<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Bruised Edge &#187; MARC</title>
	<atom:link href="http://weblog.kevinclarke.info/category/marc/feed/" rel="self" type="application/rss+xml" />
	<link>http://weblog.kevinclarke.info</link>
	<description>Digital Libraries, Repositories, Programming, Technology, Librarianship, etc.</description>
	<lastBuildDate>Wed, 28 Jul 2010 03:19:41 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Mapping MARC</title>
		<link>http://weblog.kevinclarke.info/2007/01/17/mapping-marc/</link>
		<comments>http://weblog.kevinclarke.info/2007/01/17/mapping-marc/#comments</comments>
		<pubDate>Wed, 17 Jan 2007 16:51:45 +0000</pubDate>
		<dc:creator>ksclarke</dc:creator>
				<category><![CDATA[MARC]]></category>

		<guid isPermaLink="false">http://kevinclarke.info/weblog/?p=254</guid>
		<description><![CDATA[I&#8217;ve been following (off and on) the discussions on #code4lib about mapping MARC to indices. I know each ILS has a different way of making this happen, but I wonder whether there has been any effort to pool together the decisions people have made (for instance, what MARC fields and subfields should be used for [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been following (off and on) the discussions on #code4lib about mapping MARC to indices.  I know each ILS has a different way of making this happen, but I wonder whether there has been any effort to pool together the decisions people have made (for instance, what MARC fields and subfields should be used for a title, or author, search?)  It would be interesting to see how much uniformity (or not) is out there.</p>
<p>I&#8217;ve learned from #code4lib that Erik Hatcher is working on a Ruby library that will index MARC (so he has a start on a  MARC mapping <a href="http://svn.apache.org/repos/asf/incubator/solr/trunk/client/ruby/solrb/examples/marc/marc_importer.rb" title="MARC Mapping for Solr Flare">in a subversion repository</a>).  Are there other sources for seeing how people have mapped their MARC (or, even, how they&#8217;ve cleaned up their data &#8212; I know CDL has a <a href="http://www.cdlib.org/inside/diglib/datenorm/" title="CDL's Date Normalizer">date normalizing</a> library).   Is a site where this sort of information could be shared something that other people would find useful (and do our ILS contracts allow us to share it in some generic form)?</p>
]]></content:encoded>
			<wfw:commentRss>http://weblog.kevinclarke.info/2007/01/17/mapping-marc/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Libxml-Ruby vs. REXML in Ruby-MARC</title>
		<link>http://weblog.kevinclarke.info/2007/01/07/libxml-ruby-vs-rexml-in-ruby-marc/</link>
		<comments>http://weblog.kevinclarke.info/2007/01/07/libxml-ruby-vs-rexml-in-ruby-marc/#comments</comments>
		<pubDate>Mon, 08 Jan 2007 02:41:48 +0000</pubDate>
		<dc:creator>ksclarke</dc:creator>
				<category><![CDATA[MARC]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://kevinclarke.info/weblog/?p=255</guid>
		<description><![CDATA[This weekend I reimplemented the XMLReader and XMLWriter classes in ruby-marc using Libxml-Ruby, a Ruby layer over the Libxml2 C library. Currently, ruby-marc uses REXML, a pure Ruby XML library. Since REXML is built into Ruby, it is convenient. I was curious, though, how much of a performance boost there would be from using Libxml2. [...]]]></description>
			<content:encoded><![CDATA[<p>This weekend I reimplemented the XMLReader and XMLWriter classes in <a href="http://www.textualize.com/ruby_marc" title="ruby-marc">ruby-marc</a> using <a href="http://libxml.rubyforge.org/" title="ruby-libxml">Libxml-Ruby</a>, a Ruby layer over the <a href="http://xmlsoft.org/" title="libxml2">Libxml2 C library</a>.</p>
<p>Currently, ruby-marc uses <a href="http://www.germane-software.com/software/rexml/" title="rexml">REXML</a>, a pure Ruby XML library.  Since REXML is built into Ruby, it is convenient.  I was curious, though, how much of a performance boost there would be from using Libxml2.  Here are the results of my very informal test (using some HCL MARC data):</p>
<blockquote>
<table>
<tr>
<th></th>
<th>User</th>
<th>System</th>
<th>Total</th>
<th>Real</th>
</tr>
<tr>
<th>XMLReader [old]: </th>
<td>24.300000</td>
<td>0.030000</td>
<td>24.330000</td>
<td>25.607547</td>
</tr>
<tr>
<th>XMLReader [new]: </th>
<td>3.180000</td>
<td>0.010000</td>
<td>3.190000</td>
<td>3.231896</td>
</tr>
<tr>
<th>XMLWriter [old]: </th>
<td>38.960000</td>
<td>0.060000</td>
<td>39.020000</td>
<td>41.017238</td>
</tr>
<tr>
<th>XMLWriter [new]: </th>
<td>11.950000</td>
<td>0.050000</td>
<td>12.000000</td>
<td>12.607114</td>
</tr>
</table>
</blockquote>
<p>Both XMLWriter times include the new XMLReader reading records in from a source file.  As a record is read in, it is written out to a new file.  This is just intended to get an inkling of what the difference between the two versions might be (not to be a formal benchmark). Lower numbers are better.</p>
<p>So, in reimplementing, I completely rewrote the reader.  It just reads from a file and returns MARC::Record objects.  What is being used to read the XML is completely swappable with anything else.</p>
<p>With the writer, I changed the encode method so that it now takes an option specifying which library should be used (REXML is the default still).  Since the method is public, I figured someone is probably using those REXML Documents returned and their code would break if I returned a Libxml Document instead. The write method, on the other hand, now uses Libxml by default.</p>
<p>I haven&#8217;t checked in any of these changes yet (since I haven&#8217;t passed them by Ed and don&#8217;t know whether they should be incorporated), but I have validated that the existing tests still pass just fine.</p>
<p>The speed improvements are pretty nice. If an extra dependency can be tolerated it would be nice to have the performance boost.  The only other caveat is I used the 0.4.0pre01 version of Libxml-Ruby.  It might be desirable to wait until the final 0.4.0 release.</p>
<p>Anyway, I&#8217;ll get Ed&#8217;s opinion on all this sometime this next week. Right now, it is just a fun experiment.</p>
]]></content:encoded>
			<wfw:commentRss>http://weblog.kevinclarke.info/2007/01/07/libxml-ruby-vs-rexml-in-ruby-marc/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Jellybean Jar</title>
		<link>http://weblog.kevinclarke.info/2005/07/14/the-jellybean-jar/</link>
		<comments>http://weblog.kevinclarke.info/2005/07/14/the-jellybean-jar/#comments</comments>
		<pubDate>Thu, 14 Jul 2005 22:23:50 +0000</pubDate>
		<dc:creator>ksclarke</dc:creator>
				<category><![CDATA[MARC]]></category>
		<category><![CDATA[Metadata]]></category>

		<guid isPermaLink="false">http://kevinclarke.info/weblog/?p=221</guid>
		<description><![CDATA[Remember that contest where there is a jellybean jar full of jellybeans and the goal is to guess how many jellybeans there are in the jar? I think the MARC –&#62; RDF question is a bit like that. RDF has been discussed lately on the MODS list (and a bit here). Ironically, there has been [...]]]></description>
			<content:encoded><![CDATA[<p>Remember that contest where there is a jellybean jar full of jellybeans and the goal is to guess how many jellybeans there are in the jar? I think the MARC –&gt; RDF question is a bit like that. RDF has been discussed lately on the MODS list (and a bit here). Ironically, there has been a discussion of relational databases going on on the MARC list at the same time (not related to RDF, but not wholly unrelated either).When one talks about a triple-store (of whatever kind, native RDF or relational), it is only natural to talk in the number of triples stored. This is just like an object-oriented database. How do you measure its capacity? How many objects can it store and reasonably retrieve in a query? What the triple count doesn’t tell me, though, is how many records (in the library sense) does this represent?</p>
<p>I don’t think we have an answer to this question yet because noone that I am aware of has moved the complete MARC structure into RDF. So, to me, the question of how many triples are there in a MARC record is a bit like the guessing game: “How many jellybeans are there in the jellybean jar?” I’m sure someone out there knows the answer to how many possible units of information are there in a MARC record (you’d have to break out all the encoded info from the control fields too), but this is a bit different than triples because the relationships between those units have to be represented as well.</p>
<p>It would also, perhaps, be more instructive to know how many triples there are in your <i>average</i> MARC record (and how many on the high end). This is related to the question, “Which parts of the MARC record do we actually use?” Bill Moen’s <a href="http://www.mcdu.unt.edu/?p=8">MCDU</a> group is doing some interesting work in this area.</p>
<p>As for triples in your average MARC record, my guess would be a hundred (keep in mind all the subfields and the number of bytes in each control fields if that seems like a lot to you). Okay, that’s totally off the top of my head with no real logic at all, but it doesn’t seem too far off to me either (in fact it might be a bit low). So, using this completely bogus guess, a database of 10 million triples would represent a database of a hundred thousand records. I think it is worth noting that different triple-per-record counts would affect all parts of the system.</p>
<p>It’s funny that I’ve been contemplating all this RDF stuff again. I first looked into it when we were developing XOBIS. I think Leigh Dodds’ question was the right when he asked: “What does RDF give us that XML can’t?” Well, he said “XML or a relational database” but I think the database part isn’t really the question since we don’t have to put XML into a relational database (I can certainly understand the frustrations of developers who have had to map one XML model after another into a relational db).</p>
<p>His answers seem to fall into three categories to me. The first is the database/modelling question, the second is easily combining disparate data, and the third is the ability to do Semantic Web-like things (any time someone starting talking about machines “inferencing” I lump it into the last category (right or wrong)). The first seems to be to be more directed at relational databases, the second seems to me to be handled by choosing RELAX NG and NRL. The third I remain skeptical about.</p>
<p>I don’t mean to trivialize these issues by categorizing them in this way. I don’t think XML offers a complete solution to MARC yet either (this doesn’t mean we shouldn’t keep looking and working on one though). I guess this is just my last sweep, before leaving the RDF stuff behind (though I have emailed the AustLit people to find out a little more about their caching system), to sort of make sense of it.</p>
<p>Since RDF isn’t something I work with in my daily life, I’ve probably already spent much more time than I should taking another look it at (and, yes, though I’ve been arguing a point I’ve also been genuinely looking). Maybe in another five years I’ll poke my head up again and see if there is any more out there. Back to work…</p>
]]></content:encoded>
			<wfw:commentRss>http://weblog.kevinclarke.info/2005/07/14/the-jellybean-jar/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MARC Down</title>
		<link>http://weblog.kevinclarke.info/2005/03/21/marc-down/</link>
		<comments>http://weblog.kevinclarke.info/2005/03/21/marc-down/#comments</comments>
		<pubDate>Mon, 21 Mar 2005 12:40:34 +0000</pubDate>
		<dc:creator>ksclarke</dc:creator>
				<category><![CDATA[MARC]]></category>
		<category><![CDATA[Metadata]]></category>
		<category><![CDATA[XOBIS]]></category>

		<guid isPermaLink="false">http://kevinclarke.info/weblog/?p=234</guid>
		<description><![CDATA[Lorcan Dempsey has posted about the recent XML4LIB discussions. He highlights one posting that escaped my notice (because, to be honest, I was originally trying to avoid jumping into the discussion). He says (of someone else who posted on the topic): “He reminds people of the three layers in the classical library metadata stack: encoding [...]]]></description>
			<content:encoded><![CDATA[<p>Lorcan Dempsey has posted about the recent XML4LIB discussions. He highlights one posting that escaped my notice (because, to be honest, I was originally trying to avoid jumping into the discussion). He says (of someone else who posted on the topic): “He reminds people of the three layers in the classical library metadata stack: encoding (ISO 2709 or Z39.2), content designation (as expressed in the various MARC formats), and content values (which is the focus of cataloging rules and controlled terminologies).”</p>
<p>My own post (far from thoughtful, unfortunately… hmmm, am I blog people?) commented that MARC and XML are the first layer (both are the structure in which information is passed around). Then I sort of merged the next two layers, commenting that a 245 has no real importance apart from that assigned to it by cataloging rules (e.g., the idea of title and the rules that govern recording a title both seem, to me, to be under the control of cataloging rules (both are AACR2 governed)).</p>
<p>Lorcan Dempsey also commented that Dublin Core focused on the middle layer (and this caused several possible third layer possibilities to pop up). The first layer, of course, is handled by XML. The whole thing made me think a bit about XOBIS.</p>
<p>We have assumed the encoding layer would be XML. We focused on the second layer and explicitly said, when appropriate, that a particular issue would be handled by the third layer. The interesting difference between DC and XOBIS is that DC started as a community effort… committees were set up, vested interests consulted, etc. XOBIS is just an experiment that attempts to ask and answer (or, rather, give one <i>possible</i> answer to) some interesting questions, but we haven’t really tried to drum up a community around it.</p>
<p>We have been told that we should do this, but apart from giving presentations on it we haven’t. I think, in part, this is because we are more interested in the questions than in doing the work needed to make a standard (perhaps I should just speak for myself on that though).</p>
<p>We have always taken the approach that we are experimenting (this gives us a bit more leeway in our approach and delivery). This is not to say that we shouldn’t tackle the nuts and bolts, but that we’ve worked from the perspective that we didn’t have limitations (so that we would not be restricted by them). In the end, though, one still has to deal with them in order to produce something useful.</p>
<p>Interestingly, Dick has just finished a study of the tags used by Lane Library. Now that I am no longer there, though, I find myself wondering (once again) about an XSLT stylesheet that goes from MARC to XOBIS (my earlier attempts at tranforming MARC into XOBIS were done in Java).</p>
<p>Lane and Dick do a lot of special things with their records (making the transformation easier). I’m now wondering about what will be involved with creating a transformation for a standard MARC record (without all of Lane’s special enhancements). Once I get a little more free time (after April 15th), I’d like to take a shot at it.</p>
]]></content:encoded>
			<wfw:commentRss>http://weblog.kevinclarke.info/2005/03/21/marc-down/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HCLC Authorities</title>
		<link>http://weblog.kevinclarke.info/2005/02/12/hclc-authorities/</link>
		<comments>http://weblog.kevinclarke.info/2005/02/12/hclc-authorities/#comments</comments>
		<pubDate>Sat, 12 Feb 2005 22:45:59 +0000</pubDate>
		<dc:creator>ksclarke</dc:creator>
				<category><![CDATA[MARC]]></category>

		<guid isPermaLink="false">http://kevinclarke.info/weblog/?p=235</guid>
		<description><![CDATA[So, I requested the HCLC authority records created by Sanford Berman and staff. I was excited today when the package arrived with my CD of the records (the letter says the CD includes both the authorities and bibs). The bad news is that the CD they returned is unreadable (or at least I am not [...]]]></description>
			<content:encoded><![CDATA[<p>So, I requested the HCLC authority records created by Sanford Berman and staff. I was excited today when the package arrived with my CD of the records (the letter says the CD includes both the authorities and bibs). The bad news is that the CD they returned is unreadable (or at least I am not able to read it on my current machine). I originally sent two CDs in case one was bad and they, in fact, did only return one to me&#8230; letting me know the other <i>was</i> bad. Now, I&#8217;m wondering, though, whether they perhaps returned the bad one to me? I guess I will have to ask for another copy if I can&#8217;t get my work machine to read the CD either (on Monday). Maybe if I give them an ftp site they will send the records to me that way so I don&#8217;t have to wait as long this time.</p>
<p>Anyway, I got all excited for nothing! When I get an accessible copy, my first plan is to convert them into XML and make them searchable via eXist. Eventually, though, I&#8217;d like to convert the records into XOBIS and experiment with retrieval and navigation of the records. I thought about requesting them long ago when I first heard about the snapshot, but the license restricts any changes to the records&#8217; content so they cannot be upgraded or modified&#8230; they are essentially a snapshot of the HCLC&#8217;s cataloging records at the time Berman left. At the time I first heard of them that was enough to discourage my interest. But now that I&#8217;m thinking about writing an aut navigator, the records would be useful (even without being able to modify the content &#8212; perhaps specifically because I can&#8217;t modify the content).</p>
]]></content:encoded>
			<wfw:commentRss>http://weblog.kevinclarke.info/2005/02/12/hclc-authorities/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
