Remember that contest where there is a jellybean jar full of jellybeans and the goal is to guess how many jellybeans there are in the jar? I think the MARC –> RDF question is a bit like that. RDF has been discussed lately on the MODS list (and a bit here). Ironically, there has been a discussion of relational databases going on on the MARC list at the same time (not related to RDF, but not wholly unrelated either).When one talks about a triple-store (of whatever kind, native RDF or relational), it is only natural to talk in the number of triples stored. This is just like an object-oriented database. How do you measure its capacity? How many objects can it store and reasonably retrieve in a query? What the triple count doesn’t tell me, though, is how many records (in the library sense) does this represent?

I don’t think we have an answer to this question yet because noone that I am aware of has moved the complete MARC structure into RDF. So, to me, the question of how many triples are there in a MARC record is a bit like the guessing game: “How many jellybeans are there in the jellybean jar?” I’m sure someone out there knows the answer to how many possible units of information are there in a MARC record (you’d have to break out all the encoded info from the control fields too), but this is a bit different than triples because the relationships between those units have to be represented as well.

It would also, perhaps, be more instructive to know how many triples there are in your average MARC record (and how many on the high end). This is related to the question, “Which parts of the MARC record do we actually use?” Bill Moen’s MCDU group is doing some interesting work in this area.

As for triples in your average MARC record, my guess would be a hundred (keep in mind all the subfields and the number of bytes in each control fields if that seems like a lot to you). Okay, that’s totally off the top of my head with no real logic at all, but it doesn’t seem too far off to me either (in fact it might be a bit low). So, using this completely bogus guess, a database of 10 million triples would represent a database of a hundred thousand records. I think it is worth noting that different triple-per-record counts would affect all parts of the system.

It’s funny that I’ve been contemplating all this RDF stuff again. I first looked into it when we were developing XOBIS. I think Leigh Dodds’ question was the right when he asked: “What does RDF give us that XML can’t?” Well, he said “XML or a relational database” but I think the database part isn’t really the question since we don’t have to put XML into a relational database (I can certainly understand the frustrations of developers who have had to map one XML model after another into a relational db).

His answers seem to fall into three categories to me. The first is the database/modelling question, the second is easily combining disparate data, and the third is the ability to do Semantic Web-like things (any time someone starting talking about machines “inferencing” I lump it into the last category (right or wrong)). The first seems to be to be more directed at relational databases, the second seems to me to be handled by choosing RELAX NG and NRL. The third I remain skeptical about.

I don’t mean to trivialize these issues by categorizing them in this way. I don’t think XML offers a complete solution to MARC yet either (this doesn’t mean we shouldn’t keep looking and working on one though). I guess this is just my last sweep, before leaving the RDF stuff behind (though I have emailed the AustLit people to find out a little more about their caching system), to sort of make sense of it.

Since RDF isn’t something I work with in my daily life, I’ve probably already spent much more time than I should taking another look it at (and, yes, though I’ve been arguing a point I’ve also been genuinely looking). Maybe in another five years I’ll poke my head up again and see if there is any more out there. Back to work…