Archives for The Bruised Edge

No Matter, Nevermind

Regarding my last post… on the other hand, if I look at the conference as an outgrowth of Code4Lib (like the journal is an outgrowth) what does it matter? As long as the issue isn’t should (or shouldn’t) we formalize the Code4Lib group, but just should (or shouldn’t) we formalize the conference do I really care? Probably not.

Small Coders, Loosely Joined

The Code4Lib conference is fast approaching and, as people start to spend more time thinking about it, there is the question of how things should be done differently over the coming year. It’s a good question and one that Code4Lib does well to ask itself. Any “organization” that doesn’t question whether it’s doing things as well as it could be is already dead.

I think the question was originally posed on the mailing list as to how much structure Code4Lib should put into place to ensure that these conferences continue to happen (and perhaps grow). The issue has also been framed as a question of how much formality do we want to put into place (and where is the proper intersection between structure and formality)? My gut reaction was that I don’t want much formality and that any structure beyond a list of things that need to get done (and a list of volunteers willing to do them) is too much.

I think this is in part because I see Code4Lib as an experiment. There are plenty of well established (national and regional) organizations that put on conferences year after year (they are very dependable). Code4Lib is, for me, the exact opposite. It is a loose group of people who have gathered around a particular banner (library technology) because it interests us — it is more than just a job or a way to pay the bills (though many people in the Code4Lib cloud work in day jobs that do just that). In the end, though, I don’t need Code4Lib to fulfill any professional obligations and, while I would be sad to see it disappear, I wouldn’t lose any sleep over it either — I don’t see it as a movement.

It is this loose association that keeps me interested and coming back. There are times that I do not feel myself in sync with the group and there are times that I do. It is a relationship that I can take or leave, and I do (over and over again). I do not see Code4Lib as some sort of centralized organization that has a majority opinion. There are, at times, majorities of opinion, but they are no more representative of Code4Lib as a whole than the minority opinions because there is no consistency in them (the majorities) over time. I think this is, in part, because Code4Lib hasn’t, up to this point, been formalized into a “real thing” that needs to take a stand on issues (e.g. has a need to promote a consistent opinion, works as a force in the library community, etc.)

The issue to me, I think, boils down to a question of decentralization/centralization. I’d like to see Code4Lib stay as decentralized as possible. I think a couple years of conferences have taught us that we definitely need some procedures (best practices). We also need a way to recognize people who volunteer to step up and take responsibility for something that needs to get done (for the fun stuff to happen). I think we’ve also learned after a couple of bouts that we would benefit from some code that makes making decisions, as a group, easier.

These to me, though, are more indicators that we need better documentation (and more communication) than they are indicators that we need a centralization of the organization. Both approaches would be valid ways to go… don’t get me wrong. We could document more (put our organizational knowledge into external form) or, just as valid, we could rely on people who would hold offices over a period of years. These are just two different paths. I think whether a person chooses one over the other probably has to do with where s/he sees Code4Lib going in the future (that is what paths are good for after all — or, maybe I should say, “Whether one sees the path as the point or whether there is an attainable destination in mind”).

The David Clark quote seems relevant: “We reject kings, presidents and voting. We believe in rough consensus and running code.” It is dissected in a paper up at W3C:

Open Participation ("We")
    citizen engineer:  citizen is a contributor to her space (lists, Web, MUD, FAQ)
Consensus. ("... believe in rough consensus ...")
    is it good enough, does it merit moving on, are there show stoppers?
No Kings ("... reject kings, presidents and voting.")
    consensus mediated by Elders, citizen engineers who built the space and institutions others inhabit.
Running Code / Implementation ("... believe ... in running code.")
    all policy is tested by both its support and formulation through implementation

I do think that having the conference mediated by kings (okay, that’s a loaded word isn’t it?) would ensure the successful growth of the Code4Lib conferences more reliably than just having better documentation would. Maybe. But, on the other hand, I’m okay if we have a few failures (e.g., have a conference or two that is a real stinker). We’ll learn from these (with good communication) just like we learn from making programming mistakes or faulty assumptions in our code. If we don’t learn and Code4Lib disappears that’s okay too. It was what it was and lived as long as it was useful.

I think though that creating a centralized structure that manages things from the top down will be a mistake for the group. I like the idea of having the local folks who are going to plan and host the conference take more of the responsibility. This doesn’t mean they have to take it all (especially if they don’t want to). I think this though will give each conference its own flavor (and that’s a good thing). If we rely on a centralized group to make consistent decisions year after year, things will go along smoothly enough but we will lose, in my opinion, some of the uniqueness of the conference (if it can be said to have characteristics after just a couple of years).

Maybe people would prefer the consistency. Maybe people want Code4Lib to become a force in the library community. I don’t. I’m happy (as happy can be) with the other professional organizations. Yes, they have their share of problems, but I don’t think creating a new professional organization is going to solve any of these. Anyway, enough crazy rambling…

MARC2Solr (Slight Return)

Awhile back, Andrew Nagy posted an XSLT for turning MARCXML into Solr’s XML indexing format. I thought it would be fun to take his XSLT and do the same thing in XQuery. I think it is pretty much a 1 to 1 conversion.

For the upcoming Code4Lib preconference, I thought about forming an XQuery group. I ended up joining the Java group, though, because there aren’t any native HTTP libs in XQuery (so I’d have to do that as an extension in Java anyway). I still think doing an XQuery group would be fun though.

For instance, one nice feature of XQuery is that is allows you to be as strongly or loosely typed as you’d like. Take off all the “as …” statements from the XQuery and it still works just fine (it just won’t be so picky about what you pass into (or return from) its functions).

Recently, I’ve found myself on both sides of this fence; when working with a little bit of throw-away Java code, I’ve found myself wishing for a little of Ruby’s loose typing. On the other hand sometimes, when experimenting with Ruby, I mutter to myself: “Why can’t this just be strongly typed so I know what to expect and do?”

XQuery really gives you the best of both worlds. This isn’t to say XQuery can do everything those other languages can (it can’t… and far from it). But, if you are working with XML (and want to focus on the data rather than the data’s source) I can’t think of a nicer language to use. It will be interesting to watch XQuery grow as a programming language.

So anyway… since my marc2solr.xq is written as a module you’ll need to call it from something else. This little XQuery (also here) works fine from Saxon (pass in the location of a MARCXML file on the file system as $input):

xquery version "1.0";

import module
  namespace marc2solr = "http://lisforge.net/ns/marc2solr"
  at "marc2solr.xq";

declare variable $input external;

marc2solr:add-records(doc($input))

Mapping MARC

I’ve been following (off and on) the discussions on #code4lib about mapping MARC to indices. I know each ILS has a different way of making this happen, but I wonder whether there has been any effort to pool together the decisions people have made (for instance, what MARC fields and subfields should be used for a title, or author, search?) It would be interesting to see how much uniformity (or not) is out there.

I’ve learned from #code4lib that Erik Hatcher is working on a Ruby library that will index MARC (so he has a start on a MARC mapping in a subversion repository). Are there other sources for seeing how people have mapped their MARC (or, even, how they’ve cleaned up their data — I know CDL has a date normalizing library). Is a site where this sort of information could be shared something that other people would find useful (and do our ILS contracts allow us to share it in some generic form)?

Libxml-Ruby vs. REXML in Ruby-MARC

This weekend I reimplemented the XMLReader and XMLWriter classes in ruby-marc using Libxml-Ruby, a Ruby layer over the Libxml2 C library.

Currently, ruby-marc uses REXML, a pure Ruby XML library. Since REXML is built into Ruby, it is convenient. I was curious, though, how much of a performance boost there would be from using Libxml2. Here are the results of my very informal test (using some HCL MARC data):

User System Total Real
XMLReader [old]: 24.300000 0.030000 24.330000 25.607547
XMLReader [new]: 3.180000 0.010000 3.190000 3.231896
XMLWriter [old]: 38.960000 0.060000 39.020000 41.017238
XMLWriter [new]: 11.950000 0.050000 12.000000 12.607114

Both XMLWriter times include the new XMLReader reading records in from a source file. As a record is read in, it is written out to a new file. This is just intended to get an inkling of what the difference between the two versions might be (not to be a formal benchmark). Lower numbers are better.

So, in reimplementing, I completely rewrote the reader. It just reads from a file and returns MARC::Record objects. What is being used to read the XML is completely swappable with anything else.

With the writer, I changed the encode method so that it now takes an option specifying which library should be used (REXML is the default still). Since the method is public, I figured someone is probably using those REXML Documents returned and their code would break if I returned a Libxml Document instead. The write method, on the other hand, now uses Libxml by default.

I haven’t checked in any of these changes yet (since I haven’t passed them by Ed and don’t know whether they should be incorporated), but I have validated that the existing tests still pass just fine.

The speed improvements are pretty nice. If an extra dependency can be tolerated it would be nice to have the performance boost. The only other caveat is I used the 0.4.0pre01 version of Libxml-Ruby. It might be desirable to wait until the final 0.4.0 release.

Anyway, I’ll get Ed’s opinion on all this sometime this next week. Right now, it is just a fun experiment.

Do You Trust Your Data Modelers?

In the #code4lib IRC channel today Ed Summers asked me some good questions about storing metadata in a native XML database. The gist of his questions was that he wasn’t sure he saw any advantages that a native XML database might have over a relational database (yes, I’m simplifying a bit I’m sure). As we were winding down he said, “just preppin you for my questions at code4libcon.”

My first thought after digesting the conversation was, “Hey, wait, I’m not even talking about native XML databases at code4libcon!” My proposal is about using XQuery in the digital library realm. True, we are using a native XML database here, but just because one uses XQuery doesn’t mean s/he is using a native XML database. You can use XQuery just as easily with DB2 or Oracle’s database (or files on the file system).

The one thing that native XML databases and XQuery do have in common, though, is that they let you interact with your data directly — it doesn’t have to be deconstructed into another structure and then reconstructed when you want the whole thing back out again (in the case of XQuery being used over a relational database, that (de|re)construction takes place invisibly in the database layer).

But, is this a good thing? Ed kept saying he didn’t see any data modeling going on with native XML databases.

There is data modeling going on with native XML databases, I’d suggest, but it happens on the metadata side of things. Andrew Nagy made this observation recently on the code4lib mailing list when he noted how poorly just putting MARCXML into a native XML database performs.

This is because putting MARCXML into a native XML database makes MARCXML the data model. MARC was intended for concise transfer, not for working with the data… it was assumed by the architects of MARC (I believe and hope) that MARC would be reconstructed into something else before anyone tried to do anything meaningful with it (for what it is worth, this only partially happens in the library world).

XQuery allows the developer to work more nimbly with the data models s/he is given (instead of mapping them into other data models that match the database s/he has chosen to use). So, what are these data models? They are the XML metadata standards being created by the different knowledge communities. Unsurprisingly, to use these data models, the developer needs to know them (i.e., having programmers in our libraries is a better idea than contracting out to people not in the profession).

Are the people creating these (meta)data models working to make accessing the data easier? That could be one critique leveled at native XML databases (from the perspective of the developers)… if you aren’t doing the data modeling, can you trust the people who are?

It’s not that bad though (put down that gun you cynical library developers); keep in mind that XQuery isn’t a fulltext query language. Think of it more as a database query language (even though there doesn’t have to be a database).

To use XQuery in the digital library world, in my opinion, you still need to use a fulltext indexer (like Lucene or the type built into many XML-enabled databases). Indices may be used through proprietary extensions to the XQuery language (indicated by a different namespace) or through separate processes which feed the XQuery engine (as in a pre-processing stage).

For what it is worth, there is a fulltext extension to the XQuery spec that is being written to take advantage of these external indices, but it is not really out there in the world yet. In the meantime, even if our metadata models (e.g., MARCXML) aren’t the best, we can still create and use indices that provide an intelligent view of the data.

One nice thing about working directly with the data/data models you receive is that there are less “moving parts” to fix when things change (in other words, less things to get in the way) — because we all know digital library standards don’t change, right?

Rather than go through the process of re-mapping to the database’s structure, you only need to modify the parts of your code that deal with the parts of the metadata that have changed. You’d have to do this with the other option too… just because you have a standard way of normalizing data doesn’t mean that all data is structured in the same way (in terms of how you get at the pieces you want).

I could mention, I guess, some reasons why I like native XML databases in my presentation, but I’m not sure this is a good idea. I think it may distract from the beauty that is XQuery. I’m also hoping Andrew Nagy will cover this territory in his presentation comparing different native XML databases. One of XQuery’s strengths is that it is database agnostic; I shouldn’t stray from that.

For the record, the Ed Summers that appears in this post is not the real Ed (despite the conversation actually happening), but one of my own conception for rhetorical purposes only. :-)

Access Hackfest 2006

I had two suggestions in for this year’s Access hackfest and I’ll confess my suggestions were motivated purely by self interest (as both are things I’d really like to be able to use in my day job (as Mike mentions)). Happily, for me, both were tackled by this year’s Hackfest participants.

I chose to work on my second suggestion (an Ajax METS editor) with Declan Fleming, Todd Holbrook, Peter Binkley, and Mike Giarlo because it is (probably) the most pressing need I have. The results from the group effort were good, I think. Hopefully the project we’ve started gives me, and others I’ll be working with, a direction in which to move once I return to work.

For our project, we first showed how Scriptaculous could be used to create a UI that dynamically creates a hierarchical structure into which images (and different types of metadata (descriptive, technical, etc.)) can be placed (modeling the digital object’s intellectual structure). Next we showed, using work from a Cocoon-based project, how results from the editor could be integrated with files on the file system (including pulling MIX metadata automatically from TIFF images). Lastly, we created an XSL stylesheet that generated METS from the XML data moving through Cocoon’s pipelines.

Our demo is a good proof of concept, I think. Future directions for the editor would include support for more complex hierarchies (ours handles only the most simple case but, hey, it was done in a day after all). For the future, the Scriptaculous techniques demonstrated on this page might prove useful. It certainly looks like what we want to do.

Many UI tweaks would need to be made for the editor to be able to handle hundreds of images. Also needed is a way to link to the descriptive metadata for an item (it would exist in an external database — for our prototype we made some assumptions about files and directories on the file system (convention over configuration)). But, those are things for us to work on in the future.

Overall, it was a pretty fun day (despite me nursing a huge headache — I didn’t get much sleep the night before (I never sleep well the first night away from home)). Like last year, the most interesting part of the Hackfest (and this is the main point for the participants, I think) was getting to see how others approached (or have approached) the problem.

Anyway, this was my second Hackfest and I’d definitely say it was worth doing (even though, I’ll admit, I’m not a person who usually does things quickly). Nevertheless, I’m looking forward to next year’s.