Disco-Powered pymarc

I'd been long interested in starting to develop code using some sort of MapReduce implementation for distributed computing. I have never been able to get my head around Hadoop, so I gave up with that pretty quickly. I recently discovered Disco, a MapReduce framework with an Erlang-based core. Disco also allows you to to write your worker code in Python, which was a huge plus to me.

After stumbling through the tutorial, I took the word count demo and put together some basic code using pymarc that gathered tag count statistics for a bunch of MARC files. The code's still in a very early form, and arguably should carve up large files into smaller chunks to pass off to the worker processes; I've gotten around this for the time being by splitting up the files using yaz-marcdump. Once I split the files, I pushed them into a tag of DDFS, the Disco Distributed File System. This was a useful way for me to write some demo code both for using pymarc and Disco. The code follows, and is also available as a Gist on Github.

 
#!/usr/bin/env python
#
# pymarc_disco.py - Mark Matienzo
#   sample MapReduce tasks for Disco to get tag counts from MARC files/streams
#   usage: python pymarc_disco.py <input1> [input2 ... inputN]
 
import sys
from disco.core import Disco, result_iterator
from disco.settings import DiscoSettings
import pymarc
 
def read(fd, size, fname):
    return pymarc.MARCReader(fd)
 
def map(record, params):
    for field in record.fields:
        yield field.tag, 1
 
def reduce(iter, params):
    from disco.util import kvgroup
    for tag, counts in kvgroup(sorted(iter)):
        yield tag, sum(counts)
 
disco = Disco(DiscoSettings()['DISCO_MASTER'])
print "Starting Disco job.."
print "Go to %s to see status of the job." % disco.master
results = disco.new_job(name="tagcount",
                        input=sys.argv[1:],
                        map=map,
                        map_reader=read,
                        reduce=reduce,
                        save=True).wait()
print "Job done. Results:"
for word, count in result_iterator(results):
    print word, count
 

“Using the OCLC WorldCat APIs” now available in Python Magazine

As of last Thursday, I have been inducted into the pantheon of published Python programmers (aye, abuse of alliteration is always acceptable). My article, "Using the OCLC WorldCat APIs," appears in the latest issue (June 2009) of Python Magazine. I'd like to thank my editor, Brandon Craig Rhodes, for helping me along in the process, not the least of which includes catching bugs that I'd overlooked. The article includes a brief history lesson about OCLC, WorldCat, and the WorldCat Affiliate APIs, a detailed introduction to worldcat, my Python module to interact with OCLC's APIs, and a brief introduction to SIMILE Exhibit, which helps generate the holdings mashup referenced earlier on my blog. Subscribers to Python Magazine have access to a copy of the code containing a functional OCLC Web Services key ("wskey") to explore the application.

worldcat In The Wild at OCLC’s WorldCat Mashathon in Amsterdam

It's good to see other people using your code. Thanks to the OCLC Devnet Blog, I found out that Etienne Posthumus used worldcat for a demo application he built during the WorldCat Mashathon in Amsterdam last week. Even more interesting is that Etienne's application was deployed on Google App Engine. Courtesy of OCLC's Alice Sneary, there is a brief video of Etienne presenting his application to the other Mashathon attendees:

Batch Reindexing for Drupal + Solr

Crossposted to NYPL Labs. Sorry for any duplication!

Hey, do you use Drupal on a site with several thousand nodes? Do you also use the Apache Solr Integration module? If you're like me, you've probably needed to reindex your site but couldn't be bothered to wait for those pesky cron runs to finish – in fact, that's what led me to file a feature request on the module to begin with.

Well, fret no more, because thanks to me and Greg Kallenberg, my illustrious fellow Applications Developer at NYPL DGTL, you can finally use Drupal's Batch API to reindex your site. The module is available as an attachment from that same issue node on drupal.org.  Nota bene: this is a really rough module, with code swiped pretty shamelessly from the Example Use of the Batch API page on drupal.org. It works, though, and it works well enough as we tear stuff down and build it back up over and over again.