Why I Have Given Up On the Archives and Archivists List

I am certainly not the first person to chime in on this topic, and I certainly hope not to be the last. Inspired by two fantastic posts by Ben Bromley and Maureen Callahan, I have chosen to discuss the reasons why I have given up on the Archives and Archivists List. Unlike Ben and Maureen, who discuss why they choose not to post to the list, I'm also including reasons why I choose not to read or subscribe to the list anymore. For what it's worth, until yesterday, I had been on the A&A List for almost nine long years.

  1. I don't think the majority of the traffic is terribly useful. This can be incredibly frustrating, especially there's a question on topic you happen to know something about. Telling someone how to perform a Google search is not an adequate response.Given the signal-to-noise ratio of the list, useful or timely messages can be easily buried.
  2. Off-topic messages seem to be the rule rather than the exception. I finally snapped when the "Virtual Picnic" began. I know it's a fun tradition and all, but it's a tradition that many of us have neither the time nor interest in which to indulge.
  3. People can be brusque, mean, angry, unhappy, etc. Posters to the list can be insulting. I tend to have a rather thick skin, but it's still a bit inexcusable to see some of the adults interacting the way they do on the list. Also, if a casual observer came across the A&A List, I'm afraid it'd give our profession a bad rap as containing a lot of miserable people who seem to spend the better part of their work day complaining or overreacting about the state of things. (I know, pot, kettle, black. To be fair, it's no longer during my work day.)
  4. There are better professional resources out there. A recent thread disputed the value of the list over things like blogs and Twitter. Generally speaking, I get better professional advice from sources other than the A&A List. These include blogs and Twitter as well as other e-mail lists. There are plenty of section-, roundtable-, and function-specific e-mail lists that are likely a better competitor for your attention. Arguably, I will send someone to the EAD Listserv before the A&A List if they have a question about EAD.
  5. It is arcane and hard to interact with beyond reading and posting. Maureen suggests checking the list archives before you post a question. Unfortunately, the search interface isn't entirely intuitive. Changing your subscription settings isn't entirely straightforward either. How can we be expected to use this resource with any facility if the tools we have are so suboptimal?
I may not leave the list forever, but I'm certainly done with it for now. I am happy to contribute back to the online professional community elsewhere and I certainly plan to continue doing so. We need a better solution, though, so start preparing your alternate online fora as soon as possible.

With Little Fanfare, dLIST Goes Down

I've been meaning to blog about this for a while. DLIST, the Digital Library of Information Science and Technology, maintained by the University of Arizona School of Information Resources and Library Science, has been down for at least three months. Any URL formerly part of DLIST gets automatically redirected to an announcement page that reads as follows:

Aging hardware and conversion issues following a system crash have taken their toll on DLIST, the University of Arizona's Digitial Library of Information Science and Technology. We are currently exploring choices and alternatives both to short term recovery and long term sustainability. The resources and metadata are fully recovered, and we hope to put them back online in a new repository soon.

If you or your institution would like to assist with the DLIST project, please contact us at sirls@email.arizona.edu. Thanks for your support!

While I feel for the difficulties they've had in maintaining it, I have to admit that it's a bit frustrating for me from the standpoint of someone who submitted material to DLIST. I suppose I had some expectation that the site would be maintained more than it was, but I also know that repositories are a lot of work. This is particularly true of open access repositories like DLIST and Mana'o, the OA repository for anthropologists that shut down last August. I guess I also had a higher social expectation for uptime since DLIST is (was?) maintained by a library school.

The particularly troubling thing to me is that the project seems to be in some sort of limbo. DLIST's maintainers didn't reach out to anyone who deposited their work there — at least, I never ever received anything alerting that my resources were lost but now are found. In terms of discovery, it's also troubling to see that something a little dodgy might be going on with the citation metadata and Google Scholar:

Strange behavior in Google Search for DLIST

So what is to be done? I'd love to help, but I'd also just like to know I can get my stuff back. I'd love to see what was in DLIST get folded into E-LIS, which also happens to be running ePrints. But most of all, I'd like the maintainers to acknowledge the downtime to the contributors. Like others, I stumbled upon this by sheer accident.

Code4lib 2010: Southern Hospitality

I recently returned from a trip to Asheville, North Carolina for this year's Code4lib conference. Despite the unavoidable hiccups that some attendees experienced as they tried to head home from the conference, I believe that this year's conference was the most successful one that I happened to attend. If I'm right, I think this year had a record number of attendees, a record number of new attendees, and much tighter organization to make the new folks feel welcome.

The social activities were certainly more planned and organized than last year, which was a welcome change. While I certainly didn't mind hollering out to the crowd that I would be going to see some bands or to a particular restaurant like I had in previous years, it was nice to see other folks take the lead. The newcomer dinners seemed to go pretty well; the brews cruise and barbecue excursions went smoothly; and even the game(s) of Werewolf seemed to take a life of their own.

What of the program? Well, again, I was extremely happy with the way it turned out, personally. I spent the morning of the Monday preconference in the Solr blackbelt session led by Erik Hatcher from Lucid Imagination and Naomi Dushay from Stanford. During that afternoon, I helped out Dan Chudnov by bumbling through a demo of pymarc. The keynotes of both Cathy Marshall and Paul Jones were both delightfully thought provoking and whimsical.

In all, the sessions were pretty fantastic, but the following stuck out for me:

Finally, I believe that the Ask Anything session deserves its own treatment. The idea was to ask a roomful of people for help, advice, references, etc. on anything, and I think it largely worked. This session itself proved the vibrancy and strength of Code4lib: largely unstructured, highly social, and focused on mutual aid. I will spare you any comparisons to Rainbow Gatherings, but I'll say this much: I'm proud to be part of Code4lib, and I'd like to welcome you to next year's.

Description Peddlers and Data.gov: Two Peas In a Pod

As you may have heard, the National Archives issued a press release today announcing the release of three data sets on Data.gov:

The first milestone of the Open Government Directive was met on January 22 with the release of new datasets on Data.gov. Each major government agency has uploaded at least three datasets in this initial action. The National Archives released the 2007–2009 Code of Federal Regulations and two datasets from its Archival Research Catalog. This is the first time this material is available as raw data in XML format.
The Archival Research Catalog, or ARC, is NARA's primary access system for archival description, representing 68% of NARA's entire holdings. This breaks down to the following:

  • 2,720,765 cubic feet
  • 520 record groups
  • 2,365 collections
  • 102,598 series
  • 3,265,988 file units
  • 292,887 items

In addition, there are 6,354,765,793 logical data records and 465,050 artifacts described in ARC.

NARA's decision to share this data is a breakthrough for archives and people who love data. The size of the data provided by NARA in ARC is also immense; the combined descriptions plus contextual information on represented organizations totals approximately 21 gigabytes when uncompressed.

Obviously, transferring this much data is difficult, and I was quite shocked when I discovered that NARA didn't bother to compress this data in the first place when I first decided to get my grubby paws on it. Not to be outdone, I corresponded with a few people over Twitter who were just as interested in the data, specifically Simon Spero at the UNC School of  Information and Library Science, and Richard Urban, at UIUC's Graduate School of Library and Information Science. The three of us made a concerted effort to grab the data from NARA's web server and make a compressed version available.

After 6 hours of so of transferring the files and compressing them, Simon has posted the compressed dataset on ibiblio.org, as part of his Fred2.0 dataset project. Download the whole thing, decompress it, and start crunching - there's so much you can do with it! Convert the series descriptions to EAD! Convert the organizational descriptions and histories to EAC! Throw Mitchell Whitelaw's series browser on top of it! The future's in your hands, people, and now the data is too.

We've talked about posting a torrent, but between the compression and the high bandwidth available from ibiblio, it doesn't seem to be quite as a pressing need. However, if you'd like, it could be arranged. More detail on the datasets, including detailed information about the tags and structure of the data within, can be found on Data.gov.

Onward And Upward…

It's fitting that this the hundredth (gosh, only the hundredth?) post on thesecretmirror.com, because I have rather important news. First, my fellow developers/producers/UX designers at The New York Public Library and I have been dealing with every minute detail on the upcoming, Drupal-based replacement to the NYPL website. You can see a live preview at http://new.nypl.org/. I can proudly say that this project has helped both me personally and NYPL overall play nice in the open source world - we've been actively contributing code, reporting bugs, and sending patches to the Drupal project. Also, our site search is based on Solr, which always bears mention.

In addition, after a working tirelessly as a developer at NYPL for the last year and a half, I have decided to move onward and upward. I am leaving the cozy environs of the still-recently renovated office space I share with my spectacular coworkers. It was not an easy decision by far, but it feels like the best one overall. I am leaving NYPL for the Ivy League; at the end of January, I will be starting as a Digital Archivist with Manuscripts and Archives at the Yale University Library. It's a fantastic opportunity, and I'm looking forward to being part of an incredibly strong archival community. I will be working with fantastic people on interesting projects in the field in which I was educated in and to which I feel the strongest bond. It's the perfect blend of traditional archives work, code and metadata geekery, and digital preservation!

Wish me luck, and if you're ever in New Haven, I'll split a white clam pie and a pitcher of birch beer with you at Modern.

Clifford Lynch Clarifies Position on Open Source ILSes

Clifford Lynch, Executive Director of the Coalition for Networked Information, has responded to the leaked SirsiDynix report that spreads horrific untruths about open source. Marshall Breeding posted Lynch's response on GuidePosts. In particular, Lynch notes the following:

I don't think that I ever wrote those words down in an article; I suppose I may have said something to that effect in an interview or q&a in some conference program like ALA Top Tech, though perhaps no quite as strongly as it's expressed here. I have without question spoken out about my concerns regarding investment in open source ILS development in the last few years. IF I did say this, it feels like it's used a little out of context -- or maybe the better characterization is over-simplistically -- in the report.

...

I think there are still major problems -- many of which we really don't know how to solve effectively, and which call for sustained and extensive research and development -- in various areas where ILS get involved in information discovery and the support of research and teaching. While I'm not opposed to seeing an open source ILS -- who could be? -- and recognize that it could be very useful, particularly as a platform for research and future innovation, open source re-implementation of current ILS functionality will not be a panacea for these still-unsolved challenges.

I'd like to thank Lynch for being willing to respond publicly about this. This report has plenty of holes in it, and according to esr's characterization of Linus' Law, "given enough eyeballs, all bugs are shallow." Let's show SirsiDynix that this is one of the most truly important values with open source.

SirsiDynix Report Leaked, Spreading Fear, Uncertainty and Doubt about Open Source

Thanks to Twitter, I discovered that Wikileaks has posted a report written by SirsiDynix Vice President for Innovation Stephen Abram which spreads a fantastic amount of fear, uncertainty and doubt about both open source software in general and, more specifically, the suitability of open source integrated library systems. As the summary provided by Wikileaks states,

This document was released only to a select number of existing customers of the company SirsiDynix, a proprietary library automation software vendor. It has not been released more broadly specifically because of the misinformation about open source software and possible libel per se against certain competitors contained therein ...

The source states that the document should be leaked so that everyone can see to what extent SirsiDynix will attempt to spread falsehoods and smear open source and the proponents of open source.

In addition, as you may have heard, the Queens Library is suing SirsiDynix for breach of contract; for what it's worth, the initial conference is scheduled for next Monday, November 2, 2009. More information on the lawsuit can be found on Justia.

I think one of the most concerning aspects of this disappointing white paper is the attribution of a quote to Clifford Lynch, the Executive Director for the Coalition of Networked Information. The quote is as follows, taken from page 10 of the report:

Although many in the ILS industry are taking an in-depth look at the viability of open source development over the long run, we believe the movement is premature. Moreover, we are joined in our opinion by none other than Cliff Lynch, the head of the Coalition for Networked Information and a leading thinker in the library space.

Cliff called the development of the open source ILS by OLE, [Georgia] Pines [the developers of Evergreen], etc. one of the “stupidest strategies ever undertaken” in the library world. At a time when libraries should be investing in systems to improve the priority issues in the end-user’s research, discovery and learning experience, here we have a cadre of libraries investing in the reinvention or at least, recreation, of something they already have and have at a cheaper cost than the redevelopment effort.

While Lynch is known for his provocative comments, I find it to be quite alarming that Abram, a trained librarian (read: a holder of an MLS) and Fellow of the Special Libraries Association — two facts made very clear on the title page of the report — did not bother to cite the source of this comment or provide adequate context for it. This comment doesn't particularly stand well on its own, but I see the point; perhaps developers in the library sphere should focus on usability and discovery issues rather than back office parts of the ILS. However, what Abram ignores is that this sort of work ends up being part and parcel of open source development in libraries.

Stephen Abram has a blog and a Twitter account. We could ask him to respond in earnest about the FUD-spreading. How we do that is left as an exercise to the reader.

pybhl: Accessing the Biodiversity Heritage Library’s Data Using OpenURL and Python

Via Twitter, I heard about the Biodiversity Heritage Library's relatively new OpenURL Resolver, announced in their blog about a month ago. More specifically, I head about Matt Yoder's new Ruby library, rubyBHL, which exploits the BHL OpenURL Resolver to provide metadata about items in their holdings and does some additional screenscraping to return things like links to the OCRed version of the text.

In typical fashion, I've ported Matt's library to Python, and have released my code. pybhl is available from my site, PyPI, and Github. Use should be fairly straightforward, as seen below:

 
>>> import pybhl
>>> import pprint
>>> b = pybhl.BHLOpenURLRequest(genre='book',
aulast='smith', aufirst='john', date='1900',
spage='5', volume='4')
>>> r = b.get_response()
>>> len(r.data['citations'])
3
>>> pprint.pprint(r.data['citations'][1])
{u'ATitle': u'',
 u'Authors': [u'Smith, John Donnell,'],
 u'Date': u'1895',
 u'EPage': u'',
 u'Edition': u'',
 u'Genre': u'Journal',
 u'Isbn': u'',
 u'Issn': u'',
 u'ItemUrl': u'http://www.biodiversitylibrary.org/item/15284',
 u'Language': u'Latin',
 u'Lccn': u'',
 u'Oclc': u'10330096',
 u'Pages': u'',
 u'PublicationFrequency': u'',
 u'PublisherName': u'H.N. Patterson,',
 u'PublisherPlace': u'Oquawkae [Ill.] :',
 u'SPage': u'Page 5',
 u'STitle': u'',
 u'Subjects': [u'Central America', u'Guatemala', u'Plants', u''],
 u'Title': u'Enumeratio plantarum Guatemalensium imprimis a H. de Tuerckheim collectarum /quas edidit John Donnell Smith.',
 u'TitleUrl': u'http://www.biodiversitylibrary.org/bibliography/827',
 u'Url': u'http://www.biodiversitylibrary.org/page/707932',
 u'Volume': u'4'}

Let me know if you find it useful - I'd appreciate any feedback!

Access and Description Reconsidered

What exactly is archival access, and how does archival description make it possible? I feel like that in some form or another I've been struggling with this question throughout my career. Recently, this blog post from The Top Shelf, the blog of the University of Texas at San Antonio Archives and Special Collections Department, came across my radar, wherein they write (emphasis in original):

UTSA Archives and Special Collections is among the growing number of archives to create an online presence for every one of its collections. ... We were able to utilize inventories generated by former and current collection assistants to create guides to the collection with folder-level and box-level descriptions. The project resulted in access to more than 130 collections and 2000 linear feet of materials.
What defines that accessibility? I certainly don't intend to be a negative Nancy about this - adding finding aids and other descriptive metadata about collections is obviously useful. But how has it necessarily increased access to the materials themselves?

Archivists in part provide descriptive, contextual, and structural metadata about collections, but ultimately, the products of our knowledge work are merely representations of a complex whole, teeming with relationships between facts, assumptions, and wild-arsed guesses. Thinking of arrangement and description as representation is not a new idea; cf. Yakel, “Archival Representation” (Archival Science 3, no. 1 (2003): 1-25) and Karen Gracy's class on Archival Representation at Pitt. But really, what is the end goal of that representation and how does that converge with what users and our professional peers (e.g. library and museum professionals) expect?

Our professional community is still a bit too sheltered, I'm afraid. I really disagree with Russell James' recent post "Why not 'records science'?" for a few reasons (just as I agree with him on a few points...for what it's worth, there is records science; it just happens to be called archivistics or archival science!). Archivists still have a ridiculous amount to learn about information science. It might not have seemed relevant during whatever graduate program you attended, but really, it's ultra-important!

I'm reading David C. Blair's Language and Representation in Information Retrieval (Elsevier Science & Technology, 1990) right now and am totally riveted by it. This and other works consistently prove to me that we have a lot of important lessons to learn from our counterparts in other fields, namely how metadata, indexing, and the like actually form sets of representations that unavoidably mediate access between a person with an information need and that information. This isn't new research, and it's not just a problem with electronic records. It's just incredibly unfortunate that there isn't a really good community within the profession to discuss this stuff.

You haven't heard the last of me on this yet, and I know this post has been quite rambling. Until next time...

AIP Receives NHPRC Funding To Digitize Samuel Goudsmit Papers

I'm happy to pass on the news that my former employer, the Niels Bohr Library & Archives of the American Institute of Physics, has received funding from the National Historical Publications and Records Commission to digitize the entirety of the Samuel Goudsmit papers. From the announcement on the Center for History of Physics/Niels Bohr Library & Archives Facebook page:

Goudsmit (1902–1978) was a Dutch-educated physicist who spent his career in the US and was involved at the cutting edge of physics for over 50 years. He was an important player in the development of quantum mechanics in the 1920s and 1930s; he then served as scientific head of the Alsos Mission during World War II, which assessed the progress of the German atomic bomb project. Goudsmit became a senior scientist at Brookhaven National Laboratory and editor-in-chief of the American Physical Society. The papers consist of an estimated 66,000 documents, which include correspondence, research notebooks, lectures, reports, and captured German war documents; the collection is the most used in the library.  This is the first manuscript collection that we will digitize, and it will become one of the few complete history-of-physics collections online. We plan to make the collection freely available online by summer 2010.
AIP's finding aid to the Goudsmit papers is available online. Congratulations to the staff of the NBLA!