As you may have heard, the National Archives issued a press release today announcing the release of three data sets on Data.gov:
The first milestone of the Open Government Directive was met on January 22 with the release of new datasets on Data.gov. Each major government agency has uploaded at least three datasets in this initial action. The National Archives released the 2007–2009 Code of Federal Regulations and two datasets from its Archival Research Catalog. This is the first time this material is available as raw data in XML format.
The Archival Research Catalog, or ARC, is NARA's primary access system for archival description, representing 68% of NARA's entire holdings. This breaks down to the following:
2,720,765 cubic feet
520 record groups
2,365 collections
102,598 series
3,265,988 file units
292,887 items
In addition, there are 6,354,765,793 logical data records and 465,050 artifacts described in ARC.
NARA's decision to share this data is a breakthrough for archives and people who love data. The size of the data provided by NARA in ARC is also immense; the combined descriptions plus contextual information on represented organizations totals approximately 21 gigabytes when uncompressed.
Obviously, transferring this much data is difficult, and I was quite shocked when I discovered that NARA didn't bother to compress this data in the first place when I first decided to get my grubby paws on it. Not to be outdone, I corresponded with a few people over Twitter who were just as interested in the data, specifically Simon Spero at the UNC School of Information and Library Science, and Richard Urban, at UIUC's Graduate School of Library and Information Science. The three of us made a concerted effort to grab the data from NARA's web server and make a compressed version available.
After 6 hours of so of transferring the files and compressing them, Simon has posted the compressed dataset on ibiblio.org, as part of his Fred2.0 dataset project. Download the whole thing, decompress it, and start crunching - there's so much you can do with it! Convert the series descriptions to EAD! Convert the organizational descriptions and histories to EAC! Throw Mitchell Whitelaw's series browser on top of it! The future's in your hands, people, and now the data is too.
We've talked about posting a torrent, but between the compression and the high bandwidth available from ibiblio, it doesn't seem to be quite as a pressing need. However, if you'd like, it could be arranged. More detail on the datasets, including detailed information about the tags and structure of the data within, can be found on Data.gov.
What exactly is archival access, and how does archival description make it possible? I feel like that in some form or another I've been struggling with this question throughout my career. Recently, this blog post from The Top Shelf, the blog of the University of Texas at San Antonio Archives and Special Collections Department, came across my radar, wherein they write (emphasis in original):
UTSA Archives and Special Collections is among the growing number of archives to create an online presence for every one of its collections. ... We were able to utilize inventories generated by former and current collection assistants to create guides to the collection with folder-level and box-level descriptions. The project resulted in access to more than 130 collections and 2000 linear feet of materials.
What defines that accessibility? I certainly don't intend to be a negative Nancy about this - adding finding aids and other descriptive metadata about collections is obviously useful. But how has it necessarily increased access to the materials themselves?
Archivists in part provide descriptive, contextual, and structural metadata about collections, but ultimately, the products of our knowledge work are merely representations of a complex whole, teeming with relationships between facts, assumptions, and wild-arsed guesses. Thinking of arrangement and description as representation is not a new idea; cf. Yakel, “Archival Representation” (Archival Science 3, no. 1 (2003): 1-25) and Karen Gracy's class on Archival Representation at Pitt. But really, what is the end goal of that representation and how does that converge with what users and our professional peers (e.g. library and museum professionals) expect?
Our professional community is still a bit too sheltered, I'm afraid. I really disagree with Russell James' recent post "Why not 'records science'?" for a few reasons (just as I agree with him on a few points...for what it's worth, there is records science; it just happens to be called archivistics or archival science!). Archivists still have a ridiculous amount to learn about information science. It might not have seemed relevant during whatever graduate program you attended, but really, it's ultra-important!
I'm reading David C. Blair's Language and Representation in Information Retrieval (Elsevier Science & Technology, 1990) right now and am totally riveted by it. This and other works consistently prove to me that we have a lot of important lessons to learn from our counterparts in other fields, namely how metadata, indexing, and the like actually form sets of representations that unavoidably mediate access between a person with an information need and that information. This isn't new research, and it's not just a problem with electronic records. It's just incredibly unfortunate that there isn't a really good community within the profession to discuss this stuff.
You haven't heard the last of me on this yet, and I know this post has been quite rambling. Until next time...
I've been struggling with the fact that (American) archival practice seems to bind contextual description (i.e., description of records creators) to records description. Much of these thoughts have been stirring in my head as a result of my class at Rare Book School. If we take a relatively hardline approach, e.g. the kind suggested by Chris Hurley ("contextual data should be developed independently of the perceived uses to which it will be put", 1, see also 2), it makes total sense to separate them entirely. In fact, it starts making me mad that the <bioghist> tag exists at all in EAD. Contextual description requires that it be written from a standpoint relative to that of the creator it describes. I guess what I keep getting hung up on is if there could be a relevant case that really merits this direct intellectual binding. I therefore appeal to you, humble readers, to provide me with your counsel. Do you think there are any such cases, and if so, why?
This week in Charlottesville has been a whirlwind exploration of standards and implementation strategies thus far during my class, Designing Archival Description Systems, at Rare Book School. My classmates and I have been under the esteemed tutelage of Daniel Pitti, who has served as the technical architect for both EAD and EAC. Interestingly, there's been a whole lot of talk about linking data, linked data, and Linked Data, date normalization, and print versus online presentation, among other things. In addition, a few things have floated past on my radar screen this week that have seemed particularly pertinent to the class.
The first of these was a post by Stefano Mazzocchi of Metaweb, "On Data Reconciliation Strategies and Their Impact on the Web of Data". In Stefano's post, he wrote about the problem of a priori data reconciliation vs. a posteriori; in other words, whether you iron out the kinks, apply properties like owl:sameAs, etc., on the way in or on the way out. Via FriendFeed, I noticed Ed Summers' remark about "not [being] sure [he buys] the argument that linking-open-data community isn't doing a-priori reconciliation ... an argument could be made that this is why it is taking off." I'm inclined to agree with Ed - to a certain extent, it's a gracious gesture to do a priori reconcilation. The cool thing about Stefano's post, though, is that it came through to me via Ed posting the FriendFeed discussion to his Delicious as well as being shared as a printout provided to us via Daniel today in class.
Additionally, Joe Clark's post on A List Apart, "Unwebbable," was barking up a far different tree. In it, Clark makes the claim that certain kinds of documents are ill-suited to be come web pages. Specifically, he makes the following claim:
Some documents cannot be published using HTML. In many cases, we shouldn’t even bother trying. In other cases, we have to radically change the appearance and structure of the document. Ideally, we’ll start using custom XML document types—which, finally and at long last, might actually work.
In speaking of scripts and screenplays, he writes of their printed documental form:
Typography is lousy; old typewriter fonts of yesteryear were errantly mapped onto today’s spindly Courier type. But as an example of document engineering, scripts are brilliant. There’s an entire science involved in text indention. Text is rarely, if ever, “centered”; everything lines up at a tab stop, a concept that CSS expunges from the collective memory. ... With careful alignments like these, it’s easy to scan down a screenplay page. And now people want to transfer the format—intact—to the web. It’s not going to work. ... The quest to adapt scripts to the web recalls other “category errors,” to use Martin Amis’s phrase. Electronic commerce, we eventually figured out, does not take the form of “shopping malls” you “walk” through. “Magazines” and “catalogues” do not have discrete pages you flip (complete with sound effects) and dog-ear. “Web sites” do not look like magazine layouts, complete with multicolumn text and callouts.
All this drives the argument further that it's time to rethink our instances of archival record and context description in terms of how they are to be used online. "Finding aid" is a term that covers a number of documental forms which don't work well on the web. While EAD was purposefully designed to mark up these extant (or "legacy," depending on your view) descriptive apparatuses, it wasn't entirely designed to exploit the hypertextual form heralded by even the early form of the World Wide Web. Following Clark, Web-based presentations archival record and context description need not, and probably should not, look like the columnar container list bracketed by large spans of free text content. There are a couple of possibilities of what things could look like by using other people's examples:
Threepress Consulting/ORA's Bookworm, to publish finding aids as ePub files (if we are somewhat insistent in maintaining that documentary form)
But again, the question is what do we want this to look like to provide the best experience for the user. I still have yet to narrow down my suggestions with any certainty, but I stick with my opinion that the documentary form of description needs to change for the Web.
EDIT: Joe Clark also published the part of "Unwebbable" that was cut for the sake of brevity, namely referencing the following visual aids "categories of illustrations or graphics that would translate poorly to HTML semantics" (tx to Joe Clark for corrections in comments):
Org charts and flowcharts. Nested ordered lists are a proven failure here.
I'm staying with colleagues and good friends during my week-long stint in Charlottesville, Virginia for Rare Book School. If you're here - particularly if you're in my class (Daniel Pitti's Designing Archival Description Systems) - let me know. I'm looking forward to a heady week dealing with descriptive standards, knowledge representation, and as always, doing my best to sell the archives world on Linked Data. Notes and thoughts will follow, as always, on thesecretmirror.com.
This last Tuesday, I spoke at the Annual Meeting of the Archivists' Roundtable of Metropolitan New York, where I gave a talk on archives and the Semantic Web. The presentation went over very well, and colleagues from both the archives field and the semantic technology field were in attendance. I did my best to keep the presentation not overtly technical and cover just enough to get archivists to think about how things could be in the future. I also have to give a big hat tip to Dan Chudnov, whose recent keynote at the Texas Conference on Digital Libraries helped me organize my thoughts. Enjoy the slides, and as always, I relish any feedback from the rest of you.
So, it's time for another rant about my issues with EAD. This one is a pretty straightforward and short one, and comes down to the issue that I should essentially be able to mix and match metadata schemas. This is nota new idea, and I'm tired of the archives community treating it like it is one. Application profiles, as they are called, allow us to define a structured way to combine elements from different schemas, prevent addition of new and arbitrary elements, and tighten existing standards for particular use cases.
However, to a certain extent, the EAD community has accepted the concept of combining XML namespaces but on a very limited level. The creation of the EAD 2002 Schema allows EAD data to be embedded into other XML documents, such as METS. However, I can't do it the other way around; for example, I can't work a MODS or MARCXML record into a finding aid. Why not? As I said in my last dEAD Reckoning rant as well as during my talk at EAD@10, the use of encoding analog attributes is misguided, confusing, and just plain annoying. Let's move on – give me a schema I can stick things into and pull stuff out of!
A while back, I wrote a Bad MARC Rant, and I considered titling this a Bad Metadata Rant. However, as the kids say, I got mad beef with a little metadata standard called Encoded Archival Description. Accordingly, I figured I should begin a new series of posts discussing some of these issues that I have with something that is, for better or for worse, a technological fixture of our profession. This is in part prompted by thoughts that I've had as a result of participating in EAD@10 and attending the Something New for Something Old conference sponsored by the PACSCL Consortial Survey Initiative.
Anyhow, onto my first bone to pick with EAD. I'm incredibly unsatisfied with the controlled access heading tag <controlaccess/>. First of all, it can occur within itself, and because of this, I fear that there will be some sort of weird instance where I have to end up parsing a series of these tags 3 levels deep. Also, it can contain a <chronlist/>, which also seems pretty strange given that I've never seen any example of events being used as controlled access terms in this way. Sure, things that would have ordinarily gone into a 111 tag in MARC have been shoehorned into <corpname/>s, but that's another story for another time.
These are minor issues, though. My biggest complaint is that EAD can't really handle faceted or synthetic subject headings in a way that makes parsing them at all intuitive. In fact, the only support for any sort of "faceted classification" at all built into EAD is the relatively obscure <physfacet/> tag, a rather hamfisted way to pry apart a component's physical description into what the EAD Tag Library seems to suggest as being uncontrolled terms. I've found that access points that are exposed through faceted interfaces often work best when they're separated out into distinct facets; in other words, topics should be grouped separately from places, names of corporate or individual entities, and form/genre descriptions.
I can hear you grumbling already. "Wait, Mark! EAD has different tags for subjects, geographical names, and so forth!" That's all well and good, but the syntactic structure of LCSH has pretty much established itself as being synthetic (see Elaine Svenonius, "LCSH: Semantics, Syntax, and Specificity"). EAD doesn't allow us to easily break apart these synthetic headings like, say, Strikes and lockouts—Bedding industry—England. In this case, EAD would want to wrap this heading something like this:
<controlaccess>
<subject encodinganalog="650">
Strikes and lockouts--Bedding industry--England
</subject>
</controlaccess>
However, Strikes and lockouts and Bedding industry are both topical facets of the heading, and England is a geographical facet. Unfortunately, thanks to the structure of EAD, I can't do something like the following without feeling physically ill: