NARA's Electronic Archive: A Sanity Check

By Andromeda Yelton |

The cost to build digital infrastructure for the national archives could hit $1.4 billion.

I've been fascinated by this story. 1.4 billion? What on earth are they archiving, and on what scale? How does a project originally contracted at $317 million in 2005 end up projected to be finished at $1.4 billion in 2017? [pdf; see page 21]

It sounds outrageous to me, but then again, I don't know what comparable projects cost. So, let's bracket this.

One company that operates in this space, doing data ingest, storage, indexing, taxonomizing, etc. -- including for some libraries -- is Endeca (disclosure: my husband works there). I don't know the size of their individual contracts, but their yearly revenue is in the $100 million range and they have at least 88 customers. Also in the space, Autonomy; revenues of $870.4 million in 2010, over 20,000 customers.

Of course these companies are likely not selling each of their customers a new system every year, but neither, with those numbers, are they selling any of them a system for $317 million. This suggests that their contracts are running at least one order of magnitude below NARA's contracted cost -- maybe even six orders of magnitude less than their projected final cost.

OK, but what if NARA's needs are particularly complex? To be fair, their data set is really big: "billions of pages of e-mails, memos and electronic files created by every branch of government" along with marquee items like the Constitution. And that kind of data might involve complex legal and privacy issues with classified documents. Then again, Endeca's being used to search defense intelligence databases, which include rapidly updating sources like radio signals, news feeds, and State Department emails. Autonomy does e-discovery, which means they deal with multiple petabytes and billions of records.

Both of these applications have complex legal and privacy issues, too.

And that's not even getting into projects like GenBank, covering all publicly accessible DNA sequences, with well over a hundred million records covering a few hundred billion DNA bases, and growing exponentially. Or the Large Hadron Collider, which will be producing 15 petabytes of data a year, with the processing and storage distributed in dozens of countries over hundreds of thousands of computers, in a system CERN had to design because you can't get that kind of thing off the shelf. Or, you know, Google.

All of which is to say: I don't buy that NARA has unusual problems of scope, here. And I don't buy that it takes a billion dollars, or even hundreds of millions, to build their system.

So...what's going on here? At this point I'm stepping into pure speculation, but I'll guess these two factors are involved:

1) Bureaucratic checkboxes that limited the field of possible contractors to a tiny handful. NARA selected Lockheed after a one-year, two-contestant design competition. They say it like either of those numbers is a good thing.

2) Failure to look outside the usual space. Someone on Twitter mentioned a few weeks ago that the same furniture cost three times more in a library catalog than an office supply catalog. And when I was learning about ILSes, "did you mean...?" was spoken of as an exciting, sophisticated feature -- when it is literally a homework assignment for freshman computer science majors.

Maybe Lockheed's offer was really very competitive compared to those of all the other people NARA is used to doing business with! (Er, all one of them, apparently.) Maybe it was even competitive compared to all the companies which brand themselves in the library and archive space. But maybe, just as we need to look beyond "library furniture" to just furniture, we need to look beyond "library software" to just software. Think about the state of the art full-stop, not the most impressive technology among a handful of usual suspects.

My ILS class left me with the persistent impression that librarians do not generate very high-quality demand for software. Of course there are individual exceptions, but too many of us do not know where the cutting edge lies, do not know how to sanity-check the options handed to us against the options possible, and thus do not know what we can demand of our vendors. And so we have NARA spending what appears to me to be a few hundred million extra dollars on a system that's taking years too long to be deployed and may never be finished. Imagine what those few hundred million could have done for your library.

How can we, as a field, know more about where the cutting edge lies? How can we demand better?