Library of Congress faces epic big data challenge
Here's an epic big data challenge for you: The Library of Congress has collected 170 billion Twitter messages totaling 133.2 terabytes of data, and now all it needs is a way to index it all. One of the main difficulties for the library is finding the right tools to manage that volume of information, reports Brandon Butler at Network World.
Officials at the world's largest library have discovered that the technology to leverage a data set of this volume "is not nearly as advanced as the technology for creating and distributing that data." The complexity and the resources demanded by the task have hampered the availability of affordable solutions, they've found.
The Library of Congress is amassing new tweets by the hour, and the rate of growth is increasing. Scholars are eager to dig into the trove, but a search of even one-eighth of the archive can take as long as a day. Unlike other massive data repositories managed by the agency, the Twitter archives are intended to be easily accessible.
Parallel and distributed computing options are too expensive because of the infrastructure that would be required. Cloud-based options for storing and indexing are available, but it is unclear whether they would be cost-effective in the long run.
- see Brandon Butler's article at Network World