What Hath Von Neumann Wrought?

Skeptical musings of a reluctant cyborg

(Seven Plus Or Minus Two) Databases For Computational Journalists (Updated 2012-05-14)

Updated 2012-05-14 13:28 PDT – Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement has gone to print. Congratulations to Eric Redmond and Jim R. Wilson!

As I’ve noted before, I believe any discipline is defined by its practitioners and the tools they use, and computational journalism is no exception. I think computational journalism is a superset of data journalism, but certainly the terms “data journalism” and “data science” cover the bulk of the tools I have collected and used.

In any case, to do computational journalism, at least some data must be collected, stored, explored, analyzed, cleaned, managed and “governed.” In the past few years, the “traditional” tools for doing this, called relational database management systems (RDBMS), have been supplemented by a new class of tools broadly known as “NoSQL” databases. The name NoSQL comes from the most widely used language for dealing with a traditional RDBMS, SQL.

The NoSQL field is rapidly evolving, but enough knowledge exists to fill several books. The best overview of databases for computational journalists I’ve found comes from a soon-to-be-released work from Eric Redmond and Jim R. Wilson called Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement..

I’ve been working through the book, which has been available for a few months in beta from the publisher, Pragmatic Programmers, LLC, in the course of collecting the tools for Data Journalism Developer Studio 2012LX and Computational Journalism Server. My goal is to have all of the databases available in both appliances, although at the moment only PostgreSQL, MongoDB, CouchDB and Redis are available directly from SUSE Studio.

Seven Databases in Seven Weeks covers, in order:

  • PostgreSQL, a traditional RDBMS,
  • Riak, a key-value database
  • HBase, a columnar database
  • MongoDB, a document-oriented database
  • CouchDB, a document-oriented database,
  • Neo4j, a graph-oriented database, and
  • Redis, a key-value database / data structure server.

All of these databases are open source, and they’re all supported by either a corporate entity, a non-profit foundation, or some combination of the two. The title really should have been “Seven Databases in Seven Weekends”; each database is covered in three-day hands-on sessions and could easily be done as a series of weekend projects. The book is hands-on – you’ll build things with these databases, including a Node.js application combining Redis, CouchDB and Neo4j into an application that provides a “band information service.”

Appendix A contains a pair of tables that give an overview of the distinguishing characteristics of the seven databases. As the authors put it, “Although the tables are not a replacement for a true understanding, they should provide you with an at-a-glance sense of what each database is capable of, where it falls short, and how it fits into the modern database landscape.”

I believe all of these databases have a place in modern computational journalism, as do the other two well-known open source RDBMS tools, MySQL and SQLite. In particular, for spatial / mapping projects, PostgreSQL, SQLite, MongoDB and CouchDB have robust geographic information systems capabilities either built in or available as add-ons.

Riak, HBase, MongoDB and CouchDB all support “big data” applications implemented via MapReduce. MongoDB and CouchDB both store their documents as JavaScript Object Notation (JSON) objects, which is the “native” format for Twitter data. Neo4j, as a graph database, is perfect for storing data about relationships, such as the interconnections between corporate executives and legislators. And because of its speed, Redis can serve as high-speed pipelines between other components in almost any application architecture.

I think NoSQL databases will be the core of computational journalism for the next few years. The RDBMS isn’t going away, of course, but if you limit yourself to “SQL thinking” or even “object-relational models” and “model-view-controller” architectures, there will be applications you can’t build. This book will get you up to speed as fast as you’re willing to go.

Real Time Analytics