NoSQL Impressions

Lately I've been playing with what seems like a multitude of NoSQL databases. Three in particular (MongoDB, CouchDB, and Riak) have caught my eye, though there are many more that have interesting concepts. Features such as ease of use and scalability typically impress me, so these three stand out because of their ease of use, ingenuity, and python/erlang compatibility.

First the things that are in common:

  • All are document oriented, schema-less databases
  • All allow some form of map/reduce built into the API
  • All have a REST interface
  • All rock

1. CouchDB

I have been following CouchDB for the better part of a year, and am very impressed with its ability and progress. I began writing an authentication server in it before rewriting it in erlang and mnesia. CouchDB allows for replication from multiple masters, and has a conflict flagging system, though it doesn't have a built in sharding mechanism as of yet.

The thing that interests me most about CouchDB is its method of indexing map/reduce queries. Queries against the database are either a map or a map/reduce query. Ad-hoc queries are possible, but the real power comes in the way that it indexes queries when data is written. Queries are stored as javascript (or now I believe python or erlang) functions that are run against data as it is stored/deleted. The results are stored in a b-tree for fast retrieval. The keys can then be further filtered by a set of GET arguments on the url that allow for a compound query that can give interesting results.

At the time of using CouchDB I ran into speed issues due to the indexing, but I believe it has been optimized since then and will continue to be optimized much more as the product goes on. For specific types of applications, CouchDB is a very interesting prospect. I am very interested to see where this product goes.

2. Riak

Riak is a a distributed (document oriented, schema-less) database. It works on the concept of distributed hash keys and an amount of shared master-less replication. This allows for a high degree of fault tolerance and true horizontal scalability. Adding servers or taking servers down redistributes a portion of the nodes to the new servers to share the load. Adding fault tolerance is as easy as changing the number of replicated nodes for a particular bucket. Adding disk space or processing ability is as easy as bringing another server on-line.

I haven't yet used Riak's HTTP interface (only its Erlang interface), but am impressed with what I've read. Basho is composed of some very intelligent people and it shows in their product. Rather than forcing eventual consistency, you have the choice to make an object consistent or to receive faster results with less consistency by specifying NRW (based on the CAP Theorem) values on each write and read. Data version collisions are handled by vector clocks and must be handled by the application.

Its map/reduce is distributed across the nodes allowing for scalable response times for large data sets. There are some exciting features coming in the form of indexing and search. I have to say that this database continues to blow me away daily, even if some of it goes over my head. I can't wait to have a project that requires this solution.

3. MongoDB

I have begun working with MongoDB just recently with my cache_plot project/toy. Despite the fact that it has multiple database drivers, advanced map/reduce, and many other features, I can't get over the feeling of how it feels so Pythonic (when used with the Python library). They have created a filter language that allows you to define dictionaries with special options that truly fits in with the Python language.

Oh, it has features too. In place updates (read atomic operations that are perfect for things like stats), GridFS for storing large binary files, intuitive indexing, sorting, map/reduce, grouping, master-child replication, and sharding in an alpha stage. It is extremely fast too...on a previous design of cache_plot, it could easily scan 80,000 rows of un-indexed numbers, group them, and return them in well under half a second. Mongo to me is very exciting and may very well start being the base for most of my web projects, especially if they fit a document oriented model (like a blog...don't worry, I'm not going to re-write finderweb again).

Conclusion

There are many other exciting NoSQL databases out there, such as Hadoop, Scalaris, Tokyo Cabinet/Tyrant, Redis and the tried and true Mnesia. Each has their own abilities and features sets that make them ideal in different circumstances. I may blog about some of the others as I study about them and use them, but I am very excited about the entire list of options as an alternative to relational databases.