Development and programming answers: NoSQL

CouchDB vs MongoDB

After posting about Scott Motte’s comparison of MongoDB and CouchDB, I thought there should be some more informative sources out there, so I’ve started to dig.

The first I came upon is an article about Raindrop requirements and the issues faced while attacking them with CouchDB and the pros and cons of possibly replacing CouchDB with MongoDB:

[Pros]

Uses update-in-place, so the file system impact/need for compaction is less if we store our schemas in one document are likely to work better.

Queries are done at runtime. Some indexes are still helpful to set up ahead of time though.

Has a binary format for passing data around. One of the issues we have seen is the JSON encode/decode times as data passes around through couch and to our API layer. This may be improving though.

Uses language-specific drivers. While the simplicity of REST with CouchDB sounds nice, due to our data model, the megaview and now needing a server API layer means that querying the raw couch with REST calls is actually not that useful. The harder issue is trying to figure out the right queries to do and how to do the “joins” effectively in our API app code.

[Cons]

easy master-master replication. However, for me personally, this is not so important. […] So while we need backups, we probably are fine with master-slave. To support the sometimes-offline case, I think it is more likely that using HTML5 local storage is the path there. But again, that is just my opinion.

ad-hoc query cost may still be too high. It is nice to be able to pass back a JavaScript function to do the query work. However, it is not clear how expensive that really is. On the other hand, at least it is a formalized query language — right now we are on the path to inventing our own with the server API with a “query language” made up of other API calls.

Anyway while some of the points above are generic, you should definitely try to consider them through the Raindrop requirements perspective about which you can read more here.

Another article comparing MongoDB and CouchDB is hosted by MongoDB docs. I find it well balanced and you should read it all as it covers a lot of different aspects: horizontal scalability, query expressions, atomicity, durability, mapreduce support, javascript, performance, etc.

I’d also mention this benchmark comparing the performance of MongoDB, CouchDB, Tokyo Cabinet/Tyrant (note: the author of the benchmark is categorizing Tokyo Cabinet as a document database, while Tokyo is a key-value store) and uses MySQL results as a reference.

In case you have other resources that you think would be worth including do not hesitate to send them over.

How to: Translate SQL to MongoDB MapReduce

1:58 PM 0

I keep hearing people complaining that MapReduce is not as easy as SQL. But there are others saying SQL is not easy to grok. I’ll keep myself away from this possible flame war and just point you out to this SQL to MongoDB translation PDF put together by Rick Osborne and also his post providing some more details.

As regards the SQL and MapReduce comparison, here’s what Rick has to say:

It seems kindof silly to go through all this, right? SQL does all of this, but with much less complexity. However, this approach has some huge advantages over SQL:

Programmers who don’t know SQL or relational theory may find it easier to understand and get using quickly. (Newbies especially, such as my students.)

The map and reduce functions can be heavily parallelized on commodity hardware.

It’s really that second one that is the key.

I’d also like to share something that I’ve learned lately: SQL parallel execution is supported in different forms by some RDBMS. So at the end of the day, it will probably become just a matter of what fits better the problem and your team.

What is NoSQL?

1:53 PM 0

A list of (possible) definitions for NoSQL (also referred to as NoSQL databases or NoSQL stores):

NoSQL is a movement promoting a loosely defined class of non-relational data stores that break with a long history of relational databases. These data stores may not require fixed table schemas, usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage. Wikipedia.

Non-relational next generation operational datastores and databases - Dwight Merriman, CEO 10gen

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontal scalable. - nosql-databases.org

NoSQL is a term coined by Carlo Strozzi and repurposed by Eric Evans to refer to “some” storage systems. The NoSQL term should be used as in the Not-Only-SQL and not as No to SQL or Never SQL.

NoSQL is about choice

NoSQL is not about any one feature of any of the projects. NoSQL is not about scaling, NoSQL is not about performance, NoSQL is not about hating SQL, NoSQL is not about ease of use, NoSQL is not about sharding, NoSQL is not about throughput, NoSQL is not about speed, NoSQL is not about dropping ACID, NoSQL is not about Eventual Consistency, NoSQL is not about CAP, NoSQL is not about open standards, NoSQL is not about Open Source and NoSQL is most likely not about whatever else you want NoSQL to be about. NoSQL is about choice. - Jan Lehnardt, CouchDB

Why NoSQL?

Handling massive amounts of data
Exponential growth of newly created digital content
More value around data
Build value around data by connecting the dots
Connectedness
Information format
Data usage scenarios (plus open data)

Fundamental papers

Google BigTable
Amazon Dynamo
BASE: An ACID Alternative
Brewster’s CAP theorem (pdf). Julian Browne’s article can be helpful too.

NoSQL databases

Columnar Stores or Wide Column Stores

BigTable:
Cassandra: a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributed design and Bigtable’s ColumnFamily-based data model.
HBase:
Hypertable

Document stores or Document databases

Colayer
CouchDB
FleetDB
Jackrabbit
Lotus Notes
MongoDB
OrientDB
Raven DB
ThruDB
Terrastore

Graph databases

AllegroGraph
Bigdata
Core Data
DEX: a high-performance graph database written in Java and C++. Its main characteristic is its performance storage and retrieval for large graphs, in the order of billions of nodes, edges and attributes, implemented with specialized structures.
Filament
FlockDB
HyperGraphDB
InfiniteGraph
InfoGrid
Neo4j
OpenLink Virtuoso
Sones
VertexDB
Trinity: a graph database and computation platform over distributed memory cloud. As a database, it provides features such as highly concurrent query processing, transaction, consistency control. As a computation platform, it provides synchronous and asynchronous batch-mode computations on large scale graphs.

Key-Value Stores

Amazon SimpleDB
Azure Table Storage
Berkeley DB
Chordless
Dynomite
GenieDB: GenieDB is designed to be a pragmatic solution to a widespread class of data storage problems, with a high-performance native API alongside compatability with MySQL.
GT.M / M.DB
HamsterDB
Hibari: Hibari is a production-ready, distributed, key-value, big data store. Hibari uses chain replication for strong consistency, high-availability, and durability. Hibari has excellent performance especially for read and large value operations.
KAI
KaTree
Kumofs
LightCloud
Membase
Memcachedb
Mnesia
NorthScale
Orient Key/Value Server
Pincaster
PNUTS/Sherpa
Project Voldemort: LinkedIn open source implementation of Amazon Dynamo key-value store
Redis
Riak: Dynamo-inspired key/value store that scales predictably and easily.
Scalaris
ScalienDB / Scalien Keyspace: a distributed, consistent key-value store
Tokyo Cabinet

Multi-value databases

OpenQM
Rocket U2

Object databases

Db4o
GemStone/S
KiokuDB
InterSystems Caché
Neo
Objectivity/DB
Perst
Progress
Versant
ZODB

XML databases

BerkleyDB XML
EMC Documentum xDB
eXist
MarkLogic Server
Sausalito: Sausalito powers XQuery in the Cloud
Sedna
Tamino
Xindice

Unclassified

CloudKit
FluidDB
Moneta
Perservere

Recent Posts