When we started the Word Magic project, there were some things we wanted to learn, and some things we wanted to confirm. This post is about those things.
First, we confirmed that the Big Data Reference Model holds true. We tried to operationalize analysis before we had really finalized the kind of information we wanted to produce. (Most specifically, how the numbers behind the "Word Usage Through the Decades" module would work.) In terms of the Reference Model, we were skipping the first two learning loops. Why would we do that? The same reason everyone else does. We thought we knew what we wanted.
Our initial mockups looked great, but they were foiled by the real data. The algorithm was to divide a word's occurrence in a decade by the occurrence of all words in that decade. That way, we expected to normalize out the general increase in written language since the 1600s. But, since we put both a phrase and the individual words on the same graph, we normalized to the largest occurrence of them. Sounds good, but in practice, the bubbles were all but invisible except for the most common word in the phrase. The most common word dominated the graph and caused the others to be invisible.
We eventually tried two visualizations and three algorithms before we got one that was suitable for launch. There are some different calculations that might have worked out well, but we couldn't try them. Why not? Because we jumped straight to operationalizing. Every time we changed the calculation in map/reduce, we had to spend multiple days reprocessing 400 years of humanity's entire printed output. We should have followed our own advice and done smaller samples in a quicker toolset.
The second thing we confirmed is that JRuby on Hadoop is a viable way to write map/reduce jobs. It's a thin wrapper on top of the Java API, but still an improvement.
Stuart Sierra's clojure-hadoop library also worked very well. Because this was a "20% project" for most of its life, we ended up with pieces written in JRuby, Clojure, and Java. The polyglot nature didn't really present a problem, because each piece was its own data flow.
On the deployment front, we used Pallet. It worked very well for deploying whole topologies of machines. We built several Hadoop + HBase + MapReduce clusters with it.
Finally, we confirmed that the Hadoop and HBase APIs are cumbersome, poorly documented, and frequently misleading. The processing model is expensive, and the query model is user-hostile. (We aren't the only ones to notice this.) Moving between our daily world of sharp tools like Ruby on Rails, Clojure, ClojureScript and Datomic and into the Hadoop world definitely felt like a step back in time.
Now that we've learned what we needed to learn from Word Magic, we have shut it down. Hadoop is great for scaling up with lots of servers, but it doesn't scale down well at all.