In case you've ever wondered, you can indeed run a full Hadoop cluster on a stack of Raspberry Pis!
It's certainly not fast by any means, MapReduce jobs are more like tiny data crunching, but at least you can amass a surprising amount of HDFS storage. Why do this? To see if it can be done!
I had a stack of Pis that I wasn't quite sure what I wanted to do with yet and decided to build a mini Pi cluster. Other people have done this, but I wanted to see if I could strip things down and make it simpler. Here I've mounted 7 RPis on threaded rods inside an IKEA HELMER
drawer leftover from another cluster project. Each one has 16-32GB of SDHC storage thanks to a Fry's clearance sale. All of the Pis are cabled up to a mini power bus, which in turn is fed by the 5 V DC (red/black) pairs of an ordinary Molex drive connector of an ATX power supply. An 8-port gigabit switch fits into the back, with the 7 Pis plugged into it, and the 8th used for uplink to my home network.
For Hadoop, I didn't really have to do anything special to make it work. All of the Pis run stock Raspbian wheezy, OpenJDK 6, with a Chef client installed to automatically update configurations. I've never actually set up Hadoop before, and found Michael Noll's Ubuntu instructions
to be fantastic in introducing the Hadoop components and walking through a basic install. Once you have one node working you're 80% of the way to having a cluster, the second part of the guide shows how to tie it all into a distributed cluster and run example MapReduce jobs.
One thing you need to do is trim back the Java heap size. Out of the box it's set to 1000 MB; trim this back to 256-384MB with export HADOOP_OPTS="-Xmx384m" in hadoop-env.sh. Otherwise the Pis can start swapping and it's not pretty.
In my setup I have 176 GB of flash storage to play with. For performance, copying a 730MB ISO file to the DFS took about 5 minutes (1.2MB/s write) and about 2 minutes (6MB/s read) to copy it back to my laptop. MapReduce was pretty bad, ~10 minutes to word count 2 MB of text. Each hadoop command on the cli takes 15 seconds to return, so ouch. That said, it's all stock and performance can only get better if somebody wants to wrench on it.
One Pi runs the NameNode, SecondaryNameNode, TaskTracker, while the other Pis run DataNode and TaskTrackers.
Maybe this will give people ideas to build on, nevertheless it's still a fun learning experience and that's the whole point! It remains to be seen how long I keep them in this configuration or if a tiny cluster continues to be useful/amusing. This mounting setup would easily adapt to other small boards like the Cubieboard. If one really wanted to, they could fill a HELMER cabinet with five of these drawers and get 35+ nodes on the same ATX PSU, and still be able to roll it under their desk.