Hadoop+HDFS+MR on Pi cluster - works!


15 posts
by bwann » Fri Mar 15, 2013 1:45 am
In case you've ever wondered, you can indeed run a full Hadoop cluster on a stack of Raspberry Pis!
Image

It's certainly not fast by any means, MapReduce jobs are more like tiny data crunching, but at least you can amass a surprising amount of HDFS storage. Why do this? To see if it can be done!

I had a stack of Pis that I wasn't quite sure what I wanted to do with yet and decided to build a mini Pi cluster. Other people have done this, but I wanted to see if I could strip things down and make it simpler. Here I've mounted 7 RPis on threaded rods inside an IKEA HELMER drawer leftover from another cluster project. Each one has 16-32GB of SDHC storage thanks to a Fry's clearance sale. All of the Pis are cabled up to a mini power bus, which in turn is fed by the 5 V DC (red/black) pairs of an ordinary Molex drive connector of an ATX power supply. An 8-port gigabit switch fits into the back, with the 7 Pis plugged into it, and the 8th used for uplink to my home network.

For Hadoop, I didn't really have to do anything special to make it work. All of the Pis run stock Raspbian wheezy, OpenJDK 6, with a Chef client installed to automatically update configurations. I've never actually set up Hadoop before, and found Michael Noll's Ubuntu instructions to be fantastic in introducing the Hadoop components and walking through a basic install. Once you have one node working you're 80% of the way to having a cluster, the second part of the guide shows how to tie it all into a distributed cluster and run example MapReduce jobs.

One thing you need to do is trim back the Java heap size. Out of the box it's set to 1000 MB; trim this back to 256-384MB with export HADOOP_OPTS="-Xmx384m" in hadoop-env.sh. Otherwise the Pis can start swapping and it's not pretty.

In my setup I have 176 GB of flash storage to play with. For performance, copying a 730MB ISO file to the DFS took about 5 minutes (1.2MB/s write) and about 2 minutes (6MB/s read) to copy it back to my laptop. MapReduce was pretty bad, ~10 minutes to word count 2 MB of text. Each hadoop command on the cli takes 15 seconds to return, so ouch. That said, it's all stock and performance can only get better if somebody wants to wrench on it.

One Pi runs the NameNode, SecondaryNameNode, TaskTracker, while the other Pis run DataNode and TaskTrackers.

Maybe this will give people ideas to build on, nevertheless it's still a fun learning experience and that's the whole point! It remains to be seen how long I keep them in this configuration or if a tiny cluster continues to be useful/amusing. This mounting setup would easily adapt to other small boards like the Cubieboard. If one really wanted to, they could fill a HELMER cabinet with five of these drawers and get 35+ nodes on the same ATX PSU, and still be able to roll it under their desk.

Image
Posts: 4
Joined: Sat Oct 13, 2012 7:13 am
Location: Fremont, CA
by mikesmullin » Sat Apr 06, 2013 7:20 pm
great work. someone suggested the JVM is a bottleneck, as well as the NIC. is it possible to see a hadoop-like mapreduce implemented in C/C++ and networked via USB? may also be worthwhile to try attaching Intel SSDs to the Pi.
Posts: 1
Joined: Sat Apr 06, 2013 7:15 pm
by jdriscoll » Mon Apr 29, 2013 2:08 pm
That looks great. I've also been thinking about making a Pi cluster. Would it be possible to get more details on the power system? I definitely want to avoid having to get a separate cell phone charger for each board.

Thanks
Posts: 2
Joined: Mon Apr 29, 2013 2:05 pm
by rpiuser3000 » Tue Apr 30, 2013 5:34 am
mikesmullin wrote:great work. someone suggested the JVM is a bottleneck, as well as the NIC. is it possible to see a hadoop-like mapreduce implemented in C/C++ and networked via USB? may also be worthwhile to try attaching Intel SSDs to the Pi.


USB network will make no difference as pi's network card already uses USB for networking. SSD won't give more than pi's USB can handle.
Posts: 23
Joined: Sat Apr 27, 2013 11:47 pm
by siddharth » Tue May 21, 2013 2:23 am
Great! good job..
This gives me the boost to proceed with a project I have in mind.
Thanks
Posts: 1
Joined: Tue May 21, 2013 2:19 am
by bwann » Fri May 31, 2013 4:31 am
jdriscoll wrote:That looks great. I've also been thinking about making a Pi cluster. Would it be possible to get more details on the power system? I definitely want to avoid having to get a separate cell phone charger for each board.


Basically as long as you have something that'll output 5 volts DC at a sufficient amperage, you can wire up multiple RPis to it. By my figures each one drew 400ma, so for my seven I needed something that would output at least 2.8 amps. Most USB phone chargers I've seen top out at 600-800ma or so, not enough for two. I've seen other people use a USB hub to provide power (again check the amperage rating) to a stack.

The power supply I used is a standard ATX PSU, 250W, that I had laying around. The large Molex 4-pin connectors used for things like hard drives provide two voltages, 12 VDC (yellow/black, unused) and 5 VDC (red/black). I took a 10-terminal block, daisy chained (with 16ga wire) one side of 5 terminals to the 5 VDC red (+) lead, and the other 5 terminals to the 5 VDC black (-) lead.

For each RPi, I connected pin 2 (5 V+) to a terminal on the positive side and pin 8 (5V -) to a terminal on the ground side. I took a pair of standard project jumpers and crimped one end with a terminal ring, so two ends would fit on the RPi pin and the other side would screw down. Incidentally I had both male-male and short male-female jumpers so I connected them in such a way I could unplug each board individually from the front.

Here's a closer picture of the terminal bar I used (there's more in the set):
http://www.flickr.com/photos/binaryfury/8462912231/in/set-72157632791989032

Offhand I don't know what the maximum one can connect together like this, it'd depend on the PSU and cable gauge. I recall seeing somebody connecting 30-32 RPis to one PSU like this (~12 amps!) and the PSU cabling was starting to get warm, so that's too many (without possibly running heavier gauged wire).
Posts: 4
Joined: Sat Oct 13, 2012 7:13 am
Location: Fremont, CA
by jdriscoll » Fri May 31, 2013 4:39 am
Thanks, that's very useful.
Posts: 2
Joined: Mon Apr 29, 2013 2:05 pm
by acobley » Thu Jul 04, 2013 2:57 pm
Have you tried running it with the oracle JVM rather than Open JDK ? My experiments with Cassandra show the oracle Java to be significantly faster than Open JDK:

http://ac31004.blogspot.co.uk/2012/07/j ... ebian.html
Posts: 20
Joined: Fri Feb 10, 2012 9:54 am
by aotto2012 » Mon Oct 21, 2013 11:43 am
bwann wrote:For Hadoop, I didn't really have to do anything special to make it work. All of the Pis run stock Raspbian wheezy, OpenJDK 6, with a Chef client installed to automatically update configurations.


Could you share some information regarding the installation of the Chef client?


Cheers,
Andreas
User avatar
Posts: 5
Joined: Sat Jun 09, 2012 9:37 am
Location: Germany
by peehoo » Mon Nov 18, 2013 10:54 am
Hmm... would it be anyhow possible to use f.e. 3 Pi:s to live transcoding video stream from VU+ Ultimo Enigma2 HDTV receiver to the internet? Stream is h.264 and probably VLC can do it with hard-decoding?
Posts: 1
Joined: Mon Nov 18, 2013 10:51 am
by alexr92 » Sun Dec 08, 2013 10:00 am
I was looking at building a similar enclosure... What measurement are the threaded rods?
Posts: 1
Joined: Sun Dec 08, 2013 9:58 am
by jesmitty » Fri Feb 21, 2014 11:36 pm
can't get past "sudo add-apt-repository ppa:ferramroberto/java" step. I get the following. Anyone recognize this?

You are about to add the following PPA to your system:
PPA esclusivo per l'ultima versione disponibile di JAVA

PPA for the latest version of JAVA

PPA für die neueste Version von JAVA

PPA para la última versión de JAVA

PPA pour la dernière version de JAVA


by LffL http://www.lffl.org

More info: https://launchpad.net/~ferramroberto/+archive/java
Press [ENTER] to continue or ctrl-c to cancel adding it

Traceback (most recent call last):
File "/usr/bin/add-apt-repository", line 160, in <module>
sp = SoftwareProperties(options=options)
File "/usr/lib/python2.7/dist-packages/softwareproperties/SoftwareProperties.py", line 96, in __init__
self.reload_sourceslist()
File "/usr/lib/python2.7/dist-packages/softwareproperties/SoftwareProperties.py", line 584, in reload_sourceslist
self.distro.get_sources(self.sourceslist)
File "/usr/lib/python2.7/dist-packages/aptsources/distro.py", line 87, in get_sources
raise NoDistroTemplateException("Error: could not find a "
aptsources.distro.NoDistroTemplateException: Error: could not find a distribution template
Posts: 3
Joined: Fri Feb 21, 2014 10:08 pm
by citrixmeta » Wed Aug 27, 2014 6:24 pm
that is incredible, i wonder how well it would run on this

https://www.kickstarter.com/projects/884048325/worlds-smallest-datacenter
Posts: 2
Joined: Wed Aug 27, 2014 6:23 pm
by kdseptian » Thu Apr 23, 2015 8:36 am
Do i need to disable ipv6 like in http://michael-noll.com/ guide?
If so, how can i do it? Because the terminal said 'no such files or directory' when i entr the command line in http://michael-noll.com/ guide.

Thanks
Posts: 1
Joined: Wed Sep 17, 2014 8:56 pm
by stkim1 » Mon Dec 21, 2015 3:55 am
You should come checkout Apache Spark on Raspberry PI 2.

Hadoop and MR used to work fine, but the combination is now preceded by in-memory computation in industry.
(Disclaimer, I am the builder of PocketCluster. )
I am building pocket-sized BigData clusters. Come check out my blog.
Posts: 5
Joined: Thu Jun 04, 2015 2:24 am