One data scientist shows us how to string six Raspberry Pis together to build a supercomputer and experiment with big data
The Raspberry Pi is great for learning computer science, but there’s one area that’s big news but requires big computers, and that’s ‘big data’.
Big data software typically runs on clusters of networked computers, working together to perform the heavy lifting required. This clustered nature makes learning big data tricky, because you need several computers wired together to practise. Sung-Taek Kim, a software engineer from Korea, decided that the Raspberry Pi would be perfectly suited to the task. “Raspberry Pi is a great education platform to learn how big data software works,” he tells us. “It is [comparatively] slow and low-powered, [so] that you would have hands-on experiences when your data manipulation methods execute as planned.”
In fact, the light performance of the Raspberry Pi becomes an advantage when learning big data techniques. “Once you miss a small detail,” explains Sung-Taek, “you feel the operation processes slow down.”
“Sending data across [a] network takes time,” he adds. “All the CPUs in your cluster compete for resources such as memory or disk [space], and a node or two could suddenly refuse to work, just like [in] a Google-class data centre cluster.” He explains that the relative slowness of a Pi cluster is actually an advantage, enabling you to prepare for such events.
Sung-Taek’s cluster is based around six Raspberry Pi 2 boards wired together with Ethernet cables via a D-Link 8-port Gigabit Desktop Switch.
“Theoretically, you would only need one Raspberry Pi,” says Sung-Taek, “since Spark exploits the [nature] of a master-slave scheme. Prepare a Raspberry Pi as a slave and your laptop as a master. Connect two Raspberry Pi devices and you have a Spark cluster.”
Sung-Taek suggests using between three to eight Raspberry Pi boards for the project. “Once you have more than ten Raspberry Pis,” he says, “it’s a headache to find a proper power source, to arrange the network and power cord.”
The hardest part seems to be building the enclosure. Sung-Taek hosts schematics on GitHub, but accuracy is vital. Even a half millimetre offset in the cutting template could render one of the acrylic tiers useless, he warns.
Aside from the Raspberry Pi units, the project isn’t expensive. The power supply, network switch, cables, screws, and enclosure only came to around $60. A complete list of materials is available via a Make: article (bit.ly/1J2jpDf) written by Sung-Taek.
The software requirements are quite intense. “Python is a must,” he says, “In order to fully exploit what a Spark cluster provides, it would be a good idea to learn Scala as well.” Java is another must, and the listed software packages include NumPY, Scipy, and Scikit-Learn. On top of that, you’ll be learning the MapReduce programming model, which is where you’ll encounter Hadoop and Spark.
They are all skills worth learning, though: big data is at the forefront of computer science and is an interesting area to study and work in. “The Raspberry Pi cluster lets you prepare [for] events that could take place in a production environment,” says Sung-Taek. “Further, the cluster lets you easily find where to apply optimisation work, since its hardware resource is limited indeed.”
“You might want to plan carefully and start with a small number of nodes,” he advises. “Pay attention to details in each step, and make sure your plan is checked with specifications. Once you’ve successfully built [a cluster], make a bigger one. One of the best strategies to avoid costly mistakes is to follow and observe what others have done.”