1 year ago

Pi Spark supercomputer cluster

One data scientist shows us how to string six Raspberry Pis together to build a supercomputer and experiment with big data

The Raspberry Pi is great for learning computer science, but there’s one area that’s big news but requires big computers, and that’s ‘big data’.

Big data software typically runs on clusters of networked computers, working together to perform the heavy lifting required. This clustered nature makes learning big data tricky, because you need several computers wired together to practise. Sung-Taek Kim, a software engineer from Korea, decided that the Raspberry Pi would be perfectly suited to the task. “Raspberry Pi is a great education platform to learn how big data software works,” he tells us. “It is [comparatively] slow and low-powered, [so] that you would have hands-on experiences when your data manipulation methods execute as planned.”

Six Raspberry Pi's make up the SparkPi

Six Raspberry Pi’s make up the SparkPi

In fact, the light performance of the Raspberry Pi becomes an advantage when learning big data techniques. “Once you miss a small detail,” explains Sung-Taek, “you feel the operation processes slow down.”

“Sending data across [a] network takes time,” he adds. “All the CPUs in your cluster compete for resources such as memory or disk [space], and a node or two could suddenly refuse to work, just like [in] a Google-class data centre cluster.” He explains that the relative slowness of a Pi cluster is actually an advantage, enabling you to prepare for such events.

Sung-Taek’s cluster is based around six Raspberry Pi 2 boards wired together with Ethernet cables via a D-Link 8-port Gigabit Desktop Switch.

“Theoretically, you would only need one Raspberry Pi,” says Sung-Taek, “since Spark exploits the [nature] of a master-slave scheme. Prepare a Raspberry Pi as a slave and your laptop as a master. Connect two Raspberry Pi devices and you have a Spark cluster.”

Sung-Taek suggests using between three to eight Raspberry Pi boards for the project. “Once you have more than ten Raspberry Pis,” he says, “it’s a headache to find a proper power source, to arrange the network and power cord.”

The cluster is made using a custom casing found on the GitHub

The cluster is made using a custom casing found on the GitHub

The hardest part seems to be building the enclosure. Sung-Taek hosts schematics on GitHub, but accuracy is vital. Even a half millimetre offset in the cutting template could render one of the acrylic tiers useless, he warns.

Aside from the Raspberry Pi units, the project isn’t expensive. The power supply, network switch, cables, screws, and enclosure only came to around $60. A complete list of materials is available via a Make: article (bit.ly/1J2jpDf) written by Sung-Taek.

The software requirements are quite intense. “Python is a must,” he says, “In order to fully exploit what a Spark cluster provides, it would be a good idea to learn Scala as well.” Java is another must, and the listed software packages include NumPY, Scipy, and Scikit-Learn. On top of that, you’ll be learning the MapReduce programming model, which is where you’ll encounter Hadoop and Spark.

They are all skills worth learning, though: big data is at the forefront of computer science and is an interesting area to study and work in. “The Raspberry Pi cluster lets you prepare [for] events that could take place in a production environment,” says Sung-Taek. “Further, the cluster lets you easily find where to apply optimisation work, since its hardware resource is limited indeed.”

“You might want to plan carefully and start with a small number of nodes,” he advises. “Pay attention to details in each step, and make sure your plan is checked with specifications. Once you’ve successfully built [a cluster], make a bigger one. One of the best strategies to avoid costly mistakes is to follow and observe what others have done.”

  • Jonnie Bergsén

    Isnt supercomputer and incorrect term as this operates more as a cluster.

  • darklinux

    clusters and supercomputers have become synonymous since the Beowulf project … nothing this formalized either

  • Jonnie Bergsén

    Shame as there is a big difference between a supercomputer and computers working as a cluster.

  • darklinux

    Certainly, if one refers to the vector architecture and the work of Seymour Cray, but a modern point of view, a cluster = a super computer, libraries are unified, such as languages, a computer is like a Cray C90 now a heresy

  • You’re right, there is a difference between a single computer with many processors and a cluster of computers (each with its own processor). But the two terms have become increasingly synonymous, so it’s not wholly wrong to use it to pep-up an article 🙂

    And as Cray’s start at $500k (which is a lot less than I expected btw) it’s a bit hard to get any understanding of parallel computing on a pure supercomputer (rather than a bramble).

    ps: I interviewed Sung-Taek Kim and wrote the piece incidentally. You are right to point out the differences. 🙂

  • Jonnie Bergsén

    You are so correct sir.

  • darklinux

    I refer you to all the online documentation on the lined of superodinateurs Seymour Cray (by including the T3E), would not that maintenance, HPC acteul is more pleasant, always CLI, but he no longer has it that hauled the whole chain: workstation – data server – superordineteur and backup servers

  • khisanth

    I have a 3 Pi “supercomputer” cluster , just need to think of things for it to do now!

  • Per Jensen

    A singel computer isn’t technically powerful enough to be a supercomputer, the term originates in the single processor days. And always refer to a cluster of many computers.

  • OMGWTFZPMBBQ

    16*8*128 = 4096 Pi Zeros.. (drool)… thats a lot of processing power.
    In terms of power efficiency its actually better to use an array of these, each unit only uses a couple hundred mA and assuming 8GB uSD cards that only goes up by 30mA or less when reading.
    Someone can probably calculate how much power this would use compared to a dedicated single 4*4 (ie Core I7) setup with associated cooling and suchlike.

  • Mangap

    I hope in future the Raspberry Pi preparing new models which is easier to use as super computer.
    maybe also including for 3D rendering, example using Vray

  • XENOM

    There’s no difference!

  • Aiko Man

    Why would some one use a raspberry pi cluster computer? Is there e reason?

  • Michael Crowley

    i see tons of videos of people putting together Pi Clusters.. but i have yet to see anybody doing anything but flashing LEDs on them!

  • Marko Tasic

    I have hosting platform based on systemd-nspawn. Each hosts 4-8 containers. They serve web apps and databases without any issue.

  • dsychan007

    I am trying to do the same . Is it just the same instructions but just install the latest version?

  • khader

    Link for casing? I want to buy.

  • O,owL

    That’s exactly what I’m planning to do too. 😀

  • I have the results of the power calculations. It’s 1.21 gigawatts. Or, more accurately, a bolt of lightning.

  • Tyler Xiaoyu Tang

    Flashing LEDs?

  • Tormato

    Learning HPC (High Performance Computing) without breaking the bank! You will not be using it for production loads, but it’s great training.

  • Hassan Emam

    Marko,
    How did you manage database; did you sync DBs across all nodes or you have a better way?