Apache Spark is a fast, in-memory data processing engine applicable to large-scale data processing, for both batch and streaming machine learning that requires fast access to data sets. Apache Spark’s run-everywhere ethos is now taken a few steps further with the BYOE mentality of Singularity; bring your own environment with mobility of compute.
Spark can be a tricky application to install and configure. But by installing Spark within a Singularity container your installation becomes portable and reproducible, allowing you to spin up instances in the same way on your laptop, in an HPC center or in the cloud. Additionally, since you are using Singularity you can leverage specialized hardware on the underlying hosts like GPUs, or other hardware necessities such as host interconnects, if required by your workload.
We’re going to be making a few assumptions on your setup:
- You have a shared home directory across the cluster
- You have passwordless access between nodes on the cluster, even to a single node
- You have a directory in $HOME for Spark related files (i.e. $HOME/spark)
The example definition file follows: