If you’re already familiar with the Kubernetes “kubectl” CLI, you’ll be able to issue standard commands to determine the status of the job once the configuration has been applied; for those less familiar with this CLI, this project’s README provides the details. In other words, job status, output, and even dynamic logging of “stdout” and “stderr” during job execution are now available for your Slurm job via the Kubernetes CLI. Your is the operative word here, as jobs execute as the user, thus minimizing the potential for privilege-escalation vulnerabilities.
A prerelease demo of this integration was delivered by Ian Lumb during a talk he gave at the inaugural meeting of the Singularity User Group (SUG) in mid-March 2019. As you’ll note from the integration overview provided below, the project has evolved considerably from the time of Ian’s SUG talk.
Integration Overview
The prerequisites for this integration are as follows:
- Kubernetes – a vanilla deployment based upon version 1.12 or more recent
- Singularity – owing to the need for compliance with the Open Containers Initiative (OCI) runtime specification, version 3.1 of the open source Community Edition or SingularityPRO 3.1
- Singularity CRI – the recently released v1.0.0-beta.1
- Slurm-operator – the v1.0.0-alpha.1 release of this project, see below for additional details
- Slurm – an existing or planned deployment based upon version 18.08 or more recent
Note that the specific versions identified above convey much more about our current testing matrix, than ‘hard’ constraints.
“slurm-operator” is the open source contribution of this project. Developed by Sylabs’ software engineers Vadzim Pisaruk, Sasha Yakovtseva, and Cedric Clerget, this project is comprised of three primary components:
- Red Box – a RESTful HTTP server written in the Go programming language that serves as a proxy between the project’s “job-companion” and the Slurm cluster itself.
- Resource Daemon – a service that ensures a consistent view of resource specification as well as utilization on an ongoing basis between Kubernetes and Slurm. The project employs extended resources to specify capabilities (e.g., “the maximum number allowed of simultaneously running jobs”) not accounted for through Kubernetes node labels (e.g., devices, plugins, architectures, …). Thus capabilities unique to a specific Slurm cluster become ‘known’ to Kubernetes.
- Operator – a Custom Resource Definition (CRD) and controller that extends Kubernetes for this project’s purpose.
Again, the project’s GitHub repository provides significantly more details including, of course, the source code itself.
Finally, as far as this technical overview is concerned, an architectural schematic is particularly helpful in providing additional context for the overall integration. From this schematic it is evident that a single deployment of Kubernetes can interoperate with one or more Slurm clusters – alongside a traditional deployment for services-oriented use cases orchestrated by Kubernetes.
Next-Gen Use Cases
As is often the case, milestones such as this alpha release of a brand-new open source project serve more as a starting point than a destination. Taken at face value, Singularity containerized applications and workflows can be ultimately managed in a Slurm cluster by the workload manager, though their details and control is all mediated via Kubernetes. Thus the immediate benefit then, is that (preexisting) Slurm clusters become ‘consolidated’ with the enterprise infrastructure orchestrated by Kubernetes – making the convergence claim of this integration tangible and of value.
Workload managers like Slurm however, excel at handling distributed processing. In classic HPC use cases, applications employing MPI can be scaled to the extreme on supercomputers by exploiting parallelism that embraces distributed memory. As frameworks for Deep Learning such as TensorFlow and PyTorch allow for distributed computing, either directly or by leveraging Horovod, the value proposition for workload management via Slurm rapidly multiplies. Why? HPC setups are typically predisposed towards being performant platforms for distributed computing at extreme scale – routinely featuring low-latency, high-bandwidth interconnects (e.g., InfiniBand), accelerators (e.g., GPUs), parallel file systems, and more. Through integrations such as this one, these ‘HPC affinities’ are rendered available and usable to even broader classes of use cases.
As we’ve been claiming for some time, the emergence of hybrid use cases is increasingly evident. Because these use cases involve streaming workloads, real-time analysis, and data pipelining into compute-focused services, they are inherently hybrid – and therefore demanding of the converged infrastructure enabled through this integration. To demonstrate just how tangible such hybrid use cases are in practice, Sylabs’ software engineer Carl Madison developed a demonstration that impressed attendees at DockerCon last week in San Francisco. (If you’re attending the Red Hat Summit this week in Boston, come and find us at booth #1133, as the demo will be available there as well.) While Carl’s demo doesn’t yet span Kubernetes to Slurm with multiple Singularity containers in real time, that’s definitely the direction he’s heading with it!
Finally, it’s important to note that the project announced here has emphasized Slurm as the workload manager. Workload management, by comparison to the container ecosystem, is an extremely mature area in terms of software lifecycle management – with some solutions boasting longevity over more than two decades at this point! From in-house to open source to commercial, offerings abound, and organizations can become quite polarized with respect to their preferences. As noted above, it is the red-box RESTful HTTP server that is lynchpin for this integration. In the case of the current project, a simple implementation capable of supporting a few endpoints caused us to develop our own implementation in golang. This approach could be replicated, or use could be made of existing REST APIs when available and appropriately useful. In other words, it wouldn’t require a tremendous amount of effort to support workload managers other than Slurm.
Next Steps
Whether it’s adapting this integration to work with other workload managers, or contributing to other projects in the container ecosystem, our last word remains true: we encourage you to get involved! If it’s the Singularity ecosystem where you might like to focus your contributions, the best placed to get started is here. We look forward to collaborating with you!