Singularity is an emerging software tool that facilitates the movement of software applications and workflows between computational environments -- from a researcher’s laptop, to Berkeley’s high-performance computing cluster, Savio, to XSEDE’s national-scale clusters (e.g., Jetstream, Comet) or the commercial cloud (AWS, Azure, etc.). It allows researchers to bring already-built applications and workflows from other Linux environments and run them in multiple environments, without reinstallation or reconfiguration. Singularity achieves this by packaging applications and workflows in “containers” and runs them within the containers.
The software was developed by longtime Lawrence Berkeley National Lab HPC Architect Greg Kurtzer, and is maintained by Greg and collaborators in an open-source community. Here’s how the project’s site introduces the tool:
Singularity enables users to have full control of their environment. Singularity containers can be used to package entire scientific workflows, software and libraries, and even data. This means that you don’t have to ask your cluster admin to install anything for you - you can put it in a Singularity container and run. Did you already invest in Docker? The Singularity software can import your Docker images without having Docker installed or being a superuser. Need to share your code? Put it in a Singularity container and your collaborator won’t have to go through the pain of installing missing dependencies. Do you need to run a different operating system entirely? You can “swap out” the operating system on your host for a different one within a Singularity container. As the user, you are in control of the extent to which your container interacts with its host.
Berkeley Research Computing’s Cyberinfrastructure Engineer, Maurice Manning, has been working with researchers across a number of disciplines to set up a Singularity container that runs Tesseract, an open-source engine for doing optical character recognition (OCR) over large text corpora. One of the challenges of using Tesseract is the fact that it is notoriously thorny to install. To reduce complexity and avoid the Linux command line, it is possible to access and run Singularity containers on Savio through Jupyter notebooks.
Christopher Hench, Program Development Lead for Digital Humanities at Berkeley and a consultant and Python instructor at Berkeley’s D-Lab, is working with law professors Kenneth Ayotte (UC Berkeley) and Jared Ellias (UC Hastings) on a project involving OCR over hundreds of thousands of legal documents. Chris, who is also a PhD candidate at Berkeley in German and Medieval Studies, explains:
“Singularity has been integral to streamlining our large-scale OCR work with Tesseract. Our project collects hundreds of thousands of bankruptcy court PDFs to learn more about the incentives and strategies of the parties in these cases from the language in the case files. Singularity allows us to quickly set up the OCR workflow for these PDFs across platforms in both the HPC Savio cluster and in Azure cloud services, without having to perform complex, time-consuming, and platform-specific software reinstallation and reconfiguration.”
BRC consultants can help campus researchers assess whether Singularity is suitable for a specific research problem or workflow: please e-mail the consultants at brc@berkeley.edu to get started. The Using Singularity on Savio page on Research IT’s website is another great place to start for researchers with a good understanding of containers and Linux environments.