Last Updated: 2022-07-01
In this codelab, you are going to deploy an auto-scaling High Performance Computing (HPC) cluster on Google Cloud with the Slurm job scheduler. You will use an example Terraform deployment that deploys this cluster using your choice of operating system
In this section, you will deploy an auto-scaling HPC cluster including the Slurm job scheduler.
cd ~ git clone https://github.com/FluidNumerics/research-computing-cluster.git
tf/rcc-centos
directorycd ~/research-computing-cluster/tf/rcc-centos
RCC_NAME
, RCC_PROJECT
, and RCC_ZONE
to specify the name of your cluster, your GCP project, and the zone you want to deploy to.export RCC_PROJECT=<PROJECT ID> export RCC_ZONE=<ZONE> export RCC_NAME="rcc-demo"
terraform init
make plan
make apply
Option: This pair of gcloud commands will figure out the login node name and SSH into it:
export CLUSTER_LOGIN_NODE=$(gcloud compute instances list --zones ${RCC_ZONE} --filter="name ~ .*login" --format="value(name)" | head -n1) gcloud compute ssh ${CLUSTER_LOGIN_NODE} --zone ${RCC_ZONE}
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST c2-standard-8* up infinite 25 idle~ rcc-demo-compute-0-[0-24] $ spack find singularity ==> 1 installed package -- linux-centos7-x86_64 / gcc@9.4.0 ----------------------------- singularity@3.7.4
In this section, you will download a publicly available docker container using Singularity, a container platform for HPC systems. You will start an interactive session on a compute node and run a task within a Singularity container on the compute node.
For this section, you must be SSH connected to the login node of the cluster
cowsay_latest.sif
.spack load singularity singularity pull docker://grycap/cowsay
srun
commandsrun -n1 --pty /bin/bash
cowsay
command.singularity exec cowsay_latests.sif /usr/games/cowsay "Hello World" _____________ < Hello World > ------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || ||
exit
and hit Enterexit
In this section, you will learn how to run a batch job using Slurm. Batch jobs are useful when you have long-running jobs that you want to "set-and-forget" or when you have sequences of jobs that have dependencies between them. Submitting a batch job requires that you write a script that contains the commands to be executed.
For this section, you must be SSH connected to the login node of the cluster
demo.sh
.#!/bin/bash spack load singularity singularity exec cowsay_latest.sif /usr/games/cowsay "Hello World"
sbatch
command. This process will create a compute node on your behalf, run the commands listed in the script, store the standard output and standard error in a file, and release the compute node.sbatch -n1 demo.sh
slurm-4.out
in your home directory. You can review the contents of this file to verify the job ran successfully.$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4 c2-standa demo.sh joe CF 0:00 1 rcc-demo-compute-0-1 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) $ cat slurm-4.out _____________ < Hello World > ------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || ||
In this codelab, you created an auto-scaling, cloud-native HPC cluster and learned how deploy docker images to Singularity containers and run interactive and batch jobs using Slurm.
To avoid incurring charges to your Google Cloud Platform account for the resources used in this codelab:
The easiest way to eliminate billing is to delete the project you created for the codelab.
Caution: Deleting a project has the following effects:
If you plan to explore multiple codelabs and quickstarts, reusing projects can help you avoid exceeding project quota limits.
cd ~/rcc-apps/wrf/tf
make destroy