Create a Research Computing Cluster on Google Cloud with Terraform

Last Updated: 2022-07-01

What you will build

In this codelab, you are going to deploy an auto-scaling High Performance Computing (HPC) cluster on Google Cloud with the Slurm job scheduler. You will use an example Terraform deployment that deploys this cluster using your choice of operating system

What you will learn

How to deploy a cloud-native HPC cluster with the Slurm job scheduler
How to run containers through interactive and batch jobs using Singularity and Slurm.

What you will need

Gmail Account with an SSH key attached, or Google Workspace, Cloud Identity
Google Cloud Platform Project with Billing enabled
Project owner role on your GCP Project
Sufficient Compute Engine Quota (480 c2 vCPUs and 500 GB PD-Standard Disk)
RCC-WRF Deployment

In this section, you will deploy an auto-scaling HPC cluster including the Slurm job scheduler.

Open your Cloud Shell on GCP.
Clone the Research Computing Cloud Applications repository from Fluid Numerics

cd ~
git clone https://github.com/FluidNumerics/research-computing-cluster.git

Change to the tf/rcc-centos directory

cd  ~/research-computing-cluster/tf/rcc-centos

Create and review a terraform plan. Set the environment variables RCC_NAME, RCC_PROJECT, and RCC_ZONE to specify the name of your cluster, your GCP project, and the zone you want to deploy to.

export RCC_PROJECT=<PROJECT ID>
export RCC_ZONE=<ZONE> 
export RCC_NAME="rcc-demo"

The first time you run terraform you must run the `init` command:

terraform init

Create the plan with the make command, which will run `terraform`

make plan

Deploy the cluster. The setup process can take up to 5 minutes.

make apply

SSH to the login node created in the previous step. You can see this node in the previous step (probably called rcc-demo-login0). You can do this by clicking on the SSH button next to the list of VM Instances in the console menu item Compute Engine -> VM instance.

Option: This pair of gcloud commands will figure out the login node name and SSH into it:

export CLUSTER_LOGIN_NODE=$(gcloud compute instances list --zones ${RCC_ZONE} --filter="name ~ .*login" --format="value(name)" | head -n1)
gcloud compute ssh ${CLUSTER_LOGIN_NODE} --zone ${RCC_ZONE}

Once you are connected to the login node, to verify your cluster setup, check that the Slurm compute partitions and the Singularity package are available

$ sinfo
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST
c2-standard-8*    up   infinite     25  idle~ rcc-demo-compute-0-[0-24]

$ spack find singularity
==> 1 installed package
-- linux-centos7-x86_64 / gcc@9.4.0 -----------------------------
singularity@3.7.4

In this section, you will download a publicly available docker container using Singularity, a container platform for HPC systems. You will start an interactive session on a compute node and run a task within a Singularity container on the compute node.

For this section, you must be SSH connected to the login node of the cluster

First, we will download the cowsay docker image using singularity. The resulting image will be saved as a Singularity Image File called cowsay_latest.sif .

spack load singularity
singularity pull docker://grycap/cowsay

Start an interactive session on a compute node using Slurm's srun command

srun -n1 --pty /bin/bash

Once you are on a compute node, load the singularity package to your path using spack.
Next, you can start a Singularity container using the Singularity image and execute command within the container. In this example, we will run the cowsay command.

singularity exec cowsay_latests.sif /usr/games/cowsay "Hello World"

 _____________
< Hello World >
 -------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

To release the compute node and finish your interactive job, just type exit and hit Enter

exit

In this section, you will learn how to run a batch job using Slurm. Batch jobs are useful when you have long-running jobs that you want to "set-and-forget" or when you have sequences of jobs that have dependencies between them. Submitting a batch job requires that you write a script that contains the commands to be executed.

For this section, you must be SSH connected to the login node of the cluster

First, we will create a bash script that will run the same commands as we did in the previous section. Open a text editor on the login node (e.g. vim or nano) to create a file with the contents shown below. Save the file as demo.sh .

#!/bin/bash

spack load singularity
singularity exec cowsay_latest.sif /usr/games/cowsay "Hello World"

Start the batch job on a compute node using Slurm's sbatch command. This process will create a compute node on your behalf, run the commands listed in the script, store the standard output and standard error in a file, and release the compute node.

sbatch -n1 demo.sh

You can monitor the status of your job using the squeue command. When your job is complete, there will be a file that is called slurm-4.out in your home directory. You can review the contents of this file to verify the job ran successfully.

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 4 c2-standa  demo.sh      joe CF       0:00      1 rcc-demo-compute-0-1

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

$ cat slurm-4.out

 _____________
< Hello World >
 -------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

In this codelab, you created an auto-scaling, cloud-native HPC cluster and learned how deploy docker images to Singularity containers and run interactive and batch jobs using Slurm.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this codelab:

Delete the project

The easiest way to eliminate billing is to delete the project you created for the codelab.

Caution: Deleting a project has the following effects:

Everything in the project is deleted. If you used an existing project for this codelab, when you delete it, you also delete any other work you've done in the project.
Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an appspot.com URL, delete selected resources inside the project instead of deleting the whole project.

If you plan to explore multiple codelabs and quickstarts, reusing projects can help you avoid exceeding project quota limits.

In the Cloud Console, go to the Manage resources page.
Go to the Manage resources page
In the project list, select the project that you want to delete and then click Delete .
In the dialog, type the project ID and then click Shut down to delete the project.

Delete the individual resources

Open your cloud shell and navigate to the wrf example directory

cd  ~/rcc-apps/wrf/tf

Run make destroy to delete all of the resources.

make destroy