Last Updated: 2022-09-27

What you will build

In this codelab, you will port a small C++ application to GPU hardware using OpenMP. You will transition a serial CPU-only mini-application to a portable GPU accelerated application, using OpenMP provided through the AOMP compiler.

The goal of this codelab is to introduce you to using a few basic OpenMP directives and a development practice that can be applied to porting other applications.

What you will learn

What you will need

The demo application provided for this tutorial performs 2-D smoothing operations using a 3x3 gaussian stencil.

In this section, we introduce the demo application and walk through building and verifying the example. It's important to make sure that the code produces the expected result as we will be using the CPU generated model output to ensure that the solution does not change when we port to the GPU.

This application executes a 2-D smoothing operation on a square grid of points. The program proceeds as follows

  1. Process command line arguments
  2. Allocate memory for smoother class - 5x5 stencil with Gaussian weights
  3. Allocate memory for function and smoothed function
  4. Initialize function on CPU and report function to file
  5. Call smoothing function
  6. Report smoothed function to file
  7. Clear memory

Code Structure

This application's src directory contains the following files

  1. smoother.cpp : Defines a simple data structure that stores the smoothing operators weights and the routines for allocating memory, deallocating memory, and executing the smoothing operation.
  2. main.cpp : Defines the main program that sets up the 2-D field to be smoothed and managed file IO.
  3. Makefile : A simple makefile is to build the application binary smoother.
  4. viz.py : A python script for creating plots of the smoother output

Install and Verify the Application

To get started, we want to make sure that the application builds and runs on your system using the gcc compiler.

  1. Clone the repository
git clone https://github.com/fluidnumerics/scientific-computing-edu ~/scientific-computing-edu
  1. Build the smoother application. Keep in mind, the compiler is set to gcc by default in the provided makefile.
cd ~/scientific-computing-edu/samples/c++/smoother/src
make
  1. Test run the example. The application takes two arguments. The first argument is the number of grid cells, and the second argument is the number of times the smoothing operator is applied.
./smoother 1000 10

Profile the Application

Before starting any GPU porting exercise, it is important to profile your application to find hotspots where your application spends most of its time. Further, it is helpful to keep track of the runtime of the routines in your application so that you can later assess whether or not the GPU porting has resulted in improved performance. Ideally, your GPU-Accelerated application should outperform CPU-Only versions of your application when fully subscribed to available CPUs on a compute node.

Create the profile

In this tutorial, we are going to generate a profile and call graph using gprof. The provided makefile was already configured to create profile output. From here, you just need to use gprof to create the application profile.

$ gprof ./smoother gmon.out

Interpret the profile and call tree

gprof provides a flat profile and a summary of your application's call structure indicating dependencies within your source code as a call tree. A call tree depicts the relationships between routines in your source code. Combining timing information with a call tree will help you plan the order in which you port routines to the GPU.

The first section of the gprof output is the flat-profile. An example flat-profile for the smoother application is given below. The flat-profile provides a list of routines in your application, ordered by the percent time your program spends within those routines from greatest to least. Beneath the flat-profile, gprof provides documentation of each of the columns for your convenience.

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 95.24      1.16     1.16       10   116.19   116.19  smoothField
  2.46      1.19     0.03       10     3.00     3.00  resetF
  2.46      1.22     0.03                             main
  0.00      1.22     0.00        1     0.00     0.00  smootherFree
  0.00      1.22     0.00        1     0.00     0.00  smootherInit

Let's now take a look at at the call tree. This call tree has five entries, one for each routine in our program. The right-most field for each entry indicates the routines that called each routine and that are called by each routine.

For smoother, the first entry shows that main calls smoothField, resetF, smootherInit, and smootherFree. Further, the called column indicates that smoothField and resetF routines are shown to be called 10 times (in this case) by main. The self and children columns indicate that main spends 0.03s executing instructions in main and 1.19s in calling other routines. Further, of those 1.19s, 1.16s are spent in smoothField and 0.03 are spent in resetF.

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    0.03    1.19                 main [1]
                1.16    0.00      10/10          smoothField [2]
                0.03    0.00      10/10          resetF [3]
                0.00    0.00       1/1           smootherInit [5]
                0.00    0.00       1/1           smootherFree [4]
-----------------------------------------------
                1.16    0.00      10/10          main [1]
[2]     95.1    1.16    0.00      10         smoothField [2]
-----------------------------------------------
                0.03    0.00      10/10          main [1]
[3]      2.5    0.03    0.00      10         resetF [3]
-----------------------------------------------
                0.00    0.00       1/1           main [1]
[4]      0.0    0.00    0.00       1         smootherFree [4]
-----------------------------------------------
                0.00    0.00       1/1           main [1]
[5]      0.0    0.00    0.00       1         smootherInit [5]
-----------------------------------------------

Next steps

Now that we have a profile and an understanding of the call structure of the application, we can now plan our port to GPUs. Since we will use the AOMP compiler for offloading to GPUs, we want to first modify the Makefile to use the AOMP compiler. Then, we will focus on porting the smoothField routine and the necessary data to the GPU, since smoothField takes up the majority of the run time.

When we port this routine, we will introduce data allocation on the GPU and data copies between CPU and GPU. This data movement may potentially increase the overall application runtime, even if the smoothField routine performs better. In this event, we will then work on minimizing data movements between CPU and GPU.

In the smoother application, we have seen that the smoothField routine, called by main, takes up the most time. Looking at the function call in main.cpp and the smoothField routine in smoother.cpp, we see that this routine takes in a smoother object, a real array pointer f, and integers nx and ny that are passed by value.

 81   for( int iter=0; iter<nIter; iter++){
 82     // Run the smoother
 83     smoothField( &smoothOperator, f, smoothF, nx, ny );
 84     // Reassign smoothF to f
 85     resetF( f, smoothF, nx, ny );
 86   } 

In order to offload smoothField to the GPU, we will need to copy smoothOperator class data and the f array to the GPU. After calling smoothF, we will eventually want to copy smoothF back to the CPU before calling resetF.

Copy smoothOperator class data to the GPU

The smoothField routine uses the smoothOperator -> weights array when applying the operator. Because of this, you will need to create and allocate a device copy of the weights array. After filling in the weights values on the CPU, you can copy the values over to the device array.

  1. Include the hip runtime header at the top of smoother.cpp so that you can make HIP API calls.
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include "precision.h"
#include "smoother.h"
#include <hip/hip_runtime.h>
  1. Declare a real pointer attribute in the smoother class in smoother.h for a device copy of the smoother weights.
typedef struct smoother{
  int dim;
  real *weights;
  void *weights_dev;
}smoother;
  1. Insert a call to hipMalloc in smootherInit to allocate weights_dev on the GPU and insert a call to hipMemcpy in smootherInit to copy weights to weights_dev.
// Allocate space for the device copy of the smoothing weights
  hipMalloc(&smoothOperator->weights_dev,N*N*sizeof(real));
// Copy weights from the host to the device
  hipMemcpy(smoothOperator->weights_dev,
            smoothOperator->weights,
            N*N*sizeof(real),
            hipMemcpyHostToDevice);
  1. Insert a call to hipFree in smootherFree to deallocate GPU memory held by weights_dev.
  hipFree(smoothOperator->weights_dev);
  1. Before going too much further, let's check to make sure the code compiles and runs with these changes. Starting from the smoother Makefile, let's first add variables for the paths to ROCm and CUDA at the top of the file. These will be needed to reference full paths to the hipcc compiler.

    When setting these variables, we use the ?= relation to allow a user's environment variables to override these values if desired.
ROCM ?= /opt/rocm
CUDA ?= /usr/local/cuda


Change the compiler to hipcc and save your changes.

CC=$(ROCM)/bin/hipcc
CFLAGS=-O0 -g

Once you have completed the code and Makefile modifications, you can now compile smoother and verify that data allocated and copied to the GPU.

  1. Copy existing CPU data to a reference/ subdirectory for later comparison. Whenever we make a change to the code, we will compare output with this reference data.
$ mkdir ./reference
$ mv function.txt smooth-function.txt ./reference/
  1. Remove the *.o files and the smoother binary to ensure a clean build and make a new smoother binary
$ make clean && make smoother
  1. Run smoother with the same input parameters as you did in the previous section and verify the output is unchanged. We use the diff command line utility to compare the output files and the reference files. If there are no differences, diff will produce no output.
$ ./smoother 1000 10
$ diff function.txt reference/function.txt
$ diff smooth-function.txt reference.txt
  1. Verify that data is allocated on the GPU and that data is copied from the CPU to GPU by using a profiler.

    To profile, use rocprof with the --sys-trace on and --stats flags. Running rocprof will create a file called results.json that contains the data for a trace profile. Additionally, results.stats.csv and results.hip-stats.csv will contain hotspot analysis for HIP kernels and HIP API calls, respectively.
$ rocprof --sys-trace --stats ./smoother 1000 100
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
hipMemcpy,1,12503928,12503928,95.8586582624
hipMalloc,3,387867,129289,2.9734984242
hipFree,3,152335,50778,1.16784331343
  1. Commit your changes to your local git repository.
$ git add smoother.h smoother.cpp makefile && git commit

Copy f and smoothF to the GPU

  1. Include the hip runtime header at the top of main.cpp so that you can make HIP API calls.
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include "precision.h"
#include "smoother.h"
#include <hip/hip_runtime.h>
  1. Declare real pointers for f_dev and smoothF_dev in main.cpp
int main( int argc, char *argv[] )  {
  smoother smoothOperator;
  int nx, ny, nElements;
  int nIter;
  real dx;
  real *f, *smoothF;
  real *f_dev, *smoothF_dev;
  1. Insert calls to hipMalloc for f_dev and smoothF_dev
  // Create the smoother
  smootherInit(&smoothOperator);

  // Allocate space for the function we want to smooth
  f  = (real*)malloc( nElements*sizeof(real) );
  smoothF = (real*)malloc( nElements*sizeof(real) );

  hipMalloc(&f_dev, nElements*sizeof(real));
  hipMalloc(&smoothF_dev, nElements*sizeof(real));
  1. Insert a hipMemcpy call to update f_dev (host to device) prior to calling smoothField and to update smoothF_dev (device to host) after calling smoothField.
    hipMemcpy(f_dev, f, nElements*sizeof(real), hipMemcpyHostToDevice);

    smoothField( &smoothOperator, f, smoothF, nx, ny );

    hipMemcpy(smoothF, smoothF_dev, nElements*sizeof(real), hipMemcpyDeviceToHost);
  1. Last, insert calls to hipFree at the end of the main function to free memory held by the f_dev and smoothF_dev pointers.
  // Free space
  free(f);
  free(smoothF);
  hipFree(f_dev);
  hipFree(smoothF_dev);
  1. Verify that the application still builds.
make
  1. (Optional) You can verify that data is being copied between the CPU and GPU by using a profiler

    For AMD platforms, use rocprof with the --hip-trace flag. Running rocprof will create a file called profile.json. The contents of results.hip_stats.json will show calls to hipMalloc, hipMemcpy, and hipFree.
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
hipMemcpy,22,43602127,1981914,98.9534503691
hipMalloc,3,287919,95973,0.653421757999
hipFree,3,173225,57741,0.393127872872


For Nvidia platforms, use nvprof. At this stage, you should see three calls to cudaMalloc, three calls to cudaFree, ten calls to cudaMemcpy (Device to Host), and 11 calls to cudaMemcpy (Host to Device).

$ nvprof ./smoother 1000 10
==23287== NVPROF is profiling process 23287, command: ./smoother 1000 10
==23287== Profiling application: ./smoother 1000 10
==23287== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   52.12%  4.6421ms        10  464.21us  427.18us  589.69us  [CUDA memcpy DtoH]
                   47.88%  4.2636ms        11  387.60us  1.4720us  511.96us  [CUDA memcpy HtoD]
      API calls:   93.92%  194.18ms         3  64.727ms  135.75us  193.88ms  cudaMalloc
                    5.20%  10.757ms        21  512.25us  22.255us  778.80us  cudaMemcpy
                    0.39%  809.35us        97  8.3430us     537ns  396.19us  cuDeviceGetAttribute
                    0.25%  526.66us         1  526.66us  526.66us  526.66us  cuDeviceTotalMem
                    0.16%  339.77us         3  113.26us  2.7010us  176.22us  cudaFree
                    0.06%  115.21us         1  115.21us  115.21us  115.21us  cuDeviceGetName
                    0.00%  4.4240us         3  1.4740us     700ns  2.9380us  cuDeviceGetCount
                    0.00%  4.3810us         1  4.3810us  4.3810us  4.3810us  cuDeviceGetPCIBusId
                    0.00%  2.5650us         2  1.2820us     662ns  1.9030us  cuDeviceGet
                    0.00%  1.0790us         1  1.0790us  1.0790us  1.0790us  cuDeviceGetUuid

Next steps

At this point, you now have the necessary data declared on the GPU. Additionally, you used hipMemcpy to make the input to smoothField available on the GPU. In the next step, you will create a HIP kernel that will run the smoothField algorithm on the GPU and replace the call to smoothField with a call to launch this kernel.

In this section, you will offload the smoothField routine to the GPU.

Planning the GPU port

Let's look at the smoothField routine from smoother.c

void smoothField( struct smoother *smoothOperator, real *f, real *smoothF, int nx, int ny )
{
  int iel, ism;
  int N = (real)smoothOperator->dim;
  int buf = (real)(smoothOperator->dim-1)/2.0;
  real smLocal;

  for( int j=buf; j < ny-buf; j++ ){
    for( int i=buf; i < nx-buf; i++ ){
      smLocal = 0.0;
      for( int jj=-buf; jj <= buf; jj++ ){
        for( int ii=-buf; ii <= buf; ii++ ){
          iel = (i+ii)+(j+jj)*nx;
          ism = (ii+buf) + (jj+buf)*N;
          smLocal += f[iel]*smoothOperator->weights[ism];
        }
      }
      iel = i+j*nx;
      smoothF[iel] = smLocal;
    }
  }
}

The outer loops, over i and j, are tightly nested loops over a 2-D grid. The size of these loops are ny-2*buf and nx-2*buf. The values of nx and ny are determined by the user through the first command line argument (we have been using 1000), and buf is 2 (smoothOperator->dim=5; see smootherInit). Within the i and j loops, we carry out a reduction operation for smLocal and then assign the value to each element of smoothF.

In the smoothField algorithm, the order in which we execute the i and j loops does not matter. Further, the size of each loop is O(1000) for the example we're working with. A reasonable strategy for offloading this routine to the GPU is to have each GPU thread execute the instructions within the i and j loops. Ideally, then we want each thread to execute something the following

    real smLocal = 0.0;
    for( int jj=-buf; jj <= buf; jj++ ){
      for( int ii=-buf; ii <= buf; ii++ ){
        iel = (i+ii)+(j+jj)*nx;
        ism = (ii+buf) + (jj+buf)*N;
        smLocal += f[iel]*smoothOperator->weights[ism];
      }
    }
    iel = i+j*nx;
    smoothF[iel] = smLocal;

Notice now the i and j loops are gone. Within the HIP kernel, we can calculate i and j from threadIdx.[x,y], blockIdx.[x,y], and blockDim.[x,y], assuming that we will launch the kernel with 2-D Grid and Block dimensions. You can use something like the following to calculate i and j.

  size_t i = threadIdx.x + blockIdx.x*blockDim.x+buf;
  size_t j = threadIdx.y + blockIdx.y*blockDim.y+buf;

Within the main program, you will be able to launch the GPU kernel, but you will need to calculate the Grid and Block Dimensions. For now, let's assume that the number of threads-per-block in the i and j loop dimensions (x and y directions) is fixed at 16. With the number of threads-per-block (in each direction) chosen, you can calculate the grid dimensions, by requiring the x and y grid dimensions to be greater than or equal to the i and j loop sizes, respectively.

  int buf = (real)(smoothOperator->dim-1)/2.0;
  int threadsPerBlockX = 16;
  int threadsPerBlockY = 16;
  int gridDimX = (nx-2*buf)/threadsPerBlockX + 1;
  int gridDimY = (ny-2*buf)/threadsPerBlockY + 1;

Implement the Changes

  1. Add a new routine, smoothField_gpu, to smoother.cpp. This routine needs to be decorated with the __global__ declaration specifier so that it can be launched on the GPU (device) from the CPU (host).
__global__ void smoothField_gpu( real *weights, real *f, real *smoothF, int nX, int nY, int N )
{
  int buf = (N-1)/2;
  size_t i = threadIdx.x + blockIdx.x*blockDim.x + buf;
  size_t j = threadIdx.y + blockIdx.y*blockDim.y + buf;
  int iel, ism;
  
  if( i < nX-buf && j< nY-buf){
    real smLocal = 0.0;
    for( int jj=-buf; jj <= buf; jj++ ){
      for( int ii=-buf; ii <= buf; ii++ ){
        iel = (i+ii)+(j+jj)*nX;
        ism = (ii+buf) + (jj+buf)*N;
        smLocal += f[iel]*weights[ism];
      }
    }
    iel = i+j*nX;
    smoothF[iel] = smLocal;
  }
}
  1. Calculate the Block and Grid dimensions in main.cpp. You can place this block of code just before the iteration loop
  int buf = (smoothOperator.dim-1)/2;
  int tX = 16;
  int tY = 16;
  int gX = (nx-2*buf)/tX + 1;
  int gY = (ny-2*buf)/tY + 1;
  dim3 threads(tX,tY,1);
  dim3 blocks(gX,gY,1);
  1. Replace the smoothField call in main.cpp with a kernel launch call for smoothField_gpu
    // Copy f from host to device : f is input to `smoothField`
    hipMemcpy(f_dev, f, nElements*sizeof(real), hipMemcpyHostToDevice);

    // Run the smoother
    smoothField_gpu<<<blocks,threads>>>(
                        smoothOperator.weights_dev, f_dev, smoothF_dev, nx, ny, smoothOperator.dim );

    // Copy smoothF_dev from device to host
    hipMemcpy(smoothF, smoothF_dev, nElements*sizeof(real), hipMemcpyDeviceToHost);
  1. Add smoothField_gpu declaration to smoother.h
__global__ void smoothField_gpu( real* weights, real *f, real *smoothF, int nx, int ny, int N);
  1. Rebuild the application.
$ make clean && make
  1. Run the application and obtain a profile
rocprof --sys-trace --stats ./smoother 1000 10

Next steps

Congratulations! So far, you've learned how to allocate and manage memory on the GPU and how to launch a GPU kernel. Right now, we have the code in a state where, every iteration, data is copied to the GPU before calling smoothField_gpu and from the GPU after calling smoothField_gpu. This situation happens quite often when porting to GPUs for the first time.

The next step in this codelab is to offload the resetF routine to the GPU, even though it does not take up a lot of time. We want to offload it to the GPU so that we can move the hipMemcpy calls outside of the iteration loop in main and reduce the number of times data is transmitted across the PCI bus.

In this section, we are going to offload the resetF routine in smoother.cpp to the GPU so that we can migrate hipMemcpy calls outside of the iteration loop in main.cpp. By this point, you have worked through the mechanics for porting a routine to the GPU. Additionally, for this application, we already have all of the necessary data on the GPU that the resetF routine depends on.

  1. Add resetF_gpu definition to smoother.h
__global__ void resetF_gpu( real *f, real *smoothF, int nX, int nY, int buf );
  1. Add resetF_gpu to smoother.cpp
__global__ void resetF_gpu( real *f, real *smoothF, int nX, int nY, int buf )
{
  size_t i = threadIdx.x + blockIdx.x*blockDim.x + buf;
  size_t j = threadIdx.y + blockIdx.y*blockDim.y + buf;
  int iel = i + nx*j;
  if( i < nx-buf && j< ny-buf){
    f[iel] = smoothF[iel];
  }
}
  1. Replace resetF with resetF_gpu in main.cpp, move the hipMemcpy host-to-device calls before the iteration loop, and move the hipMemcpy device-to-host call after the iteration loop.
  hipMemcpy(f_dev, f, nElements*sizeof(real), hipMemcpyHostToDevice);

  for( int iter=0; iter<nIter; iter++){

    // Run the smoother
    smoothField_gpu<<<blocks,threads>>>(
                        smoothOperator.weights_dev, f_dev, smoothF_dev, nx, ny, smoothOperator.dim );

    // Reassign smoothF to f
    resetF_gpu<<<blocks,threads>>>(f_dev, smoothF_dev, nx, ny, buf );
  }
  // Copy smoothF_dev from device to host
  hipMemcpy(smoothF, smoothF_dev, nElements*sizeof(real), hipMemcpyDeviceToHost);
  1. Recompile the smoother application
make
  1. Run the application again with the profiler. When you examine the results.hip_stats.csv file, you should see significantly fewer memory copies between the host and device.
rocprof --sys-trace --stats ./smoother 1000 10

In this codelab, you learned how to port serial CPU-only routines in C to GPUs using HIP. To do this, you created device copies of CPU arrays and learned how to copy data from the CPU to the GPU and vice versa. You also learned how to write HIP kernels and launch them from the host.

In the process of doing this, you practiced a strategy for porting to GPUs that included the following steps to make incremental changes to your own source code :

  1. Profile - Find out the hotspots in your code and understand the dependencies with other routines
  2. Plan - Determine what routine you want to port, what data needs to be present on the GPU, and what data needs to be copied back to the CPU after execution
  3. Implement & Verify - Create the necessary device data, insert the appropriate hipMemcpy calls, write an equivalent GPU kernel, and use hipLaunchKernelGGL to launch the GPU kernel. Run your application's tests and verify the results are correct. Check with a profiler that the new routine and the necessary hipMemCpy calls are being executed.
  4. Commit - Once you have verified correctness and the expected behavior, commit your changes and start the process over again.

Provide Feedback

If you have any questions, comments, or feedback that can help improve this codelab, you can reach out to support@fluidnumerics.com

Further reading