Last Updated: 2022-09-28

What you will build

In this codelab, you will port a small C++ application to GPU hardware using OpenMP. You will transition a serial CPU-only mini-application to a portable GPU accelerated application, using OpenMP provided through the AOMP compiler.

The goal of this codelab is to introduce you to using a few basic OpenMP directives and a development practice that can be applied to porting other applications.

What you will learn

What you will need

The demo application provided for this tutorial performs 2-D smoothing operations using a 3x3 gaussian stencil.

In this section, we introduce the demo application and walk through building and verifying the example. It's important to make sure that the code produces the expected result as we will be using the CPU generated model output to ensure that the solution does not change when we port to the GPU.

This application executes a 2-D smoothing operation on a square grid of points. The program proceeds as follows

  1. Process command line arguments
  2. Allocate memory for smoother class - 5x5 stencil with Gaussian weights
  3. Allocate memory for function and smoothed function
  4. Initialize function on CPU and report function to file
  5. Call smoothing function
  6. Report smoothed function to file
  7. Clear memory

Code Structure

This application's src directory contains the following files

  1. smoother.cpp : Defines a simple data structure that stores the smoothing operators weights and the routines for allocating memory, deallocating memory, and executing the smoothing operation.
  2. main.cpp : Defines the main program that sets up the 2-D field to be smoothed and managed file IO.
  3. Makefile : A simple makefile is to build the application binary smoother.
  4. viz.py : A python script for creating plots of the smoother output

Install and Verify the Application

To get started, we want to make sure that the application builds and runs on your system using the gcc compiler.

  1. Clone the repository
$ git clone https://github.com/fluidnumerics/scientific-computing-edu ~/scientific-computing-edu
  1. Build the smoother application. Keep in mind, the compiler is set to gcc by default in the provided makefile.
$ cd samples/c++/smoother/src
$ make
  1. Test run the example. The application takes two arguments. The first argument is the number of grid cells, and the second argument is the number of times the smoothing operator is applied.

$ ./smoother 1000 10

Profile the Application

Before starting any GPU porting exercise, it is important to profile your application to find hotspots where your application spends most of its time. Further, it is helpful to keep track of the runtime of the routines in your application so that you can later assess whether or not the GPU porting has resulted in improved performance. Ideally, your GPU-Accelerated application should outperform CPU-Only versions of your application when fully subscribed to available CPUs on a compute node.

Create the profile

In this tutorial, we are going to generate a profile and call graph using gprof. The provided makefile was already configured to create profile output. From here, you just need to use gprof to create the application profile.

$ gprof ./smoother gmon.out

Interpret the profile and call tree

gprof provides a flat profile and a summary of your application's call structure indicating dependencies within your source code as a call tree. A call tree depicts the relationships between routines in your source code. Combining timing information with a call tree will help you plan the order in which you port routines to the GPU.

The first section of the gprof output is the flat-profile. An example flat-profile for the smoother application is given below. The flat-profile provides a list of routines in your application, ordered by the percent time your program spends within those routines from greatest to least. Beneath the flat-profile, gprof provides documentation of each of the columns for your convenience.

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 95.24      1.16     1.16       10   116.19   116.19  smoothField
  2.46      1.19     0.03       10     3.00     3.00  resetF
  2.46      1.22     0.03                             main
  0.00      1.22     0.00        1     0.00     0.00  smootherFree
  0.00      1.22     0.00        1     0.00     0.00  smootherInit

Let's now take a look at at the call tree. This call tree has five entries, one for each routine in our program. The right-most field for each entry indicates the routines that called each routine and that are called by each routine.

For smoother, the first entry shows that main calls smoothField, resetF, smootherInit, and smootherFree. Further, the called column indicates that smoothField and resetF routines are shown to be called 10 times (in this case) by main. The self and children columns indicate that main spends 0.03s executing instructions in main and 1.19s in calling other routines. Further, of those 1.19s, 1.16s are spent in smoothField and 0.03 are spent in resetF.

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    0.03    1.19                 main [1]
                1.16    0.00      10/10          smoothField [2]
                0.03    0.00      10/10          resetF [3]
                0.00    0.00       1/1           smootherInit [5]
                0.00    0.00       1/1           smootherFree [4]
-----------------------------------------------
                1.16    0.00      10/10          main [1]
[2]     95.1    1.16    0.00      10         smoothField [2]
-----------------------------------------------
                0.03    0.00      10/10          main [1]
[3]      2.5    0.03    0.00      10         resetF [3]
-----------------------------------------------
                0.00    0.00       1/1           main [1]
[4]      0.0    0.00    0.00       1         smootherFree [4]
-----------------------------------------------
                0.00    0.00       1/1           main [1]
[5]      0.0    0.00    0.00       1         smootherInit [5]
-----------------------------------------------

Next steps

Now that we have a profile and an understanding of the call structure of the application, we can now plan our port to GPUs. Since we will use the AOMP compiler for offloading to GPUs, we want to first modify the Makefile to use the AOMP compiler. Then, we will focus on porting the smoothField routine and the necessary data to the GPU, since smoothField takes up the majority of the run time.

When we port this routine, we will introduce data allocation on the GPU and data copies between CPU and GPU. This data movement may potentially increase the overall application runtime, even if the smoothField routine performs better. In this event, we will then work on minimizing data movements between CPU and GPU.

Before jumping straight into GPU offloading with OpenMP, you will take an incremental step to change the compiler and verify the application can be compiled and executed with the amdclang compiler. Once this is verified, you will then start the GPU offloading process.

ROCm comes with compilers ( amdflang, amdclang, and amdclang++ ) that support OpenMP 5.0. To enable GPU offloading at compile time, there are a few flags that you need to pass to the compiler.

amdclang -fopenmp \
         -fopenmp-targets=[target] \
         -Xopenmp-target=[target] \
         -march=[gpu-arch]
         [other options]
         [source-code]

In this example,

ROCm also comes with a helpful tool (mygpu) that can be used to detect the GPU architecture. This is particularly useful if you are building the code on a machine that has the GPU you want to build for.

In this section, you will make the following changes to the Makefile

  1. Starting from the smoother makefile (samples/fortran/smoother/src/Makefile), Let's first add variables for the paths to ROCm and CUDA at the top of the file. These will be needed to reference full paths to the compiler and mygpu binary.

    When setting these variables, we use the ?= relation to allow a user's environment variables to override these values if desired.
ROCM ?= /opt/rocm
CUDA ?= /usr/local/cuda
  1. Next, change the specification for the C compiler, by setting CC ?= $(ROCM)/bin/amdclang . The first three lines of your Makefile should look like this :
ROCM ?= /opt/rocm
CUDA ?= /usr/local/cuda
CC ?= $(ROCM)/bin/amdclang
  1. In ROCm 4.5.0, it is necessary to remove the -g flag from the FFLAGS definition in the Makefile.
  1. Let's now work on a section for detecting the GPU architecture. The mygpu utility can be used to detect a GPU, if one is present. The -d flag is used to set the default architecture to fall back to in case a GPU is not present.

    For example, mygpu -d gfx908 will check for a GPU and return gfx908 if one is not found. In the make system, we want to also allow for someone building the code to specify the target architecture, in case they are on a system that does not have a GPU equipped.

    Add the following section of code to your Makefile just beneath the definition of FC. This section sets GPU_ARCH to the output of mygpu -d gfx908, if it is not set in the user's environment. The -d gfx908 flag indicates that the default GPU architecture will be set to gfx908, if one is not found on the system you are building on.
# Detect the GPU architecture
GPU_ARCH ?= $(shell $(ROCM)/bin/mygpu -d gfx908)
  1. Now that we have the GPU architecture, we can set the GPU target. Nvidia GPU architectures are all defined with a prefix of sm_ . We can use this with the findstring function to set the GPU target accordingly.

    Add the following section of code to your Makefile just beneath the definition of GPU_ARCH. This section sets the GPU_TARGET variable to nvptx64-nvidia-cuda when an Nvidia GPU is detected and amdgcn-amd-amdhsa otherwise. Additionally, this appends the CUDA runtime library to LFLAGS in the event that you are building for an Nvidia platform.
ifeq (sm_,$(findstring sm_,$(GPU_ARCH)))
  GPU_TARGET = nvptx64-nvidia-cuda
  LFLAGS += -L$(CUDA)/targets/x86_64-linux/lib -lcudart
else
  GPU_TARGET = amdgcn-amd-amdhsa
endif
  1. Now you can append the OpenMP offload flags to the CFLAGS variable. Just beneath the GPU_TARGET definition, add the following code to append to the CFLAGS variable.
CFLAGS += -fopenmp -fopenmp-targets=$(GPU_TARGET) -Xopenmp-target=$(GPU_TARGET) -march=$(GPU_ARCH)

Verify the application compiles and runs

Now that you have made the necessary modifications to the Makefile, it is time to re-compile and test the application. You also want to make sure that the application output is unchanged.

  1. Copy the existing output from your previous run to a reference directory.
$ mkdir reference
$ cp function.txt smooth-function.txt reference/
  1. Re-compile the smoother application.
$ make clean
$ make
  1. Run the smoother application with the same input parameters as before and compare the output with the reference output. You can use the diff command line tool to compare the new output with the reference output. If the files are identical, no output will be printed to screen.
$ ./smoother 1000 10
$ diff function.txt reference/function.txt
$ diff smooth-function.txt reference/function.txt

Next Steps

Now that you've switched to using the amdclang compiler and have verified the application successfully compiles and runs and produces the correct output, you are ready to begin offloading to GPUs with OpenMP. In the next step, you will offload the smoothField and resetF routines using OpenMP directives.

In the smoother application, we have seen that the smoothField routine, called by main, takes up the most time. Within the main iteration loop in main.cpp, the resetF function is called to update the input for smoother for the next iteration.

You will start by offloading both the smoothField and resetF routines to the GPU using OpenMP directives (also called "pragmas"). In this section you will learn how to offload sections of code to the GPU and how to manage GPU data using OpenMP pragmas.

Offload smoothField

  1. Open smoother.c and navigate to the smoothField routine. Open an OpenMP target region before the start of the first loop in smoothField and map the necessary map directives to copy smoother->weights and f to the GPU and smoothF to and from the GPU.
  #pragma omp target  map(to:smoothOperator->weights[0:N*N], f[0:nX*nY]) map(smoothF[0:nX*nY])
  1. Use a teams parallel for directive with a collapse(2) clause to parallelize the outer two loops.
  #pragma omp target  map(to:smoothOperator->weights[0:N*N], f[0:nX*nY]) map(smoothF[0:nX*nY])
  {
    #pragma omp teams distribute parallel for collapse(2) num_threads(256)
    for( int j=buf; j < nY-buf; j++ ){
      for( int i=buf; i < nX-buf; i++ ){
        smLocal = 0.0;
        for( int jj=-buf; jj <= buf; jj++ ){
          for( int ii=-buf; ii <= buf; ii++ ){
            iloc = (i+ii)+(j+jj)*nX;
            ism = (ii+buf) + (jj+buf)*N;
            smLocal += f[iloc]*smoothOperator->weights[ism];
          }
        }
        iel = i+j*nX;
        smoothF[iel] = smLocal;
      }
    }
  }
  1. Re-compile the smoother application.
$ make
  1. Run the smoother application with the same input parameters as before and compare the output with the reference output. You can use the diff command line tool to compare the new output with the reference output. If the files are identical, no output will be printed to screen.
$ time ./smoother 1000 10
real        0m2.767s
user        0m1.568s
sys        0m1.057s
$ diff function.txt reference/function.txt
$ diff smooth-function.txt reference/function.txt

Offload resetF

  1. Open smoother.cpp and navigate to the resetF routine. Open an OpenMP target region before the start of the first loop in resetF and map the necessary map directives to copy smoothF to the GPU and f to and from the GPU.
  #pragma omp target map(to: smoothF[0:nx*ny]) map(f[0:nx*ny])
  1. Use a teams parallel for directive with a collapse(2) clause to parallelize the outer two loops.
  #pragma omp target map(to: smoothF[0:nx*ny]) map(f[0:nx*ny])
  {
    #pragma omp teams distribute parallel for collapse(2) num_threads(256)
    for( int iy=buf; iy<ny-buf; iy++ ){
      for( int ix=buf; ix<nx-buf; ix++ ){
        iel = ix + nx*iy;
        f[iel] = smoothF[iel];
      }
    }
  }
  1. Re-compile the smoother application.
$ make
  1. Run the smoother application with the same input parameters as before and compare the output with the reference output. You can use the diff command line tool to compare the new output with the reference output. If the files are identical, no output will be printed to screen.
$ time ./smoother 1000 10
real        0m2.970s
user        0m1.528s
sys        0m1.228s
$ diff function.txt reference/function.txt
$ diff smooth-function.txt reference/function.txt
  1. To profile, you can profile the application using the rocprof profiler. If you would like to create a trace profile, add the --hsa-trace flags. If you would like to get a summary hotspot profile of the GPU kernels, use the --stats flag.
$ rocprof --hsa-trace  --stats ./smoother 1000 10
$ cat results.stat.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"__omp_offloading_801_440b81_smoothField_l67.kd",10,30603997,3060399,78.4420113965
"__omp_offloading_801_440b81_resetF_l48.kd",10,8410807,841080,21.5579886035

Next steps

You've successfully offloaded two routines to the GPU. However, you may have noticed that the runtime did not improve much, and may have even gotten worse, after you offloaded the second routine (resetF). At the start and end of each target region, the application is copying data between the CPU and GPU. You can see this behavior in the trace profile shown above. Ideally, you want to minimize data movement between the host and device for optimal performance.

In the next section, you will learn how to control when data is allocated and moved to and from the GPU. This will help you minimize data copies between the host and device that often become bottlenecks for GPU accelerated applications.

In this section you will learn how to use unstructured data directives with OpenMP to control when data is copied to and from the GPU.

In the smoother application, there are two routines within a main iteration loop, smoothField and resetF. Both routines operate on data stored in two arrays, f and smoothF.

  for( int iter=0; iter<nIter; iter++){
    // Run the smoother
    smoothField( &smoothOperator, f, smoothF, nx, ny );
    // Reassign smoothF to f
    resetF( f, smoothF, nx, ny, buf );
  }

Additionally, the smoothField routine requires the smoothOperator->weights array in order to calculate smoothF from f. Currently, target regions within smoothField and resetF copy these arrays to and from the GPU, before and after executing the routine instructions in parallel on the GPU; this is also done every iteration.

OMP Enter/Exit Data

Ideally, we want to have all of the necessary data copied to the GPU before the iteration loop and have smoothF copied from the GPU after the iteration loop. This can be achieved using the target enter data and target exit data directives.

Each directive is a standalone directive that can be used to allocate or deallocate memory on the GPU and copy data to or from the GPU. A typical usage is to use the target enter data directive to allocate device memory after allocation on the host and to use the target exit data directive to free device memory before freeing memory on the host. Then, you can use the target update directive to manage updating host and device data when needed.

In this example below, the enter data directive is used to allocate device memory for arrayIn and arrayOut. Before reaching the main block of code, the target update directive is used to update arrayIn on the device. At the end of this region of code, the target update directive is used to update arrayOut on the host. At the end of the example code, the exit data directive is used to free device memory before freeing the associate host pointer.

int N = 1000;
float *arrayIn;
float *arrayOut;
arrayIn = (float*)malloc( N*sizeof(float) );
arrayOut = (float*)malloc( N*sizeof(float) );
# omp pragma target enter data map(alloc:arrayIn[0:N], arrayOut[0:N])

// Initialization routines ... //
.
.
// End Initialization routines
#omp pragma target update to(arrayIn[0:N]) 
{
// Execution block
.
.
}
#omp pragma target update from(arrayOut[0:N]) 
.
.
# omp pragma target exit data map(delete:arrayIn[0:N], arrayOut[0:N])
free(arrayIn);
free(arrayOut);

Transition to enter/exit data directives

In the smoother application, we want to explicitly control data movement for f, smoothF, and smoothOperator->weights. You will work in main.cpp to insert calls to allocate, update, and deallocate device memory for f and smoothF. To handle smoothOperator->weights, you will work in smoother.cpp to allocate, update, and deallocate device memory.

  1. Open main.cppand find where f and smoothF are allocated memory. Just after the malloc calls, add a target enter data directive to allocate device memory for f and smoothF.
  // Allocate space for the function we want to smooth
  f  = (real*)malloc( nElements*sizeof(real) );
  smoothF = (real*)malloc( nElements*sizeof(real) );
  #pragma omp target enter data map(alloc: f[0:nElements], smoothF[0:nElements])
  1. Add a target update to directive to copy f and smoothF data to the GPU just before the main iteration loop and add a target update from directive to copy smoothF from the GPU just after the main iteration loop.
  #pragma omp target update to(f[0:nElements], smoothF[0:nElements])
  for( int iter=0; iter<nIter; iter++){

    // Run the smoother
    smoothField( &smoothOperator, f, smoothF, nx, ny );

    // Reassign smoothF to f
    resetF( f, smoothF, nx, ny, buf );
  }
  #pragma omp target update from(smoothF[0:nElements])
  1. Add a target exit data directive to deallocate device memory held by f and smoothF before calling free at the end of main.cpp.
  // Free space
  #pragma omp target exit data map(delete:f[0:nElements],smoothF[0:nElements])
  free(f);
  free(smoothF);
  1. Save main.cpp and open smoother.cpp. In the smootherInit routine, add a target enter data directive to allocate device memory for smoothOperator->weights.
  smoothOperator->dim = N;
  smoothOperator->weights = (real*)malloc( N*N*sizeof(real) );
  #pragma omp target enter data map(alloc: smoothOperator->weights[0:N*N])
  1. Add a target update directive for smoothOperator->weights at the end of the smootherInit routine, after the weight values have been assigned.
  for( int j=0; j < N; j++ ){
    for( int i=0; i < N; i++ ){
      smoothOperator->weights[i+j*N] = smoothOperator->weights[i+j*N]/wsum;
    }
  }
  #pragma omp target update to(smoothOperator->weights[0:N*N])
  1. Add a target exit data directive to deallocate device memory held by smoothOperator->weights in the smootherFree routine.
void smootherFree( struct smoother *smoothOperator )
{
  #pragma omp target exit data map(delete: smoothOperator->weights[0:smoothOperator->dim*smoothOperator->dim])
  free( smoothOperator->weights );
}
  1. Re-compile the smoother application.
$ make
  1. Run the smoother application with the same input parameters as before and compare the output with the reference output. You can use the diff command line tool to compare the new output with the reference output. If the files are identical, no output will be printed to screen.
$ time ./smoother 1000 10
real        0m2.689s
user        0m1.496s
sys        0m1.052s
$ diff function.txt reference/function.txt
$ diff smooth-function.txt reference/function.txt

In this codelab, you learned how to port serial CPU-only routines in C to GPUs using OpenMP. To do this, you used target directives to offload regions of code to the GPU. You used teams parallel for directives to parallelize nested loops across teams of SIMD threads.

To reduce data copies between host and device, you applied unstructured OpenMP data directives to explicitly manage when memory is allocated/deallocated on the GPU and when data is copied between to and from the GPU.

In the process of doing this, you practiced a strategy for porting to GPUs that included the following steps to make incremental changes to your own source code :

  1. Profile - Find out the hotspots in your code and understand the dependencies with other routines
  2. Plan - Determine what routine you want to port and what data needs to be copied to and from the GPU.
  3. Implement & Verify - Insert the necessary OpenMP directives, compile the application, and verify the results.
  4. Commit - Once you have verified correctness and the expected behavior, commit your changes and start the process over again.

Provide Feedback

If you have any questions, comments, or feedback that can help improve this codelab, you can reach out to support@fluidnumerics.com

Further reading