Last Updated: 2022-09-28
In this codelab, you will port a small C++ application to GPU hardware using OpenMP. You will transition a serial CPU-only mini-application to a portable GPU accelerated application, using OpenMP provided through the AOMP compiler.
The goal of this codelab is to introduce you to using a few basic OpenMP directives and a development practice that can be applied to porting other applications.
The demo application provided for this tutorial performs 2-D smoothing operations using a 3x3 gaussian stencil.
In this section, we introduce the demo application and walk through building and verifying the example. It's important to make sure that the code produces the expected result as we will be using the CPU generated model output to ensure that the solution does not change when we port to the GPU.
This application executes a 2-D smoothing operation on a square grid of points. The program proceeds as follows
This application's src directory contains the following files
smoother.cpp
: Defines a simple data structure that stores the smoothing operators weights and the routines for allocating memory, deallocating memory, and executing the smoothing operation.main.cpp
: Defines the main program that sets up the 2-D field to be smoothed and managed file IO.Makefile
: A simple makefile is to build the application binary smoother
.viz.py
: A python script for creating plots of the smoother outputTo get started, we want to make sure that the application builds and runs on your system using the gcc compiler.
$ git clone https://github.com/fluidnumerics/scientific-computing-edu ~/scientific-computing-edu
$ cd samples/c++/smoother/src
$ make
$ ./smoother 1000 10
Before starting any GPU porting exercise, it is important to profile your application to find hotspots where your application spends most of its time. Further, it is helpful to keep track of the runtime of the routines in your application so that you can later assess whether or not the GPU porting has resulted in improved performance. Ideally, your GPU-Accelerated application should outperform CPU-Only versions of your application when fully subscribed to available CPUs on a compute node.
In this tutorial, we are going to generate a profile and call graph using gprof. The provided makefile was already configured to create profile output. From here, you just need to use gprof to create the application profile.
$ gprof ./smoother gmon.out
gprof
provides a flat profile and a summary of your application's call structure indicating dependencies within your source code as a call tree. A call tree depicts the relationships between routines in your source code. Combining timing information with a call tree will help you plan the order in which you port routines to the GPU.
The first section of the gprof output is the flat-profile. An example flat-profile for the smoother
application is given below. The flat-profile provides a list of routines in your application, ordered by the percent time your program spends within those routines from greatest to least. Beneath the flat-profile, gprof provides documentation of each of the columns for your convenience.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
95.24 1.16 1.16 10 116.19 116.19 smoothField
2.46 1.19 0.03 10 3.00 3.00 resetF
2.46 1.22 0.03 main
0.00 1.22 0.00 1 0.00 0.00 smootherFree
0.00 1.22 0.00 1 0.00 0.00 smootherInit
Let's now take a look at at the call tree. This call tree has five entries, one for each routine in our program. The right-most field for each entry indicates the routines that called each routine and that are called by each routine.
For smoother
, the first entry shows that main calls smoothField
, resetF
, smootherInit
, and smootherFree
. Further, the called column indicates that smoothField and resetF routines are shown to be called 10 times (in this case) by main. The self and children columns indicate that main spends 0.03s executing instructions in main and 1.19s in calling other routines. Further, of those 1.19s, 1.16s are spent in smoothField
and 0.03 are spent in resetF
.
index % time self children called name
<spontaneous>
[1] 100.0 0.03 1.19 main [1]
1.16 0.00 10/10 smoothField [2]
0.03 0.00 10/10 resetF [3]
0.00 0.00 1/1 smootherInit [5]
0.00 0.00 1/1 smootherFree [4]
-----------------------------------------------
1.16 0.00 10/10 main [1]
[2] 95.1 1.16 0.00 10 smoothField [2]
-----------------------------------------------
0.03 0.00 10/10 main [1]
[3] 2.5 0.03 0.00 10 resetF [3]
-----------------------------------------------
0.00 0.00 1/1 main [1]
[4] 0.0 0.00 0.00 1 smootherFree [4]
-----------------------------------------------
0.00 0.00 1/1 main [1]
[5] 0.0 0.00 0.00 1 smootherInit [5]
-----------------------------------------------
Now that we have a profile and an understanding of the call structure of the application, we can now plan our port to GPUs. Since we will use the AOMP compiler for offloading to GPUs, we want to first modify the Makefile to use the AOMP compiler. Then, we will focus on porting the smoothField routine and the necessary data to the GPU, since smoothField takes up the majority of the run time.
When we port this routine, we will introduce data allocation on the GPU and data copies between CPU and GPU. This data movement may potentially increase the overall application runtime, even if the smoothField routine performs better. In this event, we will then work on minimizing data movements between CPU and GPU.
Before jumping straight into GPU offloading with OpenMP, you will take an incremental step to change the compiler and verify the application can be compiled and executed with the amdclang
compiler. Once this is verified, you will then start the GPU offloading process.
ROCm comes with compilers ( amdflang
, amdclang
, and amdclang++
) that support OpenMP 5.0. To enable GPU offloading at compile time, there are a few flags that you need to pass to the compiler.
amdclang -fopenmp \ -fopenmp-targets=[target] \ -Xopenmp-target=[target] \ -march=[gpu-arch] [other options] [source-code]
In this example,
[target]
is one of amdgcn-amd-amdhsa
or nvptx64-nvidia-cuda
[gpu-arch]
is the GPU architecture code. For MI100 GPUs, [march]=gfx908
and for V100 GPUs, [march]=sm_72
.ROCm also comes with a helpful tool (mygpu
) that can be used to detect the GPU architecture. This is particularly useful if you are building the code on a machine that has the GPU you want to build for.
In this section, you will make the following changes to the Makefile
mygpu
binary to set the GPU architecture and targetsmoother
makefile (samples/fortran/smoother/src/Makefile
), Let's first add variables for the paths to ROCm and CUDA at the top of the file. These will be needed to reference full paths to the compiler and mygpu
binary. ROCM ?= /opt/rocm
CUDA ?= /usr/local/cuda
ROCM ?= /opt/rocm
CUDA ?= /usr/local/cuda
CC ?= $(ROCM)/bin/amdclang
mygpu -d gfx908
will check for a GPU and return gfx908 if one is not found. In the make system, we want to also allow for someone building the code to specify the target architecture, in case they are on a system that does not have a GPU equipped.FC
. This section sets GPU_ARCH
to the output of mygpu -d gfx908
, if it is not set in the user's environment. The -d gfx908
flag indicates that the default GPU architecture will be set to gfx908
, if one is not found on the system you are building on.# Detect the GPU architecture
GPU_ARCH ?= $(shell $(ROCM)/bin/mygpu -d gfx908)
findstring
function to set the GPU target accordingly.GPU_ARCH
. This section sets the GPU_TARGET
variable to nvptx64-nvidia-cuda
when an Nvidia GPU is detected and amdgcn-amd-amdhsa
otherwise. Additionally, this appends the CUDA runtime library to LFLAGS
in the event that you are building for an Nvidia platform.ifeq (sm_,$(findstring sm_,$(GPU_ARCH)))
GPU_TARGET = nvptx64-nvidia-cuda
LFLAGS += -L$(CUDA)/targets/x86_64-linux/lib -lcudart
else
GPU_TARGET = amdgcn-amd-amdhsa
endif
CFLAGS
variable. Just beneath the GPU_TARGET
definition, add the following code to append to the CFLAGS
variable.CFLAGS += -fopenmp -fopenmp-targets=$(GPU_TARGET) -Xopenmp-target=$(GPU_TARGET) -march=$(GPU_ARCH)
Now that you have made the necessary modifications to the Makefile, it is time to re-compile and test the application. You also want to make sure that the application output is unchanged.
$ mkdir reference
$ cp function.txt smooth-function.txt reference/
smoother
application.$ make clean
$ make
smoother
application with the same input parameters as before and compare the output with the reference output. You can use the diff command line tool to compare the new output with the reference output. If the files are identical, no output will be printed to screen.$ ./smoother 1000 10
$ diff function.txt reference/function.txt
$ diff smooth-function.txt reference/function.txt
Now that you've switched to using the amdclang
compiler and have verified the application successfully compiles and runs and produces the correct output, you are ready to begin offloading to GPUs with OpenMP. In the next step, you will offload the smoothField
and resetF
routines using OpenMP directives.
In the smoother
application, we have seen that the smoothField
routine, called by main
, takes up the most time. Within the main iteration loop in main.cpp
, the resetF
function is called to update the input for smoother
for the next iteration.
You will start by offloading both the smoothField
and resetF
routines to the GPU using OpenMP directives (also called "pragmas"). In this section you will learn how to offload sections of code to the GPU and how to manage GPU data using OpenMP pragmas.
smoother.c
and navigate to the smoothField
routine. Open an OpenMP target region before the start of the first loop in smoothField
and map the necessary map directives to copy smoother->weights
and f
to the GPU and smoothF
to and from the GPU. #pragma omp target map(to:smoothOperator->weights[0:N*N], f[0:nX*nY]) map(smoothF[0:nX*nY])
teams parallel for
directive with a collapse(2)
clause to parallelize the outer two loops. #pragma omp target map(to:smoothOperator->weights[0:N*N], f[0:nX*nY]) map(smoothF[0:nX*nY])
{
#pragma omp teams distribute parallel for collapse(2) num_threads(256)
for( int j=buf; j < nY-buf; j++ ){
for( int i=buf; i < nX-buf; i++ ){
smLocal = 0.0;
for( int jj=-buf; jj <= buf; jj++ ){
for( int ii=-buf; ii <= buf; ii++ ){
iloc = (i+ii)+(j+jj)*nX;
ism = (ii+buf) + (jj+buf)*N;
smLocal += f[iloc]*smoothOperator->weights[ism];
}
}
iel = i+j*nX;
smoothF[iel] = smLocal;
}
}
}
smoother
application.$ make
smoother
application with the same input parameters as before and compare the output with the reference output. You can use the diff command line tool to compare the new output with the reference output. If the files are identical, no output will be printed to screen.$ time ./smoother 1000 10
real 0m2.767s
user 0m1.568s
sys 0m1.057s
$ diff function.txt reference/function.txt
$ diff smooth-function.txt reference/function.txt
smoother.cpp
and navigate to the resetF
routine. Open an OpenMP target
region before the start of the first loop in resetF
and map the necessary map directives to copy smoothF
to the GPU and f
to and from the GPU. #pragma omp target map(to: smoothF[0:nx*ny]) map(f[0:nx*ny])
teams parallel for
directive with a collapse(2)
clause to parallelize the outer two loops. #pragma omp target map(to: smoothF[0:nx*ny]) map(f[0:nx*ny])
{
#pragma omp teams distribute parallel for collapse(2) num_threads(256)
for( int iy=buf; iy<ny-buf; iy++ ){
for( int ix=buf; ix<nx-buf; ix++ ){
iel = ix + nx*iy;
f[iel] = smoothF[iel];
}
}
}
smoother
application.$ make
smoother
application with the same input parameters as before and compare the output with the reference output. You can use the diff command line tool to compare the new output with the reference output. If the files are identical, no output will be printed to screen.$ time ./smoother 1000 10
real 0m2.970s
user 0m1.528s
sys 0m1.228s
$ diff function.txt reference/function.txt
$ diff smooth-function.txt reference/function.txt
--hsa-trace
flags. If you would like to get a summary hotspot profile of the GPU kernels, use the --stats
flag.$ rocprof --hsa-trace --stats ./smoother 1000 10
$ cat results.stat.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"__omp_offloading_801_440b81_smoothField_l67.kd",10,30603997,3060399,78.4420113965
"__omp_offloading_801_440b81_resetF_l48.kd",10,8410807,841080,21.5579886035
You've successfully offloaded two routines to the GPU. However, you may have noticed that the runtime did not improve much, and may have even gotten worse, after you offloaded the second routine (resetF
). At the start and end of each target region, the application is copying data between the CPU and GPU. You can see this behavior in the trace profile shown above. Ideally, you want to minimize data movement between the host and device for optimal performance.
In the next section, you will learn how to control when data is allocated and moved to and from the GPU. This will help you minimize data copies between the host and device that often become bottlenecks for GPU accelerated applications.
In this section you will learn how to use unstructured data directives with OpenMP to control when data is copied to and from the GPU.
In the smoother
application, there are two routines within a main iteration loop, smoothField
and resetF
. Both routines operate on data stored in two arrays, f
and smoothF
.
for( int iter=0; iter<nIter; iter++){
// Run the smoother
smoothField( &smoothOperator, f, smoothF, nx, ny );
// Reassign smoothF to f
resetF( f, smoothF, nx, ny, buf );
}
Additionally, the smoothField
routine requires the smoothOperator->weights
array in order to calculate smoothF
from f
. Currently, target regions within smoothField and resetF copy these arrays to and from the GPU, before and after executing the routine instructions in parallel on the GPU; this is also done every iteration.
Ideally, we want to have all of the necessary data copied to the GPU before the iteration loop and have smoothF
copied from the GPU after the iteration loop. This can be achieved using the target enter data
and target exit data
directives.
Each directive is a standalone directive that can be used to allocate or deallocate memory on the GPU and copy data to or from the GPU. A typical usage is to use the target enter data directive to allocate device memory after allocation on the host and to use the target exit data directive to free device memory before freeing memory on the host. Then, you can use the target update directive to manage updating host and device data when needed.
In this example below, the enter data directive
is used to allocate device memory for arrayIn
and arrayOut
. Before reaching the main block of code, the target update directive
is used to update arrayIn
on the device. At the end of this region of code, the target update directive
is used to update arrayOut
on the host. At the end of the example code, the exit data directive
is used to free device memory before freeing the associate host pointer.
int N = 1000;
float *arrayIn;
float *arrayOut;
arrayIn = (float*)malloc( N*sizeof(float) );
arrayOut = (float*)malloc( N*sizeof(float) );
# omp pragma target enter data map(alloc:arrayIn[0:N], arrayOut[0:N])
// Initialization routines ... //
.
.
// End Initialization routines
#omp pragma target update to(arrayIn[0:N])
{
// Execution block
.
.
}
#omp pragma target update from(arrayOut[0:N])
.
.
# omp pragma target exit data map(delete:arrayIn[0:N], arrayOut[0:N])
free(arrayIn);
free(arrayOut);
In the smoother application, we want to explicitly control data movement for f
, smoothF
, and smoothOperator->weights
. You will work in main.cpp
to insert calls to allocate, update, and deallocate device memory for f
and smoothF
. To handle smoothOperator->weights
, you will work in smoother.cpp
to allocate, update, and deallocate device memory.
f
and smoothF
are allocated memory. Just after the malloc
calls, add a target enter data
directive to allocate device memory for f
and smoothF. // Allocate space for the function we want to smooth
f = (real*)malloc( nElements*sizeof(real) );
smoothF = (real*)malloc( nElements*sizeof(real) );
#pragma omp target enter data map(alloc: f[0:nElements], smoothF[0:nElements])
target update to
directive to copy f
and smoothF
data to the GPU just before the main iteration loop and add a target update from
directive to copy smoothF
from the GPU just after the main iteration loop. #pragma omp target update to(f[0:nElements], smoothF[0:nElements])
for( int iter=0; iter<nIter; iter++){
// Run the smoother
smoothField( &smoothOperator, f, smoothF, nx, ny );
// Reassign smoothF to f
resetF( f, smoothF, nx, ny, buf );
}
#pragma omp target update from(smoothF[0:nElements])
target exit data
directive to deallocate device memory held by f
and smoothF before calling free
at the end of main.cpp
. // Free space
#pragma omp target exit data map(delete:f[0:nElements],smoothF[0:nElements])
free(f);
free(smoothF);
main.cpp
and open smoother.cpp
. In the smootherInit
routine, add a target enter data
directive to allocate device memory for smoothOperator->weights
. smoothOperator->dim = N;
smoothOperator->weights = (real*)malloc( N*N*sizeof(real) );
#pragma omp target enter data map(alloc: smoothOperator->weights[0:N*N])
smoothOperator->weights
at the end of the smootherInit
routine, after the weight values have been assigned. for( int j=0; j < N; j++ ){
for( int i=0; i < N; i++ ){
smoothOperator->weights[i+j*N] = smoothOperator->weights[i+j*N]/wsum;
}
}
#pragma omp target update to(smoothOperator->weights[0:N*N])
target exit data directive
to deallocate device memory held by smoothOperator->weights
in the smootherFree routine.void smootherFree( struct smoother *smoothOperator )
{
#pragma omp target exit data map(delete: smoothOperator->weights[0:smoothOperator->dim*smoothOperator->dim])
free( smoothOperator->weights );
}
smoother
application.$ make
smoother
application with the same input parameters as before and compare the output with the reference output. You can use the diff command line tool to compare the new output with the reference output. If the files are identical, no output will be printed to screen.$ time ./smoother 1000 10
real 0m2.689s
user 0m1.496s
sys 0m1.052s
$ diff function.txt reference/function.txt
$ diff smooth-function.txt reference/function.txt
In this codelab, you learned how to port serial CPU-only routines in C to GPUs using OpenMP. To do this, you used target directives to offload regions of code to the GPU. You used teams parallel for
directives to parallelize nested loops across teams of SIMD threads.
To reduce data copies between host and device, you applied unstructured OpenMP data directives to explicitly manage when memory is allocated/deallocated on the GPU and when data is copied between to and from the GPU.
In the process of doing this, you practiced a strategy for porting to GPUs that included the following steps to make incremental changes to your own source code :
If you have any questions, comments, or feedback that can help improve this codelab, you can reach out to support@fluidnumerics.com