Last Updated: 2022-09-27
In this codelab, you will port a small C++ application to GPU hardware using OpenMP. You will transition a serial CPU-only mini-application to a portable GPU accelerated application, using OpenMP provided through the AOMP compiler.
The goal of this codelab is to introduce you to using a few basic OpenMP directives and a development practice that can be applied to porting other applications.
The demo application provided for this tutorial performs 2-D smoothing operations using a 3x3 gaussian stencil.
In this section, we introduce the demo application and walk through building and verifying the example. It's important to make sure that the code produces the expected result as we will be using the CPU generated model output to ensure that the solution does not change when we port to the GPU.
This application executes a 2-D smoothing operation on a square grid of points. The program proceeds as follows
This application's src directory contains the following files
smoother.cpp
: Defines a simple data structure that stores the smoothing operators weights and the routines for allocating memory, deallocating memory, and executing the smoothing operation.main.cpp
: Defines the main program that sets up the 2-D field to be smoothed and managed file IO.Makefile
: A simple makefile is to build the application binary smoother
.viz.py
: A python script for creating plots of the smoother outputTo get started, we want to make sure that the application builds and runs on your system using the gcc compiler.
git clone https://github.com/fluidnumerics/scientific-computing-edu ~/scientific-computing-edu
cd ~/scientific-computing-edu/samples/c++/smoother/src
make
./smoother 1000 10
Before starting any GPU porting exercise, it is important to profile your application to find hotspots where your application spends most of its time. Further, it is helpful to keep track of the runtime of the routines in your application so that you can later assess whether or not the GPU porting has resulted in improved performance. Ideally, your GPU-Accelerated application should outperform CPU-Only versions of your application when fully subscribed to available CPUs on a compute node.
In this tutorial, we are going to generate a profile and call graph using gprof. The provided makefile was already configured to create profile output. From here, you just need to use gprof to create the application profile.
$ gprof ./smoother gmon.out
gprof
provides a flat profile and a summary of your application's call structure indicating dependencies within your source code as a call tree. A call tree depicts the relationships between routines in your source code. Combining timing information with a call tree will help you plan the order in which you port routines to the GPU.
The first section of the gprof output is the flat-profile. An example flat-profile for the smoother
application is given below. The flat-profile provides a list of routines in your application, ordered by the percent time your program spends within those routines from greatest to least. Beneath the flat-profile, gprof provides documentation of each of the columns for your convenience.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
95.24 1.16 1.16 10 116.19 116.19 smoothField
2.46 1.19 0.03 10 3.00 3.00 resetF
2.46 1.22 0.03 main
0.00 1.22 0.00 1 0.00 0.00 smootherFree
0.00 1.22 0.00 1 0.00 0.00 smootherInit
Let's now take a look at at the call tree. This call tree has five entries, one for each routine in our program. The right-most field for each entry indicates the routines that called each routine and that are called by each routine.
For smoother
, the first entry shows that main calls smoothField
, resetF
, smootherInit
, and smootherFree
. Further, the called column indicates that smoothField and resetF routines are shown to be called 10 times (in this case) by main. The self and children columns indicate that main spends 0.03s executing instructions in main and 1.19s in calling other routines. Further, of those 1.19s, 1.16s are spent in smoothField
and 0.03 are spent in resetF
.
index % time self children called name
<spontaneous>
[1] 100.0 0.03 1.19 main [1]
1.16 0.00 10/10 smoothField [2]
0.03 0.00 10/10 resetF [3]
0.00 0.00 1/1 smootherInit [5]
0.00 0.00 1/1 smootherFree [4]
-----------------------------------------------
1.16 0.00 10/10 main [1]
[2] 95.1 1.16 0.00 10 smoothField [2]
-----------------------------------------------
0.03 0.00 10/10 main [1]
[3] 2.5 0.03 0.00 10 resetF [3]
-----------------------------------------------
0.00 0.00 1/1 main [1]
[4] 0.0 0.00 0.00 1 smootherFree [4]
-----------------------------------------------
0.00 0.00 1/1 main [1]
[5] 0.0 0.00 0.00 1 smootherInit [5]
-----------------------------------------------
Now that we have a profile and an understanding of the call structure of the application, we can now plan our port to GPUs. Since we will use the AOMP compiler for offloading to GPUs, we want to first modify the Makefile to use the AOMP compiler. Then, we will focus on porting the smoothField routine and the necessary data to the GPU, since smoothField takes up the majority of the run time.
When we port this routine, we will introduce data allocation on the GPU and data copies between CPU and GPU. This data movement may potentially increase the overall application runtime, even if the smoothField routine performs better. In this event, we will then work on minimizing data movements between CPU and GPU.
In the smoother
application, we have seen that the smoothField
routine, called by main
, takes up the most time. Looking at the function call in main.cpp
and the smoothField
routine in smoother.cpp
, we see that this routine takes in a smoother
object, a real array pointer f
, and integers nx
and ny
that are passed by value.
81 for( int iter=0; iter<nIter; iter++){
82 // Run the smoother
83 smoothField( &smoothOperator, f, smoothF, nx, ny );
84 // Reassign smoothF to f
85 resetF( f, smoothF, nx, ny );
86 }
In order to offload smoothField
to the GPU, we will need to copy smoothOperator
class data and the f
array to the GPU. After calling smoothF
, we will eventually want to copy smoothF
back to the CPU before calling resetF
.
The smoothField
routine uses the smoothOperator -> weights
array when applying the operator. Because of this, you will need to create and allocate a device copy of the weights array. After filling in the weights values on the CPU, you can copy the values over to the device array.
smoother.cpp
so that you can make HIP API calls.#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include "precision.h"
#include "smoother.h"
#include <hip/hip_runtime.h>
smoother.h
for a device copy of the smoother weights.typedef struct smoother{
int dim;
real *weights;
void *weights_dev;
}smoother;
smootherInit
to allocate weights_dev
on the GPU and insert a call to hipMemcpy in smootherInit
to copy weights
to weights_dev
.// Allocate space for the device copy of the smoothing weights
hipMalloc(&smoothOperator->weights_dev,N*N*sizeof(real));
// Copy weights from the host to the device
hipMemcpy(smoothOperator->weights_dev,
smoothOperator->weights,
N*N*sizeof(real),
hipMemcpyHostToDevice);
smootherFree
to deallocate GPU memory held by weights_dev
. hipFree(smoothOperator->weights_dev);
smoother
Makefile, let's first add variables for the paths to ROCm and CUDA at the top of the file. These will be needed to reference full paths to the hipcc compiler. ?=
relation to allow a user's environment variables to override these values if desired.ROCM ?= /opt/rocm
CUDA ?= /usr/local/cuda
Change the compiler to hipcc
and save your changes.
CC=$(ROCM)/bin/hipcc
CFLAGS=-O0 -g
Once you have completed the code and Makefile modifications, you can now compile smoother
and verify that data allocated and copied to the GPU.
reference/
subdirectory for later comparison. Whenever we make a change to the code, we will compare output with this reference data.$ mkdir ./reference
$ mv function.txt smooth-function.txt ./reference/
*.o
files and the smoother
binary to ensure a clean build and make a new smoother
binary$ make clean && make smoother
smoother
with the same input parameters as you did in the previous section and verify the output is unchanged. We use the diff command line utility to compare the output files and the reference files. If there are no differences, diff will produce no output.$ ./smoother 1000 10
$ diff function.txt reference/function.txt
$ diff smooth-function.txt reference.txt
rocprof
with the --sys-trace on
and --stats
flags. Running rocprof
will create a file called results.json
that contains the data for a trace profile. Additionally, results.stats.csv
and results.hip-stats.csv
will contain hotspot analysis for HIP kernels and HIP API calls, respectively.$ rocprof --sys-trace --stats ./smoother 1000 100
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
hipMemcpy,1,12503928,12503928,95.8586582624
hipMalloc,3,387867,129289,2.9734984242
hipFree,3,152335,50778,1.16784331343
$ git add smoother.h smoother.cpp makefile && git commit
main.cpp
so that you can make HIP API calls.#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include "precision.h"
#include "smoother.h"
#include <hip/hip_runtime.h>
f_dev
and smoothF_dev
in main.cpp
int main( int argc, char *argv[] ) {
smoother smoothOperator;
int nx, ny, nElements;
int nIter;
real dx;
real *f, *smoothF;
real *f_dev, *smoothF_dev;
hipMalloc
for f_dev
and smoothF_dev
// Create the smoother
smootherInit(&smoothOperator);
// Allocate space for the function we want to smooth
f = (real*)malloc( nElements*sizeof(real) );
smoothF = (real*)malloc( nElements*sizeof(real) );
hipMalloc(&f_dev, nElements*sizeof(real));
hipMalloc(&smoothF_dev, nElements*sizeof(real));
hipMemcpy
call to update f_dev
(host to device) prior to calling smoothField
and to update smoothF_dev
(device to host) after calling smoothField
. hipMemcpy(f_dev, f, nElements*sizeof(real), hipMemcpyHostToDevice);
smoothField( &smoothOperator, f, smoothF, nx, ny );
hipMemcpy(smoothF, smoothF_dev, nElements*sizeof(real), hipMemcpyDeviceToHost);
hipFree
at the end of the main function to free memory held by the f_dev
and smoothF_dev
pointers.// Free space free(f); free(smoothF); hipFree(f_dev); hipFree(smoothF_dev);
make
rocprof
with the --hip-trace
flag. Running rocprof will create a file called profile.json
. The contents of results.hip_stats.json
will show calls to hipMalloc
, hipMemcpy
, and hipFree
."Name","Calls","TotalDurationNs","AverageNs","Percentage"
hipMemcpy,22,43602127,1981914,98.9534503691
hipMalloc,3,287919,95973,0.653421757999
hipFree,3,173225,57741,0.393127872872
For Nvidia platforms, use nvprof
. At this stage, you should see three calls to cudaMalloc
, three calls to cudaFree
, ten calls to cudaMemcpy (Device to Host)
, and 11 calls to cudaMemcpy (Host to Device)
.
$ nvprof ./smoother 1000 10
==23287== NVPROF is profiling process 23287, command: ./smoother 1000 10
==23287== Profiling application: ./smoother 1000 10
==23287== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 52.12% 4.6421ms 10 464.21us 427.18us 589.69us [CUDA memcpy DtoH]
47.88% 4.2636ms 11 387.60us 1.4720us 511.96us [CUDA memcpy HtoD]
API calls: 93.92% 194.18ms 3 64.727ms 135.75us 193.88ms cudaMalloc
5.20% 10.757ms 21 512.25us 22.255us 778.80us cudaMemcpy
0.39% 809.35us 97 8.3430us 537ns 396.19us cuDeviceGetAttribute
0.25% 526.66us 1 526.66us 526.66us 526.66us cuDeviceTotalMem
0.16% 339.77us 3 113.26us 2.7010us 176.22us cudaFree
0.06% 115.21us 1 115.21us 115.21us 115.21us cuDeviceGetName
0.00% 4.4240us 3 1.4740us 700ns 2.9380us cuDeviceGetCount
0.00% 4.3810us 1 4.3810us 4.3810us 4.3810us cuDeviceGetPCIBusId
0.00% 2.5650us 2 1.2820us 662ns 1.9030us cuDeviceGet
0.00% 1.0790us 1 1.0790us 1.0790us 1.0790us cuDeviceGetUuid
At this point, you now have the necessary data declared on the GPU. Additionally, you used hipMemcpy
to make the input to smoothField
available on the GPU. In the next step, you will create a HIP kernel that will run the smoothField algorithm on the GPU and replace the call to smoothField with a call to launch this kernel.
In this section, you will offload the smoothField routine to the GPU.
Let's look at the smoothField routine from smoother.c
void smoothField( struct smoother *smoothOperator, real *f, real *smoothF, int nx, int ny )
{
int iel, ism;
int N = (real)smoothOperator->dim;
int buf = (real)(smoothOperator->dim-1)/2.0;
real smLocal;
for( int j=buf; j < ny-buf; j++ ){
for( int i=buf; i < nx-buf; i++ ){
smLocal = 0.0;
for( int jj=-buf; jj <= buf; jj++ ){
for( int ii=-buf; ii <= buf; ii++ ){
iel = (i+ii)+(j+jj)*nx;
ism = (ii+buf) + (jj+buf)*N;
smLocal += f[iel]*smoothOperator->weights[ism];
}
}
iel = i+j*nx;
smoothF[iel] = smLocal;
}
}
}
The outer loops, over i and j, are tightly nested loops over a 2-D grid. The size of these loops are ny-2*buf
and nx-2*buf
. The values of nx
and ny
are determined by the user through the first command line argument (we have been using 1000), and buf
is 2 (smoothOperator->dim=5
; see smootherInit
). Within the i
and j
loops, we carry out a reduction operation for smLocal
and then assign the value to each element of smoothF
.
In the smoothField
algorithm, the order in which we execute the i
and j
loops does not matter. Further, the size of each loop is O(1000) for the example we're working with. A reasonable strategy for offloading this routine to the GPU is to have each GPU thread execute the instructions within the i
and j
loops. Ideally, then we want each thread to execute something the following
real smLocal = 0.0;
for( int jj=-buf; jj <= buf; jj++ ){
for( int ii=-buf; ii <= buf; ii++ ){
iel = (i+ii)+(j+jj)*nx;
ism = (ii+buf) + (jj+buf)*N;
smLocal += f[iel]*smoothOperator->weights[ism];
}
}
iel = i+j*nx;
smoothF[iel] = smLocal;
Notice now the i and j loops are gone. Within the HIP kernel, we can calculate i
and j
from threadIdx.[x,y]
, blockIdx.[x,y]
, and blockDim.[x,y]
, assuming that we will launch the kernel with 2-D Grid and Block dimensions. You can use something like the following to calculate i
and j
.
size_t i = threadIdx.x + blockIdx.x*blockDim.x+buf;
size_t j = threadIdx.y + blockIdx.y*blockDim.y+buf;
Within the main program, you will be able to launch the GPU kernel, but you will need to calculate the Grid and Block Dimensions. For now, let's assume that the number of threads-per-block in the i and j loop dimensions (x and y directions) is fixed at 16. With the number of threads-per-block (in each direction) chosen, you can calculate the grid dimensions, by requiring the x and y grid dimensions to be greater than or equal to the i and j loop sizes, respectively.
int buf = (real)(smoothOperator->dim-1)/2.0;
int threadsPerBlockX = 16;
int threadsPerBlockY = 16;
int gridDimX = (nx-2*buf)/threadsPerBlockX + 1;
int gridDimY = (ny-2*buf)/threadsPerBlockY + 1;
smoothField_gpu
, to smoother.cpp
. This routine needs to be decorated with the __global__ declaration specifier so that it can be launched on the GPU (device) from the CPU (host).__global__ void smoothField_gpu( real *weights, real *f, real *smoothF, int nX, int nY, int N )
{
int buf = (N-1)/2;
size_t i = threadIdx.x + blockIdx.x*blockDim.x + buf;
size_t j = threadIdx.y + blockIdx.y*blockDim.y + buf;
int iel, ism;
if( i < nX-buf && j< nY-buf){
real smLocal = 0.0;
for( int jj=-buf; jj <= buf; jj++ ){
for( int ii=-buf; ii <= buf; ii++ ){
iel = (i+ii)+(j+jj)*nX;
ism = (ii+buf) + (jj+buf)*N;
smLocal += f[iel]*weights[ism];
}
}
iel = i+j*nX;
smoothF[iel] = smLocal;
}
}
main.cpp
. You can place this block of code just before the iteration loop int buf = (smoothOperator.dim-1)/2;
int tX = 16;
int tY = 16;
int gX = (nx-2*buf)/tX + 1;
int gY = (ny-2*buf)/tY + 1;
dim3 threads(tX,tY,1);
dim3 blocks(gX,gY,1);
// Copy f from host to device : f is input to `smoothField`
hipMemcpy(f_dev, f, nElements*sizeof(real), hipMemcpyHostToDevice);
// Run the smoother
smoothField_gpu<<<blocks,threads>>>(
smoothOperator.weights_dev, f_dev, smoothF_dev, nx, ny, smoothOperator.dim );
// Copy smoothF_dev from device to host
hipMemcpy(smoothF, smoothF_dev, nElements*sizeof(real), hipMemcpyDeviceToHost);
smoothField_gpu
declaration to smoother.h
__global__ void smoothField_gpu( real* weights, real *f, real *smoothF, int nx, int ny, int N);
$ make clean && make
rocprof --sys-trace --stats ./smoother 1000 10
Congratulations! So far, you've learned how to allocate and manage memory on the GPU and how to launch a GPU kernel. Right now, we have the code in a state where, every iteration, data is copied to the GPU before calling smoothField_gpu
and from the GPU after calling smoothField_gpu
. This situation happens quite often when porting to GPUs for the first time.
The next step in this codelab is to offload the resetF
routine to the GPU, even though it does not take up a lot of time. We want to offload it to the GPU so that we can move the hipMemcpy
calls outside of the iteration loop in main and reduce the number of times data is transmitted across the PCI bus.
In this section, we are going to offload the resetF
routine in smoother.cpp
to the GPU so that we can migrate hipMemcpy
calls outside of the iteration loop in main.cpp
. By this point, you have worked through the mechanics for porting a routine to the GPU. Additionally, for this application, we already have all of the necessary data on the GPU that the resetF
routine depends on.
resetF_gpu
definition to smoother.h
__global__ void resetF_gpu( real *f, real *smoothF, int nX, int nY, int buf );
resetF_gpu
to smoother.cpp
__global__ void resetF_gpu( real *f, real *smoothF, int nX, int nY, int buf ) { size_t i = threadIdx.x + blockIdx.x*blockDim.x + buf; size_t j = threadIdx.y + blockIdx.y*blockDim.y + buf; int iel = i + nx*j; if( i < nx-buf && j< ny-buf){ f[iel] = smoothF[iel]; } }
resetF
with resetF_gpu
in main.cpp
, move the hipMemcpy
host-to-device calls before the iteration loop, and move the hipMemcpy
device-to-host call after the iteration loop.hipMemcpy(f_dev, f, nElements*sizeof(real), hipMemcpyHostToDevice); for( int iter=0; iter<nIter; iter++){ // Run the smoother smoothField_gpu<<<blocks,threads>>>( smoothOperator.weights_dev, f_dev, smoothF_dev, nx, ny, smoothOperator.dim ); // Reassign smoothF to f resetF_gpu<<<blocks,threads>>>(f_dev, smoothF_dev, nx, ny, buf ); } // Copy smoothF_dev from device to host hipMemcpy(smoothF, smoothF_dev, nElements*sizeof(real), hipMemcpyDeviceToHost);
make
rocprof --sys-trace --stats ./smoother 1000 10
In this codelab, you learned how to port serial CPU-only routines in C to GPUs using HIP. To do this, you created device copies of CPU arrays and learned how to copy data from the CPU to the GPU and vice versa. You also learned how to write HIP kernels and launch them from the host.
In the process of doing this, you practiced a strategy for porting to GPUs that included the following steps to make incremental changes to your own source code :
If you have any questions, comments, or feedback that can help improve this codelab, you can reach out to support@fluidnumerics.com