Last Updated: 2022-09-27
In this codelab, you will port a small C++ application to GPU hardware using OpenMP. You will transition a serial CPU-only mini-application to a portable GPU accelerated application, using OpenMP provided through the AOMP compiler.
The goal of this codelab is to introduce you to using a few basic OpenMP directives and a development practice that can be applied to porting other applications.
The demo application provided for this tutorial performs 2-D smoothing operations using a 3x3 gaussian stencil.
In this section, we introduce the demo application and walk through building and verifying the example. It's important to make sure that the code produces the expected result as we will be using the CPU generated model output to ensure that the solution does not change when we port to the GPU.
This application executes a 2-D smoothing operation on a square grid of points. The program proceeds as follows
This application's src directory contains the following files
smoother.cpp : Defines a simple data structure that stores the smoothing operators weights and the routines for allocating memory, deallocating memory, and executing the smoothing operation.main.cpp : Defines the main program that sets up the 2-D field to be smoothed and managed file IO.Makefile : A simple makefile is to build the application binary smoother.viz.py : A python script for creating plots of the smoother outputTo get started, we want to make sure that the application builds and runs on your system using the gcc compiler.
git clone https://github.com/fluidnumerics/scientific-computing-edu ~/scientific-computing-edu
cd ~/scientific-computing-edu/samples/c++/smoother/src
make
./smoother 1000 10
Before starting any GPU porting exercise, it is important to profile your application to find hotspots where your application spends most of its time. Further, it is helpful to keep track of the runtime of the routines in your application so that you can later assess whether or not the GPU porting has resulted in improved performance. Ideally, your GPU-Accelerated application should outperform CPU-Only versions of your application when fully subscribed to available CPUs on a compute node.
In this tutorial, we are going to generate a profile and call graph using gprof. The provided makefile was already configured to create profile output. From here, you just need to use gprof to create the application profile.
$ gprof ./smoother gmon.out
gprof provides a flat profile and a summary of your application's call structure indicating dependencies within your source code as a call tree. A call tree depicts the relationships between routines in your source code. Combining timing information with a call tree will help you plan the order in which you port routines to the GPU.
The first section of the gprof output is the flat-profile. An example flat-profile for the smoother application is given below. The flat-profile provides a list of routines in your application, ordered by the percent time your program spends within those routines from greatest to least. Beneath the flat-profile, gprof provides documentation of each of the columns for your convenience.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
95.24 1.16 1.16 10 116.19 116.19 smoothField
2.46 1.19 0.03 10 3.00 3.00 resetF
2.46 1.22 0.03 main
0.00 1.22 0.00 1 0.00 0.00 smootherFree
0.00 1.22 0.00 1 0.00 0.00 smootherInit
Let's now take a look at at the call tree. This call tree has five entries, one for each routine in our program. The right-most field for each entry indicates the routines that called each routine and that are called by each routine.
For smoother, the first entry shows that main calls smoothField, resetF, smootherInit, and smootherFree. Further, the called column indicates that smoothField and resetF routines are shown to be called 10 times (in this case) by main. The self and children columns indicate that main spends 0.03s executing instructions in main and 1.19s in calling other routines. Further, of those 1.19s, 1.16s are spent in smoothField and 0.03 are spent in resetF.
index % time self children called name
<spontaneous>
[1] 100.0 0.03 1.19 main [1]
1.16 0.00 10/10 smoothField [2]
0.03 0.00 10/10 resetF [3]
0.00 0.00 1/1 smootherInit [5]
0.00 0.00 1/1 smootherFree [4]
-----------------------------------------------
1.16 0.00 10/10 main [1]
[2] 95.1 1.16 0.00 10 smoothField [2]
-----------------------------------------------
0.03 0.00 10/10 main [1]
[3] 2.5 0.03 0.00 10 resetF [3]
-----------------------------------------------
0.00 0.00 1/1 main [1]
[4] 0.0 0.00 0.00 1 smootherFree [4]
-----------------------------------------------
0.00 0.00 1/1 main [1]
[5] 0.0 0.00 0.00 1 smootherInit [5]
-----------------------------------------------
Now that we have a profile and an understanding of the call structure of the application, we can now plan our port to GPUs. Since we will use the AOMP compiler for offloading to GPUs, we want to first modify the Makefile to use the AOMP compiler. Then, we will focus on porting the smoothField routine and the necessary data to the GPU, since smoothField takes up the majority of the run time.
When we port this routine, we will introduce data allocation on the GPU and data copies between CPU and GPU. This data movement may potentially increase the overall application runtime, even if the smoothField routine performs better. In this event, we will then work on minimizing data movements between CPU and GPU.
In the smoother application, we have seen that the smoothField routine, called by main, takes up the most time. Looking at the function call in main.cpp and the smoothField routine in smoother.cpp, we see that this routine takes in a smoother object, a real array pointer f, and integers nx and ny that are passed by value.
81 for( int iter=0; iter<nIter; iter++){
82 // Run the smoother
83 smoothField( &smoothOperator, f, smoothF, nx, ny );
84 // Reassign smoothF to f
85 resetF( f, smoothF, nx, ny );
86 }
In order to offload smoothField to the GPU, we will need to copy smoothOperator class data and the f array to the GPU. After calling smoothF, we will eventually want to copy smoothF back to the CPU before calling resetF.
The smoothField routine uses the smoothOperator -> weights array when applying the operator. Because of this, you will need to create and allocate a device copy of the weights array. After filling in the weights values on the CPU, you can copy the values over to the device array.
smoother.cpp so that you can make HIP API calls.#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include "precision.h"
#include "smoother.h"
#include <hip/hip_runtime.h>
smoother.h for a device copy of the smoother weights.typedef struct smoother{
int dim;
real *weights;
void *weights_dev;
}smoother;
smootherInit to allocate weights_dev on the GPU and insert a call to hipMemcpy in smootherInit to copy weights to weights_dev.// Allocate space for the device copy of the smoothing weights
hipMalloc(&smoothOperator->weights_dev,N*N*sizeof(real));
// Copy weights from the host to the device
hipMemcpy(smoothOperator->weights_dev,
smoothOperator->weights,
N*N*sizeof(real),
hipMemcpyHostToDevice);
smootherFree to deallocate GPU memory held by weights_dev. hipFree(smoothOperator->weights_dev);
smoother Makefile, let's first add variables for the paths to ROCm and CUDA at the top of the file. These will be needed to reference full paths to the hipcc compiler. ?= relation to allow a user's environment variables to override these values if desired.ROCM ?= /opt/rocm
CUDA ?= /usr/local/cuda
Change the compiler to hipcc and save your changes.
CC=$(ROCM)/bin/hipcc
CFLAGS=-O0 -g
Once you have completed the code and Makefile modifications, you can now compile smoother and verify that data allocated and copied to the GPU.
reference/ subdirectory for later comparison. Whenever we make a change to the code, we will compare output with this reference data.$ mkdir ./reference
$ mv function.txt smooth-function.txt ./reference/
*.o files and the smoother binary to ensure a clean build and make a new smoother binary$ make clean && make smoother
smoother with the same input parameters as you did in the previous section and verify the output is unchanged. We use the diff command line utility to compare the output files and the reference files. If there are no differences, diff will produce no output.$ ./smoother 1000 10
$ diff function.txt reference/function.txt
$ diff smooth-function.txt reference.txt
rocprof with the --sys-trace on and --stats flags. Running rocprof will create a file called results.json that contains the data for a trace profile. Additionally, results.stats.csv and results.hip-stats.csv will contain hotspot analysis for HIP kernels and HIP API calls, respectively.$ rocprof --sys-trace --stats ./smoother 1000 100
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
hipMemcpy,1,12503928,12503928,95.8586582624
hipMalloc,3,387867,129289,2.9734984242
hipFree,3,152335,50778,1.16784331343
$ git add smoother.h smoother.cpp makefile && git commit
main.cpp so that you can make HIP API calls.#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include "precision.h"
#include "smoother.h"
#include <hip/hip_runtime.h>
f_dev and smoothF_dev in main.cppint main( int argc, char *argv[] ) {
smoother smoothOperator;
int nx, ny, nElements;
int nIter;
real dx;
real *f, *smoothF;
real *f_dev, *smoothF_dev;
hipMalloc for f_dev and smoothF_dev // Create the smoother
smootherInit(&smoothOperator);
// Allocate space for the function we want to smooth
f = (real*)malloc( nElements*sizeof(real) );
smoothF = (real*)malloc( nElements*sizeof(real) );
hipMalloc(&f_dev, nElements*sizeof(real));
hipMalloc(&smoothF_dev, nElements*sizeof(real));
hipMemcpy call to update f_dev (host to device) prior to calling smoothField and to update smoothF_dev (device to host) after calling smoothField. hipMemcpy(f_dev, f, nElements*sizeof(real), hipMemcpyHostToDevice);
smoothField( &smoothOperator, f, smoothF, nx, ny );
hipMemcpy(smoothF, smoothF_dev, nElements*sizeof(real), hipMemcpyDeviceToHost);
hipFree at the end of the main function to free memory held by the f_dev and smoothF_dev pointers.// Free space free(f); free(smoothF); hipFree(f_dev); hipFree(smoothF_dev);
make
rocprof with the --hip-trace flag. Running rocprof will create a file called profile.json. The contents of results.hip_stats.json will show calls to hipMalloc, hipMemcpy, and hipFree."Name","Calls","TotalDurationNs","AverageNs","Percentage"
hipMemcpy,22,43602127,1981914,98.9534503691
hipMalloc,3,287919,95973,0.653421757999
hipFree,3,173225,57741,0.393127872872
For Nvidia platforms, use nvprof. At this stage, you should see three calls to cudaMalloc, three calls to cudaFree, ten calls to cudaMemcpy (Device to Host), and 11 calls to cudaMemcpy (Host to Device).
$ nvprof ./smoother 1000 10
==23287== NVPROF is profiling process 23287, command: ./smoother 1000 10
==23287== Profiling application: ./smoother 1000 10
==23287== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 52.12% 4.6421ms 10 464.21us 427.18us 589.69us [CUDA memcpy DtoH]
47.88% 4.2636ms 11 387.60us 1.4720us 511.96us [CUDA memcpy HtoD]
API calls: 93.92% 194.18ms 3 64.727ms 135.75us 193.88ms cudaMalloc
5.20% 10.757ms 21 512.25us 22.255us 778.80us cudaMemcpy
0.39% 809.35us 97 8.3430us 537ns 396.19us cuDeviceGetAttribute
0.25% 526.66us 1 526.66us 526.66us 526.66us cuDeviceTotalMem
0.16% 339.77us 3 113.26us 2.7010us 176.22us cudaFree
0.06% 115.21us 1 115.21us 115.21us 115.21us cuDeviceGetName
0.00% 4.4240us 3 1.4740us 700ns 2.9380us cuDeviceGetCount
0.00% 4.3810us 1 4.3810us 4.3810us 4.3810us cuDeviceGetPCIBusId
0.00% 2.5650us 2 1.2820us 662ns 1.9030us cuDeviceGet
0.00% 1.0790us 1 1.0790us 1.0790us 1.0790us cuDeviceGetUuid
At this point, you now have the necessary data declared on the GPU. Additionally, you used hipMemcpy to make the input to smoothField available on the GPU. In the next step, you will create a HIP kernel that will run the smoothField algorithm on the GPU and replace the call to smoothField with a call to launch this kernel.
In this section, you will offload the smoothField routine to the GPU.
Let's look at the smoothField routine from smoother.c
void smoothField( struct smoother *smoothOperator, real *f, real *smoothF, int nx, int ny )
{
int iel, ism;
int N = (real)smoothOperator->dim;
int buf = (real)(smoothOperator->dim-1)/2.0;
real smLocal;
for( int j=buf; j < ny-buf; j++ ){
for( int i=buf; i < nx-buf; i++ ){
smLocal = 0.0;
for( int jj=-buf; jj <= buf; jj++ ){
for( int ii=-buf; ii <= buf; ii++ ){
iel = (i+ii)+(j+jj)*nx;
ism = (ii+buf) + (jj+buf)*N;
smLocal += f[iel]*smoothOperator->weights[ism];
}
}
iel = i+j*nx;
smoothF[iel] = smLocal;
}
}
}
The outer loops, over i and j, are tightly nested loops over a 2-D grid. The size of these loops are ny-2*buf and nx-2*buf. The values of nx and ny are determined by the user through the first command line argument (we have been using 1000), and buf is 2 (smoothOperator->dim=5; see smootherInit). Within the i and j loops, we carry out a reduction operation for smLocal and then assign the value to each element of smoothF.
In the smoothField algorithm, the order in which we execute the i and j loops does not matter. Further, the size of each loop is O(1000) for the example we're working with. A reasonable strategy for offloading this routine to the GPU is to have each GPU thread execute the instructions within the i and j loops. Ideally, then we want each thread to execute something the following
real smLocal = 0.0;
for( int jj=-buf; jj <= buf; jj++ ){
for( int ii=-buf; ii <= buf; ii++ ){
iel = (i+ii)+(j+jj)*nx;
ism = (ii+buf) + (jj+buf)*N;
smLocal += f[iel]*smoothOperator->weights[ism];
}
}
iel = i+j*nx;
smoothF[iel] = smLocal;
Notice now the i and j loops are gone. Within the HIP kernel, we can calculate i and j from threadIdx.[x,y], blockIdx.[x,y], and blockDim.[x,y], assuming that we will launch the kernel with 2-D Grid and Block dimensions. You can use something like the following to calculate i and j.
size_t i = threadIdx.x + blockIdx.x*blockDim.x+buf;
size_t j = threadIdx.y + blockIdx.y*blockDim.y+buf;
Within the main program, you will be able to launch the GPU kernel, but you will need to calculate the Grid and Block Dimensions. For now, let's assume that the number of threads-per-block in the i and j loop dimensions (x and y directions) is fixed at 16. With the number of threads-per-block (in each direction) chosen, you can calculate the grid dimensions, by requiring the x and y grid dimensions to be greater than or equal to the i and j loop sizes, respectively.
int buf = (real)(smoothOperator->dim-1)/2.0;
int threadsPerBlockX = 16;
int threadsPerBlockY = 16;
int gridDimX = (nx-2*buf)/threadsPerBlockX + 1;
int gridDimY = (ny-2*buf)/threadsPerBlockY + 1;
smoothField_gpu, to smoother.cpp. This routine needs to be decorated with the __global__ declaration specifier so that it can be launched on the GPU (device) from the CPU (host).__global__ void smoothField_gpu( real *weights, real *f, real *smoothF, int nX, int nY, int N )
{
int buf = (N-1)/2;
size_t i = threadIdx.x + blockIdx.x*blockDim.x + buf;
size_t j = threadIdx.y + blockIdx.y*blockDim.y + buf;
int iel, ism;
if( i < nX-buf && j< nY-buf){
real smLocal = 0.0;
for( int jj=-buf; jj <= buf; jj++ ){
for( int ii=-buf; ii <= buf; ii++ ){
iel = (i+ii)+(j+jj)*nX;
ism = (ii+buf) + (jj+buf)*N;
smLocal += f[iel]*weights[ism];
}
}
iel = i+j*nX;
smoothF[iel] = smLocal;
}
}
main.cpp. You can place this block of code just before the iteration loop int buf = (smoothOperator.dim-1)/2;
int tX = 16;
int tY = 16;
int gX = (nx-2*buf)/tX + 1;
int gY = (ny-2*buf)/tY + 1;
dim3 threads(tX,tY,1);
dim3 blocks(gX,gY,1);
// Copy f from host to device : f is input to `smoothField`
hipMemcpy(f_dev, f, nElements*sizeof(real), hipMemcpyHostToDevice);
// Run the smoother
smoothField_gpu<<<blocks,threads>>>(
smoothOperator.weights_dev, f_dev, smoothF_dev, nx, ny, smoothOperator.dim );
// Copy smoothF_dev from device to host
hipMemcpy(smoothF, smoothF_dev, nElements*sizeof(real), hipMemcpyDeviceToHost);
smoothField_gpu declaration to smoother.h__global__ void smoothField_gpu( real* weights, real *f, real *smoothF, int nx, int ny, int N);
$ make clean && make
rocprof --sys-trace --stats ./smoother 1000 10
Congratulations! So far, you've learned how to allocate and manage memory on the GPU and how to launch a GPU kernel. Right now, we have the code in a state where, every iteration, data is copied to the GPU before calling smoothField_gpu and from the GPU after calling smoothField_gpu. This situation happens quite often when porting to GPUs for the first time.
The next step in this codelab is to offload the resetF routine to the GPU, even though it does not take up a lot of time. We want to offload it to the GPU so that we can move the hipMemcpy calls outside of the iteration loop in main and reduce the number of times data is transmitted across the PCI bus.
In this section, we are going to offload the resetF routine in smoother.cpp to the GPU so that we can migrate hipMemcpy calls outside of the iteration loop in main.cpp. By this point, you have worked through the mechanics for porting a routine to the GPU. Additionally, for this application, we already have all of the necessary data on the GPU that the resetF routine depends on.
resetF_gpu definition to smoother.h__global__ void resetF_gpu( real *f, real *smoothF, int nX, int nY, int buf );
resetF_gpu to smoother.cpp__global__ void resetF_gpu( real *f, real *smoothF, int nX, int nY, int buf )
{
size_t i = threadIdx.x + blockIdx.x*blockDim.x + buf;
size_t j = threadIdx.y + blockIdx.y*blockDim.y + buf;
int iel = i + nx*j;
if( i < nx-buf && j< ny-buf){
f[iel] = smoothF[iel];
}
}
resetF with resetF_gpu in main.cpp, move the hipMemcpy host-to-device calls before the iteration loop, and move the hipMemcpy device-to-host call after the iteration loop. hipMemcpy(f_dev, f, nElements*sizeof(real), hipMemcpyHostToDevice);
for( int iter=0; iter<nIter; iter++){
// Run the smoother
smoothField_gpu<<<blocks,threads>>>(
smoothOperator.weights_dev, f_dev, smoothF_dev, nx, ny, smoothOperator.dim );
// Reassign smoothF to f
resetF_gpu<<<blocks,threads>>>(f_dev, smoothF_dev, nx, ny, buf );
}
// Copy smoothF_dev from device to host
hipMemcpy(smoothF, smoothF_dev, nElements*sizeof(real), hipMemcpyDeviceToHost);
make
rocprof --sys-trace --stats ./smoother 1000 10
In this codelab, you learned how to port serial CPU-only routines in C to GPUs using HIP. To do this, you created device copies of CPU arrays and learned how to copy data from the CPU to the GPU and vice versa. You also learned how to write HIP kernels and launch them from the host.
In the process of doing this, you practiced a strategy for porting to GPUs that included the following steps to make incremental changes to your own source code :
If you have any questions, comments, or feedback that can help improve this codelab, you can reach out to support@fluidnumerics.com