OpenMP Basics
The OpenMP 4 standard introduced directives for offloading regions of code to GPU hardware. This includes directives for parallelizing loop structures as well as those for managing device memory. Additionally, the OpenMP API provides methods for managing devices. With the OpenMP 5 standard, the OpenMP API is expanded to provide more explicit control over host and device pointer management, which allows for interoperability with HIP, CUDA, and GPU accelerated libraries. Although the OpenMP standard has progressed into version 5 (as of 2022), implementations within compilers varies; for example see LLVM Clang/Flang OpenMP supported features.
In this section, you will learn
- How to offload regions of code to GPU Accelerators with OpenMP
target
directives - How to parallelize for/do loops with OpenMP
teams
- How to use data regions to minimize memory copies between CPU and GPU
- How to compile Fortran and C/C++ applications with
amdflang
andamdclang
.
Offloading Compute to GPUs
Target Regions
With OpenMP, you can use the omp target
directive to mark regions of code that you want the compiler to offload to the GPU. When you open a target region a target task is created on the GPU and with no other directives, all code within the target region is run in serial on a single thread. Since GPUs have a distinct memory space in comparison to host CPUs, you also need to hint to the compiler what data needs to be copied to and from the GPUs when opening a target region; this is done using the map
directive.
The examples below show how to use the target
and map
directives in Fortran and C/C++ in a simple matrix-vector multiply method.
INTEGER, PARAMETER :: N=1000
REAL :: axloc
REAL, ALLOCATABLE, A(:,:), Ax(:), x(:)
ALLOCATE(A(1:N,1:N), Ax(1:N), x(1:N))
! Initialize values
x = 1.0
A = 1.0
!$omp target map(to:A,x) map(from:Ax)
DO i = 1, N
axloc = 0.0
DO r = 1, N
axloc = axloc+A(r,i)*x(r)
ENDDO
Ax(i) = axloc
ENDDO
!$omp end target
DEALLOCATE(A,Ax,x)
int N=1000;
float *A, *Ax, *x;
malloc(A, sizeof(float)*N*N);
malloc(Ax, sizeof(float)*N);
malloc(x, sizeof(float)*N);
// Initialize values
for( int i = 0; i < N; i++ ){
x[i] = 1.0;
for( int j = 0; j < N; j++ ){
A[j+i*N] = 1.0;
}
}
float axloc = 0.0;
#pragma omp target map(to:A,x) map(from:Ax)
{
for( int i = 0; i < N; i++ ){
axloc = 0.0;
for( int r = 0; r < N; r++) {
axloc += A[r+N*i]*x[r];
}
Ax[i] = axloc;
}
}
free(A);
free(Ax);
free(x);
In these examples,
- OpenMP directives begin with
!$omp
in Fortran and#pragma omp
in C/C++. - In Fortran, the OpenMP target region is closed using the
!$omp end target
directive. - In C/C++, the OpenMP target region is enclosed with brackets
{
and}
to indicate clearly the region that is offloaded to the GPU. - The arrays
A
andx
are mapped to the GPU using themap(to:A,x)
directive. Theto
modifier allows us to eliminate an unnecessary copy back to the CPU at the end of the target region, sinceA
andx
are not updated in the code included in the target region. - The
Ax
array if mapped from the GPU at the end of the target region using themap(from:Ax)
directive. Thefrom
modifier eliminates and unnecessary copy to the GPU at the beginning of the target region, sinceAx
is only written to in the code included in the target region. - Scalar values, like
axloc
and the loop iterators are automatically thread private and copied to the GPU, if necessary. Because of this, you do not need to explicitly include scalar values inmap
directives.
Teams Distribute Parallel
Within a target region, with no other directives, a single thread of execution is launched on the GPU. However, GPUs are capable of running thousands of threads simultaneously. Before the teams directive was introduced in OpenMP 4.0, parallelization was limited to parallelizing with a single group of threads. Since OpenMP 4.0, the teams directive can be used to express another dimension of parallelism that is appropriate for GPUs.
The teams
directive creates a “league” of teams that each have a single thread by default. OpenMP teams execute instructions concurrently. Following a teams
directive with a distribute
will cause future instructions, such as for
or do
loops to be distributed across the teams.
The parallel
directive creates multiple threads within each team and the number of threads can be set with the optional num_threads
clause after the parallel directive. On a GPU, the number of threads per team is ideally a multiple of the Wavefront or Warp size (64 threads on AMD GPUs or 32 threads on Nvidia GPUs).
When working with applications that have tightly nested loops, you can increase the amount of exposed parallelism by using the collapse(n)
directive, where n
is the number of loops to collapse. Loop collapsing is ideal when you have tightly nested loops with instructions that do not have dependencies between iterations.
The examples below show how we would parallelize the matrix-vector multiply from the previous example.
INTEGER, PARAMETER :: N=1000
REAL :: axloc
REAL, ALLOCATABLE, A(:,:), Ax(:), x(:)
ALLOCATE(A(1:N,1:N), Ax(1:N), x(1:N))
! Initialize values
x = 1.0
A = 1.0
!$omp target map(to:A,x) map(from:Ax)
!$omp teams distribute parallel for num_threads(256)
DO i = 1, N
axloc = 0.0
DO r = 1, N
axloc = axloc+A(r,i)*x(r)
ENDDO
Ax(i) = axloc
ENDDO
!$omp end target
DEALLOCATE(A,Ax,x)
int N=1000;
float *A, *Ax, *x;
malloc(A, sizeof(float)*N*N);
malloc(Ax, sizeof(float)*N);
malloc(x, sizeof(float)*N);
// Initialize values
for( int i = 0; i < N; i++ ){
x[i] = 1.0;
for( int j = 0; j < N; j++ ){
A[j+i*N] = 1.0;
}
}
float axloc = 0.0;
#pragma omp target map(to:A,x) map(from:Ax)
{
#pragma omp teams distribute parallel for num_threads(256)
for( int i = 0; i < N; i++ ){
axloc = 0.0;
for( int r = 0; r < N; r++) {
axloc += A[r+N*i]*x[r];
}
Ax[i] = axloc;
}
}
free(A);
free(Ax);
free(x);
In this example, we only parallelize the outer loop with the teams distribute parallel for
compound directive. The inner loop is a reduction and is left serialized in this example. For large enough loops, the reduce
can be beneficial and may improve the runtime by exposing more parallelism. The number of threads per team here is set to 256 using the num_threads
modifier. When parallelizing your own codes, you should consider varying this value between your GPUs warp or wavefront size and 1024 (the upper bound for threads per team on current GPUs) and measuring kernel runtime.
Data Regions
Compiling Applications
Try it yourself
To accompany these notes, we have created hands-on codelabs that walk you through porting a mini application to the GPU using OpenMP. You will offload two kernels using target
and teams distribute parallel
directives. From here, you will learn how to use the rocprof
profiler to create hotspot and trace profiles of the application to identify memory copies as a bottleneck for performance. From here, you will use OpenMP data regions to minimize memory copies to reduce the application's runtime.