Building a basic GPU accelerated application with HIP in Fortran

Last Updated: 2022-09-27

What you will build

In this codelab, you will port a small Fortran application to GPU hardware using HIP and HIPFort. You will transition a serial CPU-only mini-application to a portable GPU accelerated application, using HIP and HIPFort.

The goal of this codelab is to introduce you to using the HIPFort API and a development practice that can be applied to porting other Fortran applications to GPU hardware with HIPFort.

What you will learn

How to develop a GPU porting strategy using application profiles and call graphs.
How to manage GPU memory with HIPFort
How to launch GPU accelerated kernels with HIPFort and ISO C Binding.
How to build GPU accelerated Fortran applications for AMD and Nvidia platforms.
How to verify GPU memory allocation and kernel execution with the rocprof profiler.

What you will need

A compute platform with AMD or Nvidia GPU(s)
CUDA Toolkit 10 or greater (Nvidia platforms only)
ROCm (v4.2 or greater)
HIPFort
Fortran compiler (e.g. gfortran)

The demo application provided for this tutorial performs 2-D smoothing operations using a 3x3 gaussian stencil.

In this section, we introduce the demo application and walk through building and verifying the example. It's important to make sure that the code produces the expected result as we will be using the CPU generated model output to ensure that the solution does not change when we port to the GPU.

This application executes a 2-D smoothing operation on a square grid of points. The program proceeds as follows

Process command line arguments
Allocate memory for smoother class - 5x5 stencil with Gaussian weights
Allocate memory for function and smoothed function
Initialize function on CPU and report function to file
Call smoothing function
Report smoothed function to file
Clear memory

Code Structure

This application's src directory contains the following files

smoother.cpp : Defines a simple data structure that stores the smoothing operators weights and the routines for allocating memory, deallocating memory, and executing the smoothing operation.
main.cpp : Defines the main program that sets up the 2-D field to be smoothed and managed file IO.
Makefile : A simple makefile is to build the application binary smoother.
viz.py : A python script for creating plots of the smoother output

Install and Verify the Application

To get started, we want to make sure that the application builds and runs on your system using the gcc compiler.

Clone the repository

git clone https://github.com/fluidnumerics/scientific-computing-edu ~/scientific-computing-edu

Build the smoother application. Keep in mind, the compiler is set to gfortran by default in the provided makefile.

cd ~/scientific-computing-edu/samples/fortran/smoother/src
make

Test run the example. The application takes two arguments. The first argument is the number of grid cells, and the second argument is the number of times the smoothing operator is applied.

./smoother 1000 1000 10

Profile the Application

Before starting any GPU porting exercise, it is important to profile your application to find hotspots where your application spends most of its time. Further, it is helpful to keep track of the runtime of the routines in your application so that you can later assess whether or not the GPU porting has resulted in improved performance. Ideally, your GPU-Accelerated application should outperform CPU-Only versions of your application when fully subscribed to available CPUs on a compute node.

Create the profile

In this tutorial, we are going to generate a profile and call graph using gprof. The provided makefile was already configured to create profile output. From here, you just need to use gprof to create the application profile.

$ gprof ./smoother gmon.out

Interpret the profile and call tree

gprof provides a flat profile and a summary of your application's call structure indicating dependencies within your source code as a call tree. A call tree depicts the relationships between routines in your source code. Combining timing information with a call tree will help you plan the order in which you port routines to the GPU.

The first section of the gprof output is the flat-profile. An example flat-profile for the smoother application is given below. The flat-profile provides a list of routines in your application, ordered by the percent time your program spends within those routines from greatest to least. Beneath the flat-profile, gprof provides documentation of each of the columns for your convenience.

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 95.24      1.16     1.16       10   116.19   116.19  smoothField
  2.46      1.19     0.03       10     3.00     3.00  resetF
  2.46      1.22     0.03                             main
  0.00      1.22     0.00        1     0.00     0.00  smootherFree
  0.00      1.22     0.00        1     0.00     0.00  smootherInit

Let's now take a look at at the call tree. This call tree has five entries, one for each routine in our program. The right-most field for each entry indicates the routines that called each routine and that are called by each routine.

For smoother, the first entry shows that main calls smoothField, resetF, smootherInit, and smootherFree. Further, the called column indicates that smoothField and resetF routines are shown to be called 10 times (in this case) by main. The self and children columns indicate that main spends 0.03s executing instructions in main and 1.19s in calling other routines. Further, of those 1.19s, 1.16s are spent in smoothField and 0.03 are spent in resetF.

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    0.03    1.19                 main [1]
                1.16    0.00      10/10          smoothField [2]
                0.03    0.00      10/10          resetF [3]
                0.00    0.00       1/1           smootherInit [5]
                0.00    0.00       1/1           smootherFree [4]
-----------------------------------------------
                1.16    0.00      10/10          main [1]
[2]     95.1    1.16    0.00      10         smoothField [2]
-----------------------------------------------
                0.03    0.00      10/10          main [1]
[3]      2.5    0.03    0.00      10         resetF [3]
-----------------------------------------------
                0.00    0.00       1/1           main [1]
[4]      0.0    0.00    0.00       1         smootherFree [4]
-----------------------------------------------
                0.00    0.00       1/1           main [1]
[5]      0.0    0.00    0.00       1         smootherInit [5]
-----------------------------------------------

Next steps

Now that we have a profile and an understanding of the call structure of the application, we can now plan our port to GPUs. Since we will use the AOMP compiler for offloading to GPUs, we want to first modify the Makefile to use the AOMP compiler. Then, we will focus on porting the smoothField routine and the necessary data to the GPU, since smoothField takes up the majority of the run time.

When we port this routine, we will introduce data allocation on the GPU and data copies between CPU and GPU. This data movement may potentially increase the overall application runtime, even if the smoothField routine performs better. In this event, we will then work on minimizing data movements between CPU and GPU.

In the smoother application, we have seen that the ApplySmoother routine, called by main, takes up the most time. Looking at the main iteration loop in lines 51-56 in main.F90 and the ApplySmoother function in smoother.F90, we see that this function takes in arrays f and weights and integers nW, nX, and nY. Each call to ApplySmoother returns the smoothF array.

 51     DO iter = 1, nIter
 52 
 53       smoothF = ApplySmoother( f, weights, nW, nX, nY )
 54       CALL ResetF( f, smoothF, nW, nX, nY )
 55 
 56     ENDDO

In order to offload ApplySmoother to the GPU, we will need to copy the f and weights arrays to the GPU. After calling smoothF, we will want to copy smoothF back to the CPU before calling resetF.

Saving Reference Output

Before making changes to the source code, you'll first want to save the existing output so that you have a reference to compare against later. Having reference output to compare against is critical to verifying that source code changes you make do not change the results.

To do this, make a directory called reference/ and copy function.txt and smooth-function.txt to this directory. These files were created during the initial execution of the smoother application in the previous section of this codelab.

$ mkdir ./reference
$ mv function.txt smooth-function.txt ./reference/

Copying ALLOCATABLE data to the GPU

In this section, you will learn how to allocate memory on the GPU and copy data between the host and device using hipfort. You'll start by inserting calls to allocate and deallocate device memory with hipMalloc and hipFree. Once this is working, you'll then copy data between the CPU and GPU with hipMemcpy.

Allocating Device Memory

Add USE statements for hipfort, hipfort_check, and ISO_C_BINDING at the top of main.F90. We use ISO_C_BINDING because you will need to make use of certain intrinsics to map between Fortran and C pointers.

PROGRAM main
  
USE smoother
USE hipfort
USE hipfort_check
USE ISO_C_BINDING

IMPLICIT NONE

Change the ALLOCATABLE arrays to Fortran POINTER and add declarations for device copies of the f, smoothF, and weights arrays. The device copies can be declared as TYPE(c_ptr) or Fortran POINTER. In this example, we will use Fortran POINTER.

  IMPLICIT NONE
    
    INTEGER, PARAMETER :: nW = 2
    INTEGER :: nX, nY, nIter
    REAL(prec), POINTER :: f(:,:)
    REAL(prec), POINTER :: smoothF(:,:)
    REAL(prec), POINTER :: weights(:,:)
    REAL(prec), POINTER :: f_dev(:,:)
    REAL(prec), POINTER :: smoothF_dev(:,:)
    REAL(prec), POINTER :: weights_dev(:,:)
    REAL(prec) :: dx, dy, x, y
    INTEGER :: i, j, iter

Next, insert calls to hipMalloc wrapped inside of hipCheck to allocate space for f_dev, smoothF_dev, and weights_dev. The hipmalloc routine has an argument called mold that you can use to pass the host array to set the size of the device arary. Because of this, you will want to insert these calls after the ALLOCATE statements for f, smoothF, and weights.

      ! Allocate device memory
      CALL hipCheck(hipMalloc(f_dev, mold=f))
      CALL hipCheck(hipMalloc(smoothF_dev, mold=smoothF))
      CALL hipCheck(hipMalloc(weights_dev, mold=weights))

At the end of main.F90, insert calls to hipFree (wrapped inside hipCheck) to free memory held by f_dev, smoothF_dev, and weights_dev.

      ! Deallocate GPU memory
      CALL hipCheck( hipFree(f_dev) )
      CALL hipCheck( hipFree(smoothF_dev) )
      CALL hipCheck( hipFree(weights_dev) )

Before going too much further, let's check to make sure the code compiles and runs with these changes. Starting from the smoother makefile (samples/fortran/smoother/src/Makefile), let's first add variables for the paths to ROCm and CUDA at the top of the file. These will be needed to reference full paths to the hipfc compiler.

When setting these variables, we use the ?= relation to allow a user's environment variables to override these values if desired.

ROCM ?= /opt/rocm
CUDA ?= /usr/local/cuda

Change the compiler to hipfc and save your changes.

HIPFORT_COMPILER ?= /usr/bin/gfortran
FC= $(ROCM)/hipfort/bin/hipfc
FFLAGS=-O0 -g -hipfort-compiler $(HIPFORT_COMPILER)

Remove the *.o files and the smoother binary to ensure a clean build and make a new smoother binary

$ make clean && make smoother

Run smoother with the same input parameters as you did in the previous section and verify the output is unchanged. We use the diff command line utility to compare the output files and the reference files. If there are no differences, diff will produce no output.

$ ./smoother 1000 1000 10
$ diff function.txt reference/function.txt
$ diff smooth-function.txt reference.txt

You can verify that data is allocated on the GPU and that data is copied from the CPU to GPU by using a profiler. When running under the profiler on AMD platforms, you should observe three (3) calls to hipMalloc and three (3) calls to hipFree. On Nvidia platforms, you should observe three (3) calls to cudaMalloc and three (3) calls to cudaFree

For AMD platforms, use rocprof with the --hip-trace flag. Running rocprof will create a file called profile.json. The contents of results.hip_stats.csv will show a summary of calls to hipMalloc, hipMemcpy, and hipFree.

For Nvidia platforms, use nvprof. When running on Nvidia platforms, you can expect to see calls to cudaMalloc, cudaMemcpy, and cudaFree.

Copying Data between the Host and Device

So far, you have a code that has both CPU (host) and GPU (device) memory. You are now ready to make calls to transfer data between the host and device.

Recall from the first section of this codelab that the ApplySmoother routine is the most expensive routine. Because of this we'll start by focusing on moving necessary to the GPU before the call to ApplySmoother, and moving data from the GPU after.

To copy data to the GPU (Host To Device), you will use hipMemcpy, e.g.

hipMemcpy( destination, source, size, hipMemcpyHostToDevice )

The arguments for hipMemcpy are a pointer to where data will be copied to (destination), a pointer to the source of data (source), and an enum provided by the HIP library to indicate the direction of the data transfer. When copying from host-to-device, this last argument is hipMemcpyHostToDevice. When copying from device-to-host, this last argument is hipMemcpyDeviceToHost.

At the top of main.F90, we need to add a USE ISO_C_BINDING statement, so that we can gain access to intrinsics, like C_LOC . This is necessary for creating C pointers that the HIPFort API expects.
Since the weights array does not change with each iteration, insert a call to hipMemcpy wrapped inside the hipCheck routine to copy weights to weights_dev before the iteration loop starts.

     ! Copy weights to weights_dev
     CALL hipCheck(hipMemcpy(C_LOC(weights_dev), C_LOC(weights), SIZEOF(weights) hipMemcpyHostToDevice))

To ensure that even the ghost cells retain the correct value for the smoothed function, we'll also copy the smoothF array to smoothF_dev, before the main iteration loop.

     ! Copy weights to weights_dev
     CALL hipCheck(hipMemcpy(C_LOC(smoothF_dev), C_LOC(smoothF), SIZEOF(smoothF) hipMemcpyHostToDevice))

Just before the call to ApplySmoother, inside the iteration loop, insert calls to hipMemcpy wrapped inside the hipCheck routine to copy f to f_dev.

     DO iter = 1, nIter
 
       ! Copy f to f_dev
       CALL hipCheck(hipMemcpy(C_LOC(f_dev), C_LOC(f), SIZEOF(f), hipMemcpyHostToDevice))

When we first port the ApplySmoother routine to the GPU, we will leave resetF on the CPU. Because of this, we will need to have smoothF copied back to the CPU before calling resetF. Now, just after the call to ApplySmoother and before the call to resetF, insert a call to hipMemcpy wrapped inside the hipCheck routine to copy smoothF_dev from the device to the host.

 
       smoothF = ApplySmoother( f, weights, nW, nX, nY )
       
       ! Copy smoothF_dev to smoothF
       CALL hipCheck(hipMemcpy(C_LOC(smoothF), C_LOC(smoothF_dev), SIZEOF(smoothF), hipMemcpyDeviceToHost))

Save your changes in main.F90 and recompile the application to make sure there are no syntax errors in your recent code modifications.
If you run the application, the output will be incorrect because we are copying smoothF_dev to smoothF, which currently has uninitialized memory. Nonetheless, run the application under a profiler to verify that calls to hipMemcpy ( or cudaMemcpy for Nvidia devices ) are executed.

For AMD platforms, use rocprof with the --hip-trace flag. Running rocprof will create a file called profile.json. The contents of results.hip_stats.csv will show calls to hipMalloc, hipMemcpy, and hipFree.

$ rocprof --hip-trace ./smoother 1000 1000 100
$ cat results.hip_stats.csv 
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
hipMemcpy,201,696108877,3463228,99.94123584828755
hipMalloc,3,216944,72314,0.031146925698335683
hipFree,3,192359,64119,0.02761722601411495

Next steps

At this point, you now have the necessary data declared on the GPU. Additionally, you used hipMemcpy to make the input to ApplySmoother available on the GPU. In the next step, you will create a HIP kernel that will run the ApplySmoother algorithm on the GPU and replace the call to ApplySmoother with a call to launch this kernel.

In this section, you will offload the ApplySmoother routine to the GPU.

Planning the GPU port

Let's look at the ApplySmoother subroutine from smoother.F90

SUBROUTINE ApplySmoother( f, weights, smoothF, nW, nX, nY )
  IMPLICIT NONE
  REAL(prec), INTENT(in) :: f(1:nX,1:nY)
  REAL(prec), INTENT(in) :: weights(-nW:nW,-nW:nW)
  INTEGER, INTENT(in) :: nW, nX, nY
  REAL(prec), INTENT(inout) :: smoothF(1:nX,1:nY)
  ! Local
  INTEGER :: i, j, ii, jj

    DO j = 1+nW, nY-nW
      DO i = 1+nW, nX-nW
        ! Take the weighted sum of f to compute the smoothF field
        smoothF(i,j) = 0.0_prec
        DO jj = -nW, nW
          DO ii = -nW, nW
            smoothF(i,j) = smoothF(i,j) + f(i+ii,j+jj)*weights(ii,jj)
          ENDDO
        ENDDO
      ENDDO
    ENDDO
END SUBROUTINE ApplySmoother

The outer loops, over i and j, are tightly nested loops over a 2-D grid. The size of these loops are nY-2*nW and nX-2*nW. The values of nX and nY are determined by the user through the first two command line arguments (we have been using 1000 for), and nW is 2 (see the declarations in main.F90). Within the i and j loops, we carry out a reduction operation for smLocal and then assign the value to each element of smoothF.

In the ApplySmoother algorithm, the order in which we execute the i and j loops does not matter. Further, the size of each loop is O(1000) for the example we're working with. A good strategy for offloading this routine to the GPU is to have each GPU thread execute the instructions within the i and j loops. Ideally, then we want each thread to execute something the following

    real smLocal = 0.0;
    for( int jj=-nW; jj <= nW; jj++ ){
      for( int ii=-nW; ii <= nW; ii++ ){
        int iel = (i+ii)+(j+jj)*nX;
        int ism = (ii+nW) + (jj+nW)*(2*nW+1);
        smLocal += f[iel]*weights[ism];
      }
    }
    int iel = i+j*nX;
    smoothF[iel] = smLocal;

Notice now the i and j loops are gone. Within the HIP kernel, we can calculate i and j from threadIdx.[x,y], blockIdx.[x,y], and blockDim.[x,y], assuming that we will launch the kernel with 2-D Grid and Block dimensions. You can use something like the following to calculate i and j.

  size_t i = threadIdx.x + blockIdx.x*blockDim.x+nW;
  size_t j = threadIdx.y + blockIdx.y*blockDim.y+nW;

Within the main program, you will be able to launch the GPU kernel, but you will need to calculate the Grid and Block Dimensions. For now, let's assume that the number of threads-per-block in the i and j loop dimensions (x and y directions) is fixed at 16. With the number of threads-per-block (in each direction) chosen, you can calculate the grid dimensions, by requiring the x and y grid dimensions to be greater than or equal to the i and j loop sizes, respectively.

Write the HIP Kernels in C++

You may have noticed that the floating point precision is set in smoother.F90 and can be controlled using a C-Preprocessor flag. To extend this feature into the HIP kernels, add a header file, called precision.h. In this header file, define the type real to be double if the DOUBLE_PRECISION preprocessor flag is defined, and float otherwise.

#include <float.h>
 
#ifdef DOUBLE_PRECISION
  typedef double real;
#else
  typedef float real;
#endif

Create a new file called smoother.cpp. This file will contain the C++ source code for the HIP Kernel and the wrapper routine. Start by adding statements to include the HIP runtime and precision.h.

#include "precision.h"
#include <hip/hip_runtime.h>

Add a new routine, ApplySmoother_gpu. This routine needs to be of type __global__ so that it can be launched on the GPU (device) from the CPU (host) using hipLaunchKernelGGL.

__global__ void ApplySmoother_gpu(real *f, real *weights, real *smoothF, int nW, int nX, int nY)
{
  size_t i = threadIdx.x + blockIdx.x*blockDim.x + nW;
  size_t j = threadIdx.y + blockIdx.y*blockDim.y + nW;
  int iel, ism;
  
  if( i < nX-nW && j< nY-nW){
    real smLocal = 0.0;
    for( int jj=-nW; jj <= nW; jj++ ){
      for( int ii=-nW; ii <= nW; ii++ ){
        iel = (i+ii)+(j+jj)*nX;
        ism = (ii+nW) + (jj+nW)*(2*nW+1);
        smLocal += f[iel]*weights[ism];
      }
    }
    iel = i+j*nX;
    smoothF[iel] = smLocal;
  }
}

Next, add a wrapper routine in smoother.cpp that will launch the HIP kernel. Additionally, you will want to declare the wrapper routine as extern "C", so that it can be called from Fortran through ISO C Binding. In this routine, you can calculate the grid and block size for the kernel launch. Here, we've set the number of threads in the x and y directions to 16 and calculated the grid dimensions by evenly dividing the x and y extents of the outer loops of the original ApplySmoother routine by 16.

extern "C"
{
  void ApplySmoother_gpu_wrapper(real **f, real **weights, real **smoothF, int nW, int nX, int nY)
  {
    int threadsPerBlock = 16;
    int gridDimX = (nX-2*nW)/threadsPerBlock + 1;
    int gridDimY = (nY-2*nW)/threadsPerBlock + 1;
     
    ApplySmoother_gpu<<<dim3(gridDimX,gridDimY,1), dim3(threadsPerBlock,threadsPerBlock,1), 0, 0>>>(*f, *weights, *smoothF, nW, nX, nY);
  } 
}

Add Calls from Fortran to Launch the HIP Kernel

Now that the HIP kernel and a wrapper routine are both defined, we need to modify the Fortran source code to expose the wrapper routine using an INTERFACE block in the smoother.F90 module. Then, in the main.F90 program, you will replace the call to ApplySmoother with a call to ApplySmoother_gpu_wrapper.

Open the smoother.F90 module.
Add an INTERFACE block for the ApplySmoother_gpu_wrapper routine. This interface block will define the API for the ApplySmoother_gpu_wrapper routine and will bind this subroutine to ApplySmoother_gpu_wrapper defined in smoother.cpp.

 INTERFACE
   SUBROUTINE ApplySmoother_gpu_wrapper(f_dev, weights_dev, smoothF_dev, nW, nX, nY) bind(c,name="ApplySmoother_gpu_wrapper")
     USE ISO_C_BINDING
     IMPLICIT NONE
     TYPE(c_ptr) :: f_dev, weights_dev, smoothF_dev
     INTEGER, VALUE :: nW, nX, nY
   END SUBROUTINE ApplySmoother_gpu_wrapper
 END INTERFACE

Save your changes in smoother.F90 and open main.F90.
Navigate down to the ApplySmoother call in the main iteration loop. Replace this call with a call to ApplySmoother_gpu_wrapper. Replace the input/output variables with their device versions.

CALL ApplySmoother_gpu_wrapper( c_loc(f_dev), c_loc(weights_dev), c_loc(smoothF_dev), nW, nX, nY )

Save your changes in main.F90.

Update the Makefile

Open the Makefile.
Add instructions to compile smoother_HIP.cpp to an object file. Note that you can continue to use hipfc as the compiler since hipfc will detect the .cpp file extension and switch to using hipcc automatically. Notice however, that you can use the CFLAGS variable to pass compilation flags specific to compiling this C++ file.

smoother.cpp.o : smoother.cpp
     $(FC) $(CFLAGS) -c smoother.cpp  -o $@

Add smoother.cpp.o as a dependency of the smoother.o target.

smoother: main.o smoother.o
    ${FC} ${FFLAGS} *.o -o $@

main.o : main.F90 smoother.o
    $(FC) $(FFLAGS) -c main.F90  -o $@

smoother.o : smoother.F90 smoother.cpp.o
    $(FC) $(FFLAGS) -c smoother.F90  -o $@

Save the Makefile and rebuild the application.

Profile the application

You can verify that data is allocated on the GPU and that data is copied from the CPU to GPU by using a profiler.

To profile, use rocprof with the --hip-trace on and --stats flags. Running rocprof will create a file called results.json that contains the data for a trace profile. Additionally, results.stats.csv and results.hip-stats.csv will contain hotspot analysis for HIP kernels and HIP API calls, respectively.

$ rocprof --stats ./smoother 1000 1000 100

Next steps

Congratulations! So far, you've learned how to allocate and manage memory on the GPU and how to launch a GPU kernel from Fortran using hipfort, HIP, and ISO C Binding.

Right now, we have the code in a state where, every iteration, data is copied to the GPU before calling ApplySmoother_gpu and from the GPU after calling ApplySmoother_gpu. This situation happens quite often in the early stages of porting to the GPU.

The next step in this codelab is to offload the ResetF routine to the GPU, even though it does not take up a lot of time. We want to offload it to the GPU so that we can move the hipMemcpy calls outside of the iteration loop in main and reduce the number of times data is transmitted across the PCI bus.

In this section, we are going to offload the ResetF routine in smoother.F90 to the GPU so that we can migrate hipMemcpy calls outside of the iteration loop in main.F90. By this point, you have worked through the mechanics for porting a routine to the GPU :

Add a HIP kernel in C++
Add a wrapper routine in C++ to launch the HIP kernel
Add an interface block in Fortran
Replace the subroutine/function call with a call to the wrapper routine

In this section, we'll repeat these steps for the ResetF routine. Additionally, we'll take an extra step, where we push the hipMemcpy calls outside of the main iteration loop, in order to reduce the number of data transfers between the CPU and GPU.

Write the HIP Kernels in C++

Open smoother_HIP.cpp
Add a new routine, ResetF_gpu. This routine needs to be of type __global__ so that it can be launched on the GPU (device) from the CPU (host) using chevron syntax.

__global__ void ResetF_gpu(real *f, real *smoothF, int nW, int nX, int nY)
{
  size_t i = threadIdx.x + blockIdx.x*blockDim.x + nW;
  size_t j = threadIdx.y + blockIdx.y*blockDim.y + nW;
  int iel = i + nX*j;
  if( i < nX-nW && j< nY-nW){
    f_dev[iel] = smoothF_dev[iel];
  }
}

Next, add a wrapper routine that will launch the HIP kernel. As before, you will want to declare the wrapper routine as extern "C", so that it can be called from Fortran through ISO C Binding. In this routine, you can calculate the grid and block size for the kernel launch. Here, we've set the number of threads in the x and y directions to 16 and calculated the grid dimensions by evenly dividing the x and y extents of the outer loops of the original ResetF routine by 16.

extern "C"
{ 
  void ResetF_gpu_wrapper(real **f, real **smoothF, int nW, int nX, int nY)
  { 
    int threadsPerBlock = 16;
    int gridDimX = (nX-2*nW)/threadsPerBlock + 1;
    int gridDimY = (nY-2*nW)/threadsPerBlock + 1;

    ResetF_gpu<<<dim3(gridDimX,gridDimY,1), dim3(threadsPerBlock,threadsPerBlock,1),0,0>>>(*f, *smoothF, nW, nX, nY);
  }
}

Save smoother_HIP.cpp

Add Calls from Fortran to Launch the HIP Kernel

As with the ApplySmoother_gpu_wrapper routine, you will add an INTERFACE block in the smoother.F90 module for the ResetF_gpu_wrapper. Then, in the main.F90 program, you will replace the call to ResetF with a call to ResetF_gpu_wrapper.

Open the smoother.F90 module.
Add an INTERFACE block for the ResetF_gpu_wrapper routine. This interface block will define the API for the ResetF_gpu_wrapper routine and will bind this subroutine to ResetF_gpu_wrapper defined in smoother.cpp.

 INTERFACE
   SUBROUTINE ResetF_gpu_wrapper(f_dev, smoothF_dev, nW, nX, nY) bind(c,name="ResetF_gpu_wrapper")
     USE ISO_C_BINDING
     IMPLICIT NONE
     TYPE(c_ptr) :: f_dev, smoothF_dev
     INTEGER, VALUE :: nW, nX, nY
   END SUBROUTINE ResetF_gpu_wrapper
 END INTERFACE

Save your changes in smoother.F90 and open main.F90.
Navigate down to the ResetF call in the main iteration loop. Replace this call with a call to ResetF_HIP. Replace the input/output variables with their device versions.

CALL ResetF_HIP( c_loc(f_dev), c_loc(smoothF_dev), nW, nX, nY )

With ResetF_HIP in place, we can now move the hipMemcpy calls outside of the main iteration loop. Move the call to copy f to f_dev to just before the do loop. To obtain the correct result, you will also need to add a call to copy smoothF to smoothF_dev before the iteration loop. Then, move the call to copy smoothF_dev to smoothF after the do loop.

CALL hipCheck(hipMemcpy(C_LOC(smoothF_dev), C_LOC(smoothF), hipMemcpyHostToDevice))
CALL hipCheck(hipMemcpy(C_LOC(f_dev), C_LOC(f), hipMemcpyHostToDevice))

DO iter = 1, nIter
  CALL ApplySmoother_HIP( c_loc(f_dev), c_loc(weights_dev), c_loc(smoothF_dev), nW, nX, nY )
  CALL ResetF_HIP( c_loc(f_dev), c_loc(smoothF_dev), nW, nX, nY )
ENDDO
CALL hipCheck(hipMemcpy(C_LOC(smoothF), C_LOC(smoothF_dev), hipMemcpyDeviceToHost))

Save your changes in main.F90. Rebuild and re-run the smoother application with the same parameters as before. Verify that the solution has remain unchanged.

In this codelab, you learned how to port serial CPU-only routines in Fortran to GPUs using hipfort, HIP, and ISO C Binding. To do this, you used Fortran POINTER's to create device copies of CPU arrays and learned how to copy data from the CPU to the GPU and vice versa with hipMemcpy. You also learned how to write HIP kernels in C++, expose them to Fortran, and launch them from the host within Fortran source code.

In the process of doing this, you practiced a strategy for porting to GPUs that included the following steps to make incremental changes to your own source code :

Profile - Find out the hotspots in your code and understand the dependencies with other routines
Plan - Determine what routine you want to port, what data needs to be present on the GPU, and what data needs to be copied back to the CPU after execution
Implement & Verify - Create the necessary device data, insert the appropriate hipMemcpy calls, write an equivalent GPU kernel, and use hipLaunchKernelGGL to launch the GPU kernel. Run your application's tests and verify the results are correct. Check with a profiler that the new routine and the necessary hipMemcpy calls are being executed.
Commit - Once you have verified correctness and the expected behavior, commit your changes and start the process over again.

Provide Feedback

If you have any questions, comments, or feedback that can help improve this codelab, you can reach out to support@fluidnumerics.com

What you will build

What you will learn

What you will need

Code Structure

Install and Verify the Application

Profile the Application

Create the profile

Interpret the profile and call tree

Next steps

Saving Reference Output

Copying ALLOCATABLE data to the GPU

Allocating Device Memory

Copying Data between the Host and Device

Next steps

Planning the GPU port

Write the HIP Kernels in C++

Add Calls from Fortran to Launch the HIP Kernel

Update the Makefile

Profile the application

Next steps

Write the HIP Kernels in C++

Add Calls from Fortran to Launch the HIP Kernel

Provide Feedback

Further reading