Last Updated: 2022-09-27
In this codelab, you will port a small Fortran application to GPU hardware using HIP and HIPFort. You will transition a serial CPU-only mini-application to a portable GPU accelerated application, using HIP and HIPFort.
The goal of this codelab is to introduce you to using the HIPFort API and a development practice that can be applied to porting other Fortran applications to GPU hardware with HIPFort.
The demo application provided for this tutorial performs 2-D smoothing operations using a 3x3 gaussian stencil.
In this section, we introduce the demo application and walk through building and verifying the example. It's important to make sure that the code produces the expected result as we will be using the CPU generated model output to ensure that the solution does not change when we port to the GPU.
This application executes a 2-D smoothing operation on a square grid of points. The program proceeds as follows
This application's src directory contains the following files
smoother.cpp
: Defines a simple data structure that stores the smoothing operators weights and the routines for allocating memory, deallocating memory, and executing the smoothing operation.main.cpp
: Defines the main program that sets up the 2-D field to be smoothed and managed file IO.Makefile
: A simple makefile is to build the application binary smoother
.viz.py
: A python script for creating plots of the smoother outputTo get started, we want to make sure that the application builds and runs on your system using the gcc compiler.
git clone https://github.com/fluidnumerics/scientific-computing-edu ~/scientific-computing-edu
gfortran
by default in the provided makefile.cd ~/scientific-computing-edu/samples/fortran/smoother/src
make
./smoother 1000 1000 10
Before starting any GPU porting exercise, it is important to profile your application to find hotspots where your application spends most of its time. Further, it is helpful to keep track of the runtime of the routines in your application so that you can later assess whether or not the GPU porting has resulted in improved performance. Ideally, your GPU-Accelerated application should outperform CPU-Only versions of your application when fully subscribed to available CPUs on a compute node.
In this tutorial, we are going to generate a profile and call graph using gprof. The provided makefile was already configured to create profile output. From here, you just need to use gprof to create the application profile.
$ gprof ./smoother gmon.out
gprof
provides a flat profile and a summary of your application's call structure indicating dependencies within your source code as a call tree. A call tree depicts the relationships between routines in your source code. Combining timing information with a call tree will help you plan the order in which you port routines to the GPU.
The first section of the gprof output is the flat-profile. An example flat-profile for the smoother
application is given below. The flat-profile provides a list of routines in your application, ordered by the percent time your program spends within those routines from greatest to least. Beneath the flat-profile, gprof provides documentation of each of the columns for your convenience.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
95.24 1.16 1.16 10 116.19 116.19 smoothField
2.46 1.19 0.03 10 3.00 3.00 resetF
2.46 1.22 0.03 main
0.00 1.22 0.00 1 0.00 0.00 smootherFree
0.00 1.22 0.00 1 0.00 0.00 smootherInit
Let's now take a look at at the call tree. This call tree has five entries, one for each routine in our program. The right-most field for each entry indicates the routines that called each routine and that are called by each routine.
For smoother
, the first entry shows that main calls smoothField
, resetF
, smootherInit
, and smootherFree
. Further, the called column indicates that smoothField and resetF routines are shown to be called 10 times (in this case) by main. The self and children columns indicate that main spends 0.03s executing instructions in main and 1.19s in calling other routines. Further, of those 1.19s, 1.16s are spent in smoothField
and 0.03 are spent in resetF
.
index % time self children called name
<spontaneous>
[1] 100.0 0.03 1.19 main [1]
1.16 0.00 10/10 smoothField [2]
0.03 0.00 10/10 resetF [3]
0.00 0.00 1/1 smootherInit [5]
0.00 0.00 1/1 smootherFree [4]
-----------------------------------------------
1.16 0.00 10/10 main [1]
[2] 95.1 1.16 0.00 10 smoothField [2]
-----------------------------------------------
0.03 0.00 10/10 main [1]
[3] 2.5 0.03 0.00 10 resetF [3]
-----------------------------------------------
0.00 0.00 1/1 main [1]
[4] 0.0 0.00 0.00 1 smootherFree [4]
-----------------------------------------------
0.00 0.00 1/1 main [1]
[5] 0.0 0.00 0.00 1 smootherInit [5]
-----------------------------------------------
Now that we have a profile and an understanding of the call structure of the application, we can now plan our port to GPUs. Since we will use the AOMP compiler for offloading to GPUs, we want to first modify the Makefile to use the AOMP compiler. Then, we will focus on porting the smoothField routine and the necessary data to the GPU, since smoothField takes up the majority of the run time.
When we port this routine, we will introduce data allocation on the GPU and data copies between CPU and GPU. This data movement may potentially increase the overall application runtime, even if the smoothField routine performs better. In this event, we will then work on minimizing data movements between CPU and GPU.
In the smoother
application, we have seen that the ApplySmoother
routine, called by main
, takes up the most time. Looking at the main iteration loop in lines 51-56 in main.F90
and the ApplySmoother
function in smoother.F90
, we see that this function takes in arrays f
and weights
and integers nW
, nX
, and nY
. Each call to ApplySmoother
returns the smoothF
array.
51 DO iter = 1, nIter
52
53 smoothF = ApplySmoother( f, weights, nW, nX, nY )
54 CALL ResetF( f, smoothF, nW, nX, nY )
55
56 ENDDO
In order to offload ApplySmoother
to the GPU, we will need to copy the f
and weights
arrays to the GPU. After calling smoothF
, we will want to copy smoothF
back to the CPU before calling resetF
.
Before making changes to the source code, you'll first want to save the existing output so that you have a reference to compare against later. Having reference output to compare against is critical to verifying that source code changes you make do not change the results.
To do this, make a directory called reference/
and copy function.txt and smooth-function.txt to this directory. These files were created during the initial execution of the smoother application in the previous section of this codelab.
$ mkdir ./reference
$ mv function.txt smooth-function.txt ./reference/
In this section, you will learn how to allocate memory on the GPU and copy data between the host and device using hipfort. You'll start by inserting calls to allocate and deallocate device memory with hipMalloc and hipFree. Once this is working, you'll then copy data between the CPU and GPU with hipMemcpy.
USE
statements for hipfort
, hipfort_check
, and ISO_C_BINDING
at the top of main.F90
. We use ISO_C_BINDING
because you will need to make use of certain intrinsics to map between Fortran and C pointers.PROGRAM main
USE smoother
USE hipfort
USE hipfort_check
USE ISO_C_BINDING
IMPLICIT NONE
f
, smoothF
, and weights
arrays. The device copies can be declared as TYPE(c_ptr)
or Fortran POINTER
. In this example, we will use Fortran POINTER
. IMPLICIT NONE
INTEGER, PARAMETER :: nW = 2
INTEGER :: nX, nY, nIter
REAL(prec), POINTER :: f(:,:)
REAL(prec), POINTER :: smoothF(:,:)
REAL(prec), POINTER :: weights(:,:)
REAL(prec), POINTER :: f_dev(:,:)
REAL(prec), POINTER :: smoothF_dev(:,:)
REAL(prec), POINTER :: weights_dev(:,:)
REAL(prec) :: dx, dy, x, y
INTEGER :: i, j, iter
hipMalloc
wrapped inside of hipCheck
to allocate space for f_dev
, smoothF_dev
, and weights_dev
. The hipmalloc routine has an argument called mold
that you can use to pass the host array to set the size of the device arary. Because of this, you will want to insert these calls after the ALLOCATE statements for f
, smoothF
, and weights
. ! Allocate device memory
CALL hipCheck(hipMalloc(f_dev, mold=f))
CALL hipCheck(hipMalloc(smoothF_dev, mold=smoothF))
CALL hipCheck(hipMalloc(weights_dev, mold=weights))
main.F90
, insert calls to hipFree
(wrapped inside hipCheck
) to free memory held by f_dev
, smoothF_dev
, and weights_dev
. ! Deallocate GPU memory
CALL hipCheck( hipFree(f_dev) )
CALL hipCheck( hipFree(smoothF_dev) )
CALL hipCheck( hipFree(weights_dev) )
smoother
makefile (samples/fortran/smoother/src/Makefile
), let's first add variables for the paths to ROCm and CUDA at the top of the file. These will be needed to reference full paths to the hipfc compiler. ROCM ?= /opt/rocm
CUDA ?= /usr/local/cuda
Change the compiler to hipfc
and save your changes.
HIPFORT_COMPILER ?= /usr/bin/gfortran
FC= $(ROCM)/hipfort/bin/hipfc
FFLAGS=-O0 -g -hipfort-compiler $(HIPFORT_COMPILER)
*.o
files and the smoother
binary to ensure a clean build and make a new smoother
binary$ make clean && make smoother
smoother
with the same input parameters as you did in the previous section and verify the output is unchanged. We use the diff command line utility to compare the output files and the reference files. If there are no differences, diff will produce no output.$ ./smoother 1000 1000 10
$ diff function.txt reference/function.txt
$ diff smooth-function.txt reference.txt
For AMD platforms, use rocprof
with the --hip-trace
flag. Running rocprof will create a file called profile.json
. The contents of results.hip_stats.csv
will show a summary of calls to hipMalloc
, hipMemcpy
, and hipFree
.
For Nvidia platforms, use nvprof
. When running on Nvidia platforms, you can expect to see calls to cudaMalloc
, cudaMemcpy
, and cudaFree
.
So far, you have a code that has both CPU (host) and GPU (device) memory. You are now ready to make calls to transfer data between the host and device.
Recall from the first section of this codelab that the ApplySmoother
routine is the most expensive routine. Because of this we'll start by focusing on moving necessary to the GPU before the call to ApplySmoother
, and moving data from the GPU after.
To copy data to the GPU (Host To Device), you will use hipMemcpy
, e.g.
hipMemcpy( destination, source, size, hipMemcpyHostToDevice )
The arguments for hipMemcpy
are a pointer to where data will be copied to (destination
), a pointer to the source of data (source
), and an enum provided by the HIP library to indicate the direction of the data transfer. When copying from host-to-device, this last argument is hipMemcpyHostToDevice
. When copying from device-to-host, this last argument is hipMemcpyDeviceToHost
.
main.F90
, we need to add a USE ISO_C_BINDING
statement, so that we can gain access to intrinsics, like C_LOC
. This is necessary for creating C pointers that the HIPFort API expects.weights
array does not change with each iteration, insert a call to hipMemcpy
wrapped inside the hipCheck
routine to copy weights to weights_dev before the iteration loop starts.! Copy weights to weights_dev CALL hipCheck(hipMemcpy(C_LOC(weights_dev), C_LOC(weights), SIZEOF(weights) hipMemcpyHostToDevice))
smoothF
array to smoothF_dev
, before the main iteration loop.! Copy weights to weights_dev CALL hipCheck(hipMemcpy(C_LOC(smoothF_dev), C_LOC(smoothF), SIZEOF(smoothF) hipMemcpyHostToDevice))
f
to f_dev
. DO iter = 1, nIter
! Copy f to f_dev
CALL hipCheck(hipMemcpy(C_LOC(f_dev), C_LOC(f), SIZEOF(f), hipMemcpyHostToDevice))
ApplySmoother
routine to the GPU, we will leave resetF on the CPU. Because of this, we will need to have smoothF copied back to the CPU before calling resetF
. Now, just after the call to ApplySmoother
and before the call to resetF
, insert a call to hipMemcpy
wrapped inside the hipCheck
routine to copy smoothF_dev
from the device to the host.
smoothF = ApplySmoother( f, weights, nW, nX, nY )
! Copy smoothF_dev to smoothF
CALL hipCheck(hipMemcpy(C_LOC(smoothF), C_LOC(smoothF_dev), SIZEOF(smoothF), hipMemcpyDeviceToHost))
main.F90
and recompile the application to make sure there are no syntax errors in your recent code modifications. smoothF_dev
to smoothF
, which currently has uninitialized memory. Nonetheless, run the application under a profiler to verify that calls to hipMemcpy
( or cudaMemcpy
for Nvidia devices ) are executed.rocprof
with the --hip-trace
flag. Running rocprof will create a file called profile.json
. The contents of results.hip_stats.csv
will show calls to hipMalloc
, hipMemcpy
, and hipFree
.$ rocprof --hip-trace ./smoother 1000 1000 100
$ cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
hipMemcpy,201,696108877,3463228,99.94123584828755
hipMalloc,3,216944,72314,0.031146925698335683
hipFree,3,192359,64119,0.02761722601411495
At this point, you now have the necessary data declared on the GPU. Additionally, you used hipMemcpy
to make the input to ApplySmoother
available on the GPU. In the next step, you will create a HIP kernel that will run the ApplySmoother
algorithm on the GPU and replace the call to ApplySmoother
with a call to launch this kernel.
In this section, you will offload the ApplySmoother
routine to the GPU.
Let's look at the ApplySmoother
subroutine from smoother.F90
SUBROUTINE ApplySmoother( f, weights, smoothF, nW, nX, nY )
IMPLICIT NONE
REAL(prec), INTENT(in) :: f(1:nX,1:nY)
REAL(prec), INTENT(in) :: weights(-nW:nW,-nW:nW)
INTEGER, INTENT(in) :: nW, nX, nY
REAL(prec), INTENT(inout) :: smoothF(1:nX,1:nY)
! Local
INTEGER :: i, j, ii, jj
DO j = 1+nW, nY-nW
DO i = 1+nW, nX-nW
! Take the weighted sum of f to compute the smoothF field
smoothF(i,j) = 0.0_prec
DO jj = -nW, nW
DO ii = -nW, nW
smoothF(i,j) = smoothF(i,j) + f(i+ii,j+jj)*weights(ii,jj)
ENDDO
ENDDO
ENDDO
ENDDO
END SUBROUTINE ApplySmoother
The outer loops, over i and j, are tightly nested loops over a 2-D grid. The size of these loops are nY-2*nW
and nX-2*nW. The values of nX and nY are determined by the user through the first two command line arguments (we have been using 1000 for), and nW
is 2 (see the declarations in main.F90
). Within the i and j loops, we carry out a reduction operation for smLocal and then assign the value to each element of smoothF.
In the ApplySmoother algorithm, the order in which we execute the i and j loops does not matter. Further, the size of each loop is O(1000) for the example we're working with. A good strategy for offloading this routine to the GPU is to have each GPU thread execute the instructions within the i and j loops. Ideally, then we want each thread to execute something the following
real smLocal = 0.0;
for( int jj=-nW; jj <= nW; jj++ ){
for( int ii=-nW; ii <= nW; ii++ ){
int iel = (i+ii)+(j+jj)*nX;
int ism = (ii+nW) + (jj+nW)*(2*nW+1);
smLocal += f[iel]*weights[ism];
}
}
int iel = i+j*nX;
smoothF[iel] = smLocal;
Notice now the i and j loops are gone. Within the HIP kernel, we can calculate i and j from threadIdx.[x,y]
, blockIdx.[x,y]
, and blockDim.[x,y]
, assuming that we will launch the kernel with 2-D Grid and Block dimensions. You can use something like the following to calculate i and j.
size_t i = threadIdx.x + blockIdx.x*blockDim.x+nW;
size_t j = threadIdx.y + blockIdx.y*blockDim.y+nW;
Within the main program, you will be able to launch the GPU kernel, but you will need to calculate the Grid and Block Dimensions. For now, let's assume that the number of threads-per-block in the i and j loop dimensions (x and y directions) is fixed at 16. With the number of threads-per-block (in each direction) chosen, you can calculate the grid dimensions, by requiring the x and y grid dimensions to be greater than or equal to the i and j loop sizes, respectively.
smoother.F90
and can be controlled using a C-Preprocessor flag. To extend this feature into the HIP kernels, add a header file, called precision.h
. In this header file, define the type real
to be double
if the DOUBLE_PRECISION
preprocessor flag is defined, and float
otherwise.#include <float.h>
#ifdef DOUBLE_PRECISION
typedef double real;
#else
typedef float real;
#endif
smoother.cpp
. This file will contain the C++ source code for the HIP Kernel and the wrapper routine. Start by adding statements to include the HIP runtime and precision.h
.#include "precision.h"
#include <hip/hip_runtime.h>
ApplySmoother_gpu
. This routine needs to be of type __global__
so that it can be launched on the GPU (device) from the CPU (host) using hipLaunchKernelGGL
.__global__ void ApplySmoother_gpu(real *f, real *weights, real *smoothF, int nW, int nX, int nY)
{
size_t i = threadIdx.x + blockIdx.x*blockDim.x + nW;
size_t j = threadIdx.y + blockIdx.y*blockDim.y + nW;
int iel, ism;
if( i < nX-nW && j< nY-nW){
real smLocal = 0.0;
for( int jj=-nW; jj <= nW; jj++ ){
for( int ii=-nW; ii <= nW; ii++ ){
iel = (i+ii)+(j+jj)*nX;
ism = (ii+nW) + (jj+nW)*(2*nW+1);
smLocal += f[iel]*weights[ism];
}
}
iel = i+j*nX;
smoothF[iel] = smLocal;
}
}
smoother.cpp
that will launch the HIP kernel. Additionally, you will want to declare the wrapper routine as extern "C"
, so that it can be called from Fortran through ISO C Binding. In this routine, you can calculate the grid and block size for the kernel launch. Here, we've set the number of threads in the x and y directions to 16 and calculated the grid dimensions by evenly dividing the x and y extents of the outer loops of the original ApplySmoother
routine by 16.extern "C"
{
void ApplySmoother_gpu_wrapper(real **f, real **weights, real **smoothF, int nW, int nX, int nY)
{
int threadsPerBlock = 16;
int gridDimX = (nX-2*nW)/threadsPerBlock + 1;
int gridDimY = (nY-2*nW)/threadsPerBlock + 1;
ApplySmoother_gpu<<<dim3(gridDimX,gridDimY,1), dim3(threadsPerBlock,threadsPerBlock,1), 0, 0>>>(*f, *weights, *smoothF, nW, nX, nY);
}
}
Now that the HIP kernel and a wrapper routine are both defined, we need to modify the Fortran source code to expose the wrapper routine using an INTERFACE block in the smoother.F90 module. Then, in the main.F90 program, you will replace the call to ApplySmoother
with a call to ApplySmoother_gpu_wrapper
.
smoother.F90
module.INTERFACE
block for the ApplySmoother_gpu_wrapper
routine. This interface block will define the API for the ApplySmoother_gpu_wrapper
routine and will bind this subroutine to ApplySmoother_gpu_wrapper
defined in smoother.cpp
. INTERFACE
SUBROUTINE ApplySmoother_gpu_wrapper(f_dev, weights_dev, smoothF_dev, nW, nX, nY) bind(c,name="ApplySmoother_gpu_wrapper")
USE ISO_C_BINDING
IMPLICIT NONE
TYPE(c_ptr) :: f_dev, weights_dev, smoothF_dev
INTEGER, VALUE :: nW, nX, nY
END SUBROUTINE ApplySmoother_gpu_wrapper
END INTERFACE
smoother.F90
and open main.F90
.ApplySmoother
call in the main iteration loop. Replace this call with a call to ApplySmoother_gpu_wrapper
. Replace the input/output variables with their device versions. CALL ApplySmoother_gpu_wrapper( c_loc(f_dev), c_loc(weights_dev), c_loc(smoothF_dev), nW, nX, nY )
main.F90
.Makefile
.smoother_HIP.cpp
to an object file. Note that you can continue to use hipfc
as the compiler since hipfc will detect the .cpp
file extension and switch to using hipcc
automatically. Notice however, that you can use the CFLAGS
variable to pass compilation flags specific to compiling this C++ file.smoother.cpp.o : smoother.cpp
$(FC) $(CFLAGS) -c smoother.cpp -o $@
smoother.cpp.o
as a dependency of the smoother.o
target.smoother: main.o smoother.o
${FC} ${FFLAGS} *.o -o $@
main.o : main.F90 smoother.o
$(FC) $(FFLAGS) -c main.F90 -o $@
smoother.o : smoother.F90 smoother.cpp.o
$(FC) $(FFLAGS) -c smoother.F90 -o $@
You can verify that data is allocated on the GPU and that data is copied from the CPU to GPU by using a profiler.
To profile, use rocprof
with the --hip-trace on
and --stats
flags. Running rocprof
will create a file called results.json
that contains the data for a trace profile. Additionally, results.stats.csv
and results.hip-stats.csv
will contain hotspot analysis for HIP kernels and HIP API calls, respectively.
$ rocprof --stats ./smoother 1000 1000 100
Congratulations! So far, you've learned how to allocate and manage memory on the GPU and how to launch a GPU kernel from Fortran using hipfort, HIP, and ISO C Binding.
Right now, we have the code in a state where, every iteration, data is copied to the GPU before calling ApplySmoother_gpu
and from the GPU after calling ApplySmoother_gpu
. This situation happens quite often in the early stages of porting to the GPU.
The next step in this codelab is to offload the ResetF
routine to the GPU, even though it does not take up a lot of time. We want to offload it to the GPU so that we can move the hipMemcpy
calls outside of the iteration loop in main and reduce the number of times data is transmitted across the PCI bus.
In this section, we are going to offload the ResetF
routine in smoother.F90
to the GPU so that we can migrate hipMemcpy
calls outside of the iteration loop in main.F90
. By this point, you have worked through the mechanics for porting a routine to the GPU :
In this section, we'll repeat these steps for the ResetF
routine. Additionally, we'll take an extra step, where we push the hipMemcpy
calls outside of the main iteration loop, in order to reduce the number of data transfers between the CPU and GPU.
smoother_HIP.cpp
ResetF_gpu
. This routine needs to be of type __global__
so that it can be launched on the GPU (device) from the CPU (host) using chevron syntax.__global__ void ResetF_gpu(real *f, real *smoothF, int nW, int nX, int nY)
{
size_t i = threadIdx.x + blockIdx.x*blockDim.x + nW;
size_t j = threadIdx.y + blockIdx.y*blockDim.y + nW;
int iel = i + nX*j;
if( i < nX-nW && j< nY-nW){
f_dev[iel] = smoothF_dev[iel];
}
}
extern "C"
, so that it can be called from Fortran through ISO C Binding. In this routine, you can calculate the grid and block size for the kernel launch. Here, we've set the number of threads in the x and y directions to 16 and calculated the grid dimensions by evenly dividing the x and y extents of the outer loops of the original ResetF
routine by 16.extern "C"
{
void ResetF_gpu_wrapper(real **f, real **smoothF, int nW, int nX, int nY)
{
int threadsPerBlock = 16;
int gridDimX = (nX-2*nW)/threadsPerBlock + 1;
int gridDimY = (nY-2*nW)/threadsPerBlock + 1;
ResetF_gpu<<<dim3(gridDimX,gridDimY,1), dim3(threadsPerBlock,threadsPerBlock,1),0,0>>>(*f, *smoothF, nW, nX, nY);
}
}
As with the ApplySmoother_gpu_wrapper
routine, you will add an INTERFACE block in the smoother.F90
module for the ResetF_gpu_wrapper
. Then, in the main.F90
program, you will replace the call to ResetF
with a call to ResetF_gpu_wrapper
.
smoother.F90
module.INTERFACE
block for the ResetF_gpu_wrapper
routine. This interface block will define the API for the ResetF_gpu_wrapper
routine and will bind this subroutine to ResetF_gpu_wrapper
defined in smoother.cpp
. INTERFACE
SUBROUTINE ResetF_gpu_wrapper(f_dev, smoothF_dev, nW, nX, nY) bind(c,name="ResetF_gpu_wrapper")
USE ISO_C_BINDING
IMPLICIT NONE
TYPE(c_ptr) :: f_dev, smoothF_dev
INTEGER, VALUE :: nW, nX, nY
END SUBROUTINE ResetF_gpu_wrapper
END INTERFACE
smoother.F90
and open main.F90
.ResetF
call in the main iteration loop. Replace this call with a call to ResetF_HIP
. Replace the input/output variables with their device versions.CALL ResetF_HIP( c_loc(f_dev), c_loc(smoothF_dev), nW, nX, nY )
ResetF_HIP
in place, we can now move the hipMemcpy
calls outside of the main iteration loop. Move the call to copy f
to f_dev
to just before the do loop. To obtain the correct result, you will also need to add a call to copy smoothF
to smoothF_dev
before the iteration loop. Then, move the call to copy smoothF_dev
to smoothF
after the do loop.CALL hipCheck(hipMemcpy(C_LOC(smoothF_dev), C_LOC(smoothF), hipMemcpyHostToDevice))
CALL hipCheck(hipMemcpy(C_LOC(f_dev), C_LOC(f), hipMemcpyHostToDevice))
DO iter = 1, nIter
CALL ApplySmoother_HIP( c_loc(f_dev), c_loc(weights_dev), c_loc(smoothF_dev), nW, nX, nY )
CALL ResetF_HIP( c_loc(f_dev), c_loc(smoothF_dev), nW, nX, nY )
ENDDO
CALL hipCheck(hipMemcpy(C_LOC(smoothF), C_LOC(smoothF_dev), hipMemcpyDeviceToHost))
main.F90
. Rebuild and re-run the smoother application with the same parameters as before. Verify that the solution has remain unchanged.In this codelab, you learned how to port serial CPU-only routines in Fortran to GPUs using hipfort, HIP, and ISO C Binding. To do this, you used Fortran POINTER
's to create device copies of CPU arrays and learned how to copy data from the CPU to the GPU and vice versa with hipMemcpy
. You also learned how to write HIP kernels in C++, expose them to Fortran, and launch them from the host within Fortran source code.
In the process of doing this, you practiced a strategy for porting to GPUs that included the following steps to make incremental changes to your own source code :
If you have any questions, comments, or feedback that can help improve this codelab, you can reach out to support@fluidnumerics.com