Parallel Computing using the GPU – Tutorial 5: Grids

Welcome to part 5 of the Parallel Computing tutorial. In this short tutorial, we will look at how to launch multidimensional blocks on the GPU (grids). We will create the same program as in the last tutorial, but instead display a 2d-array of blocks, each displaying a calculated value.

These types of blocks work just the same way as the other blocks we have seen so far in this tutorial. But since they are 2d, you can think of them as a coordinate system where you have blocks in the x- and y-axis. Basically, it’s all the same as before, but we used multidimensional indexing.

How do we do this? First of all, we will need to use a keyword from the CUDA C library, and define our variable.

dim3 multiBlockArray(X,Y);

So, why is it dim3? Well, in the future CUDA C might support 3d-arrays as well, but for now, it’s only reserved, so when you create the array, you specify the dimension of the X-axis, and the Y-axis, and then the 3rd axis automatically is set to 1.

Implementing our test solution
First of all, include stdio.h and define the size of our block array:

#include <stdio.h>

#define BLOCKS 10

Next, we create our main-function.
int main( void )
{

Then we define a 2d array, a pointer for copying to/from the GPU and our dim3 variable:
int hostArray[BLOCKS][BLOCKS];
int *deviceArray;

Next, we allocate the memory needed for our array on the device. As you can see, we take care of a two dimensional array, using BLOCKS*BLOCKS when allocating:

cudaMalloc( (void**)&deviceArray, BLOCKS * BLOCKS * sizeof(int) );
cudaMemcpy( deviceArray,
hostArray, BLOCKS * BLOCKS * sizeof(int),
cudaMemcpyHostToDevice );

Once we got the space we need on our device, it’s time to launch our kernel and do the calculation needed from the GPU..
generateArray<<<multiBlockArray,1>>>( deviceArray );

The only difference here is that we pass the multiBlockArray we created earlier as the argument to how many blocks we want to run, and then proceed as normal.

Next we copy the array our GPU worked on back to the host, so we can display it:

cudaMemcpy( hostArray,
deviceArray,
BLOCKS * BLOCKS * sizeof(int),
cudaMemcpyDeviceToHost );

for (int i=0; i<BLOCKS; i++)
{
    printf( “Thread ID running: %d”, hostArray[0][i] );
    for (int j=1; j<BLOCKS; j++)
    {
        printf( ” %d”, hostArray[j][i] );
    }
    printf( “\n” );
}

cudaFree( deviceArray );

Nothing new here, except that we now copy BLOCKS*BLOCKS from the device as well, and loop through each block and print out it’s content.
Last thing missing is the end of main bracket }.

Now, let’s add the kernel:

__global__ void generateArray( int *hostArray )
{
int ThreadIndex = blockIdx.x + blockIdx.y * gridDim.x;
hostArray[ThreadIndex] = ThreadIndex;
}

This looks quite the same as before. blockId.x is used to get what block we are working on in the x-dimension, and the blockId.y is used to get the block we are working on in the y-dimension. gridDim is the maximum length of the dimension of our grid; .x for the x-axis and .y for the y-axis. This is the same number as the one you specified when creating the block array.

If you run the example, you will get a result similar to this:

As you might see, there are many different scenarios where multidimensional indexing is better. Another example is when working with 2d images, where you can create one block for each pixel, in the same coordinate system as the image.

Complete listing:
#include <stdio.h>

#define BLOCKS 10

__global__ void generateArray( int *hostArray )
{
int ThreadIndex = blockIdx.x + blockIdx.y * BLOCKS;
hostArray[ThreadIndex] = ThreadIndex;
}

int main( void )
{
int hostArray[BLOCKS][BLOCKS];
int *deviceArray;

    dim3 multiBlockArray(BLOCKS,BLOCKS);

    cudaMalloc( (void**)&deviceArray, BLOCKS * BLOCKS * sizeof(int) );
    cudaMemcpy( deviceArray,
                hostArray, BLOCKS * BLOCKS * sizeof(int),
                cudaMemcpyHostToDevice );


    generateArray<<<multiBlockArray,1>>>( deviceArray );

    cudaMemcpy( hostArray,
                deviceArray,
                BLOCKS * BLOCKS * sizeof(int),
                cudaMemcpyDeviceToHost );

    for (int i=0; i<BLOCKS; i++)
    {
        printf( “Thread ID running: %d”, hostArray[0][i] );
        for (int j=1; j<BLOCKS; j++)
        {
            printf( ” %d”, hostArray[j][i] );
        }
        printf( “\n” );
    }

cudaFree( deviceArray );

return 0;
}

2 Responses to Parallel Computing using the GPU – Tutorial 5: Grids

Hamed says:

November 15, 2012 at 10:58 pm

Excellent! i need it
but your tutorials about CUDA is Little(Just 5)

Theo says:

January 5, 2015 at 12:26 pm

Is it ever possible to have two dimesional arrays in GPU?
Which we could access like temp = gpu_array[i][j];

Or, we just have to work the one dimensional arrays, created by raveling the 2D arrays in row major fasion?