Parallel Computing using the GPU – Tutorial 4: Kernels in parallel

image
Until now, we haven’t really touched parallel programming yet. But this is something we will do in this tutorial!

Recall that we earlier launched a function on a device using kernelFunction<<<1,1>>>(..). This tutorial is all about the parameter N1, <<<N1,1>>>, where some of the parallel magic happens.

N1 is the number of blocks that we want to run in parallel. If we call kernelFunction<<<5,1>>>(..), the function will have five copies running in parallel, where each copy is named a block.

The next thing that we must do is to use an index to make each copy work on different parts of our solution, not much need of parallel computing if all threads will to the exact same thing, right. Luckily, this is easy as well. CUDA got a build in variable that keeps track of every single block running, blockIdx.

blockIdx is is a 2D variable, containing x and y. You either use x or both x and y, depending on what problem we want to solve. One handy usability of both x and y is to process 2D images, making one thread for each pixel on the x- and y-axis. You can also just use x if you want, there are no rules. More on this later.

Now, we know the id of the thread running by checking blockIdx.x, and we know how to run kernels in parallel, let’s create a simple example.

In this example, we will create an application that generates an array entirely in parallel kernels. The array will contain the threadID of each running thread. After the threads are completed, we will print out the result using printf.

Implementing the Kernel
Let’s start by looking at the kernel code:

__global__ void generateArray( int *hostArray )
{
    int ThreadIndex = blockIdx.x;
    hostArray[ThreadIndex] = ThreadIndex;
}

Recall, __global__ will tell the host that this function will run on the device. It takes one array as the output, store the blockIdx.x in a variable named ThreadIndex, and then put this value at the correct position in our array.

The blockIdx will generate an ID for each block running, starting for 0. This will make it a perfect index for arrays.

Implementing main()
Next, we will take a look at our main function. Shouldn’t be anything new here:

int main( void )
{
    int hostArray[BLOCKS];
    int *deviceArray;

    cudaMalloc( (void**)&deviceArray, BLOCKS * sizeof(int) );
    cudaMemcpy( deviceArray,
                hostArray, BLOCKS * sizeof(int),
                cudaMemcpyHostToDevice );
               
    generateArray<<<BLOCKS,1>>>( deviceArray );

    cudaMemcpy( hostArray,
                deviceArray,
                BLOCKS * sizeof(int),
                cudaMemcpyDeviceToHost );

    for (int i=0; i<BLOCKS; i++)
    {
        printf( “Thread ID running: %d\n”, hostArray[i] );
    }

    cudaFree( deviceArray );

    return 0;
}

First, we create an Array at the size of BLOCKS, allocate space for the array on the device, and call:
generateArray<<<BLOCKS,1>>>( deviceArray );.

This function will now run in BLOCKS parallel kernels, creating the entire array in one call.

Once this is done, we copy the result from the device to the host, print it to the screen, frees the array and exits!

The source for the entire application:

#include <stdio.h>

#define BLOCKS   25

__global__ void generateArray( int *hostArray )
{
    int ThreadIndex = blockIdx.x;
    hostArray[ThreadIndex] = ThreadIndex;
}

int main( void )
{
    int hostArray[BLOCKS];
    int *deviceArray;

    cudaMalloc( (void**)&deviceArray, BLOCKS * sizeof(int) );
    cudaMemcpy( deviceArray,
                hostArray, BLOCKS * sizeof(int),
                cudaMemcpyHostToDevice );
               
    generateArray<<<BLOCKS,1>>>( deviceArray );

    cudaMemcpy( hostArray,
                deviceArray,
                BLOCKS * sizeof(int),
                cudaMemcpyDeviceToHost );

    for (int i=0; i<BLOCKS; i++)
    {
        printf( “Thread ID running: %d\n”, hostArray[i] );
    }

    cudaFree( deviceArray );

    return 0;
}

Now compile and run the code, and you will see an output something like this:
image

Congratulations, you just made your first parallel application in CUDA! Smile

Download: Source+Executable Visual Studio 2008 + CUDA C 3.2 (Soon available, no access to FTP from my cottage)

This entry was posted in CUDA, Parallel Computing. Bookmark the permalink.

1 Response to Parallel Computing using the GPU – Tutorial 4: Kernels in parallel

  1. Pingback: Windows Client Developer Roundup 053 for 1/6/2011 - Pete Brown's 10rem.net

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.