“The way the processor industry is going is to add more and more cores, but nobody knows how to program those things. I mean, two, yeah; four, not really; eight, forget it.”
Steve Jobs, NY Times interview, June 10 2008
Nowadays, as new hardware platforms evolve, so concepts do. Yesterday, a computer was an unified processing unit. Today it is a heterogeneous machine with processing units growing in number and in processing power as well. Units like CPUs, GPUs or DSPs… So why not take advantage of all these resources? That’s where OpenCL comes in.

OpenCL stands for Open Computing Language and is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. (…) OpenCL provides parallel computing using task-based and data-based parallelism. [+]
Overview.
(I may leave a conceptional approach on this subject later on.)
The basic hardware platform is composed of a Host and a few Devices made of a few compute units. Each compute unit is made of a few processing elements. The OpenCL platform allows applications to use a host and one or more OpenCL devices as a single heterogeneous parallel computer system. e.g. Multiple cores on CPU or GPU together are a single device and OpenCL executes kernels across all cores in a data-parallel manner.
A Host program in order to perform a very specific task relies on an OpenCL C programming language derived from the ISO C99 with slight differences in keywords and syntax and with some additions and some restrictions. That piece of code should be called a Kernel. The Host program also relies on the OpenCL platform and runtime layer.
Inside this platform there’s the OpenCL Runtime layer and allows the host program to create memory objects associated to contexts as well as manipulate contexts once they have been created. It also allows to compile and create kernel program objects, issue commands to command-queue, synchronization of commands and cleaning up OpenCL resources.
Contexts enable sharing of memory between devices. In order to do that, both devices must be in the same context. Contexts are also central elements to manage:
Command Queues (for kernel execution coordination) are ment to store a set of operations to perform and are associated to a context. All work is submitted through queues so each device must have a queue. Multiple command-queues can be created to handle independent commands that don’t require synchronization;
Memory objects to transfer or mapping of memory object data, Programs and Kernels.
- A Program encapsulate a context, a program source or binary and a list of target devices and build options.
- A Kernel is written in OpenCL C and declared with the kernel qualifier. It encapsulates a specific function in a program and is precompiled into a binary format and there are function calls for dealing with module and function loading.
In the OpenCL platform, the compiler is built into the runtime layer and can be invoked on the raw text or a binary can be built and saved for later load.
Running an OpenCL calculation takes these steps.
Writing the Kernel:
- const char *ProgramSource =
“__kernel void hello(__global float *input, __global float *output)\n”\
“{\n”\
“ size_t id = get_global_id(0);\n”\
“ output[id] = input[id] * input[id];\n”\
“}\n”;
Initialization: Selecting a device and creating a context in which to run the calculation
- Query platform
- Query device(s)
- Create a context
1.
- cl_uint num_of_platforms=0;
cl_platform_id platform_id;
// retreive a list of platforms avaible
if( clGetPlatformIDs(1, &platform_id, &num_of_platforms)!= CL_SUCCESS ){
printf(“Unable to get platform_id\n”);
return 1;
}
2.
- cl_device_id device_id;
cl_uint num_of_devices=0;
// try to get a supported GPU or CPU device
if( clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_CPU, 1, &device_id, &num_of_devices) != CL_SUCCESS ){
printf(“Unable to get device_id\n”);
return 1;
}
3.
- cl_context_properties properties[3];
// context properties list - must be terminated with 0
properties[0]= CL_CONTEXT_PLATFORM;
properties[1]= (cl_context_properties) platform_id;
properties[2]= 0;
cl_int err;
cl_context context;
// create a context with the GPU device
context = clCreateContext(properties,1,&device_id,NULL,NULL,&err);
Note: The “best” device is algorithm- and hardware-dependent, so in order to pick the best device for your algorithm it is best to query some device info.
clGetDeviceInfo(device, name, size, *value, *value_size_ret)
Number of compute units: CL_DEVICE_MAX_COMPUTE_UNITS
Clock frequency: CL_DEVICE_MAX_CLOCK_FREQUENCY
Memory size: CL_DEVICE_GLOBAL_MEM_SIZE
…
Allocation of memory/storage that will be used on the device and push it to the device. Creating memory objects : Programs and kernels are read in from source and compiled or loaded as binary
- Create command-queue / Allocation of resources
- Create memory object
1.
- cl_command_queue command_queue;
// create command queue using the context and device
command_queue = clCreateCommandQueue(context, device_id, 0, &err);
cl_mem input, output;
// create buffers for the input and ouput
input = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * DATA_SIZE, NULL, NULL);
output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * DATA_SIZE, NULL, NULL);
float inputData[DATA_SIZE]={1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
// load data into the input buffer
clEnqueueWriteBuffer(command_queue, input, CL_TRUE, 0, sizeof(float) * DATA_SIZE, inputData, 0, NULL, NULL);
2.
- cl_program program;
// create a program from the kernel source code
program = clCreateProgramWithSource(context,1,(const char **) &ProgramSource, NULL, &err);
// compile the program
if( clBuildProgram(program, 0, NULL, NULL, NULL, NULL) != CL_SUCCESS ){
printf(“Error building program\n”);
return 1;
}
Note: Buffers are simple chunks of memory so Kernels can access them for reading and writing however they like (pointers, arrays, structs). Images can only be accessed via read_image() and write_image(). One image cannot be read and written in the same kernel.
The arguments to the kernel are set and the kernel is executed on all data
- cl_kernel kernel;
// specify which kernel from the program to execute
kernel = clCreateKernel(program, “hello”, &err);
// set the argument list for the kernel command
clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &output);
size_t global;
global=DATA_SIZE;
// enqueue the kernel command for execution
clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global, NULL, 0, NULL, NULL);
clFinish(command_queue);
float results[DATA_SIZE]={0};
// copy the results from out of the output buffer
clEnqueueReadBuffer(command_queue, output, CL_TRUE, 0, sizeof(float) *DATA_SIZE, results, 0, NULL, NULL);
Tear down.
- // print the results
printf(“output: “);
int i;
for( i=0;i<DATA_SIZE; i++ ){
printf(“%f “,results[i]);
}
// cleanup - release OpenCL resources
clReleaseMemObject(input);
clReleaseMemObject(output);
clReleaseProgram(program);
clReleaseKernel(kernel);
clReleaseCommandQueue(command_queue);
clReleaseContext(context);
Note: I didn’t messed with synchronization neither events.
So let’s go for some fun! Grab the code. I’m using a 32bits Ubuntu 10.04 Lucid Lynx with an ATI graphics card.
Let’s go for the prerequisites. I had to fetch the drivers from the ATI website.
- sudo wget link
- sudo chmod +x ./ati-driver-installer-10-7-x86.x86_64.run
- sudo ./ati-driver-installer-10-7-x86.x86_64.run —buildpkg Ubuntu/lucid
- sudo dpkg -i fglrx-modaliases_8.753-0ubuntu1_i386.deb fglrx-dev_8.753-0ubuntu1_i386.deb fglrx-amdcccle_8.753-0ubuntu1_i386.deb fglrx_8.753-0ubuntu1_i386.deb
- rm fglrx-modaliases_8.753-0ubuntu1_i386.deb
rm fglrx-dev_8.753-0ubuntu1_i386.deb
rm fglrx-amdcccle_8.753-0ubuntu1_i386.deb
rm fglrx_8.753-0ubuntu1_i386.deb
rm fglrx-installer_8.753-0ubuntu1_i386.changes
Next download the ATI Stream SDK from AMD website.
- tar zxvf /path/to/download/ati-stream-sdk-v2.2-lnx32.tgz -C ~
- mv ~/ati-stream-sdk-v2.2-lnx32 ~/.ati-stream-sdk-v2.2-lnx32
Add some variables to your .bashrc
- export LIBRARY_PATH=$LIBRARY_PATH”:/home/YOURUSER/.ati-stream-sdk-v2.2-lnx32/lib/x86/”
- export C_INCLUDE_PATH=$C_INCLUDE_PATH”:/home/YOURUSER/.ati-stream-sdk-v2.2-lnx32/include/”
- export LD_LIBRARY_PATH=$LD_LIBRARY_PATH”:/home/YOURUSER/.ati-stream-sdk-v2.2-lnx32/lib/x86/”
Download the icd-registration from AMD website and extract it on the root of your system.
- sudo tar zxvf /path/to/icd-registration.tgz -C /
You can now start programming with:
#include <CL/opencl.h>
and compile it like:
gcc foo.c -o foo -lOpenCL
Reference Links:
http://www.macresearch.org/opencl
http://www.haifux.org/lectures/212/
http://sa09.idav.ucdavis.edu/
http://mathnathan.com/