Java bindings for the CUDA runtime and driver API

With JCuda it is possible to interact with the CUDA runtime and driver API from Java programs. JCuda is the common platform for all libraries on this site.

You may obtain the latest version of JCuda in the Downloads section.


The following features are currently provided by JCuda:

Known limitations:

Please note that not all functionalities have been tested extensively on all operating systems, GPU devices and host architectures. There certainly are more limitations, which will be added to the following list as soon as I become aware of them:

JCuda runtime API

The main application of the JCuda runtime bindings is the interaction with existing libraries that are built based upon the CUDA runtime API.

Some Java bindings for libraries using the CUDA runtime API are available on this web site, namely, The following snippet illustrates how one of these libraries may be used with the JCuda runtime API.

You may also want to download the complete, compileable JCuda runtime API sample from the samples page that shows how to use the runtime libraries.

// Allocate memory on the device and copy the host data to the device
Pointer deviceData = new Pointer();
cudaMalloc(deviceData, memorySize);
float hostData[] = createInputData();
cudaMemcpy(deviceData, Pointer.to(hostData), memorySize, 

// Perform in-place complex-to-complex 1D transforms using JCufft
cufftHandle plan = new cufftHandle();
JCufft.cufftPlan1d(plan, complexElements, cufftType.CUFFT_C2C, 1);
JCufft.cufftExecC2C(plan, deviceData, deviceData, JCufft.CUFFT_FORWARD);

// Copy the result from the device to the host and clean up
cudaMemcpy(Pointer.to(hostData), deviceData, memorySize, 

JCuda driver API

The main usage of the JCuda driver bindings is to load PTX- and CUBIN modules and execute the kernels from a Java application.

The following code snippet illustrates the basic steps of how to load a CUBIN file using the JCuda driver bindings, and how to execute a kernel from the module.

You may also want to download a complete JCuda driver sample from the samples page.

// Initialize the driver and create a context for the first device.
CUdevice device = new CUdevice();
cuDeviceGet(device, 0);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);

// Load the PTX that contains the kernel.
CUmodule module = new CUmodule();
cuModuleLoad(module, "sample.ptx");

// Obtain a handle to the kernel function.
CUfunction function = new CUfunction();
cuModuleGetFunction(function, module, "functionName");

// Allocate the device input data, and copy the
// host input data to the device
CUdeviceptr deviceData = new CUdeviceptr();
cuMemAlloc(deviceData, memorySize);
cuMemcpyHtoD(deviceData, hostData, memorySize);

// Set up the kernel parameters 
Pointer kernelParameters = Pointer.to(

// Call the kernel function.
    gx, gy, gz,               // Grid dimension
    bx, by, bz,               // Block dimension
    sharedMemorySize, stream, // Shared memory size and stream
    kernelParameters, null    // Kernel- and extra parameters

// Copy the data back from the device to the host and clean up
cuMemcpyDtoH(hostData, deviceData, memorySize);

OpenGL interoperability

Just as CUDA supports interoperability with OpenGL, JCuda supports interoperability with JOGL and LWJGL
The OpenGL interoperability makes it possible to access memory that is bound to OpenGL from JCuda. Thus, JCuda can be used to write vertex coordinates that are computed in a CUDA kernel into Vertex Buffer Objects (VBO), or pixel data into Pixel Buffer Objects (PBO). These objects may then be rendered efficiently using JOGL or LWJGL. Additionally, JCuda allows CUDA kernels to access data that is created on Java side efficiently via texture references.

There are some samples for JCuda OpenGL interaction on the samples page.

The following image is a screenshot of one of the sample applications that reads volume data from an input file, copies it into a 3D texture, uses a CUDA kernel to render the volume data into a PBO, and displays the resulting PBO with JOGL. It uses the kernels from the Volume rendering sample from the NVIDIA CUDA samples web site.

Pointer handling

The most obvious limitiation of Java compared to C is the lack of real pointers. All objects in Java are implicitly accessed via references. Arrays or objects are created using the new keyword, as it is done in C++. References may be null, as pointers may be in C/C++. So there are similarities between C/C++ pointers and Java references (and the name NullPointerException is not a coincidence). But nevertheless, references are not suitable for emulating native pointers, since they do not allow pointer arithmetic, and may not be passed to the native libraries. Additionally, "references to references" are not possible.

To overcome these limitations, the Pointer class has been introduced in JCuda. It may be treated similar to a void* pointer in C, and thus may be used for native host or device memory, and for Java memory:
// Create a new (null) pointer
Pointer devicePointer = new Pointer();

// Allocate device memory at the given pointer
JCuda.cudaMalloc(devicePointer, 4 * Sizeof.FLOAT);

// Create a pointer to the start of a Java array
float array[] = new float[8];
Pointer hostPointer = Pointer.to(array);

// Add an offset to the Pointer
Pointer hostPointerWithOffset = hostPointer.withByteOffset(2 * Sizeof.FLOAT);

// Copy 4 elements from the middle of the Java array to the device
JCuda.cudaMemcpy(devicePointer, hostPointerWithOffset, 4 * Sizeof.FLOAT,

Pointers may either be created by instantiating a new Pointer, which initially will be a NULL pointer, or by passing either a (direct or array-based) Buffer or a primitive Java array to one of the "to(...)" methods of the Pointer class.

Pointers to pointers

It is possible to pass an array of Pointer objects to the "to(...)" method, which is important to be able to allocate a 2D array (i.e. an array of Pointers) on the device, which may then be passed to the library or kernel. See the JCuda driver API example for how to pass a 2D array to a kernel.

However, there are limitations on how these pointers may be used. Particularly, not all types of pointers may be written to. When a pointer points to a direct buffer or array, then this pointer should not be overwritten. Future versions may support this, but currently, an attempt to overwrite such a pointer may cause unspecified behavior.

Asynchronous operations

There had been some confusion about the behavior of CUDA when it comes to asynchronous operations. This was mainly caused by the different kinds of memory that can be involved in an operation. Additionally, there are several options for transfering memory between Java and a C API like CUDA, which also had to be considered for JCuda. With CUDA 4.1, the synchronous/asynchronous behavior of CUDA was specified in more detail. Unfortunately, the unified addressing and concurrent execution of later CUDA versions adds another level of complexity. But at least the basic operations should be covered here.

The following sections contain quotes from the site describing the API synchronization behavior of CUDA.

Asynchronous operations in CUDA

The idea behind an asynchronous operation is that, when the function is called, the call returns immediately, even if the result of the function is not yet available. CUDA offers various types of asynchronous operations. The most important ones are Additionally, the runtime libraries offer methods to set a cudaStream_t that should be associated with the functions of the respective library, for example via cublasSetStream or cufftSetStream. For all APIs, the stream and event handling functions may be used to achieve proper synchronization between different calls that may be associated with different streams.

Synchronous and asynchronous memory copy operations

There are different functions for copying memory in CUDA:
But in contrast to what the names suggest, the exact bahavior of these functions mainly depends on the type of the memory that they are operating on. The different types of memory considered here are

The following lists describe the synchronization behavior of CUDA depending on the memory copy function that is used, and depending on the type of the memory that is involved. These lists are summarizing and partially quoting the information about the API synchronization behavior of CUDA

Synchronous memory copy operations:
Asynchronous memory copy operations:


The JCudaAsyncCopyTest program demonstrates the different forms of synchronous and asynchronous copy operations discussed here. It allocates memory blocks of different types (device, pinned host, pageable host with Java array, pageable host with direct Java buffer). Then it performs synchronous and asynchronous copy operations between all types of memory, and prints the timing results.

It can be seen that the only configuration where the data transfer between the host and the device is really asynchronous are the ones where data is copied from the device to pinned host memory or vice versa.

Asynchronous operations in CUBLAS and CUSPARSE

(Note: This section has to be validated against the API specification, and may be updated accordingly)

The most recent versions of CUBLAS and CUSPARSE (as defined in the header files "cublas_v2.h" and "cusparse_v2.h") are inherently asynchronous. This means that all functions return immediately when they are called, although the result of the computation may not yet be available. This does not impose any problems as long as the functions do not involve host memory. However, in the newest versions of CUBLAS and CUSPARSE, several functions have been introduced that may accept parameters or return results of computations either via pointers to device memory or via pointers to host memory.

These functions are also offered in JCublas2 and JCusparse2. When they are called with pointers to device memory, they are executed asynchronously and return immediately, writing the result to the device memory as soon as the computation is finished. But this is not possible when they are are called with pointers to Java arrays. In this case, the functions will block until the computation has completed. Note that the functions will not block when they receive a pointer to a direct buffer, but this has not been tested extensively.