jcuda.runtime
Class JCuda

java.lang.Object
  extended by jcuda.runtime.JCuda

public class JCuda
extends java.lang.Object

Java bindings for the NVidia CUDA runtime API.

Most comments are extracted from the CUDA online documentation


Field Summary
static int cudaArrayCubemap
          Must be set in cudaMalloc3DArray to create a cubemap CUDA array
static int cudaArrayDefault
          Default CUDA array allocation flag
static int cudaArrayLayered
          Must be set in cudaMalloc3DArray to create a layered CUDA array
static int cudaArraySurfaceLoadStore
          Must be set in cudaMallocArray or cudaMalloc3DArray in order to bind surfaces to the CUDA array
static int cudaArrayTextureGather
          Must be set in cudaMallocArray or cudaMalloc3DArray in order to perform texture gather operations on the CUDA array
static int cudaDeviceBlockingSync
          Deprecated. As of CUDA 4.0 and replaced by cudaDeviceScheduleBlockingSync
static int cudaDeviceLmemResizeToMax
          Device flag - Keep local memory allocation after launch
static int cudaDeviceMapHost
          Device flag - Support mapped pinned allocations
static int cudaDeviceMask
          Device flags mask
static int cudaDeviceScheduleAuto
          Device flag - Automatic scheduling
static int cudaDeviceScheduleBlockingSync
          Device flag - Use blocking synchronization
static int cudaDeviceScheduleMask
          Device schedule flags mask
static int cudaDeviceScheduleSpin
          Device flag - Spin default scheduling
static int cudaDeviceScheduleYield
          Device flag - Yield default scheduling
static int cudaEventBlockingSync
          Event uses blocking synchronization
static int cudaEventDefault
          Default event flag
static int cudaEventDisableTiming
          Event will not record timing data
static int cudaEventInterprocess
          Event is suitable for interprocess use. cudaEventDisableTiming must be set
static int cudaHostAllocDefault
          Default page-locked allocation flag
static int cudaHostAllocMapped
          Map allocation into device space
static int cudaHostAllocPortable
          Pinned memory accessible by all CUDA contexts
static int cudaHostAllocWriteCombined
          Write-combined memory
static int cudaHostRegisterDefault
          Default host memory registration flag
static int cudaHostRegisterMapped
          Map registered memory into device space
static int cudaHostRegisterPortable
          Pinned memory accessible by all CUDA contexts
static int cudaIpcMemLazyEnablePeerAccess
          Automatically enable peer access between remote devices as needed
static int cudaPeerAccessDefault
          Default peer addressing enable flag
static int CUDART_VERSION
          CUDA runtime version
static int cudaSurfaceType1D
          cudaSurfaceType1D
static int cudaSurfaceType1DLayered
          cudaSurfaceType1DLayered
static int cudaSurfaceType2D
          cudaSurfaceType2D
static int cudaSurfaceType2DLayered
          cudaSurfaceType2DLayered
static int cudaSurfaceType3D
          cudaSurfaceType3D
static int cudaSurfaceTypeCubemap
          cudaSurfaceTypeCubemap
static int cudaSurfaceTypeCubemapLayered
          cudaSurfaceTypeCubemapLayered
static int cudaTextureType1D
          cudaTextureType1D
static int cudaTextureType1DLayered
          cudaTextureType1DLayered
static int cudaTextureType2D
          cudaTextureType2D
static int cudaTextureType2DLayered
          cudaTextureType2DLayered
static int cudaTextureType3D
          cudaTextureType3D
static int cudaTextureTypeCubemap
          cudaTextureTypeCubemap
static int cudaTextureTypeCubemapLayered
          cudaTextureTypeCubemapLayered
 
Method Summary
static int cudaArrayGetInfo(cudaChannelFormatDesc desc, cudaExtent extent, int[] flags, cudaArray array)
          Gets info about the specified cudaArray.
static int cudaBindSurfaceToArray(surfaceReference surfref, cudaArray array, cudaChannelFormatDesc desc)
          Binds an array to a surface.
static int cudaBindTexture(long[] offset, textureReference texref, Pointer devPtr, cudaChannelFormatDesc desc, long size)
          Binds a memory area to a texture.
static int cudaBindTexture2D(long[] offset, textureReference texref, Pointer devPtr, cudaChannelFormatDesc desc, long width, long height, long pitch)
          Binds a 2D memory area to a texture.
static int cudaBindTextureToArray(textureReference texref, cudaArray array, cudaChannelFormatDesc desc)
          Binds an array to a texture.
static int cudaChooseDevice(int[] device, cudaDeviceProp prop)
          Select compute-device which best matches criteria.
static int cudaConfigureCall(dim3 gridDim, dim3 blockDim, long sharedMem, cudaStream_t stream)
          Configure a device-launch.
static cudaChannelFormatDesc cudaCreateChannelDesc(int x, int y, int z, int w, int cudaChannelFormatKind_f)
          Returns a channel descriptor using the specified format.
static int cudaDeviceCanAccessPeer(int[] canAccessPeer, int device, int peerDevice)
          Queries if a device may directly access a peer device's memory.
static int cudaDeviceDisablePeerAccess(int peerDevice)
          Disables direct access to memory allocations on a peer device and unregisters any registered allocations from that device.
static int cudaDeviceEnablePeerAccess(int peerDevice, int flags)
          Enables direct access to memory allocations on a peer device.
static int cudaDeviceGetByPCIBusId(int[] device, java.lang.String pciBusId)
          Returns a handle to a compute device.
static int cudaDeviceGetCacheConfig(int[] pCacheConfig)
          Returns the preferred cache configuration for the current device.
static int cudaDeviceGetLimit(long[] pValue, int limit)
          Returns resource limits.
static int cudaDeviceGetPCIBusId(java.lang.String[] pciBusId, int len, int device)
          Returns a PCI Bus Id string for the device.
static int cudaDeviceGetSharedMemConfig(int[] pConfig)
          (No documentation in CUDA 4.2) Returns the shared memory configuration
static int cudaDeviceReset()
          Destroy all allocations and reset all state on the current device in the current process.
static int cudaDeviceSetCacheConfig(int cacheConfig)
          Sets the preferred cache configuration for the current device.
static int cudaDeviceSetLimit(int limit, long value)
          Set resource limits.
static int cudaDeviceSetSharedMemConfig(int config)
          (No documentation in CUDA 4.2) Sets the shared memory configuration
static int cudaDeviceSynchronize()
          Wait for compute device to finish.
static int cudaDriverGetVersion(int[] driverVersion)
          Returns the CUDA driver version.
static int cudaEventCreate(cudaEvent_t event)
          Creates an event object.
static int cudaEventCreateWithFlags(cudaEvent_t event, int flags)
          Creates an event object with the specified flags.
static int cudaEventDestroy(cudaEvent_t event)
          Destroys an event object.
static int cudaEventElapsedTime(float[] ms, cudaEvent_t start, cudaEvent_t end)
          Computes the elapsed time between events.
static int cudaEventQuery(cudaEvent_t event)
          Queries an event's status.
static int cudaEventRecord(cudaEvent_t event, cudaStream_t stream)
          Records an event.
static int cudaEventSynchronize(cudaEvent_t event)
          Waits for an event to complete.
static int cudaFree(Pointer devPtr)
          Frees memory on the device.
static int cudaFreeArray(cudaArray array)
          Frees an array on the device.
static int cudaFreeHost(Pointer ptr)
          Frees page-locked memory.
static int cudaFuncGetAttributes(cudaFuncAttributes attr, java.lang.String func)
          Find out attributes for a given function.
static int cudaGetChannelDesc(cudaChannelFormatDesc desc, cudaArray array)
          Get the channel descriptor of an array.
static int cudaGetDevice(int[] device)
          Returns which device is currently being used.
static int cudaGetDeviceCount(int[] count)
          Returns the number of compute-capable devices.
static int cudaGetDeviceProperties(cudaDeviceProp prop, int device)
          Returns information about the compute-device.
static java.lang.String cudaGetErrorString(int error)
          Returns the message string from an error code.
static int cudaGetLastError()
          Returns the last error from a runtime call.
static int cudaGetSurfaceReference(surfaceReference surfref, java.lang.String symbol)
          Deprecated. As of CUDA 4.1
static int cudaGetSymbolAddress(Pointer devPtr, java.lang.String symbol)
          Finds the address associated with a CUDA symbol.
static int cudaGetSymbolSize(long[] size, java.lang.String symbol)
          Finds the size of the object associated with a CUDA symbol.
static int cudaGetTextureAlignmentOffset(long[] offset, textureReference texref)
          Get the alignment offset of a texture.
static int cudaGetTextureReference(textureReference texref, java.lang.String symbol)
          Deprecated. As of CUDA 4.1
static int cudaGLGetDevices(int[] pCudaDeviceCount, int[] pCudaDevices, int cudaDeviceCount, int cudaGLDeviceList_deviceList)
          Gets the CUDA devices associated with the current OpenGL context.
static int cudaGLMapBufferObject(Pointer devPtr, int bufObj)
          Deprecated. This function is deprecated in the latest CUDA version
static int cudaGLMapBufferObjectAsync(Pointer devPtr, int bufObj, cudaStream_t stream)
          Deprecated. This function is deprecated in the latest CUDA version
static int cudaGLRegisterBufferObject(int bufObj)
          Deprecated. This function is deprecated in the latest CUDA version
static int cudaGLSetBufferObjectMapFlags(int bufObj, int flags)
          Deprecated. This function is deprecated in the latest CUDA version
static int cudaGLSetGLDevice(int device)
          Sets a CUDA device to use OpenGL interoperability.
static int cudaGLUnmapBufferObject(int bufObj)
          Deprecated. This function is deprecated in the latest CUDA version
static int cudaGLUnmapBufferObjectAsync(int bufObj, cudaStream_t stream)
          Deprecated. This function is deprecated in the latest CUDA version
static int cudaGLUnregisterBufferObject(int bufObj)
          Deprecated. This function is deprecated in the latest CUDA version
static int cudaGraphicsGLRegisterBuffer(cudaGraphicsResource resource, int buffer, int Flags)
          Registers an OpenGL buffer object.
static int cudaGraphicsGLRegisterImage(cudaGraphicsResource resource, int image, int target, int Flags)
          Register an OpenGL texture or renderbuffer object.
static int cudaGraphicsMapResources(int count, cudaGraphicsResource[] resources, cudaStream_t stream)
          Map graphics resources for access by CUDA.
static int cudaGraphicsResourceGetMappedPointer(Pointer devPtr, long[] size, cudaGraphicsResource resource)
          Get an device pointer through which to access a mapped graphics resource.
static int cudaGraphicsResourceSetMapFlags(cudaGraphicsResource resource, int flags)
          Set usage flags for mapping a graphics resource.
static int cudaGraphicsSubResourceGetMappedArray(cudaArray arrayPtr, cudaGraphicsResource resource, int arrayIndex, int mipLevel)
          Get an array through which to access a subresource of a mapped graphics resource.
static int cudaGraphicsUnmapResources(int count, cudaGraphicsResource[] resources, cudaStream_t stream)
          Unmap graphics resources.
static int cudaGraphicsUnregisterResource(cudaGraphicsResource resource)
          Unregisters a graphics resource for access by CUDA.
static int cudaHostAlloc(Pointer ptr, long size, int flags)
          Allocates page-locked memory on the host.
static int cudaHostGetDevicePointer(Pointer pDevice, Pointer pHost, int flags)
          Passes back device pointer of mapped host memory allocated by cudaHostAlloc() or registered by cudaHostRegister().
static int cudaHostRegister(Pointer ptr, long size, int flags)
          Registers an existing host memory range for use by CUDA.
static int cudaHostUnregister(Pointer ptr)
          Unregisters a memory range that was registered with cudaHostRegister().
static int cudaIpcCloseMemHandle(Pointer devPtr)
          Close memory mapped with ::cudaIpcOpenMemHandle.
static int cudaIpcGetEventHandle(cudaIpcEventHandle handle, cudaEvent_t event)
          Gets an interprocess handle for a previously allocated event.
static int cudaIpcGetMemHandle(cudaIpcMemHandle handle, Pointer devPtr)
          Gets an interprocess memory handle for an existing device memory allocation.
static int cudaIpcOpenEventHandle(cudaEvent_t event, cudaIpcEventHandle handle)
          Opens an interprocess event handle for use in the current process.
static int cudaIpcOpenMemHandle(Pointer devPtr, cudaIpcMemHandle handle, int flags)
          Opens an interprocess memory handle exported from another process and returns a device pointer usable in the local process.
static int cudaLaunch(java.lang.String symbol)
          Launches a device function.
static int cudaMalloc(Pointer devPtr, long size)
          Allocate memory on the device.
static int cudaMalloc3D(cudaPitchedPtr pitchDevPtr, cudaExtent extent)
          Allocates logical 1D, 2D, or 3D memory objects on the device.
static int cudaMalloc3DArray(cudaArray arrayPtr, cudaChannelFormatDesc desc, cudaExtent extent)
          Calls cudaMalloc3DArray wit the default value '0' as the last parameter
static int cudaMalloc3DArray(cudaArray arrayPtr, cudaChannelFormatDesc desc, cudaExtent extent, int flags)
          Allocate an array on the device.
static int cudaMallocArray(cudaArray array, cudaChannelFormatDesc desc, long width, long height)
          Calls cudaMallocArray wit the default value '0' as the last parameter
static int cudaMallocArray(cudaArray array, cudaChannelFormatDesc desc, long width, long height, int flags)
          Allocate an array on the device.
static int cudaMallocHost(Pointer ptr, long size)
          Allocates page-locked memory on the host.
static int cudaMallocPitch(Pointer devPtr, long[] pitch, long width, long height)
          Allocates pitched memory on the device.
static int cudaMemcpy(Pointer dst, Pointer src, long count, int cudaMemcpyKind_kind)
          Copies data between host and device.
static int cudaMemcpy2D(Pointer dst, long dpitch, Pointer src, long spitch, long width, long height, int cudaMemcpyKind_kind)
          Copies data between host and device.
static int cudaMemcpy2DArrayToArray(cudaArray dst, long wOffsetDst, long hOffsetDst, cudaArray src, long wOffsetSrc, long hOffsetSrc, long width, long height, int cudaMemcpyKind_kind)
          Copies data between host and device.
static int cudaMemcpy2DAsync(Pointer dst, long dpitch, Pointer src, long spitch, long width, long height, int cudaMemcpyKind_kind, cudaStream_t stream)
          Copies data between host and device.
static int cudaMemcpy2DFromArray(Pointer dst, long dpitch, cudaArray src, long wOffset, long hOffset, long width, long height, int cudaMemcpyKind_kind)
          Copies data between host and device.
static int cudaMemcpy2DFromArrayAsync(Pointer dst, long dpitch, cudaArray src, long wOffset, long hOffset, long width, long height, int cudaMemcpyKind_kind, cudaStream_t stream)
          Copies data between host and device.
static int cudaMemcpy2DToArray(cudaArray dst, long wOffset, long hOffset, Pointer src, long spitch, long width, long height, int cudaMemcpyKind_kind)
          Copies data between host and device.
static int cudaMemcpy2DToArrayAsync(cudaArray dst, long wOffset, long hOffset, Pointer src, long spitch, long width, long height, int cudaMemcpyKind_kind, cudaStream_t stream)
          Copies data between host and device.
static int cudaMemcpy3D(cudaMemcpy3DParms p)
          Copies data between 3D objects.
static int cudaMemcpy3DAsync(cudaMemcpy3DParms p, cudaStream_t stream)
          Copies data between 3D objects.
static int cudaMemcpy3DPeer(cudaMemcpy3DPeerParms p)
          Copies memory between devices.
static int cudaMemcpy3DPeerAsync(cudaMemcpy3DPeerParms p, cudaStream_t stream)
          Copies memory between devices asynchronously.
static int cudaMemcpyArrayToArray(cudaArray dst, long wOffsetDst, long hOffsetDst, cudaArray src, long wOffsetSrc, long hOffsetSrc, long count, int cudaMemcpyKind_kind)
          Copies data between host and device.
static int cudaMemcpyAsync(Pointer dst, Pointer src, long count, int cudaMemcpyKind_kind, cudaStream_t stream)
          Copies data between host and device.
static int cudaMemcpyFromArray(Pointer dst, cudaArray src, long wOffset, long hOffset, long count, int cudaMemcpyKind_kind)
          Copies data between host and device.
static int cudaMemcpyFromArrayAsync(Pointer dst, cudaArray src, long wOffset, long hOffset, long count, int cudaMemcpyKind_kind, cudaStream_t stream)
          Copies data between host and device.
static int cudaMemcpyFromSymbol(Pointer dst, java.lang.String symbol, long count, long offset, int cudaMemcpyKind_kind)
          Copies data from the given symbol on the device.
static int cudaMemcpyFromSymbolAsync(Pointer dst, java.lang.String symbol, long count, long offset, int cudaMemcpyKind_kind, cudaStream_t stream)
          Copies data from the given symbol on the device.
static int cudaMemcpyPeer(Pointer dst, int dstDevice, Pointer src, int srcDevice, long count)
          Copies memory between two devices.
static int cudaMemcpyPeerAsync(Pointer dst, int dstDevice, Pointer src, int srcDevice, long count, cudaStream_t stream)
          Copies memory between two devices asynchronously.
static int cudaMemcpyToArray(cudaArray dst, long wOffset, long hOffset, Pointer src, long count, int cudaMemcpyKind_kind)
          Copies data between host and device.
static int cudaMemcpyToArrayAsync(cudaArray dst, long wOffset, long hOffset, Pointer src, long count, int cudaMemcpyKind_kind, cudaStream_t stream)
          Copies data between host and device.
static int cudaMemcpyToSymbol(java.lang.String symbol, Pointer src, long count, long offset, int cudaMemcpyKind_kind)
          Copies data to the given symbol on the device.
static int cudaMemcpyToSymbolAsync(java.lang.String symbol, Pointer src, long count, long offset, int cudaMemcpyKind_kind, cudaStream_t stream)
          Copies data to the given symbol on the device.
static int cudaMemGetInfo(long[] free, long[] total)
          Gets free and total device memory.
static int cudaMemset(Pointer mem, int c, long count)
          Initializes or sets device memory to a value.
static int cudaMemset2D(Pointer mem, long pitch, int c, long width, long height)
          Initializes or sets device memory to a value.
static int cudaMemset2DAsync(Pointer devPtr, long pitch, int value, long width, long height, cudaStream_t stream)
          Initializes or sets device memory to a value.
static int cudaMemset3D(cudaPitchedPtr pitchDevPtr, int value, cudaExtent extent)
          Initializes or sets device memory to a value.
static int cudaMemset3DAsync(cudaPitchedPtr pitchedDevPtr, int value, cudaExtent extent, cudaStream_t stream)
          Initializes or sets device memory to a value.
static int cudaMemsetAsync(Pointer devPtr, int value, long count, cudaStream_t stream)
          Initializes or sets device memory to a value.
static int cudaPeekAtLastError()
          Returns the last error from a runtime call.
static int cudaPointerGetAttributes(cudaPointerAttributes attributes, Pointer ptr)
          Returns attributes about a specified pointer.
static int cudaProfilerInitialize(java.lang.String configFile, java.lang.String outputFile, int outputMode)
          Initialize the profiling.
static int cudaProfilerStart()
          Start the profiling.
static int cudaProfilerStop()
          Stop the profiling.
static int cudaRuntimeGetVersion(int[] runtimeVersion)
          Returns the CUDA Runtime version.
static int cudaSetDevice(int device)
          Set device to be used for GPU executions.
static int cudaSetDeviceFlags(int flags)
          Sets flags to be used for device executions.
static int cudaSetupArgument(Pointer arg, long size, long offset)
          Configure a device launch.
static int cudaSetValidDevices(int[] device_arr, int len)
          Set a list of devices that can be used for CUDA.
static int cudaStreamCreate(cudaStream_t stream)
          Create an asynchronous stream.
static int cudaStreamDestroy(cudaStream_t stream)
          Destroys and cleans up an asynchronous stream.
static int cudaStreamQuery(cudaStream_t stream)
          Queries an asynchronous stream for completion status.
static int cudaStreamSynchronize(cudaStream_t stream)
          Waits for stream tasks to complete.
static int cudaStreamWaitEvent(cudaStream_t stream, cudaEvent_t event, int flags)
          Make a compute stream wait on an event.
static int cudaThreadExit()
          Deprecated. This function is deprecated in the latest CUDA version
static int cudaThreadGetCacheConfig(int[] pCacheConfig)
          Deprecated. This function is deprecated in the latest CUDA version
static int cudaThreadGetLimit(long[] pValue, int limit)
          Deprecated. This function is deprecated in the latest CUDA version
static int cudaThreadSetCacheConfig(int cacheConfig)
          Deprecated. This function is deprecated in the latest CUDA version
static int cudaThreadSetLimit(int limit, long value)
          Deprecated. This function is deprecated in the latest CUDA version
static int cudaThreadSynchronize()
          Deprecated. This function is deprecated in the latest CUDA version
static int cudaUnbindTexture(textureReference texref)
          Unbinds a texture.
static void initialize()
          Initializes the native library.
static void setExceptionsEnabled(boolean enabled)
          Enables or disables exceptions.
static void setLogLevel(LogLevel logLevel)
          Set the specified log level for the JCuda runtime library.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CUDART_VERSION

public static final int CUDART_VERSION
CUDA runtime version

See Also:
Constant Field Values

cudaHostAllocDefault

public static final int cudaHostAllocDefault
Default page-locked allocation flag

See Also:
Constant Field Values

cudaHostAllocPortable

public static final int cudaHostAllocPortable
Pinned memory accessible by all CUDA contexts

See Also:
Constant Field Values

cudaHostAllocMapped

public static final int cudaHostAllocMapped
Map allocation into device space

See Also:
Constant Field Values

cudaHostAllocWriteCombined

public static final int cudaHostAllocWriteCombined
Write-combined memory

See Also:
Constant Field Values

cudaHostRegisterDefault

public static final int cudaHostRegisterDefault
Default host memory registration flag

See Also:
Constant Field Values

cudaHostRegisterPortable

public static final int cudaHostRegisterPortable
Pinned memory accessible by all CUDA contexts

See Also:
Constant Field Values

cudaHostRegisterMapped

public static final int cudaHostRegisterMapped
Map registered memory into device space

See Also:
Constant Field Values

cudaPeerAccessDefault

public static final int cudaPeerAccessDefault
Default peer addressing enable flag

See Also:
Constant Field Values

cudaEventDefault

public static final int cudaEventDefault
Default event flag

See Also:
Constant Field Values

cudaEventBlockingSync

public static final int cudaEventBlockingSync
Event uses blocking synchronization

See Also:
Constant Field Values

cudaEventDisableTiming

public static final int cudaEventDisableTiming
Event will not record timing data

See Also:
Constant Field Values

cudaEventInterprocess

public static final int cudaEventInterprocess
Event is suitable for interprocess use. cudaEventDisableTiming must be set

See Also:
Constant Field Values

cudaDeviceScheduleAuto

public static final int cudaDeviceScheduleAuto
Device flag - Automatic scheduling

See Also:
Constant Field Values

cudaDeviceScheduleSpin

public static final int cudaDeviceScheduleSpin
Device flag - Spin default scheduling

See Also:
Constant Field Values

cudaDeviceScheduleYield

public static final int cudaDeviceScheduleYield
Device flag - Yield default scheduling

See Also:
Constant Field Values

cudaDeviceScheduleBlockingSync

public static final int cudaDeviceScheduleBlockingSync
Device flag - Use blocking synchronization

See Also:
Constant Field Values

cudaDeviceBlockingSync

public static final int cudaDeviceBlockingSync
Deprecated. As of CUDA 4.0 and replaced by cudaDeviceScheduleBlockingSync
Device flag - Use blocking synchronization

See Also:
Constant Field Values

cudaDeviceScheduleMask

public static final int cudaDeviceScheduleMask
Device schedule flags mask

See Also:
Constant Field Values

cudaDeviceMapHost

public static final int cudaDeviceMapHost
Device flag - Support mapped pinned allocations

See Also:
Constant Field Values

cudaDeviceLmemResizeToMax

public static final int cudaDeviceLmemResizeToMax
Device flag - Keep local memory allocation after launch

See Also:
Constant Field Values

cudaDeviceMask

public static final int cudaDeviceMask
Device flags mask

See Also:
Constant Field Values

cudaArrayDefault

public static final int cudaArrayDefault
Default CUDA array allocation flag

See Also:
Constant Field Values

cudaArrayLayered

public static final int cudaArrayLayered
Must be set in cudaMalloc3DArray to create a layered CUDA array

See Also:
Constant Field Values

cudaArraySurfaceLoadStore

public static final int cudaArraySurfaceLoadStore
Must be set in cudaMallocArray or cudaMalloc3DArray in order to bind surfaces to the CUDA array

See Also:
Constant Field Values

cudaArrayCubemap

public static final int cudaArrayCubemap
Must be set in cudaMalloc3DArray to create a cubemap CUDA array

See Also:
Constant Field Values

cudaArrayTextureGather

public static final int cudaArrayTextureGather
Must be set in cudaMallocArray or cudaMalloc3DArray in order to perform texture gather operations on the CUDA array

See Also:
Constant Field Values

cudaIpcMemLazyEnablePeerAccess

public static final int cudaIpcMemLazyEnablePeerAccess
Automatically enable peer access between remote devices as needed

See Also:
Constant Field Values

cudaSurfaceType1D

public static final int cudaSurfaceType1D
cudaSurfaceType1D

See Also:
Constant Field Values

cudaSurfaceType2D

public static final int cudaSurfaceType2D
cudaSurfaceType2D

See Also:
Constant Field Values

cudaSurfaceType3D

public static final int cudaSurfaceType3D
cudaSurfaceType3D

See Also:
Constant Field Values

cudaSurfaceTypeCubemap

public static final int cudaSurfaceTypeCubemap
cudaSurfaceTypeCubemap

See Also:
Constant Field Values

cudaSurfaceType1DLayered

public static final int cudaSurfaceType1DLayered
cudaSurfaceType1DLayered

See Also:
Constant Field Values

cudaSurfaceType2DLayered

public static final int cudaSurfaceType2DLayered
cudaSurfaceType2DLayered

See Also:
Constant Field Values

cudaSurfaceTypeCubemapLayered

public static final int cudaSurfaceTypeCubemapLayered
cudaSurfaceTypeCubemapLayered

See Also:
Constant Field Values

cudaTextureType1D

public static final int cudaTextureType1D
cudaTextureType1D

See Also:
Constant Field Values

cudaTextureType2D

public static final int cudaTextureType2D
cudaTextureType2D

See Also:
Constant Field Values

cudaTextureType3D

public static final int cudaTextureType3D
cudaTextureType3D

See Also:
Constant Field Values

cudaTextureTypeCubemap

public static final int cudaTextureTypeCubemap
cudaTextureTypeCubemap

See Also:
Constant Field Values

cudaTextureType1DLayered

public static final int cudaTextureType1DLayered
cudaTextureType1DLayered

See Also:
Constant Field Values

cudaTextureType2DLayered

public static final int cudaTextureType2DLayered
cudaTextureType2DLayered

See Also:
Constant Field Values

cudaTextureTypeCubemapLayered

public static final int cudaTextureTypeCubemapLayered
cudaTextureTypeCubemapLayered

See Also:
Constant Field Values
Method Detail

initialize

public static void initialize()
Initializes the native library. Note that this method does not have to be called explicitly by the user of the library: The library will automatically be initialized when this class is loaded.


setLogLevel

public static void setLogLevel(LogLevel logLevel)
Set the specified log level for the JCuda runtime library.

Currently supported log levels:
LOG_QUIET: Never print anything
LOG_ERROR: Print error messages
LOG_TRACE: Print a trace of all native function calls

Parameters:
logLevel - The log level to use.

setExceptionsEnabled

public static void setExceptionsEnabled(boolean enabled)
Enables or disables exceptions. By default, the methods of this class only return the cudaError error code from the underlying CUDA function. If exceptions are enabled, a CudaException with a detailed error message will be thrown if a method is about to return a result code that is not cudaError.cudaSuccess

Parameters:
enabled - Whether exceptions are enabled

cudaGetDeviceCount

public static int cudaGetDeviceCount(int[] count)
Returns the number of compute-capable devices.
cudaError_t cudaGetDeviceCount ( int *  count  ) 

Returns in *count the number of devices with compute capability greater or equal to 1.0 that are available for execution. If there is no such device then cudaGetDeviceCount() will return cudaErrorNoDevice. If no driver can be loaded to determine if any such devices exist then cudaGetDeviceCount() will return cudaErrorInsufficientDriver.

Returns:
cudaSuccess, cudaErrorNoDevice, cudaErrorInsufficientDriver
See Also:
cudaGetDevice(int[]), cudaSetDevice(int), cudaGetDeviceProperties(jcuda.runtime.cudaDeviceProp, int), cudaChooseDevice(int[], jcuda.runtime.cudaDeviceProp)

cudaSetDevice

public static int cudaSetDevice(int device)
Set device to be used for GPU executions.
cudaError_t cudaSetDevice ( int  device  ) 

Sets device as the current device for the calling host thread.

Any device memory subsequently allocated from this host thread using cudaMalloc(), cudaMallocPitch() or cudaMallocArray() will be physically resident on device. Any host memory allocated from this host thread using cudaMallocHost() or cudaHostAlloc() or cudaHostRegister() will have its lifetime associated with device. Any streams or events created from this host thread will be associated with device. Any kernels launched from this host thread using the <<<>>> operator or cudaLaunch() will be executed on device.

This call may be made from any host thread, to any device, and at any time. This function will do no synchronization with the previous or new device, and should be considered a very low overhead call.

Returns:
cudaSuccess, cudaErrorInvalidDevice, cudaErrorDeviceAlreadyInUse
See Also:
cudaGetDeviceCount(int[]), cudaGetDevice(int[]), cudaGetDeviceProperties(jcuda.runtime.cudaDeviceProp, int), cudaChooseDevice(int[], jcuda.runtime.cudaDeviceProp)

cudaSetDeviceFlags

public static int cudaSetDeviceFlags(int flags)
Sets flags to be used for device executions.
cudaError_t cudaSetDeviceFlags ( unsigned int  flags  ) 

Records flags as the flags to use when initializing the current device. If no device has been made current to the calling thread then flags will be applied to the initialization of any device initialized by the calling host thread, unless that device has had its initialization flags set explicitly by this or any host thread.

If the current device has been set and that device has already been initialized then this call will fail with the error cudaErrorSetOnActiveProcess. In this case it is necessary to reset device using cudaDeviceReset() before the device's initialization flags may be set.

The two LSBs of the flags parameter can be used to control how the CPU thread interacts with the OS scheduler when waiting for results from the device.

  • cudaDeviceScheduleAuto: The default value if the flags parameter is zero, uses a heuristic based on the number of active CUDA contexts in the process C and the number of logical processors in the system P. If C > P, then CUDA will yield to other OS threads when waiting for the device, otherwise CUDA will not yield while waiting for results and actively spin on the processor.
  • cudaDeviceScheduleSpin: Instruct CUDA to actively spin when waiting for results from the device. This can decrease latency when waiting for the device, but may lower the performance of CPU threads if they are performing work in parallel with the CUDA thread.
  • cudaDeviceScheduleYield: Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device.
  • cudaDeviceScheduleBlockingSync: Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the device to finish work.
  • cudaDeviceBlockingSync: Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the device to finish work.
    Deprecated: This flag was deprecated as of CUDA 4.0 and replaced with cudaDeviceScheduleBlockingSync.
  • cudaDeviceMapHost: This flag must be set in order to allocate pinned host memory that is accessible to the device. If this flag is not set, cudaHostGetDevicePointer() will always return a failure code.
  • cudaDeviceLmemResizeToMax: Instruct CUDA to not reduce local memory after resizing local memory for a kernel. This can prevent thrashing by local memory allocations when launching many kernels with high local memory usage at the cost of potentially increased memory usage.

Returns:
cudaSuccess, cudaErrorInvalidDevice, cudaErrorSetOnActiveProcess
See Also:
cudaGetDeviceCount(int[]), cudaGetDevice(int[]), cudaGetDeviceProperties(jcuda.runtime.cudaDeviceProp, int), cudaSetDevice(int), cudaSetValidDevices(int[], int), cudaChooseDevice(int[], jcuda.runtime.cudaDeviceProp)

cudaSetValidDevices

public static int cudaSetValidDevices(int[] device_arr,
                                      int len)
Set a list of devices that can be used for CUDA.
cudaError_t cudaSetValidDevices ( int *  device_arr,
int  len  
)

Sets a list of devices for CUDA execution in priority order using device_arr. The parameter len specifies the number of elements in the list. CUDA will try devices from the list sequentially until it finds one that works. If this function is not called, or if it is called with a len of 0, then CUDA will go back to its default behavior of trying devices sequentially from a default list containing all of the available CUDA devices in the system. If a specified device ID in the list does not exist, this function will return cudaErrorInvalidDevice. If len is not 0 and device_arr is NULL or if len exceeds the number of devices in the system, then cudaErrorInvalidValue is returned.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevice
See Also:
cudaGetDeviceCount(int[]), cudaSetDevice(int), cudaGetDeviceProperties(jcuda.runtime.cudaDeviceProp, int), cudaSetDeviceFlags(int), cudaChooseDevice(int[], jcuda.runtime.cudaDeviceProp)

cudaGetDevice

public static int cudaGetDevice(int[] device)
Returns which device is currently being used.
cudaError_t cudaGetDevice ( int *  device  ) 

Returns in *device the current device for the calling host thread.

Returns:
cudaSuccess
See Also:
cudaGetDeviceCount(int[]), cudaSetDevice(int), cudaGetDeviceProperties(jcuda.runtime.cudaDeviceProp, int), cudaChooseDevice(int[], jcuda.runtime.cudaDeviceProp)

cudaGetDeviceProperties

public static int cudaGetDeviceProperties(cudaDeviceProp prop,
                                          int device)
Returns information about the compute-device.
cudaError_t cudaGetDeviceProperties ( struct cudaDeviceProp *  prop,
int  device  
)

Returns in *prop the properties of device dev. The cudaDeviceProp structure is defined as:

    struct cudaDeviceProp {
         char name[256];
         size_t totalGlobalMem;
         size_t sharedMemPerBlock;
         int regsPerBlock;
         int warpSize;
         size_t memPitch;
         int maxThreadsPerBlock;
         int maxThreadsDim[3];
         int maxGridSize[3];
         int clockRate;
         size_t totalConstMem;
         int major;
         int minor;
         size_t textureAlignment;
         size_t texturePitchAlignment;
         int deviceOverlap;
         int multiProcessorCount;
         int kernelExecTimeoutEnabled;
         int integrated;
         int canMapHostMemory;
         int computeMode;
         int maxTexture1D;
         int maxTexture1DLinear;
         int maxTexture2D[2];
         int maxTexture2DLinear[3];
         int maxTexture2DGather[2];
         int maxTexture3D[3];
         int maxTextureCubemap;
         int maxTexture1DLayered[2];
         int maxTexture2DLayered[3];
         int maxTextureCubemapLayered[2];
         int maxSurface1D;
         int maxSurface2D[2];
         int maxSurface3D[3];
         int maxSurface1DLayered[2];
         int maxSurface2DLayered[3];
         int maxSurfaceCubemap;
         int maxSurfaceCubemapLayered[2];
         size_t surfaceAlignment;
         int concurrentKernels;
         int ECCEnabled;
         int pciBusID;
         int pciDeviceID;
         int pciDomainID;
         int tccDriver;
         int asyncEngineCount;
         int unifiedAddressing;
         int memoryClockRate;
         int memoryBusWidth;
         int l2CacheSize;
         int maxThreadsPerMultiProcessor;
     }
 
where:
  • name[256] is an ASCII string identifying the device;
  • totalGlobalMem is the total amount of global memory available on the device in bytes;
  • sharedMemPerBlock is the maximum amount of shared memory available to a thread block in bytes; this amount is shared by all thread blocks simultaneously resident on a multiprocessor;
  • regsPerBlock is the maximum number of 32-bit registers available to a thread block; this number is shared by all thread blocks simultaneously resident on a multiprocessor;
  • warpSize is the warp size in threads;
  • memPitch is the maximum pitch in bytes allowed by the memory copy functions that involve memory regions allocated through cudaMallocPitch();
  • maxThreadsPerBlock is the maximum number of threads per block;
  • maxThreadsDim[3] contains the maximum size of each dimension of a block;
  • maxGridSize[3] contains the maximum size of each dimension of a grid;
  • clockRate is the clock frequency in kilohertz;
  • totalConstMem is the total amount of constant memory available on the device in bytes;
  • major, minor are the major and minor revision numbers defining the device's compute capability;
  • textureAlignment is the alignment requirement; texture base addresses that are aligned to textureAlignment bytes do not need an offset applied to texture fetches;
  • texturePitchAlignment is the pitch alignment requirement for 2D texture references that are bound to pitched memory;
  • deviceOverlap is 1 if the device can concurrently copy memory between host and device while executing a kernel, or 0 if not. Deprecated, use instead asyncEngineCount.
  • multiProcessorCount is the number of multiprocessors on the device;
  • kernelExecTimeoutEnabled is 1 if there is a run time limit for kernels executed on the device, or 0 if not.
  • integrated is 1 if the device is an integrated (motherboard) GPU and 0 if it is a discrete (card) component.
  • canMapHostMemory is 1 if the device can map host memory into the CUDA address space for use with cudaHostAlloc()/cudaHostGetDevicePointer(), or 0 if not;
  • computeMode is the compute mode that the device is currently in. Available modes are as follows:
    • cudaComputeModeDefault: Default mode - Device is not restricted and multiple threads can use cudaSetDevice() with this device.
    • cudaComputeModeExclusive: Compute-exclusive mode - Only one thread will be able to use cudaSetDevice() with this device.
    • cudaComputeModeProhibited: Compute-prohibited mode - No threads can use cudaSetDevice() with this device.
    • cudaComputeModeExclusiveProcess: Compute-exclusive-process mode - Many threads in one process will be able to use cudaSetDevice() with this device.
      If cudaSetDevice() is called on an already occupied device with computeMode cudaComputeModeExclusive, cudaErrorDeviceAlreadyInUse will be immediately returned indicating the device cannot be used. When an occupied exclusive mode device is chosen with cudaSetDevice, all subsequent non-device management runtime functions will return cudaErrorDevicesUnavailable.
  • maxTexture1D is the maximum 1D texture size.
  • maxTexture1DLinear is the maximum 1D texture size for textures bound to linear memory.
  • maxTexture2D[2] contains the maximum 2D texture dimensions.
  • maxTexture2DLinear[3] contains the maximum 2D texture dimensions for 2D textures bound to pitch linear memory.
  • maxTexture2DGather[2] contains the maximum 2D texture dimensions if texture gather operations have to be performed.
  • maxTexture3D[3] contains the maximum 3D texture dimensions.
  • maxTextureCubemap is the maximum cubemap texture width or height.
  • maxTexture1DLayered[2] contains the maximum 1D layered texture dimensions.
  • maxTexture2DLayered[3] contains the maximum 2D layered texture dimensions.
  • maxTextureCubemapLayered[2] contains the maximum cubemap layered texture dimensions.
  • maxSurface1D is the maximum 1D surface size.
  • maxSurface2D[2] contains the maximum 2D surface dimensions.
  • maxSurface3D[3] contains the maximum 3D surface dimensions.
  • maxSurface1DLayered[2] contains the maximum 1D layered surface dimensions.
  • maxSurface2DLayered[3] contains the maximum 2D layered surface dimensions.
  • maxSurfaceCubemap is the maximum cubemap surface width or height.
  • maxSurfaceCubemapLayered[2] contains the maximum cubemap layered surface dimensions.
  • surfaceAlignment specifies the alignment requirements for surfaces.
  • concurrentKernels is 1 if the device supports executing multiple kernels within the same context simultaneously, or 0 if not. It is not guaranteed that multiple kernels will be resident on the device concurrently so this feature should not be relied upon for correctness;
  • ECCEnabled is 1 if the device has ECC support turned on, or 0 if not.
  • pciBusID is the PCI bus identifier of the device.
  • pciDeviceID is the PCI device (sometimes called slot) identifier of the device.
  • pciDomainID is the PCI domain identifier of the device.
  • tccDriver is 1 if the device is using a TCC driver or 0 if not.
  • asyncEngineCount is 1 when the device can concurrently copy memory between host and device while executing a kernel. It is 2 when the device can concurrently copy memory between host and device in both directions and execute a kernel at the same time. It is 0 if neither of these is supported.
  • unifiedAddressing is 1 if the device shares a unified address space with the host and 0 otherwise.
  • memoryClockRate is the peak memory clock frequency in kilohertz.
  • memoryBusWidth is the memory bus width in bits.
  • l2CacheSize is L2 cache size in bytes.
  • maxThreadsPerMultiProcessor is the number of maximum resident threads per multiprocessor.

Returns:
cudaSuccess, cudaErrorInvalidDevice
See Also:
cudaGetDeviceCount(int[]), cudaGetDevice(int[]), cudaSetDevice(int), cudaChooseDevice(int[], jcuda.runtime.cudaDeviceProp)

cudaChooseDevice

public static int cudaChooseDevice(int[] device,
                                   cudaDeviceProp prop)
Select compute-device which best matches criteria.
cudaError_t cudaChooseDevice ( int *  device,
const struct cudaDeviceProp *  prop  
)

Returns in *device the device which has properties that best match *prop.

Returns:
cudaSuccess, cudaErrorInvalidValue
See Also:
cudaGetDeviceCount(int[]), cudaGetDevice(int[]), cudaSetDevice(int), cudaGetDeviceProperties(jcuda.runtime.cudaDeviceProp, int)

cudaMalloc3D

public static int cudaMalloc3D(cudaPitchedPtr pitchDevPtr,
                               cudaExtent extent)
Allocates logical 1D, 2D, or 3D memory objects on the device.
cudaError_t cudaMalloc3D ( struct cudaPitchedPtr *  pitchedDevPtr,
struct cudaExtent  extent  
)

Allocates at least width * height * depth bytes of linear memory on the device and returns a cudaPitchedPtr in which ptr is a pointer to the allocated memory. The function may pad the allocation to ensure hardware alignment requirements are met. The pitch returned in the pitch field of pitchedDevPtr is the width in bytes of the allocation.

The returned cudaPitchedPtr contains additional fields xsize and ysize, the logical width and height of the allocation, which are equivalent to the width and height extent parameters provided by the programmer during allocation.

For allocations of 2D and 3D objects, it is highly recommended that programmers perform allocations using cudaMalloc3D() or cudaMallocPitch(). Due to alignment restrictions in the hardware, this is especially true if the application will be performing memory copies involving 2D or 3D objects (whether linear memory or CUDA arrays).

Returns:
cudaSuccess, cudaErrorMemoryAllocation
See Also:
cudaMallocPitch(jcuda.Pointer, long[], long, long), cudaFree(jcuda.Pointer), cudaMemcpy3D(jcuda.runtime.cudaMemcpy3DParms), cudaMemset3D(jcuda.runtime.cudaPitchedPtr, int, jcuda.runtime.cudaExtent), cudaMalloc3DArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaExtent), cudaMallocArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, long, long), cudaFreeArray(jcuda.runtime.cudaArray), cudaMallocHost(jcuda.Pointer, long), cudaFreeHost(jcuda.Pointer), cudaHostAlloc(jcuda.Pointer, long, int)

cudaMalloc3DArray

public static int cudaMalloc3DArray(cudaArray arrayPtr,
                                    cudaChannelFormatDesc desc,
                                    cudaExtent extent)
Calls cudaMalloc3DArray wit the default value '0' as the last parameter

See Also:
cudaMalloc3DArray(cudaArray, cudaChannelFormatDesc, cudaExtent, int)

cudaMalloc3DArray

public static int cudaMalloc3DArray(cudaArray arrayPtr,
                                    cudaChannelFormatDesc desc,
                                    cudaExtent extent,
                                    int flags)
Allocate an array on the device.
cudaError_t cudaMalloc3DArray ( struct cudaArray **  array,
const struct cudaChannelFormatDesc *  desc,
struct cudaExtent  extent,
unsigned int  flags = 0  
)

Allocates a CUDA array according to the cudaChannelFormatDesc structure desc and returns a handle to the new CUDA array in *array.

The cudaChannelFormatDesc is defined as:

    struct
 cudaChannelFormatDesc {
         int x, y, z, w;
         enum cudaChannelFormatKind f;
     };
 
where cudaChannelFormatKind is one of cudaChannelFormatKindSigned, cudaChannelFormatKindUnsigned, or cudaChannelFormatKindFloat.

cudaMalloc3DArray() can allocate the following:

  • A 1D array is allocated if the height and depth extents are both zero.
  • A 2D array is allocated if only the depth extent is zero.
  • A 3D array is allocated if all three extents are non-zero.
  • A 1D layered CUDA array is allocated if only the height extent is zero and the cudaArrayLayered flag is set. Each layer is a 1D array. The number of layers is determined by the depth extent.
  • A 2D layered CUDA array is allocated if all three extents are non-zero and the cudaArrayLayered flag is set. Each layer is a 2D array. The number of layers is determined by the depth extent.
  • A cubemap CUDA array is allocated if all three extents are non-zero and the cudaArrayCubemap flag is set. Width must be equal to height, and depth must be six. A cubemap is a special type of 2D layered CUDA array, where the six layers represent the six faces of a cube. The order of the six layers in memory is the same as that listed in cudaGraphicsCubeFace.
  • A cubemap layered CUDA array is allocated if all three extents are non-zero, and both, cudaArrayCubemap and cudaArrayLayered flags are set. Width must be equal to height, and depth must be a multiple of six. A cubemap layered CUDA array is a special type of 2D layered CUDA array that consists of a collection of cubemaps. The first six layers represent the first cubemap, the next six layers form the second cubemap, and so on.

The flags parameter enables different options to be specified that affect the allocation, as follows.

  • cudaArrayDefault: This flag's value is defined to be 0 and provides default array allocation
  • cudaArrayLayered: Allocates a layered CUDA array, with the depth extent indicating the number of layers
  • cudaArrayCubemap: Allocates a cubemap CUDA array. Width must be equal to height, and depth must be six. If the cudaArrayLayered flag is also set, depth must be a multiple of six.
  • cudaArraySurfaceLoadStore: Allocates a CUDA array that could be read from or written to using a surface reference.
  • cudaArrayTextureGather: This flag indicates that texture gather operations will be performed on the CUDA array. Texture gather can only be performed on 2D CUDA arrays.

The width, height and depth extents must meet certain size requirements as listed in the following table. All values are specified in elements.

Note that 2D CUDA arrays have different size requirements if the cudaArrayTextureGather flag is set. In that case, the valid range for (width, height, depth) is ((1,maxTexture2DGather[0]), (1,maxTexture2DGather[1]), 0).

CUDA array type Valid extents that must always be met
{(width range in elements), (height range), (depth range)}
Valid extents with cudaArraySurfaceLoadStore set
{(width range in elements), (height range), (depth range)}
1D { (1,maxTexture1D), 0, 0 } { (1,maxSurface1D), 0, 0 }
2D { (1,maxTexture2D[0]), (1,maxTexture2D[1]), 0 } { (1,maxSurface2D[0]), (1,maxSurface2D[1]), 0 }
3D { (1,maxTexture3D[0]), (1,maxTexture3D[1]), (1,maxTexture3D[2]) } { (1,maxSurface3D[0]), (1,maxSurface3D[1]), (1,maxSurface3D[2]) }
1D Layered { (1,maxTexture1DLayered[0]), 0, (1,maxTexture1DLayered[1]) } { (1,maxSurface1DLayered[0]), 0, (1,maxSurface1DLayered[1]) }
2D Layered { (1,maxTexture2DLayered[0]), (1,maxTexture2DLayered[1]), (1,maxTexture2DLayered[2]) } { (1,maxSurface2DLayered[0]), (1,maxSurface2DLayered[1]), (1,maxSurface2DLayered[2]) }
Cubemap { (1,maxTextureCubemap), (1,maxTextureCubemap), 6 } { (1,maxSurfaceCubemap), (1,maxSurfaceCubemap), 6 }
Cubemap Layered { (1,maxTextureCubemapLayered[0]), (1,maxTextureCubemapLayered[0]), (1,maxTextureCubemapLayered[1]) } { (1,maxSurfaceCubemapLayered[0]), (1,maxSurfaceCubemapLayered[0]), (1,maxSurfaceCubemapLayered[1]) }

Returns:
cudaSuccess, cudaErrorMemoryAllocation
See Also:
cudaMalloc3D(jcuda.runtime.cudaPitchedPtr, jcuda.runtime.cudaExtent), cudaMalloc(jcuda.Pointer, long), cudaMallocPitch(jcuda.Pointer, long[], long, long), cudaFree(jcuda.Pointer), cudaFreeArray(jcuda.runtime.cudaArray), cudaMallocHost(jcuda.Pointer, long), cudaFreeHost(jcuda.Pointer), cudaHostAlloc(jcuda.Pointer, long, int)

cudaMemset3D

public static int cudaMemset3D(cudaPitchedPtr pitchDevPtr,
                               int value,
                               cudaExtent extent)
Initializes or sets device memory to a value.
cudaError_t cudaMemset3D ( struct cudaPitchedPtr  pitchedDevPtr,
int  value,
struct cudaExtent  extent  
)

Initializes each element of a 3D array to the specified value value. The object to initialize is defined by pitchedDevPtr. The pitch field of pitchedDevPtr is the width in memory in bytes of the 3D array pointed to by pitchedDevPtr, including any padding added to the end of each row. The xsize field specifies the logical width of each row in bytes, while the ysize field specifies the height of each 2D slice in rows.

The extents of the initialized region are specified as a width in bytes, a height in rows, and a depth in slices.

Extents with width greater than or equal to the xsize of pitchedDevPtr may perform significantly faster than extents narrower than the xsize. Secondarily, extents with height equal to the ysize of pitchedDevPtr will perform faster than when the height is shorter than the ysize.

This function performs fastest when the pitchedDevPtr has been allocated by cudaMalloc3D().

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer
See Also:
cudaMemset(jcuda.Pointer, int, long), cudaMemset2D(jcuda.Pointer, long, int, long, long), cudaMemsetAsync(jcuda.Pointer, int, long, jcuda.runtime.cudaStream_t), cudaMemset2DAsync(jcuda.Pointer, long, int, long, long, jcuda.runtime.cudaStream_t), cudaMemset3DAsync(jcuda.runtime.cudaPitchedPtr, int, jcuda.runtime.cudaExtent, jcuda.runtime.cudaStream_t), cudaMalloc3D(jcuda.runtime.cudaPitchedPtr, jcuda.runtime.cudaExtent)

cudaMemsetAsync

public static int cudaMemsetAsync(Pointer devPtr,
                                  int value,
                                  long count,
                                  cudaStream_t stream)
Initializes or sets device memory to a value.
cudaError_t cudaMemsetAsync ( void *  devPtr,
int  value,
size_t  count,
cudaStream_t  stream = 0  
)

Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.

cudaMemsetAsync() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer
See Also:
cudaMemset(jcuda.Pointer, int, long), cudaMemset2D(jcuda.Pointer, long, int, long, long), cudaMemset3D(jcuda.runtime.cudaPitchedPtr, int, jcuda.runtime.cudaExtent), cudaMemset2DAsync(jcuda.Pointer, long, int, long, long, jcuda.runtime.cudaStream_t), cudaMemset3DAsync(jcuda.runtime.cudaPitchedPtr, int, jcuda.runtime.cudaExtent, jcuda.runtime.cudaStream_t)

cudaMemset2DAsync

public static int cudaMemset2DAsync(Pointer devPtr,
                                    long pitch,
                                    int value,
                                    long width,
                                    long height,
                                    cudaStream_t stream)
Initializes or sets device memory to a value.
cudaError_t cudaMemset2DAsync ( void *  devPtr,
size_t  pitch,
int  value,
size_t  width,
size_t  height,
cudaStream_t  stream = 0  
)

Sets to the specified value value a matrix (height rows of width bytes each) pointed to by dstPtr. pitch is the width in bytes of the 2D array pointed to by dstPtr, including any padding added to the end of each row. This function performs fastest when the pitch is one that has been passed back by cudaMallocPitch().

cudaMemset2DAsync() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer
See Also:
cudaMemset(jcuda.Pointer, int, long), cudaMemset2D(jcuda.Pointer, long, int, long, long), cudaMemset3D(jcuda.runtime.cudaPitchedPtr, int, jcuda.runtime.cudaExtent), cudaMemsetAsync(jcuda.Pointer, int, long, jcuda.runtime.cudaStream_t), cudaMemset3DAsync(jcuda.runtime.cudaPitchedPtr, int, jcuda.runtime.cudaExtent, jcuda.runtime.cudaStream_t)

cudaMemset3DAsync

public static int cudaMemset3DAsync(cudaPitchedPtr pitchedDevPtr,
                                    int value,
                                    cudaExtent extent,
                                    cudaStream_t stream)
Initializes or sets device memory to a value.
cudaError_t cudaMemset3DAsync ( struct cudaPitchedPtr  pitchedDevPtr,
int  value,
struct cudaExtent  extent,
cudaStream_t  stream = 0  
)

Initializes each element of a 3D array to the specified value value. The object to initialize is defined by pitchedDevPtr. The pitch field of pitchedDevPtr is the width in memory in bytes of the 3D array pointed to by pitchedDevPtr, including any padding added to the end of each row. The xsize field specifies the logical width of each row in bytes, while the ysize field specifies the height of each 2D slice in rows.

The extents of the initialized region are specified as a width in bytes, a height in rows, and a depth in slices.

Extents with width greater than or equal to the xsize of pitchedDevPtr may perform significantly faster than extents narrower than the xsize. Secondarily, extents with height equal to the ysize of pitchedDevPtr will perform faster than when the height is shorter than the ysize.

This function performs fastest when the pitchedDevPtr has been allocated by cudaMalloc3D().

cudaMemset3DAsync() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer
See Also:
cudaMemset(jcuda.Pointer, int, long), cudaMemset2D(jcuda.Pointer, long, int, long, long), cudaMemset3D(jcuda.runtime.cudaPitchedPtr, int, jcuda.runtime.cudaExtent), cudaMemsetAsync(jcuda.Pointer, int, long, jcuda.runtime.cudaStream_t), cudaMemset2DAsync(jcuda.Pointer, long, int, long, long, jcuda.runtime.cudaStream_t), cudaMalloc3D(jcuda.runtime.cudaPitchedPtr, jcuda.runtime.cudaExtent)

cudaMemcpy3D

public static int cudaMemcpy3D(cudaMemcpy3DParms p)
Copies data between 3D objects.
cudaError_t cudaMemcpy3D ( const struct cudaMemcpy3DParms *  p  ) 

struct cudaExtent {
   size_t width;
   size_t height;
   size_t depth;
 };
 struct cudaExtent make_cudaExtent(size_t w, size_t h,
 size_t d);

 struct cudaPos {
   size_t x;
   size_t y;
   size_t z;
 };
 struct cudaPos make_cudaPos(size_t x, size_t y, size_t
 z);

 struct cudaMemcpy3DParms {
   struct cudaArray     *srcArray;
   struct cudaPos        srcPos;
   struct cudaPitchedPtr srcPtr;
   struct cudaArray     *dstArray;
   struct cudaPos        dstPos;
   struct cudaPitchedPtr dstPtr;
   struct cudaExtent     extent;
   enum cudaMemcpyKind   kind;
 };
 

cudaMemcpy3D() copies data betwen two 3D objects. The source and destination objects may be in either host memory, device memory, or a CUDA array. The source, destination, extent, and kind of copy performed is specified by the cudaMemcpy3DParms struct which should be initialized to zero before use:

cudaMemcpy3DParms myParms = {0};
 

The struct passed to cudaMemcpy3D() must specify one of srcArray or srcPtr and one of dstArray or dstPtr. Passing more than one non-zero source or destination will cause cudaMemcpy3D() to return an error.

The srcPos and dstPos fields are optional offsets into the source and destination objects and are defined in units of each object's elements. The element for a host or device pointer is assumed to be unsigned char. For CUDA arrays, positions must be in the range [0, 2048) for any dimension.

The extent field defines the dimensions of the transferred area in elements. If a CUDA array is participating in the copy, the extent is defined in terms of that array's elements. If no CUDA array is participating in the copy then the extents are defined in elements of unsigned char.

The kind field defines the direction of the copy. It must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice.

If the source and destination are both arrays, cudaMemcpy3D() will return an error if they do not have the same element size.

The source and destination object may not overlap. If overlapping source and destination objects are specified, undefined behavior will result.

The source object must lie entirely within the region defined by srcPos and extent. The destination object must lie entirely within the region defined by dstPos and extent.

cudaMemcpy3D() returns an error if the pitch of srcPtr or dstPtr exceeds the maximum allowed. The pitch of a cudaPitchedPtr allocated with cudaMalloc3D() will always be valid.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidPitchValue, cudaErrorInvalidMemcpyDirection
See Also:
cudaMalloc3D(jcuda.runtime.cudaPitchedPtr, jcuda.runtime.cudaExtent), cudaMalloc3DArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaExtent), cudaMemset3D(jcuda.runtime.cudaPitchedPtr, int, jcuda.runtime.cudaExtent), cudaMemcpy3DAsync(jcuda.runtime.cudaMemcpy3DParms, jcuda.runtime.cudaStream_t), cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpy3DPeer

public static int cudaMemcpy3DPeer(cudaMemcpy3DPeerParms p)
Copies memory between devices.
cudaError_t cudaMemcpy3DPeer ( const struct cudaMemcpy3DPeerParms *  p  ) 

Perform a 3D memory copy according to the parameters specified in p. See the definition of the cudaMemcpy3DPeerParms structure for documentation of its parameters.

Note that this function is synchronous with respect to the host only if the source or destination of the transfer is host memory. Note also that this copy is serialized with respect to all pending and future asynchronous work in to the current device, the copy's source device, and the copy's destination device (use cudaMemcpy3DPeerAsync to avoid this synchronization).

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevice
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpyPeer(jcuda.Pointer, int, jcuda.Pointer, int, long), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyPeerAsync(jcuda.Pointer, int, jcuda.Pointer, int, long, jcuda.runtime.cudaStream_t), cudaMemcpy3DPeerAsync(jcuda.runtime.cudaMemcpy3DPeerParms, jcuda.runtime.cudaStream_t)

cudaMemcpy3DAsync

public static int cudaMemcpy3DAsync(cudaMemcpy3DParms p,
                                    cudaStream_t stream)
Copies data between 3D objects.
cudaError_t cudaMemcpy3DAsync ( const struct cudaMemcpy3DParms *  p,
cudaStream_t  stream = 0  
)

struct cudaExtent {
   size_t width;
   size_t height;
   size_t depth;
 };
 struct cudaExtent make_cudaExtent(size_t w, size_t h,
 size_t d);

 struct cudaPos {
   size_t x;
   size_t y;
   size_t z;
 };
 struct cudaPos make_cudaPos(size_t x, size_t y, size_t
 z);

 struct cudaMemcpy3DParms {
   struct cudaArray     *srcArray;
   struct cudaPos        srcPos;
   struct cudaPitchedPtr srcPtr;
   struct cudaArray     *dstArray;
   struct cudaPos        dstPos;
   struct cudaPitchedPtr dstPtr;
   struct cudaExtent     extent;
   enum cudaMemcpyKind   kind;
 };
 

cudaMemcpy3DAsync() copies data betwen two 3D objects. The source and destination objects may be in either host memory, device memory, or a CUDA array. The source, destination, extent, and kind of copy performed is specified by the cudaMemcpy3DParms struct which should be initialized to zero before use:

cudaMemcpy3DParms myParms = {0};
 

The struct passed to cudaMemcpy3DAsync() must specify one of srcArray or srcPtr and one of dstArray or dstPtr. Passing more than one non-zero source or destination will cause cudaMemcpy3DAsync() to return an error.

The srcPos and dstPos fields are optional offsets into the source and destination objects and are defined in units of each object's elements. The element for a host or device pointer is assumed to be unsigned char. For CUDA arrays, positions must be in the range [0, 2048) for any dimension.

The extent field defines the dimensions of the transferred area in elements. If a CUDA array is participating in the copy, the extent is defined in terms of that array's elements. If no CUDA array is participating in the copy then the extents are defined in elements of unsigned char.

The kind field defines the direction of the copy. It must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice.

If the source and destination are both arrays, cudaMemcpy3DAsync() will return an error if they do not have the same element size.

The source and destination object may not overlap. If overlapping source and destination objects are specified, undefined behavior will result.

The source object must lie entirely within the region defined by srcPos and extent. The destination object must lie entirely within the region defined by dstPos and extent.

cudaMemcpy3DAsync() returns an error if the pitch of srcPtr or dstPtr exceeds the maximum allowed. The pitch of a cudaPitchedPtr allocated with cudaMalloc3D() will always be valid.

cudaMemcpy3DAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and stream is non-zero, the copy may overlap with operations in other streams.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidPitchValue, cudaErrorInvalidMemcpyDirection
See Also:
cudaMalloc3D(jcuda.runtime.cudaPitchedPtr, jcuda.runtime.cudaExtent), cudaMalloc3DArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaExtent), cudaMemset3D(jcuda.runtime.cudaPitchedPtr, int, jcuda.runtime.cudaExtent), cudaMemcpy3D(jcuda.runtime.cudaMemcpy3DParms), cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpy3DPeerAsync

public static int cudaMemcpy3DPeerAsync(cudaMemcpy3DPeerParms p,
                                        cudaStream_t stream)
Copies memory between devices asynchronously.
cudaError_t cudaMemcpy3DPeerAsync ( const struct cudaMemcpy3DPeerParms *  p,
cudaStream_t  stream = 0  
)

Perform a 3D memory copy according to the parameters specified in p. See the definition of the cudaMemcpy3DPeerParms structure for documentation of its parameters.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevice
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpyPeer(jcuda.Pointer, int, jcuda.Pointer, int, long), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyPeerAsync(jcuda.Pointer, int, jcuda.Pointer, int, long, jcuda.runtime.cudaStream_t), cudaMemcpy3DPeerAsync(jcuda.runtime.cudaMemcpy3DPeerParms, jcuda.runtime.cudaStream_t)

cudaMemGetInfo

public static int cudaMemGetInfo(long[] free,
                                 long[] total)
Gets free and total device memory.

Returns in *free and *total respectively, the free and total amount of memory available for allocation by the device in bytes.

Parameters:
free - Returned free memory in bytes
total - Returned total memory in bytes
Returns:
cudaSuccess, cudaErrorInitializationError, cudaErrorInvalidValue, cudaErrorLaunchFailure

cudaArrayGetInfo

public static int cudaArrayGetInfo(cudaChannelFormatDesc desc,
                                   cudaExtent extent,
                                   int[] flags,
                                   cudaArray array)
Gets info about the specified cudaArray.

Returns in *desc, *extent and *flags respectively, the type, shape and flags of array.
Any of *desc, *extent and *flags may be specified as NULL.

Parameters:
desc - - Returned array type
extent - - Returned array shape. 2D arrays will have depth of zero
flags - - Returned array flags
array - - The ::cudaArray to get info for
Returns:
cudaSuccess, cudaErrorInvalidValue

cudaHostAlloc

public static int cudaHostAlloc(Pointer ptr,
                                long size,
                                int flags)
Allocates page-locked memory on the host.
cudaError_t cudaHostAlloc ( void **  pHost,
size_t  size,
unsigned int  flags  
)

Allocates size bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc(). Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

The flags parameter enables different options to be specified that affect the allocation, as follows.

  • cudaHostAllocDefault: This flag's value is defined to be 0 and causes cudaHostAlloc() to emulate cudaMallocHost().
  • cudaHostAllocPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation.
  • cudaHostAllocMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer().
  • cudaHostAllocWriteCombined: Allocates the memory as write-combined (WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host->device transfers.

All of these flags are orthogonal to one another: a developer may allocate memory that is portable, mapped and/or write-combined with no restrictions.

cudaSetDeviceFlags() must have been called with the cudaDeviceMapHost flag in order for the cudaHostAllocMapped flag to have any effect.

The cudaHostAllocMapped flag may be specified on CUDA contexts for devices that do not support mapped pinned memory. The failure is deferred to cudaHostGetDevicePointer() because the memory may be mapped into other CUDA contexts via the cudaHostAllocPortable flag.

Memory allocated by this function must be freed with cudaFreeHost().

Returns:
cudaSuccess, cudaErrorMemoryAllocation
See Also:
cudaSetDeviceFlags(int), cudaMallocHost(jcuda.Pointer, long), cudaFreeHost(jcuda.Pointer)

cudaHostRegister

public static int cudaHostRegister(Pointer ptr,
                                   long size,
                                   int flags)
Registers an existing host memory range for use by CUDA.
cudaError_t cudaHostRegister ( void *  ptr,
size_t  size,
unsigned int  flags  
)

Page-locks the memory range specified by ptr and size and maps it for the device(s) as specified by flags. This memory range also is added to the same tracking mechanism as cudaHostAlloc() to automatically accelerate calls to functions such as cudaMemcpy(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory that has not been registered. Page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to register staging areas for data exchange between host and device.

The flags parameter enables different options to be specified that affect the allocation, as follows.

  • cudaHostRegisterPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation.

  • cudaHostRegisterMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer(). This feature is available only on GPUs with compute capability greater than or equal to 1.1.

All of these flags are orthogonal to one another: a developer may page-lock memory that is portable or mapped with no restrictions.

The CUDA context must have been created with the cudaMapHost flag in order for the cudaHostRegisterMapped flag to have any effect.

The cudaHostRegisterMapped flag may be specified on CUDA contexts for devices that do not support mapped pinned memory. The failure is deferred to cudaHostGetDevicePointer() because the memory may be mapped into other CUDA contexts via the cudaHostRegisterPortable flag.

The pointer ptr and size size must be aligned to the host page size (4 KB).

The memory page-locked by this function must be unregistered with cudaHostUnregister().

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorMemoryAllocation
See Also:
cudaHostUnregister(jcuda.Pointer), cudaHostGetDevicePointer(jcuda.Pointer, jcuda.Pointer, int)

cudaHostUnregister

public static int cudaHostUnregister(Pointer ptr)
Unregisters a memory range that was registered with cudaHostRegister().
cudaError_t cudaHostUnregister ( void *  ptr  ) 

Unmaps the memory range whose base address is specified by ptr, and makes it pageable again.

The base address must be the same one specified to cudaHostRegister().

Returns:
cudaSuccess, cudaErrorInvalidValue
See Also:
cudaHostUnregister(jcuda.Pointer)

cudaHostGetDevicePointer

public static int cudaHostGetDevicePointer(Pointer pDevice,
                                           Pointer pHost,
                                           int flags)
Passes back device pointer of mapped host memory allocated by cudaHostAlloc() or registered by cudaHostRegister().
cudaError_t cudaHostGetDevicePointer ( void **  pDevice,
void *  pHost,
unsigned int  flags  
)

Passes back the device pointer corresponding to the mapped, pinned host buffer allocated by cudaHostAlloc() or registered by cudaHostRegister().

cudaHostGetDevicePointer() will fail if the cudaDeviceMapHost flag was not specified before deferred context creation occurred, or if called on a device that does not support mapped, pinned memory.

flags provides for future releases. For now, it must be set to 0.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorMemoryAllocation
See Also:
cudaSetDeviceFlags(int), cudaHostAlloc(jcuda.Pointer, long, int)

cudaMalloc

public static int cudaMalloc(Pointer devPtr,
                             long size)
Allocate memory on the device.
cudaError_t cudaMalloc ( void **  devPtr,
size_t  size  
)

Allocates size bytes of linear memory on the device and returns in *devPtr a pointer to the allocated memory. The allocated memory is suitably aligned for any kind of variable. The memory is not cleared. cudaMalloc() returns cudaErrorMemoryAllocation in case of failure.

Returns:
cudaSuccess, cudaErrorMemoryAllocation
See Also:
cudaMallocPitch(jcuda.Pointer, long[], long, long), cudaFree(jcuda.Pointer), cudaMallocArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, long, long), cudaFreeArray(jcuda.runtime.cudaArray), cudaMalloc3D(jcuda.runtime.cudaPitchedPtr, jcuda.runtime.cudaExtent), cudaMalloc3DArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaExtent), cudaMallocHost(jcuda.Pointer, long), cudaFreeHost(jcuda.Pointer), cudaHostAlloc(jcuda.Pointer, long, int)

cudaMallocHost

public static int cudaMallocHost(Pointer ptr,
                                 long size)
Allocates page-locked memory on the host.
cudaError_t cudaMallocHost ( void **  ptr,
size_t  size,
unsigned int  flags  
)

Allocates size bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc(). Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

The flags parameter enables different options to be specified that affect the allocation, as follows.

  • cudaHostAllocDefault: This flag's value is defined to be 0.
  • cudaHostAllocPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation.
  • cudaHostAllocMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer().
  • cudaHostAllocWriteCombined: Allocates the memory as write-combined (WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host->device transfers.

All of these flags are orthogonal to one another: a developer may allocate memory that is portable, mapped and/or write-combined with no restrictions.

cudaSetDeviceFlags() must have been called with the cudaDeviceMapHost flag in order for the cudaHostAllocMapped flag to have any effect.

The cudaHostAllocMapped flag may be specified on CUDA contexts for devices that do not support mapped pinned memory. The failure is deferred to cudaHostGetDevicePointer() because the memory may be mapped into other CUDA contexts via the cudaHostAllocPortable flag.

Memory allocated by this function must be freed with cudaFreeHost().

Returns:
cudaSuccess, cudaErrorMemoryAllocation
See Also:
cudaSetDeviceFlags(int), cudaMallocHost(jcuda.Pointer, long), cudaFreeHost(jcuda.Pointer), cudaHostAlloc(jcuda.Pointer, long, int)

cudaMallocPitch

public static int cudaMallocPitch(Pointer devPtr,
                                  long[] pitch,
                                  long width,
                                  long height)
Allocates pitched memory on the device.
cudaError_t cudaMallocPitch ( void **  devPtr,
size_t *  pitch,
size_t  width,
size_t  height  
)

Allocates at least width (in bytes) * height bytes of linear memory on the device and returns in *devPtr a pointer to the allocated memory. The function may pad the allocation to ensure that corresponding pointers in any given row will continue to meet the alignment requirements for coalescing as the address is updated from row to row. The pitch returned in *pitch by cudaMallocPitch() is the width in bytes of the allocation. The intended usage of pitch is as a separate parameter of the allocation, used to compute addresses within the 2D array. Given the row and column of an array element of type T, the address is computed as:

    T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;
 

For allocations of 2D arrays, it is recommended that programmers consider performing pitch allocations using cudaMallocPitch(). Due to pitch alignment restrictions in the hardware, this is especially true if the application will be performing 2D memory copies between different regions of device memory (whether linear memory or CUDA arrays).

Returns:
cudaSuccess, cudaErrorMemoryAllocation
See Also:
cudaMalloc(jcuda.Pointer, long), cudaFree(jcuda.Pointer), cudaMallocArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, long, long), cudaFreeArray(jcuda.runtime.cudaArray), cudaMallocHost(jcuda.Pointer, long), cudaFreeHost(jcuda.Pointer), cudaMalloc3D(jcuda.runtime.cudaPitchedPtr, jcuda.runtime.cudaExtent), cudaMalloc3DArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaExtent), cudaHostAlloc(jcuda.Pointer, long, int)

cudaMallocArray

public static int cudaMallocArray(cudaArray array,
                                  cudaChannelFormatDesc desc,
                                  long width,
                                  long height)
Calls cudaMallocArray wit the default value '0' as the last parameter

See Also:
cudaMallocArray(cudaArray, cudaChannelFormatDesc, long, long, int)

cudaMallocArray

public static int cudaMallocArray(cudaArray array,
                                  cudaChannelFormatDesc desc,
                                  long width,
                                  long height,
                                  int flags)
Allocate an array on the device.
cudaError_t cudaMallocArray ( struct cudaArray **  array,
const struct cudaChannelFormatDesc *  desc,
size_t  width,
size_t  height = 0,
unsigned int  flags = 0  
)

Allocates a CUDA array according to the cudaChannelFormatDesc structure desc and returns a handle to the new CUDA array in *array.

The cudaChannelFormatDesc is defined as:

    struct
 cudaChannelFormatDesc {
         int x, y, z, w;
     enum cudaChannelFormatKind f;
     };
 
where cudaChannelFormatKind is one of cudaChannelFormatKindSigned, cudaChannelFormatKindUnsigned, or cudaChannelFormatKindFloat.

The flags parameter enables different options to be specified that affect the allocation, as follows.

  • cudaArrayDefault: This flag's value is defined to be 0 and provides default array allocation
  • cudaArraySurfaceLoadStore: Allocates an array that can be read from or written to using a surface reference
  • cudaArrayTextureGather: This flag indicates that texture gather operations will be performed on the array.

width and height must meet certain size requirements. See cudaMalloc3DArray() for more details.

Returns:
cudaSuccess, cudaErrorMemoryAllocation
See Also:
cudaMalloc(jcuda.Pointer, long), cudaMallocPitch(jcuda.Pointer, long[], long, long), cudaFree(jcuda.Pointer), cudaFreeArray(jcuda.runtime.cudaArray), cudaMallocHost(jcuda.Pointer, long), cudaFreeHost(jcuda.Pointer), cudaMalloc3D(jcuda.runtime.cudaPitchedPtr, jcuda.runtime.cudaExtent), cudaMalloc3DArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaExtent), cudaHostAlloc(jcuda.Pointer, long, int)

cudaFree

public static int cudaFree(Pointer devPtr)
Frees memory on the device.
cudaError_t cudaFree ( void *  devPtr  ) 

Frees the memory space pointed to by devPtr, which must have been returned by a previous call to cudaMalloc() or cudaMallocPitch(). Otherwise, or if cudaFree(devPtr) has already been called before, an error is returned. If devPtr is 0, no operation is performed. cudaFree() returns cudaErrorInvalidDevicePointer in case of failure.

Returns:
cudaSuccess, cudaErrorInvalidDevicePointer, cudaErrorInitializationError
See Also:
cudaMalloc(jcuda.Pointer, long), cudaMallocPitch(jcuda.Pointer, long[], long, long), cudaMallocArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, long, long), cudaFreeArray(jcuda.runtime.cudaArray), cudaMallocHost(jcuda.Pointer, long), cudaFreeHost(jcuda.Pointer), cudaMalloc3D(jcuda.runtime.cudaPitchedPtr, jcuda.runtime.cudaExtent), cudaMalloc3DArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaExtent), cudaHostAlloc(jcuda.Pointer, long, int)

cudaFreeHost

public static int cudaFreeHost(Pointer ptr)
Frees page-locked memory.
cudaError_t cudaFreeHost ( void *  ptr  ) 

Frees the memory space pointed to by hostPtr, which must have been returned by a previous call to cudaMallocHost() or cudaHostAlloc().

Returns:
cudaSuccess, cudaErrorInitializationError
See Also:
cudaMalloc(jcuda.Pointer, long), cudaMallocPitch(jcuda.Pointer, long[], long, long), cudaFree(jcuda.Pointer), cudaMallocArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, long, long), cudaFreeArray(jcuda.runtime.cudaArray), cudaMallocHost(jcuda.Pointer, long), cudaMalloc3D(jcuda.runtime.cudaPitchedPtr, jcuda.runtime.cudaExtent), cudaMalloc3DArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaExtent), cudaHostAlloc(jcuda.Pointer, long, int)

cudaFreeArray

public static int cudaFreeArray(cudaArray array)
Frees an array on the device.
cudaError_t cudaFreeArray ( struct cudaArray *  array  ) 

Frees the CUDA array array, which must have been * returned by a previous call to cudaMallocArray(). If cudaFreeArray(array) has already been called before, cudaErrorInvalidValue is returned. If devPtr is 0, no operation is performed.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInitializationError
See Also:
cudaMalloc(jcuda.Pointer, long), cudaMallocPitch(jcuda.Pointer, long[], long, long), cudaFree(jcuda.Pointer), cudaMallocArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc, long, long), cudaMallocHost(jcuda.Pointer, long), cudaFreeHost(jcuda.Pointer), cudaHostAlloc(jcuda.Pointer, long, int)

cudaMemcpy

public static int cudaMemcpy(Pointer dst,
                             Pointer src,
                             long count,
                             int cudaMemcpyKind_kind)
Copies data between host and device.
cudaError_t cudaMemcpy ( void *  dst,
const void *  src,
size_t  count,
enum cudaMemcpyKind  kind  
)

Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy. The memory areas may not overlap. Calling cudaMemcpy() with dst and src pointers that do not match the direction of the copy results in an undefined behavior.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpyPeer

public static int cudaMemcpyPeer(Pointer dst,
                                 int dstDevice,
                                 Pointer src,
                                 int srcDevice,
                                 long count)
Copies memory between two devices.
cudaError_t cudaMemcpyPeer ( void *  dst,
int  dstDevice,
const void *  src,
int  srcDevice,
size_t  count  
)

Copies memory from one device to memory on another device. dst is the base device pointer of the destination memory and dstDevice is the destination device. src is the base device pointer of the source memory and srcDevice is the source device. count specifies the number of bytes to copy.

Note that this function is asynchronous with respect to the host, but serialized with respect all pending and future asynchronous work in to the current device, srcDevice, and dstDevice (use cudaMemcpyPeerAsync to avoid this synchronization).

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevice
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy3DPeer(jcuda.runtime.cudaMemcpy3DPeerParms), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyPeerAsync(jcuda.Pointer, int, jcuda.Pointer, int, long, jcuda.runtime.cudaStream_t), cudaMemcpy3DPeerAsync(jcuda.runtime.cudaMemcpy3DPeerParms, jcuda.runtime.cudaStream_t)

cudaMemcpyToArray

public static int cudaMemcpyToArray(cudaArray dst,
                                    long wOffset,
                                    long hOffset,
                                    Pointer src,
                                    long count,
                                    int cudaMemcpyKind_kind)
Copies data between host and device.
cudaError_t cudaMemcpyToArray ( struct cudaArray *  dst,
size_t  wOffset,
size_t  hOffset,
const void *  src,
size_t  count,
enum cudaMemcpyKind  kind  
)

Copies count bytes from the memory area pointed to by src to the CUDA array dst starting at the upper left corner (wOffset, hOffset), where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpyFromArray

public static int cudaMemcpyFromArray(Pointer dst,
                                      cudaArray src,
                                      long wOffset,
                                      long hOffset,
                                      long count,
                                      int cudaMemcpyKind_kind)
Copies data between host and device.
cudaError_t cudaMemcpyFromArray ( void *  dst,
const struct cudaArray *  src,
size_t  wOffset,
size_t  hOffset,
size_t  count,
enum cudaMemcpyKind  kind  
)

Copies count bytes from the CUDA array src starting at the upper left corner (wOffset, hOffset) to the memory area pointed to by dst, where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpyArrayToArray

public static int cudaMemcpyArrayToArray(cudaArray dst,
                                         long wOffsetDst,
                                         long hOffsetDst,
                                         cudaArray src,
                                         long wOffsetSrc,
                                         long hOffsetSrc,
                                         long count,
                                         int cudaMemcpyKind_kind)
Copies data between host and device.
cudaError_t cudaMemcpyArrayToArray ( struct cudaArray *  dst,
size_t  wOffsetDst,
size_t  hOffsetDst,
const struct cudaArray *  src,
size_t  wOffsetSrc,
size_t  hOffsetSrc,
size_t  count,
enum cudaMemcpyKind  kind = cudaMemcpyDeviceToDevice  
)

Copies count bytes from the CUDA array src starting at the upper left corner (wOffsetSrc, hOffsetSrc) to the CUDA array dst starting at the upper left corner (wOffsetDst, hOffsetDst) where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpy2D

public static int cudaMemcpy2D(Pointer dst,
                               long dpitch,
                               Pointer src,
                               long spitch,
                               long width,
                               long height,
                               int cudaMemcpyKind_kind)
Copies data between host and device.
cudaError_t cudaMemcpy2D ( void *  dst,
size_t  dpitch,
const void *  src,
size_t  spitch,
size_t  width,
size_t  height,
enum cudaMemcpyKind  kind  
)

Copies a matrix (height rows of width bytes each) from the memory area pointed to by src to the memory area pointed to by dst, where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy. dpitch and spitch are the widths in memory in bytes of the 2D arrays pointed to by dst and src, including any padding added to the end of each row. The memory areas may not overlap. width must not exceed either dpitch or spitch. Calling cudaMemcpy2D() with dst and src pointers that do not match the direction of the copy results in an undefined behavior. cudaMemcpy2D() returns an error if dpitch or spitch exceeds the maximum allowed.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidPitchValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpy2DToArray

public static int cudaMemcpy2DToArray(cudaArray dst,
                                      long wOffset,
                                      long hOffset,
                                      Pointer src,
                                      long spitch,
                                      long width,
                                      long height,
                                      int cudaMemcpyKind_kind)
Copies data between host and device.
cudaError_t cudaMemcpy2DToArray ( struct cudaArray *  dst,
size_t  wOffset,
size_t  hOffset,
const void *  src,
size_t  spitch,
size_t  width,
size_t  height,
enum cudaMemcpyKind  kind  
)

Copies a matrix (height rows of width bytes each) from the memory area pointed to by src to the CUDA array dst starting at the upper left corner (wOffset, hOffset) where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy. spitch is the width in memory in bytes of the 2D array pointed to by src, including any padding added to the end of each row. wOffset + width must not exceed the width of the CUDA array dst. width must not exceed spitch. cudaMemcpy2DToArray() returns an error if spitch exceeds the maximum allowed.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidPitchValue, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpy2DFromArray

public static int cudaMemcpy2DFromArray(Pointer dst,
                                        long dpitch,
                                        cudaArray src,
                                        long wOffset,
                                        long hOffset,
                                        long width,
                                        long height,
                                        int cudaMemcpyKind_kind)
Copies data between host and device.
cudaError_t cudaMemcpy2DFromArray ( void *  dst,
size_t  dpitch,
const struct cudaArray *  src,
size_t  wOffset,
size_t  hOffset,
size_t  width,
size_t  height,
enum cudaMemcpyKind  kind  
)

Copies a matrix (height rows of width bytes each) from the CUDA array srcArray starting at the upper left corner (wOffset, hOffset) to the memory area pointed to by dst, where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy. dpitch is the width in memory in bytes of the 2D array pointed to by dst, including any padding added to the end of each row. wOffset + width must not exceed the width of the CUDA array src. width must not exceed dpitch. cudaMemcpy2DFromArray() returns an error if dpitch exceeds the maximum allowed.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidPitchValue, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpy2DArrayToArray

public static int cudaMemcpy2DArrayToArray(cudaArray dst,
                                           long wOffsetDst,
                                           long hOffsetDst,
                                           cudaArray src,
                                           long wOffsetSrc,
                                           long hOffsetSrc,
                                           long width,
                                           long height,
                                           int cudaMemcpyKind_kind)
Copies data between host and device.
cudaError_t cudaMemcpy2DArrayToArray ( struct cudaArray *  dst,
size_t  wOffsetDst,
size_t  hOffsetDst,
const struct cudaArray *  src,
size_t  wOffsetSrc,
size_t  hOffsetSrc,
size_t  width,
size_t  height,
enum cudaMemcpyKind  kind = cudaMemcpyDeviceToDevice  
)

Copies a matrix (height rows of width bytes each) from the CUDA array srcArray starting at the upper left corner (wOffsetSrc, hOffsetSrc) to the CUDA array dst starting at the upper left corner (wOffsetDst, hOffsetDst), where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy. wOffsetDst + width must not exceed the width of the CUDA array dst. wOffsetSrc + width must not exceed the width of the CUDA array src.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpyToSymbol

public static int cudaMemcpyToSymbol(java.lang.String symbol,
                                     Pointer src,
                                     long count,
                                     long offset,
                                     int cudaMemcpyKind_kind)
Copies data to the given symbol on the device.
cudaError_t cudaMemcpyToSymbol ( const char *  symbol,
const void *  src,
size_t  count,
size_t  offset = 0,
enum cudaMemcpyKind  kind = cudaMemcpyHostToDevice  
)

Copies count bytes from the memory area pointed to by src to the memory area pointed to by offset bytes from the start of symbol symbol. The memory areas may not overlap. symbol can either be a variable that resides in global or constant memory space, or it can be a character string, naming a variable that resides in global or constant memory space. kind can be either cudaMemcpyHostToDevice or cudaMemcpyDeviceToDevice.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidSymbol, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpyFromSymbol

public static int cudaMemcpyFromSymbol(Pointer dst,
                                       java.lang.String symbol,
                                       long count,
                                       long offset,
                                       int cudaMemcpyKind_kind)
Copies data from the given symbol on the device.
cudaError_t cudaMemcpyFromSymbol ( void *  dst,
const char *  symbol,
size_t  count,
size_t  offset = 0,
enum cudaMemcpyKind  kind = cudaMemcpyDeviceToHost  
)

Copies count bytes from the memory area pointed to by offset bytes from the start of symbol symbol to the memory area pointed to by dst. The memory areas may not overlap. symbol can either be a variable that resides in global or constant memory space, or it can be a character string, naming a variable that resides in global or constant memory space. kind can be either cudaMemcpyDeviceToHost or cudaMemcpyDeviceToDevice.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidSymbol, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpyAsync

public static int cudaMemcpyAsync(Pointer dst,
                                  Pointer src,
                                  long count,
                                  int cudaMemcpyKind_kind,
                                  cudaStream_t stream)
Copies data between host and device.
cudaError_t cudaMemcpyAsync ( void *  dst,
const void *  src,
size_t  count,
enum cudaMemcpyKind  kind,
cudaStream_t  stream = 0  
)

Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy. The memory areas may not overlap. Calling cudaMemcpyAsync() with dst and src pointers that do not match the direction of the copy results in an undefined behavior.

cudaMemcpyAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and the stream is non-zero, the copy may overlap with operations in other streams.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpyPeerAsync

public static int cudaMemcpyPeerAsync(Pointer dst,
                                      int dstDevice,
                                      Pointer src,
                                      int srcDevice,
                                      long count,
                                      cudaStream_t stream)
Copies memory between two devices asynchronously.
cudaError_t cudaMemcpyPeerAsync ( void *  dst,
int  dstDevice,
const void *  src,
int  srcDevice,
size_t  count,
cudaStream_t  stream = 0  
)

Copies memory from one device to memory on another device. dst is the base device pointer of the destination memory and dstDevice is the destination device. src is the base device pointer of the source memory and srcDevice is the source device. count specifies the number of bytes to copy.

Note that this function is asynchronous with respect to the host and all work in other streams and other devices.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevice
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpyPeer(jcuda.Pointer, int, jcuda.Pointer, int, long), cudaMemcpy3DPeer(jcuda.runtime.cudaMemcpy3DPeerParms), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy3DPeerAsync(jcuda.runtime.cudaMemcpy3DPeerParms, jcuda.runtime.cudaStream_t)

cudaMemcpyToArrayAsync

public static int cudaMemcpyToArrayAsync(cudaArray dst,
                                         long wOffset,
                                         long hOffset,
                                         Pointer src,
                                         long count,
                                         int cudaMemcpyKind_kind,
                                         cudaStream_t stream)
Copies data between host and device.
cudaError_t cudaMemcpyToArrayAsync ( struct cudaArray *  dst,
size_t  wOffset,
size_t  hOffset,
const void *  src,
size_t  count,
enum cudaMemcpyKind  kind,
cudaStream_t  stream = 0  
)

Copies count bytes from the memory area pointed to by src to the CUDA array dst starting at the upper left corner (wOffset, hOffset), where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy.

cudaMemcpyToArrayAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and stream is non-zero, the copy may overlap with operations in other streams.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpyFromArrayAsync

public static int cudaMemcpyFromArrayAsync(Pointer dst,
                                           cudaArray src,
                                           long wOffset,
                                           long hOffset,
                                           long count,
                                           int cudaMemcpyKind_kind,
                                           cudaStream_t stream)
Copies data between host and device.
cudaError_t cudaMemcpyFromArrayAsync ( void *  dst,
const struct cudaArray *  src,
size_t  wOffset,
size_t  hOffset,
size_t  count,
enum cudaMemcpyKind  kind,
cudaStream_t  stream = 0  
)

Copies count bytes from the CUDA array src starting at the upper left corner (wOffset, hOffset) to the memory area pointed to by dst, where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy.

cudaMemcpyFromArrayAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and stream is non-zero, the copy may overlap with operations in other streams.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpy2DAsync

public static int cudaMemcpy2DAsync(Pointer dst,
                                    long dpitch,
                                    Pointer src,
                                    long spitch,
                                    long width,
                                    long height,
                                    int cudaMemcpyKind_kind,
                                    cudaStream_t stream)
Copies data between host and device.
cudaError_t cudaMemcpy2DAsync ( void *  dst,
size_t  dpitch,
const void *  src,
size_t  spitch,
size_t  width,
size_t  height,
enum cudaMemcpyKind  kind,
cudaStream_t  stream = 0  
)

Copies a matrix (height rows of width bytes each) from the memory area pointed to by src to the memory area pointed to by dst, where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy. dpitch and spitch are the widths in memory in bytes of the 2D arrays pointed to by dst and src, including any padding added to the end of each row. The memory areas may not overlap. width must not exceed either dpitch or spitch. Calling cudaMemcpy2DAsync() with dst and src pointers that do not match the direction of the copy results in an undefined behavior. cudaMemcpy2DAsync() returns an error if dpitch or spitch is greater than the maximum allowed.

cudaMemcpy2DAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and stream is non-zero, the copy may overlap with operations in other streams.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidPitchValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpy2DToArrayAsync

public static int cudaMemcpy2DToArrayAsync(cudaArray dst,
                                           long wOffset,
                                           long hOffset,
                                           Pointer src,
                                           long spitch,
                                           long width,
                                           long height,
                                           int cudaMemcpyKind_kind,
                                           cudaStream_t stream)
Copies data between host and device.
cudaError_t cudaMemcpy2DToArrayAsync ( struct cudaArray *  dst,
size_t  wOffset,
size_t  hOffset,
const void *  src,
size_t  spitch,
size_t  width,
size_t  height,
enum cudaMemcpyKind  kind,
cudaStream_t  stream = 0  
)

Copies a matrix (height rows of width bytes each) from the memory area pointed to by src to the CUDA array dst starting at the upper left corner (wOffset, hOffset) where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy. spitch is the width in memory in bytes of the 2D array pointed to by src, including any padding added to the end of each row. wOffset + width must not exceed the width of the CUDA array dst. width must not exceed spitch. cudaMemcpy2DToArrayAsync() returns an error if spitch exceeds the maximum allowed.

cudaMemcpy2DToArrayAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and stream is non-zero, the copy may overlap with operations in other streams.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidPitchValue, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpy2DFromArrayAsync

public static int cudaMemcpy2DFromArrayAsync(Pointer dst,
                                             long dpitch,
                                             cudaArray src,
                                             long wOffset,
                                             long hOffset,
                                             long width,
                                             long height,
                                             int cudaMemcpyKind_kind,
                                             cudaStream_t stream)
Copies data between host and device.
cudaError_t cudaMemcpy2DFromArrayAsync ( void *  dst,
size_t  dpitch,
const struct cudaArray *  src,
size_t  wOffset,
size_t  hOffset,
size_t  width,
size_t  height,
enum cudaMemcpyKind  kind,
cudaStream_t  stream = 0  
)

Copies a matrix (height rows of width bytes each) from the CUDA array srcArray starting at the upper left corner (wOffset, hOffset) to the memory area pointed to by dst, where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy. dpitch is the width in memory in bytes of the 2D array pointed to by dst, including any padding added to the end of each row. wOffset + width must not exceed the width of the CUDA array src. width must not exceed dpitch. cudaMemcpy2DFromArrayAsync() returns an error if dpitch exceeds the maximum allowed.

cudaMemcpy2DFromArrayAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and stream is non-zero, the copy may overlap with operations in other streams.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidPitchValue, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpyToSymbolAsync

public static int cudaMemcpyToSymbolAsync(java.lang.String symbol,
                                          Pointer src,
                                          long count,
                                          long offset,
                                          int cudaMemcpyKind_kind,
                                          cudaStream_t stream)
Copies data to the given symbol on the device.
cudaError_t cudaMemcpyToSymbolAsync ( const char *  symbol,
const void *  src,
size_t  count,
size_t  offset,
enum cudaMemcpyKind  kind,
cudaStream_t  stream = 0  
)

Copies count bytes from the memory area pointed to by src to the memory area pointed to by offset bytes from the start of symbol symbol. The memory areas may not overlap. symbol can either be a variable that resides in global or constant memory space, or it can be a character string, naming a variable that resides in global or constant memory space. kind can be either cudaMemcpyHostToDevice or cudaMemcpyDeviceToDevice.

cudaMemcpyToSymbolAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice and stream is non-zero, the copy may overlap with operations in other streams.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidSymbol, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromSymbolAsync(jcuda.Pointer, java.lang.String, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemcpyFromSymbolAsync

public static int cudaMemcpyFromSymbolAsync(Pointer dst,
                                            java.lang.String symbol,
                                            long count,
                                            long offset,
                                            int cudaMemcpyKind_kind,
                                            cudaStream_t stream)
Copies data from the given symbol on the device.
cudaError_t cudaMemcpyFromSymbolAsync ( void *  dst,
const char *  symbol,
size_t  count,
size_t  offset,
enum cudaMemcpyKind  kind,
cudaStream_t  stream = 0  
)

Copies count bytes from the memory area pointed to by offset bytes from the start of symbol symbol to the memory area pointed to by dst. The memory areas may not overlap. symbol can either be a variable that resides in global or constant memory space, or it can be a character string, naming a variable that resides in global or constant memory space. kind can be either cudaMemcpyDeviceToHost or cudaMemcpyDeviceToDevice.

cudaMemcpyFromSymbolAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyDeviceToHost and stream is non-zero, the copy may overlap with operations in other streams.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidSymbol, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
See Also:
cudaMemcpy(jcuda.Pointer, jcuda.Pointer, long, int), cudaMemcpy2D(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int), cudaMemcpyToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int), cudaMemcpy2DToArray(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int), cudaMemcpyFromArray(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DFromArray(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, int), cudaMemcpy2DArrayToArray(jcuda.runtime.cudaArray, long, long, jcuda.runtime.cudaArray, long, long, long, long, int), cudaMemcpyToSymbol(java.lang.String, jcuda.Pointer, long, long, int), cudaMemcpyFromSymbol(jcuda.Pointer, java.lang.String, long, long, int), cudaMemcpyAsync(jcuda.Pointer, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DAsync(jcuda.Pointer, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DToArrayAsync(jcuda.runtime.cudaArray, long, long, jcuda.Pointer, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyFromArrayAsync(jcuda.Pointer, jcuda.runtime.cudaArray, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpy2DFromArrayAsync(jcuda.Pointer, long, jcuda.runtime.cudaArray, long, long, long, long, int, jcuda.runtime.cudaStream_t), cudaMemcpyToSymbolAsync(java.lang.String, jcuda.Pointer, long, long, int, jcuda.runtime.cudaStream_t)

cudaMemset

public static int cudaMemset(Pointer mem,
                             int c,
                             long count)
Initializes or sets device memory to a value.
cudaError_t cudaMemset ( void *  devPtr,
int  value,
size_t  count  
)

Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer
See Also:
cudaMemset2D(jcuda.Pointer, long, int, long, long), cudaMemset3D(jcuda.runtime.cudaPitchedPtr, int, jcuda.runtime.cudaExtent), cudaMemsetAsync(jcuda.Pointer, int, long, jcuda.runtime.cudaStream_t), cudaMemset2DAsync(jcuda.Pointer, long, int, long, long, jcuda.runtime.cudaStream_t), cudaMemset3DAsync(jcuda.runtime.cudaPitchedPtr, int, jcuda.runtime.cudaExtent, jcuda.runtime.cudaStream_t)

cudaMemset2D

public static int cudaMemset2D(Pointer mem,
                               long pitch,
                               int c,
                               long width,
                               long height)
Initializes or sets device memory to a value.
cudaError_t cudaMemset2D ( void *  devPtr,
size_t  pitch,
int  value,
size_t  width,
size_t  height  
)

Sets to the specified value value a matrix (height rows of width bytes each) pointed to by dstPtr. pitch is the width in bytes of the 2D array pointed to by dstPtr, including any padding added to the end of each row. This function performs fastest when the pitch is one that has been passed back by cudaMallocPitch().

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer
See Also:
cudaMemset(jcuda.Pointer, int, long), cudaMemset3D(jcuda.runtime.cudaPitchedPtr, int, jcuda.runtime.cudaExtent), cudaMemsetAsync(jcuda.Pointer, int, long, jcuda.runtime.cudaStream_t), cudaMemset2DAsync(jcuda.Pointer, long, int, long, long, jcuda.runtime.cudaStream_t), cudaMemset3DAsync(jcuda.runtime.cudaPitchedPtr, int, jcuda.runtime.cudaExtent, jcuda.runtime.cudaStream_t)

cudaGetChannelDesc

public static int cudaGetChannelDesc(cudaChannelFormatDesc desc,
                                     cudaArray array)
Get the channel descriptor of an array.
cudaError_t cudaGetChannelDesc ( struct cudaChannelFormatDesc *  desc,
const struct cudaArray *  array  
)

Returns in *desc the channel descriptor of the CUDA array array.

Returns:
cudaSuccess, cudaErrorInvalidValue
See Also:
cudaCreateChannelDesc(int, int, int, int, int), cudaGetTextureReference(jcuda.runtime.textureReference, java.lang.String), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture2D(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long, long, long), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaUnbindTexture(jcuda.runtime.textureReference), cudaGetTextureAlignmentOffset(long[], jcuda.runtime.textureReference)

cudaCreateChannelDesc

public static cudaChannelFormatDesc cudaCreateChannelDesc(int x,
                                                          int y,
                                                          int z,
                                                          int w,
                                                          int cudaChannelFormatKind_f)
Returns a channel descriptor using the specified format.
template<class T >
cudaChannelFormatDesc cudaCreateChannelDesc ( void   ) 

Returns a channel descriptor with format f and number of bits of each component x, y, z, and w. The cudaChannelFormatDesc is defined as:

  struct cudaChannelFormatDesc {
     int x, y, z, w;
     enum cudaChannelFormatKind f;
   };
 

where cudaChannelFormatKind is one of cudaChannelFormatKindSigned, cudaChannelFormatKindUnsigned, or cudaChannelFormatKindFloat.

Returns:
Channel descriptor with format f
See Also:
cudaCreateChannelDesc(int, int, int, int, int), cudaGetChannelDesc(jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaArray), cudaGetTextureReference(jcuda.runtime.textureReference, java.lang.String), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture2D(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long, long, long), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaUnbindTexture(jcuda.runtime.textureReference), cudaGetTextureAlignmentOffset(long[], jcuda.runtime.textureReference)

cudaGetLastError

public static int cudaGetLastError()
Returns the last error from a runtime call.
cudaError_t cudaGetLastError ( void   ) 

Returns the last error that has been produced by any of the runtime calls in the same host thread and resets it to cudaSuccess.

Returns:
cudaSuccess, cudaErrorMissingConfiguration, cudaErrorMemoryAllocation, cudaErrorInitializationError, cudaErrorLaunchFailure, cudaErrorLaunchTimeout, cudaErrorLaunchOutOfResources, cudaErrorInvalidDeviceFunction, cudaErrorInvalidConfiguration, cudaErrorInvalidDevice, cudaErrorInvalidValue, cudaErrorInvalidPitchValue, cudaErrorInvalidSymbol, cudaErrorUnmapBufferObjectFailed, cudaErrorInvalidHostPointer, cudaErrorInvalidDevicePointer, cudaErrorInvalidTexture, cudaErrorInvalidTextureBinding, cudaErrorInvalidChannelDescriptor, cudaErrorInvalidMemcpyDirection, cudaErrorInvalidFilterSetting, cudaErrorInvalidNormSetting, cudaErrorUnknown, cudaErrorInvalidResourceHandle, cudaErrorInsufficientDriver, cudaErrorSetOnActiveProcess, cudaErrorStartupFailure,
See Also:
cudaPeekAtLastError(), cudaGetErrorString(int), cudaError

cudaPeekAtLastError

public static int cudaPeekAtLastError()
Returns the last error from a runtime call.
cudaError_t cudaPeekAtLastError ( void   ) 

Returns the last error that has been produced by any of the runtime calls in the same host thread. Note that this call does not reset the error to cudaSuccess like cudaGetLastError().

Returns:
cudaSuccess, cudaErrorMissingConfiguration, cudaErrorMemoryAllocation, cudaErrorInitializationError, cudaErrorLaunchFailure, cudaErrorLaunchTimeout, cudaErrorLaunchOutOfResources, cudaErrorInvalidDeviceFunction, cudaErrorInvalidConfiguration, cudaErrorInvalidDevice, cudaErrorInvalidValue, cudaErrorInvalidPitchValue, cudaErrorInvalidSymbol, cudaErrorUnmapBufferObjectFailed, cudaErrorInvalidHostPointer, cudaErrorInvalidDevicePointer, cudaErrorInvalidTexture, cudaErrorInvalidTextureBinding, cudaErrorInvalidChannelDescriptor, cudaErrorInvalidMemcpyDirection, cudaErrorInvalidFilterSetting, cudaErrorInvalidNormSetting, cudaErrorUnknown, cudaErrorInvalidResourceHandle, cudaErrorInsufficientDriver, cudaErrorSetOnActiveProcess, cudaErrorStartupFailure,
See Also:
cudaGetLastError(), cudaGetErrorString(int), cudaError

cudaGetErrorString

public static java.lang.String cudaGetErrorString(int error)
Returns the message string from an error code.
const char* cudaGetErrorString ( cudaError_t  error  ) 

Returns the message string from an error code.

Returns:
char* pointer to a NULL-terminated string
See Also:
cudaGetLastError(), cudaPeekAtLastError(), cudaError

cudaStreamCreate

public static int cudaStreamCreate(cudaStream_t stream)
Create an asynchronous stream.
cudaError_t cudaStreamCreate ( cudaStream_t *  pStream  ) 

Creates a new asynchronous stream.

Returns:
cudaSuccess, cudaErrorInvalidValue
See Also:
cudaStreamQuery(jcuda.runtime.cudaStream_t), cudaStreamSynchronize(jcuda.runtime.cudaStream_t), cudaStreamWaitEvent(jcuda.runtime.cudaStream_t, jcuda.runtime.cudaEvent_t, int), cudaStreamDestroy(jcuda.runtime.cudaStream_t)

cudaStreamDestroy

public static int cudaStreamDestroy(cudaStream_t stream)
Destroys and cleans up an asynchronous stream.
cudaError_t cudaStreamDestroy ( cudaStream_t  stream  ) 

Destroys and cleans up the asynchronous stream specified by stream.

In the case that the device is still doing work in the stream stream when cudaStreamDestroy() is called, the function will return immediately and the resources associated with stream will be released automatically once the device has completed all work in stream.

Returns:
cudaSuccess, cudaErrorInvalidResourceHandle
See Also:
cudaStreamCreate(jcuda.runtime.cudaStream_t), cudaStreamQuery(jcuda.runtime.cudaStream_t), cudaStreamWaitEvent(jcuda.runtime.cudaStream_t, jcuda.runtime.cudaEvent_t, int), cudaStreamSynchronize(jcuda.runtime.cudaStream_t)

cudaStreamWaitEvent

public static int cudaStreamWaitEvent(cudaStream_t stream,
                                      cudaEvent_t event,
                                      int flags)
Make a compute stream wait on an event.
cudaError_t cudaStreamWaitEvent ( cudaStream_t  stream,
cudaEvent_t  event,
unsigned int  flags  
)

Makes all future work submitted to stream wait until event reports completion before beginning execution. This synchronization will be performed efficiently on the device. The event event may be from a different context than stream, in which case this function will perform cross-device synchronization.

The stream stream will wait only for the completion of the most recent host call to cudaEventRecord() on event. Once this call has returned, any functions (including cudaEventRecord() and cudaEventDestroy()) may be called on event again, and the subsequent calls will not have any effect on stream.

If stream is NULL, any future work submitted in any stream will wait for event to complete before beginning execution. This effectively creates a barrier for all future work submitted to the device on this thread.

If cudaEventRecord() has not been called on event, this call acts as if the record has already completed, and so is a functional no-op.

Returns:
cudaSuccess, cudaErrorInvalidResourceHandle
See Also:
cudaStreamCreate(jcuda.runtime.cudaStream_t), cudaStreamQuery(jcuda.runtime.cudaStream_t), cudaStreamSynchronize(jcuda.runtime.cudaStream_t), cudaStreamDestroy(jcuda.runtime.cudaStream_t)

cudaStreamSynchronize

public static int cudaStreamSynchronize(cudaStream_t stream)
Waits for stream tasks to complete.
cudaError_t cudaStreamSynchronize ( cudaStream_t  stream  ) 

Blocks until stream has completed all operations. If the cudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the stream is finished with all of its tasks.

Returns:
cudaSuccess, cudaErrorInvalidResourceHandle
See Also:
cudaStreamCreate(jcuda.runtime.cudaStream_t), cudaStreamQuery(jcuda.runtime.cudaStream_t), cudaStreamWaitEvent(jcuda.runtime.cudaStream_t, jcuda.runtime.cudaEvent_t, int), cudaStreamDestroy(jcuda.runtime.cudaStream_t)

cudaStreamQuery

public static int cudaStreamQuery(cudaStream_t stream)
Queries an asynchronous stream for completion status.
cudaError_t cudaStreamQuery ( cudaStream_t  stream  ) 

Returns cudaSuccess if all operations in stream have completed, or cudaErrorNotReady if not.

Returns:
cudaSuccess, cudaErrorNotReady cudaErrorInvalidResourceHandle
See Also:
cudaStreamCreate(jcuda.runtime.cudaStream_t), cudaStreamWaitEvent(jcuda.runtime.cudaStream_t, jcuda.runtime.cudaEvent_t, int), cudaStreamSynchronize(jcuda.runtime.cudaStream_t), cudaStreamDestroy(jcuda.runtime.cudaStream_t)

cudaEventCreate

public static int cudaEventCreate(cudaEvent_t event)
Creates an event object.
cudaError_t cudaEventCreate ( cudaEvent_t *  event,
unsigned int  flags  
)

Creates an event object with the specified flags. Valid flags include:

  • cudaEventDefault: Default event creation flag.
  • cudaEventBlockingSync: Specifies that event should use blocking synchronization. A host thread that uses cudaEventSynchronize() to wait on an event created with this flag will block until the event actually completes.
  • cudaEventDisableTiming: Specifies that the created event does not need to record timing data. Events created with this flag specified and the cudaEventBlockingSync flag not specified will provide the best performance when used with cudaStreamWaitEvent() and cudaEventQuery().

Returns:
cudaSuccess, cudaErrorInitializationError, cudaErrorInvalidValue, cudaErrorLaunchFailure, cudaErrorMemoryAllocation
See Also:
cudaEventCreate(jcuda.runtime.cudaEvent_t), cudaEventCreateWithFlags(jcuda.runtime.cudaEvent_t, int), cudaEventRecord(jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaStream_t), cudaEventQuery(jcuda.runtime.cudaEvent_t), cudaEventSynchronize(jcuda.runtime.cudaEvent_t), cudaEventDestroy(jcuda.runtime.cudaEvent_t), cudaEventElapsedTime(float[], jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaEvent_t), cudaStreamWaitEvent(jcuda.runtime.cudaStream_t, jcuda.runtime.cudaEvent_t, int)

cudaEventCreateWithFlags

public static int cudaEventCreateWithFlags(cudaEvent_t event,
                                           int flags)
Creates an event object with the specified flags.
cudaError_t cudaEventCreateWithFlags ( cudaEvent_t *  event,
unsigned int  flags  
)

Creates an event object with the specified flags. Valid flags include:

  • cudaEventDefault: Default event creation flag.
  • cudaEventBlockingSync: Specifies that event should use blocking synchronization. A host thread that uses cudaEventSynchronize() to wait on an event created with this flag will block until the event actually completes.
  • cudaEventDisableTiming: Specifies that the created event does not need to record timing data. Events created with this flag specified and the cudaEventBlockingSync flag not specified will provide the best performance when used with cudaStreamWaitEvent() and cudaEventQuery().

Returns:
cudaSuccess, cudaErrorInitializationError, cudaErrorInvalidValue, cudaErrorLaunchFailure, cudaErrorMemoryAllocation
See Also:
cudaEventCreate(jcuda.runtime.cudaEvent_t), cudaEventSynchronize(jcuda.runtime.cudaEvent_t), cudaEventDestroy(jcuda.runtime.cudaEvent_t), cudaEventElapsedTime(float[], jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaEvent_t), cudaStreamWaitEvent(jcuda.runtime.cudaStream_t, jcuda.runtime.cudaEvent_t, int)

cudaEventRecord

public static int cudaEventRecord(cudaEvent_t event,
                                  cudaStream_t stream)
Records an event.
cudaError_t cudaEventRecord ( cudaEvent_t  event,
cudaStream_t  stream = 0  
)

Records an event. If stream is non-zero, the event is recorded after all preceding operations in stream have been completed; otherwise, it is recorded after all preceding operations in the CUDA context have been completed. Since operation is asynchronous, cudaEventQuery() and/or cudaEventSynchronize() must be used to determine when the event has actually been recorded.

If cudaEventRecord() has previously been called on event, then this call will overwrite any existing state in event. Any subsequent calls which examine the status of event will only examine the completion of this most recent call to cudaEventRecord().

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInitializationError, cudaErrorInvalidResourceHandle, cudaErrorLaunchFailure
See Also:
cudaEventCreate(jcuda.runtime.cudaEvent_t), cudaEventCreateWithFlags(jcuda.runtime.cudaEvent_t, int), cudaEventQuery(jcuda.runtime.cudaEvent_t), cudaEventSynchronize(jcuda.runtime.cudaEvent_t), cudaEventDestroy(jcuda.runtime.cudaEvent_t), cudaEventElapsedTime(float[], jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaEvent_t), cudaStreamWaitEvent(jcuda.runtime.cudaStream_t, jcuda.runtime.cudaEvent_t, int)

cudaEventQuery

public static int cudaEventQuery(cudaEvent_t event)
Queries an event's status.
cudaError_t cudaEventQuery ( cudaEvent_t  event  ) 

Query the status of all device work preceding the most recent call to cudaEventRecord() (in the appropriate compute streams, as specified by the arguments to cudaEventRecord()).

If this work has successfully been completed by the device, or if cudaEventRecord() has not been called on event, then cudaSuccess is returned. If this work has not yet been completed by the device then cudaErrorNotReady is returned.

Returns:
cudaSuccess, cudaErrorNotReady, cudaErrorInitializationError, cudaErrorInvalidValue, cudaErrorInvalidResourceHandle, cudaErrorLaunchFailure
See Also:
cudaEventCreate(jcuda.runtime.cudaEvent_t), cudaEventCreateWithFlags(jcuda.runtime.cudaEvent_t, int), cudaEventRecord(jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaStream_t), cudaEventSynchronize(jcuda.runtime.cudaEvent_t), cudaEventDestroy(jcuda.runtime.cudaEvent_t), cudaEventElapsedTime(float[], jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaEvent_t)

cudaEventSynchronize

public static int cudaEventSynchronize(cudaEvent_t event)
Waits for an event to complete.
cudaError_t cudaEventSynchronize ( cudaEvent_t  event  ) 

Wait until the completion of all device work preceding the most recent call to cudaEventRecord() (in the appropriate compute streams, as specified by the arguments to cudaEventRecord()).

If cudaEventRecord() has not been called on event, cudaSuccess is returned immediately.

Waiting for an event that was created with the cudaEventBlockingSync flag will cause the calling CPU thread to block until the event has been completed by the device. If the cudaEventBlockingSync flag has not been set, then the CPU thread will busy-wait until the event has been completed by the device.

Returns:
cudaSuccess, cudaErrorInitializationError, cudaErrorInvalidValue, cudaErrorInvalidResourceHandle, cudaErrorLaunchFailure
See Also:
cudaEventCreate(jcuda.runtime.cudaEvent_t), cudaEventCreateWithFlags(jcuda.runtime.cudaEvent_t, int), cudaEventRecord(jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaStream_t), cudaEventQuery(jcuda.runtime.cudaEvent_t), cudaEventDestroy(jcuda.runtime.cudaEvent_t), cudaEventElapsedTime(float[], jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaEvent_t)

cudaEventDestroy

public static int cudaEventDestroy(cudaEvent_t event)
Destroys an event object.
cudaError_t cudaEventDestroy ( cudaEvent_t  event  ) 

Destroys the event specified by event.

In the case that event has been recorded but has not yet been completed when cudaEventDestroy() is called, the function will return immediately and the resources associated with event will be released automatically once the device has completed event.

Returns:
cudaSuccess, cudaErrorInitializationError, cudaErrorInvalidValue, cudaErrorLaunchFailure
See Also:
cudaEventCreate(jcuda.runtime.cudaEvent_t), cudaEventCreateWithFlags(jcuda.runtime.cudaEvent_t, int), cudaEventQuery(jcuda.runtime.cudaEvent_t), cudaEventSynchronize(jcuda.runtime.cudaEvent_t), cudaEventRecord(jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaStream_t), cudaEventElapsedTime(float[], jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaEvent_t)

cudaEventElapsedTime

public static int cudaEventElapsedTime(float[] ms,
                                       cudaEvent_t start,
                                       cudaEvent_t end)
Computes the elapsed time between events.
cudaError_t cudaEventElapsedTime ( float *  ms,
cudaEvent_t  start,
cudaEvent_t  end  
)

Computes the elapsed time between two events (in milliseconds with a resolution of around 0.5 microseconds).

If either event was last recorded in a non-NULL stream, the resulting time may be greater than expected (even if both used the same stream handle). This happens because the cudaEventRecord() operation takes place asynchronously and there is no guarantee that the measured latency is actually just between the two events. Any number of other different stream operations could execute in between the two measured events, thus altering the timing in a significant way.

If cudaEventRecord() has not been called on either event, then cudaErrorInvalidResourceHandle is returned. If cudaEventRecord() has been called on both events but one or both of them has not yet been completed (that is, cudaEventQuery() would return cudaErrorNotReady on at least one of the events), cudaErrorNotReady is returned. If either event was created with the cudaEventDisableTiming flag, then this function will return cudaErrorInvalidResourceHandle.

Returns:
cudaSuccess, cudaErrorNotReady, cudaErrorInvalidValue, cudaErrorInitializationError, cudaErrorInvalidResourceHandle, cudaErrorLaunchFailure
See Also:
cudaEventCreate(jcuda.runtime.cudaEvent_t), cudaEventCreateWithFlags(jcuda.runtime.cudaEvent_t, int), cudaEventQuery(jcuda.runtime.cudaEvent_t), cudaEventSynchronize(jcuda.runtime.cudaEvent_t), cudaEventDestroy(jcuda.runtime.cudaEvent_t), cudaEventRecord(jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaStream_t)

cudaDeviceReset

public static int cudaDeviceReset()
Destroy all allocations and reset all state on the current device in the current process.
cudaError_t cudaDeviceReset ( void   ) 

Explicitly destroys and cleans up all resources associated with the current device in the current process. Any subsequent API call to this device will reinitialize the device.

Note that this function will reset the device immediately. It is the caller's responsibility to ensure that the device is not being accessed by any other host threads from the process when this function is called.

Returns:
cudaSuccess
See Also:
cudaDeviceSynchronize()

cudaDeviceSynchronize

public static int cudaDeviceSynchronize()
Wait for compute device to finish.
cudaError_t cudaDeviceSynchronize ( void   ) 

Blocks until the device has completed all preceding requested tasks. cudaDeviceSynchronize() returns an error if one of the preceding tasks has failed. If the cudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the device has finished its work.

Returns:
cudaSuccess
See Also:
cudaDeviceSynchronize()

cudaDeviceSetLimit

public static int cudaDeviceSetLimit(int limit,
                                     long value)
Set resource limits.
cudaError_t cudaDeviceSetLimit ( enum cudaLimit  limit,
size_t  value  
)

Setting limit to value is a request by the application to update the current limit maintained by the device. The driver is free to modify the requested value to meet h/w requirements (this could be clamping to minimum or maximum values, rounding up to nearest element size, etc). The application can use cudaDeviceGetLimit() to find out exactly what the limit has been set to.

Setting each cudaLimit has its own specific restrictions, so each is discussed here.

  • cudaLimitStackSize controls the stack size of each GPU thread. This limit is only applicable to devices of compute capability 2.0 and higher. Attempting to set this limit on devices of compute capability less than 2.0 will result in the error cudaErrorUnsupportedLimit being returned.

  • cudaLimitPrintfFifoSize controls the size of the shared FIFO used by the printf() and fprintf() device system calls. Setting cudaLimitPrintfFifoSize must be performed before launching any kernel that uses the printf() or fprintf() device system calls, otherwise cudaErrorInvalidValue will be returned. This limit is only applicable to devices of compute capability 2.0 and higher. Attempting to set this limit on devices of compute capability less than 2.0 will result in the error cudaErrorUnsupportedLimit being returned.

  • cudaLimitMallocHeapSize controls the size of the heap used by the malloc() and free() device system calls. Setting cudaLimitMallocHeapSize must be performed before launching any kernel that uses the malloc() or free() device system calls, otherwise cudaErrorInvalidValue will be returned. This limit is only applicable to devices of compute capability 2.0 and higher. Attempting to set this limit on devices of compute capability less than 2.0 will result in the error cudaErrorUnsupportedLimit being returned.

Returns:
cudaSuccess, cudaErrorUnsupportedLimit, cudaErrorInvalidValue
See Also:
cudaDeviceGetLimit(long[], int)

cudaDeviceGetLimit

public static int cudaDeviceGetLimit(long[] pValue,
                                     int limit)
Returns resource limits.
cudaError_t cudaDeviceGetLimit ( size_t *  pValue,
enum cudaLimit  limit  
)

Returns in *pValue the current size of limit. The supported cudaLimit values are:

  • cudaLimitStackSize: stack size of each GPU thread;
  • cudaLimitPrintfFifoSize: size of the shared FIFO used by the printf() and fprintf() device system calls.
  • cudaLimitMallocHeapSize: size of the heap used by the malloc() and free() device system calls;

Returns:
cudaSuccess, cudaErrorUnsupportedLimit, cudaErrorInvalidValue
See Also:
cudaDeviceSetLimit(int, long)

cudaDeviceGetCacheConfig

public static int cudaDeviceGetCacheConfig(int[] pCacheConfig)
Returns the preferred cache configuration for the current device.
cudaError_t cudaDeviceGetCacheConfig ( enum cudaFuncCache *  pCacheConfig  ) 

On devices where the L1 cache and shared memory use the same hardware resources, this returns through pCacheConfig the preferred cache configuration for the current device. This is only a preference. The runtime will use the requested configuration if possible, but it is free to choose a different configuration if required to execute functions.

This will return a pCacheConfig of cudaFuncCachePreferNone on devices where the size of the L1 cache and shared memory are fixed.

The supported cache configurations are:

  • cudaFuncCachePreferNone: no preference for shared memory or L1 (default)
  • cudaFuncCachePreferShared: prefer larger shared memory and smaller L1 cache
  • cudaFuncCachePreferL1: prefer larger L1 cache and smaller shared memory

Returns:
cudaSuccess, cudaErrorInitializationError
See Also:
cudaDeviceSetCacheConfig(int)

cudaDeviceGetSharedMemConfig

public static int cudaDeviceGetSharedMemConfig(int[] pConfig)
(No documentation in CUDA 4.2) Returns the shared memory configuration

Parameters:
pConfig - The configuration
Returns:
The error return code

cudaDeviceSetSharedMemConfig

public static int cudaDeviceSetSharedMemConfig(int config)
(No documentation in CUDA 4.2) Sets the shared memory configuration

Parameters:
config - The configuration
Returns:
The error return code

cudaDeviceSetCacheConfig

public static int cudaDeviceSetCacheConfig(int cacheConfig)
Sets the preferred cache configuration for the current device.
cudaError_t cudaDeviceSetCacheConfig ( enum cudaFuncCache  cacheConfig  ) 

On devices where the L1 cache and shared memory use the same hardware resources, this sets through cacheConfig the preferred cache configuration for the current device. This is only a preference. The runtime will use the requested configuration if possible, but it is free to choose a different configuration if required to execute the function. Any function preference set via cudaDeviceSetCacheConfig (C API) or cudaDeviceSetCacheConfig (C++ API) will be preferred over this device-wide setting. Setting the device-wide cache configuration to cudaFuncCachePreferNone will cause subsequent kernel launches to prefer to not change the cache configuration unless required to launch the kernel.

This setting does nothing on devices where the size of the L1 cache and shared memory are fixed.

Launching a kernel with a different preference than the most recent preference setting may insert a device-side synchronization point.

The supported cache configurations are:

  • cudaFuncCachePreferNone: no preference for shared memory or L1 (default)
  • cudaFuncCachePreferShared: prefer larger shared memory and smaller L1 cache
  • cudaFuncCachePreferL1: prefer larger L1 cache and smaller shared memory

Returns:
cudaSuccess, cudaErrorInitializationError
See Also:
cudaDeviceGetCacheConfig(int[])

cudaDeviceGetByPCIBusId

public static int cudaDeviceGetByPCIBusId(int[] device,
                                          java.lang.String pciBusId)
Returns a handle to a compute device.
cudaError_t cudaDeviceGetByPCIBusId ( int *  device,
char *  pciBusId  
)

Returns in *device a device ordinal given a PCI bus ID string.

Returns:
cudaSuccess cudaErrorInvalidValue, cudaErrorInvalidDevice
See Also:
cudaDeviceGetPCIBusId(java.lang.String[], int, int)

cudaDeviceGetPCIBusId

public static int cudaDeviceGetPCIBusId(java.lang.String[] pciBusId,
                                        int len,
                                        int device)
Returns a PCI Bus Id string for the device.
cudaError_t cudaDeviceGetPCIBusId ( char *  pciBusId,
int  len,
int  device  
)

Returns an ASCII string identifying the device dev in the NULL-terminated string pointed to by pciBusId. len specifies the maximum length of the string that may be returned.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevice
See Also:
cudaDeviceGetByPCIBusId(int[], java.lang.String)

cudaIpcGetEventHandle

public static int cudaIpcGetEventHandle(cudaIpcEventHandle handle,
                                        cudaEvent_t event)
Gets an interprocess handle for a previously allocated event.
cudaError_t cudaIpcGetEventHandle ( cudaIpcEventHandle_t *  handle,
cudaEvent_t  event  
)

Takes as input a previously allocated event. This event must have been created with the cudaEventInterprocess and cudaEventDisableTiming flags set. This opaque handle may be copied into other processes and opened with cudaIpcOpenEventHandle to allow efficient hardware synchronization between GPU work in different processes.

After the event has been been opened in the importing process, cudaEventRecord, cudaEventSynchronize, cudaStreamWaitEvent and cudaEventQuery may be used in either process. Performing operations on the imported event after the exported event has been freed with cudaEventDestroy will result in undefined behavior.

Returns:
cudaSuccess, cudaErrorInvalidResourceHandle, cudaErrorMemoryAllocation, cudaErrorMapBufferObjectFailed
See Also:
cudaEventCreate(jcuda.runtime.cudaEvent_t), cudaEventDestroy(jcuda.runtime.cudaEvent_t), cudaEventSynchronize(jcuda.runtime.cudaEvent_t), cudaEventQuery(jcuda.runtime.cudaEvent_t), cudaStreamWaitEvent(jcuda.runtime.cudaStream_t, jcuda.runtime.cudaEvent_t, int), cudaIpcOpenEventHandle(jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaIpcEventHandle), cudaIpcGetMemHandle(jcuda.runtime.cudaIpcMemHandle, jcuda.Pointer), cudaIpcOpenMemHandle(jcuda.Pointer, jcuda.runtime.cudaIpcMemHandle, int), cudaIpcCloseMemHandle(jcuda.Pointer)

cudaIpcOpenEventHandle

public static int cudaIpcOpenEventHandle(cudaEvent_t event,
                                         cudaIpcEventHandle handle)
Opens an interprocess event handle for use in the current process.
cudaError_t cudaIpcOpenEventHandle ( cudaEvent_t *  event,
cudaIpcEventHandle_t  handle  
)

Opens an interprocess event handle exported from another process with cudaIpcGetEventHandle. This function returns a cudaEvent_t that behaves like a locally created event with the cudaEventDisableTiming flag specified. This event must be freed with cudaEventDestroy.

Performing operations on the imported event after the exported event has been freed with cudaEventDestroy will result in undefined behavior.

IPC functionality is restricted to devices with support for unified addressing on Linux operating systems.

Returns:
cudaSuccess, cudaErrorMapBufferObjectFailed, cudaErrorInvalidResourceHandle
See Also:
cudaEventCreate(jcuda.runtime.cudaEvent_t), cudaEventDestroy(jcuda.runtime.cudaEvent_t), cudaEventSynchronize(jcuda.runtime.cudaEvent_t), cudaEventQuery(jcuda.runtime.cudaEvent_t), cudaStreamWaitEvent(jcuda.runtime.cudaStream_t, jcuda.runtime.cudaEvent_t, int), cudaIpcGetEventHandle(jcuda.runtime.cudaIpcEventHandle, jcuda.runtime.cudaEvent_t), cudaIpcGetMemHandle(jcuda.runtime.cudaIpcMemHandle, jcuda.Pointer), cudaIpcOpenMemHandle(jcuda.Pointer, jcuda.runtime.cudaIpcMemHandle, int), cudaIpcCloseMemHandle(jcuda.Pointer)

cudaIpcGetMemHandle

public static int cudaIpcGetMemHandle(cudaIpcMemHandle handle,
                                      Pointer devPtr)
Gets an interprocess memory handle for an existing device memory allocation.
cudaError_t cudaIpcGetMemHandle ( cudaIpcMemHandle_t *  handle,
void *  devPtr  
)

/brief Gets an interprocess memory handle for an existing device memory allocation

Takes a pointer to the base of an existing device memory allocation created with cudaMalloc and exports it for use in another process. This is a lightweight operation and may be called multiple times on an allocation without adverse effects.

If a region of memory is freed with cudaFree and a subsequent call to cudaMalloc returns memory with the same device address, cudaIpcGetMemHandle will return a unique handle for the new memory.

IPC functionality is restricted to devices with support for unified addressing on Linux operating systems.

Returns:
cudaSuccess, cudaErrorInvalidResourceHandle, cudaErrorMemoryAllocation, cudaErrorMapBufferObjectFailed,
See Also:
cudaMalloc(jcuda.Pointer, long), cudaFree(jcuda.Pointer), cudaIpcGetEventHandle(jcuda.runtime.cudaIpcEventHandle, jcuda.runtime.cudaEvent_t), cudaIpcOpenEventHandle(jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaIpcEventHandle), cudaIpcOpenMemHandle(jcuda.Pointer, jcuda.runtime.cudaIpcMemHandle, int), cudaIpcCloseMemHandle(jcuda.Pointer)

cudaIpcOpenMemHandle

public static int cudaIpcOpenMemHandle(Pointer devPtr,
                                       cudaIpcMemHandle handle,
                                       int flags)
Opens an interprocess memory handle exported from another process and returns a device pointer usable in the local process.
cudaError_t cudaIpcOpenMemHandle ( void **  devPtr,
cudaIpcMemHandle_t  handle  
)

/brief Opens an interprocess memory handle exported from another process and returns a device pointer usable in the local process.

Maps memory exported from another process with cudaIpcGetMemHandle into the current device address space. For contexts on different devices cudaIpcOpenMemHandle will attempt to enable peer access between the devices as if the user called cudaDeviceEnablePeerAccess. Calling cudaDeviceCanAccessPeer can determine if this mapping is possible.

Calling cudaFree on an exported memory region before calling cudaIpcCloseMemHandle in the importing context will result in undefined behavior.

Memory returned from cudaIpcOpenMemHandle must be freed with cudaIpcCloseMemHandle.

IPC functionality is restricted to devices with support for unified addressing on Linux operating systems.

Returns:
cudaSuccess, cudaErrorMapBufferObjectFailed, cudaErrorInvalidResourceHandle, cudaErrorTooManyPeers
See Also:
cudaMalloc(jcuda.Pointer, long), cudaFree(jcuda.Pointer), cudaIpcGetEventHandle(jcuda.runtime.cudaIpcEventHandle, jcuda.runtime.cudaEvent_t), cudaIpcOpenEventHandle(jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaIpcEventHandle), cudaIpcGetMemHandle(jcuda.runtime.cudaIpcMemHandle, jcuda.Pointer), cudaIpcCloseMemHandle(jcuda.Pointer), cudaDeviceEnablePeerAccess(int, int), cudaDeviceCanAccessPeer(int[], int, int)

cudaIpcCloseMemHandle

public static int cudaIpcCloseMemHandle(Pointer devPtr)
Close memory mapped with ::cudaIpcOpenMemHandle.
cudaError_t cudaIpcCloseMemHandle ( void *  devPtr  ) 

/brief Close memory mapped with cudaIpcOpenMemHandle

Unmaps memory returnd by cudaIpcOpenMemHandle. The original allocation in the exporting process as well as imported mappings in other processes will be unaffected.

Any resources used to enable peer access will be freed if this is the last mapping using them.

IPC functionality is restricted to devices with support for unified addressing on Linux operating systems.

Returns:
cudaSuccess, cudaErrorMapBufferObjectFailed, cudaErrorInvalidResourceHandle,
See Also:
cudaMalloc(jcuda.Pointer, long), cudaFree(jcuda.Pointer), cudaIpcGetEventHandle(jcuda.runtime.cudaIpcEventHandle, jcuda.runtime.cudaEvent_t), cudaIpcOpenEventHandle(jcuda.runtime.cudaEvent_t, jcuda.runtime.cudaIpcEventHandle), cudaIpcGetMemHandle(jcuda.runtime.cudaIpcMemHandle, jcuda.Pointer), cudaIpcOpenMemHandle(jcuda.Pointer, jcuda.runtime.cudaIpcMemHandle, int)

cudaThreadExit

public static int cudaThreadExit()
Deprecated. This function is deprecated in the latest CUDA version

Exit and clean up from CUDA launches.
cudaError_t cudaThreadExit ( void   ) 

Deprecated:
Note that this function is deprecated because its name does not reflect its behavior. Its functionality is identical to the non-deprecated function cudaDeviceReset(), which should be used instead.

Explicitly destroys all cleans up all resources associated with the current device in the current process. Any subsequent API call to this device will reinitialize the device.

Note that this function will reset the device immediately. It is the caller's responsibility to ensure that the device is not being accessed by any other host threads from the process when this function is called.

Returns:
cudaSuccess
See Also:
cudaDeviceReset()

cudaThreadSynchronize

public static int cudaThreadSynchronize()
Deprecated. This function is deprecated in the latest CUDA version

Wait for compute device to finish.
cudaError_t cudaThreadSynchronize ( void   ) 

Deprecated:
Note that this function is deprecated because its name does not reflect its behavior. Its functionality is similar to the non-deprecated function cudaDeviceSynchronize(), which should be used instead.

Blocks until the device has completed all preceding requested tasks. cudaThreadSynchronize() returns an error if one of the preceding tasks has failed. If the cudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the device has finished its work.

Returns:
cudaSuccess
See Also:
cudaDeviceSynchronize()

cudaThreadSetLimit

public static int cudaThreadSetLimit(int limit,
                                     long value)
Deprecated. This function is deprecated in the latest CUDA version

Set resource limits.
cudaError_t cudaThreadSetLimit ( enum cudaLimit  limit,
size_t  value  
)

Deprecated:
Note that this function is deprecated because its name does not reflect its behavior. Its functionality is identical to the non-deprecated function cudaDeviceSetLimit(), which should be used instead.

Setting limit to value is a request by the application to update the current limit maintained by the device. The driver is free to modify the requested value to meet h/w requirements (this could be clamping to minimum or maximum values, rounding up to nearest element size, etc). The application can use cudaThreadGetLimit() to find out exactly what the limit has been set to.

Setting each cudaLimit has its own specific restrictions, so each is discussed here.

  • cudaLimitStackSize controls the stack size of each GPU thread. This limit is only applicable to devices of compute capability 2.0 and higher. Attempting to set this limit on devices of compute capability less than 2.0 will result in the error cudaErrorUnsupportedLimit being returned.

  • cudaLimitPrintfFifoSize controls the size of the shared FIFO used by the printf() and fprintf() device system calls. Setting cudaLimitPrintfFifoSize must be performed before launching any kernel that uses the printf() or fprintf() device system calls, otherwise cudaErrorInvalidValue will be returned. This limit is only applicable to devices of compute capability 2.0 and higher. Attempting to set this limit on devices of compute capability less than 2.0 will result in the error cudaErrorUnsupportedLimit being returned.

  • cudaLimitMallocHeapSize controls the size of the heap used by the malloc() and free() device system calls. Setting cudaLimitMallocHeapSize must be performed before launching any kernel that uses the malloc() or free() device system calls, otherwise cudaErrorInvalidValue will be returned. This limit is only applicable to devices of compute capability 2.0 and higher. Attempting to set this limit on devices of compute capability less than 2.0 will result in the error cudaErrorUnsupportedLimit being returned.

Returns:
cudaSuccess, cudaErrorUnsupportedLimit, cudaErrorInvalidValue
See Also:
cudaDeviceSetLimit(int, long)

cudaThreadGetCacheConfig

public static int cudaThreadGetCacheConfig(int[] pCacheConfig)
Deprecated. This function is deprecated in the latest CUDA version

Returns the preferred cache configuration for the current device.
cudaError_t cudaThreadGetCacheConfig ( enum cudaFuncCache *  pCacheConfig  ) 

Deprecated:
Note that this function is deprecated because its name does not reflect its behavior. Its functionality is identical to the non-deprecated function cudaDeviceGetCacheConfig(), which should be used instead.

On devices where the L1 cache and shared memory use the same hardware resources, this returns through pCacheConfig the preferred cache configuration for the current device. This is only a preference. The runtime will use the requested configuration if possible, but it is free to choose a different configuration if required to execute functions.

This will return a pCacheConfig of cudaFuncCachePreferNone on devices where the size of the L1 cache and shared memory are fixed.

The supported cache configurations are:

  • cudaFuncCachePreferNone: no preference for shared memory or L1 (default)
  • cudaFuncCachePreferShared: prefer larger shared memory and smaller L1 cache
  • cudaFuncCachePreferL1: prefer larger L1 cache and smaller shared memory

Returns:
cudaSuccess, cudaErrorInitializationError
See Also:
cudaDeviceGetCacheConfig(int[])

cudaThreadSetCacheConfig

public static int cudaThreadSetCacheConfig(int cacheConfig)
Deprecated. This function is deprecated in the latest CUDA version

Sets the preferred cache configuration for the current device.
cudaError_t cudaThreadSetCacheConfig ( enum cudaFuncCache  cacheConfig  ) 

Deprecated:
Note that this function is deprecated because its name does not reflect its behavior. Its functionality is identical to the non-deprecated function cudaDeviceSetCacheConfig(), which should be used instead.

On devices where the L1 cache and shared memory use the same hardware resources, this sets through cacheConfig the preferred cache configuration for the current device. This is only a preference. The runtime will use the requested configuration if possible, but it is free to choose a different configuration if required to execute the function. Any function preference set via cudaDeviceSetCacheConfig (C API) or cudaDeviceSetCacheConfig (C++ API) will be preferred over this device-wide setting. Setting the device-wide cache configuration to cudaFuncCachePreferNone will cause subsequent kernel launches to prefer to not change the cache configuration unless required to launch the kernel.

This setting does nothing on devices where the size of the L1 cache and shared memory are fixed.

Launching a kernel with a different preference than the most recent preference setting may insert a device-side synchronization point.

The supported cache configurations are:

  • cudaFuncCachePreferNone: no preference for shared memory or L1 (default)
  • cudaFuncCachePreferShared: prefer larger shared memory and smaller L1 cache
  • cudaFuncCachePreferL1: prefer larger L1 cache and smaller shared memory

Returns:
cudaSuccess, cudaErrorInitializationError
See Also:
cudaDeviceSetCacheConfig(int)

cudaThreadGetLimit

public static int cudaThreadGetLimit(long[] pValue,
                                     int limit)
Deprecated. This function is deprecated in the latest CUDA version

Returns resource limits.
cudaError_t cudaThreadGetLimit ( size_t *  pValue,
enum cudaLimit  limit  
)

Deprecated:
Note that this function is deprecated because its name does not reflect its behavior. Its functionality is identical to the non-deprecated function cudaDeviceGetLimit(), which should be used instead.

Returns in *pValue the current size of limit. The supported cudaLimit values are:

  • cudaLimitStackSize: stack size of each GPU thread;
  • cudaLimitPrintfFifoSize: size of the shared FIFO used by the printf() and fprintf() device system calls.
  • cudaLimitMallocHeapSize: size of the heap used by the malloc() and free() device system calls;

Returns:
cudaSuccess, cudaErrorUnsupportedLimit, cudaErrorInvalidValue
See Also:
cudaDeviceGetLimit(long[], int)

cudaGetSymbolAddress

public static int cudaGetSymbolAddress(Pointer devPtr,
                                       java.lang.String symbol)
Finds the address associated with a CUDA symbol.
template<class T >
cudaError_t cudaGetSymbolAddress ( void **  devPtr,
const T &  symbol  
)

Returns in *devPtr the address of symbol symbol on the device. symbol can either be a variable that resides in global or constant memory space, or it can be a character string, naming a variable that resides in global or constant memory space. If symbol cannot be found, or if symbol is not declared in the global or constant memory space, *devPtr is unchanged and the error cudaErrorInvalidSymbol is returned. If there are multiple global or constant variables with the same string name (from separate files) and the lookup is done via character string, cudaErrorDuplicateVariableName is returned.

Returns:
cudaSuccess, cudaErrorInvalidSymbol, cudaErrorDuplicateVariableName
See Also:
cudaGetSymbolAddress(jcuda.Pointer, java.lang.String)

cudaGetSymbolSize

public static int cudaGetSymbolSize(long[] size,
                                    java.lang.String symbol)
Finds the size of the object associated with a CUDA symbol.
template<class T >
cudaError_t cudaGetSymbolSize ( size_t *  size,
const T &  symbol  
)

Returns in *size the size of symbol symbol. symbol can either be a variable that resides in global or constant memory space, or it can be a character string, naming a variable that resides in global or constant memory space. If symbol cannot be found, or if symbol is not declared in global or constant memory space, *size is unchanged and the error cudaErrorInvalidSymbol is returned. If there are multiple global variables with the same string name (from separate files) and the lookup is done via character string, cudaErrorDuplicateVariableName is returned.

Returns:
cudaSuccess, cudaErrorInvalidSymbol, cudaErrorDuplicateVariableName
See Also:
cudaGetSymbolAddress(jcuda.Pointer, java.lang.String)

cudaBindTexture

public static int cudaBindTexture(long[] offset,
                                  textureReference texref,
                                  Pointer devPtr,
                                  cudaChannelFormatDesc desc,
                                  long size)
Binds a memory area to a texture.
template<class T , int dim, enum cudaTextureReadMode readMode>
cudaError_t cudaBindTexture ( size_t *  offset,
const struct texture< T, dim, readMode > &  tex,
const void *  devPtr,
const struct cudaChannelFormatDesc &  desc,
size_t  size = UINT_MAX  
)

Binds size bytes of the memory area pointed to by devPtr to texture reference tex. desc describes how the memory is interpreted when fetching values from the texture. The offset parameter is an optional byte offset as with the low-level cudaBindTexture() function. Any memory previously bound to tex is unbound.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidTexture
See Also:
cudaCreateChannelDesc(int, int, int, int, int), cudaGetChannelDesc(jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaArray), cudaGetTextureReference(jcuda.runtime.textureReference, java.lang.String), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture2D(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long, long, long), cudaBindTexture2D(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long, long, long), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaUnbindTexture(jcuda.runtime.textureReference), cudaGetTextureAlignmentOffset(long[], jcuda.runtime.textureReference)

cudaBindTexture2D

public static int cudaBindTexture2D(long[] offset,
                                    textureReference texref,
                                    Pointer devPtr,
                                    cudaChannelFormatDesc desc,
                                    long width,
                                    long height,
                                    long pitch)
Binds a 2D memory area to a texture.
template<class T , int dim, enum cudaTextureReadMode readMode>
cudaError_t cudaBindTexture2D ( size_t *  offset,
const struct texture< T, dim, readMode > &  tex,
const void *  devPtr,
const struct cudaChannelFormatDesc &  desc,
size_t  width,
size_t  height,
size_t  pitch  
)

Binds the 2D memory area pointed to by devPtr to the texture reference tex. The size of the area is constrained by width in texel units, height in texel units, and pitch in byte units. desc describes how the memory is interpreted when fetching values from the texture. Any memory previously bound to tex is unbound.

Since the hardware enforces an alignment requirement on texture base addresses, cudaBindTexture2D() returns in *offset a byte offset that must be applied to texture fetches in order to read from the desired memory. This offset must be divided by the texel size and passed to kernels that read from the texture so they can be applied to the tex2D() function. If the device memory pointer was returned from cudaMalloc(), the offset is guaranteed to be 0 and NULL may be passed as the offset parameter.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidTexture
See Also:
cudaCreateChannelDesc(int, int, int, int, int), cudaGetChannelDesc(jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaArray), cudaGetTextureReference(jcuda.runtime.textureReference, java.lang.String), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture2D(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long, long, long), cudaBindTexture2D(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long, long, long), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaUnbindTexture(jcuda.runtime.textureReference), cudaGetTextureAlignmentOffset(long[], jcuda.runtime.textureReference)

cudaBindTextureToArray

public static int cudaBindTextureToArray(textureReference texref,
                                         cudaArray array,
                                         cudaChannelFormatDesc desc)
Binds an array to a texture.
template<class T , int dim, enum cudaTextureReadMode readMode>
cudaError_t cudaBindTextureToArray ( const struct texture< T, dim, readMode > &  tex,
const struct cudaArray *  array,
const struct cudaChannelFormatDesc &  desc  
)

Binds the CUDA array array to the texture reference tex. desc describes how the memory is interpreted when fetching values from the texture. Any CUDA array previously bound to tex is unbound.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidTexture
See Also:
cudaCreateChannelDesc(int, int, int, int, int), cudaGetChannelDesc(jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaArray), cudaGetTextureReference(jcuda.runtime.textureReference, java.lang.String), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture2D(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long, long, long), cudaBindTexture2D(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long, long, long), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaUnbindTexture(jcuda.runtime.textureReference), cudaGetTextureAlignmentOffset(long[], jcuda.runtime.textureReference)

cudaUnbindTexture

public static int cudaUnbindTexture(textureReference texref)
Unbinds a texture.
template<class T , int dim, enum cudaTextureReadMode readMode>
cudaError_t cudaUnbindTexture ( const struct texture< T, dim, readMode > &  tex  ) 

Unbinds the texture bound to tex.

Returns:
cudaSuccess
See Also:
cudaCreateChannelDesc(int, int, int, int, int), cudaGetChannelDesc(jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaArray), cudaGetTextureReference(jcuda.runtime.textureReference, java.lang.String), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture2D(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long, long, long), cudaBindTexture2D(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long, long, long), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaUnbindTexture(jcuda.runtime.textureReference), cudaGetTextureAlignmentOffset(long[], jcuda.runtime.textureReference)

cudaGetTextureAlignmentOffset

public static int cudaGetTextureAlignmentOffset(long[] offset,
                                                textureReference texref)
Get the alignment offset of a texture.
template<class T , int dim, enum cudaTextureReadMode readMode>
cudaError_t cudaGetTextureAlignmentOffset ( size_t *  offset,
const struct texture< T, dim, readMode > &  tex  
)

Returns in *offset the offset that was returned when texture reference tex was bound.

Returns:
cudaSuccess, cudaErrorInvalidTexture, cudaErrorInvalidTextureBinding
See Also:
cudaCreateChannelDesc(int, int, int, int, int), cudaGetChannelDesc(jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaArray), cudaGetTextureReference(jcuda.runtime.textureReference, java.lang.String), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture2D(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long, long, long), cudaBindTexture2D(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long, long, long), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaUnbindTexture(jcuda.runtime.textureReference), cudaGetTextureAlignmentOffset(long[], jcuda.runtime.textureReference)

cudaGetTextureReference

public static int cudaGetTextureReference(textureReference texref,
                                          java.lang.String symbol)
Deprecated. As of CUDA 4.1

Get the texture reference associated with a symbol.
cudaError_t cudaGetTextureReference ( const struct textureReference **  texref,
const char *  symbol  
)

Returns in *texref the structure associated to the texture reference defined by symbol symbol.

Returns:
cudaSuccess, cudaErrorInvalidTexture
See Also:
cudaCreateChannelDesc(int, int, int, int, int), cudaGetChannelDesc(jcuda.runtime.cudaChannelFormatDesc, jcuda.runtime.cudaArray), cudaGetTextureAlignmentOffset(long[], jcuda.runtime.textureReference), cudaBindTexture(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long), cudaBindTexture2D(long[], jcuda.runtime.textureReference, jcuda.Pointer, jcuda.runtime.cudaChannelFormatDesc, long, long, long), cudaBindTextureToArray(jcuda.runtime.textureReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaUnbindTexture(jcuda.runtime.textureReference)

cudaBindSurfaceToArray

public static int cudaBindSurfaceToArray(surfaceReference surfref,
                                         cudaArray array,
                                         cudaChannelFormatDesc desc)
Binds an array to a surface.
template<class T , int dim>
cudaError_t cudaBindSurfaceToArray ( const struct surface< T, dim > &  surf,
const struct cudaArray *  array,
const struct cudaChannelFormatDesc &  desc  
)

Binds the CUDA array array to the surface reference surf. desc describes how the memory is interpreted when dealing with the surface. Any CUDA array previously bound to surf is unbound.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidSurface
See Also:
cudaBindSurfaceToArray(jcuda.runtime.surfaceReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc), cudaBindSurfaceToArray(jcuda.runtime.surfaceReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc)

cudaGetSurfaceReference

public static int cudaGetSurfaceReference(surfaceReference surfref,
                                          java.lang.String symbol)
Deprecated. As of CUDA 4.1

Get the surface reference associated with a symbol.
cudaError_t cudaGetSurfaceReference ( const struct surfaceReference **  surfref,
const char *  symbol  
)

Returns in *surfref the structure associated to the surface reference defined by symbol symbol.

Returns:
cudaSuccess, cudaErrorInvalidSurface
See Also:
cudaBindSurfaceToArray(jcuda.runtime.surfaceReference, jcuda.runtime.cudaArray, jcuda.runtime.cudaChannelFormatDesc)

cudaConfigureCall

public static int cudaConfigureCall(dim3 gridDim,
                                    dim3 blockDim,
                                    long sharedMem,
                                    cudaStream_t stream)
Configure a device-launch.
cudaError_t cudaConfigureCall ( dim3  gridDim,
dim3  blockDim,
size_t  sharedMem = 0,
cudaStream_t  stream = 0  
)

Specifies the grid and block dimensions for the device call to be executed similar to the execution configuration syntax. cudaConfigureCall() is stack based. Each call pushes data on top of an execution stack. This data contains the dimension for the grid and thread blocks, together with any arguments for the call.

Returns:
cudaSuccess, cudaErrorInvalidConfiguration
See Also:
cudaDeviceSetCacheConfig(int), cudaFuncGetAttributes(jcuda.runtime.cudaFuncAttributes, java.lang.String), cudaLaunch(java.lang.String), cudaSetupArgument(jcuda.Pointer, long, long)

cudaSetupArgument

public static int cudaSetupArgument(Pointer arg,
                                    long size,
                                    long offset)
Configure a device launch.
template<class T >
cudaError_t cudaSetupArgument ( arg,
size_t  offset  
)

Pushes size bytes of the argument pointed to by arg at offset bytes from the start of the parameter passing area, which starts at offset 0. The arguments are stored in the top of the execution stack. cudaSetupArgument() must be preceded by a call to cudaConfigureCall().

Returns:
cudaSuccess
See Also:
cudaConfigureCall(jcuda.runtime.dim3, jcuda.runtime.dim3, long, jcuda.runtime.cudaStream_t), cudaFuncGetAttributes(jcuda.runtime.cudaFuncAttributes, java.lang.String), cudaLaunch(java.lang.String), cudaSetupArgument(jcuda.Pointer, long, long)

cudaFuncGetAttributes

public static int cudaFuncGetAttributes(cudaFuncAttributes attr,
                                        java.lang.String func)
Find out attributes for a given function.
template<class T >
cudaError_t cudaFuncGetAttributes ( struct cudaFuncAttributes *  attr,
T *  entry  
)

This function obtains the attributes of a function specified via entry. The parameter entry can either be a pointer to a function that executes on the device, or it can be a character string specifying the fully-decorated (C++) name of a function that executes on the device. The parameter specified by entry must be declared as a __global__ function. The fetched attributes are placed in attr. If the specified function does not exist, then cudaErrorInvalidDeviceFunction is returned.

Note that some function attributes such as maxThreadsPerBlock may vary based on the device that is currently being used.

Returns:
cudaSuccess, cudaErrorInitializationError, cudaErrorInvalidDeviceFunction
See Also:
cudaConfigureCall(jcuda.runtime.dim3, jcuda.runtime.dim3, long, jcuda.runtime.cudaStream_t), cudaDeviceSetCacheConfig(int), cudaFuncGetAttributes(jcuda.runtime.cudaFuncAttributes, java.lang.String), cudaLaunch(java.lang.String), cudaSetupArgument(jcuda.Pointer, long, long)

cudaLaunch

public static int cudaLaunch(java.lang.String symbol)
Launches a device function.
template<class T >
cudaError_t cudaLaunch ( T *  entry  ) 

Launches the function entry on the device. The parameter entry can either be a function that executes on the device, or it can be a character string, naming a function that executes on the device. The parameter specified by entry must be declared as a __global__ function. cudaLaunch() must be preceded by a call to cudaConfigureCall() since it pops the data that was pushed by cudaConfigureCall() from the execution stack.

Returns:
cudaSuccess, cudaErrorInvalidDeviceFunction, cudaErrorInvalidConfiguration, cudaErrorLaunchFailure, cudaErrorLaunchTimeout, cudaErrorLaunchOutOfResources, cudaErrorSharedObjectSymbolNotFound, cudaErrorSharedObjectInitFailed
See Also:
cudaConfigureCall(jcuda.runtime.dim3, jcuda.runtime.dim3, long, jcuda.runtime.cudaStream_t), cudaDeviceSetCacheConfig(int), cudaFuncGetAttributes(jcuda.runtime.cudaFuncAttributes, java.lang.String), cudaLaunch(java.lang.String), cudaSetupArgument(jcuda.Pointer, long, long), cudaThreadGetCacheConfig(int[]), cudaThreadSetCacheConfig(int)

cudaGLSetGLDevice

public static int cudaGLSetGLDevice(int device)
Sets a CUDA device to use OpenGL interoperability.
cudaError_t cudaGLSetGLDevice ( int  device  ) 

Records the calling thread's current OpenGL context as the OpenGL context to use for OpenGL interoperability with the CUDA device device and sets device as the current device for the calling host thread.

If device has already been initialized then this call will fail with the error cudaErrorSetOnActiveProcess. In this case it is necessary to reset device using cudaDeviceReset() before OpenGL interoperability on device may be enabled.

Returns:
cudaSuccess, cudaErrorInvalidDevice, cudaErrorSetOnActiveProcess
See Also:
cudaGLRegisterBufferObject(int), cudaGLMapBufferObject(jcuda.Pointer, int), cudaGLUnmapBufferObject(int), cudaGLUnregisterBufferObject(int), cudaGLMapBufferObjectAsync(jcuda.Pointer, int, jcuda.runtime.cudaStream_t), cudaGLUnmapBufferObjectAsync(int, jcuda.runtime.cudaStream_t), cudaDeviceReset()

cudaGLGetDevices

public static int cudaGLGetDevices(int[] pCudaDeviceCount,
                                   int[] pCudaDevices,
                                   int cudaDeviceCount,
                                   int cudaGLDeviceList_deviceList)
Gets the CUDA devices associated with the current OpenGL context.
cudaError_t cudaGLGetDevices ( unsigned int *  pCudaDeviceCount,
int *  pCudaDevices,
unsigned int  cudaDeviceCount,
enum cudaGLDeviceList  deviceList  
)

Returns in *pCudaDeviceCount the number of CUDA-compatible devices corresponding to the current OpenGL context. Also returns in *pCudaDevices at most cudaDeviceCount of the CUDA-compatible devices corresponding to the current OpenGL context. If any of the GPUs being used by the current OpenGL context are not CUDA capable then the call will return cudaErrorNoDevice.

Returns:
cudaSuccess, cudaErrorNoDevice, cudaErrorUnknown
See Also:
cudaGraphicsUnregisterResource(jcuda.runtime.cudaGraphicsResource), cudaGraphicsMapResources(int, jcuda.runtime.cudaGraphicsResource[], jcuda.runtime.cudaStream_t), cudaGraphicsSubResourceGetMappedArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaGraphicsResource, int, int), cudaGraphicsResourceGetMappedPointer(jcuda.Pointer, long[], jcuda.runtime.cudaGraphicsResource)

cudaGraphicsGLRegisterImage

public static int cudaGraphicsGLRegisterImage(cudaGraphicsResource resource,
                                              int image,
                                              int target,
                                              int Flags)
Register an OpenGL texture or renderbuffer object.
cudaError_t cudaGraphicsGLRegisterImage ( struct cudaGraphicsResource **  resource,
GLuint  image,
GLenum  target,
unsigned int  flags  
)

Registers the texture or renderbuffer object specified by image for access by CUDA. A handle to the registered object is returned as resource.

target must match the type of the object, and must be one of GL_TEXTURE_2D, GL_TEXTURE_RECTANGLE, GL_TEXTURE_CUBE_MAP, GL_TEXTURE_3D, GL_TEXTURE_2D_ARRAY, or GL_RENDERBUFFER.

The register flags flags specify the intended usage, as follows:

  • cudaGraphicsRegisterFlagsNone: Specifies no hints about how this resource will be used. It is therefore assumed that this resource will be read from and written to by CUDA. This is the default value.
  • cudaGraphicsRegisterFlagsReadOnly: Specifies that CUDA will not write to this resource.
  • cudaGraphicsRegisterFlagsWriteDiscard: Specifies that CUDA will not read from this resource and will write over the entire contents of the resource, so none of the data previously stored in the resource will be preserved.
  • cudaGraphicsRegisterFlagsSurfaceLoadStore: Specifies that CUDA will bind this resource to a surface reference.
  • cudaGraphicsRegisterFlagsTextureGather: Specifies that CUDA will perform texture gather operations on this resource.

The following image formats are supported. For brevity's sake, the list is abbreviated. For ex., {GL_R, GL_RG} X {8, 16} would expand to the following 4 formats {GL_R8, GL_R16, GL_RG8, GL_RG16} :

  • GL_RED, GL_RG, GL_RGBA, GL_LUMINANCE, GL_ALPHA, GL_LUMINANCE_ALPHA, GL_INTENSITY
  • {GL_R, GL_RG, GL_RGBA} X {8, 16, 16F, 32F, 8UI, 16UI, 32UI, 8I, 16I, 32I}
  • {GL_LUMINANCE, GL_ALPHA, GL_LUMINANCE_ALPHA, GL_INTENSITY} X {8, 16, 16F_ARB, 32F_ARB, 8UI_EXT, 16UI_EXT, 32UI_EXT, 8I_EXT, 16I_EXT, 32I_EXT}

The following image classes are currently disallowed:

  • Textures with borders
  • Multisampled renderbuffers

Returns:
cudaSuccess, cudaErrorInvalidDevice, cudaErrorInvalidValue, cudaErrorInvalidResourceHandle, cudaErrorUnknown
See Also:
cudaGLSetGLDevice(int), cudaGraphicsMapResources(int, jcuda.runtime.cudaGraphicsResource[], jcuda.runtime.cudaStream_t), cudaGraphicsSubResourceGetMappedArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaGraphicsResource, int, int)

cudaGraphicsGLRegisterBuffer

public static int cudaGraphicsGLRegisterBuffer(cudaGraphicsResource resource,
                                               int buffer,
                                               int Flags)
Registers an OpenGL buffer object.
cudaError_t cudaGraphicsGLRegisterBuffer ( struct cudaGraphicsResource **  resource,
GLuint  buffer,
unsigned int  flags  
)

Registers the buffer object specified by buffer for access by CUDA. A handle to the registered object is returned as resource. The register flags flags specify the intended usage, as follows:

  • cudaGraphicsRegisterFlagsNone: Specifies no hints about how this resource will be used. It is therefore assumed that this resource will be read from and written to by CUDA. This is the default value.
  • cudaGraphicsRegisterFlagsReadOnly: Specifies that CUDA will not write to this resource.
  • cudaGraphicsRegisterFlagsWriteDiscard: Specifies that CUDA will not read from this resource and will write over the entire contents of the resource, so none of the data previously stored in the resource will be preserved.

Returns:
cudaSuccess, cudaErrorInvalidDevice, cudaErrorInvalidValue, cudaErrorInvalidResourceHandle, cudaErrorUnknown
See Also:
cudaGLSetGLDevice(int), cudaGraphicsUnregisterResource(jcuda.runtime.cudaGraphicsResource), cudaGraphicsMapResources(int, jcuda.runtime.cudaGraphicsResource[], jcuda.runtime.cudaStream_t), cudaGraphicsResourceGetMappedPointer(jcuda.Pointer, long[], jcuda.runtime.cudaGraphicsResource)

cudaGLRegisterBufferObject

public static int cudaGLRegisterBufferObject(int bufObj)
Deprecated. This function is deprecated in the latest CUDA version

Registers a buffer object for access by CUDA.
cudaError_t cudaGLRegisterBufferObject ( GLuint  bufObj  ) 

Deprecated:
This function is deprecated as of Cuda 3.0.
Registers the buffer object of ID bufObj for access by CUDA. This function must be called before CUDA can map the buffer object. The OpenGL context used to create the buffer, or another context from the same share group, must be bound to the current thread when this is called.

Returns:
cudaSuccess, cudaErrorInitializationError
See Also:
cudaGraphicsGLRegisterBuffer(jcuda.runtime.cudaGraphicsResource, int, int)

cudaGLMapBufferObject

public static int cudaGLMapBufferObject(Pointer devPtr,
                                        int bufObj)
Deprecated. This function is deprecated in the latest CUDA version

Maps a buffer object for access by CUDA.
cudaError_t cudaGLMapBufferObject ( void **  devPtr,
GLuint  bufObj  
)

Deprecated:
This function is deprecated as of Cuda 3.0.
Maps the buffer object of ID bufObj into the address space of CUDA and returns in *devPtr the base pointer of the resulting mapping. The buffer must have previously been registered by calling cudaGLRegisterBufferObject(). While a buffer is mapped by CUDA, any OpenGL operation which references the buffer will result in undefined behavior. The OpenGL context used to create the buffer, or another context from the same share group, must be bound to the current thread when this is called.

All streams in the current thread are synchronized with the current GL context.

Returns:
cudaSuccess, cudaErrorMapBufferObjectFailed
See Also:
cudaGraphicsMapResources(int, jcuda.runtime.cudaGraphicsResource[], jcuda.runtime.cudaStream_t)

cudaGLUnmapBufferObject

public static int cudaGLUnmapBufferObject(int bufObj)
Deprecated. This function is deprecated in the latest CUDA version

Unmaps a buffer object for access by CUDA.
cudaError_t cudaGLUnmapBufferObject ( GLuint  bufObj  ) 

Deprecated:
This function is deprecated as of Cuda 3.0.
Unmaps the buffer object of ID bufObj for access by CUDA. When a buffer is unmapped, the base address returned by cudaGLMapBufferObject() is invalid and subsequent references to the address result in undefined behavior. The OpenGL context used to create the buffer, or another context from the same share group, must be bound to the current thread when this is called.

All streams in the current thread are synchronized with the current GL context.

Returns:
cudaSuccess, cudaErrorInvalidDevicePointer, cudaErrorUnmapBufferObjectFailed
See Also:
cudaGraphicsUnmapResources(int, jcuda.runtime.cudaGraphicsResource[], jcuda.runtime.cudaStream_t)

cudaGLUnregisterBufferObject

public static int cudaGLUnregisterBufferObject(int bufObj)
Deprecated. This function is deprecated in the latest CUDA version

Unregisters a buffer object for access by CUDA.
cudaError_t cudaGLUnregisterBufferObject ( GLuint  bufObj  ) 

Deprecated:
This function is deprecated as of Cuda 3.0.
Unregisters the buffer object of ID bufObj for access by CUDA and releases any CUDA resources associated with the buffer. Once a buffer is unregistered, it may no longer be mapped by CUDA. The GL context used to create the buffer, or another context from the same share group, must be bound to the current thread when this is called.

Returns:
cudaSuccess
See Also:
cudaGraphicsUnregisterResource(jcuda.runtime.cudaGraphicsResource)

cudaGLSetBufferObjectMapFlags

public static int cudaGLSetBufferObjectMapFlags(int bufObj,
                                                int flags)
Deprecated. This function is deprecated in the latest CUDA version

Set usage flags for mapping an OpenGL buffer.
cudaError_t cudaGLSetBufferObjectMapFlags ( GLuint  bufObj,
unsigned int  flags  
)

Deprecated:
This function is deprecated as of Cuda 3.0.
Set flags for mapping the OpenGL buffer bufObj

Changes to flags will take effect the next time bufObj is mapped. The flags argument may be any of the following:

  • cudaGLMapFlagsNone: Specifies no hints about how this buffer will be used. It is therefore assumed that this buffer will be read from and written to by CUDA kernels. This is the default value.
  • cudaGLMapFlagsReadOnly: Specifies that CUDA kernels which access this buffer will not write to the buffer.
  • cudaGLMapFlagsWriteDiscard: Specifies that CUDA kernels which access this buffer will not read from the buffer and will write over the entire contents of the buffer, so none of the data previously stored in the buffer will be preserved.

If bufObj has not been registered for use with CUDA, then cudaErrorInvalidResourceHandle is returned. If bufObj is presently mapped for access by CUDA, then cudaErrorUnknown is returned.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidResourceHandle, cudaErrorUnknown
See Also:
cudaGraphicsResourceSetMapFlags(jcuda.runtime.cudaGraphicsResource, int)

cudaGLMapBufferObjectAsync

public static int cudaGLMapBufferObjectAsync(Pointer devPtr,
                                             int bufObj,
                                             cudaStream_t stream)
Deprecated. This function is deprecated in the latest CUDA version

Maps a buffer object for access by CUDA.
cudaError_t cudaGLMapBufferObjectAsync ( void **  devPtr,
GLuint  bufObj,
cudaStream_t  stream  
)

Deprecated:
This function is deprecated as of Cuda 3.0.
Maps the buffer object of ID bufObj into the address space of CUDA and returns in *devPtr the base pointer of the resulting mapping. The buffer must have previously been registered by calling cudaGLRegisterBufferObject(). While a buffer is mapped by CUDA, any OpenGL operation which references the buffer will result in undefined behavior. The OpenGL context used to create the buffer, or another context from the same share group, must be bound to the current thread when this is called.

Stream /p stream is synchronized with the current GL context.

Returns:
cudaSuccess, cudaErrorMapBufferObjectFailed
See Also:
cudaGraphicsMapResources(int, jcuda.runtime.cudaGraphicsResource[], jcuda.runtime.cudaStream_t)

cudaGLUnmapBufferObjectAsync

public static int cudaGLUnmapBufferObjectAsync(int bufObj,
                                               cudaStream_t stream)
Deprecated. This function is deprecated in the latest CUDA version

Unmaps a buffer object for access by CUDA.
cudaError_t cudaGLUnmapBufferObjectAsync ( GLuint  bufObj,
cudaStream_t  stream  
)

Deprecated:
This function is deprecated as of Cuda 3.0.
Unmaps the buffer object of ID bufObj for access by CUDA. When a buffer is unmapped, the base address returned by cudaGLMapBufferObject() is invalid and subsequent references to the address result in undefined behavior. The OpenGL context used to create the buffer, or another context from the same share group, must be bound to the current thread when this is called.

Stream /p stream is synchronized with the current GL context.

Returns:
cudaSuccess, cudaErrorInvalidDevicePointer, cudaErrorUnmapBufferObjectFailed
See Also:
cudaGraphicsUnmapResources(int, jcuda.runtime.cudaGraphicsResource[], jcuda.runtime.cudaStream_t)

cudaDriverGetVersion

public static int cudaDriverGetVersion(int[] driverVersion)
Returns the CUDA driver version.
cudaError_t cudaDriverGetVersion ( int *  driverVersion  ) 

Returns in *driverVersion the version number of the installed CUDA driver. If no driver is installed, then 0 is returned as the driver version (via driverVersion). This function automatically returns cudaErrorInvalidValue if the driverVersion argument is NULL.

Returns:
cudaSuccess, cudaErrorInvalidValue
See Also:
cudaRuntimeGetVersion(int[])

cudaRuntimeGetVersion

public static int cudaRuntimeGetVersion(int[] runtimeVersion)
Returns the CUDA Runtime version.
cudaError_t cudaRuntimeGetVersion ( int *  runtimeVersion  ) 

Returns in *runtimeVersion the version number of the installed CUDA Runtime. This function automatically returns cudaErrorInvalidValue if the runtimeVersion argument is NULL.

Returns:
cudaSuccess, cudaErrorInvalidValue
See Also:
cudaDriverGetVersion(int[])

cudaPointerGetAttributes

public static int cudaPointerGetAttributes(cudaPointerAttributes attributes,
                                           Pointer ptr)
Returns attributes about a specified pointer.
cudaError_t cudaPointerGetAttributes ( struct cudaPointerAttributes *  attributes,
void *  ptr  
)

Returns in *attributes the attributes of the pointer ptr.

The cudaPointerAttributes structure is defined as:

 struct cudaPointerAttributes {
         enum cudaMemoryType memoryType;
         int device;
         void *devicePointer;
         void *hostPointer;
     }
 
In this structure, the individual fields mean

  • memoryType identifies the physical location of the memory associated with pointer ptr. It can be cudaMemoryTypeHost for host memory or cudaMemoryTypeDevice for device memory.

  • device is the device against which ptr was allocated. If ptr has memory type cudaMemoryTypeDevice then this identifies the device on which the memory referred to by ptr physically resides. If ptr has memory type cudaMemoryTypeHost then this identifies the device which was current when the allocation was made (and if that device is deinitialized then this allocation will vanish with that device's state).

  • devicePointer is the device pointer alias through which the memory referred to by ptr may be accessed on the current device. If the memory referred to by ptr cannot be accessed directly by the current device then this is NULL.

  • hostPointer is the host pointer alias through which the memory referred to by ptr may be accessed on the host. If the memory referred to by ptr cannot be accessed directly by the host then this is NULL.

Returns:
cudaSuccess, cudaErrorInvalidDevice
See Also:
cudaGetDeviceCount(int[]), cudaGetDevice(int[]), cudaSetDevice(int), cudaChooseDevice(int[], jcuda.runtime.cudaDeviceProp)

cudaDeviceCanAccessPeer

public static int cudaDeviceCanAccessPeer(int[] canAccessPeer,
                                          int device,
                                          int peerDevice)
Queries if a device may directly access a peer device's memory.
cudaError_t cudaDeviceCanAccessPeer ( int *  canAccessPeer,
int  device,
int  peerDevice  
)

Returns in *canAccessPeer a value of 1 if device device is capable of directly accessing memory from peerDevice and 0 otherwise. If direct access of peerDevice from device is possible, then access may be enabled by calling cudaDeviceEnablePeerAccess().

Returns:
cudaSuccess, cudaErrorInvalidDevice
See Also:
cudaDeviceEnablePeerAccess(int, int), cudaDeviceDisablePeerAccess(int)

cudaDeviceEnablePeerAccess

public static int cudaDeviceEnablePeerAccess(int peerDevice,
                                             int flags)
Enables direct access to memory allocations on a peer device.
cudaError_t cudaDeviceEnablePeerAccess ( int  peerDevice,
unsigned int  flags  
)

Enables registering memory on peerDevice for direct access from the current device. On success, allocations on peerDevice may be registered for access from the current device using cudaPeerRegister(). Registering peer memory will be possible until it is explicitly disabled using cudaDeviceDisablePeerAccess(), or either the current device or peerDevice is reset using cudaDeviceReset().

If both the current device and peerDevice support unified addressing then all allocations from peerDevice will immediately be accessible by the current device upon success. In this case, explicitly sharing allocations using cudaPeerRegister() is not necessary.

Note that access granted by this call is unidirectional and that in order to access memory on the current device from peerDevice, a separate symmetric call to cudaDeviceEnablePeerAccess() is required.

Returns cudaErrorInvalidDevice if cudaDeviceCanAccessPeer() indicates that the current device cannot directly access memory from peerDevice.

Returns cudaErrorPeerAccessAlreadyEnabled if direct access of peerDevice from the current device has already been enabled.

Returns cudaErrorInvalidValue if flags is not 0.

Returns:
cudaSuccess, cudaErrorInvalidDevice cudaErrorPeerAccessAlreadyEnabled, cudaErrorInvalidValue
See Also:
cudaDeviceCanAccessPeer(int[], int, int), cudaDeviceDisablePeerAccess(int)

cudaDeviceDisablePeerAccess

public static int cudaDeviceDisablePeerAccess(int peerDevice)
Disables direct access to memory allocations on a peer device and unregisters any registered allocations from that device.
cudaError_t cudaDeviceDisablePeerAccess ( int  peerDevice  ) 

Disables registering memory on peerDevice for direct access from the current device. If there are any allocations on peerDevice which were registered in the current device using cudaPeerRegister() then these allocations will be automatically unregistered.

Returns cudaErrorPeerAccessNotEnabled if direct access to memory on peerDevice has not yet been enabled from the current device.

Returns:
cudaSuccess, cudaErrorPeerAccessNotEnabled, cudaErrorInvalidDevice
See Also:
cudaDeviceCanAccessPeer(int[], int, int), cudaDeviceEnablePeerAccess(int, int)

cudaGraphicsUnregisterResource

public static int cudaGraphicsUnregisterResource(cudaGraphicsResource resource)
Unregisters a graphics resource for access by CUDA.
cudaError_t cudaGraphicsUnregisterResource ( cudaGraphicsResource_t  resource  ) 

Unregisters the graphics resource resource so it is not accessible by CUDA unless registered again.

If resource is invalid then cudaErrorInvalidResourceHandle is returned.

Returns:
cudaSuccess, cudaErrorInvalidResourceHandle, cudaErrorUnknown
See Also:
cudaGraphicsGLRegisterBuffer(jcuda.runtime.cudaGraphicsResource, int, int), cudaGraphicsGLRegisterImage(jcuda.runtime.cudaGraphicsResource, int, int, int)

cudaGraphicsResourceSetMapFlags

public static int cudaGraphicsResourceSetMapFlags(cudaGraphicsResource resource,
                                                  int flags)
Set usage flags for mapping a graphics resource.
cudaError_t cudaGraphicsResourceSetMapFlags ( cudaGraphicsResource_t  resource,
unsigned int  flags  
)

Set flags for mapping the graphics resource resource.

Changes to flags will take effect the next time resource is mapped. The flags argument may be any of the following:

  • cudaGraphicsMapFlagsNone: Specifies no hints about how resource will be used. It is therefore assumed that CUDA may read from or write to resource.
  • cudaGraphicsMapFlagsReadOnly: Specifies that CUDA will not write to resource.
  • cudaGraphicsMapFlagsWriteDiscard: Specifies CUDA will not read from resource and will write over the entire contents of resource, so none of the data previously stored in resource will be preserved.

If resource is presently mapped for access by CUDA then cudaErrorUnknown is returned. If flags is not one of the above values then cudaErrorInvalidValue is returned.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidResourceHandle, cudaErrorUnknown,
See Also:
cudaGraphicsMapResources(int, jcuda.runtime.cudaGraphicsResource[], jcuda.runtime.cudaStream_t)

cudaGraphicsMapResources

public static int cudaGraphicsMapResources(int count,
                                           cudaGraphicsResource[] resources,
                                           cudaStream_t stream)
Map graphics resources for access by CUDA.
cudaError_t cudaGraphicsMapResources ( int  count,
cudaGraphicsResource_t *  resources,
cudaStream_t  stream = 0  
)

Maps the count graphics resources in resources for access by CUDA.

The resources in resources may be accessed by CUDA until they are unmapped. The graphics API from which resources were registered should not access any resources while they are mapped by CUDA. If an application does so, the results are undefined.

This function provides the synchronization guarantee that any graphics calls issued before cudaGraphicsMapResources() will complete before any subsequent CUDA work issued in stream begins.

If resources contains any duplicate entries then cudaErrorInvalidResourceHandle is returned. If any of resources are presently mapped for access by CUDA then cudaErrorUnknown is returned.

Returns:
cudaSuccess, cudaErrorInvalidResourceHandle, cudaErrorUnknown
See Also:
cudaGraphicsResourceGetMappedPointer(jcuda.Pointer, long[], jcuda.runtime.cudaGraphicsResource)

cudaGraphicsUnmapResources

public static int cudaGraphicsUnmapResources(int count,
                                             cudaGraphicsResource[] resources,
                                             cudaStream_t stream)
Unmap graphics resources.
cudaError_t cudaGraphicsUnmapResources ( int  count,
cudaGraphicsResource_t *  resources,
cudaStream_t  stream = 0  
)

Unmaps the count graphics resources in resources.

Once unmapped, the resources in resources may not be accessed by CUDA until they are mapped again.

This function provides the synchronization guarantee that any CUDA work issued in stream before cudaGraphicsUnmapResources() will complete before any subsequently issued graphics work begins.

If resources contains any duplicate entries then cudaErrorInvalidResourceHandle is returned. If any of resources are not presently mapped for access by Cuda then cudaErrorUnknown is returned.

Returns:
cudaSuccess, cudaErrorInvalidResourceHandle, cudaErrorUnknown
See Also:
cudaGraphicsMapResources(int, jcuda.runtime.cudaGraphicsResource[], jcuda.runtime.cudaStream_t)

cudaGraphicsResourceGetMappedPointer

public static int cudaGraphicsResourceGetMappedPointer(Pointer devPtr,
                                                       long[] size,
                                                       cudaGraphicsResource resource)
Get an device pointer through which to access a mapped graphics resource.
cudaError_t cudaGraphicsResourceGetMappedPointer ( void **  devPtr,
size_t *  size,
cudaGraphicsResource_t  resource  
)

Returns in *devPtr a pointer through which the mapped graphics resource resource may be accessed. Returns in *size the size of the memory in bytes which may be accessed from that pointer. The value set in devPtr may change every time that resource is mapped.

If resource is not a buffer then it cannot be accessed via a pointer and cudaErrorUnknown is returned. If resource is not mapped then cudaErrorUnknown is returned. *

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidResourceHandle, cudaErrorUnknown
See Also:
cudaGraphicsMapResources(int, jcuda.runtime.cudaGraphicsResource[], jcuda.runtime.cudaStream_t), cudaGraphicsSubResourceGetMappedArray(jcuda.runtime.cudaArray, jcuda.runtime.cudaGraphicsResource, int, int)

cudaGraphicsSubResourceGetMappedArray

public static int cudaGraphicsSubResourceGetMappedArray(cudaArray arrayPtr,
                                                        cudaGraphicsResource resource,
                                                        int arrayIndex,
                                                        int mipLevel)
Get an array through which to access a subresource of a mapped graphics resource.
cudaError_t cudaGraphicsSubResourceGetMappedArray ( struct cudaArray **  array,
cudaGraphicsResource_t  resource,
unsigned int  arrayIndex,
unsigned int  mipLevel  
)

Returns in *array an array through which the subresource of the mapped graphics resource resource which corresponds to array index arrayIndex and mipmap level mipLevel may be accessed. The value set in array may change every time that resource is mapped.

If resource is not a texture then it cannot be accessed via an array and cudaErrorUnknown is returned. If arrayIndex is not a valid array index for resource then cudaErrorInvalidValue is returned. If mipLevel is not a valid mipmap level for resource then cudaErrorInvalidValue is returned. If resource is not mapped then cudaErrorUnknown is returned.

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidResourceHandle, cudaErrorUnknown
See Also:
cudaGraphicsResourceGetMappedPointer(jcuda.Pointer, long[], jcuda.runtime.cudaGraphicsResource)

cudaProfilerInitialize

public static int cudaProfilerInitialize(java.lang.String configFile,
                                         java.lang.String outputFile,
                                         int outputMode)
Initialize the profiling.
cudaError_t cudaProfilerInitialize ( const char *  configFile,
const char *  outputFile,
cudaOutputMode_t  outputMode  
)

Using this API user can specify the configuration file, output file and output file format. This API is generally used to profile different set of counters by looping the kernel launch. configFile parameter can be used to select profiling options including profiler counters. Refer the "Command Line Profiler" section in the "Compute Visual Profiler User Guide" for supported profiler options and counters.

Configurations defined initially by environment variable settings are overwritten by cudaProfilerInitialize().

Limitation: Profiling APIs do not work when the application is running with any profiler tool such as Compute Visual Profiler. User must handle error cudaErrorProfilerDisabled returned by profiler APIs if application is likely to be used with any profiler tool.

Typical usage of the profiling APIs is as follows:

for each set of counters
{
cudaProfilerInitialize(); //Initialize profiling,set the counters/options in the config file
...
cudaProfilerStart();
// code to be profiled
cudaProfilerStop();
...
cudaProfilerStart();
// code to be profiled
cudaProfilerStop();
...
}

Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorProfilerDisabled
See Also:
cudaProfilerStart(), cudaProfilerStop()

cudaProfilerStart

public static int cudaProfilerStart()
Start the profiling.
cudaError_t cudaProfilerStart ( void   ) 

This API is used in conjunction with cudaProfilerStop to selectively profile subsets of the CUDA program. Profiler must be initialized using cudaProfilerInitialize() before making a call to cudaProfilerStart(). API returns an error cudaErrorProfilerNotInitialized if it is called without initializing profiler.

Returns:
cudaSuccess, cudaErrorProfilerDisabled, cudaErrorProfilerAlreadyStarted, cudaErrorProfilerNotInitialized
See Also:
cudaProfilerInitialize(java.lang.String, java.lang.String, int), cudaProfilerStop()

cudaProfilerStop

public static int cudaProfilerStop()
Stop the profiling.
cudaError_t cudaProfilerStop ( void   ) 

This API is used in conjunction with cudaProfilerStart to selectively profile subsets of the CUDA program. Profiler must be initialized using cudaProfilerInitialize() before making a call to cudaProfilerStop().API returns an error cudaErrorProfilerNotInitialized if it is called without initializing profiler.

Returns:
cudaSuccess, cudaErrorProfilerDisabled, cudaErrorProfilerAlreadyStopped, cudaErrorProfilerNotInitialized
See Also:
cudaProfilerInitialize(java.lang.String, java.lang.String, int), cudaProfilerStart()