Sitemap

CUDA GPU Memory Types

13 min readNov 29, 2025
Press enter or click to view image in full size

1. Memory Hierarchy

The CPU and GPU have separate memory spaces. This means that data processed by the GPU must be transferred from the CPU to the GPU before computation begins, and the results of the computation must be transferred back to the CPU once processing is complete.

CUDA threads may access data from multiple memory spaces during their execution. Each thread has private local memory. Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block. Thread blocks in a thread block cluster can perform read, write, and atomic operations on each other’s shared memory. All threads have access to the same global memory.

Press enter or click to view image in full size
Figure 1 Memory Hierarchy
Press enter or click to view image in full size
Figure 2 Memory Hierarchy Overview
Press enter or click to view image in full size
Figure 3 The GPU Devotes More Transistors to Data Processing

2. Types of Memories in The GPU

Off-Chip Memories (High Latency, Large Capacity)

  1. Global Memory
  2. Constant Memory
  3. Texture Memory

On-Chip Memories (Low Latency, Small Capacity)

  1. Shared Memory
  2. Local Memory
  3. Registers

Caches and Other Components

  1. Caches (L1/L2)

Off-Chip Memories (High Latency, Large Capacity)

1. Global Memory (DRAM)

Overview:

  • Speed: Slow
  • Cache: Uncached
  • Access: All Threads

Description:

There’s a large amount of global memory. It’s slower to access than other memories like shared and registers. All running threads can read and write global memory, and so can the CPU. The functions cudaMalloc, cudaFree, cudaMemcpy and cudaMemset All deal with global memory. Global memory is allocated and deallocated by the host.

This is the main memory store of the GPU; every byte is addressable. It is persistent across kernel calls.

Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access (via a variable or a pointer) to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned (i.e., its address is a multiple of that size).

If this size and alignment requirement is not fulfilled, the access compiles to multiple instructions with interleaved access patterns that prevent these instructions from fully coalescing. It is therefore recommended to use types that meet this requirement for data that resides in global memory.

The alignment requirement is automatically fulfilled for the Built-in Vector Types.

Usage:

  • Declared outside of any function: __device__ int globalArray[256];
  • Assigned by cudaMemcpy //cudaMemcpy blocking transfer; host thread waits until transfer complete

or int *myDeviceMemory = 0;

cudaMalloc(&myDeviceMemory, 256 * sizeof(int));

#include <stdio.h>
#include <cuda_runtime.h>
// 1. Declare the __device__ variable at global scope
// This variable will hold a constant scaling factor for all threads.
__device__ float global_scale_factor;// Kernel that runs on the GPU
__global__ void scaleArrayKernel(float *d_array, int N) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
// 3. Access and use the __device__ variable in a calculation
// Each thread multiplies its array element by the global_scale_factor.
d_array[tid] *= global_scale_factor;
}
}
int main() {
// --- Host Setup ---
int N;
printf("Enter array size: ");
scanf("%d", &N); size_t size = N * sizeof(float); // Host data for array
float *h_array = (float *)malloc(size);
for (int i = 0; i < N; ++i) {
h_array[i] = (float)i + 1.0f; // Initialize to 1.0, 2.0, ..., N
} // Host value for the scaling factor
float h_scale_factor;
printf("Enter scaling factor: ");
scanf("%f", &h_scale_factor); // Our small calculation parameter
// --- Device Allocation and Copy ---
float *d_array;
cudaMalloc(&d_array, size);
cudaMemcpy(d_array, h_array, size, cudaMemcpyHostToDevice); // 2. Initialize the __device__ variable from the host using cudaMemcpyToSymbol
// Copy the scaling factor to the device's global memory symbol.
cudaMemcpyToSymbol(global_scale_factor, &h_scale_factor, sizeof(float), 0, cudaMemcpyHostToDevice); // Print initial state
printf("Initial array elements: %.1f, %.1f, ..., %.1f\\n", h_array[0], h_array[1], h_array[N-1]);
printf("Scaling factor (__device__ variable) set to: %.2f\\n", h_scale_factor); // --- Launch Kernel ---
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
scaleArrayKernel<<<numBlocks, blockSize>>>(d_array, N); // Synchronize and check for errors
cudaDeviceSynchronize();
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "CUDA error: %s\\n", cudaGetErrorString(err));
} // --- Copy Results Back ---
float *h_result_array = (float *)malloc(size);
cudaMemcpy(h_result_array, d_array, size, cudaMemcpyDeviceToHost); // --- Verification ---
printf("\\nArray elements after scaling (multiplied by %.2f):\\n", h_scale_factor);
// 4. Verification of the calculation
for (int i = 0; i < N; ++i) {
// The calculation is: original_value * global_scale_factor
float expected = h_array[i] * h_scale_factor;
printf("Index %2d: Result %.2f (Expected %.2f)\\n", i, h_result_array[i], expected);
} // --- Cleanup ---
cudaFree(d_array);
free(h_array);
free(h_result_array);

return 0;
}

The previous code initializes the array h_array[N] and multiplies each element in the array by the factor global_scale_factor

2. Constant Memory (Read-Only)

Overview:

  • Speed: Slow
  • Cache: Cached
  • Access: All Threads

Description:

For data that does not change over the course of the computation. It is read-only.

This memory is also part of the GPU’s main memory. It has its own cache. Not related to the L1 and L2 of global memory. All threads have access to the same constant memory, but they can only read; they can’t write to it. The CPU sets the values in constant memory before launching the kernel.

It is very fast (register and shared memory speeds) if all running threads in a warp read the same address.

It’s only 64k of constant memory available on my machine (see deviceQueryDrv)

All running threads share constant memory. In graphics programming, this memory holds the constants like the model, view, and projection matrices.

Since it is cached, there is only one clock cycle to read, as opposed to 100+ for global memory

Usage:

  • Declare it outside of main() : __constant__ float cdata; //available in all scopes
  • To get numbers into the constant memory variable cdata, use

cudaMemcpytoSymbol(const char* symbol, const void * src, size_t count , size_t offset=0, enum cudaMemcpyKind)

  • The constant memory space resides in device memory and is cached in the constant cache.
  • As fast as a register if all threads read the same address.
#include <stdio.h>
#include <cuda_runtime.h>
// Define the size of our constant filter/offset array
#define FILTER_SIZE 3
// 1. Declare the __constant__ variable at global scope
__constant__ float c_filter_offsets[FILTER_SIZE];
// Kernel that runs on the GPU
__global__ void applyFilter(float *d_data, int N) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
int offset_index = tid % FILTER_SIZE;
d_data[tid] += c_filter_offsets[offset_index];
}
}
int main() {
// --- Host Setup ---
const int N = 10;
size_t size = N * sizeof(float);
// Dynamically allocate host array
float *h_data = (float*)malloc(size);
for (int i = 0; i < N; ++i) {
h_data[i] = (float)i * 10.0f; // 0, 10, 20, ... 90
}
// Dynamically allocate filter array
float *h_filter_offsets = (float*)malloc(FILTER_SIZE * sizeof(float));
h_filter_offsets[0] = 1.0f;
h_filter_offsets[1] = 0.5f;
h_filter_offsets[2] = 2.0f;
// --- Device Allocation ---
float *d_data;
cudaMalloc(&d_data, size);
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);
// 2. Copy filter offsets into __constant__ memory
cudaMemcpyToSymbol(
c_filter_offsets,
h_filter_offsets,
FILTER_SIZE * sizeof(float),
0,
cudaMemcpyHostToDevice
);
printf("Initial Data: {%.0f, %.0f, %.0f, ...}\\n",
h_data[0], h_data[1], h_data[2]);
printf("Constant Offsets: {%.1f, %.1f, %.1f}\\n",
h_filter_offsets[0], h_filter_offsets[1], h_filter_offsets[2]);
// --- Launch Kernel ---
int blockSize = 4;
int numBlocks = (N + blockSize - 1) / blockSize;
applyFilter<<<numBlocks, blockSize>>>(d_data, N);
cudaDeviceSynchronize();
// Error check
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "CUDA error: %s\\n", cudaGetErrorString(err));
}
// --- Copy Results Back ---
float *h_result_data = (float*)malloc(size);
cudaMemcpy(h_result_data, d_data, size, cudaMemcpyDeviceToHost);
// --- Print Verification Table ---
printf("\\nProcessed Data (Original + Constant Offset):\\n");
printf("| Index | Original | Offset | Result |\\n");
printf("|-------|----------|--------|--------|\\n");
for (int i = 0; i < N; ++i) {
int offset_index = i % FILTER_SIZE;
printf("| %5d | %8.1f | %6.1f | %6.1f |\\n",
i, (float)i * 10.0f, h_filter_offsets[offset_index], h_result_data[i]);
}
// --- Cleanup ---
cudaFree(d_data);
free(h_data);
free(h_filter_offsets);
free(h_result_data);
return 0;
}

The previous code initializes the array h_data and add to each element in the array offset c_filter_offsets

3. Texture Memory (Read-Only)

Overview:

  • Speed: Slow
  • Cache: Cache optimized for 2D access
  • Access: All Threads

Description:

Like the constant memory:

  • Read-Only Memory
  • The texture and surface memory spaces reside in device memory
  • cached in texture cache

So a texture fetch or surface read costs one memory read from device memory only on a cache miss, otherwise it just costs one read from texture cache.

The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture or surface addresses that are close together in 2D will achieve best performance.

Also, it is designed for streaming fetches with a constant latency; a cache hit reduces DRAM bandwidth demand but not fetch latency.

Reading device memory through texture or surface fetching presents some benefits that can make it an advantageous alternative to reading device memory from global or constant memory:

  • If the memory reads do not follow the access patterns that global or constant memory reads must follow to get good performance, higher bandwidth can be achieved, provided that there is locality in the texture fetches or surface reads.
  • Addressing calculations are performed outside the kernel by dedicated units.
  • Packed data may be broadcast to separate variables in a single operation.
  • 8-bit and 16-bit integer input data may be optionally converted to 32-bit floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see Texture Memory).
#include <stdio.h>
#include <cuda_runtime.h>
// Define the dimensions of the input array (image)
#define WIDTH 10
#define HEIGHT 10
// 1. Declare the texture object globally
// cudaTextureObject_t is a handle created on the host to reference the texture.
// The data type (float) will be specified when the texture object is created.
cudaTextureObject_t texObject;
// Kernel to read from the texture at fractional coordinates
__global__ void interpolateTextureKernel(float *d_out, int out_width, int out_height) {
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int tid = y * out_width + x;
// We will sample the input data at a half-step resolution to demonstrate interpolation
// Normalized coordinates (0.0 to 1.0) for the input texture
float u = (float)x / (float)out_width;
float v = (float)y / (float)out_height;
// 3. Access the texture memory using the TexObj API
// tex2D(textureObject, u_coordinate, v_coordinate)
// The hardware automatically performs bilinear interpolation based on the settings
// defined during the texture object creation.
float interpolated_value = tex2D<float>(texObject, u * WIDTH, v * HEIGHT);
if (x < out_width && y < out_height) {
d_out[tid] = interpolated_value;
}
}
// Helper function to check for CUDA errors
void checkCudaError(cudaError_t err, const char *msg) {
if (err != cudaSuccess) {
fprintf(stderr, "%s: %s\\n", msg, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
int main() {
// --- Host Setup ---
int in_width = WIDTH, in_height = HEIGHT;
int out_width = WIDTH * 2, out_height = HEIGHT * 2; // Output is twice the size
size_t in_size = in_width * in_height * sizeof(float);
size_t out_size = out_width * out_height * sizeof(float);
// Initialize input data (a simple gradient pattern)
float *h_in = (float *)malloc(in_size);
for (int j = 0; j < in_height; ++j) {
for (int i = 0; i < in_width; ++i) {
// Data value is (row + column)
h_in[j * in_width + i] = (float)(i + j);
}
}
float *h_out = (float *)malloc(out_size);
// --- Device Allocation and Texture Setup ---
// 2a. Allocate device memory for the input data (called a cudaArray or cudaPitchedPtr/linear memory)
// Texture memory can sample from linear memory, but using a cudaArray/cudaChannelFormatDesc
// is the more traditional and often more optimized approach for 2D textures.
cudaArray *d_array;
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
checkCudaError(cudaMallocArray(&d_array, &channelDesc, in_width, in_height), "cudaMallocArray failed");
// Copy the host data to the allocated cudaArray
checkCudaError(cudaMemcpy2DToArray(d_array, 0, 0, h_in, in_width * sizeof(float),
in_width * sizeof(float), in_height, cudaMemcpyHostToDevice), "cudaMemcpy2DToArray failed");
// 2b. Define the texture resource and texture object (the handle)
cudaResourceDesc resDesc;
memset(&resDesc, 0, sizeof(resDesc));
resDesc.resType = cudaResourceTypeArray;
resDesc.res.array.array = d_array;
cudaTextureDesc texDesc;
memset(&texDesc, 0, sizeof(texDesc));
// Specify the addressing mode: clamp to edge
texDesc.addressMode[0] = cudaAddressModeClamp;
texDesc.addressMode[1] = cudaAddressModeClamp;
// Specify the filter mode: THIS IS THE KEY! cudaFilterModeLinear enables interpolation.
texDesc.filterMode = cudaFilterModeLinear;
// Data is read as floating point
texDesc.readMode = cudaReadModeElementType;
// Normalize coordinates from 0.0 to 1.0 (optional, here we use absolute coords)
texDesc.normalizedCoords = 0;
// Create the texture object
checkCudaError(cudaCreateTextureObject(&texObject, &resDesc, &texDesc, NULL), "cudaCreateTextureObject failed");
// Allocate device memory for output
float *d_out;
checkCudaError(cudaMalloc(&d_out, out_size), "cudaMalloc failed for d_out");
// --- Launch Kernel ---
dim3 threads(16, 16);
dim3 blocks((out_width + threads.x - 1) / threads.x, (out_height + threads.y - 1) / threads.y);
interpolateTextureKernel<<<blocks, threads>>>(d_out, out_width, out_height);
// Synchronize and check for errors
cudaDeviceSynchronize();
checkCudaError(cudaGetLastError(), "Kernel execution failed");
// --- Copy Results Back ---
checkCudaError(cudaMemcpy(h_out, d_out, out_size, cudaMemcpyDeviceToHost), "cudaMemcpy failed for h_out");
// --- Verification ---
printf("Input Data (4x4 top left corner):\\n");
// Original data:
// 0.0 1.0 2.0 3.0
// 1.0 2.0 3.0 4.0
// 2.0 3.0 4.0 5.0
// 3.0 4.0 5.0 6.0
printf("Output Data (4x4 top left corner - interpolated):\\n");
// Expected interpolated values (e.g., at index 0,0 and 0,1):
// Data at (0.0, 0.0) -> 0.0
// Data at (0.5, 0.0) -> (0.0 + 1.0) / 2 = 0.5
// Data at (0.0, 0.5) -> (0.0 + 1.0) / 2 = 0.5
// Data at (0.5, 0.5) -> (0.0+1.0+1.0+2.0)/4 = 1.0
for (int j = 0; j < 4; ++j) {
for (int i = 0; i < 4; ++i) {
printf("%.1f ", h_out[j * out_width + i]);
}
printf("\\n");
}
// --- Cleanup ---
checkCudaError(cudaDestroyTextureObject(texObject), "cudaDestroyTextureObject failed");
checkCudaError(cudaFreeArray(d_array), "cudaFreeArray failed");
checkCudaError(cudaFree(d_out), "cudaFree failed for d_out");
free(h_in);
free(h_out);
return 0;
}

On-Chip Memories (Low Latency, Small Capacity)

1. Shared memory

Overview:

  • Speed: Fast, bank conflicts
  • Cache: limited
  • Access: threads in a block

Description:

Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.

Shared memory is very fast (register speeds). Shared memory is used to enable fast communication between threads in a block. Shared memory only exists for the lifetime of the block.

To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously.

If two addresses of a memory request fall in the same memory bank, there is a bank conflict, and the access has to be serialized.

If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.

So, Bank conflicts can slow access down. It’s fastest when all threads read from different banks or all threads of a warp read the same value. Bank conflicts are only possible within a warp. No bank conflicts occur between different warps.

Scale Up Example:

#include <stdio.h>
#include <cuda_runtime.h>
// Define the size of each block (and the shared memory array)
#define BLOCK_SIZE 512
// Kernel for parallel array reduction
__global__ void reduceSum(float *g_in, float *g_out, int N) {
// 1. Declare Shared Memory
// Using a fixed size here. This lives on-chip and is very fast.
__shared__ float sdata[BLOCK_SIZE];
// Calculate thread indices
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
// 2. Load global memory → shared memory (cooperatively)
if (i < N) {
sdata[tid] = g_in[i];
} else {
sdata[tid] = 0.0f; // For non-even block sizes
}
// Wait for all threads in block to load their data
__syncthreads();
// 3. Perform reduction in-power-of-two steps
// Example for BLOCK_SIZE = 8:
// stride = 4 → 2 → 1
for (unsigned int stride = blockDim.x / 2; stride > 0; stride /= 2) {
if (tid < stride) {
sdata[tid] += sdata[tid + stride];
}
__syncthreads(); // Mandatory barrier after each reduction step
}
// 4. Thread 0 writes block result to output array
if (tid == 0) {
g_out[blockIdx.x] = sdata[0];
}
}
int main() {
// -------------------------------
// Host Setup
// -------------------------------
int N = 1 << 20; // 1M elements
size_t size = N * sizeof(float);
printf("Running classic shared-memory parallel reduction on %d elements...\\n", N);
// Allocate host memory
float *h_in = (float*)malloc(size);
float *h_out = (float*)malloc(sizeof(float)); // final result
// Initialize input: all 1.0f → sum should be N
for (int i = 0; i < N; i++) {
h_in[i] = 1.0f;
}
// -------------------------------
// Device Setup
// -------------------------------
float *d_in, *d_out;
cudaMalloc(&d_in, size);
// Number of blocks needed
int numBlocks = (N + BLOCK_SIZE - 1) / BLOCK_SIZE;
// Output array will have `numBlocks` partial sums
cudaMalloc(&d_out, numBlocks * sizeof(float));
cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice);
// -------------------------------
// Launch First Reduction
// -------------------------------
reduceSum<<<numBlocks, BLOCK_SIZE>>>(d_in, d_out, N);
cudaDeviceSynchronize();
// Now we have `numBlocks` partial sums in d_out
// We reduce again on CPU (cleanest method)
float *h_partial = (float*)malloc(numBlocks * sizeof(float));
cudaMemcpy(h_partial, d_out, numBlocks * sizeof(float), cudaMemcpyDeviceToHost);
float total = 0.0f;
for (int i = 0; i < numBlocks; i++) {
total += h_partial[i];
}
// -------------------------------
// Print Result
// -------------------------------
printf("Final sum = %.0f (expected %d)\\n", total, N);
// -------------------------------
// Cleanup
// -------------------------------
cudaFree(d_in);
cudaFree(d_out);
free(h_in);
free(h_out);
free(h_partial);
return 0;
}

2. Local Memory

Overview:

  • Speed: slow
  • Cache: cached
  • Access: only one thread

Description:

Local memory accesses only occur for some automatic variables.

Local Memory is also part of the main memory of the GPU (same as the global memory), so it’s generally slow.

Local memory is used automatically by threads when we run out of registers or when registers cannot be used.

This is called register spilling.

It happens if there are too many variables per thread to use registers or if kernels use structures. Also, arrays that aren’t indexed with constants use local memory since registers don’t have addresses; a memory space that’s addressable must be used. The scope for local memory is per thread.

Local memory is cached in an L1, then an L2 cache, so register spilling may not mean a dramatic performance decrease.

Some Automatic Variables that the compiler is likely to place there:

  • Arrays for which it cannot determine that they are indexed with constant quantities.
  • Large structures or arrays that would consume too much register space.
  • Any variable if the kernel uses more registers than available (this is also known as register spilling).

3. Register

Overview:

  • Speed: Fast
  • Access: only one thread

Description:

Registers are the fastest memory on the GPU. The variables we declare in a kernel will use registers unless we run out or they can’t be stored in registers, then local memory will be used.

Register scope is per thread. Unlike the CPU, there are thousands of registers in a GPU. Carefully selecting a few registers can easily double the number of concurrent blocks the GPU can execute and therefore increase performance substantially.

Summary

--

--