From Parallel Computing Ideas to Programming for CPU and GPU Architectures | by Shreya Shukla | Nov, 2024

For early ML Engineers and Information Scientists, to grasp reminiscence fundamentals, parallel execution, and the way code is written for CPU and GPU.

Picture by Olav Ahrens Røtne on Unsplash

This text goals to clarify the basics of parallel computing. We begin with the fundamentals, together with understanding shared vs. distributed architectures and communication inside these techniques. We’ll discover GPU structure and the way coding parts (utilizing C++ Kokkos) assist map architectural rules to code implementation. Lastly, we’ll measure efficiency metrics (speedup) utilizing the runtime knowledge obtained from working the Kokkos code on each CPU and GPU architectures for vector-matrix multiplication, one of the crucial widespread operations within the machine studying area.

The central theme of the article is exploring and answering questions. It could seem to be a prolonged journey, however it is going to be price it. Let’s get began!

I get that parallel computing saves time by working a number of operations without delay. However I’ve heard that system time is totally different from human time or wall-clock time. How is it totally different?

The smallest unit of time in computing is named a clock tick. It represents the minimal time required to carry out an operation, comparable to fetching knowledge, executing computations, or throughout communication. A clock tick technically refers back to the change of state essential for an instruction. The state may be processor state, knowledge state, reminiscence state, or management alerts. In a single clock tick, a whole instruction, a part of an instruction, or a number of directions could also be executed.

CPU permits for a restricted variety of state modifications per second. For instance, a CPU with 3GHz clock velocity permits for 3 billion modifications of state per second. There’s a restrict to the allowable clock velocity as a result of every clock tick generates warmth, and extreme velocity can injury the CPU chip because of the warmth generated.

Due to this fact, we wish to make the most of the accessible capability by utilizing parallel computing methodologies. The aim is to cover reminiscence latency (the time it takes for the primary knowledge to reach from reminiscence), improve reminiscence bandwidth (the quantity of information transferred per unit of time), and improve compute throughput (the duties carried out in a clock tick).

To check efficiency, comparable to when calculating effectivity of a parallel program, we use wall-clock time as a substitute of clock ticks, because it contains all real-time overheads like reminiscence latency and communication delays, that can’t be straight translated to clock ticks.

What does the structure of a fundamental system appear like?

A system can encompass a single processor, a node, or perhaps a cluster. Among the bodily constructing blocks of a system are —

  1. Node — A bodily laptop unit that has a number of processor chips. A number of nodes mix to kind a cluster.
  2. Processor Chips — Chips comprise a number of processing parts referred to as cores.
  3. Core — Every core is able to working an impartial thread of execution.

In set phrases, a node can have a one-to-many relationship with processor chips, and every processor chip can have a one-to-many relationship with cores. The picture under offers a visible description of a node with processors and cores.

Trendy node with 4 eight-core processors that share a typical reminiscence pool. Ref: Cornell Digital Workshop

The non-physical parts of a system embrace threads and processes —

  1. Thread — Thread is a sequence of CPU directions that working system treats as a single unit for scheduling and execution functions.
  2. Course of — In computing, a course of is a coherent unit of useful resource allocation, together with reminiscence, file handlers, ports and units. A single course of could handle assets for a number of threads. Threads may be modelled as parts of a course of.

So, do threads run on cores on the identical system, or can they be unfold throughout totally different techniques for a single program? And in both case, how do they impart? How’s reminiscence dealt with for these threads ? Do they share it, or do they every get their very own separate reminiscence?

A single program can execute throughout a number of cores on the identical or totally different techniques/ nodes. The design of the system and this system determines whether or not it aligns with the specified execution technique.

When designing a system, three key elements have to be thought-about: execution (how threads run), reminiscence entry (how reminiscence is allotted to those threads), and communication (how threads talk, particularly when they should replace the identical knowledge). It’s vital to notice that these elements are principally interdependent.

Execution

Serial execution This makes use of a single thread of execution to work on a single knowledge merchandise at any time.

Parallel execution — On this, multiple factor occurs concurrently. In computing, this may be —

  1. One employee — A single thread of execution working on a number of knowledge objects concurrently (vector directions in a CPU). Think about a single particular person sorting a deck of playing cards by go well with. With 4 fits to categorize, the person should undergo all the deck to arrange the playing cards for every go well with.
  2. Working Collectively — A number of threads of execution in a single course of. It’s equal to a number of folks working collectively to type a single deck of playing cards by go well with.
  3. Working Independently — A number of processes can work on the identical downside, using both the identical node or a number of nodes. On this state of affairs, every particular person could be sorting their deck of playing cards.
  4. Any mixture of the above.
Working Collectively: Two employees must insert playing cards from the identical go well with. Employee A is holding the partial outcomes
for the golf equipment go well with, so Employee B is briefly blocked. Ref: Cornell Digital Workshop

Reminiscence Entry

  1. Shared Reminiscence — When a program runs on a number of cores (a single multithreaded course of) on the identical system, every thread inside the course of has entry to reminiscence in the identical digital tackle area.
  2. Distributed Reminiscence — A distributed reminiscence design is employed when a program makes use of a number of processes (whether or not on a single node or throughout totally different nodes). On this structure, every course of owns a portion of the info, and different processes should ship messages to the proprietor to replace their respective elements. Even when a number of processes run on a single node, every has its personal digital reminiscence area. Due to this fact, such processes ought to use distributed reminiscence programming for communication.
  3. Hybrid Technique Multithreaded processes that may run on the identical or totally different nodes, designed to make use of a number of cores on a single node via shared reminiscence programming. On the similar time, they will make use of distributed reminiscence methods to coordinate with processes on different nodes. Think about a number of folks or threads working in a number of cubicles within the picture above. Staff in the identical cubicle talk utilizing shared reminiscence programming, whereas these in numerous cubicles work together via distributed reminiscence programming.
In a distributed reminiscence design, parallel employees are assigned to totally different cubicles (processes). Ref: Cornell Digital Workshop

Communication

The communication mechanism will depend on the reminiscence structure. In shared reminiscence architectures, utility programming interfaces like OpenMP (Open Multi-Processing) allow communication between threads that share reminiscence and knowledge. Then again, MPI (Message Passing Interface) can be utilized for communication between processes working on the identical or totally different nodes in distributed reminiscence architectures.

How can we inform if our parallelization technique is working successfully?

There are a number of strategies, however right here, we focus on effectivity and speedup. In parallel computing, effectivity refers back to the proportion of accessible assets which might be actively utilized throughout a computation. It’s decided by evaluating the precise useful resource utilization towards the height efficiency, i.e., optimum useful resource utilization.

Precise processor utilization refers back to the variety of floating level operations (FLOP) carried out over a selected interval.

Peak efficiency assumes that every processor core executes the utmost potential FLOPs throughout each clock cycle.

Effectivity for parallel code is the ratio of precise floating-point operations per second (FLOPS) to the height potential efficiency.

Speedup is used to evaluate effectivity and is measured as:

Speedup can’t be larger than the variety of parallel assets when applications are restricted by computing velocity of the processors.

Utilizing speedup, parallel effectivity is measured as :

Therefore, a worthy objective for optimization of a parallel program is to convey its speedup as shut as potential to the variety of cores.

Suppose the serial execution of code took 300 seconds. After parallelizing the duties utilizing 50 cores, the general wall-clock time for parallel execution was 6 seconds. On this case, the speedup may be calculated because the wall-clock time for serial execution divided by the wall-clock time for parallel execution, leading to a speedup of 300s/6s = 50. We get parallel effectivity by dividing the speedup by the variety of cores, 50/50 ​= 1. That is an instance of the best-case state of affairs: the workload is completely parallelized, and all cores are utilized effectively.

Will including extra computing models continually enhance efficiency if the info measurement or variety of duties will increase?

Solely generally. In parallel computing, we now have two sorts of scaling based mostly on the issue measurement or the variety of parallel duties.

Sturdy Scaling — Rising the variety of parallel duties whereas retaining the issue measurement fixed. Nonetheless, whilst we improve the variety of computational models (cores, processors, or nodes) to course of extra duties in parallel, there’s an overhead related to communication between these models or the host program, such because the time spent sending and receiving knowledge.

Strong scaling efficiency is affected by the ratio of time speaking to the time spent computing. The bigger the ratio, worse the sturdy scaling behaviour will likely be.

Ideally, the execution time decreases because the variety of parallel duties will increase. Nonetheless, if the code doesn’t get quicker with sturdy scaling, it might point out that we’re utilizing too many duties for the quantity of labor being finished.

Weak Scaling — On this, downside measurement will increase because the variety of duties improve, so computation per job stays fixed. In case your program has good weak scaling efficiency, you may run an issue twice as giant on twice as many nodes in the identical wall-clock time.

There are restrictions round what we are able to parallelize since some operations can’t be parallelized. Is that proper?

Sure, parallelizing sure sequential operations may be fairly difficult. Parallelizing will depend on a number of instruction streams and/or a number of knowledge streams.

Various kinds of parallel computing architectures. Ref: Cornell Digital Workshop

To grasp what may be parallelized, let’s have a look at SIMD in CPUs, which is achieved utilizing vectorization.

Vectorization is a programming method through which operations are utilized to complete arrays without delay fairly than processing particular person parts one after the other. It’s achieved utilizing the vector unit in processors, which incorporates vector registers and vector directions.

Take into account a state of affairs the place we iterate over an array and carry out a number of operations on a single aspect inside a for loop. When the info is impartial, writing vectorizable code turns into simple; see the instance under:

do i, n
a(i) = b(i) + c(i)
d(i) = e(i) + f(i)
finish do

On this loop, every iteration is impartial — which means a(i) is processed independently of a(i+1) and so forth. Due to this fact, this code is vectorizable, that may enable a number of parts of array a to be computed in parallel utilizing parts from b and c, as demonstrated under:

b:  | b(i) | b(i+1) | b(i+2) | b(i+3) | ... |
c: | c(i) | c(i+1) | c(i+2) | c(i+3) | ... |
------------------------------------------------
Vectorized Addition (SIMD)

Vector Register 1 (loaded with b values):
| b(i) | b(i+1) | b(i+2) | b(i+3) | ... |

Vector Register 2 (loaded with c values):
| c(i) | c(i+1) | c(i+2) | c(i+3) | ... |

------------------------------------------------
Lead to Vector Register 3:
| a(i) | a(i+1) | a(i+2) | a(i+3) | ... |

Trendy compilers are usually able to analyzing such loops and reworking them into sequences of vector operations. Drawback arises when an operation in a single iteration relies upon upon the results of a earlier iteration. On this case, automated vectorization may result in incorrect outcomes. This example is named an information dependency.

Information dependencies generally encountered in scientific code are –

Learn After Write (RAW) Not Vectorizable

do i, n
a(i) = a(i-1) +b(i)

Write After Learn (WAR) Vectorizable

do i, n
a(i) = a(i+1) +b(i)

Write After Write (WAW)Not Vectorizable

do i, n
a(ipercent2) = a(i+1) +b(i)

Learn After Learn (RAR) Vectorizable

do i, n
a(i) = b(ipercent2) + c(i)

Adhering to sure normal guidelines for vectorization — comparable to making certain impartial assignments in loop iterations, avoiding random knowledge entry, and stopping dependencies between iterations — may help write vectorizable code.

When knowledge will increase, it is sensible to parallelize as many parallelizable operations as potential to create scalable options, however which means we’d like larger techniques with a number of cores. Is that why we use GPUs? How are they totally different from CPUs, and what results in their excessive throughput?

YES!

Evaluating the relative capabilities of the fundamental parts of CPU and GPU architectures. Ref: Cornell Digital Workshop

GPUs (Graphics processing models) have many extra processor models (inexperienced) and better mixture reminiscence bandwidth (the quantity of information transferred per unit of time) as in comparison with CPUs, which, then again, have extra refined instruction processing and quicker clock velocity. As seen above, CPUs have extra cache reminiscence than GPUs. Nonetheless, CPUs have fewer arithmetic logic models (ALUs) and floating level models (FPUs) than GPUs. Contemplating these factors, utilizing CPUs for advanced workflow and GPUs for computationally intensive duties is intuitive.

GPUs are designed to supply excessive computational throughput utilizing their massively parallel structure. Their computational potential may be measured in billions of floating level operations per second (GFLOPS). GPU {hardware} comes within the type of normal graphic playing cards (NVIDIA quad), Excessive-end accelerator playing cards (NVIDIA Tesla), and so on.

Two key properties of the graphics pipeline that allow parallelization and, thus, excessive throughput are —

  1. Independence of Objects — A typical graphics scene consists of many impartial objects; every object may be processed individually with out dependencies on the others.
  2. Uniform Processing Steps — The sequence of processing steps is identical for all objects.

So, a number of cores of GPUs work on totally different knowledge on the similar time, executing computations in parallel like a SIMD (Single Instruction A number of Information) structure. How are duties divided between cores? Does every core run a single thread like within the CPU?

In a GPU, Streaming Multiprocessors (SMs) are just like cores in a CPU. Cores in GPUs are just like vector lanes in CPUs. SMs are the {hardware} models that home cores.

When a perform or computation, referred as a kernel, is executed on the GPU, it’s usually damaged down into thread blocks. These thread blocks comprise a number of threads; every SM can handle many threads throughout its cores. If there are extra thread blocks than SMs, a number of thread blocks may be assigned to a single SM. Additionally, a number of threads can run on a single core.

Every SM additional divides the thread blocks into teams referred to as warps, with every warp consisting of 32 threads. These threads execute the identical stream of directions on totally different knowledge parts, following a Single Instruction, A number of Information (SIMD) mannequin. The warp measurement is about to 32 as a result of, in NVIDIA’s structure, CUDA cores are grouped into units of 32. This allows all threads in a warp to be processed collectively in parallel by the 32 CUDA cores, reaching excessive effectivity and optimized useful resource utilization.

In SIMD (Single Instruction, A number of Information), a single instruction acts uniformly on all knowledge parts, with every knowledge aspect processed in precisely the identical approach. SIMT (Single Instruction, A number of Threads), which is often utilized in GPUs, relaxes this restriction. In SIMT, threads may be activated or deactivated in order that instruction and knowledge are processed in energetic threads; nonetheless, the native knowledge stays unchanged on inactive threads.

I wish to perceive how we are able to code to make use of totally different architectures. Can comparable code work for each CPU and GPU architectures? What parameters and strategies can we use to make sure that the code effectively makes use of the underlying {hardware} structure, whether or not it’s CPUs or GPUs?

Code is mostly written in high-level languages like C or C++ and have to be transformed into binary code by a compiler since machines can’t straight course of high-level directions. Whereas each GPUs and CPUs can execute the identical kernel, as we’ll see within the instance code, we have to use directives or parameters to run the code on a selected structure to compile and generate an instruction set for that structure. This method permits us to make use of architecture-specific capabilities. To make sure compatibility, we are able to specify the suitable flags for the compiler to supply binary code optimized for the specified structure, whether or not it’s a CPU or a GPU.

Numerous coding frameworks, comparable to SYCL, CUDA, and Kokkos, are used to put in writing kernels or features for various architectures. On this article, we’ll use examples from Kokkos.

A bit about Kokkos — An open-source C++ programming mannequin for efficiency portability for writing Kernels: it’s applied as a template library on high of CUDA, OpenMP, and different backends and goals to be descriptive, within the sense that we outline what we wish to do fairly than prescriptive (how we wish to do it). Kokkos Core offers a programming mannequin for parallel algorithms that makes use of many-core chips and shares reminiscence throughout these cores.

A kernel has three parts —

  1. Sample — Construction of the computation: for, scan, discount, task-graph
  2. Execution coverage — How computations are executed: static scheduling, dynamic scheduling, thread groups.
  3. Computational Physique — Code which performs every unit of labor. e.g., loop physique

Sample and coverage drive computational physique. Within the instance under, used only for illustration, ‘for is the sample, the situation to regulate the sample (aspect=0; aspect<n; ++aspect) is the coverage, and the computational physique is the code executed inside the sample

for (aspect=0; aspect<n; ++aspect){
whole = 0;
for(qp = 0; qp < numQPs; ++qp){
whole += dot(left[element][qp], proper[element][qp]);
}
elementValues[element] = whole;
}

The Kokkos framework permits builders to outline parameters and strategies based mostly on three key components: the place the code will run (Execution Area), what reminiscence assets will likely be utilized (Reminiscence Area), and the way knowledge will likely be structured and managed (Information Construction and Information administration).

We primarily focus on write the Kokkos kernel for the vector-matrix product to grasp how these components are applied for various architectures.

However earlier than that, let’s focus on the constructing blocks of the kernel we wish to write.

Reminiscence Area —

Kokkos offers a variety of reminiscence area choices that allow customers to regulate reminiscence administration and knowledge placement on totally different computing platforms. Some generally used reminiscence areas are —

  1. HostSpace — This reminiscence area represents the CPU’s foremost reminiscence. It’s used for computations on the CPU and is usually the default reminiscence area when engaged on a CPU-based system.
  2. CudaSpace — CudaSpace is used for NVIDIA GPUs with CUDA. It offers reminiscence allocation and administration for GPU units, permitting for environment friendly knowledge switch and computation.
  3. CudaUVMSpac — For Unified Digital Reminiscence (UVM) techniques, comparable to these on some NVIDIA GPUs, CudaUVMSpace allows the allocation of reminiscence accessible from each the CPU and GPU with out specific knowledge transfers. Cuda runtime mechanically handles knowledge motion at a efficiency hit.

It’s also important to debate reminiscence structure, which refers back to the group and association of information in reminiscence. Kokkos offers a number of reminiscence structure choices to assist customers optimize their knowledge storage for varied computations. Some generally used reminiscence layouts are —

Row-major vs Column-major iteration of a matrix. Ref: Wikipedia
  1. LayoutRight (often known as Row-Main) is the default reminiscence structure for multi-dimensional arrays in C and C++. In LayoutRight, the rightmost index varies most quickly in reminiscence. If no structure is chosen, the default structure for HostSpace is LayoutRight.
  2. LayoutLeft (often known as Column-Main) — In LayoutLeft, the leftmost index varies most quickly in reminiscence. If no structure is chosen, the default structure for CudaSpace is LayoutLeft.

Within the programmatic implementation under, we outlined reminiscence area and structure as macros based mostly on the compiler flag ENABLE_CUDA, which will likely be True if we wish to run our code on GPU and False for CPU.

    // ENABLE_CUDA is a compile time argument with default worth true
#outline ENABLE_CUDA true

// If CUDA is enabled, run the kernel on the CUDA (GPU) structure
#if outlined(ENABLE_CUDA) && ENABLE_CUDA
#outline MemSpace Kokkos::CudaSpace
#outline Structure Kokkos::LayoutLeft
#else
// Outline default values or conduct when ENABLE_CUDA just isn't set or is fake
#outline MemSpace Kokkos::HostSpace
#outline Structure Kokkos::LayoutRight
#endif

Information Construction and Information Administration —

Kokkos Views In Kokkos, a “view” is a basic knowledge construction representing one-dimensional and multi-dimensional arrays, which can be utilized to retailer and entry knowledge effectively. Kokkos views present a high-level abstraction for managing knowledge and is designed to work seamlessly with totally different execution areas and reminiscence layouts.

    // View for a second array of information sort double
Kokkos::View<double**> myView("myView", numRows, numCols);
// Entry Views
myView(i, j) = 42.0;
double worth = myView(i, j);

Kokkos Mirroring method for knowledge administration — Mirrors are views of equal arrays residing in potential totally different reminiscence areas, which is after we want knowledge in each CPU and GPU structure. This system is useful for situations like studying knowledge from a file on the CPU and subsequently processing it on the GPU. Kokkos’ mirroring creates a mirrored view of the info, permitting seamless sharing between the CPU and GPU execution areas and facilitating knowledge switch and synchronization.

To create a mirrored copy of the first knowledge, we are able to use Kokkos’ create_mirror_view() perform. This perform generates a mirror view in a specified execution area (e.g., GPU) with the identical knowledge sort and dimensions as the first view.

    // Meant Computation -
// <y, A*x> = y^T * A * x
// Right here:
// y and x are vectors.
// A is a matrix.

// Allocate y, x vectors and Matrix A on gadget
typedef Kokkos::View<double*, Structure, MemSpace> ViewVectorType;
typedef Kokkos::View<double**, Structure, MemSpace> ViewMatrixType;

// N and M are variety of rows and columns
ViewVectorType y( "y", N );
ViewVectorType x( "x", M );
ViewMatrixType A( "A", N, M );

// Create host mirrors of gadget views
ViewVectorType::HostMirror h_y = Kokkos::create_mirror_view( y );
ViewVectorType::HostMirror h_x = Kokkos::create_mirror_view( x );
ViewMatrixType::HostMirror h_A = Kokkos::create_mirror_view( A );

// Initialize y vector on host.
for ( int i = 0; i < N; ++i ) {
h_y( i ) = 1;
}

// Initialize x vector on host.
for ( int i = 0; i < M; ++i ) {
h_x( i ) = 1;
}

// Initialize A matrix on host.
for ( int j = 0; j < N; ++j ) {
for ( int i = 0; i < M; ++i ) {
h_A( j, i ) = 1;
}
}

// Deep copy host views to gadget views.
Kokkos::deep_copy( y, h_y );
Kokkos::deep_copy( x, h_x );
Kokkos::deep_copy( A, h_A );

Execution Area —

In Kokkos, the execution area refers back to the particular computing atmosphere or {hardware} platform the place parallel operations and computations are executed. Kokkos abstracts the execution area, enabling code to be written in a descriptive method whereas adapting to varied {hardware} platforms.

We focus on two major execution areas —

  1. Serial: The Serial execution area is a major and transportable choice appropriate for single-threaded CPU execution. It’s usually used for debugging, testing, and as a baseline for efficiency comparisons.
  2. Cuda: The Cuda execution area is used for NVIDIA GPUs and depends on CUDA know-how for parallel processing. It allows environment friendly GPU acceleration and administration of GPU reminiscence.

Both the ExecSpace may be outlined, or it may be decided dynamically based mostly on the Reminiscence area as under:


// Execution area decided based mostly on MemorySpace
utilizing ExecSpace = MemSpace::execution_space;

How can we use these constructing blocks to put in writing an precise kernel? Can we use it to check efficiency between totally different architectures?

For the aim of writing a kernel and efficiency comparability, we use following computation:

<y, A*x> = y^T * (A * x)

Right here:

y and x are vectors.

A is a matrix.

<y, A*x> represents the inside product or dot product of vectors y
and the results of the matrix-vector multiplication A*x.

y^T denotes the transpose of vector y.

* denotes matrix-vector multiplication.

The kernel for this operation in Kokkos —

    // Use a RangePolicy.
typedef Kokkos::RangePolicy<ExecSpace> range_policy;

// The under code is run for a number of iterations throughout totally different
// architectures for time comparability
Kokkos::parallel_reduce( "yAx", range_policy( 0, N ),
KOKKOS_LAMBDA ( int j, double &replace ) {
double temp2 = 0;

for ( int i = 0; i < M; ++i ) {
temp2 += A( j, i ) * x( i );
}
replace += y( j ) * temp2;
}, end result );

For the above kernel, parallel_reduce serves because the sample, range_policy defines the coverage, and the precise operations represent the computational physique.

I executed this kernel on a TACC Frontera node which has an NVIDIA Quadro RTX 5000 GPU. The experiments have been carried out with various values of N, which refers back to the lengths of the vectors y and x, and the variety of rows in matrix A. Computation was carried out 100 instances to get notable outcomes, and the execution time of the kernel was recorded for each Serial (Host) and CUDA execution areas. I used ENABLE_CUDA compiler flag to modify between execution environments: True for GPU/CUDA execution area and False for CPU/serial execution area. The outcomes of those experiments are introduced under, with the corresponding speedup.

Information on kernel execution runtime knowledge and speedup for CPU vs GPU structure Ref: Desk by creator
Speedup pattern for various knowledge measurement (GPU vs CPU) Ref: Picture by creator

We discover that the speedup will increase considerably with the scale of N, indicating that the CUDA implementation turns into more and more advantageous for bigger downside sizes.

That’s all for now! I hope this text has been useful in getting began on the fitting foot in exploring the area of computing. Understanding the fundamentals of the GPU structure is essential, and this text introduces a technique of writing cross-architectural code that I experimented with. Nonetheless, there are a number of strategies and applied sciences price exploring.

Whereas I’m not a subject skilled, this text displays my studying journey from my transient expertise working at TACC in Austin, TX. I welcome suggestions and discussions, and I’d be glad to help when you have any questions or wish to study extra. Please consult with the superb assets under for additional studying. Completely happy computing!