Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 24, 2023
In computer science, there are various parallel algorithms that can run on a multiprocessor computer, such as multithreaded algorithms.
In this tutorial, we’ll take a look at what this approach means. Then, we’ll use an example to understand it in a better way.
Multithreading is a crucial topic for modern computing. There exist many competing models of parallel computation that are essentially different. For example, one can have shared or distributed memory. Since multicore processors are ubiquitous, we focus on a parallel computing model with shared memory.
Static threading means that the programmer needs to specify how many processors to use at each point in advance. When it comes to evolving conditions, this can be inflexible.
In dynamic multithreading models, the programmer needs to specify opportunities for parallelism and a concurrency platform manages the decisions of mapping these opportunities to actual static threads.
A concurrency platform is a software layer that schedules, manages, and coordinates parallel-computing resources. We’ll use a simple extension of the serial programming model that reflects current parallel-computing practice:
The “parallel” and “spawn” keywords do not impose parallelism. They just indicate that it is possible. This is known as logical parallelism. So, when parallelism is used, we respect “sync”. Every procedure has an implicit “sync” at the end for safety.
Let’s take a slow algorithm and make it parallel. Here is the definition of Fibonacci numbers:
Here’s an algorithm (non-parallel) for computing Fibonacci numbers based on the above definition:
algorithm FIB(n):
// INPUT
// n = a non-negative integer
// OUTPUT
// The nth Fibonacci number
if n <= 1:
return n
else:
x <- FIB(n-1)
y <- FIB(n-2)
return x + y
Here’s the recursion tree of the algorithm cited above:
Let’s see what improvement we can get by computing the two recursive calls in parallel using the concurrency keywords:
algorithm P-FIB(n):
// INPUT
// n = non-negative integer
// OUTPUT
// The nth Fibonacci number computed in parallel
if n <= 1:
return n
else:
x <- spawn P-FIB(n-1) // parallel execution
y <- spawn P-FIB(n-2) // parallel execution
sync // wait for the results x and y
return x + y
Let’s now look at a formal model for describing parallel computations.
A computation DAG (directed acyclic graph) will be used to model a multithreaded computation :
If there’s an edge between thread (a strand of maximal length will be called a thread) and
they are logically in series; otherwise, they are logically parallel.
Here’s a categorization of the edges:
Before moving forward to the model for the computation DAG, let’s take a look at the parallel algorithm to compute Fibonacci numbers using threads:
algorithm P-FIB(n):
// INPUT
// n = non-negative integer
// OUTPUT
// The nth Fibonacci number, with thread annotations for clarity
if n <= 1:
return n // Thread A
else:
x <- spawn P-FIB(n-1) // Thread A continues
y <- spawn P-FIB(n-2) // Thread B
sync // wait for results
return x + y // Thread C combines results
Here’s the model for the computation DAG for P-FIB():
As we can see in the figure above, the purple thread in the block number spawns another purple thread in the block number
. Then the number
points to the purple thread in the block number
. The number
points to the purple thread in the block number
. Then we point to the green thread in the block number
using the return edge. We continue pointing to the green thread in block number
and
until we reach the final thread.
Let’s walk through some measures and observations the characterizes dynamic multithreading.
The work of a multithreaded computation is the total time required to execute the entire computation on one processor. So, the work represents the sum of the time taken by each thread.
An ideal parallel computer with processors can perform up to
units of work in a one-time step. In other words, it can do up to
work in
time (
is the running time of an algorithm on
processors). Since the total work is
,
. Hence, the work law is
.
The span is the longest time it takes to execute threads along any path of the computational DAG.
An ideal parallel computer with processors cannot run faster than a machine with an infinite number of processors. A computer with an infinite number of processors, on the other hand, can imitate a
processor machine by employing only
of its processors. Therefore,
, which is known as the span law.
The speedup of a computation on processors is
. So, this ratio defines how much speedup we get with the
processor as compared to one. Using the work law (
), we can conclude that an ideal parallel computer cannot have any more speedup than the number of processors.
The parallelism of a multithreaded computation is . So, this ratio means the average amount of work that can be performed for each step of parallel execution time.
The parallel slackness of a multithreaded calculation performed on an ideal parallel computer with processors is defined as the parallelism by
ratio:
Let’s use the Fibonacci example (P-FIB()). In the graph below, we use
threads (
vertices). As we can see, there are
threads on the longest path (red line in the graph).
Let’s assume the unit time for each thread:
Let’s now take a look at the work and span of the Fibonacci computation (using the parallel algorithm above of P-FIB()) to compute the parallelism:
Let’s now use an example to understand the multithreading algorithm and its operations (parallel, spawn, sync).
Here’s an algorithm for multithreaded matrix multiplication, using the algorithm:
algorithm P-Matrix-Multiply(A, B):
// INPUT
// A, B = square matrices of size n x n
// OUTPUT
// C = the product of A and B, computed in parallel
n <- the number of rows in A
C <- initialize an empty n x n matrix
parallel for i <- 1 to n:
parallel for j <- 1 to n:
c[i, j] <- 0
for k <- 1 to n:
c[i, j] <- c[i, j] + a[i, k] * b[k, j]
return C
In this algorithm, the longest path is when spawning the outer and inner parallel loop executions, and then the executions of the innermost for loop. So, the span of this algorithm is
. Hence, the parallelism is
.
We employ divide-and-conquer algorithms for parallelism because they divide the problem into independent subproblems that can be addressed individually. Let’s take a look at merge sort:
algorithm Merge-Sort(a, p, r):
// INPUT
// a = array to sort
// p = starting index
// r = ending index
// OUTPUT
// Array a sorted in place from indices p to r
if p < r:
q <- (p + r) / 2
spawn Merge-Sort(a, p, q)
Merge-Sort(a, q+1, r)
sync // wait for parallel processes to complete
Merge(a, p, q, r)
As we can see, the dividing is in the main procedure Merge-Sort, then we parallelize it by using spawn on the first recursive call. Merge remains a serial algorithm, so its work and span are as before.
Here’s the recurrence for the work of Merge-Sort (it’s the same as the serial version):
The recurrence for the span of Merge-Sort is based on the fact that the recursive calls run in parallel:
Here’s the parallelism:
As we can see, this is low parallelism, which means that even with massive input, having hundreds of processors would not be beneficial. So, to increase the parallelism, we can speed up the serial Merge.
Here are some applications of using multithreading:
In this tutorial, we’ve discussed the basic concept of multithreaded algorithms using an example to unlock its mechanism. We’ve also gone over the directed acyclic graph (DAG).