Demystifying Supercomputers and Parallel Computing: Part 1

13 minute read

Introduction

Note: I hope people who have at least some interest in computer hardware understands most of this although there are some intermediate electronic engineering perspectives in the posts.

HAL 9000

HAL 9000 from 2001: A Space Odyssey (Courtesy of Wikipedia). This is what people think of a supercomputer, and also one of the first thought incarnations of Artificial Intelligence that became a movie.

People still hear the news from TOP500 releases of China and the US competing for who has the most floating point operations per second (FLOPS). It looks like rocket science and people think these geek engineers at the frontmost of bleeding edge science. Well, it isn’t.

Basic Structure of TOP500 and Similar Supercomputers

So let me make it simple what supercomputing is.

In floating point operations per second (FLOPS), floating point operations are simply arithmetic operations (for example addition, subtraction, multiplication, division, and modulus) of decimal numbers, that the decimal point moves around in a fixed number of all digits below and above decimal point (= precision). This is the fundamental of all CPU operations along with other less used operations. TOP500 uses the LINPACK benchmark, which is basically a benchmark program for solving linear algebra matrix system of equations.

Fundamentally, it is a link of many conventional x86-64 server racks, although a combination of other smaller nodes, which can be individual computer systems with a CPU and other accelerators, can also be called high-performance systems although it can or cannot be called a supercomputer, details coming later.

Beowulf Cluster

This is a Beowulf cluster (Courtesy of Wikipedia). Basically consumer PCs connected together to produce a high-performance throughput of data processing. The principles of supercomputing is not that different except that these systems use server CPUs that are more energy-efficient. People have also made Beowulf clusters with Raspberry Pi and this actually shows the future of supercomputing with even more energy-efficient ARM processors.

If you heard of Intel Xeon, AMD Opteron, recently AMD Epyc after the introduction of the Ryzen (Zen) microarchitecture, these. And guess what, all TOP500 systems since 2018 run Linux-based OSes, which is what powers approximately three fourth of all web servers on the internet, and is a pretty stable and intuitive OS for computer enthusiasts. Because of this, most conventional computer programs can be run in the vast majority of supercomputers, albeit unefficiently. There are exceptions such as the IBM Power9 (the CPU architecture for 13 total TOP500 builds including the top 2 supercomputers Summit and Sierra, each in the Oak Ridge and Lawrence Livermore Laboratories as of November 2019, both coupled with NVIDIA Tesla GPUs explained later, note these are different from the IBM Z Mainframe Architecture used mainly for handling transactions in credit card corporations or banks), one Sunway CPU architecture developed in-house by China, three SPARC64 architectures (these have been very dominant before the mid-2000s but have been phased out by CPUs either Intel or AMD), and two ARM RISC architectures plus one additional Japanese variant (these are the CPUs in your smartphones and are highly anticipated in the future because of its energy efficiency). Around 480 of the TOP500 is thus literally a connected version of conventional server racks and three additional are CPUs in your smartphone embedded on a motherboard and is connected together. Even the other architectures can run normal programs as long as they are compiled to their respective architectures.

UPDATE: The ARM architecture just made history. I am writing this just as the June 2020 TOP500 results are coming in at the ISC 2020 online conference. Summit and Sierra are now second and third. The Fugaku Supercomputer, built with ARM-based A64FX 48-core 2.2Ghz CPUs and a custom Tofu interconnect, is housed in the RIKEN-CCS national laboratory, and was made possible a coalition with Fujitsu and ARM, becoming the most powerful supercomputer in the world. This means that supercomputer workloads do not need to be accelerated using GPU-specific codes and computations that are only viable in CPUs can be utilized to maximum efficiency. It is also the first supercomputer to surpass one exaFLOPS in single-precision arithmetics (two exaFLOPS in half-precision) and is placed within the top ten of the GREEN500 most efficient supercomputers list also.

One additional crucial component of supercomputers. NVIDIA GPUs you see most in lots of computers around for games and graphics rendering. They aren’t for games in this case, but since games should render 2 dimensional pixels every few miliseconds, it is applied to computations that are parallel. The programs you make when you first learn programming are linear. If/For loops mean that the program has to do or check one thing before doing something other and the program waits before a finishes and moves on to a next part because the next step depends on previous results. Thus this program is only executed in one thread of a CPU, thus using one core. But what if we can program it differently and divide it in steps, each having multiple different tasks that can be done simultaneously and independently? We can do this for most problems that does require a lot of computing power. This is multicore computing if we divide tasks into multiple cores in one CPU and is used in everyday programs you use today. To apply these into GPUs, we use frameworks called OpenCL or CUDA to code so that jobs are distributed to different cores to compute this independent tasks in GPUs. GPUs called Tesla V100 or P100 GPUs made by NVIDIA are used very frequently, and these GPUs are actually used to do Machine Learning including Deep Learning using Neural Networks. Check out my First and Second posts that used these Tesla V100 GPUs to generate Linkin Park music. Moreover, there are frameworks that can communicate between different server racks to efficiently distribute these jobs and give them to multiple CPUs or GPUs in different server racks, most notably MPI and OpenMP. GPUs along with a special processors called Intel Xeon Phi are called accelerators. Intel Xeon Phi is basically a collection of CPU cores that run slower than normal server CPUs, instead having more cores to run parallel computations, but can run computations normally only available to CPUs but not GPU frameworks like OpenCL or CUDA. These accelerators usually aren’t run on its own and instead are connected to normal server CPUs and get task jobs from the CPU. This is the what makes up what is called high-performance computing. Supercomputing is a subset of high-performance computing.

Parallel Computing

Let’s look at more types of parallel computing, which is required for most high-performance computing systems. Keep in mind that these are mainly classified based on the dependencies of certain jobs to others, leading to the requirement of fast I/O R/W connection between nodes.

Multiprocessing

Multiprocessing, as discussed right above, is the distribution of jobs within a single CPU socket to multiple cores. Since all cores in a CPU die share the same RAM, this type of computation is best for computations that require close association of data between neighboring jobs. OS task schedulers mainly do this distribution, and Cilk Plus & TBB are examples of APIs designed to facilitate easier distribution of jobs. A good example of such program is a web browser or a computer game. Note that not all computer or server builds are composed of one die. There are lots of server CPU motherboards with multiple sockets or CPUs that have two dies in one socket. These don’t share the same RAM and such structure is called NUMA (Non-Uniform Memory Access). Computer video games are heavilly affected by NUMA adversely and increase of latency caused by this non-uniformness leads to gaming in server systems impractical unless all data related to games are allocated to be computed in one socket/die and uses only the RAM directly connected to it, using only a part of all the CPU cores available. When the term high-performance computing is usually used for single computing nodes, it is used because the computing loads are very intensive and incorporates multiprocessing frameworks often used for multi-node high-performance computing but the importance of memory access is too high to offload to multiple nodes. A good example of such RAM-intensive workload is y-cruncher.

Supercomputing vs Cluster Computing

Now we have to distinguish between two larger scale systems than multiprocessing (which involves only one computer or computing node). Either of these are widely referred as high-performance computing and involves multiple computing nodes.

Both supercomputing systems (we are talking about mostly TOP500 systems) and cluster computing systems have similar structure as depicted in Basic Structure of TOP500 and Similar Supercomputers, though the Xeon Phi coprocessor accelerator is not as commonly used in cluster computing. Both are housed in datacenters and are cooled using a separate passive system rather than the coolers being embedded in each node. But the biggest difference is whether I/O R/W between the server racks have been rigorously planned to demonstrate the lowest latency & highest throughput or not.

Recall from my Euler-Mascheroni Constant Record and my First and Second Posts of an in-depth analysis of y-cruncher and its mathematical constants that I was whining all the time about the bottleneck caused by insufficient I/O R/W throughput. Even if you have a good CPU and a lot of RAM, if you are calculating or processing data that is too large to fit all in RAM (this is really the most cases that actually require a high-performance system!) you have to use a secondary storage such as SSDs or HDDs. Even if we do a RAID configuration of these storages that increase the throughput by multitudes, it is still fractions compared to the speed of RAM and CPU. If we are using multiple computing nodes on programs or computing loads that require each other’s generated/processed data and are interdependent, this bottleneck problem becomes so much of a problem that causes underutilization of the CPU cores, meaning that there are cores that stay idle throughout the computation.

Cluster computing systems are basically a link of server computers that usually have a shared storage system (of which is a separate node exclusively designed to do this) to read and write, and they are not designed to perform jobs across many nodes of which the data in many nodes are interdependent. Instead, these server computers are only physically connected with a normal-to-high-speed internal Ethernet connection to a frontend node (alongside the shared storage node(s)) that allocates jobs with scheduling systems such as SLURM, PBS, SGE, or Torque (these are also used in supercomputers). Nodes that actually do the computations are not meant to exchange informations by interconnecting with each other and doing so will result in painful bottlenecks and underutilize both the CPU and RAM.

On contrary, supercomputers are rigorously planned to interconnect each nodes and exchange data between separate server nodes rapidly, decreasing the bottleneck as low as possible. Although the underlying secondary storage will still be a bottleneck, each nodes are interconnected as a mesh and each nodes are carefully placed so that the length of the cable, which also contributes to the I/O latency, is also minimized as much as possible. High-speed ethernet are used most in number out of the TOP500 rankings. This is already faster than subpar connections in cluster computing systems, but the InfiniBand interconnect system is most used in supercomputer systems that are higher up the TOP500 rankings because it brings external node-to-node connection throughput to the edge. It uses external I/O switches to separate I/O related loads from the CPU, multiple fabric connectors can be used to multiply the throughput between two nodes, and mesh-based topological connections to indirectly connected nodes don’t suffer as much degradation. It currently competes between the similar Intel Omni-Path, Ethernet, and Fibre Channel (although fibre connections are more efficient for long-distance connections rather than inside the same datacenter). Remember that NVIDIA isn’t just a gaming GPU manufacturer, one of their biggest businesses are GPU-based accelerators used in supercomputers, clusters, and servers. The importance of the InfiniBand technology in these systems is why NVIDIA spent billions of dollars to acquire Mellanox, the largest manufacturer of InfiniBand switches and fabrics with the only alternative manufacturer being Intel. These aquisitions are worth a good spotlight because it shows where the future of high-performance computing instead of passing it over as a gaming GPU manufacturer buying a corporation making some networking products for an unknown reason. These technologies have massive applications, including applications for directly interconnecting GPU accelerators between nodes. While supercomputers still have more I/O bottleneck both in terms of latency and throughput compared to a single computer or server, new technologies are continuing to bridge the gap between I/O, RAM, and CPU.

Distributed Computing

Distributed computing is a huge-scale parallel computing method that can be used when each divided job is almost completely not dependent to each other. Since each job is not dependent to each other, each node computer that computes individual jobs do not need to connect to each other. Thus these jobs can be run in different datacenters or even different personal computers around the world not tightly connected to each other. Instead, each computer does its own job. Projects such as Folding@Home or Seti@Home are good examples, and both of them also are grid computing, which is random computers connected to the web downloading individual jobs, computing them individually, then uploading the finished results to a server or a peer-to-peer network that distributes the jobs. Distributed computing is not usually widely referred to as high-performance computing but there are statistics of total floating point operations in a network computing a specific group of tasks.

Now I hope that you understood now that parallel computing, which is required to make a high-performance system, is a method to distribute task jobs to many different cores or even nodes of computers, and high-performance computing which includes supercomputing is a connected link of server racks which also may have some accelerators including GPUs to make computations faster and distribute task jobs using software that distribute the jobs programatically.

Cloud Computing

I’m going to talk about cloud computing now, which is also parallel computing. I guess most people reading this heard of Amazon Web Services, Microsoft Azure, Google Cloud Services, as well as Cloud Virtual Private Server services such as DigitalOcean or Vultr. These are surely developed with different purposes compared to what I explained above since they are designed to host programs or services in a server, but same as supercomputers, they are run on datacenters. Important technologies have been developed alongside the surge of cloud computing that are rapidly being applied to high-performance computing including supercomputing such that the boundaries are slowly diminishing.

The fundamentals of cloud computing is that with a standard way of bundling the underlying code of a backend application or a virtualized environment by a standard way of linking dependencies into a container, it can run in a system that hosts such container in a standardized program interface installed on the system regardless of the OS of the server or the location and the computing power allocated to the container. The first technology that is needed is virtualization, that allows for many virtual, independent instances with different OSs independent to the server. Docker is a standard framework of creating such container and Kubernetes is a great example framework of automatically and systematically deploying these standardized containers as per demand and need. The biggest advantage of incorporating these systems are that once coded, the computing power of the infrastructure can scale up as there is more demand without changing codes dramatically unlike traditional infrastructure that is dependent in the environment of each server. The application or service can also be hosted in datacenters worldwide that use the same standard interface to form a content distribution network and deploy the containers with low latency around the world.

So, what does this have to do directly with supercomputing? Cloud computing technologies make it very easy for normal people to use supercomputers. Keep tracked for the next part on this series in demystifying parallel computing and supercomputers.

Second Post

This post is the first part of a series in demystifying parallel computing and supercomputers.

Comments