Demystifying Supercomputers and Parallel Computing: Part 2

8 minute read

Introduction

Note: As with last post, I hope people who have at least some interest in computer hardware understands most of this although there are some intermediate electronic engineering perspectives in the posts.

If you did not read the First Post of this post, try reading it first. If you just want to know about cloud computing applications in high-performance computing, you can continue.

HAL 9000

HAL 9000 from 2001: A Space Odyssey (Courtesy of Wikipedia). This is what people think of a supercomputer, and also one of the first thought incarnations of Artificial Intelligence that became a movie.

People still hear the news from TOP500 releases of China and the US competing for who has the most floating point operations per second (FLOPS). It looks like rocket science and people think these geek engineers at the frontmost of bleeding edge science. Well, it isn’t.

Cloud Computing

I will repeat what I wrote in the First Post for those who did not read it.

Cloud computing is also parallel computing. I guess most people reading this heard of Amazon Web Services, Microsoft Azure, Google Cloud Services, as well as Cloud Virtual Private Server services such as DigitalOcean or Vultr. These are surely developed with different purposes compared to what I explained above since they are designed to host programs or services in a server, but same as supercomputers, they are run on datacenters. Important technologies have been developed alongside the surge of cloud computing that are rapidly being applied to high-performance computing including supercomputing such that the boundaries are slowly diminishing.

The fundamentals of cloud computing is that with a standard way of bundling the underlying code of a backend application or a virtualized environment by a standard way of linking dependencies into a container, it can run in a system that hosts such container in a standardized program interface installed on the system regardless of the OS of the server or the location and the computing power allocated to the container. The first technology that is needed is virtualization, that allows for many virtual, independent instances with different OSs independent to the server. Docker is a standard framework of creating such container and Kubernetes is a great example framework of automatically and systematically deploying these standardized containers as per demand and need. The biggest advantage of incorporating these systems are that once coded, the computing power of the infrastructure can scale up as there is more demand without changing codes dramatically unlike traditional infrastructure that is dependent in the environment of each server. The application or service can also be hosted in datacenters worldwide that use the same standard interface to form a content distribution network and deploy the containers with low latency around the world.

Cloud Computing + High-Performance Computing

Cloud computing is designed to deploy applications of computing workloads over the world in multiple datacenters with the same interface. Thus it is widely used for websites or applications both mobile, desktop, and web-based. But because cloud computing is a pretty revolutionary technology in managing a big number of computing nodes in that it offloads burdens that go to software engineers, it is changing how supercomputers and clusters (= high-performance computing systems) allocate their resources for their users to use. Recall that supercomputers and clusters usually use a batch-based job scheduler system to choose the order of jobs to be executed, and that examples were SLURM, PBS, SGE, and Torque. These job scheduler systems except the newer SLURM (used in about 60% of all TOP500 computers because it has fixed various problems of legacy job schedulers) were mostly from codebases made in the early 2000s, and require software engineers and researchers using HPC to adjust to use dependencies (because of a single OS version applicable) and presupplied programs/frameworks of each cluster/supercomputers in their computing load. Shell scripts coded by the job submitter are used to automatically submit jobs. This meant that when one transfers from one HPC cluster/supercomputer to another, a major overhaul of dependencies, workflows, and optimization must be done by the person who creates the jobs. The advantage of NOT using cloud containerization is that there is no resource overhead caused by an additional layer of software, and this can be a reason of choosing not to use cloud computing for throughput-critical applications that communicate with the low-level hardware directly. But the caused overhead may be very minor compared to the size and intensity of computations performed (this also depends). Spack aims to partially solve this problem by creating a package manager (much like non-superuser package manager conda or traditional package managers apt/rpm) for supercomputers, but this only solves the problem as much as conda does, still having disparity in serving standard environments for all hardware.

This is changing to offload software engineers of this burden because of Amazon, Google, OpenStack (the open-source cloud infrastructure backend system software), and other cloud computing pioneers. Now people don’t have to care about the dependencies if they succeeded in prototyping a container in a different system and the system manages the deployment for you. While cloud computing has been originally developed to add an additional software container layer to each server/computing and automatically allocate/setup containers for users to use, containers can be used to link and communicate between multiple nodes together for doing a task such as hosting an application or analyzing data, as well as scaling a computation as required. This is called container orchestration (e.g. Docker Swarm, Kubernetes, Apache Mesos). A top 10 supercomputer as of Nov 2019 called Frontera uses a cloud computing interface called Jetstream to offer containerizations of applications, supporting different dependencies, OS versions and distributions in each container and becomes easily expandable to larger resource allocations for intensive production computing beyond prototyping.

New programs including Singularity, Vagrant, and Jupyter Lab are making software engineers using high-performance computing infrastructures easier and easier alongside Docker.

Cloud Storage

We were only talking about resource allocations in nodes, which only involved the CPU, RAM, and other accelerators like GPUs and coprocessors until now. What about the storage? Recall that storage is the most important factor in optimizing high-performance computing recently and will continue to be in the future. I/O throughput is the biggest bottleneck in most workloads that requires a lot of RAM and CPU cores.

Cloud storage itself is a very old concept, in fact being the first form of widely implemented cloud computing technologies. Dropbox, Google Drive, OneDrive have been around since the mid-2000s and it is not very special today. Something uploaded in Europe can be downloaded in a server located in Asia without lags or performance degradation by either connecting datacenters with dedicated networks to serve files worldwide or store them in multiple servers. If files are distributed to many servers worldwide to decrease latency in serving contents in websites or applications, it is called a Content Delivery (Distribution) Network (CDN) like Cloudflare and Fastly (which is accelerating Github Pages right now). These technologies have really matured, and they have been actively implemented in intranet-level communications also.

Cloud computing container orchestration infrastructures optimize this instead of software engineers manually as well. Remember that supercomputers have either high-throughput multi-Gigabit Ethernet or InfiniBand connected topologically like a mesh and optimized to maximize I/O efficiency. Depending on the setup of nodes in every supercomputing datacenter, there may be storage nodes inside the mesh connection that mainly only takes care of storage I/O R/W, or storage may be embedded in every computing node. These properties can be fine tuned and ensure that all the data in an infrastructure are accessible in the closest possible place to all computing nodes performing a job that depends on communication of other nodes. Most and all widely used container orchestration frameworks support this. I have discussed already how this works and how important this is broadly in the First Post, so I won’t repeat it.

Future of Supercomputing in the Cloud Era

Honestly, supercomputers are being needed less and less as time goes because of the rise of cloud computing. I have cited y-cruncher as an example of computing workloads that I/O R/W throughput is critical, but this is really the edge case honestly. Workloads for y-cruncher are bleeding edge computations that require so much precision and CPU resources that the program is used as an overclock testing tool. More computation workloads are becoming less I/O intensive so that practical workloads have no bottleneck from I/O and can be deployed worldwide and have no problems. Those that are “production” are meant to be served to many people, mostly around the world, so the point is strengthened.

Blockchain technologies can enable error checking of transactions across many nodes worldwide, and more technologies are making stuff truly worldwide. Javascript-based static sites (like this exact website you are viewing right now made with Jekyll) are again gaining attention alongside decentralization technologies, especially because of the convenience of coding both server-side and client-side applications integrated by Node.js frameworks such as Vue.js, Angular.js, and React.js over more outdated server-side only languages like PHP. Serverless or FaaS architectures like AWS Lambda allow these static websites to trigger server-side database access and workloads only when it is triggered by a designated event and leave everything else to be rendered by clients, decreasing the need of heavy centralized computations even further instead of requiring dedicated running server instances to be always active 24/7 (for example hosting a Wordpress blog that requires everything to be rendered by the server with PHP).

Distributed and Grid Computing research enables research tasks that can be divided into partitions to be done by volunteers around the world. In similar fashion, Edge Computing technologies that incorporate many small embedded nodes to process data efficiently with lower latency are pulling what was originally centralized high-performance computing workloads such as deep learning inference tasks to mobility as well as more efficiently accelerating applications including websites aided by mesh-like topological virtual private networks such as ZeroTier that liberates traditional server-client relationships and each edge nodes equally contributing to the whole system. Supercomputers will continue to exist and be widely used because of very intensive research computations that actually eat up most supercomputers, but those that do not need supercomputers will continue transferring to hosted cloud infrastructures or supercomputer/cluster infrastructures that have implemented cloud computing frameworks.

First Post

This post is the second part of a series in demystifying parallel computing and supercomputers, focused on cloud computing applications in high-performance computing systems.

Comments