Optimizing y-cruncher to Actually Set World Records

13 minute read

Note: this post should hopefully be understandable with anyone who is a computer power user. Also, this post is meant to supplement Mr. Alexander Yee’s y-cruncher in easier words, not replace it. You have to see his website for the crucial technical details required to make world records.

Difficulties of mathematical constants: (Excerpt from y-cruncher v0.7.8.9506, Mr. Yee thankfully let me post this)

Compute a Constant: (in ascending order of difficulty to compute)

  #         Constant                      Value        Approximate Difficulty*

Fast Constants:
  0         Sqrt(n)                                         1.46
  1         Golden Ratio                = 1.618034...       1.46
  2         e                           = 2.718281...       3.88 / 3.88
Moderate Constants:
  3         Pi                          = 3.141592...       13.2 / 19.9
  4         Log(n)                                        > 35.7
  5         Zeta(3) (Apery's Constant)  = 1.202056...       62.8 / 65.7
  6         Catalan's Constant          = 0.915965...       78.0 / 105.
  7         Lemniscate                  = 5.244115...       60.4 / 124. / 154.
Slow Constants:
  8         Euler-Mascheroni Constant   = 0.577215...       383. / 574.
Other:
  9         Euler-Mascheroni Constant (parameter override)  
 10         Custom constant with user-defined formula.  

*Actual numbers will vary. Radix conversion = 1.00

If you did not read the First Post and Second Post of the significance and algorithms of mathematical constants, read them to understand algorithms and the significance of all main mathematical constants.

Note there are more mathematical constants that are defined with custom formula files available with the executable, but they are more complicated math, so if you really want to set custom formula records, you should eventually know more as you research them.

Overview

Now we get real. How do we compute actual world records with y-cruncher? The first thing we need in mind is hardware. The CPU doesn’t matter much as long as it is a comparably high-end desktop or mediocre server CPU, and that disk R/W speeds have been the bottleneck so you have to care more about the latter.

The way to decrease bottleneck is (excerpt from y-cruncher):

The “fastest storage configuration” because that’s the bottleneck.
The “largest memory configuration” because it minimizes the amount of disk I/O that is needed.
A “mediocre CPU” because that’s all you need before you hit the disk bottleneck.

Storage

As I wrote in the First Post, what is bottlenecking you is not the CPU. An 8-core 16-thread AMD Ryzen 3700X was sufficient to set a record for the lightest mathematical constants as long as you have swap RAID storage that amounts to TBs. HDDs are redundant so it can withstand massive writes but it is slow. SSDs are fast so that there is less bottleneck but such an intensive amount of writes lead it to be almost single-use, and the disk will likely fail when it does intensive operations next time. Optane SSDs are pretty redundant and fast with random R/W though not much as RAMs, but they are expensive, though not expensive as more RAM. Because of this, people mostly stick to HDDs by accelerating the overall speed using multiple arrays with RAID, and hardware reviewers like Dr. Ian Cutress have attempted a twist like Optane DIMMs that are 512GBs per DIMM, unlike 32GB maximum for ECC RAMs normally.

RAM

Memory allocation is bad in Linux compared to Windows as of now. So use Windows (preferably a server version, also turn off automatic updates…) for a meaningful increase in speed. However, some CPUs such as certain generations of AMD RYZEN Threadripper favor the Linux CPU scheduling system, so consult Mr. Alexander Yee for which OS to use if you can choose. There is also something called “Locked Pages” and “Large Pages” that increases throughput and prevents time waste on allocating and de-allocating memory, instead confining to the program. Thus having full superuser permission to enable this improves I/O time, although not crucial. I know this program is very frequently used to stress test overclocked computers, but I recommend against overclocking the CPU or RAM unlike what Mr. Alexander Yee said in his webpage for any large computation because overclocking causes loads of silent errors that don’t matter much in other casual workloads like games. This computation isn’t casual at all.

And also, ECC is pretty important as a last method to absorb errors, although I never got an ECC corrected error for any of my computations. Overclocking ECC RAM is also very pointless.

CPU

This is the pattern of CPU utilization in a typical swap computation with 32 threads with AVX2 for a world record on y-cruncher v0.7.8.9506. I captured this from a computation of mine. You can see that at the start of the computation, the CPU is almost fully utilized since the initial steps are run on RAM, but after the program offloads to swap secondary storage, there is a clear difference between when the CPU is utilized fully, and when it is utilized less than 25% because of I/O bottlenecks. As a computation extends in time to store more digits than before, the zone where the CPU is underutilized is stretched more and more. As the time where the CPU is fully utilized is significantly less than the I/O bottlenecked time, the number of cores is not very important in conserving computation time.

An 8-core 16-thread AMD Ryzen 3700X was sufficient to set a world record for the lightest mathematical constants, but the problem with desktop (including HEDT) CPUs are that there is a maximum of total RAM, normally around 128 GB. This will contribute more to I/O bottleneck, making it impractical for more complicated constants that access the memory way more in the same number of digits. The reason people use multi-socket server/workstation CPUs are because it can house more RAM, decreasing bottleneck. More cores isn’t exactly the point that drastically increases speed. I am very interested on results shall this program was used on supercomputer or mainframe builds connected to each other using Mellanox InfiniBand I/O Fabric should such computation happens.

Configuration in Linux for Using the Optimized y-cruncher

First if you are running in Linux (the Windows version has all the features embedded inside as default and does not require additional installations), I recommend using the dynamic version (especially in multisocket environments) which requires the system dependencies in Ubuntu 18.04 (or distros based on this version) and also requires installing numactl as of y-cruncher v0.7.8.9506 and versions before that. CentOS 8 is also tested to work without any other tweaks as long as you installed numactl, and since the kernel of CentOS is regarded as more stable and tested well over Debian/Ubuntu, I recommend to use this for world-record computations if you can choose your OS. This does not mean that your host requires only the Ubuntu 18.04 or CentOS 8 OS to run the dynamic version, instead you can run this on any recent Linux distros.

If you do not have root permissions and thus cannot use the default package repositories, you can install numactl-devel-cos6-x86_64 which has the required libnuma.so.1 library for running the dynamic version using Miniconda, and add the directory of where libnuma.so.1 is to the system variable LD_LIBRARY_PATH (export LD_LIBRARY_PATH='/path/to/lib:$LD_LIBRARY_PATH', check the full path with find . -name "libnuma.so.1" at the directory where the conda environment is in, but you have to use the absolute path for the system variable instead of the relative path).

CentOS 7 (or any Linux distro with kernel version tested >= 3.10.0) is also tested to work, but requires extra work; as the default libstdc++ is an incompatible version, to fix this you can install the Red Hat Developer Toolset (devtoolset-9) from the CentOS SCLo RH x86_64 repository and activate its environment (this is untested so I can’t ensure that this actually works well), or more preferably and not requiring root permissions install libstdcxx-ng in the same way as the Miniconda installation of numactl-devel-cos6-x86_64 (you can install both at the same time), and same as numactl-devel-cos6-x86_64, adding the directory of where libstdc++.so.6 is to the system variable LD_LIBRARY_PATH (export LD_LIBRARY_PATH='/conda/path/to/cpp:/conda/path/to/numa:$LD_LIBRARY_PATH', check the full path with find . -name "libstdc++.so.6" at the directory where the conda environment is in, but you have to use the absolute path for the system variable instead of the relative path).

If you get errors related to libcilkrts.so.5 and/or libtbb.so.2 when executing y-cruncher after this configuration (common if you run it as a remote command or with bash -c), add the full path of the Binaries directory of the y-cruncher download to the LD_LIBRARY_PATH, delimiting each directory with a colon also.

Using recent OS containers with light OS-level virtualization like Docker, LXC, or Singularity also works and was tested for installing the correct dependencies from package repositories without much virtualization overhead in performance, but you don’t really have to use OS-level virtualization even if you don’t have root permissions as long as your host OS has any recent kernel version. OS-level virtualization will not increase your kernel version even if you use a newly released OS container so whether y-cruncher works or not on your system is all to your host OS.

If you use the static version instead of the recommended dynamic version, you have to use the custom-coded Push Pool (this is the preferred framework for desktop level CPUs of around 16 threads) multiprocessing framework for large workloads using one or more CPU sockets summing to over 64 threads, which is less efficient than Intel’s Cilk Plus or Threading Building Blocks. Threading Building Blocks is the replacement to Cilk Plus from Intel but the performance from a past computation of the world record of Pi have been underwhelming with it so far, thus the better working Cilk Plus is used for now (but this could change later).

The y-cruncher program will then automaticaly choose the recommended frameworks for each component to be used based on the number of cores and you will not be restricted on the selections as long as you have the dynamic version.

The y-cruncher Program

Every computer is different, so they must be tuned to get its maximum throughput to decrease time dramatically. The y-cruncher program has some tools to check them. The first tool is the stress testing application, which proves that your build can initially withstand certain Fast Fourier Transform operations and other heavy computations. The second and perhaps the most important tool is the I/O Performance Analysis if you are (likely) computing with swap secondary storage like SSDs or HDDs. After getting the results you have seen with running this benchmark, you have to tweak the Far Memory Tuning configuration based on what you get. An explanation of how to do this will come up later.

For running the real world record level computations, you go in the Custom Compute a Constant menu. Note that you should run Test Swap Multiply in Advanced Options if you are running computations with huge digits that are bigger than the current record digits in Pi as advised by Mr. Yee. First choose the constant and understand the algorithms and their performances. Then you can choose whether to use swap storage or not and set how much RAM the program it should use. It will be automatically set to 90-95% of all available RAM, and this is the appropriate guideline that should be kept if you are tweaking the configurations since the remaining 5-10% should be used by the OS and other background tasks. You then have to choose the path of where the digits will be saved and for swap computations also where swap files should be stored during the computation. If you already have hardware RAID available, you can just specify the path, and if not the program can set a custom software RAID configuration if you list the paths in the program. You can even optionally set a backup command to be run automatically using the Post-Checkpoint Command in the Checkpoint menu. You can also set the I/O Memory Buffer Allocator here, and if you use the dynamic version of Linux or the Windows version and have multiple sockets, you can see it is automatically set to use libnuma version of Node Interleaving and the custom coded version of Node Interleaving if not. The Affinity configuration designates the cores or threads that are more accessible to the secondary storages than others. It is normally not changed if the OS handles everything.

Now the last configuration left is the I/O Buffer Size and Bytes/Seek parameter. Here is when the I/O Performance Analysis benchmark comes in. I will only talk about HDDs as SSDs have a whole lot different figures (I have seen Bytes/Seek parameters of over 80 MB for mixed NVMe + HDD clustered filesystem configurations and also oppositely microscopic Bytes/Seek parameter values around 128 KB for a pure NVMe RAID configuration) and you should run the I/O Performance Analysis and conclude what Bytes/Seek parameter is appropriate. The rule of thumb for I/O Buffer Size is using 64 MiB times the number of hard drives (or divide the sequential read rate shown in the results of the I/O Performance Analysis to the sequential read speed of one hard drive or SSD to infer how many hard drives are in the array). For the Bytes/Seek parameter you first have to know the logics. This is the number of bytes the hard drives can read sequentially in the time equivalent to the disk seek time. A normal hard drive has a seek time of 10ms and one hard drive normal has a sequential read rate of 100-200 MB/s, so the Bytes/Seek parameter can be around 1-2 MiB and thus we first assume it is around 2 MiB because setting this smaller can change the computation speed more dramatically than setting it larger. You should multiply this to the number of hard drives and think of this as the starting value. Then you should use the I/O Performance Analysis sequential read results directly and divide it to the disk seek time of 10 ms and additionally fine tune it in the direction that the displayed analysis results say. If you see a red texted result for one or more of the benchmark results you should definitely increase the Bytes/Seek dramatically as this can cause a big bottleneck. The Sequential Read (Write) throughput should be about three times the throughput of Threshold Strided Read (Write). You have to experiment with this multiple times to achieve optimized speed for world record sized computations and utilize your CPU as much as possible. The time invested here will really help with the computation and it can be reused in another computation with the same system configuration. If there is a big difference in Threshold Strided Read and Threshold Strided Write speeds, there unfortunately is not much remaining to do and it is not possible to tune for optimization. This happens commonly in distributed file systems and if this occurs tune Bytes/Seek so that the lower of the two is not less than 1/4 of the Sequential Speed and we can’t do much more than that. Check This Page for a more in-depth guide overall.

Miscellaneous

Now that we hopefully got a stable system with the fastest I/O throughput possible ready, we can go on and actually try setting a world record. If you set a record and your record digit output size is reasonable, please consider uploading to an open repository such as Google Drive if you have unlimited storage given to G Suite for Education/Business accounts (check This Link out too) or the Internet Archive for other researchers to utilize them when they need the digits.

If you did not read the First Post and Second Post of the significance and algorithms of mathematical constants, read them to understand algorithms and the significance of all main mathematical constants.

If you are using your own build and you have to manage the heat, I can tell you more heat makes the parts more likely to cause silent defects to the digits. Use very good CPU coolers and case coolers. I think liquid cooling is not plausible for any professional builds that is on for a long time because of the leaks, but may work only if you are mainly using in games and don’t want noisy fans. Otherwise air cooling is normally more stable. Passive cooling in server racks are also very solid.

For more information on speed optimization and management methods to prevent any silent corruptions in y-cruncher, read everything from the Performance Tips section of y-cruncher and every link and text under that thoroughly, which includes Algorithms and Internals, the FAQ, Multi-Threading, Memory Allocation, Swap Mode, and Custom Formulas for people who need to use this function.

Comments