Know Your Overclocking Benchmark and Stress Testing Methods and Tools

17 minute read

Introduction

Geekbench, 3DMark, Cinebench, Unigine, CPU-Z/GPU-Z, AIDA64, SiSoftware Sandra, Speecy, PassMark, FurMark, MemTest64, PCMark, on and on, finally last but not least Prime95, SuperPi, and y-cruncher. Do you actually use them knowing what these tools actually do? Also for stress testing, are you just going to stay in the same place without hardening your overclocked build that will potentially break down in realistically possible workloads and just keep saying “it will be okay in games”? There are a lot of benchmarks and stress testing applications, but even some professional reviewers either use them without proper utilization or compare different builds without distinguishing the factors correctly. This post explains how benchmarks and stress testing applications should work and also recommend some programs not many have tried to use.

Benchmarks

First, let’s look at benchmarks. Although people confuse the concept of benchmarks and stress testing tools and lots of them are designed to do both stuffs together, the methodology must not be same. The reason and objective of doing two different things with diferent purposes aren’t same just because the tools are same. Formal benchmarks that are meant to compare one new component of a computer compared to other new ones must be designed to have only one quantitatively different factor (which is the component itself) in the computer (this is middle school science… :rage:). Other variables must be same including all the other components in a computer and the OS/Environment/Software used to benchmark. This is the very fundamental principle in scientific research or investigation, which is the relation of independent variables, dependent variables, and control variables. It is similar to a function. Let’s assume a linear function . Independent variables () are what can be changed. There must only be normally one independent variable in a series of experiments (unless another computer is simultaneously managing multiple variables in that investigation but even so the computer is doing one investigation with one independent variable at a time just pretty fast so people think it is simultaneous), because more than one means that you don’t know which independent variable change how much of the results you see, and thus not being able to deduce a meaningful conclusion. Dependent variables () are variables that changes inside the system during an investigation as one alters the independent variables. It is dependent to the independent variable, so if the value of the independent variable and control variable are same, it should get the same dependent variable value, so it is called a dependent variable. Control variables ( and ) are everything else in that environment that can change the dependent variable value in a way different from the independent variable, thus making the whole experiment invalid if changed anyhow. There are so many amateur-level computer reviewers (I have seen even those that officially receive products from official manufacturers to review, this means they have recognition to manufacturers as being somewhat trustworthy) that ignore this and do not take care of the control variables (such as using different RAM, motherboard, secondary disk, even cooling, etc.), or those that try to take care of these but miss critical aspects that may change results drastically.

So benchmarks, especially those designed to compare one component to others, have to be experiments in an equal environment except that component itself, and changes that result by changing that components represent the subjective performance compared to others. If environments can’t be exactly equal, one should do the same experiment multiple times and make a trustable average value that cancels out abnormalities. But, not everyone is a PC reviewer. Most of you guys just want to check and increase the performance of your system in your room, so compared to optimal experimentation environments people don’t really have to change your components to define independent and dependent variables (again unless you are a reviewer). The CPU performance score in your machine is close to the real maximum performance you will get when you do everyday workloads. In this case, your system on whole is the independent variable if you want to compare with other complete systems. You still have to do it multiple times and average all of them to deduce a conclusion that you know you can trust. But if you want to identify which component is bottlenecking inside your system, you still have to do what the reviewers should do, albeit in a more casual qualitative fashion.

Benchmarks do not necessarily have to squeeze your system to the maximum as most people normally do more casual workloads. Some people want the best performance in word processing. Some want the highest numerical performance in supercomputers. A comparably cheaper (but still quite expensive to most people) gaming computer is better at gaming (usually measured in rendered frames) than a server costing ten times more that is only meant to do numerical computation because the most intensive games matter in latency (of which a server is slower at) of the system than distributed multiprocessing performance.

Types of Benchmarks

The types of benchmarks: 1. Just using the real-world program you know you will use the most and the metrics you are most sensitive at inside that program. Word, Excel, CAD, Photoshop, Video Editors and Renderers, Electronic Stock Traders, Games, the code you wrote on your own that you execute frequently, etc. These don’t have to pull the system to the extreme, and just compare on the things that they each do best at. But the thing I want to emphasize about graphics and games is that you may be changing your control variables while you don’t know it. The benchmarks should be automated in the exact same conditions and if this is really impossible you should try your best to keep the environment (= how the game is rendered so you shouldn’t make stuff different) as equal as possible and also repeat the same test multiple times to make a trustable average value of the benchmark. 2. Synthetic benchmarks: if you want to build a system that does well in many different types of workloads, synthetic benchmarks generate computational load that resembles, but does not fully represent real-life workloads. Since this is not attached to one type of workload, it is good for getting an overall outline of how this system will work like in a variety of real-world workloads. The focus of most benchmarks aren’t evaluating one component to its maximum, instead getting a peek on how some types of programs will perform. 3. Kernel benchmarks: These test how the OS and its underlying layers and kernels themselves perform, which is the maximum performance of the system plus the OS itself instead of real-world workloads. This tests the maximum possible underlying performance, and thus is more intensive to the system itself. It does not necesarily mean that it will stress test all components in the system, since only the bottlenecked components will run in maximum and others may be more idle (such as if the RAM was not 100% allocated and motherboards do not supply their maximum power designed). This still is a more accurate benchmark if focusing more on the system itself, because it cares less of individual software optimization. Examples are LINPACK, which is used in benchmarking TOP500 Supercomputers and LAPACK, which were developed decades ago but the same algorithm of solving systems of large linear equations and matrices (as in the college linear algebra courses if you took those) tests the very raw performance of a system since computers have first been developed to do best in solving linear problems and it is still the most fundamental underlying role of the computer. 4. I/O benchmarks focus on one or more of the communication throughput and latency of the L1/L2/L3 cache, RAM, the secondary storage systems of the system such as SSDs or HDDs, or even the interconnects between other nodes in a cluster or supercomputer. 5. Database benchmarks test the performance of database servers with Database Management Systems (DBMS) such as SQL DBMSs including MySQL and PostgreSQL, Oracle, Cassandra, Ignite, MongoDB, ArrangoDB, Microsoft Access, and etc. 6. Browser benchmarks test browser rendering and loading in each system such as WebGL rendering of 3D objects in the web or loading speed of websites and web components in a browser. It represents the performance of browsing the modern web which has various static and dynamic contents client-side. 7. Component Benchmarks or Microbenchmarks: these are as simple as the output you get in cat /proc/cpuinfo in Linux or when you open Task Manager in Windows. They are very short benchmarks used for identifying the basic features of a component such as frequency or memory bandwidth.

Best Practices

Since benchmarking programs are volatile and since the definition of “best” is subjective as many different programs can be used for many different desired workloads, I won’t point them out one by one. I do have to quote to avoid benchmarking tools that have weird scoring metrics that score systems in a questionable way to presumably make some CPU models reach higher scores than others, for example setting a weight of over 80% on single core scores and the remaining 20% on multicore scores. Since even most games nowadays are using multiprocessing in their engines, this scoring weight is really peculiar. The list of tools in the Introduction is a non-exhaustive list of common tools used. Any program is okay as long as you quantitatively compare the correct metrics. Always keep independent variables, dependent variables, and control variables in mind. This is a very easy method to get a trustable result. It is also highly recommended to consult the methodologies that the most trustworthy reviewers use. They can give you a new idea or method of performing benchmarks than the one you currently use as well as the programs for each workloads and you can learn the rigor of experienced computer experimentalists.

Dr. Ian Cutress (look at his Twitter and TechTechPotato Youtube too) is a professional high-performance computing specialist that I can assert is currently the most skillful PC reviewer both in server and desktop levels. I have worked with him before with the world record of Euler-Mascheroni Constant. Because he has years of experience computational chemistry research and simulation during his Ph.D. times in Oxford, his reviews are very reasonable and accurate, tolerating no errors at all. Phoronix is a very good website for Linux benchmarks and methodologies (they even have their own Linux benchmark suite). While the reviews itself are very slightly less rigorous than Dr. Ian Cutress, it still has the great professionalism in every review and is even better on reviews related to development environments and lower-level implementations inside the Linux kernel. Other reviewers aren’t really worse than both of them, but you must filter out unprofessional reviewers that don’t disclose details of the environment (= control variables) used in the benchmarks clearly and also those that just pop out conclusions out of nowhere without releasing the quantitative data that supports that conclusion. Benchmarks should only be based on facts and not be based on opinions.

Stress Tests

Stress tests are different from benchmarks. It is not a quantitative comparison experiment. This is a test to ensure your system will never ever die as a result of corruption caused by instability or that components won’t be physically damaged in any situation. Stress tests usually targets RAM, both the system RAM and the GPU VRAM overclocks because modern CPUs just throttle if the heat is excess so we know it is unstable easily. We won’t have to worry about this when you are running your CPU, GPU, RAM, or any other component as it was shipped because we usually call instabilities in hardware defects. All that has to be done is after service and RMA that this component has a defect. But there are lots of people who want to pull these chips beyond the stock limits and pull more performance with a chip of a lower price or reach a higher single thread frequency to increase framerates for games, which are called overclocking. Stress tests are thus for builds that are overclocked. But honestly after some experience I personally dislike the concept of overclocking because server builds use additional error correcting RAM on top of a stablized system to remove likely possibilities or errors. See ECC memory and Dr. Ian Cutress’s Tweet on my other post. I think overclocking is kind of like altering a chip guaranteed stable by the manufacturer to an error-prone condition people would normally send RMAs if it was not overclocked (but go ahead if the gaming capabilities of your computer are so important and you won’t be commonly using the computer for everyday uses such as editing documents that you may lose changes with blue screens and memory errors). I would’t 100% count on custom overclocked chips also since it still may generate silent errors without knowing, outside of the scope the overclocked chip vendors have tested. If you still want to overclock chips used for production tasks after reading this, go ahead but you will likely still get silent errors even if you think you stablized everything.

Therefore, this is not an evaluation of “how does this system work in my workload of choice?” and instead being “is the system stablized enough for any kind of stressed scenarios it can meet coincidentally that happens even if users didn’t mean to run such workloads?” evaluation. Therefore the lighter benchmark programs in the section before won’t work. As in the term “light” common graphics and CPU benchmark programs such as 3DMark, Cinebench, PCMark, and Passmark as well as RAM benchmarking/stress testing programs such as Memtest86+, MemTest64 or AIDA64 may not expose instabilities that actually exist but is invisible in most uses. These invisible instabilities are decreasing the lifetime of your components and they are the main reasons your computer gets an error, from as small as a failure of rendering one pixel to as serious as showing a blue screen every hour. But people tend to think the common benchmarking programs are already intensive enough to test stability in the environments they normally use because their CPUs and GPUs already gets hotter than ever (which they deem uncomparably intensive to games they normally do) and coolers spin like mad. There are lots of cases by enthusiast overclockers in the internet that showed what they thought as a “stable overclock” is not stable at all when introduced to more sophisticated and fragile workloads.

So what I want to say is that errors are not a definitive thing. As a system or a component becomes unstable, the probability of generating errors increase and there are undetected or silent errors. I will now introduce overclockers to which stress testing tool they are supposed to use to ensure their overclocked builds will be stable on use-cases up to very fragile numerical computations.

I personally don’t recommend Prime95 (although this is the most used program) because it gives people a false sense of security but I’ll go through this program to start. This program mainly runs an algorithmic test called the Lucas-Lehmer Primality Test or the Fermat Probable Prime Test to find new world record Mersenne primes that can provide insight on encryption methods such as RSA. This requires a fair bit of precision and also stresses the systems a fair bit (this is one of a very few programs that have implemented the new AVX-512 vector instructions). If you run this program and you see errors or a blue screen, it means your overclock is not stable. But the downside is that the stress test using this program cannot ensure 100% that the CPU and RAM has numerical precision in all situations. There has been cases where runs of Prime95 for hours were “stable” but with the benchmarks I am introducing now they corrupted very fast.

The first program is y-cruncher. This is the program that computed world records of various important mathematical constants such as Pi (Google has been a recent world record holder and did a great amount of promotion with this record). This is a program that focuses on the highest precision and does not tolerate any kind of errors because one has to calculate hundreds of billions to tens of trillions of digits correctly. One error in the computation means that all digits generated later than the error are corrupted with it and outputting incorrect digits, disqualifying them from being world records. Because of this, the same constant has to be computed in two different independent algorithms to be recognized as the correct accurate world record. This is the same for discovered Mersenne primes in Prime95, but the required precision and accuracy for y-cruncher is far higher than Prime95. The AVX-512 vector instructions are also a core part of y-cruncher and are used in the stress tests if the CPU supports each instruction.

The stress testing mode tests your system to withstand certain Fast Fourier Transform operations and other very heavy linear algebra computations comparable to the Linpack Xtreme tool coming up next. If you also run the Benchmark Pi RAM-only mode and set the number of digits to a size that will saturate your RAM almost totally, you can verify that the output is the same as the correct results saved in the y-cruncher program while the CPU and RAM are utilized in the most intensive way a real-world computation can go. Consult This Post for more details on how y-cruncher should be operated and also for an in-depth explanation of optimizing the configurations of y-cruncher to compute world records as well, and my posts in This Category on examples of how intensive in terms of precision y-cruncher gets. I strongly recommend this program along with Linpack Xtreme as I have used this multiple times to set world records. It is really not easy to get more intensive than this.

Linpack Xtreme for Windows and Linux is another good program that detects corruptions faster than many other programs. As I said in Types of Benchmarks, LINPACK focuses on the purpose of the computer when it was first manufactured; solving complex linear algebra problems. This is a very good stress testing program provided that this is an algorithm used in huge supercomputers and it really utilizes both the CPU and RAM to a scale not possible by most viable workloads. This is equally as good as or even better than y-cruncher can also verify precision computing workloads if you run the Benchmark Pi in RAM option and saturate the RAM. On the contrary, Linpack Xtreme is more intensive and also serves a purpose as a standardized benchmarking program so it is up to your choice on what you prefer. Why not both?

For GPUs you do not overclock or undervolt one if you are doing HPC accelerations or GPGPU (General Purpose GPU) computations (who would even want to overclock NVIDIA Tesla GPUs that have ECC memory modules to prevent memory errors?…), instead you link multiple GPUs together. Because most computation loads for GPUs mostly place high loads on different components inside the GPU (for example workloads using graphics rendering use different internal modules from numerical workloads such as CUDA/ROCm or OpenCL), there isn’t a viable benchmark that can ensure numerical accuracy. GPU-Burn is a good benchmarking tool because it utilizes almost all of the VRAM capacity with elevated power consumption close to the TDP and GPU utilization is also almost always 100% during the test. It also counts the number of errors in the GPU, so you can know if the GPU is stable. Make sure you run it for at least an hour. The GPU version of LINPACK can also be an intensive testing program for corruptions but this is not guaranteed to test all parts 100% because you have to compile and tune the program’s parameters so that you are sure the GPU VRAM has been allocated completely. GPU-Burn is a more ensured no setup stress testing tool. If you want to increase your framerate by overclocking gaming GPUs not meant to do numerical computations, OCCT, Unigine, and similar benchmarks will be enough to get a picture on how your overclocked GPU fares in high loads, but you have to consider this does not mean your GPU will not produce any error. Even if an error occurs for GPUs, all you get in games are stutters or missing pixels, so I guess that’s a fair enough deal if one really needs higher framerates although I can also say that stutters caused by errors could impact your gaming experience more.

Comment if you have more suggestions on this post.

Comments