Cores Gone Rogue

Today I read an interesting paper from Google engineers titled “Cores that don’t count”. It’s a paper about CPU cores that would produce incorrect computation results, causing serious problems for software systems. I found the paper quite interesting because of the implications of the discovery. And I’ve been wanting to start posting notes on papers I read, so I figured I’d start with this one.

Overview

Google is running a massive-scale data-analysis pipeline. One day they noticed that it starts giving incorrect results. Investigation result suggests that it has something todo with a recent change that was shipped to a low-level library they’re using (they didn’t mention what library, probably an internal library). While the change itself is correct, it causes servers to make heavier use of certain instructions that were previously not often used (they didn’t mention specifically what’s the instruction). This somehow causes a small subset of the machines to repeatedly produce incorrect results.

Deeper investigation revealed that the real underlying issue is on the hardware, specifically from manufacturing defects in the CPU chips. The defect causes typically one of the cores in the CPU to silently produce error in the form of incorrect computation results. These cores are called “mercurial cores”, and the errors are called “corrupt execution errors” or CEEs. The authors didn’t disclose the exact number of mercurial cores in their fleet due to business reasons, but they mentioned that it’s “a few mercurial cores per several thousand machines” and that the rate of incidence of CEEs is higher what software engineers expected.

The behavior caused by the defect is quite complex and interesting in my view. Some mercurial cores don’t exhibit the issue at all until after a certain amount of time has passed since the chip was installed (can be years). Typically only one core in a chip is affected. And the CEEs themselves happen non-deterministically, repeatedly, intermittently, and gets worse over time. The authors also mentioned that external factors such as temperature, voltage, and frequency can influence the rate of incidence as well; They found that some CEE rates are very sensitive to frequency changes, while others do not. They also found that lower frequencies sometimes increase the failure rate.

Why It Happens

The authors don’t have a definitive answer to why this happens. But they posit that some or all of the following factors probably contributes to the issue:

Steady increase in CPU scale and complexity.
Silicon feature size getting smaller and smaller (leaving less room for error).
Novel techniques such as stacked layers adds more complexity and manufacturing risks.

My Surprise

I have heard that computers in space shuttles and rockets are built with multiple redundancy module to ensure there’s no incorrect computation results. But I thought that’s because in space everything can happen (cosmic rays, etc) and that these systems are specifically designed in a Byzantine fault-tolerant way to handle that. So I never thought that this kind of problem could happen in commercial-grade CPUs.

Though the authors didn’t mention the exact CPU models that are affected, or the manufacturer of the CPUs, or the architecture of the CPUs. The fact that it happens in commercial-grade CPUs is something that I never heard of before and find quite surprising.

All this time I assumed that pretty much all computer systems that aren’t specialized to handle this level of fault-tolerance (which is the majority) operate under the assumption that the possibility of CPUs producing incorrect results are so minuscule that it’s fine to assume that it basically never happens. But this discovery challenges my assumption.

At a glance, the implications of this issue is quite serious. Data can be rendered permanently inaccessible, money can be lost, lives can be lost, and so on. But deeper look into the paper’s references shows that this is quite a common knowledge among the hyperscalers. Facebook has their own paper on a similar issue titled “Silent Data Corruptions at Scale”. So now I’m wondering why this issue (CPUs producing incorrect results) hasn’t caused catastrophic incidence in the industry yet. Though it has caused problems in Google’s system, such as:

Corruption affecting garbage collectors, causing live objects to be collected.
Corruption affecting database indexes, leading to some queries being non-deterministically corrupted.
Mutexes not functioning correctly causing application crashes.
Corruption in kernel states, causing kernel panics and application malfunctions.

But if this is a common thing all this time, why hasn’t it caused more major problems than this? Google’s paper mentioned that this mercurial cores issue is something that we just learned and is probably because CPU manufacturing is getting more and more complex by the day. So I’m not sure what to make of this.

The Proposed Solution

The mitigation strategies involves efforts at both the hardware and software level.

At the software side, the authors suggests:

Applying the End-to-End Argument.
Implementing system support for efficient checkpointing which allows retrying computation in a different core from a checkpoint.
Implementing application-specific detection method to decide whether to commit or retry the computation.
Implementing something like triple modular redundancy.

At the hardware level, the authors suggest:

Implementing design for testing and expose the test features to end users.
Implementing continuous verification, where functional units always check their own results.
Go for conservative design for critical functional units, trading some extra area and power for reliability. The authors give an example of IBM z990 which has duplicated pipelines and custom changes to cache controller to make them more resilient.

There are a lot more details in the paper that I didn’t cover here. I recommend reading the paper if you’re interested in this kind of stuff.

I also find some terms quite hard to understand, especially those related to hardware. So there’s a chance that I misunderstood some parts of the paper. Feel free to correct me if you find any mistakes.

The Three Properties of a Good Data-Intensive System

2024 in Retrospect

Notes on Google's paper on mercurial cores.

Overview

Why It Happens

My Surprise

The Proposed Solution