Tuesday, June 23, 2020

Fantastic Bits and Why They Flip

In 2019, Trusted CI examined the causes of random bit flips in scientific computing and common measures used to mitigate the effects of “bit flips.”  (In a separate effort, we will also be issuing a similar report on data confidentiality needs in science, as well.) Its report, “An Examination and Survey of Random Bit Flips and Scientific Computing,” was issued a few days before the winter holidays in December 2019. As news of the report was buried amidst the winter holidays and New Year, we are pleased to highlight the report in a bit more detail now. This post is longer than most of Trusted CI’s blog posts to give you a feel for the report and hopefully entice you to read it.

For those reading this that are not computer scientists, some background: What in the world is “bit,” how can one “flip” and what makes one occur randomly? Binary notation is the base-2 representation of numbers as combinations of the digits 0 and 1, in contrast to the decimal notation most of us are used to in our daily lives that represents digits as combinations of the digits 0 through 9. In binary notation, A “bit” is the atomic element of the representation of a 1 or a 0. Bits --- 0s or 1s --- can be combined together to represent numbers larger than 0 or 1 in the same way that decimal digits can be put together to represent numbers larger than 9.

Binary notation has been in use for many hundreds of years. The manipulation of binary numbers made significant advances in the mid 19th century through the efforts of George Boole, who introduced what was later referred to as Boolean algebra or Boolean logic. This advance in mathematics, combined with electronic advances in switching circuits and logic gates by Claude Shannon (and others) in the 1930s led to binary storage and logic as the basis of computing. As such, binary notation, with numbers represented as bits, are the basis of how most computers have stored and processed information since the inception of electronic computers.

However, while we see the binary digits 0 and 1 as discrete, opposite, and rigid representations, in the same way that North and South represent directions, the components of a computer that underlie these 0 and 1 representations are analog components that reveal that 0 and 1 are in fact closer to shades of grey. In fact, 0 and 1 are typically stored magnetically and transmitted through electrical charges. In reality, both magnetism and electrical charges can either degrade or otherwise be altered through external forces, including cosmic rays or other forms of radiation and magnetism. To a computer, a “bit flip” is the change of the representation of a number from a 0 to a 1 or vice versa. Underlying that “flip” could have been a sudden burst of radiation that suddenly and instantly altered magnetic storage or electrical transmission, or could also have been the slow degradation of the magnetism of a magnetically stored bit from something close to 1, or a “full” magnetic charge, to something less than 0.5, at which point it would be recognized and interpreted as a 0.

The use of error correction in computing and communication was pioneered in the 1940s and 1950s by Richard Hamming, who used some form of redundancy to help to identify and mask the effects of bit flips. Despite the creation of these techniques 70–80 years ago, it is still not the case that error correction is universally used. And, even when it is, there are limits to the amount of errors that can be incurred in a particular blob of data (a number, a file, a database) before the errors can fail to be correctable, or even to be detected at all.

The report that Trusted CI published last year describes the methods for why bit flips occur. These include isolated single errors due to some kind of interference, bursty faults of a number of sequential bits, due to some kind of mechanical failure or electrical interference; or malicious tampering. The document then narrows to focus on isolated errors. Malicious tampering is the focus of future reports, for example, as are data errors or loss due to improper scientific design, mis-calibrated sensors, and outright bugs, including unaccounted-for non-determinism in computational workflows, improper roundoff and truncation errors, hardware failures, and “natural” faults.

The report then describes why single bit faults occur — such as via cosmic rays, ionizing radiation, and corrosion in metal — their potential odds of faults occurring for a variety of different components in computing, and potential mitigation mechanisms. The goal is to help scientists understand the risk that bit faults can either lead to scientific data that is in some way incorrect, due to bit flips, or an inability to reproduce scientific results in the future, which is of course a cornerstone of the scientific process.

As part of the process of documenting mitigation mechanisms, the authors of the report surveyed an array of scientists with scientific computing workflows, as well as operators of data repositories, and computing systems ranging from small clusters to large-scale DOE and NSF high-performance computing systems. The report also discusses the impact of bit flips on science. For example, in some cases, including certain types of metadata, corrupt data might be catastrophic. In other cases, such as images,, or situations where there are already multiple data streams collecting that cross-validate each other, the flip of a single bit or even a small handful of bits is largely or entirely lost in the noise. Finally, the report collects these mechanisms into a set of practices, divided by components involved in scientific computing, that scientists may wish to consider implementing in order to protect their data and computation — for example, using strong hashing before storing or transmitting data, file systems with automated integrity repair built in, disks with redundancy built in, and even leveraging fault tolerant algorithms where possible.

For the time being, this report is intended as a standalone first draft for use by the scientific computing community. Later in 2020, this report will be combined with insights from the Trusted CI “annual challenge” on trustworthy data to more broadly offer guidance on integrity issues beyond bit flips. Finally, in late 2020, we expect to integrate issues pertaining to bit flips into a future version of the Open Science Cyber Risk Profile (OSCRP). The OSCRP is a document that was first created in 2016 to develop a “risk profile” for scientists to help understand risks to their projects via threats posed through scientific computing. While the first version included issues in data integrity, a revised version will include bit flips more directly and in greater detail.

As with many Trusted CI reports, both the bit flip report and the OSCRP are intended to be living reports that will be updated over time to serve community needs. As such, comments, questions, and suggestions about this post, and both documents are always welcome at info@trustedci.org
Going forward the community can expect additional reports from us on the topics mentioned above, as well as a variety of other topics. Please watch this space for future blog posts on these studies.