Trusted CI Blog: data assurance

Showing posts with label data assurance. Show all posts

Monday, March 4, 2024

Trusted CI Webinar: Lessons from the ACCORD project, March 18th @11am Eastern

Ron Hutchins and Tho Nguyen are presenting the talk, Lesson from the ACCORD Project, on March 18th at 11am Eastern time.

Please register here.

The ACCORD cyberinfrastructure project at the University of Virginia (UVA) successfully developed and deployed a community infrastructure providing access to secure research computing resources for users at underserved, minority-serving, and non-PhD-granting institutions. ACCORD's operational model is built around balancing data protection with accessibility. In addition to providing secure research computing resources and services, key outcomes of ACCORD include creation of a set of policies that enable researchers external to UVA to access and use ACCORD. While the ACCORD expedition achieved its technical and operational goals, its broader mission of broadening access to underserved users had limited success. Toward gaining a better understanding of the barriers to researchers accessing ACCORD, our team carried out two community outreach efforts to engage with researchers and computing service leaders to hear their pain points as well as solicit their input for an accessible community infrastructure.

In this talk, we will describe the ACCORD infrastructure and its operational model. We will also discuss insights from our effort to develop policies to balance accessibility with security. And finally, we wil share lessons learned from community outreach efforts to understand institutional and social barriers to access.

Speaker Bios:

Ron Hutchins: In the early 1980’s, Ron worked at the Georgia Institute of Technology to create a networking laboratory in the College of Computing teaching data communications courses there. After moving to the role of Director of Campus Networks in 1991, Ron founded and led the Southern Crossroads network aggregation (SoX) across the Southeast. In 2001 after receiving his PhD in computer networks, he took on the role of Chief Technology Officer for the campus. In August of 2015, Ron moved into the role of Vice President of Information Technology for the University of Virginia, working to build partnerships across the campus. Recently, Ron has moved from VP to research faculty in the Computer Science department at UVA and is participating broadly across networking and research computing in general including work with the State of California building out the broadband fiber network backbone across the state.

Tho Nguyen is a computer science and policy expert. He served as project manager for the ACCORD effort from 2019-2021, and continues to support the project implementation and growth. Nguyen is currently a Senior Program Officer at the National Academies of Sciences, Engineering, and Medicine. From 2015-2021 Nguyen was on the research staff in the Department of Computer Science at the University of Virginia where he worked on compute-in-memory and developing HPCs for research. Prior to UVA, he was a AAAS Science and Technology Policy Fellow at the National Science Foundation where he worked primarily on the Cyber Physical Systems program. Nguyen holds a PhD in Systems & Controls (Electrical Engineering) from the University of Washington.

---

Join Trusted CI's announcements mailing list for information about upcoming events. To submit topics or requests to present, see our call for presentations. Archived presentations are available on our site under "Past Events."

Monday, November 20, 2023

Trusted CI Webinar: Open Science Chain, Dec. 4th @11am Eastern

San Diego Supercomputer Center's Subhashini Sivagnanam is presenting the talk, Open Science Chain - Enabling Integrity and Metadata Provenance for Research Artifacts Using Open Science Chain, on December 4th at 11am Eastern time.

Please register here.

The envisioned advantage of sharing research data lies in its potential for reuse. Although many scientific disciplines are embracing data sharing, some face constraints on the data they can share and with whom. It becomes crucial to establish a secure method that efficiently facilitates sharing and verification of data and metadata while upholding privacy restrictions to enable the reuse of scientific data. This presentation highlights our NSF-funded Open Science Chain (OSC) project, accessible at https://www.opensciencechain.org. Developed using blockchain technologies, the OSC project aims to address challenges related to the integrity and provenance of research artifacts. The project establishes an API-based data integrity verification management service for data-driven research platforms and hubs, aiming to minimize data information loss and provide support for managing diverse metadata standards and access controls.

Speaker Bio:

Subhashini Sivagnanam is the manager of the Cyberinfrastructure Services and Solutions (CISS) group at the San Diego Supercomputer Center/ UCSD. Her research interests predominantly lie in distributed computing, cyberinfrastructure development, scientific data management, and reproducible science. She serves as the PI/Co-PI on various NSF/NIH projects related to scientific data integrity and developing cyberinfrastructure software. Furthermore, she oversees the management of UC San Diego’s campus research cluster known as the Triton Shared Computing Cluster.

---

Friday, November 20, 2020

Open Science Cyber Risk Profile (OSCRP), and Data Confidentiality and Data Integrity Reports Updated

In April 2017, Trusted CI released the Open Science Cyber Risk Profile (OSCRP), a document designed to help principal investigators and their supporting information technology professionals assess cybersecurity risks related to open science projects. The OSCRP was the culmination of extensive discussions with research and education community leaders, and has since become a widely-used resource, including numerous references in recent National Science Foundation (NSF) solicitations.

The OSCRP has always been intended to be a living document. In order to gather material for continued refreshing of ideas, Trusted CI has spent the past couple of years performing in-depth examination of additional topics for inclusion in a revised OSCRP. In 2019, Trusted CI examined the causes of random bit flips in scientific computing and common measures used to mitigate the effects of “bit flips.” Its report, “An Examination and Survey of Random Bit Flips and Scientific Computing,” was issued in December 2019. In order to address the community's need for insights on how to start thinking about computing on sensitive data, in 2020, Trusted CI examined data confidentiality issues and solutions in academic research computing. Its report, “An Examination and Survey of Data Confidentiality Issues and Solutions in Academic Research Computing,” was issued in September 2020.

Both reports have now been updated, with the current versions being made available at the links to the report titles above. In conjunction, the Open Science Cyber Risk Profile (OSCRP) itself has also been refreshed with insights from both data confidentiality and data integrity reports.

All of these documents will continue to be living reports that will be updated over time to serve community needs. Comments, questions, and suggestions about this post, and both documents are always welcome at info@trustedci.org.

Thursday, September 10, 2020

Data Confidentiality Issues and Solutions in Academic Research Computing

Many universities have needs for computing with “sensitive” data, such as data containing protected health information (PHI), personally identifiable information (PII), or proprietary information. Sometimes this data is subject to legal restrictions, such as those imposed by HIPAA, CUI, FISMA, DFARS, GDPR, or the CCPA, and at other times, data may simply not be sharable per a data use agreement. It may be tempting to think that such data is typically only in the domain of DOD and NIH funded research, but it turns out that this assumption is far from reality. While this issue arises in numerous scientific domains, including ones that people might immediately think of, such as medical research, it also arises in numerous others, including economics, sociology, and other social sciences that might look at financial data, student data or psychological records; chemistry and biology particularly that which relates to genomic analysis and pharmaceuticals, manufacturing, and materials; engineering analyses, such as airflow dynamics; underwater acoustics; and even computer science and data analysis, including advanced AI research, quantum computing, and research involving system and network logs. Such research is funded by an array of sponsors, including the National Science Foundation (NSF) and private foundations.

Few organizations currently have computing resources appropriate for sensitive data. However, many universities have started thinking about how to enable computing of sensitive data, but may not know where to start.

In order to address the community need for insights on how to start thinking about computing on sensitive data, in 2020, Trusted CI examined data confidentiality issues and solutions in academic research computing. Its report, “An Examination and Survey of Data Confidentiality Issues and Solutions in Academic Research Computing,” was issued in September 2020. The report is available at the following URL:

https://escholarship.org/uc/item/7cz7m1ws

The report examined both the varying needs involved in analyzing sensitive data and also a variety of solutions currently in use, ranging from campus and PI-operated clusters to cloud and third-party computing environments to technologies like secure multiparty computation and differential privacy. We also discussed procedural and policy issues involved in campuses handling sensitive data.

Our report was the result of numerous conversations with members of the community. We thank all of them and are pleased to acknowledge those who were willing to be identified here and also in the report:

Thomas Barton, University of Chicago, and Internet2
Sandeep Chandra, Director for the Health Cyberinfrastructure Division and Executive Director for Sherlock Cloud, San Diego Supercomputer Center, University of California, San Diego
Erik Deumens, Director of Research Computing, University of Florida
Robin Donatello, Associate Professor, Department of Mathematics and Statistics, California State University, Chico
Carolyn Ellis, Regulated Research Program Manager, Purdue University
Bennet Fauber, University of Michigan
Forough Ghahramani, Associate Vice President for Research, Innovation, and Sponsored Programs, Edge, Inc.
Ron Hutchins, Vice President for Information Technology, University of Virginia
Valerie Meausoone, Research Data Architect & Consultant, Stanford Research Computing Center
Mayank Varia, Research Associate Professor of Computer Science, Boston University

For the time being, this report is intended as a standalone initial draft for use by the academic computing community. Later in 2020, this report will be accompanied by an appendix with additional technical details on some of the privacy-preserving computing methods currently available.

Finally, in late 2020, we also expect to integrate issues pertaining to data confidentiality into a future version of the Open Science Cyber Risk Profile (OSCRP). The OSCRP is a document that was first created in 2016 to develop a “risk profile” for scientists to help understand risks to their projects via threats posed through scientific computing. While the first version included issues in data confidentiality, a revised version will include some of our additional insights gained in developing this report.

As with many Trusted CI reports, both the data confidentiality report and the OSCRP are intended to be living reports that will be updated over time to serve community needs. It is our hope that this new report helps answer many of the questions that universities are asking, but also that begins conversations in the community and results in questions and feedback that will help us to make improvements to this report over time. Comments, questions, and suggestions about this post, and both documents are always welcome at info@trustedci.org

Going forward, the community can expect additional reports from us on the topics mentioned above, as well as a variety of other topics. Please watch this space for future blog posts on these studies.

Tuesday, June 23, 2020

Fantastic Bits and Why They Flip

In 2019, Trusted CI examined the causes of random bit flips in scientific computing and common measures used to mitigate the effects of “bit flips.” (In a separate effort, we will also be issuing a similar report on data confidentiality needs in science, as well.) Its report, “An Examination and Survey of Random Bit Flips and Scientific Computing,” was issued a few days before the winter holidays in December 2019. As news of the report was buried amidst the winter holidays and New Year, we are pleased to highlight the report in a bit more detail now. This post is longer than most of Trusted CI’s blog posts to give you a feel for the report and hopefully entice you to read it.

For those reading this that are not computer scientists, some background: What in the world is “bit,” how can one “flip” and what makes one occur randomly? Binary notation is the base-2 representation of numbers as combinations of the digits 0 and 1, in contrast to the decimal notation most of us are used to in our daily lives that represents digits as combinations of the digits 0 through 9. In binary notation, A “bit” is the atomic element of the representation of a 1 or a 0. Bits --- 0s or 1s --- can be combined together to represent numbers larger than 0 or 1 in the same way that decimal digits can be put together to represent numbers larger than 9.

Binary notation has been in use for many hundreds of years. The manipulation of binary numbers made significant advances in the mid 19th century through the efforts of George Boole, who introduced what was later referred to as Boolean algebra or Boolean logic. This advance in mathematics, combined with electronic advances in switching circuits and logic gates by Claude Shannon (and others) in the 1930s led to binary storage and logic as the basis of computing. As such, binary notation, with numbers represented as bits, are the basis of how most computers have stored and processed information since the inception of electronic computers.

However, while we see the binary digits 0 and 1 as discrete, opposite, and rigid representations, in the same way that North and South represent directions, the components of a computer that underlie these 0 and 1 representations are analog components that reveal that 0 and 1 are in fact closer to shades of grey. In fact, 0 and 1 are typically stored magnetically and transmitted through electrical charges. In reality, both magnetism and electrical charges can either degrade or otherwise be altered through external forces, including cosmic rays or other forms of radiation and magnetism. To a computer, a “bit flip” is the change of the representation of a number from a 0 to a 1 or vice versa. Underlying that “flip” could have been a sudden burst of radiation that suddenly and instantly altered magnetic storage or electrical transmission, or could also have been the slow degradation of the magnetism of a magnetically stored bit from something close to 1, or a “full” magnetic charge, to something less than 0.5, at which point it would be recognized and interpreted as a 0.

The use of error correction in computing and communication was pioneered in the 1940s and 1950s by Richard Hamming, who used some form of redundancy to help to identify and mask the effects of bit flips. Despite the creation of these techniques 70–80 years ago, it is still not the case that error correction is universally used. And, even when it is, there are limits to the amount of errors that can be incurred in a particular blob of data (a number, a file, a database) before the errors can fail to be correctable, or even to be detected at all.

The report that Trusted CI published last year describes the methods for why bit flips occur. These include isolated single errors due to some kind of interference, bursty faults of a number of sequential bits, due to some kind of mechanical failure or electrical interference; or malicious tampering. The document then narrows to focus on isolated errors. Malicious tampering is the focus of future reports, for example, as are data errors or loss due to improper scientific design, mis-calibrated sensors, and outright bugs, including unaccounted-for non-determinism in computational workflows, improper roundoff and truncation errors, hardware failures, and “natural” faults.

The report then describes why single bit faults occur — such as via cosmic rays, ionizing radiation, and corrosion in metal — their potential odds of faults occurring for a variety of different components in computing, and potential mitigation mechanisms. The goal is to help scientists understand the risk that bit faults can either lead to scientific data that is in some way incorrect, due to bit flips, or an inability to reproduce scientific results in the future, which is of course a cornerstone of the scientific process.

As part of the process of documenting mitigation mechanisms, the authors of the report surveyed an array of scientists with scientific computing workflows, as well as operators of data repositories, and computing systems ranging from small clusters to large-scale DOE and NSF high-performance computing systems. The report also discusses the impact of bit flips on science. For example, in some cases, including certain types of metadata, corrupt data might be catastrophic. In other cases, such as images,, or situations where there are already multiple data streams collecting that cross-validate each other, the flip of a single bit or even a small handful of bits is largely or entirely lost in the noise. Finally, the report collects these mechanisms into a set of practices, divided by components involved in scientific computing, that scientists may wish to consider implementing in order to protect their data and computation — for example, using strong hashing before storing or transmitting data, file systems with automated integrity repair built in, disks with redundancy built in, and even leveraging fault tolerant algorithms where possible.

For the time being, this report is intended as a standalone first draft for use by the scientific computing community. Later in 2020, this report will be combined with insights from the Trusted CI “annual challenge” on trustworthy data to more broadly offer guidance on integrity issues beyond bit flips. Finally, in late 2020, we expect to integrate issues pertaining to bit flips into a future version of the Open Science Cyber Risk Profile (OSCRP). The OSCRP is a document that was first created in 2016 to develop a “risk profile” for scientists to help understand risks to their projects via threats posed through scientific computing. While the first version included issues in data integrity, a revised version will include bit flips more directly and in greater detail.

As with many Trusted CI reports, both the bit flip report and the OSCRP are intended to be living reports that will be updated over time to serve community needs. As such, comments, questions, and suggestions about this post, and both documents are always welcome at info@trustedci.org
Going forward the community can expect additional reports from us on the topics mentioned above, as well as a variety of other topics. Please watch this space for future blog posts on these studies.

Tuesday, October 9, 2018

Trusted CI Webinar October 22nd at 11am ET: Urgent Problems and (Mostly) Open Solutions with Jeff Spies

Jeffrey Spies is presenting the talk "Urgent Problems and (Mostly) Open Solutions" on Monday October 22nd at 11am (Eastern).

Please register here. Be sure to check spam/junk folder for registration confirmation email.

We're at an important stage in the history of science. The internet has dramatically accelerated the pace and scale of communication and collaboration. We have the computational resources to mine and discover complex relationships within massive datasets from diverse sources. This will usher in a new era of knowledge discovery that will undoubtedly lead to life-saving innovation, and access to content is paramount. But how do we balance transparency and privacy or transparency and IP concerns? How do we protect data from being selectively deleted? How do we decide what to make accessible with limited resources? How do we go from accessible to reusable and then to an ecosystem that fosters inclusivity and diversity?

And what if we no longer own the content we'd like to be made accessible? Such is the case with most journal articles. Skewed incentives have developed around centuries-old publishing practices that reward what is publishable rather than what is rigorous, reproducible, replicable, and reusable. In exchange for publications, we assign our copyrights to publishers, who then lease access back to us and our institutions at ever-increasing prices. And now publishers are turning their eyes--and very large profit margins--towards capturing the rest of the research workflow, including data and analytics. In contrast to the societal-level change that could occur if this research content were in an environment that maximized innovation and reuse, this is very dangerous.

This talk will discuss these urgent problems and the psychology that makes fixing them easier said than done as well as propose a practical, incremental approach to solving them via decentralized technologies, policy, and respect for researcher workflow.

Speaker Bio:
Jeffrey Spies is the founder of 221B LLC, a strategic consulting firm combining expertise in research technology, methodology, and workflow to accelerate projects across higher-ed. Previously, he co-founded and served as the CTO of the Center for Open Science, a non-profit formed to maintain his Open Science Framework. Jeff has a Ph.D. in Quantitative Psychology from the University of Virginia.

Presentations are recorded and include time for questions with the audience.

Join Trusted CI's announcements mailing list for information about upcoming events. To submit topics or requests to present, see our call for presentations. Archived presentations are available on our site under "Past Events."