Thursday, September 10, 2020

Data Confidentiality Issues and Solutions in Academic Research Computing

Many universities have needs for computing with “sensitive” data, such as data containing protected health information (PHI), personally identifiable information (PII), or proprietary information.  Sometimes this data is subject to legal restrictions, such as those imposed by HIPAA, CUI, FISMA, DFARS, GDPR, or the CCPA, and at other times, data may simply not be sharable per a data use agreement.  It may be tempting to think that such data is typically only in the domain of DOD and NIH funded research, but it turns out that this assumption is far from reality.  While this issue arises in numerous scientific domains, including ones that people might immediately think of, such as medical research, it also arises in numerous others, including economics, sociology, and other social sciences that might look at financial data, student data or psychological records; chemistry and biology particularly that which relates to genomic analysis and pharmaceuticals, manufacturing, and materials; engineering analyses, such as airflow dynamics; underwater acoustics; and even computer science and data analysis, including advanced AI research, quantum computing, and research involving system and network logs.  Such research is funded by an array of sponsors, including the National Science Foundation (NSF) and private foundations.

Few organizations currently have computing resources appropriate for sensitive data.  However, many universities have started thinking about how to enable computing of sensitive data, but may not know where to start.

In order to address the community need for insights on how to start thinking about computing on sensitive data, in 2020, Trusted CI examined data confidentiality issues and solutions in academic research computing.  Its report, “An Examination and Survey of Data Confidentiality Issues and Solutions in Academic Research Computing,” was issued in September 2020.  The report is available at the following URL:

https://escholarship.org/uc/item/7cz7m1ws

The report examined both the varying needs involved in analyzing sensitive data and also a variety of solutions currently in use, ranging from campus and PI-operated clusters to cloud and third-party computing environments to technologies like secure multiparty computation and differential privacy.  We also discussed procedural and policy issues involved in campuses handling sensitive data.

Our report was the result of numerous conversations with members of the community.  We thank all of them and are pleased to acknowledge those who were willing to be identified here and also in the report:

  • Thomas Barton, University of Chicago, and Internet2
  • Sandeep Chandra, Director for the Health Cyberinfrastructure Division and Executive Director for Sherlock Cloud, San Diego Supercomputer Center, University of California, San Diego
  • Erik Deumens, Director of Research Computing, University of Florida
  • Robin Donatello, Associate Professor, Department of Mathematics and Statistics, California State University, Chico
  • Carolyn Ellis, Regulated Research Program Manager, Purdue University
  • Bennet Fauber, University of Michigan
  • Forough Ghahramani, Associate Vice President for Research, Innovation, and Sponsored Programs, Edge, Inc.
  • Ron Hutchins, Vice President for Information Technology, University of Virginia
  • Valerie Meausoone, Research Data Architect & Consultant, Stanford Research Computing Center
  • Mayank Varia, Research Associate Professor of Computer Science, Boston University

For the time being, this report is intended as a standalone initial draft for use by the academic computing community. Later in 2020, this report will be accompanied by an appendix with additional technical details on some of the privacy-preserving computing methods currently available.  

Finally, in late 2020, we also expect to integrate issues pertaining to data confidentiality into a future version of the Open Science Cyber Risk Profile (OSCRP). The OSCRP is a document that was first created in 2016 to develop a “risk profile” for scientists to help understand risks to their projects via threats posed through scientific computing. While the first version included issues in data confidentiality, a revised version will include some of our additional insights gained in developing this report.

As with many Trusted CI reports, both the data confidentiality report and the OSCRP are intended to be living reports that will be updated over time to serve community needs. It is our hope that this new report helps answer many of the questions that universities are asking, but also that begins conversations in the community and results in questions and feedback that will help us to make improvements to this report over time.  Comments, questions, and suggestions about this post, and both documents are always welcome at info@trustedci.org

Going forward, the community can expect additional reports from us on the topics mentioned above, as well as a variety of other topics. Please watch this space for future blog posts on these studies.