Professor Yang Receives Grant to Address the Issue of Privacy in Research Data Sharing

Tse-Chuan Yang

Sponsor: The Pennsylvania State University (National Science Foundation-Prime)
Dates: September 1, 2013 – August 31, 2016
Amount: $189,298

TWC SBES: MEDIUM: Utility for Private Data Sharing in Social Science

Sharing research data is an important part of science. It promotes openness and integrity while opening up research opportunities for many scientists. However, many datasets – especially those collected for social science research – contain sensitive information about human subjects and therefore cannot be freely shared as is. One approach to protecting the data is to store it in a secure computing facility and limit access via administrative and legal hurdles (e.g., requiring applications from researchers, non-disclosure agreements, etc.). This approach can incur significant costs (in time and money) to researchers and the computing facility.
An alternative data protection scheme is to simply sanitize (perturb) the data and share this sanitized data with everyone. Privacy can be guaranteed in a rigorous way [53, 92] but a significant problem remains: why should researchers trust this sanitized/perturbed (and often noisy) data [139]?
We will develop the theory to address this issue and then apply it to the concrete realm of social science and spatial demography. We will provide technological solutions for the following concerns:

Will the sanitized data be useful? Using our recently proposed axiomatic utility framework [90, 89] we will design utility measures that are meaningful for social science applications such as Geographically Weighted Regression [25] and design algorithms for sanitizing data that maximize these utility measures.
We will also evaluate results empirically using pre-collected social science datasets.

How can we derive valid statistical inferences? Researchers fear that perturbed data can be biased and misleading thereby leading to invalid conclusions. We will develop theoretical statistical tools and methodologies for proper hypothesis testing to ensure that inferences are based on real statistical trends rather than noise or other artifacts introduced by the sanitization process. These tools may require statistical and programming expertise on the parts of scientists; thus we also consider the following usability issues.

How can researchers re-use their statistical packages? Existing statistical software packages do not account for the distortions added to sanitized data. Naive use of these packages will cause them to underestimate the variability in the sanitized data and this will lead to spurious statistical inferences. However, scientists still want the comfort of using statistical packages they are familiar with and they often lack the programming and statistical expertise needed to modify those software packages for use on sanitized data. Recent developments in multiple imputation [146, 130, 138, 140, 141, 94] suggest the possibility of combining valid statistical analysis with the comfort of familiar statistical routines. The main idea is to generate
multiple different inputs from the sanitized data, run any software package as a black box on each input, and carefully aggregate the results. Existing work does not provide rigorous privacy guarantees. Our research will develop these techniques while provably preserving privacy.