Modern data is messy and high-dimensional, and it is often not clear a priori what is the right question to ask. Different types of analysis can lead to disparate levels of bias. The degree of bias also depends on the particulars of the data set.
Modern data is messy and high-dimensional, and it is often not clear a priori
what are the right questions to ask. Instead, the analyst typically needs to
use the data to search for interesting analyses to perform and hypotheses to
test. This is an adaptive process, where the choice of analysis to be performed
next depends on the results of the previous analyses on the same data.
Ultimately, which results are reported can be heavily influenced by the data.
It is widely recognized that this process, even if well-intentioned, can lead
to biases and false discoveries, contributing to the crisis of reproducibility
in science. But while %the adaptive nature of exploration any data-exploration
renders standard statistical theory invalid, experience suggests that different
types of exploratory analysis can lead to disparate levels of bias, and the
degree of bias also depends on the particulars of the data set. In this paper,
we propose a general information usage framework to quantify and provably bound
the bias and other error metrics of an arbitrary exploratory analysis. We prove
that our mutual information based bound is tight in natural settings, and then
use it to give rigorous insights into when commonly used procedures do or do
not lead to substantially biased estimation. Through the lens of information
usage, we analyze the bias of specific exploration procedures such as
filtering, rank selection and clustering. Our general framework also naturally
motivates randomization techniques that provably reduces exploration bias while
preserving the utility of the data analysis. We discuss the connections between
our approach and related ideas from differential privacy and blinded data
analysis, and supplement our results with illustrative simulations.