Published on Mon Nov 16 2015

How much does your data exploration overfit? Controlling bias via information usage

Daniel Russo, James Zou

Modern data is messy and high-dimensional, and it is often not clear a priori what is the right question to ask. Different types of analysis can lead to disparate levels of bias. The degree of bias also depends on the particulars of the data set.

0
0
0
Abstract

Modern data is messy and high-dimensional, and it is often not clear a priori what are the right questions to ask. Instead, the analyst typically needs to use the data to search for interesting analyses to perform and hypotheses to test. This is an adaptive process, where the choice of analysis to be performed next depends on the results of the previous analyses on the same data. Ultimately, which results are reported can be heavily influenced by the data. It is widely recognized that this process, even if well-intentioned, can lead to biases and false discoveries, contributing to the crisis of reproducibility in science. But while %the adaptive nature of exploration any data-exploration renders standard statistical theory invalid, experience suggests that different types of exploratory analysis can lead to disparate levels of bias, and the degree of bias also depends on the particulars of the data set. In this paper, we propose a general information usage framework to quantify and provably bound the bias and other error metrics of an arbitrary exploratory analysis. We prove that our mutual information based bound is tight in natural settings, and then use it to give rigorous insights into when commonly used procedures do or do not lead to substantially biased estimation. Through the lens of information usage, we analyze the bias of specific exploration procedures such as filtering, rank selection and clustering. Our general framework also naturally motivates randomization techniques that provably reduces exploration bias while preserving the utility of the data analysis. We discuss the connections between our approach and related ideas from differential privacy and blinded data analysis, and supplement our results with illustrative simulations.

Sun Jan 27 2013
Machine Learning
Equitability Analysis of the Maximal Information Coefficient, with Comparisons
A measure of dependence is said to be equitable if it gives similar scores to noisy relationships of different types. Equitability is important in data exploration when the goal is to identify a relatively small set of associations rather than finding as many non-zero associations as possible.
0
0
0
Fri Jun 02 2017
Machine Learning
Information, Privacy and Stability in Adaptive Data Analysis
Traditional statistical theory assumes that the analysis to be performed on a given data set is selected independently of the data themselves. This assumption breaks down when data are re-used across analyses and the analysis at a given stage depends on the results of earlier stages.
0
0
0
Mon Aug 26 2013
Machine Learning
The Generalized Mean Information Coefficient
Reshef & Reshef present Generalized Mean Information Coefficient (GMIC) GMIC is a promising new method that mitigates the power issues suffered by MIC, at the possible expense of equitability.
0
0
0
Sat May 09 2015
Machine Learning
Equitability, interval estimation, and statistical power
An equitable statistic is a statistic that assigns similar scores to equally noisy relationships of different types. We define an equitable statistic as one with small interpretable intervals. We then draw on the equivalence of interval estimation and hypothesis testing. We conclude with a demonstration of how our two characterizations of equitability can be used to evaluate the equitability of a statistic in practice.
0
0
0
Mon Nov 10 2014
Machine Learning
Preserving Statistical Validity in Adaptive Data Analysis
Theory of statistical inference assumes a fixed collection of hypotheses to be tested. In practice data is shared and reused with new hypotheses and new analyses being generated. We show that there is a way to estimate an exponential in number of expectations accurately.
2
0
5
Fri Feb 05 2021
Machine Learning
On the Sample Complexity of Causal Discovery and the Value of Domain Expertise
Causal discovery methods seek to identify causal relations between random variables from purely observational data. One of the seminal works in this area is the Inferred Causation algorithm. Practical implementations of this algorithm incorporate statistical tests for conditional independence.
0
0
0