Published on Mon Nov 23 2015

Estimating the number of unseen species: A bird in the hand is worth $\log n $ in the bush

Alon Orlitsky, Ananda Theertha Suresh, Yihong Wu

Estimating the number of unseen species is an important problem in many scientific endeavors. We derive a class of estimators that predict not just for constant , but all the way up to proportional to $n. We also show that this range is the best possible and that the estimators' mean-square error is optimal up to constants for any $t.

0
0
0
Abstract

Estimating the number of unseen species is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher, uses samples to predict the number of hitherto unseen species that would be observed if new samples were collected. Of considerable interest is the largest ratio between the number of new and existing samples for which can be accurately predicted. In seminal works, Good and Toulmin constructed an intriguing estimator that predicts for all , thereby showing that the number of species can be estimated for a population twice as large as that observed. Subsequently Efron and Thisted obtained a modified estimator that empirically predicts even for some , but without provable guarantees. We derive a class of estimators that provably predict not just for constant , but all the way up to proportional to . This shows that the number of species can be estimated for a population times larger than that observed, a factor that grows arbitrarily large as increases. We also show that this range is the best possible and that the estimators' mean-square error is optimal up to constants for any . Our approach yields the first provable guarantee for the Efron-Thisted estimator and, in addition, a variant which achieves stronger theoretical and experimental performance than existing methodologies on a variety of synthetic and real datasets. The estimators we derive are simple linear estimators that are computable in time proportional to . The performance guarantees hold uniformly for all distributions, and apply to all four standard sampling models commonly used across various scientific disciplines: multinomial, Poisson, hypergeometric, and Bernoulli product.

Fri Jun 15 2012
Machine Learning
On the Cover-Hart Inequality: What's a Sample of Size One Worth?
Under a large class of loss functions, the best Alice can do is to halve Bob's risk. In this sense, half the information in an infinite sample is contained in a sample of size one. These results may help explain the small relative differences in applied classification and forecasting problems.
0
0
0
Tue Feb 12 2019
Machine Learning
Maximum Likelihood Estimation for Learning Populations of Parameters
The maximum likelihood estimator (MLE) is both statistically minimax and efficiently computable. For sufficiently large the MLE achieves the information theoretic optimal error bound of , with regards to the earth's distance.
0
0
0
Tue Mar 30 2021
Machine Learning
Trees, Forests, Chickens, and Eggs: When and Why to Prune Trees in a Random Forest
0
0
0
Fri Apr 26 2019
Machine Learning
Sample Amplification: Increasing Dataset Size even when Learning is Impossible
Given data drawn from an unknown distribution, to what extent is it possible to "amplify'' this dataset and output an even larger set of samples? We formalize this question as follows: An amplification procedure is valid if no algorithm can distinguish the set of samples produced by the amplifier from a set of independent draws from $D. In many settings, a valid amplification procedure exists, even when the size of the input dataset, , is significantly less than
0
0
0
Fri May 04 2018
Machine Learning
Estimating Learnability in the Sublinear Data Regime
We show that it is often possible to estimate this "learnability" even when given an amount of data that is too small to reliably learn any accurate model. In contrast to this sublinear sample size, finding an approximation of the best-fit linear function requires on the order of samples.
0
0
0
Fri Aug 14 2020
Machine Learning
A New Perspective on Pool-Based Active Classification and False-Discovery Control
In many scientific settings there is a need for adaptive experimental design. This is to guide the process of identifying regions of the search space that contain as many true positives as possible. Such regions could differ drastically from a predicted set that minimizes 0/1 error.
0
0
0