Published on Mon Jul 19 2021

Introducing a Family of Synthetic Datasets for Research on Bias in Machine Learning

William Blanzeisky, Pádraig Cunningham, Kenneth Kennedy

A significant impediment to progress in research on bias in machine learning is the availability of relevant datasets. This situation is unlikely to change much given the sensitivity of such data. For this reason, there is a role for synthetic data in this research. We present one such family of synthetic data sets.

1
0
0
Abstract

A significant impediment to progress in research on bias in machine learning (ML) is the availability of relevant datasets. This situation is unlikely to change much given the sensitivity of such data. For this reason, there is a role for synthetic data in this research. In this short paper, we present one such family of synthetic data sets. We provide an overview of the data, describe how the level of bias can be varied, and present a simple example of an experiment on the data.

Tue Dec 08 2020
Machine Learning
Synthetic Data: Opening the data floodgates to enable faster, more directed development of machine learning methods
Generating synthetic data with privacy guarantees provides one such solution. It allows meaningful research to be carried out "at scale" - by allowing the entirety of the machine learning community to potentially accelerate progress within a field.
2
0
0
Fri Jun 29 2018
Machine Learning
Measuring the quality of Synthetic data for use in competitions
Machine learning has the potential to assist many communities in using large datasets. Much of that potential is not being realized because it would require sharing data in a way that compromises privacy. Several methods have been proposed that generate synthetic data while preserving the privacy of the real data.
0
0
0
Mon Nov 16 2020
Machine Learning
Foundations of Bayesian Learning from Synthetic Data
There is significant growth and interest in the use of synthetic data for machine learning. Despite a large number of methods for synthetic data generation, there are comparatively few results on the statistical properties of models learnt on synthetic data. We use a Bayesian paradigm to characterise the updating of model parameters.
0
0
0
Mon Sep 02 2019
Machine Learning
Understanding Bias in Machine Learning
Bias is known to be an impediment to fair decisions in many domains such as human resources, the public sector, health care etc. In this article, our goal is to explain the issue of bias in machine learning from atechnical perspective.
0
0
0
Wed Apr 28 2021
Machine Learning
Algorithmic Factors Influencing Bias in Machine Learning
0
0
0
Thu Apr 16 2020
Machine Learning
Really Useful Synthetic Data -- A Framework to Evaluate the Quality of Differentially Private Synthetic Data
Data quality can be measured along two dimensions. Quality of synthetic data can be evaluated against training data or against an underlying population. It is clear that accommodating all goals at once is a formidable challenge. We invite the academic community to jointly advance the privacy-quality frontier.
0
0
0