Published on Fri Jan 11 2019

Impact of Data Pruning on Machine Learning Algorithm Performance

Arun Thundyill Saseendran, Lovish Setia, Viren Chhabria, Debrup Chakraborty, Aneek Barman Roy

Dataset pruning is the process of removing sub-optimal tuples from a dataset to improve the learning of a machine learning model. In this paper, we compared the performance of different algorithms, first on an unpruned dataset and then on an iteratively pruned dataset.

0
0
0
Abstract

Dataset pruning is the process of removing sub-optimal tuples from a dataset to improve the learning of a machine learning model. In this paper, we compared the performance of different algorithms, first on an unpruned dataset and then on an iteratively pruned dataset. The goal was to understand whether an algorithm (say A) on an unpruned dataset performs better than another algorithm (say B), will algorithm B perform better on the pruned data or vice-versa. The dataset chosen for our analysis is a subset of the largest movie ratings database publicly available on the internet, IMDb [1]. The learning objective of the model was to predict the categorical rating of a movie among 5 bins: poor, average, good, very good, excellent. The results indicated that an algorithm that performed better on an unpruned dataset also performed better on a pruned dataset.

Fri Jan 15 2016
Machine Learning
ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models
Data cleaning is often an important step to ensure that predictive models, such as regression and classification, are not affected by systematic errors. Identifying dirty data is often a manual and iterative process, and can be challenging on large datasets. Many data cleaning workflows can introduce subtle biases into the
0
0
0
Wed Dec 12 2012
Artificial Intelligence
Real-valued All-Dimensions search: Low-overhead rapid searching over subsets of attributes
This paper investigates a new, efficient approach to this class of problems. We compare RADSEARCH with other recent successful search algorithms such as PRIM, APriori, OPUS and DenseMiner. We introduce RADREG, a newression algorithm for learning real-valued outputs based
0
0
0
Tue Oct 20 2020
Artificial Intelligence
Model-specific Data Subsampling with Influence Functions
The process of model selection is time-consuming and computationally inefficient. We develop a model-specific data subsampling strategy that improves over random sampling whenever training points have varying influence.
0
0
0
Mon Apr 01 2019
Machine Learning
Adaptive Bayesian Linear Regression for Automated Machine Learning
The goal of automated machine learning (AutoML) is to design methods that can automatically perform model selection and hyperparameter optimization without human interventions for a given dataset. The method combines an adaptive Bayesian regression model with a neural network function and the acquisition function from Bayesian optimization.
0
0
0
Sat Sep 27 2014
Machine Learning
Large-scale Online Feature Selection for Ultra-high Dimensional Sparse Data
Feature selection with large-scale high-dimensional data is important yet challenging in machine learning and data mining. We devise a novel smart algorithm for second-order online feature selection using a MaxHeap-based approach, which is significantly more efficient and scalable.
0
0
0
Fri Feb 09 2018
Artificial Intelligence
Using Discretization for Extending the Set of Predictive Features
This paper provides support for a new idea that discretized features should often be used in addition to existing features. We claim thatDiscretization algorithms should be developed with the explicit purpose of enriching a non-discretized dataset.
0
0
0