New Label Noise Injection Methods for the Evaluation of Noise Filters

Authors: Luís Garcia, Jens Lehmann, André de Carvalho and Ana Lorena

Abstract: Noise is often present in real datasets used for training Machine Learning classifiers. Their disruptive effects in the learning process may include: increasing the complexity of the induced models, a higher processing time and a reduced predictive power in the classification of new examples. Therefore, treating noisy data in a preprocessing step is crucial for improving data quality and to reduce their harmful effects in the learning process. There are various filters using different concepts for identifying noisy examples in a dataset. Their ability in noise preprocessing is usually assessed in the identification of artificial noise injected into one or more datasets. This is performed to overcome the limitation that only a domain expert can guarantee whether a real example is indeed noisy. The most frequently used label noise injection method is the noise at random method, in which a percentage of the training examples have their labels randomly exchanged. This is carried out regardless of the characteristics and example space positions of the selected examples. This paper proposes two novel methods to inject label noise in classification datasets. These methods, based on complexity measures, can produce more challenging and realistic noisy datasets by the disturbance of the labels of critical examples situated close to the decision borders and can improve the noise filtering evaluation. An extensive experimental evaluation of different noise filters is performed using public datasets with imputed label noise and the influence of the noise injection methods are compared in both data preprocessing and classification steps.

R Package

All noise injection methods detailed in this paper were assembled into an R package named born (Borderline Noise) and is publicly available at the GitHub repository.

Additional Results

The datasets and results can be found here:

Dataset files

RData files

Result files

Acknowledgements

The authors would like to thank CNPq (processes 152098/2016-0), FAPESP (processes 2013/07375-0 and 2012/22608-8) and CAPES for their financial support.

References

[1] M. Lichman (2013). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml

[2] J. Vanschoren, J. Rijn, B. Bischl and L. Torgo (2013). OpenML: Networked science in machine learning. ACM SIGKDD Explorations 15(2):49-60

[3] R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

[4] P. Morales, J. Luengo, L. Garcia, A. Lorena, A. de Carvalho and F. Herrera (2016). NoiseFiltersR: Label noise filters for data preprocessing in classification. https://CRAN.R-project.org/package=NoiseFiltersR