Rcorr

View my home page
http://www.biocom.icmc.usp.br/~lpfgarcia/

View the Project on GitHub
lpfgarcia/rcorr

Effect of label noise in the complexity of classification problems

Authors: Luís P.F. Garcia, André C.P.L.F. de Carvalho and Ana C. Lorena

Abstract: Noisy data are common in real-world problems and may have several causes, like inaccuracies, distortions or contamination during data collection, storage and/or transmission. The presence of noise in data can affect the complexity of classification problems, making the discrimination of objects from different classes more difficult, and requiring more complex decision boundaries for data separation. In this paper, we investigate how noise affects the complexity of classification problems, by monitoring the sensitivity of several indices of data complexity in the presence of different label noise levels. To characterize the complexity of a classification dataset, we use geometric, statistical and structural measures extracted from data. The experimental results show that some measures are more sensitive than others to the addition of noise in a dataset. These measures can be used in the development of new preprocessing techniques for noise identification and novel label noise tolerant algorithms. We thereby show preliminary results on a new filter for noise identification, which is based on two of the complexity measures which were more sensitive to the presence of label noise.

Additional Results

The results for the artificial datasets can be found here:


Correlation Analysis


Noise Filtering Technique

Download the code

Take a look on the README file to run the code.

Contact

Luís Paulo Faina Garcia - lpgarcia [at] icmc [dot] usp [dot] br
Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos, São Paulo 13560-970, Brazil

References

[1] K. Bache, M. Lichman, UCI machine learning repository, http://archive.ics.uci.edu/ml (2013).

[2] A. Orriols-Puig, N. Maciá, T. K. Ho, Documentation for the data complexity library in C++, Tech. rep., La Salle - Universitat Ramon Llull (2010).

[3] D. R. Amancio, C. H. Comin, D. Casanova, G. Travieso, O. M. Bruno, F. A. Rodrigues, L. da F. Costa, A systematic comparison of supervised classifiers., CoRR abs/1311.0202. doi:10.1371/journal.pone.0094137.

[4] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria (2014). URL http://www.r-project.org/