A COMPARATIVE STUDY ON UNIVARIATE OUTLIER WINSORIZATION METHODS IN DATA SCIENCE CONTEXT
DOI:
https://doi.org/10.26398/IJAS.0036-004Keywords:
Capping; flooring; outlier; quantile-based.Abstract
Handling outliers is an important step in data analysis, and it can be approached through three different ways, namely; accommodation, omission, or winsorization. This article investigates the impact of four winsorization statistics (mean, median, mode, and quantiles) on parameter estimation through an extensive simulation study. Three prob- ability distributions (normal, negative binomial, and exponential) are considered, each with varying degrees of contamination. The simulation results suggest that winsoriza- tion is effective for small contamination levels and large sample sizes. Furthermore, it is recommended to winsorize outliers in symmetric distributions using any of the loca- tion parameters. However, for asymmetric distributions, the median should be employed. To illustrate these findings, a real dataset on internet usage session durations for 4,500 users, comprising over 2 million records, are fitted to the exponential distribution. The identified outliers were winsorized using the aforementioned statistics.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Statistica Applicata - Italian Journal of Applied Statistics
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.