Ma and Tang propose smart data augmentation to address imbalances in data

Department of Biostatistics Chair Yan Ma is the corresponding author on “Smart data augmentation: One equation is all you need”, recently published in Statistical Analysis and Data Mining: The ASA Data Science Journal. The paper’s co-authors include Lu Tang, vice chair for education in the Department of Biostatistics.

The paper proposes a statistical approach called smart data augmentation (SDA), to address imbalances in data by using methods of classification that can be easily fine-tuned. Using a wide range of datasets, Ma and study co-authors demonstrate that SDA could significantly improve the performance of popular statistical classifiers including random forest, multi-layer perceptron, and histogram-based gradient boosting.

According to Ma, this method is particularly beneficial for enhancing missing data imputation in his project, which is supported by grant R01MD013901 from the National Institutes of Health—National Institute on Minority Health and Health Disparities. A major challenge in missing data imputation is CI, where the number of samples in one or more classes (the majority classes) is much greater than the number in the other classes (the minority classes). For example, CI is found in some variables such as race in the National Inpatient Sample (NIS) database.

In the 2010 NIS, White is the majority racial group, making up 84% of the patient population. In contrast, minority racial groups, including Black, Hispanic, Asian or Pacific Islander, Native American, and other, altogether make up 16% of the patient population.

The CI problem should be carefully addressed, as traditional machine learning algorithms are often biased toward the majority class, and in extreme cases the minority class may be ignored altogether. This would mean that the patients whose race data is missing in the NIS would be more likely to be imputed as White and some minority racial groups (e.g., Asian or Pacific Islander, Native American) might never be selected.

As race is the key variable in any racial disparities research, misclassification of race due to CI may generate misleading results. Empirical results on a wide range of datasets, including the NIS, demonstrate that SDA could significantly improve the performance of the most popular classifiers.