I worked with a dataset from the UCI Machine Learning Laboratory that contained 14 variables and over 30,000 entries. My goal was to build a model that predicts the income of adults based on these variables. After running numerous tests and making predictions, I realized that mitigating bias with such a small dataset would be challenging without creating a lot of synthetic data. When my prediction model worked, it performed well with the underlying distribution of the data. However, one major issue I faced was inconsistency—each time I ran the model, it produced different results. This became clear when I reran the notebook and got different outcomes from the original results. Once I identified fairness issues in the dataset, my first step was to resample the data to include only white and Black individuals, as these were the only groups with more than 2,000 records. This helped balance the dataset somewhat, but I still had to deal with a significant imbalance in income levels: 78% of the entries had incomes below $50,000, while only 22% were above that threshold. As a result, the model could predict all 0's (less than $50,000) and still achieve a baseline accuracy of 78%. Next, I looked into whether the model had issues with overfitting or underfitting and whether certain features had disproportionately large coefficients. To address this, I added dropout layers to the prediction model as a form of regularization. This helped a bit, as the model started predicting more 1's (greater than $50,000). Finally, I analyzed the fairness of the model. This was tough because the model often predicted all 0's, making fairness analysis almost impossible. However, during a few tests where the model performed better, its predicted values closely matched the real data distributions. Based on this, I thought the model was relatively fair, given the dataset’s limitations.
Technologies
Python
pandas
NumPy
scikit-learn
tensorflow
Back