ΑΙhub.org
 

Mitigating biases in machine learning


by
17 May 2023



share this:
example of clustering with blue and red circles

Machine Learning (ML) is increasingly being used to simplify and automate a number of important computational tasks in modern society. From the disbursement of bank loans to job application screenings, these computer systems streamline several processes that have a considerable impact on our day to day lives. However, these artificially intelligent systems are most often devised to emulate human decision making — an inherently biased framework. For example, Microsoft’s Tay online chatbot quickly learned to tweet using racial slurs as a result of the biased online input stream (Caton and Haas 2020), and the COMPAS tool often flagged black individuals as more likely to commit a crime (even if two individuals were statistically similar with respect to many other attributes) (Flores, Bechtel and Lowenkamp 2016).

Crucially, these issues are not the product of a malevolent computer programmer instilling radical beliefs, but rather a byproduct of machines learning to optimize for a particular objective, which can inadvertently leverage underlying biases present in the data. These biases are most often the result of two (not mutually exclusive) pervasive issues in data science: bad training data and objectives that are blind to demographic (or comparable sensitive information) considerations. The former is nicely depicted again by the Tay chatbot: training on text from anonymous online forums is unfortunately not a good representation of how the general population talks. Additionally, when models are trained to optimize for an objective that ignores demographic considerations, discriminatory practices can inadvertently be learned, as was the case for the COMPAS algorithm. The algorithm was trained to predict future criminal behavior but inadvertently learned to weigh an individual’s race heavily in its decision-making process.

To provide a clearer understanding of how these “blind” objectives work, we here restrict our attention to the process of clustering data. This involves grouping objects so that those in the same cluster are more like each other than those in other clusters. Data scientists use cluster analysis to garner insight from data by merely observing which cluster a point belongs to and assuming it shares the typical properties of that group. For instance, if we represent a patient’s medical data as a set of relevant numbers such as height, weight, and age, we can visualize the data as points in a three-dimensional space. We can then use standard methods to cluster the data by identifying data points that are close together in this space. To do this, we minimize an “objective function” that measures this closeness. However, this process may inadvertently segregate men from women, as these groups are likely to be similar in terms of height and weight. This is a toy example of the inherent biases that clustering algorithms learn which can have a significant impact on the treatment that a physician provides if not recognized.

While under-representative data has a (somewhat) clean solution in improved collection procedures, mitigating bias as a result of objective minimization is less clear. In an effort to counteract this issue, researchers have recently turned to propose algorithms that ensure some quantified form of “fair representation” at the cost of a degradation in the optimality of a clustering. Consider the following figure that depicts a traditional clustering result versus what one would deem “fair”, where the two colors are representative of some protected attribute such as gender or race. We see that the standard (blind) solution may yield clusters of data points that disproportionately segregate the data based on an implicit attribute, leading to potentially discriminatory or even illegal behavior by the algorithm. A more desirable solution in practice might yield clusters that contain equal representation of the two protected attributes while also mitigating the intra-cluster distances between data points.

example of clustering with blue and red circles

In essence, balancing solutions that minimize our objective with those that yield equal representation is the crux of modern fair clustering work. While in the past few years a considerable number of studies have examined fair flat clustering (partition data into disjoint clusters), the major problem of hierarchical clustering has been largely neglected. Like the flat variant depicted above, hierarchical starts with a collection of data points whose similarity is measured as the distance between points — the closer two points are, the more similar. Now to build a hierarchy, we first let each point be its own cluster and subsequently merge these with neighboring clusters up to a certain size. We can then repeat this process on our enlarged groupings until we are left with one that encompasses all our data. A major advantage here is that we do not need to impose a restriction on the number of clusters on the data, rather we try to group things at a wide variety of granularities. For this reason, hierarchical clustering is common in the division of a geographic region or for phylogenetic trees.

In our recent work (Knittel et al. 2023), we examine the notion of fairness in hierarchical clustering due to its pervasive utility in data science. Specifically, we demonstrate that a relatively intuitive algorithm that strategically folds clusters onto one another according to their protected attribute representations can produce a fair clustering with only a modest decrease in the optimality of the solution (as measured by the objective function). Such results are highly nontrivial as well as vital to demonstrating the utility of fairness constraints.

Although technological innovation is accelerating daily, it is important to be cognizant of, and mitigate, potentially harmful algorithmic practices that can perpetuate biases. We hope that our work serves as further proof of concept that incorporating fairness constraints or demographic information into the optimization process can reduce biases in ML models without significantly sacrificing performance – a result that is critical where automated decision-making can have a harmful impact.

References




Max Springer is a PhD candidate in Applied Mathematics at the University of Maryland.
Max Springer is a PhD candidate in Applied Mathematics at the University of Maryland.




            AIhub is supported by:


Related posts :



Forthcoming machine learning and AI seminars: June 2025 edition

  02 Jun 2025
A list of free-to-attend AI-related seminars that are scheduled to take place between 2 June and 31 July 2025.
monthly digest

AIhub monthly digest: May 2025 – materials design, object state classification, and real-time monitoring for healthcare data

  30 May 2025
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

Congratulations to the #AAMAS2025 best paper, best demo, and distinguished dissertation award winners

  29 May 2025
Find out who won the awards presented at the International Conference on Autonomous Agents and Multiagent Systems last week.

The Good Robot podcast: Transhumanist fantasies with Alexander Thomas

  28 May 2025
In this episode, Eleanor talks to Alexander Thomas, a filmmaker and academic, about the transhumanist narrative.

Congratulations to the #ICRA2025 best paper award winners

  27 May 2025
The winners and finalists in the different categories have been announced.

#ICRA2025 social media round-up

  23 May 2025
Find out what the participants got up to at the International Conference on Robotics & Automation.

Interview with Gillian Hadfield: Normative infrastructure for AI alignment

  22 May 2025
Kumar Kshitij Patel spoke to Gillian Hadfield about her interdisciplinary research, career trajectory, path into AI alignment, law, and general thoughts on AI systems.

PitcherNet helps researchers throw strikes with AI analysis

  21 May 2025
Baltimore Orioles tasks Waterloo Engineering researchers to develop AI tech that can monitor pitchers using low-resolution video captured by smartphones



 

AIhub is supported by:






©2025.05 - Association for the Understanding of Artificial Intelligence


 












©2025.05 - Association for the Understanding of Artificial Intelligence