Mitigating bias in AI: a dive into Transformer models for fairness

Bias in AI models, particularly in statistical and machine learning algorithms, often mirrors and magnifies existing prejudices within training datasets. This issue becomes significantly critical in the realm of algorithmic fairness, especially for demographic groups that are underrepresented or misrepresented in biased data sources. But what are the ethical alternatives for using pre-trained large language models?

Counterfactual Data Augmentation (CDA)

To combat biases in training data, Counterfactual Data Augmentation (CDA) offers a statistically driven approach by altering demographic-specific terms within the data. This technique, introducted by Webster et. al. (2021), has demonstrated its efficacy in improving fairness benchmarks for both BERT and ALBERT models. We’ve converted these models to PyTorch and released them on the Hugging Face model hub (bert-cda, albert-cda), promoting the adoption of open-source AI models for fairness.

Additionally, the research explored bias mitigation through the application of dropout regularization at increased rates during the final stages of model training. Though this method showed promise in bias reduction, it’s noteworthy that it doesn’t constitute a statistical modification and is not applied for the majority of pre-training.

Fairness perturbation

In a novel approach, Qian et. al. (2022) trained a BART model to modify textual data across various demographic attributes. It is fine-tuned on a crowdsourced dataset of human-annotated perturbations. This method enables the model to adapt to new sentences:

A more recent line of work by Qian et. al. (2022) takes a similar approach by training a BART model to perturb textual data for a variety of demographic attributes. It is trained on a crowdsourced dataset of perturbations such that the model can learn to generalise to new sentences. In the figure below, we clearly see the benefit of a learned perturber in contrast with a heuristic-based approach like AugLy or CDA. In addition, it is also noteworthy that the perturber improved a grammatical error: the first word of the sentence is now capitalised.

After training the perturber, the authors train a RoBERTa model on a perturbed corpus, which they call FairBERTa. Just like with CDA, FairBERTa shows improvements on fairness benchmarks compared to vanilla pretrained models. These models have been contributed to Hugging Face by the authors (FairBerta, perturber)

[Original] she bent over to kiss her friends cheek before sliding in next to her

[Perturber] He bent over to kiss his friends cheek before sliding in next to her

[AugLy] he bent over to kiss him friends cheek before sliding in next to him

[TextFlint] she bent over to kiss her friends cheek before sliding in next to her

Qian et. al. (2022), Figure 3: Examples perturbed with heuristic approaches (AugLy and TextFlint), or the perturber (changed words highlighted); TextFlint did not perturb any words

Comparing the methods

In the comparison between Counterfactual Data Augmentation (CDA) and fairness perturbation techniques, it’s crucial to understand the fundamental differences between heuristic-driven and ML-driven transformations. CDA operates on a set of fixed rules, leading to consistent and predictable modifications across datasets. On the other hand, fairness perturbation utilizes machine learning to adapt its transformations to the specific nuances of the text, varying the outcome based on the decoding strategy used, such as greedy or beam search. This adaptability allows for more context-aware adjustments, often resulting in more precise augmentation.

However, this adaptability also introduces a significant challenge. Since the perturbation approach relies on a model like BART, which has been trained its respective corpora, it inherits the inherent biases of those datasets. This situation presents a vicious cycle: employing a biased model to generate unbiased outcomes raises the question of how one can ensure that the perturber model itself is free from bias.

Measuring bias & model selection

The innovative approaches of CDA and fairness perturbation mark significant strides towards addressing bias by focusing on data modifications rather than mere model adjustments. This is a departure from earlier debiasing techniques that were critiqued for merely concealing bias. The challenge lies in validating that these new models avoid similar pitfalls. At FairNLP, we’re dedicated to standardizing bias measurement and fairness in language models, facilitating the responsible creation of AI solutions. Keep an eye on our fairscore project for upcoming developments.