The topic of fairness in NLP is exceptionally broad; in this post, we hope to distill some of the key points from academic literature for an audience of technical practitioners. In section 1, we outline a general framework for thinking about fairness; in section 2, we survey some notable academic work; and in section 3, we outline a set of questions that may be useful when considering specific applications.
1. What does fairness in NLP even mean?
The core idea we hope to illustrate in this post is that there is no panacea for magically achieving "fair NLP" — though many of the core problems are intuitive, the complexity of human language means that measuring, much less mitigating, "unfairness" is a difficult task. "Bias" and "fairness" are exceptionally broad terms that span wide range of possible behaviors. There is no single definition of desirable "fair" behavior; to the extent that NLP systems model the world (or a particular worldview), there is no single perfectly neutral, unbiased model. In other words, any NLP system involves some proposition about both what the world does look like and what the wold should look like; for practitioners, it's critical to think deeply and with precision about what exactly desired behavior looks like, and why.
That being said:
The first component of any approach to fairness is defining who and what exactly we want to be fair with respect to. While social groups in the real world are fluid, in ML we typically define discrete groups along the axes of gender, race/ethnicity, and religion. (Fair ml, especially in language, unfortunately tends to treat gender as binary — both because it is mathematically convenient and because the data on which language models are trained often reflect a binary.)
For fairness in a typical tabular data setting, we generally assume that each data point reflects information about a single person, and that the values of these demographic attributes are generally known or accessible for each datapoint. For fairness in NLP, however, there isn't always a clear mapping between text and demographic information. More specifically, social groups might be inferred, but demographic information may be labelled based on author demographic — that is, text generated by particular groups, which covers things like dialect, accent, or writing style — or subject demographic — that is, text about particular groups. Crucially, author and subject demographic are distinct approaches to defining fairness. Which method to use for demographic labelling is context-dependent and varies based on the task at hand.
Similarly, there are many dimensions in NLP settings across which fairness can be measured. When the end-goal of the NLP model is something like classification or regression, we might be able to apply existing metrics for fairness in these applications by measuring the group-conditional performance (e.g. positivity rate, TPR, FPR, etc.). In language, particularly text generation, additional harms arise — most prominently, language models which propagate harmful societal stereotypes. Measuring stereotypes is a murkier task: existing (academic) approaches have focused on either investigating the trained model artifact itself (i.e. the word embeddings), or evaluating the model outputs on some specially-curated datasets. However, both of these approaches have known issues, and should not be considered to be any conclusive or concrete standards.
2. A (non-exhaustive) survey of relevant work
Ordered by year. Starred entries are worth reading in full!
One of the first works on "bias" in language models. Measures/illustrates bias by using the word embeddings to generate analogies via vector addition/substraction, showing that embeddings confirm stereotypes; demographic groups are therefore determined with respect to text content. Debiasing approach involves identifying the "gender subspace" and rotating the word embeddings such that they are orthogonal to the subspace.
This is an illustrative example of the many moving parts in what is casually referred to as "fair NLP". This is a speech-to-text task: i.e., one where the output itself is text. However, there is some notion of performance that summarizes the goodness of the text output in a single number. Demographic groups are determined with respect to speaker (author), not content. Bias is observed here because of differential model performance across groups.
This work analyzes a dataset that was popular at the time (the Stanford Natural Language Inference corpus; with the rise of larger language models trained on the web it's unclear the extent to which this is still used). The approach is similar to "Man is to Computer Programmer" — they use a mathematical measurement of similarity between words (in this case mutual information) to find (gendered) associations across words, with demographic groups determined by text content. In my opinion, this does exhibit some of the pitfalls outlined in "Language (Technology) is Power" — there is no explicit discussion of what comprises a harmful association.
Here, demographic groups are determined via authorship rather than text content — this work explores gender, age, country, and region; finds the existence of performance disparities across groups; and introduces a novel approach to learn text classifiers which reduce those performance disparities.
An extension of the 2016 "Man is to Computer Programmer" paper to the multiclass setting; the original work made use of "binary" gender in calculating a "gender direction/subspace".
A lit review of approaches to gender bias x NLP (at this point a few years old); mostly useful for a high-level overview of many possible tasks and approaches.
This work shows limitations of the approaches to debiasing word embeddings in "Man is to Computer Programmer" and "Black is to Criminal". In short, while the approaches enumerated in those papers do successfully debias with respect to their original definitions of "bias," they ultimately preserve most relationships between words in the corpus: "gendered" words still cluster together. As a result, it is possible to recover the original "biased" relationships, and they may persist in downstream applications of those embeddings even if not detected according to the original metric. This work is a clear example of why "debiasing" in general but especially in language must be evaluated with a critical/skeptical eye, and why specifications of desired "unbiased" behavior must be careful and precise.
This survey paper is worth reading (or at least skimming) in full. This paper is motivated by the idea that there is no single definition of "desirable behavior," and no such thing as a "completely unbiased" model or dataset; instead, any specification of desired behavior is inherently value laden. The survey conducted of work on "fairness/bias in NLP" finds that most such work does not state clearly what comprises "bias" and how to conceptualize algorithm behavior with respect to broader societal power structures — to whom the harm is done and how those groups are defined; whether the harm is primarily representational or allocational; what behavior is deemed harmful and what is not, and why — and more.
Introduces a dataset for evaluating the performance of masked language models (models trained on data like This is a masked sentence; the [MASK] is to determine what word is behind the masked token. — the paper reports results on BERT and BERT+ models). Several axes of discrimination/bias are included — gender, race, sexual orientation, nationality, religion, age, dis/ability, appearance, and socioeconomic status.
This paper focuses on toxicity, specifically the generation of toxic (racist, sexist, etc) text by pre-trained language models. This is a slightly different paradigm than the typical "bias" approach — rather than considering harms against specific groups, this work groups all harmful/derogatory generated text as "toxic." The authors find that even surface-level innocuous prompts can trigger highly problematic output, and that existing methods are insufficient to prevent this; upon inspection of training corpora, they find high volumes of toxic content in the training data.
This is the infamous Stochastic Parrots paper that ultimately led to the ousting of Drs. Timnit Gebru & Margaret Mitchell from Google Research. This is a broader survey paper about the harms of large language models, including the centralization of power, cultural homongenization/flattening, environmental harms, among others. Worth a read for broad context in responsible NLP.
Introduces a dataset for evaluating the bias of conversational language tasks based on Reddit data, as well as evaluation frameworks for conversational model performance after debiasing — includes four axes (gender, race, religion, queerness). They benchmark DialoGPT and several debiasing approaches on this dataset and find evidence of bias with respect to religion that can be mitigated with some methods (though not all).
This is a blog post (with a link to the full paper) summarizing some explorations of anti-Muslim bias in GPT-3's text generation. In short, GPT-3 exhibits substantial anti-Muslim bias in its generated text, which is only slightly reduced in existing mitigation methods.
This paper surveys existing benchmark datasets for evaluating "fairness" in NLP tasks. The authors — who also wrote the 2020 "Language (Technology) is Power" paper — apply a social science approach ("measurement modeling"), and find that the benchmark datasets themselves have unclear definitions and specifications behind both what constitutes "biased" or "stereotyping" behavior and what constitutes desirable model behavior. If evaluating models on the datasets covered in this paper, the results should be taken with a grain of salt. Worth a skim, especially the illustrative example on the first page, to understand the gist of the criticism.
This paper focuses on text classifiers, specifically toxicity detection. Demographic groups are explored both in terms of text content (swear words, slurs, identity mentions) and text authorship (AAVE dialectical markers). Bias here is defined by the unjustified flagging of toxic text (in conventional classification terms, high false positive rates). The authors find that existing methods are generally unsuccessful in debiasing toxicity detectors, and propose a proof of concept approach which synthetically relabels the training data; this approach (modifying the training data) is more effective than attempting to modify a pretrained model.'
3. A worksheet for practitioners
1. Defining the language task and model setting:
- Assuming the model takes in some amount of text, does it generate a single output (e.g. a probability, a classification, or multiple classifications), or text output?
2. Defining the sensitive attribute:
- Do you care about author demographic or subject demographic, or both?
- Are you able to come up with or access sensitive feature values for each data point? For example, can you come up with a vector that looks like [ <string input>, <demographic info> ]?
3. Defining and measuring the harm:
- If the model generates a single output, we can check typical measures of fairness (disparate accuracies or TPRs or FPRs or positivity rates etc).
- If the model generates text output:
- — Is there any notion of performance that is used to measure the "goodness" of the text output? You may be able to measure the performance of the generated text and determine whether there are group-wise performance disparities if you already have a means for evaluating generated text.
- — What sorts of representational or stereotyping harms do you anticipate? In other words, what is the best-case expected output, and what does a "bad" output look like?
4. Mitigating the harm:
- If the model generates a single output: existing classification/regression postprocessing approaches to fairness may be worth attempting, though they will be limited in that they cannot make use of the the text input. See annotated bibliography for some examples of bias work in text classification.
- If the model generates text output:
- — For concerns around representational harms (such as stereotyping), do you have a sense of what prompts might trigger "bad" output?
- — Most mitigation techniques for language models rely on adjusting model internals, and even then, have varied degrees of success (see annotated bibliography).