One day GPT-2, an earlier publicly obtainable model of the automated language era mannequin developed by the analysis group OpenAI, began speaking to me overtly about “white rights.” Given easy prompts like “a white man is” or “a Black woman is,” the textual content the mannequin generated would launch into discussions of “white Aryan nations” and “foreign and non-white invaders.” 

Not solely did these diatribes embrace horrific slurs like “bitch,” “slut,” “nigger,” “chink,” and “slanteye,” but the generated text embodied a specific American white nationalist rhetoric, describing “demographic threats” and veering into anti-Semitic asides against “Jews” and “Communists.” 

GPT-2 doesn’t think for itself—it generates responses by replicating language patterns observed in the data used to develop the model. This data set, named WebText, contains “over 8 million documents for a total of 40 GB of text” sourced from hyperlinks. These links were themselves selected from posts most upvoted on the social media website Reddit, as “a heuristic indicator for whether other users found the link interesting, educational, or just funny.” 

However, Reddit users—including those uploading and upvoting—are known to include white supremacists. For years, the platform was rife with racist language and permitted links to content expressing racist ideology. And although there are practical options available to curb this behavior on the platform, the first serious attempts to take action, by then-CEO Ellen Pao in 2015, were poorly received by the community and led to intense harassment and backlash. 

Whether dealing with wayward cops or wayward users, technologists choose to allow this particular oppressive worldview to solidify in data sets and define the nature of models that we develop. OpenAI itself acknowledged the limitations of sourcing data from Reddit, noting that “many malicious groups use those discussion forums to organize.” Yet the organization also continues to make use of the Reddit-derived data set, even in subsequent versions of its language model. The dangerously flawed nature of data sources is effectively dismissed for the sake of convenience, despite the consequences. Malicious intent isn’t necessary for this to happen, though a certain unthinking passivity and neglect is. 

Little white lies

White supremacy is the false belief that white individuals are superior to those of other races. It is not a simple misconception but an ideology rooted in deception. Race is the first myth, superiority the next. Proponents of this ideology stubbornly cling to an invention that privileges them. 

I hear how this lie softens language from a “war on drugs” to an “opioid epidemic,” and blames “mental health” or “video games” for the actions of white assailants even as it attributes “laziness” and “criminality” to non-white victims. I notice how it erases those who look like me, and I watch it play out in an endless parade of pale faces that I can’t seem to escape—in film, on magazine covers, and at awards shows.

Data units so particularly in-built and for white areas characterize the constructed actuality, not the pure one.

This shadow follows my each transfer, an uncomfortable chill on the nape of my neck. When I hear “murder,” I don’t simply see the police officer together with his knee on a throat or the misguided vigilante with a gun by his facet—it’s the financial system that strangles us, the illness that weakens us, and the federal government that silences us.

Tell me—what’s the distinction between overpolicing in minority neighborhoods and the bias of the algorithm that despatched officers there? What is the distinction between a segregated faculty system and a discriminatory grading algorithm? Between a physician who doesn’t pay attention and an algorithm that denies you a hospital mattress? There is not any systematic racism separate from our algorithmic contributions, from the hidden community of algorithmic deployments that commonly collapse on those that are already most susceptible.

Resisting technological determinism 

Technology just isn’t unbiased of us; it’s created by us, and we’ve got full management over it. Data isn’t just arbitrarily “political”—there are particular poisonous and misinformed politics that information scientists carelessly enable to infiltrate our information units. White supremacy is one among them. 

We’ve already inserted ourselves and our selections into the result—there isn’t a impartial method. There is not any future model of information that’s magically unbiased. Data will at all times be a subjective interpretation of somebody’s actuality, a selected presentation of the objectives and views we select to prioritize on this second. That’s an influence held by these of us chargeable for sourcing, deciding on, and designing this information and growing the fashions that interpret the knowledge. Essentially, there isn’t a trade of “fairness” for “accuracy”—that’s a legendary sacrifice, an excuse to not come clean with our position in defining efficiency on the exclusion of others within the first place.