Blog

Content Moderation Workshop

By Grace Alvino on November 2, 2020

Who am I, and How Did I Develop the Workshop?

Hi all! My name is Grace Alvino, and I’m one of the new Praxis fellows: a proud member of the 2020-21 cohort, as well as a fifth-year Ph.D. candidate in the Department of English. Much of the research I’ve done at UVa has dealt with the ways self-identified radicals—the New Left, Riot Grrrls, and white supremacists¹—have used emergent and hybrid forms of media to represent themselves and get their messages across. I was particularly excited, then, when Brandon Walsh–the absolutely wonderful Head of Programs in the Scholars’ Lab–let our cohort know that we would be developing workshops that would help audiences learn about certain Digital Humanities methods and/or technologies. More specifically, I was excited because of the way Brandon told us about the project: by presenting his own workshop on sentiment analysis, which he’d used to teach audiences about one of the ways DH can be used to analyze texts. To try to describe Brandon’s workshop in a few lines is to butcher it, but essentially, he gave our cohort a set of passages from famous works (Kafka’s “Metamorphosis,” Bradbury’s Fahrenheit 451, the Declaration of Independence) and had us decide whether each passage had a “positive/happy” or “negative/sad” valence; in order to make this determination, we were asked to circle each word we saw as having either a “positive/happy” or “negative/sad” connotation, and to assign it a value (+1 for the former, -1 for the latter). Crucially, we weren’t meant to evaluate these words based on the context of the rest of the passage, or even the rest of a given sentence—to read them in their element and then to decide whether they had a positive or negative association—but rather to consider each word individually, and to rate it positive or negative depending on its definition. After we’d finished going through a passage, we were to add up the point values that we’d assigned to each word; that sum would be the total value for the passage, and would decide how “positive/happy” or “negative/sad” it was.

While my partner and I (hi, Crystal!) felt confident in our calculations re: the Declaration of Independence and “Metamorphosis”—the former returned a positive score, and the second a negative one—we were less comfortable with the very high positive score that came out of our reading of the opening paragraph of Fahrenheit 451. If you’re familiar with the novel, you might be able to guess why: it details how a pile of books looks as it burns (a sight that makes the narrator anything but happy), and yet, because the description of the burning books contained so many words that, out of context, would be read as positive (from what I can recall, there was a lot of “blazing,” “shining, “brilliant,” etc.), and because we were asked to evaluate each word on an individual basis, it returned an even higher positive score than the Declaration of Independence. Of course, that was the point: Brandon had chosen the excerpt from Fahrenheit 451 because he knew that, by the metrics he’d given us, it would return a very high positive score, one that we would know was at odds with the actual content of the passage. He was trying to get us to see the dangers of assuming that, in order to perform sentiment analysis on a text, all you had to do was program a computer to identify which words were “positive” and “negative,” “happy” and “sad”—that is, to assign a +1 value to the “positive/happy” words and a -1 value to the “negative/sad” ones—and then the computer would spit out an accurate reading of the tone of the passage. The problem with doing so was that it in no way accounted for context: as we saw in our Fahrenheit 451 analysis, “brilliant” can mean something very different, and consequently have a very different tone, depending on the context in which it is used. It also didn’t account for ambiguity—what about words that could not be so neatly sorted into “happy” and “sad,” “positive” and “negative” categories?—nor for the fact that certain passages could be interpreted multiple ways by different readers, and that each of their interpretations (especially about something as subjective as tone) could be valid.

While the workshop soon moved on to discussing ways that digital sentiment analysis could be modified so as to produce more satisfactory readings of passages, I couldn’t stop thinking about the activity described above, as it seemed to offer a way to illustrate one of the most important concepts I’d come across in my research. My study of white nationalists is concerned with how, in the mid-2010s, they began to move away from isolated pockets of the Internet (think Stormfront), where they were really only talking to each other, and onto mainstream social media platforms like Facebook, Instagram, and Twitter. Unlike their former online hangouts, these are spaces that are not primarily occupied by other white supremacists–and, most crucially, where open hate speech is ostensibly banned. The answer to why white supremacists wanted to be on these more mainstream sites is obvious: they hoped to use them to spread and thus to normalize their message, and to attract new recruits. The question of how they managed not to get banned from these sites, however, is more complicated. As these platforms, again, supposedly banned open hate speech, white supremacists had to figure out how to make posts that were racist, but racist in a way that, were the post to be reported, it wouldn’t be so obviously hateful as to trigger an immediate ban. They couldn’t just use slurs, or write white supremacist slogans—those would, presumably, get them kicked off of these sites—and so they had to develop a subtler way to get their messages across: one that would give them plausible deniability were their posts to be reported, and that would make moderators think twice about whether giving them the axe was justified.

They did so by writing in a style, that I like to call “viral irony”: a mode of discourse that was developed on boards like 4chan (in which many white supremacists participated, or were radicalized) and SomethingAwful. Viral irony is intensely self-referential, relying on years and years of often-obscure Internet in-jokes and memes; it’s also deliberately ambiguous, mixing hyperbole and callous shock humor with what seems like earnestness so as to always leave the reader of unsure whether a poster truly believes in what they’re saying. In short, it’s almost impossible for someone who hasn’t been steeped in online irony for a very long time to read a sentence written in this mode and have a firm grasp on what it means, let alone whether it’s serious or not. That, in turn, is why it’s the perfect mode for white supremacists looking to post undisturbed on mainstream platforms: because if they write using viral irony, and one of their posts is reported, a moderator is going to have almost no idea what to do with it. Is it racist? Well, a moderator might think, there seem to be some stereotypes being drawn on, but it’s also written in such a jokey way that the author could just be trying to laugh at, and thus deflate, these stereotypes. Is it sincere? Maybe, a moderator could tell themselves, but then again, its tone is so strange and almost humorous that it’s tempting to think it’s not. Indeed, because of all this confusion, moderators were far less likely to determine that a post should be deleted—especially when compared to a post that used slurs or other white supremacist language outright.

So, what does any of this have to do with Brandon’s sentiment analysis workshop? Well (and thank you for reading this far—I know it’s a lot!), the reason I was so interested in his workshop is that it gave me a framework for how to show an audience why white supremacists’ use of online irony made content moderation so difficult. Just as Brandon had asked us to assign “positive” or “negative” values to words so as to illustrate why evaluating certain terms out of context was almost impossible—and that not only context, but also ambiguity, made a huge difference in how certain words and phrases should be read—I could ask a workshop audience to evaluate passages in order to show them why it was so difficult for major platforms’ moderators could make decisions about posts white supremacists had made using viral irony. Like the words Brandon asked us to score as “happy” or “sad,” the language white supremacists used on platforms like Facebook, Instagram, and Twitter could not be analyzed only in terms of what it literally said; to do so would be to miss its meaning entirely. Only by considering the context in which it was written (the years and years of accrued memes and references), its potential ambiguity (the way it intentionally skirted the line between serious and joking), and its intent (to mainstream white supremacist talking points and attract new followers while not reading like outright hate speech so that it could stay online) could a moderator understand the real meaning of a given post. To put it simply, I would be asking my workshop participants to do textual analysis so as to illustrate the problems of online content moderation.

As far as what texts I would use, I of course already knew I’d be asking my audience to consider posts white supremacists had written using viral irony (though, because I didn’t want to (re)traumatize my audience, and especially my audience members of color, more than was absolutely necessary, I would be picking the least horrific examples). In order to introduce them to textual analysis and the problems it could pose, I decided to have them analyze reviews from Rotten Tomatoes. Let me explain: as someone who spends more of her time watching movies than I should, I’ve often clicked on a Rotten Tomatoes review that was labeled as a whole tomato (read: a positive review) only to find it read like a pan, or opened up a review that had the splattered tomato marker (a negative review) and been surprised to find that it actually seemed mildly positive. After an extraordinarily helpful conversation with Brandon, I figured out how I could use Rotten Tomatoes reviews to show not only the problems with having a human moderator try to analyze a post or passage, but also with programming a computer to do it (as he had in his own workshop), and thus would also be able to show why having computers moderate content brings its own set of issues. Without further ado, then—here’s my script for how I would run my workshop on content moderation:

Grace’s Workshop Plan

I would begin by asking my audience how they would do online content moderation. Presumably, some of them would say that they would hire real people to read and then make decisions about which posts could stay up and which should be deleted; others would likely tell me that they would write a computer program that would do the same thing. I would tell my audience that we’re going to start with the latter idea: that is, we’re going to try to think through what would happen if you wrote a program to moderate content.

I would then ask how they would try to write that program—that is, what they think that program should do? Perhaps you could tell the machine that certain words have bad connotations—words like slurs—and thus that it should delete the posts of anyone who uses those words.
We would then try an experiment to see what it would look like if we actually wrote a program that did what I’ve just described above. I would show my audience a list of certain words that I’ve decided are “good,” and certain words that are “bad,” and then distribute several movie reviews from Rotten Tomatoes. I would tell the audience to look through the list of words I’d put up, and say that they would probably agree that those words seem like pretty clear value judgements: the words in the “good” category are ones like “excellent,” “success,” “wins,” etc., and the ones in the “bad” category include “terrible,” “flop,” “trainwreck,” “cheesy,” etc. I would then put them in groups of two or three, and ask them to go through the reviews they’d been given and make two tallies: one for the “good” words, and one for the “bad” words. Every time they see a word that’s “bad,” they should put a tally in the “bad” column; every time they see a word that’s “good,” they should put a tally in the “good” column. After you’re finished going through each review, whichever column has more tallies determines whether it should be flagged as a positive or negative review.

Once they’ve done that, I would ask them to come together, regroup, and talk about what they found. I would ask them how the exercise went, and, more specifically, if they thought the system worked—that is, if they thought it could accurately determine what was a “good” and “bad” review. If not, why not—did they find some problems in looking for the “good” and “bad” words? If so, what were they?

Obviously, the audience would have found some problems—namely, that this sort of system has no way of accounting for the context in which a word is used, and the multiple meanings that it might have depending on that context. That is, a review might say something like, “Unlike the director’s last movie, this film isn’t a “trainwreck,” or “This horror movie is delightfully campy and cheesy,” but, since the audience is only being asked to consider the words themselves, not the context in which they’re being used, “trainwreck” and “cheesy” mean they’ll have to put two tallies in the “bad”/“negative review” column. Audience members will also likely have no idea with what to do with words that they feel might have a “bad” or “good” association, but that aren’t on the list, or words that don’t fit neatly into those two categories.
I would then ask what would happen if a platform like Twitter tried to use this method as a way of moderating their content: that is, if it wrote a program in which some words were assigned as “good,” and others as “bad,” and if you had too many “bad” words—or any at all—your post would get deleted?

The audience would likely—and correctly—say that this would create a lot of problems. For one, it would treat all uses of those “bad” words the same, and it would have no way to distinguish between in-group and usage. For example, a gay man using a slur playfully in conversation with another gay man, versus a straight person using a slur to insult a gay man, would be read by the program in exactly the same way. The program would also have no way to read for ambiguity, nor for tone.
I would then say—probably with the audience’s agreement—that this doesn’t seem like the best way to do online content moderation. In fact, Rotten Tomatoes agrees—that’s why it doesn’t use a program that assigns values to words in order to sort its reviews! What actually happens at Rotten Tomatoes is that a staff member reads through a review, and decides if a review is positive or negative.

So, I would ask, what if we try this way of doing things? That is, what if I were to pass out a set of reviews and, after reading them, you would decide whether you think each of them are positive or negative? That’s exactly what I’d do (putting, again, the audience in pairs)—and, when they’d finished, I would bring them back together and let them know whether Rotten Tomatoes rated each view positive or negative.

I would ask the audience if they were surprised about any of the results—and if so, why? If they were, what does this indicate about this way of determining whether something should be flagged as positive or negative, as good or bad? The audience, presumably, would say that it shows this system is subjective, that it still depends on context, and that, crucially, each person reads context (and tone!) differently.
I would then ask my audience to consider Twitter again. Before, we decided that having a system in which certain words are flagged as “bad,” and then subject to deletion, isn’t going to work. What Twitter actually does, I would tell them, is a combination of this approach and the Rotten Tomatoes system: it asks users to report tweets that they think fit into certain categories—that are abusive and harmful content; that display a “sensitive” photo or video; that are spam; etc. If it’s one of the first two categories, Twitter will prompt the user reporting the tweet to write a little about why this is the case. Then, that report gets sent to a content moderator—and it’s that content moderator’s job to decide whether or not the tweet should be deleted.

I would then ask my audience what they thought about this approach—what advantages they believe it has, and what potential disadvantages they might see. They would likely identify the same problems that came up during our second exercise, with Rotten Tomatoes’ human review process: that different people interpret certain content, and context, differently, and that content and tone really matters—and that people don’t always get the verdict on content and tone, and thus how a piece of text should be designated, 100% right.
I would then ask what they think would happen when a Twitter moderator encounters a message that’s actually designed to mislead a moderator about what it contains: a message that’s designed in a way that, if you don’t have a really deep understanding of the references it uses and the context in which it was written, it would be almost impossible to tell what it’s trying to say, and whether or not it’s hateful.

I would then pass out examples of content that falls into that category: not only examples of viral irony as used by white supremacists on Twitter, but also non-racist viral irony used by regular Twitter posters. I would put my audience into groups again, and tell them that their job is to decide if this each example is offensive and should be deleted, and also to be able to articulate a reason why or why not.
After they were finished, I would bring my audience back together and ask them if it was really difficult. They would likely say yes—and I would tell them that’s what the people who made the white nationalist examples of viral irony wanted. There’s a hateful message in there, I would tell the audience, but it’s wrapped up in a discourse called viral irony. This is a discourse that doesn’t state its hate outright: rather, it wraps it in many layers of references and context, so that it’s really difficult for people who aren’t familiar with that content to know or to come up with a clear reason why it’s hateful, and relies on shifting tone and ambiguity to confuse the reader as to whether it’s serious or not. This is all part of a strategy white nationalists use to get their hateful messages past content moderators, and to ensure that it can stay up on Twitter and other mainstream platforms: and thus, that they can get their messages to a wider audience and attract new converts.

So, what do you think? If you’ve read this far—and again, major kudos for doing so—I’d love to hear your thoughts on my proposed workshop. Is it too long? Too involved? Not clear on what it’s trying to show? Is there anything you’d do differently (especially if you have experience running workshops—but even if you don’t)? Let me know!

This is not to draw a moral equivalency between the New Left or Riot Grrrls and white nationalists, of course! ↩