Vive la différence!

Scholars' Lab Blog //Vive la différence!

Blog //Vive la différence!

By Sarah Storti on September 20, 2011

We migrated this website to a new platform, and are working to correct formatting errors in older blog posts as a result. If you encounter an error, please send an email to scholarslab@virginia.edu. Thanks!

When I signed up for David Hoover’s “Out-of-the-Box Text Analysis” course at last summer’s Digital Humanities Summer Institute, I had absolutely no idea what I was getting into. Text analysis… with computers? Data mining? What? The first of our meetings felt akin to culture shock, I think, or to having a bucket of ice water thrown over my head (We are going to do what with the texts? Cut them up into word frequency lists??), but once I recovered myself, as it were, the rest of my fast-paced DHSI week provided me with a basic understanding of not only how to use text mining tools, but, perhaps more importantly, why one might want to use them. I thought now might be the perfect time to share some of what I learned in Victoria.

Firstly, yes, a text mining operation can give you a set of numbers that supposedly correspond in some way to a given text. Oooooh, isn’t that scary? Well, okay, maybe it is. But what happens after that is entirely dependent upon the scholar interpreting the numbers. The machine is only a machine. It is not going to change the text. Text mining tools can, however, change the ways we look at our texts.

In the course, we certainly considered questions that I had never before considered about a set of texts. They included: how well (according to word frequency lists) does the author distinguish between the “voices” of his/her different characters? What about the difference between those number sets and the number sets we get from doing the same kind of test with a different author? Is one “better” at differentiating vocabulary than another? Does this even matter? Can we make an educated guess about authorship of a disputed text based on comparisons of that text with other texts written by the two authors in question? Frequently other members of my class would push back, pretty forcefully, on the idea that we could draw hard and fast conclusions about a text using only the tools on our computers. Nobody, it seemed to me, was about to publish a paper on why Author 1 is superior to Author 2 because Author 1 uses a richer vocabulary. And these people were in the class voluntarily—they (in theory) wanted to learn how to use such tools, to what various ends I could not say. I do know, however, that I never once felt that the group lost sight of the difference between an author’s text and a set of numbers.

The kind of data results we got from running these tools on our texts were simply that: data results. Mining texts for word frequency (or what you will) does not inherently devalue or damage the text. Sure, it’s possible to come at a text with a preconceived notion and repeatedly run different tests to try to prove that theory right, but the same holds true, certainly, for traditional literary criticism. I suppose my question is this: do we have any reason to be concerned, really, about what will happen if Prism does eventually allow for some kind of data mining? What is the worst that can happen? And by allowing our fears (e.g. someone will draw an irresponsible conclusion based on numbers) to dictate the direction we take Prism, aren’t we obstructing the possibility of unimagined positive outcomes?

While I do think it is important for us to feel good about the tool that we are building, I would also echo Bethany’s cautionary (and immensely helpful) advice “not to reject approaches without exploring them — without feeling fully informed as to their potential and (more importantly, in cases where you intuit a problem) the intervention you might make, or the twist you might bring, to their use in the scholarly community.” “Text mining” may be a Bad Word to some of us, but the way our as-yet unrealized tool works could very well make an interesting intervention in the text mining (and DH) world, regardless of our personal preferences. Isn’t that the kind of thing we’re here at Praxis to do?

To conclude, I will make another admission: even after a week of thought-provoking and congenial collaboration between myself and my colleagues at DHSI, I would not consider myself to be a data mining kind of scholar. In fact, I would consider myself to be more bibliographically-inclined than anything else. I own both of the Tanselle syllabi. I am invested in methodologies which make it difficult for me to see how I could implement text mining in my own work right now. These inclinations of mine do not, however, mean that I think everybody else needs to lean my way. I look forward to seeing what other people want to do with Prism, and I hope text mining does eventually become part of that. I would like to see what kind of questions people will ask about the interpretation of texts thanks to our intervention via Prism, especially because those questions are ones I would probably never ask if left to myself. After all, as my grandmother says, variety is the spice of life.