With text mining, you can look at a large corpus of text and find every mention of mercury. With topic modeling, you can look at the same large corpus and separate it into categories that refer to Mercury the planet, Freddie Mercury from Queen, mercury the element, the WNBA team Phoenix Mercury, or Mercury the Roman god.
In my view, topic modeling works best as a tool for discovery rather than analysis. In theory, the patterns or topics emerge from the algorithm and are not pre-determined by the researcher. However, in practice, there is much fine-tuning that goes into training the model and subjective decision-making that goes into selecting stop words. Consequently, in order to evaluate the output of a topic modeling program properly, it is important for the researcher to be transparent about the specifications and assumptions made.
Topic modeling used poorly runs the risk of being circular or tautological if the built-in assumptions necessarily produce the expected results. As noted by David Blei here,
A model of texts, built with a particular theory in mind, cannot provide evidence for the theory. (After all, the theory is built into the assumptions of the model.)
The Mining the Dispatch project brought to light some of the issues with topic modeling and digital scholarship more generally. The project measured the distribution of want ads for runaway slaves in the Richmond Dispatch. The analysis revealed that the wants ads coincided with the agricultural seasons – there were more ads when the need for labor was higher during the harvest season. Does that analysis contribute to new knowledge? Or is it merely the same knowledge arrived at by a different method? Sometimes digital scholarship is a methodological contribution, not an analytical one. Moreover, does fetishizing digital scholarship over traditional scholarship risk changing the acceptable method of proof? Can we no longer accept the common-sense assertion that want ad distribution changed with the agricultural seasons? Must we show numeric data and statistics as proof? Or do we diminish the value of qualitative sources like letters or newspaper articles that mention the same phenomenon?
I followed the Mallet tutorial from Programming Historian, which I found difficult.1
I returned to my corpus from the Detroit ‘67 Project from last week. I was curious if Mallet could distinguish among the oral history interviews. I already knew that the project collected the interviews by profession: healthcare workers, law enforcement, students, community activists, business owners, and government officials. Could Mallet differentiate the interview transcripts along these lines?
Here are three of the twenty topics that resulted from my first run through Mallet:
topic 1: [ww city sitting today riot time neighborhood rv growing don lived moved tension didn remember yeah born bf black]
topic 2: [mm detroit remember jj don yeah case city black lp law judge jy communist walter court wallace put bg ]
topic 19: [school black park lived high city white work living wayne working worked moved years year don class highland neighborhood]
Hmm, what are those two-letter words? I already knew that Billy Winkel was the predominant interviewer and that explained the ww. A quick look at the others also indicated that these were the initials of interview participants and not a problem with the file or the program. This result told me to add these initials to the stop word list. I did not have a master list of the all the participants and I confess I did not have the patience in the moment to write a script to extract them (probably a regular expression to delete the initials before the semicolon at the beginning of each speaker).
The other item I noticed was the reference to don in topic 19. I was hoping Mallet had revealed a link to someone named Don that I did not recognize as a key figure in the events. I dove into the relevant text files and discovered that Mallet was picking up don’t instead. Because Mallet ignores punctuation it spits out as don. See also didn in topic 1. Mallet evidently also ignores capitalization so there would be no easy way to distinguish between the contraction and the name Don. Do users edit their text to eliminate contractions before running Mallet? Would that be advisable or would that skew the data?
The only topic that remotely returned something relevant was topic 2, grouping words like law, court, and judge. I might get better results once I run it again with the additional stop words.
Perhaps this corpus was too narrow for Mallet to work well? Was 200 files totaling less than one million words to small to work with? Maybe a different angle would be sorting the interviewees by race or doing a sentiment analysis? Did black and witnesses recall different things? To what did the witnesses react to with fear? To the police or to the rioters? Could an algorithm identify even more granularity and differentiate between reactions to the police and reactions to the National Guard? Interesting questions, but they might be better suited to coding manually and then analyzing the results.
For two reasons: (1) The tutorial used Windows syntax and I had trouble remembering to use Mac syntax in the Terminal when following along. (2) I did not understand all the parameters and their syntax and was blindly cut and pasting. See (1). ↩