A Show about Something? Topic Modeling Seinfeld (Part I)

Historians of the recent past are blessed (or, some would say, cursed) with an enormous body of non-textual primary sources: films, music, and radio just to name a few. Among these, one of the most underutilized bodies of historical “data” are TV shows. The under-representation of television in academic history is, perhaps, not surprising:  TV is not indexed, cannot easily be skimmed for relevant information, and requires a large time commitment on the part of the researcher. Further, because of copyright restrictions some TV shows are much harder for scholars to access than textual resources

seinfeld-computerFortunately, volunteers at OpenSubtitles, Addic7ed, and a number of other sites have produced a huge multilingual corpus of TV and movie subtitles that are completely free to download.  Working with captions from Seinfeld, this guide will show you how to batch download TV subtitle files (.srt files) and prepare them for use with MALLET or any other data mining/machine learning application. It presumes you have FileBot (available here) installed and are running the current version of Ubuntu (13.10), but the instructions should work with some modification on OS X.

Downloading the Subs

The first step is actually getting the subtitle files downloaded. This is is not as easy as it might seem; the major subtitle websites don’t have an clean, easy way to download a large number of files. And although automated downloaders exist, they all presume you have a properly named copy of an episode located on your machine.

Filebot offers a easy, if inelegant, workaround to this problem. With it, we can generate a list of episode titles, turn that into a simple script to create empty video files, and use Filebot to download the subtitle files. (Note: you can mass download from FileBot’s Subtitle menu, but there is no easy way to deal the problem of duplicate episodes. This method will download one subtitle file for every episode.)

  1. Open filebot and click on the “Episodes” menu. Type in the name of the show, in this case “Seinfeld,” and make sure the other options are appropriate. Save the list as “Seinfeld.txt” in an empty directory.
    FileBot-Generate
  2. Open a terminal screen to the directory of your episode list. The following set of sed commands will change your episode list into a shell script that will generate video files with valid filenames, delete special features, and ensure proper handling of double episodes:
    sed -i 's/^.*xSpecial.*$//g' Seinfeld.txt
    sed -i 's/ - /-/g' Seinfeld.txt
    sed -i 's/[^A-Za-z0-9._-]/_/g' Seinfeld.txt
    sed -i 's/ /_/g' Seinfeld.txt
    sed -i 's/__/_/g' Seinfeld.txt
    sed -i 's/_1_//g' Seinfeld.txt
    sed -i 's/_2_//g' Seinfeld.txt
    sed -i 's/^/touch /g' Seinfeld.txt
    sed -i 's/$/.mp4/' Seinfeld.txt
    sed -i "1i #!/bin/bash" Seinfeld.txt
  3. Then change the extension, make the script executable, and run it:
    mv Seinfeld.txt Seinfeld.sh
    chmod 755 Seinfeld.sh
    /Seinfeld.sh
  4. If this all worked correctly, your folder should be populated with a set of empty .mp4 files. We can now use FileBot’s command line interface to download the subtitle files in the working directory. Don’t forget the period!
    filebot -get-missing-subtitles .
  5. Finally, delete the script and the empty video files:
    rm *.mp4; rm *.sh

Cleaning the Subs

  1. Now that we’ve downloaded the subtitle files, we should clean up the markup, timestamps, and release information:
    sed -i 's/^.*-->.*$//g' *.srt
    sed -i 's/<[^>]*>//g' *.srt
    sed -i 's/^.*www.*$//g' *.srt
  2. This is not strictly necessary, but I like to append the .txt extension to the files so that I know that I have worked them:
    find . -type f -exec mv '{}' '{}'.txt \;

Topic Modeling Seinfeld with MALLET

Now that we’ve downloaded and scrubbed the files, the subtitles are ready for MALLET. If you haven’t installed the package yet, first open a terminal window and navigate to the directory in which you would like to build MALLET, then run the following commands:

sudo apt-get update; sudo apt-get upgrade -y
sudo apt-get install ant mercurial openjdk-7-jdk -y
hg clone http://hg-iesl.cs.umass.edu/hg/mallet
cd mallet
ant

After you’ve built MALLET, import your subtitle files:

 bin/mallet import-dir --input /home/kevin/run/seinfeld --output sein.mallet --keep-sequence --remove-stopwords

And train your topics:

bin/mallet train-topics --input sein.mallet --num-topics 15 --optimize-interval 10 --output-topic-keys sein_keys.txt --output-doc-topics sein_compostion.txt --word-topic-counts-file sein_topic-counts.txt

After you do that, open up “sein_keys.txt” and inspect your results. You’ll notice that there are a few contraction fragments (e.g. “ll,” “ve,” etc.) in your results. To get rid of these, create a file called “customstop” and enter the words you don’t want to appear in your results (separated by a new line). Delete your sein* files and import your data again with the –extra-stopwords argument:

bin/mallet import-dir --input /home/kevin/run/seinfeld --output sein.mallet --keep-sequence --remove-stopwords --extra-stopwords customstop

And train you your topics again:

bin/mallet train-topics --input sein.mallet --num-topics 15 --optimize-interval 10 --output-topic-keys sein_keys.txt --output-doc-topics sein_compostion.txt --word-topic-counts-file sein_topic-counts.txt

Keep playing with your stop-words until you’ve gotten set of keys.

What is Seinfeld show about?

According to MALLET, Seinfeld is a show about a lot of things:

  1. 0.19038 car keys parking cars space golf street market left god candy move bar sponge drive susan park bra ribbon blood
  2. 0.08102 lloyd glasses poppie peterman mother tony rock gum braun chinese pizza serenity pie blood cape yogurt statue mama fruit father
  3. 0.06927 married coffee maestro coma nina martin nose wedding doctor dog susan pam test cake drake chair heart tuscany jackie plane
  4. 0.11828 jimmy yada peterman newman chicken dry salad super beth stock bob tim jacket whatley shirt bowl cleaner muffin club label
  5. 0.12209 parents morty clothes florida paris jack pen father dad nice plans seinfeld shirt cadillac dinner moving tonight boxes bought rachel
  6. 0.06821 alright puddy move van pig man kruger christmas festivus newman david mail coat high mohel frank glass joe mentor assman
  7. 2.59298 jerry good george back elaine kramer guy time thing people make give man big call great wait put told talk
  8. 0.05228 hair car coat cake ticket babka wait rochelle wine dog smell ahead change hat cold cinnamon joel white ring massage
  9. 0.10159 show leo idea uncle kramer doctor banker davola jon voight nbc joe party helmet salsa pay character wallet call danson
  10. 0.05439 soup call bubble show susie plane seinfeld boy drake moops time jerry rye water stop handicapped good nazi court law
  11. 0.15743 good pony hair women opposite naked show bald idea night nice tuna funeral tuesday fake wednesday hand bitter john salad
  12. 0.14558 clown tv women toe tape smell bone baby coffee cable fire neil cigars guide chinese cabin code blow gammy medicine
  13. 0.08529 show russell card pilot raisins butler funny shoes nbc nana library women dogs gay birthday cancer white office grandmother peggy
  14. 0.10092 suit brien pitt plane miss meal flight cosmo murphy jean paul cat wake airport grace party tuck fleas arms bania
  15. 0.05573 babu naked funny money called vincent bike mail iq step mickey lawyer exclamation soda mattress records test store brett calzone

In a future post, I’ll talk about tracking topic trends over time with the output of MALLET’s “–output-doc-topics” option.

2 thoughts on “A Show about Something? Topic Modeling Seinfeld (Part I)

Leave a Reply

Your email address will not be published. Required fields are marked *