In my last post I discussed the mechanics of downloading and topic modeling television subtitles. The guide included a somewhat kludgy workaround to the problem of mass-downloading subtitles without having matching source video files. Fortunately, the very helpful developer of FileBot contacted me and provided me with a groovy script (I’m referring the the language, not the dated mid-century adjective) that solves this this problem in a much more elegant manner.
Below you will find my (slightly modified) interactive version of this script, along with instructions for fetching subtitles with it. Since this script (or rather the OpenSubtitles database) has trouble disambiguating names of some shows (Seinfeld for instance) with individual episode names, I’m leaving my original inelegant (but fool-proof) instructions available there. And since we might as well work with a different show’s subtitles, I’ll go ahead and give Community the MALLET treatment just for fun.
- The first thing you’ll need to do (after creating a working directory for your subtitle files) is to create a text file called “subtitles.groovy” and paste the following code inside it:
def osdb = net.sourceforge.filebot.WebServices.OpenSubtitles
console.print('Enter Show Name: ')
def query = console.readLine()
def language = 'English'
def options = osdb.search(query)
println "Fetching subtitles for '${options[0]}'"
def subs = osdb.getSubtitleList(options[0], language)
println "Found ${subs.size()} subtitles"
def selection = subs.findAll{ it.languageName =~ language && parseEpisodeNumber(it.name) != null }.groupBy{ parseEpisodeNumber(it.name) }.values()*.get(0)
println "Selected ${selection.size()} subtitles"
selection.each{
println it.fetch().saveAs(it.path)
} - Next, from your subtitle directory, run the following command:
filebot -script subtitle.groovy
After which you will be asked to enter the TV show’s name. Type in either “Community” or “Community (2009)” (without the quotes) and FileBot will download the episodes automatically.
- Next, we’ll need to clean up the episode names:
filebot -rename . --q "Community" --format "{n.space('_')}-{s00e00}-{t.space('_')}"
- Clean up the subs:
sed -i 's/^.*-->.*$//g' *.srt
sed -i 's/<[^>]*>//g' *.srt
sed -i 's/^.*www.*$//g' *.srt - And (optionally) append .txt to their filenames:
find . -type f -exec mv '{}' '{}'.txt \;
And as promised, here are the topics MALLET generated for Community:
- 2.9485 jeff abed guys good pierce britta time troy make back annie people man cool shirley give thing wait stop god
- 0.1548 greendale dean school college year gun cream star shoot city prize commercial history win paintball guns ice plan paint beings
- 0.14253 inspector abed town spacetime pizza cougar timelines evil air dreamatorium foosball timeline toby bathroom constable biology space pies lunch toilet
- 0.14235 pierce father dad gay gilbert family snap game pool son bitches double hawthorne white cool thanksgiving fair cookie shorts party
- 0.12852 pop pen rich football pottery alan halloween magnitude ghost vicki bag woods puppy hours puppet lesbian president yo kettle mixer
- 0.12704 spanish chicken vaughn coach cubes class delta fingers duncan write archie exam senor job kickpuncher homework green buddy whale semester
- 0.1166 christmas dean jesus sing year glee merry club santa meaning regionals cookies cave religion memories debate holiday december planet singing
- 0.10738 fort todd yam war professor pillow subway narrator blanket annie music kim model lab grade record pillows united burns biology
- 0.10684 kevin shirley neil baby sophie chang wedding andre drugs hawkins sword dance changnesia fat lukka cheers dog business documentary duquesne
- 0.07619 blade duh batman street slater schmitty drunk carnival laughing amber seacrest michelle booty jewish calling uncle pain stepdaughter circle tool