For the last couple of days, I’ve been playing around with the named entity recognition (NER) package from the Stanford Natural Language Processing Group. This command line tool automatically labels personal, geographical and organizational names in a text. I’ve found NER software useful for getting a sense of the key actors and locations in primary and secondary texts before I read them.
William J Turkel has already written an excellent guide on how to install the recognizer and generate frequency lists of named entities, so I’m not going to rewrite his walk-through. Below you’ll find my modified version of Turkle’s guide. It differs in two respects: First, it’s entirely CLI based, so you can just paste the script into a terminal screen without having to open a text editor. Second, it cleans up a trailing comma problem in some geographical names.
Here’s the script. Replace all occurrences of FILENAME with your file name:
sed 's/\/O / /g' < FILENAME_ner.txt > FILENAME_ner_clean.txt
alias egrepmatch='egrep --color -f pattr'
echo "[[:alpha:]]*/PERSON" > pattr
echo "(([[:alnum:]]|\.)+/ORGANIZATION([[:space:]]|$))+" > orgpattr
echo "(([[:alnum:]]|\.)+/LOCATION[[:space:]](,[[:space:]])?)+" > locpattr
echo "(([[:alpha:]]|\.)*/PERSON([[:space:]]|$))+" > personpattr
egrepmatch *clean.txt
echo "([[:alpha:]]|\.)*/PERSON" > pattr
egrepmatch *clean.txt
echo "([[:alpha:]]|\.)*/PERSON([[:space:]]|$)" > pattr
egrepmatch *clean.txt
egrep -o -f personpattr FILENAME_ner_clean.txt > FILENAME_ner_pers.txt
cat FILENAME_ner_pers.txt | sed 's/\/PERSON//g' | sort | uniq -c | sort -nr > FILENAME_ner_pers_freq.txt
egrep -o -f orgpattr FILENAME_ner_clean.txt > FILENAME_ner_org.txt
cat FILENAME_ner_org.txt | sed 's/\/ORGANIZATION//g' | sort | uniq -c | sort -nr > FILENAME_ner_org_freq.txt
egrep -o -f locpattr FILENAME_ner_clean.txt > FILENAME_ner_loc.txt
sed -i 's/ , /\n/g' FILENAME_ner_loc.txt
sed -i '/^$/d' FILENAME_ner_loc.txt
cat FILENAME_ner_loc.txt | sed 's/\/LOCATION//g' | sort | uniq -c | sort -nr > FILENAME_ner_loc_freq.txt
If you are running this on a Mac, please note that OS X uses the (slightly different) BSD version of sed, so you might have to play around with the syntax a bit.
If you’d like the delete the temporary files and marked up version of your document, run these commands.
rm *_ner_clean.txt
rm *_ner_org.txt
rm *_ner_loc.txt
rm *_ner_pers.txt
rm *pattr
[Edit: 2013-10-18] I haven’t used the tool in any serious or systematic way yet, but if I do, it might be useful to write something that consolidates common location name variations (eg. US, USA, United States) and recognizes and deletes common demonyms that Stanford NER sometimes reads as locations (e.g Russian).