The Library Basement
Reading under ground

Fun with Subtitles

The Natural Language Processing course by Dan Jurafsky and Chris Manning has commenced. The class is comprised of readings, video lectures, problems, and code examples. I have been working through the video lectures and was pleased to see that each one has English subtitles. This being a natural language processing course, I decided it was obligatory that I process these subtitles.

So I decided to download each subtitle file, concatenate them, normalize the text, tokenize the text, and remove stopwords. I take the resulting list and sort by the frequency of occurrence. Here's the top 50.

word - 113
we're - 107
one - 86
two - 78
like - 76
words - 75
distance - 65
let's - 57
gonna - 56
it's - 56
we'll - 54
that's - 49
string - 48
there's - 47
look - 46
end - 40
sentence - 40
example - 39
things - 39
inaudible - 38
might - 36
use - 36
going - 35
cost - 34
here's - 34
capital - 31
kind - 31
match - 31
algorithm - 30
see - 30
alignment - 29
could - 29
get - 29
text - 28
three - 28
e - 27
n - 27
regular - 27
different - 26
processing - 26
strings - 26
period - 25
case - 24
character - 24
language - 24
little - 24
characters - 23
means - 23
sound - 23
us - 23

At first I was thrown off by the presence of "e" and "n" in the list, thinking I had a bug in my tokenizer. But it turns out that the instructors say many individual letters in the course of their discussions. Comments or email with feedback are appreciated.

Published:
Category: language Tags: nltk

Categories

Tags