TextBlob is a fun Python library that allows one to parse blocks of text in neat ways.

To use it, all you need is a computer with Python on it. I’m using Linux Mint with Python 2.7.3. Installation of TextBlob is covered pretty well on Steve Loria’s TextBlob page.

To begin I open my Python interpreter and import TextBlob.
>>> from textblob import TextBlob

Then I load my text. I’m using a chunk of The Brothers Karamazov.
>>> with open(r"/home/sean/Documents/text-blobs/the-brothers-karamazov/brothers-044") as infile:
...   data = infile.read()
...   myblob = TextBlob(data)
...

Now I have a TextBlob object named “myblob” and I can do fun stuff with it. For instance, I can loop through it and pull out all the adjectives.
>>> for value,key in sorted(set(myblob.tags)):
...   if key == "JJ":
...     print key,value
...
JJ back
JJ back-way
JJ black
JJ certain
JJ civil
JJ clear
--and so on...

By setting up my for loop with the sorted() and set() methods, the output is alphabetized and will contain no duplicates.

But suppose I only want to see the adjectives that are five characters long. Then I use Python’s len() method. Like so:
>>> for value,key in sorted(set(myblob.tags)):
...   if key == "JJ" and len(value) == 5:
...     print key,value
...
JJ black
JJ civil
JJ clear
JJ equal
JJ first
--and so on...

I can sort for verbs, too; in fact, any part of speech listed in the Penn Treebank II tag set will work.

The Penn Treebank code for gerunds is VBG. But sometimes I want all the words that end in “ing” even if it’s not a gerund. In that case, I use Python’s string methods instead. Like so:
>>> for value,key in sorted(set(myblob.tags)):
...   if value[-3:] == "ing":
...     print key,value
...
VBG according
NN anything
VBG behaving
VBG bringing
--and so on...

Using Python’s handy string methods I can easily test for a word that begins with a particular letter, too. Here I’ll throw in the lower() method to match regardless of case:
>>> for value,key in sorted(set(myblob.tags)):
...   if value[0].lower() == "a":
...     print key,value
...
DT A
IN Among
NNP April
IN At
DT a
IN about
VBG according
VBN accustomed
--and so on...

But what if I want to match all the words that start with vowels? Well, I think I’m going to need a regular expression to do that. (I love regular expressions.)

First I’ll import Python’s regex library and then create my regular expression.
>>> import re
>>> reg = re.compile('^[aeiou]\w*', re.IGNORECASE)

As you can see, I’m looking for any word that begins “^” with a vowel “[aeiou]” and is followed by zero or more “*” alphanumeric characters “\w” and I want to ignore case. Then I just use another for loop, only this time with my new regex. Like so:
>>> for value,key in sorted(set(myblob.tags)):
...   if reg.match(value):
...     print key,value
...
DT A
IN Among
NNP April
IN At
DT Every
IN If
IN In
PRP It
IN Of
DT a
IN about
--and so on...

All the base form verbs that start with a vowel:
>>> for value,key in sorted(set(myblob.tags)):
...   if key == "VB" and reg.match(value):
...     print key,value
...
VB act
VB entertain
VB estrange
VB in
VB into

Pretty cool, right?

  • Delicious
  • Facebook
  • Digg
  • Reddit
  • StumbleUpon
  • Twitter

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>