Twitter vs Erotica: Your Corpora’s Source Matters

Oct 30th, 2010

Dictionary © uair01; some rights reserved.

As a result of my now defunct project, BookSuggest, I’ve built a fairly large corpus that has been seeded entirely from Twitter. This corpus weighs in at:

16,680,000 documents (tweets)
1,970,165 unique (stemmed) words
(Red flag: Oxford Dictionary suggests there’s an estimated 250,000 words in the English language. This discrepancy is the result of my failure to filter Tweets based on language, the fact that usernames were included in the count, and the fact that people “make words up.” Also, “haha” becomes one word while “hahahaha” becomes another.)
83,758,872 words total.

When I look at these numbers, I often think about how the source documents a corpus/histogram is derived from affects the distribution of its term frequencies. The most obvious example is language. A French corpus will never come close to an English corpus. A less obvious example is subject matter. For example, a corpus derived from English literature will have a different term distribution than a corpus derived from an English medical journal. Common terms will have similar frequencies, but there will be biases towards terms that are domain-specific.

To demonstrate, I scraped the “Erotica” section of textfiles.com and built a corpus based on the data there. The resulting corpus is composed of:

4,337 documents
50,709 unique (stemmed) words
10,413,715 words total.

Notes on Term Counting

Words that had a length of less than 4 characters were discarded
Words were then stemmed using the Porter Stemming algorithm
There may be some slight differences between how words were counted in both corpora, based on minor programming differences

The Data

Finally, here are the term frequencies with the obvious domain-specific terms in bold:

Corpus Seeded from Twitter

![Counts of Top 20 Terms from Twitter Corpus][6]

that (0.84%)
just (0.70%)
with (0.69%)
thi (0.68%)
have (0.65%)
your (0.61%)
like (0.56%)
love (0.54%)
follow (0.45%)
what (0.44%)
from (0.36%)
haha (0.35%)
good (0.34%)
para (0.34%)
will (0.32%)
when (0.30%)
know (0.30%)
want (0.30%)
about 0.30%)
make (0.30%)

Corpus Seeded from Erotica

![Counts of Top 20 Terms from Erotica Corpus][7]

that (1.83%)
with (1.42%)
into (0.76%)
down (0.70%)
then (0.66%)
back (0.66%)
from (0.65%)
thi (0.65%)
hand (0.64%)
were (0.59%)
look (0.58%)
have (0.58%)
cock (0.57%)
like (0.57%)
over (0.57%)
thei (0.56%)
your (0.56%)
what (0.55%)
said (0.55%)
could (0.54%)

You’ll note that the Twitter corpus had a heavy bias towards the term “follow” whereas the Erotica corpus shows an overwhelming use of the term “cock” (Writers: Use synonyms.)

[6]: http://chart.apis.google.com/chart?chxl=0:

that

just

with

thi

have

your

like

love

follow

what

from

haha

good

para

will

when

know

want

about

make&chxr=0,0,703297&chxt=x&chbh=a,4,10&chs=600x200&cht=bvg&chco=4D89F9&chds=0,703297&chd=t:703297,582988,581346,573197,547218,513823,467673,455264,378187,367112,302254,296974,286671,283887,272176,254419,252303,251673,251325,248572&chtt=Counts of Top 20 Terms from Twitter Corpus

[7]: http://chart.apis.google.com/chart?chxl=0:

that

with

into

down

then

back

from

thi

hand

were

look

have

cock

like

over

thei

your

what

said

could&chxr=0,0,190543&chxt=x&chbh=a,4,10&chs=600x200&cht=bvg&chco=F889F9&chds=0,190543&chd=t:190543,148204,78688,72452,69045,68642,68164,67998,66826,61787,60236,60179,59622,59357,58856,58760,57851,57670,57348,55739&chtt=Counts of Top 20 Terms from Erotica Corpus

Practical Reasons Why This Is Important

This is important because if I were to build a domain-specific search-engine, I would be better off seeding my corpus from domain-specific content. If I don’t, my relevance (tf-idf) scores will be inaccurate. For example, an Erotica-specific search engine should decrease the weight for the term “cock” strictly because it has a very high document frequency and is therefore less-significant. Meanwhile, a Twitter-specific search engine should discount the weight of “follow.”

Conclusion

To conclude, the subject matter of a document set will create a bias towards domain-specific terms in the document set’s histogram of term frequencies. If you are calculating relevance for any particular document set, you should use a corpus derived from that document set. In other words, if you can, try not to re-use your corpora!

Comments