Charles Hooper

Thoughts and projects from a hacker and engineer

Drinking From the Gardenhose, Cont.

Previously I blogged about a storage/database bottleneck causing dropped connections while utilizing Twitter’s streaming API. I’m happy to report that the switch from sqlite to MySQL resulted in an immediate increase in throughput. I went from processing ~5 updates/second to processing just over 11 updates/second, almost doubling my capacity.

I also saw great improvement in terms of CPU usage as well. Previously, I was pegging my CPU at 100% usage. Since the switch to MySQL, which runs on another (similarly spec’d) host, I now use less than 5% CPU on the stream listening/processing host and less than 10% CPU on the MySQL host with the same dataset as before. I believe that parallelizing my code even further would allow me to take greater advantage of my resources and achieve higher throughput.

Resolving this issue has allowed me to turn my focus back on what I originally started this project for: Building a large enough corpus to do accurate Tf-idf scoring.

Comments