Over the last few months I’ve been experimenting with and testing various methods for using concurrency within a Rails application for improved performance. Specifically in handling large amounts of data. The app I’ve been testing with is for all intents and purposes simply a web scraper. I went through several designs of the underlying system and settled on a final design that I’m really happy with. The code doesn’t entirely work (it’s a database problem), but the theory is solid and the performance is purely amazing.
My initial inspiration was delayed_job since I was familiar with it and knew it would work and how I would implement that style but I ran into scaling problems even just on my iMac because just a few threads would hammer the database hard enough with reads and writes that things would get crazy in just a few minutes. I then decided it might work to use an array / queue to cache things needing written to the database and, well, don’t do that. The real problem was that the individual scrapers were reliant on a database for both getting a job and noting it was finished. This also led to a lot of duplication of logic so the scraper and controller mechanisms had to be tightly coupled and kept causing deadlocks and collisions.
I didn’t like that, not at all. I decided anything that constantly polled a database would never work at scale, and decided I’d fix the problem for real – with messaging. In my current implementation, I have a Padrino app that handles the web interface and could provide an API out to something else if you really wanted it to. On the backend I have two rake tasks, only one of which connects to the database.
When a job is created in the system, it sends a message to the scraper queue to process the initial URL. It also does some things like read robots.txt so that we don’t poll too hard or hit urls we shouldn’t. I mean, just because this is a toy app doesn’t mean we should be sloppy ;-).
The scraper queue is listened to by this rake task. The rake task manages a thread pool of 20 threads (in production that would probably be a lot higher) and simply processes requests as they come through using MetaInspector. Once it has fetched the data (which will almost always be more waiting on the remote server than anything) it packages it all up into a nice neat JSON blob (I’ll probably switch to MessagePack in the future) and queues it to hit the manager process.
The manager process is where the app’s intelligence lies. It listens to the write queue for completed scrapes, writes them to the database, figures out the next page to scrape and queues it. As you noticed there is very little code in either task, and even the models are pretty skinny. By relying on a messaging queue I’ve managed to greatly simplify the architecture of the app while still obeying things such as priority and scrape rates. I wasn’t able to accurately benchmark this mechanism due to system problems I didn’t have time to troubleshoot but it held up (using an sqlite database) to a fast enough write performance the database logs spinning past my screen were making me dizzy.
If you want to try it out, check out the code on my BitBucket. Also, feel free to share your thoughts, comments and suggestions either in the comments or directly.
Recent Comments