Stop SOPA / PIPA

You may have noticed many websites of the Internet are speaking out against SOPA and PIPA (Stop Online Privacy Act and Protect IP Act, respectively). While piracy online is a huge problem and not something to be taken lightly, these pieces of legislature are damaging to the way the free Internet operates.The proposed legislature offers a shoot-first, ask questions later approach to shutting down unfavorable sites. Unfortunately, those with their fingers on the trigger (movie studios, etc) have a long and checkered history of abusing their existing powers and giving them an even more blunt tool to work with would be most unwise.

For a general summary,  you can watch the rap below, or check out this great Infographic.

 

Ready to act? Visit http://americancensorship.org/

Ruby Concurrency

Over the last few months I’ve been experimenting with and testing various methods for using concurrency within a Rails application for improved performance. Specifically in handling large amounts of data. The app I’ve been testing with is for all intents and purposes simply a web scraper. I went through several designs of the underlying system and settled on a final design that I’m really happy with. The code doesn’t entirely work (it’s a database problem), but the theory is solid and the performance is purely amazing.

My initial inspiration was delayed_job since I was familiar with it and knew it would work and how I would implement that style but I ran into scaling problems even just on my iMac because just a few threads would hammer the database hard enough with reads and writes that things would get crazy in just a few minutes. I then decided it might work to use an array / queue to cache things needing written to the database and, well, don’t do that. The real problem was that the individual scrapers were reliant on a database for both getting a job and noting it was finished. This also led to a lot of duplication of logic so the scraper and controller mechanisms had to be tightly coupled and kept causing deadlocks and collisions.

I didn’t like that, not at all. I decided anything that constantly polled a database would never work at scale, and decided I’d fix the problem for real – with messaging. In my current implementation, I have a Padrino app that handles the web interface and could provide an API out to something else if you really wanted it to. On the backend I have two rake tasks, only one of which connects to the database.

When a job is created in the system, it sends a message to the scraper queue to process the initial URL. It also does some things like read robots.txt so that we don’t poll too hard or hit urls we shouldn’t. I mean, just because this is a toy app doesn’t mean we should be sloppy ;-).

The scraper queue is listened to by this rake task. The rake task manages a thread pool of 20 threads (in production that would probably be a lot higher) and simply processes requests as they come through using MetaInspector. Once it has fetched the data (which will almost always be more waiting on the remote server than anything) it packages it all up into a nice neat JSON blob (I’ll probably switch to MessagePack in the future) and queues it to hit the manager process.

The manager process is where the app’s intelligence lies. It listens to the write queue for completed scrapes, writes them to the database, figures out the next page to scrape and queues it. As you noticed there is very little code in either task, and even the models are pretty skinny. By relying on a messaging queue I’ve managed to greatly simplify the architecture of the app while still obeying things such as priority and scrape rates. I wasn’t able to accurately benchmark this mechanism due to system problems I didn’t have time to troubleshoot but it held up (using an sqlite database) to a fast enough write performance the database logs spinning past my screen were making me dizzy.

If you want to try it out, check out the code on my BitBucket. Also, feel free to share your thoughts, comments and suggestions either in the comments or directly.

Unicorn Performance Tips for Rails Developers

If you’re using Unicorn, first off good job. If you’re not, bookmark this post and fix that. I can wait. Next, you’ll want to make your app as fast as possible. One of the biggest penalties you’ll pay in a Rails app is Garbage Collection. MRI’s GC is just awful in terms of performance. But [...]

Continue reading...

Christmas 2011

As the year comes to a close, I wanted to share photos from this year’s Christmas celebrations. These photos are also all up on Facebook, but for those of you who don’t use it here you go:

Continue reading...

More Content

I just threw this new blog together while I had some time off because Posterous’ performance drives me crazy. All the old stuff is still up at the old blog though.

Continue reading...