DNSWash Is Alive

Well, the brochure-ware part anyway. I wrote up some initial policies last night, sketched the UI for the community portion and have started coding up the database. Anyway, further updates and lots of policy information can be found over on the new website:

http://www.dnswasher.org

The Project – DNSWash

I saw the news, OpenDNS is rightfully going to start charging for their service for businesses. Unfortunately, this puts a lot of people in a bit of a pickle because their pricing isn’t very transparent, and most people haven’t budgeted for the change – and we just started a new year so budgets don’t get re-done for a while.

Well, I got to thinking about it and decided “It can’t be that hard to implement DNS filtering, surely I could write a little filtering DNS server and hook it up to a blacklist and that should get people through for a while.” And surely enough, in about 20 lines of Ruby and about 15 minutes I had a fully functional recursive resolver with caching and filtering built right in.

My project for the weekend is going to be to write a web app to manage various filter lists and expose a basic HTTP API so that I can release the whole thing as a working proof of concept and get some feedback. I expect the way I’m going to implement this performance won’t be that good but it should be totally usable by the end of the week. The reason I chose this option is because I can throw a basic web app up on Heroku for free and that will mean the whole project can operate indefinitely at no ongoing cost to me.

But, what I really want isn’t a set of programs that work together to handle this simple task. Ultimately, I’d like to have a great database of community maintained and owned domain categorizations with real process and policy behind its maintenance. Once the database is mature enough for production usage I think the best way to share it would be using BIND RPZs per category and probably also dnsbl style.

Anyway, I’m going to put together a website for the project and start thinking about policies before I write the actual database app.

Ruby Concurrency

Over the last few months I’ve been experimenting with and testing various methods for using concurrency within a Rails application for improved performance. Specifically in handling large amounts of data. The app I’ve been testing with is for all intents and purposes simply a web scraper. I went through several designs of the underlying system and settled on a final design that I’m really happy with. The code doesn’t entirely work (it’s a database problem), but the theory is solid and the performance is purely amazing.

My initial inspiration was delayed_job since I was familiar with it and knew it would work and how I would implement that style but I ran into scaling problems even just on my iMac because just a few threads would hammer the database hard enough with reads and writes that things would get crazy in just a few minutes. I then decided it might work to use an array / queue to cache things needing written to the database and, well, don’t do that. The real problem was that the individual scrapers were reliant on a database for both getting a job and noting it was finished. This also led to a lot of duplication of logic so the scraper and controller mechanisms had to be tightly coupled and kept causing deadlocks and collisions.

I didn’t like that, not at all. I decided anything that constantly polled a database would never work at scale, and decided I’d fix the problem for real – with messaging. In my current implementation, I have a Padrino app that handles the web interface and could provide an API out to something else if you really wanted it to. On the backend I have two rake tasks, only one of which connects to the database.

When a job is created in the system, it sends a message to the scraper queue to process the initial URL. It also does some things like read robots.txt so that we don’t poll too hard or hit urls we shouldn’t. I mean, just because this is a toy app doesn’t mean we should be sloppy ;-).

The scraper queue is listened to by this rake task. The rake task manages a thread pool of 20 threads (in production that would probably be a lot higher) and simply processes requests as they come through using MetaInspector. Once it has fetched the data (which will almost always be more waiting on the remote server than anything) it packages it all up into a nice neat JSON blob (I’ll probably switch to MessagePack in the future) and queues it to hit the manager process.

The manager process is where the app’s intelligence lies. It listens to the write queue for completed scrapes, writes them to the database, figures out the next page to scrape and queues it. As you noticed there is very little code in either task, and even the models are pretty skinny. By relying on a messaging queue I’ve managed to greatly simplify the architecture of the app while still obeying things such as priority and scrape rates. I wasn’t able to accurately benchmark this mechanism due to system problems I didn’t have time to troubleshoot but it held up (using an sqlite database) to a fast enough write performance the database logs spinning past my screen were making me dizzy.

If you want to try it out, check out the code on my BitBucket. Also, feel free to share your thoughts, comments and suggestions either in the comments or directly.

Unicorn Performance Tips for Rails Developers

If you’re using Unicorn, first off good job. If you’re not, bookmark this post and fix that. I can wait.

Next, you’ll want to make your app as fast as possible. One of the biggest penalties you’ll pay in a Rails app is Garbage Collection. MRI’s GC is just awful in terms of performance. But you don’t need to have people waiting on it anymore! In exchange for a little higher CPU and Memory usage, you can tell Unicorn to run garbage collection outside of the request/response cycle using OOB_GC. It’s really quite simple, and it works in any Rack application. Just open config.ru and add this:

require 'unicorn/oob_gc'

That’s all you have to do! There is some additional tuning if you have, for example some routes that are particularly memory intensive and have GC run more frequently than the every 5 requests it comes set out of the box for. You can read the docs for more info.

A word of warning: This increased memory usage ~100% on my Gitlab install. It made things much faster too, though. Coincidentally, the memory increase from doing this was the same as the decrease from switching from multiple thins to Unicorn.

Another nice trick is to preload your app in the master process so that your workers are a little lighter weight and spin up faster. With Rails it’s pretty easy, just add this to your unicorn config file:

preload_app true

before_fork do |server, worker|
  # the following is highly recomended for Rails + "preload_app true"
  # as there's no need for the master process to hold a connection
  defined?(ActiveRecord::Base) and
    ActiveRecord::Base.connection.disconnect!
end

after_fork do |server, worker|
  # the following is *required* for Rails + "preload_app true",
  defined?(ActiveRecord::Base) and
    ActiveRecord::Base.establish_connection
end

I’d encourage you to read the docs for Unicorn if you’re using it and spend some time tweaking it for your app. In just a few minutes I significantly sped up Gitlab over Thin and the default Unicorn configuration and everyone likes “free” performance bumps :-)