Some cool uses of Git-like hashes in Ruby (with Gibbler)
Cryptographic hashes are pretty cool. They’re often used as checksums for large files because they’re fast, consistent, and well, secure. A lot of opensource software packages are distributed with the MD5 or SHA1 hash so that you can verify that all the bits are in the correct place (i.e. that the file you downloaded is identical to the one being served). If you’ve used Mercurial or Git, you’ve seen them used there too to track commits, objects, and trees.
Hashes can also be a useful tool in your code and I wrote Gibbler to make it easy to do that. Why not use Ruby’s hash
method? Because the return values are inconsistent between runs.
But as it turns out, you can do some neat stuff when you can rely on the values between runs, using different versions and implementations of Ruby. I’m going to point a few, but first a quick introduction.
A quick intro to Gibbler
When you require gibbler, you get a gibbler
method installed into most rudimentary Objects like String, Symbol, Hash, Array, etc.
Gibbler works similarly to Git: for complex objects, it dives depth-first and creates digests for each object and at each level creates a summary digest. The final digest is based on the summaries for each element.
You can also include gibbler in your own objects with Gibbler::Complex
which will create the hash based on the values of the instance variables:
By the way, if you prefer literal method names, you can require gibbler/aliases
.
A few examples of using hashes in your code
Know when a complex object has changed
When you store a record to your database, keep track of the latest hash. Later on, you can check that value to determine whether the contents of the object have changed without checking each field individually. You can also use the value of the hash to detect and prevent duplicate content. Here’s one example:
Detect duplicate messages
You don’t need to store copies of an object to know if you’ve seen them before:
Find data without storing an index
I use this approach extensively for Stella (my web monitoring service). When a customer runs a checkup, it creates a testplan to represent the site and page being tested. I create a new instance of the object every time, but because the digest for a given testplan is always the same I know where the object is stored without looking it up the based on the URI. As well, I include the customer ID in the digest calculation so that a each customer has their own instance of the testplan. You can see an example of that here:
- A testplan created from my account.
- A testplan created by an anonymous customer.
Notice that the list of recent checkups is different for each. I don’t need to do anything special for this. It’s just a freebie that comes along with using these hashes.
Know which local objects to sync remotely
If you have data in one location that you need to synchronize remotely (database records, files, etc) you can use the hashes to determine which objects need to be sent over. This is exactly how git determines what it needs to send to or receive from a remote repo. Of course you could simply keep track of the record IDs (in the case of a database) but by using hashes you get duplicate detection for free.
Maintain an index pointer for an Array without storing the contents
This example can seem arcane but I’ve found it useful on more than one occasion. Let’s say you have a list of values and you want to always process them in sequence. And for whatever reason you don’t store the values locally but every time you see this array of values you want to continue processing at the appropriate element.
With hashes it’s simple: create an index using the gibbler hash of the array. It will always be the same as long as the values and the order of the values are the same (you could optionally create the hash after sorting the array).
There are many more uses for hashes in your Ruby codes. I’m interested to hear some. Do you implement them in your projects?
Installing Gibbler
gem install gibbler
- code at Github
- gem on Rubyforge
- documentation via RDocs
- screencast by Alex Peuchert
Mini-F.A.Q.
Can digests be made unique per application?
Yep. Set Gibbler.secret
to anything, preferably something long.
What if attributes are added or removed to a class?
Use the gibbler
class method to explicitly define the names and order of variables you want to use for the digest.
Can I use something other than SHA-1?
Yep, you can change the digest type globally or per call.
You can also shorten and change the base of the digest: