Commercial Products

Solutious

BlameStella

welcome to solutious - home of good natured tools

archives

Mar '11

Some cool uses of Git-like hashes in Ruby (with Gibbler)

posted by delano

Cryptographic hashes are pretty cool. They’re often used as checksums for large files because they’re fast, consistent, and well, secure. A lot of opensource software packages are distributed with the MD5 or SHA1 hash so that you can verify that all the bits are in the correct place (i.e. that the file you downloaded is identical to the one being served). If you’ve used Mercurial or Git, you’ve seen them used there too to track commits, objects, and trees.

Hashes can also be a useful tool in your code and I wrote Gibbler to make it easy to do that. Why not use Ruby’s hash method? Because the return values are inconsistent between runs.

t1 = Time.now
t2 = t1.clone
t1.hash                       #=> -2827223250544534006
t2.hash                       #=> -2827223250544534006 (the same!)
t1.object_id                  #=> 2170505820
t2.object_id                  #=> 2170481360

# Later on, with another instance of Ruby
t1 = Time.now
t1.hash                       #=> 2265941047042223117 (different!)
t1.object_id                  #=> 2168957700

But as it turns out, you can do some neat stuff when you can rely on the values between runs, using different versions and implementations of Ruby. I’m going to point a few, but first a quick introduction.

A quick intro to Gibbler

When you require gibbler, you get a gibbler method installed into most rudimentary Objects like String, Symbol, Hash, Array, etc.

require 'gibbler'

'tea'.gibbler                 #=> 6ef1ccef723f8f6c048399cfa5f46a781f559137
:tea.gibbler                  #=> 4f7721e1a1e0a02f87b196fd78f94358293793c1
{:count => 100}.gibbler       #=> 19322962506419bd16d9de2ab3d1e5ec0772c4e6
[4, 3, 2, '1'].gibbler        #=> b05b4fada2105f0f9547ae320423deba729abe53

Gibbler works similarly to Git: for complex objects, it dives depth-first and creates digests for each object and at each level creates a summary digest. The final digest is based on the summaries for each element.

[4, 3, 2, 1].gibbler          #=> d1cf67fb93ec51885e7c74e4b3a3d5ef3aad2bf9
[3, 2, 1].gibbler             #=> 18410df1574242b2730144ed483930072e49bd23
[3, [2, 1]].gibbler           #=> a05a76617a3b848060e6e8024e9c38a264dbd31b
[3, [2, [1]]].gibbler         #=> b32e17d4bf10eb7101153703511d08de4509e0ce

You can also include gibbler in your own objects with Gibbler::Complex which will create the hash based on the values of the instance variables:

class Email
  include Gibbler::Complex
  attr_accessor :to, :from, :subject, :content
  def initialize *args
    @to, @from, @subject, @content = *args
  end
end

msg1 = Email.new             
msg2 = Email.new 'd@example.com', 't@example.com', 'Hello', 'Long time no see!'

msg1.gibbler                  #=> 2667ed303e2e2cc307d49301acd7575ea3f90f2e
msg2.gibbler                  #=> 328dfe801c2563e31aa9a2b4831fa182f5e41dfd

By the way, if you prefer literal method names, you can require gibbler/aliases.

require 'gibbler/aliases'
'tea'.digest                  #=> 6ef1ccef723f8f6c048399cfa5f46a781f559137

A few examples of using hashes in your code

Know when a complex object has changed

When you store a record to your database, keep track of the latest hash. Later on, you can check that value to determine whether the contents of the object have changed without checking each field individually. You can also use the value of the hash to detect and prevent duplicate content. Here’s one example:

class Article
  include Gibbler::Complex
  attr_accessor :author, :title, :content, :checksum
  gibbler :author, :title, :content
  def initialize *args
    @author, @title, @content = *args
  end
  def changed?
    @checksum != gibbler
  end
end
article = Article.new 'jodie', 'Chicken Soup', '...'
article.checksum = article.gibbler
article.save

# Later on, in another process... 
article.content << "and it was delicious."
article.changed?              #=> true

Detect duplicate messages

You don’t need to store copies of an object to know if you’ve seen them before:

from = 't@example.com'
seen = []
['cust1@example.com', 'cust2@example.com', 'cust1@example.com'].each do |to|
  msg = Email.new to, from, 'A catchy subject', 'Some interesting content.'
  if seen.member?(msg.gibbler)
    # cust1 has already received that specific email
    next
  end
  seen << msg.gibbler
end

Find data without storing an index

I use this approach extensively for Stella (my web monitoring service). When a customer runs a checkup, it creates a testplan to represent the site and page being tested. I create a new instance of the object every time, but because the digest for a given testplan is always the same I know where the object is stored without looking it up the based on the URI. As well, I include the customer ID in the digest calculation so that a each customer has their own instance of the testplan. You can see an example of that here:

A testplan created from my account.
A testplan created by an anonymous customer.

Notice that the list of recent checkups is different for each. I don’t need to do anything special for this. It’s just a freebie that comes along with using these hashes.

Know which local objects to sync remotely

If you have data in one location that you need to synchronize remotely (database records, files, etc) you can use the hashes to determine which objects need to be sent over. This is exactly how git determines what it needs to send to or receive from a remote repo. Of course you could simply keep track of the record IDs (in the case of a database) but by using hashes you get duplicate detection for free.

Maintain an index pointer for an Array without storing the contents

This example can seem arcane but I’ve found it useful on more than one occasion. Let’s say you have a list of values and you want to always process them in sequence. And for whatever reason you don’t store the values locally but every time you see this array of values you want to continue processing at the appropriate element.

With hashes it’s simple: create an index using the gibbler hash of the array. It will always be the same as long as the values and the order of the values are the same (you could optionally create the hash after sorting the array).

indexes = {}
5.times do
  people = %w[dave john candace]  # inside the loop to simulate different arrays
  indexes[people.gibbler] ||= -1
  indexes[people.gibbler] += 1
  indexes[people.gibbler] = 0 if indexes[people.gibbler] >= people.size
  current_idx = indexes[people.gibbler] 
  puts people[ current_idx ]
end
# Output:
# dave
# john
# candace
# dave
# john

There are many more uses for hashes in your Ruby codes. I’m interested to hear some. Do you implement them in your projects?

Installing Gibbler

gem install gibbler

code at Github
gem on Rubyforge
documentation via RDocs
screencast by Alex Peuchert

Mini-F.A.Q.

Can digests be made unique per application?

Yep. Set Gibbler.secret to anything, preferably something long.

:kimmy.gibbler                #=> 52be7494a602d85ff5d8a8ab4ffe7f1b171587df

Gibbler.secret = '4cea880a75df6c8b1fa2'

:kimmy.gibbler                #=> 0f71d5813687cb07f8b6be5389e636962f49e213

What if attributes are added or removed to a class?

Use the gibbler class method to explicitly define the names and order of variables you want to use for the digest.

class Email
  include Gibbler::Complex
  gibbler :to, :subject, :content   # only these fields will be considered
end

msg = Email.new 'd@example.com', 't@example.com', 'Hello', 'Long time no see!'
msg.gibbler                   #=> 7f68056cf34cd42cbb3dee1f81535100ae783fe9

msg.from = 't2@example.com'
msg2.gibbler                  #=> 7f68056cf34cd42cbb3dee1f81535100ae783fe9

Can I use something other than SHA-1?

Yep, you can change the digest type globally or per call.

:a.gibbler                     #=> cd55a626c21b5580141442e789201e7e64276da9

Gibbler.digest_type = Digest::MD5
:a.gibbler                     #=> ef8de1a0d178ce85999a4b54840c21e0

:a.gibbler(Digest::SHA256)     #=> f5fa26a66724c32df25f872fd691dd18e03cc2347a...

You can also shorten and change the base of the digest:

:kimmy.gibbler                #=> 52be7494a602d85ff5d8a8ab4ffe7f1b171587df
:kimmy.gibbler.shorten        #=> 52be7494a602d85ff5d8
:kimmy.gibbler.shorten(10)    #=> 52be7494a6
:kimmy.gibbler.base(36)       #=> 9nydr6mpv6w4k8ngo3jtx0jz1n97h7j
:kimmy.gibbler.base(10)       #=> 472384540402900668368761869477227308873774630879
:kimmy.gibbler.to_i           #=> 472384540402900668368761869477227308873774630879
:kimmy.gibbler.base(2)        #=> 101001010111110011101001001010010100110000...

I'm Delano Mandelbaum, the founder of Solutious Inc. I've worked for companies large and small and now I'm putting everything I've learned into building great tools. I recently launched a monitoring service called Stella.

You can also find me on:

- Delano (@solutious.com)

Solutious is a software company based in Montréal. We build testing and development tools that are both powerful and pleasant to use. All of our software is on GitHub.

This is our blog about performance, development, and getting stuff done.

- Solutious