Commercial Products
Mar '11
11

Some cool uses of Git-like hashes in Ruby (with Gibbler)

posted by delano

Cryptographic hashes are pretty cool. They're often used as checksums for large files because they're fast, consistent, and well, secure. A lot of opensource software packages are distributed with the MD5 or SHA1 hash so that you can verify that all the bits are in the correct place (i.e. that the file you downloaded is identical to the one being served). If you've used Mercurial or Git, you've seen them used there too to track commits, objects, and trees.

Hashes can also be a useful tool in your code and I wrote Gibbler to make it easy to do that. Why not use Ruby's hash method? Because the return values are inconsistent between runs.

t1 = Time.now
t2 = t1.clone
t1.hash                       #=> -2827223250544534006
t2.hash                       #=> -2827223250544534006 (the same!)
t1.object_id                  #=> 2170505820
t2.object_id                  #=> 2170481360

# Later on, with another instance of Ruby
t1 = Time.now
t1.hash                       #=> 2265941047042223117 (different!)
t1.object_id                  #=> 2168957700

But as it turns out, you can do some neat stuff when you can rely on the values between runs, using different versions and implementations of Ruby. I'm going to point a few, but first a quick introduction.

A quick intro to Gibbler

When you require gibbler, you get a gibbler method installed into most rudimentary Objects like String, Symbol, Hash, Array, etc.

require 'gibbler'

'tea'.gibbler                 #=> 6ef1ccef723f8f6c048399cfa5f46a781f559137
:tea.gibbler                  #=> 4f7721e1a1e0a02f87b196fd78f94358293793c1
{:count => 100}.gibbler       #=> 19322962506419bd16d9de2ab3d1e5ec0772c4e6
[4, 3, 2, '1'].gibbler        #=> b05b4fada2105f0f9547ae320423deba729abe53

Gibbler works similarly to Git: for complex objects, it dives depth-first and creates digests for each object and at each level creates a summary digest. The final digest is based on the summaries for each element.

[4, 3, 2, 1].gibbler          #=> d1cf67fb93ec51885e7c74e4b3a3d5ef3aad2bf9
[3, 2, 1].gibbler             #=> 18410df1574242b2730144ed483930072e49bd23
[3, [2, 1]].gibbler           #=> a05a76617a3b848060e6e8024e9c38a264dbd31b
[3, [2, [1]]].gibbler         #=> b32e17d4bf10eb7101153703511d08de4509e0ce

You can also include gibbler in your own objects with Gibbler::Complex which will create the hash based on the values of the instance variables:

class Email
  include Gibbler::Complex
  attr_accessor :to, :from, :subject, :content
  def initialize *args
    @to, @from, @subject, @content = *args
  end
end

msg1 = Email.new             
msg2 = Email.new 'd@example.com', 't@example.com', 'Hello', 'Long time no see!'

msg1.gibbler                  #=> 2667ed303e2e2cc307d49301acd7575ea3f90f2e
msg2.gibbler                  #=> 328dfe801c2563e31aa9a2b4831fa182f5e41dfd

By the way, if you prefer literal method names, you can require gibbler/aliases.

require 'gibbler/aliases'
'tea'.digest                  #=> 6ef1ccef723f8f6c048399cfa5f46a781f559137

A few examples of using hashes in your code

Know when a complex object has changed

When you store a record to your database, keep track of the latest hash. Later on, you can check that value to determine whether the contents of the object have changed without checking each field individually. You can also use the value of the hash to detect and prevent duplicate content. Here's one example:

class Article
  include Gibbler::Complex
  attr_accessor :author, :title, :content, :checksum
  gibbler :author, :title, :content
  def initialize *args
    @author, @title, @content = *args
  end
  def changed?
    @checksum != gibbler
  end
end
article = Article.new 'jodie', 'Chicken Soup', '...'
article.checksum = article.gibbler
article.save

# Later on, in another process... 
article.content << "and it was delicious."
article.changed?              #=> true

Detect duplicate messages

You don't need to store copies of an object to know if you've seen them before:

from = 't@example.com'
seen = []
['cust1@example.com', 'cust2@example.com', 'cust1@example.com'].each do |to|
  msg = Email.new to, from, 'A catchy subject', 'Some interesting content.'
  if seen.member?(msg.gibbler)
    # cust1 has already received that specific email
    next
  end
  seen << msg.gibbler
end

Find data without storing an index

I use this approach extensively for Stella (my web monitoring service). When a customer runs a checkup, it creates a testplan to represent the site and page being tested. I create a new instance of the object every time, but because the digest for a given testplan is always the same I know where the object is stored without looking it up the based on the URI. As well, I include the customer ID in the digest calculation so that a each customer has their own instance of the testplan. You can see an example of that here:

Notice that the list of recent checkups is different for each. I don't need to do anything special for this. It's just a freebie that comes along with using these hashes.

Know which local objects to sync remotely

If you have data in one location that you need to synchronize remotely (database records, files, etc) you can use the hashes to determine which objects need to be sent over. This is exactly how git determines what it needs to send to or receive from a remote repo. Of course you could simply keep track of the record IDs (in the case of a database) but by using hashes you get duplicate detection for free.

Maintain an index pointer for an Array without storing the contents

This example can seem arcane but I've found it useful on more than one occasion. Let's say you have a list of values and you want to always process them in sequence. And for whatever reason you don't store the values locally but every time you see this array of values you want to continue processing at the appropriate element.

With hashes it's simple: create an index using the gibbler hash of the array. It will always be the same as long as the values and the order of the values are the same (you could optionally create the hash after sorting the array).

indexes = {}
5.times do
  people = %w[dave john candace]  # inside the loop to simulate different arrays
  indexes[people.gibbler] ||= -1
  indexes[people.gibbler] += 1
  indexes[people.gibbler] = 0 if indexes[people.gibbler] >= people.size
  current_idx = indexes[people.gibbler] 
  puts people[ current_idx ]
end
# Output:
# dave
# john
# candace
# dave
# john

There are many more uses for hashes in your Ruby codes. I'm interested to hear some. Do you implement them in your projects?

Installing Gibbler

gem install gibbler

Mini-F.A.Q.

Can digests be made unique per application?

Yep. Set Gibbler.secret to anything, preferably something long.

:kimmy.gibbler                #=> 52be7494a602d85ff5d8a8ab4ffe7f1b171587df

Gibbler.secret = '4cea880a75df6c8b1fa2'

:kimmy.gibbler                #=> 0f71d5813687cb07f8b6be5389e636962f49e213

What if attributes are added or removed to a class?

Use the gibbler class method to explicitly define the names and order of variables you want to use for the digest.

class Email
  include Gibbler::Complex
  gibbler :to, :subject, :content   # only these fields will be considered
end

msg = Email.new 'd@example.com', 't@example.com', 'Hello', 'Long time no see!'
msg.gibbler                   #=> 7f68056cf34cd42cbb3dee1f81535100ae783fe9

msg.from = 't2@example.com'
msg2.gibbler                  #=> 7f68056cf34cd42cbb3dee1f81535100ae783fe9

Can I use something other than SHA-1?

Yep, you can change the digest type globally or per call.

:a.gibbler                     #=> cd55a626c21b5580141442e789201e7e64276da9

Gibbler.digest_type = Digest::MD5
:a.gibbler                     #=> ef8de1a0d178ce85999a4b54840c21e0

:a.gibbler(Digest::SHA256)     #=> f5fa26a66724c32df25f872fd691dd18e03cc2347a...

You can also shorten and change the base of the digest:

:kimmy.gibbler                #=> 52be7494a602d85ff5d8a8ab4ffe7f1b171587df
:kimmy.gibbler.shorten        #=> 52be7494a602d85ff5d8
:kimmy.gibbler.shorten(10)    #=> 52be7494a6
:kimmy.gibbler.base(36)       #=> 9nydr6mpv6w4k8ngo3jtx0jz1n97h7j
:kimmy.gibbler.base(10)       #=> 472384540402900668368761869477227308873774630879
:kimmy.gibbler.to_i           #=> 472384540402900668368761869477227308873774630879
:kimmy.gibbler.base(2)        #=> 101001010111110011101001001010010100110000...
Mar '11
10

An update on yesterday's Stella outage (it was the Redis configuration)

posted by delano

After I reported on the Stella outage yesterday, I did a bit more investigation and made a few changes to the operations of the site. To recap, yesterday at 9am PST the site became unresponsive and the background workers stopped running. I also couldn't SSH in to the main backend machine and ultimately had to reboot it to get back in. After looking into it further, it looks like there were multiple factors at play.

Root Cause, revisited

Memory swapping and blocking. There was a conflict between the hourly backup process and the regular operation of the site. The hourly cronjob is simple: it copies the redis.rdb file, gzips it, encrypts it, and uploads it to S3. The conflict arose during the copy. Redis was configured to run a background save every 5 minutes to the same redis.db. What I didn't realize is that redis blocks the bgsave while that file is being copied. That file had grown a lot with the additional traffic over the past few weeks. I didn't notice this issue previously because the cp and bgsave commands ran pretty quickly. With so much more data, both take longer so it was only a matter of time that this would happen.

Operational Changes

I made the following changes:

  • Disabled background saves in Redis
  • Enabled append only persistence
  • Rewrote the hourly backup to run explicitly run a Redis background save
  • Added an hourly process to tidy unneeded data to keep the backups smaller
  • Added a nightly process to run bgrewriteaof to keep the append only file tidy

The SSH issue

There was still the issue of the hanging SSH connection. Before the outage and after I rebooted the machine, SSH was fine. Also, I was able to open a connection to the redis server the entire time so it wasn't the network. I'm not 100% sure but I have a suspicion that it was a kernel problem. That suspicion is based entirely on several modprobe errors in the system console (missing modules). The machine was running an EBS instance type. I have had miscellaneous boot and connection problems with the EBS instance types so I decided to go back to a trusty old instance-store machine image. I built a new machine image and launched a new instance that replaced the previous machine.

These changes have been in production for the past few hours. I'll report back if I need to make any further modifications.

Mar '11
09

Morning Stella outage

posted by delano

Update: I posted a follow-up on this outage.

Stella was down for about an hour and a half this morning, including checkups and monitors. I received a notification at 9am (PST) and the site returned to normal at 10:40am. Here is a screenshot of status.blamestella.com form this morning:

The symptoms

The web servers were not responding and the master backend machine was not available via SSH (I could connect but it would hang after authenticating). The connection to Redis was not affected.

The cause

I couldn't verify that there was a single root cause. I couldn't SSH in to the machine until after it was rebooted but it looks like something happened while Redis was running a background save. The only access I had was through redis-cli:

redis> bgsave
(error) ERR Background save already in progress
redis> save
(error) ERR Background save already in progress
redis> info 
...
aof_enabled:0
changes_since_last_save:67162
bgsave_in_progress:1
used_memory_human:2.21G
mem_fragmentation_ratio:1.44
...

My thinking at the time was that the process was swapping to VM so I waited patiently (not so patiently actually) to see if it would finish. There's 7.5GB of RAM and Redis was only using about 2.5GB so there shouldn't have been a need to swap but it's possible (I've noticed that the redis-server process has used as much as 1GB more than what it report via info). As you can see there was some data in memory that hadn't been written to disk yet so that was my top priority (changes_since_last_save). After about 20 minutes it was clear that there was something else going on and my top priority switched to getting the site back up. The data is stored on an EBS volume so my guess is that there was degraded IO performance. To be clear, I don't suspect it was an issue with Redis itself. However, that still doesn't explain why I couldn't login via SSH so it's possible there was a network issue at the same time (which I've experienced before in other EC2 available zones).

At that point I made the decision to reboot the machine. I also started the process of provisioning a new backend machine using the most recent backup (which was created an hour earlier). I did these in parallel in the event that the existing machine did not come back up. It did and after it ran an fsck I was able to get into the machine. I brought redis up, tested it, and then started the workers (which do the monitoring). Then I brought up the site after a few minutes.

Note: in the report you can see that the site came up briefly at 10am (PST). This was before the backend machine was restarted. The web servers could read and write to Redis but the monitoring and checkups jobs were not running.

Next steps

  • Switch to append-only backups to prevent the need for epic writes.
  • Create and practice a new recovery process to provision a new machine right away which can run off of a snapshot of the master's EBS volume.

See the archive for more

I'm Delano Mandelbaum, the founder of Solutious Inc. I've worked for companies large and small and now I'm putting everything I've learned into building great tools. I recently launched a monitoring service called Stella.

You can also find me on:

-       Delano (@solutious.com)

Solutious is a software company based in Montréal. We build testing and development tools that are both powerful and pleasant to use. All of our software is on GitHub.

This is our blog about performance, development, and getting stuff done.

-       Solutious