Commercial Products
Mar '11
09

Morning Stella outage

posted by delano

Update: I posted a follow-up on this outage.

Stella was down for about an hour and a half this morning, including checkups and monitors. I received a notification at 9am (PST) and the site returned to normal at 10:40am. Here is a screenshot of status.blamestella.com form this morning:

The symptoms

The web servers were not responding and the master backend machine was not available via SSH (I could connect but it would hang after authenticating). The connection to Redis was not affected.

The cause

I couldn't verify that there was a single root cause. I couldn't SSH in to the machine until after it was rebooted but it looks like something happened while Redis was running a background save. The only access I had was through redis-cli:

redis> bgsave
(error) ERR Background save already in progress
redis> save
(error) ERR Background save already in progress
redis> info 
...
aof_enabled:0
changes_since_last_save:67162
bgsave_in_progress:1
used_memory_human:2.21G
mem_fragmentation_ratio:1.44
...

My thinking at the time was that the process was swapping to VM so I waited patiently (not so patiently actually) to see if it would finish. There's 7.5GB of RAM and Redis was only using about 2.5GB so there shouldn't have been a need to swap but it's possible (I've noticed that the redis-server process has used as much as 1GB more than what it report via info). As you can see there was some data in memory that hadn't been written to disk yet so that was my top priority (changes_since_last_save). After about 20 minutes it was clear that there was something else going on and my top priority switched to getting the site back up. The data is stored on an EBS volume so my guess is that there was degraded IO performance. To be clear, I don't suspect it was an issue with Redis itself. However, that still doesn't explain why I couldn't login via SSH so it's possible there was a network issue at the same time (which I've experienced before in other EC2 available zones).

At that point I made the decision to reboot the machine. I also started the process of provisioning a new backend machine using the most recent backup (which was created an hour earlier). I did these in parallel in the event that the existing machine did not come back up. It did and after it ran an fsck I was able to get into the machine. I brought redis up, tested it, and then started the workers (which do the monitoring). Then I brought up the site after a few minutes.

Note: in the report you can see that the site came up briefly at 10am (PST). This was before the backend machine was restarted. The web servers could read and write to Redis but the monitoring and checkups jobs were not running.

Next steps

  • Switch to append-only backups to prevent the need for epic writes.
  • Create and practice a new recovery process to provision a new machine right away which can run off of a snapshot of the master's EBS volume.

I'm Delano Mandelbaum, the founder of Solutious Inc. I've worked for companies large and small and now I'm putting everything I've learned into building great tools. I recently launched a monitoring service called Stella.

You can also find me on:

-       Delano (@solutious.com)

Solutious is a software company based in Montréal. We build testing and development tools that are both powerful and pleasant to use. All of our software is on GitHub.

This is our blog about performance, development, and getting stuff done.

-       Solutious