Like most sysadmins out there, I read the blogs and posts about new technologies that are relevant to my product(s). There has been a lot of buzz about NoSQL and the key value store type db’s out there, yet I have never really focused directly on it. I have paid passing attention, until I realized that we use one heavily at my company.

Redis is a key value store that runs completely in memory. It’s fast, REALLY fast, and after spending some time with a few tutorials like The Little Redis Book by Karl Seguin, it makes a ton of sense, and an extremely neat backend piece of software.

I started to learn more about it, and I started talking to a coworker and he said that he really wanted to upgrade from 2.4 redis to 2.6 redis. I took this as a challenge, and ran with it.

The first part of the migration is creating a slave. We had .aof and .rdb snapshotting, but we didn’t have a slave. This post is be going to be chronicling my adventures getting from 2.4.16 to 2.6.16.

Another coworker of mine, Lance Woodson, created a script called “bloater” to start putting trash keys in the redis db. As you can see it’s pretty straight forward, I do strongly suggest using bloater.clear! if you are playing with it. I leveraged this to simulate creating a slave already in production, the theory being I’ll have data inside the redis db, things talking to db, and the sync happening. I understand that a lot of people have tested this situation, but I couldn’t seem to find any direct examples. Hopefully this will be useful to someone.

The first step after getting my 2.4.16 redis machine running, I ran bloater.bloat! 2295000000 to get an .rdb file that was ~2 gigs. I configured the snapshotting at the default values so it created it without a hitch. I copied it off called it something like 2gig-dump.rdb or something. Then I ran bloater.bloat! 4295000000 to create the 4gig dump file. I copied that off next and then I had my two dumpfiles that I could play with. You are probably wondering why 2 and 4 gigs, and also “Dang, that’s a huge redis db.” Yes, yes it is, I focused on the “oh crap, our db is too big” situation, so if redis handles that, then I know it’ll deal with the 100 meg db just fine.

Secondly I did my FLUSHALL command, then shutdown my redis instance then copied the 2 gig db, in the /var/lib/redis/ location. I spun up redis, and ran the INFO command, like all redis admins learn to love to do. I saw that it was 2 gigs, and yay, now I had my foundation to start my testing.

I went through quite a few different iterations, but I’ll just explain the basic trouble shooting that I did, and maybe you can leverage off it.

So with my redis 2.4 instance with a 2 gig db, and my 2.6 blank db instance it was time to start the fun. I ran the following commands in a tmux session on my main workstation. I liked the idea of a 3rd party machine actually looking at all this stuff instead of running directly on one of the machines.

watch "redis-cli -h redis-slave info | grep master_sync_left_bytes "
redis-cli -h redis-slave keys "*" | wc -l
redis-cli -h redis-master keys "*" | wc -l
redis-cli -h redis-master info | grep used_memory
redis-cli -h redis-slave info | grep used_memory
redis-cli -h redis-master --latency

Most, if not all are pretty self explanatory. The most interesting one of them all is probably the last one, the --latency. It used it as a way to emulate something poking my redis instance during the inital clone. The thing we, [where|are|still] worried about is customer impact, and as long as the number didn’t sky rocket then stay high it was an acceptable risk.

After I got the tmux session all up and happy, I ran these two commands to start the slaveof:

# on the master
redis-master:6379> slaveof no one
# on the slave
redis-slave:6379> slaveof redis-master 6379

It worked like a charm. The watch "redis-cli -h redis-slave info | grep master_sync_left_bytes " started showing me the amount of data left for the transfer, and before I knew it I had a working master slave configuration. This was a lot of cruft to basically say slaveof redis-master 6379 but hell it’s sometimes nice to see the thought process around it. Oh!, the latency, did spike to 800 from .05, but I discovered that was in milliseconds and only ONE sample. So yeah one call to the master with a spike of something still sub-second, was still in the risk params. Now I have a working master slave configuration. Next (part 2) I’ll post about my monitoring of this, it’s not that exciting, but it is useful.