Subscribe: antirez weblog
Added By: Feedage Forager Feedage Grade C rated
Language: English
cluster  code  data  good  master  memory  new  order  redis cluster  redis  replication  set  system  things  time  work 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: antirez weblog

Description pending


A short tale of a read overflow
[This blog post is also experimentally available on Medium:] When a long running process crashes, it is pretty uncool. More so if the process happens to take a lot of state in memory. This is why I love web programming frameworks that are able, without major performance overhead, to create a new interpreter and a new state for each page view, and deallocate every resource used at the end of the page generation. It is an inherently more reliable programming paradigm, where memory leaks, descriptor leaks, and even random crashes from time to time do not constitute a serious issue. However system software like Redis is at the other side of the spectrum, a side populated by things that should never crash. Months ago I received a crash report from my colleague Dvir Volk. He was developing his RediSearch Redis module, so it was not clear if the crash was due to a programming error inside the module, perhaps corrupting the heap, or a bug inside Redis. However it looked a lot like a real problem into the radix tree implementation: === REDIS BUG REPORT START: Cut & paste starting from here === # Redis 999.999.999 crashed by signal: 11 # Crashed running the instuction at: 0x7fceb6eb5af5 # Accessing address: 0x7fce9c400000 | Backtrace: | redis-server *:7016 [cluster](raxRemoveChild+0xd3)[0x49af53] | redis-server *:7016 [cluster](raxRemove+0x34f)[0x49b34f | redis-server *:7016 [cluster](slotToKeyUpdateKey+0x1ad)[0x4415dd] The radix tree is full of memmove() calls, and Redis crashed exactly trying to access a memory address that was oddly zero padded at the end: 0x7fce9c400000. My first thought was, I’m sure I’m doing some wrong memory movement here, and the address gets overwritten with zeroes, leading to the crash when the program attempts the deference the address. I’m pretty proud about my radix tree implementation. Not because of the implementation itself, while it’s a complex data structure to implement, it’s not rocket science. But because of the fuzz tester that comes with it and is able to cover the whole source code (trivial) and a lot of non trivial state (that’s definitely more interesting). The fuzz tester does not do fuzzing just to reach a crash, it compares the implementation of the radix tree dictionary and iterator with a reference implementation that uses an hash table and qsort, to have exactly the same semantics, but in a short and easy to audit implementation. After receiving the crash report I improved the fuzz tester, ran it for days, with and without Valgrind, wrote additional data models, created tests using 100 millions of keys, but despite the efforts I could not reproduce the crash. A few days ago I would discover that there was no bug in the implementation I was testing, but at the time I was not aware of that. I could simply never find a bug that was not there. So after failing to reproduce I gave up. One week ago I received two other additional bug reports that were almost the same. And again, the addresses were zero padded. Dvir crash: Accessing address: 0x7fce9c400000 Issue 4605: Accessing address: 0x7f2959e00000 Issue 4642: Accessing address: 0x7f0e9b800000 It was time to read every memmove, memcpy, realloccall inside the source code, trying to figure out if there was something wrong that for some reason the fuzz tester was not able to catch. I found nothing, but then inspecting the Redis crash report I noticed something funny. During crashes Redis reports the memory mapped areas of the process, things like the following: *** Preparing to test memory region 7f0e8c400000 (255852544 bytes) Now, if you add 255852544to 0x7f0e8c400000, the result is 0x7f0e9b800000 , which is exactly the accessed address in the crash reported in issue 4642. So the program was not crashing because the memory address was corrupted, but because it was accessing an address immediately after the end of the heap. I checked the other issues, and the same was true in all the instances. Basically the end of the heap, at the edge of the st[...]

An update on Redis Streams development
I saw multiple users asking me what is happening with Streams, when they’ll be ready for production uses, and in general what’s the ETA and the plan of the feature. This post will attempt to clarify a bit what comes next.

To start, in this moment Streams are my main priority: I want to finish this work that I believe is very useful in the Redis community and immediately start with the Redis Cluster improvements plans. Actually the work on Cluster has already started, with my colleague Fabio Nicotra that is porting redis-trib, the Cluster management tool, inside the old and good redis-cli. This step involves translating the code from Ruby to C. In the meantime, a few weeks ago I finished writing the Streams core, and I deleted the “streams” feature branch, merging everything into the “unstable” branch.

Later I reviewed again, several times actually, the specification for consumer groups. A few weeks ago I finally was happy with the result, so I started the implementation of this specification: Be aware that command names changed quite a bit… With the API being more like this:

I’m halfway, after the first 500 lines of code I today was able to see clients using the XREADGROUP command to see only their local history when fetching old messages. This was a milestone in the implementation. Also XACK is now available. Basically all the data structures supporting the streams consumer groups are working, but I need to finish all the commands implementations, support the blocking operations in XREADGROUP and so forth, handle replication, RDB and AOF persistence. More or less in one month I should have everything ready.

Because of consumer groups, the RDB format of the Streams is going to change a lot compared to what “unstable” is using right now. And since there is to break compatibility, we’ll also add the ability of RDB files to persist information about keys last access, so that a Redis restart (or slave loading) involving RDB will get all the info needed in order to continue doing the eviction of keys in the correct way. Right now all that information would be lost instead. Because of this radical RDB changes the plan changed a bit: Streams are going to appear in Redis 5.0 and not 4.0, but 5.0 will be released as GA in about two months: this is possible because 5.0 will be just a 4.0 plus streams, no other major features inside, if not for the RDB changes…

So TLDR: I’m working to Streams almost full time, if not for some time to check PRs / Issues for critical stuff, and in two months if you upgrade to Redis 5.0 you’ll have production ready streams with consumer groups fully implemented. There is also quite of a documentation effort to perform, I’ll write both the command manuals and an extensive introduction to streams once we go RC1, which is in about one month, so that people will have documentation to support their tests during the release candidate stage.

I hope this clarifies what’s happening with Streams. See you soon with Redis 5.0 :-) Comments

Redis PSYNC2 bug post mortem
Four days ago a user posted a critical issue in the Redis Github repository. The problem was related to the new Redis 4.0 PSYNC2 replication protocol, and was very critical. PSYNC2 brings a number of good things to Redis replication, including the ability to resynchronize just exchanging the differences, and not the whole data set, after a failover, and even after a slave controlled restart. The problem was about this latter feature: with PSYNC2 the RDB file is augmented with replication information. After a slave is restarted, the replication metadata is loaded back, and the slave is able to perform a PSYNC attempt, trying to handshake with the master and receive the differences since the last disconnection. All this is good news from the point of view of Redis operations, however while PSYNC2 was pretty solid since the introduction in Redis 4.0.0 stable, the feature involving a restarting slave was definitely lacking reliability. There were two problems with this feature: the first being that it was a last-minute addition to PSYNC2, and was not part of the original design document. It was more like an obvious extension of the work we did in PSYNC2, but was not scrutinized at the same level of the rest of the specification for potential bugs and issues. The second problem was due to the fact that the feature is more complex than it looks like initially, for the fact that it’s tricky to really restore *all* the state the replication has in the slave side after a restart. Moreover, failing to restore certain bits of the state, does not result in evident bugs most of the times, so they are hard to spot via integration testings. Only once specific conditions happen the lack of some state will result in problems. For instance failing to correctly reconstruct the currently selected DB in the slave replication state, will create problems only when there are writes happening in different Redis DBs, and such bug would not store correctly the currently selected DB only under special conditions. Thanks to the help of many Redis contributors, but especially thanks to the @soloestoy Github user, a Redis developer at Alibaba, recently we worked at improving a number of PSYNC2 potential issues. All the work done would finally end inside Redis 4.0.3 in a few days, after much testing of all the patches wrote so far. However once I received issue #4483, I understood I had to rush the release of Redis 4.0.3 because this was far more serious than the other Redis 4 PSYNC2 issues that we had discovered up to that point. The bug described in issue #4483 was fairly simple, yet very problematic. After a slave restart and a resulting reloading of the replication state from the RDB, the master replication backlog could contain Lua scripts executions in the form of EVALSHA commands. However the slave Lua scripting engine is flushed of all the scripts after the restart, so is not able to process such commands. This results in the slave not processing writes originated by Lua scripts, unless such scripts were using “commands replication”, which is not the default. The default is to replicate the script itself. I went a bit in panic mode… and wrote a few alternative patches that I submitted to the issue. At the end the one that did not require RDB incompatibilities with older versions was picked, so I went ahead and released Redis 4.0.3 ASAP. However I did a mistake… It’s a few weeks that I work from an office instead of working from home. Normally I do not work at night, but for 4.0.3, when I returned back home, I opened my home laptop and merged the patch into the 4.0 branch, and did some testing activity. The next day when I returned to the office, I continued working from another computer, but I did not realize that I was missing one commit that I merged at home, without pushing it to the repository. So I basically released a 4.0.3 that had all the PSYNC2 fixes minus the most important one for the replication bug. I released ASAP a new patch level version, Redis 4.0.4, includ[...]

Streams: a new general purpose data structure in Redis.
Until a few months ago, for me streams were no more than an interesting and relatively straightforward concept in the context of messaging. After Kafka popularized the concept, I mostly investigated their usefulness in the case of Disque, a message queue that is now headed to be translated into a Redis 4.2 module. Later I decided that Disque was all about AP messaging, which is, fault tolerance and guarantees of delivery without much efforts from the client, so I decided that the concept of streams was not a good match in that case. However, at the same time, there was a problem in Redis, that was not taking me relaxed about the data structures exported by default. There is some kind of gap between Redis lists, sorted sets, and Pub/Sub capabilities. You can kindly use all these tools in order to model a sequence of messages or events, but with different tradeoffs. Sorted sets are memory hungry, can’t model naturally the same message delivered again and again, clients can’t block for new messages. Because a sorted set is not a sequential data structure, it’s a set where elements can be moved around changing their scores: no wonder if it was not a good match for things like time series. Lists have different problems creating similar applicability issues in certain use cases: you cannot explore what is in the middle of a list because the access time in that case is linear. Moreover no fan-out is possible, blocking operations on list serve a single element to a single client. Nor there was a fixed element identifier in lists, in order to say: given me things starting from that element. For one-to-many workloads there is Pub/Sub, which is great in many cases, but for certain things you do not want fire-and-forget: to retain a history is important, not just to refetch messages after a disconnection, also because certain list of messages, like time series, are very important to explore with range queries: what were my temperature readings in this 10 seconds range? The way I tried to address the above problems, was planning a generalization of sorted sets and lists into a unique more flexible data structure, however my design attempts ended almost always in making the resulting data structure ways more artificial than the current ones. One good thing about Redis is that the data structures exported resemble more the natural computer science data structures, than, “this API that Salvatore invented”. So in the end, I stopped my attempts, and said, ok that’s what we can provide so far, maybe I’ll add some history to Pub/Sub, or some more flexibility to lists access patterns in the future. However every time an user approached me during a conference saying “how would you model time series in Redis?” or similar related questions, my face turned green. Genesis ======= After the introduction of modules in Redis 4.0, users started to see how to fix this problem themselves. One of them, Timothy Downs, wrote me the following over IRC: the module I'm planning on doing is to add a transaction log style data type - meaning that a very large number of subscribers can do something like pub sub without a lot of redis memory growth subscribers keeping their position in a message queue rather than having redis maintain where each consumer is up to and duplicating messages per subscriber This captured my imagination. I thought about it a few days, and realized that this could be the moment when we could solve all the above problems at once. What I needed was to re-imagine the concept of “log”. It is a basic programming element, everybody is used to it, because it’s just as simple as opening a file in append mode and writing data to it in some format. However Redis data structures must be abstract. They are in memory, and we use RAM not just because we are lazy, but because using a few pointers, we can conceptualize data structures and make them abstract, to allow them to break free from the obvious limits. For instance normally[...]

Doing the FizzleFade effect using a Feistel network
Today I read an interesting article about how the Wolfenstein 3D game implemented a fade effect using a Linear Feedback Shift Register. Every pixel of the screen is set red in a pseudo random way, till all the screen turns red (or other colors depending on the event happening in the game). The blog post describing the implementation is here and is a nice read: You may wonder why the original code used a LFSR or why I'm proposing a different approach, instead of the vanilla setPixel(rand(),rand()): doing this with a pseudo random generator, as noted in the blog post, is slow, but is also visually very unpleasant, since the more red pixels you have on the screen already, the less likely is that you hit a new yet-not-red pixel, so the final pixels take forever to turn red (I *bet* that many readers of this blog post tried it in the old times of the Spectum, C64, or later with QBASIC or GWBasic). In the final part of the blog post the author writes: "Because the effect works by plotting pixels individually, it was hard to replicate when developers tried to port the game to hardware accelerated GPU. None of the ports managed to replicate the fizzlefade except Wolf4SDL, which found a LFSR taps configuration to reach resolution higher than 320x200.” While not rocket science, it was possibly hard for other resolutions to find a suitable LFSR. However regardless of the real complexity of finding an appropriate LFSR for other resolutions, the authors of the port could use another technique, called a Feistel Network, to get exactly the same result in a trivial way. What is a Feistel Network? === It’s a building block typically used in cryptography: it creates a transformation between a sequence of bits and another sequence of bits, so that the transformation is always invertible, even if you use all the kind of non linear transformations inside the Feistel network. In practical terms the Feistel network can, for example, translate a 32 bit number A into another 32 bit number B, according to some function F(), so that you can always go from B to A later. Because the function is invertible, it implies that for every input value the Feistel network generates *a different* output value. This is a simple Feistel network in pseudo code: Split the input into L and R halves (Example: L = INPUT & 0xFF, R = INPUT >> 8) REPEAT for N rounds: next_L = R R = L XOR F(R) L = next_L END RETURN the value composing L and R again into a single sequence of bits: R 8; for (var i = 0; i < 8; i++) { var nl = r; var F = (((r * 11) + (r >> 5) + 7 * 127) ^ r) & 0xff; r = l ^ F; l = nl; } return ((r[...]

The mythical 10x programmer
A 10x programmer is, in the mythology of programming, a programmer that can do ten times the work of another normal programmer, where for normal programmer we can imagine one good at doing its work, but without the magical abilities of the 10x programmer. Actually to better characterize the “normal programmer” it is better to say that it represents the one having the average programming output, among the programmers that are professionals in this discipline. The programming community is extremely polarized about the existence or not of such a beast: who says there is no such a thing as the 10x programmer, who says it actually does not just exist, but there are even 100x programmers if you know where to look for. If you see programming as a “linear” discipline, it is clear that the 10x programmer looks like an irrational possibility. How can a runner run 10x faster than another one? Or a construction worker build 10x the things another worker can build in the same time? However programming is a design discipline, in a very special way. Even when a programmer does not participate in the actual architectural design of a program, the act of implementing it still requires a sub-design of the implementation strategy. So if the design and implementation of a program are not linear abilities, things like experience, coding abilities, knowledge, recognition of useless parts, are, in my opinion, not just linear advantages, they work together in a multiplicative way in the act of creating a program. Of course this phenomenon happens much more when a programmer can both handle the design and the implementation of a program. The more “goal oriented” is the task, the more a potential 10x programmer can exploit her/his abilities in order to reach the goal with a lot less efforts. When the task at hand is much more rigid, with specific guidelines about what tools to use and how to implement things, the ability of a 10x programmer to perform a lot of work in less time is weakened: it can still exploit “local” design possibilities to do a much better work, but cannot change in more profound ways the path used to reach the goal, that may include, possibly, even eliminating part of the specification completely from the project, so that the goal to be reached looks almost the same but the efforts to reach it are reduced by a big factor. In twenty years of working as a programmer I observed other programmers working with me, as coworkers, guided by me in order to reach a given goal, providing patches to Redis and other projects. At the same time many people told me that they believe I’m a very fast programmer. Considering I’m far from being a workaholic, I’ll also use myself as a reference of coding things fast. The following is a list of qualities that I believe make the most difference in programmers productivity. * Bare programming abilities: getting sub-tasks done One of the most obvious limits, or strengths, of a programmer is to deal with the sub-task of actually implementing part of a program: a function, an algorithm or whatever. Surprisingly the ability to use basic imperative programming constructs very efficiently in order to implement something is, in my experience, not as widespread as one may think. In a team sometimes I observed very incompetent programmers, that were not even aware of a simple sorting algorithm, to get more work done than graduated programmers that were in theory extremely competent but very poor in the practice of implementing solutions. * Experience: pattern matching By experience I mean the set of already explored solutions for a number of recurring tasks. An experienced programmer eventually knows how to deal with a variety of sub tasks. This avoids both a lot of design work, but especially, is an extremely powerful weapon against design errors, that are in turn among the biggest enemies of simplicity. * Focus: actual time VS hypothetical t[...]

Redis on the Raspberry Pi: adventures in unaligned lands
After 10 million of units sold, and practically an endless set of different applications and auxiliary devices, like sensors and displays, I think it’s deserved to say that the Raspberry Pi is not just a success, it also became one of the preferred platforms for programmers to experiment in the embedded space. Probably with things like the Pi zero, it is also becoming the platform in order to create hardware products, without incurring all the risks and costs of designing, building, and writing software for vertical devices. Well, I love to think that also Redis is a platform that programmers like to use when to hack, experiment, build new things. Moreover devices that can be used for embedded / IoT applications, often have the problem of temporarily or permanently storing data, for example received by sensors, on the device, to perform on-device computations or to send them to remote servers. Redis is adding a “Stream” data type that is specifically suited for streams of data and time series storage, at this point the specification is near complete and work to implement it will start in the next weeks. Redis existing data structures, and the new streams, together with the small memory footprint, the decent performances it can provide even while running on small hardware (and resulting low energy usage), looked like a good match for Raspberry Pi potential applications, and in general for small ARM devices. The missing piece was the obvious one: to run well on the Pi. One of the many cool things about the Pi is that its development environment does not look like the embedded development environments of a few years ago… It just runs Linux, with all the Debian-alike tooling you expect to find. Basically adapting Redis to work on the Pi was not a huge task. The most fundamental mismatch a Linux system program and the Pi could have, is a performance / footprint mismatch, but this a non issue because of the Redis design itself: an empty instance consumes a total of 1MB of Resident Set Size, serves queries from memory, so it is fast enough and does not stress the flash disk too much, and when persistence is needed, it uses AOF which has an append-only write pattern. However the Pi runs an ARM processor, and this requires some care when dealing with unaligned accesses. In this blog post, while showing you what I did to make Redis and Raspberry Pi more happy together, I’ll try to provide an overview about dealing with architectures that do not handle unaligned accesses transparently as the x86 platform does. A few things about ARM processors — The most interesting thing about porting Redis to ARM is that ARM processors are, or actually were… well, not big fans of unaligned memory accesses. If you live your life in high level programming, you may not know it, but many processor architectures were historically not able to load or store memory words in addresses not multiple of the word size. So if the word size is 4 bytes (in the case of a 32 bit processor), you may load or store a word at address 0x4, 0x8, and so forth, but not at address 0x7. The result is an exception sometimes, or an odd behavior some other time, depending on the CPU and its exact configuration. Then the x86 processors family ruled the world and everybody kinda forgot about this issue (if not for dealing with SEE instructions and alike, but now even those instructions have unaligned variants). Oh well, initially forgetting about the issue is not really what happened. Even if x86 processors could deal with unaligned accesses without raising an exception, doing so was a non trivial performance penalty: partial reads/writes at word boundary required to do the double of the work. But then recent x86 processors have optimizations that make unaligned accesses as fast as aligned accesses most of the times, so basically nowadays for x86 this is really Not An Issue. ARM was, up to ARM v5,[...]

The first release candidate of Redis 4.0 is out
It’s not yet stable but it’s soon to become, and comes with a long list of things that will make Redis more useful for we users: finally Redis 4.0 Release Candidate 1 is here, and is bold enough to call itself 4.0 instead of 3.4. For me semantic versioning is not a thing, what I like instead is try to communicate, using version numbers and jumps, what’s up with the new version, and in this specific case 4.0 means “this is the shit”. It’s just that Redis 4.0 has a lot of things that Redis should have had since ages, in a different world where one developer can, like Ken The Warrior, duplicate itself in ten copies and start to code. But it does not matter how hard I try to learn about new vim shortcuts, still the duplicate-me thing is not in my chords. But well, finally with 4.0 we got a lot of these stuff done… and here is the list of the big ones, with a few details and pointers to learn more. 1. Modules As you probably already know Redis 4.0 got a modules system, and one that allows to do pretty fancy stuff like implementing new data types that are RDB/AOF persisted, create non blocking commands and so forth. The deal here is that all this is done with an higher level abstract API completely separated from the core, so the modules you write are going to work with new releases of Redis. Using modules I wrote Neural Redis, a neural network data type that can be trained inside Redis itself, and many people are doing very interesting stuff: there are new rate limiting commands (implemented in Rust!), Graph DBs on top of Redis, secondary indexes, time series modules, full text indexing, and a number of other stuff, and my feeling is that’s just the start. This does not just allow Redis to grow and cover new things, while taking the core, if not minimal, just with things that are useful to most users at least, pretty generic things that many people need. But also has the potential to avoid for many tasks the problem of rewriting a networked server, even if the goal is to create something not related to Redis, databases, caching, or whatever Redis is. That is, you can just write a module to use the Redis “infrastructure”: the protocol, the clients people already wrote and so forth. So I’ve a good feeling about it, no pressure for the core, freedom for the users that want to do more crazy stuff. 2. Replication version 2 So, that’s going to be very useful in production from the POV of operations. At some point in the past we introduced what is known as “PSYNC”. It was a new master-slave protocol that allowed the master and the salve to continue from where they were, if the connection between the two broke. Before that, every single connection break in the replication link between master and slave would result into a full synchronization: generate an RDB file in the master, transfer it, load it in the slave, ok you know how this works. So PSYNC was like a real improvement. But not enough… PSYNC was not good enough when there was a failover. If a slave is promoted to master, the slaves that replicated with the old master were not able to connect to the new promoted slave and PSYNC with it: a full resynchronization was needed. This is not great, and is not good for Redis Cluster as well. However fixing this required to do changes to the replication protocol, because I really wanted to make sure that partial resyncs where going to work after any possible topology change, as long as there was a common replication history among the instances. So the first change needed was about how “chained replication” works, that is, slaves of slaves of slaves … How do they work? Like, A is the master, and we have something like that: A —> B —> C —> D So A is the master of B but B is the master of C and so forth. Before Redis 4.0 what was happening is that B received the replication protocol from A. The rep[...]

Random notes on improving the Redis LRU algorithm
Redis is often used for caching, in a setup where a fixed maximum memory to use is specified. When new data arrives, we need to make space by removing old data. The efficiency of Redis as a cache is related to how good decisions it makes about what data to evict: deleting data that is going to be needed soon is a poor strategy, while deleting data that is unlikely to be requested again is a good one. In other terms every cache has an hits/misses ratio, which is, in qualitative terms, just the percentage of read queries that the cache is able to serve. Accesses to the keys of a cache are not distributed evenly among the data set in most workloads. Often a small percentage of keys get a very large percentage of all the accesses. Moreover the access pattern often changes over time, which means that as time passes certain keys that were very requested may no longer be accessed often, and conversely, keys that once were not popular may turn into the most accessed keys. So in general what a cache should try to do is to retain the keys that have the highest probability of being accessed in the future. From the point of view of an eviction policy (the policy used to make space to allow new data to enter) this translates into the contrary: the key with the least probability of being accessed in the future should be removed from the data set. There is only one problem: Redis and other caches are not able to predict the future. The LRU algorithm === While caches can’t predict the future, they can reason in the following way: keys that are likely to be requested again are keys that were recently requested often. Since usually access patterns don’t change very suddenly, this is an effective strategy. However the notion of “recently requested often” is more insidious that it may look at a first glance (we’ll return shortly on this). So this concept is simplified into an algorithm that is called LRU, which instead just tracks the *last time* a key was requested. Keys that are accessed with an higher frequency have a greater probability of being idle (not accessed) for a shorter time compared to keys that are rarely accessed. For instance this is a representation of four different keys accesses over time. Each “~” character is one second, while the “|” line at the end is the current instant. ~~~~~A~~~~~A~~~~~A~~~~A~~~~~A~~~~~A~~| ~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~| ~~~~~~~~~~C~~~~~~~~~C~~~~~~~~~C~~~~~~| ~~~~~D~~~~~~~~~~D~~~~~~~~~D~~~~~~~~~D| Key A is accessed one time every 5 seconds, key B once every 2 seconds and key C and D are both accessed every 10 seconds. Given the high frequency of accesses of key B, it has among the lowest idle times, which means its last access time is the second most recent among all the four keys. Similarly A and C idle time of 2 and 6 seconds well reflect the access frequency of both those keys. However as you can see this trick does not always work: key D is accessed every 10 seconds, however it has the most recent access time of all the keys. Still, in the long run, this algorithm works well enough. Usually keys with a greater access frequency have a smaller idle time. The LRU algorithm evicts the Least Recently Used key, which means the one with the greatest idle time. It is simple to implement because all we need to do is to track the last time a given key was accessed, or sometimes this is not even needed: we may just have all the objects we want to evict linked in a linked list. When an object is accessed we move it to the top of the list. When we want to evict objects, we evict from the tail of the list. Tada! Win. LRU in Redis: the genesis === Initially Redis had no support for LRU eviction. It was added later, when memory efficiency was a big concern. By modifying a bit the Redis Object structure I was able to make 24 bits of space. There was no room [...]

Writing an editor in less than 1000 lines of code, just for fun
WARNING: Long pretty useless blog post. TLDR is that I wrote, just for fun, a text editor in less than 1000 lines of code that does not depend on ncurses and has support for syntax highlight and search feature. The code is here: Screencast here: For the sentimentalists, keep reading… A couple weeks ago there was this news about the Nano editor no longer being part of the GNU project. My first reaction was, wow people still really care about an old editor which is a clone of an editor originally part of a terminal based EMAIL CLIENT. Let’s say this again, “email client”. The notion of email client itself is gone at this point, everything changed. And yet I read, on Hacker News, a number of people writing how they were often saved by the availability of nano on random systems, doing system administrator tasks, for example. Nano is also how my son wrote his first program in C. It’s an acceptable experience that does not require past experience editing files. This is how I started to think about writing a text editor ways smaller than Nano itself. Just for fun, basically, because I like and admire small programs. How lame, useless, wasting of time is today writing an editor? We are supposed to write shiny incredible projects, very cool, very complex, valuable stuff. But sometimes to make things without a clear purpose is refreshing. There were also memories… I remember my first experiences with the Commodore 16 in-ROM assembler, and all the home computers and BASIC interpreters I used later in my child life. An editor is a fundamental connection between the human and the machine. It allows the human to write something that the computer can interpret. The blinking of a square prompt is something that many of us will never forget. Well, all nice, but my time was very limited. A few hours across two weekends with programmed family activities and meat to prepare for friends in long barbecue sessions. Maybe I could still write an editor on a few spare hours with some trick. My goal was to write an editor which was very small, no curses, and with syntax highlighting. Something usable, basically. That’s the deal.. It’s little stuff, but is already hard to write all this from scratch in a few hours. But … wait, I actually wrote an editor in the past, is part of the LOAD81 project, a Lua based programming environment for children. Maybe I can just reuse it… and instead of using SDL to write on the screen what about sending VT100 escape sequences directly to the terminal? And I’ve code for this as well in linenoise, another toy project that eventually found its place in some may other serious projects. So maybe mixing the two… The first week Saturday morning I went to the sea, and it was great. Later my brother arrived from Edinburg to Catania, and at the end of the day we were together in the garden with our laptops, trying to defend ourselves from the 30 degrees that there were during the day, so I started to hack the first skeleton of the editor. The LOAD81 code was quite modular, to take it away from the original project was a joke. I could kinda edit after a few hours, and it was already time to go bed. The next day I worked again at it before leaving for the sea again. My 15yo sleeps till 1pm, as I did when I was 15yo in summertime after all, so I coded more in sprints of 30 minutes waiting for him to get up, using the rest of the time to play with my wonderful daughter. Finally later in the Sunday night I tried to fix all the remaining stuff. Hey I remember that a few years ago to hack on a project was, to *hack* on it, full time for days. I’m old now, but still young enough to write toy editors and consider it a serious business :-) However life is hard, and Monday arr[...]

Programmers are not different, they need simple UIs.
I’m spending days trying to get a couple of APIs right. New APIs about modules, and a new Redis data type.
I really mean it when I say *days*, just for the API. Writing drafts, starting the implementation shaping data structures and calls, and then restarting from scratch to iterate again in a better way, to improve the design and the user facing part.

Why I do that, delaying features for weeks? Is it really so important?
Programmers are engineers, maybe they should just adapt to whatever API is better to export for the system exporting it.

Should I really reply to my rhetorical questions? No, it is no longer needed today, and that’s a big win.

I want to assume that at this point is tacit, given for granted, that programmers also have user interfaces, and that such user interfaces are so crucial to completely change the perception of a system. Database query languages, libraries calls, programming languages, Unix command line tools, they all have an User Interface part. If you use them daily, for you they are more UIs than anything else.

So if this is all well known why I’m here writing this blog post? Because I want to stress how important is the concept of simplicity, not just in graphical UIs, but also in UIs designed for programmers. The act of iterating again and again to find a simple UI solution is not a form of perfectionism, it’s not futile narcissism. It is more an exploration in the design space, that sometimes is huge, made of small variations that make a big difference, and made of big variations that completely change the point of view. There are no rules to follow but your sensibility. Yes there are good practices, but they are not a good compass when the sea to navigate is the one of the design *space*.

So why programmers should have this privilege of having good, simple UIs? Sure there is the joy of using something well made, that is great to handle, that feels *right*. But there is a more central question. Learning to configure Sendmail via M4 macros, or struggling with an Apache virtual host setup is not real knowledge. If such a system one day is no longer in use, what remains in your hands, or better, in your neurons? Nothing. This is ad-hoc knowledge. It is like junk food: empty calories without micronutrients.

For programmers, the micronutrients are the ideas that last for decades, not the ad-hoc junk. I don’t want to ship junk, so I’ll continue to refine my designs before shipping. You should not accept junk, your neurons are better spent to learn general concepts. However in part it is inevitable: every system will have something that is not general that we need to learn in order to use it. Well, if that’s the deal, at least, let’s make the ad-hoc part a simple one, and if possible, something that is even fun to use. Comments

Redis Loadable Modules System
It was a matter of time but it eventually happened. In the Redis 1.0 release notes, 7 years ago, I mentioned that one of the interesting features for the future was “loadable modules”. I was really interested in such a feature back then, but over the years I became more and more skeptic about the idea of adding loadable modules in Redis. And probably for good reasons. Modules can be the most interesting feature of a system and the most problematic one at the same time: API incompatibilities between versions, low quality modules crashing the system, a lack of identity of a system that is extendible are possible problems. SO, for years, I managed to avoided adding modules to Redis, and Lua scripting was a good tool in order to delay their addition. At the same time, years of experience with scripting, demonstrated that scripting is a way to “compose” existing features, but not a way to extend the capabilities of a system towards use cases it was not designed to cover. Previous attempts at modules also showed that one of the main pain points about mixing Redis and loadable modules is the way modules are bound with the Redis core. In may ways Redis resembles more a programming language than a database. To extend Redis properly, the module needs to have access to the internal API of the system. Directly exporting the Redis core functions to modules creates huge problems: the module starts to depend on the internal details of Redis. If the Redis core evolves, the module needs to be rewritten. This creates either a fragile modules ecosystem or stops the evolution of the Redis core. Redis internals can’t stop to evolve, nor the modules developers can keep modifying the module in order to stay updated with the internals (something that happened in certain popular systems in the past, with poor results). With all this lessons in mind, I was leaving Catania to fly to Tel Aviv, to have a meeting at Redis Labs to talk about the roadmap for the future months. One of the topics of our chat was loadable modules. During the flight I asked myself if it was possible to truly decouple the Redis core from the modules API, but still have a low level access to directly manipulate Redis data structure. So I started to immediately code something. What I wanted was an extreme level of API compatibility for the future, so that a module wrote today could work in 4 years from now with the same API, regardless of the changes to the Redis core. I also wanted binary compatibility so that the 4 years old module could even *load* in the new Redis versions and work as expected, without even the need to be recompiled. At the end of the flight I arrived in Tel Aviv with something already working in the “modules” branch. We discussed together how the API would work, and at the end everybody agreed that to be able to manipulate Redis internals directly was a fundamental feature. What we wanted to accomplish was to allow Redis developers to create commands that were as capable as the Redis native commands, and also as fast as the native commands. This cannot be accomplished just with an high level API that calls Redis commands, it’s too slow and limited. There is no point in having a Redis modules system that can just do what Lua can already do. You need to be able to say, get me the value associated with this key, what type is it? Do this low level operation on the value. Given me a cursor into the sorted set at this position, go to the next element, and so forth. To create an API that works as an intermediate layer for such low level access is tricky, but definitely possible. I returned home and started to work at the modules system immediately. After a couple of weeks I had a prototype that was already functional enough to develop interestin[...]

Three ideas about text messages
I’m aboard of a flight bringing me to San Francisco. Eventually I purchased the slowest internet connection of my life (well at least for a good reason), but for several hours I was without internet, as usually when I fly. I don’t mind staying disconnected for some time usually. It’s a good time to focus, write some code, or a blog post like this one. However when I’m disconnected, what makes the most difference is not Facebook or Twitter or Github, but the lack of text messages. At this point text messages are a fundamental thing in my life. They are also probably the main source of distraction. I use messages to talk with my family, even just to communicate between different floors. I use messages with friends to organize dinners and vacations. I even use messages with the plumber or the doctor. Clearly messages are not going to go away. They need to get better, so I usually tend to thing at topics related to messaging applications. The following are three recurrent thoughts I’ve about this topic. 1. WhatsApp is not used in the US. Do you know what’s the social network that is reshaping the most the way we communicate in Europe? Is the WhatsApp application. Since it is the total standard and an incredible amount of the population has it, “WhatsApp Groups” is changing the way people communicate. There is a group for each school class, one for the children and one for the parents. One for families, another for groups of friends. One for the kindergarten of my daughter, where teachers post from time to time pictures of our children doing activities. WhatsApp is also one of the main pains of modern life. All this groups continuously beep. However what they are doing for the society is incredible: they are truly making every human being interconnected. This magic is possible because here in Italy, and in most other EU countries, WhatsApp is *the standard* so everybody assumes you use it. To me it is pretty incredible that this is not happening in the US, the place where most social applications are funded and created. Given that in the US there are both Android and iOS phones I wonder what’s stopping this from happening. My guess is that it’s just a matter of time and a unified messaging platform will happen there too. * Why voice recognition is not used more. For people that can write text fast into a real keyboard, the software keyboard of the phone is one of the most annoying things ever. Similarly, teenagers excluded, many other people have issues writing long text messages with a phone keyboard. At the same time, the voice recognition software of Android, powered by the Google servers, reached a point where it rarely does errors at all, so why just a few people use it on a daily basis? My theory is that the user interface to activate and use the voice recognition feature is so bad to be the main barrier to make this feature a big hit. First you have to find how to activate it, and usually is not a prominent button on the keyboard. Then there is this delay and it emits a beep and starts to listen, and it’s not clear exactly how to stop it, especially if the environment is aloud. The whole thing looks like if you are suffering from the servers latency, but voice is voice. With delays and non clear experience you kill the big advantage of talking to your phone. This should be just a “push the button while you talk” thing. When you push the button, the system starts to record your voice immediately and connects asynchronously to the servers. As text arrives back, it is shown to the user. When the user releases the button the voice thing is done. As simple as that. At the end, the words that were inserted in this way could be shown in some special way (like a differ[...]

Redis 3.2.0 is out!
It took more than expected, but finally we have it, Redis 3.2.0 stable is out with changes that may be useful to a big number of Redis users. At this point I covered the changes multiple time, but the big ones are: * The GEO API. Index whatever you want by latitude and longitude, and query by radius, with the same speed and easy of use of the other Redis data structures. Here you can find the API documentation: Thank you to Matt Stancliff for the initial implementation, that was reworked but is still at the core of the GEO API, and to the developers of ARDB for providing the geo indexing code that Matt used. * The new BITFIELD command allows to use a Redis string as a bit array composed of many integers with user specified size and offset. It supports increments and decrements to exploit this small (or large) integers as counters, with fine control over the overflow behavior. Thanks to Yoav Steinberg for pushing forward this crazy idea and motivating me to write an implementation. * Many improvements to Redis Cluster, including rebalancing capabilities in redis-trib and better replicas migration. However support for NATted envrionments and Docker port forwarding, while implemented in the “unstable” branch, was not backported to 3.2 for maturity concerns. * Improvements to Lua scripts replication. It is now possible to use “effects replication” to write scripts containing side effects. It’s documented here: Thanks to Yossi Gottlieb for stimulating the addition of this feature and helping in the design. * We have now a serious Lua scripts debugger: with support into redis-cli. Thanks to Itamar Haber, developer of non trivial scripts since the start of Lua scripting, that really wanted this feature and helped during the development. * Redis is now more memory efficient thanks to changes operated by Oran Agra to SDS and the Jemalloc size classes setup. Also we have a new List internal representation contributed by Matt Stancliff which uses just a small percentage of the memory used before to represent large lists! * Finally slaves and masters are in agreement about what keys are expired during read operations. It was about time :-) * SPOP now accepts an optional count argument. Thanks to Alon Diamant for providing the initial implementation. I kinda rewrote it for performances later, and we ended with a pretty fast thing. * RDB AUX fields. So now your RDB files have a few informations inside, like the server that generated it, in what date, and so forth. Soon we’ll have redis-cli able to parse those fields (in 3.2.1) hopefully. It’s a small amount of work but I’m remembering only know writing this notes, honestly. * RDB loading is faster now. * Sentinel can now scale monitoring many masters, but should be considered more an advanced beta than stable, so please test it in your environment carefully before deploying, a lot of code changed inside Sentinel internals. * Many more things but this list is already long enough. A big thank you to all the contributors that made this release possible, to the Redis user base which is lovely and encouraging, and to Redis Labs for sponsoring a lot of the work that went into 3.2.0. During the previous weeks we also worked to a new interesting feature that will be released in the next versions of Redis, it will be announced during RedisConf 2016 in San Francisco, 10-11 May, so stay tuned! One note about stability. I keep saying it all the time, but stability in software is not black and white. It’s like pink, yellow and green. No just kidding. It’s the usual shades of[...]

100 more of those BITFIELDs
Today Redis is 7 years old, so to commemorate the event a bit I passed the latest couple of days doing a fun coding marathon to implement a new crazy command called BITFIELD. The essence of this command is not new, it was proposed in the past by me and others, but never in a serious way, the idea always looked a bit strange. We already have bit operations in Redis: certain users love it, it’s a good way to represent a lot of data in a compact way. However so far we handle each bit separately, setting, testing, getting bits, counting all the bits that are set in a range, and so forth. What about implementing bitfields? Short or large, arbitrary sized integers, at arbitrary offsets, so that I can use a Redis string as an array of 5 bits signed integers, without losing a single bit of juice. A few days ago, Yoav Steinberg from Redis Labs, proposed a set of commands on arbitrary sized integers stored at bit offsets in a more serious way. I smiled when I read the email, since this was kinda of a secret dream. Starting from Yoav proposal and with other feedbacks from Redis Labs engineers, I wrote an initial specification of a single command with sub-commands, using short names for types definitions, and adding very fine grained control on the overflow semantics. I finished the first implementation a few minutes ago, since the plan was to release it for today, in the hope Redis appreciates we do actual work in its bday. The resulting BITFIELD command supports different subcommands: SET — Set the specified value and return its previous value. GET — Get the specified value. INCRBY — Increment the specified counter. There is an additional meta command called OVERFLOW, used to set (guess what) the overflow semantics of the commands that will follow (so OVERFLOW can be specified multiple times): OVERFLOW SAT — Saturation, so that overflowing in one direction or the other, will saturate the integer to its maximum value in the direction of the overflow. OVERFLOW WRAP — This is usual wrap around, but the interesting thing is that this also works for signed integers, by wrapping towards the most negative or most positive values. OVERFLOW FAIL — In this mode the operation is not performed at all if the value would overflow. The integer types can be specified using the “u” or “i” prefix followed by the number of bits, so for example u8, i5, u20 and i53 are valid types. There is a limitation: u64 cannot be specified since the Redis protocol is unable to return 64 bit unsigned integers currently. It's time for a few examples: in order to increment an unsigned integer of 8 bits I could use:> BITFIELD mykey incrby u8 100 1 1) (integer) 3 This is incrementing an unsigned integer, 8 bits, at offset 100 (101th bit in the bitmap). However there is a different way to specify offsets, that is by prefixing the offset with “#”, in order to say: "handle the string as an array of counters of the specified size, and set the N-th counter". Basically this just means that if I use #10 with an 8 bits type, the offset is obtained by multiplying 8*10, this way I can address multiple counters independently without doing offsets math:> BITFIELD mykey incrby u8 #0 1 1) (integer) 1> BITFIELD mykey incrby u8 #0 1 1) (integer) 2> BITFIELD mykey incrby u8 #1 1 1) (integer) 1> BITFIELD mykey incrby u8 #1 1 1) (integer) 2 The ability to control the overflow is also interesting. For example an unsigned counter of 1 bit will actually toggle between 0 and 1 with the default overflow policy of “wrap”:> BITFIELD mykey incrby u1 100 1 1) (integer) 1 127.0.0[...]

The binary search of distributed programming
Yesterday night I was re-reading Redlock analysis Martin Kleppmann wrote ( At some point Martin wonders if there is some good way to generate monotonically increasing IDs with Redis. This apparently simple problem can be more complex than it looks at a first glance, considering that it must ensure that, in all the conditions, there is a safety property which is always guaranteed: the ID generated is always greater than all the past IDs generated, and the same ID cannot be generated multiple times. This must hold during network partitions and other failures. The system may just become unavailable if there are less than the majority of nodes that can be reached, but never provide the wrong answer (note: as we'll see this algorithm has another liveness issue that happens during high load of requests). So for the sake of playing a bit more with distributed systems algorithms, and learn a bit more in the process, I tried to find a solution. Actually I was aware of an algorithm that could solve the problem. It’s an inefficient one, not suitable to generate tons of IDs per second. Many complex distributed algorithms, like Raft and Paxos, use it as a step in order to get monotonically increasing IDs, as a foundation to mount the full set of properties they need to provide. This algorithm is fascinating since it’s extremely easy to understand and implement, and because it’s very intuitive to understand *why* it works. I could say, it is the binary search of distributed algorithms, something easy enough but smart enough to let newcomers to distributed programming to have an ah!-moment. However I had to modify the algorithm in order to adapt it to be implemented in the client side. Hopefully it is still correct (feedbacks appreciated). While I’m not going to use this algorithm in order to improve Redlock (see my previous blog post), I think that trying to solve this kind of problems is both a good exercise, and it may be an interesting read for people approaching distributed systems for the first times looking for simple problems to play with in real world systems. ## How it works? The algorithm requirements are the following two: 1. A data store that supports the operation set_if_less_than(). 2. A data store that can fsync() data to disk on writes, before replying to the client. The above includes almost any *SQL server, Redis, and a number of other stores. We have a set of N nodes, let’s assume N = 5 for simplicity in order to explain the algorithm. We initialize the system by setting a key called “current” to the value of 0, so in Redis terms, we do: SET current 0 In all the 5 instances. This is part of the initialization and must be done only when a new “cluster” is initialized. This step can be skipped but makes the explanation simpler. In order to generate a new ID, this is what we do: 1: Get the “current” value from the majority of instances (3 or more if N=5). 2: If we failed to reach 3 instances, GOTO 1. 3: Get the maximum value among the ones we obtained, and increment it by 1. Let’s call this $NEXTID 4: Send the following write operation to all the nodes we are able to reach. IF current < $NEXTID THEN SET current $NEXTID return $NEXTID ELSE return NULL END 5: If 3 or more instances will reply $NEXTID, the algorithm succeeded, and we successfully generated a new monotonically increasing ID. 6: Otherwise, if we have not reached the majority, GOTO 1. What we send at step 4 can be easily translated to a simple Redis Lua script: local val = tonumber('g[...]

Is Redlock safe?
Martin Kleppmann, a distributed systems researcher, yesterday published an analysis of Redlock (, that you can find here: Redlock is a client side distributed locking algorithm I designed to be used with Redis, but the algorithm orchestrates, client side, a set of nodes that implement a data store with certain capabilities, in order to create a multi-master fault tolerant, and hopefully safe, distributed lock with auto release capabilities. You can implement Redlock using MySQL instead of Redis, for example. The algorithm's goal was to move away people that were using a single Redis instance, or a master-slave setup with failover, in order to implement distributed locks, to something much more reliable and safe, but having a very low complexity and good performance. Since I published Redlock people implemented it in multiple languages and used it for different purposes. Martin's analysis of the algorithm concludes that Redlock is not safe. It is great that Martin published an analysis, I asked for an analysis in the original Redlock specification here: So thank you Martin. However I don’t agree with the analysis. The good thing is that distributed systems are, unlike other fields of programming, pretty mathematically exact, or they are not, so a given set of properties can be guaranteed by an algorithm or the algorithm may fail to guarantee them under certain assumptions. In this analysis I’ll analyze Martin's analysis so that other experts in the field can check the two documents (the analysis and the counter-analysis), and eventually we can understand if Redlock can be considered safe or not. Why Martin thinks Redlock is unsafe ----------------------------------- The arguments in the analysis are mainly two: 1. Distributed locks with an auto-release feature (the mutually exclusive lock property is only valid for a fixed amount of time after the lock is acquired) require a way to avoid issues when clients use a lock after the expire time, violating the mutual exclusion while accessing a shared resource. Martin says that Redlock does not have such a mechanism. 2. Martin says the algorithm is, regardless of problem “1”, inherently unsafe since it makes assumptions about the system model that cannot be guaranteed in practical systems. I’ll address the two concerns separately for clarity, starting with the first “1”. Distributed locks, auto release, and tokens ------------------------------------------- A distributed lock without an auto release mechanism, where the lock owner will hold it indefinitely, is basically useless. If the client holding the lock crashes and does not recover with full state in a short amount of time, a deadlock is created where the shared resource that the distributed lock tried to protect remains forever unaccessible. This creates a liveness issue that is unacceptable in most situations, so a sane distributed lock must be able to auto release itself. So practical locks are provided to clients with a maximum time to live. After the expire time, the mutual exclusion guarantee, which is the *main* property of the lock, is gone: another client may already have the lock. What happens if two clients acquire the lock at two different times, but the first one is so slow, because of GC pauses or other scheduling issues, that will try to do work in the context of the shared resource at the same time with second client that acquired the lock? Martin says that this problem is avoided by having the distributed lock ser[...]

Disque 1.0 RC1 is out!
Today I’m happy to announce that the first release candidate for Disque 1.0 is available.

If you don't know what Disque is, the best starting point is to read the README in the Github project page at

Disque is a just piece of software, so it has a material value which can be zero or more, depending on its ability to make useful things for people using it. But for me there is an huge value that goes over what Disque, materially, is. It is the value of designing and doing something you care about. It’s the magic of programming: where there was nothing, now there is something that works, that other people may potentially analyze, run, use.

Distributed systems are a beautiful field. Thanks to Redis, and to the people that tried to mentor me in a way or the other, I got exposed to distributed systems. I wanted to translate this love to something tangible. A new, small system, designed from scratch, without prejudices and without looking too closely to what other similar systems were doing. The experience with Redis shown me that message brokers were a very interesting topic, and that in some way, they are the perfect topic to apply DS concepts. I even pretend message brokers can be fun and exciting. So I tried to design a new message queue, and Disque is the result.

Disque design goal is to provide a system with a good user experience: to provide certain guarantees in the context of messaging, guarantees which are easy to reason about, and to provide extreme operational simplicity. The RC1 offers the foundation, but there is more work to do. For once I hope that Disque will be tested by Aphyr with Jepsen in depth. Since Disque is a system that provides certain kinds of guarantees that can be tested, if it fails certain tests, this translates directly to some bug to fix, that means to end with a better system.

On the operational side there is to test it in the real world. AP and message queues IMHO are a perfect match to provide operational robustness. However I’m not living into the illusion that I got everything right in the first release, so it will take months (or years?) of iteration to *really* reach the operational simplicity I’m targeting. Moreover this is an RC1 that was heavily modified in the latest weeks, I expect it to have a non trivial amount of bugs.

From the point of view of making a fun and exciting system, I tried to end with a simple and small API that does not force the user to think at the details of *this specific* implementation, but more generally at the messaging problem she or he got. Disque also has a set of introspection capabilities that should help making it a non-opaque system that is actually possible to debug and observe.

Even with all the limits of new code and ideas, the RC release is a great first step, and I’m glad Disque is not in the list of side projects that we programmers start and never complete.

I was not alone during the past months, while hacking with Disque and trying to figure out how to shape it, I received the help of: He Sun, Damian Janowski, Josiah Carlson, Michel Martens, Jacques Chester, Kyle Kingsbury, Mark Paluch, Philipp Krenn, Justin Case, Nathan Fritz, Marcos Nils, Jasper Louis Andersen, Vojtech Vitek, Renato C., Sebastian Waisbrot, Redis Labs and Pivotal, and probably more people I’m not remembering right now. Thank you for your help.

The RC1 is tagged in the Disque Github repository. Have fun! Comments

Generating unique IDs: an easy and reliable way
Two days ago Mike Malone published an interesting post on Medium about the V8 implementation of Math.random(), and how weak is the quality of the PRNG used: The post was one of the top news on Hacker News today. It’s pretty clear and informative from the point of view of how Math.random() is broken and how should be fixed, so I’ve nothing to add to the matter itself. But since the author discovered the weakness of the PRNG in the context of generating large probably-non-colliding IDs, I want to share with you an alternative that I used multiple times in the past, which is fast and extremely reliable. The problem of unique IDs - - - So, in theory, if you want to generate unique IDs you need to store some state that makes sure an ID is never repeated. In the trivial case you may use just a simple counter. However the previous ID generated must be stored in a consistent way. In case of restart of the system, it should never happen that the same ID is generated again because our stored counter was not correctly persisted on disk. If we want to generate unique IDs using multiple processes, each process needs to make sure to prepend its IDs with some process-specific prefix that will never collide with another process prefix. This can be complex to manage as well. The simple fact of having to store in a reliable way the old ID is very time consuming when we want to generate an high number of IDs per second. Fortunately there is a simple solution. Generate a random number in a range between 0 and N, with N so big that the probability of collisions is so small to be, for every practical application, irrelevant. This works if the number we generate is uniformly distributed between 0 and N. If this prerequisite is true we can use the birthday paradox in order to calculate the probability of collisions. By using enough bits it’s trivial to make the probability of a collision billions of times less likely than an asteroid centering the Earth, even if we generate millions of IDs per second for hundreds of years. If this is not enough margin for you, just add more bits, you can easily reach an ID space larger than the number of atoms in the universe. This generation method has a great advantage: it is completely stateless. Multiple nodes can generate IDs at the same time without exchanging messages. Moreover there is nothing to store on disk so we can go as fast as our CPU can go. The computation will easily fit the CPU cache. So it’s terribly fast and convenient. Mike Malone was using this idea, by using the PRNG in order to create an ID composed of a set of characters, where each character was one of 64 possible characters. In order to create each character the weak V8 PRNG was used, resulting into collisions. Remember that our initial assumption is that each new ID must be selected uniformly in the space between 0 and N. You can fix this problem by using a stronger PRNG, but this requires an analysis of the PRNG. Another problem is seeding, how do you start the process again after a restart in order to make sure you don’t pick the initial state of the PRNG again? Otherwise your real ID space is limited by the seeding of the PRNG, not the output space itself. For all the above reasons I want to show you a trivial technique that avoids most of these problems. Using a crypto hash function to generate unique IDs - - - Cryptographic hash functions are non invertible functions that transform a sequence of bits into a fixed sequence of bits. They are designed in order to resist a variety of attacks, however in t[...]

6 years of commit visualized
Today I was curious about plotting all the Redis commits we have on Git, which are 90% of all the Redis commits. There was just an initial period where I used SVN but switched very soon.

Full size image here:


Each commit is a rectangle. The height is the number of affected lines (a logarithmic scale is used). The gray labels show release tags.

There are little surprises since the amount of commit remained pretty much the same over the time, however now that we no longer backport features back into 3.0 and future releases, the rate at which new patchlevel versions are released diminished.

Major releases look more or less evenly spaced between 2.6, 2.8 and 3.0, but were more frequent initially, something that should change soon as we are trying to switch to time-driven releases with 3 new major release each year (that obviously will contain less new things compared to the amount of stuff was present in major releases that took one whole year).
For example 3.2 RC is due for December 2015.

Patch releases of a given major release tend to have a logarithmic curve shape. As a release mature, in general it gets less critical bugs. Also attention shifts progressively to the new release.

I would love Github to give us stuff like that and much more. There is a lot of data in commits of a project that is there for years. This data should be analyzed and explored... it's a shame that the graphics section is apparently the same thing for years.

EDIT: The Tcl script used to generate this graph is here: Comments

Recent improvements to Redis Lua scripting
Lua scripting is probably the most successful Redis feature, among the ones introduced when Redis was already pretty popular: no surprise that a few of the things users really want are about scripting. The following two features were suggested multiple times over the last two years, and many people tried to focus my attention into one or the other during the Redis developers meeting, a few weeks ago. 1. A proper debugger for Redis Lua scripts. 2. Replication, and storage on the AOF, of Lua scripts as a set of write commands materializing the *effects* of the script, instead of replicating the script itself as we normally do. The second feature is not just a matter of how scripts are replicated, but also touches what you can do with Lua scripting as we will see later. Back from London, I implemented both the features. This blog post describes both, giving a few hints about the design and implementation aspects that may be interesting for the readers. A proper Lua debugger --- Lua scripting was initially conceived in order to write really trivial scripts. Things like: if the key exists do this. A couple of lines in order to avoid bloating Redis with all the possible variations of commands. Of course users did a lot more with it, and started to write complex scripts: from quad-tree implementations to full featured messaging systems with non trivial semantics. Lua scripting makes Redis programmable, and usually programmers can’t resist to programmable things. It helps that all the Lua scripts run using the same interpreter and are cached, so they are very fast. It is most of the time possible to do a lot more with a Redis instance by using Lua scripting, both functionally and in terms of operations per second. So complex scripts totally have their place today. We went from a very cold reception of the scripting feature (something as dynamic as a script sent to a database!), to mass usage, to writing complex scripts in a matter of a few years. However writing simple scripts and writing complex scripts is a completely different matter. Bigger programs become exponentially more complex, and you can feel this even when going from 10 to 200 lines of code. While you can debug by brute force any simple script, just trying a few variants and observing the effects on the data set, or putting a few logging instructions in the middle, with complex scripts you have a bad time without a debugger. My colleague Itamar Haber used a lot of his time to write complex scripts recently. At some point he also wrote some kind of debugger for Redis Lua scripting using the Lua debug library. This debugger no longer works since the debug library is now no longer exposed to scripts, for sandboxing concerns, and in general, what you want in a Redis debugger is an interactive and remote debugger, with a proper client able to work alongside with the server, to provide a good debugging experience. Debugging is already a lot of hard work, to have solid tools is really a must. The only way to accomplish this result, was to add proper debugging support inside Redis itself. So back from London Itamar and I started to talk about what a debugger should export to the user in order to be kinda of useful, and a real upgrade compared to the past. It was also discussed to just add support for the Lua debuggers that already exist outside the Redis ecosystem. However I strongly believe the user experience is enhanced when everything is designed specifically to work well with Redis, so in the end I decided to wrote the deb[...]

A few things about Redis security
IMPORTANT EDIT: Redis 3.2 security improved by implementing protected mode. You can find the details about it here: From time to time I get security reports about Redis. It’s good to get reports, but it’s odd that what I get is usually about things like Lua sandbox escaping, insecure temporary file creation, and similar issues, in a software which is designed (as we explain in our security page here to be totally insecure if exposed to the outside world. Yet these bug reports are often useful since there are different levels of security concerning any software in general and Redis specifically. What you can do if you have access to the database, just modify the content of the database itself or compromise the local system where Redis is running? How important is a given security layer in a system depends on its security model. Is a system designed to have untrusted users accessing it, like a web server for example? There are different levels of authorization for different kinds of users? The Redis security model is: “it’s totally insecure to let untrusted clients access the system, please protect it from the outside world yourself”. The reason is that, basically, 99.99% of the Redis use cases are inside a sandboxed environment. Security is complex. Adding security features adds complexity. Complexity for 0.01% of use cases is not great, but it is a matter of design philosophy, so you may disagree of course. The problem is that, whatever we state in our security page, there are a lot of Redis instances exposed to the internet unintentionally. Not because the use case requires outside clients to access Redis, but because nobody bothered to protect a given Redis instance from outside accesses via fire walling, enabling AUTH, binding it to if only local clients are accessing it, and so forth. Let’s crack Redis for fun and no profit at all given I’m the developer of this thing === In order to show the Redis “security model” in a cruel way, I did a quick 5 minutes experiment. In our security page we hint at big issues if Redis is exposed. You can read: “However, the ability to control the server configuration using the CONFIG command makes the client able to change the working directory of the program and the name of the dump file. This allows clients to write RDB Redis files at random paths, that is a security issue that may easily lead to the ability to run untrusted code as the same user as Redis is running”. So my experiment was the following: I’ll run a Redis instance in my Macbook Air, without touching the computer configuration compared to what I’ve currently. Now from another host, my goal is to compromise my laptop. So, to start let’s check if I can access the instance, which is a prerequisite: $ telnet 6379 Trying Connected to Escape character is '^]'. echo "Hey no AUTH required!" $21 Hey no AUTH required! quit +OK Connection closed by foreign host. Works, and no AUTH required. Redis is unprotected without a password set up, and so forth. The simplest thing you can do in such a case, is to write random files. Guess what? my Macbook Air happens to run an SSH server. What about trying to write something into ~/ssh/authorized_keys in order to gain access? Let’s start generating a new SSH key: $ ssh-keygen -t rsa -C "" Generating[...]

Moving the Redis community on Reddit
I’m just back from the Redis Dev meeting 2015. We spent two incredible days talking about Redis internals in many different ways. However while I’m waiting to receive private notes from other attenders, in order to summarize in a blog post what happened and what were the most important ideas exposed during the meetings, I’m going to touch a different topic here. I took the non trivial decision to move the Redis mailing list, consisting of 6700 members, to Reddit. This looks like a crazy ideas probably in some way, and “to move” is probably not the right verb, since the ML will still exist. However it will only be used in order to receive announcements of new releases, critical informations like security related ones, and from time to time, links to very important discussions that are happening on Reddit. Why to move? We have a huge mailing list that served us for years at this point, and there is a decent amount of traffic going on. People go there to get some help, to provide new ideas, and so forth. However while we have some traffic the Redis mailing list is IMHO far from the “vital” thing it should be, considering the number of users using Redis currently. For most parts we see the same questions again and again, and is hard to understand if a reply is really worthwhile or not. Moreover an important topic sometimes slides away because new topics arrive, sometimes without getting much attention at all. It’s like if the ML is just a far echo of the Redis popularity, not the center of its community. Twitter, while being a tool completely unsuitable for becoming the center of a community that needs to discuss things at length, is completely different, and gives me a feedback about how the Redis ML is in some way broken. It’s a lot more vital and it is possible to get quality feedbacks, but everything is limited to 140 characters in flat threads where eventually it is kinda impossible to continue any sane discussion. However Twitter and other signals I get, are the proof that people *moved away* from emails. I bet an huge amount of users subscribed to the ML just archive it into a label to read it never or seldom. So why Reddit instead? Because it’s the center of the most vital communities on the internet today. Because it’s centered on a voting system, so if an user asks for help, even if you don’t want to contribute, you can use the comment voting to tell the difference between a lame request and a great one, a poor reply and an outstanding one. Reddit also allows to vote the topics so that things which are important for the community will get more contributions. Has a side bar that can be used for FAQs and other important pointers, and has a ton of existing subscribers. Reddit also contains “gamification” elements that may be useful and funny. For example you can associate small sentences or images to your username in a given sub-reddit, in order to state, for example, if you use Redis for caching, as a store, for messaging or whatever. Your reply in a given context can be read more clearly if it is possible to understand what kind of Redis user you are. It is possible to write guidelines in the submission page, so that people realize what to provide before posting. For example we’ll have warnings telling you to post the INFO output of your master and slaves if you want us to investigate your replication issues. So, what happens now? I asked Reddit admins to get access to /r/redis, which is a [...]

Clarifications about Redis and Memcached
If you know me, you know I’m not the kind of guy that considers competing products a bad thing. I actually love the users to have choices, so I rarely do anything like comparing Redis with other technologies. However it is also true that in order to pick the right solution users must be correctly informed. This post was triggered by reading a blog post published by Mike Perham, that you may know as the author of a popular library called Sidekiq, that happens to use Redis as backend. So I would not consider Mike a person which is “against” Redis at all. Yet in his blog post that you can find at the URL he states that, for caching, “you should probably use Memcached instead [of Redis]”. So Mike simply really believes Redis is not good for caching, and he arguments his thesis in this way: 1) Memcached is designed for caching. 2) It performs no disk I/O at all. 3) It is multi threaded and can handle 100,000s of requests by scaling multi core. I’ll address the above statements, and later will provide further informations which are not captured by the above sentences and which are in my opinion more relevant to most caching users and use cases. Memcached is designed for caching: I’ll skip this since it is not an argument. I can say “Redis is designed for caching”. So in this regard they are exactly the same, let’s move to the next thing. It performs no disk I/O at all: In Redis you can just disable disk I/O at all if you want, providing you with a purely in-memory experience. Except, if you really need it, you can persist the database only when you are going to reboot, for example with “SHUTDOWN SAVE”. The bottom line here is that Redis persistence is an added value even when you don’t use it at all. It is multi threaded: This is true, and in my goals there is to make Redis I/O threaded (like in memcached, where the data access itself is not threaded, basically). However Redis, especially using pipelining, can serve an impressive amount of requests per second per thread (half a million is a common figure with very intensive pipelining. Without pipelining it is around 100,000 ops/sec). In the vanilla caching scenario where each Redis instance is the same, works as a master, disk ops are disabled, and sharding is up to the client like in the “memcached sharding model”, to spin multiple Redis processes per system is not terrible. Once you do this what you get is a shared-nothing multi threaded setup so what counts is the amount of operations you can serve per single thread. Last time I checked Redis was at least as fast as memcached per each thread. Implementations change over time so the edge today may be of the one or the other, but I bet they provide near performances since they both tend to maximize the resources they can use. Memcached multi threading is still an advantage since it makes things simpler to use and administer, but I think it is not a crucial part. There is more. Mike talks of operations per second without citing the *quality* of operations. The thing is in systems like Redis and Memcached the cost of command dispatching and I/O is dominating compared to actually touching the in-memory data structures. So basically in Redis executing a simple GET, a SET, or a complex operation like a ZRANK operation is about the same cost. But what you can achieve with a complex operation is a lot more work from [...]

Lazy Redis is better Redis
Everybody knows Redis is single threaded. The best informed ones will tell you that, actually, Redis is *kinda* single threaded, since there are threads in order to perform certain slow operations on disk. So far threaded operations were so focused on I/O that our small library to perform asynchronous tasks on a different thread was called bio.c: Background I/O, basically. However some time ago I opened an issue where I promised a new Redis feature that many wanted, me included, called “lazy free”. The original issue is here: The gist of the issue is that Redis DEL operations are normally blocking, so if you send Redis “DEL mykey” and your key happens to have 50 million objects, the server will block for seconds without serving anything in the meantime. Historically this was accepted mostly as a side effect of the Redis design, but is a limit in certain use cases. DEL is not the only blocking command, but is a special one, since usually we say: Redis is very fast as long as you use O(1) and O(log_N) commands. You are free to use O(N) commands but be aware that it’s not the case we optimized for, be prepared for latency spikes. This sounds reasonable, but at the same time, even objects created with fast operations need to be deleted. And in this case, Redis blocks. The first attempt — In a single-threaded server the easy way to make operations non-blocking is to do things incrementally instead of stopping the world. So if there is to free a 1 million allocations, instead of blocking everything in a for() loop, we can free 1000 elements each millisecond, for example. The CPU time used is the same, or a bit more, since there is more logic, but the latency from the point of view of the user is ways better. Maybe those cycles to free 1000 elements per millisecond were not even used. Avoiding to block for seconds is the key here. This is how many things inside Redis work: LRU eviction and keys expires are two obvious examples, but there are more, like incremental rehashing of hash tables. So this was the first thing I tried: create a new timer function, and perform the eviction there. Objects were just queued into a linked list, to be reclaimed slowly and incrementally each time the timer function was called. This requires some trick to work well. For example objects implemented with hash tables were also reclaimed incrementally using the same mechanism used inside Redis SCAN command: taking a cursor inside the dictionary and iterating it to free element after element. This way, in each timer call, we don’t have to free a whole hash table. The cursor will tell us where we left when we re-enter the timer function. Adaptive is hard — Do you know what is the hard part with this? That this time, we are doing a very special task incrementally: we are freeing memory. So if while we free memory incrementally, the server memory raises very fast, we may end, for the sake of latency, to consume an *unbound* amount of memory. Which is very bad. Imagine this, for example: WHILE 1 SADD myset element1 element2 … many many many elements DEL myset END If deleting myset in the background is slower compared to our SADD call adding tons of elements per call, our memory usage will grow forever. However after a few experiments, I found a way to make it working very well. The timer function used two ideas in[...]

About Redis Sets memory efficiency
Yesterday Amplitude published an article about scaling analytics, in the context of using the Set data type. The blog post is here: On Hacker News people asked why not using Redis instead: Amplitude developers have their set of reasons for not using Redis, and in general if you have a very specific problem and want to scale it in the best possible way, it makes sense to implement your vertical solution. I’m not adverse to reinventing the wheel, you want your very specific wheel sometimes, that a general purpose system may not be able to provide. Moreover creating your solution gives you control on what you did, boosts your creativity and your confidence in what you, as a developer can do, makes you able to debug whatever bug may arise in the future without external help. On the other hand of course creating system software from scratch is a very complex matter, requires constant developments if there is a wish to actively develop something, or means to have a stalled, non evolving piece of code if there is no team dedicated to it. If it is very vertical and specialized, likely the new system is capable of handling only a slice of the whole application problems, and yet you have to manage it as an additional component. Moreover if it was created by mostly one or a few programmers that later go away from the company, then fixing and evolving it is a very big problem: there isn’t sizable external community, nor there are the original developers. Basically writing things in house is not good or bad per se, it depends. Of course it is a matter of sensibility to understand when it’s worth to implement something from scratch and when it is not. Good developers know. From my point of view, regardless of what the Amplitude developers final solution was, it is interesting to read the process and why they are not using Redis. One of the concerns they raised is the overhead of the Set data type in Redis. I believe they are right to have such a concern, Redis Sets could be a lot more memory efficient, and weeks before reading the Amplitude article, I already started to explore ways to improve Sets memory efficiency. Today I want to share the plans with you. Dual representation of data types === In principle there where plain data structures, implemented more or less like an algorithm text book suggests: each node of the data structure is implemented dynamically allocating it. Allocation overhead, fat pointers, poor cache locality, are the big limits of this basic solution. Pieter Noordhuis and I later implemented specialized implementations of Redis abstract data types, to be very memory efficient, using single allocations to hold tens or a few hundreds of elements in a single allocation, sometimes with ad-hoc encodings to better use the space. Those versions of the data structures have O(N) time complexity for certain operations, or sometimes are limited to elements having a specific format (numbers) or sizes. So for example when you create an Hash, it starts represented in a memory efficient way which is good for a small number of elements. Later it gets converted to a real hash table if the number of elements reach a given threshold. This means that the memory efficiency of a Redis data type depends a lot on the number of elements [...]

Thanks Pivotal, Hello Redis Labs
I consider myself very lucky for contributing to the open source. For me OSS software is not just a license: it means transparency in the development process, choices that are only taken in order to improve software from the point of view of the users, documentation that attempts to cover everything, and simple, understandable systems. The Redis community had the privilege of finding in Pivotal, and VMware before, a company that thinks at open source in the same way as we, the community of developers, think of it.

Thanks to the Pivotal sponsorship Redis was able to grow, to reach in the latest years a diffusion which I never expected it to reach. However for the final user it always was just a "pure" OSS project: go to the community web site, grab a tar ball, read the free documentation, send a pull request, and watch the stream of commits as they happen live.

In order to not stop this magic from happening, and in order to have enough free time to spend with my family, during these years I made the decision of not starting a Redis company. However I encouraged the creation of an economic ecosystem around Redis. There are multiple companies about Redis doing well at this point. There is one, Redis Labs, that made a remarkable steady work over the years in order to build a very strong company, with a team of developers hacking on the core of Redis, and a great set of products that provide Redis users with the commercial choices they need.

At some point it started to look like a good idea for me to move to Redis Labs. Running a big cluster of Redis instances and having a set of developers on the Redis core is the key asset for Redis future. We can work together in order to improve Redis faster, with a constant feedback on what happens into the wild of actual users running Redis and the efforts required in order to operate it at scale.

Redis Labs was willing to continue what VMware and Pivotal started. I'll be able to work as I do currently, spending all my time in the open source side of the project, while Redis Labs continues to provide Redis users with an hassles-free Redis experience of managed instances and products. However because of my close interaction with Redis Labs I believe we'll see much more contributions from Redis Labs developers to the Redis core. Things like the memory reduction pull requests which are going to be part of Redis 3.2, or the improvements to the key eviction process they contributed for Redis 3.0, are a clear example of what happens when you have great developers working at Redis, able to observe a large set of use cases.

I, Pivotal, and Redis Labs, all agree that this is important for the future of Redis, so I'm officially moving to Redis Labs starting from tomorrow morning. Thank you Pivotal and Redis Labs, we'll have to ship more OSS code in the next years, and this is just great.

EDIT: Redis Labs press release can be found here: Comments

Commit messages are not titles
Nor subjects, for what matters. Everybody will tell you to don't add a dot at the end of the first line of a commit message. I followed the advice for some time, but I'll stop today, because I don't believe commit messages are titles or subjects. They are synopsis of the meaning of the change operated by the commit, so they are small sentences. The sentence can be later augmented with more details in the next lines of the commit message, however many times there is *no* body, there is just the first line. How many emails or articles you see with just the subject or the title? Very little, I guess. So for me it is like:

This is a smart synopsis, as information dense as possible.

And when needed, this is the long version since:
1. I did this.
2. And resulted into this.
3. And you could reproduce this way.

So every time I'll be told again to don't put a dot at the end, I'll link to this article.

But no, it's not just a matter of a dot. If the first line of a commit message is a title, it changes *the way* you write it. It becomes just some text to introduce some more text, without any stress on the information density. Coders gotta code, so if something can be told in a very short way in one line, do it, and reserve the other additional informations for the next line, without sacrificing the first line since "It's a title".

Moreover, programming is the art of writing synopsis, otherwise you end with programs much more complex they should be. So perhaps it's also a good exercise for us. Comments

Plans for Redis 3.2
I’m back from Paris, DotScale 2015 was a very interesting conference. Before leaving I was working on Sentinel in the context of the unstable branch: the work was mainly about connection sharing. In short, it is the ability of a few Sentinels to scale, monitoring many masters. Before to leave, and now that I’m back, I tried to “secure” a set of features that will be the basis for Redis 3.2. In the next weeks I’ll be focusing developing these features, so I thought it’s worth to share the list with you ASAP. Geo hashing API: This work originated from Ardb, that was originally a fork of Redis (, and was later extracted and improved by Matt Stancliff ( that ported it to Redis. Open source is cool eh? The code needs a refactoring effort since currently duplicates parts of the sorted set implementation. It is not impossible that I may also change a few things about the API, I’m currently not sure, if there is something to fix, I’ll fix it. But the bottom line is: this is a great feature, now that Matt is no longer contributing to Redis, there is a huge risk to lose this work, so I’m going to do the effort of refactoring, reviewing and merging it as the first of the tasks for Redis 3.2. I think it is a very exciting addition to the Redis API. Bloom filters: We’ll get bloom filters in 3.2. I’m not sure if this will be implemented as a String type feature like HyperLogLog are, but more likely as a new special type, since I'm interested in non trivial semantics that are more easy to provide as a new type. I’ve many design ideas for bloom filters, but I’m pretty sure I would like to have the ability to control from the API the accuracy/space tradeoff, perhaps not in the lower level from of specifying number of bits and hash functions to use, but in a more higher level way. Another thing I would love to have in this API is the ability of the bloom filter to auto-depollute itself (using multiple rotating filters or something like that). I’ll read all the available literature and decide what to do, but we’ll get this feature into 3.2. Memory PRs: there are two important PRs from RedisLabs to improve Redis memory usage. We’ll get both merged. Memory introspection command: A command that provides information about memory, like the LATENCY command but for memory usage. Hints about where is memory consumed, if its just the RSS that is high because of past peak memory usage, hints about amount of memory used by client output buffers, ability to resize the hash tables to save some memory if needed, and so forth. Some Redis Cluster multi DC support. This will probably just be a “static” option of Cluster slaves so that they’ll not take part to promotion when the master fails. In this way using CLUSTER FAILOVER TAKEOVER it will be possible to promote all the slaves in a minority partition. New List type operations: A few O(1) list operations like LMERGE, and O(N) operations that will be normally used with N very small so that they are most of the times O(1) operations, like operations to move N elements from a list to another. AOF safety feature: AOF rewrites optionally using an RDB preamble, so that rewriting the AOF and reloading back the conte[...]

Adventures in message queues
EDIT: In case you missed it, Disque source code is now available at It is a few months that I spend ~ 15-20% of my time, mostly hours stolen to nights and weekends, working to a new system. It’s a message broker and it’s called Disque. I’ve an implementation of 80% of what was in the original specification, but still I don’t feel like it’s ready to be released. Since I can’t ship, I’ll at least blog… so that’s the story of how it started and a few details about what it is. ~ First steps ~ Many developers use Redis as a message queue, often wrappered via some library abstracting away Redis low level primitives, other times directly building a simple, ad-hoc queue, using the Redis raw API. This use case is covered mainly using blocking list operations, and list push operations. Redis apparently is at the same time the best and the worst system to use like that. It’s good because it is fast, easy to inspect, deploy and use, and in many environments it was already one piece of the infrastructure. However it has disadvantages because Redis mutable data structures are very different than immutable messages. Redis HA / Cluster tradeoffs are totally biased towards large mutable values, but the same tradeoffs are not the best ones to deal with messages. One thing that is important to guarantee for a message broker is that a message is delivered either at least one time, or at most one time. In short given that to guarantee an exact single delivery of a message (where for delivery we intent a message that was received *and* processed by a worker) is practically impossible, the choices are that the message broker is able to guarantee either 0 or 1 deliveries, or 1 to infinite deliveries. This is often referred as at-most-once semantics, and at-least-once semantics. There are use cases for the first, but the most interesting and practical semantics is the latter, that is, to guarantee that a message is delivered at least one time, and deliver multiple times if there are failures. So a few months ago I started to think at some client-side protocol to use a set of Redis masters (without replication or clustering whatsoever) in a way that provides these guarantees. Sometimes with small changes in the way Redis is used for an use case, it is possible to end with a better system. For example for distributed locks I tried to document an algorithm which is trivial to implement but more robust than the single-instance + failover implementation ( However after a few days of work my design draft suggested that it was a better bet to design an ad-hoc system, since the client-side algorithm ended being too complex, non optimal, and certain things I absolutely wanted were impossible or very hard to do. To add more things to Redis sounded like a bad idea, it does a lot of things already, and to cover messaging well I needed things which are very different than the way Redis operates. But why to design a new system given that the world is full of message brokers? Because an impressive number of users were using Redis instead of systems specifically designed for this goal, and this was strange. A few can be wrong, but so many need to get some reason. Maybe Redis low barrier of entr[...]

Redis Conference 2015
I’m back home, after a non easy trip, since to travel from San Francisco to Sicily is kinda NP complete: there are no solutions involving less than three flights. However it was definitely worth it, because the Redis Conference 2015 was very good, SF was wonderful as usually and I was able to meet with many interesting people. Here I’ll limit myself to writing a short account of the conference, but the trip was also an incredible experience because I discovered old and new friends, that are not just smart programmers, but also people I could imagine being my friends here in Sicily. I never felt alone while I was 10k kilometers away from my home. The conference was organized by RackSpace in a magistral way, with RedisLabs, Heroku, and Hulu, sponsoring it as well. I can’t say thank you enough times to everybody. Many people traveled from different parts of US and outside US to SF just for a couple of days, the venue was incredibly cool, and everything organized in the finest details. There was even an incredible cake for the Redis 6th birthday :-) However the killer features of the conference were, the number and the quality of the attenders (mostly actual Redis users), around 250 people, and the quality of the talks. The conference was free, even if it did not looked like a free conference at all, at any level. An incredible stage where to talk, very high quality food, plenty of space. All this honestly helped to create a setup for interesting exchanges. Everybody was using Redis for something, to get actual things done, and a lot of people shared their experiences. Among the talks I found Hulu and Heroku ones extremely interesting, because they covered details about different use cases and operational challenges. I also happen to agree with Bill Andersen (from RackSpace) vision on benchmarking Redis in a use-case oriented fashion, even if I missed the initial part of his talk because I was being interviewed, but the cool thing is, there will be recordings of the talks, so it will be possible for everybody to watch them when available at the conf site, which is, I was approached by several VeryLargeCompanies recounting stories of how they are using or are going to use Redis to do VeryLargeUseCase. Basically at this point Redis is everywhere. Redis Conference was a big gift to the Redis community… and in some way it shows very well how much there is a Redis outside Redis, I mean, at this point it has a life outside the borders of the server and client libraries repositories. It is a technology with many users that exchange ideas and that work with it in different ways: internally to companies to provide it as a technology to cover a number of use cases, and also in the context of cloud providers, that are providing it as a service to other companies. One thing I did not liked was Matt Stancliff talk. He tried to uncover different problems in the Redis development process, and finally proposed the community to replace me as the project leader, with him. In my opinion what Matt actually managed to do was to cherry-pick from my IRC, Twitter and Github issues posts in a very unfair way, in order to provide a bad imagine of myself. I think this was a big mistake. Moreover he did the [...]

Side projects
Today Redis is six years old. This is an incredible accomplishment for me, because in the past I switched to the next thing much faster. There are things that lasted six years in my past, but not like Redis, where after so much time, I still focus most of my everyday energies into. How did I stopped doing new things to focus into an unique effort, drastically monopolizing my professional life? It was a too big sacrifice to do, for an human being with a limited life span. Fortunately I simply never did this, I never stopped doing new things. If I look back at those 6 years, it was an endless stream of side projects, sometimes related to Redis, sometimes not. 1) Load81, children programming environment. 2) Dump1090, software defined radio ADS-B decoder. 3) A Javascript ray tracer. 4) lua-cmsgpack, C implementation of msgpack for Lua. 5) linenoise line editing library. Used in Redis, but well, was not our top priority. 6) lamernews, Redis-based HN clone. 7) Gitan, a small Git web interface. 8) shapeme, images evolver using simulated annealing. 9) Disque, a distributed queue (work in progress right now). And there are much more throw-away projects not listed here. The interesting thing is that many of the projects listed above are not random hacking efforts that had as an unique goal to make me happy. A few found their way into other people’s code. Because of the side projects, I was able to do different things when I was stressed and impoverished from doing again and again the same thing. I could later refocus on Redis, and find again the right motivations to have fun with it, because small projects are cool, but to work for years at a single project can provide more value for others in the long run. So currently I’m using something like 20% of my time to hack on Disque, a distributed message queue. So only 80% is left for Redis development, right? Wrong. The deal is between 80% of focus on Redis and 20% on something else, or 0% of focus on Redis in the long term, because in order to have a long term engagement, you need a long term alternative to explore new things. Side projects are the projects making your bigger projects possible. Moreover they are often the start of new interesting projects. Redis itself was a side project of LLOOGG. Sometimes you stop working at your main project because of side projects, but when this happens it is not because your side project captured your focus, it is because you managed to find a better use for your time, since the side project is more important, interesting, and compelling than the main project. Redis is six years old today, but is aging well: it continues to capture the attention of more developers, and it continues to improve in order to provide a bit more value to users every week. However for me, more users, more pull requests, and more pressure, does not mean to change my setup. What Redis is today is the sum of the work we put into it, and the endurance in the course of six years. To continue along the same path, I’ll make sure to have a few side projects for the next years. UPDATE: Damian Janowski provided an incredible present for the Redis community today, the renewed web site is online now! Thanks [...]

Why we don’t have benchmarks comparing Redis with other DBs
Redis speed could be one selling point for new users, so following the trend of comparative “advertising” it should be logical to have a few comparisons at However there are two problems with this. One is of goals: I don’t want to convince developers to adopt Redis, we just do our best in order to provide a suitable product, and we are happy if people can get work done with it, that’s where my marketing wishes end. There is more: it is almost always impossible to compare different systems in a fair way. When you compare two databases, to get fair numbers, they need to share *a lot*: data model, exact durability guarantees, data replication safety, availability during partitions, and so forth: often a system will score in a lower way than another system since it sacrifices speed to provide less “hey look at me” qualities but that are very important nonetheless. Moreover the testing suite is a complex matter as well unless different database systems talk the same exact protocol: differences in the client library alone can contribute for large differences. However there are people that beg to differ, and believe comparing different database systems for speed is a good idea anyway. For example, yesterday a benchmark of Redis and AerospikeDB was published here: I’ll use this benchmark to show my point about how benchmarks are misleading beasts. In the benchmark huge EC2 instances are used, for some strange reason, since the instances are equipped with 244 GB of RAM (!). Those are R3.8xlarge instances. For my tests I’ll use a more real world m3.medium instance. Using such a beast of an instance Redis scored, in the single node case, able to provide 128k ops per second. My EC2 instance is much more limited, testing from another EC2 instance with Redis benchmark, not using pipelining, and with the same 100 bytes data size, I get 32k ops/sec, so my instance is something like 4 times slower, in the single process case. Let’s see with Redis INFO command how the system is using the CPU during this benchmark: # CPU used_cpu_sys:181.78 used_cpu_user:205.05 used_cpu_sys_children:0.12 used_cpu_user_children:0.87> info cpu … after 10 seconds of test … # CPU used_cpu_sys:184.52 used_cpu_user:206.42 used_cpu_sys_children:0.12 used_cpu_user_children:0.87 Redis spent ~ 3 seconds of system time, and only ~ 1.5 seconds in user space. What happens here is that for each request the biggest part of the work is to perform the read() and write() call. Also since it’s one-query one-reply workload for each client, we pay a full RTT for each request of each client. Now let’s check what happens if I use pipelining instead, a feature very known and much exploited by Redis users, since it’s the only way to maximize the server usage, and there are usually a number of places in the application where you can perform multiple operations at a given time. With a pipeline of 32 operations the numbers changed drastically. My tiny instance was able to deliver 250k ops/sec using a single core, which is 25% of the *top* result using 32 (e[...]

Redis latency spikes and the Linux kernel: a few more details
Today I was testing Redis latency using m3.medium EC2 instances. I was able to replicate the usual latency spikes during BGSAVE, when the process forks, and the child starts saving the dataset on disk. However something was not as expected. The spike did not happened because of disk I/O, nor during the fork() call itself. The test was performed with a 1GB of data in memory, with 150k writes per second originating from a different EC2 instance, targeting 5 million keys (evenly distributed). The pipeline was set to 4 commands. This translates to the following command line of redis-benchmark: ./redis-benchmark -P 4 -t set -r 5000000 -n 1000000000 Every time BGSAVE was triggered, I could see ~300 milliseconds latency spikes of unknown origin, since fork was taking 6 milliseconds. Fortunately Redis has a software watchdog feature, that is able to produce a stack trace of the process during a latency event. It’s quite a simple trick but works great: we setup a SIGALRM to be delivered by the kernel. Each time the serverCron() function is called, the scheduled signal is cleared, so actually Redis never receives it if the control returns fast enough to the Redis process. If instead there is a blocking condition, the signal is delivered by the kernel, and the signal handler prints the stack trace. Instead of getting stack traces with the fork call, the process was always blocked near MOV* operations happening in the context of the parent process just after the fork. I started to develop the theory that Linux was “lazy forking” in some way, and the actual heavy stuff was happening later when memory was accessed and pages had to be copy-on-write-ed. Next step was to read the fork() implementation of the Linux kernel. What the system call does is indeed to copy all the mapped regions (vm_area_struct structures). However a traditional implementation would also duplicate the PTEs at this point, and this was traditionally performed by copy_page_range(). However something changed… as an optimization years ago: now Linux does not just performs lazy page copying, as most modern kernels. The PTEs are also copied in a lazy way on faults. Here is the top comment of copy_range_range(): * Don't copy ptes where a page fault will fill them correctly. * Fork becomes much lighter when there are big shared or private * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. Basically as soon as the parent process performs an access in the shared regions with the child process, during the page fault Linux does the big amount of work skipped by fork, and this is why I could see always a MOV instruction in the stack trace. While this behavior is not good for Redis, since to copy all the PTEs in a single operation is more efficient, it is much better for the traditional use case of fork() on POSIX systems, which is, fork()+exec*() in order to spawn a new process. This issue is not EC2 specific, however virtualized instances are slower at copying PTEs, so the problem is less noticeable with physical servers. However this is definitely not the full story. While I was testing this[...]

Redis latency spikes and the 99th percentile
One interesting thing about the Stripe blog post about Redis is that they included latency graphs obtained during their tests. In order to persist on disk Redis requires to call the fork() system call. Usually forking using physical servers, and most hypervisors, is fast even with big processes. However Xen is slow to fork, so with certain EC2 instance types (and other virtual servers providers as well), it is possible to have serious latency spikes every time the parent process forks in order to persist on disk. The Stripe graph is pretty clear in this regard. img:// As you can guess, if you perform a latency test during the fork, all the requests crossing the moment the parent process forks will be delayed up to one second (taking as example the graph above, not sure about what was the process size nor the EC2 instance). This will produce a number of samples with high latency, and will affect the 99th percentile result. To change instance type, configuration, setup, or whatever in order to improve this behavior is a good idea, and there are use cases where even a single request having a too high latency is unacceptable. However apparently it is not obvious how latency spikes of 1 second every 30 minutes (or more, if you use AOF with the right rewrite triggers) is very different from latency spikes which are evenly distributed in the set of requests. With evenly distributed spikes, if the generation of a page needs to perform a number of requests to a Redis server in order to create the output, it is very likely that a page view will incur in the latency penalty: this impacts the quality of service in a great way potentially, check this link: However 1 second of latency every 30 minutes run is a completely different thing. For once, the percentile with good latency gets better *as the number of requests increase*, since the more the requests are, the more this second of latency will be unlikely to get over-represented in the samples (if you have just 1 request per minute, and one of those requests happen to hit the high latency, it will affect the 99.99th percentile much more than what happens with 100 requests per second). Second: most page views will be unaffected. The only users that will see the 1 second delay are the ones that make a request crossing the fork call. All the other requests will experience an extremely low probability of hitting a request that has a latency which is significantly worse than the average latency. Also note that a page view crossing the fork time, even when composed of 100 requests, can’t be delayed for more than a second, since the requests are completed as soon as the fork() call terminates. The bottom line here is that, if there are hard latency requirements for each single request, it is clear that a setup where a request can be delayed 1 second from time to time is a big problem. However when the goal is to provide a good quality of service, the distribution of the latency spikes have a huge effect on the outcome. Redis latency spikes due to f[...]

This is why I can’t have conversations using Twitter
Yesterday Stripe engineers wrote a detailed report of why they had an issue with Redis. This is very appreciated. In the Hacker News thread I explained that because now we have diskless replication ( now persistence is no longer mandatory for people having a master-slaves replicas set. This changes the design constraints: now that we can have diskless replicas synchronization, it is worth it to better support the Stripe (ex?) use case of replicas set with persistence turned down, in a more safe way. This is a work in progress effort. In the same post Stripe engineers said that they are going to switch to PostgreSQL for the use case where they have issues with Redis, which is a great database indeed, and many times if you can go with the SQL data model and an on-disk database, it is better to use that instead of Redis which is designed for when you really want to scale to a lot of complex operations per second. Stripe engineers also said that they measured the 99th percentile and it was better with PostgreSQL compared to Redis, so in a tweet @aphyr wrote: “Note that *synchronous* Postgres replication *between AZs* delivers lower 99th latencies than asynchronous Redis” And I replied: “It could be useful to look at average latency to better understand what is going on, since I believe the 99% percentile is very affected by the latency spikes that Redis can have running on EC2.” Which means, if you have also the average, you can tell if the 99th percentile is ruined (or not) by latency spikes, that many times can be solved. Usually it is as simple as that: if you have a very low average, but the 99th percentile is bad, likely it is not that Redis is running slow because, for example, operations performed are very time consuming or blocking, but instead a subset of queries are served slow because of the usual issues in EC2: fork time in certain instances, remote disks I/O, and so forth. Stuff that you can likely address, since for example, there are instance types without the fork latency issue. For half the Twitter IT community, my statement was to promote the average latency as the right metric over 99th percentiles: "averages are the worst possible metric for latency. No latency I've ever seen falls on a bell curve. Averages give nonsense." "You have clearly not understood how the math works or why tail latencies matter in dist sys. I think we're done here." “indeed; the problem is that averages are not robust in the presence of outliers” Ehm, who said that average is a good metric? I proposed it to *detect* if there are or not big outliers. So during what was supposed to be a normal exchange, I find after 10 minutes my Twitter completely full of people that tell me that I’m an idiot to endorse averages as The New Metric For Latency in the world. Once you get the first retweets, you got more and more. Even a notable builder of other NoSQL database finds the time to lecture me a few things via Twitter: I reply saying that clearly what I wrote was that if you have 99th + avg you have a better picture of the curve and can understand if th[...]

Diskless replication: a few design notes.
Almost a month ago a number of people interested in Redis development met in London for the first Redis developers meeting. We identified together a number of features that are urgent (and are now listed in a Github issue here:, and among the identified issues, there was one that was mentioned multiple times in the course of the day: diskless replication. The feature is not exactly a new idea, it was proposed several times, especially by EC2 users that know that sometimes it is not trivial for a master to provide good performances during slaves synchronization. However there are a number of use cases where you don’t want to touch disks, even running on physical servers, and especially when Redis is used as a cache. Redis replication was, in short, forcing users to use disk even when they don’t need or want disk durability. When I returned back home I wanted to provide a quick feedback to the developers that attended the meeting, so the first thing I did was to focus on implementing the feature that seemed the most important and non-trivial among the list of identified issues. In the next weeks the attention will be moved to the Redis development process as well: the way issues are handled, how new ideas can be proposed to the Redis project, and so forth. Sorry for the delay about these other important things, for now what you can get is, some code at least ;-) Diskless replication provided a few design challenges. It looks trivial but it is not, so since I want to blog more, I thought about documenting how the internals of this feature work. I’m sure that a blog post may make the understanding and adoption of the new feature simpler. How replication used to work === Newer versions of Redis are able, when the connection with the master is lost, to reconnect with the master, and continue the replication process in an incremental way just fetching the differences accumulated so far. However when a slave is disconnected for a long time, or restarted, or it is a new slave, Redis requires it to perform what is called a “full resynchronization”. It is a trivial concept, and means: in order to setup this slave, let’s transfer *all* the master data set to the slave. It will flush away its old data, and reload the new data from scratch, making sure it is running an exact copy of master’s data. Once the slave is an exact copy of the master, successive changes are streamed as a normal Redis commands, in an incremental way, as the master data set itself gets modified because of write commands sent by clients. The problem was the way this initial “bulk transfer” needed for a full resynchronization was performed. Basically a child process was created by the master, in order to generate an RDB file. When the child was done with the RDB file generation, the file was sent to slaves, using non blocking I/O from the parent process. Finally when the transfer was complete, slaves could reload the RDB file and go online, receiving the incremental stream of new writes. However this means that from the master [...]

A few arguments about Redis Sentinel properties and fail scenarios.
Yesterday distributed systems expert Aphyr, posted a tweet about a Redis Sentinel issue experienced by an unknown company (that wishes to remain anonymous): “OH on Redis Sentinel "They kill -9'd the master, which caused a split brain..." “then the old master popped up with no data and replicated the lack of data to all the other nodes. Literally had to restore from backups." OMG we have some nasty bug I thought. However I tried to get more information from Kyle, and he replied that the users actually disabled disk persistence at all from the master process. Yep: the master was configured on purpose to restart with a wiped data set. Guess what? A Twitter drama immediately started. People were deeply worried for Redis users. Poor Redis users! Always in danger. However while to be very worried is a trait of sure wisdom, I want to take the other path: providing some more information. Moreover this blog post is interesting to write since actually Kyle, while reporting the problem with little context, a few tweets later was able to, IMHO, isolate what is the true aspect that I believe could be improved in Redis Sentinel, which is not related to the described incident, and is in my TODO list for a long time now. But before, let’s check a bit more closely the behavior of Redis / Sentinel about the drama-incident. Welcome to the crash recovery system model === Most real world distributed systems must be designed to be resilient to the fact that processes can restart at random. Note that this is very different from the problem of being partitioned away, which is, the inability to exchange messages with other processes. It is, instead, a matter of losing state. To be more accurate about this problem, we could say that if a distributed algorithm is designed so that a process must guarantee to preserve the state after a restart, and fails to do this, it is technically experiencing a bizantine failure: the state is corrupted, and the process is no longer reliable. Now in a distributed system composed of Redis instances, and Redis Sentinel instances, it is fundamental that rebooted instances are able to restart with the old data set. Starting with a wiped data set is a byzantine failure, and Redis Sentinel is not able to recover from this problem. But let’s do a step backward. Actually Redis Sentinel may not be directly involved in an incident like that. The typical example is what happens if a misconfigured master restarts fast enough so that no failure is detected at all by Sentinels. 1. Node A is the master. 2. Node A is restarted, with persistence disabled. 3. Sentinels (may) see that Node A is not reachable… but not enough to reach the configured timeout. 4. Node A is available again, except it restarted with a totally empty data set. 5. All the slave nodes B, C, D, ... will happily synchronize an empty data set form it. Everything wiped from the master, as per configuration, after all. And everything wiped from the slaves, that are replicating from what is believed to be the current source of truth for the data set.[...]

Redis cluster, no longer vaporware.
The first commit I can find in my git history about Redis Cluster is dated March 29 2011, but it is a “copy and commit” merge: the history of the cluster branch was destroyed since it was a total mess of work-in-progress commits, just to shape the initial idea of API and interactions with the rest of the system. Basically it is a roughly 4 years old project. This is about two thirds the whole history of the Redis project. Yet, it is only today, that I’m releasing a Release Candidate, the first one, of Redis 3.0.0, which is the first version with Cluster support. An erratic run — To understand why it took so long is straightforward: I started the cluster project with a lot of rush, in a moment where it looked like Redis was going to be totally useless without an automatic way to scale. It was not the right moment to start the Cluster project, simply because Redis itself was too immature, so we didn't yet have a solid “single instance” story to tell. While I did the error of starting a project with the wrong timing, at least I didn’t fell in the trap of ignoring the requests arriving from the community, so the project was stopped and stopped an infinite number of times in order to provide more bandwidth to other fundamental features. Persistence, replication, latency, introspection, received a lot more care than cluster, simply because they were more important for the user base. Another limit of the project was that, when I started it, I had no clue whatsoever about distributed programming. I did a first design that was horrible, and managed to capture well only what were the “products” requirement: low latency, linear scalability and small overhead for small clusters. However all the details were wrong, and it was far more complex than it had to be, the algorithms used were unsafe, and so forth. While I was doing small progresses I started to study the basics of distributed programming, redesigned Redis Cluster, and applied the same ideas to the new version of Sentinel. The distributed programming algorithms used by both systems are still primitive since they are asynchronous replicated, eventually consistent systems, so I had no need to deal with consensus and other non trivial problems. However even when you are addressing a simple problem, compared to writing a CP store at least, you need to understand what you are doing otherwise the resulting system can be totally wrong. Despite all this problems, I continued to work at the project, trying to fix it, fix the implementation, and bring it to maturity, because there was this simple fact, like an elephant into a small room, permeating all the Redis Community, which is: people were doing again and again, with their efforts, and many times in a totally broken way, two things: 1) Sharding the dataset among N nodes. 2) A responsive failover procedure in order to survive certain failures. Problem “2” was so bad that at some point I decided to start the Redis Sentinel project before Cluster was finished in order to provide an HA[...]

Queues and databases
Queues are an incredibly useful tool in modern computing, they are often used in order to perform some possibly slow computation at a latter time in web applications. Basically queues allow to split a computation in two times, the time the computation is scheduled, and the time the computation is executed. A “producer”, will put a task to be executed into a queue, and a “consumer” or “worker” will get tasks from the queue to execute them. For example once a new user completes the registration process in a web application, the web application will add a new task to the queue in order to send an email with the activation link. The actual process of sending an email, that may require retrying if there are transient network failures or other errors, is up to the worker. Technically speaking we can think at queues as a form of inter-process messaging primitive, where the receiving process needs to acknowledge the reception of the message. Messages can not be fire-and-forget, since the queue needs to understand if the message can be removed from the queue, so some form of acknowledgement is strictly required. When receiving a message triggers the execution of a task, like it happens in the kind of queues we are talking about, the moment the message reception is acknowledged changes the semantic of the queue. When the worker process acknowledges the reception of the message *before* processing the message, if the worker fails the message can be lost before the task is performed at all. If the acknowledge is sent only *after* the message gets processed, if the worker fails or because of network partitions the queue may re-deliver the message again. This happens whatever the queue consistency properties are, so, even if the queue is modeled using a system providing strong consistency, the indetermination still holds true: * If messages are acknowledged before processing, the queue will have an at-most-once delivery property. This means messages can be processed zero or one time. * If messages are acknowledged after processing, the queue will have an at-least-once delivery property. This means messages can be processed from 1 to infinite number of times. While both of this cases are not perfect, in the real world the second behavior is often preferred, since it is usually much simpler to cope with the case of multiple delivery of the message (triggering multiple executions of the task) than a system that from time to time does not execute a given task at all. An example of at-least-once delivery system is Amazon SQS (Simple Queue Service). There is also a fundamental reason why at-least-once delivery systems are to be preferred, that has to do with distributed systems: the other semantics (at-most-once delivery) requires the queue to be strongly consistent: once the message is acknowledged no other worker must be able to acknowledge the same message, which is a strong property. Once we move our focus to at-least-once delivery systems, we may notice that to model the qu[...]

A proposal for more reliable locks using Redis
----------------- UPDATE: The algorithm is now described in the Redis documentation here => The article is left here in its older version, the updates will go into the Redis documentation instead. ----------------- Many people use Redis to implement distributed locks. Many believe that this is a great use case, and that Redis worked great to solve an otherwise hard to solve problem. Others believe that this is totally broken, unsafe, and wrong use case for Redis. Both are right, basically. Distributed locks are not trivial if we want them to be safe, and at the same time we demand high availability, so that Redis nodes can go down and still clients are able to acquire and release locks. At the same time a fast lock manager can solve tons of problems which are otherwise hard to solve in practice, and sometimes even a far from perfect solution is better than a very slow solution. Can we have a fast and reliable system at the same time based on Redis? This blog post is an exploration in this area. I’ll try to describe a proposal for a simple algorithm to use N Redis instances for distributed and reliable locks, in the hope that the community may help me analyze and comment the algorithm to see if this is a valid candidate. # What we really want? Talking about a distributed system without stating the safety and liveness properties we want is mostly useless, because only when those two requirements are specified it is possible to check if a design is correct, and for people to analyze and find bugs in the design. We are going to model our design with just three properties, that are what I believe the minimum guarantees you need to use distributed locks in an effective way. 1) Safety property: Mutual exclusion. At any given moment, only one client can hold a lock. 2) Liveness property A: Deadlocks free. Eventually it is always possible to acquire a lock, even if the client that locked a resource crashed or gets partitioned. 3) Liveness property B: Fault tolerance. As long as the majority of Redis nodes are up, clients are able to acquire and release locks. # Distributed locks, the naive way. To understand what we want to improve, let’s analyze the current state of affairs. The simple way to use Redis to lock a resource is to create a key into an instance. The key is usually created with a limited time to live, using Redis expires feature, so that eventually it gets released one way or the other (property 2 in our list). When the client needs to release the resource, it deletes the key. Superficially this works well, but there is a problem: this is a single point of failure in our architecture. What happens if the Redis master goes down? Well, let’s add a slave! And use it if the master is unavailable. This is unfortunately not viable. By doing so we can’t implement our safety property of the mutual exclusion, because Redis replication is asynchronous. This is an obvious race condition with this model: [...]

Using Heartbleed as a starting point
The strong reactions about the recent OpenSSL bug are understandable: it is not fun when suddenly all the internet needs to be patched. Moreover for me personally how trivial the bug is, is disturbing. I don’t want to point the finger to the OpenSSL developers, but you just usually think at those class of issues as a bit more subtle, in the case of a software like OpenSSL. Usually you fail to do sanity checks *correctly*, as opposed to this bug where there is a total *lack* of bound checks in the memcpy() call. However sometimes in the morning I read the code I wrote the night before and I’m deeply embarrassed. Programmers sometimes fail, I for sure do often, so my guess is that what is needed is a different process, and not a different OpenSSL team. There is who proposes a different language safer than C, and who proposes that the specification is broken because it is too complex. Probably there is some truth in both arguments, however it is unlikely that we move to a different specification or system language soon, so the real question is, what we can do now to improve system software security? 1) Throw money at it. Making system code safer is simple if there are investments. If different companies hire security experts to do code auditings in the OpenSSL code base, what happens is that the probability of discovering a bug like heartbleed is greater. I’ve seen very complex bugs that are triggered by a set of non-trivial conditions being discovered by serious code auditing efforts. A memcpy() without bound checks is something that if you analyze the code security-wise, will stand out in the first read. And guess how heartbleed was discovered? Via security auditings performed at Google. Probably the time to consider open source something that mostly we take from is over. Many companies should follow the example of Google and other companies, using workforce for OSS software development and security. 2) Static and dynamic checks. Static code analysis is, as a side effect, a semi-automated way to do code auditings. In critical system code like OpenSSL even to do some source code annotation or use a set of rules to make static analysis more effective is definitely acceptable. Static tools today are not a total solution, but the output of a static analysis if carefully inspected by an expert programmer can provide some value. Another great help comes from dynamic checks like Valgrind. Every system software written in C should be tested using Valgrind automatically at every new commit. 3) Abstract C with libraries. C is low level and has no built in safety in the language. However something good about C is that it is a language that allows to build layers on top of its rawness. A sane dynamic string library prevents a lot of buffer overflow issues, and today almost every decent project is using one. However there is more you can do about it. For example for security critical code where memory can contain thin[...]

Redis new data structure: the HyperLogLog
Generally speaking, I love randomized algorithms, but there is one I love particularly since even after you understand how it works, it still remains magical from a programmer point of view. It accomplishes something that is almost illogical given how little it asks for in terms of time or space. This algorithm is called HyperLogLog, and today it is introduced as a new data structure for Redis. Counting unique things === Usually counting unique things, for example the number of unique IPs that connected today to your web site, or the number of unique searches that your users performed, requires to remember all the unique elements encountered so far, in order to match the next element with the set of already seen elements, and increment a counter only if the new element was never seen before. This requires an amount of memory proportional to the cardinality (number of items) in the set we are counting, which is, often absolutely prohibitive. There is a class of algorithms that use randomization in order to provide an approximation of the number of unique elements in a set using just a constant, and small, amount of memory. The best of such algorithms currently known is called HyperLogLog, and is due to Philippe Flajolet. HyperLogLog is remarkable as it provides a very good approximation of the cardinality of a set even using a very small amount of memory. In the Redis implementation it only uses 12kbytes per key to count with a standard error of 0.81%, and there is no limit to the number of items you can count, unless you approach 2^64 items (which seems quite unlikely). The algorithm is documented in the original paper [1], and its practical implementation and variants were covered in depth by a 2013 paper from Google [2]. [1] [2] How it works? === There are plenty of wonderful resources to learn more about HyperLogLog, such as [3]. [3] Here I’ll cover only the basic idea using a very clever example found at [3]. Imagine you tell me you spent your day flipping a coin, counting how many times you encountered a non interrupted run of heads. If you tell me that the maximum run was of 3 heads, I can imagine that you did not really flipped the coin a lot of times. If instead your longest run was 13, you probably spent a lot of time flipping the coin. However if you get lucky and the first time you get 10 heads, an event that is unlikely but possible, and then stop flipping your coin, I’ll provide you a very wrong approximation of the time you spent flipping the coin. So I may ask you to repeat the experiment, but this time using 10 coins, and 10 different piece of papers, one per coin, where you record the longest run of head[...]

Fascinating little programs
Yesterday and today I managed to spend some time with linenoise (, a minimal line-editing library designed to be a simple and small replacement for readline. I was trying to merge a few pull requests, to fix issues, and doing some refactoring at the same time. It was some kind of nirvana I was feeling: a complete control of small, self-contained, and useful code. There is something special in simple code. Here I’m not referring to simplicity to fight complexity or over engineering, but to simplicity per se, auto referential, without goals if not beauty, understandability and elegance. After all the programming world has always been fascinated with small programs. For decades programmers challenged in 1k or 4k contexts, from the 6502 assembler to today’s javascript contests. Even the obfuscated C contest, after all, has a big component in the minimalism. Why is it so great to hack a small piece of code? Yes is small and simple, those are two good points. It can be totally understood, dominated. You can use smartness since little code is the only place of the world where coding smartness will pay off, since in large projects obviousness is far better in the long run. However I believe there is more than that, and is that small programs can be perfect. As perfect as a sonnet composed of a few words. The limits in size and in scope, constitute an intellectual stratagem to avoid the “it may be better" trap, when this better is not actually measurable and evident. Under these strict limits, what the program does is far more interesting than what it does not. Actually the constraints are the more fertile ground for creativity of the solutions, otherwise likely useless: at scale there is always a more correct, understood, canonical way to do everything. There is an interview of Bill Gates in the first years of the Microsoft experience where he describes this feeling when writing the famous Microsoft BASIC interpreter. The limits were the same we self impose today to ourselves for fun, in the contests, or just for the sake of it. There was a generation of programmers that was able to experience perfection in their creations, where it was obvious to measure and understand if a change actually lead to an improvement of the program or not, in a territory where space and time were so scarse. There was no room for wastes and not needed complexity. Today’s software is in some way the triumph of the other reality of software: layers of complexities that gave use incredible devices or infrastructure technologies that in the hands of non experts leverage a number of possibilities. However maybe there is still something to preserve from the ancient times where software could be perfect, the feeling that what you are creating has a structure and is not just a pile of code that works. If you zoom out enough, you’ll see your large program i[...]

What is performance?
The title of this blog post is an apparently trivial to answer question, however it is worth to consider a bit better what performance really means: it is easy to get confused between scalability and performance, and to decompose performance, in the specific case of database systems, in its different main components, may not be trivial. In this short blog post I’ll try to write down my current idea of what performance is in the context of database systems. A good starting point is probably the first slide I use lately in my talks about Redis. This first slide is indeed about performance, and says that performance is mainly three different things. 1) Latency: the amount of time I need to get the reply for a query. 2) Operations per unit of time per core: how many queries (operations) the system is able to reply per second, in a given reference computational unit? 3) Quality of operations: how much work those operations are able to accomplish? Latency — This is probably the simplest component of performance. In many applications it is desirable that the time needed to get a reply from the system is small. However while the average time is important, another concern is the predictability of the latency figure, and how much difference there is between the average case and the worst case. When used well, in-memory systems are able to provide very good latency characteristics, and are also able to provide a consistent latency over time. Operations per second per core — The second component I’m enumerating is what makes the difference between raw performance and scalability. We are interested in the amount of work the system is able to do, in a given unit of time, for a given reference computational unit. Linearly scalable systems can reach a big number of operations per second by using a number of nodes, however this means they are scalable, and not necessarily performant. Operations per second per core is also usually bound to the amount of queries you can perform per watt, so to the energy efficiency of the system. Quality of operations — The last point, while probably not as stressed among developers as throughput and latency, is really important in certain kind of systems, especially in-memory systems. A system that is able to perform 100 operations per second, but with operations of “poor quality” (for example just GET and SET in Redis terms) has a lower performance compared to a system that is also able to perform an INCR operation with the same latency and OPS characteristics. For instance, if the problem at hand is to increment counters, the former system will require two operations to increment a counter (we are not considering race conditions in this context), while the system providing INCR is able to use a single operation. As a result it is actually able to provide twice the performance of the former system. [...]

Happy birthday Redis!
Today Redis is 5 years old, at least if we count starting from the initial HN announcement [1], that’s actually a good starting point. After all an open source project really exists as soon as it is public. I’m a bit shocked I worked for five years straight to the same thing. The opportunities for learning new things I had because of the directions where Redis pushed me, and the opportunities to learn new things that I missed because I had almost consistently no time for random hacking, are huge. My feeling today is that the Redis project was possible because of the great coders I encountered in my journey: they made Redis popular adopting it in its infancy, since great coders don’t follow the hype. Great coders provided outstanding additions to Redis in the form of patches and ideas that were able to surpass my instinct to be conservative when the topic was to extend the system or accept external contributions. More great coders made possible to sponsor Redis when it was in its infancy, recognizing that there was something interesting about it, and more great coders applied it in the right way to solve problems in the course of many years, wrote an incredible ecosystem of client libraries and tools, and helped other coders to apply it when it was not clear what was the best way to solve a given problem. The Redis community is outstanding because in some way it managed to attract a number of great coders. I learned that in the future, whatever I’ll do more coding or I’ll be in a team to build something great in a different role, my top priority will be to stay with great coders, and I learned that they are not easy to recognize at first: their abilities don’t correlate with the number of followers on Twitter nor with the number of Github repositories. You have to discover great coders one after the other, and the biggest gift that Redis provided to me, was to get exposed to many of them. In the course of five years there was also time, for me, to evolve my idea of what Redis is. The idea I’ve of Redis today is that its contribution should be to try to explore corner designs and bizzarre ideas. After all there are large teams of people much smarter than me trying to work on the hard problems applying the best technologies available. Redis will continue to be a small research in more obscure places of the design space. After all I’ve the feeling that it helped to popularize certain non obvious ideas, like using data structures as data model for key value stores and caches, or that it is possible to apply scripting to database systems in a different way than stored procedures. However for Redis to be able to do this research, I should be ready to be opinionated and change development direction when something is weak. This was done in the past, deprecating swap and diskstore, but should be done ev[...]

A simple distributed algorithm for small idempotent information
In this blog post I’m going to describe a very simple distributed algorithm that is useful in different programming scenarios. The algorithm is useful when you need to take some kind of information synchronized among a number of processes. The information can be everything as long as it is composed of a small number of bytes, and as long as it is idempotent, that is, the current value of the information does not depend on the previous value, and we can just replace an old value, with the new one. The size of the information is important because for the way the algorithm works, the information should be small enough that every node can broadcast it from time to time to some other random node, so it should fit the size of an “heartbeat” packet. Let’s say that up to a few kbytes everything is fine. This algorithm is no new in any way, it is basically just a trivial way to put together obvious ideas found in other distributed algorithms in a simple way. However the algorithm is very useful in many real-world contexts, and is extremely simple to implement. The algorithm is mostly borrowed from Raft, however because of the premises it uses only a subset of Raft that is trivial to implement. An example scenario === To understand better the algorithm, it is much better to have an example of problem that we want to solve in our distributed system. Let’s say that we have N processes connected with two kind of wifi networks: a very reliable but slow wireless network, that is only suitable to send “control” packets like heartbeats or other low bandwidth data, and a very fast wireless network. Now let’s imagine that while the slow network works at a fixed frequency, the high speed wireless network requires to adapt to changing conditions and noises in a given frequency, and is able to hop to a different frequency as soon as too much noise is detected. We need a way to make sure that all the processes use the same frequency to communicate with the high speed network, and we also need a way to switch frequency when the currently used frequency has issues. We need this system to work if there are network partitions in the slow network, as long as the majority of the processes are able to communicate. Note that this problem has the properties stated above: 1) The information is idempotent, if the high speed network switched to an different frequency, the new frequency does not depend on the old frequency. A process receiving the new frequency can just update frequency regardless of the fact that its old frequency was updated, or an older one (because for some reason it did not received some update). 2) The information is small, it is possible to propagate it easily across nodes in a small data packet. In this case it is actually extremely small, for example the frequency may be en[...]

Redis Cluster and limiting divergences.
Redis Cluster is finally on its road to reach the first stable release in a short timeframe as already discussed in the Redis google group [1]. However despite a design never proposed for the implementation of Redis Cluster was analyzed and discussed at long in the past weeks (unfortunately creating some confusion: many people, including notable personalities of the NoSQL movement, confused the analyzed proposal with Redis Cluster implementation), no attempt was made to analyze or categorize Redis Cluster itself. I believe that putting in perspective the simple ideas that Redis Cluster implements, in a more formal way, is an interesting exercise for the following reason: Redis Cluster is not a design that tries to achieve “AP” or “CP” of the CAP theorem, since for its goals CAP Availability and CAP Consistency are too hard goals to reach without sacrificing other practical qualities. Once a design does not try to maximize what is theoretically possible, the design space becomes much larger, and the implementation main goal is to try to provide some Availability and some reasonable form of Consistency, in the face of other conflicting design requirements like asynchronous replication of data. The goal of this article is to reply to the following question: how Redis Cluster, as an asynchronous system, tries to limit divergences between nodes? Nodes divergence === One of the main problems with asynchronous systems is that a master accepting requests never actually knows in a given moment if it is still the authoritative master or a stale one. For example imagine a cluster of three nodes A, B, C, with one replica each, A1, B1, C1. When a master is partitioned away with some client, A1 may be elected in the other side as the new master, but A is not able, for every request processed, to verify with the other nodes if the request should be accepted or not. At best node A can get asynchronous acknowledges from replicas. Every time this happens, two parallel time lines for the same set of data is created, one in A, and one in A1. In Redis Cluster there is no way to merge data as explained in [2], so you can imagine the merge function between A and A1 data set when the partition heals as just picking one time line between all the time lines created (two in this case). Another case that creates different time lines is the replication process itself. A master A may have three replicas A1, A2, A3. Because of the very concept of asynchronous replication, each of the slaves may represent a point in time in the past history of the time line of A. Usually since replication in Redis is very fast and has minimal delay, as data is transmitted to slaves at the same time as the reply is transmitted to the writing client, the time “delta” between A and the slaves [...]

Some fun with Redis Cluster testing
One of the steps to reach the goal of providing a "testable" Redis Cluster experience to users within a few weeks, is some serious testing that goes over the usual "I'm running 3 nodes in my macbook, it works". Finally this is possible, since Redis Cluster entered into the "refinements" stage, and most of the system design and implementation is in its final form already. In order to perform some testing I assembled an environment like that: * Hardware: 6 real computers: 2 macbook pro, 2 macbook air, 1 Linux desktop, 1 Linux tiny laptop called EEEpc running with a single core at 800Mhz. * Network: the six nodes were wired to the same network in different ways. Two nodes connected via ethernet, and four over wifi, with different access points. Basically there were three groups. The computers connected with the ethernet had 0.3 milliseconds RTT, other two computers connected with a near access point were at around 5 milliseconds, and another group of two with another access point were not very reliable, sometime some packet went lost, latency spikes at 300-1000 milliseconds. During the simulation every computer ran Partitions.tcl ( in order to simulate network partitions three times per minute, lasting an average of 10 seconds. Redis Cluster was configured to detect a failure after 500 milliseconds, so these settings are able to trigger a number of failover procedures. Every computer ran the following: Computer 1 and 2: Redis cluster node + Partitions.tcl + 1 client Computer 3 to 6: Redis cluster node + Partitions.tcl The cluster was configured to have three masters and three slaves in total. As client software I ran the cluster consistency test that is shipped with redis-rb-cluster (, that performs atomic counters increments remembering the value client side, to detect both lost writes and non acknowledged writes that were actually accepted by the cluster. I left the simulation running for about 24 hours, however there were moments where the cluster was completely down due to too many nodes being down. The bugs === The first thing that happened in the simulation was, a big number of crashes of nodes… the simulation was able to trigger bugs that I did not noticed in the past. Also there were obvious mis-behavior due to the fact that one node, the eeepc one, was running a Redis server compiled with a 32 bit target. So in the first part of the simulation I just fixed bugs: 7a666ac Cluster: set n->slaves to NULL in clusterNodeResetSlaves(). fda91db Cluster: check link is valid before sending UPDATE. f57bb36 Cluster: initialize todo_before_sleep flags to 0. c70c0c6 Cluster: use proper type mstime_t for ping delay var. 7c1cbdc Cluster: use an hardcod[...]

Redis as AP system, reloaded
So finally something really good happened from the Redis criticism thread.

At the end of the work day I was reading about Redis as AP and merge operations on Twitter. At the same time I was having a private email exchange with Alexis Richardson (from RabbitMQ, and, my boss). Alexis at some point proposed that perhaps a way to improve safety was to asynchronously ACK the client about what commands actually were not received so that the client could retry. This seemed a lot of efforts in the client side, but somewhat totally opened my view on the matter.

So the idea is, we can't go for synchronous replication, but indeed we get ACKs from the replicas, asynchronous ACKS, specifically.
What about retaining all the writes not acknowledged into a buffer, and "re-play" them to the current master when the partition heals?
The window we need to require to take the log is very small if the ACKs are frequent enough (currently the frequency is 1 per second, but this could be more easily).
If we give up Availability after N times the window we can say, ok, no more room, we now start to reply with errors to queries.

The HUGE difference with this approach is that this works regardless of the size of values. There are also semantical differences since the stream of operations is preserved instead of the value itself, so there is more context. Think for example about INCR.

Of course this would not work for anything, but one could mark in the command table what command to reply and what to discard. SADD is an example of perfect command since the order of operations does not matter. DEL is likely something to avoid replying. And so forth. In turn if we reply against the wrong (stale) master, it will accumulate the commands and so forth. Details may vary, but this is the first thing that really makes a difference.

Probably many of you that are into eventually consistent databases know about the log VS merge strategies already, but I had to re-invent the wheel as I was not aware. This is the kind of feedback I expected in the Redis thread that I did not received.

Another cool thing about this approach is that it's pretty opt-in, it can be just a state in the connection. Send a command and the connection is of "safe" type, so all the commands sent will be retained and replayed if not acknowledged, and so forth.

This is not going to be in the first version of Redis Cluster as I'm more happy to ship ASAP the current design, but it is a solid incremental idea that could be applied later, so a little actual result into the evolution of the design. Comments

The Redis criticism thread
A few days ago I tried to do an experiment by running some kind of “call for critiques” in the Redis mailing list:!topic/redis-db/Oazt2k7Lzz4 The thread has reached 89 posts so far, probably one of the biggest threads in the history of the Redis google group. The main idea was that critiques are a mix of pointless attacks, and truth, so to extract the truth from critiques can be a good exercise, it means to have some seed idea for future improvements from the part of the population that is not using or is not happy with your system. There were a lot of arguments possible: threading, persistence model, API, security, and so forth, however the argument that received the most attention was Redis Cluster design and its tradeoffs. There are people that are not convinced by the proposed design since it does not provide strong Consistency nor Availability (“C” and “A” of CAP, not any random consistency or availability). Instead it only provides some form of weaker consistency and some failure resistance. In this blog post I want to clarify why this is in my opinion the right choice for Redis. Strong consistency requires synchronous replication, and depending on the failure models you want to handle, it also requires to fsync data on disk at every write, in order to cover failures like all the nodes in a single data center rebooting for a power outage. Redis is what it is because the performance and latency characteristics, so a model like the above would not be Redis. So Redis Cluster trades Consistency for performance. The tradeoff is that there are well defined failure modes where it is possible to lose writes, however the system is designed in order to minimize lost writes under certain assumptions. The other property we give away is availability, because being available means, in order to have a decent consistency story, to merge values when partitions heals. Redis data model is not merge-friendly, Redis uses the fact that data representation is in memory to get an advantage and export rich data structures to the user that have practically no limits in the size of every single value, if not the available memory. So there are Redis deployments with just a few keys, with sorted set values of many million of elements, to implement leader boards of Facebook games, and stuff like that. Merging huge and complex values is extremely hard and not very practical. Implementing the values in a way that makes merging more manageable would mean instead to forget the current performance at all. So we trade Availability for a more powerful data model. We still have some degree of resistance to failures. Nodes can stop working and partitions can h[...]

WAIT: synchronous replication for Redis
Redis unstable has a new command called "WAIT". Such a simple name, is indeed the incarnation of a simple feature consisting of less than 200 lines of code, but providing an interesting way to change the default behavior of Redis replication. The feature was extremely easy to implement because of previous work made. WAIT was basically a direct consequence of the new Redis replication design (that started with Redis 2.8). The feature itself is in a form that respects the design of Redis, so it is relatively different from other implementations of synchronous replication, both at API level, and from the point of view of the degree of consistency it is able to ensure. Replication: synchronous or not? === Replication is one of the main concepts of distributed systems. For some state to be durable even when processes fail or when there are network partitions making processes unable to communicate, we are forced to use a simple but effective strategy, that is to take the same information “replicated” across different processes (database nodes). Every kind of system featuring strong consistency will use a form of replication called synchronous replication. It means that before some new data is considered “committed”, a node requires acknowledge from other nodes that the information was received. The node initially proposing the new state, when the acknowledge is received, will consider the information committed and will reply to the client that everything went ok. There is a price to pay for this safety: latency. Systems implementing strong consistency are unable to reply to the client before receiving enough acknowledges. How many acks are “enough”? This depends on the kind of system you are designing. For example in the case of strong consistent systems that are available as long as the majority of nodes are working, the majority of the nodes should reply back before some data is considered to be committed. The result is that the latency is bound to the slowest node that replies as the (N/2+1)th node. Slower nodes can be ignored once the majority is reached. Asynchronous replication === This is why asynchronous replication exists: in this alternative model we reply to the client confirming its write BEFORE we get acknowledges from other nodes. If you think at it from the point of view of CAP, it is like if the node *pretends* that there is a partition and can’t talk with the other nodes. The information will eventually be replicated at a latter time, exactly like an eventually consistent DB would do during a partition. Usually this latter time will be a few hundred microseconds later, but if the node receiving the write fails before propag[...]

Blog lost and recovered in 30 minutes
Yesterday I lost all my blog data in a rather funny way. When I installed this new blog engine, that is basically a Lamer News slightly modified to serve as a blog, I spinned a Redis instance manually with persistence *disabled* just to see if it was working and test it a bit.

I just started a screen instance, and run something like ./redis-server --port 10000. Since this is equivalent to an empty config file with just "port 10000" inside I was running no disk backed at all.

Since Redis very rarely crashes, guess what, after more than one year it was still running inside the screen session, and I totally forgot that it was running like that, happily writing controversial posts in my blog. Yesterday my server was under attack. This caused an higher then normal load, and Linode rebooted the instance. As a result my blog was gone.

The good thing is that I recovered everything in about 30 minutes because simple systems are really better than complex systems when something bad happens. This blog is composed of posts that are just the verbatim dump of what I write in a text area. No formatting at all. Comments are handled by Disqus and the ID I submit is just the post ID.

All I had to do is to setup a new Redis server (this time with AOF, demonized, and a proper configuration file) and search in google one after the other the posts by URL (which is the same for all the post, only the incremental ID changes). For every post I opened the Google cache of the post, select the text, copy, and submit the new post.

The only thing I lost are the post dates... I could fix them modifying a bit the blog code to allow me to do this, but not sure I'll be able to find the time.

Long story short, this is a trivial example, and an human error, but even in serious well maintained systems shit happens, and when the architecture of something is simple, it is simpler to deal with even during failures.

Without to mention that now I know I don't have to enable backups as I can recovery everything. No, just kidding. Comments

The fight against sexism is not a free pass
Today Joyent wrote a blog post in the company blog about an issue that started with this pull request in the libuv project:

Basically the developer Ben Noordhuis rejected a pull request involving a change in the documentation to use gender-neutral form instead of “him”. Joyent replied with this incredible post:

In the blog post you can read:

“But while Isaac is a Joyent employee, Ben is not—and if he had been, he wouldn't be as of this morning: to reject a pull request that eliminates a gendered pronoun on the principle that pronouns should in fact be gendered would constitute a fireable offense for me and for Joyent.”

A few lines later you can read: “Indeed, one of the challenges of an open source project that depends on volunteer effort isdealing with assholes”

Maybe Joyent is thinking something like, you can’t go wrong if you fight sexism, or something like that, as I can’t believe they had a so naive reaction. Really, we have 10k years of culture for a reason, to be able to discriminate more than that, it is not 2+2=4.

Probably Ben is not a sexist, maybe he believes simply that “him” does not make a difference in sexism every day. Everybody has his fight, and changing “him” in not the Ben fight, but possibly when Ben will have to evaluate a female candidate, he will use a truly meritocratic meter and WILL NOT GIVE A FUCK about the gender, like he did when refusing the pull request.

You can’t bash people that don’t have your vision, and you can’t think that the fight against sexism is a free pass. Joyent, you are liable of your actions, and your actions violate more fundamental civil rights than you think.

But the most disturbing thing is that companies act like that because of fear, the fear of the reaction of people that believe that the world is black or white, and that if you don’t see it the same color they see it, you are to blame, you are to crucify. Comments

Finally Redis collections are iterable
Redis API for data access is usually limited, but very direct and straightforward. It is limited because it only allows to access data in a natural way, that is, in a data structure obvious way. Sorted sets are easy to access by score ranges, while hashes by field name, and so forth. This API “way” has profound effects on what Redis is and how users organize data into it, because an API that is data-obvious means fast operations, less code and less bugs in the implementation, but especially forcing the application layer to make meaningful choices: the database as a system in which you are responsible of organizing data in a way that makes sense in your application, versus a database as a magical object where you put data inside, and then it will be able to fetch and organize data for you in any format. However most Redis data types, including the outer key-value shell if we want to consider it a data type for a moment (it is a dictionary after all), are collections of elements. The key-value space is a collection of keys mapped to values, as Sets are collections of unordered elements, and so forth. In most application logic using Redis the idea is that you know what member is inside a collection, or at least you know what member you should test for existence. But life is not always that easy, and sometimes you need something more, that is, to scan the collection in order to retrieve all the elements inside. And yes, since this is a common need, we have commands like SMEMBERS or HGETALL, or even KEYS, in order to retrieve everything there is inside a collection, but those commands are always a last-resort kind of deal, because they are O(N) operations. Your collection is very small? Fine, use SMEMBERS and you get Redis-alike performances anyway. Your collection is big? Don’t use O(N) commands if not for “debugging” purposes. A popular example is the misused KEYS command, source of troubles for non-experts, and top hit among the Redis slow log entries. Black holes === The problem is that because of O(N) operations, Redis collections, (excluding Sorted Sets that can be accessed by rank, in ranges, and in many other different ways), tend to be black holes where you put things inside, and you can hardly explore them again. And there are plenty of reasons to explore what is inside. For example garbage collection tasks, schema migration, or even fixing what is inside keys after an application bug corrupted some data. What we really needed was an iterator. Pieter Noordhuis and I were very aware of this problem since the early days of Redis, but it was a major desig[...]

New Redis Cluster meta-data handling
This blog post describes the new algorithm used in Redis Cluster in order to propagate and update metadata, that is hopefully significantly safer than the previous algorithm used. The Redis Cluster specification was not yet updated, as I'm rewriting it from scratch, so this blog post serves as a first way to share the algorithm with the community. Let's start with the problem to solve. Redis Cluster uses a master - slave design in order to recover from nodes failures. The key space is partitioned across the different masters in the cluster, using a concept that we call "hash slots". Basically every key is hashed into a number between 0 and 16383. If a given key hashes to 15, it means it is in the hash slot number 15. These 16k hash slots are split among the different masters. At every single time only one master should serve a given hash slot. Slaves just replicate the master dataset so that it is possible to fail over a master and put the cluster again into an usable state where all the hash slots are served by one node. Redis Cluster is client assisted and nodes are not capable to forward queries to other nodes. However nodes are able to redirect a client to the right node every time a client tries to access a key that is served by a different node. This means that every node in the cluster should know the map between the hash slots and the nodes serving them. The problem I was trying to solve is, how to take this map in sync between nodes in a safe way? A safe way means that even in the event of net splits, eventually all the nodes will agree about the hash slots configuration. Another problem to solve was the slave promotion. A master can have multiple slaves, how to detect, and how to act, when a master is failing and a slave should be promoted to replace it? Metadata is not data ==================== In the case of Redis Cluster handling of metadata is significantly different than the way the user data itself is handled. The focus of Redis Cluster is: 1) Speed. 2) No need for merge operations, so that it is semantically simple to handle the very large values typical of Redis. 3) The ability to retain most writes originating from clients connected to the majority of masters. Given the priorities, Redis Cluster, like the vanilla single node version of Redis, uses asynchronous replication where changes to the data set are streamed to slave nodes with an asynchronous acknowledgement from slaves. In other words when a node receives a write, the client most of the times directly talk with the node in charge for the key hash slot, and the no[...]

English has been my pain for 15 years
Paul Graham managed to put a very important question, the one of the English language as a requirement for IT workers, in the attention zone of news sites and software developers [1]. It was a controversial matter as he referred to "foreign accents" and the internet is full of people that are just waiting to overreact, but this is the least interesting part of the question, so I'll skip that part. The important part is, no one talks about the "English problem" usually, and I always felt a bit alone in that side, like if it was a problem only affecting me, so in this blog post I want to share my experience about English. [1] A long story --- I still remember me and sullivan ( both drunk in my home in Milan trying to turn an attack I was working on, back in 1998, in a post that was understandable for BUGTRAQ users, and this is the poor result we obtained: Please note the "Instead all others" in the second sentence. I'm still not great at English but I surely improved over 15 years, and sullivan now teaches in US and UK universities so I imagine he is totally fluent (spoiler warning: I'm still not). But here the point is, we were doing new TCP/IP attacks but we were not able to freaking write a post about it in English. It was 1998 and I already felt extremely limited by the fact I was not able to communicate, I was not able to read technical documentation written in English without putting too much efforts in the process of reading itself, so my brain was using like 50% of its energy to just read, and less was left to actually understand what I was reading. However in one way or the other I always accepted English as a good thing. I always advice people against translation efforts in the topic of technology, since I believe that it is much better to have a common language to document and comment the source code, and actually to obtain the skills needed to understand written technical documentation in English is a simple effort for most people. So starting from 1998 I slowly learned to fluently read English without making more efforts compared to reading something written in Italian. I even learned to write at the same speed I wrote stuff in Italian, even if I hit a local minima in this regard, as you can see reading this post: basically I learned to write very fast a broken subset of English, that is usually enough to express my thoughts in the field of programming, but it is not good enough to write about g[...]

Twilio incident and Redis
Twilio just released a post mortem about an incident that caused issues with the billing system: The problem was about a Redis server, since Twilio is using Redis to store the in-flight account balances, in a master-slaves setup, with multiple slaves in different data centers for obvious availability and data safety concerns. This is a short analysis of the incident, what Twilio can do and what Redis can do to avoid this kind of issues. The first observation is that Twilio uses Redis, an in memory system, in order to save balances, so everybody will say "WTF Twilio! Are you serious with your data?". Actually Redis uses memory to serve data and to internally manipulate its data structures, but the incident has *nothing to do* with the durability of Redis as a DB. In fact Twilio stated that they are using the append only file that can be a very durable solution as explained here: The incident is actually centered around two main aspects of Redis: 1) The replication system. 2) The configuration. I'll address they two things respectively. Analysis of the replication issue === Redis 2.6 always needs a full resynchronization between a master and a slave after a connection issue between the two. Redis 2.8 addressed this problem, but is currently a release candidate, so Twilio had no way to use the new feature called "partial resynchronization". Apparently the master became unavailable because many slaves tried to resynchronize at the same time. Actually for the way Redis works a single slave or multiple slaves trying to resynchronize should not make a huge difference, since just a single RDB is created. As soon as the second slave attaches and there is already a background save in progress in order to create the first RDB (used for the bulk data transfer), it is put in a queue with the previous slave, and so forth for all the other slaves attaching. Redis will just produce a single RDB file. However what is true is that Redis may use additional memory with many slaves attaching at the same time, since there are multiple output buffers to "record" to transfer when the RDB file is ready. This is true especially in the case of replication over WAN. In the Twilio blog post I read "multiple data centers" so it is possible that the replication process may be slow in some case. The bottom line is, Redis normally does not need to go slow when multiple slaves are [...]

San Francisco
Yesterday night I returned back home after a short trip in San Francisco. Before memory fades out and while my feelings are crisp enough, I'm writing a short report of the trip. The point of view is that of a south European programmer exposed for a few days to what is probably the most active information technology ecosystem and economy of the world. Reaching San Francisco === If you want to reach San Francisco from Sicily, there are no direct flights helping you. My flight was a Lufthansa flight from Catania to Munich, and finally from Munich to San Francisco. This is a total of 15 hours flight, plus the stop in Munich waiting for the second flight. Unfortunately the first flight had a delay big enough that I lost my connection. My trip was already pretty short, just four days, and this issue costed me one day reducing the total available time in SF to just 3 days... It's like to go to the other side of the world just to take a coffee. However it's very rewarding to see Germans having an issue with precision, so I blamed a lot of people with the most heavy of the Sicilian accent... Just kidding :-) The reality is that the operators at the Catania airport, not payed by Lufthansa, said that among all the big operators Lufthansa is one of the best in terms of avoiding delays. Moreover in Munich I was "protected" in a nice enough hotel with good food and great beer. I was not the only unlucky guy, there were other three north Americans headed to San Francisco that became my journey friends in no time, very cool people that allowed me to practice English some hours before reaching SF, recounted fun stories, and got drunk with me in the Munich -> SF flight. Basically it's a long trip, and reaching the final destination is already an experience in itself. A direct flight would be much better: only one flight, and the process is "atomic", either you remain at the starting point or you get to the final destination, no chances to remain half-way unless the aircraft falls on the floor ;-) The city === I'm a "City guy". In Sicily I live near Catania, that counting all the conurbation is like 1 million of inhabitants, not a lot but not a small town. San Francisco from this point of view is like my ideal city: people outside, gardens, large streets, bike lanes, many places where you can have a meal, and shops. The Hotel, the Intercontinental in Howard street, was also great in every possible way, from the food to the staff that was always very professional and willing to [...]

Exploring synchronous replication in Redis
Redis uses streamed asynchronous replication, that's one of the simplest forms of replication you can imagine: a continuos stream of writes is sent to the slaves, without waiting for the slaves to process the writes in any way before replying to the client. I always gave that almost for granted, as I always assumed Redis was not a good match for synchronous replication, that has an higher latency. However recently I tried to fix another issue with Redis replication, that is, timeouts are all up to the slave. This is how it used to work: 1) Master sends data to slaves. However sometimes there is no data to send (no write traffic). We still need to send something to slaves in order to avoid slaves will detect a timeout. 2) So a master periodically sends PINGs to slaves as well, every 10 seconds by default. 3) Detection of a broken replication link is up to the slaves that will close the connection when a timeout is detected. 4) Masters are able to detect errors in the replication link only when reported by the operating system as a socket error. So the ability of masters to detect errors in the replication link was pretty limited in Redis 2.6, and this is BAD. There are many kind of broken links that will result in no error raised in the socket, but still we end accumulating writes for a slave that is down. The only defense against this was the ability of Redis 2.6 to detect when the output buffer was too big, and close the connection before to use all the available memory as slave output buffers. Pinging back === In order to fix this issue the most natural thing to do is to also ping from slave to master, so that the master can be aware of slaves, otherwise the slave -> master communication is completely zero, as slaves don't reply to write commands sent by a master in any way to save bandwidth. However I was not very happy with sending just PING, since it was possible to send something way more useful, that is, the current *replication offset*. The replication offset is a new concept we have in 2.8 with PSYNC. Basically every master has a 64 bit global counter, about how much replication stream it produced. Moreover the replication stream is identical for all the slaves, so every slave shares the same global replication offset with the master. The replication offset is primarily used by PSYNC, so that slaves can request a partial resynchronization asking the master to send data starting from a given offset, that is, the last offset [...]

Availability on planet Terah
Terah is a planet far away, where networks never split. They have a single issue with their computer networks, from time to time, single hosts break in a way or the other. Sometimes is a broken power supply, other times a crashed disk, or a software issue completely blocking the system. The inhabitants of this strange planet use two database systems. One is imported from planet Earth via the Galactic Exchange Program, and is called EDB. The other is produced by engineers from Terah, and is called TDB. The databases are functionally equivalent, but they have different semantics when a network partition happens. While the database from Earth stops accepting writes as long as it is not connected with the majority of the other database nodes, the database from Terah works as long as the majority of the clients can reach at least a database node (incidentally, the author of this story released a similar software project called Sentinel, but this is just a coincidence). Terah users have setups like the following, with three database nodes and three web servers running clients asking queries to the database nodes ("D" denotes a database node, "C" a client). D D D C C C EDB is designed to avoid problems on partitions like this: D1 \ D D / C1 \ C C C1 writing to D1 may result into lost writes if D1 happened to be the master. However in Terah net splits are not an issue, they invented a solution for all the network partitions back in Galactic Year 712! Still their technology is currently not able to avoid that single hosts fail. There is a sysop from Terah, Daniel Fucbitz, that always complains about EDB. He does not understand why on the Earth… oops on the Terah I mean, his company keeps importing EDB, that causes a lot of troubles. He reasons like this: "If a single node of my network fails, I'm safe with both EDB and TDB, but what happens if one night I'm not lucky and two hosts will fail at the same time?". Actually with EDB if two nodes out of the six nodes will fail during the same night, and these nodes happen to be two "D" nodes, the system will stop working. The probability for this to happen is (3/6)*(2/5), that is... 20%! On the other hand TDB will never stop working as long as only two nodes will fail. And what about three nodes failing at the same time? With EDB this will bring the system down with a probability of 50% [...]

Reply to Aphyr attack to Sentinel
In a great series of articles Kyle Kingsbury, aka @aphyr on Twitter, attacked a number of data stores: [1] Postgress, Redis Sentinel, MongoDB, and Riak are audited to find what happens during network partitions and how these systems can provide the claimed guarantees. Redis is attacked here: I said that Kyle "attacked" the systems on purpose, as I see a parallel with the world of computer security here, it is really a good idea to move this paradigm to the database world, to show failure modes of systems against the claims of vendors. Similarly to what happens in the security world the vendor may take the right steps to fix the system when possible, or simply the user base will be able to recognize that under certain circumstances something bad is going to happen with your data. Another awesome thing in the Kyle's series is the educational tone, almost nothing is given for granted and the articles can be read by people that never cared about databases to distributed systems experts. Well done! In this blog post I'll try to address the points Kyle made about Redis Sentinel, that's the system he tested. Sentinel goals === In the post Kyle writes "What are the consistency and availability properties of Sentinel?". Probably this is the only flaw I saw in this article. Redis Sentinel is a distributed *monitoring* system, with support for automatic failover. It is in no way a shell that wraps N Redis instances into a distributed data store. So if you consider the properties of the "Redis data store + Sentinel", what you'll find is the properties of any Redis master-slave system where there is an external component that can promote a slave into a master under certain circumstances, but has limited abilities to change the fundamental fact that Redis, without Redis Cluster, is not a distributed data store. However it is also true that Redis Sentinel also acts as a configuration device, and even with the help of clients, so as a whole it is a complex system with given behavior that's worth analyzing. What I'm saying here is that just the goal of the system is: 1) To promote a slave into a master if the master fails. 2) To do so in a reliable way. All the stress here is in the point "2", that is, the fact that sentinels can be placed outside the master-slaves system makes the user able to decide a more objectiv[...]

Redis configuration rewriting
Lately I'm trying to push forward Redis 2.8 enough to reach the feature freeze and release it as a stable release as soon as possible. Redis 2.8 will not contain Redis Cluster, and its implementation of Redis Sentinel is the same as 2.6 and unstable branches, (Sentinel is taken mostly in sync in all the branches being fundamentally a different project using Redis just as framework). However there are many new interesting features in Redis 2.8 that are back ported from the unstable branch. Basically 2.8 it's our usual "in the middle" release, like 2.4 was: waiting for Redis 3.0 that will feature Redis Cluster (we have great progresses about it! See, we'll have a 2.8 release with everything that is ready to be released into the unstable branch. The goal is of course to put more things in the hands of users ASAP. The big three new entries into Redis 2.8 are replication partial resynchronizations (already covered in this blog), keyspace events notifications via Pub/Sub, and finally CONFIG REWRITE, a feature I just finished to implement (you can find it in the config-rewrite branch at Github). The post explains what CONFIG REWRITE is. An inverted workflow === Almost every unix daemon works like that: 1) You have a configuration file. 2) When you need to hack the configuration, you modify it and either restart the daemon, or send it a signal to reload the config. It's been this way forever, but with Redis I took a different path since the start: as long as I understood people created "farms" of Redis servers either to provision them on-demand, or for internal usage where a big number of Redis servers are used, I really wanted to provide a different paradigm that was more "network oriented". This is why I introduced the CONFIG command, with its sub commands GET and SET. At the start the ability of CONFIG was pretty basic, but now you can reconfigure almost every aspect of Redis on the fly, just sending commands to the server. This is extreme enough you can change persistence type when an instance is running. For example just typing: CONFIG SET appendonly yes Will switch on the Append Only File, will start a rewrite process to create the first AOF, and so forth. Similarly it is possible to alter from replication to memory limits and policy while the server is running, just interacting with the server with normal commands withou[...]

Hacking Italia
Questo post ha lo scopo di presentare alla comunita' italiana interessata ai temi della programmazione e delle startup un progetto nato attorno ad un paio di birre: "Hacking Italia", che trovate all'indirizzo Hacking Italia e' un sito di "social news", molto simile ad Hacker News, il celebre collettore di news per hacker di YCombinator. A che serve un sito italiano, e in italiano se c'e' gia' molto di piu' e di meglio nel panorama internazionale? A mettere assieme una massa critica di persone "giuste" in Italia. Mettere assieme le persone significa molto, specialmente in un paese stretto e lungo 1500 chilometri, dove le occasioni di incontri tra programmatori e startupper sono ridotte, i finanziatori nascosti in chissa' quali palazzi, inaccessibili ai piu'. Sono 15 anni che faccio questo mestiere e conosco pochissime persone in Italia, e una quantita' in tutto il resto del mondo... e dire che non e' certo un paese dove manca la passione per il codice e per l'innovazione, come la storia ci ricorda. E allora mettersi assieme significa, tanto per iniziare, avere gia' una piccola vetrina di persone a cui presentare la tua idea. Significa anche discutere assieme dei temi che non sono di nessun interesse per chi non opera nel nostro territorio, come le forme societarie e i mille problemi burocratici a cui ci tocca far fronte. Inoltre mentre probabilmente creare un clone dei servizi affermati globalmente, come Youtube o Gmail, per il mercato italiano, e' una operazione senza alcun merito, questo sono significa che non esistono delle startup che potrebbero essere di grande successo e che abbiano come target il territorio italiano: news, ristorazione, business to business, medicina... ci sono infiniti temi che si possono trattare facendo leva sul fatto che le economie di scala consentono, a chi opera in Italia, di fare meglio per gli italiani. Per cui se questi temi sono importanti anche per voi, spargete la voce, registratevi, e date il vostro contributo. Un po' di background per finire. Il progetto e' nato grazie al fatto che da qualche settimana, qui a Catania, abbiamo iniziato ad incontrarci tra programmatori. Prendiamo una birra, e parliamo di hacking, e non solo. Non avrei mai pensato che questo potesse accadere a dire il vero, parlare di cose davvero tecniche (e interessanti) a pochi chilometri da casa mia[...]

Redis with an SSD swap, not what you want
Hello! As promised today I did some SSD testing. The setup: a Linux box with 24 GB of RAM, with two disks. A) A spinning disk. b) An SSD (Intel 320 series). The idea is, what happens if I set the SSD disk partition as a swap partition and fill Redis with a dataset larger than RAM? It is a lot of time I want to do this test, especially now that Redis focus is only on RAM and I abandoned the idea of targeting disk for a number of reasons. I already guessed that the SSD swap setup would perform in a bad way, but I was not expecting it was *so bad*. Before testing this setup, let's start testing Redis in memory with in the same box with a 10 GB data set. IN MEMORY TEST === To start I filled the instance with: ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo Write load in this way is very high, more than half million SET commands processed per second using a single core: instantaneous_ops_per_sec:629782 This is possible because we using a pipeline of 32 commands per time (see -P 32), so it is possible to limit the number of sys calls involved in the processing of commands, and the network latency component as well. After a few minutes I reached 10 GB of memory used by Redis, so I tried to save the DB while still sending the same write load to the server to see what the additional memory usage due to copy on write would be in such a stress conditions: [31930] 07 Mar 12:06:48.682 * RDB: 6991 MB of memory used by copy-on-write almost 7GB of additional memory used, that is 70% more memory. Note that this is an interesting value since it is exactly the worst case scenario you can get with Redis: 1) Peak load of more than 0.6 million writes per second. 2) Writes are completely distributed across the data set, there is no working set in this case, all the DB is the working set. But given the enormous pressure on copy on write exercised by this workload, what is the write performance in this case while the system is saving? To find the value I started a BGSAVE and at the same time started the benchmark again: $ redis-cli bgsave; ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo Background saving started ^Ct key:rand:000000000000 foo: 251470.34 250k ops/sec was the lower number I was able to get, as once copy on write starts to happen, the[...]

Log driven programming is a real productivity booster.
One thing, more than everything else, keeps me focused while programming: never interrupt the flow.

If you ever wrote some complex piece of code you know what happens after some time: your mental model of the software starts to be very complex with different ideas nested inside other ideas, like the structure of your program is, after all.

So while you are writing this piece of code, you realize that because of it you need to do that other change. Something like "I'm freeing this object here, but it's connected to this two other objects and I need to do this and that in order to ensure consistent state".

The worst thing you can do is to interrupt what you are currently doing in order to fix the new problem. Instead just write freeMyObject() and don't care, but at the same time, open a different editor, and write:

* freeMyObject() should make sure to remove references from XYZ bla bla bla.

When you finished with the current function / task / whatever, re-read your notes and implement what is possible to implement. You'll get new ideas or new things to fix, use the same trick again, and log your ideas without interrupting the flow.

In this way parts of the program make sense, individually. You can address the other parts later. This is 100 times better than nested-thinking, where you need to stop, do another task, and return back. Humans don't have stack frames.

For my log I use Evernote because the log needs to have one characteristic: No save, No filenames, Nothing more than typing something. Evernote will save it for you, and is a different physical interface compared to your code editor. This in my experience improves the 2 seconds switch you need to log.

After some time your log starts to be long. When you realize most of it feels old as your code base and your idea of the system evolved, trace a like like this:

-------------------- OLD STUFF ---------------------

And continue logging again. From time to time however, re-read your old logs. You may find some gems. Comments

An idea for Twitter
After the "sexism gate" I started to use my Twitter account only for private stuff in order to protect the image of Redis and/from my freedom to say whatever I want. It did not worked actually since the reality is that people continue to address you with at-messages about Redis stuff.

But the good outcome is that now I created a @redisfeed account that I use in order to provide a stream of information to Redis users that are not interested in my personal tweets not related to Redis. Anyway when I say some important thing regarding Redis with my personal account, I just retweet in the other side, so this is a good setup.

However... I wonder if Twitter is missing an opportunity for providing a better service here, that is, the concept of "channels".

Basically I'm a single person, but I've multiple logical streams of informations:

1) I tweet about Redis.
2) I tweet about other technological stuff.
3) I say things related to my personal life.
4) Sometimes I tweet things in Italian language.

Maybe there are followers interested in just one or a few of these logical channels, so it would be cool for Twitter users to be able to follow only a subset of the channels of another twitter user.

Probably this breaks the idea of simplicity of Twitter, but I'm pretty sure there are ways to present such a feature in an interesting way: by default all users have a single channel and following them in general means to follow all the channels, it is only as a refinement and only if the user created multiple channels that you can fine-tune what you follow and what not, so basically the added complexity would be minimal.

I'm pretty sure that now that Twitter is designed with the average user in mind such a feature will never be implemented actually, without to mention that this may add some serious technological complexity to their infrastructure, but maybe in the long run such a feature may be more vital than we believe now because it is pretty related to the "information diet" concept. Comments

News about Redis: 2.8 is shaping, I'm back on Cluster.
This is a very busy moment for Redis because the new year started in a very interesting way: 1) I finished the Partial Resynchronization patch (aka PSYNC) and merged it into the unstable and 2.8 branch. You can read more about it here: 2) We finally have keyspace changes notifications: Everything is already merged into our development branches, so the deal is closed, and Redis 2.8 will include both the features. I'm especially super excited about PSYNC, as this is a not-feature, simply you don't have to deal with it, the only change is that slaves work A LOT better. I love adding stuff that is transparent for users, just making the system better and more robust. What I'm even more excited about is the fact that now that PSYNC and notifications are into 2.8, I'll mostly freeze it and can finally focus on Redis Cluster. It's a lot of time that I wait to finish Redis Cluster, now it's the right time because Redis 2.6 is out and seems very stable, people are just starting to really discovering it and the ways it is possible to use Lua scripting and the advanced bit operations to do more. Redis 2.8 is already consolidated as release, but requires a long beta stage because we touched the inner working of replication. So I can pause other incremental improvements for a bit to focus on Redis Cluster. Basically my plan is to work mostly to cluster as long as it does not reach beta quality, and for beta quality I mean, something that brave users may put into production. Today I already started to commit new stuff to Cluster code. Hash slots are now 16384 instead of 4096, this means that we are now targeting clusters of ~1000 nodes. This decision was taken because there are big Redis users with already a few hundred of nodes running. Another change is that probably, in order to ship Cluster ASAP, in the first stage I plan to use Redis Sentinel in order to failover master nodes (but Sentinel will be able to accept as configuration a list of addresses of cluster nodes and will fetch all the other nodes using CLUSTER NODES). So basically the first version of Redis Cluster to hit a stable release will have the following features: 1) Automatic partition of key space. 2)[...]

A few thoughts about Open Source Software
For a decade and half I contributed to open source regularly, and still it is relatively rare that I stop to think a bit more about what this means for me. Probably it is just because I like to write code, so this is how I use my time: writing code instead of thinking about what this means… however lately I'm starting to have a few recurring ideas about open source, its relationship with the IT industry, and my interpretation of what OSS is, for me, as a developer. First of all, open source for me is not a way to contribute to the free software movement, but to contribute to humanity. This means a lot of things, for instance I don't care about what people do with my code, nor if they'll release back their modifications. I simply want people to use my code in one way or the other. Especially I want people to have fun, learn new stuff, and *make money* with my code. For me other people making money out of something I wrote is not something that I lost, it is something that I gained. 1) I'm having a bigger effect in the world if somebody can pay the bills using my code. 2) If there are N subjects making money with my code, maybe they will be happy to share some of this money with me, or will be more willing to hire me. 3) I can be myself one of the subjects making money with my code, and with other open source software code. For all this reasons my license of choice is the BSD licensed, that is the perfect incarnation of "do whatever you want" as a license. However clearly not everybody thinks alike, and many programmers contributing to open source don't like the idea that other people can take the source code and create business out of it as a commercial product that is not released under the same license. To me instead many of the rules that you need to follow to use the GPL license are a practical barrier reducing the actual freedom of what people can do with the source code. Also I've the feeling that receiving back contributions it is not too much related to the license: if something is useful people will contribute back in some way, because maintaining forks is not great. The real gold is where development happens. Unfixed, not evolved code bases are worth zero. If you as an open source dev[...]

Dear Redis users, in the final part of 2012 I repeated many time that the focus, for 2013, is all about Redis Cluster and Redis Sentinel. This is exactly what I'm going to do from the point of view of the big picture, however there are many smaller features that make a big difference from the point of view of the Redis user day to day operations. Such features can't be ignored as well. They are less shiny in a feature list, and they are not good to generate buzz and interest in new users, and sometimes boring to code, but they are very important from a practical point of view. So I ended the year and I'm starting the new one spending considerable time on a feature that was long awaited by many users having production instances crunching data every day, that is, the ability for a slave to partially resynchronize with the master without requiring a full resynchronization every time. The good news is that finally today I've an implementation that works well in my tests. This means that this feature will be present in Redis 2.8, so it is the right time to start making users aware of it, and to describe how it works. Some background --- Redis replication is a pretty brutal piece of code in many ways: 1) It works by re-playing on slaves every command that was received in the Redis master that actually produced a change in the data set. 2) From the point of view of slaves, masters are just a bit special clients, but they are almost like normal clients sending commands. No special replication protocol or data format used for replication. 3) It *used to force* a full resynchronization every time a slave connects to a master. This means, at every connection, the slave will receive a copy of the master data set, in form of an RDB file, and load it. Because of this characteristics Redis replication has been very reliable from the point of view of corruption. If you always full-resync, there are little chances for inconsistency. Also it was architecturally trivial, because masters are like clients, no special protocol is used and so forth. Simple and reliable, what can go wrong? Well, what goes wrong is that sometimes even when simplicity is very important, to do an O(N) work w[...]

ADS-B wine cork antenna
# Software defined radio is cool About one week ago I received my RTLSDR dongle, entering the already copious crew of software defined radio enthusiasts. It's really a lot of fun, for instance from my home that is at about 10 km from the Catania Airport I can listen the tower talking with the aircrafts in the 118.700 Mhz frequency with AM modulation, however because of lack of time I was not able to explore this further until the past Sunday. My Sunday goal was to use the RTLSDR to see if I was able to capture some ADS-B message from the aircrafts lading or leaving from the airport. Basically ADS-B is a security device that is installed in most aircrafts that is used for collision avoidance and other stuff like this. Every aircraft broadcasts informations about heading, speed, altitude and so forth. With software defined radio there are a lot of programs in order to demodulate this information (that is encoded in a fairly simple to decode format). I use "modes_rx" that is free software, just Google for it. However this transmissions happen in the 1090 Mhz frequency. My toy antenna nor other rabbit ears antennas I had at home worked at all for this frequency, so I read a few things on Google and tried a simple design that actually works well and takes 10 minutes to build using just a bit of wire and a wine cork. # Cork wine dipole antenna Sorry but I know almost nothing about antennas. However I'll try to provide you the informations I've about the theoretical aspects of this antenna. Technically speaking this antenna is an Half Wavelength Dipole. In practical terms it is two pieces aligned parallel wires with a small space between them, with a total length that is half the wavelength of the frequency I want to listen to. Speed of light = 300 000 000 meters per second Frequency I want to listen to = 1090 Mhz, that is, 1090 000 000 Hertz Wavelength at frequency = 300 000 000 / 1090 000 000 = 275 millimeters. The half of 275 millimeters is 137 millimeters more or less, so this is the length of our antenna: img:// Now you can ask, why half the wavelength? But even for a n00b like me this actually makes a[...]

Partial resyncs and synchronous replication.
Currently I'm working on Redis partial resynchronization of slaves as I wrote in previous blog posts.

The idea is that we have a backlog of the replication stream, up to the specified amount of bytes (this will be in the order of a few megabytes by default).

If a slave lost the connection, it connects again, see if the master RUNID is the same, and asks to continue from a given offset. If this is possible, we continue, nothing is lost, and a full resynchronization is not needed. Otherwise if the offset is about data we no longer have in the backlog, we full resync.

Now what's interesting about this is that, in order to make this possible, both the slave and the master know about a global offset that is the replication offset, since the master was ever started.

Now, if we provide a command that returns this offset, it is simple for a client to simulate synchronous replication in Redis just sending the query, asking for the offset (think about MULTI/EXEC to do that) and then asking the same to the slave. Because Redis replication is very low latency, the client can simply do an optimistic "write, read-offset, read-offset-on-slave" and likely the offset we read on the slave will already be ok to continue (or, we can read it again with some pause).

This is already something that could be useful, but I wonder if we could build something even better starting from that, that is, a way to send Redis a command that blocks as long as the current replication offset was not acknowledged from at least N connected slaves, and returns when this happened with +OK.

I'm not promising that this will be available as we need to understand how useful is this and the complexity, but from an initial analysis this could be trivial to implement fast and reliably... and sounds pretty good.

More news ASAP. Comments

Twemproxy, a Redis proxy from Twitter
While a big number of users use large farms of Redis nodes, from the point of view of the project itself currently Redis is a mostly single-instance business. I've big plans about going distributed with the project, to the extent that I'm no longer evaluating any threaded version of Redis: for me from the point of view of Redis a core is like a computer, so that scaling multi core or on a cluster of computers is the same conceptually. Multiple instances is a share-nothing architecture. Everything makes sense AS LONG AS we have a *credible way* to shard :-) This is why Redis Cluster will be the main focus of 2013 for Redis, and finally, now that Redis 2.6 is out and is showing to be pretty stable and mature, it is the right moment to focus on Redis Cluster, Redis Sentinel, and other long awaited improvements in the area of replication (partial resynchronization). However the reality is that Redis Cluster is not yet production ready and requires months of work. Still our users already need to shard data on multiple instances in order to distribute the load, and especially in order to use many computers to get a big amount of RAM ready for data. The sole option so far was client side sharding. Client side sharding has advantages as there are no intermediate layers between clients and nodes, nor routing of request, so it is a very scalable setup (linearly scalable, basically). However to implement it reliably requires some tuning, a way to take clients configuration in sync, and the availability of a solid client with consistent hashing support or some other partitioning algorithm. Apparently there is a big news in the landscape, and has something to do with Twitter, where one of the biggest Redis farms deployed happen to serve timelines to users. So it comes as no surprise that the project I'm talking about in this blog post comes from the Twitter Open Source division. Twemproxy --- Twemproxy is a fast single-threaded proxy supporting the Memcached ASCII protocol and more recently the Redis protocol: It is written entirely in C and is licensed under [...]

Redis Crashes
Premise: a small rant about software reliability. === I'm very serious about software reliability, and this is not just a good thing. It is good in a sense, as I tend to work to ensure that the software I release is solid. At the same time I think I take this issue a bit too personally: I get upset if I receive a crash report that I can't investigate further for some reason, or that looks like almost impossible to me, or with an unhelpful stack trace. Guess what? This is a bad attitude because to deliver bugs free software is simply impossible. We are used to think in terms of labels: "stable", "production ready", "beta quality". I think that these labels are actually pretty misleading if not put in the right perspective. Software reliability is an incredibly complex mix of ingredients. 1) Precision of the specification or documentation itself. If you don't know how the software is supposed to behave, you have things that may be bugs or features. It depends on the point of view. 2) Amount of things not working accordingly to the specification, or causing a software crash. Let's call these just software "errors". 3) Percentage of errors that happen to be in the subset of the software that is actually used and stressed by most users. 4) Probability that the conditions needed for a given error to happen are met. So what happens is that you code something, and this initial version will be as good as good and meticulous are you as a programmer, or as good is your team and organization if it is a larger software project involving many developers. But this initial version contains a number of errors anyway. Then you and your users test, or simply use, this new code. All the errors that are likely to happen and to be recognized, because they live in the subset of the code that users hit the most, start to be discovered. Initially you discover a lot of issues in a short time, then every new bug takes more time to be found, probabilistically speaking, as it starts to be in the part of the code that is less stressed, or the conditions to make the error evident are unlike[...]

Redis children can now report amount of copy-on-write
This is one of this trivial changes in Redis that can make a big difference for users. Basically in the unstable branch I added some code that has the following effect, when running Redis on Linux systems: [32741] 19 Nov 12:00:55.019 * Background saving started by pid 391 [391] 19 Nov 12:01:00.663 * DB saved on disk [391] 19 Nov 12:01:00.673 * RDB: 462 MB of memory used by copy-on-write As you can see now the amount of additional memory used by the saving child is reported (it is also reported for AOF rewrite operations). I think this is big news for users as instead to see us developers and other Redis experts handwaving about the amount of copy-on-write being proportional to number of write ops per second and time used to produce the RDB or AOF file, now they get a number :-) # How it is obtained? We use the /proc//smaps, so yes, this is Linux only. Basically it is the sum of all the Private_Dirty entries in this file for the child process (actually you could measure it on the parent side and it is the same). I verified that the number we obtain actually corresponds very well with the physical amount of memory consumed during a save, in different conditions, so I'm very confident we provide an accurate information. # Why a number in the log file instead of an entry in the INFO output? Because even before calling wait3() from the parent, as long as the child exits we no longer have this information. So to display this information in INFO we need some inter process communication to move this info from the child to the parent. Not rocket science but for now I avoided adding extra complexity. The current patch is trivial enough that we could backport it into 2.6 for the joy of many users: The log is produced at NOTICE level (so it is displayed by default). Comments[...]

Memory errors and DNS
Memory errors in computers are so common that you can register domain names similar to very famous domain names, but altered in one bit, and get a number of requests per hour: Comments

On Twitter, at Twitter
On Twitter:

@War3zRub1 "Hahaha it's silly how people use Redis when they need a reverse proxy"
@C4ntc0de "ZOMG! Use a real message queue, Redis is not a queue!"
@L4m3tr00l "My face when Redis BLABLABLA..."

Meanwhile *at* Twitter:

OP1: "Hey guys, there is a spike in the number of lame messages today, load is increasing..."
OP2: "Yep noticed, it's the usual troll fiesta trowing shit at Redis, 59482 messages per second right now."
OP1: "Ok, no prob, let's spawn two additional Redis nodes to serve their timelines as smooth as usually".

TL;DR: Comments

Eventual consistency: when, and how?
This post by Peter Bailis is a must read. "Safety and Liveness: Eventual consistency is not safe" [1].


An extreme TL;DR of this is.

1) In an eventually consistent system, when all the nodes will agree again after a partition?
2) In an eventually consistent system, HOW the nodes will agree about inconsistencies?
3) In immediately consistent systems, when I'm no longer able to write? When I'm no longer able to read?


"1" is time (or more practically, conditions needed to merge).
"2" is safety (or more practically, merge strategy).
"3" is availability (or more practically, how much of the system can be down, for me to be still able to write and read). Comments

Welcome to RethinkDB
There is a new DB option out there, I know it took a long time to be developed. While I don't know very well how it works I hope it will be an interesting player in the database landscape.

My initial feeling is that it will compete closely with Riak and MongoDB (the system seems more similar to MongoDB itself, but if it can scale well multi-nodes people that don't need high write availability may pick an immediate consistent database such as RethinkDB instead of Riak for certain applications).

Welcome to RethinkDB :-) Comments

Redis data model and eventual consistency
While I consider the Amazon Dynamo design, and its original paper, one of the most interesting things produced in the field of databases in recent times, in the Redis world eventual consistency was never particularly discussed. Redis Cluster for instance is a system biased towards consistency than availability. Redis Sentinel itself is an HA solution with the dogma of consistency and master slave setups. This bias for consistent systems over more available but eventual consistent systems has some good reasons indeed, that I can probably reduce to three main points: 1) The Dynamo design partially rely on the idea that writes don't modify values, but rewrite an entirely new value. In the Redis data model instead most operations modify existing values. 2) The other assumption Dynamo does about values is about their size, that should be limited to 1 MB or so. Redis keys can hold multi million elements aggregate data types. 3) There were a good number of eventual consistent databases in development or already production ready when Redis and the Cluster specification were created. My guess was that to contribute a distributed system with different trade offs could benefit more the user base, providing a less popular option. However every time I introduce a new version of Redis I take some time to experiment with new ideas, or ideas that I never considered to be viable. Also the introduction of MIGRATE, DUMP and RESTORE commands, the availability of Lua scripting, and the ability to set millisecond expires, for the first time makes possible to easily create client orchestrated clusters with Redis. So I spent a couple of hours playing with the idea of the Redis data model and an eventual consistent cluster that could be developed as a library, without any change to the Redis server itself. It was pretty fun, in this blog post I share a few of my findings and ideas about the topic. [...]

Slave partial synchronization work in progress
You can follow the commits in the next days in the "psync" branch at github: Comments

On the importance of testing your failover solutions.
From an HN comment[1]

"(Geek note: In the late nineties I worked briefly with a D&D fanatic ops team lead. He threw a D100 when he came in every morning. Anything >90 he picked a random machine to failover 'politely'. If he threw a 100 he went to the machine room and switched something off or unplugged something. A human chaos monkey)."

[1] Comments

Client side highly available Redis Cluster, Dynamo-style.
I'm pretty surprised no one tried to write a wrapper for redis-rb or other clients implementing a Dynamo-style system on top of Redis primitives. Basically something like that: 1) You have a list of N Redis nodes. 2) On write, use consistent hashing and write the same thing to M nodes (M configurable). 3) On reads, read from M nodes and pick the most common reply to return to the client. For all the non-matching replies, use DUMP / RESTORE in Redis 2.6 to update the value of nodes that are in the minority. 4) To avoid problems with ordering and complex values, optionally implement some way to lock a key when it's the target of a non plain SET/GET/DEL ... operation. This does not need to be race conditions free, it is just a good idea to avoid to end with keys in desync. OK the fourth point needs some explanation. Redis is a bit harder to distribute in this way compared to other plain key-value systems because there are operations that modify the value instead of completely rewriting it. For instance LPUSH is such a command, while SET instead rewrites the value at every time. When a command completely rebuilds a value, out of order operations are not an huge issue. Because of latency you can still have a scenario like this: CLIENT A> "SET key1 value1" in the first node. CLIENT B> "SET key1 value2" in the first node. CLIENT B> "SET key1 value2" in the second node. CLIENT A> "SET key1 value1" in the second node. So you end with the same key with two different values (you can use vector clocks, or ask the application about what is the correct value). However to restore a problem like this involves a fast write. Instead if the same happens during an LPUSH against lists with a lot of values, a simple last value desync may force the update of the whole list that could be slower (even if DUMP / R[...]

Why it is Awesome to be a Girl in Tech
I'm proud to be mentioned in this well-thought and non-bigot post:

Also in perfect accordance with the hacking culture, the post is sort of an HOWTO for girls that want to be involved in IT. Comments

Designing Redis replication partial resync
In this busy days I had the idea to focus on a single, non-huge, self contained project for some time, that could allow me to work focused as much as hours as possible, and could provide a significant advantage to the Redis community. It turns out, the best bet was partial replication resync. An always wanted feature that consists in the ability to a slave to resynchronize to a master without the need of a full resync (and RDB dump creation on the master side) if the replication link was interrupted for a short time, because of a timeout, or a network issue, or similar transient issue. The design is different compared to the one in the Github feature request that I filed myself some time ago, now we have the REPLCONF command that is just perfect to exchange replication information between master and slave. Btw these are the main ideas of the design I'm refining and implementing. 1) The master has a global (for all the slaves) backlog of user configurable size. For instance you can allocate 100 MB of memory for this. As slaves are served in the replicationFeedSlaves() function this backlog is updated to contain the latest 100 MB of data (or whatever is the configured size of the backlog). 2) There is also a global counter, that simply starts at zero when a Redis instance is started, and is updated inside replicationFeedSlaves(), that is a global replication offset. This offset can identify a particular part of the outgoing stream from the master to the slaves at any given time. 3) If we got no slaves at all for too much time, the backlog buffer is destroyed and not updated at all, so we don't pay any performance/memory penalty if there are no slaves. Of course the buffer also starts unused when a new instance is started and is initialized on[...]

Why Github pull requests lack support for labels?
I love Github issues, it is one of the awesome things at Github IMHO: as simple as possible but actually under the hood pretty full featured.

However one of the things I love more is labels. It is a truly powerful thing to organize issues in a project-specific way. Unfortunately if an issue is a pull request, no labels can be attached. I wonder why.

Also I would love the ability to merge against multiple branches instead of the taget one, directly from the web UI. Comments

If you trust simplicity, this could be a good argument
I assume you already read the AWS report[1] about recent troubles. I think it is a very good argument you could use at work against design complexity and in favor of designing stuff that are at a complexity level where analysis of failure modes and prevention is actually possible.

[1] Comments

On complexity and failure
From a comment on Hacker News:


--- quoted comment ---
Full disclosure: I work for an AWS competitor.
While none of the specific AWS systemic failures may themselves be foreseeable, it is not true that issues of this nature cannot be anticipated: the architecture of their system (and in particular, their insistence on network storage for local data) allows for cascading failure modes in which single failures blossom to systemic ones. AWS is not the only entity to have made this mistake with respect to network storage in the cloud; I, too, was fooled.[1]
We have learned this lesson the hard way, many times over: local storage should be local, even in a distributed system. So while we cannot predict the specifics of the next EBS failure, we can say with absolute certainty that there will be a next failure -- and that it will be one in which the magnitude of the system failure is far greater than the initial failing component or subsystem. With respect to network storage in the cloud, the only way to win is not to play.

[1] Comments

Redis 2.6.1 is out
Achievement unlocked: releasing a Redis version the same day your daughter was born ;-)

But that was a bad issue as there was a bug preventing compilation on pretty old Linux systems that are still pretty widespread (RHLE5 & similar).

Redis 2.6.1 fixes just that issue and is available as usually at as a tar.gz or at github/antirez/redis as a "2.6.1" tag. Comments

Redis Bit Operations Use Case at CopperEgg
I really trust both in the usefulness of Redis bit operations and the fact that our community in the future should have documentation about Redis Patterns. So an article from CopperEgg where a bit operations pattern is described is good for sure :) Comments

Greta was born a few hours ago
25 October 2012 01:06, she is 3350 grams of a funny little thing :-) Comments

Back to technology
It's a more quite time now. Redis 2.6 released, the sexism issue almost forgotten. Time to relax, be wise, and focus on work. Right, but, that's not me. I've a few more things to say about what happened, and to reply to the many people that asked me why I felt "obligated" to stop using my Twitter account as before, with a mix of work, thoughts on technology, and personal stuff. I can change idea easily if it is the case, but this time it was not the case. As much as people that criticised me for my blog post may think that I've a problem, I also think they have huge limits. Oh well, different opinions, I don't like you, you don't like me, I don't freaking care after all. I don't think on the same line as most people alive if that's the matter. So, is a bad reaction about a blog post, that was about an argument I usually don't write about, enough to change my social medias usage? Well, it is not. What shocked me was the *source* of many of the extremely poor replies. In the next hours I started to think more and more about the problem. Wait, I said to myself, that's exactly what happened with tech conversations on Twitter in the previous months, multiple times: sarcasm, insults, poor arguments. Or even more subtle than that: a few months ago there was an episode about somebody in a company competing with Redis making jokes about Redis durability. Again, an odd source for such a joke, but I did not replied at all, after all it is a joke. You don't understand jokes otherwise, never mind if this is actually a way to get zillion of retweets and provide a bad[...]

HN comment about Linus
h2s writes about Linus:

"I love this guy's balanced approach to steering the kernel. Somebody asked whether a bunch of security-related patches would be getting into Linus' tree, and his response was great.
Basically, he spent a few minutes explaining how security people tend to think that problems are either security problems or not worth thinking about. They see things in black and white and only care about increasing security at any cost. He said performance fanatics can be the same in their approach to improving performance, and he tries not to treat security or performance patches as being too massively different from any other types of patches such as ones for correctness.
Also, a big fuck-you to this trend for shoehorning mindless Reddit memes into everything. Who the fuck wastes a question to Linus Torvalds on "Do you like cats?".

Very good points, IMHO.

Yesterday during the Redis conf there was as usually a complain about Redis not binding only to by default. Guess what? Redis is a networked server and in many setups clients are not in the same box. In most setups the server is not exposed to internet at all. So why on the earth to save people that put Redis servers exposed on the internet I should ruin the experience of all the other guys?

Btw the original link of the comment is this: Comments

About the recent EC2 issues.
I don't like people that are using recent EC2 problems to get an easy advantage / marketing. Stuff go down and cloud services are not magical, it is better to adjust the expectations.

But there are other reasons why people IMHO should consider going bare metal.

* EC2 (and similar services) are extremely costly. With 100 euros per month you can rent a beast of a dedicated server with 64 GB of RAM and fast RAID disks.

* As you can see you are not down-time safe, and to be down together with a zillion of other sites may be a good excuse with your boss maybe, but does not change your uptime percentage, so it's a poor shield.

* A few problems you prevent in the sysop side, are translated in issues with the software you run (especially DBs) because of the poor disk performance and in general poor predictability of behaviour.

* It's not bad to understand operations since the start, it is not a wasted effort at all, it is an effort, but it is also a good gym to have a deeper understanding of your production stack.

That said, with the money you save, you are likely to be able to duplicate your entire stack in two bare metal providers easily. This means that you have a disaster-recovery ready architecture that you can switch using DNS if things go as bad as yesterday with EC2. Comments

Redis 2.6 is out!
Redis 2.6 is finally out and I think that now that we reached this point we'll start to see how the advantages of a release that was already exploited in production by few, will turn into a big advantage for the rest of the community. Scripting, bitops, and all the big features are good additions but my feeling is that Redis 2.6 is especially significative as a step forward in the maturity of the Redis implementation. This does not mean that's bug free, it's new code and we'll likely discover bugs especially in the early days as with every new release that starts to be adopted more and more. What I'm talking about when I say Redis 2.6 is more mature is that it is in general a more "safe" system to run in production: latency spikes due to mass-expire events or slow disks are handled much better, the system is more observable with the software watchdog if something goes wrong, MONITOR itself shows commands before their execution and as they were sent by the client, slaves are read-only by default, if RDB persistence fails Redis stops accepting new writes by default, and I can continue with a list of features and fixes that are the result of experience with bad behaviour in production of the 2.4 release. Now the followings are the tasks at hand: * Redis Sentinel * Redis Cluster * Redis 2.8 * All the small advantages that will make 2.8 a safer release compared to 2.6 I don't want to comment too much but let me say this: in the next months you'll see Redis Cluster to become re[...]

Redis Conf live streaming
And there is Pieter Noordhuis on stage right now! Comments

Github: where you see how cool humanity can be.
You are there in the morning with your coffee in front of you, scanning pull requests and bug reports, then you see a conversation around a commit among a few guys that modified the code to make it better, than there is another one suggesting to improve it in another way. You click in the account names and you see this people with their transparent eyes and your trust in humanity is restored. Comments

Mission accomplished: videos talks for Redis Conf...

1) Making videos is in some way harder than doing a talk live.
2) Screen Flow is awesome, but could be improved with more video editing capabilities, apparently you can't "cut" the video.
3) The problem is to upload files when they are big and you have normal ADSL connection :-)

But it feels good to be able to send the video talks a few days in advance, so the conf organizers will be able to perform editing, filter audio if needed, whatever. Comments

Today is the day...
of the final recording of the videos I'll send to the Redis Conf. That was hard! The timing of the conf was not excellent for my attending, but producing the video was also less trivial than I thought, but finally I've the slides, an idea about what to say, and the ScreenFlow skills ;) Maybe after this experience I'll produce some video tutorial of Redis new features as I introduce it, in order to accelerate the adoption of new things in our community.

Now back to work... Comments