Subscribe: Evan Jones' Scratch Pad
Added By: Feedage Forager Feedage Grade B rated
Language: English
app engine  bytebuffers  call  code  crc  direct  limit  memory  process  requests  server  service  tcp  threads  time 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Evan Jones' Scratch Pad

Evan Jones - Software Engineer | Computer Scientist

I'm a software engineer at Bluecore in New York. I previously fixed interesting bugs at Twitter, and taught a database class at Columbia as an adjunct. I was a co-founder and CTO of Mitro, a password manager for groups and organizations, along with Vijay


Preventing server overload: limit requests being processed

Sat, 22 Jul 2017 16:34:16 +0000

Recently at Bluecore I ran into a familiar situation: during brief periods of overload, all requests to a server were timing out. It was a catastrophic failure, where the server completely stopped serving. We added more capacity to prevent overload, but we also wanted to fix the server. Instead of exploding, it should answer as many requests as possible, while rejecting the rest. Catastrophic failures like this usually happen because the server is processing too many requests at the same time. Typically, the server runs out of memory or requests take too long, causing whatever is making the request to give up. The solution is to limit the maximum number of concurrent requests, so some fraction of requests fail, but the others are processed. The challenge is finding the "right" limit. It can vary 1000× depending on the software and the hardware. It also changes as the system evolves, so today's correct value is tomorrow's performance bottleneck. I've added these limits to many systems and I'm tired of it. Surely there is some "automatic" solution that we could use by default on all servers? Unfortunately, it turns out that it is hard. I think there is an opportunity for some academic research to figure it out, which I am not about to do. Instead, this article shares what I learned while investigating it, so the next time I am fixing a server that falls over when overloaded, I'll know what to do. Hopefully you will too! The summary: Limit the number of concurrent requests in all servers to avoid cascading failures. You need the limit to be well below the "catastrophic" failure point for your service, but setting it too low will limit the throughput of your system. If I'm forced to guess, I'd start with a limit set to some multiple of the number of CPUs. E.g. maybe 3× the number if you assume that your requests spend 33% of their time executing code, and 66% of their time waiting (e.g. for other network requests or on disk). However, it really is best to determine the limits by forcing your service to explode, because the "right" limit can vary enormously. If you want to optimize things, you can get lower "tail" latency by adding a queue and using a lower concurrent request limit. However, this requires more code and tuning, so I think a simple concurrent limit is good enough in most cases. If you really want the lowest latency, using a FIFO queue with a "drop head" policy should decrease latency in overload scenarios. The rest of the article attempts to describe why servers fail when they process too many requests, and why limiting the concurrency helps. This is less polished than I would like. I had some benchmarks and model results I wanted to include but it turns out I've answered the question enough to satisfy my curiousity. Given that this is a fairly niche topic, I'm not sure there is much value in polishing it. I figure it is better to publish this as-is than have it never leave my computer. Why limit requests? To start, let's discuss the simplest scenario I can think of, since it helps understand the basics. Let's imagine we have a server with a single CPU, that processes requests by doing 10 ms of computation. Ideally, this server can process 100 requests per second. If requests arrive at a fixed rate less than or equal to 100 requests per second, then the server responds to all requests after 10 ms. However, what happens if requests start arriving at 101 requests per second? Well, after one second, the server could only process 100 requests, so it will be processing 2 requests at the same time. The operating system will attempt to share the CPU, so now each request takes 20 ms. The server still responds to 100 requests per second, but the latency has increased. As the overload continues, the server begins to process more and more concurrent requests, which increases the latency. Eventually there are an infinite number of requests in the server, and it takes an infinite amount of time to respond. This is why servers without a limit on the number [...]

Quick Hack: Terminal apps in your browser

Sat, 04 Mar 2017 15:38:14 +0000

Google Cloud Shell and the SSH button in the Cloud Platform console are pretty handy: Click a button and you get a terminal prompt in your browser, logged in to the correct server. I find this more convenient than finding the IP and copy-and-pasting into my terminal (although it can be a bit slow, so I still often use a "real" shell). At some point I discovered that this uses hterm, Chrome OS's terminal. I wondered if I could make my own web menu, with links to get a MySQL prompt connected to various database instances. I created a hacky server that hterm can connect to, which I've put on Github. I'm not sure anyone will find it useful. However, it should be easy to customize the demo if you want to build your own web-based terminal menus. Here is what the example demo looks like if you run it. The menu shows a list of hard coded choices (ls, vi, and man bash). Clicking on a choice launches that application. The example here shows vim.

(image) (image)

Serverless: Platform-as-a-Service done right?

Mon, 19 Dec 2016 14:47:59 +0000

The excitement around "Serverless Architectures" makes me a bit annoyed, but not because it is a bad idea. I think high-level managed services make it faster to deliver working solutions for most applications than managing hardware, VMs, or containers. I’m annoyed because it is hyped as a new concept. It’s not. Serverless is mostly just a new word for Platforms-as-a-Service. The main difference is market timing (and marketing: serverless is a more thought-provoking term than PaaS). Today, building on cloud platforms is normal, and everyone has realized, consciously or not, that they are "locked-in" to their provider. As a result, using a proprietary platform is no longer so scary. This is very different from nine years ago, when Heroku (released November 2007) and Google App Engine (released April 2008) were getting started. At the time, people were afraid of writing apps they couldn’t run anywhere else. Now, since our apps are already trapped on AWS, you might as well take advantage of whatever you can to make it easier.

The one potential technical difference is that Serverless means you don’t need to control the number of instances of an application you are running. In Adrian Cockcroft’s words, "If your PaaS can efficiently start instances in 20ms that run for half a second, then call it serverless." Mike Roberts argues that the definition of a Serverless platforms is one that manages scaling for you. With these definitions, Google App Engine is unquestionably a Serverless platform, as it is capable of rapid automatic scaling, but Heroku may not be. I’d argue that Heroku is close enough: adjusting a slider in a web interface, or executing a command line argument to change the configuration is basically managing scaling. The platform is still handling the mechanics, but is choosing not to make the decision about when to scale.

I've spent the last nine months working on an application that is primarily run on Google App Engine. I have lots of specific complaints about that platform, but it still feels like developing software in the future. We pay a premium for computing resources compared to "bare" VMs or co-location. However, we get a tightly integrated development platform that handles essentials like deployment and rollback, log collection and security. For a huge number of applications, it is simply a waste of time to configure these things yourselves, and using AWS Lambda, Heroku, or App Engine is going to reduce the "time to solution."

There will always be applications, or parts of applications, that consume large amounts of computing resources. For these pieces, a fully managed platform may not be appropriate. In order to implement "magical" scaling, these platforms impose a lot of limits that you may need to work around. Similarly, at a certain scale, the cost premium for computing resources may eventually outweigh the development cost advantage from these platforms. In either of these cases, you will want to move parts of your application to other environments. The good news is that with most of the large cloud providers, there are lots of other options. If you select the APIs you use carefully, it should always be possible to move chunks of your code to a more cost efficient environment.

In conclusion: write your apps using the highest-level platform you can. It will save you time and money. You can even call it serverless if you would like. Please, just don’t call it new.

BigQuery Tools: Dataset and table size report

Mon, 28 Nov 2016 13:21:18 +0000

Google BigQuery is fantastic tool. One of the most innovative features is how you pay for it. You pay for storage at rates that are competitive with S3 and Google Cloud Storage, and you pay for the data accessed when you execute a query. Since you don't pay for computation unless you actually use it, I've used BigQuery to store a lot of things just in case it needs to be queried. This is a great example of how the pricing model can strongly influence how you use infrastructure. However, one of the consequences is a growing BigQuery storage bill. Unfortunately, Google only provide the daily aggregate cost. I wanted to know more specifically where the cost was going. As a result, I built BigQuery Tools: a storage cost dashboard. It shows you a breakdown of where your BigQuery storage cost is going, per dataset and per table. If you try the tool, please email me to tell me what you think. I've also released the source code on Github.

Tracing a Python performance bug on App Engine

Sun, 30 Oct 2016 13:27:44 +0000

I work at Bluecore where, like any growing application, we periodically run into interesting performance problems. Most of our service runs on App Engine, which has pretty good automatic scaling. That causes most performance issues to show up as large bills, rather than production failures. This is good news for our customers, but bad news for Bluecore, so we keep a close eye on our costs. While investigating a cost increase, we discovered that the Datastore was taking nearly a second to return 20 objects, instead of the normal 100 ms. In this article, I'll talk about how we used traces and logging to find the problem, then monkey patched Google's code to fix it. I think there are some generally applicable lessons, but the details may only be interesting to the unfortunately small number of people who use App Engine. TL;DR: App Engine's older Python Datastore library is very slow when serializing big objects. Traces helped make the problems visible, but logs pinpointed the issue. We store a lot of data in the Google Cloud Datastore, which is a slightly weird key/value store with lots of features. For the purposes of this discussion, all that matters is that you can look up objects by key. We had implemented a new feature that read a batch of 20 objects using their key, which we expected to take no more than around 100 ms. Unfortunately, our monitoring showed this new feature was very slow, sometimes taking more than a second. App Engine traces a fraction of all requests, so we found a slow request and looked at the trace. We saw something like the following: This shows a backend task taking around two seconds to return a result. The selected Datastore get took 13 ms, and the details on the right shows it returned 10 entities. Around 300 ms later the application then does a Datastore put. In the worst cases, we saw a gap after a get of up to a second. So what the heck was happening? To get more detail, the trace tool can also show us our application logs: The gap between the Datastore get and the log message is about 250 ms, so we still are missing a lot of time. However, the requests combined with the log messages told us approximately what chunk of code was responsible. To track it down further, we logged the execution time of small pieces of the suspicious code. It turns out that the db.get API call was taking basically all the time. This was confusing at first: the trace showed the Datastore get returning after 100 ms, but our logs showed the function call taking a second. I suspected this meant that the Datastore was giving our application the data in the ~100 ms reported in the trace, but whatever the Python library was doing after that was slow. While App Engine does have some private implementation pieces, the bulk of the code appears to be what they provide with the development kit. I walked through the code to see if I could find anything obviously slow. It turns out the older db library we use is built using an undocumented API called datastore.GetAsync. This returns an Entity object, that the db library then converts to the application's custom Model class. Since we were now down to a single function call, I wrote a microbenchmark which thankfully demonstrated the problem, and showed that datastore.GetAsync was about 2X faster. We tried deploying that fix. While this did run significantly faster, there were still cases that were extremely slow. We logged the object keys for slow requests, then investigated using the microbenchmark. We found that the slow objects had more than 100 attributes, and that more attributes made deserialization slower. Looking more carefully at the implementation, it became clear why. The code converts all attributes from bytes to Python objects. However, for our feature, we only needed about four or five attributes, so this was wasted effort. Unfortunately, there was no easy API that skipped this step. You can't change the core libraries on App Engine, since Google [...]

fork() without exec() is dangerous in large programs

Wed, 17 Aug 2016 00:42:22 +0000

The UNIX fork() system call seems like an elegant way to process tasks in parallel. It creates a copy of the current process, allowing tasks to read the same memory, without needing fine-grained synchronization. Unfortunately, you need to use it extremely carefully in large programs because fork in a multi-threaded program can easily cause deadlocks in the child process. In the child process, only the thread that called fork continues running. Other threads no longer exist. If a thread was holding a lock, it will remain locked forever [1, 2, 3]. The next call to a function that needs a locks, like malloc, can deadlock waiting on a thread that is not running. Most large programs today use threads, either intentionally or accidentally, since many libraries use threads without making it obvious. This makes it very easy for someone to add threads to some part of the program, which can then cause rare hangs in forked child processes in some other part, far removed from where the threads were added. Chrome, which intentionally uses multiple processes and threads, has run into this issue, which I think that is good evidence that it is hard to get right, even if you are aware this is a potential problem. At Bluecore, we ran into this accidentally. We don't explicitly use threads or fork anywhere, but some libraries and tools that we use do. The original problem was that our test suite started hanging on Mac OS X. It turns out that many Mac OS X system libraries use libdispatch (a.k.a. Grand Central Dispatch) for inter-process communication with system processes. In an attempt to avoid rare and hard-to-reproduce bugs, libdispatch explicitly "poisons" the child process when forking, causing the next use to crash. This caused our unit test suite to hang, waiting for children that were dead. In this article, I'll describe how you can safely use fork if you really must, as well as a deep dive into this specific Python crash, and why it really has no workaround. How to use fork safely I have three suggestions, in priority order: Only use fork to immediately call exec (or just use posix_spawn). This is the least error-prone. However, you really do need to immediately call exec. Most other functions, including malloc, are unsafe. If you do need to do some complex work, you must do it in the parent, before the fork (example). Others support this opinion. Using posix_spawn is even better, since it is more efficient and more explicit. Fork a worker at the beginning of your program, before there can be other threads. You then tell this worker to fork additional processes. You must ensure that nothing accidentally starts a thread before this worker is started. This is actually complicated in large C++ programs where static constructors run before the main function (e.g. the Chrome bug I mentioned above). To me, this doesn't seem to have many advantages over calling exec() on the same binary. Only use fork in toy programs. The challenge is that successful toy programs grow into large ones, and large programs eventually use threads. It might be best just to not bother. Hanging Python unit tests When I ran into this problem, I was just trying to run all of Bluecore's unit tests on my Mac laptop. We use nose's multiprocess mode, which uses Python's multiprocessing module to utilize multiple CPUs. Unfortunately, the tests hung, even though they passed on our Linux test server. I figured we had some trivial bug, so I ran subsets of our tests until I isolated the problem to a single directory. Strangely, each test worked when run by itself. It was only when I ran the whole directory with nose that it got stuck. It was the --parallel=4 option, which uses multiple processes, that caused the hang. The parent process was waiting for a message from a child worker. Looking at the children using ps showed that it had already exited. After adding a ton of print messages, I found that the child proce[...]

Corrupt data over TCP: It was a kernel bug!

Fri, 12 Feb 2016 18:45:49 +0000

I wrote previously about how it is possible that an application can receive corrupt data over a network, when there is both an Ethernet and TCP checksum. The answer is that there was a Linux kernel bug that wasn't checking the checksum! A large team at Twitter was involved debugging this, and Vijay Pandurangan and I wrote a kernel patch (removing 2 lines of code). Read Vijay's detailed description for all the details.

One lesson that I will remember from debugging this: Network corruption is on average very rare. On most days, across Twitter's entire machine fleet, the machines never see a packet with a corrupt TCP checksum. However, if a hardware failure occurs, a huge number of packets can be corrupt. The usual error model that assumes corruption is evenly distributed is wrong. In our case, about 10% of packets passing through a bad switch were corrupt, and something like 0.5% had two bits of errors. This means it is very unlikely but possible that this corruption could be undetected by the TCP checksum. If anything, this has made me even more convinced that you probably shouldn't rely on the TCP checksum to protect your data across a network. This is a bit paranoid, but its easy and cheap to attach a CRC32C to your message and make it nearly impossible bad things to happen. You probably should be encrypting your data, which ensures it uses an even stronger check. Data corruption causes very expensive and time consuming failures. This is definitely a case where an ounce of prevention is worth 100 pounds of cure.

Fixing Java's ByteBuffer native memory "leak"

Sun, 27 Dec 2015 19:42:36 +0000

TL;DR: The Java NIO API caches a maximum-sized direct ByteBuffer for each thread, which looks like a native memory leak if you read or write large blocks from many threads. You can easily patch the JDK yourself to work around this problem. Always use direct ByteBuffers with Java NIO APIs for the best performance, and to avoid this "leak." Under the covers, heap ByteBuffers are copied to temporary direct ByteBuffers on each I/O call. [Update 2016-02-10: JDK 9 has a property to control this (Thanks Tony!). Run services with -Djdk.nio.maxCachedBufferSize=262144 to avoid this problem.] The full story The Java NIO APIs use ByteBuffers as the source and destination of I/O calls, and come in two flavours. Heap ByteBuffers wrap a byte[] array, allocated in the garbage collected Java heap. Direct ByteBuffers wrap memory allocated outside the Java heap using malloc. Only "native" memory can be passed to operating system calls, so it won't be moved by the garbage collector. This means that when you use a heap ByteBuffer for I/O, it is copied into a temporary direct ByteBuffer. The JDK caches one temporary buffer per thread, without any memory limits. As a result, if you call I/O methods with large heap ByteBuffers from multiple threads, your process can use a huge amount of additional native memory, which looks like a native memory leak. This can cause your process to unexpectedly run into memory limits and get killed. Our team at Twitter ran into this issue. We had a process that would slowly use more and more memory, until it hit its limit and was killed. It turns out that Finagle responses are currently contained in heap ByteBuffers, triggering this issue. (Finagle will eventually switch to a new version of Netty, which will avoid this issue by using direct ByteBuffers.) To work around the problem, Twitter's JVM team added a flag to our internal version to limit the size of this cache. However, it turns out you can easily replace one of the JDK classes for a single program. This makes it easy to avoid this native memory leak by following the steps below. I've sent an email to the nio-dev mailing list to see if we can limit the size of this cache. However, if you are affected by this, you can try my workaround. Demonstrating the leak I wrote a program that writes to /dev/null from multiple threads with both heap and direct ByteBuffers. It shows that using direct ByteBuffers works as you expect, where they are garbage collected when they are unused. However, heap ByteBuffers cause direct ByteBuffers to be allocated and cached until the threads exit. You can also use this to show that my quick-and-dirty patch below avoids the leak. I've put the code in a Github repository, and the README includes sample output. The code behind the leak When you pass a ByteBuffer to an I/O API, there are checks to copy heap ByteBuffers to a temporary direct ByteBuffer before making the actual system call. For example, for network I/O, you use a SocketChannel, which is actually an instance of Reading from a socket calls, and writing calls IOUtil.write. Both methods check if the ByteBuffer is a direct ByteBuffer. If it is not, they allocate a temporary direct ByteBuffer by calling Util.getTemporaryDirectBuffer, copy the data, then call the "real" readIntoNativeBuffer or writeFromNativeBuffer implementations. The leak itself is in Util.getTemporaryDirectBuffer, which caches the maximum sized buffer for each thread. Patching the leak Tony Printezis submitted a version of the patch he wrote for Twitter, which has been merged into JDK 9. I suggest running all services with -Djdk.nio.maxCachedBufferSize=262144 to ensure the JDK doesn't cache buffers larger than 256 kB. I would really love to have this get set as the default, but unfortunately that seems unlikely. However, if you are r[...]

How both TCP and Ethernet checksums fail

Mon, 05 Oct 2015 20:15:12 +0000

At Twitter, a team had a unusual failure where corrupt data ended up in memcache. The root cause appears to have been a switch that was corrupting packets. Most packets were being dropped and the throughput was much lower than normal, but some were still making it through. The hypothesis is that occasionally the corrupt packets had valid TCP and Ethernet checksums. One "lucky" packet stored corrupt data in memcache. Even after the switch was replaced, the errors continued until the cache was cleared. [Update 2016-02-12: Root cause found: this also involved a kernel bug!] I was very excited to hear about this error, because it is a real-world example of something I wrote about seven years ago: The TCP checksum is weak. However, the Ethernet CRC is strong, so how could a corrupt packet pass both checks? The answer is that the Ethernet CRC is recalculated by switches. If the switch corrupts the packet and it has the same TCP checksum, the hardware blindly recalculates a new, valid Ethernet CRC when it goes out. As Mark Callaghan pointed out, this is a very rare scenario and you should never blame the network without strong evidence. However, it isn't impossible and others have written about similar incidents. My conclusion is that if you are creating a new network protocol, please append a 4 byte CRC (I suggest CRC32C, implemented in hardware on recent Intel, AMD, and ARM CPUs). An alternative is to use an encryption protocol (e.g. TLS), since they include cryptographic hashes (which fixed a similar incident). The rest of this article describes the details about how this is possible, mostly so I don't forget them. Properties of the TCP checksum The TCP checksum is two bytes long, and can detect any burst error of 15 bits, and most burst errors of 16 bits (excluding switching 0x0000 and 0xffff). This means that to keep the same checksum, a packet must be corrupted in at least two locations, at least 2 bytes apart. If the chance is purely random, we should expect approximately 1 in 216 (approximately 0.001%) of corrupt packets to not be detected. This seems small, but on one Gigabit Ethernet connection, that could be as many as 15 packets per second. For details about how to compute the TCP checksum and its error properties, see RFC 1071. Properties of the Ethernet CRC The Ethernet CRC is substantially stronger, partly because it is twice as long (4 bytes), and partly because CRCs have "good" mathematical properties, such as detecting all 3 bit errors in 1500 byte Ethernet packets (understanding this is beyond my math skills). It appears that most switches discard packets with invalid CRCs when they are received, and recalculate the CRC when the packet goes back out. This means the CRC really only protects against corruption on the wire, and not inside the switch. Why not just re-send the existing CRC? Modern switch chips have features that modify packets, such as VLANs or explicit congestion notification. Hence, it is simpler to always recompute the CRC. For a detailed description, see Denton Gentry's description of how the Ethernet CRC doesn't protect very much. There is one small complication that does not change this cause of failure, but does change how you might detect it. Some switches support cut-through switching, where packets begin being forwarded as soon as the destination address is read, without waiting for the entire packet. In this case, it is already sending the packet before it can validate it, so it absolutely cannot recalculate the CRC. These switches typically support something called "CRC stomping" to ensure the outgoing CRC is invalid, so the ultimate receiver will eventually discard it. This gets more complicated when a destination port is being used when a new packet arrives. In this case, cut-through switches must buffer packets, and then act like [...]

Naive Retries Considered Harmful

Mon, 28 Sep 2015 13:29:13 +0000

In distributed systems, a standard technique to recover from errors is to resend a message if a response was not received after a timeout. Protocols such as 802.11, TCP, and many applications built on on HTTP requests rely on this mechanism. However: you absolutely must not naively retry all requests as soon as a timeout expires. This common mistake causes a feedback loop that makes every slightly overloaded service get swamped with a huge spike of requests. Instead, you must "back off" to avoid overloading the destination during a failure. At the moment, I think a good policy is to send a "backup request" after the 95th percentile latency, wait for both until an appropriate timeout, and never retry more than 10% of requests within a 5 minute interval. I'll explain why the naive policy is bad with my favorite technique: proof by example, using real-world failures from Twitter and AWS. Let's first consider a very common, but naive retry policy. The client sends the request and waits up to 100 milliseconds. If the timeout expires, it cancels that request and sends another, waiting another 100 ms (e.g. Finagle's RetryingService with a TimeoutFilter). As traffic to the service increases, the response time also increases, eventually causing some requests time out. The client retries, and so the server receives more requests, making it even slower, causing more timeouts. This feedback loop ends with the server receiving twice the number of requests, causing it to be catastrophically overloaded. As a real-world example, the graph below shows the requests per second arriving at a service at Twitter, during a situation where I got paged. The top line shows the total requests per second, while the others are the rates from each "upstream" service. Around 11:30, the traffic increases dramatically because another data center failed over for scheduled maintenance. Everything seems fine for nearly 30 minutes, although the traffic from Service B is actually slowly increasing. I believe this increase was caused by Service B timing out and retrying. At 12:00 this service has become slow enough to cause nearly all requests from Service B to time out, causing a massive spike in traffic. This caused most of the instances of this service to die (I no longer recall exactly why). The total traffic then drops, since there is nothing to receive and record it. Thankfully, Service B has a policy where after enough requests fail, it stops sending entirely, then slowly ramps back up. This allowed the instances to restart after the failure. However, the service was then in an "interesting" state, where the traffic from Service B would slowly ramp up, overload the service, then hammer it with a huge spike. This would cause lots of failures, so it would back off and repeat the process. Each spike caused a "downstream" service to get overloaded and restart. This situation persisted for two hours, until we figured out how to limit the load on the downstream service. This allowed it to survive the next spike by rejecting most of the requests, and the system stabilized. Service B's retry policy in this case was both good and bad. The good part is that after many requests fail, it stops sending then ramps back up slowly (a bit like TCP congestion control). This allowed the service to recover after each spike. The bad part is that each spike sent double the normal traffic, and killed our service. [Update: 2015-09-28]: As a second real-world example, a few days before I wrote this, Amazon's DynamoDB key-value store service had a serious outage. In their postmortem, they describe how retries during this outage caused heavy load, so that "healthy storage servers [...] were having subsequent renewals fail and were transitioning back to an unavailable state.&quo[...]