Subscribe: Dieter_be's blog
Added By: Feedage Forager Feedage Grade B rated
Language: English
alerting  code  data  detection  git  github  graphite  influxdb  make  metrics  monitoring  series  statsd  time  whisper  work 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Dieter_be's blog

Posts on Dieter's blog

Recent content in Posts on Dieter's blog

Last Build Date: Sat, 10 Dec 2016 19:13:03 +0000


Practical fault detection: redux. Next-generation alerting now as presentation

Sat, 10 Dec 2016 19:13:03 +0000

This summer I had the opportunity to present my practical fault detection concepts and hands-on approach as conference presentations. First at Velocity and then at SRECon16 Europe. The latter page also contains the recorded video. If you’re interested at all in tackling non-trivial timeseries alerting use cases (e.g. working with seasonal or trending data) this video should be useful to you. It’s basically me trying to convey in a concrete way why I think the big-data and math-centered algorithmic approaches come with a variety of problems making them unrealistic and unfit, whereas the real breakthroughs happen when tools recognize the symbiotic relationship between operators and software, and focus on supporting a collaborative, iterative process to managing alerting over time. There should be a harmonious relationship between operator and monitoring tool, leveraging the strengths of both sides, with minimal factors harming the interaction. From what I can tell, bosun is pioneering this concept of a modern alerting IDE and is far ahead of other alerting tools in terms of providing high alignment between alerting configuration, the infrastructure being monitored, and individual team members, which are all moving targets, often even fast moving. In my experience this results in high signal/noise alerts and a happy team. (according to Kyle, the bosun project leader, my take is a useful one) That said, figuring out the tool and using it properly has been, and remains, rather hard. I know many who rather not fight the learning curve. Recently the bosun team has been making strides at making it easier for newcomers - e.g. reloadable configuration and Grafana integration - but there is lots more to do. Part of the reason is that some of the UI tabs aren’t implemented for non-opentsdb databases and integrating Graphite for example into the tag-focused system that is bosun, is bound to be a bit weird. (that’s on me) For an interesting juxtaposition, we released Grafana v4 with alerting functionality which approaches the problem from the complete other side: simplicity and a unified dashboard/alerting workflow first, more advanced alerting methods later. I’m doing what I can to make the ideas of both projects converge, or at least make the projects take inspiration from each other and combine the good parts. (just as I hope to bring the ideas behind graph-explorer into Grafana, eventually…) Note: One thing that somebody correctly pointed out to me, is that I’ve been inaccurate with my terminology. Basically, machine learning and anomaly detection can be as simple or complex as you want to make it. In particular, what we’re doing with our alerting software (e.g. bosun) can rightfully also be considered machine learning, since we construct models that learn from data and make predictions. It may not be what we think of at first, and indeed, even a simple linear regression is a machine learning model. So most of my critique was more about the big data approach to machine learning, rather than machine learning itself. As it turns out then the key to applying machine learning successfully is tooling that assists the human operator in every possible way, which is what IDE’s like bosun do and how I should have phrased it, rather than presenting it as an alternative to machine learning. [...]

Restoring accidental git force push overwrite on GitHub if you don't have the needed commits locally

Mon, 14 Nov 2016 11:33:03 +0200

I like cleaning git history, in feature branches, at least. The goal is a set of logical commits without other cruft, that can be cleanly merged into master. This can be easily achieved with git rebase and force pushing to the feature branch on GitHub. Today I had a little accident and found myself in this situation: I accidentally ran git push origin -f instead of my usual git push origin -f branchname or git push origin -f HEAD This meant that I not only overwrote the branch I wanted to update, but also by accident a feature branch (called httpRefactor in this case) to which a colleague had been force pushing various improvements which I did not have on my computer. And my colleague is on the other side of the world so I didn’t want to wait until he wakes up. (if you can talk to someone who has the commits just have him/her re-force-push, that’s quite a bit easier than this) It looked something like so: $ git push origin -f + 92a817d...065bf68 httpRefactor -> httpRefactor (forced update) Oops! So I wanted to reset the branch on GitHub to what it should be, and also it would be nice to update the local copy on my computer while we’re at it. Note that the commit (or rather the abbreviated hash) on the left refers to the commit that was the latest version in GitHub, i.e. the one I did not have on my computer. A little strange if you’re to accustomed to git diff and git log output showing hashes you have in your local repository. Normally in a git repository, the objects dangle around until git gc is run, which clears any commits except those reachable by any branches or tags. I figured the commit is probably still in the GitHub repo (either cause it’s dangling, or perhaps there’s a reference to it that’s not public such as a remote branch), I just need a way to attach a regular branch to it (either on GitHub, or fetch it somehow to my computer, attach the branch there and re-force-push), so step one is finding it on GitHub. The first obstacle is that GitHub wouldn’t recognize this abbreviated hash anymore: going to resulted in a 404 commit not found. Now, we use CircleCI, so I could see what had been the full commit hash in the CI build log. Once I had it, I could see that showed it. An alternative way of opening a view of the dangling commit we need, is using the reflog syntax. Git reflog is a pretty sweet tool that often comes in handy when you made a bit too much of a mess on your local repository, but also on GitHub it works: if you navigate to{1} you will be presented with the commit that the branch head was at before the last change, i.e. the missing commit, 92a817d in my case. Then follows the problem of re-attaching a branch to it. Running on my laptop git fetch --all doesn’t seem to fetch dangling objects, so I couldn’t bring the object in. Then I tried to create a tag for the non-existant object. I figured, the tag may not reference an object in my repo, but it will on GitHub, so if only I can create the tag, manually if needed (it seems to be just a file containing a commit hash), and push it, I should be good. So: ~/g/s/g/r/metrictank ❯❯❯ git tag recover 92a817d2ba0b38d3f18b19457f5fe0a706c77370 fatal: cannot update ref 'refs/tags/recover': trying to write ref 'refs/tags/recover' with nonexistent object 92a817d2ba0b38d3f18b19457f5fe0a706c77370 ~/g/s/g/r/metrictank ❯❯❯ echo 92a817d2ba0b38d3f18b19457f5fe0a706c77370 > .git/refs/tags/recover ~/g/s/g/r/metrictank ❯❯❯ git push origin --tags error: refs/tags/recover does not point to a valid object! Everything up-to-date So this approach won’t work. I can create the tag, but not push it, even though the object exists on the remote. So I was looking for a way to attach a tag or [...]

25 Graphite, Grafana and statsd gotchas

Tue, 15 Mar 2016 16:22:03 +1000

This is a crosspost of an article I wrote on the blog For several years I’ve worked with Graphite, Grafana and statsd on a daily basis and have been participating in the community. All three are fantastic tools and solve very real problems. Hence my continued use and recommendation. However, between colleagues, random folks on irc, and personal experience, I’ve seen a plethora of often subtle issues, gotchas and insights, which today I’d like to share. I hope this will prove useful to users while we, open source monitoring developers, work on ironing out these kinks. At raintank we’re addressing a bunch of these as well but we have a long road ahead of us. Before we begin, when trying to debug what’s going on with your metrics (whether they’re going in or out of these tools), don’t be afraid to dive into the network or the whisper files. They can often be invaluable in understanding what’s up. For network sniffing, I almost always use these commands: ngrep -d any -W byline port 2003 # carbon traffic ngrep -d any -W byline port 8125 # statsd traffic Getting the json output from Graphite (just append &format=json) can be very helpful as well. Many dashboards, including Grafana already do this, so you can use the browser network inspector to analyze requests. For the whisper files, Graphite comes with various useful utilities such as whisper-info, whisper-dump, whisper-fetch, etc. here we go: 1. OMG! Did all our traffic just drop? 2. Null handling in math functions. 3. Null handling during runtime consolidation. 4. No consolidation or aggregation for incoming data. 5. Limited storage aggregation options. 6. Runtime consolidation is detached from storage aggregation. 7. Grafana consolidation is detached from storage aggregation and runtime consolidation. 8. Aggregating percentiles. 9. Deriving and integration in Graphite. 10. Graphite quantization. 11. statsd flush offset depends on when statsd was started. 12. Improperly time-attributed metrics. 13. The relation between timestamps and the intervals they describe. 14. The statsd timing type is not only for timings. 15. The choice of metric keys depends on how you deployed statsd. 16. statsd is “fire and forget”, “non-blocking” & UDP sends “have no overhead”. 17. statsd sampling. 18. As long as my network doesn’t saturate, statsd graphs should be accurate. 19. Incrementing/decrementing gauges. 20. You can’t graph what you haven’t seen 21. Don’t let the data fool you: nulls and deleteIdleStats options. 22. keepLastValue works… almost always. 23. statsd counters are not counters in the traditional sense. 24. What can I send as input? 25. Hostnames and ip addresses in metric keys. 1) OMG! Did all our traffic just drop? Probably the most common gotcha, and one i run into periodically. Graphite will return data up until the current time, according to your data schema. Example: if your data is stored at 10s resolution, then at 10:02:35, it will show data up until 10:02:30. Once the clock hits 10:02:40, it will also include that point (10:02:40) in its response. It typically takes some time for your services to send data for that timestamp, and for it to be processed by Graphite, so Graphite will typically return a null here for that timestamp. Depending on how you visualize (see for example the “null as null/zero” option in grafana), or if you use a function such as sumSeries (see below) it may look like a drop in your graph, and cause panic. You can work around this with “null as null” in Grafana, transformNull() or keepLastValue() in Graphite, or plotting until a few seconds ago instead of now. See below for some other related issues. 2) Null handling in math functions. Graphite functions don’t exactly follow the rules of logic (by design). When you request something like sumSeries(diskspace.server_*.bytes_free) Graphite returns the sum of al[...]

Interview with Matt Reiferson, creator of NSQ

Fri, 02 Oct 2015 10:25:02 +0200

I’m a fan of the NSQ message processing system written in golang. I’ve studied the code, transplanted its diskqueue code into another project, and have used NSQ by itself. The code is well thought out, organized and written.

Inspired by the book coders at work and the systems live podcast, I wanted to try something I’ve never done before: spend an hour talking to Matt Reiferson - the main author of NSQ - about software design and Go programming patterns, and post the video online for whomever might be interested.

We talked about Matt’s background, starting the NSQ project at Bitly as his first (!) Go project, (code) design patterns in NSQ and the nsqd diskqueue in particular and the new WAL (write-ahead-log) approach in terms of design and functionality.

width="560" height="315" src="" frameborder="0" allowfullscreen> You can watch it on youtube

Unfortunately, the video got cut a bit short. But basically in the cut off part i asked about the new go internals convention that prevents importing packages that are in an internals subdirectory. Matt wants to make it very clear that certain implementation details are not supported (by the NSQ team) and may change, whereas my take was that it’s annoying when i want to reuse code some I find in a project. We ultimately both agreed that while a bit clunky, it gets the job done, and is probably a bit crude because there is also no proper package management yet.

I’ld like to occasionally interview other programmers in a similar way and post on my site later.

Transplanting Go packages for fun and profit

Wed, 02 Sep 2015 19:25:02 +0300

A while back I read coders at work, which is a book of interviews with some great computer scientists who earned their stripes, the questions just as thoughtful as the answers. For one thing, it re-ignited my interest in functional programming, for another I got interested in literate programming but most of all, it struck me how common of a recommendation it was to read other people’s code as a means to become a better programmer. (It also has a good section of Brad Fitzpatrick describing his dislike for programming languages, and dreaming about his ideal language. This must have been shortly before Go came about and he became a maintainer.) I hadn’t been doing a good job reading/studying other code out of fear that inferior patterns/style would rub off on me. But I soon realized that was an irrational, perhaps slightly absurd excuse. So I made the decision to change. Contrary to my presumption I found that by reading code that looks bad you can challenge and re-evaluate your mindset and get out with a more nuanced understanding and awareness of the pros and cons of various approaches. I also realized if code is proving too hard to get into or is of too low quality, you can switch to another code base with negligible effort and end up spending almost all of your time reading code that is worthwhile and has plenty of learnings to offer. There is a lot of high quality Go code, easy to find through sites like Github or Golang weekly, just follow your interests and pick a project to start reading. It gets really interesting though once you find bodies of code that are not only a nice learning resource, but can be transplanted into your code with minimal work to solve a problem you’re having, but in a different context then the author of the code originally designed it for. Components often grow and mature in the context of an application without being promoted as reusable libraries, but you can often use them as if they were. I would like to share 2 such success cases below. Nsq’s diskqueue code I’ve always had an interest in code that manages the same binary data both in memory and on a block device. Think filesystems, databases, etc. There’s some interesting concerns like robustness in light of failures combined with optimizing for performance (infrequent syncs to disk, maintaining the hot subset of data in memory, etc), combined with optimizing for various access patterns, this can be a daunting topic to get into. Luckily there’s a use case that I see all the time in my domain (telemetry systems) and that covers just enough of the problems to be interesting and fun, but not enough to be overwhelming. And that is: for each step in a monitoring data pipeline, you want to be able to buffer data if the endpoint goes down, in memory and to disk if the amount of data gets too much. Especially to disk if you’re also concerned with your software crashing or the machine power cycling. This is such a common problem that applies to all metrics agents, relays, etc that I was longing for a library that just takes care of spooling data to disk for you without really affecting much of the rest of your software. All it needs to do is sequentially write pieces of data to disk and have a sequential reader catching up and read newer data as it finishes processing the older. NSQ is a messaging platform from bitly, and it has diskqueue code that does exactly that. And it does so oh so elegantly. I had previously found a beautiful pattern in bitly’s go code that I blogged about and again I found a nice and elegant design that builds further on this pattern, with concurrent access to data protected via a single instance of a for loop running a select block which assures only one piece of code can make changes to data at the same time (see bottom of the file), not unlike ioloops in other languages. And method calls such as Put() provide a clean external interface[...]

Focusing on open source monitoring. Joining raintank.

Fri, 03 Jul 2015 09:22:02 -0700

Goodbye Vimeo It's never been as hard saying goodbye to the people and the work environment as it is now. Vimeo was created by dedicated film creators and enthusiasts, just over 10 years ago, and today it still shows. From the quirky, playful office culture, the staff created short films, to the tremendous curation effort and staff picks including monthly staff screenings where we get to see the best of the best videos on the Internet each month, to the dedication towards building the best platform and community on the web to enjoy videos and the uncompromising commitment to supporting movie creators and working in their best interest. Engineering wise, there has been plenty of opportunity to make an impact and learn. Nonetheless, I have to leave and I'll explain why. First I want to mention a few more things. In Belgium I used to hitchhike to and from work so that each day brought me opportunities to have conversations with a diverse, fantastic assortment of people. I still fondly remember some of those memories. (and it was also usually faster than taking the bus!) Here in NYC this isn't really feasible, so I tried the next best thing. A mission to have lunch with every single person in the company, starting with those I don't typically interact with. I managed to have lunch with 95 people, get to know them a bit, find some gems of personalities and anecdotes, and have conversations on a tremendous variety of subjects, some light-hearted, some deep and profound. It was fun and I hope to be able to keep doing such social experiments in my new environment. Vimeo is also part of my life in an unusually personal way. When I came to New York (my first ever visit to the US) in 2011 to interview, I also met a pretty fantastic woman in a random bar in Williamsburg. We ended up traveling together in Europe, I decided to move the US and we moved in together. I've had the pleasure of being submerged in both American and Greek culture for the last few years, but the best part is that today we are engaged and I feel like the luckiest guy in the world. While I've tried to keep work and personal life somewhat separate, Vimeo has made an undeniable ever lasting impact on my life that I'm very grateful for. At Vimeo I found an area where a bunch of my interests converge: operational best practices, high performance systems, number crunching, statistics and open source software. Specifically, timeseries metrics processing in the context of monitoring. While I have enjoyed my opportunity to make contributions in this space to help our teams and other companies who end up using my tools, I want to move out of the cost center of the company, I want to be in the department that creates the value. If I want to focus on open source monitoring, I should align my incentives with those of my employer. Both for my and their sake. I want to make more profound contributions to the space. The time has come for me to join a company for which the main focus is making open source monitoring better. Hello raintank! Over the past two years or so I've talked to many people in the industry about monitoring, many of them trying to bring me into their team. I never found a perfect fit but as we transitioned from 2014 into 2015, the stars seemingly aligned for me. Here's why I'm very excited to join the raintank crew: I'm a strong believer in open source. I believe fundamental infrastructure tooling should be modifiable and in your control. It's partially a philosophical argument, but also what I believe will separate long lasting business from short term successes. SaaS is great, but not if it's built to lock you in. In this day and age, I think you should "lock in" your customers by providing a great experience they can't say no to, not by having them build technical debt as they integrate into your stack. That said, integrating with open source also incurs technical debt, and some closed so[...]

Moved blog to hugo, fastly and comma

Thu, 02 Jul 2015 16:35:02 -0700

  • I noticed what a disservice I was doing my readers when I started monitoring my site using litmus. A dynamic website in python on a cheap linode… What do you expect? So I now serve through fastly and use a static site generator.

  • pyblosxom was decent while it lasted. It can generate sites statically, but the project never got a lot of traction and is slowly fading out. There were a bit too many moving parts, so …

  • I now use the hugo static site generator, which is powerful, quite complete and gaining momentum. Fast and simple to use.

  • Should also keep an eye on the caddy webserver since it has some nice things such as git integration which should work well with hugo.

  • Trying to get disqus going was frustrating. Self hosted options like talkatv and isso were too complex, and kaiju is just not there yet and also pretty complex. I wrote comma which is a simple comment server in Go. Everything I need in 100 lines of Go and 50 lines of javascript! Let me know if you see anything funky.

  • migrated all content.

Practical fault detection on timeseries part 2: first macros and templates

Mon, 27 Apr 2015 09:05:02 -0400

In the previous fault detection article, we saw how we can cover a lot of ground in fault detection with simple methods and technology that is available today. It had an example of a simple but effective approach to find sudden spikes (peaks and drops) within fluctuating time series. This post explains the continuation of that work and provides you the means to implement this yourself with minimal effort. I'm sharing with you: Bosun macros which detect our most common not-trivially-detectable symptoms of problems Bosun notification template which provides a decent amount of information Grafana and Graph-Explorer dashboards and integration for further troubleshooting We reuse this stuff for a variety of cases where the data behaves similarly and I suspect that you will be able to apply this to a bunch of your monitoring targets as well. Target use case As in the previous article, we focus on the specific category of timeseries metrics driven by user activity. Those series are expected to fluctuate in at least some kind of (usually daily) pattern, but is expected to have a certain smoothness to it. Think web requests per second or uploads per minute. There are a few characteristics that are considered faulty or at least worth our attention: looks goodconsistent patternconsistent smoothness sudden deviation (spike)Almost always something broke or choked.could also be pointing up. ~ peaks and valleys increased erraticnessSometimes naturaloften result of performance issues lower values than usual (in the third cycle)Often caused by changes in code or config, sometimes innocent. But best to alert operator in any case [*] [*] Note that some regular patterns can look like this as well. For example weekend traffic lower than weekdays, etc. We see this a lot. The illustrations don't portray this for simplicity. But the alerting logic below supports this just fine by comparing to same day last week instead of yesterday, etc. Introducing the new approach The previous article demonstrated using graphite to compute standard deviation. This let us alert on the erraticness of the series in general and as a particularly interesting side-effect, on spikes up and down. The new approach is more refined and concrete by leveraging some of bosun's and Grafana's strengths. We can't always detect the last case above via erraticness checking (a lower amount may be introduced gradually, not via a sudden drop) so now we monitor for that as well, covering all cases above. We use Bosun macros which encapsulate all the querying and processing Bosun template for notifications A generic Grafana dashboard which aids in troubleshooting We can then leverage this for various use cases, as long as the expectations of the data are as outlined above. We use this for web traffic, volume of log messages, uploads, telemetry traffic, etc. For each case we simply define the graphite queries and some parameters and leverage the existing mentioned Bosun and Grafana configuration. The best way to introduce this is probably by showing how a notification looks like: (image redacted to hide confidential information the numbers are not accurate and for demonstration purposes only) As you can tell by the sections, we look at some global data (for example "all web traffic", "all log messages", etc), and also by data segregated by a particular dimension (for example web traffic by country, log messages by key, etc) To cover all problematic cases outlined above, we do 3 different checks: (note, everything is parametrized so you can tune it, see further down) Global volume: comparing the median value of the last 60 minutes or so against the corresponding 60 minutes last week and expressing it as a "strength ratio". Anything below a given threshold such as 0.8 is alerted on Global erraticness. To find all forms of erraticness[...]

Practical fault detection & alerting. You don't need to be a data scientist

Thu, 29 Jan 2015 09:08:02 -0400

As we try to retain visibility into our increasingly complicated applications and infrastructure, we're building out more advanced monitoring systems. Specifically, a lot of work is being done on alerting via fault and anomaly detection. This post covers some common notions around these new approaches, debunks some of the myths that ask for over-complicated solutions, and provides some practical pointers that any programmer or sysadmin can implement that don't require becoming a data scientist. It's not all about math I've seen smart people who are good programmers decide to tackle anomaly detection on their timeseries metrics. (anomaly detection is about building algorithms which spot "unusual" values in data, via statistical frameworks). This is a good reason to brush up on statistics, so you can apply some of those concepts. But ironically, in doing so, they often seem to think that they are now only allowed to implement algebraic mathematical formulas. No more if/else, only standard deviations of numbers. No more for loops, only moving averages. And so on. When going from thresholds to something (anything) more advanced, suddenly people only want to work with mathematical formula's. Meanwhile we have entire Turing-complete programming languages available, which allow us to execute any logic, as simple or as rich as we can imagine. Using only math massively reduces our options in implementing an algorithm. For example I've seen several presentations in which authors demonstrate how they try to fine-tune moving average algorithms and try to get a robust base signal to check against but which is also not affected too much by previous outliers, which raise the moving average and might mask subsequent spikes). from A Deep Dive into Monitoring with Skyline But you can't optimize both, because a mathematical formula at any given point can't make the distinction between past data that represents "good times" versus "faulty times". However: we wrap the output of any such algorithm with some code that decides what is a fault (or "anomaly" as labeled here) and alerts against it, so why would we hold ourselves back in feeding this useful information back into the algorithm? I.e. assist the math with logic by writing some code to make it work better for us: In this example, we could modify the code to just retain the old moving average from before the time-frame we consider to be faulty. That way, when the anomaly passes, we resume "where we left off". For timeseries that exhibit seasonality and a trend, we need to do a bit more, but the idea stays the same. Restricting ourselves to only math and statistics cripples our ability to detect actual faults (problems). Another example: During his Monitorama talk, Noah Kantrowitz made the interesting and thought provoking observation that Nagios flap detection is basically a low-pass filter. A few people suggested re-implementing flap detection as a low-pass filter. This seems backwards to me because reducing the problem to a pure mathematical formula loses information. The current code has the high-resolution view of above/below threshold and can visualize as such. Why throw that away and limit your visibility? Unsupervised machine learning... let's not get ahead of ourselves. Etsy's Kale has ambitious goals: you configure a set of algorithms, and those algorithms get applied to all of your timeseries. Out of that should come insights into what's going wrong. The premise is that the found anomalies are relevant and indicative of faults that require our attention. I have quite a variety amongst my metrics. For example diskspace metrics exhibit a sawtooth pattern (due to constant growth and periodic cleanup), crontabs cause (by definition) periodic spikes in activity, user activity causes a fairly smooth graph which is characterized by its da[...]

IT-Telemetry Google group. Trying to foster more collaboration around operational insights.

Sat, 06 Dec 2014 16:01:02 -0400

The discipline of collecting infrastructure & application performance metrics, aggregation, storage, visualizations and alerting has many terms associated with it... Telemetry. Insights engineering. Operational visibility. I've seen a bunch of people present their work in advancing the state of the art in this domain:
from Anton Lebedevich's statistics for monitoring series, Toufic Boubez' talks on anomaly detection and Twitter's work on detecting mean shifts to projects such as flapjack (which aims to offload the alerting responsibility from your monitoring apps), the metrics 2.0 standardization effort or Etsy's Kale stack which tries to bring interesting changes in timeseries to your attention with minimal configuration.

Much of this work is being shared via conference talks and blog posts, especially around anomaly and fault detection, and I couldn't find a location for collaboration, quicker feedback and discussions on more abstract (algorithmic/mathematical) topics or those that cross project boundaries. So I created the IT-telemetry Google group. If I missed something existing, let me know. I can shut this down and point to whatever already exists. Either way I hope this kind of avenue proves useful to people working on these kinds of problems.

A real whisper-to-InfluxDB program.

Tue, 30 Sep 2014 08:37:48 -0400

The whisper-to-influxdb migration script I posted earlier is pretty bad. A shell script, without concurrency, and an undiagnosed performance issue. I hinted that one could write a Go program using the unofficial whisper-go bindings and the influxdb Go client library. That's what I did now, it's at It uses configurable amounts of workers for both whisper fetches and InfluxDB commits, but it's still a bit naive in the sense that it commits to InfluxDB one serie at a time, irrespective of how many records are in it. My series, and hence my commits have at most 60k records, and presumably InfluxDB could handle a lot more per commit, so we might leverage better batching later. Either way, this way I can consistently commit about 100k series every 2.5 hours (or 10/s), where each serie has a few thousand points on average, with peaks up to 60k points. I usually play with 1 to 30 InfluxDB workers. Even though I've hit a few InfluxDB issues, this tool has enabled me to fill in gaps after outages and to do a restore from whisper after a complete database wipe.

InfluxDB as a graphite backend, part 2

Wed, 24 Sep 2014 07:56:01 -0400

Updated oct 1, 2014 with a new Disk space efficiency section which fixes some mistakes and adds more clarity. The Graphite + InfluxDB series continues. In part 1, "On Graphite, Whisper and InfluxDB" I described the problems of Graphite's whisper and ceres, why I disagree with common graphite clustering advice as being the right path forward, what a great timeseries storage system would mean to me, why InfluxDB - despite being the youngest project - is my main interest right now, and introduced my approach for combining both and leveraging their respective strengths: InfluxDB as an ingestion and storage backend (and at some point, realtime processing and pub-sub) and graphite for its renown data processing-on-retrieval functionality. Furthermore, I introduced some tooling: carbon-relay-ng to easily route streams of carbon data (metrics datapoints) to storage backends, allowing me to send production data to Carbon+whisper as well as InfluxDB in parallel, graphite-api, the simpler Graphite API server, with graphite-influxdb to fetch data from InfluxDB. Not Graphite related, but I wrote influx-cli which I introduced here. It allows to easily interface with InfluxDB and measure the duration of operations, which will become useful for this article. In the Graphite & Influxdb intermezzo I shared a script to import whisper data into InfluxDB and noted some write performance issues I was seeing, but the better part of the article described the various improvements done to carbon-relay-ng, which is becoming an increasingly versatile and useful tool. In part 2, which you are reading now, I'm going to describe recent progress, share more info about my setup, testing results, state of affairs, and ideas for future work Progress made InfluxDB saw two major releases: 0.7 (and followups), which was mostly about some needed features and bug fixes 0.8 was all about bringing some major refactorings in the hands of early adopters/testers: support for multiple storage engines, configurable shard spaces, rollups and retention schemes. There was some other useful stuff like speed and robustness improvements for the graphite input plugin (by yours truly) and various things like regex filtering for 'list series'. Note that a bunch of older bugs remained open throughout this release (most notably the broken derivative aggregator), and a bunch of new ones appeared. Maybe this is why the release was mostly in the dark. In this context, it's not so bad, because we let graphite-api do all the processing, but if you want to query InfluxDB directly you might hit some roadblocks. An older fix, but worth mentioning: series names can now also contain any character, which means you can easily use metrics2.0 identifiers. This is a welcome relief after having struggled with Graphite's restrictions on metric keys. graphite-api received various bug fixes and support for templating, statsd instrumentation and caching. Much of this was driven by graphite-influxdb: the caching allows us to cache metadata and the statsd integration gives us insights into the performance of the steps it goes through of building a graph (getting metadata from InfluxDB, querying InfluxDB, interacting with cache, post processing data, etc). the progress on InfluxDB and graphite-api in turn enabled graphite-influxdb to become faster and simpler (note: graphite-influxdb requires InfluxDB 0.8). Furthermore you can now configure series resolutions (but different retentions per serie is on the roadmap, see State of affairs and what's coming), and of course it also got a bunch of bugfixes. Because of all these improvements, all involved components are now ready for serious use. Putting it all together, with docker Docker probably needs no introduction, it's a nifty tool t[...]

Graphite & Influxdb intermezzo: migrating old data and a more powerful carbon relay

Sat, 20 Sep 2014 15:18:32 -0400

Migrating data from whisper into InfluxDB "How do i migrate whisper data to influxdb" is a question that comes up regularly, and I've always replied it should be easy to write a tool to do this. I personally had no need for this, until a recent small influxdb outage where I wanted to sync data from our backup server (running graphite + whisper) to influxdb, so I wrote a script: #!/bin/bash # whisper dir without trailing slash. wsp_dir=/opt/graphite/storage/whisper start=$(date -d 'sep 17 6am' +%s) end=$(date -d 'sep 17 12pm' +%s) db=graphite pipe_path=$(mktemp -u) mkfifo $pipe_path function influx_updater() { influx-cli -db $db -async < $pipe_path } influx_updater & while read wsp; do series=$(basename ${wsp//\//.} .wsp) echo "updating $series ..." --from=$start --until=$end $wsp_dir/$wsp.wsp | grep -v 'None$' | awk '{print "insert into \"'$series'\" values ("$1"000,1,"$2")"}' > $pipe_path done < <(find $wsp_dir -name '*.wsp' | sed -e "s#$wsp_dir/##" -e "s/.wsp$//") It relies on the recently introduced asynchronous inserts feature of influx-cli - which commits inserts in batches to improve the speed - and the whisper-fetch tool. You could probably also write a Go program using the unofficial whisper-go bindings and the influxdb Go client library. But I wanted to keep it simple. Especially when I found out that whisper-fetch is not a bottleneck: starting whisper-fetch, and reading out - in my case - 360 datapoints of a file always takes about 50ms, whereas InfluxDB at first only needed a few ms to flush hundreds of records, but that soon increased to seconds. Maybe it's a bug in my code, I didn't test this much, because I didn't need to; but people keep asking for a tool so here you go. Try it out and maybe you can fix a bug somewhere. Something about the write performance here must be wrong. A more powerful carbon-relay-ng carbon-relay-ng received a bunch of love and has been a great help in my graphite+influxdb experiments. Here's what changed: First I made it so that you can adjust routes at runtime while data is flowing through, via a telnet interface. Then Paul O'Connor built an embedded web interface to manage your routes in an easier and prettier way (pictured above) The relay now also emits performance metrics via statsd (I want to make this better by using go-metrics which will hopefully get expvar support at some point - any takers?). Last but not least, I borrowed the diskqueue code from NSQ so now we can also spool to disk to bridge downtime of endpoints and re-fill them when they come back up Beside our metrics storage, I also plan to put our anomaly detection (currently playing with heka and kale) and carbon-tagger behind the relay, centralizing all routing logic, making things more robust, and simplifying our system design. The spooling should also help to deploy to our metrics gateways at other datacenters, to bridge outages of datacenter interconnects. I used to think of carbon-relay-ng as the python carbon-relay but on steroids, now it reminds me more of something like nsqd but with an ability to make packet routing decisions by introspecting the carbon protocol, or perhaps Kafka but much simpler, single-node (no HA), and optimized for the domain of carbon streams. I'd like the HA stuff though, which is why I spend some of my spare time figuring out the intricacies of the increasingly popular raft consensus algorithm. It seems opportune to have a simpler Kafka-like thing, in Go, using raft, for carbon streams. (note: InfluxDB might introduce such a component, so I'm also a bit waiting to see what the[...]

Influx-cli: a commandline interface to Influxdb.

Mon, 08 Sep 2014 08:36:36 -0400

Time for another side project: influx-cli, a commandline interface to influxdb. Nothing groundbreaking, and it behaves pretty much as you would expect if you've ever used the mysql, pgsql, vsql, etc tools before. But I did want to highlight a few interesting features. You can do things like user management via SQL, even though influxdb doesn't have an SQL interface for this. This is much easier than doing curl http requests! influx> create admin test test influx> list admin ## 0 name root ## 1 name test You can change parameters and re-bind with the new values influx> \user test influx> \pass test influx> \db graphite influx> bind Write your variables (user, pass, host, db, ...) to ~/.influxrc influx> writerc You can even do inserts via SQL, instead of http posts I use this often. This is very useful to script test cases for bug reports etc. influx> create db issue-1234 influx> \db issue-1234 influx> bind influx> insert into demo (time, value, tag) values (120000, 10, "hi") influx> insert into demo (time, value, tag) values (180000, 20, "hi again") influx> select * from demo ## demo time sequence_number value tag 120000.000000 70001.000000 10 "hi" 180000.000000 80001.000000 20 "hi again" influx> delete db issue-1234 You can send queries on standard input, which is useful in shell commands and scripts. $ echo 'list series' | influx-cli | wc -l 194722 $ influx-cli list series > list-series.txt (note: the discrepancy of one line is due to the Go readline library echoing the query. You can also toggle options, such as compression or display of timings. This can be very useful to easily get insights of performance of different operations. influx> \t timing is now true influx> select * from foo | wc -l 64637 timing> query+network: 1.288792048s displaying : 457.091811ms influx> \comp compression is now disabled influx> select * from foo | wc -l 64637 timing> query+network: 969.322374ms displaying : 670.736018ms influx> list series >/dev/null timing> query+network: 3.109178142s displaying : 65.712027ms This has enabled me to pinpoint slow operations and provide evidence when when creating tickets. Executing queries and debugging their result data format, works too This is useful when you want to understand the api better or if the database gets support for new queries with a different output format that influx-cli doesn't understand yet. influx> raw select * from foo limit 1 ([]*client.Series) (len=1 cap=4) { (*client.Series)(0xc20b4f0480)({ Name: (string) (len=51) "foo", Columns: ([]string) (len=3 cap=4) { (string) (len=4) "time", (string) (len=15) "sequence_number", (string) (len=5) "value" }, Points: ([][]interface {}) (len=1 cap=4) { ([]interface {}) (len=3 cap=4) { (float64) 1.410148588e+12, (float64) 1, (float64) 95.549995 } } }) } And that's about it. I've found this to be a much easier way to interface with InfluxDB then using the web interface and curl, but YMMV. If you were wondering, this is of course built on top of the influxdb go client library, which was overall pretty pleasant to work with. Some ideas for future work: bulk insert performance could be better once influxdb can report query execution time and hopefully also serialization time, the timing output can be more useful. Right now we can only measure query execution+serialization+network transfer time combined my gut feeling says that using something like msgpack instead of json, and/or even streaming the resultset as it is being generated (instead of[...]

Darktable: a magnificent photo manager and editor

Tue, 12 Aug 2014 08:36:36 -0400

A post about the magnificent darktable photo manager/editor and why I'm abandoning pixie When I wrote pixie, I was aware of darktable. It looked like a neat application with potential to be pretty much what I was looking for, although it also looked complicated, mainly due to terminology like "darkroom" and "lighttable", which was a bit off-putting to me and made me feel like the application was meant for photo professionals and probably wouldn't work well with the ideals of a techie with some purist views on how to manage files and keep my filesystems clean. Basically I didn't want to give the application a proper chance and then rationalized the decision after I made it. I'm sure psychologists have a term for this behavior. I try to be aware of these cases and not to fall in the trap, but this time I was very aware of it and still proceeded, but I think I had a reasonable excuse. I wanted an app that behaves exactly how I like, I wanted to play with angularjs, it seemed like a fun learning exercise to implement a full-stack program backed by a Go api server and an angularjs interface, with some keybind features and vim-like navigation sprinkled on top. Pixie ended up working, but I got fed up with some angularjs issues, slow js performance and a list of to-do's i would need to address before i would consider pixie feature complete, so only as of a few days ago I started giving darktable the chance it had deserved from the beginning. As it turns out, darktable is actually a fantastic application, and despite some imperfections, the difference is clear enough for me to abandon pixie. Here's why I like it: It stays true to my ideals: It doesn't modify your files at all, this is a must for easily synchronizing photo archives with each other and with devices. You can tag, assign metadata, create edits, etc. and re-size on export. It stores metadata in a simple sqlite database, and also in xmp files which it puts along with the original files, but luckily you can easily ignore those while syncing. (I have yet to verify whether you can adjust dates or set GPS info without modifying the actual files, but I had no solution for that either) basically, it's just well thought out and works well. the terminology thing is a non-issue. Just realize that lighttable means the set of pictures in your collection you want to work with, darkroom is the editor where you edit the image, and film roll is a directory with imported images. Everything else is intuitive It has decent tag editing features, and a powerful mechanism to build a selection of images using a variety of criteria using exif data, tags, GPS info, labels, etc. You can make duplicates of an image and make different edits, and treat them as images of their own It has pretty extensive key binding options, and even provides a lua api so you can hook in your own plugins. People are working on a bunch of scripts already. It's fast. Navigating a 33k file archive, adjusting thumbnail sizes on the fly, iterating fast, works well It has good support for non-destructive editing. It has a variety of editing possibilities, as if it was commercial software It has complete documentation, a great blog with plenty of tutorial articles, and tutorial videos I did notice some bugs (including a few crashes), but there's always a few developers and community members active, on IRC and the bug tracker, so it's pretty active project and I'm confident/hopeful my issues will be resolved soon. I also have a few more ideas for features that would make it closer to my ideals, but as it stands, darktable is already a great application and I'm happy I can deprecate pixie at this point. I [...]