Published: Thu, 07 Jan 2016 20:44:53 AEDT
Last Build Date: Thu, 07 Jan 2016 20:44:53 AEDT
Thu, 13 Aug 2015 00:00:00 AEST
Whether you’re a TDD zealot, or you just occasionally write a quick script to reproduce some bug, it’s a rare coder who doesn’t see value in some sort of automated testing. Yet, somehow, in all of the new-age “Infrastructure as Code” mania, we appear to have forgotten this, and the tools that are commonly used for implementing “Infrastructure as Code” have absolutely woeful support for developing your Infrastructure Code. I believe this has to change.
At present, the state of the art in testing system automation code appears to be, “spin up a test system, run the manifest/state/whatever, and then use something like serverspec or testinfra to SSH in and make sure everything looks OK”. It’s automated, at least, but it isn’t exactly a quick process. Many people don’t even apply that degree of rigour to their system config systems, and rely on manual testing, or even just “doing it live!”, to shake out the bugs they’ve introduced.
Speed in testing is essential. As the edit-build-test-debug cycle gets longer, frustration grows exponentially. If it takes two minutes to get a “something went wrong” report out of my tests, I’m not going to run them very often. If I’m not running my tests very often, then I’m not likely to write tests much, and suddenly… oops. Everything’s on fire. In “traditional” software development, the unit tests are the backbone of the “fast feedback” cycle. You’re running thousands of tests per second, ideally, and an entire test run might take 5-10 seconds. That’s the sweet spot, where you can code and test in a rapid cycle of ever-increasing awesomeness.
Interestingly, when I ask the users of most infrastructure management systems about unit testing, they either get a blank look on their face, or, at best, point me in the direction of the likes of Test Kitchen, which is described quite clearly as an integration platform, not a unit testing platform.
Puppet has rspec-puppet, which is a pretty solid unit testing framework for Puppet manifests – although it isn’t widely used. Others, though… nobody seems to have any ideas. The “blank look” is near-universal.
If “infrastructure developers” want to be taken seriously, we need to learn a little about what’s involved in the “development” part of the title we’ve bestowed upon ourselves. This means knowing what the various types of testing are, and having tools which enable and encourage that testing. It also means things like release management, documentation, modularity and reusability, and considering backwards compatibility.
All of this needs to apply to everything that is within the remit of the infrastructure developer. You don’t get to hand-wave away any of this just because “it’s just configuration!”. This goes double when your “just configuration!” is a hundred lines of YAML interspersed with templating directives (SaltStack, I’m looking at you).
Mon, 03 Aug 2015 00:00:00 AEST
This is the entire complaint that was received:
We are an IT security company from Spain.
We have detected sensitive information belonging to Banesco Banco Universal, C.A. clientes.
As authorized representative in the resolution of IT security incidents affecting Banesco Banco Universal, C.A., we demand the deletion of the content related to Banesco Banco Universal, C.A, clients. This content violates the law about electronic crime in Venezuela (see “Ley Especial sobre Delitos Informáticos de Venezuela”, Chapter III, Articles 20 y 23).
Note the complete lack of any information regarding what URLs, or even which site(s), they consider to be problematic. Nope, just “delete all our stuff!” and a wave of the “against the law somewhere!” stick.
Tue, 14 Jul 2015 00:00:00 AESTIn a comment to my previous post, Daniele asked the entirely reasonable question, Would you like to comment on why you think that DNSSEC+DANE are not a possible and much better alternative? Where DANE fails to be a feasible alternative to the current system is that it is not “widely acknowledged to be superior in every possible way”. A weak demonstration of this is that no browser has implemented DANE support, and very few other TLS-using applications have, either. The only thing I use which has DANE support that I’m aware of is Postfix – and SMTP is an application in which the limitations of DANE have far less impact. My understanding of the limitations of DANE, for large-scale deployment, are enumerated below. DNS Is Awful Quoting Google security engineer Adam Langley: But many (~4% in past tests) of users can’t resolve a TXT record when they can resolve an A record for the same name. In practice, consumer DNS is hijacked by many devices that do a poor job of implementing DNS. Consider that TXT records are far, far older than TLSA records. It seems likely that TLSA records would fail to be retrieved greater than 4% of the time. Extrapolate to the likely failure rate for lookup of TLSA records would be, and imagine what that would do to the reliability of DANE verification. It would either be completely unworkable, or else would cause a whole new round of “just click through the security error” training. Ugh. This also impacts DNSSEC itself. Lots of recursive resolvers don’t validate DNSSEC, and some providers mangle DNS responses in some way, which breaks DNSSEC. Since OSes don’t support DNSSEC validation “by default” (for example, by having the name resolution APIs indicate DNSSEC validation status), browsers would essentially have to ship their own validating resolver code. Some people have concerns around the “single point of control” for DNS records, too. While the “weakest link” nature of the CA model is terribad, there is a significant body of opinion that replacing it with a single, minimally-accountable organisation like ICANN isn’t a great trade. Finally, performance is also a concern. Having to go out-of-band to retrieve TLSA records delays page generation, and nobody likes slow page loads. DNSSEC Is Awful Lots of people don’t like DNSSEC, for all sorts of reasons. While I don’t think it is quite as bad as people make out (I’ve deployed it for most zones I manage, there are some legitimate issues that mean browser vendors aren’t willing to rely on DNSSEC. 1024 bit RSA keys are quite common throughout the DNSSEC system. Getting rid of 1024 bit keys in the PKI has been a long-running effort; doing the same for DNSSEC is likely to take quite a while. Yes, rapid rotation is possible, by splitting key-signing and zone-signing (a good design choice), but since it can’t be enforced, it’s entirely likely that long-lived 1024 bit keys for signing DNSSEC zones is the rule, rather than exception. DNS Providers are Awful While we all poke fun at CAs who get compromised, consider how often someone’s DNS control panel gets compromised. Now ponder the fact that, if DANE is supported, TLSA records can be manipulated in that DNS control panel. Those records would then automatically be DNSSEC signed by the DNS provider and served up to anyone who comes along. Ouch. In theory, of course, you should choose a suitably secure DNS provider, to prevent this problem. Given that there are regular hijackings of high-profile domains (which, presumably, the owners of those domains would also want to prevent), there is something in the DNS service provider market which prevents optimal consumer behaviour. Market for lemons, perchance? Conclusion None of these problems are unsolvable, although none are trivial. I like DANE as a concept, and I’d really, really like to see it succeed. However, the problems I’ve listed above are all reasonable objections, made by people who have their hands in br[...]
Tue, 07 Jul 2015 00:00:00 AEST
The Internet is going encrypted. Revelations of mass-surveillance of Internet traffic has given the Internet community the motivation to roll out encrypted services – the biggest of which is undoubtedly HTTP.
The weak point, though, is SSL Certification Authorities. These are “trusted third parties” who are supposed to validate that a person requesting a certificate for a domain is authorised to have a certificate for that domain. It is no secret that these companies have failed to do the job entrusted to them, again, and again, and again. Oh, and another one.
However, at this point, doing away with CAs and finding some other mechanism isn’t feasible. There is no clear alternative, and the inertia in the current system is overwhelming, to the point where it would take a decade or more to migrate away from the CA-backed SSL certificate ecosystem, even if there was something that was widely acknowledged to be superior in every possible way.
This is where Certificate Transparency comes in. This protocol, which works as part of the existing CA ecosystem, requires CAs to publish every certificate they issue, in order for the certificate to be considered “valid” by browsers and other user agents. While it doesn’t guarantee to prevent misissuance, it does mean that a CA can’t cover up or try to minimise the impact of a breach or other screwup – their actions are fully public, for everyone to see.
Much of Certificate Transparency’s power, however, is diminished if nobody is looking at the certificates which are being published. That is why I have launched sslaware.com, a site for searching the database of logged certificates. At present, it is rather minimalist, however I intend on adding more features, such as real-time notifications (if a new cert for your domain or organisation is logged, you’ll get an e-mail about it), and more advanced searching capabilities.
If you care about the security of your website, you should check out SSL Aware and see what certificates have been issued for your site. You may be unpleasantly surprised.
Sun, 15 Feb 2015 00:00:00 AEDTEver worked at a company (or on a codebase, or whatever) where it seemed like, no matter what the question was, the answer was written down somewhere you could easily find it? Most people haven’t, sadly, but they do exist, and I can assure you that it is an absolute pleasure. On the other hand, practically everyone has experienced completely undocumented systems and processes, where knowledge is shared by word-of-mouth, or lost every time someone quits. Why are there so many more undocumented systems than documented ones out there, and how can we cause more well-documented systems to exist? The answer isn’t “people are lazy”, and the solution is simple – though not easy. Why Johnny Doesn’t Read When someone needs to know something, they might go look for some documentation, or they might ask someone else or just guess wildly. The behaviour “look for documentation” is often reinforced negatively, by the result “documentation doesn’t exist”. At the same time, the behaviours “ask someone” and “guess wildly” are positively reinforced, by the results “I get my question answered” and/or “at least I can get on with my work”. Over time, people optimise their behaviour by skipping the “look for documentation” step, and just go straight to asking other people (or guessing wildly). Why Johnny Doesn’t Write When someone writes documentation, they’re hoping that people will read it and not have to ask them questions in order to be productive and do the right thing. Hence, the behaviour “write documentation” is negatively reinforced by the results “I still get asked questions”, and “nobody does things the right way around here, dammit!” Worse, though, is that there is very little positive reinforcement for the author: when someone does read the docs, and thus doesn’t ask a question, the author almost certainly doesn’t know they dodged a bullet. Similarly, when someone does things the right way, it’s unlikely that anyone will notice. It’s only the mistakes that catch the attention. Given that the experience of writing documentation tends to skew towards the negative, it’s not surprising that eventually, the time spent writing documentation is reallocated to other, more utility-producing activities. Death Spiral The combination of these two situations is self-reinforcing. While a suitably motivated reader might start by strictly looking for documentation, or an author initially be enthused to always fully documenting their work, over time the “reflex” will be for readers to just go ask someone, because “there’s never any documentation!”, and for authors to not write documentation because “nobody bothers to read what I write anyway!”. It is important to recognise that this iterative feedback loop is the “natural state” of the reader/author ecosystem, resulting in something akin to thermodynamic entropy. To avoid the system descending into chaos, energy needs to be constantly applied to keep the system in order. The Solution Effective methods for avoiding the vicious circle can be derived from the things that cause it. Change the forces that apply themselves to readers and authors, and they will behave differently. On the reader’s side, the most effective way to encourage people to read documentation is for it to consistently exist. This means that those in control of a project or system mustn’t consider something “done” until the documentation is in a good state. Patches shouldn’t be landed, and releases shouldn’t be made, unless the documentation is altered to match the functional changes being made. Yes, this requires discipline, which is just a form of energy application to prevent entropic decay. Writing documentation should be an explicit and well-understood part of somebody’s job description. Whoever is responsible for documentation needs to be given the time to do it properly. Writing well take[...]
Sun, 23 Nov 2014 00:00:00 AEDT
You may have heard that Uber has been under a bit of fire lately for its desires to hire private investigators to dig up “dirt” on journalists who are critical of Uber. From using users’ ride data for party entertainment, putting the assistance dogs of blind passengers in the trunk, adding a surcharge to reduce the number of dodgy drivers, or even booking rides with competitors and then cancelling, or using the ride to try and convince the driver to change teams, it’s pretty clear that Uber is a pretty good example of how companies are inherently sociopathic.
However, most of those examples are internal stupidities that happened to be made public. It’s a very rare company that doesn’t do all sorts of shady things, on the assumption that the world will never find out about them. Uber goes quite a bit further, though, and is so out-of-touch with the world that it blogs about analysing people’s sexual activity for amusement.
You’ll note that if you follow the above link, it sends you to the Wayback Machine, and not Uber’s own site. That’s because the original page has recently turned into a 404. Why? Probably because someone at Uber realised that bragging about how Uber employees can amuse themselves by perving on your one night stands might not be a great idea. That still leaves the question open of what sort of a corporate culture makes anyone ever think that inspecting user data for amusement would be a good thing, let alone publicising it? It’s horrific.
Thankfully, despite Uber’s fairly transparent attempt at whitewashing (“clearwashing”?), the good ol’ Wayback Machine helps us to remember what really went on. It would be amusing if Uber tried to pressure the Internet Archive to remove their copies of this blog post (don’t bother, Uber; I’ve got a “Save As” button and I’m not afraid to use it).
In any event, I’ve never used Uber (not that I’ve got one-night stands to analyse, anyway), and I’ll certainly not be patronising them in the future. If you’re not keen on companies amusing themselves with your private data, I suggest you might consider doing the same.
Thu, 20 Nov 2014 00:00:00 AEDTUnless you’ve been living under a firewalled rock, you know that IPv6 is coming. There’s also a good chance that you’ve heard that IPv6 doesn’t have NAT. Or, if you pay close attention to the minutiae of IPv6 development, you’ve heard that IPv6 does have NAT, but you don’t have to (and shouldn’t) use it. So let’s say we’ll skip NAT for IPv6. Fair enough. However, let’s say you have this use case: A bunch of containers that need Internet access… That are running in a VM… On your laptop… Behind your home router! For IPv4, you’d just layer on the NAT, right? While SIP and IPsec might have kittens trying to work through three layers of NAT, for most things it’ll Just Work. In the Grand Future of IPv6, without NAT, how the hell do you make that happen? The answer is “Prefix Delegation”, which allows routers to “delegate” management of a chunk of address space to downstream routers, and allow those downstream routers to, in turn, delegate pieces of that chunk to downstream routers. In the case of our not-so-hypothetical containers-in-VM-on-laptop-at-home scenario, it would look like this: My “border router” (a DNS-323 running Debian) asks my ISP for a delegated prefix, using DHCPv6. The ISP delegates a /561. One /64 out of that is allocated to the network directly attached to the internal interface, and the rest goes into “the pool”, as /60 blocks (so I’ve got 15 of them to delegate, if required). My laptop gets an address on the LAN between itself and the DNS-323 via stateless auto-addressing (“SLAAC”). It also uses DHCPv6 to request one of the /60 blocks from the DNS-323. The laptop puts one /64 from that block as the address space for the “virtual LAN” (actually a Linux bridge) that connects the laptop to all my VMs, and puts the other 15 /64 blocks into a pool for delegation. The VM that will be running the set of containers under test gets an address on the “all VMs virtual LAN” via SLAAC, and then requests a delegated /64 to use for the “all containers virtual LAN” (another bridge, this one running on the VM itself) that the containers will each connect to themselves. Now, almost all of this Just Works. The current releases of ISC DHCP support prefix delegation just fine, and a bit of shell script plumbing between the client and server seals the deal – the client needs to rewrite the server’s config file to tell it the netblock from which it can delegate. Except for one teensy, tiny problem – routing. When the DHCP server delegates a netblock to a particular machine, the routing table needs to get updated so that packets going to that netblock actually get sent to the machine the netblock was delegated to. Without that, traffic destined for the containers (or the VM) won’t actually make it to its destination, and a one-way Internet connection isn’t a whole lot of use. I cannot understand why this problem hasn’t been tripped over before. It’s absolutely fundamental to the correct operation of the delegation system. Some people advocate running a dynamic routing protocol, but that’s a sledgehammer to crack a nut if ever I saw one. Actually, I know this problem has been tripped over before, by OpenWrt. Their solution, however, was to use a PHP script to scan logfiles and add routes. Suffice it to say, that wasn’t an option I was keen on exploring. Instead, I decided to patch ISC DHCP so that the server can run an external script to add the necessary routes, and perhaps modify firewall rules – and also to reverse the process when the delegation is released (or expired). If anyone else wants to play around with it, I’ve put it up on Github. I don’t make any promises that it’s the right way to do it, necessarily, but it works, and the script I’ve added in contrib/prefix-deleg[...]
Sun, 16 Nov 2014 00:00:00 AEDT
If you’re someone who doesn’t like Debian’s policy of automatically starting
on install (or its heinous cousin, the
ENABLE variable in
/etc/default/), then running an init system other than systemd
should work out nicely.
Wed, 15 Oct 2014 00:00:00 AEDT
For some reason, I seem to end up writing software for very esoteric use-cases. Today, though, I think I’ve outdone myself: I sat down and wrote a Ruby library to get and set process resource limits – those things that nobody ever thinks about except when they run out of file descriptors.
I didn’t even have a direct need for it. Recently I was grovelling through the EventMachine codebase, looking at the filehandle limit code, and noticed that the pure-ruby implementation didn’t manipulate filehandle limits. I considered adding it, then realised that there wasn’t a library available to do it. Since I haven’t berked around with FFI for a while, I decided to write rlimit. Now to find the time to write that patch for EventMachine…
Since I doubt there are many people who have a burning need to manipulate rlimits in Ruby, this gem will no doubt sit quiet and undisturbed in the dark, dusty corners of rubygems.org. However, for the three people on earth who find this useful: you’re welcome.
Sat, 30 Aug 2014 00:00:00 AEST
If you’ve noticed your chrome/chromium on Linux having problems since you upgraded to somewhere around version 35/36, you’re not alone. Thankfully, it’s relatively easy to workaround. It will hit people who keep their browser open for a long time, or who have lots of tabs (or if you’re like me, and do both).
To tell if you’re suffering from this particular problem, crack open your
~/.xsession-errors file (or wherever your system logs stdout/stderr from
programs running under X), and look for lines that look like this:
[22161:22185:0830/124533:ERROR:shared_memory_posix.cc(231)] Creating shared memory in /dev/shm/.org.chromium.Chromium.gFTQSy failed: Too many open files
[22161:22185:0830/124601:ERROR:host_shared_bitmap_manager.cc(122)] Cannot create shared memory buffer
If you see those errors, congratulations! The rest of this blog post will be of use to you.
There’s probably a myriad of bugs open about this problem, but the one I found was #367037: Shared memory-related tab crash. It turns out there’s a file handle leak in the chromium codebase somewhere, relating to shared memory handling. There’s no fix available, but the workaround is quite simple: increase the number of files that processes are allowed to have open.
System-wide, you can do this by creating a file
/etc/security/limits.d/local-nofile.conf, containing this line:
* - nofile 65535
You could also edit
/etc/security/limits.conf to contain the same line, if
you were so inclined. Note that this will only take effect next time you
login, or perhaps even only when you restart X (or, at worst, your entire
This doesn’t help you if you’ve got Chromium already open and you’d like to
stop it from crashing Right Now (perhaps restarting your machine would be a
terrible hardship, causing you to lose your hard-won uptime record), then
you can use a magical tool called
prlimit syscall is available if you’re running a Linux 2.6.36 or later
kernel, and running at least glibc 2.13. You’ll have a
line program if you’ve got util-linux 2.21 or later. If not, you can use
the example source code in the
prlimit(2) manpage, changing
RLIMIT_NOFILE, and then running it like this:
argument is taken from the first number in the log messages from
.xsession-errors – in the example above, it’s
And now, you can go back to using your tabs as ersatz bookmarks, like I do.