Subscribe: Brian's Waste of Time
Added By: Feedage Forager Feedage Grade B rated
Language: English
blog  build  configuration  galaxy  java  new  public  repo  return  server  set  string  system  things  time  version 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Brian's Waste of Time

Brian's Waste of Time

Updated: 2016-03-28T23:25:55-07:00


Using the Real Network with Docker


Overlay networks are all the rage with Docker, but I generally prefer to just use the network the host is sitting on, rather than deal with NATing in and out of an overlay network. I just rebuilt my general utility box at home and set it up this way, so figured I’d document it in case anyone else finds it useful, or better, can point out how to improve it! This is the same approach Stefan Schimanski described for Ubuntu. It’s super simple and works well. To start with, we need to pick an IP block to give to the container for hosts. As my home network uses a (mostly) flat network, I picked out for containers on this particular host. The host in question is running vanilla Centos7 and has two NICs. I left the first NIC (eno1) alone, as that is what I was using to ssh in and muck about. The second one I wanted to put in a bridge interface for Docker. So, first, I created a bridge, br0. This can be done lots of ways. Personally I used nmtui to make the bridge, then reconfigured it by editing the config file in /etc/. Probably didn’t need nmtui in there, but I haven’t done networking beyond “enable DHCP” in a Redhat derivative since RH6 (the 1999 edition, not RHEL6). # /etc/sysconfig/network-scripts/ifcfg-bridge-br0 DEVICE=br0 TYPE=Bridge BOOTPROTO=none ONBOOT=yes STP=no IPADDR= PREFIX=16 GATEWAY= DNS1= DEFROUTE=yes IPV4_FAILURE_FATAL=no IPV6INIT=no NAME=bridge-br0 UUID=97dcd4e2-0fdc-2301-8ffc-f0f60c835659 The bridge is configured to use the network (well, looking at it but the mask wipes out the 3), with the same details (gateway, dns, etc). This is exactly the network as relayed by DHCP, though statically configured. It might be possible to configure a bridge via DHCP, but I have no idea how. The next step is to add the second NIC (eno2) to the bridge: # /etc/sysconfig/network-scripts/ifcfg-eno2 TYPE=Ethernet BOOTPROTO=none BRIDGE=br0 NAME=eno2 UUID=e4e99e09-aa93-4d64-ab0d-5e2180e19c58 DEVICE=eno2 ONBOOT=yes Note that this doesn’t have an IP of its own. I don’t want it to do anything but relay packets through to the containers, and vice versa. We then tell Docker to use the br0 bridge and allocate IPs from This is done via --bridge=br0 --fixed-cidr= when starting the docker daemon: # /usr/lib/systemd/system/docker.service [Unit] Description=Docker Application Container Engine Documentation= docker.socket Requires=docker.socket [Service] Type=notify ExecStart=/usr/bin/docker daemon --bridge=br0 --fixed-cidr= -H fd:// MountFlags=slave LimitNOFILE=1048576 LimitNPROC=1048576 LimitCORE=infinity TimeoutStartSec=0 [Install] Docker kindly looks at the bridge it is putting virtual NICs in to get the network settings, so this Just Works. Nice touch picking that up from the bridge, Docker! Finally, we need to remember to enable IP forwarding: echo "1" > /proc/sys/net/ipv4/ip_forward and to make it persistent, I added a file at /etc/sysctl.d/60-ip_forward.conf: # /etc/sysctl.d/60-ip_forward.conf net.ipv4.ip_forward=1 Et voila, docker is now handing out IPs on the local network! If I were setting this up for a datacenter, or in a VPC, I’d give more thought to how big a block to give each container host. The full /24 feels generous. Look at your expected container density and go from there.[...]

Docker and Mesos, Sitting in a Tree...


Docker and Mesos are like the old Reeses Peanut Butter Cup commercials

allowfullscreen='true' frameborder='0' height='315' src='//' width='420'>

Docker provides the best application packaging implementation I have seen. Mesos, while designed for building applications, happens to also provide the best generally available, general purpose cluster scheduler I know about. Put the two together, and you have the basis for any new application deployment and management tooling I’d build today.

Leaving Ning, Returning to the Valley


After eight years, I’m leaving Ning. I joined this little company called 24 Hour Laundry back in 2005 and promptly started arguing with Diego. They were good arguments, I learned a lot.

I could go on for pages about how great Ning was. I’ve had the privilege of working with a number of great people at several companies, but the sheer density of brilliant, knowledgeable, and productive folks at Ning was something that is hard to describe. In 2005 we built a platform as a service, before that term existed, before EC2 launched and the term “cloud computing” was coined. Today, eight years later, I watch as a number of the most exciting technologies and practices emerging are new implementations of things we’ve been using internally for years. I am immensely proud of the things we did and the people I worked with.

Four years ago I left the valley, though I stayed with Ning. It was challenging as I worked mostly with infrastructure, and those teams were not distributed, unlike most other teams at Ning. Around that time Ning went through some turmoil. There was a big global recession, money got tight, and we had to shift our business model. This was when the quality of the folks at Ning, and the strength of our relationships, really showed. Even those who left remain close – we hang out in IRC together, get together frequently, work on each others’ open source libraries, and eagerly help each other out all the time. I guess this shouldn’t be surprising, coming out of a company focused on building and enabling online communities.

I’m still bullish on Ning and Glam. The work on Ning 3.0 is really good (and still going furiously!) and products in the pipeline are exciting. After eight years, though, it is time for something new. I will miss the folks there, and hope our practice of staying in close contact continues!

I am sad to be leaving Ning, but am excited to be returning to the valley. I usually describe the valley as a black hole for technologists, it eventually draws everyone into its’ orbit, if not into itself. It is as Firenze was during the Renaissance, the center of the world for practical purposes. The reasons I left still hold, but the reasons for returning are even stronger it turns out. I cannot wait to see all my friends again face to face, instead of through IRC!

Setting up a Go Development Environment


I quite like Go, but some things about its standard practices are downright weird if you are coming from a more standard, project-centric approach to hackery. I don’t always (okay, usually) follow these standard practices personally, but anyone coming to Go should probably understand them and work with them, so that things in the wider community make sense. Everything I say here is my understanding of practices, and is probably wrong, so correct me if I am, for the sake of anyone else finding this article! Workspace The first thing to grok is the idea of workspaces. I don’t know if this is any kind of commonly accepted term, but I have seen it used in #go-nuts and elsewhere, so I use it as well. A workspace is a space which holds projects, yours and anything you depend on (and usually a spattering of random things you experimented with and forgot about). You want to check out your project into your workspace, at the correct path. We’ll make our hypothetical workspace at ~/src/gospace. $ mkdir -p ~/src/gospace $ export GOPATH=~/src/gospace $ export PATH=~/src/gospace/bin:$PATH $ cd ~/src/gospace Within this workspace we have three root directories, src which hold source code, pkg which holds compiled bits, and bin which holds executables. We set the GOPATH to point at your workspace, and add the bin/ for the workspace to our PATH. You can have multiple paths in your GOPATH if you want to, and I know of folks who do, but I haven’t figured out how it makes life easier yet, so I don’t. (update: Charles Hooper pointed out src/, pkg/, and bin/ subdirectories in the workspace will be created automatically when they are needed.) Your Project Let’s say you want to work on a project called variant, and you will host the source code on Github at You will need to check this out to ~/src/gospace/src/ This is awkward, so we’ll just check it out via go get and fix it up: $ cd ~/src/gospace $ go get $ cd src/ $ git checkout master This works if your project already exists, if it doesn’t, you’ll need to go make the directories and set up the project basics, something like: $ cd ~/src/gospace $ mkdir -p src/ $ cd src/ $ touch $ git init $ git remote add origin $ git push -u origin master Assuming the github repo is waiting for you. This isn’t a git/github tutorial though :-) Your project’s place in the workspace is intimately tied to the source repo for it as Go’s dependency resolution mechanism relies on import paths matching up with source repos. It is kind of weird, and has some drawbacks, but is also really convenient (when you don’t hit those drawbacks). Working on Your Project For the package you are hacking on, cd into the dir of the package, hack, run go test or go build or go install as you want. Both default to the “current dir” which does a reasonable job of figuring things out (as long as you don’t have symlinks involved, Go hates symlinks for some reason). Dependencies If you want to fetch something else, go get it. If you want to use the MyMySql driver/library you can go get and it will fetch and build it for you (and its dependencies). The source will be at ~/src/gospace/src/, just like your project sources. It will be a git clone of repo. If you want a specific version (say the v1.4.5 tag), cd over to it and git checkout v1.4.5. If you want to install a utility written in Go, go get it as well. For example, to install dotCloud’s Docker you can go get The target for the go get is the package name (that would be imported) for the main package of what you want. It will fetch it, build it, and put the binary in ~/src/gospace/bin for you. (update[...]

First Steps with Apache Mahout


I’ve been digging into a home rolled machine learning classification system at $work recently, so decided to noodle around with Mahout to get some perspective on a different toolchain. Mahout is not the most approachable library in the world, so I picked up Mahout in Action and skipped right to the classifiers section. The book helped (a lot!) but it is still kind of tricky, so, here are some notes to help anyone else get started. First, the different classifiers seem mostly unrelated to each other. The book talked mostly about the SGD implementation, so I focused on that. It seems to be the easiest to work with sans hadoop cluster, so that is convenient. Creating Vectors The first problem is that you want to write out vectors somewhere so you can read them back in as you noddle on training. For my experiment I did the stereotypical two-category spam classifier (easy access to a data corpus). My corpus consists of mostly mixed plain text and html-ish samples, so to normalize them I ran each sample through jTidy and then used Tika’s very forgiving text extractor to get just the character data. Finally, I ran it through Lucene’s standard analyzer and downcase filter. I didn’t bother with stemming as the corpus is mixed language and I didn’t want to get into that for simple experimentation. The code isn’t optimized at all, is just simple path to the output I want. Preconditions.checkArgument(title != null, "title may not be null"); Preconditions.checkArgument(description != null, "description may not be null"); try { final List fields = Lists.newArrayList(); for (String raw : Arrays.asList(description, title)) { Tidy t = new Tidy(); t.setErrout(new PrintWriter(new ByteArrayOutputStream())); StringWriter out = new StringWriter(); t.parse(new StringReader(raw), out); AutoDetectParser p = new AutoDetectParser(); p.parse(new ByteArrayInputStream(out.getBuffer().toString().getBytes()), new TextContentHandler(new DefaultHandler() { @Override public void characters(char[] ch, int start, int length) throws SAXException { CharBuffer buf = CharBuffer.wrap(ch, start, length); String s = buf.toString(); fields.add(s); } }), new Metadata()); } Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_34); StringReader in = new StringReader(Joiner.on(" ") .join(fields) .replaceAll("\\s+", " ")); TokenStream ts = analyzer.tokenStream("content", in); ts = new LowerCaseFilter(Version.LUCENE_34, ts); CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class); List words = Lists.newArrayList(); while (ts.incrementToken()) { char[] termBuffer = termAtt.buffer(); int termLen = termAtt.length(); String w = new String(termBuffer, 0, termLen); words.add(w); } this.scrubbedWords = words; } catch (Exception e) { throw new RuntimeException(e); } SGD works on identically sized vectors, so we need to convert that bag of words into a vector, and as this is training data, associate the value of the target variable (spam or ham) with the vector when we store it. We cannot encode the target variable in the vector, it would be used for training, but it turns out that Mahout has a useful NamedVector implementation. This could be used to encode an id or such for external lookup, but I just used it to encode that target variable. Mahout offers a lot of ways to vectorize text and no practical guidance on which way works. I settled on using the AdaptiveWordValueEncoder for each word in the bag of text, in order to get some kind of useful frequenc[...]

Bazooka-Squirrel Solutions


Bazooka-Squirrel Solution: When you cannot track down a bug in a component, and it looks like you won’t be able to find it in less time then it will take to reimplement the component, so you just reimplement the component.

From: “I cannot hit the squirrel in that tree with the BB gun, I’m going to just take a bazooka to the tree.”

RPC Over SSH and Domain Sockets


I really like using SSH for authentication and authorization when possible – it is very configurable, well understood, and more secure then anything I am likely to design. It is also generally pretty easy to have applications communicate over SSH. A nice model is to have the server listen on a domain socket in a directory with appropriate permissions, and clients connect over ssh and netcat to talk to it. Logically, on the client it is: $ ssh /usr/bin/nc -U /tmp/foo And voila, your client (or shell in this case) is connected to the remote domain socket. After finding Jeff Hodges’s wonderful writeup on go.crypto/ssh I sat down to make Go do this internally. It was fun, and pretty straightforward. The server is just a net/rpc server which listens on a domain socket and responds with a greeting: package main import ( "fmt" "log" "net" "net/rpc" "os" "os/signal" "syscall" ) // rpc response type Response struct { Greeting string } // rpc request type Request struct { Name string } // rpc host struct thing type Greeter struct{} // our remotely invocable function func (g *Greeter) Greet(req Request, res *Response) (err error) { res.Greeting = fmt.Sprintf("Hello %s", req.Name) return } // start up rpc listener at path func ServeAt(path string) (err error) { rpc.Register(&Greeter{}) listener, err := net.Listen("unix", path) if err != nil { return fmt.Errorf("unable to listen at %s: %s", path, err) } go rpc.Accept(listener) return } // ./server /tmp/foo func main() { path := os.Args[1] err := ServeAt(path) if err != nil { log.Fatalf("failed: %s", err) } defer os.Remove(path) // block until we are signalled to quit wait() } func wait() { signals := make(chan os.Signal) signal.Notify(signals, syscall.SIGINT, syscall.SIGKILL, syscall.SIGHUP) <-signals } The client is the fun part. It establishes an SSH connection to the server host, then fires off a Session against netcat, attaches an RPC client to that session, and does its stuff! package main import ( "" "fmt" "io" "log" "net" "net/rpc" "os" "strings" ) // RPC response container type Response struct { Greeting string } // RPC request container type Request struct { Name string } // It would be nice if ssh.Session was an io.ReaderWriter // proposal submitted :-) type NetCatSession struct { *ssh.Session // define Close() writer io.Writer reader io.Reader } // io.Reader func (s NetCatSession) Read(p []byte) (n int, err error) { return s.reader.Read(p) } // io.Writer func (s NetCatSession) Write(p []byte) (n int, err error) { return s.writer.Write(p) } // given the established ssh connection, start a session against netcat and // return a io.ReaderWriterCloser appropriate for rpc.NewClient(...) func StartNetCat(client *ssh.ClientConn, path string) (rwc *NetCatSession, err error) { session, err := client.NewSession() if err != nil { return } cmd := fmt.Sprintf("/usr/bin/nc -U %s", path) in, err := session.StdinPipe() if err != nil { return nil, fmt.Errorf("unable to get stdin: %s", err) } out, err := session.StdoutPipe() if err != nil { return nil, fmt.Errorf("unable to get stdout: %s", err) } err = session.Start(cmd) if err != nil { return nil, fmt.Errorf("unable to start '%s': %s", cmd, err) } return &NetCatSession{session, in, out}, nil } // ./client localhost:/tmp/foo Brian func main() { parts := strings.Split(os.Args[1], ":") host := parts[0] path := parts[1] name := os.Args[2] // SSH setup, we assume current username and use the ssh agent // for auth agent_sock, err := net.Dial("unix", os.[...]

Some JDBI 3 Noodling


I grabbed the most recent build of jdk8 w/ lambdas tonight and started noodling on jdbi 3, which will require Java 8. Set things = jdbi.withHandle(h -> { h.execute("insert into something (id, name) values (?, ?)", 1, "Brian"); h.execute("insert into something (id, name) values (?, ?)", 2, "Steven"); return h.query("select id, name from something") .map(rs -> new Something(rs.getInt(1), rs.getString(2))) .into(new HashSet()); }); assertThat(things).isEqualTo(ImmutableSet.of(new Something(1, "Brian"), new Something(2, "Steven"))); The Stream interface is kind of heavy to implement as it stands right now, and I couldn’t get IDEA 12 and the JDK to agree on valid syntax. Neither one wants to let me omit the in the .into(new HashSet()); line, which the most recent State of the Collections implies I should be able to. It would be really nice if the lambda syntax sugar would quietly drop return values when it is auto-converting to a Block without the { ... } – I had to make some things accept a Function rather then a Block even though I ignore the return value, this will then bite you when you use something that doesn’t have a return value. Java has side effects, sometimes we call a function which returns a value just for the side effects. All told, I like the changes so far quite a bit despite my quibbles :-)[...]

Go is PHP for the Backend


I’ve had the opportunity to use Go for my most recent project at work. The stuff I’ve done in Go is a minimally distributed system (two types of servers, tens of instances max) optimized for byte slinging throughput. The relationship with Go started out a bit rocky but got turned around. After using it for a couple weeks, I described Go to my friend David as “PHP for the backend.” Despite my pretty low opinion of PHP, this was intended as a compliment. Regardless of the quality of the execution of PHP, the intent seems to have been to get out of your way and make building web pages easy. Go feels like that but for services. PHP is horribly inconsistent, breaks all the rules about programming language design, and is infuriating. Despite all that, it’s still the most widely used language for building web apps. Go is rather similar – it is inconsistent, ignores anything a modern programming language is supposed to include, doesn’t use whitespace, except to disallow it in reasonable places (say, as a newline before a {). It offers nice first class functions, but then cripples them by having a strong type system which seems to ignore everything that has been done with type systems for the last couple decades. You cannot even write a proper map(...) because Go is strongly typed with no type parameterization. Go really wants to use tabs. To top it off, errors are return values. They are return values which are easy to ignore. Idiommatic Go is to have several lines of boilerplate after every single function invocation which can possibly fail. I got really annoyed, flamed Go on Twitter, and went for a walk. When I came back, several friends, in particular Toby had commented in IM about my issues, pointing out ways of trying to handle what I was being annoyed by. They were all very reasonable, but basically came down to something along the lines of, “Go doesn’t do what you are trying to do; there are some brutal hacks to approximate it, like how you do functional-ish programming in Java, but your are fighting the system.” Calmed down, I stepped back. I know of folks having great success with Go, and it offers a lot that I want (native code, UNIX friendly, higher level then C, lower level then Python or Ruby, garbage collected, strongly typed, good performance, good concurrency support, etc), so I tried to stop programming my way, and start programming Go’s way. Go has a way of programming. Go is totally optimized for that way of programming, in fact. Programming any way other than Go’s way, with Go, will be that recipe for frustration I bounced my skull against. Go’s way is not pretty to someone indoctrinated with the modern functional aesthetic, but it works, and works well. Really well. Go’s inconsistencies and limitations hold together, bizarrely enough. They steer code towards particular structures and behavior that are good. Based on my limited experience (I am still a Go novice, I have been using it in anger for only about three weeks), Go seems to be as much, or more, about how it is used as how the language itself works. It seems to be optimized for solving design issues a particular way, and for solving issues around programming (again, a particular way), rather then for being a maximally expressive or powerful language. This, of course, should not have been a surprise to me. Every presentation, description of purpose, etc, about Go says this. I had read them and said, “that makes sense, sure.” I still went into it looking at the language and wanting to use the language to solve the problems I had in the way I conceptualized them. That failed. When I adopted Go’s way of working (as I slowly started to see it) things succeeded. I also relearned some fundamental things I already knew but had apparentl[...]

Private Apt Repos in S3


Setting up a private apt repository in S3 is actually not too bad. This HOWTO sets up two minimal repositories, one public and one private. You need both. All work is to be done on a Debian or Ubuntu machine with an architecture matching what you ar edeploying to (ie, amd64). Kyle Shank did the heavy lifting for us by making a s3:// transport scheme for apt. Sadly, that package isn’t in any reliable public repos I know of, so to be safe this HOWTO will have you host it in a repo you control. The process is therefore two steps, setting up the public repo to hold the s3 apt handler, installing that handler, then setting up a private repo which uses authenticated s3 connections to access your debs. The first repo is a public repo which exists to hold the apt-transport-s3 package. Check out a fork of apt-s3 and build it using make deb. You probably will want to nest the checkout in a dedicated directory as the build drops the deb one directory up from the one you build from. Go figure. It requires libapt-pkg-dev libcurl4-openssl-dev be installed, see the README for details, it is pretty good. Once you have that built, you’ll need to put it into our public repo. Doing this looks like: $ mkdir repos $ cd repos $ mkdir -p public-repo/binary $ cp ~/src/borkage/apt-transport-s3_1.1.1ubuntu2_amd64.deb pubic-repo/binary $ cd public-repo $ dpkg-scanpackages binary /dev/null | gzip -9c > binary/Packages.gz $ dpkg-scansources binary /dev/null | gzip -9c > binary/Sources.gz We made a repos directory which will hold both of our repos, then made a public-repo/binary directory for our binary artifacts. We copy in our apt-transport-s3 deb and build both package and source indexes. Be sure not to add a trailing slash to the binary bit in the dpkg-* incantations, it will throw off locations in the index. We build a Sources.gz, which will be empty, so that using add-repository doesn’t freak out. We now have a local copy of our repo, yea! We want to push this up to s3, so make yourself a bucket for the public repo, I’ll call the demo one demo-public-repo. We’re going to sync this up using s3cmd. You should install it and configure it: $ sudo apt-get install s3cmd $ s3cmd --configure Follow the instructions when configuring. Now we’ll use s3cmd to sync our repo up: $ cd repos $ s3cmd sync -P public-repo/ s3://demo-public-repo Note the -P – we need the artifact herein to be public so that we can install it. Okay, the private repo will be just like the public repo, except we’ll use a different bucket and not make it world readable: $ cd repos $ mkdir -p private-repo/binary $ cp ~/src/secret-stuff/target/secret_0.0.1.deb private-repo/binary $ cd private-repo $ dpkg-scanpackages binary /dev/null | gzip -9c > binary/Packages.gz $ dpkg-scansources binary /dev/null | gzip -9c > binary/Sources.gz $ cd .. $ s3cmd sync -P private-repo/ s3://demo-private-repo Now log into the host which needs to use the private repo and add the following line to /etc/apt/sources.list deb binary/ Of course, you’re URL will vary – find the HTTP URL for your bucket root and use that. This one happens to be in us-west-2 (Oregon), yours will most likely not be. Once it is added, install apt-transport-s3 via: $ sudo apt-get update $ sudo apt-get install -y --force-yes apt-transport-s3 We need the --force-yes as we didn’t sign the deb. Now, the magic, this allows us to add a repo url of the form: deb s3://:[] binary/ to /etc/apt/sources.list where you, again, get the right region and bucket information, and replace and with your actual aws access and secret keys. The [...]

Rethinking Some Galaxy Core Assumptions


Galaxy has been very successful, in multiple companies, but I think it can actually be simplified and made more powerful. The first touches on the heart of Galaxy, the second on the often-argued configuration aspect.

RC Scripts

I described the heart of Galaxy as a tarball with an rc script. I think it likely that the rc script should give way to a simple run script. The difference is that the run script doesn’t daemonize – it simply keeps running. An implementation will probably need to have a controller process which does daemonize (or defer to upstart or its ilk for process management).

While writing an rc script is far from rocket surgery, it turns out that the nuances are annoying to implement again, and again, and again. The main nuance is daemonizing things correctly. I’d prefer to provide that for software rather then force applications to get it right. Many app servers handle daemonization well, but they also all (that I know of, anyway) provide a mechanism for running in the foreground as well.

Unfortunately, a run script model makes a no-install implementation much trickier. The lore on daemonizing from bash is tricky, but even assuming bash is tricky. Using something like daemonize is nice, but then it requires an installation. Grrr. This is an implemenation problem though, and requiring some kind of installation on the appserver side may be worth it for simplifying the model.


In a moment of blinding DUH, I came back to environment variables for environmental information. I mean, it works for everyone on Heroku or Cloud Foundry.

There has been a trend in Galaxy implementations (and elsewhere) to use purely file based configuration. This is great for application configuration, but is meh for environmental configuration. This has lead to most Galaxy implementation supporting some model of URL to Path mapping, for placing configuration files into deployed applications. These mechanisms are a great way to provide escape hatch/override configuration but plays against the goal of making deployments self contained, which I like. This punts on going all the way, which Dan likes to advocate, to putting environment information into the deploy bundle, but I am not sold on this myself :-)

Regardless, a general purpose implementation probably needs to support both env var configuration and file based, but you can certainly recommend one way of making use of it.

What is Galaxy?


At Ning, in 2006, Martin and I wrote a deployment tool called Galaxy. Since that time I know of at least three complete reimplementations, two major forks, and half a dozen more partial reimplementations. In a bizarre twist of fate, I learned yesterday from Adrian that my friend James also has a clean room implementation. Using Fabric called Process Manager. Holy shit. Beyond reimplementations and forks from ex-Ninglets who are using a Galaxy derivitive, I frequently hear from ex-Ninglets who are not and wish they could. We clearly got something right, it seems. Fascinatingly, folks all seem to focus on different aspects of Galaxy in terms of what they love about it. They also tend to have a common set of complaints about how Ning’s version worked, and have adapted theirs to accomodate them. To me, the heart of Galaxy is the concept of the galaxy bundle, a tarball with the application and its dependencies coupled with an RC script at a known location inside the bundle. Given such a bundle, a Galaxy implementation is then the tooling for deploying and managing those bundles across a set of servers. From personal, and second hand, experience this simple setup can keep things happy well into the thousands of servers. To many others, the heart of Galaxy seems to be the tooling itself, and the fairly nice way of managing applications seperately from servers. At least one major user even ignores the idea of putting the applications and their dependencies in the bundle, and uses Galaxy to install RPMs! (I personally think this approach is not so great, but the person doing it is one of the best engineers I know, so am happy to believe I may be wrong). Different folks have also drawn the line of what the Galaxy implementation should manage in quite different places. In the orginal implementation, Galaxy included bundle and configuration repositories, along with how those repos were structured, an agent to control the application on the server, a console to keep track of it all, and a command line tool to query and take actions on the system. On the other hand, the Proofpoint/Airlift implementation weakens the contracts on configuration (in a good way), requires a Maven repository for bundles, supports an arbitrary number of applications per host, and has Galaxy handle server provisioning as well as application deployment. The Ness and, I believe, Metamarkets, implementation changes the configuration contract significantly, also supports several applications per host, and includes much more local server state in what Galaxy itself manages. The other (generally minor) implementations and experiments have taken it in quite a few different directions, ranging from Pierre’s reimplementation using Erlang and CouchDB, to my reimplementation with no agents or console. There seems to be an awful lot of experimentation around the concepts in Galaxy, which is awesome! Unfortunately, only the original implementation is very well documented at this point, so it is tough to use Galaxy unless you have used it before (hence my shock at James even knowing about it). I guess it’s time to start documenting and try to save other folks some work![...]

Learning to Code


Rewinding my career a ways, I want to weigh in on the Learn to Code debate as a non-programmer who coded. I am a professional programmer now, but my previous career was teaching English in High School (you are not allowed to take that as license to mock my grammar).

Programming and the Profession of Programming are quite different things. Programming is being able to efficiently tell a computer exactly what to do in a repeatable manner. The profession of programming is being able to efficiently convert business requirements into bugs.

As an English teacher I programmed regularly in order to make my life easier. I generated vocabulary quizzes (and grading sheets for them), I created interactive epic poetry (I shit you not) with my classes (those studens really grokked epic poetry thereafter), I wrote hundreds of small scripts to calculate various things (many of which could have been done in excel, but I knew perl, not excel), I turned at least one student onto cypherpunks during a study hall, I built various one-off web applications for teachers, classes, groups, etc. I calculated lots of statistics on student performance, tests, and so on so I could better understand and calibrate things (teachers may not always grade on a curve explicitely, but new teachers always do at least hand-wavily as they don’t have tests and teaching well calibrated yet).

Programming is a tool that let me be more efficient, that allows you to automate boring things, and sometimes opens up options which would otherwise be unavailable. I later left teaching, went into technical writing, and then (back) into the profession of programming full time. As Zed Shaw put it well, “Programming as a profession is only moderately interesting. … You’re much better off using code as your secret weapon in another profession.”. I happen to love programming for itself, so programming as a profession works well for me. Code is ephemeral though, and most folks don’t like to “see your works become replaced by superior ones in a year. unable to run at all in a few more” as _why described. Progamming is an exceptionally powerful tool for accomplishing other things.

Java URL Handlers


There are two ways to register your own URL scheme handler in Java, but they both kind of suck. The first is to set a system property to a list of packages, and then name subpackages and classes therein just right, the second is to register a handler factory. The handler factory approach seems great, except that you can register one, once, ever – and, oh yeah, Tomcat registers one. Given the complete brokeness of the factory registration the sane way is to align packages and class names perfectly, though this is annoying to do, so like all good programmers, I wrote a library to put a nice facade on the process: URL Scheme Registry . Using it is about as simple as I could figure out – because URL handlers need to be global (given how URL works) there is a single static method. To register the dinner scheme handler you just do like: UrlSchemeRegistry.register("dinner", DinnerHandler.class); Et voila, you can now use dinner://okonomiyaki and an instance of DinnerHandler, which must implement URLStreamHandler, will be used if you retrieve the resource. Note that you can only register a given scheme once, and your handler must have a no-arg constructor. Under the covers the library adds a particular package to the needed system property for the package/class lookup method and creates a subclass of the class you pass it, using the super convenient CGLib. This way it can leave the handler factory setting alone (so you can use it in Tomcat or others that register the url handler factory), and don’t need to manually set system properties. You can fetch it via Maven ( search for current version ) in two forms. The first has a normal dependency on cglib: org.skife.url url-scheme-registry 0.0.1 The second puts cglib into a new namespace and bundles it in case there are still cases of colliding cglib versions in the same namespace out there: org.skife.url url-scheme-registry 0.0.1 nodep Have fun![...]

Hello Pummel


I’ve been doing some capacity modeling at $work recently and found myself needing to do “find the concurrency limit at which the 99th percentile stays below 100 milliseconds” type thing. So I wrote a tool, pummel to do it for me.

$ pummel limit --labels ./urls.txt 
clients	tp99.0	mean	reqs/sec
1	2.00	1.03	967.59
2	2.00	1.04	1932.37
4	3.00	1.43	2799.16
8	16.00	3.64	2199.13
16	130.00	7.81	2049.02
8	17.00	3.54	2262.44
12	73.00	5.62	2135.61
16	129.00	7.58	2110.93
12	71.00	5.57	2155.99
14	117.97	6.58	2127.79
12	71.00	5.57	2155.99

By default it is looking for that “tp99 < 100ms” threshold, which in this case it found at 12 requests in flight at the same time.

Also really useful is the step command, which just increases concurrent load and reports on times:

$ pummel step --limit 20 --max 5000 ./urls.txt 
1	2.00	1.05	956.21
2	3.00	1.08	1854.26
3	4.00	1.24	2415.46
4	3.00	1.41	2836.48
5	6.00	2.05	2444.27
6	9.00	2.54	2358.31
7	11.00	2.92	2398.74
8	16.00	3.38	2364.49
9	23.99	3.99	2257.22
10	35.99	4.43	2258.05
11	54.99	5.10	2157.37
12	72.98	5.94	2020.61
13	87.97	5.94	2187.89
14	125.99	6.40	2187.64
15	125.00	6.85	2188.50
16	130.00	7.39	2163.68
17	134.00	7.86	2163.13
18	143.98	8.35	2156.93
19	156.00	8.92	2129.52

Assuming you put this output in data.csv this also plots very nicely with gnuplot

set terminal png size 640,480
set xlabel 'concurrency'
set ylabel 'millis'

set output 'tp99.png'
plot 'data.csv' using 2 with lines title 'tp99 response time'

set output 'mean.png'
plot 'data.csv' using 3 with lines title 'mean response time'

set output 'requests_per_second.png'
plot 'data.csv' using 4 with lines title 'requests/second'

Nothing super fancy, but is kind of fun :-)

Reworking the Atlas CLI


The current CLI for Atlas was always a stopgap measure – something to let me invoke it on the command line during development. The time has come for a real user interface though. This post is really just me thinking through how this should work - it is rambly, but capturing the reasoning is helpful :-) Project Centric It has become apparent that Atlas is project-centric, such as git. A common and useful pattern for project centric stuff is to slap it all into a version-controllable directory, and provide a nice way to kickstart a new project. So, we have the beginning of our interactions: $ atlas init This will populate the current directory with a barebones atlas project, and initialize atlas’s state keeping. We’ll encourage the seperation of environment and system model, so let’s drop two files, a system model and a development environment description. As Atlas exists right now, that would be something like system.rb and env-dev.rb. System and Environment Model Definitions These two things, system model and environment description, are the heart of an atlas project so we want them to be front and center when you look at a project. I’d like to drop them right in the root of the project, but this conflicts with another desire – that we be able to detect and load system models and environment descriptors for a project. Once a system starts growing, but before it really has a number of teams working on it or using it, I expect it will be very common to drop all the parts of the system into one project via seperate files, organized as makes sense for the system. Right now that is awkward, any files aside from the root must be explicitely included as external elements in the system model. This works but in the case of the environment descriptor, when you have multiple environments, means checking out the project multiple times and initializing each checkout for a different environment. Yuck. To complicate it further, it is becoming apparent that in order to make systems composable from external models (ie, if Basho were to provide a descriptor for a Riak system, or Puppet Labs for a Puppet it would be nice to just have to reference one URL for it. It is convenient to allow one file to describe both the environment and the system in this case. Additionally, it has become apparent that sometimes an environment needs to define system elements. In the EC2 case, for example, this might be creating a VPN instance so the machine running atlas can connect to instances inside a security group. Given that we want to have arbitrary environment and system models for a project, we’d like them to take center stage in the root of the project, and finally we migh want other convenience scripts in the root of the project, one option seems to be just changing their extension and globbing them together to load – ie, system.atlas and dev-env.atlas. Given that atlas descriptors are ruby files, this is inconvenient for automatic file type detection for code highlighting, which is unfortunate. Another option is to give up on the root and drop a model/ directory which contains the system and environment descriptors. Model is an overloaded term, due use of the word to describe database access code in Rails-inspired web frameworks. In this case, it is used to represent a model of the system and environments in which the system will run. Using a model/ directory gives a slight affordance to new users in that when the open a descriptor, it is nicely highlighted. This is small, but frankly really handy. I don’t have my .emacs on every machine I may need to muck around with models from a[...]

Java Daemonization with posix_spawn(2)


The traditional way of daemonizing a process involves forking the process and daemonizing it from the current running state. This doesn’t work so well in Java because the JVM relies on several additional worker threads, and fork only keeps the thread calling fork. So, basically, you need to start a new JVM from scratch to create a child process. The traditional way of launching a new program is to fork and exec the program you wish to start. Sadly, this also fails on Java because the calls to fork and exec are seperate, non-atomic (in the platonic sense, not the JMM sense) operations. There is no guarantee that the exec will be reached, or that the memory state of the JVM will even be sound when the exec is reached, because you could be in the middle of a garbage collection and pointers could be all over. In practice, this would happen exceptionally rarely, at least. Charles Nutter has talked about this problem in JRuby as well. The best method I know of to daemonize a Java process, then, is to use posix_spawn. This function launches a new process based on the program image passed to it – it is like fork and exec in one system call. Over the weekend I took some time to put together a small library to do this, Gressil using jnr-ffi. It turned out that the hardest part, by far, was reconstructing the full ARGV array. On linux, with ProcFS it is pretty straightforward. On OS X it is supposed to be possible but all the code samples I have found seem to be based on that snippet, and they look forward into the “String area” which usually has ARGV laid out in the correct order, but sometimes doesn’t. Trying to look backward from the pointer returned by the sysctl would be straightforward in C (decrement the pointer) but jnr-ffi seems to copy the memory space into a Java byte array, so there is no going backward. Attempting to guess how far back to jump the pointer to get a new block of memory usually segfaults for me :-) Anyone with thoughts, a version based on the above reference is in Gressil, MacARGVFinder. Anyway, it turns out you can get most of this information from the Sun JVM. The only dodgy part of that is the program arguments (the stuff passed into Java’s public static void main(String[] args) method. Luckily, those are passed to main, so we can just ask for them in the library. Given that, you can use Gressil to daemonize Java processes via re-launching like so: package org.skife.gressil.examples; import org.skife.gressil.Daemon; import; import; import java.util.Arrays; import java.util.Date; import static org.skife.gressil.Daemon.remoteDebugOnPort; public class ChattyDaemon { public static void main(String[] args) throws IOException { new Daemon().withMainArgs(args) .withPidFile(new File("/tmp/")) .withStdout(new File("/tmp/chatty.out")) .withExtraMainArgs("hello", "world,", "how are you?") .withExtraJvmArgs(remoteDebugOnPort(5005)) .daemonize(); while (!Thread.currentThread().isInterrupted()) { System.out.println(new Date() + " " + Arrays.toString(args)); try { Thread.sleep(1000); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } } } } In the parent process the call to Daemon#daemonize() will call System.exit(), in the child process it will return normally. The child process, in this case, [...]

Configuration Models


I have been noodling on the best general purpose application configuration model, and mechanism, I can find lately – that is, trying to find something general purpose that I don’t think is miserably bad. Specifically, configuration of heterogeneous applications in a hosted/SaaS/distributed/blah type system. I’ve worked with all kinds of stuff – not everything under the sun, for sure, but quite a few models. For purposes of this discussion, I’ll start with the one I have been working with most closely for the last six years – Ning’s Galaxy Repository. Galaxy Classic: The Discussion Baseline Galaxy uses a hierarchical configuration repository. See the “Gepo” and “Config Path” parts of the readme for a description of how it works. The mechanism does work pretty well, it’s gotten us a long way, but it has some horrible warts that I would like to avoid. (Really, go read that page if you are not familiar with Galaxy and want any of the rest of this to make sense. It’s worth it, Galaxy has been a Force for Good). The first is that the galaxy agent relies on the configuration repository to determine the binary to install. It takes the configuration path for the deployment (I told you, just go read that docu and come back) and constructs a URL for the binary. The second is that after finding the binary in the config repo, it then passes information about this config repo to the deployment bundle when it deploys it. The agent has a hard dependency on the config repo, and it leaks the dependency into the deployment bundle, which then uses its knowledge of the configuration repository to pull down the application configuration. Proofpoint’s Galaxy: Another Take Some very good folks at Proofpoint, who all worked with (and in Martin’s case, helped build) Ning’s Galaxy reimplemented it to fix a number of behaviors that were suitable at Ning, but not Proofpoint. One of the changes was to the configuration mechanism. The deployment command receives a set of coordinates for the binary and a configuration. These coordinates are converted to two URLS by the time they reach the agent. The configuration URL, in the Proofpoint system, references a resource which is a set of (path, URL) pairs. When the agent deploys the binary bundle, it then pulls down each resource specified in the configuration resource, and puts it at the path specified for it in the deployment. In this model, the agent just receives URLs, though it understands the resources those URLs point to. The binary one because it is a galaxy package, and deploying them is what it does, the configuration one as it needs to pull down the configuration files. The Proofpointers are considering reworking that configuation mechanism to be a URL to a configuration tarball which will contain all of the config files needed, which is expanded to a known location (probably /env/) inside the deployment. This would remove one layer of indirection, and makes the configuration a write-once artifact. Sculptor: Yet Another Take In parallel, I have been experimenting with another galaxy implementation, in particular to play nicely with Atlas, named Sculptor. Sculptor also uses (path, URL) pairs, but these pairs are specified as either part of the environment (so properties of the agent) and/or part of the deployment. You can watch a screencast which uses the environment configuration, but the deployment side is just in the noodling stages. In sculptor a deployment resource (submitted to the agent) would look something like: { "u[...]

POSIX from Java


I have been doing more traditionally unix-y stuff from Java lately, and one of the things I have needed is proper access to POSIX and libc system calls. Luckily, there are now a couple fabulous libraries to make this easy – no more need to do your own JNI muckery. I’ve been using jnr-posix with great success. Using it for something like execv(3) looks like: POSIX posix = POSIXFactory.getPOSIX(new POSIXHandler() { @Override public void error(Errno errno, String s) { } @Override public void unimplementedError(String s) { } @Override public void warn(WARNING_ID warning_id, String s, Object... objects) { } @Override public boolean isVerbose() { return false; } @Override public File getCurrentWorkingDirectory() { return new File("."); } @Override public String[] getEnv() { return new String[0]; } @Override public InputStream getInputStream() { return; } @Override public PrintStream getOutputStream() { return System.out; } @Override public int getPID() { return 0; } @Override public PrintStream getErrorStream() { return System.err; } }, true); String[] args = new String[] { "/usr/bin/ssh", "" }; posix.execv("/usr/bin/ssh", args); The bulk of that snippet is setting up the POSIXHandler which provides nice callbacks for the things that are not obvious how to handle, or might want to be overidden in a specific environment. The boolean flag at the end says to use the native POSIX implementation rather than emulating it in Java. The library will sniff your system and dynamically link the right things – is very nice. The library doesn’t properly declare its dependencies in its pom, so if you want to use it you need to depend on each of: com.github.jnr jnr-posix 2.0 com.github.jnr jnr-ffi 0.6.0 com.github.jnr jnr-constants 0.8.2 The jnr-posix pom lists jnr-constants and jnr-ffi as provided for some reason. Hopefully that will be remedied in $[...]

Some Atlas Thoughts


Back in June (that long ago, really? wow!) I first talked about Atlas and my, time has flown. While originally created to help address a very specific problem we were facing at work, the general utility of the approach in Atlas is pretty intriguing. I want to be able to describe a system, such as: system "blog" server "load-balancer:blog" server "wordpress", { base: "linux-server", cardinality: 2, install: ["wordpress?db=blog-db&caches=blog-cache", "lb-add:blog"] } server "caches", { base: "linux-server", cardinality: 2, install: ["memcached:blog-cache"] } server "database", base: "mysql:blog-db" end Which basically describes how to wire up the abstract servers in a system. This description is paired with an environment definition that explains what things like linux-server are and what exactly is done to install wordpress in a particular environment. Atlas will need some kind of virtual-installer (which it doesn’t have right now) to get the nice clean descriptor above, as the system described there actually looks like: wp_url = "" mcp_url = "" system "blog" do server "load-balancer", { base: "load-balancer:blog?from=80&to=80" } server "wordpress", { cardinality: 2, base: "apache-server", install: ["tgz:#{wp_url}?to=/var/www/&skiproot=wordpress", "zip:#{mcp_url}?to=/var/www/wp-content/&skiproot=memcached", "exec: yes | sudo pecl install memcache", "exec: sudo echo '' >> /etc/php5/apache2/php.ini", "wait-for:wordpress-db", "erb: wp-config.php.erb > /var/www/wp-config.php", "exec: sudo service apache2 restart", "elb-add:blog"] } server "memcached", { cardinality: 2, base: "server", install: ["scratch:memcached=@", "apt:memcached", "file:memcached.conf > /etc/memcached.conf", "exec: sudo service memcached restart"] } server "database", { base: "mysql:blog", install: "scratch:wordpress-db=@" } end To get to the first system descriptor it will need some concept of a virtual installer, that would be defined in the environment descriptor, and might look like: environment "ec2" do installer "wordpress", { virtual: ["tgz:#{wp_url}?to=/var/www/&skiproot=wordpress", "zip:#{mcp_url}?to=/var/www/wp-content/&skiproot=memcached", "exec: yes | sudo pecl install memcache", "exec: sudo echo '' >> /etc/php5/apache2/php.ini", "wait-for:{install[db]}", "erb: wp-config.php.erb > /var/www/wp-config.php", "exec: sudo service apache2 restart"] } # ... rest of the env descriptor end In a non-trivial system I would use Galaxy for deploying services, probably Dain’s fork for something new. But for this example I think it[...]

In Clauses


The most common feature request I get for jDBI, going back at least seven years now, is for automagic in-clause expansion, or the equivalent. Sadly, there is not correct general case solution for this. There are lots of solutions, but what is the right thing to do is very context dependent. Let’s look at some options. Database Specific Functionality The first option is to use database specific functionality to achieve this. The easiest of these I know of is with PostgreSQL, which has very nice sql array support. We’ll start with how we want to use the feature – ideally we could just bind a collection, but in this case we need to know the type of things in the collection, and we should handle empty collections, so we’ll do something like: ImmutableSet rs = h.createQuery("select name from something where id = any (:ids)") .map(StringMapper.FIRST) .bind("ids", arrayOf(Integer.class, 2, 3)) .list(ImmutableSet.class); The bind is using a helper function to create an instance of SqlArray which just captures the information being bound: public class SqlArray { private final Object[] elements; private final Class type; public SqlArray(Class type, Collection elements) { this.elements = Iterables.toArray(elements, Object.class); this.type = type; } public static SqlArray arrayOf(Class type, T... elements) { return new SqlArray(type, asList(elements)); } public static SqlArray arrayOf(Class type, Iterable elements) { return new SqlArray(type, elements); } public Object[] getElements() { return elements; } public Class getType() { return type; } } When binding a SqlArray it will wind up invoking the Handle#bind(String, Object) method. This is fine as we will intercept the actual binding with an ArgumentFactory. The one here is a toy, specialized solely for binding arrays of integers, but it should be straightforward to generalize: public class PostgresIntegerArrayArgumentFactory implements ArgumentFactory> { public boolean accepts(Class expectedType, Object value, StatementContext ctx) { return value instanceof SqlArray && ((SqlArray)value).getType().isAssignableFrom(Integer.class); } public Argument build(Class expectedType, final SqlArray value, StatementContext ctx) { return new Argument() { public void apply(int position, PreparedStatement statement, StatementContext ctx) throws SQLException { // in postgres no need to (and in fact cannot) free arrays Array ary = ctx.getConnection() .createArrayOf("integer", value.getElements()); statement.setArray(position, ary); } }; } } We need to register our argument factory on the DBI, Handle, or SqlStatement for it to be used, we’ll just put it on the handle: Handle h =; h.registerArgumentFactory(new PostgresIntegerArrayArgumentFactory()); h.registerContainerFactory(new ImmutableSetContainerFactory()); While we were at it we registered a ContainerFactory that kn[...]

Library Versioning


Libraries should be versioned and packaged such that they are easy to use over time, and in combination. The best way I have found to do this is to abide by three rules: Use APR versioning Re-namespace on major version changes Change the artifact ID on major version changes Use APR style versioning APR versioning basically defines the meanings of changes for versions like {major}.{minor}.{bugfix}. A bugfix release is both forwards and backwards compatible. It is a drop in, binary compatible, replacement for anything with the same {major}.{minor} numbers. Going from 2.28.0 to 2.28.1 would be a bugfix release. A minor release is backwards compatible but not forwards compatible. That is, a 2.29.7 version can be dropped in to replace any other 2.29.X, or earlier minor version numbers such as 2.27.4 or 2.1.0. It would not be a drop in replacement for 2.30.0, though. Typically minor releases add new functionality through additions to the API. A major release is not backwards compatible with anything lower – a 3.0.0 cannot be dropped in to replace a 2.30.7 – it has a different API. Nor can it replace a 4.2.89 release, which has a higher major version number. Version numbers are used to encode compatibility for the API. Re-namespace on major version changes When making a major version change, that is a backwards incompatible change, always use a new namespace. In Java or C# use a new package name, in Ruby use a new module name, in C use a new function prefix, etc. Re-namespacing allows you to use both the old and new versions in the same process. This is particularly important for transitive dependencies. To look at a concrete counter-example demonstrating the pain of not doing this let’s look at a personal mistake I made in jDBI. jDBI uses a StatementContext to make information available to to extensions, such as custom statement rewriters. Between 2.15 and 2.16 I changed StatementContext from an abstract class to an interface, but did not change it’s API. I believed this was a backwards compatible change because I thought the same bytecode was generated for method invocations against interfaces and abstract classes. I was wrong, different bytecode is generated. Heavy users of jDBI tend to create small libraries which bundle up their extensions, and then they rely on their small libraries. At Ning we call ours ning-jdbi. If rely on ning-jdbi 1.3.2 which relies on jdbi 2.14, and I rely on jdbi 2.28 then I am in trouble, as ning-jdbi will get runtime errors when trying to run against the more recent version of jDBI. I have to go cut a new version of ning-jdbi, which is now backwards incompatible, and the chain continues. By introducing an accidental backwards incompatible binary change I forced backwards incompatible changes on the whole dependency chain. Oops. Change the artifact ID on major version changes Using a seperate namespace for backwards incompatible changes is not enough on its own in most circumstances. Yes, both versions can coincide in the same process, but most build and packaging tools cannot handle loading two libraries with the same name and different versions. As I don’t want to write yet-another-dpkg or yet-another-build-tool merely to work around this issue, you save tons of grief by also changing the library identifier as fas as build and packaging goes. Once again, transitive dependencies are the main driver here. If I depend on and a library I use depends on com.go[...]

Maybe in Java


One of the more elegant concepts in a lot of functional languages is that of Maybe. Haskell drove home the magic of Maybe for me. The general idea is representing the possibility of a value. In Haskell, you’ll generally use it such that processing will continue if the value is needed and present, or short circuit if the value is needed and not present. The end result of a computation which includes a Maybe will also be a Maybe, but one representing the result of the computation. In expression-oriented languages like Haskell, that works out very nicely, but I spend most of my time working in Java, which is decidedly statement-oriented. Concepts will sometimes move nicely between worlds, sometimes not. When I started working on Atlas recently I decided to see how well Maybe ported. I started with Nat Pryce’s maybe-java and took off from there. Nat’s class encourages a model analogous to Haskell’s, executing within the context of a Maybe instance by passing functions into the Maybe instance, for example: Maybe name = Maybe.definitely("Brian"); Maybe message = Function() { public String apply(String s) { return "hello, " + s; } }); System.out.println(message.otherwise("No one here!")); This treats the sequence of Maybe instances as an evaluation context or control flow, which works nicely in some languages, but sadly, as with most attempts to do higher-order functions in Java, it got awkward rather quickly. Part of it is purely syntactic, the syntax isn’t optimized for it, but part of it is semantic as well. Idiommatic Java uses exceptions for non-happy-path control flow, and most of the libraries which provide the main reason for using Java behave this way. Given that, I switched from using Maybe to control evaluation to using Maybe purely to represent the possibility of a value, and things fell into place very nicely – even playing within Java’s exception idioms. Take for example this snippet: SSHCredentials creds = space.lookup(credentialName, SSHCredentials.class) .otherwise(space.lookup(SSHCredentials.DEFAULT, SSHCredentials.class) .otherwise(new IllegalStateException("unable to locate any ssh credentials")); In this case there may exist named credentials, if not there may exist some default credentials, and if there is neither the world explodes. In the typical case you would see either a test for existence and then use, or a fetch and check for null. Both of which are, to my mind, less clear and certainly more error prone (largely in needing to remember to check everywhere, particularly in the case of a this-or-that situation, etc). Other bits of using Maybe extensively are not completely clear, but I am pretty confident that I will be using some evolution this flavor of Maybe in most of my Java-based code going forward.[...]

Using s3 URLs with Ruby's open-uri


Ruby’s open-uri is a wonderful hack, and I recently got to figure out how ot plug in additional URL schemes. Here is a quick and dirty to allow urls of the form s3:/// :

require 'aws/s3'

module URI

  class S3 < Generic
    def initialize(*args)
      @bucket, @file = args[2], args[5][1,args[5].length]

    def open &block
      http_url = AWS::S3::S3Object.url_for @file, @bucket
      URI.parse(http_url).open &block
  @@schemes['S3'] = S3

It uses the AWS::S3 library, but could be adapted pretty easily to the AWS SDK for Ruby. It does require the normal initialization but then just works :-)

open("s3://skife/whiteboard.jpg") do |in|
  # do stuff with the contents...

Fundamental Components in a Distributed System


In the last several weeks I have had a surprising number of conversations about the fundamental building blocks of a large web-based system. I thought I’d write up the main bits of a good way to do it. This is far from the only way, but most reasonably large systems will wind up with most of this stuff. We’ll start at the base and work our way up. Operational Platform At the very base of the system you need to have networking gear, servers, the means to put operating systems onto the servers, bring them up to baseline configuration, and monitor their operational status (disk, memory, cpu, etc). There are lots of good tools here. Getting the initial bits onto disk will usually be determined by the operating system you are using, but after that Chef or Puppet should become your friend. You’ll use these to know what is out there and bring servers up to a baseline. I personally believe that chef or puppet should be used to handle things like accounts, dns, and stable things common to a small number of classes of server (app server, database server, etc). The operational platform provides the raw material on which the system will run, and the tools here are chosen to manage that raw material. This is different than application management. Deployment The first part of application management is a means of getting application components onto servers and controlling them. I generally prefer deploying complete, singular packages which bundle up all their volatile dependencies. Tools like Galaxy and Cast do this nicely. Think hard about how development, debugging, and general work with these things will go, as being pleasent to work with during dev, test, and downtime will trump idealism in production. Configuration Your configuration system is going to be intimately tied to your deployment system, so think about these things together. Aside from seperating the types of configuration you want there are a lot of tradeoffs. In generally, I like immutable configuration obtained at startup or deployment time. A new set of configs means a restart. In this case, you can either have the deployment system provide it to the application, or have the application itself fetch it. Some folks really like dynamic configuration, in that case Zookeeper is going to be your friend. Most things don’t reload config well without a restart though, and I like having a local copy of the config, so… YMMV. Application Monitoring Application level monitoring and operational level monitoring are very similar, and can frequently be combined in one tool, but are conceptually quite different. For one thing, operational monitoring is usually available out of the box from good old Nagios to newer tools like ‘noit. Generally you will want to track the same kinds of things, but how you get it, and what they mean will vary by the application. Monitoring is a huge topic, go google it :-) Discovery Assuming you have somewhere to deploy, and the ability to deploy, your services need to be able to find each other. I prefer dynamic, logical service discovery where services register their availability and connection information (host and port, base url, etc) and then everything finds each other via the discovery service. A lot of folks use Zookeeper for this nowadays, and most everyone I know who has used it loves it. One of the best architecty type guys I know would probably have its[...]

Yet More of the Long Tail Treasure Trove


Another addition of the long tail treasure trove in blog form.


JDBC driver for sqlite which embeds the mac, linux, and windows binaries for sqlite. It will load the C library on demand, and you just go about your merry way. I love SQLite, I work in Java frequently. Win!


MySQL server in a jar. Seriously, just embed MySQL in your Java stuff. Magical for testing, etc.


Exactly what the name says, concurrent linked hash map. Martin and I really wanted this back in the day. Now Ben Manes (sorry, don’t have a good link for him) wrote a really good implementation.

Greplin’s Bloom Filter Library

Nice bloom filter library for Java.


Young project, but very easy to use library for SSH in Java.


Better readline for Java


High perf Java reflection via bytecode gen.


Not-sucky YAML in Java


Crazy Bob’s one-class on disk FIFO queue.


Ah, finally something non-Java! Mail is a very pleasent email library for ruby.


The C Minimal Perfect Hashing Library. Perfect hashes are fun. This finds them for you.

Diff Match and Patch

Diff, fuzzy matching, and patching in C++, C#, Java, Javascript, lua, Objective-C, and Python.

Making Really Executable Jars


One of the more annoying things about writing command line applications in Java is that the Java model of an executable is the so called executable jar, which is executed via an incantation like

$ java -jar ./waffles-1.2.3.jar --some-flag=blue hello 

There has long been a hack known in some circles, but not widely known, to make jars really executable, in the chmod +x sense. The hack takes advantage of the fact that jar files are zip files, and zip files allow arbitrary cruft to be prepended to the zip file itself (this is how self-extracting zip files work).

To do this for jar files, on unix-like operating systems, create a little shell script which looks like:


exec java -jar $0 "$@"

You can make it fancier, doing things like looking for JAVA_HOME and so on, but the above is enough to get started. Make sure to add a few newlines at the end, they are very important. If you leave them out it will not work.

Now that you have your little shell script, cat the executable jar you want onto the end of it, set the script +x, and go to town. If you script is named waffles, then you would do that like:

$ cat ./waffles-1.2.3.jar >> ./waffles
$ chmod +x ./waffles
$ ./waffles --some-flag=blue hello

and there you go! I have a little maven plugin that will do this for you automagically, but haven’t had a chance to get it into central yet. I guess I should probably stop writing and go do so…


  • David Phillips suggested putting the $@ in parens as it can contain spaces. I have updated the post to do so.

  • Sven Schober pointed out a bug in the original form of the shell script I posted. I forgot the extremely important $0. That is what I get for writing from memory and not unit testing my blog posts! The post has been fixed.

  • Jeffrey McManus found a typo, I had chomd instead of chmod. Fixed, thank you! I really need to find a way to unit test blog posts!

Emacs Client on Mac OS X


I have been using J Aaron Farr’s approach to an emacsclient app on the Mac for a while, but it has always bugged me that it behaved differently than other apps. It would open a frame any time the app was launched (say via quicksilver or its ilk).

I finally took some time to munge the applescript to get the normal behavior, that is open a frame if there is no open frame, otherwise bring emacs to the front. Along the way I added “and start the emacs server if it is not already running.”

tell application "Terminal"
		-- we look for <= 2 because Emacs --daemon seems to always have an entry in visibile-frame-list even if there isn't
		set frameVisible to do shell script "/Applications/ -e '(<= 2 (length (visible-frame-list)))'"
		if frameVisible is not "t" then
			-- there is a not a visible frame, launch one
			do shell script "/Applications/ -c -n"
		end if
	on error
		-- daemon is not running, start the daemon and open a frame		
		do shell script "/Applications/ --daemon"
		do shell script "/Applications/ -c -n"
	end try
end tell

-- bring the visible frame to the front
tell application "Emacs" to activate

The script assumes that is version 23 or higher and is installed under /Applications – if it isn’t you will need to modify it accordingly. Assuming it is, you can use this built version, otherwise grab the source from github and adjust strings accordingly.

jDBI 2.12 and the SQL Object API


The latest release of jDBI, 2.12, includes a new set of APIs I have been mulling over for a couple of years now, ever since JDBC 4.0 dropped the “ease of development” features. The sql object API lets you define annotated interfaces which generate all the needed rigamarole for you. Take, for example: interface TheBasics { @SqlUpdate("insert into something (id, name) values (:id, :name)") int insert(@BindBean Something something); @SqlQuery("select id, name from something where id = :id") Something findById(@Bind("id") long id); } This snippet defines two methods, and annotates them with the SQL and how to bind the arguments into the generated prepared statements. Using them is equally easy, DBI dbi = new DBI(dataSource); dbi.registerMapper(new SomethingMapper()); TheBasics dao = dbi.onDemand(TheBasics.class); dao.insert(new Something(7, "Martin")); Something martin = dao.findById(7); In this case, we open an on-demand sql object. On demand means that it will obtain and release connections as needed, generally immediately before and after each method call. We then just call methods and we get database interactions. We used a registered result set mapper here as well. This is also a new feature in 2.12, available on both the fluent api, and in the sql object api. Basically it just lets you register result set mappers which will be used to transform each row of the result set into some object, one to one. You can add an explicit mapper as well, via the @Mapper annotation, or by defining your own mapping annotation. To get access to additional functionality, such as tranactions or access to the underlying handle, the sql object API has the idea of mixin interfaces. These are interfaces defined as part of the library which will be implemented on the sql object, for instance to use transactions with a sql object you would define your object as interface UsesTransactions extends Transactional { @SqlUpdate("insert into something (id, name) values (:id, :name)") void insert(@BindBean Something something); @SqlUpdate("update something set name = :name where id = :id") int update(@BindBean Something s); @SqlQuery("select id, name from something where id = :it") Something findById(@Bind int id); } The Transactional interface defines begin(), commit(), rollback(), and checkpoint related friends, as well as a callback receiver which wraps the callback in a transaction: public void testExerciseTransactional() throws Exception { UsesTransactions one = dbi.onDemand(UsesTransactions.class); UsesTransactions two = dbi.onDemand(UsesTransactions.class); one.insert(new Something(8, "Mike")); one.begin(); one.update(new Something(8, "Michael")); assertEquals("Mike", two.findById(8).getName()); one.commit(); assertEquals("Michael", two.findById(8).getName()); } public void testExerciseTransactionalWithCallback() throws Exception { UsesTransactions dao = dbi.onDemand(UsesTransactions.class); dao.insert(new Something(8, "Mike")); int rows_updated = dao.inTransaction(new Transaction() { public Integer inTransaction(UsesTransaction[...]



Gnuplot is a handy-dandy tool for drawing graphs, which not enough people know about. One of its nicest features is that it is fast for large data sets. Take, for example, this plot of response times for calls to an external service which I scrubbed out of some logs a while back::


It was generated in gnuplot from a six hundrend thousand or so line, tab delimited file of times and durations, which looked like:

2010-07-20T01:10:05	368
2010-07-20T01:10:24	368
2010-07-20T01:10:40	332
2010-07-20T01:10:58	328
2010-07-20T01:11:15	518
2010-07-20T01:12:02	131
2010-07-20T01:12:02	167
2010-07-20T01:12:02	445
2010-07-20T01:12:09	105
2010-07-20T01:12:09	274

I like to work in interactive mode, so this was rendered via the following commands, in the interactive prompt:

gnuplot> set xdata time
gnuplot> set timefmt '%Y-%m-%dT%H:%M:%S'
gnuplot> set format x '%R'
gnuplot> set terminal png size 640,480
gnuplot> set output '/tmp/graph.png'
gnuplot> plot 'zps.tsv' using 1:2 with dots

To summarize each line we tell it that the X axis is time, give it the time format to read, set the output format for the X axis to be just times, specify png output and size, specify the output file, and then tell it to plot.

Gnuplot has lots more that it can do, one easy and common one is to switch to log scale. Given our previous commands, we can re-render with the Y axis in log scale via:

gnuplot> set logscale y
gnuplot> set output '/tmp/graph-logscale.png'
gnuplot> plot 'zps.tsv' using 1:2 with dots

Which gives us something easier which can be a bit easier to see things outside the outliers around the downtime near the end.


It can do some analysis, not nearly what R can do, but really only that related to plotting graphs. It can do other fun things, like three dimensional plots, heat maps, etc. Check out the documentation and have fun!

Deployment Systems - Packaging


Picking back up on the earlier discussion of deployment in non-trivial systems, I’d like to suggest another useful pattern of behavior I’ve observed.

Deploy Complete, Singular Packages

Package up your service and its dependencies into a single file, then always test and deploy that package. When an artifact moves into test, be it automated or exploratory, the binary package is what is tested, not a tag, or branch or so on. Tags can be used to tell the automated build infrastructure to produce a candidate, or can be used by the automated build infrastructure to mark what was used to build a package, but they are used by developers and build the process, not by the deployment process.

At Ning we use a tarball containing the service, needed libraries, any containing server it runs in (such as Apache or Jetty), and the services post-deploy and control (rc) scripts. The project build produces this, interestingly, via a maven plugin, which we also use to build Apache/PHP based components at this point! Our deployment system, galaxy, defines the contract for this package. Aside from personal experiance, my understanding is that Google statically compiles everything into a monolithic binary for most of their services (go C++). Similarly, my understanding is that Apple deploys cpio bundles.

My rule of thumb for what needs to go into the bundle is that if it is configured for your service, you rely on a specific version, or you rely on something that is evolving rapidly (cough node cough redis cough) it goes in the bundle. This is why we package up even very stable things like Apache into our tarballs. Operational automation and configuration management (chef, etc) can handle all of the other dependencies.

Deployment Systems - Configuration


When you are putting together deployment automation for a non-trivial system, or a system you expect to become non-trivial, there seem to be some common patterns to the better ones I have seen or learned about, so I’ll try to brain-dump what I have observed. This first bit is about configuration. Seperate Business, Environment, and Override Configuration This is one of the most important as your system matures, but also one which I hav often done pretty poorly. The fundamental idea is that any configuration setting which varies based on the environment the system is running in should be managed seperately from configuration which only varies by the released version. For instance, if your system knows the domain it runs on, then the domain name is generally a configuration property, and varies by environment. In development it may be “” as you run everything on your local machine, “test1.example.internal” in the first test environment, and and “” in production. Thread pool sizes, on the other hand, don’t usually vary by environment and are generally changed only when a new version is being pushed which needs to fiddle them. This configuration is the same between all environments, so is considered business configuration (or application configuration, whatever, pick the qualifier you prefer). To the deployed application, the configuration space may be unified, but when you set up how you manage your configuration keep seperate config trees for environmental and business configuration, and do your darndest to minimize your environmental configuration. This drastically reduces the risk of “oh, farp, we forgot to change this configuration property between test and production” type incidents. It drastically reduces the “hey, Joe Dev, what is this new example.wiffle.age property supposed to be?” when Sam Hack pulls Joe’s changes as Joe just added it to the business config which is shared (yes, he should add docs as well, but we all know that even if he did add docs, which is unlikely, Sam is going to ask before searching the wiki for them). Finally, provide a means to override or overlay the business configuration in a particular environment or deployment. This is pretty much needed when you want to flip your recurring job to a one minute interval instead of a one day interval in order to track down some nasty bug in it. Any overlaid property needs to be considered temporary – they should never survive the next deployment. If it needs to last past the next deployment, it is no longer an override, it is the configuration value and should be in the normal business configuration. Implementing this can be as simple as maintaining three sets of properties files which are combined at deployment time. The first is bundled with the application at build time, and contains the business properties. The second is part of the environment, and is provided at deployment time. The last is the overrides, which is also provided at deployment time, and contains values to override the environment and business properties. You can get much more arcane, but really, this is all you need most of the time.[...]

The TCK Trap


You want to fork the OpenJDK. You look at the license, see that it is GPLv2, say “woot!” and start hacking. You add the important optimization to your fork which you need, and now want to release it.

If you don’t care about calling it Java, you can, under copyright law and the GPLv2, just cut the release, publish it, and go about your business. The catch is that there are tons of patents all over the JVM, and the GPLv2 does not include any patent protections. So, while you are clear from a copyright point of view, anyone that has contributed intellectual property to the JVM/JDK, ever, is free to sue both you and anyone using your distribution for infringing any patents they hold on their contributions. Aside from breaking the law, getting yourself, and your users, sued is not generally a good thing, so we look at option number two, passing the TCK.

Passing the TCK, which is a suite of tests used to verify correct implementation, grants you patent rights to everything folks have contributed to the Java spec over the years. So, you apply for the TCK and Oracle will give you the TCK at no charge, though it will be under NDA and will tell you that the TCK cannot be used to verify that your implementation is okay on embedded devices, such as mobile phones, kiosks, or cash registers.

You smile, nod, and run the TCK, tweak some stuff in your code and behold, it passes! Now you have a problem. You can make a release which includes the patent protections (you passed the TCK) for some usages, but not for other usages (say, in a kiosk or cash register, or mobile phone). If you say “you may not use this in those cases” you are violating the GPLv2, which does not allow you to put those kinds of restrictions on your release. If you don’t put those restrictions in place you are violating patent law and open yourself, and your users, up to patent infringement lawsuits as the TCK license you were granted specifically excludes certification on them.

So, you have a choice when you publish your release, you can violate copyright law and not abide by the GPLv2, or you can violate patent law and not have rights to patents you knowingly infringe.

Welcome to Java.

My Favorite Interview Question


As Ning is ramping up recruiting again, I need to brush off my interrogation techniques interview questions. Sadly, one of my favorites is no longer so useful, as originally designed, due to technical advances in hard drives. I figured I’d share and discuss how I use it. Hopefully folks can give me some feedback on how to better find out what I am looking for. The question goes like this Given a hard drive with one terabyte of data, arranged in 2^32 key/value pairs, where the keys and values each have lengths of 128 bytes, you need to design, build, and deploy (by yourself) a system that lets you look up the value for a given key, over the internet, at a peak rate of 5000 lookups per second. The data never changes. Let’s design that system. The question was designed (4 or 5 years ago now) to just barely require building a distributed system. With the widespread understanding and availability of solid state drives, it is fairly trivial to do in a single box now. There is additional information available if the candidate asks for it, things like requiring responses in 100 millis at the 90th percentile, that the budget is, “well, we don’t know how much it is going to be worth until we see it in use for a while, so try to do it cheap, if it is too much we’ll just not bother building it,” and that we have a datacenter and a switched network we can put it on, but no pre-specified servers. We want 99.9% availability, measured on a monthly basis, but are not offering an SLA to consumers. The keys are not distributed evenly within the keyspace. Requests are distributed evenly (and randomly) across the keys (I do this to make the problem easier). Etc. For most good candidates, designing such a system is very straightforward. It is an interesting design exercise for junior folks, or folks coming from different areas of programming (desktop apps, embedded, etc), but for most candidates, it should not be difficult. I’m looking for quite a few things as we go through the question. The first is their opinion of “a terabyte of data” and “5000 lookups per second.” Do they consider this to be a lot of data, or a fairly boring amount, same with the lookups per second. Leaning either way isn’t a failure, it is just information gathering for me, and referencing it against how you represented yourself in your resume, cover letter, and phone screen. I’m looking to see what additional information the candidate wants. Again, this is mostly to try to understand the candidate, and I don’t expect any barrage of questions out of the gate – they usually dribble in at forks in the design discussion. I expect the candidate to design something that will work. Gimmick answers (put it in S3 and put up a web server that 301’s over to S3, etc) are valid, and you get some points for reasonable ones, but you still have to design an in-house version. I expect the solution to be within reasonable bounds for hardware, etc. Most folks do some kind of hashing scheme on the key in a front end server, and fan out to some database (or database-like) servers behind that. This is [...]

Maven GPG Plugin Fixage


If you are trying to use maven’s gpg plugin, and maven just hangs when it gets to the signing part, there is a workaround.

The easiest way is to add some configuration to the build plugins section of your pom:


You can see this in JDBI’s pom.xml. With that, it should properly ask you for your gpg passphrase.

Library Dump


So, dumping a bunch of libraries and library-like-things I have been meaning to write about for months… At PhillyETE during my talk the JRugged library was recommended to me. It provides a decent looking, literal, implementation of Nygard’s Circuit Breaker. I haven’t used this, but it looks (from their example) like: public class Service implements Monitorable { private CircuitBreaker cb = new CircuitBreaker(); public String doSomething(final Object arg) throws Exception { return cb.invoke(new Callable() { public String call() { // make the call ... } }); } public Status getStatus() { return cb.getStatus(); } } The next is from My friend Bob, who wanted a more pluggable (and HAML friendly) static site generator, so made Awestruct which looks so very nice, indeed! Adam keeps telling me about all the fun hackery he has been doing with libusb. Sadly, I have not done any myself… YET! Not a library, but awesome is Gephi for exploring your big dot graphs. Dain put me onto David Blevin’s xbean-finder for doing all the newfangled classpath scanning in Java. It works very well :-) Don’t remember how I found Mail but it is derned nice for all your ruby based email needs (except IMAP, sadly). Okay, maybe not all your ruby email needs, having watched Martin slog through figuring these out a couple years ago, just use MMS2R to parse the random crap various carriers call MMS. Just in case you missed it, google-diff-match-patch for all your C++, C#, Java, Javascript, Lua, and Python diff related needs. Finally, Chris’s repl is not really a library either, but is shockingly useful.[...]



Bulkheads are used in ships to create seperate watertight compartments which serve to limit the effect of a failure – ideally preventing the ship from sinking. The bold vertical lines in Samuel Halpern’s diagram illustrate them: If water breaks through the hull in one compartment, the bulkheads prevent it from flowing into other compartments, limiting the scope of the failure. This same concept is useful in the architecture of large systems for the same reason – limiting the scope of failure. If we look at a very simple system, say something that easily partitions by user, like a wish list of some kind. We can put bulkheads in between sets of app servers talking to distinct databases. In this system a given app server only talks to the database in its partition. Given this setup, if a single app server goes berserk and starts lashing out with a TCP hatchet at everything it talks to, no matter how angry it gets it only takes out a vertical slice of the system, the rest goes about business happily. If we take a slightly fancier system (ie, slightly more realistic) we can see we develop (mostly) identical vertical slices: On a ship we’d call the groups compartments, but we’ll call them clusters because each vertical bunch of stuff forms a logical unit which can be thought of as one thing (say, a cluster!). In this setup, if one of the caches started blackholing requests the damage done (hopefully just latency increasing a small bump to a reasonable timeout) would stop at the bulkheads around the cluster. Yea! If we look closely at the slightly fancier system, we note that a cluster consists of: 3 App Servers 2 Caches 2 Log Servers 4 Somethings 1 Database Typically, you can use clusters as units by which to add capacity, and the exact contents of the cluster will be determined by finding the limiting element (usually the one which needs to maintain lots of state), on the most constrained axis of scale embodied in the cluster, and sizing out the rest of the elements based on their capacity relative to the limiting element. Add to this enough capacity to handle spikes, provide acceptable redundancy, and voila, you have a cluster. In theory. In practice, some things simply do not work well with hard boundaries like this. In this example, note that the load balancers are not part of a cluster, but span clusters – they need to as they are responsible for determining which cluster can handle a given request! It gets worse, notice that we have two log servers per cluster. Given a reasonable number of clusters, say 25, that amounts to 50 log servers. A single log server (in this case) is capable of servicing about 1000 app servers, but logs are really important so we need to run them redundantly, hence two per cluster. Given 25 clusters and three app servers per cluster, a single log server has plenty of capacity, yet we have 50 for fault isolation in this setup. The accountants are not happy. Another variant on the inefficieny problem are the Somethings. Somethings utilization is very bursty. Under average cond[...]

Treasure Trove: jmxutils


Gianugo and I used to do a talk at JavaOne called “The Long Tail Treasure Trove.” The goal of the talk was to introduce at least thirty or so small, open source, useful libraries which the majority of attendees had never heard of. They were great talks. We haven’t done one, in a while, so at Henri’s prompting (very belatedly), I’m just going to start blogging them! So, we’ll start with a great one – Martin’s JMX exporting library, jmxutils. Writing JMX beans tends to be agonizing, so unless someone is holding a blowtorch to you toes, you avoid it. Well, it doesn’t need to be, examine this: public class Something { private volatile String color; @Managed(description="Favorite Color") public String getColor() { return color; } @Managed public void setColor(String color) { this.color = color; } @Managed public void pickOne(String first, String second) { this.color = second; } } MBeanExporter exporter = new MBeanExporter(ManagementFactory.getPlatformMBeanServer()); Something s = new Something(); exporter.export("org.skife.example:name=Something", s); This little tidbit exports a JMX Bean, building a model mbean by inspecting the annotations. This particular one exports a mutable property (color) and an operation (pickOne). Nicely, it uses Paranamer (subject for another post) to even get the parameter names on the operation right! Now, to get it into maven central…[...]

Emacs 23.1, for the Designers



Embedding Clojure


There is some information out there on embedding Clojure in Java, but it isn’t the easiest to find, and the examples don’t tend to come with explanations, so… here is yet another! Let’s take a silly example and say we want to embed clojure as a validation language on something, so that it looks something like this public class Thing { private int num = 0; @Validate("(> num 0)") public void setNum(@Name("num") Integer num) { this.num = num; } @Validate("(< first second)") public void setInOrder(@Name("first") Integer first, @Name("second") Integer second) { this.num = first + second; } } We want the validation function, expressed in the @Validate annotation to be invoked on every call to the method, binding the appropriate parameters to their @Name, etc. That is, for the second one, we want ensure that first is less than second, and so forth. We want it to be really fast – the validation will be called on every invocation of the validated method, so we need it to be really fast. While fairly contrived, and rather absurd, it makes a nice example :-) What we’d like to do is hold a reference to an otherwise anonymous clojure function (we don’t want to pollute the global namespace) and invoke it on every method call with some kind of method interceptor. We can create the Clojure function reference with something like: public IFn define(String func) throws Exception { String formish = String.format("(fn [val] (true? %s))", func); return (IFn) clojure.lang.Compiler.load(new StringReader(formish); } /* ... */ IFn fn = define("(> val 0)"); assertTrue((Boolean) fn.invoke(7)); The clojure compiler (inconeniently in Java 6) is named Compiler and provides a handy load(String) function which will read and evaluate a String, returning whatever it evaluates to. In this case we return a function which wraps our validation function in a test for true-ishness. In this example, our passed in value has a hard coded name, val, which is unfortunate, but can be worked around. We can invoke this function directly via one of its’ invoke methods – it has a ton of overloads for different argument counts. This approach will generate a Java class (well, a .class anyway) implementing our function. To wrap behavior of a class, rather than an interface, and in a performant way, we’ll break out the ever-scary-but-awesome CGLIB and create a runtime extension of the class being validated. CGLIB is fast, but you pay for that with some gnarly low-level-feeling hackey. Not as low as ASM, though :-) Our object factory looks like public T build(Class type) throws Exception { Enhancer e = new Enhancer(); e.setSuperclass(type); List callbacks = new ArrayList(); callbacks.add(NoOp.INSTANCE); final Map

Setting up TokyoCabinet and Ruby


I ran into a couple weirdnesses setting up tokyocabinet and the Ruby API, so am adding this to my external memory. Hopefully it will help anyone else bumping into the same issue.

Assuming you install tokyocabinet at a non-standard location, such as /Users/brianm/.opt/tokyocabinet-1.4.27 and then want to build the ruby bindings for it via a gem, the trick is to add the bin/ directory for the tokyocabinet install dir to your $PATH (in my case, that is just export PATH=/users/brianm/.opt/tokyocabinet-1.4.27/bin:$PATH). The ruby API’s extconf.rb shells out to tc’s tcucodec to find paths to libraries, etc. Alternately you could modify the extconf.rb, which is very short and sweet, but I hate doing that for aesthetic reasons.

To build the gem, you need to build via extconf but not install. After the build, use the normal gem tokyocabinet.gemspec command to build a gem. Install the gem (in my case, via rip) and glod’s your uncle.

Now to figure out if anyone has done a convenience API wrapper around the table database in TC…

Borrowing Mark Reid's Styling


I am playing with new layouts, using Mark Reid’s wonderfully readable stylesheets as a basis. I’m going ahead and pushing it out, despite it being a work in progress. For now it is changed very little, in fact the main css is identical, but it will evolve as I have time. I’ve taken another cue from him in using markdown for posts with code in them. Something about redcloth doesn’t play nicely with pygments processing of inline code, whereas the markdown processor does play nicely. So, not really caring about which one I use, I swapped out to markdown for posts with code. Yea! [1,2,3].map { |i| i * i }.inject([]) { |a, i| a << i } {|i| i - 99}.select {|i| i + 99 == 0} =begin heh =end [1,2,3].map { |i| i * i }.inject([]) { |a, i| a << i } {|i| i - 99}.select {|i| i + 99 == 0} I particularly like how Mark’s styling handles long code lines :-) Along the way I killed the search box, it will come back, but it does highlight Toby’s comment that I should have actual, you know, links to my archives. Eventually…[...]

Dataflow Programmering


Not long after the idea of dataflow programming clicked for me while reading the excellent Concepts, Techniques, and Models of Computer Programming (affiliate link) I have been trying to figure out the best way to apply it at the library level rather than the language level. Having it at the language level is fine and dandy, but I’m happy to sacrifice a little elegance for just being easy to use for building up a page or response in a webapp from a bunch of remote services.

When rendering (heh, originally typoed that rending, kind of appropriate) a typical page in Rails, PHP, JSP, whatever. If you are nice and clean you fetch all the data you need and shove it into some kind of container which is then used to populate a template. In a complex system it is not unusual to make 20+ remote calls to render a single response. These are to caches, databases, other services, and sometimes pigeons passing by with telegrams on their legs. A couple years ago, we used a reactor style dataflow tool Tim and I wrote for javascript. I rather miss having it when wiring together backend services.

I have done a number of ad-hoc versions in services for Java, using an executor and passing around references to futures, but I don’t have anything that really matches the rather nice push and react style thing we had in javascript there. I can imagine something using Doug Lea’s jsr166y fork/join tools, but every time I start to poke into them the… well, maybe mapping it into a library really is kind of ugly. It certainly is screaming for anonymous functions, oh well guess I am not holding my breath.

So, switching to the other languages I hack in nowadays, we have Ruby (oops, no threads), C (umh, no, wrong level of abstraction), and Lua (hey, actually not bad, particularly with how LuaSocket and coroutines play together…).

Properly, I should now shut up and go hack. On that note, off to hack!

Teh New Ruby Evil


Found a beauty I don’t know how I missed before:

bar = 'hello world'
foo =~ /#{bar}/

I didn’t realize you could do interpolation into regex literals. I don’t know how I lasted this long without finding out!

Added Disqus Comments


Added commenting via disqus — we’ll see how it works out. Just felt weird not having any comments, and they seem straightforward, provide a comment feed, etc.

New Blog Tooling


To help out with the new blogging system I had fun hacking up some support scripts. I tried a new style this time, one base script and a bunch of plugins, so I do things now, as $ blog create new_blog_tooling $ blog edit new_blog_tooling Which is kind of shiny. I did it in ruby as I wanted to just get it done, but as I futzedI realized I really wanted a module system more like Lua or Erlang’s - I didn’t want to know the module name, but wanted to access stuff on it. The closest I got was a pretty gross eval hack, which looks like def load_command command it =, "#{command}.rb")) do |f| f.readlines.join("") end ms =<<-EOM do #{it} end EOM eval(ms).new end Which creates an instance of an anonymous class and lets me call methods on it, the “plugins” then are just bare method definitions, like def execute draft_dir = File.join(BaseDir, "_drafts") name = ARGV[1] exec "#{ENV['EDITOR']} #{File.join(draft_dir, "#{name}.textile") }" end def usage "" end def help "open in $EDITOR" end Which are usd to generate help and execute the actual commands, $ blog -h Usage: blog command [additional] blog create creates a new draft with name blog drafts list all drafts blog edit open in $EDITOR blog kill destroy the draft $ Annoyed at having to use an eval hack, but hey, it works.[...]

Switched Over


Well, I switched over to jekyll from blosxom, finally. All the old posts stil exist, at the urls they existed at before. I haven’t worked out the mod_rewrite magic to get comments showing up for old posts, but will, eventually, I hope :-)

In the mean time, I opted not to have an auto-publish on checkin, but to include the scripts to rebuild and republish the blog in the repo, publishing is now an explicit choice, which works fine. I would still like to automate it, though, so that I don’t need all the tools to publish from whichever machine I am writing on.

As is obvious, comments are disabled at the moment. I am not sure what I am going to do about them. Cliff thinks I should just leave them off, “more trouble than they’re worth” but I am not sure I agree, I have had good discussion in my comments before. Most of what would have been comment discussion has moved to twitter though, but no obvious way to make that commection. Would be entertaining to find a way (random hashtags and a form which posts to twitter on your behalf, maybe, but that would be a nasty hashtag abuse).

On Jekyll, it is a pretty decent publishing tool. I don’t especially like the idiom it uses for post naming (this is 2009-04-04-switched-over.textile for example) but given the “play nicely with git” goal/requirement the date being part of the filename is reasonable. I kind of like, conceptually, the webgen style of putting the publication date in the front matter, but can see usefulness for find and so on with this format. Will play and see how it goes.

Anyway, speaking of playing, going to go play with it some, hopefully the feed still works correctly!

Using Git to Manage the Blog


So I am experimenting with using git to manage the new blog. I have a published branch up on my server which will be set up to auto-deploy itself when things get checked in. It means I probably need to do some post-merge hook mungery to detect which branches were affected and take appropriate action, but that is okay :-)

I am also thinking about just using a different repo for the published stuff. I’d have one repo on the server anyway as an intermediate sync point between laptop, desktop, etc, but can then set up a different remote repo which is the published. Not sure which I like more. Need to play.

Hello Jekyll


Setting up my new blog using Jekyll. This will probably take a while, but I am going to go incrementally and switch over as soon as I have basic posting and commenting in place.