Subscribe: Dave Beckett - Journalblog
Added By: Feedage Forager Feedage Grade B rated
Language: English
api  apt  librdf  new  query engine  query  raptor  rasqal  redland  release  released  sparql query  sparql  support  syntax  work 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Dave Beckett - Journalblog

Dave Beckett's blog


Making Debian Docker Images Smaller


TL;DR: Use one RUN to prepare, configure, make, install and cleanup. Cleanup with: apt-get remove --purge -y $BUILD_PACKAGES $(apt-mark showauto) && rm -rf /var/lib/apt/lists/* I've been packaging the nghttp2 HTTP/2.0 proxy and client by Tatsuhiro Tsujikawa in both Debian and with docker and noticed it takes some time to get the build dependencies (C++ cough) as well as to do the build. In the Debian packaging case its easy to create minimal dependencies thanks to pbuilder and ensure the binary package contains only the right files. See debian nghttp2 For docker, since you work with containers it's harder to see what changed, but you still really want the containers as small as possible since you have to download them to run the app, as well as the disk use. While doing this I kept seeing huge images (480 MB), way larger than the base image I was using (123 MB) and it didn't make sense since I was just packaging a few binaries with some small files, plus their dependencies. My estimate was that it should be way less than 100 MB delta. I poured over multiple blog posts about Docker images and how to make them small. I even looked at some of the squashing commands like docker-squash that involved import and export, but those seemed not quite the right approach. It took me a while to really understand that each Dockerfile command creates a new container with the deltas. So when you see all those downloaded layers in a docker pull of an image, it sometimes is a lot of data which is mostly unused. So if you want to make it small, you need to make each Dockerfile command touch the smallest amount of files and use a standard image, so most people do not have to download your custom l33t base. It doesn't matter if you rm -rf the files in a later command; they continue exist in some intermediate layer container. So: prepare configure, build, make install and cleanup in one RUN command if you can. If the lines get too long, put the steps in separate scripts and call them. Lots of Docker images are based on Debian images because they are a small and practical base. The debian:jessie image is smaller than the Ubuntu (and CentOS) images. I haven't checked out the fancy 'cloud' images too much: Ubuntu Cloud Images, Snappy Ubuntu Core, Project Atomic, ... In a Dockerfile building from some downloaded package, you generally need wget or curl and maybe git. When you install, for example curl and ca-certificates to get TLS/SSL certificates, it pulls in a lot of extra packages, such as openssl in the standard Debian curl build. You are pretty unlikely to need curl or git after the build stage of your package. So if you don't need them, you could - and you should - remove them, but that's one of the tricky parts. If $BUILD_PACKAGES contains the list of build dependency packages such as e.g. libxml2-dev and so on, you would think that this would get you back to the start state: $ apt-get install -y $BUILD_PACKAGES $ apt-get remove -y $BUILD_PACKAGES However this isn't enough; you missed out those dependencies that got automatically installed and their dependencies. You could try $ apt-get autoremove -y but that also doesn't grab them all. It's not clear why to me at this point. What you actually need is to remove all autoadded packages, which you can find with apt-mark showauto So what you really need is $ AUTO_ADDED_PACKAGES=`apt-mark showauto` $ apt-get remove --purge -y $BUILD_PACKAGES $AUTO_ADDED_PACKAGES I added --purge too since we don't need any config files in /etc for build packages we aren't using. Having done that, you might have removed some runtime package dependencies of something you built. That's harder to automatically find, so you'll have to list and install those by hand $ RUNTIME_PACKAGES="...." $ apt-get install -y $RUNTIME_PACKAGES Finally you need to cleanup apt which you should do with rm -rf /var/lib/apt/lists/* which is great and removes all the index files that apt-get update installed. This is in many best practice documents and example Dockerfiles. You could add apt[...]

Open to Rackspace


I'm happy to announce that today I started work at Rackspace in San Francisco in a senior engineering role. I am excited anticipating these aspects:

  1. The company: a fun, fast moving organization with a culture of innovation and openness
  2. The people: lots of smart engineers and devops to work with
  3. The technologies: Openstack cloud, Hadoop big data and lots more
  4. The place: San Francisco technology world and nearby Silicon Valley

My personal technology interests and coding projects such as Planet RDF, the Redland librdf libraries and Flickcurl will continue in my own time.



Digg just announced that Digg Engineering Team Joins SocialCode and The Washington Post reported SocialCode hires 15 employees from

This acquihire does NOT include me. I will be changing jobs shortly but have nothing further to announce at this time.

I wish my former Digg colleagues the best of luck in their new roles. I had a great time at Digg and learned a lot about working in a small company, social news, analytics, public APIs and the technology stack there.

Releases = Tweets


I got tired of posting release announcements to my blog so I just emailed the announcements to the redland-dev list, tweeted a link to it from @dajobe and announced it on Freshmeat which a lot of places still pick up.. Here are the tweets for the 13 releases I didn't blog since the start of 2011: 3 Jan: Released Raptor RDF syntax library 2.0.0 at only 10 years in the making :) 12 Jan: Released Rasqal RDF Query Library 0.9.22: Raptor 2 only, ABI/API break, 16 new SPARQL Query 1.1 builtins and more #rdf 27 Jan: Rasqal 0.9.23 RDF query library released with SPARQL update query structure fixes (for @theno23 and 4store ): 1 Feb: Released Redland librdf 1.0.13 C RDF API and Triplestores with Raptor 2 support + more 22 Feb: Released Rasqal RDF Query Library 0.9.25 with many SPARQL 1.1 new things and fixes. RAND() and BIND() away! 20 Mar: Raptor RDF Syntax Library 2.0.1 released with minor fixes for N-Quads serialializer and internal librdfa parser 26 Mar: Released my Flickcurl C API to Flickr 1.21 with some bug fixes and Raptor V2 support (optional) See 1 Jun: Released Raptor 2.0.3 RDF syntax library: a minor release adding raptor2.h header, Turtle / TRiG and ohter fixes. 27 Jun: Rasqal RDF query library 0.9.26 released with better UNION execution, SPARQL 1.1 MD5, SHA* digests and more 23 Jul: Released Redland librdf RDF API / triplestore C library 1.0.14: core code cleanups, bug fixes and a few new APIs. 25 Jul: Raptor RDF Syntax C library 2.0.4 released with YAJL V2, and latest curl support, SSL client certs, bug fixes and more (yes 13; I didn't tweet 2 of them: Rasqal 0.9.24 and Raptor 2.0.2) You know it's quite tricky to collapse months of changelogs (GIT history) into release notes, compress it further into a news summary of a few lines and even harder to compress that into less than 140 characters. It is way less if you include room for a link url and space for retweeting and sometimes need a hashtag for context. So how do you measure a release? Let's try! Tarballs Released tarball files from the Redland download site. date package old version new version old tarball size new tarball size tarball byte diff tarball %diff 2011-01-03 raptor 1.4.21 2.0.0 1,651,843 1,635,566 -16,277 -0.99% 2011-01-12 rasqal 0.9.21 0.9.22 1,356,923 1,398,581 +41,658 3.07% 2011-01-27 rasqal 0.9.22 0.9.23 1,398,581 1,404,087 +5,506 0.39% 2011-01-30 rasqal 0.9.23 0.9.24 1,404,087 1,412,165 +8,078 0.58% 2011-02-01 redland 1.0.12 1.0.13 1,552,241 1,554,764 +2,523 0.16% 2011-02-22 rasqal 0.9.24 0.9.25 1,412,165 1,429,683 +17,518 1.24% 2011-03-20 raptor 2.0.0 2.0.1 1,635,566 1,637,928 +2,362 0.14% 2011-03-20 raptor 2.0.1 2.0.2 1,637,928 1,633,744 -4,184 -0.26% 2011-03-26 flickcurl 1.20 1.21 1,775,246 1,775,999 +753 0.04% 2011-06-01 raptor 2.0.2 2.0.3 1,633,744 1,652,679 +18,935 1.16% 2011-06-27 rasqal 0.9.25 0.9.26 1,429,683 1,451,819 +22,136 1.55% 2011-07-23 raptor 2.0.3 2.0.4 1,652,679 1,660,320 +7,641 0.46% 2011-07-25 redland 1.0.13 1.0.14 1,554,764 1,581,695 +26,931 1.73% Click image to embiggen Releases that stand out here are Raptor 2.0.0 which was a major release with lots of changes and Rasqal 0.9.21; that changed a lot upwards and it was both an API break as well as lots of new functionality. Sources Taken from my GitHub repositories extracting the tagged releases, excluding ChangeLog* files, and running diffstat over the output of a recursive diff -uRN. date package old version new version source files changed source lines inserted source lines deleted source lines net 2011-01-03 raptor 1.4.21 2.0.0 215 34,018 30,348 64,366 2011-01-12 rasqal 0.9.21 0.9.22 94 11,641 5,712 17,353[...]

Rasqal 0.9.21 and SPARQL 1.1 Query aggregation


Rasqal 0.9.21 was just released on Saturday 2010-12-04 (announcement) containing the following new features: Updated to handle aggregate expression execution as defined by the SPARQL 1.1 Query W3C working draft of 14 October 2010 Executes grouping of results: GROUP BY Executes aggregate expressions: AVG, COUNT, GROUP_CONCAT, MAX, MIN, SAMPLE, SUM Executes filtering of aggregate expressions: HAVING Parses new syntax: BINDINGS, isNUMERIC(), MINUS, sub SELECT and SERVICE. The syntax format for parsing data graphs at URIs can be explictly declared. The roqet utility can execute queries over SPARQL HTTP Protocol and operate over data from stdin. Added several new APIs Fixed Issue: #0000388 See the Rasqal 0.9.21 Release Notes for the full details of the changes. I'd like to emphasis a couple of the changes to the roqet(1) utility program: you can now use it to query over data from standard input, i.e. use it as a filter, but only if you are querying over one graph. You can also specify the format of the data graphs either on standard input or from URIs, if the format can't be determined or guessed from the mime type, name or bytes. Finally roqet(1) can execute remote queries at a SPARQL Protocol HTTP service, sometimes called a "SPARQL endpoint". The new support for SPARQL Query 1.1 aggregate queries (and other features) led me to make comments to the SPARQL working group about the latest SPARQL Query 1.1 working draft based on the implementation experience. The comments (below) were based on the implementation I previously outlined in Writing an RDF query engine. Twice at the end of October 2010. The implementation work to create the new features was substantial but relatively simple to describe: new rowsources were added for each of the aggregation operations and then included in the execution plan when the query structure indicated their presence after parsing. There was some additional glue code that needed to be add to allow rows to indicate their presence in a group; a simple integer group ID was sufficient and the ID value has no semantics, just a check for a change of ID is enough to know a group started or ended. I also introduced an internal variable to bind the result of SELECT aggregate expressions after grouping ($$aggID$$ which are illegal names in sparql). I then used that to replace the aggregate expression in the SELECT and the HAVING expressions and used the normal evaluation mechanisms. As I understand it, the SPARQL WG is now considering adding a way to name these explicitly in the GROUP statement. A happy coincidence since I had implemented it without knowing that. To prepare this I did think about the approach a lot and developed a couple of diagrams for the grouping and aggregation rowsources that might help to understand how they work, how they can be implemented and tested as standalone unit modules, which they were. Rasqal Group By Rowsource As always, the above isn't quite how it is implemented. There is no group by node if there is an implicit group when GROUP BY is missing but an aggregate expression is used; instead the Rasqal rowsource class synthesizes 1 group around the incoming row, when grouping is requested. Rasqal Aggregation Rowsource This shows the guts of the aggregate expression evaluation where internal variables are introduced, substituted into the SELECT and HAVING expressions and then created as the aggregate expressions are executed over the groups. The rest of this post are my detailed thoughts on this draft of SPARQL 1.1 Query as posted to the SPARQL WG comments list but turned into HTML markup here. Dave Beckett comments on SPARQL 1.1 Query Language W3C WD 2010-10-14 These are my personal comments (not speaking for any past or current employer) on: SPARQL 1.1 Query Language W3C Working Draft 14 October 2010 My comments are based on the work I did to add some SPARQL 1.1 query and update support to my Rasqal RDF query library (e[...]

End of life of Raptor V1. End of support for Raptor V1 in Rasqal and librdf


Raptor V1 was last released in January 2010 and Raptor V2 seems pretty stable and working. I am therefore announcing that from early 2011, Raptor V2 will replace Raptor V1 and be a requirement for Rasqal and librdf. End of life timeline Now Raptor V1 last release remains 1.4.21 of 2010-01-30 Raptor V2 release 2.0.0 will happen "soon". The next Rasqal release will support Raptor V1 and Raptor V2. The next librdf release will support Raptor V1 and Raptor V2 (and requires Rasqal built with the same Raptor version). 2011 The next Rasqal release will support Raptor V2 only. The next librdf release will support Raptor V2 only (and require a Rasqal built with Raptor V2). In the style of open source I've been using for the Redland libraries, which might be described as "release when it's ready, not release by date", these dates may slip a little but the intention is that Raptor V2 becomes the mainline. I do NOT rule out that there will be another Raptor V1 release but it will be ONLY for security issues. It will contain minimal changes and not add any new features or fix any other type of bug. Developer Actions If you use the Raptor V1 ABI/API directly, you will need to upgrade. If you want to write conditional code, that's possible. The redland librdf GIT source (or 1.0.12) uses the approach of macros that rewrite V2 into V1 forms and I recommend this way since dropping Raptor V1 support then amounts to removing the macros. The Raptor V2 API documentation has a detailed section on the changes and there is also an upgrading document plus it points to a perl script docs/ (also in the Raptor V2 distribution) that automates some of the work (renames mostly) and leaves markers where a human has to fix. The Raptor V1 API documentation will remain in a frozen state available at Packager Actions If you are a packager of the redland libraries, you need to prepare for the Raptor V1 / Raptor V2 transition which can vary depending on your distribution's standards. The two versions share two files: the rapper binary and the rapper.1 man page. I do not want to rename them to rapper2 etc. since rapper is a well known utility name in RDF and I want 'rapper' to provide the latest version. In the Debian packaging which I maintain, these are already planned to be in separate packages so that both libraries can be installed and you can choose the raptor-utils2 package over raptor-utils (V1). In other distributions where everything is in one package (BSD Ports for example) you may have to move the rapper/rapper.1 files to the raptor V2 package and create a new raptor1 package without them. i.e. something like this Raptor V1 package 1.4.21-X: /usr/lib/* ... (no /usr/bin/rapper or /usr/share/man/man1/rapper.1 ) Raptor V2 package 2.0.0-1: /usr/lib/* ... /usr/bin/rapper /usr/share/man/man1/rapper.1 conflicts with older raptor1 packages before 1.4.21-X The other thing to deal with is that when Rasqal is built against Raptor V2, it has internal change that mean librdf has to also be built against rasqal-with-raptor2. This needs enforcing with packaging dependencies. This packaging work can be done/started as soon as Raptor V2 2.0.0 is released which will be "soon".[...]

Writing an RDF query engine. Twice


"You'll end up writing a database" said Dan Brickley prophetically in early 2000. He was of course, correct. What started as an RDF/XML parser and a BerkeleyDB-based triple store and API, ended up as a much more complex system that I named Redland with the librdf API. It does indeed have persistence, transactions (when using a relational database) and querying. However, RDF query is not quite the same thing as SQL since the data model is schemaless and graph centric so when RDQL and later SPARQL came along, Redland gained a query engine component in 2003 named Rasqal: the RDF Syntax and Query Library for Redland. I still consider it not a 1.0 library after over 7 years of work. Query Engine The First The first query engine was written to execute RDQL which today looks like a relatively simple query language. There is one type of SELECT query returning sequences of sets of variable bindings in a tabular result like SQL. The query is a fixed pattern and doesn't allow any optional, union or conditional pattern matching. This was relatively easy to implement in what I've called a static execution model: Break the query up into a sequence of triple patterns: triples that can include variables in any position which will be found by matching against triples. A triple pattern returns a sequence of sets of variable bindings. Match each of the triple patterns in order, top to bottom, to bind the variables. If there is a query condition like ?var > 10 then check that it evaluates true. Return the result. Repeat at step #2. The only state that needed saving was where in the sequence of triple patterns that the execution had got to - pretty much an integer, so that the looping could continue. When a particular triple pattern was exhausted it was reset, the previous one incremented and the execution continued. This worked well and executes all of RDQL no problem. In particular it was a lazy execution model - it only did work when the application asked for an additional result. However, in 2004 RDF query standardisation started and the language grew. Enter The Sparkle The new standard RDF query language which was named SPARQL had many additions to the static patterns of the RDQL model, in particular it added OPTIONAL which allowed optionally (sic) matching an inner set of triple patterns (a graph pattern) and binding more variables. This is useful in querying heterogeneous data when there are sometimes useful bits of data that can be returned but not every graph has it. This meant that the engine had to be able to match multiple graph patterns - the outer one and any inner optional graph pattern - as well as be able to reset execution of graph patterns, when optionals were retried. Optionals could also be nested to an arbitrary depth. This combination meant that the state that had to be preserved for getting the next result became a lot more complex than an integer. Query engine #1 was updated to handle 1 level of nesting and a combination of outer fixed graph pattern plus one optional graph pattern. This mostly worked but it was clear that having the entire query have a fixed state model was not going to work when the query was getting more complex and dynamic. So query engine #1 could not handle the full SPARQL Optional model and would never implement Union which required more state tracking. This meant that Query Engine #1 (QE1) needed replacing. Query Engine The Second The first step was a lot of refactoring. In QE1 there was a lot of shared state that needed pulling apart: the query itself (graph patterns, conditions, the result of the parse tree), the engine that executed it and the query result (sequence of rows of variable bindings). That needed pulling apart so that the query engine could be changed independent of the query or results. Rasqal 0.9.15 at the end of 2007 was the first release with the start of the refactoring. During the work f[...]

Redland librdf 1.0.11 released


I have just released Redland librdf library version 1.0.11 which has been in progress for some time, delayed by the large amount of work to get out Raptor V2 as well as initial SPARQL 1.1 draft work for Rasqal 0.9.20.

The main features in this release are as follows:

See the Redland librdf 1.0.11 Release Notes for the full details of the changes.

Note that the Redland language bindings works fine with Redland librdf 1.0.11 but the bindings will soon have a release to match.

Leaving Yahoo - Joining Digg


I'm heading to a new adventure at Digg in San Francisco to be a lead software engineer working on APIs and syndication.

I've been at Yahoo! nearly 5 years so it is both a happy and sad time for me, and I wish all the excellent people I worked with the best of luck in future.

Here is a summary of the main changes:

  • Silicon Valley -> San Francisco
  • 15,000 staff -> 100 staff
  • Architect -> Software engineer
  • strategizing, meeting -> coding
  • Powerpoint, OmniGraffle, twiki -> emacs, eclipse, ...?
  • (No coding!) -> Python, Java, Hadoop, Cassandra, ...?
  • Sunny days -> Foggy days
  • 15 min commute -> 2.5hr commute (until I move to SF)
  • Public company -> private company


Rasqal RDF Query Library 0.9.20


I just released a new version of my Rasqal RDF Query Library for two main new features: Support more of the new W3C SPARQL working drafts of 1 June 2010 for SPARQL 1.1 Query and SPARQL 1.1 Update. Support building with Raptor V2 API as well as Raptor V1 API.. The main change is to start to add to Rasqal's APIs and query engine changes for the new SPARQL 1.1 working drafts. This release adds support the syntax for all the changes for Query and Update. The new draft syntax is available via the 'laqrs' query language name, until the SPARQL 1.1 syntax is finalized. The 'sparql' query language provides SPARQL 1.0 support. On Query 1.1, the addition is primarily syntax and API support for the new syntax. There is expression execution for the new functions IF(), URI(), STRLANG(), STRDT(), BNODE(), IN() and NOT IN() which are noew usable as part of the normal expression grammar. The existing aggregate function support was extended to add the new SAMPLE() and GROUP_CONCAT() but remains syntax-only. Finally the new GROUP BY with HAVING conditions were added to the syntax and had consequent API updates but no query engine execution of them. For Update 1.1 the full set of update operations syntax were added and they create API structures. Note, however there seem to be some ambiguities in the draft syntax especially around multiple optional tokens in a row near WITH which are particularly hard to implement in flex and bison (aka "lex and yacc"). The main non-SPARQL 1.1 related change is to allow building Rasqal with Raptor V2 APIs rather than V1. Raptor V2 is in beta so this is not a final API and is thus not the default build, it has to be enabled with --enable-raptor2 with configure. When raptor V2 is stable (2.0.0), Rasqal will require it. The changes to Rasqal in this release, in summary, are: Updated to handle more of the new syntax defined by the SPARQL 1.1 Query and SPARQL 1.1 Update W3C working drafts of 1 June 2010 Added execution support for new SPARQL 1.1 query built-in expressions IF(), URI(), STRLANG(), STRDT(), BNODE(), IN() and NOT IN(). Added an 'html' query result table format from patch by Nicholas J Humfrey Added API support for group by HAVING expressions. Added XSD Date comparison support. Support building with Raptor V2 API if configured with --with-raptor2. Many other bug fixes and improvements were made. Fixed Issues: #0000352, #0000353, #0000354, #0000360, #0000374, #0000377 and #0000378 See the Rasqal 0.9.20 Release Notes for the full details of the changes. Get it at PS The source code control has also moved to GIT and hosted at GitHub.[...]