Subscribe: Маниакальный Веблог
http://www.softwaremaniacs.org/blog/feed/atom/
Added By: Feedage Forager Feedage Grade B rated
Language: Russian
Tags:
change  code  don  error  json  language  lexeme  lexer  much  new  parser  rust  state  string  time  utf  version   
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Маниакальный Веблог

Маниакальный веблог



Иван Сагалаев о программировании и веб-разработке



Updated: 2017-05-13T11:00:53-07:00

 



Treat HTTP status codes as strings
I usually see HTTP status codes being checked as integer numbers: if ((status >= 200) && (status < 300)) // or `<= 299` This is rather verbose. But if you think about it, it's the first digit in the number that has a special value, and there are only five ...

I usually see HTTP status codes being checked as integer numbers:

if ((status >= 200) && (status < 300)) // or `<= 299`

This is rather verbose. But if you think about it, it's the first digit in the number that has a special value, and there are only five of them:

  • 1xx: server programmer is too smart
  • 2xx: success
  • 3xx: your client HTTP library is too dumb
  • 4xx: you screwed up
  • 5xx: server screwed up

When treated as strings, checking for the error class looks a bit better:

if (status[0] == '2')

Unfortunately, the ensuing party is pre-pooped by most client HTTP libraries helpfully casting the status to int.

Dear HTTP client libraries! Consider adding response.error_class to your responses.




Status update: 2016
I thought I'd post something for no reason… We're still living in Washington state, I'm still into programming, with much the same hobbies as before. However these days I actually work for money. The company is Shutterstock, and I'm a resident "refactorer" on one of the teams in Search. So ...

I thought I'd post something for no reason…

(image)

We're still living in Washington state, I'm still into programming, with much the same hobbies as before. However these days I actually work for money. The company is Shutterstock, and I'm a resident "refactorer" on one of the teams in Search. So for the last few months I mostly write an re-write Python code. Which feels scarily comfortable. Scare is due to me understanding that from the point of view of self-development feeling comfortable is a road to nowhere. But I'm not scared too much, as it has only started and I'm constantly thinking of how to make my work life more difficult. I've been missing that for the past few years.

(image)

My employment is somewhat unusual because the company, though gladly accepting remote workers, still adheres mostly to New York working hours which results in a funny schedule for me living 3 hours later from it. So I enjoy doing my groceries on weekdays, and running, and an occasional beach trip! Work is work though, and it inevitably took time out of other things, and since my other priorities (also inevitably) lay with my family, the things I chose to sacrifice are Rust and highlight.

The latter isn't actually dead though, thanks to the established contributor community providing patches at a rate no one of the core team has any hope to keep up with, and to the well-greased release process.

Speaking of which… The latest release of highlight was the first after a pretty long streak of completely automatic successful deployments which… didn't. The reason for it being me having updated the code of both highlightjs.org and softwaremaniacs.org to something resembling a current century setup. Projects are now running in their virtual environments, both use Django 1.10 and Python 3 (yes, I don't have Python 2 code on my server anymore). I took some cues from The 12-factor App, and my settings are less insane now. Also, the server can now survive reboots without assistance. And I moved it from London to US Easts coast, which is faster for me.

(image)

What else… Tried working on a Mac, didn't like it. There are two Thinkpad X1s on my desk right now! Both running Ubuntu. Started and paused learning guitar (again). I own a Les Paul if you're interested… Drive a car with a stick-shift, a Ford Fiesta ST. If you're a car nut, I advise you to indulge yourself with the new gig of Clarkson's and Co — The Grand Tour. I watched the first episode with a happy grin on my face from the start to the very end! Also, it's good it's on Amazon, and I don't need to hunt torrents or buy some stupid subscription with a lot of TV I don't need. Disappointed with the Trump thing as everyone I know… in my echo chamber. But no despair, democracy should work out in the end!

As usually, hope to return to more or less regular writing… but who knows if and how. See ya!




Liberal JSON
Tim Bray beat me to writing about this with some very similar thoughts to mine: Fixing JSON. I especially like his idea about native times, along with prefixing them with @ as a parser hint. I'd like to propose some tweaks however, based on my experience of writing JSON parsers ...

Tim Bray beat me to writing about this with some very similar thoughts to mine: Fixing JSON. I especially like his idea about native times, along with prefixing them with @ as a parser hint. I'd like to propose some tweaks however, based on my experience of writing JSON parsers (twice).

Commas and colons

Not only you don't need either of them, they actually make parsing more complicated. When you're inside an array or an object, you already know when to expect a next value or a key, but you have to diligently check for commas and colons with the sole reason of signaling errors if you don't find them where expected. Add to that edge cases with trailing commas and empty containers, and you get a really complicated state machine with no real purpose.

My proposal is simpler than Tim's, though: no need to actually remove them, just equate them to whitespace. As in: whitespace = ['\t', '\n', '\r', ' ', ',', ':']. That's it.

It removes all the complications from parsing, and humans can write those for aesthetics. And by the way, this approach works fine in Clojure for vectors and maps.

\uXXXX

JSON is defined as a UTF-8 encoded stream of bytes. This is already enough for encoding the entire Unicode. Yet, on top of that there's another encoding scheme using \uXXXX. One could probably speculate it was added to enable authoring tools that can only operate in the ASCII subset of UTF-8, but thankfully we've moved away from those dark ages already.

Handling those is a pain in the ass for a parser, especially a streaming one. Dealing with single-letter escapes like \n is easy, but with \uXXXX you need an extra buffer, you need to check for edge cases with not-yet-enough characters, and you're probably going to need a whole separate class of errors for those. Gah…

Just do away with the thing.




highlight turns 10
Almost exactly ten years ago on August 14 I wrote on this very blog (albeit in a different language): So on yesterday's night I got worked up and decided to try and write [it]. But on a condition of not dragging it on for many days if it didn't work ... Almost exactly ten years ago on August 14 I wrote on this very blog (albeit in a different language): So on yesterday's night I got worked up and decided to try and write [it]. But on a condition of not dragging it on for many days if it didn't work out on the first take, I've got enough on my mind as it is. It did work out. Which makes August 14 the official birthday of highlight! Although it wasn't until 5 days later when the first meaningful commit was recorded. Using any form of source control was only an afterthought for me back then :-) Quick flash back Switched through 3 version control systems (Subversion, Bazaar, Git). Made 71 (seventy-one!) public releases, with a regular 6 week cadence for the past year. 166 languages and 77 styles created by 216 contributors and 3 core developers. Accumulated 8062 stars on Github. Went from being a single file to be provided as a custom-built package, a node library and served from two independent CDNs. Acquired a mighty 490-strong unit test suite. Identity With the obligatory self-congratulatory stuff out of the way, let me now get to the main purpose of this anniversary post: explaining what makes highlight different among other highlighters. I'm not going to talk about obvious features listed on the front page of highlightjs.org. I'll try to document the philosophy that up until this point I was only referring to in various places, but never was able to put together. I'll try to keep it short (otherwise I'll never finish this post!) It is my deep conviction that highlighting should make code more readable instead of simply making it… fun, for the lack of better word. Let me explain by example. Here's some things that serve towards better readability when highlighted: Keywords, because they define the overall structure of the code and because they need prominent highlighting simply because they otherwise look too much like user variables. Function and class titles at the place of declaration, because they effectively define a domain-specific language, an API. They have a very distinct semantics. Built-ins and special literals, because it helps to know what in the code belongs to the language and what is defined by the user. And these are the things highlighting which makes no sense, in my humblest opinion: CamelCase identifiers, because it's not consistent: you get identifiers of the same nature either highlighted or not simply because they happen to be named differently. .method() calls, because I, frankly, can't even invent a plausible reason of why they should be highlighted in any way. Punctuation, because it significantly increases the amount of color clutter in any given snippet which makes it hard on the eyes. I have a hypothesis that the only reason why these things get highlighted traditionally is simply due to the fact that they could easily be picked up by a regexp :-) In highlight we sometimes go to great lengths to highlight what makes sense instead of what's easy ("semantics highlighting?"). In lisps we highlight the first thing in parentheses, regardless of it being or not being built-in, and we have special rules to not highlight them in quoted lists and even in argument lists in lambdas in Scheme. In VimScript we try our best to distinguish between strings and line comments even though they seem to be deliberately designed to trip up parsers. And we recognize quite a few ways of spelling out attributes in HTML. The downside of this is that highlight is heavier and probably slower than it could've been. These were the reasons why we recently lost a bid on replacing the incumbent highlighting library on Stack Overflow. I still think they made a mistake :-) Because quality beats lightness! Come join us! [...]



Why Rust's ownership/borrowing is hard
Working with pure functions is simple: you pass arguments, you get a result — no side effects happen. If, on the other hand, a function does have side effects, like mutating its arguments or global objects, it's harder to reason about. But we've got used to those too: if you ... Working with pure functions is simple: you pass arguments, you get a result — no side effects happen. If, on the other hand, a function does have side effects, like mutating its arguments or global objects, it's harder to reason about. But we've got used to those too: if you see something like player.set_speed(5) you can be reasonably certain that it's going to mutate the player object in a predictable way (and may be send some signals somewhere, too). Rust's ownership/borrowing system is hard because it creates a whole new class of side effects. Simple example Consider this code: let point = Point {x: 0, y: 0}; let result = is_origin(point); println!("{}: {}", point, result); Nothing in the experience of most programmers would prepare them to point suddenly stopping working after being passed to is_origin()! The compiler won't let you use it in the next line. This is the side effect I'm talking about — something has happened to the argument — but not the kind you've seen in other languages. Here it happens because point gets moved (instead of being copied) into the function so the function becomes responsible for destroying it and the compiler prevents you from using it after that point. The way to fix it is to either pass the argument by reference or to teach it how to copy itself. It makes total sense once you've learned about "move by default". But these things tend to jump out on you in a seemingly random fashion while you're doing some innocent refactorings or, say, adding logging. Complicated example Consider a parser that takes some bits of data from an underlying lexer and maintains some state: struct Parser { lexer: Lexer, state: State, } impl Parser { fn consume_lexeme(&mut self) -> Lexeme { self.lexer.next() } pub fn next(&mut self) -> Event { let lexeme = self.consume_lexeme(); // read the next lexeme if lexeme == SPECIAL_VALUE { self.state = State::Closed // update state of the parser } } } The seemingly unnecessary consume_lexeme() is just a convenience wrapper around a somewhat longer string of calls that I have in the actual code. The lexer.next() returns a self-sufficient lexeme by copying data from the lexer's internal buffer. Now, we want to optimize it so lexemes would only hold references into that data and avoid copying. We change the method declaration to: pub fn next<'a>(&'a mut self) -> Lexeme<'a> The 'a thingy effectively says that the lifetime of a lexeme is now tied to the lifetime of the lexer reference on which we call .next(). It can't live all by itself but depends on data in the lexer's buffer. The 'a just spells it out explicitly here. And now Parser::next() stops working: error: cannot assign to `self.state` because it is borrowed [E0506] self.state = State::Closed ^~~~~~~~~~~~~~~~~~~~~~~~~~ note: borrow of `self.state` occurs here let lexeme = self.consume_lexeme(); ^~~~ In plain English, Rust tells us that as long as we have lexeme available in this block of code it won't let us change self.state — a different part of the parser. And this does not make any sense whatsoever! The culprit here is the consume_lexeme() helper. Although it only actually needs self.lexer, to the compiler we say that it takes a reference to the entire parser (self). And because it's a mutable reference, the compiler won't let anyone else touch any part of the parser lest they might change the data that lexeme currently depends on. So here we have this nasty side effect again: though we didn't change actual types in the function signature and the code is still sound and sho[...]



ijson in Rust: typed lexer
Catching up on my Rust learning diaries… Today I'm going to tell you about releaving my Lexer from its Pythonic legacy and what tangible results it produced, beside just being The Right Thing™. Basic idea The original lexer yielded three kinds of lexemes: strings enclosed in quotes: "..." multi-character literals ... Catching up on my Rust learning diaries… Today I'm going to tell you about releaving my Lexer from its Pythonic legacy and what tangible results it produced, beside just being The Right Thing™. Basic idea The original lexer yielded three kinds of lexemes: strings enclosed in quotes: "..." multi-character literals and numbers: [a-z0-9eE\.\+-]+ single-character lexemes: brackets, braces, commas, colons, etc. Type-wise all of them were strings, and it was the job of the parser to check what kind of lexemes they were: a known literal, something starting with a quote or something parsable as a number… or an error, failing all else. This made total sense in Python where I, for example, just used a single generalized regexp to parse all non-string lexemes. It allowed for a very simple and readable code and it's in fact the only right way to parse byte buffers in an untyped GC-powered language where dealing with individual bytes introduces too much performance overhead. In Rust though it simply felt foreign because the lexer already has an intimate understanding of what is it that it parses — something starting with ", or +|-|0..9, or {, … — it has to explicitly check them all anyway. Hence it seemed silly to just drop this intrinsic type information on the floor and clump everything back into strings. Also I had a suspicion that it should affect performance quite significantly, as I had to allocate memory for and copy all those small string pieces. Lots of allocations and copying is never good! Process I started by introducing a dedicated Lexeme type distinguishing between strings, single-character lexemes and everything else under the umbrella term "scalar" (don't grumble about the name, it was destined to go away in any case): #[derive(Debug, PartialEq)] pub enum Lexeme { String(String), Scalar(String), OBrace, CBrace, OBracket, CBracket, Comma, Colon, } If anything, it made the code uglier as there were now two paradigms sitting in the code side by side: typed values and "scalars" that I had to handle the old way: match lexeme { Lexeme::OBracket => Event::StartArray, Lexeme::OBrace => Event::StartMap, Lexeme::CBracket => Event::EndArray, Lexeme::CBrace => Event::EndMap, Lexeme::String(s) => Event::String(try!(unescape(s))), // The Ugliness boundary :-) Lexeme::Scalar(ref s) if s == "null" => Event::Null, Lexeme::Scalar(ref s) if s == "true" => Event::Boolean(true), Lexeme::Scalar(ref s) if s == "false" => Event::Boolean(false), Lexeme::Scalar(s) => { Event::Number(try!(s.parse().map_err(|_| Error::Unknown(s)))) }, _ => unreachable!(), } Next, the string un-escaping business has been moved entirely into lexer. Even though it was a pretty much verbatim move of a bunch of code from one module to another, it made it obvious that I'm actually processing escaped characters twice: first simply to correctly find a terminating " and then to decode all those escapes into raw characters. This proved to be a good optimization opportunity later. It never ceases to amaze me how such simple refactorings sometimes give you a much better insight! Do not ignore them. Finally, I split Lexeme::Scalar into honest numbers, booleans and the null. The code got more readable and more idiomatic all over, and there was much rejoicing! Bumps along the road During all those refactorings I had to constantly fiddle with error definitions (of which I wasn't a fan, to begin with). Changing wrapped error types and types of error parameters — all this really fun stuff, you k[...]



Cadence for highlight
We're now doing releases of highlight on a cadence of 6 weeks. The latest release 8.8 was the second in a row (which is what technically allows me to write "are now doing"). The reason for that is we (well, mostly me) had a certain difficulty deciding when to actually ...

We're now doing releases of highlight on a cadence of 6 weeks. The latest release 8.8 was the second in a row (which is what technically allows me to write "are now doing").

The reason for that is we (well, mostly me) had a certain difficulty deciding when to actually release something. We don't develop new grand features on a regular basis, all that's happening is bug fixes, new language definitions and new styles. And releasing a new version for every little change is going to annoy end users and drive downstream maintainers mad. So releases tended to happen pretty much by chance. Like someone would ask on a random GitHub issue when is the next release and I would think, why not right now?

This anarchic approach actually worked for some time while the project wasn't going too fast. But as this has changed in the recent couple of years and as I've had left users stranded waiting for a new release for months on a couple of occasions I though it's time to get more serious.

Our release process is now quite simple, too. A maintainer only has to document the changes, update the version number and push it all to GitHub. GitHub then pings a certain API handler on highlightjs.org and the site does everything else:

  • updates the code,
  • builds a CDN package and pushes it to GitHub from where two independent CDN providers pick it up, also automatically,
  • builds and pushes a package to npmjs.org,
  • updates the live demo and various metadata (version number, language count, etc),
  • pre-builds site's caches used for dynamic custom builds,
  • publishes version-related news from the CHANGES file,
  • restarts itself,
  • goes on social media and spends a day generating and over-excited buzz about the release (OK, probably not this :-) ).

The process is still fragile but bugs are getting fixed and it's anyway immensely simpler than doing it all manually.

See you next on October, 20th!




ijson in Rust: errors
While I fully expected to have difficulty switching from the paradigm of raising and catching exception to checking return values, I wasn't ready to Rust requiring so much code to implement it properly (i.e., by the bible). So this is the one where I complain… Non-complaint I won't be talking ... While I fully expected to have difficulty switching from the paradigm of raising and catching exception to checking return values, I wasn't ready to Rust requiring so much code to implement it properly (i.e., by the bible). So this is the one where I complain… Non-complaint I won't be talking about exceptions vs. return values per se. For a language that won't let you omit cases in a match and where type safety is paramount it totally makes sense to make programmers deal with errors explicitly. Even if it's just saying "drop everything on the floor right here", it's done with an explicit call to panic! or unwrap() so you can go over them later with a simple text search and replace with something more sensible. So if you're coming to it from a dynamic language, like I am, my best advice is to not get upset every time when you have to stop and reset the pipeline inside your brain to think about every little unwelcome Result::Err that just refuses to go away. Get used to it :-) As a result, my code changed a lot in myriad of places (not even counting a whole new "errors" module). And it prompted me to consider enforcing tighter invariants on the module boundaries. For example, I now see that instead of dispatching lexemes as Vec leaving handling potential utf-8 conversion errors to multiple consumers, it's better to contain conversion within the lexer module and only handle this kind of errors in one place (I haven't done it yet). Boilerplate To the bad part, then… It is imperative that a library should wrap all the different kinds of errors that might occur within it into an overarching library-specific error type. Implementing it in Rust is straightforward but very laborious. This is my Error type: pub enum Error { IO(io::Error), Utf8(str::Utf8Error), Unterminated, Escape(String), Unexpected(String), MoreLexemes, Unmatched(char), AdditionalData, } To make it actually useful I had to: For all eight variants write a line converting it into a string for display purposes (Display::fmt). All of those lines are unsurprisingly similar looking. Associate with all of them a short textual description that is slightly different from the one above for no apparent reason. For the two first variants that wrap lower level errors I had to explicitly write logic saying that their wrapped errors are in fact their immediate causes. For the same two lower level errors I had to explicitly state that they are convertible into my Error type using those two first variants. That means a separate single-method 4-line impl for each. This took 62 lines of mostly boilerplate and repetitive code. I do feel though that all of this not only should, but could be implemented as some heavily magical #[derive(Error)] macro, at least to an extent. Might be a good project in itself… There is no try! The try! macro goes a long way towards releaving you from a burden of doing an obvious thing most of the time you encounter an unexpected error, namely returning it immediately up the stack: fn foo() -> Result { let x = try!(bar()); // checks if bar() resulted in an error and `return`s it, if yes // work with an unwrapped value of x safely } However, since it expands into code containing return Result::Err(...) it only works inside a function that returns Result. Alas, the core method of Iterator — next() — is defined to return a different type, Option. Which means that you can't use try! if you're implementing an iterator. So I had to write my own local variety — itry!. Stopping iterators Another probl[...]



Versioning REST: another angle
I've got an interesting comment on "Versioning REST APIs" that boils down to these points: Sometimes you can't really afford breaking clients (ever, or long enough to make no matter). A global version allows to freeze an old code base and have new implementation to be completely independent. This is ...

I've got an interesting comment on "Versioning REST APIs" that boils down to these points:

  1. Sometimes you can't really afford breaking clients (ever, or long enough to make no matter).
  2. A global version allows to freeze an old code base and have new implementation to be completely independent.

This is a different situation from the one I had in mind where breaking changes do eventually happen at some point.

Technically…

Technically, no matter how you look at the issue, per-resource versioning gives you more flexibility than a global version number: changing version of every resource representation has the same effect as changing the global number.

From a practical standpoint it is as simple as having code somewhere checking the Accept header in all requests:

Accept: application/mytype+json; version=2

… and doing different things or even dispatching to completely different services depending on the version.

Even if you want to invent a completely different URL scheme for your API, it's still technically either changing representations of existing resources, or adding new ones (we can't remove anything under condition 1.)

However…

I could see a tangential benefit in having a global version only in cases such as these:

  • You want to abandon the old code base and don't even want to maintain header-base routing code in your new one.
  • You're changing semantics of resources, effectively making it a completely different product.

But replacing the whole Universe with the new one does (or should) happen much less often than resource-local breaking changes that could be handled by per-resource versioning without affecting the whole API. If global version is your only mechanism than you have to change worlds all the time, even when you don't have to.




Versioning: follow-up
After reading a few comments on reddit and by email about my post on versioning of REST APIs I see that I wasn't clear on terminology and have left out some context. That's okay! Overthinking details is the main killer of all my interesting thoughts :-) I'd rather post more ...

After reading a few comments on reddit and by email about my post on versioning of REST APIs I see that I wasn't clear on terminology and have left out some context. That's okay! Overthinking details is the main killer of all my interesting thoughts :-) I'd rather post more and repeat myself on occasion.

What's "backwards incompatible", exactly?

The main point of contention was that there's no problem, really: if an API changes incompatibly, existing clients could just use the old version of an API. Hence, a version bump in the URL is seen as a solution, letting many versions to co-exist in the public space. There was even a suggestion to keep all versions indefinitely.

Well, consider these situations, off the top of my head:

  • You want to remove a feature from an API for business reason. Keeping an old version simply defeats the purpose.

  • Someone found a security hole in an old version. Even if there's no bug anymore in the current version, you will have to dig through all the active old ones and patch them. The more of them, and the older they are, the more costly it becomes. Sometimes a fix would make a version backwards incompatible which, again, defeats the purpose of keeping it.

  • With growing load, your API needs a performance overhaul by, say, replacing granular update operations with a bulk one. Even if you could technically keep the old version active, you don't want that, because the point is to get rid of the performance bottleneck.

I'm going to go ahead and even postulate this:

if you are able to keep an old version of the API working alongside a new one, you shouldn't make the change backwards incompatible at all.

Why invalidate clients

So, when old functionality eventually dies, even after you supported it for some time, the clients that use it are invalidated. It is not a question of "if".

The question is, then, "how":

  • You either explicitly change the version number in the root of your URLs, or
  • you change versions and/or representations of only those resources that has suffered the change.

The crux of my argument in the previous post was that the latter is better because it avoids invalidating clients that don't use the broken resource.

A simple to imagine example would be a read/write API, like Twitter's. If you change the writing, your clients that only read don't even need to know about it. Which may be a majority of them. Or change in the authentication protocol that doesn't affect anonymous clients.

Miscellaneous

This bit caused some confusion, too:

It constrains development on the server as you have to synchronize releases of independent backwards incompatible features to fit them into one version change.

Here, I shouldn't have used the word "have". What I meant is that since you want to minimize the amount of breaking changes, you might want to shove many little breaking changes that's been cooking up in your code base into one public update. This requires synchronizing.

But I admit that, while completely real, it's rather a weak point.




Versioning of REST APIs
Don't version APIs, version resources. I.e., this is wrong: https://example.com/api/v1/resource Global version number has a few problems: A backwards incompatible change to any one resource invalidates all clients, even those who don't use this particular resource. This is unnecessary maintenance burden on client developers. It constrains development on the server ...

Don't version APIs, version resources. I.e., this is wrong:

https://example.com/api/v1/resource

Global version number has a few problems:

  • A backwards incompatible change to any one resource invalidates all clients, even those who don't use this particular resource. This is unnecessary maintenance burden on client developers.

  • It constrains development on the server as you have to synchronize releases of independent backwards incompatible features to fit them into one version change.

  • You have to maintain several versions of your whole API code base simultaneously.

May be there's more, but the first problem is quite enough, if you ask me. And I can't think of a single disadvantage of versioning resources independently.

Versioning resources

Technically, you can do this:

Content-type: application/myformat+json; version=2.0

… or this:

{
  "version": "2.0",
  "payload": { ... }
}

… or even this:

https://example.com/api/resource?version=2.0

It doesn't matter. Since we don't live in the pure-REST utopia predominantly using well-defined MIME types, your representation formats are probably custom anyway, so your versioning is going to be custom as well. As long as it's documented, it's fine.

Implicit versioning

Instead of giving your format an explicit version number, you can change data in a way that the old clients could not use it. For example, if you've subtly changed the format of a date field, change the name of the field too:

Version 1:

{"time": "2012-01-01T00:00"}

Version 2:

{"utctime": "2012-01-01T08:00"}

You could even support both fields for some time with a deprecation warning in the docs.

Even though I like this approach more than numbering, I'm not going to defend it to the death. I think it boils down to the kind of developers you want to cater to (including your own). Some people believe that everything should be declared in advance, preferably with a formal schema. But schemas are harder to maintain. Other people rely on catching exceptions at run-time and introspection. But run-time sometimes means "production" and bad things might happen more often than you'd like.




ijson in Rust: object builder
Object builder is what makes a parser actually useful: it takes a sequence of raw parser events and creates strings, numbers, arrays, maps, etc. With this, ijson is functionally complete. Filtering Ijson can filter parser events based on a path down JSON hierarchy. For example, in a file like this: ... Object builder is what makes a parser actually useful: it takes a sequence of raw parser events and creates strings, numbers, arrays, maps, etc. With this, ijson is functionally complete. Filtering Ijson can filter parser events based on a path down JSON hierarchy. For example, in a file like this: [ { "name": "John", "friends": [ ... ] }, { "name": "Mary", "friends": [ ... ] } ... gazillion more records ... ] … you would get events with their corresponding paths: "" StartArray "item" StartMap "item" Key("name") "item.name" String("John") "item" Key("friends") "item.friends" StartArray "item.friends.item" ... "item.friends" EndArray "item" EndMap ... ... "" EndArray In Python I implemented this by simply pairing each event with its path string in a tuple (path, event). In Rust though it feels wrong. The language makes you very conscious of every memory allocation you make, so before even having run a single benchmarks I already worry about creating a new throwaway String instance for every event. Instead, my filtering parser now accepts a target path, splits it and keeps it as a vector of strings which it then compares with a running path stack which it maintains through the whole iterating process. Maintaining a path stack — also a vector of string — still feels slow but at least I don't join those strings constantly for the sole purpose of comparing. By the way, I was pleasantly surprised to find two handy functions in Rust's stdlib: String::split_terminator which works better than the regular String::split for empty strings, as I want an empty vector in this case: "".split(".") // [""] "".split_terminator(".") // [] Vec::starts_with which has the same semantics as String::starts_with but compares values in a vector. Python doesn't have it, so I somewhat hastily implemented it only to find it in the docs after it was done. Oh well :-) Building objects By now I've flexed my instincts enough so I could write the builder function recursively. It might not seem like a big achievement but I still remember the times just a few weeks ago when I just couldn't persuade the borrow checker to let me do something very similar while I was writing the parser! Now I can't even remember what the problem was. Something silly, for sure… The function itself is short but convoluted with slightly ugly differences between handling array and maps (the latter even has the unreachable! kludge to satisfy the compiler). Magical unwrapping There's a general problem with deserializing any stream of bytes in a statically typed language: what type should a hypothetical parse_json(blob) return? The answer is, it depends on whatever is in the "blob" and you don't know that in advance. As far as I know there are two ways of dealing with it: Wrap all possible value types in a tagged union and confine yourself to tedious unwrapping values on every access: value.as_array().get(0).as_map().get("key").as_int(). Provide a schema for every format you expect from the wire and let some tool generate typed code deserializing bytes into native values of known types. Since I'm writing a generic JSON parser I went ahead with wrapped values, leaving unwrapping to a consumer of the library. But then I've found a magical (if badly named) library — rustc-serialize that can automaticall[...]



ijson in Rust: unescape
Today marks a milestone: with implementation of string unescaping my json parser actually produces entirely correct output! Which doesn't necessarily mean it's easy to use or particularly fast yet. But one step at a time :-) The code came out rather small, here's the whole function (source): fn unescape(s: &str) ... Today marks a milestone: with implementation of string unescaping my json parser actually produces entirely correct output! Which doesn't necessarily mean it's easy to use or particularly fast yet. But one step at a time :-) The code came out rather small, here's the whole function (source): fn unescape(s: &str) -> String { let mut result = String::with_capacity(s.len()); let mut chars = s.chars(); while let Some(ch) = chars.next() { result.push( if ch != '\\' { ch } else { match chars.next() { Some('u') => { let value = chars.by_ref().take(4).fold(0, |acc, c| acc * 16 + c.to_digit(16).unwrap()); char::from_u32(value).unwrap() } Some('b') => '\x08', Some('f') => '\x0c', Some('n') => '\n', Some('r') => '\r', Some('t') => '\t', Some(ch) => ch, _ => panic!("Malformed escape"), } } ) } result } Likes Luckily I can make an educated guess about how much memory my resulting string would occupy and allocate it at once with String::with_capacity(). It works because s.len() gives me the length of a UTF-8 string in bytes, so my output is guaranteed to be equal or smaller than the source, because: raw UTF-8 characters are left intact \n, \t, etc. are translated into one byte from two \uXXXX become UTF-8 sequences which occupy less or equal than the original 6 bytes Look ma, no re-allocations! Char by char iteration I seriously don't like having to result.push every single byte even for strings containing no \-escapes whatsoever (which is the vast majority of strings in the real-world JSON). I'd like to be able to walk through a source string and either a) copy chunks between \ in bulk or b) if there's none found simply return the source slice converting it to an owned string with to_owned(). But I wasn't yet able to figure out how to approach that. By the way, I find while let Some(ch) = chars.next() rather brilliant! It loops as long as the iterator returns something that can be destructured into a usable value and handily binds the latter to a local var. Also, XMPPwocky at #rust IRC channel suggested "to write something on top of a Reader" and "specifically something over a Cursor>, actually". Though that was prompted by an entirely different discussion. Non-obvious .by_ref() There's this long line in the middle that converts four bytes after \u into a corresponding char: let value = chars.by_ref().take(4).fold(0, |acc, c| acc * 16 + c.to_digit(16).unwrap()); What happened without by_ref() was this line stole ownership of chars from the outer while loop, and Rust didn't let me use chars anywhere else. If you aren't familiar with the concept of ownership in Rust, head over to the official explanation. That was rather surprising because my gut feeling is (or was) that .take(4) is hardly any different than calling .next() four times in a loop, and yet the latter leaves the original iterator alone with its owner. Hex conversion You may notice that I convert hex numbers into chars manually with .fold() (aka "reduce" in other languages) even though Rust has from_str_radix(16) for that. I used it at fir[...]



ijson in Rust: the parser
It took me a while but I've finally implemented a working parser on top of the lexer in my little Rust learning project. I learned a lot and feel much more comfortable with the language by now. In the meantime I even managed to get out to a Rust Seattle ... It took me a while but I've finally implemented a working parser on top of the lexer in my little Rust learning project. I learned a lot and feel much more comfortable with the language by now. In the meantime I even managed to get out to a Rust Seattle meetup, meet new folks and share an idea about doing some coding together in the future. Let's see how it'll work out. Power of yield First, a digression. It's not about Rust at all, but I'll get to my point eventually, I promise! When I was coding the same problem in Python I didn't fully appreciate the expressive power of generators, I simply used them because it seemed to be the most natural way. Have a look: def parse_value(lexer, symbol=None): // ... elif symbol == '[': yield from parse_array(lexer) elif symbol == '{': yield from parse_object(lexer) // ... def parse_array(lexer): yield ('start_array', None) symbol = next(lexer) if symbol != ']': while True: yield from parse_value(lexer, symbol) symbol = next(lexer) if symbol == ']': break if symbol != ',': raise UnexpectedSymbol(symbol) symbol = next(lexer) yield ('end_array', None) Parsing JSON (or almost anything, for that matter) requires keeping a state describing where in the structure you are now: are you expecting a scalar value, or an object key, or a comma, etc. Also, since arrays and objects can be nested you have to keep track of them opening and closing in the correct order in a some sort of stack. Magic of the yield keyword lets you leave a function and then return to the same place, implicitly giving you both the state and the stack for free: The state is represented by an execution point. For example, after you yielded a 'start_array' event, the next iteration will continue from the same place, ready to check for a closing bracket or the first value in the array. In other words current state is described by the last executed yield. The stack is represented by, well, your runtime's call stack: you call parse_array from parse_value and you'll be back whenever that nested array finishes parsing. No need to check for wrong values or an empty stack. With both of those facilities out of the way the code simply represents the grammar in the natural order. All of that thanks to the semantics of yield. Going the hard, explicit, low-level way Rust doesn't have yield. Which means iteration is implemented by repeatedly calling the next() function, and you have to explicitly keep both the state and the stack in the iterator object between the calls. This is where I spent some number of days trying different approaches, figuring out which states I need, whether I need a loop processing non-significant lexemes like : and , or a recursive call to next() would do the job, stuff like that. It was hard, partly because of my unfamiliarity with the language and partly because I'm definitely not the best algorithmist out there. But ultimately there's always a price you pay in productivity when working in a typed language, especially the one with a strict policy on using pointers. In Python, I tend to work top-down, starting with roughly sketching the whole algorithm making it just barely runnable to see if it works at all as early as possible. I rely on the language letting me be sloppy with types and error handling and leaving whole parts of the code essentially no[...]



Styles unification: first results
Yesterday I gathered some willpower and began working on a long awaited (by myself, at the least) style unification in highlight. Here's the first taste of why I think it is important. Let's take one of the recently added style — the "Android Studio" — and see how it displays ...

Yesterday I gathered some willpower and began working on a long awaited (by myself, at the least) style unification in highlight. Here's the first taste of why I think it is important.

Let's take one of the recently added style — the "Android Studio" — and see how it displays two config languages that happen to not count as "hot" these days: Apache and .Ini:

(image)

(image)

  • Section headers, variable expansions, rewrite flags aren't highlighted at all.
  • Pre-defined literals ("True", "on") are highlighted in .Ini unisg the same color as directive names in Apache.

To fix this particular case I had to define semantics for classes "section", "meta", "variable", "name" and "literal", and dropped all the Apache- and .Ini-specific rules from styles.

Here's how it looks now, nice and consistent:

(image)

(image)

There's a looong road ahead but after it's done designing a new style will be a matter of using a relatively short list of well-documented classes with a good guarantee that all languages will look decent.




ijson in Rust
I had this idea for a while to learn a few modern languages by porting my iterative JSON parser written in Python as a test case. It only makes sense because, unlike most tutorials, it provides you with a real real-world problem and in the end you might also get ... I had this idea for a while to learn a few modern languages by porting my iterative JSON parser written in Python as a test case. It only makes sense because, unlike most tutorials, it provides you with a real real-world problem and in the end you might also get a useful piece of code. I started with Rust, and I already have plans to do the same with Go and Clojure afterwards. I won't be giving you any introduction to Rust though, as there's a lot of those around the Web. I'll try to share what I didn't find in those. Resources The online book is a very good starting material, it gives a wide shallow overview of the language principles and provides pointers on where to go next. The API docs are essential but are hard to navigate for a beginner because Rust tends to implement everything in myriads of small interfaces. You can't simply have a flat list of everything you can do with, say, a String. Instead, you get the top level of a non-obvious hierarchy of features. One of the way around that is asking Google things like "rust convert int to str" or "rust filter sequence". When Google doesn't help you've got a user forum and the IRC channel #rust on irc.mozilla.org. Both are very much alive and haven't yet failed me a single time! Lexer After a few days of fumbling around, feeling incredibly dense and switching from condemning the language to praising it every 15 minutes I've got a working JSON lexer. It's still in the playground mode: short, clumsy and not structured in any meaningful way. Following are some notes on the language. Complexity The amount of control you have over things is staggering. Which is another way of saying that the language is rather complicated. From the get-go you worry about the difference between strings and string slices (pointers into strings), values on the heap ("boxed") vs. values on the stack and dynamic vs. static dispatch for calling methods of traits. Feels very opposite to what I'm used to in Python, but on the other hand I did try to come out of my comfort zone :-) Here's a little taste of that. The Lexer owns a fixed buffer of bytes within which it searches for lexemes and returns them one by one as steps of an iteration. So my first idea was to define an iterator of pointers ("slices" in Rust parlance) into that buffer to avoid copying each lexeme into its own separate object: impl Iterator for Lexer { type Item = &[u8]; // a "pointer" to an array of unsigned bytes // ... } This turns out to be impossible, because Rust wants to know the lifetime of pointers but in this case it simply can't tell how a yielded pointer is related to the lifetime of the Lexer's internal buffer. It doesn't know who and how has created that Lexer object, it is not guaranteed to be the same block of code that now iterates over it. Since you can't have a pointer to something in limbo, you have to construct a dynamic, ownable vector of bytes and return it from an iteration step, so a consumer would hold onto it independent of the source buffer: impl Iterator for Lexer { type Item = Vec; // a growable vector of bytes fn next(&mut self) -> Option> { // don't mind the Option<> part let mut result = vec![]; // .... result.extend(self.buf[start..self.pos].iter().cloned()); // more on that later… Some(result) } } By the way, this is the kind [...]



Problem with JSON encoding
JSON spec says that a UTF shall be used to encode it as a byte stream (with a strong preference to UTF-8). However it also allows characters to be encoded in the form of \uXXXX. This poses a problem on the parser implementation side: You have to decode a byte ...

JSON spec says that a UTF shall be used to encode it as a byte stream (with a strong preference to UTF-8). However it also allows characters to be encoded in the form of \uXXXX. This poses a problem on the parser implementation side:

  • You have to decode a byte stream twice: first convert a UTF encoding to Unicode code points and then replace any \uXXXX sequences with a single code point.

    The \u encoding is redundant as you can encode all of Unicode in a UTF just fine. The only technical reason that I can see in the RFC for this is that a UTF encoding is a SHALL, not a MUST.

  • Your language runtime probably doesn't make the second step easy.

    Modern languages that I'm familiar with use distinct data types for UTF-encoded sequences of bytes and Unicode characters. So even if your runtime has a built-in codec for \uXXX escapes it probably expects a sequence of byte on input to produce a sequence of Unicode characters on output. But treating your input stream first as UTF-encoded produces those \uXXXX as characters already, not bytes. So you can't use you library codec and have to decode those manually, which is brittle and silly.

I'm writing this as a comment to Tim Bray's "I-JSON" which is an emerging description of a stricter JSON aimed to avoid interoperability problems with JSON (RFC 7493). So here's my comment/proposal in a nutshell:

  1. Mandate a UTF encoding with a "MUST", not a "SHALL".
  2. Drop the \uXXXX escapes.

Makes sense?




I learned C# in 4 days!
You know those crazy books, "Learn whatever programming in 21 days"? I mean, who can afford spending that much time, right? Some background I have a friend who employs a very particular workflow for dealing with his digital photos. It often involves renaming and merging files from different cameras into ... You know those crazy books, "Learn whatever programming in 21 days"? I mean, who can afford spending that much time, right? Some background I have a friend who employs a very particular workflow for dealing with his digital photos. It often involves renaming and merging files from different cameras into a single chronologically ordered event, relying on natural sorting of file names in Windows Explorer. File names are constructed of picture time fields and running counters, like "2015-02-06_001.jpg". This is of course too tedious to do by hand, so he was very happy with a small specialized Windows utility that I wrote for him a few years ago when Windows XP ruled the world and I still programmed in Delphi. The program worked fine until, with the natural flow of time, the world switched to Unicode and newer Windows started to display question marks in place of Cyrillic characters in the program's UI. This made it rather unusable. There were also other small and not so small imperfections about the program that, as I understand, added considerable factor of irritation to the act of processing photos. ("And when it happens upon a panoramic shot you can as well go and pour yourself some coffee because UI is frozen for minutes while loading the preview…") So a year ago when we've been visiting his family for Christmas he nagged me, politely but emphatically, about at least making the UI readable again and also, just may be, fixing some of the most outrageous annoyances uncovered over the years of usage. The only problem was… I've lost the source code! I know, it might sound utterly unbelievable these days but it was written in the era before GitHub, and back in those days I've been using — wait for it — Zip drives to store my backups. Which in hindsight turned out to be suboptimal: they fail. All this, however, provided me with a unique opportunity for making a really good Christmas gift this year… I suppose there exist people out there who could come up instantly with a perfect gift idea for any of their dozens of friends upon being woken up in the middle of the day, but most of us seem to be destined to endure the agony of scratching the bottom of the void bowl of "what on Earth should we give them this time that won't suck like the last time!" So I was pretty much stoked when some weeks before we were about to leave for the trip it hit me that I actually could write the same program from scratch! And I'm happy to say that ultimately the idea did work out as intended and at some point it has even been uttered that it was "the best gift ever!" The best thing though is that now I can actually maintain the code (which I'm doing once a week these days) and not feel sorry for writing another half-working utility. Software is a process, after all. The endeavor So I had to learn how to write Windows GUI apps, again. Going back to Delphi was pretty much out of the question as even back in the time it was already loosing the mind share to quickly rising C# and I simply assumed that by now this process has completed. Besides, I actually wanted to learn how Windows GUI programming is "officially" done these days. (Notwithstanding the fact that we're still talking about traditional desktop software, not Metro tiles.) The lazy evaluation phase took me a couple of [...]



On automation tools
Ever since I made an automatic publishing infrastructure for highlight releases… there wasn't a single time when it really worked automatically as planned! There was always something: directory structure changes that require updates to the automation tool itself, botched release tagging, out of date dependencies on the server, our CDN ...

Ever since I made an automatic publishing infrastructure for highlight releases… there wasn't a single time when it really worked automatically as planned!

There was always something: directory structure changes that require updates to the automation tool itself, botched release tagging, out of date dependencies on the server, our CDN partners having their own bugs with automatic updates, etc. In my own words:

But! It still makes sense because by introducing automation you can do more complex things at the next level and keep maintenance essentially constant. In fact, if your maintenance is much lower than you can afford it means that you don't do enough interesting stuff :-).




"Shallow" reviews
Here's the question: Why write articles like "I spent two days with that tech and here's what I think" and why they seem to be so popular? The "Rust and Go" is a recent example of such an article. It's not really shallow I, for one, think it's an excellent ...

Here's the question: Why write articles like "I spent two days with that tech and here's what I think" and why they seem to be so popular? The "Rust and Go" is a recent example of such an article.

It's not really shallow

I, for one, think it's an excellent article. Though it doesn't say anything that you wouldn't know within the first hour or two of simply researching about those languages on the Web but it expresses a few things remarkably well:

  • The title puts forth the idea that those two languages are directly comparable. There are many new languages out there, we seem to live in a sort of a Cambrian explosion of them at the moment. And it's important for an engineer who tries to look outside of the familiar toolset to know at what to look first. The article is not called "Rust and CoffeeScript" or "Clojure and Go", it shows two comparable things that are indeed worth looking at.

  • The article doesn't just list language features for the sake of mentioning. It provides useful analogies for people versed in other technologies and highlights the fact that Go and Rust live at the different levels of abstraction: Go is a simpler, less restrictive trust-the-programmer language, Rust is a harder to learn, clever compiler effectively eliminating the whole class of common memory and condition checking errors. It gives me, as a newcomer, a starting point with familiar attributes that I can use to dig deeper.

Education matters

Being an expert in anything makes it harder to appreciate what's useful to people on lower education stages. But they are still important because you can't become an expert without going through all of them. This is why it makes sense to write about something you're currently learning. Not because you know the subject better than anyone else but because you still remember what you didn't know a few days ago.

Move along

It's easy, really. If you've stumbled upon an article that seems obvious to you, congratulate yourself and move along. No need to bitch about it :-).