Added By: Feedage Forager Feedage Grade C rated
background eee  background  character set  character  data  encodings character  issues  margin  set  string  strings  text  unicode  web 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics

Updated: 2016-05-12T20:06:53+02:00


Safe String Theory for the Web


One of the major things that really bugs me about the web is how poor the average web programmer handles strings. Here we are, changing the way the world works on top of text based protocols and languages like HTTP, MIME, JavaScript and CSS, yet some of the biggest issues that still plague us are cross-site scripting and mangled text due to aggressive filtering, mismatched encodings or overzealous escaping. Almost two years ago I said I'd write down some formal notes on how to avoid issues like XSS, but I never actually posted anything. See, once I sat down to actually try and untangle the do's and don'ts, I found it extremely hard to build up a big coherent picture. But I'm going to try anyway. The text is aimed at people who have had to deal with these issues, who are looking for a bit of formalism to frame their own solutions in. p.string { text-align: center; margin-top: 0.5em; margin-bottom: 0.5em; } span.letter { padding: 0.125em 0.25em; border: 1px solid #ccc; margin-right: -1px; background: #eee; } span.marked { background: #ffc; } div.context { font-size: 0.8em; margin-top: -0.5em; margin-bottom: 0.625em; } span.ugc { color: #34a; } div.code { padding: 0.5em; border: 1px solid #ccc; background: #eee; } table { margin: 0.5em auto; } The Problem At the most fundamental level, all the issues mentioned above come down to this: you are building a string for output to a client of some sort, and one or more pieces of data you are using is triggering unknown effects, because it has an unexpected meaning to the client. For example this little PHP snippet, repeated in variations across the web: name ?>'s profile If $user->name contains malicious JavaScript, your users are screwed. What this really comes down is concatenation of data, or more literally, strings. So with that in mind, let's take a closer look at... The Humble String What exactly is a string? It seems like a trivial question, but I really think a lot of people don't really know a good answer to this. Here's mine: A string is an arbitrary sequence (1) of characters composed from a given character set (2), which acquires meaning when placed in an appropriate context (3). This definition covers three important aspects of strings: They have no intrinsic restrictions on their content. They are useless blobs of data unless you know which symbols it represents. The represented symbols are meaningless unless you know the context to interpret them in. This is a much more high-level concept than what you encounter in e.g. C++, where the definition is more akin to: A string is an arbitrary sequence of bytes/words/dwords, in most cases terminated by a null byte/word/dword. This latter definition is mostly useless for learning how to deal with strings, because it only describes their form, not their function. So let's take a closer look at the three points above. 1. Representation of Symbols They are useless blobs of data unless you know which symbols it represents. This issue is relatively well known these days and is commonly described as encodings and character sets. A character set is simply a huge, numbered list of characters to draw from; an 'alphabet' like ASCII or Unicode. An encoding is a mechanism for turning characters—i.e. numbers— into sequences of bits. For example, Latin-1 uses one byte per character, and UTF-8 uses 1-4 bytes per character. Theoretically encodings and character sets are independent of each other, but in practice the two terms are used interchangeably to describe one particular pair. You can't say much about them these days without delving into Unicode. Fortunately, Joel Spolski has already written up a great crash course on Unicode, which explains much better than I could. The main thing to take away is that every legacy character set in use can be converted to Unicode and back again without loss. That makes Unicode the only sensible choice as the internal character set of a program, and o[...]