Added By: Feedage Forager Feedage Grade C rated
character set  character  code  data  issues  margin  people  problems  set  string  strings  text  unicode  web  xss problems  xss 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics

Updated: 2016-05-12T20:06:53+02:00


Safe String Theory for the Web


One of the major things that really bugs me about the web is how poor the average web programmer handles strings. Here we are, changing the way the world works on top of text based protocols and languages like HTTP, MIME, JavaScript and CSS, yet some of the biggest issues that still plague us are cross-site scripting and mangled text due to aggressive filtering, mismatched encodings or overzealous escaping. Almost two years ago I said I'd write down some formal notes on how to avoid issues like XSS, but I never actually posted anything. See, once I sat down to actually try and untangle the do's and don'ts, I found it extremely hard to build up a big coherent picture. But I'm going to try anyway. The text is aimed at people who have had to deal with these issues, who are looking for a bit of formalism to frame their own solutions in. p.string { text-align: center; margin-top: 0.5em; margin-bottom: 0.5em; } span.letter { padding: 0.125em 0.25em; border: 1px solid #ccc; margin-right: -1px; background: #eee; } span.marked { background: #ffc; } div.context { font-size: 0.8em; margin-top: -0.5em; margin-bottom: 0.625em; } span.ugc { color: #34a; } div.code { padding: 0.5em; border: 1px solid #ccc; background: #eee; } table { margin: 0.5em auto; } The Problem At the most fundamental level, all the issues mentioned above come down to this: you are building a string for output to a client of some sort, and one or more pieces of data you are using is triggering unknown effects, because it has an unexpected meaning to the client. For example this little PHP snippet, repeated in variations across the web: name ?>'s profile If $user->name contains malicious JavaScript, your users are screwed. What this really comes down is concatenation of data, or more literally, strings. So with that in mind, let's take a closer look at... The Humble String What exactly is a string? It seems like a trivial question, but I really think a lot of people don't really know a good answer to this. Here's mine: A string is an arbitrary sequence (1) of characters composed from a given character set (2), which acquires meaning when placed in an appropriate context (3). This definition covers three important aspects of strings: They have no intrinsic restrictions on their content. They are useless blobs of data unless you know which symbols it represents. The represented symbols are meaningless unless you know the context to interpret them in. This is a much more high-level concept than what you encounter in e.g. C++, where the definition is more akin to: A string is an arbitrary sequence of bytes/words/dwords, in most cases terminated by a null byte/word/dword. This latter definition is mostly useless for learning how to deal with strings, because it only describes their form, not their function. So let's take a closer look at the three points above. 1. Representation of Symbols They are useless blobs of data unless you know which symbols it represents. This issue is relatively well known these days and is commonly described as encodings and character sets. A character set is simply a huge, numbered list of characters to draw from; an 'alphabet' like ASCII or Unicode. An encoding is a mechanism for turning characters—i.e. numbers— into sequences of bits. For example, Latin-1 uses one byte per character, and UTF-8 uses 1-4 bytes per character. Theoretically encodings and character sets are independent of each other, but in practice the two terms are used interchangeably to describe one particular pair. You can't say much about them these days without delving into Unicode. Fortunately, Joel Spolski has already written up a great crash course on Unicode, which explains much better than I could. The main thing to take away is that every legacy character set in use can be converted to Unicode and back again without loss. That makes Unicode the only sensible choice as the internal character set of a program, and o[...]

XSS & friends: Text Handling in PHP applications


Update: I jotted down some initial theory in my Safe String Theory for the Web post.

For a while now, a lot of talk has been going on about XSS, aka Cross Site Scripting. In October 2005, an XSS worm nearly took down MySpace. Most XSS attacks however are not as benevolent as that. They can be used to steal passwords and other sensitive information, perform distributed Denial-of-Service attacks on sites or generate fraudulent advertisement income.

XSS problems are still rampant in many web applications today though, with PHP applications being especially vulnerable. This has caused some to conclude that XSS problems are even impossible to avoid or at least impractical to completely audit for. However, from a purely technical standpoint, XSS problems are not unique at all. They belong to a wider class of security problems which stem from incorrect handling of user-supplied data (e.g. SQL command injection or e-mail header injection).

So, what makes the web so tricky to secure? Is it because web programmers are inherently 'stupid' and can't 'code properly'? I don't think so.

However, I do think that most web languages (such as PHP) tend to promote a bad approach to coding and by extension, to security. By letting the programmer jump in directly, learning as they go, most people never build-up a complete overview of the programming environment, but simply tweak code 'until it works'. The same applies to security issues: when a bug is found, those people will just tweak a particular line of code until the problem goes away. They won't see the big picture and will make similar mistakes later.

Another serious problem in my opinion is that there is no well-defined vocabulary for the tools used to solve these problems. Umbrella words such as 'filtering' are all too often used and stand in the way of a more precise description. With only vague notions about 'validation', 'special characters' and 'escaping', you cannot understand what's really going on. Such a lack of insight also prevents people from seeing beyond individual issues.

So I've decided I want to build up a more formalized explanation to text handling. Expect one or more blog posts about this in the future. At least the next time people "lock up" on me, I can point them somewhere.