Added By: Feedage Forager Feedage Grade B rated
character  characters  clone  code  don  drupal  encoding  encodings  object  people  problems  string  unicode  utf  web  xss 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics

Updated: 2016-05-12T20:06:53+02:00


Safe String Theory for the Web


One of the major things that really bugs me about the web is how poor the average web programmer handles strings. Here we are, changing the way the world works on top of text based protocols and languages like HTTP, MIME, JavaScript and CSS, yet some of the biggest issues that still plague us are cross-site scripting and mangled text due to aggressive filtering, mismatched encodings or overzealous escaping. Almost two years ago I said I'd write down some formal notes on how to avoid issues like XSS, but I never actually posted anything. See, once I sat down to actually try and untangle the do's and don'ts, I found it extremely hard to build up a big coherent picture. But I'm going to try anyway. The text is aimed at people who have had to deal with these issues, who are looking for a bit of formalism to frame their own solutions in. p.string { text-align: center; margin-top: 0.5em; margin-bottom: 0.5em; } span.letter { padding: 0.125em 0.25em; border: 1px solid #ccc; margin-right: -1px; background: #eee; } span.marked { background: #ffc; } div.context { font-size: 0.8em; margin-top: -0.5em; margin-bottom: 0.625em; } span.ugc { color: #34a; } div.code { padding: 0.5em; border: 1px solid #ccc; background: #eee; } table { margin: 0.5em auto; } The Problem At the most fundamental level, all the issues mentioned above come down to this: you are building a string for output to a client of some sort, and one or more pieces of data you are using is triggering unknown effects, because it has an unexpected meaning to the client. For example this little PHP snippet, repeated in variations across the web: name ?>'s profile If $user->name contains malicious JavaScript, your users are screwed. What this really comes down is concatenation of data, or more literally, strings. So with that in mind, let's take a closer look at... The Humble String What exactly is a string? It seems like a trivial question, but I really think a lot of people don't really know a good answer to this. Here's mine: A string is an arbitrary sequence (1) of characters composed from a given character set (2), which acquires meaning when placed in an appropriate context (3). This definition covers three important aspects of strings: They have no intrinsic restrictions on their content. They are useless blobs of data unless you know which symbols it represents. The represented symbols are meaningless unless you know the context to interpret them in. This is a much more high-level concept than what you encounter in e.g. C++, where the definition is more akin to: A string is an arbitrary sequence of bytes/words/dwords, in most cases terminated by a null byte/word/dword. This latter definition is mostly useless for learning how to deal with strings, because it only describes their form, not their function. So let's take a closer look at the three points above. 1. Representation of Symbols They are useless blobs of data unless you know which symbols it represents. This issue is relatively well known these days and is commonly described as encodings and character sets. A character set is simply a huge, numbered list of characters to draw from; an 'alphabet' like ASCII or Unicode. An encoding is a mechanism for turning characters—i.e. numbers— into sequences of bits. For example, Latin-1 uses one byte per character, and UTF-8 uses 1-4 bytes per character. Theoretically encodings and character sets are independent of each other, but in practice the two terms are used interchangeably to describe one particular pair. You can't say much about them these days without delving into Unicode. Fortunately, Joel Spolski has already written up a great crash course on Unicode, which explains much better than I could. The main thing to take away is that every legacy character set in use can be converted to Unicode and back again without loss. That makes Unicode the only sensible choice as the internal character set of a program, and o[...]

Vancouver PHP Conference


Ahoy from the Vancouver PHP conference. I gave a talk titled "A Closer Look at Drupal 5" earlier. Overall response was positive, although according to Boris I wouldn't have managed to squeeze everything in 1 hour if I hadn't put on my zippy fast presentation speaking voice, so there might have been some information overload at times.

Oh well.. I figure anyone generally only remembers at most 50% of a talk, so I might as well blast you with a bunch of things and hope some of it sticks ;).

Thanks to Dries and James for letting me use their earlier presentations as a base.

The slides are no longer available by Dries' request, as he has had problems with people stealing slides without permission before. Sorry.

XSS & friends: Text Handling in PHP applications


Update: I jotted down some initial theory in my Safe String Theory for the Web post.

For a while now, a lot of talk has been going on about XSS, aka Cross Site Scripting. In October 2005, an XSS worm nearly took down MySpace. Most XSS attacks however are not as benevolent as that. They can be used to steal passwords and other sensitive information, perform distributed Denial-of-Service attacks on sites or generate fraudulent advertisement income.

XSS problems are still rampant in many web applications today though, with PHP applications being especially vulnerable. This has caused some to conclude that XSS problems are even impossible to avoid or at least impractical to completely audit for. However, from a purely technical standpoint, XSS problems are not unique at all. They belong to a wider class of security problems which stem from incorrect handling of user-supplied data (e.g. SQL command injection or e-mail header injection).

So, what makes the web so tricky to secure? Is it because web programmers are inherently 'stupid' and can't 'code properly'? I don't think so.

However, I do think that most web languages (such as PHP) tend to promote a bad approach to coding and by extension, to security. By letting the programmer jump in directly, learning as they go, most people never build-up a complete overview of the programming environment, but simply tweak code 'until it works'. The same applies to security issues: when a bug is found, those people will just tweak a particular line of code until the problem goes away. They won't see the big picture and will make similar mistakes later.

Another serious problem in my opinion is that there is no well-defined vocabulary for the tools used to solve these problems. Umbrella words such as 'filtering' are all too often used and stand in the way of a more precise description. With only vague notions about 'validation', 'special characters' and 'escaping', you cannot understand what's really going on. Such a lack of insight also prevents people from seeing beyond individual issues.

So I've decided I want to build up a more formalized explanation to text handling. Expect one or more blog posts about this in the future. At least the next time people "lock up" on me, I can point them somewhere.

Summer of Code - Ajax Functionality for Drupal


This last summer I was sponsored by Google as part of their Summer of Code progam to work on Drupal. My goal was to introduce various AJAX functionalities to Drupal.

The official project description was:

"Drupal has recently begun to find meaningful ways to introduce AJAX functionality with the goal of improving the user experience. Work with Drupal's usability experts to identify the next steps and help implement new dynamic functions based on interaction with the XMLHttpRequest object."

I focused on the following Ajax-powered features:

  • Inline Editing of posts: Though I built a working prototype module, I decided not to develop this feature further because it is not flexible enough to work as a generic Drupal module. It would break on too many configurations and has limited usefulness anyhow.
  • Uploading of files: allows you to attach files to Drupal nodes (with upload.module) without having to reload the page.
  • Sorting tables inside a page: this changes the sort order of a table without reloading the entire page. It is not client-side sorting as you'd expect at first sight: because most tables in Drupal are spread across multiple pages, client-side sorting is not very useful.
  • Switching between multiple pages: this was implemented on top of the sorting functionality, and only works on paged tables (this covers most of the useful pagers though).
  • Progressbar widget: a typical progressbar that fetches the status from the server through Ajax.

The resulting code can be found in my sandbox in the Drupal contributions repository. Note however that most of the code is in patches against the (rapidly changing) Drupal HEAD, so they are likely to go out of date soon.

The file uploader is now already part of the Drupal HEAD, and at least the tablesorter is sitting in the patch queue being reviewed. I will try and keep them up to date.

A big thanks goes to Google for organising the Summer of Code!

Proposal for Implementing Unicode in PHP


On the Drupal team, I am known as an encoding nut: whenever there's an encoding issue or a question about Unicode, people tend to knock on my door. Usually any fix or answer from me is accompanied by a lot of cursing to the unfortunate inquirer about how "PHP is horrible when it comes to string handling" and how it seems that "the entire PHP dev team has its head planted firmly into the ground when it comes to Unicode". To which the reply is more than often: "Why don't you fix it yourself?". Well, I'm not a PHP language developer. To be honest I have no interest or time for becoming one. But I do know a lot about encodings and Unicode, so I decided to write this article describing the problem and possible solutions. That way, maybe others can take some of these ideas and put them into practice. At the very least, it should answer a lot of questions that people have about Unicode and PHP. Right now, the message from the PHP developers seems to be that "PHP supports Unicode, but some assembly is required". In fact, it is a lot worse. Please, read on. About encodings and Unicode First, I recommend that anyone reading this article first reads The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky. It is an excellent introduction to Unicode and encodings in general. Note also that the article was written in 2003 and specifically mentions PHP's Unicode support being hopeless. We are now two years later and the situation has not changed much. The only important thing about Unicode which isn't explained in Joel's article is that Unicode is in fact more than just a big table which maps characters to numbers: it is also a set of character properties, recommendations and algorithms on how those characters should be used. And this is why Unicode needs (and deserves!) much more attention than any other character set. What is the current situation? As far as PHP is concerned at the moment, a character consists of 8 bits and a string is a series of characters. This is good enough for legacy 8-bit encodings (like the common ISO-8859-1 or Latin-1 encoding used in Western Europe), but does not cater to more complicated encodings. To accomodate those, the multibyte string extension (Mbstring) can be used. This extension was originally developed for handling Japanese encodings, but it has now been extended to support many more encodings, including the Unicode Transformation Formats (like the popular UTF-8). Mbstring provides encoding-aware versions of many of PHP's string functions (substr(), strlen(), ereg(), ...). Through a feature called overloading, you can tell PHP to always use the Mbstring version of a function if there is one. Aside from Mbstring, there are a few other libraries and extensions which may be used to provide encoding- and Unicode-related services, like Imap, Iconv or GNU Recode. What problems are there with the current approach? PHP itself still doesn't know anything about encodings or Unicode. Aside from function calls, there are other ways of interacting with strings in PHP. For example, there is the {} operator for selecting characters from strings, as if they were arrays. And like in most programming languages, you can define strings in code with the familiar quote syntax. But all of these methods work with literal bytes, not with actual encoded characters. PHP source code itself must be encoded in an ASCII-compatible encoding and there is no way to use Unicode codepoints directly. If you want to store a character in a variable, you either have to use a short string of bytes (the encoded representation of the character) or an integer representing the character's Unicode codepoint. But converting between a codepoint and its encoded representation requires ugly work-arounds and wrappers, as PHP itself provides no easy mechanism for doing this. PHP does not guarantee anything about th[...]

PHP, Unicode and ostriches.


Update: I've written a follow-up post that describes how I would like PHP's encoding support to be.

As the resident encoding geek on the Drupal team, it's usually my job to make sure Drupal handles encodings and Unicode correctly. I don't mind doing this, but PHP doesn't exactly make it easy. With the new search.module for Drupal 4.6 being Unicode-aware, this has become very obvious, as we've had to bump up the minimum required version of PHP to 4.3.3. The UTF-8 support in the Perl-compatible regular expressions in PHP 4.3.2 and earlier is completely broken. And now I've had a bug report about someone on PHP 4.3.8 who still had problems getting it to work.

I don't know why exactly, but as far as encodings go PHP is still in the stone-age. This is odd, as you'd expect a web-oriented scripting language to have excellent support for sharing and exchanging textual information. There is a multi-byte string extension available, but it's not available on 90% of PHP hosts out there, and it's more of a black-box library anyway: it does not present you your strings as Unicode character codepoints, but still as an array of bytes. Furthermore, if you actually enable the mbstring overrides, you lose the ability to work with bytes at will. Apparently, the PHP team still hasn't figured out that bytes and characters are not the same. The other extensions which deal with encodings (iconv, recode) are also unavailable on the majority of PHP installs out there.

This means that if you want to make a PHP application which supports any language and runs on the average PHP host out there, that there's only one option: use UTF-8 internally, and write your own functions for string truncation, email header encoding, validation, etc. Using UTF-8 ensures that you only have one encoding to worry about and because it's Unicode it is guaranteed to be able to represent any language. Of course, you will no longer be able to do something simple as upper/lowercasing a string, as these PHP functions don't take UTF-8 at all.

What PHP needs is Unicode string support in the core, along with a good library of useful functions for handling the very large Unicode character range efficiently. ASP, Perl, Python, Java all have it... for me, it's the only thing that would've made PHP5 worth to upgrade to.

It's as if the entire PHP team has stuck their head in the ground, hoping that all this Unicode stuff will somehow blow over. It won't.

UFPDF: Unicode/UTF-8 extension for FPDF


Note: I wrote UFPDF as an experiment, not as a finished product. If you have problems using it, don't bug me for support. Patches are welcome though, but I don't have much time to maintain this.

FPDF is a PHP class for generating PDF files on-the-fly. Unfortunately it does not support Unicode. So I've coded UFPDF, an extension of FPDF which accepts input in UTF-8.

Only TrueType fonts are supported for now. To embed .TTF files, you need to extract the font metrics and build the required tables using the provided utilities (see README.txt). Included is a modified version of TTF2PT1 which extracts the Unicode glyph info.

UFPDF works the same as FPDF, except that all text is in UTF-8, so consult the FPDF documentation for usage.

Download UFPDF Example PDF

PHP 5 references fun: clone for PHP4.


An issue that's popped up recently in Drupal is PHP5 compatibility. At first, this looks like a no-brainer. Drupal does not use any advanced OO features, so most code runs on both PHP4 and PHP5.

There is a however a nasty change in PHP5: objects are now always passed by reference. Variables hold a handle to the object rather than the object itself. This brings PHP more in line with other OO languages (like Java) and removes some of the ugly from PHP OO code, but it also means that objects are treated differently from all the other types. Old code that depends on having objects copied when not passed by reference will break.

Fixing this is tricky. PHP5 gives you the clone keyword to copy objects with:

  $copy = clone $object;

And surprise, surprise, PHP4 does not consider this to be valid code. It doesn't even parse, so you couldn't enclose this with a version check. To get around this, you need a rather ugly hack.

The following code works the same as the above, in PHP5:

  $copy = clone($object);

PHP 4 on the other hand will think clone() is a function. The obvious next step is to conditionally declare this function if PHP4 is running. The only problem there is that the function definition will not parse in PHP5 because clone is a special keyword. To get around that, we have to use eval() to postpone parsing. Here's the finished hack:

  if (version_compare(phpversion(), '5.0') < 0) {
    function clone($object) {
      return $object;

In PHP5, the native clone keyword will clone the object, while in PHP4 the cloning will happen when the object is passed by value to clone().

We still need to go over the Drupal code and check for reference problems, but at least now we can clone objects consistently.