Preview: A Neighborhood of Infinity

Last Build Date: Wed, 22 Nov 2017 14:48:23 +0000

A tail we don't need to wag

Sat, 14 Oct 2017 20:42:00 +0000

IntroductionI've been reading a little about concentration inequalities recently. I thought it would be nice to see if you can use the key idea, if not the actual theorems, to reduce the complexity of computing the probability distribution of the outcome of stochastic simulations. Examples might include random walks, or queues. The key idea behind concentration inequalities is that very often most of the probability is owned by a small proportion of the possible outcomes. For example, if we toss a fair coin enough (say ) times we expect the number of heads to lie within of the mean about 99.99% of the time despite there being different total numbers possible. The probable outcomes tend to concentrate around the expectation. On the other hand, if we consider not the total number of heads, but the possible sequences of tosses, there are possibilities, all equally likely. In this case there is no concentration. So a key ingredient here is a reduction operation: in this case reducing a sequence of tosses to a count of the number that came up heads. This is something we can use in a computer program. I (and many others) have written about the "vector space" monad that can be used to compute probability distributions of outcomes of simulations and I'll assume some familiarity with that. Essentially it is a "weighted list" monad which is similar to the list monad except that in addition to tracking all possible outcomes, it also propagates a probability along each path. Unfortunately it needs to follow through every possible path through a simulation. For example, in the case of simulating coin tosses it needs to track different possiblities, even though we're only interested in the possible sums. If, after each bind operation of the monad, we could collect together all paths that give the same total then we could make this code much more efficient. The catch is that to collect together elements of a type the elements need to be comparable, for example instances of Eq or Ord. This conflicts with the type of Monad which requires that we can use the >>= :: m a -> (a -> m b) -> m b and return :: a -> m a functions with any types a and b. I'm going to deal with this by adapting a technique presented by Oleg Kiselyov for efficiently implementing the Set monad. Instead of Set I'm going to use the Map type to represent probability distributions. These will store maps saying, for each element of a type, what the probability of that element is. So part of my code is going to be a direct translation of that code to use the Map type instead of the Set type. > {-# LANGUAGE GADTs, FlexibleInstances #-}> {-# LANGUAGE ViewPatterns #-}> module Main where> import Control.Monad> import Control.Arrow> import qualified Data.Map as M> import qualified Data.List as LThe following code is very similar to Oleg's. But for first reading I should point out some differences that I want you to ignore. The type representing a probability distribution is P: > data P p a where> POrd :: Ord a => p -> M.Map a p -> P p a> PAny :: p -> [(a, p)] -> P p aBut note how the constructors take two arguments - a number that is a probability, in addition to a weighted Map or list. For now pretend that first argument is zero and that the functions called trimXXX act similarly to the identity: > instance (Ord p, Num p) => Functor (P p) where> fmap = liftM> instance (Ord p, Num p) => Applicative (P p) where> pure = return> (<*>) = ap> instance (Ord p, Num p) => Monad (P p) where> return x = PAny 0 [(x, 1)]> m >>= f = > let (e, pdf) = unP m> in trimAdd e $ collect $ map (f *** id) pdf> returnP :: (Ord p, Num p, Ord a) => a -> P p a> returnP a = POrd 0 $ M.singleton a 1> unP :: P p a -> (p, [(a, p)])> unP (POrd e pdf) = (e, M.toList pdf)> unP (PAny e pdf) = (e, pdf)> fromList :: (Num p, Ord a) => [(a, p)] -> M.Map a p> fromList = M.fromListWith (+)> union :: (Num p, Ord a) =&g[...]
What is a photon?

Sat, 12 Aug 2017 03:22:00 +0000

IntroductionPopular science writing about quantum mechanics leaves many people full of questions about the status of photons. I want to answer some of these without using any tricky mathematics. One of the challenges is that photons are very different to ordinary everyday objects like billiard balls. This is partly because photons are described by quantum mechanics whereas billiard balls are better modelled with classical Newtonian mechanics. Quantum mechanics defies many of our intuitions. But it's also because the word photon plays by different linguistic rules to billiard ball. I hope to explain why. One of my goals is to avoid saying anything original. I'm largely going remove the mathematics from material I first learnt from three or so courses I took at Cambridge University many years ago: Quantum Mechanics, Solid State Physics and Quantum Field Theory. I also learnt about some of this from David Miller at Stanford University who talked a little about what properties it is meaningful to apply to a photon. (I hope I haven't misrepresented him too badly.) The simple harmonic oscillatorHere's a mass hanging on a spring: Suppose it's initially sitting in equilibrium so that the net force acting on it is zero. Now we lift the mass a small distance and let it go. Because we lifted it, we shortened the spring, reducing its tension. This means the force due to gravity is now more than the spring tension and the mass falls. Eventually it falls below the equilibrium point, increasing the tension in the spring so there is a net force pulling it back up again. To a good approximation, the force restoring the mass to its equilibrium point is proportional to how far it has been displaced. When this happens we end up with oscillating motion where the mass bounces up and down. Here's what a graph of its displacement looks like over time: It's actually a sine wave but that detail doesn't matter for us right now. An oscillator where the restoring force is proportional to the displacement from the equilibrium point is called a simple harmonic oscillator and its oscillation is always described by a sine wave. Note that I'm ignoring friction here. This is a reasonable approximation for many physical systems. Masses on springs aren't all that important in themselves. But simple harmonic oscillators are very common. Another standard example is the pendulum swinging under the influence of gravity: At a more fundamental level, an example might be an atom in a crystal being held in place by electrostatic forces from its neighbouring atoms. If you have one of these systems, then in principle you can set it in motion with as little energy as you like. Pull a mass on a spring down a little bit and it will bounce back up, oscillating a certain amount. Pull the mass down half the amount and it'll bounce with oscillations half the size. In principle we could keep repeating this experiment, each time starting with the mass displaced half the amount we tried previously. In other words, a simple harmonic oscillator can have any energy we like. The spectrum of possible energies of one of these oscillators is continuous. (Note that the word spectrum here is merely physicist-speak for a set of possible values.) If we can set one in motion with 1 unit of energy then we can also set it oscillating with 0.5 units, or 0.01 units, or 0.000123 units of energy. Quantum mechanicsEverything I've said above is assuming that classical Newtonian mechanics is valid. But we know that for very small systems, around the size of a few atoms or smaller, we need to use quantum mechanics. This is an enormous topic but I'm only going to extract one basic fact. According to quantum mechanics, a simple harmonic oscillator isn't free to oscillate with any energy you like. The possible energy levels, the spectrum of the system, is discrete. There is a lowest energy level, and then all of the energy levels above that are equally spaced like so, going up forever: We usually call the lowest energy level the ground state or vacuum state and call the hig[...]
Self-referential logic via self-referential circuits

Sat, 15 Jul 2017 16:09:00 +0000

IntroductionTL;DR The behaviour of a certain kind of delay component has a formal similarity to Löb's theorem which gives a way to embed part of provability logic into electronic circuits. Here's a famous paradoxical sentence: This sentence is falseIf it's false then it's true and if it's true then it's false. Here's a paradoxical electronic circuit: The component in the middle is an inverter. If the output of the circuit is high then its input is high and then it's output must be low, and vice versa. There's a similarity here. But with a bit of tweaking you can turn the similarity into an isomorphism of sorts. In the first case we avoid paradox by noting that in the mathematical frameworks commonly used by mathematicians it's impossible, in general, for a statement to assert it's own falsity. Instead, a statement can assert its own unprovability and then we get Gödel's incompleteness theorems and a statement that is apparently true and yet can't be proved. In the second case we can't model the circuit straightforwardly as a digital circuit. In practice it might settle down to a voltage that lies between the official high and low voltages so we have to model it as an analogue circuit. Or instead we can introduce a clock and arrange that the feedback in the circuit is delayed. We then get an oscillator circuit that can be thought of as outputting a stream of bits. The observation I want to make is that if the feedback delay is defined appropriately, these two scenarios are in some sense isomorphic. This means that we can model classic results about provability, like Gödel's incompleteness theorems, using electronic circuits. We can even use such circuits to investigate what happens when logicians or robots play games like Prisoner's Dilemma. I'll be making use of results found in Boolos' book on The Logic of Provability and some ideas I borrowed from Smoryński's paper on Fixed Point Algebras. I'll be assuming the reader has at least a slight acquaintance with ithe ideas behind provability logic. Provability LogicThere are many descriptions of provability logic (aka GL) available online, so I'm not going to repeat it all here. However, I've put some background material in the appendix below and I'm going to give a very brief reminder now. Start with (classical) propositional calculus which has a bunch of variables with names like and connectives like for AND, for OR, for NOT and for implication. (Note that .) Provability logic extends propositional calculus by adding a unary operator . (I apologise, that's meant to be a □ but it's coming out like in LaTeX formulae. I think it's a bug in Google's LaTeX renderer.) The idea is that asserts that is provable in Peano Arithmetic, aka PA. In addition to the axioms of propositional calculus we have and as well as a rule that allows us to deduce from . We also have this fixed point property: Let be any predicate we can write in the language of GL involving the variable , and suppose that every appearance of in is inside a , e.g. . Then there is a fixed point, i.e. a proposition that makes no mention of such that is a theorem. In effect, for any such , is a proposition that asserts . See the appendix for a brief mention of why we should expect this to be true. From the fixed point property we can deduce Löb's theorem: . There is a proof at wikipedia that starts from the fixed point property. We can also deduce the fixed point property from Löb's theorem so it's more usual to take Löb's theorem as an axiom of GL and show that the fixed point property follows. You can think of Löb's theorem as a cunning way to encode the fixed point property. In fact you can argue that it's a sort of Y-combinator, the function that allows the formation of recursive fixed points in functional programming languages. (That's also, sort of, the role played by the loeb function I defined way back. But note that loeb isn't really a proof of Löb's theorem, it just has formal similarities.) Back to electronic circuitsIn order [...]
A relaxation technique

Wed, 07 Jun 2017 03:32:00 +0000

IntroductionSometimes you want to differentiate the expected value of something. I've written about some tools that can help with this. For example you can use Automatic Differentiation for the derivative part and probability monads for the expectation. But the probability monad I described in that article computes the complete probability distribution for your problem. Frequently this is intractably large. Instead people often use Monte Carlo methods. They'll compute the "something" many times, substituting pseudo-random numbers for the random variables, and then average the results. This provides an estimate of the expected value and is ubiquitous in many branches of computer science. For example it's the basis of ray-tracing and path-tracing algorithms in 3D rendering, and plays a major role in machine learning when used in the form of stochastic gradient descent. But there's a catch. Suppose we want to compute where each of the belong to the Bernoulli distribution . I.e. each has a probability of being 1 and probability of being 0. If we compute this using a Monte Carlo approach we'll repeatedly generate pseudo-random numbers for each of the . Each one will be 0 or 1. This means that our estimate depends on via subexpressions that can't meaningfully be differentiated with respect to . So how can we use automatic differentiation with the Monte Carlo method? I'm proposing an approach that may or may not already be in the literature. Whether it is or not, I think it's fun to get there by combining many of the things I've previously talked about here, such as free monads, negative probabilities and automatic differentiation. I'm going to assume you're familiar with using dual numbers to compute derivatives as I've written about this before and wikipedia has the basics. A probability monadI want to play with a number of different approaches to using monads with probability theory. Rather than define lots of monads I think that the easiest thing is to simply work with one free monad and then provide different interpreters for it. First some imports: > import Control.Monad> import qualified System.Random as R> import qualified Data.Map.Strict as MI'm going to use a minimal free monad that effectively gives us a DSL with a new function that allows us to talk about random Bernoulli variables: > data Random p a = Pure a | Bernoulli p (Int -> Random p a)The idea is that Pure a represents the value a and Bernoulli p f is used to say "if we had a random value x, f x is the value we're interested in". The Random type isn't going to do anything other than represent these kinds of expressions. There's no implication that we actually have a random value for x yet. > instance Functor (Random p) where> fmap f (Pure a) = Pure (f a)> fmap f (Bernoulli p g) = Bernoulli p (fmap f . g)> instance Applicative (Random p) where> pure = return> (<*>) = ap> instance Monad (Random p) where> return = Pure> Pure a >>= f = f a> Bernoulli p g >>= f = Bernoulli p (\x -> g x >>= f)We'll use bernoulli p to represent a random Bernoulli variable drawn from . > bernoulli :: p -> Random p Int> bernoulli p = Bernoulli p returnSo let's write our first random expression: > test1 :: Random Float Float> test1 = do> xs <- replicateM 4 (bernoulli 0.75)> return $ fromIntegral $ sum xsIt sums 4 Bernoulli random variables from and converts the result to a Float. The expected value is 3. We don't yet have a way to do anything with this expression. So let's write an interpreter that can substitute pseudo-random values for each occurrence of bernoulli p: It's essentially interpreting our free monad as a state monad where the state is the random number seed: > interpret1 :: (Ord p, R.Random p, R.RandomGen g) => Random p a -> g -> (a, g)> interpret1 (Pure a) seed = (a, seed)> interpret1 (Bernoulli prob f) seed = > let (r, seed') = [...]
Logarithms and exponentials of functions

Sun, 05 Feb 2017 18:30:00 +0000

IntroductionA popular question in mathematics is this: given a function , what is its "square root" in the sense that . There are many questions about this on mathoverflow but it's also a popular subject in mathematics forums for non-experts. This question seems to have a certain amount of notoriety because it's easy to ask but hard to answer fully. I want to look at an approach that works nicely for formal power series, following from the Haskell code I wrote here. There are some methods for directly finding "functional square roots" for formal power series that start as , but I want to approach the problem indirectly. When working with real numbers we can find square roots, say, by using . I want to use an analogue of this for functions. So my goal is to make sense of the idea of the logarithm and exponential of a formal power series as composable functions. Warning: the arguments are all going to be informal. NotationThere's potential for a lot of ambiguous notation here, especially as the usual mathematical notation for th powers of trig functions is so misleading. I'm going to use for composition of functions and power series, and I'm going to use the notation to mean the th iterate of . So and . As I'll be working mostly in the ring of formal power series for some ring , I'll reserve the variable to refer only to the corresponding element in this ring. I'll also use formal power series somewhat interchangeably with functions. So can be thought of as representing the identity function. To make sure we're on the same page, here are some small theorems in this notation: .That last one simply says that adding one times is the same as adding . As I'm going to have ordinary logarithms and exponentials sitting around, as well as functional logarithms and exponentials, I'm going to introduce the notation for functional logarithm and for functional exponentiation. PreliminariesThe first goal is to define a non-trivial function with the fundamental property that First, let's note some basic algebraic facts. The formal power series form a commutative ring with operations and (ordinary multiplication) and with additive identity and multiplicative identity . The formal power series form a ring-like algebraic structure with operation and partial operation with additive identity and multiplicative identity . But it's not actually ring or even a near-ring. Composition isn't defined for all formal power series and even when it's defined, we don't have distributivity. For example, in general , after all there's no reason to expect to equal . We do have right-distributivity however, i.e. , because , more or less by definition of . We can't use power series on our power seriesThere's an obvious approach, just use power series of power series. So we might tentatively suggest that . Note that I consider rather than because is the multiplicative identity in our ring-like structure. Unfortunately this doesn't work. The reason is this: if we try to use standard reasoning to show that the resulting function has the fundamental property we seek we end up using distributivity. We don't have distributivity. Sleight of handThere's a beautiful trick I spotted on mathoverflow recently that allows us to bring back distributivity. (I can't find the trick again, but when I do I'll come back and add a link and credit here.) Consider the function defined by . In other words is right-composition by . (Ambiguity alert, I'm using here to mean right. It has nothing to do with the ring underlying our formal power series.) Because we have right-distributivity, is a bona fide linear operator on the space of formal power series. If you think of formal power series as being infinitely long vectors of coefficients then can be thought of as an infinitely sized matrix. This means that as long as we have convergence, we can get away with using power series to compute with the property that . Define: . We have: where I'm using to[...]
Building free arrows from components

Mon, 09 Jan 2017 16:33:00 +0000

IntroductionGabriel Gonzalez has written quite a bit about the practical applications of free monads. And "haoformayor" wrote a great stackoverflow post on how arrows are related to strong profunctors. So I thought I'd combine these and apply them to arrows built from profunctors: free arrows. What you get is a way to use arrow notation to build programs, but defer the interpretation of those programs until later. HeteromorphismsUsing the notation here I'm going to call an element of a type P a b, where P is a profunctor, a heteromorphism. A product that isn't much of a productAs I described a while back you can compose profunctors. Take a look at the code I used, and also Data.Functor.Composition. data Compose f g d c = forall a. Compose (f d a) (g a c)An element of Compose f g d c is just a pair of heteromorphisms, one from each of the profunctors, f and g, with the proviso that the "output" type of one is compatible with the "input" type of the other. As products go it's pretty weak in the sense that no composition happens beyond the two objects being stored with each other. And that's the basis of what I'm going to talk about. The Compose type is just a placeholder for pairs of heteromorphisms whose actual "multiplication" is being deferred until later. This is similar to the situation with the free monoid, otherwise known as a list. We can "multiply" two lists together using mappend but all that really does is combine the elements into a bigger list. The elements themselves aren't touched in any way. That suggests the idea of using profunctor composition in the same way that (:) is used to pair elements and lists. Free ArrowsHere's some code: > {-# OPTIONS -W #-}> {-# LANGUAGE ExistentialQuantification #-}> {-# LANGUAGE Arrows #-}> {-# LANGUAGE RankNTypes #-}> {-# LANGUAGE TypeOperators #-}> {-# LANGUAGE FlexibleInstances #-}> import Prelude hiding ((.), id)> import Control.Arrow> import Control.Category> import Data.Profunctor> import Data.Monoid> infixr :-> data FreeA p a b = PureP (a -> b)> | forall x. p a x :- FreeA p x bFirst look at the second line of the definition of FreeA. It says that a FreeA p a b might be a pair consisting of a head heteromorphism whose output matches the input of another FreeA. There's also the PureP case which is acting like the empty list []. The reason we use this is that for our composition, (->) acts a lot like the identity. In particular Composition (->) p a b is isomorphic to p a b (modulo all the usual stuff about non-terminating computations and so on). This is because an element of this type is a pair consisting of a function a -> x and a heteromorphism p x b for some type x we don't get to see. We can't project back out either of these items without information about the type of x escaping. So the only thing we can possibly do is use lmap to apply the function to the heteromorphism giving us an element of p a b. Here is a special case of PureP we'll use later: > nil :: Profunctor p => FreeA p a a> nil = PureP idSo an element of FreeA is a sequence of heteromorphisms. If heteromorphisms are thought of as operations of some sort, then an element of FreeA is a sequence of operations waiting to be composed together into a program that does something. And that's just like the situation with free monads. Once we've build a free monad structure we apply an interpreter to it to evaluate it. This allows us to separate the "pure" structure representing what we want to do from the code that actually does it. The first thing to note is our new type is also a profunctor. We can apply lmap and rmap to a PureP function straightforwardly. We apply lmap directly to the head of the list and we use recursion to apply rmap to the PureP at the end: > instance Profunctor b => Profunctor (FreeA b) where> lmap f (PureP g) = PureP (g . f)> lmap f (g :- h) = (lmap f g)[...]
Addressing Pieces of State with Profunctors

Sat, 07 Jan 2017 21:46:00 +0000

Attempted segueSince I first wrote about profunctors there has been quite a bit of activity in the area so I think it's about time I revisited them. I could just carry on from where I left off 5 years ago but there have been so many tutorials on the subject that I think I'll have to assume you've looked at them. My favourite is probably Phil Freeman's Fun with Profunctors. What I intend to do here is solve a practical problem with profunctors. The problemArrows are a nice mechanism for building circuit-like entities in code. In fact, they're quite good for simulating electronic circuits. Many circuits are very much like pieces of functional code. For example an AND gate like this can be nicely modelled using a pure function: c = a && b. But some components, like flip-flops, have internal state. What comes out of the outputs isn't a simple function of the inputs right now, but depends on what has happened in the past. (Alternatively you can take the view that the inputs and outputs aren't the current values but the complete history of the values.) We'll use (Hughes) arrows rather than simple functions. For example, one kind of arrow is the Kleisli arrow. For the case of Kleisli arrows built from the state monad, these are essentially functions of type a -> s -> (b, s) where s is our state. We can write these more symmetrically as functions of type (a, s) -> (b, s). We can think of these as "functions" from a to b where the output is allowed to depend on some internal state s. I'll just go ahead and define arrows like this right now. First the extensions and imports: > {-# OPTIONS -W #-}> {-# LANGUAGE Arrows #-}> {-# LANGUAGE RankNTypes #-}> {-# LANGUAGE FlexibleInstances #-}> import Prelude hiding ((.), id)> import Control.Arrow> import Control.Category> import Data.Profunctor> import Data.TupleAnd now I'll define our stateful circuits. I'm going to make these slightly more general than I described allowing circuits to change the type of their state: > newtype Circuit s t a b = C { runC :: (a, s) -> (b, t) }> instance Category (Circuit s s) where> id = C id> C f . C g = C (f . g)> instance Arrow (Circuit s s) where> arr f = C $ \(a, s) -> (f a, s)> first (C g) = C $ \((a, x), s) -> let (b, t) = g (a, s)> in ((b, x), t)This is just a more symmetrical rewrite of the state monad as an arrow. The first method allows us to pass through some extra state, x, untouched. Now for some circuit components. First the "pure" operations, a multiplier and a negater: > mul :: Circuit s s (Int, Int) Int> mul = C $ \((x, y), s) -> (x*y, s)> neg :: Circuit s s Int Int> neg = C $ \(x, s) -> (-x, s)And now some "impure" ones that read and write some registers as well as an accumulator: > store :: Circuit Int Int Int ()> store = C $ \(x, _) -> ((), x)> load :: Circuit Int Int () Int> load = C $ \((), s) -> (s, s)> accumulate :: Circuit Int Int Int Int> accumulate = C $ \(a, s) -> (a, s+a)I'd like to make a circuit that has lots of these components, each with its own state. I'd like to store all of these bits of state in a larger container. But that means that each of these components needs to have a way to address its own particular substate. That's the problem I'd like to solve. Practical profunctor opticsIn an alternative universe lenses were defined using profunctors. To find out more I recommend Phil Freeman's talk that I linked to above. Most of the next paragraph is just a reminder of what he says in that talk and I'm going to use the bare minimum to do the job I want. Remember that one of the things lenses allow you to do is this: suppose we have a record s containing a field of type a and another similar enough kind of record t with a field of type b. Among other things, a lens gives a way to take a ru[...]
Expectation-Maximization with Less Arbitrariness

Sun, 16 Oct 2016 23:04:00 +0000

IntroductionThere are many introductions to the Expectation-Maximisation algorithm. Unfortunately every one I could find uses arbitrary seeming tricks that seem to be plucked out of a hat by magic. They can all be justified in retrospect, but I find it more useful to learn from reusable techniques that you can apply to further problems. Examples of tricks I've seen used are: Using Jensen's inequality. It's easy to find inequalities that apply in any situation. But there are often many ways to apply them. Why apply it to this way of writing this expression and not that one which is equal?Substituting in the middle of an expression. Again, you can use just about anywhere. Why choose this at this time? Similarly I found derivations that insert a into an expression.Majorisation-Minimisation. This is a great technique, but involves choosing a function that majorises another. There are so many ways to do this, it's hard to imagine any general purpose method that tells you how to narrow down the choice.My goal is to fill in the details of one key step in the derivation of the EM algorithm in a way that makes it inevitable rather than arbitrary. There's nothing original here, I'm merely expanding on a stackexchange answer. Generalities about EMThe EM algorithm seeks to construct a maximum likelihood estimator (MLE) with a twist: there are some variables in the system that we can't observe. First assume no hidden variables. We assume there is a vector of parameters that defines some model. We make some observations . We have a probability density that depends on . The likelihood of given the observations is . The maximum likelhood estimator for is the choice of that maximises for the we have observed. Now suppose there are also some variables that we didn't get to observe. We assume a density . We now have where we sum over all possible values of . The MLE approach says we now need to maximise One of the things that is a challenge here is that the components of might be mixed up among the terms in the sum. If, instead, each term only referred to its own unique block of , then the maximisation would be easier as we could maximise each term independently of the others. Here's how we might move in that direction. Consider instead the log-likelihood Now imagine that by magic we could commute the logarithm with the sum. We'd need to maximise One reason this would be to our advantage is that often takes the form where is a simple function to optimise. In addition, may break up as a sum of terms, each with its own block of 's. Moving the logarithm inside the sum would give us something we could easily maximise term by term. What's more, the for each is often a standard probability distribution whose likelihood we already know how to maximise. But, of course, we can't just move that logarithm in. Maximisation by proxySometimes a function is too hard to optimise directly. But if we have a guess for an optimum, we can replace our function with a proxy function that approximates it in the neighbourhood of our guess and optimise that instead. That will give us a new guess and we can continue from there. This is the basis of gradient descent. Suppose is a differentiable function in a neighbourhood of . Then around we have We can try optimising with respect to within a neighbourhood of . If we pick a small circular neighbourhood then the optimal value will be in the direction of steepest descent. (Note that picking a circular neighbourhood is itself a somewhat arbitrary step, but that's another story.) For gradient descent we're choosing because it matches both the value and derivatives of at . We could go further and optimise a proxy that shares second derivatives too, and that leads to methods based on Newton-Raphson iteration. We want our logarithm of a sum to be a sum of logarithms. But instead we'll settle for a proxy[...]
Dimensionful Matrices

Sun, 07 Aug 2016 02:23:00 +0000

IntroductionProgramming languages and libraries for numerical work tend not to place a lot of emphasis on the types of their data. For example Matlab, R, Octave, Fortran, and Numpy (but not the now defunct Fortress) all tend to treat their data as plain numbers meaning that any time you have a temperature and a mass, say, there is nothing to prevent you adding them. I've been wondering how much dimensions (in the sense of dimensional analysis) and units could help with numerical programming. As I pointed out on G+ recently (which is where I post shorter stuff these days), you don't have to limit dimensions to the standard ones of length, mass, time, dollars and so on. Any scale invariance in the equations you're working with can be exploited as a dimension giving you a property that can be statically checked by a compiler. There are quite a few libraries to statically check dimensions and units now. For example Boost.Units for C++, units for Haskell and even quantities for Idris. A matrix that breaks thingsEven if a language supports dimensions, it's typical to define objects like vectors and matrices as homogeneous containers of quantities. But have a look at the Wikipedia page on the metric tensor. There is a matrix which has the curious property that 3 entries on the diagonal seem to be dimensionless while the first entry is a squared velocity with dimension . This will break many libraries that support units. An obvious workaround is to switch to use natural units, which is much the same as abandoning the usefulness of dimensions. But there's another way, even if it may be tricky to set up with existing languages. Heterogeneous vectors and matricesAccording to a common convention in physics, a 4-vector has dimensions where I'm using the convention that we can represent the units of a vector or matrix simply as a vector or matrix of dimensions, and here is time and is length. The metric tensor is used like this: (where I'm using the Einstein summation convention so the 's and 's are summed over). If we think of having units of length squared (it is a pseudo-Riemannian metric after all) then it makes sense to think of having dimensions given by We can write this more succinctly as where is the usual outer product. I'll use the notation to mean is of type . So, for example, . I'll also use pointwise notation for types such as and . Now I can give some general rules. If is a matrix, and are vectors, and is a scalar, then only makes sense if . Similarly the "inner product" only makes sense if . Generic vectors and matricesAlthough these kinds of types might be useful if you're dealing with the kind of heterogeneous matrices that appear in relativity, there's another reason they might be useful. If you write code (in the imaginary language that supports these structures and understands dimensions and units) to be as generic as possible in the types of the vector and matrix entries, failures to type check will point out parts of the code where there are hidden assumptions, or even errors, about scaling. For example, consider a routine to find the inverse of a 3 by 3 matrix. Writing this generically as possible means we should write it to operate on a matrix of type , say. The result should have type . If this type checks when used with a suitably powerful type checker then it means that if we replace the units for type A, say, with units twice as large, it should have no effect on the result, taking into account those units. In this case, it means that if we multiply the numbers of the first row of the input by 0.5 then the numbers of the first column of the output should get multiplied by 2. In fact this is a basic property of matrix inverses. In other words, this mathematical property of matrix inverses is guaranteed by a type system that can handle units and heterogeneous matrices. It woul[...]
Cofree meets Free

Sat, 24 May 2014 05:21:00 +0000

> {-# LANGUAGE RankNTypes, MultiParamTypeClasses, TypeOperators #-}IntroductionAfter I spoke at BayHac 2014 about free monads I was asked about cofree comonads. So this is intended as a sequel to that talk. Not only am I going to try to explain what cofree comonads are. I'm also going to point out a very close relationship between cofree comonads and free monads. At the beginning of the talk the Google Hangout software seems to have switched to the laptop camera so you can't see the slides in the video. However the slides are here. Cothings as machinesI often think of coalgebraic things as machines. They have some internal state and you can press buttons to change that internal state. For example here is a type class for a machine with two buttons that's related to a magma: > class TwoButton a where> press :: a -> (a, a)The idea is that the state of the machine is given by some type a and you could press either the left button or the right button. The result of pressing one or other button is given by these two functions: > pressLeft, pressRight :: TwoButton a => a -> a> pressLeft = fst . press> pressRight = snd . press(As with many metaphors used to explain Haskell type classes your mileage may vary. Sometimes you'll have to stretch your imagination to see what the set of buttons is for a particular cothing.) ComonadsJust as monads are a kind of generalised algebraic structure (for example see my talk), comonads are a generalised kind of machine. The idea is that for any state of the machine there is a bunch of buttons we could press. But we don't have two buttons, or any fixed number of buttons. We instead have a functorful of buttons (if you think of functors by analogy with containers). We also don't get to directly see the internal state of the machine but instead we get to make observations. Here's the type class: > class Comonad w where> extract :: w a -> a> duplicate :: w a -> w (w a)The state of the machine is given by w a. We observe the state using the extract function. And when we come to press a button, we have a functorful of new states that it could end up in. The duplicate function gives the container of those new states. For example, various kinds of zipper give rise to comonads. Zippers allow you to "focus" on a part of a data structure. The extract operation allows you to observe the point that currently has focus. There is one button for every position in the structure where the focus could be. Pressing the corresponding button moves the focus to that point. Similarly the Store comonad has one button for each value you can store in the field it represents. Press the button and the value gets stored in the field. Cofreeness as a way to memoiseCofree coalgebras can be thought of as memoised forms of elements of coalgebras. For example, the TwoButton machine above has a function, press, as part of its definition. Memoising an element of such a thing means tabulating everything that could possibly happen if you pressed the buttons so we no longer need the press function. One approach is to try something like this: data CofreeTwoButton = Memo CofreeTwoButton CofreeTwoButtonThe structure contains two CofreeTwoButtons, each giving the result of pressing one of the two buttons. Any element of CofreeTwoButton may now be memoised like so: memoiseTwoButton :: TwoButton m => m -> CofreeTwoButtonmemoiseTwoButton m = Memo (memoiseTwoButton (pressLeft m)) (memoiseTwoButton (pressRight m))It definitely tabulates the result of pressing buttons. But it has a major flaw. We have no way of seeing what's stored in the table! To make this useful we want to also store some data in the table that we can peek at. So here is a better definition: > data CofreeTwoButton a = Memo a (CofreeTwoButton a) (CofreeTwoButton a)> memoise[...]
Types, and two approaches to problem solving

Sat, 17 May 2014 15:22:00 +0000

IntroductionThere are two broad approaches to problem solving that I see frequently in mathematics and computing. One is attacking a problem via subproblems, and another is attacking a problem via quotient problems. The former is well known though I’ll give some examples to make things clear. The latter can be harder to recognise but there is one example that just about everyone has known since infancy.SubproblemsConsider sorting algorithms. A large class of sorting algorithms, including quicksort, break a sequence of values into two pieces. The two pieces are smaller so they are easier to sort. We sort those pieces and then combine them, using some kind of merge operation, to give an ordered version of the original sequence. Breaking things down into subproblems is ubiquitous and is useful far outside of mathematics and computing: in cooking, in finding our path from A to B, in learning the contents of a book. So I don’t need to say much more here.Quotient problemsThe term quotient is a technical term from mathematics. But I want to use the term loosely to mean something like this: a quotient problem is what a problem looks like if you wear a certain kind of filter over your eyes. The filter hides some aspect of the problem that simplifies it. You solve the simplified problem and then take off the filter. You now ‘lift’ the solution of the simplified problem to a solution to the full problem. The catch is that your filter needs to match your problem so I’ll start by giving an example where the filter doesn’t work.Suppose we want to add a list of integers, say: 123, 423, 934, 114. We can try simplifying this problem by wearing a filter that makes numbers fuzzy so we can’t distinguish numbers that differ by less than 10. When we wear this filter 123 looks like 120, 423 looks like 420, 934 looks like 930 and 114 looks like 110. So we can try adding 120+420+930+110. This is a simplified problem and in fact this is a common technique to get approximate answers via mental arithmetic. We get 1580. We might hope that when wearing our filters, 1580 looks like the correct answer. But it doesn’t. The correct answer is 1594. This filter doesn’t respect addition in the sense that if a looks like a’ and b looks like b’ it doesn’t follow that a+b looks like a’+b’.To solve a problem via quotient problems we usually need to find a filter that does respect the original problem. So let’s wear a different filter that allows us just to see the last digit of a number. Our original problem now looks like summing the list 3, 3, 4, 4. We get 4. This is the correct last digit. If we now try a filter that allows us to see just the last two digits we see that summing 23, 23, 34, 14 does in fact give the correct last two digits. This is why the standard elementary school algorithms for addition and multiplication work through the digits from right to left: at each stage we’re solving a quotient problem but the filter only respects the original problem if it allows us to see the digits to the right of some point, not digits to the left. This filter does respect addition in the sense that if a looks like a’ and b looks like b’ then a+b looks like a’+b’.Another example of the quotient approach is to look at the knight’s tour problem in the case where two opposite corners have been removed from the chessboard. A knight’s tour is a sequence of knight’s moves that visit each square on a board exactly once. If we remove opposite corners of the chessboard, there is no knight’s tour of the remaining 62 squares. How can we prove this? If you don’t see the trick you can get get caught up in all kinds of complicated reasoning. So now put on a filter that removes your ability to see the spatial relationships between the squares so you can only[...]
The Monad called Free

Sat, 26 Apr 2014 04:42:00 +0000

IntroductionAs Dan Doel points out here, the gadget Free that turns a functor into a monad is itself a kind of monad, though not the usual kind of monad we find in Haskell. I'll call it a higher order monad and you can find a type class corresponding to this in various places including an old version of Ed Kmett's category-extras. I'll borrow some code from there. I hunted around and couldn't find an implementation of Free as an instance of this class so I thought I'd plug the gap. > {-# LANGUAGE RankNTypes, FlexibleContexts, InstanceSigs, ScopedTypeVariables #-}> import Control.Monad> import Data.MonoidTo make things unambiguous I'll implement free monads in the usual way here: > data Free f a = Pure a | Free (f (Free f a))> instance Functor f => Functor (Free f) where> fmap f (Pure a) = Pure (f a)> fmap f (Free a) = Free (fmap (fmap f) a)> instance Functor f => Monad (Free f) where> return = Pure> Pure a >>= f = f a> Free a >>= f = Free (fmap (>>= f) a)The usual Haskell typeclass Monad corresponds to monads in the category of types and functions, Hask. We're going to want monads in the category of endomorphisms of Hask which I'll call Endo. The objects in Endo correspond to Haskell's Functor. The arrows in Endo are the natural transformations between these functors: > type Natural f g = (Functor f, Functor g) => forall a. f a -> g aSo now we are led to consider functors in Endo. > class HFunctor f whereA functor in Endo must map functors in Hask to functors in Hask. So if f is a functor in Endo and g is a functor in Hask, then f g must be another functor in Hask. So there must be an fmap associated with this new functor. There's an associated fmap for every g and we collect them all into one big happy natural family: > ffmap :: Functor g => (a -> b) -> f g a -> f g bBut note also that by virtue of being a functor itself, f must have its own fmap type function associated with it. The arrows in Endo are natural transformations in Hask so the fmap for HFunctor must take arrows in Endo to arrows in Endo like so: > hfmap :: (Functor g, Functor h) => Natural g h -> Natural (f g) (f h)Many constructions in the category Hask carry over to Endo. In Hask we can form a product of type types a and b as (a, b). In Endo we form the product of two functors f and g as > data Product f g a = Product (f (g a))Note that this product isn't commutative. We don't necessarily have an isomorphism from Product f g to Product g f. (This breaks many attempts to transfer constructions from Hask to Endo.) We also won't explicitly use Product because we can simply use the usual Haskell composition of functors inline. We can implement some functions that act on product types in both senses of the word "product": > left :: (a -> c) -> (a, b) -> (c, b)> left f (a, b) = (f a, b)> right :: (b -> c) -> (a, b) -> (a, c)> right f (a, b) = (a, f b)> hleft :: (Functor a, Functor b, Functor c) => Natural a c -> a (b x) -> c (b x)> hleft f = f> hright :: (Functor a, Functor b, Functor c) => Natural b c -> a (b x) -> a (c x)> hright f = fmap f(Compare with what I wrote here.) We have something in Endo a bit like the type with one element in Hask, namely the identity functor. The product of a type a with the one element type in Hask gives you something isomorphic to a. In Endo the product is composition for which the identity functor is the identity. (Two different meanings of the word "identity" there.) We also have sums. For example, if we define a functor like so > data F a = A a | B a awe can think of F as a sum of two functors: one with a single constructor A and another with co[...]
Reinversion Revisited

Sun, 02 Feb 2014 01:53:00 +0000

IntroductionA while back I talked about the idea of reinversion of control using the continuation monad to wrest control back from an interface that only wants to call you, but doesn't want you to call them back. I want to return to that problem with a slightly different solution. The idea is that we build an interpreter for an imperative language that's an embedded Haskell DSL. You arrange that the DSL does the work of waiting to be called by the interface, but from the point of view of the user of the DSL it looks like you're calling the shots. To do this I'm going to pull together a bunch of techniques I've talked about before. This approach is largely an application of what apfelmus described here. The codeWe'll start with some administrative stuff before getting down to the real code: > {-# LANGUAGE TemplateHaskell #-}> import Control.Lens> import Control.Monad> import Control.Monad.LoopsWe'll make our DSL an imperative wrapper around Gloss: > import Graphics.Gloss.Interface.Pure.GameWe'll define a structure that can be used to represent the abstract syntax tree (AST) of our DSL. Our DSL will support the reading of inputs, adding pictures to the current picture, and clearing the screen. First we'll need a wrapper that allows us to represent ordinary Haskell values in our DSL: > data Basic a = Return aNow we want an expression that represents events given to us by Gloss. Internally we'll represent this by a function that says what our program does if it's given an event. It says what our program does by returning another AST saying what happens when the input is received. (I've previously talked about these kinds of expression trees here). > | Input (Event -> Basic a)We have a command to render some graphics. It appends a new Picture to the current picture. Again, part of the AST muct be another AST saying what happens after the picture is rendered: > | Render Picture (Basic a)And lastly here's the AST for a clear screen command: > | Cls (Basic a)Our AST will form a monad. This will allow us to build ASTs using ordinary Haskell do-notation. This technique is what I described previously here. > instance Monad Basic where> return = Return> Return a >>= f = f a> Input handler >>= f = Input (\e -> handler e >>= f)> Render p a >>= f = Render p (a >>= f)> Cls a >>= f = Cls (a >>= f)You can think of the expression x >>= f as x with the tree f a grafted in to replace any occurrence of Return a in it. This is exactly what Return a >>= f does. But applying >>= f to the other ASTs simply digs down "inside" the ASTs to find other occurrences of Return a. It's convenient to uses lenses to view Gloss's game world: > data World = World { _program :: Basic (), _picture :: Picture }> $(makeLenses ''World)And now we have some wrappers around the interpreter's commands. The return () provides the convenient place where we can graft subtrees into our AST. > input = Input return> render p = Render p (return ())> cls = Cls (return ())Now we can start coding. Here's a test to see if a Gloss event is a key down event: > keydown (EventKey (Char key) Down _ _) = True> keydown (EventKey (SpecialKey KeySpace) Down _ _) = True> keydown _ = FalseAnd now here's a complete program using our DSL. It's deliberately very imperative. It simply iterates over a nested pair of loops, collecting keystrokes and displaying them. It reads a lot like an ordinary program written in a language like Python or Basic: > mainProgram = do> render (Color white $ Scale 0.2 0.2 $ Text "Type some text")> forM_ [780, 760..] $ \ypos [...]
Distributed computing with alien technology

Sat, 26 Oct 2013 04:00:00 +0000

IntroductionSuppose we are given a function of boolean arguments that returns a boolean result. Alice has bits, and Bob has another bits . Alice and Bob are widely separated and don't know each other's bits. What is the total number of bits that Alice has to send to Bob and that Bob has to send to Alice so that between them they can compute ? Think about how complex might get. The and might each describe half of a "voxelised" region of space and might answer a question about a computational fluid dynamics (CFD) simulation running in that space. CFD simulations can be chaotic and so we might expect that in the worst case many bits have to be transferred back and forth between Alice and Bob. In the worst case we might expect that Alice has to send Bob all of her bits, or vice versa. But in fact Alice needs to send Bob just one bit. A loopholeTo get the communication requirements down to one bit we need to use a loophole. But I hope to (1) justify the cheat to some extent and (2) justify that it's even worthwhile to think about cheats. Alice and Bob have access to some Ancient technology. They each have one of a pair of boxes. At prearranged times, Alice puts a bit into her box, and Bob puts a bit into his box. A bit pops back out of Alice's box and a bit pops back out of Bob's box. Whatever the input, both Alice and Box have a 0.5 chance of seeing a one or zero pop out of their respective boxes. But when the two outputs are XORed together the result is the logical AND of the two inputs. With such boxes, Alice can compute after Bob sends a single bit down a conventional communication channel. "But this is a total cheat!" you complain before I even start to explain their technique. It seems Alice receives a bit that depends on what Bob input, and so Bob is communicating with Alice. But look closely and you'll see that the boxes don't allow any communication. No matter what Bob inputs, Alice has a 0.5 chance of getting zero or one. There is no way Bob can use this to communicate anything. It's like intercepting a message encrypted with a one time pad. Without the pad, the message is basically a sequence of random bits. Nonetheless, it is true that the outputs that Alice and Bob see are correlated. I hope I've convinced you that Alice and Bob can't send any bits with these boxes. Despite this, it is pretty clear that the behaviour of the boxes is non-local. We'll call any kind of boxes that allow instantaneous long range correlations that can't be explained by purely local behaviour non-local boxes. Boxes that can't be used for message sending are called non-signalling local boxes. And the particular non-local box I describe above is called a PR box (eg. see here). (BTW As an aside note that as the box results in widely separated outputs that are correlated, but doesn't allow communication, it's an example of how non-locality doesn't imply communication. Usually when people want to give examples of such a thing they talk about quantum mechanics. But there's no need to mention quantum mechanics to explain the behaviour of these particular non-local boxes.) The methodAny single bit boolean function of a finite sequence of bits can be written as a polynomial modulo 2. Each monomial in the polynomial can be written as a product of terms involing just the and terms involving just the , ie. where depends only on the , depends only on the and is drawn from some finite set. Alice can compute the and Bob can compute the . Now Alice and Bob, in parallel, feed and respectively into their PR boxes. We know that we could evaluate each term in the sum we want by adding Alice's output to Bob's output. But that would require sending one one-bit message for each . But we don'[...]
What stops us defining Truth?

Sat, 12 Oct 2013 16:12:00 +0000

IntroductionRecall the standard cartoon sketch of the proof of Gödel's first incompleteness theorem. We start by defining a predicate, , that is true if and only if its argument is provable. (Or more accurately, is true if is the Gödel number of a provable proposition.) With some quining we can use this to construct the proposition which says . The proposition asserts its own unprovability. Suppose instead we define a predicate which holds if its argument is true. We can use this to construct the proposition which says . Then if is true it must also be false and if it's false then it must be true. We seem to have a paradox. The loophole is that we assumed the existence of the predicate . So this argument demonstrates that there is actually no such predicate. This is Tarski's undefinability theorem. But what exactly stops us defining ? What goes wrong if we attempt to define a predicate that analyses the parts of a proposition to tell us whether or not it is true? NoteThis article is written in English. But as is standard in much of mathematics, unless I state otherwise, I'm using English largely as shorthand for an argument that could, in principle, be written in the formal language of Set Theory. So I will allow myself to use all of the usual reasoning methods that are available in ZF, even when talking about other formal systems such as Peano Arithmetic. Defining Truth for Propositional CalculusSuppose we're given a proposition from propositional calculus like . We can use a syntactic approach to determining whether or not it is true. We determine whether or not is true, then whether or not is true, and then the whole proposition is true if both and are true. Similarly is true if either or is true. Of course and might themselves be compound propositions using , and . But that's fine, that simply means that to define truth for such propositions we need to employ recursion. In fact, we can straightforwardly turn such a definition into a recursive computer program. (Ultimately with propositional calculus we hit the leaves which are atomic propositions like . Typically when we ask about the truth of a proposition in propositional calculus we've already made an assignment of truth values to the atomic propositions. So the base case for the recursion is straightforward.) We can illustrate the process with a diagram: The truth value of a node in the tree is determined by the truth of the propositions hanging underneath it. We have a parent-child relation between a proposition and its subexpressions. Recursion allows us to make a definition by defining what happens on the leaves of such a tree, and by saying how the definition at a node is built from that of its children. Defining truth for Peano ArithmeticWe can go further and attempt this approach with Peano Arithmetic (PA). The catch is that we need to consider quantifiers. For example, consider this proposition from Peano arithmetic: . This proposition is true if and only if is true whatever number we substitute for in the expression. The proposition at the top of the tree above is true if all of the immediate children are true and their truth is in turn determined by the truth of the propositions immediately below them. With some work this eventually leads to a perfectly good definition of truth for propositions in PA. Because we have nodes with infinitely many children we don't get an algorithm guaranteed to terminate, but that's not a problem for a definition in ZF. Note that we don't literally prove the infinitely many child propositions one at a time. Instead what happens is that to define the truth of we define it in terms of the truth of some infinite family of propo[...]
Why Heisenberg can't stop atomic collapse

Mon, 15 Apr 2013 01:08:00 +0000

TL;DRA heuristic argument to show that hydrogen atoms are stable and have a minimum energy level is wrong. I will assume undergraduate level quantum mechanics in the discussion.IntroductionThere's a popular argument used to explain why atoms are stable. It shows there is a lower bound on the energy level of an electron in the atom that makes it impossible for electrons to keep "falling" forever all the way down to the nucleus. You'll find it not only in popular science books but in courses and textbooks on quantum mechanics.A rough version of the argument goes like this: the closer an electron falls towards the nucleus the lower its potential energy gets. But the more closely bound to the nucleus it is, the more accurately we know its position and hence, by Heisenberg's uncertainty principle (HUP), the less accurately we know its momentum. Increased variance in the momentum corresponds to an increase in kinetic energy. Eventually the decrease in potential energy as the electron falls is balanced by an increase in kinetic energy and the electron has reached a stable state.The problem is, this argument is wrong. It's wrong related to the kind of heuristic reasoning about wavefunctions that I've talked about before.Before showing it's wrong, let's make the argument a bit more rigorous.Bounding wavefunctionsThe idea is to show that for any possible normalised wavefunction ψ of an electron in a Coulomb potential, the expected energy is bounded below by some constant. So we need to show thatis bounded below whereand p is momentum.Consider a wavefunction that is confined mainly around the nucleus soThe first fact we need is that Heisenberg uncertainty principle tells us that (assuming we're in a frame of reference where the expected values of p and x are zero).If the wavefunction is spread out with a standard deviation of a then the electron is mostly around a distance a from the nucleus. So the second fact is that we can roughly approximate the expected value of 1/r as 1/a.Combine these two facts and we get, roughly, thatI hope you can see that the right hand side, as a function of a, is bounded below. The graph of the right hand side as a function of a looks like:It's now an exercise in calculus to find a lower bound on the expected energy. You can find the details in countless places on the web. Here a link to an example from MIT, which may have come directly from Feynman's Lectures on Physics.The problemThe above discussion assumes that the wavefunction is basically a single packet confined around a distance a from the nucleus, something like that graphed above. But if a lower energy state can be found with a different wavefunction the electron will eventually find it, or an even lower energy state. In fact, by using a wavefunction with multiple peaks we will find that the Heisenberg uncertainty principle doesn't give a lower bound at all.We'll use a wavefunction like this:It has a packet around the origin just like before but it also has a sharp peak around r=l. As I'm showing ψ as a function of r this means we have a shell of radius l.Let's saywhere ψ1 is normalized and peaked near the original and ψ2 is our shell of radius l. Assume no overlap between ψ1 and ψ2.In this case you can see that we can makeas large as we like by making l as large as we like while still leaving us free to make the central peak whatever shape we want. This means that the estimate of coming from HUP can be made as small as we like while making the central peak as close to a Dirac delta as we want. Informally, HUP controls of the overall sp[...]
Aliasing and the Heisenberg uncertainty principle.

Mon, 14 Jan 2013 00:59:00 +0000

TL;DRThe Dirac comb is an example of a wavefunction whose position and momentum aren't fuzzy.IntroductionThe Heisenberg uncertainty principle says that if you have a particle in some state and observe either its momentum or its position then the products of the standard deviations of distributions of the outcomes satisfy this identity:I think many people have a mental picture a bit like this:You can know the position and momentum with some degree of fuzziness and you can trade the fuzziness between the two measurements as long as the product of their sizes is larger than ℏ/2.Here's another way of thinking about that kind of picture (assuming some units I haven't specified):position=123.4???momentum=65?.???The idea is that the question mark represents digits we don't know well. As you move towards the right in the decimal representation our certainty in the accuracy of the digit quickly goes downhill to the point where we can't reasonably write digits.But this picture is highly misleading. For example, the following state of affairs is also compatible with the uncertainty principle, in suitably chosen units:position=...???.123...momentum=...???.654...In other words, it's compatible with the uncertainty principle that we could know the digits beyond the decimal point to as much accuracy as we like as long as we don't know the digits before the point. It trivially satisfies Heisenberg's inequality because the variance of the position and the momentum aren't even finite quantities.But being compatible with Heisenberg uncertainty isn't enough for something to be realisable as a physical state. Is there a wavefunction that allows us to know the digits to the right of the decimal point as far as we want for both position and momentum measurements?Sampling audio and graphicsMaybe surprisingly, the worlds of audio and graphics can help us answer this question. Here's what a fraction of a second of music might look like when the pressure of the sound wave is plotted against time:But if we sample this signal at regular intervals, eg. at 44.1KHz for a CD, then we can graph the resulting signal as something like this:The red curve here is just to show what the original waveform looked like. The black vertical lines correspond to regular samples and we can represent them mathematically with Dirac delta functions multiplied by the amplitude measured at the sample.There is a well known problem with sampling like this. If you sample a signal that is a sine wave sin(ωt) at rate f then the signal sin((ω+2πnf)t) will generate exactly the same samples for any integer n. The following illustration shows what might happen:The two waveforms are sampled at the same regular intervals (shown by vertical lines) and give exactly the same amplitudes at those samples.This forms the basis for the famous Nyquist-Shannon sampling theorem. You can reconstruct the original signal from regularly spaced samples only if it doesn't contain frequency components higher than half your sampling rate. Otherwise you get ambiguities in the form of high frequency parts of the signal masquerading as low frequency parts. This effect is known as aliasing. As a result, the Fourier transform of a sampled function is periodic with the "repeats" corresponding to the aliasing.In the audio world you need to filter your sound to remove the high frequencies before you sample. This is frequently carried out with an analogue filter. In the 3D rendering world you need to do something similar. Ray tracers will send out many rays for each pixel, in effect forming a much higher resolution image than the resolution of the final result, and that high re[...]
Shuffles, Bayes' theorem and continuations.

Sun, 30 Dec 2012 16:49:00 +0000

IntroductionBack in the 80s when I was a kid I came across a program for the BBC Micro that could tell what card you had picked from a deck of cards even though you'd buried your card within the deck wherever you wanted and had cut and shuffled the deck. I thought I'd try to implement a slightly more sophisticated version of the same trick that could handle multiple shuffles, and multiple types of shuffle. The idea is that we prepare a deck of cards in some known sequence and have someone pick out a card and place it at the top of the deck. They perform some kind of randomisation procedure on the cards, eg. cut and shuffle it a couple of times, and then you get to look at the final sequence of cards. Can we tell which card was picked out? Some probability theoryLet's formalise this a bit. Our decks will have cards. There is a small number of initial states our deck can be in, corresponding to the known sequence with a single card moved to the top. Let's label these initial states . There is a (usually large) number of permutations that could be applied through shuffling. We'll label these . We'll try to do arrange that this isn't simply the set of all permutations (though it's not necessarily a disaster if it is). We want to figure out the initial state given some final state . In other words we want We can use Bayes theorem to get: Now is the sum over all ways of starting with and ending up with . So where the sum is over all such that . I'm assuming that the shuffles are independent of the initial sequence of cards. This gives us an algorithm. We do a brute force simulation of every possible shuffle that we're considering applied to each possible initial state. After each shuffle we sum the corresponding probability for those shuffles that give our known final state . Each shuffle is going to be built up as the product of a sequence of building blocks with each block randomly selected based on what happened before. Let's call the blocks names like . So if then . As we work through the shuffle we will accumulate the probability. After the first block we have a probability of . The probability after the second is and so on. At any point we'll call the probability accumulated so far the importance. I've borrowed that name from the world of rendering because this algorithm has a remarkable similarity to recursive ray-tracing. Some computer scienceI'd like to be able to chain a sequence of shuffles. But wait! There's a catch! Today's the day I finally want to get around to checking out the lambda expression support in C++. I've been putting this off for years. (I'm using gcc 4.7.) So I'm not going to have a Haskell non-determinism monad to make life easy. Suppose I have two types of shuffle, type and type . I could easily write a loop to iterate over all shuffles of type , and in the innermost part of the loop I could call another loop over all shuffles of type . But then if I want to replace with I have to change the code to replace the inner part with code for . That's no good. I'd like to be able to replace the innermost part of the outer loop with any code I want without actually editing that part of the code. It's easy with lambda expressions. I write the type loop code so that it takes as argument a lambda function representing what I want done inside the loop. There's another way of looking at this. You can skip this paragraph if you don't care about the connection to Haskell. But in Haskell you might do something like this by using a non-determinism monad, or even a probability monad. But as I pointed out a while back, you can fake every monad using[...]
A pictorial proof of the hairy ball theorem

Mon, 19 Nov 2012 00:10:00 +0000

The hairy-ball theorem says that there is no continuous non-zero vector field on the surface of a sphere. There are lots of popular accounts that tell you what this means, giving great examples. Here's a Youtube video for example: allowFullScreen='true' webkitallowfullscreen='true' mozallowfullscreen='true' width='320' height='266' src='https://www.youtube.com/embed/B4UGZEjG02s?feature=player_embedded' FRAMEBORDER='0' />My goal is to show why it's always true.A simply connected domain in the plane is one with the property that any loop in it can be shrunk down to a point. Here's an example of a domain D with an example loop L being shrunk down to a point P:Here's an example of a domain that's not simply connected. It has a hole in the middle. I've drawn a L loop around the hole. You can't shrink that loop to a point because the hole gets in the way:Here's a simply connected domain with a vector field on it:Think of the vectors as being drawn literally in the surface so that if we were to pick up the surface and stretch it like a piece of rubber the vectors would get stretched with it. Remember that a vector field is defined everywhere in the domain so the arrows are just a random sprinkling of examples to show what's going on. For this to be an accurate picture you want to imagine an infinity of arrows, one at every single point of the domain.Let's put a loop, starting and ending at P, in our simply-connected domain:Now imagine travelling along the loop, starting at P and ending at P. As you move along there's an arrow at each point in your journey. Here's what the arrows look like as you travel from P to P anti-clockwise, plotted as a kind of graph:The vectors start off pointing to the right. They swing anti-clockwise by about 45º and then swing back to where they started. As the journey is a loop they clearly must end where they started. A different, really swirly vector field, might have resulted in arrows that that rotated around hundreds of times along your journey. But by time you reach the end of the journey they must swing back to where they started. What's slightly less obvious is that they'd also have to rotate back to cancel out the hundreds of swings. You might think "the vectors could rotate round a hundred times but as long as they make exactly 100 turns they'll return to where they started and there's no need for them to unwind". But actually, every bit of rotation in the journey must be unwound. The total amount of rotation, adding all the positive rotations, and subtracting off the negative rotations, is called the winding number for the loop. We count anti-clockwise rotation as positive and clockwise as negative. So I'm claiming that the winding number for a closed loop in a simply-connected domain is always zero.(Note: in most books the winding number normally refers to how many times the loop itself winds around a point. I'm using it to refer to how many times the vector winds around itself you follow the loop. To help with your intuition: the hour hand of a working clock normally accumulates a winding number of -2 in one day. If it ran forward for a day, but then ran backwards for half a day, the winding number would be -1.)Here's why the winding number for simply connected domains must be zero: firstly - it's pretty clear that the winding number for any loop must be an integer. If the winding number was a half, say, the arrow wouldn't end up pointing 180º from where it started which makes no sense for a closed loop. Now the domain is simply connected, so the loop can be shrunk to a point. Now imagine doing the shri[...]
Generalised entropy

Sat, 07 Apr 2012 23:22:00 +0000

IntroductionThe entropy of a probability distribution can be seen as a measure of its uncertainty or a measure of the diversity of samples taken from it. Over the years I've talked lots about how probability theory gives rise to a monad. This suggests the possibility that maybe the notion of entropy can be generalised to monads other than probability. So here goes...> {-# LANGUAGE MultiParamTypeClasses, FlexibleInstances, GeneralizedNewtypeDeriving #-}> {-# LANGUAGE FunctionalDependencies, TypeSynonymInstances #-}> import Control.Monad> import Control.Monad.Writer hiding (lift)Shannon entropyI've talked in the past about how there is some trickiness with defining the probability monad in Haskell because a good implementation requires use of the Eq typeclass, and hence restricted monads. Restricted monads are possible through a bunch of methods, but this time I don't want them.It's common to represent probability distributions on finite sets as lists of pairs where each pair (p, x) means x has a probability p. But I'm going to allow lists without the restriction that each x appears once and make my code work with these generalised distributions. When I compute the entropy, say, it will only be the usual entropy in the case that each x in the list is unique.So here's our type and some instances for it:> data P a = P [(a, Float)] deriving Show> instance Functor P where> fmap f (P xs) = P [(f a, p) | (a, p) <- xs]> instance Monad P where> return x = P [(x, 1)]> P xss >>= f = P [(y, p*q) | (pxs, p) <- xss, let P ys = f pxs, (y, q) <- ys]We can easily compute the expected value of a distribution, and its entropy, like this:> expectation0 (P xs) = sum [x*p | (x, p) <- xs]> entropy0 (P xs) = -sum [if p==0 then 0 else p*log p/log 2.0 | (_, p) <- xs]An important property of entropy is known as the grouping property which can be illustrated through an example tree like this:The entropy for the probability distribution of the final leaves is the sum of two components: (1) the entropy of the branch at the root of the tree and (2) the expected entropy of the subtrees. Here's some corresponding code. First simple bernoulli trials:> bernoulli p a b = P [(a, p), (b, 1-p)]Now the branch at the root of the tree:> root = bernoulli 0.3 False TrueWe can compute the entropy for the distrbution on the leaves:> test1 = entropy0 $ do> x <- root> if x> then bernoulli 0.2 3 4> else bernoulli 0.4 5 6Or the sum of the root entropy and the expected subtree entropy:> test2 = entropy0 root + (expectation0 $ do> x <- root> if x> then return $ entropy0 (bernoulli 0.2 3 4)> else return $ entropy0 (bernoulli 0.4 5 6))You can confirm for yourself that test1 == test2.We can rewrite that a little. We're drawing True or False from root only to decide which distribution to use at the next stage. But we may as will pick the distribution itself at random. So define:> dist = bernoulli 0.3 (bernoulli 0.4 5 6) (bernoulli 0.2 3 4)And now we expect the equality of test3 and test4:> test3 = entropy0 $ do> x <- dist> x> test4 = entropy0 dist + (expectation0 $ do> x <- dist> return $ entropy0 x)There's a more elegant way of writing this. Define:> left0 dist = entropy0 (join dist)> right0 dist = entropy0 dist+expectation0 (fmap entropy0 dist)Now we expect left0 dist and right0 dist to always be equal. We've almost generalised to something that makes sense in the context of monads other th[...]
Overloading Python list comprehension

Sat, 17 Mar 2012 20:30:00 +0000

IntroductionPython is very flexible in the way it allows you to overload various features of its syntax. For example most of the binary operators can be overloaded. But one part of the syntax that can't be overloaded is list comprehension ie. expressions like [f(x) for x in y].What might it mean to overload this notation? Let's consider something simpler first, overloading the binary operator +. The expression a+b is interpreted as a.__add__(b) if a is of class type. So overloading + means nothing more than writing a function. So if we can rewrite list comprehensions in terms of a function (or functions) then we can overload the notation by providing alternative definitions for those functions. Python doesn't provide a facility for doing this directly, but we can at least think about what it might mean to do this. Later we'll see how to tweak the Python interpreter to make it possible.mapConsider the expression[a for x in y]Here the single letter variables are 'metavariables' representing fragments of Python code. To a good approximation this is equal to:map(lambda x: a, y)(BTW Everything I say here is "to a good approximation". Python is an incredibly complex language and I'm not good enough at it to make any categorical statements about when one fragment of code is the same as another.)So it's tempting to see list comprehensions as syntactic sugar for map, in which case one approach to overloading comprehension is to consider interpreting it in terms of replacements for map. But this isn't a very powerful overloading. It just gives us a slightly different way to write something that's already straightforward.concatMapAnother reason for not simply seeing list comprehension in terms of map is that nested list comprehensions need another operation. Consider[(y, z) for y in [1, 2] for z in ['a', 'b']]This isn't quite the same as[[(y, z) for z in ['a', 'b']] for y in [1, 2]]but it's close. The latter produces nested lists whereas the first gives one flat list. We can think of nested comprehensions as applying a flattening operation. Let's use list comprehension to implement flattening:def concat(xs): return [y for x in xs for y in x]We now write our nested comprehension as:concat([[(y, z) for z in ['a', 'b']] for y in [1, 2]])We know how to write non-nested comprehensions using map so we get:concat(map(lambda y: [(y, z) for z in ['a', 'b']], [1, 2]))And rewriting the inner comprehension we get:concat(map(lambda y: map(lambda z: (y, z), ['a', 'b']), [1, 2]))Every time we add another level of nesting we're going to need another concat. But the innermost map doesn't have a concat. Purely for reasons of symmetry we can ensure every map has a concat by enclosing the innermost element as a singleton list:concat(map(lambda y: concat(map(lambda z: [(y, z)], ['a', 'b'])), [1, 2]))Every map has a concat so we can simplify slightly. Let's define:def concatMap(f, xs): return [f(y) for x in xs for y in x]def singleton(x): return [x]Our expression becomes:concatMap(lambda y: concatMap(lambda z: singleton((y, z)), ['a', 'b']), [1, 2])Importantly we've completely rewritten the comprehension in terms of concatMap and singleton. By changing the meaning of these functions we can change the meaning of comprehension notation, or at least we could if the Python interpreter defined comprehension this way. It doesn't, but we can still reason about it. Although any comprehension that doesn't use ifs can be rewritten to use these functions, I won't give a formal description of the procedure. Instead I'll provide [...]
Using Lawvere theories to combine effects

Sat, 11 Feb 2012 15:47:00 +0000

> {-# LANGUAGE MultiParamTypeClasses, ExplicitForAll, RankNTypes, FlexibleInstances, FlexibleContexts, TypeSynonymInstances #-}> import Data.Monoid> import Data.Functor.Identity> import Control.Monad.WriterIn an earlier post I talked about how monads arise from free algebras. Let me recap a bit.In Part 1 I described algebras. They're sets with operations on them satisfying some laws. We can build new elements of an algebra from old ones by using its operations. Eg. if x and y are in an algebra then x `mappend` y must be in it too. Starting with a bunch of symbols, thought of as leaves, we can consider the set of all expressions trees we can build from them. If we consider pairs of trees to be equivalent if the laws say the corresponding expressions are equal, then the set of trees itself forms an algebra known as a free algebra (for the given theory).Let's start with some code. This type class says that the type b has leaves of type a:> class Free a b where> leaf :: a -> bEffects from monoidsNow we can make the type of all trees built from Monoid operations and including all leaves of type a:> data FreeMonoid a = FreeMonoid (forall b. (Monoid b, Free a b) => b)And we have:> instance Monoid (FreeMonoid a) where> mempty = FreeMonoid mempty> FreeMonoid a `mappend` FreeMonoid b = FreeMonoid (a `mappend` b)Unfortunately elements like e1 and e2 two ought to be equal but Haskell doesn't know this:> e1, e2 :: FreeMonoid Char> e1 = FreeMonoid (leaf 'a' `mappend` (leaf 'b' `mappend` leaf 'c'))> e2 = FreeMonoid ((leaf 'a' `mappend` leaf 'b') `mappend` leaf 'c')Instead we can manually construct a type that does respect equality in monoids. Elements of FreeMonoid are binary trees with a `mappend` at each node. Associativity means that we can always replace a tree with an equivalent one where the left branch is a leaf. We can also use the laws to eliminate any occurrence of mempty. So every element of FreeMonoid a is equivalent to one of the form:Leaf x1 `mappend` (Leaf x2 `mappend` (... mempty))In other words, free monoids are lists. We can make this explicit. The standard prelude already makes [] an instance of Monoid so we just need:> instance Free a [a] where> leaf x = [x]Here's the isomorphism (modulo tree equivalence):> iso1 :: FreeMonoid a -> [a]> iso1 (FreeMonoid x) = x> iso1' :: [a] -> FreeMonoid a> iso1' [] = FreeMonoid mempty> iso1' (a : as) = let FreeMonoid r = iso1' as> in FreeMonoid (leaf a `mappend` r)As I talked about in that earlier article, free algebras give monads and the trees representing expressions in the algebra can be thought of as abstract syntax trees for domain specific languages. In this case it's the usual list monad. So the Monoid type class gives us a language for talking about non-determinism. The operation mappend gives us a way to "fork" a process and mempty gives as a way to "kill a thread". Here's an example using non-determinism to search for some Pythagorean triples:> test1 :: [(Int, Int, Int)]> test1 = do> a <- return 3 `mappend` return 4> b <- return 4 `mappend` return 5> c <- return 5 `mappend` return 6> if a*a+b*b==c*c then return (a, b, c) else memptyEffects form M-setsWe can do exactly the same for -sets.> class Monoid m => MSet m s where> act :: m -> s -> s> data FreeMSet w a = FreeMSet (forall b. (MSet w b, Free a b) => b)> instance Monoid w => MSet w (Fre[...]
Lawvere theories made a bit easier

Sun, 05 Feb 2012 15:57:00 +0000

IntroductionI still don't think anyone has found a completely satisfactory way to combine effects in Haskell. (That's computational effects, not visual effects.) Monads are great for one kind of effect at a time, but it's not clear how to combine two arbitrary monads. Instead of monads we can work with monad transformers, but they are tricky both to implement and to use.I want to sketch a different, though incomplete approach to combining effects. There are a bunch of papers that describe this approach, and even some code that implements part of it. Almost everything I say is from a paper by Hyland and Powers that I read a few years ago though I recently ran into this helpful answer by Andrej Bauer on mathoverflow. Even if we don't get code we can immediately use, we still get a good way to think about and analyse effects.I'll get onto the effects part in another post. This one concentrates on what are known as Lawvere theories.Monoids> {-# LANGUAGE MultiParamTypeClasses, FlexibleInstances, FlexibleContexts, ExplicitForAll #-}> import Data.MonoidThe (finite) lists of integers form a monoid. Here are some functions that operate on such lists:> f1 xs = map (+1) $ reverse xs> f2 xs = xs `mappend` xsThey can both be given signature [Integer] -> [Integer]. But there's one very important difference between these functions. f2 has been written using only operations from the type class Monoid. It's a sort of universal function that can be applied to any monoid. On the other hand, the function f1 can only applied to very specific monoids. In fact, the type signature of f2 can be written as:> f2 :: forall a. Monoid a => a -> aThat type signature essentially says that we're not going to do anything with elements of a except use the interface defined by the Monoid type class.(Although Haskell type classes can't enforce them, we also assume that any instance of Monoid satisfies the axioms for a monoid.)We can also define functions on tuples of monoids. Here are some examples:> g1 :: forall a. Monoid a => (a, a, a) -> (a, a)> g1 (xs, ys, zs) = (mempty, ys `mappend` xs)> g2 :: forall a. Monoid a => (a, a) -> (a, a)> g2 (xs, ys) = (xs `mappend` ys, ys `mappend` xs)Notice that we can compose these functions. So we haveg2 . g1 :: forall a. Monoid a => (a, a, a) -> (a, a)We also have have identity functions for tuples. Armed with functions, identities and compositions can now form a category that I'll call . I'll call the (distinct) objects of this category , , and so on. The arrows from to are the total functions of type . So, for example, g1 is an arrow from to . Note that it doesn't matter what the objects are (as long as they're distinct). They're just placeholders between which we can string our arrows. Note how because of the universal quantifier, the functions we use have a type not of the form A -> B. So we can't represent the objects of our category as types in the usual way. We can think of mempty as a 0-ary operator, ie. an element of forall a. Monoid a => () -> a.But there's one more detail I want. I'll consider two arrows to be equal if they can be proved equal using the axioms for monoids. For example, these two Haskell functions represent the same arrow in because of the associativity law:> h1 (x, y, z) = (x `mappend` y) `mappend` z> h2 (x, y, z) = x `mappend` (y `mappend` z)We now have a bona fide category. Note that contains lots of arrows. Anything you can build us[...]
Some parallels between classical and quantum mechanics

Sat, 21 Jan 2012 21:38:00 +0000

IntroductionThis isn't really a blog post. More of something I wanted to interject in a discussion on Google plus but wouldn't fit in the text box.I've always had trouble with the way the Legendre transform is introduced in classical mechanics. I know I'm not the only one. Many mathematicians and physicists have recognised that it seems to be plucked out of a hat like a rabbit and have even written papers to address this issue. But however much an author attempts to make it seem natural, it still looks like a rabbit to me.So I have to ask myself, what would make me feel comfortable with the Legendre transform?The Legendre transform is an analogue of the Fourier transform that uses a different semiring to the usual. I wrote briefly about this many years ago. So if we could write classical mechanics in a form that is analogous to another problem where I'd use a Fourier transform, I'd be happier. This is my attempt to do that.When I wrote about Fourier transforms a little while back the intention was to immediately follow it with an analogous article about Legendre transforms. Unfortunately that's been postponed so I'm going to just assume you know that Legendre transforms can be used to compute inf-convolutions. I'll state clearly what that means below, but I won't show any detail on the analogy with Fourier transforms.Free classical particlesLet's work in one dimension with a particle of mass whose position at time is . The kinetic energy of this particle is given by . Its Lagrangian is therefore .The action of our particle for the time from to is thereforeThe particle motion is that which minimises the action.Suppose the position of the particle at time is and the position at time is . Then write for the action minimising path from to . Sowhere we're minimising over all paths such that .Now suppose our system evolves from time to . We can consider this to be two stages, one from to followed by one from to . Let be the minimised action analogous to for the period to . The action from to is the sum of the actions for the two subperiods. So the minimum total action for the period to is given byLet me simply that a little. I'll use where I previously used and for . So that last equation becomes:Now suppose is translation-independent in the sense that . So we can write . Then the minimum total action is given byInfimal convolution is defined byso the minimum we seek isSo now it's natural to use the Legendre transform. We have the inf-convolution theorem:where is the Legendre transform of given byand so (where we use to represent Legendre transform with respect to the spatial variable).Let's consider the case where from onwards the particle motion is free, so . In this case we clearly have translation-invariance and so the time evolution is given by repeated inf-convolution with and in the "Legendre domain" this is nothing other than repeated addition of .Let's take a look at . We know that if a particle travels freely from to over the period from to then it must have followed the minimum action path and we know, from basic mechanics, this is the path with constant velocity. Soand hence the action is given bySo the time evolution of is given by repeated inf-convolution with a quadratic function. The time evolution of is therefore given by repeated addition of the Legendre transform of a quadratic function. It's not hard to prove that the Legendre transform of a quadratic fu[...]
Lossless decompression and the generation of random samples

Sat, 07 Jan 2012 17:40:00 +0000

Let S be some finite set with a probability distribution on it. Here's a diagram showing some example probabilities for the set S={A, B, C, D, E}:How can we generate lots of random samples from this distribution? A popular method involves first storing the cumulative distribution function in a table like so:We then use any of a number of popular methods to generate uniform pseudorandom numbers in the range [0,1) and for each one walk through the table until we find the first entry greater than our pseudorandom number. The symbol above the number is the one we generate. So if we generated 0.512, we'd pick symbol C. It's straightforward to prove this gives the correct probability distribution.As described this algorithm can be slow. If the size of the table is N we may have to walk through up to N entries to generate each sample.One approach to accelerating this algorithm is to quantise our pseudorandom number and use it to look up, in a precomputed table, a jump into our cumulative distribution table. I've used this several times in my visual effects career. But today I'm going to take a different approach.Another natural way to speed up sample generation is to use a binary search to find the appropriate point in the cumulative distribution, for example the C++ standard template library's upper_bound method will do the job.But what is a binary search algorithm going to do? Typically it's going to start by comparing our pseudorandom number with the midpoint of the table. If our number is bigger then it will recursively use binary search on the left (looking next at the midpoint of the left half), otherwise on the right, and so on. If we're generating many samples from the same distribution we're going to be repeatedly looking up the same midpoints in the table. At the end of the day, the process can be described by a decision tree. So we may as well throw away the table and build the decision tree up front. Here's what it might look look like:But maybe we can do better. For example C is three times as likely as A but they both take the same amount of time to generate as they both require walking down to depth 2. Meanwhile D has a probability of 0.25 and requires walking to depth 3. We'd like to rebalance the tree. Note also that there's no reason to list our original PDF in the order I gave. Different orderings might give different trees.It's straightforward to describe what we want to optimise. We want to place sample i at depth di so that the expected value of di is as small as possible. In other words we want to minimise Σpidi. But this is precisely the problem solved by Huffman coding.And that's the conclusion I wanted to reach: Huffman coding gives the optimal decision tree to generate random samples. It also tells us this interesting consequence: if we use a decision tree method then the performance of our algorithm is bounded by the entropy of the probability distribution. I find this connection between entropy and algorithmic complexity pretty surprising.I learnt the above during my interview at Google!Why is there this connection between a compression algorithm and the generation of random samples? It took me a little longer to realise why but it's quite straightforward.Huffman coding tries to compress text one letter at a time on the assumption that each letter comes from some fixed and known probability distribution. If the algorithm is succe[...]