Subscribe: /~colmmacc/
http://www.stdlib.net/~colmmacc/feed/atom/
Added By: Feedage Forager Feedage Grade B rated
Language: English
Tags:
atheist ireland  atheist  element  elements  good  ireland  mac  number  numbers  people  random  runs  set  tasks  time 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: /~colmmacc/

/~colmmacc/



An Irishman's Fiery



Updated: 2011-05-15T17:01:26Z

 



Weighty matters

2011-05-15T17:01:26Z

One of my favourite programming exercises is to write a text generator using Markov Chains. It usually takes a relatively small amount of code and is especially useful for learning new programming languages. A useful goal is to able to have a tool that ingests a large body of text, breaks it into N-tuples (where [...]One of my favourite programming exercises is to write a text generator using Markov Chains. It usually takes a relatively small amount of code and is especially useful for learning new programming languages. A useful goal is to able to have a tool that ingests a large body of text, breaks it into N-tuples (where N is chosen by the operator), and then emits a random text of length L where that text uses those tuples at the same frequency that the original body did. On its own, that task can produce some fun results, but it’s also a very repurpose-able technique. You can use this kind of statistical generation to simulate realistic network jitter (record some N-tuples of observed RTT with ping), or awesome simulated-user fuzz tests (record some N-tuples of observed user inputs). It’s surprising that it isn’t more common. But when approaching these problems, from experience of working with newcomers, what seems to be a common first tripping point is how to do weighted selection at all. Put most simply, if we have a table of elements; element weight A 2 B 1 C 1 how do we write a function that will choose A about half the time, and B and C about a quarter each? It also turns out that this is a really interesting design problem. We can choose to implement a random solution, a non-random solution, a solution that runs in constant time, a solution that runs in linear time and a solution that runs in logarithmic time. This post is about those potential solutions. Non-random solutions When coming to the problem, the first thing to decide is whether we really want the selection to be random or not. One could imagine a function that tries to keep track of previous selections, for example; 1234567891011121314151617elements = [ { 'name' : 'A' , 'weight' : 2 },              { 'name' : 'B' , 'weight' : 1 },              { 'name' : 'C' , 'weight' : 1 } ] def select_element(elements, count):   total_weight = 0   for element in elements:     total_weight += element['weight']     if count < total_weight:          return  ( element ,                    (count + 1) % len(elements) ) # select some elements c = 0 (i , c) = select_element(elements, c) (j , c) = select_element(elements, c) (k , c) = select_element(elements, c) This function uses the “count” parameter to remember where it left off last time, and runs in time proportionate to the total number of elements. Used correctly, the first two times it’s called this function will return A, followed by B, followed by C, followed by A twice again and so on. Hopefully that’s obvious. There’s a very simple optimisation possible to make it run in constant time, just sacrifice memory proportionate to the total sum of weights; 12345678elements = [ { 'name' : 'A' },              { 'name' : 'A' },              { 'name' : 'B' },              { 'name' : 'C' } ] def select_element(elements, count):   return  ( element[count] ,             (count + 1) % len(elements) ) This basic approach is surprisingly common, it's how a lot of networking devices implement weighted path selection. Random solutions But if we're feeding into some kind of simulation, or a fuzz-test, then randomised sel[...]



Prime and Proper

2010-09-02T06:24:59Z

A common software engineering pattern is the need to test elements for set membership. Practical questions like “is this item in my cache” arise surprisingly often. Bloom Filters are a great mechanism for performing this test, and if you haven’t encountered them before – they’re well worth knowing about. But over the last year, I’ve [...]A common software engineering pattern is the need to test elements for set membership. Practical questions like “is this item in my cache” arise surprisingly often. Bloom Filters are a great mechanism for performing this test, and if you haven’t encountered them before – they’re well worth knowing about. But over the last year, I’ve been experimenting with a different method for these kinds of membership tests on enumerated sets, and thought it was worth sharing. It’s definitely not the first time the technique has been used, Cian found some papers on its use. But it’s not a common pattern, which is a pity – because it’s incredibly simple, and lends itself very well to modern hardware. First things first, some huge disclaimers; The approach only works on enumerated sets. Each element in the problem space has to be assigned a unique counted number. Hashes won’t do. The method is constrained by the size of product of the total number of elements and the maximum size of any set. So, it’s very simple; Take each element in the problem space, assign it a unique prime number, and then represent sets as the products of those primes. This works because of the fundamental theorem of arithmetic. Here’s a summary of operations; Computing a set: Compute the product of all of the elemental primes in the set. s = a * b * c Testing for element membership: Test the modulus of the set number. (s % a) == 0 Adding to the set: Multiply the existing set product by the new element’s prime number. s *= d Removing from the set: Divide the existing set product by the element’s prime number. s /= a Now at this point, you’re probably skeptical about the usefulness of the technique, given the constraints. Obviously other operations like unions and intersections between sets are possible, but they require factorisation – and so are not particularly efficient (though you’d be surprised how quickly they do run). But look at the benefits; Unlike Bloom filters, there are zero false positives, and zero false negatives. The method is 100% precise. Due to their use in cryptography; libraries, CPUs and other hardware have arisen that are highly efficient at computing the products and modulus of very large numbers. As a case-in-point, let’s use the internet as an example. Suppose you want to model the internet as a system of interconnected autonomous systems and paths. A really common test when using such models is to determine whether or not a particular system is on a particular path. If we take the internet as 100,000 autonomous systems, and the longest practical path on the internet as containing 20 such systems (longer paths exist, but are oddities – in real terms these numbers are both over-estimates) the largest product we would ever need to compute is smaller than 2410. That’s much smaller than even modern cryptographic primes themselves, and much much smaller than their products. There is a lot more room. Actually, that was my first real-world use-case for this technique – modeling the internet in a few hundred lines of python, on a laptop – as a prototype. Surprisingly, a Macbook Air could perform well over one million such membership tests per second without skipping a beat – after reordering to assign the most popular AS numbers the lowest prime-numbers. And incidentally, if that kind of problem interests you – Cloudfront are hiring. Now a question, for the information theorists; How is it that this method can seemingly perfectly encode N bits of information in fewer than N bits? For exa[...]



Fact-checking Ireland Offline

2010-03-29T19:43:33Z

One of the saddest things about Ireland and technology is the degree to which the incumbent telco, Eircom, has been permitted to dominate and skew the market. The privatisation of Eircom was very very badly designed, effectively creating a monopoly for important national infrastructure. There have been multiple legal cases between Eircom and the Regulator, [...]One of the saddest things about Ireland and technology is the degree to which the incumbent telco, Eircom, has been permitted to dominate and skew the market. The privatisation of Eircom was very very badly designed, effectively creating a monopoly for important national infrastructure. There have been multiple legal cases between Eircom and the Regulator, and many many complaints from consumers and competitors, but over the years little has happened. Sadder still, and not unrelated, is that the most prominent activist group on this issue; Ireland Offline, have consistently been one of the most ineffective, maladministered bunch of ill-informed whinge-baggers going. Despite media prominence, many many column inches, a strong membership, public meetings, and more, I can’t think of a single achievement Ireland Offline could point to as a success. Some of their goals may have been achieved over the years, but only after geological delay and through coincidence. On the contrary, I can think of many many falsehoods, myths and bizarre theories the group has promoted. My favourite being the simple falsehood that the majority of schools in Ireland have satellite broadband connectivity. But the latest effort takes the prize. In an anonymous, unattributed blog-post with no contact details; http://irelandoffline.org/2010/03/eircom-and-next-generation-broadband/, Ireland Offline are making some very strange claims. eircom today have announced their “Next Generation Broadband” promising speeds of up to 8MB to all subscribers, they claim to have also finally addressed the contention issue (which they call congestion). What is not in the headlines is the subtle but important change from a speed based model to a usage based model of charging. In reality, there has always been a usage-based model of charging. In fact this is why Ireland Offline was founded, when Esat decided to place caps on their “no limits” service. Also in reality – the real costs of operating a network are aligned with usage and are not fixed. Anyone with even a modicum of familiarity about how networks and economics work can realise that the only long-term sustainable business models for Internet consumption are going to be usage-based. The only alternative would be for the light users to effectively subsidise heavier ones. Does that really make sense? Is that equitable? Is there really a solid case for internet socialism? Complaints about the pricing are justified, but the model? This kind of demand is very alienating, and in place-of such childishness, it would be much better to see a concerted economic argument for better broadband. What is the lack of good broadband doing for business, and so on. The only thing “Next Generation about this product is the charging policy. This is a 200% increase on the average user’s bill for those who can least afford it. This product is only Dublin based for the moment. Nowhere can I find any evidence of this 200% increase. My bill would decrease, I don’t know if I’m an average user or not, but I’d have thought I am. This is yet another attack on the hard pressed telecommunications consumers of Ireland. Not only do they have to endure the highest line-rental on the planet they now have to endure a €2 charge for each Gigabyte they go over their paltry caps. For instance the “Broadband Basic” has a minuscule 10GB cap barely enough for the average “tech savy” family. In reality this package will cost €50 for all but the lightest users. That’s €50 + line rental making the total €75.36 per month. This is hardly a giant leap forward it’s more li[...]



Holy Orders

2010-01-03T15:14:14Z

So, yesterday marked the day that the new Blasphemy law came into force in Ireland. It also marked the day when Atheist Ireland published 25 “blasphemous” quotes, in a supposed act of defiance. Atheist Ireland have gone about it in a very strange way; the url in their blog post is not a hyperlink, and [...]So, yesterday marked the day that the new Blasphemy law came into force in Ireland. It also marked the day when Atheist Ireland published 25 “blasphemous” quotes, in a supposed act of defiance. Atheist Ireland have gone about it in a very strange way; the url in their blog post is not a hyperlink, and the quotes aren’t simply in their post or on their front page. That’s their prerogative, of course, but I do wonder if it’s a sign of overt caution. Either way, the circus seems to have worked, and CNN and the BBC have both picked it up, among many more. As a press-stunt, it’s genius. Getting the attention of the international media like that is not easy, and through careful choice of celebrities to quote, and the right tone, Atheist Ireland have pulled off a well-executed PR coup. But as an act of online advocacy, and of affecting political change, frankly it’s stupid. And I think that Atheist Ireland will have approximately zero success. Such is the magnitude of their “not getting it” that they are probably forever doomed to an existence of committeeism and tokenism in near-equal measure. Firstly, the action itself is ineffective, it does not – in my opinion – constitute any kind of an offence. Let’s go to the source, the 2009 Defamation Act; (2) For the purposes of this section, a person publishes or utters blasphemous matter if— (a) he or she publishes or utters matter that is grossly abusive or insulting in relation to matters held sacred by any religion, thereby causing outrage among a substantial number of the adherents of that religion, and (b) he or she intends, by the publication or utterance of the matter concerned, to cause such outrage. (3) It shall be a defence to proceedings for an offence under this section for the defendant to prove that a reasonable person would find genuine literary, artistic, political, scientific, or academic value in the matter to which the offence relates. Clearly there is political, scientific and academic value in Atheist Ireland’s statements. If they really wanted to engage in civil disobedience, it would take something akin to cartoons of the prophet, getting it on in a threesome with Buddha and Jesus while being bitten by some vampire Jehova’s Witnesses. That’s what the law was designed to impede. To be clear, I don’t agree with that law, I think freedom of speech is more important than a bizarre right not to be offended, but the law simply isn’t of the nature that Atheist Ireland and others imply. I’m not sure if it’s willful misrepresentation, but by conveying the impression that the new law is designed to outlaw such trivial, and relatively innocuous, statements of the sort that Atheist Ireland quotes it is only serving to undermine Atheist Ireland’s own credibility. In light of the actual facts, it’s hard to take them seriously on the issue. Sacrificing integrity and accuracy for punchy-PR is never a great idea, long term. To be fair to Atheist Ireland, they don’t actually claim that their statements are prosecutable, and are making the broader (but again, tokenistic) point that the law is simply silly and unworkable. Great, that’s an important point – but what use is making so much noise about it? This actually entrenches politicians and makes them less and less likely to respond favourably. It’s ironic that this the kind of reactionary “we must be listened to” tokenism that led to the law in the first place. PR-led advocacy groups, which are really pressure groups almost never work in Ireland. There simply isn&#[...]



Point Break

2009-12-28T21:12:52Z

So, about 2 minutes after taking this photo; I slipped on the paraffin-laden solid granite pavement, everything went flying, and I ended up getting a very different kind of photo entirely; That’s my right arm, and somewhere in there a minimally-deforming radial fracture that I don’t remotely have the training to actually see. Though I [...]

So, about 2 minutes after taking this photo;

I slipped on the paraffin-laden solid granite pavement, everything went flying, and I ended up getting a very different kind of photo entirely;

(image)

That’s my right arm, and somewhere in there a minimally-deforming radial fracture that I don’t remotely have the training to actually see. Though I definitely don’t like the look of that suspiciously lumpy bit of bone. Here’s another look;

(image)

As fractures go, I think I got the best kind. It’s now 3 weeks later and I’m almost back to a full range of movement and strength is starting to return. I can write, type, take photos and most importantly, play music once again. The people at St. James’s Hospital have been very good, and if you ever plan to break anything I can recommend doing it near them.

Science is really really cool. Randomised control trials have shown that a simple millennia-old sling (not a cast) is the best treatment, so that’s what I got. About a century of rigourous bio-chemical engineering has led to little pills I could buy in a pharmacy that magically suppress complex sources of pain with minimal side-effects.

Over that same century we’ve progressed from a naive understanding of “Röntgen radiation” as mysterious emanations from a vacuum tube, to a complex quantum-mechanical model of X-Ray interactions that allows us to record these highly-ionising photon jet-streams on a semi-conductor, convert the impression into a digital image and then have it pop up on my clinicians desktop. Right now, reading a random blog on the internet, you can see inside my body – as easily as you might look out the window.

And social science is used too. Because people frequently forget their appointments, there are text message reminders; a few days, and one day before anything that’s scheduled. Again, a trial has shown that this is both cost-effective and clinically beneficial. And when I go to the physio-therapy clinic, I get given a simple set of tried and tested exercises that have been shown to lead to improvements.

Cooler still is that despite all of the science, and care and attention involved in the whole process – really it’s the body doing the work. Through some magic – mostly unknown – system of DNA signaling, controlled protein unfolding, stem-cells and “stuff”, tiny microscopic organic “bits” somehow communicate and coordinate the building of whole new bone, mostly in the right place. So without having to do all that much, I can play Fussball and write dumb blog entries again.

It’s almost worth doing just for the experience.




Con-science

2009-12-16T17:43:38Z

I’m sitting on a train, the Enterprise, moving along at about 140 kilometers per hour between Drogheda and Dublin. Something is causing practically every particle in my body to spark into a new life at a different position in space one instant to the next. For each one of those particles, there’s some chance it [...]I’m sitting on a train, the Enterprise, moving along at about 140 kilometers per hour between Drogheda and Dublin. Something is causing practically every particle in my body to spark into a new life at a different position in space one instant to the next. For each one of those particles, there’s some chance it could just pop into life on the far side of the moon, or even the universe. But something, I guess “motion” is averaging everything out and look, there the sum of me pops, every tiny instant getting slightly closer to Dublin. If the train was in the vaccuum of space – or somehow free from the effects of friction – it wouldn’t even take anything to keep this atomic leap-frogging going. Each small micro-part of me would just keep on dancing. And somehow at the same time, it’s exactly as if I were staying still all along – there’s no real difference. This has been tested. But if for any reason a change of tempo was required, to move from a waltz to a samba, something intangible and ephemeral – mysteriously lumped into the word “energy” – is required. It’s weird and mystical and it doesn’t make a whole lot of sense, but it can be measured and described – and relied upon. And as I sit here and write, using the word “I” like that, it feels like there’s a “me”. It seems as if there’s a real sense of ownership over my thinking, I can’t tell where my thoughts come from, but they are mine. I feel like I’m free to choose things, when I get off of the train I could get straight into a taxi or walk to a tram. I don’t know which I’ll do yet, there are a lot of factors to consider, but neither choice seems pre-ordained, and once it happens it will feel like “I” owned that choice. “Free will” seems as real to me as the forward motion of this train, it’s an inexplicable dance, but it’s my tune. And that makes less sense. Reality is describable by all of these symbols and equations, and we can make predictions with them. This has been tested. There are even divisions between what is predictable in detail, and what is predictable only in the aggregate, random and unknowable at the finer grain. Here in my head, it doesn’t seem like my thoughts are predictable, or could be. If they are just an inevitable cadence, then that “I” is merely an illusion. It also doesn’t seem like thoughts are random or unknowable, to me anyway. I have patterns of thinking, recurring themes, and a detectable personality. It can’t be the same dance. How many miracles are there on this train? Motion is miraculous enough, that “I” can think, that the universe even exists seems miraculous too. But most astonishing, is that these simple realities are seemingly contradictory. I’m still on that train, I haven’t gone anywhere, there haven’t been any angels visiting me, or deitious interventions. These thoughts haven’t led anyone to complicated dogmas, wars or ceremonies. Some say science is cold. Maybe the dance is a myth. And if this mental dance should one day just end, where are the real love-songs? Songs that speak to the real meaning, the genuine warmth, of spending a fleeting, passing, bittersweet whirl around the floor with someone. Not an infinitesimal slice of eternity living in hope of a better dance; that’s cold. This has been tested. [...]



Role Models

2009-12-13T20:24:14Z

Consciously, I’ve never been keen on the idea of role-models. Thinking it synonymous with hero-worship, it has always seemed a bit of an anti-pattern to me. Why try to emulate anyone? There are enough people in the world behaving the same as someone else, being different and original is definitely more useful, even if it [...]Consciously, I’ve never been keen on the idea of role-models. Thinking it synonymous with hero-worship, it has always seemed a bit of an anti-pattern to me. Why try to emulate anyone? There are enough people in the world behaving the same as someone else, being different and original is definitely more useful, even if it makes you a bit crazy. When I did a dubious “leadership style” test I came up as “anti-follower”, so maybe it’s just another form of contrarianism on my part. Over time, I’ve found that the best way to learn is by example, even if it’s a process of unconscious osmosis. And when I’ve spent time on what is sometimes called “personal development” I’ve found that there is real benefit in reading the biographies and the writings of truly awesome people. It certainly seems more productive than reading self-help books that are written in truisms and marketing crap. I thought I’d share some of the people who I’ve really benefited from reading about, truly amazing people. Richard P. Feynman RPF is a legend; a nobel laureate physicist with an uncanny ability to explain complex ideas, an anti-authoritarian maverick who loved to screw with officialdom but most of all an incredibly generous, warm, loving guy (even if a womaniser at times). His writings on physics and his letters to his first, dying, wife are an inspiration. Robert M. Pirsig Pirsig, someone genuinely crazy enough to have been institutionalised, still managed to write one of the best sellers of the 20th century and to invent a philosophical system that many consider to have merit. ZMM is amazingly well written, all the more so when you consider that every paragraph was planned out in advance on index cards. Worrying, his narrator in ZMM is the only literary character I’ve ever strongly identified with. Grace Hopper Grace Hopper signed up for the US navy during World War 2, and rose (primarily as a reservist) to the rank of commodore/rear-admiral back when this was incredibly unusual. But more than this, she was an excellent experimenter, and kept a rigourous lab-book, despite being mainly a computer scientist. She was a strong believer in getting things done, and coined the phrases “dare to do” and “It’s easier to ask forgiveness than it is to get permission”. Seriously awesome woman. Oh yeah, and she invented the compiler. Doc Watson Doc Watson has been blind nearly his entire life, but that doesn’t stop him from being the truly most amazing guitar picker the world has ever seen, or doing crazy things like mending the tiles on his roof. His solo runs and accompaniment are incredibly good, and he’s somehow maintained humility in the face of multiple grammy awards and playing for the president on a regular basis. Another doer, he just kept going and became more productive after the tragic death of his duet partner and son Merle. Dolly Parton Dolly is a self-described mis-fit, but she is also a very very shrewd business woman as well as being a dedicated humanitarian and gifted songwriter. She is one of the really great singers, and is emotionally invested in every song she sings (even the ones that sound like bubblegum, listen to how sad she is in “Here you come again”). CP Snow CP Snow was basically a troll, but a very very good one. His arguments, lectures and writings weren’t always rigourous and balanced but they were always enlightening, thought-provoking and forward thinking. Most famously he ident[...]



Period Pain 3

2009-11-26T22:54:36Z

As promised, though it’s been a while coming, I wrote that there’d be a followup on scheduling periodic tasks. The most important point to get is that for nearly all real-world use cases the actual time that a “scheduled” task runs at doesn’t matter. Tasks that have to occur at a specific time on Tuesday [...]As promised, though it’s been a while coming, I wrote that there’d be a followup on scheduling periodic tasks. The most important point to get is that for nearly all real-world use cases the actual time that a “scheduled” task runs at doesn’t matter. Tasks that have to occur at a specific time on Tuesday are vanishingly rare. cron is one of the most abused tools going, rather than encode specific times it would make more sense to let a scheduler decide when the tasks should be run based on criteria such as overall network and system load. It’s not even hard. There are plenty of scheduling algorithms and theory to borrow from, and many large organisations even have private implementations that let you be a bit more fuzzy about when tasks are run. But there’s another level to the periodicity problem that is worth thinking about. Rather than simply using numbers and values that come readily to humans, it can be worth putting more effort and research into the values of periods themselves. This isn’t meant in some fetishistic sense . Yes, for say virus updates, it’s possible to produce a gigantic linear algebra equation, with 100s of parameters, that would balance the likelihood and cost of a security breach against the cost and frequency of checking for updates and it would come out with some answer, but that’s a lot of work for little gain. More interesting, and more tractable, are the effects that arise when multiple periodic tasks coincide. These are really common in distributed systems, and a real pain to debug and diagnose. It could be as simple as the case we’ve been looking at; a cron job that runs once a day, but across many systems, or it could be as complex as a full-blown peer to peer app that’s got a control loop with multiple peers, a supernode or two and a user-interface polling loop. And a pattern that’s repeated over and over again is that people choose “convenient” values for the periods .. and these choices are so common that when the periods end up in phase with each other we get constructive interference and elevated load events when the tasks coincide. Take for example 3 loops – one with a period of 5 minutes, one with 10 minutes and one with 30 minutes. If the loops end up in phase then every 30 minutes we have all three tasks running at once. It’s a mess, and it’s an easy one to prevent – use prime numbers for the values of the periods; at least that way the number of coincidental events is minimized, and if any load events show up with a periodicity it is very straightforward to identify what single event, or combination of events, should be responsible. Sometimes I can easily imagine a cron replacement that runs exactly like this, but never get to writing it. And these sorts of loops show up in places you might not necessarily think of. Caches are a good example. If you serve every piece of content of your website with the same Max-Age, then you can expect a thundering herd of requests whenever a browser or proxy expires them all at the same time. One the other hand, if you use prime number cache lifetimes for each resource, you’ll get much more nicely staggered and spread out series of requests. It’s a really simple, neat, optimisation. Tuning things doesn’t have to be hard. [...]



Period Pain part 2

2009-09-27T12:52:45Z

Last week I wrote about problems with periodicity but it was only half of the problem. But before moving on to the second half, it seems like a good time to post with some clarifications. I wrote that using some locally unique well-distributed value, such as a mac address, was better than choosing a random [...]Last week I wrote about problems with periodicity but it was only half of the problem. But before moving on to the second half, it seems like a good time to post with some clarifications. I wrote that using some locally unique well-distributed value, such as a mac address, was better than choosing a random number once. But crucially, I left out how to do such a thing. A few commenters asked what the best way might be, including some good examples. To be a bit more rigourous about it, and make sure, the great people at HEAnet provided me with an anonymous list (the prefixes had been stripped) of over 200,000 IPv6 addresses that have used ftp.heanet.ie in the last month. Included in that list were over 150,000 EUI-64 style addresses, which look like this … 2001:880:18:1:214:4fff:fe02:e6ee the last 4 octets include a slightly modified version of the user’s MAC address. The details are straightforward, but you can take it from me that “214:4fff:fe02:e6ee” corresponds to a MAC address of “00:14:4F:02:E6:EE”, and that the md5sum of that string is “d32227ed9a3bf7d8714590f837884286″. Mac addresses, and the hash, are both really just numbers. A 48-bit number and a 128-bit number respectively. Bash can handle these kind of numbers natively, and if you need a well-distributed number between say 0 and 999 then the mod operator is perfect: # Prove that bash can handle even 128-bit numbers colmmacc@infiltrator (~) $ echo $(( 0xd32227ed9a3bf7d8714590f837884286 )) 8162089295436857990 # Use the MAC address directly to pick a number colmmacc@infiltrator (~) $ MACADDR=`/sbin/ifconfig | grep HWad \ | awk '{print $5 }' | head -1` colmmacc@infiltrator (~) $ echo $(( 16#$MACADDR % 1000 )) | sed 's/^-//' 174 # Use the md5sum of the MAC address to pick a number colmmacc@infiltrator (~) $ echo $(( `echo $MACADDR | md5sum |\ cut -d\ -f1| sed 's/^/0x/'` \ % 1000 )) | sed 's/^-//' 363 As per one of the comments on the previous post, from brady, getting rid of any minus sign (the last sed operation), is a cheap form of abs(). But, which is better; randomness, mac addresses or the md5sums? To get rid of any temporal bias, I’ve graphed the distribution of the above operations for 18,365 real world MAC addresses from one day’s worth of requests to ftp.heanet.ie. MD5Sums come out slightly ahead (stddev of 4.6 compared to 4.81), and essentially performed just as well as how random numbers should do (around a stddev of 4.55). Using the MAC addresses on their own, without md5suming should be good enough for most purposes too. So why not use a random number? Well two reasons. It’s harder – you have to store the state somewhere. A mac address on the other hand is already stored state, if you can look it up each time and it will be relatively stable. A lot of the time automated tasks are being installed at provisioning time – there isn’t actually that much real entropy available, so the randomness either tends to be weak, or you contribute to exhausting entropy and denying it to more useful things. And lastly, James pointed out that from the point of view of a single host half an hour of jitter doesn’t really matter. He’s dead right – and of course the combined effects do matter for the distributed system – and the next post will be how to exploit that property to get better scheduling. [...]



Period Pain

2009-09-14T16:21:21Z

Imagine it was your job – along with 1,000 other people – to pick a number between 1 and 60. You can use any method you like (though you must use the same one), but if more than 30 of you choose the the same number, those of you who did would be shot. Would [...]Imagine it was your job – along with 1,000 other people – to pick a number between 1 and 60. You can use any method you like (though you must use the same one), but if more than 30 of you choose the the same number, those of you who did would be shot. Would you let the group pick numbers at random? Probably not, there’s always a chance it could go horribly wrong. And that chance? We could derive the correct p-value for having to shoot people easily enough from the uniform distribution, but forget that, let’s do it with code. It’s a lot easier to understand – and it’s a good habit too. 1234567891011121314151617import random count = 0 # Simulate 10,000 runs of 1,000 people #  picking a number between 0 and 59. for runs in range(10000):     numbers = [0] * 60     for people in range(1000):         r = random.randint(0, 59)         numbers[ r ] += 1     if sorted(numbers)[-1] >= 30:         count += 1 print count/10000.0 If you run this, hopefully you’ll get an answer that’s around “0.1″. In other words, about 10% of the time we expect to have at least one number being chosen by 30 or more people. Those don’t seem like great odds, and I wouldn’t play Russian roulette with lives like that. Yet this is almost exactly what we do with a lot of automated tasks in some large-scale distributed systems. A pattern than can be observed over and over again is that someone writes a scheduled task that runs once an hour, day, month or whatever but then sleeps a random amount of time (typically between 0 and 3600 seconds) when it starts- in a very naive attempt to distribute impact across the larger system evenly. The impacts can be pretty serious – it might be extended load on a dependent service, or it might simply be too many busy nodes in a cluster. Or it might be a two-day global outage of a popular Voip service. Too many things happening at once is usually a bad thing in distributed systems. Security and anti-virus updates, apt/yum/ports repositories and auditing tools in particular seem to get this pattern wrong. Here’s a good example, from Ubuntu’s daily apt script: 12345678910111213141516# sleep for a random intervall of time (default 30min) # (some code taken from cron-apt, thanks) random_sleep() {     RandomSleep=1800     eval $(apt-config shell RandomSleep APT::Periodic::RandomSleep)     if [ $RandomSleep -eq 0 ]; then     return     fi     if [ -z "$RANDOM" ] ; then         # A fix for shells that do not have this bash feature.     RANDOM=$(dd if=/dev/urandom count=1 2> /dev/null | cksum | cut -c"1-5")     fi     TIME=$(($RANDOM % $RandomSleep))     sleep $TIME } This is a very bad pattern – I’d go so far as to say that it’s actually worse than letting everything happen at the same time in the first place. It has 2 particularly dangerous qualities; It increases the effective period of the task If a task is running once an hour – it’s probably because we need to do something once an hour. That may seem tautological, but there’s subtlety. The clock-time hours we define as humans are arbitrary, by once an hour we shou[...]