Subscribe: Saaien Tist
Added By: Feedage Forager Feedage Grade B rated
Language: English
bioinformatics  data visualization  data  file  genome  line  new  position  snp snp  snp  time  variation  visualization  work 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Saaien Tist

Saaien Tist

On data visualization, bioinformatics, quantified health and personal productivity

Updated: 2017-09-17T00:33:12.365+02:00


I believe...


Ryo Sakai reminded me a couple of weeks ago about Simon Sinek's excellent TED talk "Start With Why - How Great Leaders Inspire Action"; which inspired this post... Why do I do what I do?

The way data can be analysed has been automated more and more in the last few decades. Advances in machine learning and statistics make it possible to gain a lot of information from large datasets. But are we starting to rely to much on those algorithms? Different issues seem to pop up more and more. For one thing, research in algorithm design has enabled many more applications, but at the same time makes these so complex that they start to operate as black boxes. Not only to the end-user who provides the data, but even for the algorithm developer. Another issue with pre-defined algorithms is that having these around precludes us to identifying unexpected patterns. If the algorithm or statistical test is not specifically written to find a certain type of pattern, it will not find it. Third issue: (arbitrary) cutoffs. Many algorithms rely heavily on the user (or even worse: the developer) defining a set of cutoff values. This is true in machine learning as well as statistics. A statistical test returning a p-value of 4.99% is considered "statistically significant", but you'd throw away your data if that p-value were 5.01%. What's the intrinsic thing at 5% that makes you have to choose between "yes, this is good" and "let's throw our hypothesis out the window"? All in all, much of this comes back to the fragility of using computers (hat tip to Toni for the book by Nassim Taleb): you have to tell them what to do and what to expect. They're not resilient to changes in setting, data, prior knowledge, etc; at least not as much as we are.

So where does this bring us? It's my firm belief that we need to put the human back in the loop of data analysis. Yes, we need statistics. Yes, we need machine learning. But also: yes, we need a human individual to actually make sense of the data and drive the analysis. To make this possible, I focus on visual design, interaction design, and scalability. Visual design because the representation of data in many cases needs improvement to be able to cope with high-dimensional data; interaction design because it's often by "playing" with the data that the user can gain insights; and scalability because it's not trivial to process big data fast enough that we can get interactivity.

I'll do Angelina Jolie



"I'll do Angelina Jolie". Never thought I'd say that phrase while talking to well-known Belgian cartoonists, and actually be taken serious.

Backtrack about one year. We're at the table with the crème-de-la-crème of Belgium's cartoon world (Zaza, Erwin Vanmol, LECTRR, Eva Mouton, ...), in a hotel in Knokke near the coast.  "We" is a gathering of researchers covering genetics, bioinformatics, ethics, and law. The setup: the Knokke-Heist International Cartoon Festival. This very successful festival centers each year around a particular topic. 2013 was "Love is..."; 2014 is about genetics. Hence our presence at the site. On the program for day 1: explaining genetics and everything that gets dragged into it (privacy, etc) to the cartoonists. Day 2: discussion on which messages we should bring at the festival, and a quick pictionary to check if we actually explained the concepts well. (As I was doodling myself at the moment, I briefly got to be a "cartoonist" as well and actually draw one of those :-)
"So what's the thing with Angelina Jolie?", you ask? We figured that she be the topic of part of the cartoonfestival installation (talking about breast cancer, obviously), and I volunteered to help out setting up that section...

allowFullScreen='true' webkitallowfullscreen='true' mozallowfullscreen='true' width='320' height='266' src='' FRAMEBORDER='0' />

Fast forward to this late summer. The cartoonfestival is in full swing, and I'm trying to explain the genetic dogma and codon table to a bunch of 8-13 year-olds in the Children's University. I thought it'd be nice to let them muck about with strawberries to get the DNA out, and write their names in secret code (well: just considering each letter to be an amino acid...). I was really nervous in the days/weeks before the actual event; kids can be a much harsher public than university students. Or so I thought; it was quite the opposite: feedback from the children was marvellous and I really enjoyed their enthusiasm. To be repeated... :)

I know this post is way overdue (especially given the fact that the cartoonfestival actually closed last weekend). But with this I hope to resurrect this blog from its comatose state since I started my current position 4 years ago...

Available: Research position Biological Data Visualization and Visual Analytics


We could still use more applicants for this position, so bumping the open position...Available: Research position Biological Data Visualization and Visual AnalyticsKeywords: biological data visualization; visual analytics; data integration; genomics; postdocAre you well-versed in the language of Tufte? Do you believe that visualization plays a key role in understanding data? Do you like to work in close collaboration with domain experts using short iterations? And do you want to use your visualization skills to help us understand what makes a cancer a cancer, and what distinguishes a healthy embryo from one that is not?We're looking for a motivated data visualization specialist to help biological researchers understand variation within the human genome. Methodologies exist for analyzing this type of data, but are still immature and return very different results depending on what assumptions are made. The type of data can also be used for a huge amount of different research questions, which necessitates developing very exploratory tools to support hypothesis generation.ProfileThe ideal candidate is well-motivated, holds a PhD (or at least MSc) degree in computer science or bioinformatics, and has experience in data visualization (e.g. using tools like D3 [] or Processing []). Prior experience working with DNA sequencing data and genome-wide detection of genetic variation would be an advantage but is not crucial. Good communication skills are important for this role.You will collaborate closely with biologists and contribute to the reporting of the project. You will be able to work semi-independently under the supervision of a senior investigator, mentor PhD students, and contribute to the acquisition of new funding. A three-year commitment is expected. Start date is as soon as possible.Relevant publicationsMedvedev P, Stanciu M & Brudno M. Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 6(11):S13-S20 (2009)Nielsen CB, Cantor M, Dubchak I, Gordon D & Ting W. Visualizing genomes: techniques and challenges. Nat Methods 7:S5-S15 (2010)Bartlett C, Cheong S, Hou L, Paquette J, Lum P, Jager G, Battke F, Vehlow C, Heinrich J, Nieselt K, Sakai R, Aerts J & Ray W. An eQTL biological data visualization challenge and approaches from the visualization community. BMC Bioinformatics 13(8):S8 (2012)Application For more information and to apply, please contact Jan Aerts (, @jandot, +Jan Aerts). If possible, also send screenshots and/or screencasts of previous work.  URL:[...]

Postdoc position available: visualization and genomic structural variation discovery

2012-09-04T11:22:01.585+02:00 is a consortium of computational scientists and molecular biologists at the University of Leuven, Belgium focusing on how individual genomic variation leads to disease through cascading effects across biological networks (in specific types of constitutional disorders and cancers). We develop innovative computational strategies for next-generation sequencing and biological network analysis, with demonstrated impact on actual biological breakthroughs. The candidate will be a key player in the SymBioSys workpackage that focuses on genomic variation detection based on next-generation sequencing data (454, Illumina, PacBio) using a visual analytics approach (i.e.combining machine learning with interactive data visualization). This includes applying and improving existing algorithms and tools for the detection of structural genomic variation (insertions, deletions, inversions and translocations), as well as developing interactive data visualizations in order to investigate parameter space of these algorithms. These methods will be applied to specific genetic disorders in day-to-day collaboration with the human geneticists within the consortium.We offer a competitive package and a fun, dynamic environment with a top-notch consortium of young leading scientists in bioinformatics, human genetics and cancer. Our consortium offers a rare level of interdisciplinarity, from machine learning algorithms and data visualization to fundamental advances in molecular biology, to direct access to the clinic. The University of Leuven is one of Europe’s leading research universities, with English as the working language for research. Leuven lies just east of Brussels, at the heart of Europe.ProfileThe ideal candidate holds a PhD degree in bioinformatics/genomics and has good analytical, algorithmic and mathematical skills. Programming and data analysis experience is essential. Prior experience working with sequencing data, i.c.alignment of next-generation data, as well as genome-wide detection of genetic variation would be a distinct advantage. Experience in data visualization - e.g.using tools like D3 ( or Processing ( - would also be considered a big plus. Good communication skills are important for this role.The candidate will collaborate closely with researchers across the consortium and contribute to the reporting of the project. Qualified candidates will be offered the opportunity to work semi-independently under the supervision of a senior investigator, mentor PhD students, and contribute to the acquisition of new funding. A three-year commitment is expected from the candidate. Preferred start date is November/December 2012, so please let us know asap. Relevant publicationsConrad D, Pinto D, Redon R, Feuk L, Gokumen O, Zhang Y, Aerts J, Andrews D, Barnes C, Campbell P et al. Origins and functional impact of copy number variation in the human genome. Nature 464:704-712 (2010)Medvedev P, Stanciu M & Brudno M. Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 6(11):S13-S20 (2009)Nielsen CB, Cantor M, Dubchak I, Gordon D & Ting W. Visualizing genomes: techniques and challenges. Nat Methods 7:S5-S15 (2010) Application Please send in PDF: (1) a CV including education (with Grade Point Average, class rank, honors, etc.), research experience, and bibliography, (2) a one-page research statement, and (3) two references (with phone and email) to Dr Jan Aerts (, cc Dr Yves Moreau ( and Ms Ida Tassens (  URL: apply :[...]

Quantified Health, and my frustration with genetics


Since the publication of the human genome sequence about a decade ago, the popular press has reported on many occasion about genes allegedly found for things ranging from breast size, intelligence, popularity and homosexuality to fidgeting. The general population is constantly told that the revolution is just around the corner. But the last year or so, articles start to pop up in the popular press that genomics and genetics will not be able to deliver what it promised (or what people thought it promised) a couple of years ago. The technology of (next-generation) sequencing is clearly following the Gartner Hype Cycle, and we're probably nearing the top of the "peak of inflated expectations".Gartner Hype Cycle (taken from a researcher myself, I also am not insensitive to this sensation. Even though (or because) I have contributed to some large genome sequencing projects (in chicken and cow) and have worked closely with the 1000Genomes team while at the Wellcome Trust Sanger Institute, I feel a growing frustration with genetics and genomics. It's all very interesting to build genomes, find associations of genes with disease, etc, but what can I as an individual do with this information? Yes, our research helps us understand the core of biology, and we do help (parents of) people with rare diseases diagnose those diseases or find the gene that causes the particular congenital condition. But how can this information help the vast majority of the population in their day-to-day life?My frustration with what genetics can tell meUnder the umbrella of GenomesUnzipped, I had about half a million SNPs genotyped by 23andme (for data: see here). Based on that data and scientific literature, for example, they state that I have a higher chance of getting venous thromboembolism (VTE). Almost 18% of European descent with my genotype will develop VTE before they're 80, compared to 12.3% of the general European male population. The heritability of VTE, by the way, is about 55%. So what does this tell me? I should eat healthy, not smoke, and do more exercise. Still: my genotype can not tell me actually when I will get this, not even just before the event.The issue is that our genomes are just the blueprints for who we are; they're not us. For that, we need to look at other omes: our transcriptomes, but most of all our proteomes and metabolomes, and our environment. Whatever our genotype is, it has no effect whatsoever unless through how it affects the constitution of the enzymes and other molecules in our cells: proteins could for example not work, or be present in non-optimal amounts.Meanwhile...When you go to a doctor in the hope to get rid of frequent headaches, what does the doctor base his diagnosis on? Those symptoms? More often than not: your memory. "I think I had those headaches last Friday and Monday, and if I recall correctly the one on Monday was a bit worse than the other one". Doctor: "is it always after eating something specific?". You: "I can't remember". Wouldn't it be great if in such cases your GP could diagnose your disease based on actual data?But then comes the next step: taking drugs. The amount of a particular drug (and actually the drug to be taken itself) is based on population-wide averages of efficiency and occurrence of side effects of that drug. It's not based on how you react to that particular drug.Quantified HealthEnter personalized medicine, and more specifically: P4Medicine (predictive, preventive, personalized and participatory). I'd like to look at this from a little broader perspective, as quantified health: data-driven health.Since quite a while I've been following the Quantified Self movement (see e.g. The aim of this movement is to improve self knowledge through self tracking: collecting data about yourself to identify trends (e.g. weight loss) and correlations (e.g.&nb[...]

Clojure, visualization, and scripts


Bit of a technical post for my own reference, about visualization and scripting in clojure. Clojure and visualizationBeing interested in clojure, a tweet by Francesco Strozzi (@fstrozzi) caught my attention last week: "A D3 like #dataviz project for #clojure. Codename C2 and looks promising. They need contribs so spread the word!" I tried a while ago to do some stuff in D3, but the javascript got in the way so I gave up after a while. But I was still pulled towards something html5+css rather than java applets as created by still in a very early stage, C2 is already very powerful. Rapid development of visualizations is aided by the visual repl: the webpage localhost:8987 automatically loads the last modified file in the samples directory.(ns rectangles-hover (:use [c2.core :only [unify]]))(def css "body { background-color:white;}rect { -webkit-transition: all 1s ease-in-out; fill:rgb(255,0,0);}rect:hover { -webkit-transform: translate(3em,0);fill:rgb(0,255,0);}circle { opacity: 1; -webkit-transition: opacity 1s linear; fill:rgb(0,0,255);}circle:hover { opacity: 0; }")[:svg [:style {:type "text/css"} (str "<[CDATA[" css "]]>")] (unify {"A" 50 "B" 120} (fn [[label val]] [:rect {:x val :y val :height 50 :width 60}])) (unify {"C" 180 "D" 240} (fn [[label val]] [:circle {:cx val :cy val :r 15}]))]This bit of code draws 2 red rectangles and 2 blue circles. Hovering the mouse over any of the rectangles will move it to the right and change its colour to green; hovering over a circle will make that circle transparent. Some more scripts that I've used to build up simple things and learn C2 are on github.Although interactions are not covered in C2 itself, simple transitions can be handled in the CSS part (see the example above). Brushing, linking and other types of interaction would be interesting to have available as well, though. But the developer Kevin Lynagh is very responsive.I haven't looked yet into how to run C2 without the visual repl; still on my to-do list. (UPDATE: see end of post) Clojure and scriptingAnd today, I saw this. Leiningen 2 will allow you to easily execute little clojure scripts without the whole setup of a project. Makes it amenable for pipelining just like you would do with little perl/ruby/python scripts. The completely-useless-but-good-enough-as-proof-of-principle little example below attaches some dashes to the front and stars at the back of anything you throw at it from STDIN.#!/bin/bash lein exec(doseq [line (line-seq ( *in*))] (println (str "----" line "****")))Pipe anything into this:ls ~ | ./proof-of-principle.cljDependencies are now stored in ~/.mv2 rather than in the project directory, you can load libraries such as clojure like this:#!/bin/bash lein exec(use '[leiningen.exec :only (deps)])(deps '[[incanter "1.3.0"]])(use '(incanter core charts stats datasets))(save (histogram (sample-normal 1000)) "plot.png")This also works in the interactive repl ("lein repl").Bringing the two togetherIt's really easy to combine these two (after a pointer from C2 Kevin (Thanks!)). You need an additional dependency on hiccup to convert to html, but that's it.Here's a script that, when executed with "lein exec this-script.clj" will generate a html file with the interactive picture shown above.#!/bin/bash lein exec(use '[leiningen.exec :only (deps)])(deps '[[com.keminglabs/c2 "0.1.1"] [hiccup "1.0.0"]])(use '[c2.core :only (unify)])(use 'hiccup.core)(def css "body { background-color:white;}rect { -webkit-transition: all 1s ease-in-out; fill:rgb(255,0,0);}rect:hover { -webkit-transform: translate(3em,0); fill:rgb(0,255,0);}circle { opacity: 1; -webkit-transition: opacity 1s linear; fill:rgb(0,0,255);}circle:hover { opacity: 0; }")(def svg [:svg [:style {:type "text/css"} (str "")] (unify {"A" 50 "B" 120} (fn [[label val]] [:rect {:x val :y val[...]

Biovis/Visweek recap


Finally time to write something about the biovis/visweek conference I attended about a week ago in Providence (RI)... And I must say: they'll see me again next year. (Hopefully @infosthetics will be able to join me then). Meanwhile, several blog posts are popping up discussing it (see here and here, for example).This was the first time that biovis (aka the IEEE Symposium on Biological Data Visualization) was organized. It's similar to the 2-year old vizbi, but has an agenda that is more focused on research in visualization rather than application of visualization in the biological sciences. Really interesting talks, posters and people.The biovis contestThis first installment of biovis included a data visualization contest, focusing on "specific biological problem domains, and based on realistic domain data and domain questions". The topic this year was on eQTL (expression quantitative trait loci) data, and I'm really happy that Ryo Sakai -  now a PhD student in my lab - won the "biologists' favourite" award!! The biologist jury was impressed with the ease in which his visualizations of the eQTL data highlighted and confirmed prior knowledge, and how it suggested directions for further experiments. It was interesting to see that there was a huge variation in the submissions, going from just showing the raw data in an interesting way (which Sakai-san did) to advanced statistical and algorithmic munging of the input data and visualizing the end result (which the winner of the "dataviz professionals' favourite" award did). See how this relates to my previous post on humanizing bioinformatics?Interesting talks - amazing (good & bad) talksAs this was the first time that I attended visweek, I was really looking forward to the high quality presentations/papers and posters. Overall, I got what I wanted. But there were some examples of papers and posters that I have major doubts about (taking into account that I have to be humble here in talking about people working in the field for far longer than I do).One example that seemed pretty counterintuitive was a presentation by Basak Alper from Microsoft about a new set visualization technique that they baptized LineSets. The main issue that they want to solve is the visualization of intersections of >3 sets (up to 3 you'd just use Venn diagrams). Their approach is to connect the different elements from a set by a line; hence: linesets. However, I (and many others with me) felt that this approach has some very serious drawbacks. Most of all, it suggests that there is an implicit ordering of the elements, which there isn't. In the image below, for example, line sets were used to connect Italian restaurants (in orange) and Mexican restaurants (in purple). That's the only thing this visualization wants to do: tell me which of the restaurants are Italian and which are Mexican. But give this picture to 10 people, and every single one of them will think that the lines are actually paths or routes between these restaurants. Which they're not... The example below shows data that has specific positions on a map, but they demonstrate this approach on social networks as well.LineSetsAnother example comes from the biovis conference: TIALA or Time Series Alignment Analysis. Suppose you have the time-dependent gene expression values for a single gene, which you'd plot using a line plot. Now what would you do if you have that type of data for 100 genes? Would you put those plots into 3D? I know I wouldn't... And better still: would you then connect these plots so you end up with some sort of 3D-landscape? That's like connecting the tops of a barchart displaying categorical data with a line...TIALA - Time Series Alignment AnalysisBut of course there were plenty really good talks as well. Some of the talks I really enjoyed are those about HiTSEE (by Bertini et al) on the analysis of high-throughput screen[...]

Humanizing Bioinformatics


I was invited last week to give a talk at this year's meeting of the Graduate School Structure and Function of Biological Macromolecules, Bioinformatics and Modeling (SFMBBM). It ended up being a day with great talks, by some bright PhD students and postdocs. There were 2 keynotes (one by Prof Bert Poolman from Groningen (NL) and one by myself), and a panel discussion on what the future holds for people nearing the end of their PhDs.My talk was titled "Humanizing Bioinformatics" and received quite well (at least some people still laughed at my jokes (if you can call them that); even at the end). I put the slides up on slideshare, but I thought I'd explain things here as well, because those slides will probably not convey the complete story.Let's ruin the plot by mentioning it here: we need data visualization to counteract the alienation that's happening between bioinformaticians and bright data miners on the one hand, and the user/clinician/biologist on the other. We need to make bioinformatics human again.Jim Gray from Microsoft wrote a very interesting book "The Fourth Paradigm - Data-intensive Scientific Discovery". Get it. Read it. He describes how the practice of research has changed over the centuries. In the First Paradigm, science was very much about describing things; the Second Paradigm (last couple of centuries) saw a more theoretical approach, with people like Keppler and Newton defining "laws" that described the universe around them. The last few decades saw the advent of computation in the research field, which allowed us to take a closer look at reality by simulating it (the Third Paradigm). But just recently - so Jim Gray says - we're moving into yet another fundamental way of doing science. We have moved into an age where there is just so much data generated that we don't know what to do with it. This Fourth Paradigm is that of data exploration. As I see it (but that's just one way of looking at it, and it doesn't want to say anything about what's "better" than what), this might be a definition for the difference between computational biology and bioinformatics: computational biology fits within the Third Paradigm, while bioinformatics fits in the Fourth.Being able to automatically generate these huge amounts of data (e.g. in genome sequencing) does mean that biologists have to work with ever bigger datasets, using ever more advanced algorithms that use ever more complicated data structures. This is not about just some summary statistics anymore; it's support vector machine recursive feature elimination, manifold learning and adaptive cascade sharing trees and stuff. Result: biologist is at a loss. Remember Dr McCoy in Star Trek saying "Dammit Jim, I'm a doctor, not an electrician/cook/nuclear physicist" whenever the captain let him do stuff that is - well - not doctorly? (Great analogy found by Christophe Lambert). It's exactly the same for a clinician nowadays. In order to do a (his job: e.g. decide on a treatment plan for a cancer patient), he has to first do b (set up hardware that can handle the 100s of Gb of data) and c (devise some nifty data mining trickery to get his results). Neither of which he has the time or training for. "Dammit Jim, I'm a doctor, not a bioinformatician". Result: we're alienating the user. Data mining has become so complicated and advanced, that the clinician is at a complete loss. Heck, I'm working at a bioinformatics department and don't understand half of what they're talking about. So what can the clinician do? His only option is to trust some bioinformatician to come up with some results. But this is a blind trust: he has no way of assessing the results he gets back. This trust is even more blind than the one you give the guy who repairs your car.As I see it, there are (at least) four issues.What's the question?Data generation used to be really geared towards p[...]

Visualize This (by Nathan Yau) arrived...


Last Friday I received my long-anticipated copy of "Visualize This" by Nathan Yau. On its website it is described as a "practical guide on visualization and how to approach real-world data". You can guess what my weekend looked like :-)Overall, I believe this book is a very good choice for people interested in getting started in data visualization. Not only does it provide the context in which to create visualizations (chapters 1, 2 and 9), it also handles different tools for creating them: R, protovis, flash.... Apart from chapter 3 that is dedicated entirely to that topic, different examples in the book were created using different tools, which gives people a good feel of what's possible in each and how "hard" or "easy" the coding itself is for the different options. Different chapters discuss different types of data that you could encounter: patterns over time, proportions, relationships, ...There were some minor points in the book that I'd mention if they asked me to review it (but that's according to me, and I don't want to pretend to be an expert). First of all, it would have been nice if Nathan had gone a little bit deeper into theories behind what is seen as good visualization. In the first chapter ("Telling Stories with Data") he does mention Cleveland & McGill in a side-note, but I think that information (along with Gestalt laws, etc) definitely deserves one or two full paragraphs, if not half a chapter. I also don't completely agree with the use of a stacked barchart (about page 109). From my experience, they're worth less than the time it takes to create them. After all, it's impossible to compare any groups other than the one that is at the bottom (and therefore has a common "zero"-line). For example: look at the first picture below. This shows the number of "stupid things done" by women and men, stratified over 5 different groups (A-F). Although it is easy to compare total stupidity per group (group C is doing particularly bad), as well as that for men, we can't see which of the groups A, D or F scores the worst for women. And that's because they don't have a common origin. We could of course put the women next to the men, but then we'd loose the total numbers.In the second plot, however, it is possible to compare women, men and totals. The bars for women are put next to those for men, but I've added a shaded larger bar at the back that shows the sum of the two. This plot was originally created in R using ggplot2, but I'm afraid I can't find back the reference that explained how to do this... Let me know if you can find it.The contents of the book of course is not world-shattering. But that's not the point of the book. For people new to the field it's a great addition to their library (and I learned a thing or two myself as well). If you're interested in data visualization, go out and get it.[...]

Visualizing the Tour de France


UPDATE: I encountered a blog post by Martin Theus describing a very similar approach for looking at this same data (see here).

Disclaimer 1: This is a (very!) quick hack. No effort was put in it whatsoever regarding aesthetics, interactivity, scaling (e.g. in the barcharts), ... Just wanted to get a very broad view of what happened during the Tour de France (= biggest cycling event each year).
Disclaimer 2: I don't know anything about cycling. It was actually my wife who had to point out to me which riders could be interesting to highlight in the visualization. But that also meant that this could become interesting for me to learn something about the Tour.

Data was copied from the Tour de France website (e.g. for the 1st stage). Visualization was created in processing.

The parallel coordinate plot shows the standings of all riders over all 21 stages. No data was available for stage 2, because that was a team time-trial (so discard that one). At the top is the rider who came first, at the bottom who came last. Below the coordinate plot are little barcharts displaying the distribution in arrival time (in "number of seconds later than the winner") for all riders in that stage.

The highlighted riders are: Cavendish (red), Evans (orange), Gilbert (yellow), Andy Schleck (light blue) and Frank Schleck (dark blue).

So what was I able to learn from this?

  • Based on the barcharts you can guess which trips were in the mountains, and which weren't. You'd expect that the riders become much more separated in the mountains than on the flat. In the very last stage in Paris, for example, everyone seems to have arrived in one big group. Whereas for stages 12-14 the riders were much more spread. So my guess (and that's confirmed by checking this on the TourDeFrance website :-) is that those were mountain stages.
  • You can see clear groups of riders who behave the same. There is for example a clear group of riders who performed quite badly in stage 19 but much better in stage 20 (and bad in 21 again).
  • As the parallel coordinate plots were scaled according to the initial number of riders, we can clearly see how people left the Tour because the "bottom" of the later stages are empty.
  • We see that Cavendish (red) has very erratic performance. And it seems to co-incide with trips where the arrival times are spread out (= mountain trips?). This could mean that Cavendish is good on the flats, but bad in the mountains. Question to those who know something about cycling: is that true?
  • Philippe Gilbert started good (both on the flats and in the mountains), but became more erratic halfway through the Tour.

TenderNoise - visualizing noise levels


A couple of days ago I bumped into this tweet by Benjamin Wiederkehr (@datavis): "Article: TenderNoise" It describes a visualization by Stamen Design and others displaying noise levels at different intersections in San Francisco. They recorded these levels over a period of a few days in order to get an idea of auditory pollution. More information is here.

Although this particular visualization might be very useful for the people involved, I would like to explain some of the issues that I have with it, coming from a data-visualization-for-pattern-finding viewpoint.

I think there are many things that might be gleaned from this data which are not possible with the current visualization:

  • Is there a relationship between the noise patterns at different intersections? Based on the graphic at the bottom, we can conclude that on average noise level goes down during the night and up during daytime, but it would be nice if the visualization would give an indication of any aberrant patterns as well. Are there intersections that behave differently from others?
  • I don't see a real use for changing the graphic over time. I suspect that small multiples of area charts would work better to demonstrate the change over time (as e.g. the visual used here). Using the current approach it is very difficult to see how particular intersections change over time because (a) the display changes and you loose temporal context, and (b) the resolution is so hight that the blobs just flicker.
  • Concerning that flicker, it might be an option to bin the data in larger time blocks. For calculating the value in each block different approaches should be investigated, like the average value, the maximum, the minimum, or the most extreme value (be it maximum or minimum, based on comparison with the average).

It'd be interesting to get hold of these data and work on some alternatives (given the time...)

Why did I move into data visualization?


Preamble: It's been very quiet on this blog since I left the Wellcome Trust Sanger Institute in the UK and took my position here at Leuven University in Belgium last October. Truth is that the type of work changed so profoundly that it takes a while to give it all a place in your head; let alone a blog. Until I remembered this morning why I started this blog in the first place: to help me order my thoughts in the first place. So it might have sped things up instead, actually...Anyway... In this post I'd like to explain why I'm moving into the *data visualization* field. And it's not just because it's always nice to look at pretty pictures.Statistics are great, but...Ben Schneiderman, Professor at the University of Maryland, very eloquently stated that "The great fun of information visualization is that it gives you answers to questions you didn't know you had". I'd rephrase that a bit to "The great use of data visualization is that it gives you clues to questions you didn't know you had". To me, it's as much about finding the questions to ask as about finding the answers to those questions. Data visualization should not be used to "prove" things; that's what statistics is for. But the visualization can give you ideas on what statistical models to test. As do many others, I see a strong connection between statistics and data visualization. Taking a bit of a shortcut here, you could say that statistics is about proving what you expect, while visualization is about discovering what you didn't expect and refining those expectations.From my own experience, I've seen that many (but not all!) statisticians look down upon data visualization with the argument that it can't proof anything. That's true. But their reaction then often becomes to throw away the baby with the bath water, instead of trying to see how both fields can benefit from each other. It's not always equally simple to convince people of the effectiveness of visualizations, but we're getting there...Explain and exploreIn the data visualization field, there is often the tension between explanation and exploration. The work I'll be doing here in Leuven will cover the whole spectrum. In the explanation corner, there is trying to make sense of complex data. For example helping cancer genetics researchers understand how tumours evolve (e.g. the phylogeny of cancer cells) or what the rearranged genome in those tumours looks like. This type of visualization sits downstream from the data analysis, after the data is churned. On the other hand, there are the exploration projects, where we focus on showing the raw(ish) data to help us decide on what type of analysis to perform, for example for investigating parameter-space for an algorithm. Of course many projects will fit somewhere in the middle...The visualization modelJarke van Wijk's paper "The Value of Visualization" (doi: 10.1109/VISUAL.2005.1532781) is a masterpiece in that it describes a comprehensive model of what visualization is and how we can quantify its effectiveness (cost). I'll just leave the picture here for you to contemplate over:My inspirationThere are several people whose work I keep in mind when discussing what I want to do in my group; my sources of inspiration, so to speak. They have a very important thing in common: they don't take shortcuts in their work and are not afraid to really think about what their visualizations are intended to do.These include:Cydney Nielsen: ABySS-Explorer - a sequence assembly visualization toolMiriah Meyer: Pathline - a tool for comparative functional genomicsMartin Krzywinski: Hive plots - rational network visualizationIt's really exciting to work in this field; I'm looking forward to what the next few years will bring :-)[...]

VizBi 2011 - looking back


Has been a while (again) since my last post. It seems that the requirements on my time are just a little bit different from during my previous position... But I'd like to share a little bit about the VizBi conference that I attended 2 weeks ago.

This second installment of the VizBi conference was held at the Broad Institute in Cambridge, MA. This workshop focuses on visualizing biological data and harbours a new community of people with backgrounds ranging from bioinformatics and genomics to pure data visualization. Biological visualization is a very broad field, going all the way from abstract data visualization to enable finding patterns in data, to outreach for education, creating movies to explain things like DNA replication to the general public using Holywood-studio tools (for example molecularmovies and the work by Drew Berry).

Probably the best description of the relevance of this conference was given in the opening keynote by Eric Lander, professor at MIT, one of the people pulling the Human Genome Project and - entre autres - co-chair of the Council of Advisors on Science and Technology in the Obama administration. Some of his quotes:
- "If all data fitted nicely within one clear paradigm, we wouldn't need VizBi."
- "Things that are beautiful have huge communicative value"
- "Nowhere is visualization as important as in biology, because it's the bleeding edge right now, and has the messiest data and the messiest problems."

And my favourite: "We need to work on exchange rates; if one picture is worth only a thousand words, we're screwed."

I can highly recommend having a look at Tamara Munzner's keynote on visualization principles, available here.

Slides and videos of the talks will be available on the website.

Open Research Computation - a new journal from BioMedCentral



As a colleague of mine said a couple of weeks ago: "if you don't publish it, it didn't happen". Scientific publications are the currency to advance a researcher's career. Looking for a new job? You better make sure your publication list is littered with first or second author papers in good (read: high impact factor) journals. Hoping to have your tenure track lead to tenure? Idem. Publish or perish.

Meanwhile, many bioinformaticians spend huge amounts of time developing software to make genetic or genomic research possible; research that just wouldn't happen if it was not for their custom-written tools, scripts and pipelines. Unfortunately, you often need the find function of your webbrowser or PDF reader to be able to pinpoint the lone bioinformatician in the author list. He's not the first or second author; he works in function of work by someone else.

Just like many others (see e.g. the alt-metrics manifesto), I feel quite strongly about the limitations of impact factors for judging a researcher's contribution to science. What about those papers that are published in "lower" journals but that are actually read more? What about blogs? What about your contributions on FriendFeed, seqanswers, quora, ...? How about all that software you wrote and that everyone can access on github? Don't you think these also have a teenie-weenie little value in the scientific discourse as well?

It will take time to change this skewed way of appreciating a researcher's work. We'll have to fight this from different angles, and subtly. We'll have to rethink what the term "publication" means, and how we can reference each other's work, including data that we made available in repositories and contributions to discussion forums. To start this process and give credit for the work of the computational scientist, BioMed Central launches a new journal today: Open Research Computation with Cameron Neylon at the helm as the editor-in-chief (and I'm grateful for having been asked to act as an editor). The aim of this journal is to provide a venue for programmers in research, who - as Cameron says it in his accompanying blog post - "either see themselves as software engineers or as researchers who code". The focus of the journal will be on making the tools available to the wider community of research programmers, with strict guidelines on code quality, code reusability and documentation.

I can only suggest you read Cameron's post and have a look at the website. The rest is up to you. Let's give the lone bioinformatician a venue to bring his work out in the public.

VCF, tab-delimited files and bioclojure


A lot of the work I do involves extracting data from VCF files ("Variant Call Format"; see It's tab-delimited but not quite: some of the columns contains structured data rather than just a value, and the format of these columns might even be different for every single line.

An example line (with the header):

#CHROM POS   ID REF ALT QUAL   FILTER INFO                                            FORMAT   SAMPLE1
1 12345 . A G 249.00 0 MQ=23.66;DB;DP=89;MQ0=26;LowMQ=0.2921,0.2921,89 GT:DP:GQ 1/1:89:99.00

The INFO field is actually a list of tag/value pairs (except when it's just a tag), and the meaning of the data in the SAMPLE1 column is explained in the FORMAT column. Not only can different INFO tags be present on different lines, but the FORMAT can change line-by-line. Let it be clear that this is quite a bit of a pain to parse. What if I want to know the distribution of the depth at which each SNP is covered (i.e. the DP tag in the INFO field)?

So the first thing anyone does before working with such files: convert to "real" tab-delimited: the INFO field is spread over multiple columns and the same goes for the data in SAMPLE1. There is a vcf2tsv script available in vcftools, but it only extracts part of the data in the INFO field; not all tags in the INFO field will be represented in the tab-delimited output file.

I've tried to write a vcf2tsv method in clojure (see and I'm now able to read/write VCF files without the tsv-intermediate using incanter (an R-like statistics environment using clojure), and also have a vcf2tsv command-line script that converts a VCF file to its tab-delimited counterpart, but this time including all the data that is in the file instead of a fixed selection of tags.

The vcf2tsv script convert the above sample data into:

1 12345 . A G 249.00 0 23.66 1 26 0.2921,0.2921,89 1/1 89 99.00

So to generate that histogram of quality scores:

(use 'bioclojure)
(ns bioclojure)
(def snps (load-vcf "data.vcf"))
(with-data snps
(view (histogram ($ :QUAL) :nbins 50)))

The result:

You can download the bioclojure library (which now only has this functionality :-) from github. That page also shows how to use the library. Feel free to fork and add new functionality; I only have time for very small incremental additions :-)

Postdoc position - Genomic variation discovery and visualization


Just a short note...

Even though my position in Leuven only starts in October, I've already been involved in writing and defending a major grant. We've set up a consortium in Leuven (SymBioSys 2) consisting of 6 PIs "focusing on how individual genomic variation leads to disease through cascading effects across biological networks". This should be a good stepping stone to get my own lab running.

The grant concerns several workpackages (six in total), ranging from the analysis of the raw next-gen sequencing data, over the application of network algorithms for gene prioritization to the ultimate application in a couple of disease fields. I'll be in charge of the workpackage "Genomic variation discovery and visualization". Within this WP, we will apply and improve existing methods for the discovery, annotation and prioritization of SNPs and structural variation based on data available from the Leuven University hospital (HiSeq and 454). We'll develop a pipeline that can be used within and outside of the university. In addition - and I envision this to be the bigger part of the work - we'll work on visualization of these data. Several issues remain in this field: data can be too big to be readily visualized (e.g. locations of aberrantly mapped read pairs), or too complex (e.g. structural variation between individuals or family structures).

So I have a postdoc vacancy for someone to work on this project with a good understanding of genetics and next-generation sequencing. He/she should have good programming and statistics skill. Ideally, he/she will have visualization experience as well or be very interested to put his/her teeth into the subject. With visualization experience, I mean for statistical purposes (e.g. using R), but also more general (e.g. using Processing). Expect to be talking a lot about indexing methods, mapreduce, visual encoding, ...

The SymBioSys2 consortium consists of young scientists in the fields of bioinformatics, human genetics and cancer. As I'll only be starting my own group this fall the person doing the postdoc will have the chance to have a real impact on where my group will be going :-)

Encounter with incanter - about clojure, incanter and bioinformatics


I have been a bit frustrated lately by the fact that for many of my analyses I have to write a ruby script to mangle my data first, then resort to R to add a statistic to each of the datapoints, go back to ruby to mangle the result, repeat, rinse, and finally make plots in R. Of course as a bioinformatician you're used to that and if necessary you write wrapper/pipeline scripts to handle this all for you if you know that this won't be the only time you have to do the analysis. But often you don't know. And breaking the analyses up into chunks just based on the tool you use will help you losing overview.Being relatively new to R (less than 2 years), some of its idiosyncracies keep biting me. Just last week I had a dataframe to which I wanted to add a column which was the mean of two other columns. So from1 98 162 4I wanted to go to:1 9 58 16 122 4 3So you'd think (at least I would):df$new_column [...]

Threads in ruby: probably not how to use them


I should create an online labbook with code examples of how I do things. Keep going back to an example script I have to copy/paste the code for handling different threads in ruby. But I'll put it here for the moment :-)

Suppose I have a file with several millions of lines containing information on SNPs. And suppose I have a database that already contains data for those SNP. And suppose I want to update the entries in the database with the data from the input file.

Please note: this is a quick hack.

require 'rubygems'
require 'progressbar'

nr_of_lines = `wc -l input_file.tsv`.split(/ /)[0].to_i
pbar ='processing', nr_of_lines.to_f/MAX_NR_OF_THREADS) do |slice|
threads =
slice.each do |line|
threads[line] = do
# do the actual line parsing, DB lookup and DB updates
threads.values.each do |thread|

I know this is far from perfect:
  • I shouldn't need to create that array.
  • This way all concurrent threads wait for each other before the next slice is taken from the input file. If one of the 5 threads takes a really long time, the other ones will wait but could instead start parsing the next lines in the input file.

Don't think less of me for this code...

Tipping my toes in mongodb with ruby


Read the excellent post by Neil Saunders on using ruby and mongodb to archive his posts on FriendFeed, prompting me to finally write down my own experiences with mongodb. So here goes...Let's have a look at the pilot SNP data from the 1000genomes project. The data released in April 2009 contain lists of SNPs from a low-coverage sequencing effort in the CEU (European descent), YRI (African) and JPTCHB (Asian) populations. SNPs can be dowloaded from here; get the files called something.sites.2009_04.gz. The exercise that we'll be performing here, is to get an idea of how many SNPs are in common between those populations.The input dataThe input data contains chromosome, position, reference allele (based on reference sequence), alternative allele and allele frequency in that population. Some sample lines from CEU.sites.2009_04.gz:1 223336 C G 0.0175441 224176 C T 0.0526321 224344 T A 0.8245611 224419 C T 0.0087721 224438 C T 0.4122811 224472 T C 0.122807What will it look like in a database?If we would load this into a relational database, we could create a table with the following columns: chromosome, position, ceu_ref, ceu_alt, ceu_maf, yri_ref, yri_alt, yri_maf, jptchb_ref, jptchb_alt, jptchb_maf. However we would end up with a lot of NULLs in that table.For example: there is a SNP at position 522,311 on chromosome one in the CEU and YRI populations, but not in the JPTCHB population.chr pos ceu_maj ceu_min ceu_maf yri_maj yri_min yri_maf jptchb_maj jptchb_min jptchb_maf1 223336 C G 0.01754 G C 0.47321 C G 0.220341 522311 C A 0.05263 C A 0.33036 NULL NULL NULL(In this example, I have already recoded ref/alt allele to major/minor allele. In some cases such as the SNP at 223,336 the major allele is not the same in each population.)Actually, of the 21,742,359 SNPs, there are only 4,836,814 (about 22%) is present in each of these populations. Moreover, there are 13,957,866 (64%!) SNPs that are present in only one of the populations. Ergo: loads of NULLs.This is where you can start thinking of using a document-oriented database for storing these SNP data: each document will be tailored to a specific SNP and will e.g. not refer to the JPTCHB population if it it not present in that population. Enter mongodb. Apparently the "mongo" comes from "humongouos". I hope it will live up to that name...The above two SNPs could be represented in two json documents like this:{ "_id" : "1_223336", "chr" : "1", "pos" : 223336, "kg" : { "ceu" : { "major" : "C", "minor" : "G", "maf" : 0.017544 }, "yri" : { "maf" : 0.473214, "major" : "G", "minor" : "C" }, "jptchb" : { "maf" : 0.220339, "major" : "C", "minor" : "G" } }}{ "_id" : "1_522311", "chr" : "1", "pos" : 522311, "kg" : { "ceu" : { "major" : "C", "minor" : "A", "maf" : 0.05263 }, "yri" : { "maf" : 0.33036, "major" : "D", "minor" : "A" } }}Getting the data in the databaseSo how do we load this data into a mongo database? You'll obviously need a mongo daemon running, but I'll leave that as an exercise to the reader as it's very easy and well-explained on the mongodb website. My approach was to first import all CEU SNPs straight from a json-formatted file, and then parse the YRI and JPTCHB files1. Loading the CEU SNPsThe mongoimport tool takes a json-formatted file and - guess what - imports it into a mongo database. Just reformat the lines in the original CEU.sit[...]

Trying out mapreduce - on the farm


Received an email this week from Sanger helpdesk that they installed a test hadoop system on the farm with 2 nodes. Thanks guys! First thing to do, obviously, was to repeat the streaming mapreduce exercise I did on my own machine (see my previous post). Only difference with my local setup is that this time I had to handle HDFS.

As a recap from my previous post: I'll be running an equivalent of the following:
cat snps.txt | ruby snp_mapper.rb | sort | ruby snp_reducer.rb

Setting up
First thing the Sanger wiki told me was to format my HDFS space:
hadoop namenode -format
This apparently only affects my own space... After that I could start playing with hadoop.

Where am I?
It looks like hadoop installs its own complete filesystem: even if my path on the server would be /home/users/a/aerts, a
hadoop fs -lsr /
shows that in the HDFS system I'm at /user/aerts.

Preparing the run
First off: create a directory to work in. Let's call that "locustree" with two subdirectories, called "input" and "output".
hadoop fs -mkdir locustree
hadoop fs -mkdir locustree/input
hadoop fs -mkdir locustree/output
And copy your datafile to the input directory:
hadoop fs -put snps.txt locustree/input/
Running the run
Hadoop documentation mentions that any shebang line in the mapper and reducer scripts are likely to not work, so you have to call ruby explicitely. Provide both scripts as "-file" arguments to hadoop. Finally, from your local directory containing the scripts, run the following command to run the mapreduce job:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-streaming.jar \
-input locustree/input/snps.txt \
-mapper "/usr/local/bin/ruby snp_mapper.rb" \
-reducer "/usr/local/bin/ruby snp_reducer.rb" \
-output locustree/output/snp_index \
-file snp_mapper.rb \
-file snp_reducer.rb

Et voila: a new directory snp_index now contains the file spit out by the snp_reducer.rb script.

Trying out mapreduce


Photo by niv available from FlickrI have long been interested in trying out mapreduce in my data pipelines. The Wellcome Trust Sanger Institute has several huge compute farms that I normally use, but they don't support mapreduce jobs. Quite understandable from the IT's and institute's point of view because it's a mammoth task to keep those things running. But it also means that I can't put a foot on the mapreduce path.There are options to run mapreduce on your own, however. One is to install hadoop locally with the restriction that you can only use one node: your own machine. This walkthrough will get you started with that. Another approach is to use Amazon Elastic MapReduce (AWS EMR).First of, I must admit that I'm far from familiar with mapreduce at this point in time. As far as I understand it's ideal for aggregating data when the algorithm can be run on a subset of the data as well as the complete dataset. The mapping step of the framework will create different subsets of that data, run the required program on it and return partial results. These results should be in the form of key/value-pairs and will serve as the input for a reduce step that combines all intermediate results into one.What's held me back from using AWS EMR is the cost. It's practically nothing, but anything I run on their servers at this moment is paid for with my own money. So today I installed hadoop locally and started coding.The pilotOne of the projects I'm working on is LocusTree: a way of indexing genomic features that allows for quick visualization at different resolutions. In this data structure the genome is first divided into bins of 4,000 bp (called level 0). The next step sees the creation of new "parent" bins that each hold 2 of the original ones (so each covering 8,000 bp; called level 1). You keep on grouping 2 bins into a parent bin until you end up with a single bin for a complete chromosome. For chromosome 1 that means 17 levels, I think.Each of the features - for example a list of 15 million SNPs or log2ratios for 42 million probes - is put in the smallest node in which it fits entirely. A feature ranging from position 10 to 20 will fit in the first bin on level 0; but a feature from position 3,990 to 4,050 will have to go in the first bin on level 1. Subsequently a count and if applicable some aggregate value like sum, min or max is stored in that node as well as in every parent node.So for each SNP/probe the indexing script has to find the smallest enclosing node and update the aggregate values for each parent up to level 3.The setupI'm not good at java so used the hadoop streaming approach for this indexing. The input file contains about 15 million SNPs and looks like this:snp6216929 3 177718011 177718021snp6216930 3 177718267 177718277snp6216931 3 177718268 177718278snp6216932 3 177718294 177718304snp6216933 3 177718612 177718622snp6216934 3 177718629 177718639snp6216935 3 177718956 177718966snp6216936 3 177719529 177719539snp6216937 3 177719623 177719633I want to get a list of nodes with the count of SNPs within that node and all child-nodes. Each node should also list the SNP names for which that node is the smallest enclosing node. So the output looks like:3:0:44429 9 snp6216929,snp6216930,snp6216931,snp6216932,snp6216933,snp6216934,snp6216935,snp6216936,snp62169373:0:44430 1 snp62169383:1:22214 93:1:22215 13:2:11107 103:3:5553 103:4:2776 103:5:1388 103:6:694 103:7:347 103:8:173 103:9:86 103:1[...]

First test release of circular genome browser


Worked a couple of days on pARP, the circular genome browser, and I think it's ready to be tested out by others. Consider this an alpha release: expect a lot of issues. It's easy to create regions with a negative length, for example. Also, I didn't focus yet on user-friendliness or general input files. Ways of interaction are not made clear to new users yet and the input files still need to have fixed names and be stored in a particular folder.pARP is designed to be a genome browser for features that are linked to other features on a genome (e.g. readpair mappings). Using a circular display, lines can be drawn connecting these features.pARP always shows the whole genome. You can zoom into selected regions but the rest is still shown albeit squeezed a bit more together. The reason for this is that I want to show the context at all times. Suppose you'd zoom into two regions A and B that are linked by a large number of readpairs. If the part of the genome that is not A or B is not shown any readpair that has only one of its reads in A or B will just not be shown. By showing the whole genome, even squeezed in a few pixels, you can at least see that some reads are linked outside of A and B.I've put some information on the github wiki page, such as how to interact and what the datafiles should look like.For a little taste: here's a very brief screencast:A lot of things still need to happen:Catch a lot of edge casesIncorporate a library for fast loading of features (i.e. LocusTree, which doesn't exist yet)Make interaction more straightforward: use mouse for panning/zooming for exampleAbout 1,472 other things that I currently forgetAlso: I'm looking for a new name for pARP. pARP stands for "processing abnormal readpairs" (which what is was meant for originally), but it's actually just a genome browser using a circular representation to show linked features. Suggestions I already got are encircle and SqWheel or Squeal (the last two based on sequence-wheel; Squeal was my own idea, so I like that most at the moment :-) ).A very, very big thanks goes to Jeremy Ashkenas, the author of ruby-processing. With pARP I have been pushing the boundaries of what that library does, and he has adapted it for my needs as I went. See here for his ruby-processing library. Other thanks go to my colleagues Erin, Klaudia, Jon, Nelo and Chris for their ideas.pARP can be downloaded or cloned from github. Mac, Windows and linux are available there as well.[...]

LocusTree - searching genomic loci


"Contigs should not know where they are." That's a phrase uttered by James Bonfield when presenting his work on gap5, the successor to gap4, a much-used assembly software suite. So you think: "Wait a second: you're talking about assembly, and the contigs should not store their position?"This statement addresses a problem that we encounter often when working with genomic data: how to handle features. The approach often used is to give the feature a 'chromosome', 'start position' and 'stop position'. Seems reasonable, right? So if you want all features on chromosome 1 between positions 6,124,627 and 6,827,197 you just loop over all features and check if their range overlaps with this query range. Indeed: seems reasonable. Unless your collection of features goes into the millions. Suppose you have arrayCGH data with a resolution of 1 datapoint per 50 basepairs (the green/red bars in the picture). If you'd want to search for this region, you can focus on chromosome 1 (if you were so smart to create different collections per chromosome) and start comparing the start and stop position of each feature from the beginning. Chromosome 1 is almost 250Mb so would contain 5 million array probes. That's a lot of features to check.What James alluded to, was that you should retain the position information at a higher level. He uses an adaption of an R-Tree, which is a datastructure that bins all raw data into bins of a certain size (say 25 elements). Those bins themselves are binned again into larger bins, and so forth until only one bin remains: the root. Each bin knows its start and stop position relative to the bin it's contained in. To search for the same region as above, you start at the root bin which (by definition) overlaps with that query. So you go to the next level and check each of the child bins of root. You can reject those that don't overlap with your range (the red bins) and only focus on the ones that do. Continue until you reach the bottom-most layer which contains your actual features.This has two advantages for my work on pARP, the visualization tool for aberrant read pair mappings. First of all, it's faster to get those features in a given region. A major feature of the pARP tool is that you can zoom into any region. With more than 6 million values for readdepth, speed is a big issue. But that's not all. Think about this: if I have a screen resolution of 1200 pixels wide, why would I try and plot 6 million points next to each other? Twelve-hundred would be enough, isn't it? R-Tree to the rescue again. In building the tree, each bin stores some aggregate data from its members; for example the average readdepth. To draw the genome readdepth across the whole genome, I only have to start at the root of the LocusTree and go down to that level that contains 1200 bins. No need to load the complete dataset. At the moment LocusTree bins only take the average of whatever value its members have, but this can be expanded to contain any type of aggregation (e.g. number of objects).LocusTree is not the fastest library around (being written in ruby, and by me), but I think it does what I need it to do. This means that it might not do exactly what you need it to do. But the code is at github, so feel free to fork and improve. I did add a list of to-dos at the bottom of the README file...UPDATE: I have now realized that - due to DataMapper - this library does not work under jruby, which is necessary for pARP which uses ruby-processing. A bette[...]

The good and bad of genome viewers


Back before the human genome was fully sequenced and NCBI, UCSC and Ensembl started working on visualization, it made a lot of sense to go for linear representations and use tracks for annotation. After all: chromosomes are linear. Using different tracks to show different types of annotation is the next logical step.But there is not just one human genome on earth; according to Wikipedia there's about 6.76 billion copies as of March 2009. So instead of talking about "the human genome" in those browsers, we talk about "the reference genome". Each person on earth is different, and so is each human genome. (That putting the reasoning on its head, but never mind).Differences between humans such as SNPs and microsatellites can still be shown in the track-based browsers.Things get more difficult when you're looking at structural variation. Structural variation messes up the backbone of the linear genome browser: you can't show differences between individuals in one straight line. Suppose you want to investigate a copy-number variation (CNV) and consult UCSC. You'd find tracks such as this:Although this does give you quite some information on the CNV in question, it's not an adequate representation of what the different alleles actually look like. It also highlights another issue: the concept of "the reference genome". As more and more genomes are getting sequenced, is the one that was picked first the best for visualization and indeed, the reference? To be able to handle the different MHC haplotypes in Ensembl, for example, the database contains a table called "assembly_exceptions" that contains the alternative assemblies for each haplotype.I believe that further down the line (although it might be quite a while) we might need to forget the whole notion of a reference genome. Two options come to mind. First of all, we could create an artificial reference that contains all sequence and let each real sequence we want to look at well, reference, that artificial assembly. That would mean that the different MHC haplotypes for example would all be in the same sequence. Similarly, copy-number variants containing let's say 3 to 8 copies would include all 8 in the mock-assembly. Unfortunately this still cannot cover structural variation like inter-chromosomal translocations. We can't build a single artificial assembly that would incorporate those. So here's the alternative: deBruijn graphs. Instead of creating a single linear representation of a reference, just let's not. We could use building blocks to build up each individual. Take a look at this picture:Suppose that each block is a part of a chromosome and the red and blue lines represent the path to follow to build up the chromosome for a particular individual. In this picture the red individual misses a part of that chromosome that is present in the blue individual, and another part is inverted. Notice that we don't make any (arbitrary) decision on what is the reference sequence. By dragging the blocks we can either place all red connections on one line or all blue ones, making them look like a reference.If we'd then add annotations to this picture like genes, we'd be able to display fusion genes. Suppose that the densely-striped block is on chromosome 7 in the red individual but on chromosome 12 in the blue one. If there's a gene on the right breakpoints we end up with a fusion gene.Time permitting I'm going to investigate how useful this will be in projects like [...]

Who-o-o are you? Who who? Who who?


Image by Danny McL via FlickrThere’s been quite a lot of discussions going on lately about author identification: Raf Aerts’ correspondence piece in Nature (doi:10.1038/453979b), discussions on FriendFeed, ... The issue is that it can be hard to identify who the actual author of a paper is if their name is very common. If your name is Gudmundur Thorisson (“hi, mummi”) you’re in luck. But if you are a Li Y, Zhang L or even an Aerts J it’s a bit harder. Searching PubMed for “Aerts J” returns 299 papers. I surely don’t remember writing that many. I wish… So if a future employer would search pubmed for my name they will not get a list of my papers, but a list of papers by authors that have my name. Also, some of my papers mention as the contact email. Well: you’re out of luck, I’m afraid. That email address doesn’t exist anymore because I changed jobs.The idea exists to call into life a unique ID for each author similar to the doi (“digital object identifier”) for a paper. Thomson Reuters have created ResearcherID, but because doi’s are handled through a not-for-profit CrossRef, let’s call the unique author ID a dsi (“digital scientist identifier”). This dsi can then be used by that scientist to identify himself wherever he needs.Here I’ll try to explain how I think this could work.But first of all: what are the prerequisites for a dsi-based environment? Obviously, the journals would need to request the dsi of authors on submission rather than just their names and email addresses. They are able to get names and email addresses through the dsi. And secondly, we need a service that assigns dsi’s and where scientists can update their details and add information.The service/websiteLet there be a website (for argument’s sake that assigns new dsi’s to new authors (only one dsi per author). So I could for example be dsi.12345. This service should have additional functionality such as list of contributions, curriculum vitae, contact details, network. It should also provide a homepage or profile page for each scientist listing at least the name, affiliation and literature list (i.e. what you would get from a PubMed search). So if you’d go to you’d see at least my name, the address of the institute I work and a list of papers that I co-authored.Getting a dsiIt’s critical that one researcher only gets one dsi. This is less than straightforward because I believe many researchers will not be interested enough in the whole identity story to even remember if they already had a dsi or not. So if I were to go to the dsi website and request an ID, the website would ask for my name first. It’d also ask if I used different names in author lists (e.g. I’m a woman, got married and started using my married name instead of my maiden name). Using that information the service would then search pubmed for papers that are authored by someone with my name (who might be me). It could present that list to me and ask if I’m actually that same person or not. This way we’d build up a minimal list of papers. That minimal list would then be checked against the dsi database to see if there isn’t already someone with my name who has claimed these papers. Logically that person would be me and it would appear that I already have a dsi. If no dsi has this name and these papers associated the[...]