Subscribe: Ergonomie web, Ruby on Rails et Architecture de l'information
http://feeds.feedburner.com/FredericDeVillamilcom
Added By: Feedage Forager Feedage Grade B rated
Language: French
Tags:
cluster  data  day  elasticsearch  index  indexes  master  node  nodes  routing allocation  time  web  “cluster routing 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Ergonomie web, Ruby on Rails et Architecture de l'information

Fred Thoughts on Startups, UX etc | The UX Ray



The UX Ray



 



An Advanced Elasticsearch Architecture for High-volume Reindexing

Sun, 30 Oct 2016 18:29:32 +0000

I’ve found a new and funny way to play with Elasticsearch to reindex a production cluster without disturbing our clients. If you haven’t already, you might enjoy what we did last summer reindexing 36 billion documents in 5 days within the same cluster. Reindexing that cluster was easy because it was not on production yet. Reindexing a whole cluster where regular clients expect to get their data in real time offers new challenges and more problems to solve. As you can see on the screenshot below, our main bottleneck the first time we reindexed Blackhole, the well named, was the CPU. Having the whole cluster at 100% and a load of 20 is not an option, so we need to find a workaround. This time, we won’t reindex Blackhole but Blink. Blink stores the data we display in our clients dashboards. We need to reindex them every time we change the mapping to enrich that data and add new feature our clients and colleagues love. A glimpse at our infrastructure Blink is a group of 3 clusters built around 27 physical hosts each, having 64GB RAM and 4 core / 8 threads Xeon D-1520. They are small, affordable and disposable hosts. The topology is the same for each cluster: 3 master nodes (2 in our main data center and 1 in our backup data center plus a virtual machine ready to launch in case of major outage) 4 http query nodes (2 in each data center) 20 data nodes (10 in each data center) The data nodes have 4*800GB SSD drives in RAID0, about 58TB per cluster. The data and nodes are configured with Elasticsearch zones awareness. With 1 replica for each index, that makes sure we have 100% of the data in each data center so we’re crash proof. We didn’t allocate the http query nodes to a specific zone for a reason: we want to use the whole cluster when possible, at the cost of 1.2ms of network latency. From Elasticsearch documentation: When executing search or GET requests, with shard awareness enabled, Elasticsearch will prefer using local shards — shards in the same awareness group — to execute the request. This is usually faster than crossing racks or awareness zones. In front of the clusters, we have a layer 7 load balancer made of 2 servers each running Haproxy and holding various virtual IP addresses (VIP). A keepalived ensures the active load balancer holdes the VIP. Each load balancer runs in a different data center for fault tolerance. Haproxy uses the allbackups configuration directive so we access the query nodes in the second data center only when the two first ones are down. frontend blink_01 bind 10.10.10.1:9200 default_backend be_blink01 backend be_blink01 balance leastconn option allbackups option httpchk GET /_cluster/health server esnode01 10.10.10.2:9200 check port 9200 inter 3s fall 3 server esnode02 10.10.10.3:9200 check port 9200 inter 3s fall 3 server esnode03 10.10.10.4:9200 check port 9200 inter 3s fall 3 backup server esnode04 10.10.10.5:9200 check port 9200 inter 3s fall 3 backup So our infrastructure diagram becomes: In front of the Haproxy, we have an applicative layer called Baldur. Baldur was developed by my colleague Nicolas Bazire to handle multiple versions of a same Elasticsearch index and route queries amongst multiple clusters. There’s a reason why we had to split the infrastructure in multiple clusters even though they all run the same version of Elasticsearch, the same plugins, and they do exactly the same things. Each cluster supports about 10,000 indices, and 30,000 shards. That’s a lot, and Elasticsearch master nodes have a hard time dealing with so much indexes and shards. Baldur is both an API and an applicative load balancer built on Nginx with the LUA plugin. It connects to a MySQL database and uses Nginx memory for caching. Baldur was built for 2 reasons: to tell our API the active index for a dashboard to tell our indexers which indexes they should write in, since we manage multiple versions of the same index. In Elasticsearch, each index has a defined naming: _ In[...]



That battle for Web standards we used to fight

Sat, 15 Oct 2016 14:30:55 +0000

Do you remember when fighting for the Web standards was cool and the W3C HTML validator was a thing? I do, and that’s great if you don’t. It means you’re younger than me and that long, exhaustive battle against a Web designed for Internet Explorer 6 is a thing from the past. I became a Web standards advocate somewhere between 2002 and 2003. Back then, I was running Linux as a desktop and was furiously pissed off by Web sites that did not display properly under the Mozilla suite because 95% of the world was using Internet Explorer 6. We have to test our Web sites against Internet Explorer 5.0, 5.2 for Mac, 5.5, 6.0, various flavors of Netscape Navigator, Mozilla, Opera and Safari. And none of them rendered CSS the same way. Imagine a world where you need rounded gifs because your browser can’t display rounded corners, a world where padding in floating elements does not behave the same way on all browsers, and where CSS adoption still has a long way to go. Why would you use CSS when designing with tables and inline style does the job? As a Web developer, all is was required to by my management was developing for Internet Explorer 6 and Internet Explorer 6 only. Every other browser was considered worthless, and with a monopolistic market share, Microsoft did not have to worry about releasing a more modern Web browser. In 2002, the Web standards community was still small and resources were scarce. Eric Meyer on CSS was the Web developer Bible, Molly Holzschlag our goddess and the CSS Zen Garden our lighthouse. My best memories as a Web standard advocate are meeting Molly at her hotel in Paris, eventually getting my copy of The Zen of CSS design signed, an giving a talk with David Larlet at Paris Web 2008. We were a group of idealists, fighting daily for the Web we wanted to live in, teaching and evangelising our colleagues and families about dropping that outdated IE6 for Firefox. 14 years later, Internet Explorer 6 is dead, the battle of the CSS Adoption is a thing of the past, but there’s still lot to do. The fight has moved to the accessibility level. For many of us, accessibility was already something, but there’s a long way to go. In 2016, many people still can’t use the Web because of a disability or lack of access to a broadband connection. The advent of unobtrusive Javascript and the rise of the frameworks were a great thing for accessibility. They allowed Web developers to write (more) accessible modern Web applications without even thinking or knowing about it. Prototype and Scriptaculous, the first widespread Javascript frameworks were all but accessible but their wow effects drove people out of their homemade Javascript stack. Their successors did better and better. The second battle towards a more accessible Web lies at school. I remember teaching future Web developers about producing valid HTML and splitting structure and design. This is not a thing anymore, and teaching goo practices at school is now the thing. I respect people at Opquast tremendously for providing a handy referential, great tools to validate a Web site accessibility and for their continuous evangelisation job. The next step is making the payers to understand that accessibility is not about targeting the 10% people with disabilities but making sure everyone can use their Web site. It requires Web developers time to do their job correctly, and upfront thinking everybody is not willing to invest in. Law enforcement was a good thing too. Forcing public organisations and large corporations to provide accessible Web sites was a great step, but only when they came with real penalties for those who don’t comply. Well, these were my (not so) nostalgic memories of the weekend. The memories of a small but awesome community, shared moments and daily struggle that’s hopefully not a thing anymore.[...]



Getting rid of the fantom indexes menace on Elasticsearch zombi masters

Tue, 11 Oct 2016 08:01:52 +0000

(image)

Split brains is a recurring problem when running any kind of clusters. A sudden server crash or network partition might lead to inconsistent state and data corruption. Elasticsearch addresses this problem by allowing multiple nodes to be configured as master. Running an odd number of master node and properly setting discovery.zen.minimummasternodes to (number of master nodes / 2) + 1 is an easy way to prevents split brain disasters.

However, there’s still a case your cluster might find itself in an inconsistent state.

When your master node leaves the cluster for some reasons and won’t reconnect by itself, it keeps a list of indexes existing before the split. Our clusters are living things, and we create and delete indexes all day long. When your long lost master comes back from the dead, you’ll notice some strange messages in the logs:

[2016–10–09 16:35:12,071][INFO ][gateway.local.state.meta ] [esmaster01] [183524] dangling index, exists on local file system, but not in cluster metadata, auto import to cluster state [YES]

These are the indexes your master used to know about before coming back. Elasticsearch considers these indexes actually exists and will import them into the elected master.

[2016–10–09 16:35:16,715][DEBUG][gateway.local.state.meta ] [esmaster01] [183524] no longer dangling (created), removing

That’s the moment your cluster turns red and newly created indexes appear when running GET /_cat/indices, except the data don’t exist anymore. The only way to bring it back to green is to delete those fantom indexes one by one using DELETE. Nothing complicated except a large number of freshly created indexes might put your elected master to their knees.

According to Elasticsearch documentation, this feature has 2 purposes:

If a new master node is started which is unaware of the other indices in the cluster, adding the old nodes will cause the old indices to be imported, instead of being deleted. An old index can be added to an existing cluster by copying it to the data/ directory of a new node, starting the node and letting it join the cluster. Once the index has been replicated to other nodes in the cluster, the new node can be shut down and removed.

Elasticsearch behaviour can be controlled using gateway.local.autoimportdangled which is set to yes by default.

However, to avoid any surprise after a master node crash, I prefer to shutdown Elasticsearch, delete all the data directory and start the node as a fresh one. Indeed it might not fill all the cases, but it avoids most conflicts due to a zombi node coming back from the dead.

Photo: Brain, by Adeel Anwer, CC.




From France 2002 to USA 2016

Sun, 09 Oct 2016 07:19:04 +0000

I don’t write about politics often. I stopped being interested by local politics after I dropped out from my politic sciences school back in 2001, with 2 exceptions. The first one was Barack Obama first election, because a black man being elected president of a country having a long story of institutionalised racism was a thing, and this year campaign, because Donald Trump being the Republican nominee rings a very unpleasant bell of deja vu. If you’re into French politics, you might remember our 2002 presidential election. If you’re not, let me tell you a story. French presidential election system differs from the American one by many points. To become a candidate, you have to be endorsed by at least 500 elected people, from small villages mayors to senators. That’s a direct election, so you vote for your favorite candidate. If they have more than 50% voices, they win, otherwise you vote a second time for the 2 winners of the first round. In 2002, everyone was expecting a second round between moderate right President Chirac and his socialist Prime Minister Lionel Jospin. Because in France, you can have a Prime Minister who’s a political opponent of the President. I know this sounds stupid but it happened a couple of times. During the past 5 years, President Chirac had been a terrible president, but he was an outstanding born to win candidate. Jospin had been a despicable Prime Minister and a terrible candidate who did little if no campaign, being sure Chirac unpopularity would lead him to win without an effort. Sunday April the 21st 2002, I was working in a lab at my computer engineering school, when I heard people shouting in the corridor. There was a TV set there, and I could see the devastated face of Prime Minister Lionel Jospin learning he didn’t make to the second round. I couldn’t refrain myself from smiling before I understood it meant President Chirac’s opponent would be Jean-Marie Le Pen. Jean-Marie Le Pen is a right wing, racist, revisionist, populist French politician. He’s well known for saying that gas chamber were “a detail in WWII History”, in other things. And he was one step from being the next French President. 2 weeks later, Chirac was reelected with 82.2% of the votes. But it was too late. Chirac didn’t win because he was the best candidate but to ensure Le Pen would lose. It was not about Chirac’s victory, but Le Pen defeat. And by reaching the second round for the first time in 36 years, Le Pen had already won. From a French point of view, there are lots of similarities between our presidential election and that Clinton VS Trump campaign. I’ve been reading a lot that Trump being elected President would be a shame for America. That’s wrong. As Le Pen reaching the second round was a day shame for France, the USA have already known their day of shame by nominating Trump as the Republican candidate. In 2008, the American presidential election was about knowing whether or not America would elect their first black president, and they did. I remember how happy I was. I woke up at 5AM, logged into Seesmic and recorded a video message to tell my American friends how happy I was for them. Whatever happened during Barack Obama double mandates, Tuesday November the 4th made History. 8 years later, America is about to elect their first woman as a President. 2016 November the 8th should be another historical day, but that’s not what History will remember. History won’t remember this day as the day a woman became President, but as the day people had to vote for or against Donald Trump. From that point of view, the attention, celebrity whore candidate has already won.[...]



How we reindexed 36 billion documents in 5 days within the same Elasticsearch cluster

Thu, 06 Oct 2016 07:00:00 +0000

At Synthesio, we use ElasticSearch at various places to run complex queries that fetch up to 50 million rich documents out of tens of billion in the blink of an eye. Elasticsearch makes it fast and easily scalable where running the same queries over multiple MySQL clusters would take minutes and crash a few servers on the way. Every day, we push Elasticsearch boundaries further, and going deeper and deeper in its internals leads to even more love. Last week, we decided to reindex a 136TB dataset with a brand new mapping. Updating an Elasticsearch mapping on a large index is easy until you need to change an existing field type or delete one. Such updates require a complete reindexing in a separate index created with the right mapping so there was no easy way out for us. The “Blackhole” cluster We’ve called our biggest Elasticsearch cluster “Blackhole”, because that’s exactly what it is: a hot, ready to use datastore being able to contain virtually any amount of data. The only difference with a real blackhole is that we can get our data back at the speed of light. When we designed blackhole, we had to chose between 2 different models. A few huge machines with 4 * 12 core CPU, 512GB of memory and 36 800GB SSD drives, each of them running multiple instances of Elasticsearch. A lot of smaller machines we could scale horizontally as the cluster grows. We opted for the latter since it would make scaling much easier and didn’t require spending too much money upfront. Blackhole runs on 75 physical machines: 2 http nodes, one in each data center behind a HAProxy to load balance the queries. 3 master nodes located in 3 different data center. 70 data nodes into 2 different data center. Each node has quad core Xeon D-1521 CPU running at 2.40GHz and 64GB of memory. The data nodes have a RAID0 over 4*800GB SSD drives with XFS. The whole cluster runs a Systemd less Debian Jessie with a 3.14.32 vanilla kernel. The current version of the cluster has 218,75TB of storage and 4,68TB of memory with 2.39TB being allocated to Elasticsearch heap. That’s all for the numbers. Elasticsearch configuration Blackhole runs ElasticSearch 1.7.5 on Java 1.8. Indexes have 12 shards and 1 replica. We ensure each data center hosts 100% of our data using Elasticsearch rack awareness feature. This setup allows to crash a whole data center without neither data loss nor downtime, which we test every month. All the filtered queries are ran with _cache=false. ElasticSearch caches the filtered queries result in memory, making the whole cluster explode at the first search. Running queries on 100GB shards, this is not something you want to see. When running in production, our configuration is: routing: allocation: node_initial_primaries_recoveries: 20 node_concurrent_recoveries: 20 cluster_concurrent_rebalance: 20 disk: threshold_enabled: true watermark: low: 60% high: 78% index: number_of_shards: 12 number_of_replicas: 1 merge: scheduler: max_thread_count: 8 type: 'concurrent' policy: type: 'tiered' max_merged_segment: 100gb segments_per_tier: 4 max_merge_at_once: 4 max_merge_at_once_explicit: 4 store: type: niofs query: bool: max_clause_count: 10000 action: auto_create_index: false indices: recovery: max_bytes_per_sec: 2048mb fielddata: breaker: limit: 80% cache: size: 25% expire: 1m store: throttle: type: 'none' discovery: zen: minimum_master_nodes: 2 ping: multicast: enabled: false unicast: hosts: ["master01","master02","master03"] threadpool: bulk: queue_size: 3000 type: cached index: queue_size: 3000 type: cached bootstrap: mlockall: true [...]



Happy birthday Dr Frankenstein

Wed, 05 Oct 2016 07:00:00 +0000

(image)

200 years ago was written what would become one of the most important fantastic and at some points philosophic novel, Mary Shelley’s Frankenstein. Despite its old fashion, Victorian era style the Frankenstein is still worth reading and studying at the light of today’s progress and madness in artificial intelligence (AI).

I read Frankenstein the same summer I discovered Asimov’s Robots Cycle, and I can’t help but relate the 2 centuries old novel to The Naked Sun. If you had to read only one of Asimov’s Robots novel, pick up The Naked Sun. It revolves with a paradox in Asimov’s 3 Laws of Robotics where a robot ends killing a human.

Did you ever notice how we often call Frankenstein’s creature by the name of its creator? Doing so, we both identify the monster to the one who created it and push him aside in the shadow so we forget that mankind can engender such a monstrosity. Or maybe the name Frankenstein itself sounds like the one of the monster when Albert Einstein reminds us of the old tongue pulling genius we love to show on our t-shirts, and one of the Bomb fathers.

All comparison aside, Frankenstein still has a few lessons to give. As deep and machine learning has us pushing the boundaries of artificial intelligence, creating virtual personal assistants, support chat bots and even surgeon replacement prototypes, there’s still a chance we end creating a monster. By making his own soul less self out of Science, Dr Frankenstein wanted to become as powerful as his creator. Developing AI, but also trying to push genetics searches further, we’re repeating the same process and the risks are exactly the same. One doesn’t become his own creator without the danger of being destroyed by his own creature. Before we reach this point, there are many questions about pride, ethics, and the existence of the soul we need to answer.




The deadly difference between hiding the symptoms and solving the problem

Tue, 04 Oct 2016 07:00:00 +0000

(image)

There’s a common misconception between solving a problem and hiding the symptoms. The tech world is full of examples both because it’s an easy falling trap and because of the move fast culture.

You have your application being down for short periods several times a day because your database server can’t cope with the load. The fastest workaround is throwing hardware at the problem. Adding new or faster servers, maybe faster disks will stop your application from crashing. Problem solved.

At least from a management point of view.

What you did though was not solving the problem, only make the symptoms disappear. It’s an important step because it gives you the time to focus on the problem before it comes back. To solve the problem, you will probably have to rewrite complex queries obviously written by a bunch of drunken otters, add indexes to billion records tables, write and deploy a caching layer or change your whole technology stack.

Making the symptoms disappear only buys you the time to solve your problem.

That’s not so bad actually, when your management accepts to hear it. A common question I’ve heard over the year was “why should I throw money to fix something that works and won’t bring money?” That’s another easy, deadly trap to fall, until the problem rises again. And “I told you so” won’t solve it.




An Elasticsearch cheat sheet

Mon, 03 Oct 2016 07:00:00 +0000

I’m using Elasticsearch a lot, which brings me to run the same commands again and again to manage my clusters. Even though they’re now all automated in Ansible, I thought it would be interesting to share them here. Mass index deletion with pattern I often have to delete hundreds of indexes at once. Their name usually follow some patterns, which makes batch deletion easier. for index in $(curl -XGET esmaster:9200/_cat/indices | awk ‘/pattern/ {print $3}’); do curl -XDELETE esmaster:9200/$index?master_timeout=120s; done Mass optimize, indexes with the most deleted docs first Lucene, which powers Elasticsearch has a specific behavior when it comes to delete or update documents. Instead of actually deleting or overwriting the data, if flags it as deleted and write a new one. The only way to get rid of a deleted document is to run an optimize on your indexes. This snippet sorts your existing indexes by the number of deleted documents before it runs the optimize. for indice in $(CURL -XGET esmaster:9200/_cat/indices | sort -rk 7 | awk ‘{print $3}’); do curl -XPOST http://esmaster:9200/${indice}/_optimize?max_num_segments=1; done Restart a cluster using rack awareness Using rack awareness allows to split your replicated data evenly between hosts or data center. It’s convenient to restart half of your cluster at once instead of host by host. curl -XPUT ‘host:9200/_cluster/settings’ -d ‘{ “transient” : { “cluster.routing.allocation.enable”: “none” }}’; for host in $(curl -XGET esmaster:9200/_cat/nodeattrs?attr | awk ‘/rack_id/ {print $2}’); do ssh $host service elasticsearch restart; done; sleep60; curl -XPUT ‘host:9200/_cluster/settings’ -d ‘{ “transient” : { “cluster.routing.allocation.enable”: “all }}’ Optimize your cluster restart There’s a simple way to accelerate your cluster restart. Once you’ve brought your masters back, run this snippet. Most of the options are self explanatory: curl -XPUT ‘http://escluster:9200/_cluster/settings' -d ‘{ “transient” : { “cluster.routing.allocation.cluster_concurrent_rebalance”: 20, “indices.recovery.concurrent_streams”: 20, “cluster.routing.allocation.node_initial_primaries_recoveries”: 20, “cluster.routing.allocation.node_concurrent_recoveries”: 20, “indices.recovery.max_bytes_per_sec”: “2048mb”, “cluster.routing.allocation.disk.threshold_enabled” : true, “cluster.routing.allocation.disk.watermark.low” : “90%”, “cluster.routing.allocation.disk.watermark.high” : “98%”, “cluster.routing.allocation.enable”: “primary” } }’ Then, once your cluster is back to yellow, run that one: curl -XPUT ‘http://escluster:9200/_cluster/settings' -d ‘{ “transient” : { “cluster.routing.allocation.enable”: “all” } }’ Get useful information about your cluster Nodes information This snippet gets the most useful information from your Elasticsearch nodes: hostname role (master, data, nothing) free disk space heap used ram used file descriptors used load curl -XGET https://escluster/_cat/nodes?v&h=host,r,d,hc,rc,fdc,l ost r d hc rc fdc l 192.168.1.139 d 1tb 9.4gb 58.2gb 20752 0.20 192.168.1.203 d 988.4gb 16.2gb 59.3gb 21004 0.12 192.168.1.146 d 1tb 14.1gb 59.2gb 20952 0.18 192.168.1.169 d 1tb 14.3gb 58.8gb 20796 0.10 192.168.1.180 d 1tb 16.1gb 60.5gb 21140 0.17 192.168.1.188 d 1tb 9.5gb 59.4gb 20928 0.19 Then, it’s easy to sort the output to get interesting information. Sort by free disk space: curl -XGET https://escluster/_cat/nodes?h=host,r,d,hc,rc,fdc,l | sort -hrk 3 Sort by heap occupancy: curl -XGET https://escluster/_cat/nodes?h=host,r,d,hc,rc,fdc,l | sort -hrk 4 And so on. Indices information This snippet gets most information you nee[...]



Don't fire the underperformers (yet)

Sat, 01 Oct 2016 14:56:07 +0000

(image)

Soon or later, every company ends hiring underperformers. Often unnoticed in large corporations, they can be fatal to small businesses where everyone counts in large amount.

The main problem with underperformers is that they sometimes take months to detect. No one can join an existing company and go full steam on day one. You need to learn the company’s culture, the tools, how to work with your colleagues, and the job you’ve been hired for. In tech department, it takes up to 3 months before you realise you’ve hired an underperformer, in a sales team, sometimes more, depending on how long your sales cycle lasts.

(image)

The obvious solution you get is to fire them. But not yet. You might need them somewhere else in the company.

I’ve been an underperformer myself. When I graduated, I decided I wanted to be a Web project manager. I didn’t know what it was actually about, but I bent before social pressure telling me being a tech was like being a factory worker during the 19th century. I learnt the job the hard way and hated it. I had to work twice much as my colleagues to do a poor job and switch companies often to hide my continuous failure.

Until the day I got fired after 5 months struggling in a small Parisian Web agency. I had good knowledge about Web development best practice, so the startup we were sharing our office with hired me as a quality manager. Once again, I was not made for that job but I had to pay the bills, so I went for it. I lived a nightmare during 2 years and half. 2 years and a half knowing I was on the verge to be fire if they needed a fuse to trigger. Until that day.

On August the first 2010, I went back from my yearly vacation a resignation letter in my pocket when my manager offered me to take over the infrastructure. The sysadmin left the day before and I had done sysadmin in the past. I accepted and we agreed to give myself a 3 months probation time. I accepted the job and started to overperform way beyond my boss expectations. At the end of the probation, I had a 20% raise and kept overperforming, becoming a key asset of the company.

If there’s one question to ask before firing underpeformers, it’s “how can they be useful somewhere else in the company?” After a few months, they know the company job, culture and products, and if they went through the whole hiring process, they might be useful somewhere. An average sales people can overperform as an account manager because he has everything but the teeth his jobs requires. A poor project manager can be a great product manager because he lacks the client facing skills his job requires but he’s great at managing and designing products. And a crappy quality assurance manager can becomes an overperforming sysadmin when given the right challenge.

And in the end, if there’s nothing else left, only should them be fired.




The obvious source of failure no one talks about

Fri, 20 May 2016 21:14:44 +0000

(image)

Every time I see a startup dying, I can’t help but trying to understand what went wrong. Unfortunately, you can’t turn back time or get a time lapse of a multiple years history.

Unlike success, a startup failure might be hard to understand. Obvious reasons exist: lack of product / market fit, co founders issues, poor execution, lack culture, failure to build a great team, but it doesn’t explain everything.

Before 2011, Twitter was well known for its “fail whale” and inability to scale. Early Google was relying on (poor) competitors data before they had a big enough index. And Docker was once a Platforms As A Service (PAAS) before they pivoted to focus on Docker. Before they became the success story we hear about all over the Internet.

Semi failure is even harder to analyse. How can you know what made a promising company barely survive after 5 or 10 years without diving into an insane amount of data? Beside the company internals, such as product evolution, business model pivots, execs turnaround — a sign something’s fishy, not necessary the reason — and poor selling strategies, you need to analyse the evolution of the market, their clients and indeed their competitors.

There’s something else no one talks about when analysing failure. Something so obvious it sounds ridiculous until you face it.

Yesterday I wanted to see a friend whose startup only has 1 or 2 months cash left. Yesterday was also an optional bank holiday in France, but I didn’t expect their office door to be closed.

I was shocked. If my company was about to die, I would spend the 30, 45 remaining days trying to save it by all means. I’d run everywhere trying to find some more cash. I’d have the product team release that deal breaker differentiating feature. I’d try to find a quick win pivot. I’d even try to sell to my competitors in order to save some jobs. But I’d certainly not take a bank holiday.

Then I remembered every time I went there during the past 2 years, sometimes dropping for a coffee, sometimes using their Internet connection when I was working remotely and did not want to stay at home. There was no one before 10:00AM, and there was no one after 7:00PM. There were always people playing foosball / the playstation / watching a game on TV. Not like they were thousands people, more like a dozen. I remember late lunchs and, early parties.

Despite a fair product and promising business plan, they missed something critical. “Work hard, play harder” reads from left to right, not the other way around. In the startup world, the obvious source of failure no one talks about is the lack of work.