The dangers of using the wrong tool for the job

Christian Gilson, Head of Data Science •

Blog 4

Last week it was reported that an “IT glitch” led to 15,841 positive COVID tests being lost within the UK’s track and trace system.

The culprit was a data preparation process that was meant to populate Excel spreadsheets automatically. As more testing data flowed in, the spreadsheets hit a row limit and the data pipeline seems to have failed silently: the mistake was not identified for around a week. Over the course of that week it spat out official government statistics significantly underestimating the number of positive COVID tests and left thousands of people unaware that they had been in contact with others who had tested positive.

Excel should not be used in automated data preparation pipelines; certainly not when dealing with that kind of volume of data. It is about as high-profile an example as you can find of the dangers of using the wrong tool for the job, and reminds us of the need for crisp, long-term thinking when designing data pipelines.

But the reality is that this stuff happens all the time, especially when people are working under time pressure. So although there was a lot of worthy outrage and shock on social media from data scientists and data engineers about how insane it is to have used Excel in this way, I suspect they weren’t that surprised really and deep down they were probably thinking “there but for the Grace of God go I”.

At Hivemind we see it every week when talking to companies about the way they handle their data. Data pipelines are cobbled together with whatever tools are lying around – Excel, JIRA, Slack, email alerts, or legacy systems. Everyone tacitly admits these aren’t the right tools for the job, but it’s tempting to use them because you can throw together a reasonable proof of concept with them in a very short time to earn some internal buy-in. It seems like the pragmatic middle ground to the age-old build/buy dilemma: a way to avoid either asking engineering for time to build a fully functional solution or having a budget-battle to get a new purchase signed off.

The problem is that it’s very easy to wake up one day with your quickly cobbled together prototype as a production system. And, in case it wasn’t obvious enough, that is not a position you want to be in. On day one it lands you with a massive amount of technical debt and a process that clearly isn’t scalable, robust or thoroughly tested. In this situation significant data errors are inevitable and your team firefights every day to keep the system running in the face of the uncertainty and inconsistency of real-world data.

So why do data pipelines end up being built like that? Well, although I don’t know the detail of what happened with the PHE track and trace data pipeline, I don’t believe they were using Excel (and the obsolete XLS format of Excel at that) because the developers on the project felt it was the right tool for the job. They clearly made a mistake with the implementation but to describe the problem as an “IT glitch” not only downplays the potential impact of the mistake but also misdiagnoses the fundamental cause.

In my experience, major errors in pipeline design happen not because someone on the technical team made an isolated mistake, but because of a mixture of institutional culture, pressure, bad project management and poor communication. Those on the implementation side don’t want to have to support legacy systems or inappropriate technology, but they are often forced into doing so by unrealistic project deadlines or budgetary constraints. Decision makers don’t want to have flakey data pipelines constructed, but nor are they necessarily aware of the costs and risks associated with a cobbled together solution. Often the crux of the problem is failure to think and communicate clearly about the build/buy dilemma, so the cobbled together solution persists as the only option being considered.

Data is companies’ lifeblood; it’s oil for the machine or food for the corporate body, whatever metaphor you prefer. For a long time people have recognised the garbage in / garbage out wisdom that suggests accurate outcomes rely on data quality. In order to be of high quality, the data needs a fit for purpose pipeline which delivers, normalizes, standardizes, structures and cleans it. If you allow a cobbled together prototype to be an operational part of that process, whether by design or as a failure of project management, it suggests you simply aren’t valuing your data enough.

---

Talk to the Hivemind team to see how we can help your business.