What can you learn from an isolated dataset in trading?

Chrissie Cormack Wood, •

Blog posts

The short answer to the question above is—not very much. In order to design effective differentiated trading strategies, you need to be able to take advantage of the broad range of datasets and data vendors available to you.

This article discusses taking a systematic approach to joining your different datasets together—a process known as Entity Mapping, Entity Resolution, or Company Name Matching—and why that’s imperative to success.

I’ll start with a simple toy example to give you a feel of what Entity Mapping is and what some different mapping problems can be. Let's start with dataset one, this can be any time series whatsoever, but let's say that it's quarterly revenue data and we have that data for two companies.

Now let's introduce dataset two, say you've bought some credit card transaction data, you've modelled that to predict quarterly revenues and now you want to be able to test your predicted revenues against the ground truth. You want to be able to systematically combine dataset one and dataset two for companies and if you do that correctly, you'll have a result that looks like this.

However, mapping errors are common, as links are often mismatched, which means you end up with spurious relationships being introduced into your augmented dataset, and you don't know which company is which.

You might ask why you should care about this. Well, if every single data point that you use in your business comes straight out of a Bloomberg Terminal, then you don't need to worry. Bloomberg provides really high-quality datasets. They do a great job with accurately mapping point-in-time identifiers like ISINs and CUSIPs.

However, just one source isn't enough. You can't learn that much from a single dataset in isolation.

But layering lots of different datasets from multiple vendors brings about its own problems. Mapping problems compound very quickly and they're tricky to unpick. Say you've just bought yourself some geolocation data and some venue data and you'd like to try and predict sales. Well, that's going to involve three different mapping processes.

Firstly, you're going to have to combine the geolocation data to the venues data. Then you'll be joining on the sales data because that's your ground truth. And then you'll be linking in your market data because that contains your returns. But as you do each one of these joins, you're going to lose data due to unmatched relationships. You're going to introduce spurious relationships, and that's what we saw in the toy example. It's not uncommon to start with four high-quality 95% accurate datasets and produce an augmented dataset, which has an accuracy of far closer to 80%.

What are the underlying issues causing these problems?

Datasets with symbols (ISINs, tickers etc.), often originate from vendors who don't respect point-in-time identifiers, they'll forward adjust everything to the present day—and that causes problems. Additionally, different vendors interpret corporate actions differently from each other, they'll have different methodologies. Also, there may well be a lack of common symbols between the datasets.

Datasets that don’t have symbols give rise to a different set of issues. One of the reasons the data might not contain symbols is because you're starting just with a human entered string, so you'll have human interpretation, such as typos and abbreviations like IBM, for International Business Machines. And to attack this at scale requires a systematic matching process.

A good example of this from my time at (Global Investment Management Group) Winton, is Travelers. Travelers is over 150 years old and it's been through many mergers, acquisitions, spinoffs and company renames.

- Travelers originated in 1864

- In 1998 it went through a merger with Citicorp, and produced Citigroup

- In 2002 Travelers Property & Casualty Corp was spun-off

- They then merged with St. Paul, forming St. Pauls’ Travelers company Inc

- It was then renamed to the Travelers Companies Inc that you see today in The Dow 30

An important point to note here is that a good reference dataset should contain precise point-in-time attributes, precise dates, precise tickers, and precise CUSIPs. But when we looked through the relevant articles and unstructured data sources that contained the events like mergers and acquisitions, we had something that looked like this:

So, we ran a named-entity recognition algorithm on top of the source to pull out the company names, and we linked a company name to a ticker.

We then ran Travelers through a [string matching] hybrid algorithm against Hivemind's internal reference database and quickly saw that string matching alone is not enough.

As you can see, the top score was Travelzoo Inc, which is definitely not right, and therefore it’s not advisable to just use a string-matching algorithm, top rank for all of your matches against your reference data. You'll end up with terrible results. It's also not clear from the remaining candidates which one is the right Travelers, and that's because without dates you can't cut through 150 years of mergers and spinoffs, etc. You just don't have enough context to deal with that.

So, what’s the solution? The solution is threefold:

1. Pre-filter on date before you apply the algorithm.

2. Use a precise point-in-time reference dataset to produce point-in-time candidates—this will save you a lot on compute cost.

3. Use humans to select the correct match and finish the task.

On point 3, I’m sure it took you less than a second to realise that the correct mapping, even though it wasn't the top rank, was ‘Travelers Company Inc’. The reason you were able to do that so quickly is because humans are intuitively brilliant at applying context and heuristics (but we’re terrible at doing what computers do well, such as filtering long lists at speed).

So, it’s the combination of man and machine that provides the best results. This collective intelligence approach allows automated computational processes to do the heavy lifting, and human cognitive processes to provide the fine-tuning.

And this is exactly what Hivemind does. We orchestrate workflows that combine the automated and cognitive processes of man and machine—at scale. Our particular specialisation is in diverse unstructured datasets and we solve problems for our clients in roughly five different ways:

1. Mapping datasets accurately, as I've briefly discussed in this article

2. Cleaning datasets systematically

3. Building structured datasets from unstructured sources

4. Monitoring your internal data assets, like CRM databases

5. Creating training datasets for Machine Learning

If you have mapping challenges we'd be happy to talk to you. Or, find out more about Hivemind.