Every business that cares about machine learning needs its Sandor Straus. Cleaning and enriching data to make it more useful is the secret ingredient to every successful AI strategy.
Sandor Straus was Renaissanse Technologies data guru responsible for cleaning, storing and enriching the data used in machine learning models. Straus was obsessive about two things. First, he took painstaking efforts in data cleaning.
No one had told Straus to worry so much about the prices, but he had transformed into a data purist, foraging and cleaning data the rest of the world cared little about.
Second, at a time when investors including Renaissance only relied on stock opening and closing data, Straus dived into more granular data: the tick data featuring intraday volume and pricing information for various futures. Later on, Straus engaged in enriching the data. For instance, to deal with gaps in the historical data, he used computer models to make educated guesses as to what was missing.
Straus’s efforts paid off. The early models involved searching for repeating price patterns among securities across a large swath of time. If the data is not clean, algorithms would either miss authentic patterns or pick up spurious ones. Later, when computation power became available, the granular price data would generate thousands of statistically significant observations to help reveal previously undetected pricing patterns. 
The story of Renaissanse Technologies shows us the importance of data management in algorithmic trading system research and development.
Overall there are 3 types of market data events available for analysis: 1. Tick 2. Quote 3. FIX (Financial Information eXchange) message. FIX message includes detailed information about market event: time, price, volume, market participant, venue, etc.
Tick data is also called Trade or Time-and-Sales data and contains information regarding the orders which have been executed on the exchange. The basic piece of information of interest that a transaction tick contains are:
- Order timestamp
- Aggressor flag (buy or sell). Aggressor flag indicates who initiated a trade – buyer or seller. Some exchanges provide aggressor side data if you buy the data directly from an exchange (for example, CME), some provide it in a real-time feed (various crypto-exchanges OKEx, BitMex, Binance, Kraken, etc).
There are several issues which may be present in tick data:
- Zero volumes. Most of the futures data before 2002/2006 may have zero volumes due to the fact that no electronic trading was present before that period. That is why data provider was only able to record somehow price data from the trading pit but not volumes.
- Zero prices or negative prices. Sometimes data provider's IT infrastructure outage may cause data corruption issues leading to negative prices on assets with a non-negative price range.
- Sudden price spike or drop. For example, the tick price of a security trading at 1500-2000$ range may decline sharply to 5$ which is definitely not realistic. Particularly these data issues are important to clean and difficult to capture. In our blog post, we will discuss how to detect and clean sudden tick price spikes.
Quote data is also called BBO (Best Bid and Offer Price) or Top of the Book data. Quote data consist of several fields:
- Quote timestamp
- Bid Price - the highest price that a buyer (i.e., bidder) is willing to pay.
- Bid Size - current number of units the buyer is willing to pay.
- Ask Price - the lowest price a seller of a stock is willing to accept for a share of that given stock.
- Ask Size - current number of units the seller is willing to sell.
The importance of clean tick data
Consider a high-frequency strategy trading XYZ stock using a short simple moving average to enter a position with tight fix-profit and stop-loss levels. XYZ stock price is ranging between 150-200$ and the average position holding time is 10-15 seconds. Let's see what happens if the strategy is backtested on tick data with sudden price drop:
Here we face 2 problems: 1. Sudden price spike/drop breaks down SMA calculation as a result the strategy generates signals in a wrong way. 2.Let's see what happens if the strategy generated a short sell signal on the tick with a price of 5$ using half of the available capital. Two ticks later stop-loss system was triggered because "reverted" back to 192$. As a result, the backtest would have shown a -47.3% drawdown on the signal. Wrong strategy signal is not the reason for a dramatic equity curve drop but rather "fault" tick which won't be present in real-time trading.
The example shows us how poor data quality may reject a potentially profitable strategy. That is why clean tick data is a vital part of research infrastructure, especially for high-frequency strategies.
Cleaning tick data
There are some algorithms which have been proposed in the literature for washing away wrong observations - cf. Dacorogna et al. (2001), Zhou (1996). In our blog post we will discuss the algorithm from the paper "Financial Econometric Analysis at Ultra–High Frequency: Data Handling Concerns" by Christian T. Brownlees, Giampiero M. Gallo.
Let be an ordered tick-by-tick price series. The procedure proposed in the paper removes the outliers using next equation:
where and denote respectively the 10% trimmed sample mean and sample standard deviation of a neighborhood of observations around and is a granularity parameter.
The neighborhood of observations is always chosen so that a given observation is compared with observations belonging to the same trading day. That is, the neighborhood of the first observation of day are the first ticks of the day, the neighborhood of the last observation of the day are the last ticks of the day, the neighborhood of a generic transaction in the middle of the day is made by approximately the first preceding ticks and the following ones, and so on. The idea behind the algorithm is to assess the validity of an observation on the basis of its relative distance from a neighborhood of most close valid observations.
The role of the parameter is particularly important. Ultra-high frequency series often contain sequences of equal prices which would lead to a zero variance, thus it is useful to introduce a lower positive bound on price variations which are always considered admissible.
The parameter should be chosen on the basis of the level of trading intensity. If the trading is not very active should be “reasonably small”, so that the window of observations does not contain too distant prices. On the the other hand, if the trading is very active should be “reasonably large” so that the window contains enough observations to obtain precise estimates of the price local characteristics. The choice of should be a multiple of the minimum price variation allowed for the specific asset.
In Mineo and Romito (2007) paper gives us a slight modiﬁcation of the method proposed by Brownless and Gallo, based on the following rule:
where and denote respectively the mean and the standard deviation of a neighborhood of observations around without the observation.
One can apply the algorithm described above to clean quote data. However, instead of using tick price for mean and standard deviation, bid/ask spread is used. As a result, the formula for cleaning quote data transforms into:
where denotes .
Corn futures example
Let's see how the algorithm works on the example of corn futures (C*). First of all, there is a relatively fast way to understand if price spikes/drops are present in tick data. In order to understand that, let's compress the data into daily bars and plot open, high, low and close prices on the same plot. If dramatic price changes are present in a dataset high/low prices will reflect that. Let's see how it works on CZ2016 futures contract:
As we can see, there is a strange behaviour during 2015.08 - 2016.03 period which is not explained by market dynamics but rather poor data quality. Let's see now how the algorithm manages to capture fault ticks. In our example, equals to 500 meaning that the latest 500 ticks are used to estimate rolling mean and standard deviation and equals to 5.
As you can see the algorithm successfully captured tick outliers for both sudden spikes and drops. Now clean tick data can be used in both strategy research and backtesting.
High-end data providers use proprietary data-cleaning algorithms to deliver spike-free ticks and quotes, however, the team still needs to make sure that a dataset is clean and ready to be used in research. Furthermore, most of the algorithmic trading hedge-funds store and collect market data themselves which makes data cleaning routines even more important.
In the article, we have discussed data cleaning technique using rolling mean and standard deviation, described how data engineering team can make fast and easy data quality screening using OHLC daily bars. We have also described how the procedure can be extended to quote data cleaning which also suffers from bid/ask spikes.
- Brownless C., Gallo G. (2006) Financial econometric analysis at ultra-high frequency: data handling concerns, Computational Statistics & Data Analysis, 51, 2232–2245.
- Dunis C., Gavridis M., Harris A., Leong S., Nacaskul P. (1998) An Application of genetic algorithms to high frequency trading models: a case study, in: Nonlinear Modelling of High Frequency Financial Time Series, Dunis C. & Zhou B. (Eds.), Wiley, 247-278.
- Engle R., Russell J.R. (1990) Autoregressive conditional duration: a new model for irregularly spaced transaction data, Econometrica, 66, 1127–1162.
- Mineo A.M., Romito F. (2007) A method to “clean up” ultra high-frequency data, Statistica & Applicazioni, 5, 167–186.