Every business that cares about machine learning needs its Sandor Straus. Cleaning and enriching data to make it more useful is the secret ingredient to every successful AI strategy.
Sandor Straus was Renaissanse Technologies data guru responsible for cleaning, storing and enriching the data used in machine learning models. Straus was obsessive about two things. First, he took painstaking efforts in data cleaning.
No one had told Straus to worry so much about the prices, but he had transformed into a data purist, foraging and cleaning data the rest of the world cared little about.
Second, at a time when investors including Renaissance only relied on stock opening and closing data, Straus dived into more granular data: the tick data featuring intraday volume and pricing information for various futures. Later on, Straus engaged in enriching the data. For instance, to deal with gaps in the historical data, he used computer models to make educated guesses as to what was missing.
Straus’s efforts paid off. The early models involved searching for repeating price patterns among securities across a large swath of time. If the data is not clean, algorithms would either miss authentic patterns or pick up spurious ones. Later, when computation power became available, the granular price data would generate thousands of statistically significant observations to help reveal previously undetected pricing patterns. 
The story of Renaissanse Technologies shows us the importance of data management in algorithmic trading system research and development.
Overall there are 3 types of market data events available for analysis: 1. Tick 2. Quote 3. FIX (Financial Information eXchange) message. FIX message includes the most details about market event: time, price, volume, market participant, venue, etc.
Tick data is also called Trade or Time-and-Sales data and contains information regarding the orders which have been executed on the exchange. The basic piece of information of interest that a transaction tick contains are:
- Order timestamp
- Aggressor flag (buy or sell). Aggressor flag indicates who initiated a trade – buyer or seller. Some exchanges provide agressor side data if you buy the data directly from an exchange (for example, CME), some provide it in real-time feed (various crypto-exchanges OKEx, BitMex, Binance, Kraken, etc).
There are various data problems which may be present in tick data:
- Zero volumes. Most of the futures data before 2002/2006 may have zero volumes due to the fact that no electronic trading was present at that time. That is why data provider was only able to record somehow price data from trading pit but not volumes.
- Zero prices or negative prices. Sometimes data provider's IT infrastructure outage may cause data corruption issues leading to negative prices on assets with non-negative price range.
- Sudden price spike or drop. For example, the tick price of a security trading at 1500-2000$ range may decline sharply to 5$ which is definitely not realistic. Particularly these data issues are important to clean and difficult to capture. In our blog post we will discuss how to detect and clean sudden tick price spikes.
Quote data is also called BBO (Best Bid and Offer Price) or Top of the Book data. Quote data consist of several fields:
- Quote timestamp
- Bid Price - the highest price that a buyer (i.e., bidder) is willing to pay.
- Bid Size - current number of units the buyer is willing to pay.
- Ask Price - the lowest price a seller of a stock is willing to accept for a share of that given stock.
- Ask Size - current number of units the seller is willing to sell.
The importance of clean tick data
The consider the example of high-frequency strategy trading XYZ stock. The average price for the stock is 150-200$ and the average position holding time is 10-15 seconds. The strategy is backtested on tick data and uses short SMA for signal generation. Let's consider that a researcher uses tick data with sudden price spike:
Here we face 2 problems: 1. Sudden price spike/drop breaks down SMA calculation as a result the strategy generates signals in a wrong way. 2. Let's consider the strategy generated short sell signal on tick with price of 5$ using half of the available capital. Two ticks later stop-loss system was triggered as the prices "reverted" back to 192$ as a result of that the backtest would show a -47.3% drawdown on signal. Wrong strategy signal is not the reason for a dramatic equity curve drop but rather "fault" tick which won't be present in real-time trading.
Cleaning tick data
There are some algorithms which have been proposed in the literature for washing away wrong observations (cf. Dacorogna et al. (2001), Zhou (1996). In our blog post we will discuss the algorithm from the paper "Financial Econometric Analysis at Ultra–High Frequency: Data Handling Concerns" Christian T. Brownlees, Giampiero M. Gallo.
Let be an ordered tick-by-tick price series. The procedure proposed in the paper removes the outliers using next equation:
where and denote respectively the 10% trimmed sample mean and sample standard deviation of a neighborhood of observations around and is a granularity parameter.
The neighborhood of observations is always chosen so that a given observation is compared with observations belonging to the same trading day. That is, the neighborhood of the first observation of day are the first ticks of the day, the neighborhood of the last observation of the day are the last ticks of the day, the neighborhood of a generic transaction in the middle of the day is made by approximately the first preceding ticks and the following ones, and so on. The idea behind the algorithm is to assess the validity of an observation on the basis of its relative distance from a neighborhood of most close valid observations.
The role of the parameter is particularly important. Ultra-high frequency series often contain sequences of equal prices which would lead to a zero variance, thus it is useful to introduce a lower positive bound on price variations which are always considered admissible.
The parameter should be chosen on the basis of the level of trading intensity. If the trading is not very active should be “reasonably small”, so that the window of observations does not contain too distant prices. On the the other hand, if the trading is very active should be “reasonably large” so that the window contains enough observations to obtain precise estimates of the price local characteristics. The choice of should be a multiple of the minimum price variation allowed for the specific asset.
In Mineo and Romito (2007) it is proposed a slight modiﬁcation ofthe method proposed by Brownless and Gallo, based on the following rule:
where and denote respectively the mean and the standard deviation ofa neighborhood of observations around without the observation.
The procedure described above can be also applied to quote data. However, instead of using tick price for mean and standard deviation, bid/ask spread is used. As a result the formula for cleaning quote data transforms into:
where denotes .
Corn futures example
Let's see how the algorithm works on the example of corn futures (C*). First of all, there is relativelly fast way to understand if price spikes/drops are present in tick data. In order to understand that, let's compress the data into daily bars and plot open, high, low and close prices on the same plot. If dramatic price changes are present in a dataset high/low prices will show that meaning that we need to apply ticks cleaning algorithm. Let's see how it works on CZ2016 futures contract:
As we can see certainly there is a strange behaviour which is definitely not explained by market dynamics but rather bad data quality in that period. Let's see now how the algorithm manages to capture that ticks. In our example equals to 500 meaning that latest 500 ticks are used to estimate rolling mean and standard deviation and equal to 5.
As you can see the algorithm successfuly captured tick outliers for both sudden spikes and drops. Now, with cleaned tick data a reseacher can use the dataset in order to research and backtesr intraday strategies.
Some data providers use proprietary data-cleaning algorithms to deliver spike-free tick and quote, however, the research team still needs to make sure that a dataset is clean and ready to be used in a research. Furthermore, the most of algorithmic trading hedge-funds store and collect market data themselvers which makes data cleaning routines even more important.
In the article we have discussed data cleaning technique using rolling mean and standard deviation, described how data engineering team can make fast and easy data quality screening using OHLC daily bars. We have also described how the procedure can be extended to quote data cleaning which suffers also suffers from bid/ask spikes.
- Brownless C., Gallo G. (2006) Financial econometric analysis at ultra-high frequency: data handling concerns, Computational Statistics & Data Analysis, 51, 2232–2245.
- Dunis C., Gavridis M., Harris A., Leong S., Nacaskul P. (1998) An Application of genetic algorithms to high frequency trading models: a case study, in: Nonlinear Modelling of High Frequency Financial Time Series, Dunis C. & Zhou B. (Eds.), Wiley, 247-278.
- Engle R., Russell J.R. (1990) Autoregressive conditional duration: a new model for irregularly spaced transaction data, Econometrica, 66, 1127–1162.
- Mineo A.M., Romito F. (2007) A method to “clean up” ultra high-frequency data, Statistica & Applicazioni, 5, 167–186.