Using HighFrequencyCovariance

Loading in data

We first load our data into a DataFrame. As an example we have a DataFrame of price updates (like that in df in the below code block). Then we can put our data into a SortedDataFrame by putting the DataFrame and the names of the time, label and value columns into the constructor:

using HighFrequencyCovariance
using DataFrames
# Making example data
df = DataFrame(:stock => [:A,:B,:A,:A,:A,:B,:A,:B,:B], :time => [1,2,3,4,5,5,6,7,8],
               :logprice => [1.01,2.0,1.011,1.02,1.011,2.2,1.0001,2.2,2.3])
# Making a SortedDataFrame
ts = SortedDataFrame(df, :time, :stock, :logprice)

In a real setting this is how we would turn our DataFrame of data into a SortedDataFrame.

For the succeeding sections it is useful to get more realistic time series data. So we will generate some Monte Carlo data here using the generate_random_path function which generates a random correlation matrix, volatilities, price update times and microstructure noises and generates a SortedDataFrame from a random time series consistent with these.

using HighFrequencyCovariance
using Random
dims = 4
ticks = 10000
ts, true_covar, true_micro_noise, true_update_rates = generate_random_path(dims, ticks, twister = MersenneTwister(2))

Estimating Volatility

We can use the SortedDataFrame we have generated (in the ts variable) to estimate the volatility of each asset:

assets         = get_assets(ts)
simple_vol     = estimate_volatility(ts, assets, :simple_volatility)
two_scales_vol = estimate_volatility(ts, assets, :two_scales_volatility)

Now the true volatility is contained in true_covar.volatility. We can present these true volatilities alongside the two estimated volatilities

using DataFrames
true_volatility = Dict{Symbol,Float64}(true_covar.labels .=> true_covar.volatility)
summary_frame = vcat(DataFrame.([true_volatility, simple_vol, two_scales_vol] )... )
summary_frame = hcat(DataFrame(Dict(:estimation => ["True", "Simple", "2 Scales"])), summary_frame)
print(summary_frame)
# │ Row │ estimation │ asset_1   │ asset_2   │ asset_3    │ asset_4   │
# │     │ String     │ Float64   │ Float64   │ Float64    │ Float64   │
# ├─────┼────────────┼───────────┼───────────┼────────────┼───────────┤
# │ 1   │ True       │ 0.0157077 │ 0.0137856 │ 0.00484516 │ 0.0142265 │
# │ 2   │ Simple     │ 0.0178855 │ 0.0284619 │ 0.00502814 │ 0.0129842 │
# │ 3   │ 2 Scales   │ 0.0173682 │ 0.0192129 │ 0.00605092 │ 0.0149015 │

We can see that the accuracy of the simple method was particularly bad for asset_2.

This is due to microstructure noise which we can estimate as:

noise          = estimate_microstructure_noise(ts, assets)

And tabling the estimated and true microstructure noise we can see that there was more microstructure noise for asset_2 relative to the other assets.

using DataFrames
summary_frame = vcat(DataFrame.([true_micro_noise, noise] )... )
summary_frame = hcat(DataFrame(Dict(:estimation => ["True", "2 Scales noise estimate"])), summary_frame)
print(summary_frame)
# 2×5 DataFrame
# │ Row │ estimation              │ asset_1    │ asset_2    │ asset_3     │ asset_4     │
# │     │ String                  │ Float64    │ Float64    │ Float64     │ Float64     │
# ├─────┼─────────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
# │ 1   │ True                    │ 0.00216696 │ 0.0092135  │ 0.000226909 │ 0.000938589 │
# │ 2   │ 2 Scales noise estimate │ 0.0021294  │ 0.00854816 │ 0.000226053 │ 0.000871175 │

Estimating a covariance matrix

As this is a Monte Carlo we already have the true CovarianceMatrix in the true_covar variable. As we don't have this in applied settings we will disregard this for now and try to estimate it using our generated tick data stored in the SortedDataFrame with name ts:

assets              = get_assets(ts)
simple_estimate     = estimate_covariance(ts, assets, :simple_covariance)
bnhls_estimate      = estimate_covariance(ts, assets, :bnhls_covariance)
spectral_estimate   = estimate_covariance(ts, assets, :spectral_covariance)
preav_estimate      = estimate_covariance(ts, assets, :preaveraged_covariance)
two_scales_estimate = estimate_covariance(ts, assets, :two_scales_covariance)

We may alternatively use the functions corresponding to each method directly. This has the same result:

bnhls_estimate2     = bnhls_covariance(ts, assets)
spectral_estimate2  = spectral_covariance(ts, assets)

Now we may be particularly interested in one of the estimates, for instance the bnhls_estimate. We can first see if the correlation matrix it produces is valid (is positive semi-definite, has a unit diagonal and has all other entries below 1):

valid_correlation_matrix(bnhls_estimate)
# true

and fortunately it is. We could also examine the others similarly and see that they all deliver valid correlation matrices. One thing we might try then is to average over all of the more advanced methods and use the result as our correlation matrix estimate. This is easy to achieve by using the combine_covariance_matrices function.

matrices = [spectral_estimate, preav_estimate, two_scales_estimate, bnhls_estimate]
combined_estimate = combine_covariance_matrices(matrices)

Now we can compare how close each of the estimates is to the true correlation matrix. We can do this by examining the mean absolute difference between estimated correlations.

calculate_mean_abs_distance(true_covar, combined_estimate)
# (Correlation_error = 0.38723691161754376, Volatility_error = 0.002500211816000063)
calculate_mean_abs_distance(true_covar, simple_estimate)
# (Correlation_error = 0.5321534542489482, Volatility_error = 0.010511960080115556)
calculate_mean_abs_distance(true_covar, bnhls_estimate)
# (Correlation_error = 0.7422120933301078, Volatility_error = 0.006815323622470541)
calculate_mean_abs_distance(true_covar, spectral_estimate)
# (Correlation_error = 0.5227424813357473, Volatility_error = 0.007669889385330695)
calculate_mean_abs_distance(true_covar, preav_estimate)
# (Correlation_error = 0.1840684108352901, Volatility_error = 0.0022421828719004925)
calculate_mean_abs_distance(true_covar, two_scales_estimate)
# (Correlation_error = 0.38238270061443486, Volatility_error = 0.0022421828719004925)

We can see that in this particular case the correlation matrix calculated with the preaveraging method performed the best.

Now examining the data we can see that we have some assets that trade more frequently than the others.

ticks_per_asset(ts)
# Dict{Symbol, Int64} with 4 entries:
#   :asset_4 => 3454
#   :asset_3 => 3242
#   :asset_2 => 1340
#   :asset_1 => 1964

While we have 3454 price updates for asset_4 we only have 1340 for asset_2. Potentially we could improve the bnhls estimate if we use a blocking and regularisation technique (Hautsch, Kyj and Oomen 2012).

We can start this by first making a DataFrame detailing what assets should be in what block. We will generate a new block if the minimum number of ticks of a new block has 20% more ticks than the minimum of the previous:

new_block_threshold = 1.2
blocking_frame = put_assets_into_blocks_by_trading_frequency(
                        ts, new_block_threshold, :bnhls_covariance)

This blocking_frame is a regular DataFrame with six columns where each row represents a different estimation. The order of the rows is the order of estimations (so the results of later estimations may overwrite earlier ones). The first column is named :assets and has the type Set{Symbol} which represents the assets in each estimation. The second column contains a symbol representing the function that will be used in the estimation of that block. The third column has the name :optional_parameters and is of type NamedTuple that can provide optional parameters to the covariance function in the second column. Every covariance estimation has a function signature with two common arguments before the semicolon (For a SortedDataFrame and a vector of symbols representing what assets to use). There can also be a number of named optional arguments which can be sourced from a NamedTuple. The blockwise_estimation function then estimates a block with the line

blocking_frame[i,:f](ts, collect(blocking_frame[i,:assets]);
                     blocking_frame[i,:optional_parameters]... )

Thus a user can insert a named tuple containing whatever optional parameters are used by the function.

The fourth, fifth and sixth columns contains the number of assets in the block, the mean number of ticks in the block and the mean time per tick. These do not do anything in the subsequent blockwise_estimation function but can be used to alter the DataFrame. Now in the current case we may decide to estimate the block containing all assets using the spectral_covariance method.

one_asset_row = findall(blocking_frame[:,:number_of_assets] .== 4)
blocking_frame[one_asset_row, :f] = :spectral_covariance

We can now estimate the blockwise estimated CovarianceMatrix as:

block_estimate = blockwise_estimation(ts, blocking_frame)

After a blockwise estimation the result may often not be PSD. So we could regularise at this point:

reg_block_estimate = regularise(block_estimate , ts, :nearest_correlation_matrix)

Finally we might seek to use one of our estimated CovarianceMatrixs to calculate an actual covariance matrix over some interval. This can be done with the code:

covariance_interval = 1000
covar = covariance(combined_estimate, covariance_interval)

Note that the time units of the covariance_interval here should be the same units as the CovarianceMatrix struct's volatility which are the same units as the time dimension in the SortedDataFrame that is used to estimate the CovarianceMatrix.