## Saturday, 24 January 2015

### A brief introduction to copula

Copula are a fascinating innovation and especially useful in the analysis of multivariate distributions. It is a kind of decomposition which makes it possible to consider a joint distribution in terms of its marginal distributions and a function which connects them together (the copula). It's grad level stuff so comes with the compulsory impenetrable vocabulary and "obvious" assumptions so this post will endeavour to explain what a copula is, how it is applied and outline how they are derived.

Copula derive from a result known as Sklar's theorem which states that any joint distribution $F(x_1,...,x_n)$ can be expressed as a function (the copula) of it's marginal distributions $F_1(x_1),...,F_n(x_n)$. I.e. $C(F_1(x_1),...,F_n(x_n)) = F(x_1,...,x_n)$. A prime motivator for using copula is the ability to consider the structure of a joint distribution separately. This makes it possible to analyse the behaviour of systems under different structural assumptions. It also makes it possible to change the marginals whilst retaining the structure. These turn out to be very useful features. You may for example build a regime change model which uses implied options probabilities from major markets with several alternative copula which represent trending and stationary market conditions for example. You may then run log-likelihood tests to determine which state makes the current market prices most likely and use this to inform subsequent decisions.

Let $u_i = F_i(x_i)$ and $F^{-1}_i(u_i) = x_i$. Thus $F^{-1}_i$ represents the inverse function. We now have enough to extract an expression which can help us work out the copula function: $C(u_1,...,u_n) = F(F^{-1}_1(x_1),...,F^{-1}_n(x_n))$. This suggests that if we know the joint distribution function and the inverse function of the marginal distributions, then we can derive an expression for the copula. These are sometimes called "implicit" copula and proceed by derivation from known distributions (e.g. multivariate Gaussian). Another way to derive the copula would be to directly model the dependence relationship in the data. You may have noticed that each input to the copula function $u_i \in [0,1]$ since $u_i$ is a probability. Thus our task becomes to create a function $F:[0,1]^d \to [0,1]$ where $d$ is the number of dimensions. Copulas which attempt this solution are sometimes called "explicit" or "associative".

I'll now explain a useful non-parametric copula proceeding directly from the empirical distribution function. The empirical distribution CDF is given by $F(t) = 1/n\sum_{i=1}^n\textbf{1}\{x_i \le t\}$ where confusingly the $\textbf{1}$ stands for an indicator function which counts the number of elements where $\{x_i \le t\}$ is true. You can read more about it and why it works here. To obtain the copula from a set of multivariate data, we first transform each variable $(x_i,...,x_d)$ to get $F_i(x_i) = U_i$. The copula then proceeds from the joint empirical distribution as follows.

$$C(u_1,...,u_d) = F(F_1^{-1}(u_1),...,F_n^{-1}(u_n)) = 1/n\sum_{i=1}^n\textbf{1}(U_{i,1} \le u_1,...,U_{i,d} \le u_d)$$

That's it! Hope that has made copula a bit clearer than they previously were. Please use the comments section for questions.