empirical copula 3d GAN copula 3d

CopulaGAN (the 3d case) - A first naive attempt

In this blog, we start the research effort around CopulaGAN.

The goal is to sample from a realistic dependence structure of $n$ variables (e.g. stocks returns, credit default swaps spreads changes).

Whereas CorrGAN is well suited to work with hundreds of variables, I hope that CopulaGAN will be able to scale to deal with the dozens of variables. However, CopulaGAN should be able to describe much more precisely the dependence structure of $n$ variables than a $n \times n$ correlation matrix generated by CorrGAN would since a correlation matrix discards a lot of information:

a correlation coefficient is a simplistic summary of a bivariate distribution between two variables,
only pairwise relationship between the variables is considered (though the PSD-ness of the matrix is enforcing some multivariate constraints).

Copulas, unlike correlation matrices, encode the whole dependence distribution.

We start exploring the possibility of sampling realistic vectors using GANs (or VAEs) from high-dimensional copulas by playing with a simple case: $d = 3$.

We follow the steps of the study we did earlier for CorrGAN in 3 dimensions.

TL;DR CopulaGAN 3D seems to approximately work using off-the-shelf GAN code. However, it has flaws, and won’t probably extend to higher dimensions (as it was the case for CorrGAN).

Like CorrGAN, CopulaGAN faces the permutation invariance problem. In the case of CorrGAN, we solved it by using a natural ordering of the data (cf. this blog for the permutation invariance discussion, and this blog for the ordering trick using hierarchical clustering). In the case of CopulaGAN, it is not clear what would be a proper ordering of the empirical measures… Stochastic Deep Networks offer a promising avenue to solve this issue.

%matplotlib inline

import pandas as pd
import numpy as np
from scipy.stats import rankdata
from numpy.random import seed
import tensorflow as tf
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import animation
import matplotlib.pyplot as plt

seed(42)

The data loaded below was built by choosing 3 stocks among approximately 500, then ranking their returns (time series), and finally normalizing the ranks into $[0, 1]^3$. We end up with 3 uniform margins whose joint distribution is the empirical copula.

copulas = np.load('copulas-3d/copula_sample.npy')

copulas.shape

(897120, 3)

we display below what the 3d empirical copula looks like:

xs = copulas[:20000, 0]
ys = copulas[:20000, 1]
zs = copulas[:20000, 2]

fig = plt.figure(figsize=(10, 8))
ax = plt.axes(projection='3d')
ax.scatter(xs, ys, zs, color='royalblue', alpha=0.2)
ax.scatter(1, 1, 1, color='red', s=100)
ax.view_init(30, 45)
plt.tight_layout()
plt.show()

empirical copula 3d

We can notice that the distribution is rather symmetric around the diagonal ((0,0,0), (1,1,1)).

The red dot is locating the point (1, 1, 1), and has no other meaning. It is just for visual convenience in the following animations.

def init():
    ax.scatter(xs, ys, zs, color='royalblue', alpha=0.2)
    ax.scatter(1, 1, 1, color='red', s=100)
    return fig,

def animate_mpg(i):
    if i < 90:
        ax.view_init(elev=10, azim=i*4)
    elif i < 180:
        ax.view_init(elev=10+2*(i%90), azim=(i%90)*4)
    else:
        ax.view_init(elev=190-2*(i%90), azim=0)
    return fig,

def animate_gif(i):
    if i < 90:
        ax.view_init(elev=10, azim=-60+i*4)
    elif i < 133:
        ax.view_init(elev=10+(i%90)*4, azim=-60)
    else:
        ax.view_init(elev=-180+(10+(i%90)*4)%180, azim=-60)
    return fig,

An animation of the 3d point cloud representing the copula:

fn = 'figs/empirical_copula_3d'

anim = animation.FuncAnimation(
    fig, animate_mpg, init_func=init, frames=270, interval=50,
    blit=True)

anim.save(fn + '.mp4', writer='ffmpeg', fps=1000 / 50)

anim = animation.FuncAnimation(
    fig, animate_gif, init_func=init, frames=180, interval=50,
    blit=True)

anim.save(fn + '.gif', writer='imagemagick', fps=1000 / 50)

We are now defining a basic GAN, whose goal is to sample 3d vectors from this distribution. We are using the basic code that got us started for CorrGAN first experiments:

def generator(Z, hsize=[64, 64, 16], reuse=False):
    with tf.variable_scope("GAN/Generator", reuse=reuse):
        h1 = tf.layers.dense(Z, hsize[0], activation=tf.nn.leaky_relu)
        h2 = tf.layers.dense(h1, hsize[1], activation=tf.nn.leaky_relu)
        h3 = tf.layers.dense(h2, hsize[2], activation=tf.nn.leaky_relu)
        out = tf.layers.dense(h3, 3)

    return out

def discriminator(X, hsize=[64, 64, 16], reuse=False):
    with tf.variable_scope("GAN/Discriminator", reuse=reuse):
        h1 = tf.layers.dense(X, hsize[0], activation=tf.nn.leaky_relu)
        h2 = tf.layers.dense(h1, hsize[1], activation=tf.nn.leaky_relu)
        h3 = tf.layers.dense(h2, hsize[2], activation=tf.nn.leaky_relu)
        h4 = tf.layers.dense(h3, 3)
        out = tf.layers.dense(h4, 1)

    return out, h4

X = tf.placeholder(tf.float32, [None, 3])
Z = tf.placeholder(tf.float32, [None, 3])

G_sample = generator(Z)
r_logits, r_rep = discriminator(X)
f_logits, g_rep = discriminator(G_sample, reuse=True)

disc_loss = tf.reduce_mean(
    tf.nn.sigmoid_cross_entropy_with_logits(
        logits=r_logits, labels=tf.ones_like(r_logits)) +   
    tf.nn.sigmoid_cross_entropy_with_logits(
        logits=f_logits, labels=tf.zeros_like(f_logits)))

gen_loss = tf.reduce_mean(
    tf.nn.sigmoid_cross_entropy_with_logits(
        logits=f_logits, labels=tf.ones_like(f_logits)))

gen_vars = tf.get_collection(
    tf.GraphKeys.GLOBAL_VARIABLES, scope="GAN/Generator")
disc_vars = tf.get_collection(
    tf.GraphKeys.GLOBAL_VARIABLES, scope="GAN/Discriminator")

gen_step = tf.train.RMSPropOptimizer(
    learning_rate=0.0001).minimize(gen_loss, var_list=gen_vars)
disc_step = tf.train.RMSPropOptimizer(
    learning_rate=0.0001).minimize(disc_loss, var_list=disc_vars)

The GAN model is now defined.

To monitor how the training goes, we will also look at the projections on the 2d faces of the unit cube $[0, 1]^3$, i.e. the empirical copulas of $(x_1, x_2)$, $(x_1, x_3)$, $(x_2, x_3)$.

def plot_faces(e_plot, g_plot):
    plt.figure(figsize=(18, 14))
    nb_bins = 50
    for count, (idx1, idx2) in enumerate([[0, 1], [0, 2], [1, 2]]):
        plt.subplot(3, 3, 3 * count + 1)
        plt.hist2d(e_plot[:, idx1], e_plot[:, idx2],
                   bins=nb_bins)
        plt.title('Cube Face {} - empirical'.format(count + 1))
        plt.subplot(3, 3, 3 * count + 2)
        plt.hist2d(g_plot[:, idx1], g_plot[:, idx2],
                   bins=nb_bins)
        plt.title('Cube Face {} - GAN'.format(count + 1))
        plt.subplot(3, 3, 3 * count + 3)
        plt.hist2d(rankdata(g_plot[:, idx1]), rankdata(g_plot[:, idx2]),
                   bins=nb_bins)
        plt.title('Cube Face {} - GAN + Rank'.format(count + 1))
    plt.show()

We will sample $2^{14} = 16384$ points for the display of the real empirical distribution, and the GAN generated one.

def sample_Z(m, n):
    return np.random.uniform(-1., 1., size=[m, n])

def sample_from_copula(n=10000):
    idx = [randint(0, len(copulas) - 1)
           for j in range(n)]
    
    return copulas[idx, :]

n_dots = 2**14
x_plot = sample_from_copula(n=n_dots)
Z_plot = sample_Z(n_dots, 3)

Batches contain 512 points.

sess = tf.Session()
tf.global_variables_initializer().run(session=sess)

batch_size = 2**9
nd_steps = 5
ng_steps = 5

And, finally, the main loop for training the GAN.

Regularly, we display the GAN distribution (in orange), and the real empirical one (in blue). We can see how the orange starts to progressively match the blue one.

We also display 3 faces, and for each face 3 heatmaps:

(leftmost) real empirical 2d distribution (after projection on that face),
(center) GAN generated 2d distribution (after projection on that face),
(rightmost) 2d distribution of the ranked GAN generated points.

for i in range(50000 + 1):
    X_batch = sample_from_copula(n=batch_size)
    Z_batch = sample_Z(batch_size, 3)
    
    for _ in range(nd_steps):
        _, dloss = sess.run([disc_step, disc_loss],
                            feed_dict={X: X_batch, Z: Z_batch})
    rrep_dstep, grep_dstep = sess.run([r_rep, g_rep],
                                      feed_dict={X: X_batch, Z: Z_batch})

    for _ in range(ng_steps):
        _, gloss = sess.run([gen_step, gen_loss],
                            feed_dict={Z: Z_batch})
    
    rrep_gstep, grep_gstep = sess.run([r_rep, g_rep],
                                      feed_dict={X: X_batch, Z: Z_batch})
    
    # analytics display
    if (i <= 1000 and i % 100 == 0) or (i % 10000 == 0):
        print("Iterations: %d\t Discriminator loss: %.4f\t Generator loss: %.4f"
            % (i, dloss, gloss))
        
        fig = plt.figure(figsize=(10, 8))
        g_plot = sess.run(G_sample, feed_dict={Z: Z_plot})
        ax = fig.add_subplot(111, projection='3d')
        ax.scatter(x_plot[:, 0], x_plot[:, 1], x_plot[:, 2], alpha=0.05)
        ax.scatter(g_plot[:, 0], g_plot[:, 1], g_plot[:, 2], alpha=0.2)

        plt.legend(["Empirical measures", "Generated measures"])
        plt.title('Samples at Iteration %d' % i)
        plt.tight_layout()
        plt.show()
        plt.close()
        
        plot_faces(x_plot, g_plot)

Iterations: 0	 Discriminator loss: 1.3303	 Generator loss: 0.6944

gan copula 3d

Iterations: 100	 Discriminator loss: 1.3862	 Generator loss: 0.7905

gan copula 3d

Iterations: 200	 Discriminator loss: 1.3827	 Generator loss: 0.6865

gan copula 3d

Iterations: 300	 Discriminator loss: 1.3778	 Generator loss: 0.7033

gan copula 3d

Iterations: 400	 Discriminator loss: 1.3750	 Generator loss: 0.6767

gan copula 3d

Iterations: 500	 Discriminator loss: 1.3625	 Generator loss: 0.6953

gan copula 3d

Iterations: 600	 Discriminator loss: 1.3821	 Generator loss: 0.7045

gan copula 3d

Iterations: 700	 Discriminator loss: 1.3709	 Generator loss: 0.7060

gan copula 3d

Iterations: 800	 Discriminator loss: 1.3778	 Generator loss: 0.7035

gan copula 3d

Iterations: 900	 Discriminator loss: 1.3773	 Generator loss: 0.7076

gan copula 3d

Iterations: 1000	 Discriminator loss: 1.3819	 Generator loss: 0.6921

gan copula 3d

Iterations: 10000	 Discriminator loss: 1.3823	 Generator loss: 0.6756

gan copula 3d

Iterations: 20000	 Discriminator loss: 1.3836	 Generator loss: 0.7046

gan copula 3d

Iterations: 30000	 Discriminator loss: 1.3753	 Generator loss: 0.7269

gan copula 3d

Iterations: 40000	 Discriminator loss: 1.3840	 Generator loss: 0.7027

gan copula 3d

Iterations: 50000	 Discriminator loss: 1.3743	 Generator loss: 0.7234

gan copula 3d

We can see that this basic GAN has managed to roughly match the original distribution.

We show the 3d point cloud (after ranking and normalizing into $[0, 1]^3$) produced by the last iteration of the GAN training (in orange; original distribution in blue; red dot as visual clue for (1, 1, 1)):

xs = x_plot[:, 0]
ys = x_plot[:, 1]
zs = x_plot[:, 2]

xgan = g_plot[:, 0]
ygan = g_plot[:, 1]
zgan = g_plot[:, 2]

xrgan = rankdata(xgan) / len(xgan)
yrgan = rankdata(ygan) / len(ygan)
zrgan = rankdata(zgan) / len(zgan)

fig = plt.figure(figsize=(10, 8))
ax = plt.axes(projection='3d')
ax.scatter(xs, ys, zs,
           color='royalblue', alpha=0.05)
ax.scatter(xrgan, yrgan, zrgan,
           color='orange', alpha=0.2)
ax.scatter(1, 1, 1, color='red', s=100)
ax.view_init(30, 45)
plt.tight_layout()
plt.show()

gan copula 3d

We can see that some corners are missing. I have been able to obtain (visually) better results on different runs, but it is rather difficult to decide when to stop training on an objective criterion.

def init():
    ax.scatter(xs, ys, zs,
               color='royalblue', alpha=0.05)
    ax.scatter(xrgan, yrgan, zrgan,
               color='orange', alpha=0.2)
    ax.scatter(1, 1, 1, color='red', s=100)
    return fig,

def animate_mpg(i):
    if i < 90:
        ax.view_init(elev=10, azim=i*4)
    elif i < 180:
        ax.view_init(elev=10+2*(i%90), azim=(i%90)*4)
    else:
        ax.view_init(elev=190-2*(i%90), azim=0)
    return fig,

def animate_gif(i):
    if i < 90:
        ax.view_init(elev=10, azim=-60+i*4)
    elif i < 133:
        ax.view_init(elev=10+(i%90)*4, azim=-60)
    else:
        ax.view_init(elev=-180+(10+(i%90)*4)%180, azim=-60)
    return fig,

And, to conclude, the associated animation:

fn = 'figs/gan_copula_3d'

anim = animation.FuncAnimation(
    fig, animate_mpg, init_func=init, frames=270, interval=50,
    blit=True)

anim.save(fn + '.mp4', writer='ffmpeg', fps=1000 / 50)

anim = animation.FuncAnimation(
    fig, animate_gif, init_func=init, frames=180, interval=50,
    blit=True)

anim.save(fn + '.gif', writer='imagemagick', fps=1000 / 50)

Conclusion: Results are encouraging to extend to higher dimensions. One promising research avenue is to try Stochastic Deep Networks for this problem. We also need to develop a better understanding of the characteristics of these empirical distributions (aka stylized facts), which can both help design a suitable network and check whether its results are relevant.