View on GitHub

ennemi

Non-linear correlation detection with mutual information

Potential issues

There are a few things you should know when using ennemi in your projects. In case you encounter an issue not mentioned here, please get in touch!

Negative MI

estimate_mi() might return a negative number near zero (like -0.03). You should first check that:

Mathematically, mutual information is always non-negative, but the continuous-case estimation algorithm is not exact. Some estimation error is expected, especially with small sample sizes.

Near-zero values can thus be interpreted as zero. However, if the result is further from zero (something like -0.5, or even -inf), that certainly is an issue. Some possible causes are explained below.

Skewed or high-variance distributions

Mutual information is invariant under strictly monotonic transformations. Such transformations include

However, with sample sizes less than $\infty$, the estimation algorithm may produce different results between original and transformed variables. For this reason it is recommended to scale the variables to have roughly symmetric distributions. This transformation will not change the actual MI, but will improve the accuracy of the estimate.

Entropy is not transformation-invariant, and therefore this guidance does not apply to estimate_entropy().

For example, here is the bivariate Gaussian example modified to be lognormal in both marginal directions, with greater variance in the x variable. As mentioned in the tutorial, $\mathrm{MI}(X, Y) \approx 0.51$ and the exponentiation does not change this.

from ennemi import estimate_mi
import numpy as np

rng = np.random.default_rng(1234)
rho = 0.8
cov = np.array([[1, rho], [rho, 1]])

data = rng.multivariate_normal([0, 0], cov, size=800)
x = np.exp(5 * data[:,0])
y = np.exp(data[:,1])

print(estimate_mi(y, x))

Running the code outputs a value similar to

[[0.18815223]]

This demonstrates that you should make all variables symmetric, here by taking logarithms, before running the estimation.

By default, ennemi rescales all variables to have unit variance, but the package does not perform any symmetrization. The rescaling can be disabled by setting the preprocess parameter to False.

Discrete/duplicate observations

The estimation algorithm assumes that the variables are sampled from a continuous distribution. This assumption can be violated in three ways:

For low-resolution or censored data, the suggestion of Kraskov et al. (2004) is to add some low-amplitude noise to the non-continuous variable. ennemi does this automatically unless the preprocess parameter is set to False.

As an example, we can consider a censored version of the bivariate Gaussian example. We sample data points from a centered, correlated normal distribution, but move all negative values to the corresponding axis (or origin).

from ennemi import estimate_mi
import numpy as np

rng = np.random.default_rng(1234)
rho = 0.8
cov = np.array([[1, rho], [rho, 1]])

data = rng.multivariate_normal([0.5, 0.5], cov, size=800)
x = np.maximum(0, data[:,0])
y = np.maximum(0, data[:,1])

print("MI:", estimate_mi(y, x, preprocess=False))
print("On one or both axes:", np.mean((x == 0) | (y == 0))*100, "%")
print("At origin:", np.mean((x == 0) & (y == 0))*100, "%")

This code prints

MI: [[-inf]]
On one or both axes: 40.25 %
At origin: 22 %

As many of the observations lie on the x or y axis, and most of those at $(0, 0)$, the algorithm produces a clearly incorrect result. The fix is to either add

x += rng.normal(0, 1e-6, size=800)
y += rng.normal(0, 1e-6, size=800)

before the call to estimate_mi(), or remove the preprocess=False parameter. With this fix, the code now prints

MI: [[0.41881861]]

a better approximation of the true value ($\approx 0.425$).

Autocorrelation

The estimation method requires that the observations are independent and identically distributed. If the samples have significant autocorrelation, the first assumption does not hold. In this case, the algorithm may return too high MI values.

In this example, each point is present three times: the additional occurrences have some added random noise. This simulates the autocorrelation between the samples.

from ennemi import estimate_mi
import numpy as np

rng = np.random.default_rng(1234)
rho = 0.8
cov = np.array([[1, rho], [rho, 1]])

data = rng.multivariate_normal([0, 0], cov, size=800)
x = np.concatenate((data[:,0],
    data[:,0] + rng.normal(0, 0.01, size=800),
    data[:,0] + rng.normal(0, 0.01, size=800)))
y = np.concatenate((data[:,1],
    data[:,1] + rng.normal(0, 0.01, size=800),
    data[:,1] + rng.normal(0, 0.01, size=800)))

print(estimate_mi(y, x))

Running the code outputs

[[1.02554819]]

a significantly higher value than the $\approx 0.51$ expected of non-autocorrelated samples.

The reason for this error is that the approximated density is peakier than it should be. Each point has effectively three times the density it should have, while the gaps between points are unaffected. We can see this in a Gaussian kernel density estimate of the sample. For simplicity, we plot only the marginal y density.

import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

baseline_kernel = gaussian_kde(data[:,1])
autocorrelated_kernel = gaussian_kde(y)

t = np.linspace(-2, 2, 100)
plt.plot(t, baseline_kernel(t), "--", label="Baseline")
plt.plot(t, autocorrelated_kernel(t), label="Autocorrelated")
plt.legend()
plt.show()

The autocorrelated density has more bumps than the roughly Gaussian baseline.

The way of fixing this depends on your data. Does it make sense to look at deseasonalized or differenced data? Can you reduce the sampling frequency so that the autocorrelation is smaller?

This point is addressed in the case study tutorial. You can also see Section 4.2 of the Atmosphere article doi:10.3390/atmos13071046, which includes example code in the Supplementary Material.

High MI values

As noted by Gao et al. (2015), the nearest-neighbor estimation method is not accurate when the variables are strongly dependent; roughly, this means MI values above 1.5. They also present a correction term.

Because the primary goal of ennemi is general correlation detection, and the affected MI values are equivalent to correlation coefficients above 0.97, we do not consider this issue to be significant.

Improving performance

ennemi uses fast, compiled algorithms from the NumPy and SciPy packages. This makes the estimation performance good, while letting you write simple and portable Python code. There are still some things to keep in mind to get the most out of your resources.

Basics

Parallelize the calculation

If you call estimate_mi() with several variables and/or time lags, the calculation will be split into as many concurrent tasks as you have processor cores (except if the tasks are very small). The efficiency improves with sample size; with large data sets, most of the time is spent in compiled algorithms.

This means that instead of writing

# This code is not optimal
for lag in all_the_lags:
    for var in variables:
        estimate_mi(y, data[var], lag)

you should write

estimate_mi(y, data, all_the_lags)

Similarly, you should use pairwise_mi when you want to calculate, well, the pairwise MI between several variables. The calculation is parallelized and automatically takes advantage of the symmetry of MI.

If you have only some processor cores available for Python, you can override the max_threads parameter. You should not pass a value greater than the number of cores available; the additional threads will not make the estimation faster, but only increase the overhead of switching between threads.

If you pass the callback parameter, make sure that the callback returns quickly. Only one thread of Python code can run at a time for consistency reasons. The heavy computation is done in compiled code that can run in parallel with Python code, but the callback blocks execution of other callbacks and some ennemi internal code.