View on GitHub


Non-linear correlation detection with mutual information

API reference

This page is an extended version of the documentation strings included for each method. All methods are available in the ennemi namespace.


The array_like data type is interpreted as in NumPy documentation: it is either a numpy.ndarray or anything with roughly the same shape. Python lists and tuples are array-like. Two-dimensional arrays are created by nesting sequences: [(1, 2), (3, 4), (5, 6)] is equivalent to an array with 3 rows and 2 columns. Numbers are interpreted as zero-dimensional arrays. Arrays can be automatically extended to higher dimensions if the shapes match.

The calculation algorithms are described in

The algorithms assume that

Note especially the last assumption. Highly autocorrelated time series data will produce unrealistically high MI values.


Estimate the entropy of one or more continuous random variables.

Returns the estimated entropy in nats. If x is two-dimensional, each marginal variable is estimated separately by default. If the multidim parameter is set to True, the array is interpreted as a single $m$-dimensional random variable.

If x is a pandas DataFrame or Series, the result is a DataFrame with column names matching x.


Optional keyword parameters


Equivalent to estimate_mi with normalize=True; please refer to the documentation below.

The correlation coefficient is calculated using the formula \(\rho = \sqrt{1 - \exp(-2\, \mathrm{MI}(X;Y))}.\)


Estimate the mutual information between y and each x variable.

Returns the estimated mutual information in nats, or if normalize is set, as a correlation coefficient. The result is a 2D ndarray where the first index represents lag values and the second index represents x columns. If x is a pandas DataFrame or Series, the result is a DataFrame with column names and lag values as the row indices.

The time lag $\Delta$ is interpreted as $y(t) \sim x(t - \Delta) | z(t - \Delta_{\mathrm{cond}})$. The time lags are applied to the x and cond arrays such that the y array stays the same every time. This means that y is cropped to y[max(max_lag,0) : N+min(min_lag,0)]. The cond_lag parameter specifies the lag for the cond array separately from the x lag.

If the mask parameter is set, only those y observations with the matching mask element set to True are used for estimation.

Positional or keyword parameters

Optional keyword parameters


Normalize mutual information values to the unit interval. Equivalent to passing normalize=True to the estimation methods.

The normalization formula \(\rho = \sqrt{1 - \exp(-2\, \mathrm{MI}(X;Y))}\) is based on the MI of bivariate normal distribution. The value matches the absolute Pearson correlation coefficient in a linear model. However, the value is preserved by all monotonic transformations; in a non-linear model, it matches the Pearson correlation of the linearized model. The value is positive regardless of the sign of the correlation.

Negative values are kept as-is. This is because mutual information is always non-negative, but estimate_mi may produce negative values. Large negative values may indicate that the data does not satisfy assumptions.

The normalization is not applicable to discrete variables: it is not possible to get coefficient 1.0 even when the variables are completely determined by each other. The formula assumes that both variables have an infinite amount of entropy.



Equivalent to pairwise_mi with normalize=True; please refer to the documentation below.


Estimate the pairwise MI between each variable.

Returns a matrix where the $(i,j)$’th element is the mutual information between the $i$’th and $j$’th columns in the data. The values are in nats or in the normalized scale depending on the normalize parameter. The diagonal contains NaNs (for better visualization, as the auto-MI should be $\infty$ nats or correlation $1$).

Positional or keyword parameters

Optional keyword parameters