\[\DeclareMathOperator{\erf}{erf} \DeclareMathOperator{\argmin}{argmin} \newcommand{\R}{\mathbb{R}} \newcommand{\n}{\boldsymbol{n}}\]

Module pyqt_fit.kde

Author:Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com>

Module implementing kernel-based estimation of density of probability.

Given a kernel \(K\), the density function is estimated from a sampling \(X = \{X_i \in \mathbb{R}^n\}_{i\in\{1,\ldots,m\}}\) as:

\[f(\mathbf{z}) \triangleq \frac{1}{hW} \sum_{i=1}^m \frac{w_i}{\lambda_i} K\left(\frac{X_i-\mathbf{z}}{h\lambda_i}\right)\]\[W = \sum_{i=1}^m w_i\]

where \(h\) is the bandwidth of the kernel, \(w_i\) are the weights of the data points and \(\lambda_i\) are the adaptation factor of the kernel width.

The kernel is a function of \(\mathbb{R}^n\) such that:

\[\begin{split}\begin{array}{rclcl} \idotsint_{\mathbb{R}^n} f(\mathbf{z}) d\mathbf{z} & = & 1 & \Longleftrightarrow & \text{$f$ is a probability}\\ \idotsint_{\mathbb{R}^n} \mathbf{z}f(\mathbf{z}) d\mathbf{z} &=& \mathbf{0} & \Longleftrightarrow & \text{$f$ is centered}\\ \forall \mathbf{u}\in\mathbb{R}^n, \|\mathbf{u}\| = 1\qquad\int_{\mathbb{R}} t^2f(t \mathbf{u}) dt &\approx& 1 & \Longleftrightarrow & \text{The co-variance matrix of $f$ is close to be the identity.} \end{array}\end{split}\]

The constraint on the covariance is only required to provide a uniform meaning for the bandwidth of the kernel.

If the domain of the density estimation is bounded to the interval \([L,U]\), the density is then estimated with:

\[f(x) \triangleq \frac{1}{hW} \sum_{i=1}^n \frac{w_i}{\lambda_i} \hat{K}(x;X,\lambda_i h,L,U)\]

where \(\hat{K}\) is a modified kernel that depends on the exact method used. Currently, only 1D KDE supports bounded domains.

Kernel Density Estimation Methods

class pyqt_fit.kde.KDE1D(xdata, **kwords)[source]

Perform a kernel based density estimation in 1D, possibly on a bounded domain \([L,U]\).

Parameters:
  • data (ndarray) – 1D array with the data points
  • kwords (dict) –

    setting attributes at construction time. Any named argument will be equivalent to setting the property after the fact. For example:

    >>> xs = [1,2,3]
    >>> k = KDE1D(xs, lower=0)
    

    will be equivalent to:

    >>> k = KDE1D(xs)
    >>> k.lower = 0
    

The calculation is separated in three parts:

__call__(points, out=None)[source]

This method is an alias for BoundedKDE1D.evaluate()

bandwidth[source]

Bandwidth of the kernel. Can be set either as a fixed value or using a bandwidth calculator, that is a function of signature w(xdata) that returns a single value.

Note

A ndarray with a single value will be converted to a floating point value.

cdf_grid(N=None, cut=None)[source]

Compute the cdf from the lower bound to the points given as argument.

closed[source]

Returns true if the density domain is closed (i.e. lower and upper are both finite)

copy()[source]

Shallow copy of the KDE object

covariance[source]

Covariance of the gaussian kernel. Can be set either as a fixed value or using a bandwidth calculator, that is a function of signature w(xdata) that returns a single value.

Note

A ndarray with a single value will be converted to a floating point value.

evaluate(points, out=None)[source]

Compute the PDF of the distribution on the set of points points

fit()[source]

Compute the various parameters needed by the kde method

grid(N=None, cut=None)[source]

Evaluate the density on a grid of N points spanning the whole dataset.

Returns:a tuple with the mesh on which the density is evaluated and the density itself
icdf_grid(N=None, cut=None)[source]

Compute the inverse cumulative distribution (quantile) function on a grid.

kernel[source]

Kernel object. This must be an object modeled on pyqt_fit.kernels.Kernel1D. It is recommended to inherit this class to provide numerical approximation for all methods.

By default, the kernel is an instance of pyqt_fit.kernels.normal_kernel1d

lambdas[source]

Scaling of the bandwidth, per data point. It can be either a single value or an array with one value per data point.

When deleted, the lamndas are reset to 1.

lower[source]

Lower bound of the density domain. If deleted, becomes set to \(-\infty\)

method[source]

Select the method to use. The method should be an object modeled on pyqt_fit.kde_methods.KDE1DMethod, and it is recommended to inherit the model.

Available methods in the pyqt_fit.kde_methods sub-module.

Default:pyqt_fit.kde_methods.default_method
upper[source]

Upper bound of the density domain. If deleted, becomes set to \(\infty\)

weights[source]

Weigths associated to each data point. It can be either a single value, or an array with a value per data point. If a single value is provided, the weights will always be set to 1.

Bandwidth Estimation Methods

pyqt_fit.kde.variance_bandwidth(factor, xdata)

Returns the covariance matrix:

\[\mathcal{C} = \tau^2 cov(X)\]

where \(\tau\) is a correcting factor that depends on the method.

pyqt_fit.kde.silverman_covariance(xdata, model=None)

The Silverman bandwidth is defined as a variance bandwidth with factor:

\[\tau = \left( n \frac{d+2}{4} \right)^\frac{-1}{d+4}\]
pyqt_fit.kde.scotts_covariance(xdata, model=None)

The Scotts bandwidth is defined as a variance bandwidth with factor:

\[\tau = n^\frac{-1}{d+4}\]
pyqt_fit.kde.botev_bandwidth(N=None, **kword)

Implementation of the KDE bandwidth selection method outline in:

Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diffusion. The Annals of Statistics, 38(5):2916-2957, 2010.

Based on the implementation of Daniel B. Smith, PhD.

The object is a callable returning the bandwidth for a 1D kernel.