\[\DeclareMathOperator{\erf}{erf} \DeclareMathOperator{\argmin}{argmin} \newcommand{\R}{\mathbb{R}} \newcommand{\n}{\boldsymbol{n}}\]

Module pyqt_fit.kde_methods

Author:Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com>

This module contains a set of methods to compute univariate KDEs. See the objects in the pyqt_fit.kde module for more details on these methods.

These methods provide various variations on \(\hat{K}(x;X,h,L,U)\), the modified kernel evaluated on the point \(x\) based on the estimation points \(X\), a bandwidth \(h\) and on the domain \([L,U]\).

The definitions of the methods rely on the following definitions:

\[\begin{split}\begin{array}{rcl} a_0(l,u) &=& \int_l^u K(z) dz\\ a_1(l,u) &=& \int_l^u zK(z) dz\\ a_2(l,u) &=& \int_l^u z^2K(z) dz \end{array}\end{split}\]

These definitions correspond to:

  • \(a_0(l,u)\) – The partial cumulative distribution function
  • \(a_1(l,u)\) – The partial first moment of the distribution. In particular, \(a_1(-\infty, \infty)\) is the mean of the kernel (i.e. and should be 0).
  • \(a_2(l,u)\) – The partial second moment of the distribution. In particular, \(a_2(-\infty, \infty)\) is the variance of the kernel (i.e. which should be close to 1, unless using higher order kernel).

References:

[1](1, 2) Jones, M. C. 1993. Simple boundary correction for kernel density estimation. Statistics and Computing 3: 135–146.

Univariate KDE estimation methods

The exact definition of such a method is found in pyqt_fit.kde.KDE1D.method

pyqt_fit.kde_methods.generate_grid(kde, N=None, cut=None)[source]

Helper method returning a regular grid on the domain of the KDE.

Parameters:
  • kde (KDE1D) – Object describing the KDE computation. The object must have been fitted!
  • N (int) – Number of points in the grid
  • cut (float) – for unbounded domains, how far past the maximum should the grid extend to, in term of KDE bandwidth
Returns:

A vector of N regularly spaced points

pyqt_fit.kde_methods.compute_bandwidth(kde)[source]

Compute the bandwidth and covariance for the model, based of its xdata attribute

class pyqt_fit.kde_methods.KDE1DMethod[source]

Base class providing a default grid method and a default method for unbounded evaluation of the PDF and CDF. It also provides default methods for the other metrics, based on PDF and CDF calculations.

Note:
  • It is expected that all grid methods will return the same grid if used with the same arguments.
  • It is fair to assume all array-like arguments will be at least 1D arrays.

The following methods are interface methods that should be overriden with ones specific to the implemented method.

fit(kde)[source]

Method called by the KDE1D object right after fitting to allow for one-time calculation.

Parameters:kde (pyqt_fit.kde.KDE1D) – KDE object being fitted
Default:Compute the bandwidth and covariance if specified as functions
pdf(kde, points, out)[source]

Compute the PDF of the estimated distribution.

Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • points (ndarray) – Points to evaluate the distribution on
  • out (ndarray) – Result object. If must have the same shapes as points
Return type:

ndarray

Returns:

Returns the out variable, updated with the PDF.

Default:

Direct implementation of the formula for unbounded pdf computation.

__call__(kde, points, out)[source]

Call the pdf() method.

grid(kde, N=None, cut=None)[source]

Evaluate the PDF of the distribution on a regular grid with at least N elements.

Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
  • cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type:

(ndarray, ndarray)

Returns:

The array of positions the PDF has been estimated on, and the estimations.

Default:

Evaluate \(pdf(x)\) on a grid generated using generate_grid()

cdf(kde, points, out)[source]

Compute the CDF of the estimated distribution, defined as:

\[cdf(x) = P(X \leq x) = \int_l^x p(t) dt\]

where \(l\) is the lower bound of the distribution domain and \(p\) the density of probability

Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • points (ndarray) – Points to evaluate the CDF on
  • out (ndarray) – Result object. If must have the same shapes as points
Return type:

ndarray

Returns:

Returns the out variable, updated with the CDF.

Default:

Direct implementation of the formula for unbounded CDF computation.

cdf_grid(kde, N=None, cut=None)[source]

Evaluate the CDF of the distribution on a regular grid with at least N elements.

Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
  • cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type:

(ndarray, ndarray)

Returns:

The array of positions the CDF has been estimated on, and the estimations.

Default:

Evaluate \(cdf(x)\) on a grid generated using generate_grid()

icdf(kde, points, out)[source]

Compute the inverse cumulative distribution (quantile) function, defined as:

\[icdf(p) = \inf\left\{x\in\mathbb{R} : cdf(x) \geq p\right\}\]
Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • points (ndarray) – Points to evaluate the iCDF on
  • out (ndarray) – Result object. If must have the same shapes as points
Return type:

ndarray

Returns:

Returns the out variable, updated with the iCDF.

Default:

First approximate the result using linear interpolation on the CDF and refine the result numerically using the Newton method.

icdf_grid(kde, N=None, cut=None)[source]

Compute the inverse cumulative distribution (quantile) function on a grid.

Note:

The default implementation is not as good an approximation as the plain icdf default method.

Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
  • cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type:

(ndarray, ndarray)

Returns:

The array of positions the CDF has been estimated on, and the estimations.

Default:

Linear interpolation of the inverse CDF on a grid

sf(kde, points, out)[source]

Compute the survival function, defined as:

\[sf(x) = P(X \geq x) = \int_x^u p(t) dt = 1 - cdf(x)\]
Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • points (ndarray) – Points to evaluate the survival function on
  • out (ndarray) – Result object. If must have the same shapes as points
Return type:

ndarray

Returns:

Returns the out variable, updated with the survival function.

Default:

Compute explicitly \(1 - cdf(x)\)

sf_grid(kde, N=None, cut=None)[source]

Compute the survival function on a grid.

Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
  • cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type:

(ndarray, ndarray)

Returns:

The array of positions the survival function has been estimated on, and the estimations.

Default:

Compute explicitly \(1 - cdf(x)\)

isf(kde, points, out)[source]

Compute the inverse survival function, defined as:

\[isf(p) = \sup\left\{x\in\mathbb{R} : sf(x) \leq p\right\}\]
Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • points (ndarray) – Points to evaluate the iSF on
  • out (ndarray) – Result object. If must have the same shapes as points
Return type:

ndarray

Returns:

Returns the out variable, updated with the inverse survival function.

Default:

Compute \(icdf(1-p)\)

isf_grid(kde, N=None, cut=None)[source]

Compute the inverse survival function on a grid.

Note:

The default implementation is not as good an approximation as the plain isf default method.

Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
  • cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type:

(ndarray, ndarray)

Returns:

The array of positions the CDF has been estimated on, and the estimations.

Default:

Linear interpolation of the inverse survival function on a grid

hazard(kde, points, out)[source]

Compute the hazard function evaluated on the points.

The hazard function is defined as:

\[h(x) = \frac{p(x)}{sf(x)}\]

where \(p(x)\) is the probability density function and \(sf(x)\) is the survival function.

Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • points (ndarray) – Points to evaluate the hazard function on
  • out (ndarray) – Result object. If must have the same shapes as points
Return type:

ndarray

Returns:

Returns the out variable, updated with the hazard function

Default:

Compute explicitly \(pdf(x) / sf(x)\)

hazard_grid(kde, N=None, cut=None)[source]

Compute the hazard function on a grid.

Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
  • cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type:

(ndarray, ndarray)

Returns:

The array of positions the hazard function has been estimated on, and the estimations.

Default:

Compute explicitly \(pdf(x) / sf(x)\)

cumhazard(kde, points, out)[source]

Compute the cumulative hazard function evaluated on the points.

The hazard function is defined as:

\[ch(x) = \int_l^x h(t) dt = -\ln sf(x)\]

where \(l\) is the lower bound of the domain, \(h\) the hazard function and \(sf\) the survival function.

Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • points (ndarray) – Points to evaluate the cumulative hazard function on
  • out (ndarray) – Result object. If must have the same shapes as points
Return type:

ndarray

Returns:

Returns the out variable, updated with the cumulative hazard function

Default:

Compute explicitly \(-\ln sf(x)\)

cumhazard_grid(kde, N=None, cut=None)[source]

Compute the hazard function on a grid.

Parameters:
  • kde (pyqt_fit.kde.KDE1D) – KDE object
  • N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
  • cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type:

(ndarray, ndarray)

Returns:

The array of positions the hazard function has been estimated on, and the estimations.

Default:

Compute explicitly \(-\ln sf(x)\)

name
Type:str

Specify a human-readable name for the method, for presentation purposes.

But the class also provide a number of utility methods to help implementing you own:

numeric_cdf(kde, points, out)[source]

Provide a numeric approximation of the CDF based on integrating the pdf using scipy.integrate.quad().

numeric_cdf_grid(kde, N=None, cut=None)[source]

Compute the CDF on a grid using a trivial, but fast, numeric integration of the pdf.

Estimation methods

Here are the methods implemented in pyqt_fit. To access these methods, the simplest is to use the instances provided.

pyqt_fit.kde_methods.unbounded

Instance of the KDE1DMethod class.

pyqt_fit.kde_methods.renormalization

Instance of the RenormalizationMethod class.

pyqt_fit.kde_methods.reflection

Instance of the ReflectionMethod class.

pyqt_fit.kde_methods.linear_combination

Instance of the LinearCombinationMethod class.

pyqt_fit.kde_methods.cyclic

Instance of the CyclicMethod class.

pyqt_fit.kde_methods.transformKDE1D(trans, method=None, inv=None, Dinv=None)[source]

Creates an instance of TransformKDE1DMethod

pyqt_fit.kde_methods.default_method

Method used by pyqt_fit.kde.KDE1D by default. :Default: reflection

Classes implementing the estimation methods

class pyqt_fit.kde_methods.RenormalizationMethod[source]

This method consists in using the normal kernel method, but renormalize to only take into account the part of the kernel within the domain of the density [1].

The kernel is then replaced with:

\[\hat{K}(x;X,h,L,U) \triangleq \frac{1}{a_0(u,l)} K(z)\]

where:

\[z = \frac{x-X}{h} \qquad l = \frac{L-x}{h} \qquad u = \frac{U-x}{h}\]
class pyqt_fit.kde_methods.ReflectionMethod[source]

This method consist in simulating the reflection of the data left and right of the boundaries. If one of the boundary is infinite, then the data is not reflected in that direction. To this purpose, the kernel is replaced with:

\[\hat{K}(x; X, h, L, U) \triangleq K(z) + K\left(\frac{x+X-2L}{h}\right) + K\left(\frac{x+X-2U}{h}\right)\]

where:

\[z = \frac{x-X}{h}\]

See the pyqt_fit.kde_methods for a description of the various symbols.

When computing grids, if the bandwidth is constant, the result is computing using CDT.

class pyqt_fit.kde_methods.LinearCombinationMethod[source]

This method uses the linear combination correction published in [1].

The estimation is done with a modified kernel given by:

\[\hat{K}(x;X,h,L,U) \triangleq \frac{a_2(l,u) - a_1(-u, -l) z}{a_2(l,u)a_0(l,u) - a_1(-u,-l)^2} K(z)\]

where:

\[z = \frac{x-X}{h} \qquad l = \frac{L-x}{h} \qquad u = \frac{U-x}{h}\]
class pyqt_fit.kde_methods.CyclicMethod[source]

This method assumes cyclic boundary conditions and works only for closed boundaries.

The estimation is done with a modified kernel given by:

\[\hat{K}(x; X, h, L, U) \triangleq K(z) + K\left(z - \frac{U-L}{h}\right) + K\left(z + \frac{U-L}{h}\right)\]

where:

\[z = \frac{x-X}{h}\]

When computing grids, if the bandwidth is constant, the result is computing using FFT.

pyqt_fit.kde_methods.create_transform(obj, inv=None, Dinv=None)[source]

Create a transform object.

Parameters:
  • obj (fun) – This can be either simple a function, or a function-object with an ‘inv’ and/or ‘Dinv’ attributes containing the inverse function and its derivative (respectively)
  • inv (fun) – If provided, inverse of the main function
  • Dinv (fun) – If provided, derivative of the inverse function
Return type:

Transform

Returns:

A transform object with function, inverse and derivative of the inverse

The inverse function must be provided, either as argument or as attribute to the object. The derivative of the inverse will be estimated numerically if not provided.

Note:All the functions should accept an out argument to store the result.
class pyqt_fit.kde_methods.TransformKDE1DMethod(trans, method=None, inv=None, Dinv=None)[source]

Compute the Kernel Density Estimate of a dataset, transforming it first to a domain where distances are “more meaningful”.

Often, KDE is best estimated in a different domain. This object takes a KDE1D object (or one compatible), and a transformation function.

Given a random variable \(X\) of distribution \(f_X\), the random variable \(Y = g(X)\) has a distribution \(f_Y\) given by:

\[f_Y(y) = \left| \frac{1}{g'(g^{-1}(y))} \right| \cdot f_X(g^{-1}(y))\]

In our term, \(Y\) is the random variable the user is interested in, and \(X\) the random variable we can estimate using the KDE. In this case, \(g\) is the transform from \(Y\) to \(X\).

So to estimate the distribution on a set of points given in \(x\), we need a total of three functions:

  • Direct function: transform from the original space to the one in which the KDE will be perform (i.e. \(g^{-1}: y \mapsto x\))
  • Invert function: transform from the KDE space to the original one (i.e. \(g: x \mapsto y\))
  • Derivative of the invert function

If the derivative is not provided, it will be estimated numerically.

Parameters:
  • trans – Either a simple function, or a function object with attributes inv and Dinv to use in case they are not provided as arguments. The helper create_transform() will provide numeric approximation of the derivative if required.
  • method – instance of KDE1DMethod used in the transformed domain. Default is pyqt_fit.kde_methods.KDE1DMethod
  • inv – Invert of the function. If not provided, trans must have it as attribute.
  • Dinv – Derivative of the invert function.
Note:

all given functions should accept an optional out argument to get a pre-allocated array to store its result. Also the out parameter may be one of the input argument.