| Title: | Kernel Density Estimation with Parametric Starts and Asymmetric Kernels |
|---|---|
| Description: | Handles univariate non-parametric density estimation with parametric starts and asymmetric kernels in a simple and flexible way. Kernel density estimation with parametric starts involves fitting a parametric density to the data before making a correction with kernel density estimation, see Hjort & Glad (1995) <doi:10.1214/aos/1176324627>. Asymmetric kernels make kernel density estimation more efficient on bounded intervals such as (0, 1) and the positive half-line. Supported asymmetric kernels are the gamma kernel of Chen (2000) <doi:10.1023/A:1004165218295>, the beta kernel of Chen (1999) <doi:10.1016/S0167-9473(99)00010-9>, and the copula kernel of Jones & Henderson (2007) <doi:10.1093/biomet/asm068>. User-supplied kernels, parametric starts, and bandwidths are supported. |
| Authors: | Jonas Moss [aut, cre] (ORCID: <https://orcid.org/0000-0002-6876-6964>), Martin Tveten [ctb] |
| Maintainer: | Jonas Moss <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.2.0 |
| Built: | 2026-05-05 08:58:56 UTC |
| Source: | https://github.com/jonasmoss/kdensity |
The available options for bandwidth selectors, passed as the bw
argument to kdensity.
x |
The input data. |
kernel_str |
A string specifying the kernel, e.g. "gaussian." |
start_str |
A string specifying the parametric start, e.g. "normal". |
support |
The domain of definition for the kernel. (-Inf, Inf) for symmetric kernels. |
The bandwidth functions are not exported. They are members of the
environment bw_environments, and can be accessed by
kdensity:::bw_environments.
"nrd0", "nrd", "bcv", "SJ": Bandwidth selectors from stats.
They are documented in [bandwidth][stats::bandwidth] stats:bandwidth.
"nrd0" is the standard bandwidth selector for symmetric kernels with
constant parametric starts.
"ucv": Unbiased cross validation. The standard option for
asymmetric kernels.
"HS": Hallberg Szabadváry's closed-form reference rule for the
beta kernel on the unit interval. This is the default for
kernel = "beta" with a uniform or constant start.
"RHE": Selector for parametric starts with a symmetric kernel,
based on a reference rule with Hermite polynomials.
Described in Hjort & Glad (1995). The default method in kdensity when a parametric
start is supplied and the kernel is symmetric.
"JH": Selector for the Gaussian copula kernel, based on
normal reference rule. Described in Jones & Henderson. The default method when
the gcopula kernel is used in kdensity.
The bandwidth selector is a function of four arguments: The data
x, a kernel string kernel, a start string start,
and a support vector support. To obtain the functions associated
with these strings, use get_kernel and get_start. The
function should return a double.
Jones, M. C., and D. A. Henderson. "Kernel-type density estimation on the unit interval." Biometrika 94.4 (2007): 977-984. Hjort, Nils Lid, and Ingrid K. Glad. "Nonparametric density estimation with a parametric start." The Annals of Statistics (1995): 882-904. Hallberg Szabadváry, Johan. "A Fast, Closed-Form Bandwidth Selector for the Beta Kernel Density Estimator." arXiv preprint arXiv:2601.19553 (2026).
kdensity(), stats::bandwidth.kernel() for the
bandwidth selectors of stats::density(). In addition,
kernels(); parametric_starts()
## Not a serious bandwidth function. silly_width <- function(x, kernel = NULL, start = NULL, support = NULL) { rexp(1) } kdensity(mtcars$mpg, start = "gumbel", bw = silly_width)## Not a serious bandwidth function. silly_width <- function(x, kernel = NULL, start = NULL, support = NULL) { rexp(1) } kdensity(mtcars$mpg, start = "gumbel", bw = silly_width)
kdensity computes a parametrically guided kernel density estimate
for univariate data. It supports asymmetric kernels and parametric starts
through the kernel and start arguments.
kdensity( x, bw = NULL, adjust = 1, kernel = NULL, start = NULL, support = NULL, na.rm = FALSE, normalized = TRUE, tolerance = 0.01 )kdensity( x, bw = NULL, adjust = 1, kernel = NULL, start = NULL, support = NULL, na.rm = FALSE, normalized = TRUE, tolerance = 0.01 )
x |
Numeric vector containing the data. |
bw |
A bandwidth function. Can be either a string, a custom-made
function, or a double. The supported bandwidth functions are documented
in |
adjust |
An adjustment constant, so that |
kernel |
The kernel function. Can be chosen from the list of built-in
kernels or be custom-made. See |
start |
Parametric start. Can be chosen from the list of built-in
parametric starts or be custom-made. See |
support |
The support of the data. Must be compatible with the supplied
|
na.rm |
Logical; if |
normalized |
Logical; if |
tolerance |
Numeric; the relative error to tolerate in normalization. |
The default values for bw, kernel, start, and
support are interdependent, and are chosen to make sense. E.g.,
the default value for support when start = beta is
c(0, 1).
The start argument defaults to uniform, which corresponds
to ordinary kernel density estimation. The typical default value for
kernel is gaussian.
If normalized is FALSE and start != "uniform", the resulting
density will not integrate to 1 in general.
kdensity returns an S3 function object of
base::class() "kdensity". This is a callable function with the
following elements, accessible by '$':
xThe data supplied in x.
bw_str, bw, adjust, hThe bandwidth function, the resulting
bandwidth, the adjust argument, and the adjusted
bandwidth.
kernel_str, kernel, start, start_str, supportName of the kernel, the kernel object, name of the parametric start, the start object, and the support of the density.
data.name, n, range, has.na, na.rm, normalizedName of the data, number of
observations, the range of the data, whether the data
x contained NA values, whether na.rm is TRUE
or not, and whether the density is normalized.
callThe call to kdensity.
estimatesNamed numeric vector containing the parameter estimates from the parametric start.
logLikThe log-likelihood of the parametric starts. Is NA
for the uniform start.
Hjort, Nils Lid, and Ingrid K. Glad. "Nonparametric density estimation with a parametric start." The Annals of Statistics (1995): 882-904.
Jones, M. C., and D. A. Henderson. "Miscellanea kernel-type density estimation on the unit interval." Biometrika 94.4 (2007): 977-984.
Chen, Song Xi. "Probability density function estimation using gamma kernels." Annals of the Institute of Statistical Mathematics 52.3 (2000): 471-480.
Silverman, Bernard W. Density estimation for statistics and data analysis. Vol. 26. CRC press, 1986.
The stats package function stats::density().
## Use gamma kernels to model positive data, the concentration of ## theophylline concentration <- Theoph$conc + 0.001 plot(kdensity(concentration, start = "gamma", kernel = "gamma", adjust = 1 / 3), ylim = c(0, 0.15), lwd = 2, main = "Concentration of theophylline" ) lines(kdensity(concentration, start = "gamma", kernel = "gaussian"), lty = 2, col = "grey", lwd = 2 ) lines(kdensity(concentration, start = "gaussian", kernel = "gaussian"), lty = 3, col = "blue", lwd = 2 ) lines(kdensity(concentration, start = "gaussian", kernel = "gamma", adjust = 1 / 3), lty = 4, col = "red", lwd = 2 ) rug(concentration)## Use gamma kernels to model positive data, the concentration of ## theophylline concentration <- Theoph$conc + 0.001 plot(kdensity(concentration, start = "gamma", kernel = "gamma", adjust = 1 / 3), ylim = c(0, 0.15), lwd = 2, main = "Concentration of theophylline" ) lines(kdensity(concentration, start = "gamma", kernel = "gaussian"), lty = 2, col = "grey", lwd = 2 ) lines(kdensity(concentration, start = "gaussian", kernel = "gaussian"), lty = 3, col = "blue", lwd = 2 ) lines(kdensity(concentration, start = "gaussian", kernel = "gamma", adjust = 1 / 3), lty = 4, col = "red", lwd = 2 ) rug(concentration)
Kernel functions are an important part of kdensity. This document
lists the available built-in functions and the structure of them. Any kernel
in the list can be used in kdensity by using kernel = "kernel"
for the intended kernel.
Be careful combining kernels with compact support with parametric starts,
as the normalizing integral typically fails to converge. Use gaussian
instead.
gaussian, normal: The Gaussian kernel. The default argument when
starts is supported on R.
epanechnikov, rectangular (uniform), triangular, biweight, cosine, optcosine: Standard symmetric kernels, also used in
stats::density().
tricube, triweight: Standard symmetric kernels. Not supported by
stats::density().
laplace: Uses the Laplace density, also known as the double
exponential density.
gamma, gamma_biased: The gamma kernel of Chen (2000). For use on the positive
half-line. gamma is the recommended biased-corrected kernel.
gcopula: The Gaussian copula kernel of Jones & Henderson (2007). For use
on the unit interval.
beta, beta_biased: The beta kernel of Chen (1999). For use on the unit interval.
beta is the recommended bias-corrected kernel.
A kernel is a list containing two mandatory elements and one optional
element. The mandatory element 'kernel' is the kernel function.
It takes arguments y, x, h, where x is the data supplied
to kdensity and y is the point of evaluation. h is
the bandwidth. Internally, the kernel function is evaluated as
1/h*kernel(y, x, h). It should be vectorized in x, but
vectorization in y is not needed.
The second mandatory element is support, stating the domain of
definition for the kernel. This is used to distinguish kernels on the
unit interval / positive half-line from kernels on R.
sd is used for symmetric kernels, and states the standard error
of the kernel. This is used to make kernels comparable to the Gaussian
kernel when calculating bandwidths.
Chen, Song Xi. "Probability density function estimation using gamma kernels." Annals of the Institute of Statistical Mathematics 52.3 (2000): 471-480. Jones, M. C., and D. A. Henderson. "Kernel-type density estimation on the unit interval." Biometrika 94.4 (2007): 977-984. Chen, Song Xi. "Beta kernel estimators for density functions." Computational Statistics & Data Analysis 31.2 (1999): 131-145.
kdensity(); parametric_starts();
bandwidths().
gaussian <- list( kernel = function(y, x, h) stats::dnorm((y - x) / h), sd = 1, support = c(-Inf, Inf) ) gcopula <- list( kernel = function(y, x, h) { rho <- 1 - h^2 inside <- rho^2 * (qnorm(y)^2 + qnorm(x)^2) - 2 * rho * qnorm(y) * qnorm(x) exp(-inside / (2 * (1 - rho^2))) }, support = c(0, 1) )gaussian <- list( kernel = function(y, x, h) stats::dnorm((y - x) / h), sd = 1, support = c(-Inf, Inf) ) gcopula <- list( kernel = function(y, x, h) { rho <- 1 - h^2 inside <- rho^2 * (qnorm(y)^2 + qnorm(x)^2) - 2 * rho * qnorm(y) * qnorm(x) exp(-inside / (2 * (1 - rho^2))) }, support = c(0, 1) )
A parametric start is a density function with an associated estimator which
is used as a starting point in kdensity. Several parametric starts
are implemented, all with maximum likelihood estimation. Custom-made
parametric starts are possible, see the Structure section.
The parametric start contains three elements: The density function, an
estimation function, and the support of the density. The parameters of
the density function must partially match the parameters of the estimator
function. The estimator function takes one argument, a numeric vector,
which is passed from kdensity.
kdensity supports more than
20 built-in starts from the univariateML package, see
univariateML::univariateML_models for a list. Densities with variable
support, power, are not supported. The pareto density has its
support fixed to (1,Inf). The
options uniform, constant makes kdensity estimate a kernel
density without parametric starts.
kdensity(); kernels(); bandwidths()
start_exponential <- list( density = stats::dexp, estimator = function(data) { c(rate = 1 / mean(data)) }, support = c(0, Inf) ) start_inverse_gaussian <- list( density = extraDistr::dwald, estimator = function(data) { c( mu = mean(data), lambda = mean(1 / data - 1 / mean(data)) ) }, support = c(0, Inf) )start_exponential <- list( density = stats::dexp, estimator = function(data) { c(rate = 1 / mean(data)) }, support = c(0, Inf) ) start_inverse_gaussian <- list( density = extraDistr::dwald, estimator = function(data) { c( mu = mean(data), lambda = mean(1 / data - 1 / mean(data)) ) }, support = c(0, Inf) )
The plot method for kdensity objects.
## S3 method for class 'kdensity' plot(x, range = NULL, plot_start = FALSE, zero_line = TRUE, ...) ## S3 method for class 'kdensity' lines(x, range = NULL, plot_start = FALSE, zero_line = TRUE, ...) ## S3 method for class 'kdensity' points(x, range = NULL, plot_start = FALSE, zero_line = TRUE, ...)## S3 method for class 'kdensity' plot(x, range = NULL, plot_start = FALSE, zero_line = TRUE, ...) ## S3 method for class 'kdensity' lines(x, range = NULL, plot_start = FALSE, zero_line = TRUE, ...) ## S3 method for class 'kdensity' points(x, range = NULL, plot_start = FALSE, zero_line = TRUE, ...)
x |
a |
range |
range of x values. |
plot_start |
logical; if |
zero_line |
logical; if |
... |
further plotting parameters. |
None.
## Using the data set "precip" to eye-ball the similarity between ## a kernel fit, a parametric fit, and a kernel with parametric start fit. kde_gamma <- kdensity(precip, kernel = "gaussian", start = "gamma") kde <- kdensity(precip, kernel = "gaussian", start = "uniform") plot(kde_gamma, main = "Annual Precipitation in US Cities") lines(kde_gamma, plot_start = TRUE, lty = 2) lines(kde, lty = 3) rug(precip)## Using the data set "precip" to eye-ball the similarity between ## a kernel fit, a parametric fit, and a kernel with parametric start fit. kde_gamma <- kdensity(precip, kernel = "gaussian", start = "gamma") kde <- kdensity(precip, kernel = "gaussian", start = "uniform") plot(kde_gamma, main = "Annual Precipitation in US Cities") lines(kde_gamma, plot_start = TRUE, lty = 2) lines(kde, lty = 3) rug(precip)