R/pglmm_compare.R
pglmm_compare.Rd
pglmm_compare
performs linear regression for Gaussian, binomial and Poisson
phylogenetic data, estimating regression coefficients with approximate standard
errors. It simultaneously estimates the strength of phylogenetic signal in the
residuals and gives an approximate conditional likelihood ratio test for the
hypothesis that there is no signal. Therefore, when applied without
predictor (independent) variables, it gives a test for phylogenetic signal.
pglmm_compare
is a wrapper for pglmm
tailored for comparative data in
which each value of the response (dependent) variable corresponds to a single tip
on a phylogenetic tree. If there are multiple measures for each species, pglmm
will be helpful.
pglmm_compare(
formula,
family = "gaussian",
data = list(),
phy,
REML = TRUE,
optimizer = c("nelder-mead-nlopt", "bobyqa", "Nelder-Mead", "subplex"),
add.obs.re = TRUE,
verbose = FALSE,
cpp = TRUE,
bayes = FALSE,
reltol = 10^-6,
maxit = 500,
tol.pql = 10^-6,
maxit.pql = 200,
marginal.summ = "mean",
calc.DIC = FALSE,
prior = "inla.default",
prior_alpha = 0.1,
prior_mu = 1,
ML.init = FALSE,
s2.init = 1,
B.init = NULL
)
A two-sided linear formula object describing the
fixed-effects of the model; for example, Y ~ X. Binomial data can be either
presence/absence, or a two-column array of 'successes' and 'failures'.
For both binomial and Poisson data, we add an observation-level
random term by default via add.obs.re = TRUE
.
Either "gaussian" for a Linear Mixed Model, or
"binomial" or "poisson" for Generalized Linear Mixed Models.
family
should be specified as a character string (i.e., quoted). For binomial and
Poisson data, we use the canonical logit and log link functions, respectively.
Binomial data can be either presence/absence, or a two-column array of 'successes' and 'failures'.
For both Poisson and binomial data, we add an observation-level
random term by default via add.obs.re = TRUE
. If bayes = TRUE
there are
two additional families available: "zeroinflated.binomial", and "zeroinflated.poisson",
which add a zero inflation parameter; this parameter gives the probability that the response is
a zero. The rest of the parameters of the model then reflect the "non-zero" part
of the model. Note that "zeroinflated.binomial" only makes sense for success/failure
response data.
A data frame containing the variables named in formula. It must has the tip labels of the phylogeny as row names; if they are not in the same order, the data frame will be arranged so that row names match the order of tip labels.
A phylogenetic tree as an object of class "phylo".
Whether REML or ML is used for model fitting the random effects. Ignored if
bayes = TRUE
.
nelder-mead-nlopt (default), bobyqa, Nelder-Mead, or subplex.
Nelder-Mead is from the stats package and the other optimizers are from the nloptr package.
Ignored if bayes = TRUE
.
Whether to add observation-level random term for binomial and Poisson
families. Normally it would be a good idea to add this to account for overdispersion,
so add.obs.re = TRUE
by default.
If TRUE
, the model deviance and running
estimates of s2
and B
are plotted each iteration
during optimization.
Whether to use C++ function for optim. Default is TRUE. Ignored if bayes = TRUE
.
Whether to fit a Bayesian version of the PGLMM using r-inla
. We recommend
against Bayesian fitting for non-Gaussian data unless sample sizes are large (>1000), because
the phylogenetic variance tends to get trapped near zero.
A control parameter dictating the relative tolerance
for convergence in the optimization; see optim
.
A control parameter dictating the maximum number of
iterations in the optimization; see optim
.
A control parameter dictating the tolerance for
convergence in the PQL estimates of the mean components of the
GLMM. Ignored if family = "gaussian"
or bayes = TRUE
.
A control parameter dictating the maximum number
of iterations in the PQL estimates of the mean components of the
GLMM. Ignored if family = "gaussian"
or bayes = TRUE
.
Summary statistic to use for the estimate of coefficients when
doing a Bayesian PGLMM (when bayes = TRUE
). Options are: "mean",
"median", or "mode", referring to different characterizations of the central
tendency of the Bayesian posterior marginal distributions. Ignored if bayes = FALSE
.
Should the Deviance Information Criterion be calculated and returned,
when doing a Bayesian PGLMM? Ignored if bayes = FALSE
.
Which type of default prior should be used by pglmm
?
Only used if bayes = TRUE
. There are currently four options:
"inla.default", which uses the default INLA
priors; "pc.prior.auto", which uses a
complexity penalizing prior (as described in
Simpson et al. (2017)) designed to automatically
choose good parameters (only available for gaussian and binomial responses); "pc.prior", which
allows the user to set custom parameters on the "pc.prior" prior, using the prior_alpha
and prior_mu
parameters (Run INLA::inla.doc("pc.prec")
for details on these
parameters); and "uninformative", which sets a very uninformative prior
(nearly uniform) by using a very flat exponential distribution. The last option is generally
not recommended but may in some cases give estimates closer to the maximum likelihood estimates.
"pc.prior.auto" is only implemented for family = "gaussian"
and family = "binomial"
currently.
Only used if bayes = TRUE
and prior = "pc.prior"
, in
which case it sets the alpha parameter of INLA
's complexity penalizing prior for the
random effects.The prior is an exponential distribution where prob(sd > mu) = alpha,
where sd is the standard deviation of the random effect.
Only used if bayes = TRUE
and prior = "pc.prior"
, in
which case it sets the mu parameter of INLA
's complexity penalizing prior for the
random effects.The prior is an exponential distribution where prob(sd > mu) = alpha,
where sd is the standard deviation of the random effect.
Only relevant if bayes = TRUE
. Should maximum
likelihood estimates be calculated and used as initial values for
the bayesian model fit? Sometimes this can be helpful, but most of the
time it may not help; thus, we set the default to FALSE
. Also, it
does not work with the zero-inflated families.
An array of initial estimates of s2. If s2.init is not provided for
family="gaussian"
, these are estimated using lm
assuming
no phylogenetic signal. If s2.init
is not
provided for family = "binomial"
, these are set to 0.25.
Initial estimates of \(B\), a matrix containing
regression coefficients in the model for the fixed effects. This
matrix must have dim(B.init) = c(p + 1, 1)
, where p
is the
number of predictor (independent) variables; the first element of
B
corresponds to the intercept, and the remaining elements
correspond in order to the predictor (independent) variables in the
formula. If B.init
is not provided, these are estimated
using lm
or glm
assuming no phylogenetic signal.
An object (list) of class pglmm_compare
with the following elements:
the formula for fixed effects
the formula for both fixed effects and random effects
the dataset
either gaussian
or binomial
or poisson
depending on the model fit
estimates of the regression coefficients
approximate standard errors of the fixed effects regression coefficients.
This is set to NULL if bayes = TRUE
.
approximate bayesian credible interval of the fixed effects regression coefficients.
This is set to NULL if bayes = FALSE
approximate covariance matrix for the fixed effects regression coefficients
approximate Z scores for the fixed effects regression coefficients. This is set to NULL if bayes = TRUE
approximate tests for the fixed effects regression coefficients being different from zero. This is set to NULL if bayes = TRUE
random effects' standard deviations for the covariance matrix \(\sigma^2V\) for each random effect in order. For the linear mixed model, the residual variance is listed last
random effects variances for non-nested random effects
random effects variances for nested random effects
for linear mixed models, the residual variance
Bayesian credible interval for random effects variances for non-nested random effects.
This is set to NULL if bayes = FALSE
Bayesian credible interval for random effects variances for nested random effects.
This is set to NULL if bayes = FALSE
Bayesian credible interval for linear mixed models, the residual variance.
This is set to NULL if bayes = FALSE
for linear mixed models, the log-likelihood for either the restricted likelihood (REML=TRUE
) or the overall likelihood (REML=FALSE
). This is set to NULL for generalised linear mixed models. If bayes = TRUE
, this is the marginal log-likelihood
for linear mixed models, the AIC for either the restricted likelihood (REML=TRUE
) or the overall likelihood (REML=FALSE
). This is set to NULL for generalised linear mixed models
for linear mixed models, the BIC for either the restricted likelihood (REML=TRUE
) or the overall likelihood (REML=FALSE
). This is set to NULL for generalised linear mixed models
for bayesian PGLMM, this is the Deviance Information Criterion metric of model fit. This is set to NULL if bayes = FALSE
.
whether or not REML is used (TRUE
or FALSE
).
whether or not a Bayesian model was fit.
The specified summary statistic used to summarise the Bayesian marginal distributions.
Only present if bayes = TRUE
the user-provided initial estimates of s2
the user-provided initial estimates of B
the response (dependent) variable returned in matrix form
the predictor (independent) variables returned in matrix form (including 1s in the first column)
the residuals. For linear mixed models, this does not account for random terms,
To get residuals after accounting for both fixed and random terms, use residuals()
.
For the generalized linear mixed model, these are the predicted residuals in the
logit -1 space.
the inverse of the covariance matrix. This is NULL if bayes = TRUE
.
predicted mean values for the generalized linear mixed model (i.e. similar to fitted(merMod)
).
Set to NULL for linear mixed models, for which we can use fitted()
.
the design matrix for random effects. This is set to NULL if bayes = TRUE
diagonal matrix that maps the random effects variances onto the design matrix
the convergence code provided by optim
. This is set to NULL if bayes = TRUE
number of iterations performed by optim
. This is set to NULL if bayes = TRUE
Model object fit by underlying inla
function. Only returned
if bayes = TRUE
pglmm_compare
in the package phyr
is similar to binaryPGLMM
in
the package ape
, although it has much broader functionality, including
accepting more than just binary data, implementing Bayesian analyses, etc.
For non-Gaussian data, the function estimates parameters for the model
$$Pr(Y = 1) = \theta $$ $$\theta = inverse.link(b0 + b1 * x1 + b2 * x2 + \dots + \epsilon)$$ $$\epsilon ~ Gaussian(0, s2 * V) $$
where \(V\) is a covariance matrix derived from a phylogeny
(typically under the assumption of Brownian motion evolution). Although
mathematically there is no requirement for \(V\) to be ultrametric,
forcing \(V\) into ultrametric form can aide in the interpretation of the
model. This is especially true for binary data, because in regression for
binary dependent variables, only the off-diagonal elements (i.e., covariances)
of matrix \(V\) are biologically meaningful (see Ives & Garland 2014).
The function converts a phylo tree object into a covariance matrix,
and further standardizes this matrix to have determinant = 1. This in effect
standardizes the interpretation of the scalar s2
. Although mathematically
not required, it is a very good idea to standardize the predictor
(independent) variables to have mean 0 and variance 1. This will make the
function more robust and improve the interpretation of the regression
coefficients.
For Gaussian data, the function estimates parameters for the model
$$Y = b0 + b1 * x1 + b2 * x2 + \dots + \epsilon)$$ $$\epsilon ~ Gaussian(0, s2 * V + s2resid * I) $$
where \(s2resid * I\) gives the non-phylogenetic residual variance. Note that this is equivalent to a model with Pagel's lambda transformation.
Ives, A. R. and Helmus, M. R. (2011) Generalized linear mixed models for phylogenetic analyses of community structure. Ecological Monographs, 81, 511--525.
Ives, A. R. and Garland, T., Jr. (2014) Phylogenetic regression for binary dependent variables. Pages 231--261 in L. Z. Garamszegi, editor. Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology. Springer-Verlag, Berlin Heidelberg.
pglmm
; package ape and its function binaryPGLMM
;
package phylolm and its function phyloglm
; package MCMCglmm
## Illustration of `pglmm_compare` with simulated data
# Generate random phylogeny
library(ape)
n <- 100
phy <- compute.brlen(rtree(n=n), method = "Grafen", power = 1)
# Generate random data and standardize to have mean 0 and variance 1
X1 <- rTraitCont(phy, model = "BM", sigma = 1)
X1 <- (X1 - mean(X1))/var(X1)
# Simulate binary Y
sim.dat <- data.frame(Y = array(0, dim = n), X1 = X1, row.names = phy$tip.label)
sim.dat$Y <- ape::binaryPGLMM.sim(Y ~ X1, phy = phy, data=sim.dat, s2 = 1,
B = matrix(c(0, .25), nrow = 2, ncol = 1),
nrep = 1)$Y
# Fit model
pglmm_compare(Y ~ X1, family = "binomial", phy = phy, data = sim.dat)
#> Generalized linear mixed model for binomial data fit by restricted maximum likelihood
#>
#> Call:Y ~ X1
#> <environment: 0x000001ff75a42b00>
#>
#> logLik AIC BIC
#> -52.56 113.13 118.97
#>
#> Phylogenetic random effects variance (s2):
#> Variance Std.Dev
#> s2 0.06239 0.2498
#>
#> Fixed effects:
#> Value Std.Error Zscore Pvalue
#> (Intercept) 1.18920 0.65037 1.8285 0.06747 .
#> X1 -0.20456 0.36923 -0.5540 0.57957
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
# Compare with `binaryPGLMM`
ape::binaryPGLMM(Y ~ X1, phy = phy, data = sim.dat)
#>
#>
#> Call:Y ~ X1
#> <environment: 0x000001ff75a42b00>
#>
#> Random effect (phylogenetic signal s2):
#> s2 Pr
#> 1 1.143 0.009459
#>
#> Fixed effects:
#> Value Std.Error Zscore Pvalue
#> (Intercept) 1.18937 0.65263 1.8224 0.06839 .
#> X1 -0.20599 0.37001 -0.5567 0.57772
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
# Compare with `phyloglm`
summary(phylolm::phyloglm(Y ~ X1, phy = phy, data = sim.dat))
#> Warning: the estimate of 'alpha' (54.5427537597569) reached the upper bound (54.5981500331442).
#> This may simply reflect a flat likelihood at large alpha values,
#> meaning that the phylogenetic correlation is estimated to be negligible.
#> Warning: phyloglm failed to converge.
#>
#> Call:
#> phylolm::phyloglm(formula = Y ~ X1, data = sim.dat, phy = phy)
#> AIC logLik Pen.logLik
#> 117.37 -55.68 -52.75
#>
#> Method: logistic_MPLE
#> Mean tip height: 1
#> Parameter estimate(s):
#> alpha: 54.54275
#>
#> Coefficients:
#> Estimate StdErr z.value p.value
#> (Intercept) 1.03964 0.25772 4.034 5.482e-05 ***
#> X1 0.12630 0.20636 0.612 0.5405
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Note: Wald-type p-values for coefficients, conditional on alpha=54.54275
#>
# Compare with `glm` that does not account for phylogeny
summary(glm(Y ~ X1, data = sim.dat, family = "binomial"))
#>
#> Call:
#> glm(formula = Y ~ X1, family = "binomial", data = sim.dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.8498 0.6304 0.6880 0.7353 0.8395
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.2173 0.2395 5.084 3.71e-07 ***
#> X1 0.1489 0.1920 0.776 0.438
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 107.86 on 99 degrees of freedom
#> Residual deviance: 107.26 on 98 degrees of freedom
#> AIC: 111.26
#>
#> Number of Fisher Scoring iterations: 4
#>
# Compare with logistf() that does not account
# for phylogeny but is less biased than glm()
logistf::logistf(Y ~ X1, data = sim.dat)
#> Error in loadNamespace(x): there is no package called 'logistf'
## Fit model with bayes = TRUE
# pglmm_compare(Y ~ X1, family = "binomial", phy = phy, data = sim.dat,
# bayes = TRUE, calc.DIC = TRUE)
# Compare with `MCMCglmm`
V <- vcv(phy)
V <- V/max(V)
detV <- exp(determinant(V)$modulus[1])
V <- V/detV^(1/n)
invV <- Matrix::Matrix(solve(V),sparse = TRUE)
sim.dat$species <- phy$tip.label
rownames(invV) <- sim.dat$species
nitt <- 43000
thin <- 10
burnin <- 3000
prior <- list(R=list(V=1, fix=1), G=list(G1=list(V=1, nu=1000, alpha.mu=0, alpha.V=1)))
# commented out to save time
# summary(MCMCglmm::MCMCglmm(Y ~ X1, random = ~species, ginvers = list(species = invV),
# data = sim.dat, slice = TRUE, nitt = nitt, thin = thin, burnin = burnin,
# family = "categorical", prior = prior, verbose = FALSE))