Phylogenetic Generalized Linear Mixed Model for Comparative Data

pglmm_compare performs linear regression for Gaussian, binomial and Poisson phylogenetic data, estimating regression coefficients with approximate standard errors. It simultaneously estimates the strength of phylogenetic signal in the residuals and gives an approximate conditional likelihood ratio test for the hypothesis that there is no signal. Therefore, when applied without predictor (independent) variables, it gives a test for phylogenetic signal. pglmm_compare is a wrapper for pglmm tailored for comparative data in which each value of the response (dependent) variable corresponds to a single tip on a phylogenetic tree. If there are multiple measures for each species, pglmm will be helpful.

pglmm_compare(
  formula,
  family = "gaussian",
  data = list(),
  phy,
  REML = TRUE,
  optimizer = c("nelder-mead-nlopt", "bobyqa", "Nelder-Mead", "subplex"),
  add.obs.re = TRUE,
  verbose = FALSE,
  cpp = TRUE,
  bayes = FALSE,
  reltol = 10^-6,
  maxit = 500,
  tol.pql = 10^-6,
  maxit.pql = 200,
  marginal.summ = "mean",
  calc.DIC = FALSE,
  prior = "inla.default",
  prior_alpha = 0.1,
  prior_mu = 1,
  ML.init = FALSE,
  s2.init = 1,
  B.init = NULL
)

Arguments

formula: A two-sided linear formula object describing the fixed-effects of the model; for example, Y ~ X. Binomial data can be either presence/absence, or a two-column array of 'successes' and 'failures'. For both binomial and Poisson data, we add an observation-level random term by default via add.obs.re = TRUE.
family: Either "gaussian" for a Linear Mixed Model, or "binomial" or "poisson" for Generalized Linear Mixed Models. family should be specified as a character string (i.e., quoted). For binomial and Poisson data, we use the canonical logit and log link functions, respectively. Binomial data can be either presence/absence, or a two-column array of 'successes' and 'failures'. For both Poisson and binomial data, we add an observation-level random term by default via add.obs.re = TRUE. If bayes = TRUE there are two additional families available: "zeroinflated.binomial", and "zeroinflated.poisson", which add a zero inflation parameter; this parameter gives the probability that the response is a zero. The rest of the parameters of the model then reflect the "non-zero" part of the model. Note that "zeroinflated.binomial" only makes sense for success/failure response data.
data: A data frame containing the variables named in formula. It must has the tip labels of the phylogeny as row names; if they are not in the same order, the data frame will be arranged so that row names match the order of tip labels.
phy: A phylogenetic tree as an object of class "phylo".
REML: Whether REML or ML is used for model fitting the random effects. Ignored if bayes = TRUE.
optimizer: nelder-mead-nlopt (default), bobyqa, Nelder-Mead, or subplex. Nelder-Mead is from the stats package and the other optimizers are from the nloptr package. Ignored if bayes = TRUE.
add.obs.re: Whether to add observation-level random term for binomial and Poisson families. Normally it would be a good idea to add this to account for overdispersion, so add.obs.re = TRUE by default.
verbose: If TRUE, the model deviance and running estimates of s2 and B are plotted each iteration during optimization.
cpp: Whether to use C++ function for optim. Default is TRUE. Ignored if bayes = TRUE.
bayes: Whether to fit a Bayesian version of the PGLMM using r-inla. We recommend against Bayesian fitting for non-Gaussian data unless sample sizes are large (>1000), because the phylogenetic variance tends to get trapped near zero.
reltol: A control parameter dictating the relative tolerance for convergence in the optimization; see optim.
maxit: A control parameter dictating the maximum number of iterations in the optimization; see optim.
tol.pql: A control parameter dictating the tolerance for convergence in the PQL estimates of the mean components of the GLMM. Ignored if family = "gaussian" or bayes = TRUE.
maxit.pql: A control parameter dictating the maximum number of iterations in the PQL estimates of the mean components of the GLMM. Ignored if family = "gaussian" or bayes = TRUE.
marginal.summ: Summary statistic to use for the estimate of coefficients when doing a Bayesian PGLMM (when bayes = TRUE). Options are: "mean", "median", or "mode", referring to different characterizations of the central tendency of the Bayesian posterior marginal distributions. Ignored if bayes = FALSE.
calc.DIC: Should the Deviance Information Criterion be calculated and returned, when doing a Bayesian PGLMM? Ignored if bayes = FALSE.
prior: Which type of default prior should be used by pglmm? Only used if bayes = TRUE. There are currently four options: "inla.default", which uses the default INLA priors; "pc.prior.auto", which uses a complexity penalizing prior (as described in Simpson et al. (2017)) designed to automatically choose good parameters (only available for gaussian and binomial responses); "pc.prior", which allows the user to set custom parameters on the "pc.prior" prior, using the prior_alpha and prior_mu parameters (Run INLA::inla.doc("pc.prec") for details on these parameters); and "uninformative", which sets a very uninformative prior (nearly uniform) by using a very flat exponential distribution. The last option is generally not recommended but may in some cases give estimates closer to the maximum likelihood estimates. "pc.prior.auto" is only implemented for family = "gaussian" and family = "binomial" currently.
prior_alpha: Only used if bayes = TRUE and prior = "pc.prior", in which case it sets the alpha parameter of INLA's complexity penalizing prior for the random effects.The prior is an exponential distribution where prob(sd > mu) = alpha, where sd is the standard deviation of the random effect.
prior_mu: Only used if bayes = TRUE and prior = "pc.prior", in which case it sets the mu parameter of INLA's complexity penalizing prior for the random effects.The prior is an exponential distribution where prob(sd > mu) = alpha, where sd is the standard deviation of the random effect.
ML.init: Only relevant if bayes = TRUE. Should maximum likelihood estimates be calculated and used as initial values for the bayesian model fit? Sometimes this can be helpful, but most of the time it may not help; thus, we set the default to FALSE. Also, it does not work with the zero-inflated families.
s2.init: An array of initial estimates of s2. If s2.init is not provided for family="gaussian", these are estimated using lm assuming no phylogenetic signal. If s2.init is not provided for family = "binomial", these are set to 0.25.
B.init: Initial estimates of $B$, a matrix containing regression coefficients in the model for the fixed effects. This matrix must have dim(B.init) = c(p + 1, 1), where p is the number of predictor (independent) variables; the first element of B corresponds to the intercept, and the remaining elements correspond in order to the predictor (independent) variables in the formula. If B.init is not provided, these are estimated using lm or glm assuming no phylogenetic signal.

Value

An object (list) of class pglmm_compare with the following elements:

formula: the formula for fixed effects
formula_original: the formula for both fixed effects and random effects
data: the dataset
family: either gaussian or binomial or poisson depending on the model fit
B: estimates of the regression coefficients
B.se: approximate standard errors of the fixed effects regression coefficients. This is set to NULL if bayes = TRUE.
B.ci: approximate bayesian credible interval of the fixed effects regression coefficients. This is set to NULL if bayes = FALSE
B.cov: approximate covariance matrix for the fixed effects regression coefficients
B.zscore: approximate Z scores for the fixed effects regression coefficients. This is set to NULL if bayes = TRUE
B.pvalue: approximate tests for the fixed effects regression coefficients being different from zero. This is set to NULL if bayes = TRUE
ss: random effects' standard deviations for the covariance matrix $\sigma^2V$ for each random effect in order. For the linear mixed model, the residual variance is listed last
s2r: random effects variances for non-nested random effects
s2n: random effects variances for nested random effects
s2resid: for linear mixed models, the residual variance
s2r.ci: Bayesian credible interval for random effects variances for non-nested random effects. This is set to NULL if bayes = FALSE
s2n.ci: Bayesian credible interval for random effects variances for nested random effects. This is set to NULL if bayes = FALSE
s2resid.ci: Bayesian credible interval for linear mixed models, the residual variance. This is set to NULL if bayes = FALSE
logLik: for linear mixed models, the log-likelihood for either the restricted likelihood (REML=TRUE) or the overall likelihood (REML=FALSE). This is set to NULL for generalised linear mixed models. If bayes = TRUE, this is the marginal log-likelihood
AIC: for linear mixed models, the AIC for either the restricted likelihood (REML=TRUE) or the overall likelihood (REML=FALSE). This is set to NULL for generalised linear mixed models
BIC: for linear mixed models, the BIC for either the restricted likelihood (REML=TRUE) or the overall likelihood (REML=FALSE). This is set to NULL for generalised linear mixed models
DIC: for bayesian PGLMM, this is the Deviance Information Criterion metric of model fit. This is set to NULL if bayes = FALSE.
REML: whether or not REML is used (TRUE or FALSE).
bayes: whether or not a Bayesian model was fit.
marginal.summ: The specified summary statistic used to summarise the Bayesian marginal distributions. Only present if bayes = TRUE
s2.init: the user-provided initial estimates of s2
B.init: the user-provided initial estimates of B
Y: the response (dependent) variable returned in matrix form
X: the predictor (independent) variables returned in matrix form (including 1s in the first column)
H: the residuals. For linear mixed models, this does not account for random terms, To get residuals after accounting for both fixed and random terms, use residuals(). For the generalized linear mixed model, these are the predicted residuals in the logit -1 space.
iV: the inverse of the covariance matrix. This is NULL if bayes = TRUE.
mu: predicted mean values for the generalized linear mixed model (i.e. similar to fitted(merMod)). Set to NULL for linear mixed models, for which we can use fitted().
Zt: the design matrix for random effects. This is set to NULL if bayes = TRUE
St: diagonal matrix that maps the random effects variances onto the design matrix
convcode: the convergence code provided by optim. This is set to NULL if bayes = TRUE
niter: number of iterations performed by optim. This is set to NULL if bayes = TRUE
inla.model: Model object fit by underlying inla function. Only returned if bayes = TRUE

Details

pglmm_compare in the package phyr is similar to binaryPGLMM in the package ape, although it has much broader functionality, including accepting more than just binary data, implementing Bayesian analyses, etc.

For non-Gaussian data, the function estimates parameters for the model

$$Pr(Y = 1) = \theta $$ $$\theta = inverse.link(b0 + b1 * x1 + b2 * x2 + \dots + \epsilon)$$ $$\epsilon ~ Gaussian(0, s2 * V) $$

where $V$ is a covariance matrix derived from a phylogeny (typically under the assumption of Brownian motion evolution). Although mathematically there is no requirement for $V$ to be ultrametric, forcing $V$ into ultrametric form can aide in the interpretation of the model. This is especially true for binary data, because in regression for binary dependent variables, only the off-diagonal elements (i.e., covariances) of matrix $V$ are biologically meaningful (see Ives & Garland 2014). The function converts a phylo tree object into a covariance matrix, and further standardizes this matrix to have determinant = 1. This in effect standardizes the interpretation of the scalar s2. Although mathematically not required, it is a very good idea to standardize the predictor (independent) variables to have mean 0 and variance 1. This will make the function more robust and improve the interpretation of the regression coefficients.

For Gaussian data, the function estimates parameters for the model

$$Y = b0 + b1 * x1 + b2 * x2 + \dots + \epsilon)$$ $$\epsilon ~ Gaussian(0, s2 * V + s2resid * I) $$

where $s2resid * I$ gives the non-phylogenetic residual variance. Note that this is equivalent to a model with Pagel's lambda transformation.

References

Ives, A. R. and Helmus, M. R. (2011) Generalized linear mixed models for phylogenetic analyses of community structure. Ecological Monographs, 81, 511--525.

Ives, A. R. and Garland, T., Jr. (2014) Phylogenetic regression for binary dependent variables. Pages 231--261 in L. Z. Garamszegi, editor. Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology. Springer-Verlag, Berlin Heidelberg.

Author

Anthony R. Ives

Examples


## Illustration of `pglmm_compare` with simulated data

# Generate random phylogeny

library(ape)

n <- 100
phy <- compute.brlen(rtree(n=n), method = "Grafen", power = 1)

# Generate random data and standardize to have mean 0 and variance 1
X1 <- rTraitCont(phy, model = "BM", sigma = 1)
X1 <- (X1 - mean(X1))/var(X1)

# Simulate binary Y
sim.dat <- data.frame(Y = array(0, dim = n), X1 = X1, row.names = phy$tip.label)
sim.dat$Y <- ape::binaryPGLMM.sim(Y ~ X1, phy = phy, data=sim.dat, s2 = 1,
                             B = matrix(c(0, .25), nrow = 2, ncol = 1), 
                             nrep = 1)$Y

# Fit model
pglmm_compare(Y ~ X1, family = "binomial", phy = phy, data = sim.dat)
#> Generalized linear mixed model for binomial data fit by restricted maximum likelihood
#> 
#> Call:Y ~ X1
#> <environment: 0x000001ff75a42b00>
#> 
#> logLik    AIC    BIC 
#> -52.56 113.13 118.97 
#> 
#> Phylogenetic random effects variance (s2):
#>    Variance Std.Dev
#> s2  0.06239  0.2498
#> 
#> Fixed effects:
#>                Value Std.Error  Zscore  Pvalue  
#> (Intercept)  1.18920   0.65037  1.8285 0.06747 .
#> X1          -0.20456   0.36923 -0.5540 0.57957  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 

# Compare with `binaryPGLMM`
ape::binaryPGLMM(Y ~ X1, phy = phy, data = sim.dat)
#> 
#> 
#> Call:Y ~ X1
#> <environment: 0x000001ff75a42b00>
#> 
#> Random effect (phylogenetic signal s2):
#>      s2       Pr
#> 1 1.143 0.009459
#> 
#> Fixed effects:
#>                Value Std.Error  Zscore  Pvalue  
#> (Intercept)  1.18937   0.65263  1.8224 0.06839 .
#> X1          -0.20599   0.37001 -0.5567 0.57772  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 

# Compare with `phyloglm`
summary(phylolm::phyloglm(Y ~ X1, phy = phy, data = sim.dat))
#> Warning: the estimate of 'alpha' (54.5427537597569) reached the upper bound (54.5981500331442).
#>  This may simply reflect a flat likelihood at large alpha values,
#>  meaning that the phylogenetic correlation is estimated to be negligible.
#> Warning: phyloglm failed to converge.
#> 
#> Call:
#> phylolm::phyloglm(formula = Y ~ X1, data = sim.dat, phy = phy)
#>        AIC     logLik Pen.logLik 
#>     117.37     -55.68     -52.75 
#> 
#> Method: logistic_MPLE
#> Mean tip height: 1
#> Parameter estimate(s):
#> alpha: 54.54275 
#> 
#> Coefficients:
#>             Estimate  StdErr z.value   p.value    
#> (Intercept)  1.03964 0.25772   4.034 5.482e-05 ***
#> X1           0.12630 0.20636   0.612    0.5405    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Note: Wald-type p-values for coefficients, conditional on alpha=54.54275
#> 

# Compare with `glm` that does not account for phylogeny
summary(glm(Y ~ X1, data = sim.dat, family = "binomial"))
#> 
#> Call:
#> glm(formula = Y ~ X1, family = "binomial", data = sim.dat)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.8498   0.6304   0.6880   0.7353   0.8395  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   1.2173     0.2395   5.084 3.71e-07 ***
#> X1            0.1489     0.1920   0.776    0.438    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 107.86  on 99  degrees of freedom
#> Residual deviance: 107.26  on 98  degrees of freedom
#> AIC: 111.26
#> 
#> Number of Fisher Scoring iterations: 4
#> 

# Compare with logistf() that does not account
# for phylogeny but is less biased than glm()
logistf::logistf(Y ~ X1, data = sim.dat)
#> Error in loadNamespace(x): there is no package called 'logistf'

## Fit model with bayes = TRUE
# pglmm_compare(Y ~ X1, family = "binomial", phy = phy, data = sim.dat, 
#               bayes = TRUE, calc.DIC = TRUE)

# Compare with `MCMCglmm`

V <- vcv(phy)
V <- V/max(V)
detV <- exp(determinant(V)$modulus[1])
V <- V/detV^(1/n)

invV <- Matrix::Matrix(solve(V),sparse = TRUE)
sim.dat$species <- phy$tip.label
rownames(invV) <- sim.dat$species

nitt <- 43000
thin <- 10
burnin <- 3000

prior <- list(R=list(V=1, fix=1), G=list(G1=list(V=1, nu=1000, alpha.mu=0, alpha.V=1)))
# commented out to save time
# summary(MCMCglmm::MCMCglmm(Y ~ X1, random = ~species, ginvers = list(species = invV),
#     data = sim.dat, slice = TRUE, nitt = nitt, thin = thin, burnin = burnin,
#    family = "categorical", prior = prior, verbose = FALSE))