correlated due to sharing a random intercep
This correlation only exists if b_i varies over i = 1. , ... sampling units, i.e., sigma_bi > 0.
correlated due to sharing a random intercep
This correlation only exists if b_i varies over i = 1. , ... sampling units, i.e., sigma_bi > 0.
you might consider
It seems like you "need to", not "might", consider adding a random effect. The repeated observations is a sampling design parameter as it is restricting the patient-time pairing that could be sampled in a simple random sample. You might consider not including a random intercept if there is clearly no impact from the clustering factor. Even then the observations are only behaving as if they are independent because of no cluster (e.g., lake) effect; they still are correlated observations from sampling perspective.
random effects induces
I get the impression that adding the random effects to the statistical model induces correlation, but isn't it the case that the study design induces the correlation and the model is trying to do just that, model the correlation.
corrlation
"correlation"
Thus, a common recommendation is to use ML when comparing models (e.g., using likelihood ratio tests) that differ only in their fixed effects
Piniero & Bates p.87 do not recommend using likelhood ratio tests for fixed effects, but rather use t-tests from summary(lme).
would
"would be"
If this seems confusing, you are not alone. This is another situation where life just seems easier if you are a Bayesian since all parameters are treated as random variables and assigned probability distributions.
I find these sentences unhelpful. There are times when it is appropriate to consider results as a fixed effect rather than a random effect. For example for Se fish, what if those 9 lakes sampled were all the lakes "nearby"? Then assuming a model where they are just a sample from a larger population will lead to poorer estimates than the correct fixed effects model. One could argue that it is more important to do things correctly, rather than just the easy way...
The deviations from the average intercept and slope, b0ib0ib_{0i} and b1ib1ib_{1i}, are also missing in action. What gives?
These can be found quite easily. I am not sure why you include this sentence rather than a helpful sentence on how easy it is to generate these values, which you show just a paragraph later.
and
I think this is an important "and" as it applies to all scenarios described in Fig 18.5
within cluster variability is low
Do you mean when trends are strong, i.e., variability of residuals for within cluster regression is low? It seems likely that if cluster variability is low you would go with random intercept model.
Schielzeth & Forstmeier
At least for this paper, the authors compared random intercept vs random slope model in a simulation where a random slope was included. What if there was no random slope in the true model? I suspect the random intercept model estimator would be better. Generally speaking the best model is the correct model. How well a model does under alternative truths has not been studied much as it depends on too many alternative truths. One issue to note is that if there is a big difference between between- and within-subject variance, iid N models will work well. If the between-subject variance is small compared to within-subject variance then the model should be at the subject level (fish), but if it is large, then model should be at the lake level (and take averages of the fish within lakes).
In this model, each site has its own intercept β0iβ0i\beta_{0i} and slope β1iβ1i\beta_{1i}
I think it is important to note that if you take this approach you are no longer estimating how the population mean for each lake varies with Se in lake water, but how B_0 and B_1 vary with Se in lake water. B_0 may or may not equate well with lake population means.
predictor variable varies within a lake
and assuming impacts the response in some way.
powerful
"useful"?
We see that the estimated standard errors for β0β0\beta_0 and β1β1\beta_1 are much narrower in the naive linear model, which is not a surprise since this approach treats the data as though they were independent
Wouldn't that be a good thing? It is bad only if they are biased, but you do not present any evidence that is the case or at least worse than lme model. In any case if the true model had no Lake effect, SE estimates would be much to large for mixed effects model and iid N case would be fine.
Naive linear model
Aren't the true values 1.16 for intercept and 0.26 for Slope? Would this not indicate that "Naive" model is least biased method?
mixed-effect model fit using lmer, which was the model used to generate the data
Would the mixed effects model work better than the naive model if we had simulated the naive model? We need to remember that the best model is always going to be the correct model. The naive model is simply an estimator using the incorrect model, it has nothing to do being a poor or naive estimator.
often
Often? When is it not reasonable?
data are balanced
You will have the sample problems if sample sizes are balanced, just less egregious ; as noted previously, the issue is the sample is not a simple random sample, not unequal sample sizes.
estimator of the intercept is biased
This is a biased comparison of the two models. In linear regression there is no adjustment for differences between lakes, however awkward that may be, as there is in the lme model. For a fair comparison lakes factor needs to be included in the model; of course an estimator needs to be selected carefully so they both are estimates of population mean (or at least the same thing).
SEs estimate the standard deviation
I believe a better estimate would be sqrt(mean(se.hat[,1]^2)) as we would rather take mean of variances than stdDev.
repeated measures data
I don't believe this data fit the description of what is usually thought of as repeated measures data. The sampling unit is fish and we only take one observation per fish. Based on simulation this is a kind of stratified random sample or cluster sampling, depending on how many lakes there are in the population, with population of fish stratified by randomly selecting lakes and then "random" samples taken from each lake; but sampling plan was not described in the text. The sampling plan can not be ignored when discussing the nature of the model.
factors
is this statistical factors or ecological factors?
naive estimates of uncertainty will be too small
I don't understand this. If the sample of fish are a simple random sample from the population of 9 (all?) lakes, then estimates will be unbiased. As it turns out the simulation given as an example is not a simple random sample, so there is an error in model specification.
I am not sure I know what a "naive" estimate is. If Lake is not important is the mixed effects model the naive estimator?
the number of fish sampled per lake does not depend on these other factors
This is not what is happening in the simulation given. Lakes are first selected, then fish are caught in those lakes. If there were a lot of lakes, 10000 in example, we would rarely catch more than 1 fish from any given lake if a simple random sample (SRS) was taken. Instead we have about 9 fish caught per selected lakes. Hence not all fish are equally likely to be caught.
sample sizes are random
I am not sure why this is important. The question is whether the data are collected as a simple random sample from the population of fish across all lakes defining the population of fish.
Well, it depends on the process that leads to the imbalance
It also depends on the question one is trying to answer. If you want an estimate of the population mean, it will matter. If you want to understand the relationship between lake SE and bioaccumulation, then not so much, except in the efficiency of your estimator; in which case equal sample sizes will be most efficient.
not independent
Independence is a very hard thing to prove or disprove. There could be difference in lakes that lead to different bioaccumulation rates, e.g., the fish caught in different lakes may be different species and have different bioaccumulation rates. The sampling plan, which is not described, would tell us whether fish were selected independently. (As best we can tell given the complexities of sampling wildlife.). Even if the fish were a true simple random sample from the population of fish from nearby lakes, results could look like what is given in the text. Keep in mind what we are observing in the figures are the residuals, not the errors. What is missing here is Lake as a factor in the model, so assuming a simple random sample took place it is model error (lack of fit) not lack of independence that is leading to this picture. On the other hand if Lake has no effect, then a stratified random sample will look like independent data. The issue is that Lake is presumed to have an effect. Note Lake and log(Se) will be confounded as we only have one observation of Se in water per lake.
locations
Based on the text "to determine whether Selenium concentrations in nearby lakes are bioaccumulating in fish" the population of fish in these lakes are of interest and are the sampling units, not locations, at least at this point in the conversation. So it is the sampled fish that need to be independent. This may or may not have occurred, but a sampling plan has not been stated.
nearby
Are the 9 sampled lakes all the "nearby" lakes? How many nearby lakes are there? This is an important question that is not directly assessed.
concentrations
delete? Selenium can bioaccumulate, but not Se concentrations.
Based on the significant interactions
Could these interactions be due to sparsity of data and large number of nesting boxes?
Population proportion of skaters
This is the sample proportion as it is based on summarize(PopProportion = mean(Y) rather than the parameters B_0 and B_1 correct?
model’s design matrix
The response variable is not clearly stated. Is it presence/absence of eggs or clutch size? I believe both have been discussed. The model implies it is presence/absence, but this should be clarified.
we load the data
I am not sure I understand this sentence. Before we load the data we convert several variables? How can we do that before we load the data? What does "loading" the data mean?
test statistics, and AIC’s
test statistics and AICs are not presented for both Random Intercept and GLS in the table above this sentence.
were
was?
estimates
of the parameter values? Are estimates of SEs of parameter values impacted if assumptions about variance/covariance matrix are incorrect?
Thus, ∂μi∂βVi=Xi∂μi∂βVi=Xi\frac{\partial \mu_i}{\partial \beta}V_i = X_i
V^-1, i.e., V inverse in equation here?
[more on this later…reference last section of book…if I get there:)]
delete
This is one of the reasons why I like to cover zero-inflated models
I find this reasoning confusing. In next example, it appears you carry out analysis without Bayesian priors. In the following section you implement using Bayesian. As far as I can tell you get the same answer for both (the intercept is different but I assume that is due to switch of definition of zero-inflation (which is confusing in of itself). Since the answers are the same, it seems the non-Bayesian approach in Sec 17.6 is "easier" as it requires fewer assumptions and less complicated modeling process.
zero-inflation
Has this been defined?
we should expect fewer zeros
why should we expect fewer zeroes? It seems this is a conditional distribution problem. First there is the presence absence condition. Second, there is the distribution of values when "present". At least theoretically, even under present" conditions, you may still get zero values. As the mean of the distribution under "present" conditions increases from zero the frequency of zeroes will decline to a point where getting a zero is unlikely. So whether zeroes increase with mean or not, the zero-inflated model could be appropriate.
95% credible interval
One limitation to use of two sided intervals, confidence or credible, is that it removes 0 probability from any estimated interval (of interest, i.e., discounting bounds of +/-Inf); yet, all we have seen are 0s. Another approach is to note that for p = P(Y = 1) > 0.05, the chance of getting 59 0s in a row is less than "p-value" = 5%. Hence either p > 0.05 and something very unusual happened ("p-value"< 0.05) or p < 0.05 with the most likely value being zero (given no further information).
primarily
I am confused. What value is a non-informative prior? If it is completely non-informative wouldn't the answers be determined completely by the data? This seems to be the case as Bayesian results generally mirror frequentist results when uninformed priors are used. In Sec 11.2 it is implied that we need informed priors (from data) otherwise the probability is subjective. In the example just given, we need an informed prior about beta_0 to achieve an uninformed prior about p?? So does this model have an informative or uninformative prior?
it turns out that a N(0,3)N(0,3)N(0, 3) distribution is a much better option
What prior information leads us to believe that beta_0 has a distribution with mean 0 and std dev 3?
Contrast this process with fitting a model using glm where users may not even know that the mean is being modeled on a transformed scale
I could also write this as "Note how in JAGS we have to spell everything out in tedious detail, but glm does all that for us!"
an
and?
get will get the
delete?
included
"include"
we will no longer be able to write our models in terms of a “signal + error”
I don't see why we can not continue to write models as signal + error. Just because the variance may be a function of the mean doesn't mean we can't view it as signal + error. The link functions are essentially the signal function, Hence, the deviations we observe, i.e., the residuals, are estimates of the "errors". I believe, what is really happening is we are getting more nuanced about modelling the "error"; now we have different distributions, then "random effects models", measurement error, etc. It is important to keep in mind that the "error" is presumed additive, i.e., signal + error. (Except when it is not, of course, but then we do something like a transform to make it additive.)
Poisson regression
I think it would be useful to add "models" to the end of the sentence.
linear regression
This is a false comparison between "linear regression", by which I presume the iid N subset of linear regression models is meant, and the Poisson model. In one the signal function is linear and the other it is exponential function. Depending on the truth one of these is going to be wrong. A better comparison is between log(Y) = beta_0 + ... and Poisson.
interpretation of regression coefficients
This is a false indication of the capabilities of linear regression, by which I presume you are referring to the iid N subset of linear regression models. In the Poisson model you are stating that response is an exponential function of the predictors. Yet log(response) with iid N is not considered or compared to the Poisson model. This is like saying a model is Y = X^2, but not allowing linear regression to use polynomials. I am not claiming log(response) is going to be better or worse, but this is an unfair comparison.
can thus only take on integer values
but this really only matters if the integer range is small and particularly so if near near zero. This is important information when choosing models.
assumed model
What is the assumed model? I presume it is the model assumed by the null hypothesis.
other methods might include
Why not include log(response) iid Normal case? This should be considered as a possibility.
the easy way
Sorry, but frequentist model gives the same answer and seems way easier than going through all this.
Confidence
Again, is this meant to be "Credible"? If not what is a Bayesian Confidence Interval?
If you get a p-value near 0.5, that is ideal as it suggests that the observed data look very similar to the data generated under the assume model (at least when it comes to our chosen test statistic)
What if you get a value near 1? What does that mean?
assumed model
What is the assumed model? I assume it is the one behind the null hypothesis. These hypotheses are either True or False just like frequentist hypotheses, are they not?
By contrast, a frequentist will say that this particular interval either does contain ppp or it does not contain ppp (with probability = 1). The parameter itself is a fixed, but unknown quantity and the interval is random (it depends on the observed data). To interpret the frequentist interval in terms of a probability statement, we need consider creating many such intervals across repeated events (collect data, form interval). We expect that close to 95% of these intervals would contain the fixed and unknown ppp (see Section 11.3.2). Thus, we say that we are 95% sure that our interval contains the true parameter ppp
I find all of this confusing and contradictory; see previous comment as well. The first probability is one of little interest as it is essentially a tautology. The true value is either in the interval or it is not - once you know the truth. We do not, so rely on the second definition to guide our uncertainty in the parameter value. The first probability also holds for Bayesian credible interval as probability is the "belief" that the true parameter is in the interval. So for a give credible interval the true value is either in or outside the interval.
This statement summarizes our belief about reasonable values of ppp given the observed data
I am puzzled why a frequentist assuming their underlying model, e.g., normal distribution, is correct, can't say that there is a 95% chance that the 95%CI interval covers the true value, i.e., "probability that p is between 0.39 and 0.56 is 0.95." Isn't this the frequentist probability belief model? The fact we have chosen specific values, 0.39 and 0.56, to identify the CI doesn't matter, the CI itself has a 95% chance of covering the true parameter, it just so happens the one we observe was labelled with those specific values.
how well do these frequentist intervals work
Why not ask how well do these Bayesian Intervals work?
and
"as"? I think starting at "Inference" would be better.
one obtains more and more data
the experiment contains more and more data. Data collected piecemeal presumably is captured by increasingly important priors.
result in informed beliefs (i.e., informed by data)
This is an integral part of Bayesian estimation and inference, but not a word on the challenges of building an informed prior. The text covers all the blemishes of frequentist modelling, so why not mention this important challenge in Bayesian analysis? These challenges should be acknowledged if not briefly described. If you know of any papers using a good informed prior, they should be at least cited.
belief
about what?
probability represents something a little different
when it comes to assessing uncertainty in estimation and inference.
Data are used to test hypotheses which are either True or False
Does this not hold for Bayesian hypothesis testing? That is whatever hypothesis you state or however you state it is either true or false?
n frequentist statistics
Don't Bayesians define their probability models in the same way? For example, would a Bayesian not use a binomial to assess the probability of getting 7 or more Heads from 10 tosses of a fair coin? Isn't the root of this model the relative frequency of some event? I agree that there are differences in defining and assessing uncertainty in parameter estimates, but this is a whole different principle than probability in general.
of hatθhatθhat{\theta}
hat notation is not correct.
have many desirable properties
assuming the choice of distribution used in ML is correct.
If you find that the results from using optim depend on the starting value, then you should be cautious, try multiple starting values, and select the one that results in the smallest negative log-likelihood
It might also imply the likelihood function is very flat.
be
delete
can
"which can"
returns
space after "returns" needed.
Therefore, it is more common to work with the log-likelihood
I believe it is near universal; why not skip the example with likelihood function and just start here?
null
"the null"
we have
and y values are non-negative.
Fortunately, there are many websites that can perform differentiation if you get stuck
Also, this has been worked out for all the common distributions.
Depending on how rusty your math is, you may or may not recall ways to simplify this expression.
I don't feel this sentence is necessary.
given by
assuming tiles are all the same size
unknown parameters with the data being fixed
I prefer "... unknown parameters conditional on observed data." or "... unknown parameters given the observed data."
Alternative mode
The notation between Null and Alternative models is confusing. Subscript j from 1 to 80 for null and i from 1 to 80 in Alternative, even though there are not 80 measurements in each of the j sites. Also, I don't believe you want individual parameter values for each observation in Alternative, so lambda should be j = 1, 2 as noted in following sentence.
our large sample sizes are large
or the data are actually normally distributed, which is also common.
Model for when a large number of factors act in an additive way
Not sure this a good description. We are discussing a single random variable.
Number of events in a fixed time interval or spatial unit
Is this correct?
continuous
or take on a wide range of values so the data are essentially continuous
F distribution
I have not seen this formulation for displaying df for F distribution before; usually I see F_k1,k2 or F(k1,k2).
distribution is often encountered
It is also the distribution of estimates of variance when data are normally distributed.
the uniform distribution is often used as a model of ignorance for prior distributions
It is also the distribution of p-values when the null hypothesis and model assumptions are true.
The last trial will be a success
Technically the last trial will be a success with probability 1 as it is a requirement of the study design that the last trial be a success. The model still works because presumably we chose r before the study starts as a stopping rule. Hence we are just sampling an infinite sequence of successes and failures. Now if we change r midstream, then we have problems.
a
"as a"
derive
We can derive the mass function regardless of what is observed. Perhaps you mean "estimate" or simply want to say the geometric probability mass function has y failures followed a success.
the
"then"
Binomial distribution
R has a Bernoulli distribution; it is a binomial distribution with n = 1.
returns the value of the probability density function
While stated above, I think it would be useful to make clear that mean 0 and std dev 1 are implied in the function.
and
in next line "0|x=1"
taken
"take"
frequentist statistics
Don't Bayesians' use these probabilities as well?
postivie
"positive"
population
"general population"
but it is also often used for frequentist inference
I am not sure what being a frequentist or a Bayesian has to do with this problem. We "know" the underlying probabilities 1% Down's syndrome, 90% true positive and 1% false positive. The rest is calculating the probability that a positive test will be correct. Why should it be a surprise frequentists use Bayes theorem?
Multiplicative rule:
Isn't this the same as the conditional probability rule?
needed
"need"
have no ability to discover important associations
But we learn that changes in observed predictors do not impact the binary data. This is no small piece of information.
use
delete "use"
Because the model is pre-specified, we can trust our inferences using confidence intervals and p-values as long as the assumptions are not violated.
I am not sure this is true. Consider an example with two highly correlated predictors and only one is cause of changes in response.
simple simulation example
This highlights my point above that lots of predictors is a multiple comparison problem as well. Out of 100 variables 5 or 6 are marginally significant, so why should I believe any are truly significant. A look at anova of full vs intercept model shows no significance. Now add one variable that does have a relationship, say y = X1. We see that the overall anova for test is not significant, but X1 is pretty significant (I get 0.0004). How do you manage expectations in these large variable models? One possibility is false discovery rates. So this is a problem with large sets of variables, but I am not convinced it is a big issue with small (say < 10) variables, which most studies are likely to have observed. In the end multiple comparisons and multicollinearity are the big issues in model selection.
A variable is more likely to be included if by chance its importance in the original sample was overestimated than if it was underestimated (Copas & Long, 1991).
It seems to me that the estimated parameter value and SE are both equally at issue here. If the estimate is near 0, but SE is small, then we may incorrectly include the parameter in the model.
of
delete "of"
a
"an"
two approaches
This is a great example, but the professor set up a false dichotomy in laying out your choices. You need to know the subject matter in both cases. Your poor model selection highlights that need.
our
"your"
Modeling
I think Model selection strategies is a better title. Also, I think this more naturally follows Ch. 6. See also bertv
unless we include ZZZ
I am confused. I am not sure what it means to exclude a variable. If you mean you have X -> Y then it is "open" (and no longer a chain). If Z is in path as described, then does not change in X lead to a change in Z which leads to a change in Y. So is that not open as water can get from X to Y. If Z can change on its own to impact Y is another question.
For the fork, if we exclude Z there is no connection between X and Y, i.e., no change in Z leads to no change in either X or Y. So seems like Z is necessary to have an open path, but the analogy falters as water is not going from X TO Y, but going to X AND Y.
From a correlation perspective are not chains and forks the same? After all, correlation does not care whether it is X -> Z or X <- Z.
confounding variables
Is this not multicollinearity? I think it would it be helpful to make this connection here.
together
"each"?
not
"no"
parameter estimates may change in magnitude (and even sign) depending on what over variables are included in the model
This might be a good time to de-emphasize the actual estimated values coming out of the model. The goal is really to just understand relationships. This includes the relationship between predictors as well. We should not worry about the specific number because 1) it is only an estimate, so is a ballpark figure (hopefully) in the first place, and 2) can change with model (multicollinearity), population, etc.
such data-driven approaches can lead to overfitting
This seems to imply that only dropping variables can lead to overfitting. Keeping them in can lead to overfitting as well.
it
"it is"
case to
Should this be "male" to match the model in the previous section?
he gls function
Technically, this is not a generalized least squares estimator as function uses ML or REML to get estimates. Generalized least squares requires all values of covariance matrix to be known up to a constant.
library(here) # Read in escapement data sdata<-read.csv(here("data", "Sockeyedata.csv"), header=T)
I couldn't get this to work. Where is this data?
be
"by"
than a linear model
This is not a fair comparison of the linear model to the non-linear approach because you are not comparing the best of the splines with the best of the linear models, i.e., log(Species) vs log(Area). They may not be directly comparable via AIC due to transform of response variable, but that doesn't mean it is not a good (or even better) model. Besides, which model is easier to understand at the end of the day?
better
A "different" approach. In Sec. 15, the log response model works equally well with negative binomial GLM.
This question brings up the thorny issue of multiple comparisons
This problem also holds if you have lots of different variables in the model.
It turns out that there are other ways to code
This might be a good type to introduce other contrast ideas such as (-0.5, 0.5) to have intercept represent the mean and slope the difference.
β0+β1X1,i+β2X2,i+…
Should these betas be hatted?
It
"In"
Without this assumption, the sampling distribution of our estimators9 does not follow a t-distribution
With or without this assumption, the sampling distribution will or will not follow a t-distribution. It is what it is. If our assumptions about the model, i.e., t-distribution, are incorrect inference about intercept and slope may not be accurate. Luckily, the CLT says we will be in the ballpark most of the time. This holds true for Bayesian models too.
fewer
I prefer "different". Not all assumptions are the same, and one big assumption may outweigh 3 little ones.
modern
I do not like the use of "modern", it connotes the iid N model is outdated. The other models are simply expansions on the general principles of the iid case.
linear regression models
I believe Zuur et al really mean iid Normal case of linear regression. This confusion is continued in the next sentence when include random effects models in list. Random effects models are linear regression models too.
Technically, this is an incorrect statement, at least as far as frequentist statistics go
I don't believe this is an incorrect statement, as I am viewing the probability as the chance my randomly observed experiment has in containing the true parameter based on my model assumptions (and the data validating those assumptions).
should ideally
I believe 95% CIs do contain the true parameter 95% of the time. Estimators of 95% confidence intervals may be biased, so the expected value of the estimator may be 92%, but then it is a 92% confidence interval.
must
I don't think "must" applies as there are alternative methods for explaining sampling distributions. I prefer an alternative view in that the observed experiment is simply one "randomly" selected experiment from the implied series of identical experiments. The sampling distribution describes the potential outcomes (and their probabilities) of this series of experiments. The outcome is the same. The sampling distribution is often determined from assumptions made in the regression model, sometimes it is derived empirically.
I agree that simulations are an excellent way to study sampling distributions.
lm(formula = age ~ proportion.black, data = LionNoses)
The use of percent.black was proposed as a better model in previous section. Why not use it here?
Independence of the errors
I believe this assumption needs to be stated explicitly, it is not implied in the statements above.