R-bloggaajat

. (Voit ilmoittaa ongelman sisällöstä tällä sivulla täällä) haluatko jakaa sisältöä R-bloggers? klikkaa tästä, jos sinulla on blogi, tai täältä jos et.

opiskelijani kysyi tänään, miten AIC (akaiken InformationCriteria) – tilasto tulkitaan mallivalintaa varten. Päädyimme bashing ulos joitakin Rcode osoittaa, miten laskea AIC yksinkertainen GLM (generallinear malli). Olen aina sitä mieltä, että jos ymmärtää astatistisen derivoinnin, on paljon helpompi muistaa, miten sitä käytetään.,

nyt jos googletat AIC: n derivoinnin, törmäät todennäköisesti matematiikkaan. Mutta periaatteet eivät todellakaan ole niin monimutkaisia. Joten tässä me sopivat joitakin yksinkertaisia GLMs, sitten saada keino valita ’ paras’one.

Siirry loppuun, jos haluat vain mennä yli perusperiaatteet.

Ennen kuin voimme ymmärtää, AIC, vaikka, meidän täytyy ymmärtää, thestatistical menetelmät todennäköisyydet.,

Selittää todennäköisyydet

Sano, että sinulla on joitakin tietoja, jotka ovat normaalisti jakautunut keskiarvo 5ja sd 3:

set.seed(126)n 

Now we want to estimate some parameters for the population that

wassampled from, like its mean and standard devaiation (which we know hereto be 5 and 3, but in the real world you won’t know that).

We are going to use frequentist statistics to estimate those parameters.Philosophically this means we believe that there is ‘one true value’ foreach parameter, and the data we observed are generated by this truevalue.

m1 

The estimate of the mean is stored here

=4.38, the estimatedvariance here
= 5.91, or the SD
=2.43. Just to be totally clear, we also specified that we believe thedata follow a normal (AKA "Gaussian”) distribution.

We just fit a GLM asking R to estimate an intercept parameter (

),which is simply the mean of
. We also get out an estimate of the SD(= $\sqrt variance$) You might think its overkill to use a GLM toestimate the mean and SD, when we could just calculate them directly.

Well notice now that R also estimated some other quantities, like theresidual deviance and the AIC statistic.

summary(m1)#### Call:## glm(formula = y ~ 1, family = "gaussian")#### Deviance Residuals:## Min 1Q Median 3Q Max## -5.7557 -0.9795 0.2853 1.7288 3.9583#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 4.3837 0.3438 12.75 

You might also be aware that the deviance is a measure of model fit,much like the sums-of-squares. Note also that the value of the AIC issuspiciously close to the deviance. Despite its odd name, the conceptsunderlying the deviance are quite simple.

As I said above, we are observing data that are generated from apopulation with one true mean and one true SD. Given we know haveestimates of these quantities that define a probability distribution, wecould also estimate the likelihood of measuring a new value of

thatsay = 7.

To do this, we simply plug the estimated values into the equation forthe normal distribution and ask for the relative likelihood of

. Wedo this with the R function

sdest 

Formally, this is the relative likelihood of the value 7 given thevalues of the mean and the SD that we estimated (=4.8 and 2.39respectively if you are using the same random seed as me).

You might ask why the likelihood is greater than 1, surely, as it comesfrom a probability distribution, it should be y values, so the probability of any given value will be zero.The relative likelihood on the other hand can be used to calculate theprobability of a range ofvalues.

So you might realise that calculating the likelihood of all the datawould be a sensible way to measure how well our ‘model’ (just a mean andSD here) fits the data.

Here’s what the likelihood looks like:

plot(y, dnorm(y, mean = coef(m1), sd = sdest), ylab = "Likelihood")

Se on vain normaali jakelu.

Jos haluat tehdä tämän, mieti, miten laskisit todennäköisyyden (riippumattomille) tapahtumille. Sanoa, että mahdollisuus, minä ratsastaa minun pyörä työtä tahansa tiettynä päivänä on 3/5 ja mahdollisuus sataa on 161/365 (likeVancouver!,), sitten mahdollisuus ratsastaa sateessa on 3/5 * 161/365 = noin 1/4, joten parhaiten käytän takkia, jos ratsastan Vancouverissa.

Me voimme tehdä saman todennäköisyydet, yksinkertaisesti moninkertaistaa todennäköisyys jokaisen yksittäisen y arvo ja meillä on yhteensä todennäköisyys. Tämä on hyvin pieni määrä, koska me moninkertaistamme paljon pieniä lukuja keskenään., Joten yksi kikka, jota käytämme, on laskea likeliitosten loki kertomalla ne:

y_lik 

The larger (the less negative) the likelihood of our data given themodel’s estimates, the ‘better’ the model fits the data. The deviance iscalculated from the likelihood and for the deviance smaller valuesindicate a closer fit of the model to the data.

The parameter values that give us the smallest value of the-log-likelihood are termed the maximum likelihood estimates.

Comparing alternate hypotheses with likelihoods

Now say we have measurements and two covariates,

and
, eitherof which we think might affect y:

a 

So x1 is a cause of y, but x2 does not affect y. How would we choosewhich hypothesis is most likely? Well one way would be to compare modelswith different combinations of covariates:

m1 

Now we are fitting a line to y, so our estimate of the mean is now theline of best fit, it varies with the value of x1. To visualise this:

plot(x1, y)lines(x1, predict(m1))

predict(m1) antaa riville parhaan istuvuuden eli ygiven jokaisen x1-arvon keskiarvon. Tämän jälkeen käytämme predictiä saadaksemme yhtäläisyydet eachmodelille:

sm1 

The likelihood of

is larger than
, which makes sense because
has the ‘fake’ covariate in it. The likelihood for
(which hasboth x1 and x2 in it) is fractionally larger than the likelihood
,so should we judge that model as giving nearly as good a representationof the data?

Because the likelihood is only a tiny bit larger, the addition of

has only explained a tiny amount of the variance in the data. But wheredo you draw the line between including and excluding x2? You run into asimilar problem if you use R^2 for model selection.

So what if we penalize the likelihood by the number of paramaters wehave to estimate to fit the model? Then if we include more covariates(and we estimate more slope parameters) only those that account for alot of the variation will overcome the penalty.

What we want a statistic that helps us select the most parsimoniousmodel.

The AIC as a measure of parsimony

One way we could penalize the likelihood by the number of parameters isto add an amount to it that is proportional to the number of parameters.First, let’s multiply the log-likelihood by -2, so that it is positiveand smaller values indicate a closer fit.

LLm1 

Why its -2 not -1, I can’t quite remember, but I think just historicalreasons.

Then add 2*k, where k is the number of estimated parameters.

-2*LLm1 + 2*3## 257.2428

m1 on olemassa kolme parametria, yksi intercept, yksi rinne ja onestard-poikkeama. Nyt lasketaan AIC kaikille kolmelle mallille:

-2*LLm1 + 2*3## 257.2428LLm2 

We see that model 1 has the lowest AIC and therefore has the mostparsimonious fit. Model 1 now outperforms model 3 which had a slightlyhigher likelihood, but because of the extra covariate has a higherpenalty too.

AIC basic principles

So to summarize, the basic principles that guide the use of the AIC are:

  1. Lower indicates a more parsimonious model, relative to a model fitwith a higher AIC.

  2. It is a relative measure of model parsimony, so it only hasmeaning if we compare the AIC for alternate hypotheses (= differentmodels of the data).

  3. We can compare non-nested models. For instance, we could compare alinear to a non-linear model.

  4. The comparisons are only valid for models that are fit to the same responsedata (ie values of y).

  5. Model selection conducted with the AIC will choose the same model asleave-one-out cross validation (where we leave out one data pointand fit the model, then evaluate its fit to that point) for largesample sizes.

  6. You shouldn’t compare too many models with the AIC. You will runinto the same problems with multiple model comparison as you wouldwith p-values, in that you might by chance find a model with thelowest AIC, that isn’t truly the most appropriate model.

  7. When using the AIC you might end up with multiple models thatperform similarly to each other. So you have similar evidenceweights for different alternate hypotheses. In the example above m3is actually about as good as m1.

  8. You should correct for small sample sizes if you use the AIC withsmall sample sizes, by using the AICc statistic.

Assuming it rains all day, which is reasonable for Vancouver.

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Leave a Comment