Talk:Akaike information criterion

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

AICc formula[edit]

Hi all, I recently made a small change to the AICc formula to add the simplified version of the AICc (i.e. not in terms of the AIC equation above, but the direct formula) so that readers could see both the AICc's relation to the AIC (as a "correction") and as the formula recommended by Burnham and Anderson for general use [edit number 618051554].

An unnamed user reverted the edit, citing "invalid simplification." Does anyone (the reverting editor from ip 2.25.180.99 included) have any reason to not want the edit I made to stand? It adds the second line to the equation below,

The main point I have in favor of the change is that Burnham and Anderson point out in multiple locations that the AIC should be thought of as the asymptotic version of the AICc, rather than thinking of the AICc as a correction to the AIC. This second equation shows easily (by comparison with the AIC formula) how that asymptotic relationship holds, and it shows how to compute the AICc itself directly.

The point against it, as far as I can see, is just that there is now a second line of math (which I can imagine some people being opposed to...)

Any opinions? If no, I'll change my edit back in a few days/weeks. Dgianotti (talk) 18:03, 31 July 2014 (UTC)[reply]

The second equation is certainly valid. I do not see why the proposed new formula is a simplification though; for me, the first (current) formula is simpler. Regarding the asymptotic relationship, this is shown much more clearly by the first formula: as n→ ∞, the second term plainly goes to 0; so the formula becomes just AIC, as Burnham and Anderson state. Hence I prefer the current formula. 86.152.236.37 (talk) 21:11, 13 September 2014 (UTC)[reply]

Fitting statistical models[edit]

When analyzing data, a general question of both absolute and relative goodness-of-fit of a given model arises. In the general case, we are fitting a model of K parameters to the observed data x_1, x_2 ... x_N. In the case of fitting AR, MA, ARMA, or ARIMA models, the question we are concerned with is what K is, i.e. how many parameters to include in the model.

The parameters are routinely estimated by minimizing the residual sum of squares, or by maximizing log likelihood of the data. For normal distributions, the least sum of squares method and the log likelihood method yield identical results.

These techniques are, however, unusable for estimation of optimal K. For that, we use information criteria which also justify the use of the log likelihood above.

More TBA...=[edit]

More TBA... Note: The entropy link should be changed to Information entropy

Deviance Information Criterion[edit]

I've written a short article on DIC, please look it over and edit. Bill Jefferys 22:55, 7 December 2005 (UTC)[reply]

Great, thanks very much. I've edited a bit for style, no major changes though. Cheers, --MarkSweep (call me collect) 23:18, 7 December 2005 (UTC)[reply]

Justification/derivation[edit]

is AIC *derived* from anything or is it just a hack? BIC is at least derivable from some postulate. WHy would you ever use AIC over BIC or, better, cross validation?

There is a link on the page ([1]) which shows a proof that AIC can be derived from the same postulate as BIC and vice versa. Cross validation is good but computationally expensive compared to A/BIC - a problem for large scale optimisations. The actual discussion over BIC/AIC as a weapon of choice seems to be long, immensely technical/theoretical and not a little boring 128.240.229.7 12:37, 28 February 2007 (UTC)[reply]

Does the definition of AIC make sense with respect to dimension? That is...why would the log of the likelihood function have the same dimension as the number of parameters, so that subtracting them would make sense? Cazort 20:00, 14 November 2007 (UTC)[reply]

AIC is not a "hack" but it is not a general method either. It relates to Shannon entropy, which is self-information, and as such it can only compare various uses of single data objects. Physically, entropy has units of energy divided by temperature. Temperature relates to the relative information content of two different data objects, which latter, relative information content, is the basis for comparison between data objects. The reference in this Wikipedia article to this effect is obtuse. That AIC is related to "information theory" is vague enough to qualify as what physicists call "hand-waving." Namely, when one runs out of logical explanation the "waving of hands in the air" takes over. The relationship is only to Shannon entropy, which in turn may have some historical relevance, but is only a small part of information theory. Thus, it is like saying AIC is related to "statistics." It is too vague a statement to be of any use, and it does not explain anything. CarlWesolowski (talk) 21:40, 4 October 2016 (UTC)[reply]

An external link http://www4.ncsu.edu/~shu3/Presentation/AIC.pdf mentioned in the current article gives a coherent explanation of the goal of the AIC and the "corrected" AIC. (See the section "Model Selection Criterion" on page 7) Using general concepts, it defines a quantity that we seek to maximize. The AIC and the "corrected" AIC formulae are presumably ways to compute something proportional to that quantity in special situations. Tashiro~enwiki (talk) 17:38, 9 December 2017 (UTC)[reply]
It took me years to understand what the text of this article is actually about, pity the poor reader. One major source of confusion is the omission of a crisp definition of what are the 'statistical models' that AIC applies to, to wit, it applies only to 1D random variates, which is a small subset of the uses of density models. For example, if we want to fit a concentration of drug in blood curve with a 'statistical model,' like a gamma distribution, GD(t|a,b), we would minimize the error of C(t) ≈ AUC GD(t|a,b), where C(t) are concentration 2D "blood samples" in time, and AUC, in this case, is area from time is {0,infinity} of concentration under the AUC GD(t|a,b) curve, as the AUC of GD(t|a,b) = 1. Note that AIC values can be obtained from concentration curves, but only indirectly. For example, maximum likelihood fitting can be performed, but one would apply maximum likelihood to a 1D list of residuals from {C(t),AUC GD(t|a,b)} and not to the 2D concentration data itself. One could call this process 'residual maximum likelihood.' Now the way this article presents AIC and 'statistical models' excludes discussion of all other density functions, which, like concentration density functions, are not probability density functions (PDF), and that is the majority of models out there. Using the definitions implied by, but not explained in this article, the gamma distribution usage above would not be called a statistical model, which is not a good idea; too confusing, especially given that its residuals can be 'statistical.' To be clear, a gamma distribution is a density function, and although density functions can be applied to random variates, in which case, and as far as I can tell only in that case, can we speak of density functions as being PDF. Models are never 'statistical' or 'not statistical' only data can be statistical in the sense of being random, and one should not consider a PDF, or a PMF for that matter, as being anything other than deterministic; they just are not random; they have no noise. Thus, I highly recommend not saying 'statistical models,' when you mean models applied to statistical data. That is a grammatical error called a misplaced modifier, for other examples see https://www.scribbr.com/language-rules/misplaced-modifier/. There is no such literal thing as a 'statistical model' as that says that the model is randomized. Statistical modelling is the modelling of statistics, it does not use 'statistical models,' but rather models of statistical processes. CarlWesolowski (talk) 06:38, 24 July 2022 (UTC)[reply]

Origin of Name[edit]

AIC was said to stand for "An Information Criterion" by Akaike, not "Akaike information Criterion" Yoderj 19:39, 16 February 2007 (UTC)[reply]

This is similar to Peter Ryom developing the RV-index for the works of Vivaldi - instead of the official Répertoire Vivaldi, his index of course became known as the Ryom Verzeichnis... I am inclined to believe this is not unintentional on the part of the developer; it's probably (false?) modesty. Classical geographer 12:18, 2 April 2007 (UTC)[reply]

This criterion is alternately called the WAIC: Watanabe-Akaike Information Criterion[2], or the widely-applicable information criterion [3][4]. — Preceding unsigned comment added by 167.220.148.12 (talk) 10:48, 29 April 2016 (UTC)[reply]

WAIC and AIC are not the same. I have now listed a paper for WAIC in the "Further reading" section. SolidPhase (talk) 07:31, 16 May 2016 (UTC)[reply]

Travis Gee[edit]

I have sent an e-mail to this Mr. Gee who is cited as a possible reference, with the following text:

Dear Mr. Gee,
For some time now your name is mentioned in the Wikipedia article on the Akaike Information Criterion (http://en.wikipedia.org/wiki/Akaike_information_criterion). You are cited as having developed a pseudo-R2 derived from the AIC. However, no exact reference is given. I'd be glad to hear from you whether you actually developed this, and where, if anywhere, you have published this measure.
Thank you for your cooperation.

However, he has not answered. I will remove the reference. Classical geographer 12:18, 2 April 2007 (UTC)[reply]

That measurement ( R^2_{AIC}= 1 - \frac{AIC_0}{AIC_i} ) doesn't make sense to me. R^2 values range from 0-1. If the AIC is better than the null model, it should be smaller. If the numerator is larger than the denominator, the R^2_{AIC} will be less than 1. This is saying that better models will generate a negative R^2_{AIC}.

It would make sense if the model were: R^2_{AIC}= 1 - \frac{AIC_i}{AIC_0}

Denoting pronunciation[edit]

Please write the pronunciation using the International phonetic alphabet, as specified in Wikipedia:Manual of style (pronunciation). -Pgan002 05:09, 10 May 2007 (UTC)[reply]

Bayes' Factor[edit]

Query - should this page not also link to Bayes' Factor[5]?

I'm not an expert in model selection but in my field (molecular phylogenetics) model selection is an increasingly important problem in methods involves Bayesian inference (e.g. MyBayes, BEAST) and AIC is apparently 'not appropriate' for these models [6]

Any thoughts anyone? I've also posted this on the model selection[7] page. Thanks.--Comrade jo (talk) 12:19, 19 December 2007 (UTC)[reply]

I agree. The opening statement: "Hence, AIC provides a means for model selection." should read "Hence, AIC provides a means for model selection, in certain circumstances." Circumstances in which it is not appropriate abound, see introduction in [1] — Preceding unsigned comment added by CarlWesolowski (talkcontribs) 21:49, 12 November 2016 (UTC)[reply]

Confusion[edit]

The RSS in the definition is not a likelihood function! However, it turns out that the log likelihood looks similar to RSS. —Preceding unsigned comment added by 203.185.215.144 (talk) 23:12, 7 January 2008 (UTC)[reply]

I agree. What's written is actually the special case of AIC with least squares estimation with normally distributed errors. (As stated in Burnham, Anderson, "Model selection and inference". p48) Furthermore you can factor out ln(2pi)*n and an increase in K due to using least squares. These are both constant when available data is given, so they can be ignored. The AIC as Burnham and Anderson present it, is really a tool for ranking possible models, with the one with the lowest AIC being the best, the actual AIC value is of less importance. EverGreg (talk) 15:27, 29 April 2008 (UTC)[reply]

Relevance to fitting[edit]

I have contributed a modified AIC, valid only for models with the same number of data points. It is quite useful though. Velocidex (talk) 09:17, 8 July 2008 (UTC)[reply]

Could you please supply some references on this one? Many variations of AIC have been proposed, but as e.g. Burnham and Anderson stresses, only a few of these are grounded in the likelihood theory that AIC is derived from. I'm away from my books and in "summer-mode" so it could very well be that I just can't see how the section's result follow smoothly from the preceding derivation using RSS. :-)

EverGreg (talk) 11:22, 8 July 2008 (UTC)[reply]

For fitting, the likelihood is given by
i.e.
, where C is a constant independent of the model used, and dependent only on the use of particular data points. i.e. it does not change if the data do not change.
The AIC is given by . As only differences in AICc are meaningful, this constant can be omitted provided n does not change. This is the result I had before, which was correct. Velocidex (talk) 19:39, 18 May 2009 (UTC)[reply]
I should also say RSS is used by people who can't estimate their errors. If any error estimate is available for the data points, fitting should be used. Unweighted linear regression is dangerous because it uses the data points to estimate the errors by assuming a good fit. You get no independent estimate of the probability that your fit is good, Q. Velocidex (talk) 19:53, 18 May 2009 (UTC)[reply]

I am unhappy with this section. It says "where C is a constant independent of the model used, and dependent only on the use of particular data points, i.e. it does not change if the data do not change."

But this is only true if the :s are the same for the two models. And under "Equal-variances case" it explicitly saya that is unknown, hence is estimated by the models. For instance, if we compare two nested linear models, then the larger will estimnate to a smaller value than the smaller model. In this case it is the converse: the "constant" C will differ between models, whereas the term with the exponentials will cancel out (they will both be exp(−1).)

The formula with RSS is correct, but the derivation is wrong for the above reason.

All this needs to be fixed. (Harald Lang, 9/12/2015) — Preceding unsigned comment added by 46.39.98.125 (talk) 11:36, 9 December 2015 (UTC)[reply]

Your point seems valid to me. Additionally, it is notable that the subsection "General case" does not have any references (unlike the subsection "Equal-variances case"). Moreover, I have just skimmed through Burnham & Anderson (2002), and did not see any supportive discussion that could be cited.
The editor who first added the text for the "General case" has been inactive for over a year; so asking them would probably not lead anywhere. Does anyone have a justification for keeping the "General case" subsection? If not, I will delete that subsection, and revise the "Equal-variances case" subsection.
SolidPhase (talk) 22:43, 10 December 2015 (UTC)[reply]

Link spam?[edit]

I think the link that appeared at the bottom "A tool for fitting distributions, times series and copulas using AIC with Excel by Vose Software" is not too relevant and only one of many tools that may incorporate AIC. I am not certain enough to remove it myself. Dirkjot (talk) 16:36, 17 November 2008 (UTC)[reply]

Confusion 2[edit]

The equation given here for determining AIC when error terms are normally distributed does not match the equation given by Burnham and Anderson on page 63 of their 2002 book. Burnham and Anderson's equation is identical except that it does not include a term with pi. Anyone know why this is? Tcadam (talk) 03:13, 17 December 2008 (UTC)Tcadam (talk) 03:14, 17 December 2008 (UTC)[reply]

Hi, I took the liberty to format your question. this is touched on in the "confusion" paragraph above. I assume you mean this equation:
We should really fix it. Since for logarithms ln(x*y) = ln(x) + ln(y), you can factor out the 2pi term so that AIC = Burnham and andersons equation + 2pi term. since the 2pi term is a constant, it can be removed. This is because AIC is used to rank alternatives as best, second best e.t.c. Adding or subtracting a constant from the AIC score of all alternatives can't change the ranking between them. EverGreg (talk) 12:18, 17 December 2008 (UTC)[reply]
Added the simplified version in the article and emphasized ranking-only some more. By the way, did Burnaham and Anderson skip the + 1 term too? EverGreg (talk) 12:39, 17 December 2008 (UTC)[reply]
I dont understand why you are using the term RSS/n. I dont see that in at least two of the references i am looking at. It is just RSS. 137.132.250.11 (talk) 09:29, 29 April 2010 (UTC)[reply]
Exactly. the 1/n term can be factored out and removed just like the 2pi term, using that ln(x/y) = ln(x) - ln(y). It makes no difference if it's there or not, so most books should really go with , as we have done in the article. The reason we see 2pi and 1/n at all is that they turn up when you take the general formula and add the RSS assumption. We should probably add how this is derived, but I don't have a source on that nearby.EverGreg (talk) 13:19, 29 April 2010 (UTC)[reply]
Oh, I didn't check what you did on the article page. Thanks for spotting that! EverGreg (talk) 13:22, 29 April 2010 (UTC)[reply]

Further confusion: Is there a discrepancy between AIC defined from the : and the RSS version: ? Don't they differ with an extra ? —Preceding unsigned comment added by 152.78.192.25 (talk) 15:27, 13 May 2011 (UTC)[reply]

both formulas are valid, but the second one uses the additional assumption of a linear model. You can read that in [2]. It is a bit confusing because they do not state the assumption of the linear model p. 63, but on p. 12 the derive the log-likelihood for the case of linear models and that makes it clear (I'm referring to page numbers in the second edition as found here Frostus (talk) 11:04, 16 September 2014 (UTC)[reply]
I reverted my change with the linear model. Although it is shown for a linear model in Burnham & Anderson, 2002,[3] this assumption is not needed to derive the equation . I removed the part with "if the RSS is available", as it can always be calculated Frostus (talk) 14:12, 18 September 2014 (UTC)[reply]
It still looks wrong. There should be no ln in the ln(RSS) term after the "=" in this expression: There might very well be such expressions in the literature, but that expression as it stands here is not mathematically valid.

I suspect that the whole derivation concerning chi-square is wrong, since it uses the likelihood function instead of the maximum of the likelihood function in the AIC. — Preceding unsigned comment added by 141.14.232.254 (talk) 19:22, 14 February 2012 (UTC)[reply]

I expect the maximum of the log of the likelihood function to be the same as the log of the maximum of the likelihood function - since log is a monotonically growing function?

References

Controversy?![edit]

What on earth is this section? It should be properly explained, with real references, or permanently deleted! I would like to see a book on model selection which describes AIC in detail, but also points out these supposed controversies! True bugman (talk) 11:50, 7 September 2010 (UTC)[reply]

This is a good question, but there is probably something that does need to be said about properties of AIC, not necessarily under "Controversy". For example, in this online dissertation, I found "AIC and other constant penalties notoriously include too many irrelevant predictors (Breiman and Freedman, 1983)" with the reference being: L. Breiman and D. Freedman. "How many variables should be entered in a regression equation?" Journal of the American Statistical Association, pages 131–136, 1983. There are similar results for using AIC to select a model order in time series analysis. But these results just reflect the penalty on large models that is inherent in AIC, and arises from the underlying derivation of AIC as something to optimise. Melcombe (talk) 16:04, 7 September 2010 (UTC)[reply]
Time series data is only one of many data types where modelling and AIC are used together. Something like this should be included in a special section dedicated to time series data. True bugman (talk) 12:09, 8 September 2010 (UTC)[reply]
The point was that the supposed problem with AIC is known to occur for both regression and time series, in exactly the same way, so it would be silly to have to say it twice in separete sections. Melcombe (talk) 16:51, 8 September 2010 (UTC)[reply]
Regression, as an example, is not covered in this article. Neither is time series. But yes, AIC is not perfect, and yes this should probably be discussed. But in a neutral way, this is by no means a controversy. I believe the entire 'controversy' section should be deleted. These are all recent changes from different IP addresses (110.32.136.51, 150.243.64.1, 99.188.106.28, 130.239.101.140) unsupported by citation, irrelevant for the article (the controversial topics discussed are not even in the article), and it is very poorly written (again there is no connection to the article). What does "crossover design", "given to us a priori by pre-testing", and "Monte Carlo testing" even mean? This section is written as an attack on the technique rather than a non-biased source of information. It is not verifiable WP:V nor written with a neutral point of view WP:NPOV. It must go. True bugman (talk) 17:19, 8 September 2010 (UTC)[reply]

Takeuchi information criterion[edit]

I removed the part on Takeuchi information criterion (based on matrix trace), because this seemed to give credit to Claeskens & Hjort. There could be a new section on TIC, if someone wanted to write one; for now, I included a reference to the 1976 paper. Note that Burnham & Anderson (2002) discuss TIC at length, and a section on TIC should cite their discussion. TIC is rarely useful in practice; rather, it is an important intermediate step in the most-general derivation of AIC and AICc.  86.170.206.175 (talk) 16:24, 14 April 2011 (UTC)[reply]

Biased Tone of BIC Comparison Section[edit]

I made a few minor edits in the BIC section to try to keep it a *little* more neutral, but it still reads with a very biased tone. I imagine a bunch of AIC proponents had a huge argument with BIC proponents and then decided to write that section as pro-AIC propaganda. You can find just as many papers in the literature that unjustifiably argue that BIC is "better" than AIC, as you can find papers that unjustifiably argue AIC is "better" than BIC. Furthermore, if AIC can be derived from the BIC formalism by just taking a different prior, then one might argue AIC is essentially contained within "generalized BIC", so how can BIC, in general, be "worse" than AIC if AIC can be derived through the BIC framework?

The truth is that neither AIC nor BIC is inherently "better" or "worse" than the other until you define a specific application (and by AIC, I include AICc and minor variants, and by BIC I include variants also to be fair). You can find applications where AIC fails miserably and BIC works wonderfully, and vice versa. To argue that this or that method is better in practice, because of asymptotic results or because of a handful of research papers, is flawed since, for most applications, you never get close to the fantasy world of "asymptopia" where asymptotic results can actually be used for justification, and you can almost always find a handful of research papers that argue method A is better than method B when, in truth, method A is only better than method B for the specific application they were working on. — Preceding unsigned comment added by 173.3.109.197 (talk) 17:44, 15 April 2012 (UTC)[reply]

The difference between AIC and BIC is not explored in this biased article. To see some of these differences viewed by rather more knowledgeable people, including, for example, Rob Hyndman, who relates:

   AIC is best for prediction as it is asymptotically equivalent to cross-validation.
   BIC is best for explanation as it is allows consistent estimation of the underlying data generating process.

Please follow the link http://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other CarlWesolowski (talk) 17:39, 1 October 2016 (UTC)[reply]

AIC for nested models?[edit]

The article states that AIC is applicable for nested and non-nested models, with a reference to Anderson (2008). However, looking up the source, there's no explicit indication that the AIC should be used for nested models. Instead, the indicated reference just states that the AIC can be valuable for non-nested models. Are there other sources that might be more explicit? — Preceding unsigned comment added by Redsilk09 (talkcontribs) 10:11, 18 July 2012 (UTC)[reply]

I agree with the above comment. I've tried using the AIC for nested models as specified by the article, and the results were nonsensical. — Preceding unsigned comment added by 152.160.76.249 (talk) 20:14, 1 August 2012 (UTC)[reply]

I agree, and provide a counter example https://stats.stackexchange.com/q/369850/99274CarlWesolowski (talk) 06:30, 31 March 2020 (UTC)[reply]

BIC Section[edit]

The BIC section claims Akaike derived BIC independently and credits him as much as anyone else in discovering BIC. However, I have always read in the history books that Akaike was very excited when he first saw (Schwartz's?) a BIC derivation, and that after seeing that it inspired him to develop his own Bayesian version of AIC. I thought it was well-documented historically that this was the case, and that he was a very graceful man who didn't think of BIC as a competitor to him, but thought of it as just yet another very useful and interesting result. His only disappointment, many accounts do claim, was that he didn't think of it himself earlier. Isn't that the standard way that all the historical accounts read?

Your version of events seems right to me. Akaike found his Bayesian version of AIC after seeing Schwartz's BIC. BIC and the Bayesian version of AIC turned out to be the same thing. (Maybe you should edit the article?) — Preceding unsigned comment added by 86.156.204.205 (talk) 14:10, 14 December 2012 (UTC)[reply]

Removed confusing sentence[edit]

I removed the following sentence: "This form is often convenient, because most model-fitting programs produce as a statistic for the fit." The statistic produced with many model-fitting programs is in fact the RSS (e.g. Origin [8]). But the RSS cannot simply replace in these equations. Either the σi has to be known or the following formula should be used AIC = n ln(RSS/n) + 2k + C. — Preceding unsigned comment added by 129.67.70.165 (talk) 14:34, 21 February 2013 (UTC)[reply]

Example?[edit]

The example from U. Georgia is no longer found; so I deleted it. It was:

I added the best example I could find with a Google-search: [Akaike example filetype:pdf]  DoneCharles Edwin Shipp (talk) 13:31, 11 September 2013 (UTC)[reply]


Error in Reference[edit]

AIC was introduced by Akaike in 1971/1972 in "Informstion theory and an extension of the maximum likelihood principle", not in 1974. Please correct it. — Preceding unsigned comment added by 31.182.64.248 (talkcontribs) 01:05, 23 November 2014

Recent edits by Tayste[edit]

@SolidPhase: Explain yourself. The "relative quality of a model" is ungrammatical - relative to what? This must be "models" plural. As for measuring "quality" - this sounds like higher values of AIC mean greater quality, but the reverse is true, so this should be made clear up front in the lead. Why remove that? Thirdly, WP:HEADINGS states that "Headings should not refer redundantly to the subject of the article". Lastly, it seems (to me) better to talk about how AIC works before discussing its limitations. Tayste (edits) 18:07, 18 June 2015 (UTC)[reply]

@Tayste: Okay, how about removing the word "relative" from the first sentence?
About "models" plural, will you elaborate? AIC gives the value for a single model; so I think that singular is appropriate.
I disagree with mentioning about higher/lower values in the lead. Following WP:LEAD, this looks like clutter that someone who read only the lead would not benefit from. The issue is discussed in the second paragraph of the body: in the first sentence, and italicized.
Which heading referred redundantly to the subject of the article?
Does your last point pertain to my edit?
SolidPhase (talk) 18:28, 18 June 2015 (UTC)[reply]

As an interim measure (only), I have restored the body to my last edit, but kept your lead section. SolidPhase (talk) 19:47, 18 June 2015 (UTC)[reply]

It has now been over four days.
Regarding the sentence "Lower values of AIC indicate higher quality and therefore better models", as above I think that including this is clutter, which will be especially distracting for people who only read the lead. Additionally, there are many activities where the minimum is the optimum, e.g. golf. Moreover, in the field of Optimization, the canonical examples are minimization. I definitely believe that the sentence should be removed; so I have now done that.
Regarding the grammatical changes that you made, I do not agree. Back in March, though, you found a grammatical error: and you were correct, of course. Hence I get the impression that you have a really good grammatical knowledge. I do not understand what you find grammatically wrong about the previous version, though, or why your version is correct. Simply put, I am confused about this(!). Your edits to the grammar remain as you made them, but I would really appreciate it if we could discuss this issue further. Will you explain the reasons for your grammatical change more?
SolidPhase (talk) 19:19, 22 June 2015 (UTC)[reply]

I've stayed away (partly) to give other editors an opportunity to chip in. The AIC value for a single model is completely meaningless in isolation. It tells absolutely nothing about the quality of that model. AIC values are only useful when the differences in values is taken for pairs of models fitted to the same data set. So the word "relative" must be there. Thank you for retaining the plural in the first sentence.
Despite your specific counter examples, the generally understood meaning of the verb to measure is that it assigns higher numbers for greater amounts of the aspect being measured. In terms of relative measurement, AIC measures not the quality of models but their lack of quality, since higher values mean worse. I disagree that it clutters the lead to state the direction in which AIC works. It is a fundamentally important point to get across early for anyone wishing to understand what AIC is. Tayste (edits) 20:56, 22 June 2015 (UTC)[reply]
To me (a far-from-expert in grammar), the phrase “relative quality of statistical models” seems inappropriate, because “quality” is singular and “models” is plural.
The lead currently states that AIC “offers a relative estimate of the information lost”; so the less information lost, the better. Regarding measure, this is a formal term in mathematics, and the definition requires that all measures be nonnegative. What about replacing the term “measure” by something else?—e.g. “AIC provides a means for assessing the relative quality of statistical models”. Could something like that be okay?
SolidPhase (talk) 22:17, 22 June 2015 (UTC)[reply]
"Quality" is indeed singular, but "relative quality" necessitates a comparison involving at least a pair of models. I'd be happy with "means" but I think "measure" here was being used in the more general sense (anywhere on the Real line) rather than that specific mathematical definition. The point about the information lost is actually a better definition than "quality". Tayste (edits) 22:44, 22 June 2015 (UTC)[reply]
Quick comments. The lead is confusing and misrepresents. AIC is a number that calculated without reference to other models. It is a metric that does not depend on other models. That is, the value of AIC ignores all other models. AIC's value is not "relative to each of the other models".
AIC can be used to rank different models under the AIC metric. Using that metric does not guarantee the earlier statement that "Lower values of AIC indicate higher quality and therefore better models." The notion of "better" is tempered by the metric. AIC might deprecate an exact model due to its complexity.
I don't care much about which scores are better. For fits, lower chi square values are better, so smaller is better is not a foreign concept.
I don't know if AIC has some value as an absolute metric. For example, if input variances are known, then reduced chi square near 1 suggests a good model.
Glrx (talk) 05:46, 23 June 2015 (UTC)[reply]
I agree that the lead is confusing. I have spent 1–2 hours trying to come up with something better, but so far I have got nothing constructive to propose. Including the sentence about lower values indicating higher quality makes the lead more confusing, which is why I have been advocating keeping the sentence out.
I agree that metric is more appropriate than measure, considering the formal mathematical definitions. I also think that the mathematical definitions are highly relevant, given that AIC is part of some fairly advanced mathematical statistics. One problem with "metric", though, is that the word is not commonly known. Hence, if the word were used, people without the requisite mathematical background would be confused.
SolidPhase (talk) 09:42, 23 June 2015 (UTC)[reply]

Are you quite sure that AIC is ranked from best is lowest? Here are rankings from a Mathematica case of the 5 best models for a problem that uses BIC for ranking:

BIC AIC HQIC
3.841 3.857 3.845
3.815 3.825 3.818
3.735 3.746 3.738
3.732 3.742 3.735
3.458 3.468 3.461

Note that they go from highest as best to lowest as worst. I checked on this and is seems that some programs output -AIC, not AIC. However, the word is "index." In addition to being accurate, it is in common usage and people with no higher mathematical training understand it. Please change this, post an objection to the change or otherwise I will change it. If you then change it back without discussion, which is typical, we will have a dispute, as I will keep changing it back until there is a dispute settlement. CarlWesolowski (talk) 14:21, 11 July 2016 (UTC)CarlWesolowski (talk) 18:37, 14 July 2016 (UTC)[reply]

I strongly oppose using the word "index", because the word would be confusing here.
Your claim that I undo your edits "without discussion, which is typical" is false, and slanderous.
Your threat to start an edit war is in violation of Wikipedia norms.
SolidPhase (talk) 14:30, 15 July 2016 (UTC)[reply]

No surprise that you object to the word "index." However, not only is AIC an index, but as it is based on Shannon entropy, it is a data specific index namely Self-information. So, most indices would smoke it, and they tend to be at least somewhat comparable between data sets. I am just asking for a more objective presentation. Moreover, I am convinced that you cannot take this article to the next level. I have found some of the advocates for AIC use to be unusually partisan, which given AIC's very limited applicability due to the restrictive assumptions not being met as a common occurrence, is somewhat difficult for me to reconcile with objectivity.CarlWesolowski (talk) 22:36, 12 September 2016 (UTC)[reply]

Assessment comment[edit]

The comment(s) below were originally left at Talk:Akaike information criterion/Comments, and are posted here for posterity. Following several discussions in past years, these subpages are now deprecated. The comments may be irrelevant or outdated; if so, please feel free to remove this section.

Hello,

I am not a statistician and therefore can only provide remark about stuff I was unable to understand. My concern is about "k" :

In the paragraph "Definition" it is said that k is the number of parameters in the statistical model

In the paragraph "AICc and AICu" it is said that k denotes the number of model parameters + 1.

If these two ks are different, then why to give them the same name. If they are not there is a problem of definition somewhere ?

Last edited at 13:27, 7 October 2009 (UTC). Substituted at 19:44, 1 May 2016 (UTC)

Personal attack by SolidPhase upon CarlWesolowski with reversal of all edits[edit]

Some of the edits are ungrammatical, e.g. "AIC use as one means of model selection". Some of the edits introduce technical invalidity, e.g. "each candidate model has residuals that are normal distributions". I have undone the edits.

CarlWesolowski has been sporadically making edits to this article since at least 20 March 2015. Each time, those edits have been undone. My suggestion is this: if CarlWesolowski wants to make changes to the article, then he should discuss those changes on this Talk page, and get a consensus of editors to agree to the changes.
SolidPhase (talk) 06:18, 8 July 2016 (UTC)[reply]

That my edits can and have been reversed is not surprising. That SolidPhase takes exception to my person is inexcusable, calling me "ignorant" on my talk page. This article is misleading. The premise "Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection." is not logical. For example, BIC will yield better quality of ft than AIC. BIC is more appropriate for model selection than AIC. When a model is already selected, AIC will provide better estimates of that model's parameter values than BIC.

AIC is only one of many criteria for model selection, and often suggested for use when it is inappropriate. "The AIC is not a consistent model selection method"-point 10 of Facts and fallacies of the AIC by Rob J Hyndman [1]. The introduction reads like a commercial for cigarettes. Mathematica uses BIC, as one of several tests whose combined score ranks models, and, there are lots of good folks who may compute AIC but not use it to rank. In addition to AIC, other methods used include BIC, step-wise partial probability ANOVA, HQIC, log likelihood, complexity error, factor analysis, and goodness of fit testing with Pearson Chi-squared, Cramer Von Mises probabilities and others. And, without looking at those other measurements, any pronouncements made with respect to model selection using AIC should be ignored. It is said that AIC does not assume that there is a true model, but BIC does. BIC is also more self-consistent. Neither of these maximum likelihood approaches is appropriate for model selection when the objective is extrapolation, not interpolation, as the goodness-of-extrapolation makes goodness-of-fit irrelevant.

In the section that says in rather poor quality English "Sometimes, each candidate model assumes that the residuals are distributed according to independent identical normal distributions (with zero mean). That gives rise to least squares model fitting." Let us take this one statement at a time.

Candidate models do not "assume." People assume. The requirement for normally distributed residuals is unnecessary, that happens approximately 10% of the time. Normally distributed residuals are not a requirement for AIC any more than they are for maximum likelihood. Again quoting Hyndman-point 3-"The AIC does not assume the residuals are Gaussian. It is just that the Gaussian likelihood is most frequently used. But if you want to use some other distribution, go ahead. The AIC is the penalized likelihood, whichever likelihood you choose to use." Again with inanimate objects making assumptions, tisk, tisk.

"That gives rise to least squares model fitting." Well, no it doesn't. Other assumptions for OLS can include homoscedasticity, and fixed intervals on the x-axis. Otherwise, OLS fit parameters are biased and only approximate. Summarizing, I really think that one should consider pulling back from the claims herein and injecting some perspective into this sloppy article. You will not let me fix this article, so fix it yourselves. — (talkcontribs) 03:08, 9 July 2016 (UTC) CarlWesolowski (talk) 07:51, 29 January 2017 (UTC)[reply]

A more general treatment of the fit problem that may be worth mentioning is QML, Quasi-Maximum Likelihood, based upon [2]. It is currently totally unclear what the statistical use of AIC is in the article, so fix it. IThe current article is dangerous, it promotes AIC without sufficient insight as to appropriate usage. CarlWesolowski (talk) 17:09, 10 July 2016 (UTC)CarlWesolowski (talk) 07:51, 29 January 2017 (UTC)[reply]

The word "a" is used as an indefinite article. Thus, it is clear that there might be more than one. The paragraph also links to model selection, which lists 13.
Some of your remarks about model assumptions might be appropriate for the article on model selection. They are, however, not specific to AIC.
It is colloquial to talk about models (rather than people) assuming something.
It is common to assume that "the residuals are distributed according to independent identical normal distributions (with zero mean)". That assumption "gives rise to least squares model fitting", as the article states. Other assumptions can also give rise to least squares model fitting, but that is irrelevant in the context.
SolidPhase (talk) 19:15, 10 July 2016 (UTC)[reply]

Thank you for responding. However, the "a" is too soft. In the matter of implication, hinting at something is not as good as saying it. This article has that problem throughout, and it is less useful in that form than it would be if it were more clearly written. For example, let us take the infamous sentence "Sometimes, each candidate model assumes that the residuals are distributed according to independent identical normal distributions (with zero mean). That gives rise to least squares model fitting." It took me a very long time to figure out what you are trying to say and, BTW, do not. Consider for "that gives rise to least squares..." it is unclear that it does, and most people having studied least squares would still not know what you are getting on about. Consider saying something relevant rather than making the reader study the phrase to make any sense out of it, namely, note [3] that "There are several different frameworks in which the linear regression model can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed." You do not say what you are assuming, and that is not a problem for the editors, but, it is a big problem for the readers.

When you use the "colloquialism" as you call it, you depreciate not only the language, but mask the fact that you have imposed an assumption, which does not help the reader understand what you are saying. The phrase "the residuals are distributed according to independent identical normal distributions (with zero mean)" is so inaccurate that it is nearly unintelligible. I think perhaps that you obliquely referring to ML ~ND, where and , or some such. Take a look at [4]. It is much more clearly written than this Wikipedia entry. It is not misleading, it is not oversold, and it give a much better indication of where AIC is in the universe of methods. Try to emulate that level of clarity, please. What happens when the residuals are not ND. Surely you realize that that is most of the time. AIC, BIC and maximum likelihood can be, and should be defined in that broader context, in which case there is no direct relationship to OLS, such that the relationship to OLS for normal residuals is an aside that does more to confuse than to clarify. CarlWesolowski (talk) 23:50, 10 July 2016 (UTC)CarlWesolowski (talk) 00:54, 11 July 2016 (UTC)CarlWesolowski (talk) 18:51, 14 July 2016 (UTC)CarlWesolowski (talk) 21:23, 6 February 2017 (UTC)[reply]

@SolidPhase The sentence "Sometimes, each candidate model assumes that the residuals are distributed according to independent identical normal distributions (with zero mean). That gives rise to least squares model fitting." is incorrect, because 1) Models do not make assumptions and when you do you confuse not only the reader but also yourself. To wit 2) When the residuals are actually normally distributed only then are OLS and AIC as both applied to normal residuals the same. However, 3) AIC does not assume normal residuals because A) AIC can be applied to non-normal residual structure, and B) The assumption of normal residuals is clearly yours as it is not a requirement for AIC.CarlWesolowski (talk) 22:52, 10 November 2016 (UTC)[reply]

Recent edit war by anonymous IPs[edit]

This section is created to discuss the recent edits by anonymous IPs. SolidPhase (talk) 10:28, 31 October 2016 (UTC)[reply]

Number of parameters?[edit]

In the article, it is written "If the model under consideration is a linear regression, k {\displaystyle k} k is the number of regressors, including the intercept". This is wrong, isn't it? What about the error variance? Shouldn't the error variance also count as a parameter? — Preceding unsigned comment added by 193.174.15.2 (talk) 09:30, 3 January 2017 (UTC)[reply]

External links modified[edit]

Hello fellow Wikipedians,

I have just modified one external link on Akaike information criterion. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 00:45, 29 June 2017 (UTC)[reply]

AIC asymptotics, misleadingly phrased?[edit]

@BetterMath:, could you please explain why the sentence “Asymptotically, AIC selects the model that minimizes the mean squared error of (out-of-sample) prediction,” is wrong and/or misleading? By definition, an asymptotically efficient (information) criterion chooses the (candidate) model that minimises the mean squared error of prediction, and by Stone (1977) AIC is asymptotically efficient. --bender235 (talk) 23:56, 16 January 2018 (UTC)[reply]

The statement is true in some cases, but it is not known to be true in general, e.g. for panel data. Stone (1977) considers regression only, and the article does say that "in the context of regression … AIC is asymptotically optimal for selecting the model with the least mean squared error". Even for univariate time series, the status of the statement is only partially known (Ing & Wei, Annals of Statistics, 2005). If you (or someone else) want to study this issue more, perhaps also consider the cited work of Akaike (1985).  BetterMath (talk) 00:50, 17 January 2018 (UTC)[reply]

Discussion of lead[edit]

This section is created pursuant to WP:BRD, to discuss recent proposed changes to the lead. (Consider also WP:Lead.)

Having the lead use the word "score" in this context seems wrong, because score has a technical meaning in statistics that does not apply in this context. Having the lead mention that a lower AIC is better seems inappropriate to me, because it is a technical detail—a detail, moreover, that is well discussed in the Definition section; it seems much more appropriate to have the lead tell what AIC does, rather than tell how AIC does things. Having the lead claim "Thus, AIC provides a means for model selection that deals with the trade-off between the goodness of fit of the model and the simplicity of the model" seems wrong, because the "Thus" is not logically supported by the context. Having the first paragraph of the lead worded as proposed does not even suggest that AIC can only evaluate relative quality, which is surely inappropriate and confusing.

For the above reasons, I have reverted to the prior version.  SolidPhase (talk) 16:19, 29 June 2018 (UTC)[reply]

Hi, thanks for the remarks about the change I proposed! I've made a new edit that incorporates the points you made. Please do improve it directly (plus talk discussion), rather than reverting -- I prefer discussion combined with a WP:BOLD, BOLD, BOLD cycle. As WP:ROWN says: "It is usually preferable to make an edit that retains at least some elements of a prior edit than to revert the prior edit."
Changes with respect to my first proposal:
  • I've replaced the word 'scores' with 'numbers'. 'Scores' might still be the better word, though: in statistics 'scoring' also has the general meaning of 'assigning a number'. Compare the z-score a.ka. standard score, propensity score matching, test scores, and the notion of 'scoring questions'. What do you think?
  • I've kept the higher/lower words. If somebody who does not know the AIC encounters one and looks up the article, 'lower AICs are better' is the first thing they will want to know.
  • I've kept the 'Thus', because that sentence describes the high-level consequences of the low-level properties described in the preceding sentence.
  • You are right that AIC is only a relative measure. I've made the last sentence refer to multiple models, to make this more clear. The word 'relative' already appears in the first sentence, and in the third paragraph of the intro, so now it's triply clear.

Let me know what you think!

--Sietse (talk) 15:17, 12 July 2018 (UTC)[reply]
I believe that the changes you made overall make the lead more confusing. For example, a sentence with two parenthesized phrases is awkward. Overall, the first paragraph is bordering on incomprehensible.
An additional issue is that you flagged your edit as minor. A minor edit should follow WP:Minor: an edit that the editor believes requires no review and could never be the subject of a dispute. Plainly, your edit could be subject to dispute.  SolidPhase (talk) 19:17, 12 July 2018 (UTC)[reply]
I did not mark my main edit, at 16:09, 12 July 2018, as minor. The edit I marked as minor was a subsequent one that fixed a typo. I would hate for you to accidentally think I was engaging in bad faith :-)
As for your opinion that my changes make the lead more confusing and 'bordering on incomprehensible': I have reread my last proposal with an open mind, and I respectfully disagree. So let's work on the text. Here is the last text I proposed, and which you reverted:
The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a given set of data. It rewards (gives lower AICs to) models with high likelihood, but penalizes (gives higher AICs to) models with more parameters. Thus, AIC provides a means for choosing between models that deals with the trade-off between the goodness of fit of the models and the simplicity of the models.
AIC is founded on information theory: it offers an estimate of the relative information lost when a given model is used to represent the process that generated the data.
I prefer this text (which incorporates your suggestions) over the status quo ante because:
  • it makes clear to the layman how to interpret an AIC (lower is better),
  • it gives the layman an idea of how AIC functions (goodness of fit is rewarded, parameters are penalized)
Your latest objections were only to the phrasing. Let's avoid falling into an cycle where I only propose and you only reject. Could you propose a better phrasing which also incorporates this content?
Cheers, Sietse (talk) 23:18, 12 July 2018 (UTC)[reply]
My apologies about the minor comment; you are correct.
There was some discussion about the lead back in June 2015 (see above). Then, User:Glrx said that "The lead is confusing" and I replied "I agree". We could not come up with anything that we thought was better though. Since then, the lead has improved a little, but it is still confusing.
Your proposed changes clearly make the confusion worse. I previously commented about how a sentence with two parenthesized phrases is awkward. There is also the use of "likelihood"—not even wikilinked—even though not everyone will understand what that means. Your proposed first paragraph contains far more technical information than the current paragraph. Despite that, it does not convey the essence of what AIC does nearly as well as the current version.
Additionally, your proposed use of "Thus" is not logically proper, as I commented before.
SolidPhase (talk) 15:11, 13 July 2018 (UTC)[reply]
Since we agree that the current lede could be better, but disagree about the relative quality of the status quo and the quality of the proposal, and since your opinion is as good as mine and mine as good as yours, shall we ask for a third opinion? I'd be curious to hear what an outside party would recommend. Sietse (talk) 19:27, 13 July 2018 (UTC)[reply]
My belief is that you should first address the points that I have raised. SolidPhase (talk) 17:39, 14 July 2018 (UTC)[reply]
  1. Wikilinking 'likelihood' is fine, that's a good suggestion! It's a good word, though: one that statisticians understand precisely, and laymen understand approximately.
  2. The proposal does contain more technical information. It is, however, a description of the mechanism in simple language. I therefore believe that it does a better job of presenting the AIC, by making it easy for the reader to understand what it does. I realise we disagree on this: hence my proposal to get a third opinion.
  3. The word 'thus' also means 'in this way or manner'. Again, I realise that you claim it will be interpreted as a logical 'therefore', while I claim readers will realise the lede is not a formal argument. Let's avoid the discussion entirely by writing 'In this way'. (You could have proposed that, too!)
  4. Remember WP:BRD-NOT: "BRD is not a valid excuse for reverting good-faith efforts to improve a page simply because you don't like the changes. BRD is never a reason for reverting. Unless the reversion is supported by policies, guidelines or common sense exists, the reversion is not part of BRD cycle."
  5. I don't think you and I are any closer to agreeing on the fundamental 'is this lede clearer, or more difficult to read' point. So let's get somebody else in, and see what they think. I've added a subsection below, and will post a request on WP:3O.
-- Sietse (talk) 14:49, 15 July 2018 (UTC)[reply]


  1. The problem with "likelihood" is that some people confuse it with probability. Your claim that "laymen understand approximately" is, in general, false. Worse, many laymen will erroneously believe that they understand better than they actually do. Thus, the proposed paragraph would mislead laymen.
  2. You acknowledge that your proposed paragraph contains more technical information. Thus, it is more difficult for most people to understand. You claim that the proposed paragraph "does a better job of presenting the AIC"; I believe that the opposite is true. In particular, I cannot comprehend the proposed first paragraph on a single reading. The lead, and especially the first paragraph, should be readily comprehensible to non-specialists, as per WP:LEAD; your proposal does not do that. Additionally, as I noted earlier, it seems much more appropriate to have the lead tell what AIC does, rather than tell how AIC does things: one of the reason for this is that non-specialists will not be able to understand how. The proposed paragraph is so confusing that even reading it a few times, I find it difficult to tell what AIC does.
  3. Both "thus" and "in this way" fail with your proposal. The current logic is clear: a higher quality implies preferred selection. The logic in your proposal is not clear; if you are going to claim otherwise, please give details.
The general issue of readability has been studied extensively. There are automatic estimates of readability that have proved helpful in practice. The estimate that is most commonly used, by far, is the Flesch–Kincaid grade level. (This estimate has been shown to correlate with actual readability at about 90%; it is used in the military, in some legal situations, by many businesses, etc.; it is also incorporated in programs such a Microsoft Word—which I have used here.) The grade level for the current first paragraph is 11.6; the grade level for the proposed first paragraph is 13.2. Thus, there is strong objective evidence that the proposed paragraph is substantially more confusing than the current paragraph.
Moreover, the grade levels here surely underestimate the difference in actual readability. First, because the Flesch–Kincaid grade level does not consider the double parenthesization of phrases, which obviously complicates the sentence. Second, because the Flesch–Kincaid grade level assumes that readers well understand the meanings of all the words, which is not true for "likelihood".
SolidPhase (talk) 18:31, 16 July 2018 (UTC)[reply]


Third opinion[edit]

Sietse (talk)'s request for a third opinion on the following (see discussion above): of the two proposed lede paragraphs below, which do you recommend we use (with or without changes)? (SolidPhase, I think that is also what the dispute is about from your point of view. But if you'd also like the third opinion to address a different question, feel free to add that here.)

  • The current version:
The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.
AIC is founded on information theory: it offers an estimate of the relative information lost when a given model is used to represent the process that generated the data. (In doing so, it deals with the trade-off between the goodness of fit of the model and the simplicity of the model.)
  • An alternative proposal:
The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a given set of data. It rewards (gives lower scores to) models with high likelihood, but penalizes (gives higher scores to) models with more parameters. In this way, AIC provides a means for choosing between models that deals with the trade-off between the goodness of fit of the models and the simplicity of the models.
AIC is founded on information theory: it offers an estimate of the relative information lost when a given model is used to represent the process that generated the data.

3O Response: I feel that I understand the topic better after reading both the current and proposed leads. I agree that:

  • It's probably not a bad idea to communicate that 'a lower result is better'. Some readers may visit the article because AIC estimates are used with little explanation in another article – or a sister project – and that's information they'd want to obtain from the lead.
  • Multiple parenthetic statements do make that one sentence a bit complex for the lead. And though likelihood is wikilinked, suggesting a specific meaning, that meaning is not clear and its preferable to not make readers chase links.

The 3O asks which I recommend, and my choice would be the current version. But that's the sort of revert/no-revert choice that the requester doesn't like. I feel that there is room to build. The lead is short and some information could be added. I feel the 'lower is better' concept could be added as well as summarizing other parts of the article like advantages/disadvantages of using AIC and its history (mentioning Hirotugu Akaike and the year he published). This information should probably not go in the MOS:LEADPARAGRAPH. I suspect that given a little more room outside that first paragraph, it'd be easier to state it without that awkwardness.
For what it's worth, that's my non-binding opinion. I'll try to keep an eye open for any follow-up. – Reidgreg (talk) 21:36, 17 July 2018 (UTC)[reply]

Thank you for your time and your advice! You have written many helpful things: advice on which version to start working from, advice on the strengths and weaknesses of both versions, and advice on lead paragraphs in general. We're going to do good things with this :-D
Kind regards, Sietse (talk) 08:55, 19 July 2018 (UTC)[reply]


Proposed revision[edit]

A proposed revision is below. The first paragraph is the same as before; the second paragraph is expanded; the third paragraph is the same as before; the fourth paragraph is new (and it mentions Hirotugu Akaike, as recommended by User:Reidgreg).

The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.
AIC is founded on information theory: it estimates the relative information lost when a given model is used to represent the process that generated the data. The less information is lost, the higher is the quality of the model. (In making an estimate of the information lost, AIC deals with the trade-off between the goodness of fit of the model and the simplicity of the model.)
AIC does not provide a test of a model in the sense of testing a null hypothesis. It tells nothing about the absolute quality of a model, only the quality relative to other models. Thus, if all the candidate models fit poorly, AIC will not give any warning of that.
AIC was formulated by the statistician Hirotugu Akaike. It now forms the basis of a paradigm for the foundations of statistics; as well, it is widely used for statistical inference.


Perhaps the third paragraph should be moved to the Definition section.

@Glrx:
SolidPhase (talk) 09:45, 19 July 2018 (UTC)[reply]

Further discussion of lead[edit]

The prior section was for discussion of the wording of the lead. This section is for further discussion. It was created pursuant to a comment from User:Glrx that "the [second] paragraph is difficult and should be rewritten". As per the prior section, recommendations for rewording the lead are solicited. (Note that, in the prior section, I twice wikilinked Glrx.)

Additionally, Glrx has now changed a word in the lead: "representation" was changed to "model". The context is this clause: "When a statistical model is used to represent the process that generated the data, the representation will almost never be exact" versus "When a statistical model is used to represent the process that generated the data, the model will almost never be exact". My preference is for "representation".

SolidPhase (talk) 23:08, 2 December 2018 (UTC)[reply]

I don't have time for this debate; too much is going on in RL. Your task here was to garner consensus for your edit before reinserting it. See WP:BRD. I do not support the edit and you have not obtained any other editor's support.
The paragraph is extremely poor and is using "represent" as a synonym for model. The contorted sentence says "a statistical model is used to model the process that generated the data". See my edit comments. The sentence is awkward and verbose. Simpler, clearer, statement is "a statistical model will almost never be exact". Flush "represent" completely; it is not needed. It's only there to hide the circularity of the statement.
Glrx (talk) 03:33, 7 December 2018 (UTC)[reply]
Your suggestion, “a statistical model will almost never be exact”, does not make sense to me: exact what? The sentence needs to say what is inexact. Your claim that the statement has “circularity” is not something I understand.
You still do not seem to realize what a statistical model is. Please read the article statistical model. In particular, as stated in that article, a statistical model is a representation. Your repeated failure to grasp that basic aspect of the definition has led to your comments so far having been repeatedly nonsensical. For example, your comment above complains about “using "represent" as a synonym for model”. It is ironic that you would fail that way given that your first edit summary was “model is technical term”: yes, model is indeed a technical term, and you should find out what it means!
As for WP:BRD, you did not participate in the discussion (which I opened). Hence your complaint seems ill-founded.
You also complain that the second “paragraph is extremely poor”. I am much welcoming of (sensible) recommendations, as I have strongly demonstrated in the prior section. I spent hours writing the second paragraph. I am sure that it can be improved further though; so if you can recommend something, please do.
(An idea that I had was to take the last, parenthetical, sentence and put that in a separate paragraph. I wanted to avoid a single-sentence paragraph though. Perhaps you have an idea around that?)
Note that AIC is used for selecting a statistical model: that is stated, repeated and clearly, in the first paragraph. As such, it is assumed afterward that the reader has some familiarity with statistical models. Your not knowing even the definition of a statistical model is presumably part of the reason that you find the second paragraph difficult to understand.
It would also be helpful if you accurately understood what “representation” means. There are definitions at e.g. Wiktionary and OxfordDictionaries.com.
SolidPhase (talk) 23:47, 10 December 2018 (UTC)[reply]

splitting long equation[edit]

There is a long equation in the section "Replicating Student's t-test". The equation is so long that it is split over two lines. Should the equals sign go on the first line (at the end) or on the second line (at the beginning)?

I did not find a relevant policy or guideline on this. I prefer that the equals sign be on the first line, for two reasons. First, because it makes the first line clearer: when someone reads the first line, they know immediately that the next line is the other side of an equation, and that eases reading of the whole equation. Second, because it lessens the overall display width.

@Michael Hardy:

BetterMath (talk) 10:43, 27 July 2019 (UTC)[reply]

"relative quality of statistical models"[edit]

@BetterMath: I'm not going to revert again because I'm tired of your one-sentence explanations for removing my contribution. I gave two reliable sources that state verbatim what I added to the article; here are three more sources that confirm that this definition applies to time series models as well (which are regression, too, anyways, but let's leave this discussion aside). So after I've done my due diligence, how about you do your part: (i) please show me a reliable source that contradicts either of the three five that I've found, and more importantly (ii) explain to me what "relative quality of statistical models" is supposed to mean: what "quality" are we talking about? Best looking? Fastest converging? --bender235 (talk) 14:07, 18 September 2019 (UTC)[reply]

The quality of a statistical model is explained in the second paragraph of the article: in particular, "the less information a model loses, the higher the quality of that model". The explanation is elaborated in the section "Definition".
I will look at the references you cite and get back to you on them.
BetterMath (talk) 18:03, 18 September 2019 (UTC)[reply]
This "information loss" definition is absurdly vague. What AIC weighs models by is predictive power, i.e. minimal forecasting (or out-of-sample) error. As described in all five sources I've named. --bender235 (talk) 20:19, 18 September 2019 (UTC)[reply]
A time series is different from a regression. With a regression, the log-likelihood function is additive: 𝓁(⟨ y1, …, yn⟩) = Σi 𝓁(yi). Such additivity does not generally hold for time series, nor for most other non-regression models. Hence, citing sources that only apply to regressions is invalid.
The three more sources that you cite (Lütkepohl, Hyndman & Athanasopoulos, Chatfield) do not seem to mention the word “deviance”.
AIC is fundamentally founded on information theory. That is why it is called an information criterion. The Definition section makes the foundation clear, including citing the original paper of Akaike. The issue is discussed in much detail by Burnham & Anderson (2002)—a book about AIC that has over 47000 citations on Google Scholar.
Your claim about the Definition section being “absurdly vague” makes no sense. What parts of it are vague to you?
You obviously do not have familiarity with the relevant literature. Maybe you should spend some time reading it before doing more on this.
BetterMath (talk) 21:11, 18 September 2019 (UTC)[reply]
Thanks for the personal attack. I take it as a badge of honor, given that you have yet to cite any literature contradicting what the five sources I mentioned clearly state.
Also you seem to not understand what "regression" means. Log-likelihoods are additive when observations are independently distributed. This assumption is violated when serial correlation is present (as is the case in time series or panel data), but—putting the necessary tweaks to the loglikelihood function aside—we'd still call it a regression. However, that's not even the point, so let me move on.
AIC is a model selection criteria that "ranks" models according to their predictive (out-of-sample) accuracy. It gives an estimator of the same "measure" that comes from cross-validation, only without actually having to do the computationally burdensome CV. This fact was derived by Akaike using information theory, but that doesn't mean we have to stick to an ominously vague definition of AIC when several modern textbooks give us clear and easy-to-understand definition.
PS: Lütkepohl, Hyndman & Athanasopoulos, and Chatfield mention forecast errors or mean-square errors as criteria, which are obviously related to deviance (which you would know if you had read the article). Generally deviance is about how far off a model's prediction is from the real observation. RSS et al. are special cases of this idea. --bender235 (talk) 22:30, 18 September 2019 (UTC)[reply]
I believe that the article would benefit from a brief discussion of prediction errors. Such a discussion would need to be technically valid and properly sourced. Maybe put a draft version here on Talk and then get a consensus of editors to accept that.
BetterMath (talk) 03:27, 19 September 2019 (UTC)[reply]
Why would I waste my time on this? I've already named five reliable sources for the half-a-sentence correction I've proposed (compared to precisely zero from your side), and you just keep reverting my contributions anyways. --bender235 (talk) 21:01, 20 September 2019 (UTC)[reply]


@Bender235: I thought that I should come back to this. I’ve used statistical deviance very little in my own studies, and didn’t previously notice an error in your reference.

The reference McElreath (2015) does say that AIC is an “estimate of the average out-of-sample deviance”. McElreath, however, is wrong. Indeed, we could easily reduce the deviance to 0, by increasing the number of parameters—but that would lead to overfitting. This issue is discussed in the Definition section.

The error by McElreath points to a more general issue. In most science-related fields, it is reasonable to cite undergraduate-level references with the assumption that the reference is valid. In some fields, however, that assumption is not reasonable. Thermodynamics is one such field. Statistics is another. Many statistics references contain serious errors. So if the references are cited blindly, those errors will corrupt Wikipedia.

That is what almost happened here. You made an edit in good faith (I assume), and cited a reference that easily meets WP:RS. If your edit had not been undone, then the article would’ve contained an error. The way to avoid such errors is to ensure that the editor (in this case, you) has a deep comprehension of the topic and the reference.

Something similar happened before. You edited this article in January 2018, including a citation of Stone (1977). That edit too was based on your comprehension of the reference not being deep enough (as I explained on this Talk page at the time).

Just now, I was looking at the article Parametric model. There was a problem in the definition given there. I fixed the problem, and looked at the article history. The prior edit to the article was made by you. I checked and discovered that it was your edit that introduced the problem.

The above are three illustrations of a general issue. It is important to make textual edits to a statistical article only if the editor has a deep comprehension of the topic.

BetterMath (talk) 22:36, 5 October 2019 (UTC)[reply]

@BetterMath: The reference McElreath (2015) does say that AIC is an “estimate of the average out-of-sample deviance”. McElreath, however, is wrong. Indeed, we could easily reduce the deviance to 0, by increasing the number of parameters—but that would lead to overfitting.
You're wrong. Adding model parameters reduces in-sample deviance arbitrarily (and all the way to zero if the number of parameters is equal to the number of observation). However, McElreath talks about out-of-sample deviance. Which is correct. And adding parameters obviously does not decrease out-of-sample deviance (to the contrary). Decreasing out-of-sample deviance arbitrarily by adding parameters (or any other way) is not possible; if it were, neural networks could be fit to 100% accuracy, stock prices could be predicted perfectly, etc. (which, in case you didn't notice, hasn't happened yet). The rest of your post is not worth commenting on, because you seem not to understand the issue at hand if you confuse in- and out-of-sample statistics. Just for the record, though, let me tell you that neither McElreath nor Taddy are undergraduate textbooks. As one can read in the preface, McElreath's book is meant for first-year PhD students; Taddy's book is meant for (and used in) data science classes in MBA program at Chicago Booth. --bender235 (talk) 01:24, 6 October 2019 (UTC)[reply]
P.S.: I had all but forgotten my edit from January 2018. Indeed I added the important out-of-sample target of the AIC, which unfortunately you don't understand today more than you did two years ago. --bender235 (talk) 01:37, 6 October 2019 (UTC)[reply]
Yes, I had misunderstood on out-of-sample. There are still the prior problems that I raised though. First, mentioning the term deviance is poor practice, because it is not known to most readers. Using the term in the body of the article would be fine, but it should not be used in the opening sentence. The goal is to explain the topic in a way that readers will tend to find as easy to read as practicable. Second, the claim about out-of-sample deviance is not known to be true, in general. In particular, there are issues when the log-likelihood function is not additive. There is some relevant discussion in the cited paper of Ing & Wei.  BetterMath (talk) 20:17, 7 October 2019 (UTC)[reply]
AIC is a technical term that will inevitably require other technical terms to properly define it. That's how it works. I'm opposed to using a vague, non-nonsensical definition ("relative quality") instead of a proper one, because contrary to your beliefs the current definition poses more questions to the reader than it answers ("what is quality?"). And unlike for deviance, there is no article defining it (mostly because its a meaningless term to begin with).
Second, the out-of-sample deviance target for AIC is true. It only fails when the assumptions behind AIC aren't met, but then AIC is improperly applied to begin with. That paper you mentioned, Ing & Wei, does not mention log-likelihoods anywhere (let alone a supposed issue with non-additivity), so I'd like to see a proper citation for your claim. --bender235 (talk) 20:46, 7 October 2019 (UTC)[reply]
There are two main derivations of AIC. One requires strong assumptions: this is the original derivation by Akaike. The other is very general: this is based on the work of Takeuchi. (Some discussion of those is in the History section.)
Quality is precisely defined as K–L discrepancy from the true model, and is the basis for AIC, regardless of the derivation. That’s discussed in the Definition section. The first two paragraphs give an intuitive explanation of that, which most readers will (I think) comprehend.
Requiring most readers to click on a wikilink in order to comprehend the opening sentence is surely a poor practice.
Ing & Wei discuss some problems that arise when observations are not independent. Independence of observations is equivalent to additivity of the log-likelihood function. Additivity of the log-likelihood function is assumed by McElreath when defining deviance: see his definition of deviance (in section 6.2.4).
BetterMath (talk) 22:18, 8 October 2019 (UTC)[reply]
@BetterMath: I'm not sure that I understand what you mean by "independence of observations is equivalent to additivity of the log-likelihood function," because log-likelihood-based models are ubiquitous in time series analysis where, obviously, observations are not independent, and yet because of the likelihood underpinning, AIC is used frequently in model selection. Of the books I cited above most are specifically from time series econometrics. Lutkepohl, for instance, discusses AIC and its use for time series models used for forecasting (in which case AIC targets the forecasting error).
Are you claiming that all of these texts are wrong? That all of these authors, not to mention the ones relying on them, are using AIC wrong because they somehow forgot about the serial dependence of their time series observations? If so, that is a pretty bold statement that sure as hell needs a reliable source. So far you haven't presented any. I'm waiting. In particular, I'm looking for (i) a definition of "additivity" in this context, and (ii) why only independent observations lead to an "additive" log-likelihood.
Apart from all that, I find it absurd that you think "out-of-sample deviance" is too technical, but "quality defined as K–L discrepancy" is fine. Do you really believe that the modal reader has a more intuitive understanding of information theory and Kullback–Leibler divergence than he has of forecasting errors? --bender235 (talk) 22:19, 11 October 2019 (UTC)[reply]

Maybe this discussion might additionally consider WP:Lead? TheSeven (talk) 03:57, 12 October 2019 (UTC)[reply]

MOS:LEAD is always a concern for articles. But what in particular do you mean? --bender235 (talk) 17:40, 13 October 2019 (UTC)[reply]


McElreath defines deviance (in section 6.2.4) as −2 Σi log(qi), and states “i indexes each observation (case), and each qi is just the likelihood of case i”. The quoted statement does not make sense, because a likelihood is a function of the model parameter, not of an observation.
In prior sections (6.2.2–6.2.3), McElreath actually defines qi as the probability of case i. Thus, in the quoted statement, McElreath seems to be using likelihood to mean probability (some people do confuse the two). My earlier writing copied his usage. Although his usage is wrong, I didn’t want to distract from the point that I was making by raising the issue.
The definition of independence is that Pr(A and B) = Pr(A) ⋅ Pr(B). Assuming the probabilities are non-zero, that equation is equivalent to log_Pr(A and B) = log_Pr(A) + log_Pr(B).
Deviance is usually defined as −2 log(Pr(x | θ)) with θ set to the maximum likelihood estimate. If the data, x, consists of independent observations, then we get the same formula as McElreath, but with qi denoting the probability of observation i. The formula given by McElreath is wrong if the observations are not independent.
As to Lütkepohl, the part that you cite is based on FPE, which is due to Akaike. Akaike’s approach is considered by Ing & Wei, and shown to be in error.
The work of Ing & Wei seems to be seminal, and it has been discussed in many subsequent papers (see Google Scholar). I’ve looked at only a few of those papers. One example is doi:10.1017/S0266466609990107, by Ing et al.: the authors find the mean-squared prediction error for a certain class of time series. Another example is doi:10.1111/j.1467-842X.2007.00487.x. Both those examples show how difficult it is to understand predictions, when the observations are not independent.
A third example is doi:10.1016/j.jmva.2015.01.004. An extract from that is:
One of the primary obstacles to deriving the properties of out-of-sample forecasts is the potential dependence between the variable to be forecast and the sample used in model estimation. Often the dependence problem is circumvented altogether by using the so-called independent realization (IR) condition, whereby the forecast variable is assumed to be a statistically independent replicate of the process used for model estimation. Such an assumption is unnatural in many applications. For example, in autoregressive (AR) time series forecasting, future realizations of the time series are dependent on earlier data used to estimate the AR model.
A more appropriate condition in many applications is same-sample realization (SSR), whereby the variable to be forecast is generated by the same process as the data used to estimate the model. However, under sufficient restrictions on the dependence in the data, the IR assumption often delivers a sufficiently accurate approximation of SSR forecast loss. For a class of short memory processes, Ing and Wei show that the mean square forecast error (MSFE) of a least squares autoregression under IR is equivalent to the MSFE under SSR up to an o(kT−1) approximation.
.........
Panel data models are increasingly being used in forecasting applications. The purpose of this paper is to investigate whether IR provides a similar shortcut for deriving the properties of panel data forecasts. Due to parameterized heterogeneity in the forecasting model (such as fixed effects), the relevant approximation of the LS MSFE under SSR is not equivalent to the MSFE under the IR assumption. This is because the LS transformation used to partial out these heterogeneous parameters generates dependence in the transformed time series comprising the panel. Because SSR is a more realistic condition in many empirical forecasting applications, the IR assumption should only be employed with caution when deriving the properties of panel data forecasting models.
That’s essentially the main point that I made in my remark of 17 January 2018. (And I had not seen the paper when I made the remark.)
As to your last paragraph, my comments referred to the opening sentence of the article. K–L discrepancy is not discussed in the opening sentence, nor should it be. The same is true for deviance.
Thanks to TheSeven for citing WP:Lead. I see it has a section about the First sentence. The section well supports my comments pertaining to the opening/first sentence.
BetterMath (talk) 16:18, 14 October 2019 (UTC)[reply]
I'm not quite sure why you are so obsessed with disproving McElreath, especially since you do not seem to understand the content of his book. The sentence you quote from page 182 is unquestionably correct, since his (contrived but nonetheless valid) example estimates one (binomial) probability parameter per event (see p. 178), in which case i indexes both the parameter vector and the observations. McElreath does not confuse likelihood and probability; his description of probability and likelihood (and AIC, for that matter) are perfectly accurate.
Second, and more important than this nitpicking of a valid source: you have yet to explain (i) why only independence leads to "additive likelihoods," and (ii) why that matters in the first place. Your toy example with two independent events makes no sense (if B is not independent of A, you have P(A and B) = P(A)P(B|A) and thus log[P(A)] + log[P(B|A)], which you could generalize ad infinitum, so where exactly is the problem?). Apart, everybody and their mother uses AIC to evaluate models of non-independent (time series) data. If you really think they are all wrong, I suggest you publish a peer-reviewed article on the issue. But to save you both effort and embarrassment, just have a look at Lutkepohl, who lists numerous examples of likelihood functions of time series models, for instance here the log-likelihood of a VAR process, which obvious is not constructed under the assumption of independence. Is Lutkepohl's equation (3.4.5) an "additive likelihood" according to whatever definition you apply or not?
Finally, we would very well be within in the guidelines of MOS:LEAD if we define and explain AIC as an estimate of out-of-sample (or forecasting) error. We gain an enormous amount of precision and intuition in exchange for just a minor increase in technicality. --bender235 (talk) 20:45, 14 October 2019 (UTC)[reply]
Your reply claims that AIC is an estimate of out-of-sample forecasting error. My prior comment explained why that claim is not known to be true when the observations are not independent. It also gave multiple references that discussed that issue. It even quoted from one of those references, including the phrase "out-of-sample forecasts". Etc. Your reply ignores most of my comment. BetterMath (talk) 20:27, 15 October 2019 (UTC)[reply]
I ignored most of your comment because you haven't yet addressed the very basic issues, like what the heck you mean by "additive likelihood," and how dependent random variables would somehow not generate such a likelihood. Now please define that term, or cite a reliable source that does.
You see, the problem it seems to me is that you don't understand even the literature that you are citing. Above you claim Ing & Wei (2005) to be a source for why AIC does not apply to time series models, when in fact the abstract of that very paper states: "We present the first theoretical verification that AIC and its variants are still asymptotically efficient for same-realization predictions," and they define "same-realization prediction" as "predicting the future of the observed time series." What else would prediction and forecasting refer to anyways? --bender235 (talk) 01:56, 16 October 2019 (UTC)[reply]
Your second paragraph asserts that I “claim Ing & Wei (2005) to be a source for why AIC does not apply to time series models”. Yet I’ve never made such a claim. (Moreover, in my work, I’ve many times used AIC when analyzing time series.)
As to your first paragraph, I suggest that we focus on one issue. The issue is this: most references that discuss the properties of predictions (from models selected via AIC) do not adequately address the difficulties that arise when the observations are not independent. My comments have given four reliable sources pertaining to this issue. Your comments haven’t adequately considered that.
BetterMath (talk) 17:41, 16 October 2019 (UTC)[reply]
There is a clear disconnect between what those sources you've listed are saying, and what you think they are saying. So let's go at it one-by-one: if observations are not independently distributed, how would a likelihood function look like in your opinion (in particular, how would it not be "additive," whatever you define this to be)? --bender235 (talk) 17:45, 16 October 2019 (UTC)[reply]
Your first sentence seems be based on ignoring the first paragraph of my last comment.
About additivity, McElreath defines deviance (in section 6.2.4) as −2 Σi log(qi). The Σ denotes addition: I assume you know that. The usual definition of deviance is −2 log(Pr(x | θ)) with θ set to the maximum likelihood estimate. The usual definition is not the same as McElreath’s definition for all statistical models (i.e. in general), only for some statistical models.
If you want to go back earlier, start with my first comment, of 17 January 2018. You previously indicated that you didn’t understand that comment. Your reply of 16 October 2019, though, quoted from the paper of Ing & Wei. Since you have now looked at the paper, I suggest that you reread my first comment.
BetterMath (talk) 21:28, 17 October 2019 (UTC)[reply]
Ok, let me take you by the hand and derive it for you: in a Bernoulli distribution 0 ≤ q ≤ 1 is the parameter indicating the probability of success, i.e. the random variable x being, say, 1 rather than 0. In other words, . If you have multiple observations of said variable x then . The last step obviously resulted from multiplying the success probability n-times, and if you take log you'd get . Now assume that each observation comes from a Bernoulli distribution with its very own , i.e. . Then , where is the parameter vector. Then hopefully you'll agree that , where the Σ denotes addition. Do you see now (finally!) that is in fact equal to if, as in McElreath's example, the relevant probability is Bernoulli? --bender235 (talk) 23:41, 17 October 2019 (UTC)[reply]
The derivation that your reply gives (to show equality) is correct and obvious. And, it is consistent with the second paragraph of my prior comment.
Your reply is not based on what my prior comment said. You have previously also given replies that were not based on what my comments said.
BetterMath (talk) 03:39, 19 October 2019 (UTC)[reply]
I am losing my mind here. First you claim McElreath used a "wrong definition," and after I demonstrate that he was correct, you call it "correct and obvious." Are you just trolling me? Seriously, I had enough. So far you twice claimed the sources I named contained errors, which in both cases resulted from basic misunderstanding of statistical concepts (like "in-sample" vs. "out-of-sample" statistics). At this point you have lost any credibility in terms of expertise on this subject. I had enough of your ownership behavior, and I'm tired of feeding the troll. I will restore my contribution to the lead, because it is factually correct, well sourced, and intuitively easy to understand. --bender235 (talk) 18:33, 19 October 2019 (UTC)[reply]

I do not know how to deal with someone who exhibits your level of reasoning capabilities. Your choice is to select a method of Dispute Resolution or have me report you. BetterMath (talk) 18:47, 19 October 2019 (UTC)[reply]

Filed. This needs a third opinion. --bender235 (talk) 20:02, 19 October 2019 (UTC)[reply]
External editor here I didn't read all arguments, but I think that the lede needs some rewriting (not necessarily limiting to what has been done/attempted). Per MOS:FIRST, the first sentence should be accessible to the nonspecialist reader (not necessarily lay man), and in plain english and standalone (most infos about the topic should be grasped by reading the lede). However, it should also be a concise summary of the most pertinent characteristics of this concept, so accuracy is important to take into account. My remarks: 1. the lede seems too broad IMO, as it mostly focuses on what its usage is (model selection), rather than what it actually is and how it works (for instance, I wouldn't be able to say what differentiates AIC from other model selection criteria), maybe some inspiration could be drawn from F1 score's lede? ; 2. any new info should be first implemented in a section, rather than in the lede, as the lede is simply a summary of the entry's content, so I feel this discussion should first be redirected into reworking the Definition section (and maybe others), before adding these infos in the lede (notably the "out-of-sample" info which is not at all present in the entry currently). That's all, I may try to read the whole arguments if I have enough time and maybe provide an additional comments --Signimu (talk) 18:47, 20 October 2019 (UTC)[reply]
I much appreciate this. About the first sentence, I am still thinking about a constructive way forward. About your point #2, I strongly agree. About your point #1....
Regarding what AIC actually is, the second paragraph discusses the concept of information and then says “AIC estimates the relative amount of information lost by a given model”; so that, I thought, is a reasonable explanation of what AIC is. (Perhaps it would be better as “AIC is an estimator of the relative amount of ...”?) Would you elaborate on your reasons for finding the second paragraph inadequate?
AIC is used for model selection, and that’s about it. Perhaps, though, the lead does focus on that too much, as your comment says. What would you think of deleting the third paragraph (or merging the third paragraph into the Definition section)?
BetterMath (talk) 16:58, 21 October 2019 (UTC)[reply]
@BetterMath: could you finally please give us a single concrete example of a log-likelihood that is not additive, or equivalently a likelihood that is not multiplicative, so as to rewrite the law of probability that states P(A ∩ B) = P(A)P(B|A). I'm tired of this evasion. --bender235 (talk) 18:54, 21 October 2019 (UTC)[reply]
About my 1st suggestion, by reading the entry, I understand AIC is a way select the model that minimizes the error (like any model selection criteria), but here using information error to be more precise. Is that all there is to it? It seems very broad to me, as there are lots and lots of ways to measure information. Is AIC tied to a specific information metric, or is covering all instances? Eg, if I used KLD, or MI, or NMI as the metric to measure the model's error, is this in all cases an AIC? I however agree that the Lede needs to be accessible, but it needs also to be accurate. Also it does not necessarily need to target lay people IMO, but the audience that is the most likely to end up on this page, so most likely people that are interested into statistics or machine learning. bender235 please take a moment to read WP:CALC and [9]. Using a mathematical demonstration can be useful in the talk page to select appropriate sources and information to include in the article, but a reliable source is still needed, and the source needs to describe your end point. If for example a source provides only the premisses and then the conclusion is one only you derive, this cannot be included in the article (but it can be used in the talk page to argue about which source to choose to include on the article). That said, I did not read everything on this talk page and I don't have the competence to check the mathematical demonstrations, but the requirement to provide a reliable source for any content on WP's mainspace is mandatory, just make sure to comply with that --Signimu (talk) 21:48, 21 October 2019 (UTC)[reply]
BetterMath, about your suggestion, yes I think the 2nd and 3rd paragraphs could be merged as they are a bit redundant (eg, no need to write that "AIC is founded on information theory", simply link "information" to "information theory" and that's it). Here's my suggestion (feel free to edit, particularly if I introduced an inaccuracy ):
When a statistical model is used to represent the process that generated the data, the representation will almost never be exact, as some information will be lost when using the model to represent the process. AIC estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model, which allows AIC to deal with the trade-off between the model's goodness of fit and simplicity. In other words, AIC deals with both the risk of overfitting and the risk of underfitting.
That said, I think the fact that the lede has no references attached anywhere is a good indicator that it needs a rewriting. In a similar situation, I would first try to rework the article's body and solve any dispute in the most pertinent section (here I'd suggest in definition) and then reuse that in the lede. There is BTW nothing that prevents the Definition section to be entirely mathematical, in fact if the lede contains a "layman" or at least easy to access description of AIC, it is often advised to be also somewhere in the article's content, per MOS:LEADREL. Just my 2 cents, hope this can help --Signimu (talk) 22:04, 21 October 2019 (UTC)[reply]
@Signimu:, you should've indeed read the above discussion before suggesting WP:CALC. I presented two RS for the definition of AIC; two sources that BetterMath claims to be wrong, because they supposedly assume "additive log-likelihood," a concept BetterMath has yet to define (let alone name a source for). I've been asking him to define this concept for weeks now; at this point, I'd be satisfied with just a simple example of a "non-additive log-likelihood". Just so we finally understand where BetterMath's remaining misconceptions are (again, Signimu, if you read the above discussion, you'll see there are plenty). --bender235 (talk) 22:53, 21 October 2019 (UTC)[reply]
Yes, it was only a potentially useful reminder, but I did not mean that you or BetterMath broke it ;-) (and the 2nd link provides a very interesting discussion on when original research is permitted and even useful, and I think your discussion here falls in this category). I have read a bit more the rest of the discussion, that's very interesting, and I think I isolated the 2 main roots of this disagreement:
  1. Accuracy vs accessibility of the lede: BetterMath argues that the lede should be simple enough to be accessible to any reader, whereas Bender235 argues that it should be accurate. MOS:FIRST reads: "The first sentence should tell the nonspecialist reader what, or who, the subject is. It should be in plain English" and "If its subject is definable, then the first sentence should give a concise definition: where possible, one that puts the article in context for the nonspecialist. Similarly, if the title is a specialised term, provide the context as early as possible". Emphasis is mine. As I read it, the lede should indeed be accessible, but not necessarily to the lay person, we can hence assume some reasonable degree of familiarity with the necessary foundational concepts, with no need to dwelve too much in off-topic details. For other concepts that may be necessary but there is doubt whether the reader would be familliar with, wikilinks can be used (eg, deviance). The lede should however provide the context, so a mention of AIC being a model selection criterion is necessary, as it is now seems good. In conclusion, I would suggest to drop this constraint, don't limit yourselves on writing for lay people, aim for an accurate description written in simple English for a reader that assumedly has some statistical litteracy.
  2. On the definition of AIC in terms of forecasting error: there is a disagreement between both authors on whether the sources that define AIC as a forecasting error measure (to oversimplify) are correct or not. This discussion is interesting, edging on WP:OR but is permitted on talk pages per consensus[10] if it allows to select what sources are valid and hence usable for the encyclopedia. I don't have the competence to help on this, so I would advise 2 things, that were already advised: 1. first try to work and reach a consensus for other sections such as Definition, and later work on the lede depending on what result you get (and reuse the same refs), 2. contact the WikiProject Statistics for a 3rd opinion, the DRN already made a request[11], hopefully someone will show up... --Signimu (talk) 01:09, 22 October 2019 (UTC)[reply]
As a first concrete step, I would suggest that both of you lay down a simple list of a few reliable sources, along with quotes (not too long) from each of them where they support the point you are arguing, with succinct or no comment on your part. This would allow to give a clear starting/reset ground to discuss further, for you and for potential third parties. --Signimu (talk) 01:16, 22 October 2019 (UTC)[reply]
Ok I've read everything. It's less complicated than I thought. I would suggest you both lay down a clear list of the reliable sources that support your point, as a way to recap, and we'll see from there Thank you for your patience, we'll make this work! --Signimu (talk) 01:45, 22 October 2019 (UTC)[reply]
@Signimu: I agree with MOS:LEAD 100%. So, as somebody without a deep knowledge in statistics, you tell me if "prediction error" is a concept that is too technical for the lead. --bender235 (talk) 02:53, 22 October 2019 (UTC)[reply]
Bender235 Indeed no I don't think this requires deep knowledge in statistics, and I'm not a statistician, although I have some statistical literacy. But anyway there's a more foolproof way to reach a consensus on this point, by looking at other consensus: Hidden_Markov_model, Nearest-neighbor_chain_algorithm and Theil–Sen_estimator are all good articles in the Statistics WikiProjects, and they all contain a fairly accurate lede that is not accessible to the lay man, but are accessible to statistical literate individuals. This further confirms that this point is moot. What remains is what accurate definition/description to use, and for that the reliable sources prime --Signimu (talk) 15:30, 22 October 2019 (UTC)[reply]
Agreed. Here's what the sources I named write: "AIC provides a surprisingly simple estimate of the average out-of-sample deviance." (McElreath) and "The AIC is an estimate for OOS deviance." (Taddy). Deviance is a more general term than prediction error, but the idea is the same. --bender235 (talk) 15:36, 22 October 2019 (UTC)[reply]

Thank you for providing these sources, I quickly read the context of the quotes you provided, I think they would be pertinent additions to the definition and lede. And they answer my initial questions, of what is AIC and how it differs from other information criteria. Here are what caught my eyes: From McElreath:

AIC provides an approximation of predictive accuracy, as measured by out-of-sample deviance. All information criteria aim at the same target, but are derived under more and less general assumptions. AIC is just the oldest and most restrictive. AIC is an approximation that is reliable only when...

From Taddy:

The AIC is an estimate for OOS deviance. It is targeting the same statistic that you are estimating in a CV experiment: what your deviance would be on another independent sample of size n. You know that the IS deviance is too small—since the model is tuned to this data, the IS errors are an underestimate of the OOS errors. Some more deep theory7 shows that IS minus OOS deviance will be approximately equal to 2df, and this is the basis for Akaike’s AIC. [...] Basically, while AIC and AICc are trying to optimize prediction, the BIC is attempting to get at a “true” model. This leads the BIC to be more conservative, and in small to medium-sized samples it behaves much like the CV-1se rule. However, in large samples we find that it tends to underfit for prediction purposes—it chooses λ that are too big and models that are too simple.

In Taddy there is also a paragraph about the intuition behind the "corrected AIC" which I think is interesting and would be pertinent to add to extend the AICc section.

BetterMath, could you please provide reliable sources and concise quotes for your point too? Thank you in advance --Signimu (talk) 17:37, 22 October 2019 (UTC)[reply]

For a summary, kindly see "Summary of dispute by BetterMath" on the Dispute resolution noticeboard. The main issue is that predictions are difficult to understand when the observations are not independent.
McElreath assumes that the observations are statistically independent; so it is not relevant here. I have not looked at Taddy, but assume that it is the same.
My above comments noted four reliable sources: Ing & Wei (doi:10.1214/009053605000000525); doi:10.1017/S0266466609990107; doi:10.1111/j.1467-842X.2007.00487.x; doi:10.1016/j.jmva.2015.01.004. For the last source, a relevant extract is in my above comment of 14 October.
BetterMath (talk) 21:07, 22 October 2019 (UTC)[reply]
Thank you BetterMath I will try to summarize each source you provided, please correct me if I misrepresent them or the point/quote you would extract from them:
  • [12]: studies AIC under same-realization predictions (in other words: original AIC assumed independent observations, here they assume dependent observations). Quote: "This study shows that AIC also yields a satisfactory same-realization prediction in finite samples. On the other hand, a limitation of AIC in same-realization settings is pointed out. It is interesting to note that this limitation of AIC does not exist for corresponding independent cases."
  • [13] extends the previous, but from assumed stationarity to non-stationary same-realization predictions. Quote: "[...] which shows that the asymptotic efficiency (see (32) in Section 3 of the present paper) of AIC and a two-stage information criterion of Ing (2007) in various stationary time series models carries over to nonstationary cases."
  • [14] proposes another information-based criterion, FIC. Quote: "We illustrated, by means of simulations, that the FIC selects models that give predictions with a comparable MSE to that of the AIC over the entire parameter space."
  • [15] I could not find a direct link with AIC, maybe you can clarify BetterMath?
My comment: given the sources you cite BetterMath, I in fact see no conflict with what bender235 wrote? The sources seem to conclude that AIC and its interpretation as an estimate of predictive error on the test set can be extended to same-realization predictions and to non-stationary cases. I note however that the first source explicits a caveat:
"On the other hand, a limitation of AIC in same-realization settings is demonstrated. Empirical results, given in Table 2 in Section 4, reveal that it seems very difficult for AIC to possess strong asymptotic efficiency; [...] If the order of the true model is finite, then the BIC-like criterion, for example, BIC [24] and HQ [13], can choose the smallest true model with probability tending to 1, but AIC does not possess this optimal property (see [26]). Therefore, to achieve optimal same-realization predictions in situations where the underlying AR model has a possibly finite order, further investigation is still required."
This is indeed very interesting and I think pertinent to the entry, but as I understand it, this states that AIC lacks "strong asymptotic efficiency", but there is still some asymptotic efficiency, although less than BIC-like criteria. As the 2nd source writes:
" Under a less stringent assumption on k than that of Gerencser (1992), Ing and Wei (2003, 2005) obtained an asymptotic expression for the MSPE of the least squares predictor and showed that AIC and its variants are still asymptotically efficient for same-realization predictions."
It looks to me this goes along the way of AIC being defined in terms of prediction error, or more precisely out-of-sample deviance, or rather these sources do not contradict, but they provide caveats on when AIC may not work as efficiently as expected, which is for sure of interest. If I am misunderstanding your point BetterMath, please correct me, but if that is correct, I would suggest that in fact all of your sources, Bender235 and BetterMath, be used to expand the entry and then the lede I would suggest to start with McElreath and Taddy as the intro/textbook definition, and then the other sources can be used to inform about AIC extensions and caveats under dependency and non-stationarity What you do you guys think? --Signimu (talk) 23:24, 22 October 2019 (UTC)[reply]
@Signimu: did you notice how BetterMath evaded my question yet again? How often am I supposed to ask it? --bender235 (talk) 00:40, 23 October 2019 (UTC)[reply]
@BetterMath: could you finally please give us a single concrete example of a log-likelihood that is not additive, or equivalently a likelihood that is not multiplicative, so as to rewrite the law of probability that states P(A ∩ B) = P(A)P(B|A). I'm tired of this evasion. --bender235 (talk) 00:40, 23 October 2019 (UTC)[reply]
I understand your impatience Bender235, but please let's not side track, I promise I'll do everything I can to help solve this as quick as possible . This question might be of interest, but I think for the moment we may be able to solve this issue without, by relying on reliable sources, as WP:V requires And so far, what I see is that all sources provided are of interest and pertinent, and not contradictory but rather complementary I'd like to hear what BetterMath thinks of the descriptions I wrote from his/her sources and of my proposition of a plan to rewrite the article with all the sources provided by both of you --Signimu (talk) 00:58, 23 October 2019 (UTC)[reply]

@Signimu: The general situation seems to be as follows....  It used to be believed that AIC had a nice property: AIC selects the model that has the minimal expected error when making predictions. Then, in 2005, Ing & Wei showed that the analysis underlying that belief was invalid when the observations are not independent.
That leads to a Question: when the observations are not independent, what can be said about the expected error from a model selected via AIC?
Ing & Wei gave a partial answer to the Question. They introduced a technique that they called independent realization. They then applied the technique to one special case: a certain class of autoregressive models. For that special case, they showed that AIC essentially still had the above-noted nice property.
Ing & Wei did not, however, attempt to answer the Question for arbitrary models. About classes of models other than the special case, Ing & Wei said essentially nothing—except that prior analyses were invalid.
Many researchers have worked on extending the analysis of Ing & Wei, in attempts to answer the Question more generally. As an example, the second source that my comment cited extends the work of Ing & Wei to autoregressive models that are nonstationary (as your comment rightly summarized). As another example, the fourth source that my comment cited argues that the independent-realization technique is generally inappropriate for linear models of panel data.
For a list of the substantial literature that Ing & Wei has generated, see Google Scholar. As the list demonstrates, and the above examples illustrate, there has been a substantial amount of work, but only little progress.
To summarize, researchers are not even close to answering the Question for general models, despite substantial effort. The special case of autoregressive models has been analyzed, but there seems to have been nothing done for most other important classes of models (e.g. nonlinear, threshold, etc.). Thus, when the observations are not independent, prediction errors are generally poorly understood.
BetterMath (talk) 18:08, 23 October 2019 (UTC)[reply]
I'm gonna ask this one time and one time only: in situations where observations "are not independent of the previous data" (Ing & Wei), what does AIC try to measure/estimate, regardless of whether it does so correctly or wrongly? --bender235 (talk) 19:05, 23 October 2019 (UTC)[reply]
For a given candidate model, AIC provides a (relative) estimate of the K–L divergence from the true model. This is discussed in the Definition section.  BetterMath (talk) 20:02, 23 October 2019 (UTC)[reply]
And how in the world is that different from deviance? --bender235 (talk) 20:55, 23 October 2019 (UTC)[reply]
Today, after a few weeks of hiatus, Bender235 proceeded to update the lede, and BetterMath subsequently reverted. In the absence of BetterMath's replies, Bender235 was right in being BOLD. BetterMath, I appreciate your clarification, but I remain unconvinced that a model's limitation is a sufficient reason to not describe the model's definition or theoretical goal. It seems to me it would be like writing that Naive Bayes can't estimate the maximum likelihood because the model can't cope with dependent observations. In both cases, there are limiting assumptions, but that doesn't change the theoretical goal/measure target. IMO the most reasonable thing to do in such case would be to 1) describe the intended model's goal, per the sources provided by Bender235, 2) add a sentence about caveats, properly backed by sources. As the lede is right now, it's indifferentiable from the description of any other model selection criterion ("relative quality" is too vague). --Signimu (talk) 04:15, 9 November 2019 (UTC)[reply]
The theoretical goal, per se, is not to get the prediction with the least error. The theoretical goal is the goal given in the Definition section. BetterMath (talk) 07:57, 9 November 2019 (UTC)[reply]
BetterMath what is in the definition section does not preclude a more succinct description in the lede such as those provided in the sources of Bender235. You only restate here your own previous arguments, that's not going to convince anyone further (WP:BRDD)! For the moment, Bender235 provided direct quotes from reliable sources supporting the edit, where you only provide your own statement, as from what I could read from your sources, they do not contradict Bender235 sources' definition. In other words, the burden of proof is on you for the moment. Thus, I find it quite distateful for you to revert based only on your own conviction. You may be sure you are right, but Wikipedia is not interested in truth, but in verifiable content, so you need to provide a source with a direct quote for your statement if you would like to keep your position further. Another solution is to compromise, as I proposed several times above. Unfortunately, it seems from previous similar discussions you participated in that you never compromise nor consider the other editors' arguments[16][17][18][19], so unfortunately I am led to believe that it seems you misunderstand the purpose of WP:BRD, as it is not a way for you to get your way, but to reach a consensus, even if it means it's an imperfect one. TL;DR bottom line: please provide a direct quote from a source supporting your refusal of Bender235 addition to the definition, or if you can't please let the addition be done. --Signimu (talk) 13:22, 9 November 2019 (UTC)[reply]
An edit war is developing... BetterMath, please do not revert until you provide sources with direct quotes for your argument, and that we reach a consensus on this. Reverting a sourced content will yield no good... --Signimu (talk) 13:55, 9 November 2019 (UTC)[reply]
There should be no compromise on technical validity.
As for quotes, consider the quote in the comment by Bender235 at 01:56, 16 October, which is from the paper of Ing & Wei. That quote supports the position of Bender235. Then see the extract (i.e. quote) in my comment of 14 October. The extract is explicitly explains that quote from Bender235 is not for panel data.
BetterMath (talk) 13:57, 9 November 2019 (UTC)[reply]
BetterMath Thank you for taking the time to reply (even though you reverted again...). I think you are here talking about doi:10.1016/j.jmva.2015.01.004. First off, this paper does not mention Akaike information criterion, so applying it here is a bit of a WP:SYNTH. But I am of the opinion that logically it should applies and it is a pertinent addition (but don't be surprised if other editors disagree). Secondly, the source mentions specifically that it is studying out of sample forecasts. So if AIC is not an OOS forecast, how could this paper apply to it? So either Bender's source is correct, and thus your source is applicable (with a bit of a synthesis stretch), or Bender's source is wrong (AIC is not an OOS deviance estimator) and your source is not applicable either, so it looks paradoxical to use the source you propose to reject Bender's addition. If your argument is correct, surely there must be another source saying the same thing (and we should be wary when only a single source declares something), can you propose another source, if possible more directly supporting your claim BetterMath? --Signimu (talk) 15:36, 9 November 2019 (UTC)[reply]
In the end, I think all of this comes down to BetterMath's misunderstanding of the subject at hand. And I'm not talking about the "in-sample vs. out-of-sample" gaffe, but his failure to produce a definition of a "non-additive log-likelihood," which seemingly is the key to produce a counterexample to AIC not estimating out-of-sample error. A likelihood function is, in the end, just a joint probability (interpreted as a function of the parameters), i.e. from the multiplicative law of probability (here an intro level text). Of course this product of densities turns into a sum once taking logarithm. So if that's the "additivity" BetterMath is referring to, it is always satisfied. Unfortunately I don't know if that's what BetterMath actually means, because so far (for the past two months!) he has failed to produce a definition. --bender235 (talk) 19:08, 9 November 2019 (UTC)[reply]
Thank you Bender235 for the explanation, but so far it seems going down the route of mathematical demonstrations is likely not going to help reach a consensus seeing how it failed in the past, so I would humbly suggest this discussion simply sticks to reliable sources per WP:V. If a content can be sourced, it should be added, if it is not, then it should not shape the article, whether it's true or not. --Signimu (talk) 19:18, 9 November 2019 (UTC)[reply]
Signimu, I definitely agree with you on the application of our WP:V principle. It's just that I do agree with BetterMath insofar as that textbooks sometimes are wrong (for example, here in constrained optimization). However, such a claim either needs a WP:RS or a valid mathematical argument. So far I have seen neither from BetterMath. --bender235 (talk) 22:31, 9 November 2019 (UTC)[reply]

Information criteria are hardly a statistical paradigm[edit]

There is a misleading statement in the section on "Foundations of Statistics," where one editor claims that the Akaike Information Criterion (AIC) can form a foundation of statistics that is distinct from both frequentism and Bayesianism. Upon reviewing the cited reference (Burnham & Anderson, 2002 [p. 99]), it appears that the statement is not adequately supported. The relevant statement in the reference mentions that information criteria can be computed and interpreted without subjective judgment or the use of significance levels or Bayesian priors, but it does not suggest that information criteria constitute a separate statistical paradigm.

Additionally, after examining the book "Philosophy of Statistics," it is evident that no claim is made regarding AIC or information criteria forming a distinct paradigm within statistics. In the chapter specifically discussing AIC, BIC, and model selection, the authors treat AIC as a model selection rule that aids in statistical inference.

Based on these observations, it seems that the editor who wrote that particular part of the section may have aggregated and overextended the statements in the references. I recommend that the section be removed entirely, but gathering input from other editors would be valuable before making a final decision. Jourdy345 (talk) 15:04, 10 May 2023 (UTC)[reply]

How Online Essay Ordering Services Help Students Manage Academic Workload[edit]

Online essay ordering services play a crucial role in alleviating the burden of academic workload for students. With the increasing demands of coursework and deadlines, these services offer a lifeline by providing timely and customized educational assistance. By outsourcing essay writing tasks to professional writers, students can focus on other essential aspects of their education, such as studying for exams, participating in extracurricular activities, or gaining practical experience through internships. The convenience of accessing these services online empowers students to strike a balance between their academic responsibilities and personal pursuits, ultimately enhancing their overall learning experience and academic performance. The order essay online process provides a practical solution for those seeking assistance with complex assignments or seeking to manage their academic workload more effectively. Lucas Herry (talk) 07:28, 8 November 2023 (UTC)[reply]