relative variable importance with unbalanced model set

Forum for discussion of general questions related to study design and/or analysis of existing data - software neutral.

relative variable importance with unbalanced model set

Postby mherzog » Mon Mar 02, 2015 1:56 pm

While developing a set of models, we would like to use this approach to show some weight of evidence for a particular variable. However, when all variables do not have equal representation throughout the model set you might run into some level of concern that there is some bias in the resulting accumulated weight score, just due to frequency.

I know there is some philosophy behind the fact that this is ok, and it's really just a function of your predefined a priori hypotheses, and you just have to recognize that and present/infer your results within that universe.

That said, I think there is some value for obtaining a estimate of variable importance where this is corrected. Examples might include interaction terms which require both variables to also be present in the model, or polynomial terms. We (Julie Yee and I) decided on: log(W/(1-W)) - log(Nv/(N-Nv)); were W is sum of model weights that a given variable is in, Nv is the number of models that the given variable is in and N is total number of models.

Searching the forum prior to posting I noticed Darryl (Mackenzie) suggested something similar here:

[url]http://www.phidot.org/forum/viewtopic.php?f=34&t=1228&p=3423&hilit=relative+variable+importance#p3423
[/url]

Darryl's is basically an odds-ratio, and we opted to use a log-odds to provide a bit more symmetric range. But logic is the same.

Question/Discussion topic here is two-fold:
1. How do folks feel about the use of the log-odds or odds ratio as an approach to correct for this?
2. Is there anything out there that is citeable to use such an approach? We don't want to have to spend too much time each paper trying to defend this approach... and perhaps we can do that once and then cite that paper each time going forward, but would definitely prefer to have something more specific to cite if it's available.

Thanks,
Mark
Last edited by mherzog on Wed Mar 04, 2015 12:34 pm, edited 1 time in total.
mherzog
 
Posts: 5
Joined: Tue May 27, 2003 7:38 pm
Location: Northern California

Re: relative variable importance with unbalanced model set

Postby cooch » Mon Mar 02, 2015 8:16 pm

mherzog wrote:Question/Discussion topic here is two-fold:
1. How do folks feel about the use of the log-odds or odds ratio as an approach to correct for this?
2. Is there anything out there that is citeable to use such an approach? We don't want to have to spend too much time each paper trying to defend this approach... and perhaps we can do that once and then cite that paper each time going forward, but would definitely prefer to have something more specific to cite if it's available.

Thanks,
Mark


Ken Burnham and David Anderson are currently working on a MS about this very problem. Indeed..you (and Darryl) are 'correct' in that the 'solution' (as it were) lies in something analogous to log odds ratios. The over-arching problems as I see them are the ever-messy interaction terms (heaven forbid if you have interactions of 'factors' and 'linear covariates', let along interactions of two linear covariates), and varying degrees of multi-colinearity among the regressors.

Like all good examples (i.e., in B*A's book, and most papers stemming from that), the regressors in the linear models are not evaluated with interaction terms in the model, and are as a set largely orthogonal (independent). So, the reasonable approach to demonstrate the idea was using some data which meets all the parametric assumptions.

Real world applications are, typicaly, uglier. For the moment, until Ken and David finish up their paper, the best you can do is an ad hoc approach, as you've described [and don't be insulted by the use of 'ad hoc'. A lot of things are based on 'ad hoc' assumptions. For example, the Bayesian credible interval is typically based on an ad hoc use of simple 95% cutoffs of the posterior density, which is robust, provided that the posterior distribution is symmetrical. In that case, there is only one 95% interval (based on frequencies) where there is equal mass in both tails. But, if the posterior is strongly asymmetrical (as it can be), then what constitutes the right 95% interval to use is not an easy problem, and leads to considerations of 'highest posterior density' approaches, and potentially nasty things like that -- for those that are interested, have a look at a very short treatment of the subject on pp. 12-13 in Appendix E of the MARK book].

So, in short, I would suggest:

1\ using your ad hoc approach (which Darryl described in his earlier post in a different thread - kudos on search for and finding relevant bits of the 'knowledge base').

2\ describing it as such in your paper(s), and simply point out that for the moment, we don't have a good way to evaluate 'factor importance' in a multi-model inference framework, if there are interactions of some terms, and not others (not to mention of collinearity complication). You're on the bleeding edge. Period. Nothing you can do about that. if the editor/reviewers balk, simply ask them 'so, you have a better suggestion?'. Of course the answer is 'no' (unless, perhaps, said editors/reviewers happen to be Burnham or Anderson). But, any reviewer or journal that rejects because there isn't a canonical approach to a certain problem would fast find themselves out of a *lot* of papers (e.g., GOF testing - only available for certain types of models. So, in the absence of an estimate for c-hat you, do what? Simply point out there is not way to estimate c-hat for some data types, and then you do some post hoc evaluation for how much a lack of fit might make a difference, if in fact you could estimate the magnitude of that lack of fit').

3\ wait patiently for the smart folks to start working in this as a problem of interest. The lack of prior consideration may also reflect a sense that to 'do it right' (it being estimating factor importance as a function of accumulated AIC weights) often involves building (very) large sets of models, and accumulating AIC weights over that very large set -- the size of the set being driven by the need at some levels for symmetry of factors in models (there is a reason MARK limits you to playing this game easily, using built in capabilities, with no more than 10 factors). This starts to get dangerously close to data dredging, which is probably why B&A don't focus much on the issue -- at least in the book.

But, this too shall pass.
cooch
 
Posts: 1652
Joined: Thu May 15, 2003 4:11 pm
Location: Cornell University

Re: relative variable importance with unbalanced model set

Postby jCeradini » Thu Sep 17, 2015 11:46 pm

Thoughts on Brian Cade's relative variable approach in his new Ecology article (Model averaging and muddled multimodel inferences)? Most of the paper covers model averaging coefficients (which this forum has covered and advised against) but his approach to relative variable importance begins on 2377. Model averaging t statistics (weighted by AIC model weights) that are scaled between 0-1 sounds intuitive to me, but my experience is limited in this realm. Model set does not need to be balanced. There's even some R code that walks through the process. I believe there are the same limitations when it comes to model sets with interactions.

Joe
jCeradini
 
Posts: 72
Joined: Mon Oct 13, 2014 3:53 pm


Return to analysis & design questions

Who is online

Users browsing this forum: No registered users and 1 guest