Main Article Content
Aims: This study is intended to determine which information criterion is more appropriate for mixture model selection when considering data sets with both categorical and numerical clustering base variables (mixed case).
Study Design: In order to select among eleven information criteria which may support the selection of the correct number of clusters we conduct a simulation study. The generation of mixtures of both multinomial and multivariate normal data supports the proposed analysis.
Place and Duration of Study: Simulation: Instituto Superior de Ciências Sociais e Políticas (ISCSP), Universidade de Lisboa, 2012.
Methodology: The experimental design controls the number of normal (two and four) and multinomial (two and four) variables, the number of clusters (two, four and six), the level of clusters separation (ill and well), and for sample size we use three levels (400, 1200, 2000).
Thus, data sets were simulated with the following factors: two levels for the number of normal variables; two levels for the number of multinomial variables; two levels of segment separation, and three levels of number of clusters. Thus, the simulation plan uses a 23×32 factorial design with 72 cells. So with five replications (data sets) per cell, we generate a total of 23×32 ´5 = 360 experimental data sets.
Results: The best overall performance goes to AIC3 (58%), followed by AICu (56%) and AICc (54%). About AIC3, AICu and AICc, these criteria evidence a good compromise between underfit and overfit: AIC3, AIC and AICu underfit 11, 7 and 14%, and they overfit on 21, 18 and 18%, respectively. The most underfiting criterion is NEC, with 48%, and the most overfiting one is AIC, with 42%.
Conclusion: We run Friedman test for all the criteria, to test the null hypothesis that all the eleven populations distributions functions are identical We reject the null hypothesis and we accept the alternative (Monte Carlo p-value=0.000). Thus, we conclude that criteria performance was not identical for the eleven criteria, and we make multiple comparisons.
We concluded that AIC3 and AICc have significantly different performances, but AIC3 and AICu have similar performances. Thus we may conclude that AIC3 and AICu are the best information criteria for selecting the true number of clusters when dealing with finite mixture models, mixed data and information criteria for model selection.