A Principled Approach to Model Validation in the Context of Estimating Age-Specific Rates

Ameer Dharamshi , University Of Washington
Daniela Witten, University of Washington
Monica Alexander, University of Toronto

A number of statistical models are available to estimate age-specific demographic rates in a wide range of populations, partcularly in contexts where the data are sparse or unreliable. However, model development and testing has largely focused on in-sample validation, with less emphasis on model validation on new data. Validation using classical train/test sample splitting approaches is not appropriate in the context of estimating age-specific rates, as demographic outcomes across age are not independent and identically distributed. In this paper, we show that data thinning, a collection of randomization strategies used in the statistical learning literature, can be used to generate independent train and test sets by partitioning the population of each individual age group into two (or more) subpopulations subject to the same latent risks. As all age groups are represented in both folds, model fitting and testing replicate the intended use of the model, that is, whether the latent risk curve extracted from the observed data generalizes to an independent realization from the same age groups. We illustrate the advantages of data thinning over sample splitting in a simulation study based on US state-level mortality data.

See extended abstract

 Presented in Session P2. Health, Mortality, Ageing - Aperitivo