City at night with links depicting communication


The cross-cutting theme on Biostatistics and Data Science aims to provide statistical support and develop novel quantitative methods to enable researchers to probe the rich data resources available at the Centre, to analyse multiple complex data sources in a coherent and robust manner, and to reveal the complex interactions and pathways between exposures and outcomes in environment and health studies.

Our research in this area includes the development of statistical methodology and innovative data analytics, anchored in the Bayesian hierarchical modelling paradigm and machine learning, to improve statistical inference. We employ flexible, spatio-temporal semi- and non-parametric models for large complex data from environmental and epidemiological studies and multi-omic high-throughput platforms, and strategies to address the computational challenges of high-dimensional inference and estimation. Such methods are particularly relevant for environment-health assessments and surveillance studies.

Theme Lead

Principal Team

Associated Team

Key Projects and Papers

• Statistical methods and data science techniques to improve the characterisation of air pollution. In particular, 1) we are developing Bayesian models to integrate data sources coming from measurements as well as numerical model outputs or satellites and accounting for misalignment that naturally occurs when dealing with separate sources. We are also extending the framework to a multi-pollutant approach. 2) We are employing machine learning techniques to apportion particulate matter into its sources and to investigate the link with health outcomes (MRC Methodology grant, PI Blangiardo)

• Spatio-temporal methods for 1) understanding the interrelations between climate change, local environmental conditions and arboviral disease dynamics in Brazil (Wellcome Trust grant; PI: Pirani), as well as 2) to quantify the temperature-related respiratory disease burden in England and Wales (MRC Centre funded fellowship).

• Statistical methods to account for spatial uncertainty in small-area data, in particular population census data, and to visualise uncertainty from multiple sources. Collaboration with Emory University (Lance Waller) and Harvard University (Nancy Krieger). (Overall PI: Waller, ICL PI: Piel).

• Comparison of statistical profiling and data analytics for exposome data, including how to include interactions in high dimensional profiling. We devised a multivariate normal approach to analyse exposome data generated using complex study designs with multiple observations per participant and applied it to EXPOsOMICS data. We proposed a series of partial least squares (PLS) models to explore the multivariate effect of exposure mixtures on inflammatory markers.

• Development of the Metabolome Wide Significance Level as an approach to correct for multiple testing in metabolome-wide association studies and a new method for power and sample size calculations for metabolomic studies. We also contributed to the EU-funded PhenoMeNal programme cloud-based infrastructure for computational metabolomics.

• We also have a strong interest in causal inference, in particular how genetic information can be used as an instrumental variable in the Mendelian randomization paradigm to infer causal effects of high-dimensional exposures on outcomes of public health interest.
• A Systematic Comparison of Linear Regression-Based Statistical Methods to Assess Exposome-Health Associations. Agier L, Portengen L, Chadeau-Hyam M, Basagaña X, Giorgis-Allemand L, Siroux V, Robinson O, Vlaanderen J, González JR, Nieuwenhuijsen MJ, Vineis P, Vrijheid M, Slama R, Vermeulen R. Environ Health Perspect. 2016 Dec;124(12):1848-1856.

• Blood transcriptional and microRNA responses to short-term exposure to disinfection by-products in a swimming pool. Espín-Pérez A, Font-Ribera L, van Veldhoven K, Krauskopf J, Portengen L, Chadeau-Hyam M, Vermeulen R, Grimalt JO, Villanueva CM, Vineis P, Kogevinas M, Kleinjans JC, de Kok TM. Environ Int. 2018 Jan;110:42-50.

• Improving Visualization and Interpretation of Metabolome-Wide Association Studies: An Application in a Population-Based Cohort Using Untargeted 1H NMR Metabolic Profiling. Castagné R, Boulangé CL, Karaman I, Campanella G, Santos Ferreira DL, Kaluarachchi MR, Lehne B, Moayyeri A, Lewis MR, Spagou K, Dona AC, Evangelos V, Tracy R, Greenland P, Lindon JC, Herrington D, Ebbels TMD, Elliott P, Tzoulaki I, Chadeau-Hyam M. J Proteome Res. 2017 Oct 6;16(10):3623-3633.

• Tan, L. S. L., A. Jasra, M. De Iorio and T. M. D. Ebbels (2017). "Bayesian Inference for Multiple Gaussian Graphical Models with Application to Metabolic Association Networks." Annals of Applied Statistics 11(4): 2222-2251.

• Power Analysis and Sample Size Determination in Metabolic Phenotyping. Blaise BJ, Correia G, Tin A, Young JH, Vergnaud AC, Lewis M, Pearce JT, Elliott P, Nicholson JK, Holmes E, Ebbels TM. Anal Chem. 2016 May 17;88(10):5179-88.

• Ebbels TM, Pearce JTM, Sadawi N, Gao J, Glen RC. Chapter 11 - Big Data and Databases for Metabolic Phenotyping. The Handbook of Metabolic Phenotyping. Editor(s): Lindon JC, Nicholson JK, Holmes E. Elsevier, 2019. Pages 329-367.

• Blangiardo M, Cameletti M. Spatial and Spatio-temporal Bayesian Models with R – INLA. Wiley, 2015.

• R2GUESS: A Graphics Processing Unit-Based R Package for Bayesian Variable Selection Regression of Multivariate Responses. Liquet B, Bottolo L, Campanella G, Richardson S, Chadeau-Hyam M. J Stat Softw. 2016 Jan 29;69(2).

• A Bayesian mixture modeling approach for public health surveillance. Boulieri A, Bennett JE, Blangiardo M. Biostatistics. 2018 Sep 25. doi: 10.1093/biostatistics/kxy038

• A hierarchical modelling approach to assess multi pollutant effects in time-series studies. Blangiardo M, Pirani M, Kanapka L, Hansell A, Fuller G. PLoS One. 2019 Mar 4;14(3):e0212565. Bayesian modeling for spatially misaligned health and air pollution data through the INLA-SPDE approach. Cameletti M, Gomez-Rubio V, Blangiardo M. Spatial Statistics 31, 2019 April.

• Bayesian spatial modelling for quasi-experimental designs: An interrupted time series study of the opening of Municipal Waste Incinerators in relation to infant mortality and sex ratio. Freni-Sterrantino A, Ghosh RE, Fecht D, Toledano MB, Elliott P, Hansell AL, Blangiardo M. Environ Int. 2019 Jul;128:109-115.

• Error in air pollution exposure model determinants and bias in health estimates. Vlaanderen J, Portengen L, Chadeau-Hyam M, Szpiro A, Gehring U, Brunekreef B, Hoek G, Vermeulen R. J Expo Sci Environ Epidemiol. 2019 Mar;29(2):258-266.

• Wang, Y., Pirani, M., Hansell, A. L., Richardson, S., & Blangiardo, M. (2019). Using ecological propensity score to adjust for missing confounders in small area studies. Biostatistics, 20(1), 1-16. Analysing the health effects of simultaneous exposure to physical and chemical properties of airborne particles. Pirani M, Best N, Blangiardo M, Liverani S, Atkinson RW, Fuller GW. Environ Int. 2015 Jun;79:56-64.

• Forlani C., Bhatt S., Cameletti M., Krainski E., Blangiardo M., A Joint Bayesian Space-Time Model to Integrate Spatially Misaligned Air Pollution Data in R-INLA, accepted on Environmetrics,