Causal inference with Big Data

At increasing velocity, volume and variety, we are generating, recording, and storing unprecedented amounts of data. Big Data present exciting opportunities to better understand risk factors, to build improved predictors, and to examine the causal relationships between variables. Still, there are many sources of association between variables, including direct effects, indirect effects, measured confounding, unmeasured confounding, and selection bias. Methods to delineate causation from correlation are perhaps more pressing now than ever.

Super & Targeted Learning for Superior Prediction & Effect Estimation

Machine learning can improve risk prediction by relaxing the modeling assumptions made by standard approaches. A core strength of our research is the application of Super Learner, an ensemble method, to develop flexible prediction algorithms. Another strength of our research is the incorporation of machine learning to avoid unsubstantiated assumptions when estimating causal effects. We have expertise in the extension and application of targeted maximum likelihood estimation (TMLE), a general approach to semi-parametric efficient estimation that naturally integrates machine learning and formal statistical inference. 

Robust Inference with Missing Data

Population-level estimates of disease prevalence and control are needed to assess the effectiveness of prevention and treatment strategies. However, individuals whose status is measured are likely to differ meaningfully from those without measurements. Further complications arise due to the dependence of outcomes from social interactions between individuals. Both theoretically and with simulations, we have demonstrated the importance of flexibly controlling for baseline and time-varying causes of missingness, while rigorously accounting for the dependence of observations within a cluster (e.g. community).

Linked papers​

Pragmatic Trials to Translate Research into Practice

Pragmatic trials focus on learning the how and effect of interventions in real world settings. Cluster (group) randomized trials further help us to learn the population-level effectiveness of interventions with proven individual-level efficacy. We tackle key questions in the design and analysis of these trials. In particular, we have demonstrated the gains in efficiency, power, and interpretation from pair-matching over complete randomization, targeting the sample effect instead of a population average parameter, and data-adaptive adjustment through a pre-specified analysis.