Fall 2017 Colloquia Abstracts
Toward the Era of “Large p2, Medium n"
Xing Qiu, PhD
University of Rochester
In the past two decades or so, the emergence of many different types of high-throughput data, such as whole transcriptome gene expression, genotyping, and microbiota abundance data, has revolutionized medical research. One common property shared by these “Omics” data is that typically they have much more features than the number of independent measurements (sample size). This property is also known as the “large p, small n” property in the research community, and has motivated many instrumental statistical innovations. A few of these examples include Benjamini-Hochberg’s FDR controlling multiple testing procedure; Fan and Lv’s sure independence screening; a host of advanced penalized regression methods; sparse matrix and tensor decomposition techniques; just to name a few. Due to the rapid advancing of biotechnology, the unit cost of generating high-throughput data has decreased significantly in recent years. Consequently, the sample size of those data in a respectful study is now about, n = 100~500 which I consider as “medium n”, and is certainly a huge improvement to the old “small n” studies in which n < 10 is the norm. With the increased sample size, medical investigators are starting to ask more sophisticated questions – feature selection based on hypothesis testing and regression analysis is no longer the end, but the new starting point for secondary analyses such as network analysis, multi-modal data association, gene set analyses, etc. The overarching theme of these advanced analyses is that they all require statistical inference for models that involve p 2 parameters. In my opinion, it takes a combination of proper data preprocessing, feature selection, dimension reduction, model building and selection, as well as domain knowledge and computational skills to do it right. Despite of the technical difficulties of designing and performing these avant-garde analyses, I believe that they will soon become mainstream, and inspire a generation of young statisticians and data scientists to invent the next big breakthroughs in statistical science. In this talk, I will share some of my recent methodology and collaborative research that involves “large p 2 ” models, and list a few potential extensions of these methods that may be used in other areas of statistics.
Thursday, November 30, 2017
Automated Model Building and Deep Learning
Xiao Wang, PhD
Purdue University
Analysis of big data demands computer aided or even automated model building. It becomes extremely difficult to analyze such data with traditional statistical models and model building methods. Deep learning has proved to be successful for a variety of challenging problems such as AlphaGo, driverless cars, and image classification. Understanding deep learning has however apparently been limited, which makes it difficult to be fully developed. In this talk, we focus on neural network models with one hidden layers. We provide an understanding of deep learning from an automated modeling perspective. This understanding leads to a sequential method of constructing deep learning models. This method is also adaptive to unknown underlying model structure. This is a joint work with Chuanhai Liu.
Thursday, November 9, 2017
Sieve Analysis using the Number of Infecting Pathogens
Dean Follmann, PhD
National Institute of Allergy & Infectious Diseases
Assessment of vaccine efficacy as a function of the similarity of the infecting pathogen to the vaccine is an important scientific goal. Characterization of pathogen strains for which vaccine efficacy is low can increase understanding of the vaccine's mechanism of action and offer targets for vaccine improvement. Traditional sieve analysis estimates differential vaccine efficacy using a single identifiable pathogen for each subject. The similarity between this single entity and the vaccine immunogen is quantified, for example, by exact match or number of mismatched amino acids. With new technology we can now obtain the actual count of genetically distinct pathogens that infect an individual. Let F be the number of distinct features of a species of pathogen. We assume a log-linear model for the expected number of infecting pathogens with feature ``f", f=1,…, F. The model can be used directly in studies with passive surveillance of infections where the count of each type of pathogen is recorded at the end of some interval, or active surveillance where the time of infection is known. For active surveillance we additionally assume that a proportional intensity model applies to the time of potentially infectious exposures and derive product and weighted estimating equation (WEE) estimators for the regression parameters in the log-linear model. The WEE estimator explicitly allows for waning vaccine efficacy and time-varying distributions of pathogens. We give conditions where sieve parameters have a per-exposure interpretation under passive surveillance. We evaluate the methods by simulation and analyze a phase III trial of a malaria vaccine.
Thursday, October 26, 2017
N-of-1 Trials for Making Personalized Treatment Decisions with Personalized Designs Using Self-Collected Data
Christopher Schmid, PhD
Brown University
N-of-1 trials hold great promise for enabling participants to create personalized protocols to make personalized treatment decisions. Fundamentally, N-of-1 trials are single-participant multiple-crossover studies for determining the relative comparative effectiveness of two or more treatments for one individual. An individual selects treatments and outcomes of interest, carries out the trial, and then makes a final treatment decision with or without a clinician based on results of the trial. Established in a clinical environment, an N-of-1 practice provides data on multiple trials from different participants. Such data can be combined using meta-analytic techniques to inform both individual and population treatment effects. When participants undertake trials with different treatments, the data form a treatment network and suggest use of network meta-analysis methods. This talk will discuss ongoing and completed clinical research projects using N-of-1 trials for chronic pain, atrial fibrillation, inflammatory bowel disease, fibromyalgia and attention deficit hyperactivity disorder. Several of these trials collect data from participants using mobile devices. I will describe design, data collection and analytic challenges as well as unique aspects deriving from use of the N-of-1 design and mobile data collection for personalized decision-making. Challenges involve defining treatments, presenting results, assessing model assumptions and combining information from multiple participants to provide a better estimate of each individual’s effect than from his or her own data alone.
Thursday, October 5, 2017
A New Clustering Method for Single-Cell Data Analysis
Lynn Lin, PhD
The Pennsylvania State University
Advances in single-cell technologies have enabled high-dimensional measurement of individual cells in a high-throughput manner. A key first step to analyze this wealth of data is to identify distinct cell subsets from a mixed-population sample. In many clinical applications, cell subsets of interest are often found in very low frequencies which pose challenges for existing clustering methods. To address this issue, we propose a new mixture model called Hidden Markov Model on Variable Blocks (HMM-VB) and a new mode search algorithm called Modal Baum-Welch (MBW) for mode-association clustering. HMM-VB leverages prior information about chain-like dependence among groups of variables to achieve the effect of dimension reduction as well as incisive modeling of the rare clusters. In case such a dependence structure is unknown or assumed merely for the sake of parsimonious modeling, we have developed a recursive search algorithm based on BIC to optimize the formation of ordered variable blocks. In addition, we provide theoretical investigations about the identifiability of HMM-VB as well as the consistency of our approach to search for the block partition of variables in a special case. In a series of experiments on simulated and real data, HMM-VB outperforms other widely used methods.
Thursday, September 28, 2017
A Scalable Empirical Bayes Approach to Variable Selection in Generalized Linear Models
Martin Wells, PhD
Cornell University
A new new empirical Bayes approach to variable selection in the context of generalized linear models is developed. The proposed algorithm scales to situations in which the number of putative explanatory variables is very large, possibly much larger than the number of responses. The coefficients in the linear predictor are modeled as a three-component mixture allowing the explanatory variables to have a random positive effect on the response, a random negative effect, or no effect. A key assumption is that only a small (but unknown) fraction of the candidate variables have a non-zero effect. This assumption, in addition to treating the coefficients as random effects facilitates an approach that is computationally efficient. In particular, the number of parameters that have to be estimated is small, and remains constant regardless of the number of explanatory variables. The model parameters are estimated using a modified form of the EM algorithm which is scalable, and leads to significantly faster convergence compared with simulation-based fully Bayesian methods.
Thursday, September 14, 2017