第十届中国R会议（北京）演讲嘉宾介绍（二）

　　2017年，是中国R会议值得纪念的第10个年头，本届R会议将于5月19-21日在美丽的清华大学举办。在这样一个值得纪念的时刻，让我们相聚清华大学统计学研究中心，相聚R会议十周年庆典，相聚这场数据与统计的盛宴！本届会议覆盖数据科学多个领域，我们非常期待您的到来，希望您的演讲能让听众更多受益，能让会议更加精彩！

　　中国R会议是由统计之都发起，并同国内高校共同举办的极有特色的数据科学会议。2008年，中国R会议在中国人民大学举办第1届，2016年已发展至全年9个城市先后举办，服务数据科学在校师生和业界人士数万人，内容覆盖数据科学相关的多个行业，R会议非常有幸见证了数据科学在中国的蓬勃发展。

　　2017年，清华大学统计学研究中心、北京大学商务智能研究中心和统计之都携手共同主办第10届中国R会议。本届会议的主题包括医疗健康、生物信息、消费金融、量化投资、工业工程、智能制造、软件工具、计算平台、概率统计、机器学习、人工智能、自然语言、天文地理、城市规划、环境科学、社交网络、政务数据、商务统计、人文科学等诸多话题。其中5月19日特邀演讲会场设于清华大学新清华学堂，20-21日将举办上述主题的平行分会场。

　　目前收到的演讲列表链接如下：http://china-r.org/bj2017/lectures.html 欢迎大家查看！

　　下面为您奉上本次R会议【生物信息】【统计理论A】以及【统计理论B】分会场演讲嘉宾介绍：

　　生物信息

　　Adaptive False Discovery Rate regression with application in integrative analysis of large-scale genomic data

香港浸会大学数学系助理教授：杨灿

　　【演讲摘要】

　　To address scientific questions, we often design experiments and collect data from experiments. Conventionally, we often focus on the data set at hand and improve analysis results by refining models. The rising of Big Data may change the way of doing research – What if combining our data at hand with other existing information that hides in the Big Data Mountain? In this talk, we consider a large-scale testing problem in genomic data analysis. Recent international projects, such as the Encyclopedia of DNA Elements (ENCODE) project, the Roadmap project and the Genotype-Tissue Expression (GTEx) project, have generated vast amounts of genomic annotation data, e.g., epigenome and tranome. There is great demanding of effective statistical approaches to integrate genomic annotations with the results from genome-wide association studies (GWAS). To explore genetic architecture of human complex phenotypes, rather than only relying on GWAS, we introduce Adaptive False Discovery Rate (AdaFDR) regression to integrate genomic annotations with GWAS. For a given phenotype, not only AdaFDR increase the power of mapping its risk variants, but also adaptively incorporates relevant annotations for prioritization of genetic risk variants, allowing nonlinear effects among these annotations, such as interaction effects between genomic features. The developed algorithm is scalable to genome-wide analysis. Using AdaFDR, we performed integrative analysis of genome-wide association studies on human complex phenotypes and genome-wide annotation resources, e.g., Roadmap epigenome. The analysis results revealed interesting regulatory patterns of risk variants, offering new biological insights on genetic architectures of complex phenotypes.

　　Extending the adjusting-heritable-trait GWAS to bivariate analyse can help identify novel loci

中山大学数学学院副教授：郭小波

　　【演讲摘要】

　　In recent years, a number of literatures published large-scale genome-wide association studies (GWASs) for human diseases or traits adjusted for other heritable traits (adjusting-heritable-trait GWASs). However, it is known that this strategy might lead to bias genetic estimates or even false positive, leading to interpretable problem in application. In this study, we provide a method terms ‘ETB’ that extends the adjusting-heritable-trait GWASs to bivariate analyses by integrating the summary data from the adjusting-heritable-trait GWASs and the GWASs for the adjusted heritable trait. We employ ETB to the bivariate analyses for the summary data of anthropometric traits in large scale meta-GWASs, and identify 4 novel loci to the literatures. We also show that the bivariate analyses in real data might help reveal novel loci compared with the univariate analyses. Theoretical results and simulation confirm the valid and efficiency of the proposed method.

　　Hepatocellular carcinoma study based on HBV next generation sequencing

复旦大学数学科学院副教授：张淑芹

　　【演讲摘要】

　　Hepatocellular carcinoma (HCC) is one of the most common type of cancer in our country. There have been many studies on it. In this talk, we will introduce our recent work on HCC classification based on HBV next generation sequencing data. The clinical phenotype data are also analyzed, and their relations with HBV are studied.

　　Prediction analysis for microbiome sequencing data

上海交通大学生物信息学与生物统计学系特别研究员：王涛

　　【演讲摘要】

　　One primary goal of human microbiome studies is to predict host traits based on human microbiota. However, microbial community sequencing data present significant challenges to the development of statistical methods. In particular, the samples have different library sizes, the data contain many zeros and are often over-dispersed. To address these challenges, we introduce a new statistical framework, called predictive analysis in metagenomics via inverse regression (PAMIR). We demonstrate the advantages of PAMIR through numerical studies.

　　条件随机场及其在生物信息学中的应用

中国科学院数学与系统科学研究院研究员：吴凌云

　　【演讲摘要】

　　海量分子生物学数据和复杂数据结构对现有的生物信息学模型和算法提出了巨大的挑战。条件随机场是一类重要的概率图模型，是隐马尔可夫模型的推广，具有更广的适用范围和更好的效果，在语言识别和图像处理等领域已经有非常广泛的应用。本报告将介绍条件随机场的模型、算法和我们开发的R软件包CRF，以及条件随机场在生物信息学领域的应用。

　　生物序列分类中的特征快速生成与可视化

天津大学计算机学院副教授：杜朴风

　　【演讲摘要】

　　在生物序列分类过程中，我们需要快速的生成特征，也需要通过可视化来帮助进行分类算法的设计和选择。在这个报告里，我们将讨论一些常用的特征生成技术，以及利用R所进行的特征可视化。

　　统计理论A

　　Banded Spatio-Temporal Autoregressions with with Application to Forecasting PM2.5

北京航空航天大学经管学院助理教授：马莹莹

　　【演讲摘要】

　　We propose a new class of spatio-temporal models with unknown and banded autoregressive coeffcient matrices. The setting represents a sparse structure for high dimensional spatial panel dynamic models when panel members represent economic (or other type) individuals at many different locations. The structure is practically meaningful when the order of panel members is arranged appropriately. Note that the implied autocovariance metrices are unlikely to be banded, and therefore, the proposal is radically dierent from the existing literature on the inference for high-dimensional banded covariance matrices.Due to the innate endogeneity, we apply the least squares method based on a Yule-Walker equation to estimating autoregressive matrices. A ratio-based method for determining the bandwidth of autoregressive matrices is also proposed. Some asymptotic properties of the inference methods are established.The proposed methodology is further illustrated using both simulated and real data sets.

　　On a vector double autoregressive model

广州大学经济与统计学院统计系副系主任：张兴发

　　【演讲摘要】

　　Motivated by the double autoregressive (DAR) model, in this talk, we study a vector double autoregressive model (VDAR). The model is a straightforward extension from univariate case to multivariate case. Sufficient ergodicity conditions are given for the model. Without existence of second moment conditions for observed time series, the quasi maximum likelihood estimator (QMLE) ofthe parameter in the model is shown to be asymptotically normal, which does not hold for classic vector autoregressive (VAR) model with i.i.d errors. Simulation results confirm that our estimators perform well. A given empirical study implies the proposed model has potential applications in practice.

　　Keywords: Vector double autoregressive model, quasi maximum likelihood estimation

　　Prediction Interval for Autoregressive Time Series via Oracally Efficient Estimation of Multi-Step Ahead Innovation Distribution Function

苏州大学数学科学学院副教授：顾莉洁

　　【演讲摘要】

　　Kernel distribution estimator (KDE) is proposed for multi-step ahead prediction error distribution of autoregressive time series, based on prediction residuals. Under general assumptions, the KDE is proved to be oracally efficient as the infeasible KDE and the empirical cdf based on unobserved prediction errors. Quantile estimator is obtained from the oracally efficient KDE and prediction interval for multi-step ahead future observation is constructed using the estimated quantiles and shown to achieve asymptotically the nominal confidence levels. Simulation examples corroborate the asymptotic theory.

　　Simultaneous conficence bands for mean and variance function based on deterministic design

苏州大学数学科学学院博士：蔡利

　　【演讲摘要】

　　Asymptotically correct simultaneous confidence bands (SCBs) are proposed for the mean and variance functions of nonparametric regression model based on deterministic designs. The variance estimation is as efficient up to order $n^{-1/2}$ as an infeasible estimator if the mean function were known. Simulation experiments provide strong evidence that corroborates the asymptotic theory. The proposed SCBs are used to analyze two sets of strata pressure from the Bullianta Coal Mine in Erdos City, Inner Mongolia, China.

　　A smooth simultaneous confidence band for correlation curve

苏州大学数学科学学院硕士：张园园

　　【演讲摘要】

　　A smooth simultaneous confidence band (SCB) is proposed for a local measure of variance explained by regression, termed correlation curve in Doksum et al.(1994), based on local quadratic estimation.The proposed estimator of correlation curve is oracally efficient in the sense that it is as efficient as an infeasible correlation estimator with the variance function known.Simulated and real-data examples are provided to illustrate the usefulness of the proposed oracle SCB.

　　统计理论B

　　FACTOR AND RESIDUAL EMPIRICAL PROCESSES

南京审计大学理学院/统计科学与大数据研究院讲师：王江艳

　　【演讲摘要】

　　The distributions of the factor return and specific error for an individual variable are important in forecasting and applications. However, they are not identified with low-dimensional time series observations. Using the recently developed theory for large dimensional approximate factor model for large panel data, the factor return and specific error can be estimated consistently. Based on the estimated factor returns and residual errors, we construct the empirical processes for estimation of the distribution functions of the factor return and specific error, respectively. We prove that the two empirical processes are oracle efficient when p≥CT^{3/2} where p and T are the dimensionality and sample size, respectively. This demonstrates that the factor and residual empirical processes behave as well as the empirical processes pretending that the factor returns and specific errors for an individual variable are directly observable. Based on this oracle property, we construct the simultaneous confidence bands for the distributions of the factor return and specific error. Extensive simulation studies check that the estimated bands have good coverage probabilities. Our real data analysis shows that the factor return distribution has a structural change during the crisis in 2008.

　　Free-knot spline for Generalized Regression Models

伊利诺伊大学芝加哥分校统计系副教授：王静

　　【演讲摘要】

　　A computational study of bootstrap confidence bands based on free-knot spline technique is explored for generalized regression models, typically the logistic regression. A parametric bootstrap is used to study the proposed estimator and to construct confidence bands for the unknown predictor function. In free-knot spline regression, the knot location as additional parameters offers greater flexibility and the potential to account for rapid shifts or structure changes in the target functions. However the lethargy property in the optimization objective function results in replicate knot solutions. Penalized estimating equations are proposed based on Jupp's transformation and an added penalty on knots distance directly (Lindstrom (1999)). Another approach for selecting knots location is proposed by employing variable selection procedures subject to the constraints on spline derivatives.The finite-sample behavior of the proposed method is also investigated to a real example.

　　Spatially Varying Coefficient Models

威廉玛丽学院数学系助理教授：王冠男

　　【演讲摘要】

　　In this paper, we study the estimation of spatially varying coefficient models for data distributed over complex domains. We use bivariate splines over triangulations to represent the coefficient functions. A convergence rate for the bivariate spline estimators is derived. The estimators of the coefficient functions are consistent, and we establish the rate of convergence of the proposed estimators. A penalized least squares method is proposed to estimate the the model with a penalization term. We also propose hypothesis tests to examine if the coefficient function is really varying over space. The proposed method is computational expedient, thus usable for analyzing massive datasets. The performance of the estimators and the proposed tests are evaluated by several simulation examples and a real data analysis.

　　Quantile Regression Oultier Diagnostic: R package `quokar`

中国人民大学统计学院博士：王文静

　　【演讲摘要】

　　Extensive toolbox for estimation and inference about quantile regression has been developed in the past decades. Recently tools for quantile regresion model diagnostic are studied by researchers. We built R package `quokar` to implement outlier diagnostic methods in R language. This talk offers a brief tutorial introduction to this package. Package `quokar` is open-source and can be freely downloaded from Github:http://www.github.com/wenjingwang/quokar. To move one step further, we also plot the diagnositic model into data space to observe how does the model performs using R package `rggobi`.

　　哪种奇巧巧克力最好吃：Statistical ranking models及其R实现

德克萨斯大学生物统计系博士：曹明

　　【演讲摘要】

　　排序（ranking）是一种普遍的需求，google出来排在最前面的几个结果（PageRank）是否就是你想要的？上赛季的金州勇士队常规赛创纪录的73胜却没有赢下最终的总冠军，他们的“真实实力”到底是不是第一呢？我们就从sports analytics里常用的Bradley-Terry model说起，以最近John Hopkins一个十分有趣的项目：哪种奇巧巧克力（Kitkat）最好吃为例，谈谈ranking的统计模型，以及相关的几个R package。

　　统计之都：专业、人本、正直的中国统计学社区。

　　往期推送：进入统计之都会话窗口，点击右上角小人图标，查看历史消息即可。