StatLearn 2013 - Workshop on "Challenging problems in Statistical Learning"

L'apprentissage statistique joue de nos jours un rôle croissant dans de nombreux domaines scientifiques et doit de ce fait faire face à des problèmes nouveaux. Il est par conséquent important de proposer des méthodes d'apprentissage statistique adaptées aux problèmes modernes posés par les différents champs d'application. Outre l'importance de la précision des méthodes proposées, elles devront également apporter une meilleure compréhension des phénomènes observés. Afin de faciliter les contacts entre les différentes communautés et de faire ainsi germer de nouvelles idées, un colloquium d'audience internationale (en langue anglaise) sur le thème «Challenging problems in Statistical Learning» a été organisé à l'Université Bordeaux Segalen les 8 et 9 avril 2013. Vous trouverez ci-dessous les enregistrements des exposés donnés lors de ce colloquium.

  • 51 minutes 6 seconds
    Regularized PCA to denoise and visualize data (Julie Josse)
    Principal component analysis (PCA) is a well-established method commonly used to explore and visualize data. A classical PCA model is the fixed effect model where data are generated as a fixed structure of low rank corrupted by noise. Under this model, PCA does not provide the best recovery of the underlying signal in terms of mean squared error. Following the same principle as in ridge regression, we propose a regularized version of PCA that boils down to threshold the singular values. Each singular value is multiplied by a term which can be seen as the ratio of the signal variance over the total variance of the associated dimension. The regularized term is analytically derived using asymptotic results and can also be justified from a Bayesian treatment of the model. Regularized PCA provides promising results in terms of the recovery of the true signal and the graphical outputs in comparison with classical PCA and with a soft thresholding estimation strategy. The method is illustrated through a simulation study and a real dataset coming from genetics. We will also highlight the ability of the method to handle properly missing values.
    16 May 2013, 10:00 pm
  • 57 minutes 42 seconds
    Strategies to analyze (Benoît Liquet)
    Recent technological advances in molecular biology have given rise to numerous large scale datasets whose analysis have risen serious methodological challenges mainly relating to the size and complex structure of the data. Considerable experience has been gained over the past decade, mainly in genetics, from the Genome-Wide Association Study (GWAS) era, and more recently in transcriptomics and metabolomics. Building upon the corresponding wide literature, we present methods used to analyze OMICS data within each of the three main types of approaches : univariate models, dimension reduction techniques, and variable selection models. We focus on methods for which available ready-to-use packages are available. In this context, we propose R2GUESS an R package which interface a C++ implementation of a fully sparse Bayesian variable selection (BVS) approach for linear regression that can analyze single and multiple responses in a integrated way. A simulation and an illustration in the context of GWAS is presented and show the performance of the BVS approach.
    16 May 2013, 10:00 pm
  • 49 minutes 34 seconds
    Modular priors for partially identified models (Ioanna Manolopoulou)
    This work is motivated by the challenges of drawing inferences from presence-only data. For example, when trying to determine what habitat sea-turtles "prefer" we only have data on where turtles were observed, not data about where the turtles actually are. Therefore, if we find that our sample contains very few turtles living in regions with tall sea grass, we cannot conclude that these areas are unpopular with the turtles, merely that we are unlikely to observe them there. Similar issues arise in forensic accounting : attempts to determine which companies are apt to misreport their official earnings based on a history of which firms were censured by the SEC are confounded by the fact that we only observe which firms got caught cheating, not which firms cheat (many of whom do not get caught). This sort of confounding is insurmountable from a point-estimation perspective, but the data are not entirely uninformative either. Our present work is devoted to parametrizing observation models in a way that isolates which aspects of the model are informed by the data and which aspects are not. This approach allows us to construct priors which are informative with respect to the unidentified parts of the model without simultaneously (and unintentionally) biasing posterior estimates of the identified parameters ; these priors do not "fight against" the data. In addition, their modularity allows for convenient sensitivity analysis in order to examine the extent to which our ultimate conclusions are driven by prior assumptions as opposed to our data. Joint work with Richard Hahn and Jared Murray.
    16 May 2013, 10:00 pm
  • 58 minutes 6 seconds
    New challenges for (biological) network inference with sparse Gaussian graphical models (Julien Chiquet)
    Network inference methods based upon sparse Gaussian Graphical Models (GGM) have recently emerged as a promising exploratory tool in genomics. They give a sounded representation of direct relationships between genes and are accompanied with sparse inference strategies well suited to the high dimensional setting. They are also versatile enough to include prior structural knowledge to drive the inference. Still, GGM are now in need for a second breath after showing some limitations : among other questionings, the state-of-the-art reconstruction strategies often suffer a lack of robustness and are not fully appropriate for treating heterogeneous data. In that perspective, we will discuss recent approaches that try to overcome the limitations essentially induced by the nature of the genomic data and of the underlying biological mechanisms.
    16 May 2013, 10:00 pm
  • 1 hour 6 minutes
    Learning with the Online EM Algorithm (Olivier Cappé)
    The Online Expectation-Maximization (EM) is a generic algorithm that can be used to estimate the parameters of latent data models incrementally from large volumes of data. The general principle of the approach is to use a stochastic approximation scheme, in the domain of sufficient statistics, as a proxy for a limiting, deterministic, population version of the EM recursion. In this talk, I will briefly review the convergence properties of the method and discuss some applications and extensions of the basic approach.
    16 May 2013, 10:00 pm
  • 54 minutes 1 second
    Investigating on nonlinear relationship in high-dimensional setting (Frédéric Ferraty)
    The high dimensional setting is a modern and dynamic research area in Statistics. It covers numerous situations where the number of explanatory variables is much larger than the sample size. This is the case in genomics when one observes (dozens of) thousands genes expression ; typically one has at hand a small sample of high dimensioned vectors derived from a large set of covariates. Such datasets will be abbreviated to HDD-I for High Dimensional Data of type I. A particular setting may correspond to the observation of a collection of curves, surfaces, ... sampled at high frequencies (design points) ; these sets of data are gathered under the terminology functional data (or functional variables) and will be abbreviated to HDD-II (High Dimensional Data of type II). The main feature of HDD-II (and difference with HDD-I) is due to the existence of high colinearities between explanatory variables which reduces the overall dimensionality of the data. Last twenty years have been devoted to develop successful methodologies able to manage such high dimensional data. Essentially sparse linear modelling involving variable selection techniques has been proposed to investigate on HDD-I whereas non selective functional linear approaches have been introduced to handle HDD-II mainly. However, as in the standard multivariate setting, linear assumption may too much restrictive by hiding relevant nonlinear aspects. This is why in the last decade flexible methodologies taking into account nonlinear relationship have been developed to better understand the structure of such high dimensional data. So, the aim of this talk is to present and illustrate on various examples recent approaches connecting nonparametric, selective and functional techniques in order to handle nonlinear relationship in HDD-I or HDD-II settings which allow us to tackle various challenging issues.
    16 May 2013, 10:00 pm
  • 48 minutes 50 seconds
    Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator (Arnaud Doucet)
    When an unbiased estimator of the likelihood is used within an Markov chain Monte Carlo (MCMC) scheme, it is necessary to tradeoff the number of samples used against the computing time. Many samples for the estimator will result in a MCMC scheme which has similar properties to the case where the likelihood is exactly known but will be expensive. Few samples for the construction of the estimator will result in faster estimation but at the expense of slower mixing of the Markov chain.We explore the relationship between the number of samples and the efficiency of the resulting MCMC estimates. Under specific assumptions about the likelihood estimator, we are able to provide guidelines on the number of samples to select for a general Metropolis-Hastings proposal.We provide theory which justifies the use of these assumptions for a large class of models. On a number of examples, we find that the assumptions on the likelihood estimator are accurate. This is joint work with Mike Pitt (University of Warwick) and Robert Kohn (UNSW).
    16 May 2013, 10:00 pm
  • 55 minutes 48 seconds
    Clustering of variables combined with variable selection using random forests : application to gene expression data (Robin Genuer & Vanessa Kuentz-Simonet)
    The main goal of this work is to tackle the problem of dimension reduction for highdimensional supervised classification. The motivation is to handle gene expression data. The proposed method works in 2 steps. First, one eliminates redundancy using clustering of variables, based on the R-package ClustOfVar. This first step is only based on the exploratory variables (genes). Second, the synthetic variables (summarizing the clusters obtained at the first step) are used to construct a classifier (e.g. logistic regression, LDA, random forests). We stress that the first step reduces the dimension and gives linear combinations of original variables (synthetic variables). This step can be considered as an alternative to PCA. A selection of predictors (synthetic variables) in the second step gives a set of relevant original variables (genes). Numerical performances of the proposed procedure are evaluated on gene expression datasets. We compare our methodology with LASSO and sparse PLS discriminant analysis on these datasets.
    16 May 2013, 10:00 pm
  • 1 hour 1 minute
    Bayesian inference for the exponential random graph model (Nial Friel)
    The exponential random graph is arguably the most popular model for the statistical analysis of network data. However despite its widespread use, it is very complicated to handle from a statistical perspective, mainly because the likelihood function is intractable for all but trivially small networks. This talk will outline some recent work in this area to overcome this intractability. In particular, we will outline some approaches to carry out Bayesian parameter estimation and show how this can be extended to estimate Bayes factors between competing models.
    16 May 2013, 10:00 pm
  • More Episodes? Get the App
About StatLearn 2013 - Workshop on "Challenging problems in Statistical Learning"
© MoonFM 2024. All rights reserved.