Events Calendar

Data Science Project Match Event

Speakers, Conferneces & Workshops
Event time: 
Tuesday, December 13, 2022 - 2:00pm
Yale Institute for Network Science See map
17 Hillhouse Avenue, 3rd floor
New Haven, CT
Event description: 

Data Science Project Match

In-person event, also available via webcast.

Introduction by Daniel Spielman
Sterling Professor of Computer Science; Professor of Statistics & Data Science, and of Mathematics


Drago Radev
A. Bartlett Giamatti Professor of Computer Science | 

Natural Language Processing

David van Dijk, Ph.D.
Assistant Professor of Medicine, Yale School of Medicine
Assistant Professor of Computer Science |

Spatiotemporal graph-neural networks for brain dynamics and spatial genomics

Graph-neural networks (or geometric deep learning) are revolutionizing machine learning and data science. They combine ideas from graph theory, geometry, topology, and deep learning to learn powerful non-linear models on graphical data. In the Van Dijk Lab we are developing several new types of graph neural networks, based on ideas from dynamical systems, computer vision, and natural language processing, and apply these to diverse biomedical applications. In one application, we are using graph neural networks to model spatiotemporal brain activity data, such as whole-cortex calcium imaging and fMRI recordings. In a second application, we are using graph neural networks to model spatial transcriptomic data – a new technology for the measurement of high-dimensional gene expression at the single-cell level with spatial resolution. Using our algorithm, we infer cell-cell interactions in measurements from kidney cancer and brain tissue of multiple sclerosis patients. In these projects, there is the opportunity to focus more on the algorithmic side or on the application, and you will work closely with postdocs and grad students in the lab.

Mark Gerstein
Albert L Williams Professor of Biomedical Informatics
Professor of Molecular Biophysics & Biochemistry, of Computer Science, and of Statistics & Data Science
Project presented by Can Koçkan, Postdoctoral Associate, Gerstein Lab, |

Privacy-Preserving and Secure Genome Analysis

We are currently approaching the broad topic of genome privacy and security from two different angles. First, we are trying to assess the risks presented by public genome databases and other genomic information floating around to individuals. We are interested in coming up with ways to quantify the information leakage in different scenarios, such as how many SNPs would be needed at minimum to identify a person accurately from a DNA sample taken from a coffee cup they recently left behind. Second, we are evaluating the performance of different techniques to enable privacy-preserving genome analysis on cloud computing environments such as AWS, Azure, and Google Cloud Platform. These techniques include Homomorphic Encryption, Secure Multiparty Computing, and Trusted Execution Environments such as the Intel SGX. Each of these techniques involve certain advantages and disadvantages compared to the others when it comes to security and protection guarantees and strength, runtime performance, ease-of-use, etc. We are always trying to come up with efficient algorithms and data structures that allow the use of these techniques to aid with core bioinformatics tasks such as genotype imputation and genome-wide association studies.

Julian Posada
Incoming Assistant Professor, American Studies  |

Studying an Invisible Population: Mapping the Availability of Online Data Work

Many organizations that develop data-intensive technologies outsource data generation, annotation, and verification processes through online platforms like Amazon Mechanical Turk, Remotasks, and Appen. Data workers are an invisible population because platform companies often conceal information about their contractual workers to researchers. Other options to study this invisible population using computational social research methods exist, including analyzing web traffic data to understand the geography of data work. This talk presents an ongoing project to study the coloniality of data work, a crucial step in understanding the social conditions in which many data sets are developed. The preliminary results show the existence of a North-South divide in data production, in which most of the demand for data work comes from the United States and the supply from a handful of countries with particular social, economic, and infrastructural conditions. I collected web traffic data from 93 platforms in 2019-2021 and found that most of the workers’ traffic comes from Venezuela, a country experiencing the highest levels of inflation in the world. However, there was a decline in traffic from this country in favor of Kenya during the last recorded time period (Summer 2021). Future work will consider how the Russian invasion of Ukraine and the impending global financial crisis have and will affect the geography of data work in 2022 and early 2023.

Nils Rudi
Professor, Yale School of Management

Soccer in-play probabilities

The question this project look at is how to forecast win probabilities and goal distribution during a game of soccer. The difference of two Poisson processes calibrated from betting odds provides a simple and quite good model to estimate soccer in-play win probabilities. In earlier work, we studied such models including some analytical properties and extensions to time-dependent and score dependent goal arrival rates. This project aims at extending these models to take into account various elements, including: –Red cards –Other factors for pre-game prediction –Uncertainty (error) in parameter estimates –Updating –Metrics for forecasting accuracy, mainly based on scoring rules (Brier score is a simple example). The project will require good mastery or R and preferably Python and will mainly use methods from Stochastic Processes, Markov Chains and Bayesian Methods.

Zhuoran Yang
Assistant Professor, Statistics and Data Science |

Navigating a maze efficiently with deep reinforcement learning 

How do we use a map to guide us to a desired place? First, we locate ourselves on the map by looking at the surroundings. Then, we find a path on the map that connects our current spot and the target spot, and follow the path. But what if we don’t have a map at hand? Deep reinforcement learning is a class of algorithms that exactly deals with such a challenging scenario. 

In specific, deep reinforcement learning consists of three components — (i) exploration, (ii) representation learning, (iii) decision-making. Exploration module specifies how the algorithm collects the data — wandering around to see the neighborhood. Representation learning module extracts useful representations from the observations — building a map in our mind based on the gathered data. Decision making module learns the optimal decisions based on the representation model – finding a path on the learned map and start the navigation. 

In this project, we aim to implement deep reinforcement learning algorithms for solving a stylized navigation problem on partially observable mazes, where the observations are the surroundings. We will explore how different the choices of the three components affect the performance of the deep reinforcement learning algorithm. For example, in terms of exploration, we can use epsilon-greedy, UCB, and Thompson sampling; in terms of representation learning, we can use VAE and contrastive learning; and in terms of decision making, we can use actor-critic or Q-learning. We aim to gain some understanding of how to design an efficient deep reinforcement learning algorithm in a principled manner.

Jennifer Marlon
Senior Research Scientist, School of the Environment
Director of Data Science, Yale Program on Climate Change Communication
Lecturer, Department of Molecular, Cellular and Developmental Biology |

Predicting public climate change beliefs, attitudes, policy support, and behavior

Two project options exist. The first is already underway and focuses on assessing the effect of extreme weather on individual and neighborhood-level climate change vulnerability and adaptation efforts. The second opportunity is starting in January and includes a meta-analysis of hundreds of RCT climate change message experiments across the US and predictive modeling of national climate change advertising campaign effectiveness, including behavioral outcomes.    Description for Project #1:  In the US, ideology and demographic factors are the primary determinants of individual’s climate-change related attitudes and behaviors; for example Democrats are more worried than Republicans and younger Americans are more worried than older Americans. Other factors, including personal experience and local environmental changes have smaller but significant influences. Importantly, unlike ideology and demographics, these latter factors change over time and may provide windows of opportunity to encourage individual and collective adaptation and mitigation on climate change. In addition, such factors play a much larger role in determining attitudes and behaviors outside the U.S. This project aims to identify 1) how much variation in Americans’ climate attitudes is explained by geography and local environmental changes; 2) which climate hazards are perceived as most serious in different regions; and 3) how perceptions compare with actual risks; and 4) how these gaps affect preparedness and adaptation behaviors for different subpopulations. Analysis will be based on a rich US survey dataset (N > 11,000) of georeferenced respondents coupled with census, economic, political, weather, geographic, and other data. Machine learning will be used to develop and validate multilevel regression and poststratification (MRP) models for different hazards and subpopulations. The models will identify individual and higher spatial level contextual predictors of risk and responses at multiple spatial scales. The MRP models will then be applied to create neighborhood-level vulnerability maps based on perception as well as traditional data (i.e., physical and sociodemographic factors). This project is focused on the US, but will have data from India and the UK for similar analyses.     Details for Project #2 are in development!

Walter Jetz
Professor of Ecology & Evolutionary Biology, Forestry & Environmental Studies, Yale Center for Biodiversity and Global Change
Max Planck-Yale Center for Biodiversity Movement and Global Change
Presented and managed by Kevin Winner
Postdoc and Modeling Project Lead, Yale Center for Biodiversity and Global Change |

Modeling the geographical and environmental distribution of all species on the planet at 1km^2

Much of modern ecology and conservation relies on accurate and reliable predictions of the current spatial distribution of species, the environmental “niche” of those species, and how both of these things might change under potential climate change scenarios. On my team, and alongside our partners, we are currently well underway with the effort to produce these models (Species Distribution Models, SDMs) across all terrestrial plant and animal species in North America at 1km^2. Modeling species distributions at such a (relatively) fine resolution means the impact of ecological processes such as dispersal limitations, biotic interactions, and more begin to play a more significant role than in coarser resolution models, meaning we need to develop more sophisticated statistical models capable of capturing these processes. At the same time, the broad scope of our study has introduced a wide array of computational challenges to making this entire effort tractable. Our team also uses these SDM products to quantify species’ vulnerability to climate change and habitat loss and to derive optimal conservation strategies jointly across all species and ecosystems. These analyses as well as our primary SDMs are also made available to our governmental partners and the public via the Yale Map of Life at
Emma Xiaolu Zang, Ph.D. 
Assistant Professor of Sociology, Biostatistics (Secondary), and Global Affairs (Secondary)

Innovative Approaches to Modeling Migration During the Pandemic 

In this project, we will develop innovative approaches to modeling US domestic migration during the pandemic, using multiple data sources. These data sources include VIIRS night light data, cell phone data, Zillow housing price data, longitudinal survey data, USPS change of addresses data, and potentially a couple of others. Ideal candidates should have the expertise and strong interests in spatial analyses, data visualization, and panel analyses using R, Stata, ArcGIS, or Python. I will also introduce many other data our team has for other projects, which can be explored if the candidate has an interesting idea.