Tutorial: Retrieval Augmented Generation with Citation (RAG+C)

As a final project for my Deep Learning (GRAD-E1394) course, I partnered with Kai Foerster and Amin Oueslati to create a tutorial on how to implement a Retrieval Augmented Generation with Citation (RAG+C) chatbot to address the challenge of knowledge management in government. In essence, RAG+C provides Large Language Models (LLMs) with additional contextual information sourced from an external database, which significantly improves response accuracy and avoids hallucinations, particularly on highly domain-specific topics.

As a demonstration, we develop a RAG+C chatbot to answer questions about the U.S. Federal Acquisition Regulation (FAR), the rule book for public procurement in the United States. We show you how to load and process FAR documents, store them in a database, and build your very own RAG+C pipeline. Lastly, this tutorial shows how to deploy models as a Conversation User Interface (CUI) (think chatbot) on a web server. You can access our “FAR-Chat” hosted on Hugging Face Spaces here.

➡️ See the code

RAG+C Chatbot


EU Procurement Networks

As part of the Applied Network Analysis (GRAD-E1426) at the Hertie School, I worked with fellow students Kai Foerster, Danial Riaz, and Lukas Lehmann to examine the network dynamics within EU public procurement markets. Our dataset consisted of 234 networks across 26 European countries from 2008 to 2016, focusing on a subset of countries and years. The networks comprised issuers or “buyers” (such as public hospitals and ministries) and winners or “suppliers” of public contracts (mainly private-sector firms). We explored network characteristics related to different levels of perceived government corruption.

To give a specific example, the network graph below depicts Slovakia’s public procurement market in 2014. It features issuers and winners as gray and white nodes, respectively. The connections between these nodes are colored based on the single bidding rate - a metric linked to government corruption. Red links represent a higher-than-average single bidding rate, suggesting potential corruption risks, while blue links indicate a rate below average. Notably, several clusters of red links shown in the graph signal heightened corruption risk in those contracts.

➡️ See the code

EU Procurement


News coverage of 2023 Israel-Hamas War

Following Hamas’s surprise October 7 attack on Israel and the subsequent Israeli military campaign in the Gaza Strip, the media’s response quickly became polarized. I collaborated with my fellow classmates Kai Foerster, Danial Riaz, and Max Eckert to investigate (1) how news coverage of these events evolved over time and (2) the extent to which new coverage varied by news outlet. To answer these questions, we sourced 4,000+ articles from various news outlets including the New York Times, Al Jazeera, and Die Welt which were published in the weeks immediately preceding and following the October 7 attack. We employ topic modeling to identify topics associated with Israel, Hamas, and Palestine as well as sentiment analysis to measure the extent to which the emotional valence of news coverage varied by group.

➡️ See the code

Sentiment analysis of news coverage of Israel-Hamas War


2022-23 Mpox Outbreak: Global Trends Report

I helped to build and maintain WHO’s 2022-23 Mpox (Monkeypox) Outbreak: Global Trends Report.


Description of the First Global Outbreak of Mpox: An Analysis of Global Surveillance Data

Building on my mpox data analytics experience, I collaborated with WHO on a Lancet Global Health publication describing the epidemiological characteristics, demographic trends, and risk factors surrounding the 2022-23 mpox outbreak. Leveraging the most extensive dataset of mpox cases available, the paper provides invaluable data-driven insights into the dynamics of the outbreak.


Mpox in Children and Adolescents during Multicountry Outbreak, 2022–2023

Continuing my work with WHO, I served as co-first author on a publication in the U.S. CDC’s Emerging Infectious Diseases Journal in which our team examined mpox cases among children and adolescents during the 2022-23 mpox outbreak. While the outbreak predominantly affected adult men, 1.3% of reported cases were in children and adolescents <18 years of age. Analysis of global surveillance data showed one hospital intensive care unit admission and zero deaths in that age group. Transmission routes and clinical manifestations varied across age subgroups.

Mpox in Children and Adolescents during Multicountry Outbreak


Tutorial: Interactive Graphics with plotly

As part of the Introduction to Data Science (I2DS) Tools for Data Science Workshop hosted by the Hertie School on 4 November 2021, I partnered with Julian Kath to present on Interactive Graphics with plotly. High-quality data visualization is essential for effective communication of findings. plotly offers a straightforward way to build captivating graphics that will impress your audience. The goals of this tutorial are to:

  1. Equip participants with conceptual knowledge about the plotly package and data visualization workflow
  2. Demonstrate key capabilities of the package
  3. Provide participants with practice material and further resources

While the interactive part of the session was held live at the Hertie School, the online portion can be viewed here.

➡️ See the code

Interactive Graphics with `plotly`


Detecting COVID-related Fake News with NLP

Alarmed by the amount of COVID-related fake news circulating through social media platforms, my classmates Hannah Schweren, Marco Schildt, and I set out to construct our own fake news detection algorithm. We used Patwa et al.’s Covid-19 Fake News dataset comprising 10,700 Covid-related social media posts labeled either “real” or “fake” to develop a competitive prediction model. We also examined the extent to which our fake news detection algorithm degraded over time due to the ever-evolving nature of fake news.

Here’s the blog post we prepared for the Hertie Data Science Lab detailing our project.

➡️ See the code

Detecting COVID-related Fake News


UK COVID Dashboard Disaggregated by Age

Together with fellow master’s students Kai Foerster and Dominik Cramer, I created a dashboard to track COVID case rates in England with the option to disaggregate by age group. We extracted our data using the API for the official UK government website for data and insights on COVID. To build our map, we downloaded boundary data for all 309 local authority districts from the Office of National Statistics’ web portal. As seen below, the user can observe COVID case rates for selected age groups during different time periods.

➡️ See the code

UK COVID Dashboard


Validating COVID Self-Tests with Image Recognition

With the understanding that COVID rapid tests will continue to play a key role in the post-pandemic world, I once again found myself working alongside Kai Foerster and Dominik Cramer to develop a program capable of validating COVID self tests using image recognition. We used data from the MNIST database of handwritten digits, a collection of 70,000 examples, to train our Machine Learning algorithm to recognize handwritten serial numbers and COVID test results.

➡️ See the code

Validating COVID Test Results

Go to the Home Page