Tutorial: Retrieval Augmented Generation with Citation (RAG+C)
As a final project for my Deep Learning (GRAD-E1394) course, I partnered with Kai Foerster and Amin Oueslati to create a tutorial on how to implement a Retrieval Augmented Generation with Citation (RAG+C) chatbot to address the challenge of knowledge management in government. In essence, RAG+C provides Large Language Models (LLMs) with additional contextual information sourced from an external database, which significantly improves response accuracy and avoids hallucinations, particularly on highly domain-specific topics.
As a demonstration, we develop a RAG+C chatbot to answer questions about the U.S. Federal Acquisition Regulation (FAR), the rule book for public procurement in the United States. We show you how to load and process FAR documents, store them in a database, and build your very own RAG+C pipeline. Lastly, this tutorial shows how to deploy models as a Conversation User Interface (CUI) (think chatbot) on a web server. You can access our “FAR-Chat” hosted on Hugging Face Spaces here.
➡️ See the code
EU Procurement Networks
As part of the Applied Network Analysis (GRAD-E1426) at the Hertie School, I worked with fellow students Kai Foerster, Danial Riaz, and Lukas Lehmann to examine the network dynamics within EU public procurement markets. Our dataset consisted of 234 networks across 26 European countries from 2008 to 2016, focusing on a subset of countries and years. The networks comprised issuers or “buyers” (such as public hospitals and ministries) and winners or “suppliers” of public contracts (mainly private-sector firms). We explored network characteristics related to different levels of perceived government corruption.
To give a specific example, the network graph below depicts Slovakia’s public procurement market in 2014. It features issuers and winners as gray and white nodes, respectively. The connections between these nodes are colored based on the single bidding rate - a metric linked to government corruption. Red links represent a higher-than-average single bidding rate, suggesting potential corruption risks, while blue links indicate a rate below average. Notably, several clusters of red links shown in the graph signal heightened corruption risk in those contracts.
➡️ See the code
News coverage of 2023 Israel-Hamas War
Following Hamas’s surprise October 7 attack on Israel and the subsequent Israeli military campaign in the Gaza Strip, the media’s response quickly became polarized. I collaborated with my fellow classmates Kai Foerster, Danial Riaz, and Max Eckert to investigate (1) how news coverage of these events evolved over time and (2) the extent to which new coverage varied by news outlet. To answer these questions, we sourced 4,000+ articles from various news outlets including the New York Times, Al Jazeera, and Die Welt which were published in the weeks immediately preceding and following the October 7 attack. We employ topic modeling to identify topics associated with Israel, Hamas, and Palestine as well as sentiment analysis to measure the extent to which the emotional valence of news coverage varied by group.
➡️ See the code
2022-23 Mpox Outbreak: Global Trends Report
I helped to build and maintain WHO’s 2022-23 Mpox (Monkeypox) Outbreak: Global Trends Report.
With over 86,000 total cases & 96 related deaths, #mpox outbreak has slowed down but is still ongoing in several countries.
— World Health Organization (WHO) (@WHO) February 21, 2023
Maintained surveillance & enhanced access to diagnostics, vaccines and treatments are key to stop the disease transmission globally https://t.co/mLHsVoWKBY pic.twitter.com/PcLFQjjrz7
Description of the First Global Outbreak of Mpox: An Analysis of Global Surveillance Data
Building on my mpox data analytics experience, I collaborated with WHO on a Lancet Global Health publication describing the epidemiological characteristics, demographic trends, and risk factors surrounding the 2022-23 mpox outbreak. Leveraging the most extensive dataset of mpox cases available, the paper provides invaluable data-driven insights into the dynamics of the outbreak.
Epidemiological and clinical characteristics, as well as risk factors for hospitalization, of #mpox cases during the 2022-23 multi-country outbreak, from the @WHO global surveillance system, published in @LancetGH.
— Ana Hoxha (@AnaHoxhaEpi) June 22, 2023
👏 & 🙏to countries for sharing their data. pic.twitter.com/OcBEA00K00
Mpox in Children and Adolescents during Multicountry Outbreak, 2022–2023
Continuing my work with WHO, I served as co-first author on a publication in the U.S. CDC’s Emerging Infectious Diseases Journal in which our team examined mpox cases among children and adolescents during the 2022-23 mpox outbreak. While the outbreak predominantly affected adult men, 1.3% of reported cases were in children and adolescents <18 years of age. Analysis of global surveillance data showed one hospital intensive care unit admission and zero deaths in that age group. Transmission routes and clinical manifestations varied across age subgroups.
Tutorial: Interactive Graphics with plotly
As part of the Introduction to Data Science (I2DS) Tools for Data Science Workshop hosted by the Hertie School on 4 November 2021, I partnered with Julian Kath to present on Interactive Graphics with plotly
. High-quality data visualization is essential for effective communication of findings. plotly
offers a straightforward way to build captivating graphics that will impress your audience. The goals of this tutorial are to:
- Equip participants with conceptual knowledge about the
plotly
package and data visualization workflow - Demonstrate key capabilities of the package
- Provide participants with practice material and further resources
While the interactive part of the session was held live at the Hertie School, the online portion can be viewed here.
➡️ See the code
Detecting COVID-related Fake News with NLP
Alarmed by the amount of COVID-related fake news circulating through social media platforms, my classmates Hannah Schweren, Marco Schildt, and I set out to construct our own fake news detection algorithm. We used Patwa et al.’s Covid-19 Fake News dataset comprising 10,700 Covid-related social media posts labeled either “real” or “fake” to develop a competitive prediction model. We also examined the extent to which our fake news detection algorithm degraded over time due to the ever-evolving nature of fake news.
Here’s the blog post we prepared for the Hertie Data Science Lab detailing our project.
➡️ See the code
UK COVID Dashboard Disaggregated by Age
Together with fellow master’s students Kai Foerster and Dominik Cramer, I created a dashboard to track COVID case rates in England with the option to disaggregate by age group. We extracted our data using the API for the official UK government website for data and insights on COVID. To build our map, we downloaded boundary data for all 309 local authority districts from the Office of National Statistics’ web portal. As seen below, the user can observe COVID case rates for selected age groups during different time periods.
➡️ See the code
Validating COVID Self-Tests with Image Recognition
With the understanding that COVID rapid tests will continue to play a key role in the post-pandemic world, I once again found myself working alongside Kai Foerster and Dominik Cramer to develop a program capable of validating COVID self tests using image recognition. We used data from the MNIST database of handwritten digits, a collection of 70,000 examples, to train our Machine Learning algorithm to recognize handwritten serial numbers and COVID test results.
➡️ See the code