ΑΙhub.org
 

Generating a biomedical knowledge graph question answering dataset


by
17 January 2025



share this:

By Xi Yan

The biomedical domain is a complex network of interconnected knowledge, encompassing genetics, diseases, drugs, and biological processes. While knowledge graphs (KGs) excel at organizing and linking this information, their complexity often makes them difficult for users to query. Ideally, users should be able to ask questions in natural language and receive precise answers directly from the KG, without needing specialized query expertise. However, enabling deep learning-based systems to query KGs using natural language remains a major challenge. Existing biomedical knowledge graph question answering (BioKGQA) datasets are small and limited in scope, typically containing only a few hundred question answering (QA) pairs. This scarcity of data hinders the development of robust and scalable QA systems, which are essential for critical applications such as clinical decision support, personalized medicine, and drug discovery.

PrimeKGQA addresses these challenges with a novel, scalable approach to dataset generation, harnessing the power of large language models (LLMs). Built on PrimeKG—a precision medicine-oriented knowledge graph that integrates data from 20 of the most-cited biomedical databases spanning ten biological scales, including genes, diseases, and drugs—PrimeKGQA leverages a generalizable, scalable, and training-free data generation framework. Using few-shot learning with LLMs, the framework transforms KG subgraphs (based on network motifs, see Figure 1) into SPARQL queries, which are subsequently converted into natural language question-answer pairs. The resulting PrimeKGQA dataset encompasses a wide array of biomedical concepts and reasoning complexities, ranging from straightforward factual queries to intricate multi-hop reasoning paths, providing a comprehensive resource for advancing biomedical question-answering systems.

network motifsFigure 1: All types of network motifs for graphs with node numbers from two to four. N3_1 stands for “node number 3 subgraph type 1”. Note that for 3-node-subgraphs, we discard N3_5, N3_6, N3_9, N3_10, N3_11, N3_12 and N3_13.

PrimeKGQA stands out not just for its size but also for its comprehensiveness. With 83,999 QA pairs, it is 1,000 times larger than the next largest BioKGQA dataset. The dataset includes questions generated from 2- to 4-node subgraphs, offering a balanced mix of simple and complex reasoning tasks. The questions are evaluated for linguistic correctness, semantic fidelity, and grammatical accuracy, ensuring a strong alignment with the biomedical KG facts it represents.

The creation of PrimeKGQA follows an innovative pipeline. 1. Subgraph Sampling: Subgraphs from PrimeKG are extracted based on network motifs, ranging from simple 2-node structures to more complex 4-node configurations. 2. SPARQL Validation: SPARQL queries are used to validate the answers extracted from the subgraphs, ensuring correctness of the answer. 3. Question Generation: Pre-trained language models like GPT3, Mistral, and LLaMA are prompted to generate natural language questions based on the subgraph structures and validated answers. A detailed pipeline could be found in figure 2.

Figure 2. Our pipeline for automatic generation of PrimeKGQA. The Pink blocks are the composing elements of the dataset, i.e., natural question, SPARQL, Correct Answer from the KG.

LLM-generated data is known to suffer from hallucination, which in our task is evaluated across three dimensions: grammaticality, consistency (whether the question and answer correspond), and coverage (whether the question and SPARQL query align). To assess this, we use both automatic and manual evaluation methods. Established benchmarks like BLEU, ROUGE, and METEOR are employed to measure linguistic quality, while LLM-based metrics like BERTScore ensure semantic alignment. Additionally, domain experts evaluate the samples for grammaticality, consistency, and coverage, providing human validation of the dataset’s overall quality.

While PrimeKGQA establishes a new benchmark for BioKGQA datasets, the work is far from complete. There are still opportunities for improvement, such as post-editing and correcting problematic questions. Additionally, no models have yet been tested on this dataset. Future work will focus on: refining the question generation process to capture more nuanced and exploratory queries, and using PrimeKGQA to benchmark existing QA systems to evaluate their effectiveness in real-world biomedical tasks.

Want to explore PrimeKGQA for yourself? The dataset and models are available on GitHub.

Read the work in full

Bridging the Gap: Generating a Comprehensive Biomedical Knowledge Graph Question Answering Dataset, Xi Yan, Patrick Westphal, Jan Seliger, and Ricardo Usbeck, ECAI 2024.


This research was presented at ECAI 2024.



tags: , ,


Xi Yan is a PhD student at Universität Hamburg
Xi Yan is a PhD student at Universität Hamburg

            AIhub is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

What the Moltbook experiment is teaching us about AI

An experimental social media platform where only AI bots can post reveals surprising lessons about artificial intelligence behaviour and safety.

The malleable mind: context accumulation drives LLM’s belief drift

  09 Mar 2026
LLMs change their "beliefs" over time, depending on the data they are given.

RWDS Big Questions: how do we balance innovation and regulation in the world of AI?

  06 Mar 2026
The panel explores the tensions, trade-offs and practical realities facing policymakers and data scientists alike.

Studying multiplicity: an interview with Prakhar Ganesh

  05 Mar 2026
What is multiplicity, and what implications does it have for fairness, privacy and interpretability in real-world systems?

Top AI ethics and policy issues of 2025 and what to expect in 2026

, and   04 Mar 2026
In the latest issue of AI Matters, a publication of ACM SIGAI, Larry Medsker summarised the year in AI ethics and policy, and looked ahead to 2026.

The greatest risk of AI in higher education isn’t cheating – it’s the erosion of learning itself

  03 Mar 2026
Will AI hollow out the pipeline of students, researchers and faculty that is the basis of today’s universities?

Forthcoming machine learning and AI seminars: March 2026 edition

  02 Mar 2026
A list of free-to-attend AI-related seminars that are scheduled to take place between 2 March and 30 April 2026.
monthly digest

AIhub monthly digest: February 2026 – collective decision making, multi-modal learning, and governing the rise of interactive AI

  27 Feb 2026
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.



AIhub is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.02 - Association for the Understanding of Artificial Intelligence