Dense Passage Retrieval for Open-Domain Question Answering

April 2, 2025

What is Dense Passage Retrieval?

Dense Passage Retrieval is a technique for open-domain question answering that aims to retrieve relevant passages from a large corpus of unstructured text. Unlike traditional information retrieval (IR) techniques that rely on sparse representations, DPR uses dense representations adapted from deep neural networks, which are then used to encode text passages and questions.

The basic idea of DPR is to precompute dense vector representations of text and store them in a search index. DPR uses dense representations comprehended from deep neural networks to encode text passages and questions. Given a user’s query, DPR retrieves the most relevant passages from the index based on the similarity between their representations and the representation of the query. Once the relevant passages are retrieved, a downstream model can extract the answers to the question asked.

Why DPR?

DPR has several advantages over traditional IR techniques for ODQA. First, dense representations have a great ability to capture the semantic similarity between text passages, which leads to precise retrieval. Second, by precomputing the representations and storing them in an index, DPR can achieve faster retrieval times than traditional techniques that compute similarity on the fly. DPR on the other hand utilizes vector-dense representations of documents and queries, as the dense vectors capture refined and contextualized semantic information in the text. DPR can be used to improve a variety of text-based applications.

What is Open Domain Question Answering?

Open Domain Question Answering is a type of linguistic task that asks a model to produce answers to factoid questions in natural language. The true answer is objectively produced, so it is simpler to evaluate model performance. The open domain part refers to the lack of relevant context for any arbitrarily asked factual question.

What is an ODQA model?

An ODQA model may work for us without accessing an external source of knowledge. In the Open-Domain Question Answering task (ODQA), questions could be about nearly anything relying on world knowledge. The challenge is that the context containing relevant information about the question is not provided. This is in contrast to the standard reading comprehension task in which a passage containing the answer span is provided with the question.

How do you build retrievers for question-answering?

Given a factoid question, if a language model has no context or is not large enough to memorize the context which exists in the training dataset, it is unlikely to guess the correct answer. In an open-book exam, students are allowed to refer to external resources like notes and books while answering test questions. Similarly, an ODQA system can be paired with a rich knowledge base to identify relevant documents as evidence of the answers. We can decompose the process of finding answers to given questions into two stages:

Find the context in an external repository of knowledge;
Process the retrieved context to extract an answer.

Fig. 2. The retriever-reader QA framework combines information retrieval with machine reading comprehension.

Such a retriever + reader framework was first proposed in DrQA (“Document retriever Question-Answering”). The retriever and the reader components can be set up and trained independently or trained together from end to end.

Datasets to train your DPR models:

Wikipedia: It is an online encyclopedia with articles on a wide range of topics. Many DPR models are trained on subsets of Wikipedia articles or on the entire corpus.

Natural Questions: It is a dataset used for the collection of real user questions and answers gathered from Google search results. This dataset is popularly used for evaluating the effectiveness of question-answering systems.

MS MARCO: The Microsoft Machine Reading Comprehension (MS MARCO) dataset is a large collection of human-generated queries and relevant passages. It was created as part of a research effort to advance the state of the art in passage retrieval.

Github source code: https://microsoft.github.io/msmarco/Datasets.html

TREC: The Text Retrieval Conference (TREC) is an annual event that includes a series of information retrieval tasks. The TREC datasets include collections of news articles and other documents, along with queries and relevant judgements.

Github source code:https://github.com/microsoft/msmarco/blob/master/TREC-Deep-Learning-2021.md

OpenWebText: It is a dataset that consists of approximately 40GB of text from web pages that were crawled in 2019. This dataset has been used to train DPR models for generating text.

Github source code: https://paperswithcode.com/dataset/openwebtext#:~:text=OpenWebText%20is%20an%20open%2Dsource,with%20at%20least%20three%20upvotes.

Aristo: This dataset is a collection of science exam questions and their answers. It is popularly used for evaluating the ability of question-answering systems to reason about scientific concepts.

Papers with code: https://paperswithcode.com/dataset/aristo-v4

Launch A100 80GB Cloud GPU on E2E Cloud for training your DPR model for open domain question answering:

Login to Myaccount
Go to Compute> GPU> NVIDIA- A100 80GB.
Click on “Create” and choose your plan.

Choose your required security, backup, and network settings and click on “Create My Node”.

The launched plan will appear on your dashboard once it starts running.

After launching the A100 80GB Cloud GPU from the Myaccount portal, you can deploy any DPR model for open domain question answering.

E2E Networks is the leading accelerated Cloud Computing player which provides the latest Cloud GPUs at a great value. Connect with us at sales@e2enetworks.com

Request a free trial here: https://zfrmz.com/LK5ufirMPLiJBmVlSRml

Sign up for Free Trial

Latest Blogs