RAG Time Journey 2: Data ingestion and search practices for the ultimate RAG retrieval system

Windows Server · 2025-03-12T18:30:00Z

Introduction

This is the second post for RAG Time, a 7-part educational series on retrieval-augmented generation (RAG). Read the first post of this series and access all videos and resources in our Github repo.

Journey 2 covers indexing and retrieval techniques for RAG:

Data ingestion approaches: use Azure AI Search to upload, extract, and process documents using Azure Blob Storage, Document Intelligence, and integrated vectorization.
Keyword and vector search: compare traditional keyword matching with vector search
Hybrid search: how to apply keyword and vector search techniques with Reciprocal Rank Fusion (RRF) for better quality results across more use cases.
Semantic ranker and query rewriting: See how reordering results using semantic scoring and enhancing queries through rewriting can dramatically improve relevance.

Data Pipeline

What is data ingestion?

When building a RAG framework, the first step is getting your data into the retrieval system and processed so that it’s primed for the LLM to understand. The following sections cover the fundamentals of data ingestion. A future RAG Time post will cover more advanced topics in data ingestion.

Integrated Vectorization

Azure AI Search offers integrated vectorization, a built-in feature. It automatically converts your ingested text (or even images) into vectors by leveraging advanced models like OpenAI’s text-embedding-3-large—or even custom models you might have. This real-time transformation means that every document and every segment of it is instantly prepared for semantic analysis, with the entire process seamlessly tied into your ingestion pipeline. No manual intervention is required, which means fewer bottlenecks and a more streamlined workflow.

Parsing documents

The first step of the data ingestion process involves uploading your documents from various sources—whether that’s Azure Blob Storage, Azure Data Lake Storage Gen2 or OneLake. Once the data is in the cloud, services such as Azure Document Intelligence and Azure Content Understanding step in to extract all the useful information: text, tables, structural details, and even images embedded in your PDFs, Office documents, JSON files, and more. In addition, Azure AI Search automatically supports change tracking so you can rest assured your documents remain up to date without any extra effort.

Chunking Documents

A critical component in integrated vectorization is chunking. Most language models have a limited context window, which means feeding in too much unstructured text can dilute the quality of your results. By splitting larger documents into smaller, manageable chunks based on sentence boundaries or token counts—while intelligently allowing overlaps to preserve context—you ensure that key details aren’t lost. Overlapping can be especially important for maintaining the continuity of thought, such as preserving table headers or the transition between paragraphs, which in turn boosts retrieval accuracy and improves overall performance.

Using integrated vectorization, you lay a solid foundation for a highly effective RAG system that not only understands your data but leverages it to deliver precise, context-rich search results

Retrieval Strategies

Here are some common, foundational search strategies used in retrieval systems.

Keyword Search

Traditional keyword search is the foundation of many search systems. This method works by creating an inverted index—a mapping of each term in a document to the documents where it appears. For instance, imagine you have a collection of documents about fruits. A simple keyword search might count the occurrences of words like “apple,” “orange,” or “banana” to determine the relevance of each document. This approach is particularly effective when you need literal matches, such as pinpointing a flight number or a specific code where precision is crucial.

Even as newer search technologies emerge, keyword search remains a robust baseline. It efficiently matches exact terms found in text, ensuring that when specific information is needed, the results are both fast and accurate.

Vector Search

While keyword search provides exact matches, it may not capture the full context or nuanced meanings behind a query. This is where vector search shines. In vector search, both queries and document chunks are transformed into high-dimensional embeddings using advanced models like OpenAI’s text-embedding-3-large. These embeddings capture the semantic essence of words and phrases in multi-dimensional vectors.

Once everything is converted into vectors, the system performs a k-nearest neighbor search using cosine similarity. This method allows the search engine to find documents that are contextually similar—even if they don’t share exact keywords. For example, demo code in our system showed that a query like “what is Contoso?” not only returned literal matches but also contextually related documents, demonstrating a deep semantic understanding of the subject.

In summary, combining keyword search with vector search in your RAG system leverages the precision of text-based matching with the nuanced insight of semantic search. This dual approach ensures that users receive both exact answers and optionally related information that enhances the overall retrieval experience.

Hybrid Search

Hybrid search is a powerful method that blends the precision of keyword search with the nuanced, context-aware capabilities of vector search.

Hybrid search leverages the strengths of both strategies. On one hand, keyword search excels at delivering exact matches, which is critical when you're looking for precise information like flight numbers, product codes, or specific numerical data. On the other hand, vector search digs deeper by transforming your queries and documents into embeddings, allowing the system to understand and interpret the underlying semantics of the content. By combining these two, hybrid search ensures that both literal and contextually similar results are retrieved.

Reciprocal Rank Fusion (RRF) is a technique used to merge the results from both keyword and vector searches into one cohesive set. Essentially, it reorders and integrates the result lists from each method, amplifying the highest quality matches from both sides. The outcome is a ranked list where the most relevant document chunks are prioritized.

By incorporating hybrid search into your retrieval system, you get the best of both worlds: the precision of keyword matching alongside the semantic depth of vector search, all working together to deliver an optimal search experience.

Reranking

Reranking is a post-retrieval step. Reranking uses a reasoning model to sort and prioritize the most relevant retrieved documents first.

Semantic ranker in Azure AI Search uses a cross-encoder model to re-score every document retrieved on a normalized scale from 0 to 4. This score reflects how well the document semantically matches the query. You can use this score to establish a minimum threshold to filter out low-quality or “noisy” documents, ensuring that only the best passages are sent along for further processing. This re-ranking model is trained on data commonly seen in RAG applications, across multiple industries, languages and data types.

Query transformations

Sometimes, a user’s original query might be imprecise or too narrow, which can lead to relevant content being missed. Pre-retrieval, you can transform, augment or modify the search query to improve recall. Query rewriting in Azure AI Search is a pre-retrieval feature that transforms the initial search query into alternative expressions. For example, a question like "What underwater activities can I do in the Bahamas?" might be rephrased as "water sports available in the Bahamas" or "snorkeling and diving in the Bahamas." This expansion creates additional candidate queries that help surface documents that may have been overlooked by the original wording.

By optimizing across the entire query pipeline, not just the retrieval phase, you have more tools to deliver more relevant information to the language model. Azure AI Search makes it possible to fine-tune the retrieval process, filtering out noise and capturing a wider range of relevant content—even when the initial query isn’t perfect.

Continue your RAG Journey: Wrapping Up & Looking Ahead

Let’s take a moment to recap the journey you’ve embarked on today. We started with the fundamentals of data ingestion, where you learned how to use integrated vectorization to extract valuable information.

Next, we moved into search strategies by comparing keyword search—which offers structured, literal matching ideal for precise codes or flight details—with the more dynamic vector search that captures the subtle nuances of language through semantic matching. Combining these methods with hybrid search, and using Reciprocal Rank Fusion to merge results, provided a balanced approach: the best of both worlds in one robust retrieval system.

To further refine your results, we looked at the semantic ranker—a tool that re-scores and reorders documents based on their semantic fit with your query—and query rewriting, which transforms your original search ideas into alternative formulations to catch every potential match. These enhancements ensure that your overall pipeline isn’t just comprehensive; it’s designed to deliver only the most relevant, high-quality content.

Now that you’ve seen how each component of this pipeline works together to create a state-of-the-art RAG system, it’s time to take the next step in your journey. Explore our repository for full code samples and detailed documentation. And don’t miss out on future RAG Time sessions, where we continue to share the latest best practices and innovations in retrieval augmented generation.

Getting started with RAG on Azure AI Search has never been simpler, and your journey toward building even more effective retrieval systems is just beginning. Embrace the next chapter and continue to innovate!

Next Steps

Ready to explore further? Check out these resources, which can all be found in our centralized GitHub repo:

Watch Journey 2

RAG Time GitHub Repo (Hands-on notebooks, documentation, and detailed guides to kick-start your RAG journey)

Have questions, thoughts, or want to share how you’re using RAG in your projects? Drop us a comment below or open a discussion in our GitHub repo. Your feedback shapes our future content!

View the full article

Sign In