RAG Isn’t Plug-and-Play

Mins

21.7.2025

In a previous article, we explained what a Retrieval-Augmented Generation (RAG) system is and how it works in plain terms. If you’re new to RAG, you might start there for the fundamentals. In this piece, we’ll assume you know the basics and focus on practical tips for actually building a useful RAG system. Each section below covers an important factor to keep in mind as you implement RAG in your organization.

Hallucinations Can Still Happen

RAG is great at grounding AI answers in your own data, but it doesn’t completely eliminate AI “hallucinations.” An AI hallucination is when the model confidently generates information that isn’t true or isn’t actually in the provided sources.

Even with a RAG setup, the underlying large language model (LLM) is still doing the talking. It takes your question along with the retrieved documents as context and then tries to formulate a response. If the context doesn’t perfectly answer the question, the LLM may fill in gaps by guessing, sometimes producing a plausible-sounding but incorrect statement.

Why does this happen if the AI has your documents at hand? One reason is that the model might misinterpret the context. It could combine bits of information from the documents in an incorrect way.

For example, if two retrieved documents have related but not identical facts, the AI might blend them together and produce an answer that neither document explicitly supports. Additionally, if the retrieved text is complex or ambiguous, the AI could confidently assert something that isn’t actually confirmed by those sources.

It’s important to stay vigilant and review outputs from your RAG system, especially in high-stakes applications. RAG significantly reduces the risk of wild, off-base answers compared to an unaugmented model, but it’s not 100% hallucination-proof.

You can further minimize hallucinations by configuring the generation settings (more on that later) to make the AI stick closer to the source material. In practice, this means you might keep the AI’s creativity in check. For example, you can use a more conservative setting to prevent it from straying beyond the given information.

The bottom line: always treat the AI’s answers as suggestions, not absolute truth, and make sure there’s a source document backing up any important facts or figures it provides.

Gathering Source Documents & References

One of the biggest advantages of a RAG system is that it can point to where an answer is coming from. As you build your RAG, make sure you gather all relevant source documents and enable the system to cite or list its references.

This practice isn’t just for end-user transparency (though that’s a huge benefit); it also helps you as the builder. Seeing which documents and snippets the AI pulled in as context for a given answer lets you trace the data used and troubleshoot issues in your RAG pipeline.

Imagine you ask your RAG-powered assistant a question about a policy, and it retrieves a paragraph from an old guidelines document. If the answer seems off, you can look at that source snippet and immediately start diagnosing: Was the policy outdated in that document? Should the system have retrieved a different file? Is the search query or embedding favoring the wrong keywords?

By logging and reviewing the source references for each answer, you gain insight into how well each component is working, from document chunking to the relevance of the search results.

As you gather and prepare these documents, it’s also useful to attach metadata to each chunk. Metadata is extra information that tells you more about where the content came from, such as the document title, author, source system, or creation date. This is especially important if you're pulling in online content, since you can store the original URL as part of the metadata. That way, when a chunk is retrieved, the system can include not just the answer but also a direct reference back to the source. This supports traceability, builds trust, and allows users or developers to review the full context when needed. Metadata can also support smarter filtering during retrieval, such as prioritizing newer content or excluding certain document types.

Gathering source documents also means curating the right data upfront. Make sure the content you feed into the RAG system is authoritative and up-to-date for your domain (e.g. the latest product FAQs, current pricing sheets, updated HR policies, etc.).

Then, during retrieval, always have the system return the document name or ID alongside the answer. This way, every answer comes with a breadcrumb trail. Not only does this build trust with users (“the assistant isn’t just making this up – it came from here”), but it also makes maintenance easier. If something goes wrong, you can pinpoint whether the fault lay in the retrieval step or the generation step by examining those references.

In summary: keep your source data organized, and have your RAG system show its work. It will pay off in both user confidence and easier debugging.

Document Splitting (Chunking Strategies)

The first technical step in setting up a RAG system is breaking your documents into smaller pieces, known as chunks. This process, known as document splitting or chunking, prepares the content for embedding and retrieval.

When dividing up your documents, it’s important to know that one size does not fit all. Different types of content require different chunking strategies. It’s worth taking time to review your source materials and think through how to split them before creating embeddings. The goal is to find a sweet spot where each chunk is large enough to be meaningful but small enough to stay specific.

Start by considering the structure of each document type. For a long text document (like a report or manual), you might split it by sections or paragraphs. Look at the headings and subheadings: if each section covers a distinct subtopic, you’d want those in separate chunks. Ask yourself, “If this section were given to the AI as context, would it be self-contained and understandable?”

If a section is too lengthy or covers multiple ideas, consider splitting it further, perhaps into two chunks with a slight overlap in content. (Overlapping a few sentences between chunks can help preserve continuity of thought, so the AI doesn’t lose context at chunk boundaries.)

On the other hand, if chunks are too short (for instance, single sentences cut out of context), the AI might not have enough information to work with and could misinterpret the fragment. Always imagine feeding a chunk to the AI on its own. Would it make sense and be helpful in answering a question? If not, adjust your chunk size or boundaries.

Now, different data formats need tailored treatment. Structured data, like spreadsheets or databases, should be chunked along their logical units. For example, if you’re incorporating an Excel file of financial data or product specs, you might treat each row as a chunk (since each row might represent a record or entry). In some cases, even a single cell could be a chunk if it contains a standalone piece of information (for instance, a cell with a specific metric or definition).

The key is that each chunk, whether it’s a row, a table cell, a paragraph, or a Q & A pair from an FAQ, should represent a coherent thought or answer. For an email thread, you might chunk by individual message; for a slide deck, perhaps one chunk per slide. And for a long policy PDF, maybe chunk by subsection or bullet list.

It’s also wise to use different chunk overlaps or sizes for different content. Technical documentation might need larger overlapping chunks to ensure context isn’t lost, whereas a collection of short customer reviews might be fine with no overlap and smaller chunks.

In practice, you may need to experiment: try splitting one way, see if the retrieval finds relevant chunks when answering sample questions (more on testing next), and adjust if needed. The takeaway: don’t chunk documents blindly or uniformly. Use the structure inherent in the content as a guide, and remember that the end goal is to feed the LLM with chunks that are as helpful and specific as possible to answer user queries.

Benchmarking Your RAG System

Building a RAG system isn’t a “set and forget” task. You’ll want to test it and tune it to make sure it’s actually delivering the results you expect. One effective approach is to set up benchmark evaluations for your RAG’s retrieval component.

In plain terms, this means creating a little test suite of queries (questions) for which you already know the correct answers and which document should have the answer. By running these test queries on your system, you can see if the RAG retrieves the right information and responds correctly. This kind of benchmark acts as an early warning system: if a change you make (say, adjusting chunk size, updating the vector database, or changing the search queries used by the agent) causes the system to start missing the right documents, you’ll catch it before your users do.

To get started, you might take a handful of real-world questions your team cares about. For each question, identify the “golden” source: the document or section of text that contains the answer you’d want the RAG to use. For example, if one question is “What’s our refund policy for annual subscribers?”, you’d note that Document X, section 4.2 is the ideal source. Now you have a set of Q&A pairs (or rather, Q & source-reference pairs) to test against. Run these questions through your RAG setup and see what happens: Did it retrieve the correct section from Document X? Did it maybe bring back some irrelevant snippets? Did the final answer actually use the information from the source correctly?

By systematically checking these, you can measure the retrieval performance.If the system isn’t fetching the known relevant chunk as one of the top results, that’s a sign you might need to adjust something: perhaps the embedding relevance threshold, the way the query is formulated, or the data itself (maybe the document needs re-chunking or metadata tagging).

It’s a bit like having unit tests for a software system, but here you’re testing the AI’s ability to find information. Keep these benchmarks and run them regularly, especially whenever you tweak the system (for instance, if you change the vector database configuration or add a lot of new documents). Over time, you might expand this test set to cover more scenarios.

It’s worth noting that this kind of benchmarking is specific to the retrieval aspect of RAG. It complements testing the AI’s overall answer quality or an AI agent’s decision-making. In other words, you’re isolating whether the right supporting information is being pulled in.

This is extremely useful when experimenting with different strategies. Say you try a new chunking method or adjust the number of documents the system retrieves; your benchmark queries will immediately show if those changes helped (e.g., more questions now surface the correct document) or hurt (e.g., the system started missing some answers it used to get right). Having these reference points takes a lot of the guesswork out of improving your RAG: you'll have data to back up that one approach works better than another.

In summary, treat RAG development as an iterative, test-driven process: set up known test queries, track the system’s performance, and use those insights to continually refine your document processing and retrieval techniques.

Tuning Key Settings (top_k, top_p, etc.)

Finally, remember that most RAG systems (especially those built with libraries or APIs) offer technical settings you can tweak to improve performance. It’s worth getting familiar with a few of these knobs and dials, as small adjustments can make a big difference in your results. Three important parameters to consider are top_k, top_p, and various search parameters often collectively referred to as search_kwargs (search “keyword arguments”):

top_k (Number of Results to Retrieve): This setting controls how many document chunks the system pulls from the knowledge base for each query. In other words, if top_k is 5, the system will retrieve the 5 most relevant pieces of content to feed into the LLM. If it’s set to 1, it will only grab the single highest-ranked chunk.

Tuning this can impact both accuracy and efficiency. A higher top_k means the AI has more material to work with (which can help if the answer actually spans multiple documents or sections), but it also means more noise if those extra chunks aren’t truly relevant.

If your RAG sometimes gives incomplete answers, you might try increasing top_k to ensure it’s seeing enough context. Conversely, if you find the AI is getting confused or straying off-topic, decreasing top_k could help it focus only on the most relevant info.

Finding the right number often requires testing; for many applications, a small handful (3-5) of top chunks is a good starting point, but your use case might need more or less.
top_p (Nucleus Sampling for Generation): Whereas top_k deals with retrieval, top_p is a setting that influences the AI model’s generation of the answer.

This parameter (between 0 and 1) is part of the text generation controls and works hand-in-hand with temperature to manage the randomness or creativity of the output. In simple terms, top_p (often called nucleus sampling) limits the model to considering only the most probable words as it writes an answer. For example, top_p = 0.9 means “only use the set of words that collectively have 90% of the probability mass at each step.”

A lower top_p (closer to 0) makes the model more conservative: it will stick to highly likely word choices (usually meaning it stays closer to facts and rephrases the source material more directly).

A higher top_p (closer to 1) allows more variety: the model might pick less obvious words or phrases, which can sometimes introduce creative or extraneous details.

In a RAG system, you generally want the model to stay grounded in the retrieved text, so leaning toward a somewhat lower top_p (and/or a lower temperature) often yields better factual accuracy. However, too low can make responses terse or overly literal.

It’s about balance: if your answers seem a bit too dry or exact, nudging top_p up can make them read more naturally. If you catch the AI embellishing or going beyond the source, dial top_p down to keep it honest.

search_kwargs (Search Parameters and Filters): This isn’t one single setting but rather a bundle of options that control how the retrieval step searches your document index. Depending on the RAG toolkit or platform you use, you might have parameters for things like filtering by metadata, setting a minimum relevance score, or choosing a specific search algorithm. For instance, you could restrict the search to certain document categories (e.g. only search within “technical docs” if the query clearly asks about a technical topic), which would be done via a filter in the search parameters. Or you might adjust something like search_depth or use hybrid search (combining keyword search with vector similarity) through these settings. The key point is that you have control over the retrieval behavior beyond just the query text – and using these controls can significantly improve results. If users are getting a lot of near-misses in retrieval (e.g., documents that are somewhat related but not exactly what they need), consider refining the search_kwargs. This could mean adding a filter (such as date range, document type, or author), or tweaking the similarity cutoff so that only chunks with a very high relevance score are returned. On the flip side, if the system is too narrowly focused and sometimes says “I don’t know” even though the info exists somewhere, you might relax the search constraints a bit. Work with your developers or platform settings to find the right mix; sometimes it’s trial and error, but the ability to fine-tune these parameters is there to help your RAG perform optimally for your specific data.

Summary

In summary, don’t be afraid to adjust these technical dials. They are provided so you can customize the RAG system’s behavior. Business users don’t need to worry about the code, but as a decision-maker it’s useful to know that these levers exist. You can ask your technical team or vendor: “Have we set the retrieval top_k to an appropriate number? What’s our strategy for the generation settings like top_p to ensure the answers stay factual? Are we using any filters in the search step to limit or expand the scope appropriately?”

Those kinds of questions will help ensure that the system is tuned for your needs. The defaults might be okay, but a well-tuned system can be the difference between an AI assistant that’s just okay and one that’s genuinely helpful and reliable.

Alexander

Head of Artificial Intelligence

21.7.2025

Back to Index

Read Article

Heading

Mins

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Author

Date

AI Literacy is NOT Optional

Many companies are investing in AI agents and automation. That matters, but it is only part of the equation. Just as important is how employees are using AI in their everyday work. AI literacy comes down to two practical skills. First, an AI-first mindset, where people start tasks by asking how AI can help. Second, knowing how to write clear and detailed prompts that guide the AI to give better results. Frequent users of AI are already seeing measurable gains. They work faster, get more done, and often take on tasks that would be difficult without AI support. The more experience they build, the more value they get. Upskilling your existing team in these basics is no longer optional. It is a necessary part of staying productive and competitive.

Alexander

13.8.2025

RAG Isn’t Plug-and-Play

RAG systems can help ground AI answers in your own data, but they are not plug and play. Hallucinations still happen, especially when the retrieved content is vague or misleading. A strong RAG setup depends on good source material, thoughtful chunking, and traceable references. Each chunk should make sense on its own and be specific enough to support accurate answers. Metadata helps with filtering, relevance, and trust. Benchmarking the system with known questions and answers is key. It shows whether retrieval is working and helps you catch issues early. There are also technical knobs you can adjust, but the foundation is clear: quality input, careful structure, and regular testing make RAG systems more useful and reliable. RAG can be a powerful tool, but it is not something you set up once and walk away from. It needs thoughtful design, testing, and regular adjustments to be genuinely helpful and reliable.

Alexander

21.7.2025

Understanding the Model Context Protocol (MCP)

The Model Context Protocol (MCP) is an open standard that makes it easier for AI agents to connect with tools, databases, and software systems. Instead of building a separate integration for each service, MCP provides a consistent way for AI to send requests and receive structured responses. It works through a simple client-server model. The AI acts as the client. Each external system runs an MCP server that handles translation between the AI and the tool’s API. This setup lets AI agents interact with systems like CRMs or internal platforms without needing custom code for each one. For developers, MCP reduces integration work and maintenance. For decisionmakers, it means AI projects can move faster and scale more easily. Once a system supports MCP, any compatible AI agent can use it. MCP is still new, but adoption is growing. OpenAI, Google, and others are starting to build support for it. While it is not a shortcut to AI adoption, it does reduce friction. It gives teams a stable way to connect AI with real business systems without reinventing the wheel every time.

Alexander

14.7.2025

AI Agents at Work: How to Stay in Control

AI Agents

Building AI agents that are safe, traceable, and reliable isn’t just about getting the technology right. It’s about putting the right systems in place so the agent can be trusted to do its job, even as its tasks get more complex. Guardrails, benchmarks, lifecycle tracking, structured outputs, and QA agents each play a specific role. Together, they help ensure the agent works as expected, and that you can explain, review, and improve its performance over time. As more teams bring AI into day-to-day operations, these practices are what separate a useful prototype from something that is ready for real business use.

Alexander

9.7.2025

Wait... What's agentic AI?

The article explains the difference between AI agents, agentic AI, and compound AI. AI agents handle simple tasks, agentic AI manages multi-step workflows, and compound AI combines multiple tools to solve complex problems.

Alexander

6.6.2025

AI Agent Fundamentals

Tech

Artificial intelligence (AI) agents help businesses by completing tasks independently, without needing constant instructions from people. Unlike simple AI tools or regular automation, AI agents can think through steps, make their own decisions, fix mistakes, and adapt if things change. They use different tools to find information, take actions, or even coordinate with other agents to get complex work done. Because AI agents can handle tasks on their own, they can be useful in areas like customer support, sales, marketing, and even writing software. Platforms that don't require coding make it easier for more people to create and use these agents. Businesses that understand how AI agents differ from simpler AI tools can better plan how to use them effectively, making their operations smoother and more efficient.

Alexander

20.5.2025

Connecting Enterprise Data to LLMs

Tech

AI Agents

Many companies are eager to integrate AI into their workflows, but face a common challenge: traditional AI systems lack access to proprietary, up-to-date business information. Retrieval-Augmented Generation (RAG) addresses this by enabling AI to retrieve relevant internal data before generating responses, ensuring outputs are both accurate and context-specific. RAG operates by first retrieving pertinent information from a company's documents, databases, or internal sources, and then using this data to generate informed answers. This approach allows AI systems to provide precise responses based on proprietary data, even if that information wasn't part of the model's original training.

Alexander

16.5.2025

Software Development in a Post-AI World

Tech

Development

Heyra uses AI across three key stages of software development: from early ideas to structured product requirements, from product requirements to working prototypes, and from prototypes to production-ready code. Tools like Lovable, Cursor, and Perplexity allow both technical and non-technical team members to contribute earlier and move faster. This speeds up development, improves collaboration, and reshapes team workflows.

Alexander

24.4.2025

Rethinking Roles When AI Joins The Team

Tech

AI is changing how work gets done. Instead of replacing jobs, it helps with everyday tasks. Companies are looking for people who can work across different areas and use AI tools well. Entry-level roles are becoming more about checking AI’s work than doing it from scratch. The key is knowing how to ask the right questions and starting small with AI.

Alexander

16.4.2025

Let's connect

RAG Isn’t Plug-and-Play

Hallucinations Can Still Happen

Gathering Source Documents & References

Document Splitting (Chunking Strategies)

Benchmarking Your RAG System

Tuning Key Settings (top_k, top_p, etc.)

Summary

Next Article

Heading

AI Literacy is NOT Optional

RAG Isn’t Plug-and-Play

Understanding the Model Context Protocol (MCP)

AI Agents at Work: How to Stay in Control

Wait... What's agentic AI?

AI Agent Fundamentals

Connecting Enterprise Data to LLMs

Software Development in a Post-AI World

Rethinking Roles When AI Joins The Team