Low-Hanging Fruit of LLM Security

index

So you’re thinking of fine-tuning your favorite LLM, or working on alignment? Let’s get the low-hanging fruit out of the way first.

First of all, what are guardrails in LLM agents context?

Guardrails are mechanisms designed to prevent large language models (LLMs) from generating harmful, biased, or unsafe content.

This is not a complete definition, but it helps understand the concept. They’re checks or validations you perform before sending the query to your actual LLM agent. They sort of act as shortcuts before you reach your main agent. These can be used for many reasons, such as ensuring the data passed is valid, or making sure that the query is not malicious.

Let’s consider the scenario where you have a real estate agent. The first guardrail you can think of for this is to prevent the agent from answering anything that’s not related to its main purpose. Of course, you can also prompt your agent to not ever talk about the meaning of life, but these are deterministic creatures, so we may need a bit more rigor. What if we add a lightweight check to make sure that the user’s query is relevant to the agent’s context? This could be a very fast and lightweight LLM model checking for the chat history, and identifying if it should deny the request. If this guardrail is triggered, we can quickly return a response (possibly hardcoded, like how ChatGPT says that it can’t help with certain stuff) without wasting resources for the main agent response.

In general, we have two types of guardrails: code-based and AI-based guardrails. Examples of code-based tools that you can use to create guardrails are data structure validation through schemas or regex checks. These are very cheap to execute and they help prevent anything that’s potentially problematic for your main agent. I find these types of checks useful if my agent requires a structured input, that which you can validate each property easily with a tool like Pydantic, and then pass in to the agent. AI-based guardrails, on the other hand, are more expensive, similar to the query relevancy example I gave before. These AI-based guardrails are necessary if you would like to understand if there’s an obscure malicious intent in the input, or if you need to hint the agent to use certain tools from its toolset. It’s important to use AI-based guardrails with caution in production as they’ll slow down the time to first output token (time to make TTFOT a metric?), affecting user experience.

There’s a less popular way of using guardrails too that I like to have in my agents. Let’s say you are building an appointment agent. When do we confirm an appointment? Should we just let the agent handle it? Or should we check for acceptance using a smaller model, and enhance the input from the user with the correct appointment date/time, and then feed it into the agent? In my experience, in ambiguous situations like appointment scheduling that includes discussing multiple options, it’s crucial to first find out the exact details of the approval/acceptance from the user, which helps improve evals significantly.

I believe that as we build more complex agents, we will be needing to reduce complexity by implementing complex but lightweight guardrails, thus, making the agent more robust without necessarily causing prompt creep.