I have been experimenting with voice agents for the past three months. During this time, I came across ElevenLabs and have already written an article about it, which you can read [here].

While exploring, my curiosity pushed me to understand how ElevenLabs agents really work. I had already built a system similar to theirs from scratch, so I wanted to compare their architecture with mine and identify key differences.

Why understanding ElevenLabs matters

Understanding how ElevenLabs works beyond treating it as a black box can help us optimize usage and integrate it more effectively.

My findings

I began by digging deeper into their product and tracing its behavior. One feature that caught my attention was the Custom LLM option in their conversation agent. This allows users to provide their own hosted LLM URL and model name. To test this, I created a simple LLM proxy (you can check out the repository [here]). The proxy accepts a target URL, forwards the request, adds authentication headers (e.g., for Google APIs), and stores the requests in files for later analysis.

After replacing the LLM with my proxy and testing, I observed the requests coming from ElevenLabs. I analyzed the system prompts and discovered that they append a predefined set of instructions to every system prompt configured in their UI:

Task description: You are an AI agent. Your character definition is provided below, stick to it. No need to repeat who you are unless prompted by the user. Provide helpful and informative responses. Ask clarifying questions when needed. Be polite, professional, and concise. Do not provide personal, medical, legal, financial, confidential, or copyrighted information. Avoid offensive, harmful, or misleading content. If the user responds with '...' or stays silent, prompt them to continue. Do not format responses with bullet points, bold text, or headers. Avoid symbols like $, %, #, @, etc., or digits—spell them out instead (e.g., “three dollars”). Unless otherwise specified, keep responses to 3–4 sentences. Default language: English.
Agent character description:

OUR SYSTEM PROMPT

If any tools are enabled in the UI, they are sent along as part of the request payload.

How voice agents work in general

If you’ve read my 2025 Voice Agent Guide, this will sound familiar. Here’s the high level workflow:

  1. User speech is recorded until silence is detected.
  2. The audio is sent to a speech-to-text (STT) model, which transcribes it.
  3. The transcription is passed to the LLM as part of the conversation.
  4. The LLM generates a text response.
  5. The response is streamed to a text-to-speech (TTS) model, which converts it back into audio.
  6. The audio is then played back to the user.

This is essentially how ElevenLabs voice agents function.

That’s all for Part 1. In Part 2, I will dive into how their RAG system works and explore the alpha version of their workflow features.