Data Processing Pipelines
Unstructured Data for LLM
Unstructured Data Transform complex, unstructured data into clean, structured data. Securely. Continuously. Effortlessly.
MarkItDown
Python tool for converting files and office documents to Markdown.
Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion
When we attach mulitple tool to the LLM it will easily get confused and to avoid we hav RAG the collection tools in db and use one tool that will used to get the tool that needed
The tool document in the library contains:
- The tool name (e.g., “Get Record”).
- A description of what the tool does.
- The argument schema (parameters the tool needs).
- Synthetic queries that are examples of what the tool can do (e.g., “How do I retrieve a record from this database?”).
- Key topics that summarize what the tool is about.
Steps
1. Pre-retrieval (Indexing)
Before the agent can even search for tools, it has to prepare the tools so they’re easier to retrieve. To do this:
- Each tool gets enhanced metadata, including:
- A long and high-quality description.
- Synthetic questions that help the agent understand what the tool can do.
- Key topics that give the agent more context.
This process ensures that when the agent searches for a tool, it knows exactly what the tool can do and how to use it.
2. Intra-retrieval (Inference-time)
This is where the agent actually searches for the tool it needs in real-time. Here’s how it works:
- Rewriting the query: If a user’s query is unclear or has mistakes (like typos or ambiguous words), the agent can clean it up first.
- Query decomposition: If the user asks a complex question (like, “How do I get a record and then update it?”), the agent breaks it down into smaller, easier parts.
- Query expansion: The agent creates multiple versions of the query to capture all possible meanings and retrieve the most relevant tools.
- The agent then searches the Toolshed Knowledge Base for the best tools.
3. Post-retrieval
Once the tools are retrieved, the agent filters out the irrelevant ones:
- It uses a reranker (an algorithm) to sort and pick the best tools.
- If the agent feels like it’s missing something, it can also self-reflect and search again for more tools.
Resources
- https://eugeneyan.com/
- Technical articles about graph neural networks, large language models, and convex optimization.
- https://eugeneyan.com/
- The Turing Lectures: The future of generative AI
- https://genai-handbook.github.io/
Learning resource
Abstract Meaning Representation
Abstract Meaning Representation (AMR) is a way to represent the meaning of a sentence using a structured, graph-based format. Instead of focusing on the exact words used in a sentence, AMR captures the underlying concepts and their relationships.
Think of it as a simplified meaning representation that removes unnecessary linguistic complexity while preserving the core message.
“The boy wants to eat pizza.” will be converted to
(want-01
:ARG0 (boy)
:ARG1 (eat-01
:ARG0 (boy)
:ARG1 (pizza)))
Summarization of long text
Text Segmentation
Text segmentation is the process of dividing a large piece of text into smaller, meaningful segments like paragraphs, topics, or sentences based on their semantic (meaning) boundaries, not just formatting.
- Sentence segmentation – e.g., splitting words into sentences using periods.
- Topic segmentation – splitting long transcripts (e.g., YouTube captions, meeting notes, books) into coherent sections where the subject changes.
- Discourse segmentation – dividing based on logical structure (claims, arguments, etc.).
How it works
Imagine we have a long transcript like this
[0] "Welcome to the AI podcast."
[1] "Today, we'll discuss how neural networks learn."
[2] "Backpropagation is a key mechanism."
[3] "Let's now switch to reinforcement learning."
[4] "Agents learn by rewards and punishment."
[5] "Thank you for listening!"
The goal is to break this into coherent segments. But we don’t want to:
- Split after every sentence (too fine-grained)
- Use fixed-length (e.g. 3 sentences) because topic changes aren’t always evenly spaced
Embed Each Sentence
we use a model like Sentence-BERT to convert each sentence into a vector in high-dimensional space.
we now compute cosine similarity between adjacent sentence embeddings
| Pair | Cosine Similarity |
|---|---|
| (0,1) | 0.87 |
| (1,2) | 0.90 |
| (2,3) | 0.45 ← Drop |
| (3,4) | 0.92 |
| (4,5) | 0.83 |
A sudden drop in similarity (like between 2 and 3) usually signals a topic shift → segment boundary!
- plit the input into units (sentences, lines, etc.)
- Embed each unit (Sentence-BERT or similar)
- Compute similarity scores between adjacent units
- Score each potential boundary (i.e., places between units)
- Optionally apply distance-based penalty: prevent very short or very long segments
- Use greedy algorithm (or dynamic programming) to pick best breakpoints
- Greedy = pick the next biggest drop in similarity if distance constraint is met
- Output segments