AI Agents Explained: Their Role in Modern AI Systems — Learning Notes (Part II)
What is an Agent?
An agent is an AI-driven system that interacts with its environment to achieve a user-defined objective. It combines reasoning, planning, and executing actions, often leveraging external tools. The backbone of an agent is a Large Language Model (LLM), which enables it to process and respond to natural language inputs effectively.
How Does an Agent Work?
An agent serves as a coordinator for executing specialized tasks within a predefined environment equipped with various tools. The LLM acts as the core component, interpreting natural language inputs, maintaining state, and determining the next course of action. If a task requires external data or computation, the agent invokes its equipped tools, processes the results, and continues maintaining the state with the LLM.
The Agent Workflow:
1. Understand: The agent receives a user query and processes it using an LLM.
2. Decide: If the task requires external tools (e.g., fetching real-time data), the agent determines which tools to use.
3. Execute: The agent invokes the necessary tool(s), retrieves the results, and integrates them into the conversation.
4. Respond: The agent returns a meaningful response while maintaining context for future interactions.
Why Do We Need Agents?
LLMs primarily function as generative models, meaning they generate text based on input prompts. However, they do not inherently react to real-world events or provide up-to-date information. This is where agents and tools come into play.
For example, if a user asks about today’s weather, an LLM alone would rely on its training data, which may be outdated. An agent, on the other hand, can invoke a weather API to fetch real-time data and provide an accurate response.
Understanding Large Language Models (LLMs)
An LLM is an AI model that excels at understanding and generating human language. These models are trained on vast datasets, learning patterns, structures, and nuances in language. Most LLMs today are based on Transformer architecture, which has revolutionized natural language processing since Google introduced BERT in 2018.
Types of Transformer Models:
1. Encoders: Convert input text into a dense representation (embedding).
2. Decoders: Generate text token by token.
3. Seq2Seq (Encoder-Decoder): Combine both for tasks like translation and summarization (e.g., T5, BART).
The Role of Prompting
Prompting is a crucial component of LLM interactions, determining how the model predicts and generates responses. When a user interacts with an LLM, they provide a prompt, which often includes prior conversation history to maintain context.
Example of Message History in a Prompt:
message_history = [{
"role": "system",
"content": "You are a stock market analyst and will help me pick stocks."
},
{
"role": "user",
"content": "Please help me pick a stock."
}]
This structured conversation ensures that the LLM understands its role and provides relevant responses.
Special Tokens in LLMs
LLMs use special tokens to structure and interpret input data. Above the array converted to the string with a delimiter specialized for each LLM. I would call it a message protocol — API format to talk to the LLM. These translations it is taken care of by a library called transformers. Find the below list delimiters we use
- End of Sequence (EOS): Signals where the response should stop.
- Beginning of Sequence (BOS): Marks the start of input.
- Padding Token: Ensures consistent input length.
- Masking Token: Used in models like BERT to predict missing words.
- Separation Token: Differentiates instructions, user queries, and system responses.
- Special Instruction Tokens: Define task boundaries for instruction-based models.
What Are AI Tools?
AI tools are specialized functions integrated with an LLM to extend its capabilities. These tools perform tasks beyond the LLM’s native abilities, such as fetching real-time data or performing mathematical computations.
Common AI Tools and Their Functions:
Web Search: Retrieves current data from the internet.
Image Generation: Creates images based on text input.
Retrieval: Fetches relevant documents from a database.
API Interface: Interacts with external APIs (e.g., GitHub, YouTube).
For example, if an LLM needs to perform arithmetic calculations, an integrated calculator tool will yield more accurate results than relying on the LLM’s text-based predictions.
Conclusion
AI agents bridge the gap between LLMs and real-world interactivity. While LLMs generate human-like text, agents empower them to interact with their environment, retrieve real-time information, and execute specialized tasks. By integrating external tools, AI agents extend the capabilities of LLMs, making them more practical and responsive in real-world applications.