Files and Resources with MCP - Part 1
Introduction
As AI assistants become more embedded, they need reliable ways to work with files, images, and other content types. Whether you’re building autonomous agents or simpler workflow automation and extensions, handling diverse content effectively is crucial. The Model Context Protocol (MCP) provides a standardized approach for managing these interactions between Users, AI Assistants, and external tools.
This is the first of two articles exploring content handling in MCP. Here we examine current implementation patterns and practical examples with Claude Desktop. The second article will look at more general patterns and parts of the specification, particularly for building more sophisticated plug-and-play agentic systems.
Components and Terminology
For clarity, we distinguish between the Host application (which manages the overall user experience) and the MCP Client (a library the Host uses to communicate with MCP Servers).
Component | Description |
---|---|
Host | The primary application (e.g. Claude Desktop, LibreChat) managing user-assistant conversations |
Assistant | The AI model generating responses and requesting tool operations |
Assistant API | Handles message processing, tokenization, and tool availability |
MCP Server | Executes tool operations requested by the assistant |
MCP Client | A library used by the Host to interact with MCP Servers |
Understanding these components’ roles is essential as we explore how different content types flow through the system.
Content Types and Processing
Large Language Models (LLMs) primarily process text, including formats like Markdown, JSON, and source code. Modern multi-modal models support additional content types:
- Claude Sonnet 3.5: Text, Images (Vision), and PDFs
- OpenAI GPT-4-Audio-Preview: Text and Audio processing
- Google Gemini 2.0: Text, Image (Vision and Generation), and audio generation
When users share non-text content, the Assistant API creates a Content Block and tokenizes it for the model1. This process is efficient - for example, images typically require only 1,500 tokens, equivalent to approximately 1,000 words2.
The host application manages the presentation and handling of all supported content types.
Tool Use: Vision and Generation
Modern LLMs handle images differently depending on whether they’re processing or generating them. Let’s examine this through a practical example.
OpenAI’s GPT-4o and o1 models have vision capabilities - they can analyze uploaded images, but cannot generate them. However, ChatGPT users can create images because the application provides access to image generation through tools.
Here’s how image generation works in ChatGPT:
- The Host application tells the Assistant that an Image Generator tool is available
- When a user requests an image, the Assistant creates a suitable prompt
- The Assistant requests the Host to call the DALL-E Image Generator tool
- The Host executes this request and displays the result to the user
Importantly, generated images are ‘attached’ rather than ’embedded’ in the conversation. You can verify this yourself: ask ChatGPT detailed questions about an image it just generated - it will only be able to guess based on the prompt it created. However, if you upload the generated image back to the chat, it can then analyze specific details. This architectural choice treats generated content as user-facing output rather than part of the conversation context.
While the ChatGPT example demonstrates basic tool usage, MCP implementations offer a more comprehensive framework for handling both tools and resources. These patterns apply equally to other content types such as PDFs, structured data, or audio files. The key principles of tokenization, resource handling, and tool responses remain consistent regardless of the underlying data format.
Claude Desktop and MCP Implementation
Claude Desktop (v0.78) currently offers the most sophisticated implementation of MCP features. MCP Servers expose Resources, Prompts and Tools to the Host application - Resources and Prompts to return data to the Host and Assistant, while Tools allow the Assistant to make requests to the MCP Server.
General MCP Guidance
When building MCP Servers, several fundamental constraints and patterns shape both Tool and Resource implementations:
- Output token limits constrain argument size for Tool calls, although substantial text is possible
- Tool Call requests and arguments consume Context Window space, favouring concise interactions
- Use the model’s inference capabilities appropriately - avoid redundant content transmission
- Tokenized content (e.g. uploaded images or PDFs) cannot be reconstructed for sending to the MCP Server
- Prefer using URIs over embedded content for large files, or if the content is not needed in the conversation context
These fundamental constraints inform how we handle both Tools and Resources in practice, as we’ll see in the following sections.
Resource Handling
MCP Resources represent data that can be accessed by the Assistant through the MCP Server. Unlike simple file attachments, Resources can be dynamic (like database queries) or static (like files). Resources are exposed to the Host application and can be attached to messages, allowing the Assistant to analyze their content.
To understand how these components work together in practice, let’s examine a typical resource handling flow:
sequenceDiagram actor User participant Host as Host and MCP Client
(e.g. Claude Desktop) participant MCP Server participant Assistant API participant Assistant as Assistant LLM
(e.g. Claude) Note over User,Assistant: MCP Resource Usage in a Conversation User->>+Host: View Resources Host->>+MCP Server: List Resources MCP Server-->>-Host: Available Resources Host-->>-User: Display Resources User->>+Host: "Can you analyze this file?"
+ Selected Resource Host->>+MCP Server: Read Resource MCP Server-->>-Host: Resource Content Note over Host: Compose Message:
1. Text Content Block
"Can you analyze..."
2. Resource Content Block
Base64: "...Y0IGVuY29kZWQ=" Host->>+Assistant API: Send message array:
{role: "user", content: [...]} Note over Assistant API: Tokenize content blocks
for model consumption Assistant API->>+Assistant: Tokenized messages Assistant->>-Assistant API: Generated response Assistant API-->>-Host: {role: "assistant", content: "..."} Host-->>-User: Display formatted response
Current Limitations
While Claude Desktop handles both File (“Paperclip”) and Resource (“Connector”) attachments similarly, there are important differences:
- MCP Resources can represent dynamic data like database queries, not just files
- Claude Desktop’s resource handling has size limitations - large images work via File attachment but cause ‘stack size’ errors as Resources
- Content returned from MCP Servers must be under 1MB in size.
For experimenting with larger resources and content types, mcp-hfspace with Claude Desktop mode set to false is a convenient way of doing so.
File Handling in MCP Servers
MCP Servers need two key elements to handle files effectively:
- Physical access to the files on the server
- A way to communicate available files to the Assistant for tool calls (Resource Discovery)
MCP Servers started by Claude Desktop run with the user’s account permissions and full environment access3. While this provides convenient filesystem and network access during development, it raises important security considerations for production deployments.
Best practices for secure file handling include:
- Anticipating tighter sandboxing in deployment
- Preferring MCP protocol features for resource access
- Designing for compatibility with remote MCP Servers using SSE transport
- Configuring the MCP Server with specific, allowed file directories during startup (to later be replaced with Roots)
Resource Discovery
For effective resource handling, the Assistant needs to know which resources are available. The current version of Claude Desktop doesn’t automatically expose listed Resources to the Assistant. To bridge this gap, there are three main approaches:
- Direct user entry of resource identifiers
- Tool calls that return the Resource List, enabling the Assistant to discover resources automatically
- Prompts that return the Resource List for User review and selection
The MCP Specification encourages meaningful resource descriptions:
A description of what this resource represents. This can be used by clients to improve the LLM’s understanding of available resources. It can be thought of like a ‘hint’ to the model.
This reinforces the importance of structuring Resource Lists in a way that LLMs can effectively process (and the expectation that the Assistant will have access to them).
Content Handling and Return Types
Tool Responses contain blocks of Text, Images or Embedded Resources.
When content is returned:
- Text and Images are tokenized and added to the conversation context
- Images become available to Claude’s Vision capabilities
- Non-text content types are not processed (Claude Desktop displays
Unsupported image type: audio/wav
for an Embedded Resource). - Must be under 1MB in size4.
For content not intended for the conversation context (like large files or binary data), the MCP Server can save files to a configured directory and return a message to the User with the location.
Embedded User Interfaces
MCP Servers can provide their own user interfaces through embedded web servers, enabling interactive access to resources that might not be directly available through the filesystem. The mcp-webcam (@llmindset/mcp-webcam
) project demonstrates this approach.
When Claude Desktop launches the MCP Server, it starts a local web server hosting an interface for webcam interaction. The server provides both a tool call allowing Claude to capture webcam images in the conversation, and an ephemeral Resource for capturing frames. Screenshot capture happens through direct user interaction with the web interface.
This pattern extends naturally to other scenarios, where an MCP Server might:
- Provide secure interfaces to database queries
- Handle user authentication flows
- Enable user approval workflows
- Manage interactive request queues
- Surface local system resources safely
For multi-user environments, this approach currently requires session key management between the tool calls and web interface. This will be simplified once the MCP protocol implements shared authentication between Servers and Hosts.
Looking Ahead
The patterns and implementations we’ve explored demonstrate MCP’s current capabilities for content handling. While there are some limitations, particularly around resource sizes and authentication, the protocol provides a solid foundation for building AI-enabled applications.
In the next article, we’ll examine emerging patterns for working with MCP, including improvements to resource templates, audience handling, and content type management. We’ll also look at how the specification might evolve to better support multi-modal interactions and enterprise deployment patterns.
-
Commonly, binary resources will be encoded as Base64 content blocks. Some APIs allow the use of URIs instead. Another pattern is replacing encoded content with a temporary identifier to keep multi-turn conversations efficient - see here. ↩︎
-
Both the Claude Vision and OpenAI Vision documentation describe this process well. It is worth noting that the API may resize images on input. ↩︎
-
It is possible to deploy MCP Servers using Docker (see here and here). This makes managaging deployment dependencies easier, as well as providing options to sandbox deployments. This post shows a clever use of Docker and LLMs for automating deployment of MCP Servers. ↩︎
-
Ideally, the Image would be passed to the Assistant API to handle tokenization. ↩︎