Here's how tool use actually works under the hood. When you send a message to an API like Claude or GPT-4, you also send a list of tool definitions — each one a JSON schema describing the function name, its parameters (with types and descriptions), and what it does. The model reads these definitions as part of its context, and when it determines that calling a tool would help answer the user's question, it stops generating text and instead outputs a structured tool-call object: the function name and the arguments it wants to pass. Your application code then executes that function (hitting an API, querying a database, running a calculation), and sends the result back to the model as a new message. The model reads the result and continues generating its response. This is not the model "running code" — it's the model producing structured output that your application interprets and acts on.
The quality of your tool definitions matters enormously. Models pick tools based on their names and descriptions, so a tool called search_docs with the description "Search the internal knowledge base for relevant documents given a natural language query" will get used appropriately, while a tool called sd with no description will confuse the model. Parameter descriptions are equally important — if you have a date parameter, specify the expected format ("ISO 8601, e.g. 2025-03-15") or the model will guess. In the Claude API, you can also add a tool_choice parameter to force the model to use a specific tool, let it choose freely, or prevent tool use entirely. OpenAI's API has equivalent controls. Getting these definitions right is often the difference between a tool-use integration that works reliably and one that breaks on edge cases.
Parallel tool calling is a feature that's easy to overlook but significant for performance. When a model needs to gather information from multiple sources — say, checking the weather in three cities — it can emit multiple tool calls in a single response. Your application executes them concurrently and sends all results back at once. Claude, GPT-4, and Gemini all support this. The alternative (sequential calls, one per round trip) adds latency that compounds quickly. If you're building a tool-use integration, design your execution layer to handle arrays of tool calls from the start.
A common gotcha is that tool use is not deterministic. The same prompt with the same tools might lead the model to call different tools, pass different arguments, or choose not to use tools at all. This matters for testing and reliability. Production systems typically include validation logic on the tool-call output — checking that required parameters are present, that values are in expected ranges, that the function name matches a known tool. Some teams add a retry mechanism: if the model emits a malformed tool call, the error is sent back as a tool result and the model gets a chance to try again. This "self-correction" pattern works surprisingly well in practice and is much cheaper than trying to prevent all errors upfront.
The history of tool use in AI models is surprisingly short. OpenAI introduced "function calling" in June 2023 with GPT-3.5 and GPT-4, and it immediately changed what was possible to build. Before that, developers used prompt engineering to get models to output JSON in a particular format, then parsed it with fragile regex — it worked, but it was brittle. Anthropic shipped tool use for Claude in 2024, followed by Google for Gemini. The APIs have converged on very similar designs: you define tools as JSON schemas, the model outputs structured calls, and you handle execution. The introduction of MCP in late 2024 then added a standardized discovery and transport layer on top of this mechanism, so tools could be shared across applications without redefining them for each one.