Claude API Guide: Streaming, Tools and System Prompts
On this page
The Claude API from Anthropic gives you three levers that do most of the heavy lifting in any real application: streaming for responsive output, tools for letting the model act on the world, and system prompts for shaping how the model behaves. This guide walks through each one with practical patterns you can drop into production code.
All examples use the Messages API and the official SDKs (anthropic for Python, @anthropic-ai/sdk for TypeScript). Set your key via the ANTHROPIC_API_KEY environment variable rather than hardcoding it.
Choosing a model first
Before you write a request, pick a model. The current lineup is optimized for different trade-offs:
- Opus (
claude-opus-4-8) — the most capable tier, best for hard reasoning, complex agentic loops, and code. - Sonnet (
claude-sonnet-4-6) — the balanced workhorse for most production traffic. - Haiku (
claude-haiku-4-5-20251001) — fastest and cheapest, ideal for high-volume classification, extraction, and routing.
A common architecture is to route cheap, well-defined tasks to Haiku and escalate ambiguous or high-stakes work to Sonnet or Opus. Because they share the same API surface, swapping the model string is usually the only change required.
System prompts: setting the ground rules
The system prompt is a separate top-level parameter — not a message with role: "system". It establishes persistent instructions, persona, and constraints that apply to the whole conversation.
from anthropic import Anthropic
client = Anthropic()
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=(
"You are a senior Python reviewer. Respond only with actionable "
"feedback in bullet points. Never rewrite the whole file."
),
messages=[{"role": "user", "content": "Review this function: ..."}],
)
print(resp.content[0].text)
Practical advice for system prompts:
- Be specific about format. "Return valid JSON with keys
summaryandrisks" beats "summarize this." - State what not to do. Negative constraints ("do not invent citations") are as important as positive ones.
- Put stable content first. If you reuse a long system prompt across many requests, mark it with prompt caching so you don't pay full input cost every time (see below).
- Keep persona and task separate. Persona in the system prompt; the actual data and question in the user message. This keeps your prompts reusable.
You can pass system as a plain string or as a list of content blocks. The block form is what you need for caching.
Streaming: responsive output for users
For anything a human waits on, stream. Instead of blocking until the full response is generated, you receive incremental events as tokens are produced. This slashes perceived latency.
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain TCP slow start."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
final = stream.get_final_message()
In TypeScript the shape is similar:
const stream = client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: "Explain TCP slow start." }],
});
stream.on("text", (delta) => process.stdout.write(delta));
const final = await stream.finalMessage();
Things to know about streaming:
- The raw event stream is server-sent events (SSE). The SDK helpers (
text_stream,.on("text")) hide the plumbing, but under the hood you getmessage_start,content_block_delta, andmessage_stopevents. - Always capture the final message. You need it for the
stop_reason, token usage, and — critically — any tool calls the model made. - Streaming and tools compose. Tool inputs arrive as
input_json_deltaevents that you accumulate into a complete JSON object by the time the block closes. - For very long generations, streaming also avoids request timeouts that a single blocking call might hit.
Tools: letting Claude take actions
Tool use (function calling) is how you connect Claude to real systems — databases, search, calculators, your own APIs. You describe the tools; Claude decides when to call them and with what arguments; you execute them and return the results.
Define each tool with a name, a description, and a JSON Schema for its input:
tools = [
{
"name": "get_weather",
"description": "Get the current weather for a city.",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["city"],
},
}
]
The interaction is a loop:
messages = [{"role": "user", "content": "What's the weather in Osaka?"}]
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
messages=messages,
)
if resp.stop_reason == "tool_use":
tool_call = next(b for b in resp.content if b.type == "tool_use")
result = run_get_weather(**tool_call.input) # your real function
messages.append({"role": "assistant", "content": resp.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_call.id,
"content": str(result),
}],
})
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
messages=messages,
)
print(resp.content[0].text)
Key practices for tools:
- Descriptions are the interface. The model chooses tools based on the
descriptionand schema. Invest in clear, unambiguous descriptions and note edge cases ("returns null if the city is unknown"). - Echo the assistant's tool_use block back unchanged before appending the
tool_result. Thetool_use_idmust match. - Loop until
stop_reasonis nottool_use. Claude may chain multiple tool calls before answering. Wrap the whole thing in awhileloop with a sane iteration cap. - Use
tool_choiceto control behavior. Set it to{"type": "auto"}(default),{"type": "any"}to force some tool, or{"type": "tool", "name": "..."}to force a specific one — handy for structured extraction. - Return errors as tool results, not exceptions. Set
is_error: truein thetool_resultblock so the model can recover gracefully.
Prompt caching to cut cost and latency
If your system prompt, tool definitions, or a large document are reused across requests, mark them with cache_control to reuse the processed prefix:
system=[{
"type": "text",
"text": LONG_STABLE_INSTRUCTIONS,
"cache_control": {"type": "ephemeral"},
}]
Cached reads are billed at a large discount versus fresh input tokens, and they process faster. Put the stable, repeated content at the front and the variable content (the user's actual query) at the end so as much prefix as possible stays cacheable.
Putting it together
A production request often uses all three features at once: a cached system prompt that sets the persona and rules, a set of tools the model can invoke, and streaming so the user sees output immediately. The mental model is simple — the system prompt shapes how Claude behaves, tools define what it can do, and streaming controls how you deliver the result.
FAQ
Is the system prompt a message?
No. It's the separate system parameter. Don't put a role: "system" object in the messages array — that isn't a supported role.
Can I stream and use tools at the same time?
Yes. Tool inputs arrive incrementally as input_json_delta events. Accumulate them and read the finalized tool calls from the final message once the stream completes.
How do I force Claude to always call a tool?
Set tool_choice to {"type": "any"} to require some tool, or {"type": "tool", "name": "..."} to require a specific one. This is the cleanest way to get structured JSON output.
What should I return when a tool fails?
Return a tool_result block with is_error: true and a short message describing what went wrong. Claude can then retry, pick a different tool, or explain the failure to the user.
How do I stop an infinite tool loop?
Cap the number of tool round-trips in your own loop (for example, 10 iterations) and break out with a fallback message. Also inspect stop_reason — once it's end_turn, you have a final answer.
Which model should I start with?
Start with Sonnet (claude-sonnet-4-6) for general development. Drop to Haiku for high-volume, well-scoped tasks, and move to Opus (claude-opus-4-8) when you hit reasoning or coding limits.
Does streaming change how I'm billed? No. Billing is based on input and output tokens regardless of whether you stream. Streaming only changes delivery, not cost.
Sources
Related Articles
shadcn/ui Guide for Next.js: Build Component Libraries
How to Set Up AI Code Review in GitHub Actions (2026 Guide)
Wire an AI code reviewer into GitHub Actions the right way — trigger on pull requests, post inline comments, keep secrets safe, and avoid the noisy-bot trap. Complete working workflow included.
AI Code Review Prompts That Actually Work (With Examples)
The quality of an AI code review is decided almost entirely by the prompt. Review prompt patterns that produce signal instead of noise — copy-paste examples for bugs, security, and PR-level review.