The Claude API from Anthropic gives you three levers that do most of the heavy lifting in any real application: streaming for responsive output, tools for letting the model act on the world, and system prompts for shaping how the model behaves. This guide walks through each one with practical patterns you can drop into production code.

All examples use the Messages API and the official SDKs (anthropic for Python, @anthropic-ai/sdk for TypeScript). Set your key via the ANTHROPIC_API_KEY environment variable rather than hardcoding it.

Choosing a model first

Before you write a request, pick a model. The current lineup is optimized for different trade-offs:

Opus (claude-opus-4-8) — the most capable tier, best for hard reasoning, complex agentic loops, and code.
Sonnet (claude-sonnet-4-6) — the balanced workhorse for most production traffic.
Haiku (claude-haiku-4-5-20251001) — fastest and cheapest, ideal for high-volume classification, extraction, and routing.

A common architecture is to route cheap, well-defined tasks to Haiku and escalate ambiguous or high-stakes work to Sonnet or Opus. Because they share the same API surface, swapping the model string is usually the only change required.

System prompts: setting the ground rules

The system prompt is a separate top-level parameter — not a message with role: "system". It establishes persistent instructions, persona, and constraints that apply to the whole conversation.

from anthropic import Anthropic

client = Anthropic()

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=(
        "You are a senior Python reviewer. Respond only with actionable "
        "feedback in bullet points. Never rewrite the whole file."
    ),
    messages=[{"role": "user", "content": "Review this function: ..."}],
)
print(resp.content[0].text)

Practical advice for system prompts:

Be specific about format. "Return valid JSON with keys summary and risks" beats "summarize this."
State what not to do. Negative constraints ("do not invent citations") are as important as positive ones.
Put stable content first. If you reuse a long system prompt across many requests, mark it with prompt caching so you don't pay full input cost every time (see below).
Keep persona and task separate. Persona in the system prompt; the actual data and question in the user message. This keeps your prompts reusable.

You can pass system as a plain string or as a list of content blocks. The block form is what you need for caching.

Streaming: responsive output for users

For anything a human waits on, stream. Instead of blocking until the full response is generated, you receive incremental events as tokens are produced. This slashes perceived latency.

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain TCP slow start."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

final = stream.get_final_message()

In TypeScript the shape is similar:

const stream = client.messages.stream({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Explain TCP slow start." }],
});

stream.on("text", (delta) => process.stdout.write(delta));
const final = await stream.finalMessage();

Things to know about streaming:

The raw event stream is server-sent events (SSE). The SDK helpers (text_stream, .on("text")) hide the plumbing, but under the hood you get message_start, content_block_delta, and message_stop events.
Always capture the final message. You need it for the stop_reason, token usage, and — critically — any tool calls the model made.
Streaming and tools compose. Tool inputs arrive as input_json_delta events that you accumulate into a complete JSON object by the time the block closes.
For very long generations, streaming also avoids request timeouts that a single blocking call might hit.

Tools: letting Claude take actions

Tool use (function calling) is how you connect Claude to real systems — databases, search, calculators, your own APIs. You describe the tools; Claude decides when to call them and with what arguments; you execute them and return the results.

Define each tool with a name, a description, and a JSON Schema for its input:

tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a city.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city"],
        },
    }
]

The interaction is a loop:

messages = [{"role": "user", "content": "What's the weather in Osaka?"}]

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=messages,
)

if resp.stop_reason == "tool_use":
    tool_call = next(b for b in resp.content if b.type == "tool_use")
    result = run_get_weather(**tool_call.input)   # your real function

    messages.append({"role": "assistant", "content": resp.content})
    messages.append({
        "role": "user",
        "content": [{
            "type": "tool_result",
            "tool_use_id": tool_call.id,
            "content": str(result),
        }],
    })

    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        tools=tools,
        messages=messages,
    )

print(resp.content[0].text)

Key practices for tools:

Descriptions are the interface. The model chooses tools based on the description and schema. Invest in clear, unambiguous descriptions and note edge cases ("returns null if the city is unknown").
Echo the assistant's tool_use block back unchanged before appending the tool_result. The tool_use_id must match.
Loop until stop_reason is not tool_use. Claude may chain multiple tool calls before answering. Wrap the whole thing in a while loop with a sane iteration cap.
Use tool_choice to control behavior. Set it to {"type": "auto"} (default), {"type": "any"} to force some tool, or {"type": "tool", "name": "..."} to force a specific one — handy for structured extraction.
Return errors as tool results, not exceptions. Set is_error: true in the tool_result block so the model can recover gracefully.

Prompt caching to cut cost and latency

If your system prompt, tool definitions, or a large document are reused across requests, mark them with cache_control to reuse the processed prefix:

system=[{
    "type": "text",
    "text": LONG_STABLE_INSTRUCTIONS,
    "cache_control": {"type": "ephemeral"},
}]

Cached reads are billed at a large discount versus fresh input tokens, and they process faster. Put the stable, repeated content at the front and the variable content (the user's actual query) at the end so as much prefix as possible stays cacheable.

Putting it together

A production request often uses all three features at once: a cached system prompt that sets the persona and rules, a set of tools the model can invoke, and streaming so the user sees output immediately. The mental model is simple — the system prompt shapes how Claude behaves, tools define what it can do, and streaming controls how you deliver the result.

FAQ

Is the system prompt a message? No. It's the separate system parameter. Don't put a role: "system" object in the messages array — that isn't a supported role.

Can I stream and use tools at the same time? Yes. Tool inputs arrive incrementally as input_json_delta events. Accumulate them and read the finalized tool calls from the final message once the stream completes.

How do I force Claude to always call a tool? Set tool_choice to {"type": "any"} to require some tool, or {"type": "tool", "name": "..."} to require a specific one. This is the cleanest way to get structured JSON output.

What should I return when a tool fails? Return a tool_result block with is_error: true and a short message describing what went wrong. Claude can then retry, pick a different tool, or explain the failure to the user.

How do I stop an infinite tool loop? Cap the number of tool round-trips in your own loop (for example, 10 iterations) and break out with a fallback message. Also inspect stop_reason — once it's end_turn, you have a final answer.

Which model should I start with? Start with Sonnet (claude-sonnet-4-6) for general development. Drop to Haiku for high-volume, well-scoped tasks, and move to Opus (claude-opus-4-8) when you hit reasoning or coding limits.

Does streaming change how I'm billed? No. Billing is based on input and output tokens regardless of whether you stream. Streaming only changes delivery, not cost.

Claude API Guide: Streaming, Tools and System Prompts

On this page

Choosing a model first

System prompts: setting the ground rules

Streaming: responsive output for users

Tools: letting Claude take actions

Prompt caching to cut cost and latency

Putting it together

FAQ

Sources

Related Articles

shadcn/ui Guide for Next.js: Build Component Libraries

How to Set Up AI Code Review in GitHub Actions (2026 Guide)

AI Code Review Prompts That Actually Work (With Examples)

On this page