OpenAI API with Python — Complete Guide (Chat, Streaming, Function Calling)

The OpenAI Python SDK is the fastest path from idea to production LLM app. This guide covers everything: basic chat completions, streaming responses, function calling, embeddings, vision, and patterns for handling rate limits and retries in production.

Installation and Setup

pip install openai

# Set API key (prefer environment variable over hardcoding)
export OPENAI_API_KEY="sk-..."
from openai import OpenAI

# Client reads OPENAI_API_KEY from environment automatically
client = OpenAI()

# Or pass explicitly (not recommended for production)
client = OpenAI(api_key="sk-...")

Basic Chat Completions

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Python decorators in 2 sentences."}
    ],
    temperature=0.7,      # 0 = deterministic, 2 = very random
    max_tokens=200,
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

Multi-turn Conversations

conversation = [
    {"role": "system", "content": "You are a Python tutor."}
]

def chat(user_message: str) -> str:
    conversation.append({"role": "user", "content": user_message})

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=conversation,
    )

    assistant_reply = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": assistant_reply})
    return assistant_reply

print(chat("What is a list comprehension?"))
print(chat("Can you show me an example with filtering?"))
# Context is preserved across turns

Streaming Responses

Streaming shows tokens as they're generated, improving perceived latency significantly for long responses:

def stream_chat(prompt: str):
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )

    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            print(delta.content, end="", flush=True)
    print()  # newline after stream ends

stream_chat("Write a haiku about Python programming.")

# For async streaming (FastAPI, etc.)
async def stream_chat_async(prompt: str):
    async with client.beta.stream(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        async for text in stream.text_stream:
            yield text

Function Calling (Tool Use)

Function calling lets the model decide when to invoke your Python functions, parse structured arguments, and incorporate results:

import json

# Define your functions as JSON Schema
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "City name, e.g. 'Berlin'"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["city"]
            }
        }
    }
]

# Your actual function
def get_weather(city: str, unit: str = "celsius") -> dict:
    # In reality, call a weather API
    return {"city": city, "temp": 22, "unit": unit, "condition": "sunny"}

def run_with_tools(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]

    # First call -- model may request a tool
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools,
        tool_choice="auto",
    )

    message = response.choices[0].message

    # If model called a tool, execute it and continue
    if message.tool_calls:
        messages.append(message)

        for tool_call in message.tool_calls:
            func_name = tool_call.function.name
            func_args = json.loads(tool_call.function.arguments)

            # Call the actual function
            if func_name == "get_weather":
                result = get_weather(**func_args)
            else:
                result = {"error": f"Unknown function: {func_name}"}

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result),
            })

        # Second call -- model uses tool result to form final response
        final = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
        )
        return final.choices[0].message.content

    return message.content

print(run_with_tools("What's the weather in Tokyo right now?"))

Structured Output with Pydantic

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class ArticleSummary(BaseModel):
    title: str
    key_points: list[str]
    sentiment: str  # positive, neutral, negative
    word_count_estimate: int

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",  # structured output requires this model+
    messages=[
        {"role": "system", "content": "Extract structured info from articles."},
        {"role": "user", "content": "Python 3.13 was released with a free-threaded mode that removes the GIL under a special build flag, enabling true multi-core parallelism for CPU-bound Python code."}
    ],
    response_format=ArticleSummary,
)

summary = response.choices[0].message.parsed
print(summary.title)       # Python 3.13 Introduces Free-Threaded Mode
print(summary.key_points)  # ['Removes GIL...', ...]
print(summary.sentiment)   # positive

Embeddings

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    text = text.replace("\n", " ")
    response = client.embeddings.create(input=[text], model=model)
    return response.data[0].embedding

# Batch embeddings (more efficient)
def get_embeddings(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        input=texts,
        model="text-embedding-3-small"
    )
    return [item.embedding for item in response.data]

# Cosine similarity search
import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Build a simple semantic search
documents = ["Python is a programming language", "Dogs are mammals", "FastAPI is fast"]
doc_embeddings = get_embeddings(documents)

query = "What coding language should I learn?"
query_embedding = get_embedding(query)

similarities = [(doc, cosine_similarity(query_embedding, emb))
                for doc, emb in zip(documents, doc_embeddings)]
best = max(similarities, key=lambda x: x[1])
print(f"Most relevant: {best[0]} (score: {best[1]:.3f})")

Vision (Image Analysis)

import base64
from pathlib import Path

def analyze_image_file(image_path: str, question: str) -> str:
    """Analyze a local image file."""
    image_data = base64.b64encode(Path(image_path).read_bytes()).decode("utf-8")
    ext = Path(image_path).suffix.lstrip(".")
    mime = f"image/{ext}" if ext != "jpg" else "image/jpeg"

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:{mime};base64,{image_data}",
                        "detail": "high"  # "low", "high", or "auto"
                    }
                }
            ]
        }]
    )
    return response.choices[0].message.content

result = analyze_image_file("screenshot.png", "What UI issues do you see in this design?")

Rate Limits and Retries

import time
from openai import OpenAI, RateLimitError, APIError

client = OpenAI(
    max_retries=3,       # automatic retry on 429/500 (built into SDK)
    timeout=30.0,        # request timeout in seconds
)

# Manual exponential backoff for custom logic
def robust_completion(messages, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return client.chat.completions.create(
                model="gpt-4o-mini",
                messages=messages,
            )
        except RateLimitError as e:
            if attempt == max_attempts - 1:
                raise
            wait = (2 ** attempt) + (0.1 * attempt)  # exponential + jitter
            print(f"Rate limited. Waiting {wait:.1f}s...")
            time.sleep(wait)
        except APIError as e:
            if e.status_code in (500, 502, 503) and attempt < max_attempts - 1:
                time.sleep(2 ** attempt)
            else:
                raise

Cost Optimization Tips

  • Use gpt-4o-mini for simple tasks -- 15x cheaper than gpt-4o, suitable for classification, extraction, summarization
  • Limit max_tokens -- set it to what you actually need; default is unbounded
  • Cache repeated prompts -- OpenAI's prompt caching gives 50% discount on repeated prompt prefixes >1024 tokens
  • Batch API -- 50% discount for non-real-time workloads via client.batches.create()
  • Embeddings: text-embedding-3-small -- 5x cheaper than ada-002, better quality
→ Format and validate your JSON API responses with DevKits
aiforeverthing.com — Free developer tools, no signup

Frequently Asked Questions

Which model should I use: gpt-4o or gpt-4o-mini?

Start with gpt-4o-mini for most tasks (classification, summarization, extraction, simple Q&A). Upgrade to gpt-4o only when you need stronger reasoning, complex instructions, or multilingual accuracy. The cost difference is 15x.

How do I handle context window limits?

GPT-4o has a 128k token context window. For conversations exceeding this, implement a sliding window (keep last N messages) or summarize old context into a system message. Use tiktoken to count tokens before sending.

Is the OpenAI Python SDK thread-safe?

Yes. The OpenAI client is safe to share across threads. For async code, use AsyncOpenAI instead of OpenAI.

How do I test without spending API credits?

Use OpenAI's evals framework for automated testing. For unit tests, mock the client with unittest.mock or use the unofficial openai-mock library. For integration tests, use a cheaper model like gpt-4o-mini.

🚀 Recommended: Deploy on DigitalOcean

DigitalOcean is the go-to cloud for developers deploying containerized apps — clean UI, predictable pricing, and managed Kubernetes. New accounts get $200 free credit (60-day trial) to test this setup at zero cost.

Get $200 Free Credit → Affiliate link — we may earn a commission at no extra cost to you.