스트리밍 응답 활성화하기

때때로 모델이 응답을 생성하는 데 시간이 걸릴 수 있습니다. stream 옵션을 true로 설정하면 응답을 청크(chunk) 스트림으로 받을 수 있으며, 전체 응답이 생성될 때까지 기다리는 대신 결과를 점진적으로 표시할 수 있습니다. 스트리밍 출력은 모든 호스팅된 모델에서 지원됩니다. 특히 reasoning models를 사용할 때 스트리밍 이용을 권장합니다. 스트리밍을 사용하지 않는 요청은 모델이 출력을 시작하기 전 너무 오랫동안 생각할 경우 타임아웃이 발생할 수 있기 때문입니다.

Python
Bash

import openai

client = openai.OpenAI(
    base_url='https://api.inference.wandb.ai/v1',
    api_key="<your-api-key>",  # https://wandb.ai/settings 에서 API 키를 생성하세요
)

stream = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {"role": "user", "content": "Tell me a rambling joke"}
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices:
        print(chunk.choices[0].delta.content or "", end="", flush=True)
    else:
        print(chunk) # CompletionUsage 오브젝트 표시

curl https://api.inference.wandb.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your-api-key>" \
  -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [
      { "role": "user", "content": "Tell me a rambling joke" }
    ],
    "stream": true
  }'

Response Settings

Tutorials

API Reference

Response Settings

Tutorials

API Reference

Documentation Index