Responses
The responses endpoint is a stateful alternative to /v1/chat/completions. The gateway stores each turn server-side, so clients only need to send the new user message on subsequent requests — previous conversation history is reconstructed automatically from previous_response_id.
Endpoint
POST /v1/responses
Requires authentication. See Authentication.
Request body
| Field | Type | Required | Description |
|---|---|---|---|
model |
string | Yes | The model ID to use. Get available IDs from List Models. |
input |
string or array | Yes | The new user message. Either a plain string or a messages array (each item with role and content). |
previous_response_id |
string | No | The id of the previous response. When provided, the gateway prepends the full conversation history automatically. |
instructions |
string | No | A system prompt. On the first turn this is stored alongside the conversation. On subsequent turns it can be omitted and the original instructions will be reused. |
tools |
array | No | Tool definitions the model may call. Same format as /v1/chat/completions. |
tool_choice |
string or object | No | "auto", "none", "required", or a specific tool object. |
stream |
boolean | No | If true, the response is streamed using the event format described below. Default: false. |
store |
boolean | No | If false, the response is not persisted and previous_response_id chaining is disabled for this turn. Default: true. |
temperature |
number | No | Sampling temperature. |
max_output_tokens |
integer | No | Maximum tokens in the response. |
Basic example
First turn (no previous_response_id):
{
"model": "llama-3.1-8b",
"input": "What is the capital of France?",
"instructions": "You are a helpful assistant."
}
Response:
{
"id": "resp_a1b2c3d4e5f6...",
"object": "response",
"created_at": 1718000000,
"model": "llama-3.1-8b",
"status": "completed",
"output": [
{
"type": "message",
"id": "abc123",
"role": "assistant",
"content": [
{ "type": "output_text", "text": "The capital of France is Paris." }
]
}
],
"usage": {
"input_tokens": 28,
"output_tokens": 9,
"total_tokens": 37
}
}
Second turn (continuing the conversation):
{
"model": "llama-3.1-8b",
"input": "And what about Germany?",
"previous_response_id": "resp_a1b2c3d4e5f6..."
}
The gateway looks up the prior response, reconstructs the conversation history, and returns a response as if the full history had been sent.
Response object
| Field | Description |
|---|---|
id |
Unique response identifier (resp_...). Pass this as previous_response_id on the next turn. |
object |
Always "response". |
created_at |
Unix timestamp. |
model |
The model that generated the response. |
status |
Always "completed" for non-streaming responses. |
output |
Array of output items. Each item has a type of "message" (text) or "function_call" (tool call). |
usage.input_tokens |
Prompt token count. |
usage.output_tokens |
Completion token count. |
usage.total_tokens |
Combined token count. |
Tool calls
If the model requests a tool call, the output array contains a function_call item instead of a message:
{
"output": [
{
"type": "function_call",
"call_id": "call_xyz",
"name": "get_weather",
"arguments": "{\"city\":\"London\"}"
}
]
}
Execute the function and send the result back as the input on the next turn, alongside the same previous_response_id.
Streaming
Set "stream": true to receive the response as Server-Sent Events.
event: response.created
data: {"type":"response.created","response":{"id":"resp_...","status":"in_progress",...}}
event: response.output_item.added
data: {"type":"response.output_item.added","output_index":0,"item":{"type":"message","role":"assistant","content":[]}}
event: response.output_text.delta
data: {"type":"response.output_text.delta","output_index":0,"content_index":0,"delta":"The capital"}
event: response.output_text.delta
data: {"type":"response.output_text.delta","output_index":0,"content_index":0,"delta":" of France is Paris."}
event: response.output_text.done
data: {"type":"response.output_text.done","output_index":0,"content_index":0,"text":"The capital of France is Paris."}
event: response.output_item.done
data: {"type":"response.output_item.done","output_index":0,"item":{...}}
event: response.completed
data: {"type":"response.completed","response":{...}}
data: [DONE]
Opting out of persistence
Set "store": false to process the request without saving it. The response will not be retrievable via previous_response_id and no conversation record will be created.
Authentication note
This endpoint accepts both JWT bearer tokens and API keys. Conversations are linked to the authenticated identity — JWT tokens link to a user account; API keys link to the key itself.