QYAI API Documentation

OpenAI-compatible API gateway to DeepSeek models. Drop-in replacement — just change the base URL and API key.

DeepSeek V4 1M Context OpenAI Compatible Streaming SSE

Quick Start

Get up and running with QYAI in three steps.

1

Get your API Key

Contact us at qiyuanai@163.com to get your API key starting with sk-.

2

Configure your client

Point your OpenAI SDK or any compatible client to the QYAI base URL.

3

Send your first request

Use any model name from our list — it just works.

Python
from openai import OpenAI

client = OpenAI(
    api_key="sk-your-qyai-key",
    base_url="http://121.41.214.247/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Hello, QYAI!"}]
)

print(response.choices[0].message.content)

Authentication

All API requests require authentication via a Bearer token in the Authorization header.

HTTP Header
Authorization: Bearer sk-your-qyai-key

Keep your API key secret. Never expose it in client-side code or public repositories. If your key is compromised, regenerate it immediately from the dashboard.

Available Models

Official Models

Model IDDescriptionContextThinking
deepseek-v4-pro Flagship reasoning model — complex reasoning, code, and Agent workflows 1M
deepseek-v4-flash High-efficiency model — daily tasks and high-throughput scenarios 1M

Compatibility Aliases

AliasMaps ToNotes
deepseek-chat deepseek-v4-flash Legacy name — non-thinking mode
deepseek-reasoner deepseek-v4-flash Legacy name — thinking mode enabled
gpt-4o deepseek-v4-flash OpenAI-compatible alias
gpt-4o-mini deepseek-v4-flash OpenAI-compatible alias

Model Comparison

ModelContextMax OutputThinking ModeTool CallingJSON OutputBest ForInput PriceOutput Price
deepseek-v4-pro 1M 384K Complex reasoning, code, agents $2.80/M $5.60/M
deepseek-v4-flash 1M 384K Daily tasks, high throughput $0.22/M $0.44/M

Chat Completions

Send a chat completion request to generate a response from a model.

HTTP
POST http://121.41.214.247/v1/chat/completions

Parameters

ParameterTypeDefaultDescription
modelrequired string Model ID to use (e.g. deepseek-v4-flash, deepseek-v4-pro)
messagesrequired array Array of message objects with role and content
streamoptional boolean false Enable Server-Sent Events streaming
temperatureoptional number 1 Sampling temperature (0–2). Higher = more random
max_tokensoptional integer 384K Maximum tokens to generate in the completion
top_poptional number 1 Nucleus sampling threshold (0–1)
frequency_penaltyoptional number 0 Penalize frequent tokens (−2 to 2)
presence_penaltyoptional number 0 Penalize new tokens that have appeared (−2 to 2)
stopoptional string/array null Up to 4 stop sequences

Request Example

JSON
{
  "model": "deepseek-v4-flash",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in one paragraph."}
  ],
  "temperature": 0.7,
  "stream": false
}

Response Example

JSON
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1746000000,
  "model": "deepseek-v4-flash",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing harnesses the principles..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "completion_tokens": 87,
    "total_tokens": 105
  }
}

Thinking Mode

Both deepseek-v4-pro and deepseek-v4-flash support thinking mode, which returns the model's chain-of-thought reasoning via the reasoning_content field before the final answer.

How it works: When thinking mode is active, the response message object includes a reasoning_content field containing the model's internal reasoning, followed by content with the final answer.

Response with Thinking

JSON
{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "reasoning_content": "Let me think about this step by step...\nFirst, I need to consider...\nTherefore, the answer is...",
        "content": "The answer is 42."
      }
    }
  ]
}

Important: When continuing a multi-turn conversation, do not include reasoning_content in subsequent messages. Only include role and content from assistant responses in the conversation history.

Using the deepseek-reasoner Alias

The legacy alias deepseek-reasoner automatically maps to deepseek-v4-flash with thinking mode enabled. This provides backward compatibility for existing integrations.

Streaming

Enable streaming by setting "stream": true in your request. The server returns Server-Sent Events (SSE), with each chunk prefixed by data: . The stream terminates with data: [DONE].

Streaming Response Format

SSE
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1746000000,"model":"deepseek-v4-flash","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1746000000,"model":"deepseek-v4-flash","choices":[{"index":0,"delta":{"content":"Quantum"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1746000000,"model":"deepseek-v4-flash","choices":[{"index":0,"delta":{"content":" computing"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1746000000,"model":"deepseek-v4-flash","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Tip: The final chunk before [DONE] may include a usage field with token counts when stream_options is set.

List Models

Retrieve the list of available models.

HTTP
GET http://121.41.214.247/v1/models

Response

JSON
{
  "object": "list",
  "data": [
    {
      "id": "deepseek-v4-pro",
      "object": "model",
      "created": 1743465600,
      "owned_by": "qyai"
    },
    {
      "id": "deepseek-v4-flash",
      "object": "model",
      "created": 1743465600,
      "owned_by": "qyai"
    },
    {
      "id": "deepseek-chat",
      "object": "model",
      "created": 1743465600,
      "owned_by": "qyai"
    },
    {
      "id": "deepseek-reasoner",
      "object": "model",
      "created": 1743465600,
      "owned_by": "qyai"
    },
    {
      "id": "gpt-4o",
      "object": "model",
      "created": 1743465600,
      "owned_by": "qyai"
    },
    {
      "id": "gpt-4o-mini",
      "object": "model",
      "created": 1743465600,
      "owned_by": "qyai"
    }
  ]
}

Billing Usage

Query your API key's usage statistics.

HTTP
GET http://121.41.214.247/v1/dashboard/billing/usage

Response

JSON
{
  "object": "billing.usage",
  "key_prefix": "sk-abc...",
  "tier": "pro",
  "monthly_requests": 142,
  "monthly_limit": 3000,
  "remaining": 2858,
  "usage_percent": 4.7,
  "total_prompt_tokens": 52340,
  "total_completion_tokens": 12890
}

Pricing

Simple, transparent pricing. Pay only for what you use — no minimums, no commitments.

ModelInput (cache miss)Input (cache hit)Output
deepseek-v4-flash $0.22 / 1M tokens $0.005 / 1M tokens $0.44 / 1M tokens
deepseek-v4-pro $2.80 / 1M tokens $0.024 / 1M tokens $5.60 / 1M tokens

Cache hit pricing: When your prompt prefix matches a previously cached prefix, you benefit from drastically reduced input pricing. Cache hits are automatic — no configuration needed.

Price Comparison

See how QYAI compares to other major API providers (input / output per 1M tokens).

OpenAI GPT-4o

$2.50 / $10.00
per 1M tokens (in / out)

Anthropic Claude 3.5

$3.00 / $15.00
per 1M tokens (in / out)
ProviderModelInput / 1MOutput / 1MContext
QYAI V4-Flash $0.22 $0.44 1M
QYAI V4-Pro $2.80 $5.60 1M
OpenAI GPT-4o $2.50 $10.00 128K
Anthropic Claude 3.5 Sonnet $3.00 $15.00 200K

Error Codes

The API returns standard HTTP status codes. Error responses include a JSON body with details.

CodeTypeDescription
400 invalid_request_error The request body is malformed or missing required parameters
401 authentication_error Invalid or missing API key in the Authorization header
403 permission_error Your API key does not have access to the requested resource
404 not_found_error The requested endpoint or model does not exist
429 rate_limit_error You have exceeded your rate limit or monthly quota
500 internal_error An unexpected internal server error occurred
502 upstream_error Failed to connect to the upstream model provider
503 service_unavailable The service is temporarily unavailable — try again later
504 upstream_timeout The upstream model provider timed out

Error Response Format

JSON
{
  "error": {
    "message": "Invalid API key provided",
    "type": "authentication_error"
  }
}

Code Examples

Python (OpenAI SDK)

Non-streaming

Python
from openai import OpenAI

client = OpenAI(
    api_key="sk-your-qyai-key",
    base_url="http://121.41.214.247/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about programming."}
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

Streaming

Python
from openai import OpenAI

client = OpenAI(
    api_key="sk-your-qyai-key",
    base_url="http://121.41.214.247/v1"
)

stream = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "user", "content": "Explain recursion in 3 sentences."}
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

print()  # newline

JavaScript / Node.js

Non-streaming

JavaScript
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "sk-your-qyai-key",
  baseURL: "http://121.41.214.247/v1",
});

const response = await client.chat.completions.create({
  model: "deepseek-v4-flash",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Write a haiku about programming." },
  ],
  temperature: 0.7,
});

console.log(response.choices[0].message.content);
console.log(`Tokens used: ${response.usage.total_tokens}`);

Streaming

JavaScript
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "sk-your-qyai-key",
  baseURL: "http://121.41.214.247/v1",
});

const stream = await client.chat.completions.create({
  model: "deepseek-v4-flash",
  messages: [
    { role: "user", content: "Explain recursion in 3 sentences." },
  ],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || "";
  process.stdout.write(content);
}
console.log(); // newline

cURL

Non-streaming

bash
curl http://121.41.214.247/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-qyai-key" \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a haiku about programming."}
    ],
    "temperature": 0.7
  }'

Streaming

bash
curl http://121.41.214.247/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-qyai-key" \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [
      {"role": "user", "content": "Explain recursion in 3 sentences."}
    ],
    "stream": true
  }'

FAQ

Is QYAI compatible with the OpenAI SDK?

Yes. QYAI is fully OpenAI-compatible — just change the base_url and api_key in your existing OpenAI SDK client. All request and response formats are identical.

What models are available behind the gpt-4o and gpt-4o-mini aliases?

Both aliases map to deepseek-v4-flash. This allows you to switch providers without changing any model names in your code. You'll get DeepSeek V4-Flash quality at a fraction of the OpenAI price.

How does thinking mode work?

When thinking mode is active, the model returns a reasoning_content field in the response containing its chain-of-thought, followed by content with the final answer. Use deepseek-v4-pro for the best reasoning quality. Do not include reasoning_content in subsequent messages.

What is cache hit pricing?

DeepSeek's API automatically caches prompt prefixes. When a subsequent request's prefix matches a cached one, the input price drops dramatically (e.g., $0.005/M for V4-Flash vs $0.22/M for cache miss). This is automatic — no configuration needed.

What are the rate limits?

Rate limits depend on your plan tier. The basic tier allows 1,000 requests/month and 10 RPM; pro allows 3,000 requests/month and 30 RPM; enterprise allows 10,000 requests/month and 100 RPM. Contact us for custom limits.

Is my data sent to China?

QYAI proxies requests to DeepSeek's API infrastructure. DeepSeek is a Chinese company and their data processing is subject to Chinese regulations. We recommend not sending personally identifiable or sensitive data through any third-party AI API.