We’ll start with implementing the non-streaming bit. Let’s start with modeling our request:
from typing import List, Optionalfrom pydantic import BaseModel
class ChatMessage(BaseModel):
role: str
content: str
class ChatCompletionRequest(BaseModel):
model: str = "mock-gpt-model"
messages: List[ChatMessage]
max_tokens: Optional[int] = 512
temperature: Optional[float] = 0.1
stream: Optional[bool] = False
The PyDantic model represents the request from the client, aiming to replicate the API reference. For the sake of brevity, this model does not implement the entire specs, but rather the bare bones needed for it to work. If you’re missing a parameter that is a part of the API specs (like top_p
), you can simply add it to the model.
The ChatCompletionRequest
models the parameters OpenAI uses in their requests. The chat API specs require specifying a list of ChatMessage
(like a chat history, the client is usually in charge of keeping it and feeding back in at every request). Each chat message has a role
attribute (usually system
, assistant
, or user
) and a content
attribute containing the actual message text.
Next, we’ll write our FastAPI chat completions endpoint:
import timefrom fastapi import FastAPI
app = FastAPI(title="OpenAI-compatible API")
@app.post("/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
if request.messages and request.messages[0].role == 'user':
resp_content = "As a mock AI Assitant, I can only echo your last message:" + request.messages[-1].content
else:
resp_content = "As a mock AI Assitant, I can only echo your last message, but there were no messages!"
return {
"id": "1337",
"object": "chat.completion",
"created": time.time(),
"model": request.model,
"choices": [{
"message": ChatMessage(role="assistant", content=resp_content)
}]
}
That simple.
Testing our implementation
Assuming both code blocks are in a file called main.py
, we’ll install two Python libraries in our environment of choice (always best to create a new one): pip install fastapi openai
and launch the server from a terminal:
uvicorn main:app
Using another terminal (or by launching the server in the background), we will open a Python console and copy-paste the following code, taken straight from OpenAI’s Python Client Reference:
from openai import OpenAI# init client and connect to localhost server
client = OpenAI(
api_key="fake-api-key",
base_url="http://localhost:8000" # change the default port if needed
)
# call API
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model="gpt-1337-turbo-pro-max",
)
# print the top "choice"
print(chat_completion.choices[0].message.content)
If you’ve done everything correctly, the response from the server should be correctly printed. It’s also worth inspecting the chat_completion
object to see that all relevant attributes are as sent from our server. You should see something like this:
As LLM generation tends to be slow (computationally expensive), it’s worth streaming your generated content back to the client, so that the user can see the response as it’s being generated, without having to wait for it to finish. If you recall, we gave ChatCompletionRequest
a boolean stream
property — this lets the client request that the data be streamed back to it, rather than sent at once.
This makes things just a bit more complex. We will create a generator function to wrap our mock response (in a real-world scenario, we will want a generator that is hooked up to our LLM generation)
import asyncio
import jsonasync def _resp_async_generator(text_resp: str):
# let's pretend every word is a token and return it over time
tokens = text_resp.split(" ")
for i, token in enumerate(tokens):
chunk = {
"id": i,
"object": "chat.completion.chunk",
"created": time.time(),
"model": "blah",
"choices": [{"delta": {"content": token + " "}}],
}
yield f"data: {json.dumps(chunk)}\n\n"
await asyncio.sleep(1)
yield "data: [DONE]\n\n"
And now, we would modify our original endpoint to return a StreamingResponse when stream==True
import timefrom starlette.responses import StreamingResponse
app = FastAPI(title="OpenAI-compatible API")
@app.post("/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
if request.messages:
resp_content = "As a mock AI Assitant, I can only echo your last message:" + request.messages[-1].content
else:
resp_content = "As a mock AI Assitant, I can only echo your last message, but there wasn't one!"
if request.stream:
return StreamingResponse(_resp_async_generator(resp_content), media_type="application/x-ndjson")
return {
"id": "1337",
"object": "chat.completion",
"created": time.time(),
"model": request.model,
"choices": [{
"message": ChatMessage(role="assistant", content=resp_content) }]
}
Testing the streaming implementation
After restarting the uvicorn server, we’ll open up a Python console and put in this code (again, taken from OpenAI’s library docs)
from openai import OpenAI# init client and connect to localhost server
client = OpenAI(
api_key="fake-api-key",
base_url="http://localhost:8000" # change the default port if needed
)
stream = client.chat.completions.create(
model="mock-gpt-model",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "")
You should see each word in the server’s response being slowly printed, mimicking token generation. We can inspect the last chunk
object to see something like this:
Putting it all together
Finally, in the gist below, you can see the entire code for the server.