LLM Model Proxy Cache
Develop against large language models without the big bill
This application serves as a reverse proxy with caching capabilities, specifically tailored for language model model API requests. Built with Golang, it facilitates interactions with models hosted on platforms like OpenAI by caching responses and minimizing redundant external API calls.
The goal is to allow for you to develop against llm api's without running up a bill.
Features
- Reverse Proxy Functionality: Directs API requests to the appropriate machine learning model service provider. Currently only supports OpenAI.
- Caching Mechanism: Stores successful responses to reduce API calls and improve performance. Cache hits serve responses directly from the cache.
- Token Counting: Leveraging
tiktoken-go
, the service estimates the number of tokens in each request's payload to keep track of usage.
- Dynamic Service Resolution: Looks up the configured model service provider based on the requested model in the API call.
- Extensibility: Supports registering multiple model providers through the
IModelProvider
interface, each with its own set of API models and endpoints.
How It Works
The entry point main()
initiates the application by setting up a response cache and a service resolver that includes an OpenAIProvider
responsible for handling OpenAI API requests. The HTTP server listens on port 8080
and processes incoming requests through a handler which:
- Generates a cache key based on the request's path, body, and header.
- Checks the cache for a stored response corresponding to the cache key.
- If a cache hit occurs, it serves the response directly from the cache.
- On a cache miss, it determines the correct service provider and reverse proxies the request to the target machine learning model API.
API
The application exposes a single HTTP endpoint /
that accepts requests with model specifications in the body. Examples of supported model names include "gpt-4", "gpt-3.5-turbo", and "text-embedding-3-large".
Setup
To run the service, ensure you have the following:
- Golang installed and configured.
tiktoken-go
library installed (go get github.com/pkoukk/tiktoken-go
).
To start the server, execute:
go run main.go
The server will listen on http://localhost:8080
.
Installation
Download the repository and install the dependencies:
go get
Then install via:
go install
Usage
Make an API request to the service with the desired machine learning model name and payload:
curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" \
-d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
Replace the model
value and the messages
array content with your specific requirements.
Nodejs
You can drop in the proxt via the
const llm = new OpenAI({
baseURL: "http://localhost:8080/v1",
});
const res = await llm.chat.completions.create({
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Hello!" },
],
model: "gpt-3.5-turbo",
temperature: 1,
});
Caching Key Generation
ls $GOPATH/bin
The generateCacheKey()
function creates a unique cache key by hashing the request's path, body, and the 'Authorization' bearer token if present.
Token Counting
The CountTokens()
method, part of the OpenAIProvider
implementation of IModelProvider
, counts tokens in the request content for a given model, aiding in managing token usage.
Note
This application serves as an example and may require additional security and error handling features to be production-ready.