Shipping Small but Useful AI Tools: A Practical Stack for 0 → 1 Without Heavy MLOps
I've built three AI products in the last year. None of them use Kubernetes, vector databases, or custom-trained models. They all make money, handle thousands of users, and cost less than $500/month to run.
The AI tooling landscape makes you think you need a complex stack to ship something useful. You don't.
Here's the stack I actually use to go from idea to production in days, not months.
The Stack
Backend: Python + FastAPI AI: OpenAI/Anthropic APIs Database: Postgres (via Supabase) Queue: Simple in-process or Redis if needed Hosting: Railway or Render Frontend: Next.js (but honestly, anything works)
Total complexity: Low Total capability: High enough for 95% of AI tools
Let me break down why each piece and how to use it.
FastAPI: The AI Tool Backend
FastAPI is perfect for AI products because:
- Async by default (good for LLM API calls)
- Automatic API docs
- Type hints prevent bugs
- Fast to develop, fast to run
Basic Setup:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import openai
from typing import Optional
app = FastAPI()
class GenerateRequest(BaseModel):
prompt: str
user_id: str
max_tokens: Optional[int] = 500
class GenerateResponse(BaseModel):
result: str
tokens_used: int
@app.post("/generate", response_model=GenerateResponse)
async def generate_content(request: GenerateRequest):
try:
response = await openai.ChatCompletion.acreate(
model="gpt-4",
messages=[{"role": "user", "content": request.prompt}],
max_tokens=request.max_tokens
)
result = response.choices[0].message.content
tokens = response.usage.total_tokens
# Log usage for billing/analytics
log_usage(request.user_id, tokens)
return GenerateResponse(result=result, tokens_used=tokens)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
This is the core pattern for 90% of AI tools:
- Accept user input
- Call LLM API
- Return result
- Log for analytics/billing
You can ship this in an hour.
LLM APIs: Don't Train Your Own (Yet)
The biggest mistake I see: people trying to fine-tune or train models before validating their product.
Use GPT-4/Claude API until:
- You have 10,000+ users
- You can't achieve quality with prompts alone
- You have proprietary data that provides a real advantage
- You've calculated the cost/quality tradeoff
Before that, just use the API. It's fast, cheap enough, and incredibly capable.
Prompt Engineering > Fine-Tuning:
def create_system_prompt(user_profile):
"""
Customize behavior through prompts, not training
"""
return f"""
You are a content assistant for {user_profile['name']}, a {user_profile['niche']} creator.
Their style:
- {user_profile['tone']} tone
- {user_profile['length']} content length
- Focuses on {user_profile['topics']}
Examples of their past content:
{format_examples(user_profile['past_content'])}
Generate new content matching this exact style.
"""
async def generate_for_user(user_id, prompt):
user_profile = get_user_profile(user_id)
system_prompt = create_system_prompt(user_profile)
response = await openai.ChatCompletion.acreate(
model="gpt-4",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
This gives you personalization without training. It works remarkably well.
Cost Management:
import asyncio
from datetime import datetime, timedelta
# Simple rate limiting
user_limits = {}
async def check_usage_limit(user_id: str, tier: str = "free"):
limits = {
"free": {"requests_per_day": 10, "tokens_per_day": 10000},
"pro": {"requests_per_day": 1000, "tokens_per_day": 1000000}
}
today = datetime.utcnow().date()
key = f"{user_id}_{today}"
usage = user_limits.get(key, {"requests": 0, "tokens": 0})
limit = limits[tier]
if usage["requests"] >= limit["requests_per_day"]:
raise HTTPException(status_code=429, detail="Daily request limit reached")
if usage["tokens"] >= limit["tokens_per_day"]:
raise HTTPException(status_code=429, detail="Daily token limit reached")
return True
This prevents runaway costs. Critical when you're charging $10/month but GPT-4 costs add up.
Postgres: Your AI Tool Database
You don't need a vector database, graph database, or time-series database for most AI tools.
Postgres handles:
- User data
- Prompts and results
- Usage tracking
- Simple search (full-text is good enough)
Schema for AI Tool:
-- Users and auth
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email TEXT UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
tier TEXT DEFAULT 'free',
settings JSONB DEFAULT '{}'
);
-- AI generations
CREATE TABLE generations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id),
prompt TEXT NOT NULL,
result TEXT NOT NULL,
tokens_used INTEGER,
model TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
-- Usage tracking
CREATE TABLE daily_usage (
user_id UUID REFERENCES users(id),
date DATE,
requests INTEGER DEFAULT 0,
tokens_used INTEGER DEFAULT 0,
PRIMARY KEY (user_id, date)
);
-- Simple analytics
CREATE INDEX idx_generations_user_date ON generations(user_id, created_at);
CREATE INDEX idx_usage_user_date ON daily_usage(user_id, date);
That's it. Covers 95% of AI tools.
Why Not Vector Databases?
You only need vector databases if you're doing semantic search or RAG at scale.
For most AI tools, you're:
- Storing user inputs/outputs → Postgres
- Calling LLM APIs → No storage needed
- Maybe searching past generations → Postgres full-text search works fine
If you're actually building RAG (see my other post on pragmatic RAG), then yes, add pgvector extension to Postgres or use Pinecone. But not on day one.
Queues: When and Why
Most AI tools don't need a queue initially. But when you do, here's the pattern:
Without Queue (Simple, Start Here):
@app.post("/generate")
async def generate(request: GenerateRequest):
result = await call_llm(request.prompt)
return {"result": result}
Client waits for response. Works fine if LLM calls are <10 seconds.
With Queue (When Needed):
from redis import Redis
from rq import Queue
redis_conn = Redis()
queue = Queue(connection=redis_conn)
@app.post("/generate")
async def generate(request: GenerateRequest):
job = queue.enqueue(call_llm, request.prompt, job_timeout=60)
return {"job_id": job.id}
@app.get("/status/{job_id}")
async def check_status(job_id: str):
job = queue.fetch_job(job_id)
if job.is_finished:
return {"status": "complete", "result": job.result}
elif job.is_failed:
return {"status": "failed", "error": str(job.exc_info)}
else:
return {"status": "processing"}
Now long-running jobs don't block the API.
When to Add a Queue:
- LLM calls take >30 seconds
- You're doing batch processing
- You need retry logic for failures
- You want background jobs
Before that, async/await is enough.
Logging and Monitoring: Simple But Critical
You don't need Datadog or New Relic. You need structured logs and basic alerts.
Logging Pattern:
import json
import logging
from datetime import datetime
def log_generation(user_id, prompt, result, tokens, latency_ms):
log_entry = {
"event": "generation",
"user_id": user_id,
"prompt_length": len(prompt),
"result_length": len(result),
"tokens": tokens,
"latency_ms": latency_ms,
"timestamp": datetime.utcnow().isoformat()
}
logging.info(json.dumps(log_entry))
# This logs to stdout → Railway/Render/whatever captures it
# Now you can search: "show me all generations for user X"
# Or: "show me slow requests (latency_ms > 5000)"
Basic Metrics Dashboard:
@app.get("/admin/metrics")
async def get_metrics(admin_token: str):
# Verify admin
if admin_token != os.getenv("ADMIN_TOKEN"):
raise HTTPException(status_code=403)
# Query simple metrics
today = datetime.utcnow().date()
metrics = {
"users_active_today": await db.count_users_active_since(today),
"generations_today": await db.count_generations_since(today),
"tokens_used_today": await db.sum_tokens_since(today),
"error_rate": await db.error_rate_since(today)
}
return metrics
Check this once a day. If errors spike or usage drops, investigate.
That's enough monitoring for the first 6 months.
Deployment: Railway or Render
Don't overthink hosting. Both Railway and Render work great for AI tools:
Railway:
- Connect GitHub repo
- Add Postgres and Redis (if needed)
- Deploy on push
- Cost: ~$20-50/month
Render:
- Similar to Railway
- Slightly more configuration
- Cost: ~$20-50/month
Both handle:
- Auto-deploy from Git
- Environment variables
- SSL certificates
- Scaling (when you need it)
My deployment setup:
# railway.toml
[build]
builder = "NIXPACKS"
[deploy]
startCommand = "uvicorn main:app --host 0.0.0.0 --port $PORT"
healthcheckPath = "/health"
restartPolicyType = "ON_FAILURE"
Push to main, it deploys. That's it.
The Complete Minimal AI Tool
Putting it all together, here's a working AI tool in ~150 lines:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import openai
from supabase import create_client
import os
from datetime import datetime
import json
import logging
app = FastAPI()
# Setup
openai.api_key = os.getenv("OPENAI_API_KEY")
supabase = create_client(
os.getenv("SUPABASE_URL"),
os.getenv("SUPABASE_KEY")
)
# Models
class GenerateRequest(BaseModel):
prompt: str
user_id: str
class GenerateResponse(BaseModel):
result: str
tokens_used: int
# Rate limiting
async def check_rate_limit(user_id: str):
today = datetime.utcnow().date()
result = supabase.table("daily_usage").select("*").match({
"user_id": user_id,
"date": today.isoformat()
}).execute()
if result.data:
usage = result.data[0]
if usage["requests"] >= 10: # Free tier limit
raise HTTPException(429, "Daily limit reached")
# Main endpoint
@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
start_time = datetime.utcnow()
# Check limits
await check_rate_limit(request.user_id)
# Generate
try:
response = await openai.ChatCompletion.acreate(
model="gpt-4",
messages=[{"role": "user", "content": request.prompt}],
max_tokens=500
)
result = response.choices[0].message.content
tokens = response.usage.total_tokens
except Exception as e:
logging.error(f"OpenAI error: {e}")
raise HTTPException(500, "Generation failed")
# Save generation
supabase.table("generations").insert({
"user_id": request.user_id,
"prompt": request.prompt,
"result": result,
"tokens_used": tokens
}).execute()
# Update usage
today = datetime.utcnow().date()
supabase.rpc("increment_usage", {
"user_id": request.user_id,
"date": today.isoformat(),
"tokens": tokens
}).execute()
# Log
latency_ms = (datetime.utcnow() - start_time).total_seconds() * 1000
logging.info(json.dumps({
"event": "generation",
"user_id": request.user_id,
"tokens": tokens,
"latency_ms": latency_ms
}))
return GenerateResponse(result=result, tokens_used=tokens)
@app.get("/health")
async def health():
return {"status": "ok"}
This handles:
- User requests
- Rate limiting
- LLM calls
- Database storage
- Usage tracking
- Logging
Total lines: ~100. Total capabilities: production-ready AI tool.
When to Add Complexity
Start with this simple stack. Add complexity only when:
Add Redis Queue When:
- LLM calls take >30 seconds
- You're doing batch jobs
- You need better retry logic
Add Vector DB When:
- You're building RAG with 10K+ documents
- You're doing semantic search at scale
- Postgres full-text search isn't good enough
Add Custom Model When:
- GPT-4 can't achieve quality you need (rare)
- You have 10K+ users and cost is significant
- You have proprietary data that creates moat
Add Kubernetes When:
- You have multiple services with different scaling needs
- You have DevOps expertise
- Simple hosting is actually becoming expensive
Most AI tools never need these. The ones that do can add them later.
The Real Stack: Speed to Ship
The best stack is the one you can ship with.
I've watched founders spend 3 months setting up:
- Kubernetes
- Custom vector databases
- Fine-tuned models
- Complex MLOps pipelines
Then realize their product idea doesn't work and they have to start over.
Meanwhile, I ship in a week with:
- FastAPI
- OpenAI API
- Postgres
- Railway
If it works, I iterate. If it doesn't, I pivot without being tied to infrastructure.
The goal isn't "best practices" or "production-grade architecture." It's learning if your product solves a real problem.
Ship the simplest thing that works. Add complexity when simplicity breaks.
That's how you go from 0 to 1.