A self-hosted PDF OCR API powered by PaddleOCR and the PaddleOCR-VL model. Runs on GPU via Docker, processes PDFs page-by-page, and returns markdown content in JSON responses. Good support (not perfect) for Latvian and Lithuanian languages.
| Model | PaddleOCR-VL-1.5 |
| Parameters | 0.9B |
| Layout detection | PP-DocLayoutV3 |
| GPU VRAM | ~8.5GB |
- Docker with NVIDIA Container Toolkit
- NVIDIA GPU with ~8.5GB VRAM
Using Docker Hub image:
services:
paddleocr:
image: edgaras0x4e/paddleocr-pdf-api:latest
ports:
- "8099:8000"
volumes:
- ocr-data:/data
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
volumes:
ocr-data:docker compose up -dOr build from source:
git clone https://github.com/Edgaras0x4E/paddleocr-pdf-api.git && cd paddleocr-pdf-api
docker compose up --build -dThe API will be available at http://localhost:8099. On first startup the model (~2GB) is downloaded and loaded into GPU memory. The API accepts requests immediately, but jobs will start processing once the model is ready.
curl -X POST http://localhost:8099/ocr -F "file=@document.pdf"{
"job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
"filename": "document.pdf",
"status": "queued"
}curl http://localhost:8099/ocr/{job_id}{
"job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
"filename": "document.pdf",
"status": "processing",
"total_pages": 185,
"processed_pages": 42,
"error": null
}curl http://localhost:8099/ocr/{job_id}/pages/1{
"job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
"page_num": 1,
"markdown": "## Chapter 1\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit..."
}curl http://localhost:8099/ocr/{job_id}/result{
"job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
"filename": "document.pdf",
"status": "completed",
"total_pages": 185,
"processed_pages": 185,
"pages": [
{"page_num": 1, "markdown": "## Chapter 1\n\nLorem ipsum dolor sit amet..."},
{"page_num": 2, "markdown": "..."}
]
}curl http://localhost:8099/jobs{
"jobs": [
{
"job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
"filename": "document.pdf",
"status": "completed",
"total_pages": 185,
"processed_pages": 185
}
]
}curl -X POST http://localhost:8099/ocr/{job_id}/cancel{
"job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
"status": "cancelling"
}curl -X DELETE http://localhost:8099/ocr/{job_id}{
"status": "deleted"
}| Method | Endpoint | Description |
|---|---|---|
POST |
/ocr |
Upload a PDF for processing |
GET |
/ocr/{job_id} |
Get job status and progress |
GET |
/ocr/{job_id}/pages/{page_num} |
Get markdown for a specific page |
GET |
/ocr/{job_id}/result |
Get all completed pages |
POST |
/ocr/{job_id}/cancel |
Cancel a queued or running job |
DELETE |
/ocr/{job_id} |
Delete a job and its data |
GET |
/jobs |
List all jobs |
Environment variables set in docker-compose.yml:
| Variable | Default | Description |
|---|---|---|
API_KEY |
(empty) | Optional API key. When set, all requests must include an X-API-Key header |
OCR_DPI |
200 |
DPI for PDF page rendering |
DB_PATH |
/data/ocr.db |
SQLite database path |
UPLOAD_DIR |
/data/uploads |
Upload storage path |
Uncomment the environment section in docker-compose.yml:
environment:
- API_KEY=your-secret-keyThen restart:
docker compose down && docker compose up -dAll requests must then include the header:
curl -H "X-API-Key: your-secret-key" http://localhost:8099/jobsservices:
paddleocr:
build: .
ports:
- "8099:8000"
# environment:
# - API_KEY=your-secret-key
volumes:
- ocr-data:/data
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
volumes:
ocr-data:- A PDF is uploaded and saved to disk
- A background worker picks up queued jobs in order
- Each page is rendered to an image using pypdfium2
- PaddleOCR-VL extracts text and converts it to markdown
- HTML tags and image placeholders are stripped from the output
- Results are stored in SQLite and available per-page as they complete
- Jobs interrupted by a restart are automatically re-queued
The /data volume stores the SQLite database and uploaded PDFs. This is a named Docker volume (ocr-data) that persists across container restarts and rebuilds.
MIT