Self-Hosted PDF OCR API for Large Documents

A self-hosted PDF OCR API powered by PaddleOCR and the PaddleOCR-VL model. Runs on GPU via Docker, processes PDFs page-by-page, and returns markdown content in JSON responses. Good support (not perfect) for Latvian and Lithuanian languages.

Model


Model	PaddleOCR-VL-1.5
Parameters	0.9B
Layout detection	PP-DocLayoutV3
GPU VRAM	~8.5GB

Requirements

Docker with NVIDIA Container Toolkit
NVIDIA GPU with ~8.5GB VRAM

Quick start

Using Docker Hub image:

services:
  paddleocr:
    image: edgaras0x4e/paddleocr-pdf-api:latest
    ports:
      - "8099:8000"
    volumes:
      - ocr-data:/data
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  ocr-data:

docker compose up -d

Or build from source:

git clone https://github.com/Edgaras0x4E/paddleocr-pdf-api.git && cd paddleocr-pdf-api
docker compose up --build -d

The API will be available at http://localhost:8099. On first startup the model (~2GB) is downloaded and loaded into GPU memory. The API accepts requests immediately, but jobs will start processing once the model is ready.

Usage

Submit a PDF

curl -X POST http://localhost:8099/ocr -F "file=@document.pdf"

{
  "job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
  "filename": "document.pdf",
  "status": "queued"
}

Check progress

curl http://localhost:8099/ocr/{job_id}

{
  "job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
  "filename": "document.pdf",
  "status": "processing",
  "total_pages": 185,
  "processed_pages": 42,
  "error": null
}

Get a single page

curl http://localhost:8099/ocr/{job_id}/pages/1

{
  "job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
  "page_num": 1,
  "markdown": "## Chapter 1\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit..."
}

Get all pages

curl http://localhost:8099/ocr/{job_id}/result

{
  "job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
  "filename": "document.pdf",
  "status": "completed",
  "total_pages": 185,
  "processed_pages": 185,
  "pages": [
    {"page_num": 1, "markdown": "## Chapter 1\n\nLorem ipsum dolor sit amet..."},
    {"page_num": 2, "markdown": "..."}
  ]
}

List all jobs

curl http://localhost:8099/jobs

{
  "jobs": [
    {
      "job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
      "filename": "document.pdf",
      "status": "completed",
      "total_pages": 185,
      "processed_pages": 185
    }
  ]
}

Cancel a job

curl -X POST http://localhost:8099/ocr/{job_id}/cancel

{
  "job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
  "status": "cancelling"
}

Delete a job

curl -X DELETE http://localhost:8099/ocr/{job_id}

{
  "status": "deleted"
}

API reference

Method	Endpoint	Description
`POST`	`/ocr`	Upload a PDF for processing
`GET`	`/ocr/{job_id}`	Get job status and progress
`GET`	`/ocr/{job_id}/pages/{page_num}`	Get markdown for a specific page
`GET`	`/ocr/{job_id}/result`	Get all completed pages
`POST`	`/ocr/{job_id}/cancel`	Cancel a queued or running job
`DELETE`	`/ocr/{job_id}`	Delete a job and its data
`GET`	`/jobs`	List all jobs

Configuration

Environment variables set in docker-compose.yml:

Variable	Default	Description
`API_KEY`	(empty)	Optional API key. When set, all requests must include an `X-API-Key` header
`OCR_DPI`	`200`	DPI for PDF page rendering
`DB_PATH`	`/data/ocr.db`	SQLite database path
`UPLOAD_DIR`	`/data/uploads`	Upload storage path

Enabling API key authentication

Uncomment the environment section in docker-compose.yml:

environment:
  - API_KEY=your-secret-key

Then restart:

docker compose down && docker compose up -d

All requests must then include the header:

curl -H "X-API-Key: your-secret-key" http://localhost:8099/jobs

docker-compose.yml

services:
  paddleocr:
    build: .
    ports:
      - "8099:8000"
    # environment:
    #   - API_KEY=your-secret-key
    volumes:
      - ocr-data:/data
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  ocr-data:

How it works

A PDF is uploaded and saved to disk
A background worker picks up queued jobs in order
Each page is rendered to an image using pypdfium2
PaddleOCR-VL extracts text and converts it to markdown
HTML tags and image placeholders are stripped from the output
Results are stored in SQLite and available per-page as they complete
Jobs interrupted by a restart are automatically re-queued

Data persistence

The /data volume stores the SQLite database and uploaded PDFs. This is a named Docker volume (ocr-data) that persists across container restarts and rebuilds.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Dockerfile		Dockerfile
README.md		README.md
api.py		api.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Hosted PDF OCR API for Large Documents

Model

Requirements

Quick start

Usage

Submit a PDF

Check progress

Get a single page

Get all pages

List all jobs

Cancel a job

Delete a job

API reference

Configuration

Enabling API key authentication

docker-compose.yml

How it works

Data persistence

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Languages

Folders and files

Latest commit

History

Repository files navigation

Self-Hosted PDF OCR API for Large Documents

Model

Requirements

Quick start

Usage

Submit a PDF

Check progress

Get a single page

Get all pages

List all jobs

Cancel a job

Delete a job

API reference

Configuration

Enabling API key authentication

docker-compose.yml

How it works

Data persistence

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 0

Languages

Packages

Contributors