Document Processing with Hyrex

Need to extract text from thousands of documents? PDFs, Word docs, images with text - they all need processing but doing it sequentially takes forever. Time to parallelize with Hyrex!

With Hyrex, you can build powerful document processing pipelines that handle multiple file formats simultaneously. Whether you're dealing with scanned documents, PDFs, Word files, or images, Hyrex distributes the work across multiple workers, turning hours of processing into minutes.

Step 1: Define Document Processing Tasks

Create specialized Hyrex tasks for different document types. Each task handles a specific format (PDF, DOCX, images) and can run in parallel across your worker fleet for maximum throughput.

src/document/processors.py

1from hyrex import HyrexRegistry
2import PyPDF2
3import docx
4from PIL import Image
5import pytesseract
6
7hy = HyrexRegistry()
8
9@hy.task
10def extract_text_from_pdf(file_path: str) -> str:
11    with open(file_path, 'rb') as file:
12        reader = PyPDF2.PdfReader(file)
13        text = ""
14        for page in reader.pages:
15            text += page.extract_text()
16    return text
17
18@hy.task
19def extract_text_from_docx(file_path: str) -> str:
20    doc = docx.Document(file_path)
21    text = ""
22    for paragraph in doc.paragraphs:
23        text += paragraph.text + "\n"
24    return text
25
26@hy.task
27def extract_text_from_image(file_path: str) -> str:
28    image = Image.open(file_path)
29    text = pytesseract.image_to_string(image)
30    return text
31
32@hy.task
33def process_document(file_path: str, file_type: str):
34    if file_type == "pdf":
35        text = extract_text_from_pdf.send(file_path)
36    elif file_type == "docx":
37        text = extract_text_from_docx.send(file_path)
38    elif file_type in ["jpg", "png", "jpeg"]:
39        text = extract_text_from_image.send(file_path)
40
41    # Store extracted text in database
42    store_document_text.send(file_path, text)
43
44@hy.task
45def batch_process_documents(folder_path: str):
46    import os
47    for filename in os.listdir(folder_path):
48        file_path = os.path.join(folder_path, filename)
49        file_ext = filename.split('.')[-1].lower()
50        process_document.send(file_path, file_ext)

Step 2: Build Processing APIs

Create REST endpoints that accept document uploads or file paths and dispatch processing tasks to your Hyrex workers. Support both single document processing and batch operations for efficiency.

src/api/processing.py

1from fastapi import FastAPI, File, UploadFile
2from pydantic import BaseModel
3from typing import List
4import os
5from .tasks import process_document, batch_process_documents
6
7app = FastAPI()
8
9class ProcessingRequest(BaseModel):
10    file_path: str
11    file_type: str
12
13class BatchRequest(BaseModel):
14    folder_path: str
15    file_types: List[str] = ["pdf", "docx", "jpg", "png"]
16
17@app.post("/process/document")
18async def process_single_document(request: ProcessingRequest):
19    # Send task to process a single document
20    task = process_document.send(request.file_path, request.file_type)
21    return {
22        "message": "Document processing started",
23        "task_id": task.task_id,
24        "file_path": request.file_path
25    }
26
27@app.post("/process/upload")
28async def upload_and_process(file: UploadFile = File(...)):
29    # Save uploaded file
30    file_path = f"uploads/{file.filename}"
31    os.makedirs("uploads", exist_ok=True)
32
33    with open(file_path, "wb") as buffer:
34        content = await file.read()
35        buffer.write(content)
36
37    # Determine file type and process
38    file_type = file.filename.split('.')[-1].lower()
39    task = process_document.send(file_path, file_type)
40
41    return {
42        "message": "File uploaded and processing started",
43        "task_id": task.task_id,
44        "filename": file.filename
45    }
46
47@app.post("/process/batch")
48async def batch_process(request: BatchRequest):
49    # Send task to batch process documents in a folder
50    task = batch_process_documents.send(request.folder_path)
51    return {
52        "message": "Batch processing started",
53        "task_id": task.task_id,
54        "folder_path": request.folder_path
55    }

That's it - you're processing at scale!

Your document processing pipeline is now running in parallel across multiple workers. Upload hundreds of documents and watch them get processed simultaneously, with extracted text ready for indexing, analysis, or AI processing.

Want to take it further? Add support for more file formats, implement OCR preprocessing for better text extraction, or connect the output directly to your search index or AI pipeline for seamless automation.

Document Processing with Hyrex

Step 1: Define Document Processing Tasks

Step 2: Build Processing APIs

That's it - you're processing at scale!

Explore other use cases

AI-Ready Datasets

Agent Actions

Context Engineering

Background Tasks