Building an On-Premise Document Intelligence Stack with Docling, Ollama, Phi-4 | ExtractThinker

In this new era of LLMs, banks and financial institutions are in a bit of a disadvantage, since frontier models are close to impossible to use on-premise give their hardware requirements. However, the sensitive nature of banking data poses significant privacy concerns, especially when these models are only available as cloud services. To address these challenges, organizations can turn to on-premise or Small Language Model (SLM) setups to keep data in-house, avoiding potential leakage of sensitive information. This approach allows you to take advantage of advanced LLMs (locally or with minimal external calls) while ensuring strict compliance with regulations such as GDPR, HIPAA, or various financial directives.

This article showcases how you can build a fully on-premise Document Intelligence solution by combining:

ExtractThinker — an open-source framework orchestrating OCR, classification, and data extraction pipelines for LLMs
Ollama — a local deployment solution for language models like Phi-4 or Llama 3.x
Docling or MarkItDown — flexible libraries to handle document loading, OCR, and layout parsing

Whether you’re operating under strict confidentiality rules, dealing with scanned PDFs, or simply want advanced vision-based extraction, this end-to-end stack provides a secure, high-performance pipeline fully within your own infrastructure.

1. Picking the Right Model (Text vs. Vision)

When building a Document Intelligence stack, it’s important to first decide whether you need a text-only model or a vision-capable model. Text-only models are often preferred for on-premise solutions because they’re widely available and less restrictive. However, vision-enabled models can be crucial for advanced splitting tasks, particularly when documents rely on visual cues — like layout, color schemes, or distinct formatting.

In some scenarios, you can pair different models for different stages. For instance, a smaller moondream model (0.5B parameters) might handle splitting, while the Phi-4 14B model manages classification and extraction. Many large institutions prefer deploying a single, more powerful model (e.g., Llama 3.3 or Qwen 2.5 in the 70B range) to cover all use cases. If you only need English-centric IDP, you could simply use Phi4 for most tasks and keep a lightweight moondream model on standby for edge-case splitting. It all depends on your specific requirements and available infrastructure.

2. Processing Documents: MarkItDown vs. Docling

For document parsing, two popular libraries stand out:

MarkItDown

Simpler, straightforward, widely supported by Microsoft
Perfect for direct text-based tasks where you don’t require multiple OCR engines
Easy to install and integrate

Docling

More advanced, with multi-OCR support (Tesseract, AWS Textract, Google Document AI, etc.)
Excellent for scanning workflows or robust extraction from image PDFs
Detailed documentation, flexible for complex layouts

ExtractThinker lets you swap in either DocumentLoaderMarkItDown or DocumentLoaderDocling depending on your needs—simple digital PDFs or multi-engine OCR.

3. Running Local Models

Although Ollama is a popular tool for hosting LLMs locally, there are now several solutions for on-prem deployments that can integrate seamlessly with ExtractThinker:

LocalAI — An open-source platform that mimics OpenAI’s API locally. It can run LLMs on consumer-grade hardware (even CPU-only), such as Llama 2 or Mistral, and provides a simple endpoint to connect with.
OpenLLM — A project by BentoML that exposes LLMs via an OpenAI-compatible API. It’s optimized for throughput and low latency, suitable for both on-prem and cloud, and supports a wide range of open-source LLMs.
Llama.cpp — A lower-level approach for running Llama models with advanced custom configurations. Great for granular control or HPC setups, albeit more complexity to manage.

Ollama is often a first choice thanks to its ease of setup and simple CLI. However, for enterprise or HPC scenarios, a Llama.cpp server deployment, OpenLLM, or a solution like LocalAI might be more appropriate. All of these can be integrated with ExtractThinker by simply pointing your local LLM endpoint to the environment variable or base URL in your code.

4. Tackling small context windows

When working with local models that have limited context windows (e.g., ~8K tokens or less), it becomes critical to manage both:

Splitting Documents

To avoid exceeding the model’s input capacity, Lazy Splitting is ideal. Rather than ingesting the entire document at once:

You incrementally compare pages (e.g., pages 1–2, then 2–3), deciding if they belong to the same sub-document.
If they do, you keep them together for the next step. If not, you start a new segment.
This approach is memory-friendly and scales to very large PDFs by only loading and analyzing a couple of pages at a time.

Note: Concatenate is ideal when you have a higher token allowance; Paginate is preferred for limited windows.

Handling Partial Responses

For smaller local models, each response also risks truncation if the prompt is large. PaginationHandler elegantly addresses this by:

Splitting the document’s pages for separate requests (one page per request).
Merging page-level results at the end, with optional conflict resolution if pages disagree on certain fields.

Note: Concatenate is ideal when you have a higher token allowance; Paginate is preferred for limited windows.

Quick Example Flow

Lazy Split the PDF so each chunk/page remains below your model’s limit.
Paginate across pages: each chunk’s result is returned separately.
Merge the partial page results into the final structured data.

This minimal approach ensures you never exceed the local model’s context window — both in how you feed the PDF and in how you handle multi-page responses.

5. ExtractThinker: Building the stack

Below is a minimal code snippet showing how to integrate these components. First, install ExtractThinker:

pip install extract-thinker

Document Loader

As discussed above, we can use MarkitDown or Docling.

from extract_thinker import DocumentLoaderMarkItDown, DocumentLoaderDocling

## DocumentLoaderDocling or DocumentLoaderMarkItDown
document_loader = DocumentLoaderDocling()

Defining Contracts

We use Pydantic-based Contracts to specify the structure of data we want to extract. For example, invoices and driver licenses:

from extract_thinker.models.contract import Contract
from pydantic import Field

class InvoiceContract(Contract):
    invoice_number: str = Field(description="Unique invoice identifier")
    invoice_date: str = Field(description="Date of the invoice")
    total_amount: float = Field(description="Overall total amount")

class DriverLicense(Contract):
    name: str = Field(description="Full name on the license")
    age: int = Field(description="Age of the license holder")
    license_number: str = Field(description="License number")

Classification

If you have multiple document types, define Classification objects. You can specify:

The name of each classification (e.g., “Invoice”).
A description.
The contract it maps to.

from extract_thinker import Classification

TEST_CLASSIFICATIONS = [
    Classification(
        name="Invoice",
        description="This is an invoice document",
        contract=InvoiceContract
    ),
    Classification(
        name="Driver License",
        description="This is a driver license document",
        contract=DriverLicense
    )
]

Putting It All Together: Local Extraction Process

Below, we create an Extractor that uses our chosen document_loader and a local model (Ollama, LocalAI, etc.). Then we build a Process to load, classify, split, and extract in a single pipeline.

import os
from dotenv import load_dotenv

from extract_thinker import (
    Extractor,
    Process,
    Classification,
    SplittingStrategy,
    ImageSplitter,
    TextSplitter
)

## Load environment variables (if you store LLM endpoints/API_BASE, etc. in .env)
load_dotenv()

## Example path to a multi-page document
MULTI_PAGE_DOC_PATH = "path/to/your/multi_page_doc.pdf"

def setup_local_process():
    """
    Helper function to set up an ExtractThinker process
    using local LLM endpoints (e.g., Ollama, LocalAI, OnPrem.LLM, etc.)
    """

    # 1) Create an Extractor
    extractor = Extractor()

    # 2) Attach our chosen DocumentLoader (Docling or MarkItDown)
    extractor.load_document_loader(document_loader)

    # 3) Configure your local LLM
    #    For Ollama, you might do:
    os.environ["API_BASE"] = "http://localhost:11434"  # Replace with your local endpoint
    extractor.load_llm("ollama/phi4")  # or "ollama/llama3.3" or your local model
    
    # 4) Attach extractor to each classification
    TEST_CLASSIFICATIONS[0].extractor = extractor
    TEST_CLASSIFICATIONS[1].extractor = extractor

    # 5) Build the Process
    process = Process()
    process.load_document_loader(document_loader)
    return process

def run_local_idp_workflow():
    """
    Demonstrates loading, classifying, splitting, and extracting
    a multi-page document with a local LLM.
    """
    # Initialize the process
    process = setup_local_process()

    # (Optional) You can use ImageSplitter(model="ollama/moondream:v2") for the split
    process.load_splitter(TextSplitter(model="ollama/phi4"))

    # 1) Load the file
    # 2) Split into sub-documents with EAGER strategy
    # 3) Classify each sub-document with our TEST_CLASSIFICATIONS
    # 4) Extract fields based on the matched contract (Invoice or DriverLicense)
    result = (
        process
        .load_file(MULTI_PAGE_DOC_PATH)
        .split(TEST_CLASSIFICATIONS, strategy=SplittingStrategy.LAZY)
        .extract(vision=False, completion_strategy=CompletionStrategy.PAGINATE)
    )

    # 'result' is a list of extracted objects (InvoiceContract or DriverLicense)
    for item in result:
        # Print or store each extracted data model
        if isinstance(item, InvoiceContract):
            print("[Extracted Invoice]")
            print(f"Number: {item.invoice_number}")
            print(f"Date: {item.invoice_date}")
            print(f"Total: {item.total_amount}")
        elif isinstance(item, DriverLicense):
            print("[Extracted Driver License]")
            print(f"Name: {item.name}, Age: {item.age}")
            print(f"License #: {item.license_number}")

## For a quick test, just call run_local_idp_workflow()
if __name__ == "__main__":
    run_local_idp_workflow()

6. Privacy and PII: LLMs in the Cloud

Not every organization can — or wants to — run local hardware. Some prefer advanced cloud-based LLMs. If so, keep in mind:

Data Privacy Risks: Sending sensitive data to the cloud raises potential compliance issues.
GDPR/HIPAA: Regulations may restrict data from leaving your premises at all.
VPC + Firewalls: You can isolate cloud resources in private networks, but this adds complexity.

Note*: Many LLM APIs (e.g., OpenAI) do provide GDPR compliance. But if you’re heavily regulated or want the freedom to switch providers easily, consider local or masked-cloud approaches.*

PII MaskingA robust approach is building a PII masking pipeline. Tools like Presidio can automatically detect and redact personal identifiers before sending to the LLM. This way, you remain model-agnostic while maintaining compliance.

7. Conclusion

By combining ExtractThinker with a local LLM (such as Ollama, LocalAI, or OnPrem.LLM) and a flexible DocumentLoader (Docling or MarkItDown), you can build a secure, on-premise Document Intelligence workflow from the ground up. If regulatory requirements demand total privacy or minimal external calls, this stack keeps your data in-house without sacrificing the power of modern LLMs.