Turn messy documents into clean, structured data.

Tempra is an AI-powered pipeline that ingests PDFs, CSVs, and Excel files — and outputs validated, analysis-ready data. No configuration needed.

Open source · Apache 2.0 · 336 tests passing

See the difference

Raw, messy data goes in. Clean, structured, validated data comes out.

Before — Raw input
| name          | age      | salary       | hire_date      | email              |
|---------------|----------|--------------|----------------|--------------------|
| John Smith    | thirty   | SIXTY THOUSAND| 04/15/2019    | john@company       |
| JANE DOE      |  42      | €85.000,00   | 2020-Jan-8     | jane.doe@email.com |
| bob williams  | NAN      | 72000        | ERROR          | UNKNOWN            |
|  Alice Brown  | 28       | $91,500      | April 5 2018   | alice@@brown.com   |
After — Tempra output
| name          | age | salary | hire_date  | email              |
|---------------|-----|--------|------------|--------------------|
| John Smith    | 30  | 60000  | 2019-04-15 | john@company.com   |
| Jane Doe      | 42  | 85000  | 2020-01-08 | jane.doe@email.com |
| Bob Williams  | 38  | 72000  | null       | null               |
| Alice Brown   | 28  | 91500  | 2018-04-05 | null               |
Quality Score:0.9721.000|Grade:CA

How it works

01

Ingest

Upload any document. PDF, CSV, Excel, JSON, XML, Parquet, Avro — Tempra reads them all.

02

Clean

AI agents detect and fix issues: bad dates, mixed formats, duplicates, outliers, sentinel values. Zero config.

03

Validate

Get structured output with a quality report, data profiling, and schema validation. Ready for your pipeline.

What gets fixed

Every common data quality issue, handled automatically.

Sentinel Values

Converts ERROR, UNKNOWN, N/A, NULL to proper nulls.

Format Chaos

Standardizes €2.954,50 and $2,954.50 and "SIXTY THOUSAND" to 60000.

Date Mayhem

Parses "April 5 2018", "04/05/2018", "2018-Jan-5" into ISO 8601.

Duplicate Rows

Detects and removes exact and fuzzy duplicates.

Outlier Detection

IQR-based winsorization, skips financial columns automatically.

Schema Validation

Validates output against JSON schemas with regex patterns.

Validated on real-world dirty datasets

Average quality improvement: +0.018 across 13,000 rows from 4 public datasets.

DatasetDomainRowsBeforeAfterImprovement
HR MessyEmployee records1,0000.9730.981+0.008
HealthcarePatient records1,0000.9600.979+0.018
WarehouseInventory1,0000.9831.000+0.017
Cafe SalesTransactions10,0000.9721.000+0.028

Under the hood

Upload → Parse → Profile → [AI Agents] → Clean → Validate → Output
                              ├── Structuring Agent
                              ├── Normalization Agent
                              ├── Layout Agent
                              └── Human-in-the-loop
PythonFastAPIPandasscikit-learnOpenAIDockerKubernetes

Get early access

Tempra's hosted platform is launching soon. Join the waitlist to be first in line.

Or self-host now View on GitHub