Turn messy documents into clean, structured data.
Tempra is an AI-powered pipeline that ingests PDFs, CSVs, and Excel files — and outputs validated, analysis-ready data. No configuration needed.
Open source · Apache 2.0 · 336 tests passing
See the difference
Raw, messy data goes in. Clean, structured, validated data comes out.
| name | age | salary | hire_date | email | |---------------|----------|--------------|----------------|--------------------| | John Smith | thirty | SIXTY THOUSAND| 04/15/2019 | john@company | | JANE DOE | 42 | €85.000,00 | 2020-Jan-8 | jane.doe@email.com | | bob williams | NAN | 72000 | ERROR | UNKNOWN | | Alice Brown | 28 | $91,500 | April 5 2018 | alice@@brown.com |
| name | age | salary | hire_date | email | |---------------|-----|--------|------------|--------------------| | John Smith | 30 | 60000 | 2019-04-15 | john@company.com | | Jane Doe | 42 | 85000 | 2020-01-08 | jane.doe@email.com | | Bob Williams | 38 | 72000 | null | null | | Alice Brown | 28 | 91500 | 2018-04-05 | null |
How it works
Ingest
Upload any document. PDF, CSV, Excel, JSON, XML, Parquet, Avro — Tempra reads them all.
Clean
AI agents detect and fix issues: bad dates, mixed formats, duplicates, outliers, sentinel values. Zero config.
Validate
Get structured output with a quality report, data profiling, and schema validation. Ready for your pipeline.
What gets fixed
Every common data quality issue, handled automatically.
Sentinel Values
Converts ERROR, UNKNOWN, N/A, NULL to proper nulls.
Format Chaos
Standardizes €2.954,50 and $2,954.50 and "SIXTY THOUSAND" to 60000.
Date Mayhem
Parses "April 5 2018", "04/05/2018", "2018-Jan-5" into ISO 8601.
Duplicate Rows
Detects and removes exact and fuzzy duplicates.
Outlier Detection
IQR-based winsorization, skips financial columns automatically.
Schema Validation
Validates output against JSON schemas with regex patterns.
Validated on real-world dirty datasets
Average quality improvement: +0.018 across 13,000 rows from 4 public datasets.
| Dataset | Domain | Rows | Before | After | Improvement |
|---|---|---|---|---|---|
| HR Messy | Employee records | 1,000 | 0.973 | 0.981 | +0.008 |
| Healthcare | Patient records | 1,000 | 0.960 | 0.979 | +0.018 |
| Warehouse | Inventory | 1,000 | 0.983 | 1.000 | +0.017 |
| Cafe Sales | Transactions | 10,000 | 0.972 | 1.000 | +0.028 |
Under the hood
Upload → Parse → Profile → [AI Agents] → Clean → Validate → Output
├── Structuring Agent
├── Normalization Agent
├── Layout Agent
└── Human-in-the-loopGet early access
Tempra's hosted platform is launching soon. Join the waitlist to be first in line.
Or self-host now View on GitHub