Skip to main content

Helping AI understand your data

Learn how you can help AI understand more and get data right

Updated today

Overview

The AI in Kodexa uses semantic definitions (prompts) to understand what data you want to extract from documents. By providing clear, descriptive prompts, you guide the AI to accurately identify and extract the specific information you need.

What is a Semantic Definition?

A semantic definition is a natural language description that tells the AI:

  • What to look for - The type of information you want to extract

  • Where to find it - Location hints or context clues

  • How to interpret it - Format expectations or validation rules

  • Why it matters - The business meaning of the data

Example: Instead of just labeling a field "Date", you might write: "The invoice issue date, typically found near the top of the document in MM/DD/YYYY format"

Where to Add Semantic Definitions

To add a semantic definition to a data element:

  1. Open your Data Definition

  2. Select the data element you want to describe

  3. Navigate to the Semantics tab

  4. Enter your description in the text area

  5. The AI will use this prompt when extracting data

The Semantics Tab

The Semantics tab provides different input methods depending on your data source:

  • Document (default) - Text area for natural language prompt

  • Formula - Formula editor with available elements

  • Expression - Code editor for Groovy expressions

  • Review - Template editor for review workflows

  • External - Expression editor for external data sources

Each data source type has tailored UI for the kind of definition needed.

AI Generation Feature

The Semantics tab includes an AI generation feature that can suggest semantic definitions based on:

  • The element's name and label

  • Its position in the Data Definition hierarchy

  • The data type you selected

  • Related elements in the same group

This helps you get started quickly with a well-formed prompt that you can then refine.

Writing Effective Prompts

Be Specific

Weak: "Get the date"

Better: "The invoice date, usually found in the header section"

Best: "The invoice issue date (not the due date), typically found in the upper right corner of the document near the invoice number, in MM/DD/YYYY format"

Provide Context

  • Location - "Found in the header", "Near the bottom of the first page"

  • Relationship - "The date directly below the invoice number"

  • Visual cues - "Listed in a table", "In bold text"

  • Format - "As a currency with two decimal places", "As YYYY-MM-DD"

Distinguish Similar Fields

When you have multiple similar fields, explain the difference:

  • "The issue date (when the invoice was created)"

  • "The due date (when payment is required)"

  • "The service date (when the service was provided)"

Include Examples

When the format varies, provide examples:

"The total amount, which may appear as '$1,234.56' or '1234.56 USD' or '1,234.56'"

Common Prompt Patterns

For Text Fields

"The [field name], typically found [location], appearing as [format or pattern]"

Example: "The vendor name, typically found at the top of the invoice, appearing as the first bold text after any logo"

For Numeric Fields

"The [field name] amount, usually [location], formatted as [currency/number format]"

Example: "The subtotal amount, usually found above the tax and total lines, formatted as a currency with two decimal places"

For Date Fields

"The [type] date, located [position], in [format] format"

Example: "The invoice date, located in the upper right corner near the invoice number, in MM/DD/YYYY format"

For Table/Line Items

"The [field name] for each line item in the [table name] table"

Example: "The quantity for each line item in the items table, appearing in the second column after the description"

For Addresses

"The [type] address, including [components], typically [location]"

Example: "The shipping address, including street, city, state, and ZIP code, typically found in a box on the left side of the document"

Improving AI Accuracy

1. Clean and Structure Your Data Definition

  • Use clear, descriptive element names

  • Group related elements together

  • Use appropriate data types (number for quantities, date for dates)

  • Organize hierarchically (Invoice > Header > Invoice Number)

2. Use Project Templates

  • Templates include pre-configured Data Definitions

  • Optimized prompts for common document types

  • Proven patterns that work well with AI

  • Saves time and improves initial accuracy

3. Review and Correct Extractions

  • The AI learns from your corrections

  • Consistently correcting mistakes improves future accuracy

  • Use the review interface to validate extractions

  • Provide feedback on what was extracted correctly and incorrectly

4. Test with Sample Documents

  • Upload a variety of document formats

  • Test edge cases (missing fields, unusual formats)

  • Refine prompts based on actual extraction results

  • Iterate until accuracy is consistently high

5. Use Validation Rules

  • Define expected formats (regex patterns)

  • Set value ranges (min/max for numbers)

  • Specify required fields

  • Validation catches extraction errors automatically

Document-Specific Guidance

For Invoices

  • Distinguish between subtotal, tax, and total amounts

  • Specify which date (invoice date vs. due date vs. service date)

  • Describe line item structure clearly

  • Note currency and decimal formatting

For Contracts

  • Identify parties clearly (vendor vs. customer)

  • Distinguish between effective date and execution date

  • Describe section structure (terms, conditions, signatures)

  • Note multi-page layout considerations

For Forms

  • Reference field labels on the form

  • Describe checkbox or radio button selections

  • Note handwritten vs. printed text differences

  • Specify expected value formats

For Receipts

  • Describe compact layouts clearly

  • Distinguish items from totals

  • Note abbreviations commonly used

  • Handle variable formatting (thermal printer output)

Common Challenges and Solutions

Challenge: Field Appears in Multiple Locations

Solution: Specify which instance: "The total amount at the bottom of the invoice (not the subtotal amounts within line items)"

Challenge: Format Varies Across Documents

Solution: List format variations: "The date, which may appear as MM/DD/YYYY, DD/MM/YYYY, or Month DD, YYYY"

Challenge: Field May Be Missing

Solution: Mark as optional and describe typical location: "The purchase order number, if present, usually found near the invoice number"

Challenge: Extracting from Tables

Solution: Describe table structure and column: "The unit price for each line item, found in the third column of the items table"

Challenge: Handwritten or Poor Quality

Solution: Note this in the prompt: "The signature date, which may be handwritten and require special attention"

Best Practices

  • Start simple - Begin with basic prompts and add detail as needed

  • Test incrementally - Add elements gradually and validate each

  • Use consistent naming - Similar elements should have similar prompt patterns

  • Leverage AI suggestions - Use the generation feature as a starting point

  • Document assumptions - Note any expectations about document structure

  • Review regularly - Update prompts based on real extraction results

  • Learn from templates - Study how templates describe common fields

The Continuous Improvement Cycle

  1. Define - Write clear semantic definitions for your data elements

  2. Extract - Process documents with the AI

  3. Review - Check extraction accuracy and quality

  4. Correct - Fix any inaccuracies in the extracted data

  5. Learn - The AI improves from your corrections

  6. Refine - Update prompts based on what you learned

  7. Repeat - Continue the cycle for ongoing improvement

This cycle ensures your AI model becomes increasingly accurate over time, learning your specific document patterns and extraction needs.

Tips

  • The more specific your prompt, the better the extraction accuracy

  • Use the AI generation feature to get started, then refine

  • Test with diverse document samples to catch edge cases

  • Group related elements together for better context

  • Use appropriate data types to help the AI understand format expectations

  • Regular review and correction significantly improves model accuracy

  • Project templates provide excellent examples of effective prompts

  • The Semantics tab adapts its interface based on your data source type

Did this answer your question?