Overview
The AI in Kodexa uses semantic definitions (prompts) to understand what data you want to extract from documents. By providing clear, descriptive prompts, you guide the AI to accurately identify and extract the specific information you need.
What is a Semantic Definition?
A semantic definition is a natural language description that tells the AI:
What to look for - The type of information you want to extract
Where to find it - Location hints or context clues
How to interpret it - Format expectations or validation rules
Why it matters - The business meaning of the data
Example: Instead of just labeling a field "Date", you might write: "The invoice issue date, typically found near the top of the document in MM/DD/YYYY format"
Where to Add Semantic Definitions
To add a semantic definition to a data element:
Open your Data Definition
Select the data element you want to describe
Navigate to the Semantics tab
Enter your description in the text area
The AI will use this prompt when extracting data
The Semantics Tab
The Semantics tab provides different input methods depending on your data source:
Document (default) - Text area for natural language prompt
Formula - Formula editor with available elements
Expression - Code editor for Groovy expressions
Review - Template editor for review workflows
External - Expression editor for external data sources
Each data source type has tailored UI for the kind of definition needed.
AI Generation Feature
The Semantics tab includes an AI generation feature that can suggest semantic definitions based on:
The element's name and label
Its position in the Data Definition hierarchy
The data type you selected
Related elements in the same group
This helps you get started quickly with a well-formed prompt that you can then refine.
Writing Effective Prompts
Be Specific
Weak: "Get the date"
Better: "The invoice date, usually found in the header section"
Best: "The invoice issue date (not the due date), typically found in the upper right corner of the document near the invoice number, in MM/DD/YYYY format"
Provide Context
Location - "Found in the header", "Near the bottom of the first page"
Relationship - "The date directly below the invoice number"
Visual cues - "Listed in a table", "In bold text"
Format - "As a currency with two decimal places", "As YYYY-MM-DD"
Distinguish Similar Fields
When you have multiple similar fields, explain the difference:
"The issue date (when the invoice was created)"
"The due date (when payment is required)"
"The service date (when the service was provided)"
Include Examples
When the format varies, provide examples:
"The total amount, which may appear as '$1,234.56' or '1234.56 USD' or '1,234.56'"
Common Prompt Patterns
For Text Fields
"The [field name], typically found [location], appearing as [format or pattern]"
Example: "The vendor name, typically found at the top of the invoice, appearing as the first bold text after any logo"
For Numeric Fields
"The [field name] amount, usually [location], formatted as [currency/number format]"
Example: "The subtotal amount, usually found above the tax and total lines, formatted as a currency with two decimal places"
For Date Fields
"The [type] date, located [position], in [format] format"
Example: "The invoice date, located in the upper right corner near the invoice number, in MM/DD/YYYY format"
For Table/Line Items
"The [field name] for each line item in the [table name] table"
Example: "The quantity for each line item in the items table, appearing in the second column after the description"
For Addresses
"The [type] address, including [components], typically [location]"
Example: "The shipping address, including street, city, state, and ZIP code, typically found in a box on the left side of the document"
Improving AI Accuracy
1. Clean and Structure Your Data Definition
Use clear, descriptive element names
Group related elements together
Use appropriate data types (number for quantities, date for dates)
Organize hierarchically (Invoice > Header > Invoice Number)
2. Use Project Templates
Templates include pre-configured Data Definitions
Optimized prompts for common document types
Proven patterns that work well with AI
Saves time and improves initial accuracy
3. Review and Correct Extractions
The AI learns from your corrections
Consistently correcting mistakes improves future accuracy
Use the review interface to validate extractions
Provide feedback on what was extracted correctly and incorrectly
4. Test with Sample Documents
Upload a variety of document formats
Test edge cases (missing fields, unusual formats)
Refine prompts based on actual extraction results
Iterate until accuracy is consistently high
5. Use Validation Rules
Define expected formats (regex patterns)
Set value ranges (min/max for numbers)
Specify required fields
Validation catches extraction errors automatically
Document-Specific Guidance
For Invoices
Distinguish between subtotal, tax, and total amounts
Specify which date (invoice date vs. due date vs. service date)
Describe line item structure clearly
Note currency and decimal formatting
For Contracts
Identify parties clearly (vendor vs. customer)
Distinguish between effective date and execution date
Describe section structure (terms, conditions, signatures)
Note multi-page layout considerations
For Forms
Reference field labels on the form
Describe checkbox or radio button selections
Note handwritten vs. printed text differences
Specify expected value formats
For Receipts
Describe compact layouts clearly
Distinguish items from totals
Note abbreviations commonly used
Handle variable formatting (thermal printer output)
Common Challenges and Solutions
Challenge: Field Appears in Multiple Locations
Solution: Specify which instance: "The total amount at the bottom of the invoice (not the subtotal amounts within line items)"
Challenge: Format Varies Across Documents
Solution: List format variations: "The date, which may appear as MM/DD/YYYY, DD/MM/YYYY, or Month DD, YYYY"
Challenge: Field May Be Missing
Solution: Mark as optional and describe typical location: "The purchase order number, if present, usually found near the invoice number"
Challenge: Extracting from Tables
Solution: Describe table structure and column: "The unit price for each line item, found in the third column of the items table"
Challenge: Handwritten or Poor Quality
Solution: Note this in the prompt: "The signature date, which may be handwritten and require special attention"
Best Practices
Start simple - Begin with basic prompts and add detail as needed
Test incrementally - Add elements gradually and validate each
Use consistent naming - Similar elements should have similar prompt patterns
Leverage AI suggestions - Use the generation feature as a starting point
Document assumptions - Note any expectations about document structure
Review regularly - Update prompts based on real extraction results
Learn from templates - Study how templates describe common fields
The Continuous Improvement Cycle
Define - Write clear semantic definitions for your data elements
Extract - Process documents with the AI
Review - Check extraction accuracy and quality
Correct - Fix any inaccuracies in the extracted data
Learn - The AI improves from your corrections
Refine - Update prompts based on what you learned
Repeat - Continue the cycle for ongoing improvement
This cycle ensures your AI model becomes increasingly accurate over time, learning your specific document patterns and extraction needs.
Tips
The more specific your prompt, the better the extraction accuracy
Use the AI generation feature to get started, then refine
Test with diverse document samples to catch edge cases
Group related elements together for better context
Use appropriate data types to help the AI understand format expectations
Regular review and correction significantly improves model accuracy
Project templates provide excellent examples of effective prompts
The Semantics tab adapts its interface based on your data source type
