Overview
Defining your data in Kodexa means configuring data elements that specify what information to extract from documents. Each data element represents a piece of information you want to capture, such as an invoice number, customer name, or line item.
What is a Data Element?
A data element is a single piece of information within your Data Definition. It includes:
Identity - Name, label, and description
Type - What kind of data (text, number, date, etc.)
Source - Where the data comes from (document, formula, metadata)
Semantics - How to extract or calculate it
Validation - Rules to ensure data quality
Context - Additional information to help extraction
Accessing Data Element Editor
To define or edit a data element:
Open your Data Definition
Click on an existing element or create a new one
The data element editor opens with multiple tabs
Configure the element across these tabs
Editor Tabs
The data element editor provides six tabs for complete configuration:
1. Overview Tab
Basic identification and organization:
Display Label
The human-readable name shown in the UI
Appears in forms, tables, and reports
Can include spaces and special characters
Example: "Invoice Number", "Customer Name"
Description
Detailed explanation of what this element represents
Helps team members understand the purpose
Used by AI to understand context
Example: "The unique identifier for this invoice, typically found in the top right corner"
Data Group Checkbox
Enable to make this element a container for other elements
Groups organize related fields together
Example: "Address" group containing Street, City, State, ZIP
Can be repeating (like line items in an invoice)
Repeating Checkbox
Enable when multiple instances are expected
Use for line items, transactions, or any list
Creates an array of values instead of a single value
Example: Multiple line items in an invoice
Internal ID
System identifier used in code and APIs
Cannot contain spaces or special characters
Auto-generated option available
Example: "invoice_number", "customer_name"
External Name
The name used when exporting or accessing via API
Maps internal ID to external systems
Useful for integration with other platforms
2. Type Tab
Specify the data type and format:
Data Type Selection
Type | Purpose | Example |
String | Text values | Names, descriptions, addresses |
Number | Numeric values | Quantities, IDs, counts |
Currency | Monetary amounts | Prices, totals, payments |
Date | Date values | Invoice dates, due dates |
Date Time | Date and time | Timestamps, appointments |
Boolean | True/false values | Flags, checkboxes, yes/no |
Email Address | Email format | Contact emails |
Phone Number | Phone format | Contact numbers |
URL | Web addresses | Website links |
Percentage | Percentage values | Tax rates, discounts |
Selection | Predefined options | Status codes, categories |
Type Features
Each type has specific configuration options:
Long Text - For text fields, enable multi-line support
Max Text Rows - Limit display height for long text
Allow Markdown - Enable markdown formatting in text
Display Width - Set column width in tables
Expected - Mark if this field should always be present
3. Data Source Tab
Specify where the data comes from:
Available Sources
Source | Purpose |
Document | Extract from document content (default) |
Metadata | Use document metadata (filename, upload date, etc.) |
Formula | Calculate from other elements using formulas |
Expression | Calculate using Groovy expressions |
Review | Provide during manual review process |
External | Fetch from external APIs or databases |
The Semantics tab changes based on the selected data source.
4. Semantics Tab
Define how to extract or calculate the data:
For Document Source
Text area for natural language prompt
Describe what to extract and where to find it
AI uses this prompt to locate and extract the data
AI generation feature can suggest prompts
For Formula Source
Formula editor with syntax highlighting
Dropdown of available elements to reference
Real-time validation and error checking
Use for calculations like:
{Quantity} * {Price}
For Expression Source
Code editor for Groovy expressions
Advanced logic and transformations
Access to full programming capabilities
For Review Source
Template editor using Jinja syntax
Define review workflows
Guide human reviewers
For External Source
Expression editor for API calls
Connect to external data sources
Transform and map external data
5. Additional Context Tab
Provide extra information to improve extraction:
Additional Context
Help with record-based chunking
Define record boundaries and sections
Useful for multi-record documents
Context Types
Record Definition - Defines what constitutes a record
Record Start Marker - Indicates where records begin
Record End Marker - Indicates where records end
Record Section Starter - Marks section beginnings
Record Section End - Marks section endings
Lexical Relations
Define synonyms for classification
Define antonyms to distinguish concepts
Used when using embeddings for classification
Example: "Invoice" synonyms: "Bill", "Receipt", "Statement"
6. Validation Tab
Define rules to ensure data quality:
Validation Rules
Required field validation
Format validation (regex patterns)
Range validation (min/max values)
Custom validation logic
Benefits
Catch extraction errors automatically
Ensure data meets requirements
Flag issues for review
Improve data quality consistently
Creating Different Element Types
Simple Data Elements
For single values like invoice number or date:
Set a clear label: "Invoice Number"
Choose appropriate type: "String"
Select data source: "Document"
Write a semantic definition (prompt)
Add validation if needed
Data Groups
For organizing related fields like an address:
Enable "Data Group" checkbox
Set label: "Billing Address"
Add child elements: Street, City, State, ZIP
Each child has its own configuration
Repeating Groups
For lists like invoice line items:
Enable both "Data Group" and "Repeating" checkboxes
Set label: "Line Items"
Add child elements: Description, Quantity, Price, Total
AI extracts multiple instances automatically
Calculated Fields
For values derived from other fields:
Set data source to "Formula"
Choose appropriate type: "Currency"
Write formula in Semantics tab:
{Subtotal} * {TaxRate}Formula calculates automatically
Best Practices
Use clear, descriptive labels - "Invoice Date" not "Date1"
Write detailed descriptions - Help your team understand the purpose
Choose appropriate types - Use Currency for money, Date for dates
Organize with groups - Keep related fields together
Write specific prompts - Guide the AI with clear instructions
Add validation rules - Catch errors early
Test incrementally - Add and test elements one at a time
Use consistent naming - Follow a naming convention
Enable Expected flag - For fields that should always be present
Leverage AI generation - Use suggestions as starting points
Common Patterns
Invoice Data Definition
Header (Group)
Invoice Number (String)
Invoice Date (Date)
Due Date (Date)
Vendor (Group)
Name (String)
Address (Group with Street, City, State, ZIP)
Line Items (Repeating Group)
Description (String)
Quantity (Number)
Unit Price (Currency)
Total (Currency, Formula: {Quantity} * {Unit Price})
Totals (Group)
Subtotal (Currency)
Tax (Currency)
Total Amount (Currency)
Tips
Start with a template if one matches your document type
The Type tab adapts options based on the data type selected
The Semantics tab changes UI based on your data source
Use "Use Generated ID" for automatic internal naming
External Name is useful for API integrations
Groups can be nested multiple levels deep
Repeating groups work well with table-structured data
Formula fields automatically recalculate when dependencies change
Validation rules run automatically during extraction
Additional Context helps with complex document layouts
