Skip to main content

Defining your Data

Learn how you can define data in Kodexa

Updated today

Overview

Defining your data in Kodexa means configuring data elements that specify what information to extract from documents. Each data element represents a piece of information you want to capture, such as an invoice number, customer name, or line item.

What is a Data Element?

A data element is a single piece of information within your Data Definition. It includes:

  • Identity - Name, label, and description

  • Type - What kind of data (text, number, date, etc.)

  • Source - Where the data comes from (document, formula, metadata)

  • Semantics - How to extract or calculate it

  • Validation - Rules to ensure data quality

  • Context - Additional information to help extraction

Accessing Data Element Editor

To define or edit a data element:

  1. Open your Data Definition

  2. Click on an existing element or create a new one

  3. The data element editor opens with multiple tabs

  4. Configure the element across these tabs

Editor Tabs

The data element editor provides six tabs for complete configuration:

1. Overview Tab

Basic identification and organization:

Display Label

  • The human-readable name shown in the UI

  • Appears in forms, tables, and reports

  • Can include spaces and special characters

  • Example: "Invoice Number", "Customer Name"

Description

  • Detailed explanation of what this element represents

  • Helps team members understand the purpose

  • Used by AI to understand context

  • Example: "The unique identifier for this invoice, typically found in the top right corner"

Data Group Checkbox

  • Enable to make this element a container for other elements

  • Groups organize related fields together

  • Example: "Address" group containing Street, City, State, ZIP

  • Can be repeating (like line items in an invoice)

Repeating Checkbox

  • Enable when multiple instances are expected

  • Use for line items, transactions, or any list

  • Creates an array of values instead of a single value

  • Example: Multiple line items in an invoice

Internal ID

  • System identifier used in code and APIs

  • Cannot contain spaces or special characters

  • Auto-generated option available

  • Example: "invoice_number", "customer_name"

External Name

  • The name used when exporting or accessing via API

  • Maps internal ID to external systems

  • Useful for integration with other platforms

2. Type Tab

Specify the data type and format:

Data Type Selection

Type

Purpose

Example

String

Text values

Names, descriptions, addresses

Number

Numeric values

Quantities, IDs, counts

Currency

Monetary amounts

Prices, totals, payments

Date

Date values

Invoice dates, due dates

Date Time

Date and time

Timestamps, appointments

Boolean

True/false values

Flags, checkboxes, yes/no

Email Address

Email format

Contact emails

Phone Number

Phone format

Contact numbers

URL

Web addresses

Website links

Percentage

Percentage values

Tax rates, discounts

Selection

Predefined options

Status codes, categories

Type Features

Each type has specific configuration options:

  • Long Text - For text fields, enable multi-line support

  • Max Text Rows - Limit display height for long text

  • Allow Markdown - Enable markdown formatting in text

  • Display Width - Set column width in tables

  • Expected - Mark if this field should always be present

3. Data Source Tab

Specify where the data comes from:

Available Sources

Source

Purpose

Document

Extract from document content (default)

Metadata

Use document metadata (filename, upload date, etc.)

Formula

Calculate from other elements using formulas

Expression

Calculate using Groovy expressions

Review

Provide during manual review process

External

Fetch from external APIs or databases

The Semantics tab changes based on the selected data source.

4. Semantics Tab

Define how to extract or calculate the data:

For Document Source

  • Text area for natural language prompt

  • Describe what to extract and where to find it

  • AI uses this prompt to locate and extract the data

  • AI generation feature can suggest prompts

For Formula Source

  • Formula editor with syntax highlighting

  • Dropdown of available elements to reference

  • Real-time validation and error checking

  • Use for calculations like: {Quantity} * {Price}

For Expression Source

  • Code editor for Groovy expressions

  • Advanced logic and transformations

  • Access to full programming capabilities

For Review Source

  • Template editor using Jinja syntax

  • Define review workflows

  • Guide human reviewers

For External Source

  • Expression editor for API calls

  • Connect to external data sources

  • Transform and map external data

5. Additional Context Tab

Provide extra information to improve extraction:

Additional Context

  • Help with record-based chunking

  • Define record boundaries and sections

  • Useful for multi-record documents

Context Types

  • Record Definition - Defines what constitutes a record

  • Record Start Marker - Indicates where records begin

  • Record End Marker - Indicates where records end

  • Record Section Starter - Marks section beginnings

  • Record Section End - Marks section endings

Lexical Relations

  • Define synonyms for classification

  • Define antonyms to distinguish concepts

  • Used when using embeddings for classification

  • Example: "Invoice" synonyms: "Bill", "Receipt", "Statement"

6. Validation Tab

Define rules to ensure data quality:

Validation Rules

  • Required field validation

  • Format validation (regex patterns)

  • Range validation (min/max values)

  • Custom validation logic

Benefits

  • Catch extraction errors automatically

  • Ensure data meets requirements

  • Flag issues for review

  • Improve data quality consistently

Creating Different Element Types

Simple Data Elements

For single values like invoice number or date:

  1. Set a clear label: "Invoice Number"

  2. Choose appropriate type: "String"

  3. Select data source: "Document"

  4. Write a semantic definition (prompt)

  5. Add validation if needed

Data Groups

For organizing related fields like an address:

  1. Enable "Data Group" checkbox

  2. Set label: "Billing Address"

  3. Add child elements: Street, City, State, ZIP

  4. Each child has its own configuration

Repeating Groups

For lists like invoice line items:

  1. Enable both "Data Group" and "Repeating" checkboxes

  2. Set label: "Line Items"

  3. Add child elements: Description, Quantity, Price, Total

  4. AI extracts multiple instances automatically

Calculated Fields

For values derived from other fields:

  1. Set data source to "Formula"

  2. Choose appropriate type: "Currency"

  3. Write formula in Semantics tab: {Subtotal} * {TaxRate}

  4. Formula calculates automatically

Best Practices

  • Use clear, descriptive labels - "Invoice Date" not "Date1"

  • Write detailed descriptions - Help your team understand the purpose

  • Choose appropriate types - Use Currency for money, Date for dates

  • Organize with groups - Keep related fields together

  • Write specific prompts - Guide the AI with clear instructions

  • Add validation rules - Catch errors early

  • Test incrementally - Add and test elements one at a time

  • Use consistent naming - Follow a naming convention

  • Enable Expected flag - For fields that should always be present

  • Leverage AI generation - Use suggestions as starting points

Common Patterns

Invoice Data Definition

  • Header (Group)

    • Invoice Number (String)

    • Invoice Date (Date)

    • Due Date (Date)

  • Vendor (Group)

    • Name (String)

    • Address (Group with Street, City, State, ZIP)

  • Line Items (Repeating Group)

    • Description (String)

    • Quantity (Number)

    • Unit Price (Currency)

    • Total (Currency, Formula: {Quantity} * {Unit Price})

  • Totals (Group)

    • Subtotal (Currency)

    • Tax (Currency)

    • Total Amount (Currency)

Tips

  • Start with a template if one matches your document type

  • The Type tab adapts options based on the data type selected

  • The Semantics tab changes UI based on your data source

  • Use "Use Generated ID" for automatic internal naming

  • External Name is useful for API integrations

  • Groups can be nested multiple levels deep

  • Repeating groups work well with table-structured data

  • Formula fields automatically recalculate when dependencies change

  • Validation rules run automatically during extraction

  • Additional Context helps with complex document layouts

Did this answer your question?