Data Definition for AI | Kodexa Help Center

When working with data element groups, you can customize various features to guide the AI in finding and processing data in your documents effectively. This support article provides a step-by-step guide on how to set these features in the user interface.

Accessing the Features Tab

To begin, navigate to the data element group you want to configure and click on the “Features” tab. This tab allows you to view and edit the available settings for the selected data element group.

Configuring Ignore and Embedded Options

The “Ignore” and “Embedded” options are crucial settings to consider:

Ignore: Check this box if you want the AI to ignore this data element during processing.
Embedded: Check this box if the data element should be treated as embedded within another structure.

When you set a data element as embedded, the AI will look for it while processing the parent data group. As a result, you won't have control over the settings, and other options will be unavailable when the “Embedded” option is checked.

In most cases, using the embedded option is recommended since it is the most efficient way to capture data. However, if you have a large amount of data to capture and need to optimize for multiple calls to the GenAI, you might choose not to use the embedded option.

Setting Cardinality

Cardinality determines how many times a data element can appear in the document. Select the appropriate cardinality from the dropdown menu:

Single: The data element appears only once.
Multiple: The data element can appear multiple times.

Setting the correct cardinality is important as it informs the AI whether to look for a single instance or multiple instances of the data element.

Defining Classification Strategy

Classification is the process of determining which pages in a document contain information relevant to specific data elements. If your document is more than a few pages long, you should consider using classification. Choose a classification strategy from the dropdown menu:

None: No classification will be performed.
Data Element: Use the data element's definition (name and description) for classification.
Data Element and Children: Use the data element's definition along with its children's definitions for classification.

If you are looking for general concepts, using the data element definition alone may suffice. However, including the definitions of the children can help reduce the number of possible matches by providing additional information to the AI.

Specifying Classification Content

When working with the AI, there are multiple representations of the content that can be used for classification. Select the type of content to use from the following options:

Text: Use only the textual content for classification.
Bounding Boxes: Use the bounding boxes of the content for classification.
Images & Bounding Boxes: Use a combination of images and bounding boxes for classification.
Image: Use only the images for classification.

If your document contains mostly plain text, using the text option may be sufficient. However, if you have lots of tabular data, using bounding boxes, images, or a combination of both might yield better results.

Re-ranking Classification Matches

If you believe that the relevant data is limited to a specific number of pages, you can enable re-ranking of the classification results. Check the “Re-rank Classification Matches” box to activate this feature. Additional options will appear when re-ranking is enabled.

Setting Max Pages From Re-rank

If re-ranking is enabled, specify the maximum number of pages to return after the re-ranking process. This allows you to limit the number of pages the AI will process, focusing on the most relevant ones.

Choosing Chunking Strategy

Chunking refers to the process of breaking down the document into smaller parts for the language model to process. Select the appropriate chunking strategy from the dropdown menu:

Whole Document: Feed the entire document to the AI without chunking.
Records: Chunk the document based on individual records.
Classified Content (Single): Chunk the document based on the classified content.

If you are using classification, it is recommended to choose the “Classified Content” option to ensure the AI processes the most relevant parts of the document.

Selecting Prompt Strategy

The prompt strategy determines how the content will be presented to the AI for data element extraction. Choose the appropriate strategy from the dropdown menu:

Text (Lines): Present the content as plain text lines.
Image & Bounding Boxes: Present the content as images along with bounding boxes.
Bounding Boxes: Present the content using only bounding boxes.

If your document contains mostly plain text without columns or tables, using the “Text (Lines)” option is sufficient. However, if your document includes tables or forms, using the “Image & Bounding Boxes” or “Bounding Boxes” option may yield better results.

Setting Image Width

If you are using image-based extraction, you can specify the image width to override the default image resizing. This allows you to reduce costs by limiting the size of the images processed by the AI.

Configuring Structure Review

Structure review involves using another GenAI model to analyze the captured data elements and identify any mistakes in the structure that need correction before labeling the document. Select a model for reviewing the structure from the dropdown menu:

None: No structure review will be performed.
Anthropic Claude 3.5 Sonnet
GPT 4o
GPT 4
Pi
Anthropic Claude 3 Haiku
Anthropic Claude 3 Sonnet
Cohere Command

If you are using a smaller model for data extraction, performing a structure review is important to ensure mistakes in the data structure are reviewed and corrected by the AI.

Example Configuration

Here's an example configuration for an “Income Statement” data element group:

Cardinality: Multiple
Classification Strategy: Data Element and Children
Classification Content: Text
Re-rank Classification Matches: Unchecked
Chunking Strategy: Classified Content (Single)
Prompt Strategy: Bounding Boxes
Structure Review: Anthropic Claude 3 Haiku

By following these steps and customizing the features according to your specific requirements, you can ensure that the AI processes your data element groups accurately and efficiently, yielding more relevant results.

Introducing Data Definitions

Helping AI understand your data

Defining your Data

Understanding AI Accuracy

Working with External Data