AI and Reference Data

By Malcolm Chisholm

AI is one of the newest and most exciting developments in the world of Information
Technology, although it is starting to age a little. By contrast, Reference Data is one of
the oldest types of data and is generally unloved. Reference Data is “code tables”, that is,
small database tables with codes, descriptions, and hopefully definitions (but alas
these are often missing).

What is special about code tables is that they are metadata. “GOLD”, “SILVER”, and
“BRONZE” Membership Levels mean something and drive business logic. And this is a
crucial fact for AI.

Where Have All the Semantics Gone?

We tend to be in the habit of thinking that database tables and columns, representing
entity types and data elements, are the things that have semantics in the world of data.
Of course, they do have semantics, but not all of the semantics.

Suppose an average Reference Data table has 20 records, and 20% of all tables in a
database are Reference Data tables. An average sized database may have 100 tables,
so that is 400 Reference Data records, and it represent a big chunk of semantics.

What It Means for AI

It is true that some Reference Data tables are based on international standards that
have well defined meanings, like NAICS Industry Codes. But a lot of Reference Data is
created by and is unique to the enterprise. There is no external source of information
that could possibly train AI to understand what these codes mean.

There are business glossaries and design documents that can be used to train AI about
the semantics of database structures, meaning tables and the columns that exist within
them. But Reference Data records are not “designed” in that way. They simply get
added or updated in production environments over time. Adding more complexity, code
values are very rarely deleted but are inactivated. When a code value is no longer
relevant it cannot be used for new transactions but must be retained for historical
reporting. How inactivation is done can vary from table to table. How can AI
understand any of this?

Reference Data for AI

If people want to chat with their data, or have AI agents perform tasks with their data,
then the issue of Reference Data is going to have to be addressed. The meaning of
Reference Data, including the way it drives program logic, needs to be documented and
made available to AI. This will require considerable effort given the poor state of

Reference Data Management in most enterprises today. These practices will also need
to improve going forward, which implies much stronger Data Governance in this area.
Or enterprises could just find out the hard way.

Leave a Comment