Cataloging Unstructured Data for GenAI Projects
By Malcolm Chisholm
With the AI race on, there is a great need for data sets to drive AI projects to success. There are many different kinds of data, but today let’s consider unstructured data. As a quick reminder, examples of structured data sources are excel files, database tables, JSON docs, while unstructured data sources would typically be word docs, pdf’s, even video and audio files. Data Governance must have the goal of curating all of these data assets as accurately as possible in their data catalogs. The objective is to have a single reference point of truth for the data assets in the enterprise, and this includes unstructured data assets too. Cataloging Unstructured Data for GenAI Projects
Previously, unstructured data assets were not thought of as sources of data as they required human analysis to be understood. But now, with the latest GenAI tools, it is quite possible to extract meaning and understanding from them. So suddenly, the document troves in SharePoint instances scattered around the enterprise are much more valuable. That is not to say all documents are useful or relevant for GenAI purposes, but there will be many, many gold nuggets among them. To understand how to better curate unstructured data assets, Data Governance must have an understanding of how that data will be used in AI projects.
1 – Different AI solution types require different kinds of data
What type of LLM is being solutioned? There are multiple AI solution types including Question Answering, Instruct Training, Human Bot Conversations and Text Summarization. ML Ops (Machine Learning Operations – typically the professionals implementing the AI solutions) will take data from unstructured sources and prepare it into structured files according to the solution type. For example, for Question Answering, a resultant file will have 3 columns of “question”, “answer” and “context”. For Text Summarization the resultant file would have 2 columns, “article” and “summary”. It is suggested that unstructured data be cataloged according to which solutions it is suitable for. Of course, descriptions, definitions, tags, relations and more are vital to this effort as well.
2 – Risks
For LLMs to be successful they need to not only have traditional data quality checks in place, but the unstructured data itself also needs to be checked for sensitivity, toxicity, profanity, privacy, sparsity, and bias. Typically, there are AI tools to help with this, but Data Governance can provide a head start and support for the effort. Data Stewards have familiarity with the content they are stewarding and should be able to provide an initial check, or possibly provide a risk profile. These checks and risks can be curated as attributes in the Data Catalog.
3 – How?
As mentioned earlier, there are troves and troves of documents from decades ago on SharePoint instances and various file repositories. If someone with tribal knowledge is available, it would be the best-case scenario for them to catalog these assets. However, in many cases, the persons with tribal knowledge are retired or gone. In which case GenAI tools can help. Documents can be summarized, scanned for PII and more, and then even auto-entered into the Data Catalog and curated by a human steward much more easily there. It is important to note that there is a cost to this methodology in terms of compute. However, it is also important to note that costs will be coming down as GenAI algorithms improve, and GenAI infrastructure becomes commoditized.
The goal is to provide ML Ops, and others, with a central resource to find the relevant data as soon as, and as easily as, possible, and be aware of risks and caveats too.
If unstructured data is properly cataloged, then GenAI initiatives will be accomplished faster with a higher success rate.
Cataloging Unstructured Data for GenAI Projects