ARTICLES

PUBLISHED ON LINKEDIN
The Cambrian Explosion of Data Catalogs

By Malcolm Chisholm

The fossil record can be very puzzling.  One of the greatest mysteries is the appearance of large groups of diverse organisms during relatively short time periods.  The most famous of these is the “Cambrian Explosion”, some 540 million years ago, during which nearly all of the animal body plans we are familiar with today came into existence.  In addition, there were many bizarre forms that quickly went extinct.  During the subsequent half billion years, relatively little changed in terms of the fundamental body plans of the surviving groups, although variations on these body plans led enormous diversity within constraints set by their original architecture.

Figure 1: Fossil of Hallucigenia, a Cambrian Explosion Animal with No Known Modern Counterparts (from the Burgess Shale of Canada). By Han Zeng – http://n2t.net/ark:/65665/3908098df-4e21-46f7-931a-61fa3787dc80, CC0, https://commons.wikimedia.org/w/index.php?curid=132294399

Could we be seeing a Cambrian Explosion of Data Catalogs today?  There is good reason to think this may be the case.

Data Catalogs are a new class of metadata tool.  Broadly speaking, their goal is to unlock the value of the enterprise’s data resource for everyone who may need to work with data.  As such, they are definitely an enterprise class of tool, rather than a class of tool designed to support a specific type of business unit.  Being enterprise-wide in scope makes them extremely important, and strategic in nature.  It also means that the market for them is enormous.

Before 2019, an epoch that is the equivalent of the Precambrian period in this story, tools that resembled primitive Data Catalogs inhabited favorable environmental niches, like Data Governance units, or BI teams.  But in 2019, and continuing into 2020, a bewildering array of products came into the market, in many cases seemingly from nowhere.  I recently counted 42 of them, and the number appears to be growing.

Now, if all these Data Catalogs were clones of each other, or had very similar functionality, it would be easier to understand them.  But their functionality varies, and it is this diversity that makes the Cambrian Explosion a particularly good analogy for what is going on.  Just like good paleontologists, we need to start by classifying the fundamental types of forms we are looking at.  This is tricky and can easily be proven wrong at a later date, but let’s give it a try.

It seems to me that if we look at the fundamental paradigms of the current Data Catalogs, they have 3 major orientations that are likely to be the pathways for their future evolution, as shown in the illustration below:

Figure 2: Illustration of Fundamental Paradigms and Likely Future Evolutionary Pathways of Data Catalogs

Let’s explore the 3 fundamental paradigms shown in Figure 2:

  • Human Factors in Data. This covers all metadata at the level of business understanding, and which guide human behavior around data.  A big part of this is what has been called the “Business Glossary”, but which today covers much more than terms and definitions.  Collaboration and sharing are also included, as are rules, roles and responsibilities in dealing with data.
  • Technical Metadata Inventory. This covers all the technical metadata that is related to data.  Data Dictionaries that provide an understanding of databases and other data stores are one example.  Report metadata, data lineage, data discovery, and automated data classification, are examples of other areas of metadata covered by this functionality.
  • Active Data Management. This covers enabling people to directly work with data through the Data Catalog.  It is not just providing helpful information, but providing an environment where actual data manipulation can occur.  Also included is metadata engineering, which is the use of metadata to directly manipulate data.

Each Data Catalog product has some combination of all three of these fundamental orientations, but each product typically emphasizes one of them.  The likelihood is that each Data Catalog will continue to focus on this one orientation and build more and more functionality to support it.   Of course, this is a prediction, so we will se how it really turns out.

One other feature to note is that we can see a distinction between “Active” and “Passive” Data Catalogs.  Active Data Catalogs help users to create data products, usually oriented to some kind of analytics.  Passive Data Catalogs hold information that is used to understand, govern, manage, and use the enterprise data resource.  All Data Catalogs have some degree of mixture of both active and passive.

From this discussion we can see that Data Catalogs belong to groups based on three different paradigms, and that individual products are likely to evolve in ways that further differentiate themselves based on the particular paradigm they have adopted.  No doubt there will continue to be new entrants, but if the Cambrian Explosion analogy holds, the number of new entrants will decline rapidly in the near future, followed by a long period with extinctions and variations on the themes already established.

Time will tell.

Leave a Comment