Data Products Revisited
Data products have taken the data industry by storm in the past few years, and can now
clearly be seen to be more than a fad. The simplest view is that data products are
datasets, and this seems to be how businesspeople view them. In the past there was a
notion that the process to produce a dataset was also part of the data product. Of
course, the process is necessary to produce the product, but the common view now
appears to be that the output alone is the product.
This is all well and good, but is the data product concept really mature yet? Let’s look at
data products from the perspective of how things were done in the past.
Lessons from Pub/Sub
“Data product” may be a new term, but similar things have been done in the past. One
was Publish/Subscribe Hubs (“Pub/Sub”). A Pub/Sub Hub was essentially a server
where batch files were placed, and from which they could be picked up by other
processes.
But here is an interesting twist. The datasets in the Pub/Sub paradigm were produced
via production processes that did not involve the team that developed them. Today,
there seems to be an expectation that the team that developed the data product is to
some extent involved in running the data pipelines to produce it.
In a Pub/Sub Hub there was no concept like a Data Product Owner. Of course, if
something went wrong, like a dataset not being produced within SLA, there was a way
to find out which application support team had to be contacted to fix the problem. But
there is philosophical problem here too.
Development vs. Production Support
When I was in the early years of my career as a developer, my team lead asked me
what I thought the primary quality attribute of good code was.
I replied “Ease of maintenance”
“Wrong.” he replied, “Everything you develop has to be robust. You want to keep
developing new code. You don’t want to be involved supporting anything in production.”
By definition, a Data Product Owner is involved in production support. It is not only
supporting the production of the dataset, but keeping the metadata up to date,
answering questions about the dataset, dealing with data quality, and so on. And this is
apart from the trend of data engineering teams actually running the pipelines to
generate the data products – as opposed to the older concept of a turnover to a
Production Control function.
Where Does This Leave Us?
The more data products a data engineering team produces, the greater its burden of
production processing relative to new development work. Also, there is an increased
risk that the data engineers become part of the process, because they never had to
make it robust enough to survive a handover to Production Control – the team that runs
processes in production. Further, there is a lot of downside risk in being identified as a
Data Product Owner versus being rewarded. This would indicate that at a minimum
there are aspects of the data product concept that remain to be ironed out. Time will
tell.
By Malcolm Chisholm
