Data is the new oil – but without tight control, how can business determine that the data is valid? Data catalogues are back in vogue
In his opening remarks at Tibco Live in London at the end of September, Tibco CTO Nelson Petracek discussed the value of data in driving innovation. “Innovation can occur anywhere you are,” he said. “Innovation has to do with technology, the way you develop software and aspects of culture. But data drives innovation, and the organisations that drive innovation have a handle on data.”
In Petracek’s experience, it is necessary not only to be able to connect to data sources, but also to be able to handle data at the rate it is being created. The company’s overall vision is to establish a platform on the Tibco Cloud to support the development and deployment of new, data-driven, cloud-native applications, he said.
But the idea of having a streamlined data pipeline poses risks to organisations, particularly from a data governance perspective.
Speaking to Computer Weekly before his keynote presentation at Tibco Live, Petracek said: “Without proper care and planning, data will quickly become an unmanageable environment. We are seeing renewed interest in data lineage, data catalogues and the mastering of metadata. Without a proper governance programme and data stewardship, you will have data all over the place and it will quickly get out of hand.”
For an organisation to become truly data-driven, the quality of data is critical, said Petracek. “I know customers who have loaded a bunch of data in Hadoop and have no idea if they loaded the data a year ago, if is still valid and who is actually using it. If I want to look for a customer name, I need to understand what databases out in the organisation hold the customer name.”
The concept of a data catalogue has bounced around the industry for a number of years, but it is now regaining traction in the enterprise, said Petracek. Rather than try to identify every database holding a customer name field, the data catalogue provides a single reference point that data administrators can go to.
Beyond the data catalogue, Petracek said that heuristics – the technique used to identify unknown virus signatures – can be used with data virtualisation to identify unusual data usage patterns. Looking at such patterns is not limited to tracking unauthorised access, but can help data managers to identify whether data is being duplicated unnecessarily or to improve performance.
Such techniques could become more important as and when organisations put artificial intelligence (AI) at the centre of their data-driven application development process. There is much talk in the industry of innovating through data.
In 2018, the Notes from the AI frontier paper for McKinsey, which looked at uses for AI across industry sectors, found a correlation between the performance of a traditional analytics application and one that uses deep learning AI techniques to process large data sets.
The report’s authors noted: “Making effective use of neural networks in most applications requires large labelled training data sets alongside access to sufficient computing infrastructure. As the size of the training data set increases, the performance of traditional techniques tends to plateau in many cases. However, the performance of advanced AI techniques using deep neural networks, configured and trained effectively, tends to increase.
“Furthermore, these deep learning techniques are particularly powerful in extracting patterns from complex, multidimensional data types, such as images, video, and audio or speech.”
Given that deep learning algorithms are adept at handling large, complex datasets, there is a growing belief across the IT industry that AI is at the heart of a new approach to designing and building data-driven applications. Handling data in such applications is not hard-coded – data is processed via an AI inference engine that learns and adapts the data model it uses, based on the data it ingests.
In his presentation at Tibco Live, the company’s chief operating office, Matt Quinn, asked delegates to consider how they might be hindering their own innovation by only basing assertions on what they already know. “Today’s innovation is held back by yesterday’s constraints,” he said. “We are using principles that have existed since the dawn of computers. Even though tools today are vastly superior to those of yesteryear, we’re still putting the same principles of yesteryear on our systems today.”
Whereas programmers previously developed software “hard-coded” to make decisions based on what they expected from the data, industry experts believe that AI leads to a new approach to application development, said Quinn, adding: “We need to rethink the way we build systems.”