Why data governance is more important than ever
*Blog image by Jakub Szepietowski
There's no doubt that going forward, agile methodologies will gain momentum in the business analytics world (read more). A key driver for this trend is technology enablement. Advances in self-service data exploration and visualisation tools, such as Tableau, Qlik and Power BI have empowered business users without a specialist knowledge in statistics or data analytics to query their corporate data, while bypassing IT (for more business analytics trends, click here).
Similarly, self-service data blending tools like Alteryx and Trifacta are so intuitively designed, they now allow front-line analysts to independently access, cleanse, and blend datasets using only drag-and-drop functionality. While these technologies accelerate the outflow of ad-hoc business insights, they also create a set of challenges related to data governance. With this new approach, there are several challenges that businesses have to address:
A new role for IT – becoming a data police force
Many businesses are misguided in thinking that self-service tools make the role of BI or data science teams in the IT function irrelevant. In fact, in some ways the opposite is true. IT teams may no longer spend hours preparing BI reports for executives, but their role in governing the quality of corporate data has never been as important. With different people around the business now able to source, integrate, and explore data individually, it is crucial that this circulated data is always updated, clean, and understood consistently at all times.
In many cases, I have seen relatively small businesses differ in their definition of common and core business analytics terms across departments. This often leads to markedly varied conclusions to the same analytical report, a problem all Excel users are familiar with, resulting in multiple versions of the truth that stir arguments in meetings as to whose data was correct and what it meant. This is why mature data-driven companies invest their time and resources on building strong data governance practices within IT data warehouse teams to ensure consistency and mutual understanding across the organization.
how are you monitoring your data sources?
Another area of focus when considering data governance is the monitoring of access to data – who has access to what data. A distinction between corporate and enrichment data from external sources must be made. While data management teams should be responsible for the corporate data estate, they should not necessarily have a full responsibility over external enrichment data, unless this has been strictly defined and understood. This does not mean that this enrichment data should not be available as a valuable resource to business analysts, but the business needs to understand that this data may not be as consistent and defined as its corporate data.
Finding needles in the haystack of big data: store now, select later
Up until recently, limited data processing and storage capacity meant that companies had to sieve through influx of big data and store what they deem as necessary. A major development in data management technology has revolutionised this. Businesses no longer need to think about what data to keep, because the cost and implementation of storage with a Hadoop or equivalent stack is no longer a deterrent. Today, companies can easily dump all their data into commodity storage and then select and analyze it later.
The idea that you can have a place to first store all the data you generate and later on, let everyone pull out the relevant data they need, is empowering. Yet, it does not work without data governance. Additionally, when the NoSQL Big Data vendors started becoming popular, their aim was to enable cost-effective storage of large, fast and varied data. As the big data stack has matured, the challenge now moves to how companies can extract valuable information out of all of their stored data in a timely and consistent manner.
Re-thinking the data landscape
This has to begin with data governance to define the schemas and meaning of organizations' data. As the technology has matured, so has the standards of user experience. Users are demanding easier ways to leverage this data consistently. This propelled the development of a host of tools that sit on the Hadoop stack (e.g. HCatalog, Hive, Spark, Impala, Apache Drill, Apache Hawq) to assist in defining schemas for this data, allowing analysts to query the data with some consistency.
With these challenges in mind, businesses must think through their data governance strategy from early on of their analytics journey. Defining a new focus for the IT function will be a critical part of the process, which I will be exploring in my next blog.
Have you got any challenges in data governance projects you would like to share? Join the discussion below.