Data governance is certainly not a new concept — as long as data has been collected, companies have needed some level of policy and oversight for its management. Yet it largely stayed in the background, handled by IT and never seeing the light of day, as businesses weren’t using data at a scale that required data governance to be top of mind.
Yet in the last few years, it seems that data governance has shot to the forefront of discussions both in the media and in the boardroom as businesses take their first steps towards Enterprise AI. Recent increased government involvement in data privacy has no doubt played a part here.
However, companies are starting to realise that data governance has never really been established in a way to handle the massive shift toward democratised machine learning required in the age of AI.
Traditionally, data was something that belonged to IT, who was the enabler of its storage and retrieval. Need customer data to do some churn analysis? It must be requested from IT. Need to do some analysis on fraud? Request the data from IT. Sure, the business lines own the analysis, but at the root of it all sat IT to fulfil any data requests. And this was the extent of data governance.
Today, the democratisation of data science across the enterprise and tools that put data into the hands of the many and not just the elite few (like data scientists or even analysts) means that companies are using more data in more ways than ever before. And that’s super valuable; in fact, the businesses that have seen the most success in using data to drive the business take this approach.
But it also presents new challenges — namely that businesses’ IT organisations are not able to handle the demands of data democratisation, which has created a sort of power struggle between the two sides that slows down overall progression to Enterprise AI. A fundamental shift and organisational change into a new type of data governance, one that enables data use while also protecting the business from risk, is the answer to this challenge and the topic of this white paper.
Here, we’ll explore the components for a modern data governance program. While they may require an organisational change to achieve, in the long run, it will allow for Enterprise AI at a scale that is responsible and sustainable.
AI Governance, Defined
Traditionally, data governance includes the policies, roles, standards and metrics around the use of information in enabling a company to achieve its goals. It ensures the quality and security of an organisation’s data by clearly defining who is responsible for what data, and what actions they can take, using what methods.
With the rise of big data, machine learning and AI, it is tempting to think that the need for a well-crafted data governance strategy is redundant.
Surely you can get the data in a big data lake as quickly as possible, so that data scientists and analysts can wrangle it to fit the needs of the business?
This thinking would be wrong. The need for data governance is greater than ever, as more decisions, with more data, at a greater frequency are being made by organisations every day.
By not having effective governance and quality controls, all you are doing is kicking the can down the road for the analysts, data scientists and business users to deal with. Repeatedly. And in inconsistent ways. And this leads to a lack of trust at every stage of the data pipeline and with the end-users.
If people across an organisation do not trust the data, how can they confidently and accurately make the right decisions?
Who is responsible for governance?
As discussed in the introduction, more traditional IT organisations historically have only addressed the data governance piece. As businesses move into the age of data democratisation and stewardship, access, and ownership become necessary, IT teams have often been put in the position — incorrectly — by management of also taking responsibility for information governance pieces that should be owned by business teams.
Why? Because the skill sets for each of these governance components are different. Those responsible for data governance will have expertise in data architecture, privacy, integration, and modelling. However, those on the information governance side should be business experts — they know what the data is, where it comes from, how and why it’s valuable to the business, and how it can (or should) be used to unlock its potential. In short, data governance needs to be a collaboration between IT and Business stakeholders.
From Data Governance to Data & AI Governance
A traditional data governance programme oversees a wide range of activities, including Data Security, Reference & Master Data Management, Data Quality, Data Architecture and Metadata Management (see Figure 1 below).
Figure: Moving from Traditional Data Governance to Data & AI Governance
Now, with the growing adoption of machine learning and AI, there are new components that should also sit under the data governance umbrella (see Figure 2 below). These are namely:
Machine Learning Model Management
which we will expand upon below.
Machine Learning Model Management
Just as the use of data is governed by a data governance programme, the development and use of machine learning and AI models require clear, unambiguous policies, roles, standards and metrics.
These would aim to answer questions such as:
Who is responsible for the performance and maintenance of production machine learning models?
How are machine learning models updated/refreshed to account for model drift (deterioration in the model’s performance)?
What performance metrics are measured when developing and selecting models and what level of performance is acceptable to the business?
How are models monitored over time to detect model deterioration or unexpected, anomalous data and predictions?
How are models audited and are they explainable to those outside of the team developing them?
The second new aspect for a modern data governance strategy is the governance and policies around the ethical use of data. This has been brought to the fore with questions, debate and discussions around ethics and AI, but is also something that perhaps has been overlooked by data governance in general.
Data should be used in line with the ethical standards of a company, whether machine learning and AI models are involved or not. Cambridge Analytica is one example of a company falling foul of what many would see as ethical standards around data privacy, by acquiring masses of Facebook users’ data, without their knowledge.
Ethics has been now brought up the agenda with machine learning and AI and all the ethical implications of decisions being made by models… not humans.
Governance around ethics would aim to answer questions such as:
What are the protected characteristics that should be omitted from the model training process (such as ethnicity, gender, age and religion)?
How do we account for and mitigate model bias and unfairness against certain groups?
How do we respect the data privacy of our customers/employees/users/citizens?
How long can we legitimately retain data beyond its original intended use?
Are the means by which we collect and store data ethical?
Figure 2: Data & AI Governance
Most enterprises today identify data governance as a very important part of their data strategy, but more often than not, it’s because bad data governance is risky. And that’s not a bad reason to prioritise it. After all, complying with regulations and avoiding bad actors or security concerns is critical.
However, governance programs aren’t just beneficial because they keep the company safe — their effects are much wider.
Save money — According toGartner, recent research has shown that organisations believe that poor data quality is responsible for an average of $15 million per year in losses. The cost of security breaches can also be huge, an IBM report estimates the average cost of a data breach to be $3.92 million. Robust data governance, covering data quality and security can result in huge savings for a company.
Improve trust — Data governance, properly implemented, can improve trust in data at all levels of an organisation. Allowing employees to be more confident in decisions they are making with company data. It can also improve trust in the analysis and models produced by data scientists, with greater accuracy resulting from improved data quality.
Reduce risk — Data governance can reduce the risk to company reputation from data breaches and from public relations issues where data has been seen to be used in an unethical way. With increased regulation around data, the risk of fines can be incredibly damaging GDPR being the prime example with fines up to €20 million or 4% of annual worldwide turnover.
It’s not just “keep the company safe” — data and AI governance are essential components to bringing the company up to today’s data standards — that is, democratisation.
The Five Components of a Modern Data & AI Governance Strategy
Now that the benefits of Data & AI Governance are clear, what practical steps can companies take for a modern Data & AI Governance strategy?
1. Top-Down And Bottom-Up Strategy
Every Data Governance programme needs executive sponsorship. Without strong support from leadership, it is unlikely a company will make the right, and often difficult changes, to improve data security, data quality and management.
At the same time, individual teams have to take collective responsibility for the data they manage and the analysis they produce. There needs to be a culture of continuous improvement and ownership of data issues. This bottom-up approach can only be achieved with top-down communication and recognition of teams that have made improvements to data quality and security.
2. The Balance Between Governance and Enablement
Governance shouldn’t be a blocker to innovation, it should enable and support innovation. A distinction needs to be made between proofs-of-concept and industrialised data products. Space needs to be given for the former, but the decision needs to be made when a proof-of-concept should have the funding, testing and assurance to become an industrialised solution.
3.Quality at its Heart
In many companies, data products produced by data science and business intelligence teams have not had the same commitment to quality seen in traditional software development, through movements such as extreme programming and software craftsmanship. This is quickly no longer becoming the case. Data products need to have a high level of quality, through code review, testing and continuous integration/continuous development (CI/CD) that traditional software has if the insights are to be trusted and adopted by the business at scale.
4. Model Management is a Key Factor
As machine learning and deep learning models become more widespread in the decisions made across industries, model management is becoming a key factor in any Data/AI Governance strategy. Models can degrade over time through model drift. Continuous monitoring, model refreshes and testing are needed to ensure the performance of models meet the needs of the business.
MLOps is an attempt to take the best of DevOps processes from software development and apply them to Data Science.
Figure 3 — MLOps
Open-source software like MLFlow and DVC (Data Version Control) makes managing models easier than ever.
5. Ethics and Transparency are Essential
There is growing scrutiny on the decisions made by machine learning and deep learning models, and rightly so. Models are making decisions that impact many people’s lives every day. So understanding the ethical implications of the decisions they make and making the models explainable is essential.
Open source toolkits such asAequitas, developed by the University of Chicago, make it simpler for machine learning developers, analysts, and policymakers to audit machine learning models for discrimination and bias.
Below is an example Aequitas bias and fairness report showing that a model used for identifying individuals likely to have police charges, has bias across gender, marital status and race.
Figure 4 — Example Aequitas Bias and Fairness Audit Report
Data & AI Governance Pitfalls
Despite the clear importance and tangible benefit of having an effective data and AI governance programme, there are several pitfalls that organisations can fall into.
Lack of senior sponsorship
A governance programme is ineffective if there isn’t senior sponsorship and the policies have no “teeth”. Employees will often revert to the status quo if there isn’t top-down castigation when data governance policies aren’t adhered to and recognition for when positive steps are taken to improve data governance.
A lack of clear communication around data governance policies, standards, roles and metrics can lead to a data governance programme being ineffective. If employees aren’t aware or educated around what the policies and standards are, then how can they implement them? Using the best communication and education channels available, whether that be webinars, e-learning, online documentation, mass emails or videos can help communicate the policies and goals of a data governance programme throughout an organisation.
Finally, if there isn’t a culture of ownership and commitment to improving the use and exploitation of data throughout the organisation, it is very difficult for a data governance strategy to be effective. As the saying goes “Culture eats strategy for breakfast”.
There are a few conclusions we hope you take away from this article:
Firstly, traditional data governance, and all the areas underneath it, are still important. Whether that be data quality, master data management or data security.
Secondly, machine learning and AI have added new aspects to data governance around model management and ethics.
Finally, the right sponsorship, investment, culture and communication is needed to make sure a data governance programme is effective and leads to continuous improvement across the organisation.
AI Governance: A Research Agenda (University of Oxford)