As Big Data continues to grow, businesses are becoming more dependent than ever on data to inform the decisions they make. That makes ensuring the quality of that data more important as well, and requires good data governance to ensure data integrity and security. Inaccurate data or inappropriate usage of data can have a significant impact on the company's ability to leverage the data to achieve business growth.
Data Characteristics Make Big Data Hard to Govern
Big Data differs from traditional data in several ways. The data is often unstructured and collected from a mix of internal and external sources. It's often gathered from streaming sources, without any validations applied. These characteristics make it difficult to know if the data is accurate, complete, or consistent. The same information can be included multiple times but expressed differently in different information sources.
The way the business uses Big Data is also different from the way it uses other data. Traditional data has well-defined metadata that means the data is understood and "clean" before it feeds into any analytic or reporting programs. The data flows into databases that are supported by database specialists. However, Big Data projects are often experimental or work via a process of discovery and exploration. The data cannot be understood until it is processed. Data scientists often have private sandboxes where the data is analyzed, and the nature of processing tools like Hadoop mean that there are multiple copies of the data.
Data Gathering Goals Make Big Data Hard to Govern
In addition to the conflict between the nature of Big Data and governance, the reasons for collecting Big Data also can conflict with data governance. When seen through the eyes of Big Data, more data is always better. Deleting historic data means losing analytic insight. From a corporate compliance perspective, however, it's often better to keep data only as long as there's a legal requirement to do so.
In addition, Big Data inevitably contains personally identifiable information (PII), some of it buried within unstructured files. The Big Data goal of remembering everything is in direct conflict with the legally mandated "right to be forgotten" that exists in some locales. With the upcoming General Protection Data Regulation (GDPR) in the European Union, you can only use individual's data for purposes they agree to; the difficulty in identifying their information in Big Data feeds makes that difficult to satisfy.
Implementing Big Data Governance
Policies and processes are an important part of your Big Data data governance strategy, but that's not where you should begin. Looking at data governance tools isn't where you should begin, either.
The best way to start achieving data governance control over your Big Data is by talking with the people who collect and use the information: it's better if tools serve as an enable rather than simply as an enforcer. Through educating your Big Data team, they can begin to understand the importance of compliance controls and even see the benefit of the controls—improved quality data means better quality analytics—in their own work.
Then you can look at tools that help you achieve the control you need, like the products from Veritas that provide data protection and help you find PII buried in unstructured data forms. With the right tools and buy-in from the Big Data team, data governance can increase the value of your data gathering and analytics projects.
Is your Big Data out of control? Contact dcVAST to learn more about data governance strategies and tools for large-scale datasets.