Data Quality: A necessity of the immediate future

5 min readMar 31, 2023

Can we always trust our data?

If Data is the new Oil, then we need an Oil refinery !

The smaller the data, the easier it is to humanly validate and understand it. As long as data is limited to spreadsheets or single databases; domain experts & analysts would filter data manually and even call out mistakes made by representations / visualizations over the underlying data. Data until now, augmented the understanding of domain experts via reports and visualization tools like Tableau and Power BI.

As the scale of data has increased many fold and we progress further into the bold new world of AI & Machine Learning; validating and mining insights and relationships has become harder.

The scale, and complexity of data makes it very hard to verify if the data being fed to these AI models can be trusted or not.

Information is only valuable if it is of high quality.

Every Data Scientist, worth their salt is aware of EDA (Exploratory Data Analysis). In this phase, the data is statistically analyzed on its main characteristics. Example: Outlier detection, Categorical Vs Numeric data etc. EDA is a crucial and painful time consuming process.

EDA is mostly a manual process that requires domain experts to understand their own data to determine which models are applicable and what metrics one would use to measure the models effectiveness; based on the context the AI is trying to solve. For model selection later, you also have to consider Bias-Variance Tradeoffs and Complexity Vs Interpretability, among many other aspects that come from understanding your data first.

An important assumption of EDA is clean data. But how can you be sure the data is clean, specially at the scale of giga or peta-bytes and the data changing everyday?

As we rely more on our data and the scale is ever increasing. Furthermore while in the past data was used for inference only; …

Data is being used to train the AI models and inference. This implies if your data is unreliable; your are pretty much !@☢☠@!

At this point, it should be clear to you, that …

relying on your raw business data to train your models is perhaps more dangerous than not using it at all.

It requires a careful blend of Domain, AI/ML, Software, statistics and Devops experts to have production quality models providing trustworthy results (that may or may not be explainable). Furthermore there is no guarantee on how these models will scale in the future.

Can I use tech like Chat GPT to solve problems?

No !

Behind the hype of what technologies like Chat GPT can do; its better to understand its actual use-cases. You need your analysts, domain experts and data scientists to understand, clean and use the data. What you need to do is provide them with the tools to improve their productivity and ensure the human factor does not become a bottleneck in the road to automation and digital transformation. Furthermore these magic technologies come with no warranty or accountability, nor will they explain why they made a decision to say something. Which leads us to the next topic.

Who is accountable for bad decisions made by machines?

Even the best models make mistakes!

As explained in the previous section; it requires a team of humans to make this machine succeed. In the case of complex models; its literally a black box. So the question is, who is accountable for the decisions made by the machine?

In a related topic; professor Vasant Dhar, talks about When Should We Trust Machines?

How to ensure you can trust your data

Using a enterprise scale quality tool like Data Quality tool, one can measure quality of data and even be notified and take action on poor data business real-time. The quality is measured on multiple dimensions.

I’ve had the pleasure to architect this very tool with my team @ EACIIT and validate it with large enterprise clients. The tool is domain agnostic and assumes no contextual understanding of a domain but worth mentioning its been applied on banking and other domains such as Oil & Gas and more to come.

The DQ Tool: How it works?

At the heart the tool is able to integrate with most popular SQL and NoSQL databases. One can add new DB adapters to this in the future.

During the first phase called Profiling; It crawls the multiple data sources and is able to infer the general nature of the datasets (tables or collections). At this stage the tool can infer the structures of the datasets.

The next stage is where pre defined and custom rules can at scale, process the data and determine the quality as per dimensions. These scores are aggregated and rolled up at a dataset, project, domain and data-source level. One can also see performance variance over time using time series of the data quality at any level.

Rules can operate at a simple level, like checking for consistency within data-type or referential integrity across datasets and even databases. One can write custom rules to achieve even higher layers of abstraction and quality checking. The possibilities are endless, including plugging well tested AI itself to measure the quality.

… all this without leaving the comfort of the GUI and mostly with little or no database skills. Albeit, knowing how to write queries or custom scripts comes useful for more custom complex rule scenarios.

The tool also takes care of atomicity of rules firing to ensure the data measurements are consistent over a unit of time; and many detailed features that have come from our experience of production quality enterprise needs.

…But it does not stop there. Once the quality is determined, then one can use the in-built issue management product or integrate with your own business issue tracker to remediate issues and take actions on bad data to fix them.

Conclusion

Having a tool like the DQ; is a must for any enterprise considering digital transformation. Be it to visualize your complex data, or to step into the bold new world of business AI with more confidence.