Zscaler Blog

Get the latest Zscaler blog updates in your inbox

Data validation on production for unsupervised classification tasks using a golden dataset

EDEN MEYUHAS - Senior Staff Data Science Engineer and Researcher

January 05, 2024 - 4 min read

Data Security

Contents

Abstract
Why is it needed?
What about data integrity?
Margin of error
The flow of work for end-to-end data integrity:
More blogs

Abstract

Have you ever been working on an unsupervised task and wondered, “How you I validate my algorithm at scale?”

In unsupervised learning, in contrast to supervised learning, our validation set has to be manually created and checked by us, i.e. we will have to go through the classifications ourselves and measure the classification accuracy or some other scores. The problem with manual classification is the time, effort, and work that is required for classifications, but this is the easy part of the problem.

Let’s assume that we developed an algorithm and tested it very well while manually passing on all the classifications, what about future changes to that algorithm? After every change we should check the classifications manually ourselves again. While the data classified might change with time, it might also grow to huge scales with the evolution of our product, and the growth of our customers, then our manual classification problem would of course be much more difficult.

Have you started to worry about your production algorithms already? Well, you shouldn’t!

After reading this, you will be familiar with our proposed method to validate your algorithm score easily, adaptively, and effectively against any change in the data or the model.

So let's start detailing it from the beginning.

Why is it needed?

Algorithm continuous modifications always happen. For example, we are having:

Runtime optimizations
Model improvements
Bug fixes
Version upgrades

How are we dealing with those modifications? We usually use QA tests to make sure the system keeps working. At the same time, the best among us might even develop some regression tests to make sure, for several constant scenarios, that the classifications would not be changed

What about data integrity?

But what about the real classifications on prod? Who verifies their change? We need to make sure that we won’t have any disasters on prod when deploying our new changes in the algorithm.

For that, we have two optional solutions:

Naive solution - pass through all the classifications on prod (which is of course not possible)
Practical solution - use samples of each customer data on prod - using the margin of error equation.

Margin of error

To demonstrate, we are going to take a constant sample from each customer’s data, which would represent the real distribution of the data with minimal deviation, which we will do using the Margin of Error equation, sometimes known from election surveys, where the surveys are sometimes based on some equation derived from the Margin of Error equation.

So, how does it work?

We can use the first equation used for calculating the margin of error, to extract the needed sample size desired.

We would like to have a maximum margin of error of 5%, while we should use a constant value of Z = 1.96 if we want the confidence of 95% (might be changed if we would like to have another confidence level

The extraction of the required sample size is demonstrated in the following equation:

While this equation is an expansion of the equation above, it might be used when we have the full data size, to be more precise. Otherwise, we’ll be left only with the numerator of that equation - which is also fine if we don’t have the full data size.

This is a code block demonstrating the implementation of this equation in Python:

We can now freeze those samples, which we call a “golden dataset,” and use them as a supervised dataset that will be used by us in the future when making modifications, and serves us as a data integrity validator on real data from prod.

We should mention that because optional changes on prod data might happen with time, we encourage you to update this golden dataset from time to time.

The flow of work for end-to-end data integrity:

Manual classification to create a golden dataset
Maintaining a constant baseline of prod classifications
Developing a suite of score comparison tests
Integrating quality check into CI-process of the algorithm

So, how will it all work together? You can see that in the following GIF:

We may now push any change to our algorithm code, and remain protected, thanks to our data integrity shield!

For further questions about data integrity checks, or data science in general, don’t hesitate to reach out to me at [email protected].

Thank you for reading

Was this post useful?

Yes, very!

Not really

Disclaimer: This blog post has been created by Zscaler for informational purposes only and is provided "as is" without any guarantees of accuracy, completeness or reliability. Zscaler assumes no responsibility for any errors or omissions or for any actions taken based on the information provided. Any third-party websites or resources linked in this blog post are provided for convenience only, and Zscaler is not responsible for their content or practices. All content is subject to change without notice. By accessing this blog, you agree to these terms and acknowledge your sole responsibility to verify and use the information as appropriate for your needs.

Explore more Zscaler blogs

What Did You Do On Data Privacy Day?

Read post

The Formula for Flawless Data Protection

Read post

How to Unlock Maximum Performance in Data Protection

Read post

Get the latest Zscaler blog updates in your inbox

By submitting the form, you are agreeing to our privacy policy.