Blog

What is Data Classification and How Does it Work?

Jeffrey Binder

Senior Data Scientist

May 27, 2025

Min Read

Formstack Streamline’s Data Fabric allows you to connect workflows to your own, existing data sources, ranging from SQL databases to Salesforce to Electronic Health Record systems like Epic. Since many of our customers deal with highly sensitive data, we’ve designed our platform with security in mind from the ground up.

Our security features go above and beyond the standard data protection measures that are table stakes for software vendors. Since our platform allows users to design their own automated processes, we strive to help users design workflows that handle sensitive data appropriately. To make potential security concerns visible, Formstack Streamline’s Data Fabric detects when connected data sources might contain important categories of sensitive information, such as Protected Health Information (PHI), Personally Identifiable Information (PII), and data that potentially raise sensitivity concerns under the Health Insurance Portability and Accountability Act (HIPAA), and labels datasets accordingly:

‍

‍

To find these potential sensitivity issues, Formstack Streamline’s Data Catalog automatically scans data sources to determine what sorts of information they contain. For each field, it applies a label—visible in the “Data Classes” column—indicating what type of information it detected:

‍

‍

By classifying the contents of each field in a data source, the Data Catalog provides visibility into where sensitive or otherwise important information resides in your systems. Our platform is built to treat our customers’ data with care: we never store persistent copies of your data or use it for training our classification models. Formstack Streamline allows you to build workflows that read and write from your existing systems, thus helping automate processes while maintaining a high standard of data security.

Formstack’s technology team has years of experience doing data classification in the cybersecurity realm, and we’ve learned from experience what works and what doesn’t. Classification has also advanced greatly in the past few years with developments in AI, and we’ve developed a solution that makes full use of modern machine learning techniques.

Data Classification Methods

The task of data classification is conceptually simple: determine what sort of data is stored in a given place, from a list of general types such as “First Name” and “Dosage Information.” For database tables, we classify data at the column level, since relational databases generally store a single type of information in each column. It works like this:

‍

The term of art for this type of classification is column type annotation. Data classification works differently in other types of data like plain text, where there is less of a known structure to work with.

Traditional data classification systems work by detecting certain patterns that are defined ahead of time. For example, we know that US Social Security Numbers generally have the form DDD-DD-DDDD, where D is a digit. Regular expressions provide a fast and reliable method of detecting this sort of pattern, and they can be supplemented with additional code to test more complex conditions, such as the check digits used to validate credit card numbers.

At first glance, this sort of pattern matching may seem to be all we need. Can’t we just check whether the values have the format we’re looking for, and leave it at that?

As it turns out, pattern-matching methods will only get you partway there. It’s true that some data types are virtually unmistakable; for example, one will rarely encounter anything that looks like an email address but isn’t.* But many of the data types we’re interested in have less distinctive formats that can be interpreted multiple ways.

Consider the string “289012716,” for example. This could be an SSN written without the hyphens—a fairly common format in databases—or it could be a driver’s license number from Greece, a customer account number, a quantity of currency, or who knows what else. In such cases, we need to look at the context, such as the column name and the surrounding columns, to determine what the digits mean.

Other data types require more advanced approaches. Consider the problem of detecting people’s names. Here we can’t rely wholly on the metadata: a field called “name” might contain the names of people or some other kind of name. Moreover, we would like to detect people’s names even when the column labels are vague, as in the above table, where they are just called “col1” and “col2.” How, then, can you tell if something is a person’s name? There are few hard and fast rules about what people can be named, so this is a difficult problem. Doing it well requires an international perspective, since naming practices differ geographically and culturally.

At this point, you might be wondering: why not just ask ChatGPT or Claude to do it? This approach is a start, but it has some drawbacks. First, using Large Language Models (LLMs) typically involves sending the data to third-party services, which can raise data privacy concerns. Second, LLMs can be very slow when dealing with large amounts of data. Finally, general-purpose LLMs are uneven in classification accuracy, and their errors are difficult to correct. We’ve found that we can get better results, and get them an order of magnitude faster, by training small-scale language models specifically for data classification. The next section explains how our method works.

*One exception I’ve encountered in the past is file names containing ‘@’ symbols, which are especially tricky with extensions like .COM—but this case is exceedingly rare.

Combining Language Models with Pattern Matching

Formstack Streamline’s classifier is designed to combine the advantages of traditional pattern matching with those of modern machine learning techniques. It has two main components.

The first is an expert system, consisting of human-coded rules for matching things like URLs and bank account numbers, based largely on the official documentation of these formats. The expert system gathers statistics about how many of the values in a random sample of data from the column match the rules. This statistical approach enables it to account for both false positives—the fact that a value will sometimes look like a credit card number or SSN by coincidence—and the possibility that some invalid values may appear in a database by mistake.

The second component is a transformer language model. The language model helps in interpreting more open-ended formats that require an understanding of language, and it can also analyze column headings for clues about what they might contain. We use a pretrained model (currently RoBERTa) that is finetuned to work with data in a table format. The model employs the same transformer architecture as LLMs, but it is much smaller and, as our experiments have shown, as much as twenty times faster than large-scale models like Claude 3.5 Sonnet. The pretraining enables it to learn general linguistic capabilities, such as parsing grammar and understanding synonyms; the finetuning prepares it to deal with the particular formats and data types we expect to encounter.

We train our model on a combination of publicly available datasets and synthetic data. First, we define methods of generating examples for each of our data classes, along with example column names. We then assemble these into tables, using some general heuristics about what sorts of data tend to appear together. To ensure that our model can generalize beyond the training data, we validate it against entirely different column names that it has never seen before.

Since one of our aims is to detect sensitive information, our classifier must work with data that has high security requirements. The classifier is designed such that the contents of customer data sources are never stored, sent to a third-party service, or used for training our models.

Conclusion

Detecting data that falls under complex regulatory frameworks, like HIPAA, is a complex problem that requires a multifaceted approach. While simple pattern-matching methods can help recognize clearly defined formats, they fall short when dealing with more open-ended data types, such as names or doctors’ notes. Our approach combines a rule-based expert system with a language model, giving us the best of both worlds: precise matching of known formats and flexibility in the handling of fuzzier data types.

By highlighting potential security concerns before they become problems, Formstack Streamline’s data classification system empowers users to handle sensitive information responsibly within their automated workflows. Along with the other features of our data fabric, it provides a robust and secure means of integrating our workflow builder with your existing data management systems.

To learn more about Formstack Streamline and how your organization can benefit from our data classification model, click here. To schedule time to speak with our team and request a demo, contact us here.

Thank you! You’re signed up.

Ready to learn more?

What is Data Classification and How Does it Work?

Data Classification Methods

Combining Language Models with Pattern Matching

Conclusion

Why 2025 Is the Year of Vendor Consolidation

Jeffrey Binder