Data can be the lifeblood of a business or its most damaging liability. Important documents are often stored in ramshackle file systems – or worse, copied to the cloud with little to no process controls. In fact, about 40 percent of corporate data that is uploaded to the cloud uses ad hoc file sharing applications, more than any other category. As a result, timely information gets lost amid haphazard processes. Business decisions, therefore, are made on instinct, and the competition wins. Later, during the audit, it’s not just the shareholders who are upset but industry regulators, fining you for compliance failures.

 

But it doesn’t have to be this bleak. Clearly, with sophisticated data tools, intelligent storage systems and artificial intelligence, there is no reason to mismanage data.

 

If the Cambridge Analytica scandal was not enough of a wake-up call, data privacy regulations — such as GDPR and CCPA — ensure that companies manage their information securely or face serious sanctions and fines.

 

In this series of posts, I will demonstrate how to map and secure sensitive data with the help of artificial intelligence. I will explain how to specify several types of sensitive information plus provide concrete methods to detect their location in organizational data.

Mapping Private Data

Three primary questions consistently rise to the top when it comes to appropriately mapping private data:

  1. Which types of information could be relevant?
  2. In what form or shape do they occur?
  3. How to detect their occurrences accurately?

File-Level Data Management

Can you recall what personal information exists on your own PC? A brief glimpse at ‘My Documents’ or ‘Downloads’ folders may provide a hint of how quickly random documents and forms containing personal information can pile up. Legal agreements, medical examinations and personal letters are all laying there comfortably in your hard drive, untamed.

 

Now, imagine what a challenge it is for enterprises to manage documents across the entire organization, across all data silos: end-points, servers and cloud storage. Organizations must not only take proper care of personal information, but also secure sensitive business-related information.

 

The abstract term ‘sensitive data’ must therefore be broken down to a more concrete definition:

 

  1. Reason for sensitivity — can be perceived as independent layers: personal data, confidential by organization/department or generally sensitive (i.e. legal agreements)
  2. Common form — different layers of types of data interchangeably emerge in patterns and contents of varying sizes: single word, sentence, paragraph or an entire document.

Once a clear definition of sensitive data is established, the process of detecting files and mapping sensitive information becomes more intuitive. For example, a given file on a cloud repository may contain a single word (i.e. SSN — Social Security Number) that signals its potential sensitivity — disclosure of personal data. Another file on a salesperson’s PC might contain a paragraph that reveals a company’s confidential operations process.

Methods For Detection

Manual labeling – time consuming and error prone.

 

Technically speaking, the most precise way to find sensitive information would probably be to manually label relevant occurrences. In other words, every file created will have to be passed through the eyes of a domain expert: a professional capable of determining whether the document may be sensitive, with respect to its context and domain. Ideally, this person would mark relevant words or sections, and specify precise reasons.

 

In practice, this approach is unrealistic, for it requires:

  • Massive dedication of time and attention to label documents
  • High familiarity with data protection regulations and enterprise confidentiality

Old, received or automatically generated documents require examination by a relevant domain expert and the whole process is very error prone.

 

As a reference, try to think of organizations that deliberately and dedicatedly label manually each and every piece of data upon creation. The immediate examples coming to mind are usually national security agencies, military units and perhaps embassies. The importance of data protection for them is very clear, and the risk posed by information leakage is severe enough to take the time and effort for manual labeling.

AI Labeling – Rule Based vs. Machine Learning

A more scalable and generalized approach can involve the use of automatic methods that would require the least amount of human involvement in the process of detecting relevant information. Some methods are based on exact matches of terms or series of characters. Other methods rely on statistical models based on occurrences and adjusted frequencies of terms or entities within a document or paragraph. More advanced methods even attempt to utilize the context of every word in a sentence to predict whether it’s relevant or not, using deep learning and natural language processing.

 

Different methods can solve different tasks:

  • Rule-based methods can be applicable to detect and match the occurrences of a variety of IDs, credit cards, and email addresses.
  • Statistical based, machine learning methods are useful to categorize documents according to their content.
  • Context-aware, deep learning methods can aid in extracting personal names from sentences or detecting context-dependent phrases.

I’ll discuss more about rule-based versus machine learning detection in Part 2.

Final Thoughts

Managing and securing sensitive data is an ambitious and challenging goal. To comply with data privacy regulations and avoid heavy fines, organizations must be able to map sensitive information in various forms and shapes. Yet, with the right set of advanced artificial intelligence tools this goal can be achieved much faster.

 

I hope this post provided some insight into the concepts that lay the foundation of mapping sensitive information. In the next posts that will appear on the NetApp Cloud Central blog, I’ll dive deeper into the practical methods that can be used towards achieving data protection and compliance.

Adam Bali

Adam is a lead data who joined NetApp following the Cognigo Acquisition. Adam specializes in applying machine learning and natural-language processing solutions to help organizations map and protect personal and sensitive data. Adam holds M.Sc. in Mathematics.