ALTR Brief: Snowflake Cybersecurity Investigation

What is Data Tokenization – A Complete Guide

3 models for tokenizing data for a secure journey to the cloud.

What is Data Tokenization? – a Definition

You may be familiar with the idea of encryption to protect sensitive data, but maybe the idea of tokenization is new. What is data tokenization? In the realm of data security, “tokenization” is the practice of replacing a piece of sensitive or regulated data (like PII or a credit card number) with a non-sensitive counterpart, called a token, that has no inherent value. The token maps back to the sensitive data through an external data tokenization system. Data can be tokenized and de-tokenized as often as needed with approved access to the tokenization system.

How Does Tokenization of Data Work?

Original data is mapped to a token using methods that make the token impractical or impossible to restore without access to the data tokenization system. Since there is no relationship between the original data and the token, there is no standard key that can unlock or reverse lists of tokenized data. The only way to undo tokenization of data is via the system that tokenized it. This requires the tokenization system to be secured and validated using the highest security levels for sensitive data protection, secure storage, audit, authentication and authorization. The tokenization system is the only vehicle for providing data processing applications with the authority and interfaces to request tokens or de-tokenize to the original sensitive data.

Replacing original data with tokens in data processing systems and applications like business intelligence tools minimizes exposure of sensitive data in those applications, stores, people and processes, reducing risk of compromise, breach or unauthorized access to sensitive or regulated data. Applications, except for a handful of necessary applications or users authorized to de-tokenize when strictly necessary for a required business purpose, can operate using tokens instead of live data,. Data tokenization systems may be operated within a secure isolated segment of the in-house data center, or as a service from a secure service provider.

What is Data Tokenization Used For?

Tokenization may be used to safeguard sensitive data including bank accountsfinancial statementsmedical recordscriminal recordsdriver’s licensesloan applications, stock tradesvoter registrations, and other types of personally identifiable information (PII). Data tokenization is most often used in credit card processing, and the PCI Council defines tokenization as “a process by which the primary account number (PAN) is replaced with a surrogate value called a token. De-tokenization is the reverse process of redeeming a token for its associated PAN value. The security of an individual token relies predominantly on the infeasibility of determining the original PAN knowing only the surrogate value”.

The choice of tokenization as an alternative to other data security techniques such as encryption, anonymization, or hashing will depend on varying regulatory requirements, interpretation, and acceptance by auditing or assessment entities. We cover the advantages and disadvantages of tokenization versus other data security solutions below.

Benefits of Data Tokenization

When it comes to solving these cloud migration challenges, tokenization of data has all the obfuscation benefits of encryption, hashing, and anonymization, while providing much greater usability. Let’s look at the advantages in more detail. 

  1. No formula or key: Tokenization replaces plain-text data with an unrelated token that has no value if breached. There’s no mathematical formula or key; a token vault holds the real data secure.    
  2. Acts just like real data: Users and applications can treat tokens the same as real data and perform high-level analysis it, without opening up the door to risk of leaks or loss. Anonymized data on the other hand provides only limited analytics capability because you’re working with ranges, while hashed and encrypted data are ineligible for analytics. With the right tokenization solution, you can share tokenized data from the data warehouse with any application, without requiring data to be unencrypted and inadvertently exposing it to users.   
  3. Granular analytics: Retaining the connection to the original data enables you to dig deeper into the data with more granular analytics than anonymization. Anonymized data is limited by the original parameters, such as age ranges or broad locations, which might not provide enough details or flexibility for future purposes. With tokenized data, analysts can create fresh segments of data as needed, down to the original, individual street address, age or health information. 
  4. Analytics plus protection: Tokenization delivers the advantages of analytics with the strong at-rest protection of encryption. For the strongest possible security, look for solutions that limit the amount of tokenized data that can be de-tokenization and also issue notifications and alerts when data is de-tokenized so you can ensure only approved users get the data. 

Tokenization Vs. Encryption

1. Tokens have no mathematical relationship to the original data, which means unlike encrypted data, tokenized data can’t be broken or returned to their original form.

While many of us might think encryption is one of the strongest ways to protect stored data, it has a few weaknesses, including this big one: the encrypted information is simply a version of the original plain text data, scrambled by math. If a hacker gets their hands on a set of encrypted data and the key, they essentially have the source data. That means breaches of sensitive PII, even of encrypted data, require reporting under state data privacy laws. Tokenizing data, on the other hand, replaces the plain text data with a completely unrelated “token” that has no value if breached. Unlike encryption, there is no mathematical formula or “key” to unlocking the data – the real data remains secure in a token vault.

2. Tokens can be made to match the relationships and distinctness of the original data so that meta-analysis can be performed on tokenized data.

When one of the main goals of moving data to the cloud is to make it available for analytics, tokenizing the data delivers a distinct advantage: actions such as counts of new users, lookups of users in specific locations, and joins of data for the same user from multiple systems can be done on the secure, tokenized data. Analysts can gain insight and find high-level trends without requiring access to the plain text sensitive data. Standard encrypted data, on the other hand, must be decrypted to operate on, and once the data is decrypted there’s no guarantee it will be deleted and not be forgotten, unsecured, in the user’s download folder. As companies seek to comply with data privacy regulations, demonstrating to auditors that access to raw PII is as limited as possible is also a huge bonus. Data tokenization allows you to feed tokenized data directly from Snowflake into whatever application needs it, without requiring data to be unencrypted and potentially inadvertently exposed to privileged users.

3. Tokens maintain a connection to the original data, so analysis can be drilled down to the individual as needed.

Anonymized data is a security alternative that removes the personally identifiable information by grouping data into ranges. It can keep sensitive data safe while still allowing for high-level analysis. For example, you may group customers by age range or general location, removing the specific birth date or address. Analysts can derive some insights from this, but if they wish to change the cut or focus in, for example looking at users aged 20 to 25 versus 20 to 30, there’s no ability to do so. Anonymized data is limited by the original parameters which might not provide enough granularity or flexibility. And once the data has been analyzed, if a user wants to send a marketing offer to the group of customers, they can’t, because there’s no relationship to the original, individual PII.

Three Risk-based Models for Tokenizing Data in the Cloud

Depending on the sensitivity level of your data or comfort with risk there are several spots at which you could tokenize data on its journey to the cloud. We see three main models – the best choice for your company will depend on the risks you’re facing:

Level 1: Tokenize data before it goes into a cloud data warehouse

  1. The first issue might be that you’re consolidating sensitive data from multiple databases. While having that data in one place makes it easier for authorized users, it might also make it easier for unauthorized users! Moving from multiple source databases or applications with their own siloed and segmented security and log in requirements to one central repository gives bad actors, hackers or disgruntled employees just one location to sneak into to have access to all your sensitive data. It creates a much bigger target and bigger risk.  
  2. And this leads to the second issue: as more and more data is stored in high-profile cloud data warehouses, they have become a bigger focus for bad actors and nation states. Why should they go after Salesforce or Workday or other discrete applications separately when all the same data can be found in one giant hoard?  
  3. The third concern might be about privileged access from Snowflake employees or your own Snowflake admins who could, but really shouldn’t, have access to the sensitive data in your cloud data warehouse.

If your company is facing any of these situations, it makes sense for you to choose “Level 1 Tokenization”: tokenize data just before it goes into the cloud. By tokenizing data that is stored in the cloud data warehouse, you ensure that only the people you authorize have access to the original, sensitive data.

Level 2: Tokenize data before moving it through the ETL process

As you’re mapping out your path to the cloud, you may want to make sure data is protected as soon as it leaves the secure walls of your datacenter. This is especially challenging for CISOs who’ve spent years hardening the security of perimeter only to have control wrested away as sensitive data is moved to cloud data warehouses they don’t control. If you’re working with an outside ETL (extract, transform, load) provider to help you prepare, combine, and move your data, that will be the first step outside your perimeter you want to safeguard. Even though you hired them, without years of built-up trust, you may not want them to have access to sensitive data. Or it may even be out of your hands—you may have agreements or contracts with your own customers that specify you can’t let any vendor or other entity have access without written consent. 

In this case, “Level 2 Tokenization” is probably the right choice. This takes one step back in the data transfer path and tokenizes sensitive data before it even reaches the ETL. Instead of direct connection to the source database, the ETL provider connects through the data tokenization software which returns tokens. ALTR partners with SaaS-based ETL providers like Matillion to make this seamless for data teams.  

Level 3: End-to-end on-premises-to-cloud data tokenization

If you’re a very large financial institution classified as “critical vendor” by the US government, you’re familiar with the arduous security required. This includes ensuring that ultra-sensitive financial data is exceedingly secure – no unauthorized users, inside or outside the enterprise, can have access to that data, no matter where it is. You already have this nailed down in your on-premises data stores, but we’re living in the 21st century and everyone from marketing to IT operations is saying “you have to go to the cloud.” In this case, you’ll need “Level 3 Tokenization”: full end-to-end data tokenization from all your onsite databases through to your cloud data warehouse.  

As you can imagine, this can be a complex task. It requires tokenization of data across multiple on-premises systems before even starting the data transfer journey. The upside is that it can also shine a light on who’s accessing your data, wherever it is. You’ll quickly hear from people throughout the company who relied on sensitive data to do their jobs when the next time they run a report all they get back is tokens. This turns into a benefit by stopping “dark access” to sensitive data.  

Conclusion

Data tokenization can provide unique data security benefits across your entire path to the cloud. ALTR’s SaaS-based approach to data tokenization-as-a-service means we can cover data wherever it’s located: on-premises, in the cloud or even in other SaaS-based software. This also allows us to deliver innovations like new token formats or new security features more quickly, with no need for users to upgrade. Our tokenization solutions also range from flexible and scalable vaulted tokenization all the way up to PCI Level 1 compliant, allowing companies to choose the best balance of speed, security, and cost for their business. We’ve also invested heavily in IP that enables our database driver to connect transparently and keep data usable while tokenized. The drivers can, for example, perform the lookups and joins needed to keep applications that are unused to tokenization running.

With data tokenization from ALTR, users can bring sensitive data safely into the cloud to get full analytic value from it, while helping meet contractual security requirements or the steepest regulatory challenges.