Protecting PII from LLM Training with Format-Preserving Encryption

Protecting PII from LLM Training with Format-Preserving Encryption

PUBLISHED:

As companies build custom AI models, protecting training data like PII is critical to maintaining a competitive edge.

In the age of artificial intelligence (AI), large language models (LLMs) have revolutionized the way we interact with technology. Generative AI is increasingly used in predictive modeling, chatbots, and even art. However, with the growing use of these systems comes an equally significant concern: the privacy of the data used to train these models. One of the biggest barriers to AI adoption is ensuring that sensitive data, such as PII, is not inadvertently included in LLM models. 

A solution to this challenge lies in Format-Preserving Encryption (FPE). In this blog post, we’ll explore how FPE can play a critical role in data privacy: enabling the development of powerful AI models without compromising sensitive data such as PII. We’ll also look at how ALTR’s FPE technology can help companies achieve this protection and unleash powerful new AI capabilities. 

Understanding Format-Preserving Encryption (FPE) 

Format-Preserving Encryption is a type of encryption that allows data to be encrypted while maintaining its length, alphabet, and special characters. Unlike traditional encryption techniques that turn data into random-looking ciphertext, FPE ensures that the encrypted data remains in the same structure and length as the original input. This feature is particularly useful when dealing with structured data, such as credit card numbers, social security numbers, or other forms of PII that require a specific format for legitimate processing or storage. FPE is also deterministic, meaning that a given piece of data is always encrypted to the same value, ensuring that the data can still be used in analysis that requires referential integrity. 

For example, if a social security number “123-45-6789” is encrypted using FPE, the result might look like “135-79-2468” preserving the format (a series of numbers grouped by hyphens) but encrypted in a way that it no longer contains the real social security number. 

The Privacy Problem with LLM Training 

Large language models like GPT-4 are trained on vast datasets containing information scraped from a variety of sources. As organizations look to tailor LLMs to their own business needs and use cases, they need to customize models by training them on their own internal datasets. The challenge arises when sensitive data, such as individual’s name, email, or social security number inadvertently gets embedded into the model during the training process. Once training is complete, this data becomes part of the model’s learned patterns and can potentially be retrieved or exposed by the LLM. 

Given the sheer scale of these models and the sensitivity of the data involved, protecting PII during training has become a critical issue for developers, regulators, and privacy advocates alike. Exposing PII in this manner could lead to privacy violations, identity theft, and legal repercussions for organizations looking to take advantage of customized AI models. 

How Format-Preserving Encryption Can Protect PII 

Format-Preserving Encryption offers a unique and effective solution to this problem by allowing organizations to encrypt PII before using it in LLM training without altering its format. Here’s how FPE can help ensure that sensitive data is safeguarded throughout the AI development process: 

1. Pre-Training Data Protection 

Before training a language model, any PII within the dataset can be encrypted using FPE. For instance, if a dataset includes names, phone numbers, or email addresses, each of these can be encrypted so that they appear as random but valid-looking entries to the model. 

For example: 

  • A name like “JANE DOE” may be transformed into “KSUG JPY.” 
  • A phone number like “(123) 555-4567” might become “(927) 282-4819.” 

This ensures that even if the data is used to train the model, the AI does not learn or store any real-world personal identifiers. 

2. Data Usability with Privacy Protection 

One of the primary benefits of FPE is that it retains the format of the original data, which means that encrypted data can still be processed, validated, and used by the system as if it were unencrypted. This is essential for training LLMs, which need to understand data structure, syntax, and relationships between different entities. It also ensures that any other users of the data, such as analysts or BI tools, can still operate on the encrypted data. 

3. Minimizing Risks of Data Leakage 

Even if an encrypted dataset is accessed by unauthorized individuals or used for unintended purposes, the encryption ensures that PII is not exposed. Without the decryption key, it is computationally infeasible to reverse the encryption and retrieve the original information. This adds an additional layer of security, especially when handling datasets that contain potentially sensitive or personal data. 

The risk is even worse if an organization uses off-the-shelf tools such as ChatGPT. By default, ChatGPT trains itself on user input. Because of this, as soon as data such as PII is fed in, an organization loses control of that sensitive information. Obfuscating the data with tools such as FPE is one of the only ways for an organization to maintain control over their data and eliminate the risk of leakage when using these tools. 

4. Compliance with Data Privacy Regulations 

As privacy regulations like GDPR (General Data Privacy Regulation) and CCPA (California Consumer Privacy Act) evolve, organizations are increasingly required to demonstrate how they handle and protect PII in AI training datasets. By using FPE to encrypt PII, companies can significantly reduce the risk of violating these regulations. Encrypted data that retains its format helps meet privacy standards while allowing AI systems to be trained on realistic datasets. 

FPE also helps organizations comply with regulations such as GDPR’s right to be forgotten. If an AI model was trained on customer data; even if that customer’s information is removed from the source system it isn’t truly “forgotten” by the AI model without retraining, which requires a significant investment of time and resources. If FPE was used to to protect the data before being fed into the AI model, the model doesn’t need to be retrained to “forget” customer PII – it never knew it in the first place! 

5. Enabling Selective Decryption 

In some cases, there might be a need for a user to decrypt the data for legitimate purposes (e.g., for auditing or compliance checks). With FPE, organizations can retain control over which parties have access to decryption keys, allowing them to manage who can view the original PII. This selective decryption ensures that sensitive information is only accessible to authorized users or systems. 

How ALTR’s FPE Can Protect Your Data 

For organizations looking to implement FPE in their data security strategy, ALTR’s FPE provides an advanced solution designed to meet the needs of modern data privacy and compliance requirements. ALTR’s platform enables users to easily integrate FPE into their existing workflows without a need for deploying new infrastructure or writing code. ALTR also automates access control to encryption keys and the ability to decrypt data, making it easy to indicate who can – and can’t – access sensitive data. 

ALTR SaaS platform seamlessly integrates FPE throughout the data ecosystem, enabling organizations to incorporate data privacy across their databases, ETL tools, and cloud data warehouses without deploying new infrastructure or writing code. In this way, organizations can focus their time and resources on what matters – achieving their business goals – without having to worry about data privacy. 

Conclusion 

As organizations fight to maintain their competitive edge using custom AI models, ensuring the privacy of the data used to train those models, such as personally identifiable information (PII,) becomes increasingly important. Format-Preserving Encryption (FPE) offers a viable solution for protecting PII in training datasets by encrypting sensitive data while maintaining its original format. FPE allows organizations to quickly train, customize, and deploy new AI models without exposing private information. 

ALTR’s FPE makes it effortless to deploy this technology, providing a robust, reliable, and compliant platform that safeguards data for AI development. Schedule a product tour to learn more about how our Format-Preserving Encryption technology can help you safeguard sensitive information in AI model training and more.