In today’s digital landscape, securing sensitive information is paramount for industries like fintech, healthcare, and e-commerce. These sectors frequently handle sensitive data such as Personal Identifiable Information (PII), Card Holder Data (CHD), Private Account Numbers (PAN), Payment Card Industry (PCI), Social Security Numbers (SSN), and Personal Health Information (PHI). Ensuring data confidentiality and integrity is essential for regulatory compliance and maintaining customer trust.
Handling sensitive data can be challenging, especially with large volumes stored in CSV and Excel files. Traditional data handling methods may expose information to breaches, making robust security measures indispensable.
One effective way to mitigate these risks is through data tokenization, replacing sensitive data with non-sensitive identifiers. Tokenization enhances data privacy and protection and simplifies compliance with regulations.
In this post, we walk through a solution using Integrate.io to process CSV files stored in SFTP To Go and apply the Piiano Vault tokenization API to protect sensitive fields. We will then run the Integrate.io job to transform and store the processed, tokenized data securely back in SFTP To Go.
We will cover the steps to set up each component, integrate them seamlessly, and automate the data processing workflow. By the end of this guide, you will understand how to implement a secure data processing pipeline that protects sensitive information, complies with regulatory requirements, and simplifies data management workflows at scale.
Table of Contents
Motivation For Tokenizing Sensitive Fields
Handling sensitive data requires stringent security measures to prevent breaches and comply with regulations like PCI-DSS, HIPAA, and GDPR. Tokenizing sensitive information reduces the risk of exposure by substituting sensitive data with non-sensitive data tokens. This is crucial for any organization’s data processing pipeline protecting it from unauthorized access and ensuring that even if data is intercepted, it remains useless to malicious actors.
Solution Overview
We use a public dataset that is available for download at 100 Sample Synthetic Patient Records, CSV. It contains 100 synthetic patient records in CSV format. Data hosted within SyntheticMass has been generated by Synthea™, an open-source patient population simulation made available by The MITRE Corporation.
For our use case, we would like to showcase quite large data set so download the zipped file synthea_sample_data_csv_latest.zip and unzip it. To demonstrate sensitive data protection, our solution uses the dummy data in the file patients.csv. The file contain synthetic patient records in CSV format including PII fields like birth date, address, SSN, and more.
💡 Note: The zip file also includes other synthetic data like allergies.csv which include PHI data per patient. Our solution can easily be extended to also process and tokenize those fields. Feel free to contact the Integrate.io team to get some help.
The file will be uploaded to SFTP To Go storage. We will define a data flow in Integrate.io to process the file, call the Piiano Vault tokenization API to replace sensitive fields with secure data tokens, and store the processed data back in SFTP To Go. We will preview the processed file using the SFTP To Go web portal.
The following diagram illustrates our solution's architecture:
Solution architecture diagram
Prerequisites
-
SFTP To Go account A secure cloud storage as a service.
-
Piiano Vault account: A powerful security engineering platform with API for tokenizing and de-tokenizing sensitive data.
-
Integrate.io account: A robust data integration platform.
Step-by-Step Guide to Process and Tokenize CSV Data
Let's set up SFTP To Go as a secure cloud storage to handle sensitive CSV files.
Setting Up SFTP To Go
-
Create a SFTP To Go Account: Head over to SFTP To Go and sign up for a 7-days free trial account. Follow the simple setup instructions to get your secure SFTP storage ready.
-
Create a SFTP To Go Credentials: To adhere to the principle of least privilege, create credentials in SFTP To Go to be used by Integrate.io to read and write the files.
Creating SFTP To Go credentials
3. Upload patients.csv File: With SFTP To Go, you can upload CSV files programmatically using the SFTP protocol. This makes data transfers smooth and automated, perfect for accepting sensitive files from external systems.
In our case we’ll upload the file using the SFTP To Go integrated web portal (under File Browser tab). Next, navigate to /home/integrateio/ folder (which was set as the home directory of the credentials we’ve created) and then click on Upload files to begin the upload process. Paste the patients.csv file from the unzipped folder, and click Upload 1 file.
Uploading files using the SFTP To Go web portal.
Setting up Piiano Vault
Next, set up Piiano Vault for tokenization.
-
Create a Piiano Vault Account: Head over to Piiano.com and sign up for 7-days free trial of a managed Piiano Vault instance. This is where all the tokenization magic will happen.
-
Create a Vault: When you sign up for Piiano Vault, you automatically receive a default, ready-to-use Vault. This ensures you can start without any initial setup. However, if you still to create a Vault, click New Vault and create one in the Sandbox environment. Next, click Create Vault to create a new Vault. Give it a few seconds for the Vault state to become Running so we can start using it.
💡 Piiano Vault offers several deployment types, including PCI compliant environment for storing payment information data such as Card Holder Data and Primary Account Numbers.
Creating new Vault
-
Create a Collection: Set up a collection in your Vault which is used as a secure container to tokenize data. Head over to Collections and click New collection. Give the collection patients name.
-
Define Collection Schema: Specify the schema for your collection. This ensures that the data structure is correctly validated and processed. For our solution, we will tokenize SSN so add new property named SSN with SSN as data type. Notice that the property is Encrypted by default to ensure the data is also encrypted in the underlying storage.
💡 Piiano Vault comes with various schemas and built-in semantic data types, helping it understand the meaning of your data and offering tailored, built-in features for each type.
- Click Create collection to create the collection.
Creating new collection
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Setting up Integrate.io
Finally, set up Integrate.io for data processing.
-
Create an Integrate.io Account: Head over to Integrate.io and sign up for a 14-days free trial account. Follow the simple setup instructions to create a new organization and get your data integration platform ready.
-
Create a SFTP Connection: A connection is a re-usable set of credentials and settings that can be used in data pipelines. For our solution we will create a new SFTP connection, with Public key authentication as the authentication method as it’s the most secure way to connect:
Head over to Connections list and click New connection:
Integrate.io’s list of integrations
Creating SFTP connection
Once setting the authentication method, you will be provided with a unique public key that can be imported into SFTP To Go for authentication purposes. Head back to SFTP To Go, and import the provided SSH Public Key into the integrate-demo credentials we have previously created.
Importing SSH key into SFTP To Go credentials
Next, copy the Host and Username values into Integrate.io’s connection details, and name your connection SFTP To Go. Click Test connection to confirm the connection is working. Finally, click Create connection to save the connection details.
3. Create a Dataflow Package: A package is an executable, user-defined pipeline that moves data from one place to another. For our solution we will create a new blank Dataflow package: Head over to Packages list, and click on New package. Name the package Tokenize Demo and make sure the Type is set to Dataflow. Click on Create package to create the package.
Creating a dataflow package
4. Set Environment Variables: Before building the data pipeline, we define some secret environment variables that can be used to make authenticated Piiano Vault API calls. Click the three dots in the package designer to set variables, and navigate to Secrets tab. Add the following variables:
PIIANO_VAULT_API_KEY: For security purposes, we will use a user that can only Tokenize data in our Vault (not able to list any data). Head back to your Piiano Vault dashboard, and navigate to Identity and access management . Open the IAM TOML tab, and paste the following TOML content. Click Save IAM TOML to save the IAM configuration:
[idps]
[policies]
[policies.AdminPolicy]
operations = ["read", "write", "delete", "search", "tokenize", "detokenize", "encrypt", "decrypt", "hash", "stats", "invoke"]
policy_type = "allow"
reasons = ["*"]
resources = ["*"]
[policies.InvokeHTTPCallPolicy]
operations = ["invoke"]
policy_type = "allow"
reasons = ["AppFunctionality"]
resources = ["http_call"]
[policies.PatientsDetokenizePolicy]
operations = ["detokenize"]
policy_type = "allow"
reasons = ["AppFunctionality"]
resources = ["patients/properties/*"]
[policies.PatientsTokenizePolicy]
operations = ["tokenize"]
policy_type = "allow"
reasons = ["AppFunctionality"]
resources = ["patients/properties/*"]
[roles]
[roles.PublicPatientsTokenizerRole]
capabilities = ["CapTokensWriter"]
policies = ["PatientsTokenizePolicy"]
[roles.VaultAdminRole]
capabilities = ["CapSystem"]
policies = ["AdminPolicy"]
[roles.VaultHTTPCallAction]
capabilities = ["CapTokensDetokenizer"]
policies = ["PatientsDetokenizePolicy"]
[roles.WebAppRole]
capabilities = ["CapActionsInvoker"]
policies = ["InvokeHTTPCallPolicy"]
[users]
[users.PublicPatientsTokenizer]
disabled = false
role = "PublicPatientsTokenizerRole"
[users.VaultAdmin]
disabled = false
role = "VaultAdminRole"
[users.Webapp]
disabled = false
role = "WebAppRole"
💡 Note: Identity and access management is powerful feature of Piiano Vault that governs access to APIs and data.
Next, navigate to the Users tab and open the options button of the PublicPatientsTokenizer user and regenerate an API key so we can use it in our data pipeline, copy the new key and use it as the variable value.
Generating tokenizer API key
-
PIIANO_VAULT_BASE_URL: This is your Vault’s endpoint. Head back to your Piiano Vault dashboard, and copy the Endpoint value. Paste it as the variable value.
-
PIIANO_VAULT_PATIENTS_COLLECTION: This is the name of the collection the data will be stored in. For our use case we call it patients.
-
Next, click Save to set the package variables.
Setting package sec variables
5. Build The Data Pipeline: We will now build the ETL data pipeline to process the patients file and tokenize data in the sensitive SSN field.
Extract - Read Source File: Click Add component to create a File Storage component.
- Name it patients.
- Choose SFTP To Go connection and click Next.
- In Source path, type /patients.csv . The default settings for File Storage should suffice in our case, so click Next.
- An attempt to read the file and determine its schema will not start. You should see a preview of the file and a list of detected fields. For our solution we’ll select all fields. This step helps Integrate.io understand and correctly process the data.
- Next, click Save to update the component state.
Extracting patients data with file storage source component
6. For our solution we will use the commonly used Select transformation component to call the Piiano Vault API per CSV line. This step is where the magic happens—sending sensitive fields for tokenization.
- To do so, hover the patients component we just added, and click the plus button to open the list of transformation components. Pick the Select component.
- Next, click on the new component to edit its settings. Name it tokenized . For our solution, we will only retain the Id and a tokenized SSN fields. In the Specify field expressions section, click the plus button to add a new field. Type Id as Expression and Alias, and click the plus button to add another field. We’ll use this to tokenize the value of SSN field.
- To tokenize SSN, we use deterministic token type to make multiple executions of the same data resilient. You can follow the Piiano Vault documentation for further reading on token types. The tokens will replace sensitive information, keeping your data secure and compliant.
To do so, paste the following expression into the Expression part of the field and SSN_Token in the Alias part of the field:
JsonExtractScalar(Curl('$PIIANO_VAULT_BASE_URL/api/pvlt/1.0/data/collections/$PIIANO_VAULT_PATIENTS_COLLECTION/tokens?reason=AppFunctionality', 'POST', '{"Authorization": "Bearer $PIIANO_VAULT_API_KEY", "Content-Type": "application/json"}', SPRINTF('[{ "type": "pci", "object": { "fields": { "SSN": "%s" }}, "props": ["SSN"]}]', SSN))#'body', '$[0].token_id')
Transformation component to tokenize SSN field
💡 Note: The expression is essentially a call to Piiano Vault’s Tokenize API with the package variables we defined earlier and extracting the token ID from the response.
7. Load - Store Processed Data: Connect a new File Storage destination as the final step to store the processed and tokenized data back into the SFTP storage. The processed files are stored under /processed folder, as a single CSV file. Open the destination component and set /processed as the Target directory . For our solution we will not compress the output as we want to preview it.
In the Destination action select Use intermediate storage and copy files to a new directory in destination and make sure to check the Merge output to single file so we can have a single file for large datasets. The following image illustrates how the data pipeline looks like:
Full dataflow pipeline
8. Run The Pipeline: Finally, we’ll run the pipeline. In the package designer, click Save & Run to save the latest version of the package, and execute it. To run a job you need a cluster. Create one. Then, click Run job.
The job can take a few minutes to run, depends on the size of the cluster and the size of the file to process.
Choose a cluster
9. Preview The Processed File: Finally, we should get a file with tokenized patient SSN’s. To preview it, head back to your SFTP To Go file browser and navigate to /home/integrateio/processed. You will see a file named part-r-00000.csv. Click generated file to preview it.
Preview the processed file
Automating The Pipeline With Integrate.io's Scheduler
To ensure your data processing pipeline runs automatically at scheduled intervals, you can use Integrate.io's Scheduler. This will help automate the process of reading, tokenizing, and storing your data without manual intervention.
- Navigate to Schedules and click New schedule.
- Give the schedule a name (e.g. Daily Patients Processing ).
- Select Cron expression and set it to 0 8 * * * (to run every day at 8AM).
- Configure the cluster size for the job to use.
- Select packages to run, and pick our Tokenize Demo package. It will use the latest version unless you pin it to a specific version.
- Click Create schedule to create it.
Create a cron based schedule
Retrieving an SSN Given a Token
To retrieve an SSN from a token, you need to use the de-tokenization functionality provided by Piiano Vault. This process involves sending a request to the Piiano Vault API with the token, which then returns the original sensitive data if the requester has the necessary permissions.
-
Ensure Proper Permissions: First, ensure that the user or API key used for de-tokenization has the appropriate permissions set in the Piiano Vault IAM configuration. The user must have the detokenize operation allowed for the relevant resource.
-
Make the API Call: Use the following API call format to de-tokenize data:
curl -X POST $PIIANO_VAULT_BASE_URL/api/pvlt/1.0/data/collections/$PIIANO_VAULT_PATIENTS_COLLECTION/tokens/detokenize \
-H "Authorization: Bearer $PIIANO_VAULT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"token_id": "<TOKEN_ID>",
"props": ["SSN"]
}'
Replace <TOKEN_ID> with the actual token you wish to de-tokenize. The response will include the original SSN corresponding to the provided token, if exists.
-
Handle the Response: The API response will contain the original SSN in a sensitive manner. Ensure that the data is handled and stored securely to maintain compliance and data protection standards.
By following these steps, you can securely retrieve sensitive information like SSNs from tokens, allowing for necessary operations while maintaining high levels of data security.
Conclusion
Integrating SFTP To Go, Integrate.io, and Piiano Vault provides a streamlined and secure solution for processing and tokenizing sensitive CSV data. This approach ensures compliance with data protection regulations, minimizes the risk of data breaches, and simplifies the overall data handling process. By leveraging these platforms, organizations across various industries can efficiently manage sensitive information, reinforcing their data security posture with ease.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Frequently Asked Questions (FAQ)
Q: What is data tokenization (or what is tokenization of data), and why is it important?
A: Tokenization of data is the process of substituting sensitive data with non-sensitive equivalents, known as tokens. This process enhances data privacy and protection, simplifies regulatory compliance, and minimizes the risk of data breaches.
Q: What types of sensitive data can be tokenized using Piiano Vault?
A: Piiano Vault can tokenize various types of sensitive data, including Personal Identifiable Information (PII), Card Holder Data (CHD), Private Account Numbers (PAN), Payment Card Industry (PCI), Social Security Numbers (SSN), and Personal Health Information (PHI).
Q: Why should I use Integrate.io for data integration and processing?
A: Integrate.io is a robust data integration platform that offers easy-to-use tools for designing and automating complex data workflows. It supports various data sources and destinations, ensuring smooth data handling and transformation processes, ensuring compliance with various regulatory standards.
Q: What is SFTP To Go, and how does it contribute to data security?
A: SFTP To Go is a secure cloud storage service that uses the SFTP protocol to handle file transfers. It ensures data security through encryption and access control, making it ideal for storing and automating the exchange of sensitive files.
Q: How does Piiano Vault ensure the security of tokenized data?
A: Piiano Vault uses advanced encryption techniques and access controls to secure tokenized data. It provides highly robust and resilient security engineering platform for managing sensitive information, ensuring compliance with various regulatory standards.
Q: Can I extend this solution to tokenize data that's other types such as sensitive?
A: Yes, the solution can be extended to tokenize other types of sensitive data. For example, you can process and tokenize fields in other synthetic CSV file such as allergies.csv that contain PHI data.
Q: Is it necessary to use a public dataset, or can I use my own data?
A: While this guide uses a public dataset for demonstration purposes, you can certainly use your own CSV files with sensitive data. Ensure that you follow similar steps to set up and secure your data processing pipeline.
Q: What are the costs associated with using Integrate.io, Piiano Vault, and SFTP To Go?
A: Each platform offers free trials (Integrate.io for 14 days, Piiano Vault for 7 days, and SFTP To Go for 7 days). For detailed pricing, refer to the respective websites of Integrate.io, Piiano Vault, and SFTP To Go.
Q: How do I handle large datasets efficiently in Integrate.io?
A: Integrate.io supports scalable data processing and allows you to configure distributed processing clusters based on your dataset size, ensuring your dataflow package is optimized for large-scale data handling.
Q: Where can I get help if I encounter issues during setup?
A: If you encounter issues, you can reach out to the support teams of Integrate.io, Piiano Vault, and SFTP To Go. They offer comprehensive documentation and customer support to assist you.
Additional Resources
- Integrate.io Documentation
- Piiano Vault Documentation
- SFTP To Go Documentation
- Cron Expression To Go