There is a famous saying that goes by:
“The journey is more important than the destination."
Coincidently, this is also true for data in modern times. The information which we see in pretty reports and charts or is displayed to users via an application has actually experienced a long run of pipelines and strategies. Originating from different touchpoints, data witnesses several alterations throughout its , such as:and transformations. These transformations are a result of well-planned
Joining with information from other sources.
Data type manipulation.
Removal of irrelevant and unrequired information.
Such alterations are required to maintainand to define a standard schema throughout the organization. However, at times it becomes important to keep track of these changes for various purposes. The changes that information undergoes throughout its journey are called . This article will serve as for beginners as well as experienced professionals. But first, let’s talk more about what this lineage is.
Table of Contents
- What is ?
- What Problems Does Solve?
- Where Does Fit into ?
- Methods & Techniques
- How Can Data Leaders Get Started With ?
- Best Practices
- Automating & Its Impact on
- Some Examples of
- Exploring the Untapped Potential of Data
What is ?
is quite a self-explanatory term. Lineage refers to tracking one’s pedigree or ancestry, so, in the context of data, it means throughout its . The procedure keeps track of where the data originated, all the databases it has been stored in, all the transformations applied to it, and its final destination.
. contains the complete description of what the data table contains, including schema information, , creation date, modification date, file size, and authors. This helps users track the changes made to the stored data and at what time.act as a version control system (VCS) for the data, tracking every formation data has seen, highlighting key changes, and providing easy access to previous versions for debugging. These also help engineers with of the entire journey in the form of a map, linking each data stop to the other. The key metric which allows to conjure important information about the data is its
seems like an unnecessary effort for seemingly no gain, but it has more advantages than you think. Let’s look over some of the problems it tackles.
What Problems DoesSolve?
Data growth brings opportunities, but it also spins up certain challenges. With colossal , it becomes all the more difficult to diagnose and pinpoint where a problem occurred. solve a lot of such problems.
Better Understanding of Data
data engineers as to why the were made. This helps with and helps engineers understand its potential. Being aware of the data origin helps stakeholders familiarize themselves with the and touchpoints. This can help with infrastructural benefits such as eliminating data silos.gives users a snapshot view of all the versions of the that have been previously created. By analyzing the journey, it makes more sense to
Having a better understanding of the data automatically leads to better. When you know what information your data holds, you know how to drive value out of it. With the help of , analysis, and , organizations can dive into new business and ventures.
helps organizations deal with the ripple effects that may occur from changing any step in the pipeline. With the complete data map at their disposal, data engineers can analyze how a certain action or change will affect the entire data and what repercussions it will entail for the entire database.
Better Error Handling
When yourcontains 10’s of data modification points, debugging becomes extremely difficult. You can see that the final output is not as you desire, but you have no clue where things went wrong. solves this by presenting you with individual resulting from each modification in the pipeline. This makes it easier to view the alterations applied and focus on the erroneous queries.
When a new employee joins the data team, the biggest challenge is transferring the domain knowledge. For most enterprises, data is scattered and undergoes multiple transformations for different uses.means you will have an easy path for your trainee to follow. This centralized containing information about all , touchpoints, and alteration procedures aids the learning process.
Where DoesFit into ?
and in infrastructures such as . It gives data a much-needed structure by defining user roles (owner, read-only access, etc.), processing rules, policies, and . A defined structure helps with easy and maintenance.is an important process in
. Where governance defines multiple rules and restrictions for data, lineage helps determine whether these are being complied with or not. provides managers with an audit trail of all the transformations data has gone through, which is used to verify that all processes are in order, as defined by .can be seen as an aid to
Methods & Techniques
There are multiple ways in whichcan be tracked. Let’s talk about them below.
Whichever transformation engine you use tags the data along its transformation journey. Each version of the data is tagged separately to create a link between them. These tags are then used to create a lineage for the data. The downside of this approach is that allmust be defined within that tool, and any transformation that occurs on the data externally will not be tracked.
This form of lineage tracking ison the of the tables and the databases. It does not concern itself with any programming language or the code utilized in but rather the changes made to the . It tracks these changes throughout the data to create a lineage for it.
It is a very simple approach that is environment independent since it only concerns itself with the data, so it can be used across multiple database systems such as, MySQL, and Server. However, its simplistic approach does make it prone to errors as there always be a chance that it misses certain complex patterns that are very subtle.
Lineage By Parsing
This is the most effective method of lineage tracking. It deals directly with the logic written to perform the necessary transformation. The code logic is automatically read and used to keep track of all the alterations in the data. This form of lineage tracking is complex to implement because it requires knowledge of the programming environment and logic used for the transformations.
How Can Data Leaders Get Started With?
The general approach to implementing a new concept is by first establishing its need. For data, this is mostly not a requirement because having better and infrastructure benefits an organization one way or the other. The same rule applies to . In 2018, the European Union (EU), introduced the General Data Protection Regulation ( ). This is a data protection law that requires organizations to focus on , so it has become more important for leaders to implement it. Certain steps are required before can be put into practice.
Identify Important Data Tables: It is important to first gather information regarding all the data that is stored in its final form. This is important for tracking the lineage.
Track To The Source: For each data table, track all its transformations and intermediary forms. Create tags for all the transformations, the authors of the scripts, and all the storage locations.
Create a Map: Link all the tags to create a map that represents thethroughout the organization.
As vital as it may be, creating ais a frustrating and daunting task, especially if your data is widespread.
tracking is a tough task however certain practices make the overall process easier and more efficient.
Track On-The-Go: Lineage should be maintained as the data moves along thepipeline. Tags should be created as soon as data passes through a certain script. Waiting for the entire procedure to complete can make the tagging process challenging.
: Keeping track of all the changes is a tough task, so is always the best solution. There are multiple tools and cloud services available that perform automatic tracking.
Putto Use: As is being tracked, it should be used to other data-related business , such as better error handling.
Automating& Its Impact on
I know we have talked about the benefits of automated lineage tracking over manual, but in reality, there isn’t any comparison between the two methods. The manual approach is only good for a healthy discussion, but in practice, organizations must always use automated approaches. This is because modern firms manage hundreds of databases, each containing hundreds of tables. Each of these tables goes under several transformations every day. A manual approach would require several resources and would not be feasible for large enterprises.
Automated, from all origins, throughout its journey. They extract at each stage and mark the accordingly. These markings are used to create a map of the data, and this entire is available to the user for inspection. These also allow users to inspect the state of the at each tag.
This automatedbrings great benefits to organizations. Since lineage is being tracked simultaneously, it can be used for business analysis and without wasting any time. It can also be sued by teams for constant monitoring of policies and . Let’s take a look at some common .
Some Examples of
Dremio: Dremio is a data-centric company that helps organizations structure their data and. It is trusted by several industry giants, including 3 of the Fortune 5. It exposes the data stored in the lake and its lineage to users, which can be queried normally and inspected at will.
Octopai: Octopai is a tool built specifically forand discovery. They offer three different types of lineage, which are cross-system lineage, end-to-end column lineage, and inner system lineage.
CloverDX (formerly CloverETL): CloverDX offers a variety of data-related solutions such as migration, warehousing, and quality management. They also offer easy deployments to popular clouds such as AWS and Azure.
Exploring the Untapped Potential of Data
Your data is your key to success if you know how to use it. Like any valuable asset, data requires special care in storage and quality management. Integrate.io is a data platform that helps you with all your data needs, including Data warehousing and analysis. We offer seamless integrations with hundreds of to establish robust pipelines and create your warehouse. Integrate serves a large client base, including some of the world’s leading brands, such as Nike and Salesforce.
Book your consultation, talk to our experts and join our client base today.