Big data has massive potential, but in order to harness that potential, data processing teams must understand how to define the contents of their datasets. That process of definition involves identifying the data's key aspects in order to leverage it most effectively. These are commonly known as the 7 Vs of Big Data. In some cases, however, it's the 10 Vs.
Knowing these Vs for your dataset is the starting point of your data utilization strategy. Once you know this information, you can develop a seamless data pipeline to meet your business objectives.
Table of Contents
The sheer amount of data is where "big data" gets its name. But do you know how much data you actually have and how much you produce? The volume of your data leads you to make decisions about how you will manage and transform that information, whether for the current dataset or on an ongoing, automatic basis. That's particularly important as your business scales in size, and technology develops. That's the case for small businesses and large companies that currently manage different data volumes.
You may be able to process the amount of data you have now, but it's smart to think ahead to when your data grows exponentially. As an example, think only about the data that comes from interconnected devices. Imagine how the volume of data will expand when advances increase the number of connected devices from three or four to 20 or 200.
When do we find Volume as a problem:
A quick web search reveals that a decent 10TB hard drive runs at least $300. To manage a petabyte of data that’s 100 x $300 USD = $30,000 USD. Maybe you’ll get a discount, but even at 50% off, you’re well over $10,000 USD in storage costs alone. Imagine if you just want to keep a redundant version of the data for disaster recovery. You’d need even more disk space. Hence the volume of data becomes a problem when it grows beyond the normal limits and becomes an inefficient and costly way to store on local storage devices.
Amazon Redshift, which is a managed cloud data warehouse service by AWS is one of the popular options for storage. It stores data distributed across multiple nodes, which are resilient to disaster and faster for computations compared to on-premise relational databases like Postgres and MySql. It is also easy to replicate data from relational databases to Redshift without any downtime.
How rapidly can you access your data? Velocity refers to how quickly data comes in and how quickly you can make use of that data. Turning data into business intelligence ideally happens in real-time, but a number of factors determine how quickly you can turn unstructured data into actionable analytics.
Those factors include the efficiency of your data pipeline. For example, in some organizations, using a data lakehouse is more efficient than a data lake and data warehouse, as the functions of those are combined into one. A data lakehouse can increase the speed you can process and use data.
Big data speed has tangible business effects. It's probably best to demonstrate this with an example. A food delivery company may plan to launch a Google Adwords campaign but wants that campaign to reflect potential sales in order to maximize return on investment. Knowing that sports fans order food during big games, the delivery company monitors its sales volume over the first 45 minutes of the match to determine the projected volume and launches the ad campaign while the players are still on the field.
That rapid response requires almost real-time use of big data. It's an almost impossible task without real-time processing capability already in place.
When do we find Velocity as a problem:
High-velocity data sounds great because – velocity x time = volume and volume leads to insights, and insights lead to money. However, this path to growing revenue is not without its costs.
There are many questions that arise like, how do you process every packet of data that comes through your firewall, for maliciousness? How do you process such high-frequency structured and unstructured data on the fly? Moreover, when you have a high velocity of data, that almost always means that there are going to be large swings in the amount of data processed every second, tweets on Twitter are much more active during the Super Bowl than on an average Tuesday, how do you handle that?
Fortunately, “streaming data” solutions have cropped up to the rescue. The Apache organization has popular solutions like Spark and Kafka, where Spark is great for both batch processing and streaming processing, Kafka runs on a publish/subscribe mechanism. Amazon Kinesis is also a solution, which has a set of related APIs designed to process streaming data. Google Cloud Functions (Google Firebase also has a version of this) is another popular serverless function API. All these are a great black-box solution for managing complex processing of payloads on the fly but they all require time and effort to build data pipelines.
Now, if you don’t want to deal with the time and expense of creating your own data pipeline, that’s where something like FlyData could come in handy. FlyData seamlessly and securely replicates your Postgres, MySQL, or RDS data into Redshift in near real-time.
Typically data comes from a number of different sources. That results in variety. It means your data may be structured, semi-structured, or unstructured. Developing consistency is one essential element before, and sometimes during, the data transformation process. Ensuring consistency is crucial when accessing your data from different sources, specifically data lakes (typically unstructured), data warehouses (typically structured), and data lakehouses.
When do we find Variety as a problem:
When consuming a high volume of data the data can have different data types (JSON, YAML, xSV (x = C(omma), P(ipe), T(ab), etc.), XML) before one can massage it to a uniform data type to store in a data warehouse. The data processing becomes even more painful when the data columns or keys are not guaranteed to exist forever, such as renaming, introducing, and/or deprecating support for keys in an API. So not only one is trying to squeeze a variety of data types into uniform data type but also the data types can vary from time to time.
One way to deal with a variety of data types is to record every transformation milestone applied to it along the route of your data processing pipeline. Firstly, store the raw data as-is in a data lake( a data lake is a hyper-flexible repository of data collected and kept in its rawest form, like Amazon S3 file storage ). Then transform the raw data with different types of data types into some aggregated and refined state, which then can be stored in another location inside the data lake, and then later can be loaded into a relational database or a data warehouse for data management.
The strength of your data leads to confidence in the dataset. Veracity refers to the trustworthiness and importance of the data source, the reliability of the information, and its relevance to your business case. Although veracity may be similar to accuracy, it is about more than just the number of errors in your raw dataset. It's about the quality of the data you are about to run through your transformation pipeline.
Veracity can change from organization to organization. A data source may have high veracity if it has a proven track record, low veracity if it's unknown or has a less enviable record. For example, a business may learn there is a strong correlation between consumers who buy a certain product and the likelihood that those customers will sign up for an additional training program. The list of customers who purchase that product has high veracity for the purpose of a marketing campaign for the training program. If the end goal of your big data processing is to boost the training program business, one would view that specific customer list as having high veracity.
When do we find Veracity as a problem:
Consider the case of tweets on Twitter, which use things like hashtags, uncommon slangs, abbreviations, typos, and colloquial speech, all this data have a lot of messiness or noise and as the volume of data increases the noise also increases with it, which can be sometimes exponential too. The noise reduces the overall data quality affecting the data processing and later on data management of the processed data.
If the data is not sufficiently trustworthy, it then becomes important to extract only high-value data as it doesn’t always make sense to collect all the data you can because it is expensive and requires more effort to do so. Filtering out noises as early as possible in the data processing pipeline from the data while data extraction. This leaves only required and trustworthy data which can then be transformed and loaded for data analytics.
Before you embark on your data transformation process, you should know if it's worth it. What does this data ultimately provide? Be prepared to distinguish between "nice to have" and "essential" information. While the "nice to have" can offer some return on investment, it's best to focus your data strategy project on what's going to give you the best payoff according to your business objectives.
Consider the case of Netflix where user viewing and browsing pattern data is gathered from different data sources and then is extracted and transformed inside the data processing pipeline to generate only high-value information like user interests to provide useful recommendations. This, in turn, helps Netflix to avoid user churn and to attract even more users to their platform. The information generated could have been of low value if it had not satisfied the user. Hence, the value of big data impacts many business decisions and provides a competitive advantage over others.
A Few Extra Vs
Although those are the 5 Vs of big data, many commentators add a few more. These are additional elements to assess in your dataset before embarking on a new project. Specifically, these are validity, vulnerability, and volatility.
In addition to differences vis-a-vis structured and unstructured data, not all of your data is going to behave the same way. It may upload at different speeds. It most certainly will contain different data types. It may have a number of stray pieces of information that don't fit into a typical framework. Understanding the nature and extent of variability helps you to plan for data processing.
Big data is one thing, but knowing what that data represents is quite another. If your dataset is typical, you may have millions or billions of pieces of information. That should translate into a picture that makes sense to users within your organization. Ask yourself how easy it is to transform your raw data into a visualization that is relevant and actionable.
The easiest way to understand this concept is to acknowledge the limitations of traditional visualization techniques when it comes to big data. It is easy to plot simple, small data sets with a simple visualization strategy, using a standard software tool such as an Excel spreadsheet. An example is a graph to track a stock price over a period of time. There are two points: data and price, which results in a quick-and-easy graph.
When it comes to big data, you could have much more than just two points of data. You could easily have billions of relevant points. But even this can be made visual, with some work. Instead of a graph, one could use a treemap or cluster data into subsets to provide an accurate, and usable, picture.
One could view validity as having a slightly more narrow meaning than veracity. This refers to the amount of erroneous information you will have to remove or fix during the data transformation process. Data accuracy is directly connected to the amount of time you will spend cleaning your data.
To determine how clean or dirty your data may be, you can analyze a small sampling. This may be done manually. In that case, a data scientist reads the data in order to determine its level of validity. Some organizations also have AI-powered data "scrubbing" solutions that use intelligent suggestions to uncover and remove probable errors.
After you have analyzed a data sample, you can make reasonable judgments. For example, data that comes from your Salesforce database may be less riddled with errors or false information than user-generated information coming from a customer-facing e-commerce website.
Your security team should already be on top of data security. They may be able to tell you if a particular dataset, or particular stops along your data pipeline, are especially vulnerable to the cyber attack. Security concerns are highly important if your dataset includes personal and private customer information and therefore falls under specific legal regimes.
There are many ways of defining the word "volatile," but for these purposes, it should come down to, "when does the data go bad." Data is precious, and older information may simply need to be archived. But it may also become stale, out-of-date, irrelevant, or incorrect. Keeping stale or irrelevant information, when it's not clearly identified as such, comes with many risks. Those risks could include anything from remarketing to a customer at an old address to planning a product roll-out strategy based on old demographic or sales information.
Thankfully, you can prevent these kinds of misuse of big data. Carefully review the age and relevance of your data and decide upon its expiration date and what you want to do with the older information.
Why You Should Know the 5 Vs of Big Data
These elements of big data are more than an intellectual exercise. It is key to know these factors in order to process your data efficiently. Specifically, the 7 Vs can help you to find the right tools to manipulate your data, to develop workflows based on this new data, and guidelines to maintain data reliability. This ensures that your big data does what you want it to do: provide you with the analytics and business intelligence you need to make strategic, profitable decisions.
Finding the right tools leads you to develop an optimal data pipeline. Your data pipeline will include ETL or ELT. Integrate.io is one of the market's most trusted platforms to complete this process. To learn how Integrate.io can improve your data pipeline through a user-friendly interface and superior platform features, contact us to schedule a demo.