What is Data Mining?

Data mining is the process of exploring large data sets to reveal previously undiscovered patterns, correlations, and rules, a step often taken before data analytics and also known as knowledge discovery.

Data mining is the process of exploring Big Data to reveal undiscovered patterns and rules. This process is also known as knowledge discovery.

Data mining usually happens prior to data analytics. Mining can help to uncover patterns which the business might not previously be aware of, such as a correlation between two variables. Analytics can then be used to test a hypothesis based on insights gleaned from mining.

What is the Data Mining Process?

Data mining is a complex process involving statistics, machine learning, and database techniques.

One of the most common process models for data mining is the CRISP-DM model, an open-source standard that has been in use since the 1990s. This model breaks data mining down into six steps:

1. Business understanding

At the initial stage, stakeholders discuss the scope and objectives of the data mining project. This conversation helps to identify which data sources to use, what business outcomes are required, and what resources will be made available to the data mining team.

2. Data Understanding

Next, there will be a phase of data exploration. This involves a high-level examination of available data sources. During this phase, promising trends are highlighted, and these will be the targets for future mining. Tools such as Tableau or Grapher can help to perform this initial analysis.

3. Data Preparation

Data is prepared as required to facilitate mining. This can include:

Data cleansing: Errors, duplicates and other problematic values are removed from the data.
Data integration: Multiple, disparate sources are unified into a single source.
Data harmonization: Data is converted into a pre-defined schema.

This stage may pass through an ETL (Extract Transform Load) layer to automate the data preparation process. An ETL platform like Integrate.io can prepare data from most common sources without the need for manual intervention.

4. Modeling

The data mining team will try a number of models to explore the available data. These models can include:

Linear regression: Identifying the relationships between multiple values, and then using those relationships to predict future values
Decision trees (or regression trees): A modeling technique that uses a series of binary values to interpret data
Neural networks: machine learning algorithms that repeat problems over and over, gradually becoming more efficient with each iteration

In order to test these models effectively, it may be necessary to review the data preparation process, in which case the data mining process moves back to stage three.

5. Evaluation

The results of each model are assessed to find the most appropriate candidate. Models must meet the following criteria:

Predictive: The model can make predictive conclusions based on available data
Accurate: Insights derived from the model must correspond with the data
Relevant: The model must produce results that deliver the agreed-upon business objectives

If no candidate model meets these criteria, the process may move back to step four, or back to step three if further data preparation is required.

6. Deployment

The data mining model is deployed and put to work against the available data. The results should fulfill the project’s objectives and deliver insights that can inform the next steps in the organization’s data analytics strategy.

How is Data Mining Performed?

Steps 3 to 6 of the CRISP-DM model generally only happen where data scientists are creating mining algorithms.

In enterprise usage, the data team will generally use a Business Intelligence (BI) platform to perform data mining. Common platforms include Tableau Server, Looker, Amazon QuickSight, and Microsoft Power BI.

These platforms can also help with insight refinement and visualization. Data mining should ultimately produce something that is useful to the business, such as a new trend or an interesting correlation.

The final step of data mining is to present insights to relevant stakeholders. These interested parties will then decide:

If the data mining project met the stated goals
If the correct data sources were used, or if other data sources should have been included
Whether the mining results represent new knowledge or fit with the existing understanding of the business
If the data team should proceed to perform more in-depth analytics

In some instances, further mining work might be required. This may involve returning to the beginning of the CRISP-DM process.

FAQ

Frequently asked questions

Clear answers to the questions teams ask when evaluating Integrate.io.

How does data mining relate to data analytics?

Data mining usually happens before data analytics. Mining uncovers patterns a business might not be aware of, such as a correlation between two variables, and analytics can then test a hypothesis based on the insights that mining reveals.

What is the CRISP-DM model in data mining?

CRISP-DM is a common open-source data mining process model in use since the 1990s. It breaks data mining into six steps: business understanding, data understanding, data preparation, modeling, evaluation, and deployment, with the process sometimes looping back to earlier stages.

What tools are used to perform data mining?

In enterprise use, data teams generally perform data mining with a business intelligence platform such as Tableau Server, Looker, Amazon QuickSight, or Microsoft Power BI. These platforms also help with insight refinement and visualization before results are presented to stakeholders.

Back to glossary

Need help with your data integration?

Our team of experts is ready to help you build reliable data pipelines with Integrate.io.

Talk to an Expert

What is the Data Mining Process?

How is Data Mining Performed?

Frequently asked questions

Related Terms

Need help with your data integration?