You are currently viewing Data Mining

Data Mining

In a previous post, I discussed a framework that can be used to link the key financial processes in a business with Fintech solutions.

A part of this framework is data mining and I would like to explore data mining a bit more in this post. Some items regarding data mining have already been briefly discussed in my previous post but I will discuss them here more extensively.  

What is data mining?

Sometimes people call data “the new gold”. Of course, this depends. Having a lot of data doesn’t mean anything if you can’t use it: you have to search for it.

Let’s stay into the gold analogy. Digging for gold is not very easy. You can have a goldmine but the gold is not immediately visible to your eyes. You need to actively mine for gold and in some cases, you have to dig long and deep to find some gold. Data mining works the same. Data mining focuses on finding relevant datasets and information which you can then use for analytical and predictional purposes, relevant to your business. 

When you gather the data it is still raw and you don’t know if it is useful (like getting the big chunks of stone in which you suspect there is gold in it somewhere). To be able to use it, you need to refine the data to get the bits of gold out of it that you can use. This refining of raw data is also a part of data mining.

Useful data mining applications

Data mining can be applied in, for instance, the following business area’s:

  • Sales and Marketing. Nowadays, organizations collect a massive number of data about their customers and prospects. If you observe the online user behavior and the demographics of consumers, you can use the data to improve the effectiveness of your marketing campaigns, increasing your market segmentation and cross-sell opportunities. You can also use predictive analyses to set the expectations/targets of your team. 
  • Education. Data is used to understand student populations and the success rates of students. Nowadays you see more and more courses online and this generates data more easily than traditional classes. This data can be easier analyzed into useful information like the number of keystrokes, profiles of students, classes, time spent, etc.
  • Operational excellence (optimization). Process mining uses data mining techniques to reduce costs across operational functions (production, supply chain, etc.). This enables your business to perform more efficiently. It helps to identify costly bottlenecks in your process and can improve the level of your decision-making. 
  • Fraud detection. Data anomalies can be very useful in detecting fraud. Banks and other financial institutions use this frequently, but it can also be used in cybersecurity. A data anomaly can for instance detect fake user accounts in a database that can then be eliminated.

The data mining toolset

Like a miner that mines for gold, data mining also needs a set of tools to find the gold (patterns in data) that you need to help to answer your business questions and to support in predicting future trends for your business. Because there is so much data around us nowadays, it’s very hard to find useful data manually. Fortunately, there are tools available that can support you in going through the mass collection of (raw) information to try and find the gold you are searching for. 

The data that you “mine” can be analyzed by statistical data analysis with the support of data mining algorithms. It makes sense that statistics is the correct tool for mining data. Statistics is the science that has always involved data collection, data interpretation and data validation. However, this science has become more and more important because of the huge amounts of data that are generated nowadays. The reason for this is the big technological advancements we have been making in the past decades.  

With statistical data analysis, you perform various statistical actions on the raw information that is received. This is done by a data scientist. If a vast amount of data is required, it is not recommendable to do the analysis completely manually. In that case, it is advisable to use software as a complementary tool to do the statistical data analysis. Well-known data mining platforms include SAS Visual Data Mining and Machine Learning, IBM Watson Discovery, RapidMiner Studio, and Alteryx.

The 5 steps of data mining

There are 5 primary steps of data mining. These are also the key steps in any statistical data analysis process:

Step 1: the identification of business issues to analyze data sources

This concerns, for instance, analyzing what specific kind of data you want to have available in databases and operational systems. Setting these business objectives can be very hard and many businesses spend too little time on this step (which is in my opinion the most important one). They just start getting data but are not thinking about the actual application they want to use it for. A data scientist/data analyst and a business stakeholder need to work together to define the business problem that needs to be solved by data mining. This can only be done by defining the information that is required and the right parameters. Sometimes data scientists/analysts may have to do additional research to understand the business context (and the applicable business processes) appropriately. 

Step 2: The collection and exploration of data

This also includes the sampling and the profiling of the data sets. There are many ways to collect data. This can be done in a traditional way (interviews, focus groups, questionnaires, schedules and observations) or, what is done mostly nowadays, by using computers (data obtained by for instance Facebook, Google, etc.).

Step 3: The preparation and transformation of data

If the problem has been defined, it is easier for a data scientist and/or a data analyst to identify what kind of set of data will help to answer the relevant questions from the business. When the required (raw) datasets have been acquired, the data must be cleaned. This means you have to refine the data by removing any noise (duplicates, corrupt data, incomplete values, etc.). Additionally, an extra step is needed but that depends on the amount of data that has been gathered. This step is reducing the number of dimensions of the dataset because too many features might slow down any follow-up computations. A data scientist or data analyst will try to keep the most important predictors. This way optimal accuracy can be ensured within any of the data models.

Step 4: The modeling of data and pattern mining

Data scientists or business analysts can investigate interesting data relationships such as successive patterns, correlations and association rules.

An association rule is a rule-based method that can find relationships between different variables in a specific dataset. They are mainly used in a market analysis to better understand relationships between different products. If you understand the consuming habits of customers better, it can enable your business to develop better cross-selling strategies.

Sometimes the deviations in the data can be more interesting because these deviations can expose potential risks and potential fraud. Deep learning algorithms can be used as tools to classify or cluster a data set (this depends on the availability of data). If the input data is labeled (supervised learning), a classification model can be used to categorize data. If the data set isn’t labeled (unsupervised learning), the data points in the training set are compared with one another to discover underlying similarities, clustering them based on those characteristics.  

Step 5: Deploying the data models and evaluating the results

Once the data is collected and cleaned, the results can be evaluated and interpreted. When these results are finalized, you need to be completely sure that it is valid, unique, useful, and that it makes sense. When all these criteria are met, the business can use this knowledge to implement new strategies, achieving their intended objectives (this moves you back to Step 1 which is the most critical step of the process).

Why is data mining useful?

Examples of what a well set up data mining process can do for your company:

  • Improving lead conversion rates in sales and marketing
  • Building risk models and detecting fraud in finance
  • Improve safety and identify quality issues in QHSE (Quality, Health, Safety and Environment)
  • Effectively managing supply chain operations and manufacturing 

Final thoughts

As you can see, data mining is not as easy as it looks. It can be quite a time-consuming and costly undertaking to generate the right data models that can be applied to the pre-formulated business objectives of an organization. In my opinion, without a structured process (the application of the 5 steps), the quality of your data will suffer.

This is why data mining is not something you do now and then in my opinion. You need to apply it as an integral part of a business with dedicated resources: it’s serious science. In addition to the tools, you need dedicated support from specialists that use these tools: data scientists and data analysts. If you don’t do that, you are wasting money because you don’t get the most out of it and you can run into the risk that the data is wrongly interpreted/used and that can lead to wrong decisions that can have a dramatic (negative) effect on your business on the middle and/or long term. 

Contact me for any questions, tips and/or advice relating to data mining. If you want to keep in the loop when I upload a new post, don’t forget to subscribe to receive a notification by e-mail. 

Gijs Groenland

I live in San Diego, USA together with my wife, son, and daughter. I work as Chief Financial and Information Officer (CFIO) at a mid-sized company.

Leave a Reply