Data mining is a combination of database and artificial intelligence technologies. Although the AI field has taken a major dive in the last decade; this new emerging field has shown that AI can add major contributions to existing fields in computer science. In fact, many experts believe that data mining is the third hottest field in the industry behind the Internet, and data warehousing.
Data mining is really just the next step in the process of analyzing data. Instead of getting queries on standard or user-specified relationships, data mining goes a step farther by finding meaningful relationships in data. Relationships that were thought to have not existed, or ones that give a more insightful view of the data. For example, a computer-generated graph may not give the user any insight, however data mining can find trends in the same data that shows the user more precisely what is going on. Using trends that the end-user would have never thought to query the computer about. Without adding any more data, data mining gives a huge increase in the value added by the database. It allows both technical and non-technical users get better answers, allowing them to make a much more informed decision, saving their companies millions of dollars.
"Data mining is the process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques" (SPSS). However, really data mining turns databases into knowledge bases which is one of the fundamental components of expert systems. Instead of the computer just blindly pulling data from a database, the computer is able to take all the data and interpret it, which is a huge step to make. If it was not for existing AI technologies this field could not have emerged as quickly; if at all.
Data mining allows companies to focus on the more important information in their data warehouses. Data mining can be broken down into two major categories. Automated prediction of trends and behaviors, and automated discovery of previously unknown patterns. In the first category, data mining automates the process of finding predictive information in large databases. Questions that traditionally required exhaustive hands-on analysis can now be quickly answered directly from data. In the second category, data mining tools sweep through databases and identify previously hidden patterns in one step. This category is where the major focus of research has been on.
"Data mining is a rather new term for a challenge that has been growing for many years: how to scan very large databases to retrieve the high level conceptual information of the greatest interest" (Lindsay). With the advances in data acquisition and storage technologies, the problem of how to turn measured raw data into useful information becomes a important one. Having reached sizes that defy even partial examination by humans, the data volumes are literally swamping users. For example, large US retail chains now mine their data bases with sophisticated data mining programs to look for general trends and geographic clustering in purchases that are not easily visible in the huge multitude of products and sales.
Data mining has come from an evolution of searching through data trying to find useful business information. There are four major steps: Data Collection, Data Access, Data Warehousing & Decision Support, and finally Data Mining (Pilot).
Data Collection started in the 1960s. This is a static data delivery system that came from pulling information from computers, tapes, and disks. For example, what is the total revenue in the last five years. Data Access is the next step and it started in the 1980s. This allowed dynamic data delivery at the record level. Data access mainly uses relational databases using SQL. A Sample Query would be: What were unit sales in Florida last October. Then in the 1990s came Data warehousing and decision support. This allowed dynamic data delivery at multiple levels. This technology came about, because of multidimensional databases and on-line analytic processing (OLAP). This will let the query above go as detailed as city to city in Florida. Finally came data mining, which allowed proactive information delivery. Data mining uses Advanced AI algorithms, multiprocessor computers, and massive databases. With data mining a person ask questions like what is likely to happen to Florida unit sales next month and why (Pilot).
Fundamentally, data mining does two things with data: It finds relationships and makes forecasts. Within these two categories, data mining is good at producing the following six information types (Newquist): Classes, Clusters, Associations, Sequences, Forecasts, Similar Sequences. Classes are the most common form of data mining, and consist of shared characteristics, such as how many or what percentage of people over the age of 40 have checking and saving accounts but no investments in mutual funds. A data mining tool uses pattern recognition to create classes. Clusters are a subset of classes that consist of patterns and relationships that have not been predefined or were not previously have known to exist. Data mining finds these relationships even though the user was not specifically looking for them.
Associations deal with events. That is, an association exists when the completion of one occurrence implies the existence of another. For example, when people buy beer, 60 percent of the time they buy some form of snack. Sequences deal with events also. However they are linked over time instead. For example, credit card holders that ask for an increase in limit usually buy a large item within the next two weeks.
Forecasts involve predicting the future on current data. Forecasts are applicable to almost any corporate situation. Extracting all relevant data and applying them with relevant fluctuations makes forecasts. Similar sequences extend the concept of sequences by combining them conceptually with classes. For example, after discovering a sequence in a particular time, a user might want to find other sequences occurring at the same time or search for similar sequences over time (Newquist).
The most common techniques in data mining are artificial neural networks, decision trees, genetic algorithms, nearest neighbor method, and rule induction (Pilot).
Artificial neural networks are non-linear predictive models that learn through training, and closely resemble biological networks. Decision trees are tree-shaped data structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Genetic Algorithms use optimization techniques that use concepts of evolution such as combination, mutation, and natural selection. Nearest neighbor method is a technique that classifies each record in a dataset based on a combination of classes of records most similar to them in a historical dataset. Rule induction is the extraction of useful if-then rules from databases on statistical significance (Pilot).
The main reason for the necessity of automated computer systems for intelligent data analysis is the enormous volume of existing and newly appearing data that requires processing. The amount of data accumulated each day by various businesses, scientific, and governmental organizations around the world is appalling. According to GTE research, scientific organizations store about 1 terabyte of new information each day (Mega computer). It is impossible for human analysts to cope with such overwhelming amounts of data.
Two problems that surface when human analysts process data are the inadequacy of the human brain when searching for complex dependencies in data, and the lack of objectiveness in their analysis. Therefore, one of the benefits of using automated data mining systems is that this process has a much lower cost than hiring an army of highly trained and paid professional statisticians. Although data mining does not completely eradicate the need for humans, it allows an analyst who has no programming and statistics skill to extract knowledge from databases (Mega computer).
Data Mining is the extraction of hidden predictive information from large databases. This is a new powerful new technology with great potential to help companies focus on the most important information in data warehousing. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. "The automated, prospective analyses offered by data mining move beyond the analyzes of past events provided by retrospective tools typical of decision support systems" (Pilot).
Data mining is important to large systems because it finds things in large data repositories that you did not know existed. "A simple metaphor would be finding two needles in a haystack that match. The haystack is the database, the individual lengths of the hay represent your data fields, and the needles represent data fields with a relationship worth more to you than all the hay put together" (Newquist).
Mega Computer. "Reasons for the growing popularity of data mining." Online. Internet. 3 Oct. 1997 Available: http://www.megaputer.ru/dmreason.html.
Lindsay, Clark. "Data Mining." Online. Internet. 3 Oct. 1997 Available: http://msia02.msi.se/~lindsay/datamine.html
Newquist, H.P. "Data mining: The AI metamorphosis." Online. Internet. 3 Oct. 1997 Available: http://www.dbpd.com/newquist.html.
Pilot Software. "An Introduction to Data Mining." Online. Internet. 3 Oct. 1997 Available: http://www.pilotsw/dmpaper/dmindex.htm.
SSPS. "SSPS' Approach: Open Comprehensive Data Mining." Online. Internet. 3 Oct. 1997 Available: http://www.spss.com/datamine/ocdm.html