![]() |
|
|
| Lecture One Introduction to Data Mining What you will learn
|
Lecture Menu An Overview of Data Mining Technology Data Mining & Data Warehousing Data Mining as a Part of the Knowledge Discovery Process Goals of Data Mining and Knowledge Discovery |
| Note: The material in this lecture has been developed by Knowledge Systems Institute. The lecture material and figures that are linked to from the lecture come from material supplied in conjunction with the required text for the course. The copyright for this material and some illustrations is held by Morgan Kaufmann. The slides are to be used only in conjunction with Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber. · Over the last three decades, many organizations have generated a large amount of machine-readable data in the form of files and databases. · To process this data, we have the database technology available to us that supports query languages like SQL. · The problem with SQL is that it is a structured language that assumes the user is aware of the database schema. · SQL supports operations of relational algebra chat allow a user to select from tables (rows and columns of data) or join related information from tables based on common fields. · As the data warehousing technology affords types of functionality, such as consolidation, aggregation, and summarization of data. · It also lets us view the same information along multiple dimensions.
Note: The material in this lecture has been developed by Knowledge Systems Institute. The lecture material and figures that are linked to from the lecture come from material supplied in conjunction with the required text for the course. The copyright for this material and some illustrations is held by Morgan Kaufmann. The slides are to be used only in conjunction with Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber. · Data Mining refers to the mining or discovery of new information in terms of patterns or rules from vast amounts of data. · To be practically useful, data mining must be carried out efficiently on large files and databases. To date, it is not well-integrated with database management systems. What are some specific examples of the use of data mining for applications in science and business? · The past decade has seen an explosive growth in biomedical research, ranging from the development of new pharmaceuticals and advances in cancer therapies to the identification and study of the human genome by discovering large-scale sequencing patterns and gene functions. A great deal of biomedical research has focused on DNA data analysis. Recent research in DNA analysis has led to the discovery of genetic causes for many diseases and disabilities, as well as the discovery of new medicines and approaches for disease diagnosis, prevention, and treatment. Visualization tools and genetic data analysis: Complex structures and sequencing patterns of genes are most effectively presented in graphs, trees, cuboids, and chains by various kinds of visualization tools. Such visually appealing structures and patterns facilitate pattern understanding, knowledge discovery, and interactive data exploration. Visualization therefore plays an important role in biomedical data mining. Financial data collected in the banking and financial industry are often relatively complete, reliable, and of high quality, which facilitates systematic data analysis and data mining. Design and construction of data warehouses for multidimensional data analysis and data mining: Like many other applications, data warehouses need to be constructed for banking and financial data. Multidimensional data analysis methods should be used to analyze the general properties of such data. For example, one may like to view the debt and revenue changes by month, by region, by sector, and by other factors, along with maximum, minimum, total, average, trend, and other statistical information. Data warehouses, data cubes, multifeature and discovery-driven data cubes, characteristic and comparative analyses, and outlier analysis all play important roles in financial data analysis and mining. An Overview of Data Mining Technology · In reports such as the very popular Gartner Report, data mining has been hailed as one of the top technologies for the near future. · In this section we relate data mining to the broader area called knowledge discovery and contrast the two by means of an illustrative example.
Note: The material in this lecture has been developed by Knowledge Systems Institute. The lecture material and figures that are linked to from the lecture come from material supplied in conjunction with the required text for the course. The copyright for this material and some illustrations is held by Morgan Kaufmann. The slides are to be used only in conjunction with Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber. Data Mining and Data Warehousing · The goal of a data warehouse is to support decision making with data. · Data mining can be used in conjunction with a data warehouse to help with certain types of decisions. · Data mining can be applied to operational databases with individual transactions. · To make data mining more efficient, the data warehouse should have an aggregated or summarized collection of data. · Data mining helps in extracting meaningful new patterns that cannot be found necessarily by merely querying or processing data or metadata in the data warehouse. · Data mining applications should therefore be strongly considered early, during the design of a data warehouse. · Data mining tools should be designed to facilitate their use in conjunction with data warehouses. In fact, for very large databases running into terabytes of data, successful use of database mining applications will depend first on the construction of a data warehouse. Data Mining as a Part of the Knowledge Discovery Process · Knowledge Discovery in Databases, frequently abbreviated as KDD, typically encompasses more than data mining. · The knowledge discovery process comprises six phases: data selection, data cleansing, enrichment, data transformation or encoding, data mining, and the reporting and display of the discovered information.
Note: The material in this lecture has been developed by Knowledge Systems Institute. The lecture material and figures that are linked to from the lecture come from material supplied in conjunction with the required text for the course. The copyright for this material and some illustrations is held by Morgan Kaufmann. The slides are to be used only in conjunction with Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber. Example: - Consider a transaction database maintained lay a specialty consumer goods retailer. - Suppose the client data includes a customer name, zip code, phone number, date of purchase, item code, price, quantity, and total amount. - A variety of new knowledge can be discovered by KDD processing on this client database. - During data selection, data about specific items or categories of items, or from stores in a specific region or area of the country, may be selected. - The data cleansing process then may correct invalid zip codes or eliminate records with incorrect phone prefixes. - Enrichment typically enhances the data with additional sources of information. For example, given the client names and phone numbers, the store may purchase other data about age, income, and credit rating and append them to each record. - Data transformation and encoding may be done to reduce the amount of data. Examples: 1- Item codes may be grouped in terms of product categories into audio, video, supplies, camera, accessories, and so on. 2- Zip codes may be aggregated into geographic regions, 3- Incomes may be divided into ten ranges, and so on. - Data mining techniques are used to mine different rules and patterns.
Note: The material in this lecture has been developed by Knowledge Systems Institute. The lecture material and figures that are linked to from the lecture come from material supplied in conjunction with the required text for the course. The copyright for this material and some illustrations is held by Morgan Kaufmann. The slides are to be used only in conjunction with Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber. - The result of mining may be to discover: · Association rules—e.g., whenever a customer buys video equipment, he or she also buys another electronic gadget. · Sequential patterns—e.g., suppose a customer buys a camera, and within three months he or she buys photographic supplies, and within six months an accessory item. A customer who buys more than twice in the lean periods may be likely to buy at least once during Christmas period. · Classification trees—e.g., customers may be classified by frequency of visits, by types of financing used, by amount of purchase, or by likeness for types of items, and some revealing statistics may be generated for such classes. Note: · As this retail-store example shows, data mining must be preceded by significant data preparation before it can yield useful information that can directly influence business decisions. · The results of data mining may be reported in a variety of formats, such as listings, graphic outputs, summary tables, or visualizations. Goals of Data Mining and Knowledge Discovery
· Prediction—Data mining can show how certain attributes within the data will behave in the future. · Examples of predictive data mining include the analysis of buying transactions to predict what consumers will buy under certain discounts, how much sales volume a store would generate in a given period, and whether deleting a product line would yield more profits. · In such applications, business logic is used coupled with data mining. In a scientific context, certain seismic wave patterns may predict an earthquake with high probability. · Identification—Data patterns can be used to identify the existence of an item, an event, or an activity. · For example, intruders trying to break a system may be identified by the programs executed, files accessed, and CPU time per session. · In biological applications, existence of a gene may be identified by certain sequences of nucleotide symbols in the DNA sequence. · The area known as authentication is a form of identification. It ascertains whether a user is indeed a specific user or one from an authorized class; it involves a comparison of parameters or images or signals against a database. · Classification—Data mining can partition the data so that different classes or categories can be identified based on combinations of parameters. · For example, customers in a supermarket can be categorized into discount-seeking shoppers, shoppers in a rush, loyal regular shoppers, and infrequent shoppers. · This classification may be used in different analyses of customer buying transactions as a post-mining activity. · Sometimes classification based on common domain knowledge is used as an input to decompose the mining problem and make it simpler. · For instance, health foods, party foods, or school lunch foods are distinct categories in the supermarket business. It makes sense to analyze relationships within and across categories as separate problems. · Such categorization may be used to encode the data appropriately before subjecting it to further data mining. · Optimization—One eventual goal of data mining may be to optimize the use of limited resources such as time, space, money, or materials and to maximize output variables such as sales or profits under a given set of constraints. · As such, this goal of data mining resembles the objective function used in operations research problems that deals with optimization under constraints.
Types of Knowledge Discovered during Data Mining · The term "knowledge" is very broadly interpreted as involving some degree of intelligence. · Knowledge is often classified as inductive and deductive. · Data mining addresses inductive knowledge. · Knowledge can be represented in many forms: in an unstructured sense, it can be represented by rules, or prepositional logic. · In a structured form, it may be represented in decision trees, semantic networks, neural networks, or hierarchies of classes or frames. · The knowledge discovered during data mining can be described in five ways, as follows. 1. Association rules—These rules associate the presence of a set of items with another range of values for another set of variables. Examples: (1) When a female retail shopper buys a handbag, she is likely to buy shoes. (2) An X-ray image containing characteristics a and b is likely to also exhibit characteristic c. 2. Classification hierarchies—The goal is to work from an existing set of events or transactions to create a hierarchy of classes. Examples: (1) A population may be divided into five ranges of credit worthiness based on a history of previous credit transactions. (2) A model may be developed for the factors that determine the desirability of location of a store on a 1-10 scale. (3) Mutual funds may be classified based on performance data using characteristics such as growth, income, and stability. 3. Sequential patterns—A sequence of actions or events is sought. Example: If a patient underwent cardiac bypass surgery for blocked arteries (blood vessel) and an aneurysm and later developed high blood urea within a year of surgery, he or she is likely to suffer from kidney failure within the next 18 months. Detection of sequential patterns is equivalent to detecting association among events with certain temporal relationships. 4. Patterns within time series—Similarities can be detected within positions of the time series. Three examples follow with the stock market price data as a time series: (1) Stocks of a utility company ABC Power and a financial company XYZ Securities show the same pattern during 1998 in terms of closing stock price. (2) Two products show the same selling pattern in summer but a different one in winter. (3) A pattern in solar magnetic wind may be used to predict changes in earth atmospheric conditions. 5. Categorization and segmentation—A given population of events or items can be partitioned (segmented) into sets of "similar" elements. Examples: (1) An entire population of treatment data on a disease may be divided into groups based on the similarity of side effects produced. (2) The adult population in the United States may be categorized info five groups from "most likely to buy" to "least likely to buy" a new product. (3) The web accesses made by a collection of users against a set of documents (say, in a digital library) may be analyzed in terms of the keywords of documents to reveal clusters or categories of users. - Database technology has evolved from primitive file processing to the development of database management systems with query and transaction processing. - Further progress has led to the increasing demand for efficient and effective data analysis and data understanding tools. This need is a result of the explosive growth in data collected from applications including business and management, government administration, science and engineering, and environmental control. - Data mining is the task of discovering interesting patterns from large amounts of data where the data can be stored in databases, data warehouses, or other information repositories. - It is a young interdisciplinary field, drawing from areas such as database systems, data warehousing, statistics, machine learning, data visualization, information retrieval, and high-performance computing. - Other contributing areas include neural networks, pattern recognition, spatial data analysis, image databases, signal processing, and many application fields, such as business, economics, and bioinformatics. - A knowledge discovery process includes data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge presentation. - Data patterns can be mined from many different kinds of databases, such as relational databases, data warehouses, and transactional, object-relational, and object-oriented databases. - Interesting data patterns can also be extracted from other kinds of information repositories, including spatial, time-related, text) multimedia, and legacy databases, and the World Wide Web. - A data warehouse is a repository for long-term storage of data from multiple sources, organized so as to facilitate management decision making. - The data are stored under a unified schema and are typically summarized. Data warehouse systems provide some data analysis capabilities, collectively referred to as OLAP (On-Line Analytical Processing). - Data mining functionalities include the discovery of concept/class descriptions, association, classification, prediction, clustering, trend analysis, deviation analysis, and similarity analysis. Characterization and discrimination are forms of data summarization. - A pattern represents knowledge if it is easily understood by humans; valid on test data with some degree of certainty; and potentially useful, novel, or validates a hunch about which the user was curious. Measures of pattern interestingness, either objective or subjective, can be used to guide the discovery process. - Data mining systems can be classified according to the kinds of databases mined, the kinds of knowledge mined, the techniques used, or the applications adapted. - Efficient and effective data mining in large databases poses numerous requirements and great challenges to researchers and developers. The issues involved include data mining methodology, user interaction, performance and scalability, and the processing of a large variety of data types. Other issues include the exploration of data mining applications and their social impacts. 1. Define the following: Data Mining and Data Warehousing. 2. Discuss how data mining is described as a Part of the Knowledge Discovery Process. 1- Is data mining another hype? Answer: Data mining is not another hype. Instead, the need for data mining has arisen due to the wide availability of huge amounts of data and the imminent need for turning such data, into useful information and knowledge. Thus, data mining can be viewed as the result of the natural evolution of information technology. 2- Is it a simple transformation of technology developed from databases, statistics, and machine learning? Answer: No. Data mining is more than a simple transformation of technology developed from databases, statistics, and machine learning. Instead, data mining involves an integration, rather than a simple transformation, of techniques from multiple, disciplines such as database technology, statistics, machine learning, high-performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial data analysis. 3- Explain how the evolution of database technology led to data mining. Answer: Database technology began with the development of data collection and database creation mechanisms that, led to the development of effective mechanisms for data management including data storage and retrieval, and query and transaction processing. The large number of database systems offering query and transaction processing eventually and naturally led to the need for data analysis and understanding. Hence, data mining began its development out of this necessity. Required Readings: Textbook: Chapter 1 “Introduction” Principles of Data Mining, David Hand, Heikki Mannila, and Padhraic Smyth, The MIT Press, ISBN 0-262-08290-X. |
|