Data Mining and Information Retrieval

The term data mining refers loosely to the process of semiautomatically analyzing
large databases to find useful patterns. Like knowledge discovery in artificial
intelligence (also called machine learning) or statistical analysis, data mining
attempts to discover rules and patterns from data. However, data mining differs
from machine learning and statistics in that it deals with large volumes of data,
stored primarily on disk. That is, data mining deals with “knowledge discovery
in databases.”

Some types of knowledge discovered from a database can be represented by
a set of rules. The following is an example of a rule, stated informally: “Young
womenwith annual incomes greater than $50,000 are the most likely people to buy
small sports cars.” Of course such rules are not universally true, but rather have
degrees of “support” and “confidence.” Other types of knowledge are represented
by equations relating different variables to each other, or by other mechanisms
for predicting outcomes when the values of some variables are known.

There are a variety of possible types of patterns that may be useful, and
different techniques are used to find different types of patterns. In Chapter 20 we
study a few examples of patterns and see how they may be automatically derived
from a database.

Usually there is a manual component to data mining, consisting of preprocessing
data to a form acceptable to the algorithms, and postprocessing of discovered
patterns to find novel ones that could be useful. There may also be more than
one type of pattern that can be discovered from a given database, and manual
interaction may be needed to pick useful types of patterns. For this reason, data
mining is really a semiautomatic process in real life. However, in our description
we concentrate on the automatic aspect of mining.

Businesses have begun to exploit the burgeoning data online to make better
decisions about their activities, such as what items to stock and how best to
target customers to increase sales. Many of their queries are rather complicated,
however, and certain types of information cannot be extracted even by using SQL.

Several techniques and tools are available to help with decision support.
Several tools for data analysis allow analysts to view data in different ways.
Other analysis tools precompute summaries of very large amounts of data, in
order to give fast responses to queries. The SQL standard contains additional
constructs to support data analysis.

Large companies have diverse sources of data that they need to use for making
business decisions. To execute queries efficiently on such diverse data, companies
have built data warehouses. Data warehouses gather data from multiple sources
under a unified schema, at a single site. Thus, they provide the user a single
uniform interface to data.

Textual data, too, has grown explosively. Textual data is unstructured, unlike
the rigidly structured data in relational databases. Querying of unstructured
textual data is referred to as information retrieval. Information retrieval systems
have much in common with database systems—in particular, the storage and
retrieval of data on secondary storage. However, the emphasis in the field of
information systems is different from that in database systems, concentrating on
issues such as querying based on keywords; the relevance of documents to the
query; and the analysis, classification, and indexing of documents. 

Frequently Asked Questions

Ans: The architecture of a database system is greatly influenced by the underlying computer system on which the database system runs. Database systems can be centralized, or client-server, where one server machine executes work on behalf of multiple client machines. view more..
Ans: A transaction is a collection of operations that performs a single logical function in a database application. view more..
Ans: A database system is partitioned into modules that deal with each of the responsibilities of the overall system. The functional components of a database system can be broadly divided into the storage manager and the query processor components. view more..
Ans: The term data mining refers loosely to the process of semi-automatically analysing large databases to find useful patterns. view more..
Ans: Researchers have developed several data-models to deal with these application domains, including object-based data models and semi-structured data models. view more..
Ans: A primary goal of a database system is to retrieve information from and store new information into the database. People who work with a database can be categorized as database users or database administrators. view more..
Ans: Information processing drives the growth of computers, as it has from the earliest days of commercial computers. In fact, automation of data processing tasks predates computers. view more..
Ans: A relational database consists of a collection of tables, each of which is assigned a unique name. view more..
Ans: The database schema is the logical design of the database. view more..
Ans: A super-key is a set of one or more attributes that, taken collectively, allow us to identify uniquely a tuple in the relation. view more..
Ans: DBMS typically includes a database security and authorization subsystem that is responsible for ensuring the security of portions of a database against unauthorized access view more..
Ans: The typical method of enforcing discretionary access control in a database system is based on the granting and revoking of privileges. Let us consider privileges in the context of a relational DBMS. view more..
Ans: This chapter discusses techniques for securing databases against a variety of threats. It also presents schemes of providing access privileges to authorized users. view more..
Ans: This chapter discusses techniques for securing databases against a variety of threats. It also presents schemes of providing access privileges to authorized users. view more..
Ans: Object databases is the power they give the designer to specify both the structure of complex objects and the operations that can be applied to these objects view more..
Ans: XML (Extensible Markup Language)—has emerged as the standard for structuring and exchanging data over the Web. XML can be used to provide information about the structure and meaning of the data in the Web pages rather than just specifying how the Web pages are formatted for display on the screen view more..
Ans: A database schema, along with primary key and foreign key dependencies, can be depicted by schema diagrams. view more..
Ans: A query language is a language in which a user requests information from the database. view more..

Rating - 3/5