Knowledge Base Systems and Data Mining

Describe classification. Explain any two classification algorithms with examples.

Introduction

Classification is a data- mining function that will assign items in a collection to target various categories or classes.
Once the classification is done a prediction or decision can be taken about the data.
It generally includes historical data.
The goal of this is to construct a model using the historical data that will accurately predict the label of the unlabeled examples.
A classification task generally begins by building data for which the target values are known.
There are three different approaches that are followed by the classification model: discriminative approach, regression approach and class-conditional approach.
A classification task begins with a data set in which the class assignments are known.
Classifications are discrete and do not imply order.
Continuous, floating-point values would indicate a numerical, rather than a categorical, target.
A predictive model with a numerical target uses a regression algorithm, not a classification algorithm.
The simplest type of classification problem is binary classification.
In binary classification, the target attribute has only two possible values: for example, high credit rating or low credit rating.
Multiclass targets have more than two values: for example, low, medium, high, or unknown credit rating.
In the model build (training) process, a classification algorithm finds relationships between the values of the predictors and the values of the target.
Different classification algorithms use different techniques for finding relationships.
These relationships are summarized in a model, which can then be applied to a different data set in which the class assignments are unknown.
Classification models are tested by comparing the predicted values to known target values in a set of test data.
The historical data for a classification project is typically divided into two data sets: one for building the model; the other for testing the model.
Scoring a classification model results in class assignments and probabilities for each case.
For example, a model that classifies customers as low, medium, or high value would also predict the probability of each classification for each customer.
Classification has many applications in customer segmentation, business modeling, marketing, credit analysis, and biomedical and drug response modeling.

Classification Algorithms
There are four types of algorithms provided by classification:

1. Decision Tree

These are the predictive models that are used to graphically organize the information about the possible options, consequences and the end values.
Each branch of this tree is a classification question and the leaves of this tree are the partitions of the dataset with the classification.
The outcome of the test depends upon the choice of a certain branch.
A particular data item is classified at the start of the root node and follow the assertions down until we reach a terminal node (or leaf).
A decision will be taken when the terminal node is approached.
They are also interpreted as a special form of rule set that are characterized by their hierarchical organization of rules.

Diagram

decisiontree

The basic algorithm can be summarized as follows:

Input

A set of training tuples and their associated class labels – Data partition D
An attribute list of the candidates attributes.
The attribute selection method is used which is a procedure used to determine the splitting criteria that the best partitions the data tuples into individual classes.

Output

A decision tree will the output to the above input.

Method/ Steps for creating a decision tree

A node N is created.
If tuples in D are all of the same class C then
N is returned as a leaf node labeled with class C
If the list of attributes is empty then
N is returned as a leaf node that is labeled with the majority of class in D
By applying the attribute selection method (D, attribute list) the best splitting criterion is found.
Node N is labeled with the splitting criterion.
If the splitting attribute has a discrete- value and multiway splits are allowed then
Attribute list – attribute list – splitting attribute – the splitting attribute is removed
Each outcome j of the splitting criterion the tuples are partitioned and subtrees are grown for each partition.
D_j will be the set of data tuples where D is satisfying the outcome of j
D_j is empty then
A leaf labeled with the majority class in D to node N is attached.
Otherwise the node returned by generate decision tree (D_j, attribute list) to node N will be attached.
N is returned.

2. Bayesian classification

It is based on the Bayes theorem.
They are statistical classifiers.
These classifiers help in predicting about the class membership probability which means that we can predict about the particular record to which class it belongs.
Bayesian classifiers are acurate and give a good performance with the larger databases.
The Naive Bayesian classifier are the class condition independent which means the effect of an attribute value on a given class is independent of the values of the other attributes.

Bayes Theorem

The Bayes theorem is named after Thomas Bayes in the 18th century.
It provides two types of probabilities

Here X is the data tuple and H is some hypothesis

According to the Bayes theorem it is
P(H/X) = P(X/H)P(H) / P(X)

Bayesian Network

These networks joint the probability distributions. They are also known as Belief Networks, Bayesian Networks or even Probabilistic networks.
They allow class conditional independences to be defined between the subsets of the variable.
A graphical model of casual relationship is provided on which learning can be performed.
A trained Bayesian Network can be used for classification.

A Bayesian Belief Network defines two components

1. Directed acyclic graph

Each node in this graph represents a random variable.
The variables can be continuous or discrete valued.
The variables correspond to the actual attribute given in the data.

Graphical representation of the acyclic graph

The arc in the above diagram allows the representation of casual knowledge.
For example, diabetes is inherited by a persons family history and even his age.
The variable positive test is independent of whether the patient has a family history of diabetes or not or is in an age or not, given that we know that the patient has diabetes.

2. Set of conditional probability table

It allows the representation of casual knowledge.

For example, diabetes is inherited by a persons family history and even his age.
The variable positive test is independent of whether the patient has a family history of diabetes or not or is in an age or not, given that we know that the patient has diabetes.

Write short notes on:
a) Text mining
b) Data-visualization.

Introduction

The text database consist of huge collection of databases.
This information is collected by various means like news articles, books, digital libraries, e-mail messages, web pages, etc.
The text databases are growing rapidly due to the increase in amount of information.
The data is semi-structured in many of the text databases.
Take an example of a document that contain a few structured fields, like title, author, publishing_date, etc.
But along with this structured data, the document also contains unstructured text components, like abstract and contents.
Without having any knowledge of what could be in the documents, it becomes difficult to formulate effective queries for analyzing and extracting useful information from the data.
Tools are required by the users to compare the documents and rank their importance and relevance.
Hence, text mining has become popular and an essential theme in data mining.

Information Retrieval

It deals with the retrieval of information from a large number of text-based documents.
They can handle different kinds of data as some of the database systems are not usually present in the information retrieval systems.
Examples of information retrieval system include:
Online Library catalog system
Online Document Management Systems
Web Search Systems etc.
The information retrieval systems main problem is to locate relevant documents in a document collection based on a user's query.
This kind of user's query consists of some keywords describing an information need.
In such search problems, the user takes an initiative to pull relevant information out from a collection.
It is appropriate when the user has ad-hoc information need, i.e., a short-term need.
The retrieval system can also take an initiative to push any newly arrived information item to the user only if the user has a need of long-term information.
This kind of access given to the information is called Information Filtering. And the corresponding systems are known as Filtering Systems or Recommender Systems.

Basic Measures for Text Retrieval

The accuracy of the system is checked when a number of documents on the basis of user's input is retrieved.
The set of documents relevant to a query is denoted as {Relevant} and the set of retrieved document as {Retrieved}.
The set of documents which are relevant and retrieved are denoted as {Relevant} ∩ {Retrieved}.

This can be shown in the form of a Venn diagram as follows:

text mining

The quality of text retrieval can be assessed by using three fundamental methods:

1. Precision
It is the percentage of retrieved documents that are in fact relevant to the query.

It can be defined as:

Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|

2. Recall
It is the percentage of documents that are relevant to the query and were in fact retrieved.

It is defined as:

Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|

3. F-score

It is commonly used as trade-off.
The information retrieval system often needs to trade-off for precision or vice versa.

It is also defined as harmonic mean of recall or precision as follows:

F-score = recall x precision / (recall + precision) / 2

Text mining applications
The applications where text mining is used is as follows:

1. Security applications – It does the analysis of plain text sources such as the internet news. The study of text encryption is also involved.
2. Biomedical applications – One of the best example for this is PubGene which is a combination of biomedical text mining with the network visualization as an internet service.
3. Marketing applications – It is more specifically used for analytical customer relationship management.
4. Software applications – Firms like IBM are further trying to automate the mining and analysis processes in order to improve the text mining results.
5. Online media applications – It is generally used to clarify information and provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. On the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.
6. Sentiment analysis – It involves the analysis of movie reviews for estimating how favorable a review is for a movie.

b) Data-visualization

Data visualization is viewed as a modern equivalent way of visual communication.
It involves the creation of the visual representation of data.
Its primary goal is to communicate the information clearly and efficiently to the users by means of statistical graphics, plots, information graphics charts.
It helps the decision makers to see the analytics and learn the difficult concept or identify the new pattern.
With the help of interactive visualization for more detailing the data can be drilled down into charts and graphs.

Importance

Data visualization is important as the human brain process the information faster as charts or graphs are used to visualize large amounts of data.
It is a quick and easy way to convey the concepts in a universal manner.
The different scenarios can be experimented by just making slight adjustments.
It can also help in identifying the areas which need attention or improvements.
The factors that influence the customer behavior is clarified.
Help to build an understanding about what products need to be placed where.
The sales volume can be predicted.

Use of data visualization

Comprehend the information quickly
Identifying the relationships and patterns
Pinpointing the emerging trends
Communicating the story to others

Characteristics of an effective graphical display
The graphical display should posses the following characteristics:

Show the data.
It will induce the viewer for thinking about the substance rather than methodology, graphic design, the technology of graphic production or something else.
It should avoid distorting what the data has to say.
Many numbers should be present in a small space.
Large data sets should be made coherent.
The eye should be encouraged to compare different pieces of data.
The data should be revealed at several levels of detail, from a broad overview to the fine structure.
A reasonably clear purpose should be served: description, exploration, tabulation or decoration.
The statistical and verbal descriptions of a data set should be closely integrated.

Diagrams used for data visualization

Bar chart
Histogram
Scatter plot
Scatter plot (3D)
Network
Streamgraph
Treemap
Gantt chart
Heat Map

Knowledge Base Systems and Data Mining

Describe classification. Explain any two classification algorithms with examples.

Write short notes on: a) Text mining b) Data-visualization.

Related Content

Write short notes on:
a) Text mining
b) Data-visualization.