Decision Trees for Uncertain Data
0
Decision Trees for Uncertain Data
Abstract:
Traditional
decision tree classifiers work with data whose values are known and precise. We
extend such classifiers to handle data
with uncertain information. Value uncertainty
arises in many applications during the data collection process. Example sources of uncertainty include measurement/quantization errors, data staleness, and multiple repeated
measurements. With uncertainty, the value of a data item is often represented not by one single value, but by multiple
values forming a probability distribution. Rather than abstracting uncertain
data by statistical derivatives (such as mean and median), we discover that the
accuracy of a decision tree classifier can be much improved if the “complete
information” of a data item (taking into account the probability density
function (pdf)) is utilised. We extend
classical decision tree building algorithms to handle data
tuples with uncertain values. Extensive experiments have been conducted that
show that the resulting classifiers are more accurate than those using value
averages. Since processing pdf’s is computationally more costly than processing
single values (e.g., averages), decision tree construction on uncertain data is
more CPU demanding than that for certain data. To tackle this problem, we
propose a series of pruning techniques that can greatly improve construction
efficiency
Existing System:
In traditional decision-tree
classification, a feature (an attribute) of a tuple is either categorical or
numerical. For the latter, a precise and definite point value is usually
assumed. In many applications, however, data uncertainty is common. The value
of a feature/attribute is thus best captured not by a single point value, but
by a range of values giving rise to a probability distribution. Although the
previous techniques can improve the efficiency of means, they do not consider
the spatial relationship among cluster representatives, nor make use of the
proximity between groups of uncertain objects to perform pruning in batch. A
simple way to handle data uncertainty is to abstract probability distributions
by summary statistics such as means and variances. We call this approach
Averaging. Another approach is to consider the complete information carried by
the probability distributions to build a decision tree. We call this approach
Distribution-based.
Proposed System:
We
study the problem of constructing decision tree classifiers on data with
uncertain numerical attributes. Our goals are (1) to devise an algorithm for
building decision trees from uncertain data using the Distribution-based
approach; (2) to investigate whether the Distribution-based approach could lead
to a higher classification accuracy compared with the Averaging approach; and
(3) to establish a theoretical foundation on which pruning techniques are
derived that can significantly improve the computational efficiency of the
Distribution-based algorithms.
Advantages:
Ø Estimates(i.e.
budget, schedule etc .) become more relistic as work progresses, because
important issues discoved earlier.
Ø It is more able to
cope with the changes that are software development generally entails.
Ø Software engineers
can get their hands in and start woring on the core of a project earlier.
Software
Requirements:
Ø Operating
system : - Windows XP.
Ø Coding
Language : - JAVA,Swing,RMI,J2me(Wireless
Toolkit)
Ø Tool
Used : - Eclipse 3.3
Hardware
Requirements:
Ø
System : Pentium IV 2.4 GHz.
Ø
Hard
Disk : 250GB.
Ø
Monitor : 15 VGA Colour.
Ø
Mouse : Logitech.
0 comments: