Decision Trees for Uncertain Data

0

Decision Trees for Uncertain Data

Abstract:
Traditional decision tree classifiers work with data whose values are known and precise. We extend such classifiers  to handle data with uncertain information. Value uncertainty  arises in many applications during the data collection process. Example  sources of uncertainty include measurement/quantization  errors, data staleness, and multiple repeated measurements. With uncertainty, the value of a data item is often represented  not by one single value, but by multiple values forming a probability distribution. Rather than abstracting uncertain data by statistical derivatives (such as mean and median), we discover that the accuracy of a decision tree classifier can be much improved if the “complete information” of a data item (taking into account the probability density function (pdf)) is utilised.  We extend classical decision tree building algorithms to handle data tuples with uncertain values. Extensive experiments have been conducted that show that the resulting classifiers are more accurate than those using value averages. Since processing pdf’s is computationally more costly than processing single values (e.g., averages), decision tree construction on uncertain data is more CPU demanding than that for certain data. To tackle this problem, we propose a series of pruning techniques that can greatly improve construction efficiency



Existing System:
                In traditional decision-tree classification, a feature (an attribute) of a tuple is either categorical or numerical. For the latter, a precise and definite point value is usually assumed. In many applications, however, data uncertainty is common. The value of a feature/attribute is thus best captured not by a single point value, but by a range of values giving rise to a probability distribution. Although the previous techniques can improve the efficiency of means, they do not consider the spatial relationship among cluster representatives, nor make use of the proximity between groups of uncertain objects to perform pruning in batch. A simple way to handle data uncertainty is to abstract probability distributions by summary statistics such as means and variances. We call this approach Averaging. Another approach is to consider the complete information carried by the probability distributions to build a decision tree. We call this approach Distribution-based.

Proposed System:
We study the problem of constructing decision tree classifiers on data with uncertain numerical attributes. Our goals are (1) to devise an algorithm for building decision trees from uncertain data using the Distribution-based approach; (2) to investigate whether the Distribution-based approach could lead to a higher classification accuracy compared with the Averaging approach; and (3) to establish a theoretical foundation on which pruning techniques are derived that can significantly improve the computational efficiency of the Distribution-based algorithms.

Advantages:
Ø Estimates(i.e. budget, schedule etc .) become more relistic as work progresses, because important issues discoved earlier.
Ø It is more able to cope with the changes that are software development generally entails.
Ø Software engineers can get their hands in and start woring on the core of a project earlier.
Software Requirements:
Ø Operating system           : - Windows XP.
Ø Coding Language : - JAVA,Swing,RMI,J2me(Wireless Toolkit)
Ø Tool Used            : - Eclipse 3.3

Hardware Requirements:
Ø System                 : Pentium IV 2.4 GHz.
Ø Hard Disk             : 250GB.
Ø Monitor                : 15 VGA Colour.
Ø Mouse                  : Logitech.
Ø Ram                      : 2GB

Click Here to Download this Project

About the author

Donec non enim in turpis pulvinar facilisis. Ut felis. Praesent dapibus, neque id cursus faucibus. Aenean fermentum, eget tincidunt.

0 comments:

Recent Posts