Description
Data Mining I: Basic Methods and Techniques
Laboratory Assignment #4:
- Describe the Classification Rule method.
- Use the Classification rule production method (Classify
Tab-Rules Folder-JRip) on the Weather.nominal data set. How many rules did it produce? Compare this to the Decision tree
produced on the same data. What is the difference between the two models?
- Describe the K-nearest neighbor method.
- Produce
a K-NN model (classifiers.lazy.IBk)
for Weather.numeric data set.
The standard K-nearest neighbor method can be found
in the ‘lazy’ submenu of the list presented when you click ‘Choose’ in
Explorer’s Classify window. It is called ‘IBk’. Select this and then click on
IBk so you can modify the parameters. The default value of k is 1. Set it to 3
(or other value of your preference) and then click Start to run the programs.
What is the output? How many
instances did it classify correctly and how many incorrectly?
·
Try changing the parameter K – the number of
neighbors. Did that influence the model’s performance?
·
Try using different weighting schemes. Did does this change influence the model’s
performance?
- Upload the soybean.arff data set. Before running Weka, it is worth having
a brief look at the data file under the Preprocess tab click Edit button.
Alternatively, you can take a look at the data file using a text editor
(Notepad or WordPad would work).
Lines beginning with % are comments. Typically the beginning of the file
provides background information on the data set. This includes details of
the data itself and references to previous work using the data. The
Soybean file contains 683 examples, each of which has 35 attributes plus
the class attribute. The task is to assign examples to one of 19 disease
classes. Apply the k-nearest
neighbor classifier to the soybean data set.
What % of examples are correctly classified? Compare
the result to the same result of the unpruned decision tree procedure. Try
investigating the effect of repeating the run with different values for k. Compare and contrast the 2 methods and their
outputs.