The standard K-nearest neighbor method can be found in the ‘lazy’ submenu of the list presented when you click ‘Choose’ in Explorer’s Classify window.

data mining

Description

Data Mining I: Basic Methods and Techniques

Laboratory Assignment #4:

  1. Describe the Classification Rule method.
  2. Use the Classification rule production method (Classify Tab-Rules Folder-JRip) on the Weather.nominal data set.  How many rules did it produce?  Compare this to the Decision tree produced on the same data. What is the difference between the two models?
  3. Describe the K-nearest neighbor method.
  4. Produce a K-NN model (classifiers.lazy.IBk) for Weather.numeric data set.

The standard K-nearest neighbor method can be found in the ‘lazy’ submenu of the list presented when you click ‘Choose’ in Explorer’s Classify window. It is called ‘IBk’. Select this and then click on IBk so you can modify the parameters. The default value of k is 1. Set it to 3 (or other value of your preference) and then click Start to run the programs.

What is the output? How many instances did it classify correctly and how many incorrectly?

·         Try changing the parameter K – the number of neighbors. Did that influence the model’s performance?

·         Try using different weighting schemes.  Did does this change influence the model’s performance?

  1. Upload the soybean.arff data set.  Before running Weka, it is worth having a brief look at the data file under the Preprocess tab click Edit button. Alternatively, you can take a look at the data file using a text editor (Notepad or WordPad would work).  Lines beginning with % are comments.  Typically the beginning of the file provides background information on the data set. This includes details of the data itself and references to previous work using the data. The Soybean file contains 683 examples, each of which has 35 attributes plus the class attribute. The task is to assign examples to one of 19 disease classes.  Apply the k-nearest neighbor classifier to the soybean data set.

What % of examples are correctly classified? Compare the result to the same result of the unpruned decision tree procedure. Try investigating the effect of repeating the run with different values for k.  Compare and contrast the 2 methods and their outputs.

 


Related Questions in data mining category