This is the file19.txt we needed this file for calculating our problem HARTIGAN is a dataset directory that contains test data for clustering algorithms.

data mining

Description

This is the file19.txt we needed this file for calculating our problem

HARTIGAN is a dataset directory that contains test data for clustering algorithms. The data files are all simple text files, and the format of the data files is explained on the web page at https://people.sc.fsu.edu/~jburkardt/datasets/hartigan/hartigan.html

Perform K-means clustering on file19.txt on the above web page.

 

#  file19.txt

#

#  Reference:

#

#    John Hartigan,

#    Clustering Algorithms,

#    Wiley, 1975.

#    ISBN 0-471-35645-X

#    LC: QA278.H36

#    Dewey: 519.5'3

#

#  "Name" is the name of the animal.

#

#  "I", "i", "C", "c", "P", "p", "M", "m", is the tooth pattern, the

#  number of top incisors, bottom incisors, top canines, bottom canines,

#  top premolars, bottom premolars, top molars, and bottom molars.

#

"Dentition of Mammals, Hartigan page 170"

9 columns

66 rows

"Name"           "I" "i" "C" "c" "P" "p" "M" "m"

"Opossum"         5 4 1 1 3 3 4 4

"Hairy tail mole" 3 3 1 1 4 4 3 3

"Common mole"     3 2 1 0 3 3 3 3

"Star nose mole"  3 3 1 1 4 4 3 3

"Brown bat"       2 3 1 1 3 3 3 3

"Silver hair bat" 2 3 1 1 2 3 3 3

"Pigmy bat"       2 3 1 1 2 2 3 3

"House bat"       2 3 1 1 1 2 3 3

"Red bat"         1 3 1 1 2 2 3 3

"Hoary bat"       1 3 1 1 2 2 3 3

"Lump nose bat"   2 3 1 1 2 3 3 3

"Armadillo"       0 0 0 0 0 0 8 8

"Pika"            2 1 0 0 2 2 3 3

"Snowshoe rabbit" 2 1 0 0 3 2 3 3

"Beaver"          1 1 0 0 2 1 3 3

"Marmot"          1 1 0 0 2 1 3 3

"Groundhog"       1 1 0 0 2 1 3 3

"Prairie Dog"     1 1 0 0 2 1 3 3

"Ground Squirrel" 1 1 0 0 2 1 3 3

"Chipmunk"        1 1 0 0 2 1 3 3

"Gray squirrel"   1 1 0 0 1 1 3 3

"Fox squirrel"    1 1 0 0 1 1 3 3

"Pocket gopher"   1 1 0 0 1 1 3 3

"Kangaroo rat"    1 1 0 0 1 1 3 3

"Pack rat"        1 1 0 0 0 0 3 3

"Field mouse"     1 1 0 0 0 0 3 3

"Muskrat"         1 1 0 0 0 0 3 3

"Black rat"       1 1 0 0 0 0 3 3

"House mouse"     1 1 0 0 0 0 3 3

"Porcupine"       1 1 0 0 1 1 3 3

"Guinea pig"      1 1 0 0 1 1 3 3

"Coyote"          1 3 1 1 4 4 3 3

"Wolf"            3 3 1 1 4 4 2 3

"Fox"             3 3 1 1 4 4 2 3

"Bear"            3 3 1 1 4 4 2 3

"Civet cat"       3 3 1 1 4 4 2 2

"Raccoon"         3 3 1 1 4 4 3 2

"Marten"          3 3 1 1 4 4 1 2

"Fisher"          3 3 1 1 4 4 1 2

"Weasel"          3 3 1 1 3 3 1 2

"Mink"            3 3 1 1 3 3 1 2

"Ferrer"          3 3 1 1 3 3 1 2

"Wolverine"       3 3 1 1 4 4 1 2

"Badger"          3 3 1 1 3 3 1 2

"Skunk"           3 3 1 1 3 3 1 2

"River otter"     3 3 1 1 4 3 1 2

"Sea otter"       3 2 1 1 3 3 1 2

"Jaguar"          3 3 1 1 3 2 1 1

"Ocelot"          3 3 1 1 3 2 1 1

"Cougar"          3 3 1 1 3 2 1 1

"Lynx"            3 3 1 1 3 2 1 1

"Fur seal"        3 2 1 1 4 4 1 1

"Sea lion"        3 2 1 1 4 4 1 1

"Walrus"          1 0 1 1 3 3 0 0

"Grey seal"       3 2 1 1 3 3 2 2

"Elephant seal"   2 1 1 1 4 4 1 1

"Peccary"         2 3 1 1 3 3 3 3

"Elk"             0 4 1 0 3 3 3 3

"Deer"            0 4 0 0 3 3 3 3

"Moose"           0 4 0 0 3 3 3 3

"Reindeer"        0 4 1 0 3 3 3 3

"Antelope"        0 4 0 0 3 3 3 3

"Bison"           0 4 0 0 3 3 3 3

"Mountain goat"   0 4 0 0 3 3 3 3

"Musk ox"         0 4 0 0 3 3 3 3

"Mountain sheep"  0 4 0 0 3 3 3 3

 

2.2 K-means clustering (2.5 points divided evenly among the components)

 HARTIGAN is a dataset directory that contains test data for clustering algorithms. The data files are all simple text files, and the format of the data files is explained on the web page at https://people.sc.fsu.edu/~jburkardt/datasets/hartigan/hartigan.html

Perform K-means clustering on file19.txt on the above web page.

This file contains a multivariate mammals dataset; there are 9 columns and 66 rows.

(a) Data cleanup (1 point divided evenly by components below)

 (i) Think of what attributes, if any, you may want to omit from the dataset when you do the clustering. Indicate all of the attributes you removed before doing the clustering.

 (ii) Does the data need to be standardized? (iii) You will have to clean the data to remove multiple spaces and make the comma character the delimiter. Please make sure you include your cleaned dataset in the archive file you upload.

(b) Clustering (2 points divided evenly by components below)

 (i) Determine how many clusters are needed by running the WSS or Silhouette graph. Plot the graph using fviz_nbclust().

(ii) Once you have determined the number of clusters, run k-means clustering on the dataset to create that many clusters. Plot the clusters using fviz_cluster().

 (iii) How many observations are in each cluster?

(iv) What is the total SSE of the clusters?

 (v) What is the SSE of each cluster?

(vi) Perform an analysis of each cluster to determine how the mammals are grouped in each cluster, and whether that makes sense? Act as the domain expert here; clustering has produced what you asked it to. Examine the results based on your knowledge of the animal kingdom and see whether the results meet expectations. Provide me a summary of your observations.

Hint: to get the indices of all animals in cluster 1, you would execute: > which(k$cluster == 1) assuming k is the variable that holds the output of the kmeans() function call. 


Related Questions in data mining category