Download zip file and extract it. Consider this data is a subset of full Reuters corpus to make it possible for you to process without the need of a powerful server.

data mining

Description

 1- Download zip file and extract it. Consider this data is a subset of full Reuters corpus to make it possible for you to process without the need of a powerful server.

You have access to the following assets using your CSID that might be useful:

·         Bluenose: Bluenose.cs.dal.ca (undergrad and grads)

·         2-Hector: hector.cs.dal.ca (only grad students)

·         3-Gitlab: https://git.cs.dal.ca

2- Each file contains some XML files. Explore XML files and find a list of all fields available there.

3- Write a function extract a Pandas's Dataframe containing: (1) headline, (2) text, (3) bip:topics,(4) dc.date.published, (5) itemid, (6) XMLfilename (4 points)

4- Write a python function to find all the possible values for bip:topics. Consider that each news can belong to more than one topic. (4 points)

5- Write a function to prepare your text data by methods such as removing stop words. You are allowed to use the NLTK library. You can find more information here: https://www.nltk.org/. (4 points)

6- Extract features from the text using any approach you like. Write a function that input the Dataframe in step 3 and generates a new Dataframe of your features and labels. (4 points)

7- Divide your data into a training and test set. You can use any method such as cross-validation. You need to provide a reason why you decide so here. (4 points function, 4 points explanation: 4+4=8 points)

8- Write a function to get the Dataframe of step 6 and a set of parameters to return a trained classifier to classify all labels that you get in step 4.(4 points)

9- Write a function to evaluate the quality of your classifier (like accuracy, F-score, AUC, ...). Explain why you think this function is the best choice. (4 points function, 4 points explanation: 4+4=8 points)

9- Generate five different classifiers (Random Forest, Decision Tree, Linear Regression, Neural Network, and SVM) using step 8. Tune them up for the best parameters. Find the best classifier. Explain why. (4 points each classifier, Tune up 4points, explanation of best classifier 4 points: 4X5+4+4=28 points)

10- Go to Brightspace and upload your notebook containing all of your work under the assignment 1 section.


Related Questions in data mining category