Required to be submitted:
1. Please save your output into a text or word files for each question (file name is your full name_Q2a, e.g., Yuefeng_Li_Q2a.txt) and put all codes into a folder (e.g., Yuefeng_Li_Q2a). Then zip all txt files and folders into a zip file as your “student ID_Surname_Asm1.zip”.
2. Submit your zip file for this assignment in BB before 11.59pm on 24 April 2020.
3. Answer all four questions (10 sub-questions).
4. All sub-questions are worth 2 marks each
Data (RCV1v2 document collection)
• You will be working with a sample dataset which is a small subset of just 10 documents
from the RCV1v2 document collection, which is a pre-tokenized version (for
convenience, and for copyright reasons). The dataset can be downloaded from
Blackboard.
Question 1. Design Python code for text pre-processing (a) Parsing and tokenizing - read files from RCV1v2, find the documentID and record it to a collection of BowDocument Objects.
• The documentID is simply assigned by the ‘itemid’ in
Get Free Quote!
386 Experts Online