Beginner Python assignment: Description In this project you are going to index a set of documents in a python

computer science

Description

Beginner Python assignment 

Description In this project you are going to index a set of documents in a python open-source search engine called tinysearch, devise a set of test queries and evaluate the system on those queries. Indexing the Documents • download the search engine and the corpse from OWL->Resources->Assignment2_Files on your machines and index them (file name: tinysearch.zip). o Note that this corpse contains dumped Wikipedia documents and it is a few years old. o There are instructions concerning these steps further down in this document. Provided corpse This corpse has some of the Wikipedia documents that can be used for mirroring, personal use, informal backups, off-line use or database queries. All text content is licensed under the GNU Free Documentation License (GFDL). Topics and Questions In this part of the assignment, you are going to think of an application domain (i.e. a subject) that is of interest to you. For example, you could choose health, politics, sport, geography, music (any kind), etc. Now, create twenty queries in your chosen domain. For example: Q1: species and dogs Q2: Akita dogs Q3: wolfdog Note: These are just examples chosen by a student who was interested in dogs. You can choose whatever subject that: • It is covered by the documents you are using; • You can think of some quite difficult queries on your chosen topic. Retrieval Experiments Test the performance of the provided search engine using TF-IDF by applying the following steps: 1. Run the queries, as prepared above, through the system and collect the first ten files (or so) returned for each. 2. Compute precision and recall at the following levels of n (where n is the number of documents considered): n=5, n=10. 2 3. To do this, for each query you need to look at (for example) the first ten results (i.e. files) returned and see for each file whether is it Relevant or Not Relevant. A file is relevant if it contains the answer to your query. It does not matter where in the file the answer occurs as long as it is present somewhere. Note that this is not as easy as it sounds since there will be occasions when you are not sure. You need to make a note of the rationale for making your final decision in cases of doubt. Computing recall poses a problem in that we need to know for each query all the correct answers in the collection. Strictly, we cannot know that without inspecting every document in the collection. At TREC they use a pooling method as discussed in the lecture. To get around the problem here, simply check the first n documents (n = 20) returned for each query. Count the number of correct responses there and assume that these are all the correct responses in the collection. Then use this information to compute recall at n=5 and n=10 as above. Assignment Report Write up your results in a short report USING THE TEMPLATE SUPPLIED with the following headings exactly as shown in the template: 1. Cover page includes your (formal) name, ID and the program you are currently enrolledin. 2. Topic and Queries • What topic you chose; how the queries were devised. 3. Indexing the Documents • How was this done? • What problems were encountered (if any) and how were they solved? 4. TF-IDF Performance • Method - short text outlining what you did. • Results - a table summarizing the numerical results as above (review assignment 2 - appendix 1). • Discussion - a short description of what the results show (was TF-IDF always better, always worse or sometimes better/worse?), any interesting problem cases, any technical problems encountered and so on. Report Appendix 1 • Include the queries you used for your TF-IDF evaluation and the IDs of the right answers found for each (if any). • Example (this is just a sample): Num Query IDs of Answers 1 hot chicken 5003 … 20 chicken 0


Related Questions in computer science category