Description
For this first part you will have to create two python programs. One to read a text file and compute trigram, bigram
and unigram probabilities with the existing words. Your program should be called n-trigrams.py and should be
run like this
For each of the it should output it’s probability to the screen. Therefore, if there are 10
sentences in the file, it will output 10 probabilities. The sentences are separated by a period. There can be more than
one sentence on each file
ne sentence on each file.
The format of the file must be a tab separated file with two columns: The first column is the
ngram. if a unigram, then just the word. If a bigram or trigram, the words separated by space. The second column is
the count of that n-gram. Add STOP as a special word. This will allow you to know how many sentences are there in
the corpus.
Here are some steps to help you modularize your code
• create a function that replaces punctuation and other noise (accents, tildes, newlines,etc.) in a string. End of sentence punctuation should be replaced with ; accents and tildes can be replaced with their corresponding
English equivalent (this is optional). Newlines, commas and apostrophes should be replaced with a space.
• create a function that reads a file. For each line read it should replace punctuation and then find unigrams,
bigrams and trigrams and add them to a dictionary (you can use collections.Counter here. The key to
the dictionary can be a string or a tuple For example: (word1,word2,word3) for trigrams.
• create a function to save the resulting dictionary to a file and a function to read that file. The file should conform
to the format in the instructions.
• Lastly, create a function to compute the probability of any bigram present in the corpus.