An IR Engine should include at least the following major components: Text parser, Indexer and Retrieval System. Your first programming assignment is to build the first component, the Text parser which will be used by subsequent assignments.
Document Preprocessing Steps:-
• Tokenization to handle numbers, punctuation marks, and the case of letters (upper/lower)
• Elimination of stopwords
• Stemming of the remaining words
• Selection of terms for the term dictionary
• Creating the dictionary file (Term Dictionary and Document Dictionary)