This application imports electronic medical records (stored in text files) from the selected folder and runs each file through the natural language process of:
- Removing stop words
- Stemming words
- Creating an index
For stemming the Porter Stemming Algorithm is implemented to transform a word into its root form and contains six steps:
- Get rid of plurals and –ed or –ing endings
- Turn terminal y to I when there is another vowel in the stem
- Map double suffices to a single one
- Deals with –ic-, -full, -ness etc. endings similarly to step 3
- Takes off –ant, -ence, etc. in context of <c>vcvc<v>
- Removes final –e if the measure of the number of constant sequences between 0 and j is greater than 1, i.e. <c>vcvc<v> gives 2
A trie (prefix tree) is an ordered data structure used to store the index in memory. All descendants of a node have a common prefix under the same branch. The common prefix eliminates the need for storing the same prefix each time which increases memory efficiency. A simple representation containing the words aspirin, ankle, and pain is:
When the user searches for keywords, Figure 1, spaces are seen as an AND condition and commas are seen as an OR condition. A list is displayed to the user showing the file names of the documents which contain there search terms. Double clicking on a particular document will open a new tab displaying the documents contents and highlighting the keywords present in the document, Figure 2 & 3. When mousing over the keywords found the SNOMED and UMLS IDs are displayed. These are determined based on a separate trie data structure built from the SNOMED CT data.