Natural Language Processing

Initial Starting State

This application imports electronic medical records (stored in text files) from the selected folder and runs each file through the natural language process of:

  • Removing stop words
  • Stemming words
  • Creating an index

For stemming the Porter Stemming Algorithm is implemented to transform a word into its root form and contains six steps:

  1. Get rid of plurals and –ed or –ing endings
  2. Turn terminal y to I when there is another vowel in the stem
  3. Map double suffices to a single one
  4. Deals with –ic-, -full, -ness etc. endings similarly to step 3
  5. Takes off –ant, -ence, etc. in context of <c>vcvc<v>
  6. Removes final –e if the measure of the number of constant sequences between 0 and j is greater than 1, i.e. <c>vcvc<v> gives 2

A trie (prefix tree) is an ordered data structure used to store the index in memory.  All descendants of a node have a common prefix under the same branch.  The common prefix eliminates the need for storing the same prefix each time which increases memory efficiency.   A simple representation containing the words aspirin, ankle, and pain is:

Trie Data Structure

When the user searches for keywords, Figure 1, spaces are seen as an AND condition and commas are seen as an OR condition.  A list is displayed to the user showing the file names of the documents which contain there search terms.  Double clicking on a particular document will open a new tab displaying the documents contents and highlighting the keywords present in the document, Figure 2 & 3.  When mousing over the keywords found the SNOMED and UMLS IDs are displayed.  These are determined based on a separate trie data structure built from the SNOMED CT data.

Pain Search Result
Figure 1: Pain Search Result
Encoding of Aspirin
Figure 2: Encoding of Aspirin
Encoding of Pain
Figure 3: Encoding of Pain