Tag Archives: Language

Natural Language Processing

Initial Starting State

This application imports electronic medical records (stored in text files) from the selected folder and runs each file through the natural language process of:

  • Removing stop words
  • Stemming words
  • Creating an index

For stemming the Porter Stemming Algorithm is implemented to transform a word into its root form and contains six steps:

  1. Get rid of plurals and –ed or –ing endings
  2. Turn terminal y to I when there is another vowel in the stem
  3. Map double suffices to a single one
  4. Deals with –ic-, -full, -ness etc. endings similarly to step 3
  5. Takes off –ant, -ence, etc. in context of <c>vcvc<v>
  6. Removes final –e if the measure of the number of constant sequences between 0 and j is greater than 1, i.e. <c>vcvc<v> gives 2

A trie (prefix tree) is an ordered data structure used to store the index in memory.  All descendants of a node have a common prefix under the same branch.  The common prefix eliminates the need for storing the same prefix each time which increases memory efficiency.   A simple representation containing the words aspirin, ankle, and pain is:

Trie
Trie Data Structure

When the user searches for keywords, Figure 1, spaces are seen as an AND condition and commas are seen as an OR condition.  A list is displayed to the user showing the file names of the documents which contain there search terms.  Double clicking on a particular document will open a new tab displaying the documents contents and highlighting the keywords present in the document, Figure 2 & 3.  When mousing over the keywords found the SNOMED and UMLS IDs are displayed.  These are determined based on a separate trie data structure built from the SNOMED CT data.

Pain Search Result
Figure 1: Pain Search Result
Encoding of Aspirin
Figure 2: Encoding of Aspirin
Encoding of Pain
Figure 3: Encoding of Pain