skip to Main Content

I want to search for smoking status and it’s variation from medical notes text using regex by searching "smoking" and its variations.

But some of the notes contain "never smoke", "denies any smoking","Denies ever using tobacco" and such phrase. Which mean they don’t smoke.

How do I exclude this phrase from the regex?

Currently, my regex looks like this : b(?:smoking|smoker|smoked|cigarette(s)?|tobacco)b

It’s searching for those words perfectly, but as I mentioned before it also includes phrases like "denies smoking" which means the patient don’t smoke

2

Answers


  1. You can use the same regex b(?:smoking|smoker|smoked|cigarette(s)?|tobacco)b and check to see if the same sentence have the words have ‘never’, ‘does not’ or ‘denies’ in the medical notes as well.

    I’m not sure how big these notes are, and there might be a note such as ‘Doesn’t drink. Is a smoker’ or something along those lines. If they are written in sentences you may want to split the medical notes into individual sentences with notes.split('.') and then check in each sentence.

    Even so, searching for words with regex is probably unreliable based on how much data you have. If you have notes that have a lot of variety in them then it would probably be better to use a machine learning api to go through each patient’s medical notes to extract the keywords that indicate that the patient is a smoker.

    Login or Signup to reply.
  2. Suppose you wish to wish to identify the following strings, provided they are not part of one of the excluding strings that I list subsequently.

    smoked
    smoking
    smoker
    lit up
    cigarette
    cigarettes
    

    The excluding strings are as follows.

    never smoked
    lit up a party
    hated cigarettes
    

    "smoked" is to be identified in the text, for example, unless it is part of the string "never smoked".

    We can then match text with the following regular expression (with the case-indifferent flag set).

    b(?:never smoked|lit up the party|hated cigarette|(smoked|smoker|lit up|cigarettes?))b
    

    Notice that in the alternation the excluding strings come first. These strings are to be matched but not captured. They are followed by the strings to be identified, which are captured (to group 1) as well as matched.

    Suppose the text were as follows.

    Bob smoked but Lois never smoked.
        cccccc          mmmmmmmmmmmm 
    
    Tim lit up his pipe and then lit up the party with his singing.        
        cccccc                   mmmmmmmmmmmmmmmm
    
    Lulu left because she hated cigarette smoke.
                          mmmmmmmmmmmmmmm
    

    As shown at regex101.com, strings underlined with c’s are both matched and captured; strings underlined with m’s are matched but not captured.

    The key, of course, is to list the excluding strings before the strings to be identified.

    The procedure is to (in code) skip over (disregard) matches that are not captured, keeping only matches that are captured.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search