I want to search for smoking status and it’s variation from medical notes text using regex by searching "smoking" and its variations.
But some of the notes contain "never smoke", "denies any smoking","Denies ever using tobacco" and such phrase. Which mean they don’t smoke.
How do I exclude this phrase from the regex?
Currently, my regex looks like this : b(?:smoking|smoker|smoked|cigarette(s)?|tobacco)b
It’s searching for those words perfectly, but as I mentioned before it also includes phrases like "denies smoking" which means the patient don’t smoke
2
Answers
You can use the same regex
b(?:smoking|smoker|smoked|cigarette(s)?|tobacco)b
and check to see if the same sentence have the words have ‘never’, ‘does not’ or ‘denies’ in the medical notes as well.I’m not sure how big these notes are, and there might be a note such as ‘Doesn’t drink. Is a smoker’ or something along those lines. If they are written in sentences you may want to split the medical notes into individual sentences with
notes.split('.')
and then check in each sentence.Even so, searching for words with regex is probably unreliable based on how much data you have. If you have notes that have a lot of variety in them then it would probably be better to use a machine learning api to go through each patient’s medical notes to extract the keywords that indicate that the patient is a smoker.
Suppose you wish to wish to identify the following strings, provided they are not part of one of the excluding strings that I list subsequently.
The excluding strings are as follows.
"smoked" is to be identified in the text, for example, unless it is part of the string "never smoked".
We can then match text with the following regular expression (with the case-indifferent flag set).
Notice that in the alternation the excluding strings come first. These strings are to be matched but not captured. They are followed by the strings to be identified, which are captured (to group 1) as well as matched.
Suppose the text were as follows.
As shown at regex101.com, strings underlined with c’s are both matched and captured; strings underlined with m’s are matched but not captured.
The key, of course, is to list the excluding strings before the strings to be identified.
The procedure is to (in code) skip over (disregard) matches that are not captured, keeping only matches that are captured.