skip to Main Content

I want to split large text into sentences.
I know how to do that with NLTK but I do not know how to do that without it.

This is my text, it has 8 sentences:

import re
import nltk

text = """Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data. 
        It is seen as a part of artificial intelligence. 
        Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. 
        Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks. 
        A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers, but not all machine learning is statistical learning. 
        The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning. 
        Some implementations of machine learning use data and neural networks in a way that mimics the working of a biological brain. As of 2020, many sources continue to assert that ML remains a subfield of AI. 
        Others have the view that not all ML is part of AI, but only an 'intelligent subset' of ML should be considered AI."""


sent_num = len(re.split("(?<=[^A-Z].[.?])(s|n)+(?=[A-Z])", text))
print("Number of sentences with regex:", sent_num)  #15

sent_num = len(nltk.sent_tokenize(text))
print("Number of sentences with NLTK:", sent_num)  #8

I have wrote a regex that can split text based on condition:
If word ends with punctuation (.!?) and if there is empty space or new line after punctuation and if word after empty space has first capital letter, then split it.

But Im getting bad results, NLTK gives 8 (correct), and my regex gives 15 instead of 8.

2

Answers


  1. If you use re.findall as follows, you get 8 sentences:

    sentences = re.findall(r'w+.*?[.?!]', text, flags=re.S)
    print(sentences)  # 8 sentences
    

    However, the above only happens to work because [.?!] only appear as end of sentence markers. Should these appear elsewhere, it would spoof the results. This is why using a library like NLTK is preferable, because it can parse the grammar of the text and figure out the context of punctuation.

    Login or Signup to reply.
  2. I agree with Tim that using NLTK is the correct solution. That said, the reason your existing code isn’t working is because you put a capturing group in your regex, and re.split will include the capture groups in the result, not just the strings outside the regex entirely, per the docs:

    If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

    You ended up capturing a single space character between each of the sentences, adding seven "sentences" that were all length one strings.

    The minimal fix is to stop re.split from preserving it by making it a non-capturing group with ?:, making it (?:s|n):

    sentences = re.split(r"(?<=[^A-Z].[.!?])(?:s|n)+(?=[A-Z])", text)
    

    but in this particular case you don’t need a group at all, the escapes are legal in character classes, so you can just use [sn], which is both more succinct and potentially more performant (due to lower complexity of classes over grouping). And in fact, s already includes n as a component (n is a whitespace character), so you don’t even need the explicit character class at all:

    sentences = re.split(r"(?<=[^A-Z].[.!?])s+(?=[A-Z])", text)
    

    Note that in both cases:

    • I’ve placed a r prefix on the regex to make it a raw-string literal; you got lucky here (the invalid str literal escape s is ignored with a DeprecationWarning you’re not opted in to see, and n happens to produce a raw newline character, which re treats as equivalent to n), but some day you’ll want to use b for a word boundary, and if you’re not using raw-string literals you’ll be very confused when nothing matches (because the str literal converted the b to an ASCII backspace, and re never knew it was looking for a word boundary). ALWAYS use raw strings for regex, and you won’t get bitten.
    • I’ve added ! to the character class in the lookbehind assertion (you say you want to allow sentences to end in ., ! or ?, but you were only allowing . or ?).

    Update: In response to your edited example input where the last logical sentence isn’t split from the second-to-last, the reason the regex doesn’t split As of 2020, many sources continue to assert that ML remains a subfield of AI. from Others have the view that not all ML is part of AI, but only an 'intelligent subset' of ML should be considered AI. is because of your positive lookbehind assertion, (?<=[^A-Z].[.?]). Expanding that, the assertion translates to:

    (?<=[^A-Z].[.?])
     ^^^           ^ The splitter must be preceded by
        ^^^^^^       a character that is not an uppercase ASCII alphabetic character
              ^      followed by any character
               ^^^^  followed by a period or a question mark
    

    The first logical sentence here ends with AI.; the I matches the ., the . matches [.?], but A fails the [^A-Z] test.

    You could just make the look-behind assertion simplify to just the end of sentence character, e.g.:

    sentences = re.split(r"(?<=[.!?])s+(?=[A-Z])", text)
    #                      ^^^^^^^^^^ Only checking that the whitespace is preceded by ., ! or ?
    

    and that will split your sentences completely, but it’s unclear if the additional components of that look-behind assertion are important, e.g. if you meant to prevent initials from being interpreted as sentence boundaries (if I cite S. Ranger that will be interpreted as a sentence boundary after S., but if I narrow it to exclude that case, then Who am I? I am I. Or am I? will decide both I? and I. are not sentence boundaries).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search