I want to split large text into sentences.
I know how to do that with NLTK but I do not know how to do that without it.
This is my text, it has 8 sentences:
import re
import nltk
text = """Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data.
It is seen as a part of artificial intelligence.
Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so.
Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers, but not all machine learning is statistical learning.
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning.
Some implementations of machine learning use data and neural networks in a way that mimics the working of a biological brain. As of 2020, many sources continue to assert that ML remains a subfield of AI.
Others have the view that not all ML is part of AI, but only an 'intelligent subset' of ML should be considered AI."""
sent_num = len(re.split("(?<=[^A-Z].[.?])(s|n)+(?=[A-Z])", text))
print("Number of sentences with regex:", sent_num) #15
sent_num = len(nltk.sent_tokenize(text))
print("Number of sentences with NLTK:", sent_num) #8
I have wrote a regex that can split text based on condition:
If word ends with punctuation (.!?) and if there is empty space or new line after punctuation and if word after empty space has first capital letter, then split it.
But Im getting bad results, NLTK gives 8 (correct), and my regex gives 15 instead of 8.
2
Answers
If you use
re.findall
as follows, you get 8 sentences:However, the above only happens to work because
[.?!]
only appear as end of sentence markers. Should these appear elsewhere, it would spoof the results. This is why using a library like NLTK is preferable, because it can parse the grammar of the text and figure out the context of punctuation.I agree with Tim that using NLTK is the correct solution. That said, the reason your existing code isn’t working is because you put a capturing group in your regex, and
re.split
will include the capture groups in the result, not just the strings outside the regex entirely, per the docs:You ended up capturing a single space character between each of the sentences, adding seven "sentences" that were all length one strings.
The minimal fix is to stop
re.split
from preserving it by making it a non-capturing group with?:
, making it(?:s|n)
:but in this particular case you don’t need a group at all, the escapes are legal in character classes, so you can just use
[sn]
, which is both more succinct and potentially more performant (due to lower complexity of classes over grouping). And in fact,s
already includesn
as a component (n
is a whitespace character), so you don’t even need the explicit character class at all:Note that in both cases:
r
prefix on the regex to make it a raw-string literal; you got lucky here (the invalidstr
literal escapes
is ignored with aDeprecationWarning
you’re not opted in to see, andn
happens to produce a raw newline character, whichre
treats as equivalent ton
), but some day you’ll want to useb
for a word boundary, and if you’re not using raw-string literals you’ll be very confused when nothing matches (because thestr
literal converted theb
to an ASCII backspace, andre
never knew it was looking for a word boundary). ALWAYS use raw strings for regex, and you won’t get bitten.!
to the character class in the lookbehind assertion (you say you want to allow sentences to end in.
,!
or?
, but you were only allowing.
or?
).Update: In response to your edited example input where the last logical sentence isn’t split from the second-to-last, the reason the regex doesn’t split
As of 2020, many sources continue to assert that ML remains a subfield of AI.
fromOthers have the view that not all ML is part of AI, but only an 'intelligent subset' of ML should be considered AI.
is because of your positive lookbehind assertion,(?<=[^A-Z].[.?])
. Expanding that, the assertion translates to:The first logical sentence here ends with
AI.
; theI
matches the.
, the.
matches[.?]
, butA
fails the[^A-Z]
test.You could just make the look-behind assertion simplify to just the end of sentence character, e.g.:
and that will split your sentences completely, but it’s unclear if the additional components of that look-behind assertion are important, e.g. if you meant to prevent initials from being interpreted as sentence boundaries (if I cite
S. Ranger
that will be interpreted as a sentence boundary afterS.
, but if I narrow it to exclude that case, thenWho am I? I am I. Or am I?
will decide bothI?
andI.
are not sentence boundaries).