I need some help solving this problem.
I have two disjointed list of string: list A = {a1, ..., an}
and list B = {b1, ..., bn}
an element of the list can be a simple word like “artificial” or “intelligence” or it can be compound by more words like “artificial intelligence”.
I have also a sentence that contains many words. Some of them are in one of the two list.
What I have to do is counting how many times two strings from the two lists occurr together in the same sentence.
The problem is that if I find in the sentence a word like artificial intelligence the correct word to consider would be only “artificial intelligence” (and not “artificial” nor “intelligence”)
I was thinking about adding every words from the list, contained in the sentence, in a treeset, then ordering it by length and taking only the longest word, but I don’t think the solution is very nice and efficient.
At the moment the code looks like this (but it still has the problem I’m talking about)
// iterates on the words from the list A
for (String a: A)
// if the phrase contains the word
if (phrase.matches(".*\b" + a + "\b.*")
// iterates on the words from the list B
for (String b: B)
// if the phrase contains the word
if (phrase.matches(".*\b" + b + "\b.*")
// do stuffs
Do you have any suggestions? thanks!
4
Answers
I think I found a solution for this and you helped me thinking about it.
What I could do is iterate on both the list separately and add the words I find in the sentence in two temporary maps (with a weight that counts the occorrences). After that I can iterate these two maps always separately and if a string a1 contains a string a2, I decrement the weight of a2 by one. After that I will obtain 2 maps containing the correct weights and I can iterate both them incrementing the co-occurrences of each pairs.
I think this way it should work!
I am not sure if I understood your requirements fully, but if you just need the count, you could give a weight to the strings in the list. For example if you have the entries
if the sentence contains “artificial intelligence”, all three will match giving a sum of the weights = 1.
This will need some preprocessing to calculate the correct weights for the strings.
My idea is to keep track of considered word, then clean up.
Try something like this:
Hope I understood your problem
You have 2 lists. For every word in the list make a map from first word to the rest of the words in the list. For example if you have “artificial intelligence” , “bat cave”, “dog” in this list, you would store it as :
"artificial" => { "artificial intelligence" }
"bat" => { "bat cave" }
"dog" => { "dog" }
This will be the first step. Preprocessing the list to get a map of firstword to the rest of the words in the list.
Now when your line contains a statement like “artificial intelligence is cool.” You split the line with
w
. You get words. The first word we encounter is “artificial”. We look up into both the Maps as obtained previously. So we see a key forartificial
in one of the maps. We know what is the next word in the line. We nevertheless want to match against the longest match. So we compare get the list of words corresponding toartificial
. And make the longest substring match. We findartificial intelliegence
together since we are looking for the longest match. We nevertheless repeat the process for the second list. Depending on whichever is longer, we select whether it belongs to list 1 or list 2.Here is some sample code.
The line to process was
"artificial intelligence is cool. bat."
Output from program is
There are lots of optimizations to do.