Counting co-occurrences of 2 words in a sentence (java) - Artificial Intelligence

Epi
August 17, 2013
234 views
2 votes
4 Answers

I need some help solving this problem.

I have two disjointed list of string: list A = {a1, ..., an} and list B = {b1, ..., bn}
an element of the list can be a simple word like “artificial” or “intelligence” or it can be compound by more words like “artificial intelligence”.
I have also a sentence that contains many words. Some of them are in one of the two list.

What I have to do is counting how many times two strings from the two lists occurr together in the same sentence.

The problem is that if I find in the sentence a word like artificial intelligence the correct word to consider would be only “artificial intelligence” (and not “artificial” nor “intelligence”)

I was thinking about adding every words from the list, contained in the sentence, in a treeset, then ordering it by length and taking only the longest word, but I don’t think the solution is very nice and efficient.

At the moment the code looks like this (but it still has the problem I’m talking about)

// iterates on the words from the list A
for (String a: A)
    // if the phrase contains the word
    if (phrase.matches(".*\b" + a + "\b.*")
        // iterates on the words from the list B
        for (String b: B)
            // if the phrase contains the word
            if (phrase.matches(".*\b" + b + "\b.*")
                // do stuffs

Do you have any suggestions? thanks!

Tags: java

Answers

Chosen as BEST ANSWER
- Epi
- August 17, 2013 at 8:35 pm
- 0 votes
0
I think I found a solution for this and you helped me thinking about it.

What I could do is iterate on both the list separately and add the words I find in the sentence in two temporary maps (with a weight that counts the occorrences). After that I can iterate these two maps always separately and if a string a1 contains a string a2, I decrement the weight of a2 by one. After that I will obtain 2 maps containing the correct weights and I can iterate both them incrementing the co-occurrences of each pairs.

I think this way it should work!

(Edit)

- Henry
- August 17, 2013 at 7:29 pm
- 0 votes
0
I am not sure if I understood your requirements fully, but if you just need the count, you could give a weight to the strings in the list. For example if you have the entries
```
artificial -> 1
intelligence -> 1
artificial intelligence -> -1
```
if the sentence contains “artificial intelligence”, all three will match giving a sum of the weights = 1.

This will need some preprocessing to calculate the correct weights for the strings.
Login or Signup to reply.

My idea is to keep track of considered word, then clean up.

Try something like this:

int counter = 0;
List<String[]> h = new ArrayList<String[]>();
HashSet<String> words = new HashSet<String>();

// iterates on the words from the list A
for (String a: A)
    // if the phrase contains the word
    if (phrase.matches(".*\b" + a + "\b.*"))
        // iterates on the words from the list B
        for (String b: B)
            // if the phrase contains the word
            if (phrase.matches(".*\b" + b + "\b.*")) {

                h.add(new String[]{a,b});
                words.add(a);
                words.add(b);
            }

// clean up:

// 1. clean words
for (String i:words) {
    // in words, keep only strings that are not contained by others
}

// 2. clean h
for (String[] i :  h) {
    // if i[0] or i[1] are both in words, then
    // increment counter... or whatever you want
}

Hope I understood your problem

You have 2 lists. For every word in the list make a map from first word to the rest of the words in the list. For example if you have “artificial intelligence” , “bat cave”, “dog” in this list, you would store it as :

"artificial" => { "artificial intelligence" }

"bat" => { "bat cave" }

"dog" => { "dog" }

This will be the first step. Preprocessing the list to get a map of firstword to the rest of the words in the list.

Now when your line contains a statement like “artificial intelligence is cool.” You split the line with w. You get words. The first word we encounter is “artificial”. We look up into both the Maps as obtained previously. So we see a key for artificial in one of the maps. We know what is the next word in the line. We nevertheless want to match against the longest match. So we compare get the list of words corresponding to artificial. And make the longest substring match. We find artificial intelliegence together since we are looking for the longest match. We nevertheless repeat the process for the second list. Depending on whichever is longer, we select whether it belongs to list 1 or list 2.

Here is some sample code.

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class WordSplits {
    public static Map<String, List<String>> first2rest(List<String> wordList) {
        Map<String, List<String>> first2RestWords = new HashMap<String, List<String>>();
        for (String word : wordList) {
            // TODO Make it use Pattern. Sample demo. Get the first word of
            // every string.
            String splits[] = word.split("\W");
            String firstWord = splits[0];
            List<String> restWords = first2RestWords.get(firstWord);
            if (restWords == null) {
                restWords = new ArrayList<String>();
            }
            restWords.add(word);
            // store the complete pattern nevertheless
            first2RestWords.put(firstWord, restWords);
        }

        return first2RestWords;
    }

    public static Map<String, List<Integer>> longestSubstring(String line,
            List<String> first, List<String> second) {
        Map<String, List<Integer>> occurences = new LinkedHashMap<String, List<Integer>>();
        Map<String, List<String>> first2RestWords = first2rest(first);
        Map<String, List<String>> second2RestWords = first2rest(second);

        Matcher wordMatcher = Pattern.compile("\w+").matcher(line);
        for (int start = 0; start < line.length() && wordMatcher.find(start);) {

            String word = wordMatcher.group();

            String maxWordFirst = "", maxWordSecond = "";
            if (first2RestWords.containsKey(word)) {
                maxWordFirst = longestMatch(
                        line.substring(wordMatcher.start()),
                        first2RestWords.get(word));
            }
            if (second2RestWords.containsKey(word)) {
                maxWordSecond = longestMatch(
                        line.substring(wordMatcher.start()),
                        second2RestWords.get(word));

            }

            if (maxWordFirst.length() > 0 || maxWordSecond.length() > 0) {
                if (maxWordFirst.equals(maxWordSecond)) {
                    System.out.println("Belongs to both the lists : " + maxWordFirst);
                } else {
                    if (maxWordFirst.length() > maxWordSecond.length()) {
                        System.out.println("Belongs to first list:  " + maxWordFirst);
                    } else if (maxWordSecond.length() > maxWordFirst.length()) {
                        System.out.println("Belongs to second list: " + maxWordSecond);
                    }
                }
            } else {
                System.out.println(word + " does not belong to any list");
            }
            // Take some action
            start = wordMatcher.start() + Math.max(maxWordFirst.length(), maxWordSecond.length()) + 1;
            start = Math.max(wordMatcher.end(), start);
        }

        return occurences;
    }

    public static String longestMatch(String line, List<String> wordList) {
        String maxWord = "";
        // poor way to compare
        for (String word : wordList) {
            if (line.startsWith(word) && word.length() > maxWord.length()) {
                maxWord = word;
            }
        }

        return maxWord;
    }

    public static void main(String[] args) {
        longestSubstring("artificial intelligence is cool. bat.",
                Arrays.asList("dog", "cow", "dog", "artificial intelligence", "bat"),
                Arrays.asList("artificial", "hound", "cool", "bat", "dog hound"));
    }
}

The line to process was "artificial intelligence is cool. bat."

l1 = `"dog", "cow", "dog", "artificial", "artificial intelligence", "bat"`

l2 = `"intelligence", "hound", "cool", "bat", "dog hound"`

Output from program is

Belongs to first list:  artificial intelligence
is does not belong to any list
Belongs to second list: cool
Belongs to both the lists : bat

There are lots of optimizations to do.

Please signup or login to give your own answer.

Click here to cancel reply.

Counting co-occurrences of 2 words in a sentence (java) – Artificial Intelligence

Answers