skip to Main Content

I need some help solving this problem.

I have two disjointed list of string: list A = {a1, ..., an} and list B = {b1, ..., bn}
an element of the list can be a simple word like “artificial” or “intelligence” or it can be compound by more words like “artificial intelligence”.
I have also a sentence that contains many words. Some of them are in one of the two list.

What I have to do is counting how many times two strings from the two lists occurr together in the same sentence.

The problem is that if I find in the sentence a word like artificial intelligence the correct word to consider would be only “artificial intelligence” (and not “artificial” nor “intelligence”)

I was thinking about adding every words from the list, contained in the sentence, in a treeset, then ordering it by length and taking only the longest word, but I don’t think the solution is very nice and efficient.

At the moment the code looks like this (but it still has the problem I’m talking about)

// iterates on the words from the list A
for (String a: A)
    // if the phrase contains the word
    if (phrase.matches(".*\b" + a + "\b.*")
        // iterates on the words from the list B
        for (String b: B)
            // if the phrase contains the word
            if (phrase.matches(".*\b" + b + "\b.*")
                // do stuffs

Do you have any suggestions? thanks!

4

Answers


  1. Chosen as BEST ANSWER

    I think I found a solution for this and you helped me thinking about it.

    What I could do is iterate on both the list separately and add the words I find in the sentence in two temporary maps (with a weight that counts the occorrences). After that I can iterate these two maps always separately and if a string a1 contains a string a2, I decrement the weight of a2 by one. After that I will obtain 2 maps containing the correct weights and I can iterate both them incrementing the co-occurrences of each pairs.

    I think this way it should work!


  2. I am not sure if I understood your requirements fully, but if you just need the count, you could give a weight to the strings in the list. For example if you have the entries

    artificial -> 1
    intelligence -> 1
    artificial intelligence -> -1
    

    if the sentence contains “artificial intelligence”, all three will match giving a sum of the weights = 1.

    This will need some preprocessing to calculate the correct weights for the strings.

    Login or Signup to reply.
  3. My idea is to keep track of considered word, then clean up.

    Try something like this:

    int counter = 0;
    List<String[]> h = new ArrayList<String[]>();
    HashSet<String> words = new HashSet<String>();
    
    // iterates on the words from the list A
    for (String a: A)
        // if the phrase contains the word
        if (phrase.matches(".*\b" + a + "\b.*"))
            // iterates on the words from the list B
            for (String b: B)
                // if the phrase contains the word
                if (phrase.matches(".*\b" + b + "\b.*")) {
    
                    h.add(new String[]{a,b});
                    words.add(a);
                    words.add(b);
                }
    
    // clean up:
    
    // 1. clean words
    for (String i:words) {
        // in words, keep only strings that are not contained by others
    }
    
    // 2. clean h
    for (String[] i :  h) {
        // if i[0] or i[1] are both in words, then
        // increment counter... or whatever you want
    }
    

    Hope I understood your problem

    Login or Signup to reply.
  4. You have 2 lists. For every word in the list make a map from first word to the rest of the words in the list. For example if you have “artificial intelligence” , “bat cave”, “dog” in this list, you would store it as :

    "artificial" => { "artificial intelligence" }

    "bat" => { "bat cave" }

    "dog" => { "dog" }

    This will be the first step. Preprocessing the list to get a map of firstword to the rest of the words in the list.

    Now when your line contains a statement like “artificial intelligence is cool.” You split the line with w. You get words. The first word we encounter is “artificial”. We look up into both the Maps as obtained previously. So we see a key for artificial in one of the maps. We know what is the next word in the line. We nevertheless want to match against the longest match. So we compare get the list of words corresponding to artificial. And make the longest substring match. We find artificial intelliegence together since we are looking for the longest match. We nevertheless repeat the process for the second list. Depending on whichever is longer, we select whether it belongs to list 1 or list 2.

    Here is some sample code.

    import java.util.ArrayList;
    import java.util.Arrays;
    import java.util.HashMap;
    import java.util.LinkedHashMap;
    import java.util.List;
    import java.util.Map;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    
    public class WordSplits {
        public static Map<String, List<String>> first2rest(List<String> wordList) {
            Map<String, List<String>> first2RestWords = new HashMap<String, List<String>>();
            for (String word : wordList) {
                // TODO Make it use Pattern. Sample demo. Get the first word of
                // every string.
                String splits[] = word.split("\W");
                String firstWord = splits[0];
                List<String> restWords = first2RestWords.get(firstWord);
                if (restWords == null) {
                    restWords = new ArrayList<String>();
                }
                restWords.add(word);
                // store the complete pattern nevertheless
                first2RestWords.put(firstWord, restWords);
            }
    
            return first2RestWords;
        }
    
        public static Map<String, List<Integer>> longestSubstring(String line,
                List<String> first, List<String> second) {
            Map<String, List<Integer>> occurences = new LinkedHashMap<String, List<Integer>>();
            Map<String, List<String>> first2RestWords = first2rest(first);
            Map<String, List<String>> second2RestWords = first2rest(second);
    
            Matcher wordMatcher = Pattern.compile("\w+").matcher(line);
            for (int start = 0; start < line.length() && wordMatcher.find(start);) {
    
                String word = wordMatcher.group();
    
                String maxWordFirst = "", maxWordSecond = "";
                if (first2RestWords.containsKey(word)) {
                    maxWordFirst = longestMatch(
                            line.substring(wordMatcher.start()),
                            first2RestWords.get(word));
                }
                if (second2RestWords.containsKey(word)) {
                    maxWordSecond = longestMatch(
                            line.substring(wordMatcher.start()),
                            second2RestWords.get(word));
    
                }
    
                if (maxWordFirst.length() > 0 || maxWordSecond.length() > 0) {
                    if (maxWordFirst.equals(maxWordSecond)) {
                        System.out.println("Belongs to both the lists : " + maxWordFirst);
                    } else {
                        if (maxWordFirst.length() > maxWordSecond.length()) {
                            System.out.println("Belongs to first list:  " + maxWordFirst);
                        } else if (maxWordSecond.length() > maxWordFirst.length()) {
                            System.out.println("Belongs to second list: " + maxWordSecond);
                        }
                    }
                } else {
                    System.out.println(word + " does not belong to any list");
                }
                // Take some action
                start = wordMatcher.start() + Math.max(maxWordFirst.length(), maxWordSecond.length()) + 1;
                start = Math.max(wordMatcher.end(), start);
            }
    
            return occurences;
        }
    
        public static String longestMatch(String line, List<String> wordList) {
            String maxWord = "";
            // poor way to compare
            for (String word : wordList) {
                if (line.startsWith(word) && word.length() > maxWord.length()) {
                    maxWord = word;
                }
            }
    
            return maxWord;
        }
    
        public static void main(String[] args) {
            longestSubstring("artificial intelligence is cool. bat.",
                    Arrays.asList("dog", "cow", "dog", "artificial intelligence", "bat"),
                    Arrays.asList("artificial", "hound", "cool", "bat", "dog hound"));
        }
    }
    

    The line to process was "artificial intelligence is cool. bat."

    l1 = `"dog", "cow", "dog", "artificial", "artificial intelligence", "bat"`
    
    l2 = `"intelligence", "hound", "cool", "bat", "dog hound"` 
    

    Output from program is

    Belongs to first list:  artificial intelligence
    is does not belong to any list
    Belongs to second list: cool
    Belongs to both the lists : bat 
    

    There are lots of optimizations to do.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search