skip to Main Content

I have a large list of unique strings (~1000), for instance: [bbbhbbbh, jjjhhssa, eeeffus,…]

And a smaller list of sub-string pairs (~50) that make up each of these unique strings: [bbbh, jjjh, hssa, eeef, fus,…]

I want to create a function that takes the large unique string list (~1000) as an argument and returns a dictionary with the unique string and the corresponding values of its two unique sub-strings.

For example:

result = {'bbbhbbbh': 'bbbh/bbbh', 
            'jjjhhssa': 'jjjh/hssa', 
            'eeeffus': 'eeef/fus',...}

I’ve tried with a for loop but I am not able to print the unique strings with duplicates, I am wondering if there is a more concise way with list comprehension along with returning the two corresponding values that make up the unique string? I only want to use the json package at this point and solve this without importing any new packages. Thank you for any help with this.

My current loop and output:

result = []    

for string in pair_list:
    matches = []
    for substring in sub_list:
        if substring in string:
            matches.append(substring)
    if matches:
        result.append(matches)

print(result)

[['bbbh'], ['jjjh', 'hssa'], ['eeef', 'fus'],...

2

Answers


  1. AS per your output format I think you are expecting a dictionary kind of object.
    Where Long string is Key and all matched sub string is value. Just modifying your code, I added a dict object to store the result and append the sub string to the values. Since we also need to get repeated sub string we can use count method and string.

    Code:

    pair_list = ["bbbhbbbh", "jjjhhssa", "eeeffus"]
    sub_list = ["bbbh", "jjjh", "hssa", "eeef", "fus"]
    pair_mapping_result = dict() 
    
    for pair_string in pair_list:
        for sub_string in sub_list:
            if sub_string in pair_string:
                matched_sub_pairs = "/".join([sub_string] * pair_string.count(sub_string))
                pair_mapping_result[pair_string] = (r"{}/{}".format(pair_mapping_result[pair_string], 
                                                                   matched_sub_pairs) 
                                                    if pair_mapping_result.get(pair_string) else matched_sub_pairs)
    
    print(pair_mapping_result)
    

    Output

    {'bbbhbbbh': 'bbbh/bbbh', 'jjjhhssa': 'jjjh/hssa', 'eeeffus': 'eeef/fus'}
    

    We can do it using a dict comprehension

    Code

    {pair_string: "/".join(["/".join([sub_string] * pair_string.count(sub_string)) 
                            for sub_string in sub_list 
                            if sub_string in pair_string]) 
                            for pair_string in pair_list}
    
    Login or Signup to reply.
  2. Previously in the code it was not searching for its duplicate in the list items. Once it gets the desired substring in the main string it passes onto another. But now its finding multiple duplicate through reggex finditer method.

    pair_list= ['bbbhbbbh', 'jjjhhssa', 'eeeffus', 'aaaabbbh', 'ccccdddd','eeefff']
    
    sub_list = ['bbbh', 'jjjh', 'hssa', 'eeef', 'fus']
    
    import re
    result = []    
    for string in pair_list:
        matches = []
        for substring in sub_list:
            for duplicate in re.finditer(substring, string):
                matches.append(substring)
        if matches:
            result.append(matches)
    
    print(result)
    

    I hope this might help you.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search