skip to Main Content

Below code is from Using word2vec to classify words in categories and I need some help on input and return saveing. Any help would be greatly appreciated.

# Category -> words
data = {
  'Names': ['john','jay','dan','nathan','bob'],
  'Colors': ['yellow', 'red','green', 'oragne', 'purple'],
  'Places': ['tokyo','bejing','washington','mumbai'],
}
# Words -> category
categories = {word: key for key, words in data.items() for word in words}

# Load the whole embedding matrix
embeddings_index = {}
with open('glove.6B.100d.txt', encoding='utf-8') as f:
  for line in f:
    values = line.split()
    word = values[0]
    embed = np.array(values[1:], dtype=np.float32)
    embeddings_index[word] = embed
print('Loaded %s word vectors.' % len(embeddings_index))
# Embeddings for available words
data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}

# Processing the query
def process(query):
  query_embed = embeddings_index[query]
  scores = {}
  for word, embed in data_embeddings.items():
    category = categories[word]
    dist = query_embed.dot(embed)
    dist /= len(data[category])
    scores[category] = scores.get(category, 0) + dist
  return scores

# Testing
print(process('jonny'))
print(process('green'))
print(process('park'))

And the return looks like:

Loaded 400000 word vectors.
{'Names': 7.965438079833984, 'Places': -0.3282392770051956, 'Colors': 1.803783965110779}
{'Names': 11.360316085815429, 'Places': 3.536876901984215, 'Colors': 21.82199630737305}
{'Names': 10.234728145599364, 'Places': 8.739515662193298, 'Colors': 10.761297225952148}

Below are the changes I want to make to this scrip but keep failing 🙁 Please help.

Question 1: The order or category (data) is Names, Colors, and Places. But why does the retun has Name, Place, Color order instead? This is not important but was wondering why.

Question 2: Instead of using print(process(‘jonny’)), how can I input list of text from text file?

Question 3: Lets suppose name of input text file is TEST.txt. How can I save the return in TEST.JSON or TEST.csv file? Basically input and output as same name.

Thank yo so much!

2

Answers


  1. Chosen as BEST ANSWER

    Thanks a lot, @Driftr95

    The below code allows to input of multiple text files and then saving the return in individual json files.

    inpFiles = ['text1.txt', 'text2.txt', 'text3.txt']
    # ifLen = len(ifLen)
    for inpf in inpFiles: # for ifi, inpf in enumerate(inpFiles, 1):
        # print('', end=f'r[{ifi} of {ifLen}] processing "{inpf}"...')
        with open(inpf) as f: inputList = f.read().splitlines()
        with open(f'{inpf[:-4]}.json', 'w') as f:   
            json.dump({inp: process(inp) for inp in inputList}, f, indent=4)
    

  2. Question 1: The order or category (data) is Names, Colors, and Places. But why does the return has Name, Place, Color order instead? This is not important but was wondering why.

    It’s probably because of how the contents of ‘glove.6B.100d.txt’ are ordered/arranged.


    Question 2: Instead of using print(process('jonny')), how can I input list of text from text file? [Lets suppose name of input text file is TEST.txt.]

    Assuming ‘TEST.txt’ has an input in each line like

    jonny
    green
    park
    [input#4]
    [input#5]
    

    Then you could read them into a list of strings to loop through and apply process to:

    with open('TEST.txt') as f: 
        inputList = f.read().splitlines()
    
    # for inp in inputList: print(process(inp)) ## OR
    outputList = [process(inp) for inp in inputList] 
    for op in outputList: print(op) 
    

    Question 3: […] How can I save the return in TEST.JSON or TEST.csv file? Basically input and output as same name.

    To save as CSV, you could use pandas .to_csv(view examples)

    import pandas as pd
    # pd.DataFrame(outputList, index=inputList).to_csv('TEST.csv') ## same as:
    # pd.DataFrame([process(i) for i in inputList], index=inputList).to_csv('TEST.csv')
     
    pd.DataFrame(
        [{'input': inp, **process(inp)} for inp in inputList]
    ).set_index('input').to_csv('TEST.csv')
    

    and to save as JSON, you can use json.dump(view examples: op1, op2)

    import json
    
    with open('TEST.json', 'w') as f: 
        # json.dump([{'input':inp, 'output': process(inp)} for inp in inputList], f) ## op1
        json.dump({inp: process(inp) for inp in inputList}, f) #, indent=4) ## op2
    

    Added EDIT:

    Let’s suppose I have a list of text files for this. Then how would I be able to process all the text files at once and save the return in the same file name? For example, if I use text1.txt, text2.txt, and text3.txt…..return will be text1.json, text2.json, and text3.json.

    inpFiles = ['text1.txt', 'text2.txt', 'text3.txt']
    # ifLen = len(ifLen)
    for inpf in inpFiles: # for ifi, inpf in enumerate(inpFiles, 1):
        # print('', end=f'r[{ifi} of {ifLen}] processing "{inpf}"...')
        with open(inpf) as f: inputList = f.read().splitlines()
        with open(f'{inpf[:-4]}.json', 'w') as f:   
            json.dump({inp: process(inp) for inp in inputList}, f, indent=4)
    

    [Using f'{inpf[:-4]}.json' assumes all file names in inpFiles end with ‘.txt’]

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search