NLTK language modeling confusion - Artificial Intelligence

PeymanTahghighi
March 2, 2019
103 views
0 votes
2 Answers

I want to train a language model using NLTK in python but I got into several problems.
first of all, I don’t know why my words turn into just characters as I write something like this :

s = "Natural-language processing (NLP) is an area of computer science " 
"and artificial intelligence concerned with the interactions " 
"between computers and human (natural) languages."
s = s.lower();


paddedLine = pad_both_ends(word_tokenize(s),n=2);

train, vocab = padded_everygram_pipeline(2, paddedLine)
print(list(vocab))
lm = MLE(2);
lm.fit(train,vocab)

and the printed vocab is something like this that is clearly not correct(i don’t want to work with characters!),this is part of output.:

<s>', '<', 's', '>', '</s>', '<s>', 'n', 'a', 't', 'u', 'r', 'a', 'l', '-', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', '</s>', '<s>', 'p', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', '</s>', '<s>', '(', '</s>', '<s>', 'n', 'l', 'p', '</s>', '<s>', ')', '</s>'

why my input turns into characters?
i did this work in another way but with no luck :

paddedLine = pad_both_ends(word_tokenize(s),n=2);
#train, vocab = padded_everygram_pipeline(2, tokens)
#train = everygrams(paddedLine,max_len = 2);

train = ngrams(paddedLine,2);
vocab = Vocabulary(paddedLine,unk_cutoff = 1);
print(list(train))

lm = MLE(2);
lm.fit(train,vocab)

when i run this code my train is absolute nothing,empty! it shows me “[]” !!
wired thing is when i comment at this line from above code:

vocab = Vocabulary(paddedLine,unk_cutoff = 1);

now my train data is ok and something like this that is correct :

[('<s>', 'natural-language'), ('natural-language', 'processing'), ('processing', '('), ('(', 'nlp'), ('nlp', ')'), (')', 'is'), ('is', 'an'), ('an', 'area'), ('area', 'of'), ('of', 'computer'), ('computer', 'science'), ('science', 'and'), ('and', 'artificial'), ('artificial', 'intelligence'), ('intelligence', 'concerned'), ('concerned', 'with'), ('with', 'the'), ('the', 'interactions'), ('interactions', 'between'), ('between', 'computers'), ('computers', 'and'), ('and', 'human'), ('human', '('), ('(', 'natural'), ('natural', ')'), (')', 'languages'), ('languages', '.'), ('.', '</s>')]

whats wrong with it?
by the way, I have to say that I’m not an expert in python or NLTK and it’s my first experience.
The next question is how can I use kneser-ney smoothing or add-one smoothing on the training language model?
and am I doing language model training the right way?
my training data is simple :

"Natural-language processing (NLP) is an area of computer science " 
    "and artificial intelligence concerned with the interactions " 
    "between computers and human (natural) languages."

thanks.

Answers

The padded_everygram_pipeline function expects a list of list of n-grams. You should fix your first code snippet as follows. Also python generators are lazy sequences, you can’t iterate them more than once.

from nltk import word_tokenize
from nltk.lm import MLE
from nltk.lm.preprocessing import pad_both_ends, padded_everygram_pipeline

s = "Natural-language processing (NLP) is an area of computer science " 
    "and artificial intelligence concerned with the interactions " 
    "between computers and human (natural) languages."
s = s.lower()

paddedLine = [list(pad_both_ends(word_tokenize(s), n=2))]

train, vocab = padded_everygram_pipeline(2, paddedLine)

lm = MLE(2)

lm.fit(train, vocab)

print(lm.counts)

- VarunVenkatesh
- June 3, 2022 at 2:14 am
- 0 votes
0
Multiple mistakes in the above answer:
1. "padded_everygram_pipeline function expects a list of list of n-grams" -> No, If you mean input, it just needs list of list of tokenized word. Not n-grams. It generates n-grams
  2)You don’t need to do padding as padded_everygram_pipeline already does it for you. So no need for padded_everygram_pipeline(2, paddedLine)
  One correct way to do this for a single sentence s:
```
 tokens = [list((word_tokenize(s))]
 train_data_bigram, padded_sent_list = padded_everygram_pipeline(2, tokens)
```
  #To check everygram in result, you can use the following
  
  for ngramlize_sent in train_data_bigram:
```
 #prints unigrams and bigrams

 print(list(ngramlize_sent))

 print()
```
  print(‘—-‘)
  
  #prints padded sentence i.e. sentence itself is padded
  
  list(padded_sent_list)
  
  lm = MLE(2)
  
  lm.fit(train_data_bigram, padded_sent_list)
  
  print(lm.counts)
Similarly you can change 2 to 3 in both places and end up testing this with a trigram model.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

NLTK language modeling confusion – Artificial Intelligence

Answers