So I have a JSON dataset that is in the following format:
[{
"answer": "...",
"question": "...",
"context": "..."
},
...
]
All of the fields are normal plaintext. My goal is train a pretrained BERT model on it for the task of "Question-Answering". When reviewing through the HuggingFace docs it seems that the formatting needs to match the Squad dataset formatting, context needs to be trimmed to a max of 394 characters, and things like [CLS] and other tokens need to be added.
When trying to follow the docs, I get to this site for question answering and they provide a sample function to do the preprocessing. Provided here for simplicity:
def preprocess_function(examples):
questions = [q.strip() for q in examples["question"]]
inputs = tokenizer(
questions,
examples["context"],
max_length=384,
truncation="only_second",
return_offsets_mapping=True,
padding="max_length",
)
offset_mapping = inputs.pop("offset_mapping")
answers = examples["answers"]
start_positions = []
end_positions = []
for i, offset in enumerate(offset_mapping):
answer = answers[i]
start_char = answer["answer_start"][0]
end_char = answer["answer_start"][0] + len(answer["text"][0])
sequence_ids = inputs.sequence_ids(i)
# Find the start and end of the context
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1
# If the answer is not fully inside the context, label it (0, 0)
if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
start_positions.append(0)
end_positions.append(0)
else:
# Otherwise it's the start and end token positions
idx = context_start
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)
idx = context_end
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs
With the following sample code:
from transformers import AutoTokenizer
from datasets import Dataset
dataset = Dataset.from_pandas(df) # where df is a 3 column Dataframe Object | answer | question | context |
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_ds = dataset.map(preprocess_function, batched=True)
I get the following error when I try to run it.
File "c:/project.py", line 234, in preprocess_function
inputs = tokenizer(
TypeError: 'list' object is not callable
So my question: Is this just a silly error on my part, or should I be trying to preprocess this dataset in a different manner? I am new to HuggingFace and NLP in general so I am doing this as a fun project. Thanks in advance!
UPDATE 1: Following the tutorial more closely I put my data into a Datasets object and used the map function properly, but now it is saying the tokenizer is a problem because TypeError: 'list' object is not callable
but it is the exact same tokenizer used in the tutorial.
2
Answers
Solved it myself! My solution:
Load the df into a Datasets object so that I could actually call the map function
Globally declare the tokenizer above the preprocessing function
thanks to this question for the help
Change the preprocessing function to match my dataset.
answers = examples["answers"]
-->answers = examples["answer"]
created
contexts = examples["context"]
start_char = answer["answer_start"][0]
end_char = answer["answer_start"][0] + len(answer["text"][0])
-->
start_char = contexts[i].find(answer)
end_char = start_char + len(answer)