skip to Main Content

So I have a JSON dataset that is in the following format:

[{
    "answer": "...",
    "question": "...",
    "context": "..."
  },
 ...
]

All of the fields are normal plaintext. My goal is train a pretrained BERT model on it for the task of "Question-Answering". When reviewing through the HuggingFace docs it seems that the formatting needs to match the Squad dataset formatting, context needs to be trimmed to a max of 394 characters, and things like [CLS] and other tokens need to be added.

When trying to follow the docs, I get to this site for question answering and they provide a sample function to do the preprocessing. Provided here for simplicity:

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

With the following sample code:

from transformers import AutoTokenizer
from datasets import Dataset

dataset = Dataset.from_pandas(df) # where df is a 3 column Dataframe Object | answer | question | context |

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_ds = dataset.map(preprocess_function, batched=True)

I get the following error when I try to run it.

File "c:/project.py", line 234, in preprocess_function
    inputs = tokenizer(
TypeError: 'list' object is not callable

So my question: Is this just a silly error on my part, or should I be trying to preprocess this dataset in a different manner? I am new to HuggingFace and NLP in general so I am doing this as a fun project. Thanks in advance!

UPDATE 1: Following the tutorial more closely I put my data into a Datasets object and used the map function properly, but now it is saying the tokenizer is a problem because TypeError: 'list' object is not callable but it is the exact same tokenizer used in the tutorial.

2

Answers


  1. Chosen as BEST ANSWER

    Solved it myself! My solution:

    1. Load the df into a Datasets object so that I could actually call the map function

    2. Globally declare the tokenizer above the preprocessing function

      thanks to this question for the help

    3. Change the preprocessing function to match my dataset.

      • answers = examples["answers"] --> answers = examples["answer"]

      • created contexts = examples["context"]

      • start_char = answer["answer_start"][0]

        end_char = answer["answer_start"][0] + len(answer["text"][0])

        -->

        start_char = contexts[i].find(answer)

        end_char = start_char + len(answer)


  2.  from transformers import AutoTokenizer
    
     tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    
    
     questions = dataset['question'].tolist()
     contexts = dataset['context'].tolist()
    
     qas = preprocess_function({"question": questions, "context": contexts}, tokenizer)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search