I’m trying to fine-tune a GPT-2 model using the Hugging Face Transformers library, but I’m encountering a segmentation fault during training. Here’s a minimal reproducible example:

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import load_dataset
from typing import Dict
import os

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load pre-trained model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set the pad_token to the eos_token
tokenizer.pad_token = tokenizer.eos_token

model = GPT2LMHeadModel.from_pretrained(model_name)

# Move model to GPU and enable bf16 precision
model =, dtype=torch.bfloat16)

def preprocess_function_proofnet_simple(examples: Dict[str, list], tokenizer: GPT2Tokenizer, max_length: int = 512) -> Dict[str, torch.Tensor]:
    Preprocess the input data for the proofnet dataset.

    examples: The examples to preprocess.
    tokenizer: The tokenizer for encoding the texts.

    The processed model inputs.
    inputs = [f"{examples['nl_statement'][i]}{tokenizer.eos_token}{examples['formal_statement'][i]}" for i in range(len(examples['nl_statement']))]
    model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
    labels = model_inputs.input_ids.clone()
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs

# Load the dataset
dataset_path = "hoskinson-center/proofnet"
dataset = load_dataset(dataset_path)

# Select only 10 examples for training and validation
small_train_dataset = dataset['validation'].select(range(10))
small_val_dataset = dataset['test'].select(range(10))

# Preprocess the dataset
train_dataset = examples: preprocess_function_proofnet_simple(examples, tokenizer), batched=True, remove_columns=["nl_statement", "formal_statement"])
val_dataset = examples: preprocess_function_proofnet_simple(examples, tokenizer), batched=True, remove_columns=["nl_statement", "formal_statement"])

# Data collator for language modeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# # Training arguments
# training_args = TrainingArguments(
#     output_dir=os.path.expanduser("~/tmp/gpt2_trainer"),
#     overwrite_output_dir=True,
#     num_train_epochs=3,  # Train for 3 epochs
#     per_device_train_batch_size=2,
#     save_steps=10_000,
#     save_total_limit=2,
#     bf16=True,  # Enable bf16 training only
#     logging_dir=os.path.expanduser("~/tmp/gpt2_trainer/logs"),
#     logging_steps=200,
#     report_to="none"  # Disable logging to WandB
# )
# Training arguments
from pathlib import Path
output_dir_train: Path = Path('~/tmp').expanduser()
output_dir_train.mkdir(parents=True, exist_ok=True)
training_args = TrainingArguments(
    max_steps=2,  # TODO get rid of this in favour of 1 or 2 or 3 epochs
    # num_train_epochs=num_train_epochs, 
    gradient_accumulation_steps=2,  # based on alpaca, allows to process effective_batch_size = gradient_accumulation_steps * batch_size, num its to accumulate before opt update step
    gradient_checkpointing = True,  # TODO depending on hardware set to true?
    max_grad_norm=1.0, # TODO once real training change?
    lr_scheduler_type='cosine',  # TODO once real training change? using what I've seen most in vision 
    # logging_strategy='epoch', # TODO
    save_steps=100, # Save checkpoint every 500 steps
    save_total_limit=3, # save last 3
    logging_steps=10,  # Frequency of logging steps
    # evaluation_strategy='no',  # "no"`: No evaluation is done during training. no can be good to avoid memory issues.
    eval_strategy='no',  # "no"`: No evaluation is done during training. no can be good to avoid memory issues.
    # evaluation_strategy="steps",  # TODO Evaluate model at specified steps
    # eval_steps=110,  # TODO Evaluate every 100 steps
    # remove_unused_columns=False,  # TODO ,
    report_to='none',  # options I recommend: 'none', 'wandb'
    fp16=False,  # never ever set to True
    # full_determinism=True,  # TODO periphery, Ensure reproducibility
    # torchdynamo="nvfuser",  # TODO periphery, Use NVFuser backend for optimized torch operations
    # dataloader_prefetch_factor=2,  # TODO periphery, Number of batches to prefetch
    # dataloader_pin_memory=True,  # TODO periphery, Pin memory in data loaders for faster transfer to GPU
    # dataloader_num_workers=16,  # TODO Number of subprocesses for data loading

# Initialize the Trainer
trainer = Trainer(

# Train the model

# Save the model


When I run this code, I get the following error:

100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:06<00:00,  1.13s/it]
max_steps is given, it will override any value given in num_train_epochs
  0%|                                                                                                           | 0/2 [00:00<?, ?it/s]
/home/ubuntu/.virtualenvs/snap_cluster_setup/lib/python3.11/site-packages/torch/nn/parallel/ FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.device(device),, autocast(enabled=autocast_enabled):
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/home/ubuntu/.virtualenvs/snap_cluster_setup/lib/python3.11/site-packages/torch/_dynamo/ UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)
/home/ubuntu/.virtualenvs/snap_cluster_setup/lib/python3.11/site-packages/torch/nn/parallel/ UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
/home/ubuntu/.virtualenvs/snap_cluster_setup/lib/python3.11/site-packages/torch/utils/ FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Segmentation fault (core dumped)

main issue:

Segmentation fault (core dumped)

with no trace! + I’m not directly using C.

Any ideas on what might be causing this segmentation fault and how to resolve it?

Note: commented out code with training args seems to work




  1. I have encountered this error multiple times before. It’s not related to C or a faulty code. It is due to installation issue of PyTorch. Using torch on GPU needs correct Cuda version matching with torch version. After checking the correct PyTorch version requirement, you need to match correct requirements of architecture, OS, GPU. Refer this link for latest pytorch installation or this old versions install link.

    You need to correctly match following things of the PyTorch version you are using –

    1. Architecture e.g – x64, 32 bit, arch64 (Jetson boards, raspberry pi) etc.
    2. Operating system e.g – Linux, Windows etc.
    3. NVIDIA drivers. Use commands – nvidia-smi to check if GPU is accessible.
    4. Nvidia CUDA toolkit version. – Check compatible pytorch links above.
    5. Python version.

    I would recommend using conda package manager for simplicity as compared to pip. You can refer this link for more info – question

  2. This error is due to mismatch in one of the pre-requisite packages of PyTorch while using it on GPU. PyTorch binaries are very specific for a given architecture, OS, python version etc. You can also check CUDA version installed by command – nvcc --version. It will give the CUDA version of your system. Mismatch in any of the versions of pre-requisite packages can crash the PyTorch core causing this error.

