Recording a task to solve a movie comment sentiment analysis task~

Data Preprocessing

Dataset downloading

Download movie comments dataset (IMDb) from Hugging Face

1
dataset = load_dataset("imdb")

Data Cleaning

Part of the data processing. Remove the useless characters.

Tokenize the data

Load a tokenizer (pretrained)

tokenize and transform the data

Split the data into training set, validation set and test set

Transform the dataset to DataLoader

Create specific dataset class (inherit ‘Dataset’)

1
class SentimentDataset(Dataset):
2
    def __init__(self, dataset):
3
        self.dataset = dataset
4

5

6
    def __len__(self):
7
        return len(self.dataset)
8

9

10
    def __getitem__(self, idx):
11
        item = self.dataset[idx]
12
        inputs_ids = torch.tensor(item["input_ids"], dtype=torch.long)
13
        attention_mask = torch.tensor(item["attention_mask"], dtype=torch.long)
14
        label = torch.tensor(item["label"], dtype=torch.long)
15
        return {"input_ids": input_ids, "attention_mask": attention_mask, "label": label}

Create DataLoader (convenient for batch training)

Load and Fine-Tune a Pretrained Model

Load a pretrained transformer model

1
model = AutoModelForSequenceClassification.from_pretrained(
2
    "bert-based-uncased",
3
    num_labels=2,
4
    hidden_dropout_prob=0.4,
5
    attention_probs_dropout_prob=0.4
6
)

Fine tune the model

Training Arguments Setting

1
training_args = TrainingArguments(
2
    output_dir="./results",
3
    evaluation_strategy="epoch",
4
    save_strategy="epoch",
5
    logging_strategy="steps",
6
    logging_steps=10,
7
    learning_rate=5e-6,  # 进一步降低学习率
8
    per_device_train_batch_size=4,  # 降低 batch size
9
    per_device_eval_batch_size=4,
10
    gradient_accumulation_steps=2,  # 让梯度累积，等效 batch_size=8
11
    num_train_epochs=2,  # 降低 Epochs，避免过拟合
12
    weight_decay=0.1,  # 增加权重衰减
13
    logging_dir="./logs",
14
    report_to="none",
15
    load_best_model_at_end=True  # 解决 EarlyStoppingCallback 报错
16
)

Trainer Setting

1
trainer = Trainer(
2
    model=model,
3
    args=training_args,
4
    train_dataset=train_dataset,
5
    eval_dataset=val_dataset,
6
    tokenizer=tokenizer,
7
    callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]  # 提前终止训练，避免过拟合
8
)

Train the model

1
trainer.train()

Evaluate the Model

Confusion Matrix

Compute metrics

Define function to compute: Accuracy, Precision, Recall, F1

1
def compute_metrics(pred):
2
    labels = pred.label_ids
3
    preds = pred.predictions.argmax(-1)
4
    accuracy = accuracy_score(labels, preds)
5
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
6
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}
7

8
eval_results = trainer.evaluate()

Sentiment Analysis Task Notesß