Recording a task to solve a movie comment sentiment analysis task~
Data Preprocessing
Dataset downloading
Download movie comments dataset (IMDb) from Hugging Face
1
| dataset = load_dataset("imdb")
|
Data Cleaning
Part of the data processing. Remove the useless characters.
Tokenize the data
Load a tokenizer (pretrained)
Split the data into training set, validation set and test set
Create specific dataset class (inherit ‘Dataset’)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| class SentimentDataset(Dataset): def __init__(self, dataset): self.dataset = dataset def __len__(self): return len(self.dataset) def __getitem__(self, idx): item = self.dataset[idx] inputs_ids = torch.tensor(item["input_ids"], dtype=torch.long) attention_mask = torch.tensor(item["attention_mask"], dtype=torch.long) label = torch.tensor(item["label"], dtype=torch.long) return {"input_ids": input_ids, "attention_mask": attention_mask, "label": label}
|
Create DataLoader (convenient for batch training)
Load and Fine-Tune a Pretrained Model
1 2 3 4 5 6
| model = AutoModelForSequenceClassification.from_pretrained( "bert-based-uncased", num_labels=2, hidden_dropout_prob=0.4, attention_probs_dropout_prob=0.4 )
|
Fine tune the model
Training Arguments Setting
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", save_strategy="epoch", logging_strategy="steps", logging_steps=10, learning_rate=5e-6, per_device_train_batch_size=4, per_device_eval_batch_size=4, gradient_accumulation_steps=2, num_train_epochs=2, weight_decay=0.1, logging_dir="./logs", report_to="none", load_best_model_at_end=True )
|
Trainer Setting
1 2 3 4 5 6 7 8
| trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset, tokenizer=tokenizer, callbacks=[EarlyStoppingCallback(early_stopping_patience=1)] )
|
Train the model
Evaluate the Model
Confusion Matrix
Compute metrics
Define function to compute: Accuracy, Precision, Recall, F1
1 2 3 4 5 6 7 8
| def compute_metrics(pred): labels = pred.label_ids preds = pred.predictions.argmax(-1) accuracy = accuracy_score(labels, preds) precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary') return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}
eval_results = trainer.evaluate()
|
Plot Confusion Matrix
Visualize Classification Report
Visualize the Training Loss and Validation Loss