How to Run a Text Classification Model

Environment Setup

Install Conda in the `/project` Directory

We recommend using Miniconda for a lightweight and efficient Python environment setup. To install Miniconda3 in the /project directory (recommended to avoid the 20 GB storage limitations in /home), run the following commands:

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh

These four commands download the 64-bit version of the Linux installer, rename it to a shorter file name, silently install it, and then delete the installer.

You may download a different version of Miniconda by following the instructions available at Miniconda Documentation.

And add to your system’s $PATH environment variable by running:

export PATH=~/miniconda3/bin:$PATH

To ensure that Miniconda is always available, add the above line to your shell configuration file (.bashrc) by running:

echo 'export PATH=~/miniconda3/bin:$PATH' >> ~/.bashrc
conda init
source ~/.bashrc

Creating a new Conda Environment

Create a new Conda environment with Python 3.10 by running the following command:

conda create -n myenv python=3.10
conda activate myenv

After activating the environment, install pytorch with CUDA 11.8 support and other necessary dependencies:

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 scikit-learn -c pytorch -c nvidia
pip install transformers==4.35.0 evaluate==0.4.1 datasets

Running a Task:

We show an example of how to run a text classification task using pretrained language model and public dataset on Huggingface:

Download the classification python code from here, change branch to v4.35-release before downloading and put the scripts under your directory.

Preparing a SLURM submission script

Create a shell script file (e.g., by running touch run_job.sh in the terminal) and open it for editing. Below is an example script that you can copy:

#!/bin/bash
#SBATCH --job-name=text_classification       # Job name
#SBATCH --output=output_%j.txt               # Output log file (%j will be replaced by the job ID)
#SBATCH --error=error_%j.txt                 # Error log file
#SBATCH --ntasks=1                           # Number of tasks (typically 1 for a single Python script)
#SBATCH --cpus-per-task=4                    # Number of CPU cores per task
#SBATCH --gres=gpu:1                         # Request 1 GPU (adjust based on the available resources)
#SBATCH --mem=16G                            # Memory allocation
#SBATCH --time=12:00:00                      # Maximum run time (HH:MM:SS)
#SBATCH --partition=bigTiger                 # Partition to submit to (adjust based on availability)
# Set dataset and subset variables
dataset="imdb"
# Run the classification task using the dataset and subset variables
python run_classification.py \
    --model_name_or_path google-bert/bert-base-uncased \
    --dataset_name ${dataset} \
    --trust_remote_code True \
    --shuffle_train_dataset \
    --metric_name accuracy \
    --text_column_name text \
    --do_train \
    --do_eval \
    --max_seq_length 512 \
    --per_device_train_batch_size 32 \
    --learning_rate 2e-5 \
    --num_train_epochs 1 \
    --output_dir /tmp/${dataset}/

Submit the Task

Use the following command to submit the job:

sbatch run_job.sh

After submitting the job, you can view detailed information about a specific job (replace with your actual job ID) with the following commands:

scontrol show job <job_id>