How to Run a Text Classification Model
Environment Setup
Install Conda in the /project
Directory
- We recommend using Miniconda for a lightweight and efficient Python environment setup. To install Miniconda3 in the
/project
directory (recommended to avoid the 20 GB storage limitations in/home
), run the following commands:
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
These four commands download the 64-bit version of the Linux installer, rename it to a shorter file name, silently install it, and then delete the installer.
You may download a different version of Miniconda by following the instructions available at Miniconda Documentation.
And add to your system’s $PATH environment variable by running:
export PATH=~/miniconda3/bin:$PATH
To ensure that Miniconda is always available, add the above line to your shell configuration file (.bashrc) by running:
echo 'export PATH=~/miniconda3/bin:$PATH' >> ~/.bashrc
conda init
source ~/.bashrc
Creating a new Conda Environment
Create a new Conda environment with Python 3.10 by running the following command:
conda create -n myenv python=3.10
conda activate myenv
After activating the environment, install pytorch with CUDA 11.8 support and other necessary dependencies:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 scikit-learn -c pytorch -c nvidia
pip install transformers==4.35.0 evaluate==0.4.1 datasets
Running a Task:
We show an example of how to run a text classification task using pretrained language model and public dataset on Huggingface:
Download the classification python code from here, change branch to v4.35-release before downloading and put the scripts under your directory.
Preparing a SLURM submission script
Create a shell script file (e.g., by running touch run_job.sh
in the terminal) and open it for editing. Below is an example script that you can copy:
#!/bin/bash
#SBATCH --job-name=text_classification # Job name
#SBATCH --output=output_%j.txt # Output log file (%j will be replaced by the job ID)
#SBATCH --error=error_%j.txt # Error log file
#SBATCH --ntasks=1 # Number of tasks (typically 1 for a single Python script)
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
#SBATCH --gres=gpu:1 # Request 1 GPU (adjust based on the available resources)
#SBATCH --mem=16G # Memory allocation
#SBATCH --time=12:00:00 # Maximum run time (HH:MM:SS)
#SBATCH --partition=bigTiger # Partition to submit to (adjust based on availability)
# Set dataset and subset variables
dataset="imdb"
# Run the classification task using the dataset and subset variables
python run_classification.py \
--model_name_or_path google-bert/bert-base-uncased \
--dataset_name ${dataset} \
--trust_remote_code True \
--shuffle_train_dataset \
--metric_name accuracy \
--text_column_name text \
--do_train \
--do_eval \
--max_seq_length 512 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 1 \
--output_dir /tmp/${dataset}/
Submit the Task
Use the following command to submit the job:
sbatch run_job.sh
After submitting the job, you can view detailed information about a specific job (replace
scontrol show job <job_id>
Enjoy Reading This Article?
Here are some more articles you might like to read next: