Chess puzzle SFT/RL on Qwen3.
Quick setup (one command):
bash setup_env.shOr manual setup:
# Create conda environment (Python 3.12)
conda create -n c1 python=3.12 -y
conda activate c1
# Install verl and LLaMA-Factory from git
pip install --no-deps --no-build-isolation \
"verl @ git+https://github.com/volcengine/verl.git@facd9fb50193522f87983b89f886afe8c0810acc" \
"llamafactory @ git+https://github.com/hiyouga/LLaMA-Factory.git@a711bce664faade03b540ad30c41707ba8c928ad"
# Install other dependencies (adjust path if needed)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install vllm flash-attn transformers "datasets>=2.16.0,<=4.0.0" python-chessKey packages installed:
torch2.8.0 + cu128vllm0.11.0flash-attn2.8.1transformers4.57.0datasets>=2.16.0,<=4.0.0python-chess1.999 (chess 1.11.2)verl(git: facd9fb)llamafactory(git: a711bce)
Create api_keys.json in the project root:
{
"openrouter": {
"api_key": "your-openrouter-key"
},
"wandb": {
"api_key": "your-wandb-key",
"entity": "your-wandb-entity"
},
"huggingface": {
"token": "your-hf-token"
}
}
⚠️ Never commitapi_keys.json- it's already in.gitignore.
Training data is generated from chess puzzles using Gemini API:
cd code
python 1_cot_generation.py --cfg configs/gemini3_flash.yaml # Gemini 3 Flash data
python 1_cot_generation.py --cfg configs/gemini3.5_flash.yaml # Gemini 3.5 Flash dataThen convert to LLaMA-Factory format:
python 2_format_matching.py --registerData files:
/data/train_sft_gemini-3-flash.json- Generated with Gemini 3 Flash/data/train_sft_gemini-3.5-flash.json- Generated with Gemini 3.5 Flash
Always use scripts/sft.sh to run training (never call llamafactory-cli directly):
conda activate c1
# 0.6B models
bash scripts/sft.sh configs/qwen3-0.6b-gemini3-flash.yaml # GPU 0-3
bash scripts/sft.sh configs/qwen3-0.6b-gemini3.5-flash.yaml # GPU 4-7
# 4B models
bash scripts/sft.sh configs/qwen3-4b-gemini3-flash.yaml # GPU 0-3
bash scripts/sft.sh configs/qwen3-4b-gemini3.5-flash.yaml # GPU 4-7GPU Allocation:
- Training jobs: 4 GPUs each (DDP)
- Run parallel jobs on disjoint GPU sets (e.g., GPU 0-3 and GPU 4-7)
Outputs:
- Checkpoints:
/data1/C1/qwen3-{size}/sft_{dataset}/checkpoint-* - Logs:
/logs/sft_train_{name}_{timestamp}.log - WandB:
lilvjosephtang-university-of-toronto/c1_sft
Training configs:
- Base models:
/data1/models/Qwen/Qwen3-0.6B,/data1/models/Qwen/Qwen3-4B - LoRA rank: 32
- Training steps: 320
Evaluate trained models on test set:
conda activate c1
# Evaluate gemini3-flash model
CUDA_VISIBLE_DEVICES=0,1,2,3 python code/sft_eval.py \
--base_model_path /data1/models/Qwen/Qwen3-4B \
--lora_dir /data1/C1/qwen3-4b/sft_gemini3_flash \
--output_dir ../outputs/val_sft_gemini3_flash
# Evaluate gemini3.5-flash model
CUDA_VISIBLE_DEVICES=4,5,6,7 python code/sft_eval.py \
--base_model_path /data1/models/Qwen/Qwen3-4B \
--lora_dir /data1/C1/qwen3-4b/sft_gemini3.5_flash \
--output_dir ../outputs/val_sft_gemini3.5_flashParameters:
--base_model_path: Path to base Qwen3 model--lora_dir: Path to trained LoRA adapters--output_dir: Where to save predictions--test_data_path: Test data path (default:../data/test.parquet)--tensor_parallel_size: Number of GPUs for vLLM (default: 4)--temperature: Sampling temperature (default: 0.0 for deterministic)--top_p: Nucleus sampling (default: 1.0)
Outputs:
- Predictions:
{output_dir}/checkpoint-*.jsonl - Logs:
/logs/sft_eval_{name}_{timestamp}.log
Make sure conda environment is activated:
conda activate c1Set model path and run:
export C1_MODEL_PATH=/data1/C1/qwen3-0.6b/sft_gemini3_flash
bash scripts/grpo_qwen-0.6b.sh # 0.6B model
bash scripts/grpo_qwen-4b.sh # 4B model
bash scripts/dapo_qwen-4b.sh # DAPO variant- SFT:
c1_sft(entity:lilvjosephtang-university-of-toronto) - RL (GRPO/DAPO):
c1_rl
Override with WANDB_PROJECT env var if needed.
C1/
├── api_keys.json # API keys (not in git)
├── code/
│ ├── 1_cot_generation.py # Data generation from Gemini API
│ ├── 2_format_matching.py # Convert to LLaMA-Factory format
│ ├── sft_eval.py # SFT evaluation script
│ └── configs/ # Config files for data generation
├── configs/
│ ├── qwen3-0.6b-gemini3-flash.yaml
│ ├── qwen3-0.6b-gemini3.5-flash.yaml
│ ├── qwen3-4b-gemini3-flash.yaml
│ └── qwen3-4b-gemini3.5-flash.yaml
├── data/
│ ├── test.parquet # Test set
│ └── train_sft_*.json # Training data
├── logs/ # Training and evaluation logs
├── outputs/ # Evaluation predictions
├── scripts/
│ ├── sft.sh # SFT training wrapper
│ ├── grpo_qwen-*.sh # GRPO training
│ └── dapo_qwen-*.sh # DAPO training
└── setup_env.sh # Environment setup script
- Always use
scripts/sft.shfor training - it handles wandb config properly - GPU allocation: Set
CUDA_VISIBLE_DEVICESbefore running commands - vLLM 0.11.0 required for Qwen3 support
- Use temperature=0.0 for deterministic evaluation