Deep Learning: CNNs, Transformers & BERT Fine-Tuning

Implementing core deep learning architectures from scratch in PyTorch, then fine-tuning pretrained models for real classification tasks.

Python

PyTorch

HuggingFace

torchaudio

NumPy

Overview

This project, completed for UC Berkeley's CS 189 (Introduction to Machine Learning), covers the modern deep learning stack from first principles to practical fine-tuning. It spans five sub-projects: a convolutional network and a ResNet built from scratch for image classification, a full Transformer implemented from its mathematical building blocks, a BERT-style model fine-tuned to classify DNA sequences, and a ConvNeXt model fine-tuned to classify urban sounds from audio spectrograms.

Architecture / Approach

CNNs on mini-ImageNet: built a convolutional classifier in PyTorch as custom nn.Module subclasses, trained on a 100-class subset of ImageNet-1k loaded through HuggingFace Datasets, with a hand-written training loop, custom Dataset/DataLoader pipeline, and loss/accuracy visualization.
ResNet from scratch: implemented residual blocks with skip connections and assembled them into a ResNet-18-style architecture (four stages of residual blocks from 64 to 512 channels plus a classification head), then trained and compared it against the plain CNN.
Transformer from scratch: implemented softmax, scaled dot-product attention, single attention heads, multi-head attention, positional encodings, feed-forward networks, and full encoder and decoder layers — then stacked them into a complete Transformer trained on next-token prediction, including writing the inference loop for autoregressive generation.
BERT fine-tuning (DNABERT): tokenized raw DNA sequences into k-mers, loaded a pretrained DNABERT-6 backbone, attached a custom classification head on the [CLS] embedding, and fine-tuned the model to classify which species (human, dog, or chimpanzee) a DNA sequence came from.
ConvNeXt on UrbanSound8K: converted audio clips into spectrograms with torchaudio and compared three fine-tuning paradigms on a ConvNeXt image model — training from scratch, freezing the pretrained backbone, and full fine-tuning — to classify urban sounds.

Results / What I learned

Both fine-tuning tasks concluded with class-wide Kaggle competitions on held-out test sets. The DNABERT classifier placed 32nd of 449 (top 8%) on the leaderboard for the 3-class DNA species task — a noisy genomics benchmark where random guessing yields 33% and my model scored 47%. The ConvNeXt sound classifier reached 94% test accuracy on UrbanSound8K, placing 76th of 401. The biggest takeaway was seeing how the same handful of primitives — attention, convolutions, residual connections, and a pretrained backbone with a task-specific head — power everything from image classification to genomics to audio.

Source code is not published because this was graded university coursework; this page describes the approach and results instead.