This article presents a detailed, step-by-step account of building a lightweight multimodal large language model from the ground up. The author starts by implementing a BPE tokenizer and core Transformer modules from scratch, then trains a GPT-2 Medium-scale text backbone model. Subsequent sections cover multi-round supervised fine-tuning (SFT), HiRA fine-tuning, data distribution adjustments, and task diagnostics to enable basic dialogue and short instruction capabilities. The piece is notable for its practical depth, including code-level explanations of matrix multiplication operations and attention mechanisms. For developers and researchers interested in the internals of multimodal models, this serves as an excellent reference that bridges theory and implementation. The HiRA fine-tuning approach and data balancing strategies are particularly valuable for those looking to optimize model performance on specific tasks without massive computational resources.
A complete walkthrough of building a lightweight multimodal LLM from scratch, covering BPE tokenizer, Transformer, pretraining, SFT, and HiRA fine-tuning.