SmolVLA: Vision-Language-Action Model for Affordable Robotics

SmolVLA

A compact, efficient, and community-driven Vision-Language-Action (VLA) model that enables natural language-driven perception and control for robotics applications.

Project Overview

SmolVLA is a compact, efficient, and community-driven Vision-Language-Action (VLA) model that enables natural language-driven perception and control for robotics applications. Unlike existing massive VLA models with billions of parameters, SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade hardware or even CPUs, making it accessible to a broader robotics community.

Key Features

  • Compact Architecture: Significantly smaller than existing VLAs while maintaining competitive performance
  • Efficient Training: Can be trained on a single GPU, reducing computational costs
  • Accessible Deployment: Runs on consumer-grade GPUs or CPUs
  • Asynchronous Inference: Decouples perception and action prediction from execution for higher control rates
  • Community-Driven: Built on public community datasets from affordable robotic platforms
  • Flow Matching: Uses advanced training techniques for improved action generation

Technical Architecture

SmolVLA Architecture

SmolVLA Architecture Overview

SmolVLA consists of a compact pretrained vision-language model with a specialized Action Expert that processes three types of inputs:

  1. Language Instructions: Natural language commands for task execution
  2. RGB Images: Visual observations from robot cameras
  3. Robot Sensorimotor State: Current robot joint positions and sensor data

The model uses alternating cross-attention and self-attention blocks to generate low-level action sequences, enabling precise robotic control through natural language instructions.

Key Technical Features

  • Lightweight Architecture: Only 450 million parameters (100M for action expert) vs. billions in other VLAs
  • Single GPU Training: Designed to train on a single GPU and deploy on consumer hardware
  • Layer Skipping: Uses features from earlier VLM layers to halve computational cost
  • Minimal Visual Tokens: Only 64 tokens per frame with pixel shuffle operation
  • Flow Matching: Better inductive bias for complex action distributions
  • Asynchronous Inference: 30% faster task completion with decoupled processing

Architecture Components

  1. Compact Pretrained VLM: Uses SmolVLM-2 backbone optimized for multi-image inputs
  2. Action Expert: Alternating cross-attention and self-attention blocks
  3. Three Input Types:
    • Natural language instructions
    • RGB camera observations
    • Robot sensorimotor state (projected to token space)
  4. Chunked Action Generation: Outputs n low-level actions per prediction

Community-Driven Training

  • Public Datasets: Trained on community-contributed datasets from affordable platforms
  • Smaller Scale: Only ~30k episodes (10.6M frames) vs. millions in other VLAs
  • Data Standardization: Handles noisy annotations and variable camera conventions
  • Real-World Focus: Designed for practical deployment on low-cost robots

Performance Highlights

  • Competitive Results: Matches or exceeds VLAs 10× larger in size
  • Benchmark Success: Strong performance on LIBERO, Meta-World, SO-100, and SO-101
  • Knowledge Transfer: Effective multitask fine-tuning and cross-task learning
  • Real Robot Validation: Tested and validated on actual robotic platforms
Demonstration Video 1: SmolVLA in action
Demonstration Video 2: Advanced capabilities showcase

Resources & References

Academic Paper

Citation:

@article{shukor2025smolvla,
  title={SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics},
  author={Shukor, Mustafa and Aubakirova, Dana and Capuano, Francesco and Kooijmans, Pepijn and Palma, Steven and Zouitine, Adil and Aractingi, Michel and Pascal, Caroline and Russi, Martino and Marafioti, Andres and Alibert, Simon and Cord, Matthieu and Wolf, Thomas and Cadene, Remi},
  journal={arXiv preprint arXiv:2506.01844},
  year={2025}
}

Paper Link: https://arxiv.org/pdf/2506.01844

Code Repository

GitHub Repository: https://github.com/huggingface/lerobot

The repository contains the complete implementation, pretrained models, and training data for SmolVLA, making it fully reproducible and accessible to the robotics research community.

Impact & Significance

SmolVLA represents a significant step toward democratizing robotics AI by making powerful vision-language-action capabilities accessible to researchers and developers with limited computational resources. Its open-source nature and efficient design contribute to the broader goal of transparent, reproducible robotics research that can accelerate progress in the field.