Vision-Language-Action (VLA)
VLAs are embodied foundation models that frame robot control as a multimodal sequence
modeling task. They process current observation o and language instruction
l to generate actions under the objective p(a | o, l).
In this formulation, observation and language map directly to action. Semantic grounding is strong, but the model remains fundamentally reactive because future physical evolution is not explicitly represented.