Tencent and Qwen Trained Robot Policies Without Teleoperation

Weekly Physical AI roundup.

Jun 18, 2026

Both assembled their training data from human and open video rather than fresh robot teleoperation, and both report cross-embodiment transfer from it. Tencent shipped its full stack under Apache-2.0; Qwen’s weights are still behind a pilot.

Two open stacks, trained off the robot

Tencent’s Hy-Embodied-0.5-VLA is the more complete release: Apache-2.0 weights, training code, and an open 2,163-hour slice of a roughly 10,000-hour dataset collected with a custom fingertip UMI rig tracked by optical mocap. The model pairs a mixture-of-transformers backbone with a flow-matching action expert and a relative end-effector action representation meant to carry across embodiments, with a reward-free offline RL stage (FlowPRO) on top. It reports around 90% on RoboTwin 2.0, leading the open VLAs, and the claim worth testing is cross-embodiment manipulation trained from human UMI data with no teleoperation on the target robot.

Qwen-RobotSuite makes the same bet at a different scale: RobotManip is a VLA on Qwen3.5-4B trained on about 38,100 hours assembled only from open robot datasets, human video, and human-to-robot synthesis, reporting 3x the prior cross-embodiment state of the art, alongside RobotWorld, a language-conditioned video world model meant to generate synthetic training data, and RobotNav for navigation. The catch is availability. Two of the three Qwen models reportedly have public repos, but downloadable weights are still gated behind an enterprise pilot, so for now Tencent’s is the one you can actually run.

The shared question is whether the no-teleop transfer survives on third-party robots. If it does, a lot of the field’s collection effort moves off the robot and onto human video.

Research

MolmoMotion: a fully open motion-forecasting model

Ai2’s MolmoMotion predicts where 3D points on an object will move given a frame, the marked points, and a text instruction, built on the Molmo 2 backbone. The release is complete, with weights, the 1.16M-video MolmoMotion-1M dataset, and the PointMotionBench benchmark, which makes it one of the few things this week you can pick up and run end to end.

ENPIRE: hand the RL loop to a coding agent

NVIDIA GEAR’s ENPIRE wires frontier coding agents into the real-world robot-RL loop, so the agent resets the scene, runs rollouts, reads papers, rewrites the training code, and repeats with no human in the loop, reporting roughly 99% on four contact-rich tasks. It also describes a “physical scaling” effect where eight robots coordinating over Git cut Push-T training from five hours to two. There is no code or weights and the numbers are single-setup, so read it as a provocation rather than a result; it was run across Codex, Claude Code, and Kimi.

FRS: steer a flow policy at inference

Flow Reversal Steering (Finn, Levine, and colleagues) takes a mediocre action, runs it backward through a flow-matching policy to recover its noise, then remaps it to a nearby better mode, turning coarse human or VLM guidance into usable actions. Distilling that back into the policy reportedly buys up to a 95-point gain in success rate in under a minute of training.

WEAVER: a world model that moves real-robot numbers

CMU’s WEAVER is a multi-view flow-matching world model that predicts future latents and reward. It reports 0.870 correlation with real success for policy evaluation, a 38% gain when used to improve a foundation policy, and faster test-time planning, with code and models released.

μ₀: pretrain on motion traces, not pixels or actions

μ₀ forecasts smooth 3D trajectories of interaction keypoints as an embodiment-agnostic interface, and mines that supervision from human and robot video with a data engine called TraceExtract. A frozen μ₀ plus a small action expert reaches parity with π0 despite using no action labels in pretraining.

SPARC: auto-label demonstrations at scale

SPARC annotates robot demos with bounding boxes, object trajectories, and manipulation-phase labels, each with a reliability score, and keeps about 3x more usable samples at high-precision thresholds than prior labelers. Code, data, and models are out.

FTP-1: one tactile policy across sensors

FTP-1 is a tactile-manipulation foundation model that maps 21 different sensors into a shared representation, pretrained on 3,000 hours, and reports a 31% success gain on sensors it never saw in training. Pretrained models, data, and code are released.

HUG and YUBI: collecting dexterous data off the robot

Two releases push hand-data collection away from teleoperation. HUG trains a grasp model on about 1M frames of egocentric smart-glasses video and retargets to robot hands zero-shot, shipping code, data, a benchmark, and a demo. YUBI takes the hardware route with a finger-aligned gripper that maps human finger motion directly to the jaws, releasing the rig, the software, and an 8,434-hour bimanual dataset that transfers across UR, Franka, and ELEY arms.

Quick hits

QGF (Levine again) steers flow policies at test time with a value gradient and no extra training, a natural companion to FRS this week. Hi-VLA from DeepMind benchmarks hierarchical VLA design on a real ALOHA and finds a principled hierarchy beats both flat control and naive ones. FACTR 2 estimates external joint torques on commodity arms with no force sensor, trained from ten minutes of free motion. ATOM-Bench is a real-world benchmark showing policies that nail atomic skills still fall apart on unseen compositions. CHORUS LoRA-tunes a single π0.5 so a multi-robot team coordinates with no communication at inference. ORCA is an open, LeRobot-integrated stack for dexterous-hand research, and NVIDIA published a readable survey of world-action models as an alternative to VLAs.

Industry

NEURA Robotics raises up to $1.4B

German humanoid maker NEURA Robotics closed a Series C of up to $1.4B at a roughly $7B valuation, led by Tether with Qualcomm, Amazon, Nvidia, and Bosch. It’s the largest Physical AI round on record and the biggest VC round in German history, which puts a European humanoid program on the same capital footing as the US and Chinese ones.

Genesis AI launches Eno

Genesis AI unveiled Eno, a general-purpose mobile manipulator co-designed with its GENE foundation model and pitched on long-horizon, memory-using task execution. No technical specs or training details yet, with industrial deployments slated for late 2026.

Worth Watching

RSS 2026 (Sydney, July 13–17): paper notifications and project pages start landing this month.

Discussion about this post

Ready for more?