EVAR: Edge Visual Autoregressive Models
via Principled Pruning

Zefang Wang1,2, Yanyu Li3, Mingluo Su1, Simin Xu1, Guanzhong Tian2,*, Huan Wang1,*

1Westlake University     2Zhejiang University     3Snap Inc.

*Corresponding authors

Westlake University Zhejiang University Snap Inc.
EncodeLab

Abstract

Recent advances in generative modeling have catalyzed demand for on-device single-image synthesis. However, the stringent compute and memory budgets of resource-constrained edge hardware hinder the deployment of large-scale models. Next-scale visual autoregressive (VAR) models—which predict finer-scale content conditioned on coarser resolutions—offer strong fidelity, generalization, and improved inference efficiency, yet remain costly to run on such devices.

We introduce EVAR, an efficient structured-pruning framework tailored to next-scale VAR models and edge deployment. EVAR instantiates a principled pruning paradigm: it couples Optimal Brain Surgeon–guided, Hessian-aware sensitivity estimation with closed-form weight updates, and augments them with scale-aligned calibration and compensation. By grounding pruning decisions in second-order optimality and executing updates analytically, EVAR mitigates compression-induced degradation while preserving next-scale conditioning—turning sparsification from a heuristic into a disciplined procedure. To further address scale-wise gradient and loss imbalance during fine-tuning, we propose Progressive Scale-Aware Distillation (PSAD), which leverages VAR's multi-scale generative hierarchy to reweight scales and enforce cross-scale consistency in the pruned model.

On ImageNet single-image generation benchmarks, EVAR reduces parameter count, memory footprint, and end-to-end latency while retaining competitive generative quality. On an iOS deployment, EVAR further cuts single-image latency from 494 ms to 277 ms (1.8Ă— speedup), with FID changing only marginally.

Method Overview

Method Pipeline

Figure 1: EVAR Framework. We introduce the first principled pruning framework tailored for next-scale visual autoregressive (VAR) models, enabling efficient edge deployment. Our approach makes three key contributions: (1) OBS-Guided Adaptive Pruning: We present an Optimal Brain Surgeon (OBS)-guided adaptive pruning and compensation framework with Hessian-aware sensitivity estimation, applying closed-form weight updates to systematically prune attention heads and FFN channels while minimizing reconstruction error. (2) Progressive Scale-Aware Distillation (PSAD): We propose a novel distillation scheme that addresses scale-wise gradient and loss imbalance inherent to next-scale VAR architectures. PSAD leverages VAR's multi-scale generative hierarchy through progressive scale unlocking and scale-weighted distillation loss, enforcing cross-scale consistency during fine-tuning. (3) End-to-End Edge Deployment: We build a complete on-device deployment and evaluation pipeline for single-image VAR generation on iOS devices using CoreML inference engine, demonstrating practical applicability and real-world performance gains. The pipeline employs pre-encode calibration with real images encoded through VAR's VQ-VAE to create calibration sets that faithfully reflect inference-time token distributions, grounding pruning decisions in second-order optimality and turning sparsification from a heuristic into a disciplined procedure.

Video Demo

Real-time Demo: EVAR running on iPad Pro (M4) showing 1.8Ă— speedup and high-quality image generation