EVAR: Edge Visual Autoregressive Models
via Principled Pruning

Zefang Wang^1,2, Yanyu Li³, Mingluo Su¹, Simin Xu¹, Guanzhong Tian^2,*, Huan Wang^1,*

¹Westlake University ²Zhejiang University ³Snap Inc.

^*Corresponding authors

📄 Paper 💻 Code

Abstract

Recent advances in generative modeling have catalyzed demand for on-device single-image synthesis. However, the stringent compute and memory budgets of resource-constrained edge hardware hinder the deployment of large-scale models. Next-scale visual autoregressive (VAR) models—which predict finer-scale content conditioned on coarser resolutions—offer strong fidelity, generalization, and improved inference efficiency, yet remain costly to run on such devices.

We introduce EVAR, an efficient structured-pruning framework tailored to next-scale VAR models and edge deployment. EVAR instantiates a principled pruning paradigm: it couples Optimal Brain Surgeon–guided, Hessian-aware sensitivity estimation with closed-form weight updates, and augments them with scale-aligned calibration and compensation. By grounding pruning decisions in second-order optimality and executing updates analytically, EVAR mitigates compression-induced degradation while preserving next-scale conditioning—turning sparsification from a heuristic into a disciplined procedure. To further address scale-wise gradient and loss imbalance during fine-tuning, we propose Progressive Scale-Aware Distillation (PSAD), which leverages VAR's multi-scale generative hierarchy to reweight scales and enforce cross-scale consistency in the pruned model.

On ImageNet single-image generation benchmarks, EVAR reduces parameter count, memory footprint, and end-to-end latency while retaining competitive generative quality. On an iOS deployment, EVAR further cuts single-image latency from 494 ms to 277 ms (1.8× speedup), with FID changing only marginally.

Method Overview

Figure 1: EVAR Framework. We introduce the first principled pruning framework tailored for next-scale visual autoregressive (VAR) models, enabling efficient edge deployment. Our approach makes three key contributions: (1) OBS-Guided Adaptive Pruning: We present an Optimal Brain Surgeon (OBS)-guided adaptive pruning and compensation framework with Hessian-aware sensitivity estimation, applying closed-form weight updates to systematically prune attention heads and FFN channels while minimizing reconstruction error. (2) Progressive Scale-Aware Distillation (PSAD): We propose a novel distillation scheme that addresses scale-wise gradient and loss imbalance inherent to next-scale VAR architectures. PSAD leverages VAR's multi-scale generative hierarchy through progressive scale unlocking and scale-weighted distillation loss, enforcing cross-scale consistency during fine-tuning. (3) End-to-End Edge Deployment: We build a complete on-device deployment and evaluation pipeline for single-image VAR generation on iOS devices using CoreML inference engine, demonstrating practical applicability and real-world performance gains. The pipeline employs pre-encode calibration with real images encoded through VAR's VQ-VAE to create calibration sets that faithfully reflect inference-time token distributions, grounding pruning decisions in second-order optimality and turning sparsification from a heuristic into a disciplined procedure.

Main Results

EVAR achieves significant efficiency gains while maintaining high-quality image generation. Our principled pruning framework enables practical deployment on resource-constrained edge devices.

1.8×

Speedup on iPad Pro M4

494ms → 277ms

26%

Parameter Reduction

310M → 230M params

<10%

FID Degradation

FID: 3.55 → 3.91

First

Principled Pruning for VAR

OBS-guided framework

Performance Comparison

EVAR outperforms existing pruning methods on ImageNet-1K 256×256 conditional generation, achieving superior quality-efficiency trade-off across all metrics.

Model	Parameters	Pruning Rate	FID ↓	IS ↑	Precision ↑	Recall ↑
VAR-d16 (Original)	310M	—	3.55	274.4	0.84	0.51
LLM-pruner	230M	40%	4.21	53.92	0.81	0.50
OBA	230M	40%	4.19	53.43	0.83	0.47
EVAR (20% pruning)	270M	20%	3.67	57.78	0.81	0.51
EVAR (40% pruning)	230M	40%	3.91	57.23	0.81	0.51

Results on ImageNet-1K 256×256 conditional generation. EVAR achieves the best quality-efficiency trade-off.

Ablation Studies

Comprehensive ablation studies demonstrate the effectiveness of our OBS-guided pruning strategy and Progressive Scale-Aware Distillation (PSAD) framework.

Impact of Different Pruning Methods

Method	Pruning Strategy	FID ↓	IS ↑	Precision ↑	Recall ↑
Magnitude	L1-norm based	4.35	52.18	0.80	0.49
Taylor	First-order sensitivity	4.28	53.21	0.81	0.48
LLM-pruner	Group-wise importance	4.21	53.92	0.81	0.50
OBA	Hessian approximation	4.19	53.43	0.83	0.47
EVAR (Ours)	OBS-guided + PSAD	3.91	57.23	0.81	0.51

Comparison at 40% pruning rate (230M parameters). EVAR significantly outperforms baseline methods.

Progressive Scale-Aware Distillation (PSAD) Analysis

Configuration	FID ↓	IS ↑	Description
w/o Distillation	4.15	55.32	Direct fine-tuning only
w/o Progressive Unlocking	4.02	56.18	Uniform distillation across scales
w/o Scale Weighting	3.98	56.54	Progressive but unweighted
Full PSAD (Ours)	3.91	57.23	Complete framework

Each component of PSAD contributes to the final performance improvement.

Edge Deployment Results

iOS Deployment Performance

iPad Pro (M4)

494ms

VAR-d16 310M

iPad Pro (M4)

277ms

EVAR 40% 230M

iPad Pro (M2)

449ms

EVAR 40% 230M

iPhone 16 Pro

580ms

EVAR 40% 230M

We deploy EVAR models on multiple iOS devices using the CoreML framework for CPU+GPU mixed inference. On the flagship iPad Pro (M4), EVAR achieves 1.8× speedup (494ms → 277ms). Notably, due to memory and computational constraints on mobile devices, the original 310M-parameter VAR-d16 model cannot run, and only the 40% pruned EVAR model (230M parameters) is supported. This finding underscores the necessity of model pruning for mobile deployment—through systematic pruning strategies, the EVAR framework enables high-quality visual generative models to achieve practical deployment on resource-constrained mobile devices for the first time. iPhone 16 Pro demonstrates excellent performance (580ms), while iPhone 12 Pro, as an older device, completes inference within an acceptable time frame (1778ms).

Cross-Device Inference Comparison

iPhone 12 Pro: EVAR inference results on older device

iPhone 16 Pro Max: EVAR inference results on latest device

EVAR demonstrates robust performance across different iOS devices, achieving consistent quality and real-time generation

Video Demo

Real-time Demo: EVAR running on iPad Pro (M4) showing 1.8× speedup and high-quality image generation

EVAR: Edge Visual Autoregressive Modelsvia Principled Pruning

Abstract

Method Overview

Main Results

Performance Comparison

Ablation Studies

Impact of Different Pruning Methods

Progressive Scale-Aware Distillation (PSAD) Analysis

Edge Deployment Results

iOS Deployment Performance

Cross-Device Inference Comparison

Video Demo

EVAR: Edge Visual Autoregressive Models
via Principled Pruning