POaaS-VL: Extending Prompt Optimization to Multimodal On-...

ABSTRACT

This research proposes POaaS-VL, an extension of Prompt Optimization as a Service (POaaS) for multimodal on-device vision-language models (VLMs). POaaS-VL addresses challenges like visual feature degradation and cross-modal misalignment with a minimal-edit approach, employing specialists for visual grounding, cross-modal alignment, and image context enrichment. The system is designed to operate within strict mobile constraints, aiming to improve task accuracy and reduce visual hallucinations.

PAPER · PDF

POaaS_VL_Paper.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

POaaS-VL extends POaaS with vision-language specialists for on-device VLMs.

The system includes a Visual Grounding Agent, Cross-Modal Aligner, and Image Context Enricher.

POaaS-VL employs a capacity-aware orchestrator for selective specialist activation based on prompt complexity.

The research anticipates a 2-4% improvement in task accuracy and a 3-5% reduction in visual hallucinations on mobile VLMs.

Limitations & open questions

The proposed system has not yet been implemented and tested on actual mobile devices.

The evaluation plan is based on simulated deployment constraints, which may not fully represent real-world scenarios.

POaaS-VL: Extending Prompt Optimization to Multimodal On-Device Models

Key findings

Limitations & open questions

Related Papers