This research proposes POaaS-VL, an extension of Prompt Optimization as a Service (POaaS) for multimodal on-device vision-language models (VLMs). POaaS-VL addresses challenges like visual feature degradation and cross-modal misalignment with a minimal-edit approach, employing specialists for visual grounding, cross-modal alignment, and image context enrichment. The system is designed to operate within strict mobile constraints, aiming to improve task accuracy and reduce visual hallucinations.
Key findings
POaaS-VL extends POaaS with vision-language specialists for on-device VLMs.
The system includes a Visual Grounding Agent, Cross-Modal Aligner, and Image Context Enricher.
POaaS-VL employs a capacity-aware orchestrator for selective specialist activation based on prompt complexity.
The research anticipates a 2-4% improvement in task accuracy and a 3-5% reduction in visual hallucinations on mobile VLMs.
Limitations & open questions
The proposed system has not yet been implemented and tested on actual mobile devices.
The evaluation plan is based on simulated deployment constraints, which may not fully represent real-world scenarios.