WOW-Seg: A Word-free Open World Segmentation Model

May 16, 2026 ยท Grace Period ยท ๐Ÿ› ICLR 2026

โณ Grace Period
This paper is less than 90 days old. We give authors time to release their code before passing judgment.
Authors Danyang Li, Tianhao Wu, Bin Li, Zhenyuan Chen, Yang Zhang, Yuxuan Li, Ming-Ming Cheng, Xiang Li arXiv ID 2605.16903 Category cs.CV: Computer Vision Citations 0 Venue ICLR 2026
Abstract
Open world image segmentation aims to achieve precise segmentation and semantic understanding of targets within images by addressing the infinitely open set of object categories encountered in the real world. However, traditional closed-set segmentation approaches struggle to adapt to complex open world scenarios, while foundation segmentation models such as SAM exhibit notable discrepancies between their strong segmentation capabilities and relatively weaker semantic understanding. To bridge these discrepancies, we propose WOW-Seg, a Word-free Open World Segmentation model for segmenting and recognizing objects from open-set categories. Specifically, WOW-Seg introduces a novel visual prompt module, Mask2Token, which transforms image masks into visual tokens and ensures their alignment with the VLLM feature space. Moreover, we introduce the Cascade Attention Mask to decouple information across different instances. This approach mitigates inter-instance interference, leading to a significant improvement in model performance. We further construct an open world region recognition test benchmark: the Region Recognition Dataset (RR-7K). With 7,662 classes, it represents the most extensive category-rich region recognition dataset to date. WOW-Seg attains strong results on the LVIS dataset, achieving a semantic similarity of 89.7 and a semantic IoU of 82.4. This performance surpasses the previous SOTA while using only one-eighth the parameter count. These results underscore the strong open world generalization capabilities of WOW-Seg. The code and related resources are available at https://github.com/AAwcAA/WOW-Seg-Meta.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Computer Vision

๐ŸŒ… ๐ŸŒ… Old Age

Fast R-CNN

Ross Girshick

cs.CV ๐Ÿ› ICCV ๐Ÿ“š 27.7K cites 11 years ago