Visual Coreference Resolution in Visual Dialog using Neural Module Networks

Kottur, Satwik; Moura, José M. F.; Parikh, Devi; Batra, Dhruv; Rohrbach, Marcus

Computer Science > Computer Vision and Pattern Recognition

arXiv:1809.01816 (cs)

[Submitted on 6 Sep 2018]

Title:Visual Coreference Resolution in Visual Dialog using Neural Module Networks

Authors:Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach

View PDF

Abstract:Visual dialog entails answering a series of questions grounded in an image, using dialog history as context. In addition to the challenges found in visual question answering (VQA), which can be seen as one-round dialog, visual dialog encompasses several more. We focus on one such problem called visual coreference resolution that involves determining which words, typically noun phrases and pronouns, co-refer to the same entity/object instance in an image. This is crucial, especially for pronouns (e.g., `it'), as the dialog agent must first link it to a previous coreference (e.g., `boat'), and only then can rely on the visual grounding of the coreference `boat' to reason about the pronoun `it'. Prior work (in visual dialog) models visual coreference resolution either (a) implicitly via a memory network over history, or (b) at a coarse level for the entire question; and not explicitly at a phrase level of granularity. In this work, we propose a neural module network architecture for visual dialog by introducing two novel modules - Refer and Exclude - that perform explicit, grounded, coreference resolution at a finer word level. We demonstrate the effectiveness of our model on MNIST Dialog, a visually simple yet coreference-wise complex dataset, by achieving near perfect accuracy, and on VisDial, a large and challenging visual dialog dataset on real images, where our model outperforms other approaches, and is more interpretable, grounded, and consistent qualitatively.

Comments:	ECCV 2018 + results on VisDial v1.0 dataset
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:1809.01816 [cs.CV]
	(or arXiv:1809.01816v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1809.01816

Submission history

From: Marcus Rohrbach [view email]
[v1] Thu, 6 Sep 2018 04:36:22 UTC (4,982 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Coreference Resolution in Visual Dialog using Neural Module Networks

Submission history

Access Paper:

Current browse context:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Coreference Resolution in Visual Dialog using Neural Module Networks

Submission history

Access Paper:

Current browse context:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators