Abstract

We conduct large-scale studies on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Overall, our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans.

Read more in the paper.
To appear in EMNLP 2016. Initial version presented at ICML 2016 Workshop on Visualization for Deep Learning (Best student paper).

Bibtex

@inproceedings{vqahat,
  title={{Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?}},
  author={Abhishek Das and Harsh Agrawal and C. Lawrence Zitnick and Devi Parikh and Dhruv Batra},
  booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2016}
}

VQA-HAT Dataset

Human attention image files are named as "qid_n.png". qid refers to question ID according to the VQA dataset (v1.0), and n iterates over multiple attention maps per question. For example, 1500070_1.png, 1500070_2.png, etc. n = 1 for the training set and n = {1,2,3} for the validation set. Download links are given below.

Training set (703M)
58,475 attention maps
Validation set (47M)
4,122 attention maps

Human vs. Machine Attention

Machine-generated attention maps on COCO-val for Stacked Attention Network (Yang et al., CVPR16), HieCoAtt (Lu et al., NIPS16) and Judd et al, ICCV09 are available for download here. For SAN and HieCoAtt, image files are named as "qid.png", and for Judd et al., "coco_image_id.jpg". Examples of human attention (column 2) vs. machine-generated attention (columns 3-5) with rank-correlation coefficients are shown below.


Acknowledgements

We thank Jiasen Lu and Ramakrishna Vedantam for helpful suggestions and discussions. This work was supported in part by the following: National Science Foundation CAREER awards to DB and DP, Army Research Office YIP awards to DB and DP, ICTAS Junior Faculty awards at Virginia Tech to DB and DP, Army Research Lab grant W911NF-15-2-0080 to DP and DB, Office of Naval Research YIP award to DP, Office of Naval Research grant N00014-14-1-0679 to DB, Alfred P. Sloan Fellowship to DP, Paul G. Allen Family Foundation Allen Distinguished Investigator award to DP, Google Faculty Research award to DP and DB, AWS in Education Research grant to DB, and NVIDIA GPU donation to DB.

Translations

This page has been independently translated to several languages. This content has not been verified by us and may have errors.

  • Urdu by Samuel Badree.