Learning Visual Commonsense for Robust Scene Graph Generation: Supplementary Material Alireza Zareian ⋆ , Zhecan Wang ⋆ , Haoxuan You ⋆ , and Shih-Fu Chang Columbia University, New York NY 10027, USA {az2407,zw2627,hy2612,sc250}@columbia.edu In the following, we provide additional details and analysis that was not included in the main paper due to limited space. We ﬁrst provide more imple- mentation details, followed by a novel evaluation protocol that reveals insightful statistics about the data and the state of the art performance. We ﬁnally provide more qualitative examples to showcase what our commonsense model learns. 1 Implementation Details We have three training stages: perception, commonsense, and joint ﬁne-tuning. We train the perception model by closely following the implementation details of each method we adopt [2, 3, 1], with the exception that for IMP [2], we use the implementation by Zellers et al. [3], because it performs much better. We separately train the commonsense model (GLAT) once, independent of the per- ception model, as we described in the main paper. Then we stack GLAT on top of each perception model and perform ﬁne-tuning, without the fusion module. Finally, we add the fusion model for inference. Our GLAT implementation has 6 layers each with 8 attention heads, each with 300-D representations. We train it with a 30% masking rate on Visual Genome (VG) training scene graphs, using an Adam optimizer with a learning rate of 0.0001, for 100 epochs. To stack the trained GLAT model on top of a perception model, we take the scene graph output of the perception model for a given image, keep the top 100 most conﬁdent triplets and remove the rest, and represent each remaining entity and predicate with a one-hot vector that speciﬁes the top-1 predicted class. We intentionally discard the class distribution predicted by the perception model, to let the commonsense model reason inde- pendently in an abstract, symbolic space. Perception conﬁdence is later taken into account by our fusion module. The resulting one-hot graph is represented in the same way as a VG graph, that we have pretrained GLAT on, and is fed into GLAT without masking any node. The GLAT decoder predicts new classes for each node, and new edges, but we ignore the new edges and keep the structure ﬁxed. Hence, the output of GLAT looks like the output of the perception model with the exception that ⋆ Equal contribution.