| 
                
| A tSNE plot made by randomly selecting 1500 images (each from Humorous and Non-Humorous set) as the last frame of some visual dialog turns. Sometimes these visual models could cheat by detecting some pattern inHumorous/Non-Humorous visual dialogs like specific camera angle etc. The above plot hints towards its absence.To visualize the plot better, each image is represented by a dot and the corresponding plot is shown below. (Currentplot is slightly scaled up to ease the visibility.) | 
                         
                     | 
                
| A green dot represents a humorous sample and red dot, a non-humorous sample. They seem to be randomly distributed, hinting towards absence of any such bias. | 
                         
                     | 
                    
                         
                     | 
                
                         
                     | 
                    
                         
                     | 
                
                         
                     | 
                    
                         
                     | 
                
                         
                     | 
                    
                         
                     | 
                    
                         
                     | 
                
| The figure showing average time per turn in a Dialog, across the Dataset. | The figure showing average dialog time, across the Dataset. | The figure showing contribution of each speaker in generating humor, across the Dataset. | 
                         
                     | 
                
                         
                     | 
                
                         
                     | 
                
                         
                     | 
                    
                         
                     | 
                
| Text based Fusion Model (TFM) | Video based Fusion Model (VFM) | 
                         
                     | 
                    
                         
                     | 
                
| Text based Attention Model (TAM) | Video based Attention Model (VAM) |