REVIEW OF VISUAL DATA DESCRIPTION Supriya Pradeep Kurlekar Phd Student, Dept of Electronics Engineering in Shivaji University Kolhapur, India Dr. Manasi R. Dixit Professor, Department of Electronics and Telecommunication Engineering, KIT’s College of Engineering, Kolhapur, State: Maharashtra, India Abstract Nowadays due to vast number of camera equipped devices, large amount of data in terms of image and video are getting generated which brings lot of information which can address many real world problems [16]. Deep learning based Visual data description is one of the most popular research field of Research. Image understanding is associated with objects identification, location and for captioning we need to detect interrelation of objects. Describing video in natural language automatically is very challenging task. It includes understanding of many entities like background scene, human interaction, and other sequential events. Video/image captioning is having huge scope for development in human –robot interaction, virtual assistance, visually impaired people assistance, surveillance, and many more. Captioning can do a lot for those who can’t hear. Social media is one of the biggest platform used by more than half billion people for video watching. One great advantage of having caption of your content is that it enables a video or image to be found by search engines including Google, through Search Engine optimization. I Introduction Obtaining meaningful information from video is crucial task. Availability of standardized data sets and deep neural network algorithms contributed in significant improvement of video caption generation. Videos may include number of activities, entities, inter related events. Dense Video caption generation has become most challenging task in recent years. It requires many sentences for meaningful captioning. Consequently, dense video captioning task [2] has been introduced and getting more popular recently. The complexity of the defined task is more in conceptual manner compared to simple video captioning as individual events detection becomes necessary for video processing. On Also, due to complexity of the video sequences while generating captions, up to two events are considered in most of the methods [1, 10]. Also, type of