We’ve seen AI generate text, then generate images and most recently even generate short videos, even though they still need some improvement. The results are incredible when you think that no one is actually involved in the creation process of these pieces and it only has to be trained once to then be used by thousands of people like stable diffusion is. Still, do these models really understand what they are doing? Do they know what the picture or video they just produced really represents? What does such a model understand when it sees such a picture or, even more complex, a video? Learn more in the video... (there is RTX GPU giveaway information in the video too!) References ►Read the full article: ►Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S. and Ling, H., 2022. Expanding Language-Image Pretrained Models for General Video Recognition. arXiv preprint arXiv:2208.02816. ►Code: ►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/general-video-recognition/ https://github.com/microsoft/VideoX/tree/master/X-CLIP https://www.louisbouchard.ai/newsletter/ Video Transcript 0:00 we've seen ai generate text then 0:02 generate images and most recently even 0:05 generate short videos even though they 0:07 still need work the results are 0:09 incredible especially when you think 0:11 that no one is actually involved in the 0:13 creation process of these pieces and it 0:16 only has to be trained at once to then 0:18 be used by thousands of people like 0:20 stable diffusion is still do these 0:23 modals really understand what they are 0:25 doing do they know what the picture or 0:27 video they just produced really 0:29 represents what does such a model 0:31 understand when it sees such a picture 0:34 or even more complex a video let's focus 0:36 on the more challenging of the two and 0:38 dive into how an ai understands videos 0:41 through a task called general video 0:44 recognition where the goal is for a 0:46 model to take videos as inputs and use 0:49 text to describe what's happening in the 0:51 video but first i think you will love 0:53 this episode sponsor and what they have 0:55 to offer an amazing free ai event for 0:59 this video i'm partnering with scale ai 1:01 scalia is the company behind one of the 1:04 world's leading ai conferences transform 1:07 x this october 19th to 21st transformix 1:11 will bring together over 20 000 ai and 1:14 ml leaders visionaries practitioners and 1:16 researchers across industries to explore 1:19 operationalizing ai and machine learning 1:22 transfer mix is a free virtual event and 1:24 will have 120 speakers from companies 1:27 like meta openai deepmind google etsy 1:31 and more i'm personally excited to hear 1:33 from greg brockman openai's co-founder 1:36 and president and corey the vp of 1:39 research and technology at deepmind two 1:41 of the most important companies in our 1:43 field there will also be really 1:45 interesting talks from fantastic 1:46 contributors to the field like francois 1:49 chalet the creator of keras that i'll 1:51 definitely tune into don't miss your 1:53 chance to attend this free educational 1:55 event it was a huge hit last year and 1:58 you don't want to miss out sing up with 2:00 the first link below to attend the 2:01 transformix conference with me and 2:03 support my work 2:06 general video recognition 2:08 is one of the most challenging tasks in 2:10 understanding videos yet it may be the 2:13 best measure of a model's ability to get 2:15 what's happening it's also the basis 2:17 behind many applications relying on a 2:19 good understanding of videos like sports 2:22 analysis or autonomous driving but what 2:24 makes this task so complex well there 2:27 are two things we need to understand 2:30 what is shown meaning each frame or each 2:33 image of a particular video second we 2:36 need to be able to say what we 2:38 understand in a way humans understand 2:41 which means using words fortunately for 2:44 us the second challenge has been tackled 2:46 numerous times by the language community 2:49 and we can take over their work more 2:51 precisely we can take what people from 2:53 the language image field have done with 2:56 models like clip or even stable 2:58 diffusion where you have a text encoder 3:01 and an image encoder that learns to 3:04 encode both types of inputs into the 3:06 same kind of representation this way you 3:09 can compare a similar scene to a similar 3:11 text prompt by training the architecture 3:13 with millions of image captions example 3:16 pairs having both text and images 3:18 encoded in a similar space is powerful 3:20 because it takes much less space to 3:22 perform computations and it allows us to 3:24 compare text to images easily meaning 3:27 that the model still doesn't understand 3:29 an image or even a simple sentence but 3:32 it can at least understand if both are 3:34 similar or not we are still far from 3:37 intelligence but that's pretty useful 3:39 and good enough for most cases now comes 3:42 with the biggest challenge here videos 3:44 and for that we'll use the approach from 3:47 berlin me and colleagues in their recent 3:49 paper expanding language image 3:51 pre-trained modals for general video 3:54 recognition videos are much more complex 3:56 than images due to the temporal 3:58 information meaning the multiple frames 4:01 and the fact that each frame is linked 4:03 to the next and the previous one with 4:05 coherent movement and actions the model 4:08 needs to see what happened before during 4:10 and after each frame to have a proper 4:13 understanding of the scene it's just 4:15 like on youtube you cannot really skip 5 4:18 seconds forward in short videos as you 4:20 will miss valuable information in this 4:23 case they take each frame and send them 4:25 into the same image encoder we just 4:27 discussed using a vision transformer 4:30 based architecture to process them into 4:32 a condensed space using attention if you 4:35 are not familiar with vision 4:36 transformers or the attention mechanism 4:39 i will invite you to watch the video i 4:40 made introducing them once you have your 4:43 representation for each frame you can 4:45 use a similar attention-based process to 4:47 have each frame communicate together and 4:50 allow your model to exchange information 4:52 between frames and create a final 4:55 representation for the video this 4:57 information exchange between frames 4:59 using attention will act as some sort of 5:02 memory for your model to understand the 5:04 video as a whole rather than a couple of 5:06 random images together finally we use 5:09 another attention module to merge the 5:11 text encodings of the frames we had with 5:14 our condensed video representation 5:17 and voila this is one way an ai 5:20 understands a video of course this was 5:23 just an overview of this great paper by 5:25 microsoft research serving as an 5:27 introduction to video recognition i 5:30 invite you to read their paper for a 5:32 better understanding of their approach i 5:34 also have the pleasure to announce 5:36 another giveaway for the next nvidia gtc 5:39 event from september 19th to september 5:42 22nd nvidia is giving me once again an 5:45 rtx 3080 ti to give away to this 5:48 community for those of you attending the 5:50 event the only two things you have to do 5:53 in order to have a chance to win are to 5:55 subscribe to the channel and dm me a 5:57 screenshot of one of the tolls you 5:59 decide to attend to during the event 6:02 that's it thank you for watching the 6:04 video and warm thanks to my friends at 6:06 scale ai for sponsoring the video i hope 6:09 to see you virtually at their free event 6:11 shortly and i will see you next week 6:13 with another amazing paper [Music]

Meta's Groundbreaking AI Film Maker: Make-A-Scene

NVIDIA and Tel Aviv University's Approach to Conditioning Text-to-Image Models

Watch more on YouTube: https://www.youtube.com/c/WhatsAI

2021 - HackerNoon Contributor of the Year - DEEP-LEARNING

2021 - HackerNoon Contributor of the Year - FACEBOOK

Nominated for 2022 - Best Data Science Newsletter

Nominated for 2022 - HackerNoon Contributor of the Year - Artificial Intelligence

Nominated for 2022 - Top Tech Youtuber

Nominated for 2022 - HackerNoon Contributor of the Year - Innovation

Nominated for 2022 - HackerNoon Contributor of the Year - Data Science

Nominated for 2022 - HackerNoon Contributor of the Year - Natural Language Processing

What is General Video Recognition?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

3D Articulated Shape Reconstruction from Videos

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

3D Articulated Shape Reconstruction from Videos

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps