• Loop
  • Posts
  • 🎥 Microsoft & Google are using LLMs for video analysis

🎥 Microsoft & Google are using LLMs for video analysis

Plus more on Anthropic and Inflection’s new AI models, the return of Sam Altman, and Stable Diffusion for Video.

Loop logo

Hello,

Welcome to this edition of Loop! We aim to keep you informed about technology advances, without making you feel overwhelmed.

To kick off your week, we’ve rounded-up the most important technology and AI updates that you should know about.

In this edition, we’ll explore:

- Microsoft’s efforts in developing smaller Language Models and reducing costs
- Sam Altman’s return to OpenAI
- How CarViz are using computer vision to detect car damage
- … and much more

Let's jump in!

Image of Loop character reading a newspaper

Top Stories

1. Google’s Bard AI chatbot can now answer questions about YouTube videos [Link]

Occasionally, you will see that the top result on Google Search has a transcript of a YouTube video, which aims to answer your query. Bard is likely making use of the same text transcript to answer questions about the video’s content, with it being sent to the LLM as context, rather than any actual multi-modal functionality - since the costs of doing so are still quite large.

This is part of the gradual shift to allow users to ask questions about the content they’re watching. Microsoft’s MM-Vid, which is described later in this post, aims to do something very similar. Sometimes video creators can be slow to get to the point, so Google gave an example of a user asking how many eggs were needed for the recipe - allowing you to get organised as the creator gives their video introduction.

2. Inflection announce their 2nd generation model [Link]

Inflection AI, which is being led by DeepMind’s co-founder, has announced a new version of their Large Language Model (LLM). They have released benchmark tests that show their model is the second most capable LLM, behind OpenAI’s GPT-4.

Interestingly, Inflection 2 is able to outperform Google’s code optimised version of PaLM-2 - and this is despite the fact that Inflection did not focus on training it for “coding and mathematical reasoning”. It’s impressive work, especially considering that Inflection AI has just 50 employees, which pales in comparison to the over 750 working at OpenAI.

3. Microsoft Research unveil a small Language Model called Orca 2 [Link]

The research team at Microsoft has been tasked with making smaller language models, as the costs of running LLMs are incredibly high. It has been reported that they’re losing between $20-$60 on each GitHub Copilot subscription, with this occuring every month. As the company integrates LLMs into Word, PowerPoint, Excel, the Windows OS and elsewhere - there’s a real need to quickly reduce costs.

Orca 2 is the latest part of that work, which aims to produce cheaper models, but still maintain most of the functionality we have come to expect. The model comes in 2 sizes, 7 and 13 billion parameters, and they’ve shown that it can outperform models that are 5-10 times larger. The model’s weights have been made available on HuggingFace.

4. Sam Altman has returned to OpenAI [Link]

It’s been quite the week. OpenAI suddenly fired their CEO Sam Altman, which led to their co-founder Greg Brockman and several others leaving the company. A showdown then ensued, as over 700 employees then called for the board to reinstate Altman or they would resign.

But on Sunday night the board hired Twitch’s co-founder as their new CEO, with Microsoft swiftly announcing that Sam would lead their new Advanced AI research team. Eventually, OpenAI’s board relented to external & internal pressure, agreeing to bring Sam Altman back as their CEO and to change how OpenAI is governed. Still with me?

It’s been an astonishing week that’s gripped the tech world. The constant twists and turns were entertainment for some, but companies who rely on OpenAI’s GPT models worried about what it meant for them. A huge amount of money and software relies on what OpenAI have created. While the drama has died down for now, OpenAI will have some work to do in rebuilding trust with their customers.

5. Anthropic release Claude 2.1 to developers [Link]

As mentioned last week, Anthropic was founded after a group of OpenAI researchers disagreed with Sam Altman on how to safely develop AI and then left to start their own company.

Anthropic aimed to capitalise on OpenAI’s internal struggles and released Claude 2.1, which has a 200k context window, “significantly” less hallucinations, and now supports tool use. To get a better idea of how big a 200k context window really is, you could include 150,000 words (or around 500 pages of text).

Although, it is worth noting that as you increase the context window, LLM responses will often become less accurate. It’s hoped that this can be minimised over time, but it’s great to see nonetheless. OpenAI might be leading the field with GPT-4, but Anthropic and Inflection aren’t far behind.

Closer Look

Microsoft have used GPT-4 Vision to analyse TV episodes, live sports, and games

The MM-Vid project used a mixture of GPT-4 vision - along with other computer vision, audio, and speech tools - to better analyse longer form videos, such as TV episodes. Similar to what Google have done with Bard and YouTube videos, Microsoft’s team were able to create a detailed script for the uploaded video. This script accurately describes the character’s movements, expressions, and dialogue throughout - which is then processed by a LLM and used to answer the user’s questions.

The team have provided lots of examples of how it could be used, such as answering questions about a character’s motivations in a TV show, having an AI system that can play a game of Super Mario, or to show the “most exciting moment” of a MLB baseball game.

It can even describe what is happening in a scene, even without any audio. This could be useful in creating audio descriptions for those with visual impairements, even if the content creator didn’t add them, and make a huge catalogue of content more accessible for people.

If you want to see the full list of examples, you can view them on GitHub.

Announcement

Stable Diffusion for Video is now here

We’ve seen some substantial progress with text-to-video generators in the last few months, following announcements by Runway and Meta, and now Stability AI has joined the race. They’re well-known for their image model, but the video model looks to be just as impressive. The company claims it can be easily adapted for other tasks, such as creating 3-dimensional views from a single image.

However, the model isn’t being released to the public just yet - since Stability AI are focused on further improving the video quality and safety aspects - but they “look forward to sharing the full release” at some point in the future. This will work well with the company’s other models, which span across audio, image, 3D, and text generation.

If you want to read the full announcement, you can see it on their website.

Byte-Sized Extras

  • F1 is using computer vision to detect when cars are leaving the track [Link]

  • Generative AI startup AI21 Labs raises extra cash from investors [Link]

  • SpaceX plans to sell shares next month at $150B valuation [Link]

  • Cruise’s CEO Kyle Vogt resigns, following months of turmoil [Link]

  • Binance to pay $4.3B in fines and their CEO steps down from the crypto exchange, will plead guilty to anti-money laundering charges [Link]

  • Hyundai and Motional are to jointly manufacture an IONIQ 5 robotaxi in Singapore [Link]

Image of Loop character with a cardboard box

Startup Spotlight

CarViz

This is a computer vision startup that’s based in France, which aim to analyse a car’s condition and give users a more accurate valuation. They can detect scratches, dents, tyre condition, and other types of damage. The company also uses data from government sources, which provides information about the vehicle specification, and compare this to the documents that were provided by the owner.

The final report is then sent to the user, which is often a large dealership - CarViz currently work with 6 of the major dealers in France, along with others in Spain, Germany, and the US.

If you want to read more about what they do, you can check out their website.

Image of Loop character standing at a podium

Analysis

We’re starting to see the big tech companies explore what insights can be gathered from using LLMs to analyse video content. You can imagine a situation where we will soon be able to ask Netflix questions about a TV show as we watch it. This would be useful if you’ve just missed something that was said and want it explained, without having to scroll back, or if you want a quick recap of the last few episodes.

As streaming companies - such as Netflix & Disney - are starting to see their subscriber growth slow and need to justify further price rises, this could prove invaluable. Imagine a situation where you could talk to an AI version of your favourite TV or movie character, just by using your remote - most of our TV remotes are already equipped with voice assistants. As competition heats up in the video streaming world, whether it’s YouTube or Netflix, you can expect more features to be developed around this.

I’m excited to see what this means for live sports games and other types of content, going forward. Creating highlights of sports games can be quite intensive and has to be done within a very short timeframe, before it’s either broadcast on television or is added to YouTube for fans to view.

Does this mean we could use AI models to create scripts of the event in real time, then have a LLM analyse it for what it suggests are the highlights? It could certainly speed up the process for staff, while it could also feed into the live text commentary pages that sports websites often have.

There’s endless possibilities of how the tech could be used in this way - and it looks like we’ve only taken the first step in exploring what’s possible.

This Week’s Art

Prompt: An ultra-realistic modern art studio scene, featuring a sleek robotic arm painting on a canvas. The studio is vibrant and filled with bright, pop art-inspired colors. The painting depicts a generic tomato soup can, styled colorfully in a manner reminiscent of pop art, avoiding any trademarked designs. The atmosphere combines retro and futuristic aesthetics, illustrating the fusion of traditional art themes with advanced technology. The scene captures the essence of pop art without any specific public figures or copyrighted designs.
Platform: DALL-E 3

Image of Loop character with a note called End Note

End Note

While the week has been pretty dramatic with Sam Altman’s firing and return to OpenAI, there are much wider stories to take note of. OpenAI’s rivals have tried to capitalise on the dramatic twists, by unveiling their new models and have offered jobs to OpenAI's dissatisfied employees.

But serious advances are being made with both GenAI video generation and video analysis. These could open the door to a huge number of new applications, which is very exciting.

This week we’ve looked at Inflection, Anthropic and Microsoft’s new models, Stability Diffusion for Video, the return of Sam Altman as CEO, Microsoft’s MM-Vid, using Bard to ask questions about YouTube videos, and how CarViz are using computer vision to spot vehicle damage.

Have a good week!

Liam

Image of Loop character waving goodbye

Share with Others

If you found something interesting, feel free to share this newsletter with your colleagues.

About the Author

Liam McCormick is a Senior Software Engineer and works within Kainos' Innovation team.
He identifies business value in emerging technologies, implements them, and then shares these insights with others.