Gemini Vision + OpenAI Speech: A Powerful AI Meeting Agent using VideoSDK
Gemini Vision + OpenAI Speech: A Powerful AI Meeting Agent using VideoSDK
Source Code : https://github.com/videosdk-community/videosdk-gemini-vision-agent
Explore how Artificial Intelligence can help us see and understand screen content in real-time video calls! This demo showcases an AI agent built using the Video SDK, OpenAI's real-time API for speech-to-speech communication, and Google Gemini's vision-language models for screen analysis.
Watch as the AI accurately identifies famous paintings like Vincent van Gogh's The Starry Night and even historical scenes shown during a live meeting. We dive into the technical architecture, demonstrating how the AI agent joins a meeting, processes audio and screen share streams, and leverages the power of Gemini 1.5 Flash and OpenAI to provide real-time insights.
Learn about the "Video Sticker Gemini Vision Agent" repository used in this project, its client (ReactJS) and server (FastAPI) structure, and key components like the AI agent class and helper methods for handling function calls, audio listeners, and screen share analysis.This video is perfect for developers, AI enthusiasts, and anyone interested in the future of intelligent video communication and real-time AI applications.
Timestamps:
0:00 Introduction & What AI Can Do
0:16 Real-time AI Screen Analysis Demo (Starry Night & Historical Scene)
1:02 Technology Stack & Repository Overview (Video SDK, OpenAI, Gemini)
1:32 Project Structure & AI Agent Details
2:38 Initializing LLM Models (Gemini & OpenAI)
3:01 Helper Methods & Event Handling
Keywords: AI, Artificial Intelligence, Vision AI, Gemini, Google Gemini, OpenAI, Video SDK, Real-time AI, Screen Analysis, Video Call, Meeting Assistant, LLM, Large Language Model, AI Demo, AI Tutorial, How to Build AI, AI Development, Computer Vision, Speech to Speech, Video Conference, AI Agent, Video Analysis, Real Time Video Analytics, Gemini API
Gemini Vision + OpenAI Speech: A Powerful AI Meeting Agent using VideoSDK
Source Code : https://github.com/videosdk-community/videosdk-gemini-vision-agent
Explore how Artificial Intelligence can help us see and understand screen content in real-time video calls! This demo showcases an AI agent built using the Video SDK, OpenAI’s real-time API for speech-to-speech communication, and Google Gemini’s vision-language models for screen analysis.
Watch as the AI accurately identifies famous paintings like Vincent van Gogh’s The Starry Night and even historical scenes shown during a live meeting. We dive into the technical architecture, demonstrating how the AI agent joins a meeting, processes audio and screen share streams, and leverages the power of Gemini 1.5 Flash and OpenAI to provide real-time insights.
Learn about the “Video Sticker Gemini Vision Agent” repository used in this project, its client (ReactJS) and server (FastAPI) structure, and key components like the AI agent class and helper methods for handling function calls, audio listeners, and screen share analysis.This video is perfect for developers, AI enthusiasts, and anyone interested in the future of intelligent video communication and real-time AI applications.
Timestamps:
0:00 Introduction & What AI Can Do
0:16 Real-time AI Screen Analysis Demo (Starry Night & Historical Scene)
1:02 Technology Stack & Repository Overview (Video SDK, OpenAI, Gemini)
1:32 Project Structure & AI Agent Details
2:38 Initializing LLM Models (Gemini & OpenAI)
3:01 Helper Methods & Event Handling
Keywords: AI, Artificial Intelligence, Vision AI, Gemini, Google Gemini, OpenAI, Video SDK, Real-time AI, Screen Analysis, Video Call, Meeting Assistant, LLM, Large Language Model, AI Demo, AI Tutorial, How to Build AI, AI Development, Computer Vision, Speech to Speech, Video Conference, AI Agent, Video Analysis, Real Time Video Analytics, Gemini API