YouTube Video Summarizer and Question Answering Bot using Gemini

Netra Prasad Neupane
11 min readMar 8, 2024

--

The large language model has been a very hot topic since the release of ChatGPT in late 2022. Now, the entire AI community is focused towards solving the problems using the large language model. At least one LLM comes every month and the large number of papers are released each week regarding quantization, evaluation and prompting. A few months ago, Google released their multimodel LLM called Gemini. During the release of Gemini, they claimed that Gemini is the first model to outperform human experts on MMLU (Massive Multitask Language Understanding), one of the most popular methods to test the knowledge and problem-solving abilities of AI models. According to their claims, Gemini outperforms ChatGPT-4 in most benchmarks like text, image, and video but not in audio processing. Gemini has the following three different models:

  • Gemini-nano: Gemini-nano is a small model for on-device tasks. It can handle advanced tasks like summarizing text, suggesting smart replies in context, and checking grammar etc. It can be used in smartphones as well such as Google Pixel 8.
  • Gemini-pro: According to Google Gemini-pro is the best model for scaling across a wide range of tasks. Gemini-pro is more powerful in comparison to Gemini-nano and can process text, images, code, audio and videos. Gemini-pro is currently available as API as well and can be accessed from Bard(which was renamed as Gemini recently). In this blog, we will be using Gemini-pro-vision for processing images.
  • Gemini-ultra: Gemini-ultra is the most capable and largest model for highly complex tasks and is said to be available for enterprises and data centres.
fig: YouTube video summarizer

That’s enough for Gemini, right? Let’s deep dive into the problem we are going to solve today. You might have spent a lot of time on YouTube to figure out which video explains the exact details that fit your case. Nowadays, YouTubers create fake thumbnails on the videos to increase the views on the videos. But, this creates a problem for us to find the exact match according to our intent. Not only that watching long videos to figure out whether it has explained the topic(we are looking for) or not is very hard and time-consuming. To overcome this problem and get all the insight of the video within a second, I came up with the idea of a YouTube video summarizer. At the same time, I have another idea: What if we make an application to query/prompt over the YouTube video? I think this use case is beneficial for all of you. It helped me a lot to get quick answers about whether to watch or not the video and get the information described in the video within a few seconds.

fig: Question-Answering in YouTube videos

Here, We transcribe the YouTube video first using the transcription library. Once, we get the transcription of the video then we will send that transcription to Gemini and we instruct Gemini in such a way that it will summarize and give the information from the video as a detailed summary. We instruct Gemini in such a way that it answers the user prompt as well as question answering.

Let’s start the implementation part. For this, you have to install the following libraries on your environment. I will recommend you to create a new virtual environment and install the dependencies inside that environment instead of installing all the dependencies globally.

  1. python-dotenv is used to read the value of critical variables(such as API key) from the environment file.
  2. youtube-transcript-api is used to get the transcripts/subtitles for a given youtube video.
  3. google-generativeai library is used to access(prompt and get response) the Gemini model.
  4. streamlit is used to create the UI of the summarizer.

Once you have set up your environment then it’s time to get the Gemini API. Currently, Gemini-pro has two models: gemini-pro for the text to texts processing and gemini-pro-vision for image and text to texts processing. Here, we will be using gemini-pro model through Gemini API. You can find Gemini API at https://makersuite.google.com/app/apikey. Currently, you can do 60 queries per minute to gemini-pro using the API key collected from the above location. See the following image to get more insight on pricing.

fig: Gemini API pricing (source: https://ai.google.dev/pricing)

At first, we import the required dependencies like youtube-transcript-api, Streamlit, Gemini, dotenv etc and then configure the Gemini API.

# import dependencies
import os
import streamlit as st
import google.generativeai as genai
from dotenv import load_dotenv
from youtube_transcript_api import YouTubeTranscriptApi

# configure API by loading key from .env file
# load environment variables
load_dotenv()
# configure key
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
genai.configure(api_key=GEMINI_API_KEY)

Great, we configured Gemini API. It’s time to initialize the model. Let’s create functions to initialize the Gemini Pro model and get the response from the model.

# initialize gemini pro model
def initialize_model(model_name="gemini-pro"):
model = genai.GenerativeModel(model_name)
return model

def get_response(model, prompt):
response = model.generate_content(prompt)
return response.text

In the above code, initialize_model() will initialize the Gemini-pro model and get_response() will return the response of the model for the given prompt. Now, it’s time to create a function which transcribes youtube video.

def get_video_transcripts(video_id):
try:
transcription_list = YouTubeTranscriptApi.get_transcript(video_id)
transcription = " ".join([transcript["text"] for transcript in transcription_list])
return transcription

except Exception as e:
raise e


def get_video_id(url):
video_id = url.split("=")[1]
if "&" in video_id:
video_id = video_id.split("&")[0]

return video_id

In the above code get_video_transcripts() function transcribe the youtube video. It takes only the video_id from the youtube URL. To get the video_id from the given URL, we will useget_video_id() function first. Now it’s time to create the UI portion. Let’s create the Streamlit UI to upload the video URL and a button to submit that URL.

st.title("YouTube video summarizer")
st.markdown("<br>", unsafe_allow_html=True)
youtube_url = st.text_input("Enter youtube video link:")
if youtube_url:
video_id = get_video_id(youtube_url)
st.image(f"http://img.youtube.com/vi/{video_id}/0.jpg", use_column_width=True)
submit = st.button("submit")

After implementing the above code to generate streamlit UI you will end up with the following interface:

fig: Youtube Video Summarizer UI

Now, It’s time to define the behaviour of the model and run the main logic behind the summarizer. Let’s call the Gemini model initialized above and pass the URL of the youtube video which you want to summarize.

model_behavior = """ You are expert in summarization of youtube video from transcription of video.
So, input is transcription and output will be the summary of the given video including all
the important information. Please break down the information in multiple paragraph if it becomes
more clear and concise.Please give relevant topic for the summary.
Please try to make the summary in below 1000 words. Please don't add extra information that doesn't
make sense but fix typos and return `Couldn't generate summary for the given video` if transcription is meaningless or empty.
This is the transcriptions for the video.
"""

if submit:
transcriptions = get_video_transcripts(video_id)

# initialize the gemini-pro model
gemini_model = initialize_model(model_name="gemini-pro")
final_prompt = model_behavior + "\n\n" + transcriptions
summary = get_response(model=gemini_model, prompt=final_prompt)
st.write(summary)

Finally, we have completed the implementation of the summarizer. It’s time to test how it works. For testing purposes, are are using Introduction to Quantum Computing video by IBM.

Let’s enter the URL of the above video(https://www.youtube.com/watch?v=lt4OsgmUTGI) as the prompt and get the summary of the video:

fig: summary of given video

As you have seen in the above image, our application gives a detailed summary of the video. Here you can read the detailed summary more clearly:

Introduction to Quantum Computing

Quantum computing is a rapidly emerging field with the potential to revolutionize various industries. Unlike classical computers that rely on bits (0s and 1s), quantum computers utilize qubits that possess the unique ability of superposition, allowing them to exist in both a 0 and 1 state simultaneously. This property enables quantum computers to perform parallel computations, potentially solving complex problems that are intractable for classical computers.

Superposition and Gates

In the world of quantum mechanics, qubits can exist in a state of superposition, representing a combination of both 0 and 1 states. This unique characteristic enables quantum computers to explore multiple possibilities simultaneously. Quantum gates, analogous to classical computer gates, act on qubits to manipulate their states and create complex circuits.

Measurement and Collapse

When a qubit is measured, it collapses into either a 0 or 1 state, losing its superposition. This property introduces an element of uncertainty in quantum computing, as the outcome of a measurement cannot be predicted with certainty. However, the probabilistic nature of measurement allows quantum computers to represent a large amount of information with a small number of qubits.

Interference and Amplification

Interference is a crucial concept in quantum computing. By designing quantum circuits in a specific manner, it is possible to amplify the desired computational outcomes while simultaneously canceling out incorrect ones. This interference effect is essential for ensuring that the single answer obtained from a measurement is accurate.

Entanglement

Entanglement is a phenomenon that allows two or more qubits to become inextricably linked. Changes in the state of one entangled qubit instantly affect the state of the others, regardless of the distance between them. This property paves the way for exponential increases in computational power, enabling quantum computers to tackle problems that are currently beyond the reach of classical computers.

Applications and Challenges

The combination of superposition, interference, and entanglement unlocks the potential for quantum computing to solve complex problems in areas such as drug discovery, financial modeling, and artificial intelligence. However, the development of practical quantum computers faces significant challenges. Researchers continue to explore various hardware and software approaches to overcome these obstacles and harness the full power of quantum computing.

If you watch the above video and read the entire summary then you might have realized that the summary of the video depicts all the concepts explained in the video. This application has many use cases: students can create notes of lectures, researchers can quickly figure out the summary and many more. Now, it’s your turn to test the application in other videos as well and evaluate the performance of the model. Please provide feedback in the comments!

Here, we have completed the summarizer. Now, It’s time to implement the second use case: the Question Answering in the video. For question answering, we are going to reuse all the functions that are defined above: initialize_mode(), get_response(), get_video_transcripts() and get_video_id(). Let’s create the streamlit UI for question-answering in the video.

st.title("Question Answering in YouTube video")
st.markdown("<br>", unsafe_allow_html=True)
youtube_url = st.text_input("Enter youtube video link:")
if youtube_url:
video_id = get_video_id(youtube_url)
st.image(f"http://img.youtube.com/vi/{video_id}/0.jpg", use_column_width=True)

user_prompt = st.text_area("Your Prompt on above video", key="user_prompt")
submit = st.button("submit")

If you run the above lines of code you will get the following streamlit UI.

fig: Question Answering in the video

Now, Let’s create the prompt describing how the LLM should behave and respond to the user. In the following code, once the user hits the submit button then we will transcribe the video, initialize the model, prompt the model along with transcription and write the response of the model to the streamlit UI.

model_behavior = """ You are expert in summarization of youtube videos from transcription of videos.
So, input is transcriptions of videos along with prompt which have the user query. Please make sure that you have
understand all the information present in the video from transcription and respond user query.
Please don't add extra information that doesn't make sense but fix typos and return `Couldn't transcribe the video`
if transcription of video is empty otherwise respond accordingly!.
"""
if user_prompt or submit:
# transcribe the video
video_transcriptions = get_video_transcripts(video_id)
# initialize the gemini-pro model
gemini_model = initialize_model(model_name="gemini-pro")
# add transcription and prompt to main prompt
model_behavior = model_behavior + f"\nvideo transcription: {video_transcriptions} \nprompt: {user_prompt}"

response = get_response(model=gemini_model, prompt=model_behavior)
st.write(response)

Awesome! We have completed the implementation of question answering in the YouTube video. Let’s test its performance by doing the following prompts.

Q1: What is Quantum Computing?
Response: Quantum computing is a new form of computing that uses qubits instead of bits. Qubits can be in a superposition of states, which means they can be both 0 and 1 at the same time. This allows quantum computers to perform calculations that are impossible for classical computers. Quantum computers are still in their early stages of development, but they have the potential to revolutionize many fields, including medicine, finance, and artificial intelligence.

Q2: what is Entanglement?
Response: Entanglement is when qubits are strongly correlated, meaning that changing the state of just one qubit would change the state of another. For example, two qubits can be entangled so that their states have a 50% chance of measuring a 00 and 50% chance of measuring a 11, but never a 01 or a 10. In this case, if we just change the state of one, the other one would also change.

Q3: List out challanges of quantum computing
Response:

  • Quantum computers are not yet powerful enough to break current encryption standards.
  • Quantum computers are more difficult to program than classical computers.
  • Quantum computers are very expensive to build and operate.
  • Quantum computers are still in the early stages of development and there are many technical challenges that need to be overcome before they can be widely used.

In my testing, it works great for those videos which can be transcribed. Now, It’s time to test the application yourself and evaluate the model on the task. What do you think about the performance of the model on the question-answering task? Does it meet your expectations? Please don’t forget to provide your valuable feedback in the comments. I am very excited to see your feedback.

In conclusion, gemini-pro works perfectly if the videos are clear, concise and transcriptionable. This application has one limitation: the output is affected by the transcription of the video. If the transcription of the video is below the average we can’t expect a proper response from the model. In the above video sample, it provided a well-organized and detailed summary. It also answers all the questions pretty correctly using the information from the video. You can find entire code snippets on my GitHub repository: blog_code_snippets.

Finally, If you have any queries I am happy to answer them if possible. If you liked it, then please don’t forget to clap and share it with your friends. See you in the next blog:

Reference:

--

--

Netra Prasad Neupane
Netra Prasad Neupane

Written by Netra Prasad Neupane

Machine Learning Engineer with expertise in Computer Vision, Deep Learning, NLP and Generative AI. https://www.linkedin.com/in/netraneupane/

Responses (1)