Utilizing Gemini + Textual content to Speech + MoviePy to create a video, and what this says about what GenAI is turning into quickly helpful for
Like most everybody, I used to be flabbergasted by NotebookLM and its capacity to generate a podcast from a set of paperwork. After which, I received to pondering: “how do they try this, and the place can I get a few of that magic?” How straightforward would it not be to copy?
Purpose: Create a video discuss from an article
I don’t need to create a podcast, however I’ve typically wished I may generate slides and a video discuss from my weblog posts —some individuals favor paging by means of slides, and others favor to look at movies, and this may be a great way to fulfill them the place they’re. On this article, I’ll present you ways to do that.
The full code for this text is on GitHub — in case you need to comply with together with me. And the objective is to create this video from this text:
1. Initialize the LLM
I’m going to make use of Google Gemini Flash as a result of (a) it’s the least costly frontier LLM as we speak, (b) it’s multimodal in that it might probably learn and perceive pictures additionally, and (c) it helps managed technology, that means that we are able to be certain the output of the LLM matches a desired construction.
import pdfkit
import os
import google.generativeai as genai
from dotenv import load_dotenvload_dotenv("../genai_agents/keys.env")
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
Be aware that I’m utilizing Google Generative AI and never Google Cloud Vertex AI. The 2 packages are totally different. The Google one helps Pydantic objects for managed technology; the Vertex AI one solely helps JSON for now.
2. Get a PDF of the article
I used Python to obtain the article as a PDF, and add it to a short lived storage location that Gemini can learn:
ARTICLE_URL = "https://lakshmanok.medium...."
pdfkit.from_url(ARTICLE_URL, "article.pdf")
pdf_file = genai.upload_file("article.pdf")
Sadly, one thing about medium prevents pdfkit from getting the pictures within the article (maybe as a result of they’re webm and never png …). So, my slides are going to be primarily based on simply the textual content of the article and never the pictures.
3. Create lecture notes in JSON
Right here, the information format I would like is a set of slides every of which has a title, key factors, and a set of lecture notes. The lecture as an entire has a title and an attribution additionally.
class Slide(BaseModel):
title: str
key_points: Listing[str]
lecture_notes: strclass Lecture(BaseModel):
slides: Listing[Slide]
lecture_title: str
based_on_article_by: str
Let’s inform Gemini what we wish it to do:
lecture_prompt = """
You're a college professor who must create a lecture to
a category of undergraduate college students.* Create a 10-slide lecture primarily based on the next article.
* Every slide ought to comprise the next data:
- title: a single sentence that summarizes the primary level
- key_points: an inventory of between 2 and 5 bullet factors. Use phrases, not full sentences.
- lecture_notes: 3-10 sentences explaining the important thing factors in easy-to-understand language. Broaden on the factors utilizing different data from the article.
* Additionally, create a title for the lecture and attribute the unique article's writer.
"""
The immediate is fairly easy — ask Gemini to learn the article, extract key factors and create lecture notes.
Now, invoke the mannequin, passing within the PDF file and asking it to populate the specified construction:
mannequin = genai.GenerativeModel(
"gemini-1.5-flash-001",
system_instruction=[lecture_prompt]
)
generation_config={
"temperature": 0.7,
"response_mime_type": "utility/json",
"response_schema": Lecture
}
response = mannequin.generate_content(
[pdf_file],
generation_config=generation_config,
stream=False
)
Just a few issues to notice concerning the code above:
- We move within the immediate because the system immediate, in order that we don’t have to hold sending within the immediate with new inputs.
- We specify the specified response kind as JSON, and the schema to be a Pydantic object
- We ship the PDF file to the mannequin and inform it generate a response. We’ll anticipate it to finish (no have to stream)
The result’s JSON, so extract it right into a Python object:
lecture = json.hundreds(response.textual content)
For instance, that is what the third slide seems like:
{'key_points': [
'Silver layer cleans, structures, and prepares data for self-service analytics.',
'Data is denormalized and organized for easier use.',
'Type 2 slowly changing dimensions are handled in this layer.',
'Governance responsibility lies with the source team.'
],
'lecture_notes': 'The silver layer takes information from the bronze layer and transforms it right into a usable format for self-service analytics. This entails cleansing, structuring, and organizing the information. Kind 2 slowly altering dimensions, which monitor modifications over time, are additionally dealt with on this layer. The governance of the silver layer rests with the supply workforce, which is usually the information engineering workforce chargeable for the supply system.',
'title': 'The Silver Layer: Information Transformation and Preparation'
}
4. Convert to PowerPoint
We will use the Python package deal pptx to create a Presentation with notes and bullet factors. The code to create a slide seems like this:
for slidejson in lecture['slides']:
slide = presentation.slides.add_slide(presentation.slide_layouts[1])
title = slide.shapes.title
title.textual content = slidejson['title']
# bullets
textframe = slide.placeholders[1].text_frame
for key_point in slidejson['key_points']:
p = textframe.add_paragraph()
p.textual content = key_point
p.stage = 1
# notes
notes_frame = slide.notes_slide.notes_text_frame
notes_frame.textual content = slidejson['lecture_notes']
The result’s a PowerPoint presentation that appears like this:
Not very fancy, however undoubtedly an excellent start line for modifying if you’ll give a chat.
5. Learn the notes aloud and save audio
Effectively, we have been impressed by a podcast, so let’s see find out how to create simply an audio of somebody summarizing the article.
We have already got the lecture notes, so let’s create audio information of every of the slides.
Right here’s the code to take some textual content, and have an AI voice learn it out. We save the ensuing audio into an mp3 file:
from google.cloud import texttospeechdef convert_text_audio(textual content, audio_mp3file):
"""Synthesizes speech from the enter string of textual content."""
tts_client = texttospeech.TextToSpeechClient()
input_text = texttospeech.SynthesisInput(textual content=textual content)
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
identify="en-US-Normal-C",
ssml_gender=texttospeech.SsmlVoiceGender.FEMALE,
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3
)
response = tts_client.synthesize_speech(
request={"enter": input_text, "voice": voice, "audio_config": audio_config}
)
# The response's audio_content is binary.
with open(audio_mp3file, "wb") as out:
out.write(response.audio_content)
print(f"{audio_mp3file} written.")
What’s occurring within the code above?
- We’re utilizing Google Cloud’s textual content to speech API
- Asking it to make use of a normal US accent feminine voice. Should you have been doing a podcast, you’d move in a “speaker map” right here, one voice for every speaker.
- We then give it within the enter textual content, ask it generate audio
- Save the audio as an mp3 file. Be aware that this has to match the audio encoding.
Now, create audio by iterating by means of the slides, and passing within the lecture notes:
for slideno, slide in enumerate(lecture['slides']):
textual content = f"On to {slide['title']} n"
textual content += slide['lecture_notes'] + "nn"
filename = os.path.be a part of(outdir, f"audio_{slideno+1:02}.mp3")
convert_text_audio(textual content, filename)
filenames.append(filename)
The result’s a bunch of audio information. You’ll be able to concatenate them if you want utilizing pydub:
mixed = pydub.AudioSegment.empty()
for audio_file in audio_files:
audio = pydub.AudioSegment.from_file(audio_file)
mixed += audio
# pause for 4 seconds
silence = pydub.AudioSegment.silent(length=4000)
mixed += silence
mixed.export("lecture.wav", format="wav")
However it turned out that I didn’t have to. The person audio information, one for every slide, have been what I wanted to create a video. For a podcast, in fact, you’d desire a single mp3 or wav file.
6. Create pictures of the slides
Reasonably annoyingly, there’s no straightforward strategy to render PowerPoint slides as pictures utilizing Python. You want a machine with Workplace software program put in to do this — not the sort of factor that’s simply automatable. Possibly I ought to have used Google Slides … Anyway, a easy strategy to render pictures is to make use of the Python Picture Library (PIL):
def text_to_image(output_path, title, keypoints):
picture = Picture.new("RGB", (1000, 750), "black")
draw = ImageDraw.Draw(picture)
title_font = ImageFont.truetype("Coval-Black.ttf", measurement=42)
draw.multiline_text((10, 25), wrap(title, 50), font=title_font)
text_font = ImageFont.truetype("Coval-Mild.ttf", measurement=36)
for ptno, keypoint in enumerate(keypoints):
draw.multiline_text((10, (ptno+2)*100), wrap(keypoint, 60), font=text_font)
picture.save(output_path)
The ensuing picture shouldn’t be nice, however it’s serviceable (you’ll be able to inform nobody pays me to jot down manufacturing code anymore):
7. Create a Video
Now that we’ve a set of audio information and a set of picture information, we are able to use a Python package deal moviepy to create a video clip:
clips = []
for slide, audio in zip(slide_files, audio_files):
audio_clip = AudioFileClip(f"article_audio/{audio}")
slide_clip = ImageClip(f"article_slides/{slide}").set_duration(audio_clip.length)
slide_clip = slide_clip.set_audio(audio_clip)
clips.append(slide_clip)
full_video = concatenate_videoclips(clips)
And we are able to now write it out:
full_video.write_videofile("lecture.mp4", fps=24, codec="mpeg4",
temp_audiofile='temp-audio.mp4', remove_temp=True)
Finish outcome? We have now 4 artifacts, all created routinely from the article.pdf:
lecture.json lecture.mp4 lecture.pptx lecture.wav
There’s:
- a JSON file with keypoints, lecture notes, and many others.
- A PowerPoint file that you would be able to modify. The slides have the important thing factors, and the notes part of the slides has the “lecture notes”
- An audio file consisting of an AI voice studying out the lecture notes
- A mp4 film (that I uploaded to YouTube) of the audio + pictures. That is the video discuss that I got down to create.
Fairly cool, eh?
8. What this says about the way forward for software program
We’re all, as a group, probing round to search out what this actually cool expertise (generative AI) can be utilized for. Clearly, you should utilize it to create content material, however the content material that it creates is nice for brainstorming, however to not use as-is. Three years of enhancements within the tech haven’t solved the issue that GenAI generates blah content material, and not-ready-to-use code.
That brings us to a number of the ancillary capabilities that GenAI has opened up. And these grow to be extraordinarily helpful. There are 4 capabilities of GenAI that this put up illustrates.
(1) Translating unstructured information to structured information
The Consideration paper was written to resolve the interpretation drawback, and it seems transformer-based fashions are actually good at translation. We hold discovering use circumstances of this. However not simply Japanese to English, but in addition Java 11 to Java 17, of textual content to SQL, of textual content to speech, between database dialects, …, and now of articles to audio-scripts. This, it seems is the stepping level of utilizing GenAI to create podcasts, lectures, movies, and many others.
All I needed to do was to immediate the LLM to assemble a collection of slide contents (keypoints, title, and many others.) from the article, and it did. It even returned the information to me in structured format, conducive to utilizing it from a pc program. Particularly, GenAI is basically good at translating unstructured information to structured information.
(2) Code search and coding help at the moment are dramatically higher
The opposite factor that GenAI seems to be actually good at is at adapting code samples dynamically. I don’t write code to create displays or text-to-speech or moviepy on a regular basis. Two years in the past, I’d have been utilizing Google search and getting Stack Overflow pages and adapting the code by hand. Now, Google search is giving me ready-to-incorporate code:
In fact, had I been utilizing a Python IDE (quite than a Jupyter pocket book), I may have averted the search step fully — I may have written a remark and gotten the code generated for me. That is vastly useful, and hurries up growth utilizing common function APIs.
(3) GenAI internet providers are strong and easy-to-consume
Let’s not lose monitor of the truth that I used the Google Cloud Textual content-to-Speech service to show my audio script into precise audio information. Textual content-to-speech is itself a generative AI mannequin (and one other instance of the interpretation superpower). The Google TTS service which was launched in 2018 (and presumably improved since then) was one of many first generative AI providers in manufacturing and made out there by means of an API.
On this article, I used two generative AI fashions — TTS and Gemini — which might be made out there as internet providers. All I needed to do was to name their APIs.
(4) It’s simpler than ever to supply end-user customizability
I didn’t do that, however you’ll be able to squint a little bit and see the place issues are headed. If I’d wrapped up the presentation creation, audio creation, and film creation code in providers, I may have had a immediate create the operate name to invoke these providers as effectively. And put a request-handling agent that might assist you to use textual content to alter the look-and-feel of the slides or the voice of the particular person studying the video.
It turns into extraordinarily straightforward so as to add open-ended customizability to the software program you construct.
Abstract
Impressed by the NotebookLM podcast characteristic, I got down to construct an utility that might convert my articles to video talks. The important thing step is to immediate an LLM to supply slide contents from the article, one other GenAI mannequin to transform the audio script into audio information, and use present Python APIs to place them collectively right into a video.
This text illustrates 4 capabilities that GenAI is unlocking: translation of all types, coding help, strong internet providers, and end-user customizability.
I liked having the ability to simply and rapidly create video lectures from my articles. However I’m much more excited concerning the potential that we hold discovering on this new software we’ve in our arms.
Additional Studying
- Full code for this text: https://github.com/lakshmanok/lakblogs/blob/fundamental/genai_seminar/create_lecture.ipynb
- The supply article that I transformed to a video: https://lakshmanok.medium.com/what-goes-into-bronze-silver-and-gold-layers-of-a-medallion-data-architecture-4b6fdfb405fc
- The ensuing video: https://youtu.be/jKzmj8-1Y9Q
- Seems Sascha Heyer wrote up find out how to use GenAI to generate a podcast, which is the precise Pocket book LM usecase. His strategy is considerably much like mine, besides that there is no such thing as a video, simply audio. In a cool twist, he makes use of his personal voice as one of many podcast audio system!
- In fact, right here’s the video discuss of this text created utilizing the approach proven on this video. Ideally, we’re pulling out code snippets and pictures from the article, however it is a begin …