Interview Setup and Video Add
To create a user-friendly interface for establishing interviews and offering video hyperlinks, I used Google Colab’s types performance. This permits for the creation of textual content fields, sliders, dropdowns, and extra. The code is hidden behind the shape, making it very accessible for non-technical customers.
Audio obtain and conversion
I used the yt-dlp lib to obtain solely the audio from a YouTube video and convert it to the mp3 format. It is rather easy to make use of, and you’ll examine its documentation right here.
Audio transcription
To transcribe the assembly, I used Whisper from Open AI. It’s an open-source mannequin for speech recognition skilled on greater than 680K hours of multilingual information.
The mannequin runs extremely quick; a one-hour audio clip takes round 6 minutes to be transcribed on a 16GB T4 GPU (supplied by free on Google Colab), and it helps 99 totally different languages.
Since privateness is a requirement for the answer, the mannequin weights are downloaded, and all of the inference happens contained in the colab occasion. I additionally added a Mannequin Choice type within the pocket book so the person can select totally different fashions primarily based on the precision they’re in search of.
Audio system Identification
Speaker identification is finished via a method referred to as Audio system Diarization. The concept is to establish and phase audio into distinct speech segments, the place every phase corresponds to a selected speaker. With that, we are able to establish who spoke and when.
For the reason that movies uploaded from YouTube haven’t got metadata figuring out who’s talking, the audio system might be divided into Speaker 1, Speaker 2, and many others.… Later, the person can discover and exchange these names in Google Docs so as to add the audio system’ identification.
For the diarization, we are going to use a mannequin referred to as the Multi-Scale Diarization Decoder (MSDD), which was developed by Nvidia researchers. It’s a subtle method to speaker diarization that leverages multi-scale evaluation and dynamic weighting to realize excessive accuracy and adaptability.
The mannequin is thought for being fairly good at figuring out and correctly categorizing moments the place a number of audio system are speaking—a factor that happens regularly throughout interviews.
The mannequin can be utilized via the NVIDIA NeMo framework. It allowed me to get MSDD checkpoints and run the diarization immediately within the colab pocket book with only a few strains of code.
Trying into the Diarization outcomes from MSDD, I observed that punctuation was fairly unhealthy, with lengthy phrases, and a few interruptions comparable to “hmm” and “yeah” have been taken into consideration as a speaker interruption — making the textual content tough to learn.
So, I made a decision so as to add a punctuation mannequin to the pipeline to enhance the readability of the transcribed textual content and facilitate human evaluation. So I acquired the punctuate-all mannequin from Hugging Face, which is a really exact and quick resolution and helps the next languages: English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portuguese, Slovak, and Slovenian.
Video Synchronization
From the business options I benchmarked, a robust requirement was that each phrase must be linked to the second within the interview the speaker was speaking.
The Whisper transcriptions have metadata indicating the timestamps when the phrases have been mentioned; nevertheless, this metadata will not be very exact.
Due to this fact, I used a mannequin referred to as Wav2Vec2 to do that match in a extra correct approach. Principally, the answer is a neural community designed to be taught representations of audio and carry out speech recognition alignment. The method entails discovering the precise timestamps within the audio sign the place every phase was spoken and aligning the textual content accordingly.
With the transcription <> timestamp match correctly performed, via easy Python code, I created hyperlinks pointing to the second within the video the place the phrases begin to be mentioned.
The LLM Mannequin
This step of the pipeline has a big language mannequin able to run regionally and analyze the textual content, offering insights in regards to the interview. By default, I added a Gemma Mannequin 1.1b with a immediate to summarize the textual content. If the customers select to have the summarization, it is going to be in a bullet listing on the high of the doc.
Additionally, by clicking on Present code, the customers can change the immediate and ask the mannequin to carry out a special process.
Doc era for tagging, highlights, and feedback
The final process carried out by the answer is to generate Google Docs with the transcriptions and hyperlinks to the interviews. This was performed via the Google API Python shopper library.
For the reason that product has develop into extremely helpful in my day-to-day work, I made a decision to offer it a reputation for simpler reference. I referred to as it the Insights Gathering Open-source Tool, or iGot.
When utilizing the answer for the primary time, some preliminary setup is required. Let me information you thru a real-world instance that can assist you get began.
Open the iGot pocket book and set up the required libraries
Click on on this hyperlink to open the pocket book and run the primary cell to put in the required libraries. It’s going to take round 5 minutes.
In the event you get a immediate asking you to restart the pocket book, simply cancel it. There isn’t a want.
If every little thing runs as anticipated, you’ll get the message “All libraries put in!”.
Getting the Hugging Person Entry Token and mannequin entry
(This step is required simply the primary time you might be executing the pocket book)
For operating the Gemma and punctuate-all fashions, we are going to obtain weights from hugging face. To take action, you should request a person token and mannequin entry.
To take action, it is advisable to create a hugging face account and comply with these steps to get a token with studying permissions.
After you have the token, copy it and return to the lab pocket book. Go to the secrets and techniques tab and click on on “Add new secret.”
Title your token as HF_TOKEN and previous the important thing you bought from Hugging Face.
Subsequent, click on this hyperlink to open the Gemma mannequin on Hugging Face. Then, click on on “Acknowledge license” to get entry the mannequin.
Sending the interview
To ship an interview to iGot, it is advisable to add it as an unlisted video on YouTube beforehand. For the aim of this tutorial, I acquired a bit of the Andrej Karpathy interview with Lex Fridman and uploaded it to my account. It’s a part of the dialog the place Andrej gave some recommendation for Machine Studying Freshmen.
Then, it is advisable to get the video URL, paste within the video_url subject of the Interview Choice pocket book cell, outline a reputation for it, and point out the language spoken within the video.
When you run the cell, you’ll obtain a message indicating that an audio file was generated.t into
Mannequin choice and execution
Within the subsequent cell, you possibly can choose the scale of the Whisper mannequin you need to use for the transcription. The larger the mannequin, the upper the transcription precision.
By default, the most important mannequin is chosen. Make your alternative and run the cell.
Then, run the fashions execution cell to run the pipeline of fashions confirmed within the earlier part. If every little thing goes as anticipated, it’s best to obtain the message “Punctuation performed!” by the top.
In the event you get prompted with a message asking for entry to the cuddling face token, grant entry to it.
Configuring the transcript output
The ultimate step is to save lots of the transcription to a Google Docs file. To perform this, it is advisable to specify the file path, present the interview title, and point out whether or not you need Gemma to summarize the assembly.
When executing the cell for the primary time, you will get prompted with a message asking for entry to your Google Drive. Click on in enable.
Then, give Colab full entry to your Google Drive workspace.
If every little thing runs as anticipated, you’ll see a hyperlink to the google docs file on the finish. Simply click on on it, and you’ll have entry to your interview transcription.
Gathering insights from the generated doc
The ultimate doc could have the transcriptions, with every phrase linked to the corresponding second within the video the place it begins. Since YouTube doesn’t present speaker metadata, I like to recommend utilizing Google Docs’ discover and exchange device to substitute “Speaker 0,” “Speaker 1,” and so forth with the precise names of the audio system.
With that, you possibly can work on highlights, notes, reactions, and many others. As envisioned to start with:
The device is simply in its first model, and I plan to evolve it right into a extra user-friendly resolution. Perhaps internet hosting a web site so customers don’t must work together immediately with the pocket book, or making a plugin for utilizing it in Google Meets and Zoom.
My most important aim with this undertaking was to create a high-quality assembly transcription device that may be helpful to others whereas demonstrating how out there open-source instruments can match the capabilities of economic options.
I hope you discover it helpful! Be happy to attain out to me on LinkedIn when you have any suggestions or are enthusiastic about collaborating on the evolution of iGot 🙂