In right this moment’s digital panorama, content material repurposing has grow to be essential for maximizing attain and engagement. One efficient technique is reworking long-form content material like weblog posts into participating Twitter threads. Nevertheless, manually creating these threads might be time-consuming and difficult. On this article, we’ll discover the best way to construct an software to automate weblog to Twitter thread creation utilizing Google’s Gemini-2.0 LLM, ChromaDB, and Streamlit.
Studying Targets
- Automate weblog to Twitter thread transformation utilizing Google’s Gemini-2.0, ChromaDB, and Streamlit for environment friendly content material repurposing.
- Achieve hands-on expertise to construct automate weblog to Twitter thread with embedding fashions and AI-driven immediate engineering.
- Perceive the capabilities of Google’s Gemini-2.0 LLM for automated content material transformation.
- Discover the combination of ChromaDB for environment friendly semantic textual content retrieval.
- Construct a Streamlit-based net software for seamless PDF-to-Twitter thread conversion.
- Achieve hands-on expertise with embedding fashions and immediate engineering for content material era.
This text was printed as part of the Information Science Blogathon.
What’s Gemini-2.0?
Gemini-2.0 is Google’s newest multimodal Massive Language Mannequin (LLM), representing a major development in AI capabilities. It’s now out there as Gemini-2.0-flash-exp API in Vertext AI Studio. It affords improved efficiency in areas like:
- Multimodal understanding , coding, complicated directions following and performance calling in pure language.
- Context-aware content material creation.
- Complicated reasoning and evaluation.
- It has native picture era, picture enhancing, controllable text-to-speech era.
- Low-latency responses with the Flash variant.
For our undertaking, we’re particularly utilizing the gemini-2.0-flash-exp mannequin API, which is optimized for fast response whereas sustaining high-quality output.
What’s the ChromaDB Vector Database?
ChromaDB is an open-source embedding database that excels at storing and retrieving vector embeddings. It’s a high-performance database designed for environment friendly storing, looking, and managing embeddings generated by AI fashions. It allows similarity searches by indexing and evaluating vectors based mostly on their proximity to different related vectors in multidimensional area.
- Environment friendly related search capabilities
- Simple integration with common embedding fashions
- Native storage and persistence
- Versatile querying choices
- Light-weight deployment
In our software, ChromaDB is the spine for storing and retrieving related chunks of textual content based mostly on semantic similarity, enabling extra contextual and correct thread era.
What’s Streamlit UI?
Streamlit is an open-source Python library designed to shortly construct interactive and data-driven net purposes for AI/ML tasks. Its give attention to simplicity allows builders to create visually interesting and useful apps with minimal effort.
Key Options:
- Ease of Use: Builders can flip Python scripts into net apps with just a few traces of code.
- Widgets: It affords a variety of enter widgets (sliders, dropdowns, textual content inputs) to make purposes interactive.
- Information Visualization: It Helps integration with common Python libraries like Matplotlib, Plotly, and Altair for dynamic viz.
- Actual-time Updates: Routinely rerun apps when code or enter adjustments, offering a seamless consumer expertise.
- No Net Improvement Required: Take away the necessity to study HTML, CSS, or Javascript.
Utility of StreamLit
Streamlit is widley used for constructing bashboards, exploratory knowledge evaluation instruments, AI/ML software prototypes. Its simplicity and interactivity makes it excellent for fast prototying and sharing insights with non-technical stakeholders. We’re utilizing streamlit for desiging the interface for the our software.
Motivation for Tweet Era Automation
The first motivation behind automating tweet thread era embody:
- Time effectivity: Lowering the annual effort required to create participating Twitter threads.
- Consistency: Sustaining a constant voice and format throughout all threads.
- Scalability: Processing a number of article shortly and effectively.
- Enhanced engagement: Leveraging AI to create extra compelling and shareable content material.
- Content material optimization: Utilizing data-driven approaches to construction threads successfully.
Venture Environmental Setup Utilizing Conda
To arrange the undertaking surroundings, observe these steps:
#create a brand new conda env
conda create -n tweet-gen python=3.11
conda activate tweet-gen
Set up required packages
pip set up langchain langchain-community langchain-google-genai
pip set up chromadb streamlit python-dotenv pypdf pydantic
Now create a undertaking folder named BlogToTweet or no matter you would like.
Additionally, create a .env file in your undertaking root. Get your GOOGLE API KEY from right here and put it within the .env file.
GOOGLE_API_KEY="<your API KEY>"
We’re all set as much as dive into the primary implementation half.
Venture Implementation
In our undertaking, there are 4 necessary recordsdata every having its performance for higher improvement.
- Companies: For placing all of the necessary providers in it.
- fashions: For all of the necessary Pydantic knowledge fashions.
- fundamental: For testing the automation within the terminal.
- app: For Streamlit UI implementation.
Implementing Fashions
We’ll begin with implementing Pydantic knowledge fashions within the fashions.py file. What’s Pydantic? learn this.
from typing import Elective, Checklist
from pydantic import BaseModel
class ArticleContent(BaseModel):
title: str
content material: str
creator: Elective[str]
url: str
class TwitterThread(BaseModel):
tweets: Checklist[str]
hashtags: Checklist[str]
It’s a easy but necessary mannequin that can give the article content material and all of the tweets a constant construction.
Implementing Companies
The ContentRepurposer handles the core performance of the appliance. Right here is the skeletal construction of that class.
# providers.py
import os
from dotenv import load_dotenv
from typing import Checklist
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_community.vectorstores import Chroma
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from fashions import ArticleContent, TwitterThread
class ContentRepurposer:
def __init__(self, content material):
cross
def process_pdf(self, pdf_path: str) -> ArticleContent:
cross
def get_relevant_chunk(self, question: str, ok: int = 3) -> Checklist[str]:
cross
def generate_twitter_thread(self, article: ArticleContent):
cross
def process_article(self, pdf_path: str):
cross
Within the preliminary methodology, we are going to put all necessary parameters of the category
def __init__(self):
from pydantic import SecretStr
google_api_key = os.getenv("GOOGLE_API_KEY")
if google_api_key is None:
elevate ValueError("GOOGLE_API_KEY surroundings variable isn't set")
_google_api_key = SecretStr(google_api_key)
# Initialize Gemini mannequin and embeddings
self.embeddings = GoogleGenerativeAIEmbeddings(
mannequin="fashions/embedding-001",
)
self.llm = ChatGoogleGenerativeAI(
mannequin="gemini-2.0-flash-exp",
temperature=0.7)
# Initialize textual content splitter
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["nn", "n", " ", ""]
)
Right here, we use Pydantic SecretStr for the safe use of the API_KEY, for embedding our articles we use the GoogleGenerativeAIEmbeddings perform utilizing the embedding-001 mannequin. To create the tweets from the article we are going to use the ChatGoogleGenerativeAI perform with the newest Gemini-2.0-flash-exp mannequin. RecursiveCharacterTextSplitter is used for splitting a big doc into elements right here we break up the doc in chunk_size 1000 with 200 character overlap.
Processing PDF
The system processes PDFs utilizing PyPDFLoader from LangChain and implements textual content chunking.
def process_pdf(self, pdf_path: str) -> ArticleContent:
"""Course of native PDF and create embeddings"""
# Load PDF
loader = PyPDFLoader(pdf_path)
pages = loader.load()
# Extract textual content
textual content = " ".be a part of(web page.page_content for web page in pages)
# Cut up textual content into chunks
chunks = self.text_splitter.split_text(textual content)
# Create and retailer embeddings in Chroma
self.vectordb = Chroma.from_texts(
texts=chunks,
embedding=self.embeddings,
persist_directory="./knowledge/chroma_db"
)
# Extract title and creator
traces = [line.strip() for line in text.split("n") if line.strip()]
title = traces[0] if traces else "Untitled"
creator = traces[1] if len(traces) > 1 else None
return ArticleContent(
title=title,
content material=textual content,
creator=creator,
url=pdf_path
)
Within the above code, we implement the PDF processing performance of the appliance.
- Load and Extract PDF Textual content: The PyPDFLoader reads the PDF file and extracts the textual content content material from all pages, concatenating it right into a single string.
- Cut up Textual content into Chunks: The textual content is split into smaller chunks utilizing the text_splitter for higher processing and bedding creation.
- Generate Embeddings: Chroma creates vector embeddings from the textual content chunks and shops them in a persistent database listing.
- Extract Title and Creator: The primary non-empty line is used because the title, and the second because the creator.
- Return Article Content material: Assemble an Article Content material object containing the title, full textual content, creator, and file path.
Getting the related Chunk
def get_relevant_chunks(self, question: str, ok: int = 3) -> Checklist[str]:
"""Retrieve related chunks from vector database"""
outcomes = self.vectordb.similarity_search(question, ok=ok)
return [doc.page_content for doc in results]
This code retrieves the highest ok (default 3) most related textual content chunks from the vector database based mostly on similarity to the given question.
Producing Tweet thread from Article
This methodology is an important as a result of right here we are going to use all of the generative AI, embedding, and prompts collectively to generate the Thread from the consumer’s PDF file.
def generate_twitter_thread(self, article: ArticleContent) -> TwitterThread:
"""Generate Twitter thread utilizing Gemini"""
# First, get essentially the most related chunks for various elements
intro_chunks = self.get_relevant_chunks("introduction and details")
technical_chunks = self.get_relevant_chunks("technical particulars and implementation")
conclusion_chunks = self.get_relevant_chunks("conclusion and key takeaways")
thread_prompt = PromptTemplate(
input_variables=["title", "intro", "technical", "conclusion"],
template="""
Write a fascinating Twitter thread (8-10 tweets) summarizing this technical article in an approachable and human-like model.
Title: {title}
Introduction Context:
{intro}
Technical Particulars:
{technical}
Key Takeaways:
{conclusion}
Tips:
1. Begin with a hook that grabs consideration (e.g., a stunning reality, daring assertion, or thought-provoking query).
2. Use a conversational tone and clarify complicated particulars merely, with out jargon.
3. Embrace concise tweets underneath 280 characters, following the 1/n numbering format.
4. Break down the important thing insights logically, and make every tweet construct curiosity for the subsequent one.
5. Embrace related examples, analogies, or comparisons to help understanding.
6. Finish the thread with a powerful conclusion and a name to motion (e.g., "Learn the complete article," "Comply with for extra insights").
7. Make it relatable, academic, and fascinating.
Output format:
- A numbered listing of tweets, with every tweet on a brand new line.
- After the tweets, counsel 3-5 hashtags that summarize the thread, beginning with #.
"""
)
chain = LLMChain(llm=self.llm, immediate=thread_prompt)
consequence = chain.run({
"title": article.title,
"intro": "n".be a part of(intro_chunks),
"technical": "n".be a part of(technical_chunks),
"conclusion": "n".be a part of(conclusion_chunks)
})
# Parse the consequence into tweets and hashtags
traces = consequence.break up("n")
tweets = [line.strip() for line in lines if line.strip() and not line.strip().startswith("#")]
hashtags = [tag.strip() for tag in lines if tag.strip().startswith("#")]
# Guarantee we've not less than one tweet and hashtag
if not tweets:
tweets = ["Thread about " + article.title]
if not hashtags:
hashtags = ["#AI", "#TechNews"]
return TwitterThread(tweets=tweets, hashtags=hashtags)
Let’s perceive what is going on within the above code step-by-step
- Retrieve Related Chunks: The tactic first extracts related chunks of textual content for the introduction, technical particulars, and conclusion utilizing the get_relevant_chunks methodology.
- Put together a Immediate: A PromptTemplate is created with directions to jot down a fascinating Twitter thread summarizing the article, together with particulars on tone, construction, and formatting tips.
- Run the LLM Chain: The LLMChain is used with the LLM fashions to course of the immediate and generate a thread based mostly on the article’s title and extracted chunks.
- Parse Outcomes: The generated output is break up into tweets and hashtags, making certain correct formatting and extracting the required elements.
- Return Twitter Thread: The tactic returns a TwitterThread object containing the formatted tweets and hashtags.
Course of The Article
This methodology processes a PDF file to extract its content material and generates a Twitter thread summarizing it. and final it’s going to return a Twitter Thread.
def process_article(self, pdf_path: str) -> TwitterThread:
"""Fundamental methodology to course of article and generate content material"""
strive:
article = self.process_pdf(pdf_path)
thread = self.generate_twitter_thread(article)
return thread
besides Exception as e:
print(f"Error processing article: {str(e)}")
elevate
Upto right here We carried out all the required code for this undertaking, now there are two methods we will proceed additional.
- Implementing the Fundamental file for testing and
- Implementing Streamlit Utility for the net interface
If you happen to don’t wish to take a look at the appliance in terminal mode then you possibly can skip the Fundamental file implementation and go on to the Streamlit Utility implementation.
Implementing the Fundamental file for testing
Now, we put collectively all of the modules to check the appliance.
import os
from dotenv import load_dotenv
from providers import ContentRepurposer
def fundamental():
# Load surroundings variables
load_dotenv()
google_api_key = os.getenv("GOOGLE_API_KEY")
if not google_api_key:
elevate ValueError("GOOGLE_API_KEY surroundings variable not discovered")
# Initialize repurposer
repurposer = ContentRepurposer()
# Path to your native PDF
# pdf_path = "knowledge/guide_to_jax.pdf"
pdf_path = "knowledge/build_llm_powered_app.pdf"
strive:
thread = repurposer.process_article(pdf_path)
print("Generated Twitter Thread:")
for i, tweet in enumerate(thread.tweets, 1):
print(f"nTweet {i}/{len(thread.tweets)}:")
print(tweet)
print("nSuggested Hashtags:")
print(" ".be a part of(thread.hashtags))
besides Exception as e:
print(f"Didn't course of article: {str(e)}")
if __name__ == "__main__":
fundamental()
Right here, you possibly can see that it merely imports all of the modules, Examine the GOOGLE_API_KEY availability, initiates ContentRepuposer() class, after which within the strive block creates a thread by calling the process_article() methodology from the repurposer object. On the final, some printing strategies for tweets printing on the terminal and the Exception dealing with.
To check the appliance, create a folder named knowledge in your undertaking root and put your downloaded PDF there. To obtain the article from AnalyticsVidya, go to any article click on the obtain button, and obtain it.
Now in your terminal,
python fundamental.py
Instance Weblog 1 Output
Instance Weblog 2 Output
I feel you get the thought of how lovely the appliance is! Let’s make it extra aesthetically sensible.
Implementing the Streamlit APP
Now we are going to do just about the identical as above in a extra UI-centric manner.
Importing Libraries and Env Configuration
import os
import streamlit as st
from dotenv import load_dotenv
from providers import ContentRepurposer
import pyperclip
from pathlib import Path
# Load surroundings variables
load_dotenv()
# Set web page configuration
st.set_page_config(page_title="Content material Repurposer", page_icon="🐦", structure="vast")
Customized CSS
# Customized CSS
st.markdown(
"""
<model>
.tweet-box {
background-color: #181211;
border: 1px stable #e1e8ed;
border-radius: 10px;
padding: 15px;
margin: 10px 0;
}
.copy-button {
background-color: #1DA1F2;
coloration: white;
border: none;
border-radius: 5px;
padding: 5px 10px;
cursor: pointer;
}
.main-header {
coloration: #1DA1F2;
text-align: heart;
}
.hashtag {
coloration: #1DA1F2;
background-color: #E8F5FE;
padding: 5px 10px;
border-radius: 15px;
margin: 5px;
show: inline-block;
}
</model>
""",
unsafe_allow_html=True,
)
Right here, we’ve made some CSS styling for the net pages (tweets, copy buttons, hashtags) is CSS complicated to you? go to W3Schools
Some Essential Capabilities
def create_temp_pdf(uploaded_file):
"""Create a brief PDF file from uploaded content material"""
temp_dir = Path("temp")
temp_dir.mkdir(exist_ok=True)
temp_path = temp_dir / "uploaded_pdf.pdf"
with open(temp_path, "wb") as f:
f.write(uploaded_file.getvalue())
return str(temp_path)
def initialize_session_state():
"""Initialize session state variables"""
if "tweets" not in st.session_state:
st.session_state.tweets = None
if "hashtags" not in st.session_state:
st.session_state.hashtags = None
def copy_text_and_show_success(textual content, success_key):
"""Copy textual content to clipboard and present success message"""
strive:
pyperclip.copy(textual content)
st.success("Copied to clipboard!", icon="✅")
besides Exception as e:
st.error(f"Failed to repeat: {str(e)}")
Right here, the create_temp_pdf() methodology will create a temp listing within the undertaking folder and can put the uploaded PDF there for the whole course of.
initialize_session_state() methodology will verify whether or not the tweets and hashtags are within the Streamlit session or not.
The copy_text_and_show_success() methodology will use the Pyperclip library to repeat the tweets and hashtags to the clipboard and present that the copy was profitable.
Fundamental Operate
def fundamental():
initialize_session_state()
# Header
st.markdown(
"<h1 class="main-header">📄 Content material to Twitter Thread 🐦</h1>",
unsafe_allow_html=True,
)
# Create two columns for structure
col1, col2 = st.columns([1, 1])
with col1:
st.markdown("### Add PDF")
uploaded_file = st.file_uploader("Drop your PDF right here", kind=["pdf"])
if uploaded_file:
st.success("PDF uploaded efficiently!")
if st.button("Generate Twitter Thread", key="generate"):
with st.spinner("Producing Twitter thread..."):
strive:
# Get Google API key
google_api_key = os.getenv("GOOGLE_API_KEY")
if not google_api_key:
st.error(
"Google API key not discovered. Please verify your .env file."
)
return
# Save uploaded file
pdf_path = create_temp_pdf(uploaded_file)
# Course of PDF and generate thread
repurposer = ContentRepurposer()
thread = repurposer.process_article(pdf_path)
# Retailer ends in session state
st.session_state.tweets = thread.tweets
st.session_state.hashtags = thread.hashtags
# Clear up non permanent file
os.take away(pdf_path)
besides Exception as e:
st.error(f"Error producing thread: {str(e)}")
with col2:
if st.session_state.tweets:
st.markdown("### Generated Twitter Thread")
# Copy complete thread part
st.markdown("#### Copy Full Thread")
all_tweets = "nn".be a part of(st.session_state.tweets)
if st.button("📋 Copy Complete Thread"):
copy_text_and_show_success(all_tweets, "thread")
# Show particular person tweets
st.markdown("#### Particular person Tweets")
for i, tweet in enumerate(st.session_state.tweets, 1):
tweet_col1, tweet_col2 = st.columns([4, 1])
with tweet_col1:
st.markdown(
f"""
<div class="tweet-box">
<p>{tweet}</p>
</div>
""",
unsafe_allow_html=True,
)
with tweet_col2:
if st.button("📋", key=f"tweet_{i}"):
copy_text_and_show_success(tweet, f"tweet_{i}")
# Show hashtags
if st.session_state.hashtags:
st.markdown("### Advised Hashtags")
# Show hashtags with copy button
hashtags_text = " ".be a part of(st.session_state.hashtags)
hashtags_col1, hashtags_col2 = st.columns([4, 1])
with hashtags_col1:
hashtags_html = " ".be a part of(
[
f"<span class="hashtag">{hashtag}</span>"
for hashtag in st.session_state.hashtags
]
)
st.markdown(hashtags_html, unsafe_allow_html=True)
with hashtags_col2:
if st.button("📋 Copy Tags"):
copy_text_and_show_success(hashtags_text, "hashtags")
if __name__ == "__main__":
fundamental()
If you happen to learn this code intently, you will notice that Streamlit creates two columns: one for the PDF uploader perform and the opposite for exhibiting the generated tweets.
Within the first column, we’ve achieved just about the identical because the earlier fundamental.py with some additional markdown, including buttons for importing and producing threads utilizing the Streamlit object.
Within the second column, Streamlit iterates the tweet listing or generated thread, places every tweet in a tweet field, and creates a replica button for the person tweet, and within the final, it’s going to present all of the hashtags and their copy buttons.
Now the enjoyable half!!
Open your terminal and kind
streamlit run .app.py
If all the things is finished proper It should begin a Streamlit software in your default browser.
Now, drag and drop your downloaded PDF on the field, it’s going to robotically add the PDF to the system, and click on on the Generate Twitter Thread button to generate tweets.
You may copy full thread or particular person tweet utilizing respective copy buttons.
I hope doing hands-on tasks like these will show you how to study many sensible ideas on Generative AI, Python libraries, and programming. Pleased Coding, Keep wholesome.
All of the code used on this article is right here.
Conclusion
This undertaking demonstrates the facility of mixing fashionable AI applied sciences to automate content material repurposing. By leveraging Gemini-2.0 and ChromaDB, we’ve created a system that not solely saves time but in addition maintains high-quality output. The modular structure guarantee straightforward upkeep and extensibility, whereas the Streamlit interface makes it accessible to non-technical customers.
Key Takeaways
- The undertaking demonstrates profitable integration of cutting-edge AI instruments for practival content material automation.
- The structure’s modularity permits for straightforward upkeep and future enhancements, making it a sustainable answer for content material repurposing.
- The Streamlit interface makes the instrument accessible to content material creators with out technical experience, bridging the hole between complicated AI know-how and sensible utilization.
- The implementation can deal with numerous content material sorts and volumes, making it appropriate for each particular person content material creators and enormous organizations.
Continuously Requested Questions
A. The system makes use of RecursiveCharacterTextSplitter to interrupt down lengthy articles into manageable chunks, that are then embedded and saved in ChromaDB. When producing threads, it retrieves essentially the most related chunk utilizing similarity search.
A. We used a temperature of 0.7, which supplied a superb stability between creativity and coherence. You may regulate this setting based mostly on particular wants, with larger values (>0.7) producing extra inventive output and decrease values (<0.7) producing extra centered content material.
A. The immediate template explicitly specifies the 280-character restrict, and the LLM is skilled to respect this constraint. You may add extra validation to make sure compliance programmatically.
The media proven on this article isn’t owned by Analytics Vidhya and is used on the Creator’s discretion.