From Code to Paper: Utilizing GPT Fashions and Python to Generate Scientific LaTeX Paperwork | by Peder Ward | Dec, 2024

Automating scientific code documentation: a GPT-powered POC for streamlined workflows.

Illustration image. Generated by ChatGPT.

Engaged on scientific papers typically includes translating algorithms into scientific formulation, sometimes formatted in LaTeX. This course of could be tedious and time-consuming, particularly in giant tasks, because it requires fixed back-and-forth between the code repository and the LaTeX doc.

Whereas engaged on a big repository of algorithms, I started exploring methods to streamline this workflow. My motivation arose from the inefficiency of manually changing complicated algorithms into LaTeX-compatible formulation. A specific problem was making certain consistency throughout a number of paperwork, particularly in tasks the place formulation required frequent updates. This led me to discover how automation might streamline repetitive duties whereas enhancing accuracy.

For the rest of this doc, I’ll use each the time period “algorithm” and “scientific code.” All pictures on this article, apart from the quilt picture, have been created by the creator.

My purpose was to transition from scientific code to a complete doc that introduces the aim of the code, defines variables, presents scientific formulation, features a generated instance plot, and demonstrates the calculations for a particular instance. The doc would comply with a predefined framework, combining static and dynamic parts to make sure each consistency and flexibility.

The framework I designed included the next construction:

  1. Entrance Web page
    A visually interesting cowl with key particulars such because the title and creator.
  2. Desk of Contents
    Robotically generated to offer an outline of the doc’s content material.
  3. Transient Description of the Doc
    An introduction outlining the aim and scope of the doc.
  4. Algorithms
    A bit devoted to documenting every algorithm intimately. For every algorithm, the next subsections could be included:
    Introduction: A quick overview of the algorithm’s function and context.
    Variables: A transparent definition of all variables used within the algorithm.
    Formulation: A presentation of the important thing formulation derived from the algorithm.
    Instance: A labored instance as an example the algorithm’s software, full with a generated plot.
    Code: The corresponding code snippet to assist reproducibility.

This construction was designed to dynamically adapt based mostly on the variety of algorithms being documented, making certain a constant {and professional} presentation whatever the doc’s measurement or complexity.

To realize this purpose, a well-organized repository was important for enabling a scalable and environment friendly answer. The algorithm calculations have been grouped right into a devoted folder, with recordsdata named utilizing a constant snake_case conference that matched the algorithm names.

To make sure readability and assist reuse, preliminary values for examples and the generated plots have been saved in separate folders. These folders adopted the identical naming conference because the algorithms however with distinct suffixes to distinguish their function. This construction ensured that every one parts have been simple to seek out and according to the general framework of the undertaking.

On the core of this undertaking is the usage of GPT fashions to automate the conversion of algorithms into LaTeX. GPT’s energy lies in its means to interpret the construction of generic, variable-rich code and remodel it into human-readable explanations and exactly formatted scientific formulation. This automation considerably reduces the guide effort required, making certain each accuracy and consistency throughout paperwork.

For this undertaking, I’ll leverage OpenAI’s ChatGPT-4o mannequin, famend for its superior means to understand and generate structured content material. To work together with OpenAI’s API, you could have an OPENAI_KEY set in your atmosphere. Under is an easy Python perform I exploit to fetch responses from the GPT mannequin:

import os
from openai import OpenAI
from dotenv import load_dotenv

def ask_chat_gpt(immediate):
load_dotenv()
api_key = os.getenv("OPENAI_KEY") or exit("API key lacking")
shopper = OpenAI(api_key=api_key)
response = shopper.chat.completions.create(
mannequin="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.selections[0].message.content material

Overview of What the Code Does
This code automates the era of structured LaTeX documentation for Python algorithms, full with examples, plots, and Python code listings. Right here’s an outline:

Flowchart of how the LaTeX Doc is dynamically constructed. Made by creator.

Immediate Creation for GPT

This part describes customized features designed to generate detailed prompts for GPT, enabling the automated creation of LaTeX documentation:

  • make_algo_doc_gpt_prompt: A perform that creates prompts instructing GPT to generate LaTeX sections, together with introductions, variable descriptions, formulation, and instance subsections.
  • make_algo_example_gpt_prompt: A perform that generates prompts for creating LaTeX instance sections, incorporating plots and instance calculations.

Doc Era

These features are liable for processing the GPT-generated content material and saving it as LaTeX recordsdata:

  • make_algo_doc: A perform that makes use of GPT outputs to generate LaTeX documentation for every algorithm and saves it as a .tex file.
  • make_algo_example: A perform that creates .tex recordsdata for instance sections, together with plots and instance calculations.

LaTeX Meeting

  • Makes use of the pylatex library to programmatically create a full LaTeX doc.
  • Provides a title web page, metadata, and a desk of contents.
  • Contains an introduction part with an outline of the algorithms and their function.
  • Creates a chapter for every algorithm with sections from make_algo_doc and make_algo_example, instance plots, and Python code listings.
# Create and construction the LaTeX doc programmatically
doc = Doc(documentclass="report")

# Embrace preamble and metadata
doc.preamble.append(NoEscape(r'enter{algo_docs/init.tex}')) # Customized preamble
doc.append(NoEscape(r'enter{algo_docs/title_page.tex}')) # Title web page
doc.append(NoEscape(r'tableofcontents')) # Desk of contents

# Add Introduction Chapter
with doc.create(Chapter('Introduction')):
doc.append(
'This doc offers an outline of assorted algorithms, exploring their design, evaluation, and software in computational problem-solving. '
'The intention is to facilitate understanding of their mechanisms and significance throughout totally different domains.'
)

# Add Algorithms Chapter
with doc.create(Chapter('Algorithms')):
doc.append(
'This chapter presents detailed analyses of assorted algorithms, highlighting their theoretical foundations, use circumstances, and sensible insights. '
'Every algorithm is accompanied by examples and visualizations as an example its performance and potential limitations.'
)

# Course of every Python file within the 'python_code' listing
python_code_dir = "python_code/"
output_folder = "algo_docs/"
plot_folder = "plots/"

for filename in os.listdir(python_code_dir):
if filename.endswith(".py"): # Course of solely Python recordsdata
algorithm_name = filename.exchange(".py", "")
formatted_name = algorithm_name.exchange("_", " ").title()

# Outline paths for documentation recordsdata and plots
document_path = os.path.be part of(output_folder, f"{algorithm_name}_doc.tex")
example_path = os.path.be part of(output_folder, f"{algorithm_name}_example.tex")
plot_path = os.path.be part of(plot_folder, f"{algorithm_name}_plot.png")
python_code_path = os.path.be part of(python_code_dir, filename)

print(f"Processing: {filename}")

# Begin a brand new web page for every algorithm
doc.append(NoEscape(r'newpage'))

# Generate documentation and instance recordsdata with GPT
make_algo_doc(algorithm_name)
make_algo_example(algorithm_name)

# Insert generated LaTeX sections
doc.append(NoEscape(rf'enter{{{document_path}}}'))
doc.append(NoEscape(rf'enter{{{example_path}}}'))

# Insert plot immediately after instance subsection
if os.path.exists(plot_path):
with doc.create(Determine(place='H')) as determine:
determine.add_image(plot_path, width=NoEscape(r'textwidth'))
determine.add_caption(f'Instance plot for {formatted_name}.')

# Add a subsection for the Python code itemizing
with doc.create(Subsection('Code Itemizing')):
doc.append(NoEscape(rf'lstinputlisting[language=Python]{{{python_code_path}}}'))

# Add a web page break for readability
doc.append(NoEscape(r'clearpage'))

# Generate the LaTeX file
tex_file = "programmatic_report"
doc.generate_tex(tex_file)

# Compile the LaTeX file to a PDF
subprocess.run(["pdflatex", f"{tex_file}.tex"])

PDF Compilation

  • The assembled doc is saved and compiled into a cultured PDF utilizing pdflatex.
Easy entrance web page.
Desk of Contents.

One of the difficult elements of this undertaking was designing and refining the prompts used to work together with GPT. The success of the complete course of relied on the standard of the GPT-generated output, making the creation of efficient prompts a vital activity that required intensive time and experimentation.

The prompts wanted to strike a fragile steadiness:

  • Readability: Exactly guiding GPT to supply structured LaTeX content material, together with sections, subsections, and mathematical equations, whereas leaving no ambiguity in regards to the desired format.
  • Adaptability: Guaranteeing the prompts might deal with all kinds of algorithms, starting from easy calculations to complicated implementations.
  • Consistency: Reaching dependable, well-formatted, and correct output, even for edge circumstances or unconventional code buildings.

To handle these challenges, I carried out dynamic prompting. This method concerned programmatically producing prompts tailor-made to the contents of every file. By offering GPT with related context and particular directions, dynamic prompting ensured the output was each correct and contextually acceptable for the given algorithm.

By way of quite a few iterations, the prompts advanced to turn out to be exact and versatile, forming the muse of the automation course of.
Instance of a immediate for producing LaTeX code from a algorithm:

Generate LaTeX code from the offered Python code. Observe these pointers:

1. **Doc Construction**:
- Begin with `part{}` for the algorithm title.
- Add a `subsection{Introduction}` for a short overview of the algorithm.
- Embrace a `subsection{Variables}` part that lists all variables with descriptions, utilizing subscript notation (e.g., `v_{textual content{earth}}`).
- Add a `subsection{Formulation}` part presenting the code's logic as LaTeX formulation. Use subscripted symbols for variable names as an alternative of copying Python variable names immediately.

2. **Formatting Guidelines**:
- Make sure that the output contains **solely** the LaTeX content material, with out `documentclass`, `usepackage`, `start{doc}`, `finish{doc}`, or any unrelated textual content.
- Do **not** embody the triple backticks (e.g., ```latex or ```).
- Correctly shut all LaTeX environments (e.g., `start{align*}...finish{align*}`).
- Guarantee all brackets, parentheses, and braces are matched accurately.
- Keep constant subscript notation for all variables.

3. **Vital Notes**:
- **Don't** embody any textual content or explanations outdoors the LaTeX code.
- Solely the related LaTeX content material for the `part`, `subsection`, `start{align*}`, and `finish{align*}` components needs to be generated.
- Guarantee no further or unrelated LaTeX sections are added.

The next demonstrates how the Hohmann Switch Orbit Calculation algorithm is documented utilizing GPT-generated LaTeX code. This algorithm calculates the speed adjustments (delta-v) required to switch a spacecraft from Earth’s orbit to Mars’s orbit. Under is the Python implementation of the algorithm:

def calculate_hohmann_transfer(earth_orbit_radius, mars_orbit_radius):
# Gravitational fixed for the Solar
mu_sun = 1.32712440018e20

# Orbital velocities of Earth and Mars
v_earth = np.sqrt(mu_sun / earth_orbit_radius)
v_mars = np.sqrt(mu_sun / mars_orbit_radius)

# Semi-major axis of the switch orbit
transfer_orbit_semi_major_axis = (earth_orbit_radius + mars_orbit_radius) / 2

# Switch orbit velocities at Earth and Mars
v_transfer_at_earth = np.sqrt(2 * mu_sun / earth_orbit_radius - mu_sun / transfer_orbit_semi_major_axis)
v_transfer_at_mars = np.sqrt(2 * mu_sun / mars_orbit_radius - mu_sun / transfer_orbit_semi_major_axis)

# Delta-v at Earth and Mars
delta_v_earth = v_transfer_at_earth - v_earth
delta_v_mars = v_mars - v_transfer_at_mars

# Complete delta-v for the switch
total_delta_v = abs(delta_v_earth) + abs(delta_v_mars)

return delta_v_earth, delta_v_mars, total_delta_v

Utilizing the GPT immediate with this code, I generated LaTeX subsections for the documentation. Under are the parts created:

Introduction to the Algorithm
GPT generated a LaTeX rationalization of the algorithm’s function, detailing the way it calculates velocity adjustments for an environment friendly interplanetary switch.

Introduction to the algorithm. LaTeX code generated by GPT mannequin.

Variable Definitions
GPT offered a transparent rationalization of all variables used within the algorithm.

Variable definitions for the algorithm. LaTeX code generated by GPT mannequin.

Formulation
The important thing formulation used within the algorithm have been formatted into LaTeX by GPT.

Formulation used within the algorithm. LaTeX code generated by GPT mannequin.

Instance Part
Utilizing instance values, GPT generated LaTeX code for a labored instance.

Snip of the Instance utilizing instance values and the algorithm as enter. LaTeX code generated by GPT mannequin.

Plot Era
A plot of the switch orbit was generated utilizing the instance values and included within the LaTeX doc.

Plot generated by the code and instance values. Inserted within the LaTeX dokument.

Code Itemizing
The algorithm’s supply code was appended to the doc for completeness on the finish.

Code listings on the finish of the chapter (partial view).

Preliminary experiments with this method have been promising. Utilizing Python and GPT-4, I efficiently automated the conversion of a number of algorithms into LaTeX paperwork. The outcomes of this proof of idea (POC) could be explored in my GitHub repository, the place all elements of the undertaking can be found for evaluation.

The repository contains the entire Python codebase, showcasing the customized features used to generate LaTeX documentation and create GPT prompts. It additionally accommodates the detailed prompts themselves, illustrating how the system guides GPT in producing structured and correct LaTeX content material. Moreover, the repository options the ultimate outputs, together with each the LaTeX supply recordsdata and the compiled PDF paperwork.

Whereas the preliminary outcomes have been promising, the method has not been with out its challenges and invaluable insights alongside the best way::

  • Formatting Challenges: Sometimes, GPT would produce incorrect LaTeX formatting, resulting in errors in the course of the PDF conversion course of. Though this concern was uncommon, I experimented with an answer: resubmitting the LaTeX code to GPT and asking it to repair the formatting. Whereas this method was constantly profitable, it was not carried out as a part of the workflow.
  • Code Feedback: Including clear feedback inside the code helped GPT perceive the context higher and generate extra correct LaTeX outputs.
  • Inconsistent Outcomes: GPT sometimes produced various outputs for a similar code and immediate, emphasizing its inherent variability and the significance of cautious testing.
  • Crafting Efficient Prompts: Writing efficient prompts was difficult. Overloading the immediate with an excessive amount of element, like examples, typically precipitated GPT to overlook smaller parts comparable to formatting or construction. I found that breaking down directions step-by-step and utilizing very small, centered examples helped GPT carry out higher. Maintaining prompts concise and structured with bullet factors ensured that every key instruction was clearly understood and executed.
  • Area-Particular Terminology: Effective-tuning GPT for specialised phrases is an space requiring additional enchancment to boost accuracy.
  • Variable Definitions: Maintaining LaTeX variable definations in algorithm and examples constant was difficult. Including GPT-generated variable definitions to later prompts helped keep uniformity.

Regardless of its imperfections, the workflow has drastically decreased the time spent on documentation by automating a lot of the method. Whereas minor evaluations and changes are nonetheless wanted, they symbolize solely a fraction of the hassle beforehand required. This proof of idea demonstrates the potential to generate polished paperwork with out writing LaTeX manually, although additional refinement is required to boost consistency, scalability, and flexibility. The outcomes to date spotlight the numerous promise of this method.

  • Develop Validation Mechanisms
    Implement cross-referencing of generated formulation in opposition to identified requirements or benchmarks to make sure accuracy and consistency.
  • Broaden Use Instances
    Take a look at the workflow on bigger, extra various datasets to enhance scalability and flexibility for numerous scientific domains.
  • Improve Visible Documentation
    Incorporate further visible parts, comparable to flowcharts, through the use of GPT to generate XML paperwork or related codecs.
  • Generate Plots and Examples with GPT
    Prolong GPT’s performance to create instance plots immediately, decreasing the reliance on exterior plotting instruments.
  • Experiment with Totally different GPT Fashions
    To date, I’ve primarily used ChatGPT-4 on account of its accessibility, however additional analysis is required to determine the optimum mannequin for this activity. Exploring fashions tailor-made for technical content material or incorporating a Retrieval-Augmented Era (RAG) method with a database of various scientific papers might enhance accuracy and relevance.
  • Transition from Proof of Idea (POC) to Minimal Viable Product (MVP)
    Evolve the undertaking from a proof of idea to a minimal viable product by including sturdy error dealing with, scalability options, and user-focused refinements.

This undertaking has confirmed the potential of GPT fashions to automate the creation of structured LaTeX documentation, considerably decreasing the guide effort concerned. It efficiently generated professional-quality outputs, together with formulation, plots, and structured examples. Nevertheless, challenges comparable to inconsistent outcomes, formatting points, and variability in GPT’s output highlighted the necessity for refinement. Methods like dynamic prompting, higher code commenting, and iterative validation have helped tackle these points, however some guide oversight stays needed.

Regardless of these challenges, the workflow has proven clear advantages, streamlining the documentation course of and saving appreciable time. Whereas the answer shouldn’t be but good, it represents a major step towards automating complicated documentation duties, paving the best way for future enhancements in accuracy.

LinkedIn Profile — Peder Ward

GitHub — Peder Ward