Gitours – LLM-Powered GitHub CodeTour Generator

Context

Gitours is an end-to-end pipeline that clones a GitHub repository, performs static analysis to build a relationship map of variables, functions, and classes, then prompts OpenAI's API to produce a VS Code CodeTour JSON file. Users drag the generated .tour into their local clone and immediately step through a guided walkthrough of important files and design decisions. Built as a Georgia Tech CS4675/6675 course project to make onboarding into small and medium open-source projects less painful.

What I built

1Static itemizer builds cross-file symbol maps—tracks which files define functions/classes and where they're imported or called, giving LLM architectural context beyond per-file summaries.
2Prompt engineering abstraction in clone_summary.py isolates LLM interaction, making it easy to iterate on prompt templates without touching analysis or output layers.
3Dual interface (CLI + web UI) supports both power users and non-technical stakeholders who want to explore repos visually.
4Designed for safe execution: gitRepo class handles temp directory cleanup even if errors occur mid-generation.
5Frontend evolution from v0.1 (skeleton with example JSON) to v0.2 (live backend integration) shows incremental development methodology.
6Prompting experiments documented in PROMPTING_SUMMARY.txt provide reproducibility and template iteration history.

Results

→Generates structured .tour files in ~3 minutes for small-to-medium repos, eliminating manual onboarding doc creation
→Static itemization engine tracks where symbols are defined/used, enabling LLM to follow real execution flows instead of alphabetical file lists
→Dual interface: CLI (main.py) for quick generation and React + Flask web UI for non-technical users

Problem

Fast-moving open-source projects lack up-to-date onboarding documentation. New contributors waste hours reverse-engineering architecture from file explorers and README files. Mentors can't scale manual walkthroughs across dozens of repos. Existing solutions either require hand-written tours or produce generic file summaries without execution-flow context.

Approach

Built a modular Python pipeline with five core components: (1) repo_data.py—gitRepo class clones target repo into temp directory and handles cleanup; (2) itemizer.py—traverses codebase to map relationships between variables, method calls, definitions, and usage sites, producing a structured reference map; (3) clone_summary.py—centralizes prompt design and OpenAI API calls, packaging repo map into prompts that ask LLM to describe architecture and suggest tour step sequences; (4) codetours.py—converts LLM responses into valid CodeTour .tour JSON with file paths and line numbers; (5) helpers.py—URL validation and cleaning utilities. Frontend is single-page React app (src/app/page.js) where users paste repo URL and download generated tour; Flask backend (backend.py) exposes pipeline as HTTP API. Requires OPENAI_API_KEY in .env; optimized for small-to-medium repos (generation takes ~3 minutes depending on size).

What I learned

Successfully generates working CodeTour files for repositories across multiple languages. The itemization step's relationship tracking enables the LLM to produce tours that follow logical code paths (e.g., entry point → core classes → helper utilities) rather than arbitrary orderings. Web UI (v0.2) now calls live Flask backend and returns real .tour data instead of placeholder JSON. CLI mode (python main.py) outputs temp_output_codetour.tour for immediate use. Project demonstrates feasibility of LLM-assisted developer tooling when paired with robust static analysis.

Links