Skip to content

Commit 6eb0e7b

Browse files
feat: implement S3 integration for storing and retrieving digest files
- Add utility functions for S3 configuration, URL generation, and file uploads. - Enhance ingestion flow to optionally upload digests to S3 if enabled. - Modify API endpoints to redirect downloads to S3 if files are stored there. - Extend `IngestResponse` schema to include S3 URL when applicable. - Introduce `get_current_commit_hash` utility to retrieve commit SHA in ingestion. - add Docker Compose configuration for dev/prod environments with documented usage details - integrate MinIO S3-compatible storage for local development, including bucket auto-setup and app credentials - add S3 storage toggle, test service in Docker Compose, and boto3 dependency - enforce UUID type for ingest_id, resolve comments - Implement `JSONFormatter` and methods for structured logging. - Integrate logging into S3 client creation, uploads, and URL lookups. - Enhance logging with extra fields for better traceability. - add optional S3 directory prefix support - remove unused test service from Docker Compose configuration - improve `get_s3_config` to handle optional environment variables more robustly - add centralized JSON logging and integrate into S3 utilities Co-authored-by: Filip Christiansen <22807962+filipchristiansen@users.noreply.github.com>
1 parent 998cea1 commit 6eb0e7b

File tree

17 files changed

+795
-30
lines changed

17 files changed

+795
-30
lines changed

.docker/minio/setup.sh

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
#!/bin/sh
2+
3+
# Simple script to set up MinIO bucket and user
4+
# Based on example from MinIO issues
5+
6+
# Format bucket name to ensure compatibility
7+
BUCKET_NAME=$(echo "${S3_BUCKET_NAME}" | tr '[:upper:]' '[:lower:]' | tr '_' '-')
8+
9+
# Configure MinIO client
10+
mc alias set myminio http://minio:9000 ${MINIO_ROOT_USER} ${MINIO_ROOT_PASSWORD}
11+
12+
# Remove bucket if it exists (for clean setup)
13+
mc rm -r --force myminio/${BUCKET_NAME} || true
14+
15+
# Create bucket
16+
mc mb myminio/${BUCKET_NAME}
17+
18+
# Set bucket policy to allow downloads
19+
mc anonymous set download myminio/${BUCKET_NAME}
20+
21+
# Create user with access and secret keys
22+
mc admin user add myminio ${S3_ACCESS_KEY} ${S3_SECRET_KEY} || echo "User already exists"
23+
24+
# Create policy for the bucket
25+
echo '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":["s3:*"],"Resource":["arn:aws:s3:::'${BUCKET_NAME}'/*","arn:aws:s3:::'${BUCKET_NAME}'"]}]}' > /tmp/policy.json
26+
27+
# Apply policy
28+
mc admin policy create myminio gitingest-policy /tmp/policy.json || echo "Policy already exists"
29+
mc admin policy attach myminio gitingest-policy --user ${S3_ACCESS_KEY}
30+
31+
echo "MinIO setup completed successfully"
32+
echo "Bucket: ${BUCKET_NAME}"
33+
echo "Access via console: http://localhost:9001"

.env.example

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,26 @@ GITINGEST_SENTRY_PROFILE_LIFECYCLE=trace
3333
GITINGEST_SENTRY_SEND_DEFAULT_PII=true
3434
# Environment name for Sentry (default: "")
3535
GITINGEST_SENTRY_ENVIRONMENT=development
36+
37+
# MinIO Configuration (for development)
38+
# Root user credentials for MinIO admin access
39+
MINIO_ROOT_USER=minioadmin
40+
MINIO_ROOT_PASSWORD=minioadmin
41+
42+
# S3 Configuration (for application)
43+
# Set to "true" to enable S3 storage for digests
44+
# S3_ENABLED=true
45+
# Endpoint URL for the S3 service (MinIO in development)
46+
S3_ENDPOINT=http://minio:9000
47+
# Access key for the S3 bucket (created automatically in development)
48+
S3_ACCESS_KEY=gitingest
49+
# Secret key for the S3 bucket (created automatically in development)
50+
S3_SECRET_KEY=gitingest123
51+
# Name of the S3 bucket (created automatically in development)
52+
S3_BUCKET_NAME=gitingest-bucket
53+
# Region for the S3 bucket (default for MinIO)
54+
S3_REGION=us-east-1
55+
# Public URL/CDN for accessing S3 resources
56+
S3_ALIAS_HOST=127.0.0.1:9000/gitingest-bucket
57+
# Optional prefix for S3 file paths (if set, prefixes all S3 paths with this value)
58+
# S3_DIRECTORY_PREFIX=my-prefix

.github/workflows/codeql.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ jobs:
3535
strategy:
3636
fail-fast: false
3737
matrix:
38-
language: ["javascript", "python"]
38+
language: ["javascript", "python", "actions", "javascript-typescript"]
3939
# CodeQL supports [ $supported-codeql-languages ]
4040
# Learn more about CodeQL language support at https://aka.ms/codeql-docs/language-support
4141

.pre-commit-config.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@ repos:
113113
files: ^src/
114114
additional_dependencies:
115115
[
116+
boto3>=1.28.0,
116117
click>=8.0.0,
117118
'fastapi[standard]>=0.109.1',
118119
httpx,
@@ -138,6 +139,7 @@ repos:
138139
- --rcfile=tests/.pylintrc
139140
additional_dependencies:
140141
[
142+
boto3>=1.28.0,
141143
click>=8.0.0,
142144
'fastapi[standard]>=0.109.1',
143145
httpx,

README.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -204,6 +204,8 @@ This is because Jupyter notebooks are asynchronous by default.
204204

205205
## 🐳 Self-host
206206

207+
### Using Docker
208+
207209
1. Build the image:
208210

209211
``` bash
@@ -239,6 +241,89 @@ The application can be configured using the following environment variables:
239241
- **GITINGEST_SENTRY_PROFILE_SESSION_SAMPLE_RATE**: Sampling rate for profile sessions (default: "1.0", range: 0.0-1.0)
240242
- **GITINGEST_SENTRY_PROFILE_LIFECYCLE**: Profile lifecycle mode (default: "trace")
241243
- **GITINGEST_SENTRY_SEND_DEFAULT_PII**: Send default personally identifiable information (default: "true")
244+
- **S3_ALIAS_HOST**: Public URL/CDN for accessing S3 resources (default: "127.0.0.1:9000/gitingest-bucket")
245+
- **S3_DIRECTORY_PREFIX**: Optional prefix for S3 file paths (if set, prefixes all S3 paths with this value)
246+
247+
### Using Docker Compose
248+
249+
The project includes a `compose.yml` file that allows you to easily run the application in both development and production environments.
250+
251+
#### Compose File Structure
252+
253+
The `compose.yml` file uses YAML anchoring with `&app-base` and `<<: *app-base` to define common configuration that is shared between services:
254+
255+
```yaml
256+
# Common base configuration for all services
257+
x-app-base: &app-base
258+
build:
259+
context: .
260+
dockerfile: Dockerfile
261+
ports:
262+
- "${APP_WEB_BIND:-8000}:8000" # Main application port
263+
- "${GITINGEST_METRICS_HOST:-127.0.0.1}:${GITINGEST_METRICS_PORT:-9090}:9090" # Metrics port
264+
# ... other common configurations
265+
```
266+
267+
#### Services
268+
269+
The file defines three services:
270+
271+
1. **app**: Production service configuration
272+
- Uses the `prod` profile
273+
- Sets the Sentry environment to "production"
274+
- Configured for stable operation with `restart: unless-stopped`
275+
276+
2. **app-dev**: Development service configuration
277+
- Uses the `dev` profile
278+
- Enables debug mode
279+
- Mounts the source code for live development
280+
- Uses hot reloading for faster development
281+
282+
3. **minio**: S3-compatible object storage for development
283+
- Uses the `dev` profile (only available in development mode)
284+
- Provides S3-compatible storage for local development
285+
- Accessible via:
286+
- API: Port 9000 ([localhost:9000](http://localhost:9000))
287+
- Web Console: Port 9001 ([localhost:9001](http://localhost:9001))
288+
- Default admin credentials:
289+
- Username: `minioadmin`
290+
- Password: `minioadmin`
291+
- Configurable via environment variables:
292+
- `MINIO_ROOT_USER`: Custom admin username (default: minioadmin)
293+
- `MINIO_ROOT_PASSWORD`: Custom admin password (default: minioadmin)
294+
- Includes persistent storage via Docker volume
295+
- Auto-creates a bucket and application-specific credentials:
296+
- Bucket name: `gitingest-bucket` (configurable via `S3_BUCKET_NAME`)
297+
- Access key: `gitingest` (configurable via `S3_ACCESS_KEY`)
298+
- Secret key: `gitingest123` (configurable via `S3_SECRET_KEY`)
299+
- These credentials are automatically passed to the app-dev service via environment variables:
300+
- `S3_ENDPOINT`: URL of the MinIO server
301+
- `S3_ACCESS_KEY`: Access key for the S3 bucket
302+
- `S3_SECRET_KEY`: Secret key for the S3 bucket
303+
- `S3_BUCKET_NAME`: Name of the S3 bucket
304+
- `S3_REGION`: Region for the S3 bucket (default: us-east-1)
305+
- `S3_ALIAS_HOST`: Public URL/CDN for accessing S3 resources (default: "127.0.0.1:9000/gitingest-bucket")
306+
307+
#### Usage Examples
308+
309+
To run the application in development mode:
310+
311+
```bash
312+
docker compose --profile dev up
313+
```
314+
315+
To run the application in production mode:
316+
317+
```bash
318+
docker compose --profile prod up -d
319+
```
320+
321+
To build and run the application:
322+
323+
```bash
324+
docker compose --profile prod build
325+
docker compose --profile prod up -d
326+
```
242327

243328
## 🤝 Contributing
244329

compose.yml

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Common base configuration for all services
2+
x-app-base: &app-base
3+
ports:
4+
- "${APP_WEB_BIND:-8000}:8000" # Main application port
5+
- "${GITINGEST_METRICS_HOST:-127.0.0.1}:${GITINGEST_METRICS_PORT:-9090}:9090" # Metrics port
6+
environment:
7+
# Python Configuration
8+
- PYTHONUNBUFFERED=1
9+
- PYTHONDONTWRITEBYTECODE=1
10+
# Host Configuration
11+
- ALLOWED_HOSTS=${ALLOWED_HOSTS:-gitingest.com,*.gitingest.com,localhost,127.0.0.1}
12+
# Metrics Configuration
13+
- GITINGEST_METRICS_ENABLED=${GITINGEST_METRICS_ENABLED:-true}
14+
- GITINGEST_METRICS_HOST=${GITINGEST_METRICS_HOST:-127.0.0.1}
15+
- GITINGEST_METRICS_PORT=${GITINGEST_METRICS_PORT:-9090}
16+
# Sentry Configuration
17+
- GITINGEST_SENTRY_ENABLED=${GITINGEST_SENTRY_ENABLED:-false}
18+
- GITINGEST_SENTRY_DSN=${GITINGEST_SENTRY_DSN:-}
19+
- GITINGEST_SENTRY_TRACES_SAMPLE_RATE=${GITINGEST_SENTRY_TRACES_SAMPLE_RATE:-1.0}
20+
- GITINGEST_SENTRY_PROFILE_SESSION_SAMPLE_RATE=${GITINGEST_SENTRY_PROFILE_SESSION_SAMPLE_RATE:-1.0}
21+
- GITINGEST_SENTRY_PROFILE_LIFECYCLE=${GITINGEST_SENTRY_PROFILE_LIFECYCLE:-trace}
22+
- GITINGEST_SENTRY_SEND_DEFAULT_PII=${GITINGEST_SENTRY_SEND_DEFAULT_PII:-true}
23+
user: "1000:1000"
24+
command: ["python", "-m", "uvicorn", "server.main:app", "--host", "0.0.0.0", "--port", "8000"]
25+
26+
services:
27+
# Production service configuration
28+
app:
29+
<<: *app-base
30+
image: ghcr.io/coderamp-labs/gitingest:latest
31+
profiles:
32+
- prod
33+
environment:
34+
- GITINGEST_SENTRY_ENVIRONMENT=${GITINGEST_SENTRY_ENVIRONMENT:-production}
35+
restart: unless-stopped
36+
37+
# Development service configuration
38+
app-dev:
39+
<<: *app-base
40+
build:
41+
context: .
42+
dockerfile: Dockerfile
43+
profiles:
44+
- dev
45+
environment:
46+
- DEBUG=true
47+
- GITINGEST_SENTRY_ENVIRONMENT=${GITINGEST_SENTRY_ENVIRONMENT:-development}
48+
# S3 Configuration
49+
- S3_ENABLED=true
50+
- S3_ENDPOINT=http://minio:9000
51+
- S3_ACCESS_KEY=${S3_ACCESS_KEY:-gitingest}
52+
- S3_SECRET_KEY=${S3_SECRET_KEY:-gitingest123}
53+
# Use lowercase bucket name to ensure compatibility with MinIO
54+
- S3_BUCKET_NAME=${S3_BUCKET_NAME:-gitingest-bucket}
55+
- S3_REGION=${S3_REGION:-us-east-1}
56+
# Public URL for S3 resources
57+
- S3_ALIAS_HOST=${S3_ALIAS_HOST:-http://127.0.0.1:9000/${S3_BUCKET_NAME:-gitingest-bucket}}
58+
volumes:
59+
# Mount source code for live development
60+
- ./src:/app:ro
61+
# Use --reload flag for hot reloading during development
62+
command: ["python", "-m", "uvicorn", "server.main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]
63+
depends_on:
64+
minio-setup:
65+
condition: service_completed_successfully
66+
67+
# MinIO S3-compatible object storage for development
68+
minio:
69+
image: minio/minio:latest
70+
profiles:
71+
- dev
72+
ports:
73+
- "9000:9000" # API port
74+
- "9001:9001" # Console port
75+
environment:
76+
- MINIO_ROOT_USER=${MINIO_ROOT_USER:-minioadmin}
77+
- MINIO_ROOT_PASSWORD=${MINIO_ROOT_PASSWORD:-minioadmin}
78+
volumes:
79+
- minio-data:/data
80+
command: server /data --console-address ":9001"
81+
restart: unless-stopped
82+
healthcheck:
83+
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
84+
interval: 30s
85+
timeout: 30s
86+
start_period: 30s
87+
start_interval: 1s
88+
89+
# MinIO setup service to create bucket and user
90+
minio-setup:
91+
image: minio/mc
92+
profiles:
93+
- dev
94+
depends_on:
95+
minio:
96+
condition: service_healthy
97+
environment:
98+
- MINIO_ROOT_USER=${MINIO_ROOT_USER:-minioadmin}
99+
- MINIO_ROOT_PASSWORD=${MINIO_ROOT_PASSWORD:-minioadmin}
100+
- S3_ACCESS_KEY=${S3_ACCESS_KEY:-gitingest}
101+
- S3_SECRET_KEY=${S3_SECRET_KEY:-gitingest123}
102+
- S3_BUCKET_NAME=${S3_BUCKET_NAME:-gitingest-bucket}
103+
volumes:
104+
- ./.docker/minio/setup.sh:/setup.sh:ro
105+
entrypoint: sh
106+
command: -c /setup.sh
107+
108+
volumes:
109+
minio-data:
110+
driver: local

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@ description="CLI tool to analyze and create text dumps of codebases for LLMs"
55
readme = {file = "README.md", content-type = "text/markdown" }
66
requires-python = ">= 3.8"
77
dependencies = [
8-
"click>=8.0.0",
98
"httpx",
109
"pathspec>=0.12.1",
1110
"pydantic",
@@ -44,6 +43,7 @@ dev = [
4443
]
4544

4645
server = [
46+
"boto3>=1.28.0", # AWS SDK for S3 support
4747
"fastapi[standard]>=0.109.1", # Minimum safe release (https://osv.dev/vulnerability/PYSEC-2024-38)
4848
"prometheus-client",
4949
"sentry-sdk[fastapi]",

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
boto3>=1.28.0 # AWS SDK for S3 support
12
click>=8.0.0
23
fastapi[standard]>=0.109.1 # Vulnerable to https://osv.dev/vulnerability/PYSEC-2024-38
34
httpx

src/gitingest/query_parser.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -44,9 +44,9 @@ async def parse_remote_repo(source: str, token: str | None = None) -> IngestionQ
4444
host = parsed_url.netloc
4545
user, repo = _get_user_and_repo_from_path(parsed_url.path)
4646

47-
_id = str(uuid.uuid4())
47+
_id = uuid.uuid4()
4848
slug = f"{user}-{repo}"
49-
local_path = TMP_BASE_PATH / _id / slug
49+
local_path = TMP_BASE_PATH / str(_id) / slug
5050
url = f"https://{host}/{user}/{repo}"
5151

5252
query = IngestionQuery(
@@ -132,7 +132,7 @@ def parse_local_dir_path(path_str: str) -> IngestionQuery:
132132
"""
133133
path_obj = Path(path_str).resolve()
134134
slug = path_obj.name if path_str == "." else path_str.strip("/")
135-
return IngestionQuery(local_path=path_obj, slug=slug, id=str(uuid.uuid4()))
135+
return IngestionQuery(local_path=path_obj, slug=slug, id=uuid.uuid4())
136136

137137

138138
async def _configure_branch_or_tag(

src/gitingest/schemas/ingestion.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
from __future__ import annotations
44

55
from pathlib import Path # noqa: TC003 (typing-only-standard-library-import) needed for type checking (pydantic)
6+
from uuid import UUID # noqa: TC003 (typing-only-standard-library-import) needed for type checking (pydantic)
67

78
from pydantic import BaseModel, Field
89

@@ -27,7 +28,7 @@ class IngestionQuery(BaseModel): # pylint: disable=too-many-instance-attributes
2728
The URL of the repository.
2829
slug : str
2930
The slug of the repository.
30-
id : str
31+
id : UUID
3132
The ID of the repository.
3233
subpath : str
3334
The subpath to the repository or file (default: ``"/"``).
@@ -47,6 +48,8 @@ class IngestionQuery(BaseModel): # pylint: disable=too-many-instance-attributes
4748
The patterns to include.
4849
include_submodules : bool
4950
Whether to include all Git submodules within the repository. (default: ``False``)
51+
s3_url : str | None
52+
The S3 URL where the digest is stored if S3 is enabled.
5053
5154
"""
5255

@@ -56,7 +59,7 @@ class IngestionQuery(BaseModel): # pylint: disable=too-many-instance-attributes
5659
local_path: Path
5760
url: str | None = None
5861
slug: str
59-
id: str
62+
id: UUID
6063
subpath: str = Field(default="/")
6164
type: str | None = None
6265
branch: str | None = None
@@ -66,6 +69,7 @@ class IngestionQuery(BaseModel): # pylint: disable=too-many-instance-attributes
6669
ignore_patterns: set[str] = Field(default_factory=set) # TODO: ssame type for ignore_* and include_* patterns
6770
include_patterns: set[str] | None = None
6871
include_submodules: bool = Field(default=False)
72+
s3_url: str | None = None
6973

7074
def extract_clone_config(self) -> CloneConfig:
7175
"""Extract the relevant fields for the CloneConfig object.

0 commit comments

Comments
 (0)