Skip to content

feat: implement S3 integration for storing and retrieving digest files #427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 26, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions .docker/minio/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/bin/sh

# Simple script to set up MinIO bucket and user
# Based on example from MinIO issues

# Format bucket name to ensure compatibility
BUCKET_NAME=$(echo "${S3_BUCKET_NAME}" | tr '[:upper:]' '[:lower:]' | tr '_' '-')

# Configure MinIO client
mc alias set myminio http://minio:9000 ${MINIO_ROOT_USER} ${MINIO_ROOT_PASSWORD}

# Remove bucket if it exists (for clean setup)
mc rm -r --force myminio/${BUCKET_NAME} || true

# Create bucket
mc mb myminio/${BUCKET_NAME}

# Set bucket policy to allow downloads
mc anonymous set download myminio/${BUCKET_NAME}

# Create user with access and secret keys
mc admin user add myminio ${S3_ACCESS_KEY} ${S3_SECRET_KEY} || echo "User already exists"

# Create policy for the bucket
echo '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":["s3:*"],"Resource":["arn:aws:s3:::'${BUCKET_NAME}'/*","arn:aws:s3:::'${BUCKET_NAME}'"]}]}' > /tmp/policy.json

# Apply policy
mc admin policy create myminio gitingest-policy /tmp/policy.json || echo "Policy already exists"
mc admin policy attach myminio gitingest-policy --user ${S3_ACCESS_KEY}

echo "MinIO setup completed successfully"
echo "Bucket: ${BUCKET_NAME}"
echo "Access via console: http://localhost:9001"
23 changes: 23 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,26 @@ GITINGEST_SENTRY_PROFILE_LIFECYCLE=trace
GITINGEST_SENTRY_SEND_DEFAULT_PII=true
# Environment name for Sentry (default: "")
GITINGEST_SENTRY_ENVIRONMENT=development

# MinIO Configuration (for development)
# Root user credentials for MinIO admin access
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin

# S3 Configuration (for application)
# Set to "true" to enable S3 storage for digests
# S3_ENABLED=true
# Endpoint URL for the S3 service (MinIO in development)
S3_ENDPOINT=http://minio:9000
# Access key for the S3 bucket (created automatically in development)
S3_ACCESS_KEY=gitingest
# Secret key for the S3 bucket (created automatically in development)
S3_SECRET_KEY=gitingest123
# Name of the S3 bucket (created automatically in development)
S3_BUCKET_NAME=gitingest-bucket
# Region for the S3 bucket (default for MinIO)
S3_REGION=us-east-1
# Public URL/CDN for accessing S3 resources
S3_ALIAS_HOST=127.0.0.1:9000/gitingest-bucket
# Optional prefix for S3 file paths (if set, prefixes all S3 paths with this value)
# S3_DIRECTORY_PREFIX=my-prefix
2 changes: 2 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ repos:
files: ^src/
additional_dependencies:
[
boto3>=1.28.0,
click>=8.0.0,
'fastapi[standard]>=0.109.1',
httpx,
Expand All @@ -138,6 +139,7 @@ repos:
- --rcfile=tests/.pylintrc
additional_dependencies:
[
boto3>=1.28.0,
click>=8.0.0,
'fastapi[standard]>=0.109.1',
httpx,
Expand Down
85 changes: 85 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,8 @@ This is because Jupyter notebooks are asynchronous by default.

## 🐳 Self-host

### Using Docker

1. Build the image:

``` bash
Expand Down Expand Up @@ -239,6 +241,89 @@ The application can be configured using the following environment variables:
- **GITINGEST_SENTRY_PROFILE_SESSION_SAMPLE_RATE**: Sampling rate for profile sessions (default: "1.0", range: 0.0-1.0)
- **GITINGEST_SENTRY_PROFILE_LIFECYCLE**: Profile lifecycle mode (default: "trace")
- **GITINGEST_SENTRY_SEND_DEFAULT_PII**: Send default personally identifiable information (default: "true")
- **S3_ALIAS_HOST**: Public URL/CDN for accessing S3 resources (default: "127.0.0.1:9000/gitingest-bucket")
- **S3_DIRECTORY_PREFIX**: Optional prefix for S3 file paths (if set, prefixes all S3 paths with this value)

### Using Docker Compose

The project includes a `compose.yml` file that allows you to easily run the application in both development and production environments.

#### Compose File Structure

The `compose.yml` file uses YAML anchoring with `&app-base` and `<<: *app-base` to define common configuration that is shared between services:

```yaml
# Common base configuration for all services
x-app-base: &app-base
build:
context: .
dockerfile: Dockerfile
ports:
- "${APP_WEB_BIND:-8000}:8000" # Main application port
- "${GITINGEST_METRICS_HOST:-127.0.0.1}:${GITINGEST_METRICS_PORT:-9090}:9090" # Metrics port
# ... other common configurations
```

#### Services

The file defines three services:

1. **app**: Production service configuration
- Uses the `prod` profile
- Sets the Sentry environment to "production"
- Configured for stable operation with `restart: unless-stopped`

2. **app-dev**: Development service configuration
- Uses the `dev` profile
- Enables debug mode
- Mounts the source code for live development
- Uses hot reloading for faster development

3. **minio**: S3-compatible object storage for development
- Uses the `dev` profile (only available in development mode)
- Provides S3-compatible storage for local development
- Accessible via:
- API: Port 9000 ([localhost:9000](http://localhost:9000))
- Web Console: Port 9001 ([localhost:9001](http://localhost:9001))
- Default admin credentials:
- Username: `minioadmin`
- Password: `minioadmin`
- Configurable via environment variables:
- `MINIO_ROOT_USER`: Custom admin username (default: minioadmin)
- `MINIO_ROOT_PASSWORD`: Custom admin password (default: minioadmin)
- Includes persistent storage via Docker volume
- Auto-creates a bucket and application-specific credentials:
- Bucket name: `gitingest-bucket` (configurable via `S3_BUCKET_NAME`)
- Access key: `gitingest` (configurable via `S3_ACCESS_KEY`)
- Secret key: `gitingest123` (configurable via `S3_SECRET_KEY`)
- These credentials are automatically passed to the app-dev service via environment variables:
- `S3_ENDPOINT`: URL of the MinIO server
- `S3_ACCESS_KEY`: Access key for the S3 bucket
- `S3_SECRET_KEY`: Secret key for the S3 bucket
- `S3_BUCKET_NAME`: Name of the S3 bucket
- `S3_REGION`: Region for the S3 bucket (default: us-east-1)
- `S3_ALIAS_HOST`: Public URL/CDN for accessing S3 resources (default: "127.0.0.1:9000/gitingest-bucket")

#### Usage Examples

To run the application in development mode:

```bash
docker compose --profile dev up
```

To run the application in production mode:

```bash
docker compose --profile prod up -d
```

To build and run the application:

```bash
docker compose --profile prod build
docker compose --profile prod up -d
```

## 🤝 Contributing

Expand Down
111 changes: 111 additions & 0 deletions compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Common base configuration for all services
x-app-base: &app-base
ports:
- "${APP_WEB_BIND:-8000}:8000" # Main application port
- "${GITINGEST_METRICS_HOST:-127.0.0.1}:${GITINGEST_METRICS_PORT:-9090}:9090" # Metrics port
environment:
# Python Configuration
- PYTHONUNBUFFERED=1
- PYTHONDONTWRITEBYTECODE=1
# Host Configuration
- ALLOWED_HOSTS=${ALLOWED_HOSTS:-gitingest.com,*.gitingest.com,localhost,127.0.0.1}
# Metrics Configuration
- GITINGEST_METRICS_ENABLED=${GITINGEST_METRICS_ENABLED:-true}
- GITINGEST_METRICS_HOST=${GITINGEST_METRICS_HOST:-127.0.0.1}
- GITINGEST_METRICS_PORT=${GITINGEST_METRICS_PORT:-9090}
# Sentry Configuration
- GITINGEST_SENTRY_ENABLED=${GITINGEST_SENTRY_ENABLED:-false}
- GITINGEST_SENTRY_DSN=${GITINGEST_SENTRY_DSN:-}
- GITINGEST_SENTRY_TRACES_SAMPLE_RATE=${GITINGEST_SENTRY_TRACES_SAMPLE_RATE:-1.0}
- GITINGEST_SENTRY_PROFILE_SESSION_SAMPLE_RATE=${GITINGEST_SENTRY_PROFILE_SESSION_SAMPLE_RATE:-1.0}
- GITINGEST_SENTRY_PROFILE_LIFECYCLE=${GITINGEST_SENTRY_PROFILE_LIFECYCLE:-trace}
- GITINGEST_SENTRY_SEND_DEFAULT_PII=${GITINGEST_SENTRY_SEND_DEFAULT_PII:-true}
user: "1000:1000"
command: ["python", "-m", "uvicorn", "server.main:app", "--host", "0.0.0.0", "--port", "8000"]

services:
# Production service configuration
app:
<<: *app-base
image: ghcr.io/coderamp-labs/gitingest:latest
profiles:
- prod
environment:
- GITINGEST_SENTRY_ENVIRONMENT=${GITINGEST_SENTRY_ENVIRONMENT:-production}
restart: unless-stopped

# Development service configuration
app-dev:
<<: *app-base
build:
context: .
dockerfile: Dockerfile
profiles:
- dev
environment:
- DEBUG=true
- GITINGEST_SENTRY_ENVIRONMENT=${GITINGEST_SENTRY_ENVIRONMENT:-development}
# S3 Configuration
- S3_ENABLED=true
- S3_ENDPOINT=http://minio:9000
- S3_ACCESS_KEY=${S3_ACCESS_KEY:-gitingest}
- S3_SECRET_KEY=${S3_SECRET_KEY:-gitingest123}
# Use lowercase bucket name to ensure compatibility with MinIO
- S3_BUCKET_NAME=${S3_BUCKET_NAME:-gitingest-bucket}
- S3_REGION=${S3_REGION:-us-east-1}
- S3_DIRECTORY_PREFIX=${S3_DIRECTORY_PREFIX:-dev}
# Public URL for S3 resources
- S3_ALIAS_HOST=${S3_ALIAS_HOST:-http://127.0.0.1:9000/${S3_BUCKET_NAME:-gitingest-bucket}}
volumes:
# Mount source code for live development
- ./src:/app:ro
# Use --reload flag for hot reloading during development
command: ["python", "-m", "uvicorn", "server.main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]
depends_on:
minio-setup:
condition: service_completed_successfully

# MinIO S3-compatible object storage for development
minio:
image: minio/minio:latest
profiles:
- dev
ports:
- "9000:9000" # API port
- "9001:9001" # Console port
environment:
- MINIO_ROOT_USER=${MINIO_ROOT_USER:-minioadmin}
- MINIO_ROOT_PASSWORD=${MINIO_ROOT_PASSWORD:-minioadmin}
volumes:
- minio-data:/data
command: server /data --console-address ":9001"
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 30s
start_period: 30s
start_interval: 1s

# MinIO setup service to create bucket and user
minio-setup:
image: minio/mc
profiles:
- dev
depends_on:
minio:
condition: service_healthy
environment:
- MINIO_ROOT_USER=${MINIO_ROOT_USER:-minioadmin}
- MINIO_ROOT_PASSWORD=${MINIO_ROOT_PASSWORD:-minioadmin}
- S3_ACCESS_KEY=${S3_ACCESS_KEY:-gitingest}
- S3_SECRET_KEY=${S3_SECRET_KEY:-gitingest123}
- S3_BUCKET_NAME=${S3_BUCKET_NAME:-gitingest-bucket}
volumes:
- ./.docker/minio/setup.sh:/setup.sh:ro
entrypoint: sh
command: -c /setup.sh

volumes:
minio-data:
driver: local
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ dev = [
]

server = [
"boto3>=1.28.0", # AWS SDK for S3 support
"fastapi[standard]>=0.109.1", # Minimum safe release (https://osv.dev/vulnerability/PYSEC-2024-38)
"prometheus-client",
"sentry-sdk[fastapi]",
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
boto3>=1.28.0 # AWS SDK for S3 support
click>=8.0.0
fastapi[standard]>=0.109.1 # Vulnerable to https://osv.dev/vulnerability/PYSEC-2024-38
httpx
Expand Down
6 changes: 3 additions & 3 deletions src/gitingest/query_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,9 @@ async def parse_remote_repo(source: str, token: str | None = None) -> IngestionQ
host = parsed_url.netloc
user, repo = _get_user_and_repo_from_path(parsed_url.path)

_id = str(uuid.uuid4())
_id = uuid.uuid4()
slug = f"{user}-{repo}"
local_path = TMP_BASE_PATH / _id / slug
local_path = TMP_BASE_PATH / str(_id) / slug
url = f"https://{host}/{user}/{repo}"

query = IngestionQuery(
Expand Down Expand Up @@ -132,7 +132,7 @@ def parse_local_dir_path(path_str: str) -> IngestionQuery:
"""
path_obj = Path(path_str).resolve()
slug = path_obj.name if path_str == "." else path_str.strip("/")
return IngestionQuery(local_path=path_obj, slug=slug, id=str(uuid.uuid4()))
return IngestionQuery(local_path=path_obj, slug=slug, id=uuid.uuid4())


async def _configure_branch_or_tag(
Expand Down
8 changes: 6 additions & 2 deletions src/gitingest/schemas/ingestion.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from __future__ import annotations

from pathlib import Path # noqa: TC003 (typing-only-standard-library-import) needed for type checking (pydantic)
from uuid import UUID # noqa: TC003 (typing-only-standard-library-import) needed for type checking (pydantic)

from pydantic import BaseModel, Field

Expand All @@ -27,7 +28,7 @@ class IngestionQuery(BaseModel): # pylint: disable=too-many-instance-attributes
The URL of the repository.
slug : str
The slug of the repository.
id : str
id : UUID
The ID of the repository.
subpath : str
The subpath to the repository or file (default: ``"/"``).
Expand All @@ -47,6 +48,8 @@ class IngestionQuery(BaseModel): # pylint: disable=too-many-instance-attributes
The patterns to include.
include_submodules : bool
Whether to include all Git submodules within the repository. (default: ``False``)
s3_url : str | None
The S3 URL where the digest is stored if S3 is enabled.

"""

Expand All @@ -56,7 +59,7 @@ class IngestionQuery(BaseModel): # pylint: disable=too-many-instance-attributes
local_path: Path
url: str | None = None
slug: str
id: str
id: UUID
subpath: str = Field(default="/")
type: str | None = None
branch: str | None = None
Expand All @@ -66,6 +69,7 @@ class IngestionQuery(BaseModel): # pylint: disable=too-many-instance-attributes
ignore_patterns: set[str] = Field(default_factory=set) # TODO: ssame type for ignore_* and include_* patterns
include_patterns: set[str] | None = None
include_submodules: bool = Field(default=False)
s3_url: str | None = None

def extract_clone_config(self) -> CloneConfig:
"""Extract the relevant fields for the CloneConfig object.
Expand Down
6 changes: 3 additions & 3 deletions src/server/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,8 +71,8 @@ class IngestSuccessResponse(BaseModel):
Short form of repository URL (user/repo).
summary : str
Summary of the ingestion process including token estimates.
ingest_id : str
Ingestion id used to download full context.
digest_url : str
URL to download the full digest content (either S3 URL or local download endpoint).
tree : str
File tree structure of the repository.
content : str
Expand All @@ -89,7 +89,7 @@ class IngestSuccessResponse(BaseModel):
repo_url: str = Field(..., description="Original repository URL")
short_repo_url: str = Field(..., description="Short repository URL (user/repo)")
summary: str = Field(..., description="Ingestion summary with token estimates")
ingest_id: str = Field(..., description="Ingestion id used to download full context")
digest_url: str = Field(..., description="URL to download the full digest content")
tree: str = Field(..., description="File tree structure")
content: str = Field(..., description="Processed file content")
default_max_file_size: int = Field(..., description="File size slider position used")
Expand Down
Loading
Loading