Interactive visualization and exploration of scientific papers from the LAION open science dataset.
This project is a collaboration between Inference.net and LAION. LAION curated the original dataset which is about ~100m scrapped scientific and research articles and Inference.net fine-tuned a custom model to extract structured summaries from the articles. This repo contains a visual explorer for a small subset of the extracted dataset.
View the live explorer at https://laion.inference.net.
A web application for exploring scientific papers with semantic embeddings, dimensionality reduction, and clustering visualizations.
- Frontend: React + TypeScript + Vite with D3.js for interactive visualizations
- Backend: Python FastAPI serving data from SQLite (D1 in production)
- Storage: SQLite locally, Cloudflare D1 + R2 in production
You'll need the following tools installed:
- Python 3.11+ - Download
- uv - Python dependency management - Install
- bun - JavaScript runtime - Install
- Task - Task runner - Install
Install all dependencies:
task setupThis will install both backend and frontend dependencies.
Download the database from R2:
task db:setupThis will download the SQLite database to backend/data/db.sqlite.
Run the backend and frontend in separate terminals:
Backend (Terminal 1):
task backend:devFrontend (Terminal 2):
task frontend:devThe application will be available at:
- Frontend:
http://localhost:5173 - API:
http://localhost:8787 - API Docs:
http://localhost:8787/docs
The code for the data pipeline that we used to construct this dataset is not yet open source, mostly because it was setup for a one-time process and not production-ready.
However, the general process was:
- Initial data extraction and filtering
- Ran a pipeline to generate the summaries
- Excluded specific non-scientific content and failed summaries
- Compiled results for further processing
- Semantic Embedding
- Generates 768-dimensional embeddings using SPECTER2 (allenai/specter2_base)
- Processes papers in batches with GPU acceleration support
- Stores embeddings as binary blobs for similarity search
- Visualization & Clustering
- Reduces embeddings to 2D coordinates using UMAP with cosine distance
- Applies K-Means clustering with automatic optimization (20-60 clusters via silhouette scores)
- Generates initial cluster labels using TF-IDF analysis of titles and fields
- LLM-Curated Labels
- Applies manually reviewed, domain-specific cluster labels
- Improves interpretability over automated TF-IDF labels
Deploy to Cloudflare:
task deployThis will prompt you to deploy the backend API and/or frontend.
We welcome contributions to this project! Here's what you should know:
Bug Fixes & Minor Improvements
- Bug fixes are always welcome! Please submit a PR with a clear description of the issue and fix.
- Minor improvements to documentation, code quality, or performance are appreciated.
New Features
- This project is intentionally scoped as a one-time preview of this dataset.
- We are generally not planning to greatly expand the functionality beyond its current scope.
- If you want to add significant new features, we encourage you to fork the project and build on it!
Before Submitting a PR
- Ensure your code passes linting and formatting checks:
task check
- Keep changes focused and well-documented.
- Test your changes with sample data when applicable.
MIT License - see LICENSE file for details.