Skip to content

Commit 0a4e02f

Browse files
authored
Merge branch 'main' into add-workflow
2 parents bedc5fa + 7696253 commit 0a4e02f

File tree

13 files changed

+456
-155
lines changed

13 files changed

+456
-155
lines changed
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
---
2+
name: Bug report
3+
about: Create a report to help us improve
4+
title: ''
5+
labels: ''
6+
assignees: ''
7+
8+
---
9+
10+
**Describe the bug**
11+
A clear and concise description of what the bug is.
12+
13+
**To Reproduce**
14+
Steps to reproduce the behavior:
15+
1. Go to '...'
16+
2. Click on '....'
17+
3. Scroll down to '....'
18+
4. See error
19+
20+
**Expected behavior**
21+
A clear and concise description of what you expected to happen.
22+
23+
**Screenshots**
24+
If applicable, add screenshots to help explain your problem.
25+
26+
**Desktop (please complete the following information):**
27+
- OS: [e.g. iOS]
28+
- Browser [e.g. chrome, safari]
29+
- Version [e.g. 22]
30+
31+
**Smartphone (please complete the following information):**
32+
- Device: [e.g. iPhone6]
33+
- OS: [e.g. iOS8.1]
34+
- Browser [e.g. stock browser, safari]
35+
- Version [e.g. 22]
36+
37+
**Additional context**
38+
Add any other context about the problem here.

.github/ISSUE_TEMPLATE/custom.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
name: Custom issue template
3+
about: Describe this issue template's purpose here.
4+
title: ''
5+
labels: ''
6+
assignees: ''
7+
8+
---
9+
10+
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
name: Feature request
3+
about: Suggest an idea for this project
4+
title: ''
5+
labels: ''
6+
assignees: ''
7+
8+
---
9+
10+
**Is your feature request related to a problem? Please describe.**
11+
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12+
13+
**Describe the solution you'd like**
14+
A clear and concise description of what you want to happen.
15+
16+
**Describe alternatives you've considered**
17+
A clear and concise description of any alternative solutions or features you've considered.
18+
19+
**Additional context**
20+
Add any other context or screenshots about the feature request here.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Dependency Review Action
2+
#
3+
# This Action will scan dependency manifest files that change as part of a Pull Request,
4+
# surfacing known-vulnerable versions of the packages declared or updated in the PR.
5+
# Once installed, if the workflow run is marked as required, PRs introducing known-vulnerable
6+
# packages will be blocked from merging.
7+
#
8+
# Source repository: https://github.com/actions/dependency-review-action
9+
# Public documentation: https://docs.github.com/en/code-security/supply-chain-security/understanding-your-software-supply-chain/about-dependency-review#dependency-review-enforcement
10+
name: 'Dependency review'
11+
on:
12+
pull_request:
13+
branches: [ "main" ]
14+
15+
# If using a dependency submission action in this workflow this permission will need to be set to:
16+
#
17+
# permissions:
18+
# contents: write
19+
#
20+
# https://docs.github.com/en/enterprise-cloud@latest/code-security/supply-chain-security/understanding-your-software-supply-chain/using-the-dependency-submission-api
21+
permissions:
22+
contents: read
23+
# Write permissions for pull-requests are required for using the `comment-summary-in-pr` option, comment out if you aren't using this option
24+
pull-requests: write
25+
26+
jobs:
27+
dependency-review:
28+
runs-on: ubuntu-latest
29+
steps:
30+
- name: 'Checkout repository'
31+
uses: actions/checkout@v4
32+
- name: 'Dependency Review'
33+
uses: actions/dependency-review-action@v4
34+
# Commonly enabled options, see https://github.com/actions/dependency-review-action#configuration-options for all available options.
35+
with:
36+
comment-summary-in-pr: always
37+
# fail-on-severity: moderate
38+
# deny-licenses: GPL-1.0-or-later, LGPL-2.0-or-later
39+
# retry-on-snapshot-warnings: true

.github/workflows/pylint.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
on:
2+
push:
3+
paths:
4+
- 'scrapegraphai/**'
5+
- '.github/workflows/pylint.yml'
6+
7+
jobs:
8+
build:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v3
12+
- name: Install the latest version of rye
13+
uses: eifinger/setup-rye@v3
14+
- name: Install dependencies
15+
run: rye sync --no-lock
16+
- name: Analysing the code with pylint
17+
run: rye run pylint-ci
18+
- name: Check Pylint score
19+
run: |
20+
pylint_score=$(rye run pylint-score-ci | grep 'Raw metrics' | awk '{print $4}')
21+
if (( $(echo "$pylint_score < 8" | bc -l) )); then
22+
echo "Pylint score is below 8. Blocking commit."
23+
exit 1
24+
else
25+
echo "Pylint score is acceptable."
26+
fi

README.md

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ The generate schemas can be used to infer from document to use for tables in a d
77

88
- **Entity Extraction**: Automatically identifies and extracts entities from PDF files.
99
- **Schema Generation**: Constructs a schema based and structure of the extracted entities.
10-
- **Visualization**: Leverages Graphviz to visualize the extracted schema.
10+
- **Visualization**: Dynamic schema visualization
1111

1212
## Quick Start
1313

@@ -16,23 +16,40 @@ The generate schemas can be used to infer from document to use for tables in a d
1616
Before you begin, ensure you have the following installed on your system:
1717

1818
- **Python**: Make sure Python 3.9+ is installed.
19-
- **Graphviz**: This tool is necessary for visualizing the extracted schema.
19+
- **Poppler**: This tool is necessary for converting PDF to images.
2020

2121
#### MacOS Installation
2222

23-
To install Graphviz on MacOS, use the following command:
23+
To install Poppler on MacOS, use the following command:
2424

2525
```bash
26-
brew install graphviz
26+
brew install poppler
27+
2728
```
2829

2930
#### Linux Installation
3031

3132
To install Graphviz on Linux, use the following command:
3233

3334
```bash
34-
sudo apt install graphviz
35+
sudo apt-get install poppler-utils
3536
```
37+
38+
#### Windows
39+
40+
1. Download the latest Poppler release for Windows from [poppler releases](https://github.com/oschwartz10612/poppler-windows/releases/).
41+
2. Extract the downloaded zip file to a location on your computer (e.g., `C:\Program Files\poppler`).
42+
3. Add the `bin` directory of the extracted folder to your system's PATH environment variable.
43+
44+
To add to PATH:
45+
1. Search for "Environment Variables" in the Start menu and open it.
46+
2. Under "System variables", find and select "Path", then click "Edit".
47+
3. Click "New" and add the path to the Poppler `bin` directory (e.g., `C:\Program Files\poppler\bin`).
48+
4. Click "OK" to save the changes.
49+
50+
After installation, restart your terminal or command prompt for the changes to take effect.
51+
If doesn't work try the magic restart button.
52+
3653
#### Installation
3754
After installing the prerequisites and dependencies, you can start using ScrapeSchema to extract entities and their schema from PDFs.
3855

pyproject.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
[project]
2+
23
name = "scrapeschema"
34
version = "0.0.1"
45
description = "library for creating ontologies from documents"
@@ -68,3 +69,8 @@ dev-dependencies = [
6869
"-e file:.[docs]",
6970
"pylint>=3.2.5",
7071
]
72+
[tool.rye.scripts]
73+
pylint-local = "pylint scrapegraphai/**/*.py"
74+
pylint-ci = "pylint --disable=C0114,C0115,C0116 --exit-zero scrapegraphai/**/*.py"
75+
update-requirements = "python 'manual deployment/autorequirements.py'"
76+

requirements.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
certifi==2024.7.4
22
charset-normalizer==3.3.2
33
idna==3.8
4-
pdf2image==1.17.0
54
pillow==10.4.0
65
python-dotenv==1.0.1
76
requests==2.32.3

scrapeschema/extractor.py

Lines changed: 102 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
from abc import ABC, abstractmethod
22
from typing import List, Tuple, Dict, Any
33
from .primitives import Entity, Relation
4+
from .parsers.base_parser import BaseParser
5+
from .parsers.prompts import DELETE_PROMPT, UPDATE_ENTITIES_PROMPT
6+
import requests
7+
import json
48

59
class Extractor(ABC):
610
@abstractmethod
@@ -15,16 +19,111 @@ def extract_relations(self) -> List[Relation]:
1519
def entities_json_schema(self) -> Dict[str, Any]:
1620
pass
1721

22+
@abstractmethod
23+
def update_entities(self, new_entities: List[Entity]) -> List[Entity]:
24+
pass
25+
1826
class FileExtractor(Extractor):
19-
def __init__(self, file_path: str, parser):
27+
def __init__(self, file_path: str, parser: BaseParser):
2028
self.file_path = file_path
2129
self.parser = parser
2230

2331
def extract_entities(self) -> List[Entity]:
24-
return self.parser.extract_entities(self.file_path)
32+
new_entities = self.parser.extract_entities(self.file_path)
33+
return self.update_entities(new_entities)
2534

2635
def extract_relations(self) -> List[Relation]:
2736
return self.parser.extract_relations(self.file_path)
2837

2938
def entities_json_schema(self) -> Dict[str, Any]:
30-
return self.parser.entities_json_schema(self.file_path)
39+
return self.parser.entities_json_schema(self.file_path)
40+
41+
def delete_entity_or_relation(self, item_description: str) -> None:
42+
"""
43+
Delete an entity or relation based on user description.
44+
45+
:param item_description: User's description of the entity or relation to delete
46+
"""
47+
entities_ids = [e.id for e in self.parser.get_entities()]
48+
relations_ids = [(r.source, r.target, r.name) for r in self.parser.get_relations()]
49+
prompt = DELETE_PROMPT.format(
50+
entities=entities_ids,
51+
relations=relations_ids,
52+
item_description=item_description
53+
)
54+
55+
response = self._get_llm_response(prompt)[8:-3]
56+
response_dict = json.loads(response)
57+
58+
for key, value in response_dict.items():
59+
if key == 'Type':
60+
if value == 'Entity':
61+
self._delete_entity(response_dict['ID'])
62+
elif value == 'Relation':
63+
self._delete_relation(response_dict['ID'])
64+
65+
66+
def _delete_entity(self, entity_id: str) -> None:
67+
"""Delete an entity and its related relations."""
68+
entities = self.parser.get_entities()
69+
relations = self.parser.get_relations()
70+
71+
entities = [e for e in entities if e.id != entity_id]
72+
relations = [r for r in relations if r.source != entity_id and r.target != entity_id]
73+
74+
self.parser.set_entities(entities)
75+
self.parser.set_relations(relations)
76+
print(f"Entity '{entity_id}' and its related relations have been deleted.")
77+
78+
def _delete_relation(self, relation_id: str) -> None:
79+
"""Delete a relation."""
80+
relations = self.parser.get_relations()
81+
82+
source, target, name = eval(relation_id)
83+
relations = [r for r in relations if not (r.source == source and r.target == target and r.name == name)]
84+
85+
self.parser.set_relations(relations)
86+
print(f"Relation '{name}' between '{source}' and '{target}' has been deleted.")
87+
88+
def _get_llm_response(self, prompt: str) -> str:
89+
"""Get a response from the language model."""
90+
payload = {
91+
"model": self.parser.get_model(),
92+
"temperature": self.parser.get_temperature(),
93+
"messages": [
94+
{"role": "user", "content": prompt}
95+
],
96+
}
97+
response = requests.post(self.parser.get_inference_base_url(), headers=self.parser.get_headers(), json=payload)
98+
return response.json()['choices'][0]['message']['content']
99+
100+
def update_entities(self, new_entities: List[Entity]) -> List[Entity]:
101+
"""
102+
Update the existing entities with new entities, integrating and deduplicating as necessary.
103+
104+
:param new_entities: List of new entities to be integrated
105+
:return: Updated list of entities
106+
"""
107+
existing_entities = self.parser.get_entities()
108+
109+
# Prepare the prompt for the LLM
110+
prompt = UPDATE_ENTITIES_PROMPT.format(
111+
existing_entities=json.dumps([e.__dict__ for e in existing_entities], indent=2),
112+
new_entities=json.dumps([e.__dict__ for e in new_entities], indent=2)
113+
)
114+
115+
# Get the LLM response
116+
response = self._get_llm_response(prompt)
117+
118+
try:
119+
updated_entities_data = json.loads(response)
120+
updated_entities = [Entity(**entity_data) for entity_data in updated_entities_data]
121+
122+
# Update the parser's entities
123+
self.parser.set_entities(updated_entities)
124+
125+
print(f"Entities updated. New count: {len(updated_entities)}")
126+
return updated_entities
127+
except json.JSONDecodeError:
128+
print("Error: Unable to parse the LLM response.")
129+
return existing_entities

scrapeschema/parsers/base_parser.py

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
from ..primitives import Entity, Relation
44

55
class BaseParser(ABC):
6-
def __init__(self, api_key: str, inference_base_url: str = "https://api.openai.com/v1/chat/completions", model: str = "gpt-4o", temperature: float = 0.0):
6+
def __init__(self, api_key: str, inference_base_url: str = "https://api.openai.com/v1/chat/completions", model: str = "gpt-4o-2024-08-06", temperature: float = 0.0):
77
"""
88
Initializes the PDFParser with an API key.
99
@@ -16,9 +16,9 @@ def __init__(self, api_key: str, inference_base_url: str = "https://api.openai.c
1616
"Authorization": f"Bearer {self._api_key}"
1717
}
1818

19-
self.inference_base_url = inference_base_url
20-
self.model = model
21-
self.temperature = temperature
19+
self._inference_base_url = inference_base_url
20+
self._model = model
21+
self._temperature = temperature
2222
self._entities = []
2323
self._relations = []
2424

@@ -34,15 +34,20 @@ def extract_relations(self, file_path: str) -> List[Relation]:
3434
def entities_json_schema(self, file_path: str) -> Dict[str, Any]:
3535
pass
3636

37-
3837
def get_api_key(self):
3938
return self._api_key
4039

4140
def get_headers(self):
4241
return self._headers
4342

43+
def get_model(self):
44+
return self._model
45+
46+
def get_temperature(self):
47+
return self._temperature
48+
4449
def get_inference_base_url(self):
45-
return self.inference_base_url
50+
return self._inference_base_url
4651

4752
def set_api_key(self, api_key: str):
4853
self._api_key = api_key
@@ -68,4 +73,3 @@ def set_relations(self, relations: List[Relation]):
6873
if not isinstance(relations, list) or not all(isinstance(relation, Relation) for relation in relations):
6974
raise TypeError("relations must be a List of Relation objects")
7075
self._relations = relations
71-

0 commit comments

Comments
 (0)