Skip to content
Open
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
94aa2bb
edit prompts
Mar 31, 2024
9df54fe
edit exception
Mar 31, 2024
e3e24ea
test push
linhkid Mar 31, 2024
284474f
Add other fields and fix JSON format errors
linhkid Apr 2, 2024
cb7341f
add date time to file name
linhkid Apr 7, 2024
937bbef
Edit some comments
linhkid Apr 7, 2024
6ec246b
Update README.md
linhkid Apr 9, 2024
d98d8da
Update README.md
linhkid Apr 9, 2024
126773e
Update README.md
linhkid Apr 9, 2024
5d885c4
test adding new attributes
linhkid Apr 21, 2024
fc0e67e
Read html version of papers instead of just abstract
linhkid Apr 26, 2024
fc807c3
Add subjects and add more tokens for the model to digest
linhkid Apr 27, 2024
a3848f5
Modify Huggingface app.py
linhkid Apr 27, 2024
ae371ad
Change README
linhkid Apr 27, 2024
723f383
Change README
linhkid Apr 27, 2024
a332618
Fix crawler error lead to logic's fault in checking subjects
linhkid May 9, 2024
48da507
Change URL for main page landing, waiting for TODO on abstract
linhkid May 25, 2024
9b11eb5
Fix the abstract not found error, and also add ssl cert for windows
linhkid May 26, 2024
16cd86c
Major fix and upgrade for Arxiv digest
linhkid Apr 6, 2025
23c38b5
ok for now
linhkid Apr 6, 2025
89ffcf1
just to be safe, it's processing single file ok now
linhkid Apr 6, 2025
51389ee
2 stage filtering
linhkid Apr 6, 2025
e09d501
Merge branch 'main' into multiagent_multipurpose
linhkid Apr 6, 2025
2cc2ce2
Merge pull request #1 from linhkid/multiagent_multipurpose
linhkid Apr 6, 2025
e8da783
refine and refactor
linhkid Apr 7, 2025
45dd62d
edit README
linhkid Apr 7, 2025
a8eec4d
edit README
linhkid Apr 7, 2025
01ce725
Merge pull request #2 from linhkid/multiagent_multipurpose
linhkid Apr 7, 2025
427cf6a
Update README.md
linhkid Apr 7, 2025
bddfee4
edit threshold bug
linhkid Apr 7, 2025
cb8e751
add scrollable sidebar for HTML
linhkid Apr 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
<p align="center"><img src="./readme_images/banner.png" width=500 /></p>

**ArXiv Digest and Personalized Recommendations using Large Language Models.**
**ArXiv Digest (extra version) and Personalized Recommendations using Large Language Models.**

*(Note: This is an adjusted repo to match my needs. For original repo please refer to **AutoLLM** that I forked from)*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pull request to the original repo 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Richard, pls ignore haha.


This repo aims to provide a better daily digest for newly published arXiv papers based on your own research interests and natural-language descriptions, using relevancy ratings from GPT.

Expand Down
9 changes: 5 additions & 4 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@ topic: "Computer Science"
# An empty list here will include all categories in a topic
# Use the natural language names of the topics, found here: https://arxiv.org
# Including more categories will result in more calls to the large language model
categories: ["Artificial Intelligence", "Computation and Language"]
categories: ["Artificial Intelligence", "Computation and Language", "Machine Learning"]

# Relevance score threshold. abstracts that receive a score less than this from the large language model
# will have their papers filtered out.
#
# Must be within 1-10
threshold: 7
threshold: 6

# A natural language statement that the large language model will use to judge which papers are relevant
#
Expand All @@ -23,5 +23,6 @@ threshold: 7
interest: |
1. Large language model pretraining and finetunings
2. Multimodal machine learning
3. Do not care about specific application, for example, information extraction, summarization, etc.
4. Not interested in paper focus on specific languages, e.g., Arabic, Chinese, etc.
3. RAGs
4. Optimization of LLM and GenAI
5. Do not care about specific application, for example, information extraction, summarization, etc.
19 changes: 14 additions & 5 deletions src/action.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
from sendgrid import SendGridAPIClient
from sendgrid.helpers.mail import Mail, Email, To, Content

from datetime import date

import argparse
import yaml
import os
from dotenv import load_dotenv
import openai
from relevancy import generate_relevance_score, process_subject_fields
from download_new_papers import get_papers
from datetime import date



# Hackathon quality code. Don't judge too harshly.
Expand Down Expand Up @@ -247,11 +247,15 @@ def generate_body(topic, categories, interest, threshold):
papers,
query={"interest": interest},
threshold_score=threshold,
num_paper_in_prompt=16,
num_paper_in_prompt=20,
)
body = "<br><br>".join(
[
f'Title: <a href="{paper["main_page"]}">{paper["title"]}</a><br>Authors: {paper["authors"]}<br>Score: {paper["Relevancy score"]}<br>Reason: {paper["Reasons for match"]}'
f'<b>Title:</b> <a href="{paper["main_page"]}">{paper["title"]}</a><br><b>Authors:</b> {paper["authors"]}<br>'
f'<b>Score:</b> {paper["Relevancy score"]}<br><b>Reason:</b> {paper["Reasons for match"]}<br>'
f'<b>Goal:</b> {paper["Goal"]}<br><b>Data</b>: {paper["Data"]}<br><b>Methodology:</b> {paper["Methodology"]}<br>'
f'<b>Experiments & Results</b>: {paper["Experiments & Results"]}<br><b>Git</b>: {paper["Git"]}<br>'
f'<b>Discussion & Next steps</b>: {paper["Discussion & Next steps"]}'
for paper in relevancy
]
)
Expand All @@ -269,6 +273,10 @@ def generate_body(topic, categories, interest, threshold):
)
return body

def get_date():
today = date.today()
formatted_date = today.strftime("%d%m%Y")
return formatted_date

if __name__ == "__main__":
# Load the .env file.
Expand All @@ -292,7 +300,8 @@ def generate_body(topic, categories, interest, threshold):
threshold = config["threshold"]
interest = config["interest"]
body = generate_body(topic, categories, interest, threshold)
with open("digest.html", "w") as f:
today_date = get_date()
with open(f"digest_{today_date}.html", "w") as f:
f.write(body)
if os.environ.get("SENDGRID_API_KEY", None):
sg = SendGridAPIClient(api_key=os.environ.get("SENDGRID_API_KEY"))
Expand Down
4 changes: 3 additions & 1 deletion src/download_new_papers.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import datetime
import pytz


#Linh - add new def crawl_html_version(html_link) here
def _download_new_papers(field_abbr):
NEW_SUB_URL = f'https://arxiv.org/list/{field_abbr}/new' # https://arxiv.org/list/cs/new
page = urllib.request.urlopen(NEW_SUB_URL)
Expand All @@ -21,6 +21,7 @@ def _download_new_papers(field_abbr):
dt_list = content.dl.find_all("dt")
dd_list = content.dl.find_all("dd")
arxiv_base = "https://arxiv.org/abs/"
arxiv_html = "https://arxiv.org/html/"

assert len(dt_list) == len(dd_list)
new_paper_list = []
Expand All @@ -29,6 +30,7 @@ def _download_new_papers(field_abbr):
paper_number = dt_list[i].text.strip().split(" ")[2].split(":")[-1]
paper['main_page'] = arxiv_base + paper_number
paper['pdf'] = arxiv_base.replace('abs', 'pdf') + paper_number
paper['html'] = arxiv_html + paper_number + "v1"

paper['title'] = dd_list[i].find("div", {"class": "list-title mathjax"}).text.replace("Title: ", "").strip()
paper['authors'] = dd_list[i].find("div", {"class": "list-authors"}).text \
Expand Down
23 changes: 18 additions & 5 deletions src/relevancy.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,14 @@ def encode_prompt(query, prompt_papers):
return prompt


def is_json(myjson):
try:
json.loads(myjson)
except ValueError as e:
return False
return True


def post_process_chat_gpt_response(paper_data, response, threshold_score=8):
selected_data = []
if response is None:
Expand All @@ -45,9 +53,14 @@ def post_process_chat_gpt_response(paper_data, response, threshold_score=8):
try:
score_items = [
json.loads(re.sub(pattern, "", line))
for line in json_items if "relevancy score" in line.lower()]
except Exception:
for line in json_items if (is_json(line) and "relevancy score" in line.lower())]
except Exception as e:
pprint.pprint([re.sub(pattern, "", line) for line in json_items if "relevancy score" in line.lower()])
try:
score_items = score_items[:-1]
except Exception:
score_items = []
print(e)
raise RuntimeError("failed")
pprint.pprint(score_items)
scores = []
Expand Down Expand Up @@ -91,8 +104,8 @@ def generate_relevance_score(
all_papers,
query,
model_name="gpt-3.5-turbo-16k",
threshold_score=8,
num_paper_in_prompt=4,
threshold_score=7,
num_paper_in_prompt=8,
temperature=0.4,
top_p=1.0,
sorting=True
Expand Down Expand Up @@ -136,7 +149,7 @@ def generate_relevance_score(
return ans_data, hallucination

def run_all_day_paper(
query={"interest":"", "subjects":["Computation and Language", "Artificial Intelligence"]},
query={"interest":"Computer Science", "subjects":["Machine Learning", "Computation and Language", "Artificial Intelligence"]},
date=None,
data_dir="../data",
model_name="gpt-3.5-turbo-16k",
Expand Down
9 changes: 5 additions & 4 deletions src/relevancy_prompt.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
You have been asked to read a list of a few arxiv papers, each with title, authors and abstract.
Based on my specific research interests, elevancy score out of 10 for each paper, based on my specific research interest, with a higher score indicating greater relevance. A relevance score more than 7 will need person's attention for details.
Additionally, please generate 1-2 sentence summary for each paper explaining why it's relevant to my research interests.
Based on my specific research interests, relevancy score out of 10 for each paper, based on my specific research interest, with a higher score indicating greater relevance. A relevance score more than 7 will need person's attention for details.
Additionally, please generate summary, for each paper explaining why it's relevant to my research interests.
Please keep the paper order the same as in the input list, with one json format per line. Example is:
1. {"Relevancy score": "an integer score out of 10", "Reasons for match": "1-2 sentence short reasonings"}

My research interests are:
1. {"Relevancy score": "an integer score out of 10", "Reasons for match": "1-2 sentence short reasonings", "Goal": "What kind of pain points the paper is trying to solve?", "Data": "Summary of the data source used in the paper", "Methodology": "Summary of methodologies used in the paper", "Git": "Link to the code repo (if available)", "Experiments & Results": "Summary of any experiments & its results", "Discussion & Next steps": "Further discussion and next steps of the research"}

My research interests are: NLP, RAGs, LLM, Optmization in Machine learning, Data science, Generative AI, Optimization in LLM, Finance modelling ...
Copy link
Collaborator

@rmfan rmfan Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interests get appended on here:

prompt += query['interest']

No need to add them manually to the relevancy prompt

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks