nlp-ml-analysis-of-psych-methods

Methodological Trends in Psychology Research: Analyzing Abstracts with Natural Language Processing and Machine Learning

Introduction

This repository accompanies the research paper "Methodological Trends in Psychology Research: Analyzing Abstracts with Natural Language Processing and Machine Learning". It serves as a supplementary resource for researchers, analysts, and practitioners interested in the application of Natural Language Processing (NLP), Text Mining (TM), and Machine Learning (ML) within the context of psychological research.

The repository contains a well-organized collection of datasets and analysis tools used in the study. These resources are designed to support reproducibility, facilitate further exploration, and promote methodological transparency. Whether you're a newcomer to data analysis or an experienced statistician, the materials provided aim to be both accessible and informative.

Citation

@Article{,
  AUTHOR = {},
  TITLE = {},
  JOURNAL = {},
  YEAR = {},
  VOLUME = {},
  NUMBER = {},
  PAGES = {},
  DOI = {},
  URL = {},
  ISSN = {},
  ABSTRACT = {Scientific documents, such as research articles, are valuable resources for Information Retrieval and Natural Language Processing (NLP), offering opportunities to extract specialized knowledge and analyze key components of scholarly content, including research methods. This study investigates methodological trends in psychology research over the past 30 years (1995-2024) by applying a novel NLP and Machine Learning pipeline to a large corpus of 85,452 abstracts. A curated glossary of 365 method-related keywords served as a gold-standard reference for term identification, using direct and fuzzy string matching. Retrieved terms were encoded with SciBERT, averaging embeddings across contextual occurrences to produce unified vectors.  These vectors were clustered using unsupervised and weighted semi-supervised approaches, yielding six and ten clusters, respectively. Cluster composition was analyzed using weighted statistical measures to assess term importance within and across groups. Our findings highlight an increasing presence of methodological terminology in psychology, reflecting a shift toward greater standardization and transparency in research reporting. This work contributes a reproducible methodological framework for the semantic analysis of research language, with implications for meta-research, domain-specific lexicon development, and automated scientific knowledge discovery.}
}

Data Source

The data and statistics presented in this repository were collected from three reputable databases in the fields of health, psychology, and behavioral sciences:

Scopus (Elsevier) – https://www.scopus.com
MEDLINE (PubMed) – https://pubmed.ncbi.nlm.nih.gov
PsycINFO (Ovid) – https://ovidsp.dc1.ovid.com/ovid-new-a/ovidweb.cgi

These data sources were used in accordance with their respective terms of use. Full credit for the original data collection is attributed to the corresponding platforms.

Disclaimer: This repository is not affiliated with, sponsored by, or endorsed by Elsevier, PubMed, or Ovid.

Repository Contents

This repository contains the following scripts and data files used in the research:

📁 Data and Glossary

glossary_365Terms.json – Glossary containing 365 method-related terms.
data_Elsevier.py – Script for retrieving data from Elsevier’s Scopus.
data_PubMed.py – Script for retrieving data from PubMed’s MEDLINE.

🛠️ Preprocessing

prepro_Hyphen.py – Handles term hyphenation.
prepro_NLTK.py – Preprocesses abstracts.
direct&fuzzy.py – Performs direct and fuzzy string matching.

🧠 Embeddings & Representation

embeddings.py – Generates embeddings using SciBERT.
embeddings_Avg'ed.py – Calculates average embeddings per term.
updateAbstractsWith_Avg'edEmbeds.py – Updates abstracts with averaged term embeddings.

🔍 Unsupervised Clustering

umapUnsupervised.py – Applies UMAP dimensionality reduction.
exploratoryUnsupervisedClustering.py – Exploratory k-means clustering.
selectedClusteringModel_Unsupervised.py – Final model configuration.
outliers_Unsupervised.py – Outlier detection.
clusterDescriptives_Unsupervised.py – Cluster-level descriptive analysis.
compactness_VS_separation_Unsupervised.py – Visualizes cluster compactness vs. separation.
tf-idf_Unsupervised.py – Computes TF-IDF scores per cluster.
jaccard_Unsupervised.py – Calculates Jaccard similarities and heatmap.
heatmap_CommonTerms_Unsupervised.py – Heatmap of common terms between clusters.

🧪 Semi-Supervised Clustering

weightedTerms.py – Applies weighting scheme.
umapSemiSupervised.py – UMAP reduction for semi-supervised approach.
clusterDescriptives_Semisupervised.py – Descriptive analysis of clusters.
jaccard_Semisupervised.py – Jaccard similarity and heatmap.
facetedBarPlots_Semisupervised_10Clusters.py – Bar plots of top terms per cluster.

📊 Global Term Statistics

overall_DescriptiveTermStats.py – Descriptive stats for all terms.
overall_tf-idf_byYear.py – TF-IDF by year.
overall_tf-idf_5YearPlots.py – 5-year window TF-IDF visualizations.
abstractProportion_WithNoTerms.py – Proportion of abstracts without terms (dual-axis plot).

How to Use This Repository

The resources and scripts in this repository can be leveraged for various research and analytical purposes:

Natural Language Processing & Text Mining
Apply the preprocessing pipeline, term matching, and embedding techniques for information retrieval or content analysis tasks.
Machine Learning Applications
Reuse the clustering workflows (unsupervised and semi-supervised) for methodological experimentation or to analyze similar corpora.
Scientific Communication & Meta-Research
Create visualizations and conduct meta-analyses to explore how psychological research methods are represented and evolve across time in scientific literature.

Note: This repository does not include the original raw abstracts due to licensing restrictions. Users can recreate the dataset using the provided scripts and publicly accessible APIs (where permitted).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

nlp-ml-analysis-of-psych-methods

Methodological Trends in Psychology Research: Analyzing Abstracts with Natural Language Processing and Machine Learning

Introduction

Citation

Data Source

Repository Contents

📁 Data and Glossary

🛠️ Preprocessing

🧠 Embeddings & Representation

🔍 Unsupervised Clustering

🧪 Semi-Supervised Clustering

📊 Global Term Statistics

How to Use This Repository

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
abstractProportion_WithNoTerms.py		abstractProportion_WithNoTerms.py
clusterDescriptives_Semisupervised.py		clusterDescriptives_Semisupervised.py
clusterDescriptives_Unsupervised.py		clusterDescriptives_Unsupervised.py
compactness_VS_separation_Unsupervised.py		compactness_VS_separation_Unsupervised.py
data_Elsevier.py		data_Elsevier.py
data_PubMed.py		data_PubMed.py
direct&fuzzy.py		direct&fuzzy.py
embeddings.py		embeddings.py
embeddings_Avg'ed.py		embeddings_Avg'ed.py
exploratoryUnsupervisedClustering.py		exploratoryUnsupervisedClustering.py
facetedBarPlots_Semisupervised_10Clusters.py		facetedBarPlots_Semisupervised_10Clusters.py
glossary_365Terms.json		glossary_365Terms.json
heatmap_CommonTerms_Unsupervised.py		heatmap_CommonTerms_Unsupervised.py
jaccardSemiSupervised.py		jaccardSemiSupervised.py
jaccard_Unsupervised.py		jaccard_Unsupervised.py
outliers_Unsupervised.py		outliers_Unsupervised.py
overall_DescriptiveTermStats.py		overall_DescriptiveTermStats.py
overall_tf-idf_5YearPlots.py		overall_tf-idf_5YearPlots.py
overall_tf-idf_byYear.py		overall_tf-idf_byYear.py
prepro_Hyphen.py		prepro_Hyphen.py
prepro_NLTK.py		prepro_NLTK.py
selectedClusteringModel_Unsupervised.py		selectedClusteringModel_Unsupervised.py
tf-idf_Unsupervised.py		tf-idf_Unsupervised.py
umapSemiSupervised.py		umapSemiSupervised.py
umapUnsupervised.py		umapUnsupervised.py
updateAbstractsWith_Avg'edEmbeds.py		updateAbstractsWith_Avg'edEmbeds.py
weightedTerms.py		weightedTerms.py

KosStath/nlp-ml-analysis-of-psych-methods

Folders and files

Latest commit

History

Repository files navigation

nlp-ml-analysis-of-psych-methods

Methodological Trends in Psychology Research: Analyzing Abstracts with Natural Language Processing and Machine Learning

Introduction

Citation

Data Source

Repository Contents

📁 Data and Glossary

🛠️ Preprocessing

🧠 Embeddings & Representation

🔍 Unsupervised Clustering

🧪 Semi-Supervised Clustering

📊 Global Term Statistics

How to Use This Repository

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages