This repository contains a comprehensive methodological blueprint for a data science project aimed at investigating the relationship between hypertension and psychopathological events in a geriatric population. The project is designed as a formal proposal, following the industry-standard CRISP-DM (Cross-Industry Standard Process for Data Mining) framework.
The primary goal is not just to present a final model, but to showcase the entire strategic planning process required for a real-world data science initiative. This includes business problem definition, data preparation strategy, modeling approach, and deployment considerations. The document serves as a case study in how to structure a data mining project from concept to completion.
The core of this project is a detailed, phase-by-phase plan that demonstrates a deep understanding of the data science lifecycle.
- Problem: A geriatric care company observes a potential link between hypertension and the worsening of mental disorders or psychotic crises in its clients.
- Objectives:
- Statistical Refutation: To statistically validate or dismiss the theorized link using historical data.
- Prevention & Care Improvement: To develop a predictive model that enables early detection of risk, triggering preventive measures.
- Deliverable: A predictive model integrated into healthcare management software to provide early alerts.
- Plan: This phase outlines the process for identifying and exploring all relevant internal data sources, assessing data quality, and understanding the structure of the necessary information (formats, attributes, accessibility).
This section details a sophisticated data preparation and feature engineering strategy, showcasing a deep focus on data quality.
- Dimensionality Reduction: Using techniques like PCA and RFE to identify the most relevant features.
- Advanced Normalization: The plan specifies multiple normalization techniques (by max value, by difference, by standard deviation) to be tested for different variables like age and medication dosage.
- Feature Engineering: Proposes the creation of new, high-value features, such as a 'Hydration' variable derived from food and fluid intake records.
- Bias Identification & Mitigation: A critical and often-overlooked step. The plan explicitly includes tasks to identify and document potential ideological, gender, observer, and interpretation biases in the data, ensuring a more ethical and robust final model.
- Proposed Algorithms: A comparative approach is planned, starting with interpretable models like Decision Trees and Logistic Regression.
- Evaluation Plan: The strategy includes using k-fold cross-validation to rigorously assess model performance and ensure generalizability.
The project scope extends beyond modeling to include a full deployment strategy.
- Business Evaluation: The plan specifies that model results must be evaluated against the initial business objectives.
- Deployment: Outlines the steps for implementing the final model into the patient management environment, including the creation of a real-time alert system for healthcare staff.
The repository includes a detailed document comparing the chosen CRISP-DM framework against other methodologies like SEMMA, KDD, and Agile, justifying why its iterative and project-oriented approach is the best fit for this type of research and development problem.
- Primary Methodology: CRISP-DM
- Proposed Language for Implementation: R
- Proposed Models: Logistic Regression, Decision Trees
This repository contains a complete project proposal and strategic plan. It is not intended to be a finished codebase but a demonstration of the crucial planning and foresight required to execute a successful data science project. It showcases senior-level skills in problem definition, methodological rigor, and strategic planning in a real-world healthcare context.
Antonio Barrera Mora
- LinkedIn: https://www.linkedin.com/in/anbamo/
- GitHub: @Kamaranis