diff --git a/docs/source/archive/gsoc-toc.rst b/docs/source/archive/gsoc-toc.rst index 5cf915b..1b9a6bb 100755 --- a/docs/source/archive/gsoc-toc.rst +++ b/docs/source/archive/gsoc-toc.rst @@ -14,6 +14,7 @@ GSoC 2025 .. toctree:: :maxdepth: 2 + gsoc/reports/2025/scancode_toolkit_alok gsoc/reports/2025/vulnerablecode_michael GSoC 2024 diff --git a/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst b/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst new file mode 100644 index 0000000..1694a3b --- /dev/null +++ b/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst @@ -0,0 +1,201 @@ +======================================================================== +Have variable license sections in license rules +======================================================================== + +**Organization:** `AboutCode `_ + +**Projects:** `Scancode Toolkit `_ + +**Mentee:** `Alok Kumar (alok1304) `_ + +**Mentors:** + +- `Philippe Ombredanne `_ +- `Ayan Sinha Mahapatra `_ + +Overview +-------- +This project aims to enhance the `detection_log` by clearly indicating when `extra-words` +are detected. These `extra-words` represent variable parts in the license rules, which +previously caused the match score to fall below 100. + +To address this issue, the implementation now verifies whether the `extra-words` +appear in the correct position within the license text. If they do, the score is +adjusted and improved accordingly, resulting in more accurate license rule matching. + +-------------------------------------------------------------------------------- + +Implementation +-------------- + +- **Enhanced the detection_log:** + + - Display `extra-words` when they are detected. + +- **Added extra-phrase marker like [[n]] for the extra-words:** + + - The `extra-phrase` is denoted by double opening square brackets ``[[`` + and double closing square brackets ``]]``. + - Here, `n` represents the maximum number of allowable `extra-words`. + - The `extra-phrase` ``[[n]]`` is inserted in license rules at positions + where `extra-words` may appear. + - The value of `n` specifies how many `extra-words` are permitted + at that location. + +- **Improve Score:** + + - Check whether `extra-words` appear in the correct position as defined by + the `extra-phrase`, and ensure they do not exceed the maximum allowable limit. + - If the conditions are satisfied, increase the match score to ``100``. + +- **Shows in detection_log:** + + - If the score is increased that means `extra-words` are in the correct + position, then show ``extra-words-permitted-in-rule`` in the `detection_log`. + - If the `extra-words` are at wrong place or exceed the maximum allowable limit, + then show ``extra-words`` in the `detection_log`. + +- **Testing:** + + - Added tests for the `extra-phrase` functionality, such as + `test_extra_phrase_tokenizer` and `test_extra_phrase_spans`, to ensure that + phrases are correctly identified and processed. + - Implemented multiple tests to verify that `extra-words` appear in the correct + position according to the rules and that the match score is updated correctly + when they are within the allowable limit. + - Covered various edge cases where `extra-words` might be misplaced or exceed + the maximum allowable count, ensuring the scoring and logging behave as expected. + +-------------------------------------------------------------------------------- + +Linked Pull Requests +-------------------- + +.. list-table:: + :widths: 10 60 30 10 + :header-rows: 1 + + * - Sr. no + - Name + - Link + - Status + * - 1 + - Display `extra-words` in `detection_log` if present + - `aboutcode.org/scancode-toolkit#4402 + `_ + - Merged + * - 2 + - Improve score by supporting `extra_phrase` for `extra-words` in rules + - `aboutcode.org/scancode-toolkit#4432 + `_ + - Open + * - 3 + - Add extra-phrase in rules + - `aboutcode.org/scancode-toolkit#4518 + `_ + - Open + +Related Issues +-------------- + +.. list-table:: + :widths: 10 60 30 + :header-rows: 1 + + * - Sr. no + - Name + - Link + * - 1 + - `extra-words` does not show up in detection_log properly + - `#4400 + `_ + * - 2 + - Improve score when `extra-words`` are found in the correct position + - `#4420 + `_ + +Pre GSoC Work +------------- + +Before GSoC, I had contributed the following PRs: + +.. list-table:: + :widths: 10 60 30 + :header-rows: 1 + + * - Sr. no + - Name + - Link + * - 1 + - Renaming the dependency attribute `is_resolved` to `is_pinned` + - `aboutcode-org/scancode-workbench#638 + `_ + * - 2 + - Add test for all PyPI METADATA versions + - `aboutcode-org/scancode-toolkit#4180 + `_ + * - 3 + - Add test for false positive GPL3 license + - `aboutcode-org/scancode-toolkit#4106 + `_ + * - 4 + - Add new rules for EUPL license + - `aboutcode-org/scancode-toolkit#4204 + `_ + * - 5 + - Add DUMB License and detection rule + - `aboutcode-org/scancode-toolkit#4400 + `_ + * - 6 + - Fixing the dead link by cross-reference in the documentation + - `aboutcode-org/purldb#550 + `_ + * - 7 + - Add test for equivalent word + - `aboutcode-org/scancode-toolkit#4305 + `_ + * - 8 + - Enhance code visibility in dark mode + - `aboutcode-org/scancode-workbench#637 + `_ + +Post GSoC +--------- + +I plan to continue contributing by adding `extra-phrase` support across many +license rules. This will strengthen license detection by making it more accurate +and flexible in handling variations within the rules. + +For identifying named entities in rules, I created a new repository i.e +`named-entity-utils `_ which I am +currently working on. This utility is used to add `extra-phrase` markers in rules +at positions where named entities are present. + +Links +----- + +* `Project Idea + `_ + +* `Official GSoC project page + `_ + +* `GSoC Proposal + `_ + +* `Project Board `_ + +Acknowledgements +---------------- + +I would like to thank my mentors: + +- `Philippe Ombredanne`_ +- `Ayan Sinha Mahapatra`_ + +A special thanks to my mentors who always supported me throughout this journey. Whenever +I faced a problem, we discussed it in depth during our weekly status calls. Without +their guidance and constant help, completing this project would not have been possible. + +I also plan to explore more projects in AboutCode and contribute whenever I get +time, because I would love to remain a part of this wonderful organization.