Skip to content

Commit d4d05d1

Browse files
committed
add gsoc25 report
Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
1 parent 023167e commit d4d05d1

File tree

2 files changed

+170
-0
lines changed

2 files changed

+170
-0
lines changed

docs/source/archive/gsoc-toc.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,14 @@ designed to encourage university student participation in open source
88
software development. It was started by Google in 2005. More about GSoC -
99
`<https://summerofcode.withgoogle.com/about/>`_
1010

11+
GSoC 2025
12+
---------
13+
14+
.. toctree::
15+
:maxdepth: 2
16+
17+
gsoc/reports/2025/scancode_toolkit_alok
18+
1119
GSoC 2024
1220
---------
1321

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
========================================================================
2+
Have variable license sections in license rules
3+
========================================================================
4+
5+
**Organization:** `AboutCode <https://aboutcode.org>`_
6+
7+
**Projects:** `Scancode Toolkit <https://github.com/aboutcode-org/scancode-toolkit>`_
8+
9+
**Mentee:** `Alok Kumar (alok1304) <https://github.com/alok1304>`_
10+
11+
**Mentors:**
12+
13+
- `Philippe Ombredanne <https://github.com/pombredanne>`_
14+
- `Ayan Sinha Mahapatra <https://github.com/AyanSinhaMahapatra>`_
15+
16+
Overview
17+
--------
18+
This project aims to enhance the `detection_log` by clearly indicating when `extra-words`
19+
are detected. These `extra-words` represent variable parts in the license rules, which
20+
previously caused the match score to fall below 100.
21+
22+
To address this issue, the implementation now verifies whether the `extra-words`
23+
appear in the correct position within the license text. If they do, the score is
24+
adjusted and improved accordingly, resulting in more accurate license rule matching.
25+
26+
--------------------------------------------------------------------------------
27+
28+
Implementation
29+
--------------
30+
31+
- **Enhanced the detection_log:**
32+
33+
- Display `extra-words` when they are detected.
34+
35+
- **Added extra-phrase marker like [[n]] for the extra-words:**
36+
37+
- The `extra-phrase` is denoted by double opening square brackets ``[[``
38+
and double closing square brackets ``]]``.
39+
- Here, `n` represents the maximum number of allowable `extra-words`.
40+
- The `extra-phrase` ``[[n]]`` is inserted in license rules at positions
41+
where `extra-words` may appear.
42+
- The value of `n` specifies how many `extra-words` are permitted
43+
at that location.
44+
45+
- **Improve Score:**
46+
47+
- Check whether `extra-words` appear in the correct position as defined by
48+
the `extra-phrase`, and ensure they do not exceed the maximum allowable limit.
49+
- If the conditions are satisfied, increase the match score to ``100``.
50+
51+
- **Shows in detection_log:**
52+
53+
- If the score is increased that means `extra-words` are in the correct
54+
position, then show ``extra-words-permitted-in-rule`` in the `detection_log`.
55+
- If the `extra-words` are at wrong place or exceed the maximum allowable limit,
56+
then show ``extra-words`` in the `detection_log`.
57+
58+
- **Testing:**
59+
60+
- Added tests for the `extra-phrase` functionality, such as
61+
`test_extra_phrase_tokenizer` and `test_extra_phrase_spans`, to ensure that
62+
phrases are correctly identified and processed.
63+
- Implemented multiple tests to verify that `extra-words` appear in the correct
64+
position according to the rules and that the match score is updated correctly
65+
when they are within the allowable limit.
66+
- Covered various edge cases where `extra-words` might be misplaced or exceed
67+
the maximum allowable count, ensuring the scoring and logging behave as expected.
68+
69+
Linked Pull Requests
70+
--------------------
71+
72+
.. list-table::
73+
:widths: 10 60 30 10
74+
:header-rows: 1
75+
76+
* - Sr. no
77+
- Name
78+
- Link
79+
- Status
80+
* - 1
81+
- Display `extra-words` in `detection_log` if present
82+
- `aboutcode.org/scancode-toolkit#4402
83+
<https://github.com/aboutcode-org/scancode-toolkit/pull/4402>`_
84+
- Merged
85+
* - 2
86+
- Improve score by supporting `extra_phrase` for `extra-words` in rules
87+
- `aboutcode.org/scancode-toolkit#4432
88+
<https://github.com/aboutcode-org/scancode-toolkit/pull/4432>`_
89+
- Open
90+
91+
Related Issues
92+
--------------
93+
94+
.. list-table::
95+
:widths: 10 60 30
96+
:header-rows: 1
97+
98+
* - Sr. no
99+
- Name
100+
- Link
101+
* - 1
102+
- `extra-words` does not show up in detection_log properly
103+
- `#4400
104+
<https://github.com/aboutcode-org/scancode-toolkit/issues/4400>`_
105+
* - 2
106+
- Improve score when `extra-words`` are found in the correct position
107+
- `#4420
108+
<https://github.com/aboutcode-org/scancode-toolkit/issues/4420>`_
109+
110+
Pre GSoC Work
111+
-------------
112+
113+
Before GSoC, I had contributed the following PRs:
114+
115+
- `Renaming the dependency attribute is_resolved to is_pinned
116+
<https://github.com/aboutcode-org/scancode-workbench/pull/638>`_
117+
- `Add test for all PyPI METADATA versions
118+
<https://github.com/aboutcode-org/scancode-toolkit/pull/4180>`_
119+
- `Add test for false positive GPL3 license
120+
<https://github.com/aboutcode-org/scancode-toolkit/pull/4106>`_
121+
- `Add new rules for EUPL license
122+
<https://github.com/aboutcode-org/scancode-toolkit/pull/4204>`_
123+
- `Add DUMB License and detection rule
124+
<https://github.com/aboutcode-org/scancode-toolkit/pull/4143>`_
125+
- `Fixing the dead link by cross-reference in the documentation
126+
<https://github.com/aboutcode-org/purldb/pull/550>`_
127+
128+
Post GSoC
129+
---------
130+
131+
I plan to continue contributing by adding `extra-phrase` support across many
132+
license rules. This will strengthen license detection by making it more accurate
133+
and flexible in handling variations within the rules.
134+
135+
Links
136+
-----
137+
138+
* `Project Idea
139+
<https://github.com/aboutcode-org/aboutcode/wiki/GSOC-2025-project-ideas#have-variable-license-sections-in-license-rules>`_
140+
141+
* `Official GSoC project page
142+
<https://summerofcode.withgoogle.com/programs/2025/projects/EvCogGhq>`_
143+
144+
* `GSoC Proposal
145+
<https://docs.google.com/document/d/1vNgiO8g1RiKVym4qK_jVFsiUH2z5ztaz8Q5lW6NkRK0/edit?tab=t.0>`_
146+
147+
* `Project Board <https://github.com/orgs/aboutcode-org/projects/28>`_
148+
149+
Acknowledgements
150+
----------------
151+
152+
I would like to thank my mentors:
153+
154+
- `Philippe Ombredanne`_
155+
- `Ayan Sinha Mahapatra`_
156+
157+
A special thanks to my mentors who always supported me throughout this journey. Whenever
158+
I faced a problem, we discussed it in depth during our weekly status calls. Without
159+
their guidance and constant help, completing this project would not have been possible.
160+
161+
I also plan to explore more projects in AboutCode and contribute whenever I get
162+
time, because I would love to remain a part of this wonderful organization.

0 commit comments

Comments
 (0)