From f437798d48e9e058e4a9b1b7918e267f94894873 Mon Sep 17 00:00:00 2001 From: Philippe Ombredanne Date: Wed, 16 Jul 2025 13:58:07 +0200 Subject: [PATCH 1/2] Add new FAQ entry on --license-text Signed-off-by: Philippe Ombredanne --- docs/source/misc/faq.rst | 59 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) diff --git a/docs/source/misc/faq.rst b/docs/source/misc/faq.rst index 7c3c396619..32ac8bf56d 100644 --- a/docs/source/misc/faq.rst +++ b/docs/source/misc/faq.rst @@ -82,3 +82,62 @@ When scanning binaries, the line numbers are just a relative indication of where a detection was found: there is no such thing as lines in a binary. The numbers reported are based on the strings extracted from the binaries, typically broken as new lines with each NULL character. + + +How does ``--license-text`` for ScanCode works exactly? +------------------------------------------------------------- + +I have a question about how ``--license-text`` for ScanCode works exactly: +Is the matched text that gets included into the result exactly the lines of text +from the input file that are covered by the ``start_line`` and ``end_line`` +fields of the result? I.e., if I would post-process the input file and extract +``start_line`` to ``end_line`` from it, would I get exactly the ``matched_text`` +contents? Or is there some more "magic" involved when populating the +``matched_text`` field? + +ScanCode is a bit smarter than just start and end line, as matching is based on +words, not lines of the actual scanned text. +And a whole line may not always be matched. + +For instance with this command:: + + $ echo "Foo is a wonder piece of code. Licensed under the GPL. For support contact foo@bar.com " > tst + $ scancode --license --license-text --license-text-diagnostics --yaml - tst + ... + license_detections: + - license_expression: gpl-1.0-plus + license_expression_spdx: GPL-1.0-or-later + matches: + - license_expression: gpl-1.0-plus + license_expression_spdx: GPL-1.0-or-later + from_file: tst + start_line: 1 + end_line: 1 + matcher: 2-aho + score: '100.0' + matched_length: 4 + match_coverage: '100.0' + rule_relevance: 100 + rule_identifier: gpl_85.RULE + rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gpl_85.RULE + matched_text: Foo is a wonder piece of code. Licensed under the GPL. + For support contact foo@bar.com + matched_text_diagnostics: Licensed under the GPL. + ... + +then: + +- ``matched_text`` is based on ``start_line`` and ``end_line`` +- ``matched_text_diagnostics`` is based on the exact matched words (and it includes "tagged" gaps or extra) + + + + + + + + + + + + From fe5179e8003134c22d118d95b05a0d7770ea45fe Mon Sep 17 00:00:00 2001 From: Ayan Sinha Mahapatra Date: Tue, 22 Jul 2025 20:16:36 +0530 Subject: [PATCH 2/2] Fix doc tests and update docs Signed-off-by: Ayan Sinha Mahapatra --- docs/source/cli-reference/basic-options.rst | 11 ++++++-- docs/source/misc/faq.rst | 31 ++++++++------------- 2 files changed, 19 insertions(+), 23 deletions(-) diff --git a/docs/source/cli-reference/basic-options.rst b/docs/source/cli-reference/basic-options.rst index 961dff7021..6ecc5e6a0a 100644 --- a/docs/source/cli-reference/basic-options.rst +++ b/docs/source/cli-reference/basic-options.rst @@ -623,8 +623,8 @@ The option ``--license-text-diagnostics`` is a sub-option of and requires the options ``--license`` and ``--license-text``. - In the matched license text, include diagnostic highlights surrounding with square brackets [] - words that are not matched. + This adds a new attribute like the matched license text, but includes diagnostic highlights + surrounding with square brackets ``[]`` for words that are not matched. In a normal scan, whole lines of text are included in the matched license text, including parts that are possibly unmatched. @@ -645,9 +645,14 @@ obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction - With Diagnostics on:: + With Diagnostics on (new attribute with the matched text diagnostics):: "matched_text": + "License Copyright (c) 2000 - 2006 The Legion Of The Bouncy Castle + (http://www.bouncycastle.org) Permission is hereby granted, free of charge, to any person + obtaining a copy of this software and associated documentation files (the \"Software\"), + to deal in the Software without restriction + "matched_text_diagnostics": "License [Copyright] ([c]) [2000] - [2006] [The] [Legion] [Of] [The] [Bouncy] [Castle] ([http]://[www].[bouncycastle].[org]) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), diff --git a/docs/source/misc/faq.rst b/docs/source/misc/faq.rst index 32ac8bf56d..d3184d4892 100644 --- a/docs/source/misc/faq.rst +++ b/docs/source/misc/faq.rst @@ -84,24 +84,23 @@ reported are based on the strings extracted from the binaries, typically broken as new lines with each NULL character. -How does ``--license-text`` for ScanCode works exactly? +How does ``--license-text`` for ScanCode works exactly? ------------------------------------------------------------- -I have a question about how ``--license-text`` for ScanCode works exactly: -Is the matched text that gets included into the result exactly the lines of text -from the input file that are covered by the ``start_line`` and ``end_line`` +Is the matched text that gets included into the result exactly the lines of text +from the input file that are covered by the ``start_line`` and ``end_line`` fields of the result? I.e., if I would post-process the input file and extract -``start_line`` to ``end_line`` from it, would I get exactly the ``matched_text`` +``start_line`` to ``end_line`` from it, would I get exactly the ``matched_text`` contents? Or is there some more "magic" involved when populating the -``matched_text`` field? +``matched_text`` field? ScanCode is a bit smarter than just start and end line, as matching is based on -words, not lines of the actual scanned text. -And a whole line may not always be matched. +words, not lines of the actual scanned text. And a whole line may not always be matched. For instance with this command:: - $ echo "Foo is a wonder piece of code. Licensed under the GPL. For support contact foo@bar.com " > tst + $ echo "Foo is a wonder piece of code. Licensed under the GPL. " \ + "For support contact foo@bar.com " > tst $ scancode --license --license-text --license-text-diagnostics --yaml - tst ... license_detections: @@ -128,16 +127,8 @@ For instance with this command:: then: - ``matched_text`` is based on ``start_line`` and ``end_line`` -- ``matched_text_diagnostics`` is based on the exact matched words (and it includes "tagged" gaps or extra) - - - - - - - - - - +- ``matched_text_diagnostics`` is based on the exact matched words +Note that ``matched_text_diagnostics`` also includes "tagged" gaps or extra +unmatched words highlighted between the matched words.