From f437798d48e9e058e4a9b1b7918e267f94894873 Mon Sep 17 00:00:00 2001
From: Philippe Ombredanne <pombredanne@aboutcode.org>
Date: Wed, 16 Jul 2025 13:58:07 +0200
Subject: [PATCH 1/2] Add new FAQ entry on --license-text

Signed-off-by: Philippe Ombredanne <pombredanne@aboutcode.org>
---
 docs/source/misc/faq.rst | 59 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/docs/source/misc/faq.rst b/docs/source/misc/faq.rst
index 7c3c396619..32ac8bf56d 100644
--- a/docs/source/misc/faq.rst
+++ b/docs/source/misc/faq.rst
@@ -82,3 +82,62 @@ When scanning binaries, the line numbers are just a relative indication of where
 a detection was found: there is no such thing as lines in a binary. The numbers
 reported are based on the strings extracted from the binaries, typically broken
 as new lines with each NULL character.
+
+
+How does ``--license-text`` for ScanCode works exactly?
+-------------------------------------------------------------
+
+I have a question about how ``--license-text`` for ScanCode works exactly:
+Is the matched text that gets included into the result exactly the lines of text
+from the input file that are covered by the ``start_line`` and ``end_line``
+fields of the result? I.e., if I would post-process the input file and extract
+``start_line`` to ``end_line`` from it, would I get exactly the ``matched_text``
+contents? Or is there some more "magic" involved when populating the
+``matched_text`` field?
+
+ScanCode is a bit smarter than just start and end line, as matching is based on
+words, not lines of the actual scanned text.
+And a whole line may not always be matched.
+
+For instance with this command::
+
+    $ echo "Foo is a wonder piece of code. Licensed under the GPL. For support contact foo@bar.com " > tst
+    $ scancode --license --license-text --license-text-diagnostics --yaml - tst
+    ...
+        license_detections:
+            -   license_expression: gpl-1.0-plus
+                license_expression_spdx: GPL-1.0-or-later
+                matches:
+                    -   license_expression: gpl-1.0-plus
+                        license_expression_spdx: GPL-1.0-or-later
+                        from_file: tst
+                        start_line: 1
+                        end_line: 1
+                        matcher: 2-aho
+                        score: '100.0'
+                        matched_length: 4
+                        match_coverage: '100.0'
+                        rule_relevance: 100
+                        rule_identifier: gpl_85.RULE
+                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gpl_85.RULE
+                        matched_text: Foo is a wonder piece of code. Licensed under the GPL.
+                            For support contact foo@bar.com
+                        matched_text_diagnostics: Licensed under the GPL.
+    ...
+
+then:
+
+- ``matched_text`` is based on ``start_line`` and ``end_line``
+- ``matched_text_diagnostics`` is based on the exact matched words (and it includes "tagged" gaps or extra)
+
+
+
+
+
+
+
+
+
+
+
+

From fe5179e8003134c22d118d95b05a0d7770ea45fe Mon Sep 17 00:00:00 2001
From: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Date: Tue, 22 Jul 2025 20:16:36 +0530
Subject: [PATCH 2/2] Fix doc tests and update docs

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
---
 docs/source/cli-reference/basic-options.rst | 11 ++++++--
 docs/source/misc/faq.rst                    | 31 ++++++++-------------
 2 files changed, 19 insertions(+), 23 deletions(-)

diff --git a/docs/source/cli-reference/basic-options.rst b/docs/source/cli-reference/basic-options.rst
index 961dff7021..6ecc5e6a0a 100644
--- a/docs/source/cli-reference/basic-options.rst
+++ b/docs/source/cli-reference/basic-options.rst
@@ -623,8 +623,8 @@
         The option ``--license-text-diagnostics`` is a sub-option of and requires the options
         ``--license`` and ``--license-text``.
 
-    In the matched license text, include diagnostic highlights surrounding with square brackets []
-    words that are not matched.
+    This adds a new attribute like the matched license text, but includes diagnostic highlights
+    surrounding with square brackets ``[]`` for words that are not matched.
 
     In a normal scan, whole lines of text are included in the matched license text, including parts
     that are possibly unmatched.
@@ -645,9 +645,14 @@
         obtaining a copy of this software and associated documentation files (the \"Software\"),
         to deal in the Software without restriction
 
-    With Diagnostics on::
+    With Diagnostics on (new attribute with the matched text diagnostics)::
 
         "matched_text":
+        "License Copyright (c) 2000 - 2006 The Legion Of The Bouncy Castle
+        (http://www.bouncycastle.org) Permission is hereby granted, free of charge, to any person
+        obtaining a copy of this software and associated documentation files (the \"Software\"),
+        to deal in the Software without restriction
+        "matched_text_diagnostics":
         "License [Copyright] ([c]) [2000] - [2006] [The] [Legion] [Of] [The] [Bouncy] [Castle]
         ([http]://[www].[bouncycastle].[org]) Permission is hereby granted, free of charge, to any person
         obtaining a copy of this software and associated documentation files (the \"Software\"),
diff --git a/docs/source/misc/faq.rst b/docs/source/misc/faq.rst
index 32ac8bf56d..d3184d4892 100644
--- a/docs/source/misc/faq.rst
+++ b/docs/source/misc/faq.rst
@@ -84,24 +84,23 @@ reported are based on the strings extracted from the binaries, typically broken
 as new lines with each NULL character.
 
 
-How does ``--license-text`` for ScanCode works exactly?
+How does ``--license-text`` for ScanCode works exactly?
 -------------------------------------------------------------
 
-I have a question about how ``--license-text`` for ScanCode works exactly:
-Is the matched text that gets included into the result exactly the lines of text
-from the input file that are covered by the ``start_line`` and ``end_line``
+Is the matched text that gets included into the result exactly the lines of text
+from the input file that are covered by the ``start_line`` and ``end_line``
 fields of the result? I.e., if I would post-process the input file and extract
-``start_line`` to ``end_line`` from it, would I get exactly the ``matched_text``
+``start_line`` to ``end_line`` from it, would I get exactly the ``matched_text``
 contents? Or is there some more "magic" involved when populating the
-``matched_text`` field?
+``matched_text`` field?
 
 ScanCode is a bit smarter than just start and end line, as matching is based on
-words, not lines of the actual scanned text.
-And a whole line may not always be matched.
+words, not lines of the actual scanned text. And a whole line may not always be matched.
 
 For instance with this command::
 
-    $ echo "Foo is a wonder piece of code. Licensed under the GPL. For support contact foo@bar.com " > tst
+    $ echo "Foo is a wonder piece of code. Licensed under the GPL. " \
+        "For support contact foo@bar.com " > tst
     $ scancode --license --license-text --license-text-diagnostics --yaml - tst
     ...
         license_detections:
@@ -128,16 +127,8 @@ For instance with this command::
 then:
 
 - ``matched_text`` is based on ``start_line`` and ``end_line``
-- ``matched_text_diagnostics`` is based on the exact matched words (and it includes "tagged" gaps or extra)
-
-
-
-
-
-
-
-
-
-
+- ``matched_text_diagnostics`` is based on the exact matched words
 
+Note that ``matched_text_diagnostics`` also includes "tagged" gaps or extra
+unmatched words highlighted between the matched words.