diff --git a/docs/source/cli-reference/basic-options.rst b/docs/source/cli-reference/basic-options.rst index 961dff7021..6ecc5e6a0a 100644 --- a/docs/source/cli-reference/basic-options.rst +++ b/docs/source/cli-reference/basic-options.rst @@ -623,8 +623,8 @@ The option ``--license-text-diagnostics`` is a sub-option of and requires the options ``--license`` and ``--license-text``. - In the matched license text, include diagnostic highlights surrounding with square brackets [] - words that are not matched. + This adds a new attribute like the matched license text, but includes diagnostic highlights + surrounding with square brackets ``[]`` for words that are not matched. In a normal scan, whole lines of text are included in the matched license text, including parts that are possibly unmatched. @@ -645,9 +645,14 @@ obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction - With Diagnostics on:: + With Diagnostics on (new attribute with the matched text diagnostics):: "matched_text": + "License Copyright (c) 2000 - 2006 The Legion Of The Bouncy Castle + (http://www.bouncycastle.org) Permission is hereby granted, free of charge, to any person + obtaining a copy of this software and associated documentation files (the \"Software\"), + to deal in the Software without restriction + "matched_text_diagnostics": "License [Copyright] ([c]) [2000] - [2006] [The] [Legion] [Of] [The] [Bouncy] [Castle] ([http]://[www].[bouncycastle].[org]) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), diff --git a/docs/source/misc/faq.rst b/docs/source/misc/faq.rst index 7c3c396619..d3184d4892 100644 --- a/docs/source/misc/faq.rst +++ b/docs/source/misc/faq.rst @@ -82,3 +82,53 @@ When scanning binaries, the line numbers are just a relative indication of where a detection was found: there is no such thing as lines in a binary. The numbers reported are based on the strings extracted from the binaries, typically broken as new lines with each NULL character. + + +How does ``--license-text`` for ScanCode works exactly? +------------------------------------------------------------- + +Is the matched text that gets included into the result exactly the lines of text +from the input file that are covered by the ``start_line`` and ``end_line`` +fields of the result? I.e., if I would post-process the input file and extract +``start_line`` to ``end_line`` from it, would I get exactly the ``matched_text`` +contents? Or is there some more "magic" involved when populating the +``matched_text`` field? + +ScanCode is a bit smarter than just start and end line, as matching is based on +words, not lines of the actual scanned text. And a whole line may not always be matched. + +For instance with this command:: + + $ echo "Foo is a wonder piece of code. Licensed under the GPL. " \ + "For support contact foo@bar.com " > tst + $ scancode --license --license-text --license-text-diagnostics --yaml - tst + ... + license_detections: + - license_expression: gpl-1.0-plus + license_expression_spdx: GPL-1.0-or-later + matches: + - license_expression: gpl-1.0-plus + license_expression_spdx: GPL-1.0-or-later + from_file: tst + start_line: 1 + end_line: 1 + matcher: 2-aho + score: '100.0' + matched_length: 4 + match_coverage: '100.0' + rule_relevance: 100 + rule_identifier: gpl_85.RULE + rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gpl_85.RULE + matched_text: Foo is a wonder piece of code. Licensed under the GPL. + For support contact foo@bar.com + matched_text_diagnostics: Licensed under the GPL. + ... + +then: + +- ``matched_text`` is based on ``start_line`` and ``end_line`` +- ``matched_text_diagnostics`` is based on the exact matched words + +Note that ``matched_text_diagnostics`` also includes "tagged" gaps or extra +unmatched words highlighted between the matched words. +