Skip to content

Commit f6e7c39

Browse files
authored
Add support for multiple inputs in the run management command (#1916)
Signed-off-by: tdruez <tdruez@aboutcode.org>
1 parent 2e4062e commit f6e7c39

File tree

5 files changed

+236
-32
lines changed

5 files changed

+236
-32
lines changed

docs/quickstart.rst

Lines changed: 116 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33
QuickStart
44
==========
55

6-
Run a Scan (no installation required!)
7-
--------------------------------------
6+
Run a Local Directory Scan (no installation required!)
7+
------------------------------------------------------
88

99
The **fastest way** to get started and **scan a codebase** —
1010
**no installation needed** — is by using the latest
@@ -52,8 +52,120 @@ See the :ref:`RUN command <cli_run>` section for more details on this command.
5252
.. note::
5353
Not sure which pipeline to use? Check out :ref:`faq_which_pipeline`.
5454

55-
Next Step: Local Installation
56-
-----------------------------
55+
Run a Remote Package Scan
56+
-------------------------
57+
58+
Let's look at another example — this time scanning a **remote package archive** by
59+
providing its **download URL**:
60+
61+
.. code-block:: bash
62+
63+
docker run --rm \
64+
ghcr.io/aboutcode-org/scancode.io:latest \
65+
run scan_single_package https://github.com/aboutcode-org/python-inspector/archive/refs/tags/v0.14.4.zip \
66+
> results.json
67+
68+
Let's break down what's happening here:
69+
70+
- ``docker run --rm``
71+
Runs a temporary container that is automatically removed after the scan completes.
72+
73+
- ``ghcr.io/aboutcode-org/scancode.io:latest``
74+
Uses the latest ScanCode.io image from GitHub Container Registry.
75+
76+
- ``run scan_single_package <URL>``
77+
Executes the ``scan_single_package`` pipeline, automatically fetching and analyzing
78+
the package archive from the provided URL.
79+
80+
- ``> results.json``
81+
Writes the scan results to a local ``results.json`` file.
82+
83+
Notice that the ``-v "$(pwd)":/codedrop`` option is **not required** in this case
84+
because the input is downloaded directly from the provided URL, rather than coming
85+
from your local filesystem.
86+
87+
The result? A **complete scan of a remote package archive — no setup, one command!**
88+
89+
Use PostgreSQL for Better Performance
90+
-------------------------------------
91+
92+
By default, ScanCode.io uses a **temporary SQLite database** for simplicity.
93+
While this works well for quick scans, it has a few limitations — such as
94+
**no multiprocessing** and slower performance on large codebases.
95+
96+
For improved speed and scalability, you can run your pipelines using a
97+
**PostgreSQL database** instead.
98+
99+
Start a PostgreSQL Database Service
100+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
101+
102+
First, start a PostgreSQL container in the background:
103+
104+
.. code-block:: bash
105+
106+
docker run -d \
107+
--name scancodeio-run-db \
108+
-e POSTGRES_DB=scancodeio \
109+
-e POSTGRES_USER=scancodeio \
110+
-e POSTGRES_PASSWORD=scancodeio \
111+
-e POSTGRES_INITDB_ARGS="--encoding=UTF-8 --lc-collate=en_US.UTF-8 --lc-ctype=en_US.UTF-8" \
112+
-v scancodeio_pgdata:/var/lib/postgresql/data \
113+
-p 5432:5432 \
114+
postgres:17
115+
116+
This command starts a new PostgreSQL service named ``scancodeio-run-db`` and stores its
117+
data in a named Docker volume called ``scancodeio_pgdata``.
118+
119+
.. note::
120+
You can stop and remove the PostgreSQL service once you are done using:
121+
122+
.. code-block:: bash
123+
124+
docker rm -f scancodeio-run-db
125+
126+
.. tip::
127+
The named volume ``scancodeio_pgdata`` ensures that your database data
128+
**persists across runs**.
129+
You can remove it later with ``docker volume rm scancodeio_pgdata`` if needed.
130+
131+
Run a Docker Image Analysis Using PostgreSQL
132+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
133+
134+
Once PostgreSQL is running, you can start a ScanCode.io pipeline
135+
using the same Docker image, connecting it to the PostgreSQL database container:
136+
137+
.. code-block:: bash
138+
139+
docker run --rm \
140+
--network host \
141+
-e SCANCODEIO_NO_AUTO_DB=1 \
142+
ghcr.io/aboutcode-org/scancode.io:latest \
143+
run analyze_docker_image docker://alpine:3.22.1 \
144+
> results.json
145+
146+
Here’s what’s happening:
147+
148+
- ``--network host``
149+
Ensures the container can connect to the PostgreSQL service running on your host.
150+
151+
- ``-e SCANCODEIO_NO_AUTO_DB=1``
152+
Tells ScanCode.io **not** to create a temporary SQLite database, and instead use
153+
the configured PostgreSQL connection defined in its default settings.
154+
155+
- ``ghcr.io/aboutcode-org/scancode.io:latest``
156+
Uses the latest ScanCode.io image from GitHub Container Registry.
157+
158+
- ``run analyze_docker_image docker://alpine:3.22.1``
159+
Runs the ``analyze_docker_image`` pipeline, scanning the given Docker image.
160+
161+
- ``> results.json``
162+
Saves the scan results to a local ``results.json`` file.
163+
164+
The result? A **faster, multiprocessing-enabled scan** backed by PostgreSQL — ideal
165+
for large or complex analyses.
166+
167+
Next Step: Installation
168+
-----------------------
57169

58170
Install ScanCode.io, to **unlock all features**:
59171

scancodeio/__init__.py

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,9 @@ def combined_run():
106106
configuration.
107107
It combines the creation, execution, and result retrieval of the project into a
108108
single process.
109+
110+
Set SCANCODEIO_NO_AUTO_DB=1 to use the database configuration from the settings
111+
instead of SQLite.
109112
"""
110113
from django.core.checks.security.base import SECRET_KEY_INSECURE_PREFIX
111114
from django.core.management import execute_from_command_line
@@ -114,10 +117,12 @@ def combined_run():
114117
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "scancodeio.settings")
115118
secret_key = SECRET_KEY_INSECURE_PREFIX + get_random_secret_key()
116119
os.environ.setdefault("SECRET_KEY", secret_key)
117-
os.environ.setdefault("SCANCODEIO_DB_ENGINE", "django.db.backends.sqlite3")
118-
os.environ.setdefault("SCANCODEIO_DB_NAME", "scancodeio.sqlite3")
119-
# Disable multiprocessing
120-
os.environ.setdefault("SCANCODEIO_PROCESSES", "0")
120+
121+
# Default to SQLite unless SCANCODEIO_NO_AUTO_DB is provided
122+
if not os.getenv("SCANCODEIO_NO_AUTO_DB"):
123+
os.environ.setdefault("SCANCODEIO_DB_ENGINE", "django.db.backends.sqlite3")
124+
os.environ.setdefault("SCANCODEIO_DB_NAME", "scancodeio.sqlite3")
125+
os.environ.setdefault("SCANCODEIO_PROCESSES", "0") # Disable multiprocessing
121126

122127
sys.argv.insert(1, "run")
123128
execute_from_command_line(sys.argv)

scanpipe/management/commands/__init__.py

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -284,20 +284,23 @@ def validate_pipelines(pipelines_data):
284284
return pipelines_data
285285

286286

287-
def extract_tag_from_input_files(input_files):
287+
def extract_tag_from_input_file(file_location):
288288
"""
289-
Add support for the ":tag" suffix in file location.
289+
Parse a file location with optional tag suffix.
290290
291291
For example: "/path/to/file.zip:tag"
292292
"""
293-
input_files_data = {}
294-
for file in input_files:
295-
if ":" in file:
296-
key, value = file.split(":", maxsplit=1)
297-
input_files_data.update({key: value})
298-
else:
299-
input_files_data.update({file: ""})
300-
return input_files_data
293+
if ":" in file_location:
294+
cleaned_location, tag = file_location.split(":", maxsplit=1)
295+
return cleaned_location, tag
296+
return file_location, ""
297+
298+
299+
def extract_tag_from_input_files(input_files):
300+
"""Parse multiple file locations with optional tag suffixes."""
301+
return dict(
302+
extract_tag_from_input_file(file_location) for file_location in input_files
303+
)
301304

302305

303306
def validate_input_files(input_files):

scanpipe/management/commands/run.py

Lines changed: 38 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,15 @@
2020
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
2121
# Visit https://github.com/aboutcode-org/scancode.io for support and download.
2222

23+
from collections import defaultdict
2324
from pathlib import Path
2425

2526
from django.core.management import call_command
2627
from django.core.management.base import BaseCommand
2728
from django.core.management.base import CommandError
2829
from django.utils.crypto import get_random_string
2930

31+
from scanpipe.management.commands import extract_tag_from_input_file
3032
from scanpipe.pipes.fetch import SCHEME_TO_FETCHER_MAPPING
3133

3234

@@ -42,12 +44,16 @@ def add_arguments(self, parser):
4244
help=(
4345
"One or more pipeline to run. "
4446
"The pipelines executed based on their given order. "
45-
'Groups can be provided using the "pipeline_name:option1,option2"'
46-
" syntax."
47+
'Groups can be provided using the "pipeline_name:option1,option2" '
48+
"syntax."
4749
),
4850
)
4951
parser.add_argument(
50-
"input_location", help="Input location: file, directory, and URL supported."
52+
"input_location",
53+
help=(
54+
"Input location: file, directory, and URL supported."
55+
'Multiple values can be provided using the "input1,input2" syntax.'
56+
),
5157
)
5258
parser.add_argument("--project", required=False, help="Project name.")
5359
parser.add_argument(
@@ -68,22 +74,40 @@ def handle(self, *args, **options):
6874
"pipeline": pipelines,
6975
"execute": True,
7076
"verbosity": 0,
77+
**self.get_input_options(input_location),
7178
}
7279

73-
if input_location.startswith(tuple(SCHEME_TO_FETCHER_MAPPING.keys())):
74-
create_project_options["input_urls"] = [input_location]
75-
else:
76-
input_path = Path(input_location)
77-
if not input_path.exists():
78-
raise CommandError(f"{input_location} not found.")
79-
if input_path.is_file():
80-
create_project_options["input_files"] = [input_location]
81-
else:
82-
create_project_options["copy_codebase"] = input_location
83-
8480
# Run the database migrations in case the database is not created or outdated.
8581
call_command("migrate", verbosity=0, interactive=False)
8682
# Create a project with proper inputs and execute the pipeline(s)
8783
call_command("create-project", project_name, **create_project_options)
8884
# Print the results for the specified format on stdout
8985
call_command("output", project=project_name, format=[output_format], print=True)
86+
87+
@staticmethod
88+
def get_input_options(input_location):
89+
"""
90+
Parse a comma-separated list of input locations and convert them into options
91+
for the `create-project` command.
92+
"""
93+
input_options = defaultdict(list)
94+
95+
for location in input_location.split(","):
96+
if location.startswith(tuple(SCHEME_TO_FETCHER_MAPPING.keys())):
97+
input_options["input_urls"].append(location)
98+
99+
else:
100+
cleaned_location, _ = extract_tag_from_input_file(location)
101+
input_path = Path(cleaned_location)
102+
if not input_path.exists():
103+
raise CommandError(f"{location} not found.")
104+
if input_path.is_file():
105+
input_options["input_files"].append(location)
106+
else:
107+
if input_options["copy_codebase"]:
108+
raise CommandError(
109+
"Only one codebase directory can be provided as input."
110+
)
111+
input_options["copy_codebase"] = location
112+
113+
return input_options

scanpipe/tests/test_commands.py

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -984,6 +984,53 @@ def test_scanpipe_management_command_run(self):
984984
self.assertEqual("do_nothing", runs[1]["pipeline_name"])
985985
self.assertEqual(["Group1", "Group2"], runs[1]["selected_groups"])
986986

987+
@mock.patch("requests.sessions.Session.get")
988+
def test_scanpipe_management_command_run_multiple_inputs(self, mock_get):
989+
source_download_url = "https://example.com/z-source.zip#from"
990+
bin_download_url = "https://example.com/z-bin.zip#to"
991+
mock_get.side_effect = [
992+
make_mock_response(url=source_download_url),
993+
make_mock_response(url=bin_download_url),
994+
]
995+
996+
out = StringIO()
997+
inputs = [
998+
# copy_codebase option
999+
str(self.data / "codebase"),
1000+
# input_files option
1001+
str(self.data / "d2d" / "jars" / "from-flume-ng-node-1.9.0.zip"),
1002+
str(self.data / "d2d" / "jars" / "to-flume-ng-node-1.9.0.zip"),
1003+
# input_urls option
1004+
source_download_url,
1005+
bin_download_url,
1006+
]
1007+
joined_locations = ",".join(inputs)
1008+
with redirect_stdout(out):
1009+
call_command("run", "download_inputs", joined_locations)
1010+
1011+
json_data = json.loads(out.getvalue())
1012+
headers = json_data["headers"]
1013+
project_uuid = headers[0]["uuid"]
1014+
project = Project.objects.get(uuid=project_uuid)
1015+
1016+
expected = [
1017+
"from-flume-ng-node-1.9.0.zip",
1018+
"to-flume-ng-node-1.9.0.zip",
1019+
"z-bin.zip",
1020+
"z-source.zip",
1021+
]
1022+
self.assertEqual(expected, sorted(project.input_files))
1023+
1024+
input_sources = headers[0]["input_sources"]
1025+
self.assertEqual("z-bin.zip", input_sources[2]["filename"])
1026+
self.assertEqual("to", input_sources[2]["tag"])
1027+
self.assertEqual("z-source.zip", input_sources[3]["filename"])
1028+
self.assertEqual("from", input_sources[3]["tag"])
1029+
1030+
codebase_files = [path.name for path in project.codebase_path.glob("*")]
1031+
expected = ["a.txt", "b.txt", "c.txt"]
1032+
self.assertEqual(expected, sorted(codebase_files))
1033+
9871034
@mock.patch("scanpipe.models.Project.get_latest_output")
9881035
@mock.patch("requests.post")
9891036
@mock.patch("requests.sessions.Session.get")
@@ -1414,6 +1461,19 @@ def test_scanpipe_management_command_verify_project(self):
14141461
stdout=out,
14151462
)
14161463

1464+
def test_scanpipe_management_command_extract_tag_from_input_file(self):
1465+
extract_tag = commands.extract_tag_from_input_file
1466+
expected = ("file.ext", "")
1467+
self.assertEqual(expected, extract_tag("file.ext"))
1468+
expected = ("file.ext", "")
1469+
self.assertEqual(expected, extract_tag("file.ext:"))
1470+
expected = ("file.ext", "tag")
1471+
self.assertEqual(expected, extract_tag("file.ext:tag"))
1472+
expected = ("file.ext", "tag1:tag2")
1473+
self.assertEqual(expected, extract_tag("file.ext:tag1:tag2"))
1474+
expected = ("file.ext", "tag1,tag2")
1475+
self.assertEqual(expected, extract_tag("file.ext:tag1,tag2"))
1476+
14171477

14181478
class ScanPipeManagementCommandMixinTest(TestCase):
14191479
class CreateProjectCommand(

0 commit comments

Comments
 (0)