-
Notifications
You must be signed in to change notification settings - Fork 104
Description
Problem statement
Introduction
The crawl module is responsible for querying GCP resources via REST API. As of the writing of this document, the current version of the crawl.py comprises 29 methods that retrieve information about different GCP resources. However, the list of supported resources is expected to expand significantly in the future. Therefore, a modularized crawler module will be essential to ensure the scalability and ease of maintenance of the scanning functionalities over time.
Moreover, there's quite a bit of repetitive code in the system. It would be great to minimize these repetitions and have a single point where we can easily manage and maintain the code. By doing so, we can enhance the overall robustness of the codebase.
Explanation with references
For instance, the signature of the methods in crawl.py
either takes discovery.Resource
or Credentials
. The credentials are then used to create a client of type discovery.Resource
so that we can query the resources from GCP. For example, get_managed_zones()
function takes credentials and creates the following discovery.Resource
to call the discovery API.
service = discovery.build(
"dns", "v1", credentials=credentials, cache_discovery=False)
Reference codes can be found here in the main gcp_scanner codebase. Additionally, if we take another code block, we can see that this same code is repeated. Hence, this code needs to be refactored and modularized using object-oriented paradigm.
Another example: Building DNS service: occurance1, occurance2
In the above examples and many other places, we needed to build a client and then pass it into the crawl function. But
the current process of building the client is not consistent and well-maintained. For example, here gke_client
is passed to the crawl.get_gke_clusters()
function, however, in this case the credential is directly passed to the crawl. Hence, it is necessary to maintain a consistent way to implement client
so that it becomes straightforward for new contributors to implement additional client and GCP resources.
High-Level Acceptance criteria
This refactoring plan should achieve the following objectives.
- Implement a crawl package.
- Categorize existing crawl methods and put them into their respective module.
- Take advantage of the object-oriented capabilities of Python.
- Improve the DRYness of the code
- Fix the broken unit tests resulted due to the refactoring.
Proposed solution
Crawler
A new crawler
package will be implemented. The location of the package is given below src/gcp_scanner/crawler
.
Next, existing crawl methods will be categorized according to their resource type and put into their own module. Factory Design Pattern will be implemented to increase the scalability of the code. For example, the following methods fetch information on compute resources.
Expected modules
A category-wise list of the existing functions that need to be put into their own module is given below.
Compute Resources
get_compute_instances_names
get_compute_images_names
get_compute_disks_names
get_static_ips
get_compute_snapshots
get_subnets
get_firewall_rules
GCP App Resources
get_app_services
GCP storage resources
get_bucket_names
GCP DNS resources
get_managed_zones
list_dns_policies
GKE resources
get_gke_clusters
get_gke_images
SQL instances
get_sql_instances
Bigquery
get_bq
PubSub Subscriptions
get_pubsub_subscriptions
CloudFunctions
get_cloudfunctions
Bigtable
get_bigtable_instances
Spanner
get_spanner_instances
FileStore
get_filestore_instances
KMS
get_kms_keys
Endpoints
get_endpoints
Serviceusage
list_services
Sourcerepo
List_sourcerepo
Cloudresourcemanager
get_iam_policy
fetch_project_info
get_project_list
Client
We also need to refactor and implement client to reduce code repetition. Here using the factory design pattern we can
refactor the creation of discovery.Resource
client.
Idea on refactoring repetitive if statements in the scanner.py
# crawl_name: client_name map
# crawl_name indicates the string depending on which crawl factory will work
# client_name indicates the string depending on which current client factory works
# crawl_name in the client_factory will be chosen in such a way that it matches with the keys in the config file
crawl_client_map = {
'compute_instances': 'compute',
'compute_images': 'compute',
'machine_images': 'compute',
'compute_disks': 'compute',
'static_ips': 'compute',
'app_services': 'appengine',
'storage_buckets': 'storage',
'managed_zones': 'managed_zones',
# and so on
}
# replacing the repetitive if statements
for crawler_name, client_name in crawl_client_map.items():
if is_set(scan_config, crawler_name):
CrawlFactory.create_crawler(crawler_name).crawl(
project_id,
ClientFactory.get_client(client_name).get_service(credentials),
)
Testing
Accommodate existing unit tests with the new modular structure
Task List
Client
- Subtasks: Implement client factory for compute client. #154
- Subtasks: Implement client factory for App Resources client. #157
- Subtasks: Implement client factory for storage client. #161
- Subtask: Implement client factory for DNS client. #155
- Subtasks: Implement client factory for SQL instances client. #163
- Subtasks: Implement client factory for Bigquery client. #165
- Subtasks: Implement client factory for PubSub client. #167
- Subtasks: Implement client factory for CloudFunctions client. #169
- Subtasks: Implement client factory for Bigtable client. #171
- Subtasks: Implement client factory for Spanner client. #172
- Subtasks: Implement client factory for FileStore client. #173
- Subtasks: Implement client factory for KMS client. #174
- Subtasks: Implement client factory for service management client. #175
- Subtasks: Implement client factory for Sourcerepo client. #176
- Subtasks: Implement client factory for Cloudresourcemanager client. #177
- Subtasks: Implement client factory for serviceusage client. #180
- Subtasks: Implement client factory for IAM client. #182
Could not refactor the following methods in the craw.py
. Signature of these functions is a little bit different.
get_gke_clusters
get_gke_images
get_sas_for_impersonation
Crawl
- Subtasks: Implement crawl factory for compute resources. #188
- Subtasks: Implement crawl factory for App Resources. #192
- Subtasks: Implement crawl factory for DNS resource. #203
- Subtasks: Implement crawl factory for SQL instances resource. #205
- Subtasks: Implement crawl factory for Bigquery resource. #206
- Subtasks: Implement crawl factory for PubSub resource. #207
- Subtasks: Implement crawl factory for CloudFunctions resource. #208
- Subtasks: Implement crawl factory for Bigtable resource. #199
- Subtasks: Implement crawl factory for Spanner resource. #209
- Subtasks: Implement crawl factory for FileStore resource. #201
- Subtasks: Implement crawl factory for KMS resource. #210
- Subtasks: Implement crawl factory for service usage resource. #211
- Subtasks: Implement crawl factory for Sourcerepo resource. #212
- Subtasks: Implement crawl factory for Cloudresourcemanager resource. #213
- subtask: Implement servicemanagement crawler. #229
- subtask: Implement endpoints crawler. #230
- subtask: implement service_accounts crawler. #231