Epic: Refactor the Crawler for modularity and better maintainability

## Problem statement

### Introduction

The **_crawl_** module is responsible for querying GCP resources via REST API. As of the writing of this document, the current version of the crawl.py comprises 29 methods that retrieve information about different GCP resources. However, the list of supported resources is expected to expand significantly in the future. Therefore, a modularized crawler module will be essential to ensure the scalability and ease of maintenance of the scanning functionalities over time.

Moreover, there's quite a bit of repetitive code in the system. It would be great to minimize these repetitions and have a single point where we can easily manage and maintain the code. By doing so, we can enhance the overall robustness of the codebase.

### Explanation with references

For instance, the signature of the methods in `crawl.py` either takes `discovery.Resource` or `Credentials`. The credentials are then used to create a client of type `discovery.Resource` so that we can query the resources from GCP. For example, `get_managed_zones()` function takes credentials and creates the following `discovery.Resource` to call the discovery API.

```python
service = discovery.build(
  "dns", "v1", credentials=credentials, cache_discovery=False)
```

Reference codes can be found [here](https://github.com/google/gcp_scanner/blob/c2b85cc1023498c9a5a507abce9d928bf02592e9/src/gcp_scanner/crawl.py#L396) in the main gcp_scanner codebase. Additionally, if we take another code block, we can see that this same code is repeated. Hence, this code needs to be refactored and modularized using object-oriented paradigm.

Another example: Building DNS service: [occurance1](https://github.com/google/gcp_scanner/blob/c2b85cc1023498c9a5a507abce9d928bf02592e9/src/gcp_scanner/crawl.py#L412), [occurance2](https://github.com/google/gcp_scanner/blob/c2b85cc1023498c9a5a507abce9d928bf02592e9/src/gcp_scanner/crawl.py#L1040)

In the above examples and many other places, we needed to build a client and then pass it into the crawl function. But
the current process of building the client is not consistent and well-maintained. For example, [here](https://github.com/google/gcp_scanner/blob/c2b85cc1023498c9a5a507abce9d928bf02592e9/src/gcp_scanner/scanner.py#L245) `gke_client` is passed to the `crawl.get_gke_clusters()` function, however, in [this](https://github.com/google/gcp_scanner/blob/c2b85cc1023498c9a5a507abce9d928bf02592e9/src/gcp_scanner/scanner.py#L255) case the credential is directly passed to the crawl. Hence, it is necessary to maintain a consistent way to implement `client` so that it becomes straightforward for new contributors to implement additional client and GCP resources.

## High-Level Acceptance criteria

This refactoring plan should achieve the following objectives.

1. Implement a crawl package.
2. Categorize existing crawl methods and put them into their respective module.
3. Take advantage of the object-oriented capabilities of Python.
4. Improve the  [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)ness of the code
5. Fix the broken unit tests resulted due to the refactoring.

## Proposed solution

### Crawler

A new `crawler` package will be implemented. The location of the package is given below `src/gcp_scanner/crawler`.

Next, existing crawl methods will be categorized according to their resource type and put into their own module. **Factory Design Pattern** will be implemented to increase the scalability of the code. For example, the following methods fetch information on compute resources.

### Expected modules

A category-wise list of the existing functions that need to be put into their own module is given below.

#### Compute Resources

```
get_compute_instances_names
get_compute_images_names
get_compute_disks_names
get_static_ips
get_compute_snapshots
get_subnets
get_firewall_rules
```

#### GCP App Resources

```
get_app_services
```

#### GCP storage resources

```
get_bucket_names
```

#### GCP DNS resources

```
get_managed_zones
list_dns_policies
```

#### GKE resources

```
get_gke_clusters
get_gke_images
```

#### SQL instances

```
get_sql_instances
```

#### Bigquery

```
get_bq
```

#### PubSub Subscriptions

```
get_pubsub_subscriptions
```

#### CloudFunctions

```
get_cloudfunctions
```

#### Bigtable

```
get_bigtable_instances
```

#### Spanner

```
get_spanner_instances
```

#### FileStore

```
get_filestore_instances
```

#### KMS

```
get_kms_keys
```

#### Endpoints

```
get_endpoints
```

####  Serviceusage
```
list_services
```

#### Sourcerepo

```
List_sourcerepo
```

#### Cloudresourcemanager

```
get_iam_policy
fetch_project_info
get_project_list
```

### Client

We also need to refactor and implement client to reduce code repetition. Here using the factory design pattern we can
refactor the creation of `discovery.Resource` client.

![chart](https://github.com/google/gcp_scanner/assets/32765701/1158d549-a387-46d8-adfb-d4a532704dcf)


## Idea on refactoring repetitive if statements in the `scanner.py`
```python3
# crawl_name: client_name map
# crawl_name indicates the string depending on which crawl factory will work
# client_name indicates the string depending on which current client factory works
# crawl_name in the client_factory will be chosen in such a way that it matches with the keys in the config file

crawl_client_map = {
  'compute_instances': 'compute',
  'compute_images': 'compute',
  'machine_images': 'compute',
  'compute_disks': 'compute',
  'static_ips': 'compute',
  'app_services': 'appengine',
  'storage_buckets': 'storage',
  'managed_zones': 'managed_zones',
  # and so on
}

# replacing the repetitive if statements

for crawler_name, client_name in crawl_client_map.items():
  if is_set(scan_config, crawler_name):
    CrawlFactory.create_crawler(crawler_name).crawl(
      project_id,
      ClientFactory.get_client(client_name).get_service(credentials),
    )
```

## Testing

Accommodate existing unit tests with the new modular structure

## Task List

### Client

- [x] #154 
- [x] #157
- [x] #161 
- [x] #155 
- [x] #163 
- [x] #165
- [x] #167
- [x] #169 
- [x] #171 
- [x] #172 
- [x] #173 
- [x] #174 
- [x] #175 
- [x] #176 
- [x] #177 
- [x] #180 
- [x] #182

Could not refactor the following methods in the `craw.py`. Signature of these functions is a little bit different.

```python3
get_gke_clusters
get_gke_images
get_sas_for_impersonation

```

### Crawl

- [x] #188
- [x] #192
- [x] #203
- [x] #205
- [x] #206
- [x] #207
- [x] #208
- [x] #199 
- [x] #209
- [x] #201
- [x] #210
- [x] #211
- [x] #212
- [x] #213
- [x] #229 
- [x] #230 
- [x] #231

### refactoring repetitive if statements
- [x] #227

### Refinement tickets
- [ ] #178 
- [ ] #198 
- [ ] #204
- [ ] #241


Epic: Refactor the Crawler for modularity and better maintainability #153

Description

Problem statement

Introduction

Explanation with references

High-Level Acceptance criteria

Proposed solution

Crawler

Expected modules

Compute Resources

GCP App Resources

GCP storage resources

GCP DNS resources

GKE resources

SQL instances

Bigquery

PubSub Subscriptions

CloudFunctions

Bigtable

Spanner

FileStore

KMS

Endpoints

Serviceusage

Sourcerepo

Cloudresourcemanager

Client

Idea on refactoring repetitive if statements in the scanner.py

Testing

Task List

Client

Crawl

refactoring repetitive if statements

Refinement tickets

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Idea on refactoring repetitive if statements in the `scanner.py`