Skip to content

Commit 4842c8e

Browse files
authored
Add issue template. See: #43 (#44)
* Add issue+PR template for gh and gitlab See: #43
1 parent f2f35ca commit 4842c8e

File tree

8 files changed

+381
-0
lines changed

8 files changed

+381
-0
lines changed
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
name: Minimal issue template
3+
about: Minimal Issue Template
4+
title: '[bug]: '
5+
labels: ''
6+
assignees: ''
7+
---
8+
9+
## When I (optional)
10+
11+
1. xx
12+
1. xx
13+
1. xx
14+
15+
## I expect
16+
17+
- [ ] yy
18+
- [ ] zz
19+
20+
## Instead
21+
22+
-
23+
-
24+
-
25+
26+
## Notes
27+
28+
Attach sanitized logs, screenshots, outputs.
29+
CC folks
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
## This PR
2+
3+
- [ ]
4+
- [ ]
5+
- [ ]
6+
7+
## It's done
8+
9+
- Rationale of the implementation
10+
11+
## Checks
12+
13+
- [ ] This PR conforms the [CONTRIBUTING.md](CONTRIBUTING.md) guidelines

.gitlab/POSTMORTEM.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
---
2+
# This is a template for a postmortem reports inspired by
3+
# the teamdigitale's one published on medium.com.
4+
# For the original version, see the references section.
5+
title: Fake Postmortem - Cloud connectivity incident
6+
date: 2018-05-23
7+
summary: >-
8+
Fake Postmortem inspired by the following: The Digital Team's websites were unreachable for 28 hours due to a cloud provider
9+
outage.
10+
authors:
11+
- name: Mario Rossi
12+
- name: Franco Bianchi
13+
references:
14+
- https://medium.com/team-per-la-trasformazione-digitale/document-postmortem-technology-italian-government-public-administration-99639a0a7877
15+
- https://abseil.io/resources/swe-book/html/ch02.html#blameless_postmortem_culture
16+
glossary: {}
17+
keywords: []
18+
...
19+
---
20+
# Postmortem - Template for a postmortem report
21+
22+
## Summary
23+
24+
**Impact**:
25+
26+
The following services cannot be reached:
27+
28+
- Dashboard Team
29+
- Three-Year ICT Plan
30+
- Designers Italia
31+
- Developers Italia
32+
- Docs Italia
33+
- Forum Italia
34+
35+
**Duration**:
36+
28 hours
37+
38+
**Cause**:
39+
OpenStack network outage - cloud provider _Cloud SPC Lotto 1_
40+
41+
## Context
42+
43+
The Digital Team's websites are based mainly on static HTML generated by the source content of the repositories on GitHub. The HTML code is published via a web server (nginx) and exposed according to HTTPS protocol. Forum Italia (http://forum.italia.it) is the only exception to this deployment model, and is managed separately via Docker containers. At any given time, one or more web servers can be deployed on the cloud provider's (Cloud SPC Lotto 1) OpenStack virtual machines, using the API provided by the platform.
44+
45+
Cloud resources (virtual machines and volume data) are allocated towards services according to the Agency for Digital Italy's Cloud SPC contract.
46+
47+
## Impact and damage assessment
48+
49+
On 19/05/2018, the following services became unreachable due to an internal connectivity issue of the Cloud Service Provider "Cloud SPC":
50+
51+
- Dashboard Team
52+
- Three-Year ICT Plan
53+
- Designers Italia
54+
- Developers Italia
55+
- Docs Italia
56+
- Forum Italia
57+
58+
## Causes and Contributing Factors
59+
60+
According to a postmortem document released by the supplier on 2018-06-07, the interruption of connectivity experienced by the 31 users (tenants) of the SPC Cloud service was triggered by a planned update of the OpenStack platform carried out on the night of Thursday 2018-05-17.
61+
62+
### Detection
63+
64+
The problem was detected the following morning (2018-05-18), thanks to reports from users who were no longer able to access the services provided on the Cloud SPC platform.
65+
66+
### Causes
67+
68+
The document states that a restart of the control nodes of the OpenStack platform (nodes that handle OpenStack's management services: neutron, glance, cinder, etc.) caused “an anomaly” in the network infrastructure, blocking the traffic on several computing nodes (nodes where virtual instances are executed), and causing virtual machines belonging to 31 users to become unreachable.
69+
The postmortem document also explains how a bug in the playbook (update script) would have blocked network activities by modifying the permissions of the file `/var/run/neutron/lock/neutron-iptables`, as indicated in the platform's official documentation.
70+
71+
Again, according to the supplier, restarting the nodes was necessary for the application of security updates for Meltdown and Spectre (CVE-2017-5715, CVE-2017-5753 and CVE-2017-5754).
72+
73+
The unavailability of the Cloud SPC infrastructure was undoubtedly the root cause of the problem, but the lack of an application-level protection mechanism for the Digital Team's services prolonged their unavailability.
74+
Indeed, due to the fact that the possibility of the entire cloud provider becoming unreachable had not been taken into account during the design phase of the services, it was not possible to respond adequately to this event.
75+
Despite the SPC Cloud provider's failover mechanisms, the web services were not protected from generalized outages capable of undermining the entire infrastructure of the only Cloud provider at our disposal.
76+
77+
## Actions taken
78+
79+
WRITEME: A list of action items taken to mitigate/fix the problem
80+
81+
- * Action 1
82+
* Owner
83+
- * Action 2
84+
* Owner
85+
...
86+
87+
## Preventive actions
88+
89+
WRITEME: A list of action items to prevent this from happening again
90+
91+
92+
93+
## Lessons learned
94+
95+
### What went wrong
96+
97+
The Cloud SPC platform cannot currently distribute virtual machines through data centers or different regions (OpenStack region).
98+
It would have been useful to be able to distribute virtual resources through independent infrastructures, even infrastructures provided by the same supplier.
99+
100+
### What should have been done
101+
102+
In hindsight, the Public Administration should have access to multiple cloud providers, so as to ensure the resilience of its services even when the main cloud provider is interrupted.
103+
104+
### Where we got lucky
105+
106+
WRITEME: What things went right that could have gone wrong
107+
108+
### What should we do differently next time
109+
110+
The most important lesson we learned from this experience is the need to continue investing in the development of a cross-platform, multi-supplier Cloud model.
111+
This model would guarantee the reliability of Public Administration services even when the main cloud provider becomes affected by problems that make it unreachable for a long period of time.
112+
113+
## Timeline
114+
115+
A timeline of the event, from discovery through investigation to resolution.
116+
All times are in CEST.
117+
118+
### 2018-05-17
119+
120+
22.30 CEST: The SPC MaaS alert service sends alerts through email indicating that several nodes can no longer be reached. <START of programmed activities>
121+
122+
### 2018-05-19
123+
124+
6:50 CEST: The aforementioned services, available at the IP address 91.206.129.249, can no longer be reached <START of INTERRUPTION>
125+
126+
### 2018-05-19
127+
128+
08:00 CEST: The problem is detected and reported to the supplier
129+
130+
09:30 CEST: The machines are determined to be accessible through OpenStack's administration interface (API and GUI) and internal connectivity reveals no issue. Virtual machines can communicate through the tenant's private network, but do not connect to the Internet.
131+
132+
15:56 CEST: The Digital Team sends the supplier and CONSIP a help request via email
133+
134+
18:00 CEST: The supplier communicates that they have identified the problem, which turns out to be the same problem experienced by the DAF project, and commence work on a manual workaround
135+
136+
19:00 CEST: The supplier informs us that a fix has been produced and that it will be applied to the virtual machines belonging to the 31 public administrations (tenants) involved.
137+
138+
### 2018-05-20
139+
140+
11:10 CEST: The supplier restores connectivity to the VMs of the AgID tenant
141+
142+
11:30 CEST: The Digital Team reboots the web services and the sites are again reachable <END OF INTERRUPTION>
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
## When I (optional)
2+
3+
1. xx
4+
1. xx
5+
1. xx
6+
7+
## I expect
8+
9+
- [ ] yy
10+
- [ ] zz
11+
12+
## Instead
13+
14+
-
15+
-
16+
-
17+
18+
## Notes
19+
20+
Attach sanitized logs, screenshots, outputs.
21+
CC folks
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
## This PR
2+
3+
- [ ]
4+
- [ ]
5+
- [ ]
6+
7+
## It's done
8+
9+
- Rationale of the implementation
10+
11+
## Checks
12+
13+
- [ ] This PR conforms the [CONTRIBUTING.md](CONTRIBUTING.md) guidelines

.gitlab/workflows/lint.yml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Run the SuperLinter action with some custom setup.
2+
3+
name: Lint
4+
5+
on:
6+
push:
7+
branches: ["main"]
8+
pull_request:
9+
branches: [ "main" ]
10+
11+
# Allows you to run this workflow manually from the Actions tab
12+
workflow_dispatch:
13+
14+
jobs:
15+
build:
16+
# The type of runner that the job will run on
17+
runs-on: ubuntu-latest
18+
19+
steps:
20+
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
21+
- uses: actions/checkout@v2
22+
23+
- name: Super-Linter
24+
uses: github/super-linter@v3.15.5
25+
env:
26+
VALIDATE_MARKDOWN: false
27+
# Disabled for conflicts with the isort version used in pre-commit
28+
# you can re-enable it if you align your local isort with
29+
# the one in the super-linter image.
30+
VALIDATE_PYTHON_ISORT: false
31+
VALIDATE_XML: false
32+
VALIDATE_NATURAL_LANGUAGE: false
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# This is a basic workflow to help you get started with Actions
2+
3+
name: "security-bandit"
4+
5+
# Controls when the action will run. Triggers the workflow on push or pull request
6+
# events but only for the master branch
7+
on:
8+
push:
9+
branches: [ "main" ]
10+
paths-ignore:
11+
- "ISSUE_TEMPLATE/**"
12+
pull_request:
13+
branches: [ "main" ]
14+
paths-ignore:
15+
- "ISSUE_TEMPLATE/**"
16+
17+
permissions: read-all
18+
19+
jobs:
20+
build:
21+
runs-on: ubuntu-latest
22+
23+
# Steps represent a sequence of tasks that will be executed as part of the job
24+
steps:
25+
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
26+
- uses: actions/checkout@v2
27+
28+
# Runs a single command using the runners shell
29+
- name: Python security check using Bandit
30+
uses: ioggstream/bandit-report-artifacts@v1.7.4
31+
with:
32+
project_path: .
33+
config_file: .bandit.yaml
34+
35+
super-sast:
36+
runs-on: ubuntu-latest
37+
timeout-minutes: 10
38+
steps:
39+
- uses: actions/checkout@v3
40+
- name: Test
41+
run: |
42+
echo UID=$(id -u) >> .env
43+
docker run --rm --user=$(id -u) \
44+
-v $PWD:/code \
45+
-w /code \
46+
-e MAVEN_OPTS=" -ntp " \
47+
-e RUN_OWASP_DEPENDENCY_CHECK=false \
48+
-e RUN_SPOTBUGS_CHECK=false \
49+
-e RUN_SPOTLESS_CHECK=false \
50+
-e RUN_SPOTLESS_APPLY=true \
51+
-e HOME=/tmp \
52+
-e USER=nobody \
53+
-e BANDIT_CONFIG_FILE=/code/.bandit.yaml \
54+
ghcr.io/par-tec/super-sast:latest

.gitlab/workflows/test.yml

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# This workflow template runs:
2+
# - a tox container with tests
3+
# - a service container (eg. a database) to be used by tox tests.
4+
5+
name: Test
6+
7+
# Controls when the action will run.
8+
on:
9+
# Triggers the workflow on push or pull request events but only for the main branch
10+
push:
11+
branches: [ main ]
12+
pull_request:
13+
branches: [ main ]
14+
15+
# Allows you to run this workflow manually from the Actions tab
16+
workflow_dispatch:
17+
18+
# A workflow run is made up of one or more jobs that can run sequentially or in parallel
19+
jobs:
20+
21+
test-tox-job:
22+
# The type of runner that the job will run on
23+
runs-on: ubuntu-latest
24+
container: python:3.9-slim
25+
26+
# This stanza deploys a service container with
27+
# the "rabbit" hostname. This is commented
28+
# to save build time. Uncomment it if you need
29+
# it!
30+
# services:
31+
# rabbit:
32+
# image: rabbitmq:3-management
33+
# ports:
34+
# - 5672:5672
35+
36+
# ...then run the tox jobs referencing
37+
steps:
38+
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
39+
# IMPORTANT!! By default `actions/checkout` just checkouts HEAD, so if you want
40+
# to checkout tags and branches too (eg. to auto-version your deployments)
41+
# you need to pass the `fetch-depth: 0` option. eg
42+
#
43+
# uses: actions/checkout@v2
44+
# with:
45+
# fetch-depth: 0
46+
- uses: actions/checkout@v2
47+
48+
- name: Run tests.
49+
run: |
50+
pip3 install tox
51+
tox
52+
test-pre-commit:
53+
# The type of runner that the job will run on
54+
runs-on: ubuntu-latest
55+
container: python:3.9
56+
steps:
57+
- uses: actions/checkout@v2
58+
59+
- name: Run commit hooks.
60+
run: |
61+
pip3 --no-cache-dir install pre-commit
62+
git --version
63+
pwd
64+
ls -la
65+
id
66+
git config --global --add safe.directory $PWD
67+
pre-commit install
68+
pre-commit run -a
69+
70+
# Store (expiring) logs on failure.
71+
# Retrieve artifacts via `gh run download`.
72+
- uses: actions/upload-artifact@v3
73+
if: failure()
74+
with:
75+
name: pre-commit.log
76+
path: /github/home/.cache/pre-commit/pre-commit.log
77+
retention-days: 5

0 commit comments

Comments
 (0)