Total LLoC depends on analysis

While writing up my thesis I found that the total LLoC we report for some benchmarks differs between three papers:

| Benchmark | [traces paper](https://link.springer.com/chapter/10.1007/978-3-030-88806-0_18) | [more-traces paper](https://link.springer.com/chapter/10.1007/978-3-031-30044-8_2) | [unassume paper](https://doi.org/10.1007/978-3-031-50524-9_4) | 42848ea0ff5da0147aff8715d5d75ff7727b4881 |
|--------|--------|--------|--------|--------|
| pfscan | 600 | 550 | 559 | 559 |
| aget | 603 | 581 | 587 | 587 |
| ctrace | 665 | 651 | – | 657 |
| knot | 987 | 973 | 981 | 981 |
| smtprc | 3102 | 3013 | 3037 | 3037 |

I also ran the current version of Goblint on them and got even more different results.
And found that @michael-schwarz's PhD thesis has an even different one for pfscan: 562.

All five programs have been unchanged for 18 years, so that's not the cause for the differences.
b6e8132a1012cc3cf0ce3701484a8031dfb4d7b1 and bddbd0f2f45ea5768a4783e6cd74782798b564e6 (and possibly others) refactored this logic a long time ago (outside of any PR it seems?!), which doesn't explain the change since more-traces.

Our calculation of the total is quite suspicious:
https://github.com/goblint/analyzer/blob/42848ea0ff5da0147aff8715d5d75ff7727b4881/src/framework/control.ml#L202-L207
This is basically

$$
\text{total} = (\text{live from analysis result}) + (\text{dead from analysis result}) + (\text{all from functions uncalled according to the analysis result}).
$$

Only the last summand is computed purely syntactically with a CIL visitor (but still depending on which functions were uncalled by the analysis). The proper way would be to just compute the total fully syntactically to begin with. This would by construction avoid any possible dependence on the analysis because it's supposed to be an independent measure.

Of course the live and dead things from the analysis result should make up the total, but both of these are subtly affected by much more complicated logic:
* The analysis result is for a constraint system, which is based on CFGs, but our CFG construction can omit some nodes, e.g. when following multiple `goto`-s.
* The constraint system solution is perhaps not guaranteed to have computed something for all CFG nodes, if they're not depended on. In some obscure cases (e.g. #231), CFGs can contain weird vestigial parts that aren't (and don't need to be) properly connected.
  We check that the CFG is connected, but not that all paths from entry eventually end up at exit.

	let dead_total = !count + uncalled_fn_loc in
	let total = live_count + dead_total in (* We can only give total LoC if we counted dead code *)
	M.msg_group severity ~category:Deadcode "Logical lines of code (LLoC) summary" [
	(Pretty.dprintf "live: %d" live_count, None);
	(Pretty.dprintf "dead: %d%s" dead_total (if uncalled_fn_loc > 0 then Printf.sprintf " (%d in uncalled functions)" uncalled_fn_loc else ""), None);
	(Pretty.dprintf "total lines: %d" total, None);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Total LLoC depends on analysis #1737

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark	traces paper	more-traces paper	unassume paper	`42848ea`
pfscan	600	550	559	559
aget	603	581	587	587
ctrace	665	651	–	657
knot	987	973	981	981
smtprc	3102	3013	3037	3037

Total LLoC depends on analysis #1737

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions