Comment parser #11252

leana8959 · 2025-10-10T00:52:34Z

This is the first part of the exact print parser. In this PR I changed the lexer so instead of dropping the comments it emits them to the parser which is further stored in GenericPackageDescription.

Please let me know your thoughts!

Checklist below:

This PR modifies behaviour or interface

Include the following checklist in your PR:

Patches conform to the coding conventions.
Any changes that could be relevant to users have been recorded in the changelog.
- Is the change significant? If so, remember to add significance: significant in the changelog file.
The documentation has been updated, if necessary.
Manual QA notes have been included.
Tests have been added. (Ask for help if you don’t know how to write them! Ask for an exemption if tests are too complex for too little coverage!)

andreabedini

Thank you for taking on this work. I think adding a Comment constructor to Field is not a good design. It modifies the meaning of the type (and indeed this forces you to make functions like elementInLayoutContext return [Fields]).

I think this has been already discussed before: we have the ann parameter which can be used to keep hold on the comments. E.g.

data Comment ann = Comment !ann !ByteString
type FieldWithComments ann = Field ([Comment ann], ann)

In this design each Field carries the comments preceding it, annotated with their position. An extra annotation marks the position of the field itself. Any comment at the end of the file would need to be captured separately.

This is already the practice of few packages developed by the community. Is there a reason to deviate from this?

andreabedini · 2025-10-15T06:38:11Z

Cabal-syntax/src/Distribution/Fields/Parser.hs

 -- elementInLayoutContext ::= ':'  fieldLayoutOrBraces
 --                          | arg* sectionLayoutOrBraces
-elementInLayoutContext :: IndentLevel -> Name Position -> Parser (Field Position)
+elementInLayoutContext :: IndentLevel -> Name Position -> Parser [Field Position]


The comment suggest this function parses one field but the signature is changed to return a list of fields.

andreabedini · 2025-10-15T06:38:58Z

Cabal-syntax/src/Distribution/Fields/Parser.hs

-fieldLayoutOrBraces :: IndentLevel -> Name Position -> Parser (Field Position)
+-- fieldLayoutOrBraces   ::= '\\n'? '{' comment* (content comment*)* '}'
+--                         | comment* line? comment* ('\\n' line comment*)*
+fieldLayoutOrBraces :: IndentLevel -> Name Position -> Parser [Field Position]


Same as above.

leana8959 · 2025-10-15T11:45:00Z

Here's a preliminary benchmark done with hyperfine running on my computer with as little other programs running as possible, done in the same condition (beside being run 6 hours apart).
It seems like we are within the standard deviation, so there's no noticeable degrade of performance.

Upstream:

~/r/haskell/cabal λ lts-18.28
$ hyperfine --runs 30 './validate.sh --partial-hackage-tests'
Benchmark 1: ./validate.sh --partial-hackage-tests
  Time (mean ± σ):     203.400 s ± 21.816 s    [User: 150.484 s, System: 39.487 s]
  Range (min … max):   183.805 s … 277.905 s    30 runs

This branch:

…/wt/haskell/cabal/exact-pp-leana λ lts-18.28
$ hyperfine --runs 30 './validate.sh --partial-hackage-tests'
Benchmark 1: ./validate.sh --partial-hackage-tests
  Time (mean ± σ):     199.168 s ± 10.540 s    [User: 156.373 s, System: 39.818 s]
  Range (min … max):   184.443 s … 242.563 s    30 runs

Thank you for your response Andrea, I'll write up a response and get back to you soon :)

leana8959 · 2025-10-16T02:09:47Z

Thank you for your comment @andreabedini :)

I think this has been already discussed before: we have the ann parameter which can be used to keep hold on the comments.

That looks very interesting, but how would I deal with files that are just comments? To the point of view of readFields they should be valid yet we would have no Field to attach them to. Whether we should attach the comments above or below is yet another question. For example, in the sequence "comment element comment element comment", which element should grabs the comment in between?

I do think your model is very interesting so if you have the time to, please show working PR against mine so we can simply merge it in 🙏

This is already the practice of few packages developed by the community. Is there a reason to deviate from this?

Could you elaborate which packages are these? I would love to have more insight on how people solve similar problems.

Are there other design issues that needs to be addressed ?

Bodigrim · 2025-10-16T23:51:48Z

That looks very interesting, but how would I deal with files that are just comments?

Are they valid Cabal files if there is nothing but comments? I don't think so.

Could you elaborate which packages are these? I would love to have more insight on how people solve similar problems.

For instance, cabal-add has a function

annotateFieldsWithSource :: ByteString -> [Field Position] -> [Field ByteString]

which annotates each field with its source including all adjacent comments. It can be modified to return [Field (Position, ByteString) if you need both.

Bodigrim · 2025-10-16T23:55:36Z

Cabal-syntax/src/Distribution/FieldGrammar.hs

    match _ = Nothing
+
+-- | Collect comments into a map. The second field of the output will have no comment
+extractComments :: Ord ann => [Field ann] -> (Map.Map ann ByteString, [Field ann])


Could you possibly outline how the output of this function is supposed to be used? Now that we detached comments from fields, how do we reconstruct the original document? How do we do it if [Field ann] is programmatically updated (say, adding or removing elements)?

Could you possibly outline how the output of this function is supposed to be used?

This function recursively extracts the comments from the fields (the updated design variant does the same thing, albeit being more concise).

It is used to extract the comments out right before running the parseFieldGrammar parser in src/Distribution/PackageDescription/Parsec.hs. This allows us to not change the grammar and be able to directly inject the comments into the GenericPackageDescription.

Now that we detached comments from fields, how do we reconstruct the original document?

We can't yet. Following Jappie's proposal, we need to store the exact position of each field as a map (called exactPositions) indexed by the path in the rose tree (typed [NameSpace], a list of path segments). And then we will have the right information to construct the exact printer.

How do we do it if [Field ann] is programmatically updated (say, adding or removing elements)?

To quote Jappie:

The issue for addition is that you now have to invent exact positions. for removal, if it involves a line, you've to fix up all following lines, (and it has to know something was removed).

I haven't thought about this thoroughly because it's out of scope of this PR, but it should be very much feasible.

geekosaur · 2025-10-16T23:57:11Z

Are they valid Cabal files if there is nothing but comments? I don't think so.

Pretty sure you need at minimum name and one target.

jappeace · 2025-10-17T07:38:47Z

Hi friends, thanks for all your responses. Leana needs some time to read up on the exact proposal to see how it all fits together before replying.
After chatting with her, I think she wants to go with Andrea's design for the comment field parser. As you can see, she's deeply in the weeds about many of the details of the parser; she even corrected some of the field grammar comments!
I don't know what you guys think, but I think it's good progress 🚀

Bodigrim · 2025-10-17T21:54:43Z

Cabal is one of the toughest code bases I ever worked on, so I'm quite amazed by Leana making progress so quickly!

leana8959 · 2025-10-21T03:11:43Z

Thank you Bodigrim and Jappie for your kind words! That means a lot to me, I'm glad to be on the right track.

I have started (and completed) to rewrite my PR using Andrea's approach.
Right now the behaviour is identical to my old approach, while comments are tracked in the annotation ann.

Are they valid Cabal files if there is nothing but comments? I don't think so.

Good to know. Currently the top level parser drops the comments consumed if there are no fields to attach them to.

Let me know what you think about the change :)

leana8959 · 2025-10-22T14:02:26Z

Here are the benchmark results. The baseline has been rerun because I did these ones on a VPS machine, and they are not comparable to the last ones I ran on my machine.

# baseline
leana@Ubuntu-2404-noble-amd64-base:~/cabal$ hyperfine './validate.sh --partial-hackage-tests'
Benchmark 1: ./validate.sh --partial-hackage-tests
  Time (mean ± σ):     253.353 s ±  9.520 s    [User: 196.642 s, System: 60.881 s]
  Range (min … max):   241.649 s … 271.207 s    10 runs
  
# this PR
leana@Ubuntu-2404-noble-amd64-base:~/cabal$ hyperfine --setup "./validate.sh --partial-hackage-tests" "./validate.sh --partial-hackage-tests"
Benchmark 1: ./validate.sh --partial-hackage-tests
  Time (mean ± σ):     253.163 s ±  7.451 s    [User: 196.432 s, System: 58.507 s]
  Range (min … max):   239.373 s … 266.656 s    10 runs

andreabedini · 2025-10-23T01:25:28Z

@leana8959

That looks very interesting, but how would I deal with files that are just comments?

In my prototype I have replaced [Field ann] with something like

data File ann = File [Field ann] ann

Where the extra annotation is for anything coming after the last field.

Could you elaborate which packages are these? I would love to have more insight on how people solve similar problems.

In addition to @Bodigrim's cabal-add, @phadej's cabal-fields rewrites parser entirely (but dropping support for braces) but at the AST level does the same thing. I am sure he also left comments in some of the "exact printing" mega-threads.

After chatting with her, I think she wants to go with Andrea's design for the comment field parser. As you can see, she's deeply in the weeds about many of the details of the parser; she even corrected some of the field grammar comments!

I am available to discuss and support her effort. @leana8959 I'll reach out privately.

Cabal is one of the toughest code bases I ever worked on, so I'm quite amazed by Leana making progress so quickly!

I warmly second this!

Mikolaj · 2025-11-06T18:37:52Z

@leana8959, @andreabedini: how is the private communication going? We are interested too! Could we help somehow?

jappeace · 2025-11-06T21:00:23Z

for clarity: These parser changes are ready for review as far as leana and me are concerned, meanwhile we've moved over to import stanza retention in GenericPackageDescription (they currently get merged into the stanza's). This is an independent change of the parser changes here.

After that we can start on exact print propper.

phadej · 2025-11-07T11:46:02Z

import stanza retention in GenericPackageDescription (

Please don't. A lot of code wants an elaborated (= stripped down of syntactic convenience) representation of package description, and GPD serves that now.

Your parsing changes leak down the pipeline where they shouldn't. E.g. things like solver works with GPD, and it really shouldn't care about whether import stanzas were used to declare a package or not.

jappeace · 2025-11-07T12:37:55Z

This PR isn't about that,
and we're intending to keep everything working. 🙂

phadej · 2025-11-10T12:49:38Z

This PR does add

  , exactComments :: ExactComments Position

field to GPD. I don't see a point of having comments in GPD. (As noted in tests, it "breaks" equality)

leana8959 · 2025-11-11T03:52:27Z

@phadej Indeed, I have added a newtype around GenericPackageDescription that doesn't have an Eq instance. This ensures GenericPackageDescription stays the same :)

ulysses4ever · 2025-11-20T18:13:03Z

I understand, the authors would like to get reviews for this PR. If this is so, please, squash the commit history. For the size of PR: a good part of the changes are test-suite changes, it seems. I don't think they have to be extracted into separate PR or even separate commits (because having commits that don't pass CI individually may be cumbersome in the future, for git-bisecting and alike).

ulysses4ever · 2025-11-20T18:51:58Z

@mpickering can you take a high-level look at the design in this PR and tell us your opinion? The gist of it is:

-- Cabal-syntax/src/Distribution/Fields/Field.hs

data Comment ann = Comment !ByteString !ann
  deriving (Show, Generic, Eq, Ord, Functor)

data WithComments ann = WithComments
  { justComments :: ![Comment ann]
  , unComments :: !ann
  }
  deriving (Show, Generic, Eq, Ord, Functor)

and then many parsing utilities that used to return ... Field Position now return Field (WithComments Position). With casual renamings like below:

-parseGenericPackageDescription'
+parseAnnotatedGenericPackageDescription'
  :: Maybe CabalSpecVersion
  -> [LexWarning]
  -> Maybe Int
-  -> [Field Position]
-  -> ParseResult src GenericPackageDescription
-parseGenericPackageDescription' scannedVer lexWarnings utf8WarnPos fs = do
+  -> [Field (WithComments Position)]
+  -> ParseResult src AnnotatedGenericPackageDescription
+parseAnnotatedGenericPackageDescription' scannedVer lexWarnings utf8WarnPos fs = do

ulysses4ever · 2025-11-20T19:26:45Z

One particular thing that bothers me is that clients will have to call unComment even if they don't care about comments (at least, it appears to be the case to me after looking at the diff in this patch). I feel like there should be a more graceful way to handle this. On the cabal meeting today @jappeace suggested that we could duplicate (I assume, some) parsing utilities so that some functions have two versions: one comment-bearing and another one, comment-less. I must say I quite like this idea. @geekosaur pointed out that this "just" moves the pressure from client's to our shoulders. But I feel like many clients aren't interested in comments anyway (they are surviving somehow today), and the cost of this sort of contained duplication (inside one module and very uniform) is low. So, overall, the cost/benefit ratio of this duplication-based idea is better than in the current proposal to me.

Any thoughts?

Meta-comment: I looked through the tech proposal again, and there doesn't appear to be a technical description of a solution for this particular comments issue. It is totally fine, because the proposal would turn into a foliant at that level of detail. Neveretheless, I wish that, before doing all the technical work in this PR, the authors had discussed the actual technical solution for the particular issue (like preserving comments; others are listed in the tech proposal) on the Cabal bug tracker (here).

leana8959 · 2025-11-21T07:02:15Z

If this is so, please, squash the commit history. For the size of PR: a good part of the changes are test-suite changes, it seems. I don't think they have to be extracted into separate PR or even separate commits.

@ulysses4ever Are there some rule of thumb to squash my commits? I already went through the history once this morning and split out all changes that touch the testsuite into one commit.
Do I need to do something else for the feature commits? If it's all good please let me know, I'll force push to my PR branch.
Maybe you have an example PR that does it right?

add lexer tokens and rules remove lexer "Whitespace" token This token is not needed, we will later use the position information to pad each token. implement "Comment" handling ... which is not handling at all for the time being temporary fix by dropping comments before parseGenericPackageDescription make metaFields a map of positions rearrange and simplify field make lexer emit comment wherever they would occur stop parser from emitting indentation warning for comments fix: restore checkIndentation behaviour for Field test: add dummy tests test: accept new golden expressions test: accept new golden expressions test: rename comment test group debug: trace tokens fix: split comments recursively fix: consume comments after colon in FieldLayoutOrBraces debug: remove tracing test: update expected test: improve comment tests test: correct comment tests test: assert interleaving comment parsing fix: correct interleaving comment parsing test: update expected debug: remove tracing test: assert parsing of fieldline flag test: update expected fix: correct parsing fieldLine starting with -- as comment test: update expected test: remove test case that doesn't pass on upstream minor fixes test: ignore comment in test comparison docs: improve comments on the grammar style: whitespace style: fourmolu ref: simplification docs: update grammar specification for comments ref: run hlint improve describeToken on comments ref: make diff smaller test: fix no-thunks test test: fix md5Check test fix compiler errors and warnings test: add expectation for failing hackage test We also reintroduced the flag "CABAL_PARSEC_DEBUG" to debug the lexer/parser. fix hackage test 001 fix hackage test test: disable comments in comparison in roundtrip hackage test refactor parser refactor test style: run fourmolu remove todos yay test: remove test dependencies move ToExpr to orphan module test: simplify restore accidently formatted cabal restore previous debug behaviour refactor: don't use liftA2 and liftA3 refactor annotation to ([Comment ann], ann) attempt test: update expects fix errors for Deprecated module fix compilation errors for integration tests fix grammar while incorrect output We need to look into how to wire the output for it to hold the comments in the right position. refactor parser style: run fourmolu fix comment attach post processing refactor fix: only discard element comments at top level test: update expected fix: derive Eq instance for Comment This fixes builds for old GHC use strict either for parser fix: doctest define proper WithComments data type remove exactComment field in GenericPackageDescription add Lens functions for AnnotatedGenericPackageDescription test AnnotatedGenericPackageDescription instead run fourmolu run hlint remove redundant imports tests: test hasktorch tests: update expected tests: fix integration tests tests: fix nothunks test run fourmolu fix doctests ref: keep backward compatibility of exported functions test: explicitly test *WithComment readFields variant test: fix doctest fix: build undo changes from experiments ref: clean up the parser code ref: reduce diff

leana8959 · 2025-11-27T02:46:38Z

I cleaned up everything and made sure the test passes, this is ready for review!

leana8959 mentioned this pull request Oct 10, 2025

Cabal exact printer in stages? #11227

Open

andreabedini reviewed Oct 15, 2025

View reviewed changes

leana8959 marked this pull request as ready for review October 16, 2025 10:21

Bodigrim reviewed Oct 16, 2025

View reviewed changes

Mikolaj added the exact-print label Oct 23, 2025

leana8959 force-pushed the exact-pp-leana branch from 26f34ae to 7751130 Compare November 27, 2025 02:39

jappeace added the attention: needs-review label Nov 27, 2025

Comment parser #11252

Are you sure you want to change the base?

Comment parser #11252

Conversation

leana8959 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreabedini left a comment

Choose a reason for hiding this comment

Uh oh!

andreabedini Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

andreabedini Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

leana8959 commented Oct 15, 2025

Uh oh!

leana8959 commented Oct 16, 2025

Uh oh!

Bodigrim commented Oct 16, 2025

Uh oh!

Bodigrim Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

leana8959 Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

geekosaur commented Oct 16, 2025

Uh oh!

jappeace commented Oct 17, 2025

Uh oh!

Bodigrim commented Oct 17, 2025

Uh oh!

leana8959 commented Oct 21, 2025

Uh oh!

leana8959 commented Oct 22, 2025

Uh oh!

andreabedini commented Oct 23, 2025

Uh oh!

Mikolaj commented Nov 6, 2025

Uh oh!

jappeace commented Nov 6, 2025

Uh oh!

phadej commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jappeace commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phadej commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leana8959 commented Nov 11, 2025

Uh oh!

ulysses4ever commented Nov 20, 2025

Uh oh!

ulysses4ever commented Nov 20, 2025

Uh oh!

ulysses4ever commented Nov 20, 2025

Uh oh!

leana8959 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leana8959 commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

leana8959 commented Oct 10, 2025 •

edited

Loading

phadej commented Nov 7, 2025 •

edited

Loading

jappeace commented Nov 7, 2025 •

edited

Loading

phadej commented Nov 10, 2025 •

edited

Loading

leana8959 commented Nov 21, 2025 •

edited

Loading