Skip to content

More granular parsing for some complex tables  #4

@dr333

Description

@dr333

I am not sure if this is really an issue with the parser but perhaps an improvement request unless a solution is available when using this parser.

Consider a complex table such as the one for which I am providing source code in a file at https://ufile.io/tee2c

As you will see, this table, even though encapsulated in a single

tag, contains more than 1 tables. When I use the parser, the output I get is thus a consolidated output as expected. Is it possible for the parser to somehow recognize any valid tables inside a single table doc for such a complex table so that we can a separate output corresponding to each valid table section?

Here's the code I have but not sure how to parse the table at a more granular level as described above:

extractor = Extractor(str(table_doc))
extractor.parse()
list_of_lists = extractor.return_list()
new_list = []  

for row in list_of_lists:
    stripped_list = [item.strip() for item in row]
    if len(set(stripped_list))==1 or any([x for x in stripped_list if len(x)>200]): # skip any item with > 200    characters, its probably some paragraph string and hence not a valid table item 
        continue
    new_list.append(stripped_list) 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions