More granular parsing for some complex tables 

I am not sure if this is really an issue with the parser but perhaps an improvement request unless a solution is available when using this parser. 

Consider a complex table such as the one for which I am providing source code in a file at https://ufile.io/tee2c

As you will see, this table, even though encapsulated in a single <table> tag, contains more than 1 tables. When I use the parser, the output I get is thus a consolidated output as expected. Is it possible for the parser to somehow recognize any valid tables inside a single table doc for such a complex table so that we can a separate output corresponding to each valid table section?

Here's the code I have but not sure how to parse the table at a more granular level as described above:

    extractor = Extractor(str(table_doc))
    extractor.parse()
    list_of_lists = extractor.return_list()
    new_list = []  

    for row in list_of_lists:
        stripped_list = [item.strip() for item in row]
        if len(set(stripped_list))==1 or any([x for x in stripped_list if len(x)>200]): # skip any item with > 200    characters, its probably some paragraph string and hence not a valid table item 
            continue
        new_list.append(stripped_list) 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More granular parsing for some complex tables #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

More granular parsing for some complex tables #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions