-
Notifications
You must be signed in to change notification settings - Fork 22
Open
Description
I am not sure if this is really an issue with the parser but perhaps an improvement request unless a solution is available when using this parser.
Consider a complex table such as the one for which I am providing source code in a file at https://ufile.io/tee2c
As you will see, this table, even though encapsulated in a single
tag, contains more than 1 tables. When I use the parser, the output I get is thus a consolidated output as expected. Is it possible for the parser to somehow recognize any valid tables inside a single table doc for such a complex table so that we can a separate output corresponding to each valid table section?Here's the code I have but not sure how to parse the table at a more granular level as described above:
extractor = Extractor(str(table_doc))
extractor.parse()
list_of_lists = extractor.return_list()
new_list = []
for row in list_of_lists:
stripped_list = [item.strip() for item in row]
if len(set(stripped_list))==1 or any([x for x in stripped_list if len(x)>200]): # skip any item with > 200 characters, its probably some paragraph string and hence not a valid table item
continue
new_list.append(stripped_list)
Metadata
Metadata
Assignees
Labels
No labels