Skip to content

[feature request] Somehow exclude images/attachments when retrieving pages / chunking #29

@cadeon

Description

@cadeon

Is your feature request related to a problem? Please describe.
I'm needing to make my chunk sizes really large or just go to full context with big page sizes because our pages have pngs in them. I can see in the search results "(data:image/png;base64,iVBORw0KGgoAAAANS" etc.

These strings eat up a ton of context and push useful results out. It would be nice if we could filter them somehow.

Describe the solution you'd like
I'm not sure what the best approach would be;
Maybe there's a confluence api that will return just text? Likely ideal?
If not, maybe there's a way to strip embeds as the pages get pulled in?

Describe alternatives you've considered
Work-around is full context with a big page size. Sometimes allowing RAG with really big chunk sizes and a fair amount of overlap works, too. Either way the context window for the response generation model needs to be big, too.

Additional context
I'm happy to help / answer any questions as needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions