Skip to content

[epic] Support for large queries from data API #19

@rufuspollock

Description

@rufuspollock

When querying the data API I want to be able to make queries and get results with 100k or 1m+ results and download them so that I can extract the data I want even if larger

Acceptance

  • Design the solution
    • Consider authorization considerations
  • Write to storage approach
    • Choose storage backend
    • Stream to storage from data API (or trigger background job)
    • Return download URL
  • Setup switch from "streaming" to write to storage

Analysis

There are 2 approaches:

  1. Stream the whole result
  2. Extract the query results to storage and give storage url to the user

One can also have hybrid e.g. do the former up to some number of results and then switch to 2.

There are several advantages of option 2:

  • You have a natural cache structure on disk so that the same query may not need to be recomputed (you can expire exported results after some time period)
  • If your download/stream is interrupted you can resume it from storage (rather than re-running the query)
  • You give the end user the option to share the file for a certain period of time with other users

The disadvantages (at least the small data sizes):

  • It's more complex / more work on backend
  • Slower (greater latency)
  • More complex for user: they have 2 steps where there was one (get query result, extract download url and download)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions