Skip to content

Conversation

@betatim
Copy link
Contributor

@betatim betatim commented Mar 22, 2018

Closes #51

Work in progress code to check out using pickles to store intermediate results over NPZ files.

Not quite sure how to nicely benchmark this. This speeds up (for example) the time between running label-maker images and it printing "Downloading 10874 tiles to ...". With this branch there is nearly no delay between starting label-maker and seeing that printout. With the npz setup it takes "a while" with the below zurich.json (a while == minutes or longer, I can measure it later).

(This branch needs cleaning up a bit before merging, but wanted to show the basic idea.)


{
    "country": "switzerland",
    "bounding_box": [8.488103,47.359111,8.582088,47.407637],
    "zoom": 19,
    "classes": [
      { "name": "Pools", "filter": ["==", "leisure", "swimming_pool"] },
      { "name": "Bridge", "filter": ["has", "bridge"], "buffer": 5 },
      { "name": "Roads", "filter": ["all",
        ["has", "highway"],
        ["in", "highway", "motorway", "primary", "secondary", "residential"]
      ], "buffer": 3
      },
      { "name": "Buildings", "filter": ["has", "building"], "buffer": 3 },
      { "name": "Water", "filter": ["==", "natural", "water"] },
      { "name": "Forest", "filter": ["==", "landuse", "forest"] }
    ],
    "imagery": "http://a.tiles.mapbox.com/v4/mapbox.satellite/{z}/{x}/{y}.jpg?access_token=your_token_here",
    "background_ratio": 1,
    "ml_type": "classification"
  }

@drewbo
Copy link
Contributor

drewbo commented Apr 17, 2018

@betatim update here, I timed the two on a separate dataset (50k tiles) and pickle was considerably faster to load (~4 seconds vs. 90). I don't totally understand why but this line is the culprit; my guess is that iterating over the file list of an npz object is much less efficient than .items() on a dict.

Do you want to remove the older commented code and then I'll merge?

@betatim
Copy link
Contributor Author

betatim commented Apr 18, 2018

I think the problem is in how the npz is read. If you dig a bit into the numyp docs it suggests that (maybe) the way it is implemented is as one file per key. So I would not be surprised if on the inside there is one open() call for each key. This would be much slower than open('foo.pkl').read().

I'll remove the commented code and take a look at the failing tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants