Skip to content

DataLoader uses way too much memory #210

@oluwandabira

Description

@oluwandabira

I was trying to train on imagenet and I noticed that Dataloader kept running into OOM issues so I had to write my own simple loader that may not be as fast but doesn't crash.

I made this MWE to demo the issue:

using MLUtils
using Random

struct RandDataset
	len::Int
end

Base.length(dataset::RandDataset) = dataset.len

function Base.getindex(dataset::RandDataset, idx::Int)
	return rand(Float32, 224,224,3), rand(1:1000)
end

function Base.getindex(dataset::RandDataset, indices::AbstractVector{Int})
    return [dataset[idx] for idx in indices]
end

function Base.getindex(dataset::RandDataset, range::AbstractRange{Int})
    return [dataset[idx] for idx in range]
end

struct SimpleLoader{D}
    dataset::D
    batchsize::Int
    shuffle::Bool
    indices::Vector{Int}
end

function SimpleLoader(dataset, batchsize::Int; shuffle::Bool=false)
    indices = collect(1:numobs(dataset))
    return SimpleLoader(dataset, batchsize, shuffle, indices)
end

function load_batch(dataset, indices::Vector{Int})
    actual_batch_size = length(indices)
    batch_array = Array{Float32}(undef, 224, 224, 3, actual_batch_size)
    labels = Vector{Int}(undef, actual_batch_size)

    Threads.@threads for idx in eachindex(indices)
        i = indices[idx]
        img, lbl = dataset[i]
        batch_array[:, :, :, idx] .= img
        labels[idx] = lbl
    end

    return batch_array, labels
end

function Base.iterate(loader::SimpleLoader, state::Int = 1)
    if state > length(loader.indices)
        return nothing
    end

    if loader.shuffle && state == 1
        Random.shuffle!(loader.indices)
    end

    end_idx = min(state + loader.batchsize - 1, length(loader.indices))
    batch_inds = loader.indices[state:end_idx]

    batch = load_batch(loader.dataset, batch_inds)
    return (batch, end_idx + 1)
end

dataset = RandDataset(1_200_000)
#dl = DataLoader(dataset, batchsize=1024, collate=true, parallel=true, partial=true)
dl = SimpleLoader(dataset, 1024)
@time for _ in dl end
@time for _ in dl end

Result from running the dataloader and 16 threads:

 92.184423 seconds (8.12 M allocations: 1.315 TiB, 18.53% gc time, 319 lock conflicts, 2.68% compilation time)
 91.870471 seconds (6.62 M allocations: 1.314 TiB, 16.23% gc time, 210 lock conflicts)
	Command being timed: "julia --project=. --threads=16 benchmark_imagenet_loading.jl"
	User time (seconds): 286.29
	System time (seconds): 297.57
	Percent of CPU this job got: 312%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 3:07.08
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 40386076
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 61203304
	Voluntary context switches: 3746914
	Involuntary context switches: 1050
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Result from running the simpleloader and 16 threads:

 74.446160 seconds (4.85 M allocations: 1.314 TiB, 48.53% gc time, 13.69% compilation time)
 74.961892 seconds (3.71 M allocations: 1.314 TiB, 49.67% gc time)
	Command being timed: "julia --project=. --threads=16 benchmark_imagenet_loading.jl"
	User time (seconds): 444.61
	System time (seconds): 632.58
	Percent of CPU this job got: 709%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:31.84
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2678888
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 163159298
	Voluntary context switches: 2486217
	Involuntary context switches: 865
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

I know I'm not using the buffer (coz I can't figure out how to) but still, the simple loader uses around 2GB and dataloader uses around 40GB and it's not even faster.
Since I know the item shapes and use that for preallocation in my simple loader I know dataloader wouldn't be faster or use less memory but the amount of memory it uses is too much, is there a memory leak somewhere?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions