Skip to content

Conversation

@hukumka
Copy link

@hukumka hukumka commented Sep 30, 2020

Hello, and thanks for this awesome library.

This PR is a step toward #18 and implements generation of areas using opencl.

Lacking features

  • Layers past L_SHORE_16
  • Version support

Performance

Then generating 64 seeds per routine, I observed x30 speedup.
Then generating 1 seed per routine, speedup is only x5.

Terribly sorry for dumping such a large chunk of code in a single PR, but I needed to see if
my approach for avoiding recomputing same layer multiple times works before I submitted this.

This implementation is a proof of concept, and missing:
+ Layers past L_SHORE_16
+ Support for different minecraft versions
@Cubitect
Copy link
Owner

Cubitect commented Oct 2, 2020

Thanks for the interest, I was always a little sceptical about performance with a GPU. Generating giant areas in one go might work reasonably well on a GPU, but the code is highly reliant on branching, which is like poison to a GPU and to SSE instructions. Also I find myself needing small areas much more often than large ones, which make this problem much worse. So I always leaned towards distributing workload on CPU cores instead. That said I'm quite interested to see what the performance would actually be using a GPU in different scenarios.

While checking out the your branch I found a bug in the cubiomes library that caused allocCache to allocate too little memory, when the entry point was one of the first few layers. That should be fixed now.

I found a couple of issues with the draft. I think at ocl_test.c:47 it should be bufferA[i + j*W] without the + s*W*H, and it does not seem to work for area sizes below 32x32.

out[xx + 1 + zz * w] = (cs >> 24) & 1 ? v10 : v00;
}
int v;
if (v10 == v01 && v01 == v11) v = v10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a few experiments one day trying to remove this branches, which are from select_mode_or_random. This is the alternative that worked better to me, why is about 25% faster that the "if cascade" on my CPU. I hope that the difference is bigger on a GPU but can't try that myself:

https://github.com/Badel2/slime_seed_finder/blob/9334b161bd4b7b7b8d7251e48623d5803707921c/benches/select_mode_or_random.rs#L118

Suggested change
if (v10 == v01 && v01 == v11) v = v10;
int cv00 = (v00 == v10) + (v00 == v01) + (v00 == v11);
int cv10 = (v10 == v01) + (v10 == v11);
int cv01 = v01 == v11;
if cv00 > cv10 && cv00 > cv01 {
v = v00;
} else if cv10 > cv00 {
v = v10;
} else if cv01 > cv00 {
v = v01;
} else {
// v = random
}

Copy link
Owner

@Cubitect Cubitect Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I see you did a lot of testing and the assembly does look significantly better, if only for the CPU. I did some rudimentary testing with CUDA C, and I was surprised that the improvement was only minor for a GPU. After some digging I found that the nvcc compiler manages to reduce the branching for this part of the device code quite well on its own (at least better than gcc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants