-
Notifications
You must be signed in to change notification settings - Fork 18
Added different predictors proc and new AVX2 SAD #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: mvtools-pfmod
Are you sure you want to change the base?
Conversation
Added different predictors proc and new AVX2 SAD for ExhaustiveSearch
Added new more AVX2 Esa search sp2 and sp4. One pass only without immediate array of SADs.
|
Does this second commit replace the previous one, so are they doing the same just the new one is quicker/uses less temporary storage? |
|
Greetings !
Todays commint it is my long awaiting first release of idea about full
Esa search inside SIMD register file (at least it only reloads ref
lines from cache at V-steps and not produce immediate write traffic -
only 1 final output write result). It possibly perspective for future
multicores chips because it creates much less write traffic (that
cause cache coherence protocol transactions between cores I think).
Though it uses a bit more SIMD AVX2 instructions.
Also it uses much less intermetiate memory for immediate SADs of all
search positions array and it is good step to the future of 'array of
blocks vs ref plane' search (need only 1 SAD and 1 x,y memory output
per block).
Also it skips simple but may be still slow 2 loops search inside 5x5
(or 9x8 for r=4) immediate array for minsad.
About final output - I use m256 write instruction because do not found
AVX2 for writing 64bits only. As I remember it may be penalty fo using
SSE instruction for writing half of SSE-128 register after AVX2-256 -
may I wrong.
I make it this morning and also debug with SDE - I go to a day work
this evening and will have a time at work to test many already
comittet tests tomorrow. It was an idea to write you a message about
new Esa search functions update after more tests with real 6core AVX2
CPU for both speed and output MDegrainN results and encoding results.
I will try to write more results tomorrow till end of day (in eupore).
About C-reference - as it is very simple I still not dot it as
promise. I hope tomorrow.
It all about:
Old processing
for(dy = -r; dy < +r; dy++)
for(dx = -r; dx < +r; dx++)
ArraySADs[dx, dy] = CalcSAD(dx, dy);
}
}
+make search for minsad and save minsad_x and minsad_y in C-loops
New method:
minsad=max_sad_value; xcurrent=0, ycurrent=0;
for(dy = -r; dy < +r; dy++)
for(dx = -r; dx < +r; dx++)
sad = CalcSAD(dx, dy);
if (sad < minsad)
{
minsad=sad;
x_minsad=xcurrent;
y_minsad=xcurrent;
}
xcurrent+=1;
}
ycurrent+=1;
}
save ymm_reg with x_minsad, y_minsad, minsad and parse.
…On 10/19/21, Ferenc Pintér ***@***.***> wrote:
Does this second commit replace the previous one, so are they doing the same
just the new one is quicker/uses less temporary storage?
Really needs human readable C implementation of these black boxes without
any tricks inside.
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#45 (comment)
|
|
(Note: Until I upload an |
|
Integrated by cherry pick directly from your repo. Usually this is the way for codes which are not ready-to-merge. then cherry picked from IDE and continued with the integration. You'll need optsearchoption = 1 for MAnalyze and MRecalculate |
Added C-ref functions (uses SAD() as old), fixed bug with ends of ArraySADs for sp2, added prefetch attempt.
|
I uploaded new commint with C-ref functions and some fixes to _avx2. Do it correct or I need to take (fork) latest repository and add new files to it and create new pull request ? |
|
I delete and fork my repository again to have new copy with a places where to put C-ref functions and corrected _avx2. So need to open new pull-request with new commited changes or it will adds to this ? |
|
Oh - I see pinterf already merge my latest commit. The only comment (it looks it not works in web-interface of github ?) - line 803 of PlaneofBlocks_avx2.cpp - mvtools/Sources/PlaneOfBlocks_avx2.cpp Line 802 in 0361d4b
It looks I or you forgot to add prefetch line: _mm_prefetch(const_cast<const CHAR*>(reinterpret_cast<const CHAR*>(pucRef + nRefPitch[0] * (i + 8))), _MM_HINT_NTA); // prefetch next Ref row (same as now in line 114 in sp4 function). I have more ideas for optimize of Esa_avx2 search (more compacting storage of current and ref data for sad calculation and using of 2 _mm256_sad_epu8() instructions instead of 4 for 8x8 block and may be more attempts to use SIMD minsad instruction instead of 25-points C-loop search). Will try to make tests if it help to speed to the end of Saturday 23.10.2021. |
|
The merge was full manual, since the commits and changes involved different files and coding conventions. |
I load all new commits and this new file. But after I commit new edits to separate branch - DTL2020@f23f587 It strangely changes PlaneOfBlocks.cpp file with almost all old deletions and all new additions. Is it correct and as expected ? I use VS2017 at home. |
|
Line 206 of PlaneofBlocks.cpp - used and instead of && mvtools/Sources/PlaneOfBlocks.cpp Line 206 in bbc2ac7
It confuses Intel C++ 19.1 compiler. But looks like works with VS2019. Also found the main issue of the new _avx2 search functions - they work correctly only with nPel=1. Because Pel=2 uses different x,y of vector to memory pointer of start address of each ref block conversion (not contiguous even in Horizontal x-direction). In old functions ExpandingSearch (and in c_ref for _avx2) it is done by calling of GetRefBlock() for each ref block in the search. And current _avx2 functions load all columns of ref plane expecting it is contigous in memory. So need 1 more check for use. For nPel=2 it looks also possible to make separate set of functions but it need completely different 'gathering' of input data or re-design pel=2 plane storage to be contigous in memory (?). |
|
aaaaa, only noticed for fifth re-read. So yes, there is 'and' instead of && :) |
Dunno. Line endings??? CRLF vs LF? |
|
I make new commit to main branch - DTL2020@0f3e9d7 . Now the sp2 going to be faster and also for nPel=1 added sp1 using latest method of calculating sad and searching for minimum. Sp4 still old and sp3 need to be added to have all sp 1,2,3,4 filled for nPel=1 at least. It looks like this commit not listed here. Do I need to close this and open new pull-request ? Also it looks editor again change something in all lines of files and github list almost all lines as changed - do not know what happens. |
|
I'm gonna then cherry-pick again. I've still got your repo as remote so I can easily access your commits. |
As I see it mostly about line beginnings - almost every line is shifted (to the left or right). May be it re-format all document to |
|
I hope that was the last time when I needed to cherry-pick. Commited a lot, including (hopefully) all your changes, uniformizing spaces and line endings. It was easy when I commited alone but now there must be order. |
|
Yes - it looks most new is integrated. Some known inconsistency between C-refs and _avx2 Esa search functions: If there are more 1 equal minsad value the C-ref return last found x,y pair and _avx2 looks like returns first (as the instruction _mm_minpos_epu16() internal logic). So if more equal result required - may be make C-ref reverse order scan (from botom right to top left ) ? Though it is rare enough situation and may be not any danger for MDegrainN because if 2 or more blocks have equal sad they possibly can be blended without distortions. But the some 'intelectual logic' may treat as different motion vectors. Also the old Expanding search may be make different result because of different scan order (from center of area to the edges) and _avx2 uses scan from top-left corner. So it also creates different result if 'workarea' contain >1 blocks with completely equal sad value. Also the Expanding Search from center uses 'cost of new minsad' compare at each search step (CheckMV()) and new _avx2 only at the end of full scan of area. |
|
Once I had a week investigating why different runs sometimes yield different output. (Debugging nightmare: we get a slightly different vectors for once in ten thousand frames, happening once in 10 runs) |

Added different predictors proc and new AVX2 SAD for ExhaustiveSearch