assembly - Contrast reduction - intel x86 -
i supposed project pass course. ask, if there possibility make code more effective or better. i'm doing because coordinator meticulous perfectionist , crazy efficiency. it's hybrid program, modifies 24bpp bitmap. it's contrast reduction, algorithm looks this(it's approved coordinator):
comp-=128; comp*=rfactor comp/=128 comp+=128
'comp' means every component of pixel, literally: every value of red, green , blue in every pixel. function this, read file using functions in c. forward assembly array components, width of bmp, amount of pixels in each line, , 'rfactor' - value of contrast reduction. make this:
; void contrast(void *img, int width, int linewidth, int rfactor); ; stack: ebp+8 -> *img ; ebp+12 -> width [px] ; ebp+16 -> linewidth [b] ; ebp+20 -> rfactor (values in range of 1-128) section .text global contrast contrast: push ebp mov ebp, esp push ebx mov ebx, [ebp+12] ; width mov eax, [ebp+16] ; linewidth mul ebx ; how pixels reduce mov ecx, eax ; set counter mov edx, [ebp+8] ; edx = pointer @ img mov ebx, [ebp+20] ; ebx=rfactor loop: xor eax, eax dec ecx ; decrement counter mov al, [edx] ; current pixel al add eax, -128 imul bl ; pixel*rfactor sar eax, 7 ; pixel/128 add eax, 128 mov byte[edx], al ; put pixel inc edx ; next pixel test ecx, ecx ; counter 0? jnz loop koniec: pop ebx mov esp, ebp pop ebp ret
is there improve? thank suggestions, have impress coordinator ;)
i still interested in simd version here one.
use avx2 instructions need @ least 4th generation processor (haswell micro-architecture).
bits 32 global _contrast section .code ;rfactor ;linewidth ;width ;ptr buffer _contrast: push ebp mov ebp, esp , esp, 0fffffff0h push edi push esi push ebx push eax mov eax, dword [ebp+0ch] ;witdh mul dword [ebp+10h] ;total bytes mov ecx, eax ;number of bytes process shr ecx, 04h ;process chunks of 16 bytes per cycle mov edi, dword [ebp+08h] ;buffer ;--- prepare ymm registers --- vzeroall sub esp, 10h ;ymm1 contains r factor (x16) movzx ebx, word [ebp+14h] mov dword [esp], ebx vpbroadcastw ymm1, word [esp] ;ymm1 = r (x16) ;ymm0 contains 128-r value (x16) neg word [esp] ;-r mov al, 128 add word [esp], ax ;128-r vpbroadcastw ymm0, word [esp] ;ymm0 = 128-r (x16) add esp, 10h .loop: ;computer channels values vpmovzxbw ymm2, [edi] ;16 channels (128 bit) 16 words vpmullw ymm2, ymm2, ymm1 ;ymm2 = in*r vpsrlw ymm2, ymm2, 7 ;ymm2 = in*r>>7 vpaddw ymm2, ymm2, ymm0 ;ymm2 = in*r>>7 + r-128 vpackuswb ymm2, ymm2, ymm2 ;xmm2 = 16 computes values ;store memory movdqa [edi], xmm2 add edi, 10h loop .loop pop eax pop ebx pop esi pop edi mov esp, ebp pop ebp ret
i have tested comparing output output of code.
the prototype in c old 1 (with linewidth):
void contrast(void* buffer, unsigned int width, unsigned int bpp, unsigned short rfactor);
i have done profiling on machine. have run version , 1 in answer on 2048x20480 image (120mib buffer) 10 times. code takes 2.93 seconds, 1 1.09 seconds. though timings may not accurate.
this version require buffer size multiple of 16 (because processes 16 bytes per cycle, 5 , 1 third of pixel @ time), can pad zeros. if buffer aligned on 16 byte boundaries run faster.
if want more detailed answer (with useful comments example :d) ask in comments.
edit: updated code great of peter cordes, future reference.
Comments
Post a Comment