assembly - Contrast reduction - intel x86 -


i supposed project pass course. ask, if there possibility make code more effective or better. i'm doing because coordinator meticulous perfectionist , crazy efficiency. it's hybrid program, modifies 24bpp bitmap. it's contrast reduction, algorithm looks this(it's approved coordinator):

comp-=128; comp*=rfactor comp/=128 comp+=128 

'comp' means every component of pixel, literally: every value of red, green , blue in every pixel. function this, read file using functions in c. forward assembly array components, width of bmp, amount of pixels in each line, , 'rfactor' - value of contrast reduction. make this:

;  void contrast(void *img, int width, int linewidth, int rfactor); ;   stack:  ebp+8 ->        *img ;       ebp+12 ->       width [px] ;       ebp+16 ->       linewidth [b] ;       ebp+20 ->       rfactor (values in range of 1-128) section .text  global  contrast  contrast: push    ebp          mov ebp, esp       push    ebx          mov     ebx, [ebp+12]   ; width mov     eax, [ebp+16]   ; linewidth mul     ebx     ; how pixels reduce mov     ecx, eax    ; set counter mov     edx, [ebp+8]    ; edx = pointer @ img mov     ebx, [ebp+20]   ; ebx=rfactor  loop: xor     eax, eax     dec     ecx         ; decrement counter mov     al, [edx]   ; current pixel al add     eax, -128    imul    bl          ; pixel*rfactor sar     eax, 7      ; pixel/128 add     eax, 128     mov     byte[edx], al   ; put pixel inc     edx         ; next pixel test    ecx, ecx    ; counter 0? jnz     loop          koniec: pop     ebx mov     esp, ebp     pop     ebp ret  

is there improve? thank suggestions, have impress coordinator ;)

i still interested in simd version here one.
use avx2 instructions need @ least 4th generation processor (haswell micro-architecture).

bits 32  global _contrast  section .code  ;rfactor ;linewidth  ;width ;ptr buffer _contrast:     push ebp              mov ebp, esp           , esp, 0fffffff0h      push edi     push esi     push ebx      push eax       mov eax, dword [ebp+0ch]        ;witdh     mul dword [ebp+10h]             ;total bytes     mov ecx, eax                    ;number of bytes process     shr ecx, 04h                    ;process chunks of 16 bytes per cycle      mov edi, dword [ebp+08h]        ;buffer       ;--- prepare ymm registers ---      vzeroall      sub esp, 10h       ;ymm1 contains r factor (x16)     movzx ebx, word [ebp+14h]     mov dword [esp], ebx     vpbroadcastw ymm1, word [esp]   ;ymm1 = r (x16)      ;ymm0 contains 128-r value (x16)     neg word [esp]                  ;-r     mov al, 128     add word [esp], ax              ;128-r     vpbroadcastw ymm0, word [esp]   ;ymm0 = 128-r (x16)      add esp, 10h  .loop:     ;computer channels values     vpmovzxbw ymm2, [edi]           ;16 channels (128 bit) 16 words     vpmullw ymm2, ymm2, ymm1        ;ymm2 = in*r     vpsrlw ymm2, ymm2, 7            ;ymm2 = in*r>>7     vpaddw ymm2, ymm2, ymm0         ;ymm2 = in*r>>7 + r-128      vpackuswb ymm2, ymm2, ymm2      ;xmm2 = 16 computes values       ;store memory     movdqa [edi], xmm2      add edi, 10h  loop .loop      pop eax     pop ebx     pop esi     pop edi       mov esp, ebp     pop ebp     ret  

i have tested comparing output output of code.

the prototype in c old 1 (with linewidth):

void contrast(void* buffer, unsigned int width, unsigned int bpp, unsigned short rfactor); 

i have done profiling on machine. have run version , 1 in answer on 2048x20480 image (120mib buffer) 10 times. code takes 2.93 seconds, 1 1.09 seconds. though timings may not accurate.

this version require buffer size multiple of 16 (because processes 16 bytes per cycle, 5 , 1 third of pixel @ time), can pad zeros. if buffer aligned on 16 byte boundaries run faster.

if want more detailed answer (with useful comments example :d) ask in comments.

edit: updated code great of peter cordes, future reference.


Comments

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -