本身也发过十分的多飞快高斯模糊算法.

在此以前,我也发过相当多飞跃高斯模糊算法.

   
 这里的高斯模糊使用的是舆论《Recursive implementation of the Gaussian
filter》里描述的递归算法。

作者一般感到,只要处理一千第六百货万像素彩色图片,在2.2GHz的CPU上单核单线程抢先1秒的算法,都以难过的.

           
  图片 1

事头阵的多少个算法,在吾2.2GHz的CPU上耗时都会当先1秒.

     
仔细观望和驾驭上述公式,在forward进度中,n是递增的,因而,借使在进展forward以前,把in数据先全体的赋值给w,然后式子(9a)就能够成为:

而显著,飞快高斯模糊有众多兑现格局:

      w[n] = B w[n] +
(b1 w[n-1] + b2 w[n-2] +
b3 w[n-3]) / b0;          ———>    
(1a)

1.FIR (Finite impulse response)

 
  在backward进度中,n是递减的,由此在backward前,把w的数量全体的拷贝到out中,则式子(9b)变为:

https://zh.wikipedia.org/wiki/%E9%AB%98%E6%96%AF%E6%A8%A1%E7%B3%8A

     out[n] = B out[n] +
(b1 
out[n+1] + b2 out[n+2] +
b3 
out[n+3]) / b0 ;     <———     (1b)

2.SII (Stacked integral images)

   
 从编制程序角度看来,backward中的拷贝是不须要的,因而 (1b)能够一贯写为:

http://dx.doi.org/10.1109/ROBOT.2010.5509400

           w[n] = B w[n] +
(b1 w[n+1] + b2 w[n+2] +
b3 w[n+3]) / b0 ;               <———    
(1c)

http://arxiv.org/abs/1107.4958

   
 
从速度上挂念,浮点除法不快,由此,我们再度定义b1 = b1 / b0, b2 = b2 /
b0, b3 = b3 / b0, 最终收获我们选取的递归公式:

3.Vliet-Young-Verbeek (Recursive filter)

           w[n] = B w[n] +
b1 w[n-1] + b2 w[n-2] +
b3 w[n-3];          ———>     (a)

http://dx.doi.org/10.1016/0165-1684(95)00020-E

        w[n] = B w[n] +
b1 w[n+1] + b2 w[n+2] +
b3 w[n+3] ;             <———      (b)

http://dx.doi.org/10.1109/ICPR.1998.711192

   
上述公式是一维的高斯模糊总结情势,针对二维的图像,精确的做法就是先对各种图像行进行模糊管理获得中间结果,再对中间结果的每列举行模糊操作,最后赢得二维的混淆结果,当然也得以应用先列后行那样的做法。

4.DCT (Discrete Cosine Transform)

     
其它注意点正是,边缘像素的拍卖,大家见到在公式(a)也许(b)中,w[n]的结果个别依附于前多少个大概后四个要素的值,而对于边缘地方的值,这几个都以不在合理限定内的,日常的做法是镜像数据可能重新边缘数据,推行申明,由于是递归的算法,发轫值的不一致会将结果平昔一而再下去,因而,差异的不二法门对边缘部分的结果照旧有确定的熏陶的,这里大家利用的大致的重复边缘像素值的方法。

http://dx.doi.org/10.1109/78.295213

     
由于地方公式中的周密均为浮点类型,因而,总计一般也是对浮点实行的,也等于说一般须要先把图像数据调换为浮点,然后开始展览高斯模糊,在将结果调换为字节类型的图像,因此,上述进度可以独家用一下多少个函数达成:

5.box (Box filter)

               
CalcGaussCof          
//  总括高斯模糊中运用到的周全
      ConvertBG哈弗8U2BGRAF      //  将字节数据转变为浮点数据 
      GaussBlurFromLeftToRight    //  水平方向的前向传播
      GaussBlurFromRightToLeft    //  水平方向的反向传播
      GaussBlurFromTopToBottom  
 //   垂直方向的前向传来
      GaussBlurFromBottomToTop  
 //   垂直方向的反向传播
      ConvertBGRAF2BGPRADO8U      
 //   将结果转变为字节数据

http://dx.doi.org/10.1109/TPAMI.1986.4767776

   大家逐一深入分析之。

6.AM(Alvarez, Mazorra)

     
 CalcGaussCof,那个不会细小略,就依照故事集中提议的总括公式依据用户输入的参数来总结,当然结合下方面大家提到的除法变乘法的优化,注意,为了持续的一对标号的主题素材,笔者讲上述公式(a)和(b)中的周到B写为b0了。

http://www.jstor.org/stable/2158018

void CalcGaussCof(float Radius, float &B0, float &B1, float &B2, float &B3)
{
    float Q, B;
    if (Radius >= 2.5)
        Q = (double)(0.98711 * Radius - 0.96330);                            //    对应论文公式11b
    else if ((Radius >= 0.5) && (Radius < 2.5))
        Q = (double)(3.97156 - 4.14554 * sqrt(1 - 0.26891 * Radius));
    else
        Q = (double)0.1147705018520355224609375;

    B = 1.57825 + 2.44413 * Q + 1.4281 * Q * Q + 0.422205 * Q * Q * Q;        //    对应论文公式8c
    B1 = 2.44413 * Q + 2.85619 * Q * Q + 1.26661 * Q * Q * Q;
    B2 = -1.4281 * Q * Q - 1.26661 * Q * Q * Q;
    B3 = 0.422205 * Q * Q * Q;

    B0 = 1.0 - (B1 + B2 + B3) / B;
    B1 = B1 / B;
    B2 = B2 / B;
    B3 = B3 / B;
}

7.Deriche (Recursive filter)

  由上述代码可见,B0+B1+B2+B3=1,就是归一化的全面,那也是算法可以递归进行的要害之一。

http://hal.inria.fr/docs/00/07/47/78/PDF/RR-1893.pdf

   
 接着为了方便中间经过,大家先将字节数据调换为浮点数据,那有的代码也很简单:

8.ebox (Extended Box)

void ConvertBGR8U2BGRAF(unsigned char *Src, float *Dest, int Width, int Height, int Stride)
{
    //#pragma omp parallel for
    for (int Y = 0; Y < Height; Y++)
    {
        unsigned char *LinePS = Src + Y * Stride;
        float *LinePD = Dest + Y * Width * 3;
        for (int X = 0; X < Width; X++, LinePS += 3, LinePD += 3)
        {
            LinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];
        }
    }
}

http://dx.doi.org/10.1007/978-3-642-24785-9\_38

  大家能够尝试自身把在那之中的X循环再展开探访效果。

9.IIR (Infinite Impulse Response)

   
 水平方向的前向传来依据公式去写也是一点也不细略的,不过一向使用公式里面有成都百货上千走访数组的历程(不自然就慢),作者有一点退换下写成如下方式:

https://software.intel.com/zh-cn/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions

void GaussBlurFromLeftToRight(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)
{
    //#pragma omp parallel for
    for (int Y = 0; Y < Height; Y++)
    {
        float *LinePD = Data + Y * Width * 3;
        float BS1 = LinePD[0], BS2 = LinePD[0], BS3 = LinePD[0];          //  边缘处使用重复像素的方案
        float GS1 = LinePD[1], GS2 = LinePD[1], GS3 = LinePD[1];
        float RS1 = LinePD[2], RS2 = LinePD[2], RS3 = LinePD[2];
        for (int X = 0; X < Width; X++, LinePD += 3)
        {
            LinePD[0] = LinePD[0] * B0 + BS1 * B1 + BS2 * B2 + BS3 * B3;
            LinePD[1] = LinePD[1] * B0 + GS1 * B1 + GS2 * B2 + GS3 * B3;         // 进行顺向迭代
            LinePD[2] = LinePD[2] * B0 + RS1 * B1 + RS2 * B2 + RS3 * B3;
            BS3 = BS2, BS2 = BS1, BS1 = LinePD[0];
            GS3 = GS2, GS2 = GS1, GS1 = LinePD[1];
            RS3 = RS2, RS2 = RS1, RS1 = LinePD[2];
        }
    }
}

10.FA (Fast Anisotropic)

  非常少描述,请我们把上述代码和公式(a)对应一下就知道了。

http://mathinfo.univ-reims.fr/IMG/pdf/Fast\_Anisotropic\_Gquss\_Filtering\_-\_GeusebroekECCV02.pdf

     
有了GaussBlurFromLeftToRight的参照他事他说加以考察代码,GaussBlurFromRightToLeft的代码就不会有何样大的许多不便了,为了防止某些懒人不看小说不思虑,这里笔者不贴这段的代码。

……

     
那么垂直方向上轻松的做只须要改动下循环的趋向,以及历次指针扩大量退换为Width
* 3即可。

金玉满堂高斯模糊的不二等秘书诀就算多数,可是作为算法来说,宗旨关键是轻易高效.

     
那么大家来思考下垂直方向能否不这样管理吧,指针每便增添Width *
3会导致深重的Cache
miss,非常是对此宽度稍微大点的图像,大家仔细观望垂直方向,会发掘还可以够根据Y
 — X那样的巡回形式也是足以的,代码如下:

日前作者经超过实际地衡量,II帕杰罗是专职效果以及质量的没有错的秘诀,也是半径非亲非故(即模糊分裂强度耗时基本不改变)的达成.

void GaussBlurFromTopToBottom(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)
{
    for (int Y = 0; Y < Height; Y++)
    {
        float *LinePD3 = Data + (Y + 0) * Width * 3;
        float *LinePD2 = Data + (Y + 1) * Width * 3;
        float *LinePD1 = Data + (Y + 2) * Width * 3;
        float *LinePD0 = Data + (Y + 3) * Width * 3;
        for (int X = 0; X < Width; X++, LinePD0 += 3, LinePD1 += 3, LinePD2 += 3, LinePD3 += 3)
        {
            LinePD0[0] = LinePD0[0] * B0 + LinePD1[0] * B1 + LinePD2[0] * B2 + LinePD3[0] * B3;
            LinePD0[1] = LinePD0[1] * B0 + LinePD1[1] * B1 + LinePD2[1] * B2 + LinePD3[1] * B3;
            LinePD0[2] = LinePD0[2] * B0 + LinePD1[2] * B1 + LinePD2[2] * B2 + LinePD3[2] * B3;
        }
    }
}

英特尔官方完毕的那份:

  就是说大家不是一下子就把列方向的前向传来进行完,而是每一次只举行叁个像素的流传,当一行所有像素都进行完了列方向的前向传来后,在切换成下一行,这样管理就避免了Cache
miss出现的情状。

IIR Gaussian Blur Filter Implementation using Intel® Advanced Vector
Extensions
 [PDF
513KB]
source: gaussian_blur.cpp [36KB]

   
 注意到列方向的边缘地点,为了有利于代码的管理,大家在中度方向上上下分别扩张了3个行的像素,当举行完全中学间的管用行的行方向前向和反向传播后,根据前述的双重边缘像素的章程填充上下那空出的三行数据。

动用了AMDComputer的流(SIMD)指令,算法管理速度极度惊人.

   
 同样的道理,GaussBlurFromBottomToTop的代码可由读者自个儿补充进去。

咱写算法追求干净整洁,高效简明,换言之正是不选拔任何硬件加快方案,达成轻巧快速,以适应分化硬件意况.

   
 最终的ConvertBGRAF2BGXC608U也很简短,正是几个for循环:

故基于速龙那份代码,笔者对其展开了改写以及优化.

void ConvertBGRAF2BGR8U(float *Src, unsigned char *Dest, int Width, int Height, int Stride)
{
    //#pragma omp parallel for
    for (int Y = 0; Y < Height; Y++)
    {
        float *LinePS = Src + Y * Width * 3;
        unsigned char *LinePD = Dest + Y * Stride;
        for (int X = 0; X < Width; X++, LinePS += 3, LinePD += 3)
        {
            LinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];
        }
    }
}

最后在本身2.20GHz的CPU上,单核单线程,不采取流(SIMD)指令,抵达了,管理1000六百万像素的彩色照片仅需700微秒左右.

   在使得的界定内,上述浮点总括的结果不会赶过byte所能表明的限制,由此也没有需求特意的进展Clamp操作。

服从惯例,依旧贴个效果图比较直观.

     
 最终正是局地内部存款和储蓄器分配和自由的代码了,如下所示:

图片 2

void GaussBlur(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, float Radius)
{
    float B0, B1, B2, B3;
    float *Buffer = (float *)malloc(Width * (Height + 6) * sizeof(float) * 3);

    CalcGaussCof(Radius, B0, B1, B2, B3);
    ConvertBGR8U2BGRAF(Src, Buffer + 3 * Width * 3, Width, Height, Stride);

    GaussBlurFromLeftToRight(Buffer + 3 * Width * 3, Width, Height, B0, B1, B2, B3);
    GaussBlurFromRightToLeft(Buffer + 3 * Width * 3, Width, Height, B0, B1, B2, B3);        //    如果启用多线程,建议把这个函数写到GaussBlurFromLeftToRight的for X循环里,因为这样就可以减少线程并发时的阻力

    memcpy(Buffer + 0 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float));
    memcpy(Buffer + 1 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float));
    memcpy(Buffer + 2 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float));

    GaussBlurFromTopToBottom(Buffer, Width, Height, B0, B1, B2, B3);

    memcpy(Buffer + (Height + 3) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float));
    memcpy(Buffer + (Height + 4) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float));
    memcpy(Buffer + (Height + 5) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float));

    GaussBlurFromBottomToTop(Buffer, Width, Height, B0, B1, B2, B3);

    ConvertBGRAF2BGR8U(Buffer + 3 * Width * 3, Dest, Width, Height, Stride);

    free(Buffer);
}

事先也可能有网上亲密的朋友问过这一个算法的落到实处难题.

  正如上所述,分配了Height +
6行的内存区域,重倘诺为着便利垂直方向的拍卖,在进行GaussBlurFromTopToBottom以前依照重复边缘的规则复制3行,然后在GaussBlurFromBottomToTop此前在复制底部边缘的3行像素。

想了想,还是将代码共享出来,供我们参照他事他说加以考查学习.

     
至此,二个轻松而又飞快的高斯模糊就基本产生了,代码数量也没多少,也未有吗难度,不掌握为何繁多个人搞不定。

 

     
依据本人的测试,上述方法代码在一台I5-6300HQ
2.30GHZ的记录簿上针对一副三千*三千的二十三位图像的拍卖时间大致须求370ms,假如在C++的编写翻译选项的代码生成选项里的启用加强指令集选拔–>流式管理SIMD扩充2(/arch:sse2),则编写翻译后的次第大概要求220ms的小时。

总体代码:

     
大家尝试利用种类的一些能源进一步升高速度,首先我们想到了SSE优化,关于那上边AMD有一篇官方的小说和代码:

void CalGaussianCoeff(float sigma, float * a0, float * a1, float * a2, float * a3, float * b1, float * b2, float * cprev, float * cnext) {
    float alpha, lamma, k;

    if (sigma < 0.5f)
        sigma = 0.5f;
    alpha = (float)exp((0.726) * (0.726)) / sigma;
    lamma = (float)exp(-alpha);
    *b2 = (float)exp(-2 * alpha);
    k = (1 - lamma) * (1 - lamma) / (1 + 2 * alpha * lamma - (*b2));
    *a0 = k; *a1 = k * (alpha - 1) * lamma;
    *a2 = k * (alpha + 1) * lamma;
    *a3 = -k * (*b2);
    *b1 = -2 * lamma;
    *cprev = (*a0 + *a1) / (1 + *b1 + *b2);
    *cnext = (*a2 + *a3) / (1 + *b1 + *b2);
}

void gaussianHorizontal(unsigned char * bufferPerLine, unsigned char * lpRowInitial, unsigned char  * lpColumn, int width, int height, int Channels, int Nwidth, float a0a1, float a2a3, float b1b2, float  cprev, float cnext)
{
    int HeightStep = Channels*height;
    int WidthSubOne = width - 1;
    if (Channels == 3)
    {
        float prevOut[3];
        prevOut[0] = (lpRowInitial[0] * cprev);
        prevOut[1] = (lpRowInitial[1] * cprev);
        prevOut[2] = (lpRowInitial[2] * cprev);
        for (int x = 0; x < width; ++x) {
            prevOut[0] = ((lpRowInitial[0] * (a0a1)) - (prevOut[0] * (b1b2)));
            prevOut[1] = ((lpRowInitial[1] * (a0a1)) - (prevOut[1] * (b1b2)));
            prevOut[2] = ((lpRowInitial[2] * (a0a1)) - (prevOut[2] * (b1b2)));
            bufferPerLine[0] = prevOut[0];
            bufferPerLine[1] = prevOut[1];
            bufferPerLine[2] = prevOut[2];
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        lpColumn += HeightStep * WidthSubOne;
        bufferPerLine -= Channels;
        prevOut[0] = (lpRowInitial[0] * cnext);
        prevOut[1] = (lpRowInitial[1] * cnext);
        prevOut[2] = (lpRowInitial[2] * cnext);

        for (int x = WidthSubOne; x >= 0; --x) {
            prevOut[0] = ((lpRowInitial[0] * (a2a3)) - (prevOut[0] * (b1b2)));
            prevOut[1] = ((lpRowInitial[1] * (a2a3)) - (prevOut[1] * (b1b2)));
            prevOut[2] = ((lpRowInitial[2] * (a2a3)) - (prevOut[2] * (b1b2)));
            bufferPerLine[0] += prevOut[0];
            bufferPerLine[1] += prevOut[1];
            bufferPerLine[2] += prevOut[2];
            lpColumn[0] = bufferPerLine[0];
            lpColumn[1] = bufferPerLine[1];
            lpColumn[2] = bufferPerLine[2];
            lpRowInitial -= Channels;
            lpColumn -= HeightStep;
            bufferPerLine -= Channels;
        }
    }
    else if (Channels == 4)
    {
        float prevOut[4];

        prevOut[0] = (lpRowInitial[0] * cprev);
        prevOut[1] = (lpRowInitial[1] * cprev);
        prevOut[2] = (lpRowInitial[2] * cprev);
        prevOut[3] = (lpRowInitial[3] * cprev);
        for (int x = 0; x < width; ++x) {
            prevOut[0] = ((lpRowInitial[0] * (a0a1)) - (prevOut[0] * (b1b2)));
            prevOut[1] = ((lpRowInitial[1] * (a0a1)) - (prevOut[1] * (b1b2)));
            prevOut[2] = ((lpRowInitial[2] * (a0a1)) - (prevOut[2] * (b1b2)));
            prevOut[3] = ((lpRowInitial[3] * (a0a1)) - (prevOut[3] * (b1b2)));

            bufferPerLine[0] = prevOut[0];
            bufferPerLine[1] = prevOut[1];
            bufferPerLine[2] = prevOut[2];
            bufferPerLine[3] = prevOut[3];
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        lpColumn += HeightStep * WidthSubOne;
        bufferPerLine -= Channels;

        prevOut[0] = (lpRowInitial[0] * cnext);
        prevOut[1] = (lpRowInitial[1] * cnext);
        prevOut[2] = (lpRowInitial[2] * cnext);
        prevOut[3] = (lpRowInitial[3] * cnext);

        for (int x = WidthSubOne; x >= 0; --x) {
            prevOut[0] = ((lpRowInitial[0] * a2a3) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a2a3) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a2a3) - (prevOut[2] * b1b2));
            prevOut[3] = ((lpRowInitial[3] * a2a3) - (prevOut[3] * b1b2));
            bufferPerLine[0] += prevOut[0];
            bufferPerLine[1] += prevOut[1];
            bufferPerLine[2] += prevOut[2];
            bufferPerLine[3] += prevOut[3];
            lpColumn[0] = bufferPerLine[0];
            lpColumn[1] = bufferPerLine[1];
            lpColumn[2] = bufferPerLine[2];
            lpColumn[3] = bufferPerLine[3];
            lpRowInitial -= Channels;
            lpColumn -= HeightStep;
            bufferPerLine -= Channels;
        }
    }
    else if (Channels == 1)
    {
        float prevOut = (lpRowInitial[0] * cprev);

        for (int x = 0; x < width; ++x) {
            prevOut = ((lpRowInitial[0] * (a0a1)) - (prevOut  * (b1b2)));
            bufferPerLine[0] = prevOut;
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        lpColumn += HeightStep*WidthSubOne;
        bufferPerLine -= Channels;

        prevOut = (lpRowInitial[0] * cnext);

        for (int x = WidthSubOne; x >= 0; --x) {
            prevOut = ((lpRowInitial[0] * a2a3) - (prevOut  * b1b2));
            bufferPerLine[0] += prevOut;
            lpColumn[0] = bufferPerLine[0];
            lpRowInitial -= Channels;
            lpColumn -= HeightStep;
            bufferPerLine -= Channels;
        }
    }
}

void gaussianVertical(unsigned char * bufferPerLine, unsigned char * lpRowInitial, unsigned char * lpColInitial, int height, int width, int Channels, float a0a1, float a2a3, float b1b2, float  cprev, float  cnext) {

    int WidthStep = Channels*width;
    int HeightSubOne = height - 1;
    if (Channels == 3)
    {
        float prevOut[3];
        prevOut[0] = (lpRowInitial[0] * cprev);
        prevOut[1] = (lpRowInitial[1] * cprev);
        prevOut[2] = (lpRowInitial[2] * cprev);

        for (int y = 0; y < height; y++) {
            prevOut[0] = ((lpRowInitial[0] * a0a1) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a0a1) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a0a1) - (prevOut[2] * b1b2));
            bufferPerLine[0] = prevOut[0];
            bufferPerLine[1] = prevOut[1];
            bufferPerLine[2] = prevOut[2];
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        bufferPerLine -= Channels;
        lpColInitial += WidthStep * HeightSubOne;
        prevOut[0] = (lpRowInitial[0] * cnext);
        prevOut[1] = (lpRowInitial[1] * cnext);
        prevOut[2] = (lpRowInitial[2] * cnext);
        for (int y = HeightSubOne; y >= 0; y--) {
            prevOut[0] = ((lpRowInitial[0] * a2a3) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a2a3) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a2a3) - (prevOut[2] * b1b2));
            bufferPerLine[0] += prevOut[0];
            bufferPerLine[1] += prevOut[1];
            bufferPerLine[2] += prevOut[2];
            lpColInitial[0] = bufferPerLine[0];
            lpColInitial[1] = bufferPerLine[1];
            lpColInitial[2] = bufferPerLine[2];
            lpRowInitial -= Channels;
            lpColInitial -= WidthStep;
            bufferPerLine -= Channels;
        }
    }
    else if (Channels == 4)
    {
        float prevOut[4];

        prevOut[0] = (lpRowInitial[0] * cprev);
        prevOut[1] = (lpRowInitial[1] * cprev);
        prevOut[2] = (lpRowInitial[2] * cprev);
        prevOut[3] = (lpRowInitial[3] * cprev);

        for (int y = 0; y < height; y++) {
            prevOut[0] = ((lpRowInitial[0] * a0a1) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a0a1) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a0a1) - (prevOut[2] * b1b2));
            prevOut[3] = ((lpRowInitial[3] * a0a1) - (prevOut[3] * b1b2));
            bufferPerLine[0] = prevOut[0];
            bufferPerLine[1] = prevOut[1];
            bufferPerLine[2] = prevOut[2];
            bufferPerLine[3] = prevOut[3];
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        bufferPerLine -= Channels;
        lpColInitial += WidthStep*HeightSubOne;
        prevOut[0] = (lpRowInitial[0] * cnext);
        prevOut[1] = (lpRowInitial[1] * cnext);
        prevOut[2] = (lpRowInitial[2] * cnext);
        prevOut[3] = (lpRowInitial[3] * cnext);
        for (int y = HeightSubOne; y >= 0; y--) {
            prevOut[0] = ((lpRowInitial[0] * a2a3) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a2a3) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a2a3) - (prevOut[2] * b1b2));
            prevOut[3] = ((lpRowInitial[3] * a2a3) - (prevOut[3] * b1b2));
            bufferPerLine[0] += prevOut[0];
            bufferPerLine[1] += prevOut[1];
            bufferPerLine[2] += prevOut[2];
            bufferPerLine[3] += prevOut[3];
            lpColInitial[0] = bufferPerLine[0];
            lpColInitial[1] = bufferPerLine[1];
            lpColInitial[2] = bufferPerLine[2];
            lpColInitial[3] = bufferPerLine[3];
            lpRowInitial -= Channels;
            lpColInitial -= WidthStep;
            bufferPerLine -= Channels;
        }
    }
    else if (Channels == 1)
    {
        float prevOut = 0;
        prevOut = (lpRowInitial[0] * cprev);
        for (int y = 0; y < height; y++) {
            prevOut = ((lpRowInitial[0] * a0a1) - (prevOut * b1b2));
            bufferPerLine[0] = prevOut;
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        bufferPerLine -= Channels;
        lpColInitial += WidthStep*HeightSubOne;
        prevOut = (lpRowInitial[0] * cnext);
        for (int y = HeightSubOne; y >= 0; y--) {
            prevOut = ((lpRowInitial[0] * a2a3) - (prevOut * b1b2));
            bufferPerLine[0] += prevOut;
            lpColInitial[0] = bufferPerLine[0];
            lpRowInitial -= Channels;
            lpColInitial -= WidthStep;
            bufferPerLine -= Channels;
        }
    }
}
//本人博客:http://tntmonks.cnblogs.com/ 转载请注明出处.
void  GaussianBlurFilter(unsigned char * input, unsigned char * output, int Width, int Height, int Stride, float GaussianSigma) {

    int Channels = Stride / Width;
    float a0, a1, a2, a3, b1, b2, cprev, cnext;

    CalGaussianCoeff(GaussianSigma, &a0, &a1, &a2, &a3, &b1, &b2, &cprev, &cnext);

    float a0a1 = (a0 + a1);
    float a2a3 = (a2 + a3);
    float b1b2 = (b1 + b2); 

    int bufferSizePerThread = (Width > Height ? Width : Height) * Channels;
    unsigned char * bufferPerLine = (unsigned char*)malloc(bufferSizePerThread);
    unsigned char * tempData = (unsigned char*)malloc(Height * Stride);
    if (bufferPerLine == NULL || tempData == NULL)
    {
        if (tempData)
        {
            free(tempData);
        }
        if (bufferPerLine)
        {
            free(bufferPerLine);
        }
        return;
    }
    for (int y = 0; y < Height; ++y) {
        unsigned char * lpRowInitial = input + Stride * y;
        unsigned char * lpColInitial = tempData + y * Channels;
        gaussianHorizontal(bufferPerLine, lpRowInitial, lpColInitial, Width, Height, Channels, Width, a0a1, a2a3, b1b2, cprev, cnext);
    }
    int HeightStep = Height*Channels;
    for (int x = 0; x < Width; ++x) {
        unsigned char * lpColInitial = output + x*Channels;
        unsigned char * lpRowInitial = tempData + HeightStep * x;
        gaussianVertical(bufferPerLine, lpRowInitial, lpColInitial, Height, Width, Channels, a0a1, a2a3, b1b2, cprev, cnext);
    }

    free(bufferPerLine);
    free(tempData);
}

        IIR Gaussian Blur Filter
Implementation using Intel® Advanced Vector
Extensions
 [PDF
513KB]

 

     source
code: gaussian_blur.cpp [36KB]

调用方法:

     
笔者只是简单的看了下那篇小说,认为她个中用到的总结公式和Deriche滤波器的很像,和本文参谋的Recursive
implementation
不太一样,并且其SSE代码对能管理的图还会有多数限量,对本身那边的参照他事他说加以考察意义相当的小。

  GaussianBlurFilter(输入图像数据,输出图像数据,宽度,高度,通道数,强度)

     
我们先看下大旨的估摸的SSE优化,注意到  GaussBlurFromLeftToRight
的代码中,
个中央的计量部分是几个乘法,不过她唯有3个乘法计算,即便可以凑成四行,那么就足以丰硕利用SSE的批量乘除成效了,也便是一旦能增添贰个通路,使得GaussBlurFromLeftToRight变为如下方式:

  注:扶助通道数分别为 1 ,3 ,4.

void GaussBlurFromLeftToRight(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)
{
    //#pragma omp parallel for
    for (int Y = 0; Y < Height; Y++)
    {
        float *LinePD = Data + Y * Width * 4;
        float BS1 = LinePD[0], BS2 = LinePD[0], BS3 = LinePD[0];                //  边缘处使用重复像素的方案
        float GS1 = LinePD[1], GS2 = LinePD[1], GS3 = LinePD[1];
        float RS1 = LinePD[2], RS2 = LinePD[2], RS3 = LinePD[2];
        float AS1 = LinePD[3], AS2 = LinePD[3], AS3 = LinePD[3];
        for (int X = 0; X < Width; X++, LinePD += 4)
        {
            LinePD[0] = LinePD[0] * B0 + BS1 * B1 + BS2 * B2 + BS3 * B3;
            LinePD[1] = LinePD[1] * B0 + GS1 * B1 + GS2 * B2 + GS3 * B3;         // 进行顺向迭代
            LinePD[2] = LinePD[2] * B0 + RS1 * B1 + RS2 * B2 + RS3 * B3;
            LinePD[3] = LinePD[3] * B0 + AS1 * B1 + AS2 * B2 + AS3 * B3;
            BS3 = BS2, BS2 = BS1, BS1 = LinePD[0];
            GS3 = GS2, GS2 = GS1, GS1 = LinePD[1];
            RS3 = RS2, RS2 = RS1, RS1 = LinePD[2];
            AS3 = AS2, AS2 = AS1, AS1 = LinePD[3];
        }
    }
}

至于IILAND相关知识,参阅 百度词条 “IIEscort数字滤波器”

  则很轻巧就把上述代码转变到SSE能够标准管理的代码了。

http://baike.baidu.com/view/3088994.htm

  而对此Y方向的代码,你精心阅览会开掘,无论是及通道的图,天然的就能够选用SSE举办管理,详见下边的代码。

大地功夫,唯快不破。
本文只是投石问路一下,若有其余相关难题或然要求也得以邮件联系小编钻探。

  好,大家还是三个贰个的来解析:

邮箱地址是:
gaozhihan@vip.qq.com

   第四个函数
CalcGaussCof 不要求进行任何的优化。

 

     
第贰个函数 ConvertBG奥德赛8U2BGRAF遵照上述解析供给重新写,因为急需追加二个通路,新的大道的值填0只怕别的值都能够,但提议填0,这对某个SSE函数很有用,我把那几个函数的SSE完毕共享一下:

题外话:

void ConvertBGR8U2BGRAF_SSE(unsigned char *Src, float *Dest, int Width, int Height, int Stride)
{
    const int BlockSize = 4;
    int Block = (Width - 2) / BlockSize;
    __m128i Mask = _mm_setr_epi8(0, 1, 2, -1, 3, 4, 5, -1, 6, 7, 8, -1, 9, 10, 11, -1);            //    Mask为-1的地方会自动设置数据为0
    __m128i Zero = _mm_setzero_si128();
    //#pragma omp parallel for
    for (int Y = 0; Y < Height; Y++)
    {
        unsigned char *LinePS = Src + Y * Stride;
        float *LinePD = Dest + Y * Width * 4;
        int X = 0;
        for (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 3, LinePD += BlockSize * 4)
        {
            __m128i SrcV = _mm_shuffle_epi8(_mm_loadu_si128((const __m128i*)LinePS), Mask);        //    提取了16个字节,但是因为24位的,我们要将其变为32位的,所以只能提取出其中的前12个字节
            __m128i Src16L = _mm_unpacklo_epi8(SrcV, Zero);
            __m128i Src16H = _mm_unpackhi_epi8(SrcV, Zero);
            _mm_store_ps(LinePD + 0, _mm_cvtepi32_ps(_mm_unpacklo_epi16(Src16L, Zero)));        //    分配内存时已经是16字节对齐了,然后每行又是4的倍数的浮点数,因此,每行都是16字节对齐的
            _mm_store_ps(LinePD + 4, _mm_cvtepi32_ps(_mm_unpackhi_epi16(Src16L, Zero)));        //    _mm_stream_ps是否快点?
            _mm_store_ps(LinePD + 8, _mm_cvtepi32_ps(_mm_unpacklo_epi16(Src16H, Zero)));
            _mm_store_ps(LinePD + 12, _mm_cvtepi32_ps(_mm_unpackhi_epi16(Src16H, Zero)));
        }
        for (; X < Width; X++, LinePS += 3, LinePD += 4)
        {
            LinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];    LinePD[3] = 0;        //    Alpha最好设置为0,虽然在下面的CofB0等SSE常量中通过设置ALPHA对应的系数为0,可以使得计算结果也为0,但是不是最合理的
        }
    }
}

大多网络好朋友间接青睐使用opencv,opencv的确十三分强劲,然而一旦想要有更大的向上空间以及开创力.

  稍作解释,_mm_loadu_si128二回性加载十五个字节的数目到SSE寄存器中,对于贰10个人图像,十七个字节里富含了5

抑或要一步一个鞋印去落到实处部分最中央的算法,扎实的底蕴才是创设上层建筑的着力条件.

  • 1 /
    3个像素的音信,而作者辈的靶子是把这个数量转变为4通道的音讯,由此,我们只好三次性的提取处16/4=4个像素的值,那借助于_mm_shuffle_epi8函数和方便的Mask来达成,_mm_unpacklo_epi8/_mm_unpackhi_epi8分别领取了SrcV的高8位和低8位的8个字节数据并将它们转换为8个16进制数保存到Src16L和Src16H中,
    而_mm_unpacklo_epi16/_mm_unpackhi_epi16则更是把14人数据扩充到叁11人整形数据,最终通过_mm_cvtepi32_ps函数把整形数据调换为浮点型。

吾如今只是把opencv当资料库来看,并不以为opencv能够用来绝大许多的生意项目.

  大概有人注意到了 int Block = (Width – 2) /
BlockSize;
这一行代码,为啥要-2操作呢,那也是自家在数十回测试开采先后连接出现内部存款和储蓄器错误时才发掘到的,因为_mm_loadu_si128一回性加载了5

若本文帮到您,无耻之尤求微信扫码打个赏.

  • 1 / 3个像素的音信,当在拍卖最终一行像素时(别的行不会),即使Block
    取Width/BlockSize,
    则很有希望访问了超越像素范围内的内部存款和储蓄器,而-2不是-1就是因为极其额外的54%像素起的效劳。

图片 3

  接着看档案的次序方向的前向传来的SSE方案:

void GaussBlurFromLeftToRight_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)
{
    const  __m128 CofB0 = _mm_set_ps(0, B0, B0, B0);
    const  __m128 CofB1 = _mm_set_ps(0, B1, B1, B1);
    const  __m128 CofB2 = _mm_set_ps(0, B2, B2, B2);
    const  __m128 CofB3 = _mm_set_ps(0, B3, B3, B3);
    //#pragma omp parallel for
    for (int Y = 0; Y < Height; Y++)
    {
        float *LinePD = Data + Y * Width * 4;
        __m128 V1 = _mm_set_ps(LinePD[3], LinePD[2], LinePD[1], LinePD[0]);
        __m128 V2 = V1, V3 = V1;
        for (int X = 0; X < Width; X++, LinePD += 4)            //    还有一种写法不需要这种V3/V2/V1递归赋值,一次性计算3个值,详见 D:\程序设计\正在研究\Core\ShockFilter\Convert里的高斯模糊,但速度上没啥区别
        {
            __m128 V0 = _mm_load_ps(LinePD);
            __m128 V01 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V1));
            __m128 V23 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V3));
            __m128 V = _mm_add_ps(V01, V23);
            V3 = V2;    V2 = V1;    V1 = V;
            _mm_store_ps(LinePD, V);
        }
    }
}

  和方面包车型地铁4坦途的GaussBlurFromLeftToRight_SSE比较,你会开掘大约语法上未曾任何的分别,实在是太轻便了,注意自己从未用_mm_storeu_ps,而是一向动用_mm_store_ps,那是因为,第一,分配Data内部存款和储蓄器时,我动用了_mm_malloc分配了16字节对齐的内存,而Data每行元素的个数又都以4的翻番,由此,每行起首地方处的内部存款和储蓄器也是16字节对齐的,所以直接_mm_store_ps完全部都以可以的。

   
 同理,在笔直方向的前向传来的SSE优化代码就越来越直接了:

void GaussBlurFromTopToBottom_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)
{
    const  __m128 CofB0 = _mm_set_ps(0, B0, B0, B0);
    const  __m128 CofB1 = _mm_set_ps(0, B1, B1, B1);
    const  __m128 CofB2 = _mm_set_ps(0, B2, B2, B2);
    const  __m128 CofB3 = _mm_set_ps(0, B3, B3, B3);
    for (int Y = 0; Y < Height; Y++)
    {
        float *LinePS3 = Data + (Y + 0) * Width * 4;
        float *LinePS2 = Data + (Y + 1) * Width * 4;
        float *LinePS1 = Data + (Y + 2) * Width * 4;
        float *LinePS0 = Data + (Y + 3) * Width * 4;
        for (int X = 0; X < Width * 4; X += 4)
        {
            __m128 V3 = _mm_load_ps(LinePS3 + X);
            __m128 V2 = _mm_load_ps(LinePS2 + X);
            __m128 V1 = _mm_load_ps(LinePS1 + X);
            __m128 V0 = _mm_load_ps(LinePS0 + X);
            __m128 V01 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V1));
            __m128 V23 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V3));
            _mm_store_ps(LinePS0 + X, _mm_add_ps(V01, V23));
        }
    }
}

  对上面包车型大巴代码不想做此外表达了。

  最有难度的相应算是ConvertBGRAF2BG凯雷德8U的SSE版本了,由于有些原因,我不太情愿分享这几个函数的代码,有意思味的意中人能够仿照效法opencv的有关落到实处。

   
 最终的SSE版本高斯模糊的第一代码如下:

void GaussBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, float Radius)
{
    float B0, B1, B2, B3;
    float *Buffer = (float *)_mm_malloc(Width * (Height + 6) * sizeof(float) * 4, 16);

    CalcGaussCof(Radius, B0, B1, B2, B3);
    ConvertBGR8U2BGRAF_SSE(Src, Buffer + 3 * Width * 4, Width, Height, Stride);

    GaussBlurFromLeftToRight_SSE(Buffer + 3 * Width * 4, Width, Height, B0, B1, B2, B3);        //    在SSE版本中,这两个函数占用的时间比下面两个要多,不过C语言版本也是一样的
    GaussBlurFromRightToLeft_SSE(Buffer + 3 * Width * 4, Width, Height, B0, B1, B2, B3);        //    如果启用多线程,建议把这个函数写到GaussBlurFromLeftToRight的for X循环里,因为这样就可以减少线程并发时的阻力

    memcpy(Buffer + 0 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float));
    memcpy(Buffer + 1 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float));
    memcpy(Buffer + 2 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float));

    GaussBlurFromTopToBottom_SSE(Buffer, Width, Height, B0, B1, B2, B3);

    memcpy(Buffer + (Height + 3) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float));
    memcpy(Buffer + (Height + 4) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float));
    memcpy(Buffer + (Height + 5) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float));

    GaussBlurFromBottomToTop_SSE(Buffer, Width, Height, B0, B1, B2, B3);

    ConvertBGRAF2BGR8U_SSE(Buffer + 3 * Width * 4, Dest, Width, Height, Stride);

    _mm_free(Buffer);
}

  对于同一的3000*两千的彩色图像,SSE版本的代码耗费时间只有145ms的耗费时间,相对于平时的C代码有约2.5倍左右的提速,那也事出有因,因为我们在进行SSE的代码时时多管理了一个通路的总计量的,可是同编写翻译器自个儿的SSE优化220ms,唯有1.5倍的提速了,那表明编写翻译器的SSE优化技能照旧极度强的。

   
 进一步的优化正是本人上边的声明掉的opemp相关代码,在ConvertBG牧马人8U2BGRAF /
GaussBlurFromLeftToRight / GaussBlurFromRightToLeft / ConvertBGRAF2BG昂科拉8U
 4个函数中,直接助长轻便的#pragma omp parallel
for就足以了,至于GaussBlurFromTopToBottom_SSE、
GaussBlurFromBottomToTop_SSE则是因为上下行之间数据的相关性,是无能为力落实并行总结的,可是测试表示他们的耗费时间要比水平方向的少多数。

   
例如我们钦点openmp使用2个线程,在上述机器上(四核的),纯C版本能优化到252ms,而纯SSE的只可以提升到100ms左右,编写翻译器本身的SSE优化速度差相当的少是150ms,基本上依旧维持同一个级其余比例。

   对于灰度图像,很可惜,上述的水准方向上的优化措施就随意为力了,一种方法便是把4行灰度像素混洗成一行,可是那些操作不太有利用SSE达成,其余一种正是把水平方向的数码先转置,然后接纳垂直方向的SSE优化代码管理,完毕在转置回去,最终对转置的多少再次举行垂直方向SSE优化,当然转置的历程是足以注重SSE的代码达成的,可是须求额外的一份内部存款和储蓄器,速度上可能和常见的C比较就不会有那么多的提高了,这几个待一时光了再去测试。

   
 前光景后写那么些概略也花了半个月的年华,写完了百分百流程,在倒过来看,真的是充裕的粗略,不常便是这么。

   
 本文并不曾提供整机的能够一向实行的上上下下代码,需者自勤,提供一段C#的调用工程供风乐趣的对象测试和比对(未利用OPENMP版本)。

   
 http://files.cnblogs.com/files/Imageshop/GaussBLur_Sample.rar

图片 4

     后记:后来测试发掘,当半径参数非常大时,无论是C版本依旧SSE版本都会出现部分狼狈的缺点,感到疑似溢出了,后来发觉首假使当半径变大后,B0参数变得不大,以致于用float类型的数额来拍卖递归已经力不从心担保丰富的精度了,消除的艺术是行使double类型,那对于C语言版本的话是秒秒的政工,而对此SSE版本则是极大的天灾人祸,double时换来AVX恐怕退换量非常的小,但是AVX的广泛度以及…..,可是,一般景况下,半径不高于75时结果都照旧不错的,那对于繁多的施用来讲是十足的,半径75时,整个图像已经大半没有任何的细节了,再大,差距也非常的小了。

   
 消除上述难题三个得力的方案正是行使Deriche滤波器,小编用该滤波器的float版本对大半径是不会油不过生难点的,并且也可能有相关SSE参谋代码。

 图片 5

   后续小说:高斯模糊算法的巨细无遗优化进度分享(二)。

 

相关文章