往日发的多少个算法,此前发的多少个算法

事先,俺也发过不少高速高斯模糊算法.

事先,俺也发过不少迅速高斯模糊算法.

   
 那里的高斯模糊使用的是舆论《Recursive implementation of the Gaussian
filter》里描述的递归算法。

吾一般认为,只要处理一千六百万像素彩色图片,在2.2GHz的CPU上单核单线程超过1秒的算法,都是难过的.

俺一般认为,只要处理一千六百万像素彩色图片,在2.2GHz的CPU上单核单线程超越1秒的算法,都是难过的.

           
  图片 1

从前发的多少个算法,在本人2.2GHz的CPU上耗时都会超过1秒.

事先发的多少个算法,在我2.2GHz的CPU上耗时都会超过1秒.

     
仔细察看和通晓上述公式,在forward过程中,n是递增的,因而,虽然在举办forward往日,把in数据先全部的赋值给w,然后式子(9a)就可以变成:

而显然,快速高斯模糊有众多落实格局:

而强烈,急忙高斯模糊有成千上万兑现格局:

      w[n] = B w[n] +
(b1 w[n-1] + b2 w[n-2] +
b3 w[n-3]) / b0;          ———>    
(1a)

1.FIR (Finite impulse response)

1.FIR (Finite impulse response)

 
  在backward过程中,n是递减的,由此在backward前,把w的数目全部的拷贝到out中,则式子(9b)变为:

https://zh.wikipedia.org/wiki/%E9%AB%98%E6%96%AF%E6%A8%A1%E7%B3%8A

https://zh.wikipedia.org/wiki/%E9%AB%98%E6%96%AF%E6%A8%A1%E7%B3%8A

     out[n] = B out[n] +
(b1 
out[n+1] + b2 out[n+2] +
b3 
out[n+3]) / b0 ;     <———     (1b)

2.SII (Stacked integral images)

2.SII (Stacked integral images)

   
 从编程角度看来,backward中的拷贝是全然没有必要的,因而 (1b)能够一向写为:

http://dx.doi.org/10.1109/ROBOT.2010.5509400

http://dx.doi.org/10.1109/ROBOT.2010.5509400

           w[n] = B w[n] +
(b1 w[n+1] + b2 w[n+2] +
b3 w[n+3]) / b0 ;               <———    
(1c)

http://arxiv.org/abs/1107.4958

http://arxiv.org/abs/1107.4958

   
 
从速度上考虑,浮点除法很慢,因而,我们再度定义b1 = b1 / b0, b2 = b2 /
b0, b3 = b3 / b0, 最后取得我们运用的递归公式:

3.Vliet-Young-Verbeek (Recursive filter)

3.Vliet-Young-Verbeek (Recursive filter)

           w[n] = B w[n] +
b1 w[n-1] + b2 w[n-2] +
b3 w[n-3];          ———>     (a)

http://dx.doi.org/10.1016/0165-1684(95)00020-E

http://dx.doi.org/10.1016/0165-1684(95)00020-E

        w[n] = B w[n] +
b1 w[n+1] + b2 w[n+2] +
b3 w[n+3] ;             <———      (b)

http://dx.doi.org/10.1109/ICPR.1998.711192

http://dx.doi.org/10.1109/ICPR.1998.711192

   
上述公式是一维的高斯模糊总结方法,针对二维的图像,正确的做法就是先对各类图像行开展模糊处理得到中间结果,再对中等结果的每列进行模糊操作,最后取得二维的模糊结果,当然也足以动用先列后行这样的做法。

4.DCT (Discrete Cosine Transform)

4.DCT (Discrete Cosine Transform)

     
其它注意点就是,边缘像素的拍卖,大家见到在公式(a)或者(b)中,w[n]的结果个别凭借于前两个或者后多少个要素的值,而对此边缘地点的值,那些都是不在合理界定内的,平时的做法是镜像数据仍然再一次边缘数据,实践声明,由于是递归的算法,起初值的两样会将结果从来延续下去,因而,不同的办法对边缘部分的结果依旧有一定的熏陶的,这里我们利用的简便的双重边缘像素值的法门。

http://dx.doi.org/10.1109/78.295213

http://dx.doi.org/10.1109/78.295213

     
由于地点公式中的周密均为浮点类型,由此,总结一般也是对浮点举行的,也就是说一般需要先把图像数据转换为浮点,然后举办高斯模糊,在将结果转换为字节类型的图像,由此,上述过程可以分别用一下几个函数完成:

5.box (Box filter)

5.box (Box filter)

               
CalcGaussCof          
//  总结高斯模糊中动用到的系数
      ConvertBGR8U2BGRAF      //  将字节数据转换为浮点数据 
      GaussBlurFromLeftToRight    //  水平方向的前向传来
      GaussBlurFromRightToLeft    //  水平方向的反向传播
      GaussBlurFromTopToBottom  
 //   垂直方向的前向传来
      GaussBlurFromBottomToTop  
 //   垂直方向的反向传播
      ConvertBGRAF2BGR8U      
 //   将结果转换为字节数据

http://dx.doi.org/10.1109/TPAMI.1986.4767776

http://dx.doi.org/10.1109/TPAMI.1986.4767776

   我们逐一分析之。

6.AM(Alvarez, Mazorra)

6.AM(Alvarez, Mazorra)

     
 CalcGaussCof,这些很粗略,就依照杂谈中提议的总结公式依据用户输入的参数来总计,当然结合下方面我们提到的除法变乘法的优化,注意,为了持续的一部分标注的题材,我讲上述公式(a)和(b)中的周详B写为b0了。

http://www.jstor.org/stable/2158018

http://www.jstor.org/stable/2158018

void CalcGaussCof(float Radius, float &B0, float &B1, float &B2, float &B3)
{
    float Q, B;
    if (Radius >= 2.5)
        Q = (double)(0.98711 * Radius - 0.96330);                            //    对应论文公式11b
    else if ((Radius >= 0.5) && (Radius < 2.5))
        Q = (double)(3.97156 - 4.14554 * sqrt(1 - 0.26891 * Radius));
    else
        Q = (double)0.1147705018520355224609375;

    B = 1.57825 + 2.44413 * Q + 1.4281 * Q * Q + 0.422205 * Q * Q * Q;        //    对应论文公式8c
    B1 = 2.44413 * Q + 2.85619 * Q * Q + 1.26661 * Q * Q * Q;
    B2 = -1.4281 * Q * Q - 1.26661 * Q * Q * Q;
    B3 = 0.422205 * Q * Q * Q;

    B0 = 1.0 - (B1 + B2 + B3) / B;
    B1 = B1 / B;
    B2 = B2 / B;
    B3 = B3 / B;
}

7.Deriche (Recursive filter)

7.Deriche (Recursive filter)

  由上述代码可见,B0+B1+B2+B3=1,即是归一化的周到,这也是算法能够递归举行的要紧之一。

http://hal.inria.fr/docs/00/07/47/78/PDF/RR-1893.pdf

http://hal.inria.fr/docs/00/07/47/78/PDF/RR-1893.pdf

   
 接着为了有利于中间经过,我们先将字节数据转换为浮点数据,那部分代码也很简单:

8.ebox (Extended Box)

8.ebox (Extended Box)

void ConvertBGR8U2BGRAF(unsigned char *Src, float *Dest, int Width, int Height, int Stride)
{
    //#pragma omp parallel for
    for (int Y = 0; Y < Height; Y++)
    {
        unsigned char *LinePS = Src + Y * Stride;
        float *LinePD = Dest + Y * Width * 3;
        for (int X = 0; X < Width; X++, LinePS += 3, LinePD += 3)
        {
            LinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];
        }
    }
}

http://dx.doi.org/10.1007/978-3-642-24785-9\_38

http://dx.doi.org/10.1007/978-3-642-24785-9\_38

  我们能够尝尝自己把其中的X循环再举办探访效果。

9.IIR (Infinite Impulse Response)

9.IIR (Infinite Impulse Response)

   
 水平方向的前向传播遵照公式去写也是很粗略的,可是一向动用公式里面有很多造访数组的经过(不自然就慢),我稍稍改造下写成如下格局:

https://software.intel.com/zh-cn/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions

https://software.intel.com/zh-cn/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions

void GaussBlurFromLeftToRight(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)
{
    //#pragma omp parallel for
    for (int Y = 0; Y < Height; Y++)
    {
        float *LinePD = Data + Y * Width * 3;
        float BS1 = LinePD[0], BS2 = LinePD[0], BS3 = LinePD[0];          //  边缘处使用重复像素的方案
        float GS1 = LinePD[1], GS2 = LinePD[1], GS3 = LinePD[1];
        float RS1 = LinePD[2], RS2 = LinePD[2], RS3 = LinePD[2];
        for (int X = 0; X < Width; X++, LinePD += 3)
        {
            LinePD[0] = LinePD[0] * B0 + BS1 * B1 + BS2 * B2 + BS3 * B3;
            LinePD[1] = LinePD[1] * B0 + GS1 * B1 + GS2 * B2 + GS3 * B3;         // 进行顺向迭代
            LinePD[2] = LinePD[2] * B0 + RS1 * B1 + RS2 * B2 + RS3 * B3;
            BS3 = BS2, BS2 = BS1, BS1 = LinePD[0];
            GS3 = GS2, GS2 = GS1, GS1 = LinePD[1];
            RS3 = RS2, RS2 = RS1, RS1 = LinePD[2];
        }
    }
}

10.FA (Fast Anisotropic)

10.FA (Fast Anisotropic)

  不多描述,请我们把上述代码和公式(a)对应一下就掌握了。

http://mathinfo.univ-reims.fr/IMG/pdf/Fast\_Anisotropic\_Gquss\_Filtering\_-\_GeusebroekECCV02.pdf

http://mathinfo.univ-reims.fr/IMG/pdf/Fast\_Anisotropic\_Gquss\_Filtering\_-\_GeusebroekECCV02.pdf

     
有了GaussBlurFromLeftToRight的参照代码,GaussBlurFromRightToLeft的代码就不会有什么大的困顿了,为了防止有些懒人不看作品不酌量,这里自己不贴这段的代码。

……

……

     
那么垂直方向上粗略的做只需要转移下循环的自由化,以及历次指针扩展量更改为Width
* 3即可。

兑现高斯模糊的法门尽管很多,但是作为算法而言,主题关键是概括高效.

实现高斯模糊的不二法门即便很多,然则作为算法而言,核心关键是简约高效.

     
那么大家来设想下垂直方向能不可能不这么处理啊,指针每回扩张Width *
3会造成严重的Cache
miss,特别是对于宽度稍微大点的图像,大家精心考察垂直方向,会意识还可以够按照Y
 — X这样的大循环情势也是足以的,代码如下:

此时此刻吾经过实测,IIR是兼职效果以及性能的不利的法门,也是半径无关(即模糊不同强度耗时基本不变)的实现.

当下吾经过实测,IIR是全职效果以及性能的不错的模式,也是半径无关(即模糊不同强度耗时基本不变)的实现.

void GaussBlurFromTopToBottom(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)
{
    for (int Y = 0; Y < Height; Y++)
    {
        float *LinePD3 = Data + (Y + 0) * Width * 3;
        float *LinePD2 = Data + (Y + 1) * Width * 3;
        float *LinePD1 = Data + (Y + 2) * Width * 3;
        float *LinePD0 = Data + (Y + 3) * Width * 3;
        for (int X = 0; X < Width; X++, LinePD0 += 3, LinePD1 += 3, LinePD2 += 3, LinePD3 += 3)
        {
            LinePD0[0] = LinePD0[0] * B0 + LinePD1[0] * B1 + LinePD2[0] * B2 + LinePD3[0] * B3;
            LinePD0[1] = LinePD0[1] * B0 + LinePD1[1] * B1 + LinePD2[1] * B2 + LinePD3[1] * B3;
            LinePD0[2] = LinePD0[2] * B0 + LinePD1[2] * B1 + LinePD2[2] * B2 + LinePD3[2] * B3;
        }
    }
}

AMD官方实现的这份:

英特尔官方实现的这份:

  就是说我们不是弹指间就把列方向的前向传播举办完,而是每一遍只进行一个像素的传入,当一行所有像素都举行完了列方向的前向传来后,在切换来下一行,这样处理就避免了Cache
miss出现的情状。

IIR Gaussian Blur Filter Implementation using Intel® Advanced Vector
Extensions
 [PDF
513KB]
source: gaussian_blur.cpp [36KB]

IIR Gaussian Blur Filter Implementation using Intel® Advanced Vector
Extensions
 [PDF
513KB]
source: gaussian_blur.cpp [36KB]

   
 注意到列方向的边缘地方,为了便利代码的拍卖,我们在低度方向上上下分别扩展了3个行的像素,当举办完中间的得力行的行方向前向和反向传来后,按照前述的重新边缘像素的方法填充上下那空出的三行数据。

采取了AMD电脑的流(SIMD)指令,算法处理速度极其惊人.

行使了AMD电脑的流(SIMD)指令,算法处理速度极其惊人.

   
 同样的道理,GaussBlurFromBottomToTop的代码可由读者自己补充进去。

吾写算法追求干净清洁,高效简明,换言之就是不接纳另外硬件加速方案,实现简单便捷,以适应不同硬件环境.

吾写算法追求干净卫生,高效简明,换言之就是不选取此外硬件加速方案,实现简单高效,以适应不同硬件环境.

   
 最终的ConvertBGRAF2BGR8U也很简短,就是一个for循环:

故基于AMD这份代码,俺对其举办了改写以及优化.

故基于AMD这份代码,俺对其开展了改写以及优化.

void ConvertBGRAF2BGR8U(float *Src, unsigned char *Dest, int Width, int Height, int Stride)
{
    //#pragma omp parallel for
    for (int Y = 0; Y < Height; Y++)
    {
        float *LinePS = Src + Y * Width * 3;
        unsigned char *LinePD = Dest + Y * Stride;
        for (int X = 0; X < Width; X++, LinePS += 3, LinePD += 3)
        {
            LinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];
        }
    }
}

终极在我2.20GHz的CPU上,单核单线程,不采纳流(SIMD)指令,达到了,处理一千六百万像素的彩色照片仅需700皮秒左右.

末了在吾2.20GHz的CPU上,单核单线程,不利用流(SIMD)指令,达到了,处理一千六百万像素的彩色照片仅需700毫秒左右.

   在使得的限制内,上述浮点总计的结果不会超出byte所能表明的限量,因而也不需要专门的开展Clamp操作。

遵从常规,依然贴个效果图相比较直观.

依据规矩,仍旧贴个效果图相比直观.

     
 末了就是有的内存分配和刑满释放的代码了,如下所示:

图片 2

图片 3

void GaussBlur(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, float Radius)
{
    float B0, B1, B2, B3;
    float *Buffer = (float *)malloc(Width * (Height + 6) * sizeof(float) * 3);

    CalcGaussCof(Radius, B0, B1, B2, B3);
    ConvertBGR8U2BGRAF(Src, Buffer + 3 * Width * 3, Width, Height, Stride);

    GaussBlurFromLeftToRight(Buffer + 3 * Width * 3, Width, Height, B0, B1, B2, B3);
    GaussBlurFromRightToLeft(Buffer + 3 * Width * 3, Width, Height, B0, B1, B2, B3);        //    如果启用多线程,建议把这个函数写到GaussBlurFromLeftToRight的for X循环里,因为这样就可以减少线程并发时的阻力

    memcpy(Buffer + 0 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float));
    memcpy(Buffer + 1 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float));
    memcpy(Buffer + 2 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float));

    GaussBlurFromTopToBottom(Buffer, Width, Height, B0, B1, B2, B3);

    memcpy(Buffer + (Height + 3) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float));
    memcpy(Buffer + (Height + 4) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float));
    memcpy(Buffer + (Height + 5) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float));

    GaussBlurFromBottomToTop(Buffer, Width, Height, B0, B1, B2, B3);

    ConvertBGRAF2BGR8U(Buffer + 3 * Width * 3, Dest, Width, Height, Stride);

    free(Buffer);
}

事先也有网友问过这么些算法的兑现问题.

事先也有网友问过这一个算法的落实问题.

  正如上所述,分配了Height +
6行的内存区域,首假诺为了有利于垂直方向的拍卖,在举办GaussBlurFromTopToBottom此前遵照重复边缘的规范复制3行,然后在GaussBlurFromBottomToTop往日在复制底部边缘的3行像素。

想了想,仍然将代码共享出来,供大家参考学习.

想了想,如故将代码共享出来,供我们参考学习.

     
至此,一个简练而又高效的高斯模糊就着力做到了,代码数量也不多,也未尝什么难度,不理解为啥许六人搞不定。

 

 

     
依据自己的测试,上述方法代码在一台I5-6300HQ
2.30GHZ的记录簿上针对一副3000*2000的24位图像的处理时间大概需要370ms,假使在C++的编译选项的代码生成选项里的启用增强指令集选取–>流式处理SIMD扩张2(/arch:sse2),则编译后的次第大概需要220ms的光阴。

完整代码:

完全代码:

     
大家尝试采用系统的部分资源进一步提升速度,首先我们想到了SSE优化,关于这下面英特尔有一篇官方的篇章和代码:

void CalGaussianCoeff(float sigma, float * a0, float * a1, float * a2, float * a3, float * b1, float * b2, float * cprev, float * cnext) {
    float alpha, lamma, k;

    if (sigma < 0.5f)
        sigma = 0.5f;
    alpha = (float)exp((0.726) * (0.726)) / sigma;
    lamma = (float)exp(-alpha);
    *b2 = (float)exp(-2 * alpha);
    k = (1 - lamma) * (1 - lamma) / (1 + 2 * alpha * lamma - (*b2));
    *a0 = k; *a1 = k * (alpha - 1) * lamma;
    *a2 = k * (alpha + 1) * lamma;
    *a3 = -k * (*b2);
    *b1 = -2 * lamma;
    *cprev = (*a0 + *a1) / (1 + *b1 + *b2);
    *cnext = (*a2 + *a3) / (1 + *b1 + *b2);
}

void gaussianHorizontal(unsigned char * bufferPerLine, unsigned char * lpRowInitial, unsigned char  * lpColumn, int width, int height, int Channels, int Nwidth, float a0a1, float a2a3, float b1b2, float  cprev, float cnext)
{
    int HeightStep = Channels*height;
    int WidthSubOne = width - 1;
    if (Channels == 3)
    {
        float prevOut[3];
        prevOut[0] = (lpRowInitial[0] * cprev);
        prevOut[1] = (lpRowInitial[1] * cprev);
        prevOut[2] = (lpRowInitial[2] * cprev);
        for (int x = 0; x < width; ++x) {
            prevOut[0] = ((lpRowInitial[0] * (a0a1)) - (prevOut[0] * (b1b2)));
            prevOut[1] = ((lpRowInitial[1] * (a0a1)) - (prevOut[1] * (b1b2)));
            prevOut[2] = ((lpRowInitial[2] * (a0a1)) - (prevOut[2] * (b1b2)));
            bufferPerLine[0] = prevOut[0];
            bufferPerLine[1] = prevOut[1];
            bufferPerLine[2] = prevOut[2];
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        lpColumn += HeightStep * WidthSubOne;
        bufferPerLine -= Channels;
        prevOut[0] = (lpRowInitial[0] * cnext);
        prevOut[1] = (lpRowInitial[1] * cnext);
        prevOut[2] = (lpRowInitial[2] * cnext);

        for (int x = WidthSubOne; x >= 0; --x) {
            prevOut[0] = ((lpRowInitial[0] * (a2a3)) - (prevOut[0] * (b1b2)));
            prevOut[1] = ((lpRowInitial[1] * (a2a3)) - (prevOut[1] * (b1b2)));
            prevOut[2] = ((lpRowInitial[2] * (a2a3)) - (prevOut[2] * (b1b2)));
            bufferPerLine[0] += prevOut[0];
            bufferPerLine[1] += prevOut[1];
            bufferPerLine[2] += prevOut[2];
            lpColumn[0] = bufferPerLine[0];
            lpColumn[1] = bufferPerLine[1];
            lpColumn[2] = bufferPerLine[2];
            lpRowInitial -= Channels;
            lpColumn -= HeightStep;
            bufferPerLine -= Channels;
        }
    }
    else if (Channels == 4)
    {
        float prevOut[4];

        prevOut[0] = (lpRowInitial[0] * cprev);
        prevOut[1] = (lpRowInitial[1] * cprev);
        prevOut[2] = (lpRowInitial[2] * cprev);
        prevOut[3] = (lpRowInitial[3] * cprev);
        for (int x = 0; x < width; ++x) {
            prevOut[0] = ((lpRowInitial[0] * (a0a1)) - (prevOut[0] * (b1b2)));
            prevOut[1] = ((lpRowInitial[1] * (a0a1)) - (prevOut[1] * (b1b2)));
            prevOut[2] = ((lpRowInitial[2] * (a0a1)) - (prevOut[2] * (b1b2)));
            prevOut[3] = ((lpRowInitial[3] * (a0a1)) - (prevOut[3] * (b1b2)));

            bufferPerLine[0] = prevOut[0];
            bufferPerLine[1] = prevOut[1];
            bufferPerLine[2] = prevOut[2];
            bufferPerLine[3] = prevOut[3];
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        lpColumn += HeightStep * WidthSubOne;
        bufferPerLine -= Channels;

        prevOut[0] = (lpRowInitial[0] * cnext);
        prevOut[1] = (lpRowInitial[1] * cnext);
        prevOut[2] = (lpRowInitial[2] * cnext);
        prevOut[3] = (lpRowInitial[3] * cnext);

        for (int x = WidthSubOne; x >= 0; --x) {
            prevOut[0] = ((lpRowInitial[0] * a2a3) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a2a3) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a2a3) - (prevOut[2] * b1b2));
            prevOut[3] = ((lpRowInitial[3] * a2a3) - (prevOut[3] * b1b2));
            bufferPerLine[0] += prevOut[0];
            bufferPerLine[1] += prevOut[1];
            bufferPerLine[2] += prevOut[2];
            bufferPerLine[3] += prevOut[3];
            lpColumn[0] = bufferPerLine[0];
            lpColumn[1] = bufferPerLine[1];
            lpColumn[2] = bufferPerLine[2];
            lpColumn[3] = bufferPerLine[3];
            lpRowInitial -= Channels;
            lpColumn -= HeightStep;
            bufferPerLine -= Channels;
        }
    }
    else if (Channels == 1)
    {
        float prevOut = (lpRowInitial[0] * cprev);

        for (int x = 0; x < width; ++x) {
            prevOut = ((lpRowInitial[0] * (a0a1)) - (prevOut  * (b1b2)));
            bufferPerLine[0] = prevOut;
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        lpColumn += HeightStep*WidthSubOne;
        bufferPerLine -= Channels;

        prevOut = (lpRowInitial[0] * cnext);

        for (int x = WidthSubOne; x >= 0; --x) {
            prevOut = ((lpRowInitial[0] * a2a3) - (prevOut  * b1b2));
            bufferPerLine[0] += prevOut;
            lpColumn[0] = bufferPerLine[0];
            lpRowInitial -= Channels;
            lpColumn -= HeightStep;
            bufferPerLine -= Channels;
        }
    }
}

void gaussianVertical(unsigned char * bufferPerLine, unsigned char * lpRowInitial, unsigned char * lpColInitial, int height, int width, int Channels, float a0a1, float a2a3, float b1b2, float  cprev, float  cnext) {

    int WidthStep = Channels*width;
    int HeightSubOne = height - 1;
    if (Channels == 3)
    {
        float prevOut[3];
        prevOut[0] = (lpRowInitial[0] * cprev);
        prevOut[1] = (lpRowInitial[1] * cprev);
        prevOut[2] = (lpRowInitial[2] * cprev);

        for (int y = 0; y < height; y++) {
            prevOut[0] = ((lpRowInitial[0] * a0a1) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a0a1) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a0a1) - (prevOut[2] * b1b2));
            bufferPerLine[0] = prevOut[0];
            bufferPerLine[1] = prevOut[1];
            bufferPerLine[2] = prevOut[2];
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        bufferPerLine -= Channels;
        lpColInitial += WidthStep * HeightSubOne;
        prevOut[0] = (lpRowInitial[0] * cnext);
        prevOut[1] = (lpRowInitial[1] * cnext);
        prevOut[2] = (lpRowInitial[2] * cnext);
        for (int y = HeightSubOne; y >= 0; y--) {
            prevOut[0] = ((lpRowInitial[0] * a2a3) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a2a3) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a2a3) - (prevOut[2] * b1b2));
            bufferPerLine[0] += prevOut[0];
            bufferPerLine[1] += prevOut[1];
            bufferPerLine[2] += prevOut[2];
            lpColInitial[0] = bufferPerLine[0];
            lpColInitial[1] = bufferPerLine[1];
            lpColInitial[2] = bufferPerLine[2];
            lpRowInitial -= Channels;
            lpColInitial -= WidthStep;
            bufferPerLine -= Channels;
        }
    }
    else if (Channels == 4)
    {
        float prevOut[4];

        prevOut[0] = (lpRowInitial[0] * cprev);
        prevOut[1] = (lpRowInitial[1] * cprev);
        prevOut[2] = (lpRowInitial[2] * cprev);
        prevOut[3] = (lpRowInitial[3] * cprev);

        for (int y = 0; y < height; y++) {
            prevOut[0] = ((lpRowInitial[0] * a0a1) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a0a1) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a0a1) - (prevOut[2] * b1b2));
            prevOut[3] = ((lpRowInitial[3] * a0a1) - (prevOut[3] * b1b2));
            bufferPerLine[0] = prevOut[0];
            bufferPerLine[1] = prevOut[1];
            bufferPerLine[2] = prevOut[2];
            bufferPerLine[3] = prevOut[3];
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        bufferPerLine -= Channels;
        lpColInitial += WidthStep*HeightSubOne;
        prevOut[0] = (lpRowInitial[0] * cnext);
        prevOut[1] = (lpRowInitial[1] * cnext);
        prevOut[2] = (lpRowInitial[2] * cnext);
        prevOut[3] = (lpRowInitial[3] * cnext);
        for (int y = HeightSubOne; y >= 0; y--) {
            prevOut[0] = ((lpRowInitial[0] * a2a3) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a2a3) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a2a3) - (prevOut[2] * b1b2));
            prevOut[3] = ((lpRowInitial[3] * a2a3) - (prevOut[3] * b1b2));
            bufferPerLine[0] += prevOut[0];
            bufferPerLine[1] += prevOut[1];
            bufferPerLine[2] += prevOut[2];
            bufferPerLine[3] += prevOut[3];
            lpColInitial[0] = bufferPerLine[0];
            lpColInitial[1] = bufferPerLine[1];
            lpColInitial[2] = bufferPerLine[2];
            lpColInitial[3] = bufferPerLine[3];
            lpRowInitial -= Channels;
            lpColInitial -= WidthStep;
            bufferPerLine -= Channels;
        }
    }
    else if (Channels == 1)
    {
        float prevOut = 0;
        prevOut = (lpRowInitial[0] * cprev);
        for (int y = 0; y < height; y++) {
            prevOut = ((lpRowInitial[0] * a0a1) - (prevOut * b1b2));
            bufferPerLine[0] = prevOut;
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        bufferPerLine -= Channels;
        lpColInitial += WidthStep*HeightSubOne;
        prevOut = (lpRowInitial[0] * cnext);
        for (int y = HeightSubOne; y >= 0; y--) {
            prevOut = ((lpRowInitial[0] * a2a3) - (prevOut * b1b2));
            bufferPerLine[0] += prevOut;
            lpColInitial[0] = bufferPerLine[0];
            lpRowInitial -= Channels;
            lpColInitial -= WidthStep;
            bufferPerLine -= Channels;
        }
    }
}
//本人博客:http://tntmonks.cnblogs.com/ 转载请注明出处.
void  GaussianBlurFilter(unsigned char * input, unsigned char * output, int Width, int Height, int Stride, float GaussianSigma) {

    int Channels = Stride / Width;
    float a0, a1, a2, a3, b1, b2, cprev, cnext;

    CalGaussianCoeff(GaussianSigma, &a0, &a1, &a2, &a3, &b1, &b2, &cprev, &cnext);

    float a0a1 = (a0 + a1);
    float a2a3 = (a2 + a3);
    float b1b2 = (b1 + b2); 

    int bufferSizePerThread = (Width > Height ? Width : Height) * Channels;
    unsigned char * bufferPerLine = (unsigned char*)malloc(bufferSizePerThread);
    unsigned char * tempData = (unsigned char*)malloc(Height * Stride);
    if (bufferPerLine == NULL || tempData == NULL)
    {
        if (tempData)
        {
            free(tempData);
        }
        if (bufferPerLine)
        {
            free(bufferPerLine);
        }
        return;
    }
    for (int y = 0; y < Height; ++y) {
        unsigned char * lpRowInitial = input + Stride * y;
        unsigned char * lpColInitial = tempData + y * Channels;
        gaussianHorizontal(bufferPerLine, lpRowInitial, lpColInitial, Width, Height, Channels, Width, a0a1, a2a3, b1b2, cprev, cnext);
    }
    int HeightStep = Height*Channels;
    for (int x = 0; x < Width; ++x) {
        unsigned char * lpColInitial = output + x*Channels;
        unsigned char * lpRowInitial = tempData + HeightStep * x;
        gaussianVertical(bufferPerLine, lpRowInitial, lpColInitial, Height, Width, Channels, a0a1, a2a3, b1b2, cprev, cnext);
    }

    free(bufferPerLine);
    free(tempData);
}
void CalGaussianCoeff(float sigma, float * a0, float * a1, float * a2, float * a3, float * b1, float * b2, float * cprev, float * cnext) {
    float alpha, lamma, k;

    if (sigma < 0.5f)
        sigma = 0.5f;
    alpha = (float)exp((0.726) * (0.726)) / sigma;
    lamma = (float)exp(-alpha);
    *b2 = (float)exp(-2 * alpha);
    k = (1 - lamma) * (1 - lamma) / (1 + 2 * alpha * lamma - (*b2));
    *a0 = k; *a1 = k * (alpha - 1) * lamma;
    *a2 = k * (alpha + 1) * lamma;
    *a3 = -k * (*b2);
    *b1 = -2 * lamma;
    *cprev = (*a0 + *a1) / (1 + *b1 + *b2);
    *cnext = (*a2 + *a3) / (1 + *b1 + *b2);
}

void gaussianHorizontal(unsigned char * bufferPerLine, unsigned char * lpRowInitial, unsigned char  * lpColumn, int width, int height, int Channels, int Nwidth, float a0a1, float a2a3, float b1b2, float  cprev, float cnext)
{
    int HeightStep = Channels*height;
    int WidthSubOne = width - 1;
    if (Channels == 3)
    {
        float prevOut[3];
        prevOut[0] = (lpRowInitial[0] * cprev);
        prevOut[1] = (lpRowInitial[1] * cprev);
        prevOut[2] = (lpRowInitial[2] * cprev);
        for (int x = 0; x < width; ++x) {
            prevOut[0] = ((lpRowInitial[0] * (a0a1)) - (prevOut[0] * (b1b2)));
            prevOut[1] = ((lpRowInitial[1] * (a0a1)) - (prevOut[1] * (b1b2)));
            prevOut[2] = ((lpRowInitial[2] * (a0a1)) - (prevOut[2] * (b1b2)));
            bufferPerLine[0] = prevOut[0];
            bufferPerLine[1] = prevOut[1];
            bufferPerLine[2] = prevOut[2];
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        lpColumn += HeightStep * WidthSubOne;
        bufferPerLine -= Channels;
        prevOut[0] = (lpRowInitial[0] * cnext);
        prevOut[1] = (lpRowInitial[1] * cnext);
        prevOut[2] = (lpRowInitial[2] * cnext);

        for (int x = WidthSubOne; x >= 0; --x) {
            prevOut[0] = ((lpRowInitial[0] * (a2a3)) - (prevOut[0] * (b1b2)));
            prevOut[1] = ((lpRowInitial[1] * (a2a3)) - (prevOut[1] * (b1b2)));
            prevOut[2] = ((lpRowInitial[2] * (a2a3)) - (prevOut[2] * (b1b2)));
            bufferPerLine[0] += prevOut[0];
            bufferPerLine[1] += prevOut[1];
            bufferPerLine[2] += prevOut[2];
            lpColumn[0] = bufferPerLine[0];
            lpColumn[1] = bufferPerLine[1];
            lpColumn[2] = bufferPerLine[2];
            lpRowInitial -= Channels;
            lpColumn -= HeightStep;
            bufferPerLine -= Channels;
        }
    }
    else if (Channels == 4)
    {
        float prevOut[4];

        prevOut[0] = (lpRowInitial[0] * cprev);
        prevOut[1] = (lpRowInitial[1] * cprev);
        prevOut[2] = (lpRowInitial[2] * cprev);
        prevOut[3] = (lpRowInitial[3] * cprev);
        for (int x = 0; x < width; ++x) {
            prevOut[0] = ((lpRowInitial[0] * (a0a1)) - (prevOut[0] * (b1b2)));
            prevOut[1] = ((lpRowInitial[1] * (a0a1)) - (prevOut[1] * (b1b2)));
            prevOut[2] = ((lpRowInitial[2] * (a0a1)) - (prevOut[2] * (b1b2)));
            prevOut[3] = ((lpRowInitial[3] * (a0a1)) - (prevOut[3] * (b1b2)));

            bufferPerLine[0] = prevOut[0];
            bufferPerLine[1] = prevOut[1];
            bufferPerLine[2] = prevOut[2];
            bufferPerLine[3] = prevOut[3];
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        lpColumn += HeightStep * WidthSubOne;
        bufferPerLine -= Channels;

        prevOut[0] = (lpRowInitial[0] * cnext);
        prevOut[1] = (lpRowInitial[1] * cnext);
        prevOut[2] = (lpRowInitial[2] * cnext);
        prevOut[3] = (lpRowInitial[3] * cnext);

        for (int x = WidthSubOne; x >= 0; --x) {
            prevOut[0] = ((lpRowInitial[0] * a2a3) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a2a3) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a2a3) - (prevOut[2] * b1b2));
            prevOut[3] = ((lpRowInitial[3] * a2a3) - (prevOut[3] * b1b2));
            bufferPerLine[0] += prevOut[0];
            bufferPerLine[1] += prevOut[1];
            bufferPerLine[2] += prevOut[2];
            bufferPerLine[3] += prevOut[3];
            lpColumn[0] = bufferPerLine[0];
            lpColumn[1] = bufferPerLine[1];
            lpColumn[2] = bufferPerLine[2];
            lpColumn[3] = bufferPerLine[3];
            lpRowInitial -= Channels;
            lpColumn -= HeightStep;
            bufferPerLine -= Channels;
        }
    }
    else if (Channels == 1)
    {
        float prevOut = (lpRowInitial[0] * cprev);

        for (int x = 0; x < width; ++x) {
            prevOut = ((lpRowInitial[0] * (a0a1)) - (prevOut  * (b1b2)));
            bufferPerLine[0] = prevOut;
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        lpColumn += HeightStep*WidthSubOne;
        bufferPerLine -= Channels;

        prevOut = (lpRowInitial[0] * cnext);

        for (int x = WidthSubOne; x >= 0; --x) {
            prevOut = ((lpRowInitial[0] * a2a3) - (prevOut  * b1b2));
            bufferPerLine[0] += prevOut;
            lpColumn[0] = bufferPerLine[0];
            lpRowInitial -= Channels;
            lpColumn -= HeightStep;
            bufferPerLine -= Channels;
        }
    }
}

void gaussianVertical(unsigned char * bufferPerLine, unsigned char * lpRowInitial, unsigned char * lpColInitial, int height, int width, int Channels, float a0a1, float a2a3, float b1b2, float  cprev, float  cnext) {

    int WidthStep = Channels*width;
    int HeightSubOne = height - 1;
    if (Channels == 3)
    {
        float prevOut[3];
        prevOut[0] = (lpRowInitial[0] * cprev);
        prevOut[1] = (lpRowInitial[1] * cprev);
        prevOut[2] = (lpRowInitial[2] * cprev);

        for (int y = 0; y < height; y++) {
            prevOut[0] = ((lpRowInitial[0] * a0a1) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a0a1) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a0a1) - (prevOut[2] * b1b2));
            bufferPerLine[0] = prevOut[0];
            bufferPerLine[1] = prevOut[1];
            bufferPerLine[2] = prevOut[2];
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        bufferPerLine -= Channels;
        lpColInitial += WidthStep * HeightSubOne;
        prevOut[0] = (lpRowInitial[0] * cnext);
        prevOut[1] = (lpRowInitial[1] * cnext);
        prevOut[2] = (lpRowInitial[2] * cnext);
        for (int y = HeightSubOne; y >= 0; y--) {
            prevOut[0] = ((lpRowInitial[0] * a2a3) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a2a3) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a2a3) - (prevOut[2] * b1b2));
            bufferPerLine[0] += prevOut[0];
            bufferPerLine[1] += prevOut[1];
            bufferPerLine[2] += prevOut[2];
            lpColInitial[0] = bufferPerLine[0];
            lpColInitial[1] = bufferPerLine[1];
            lpColInitial[2] = bufferPerLine[2];
            lpRowInitial -= Channels;
            lpColInitial -= WidthStep;
            bufferPerLine -= Channels;
        }
    }
    else if (Channels == 4)
    {
        float prevOut[4];

        prevOut[0] = (lpRowInitial[0] * cprev);
        prevOut[1] = (lpRowInitial[1] * cprev);
        prevOut[2] = (lpRowInitial[2] * cprev);
        prevOut[3] = (lpRowInitial[3] * cprev);

        for (int y = 0; y < height; y++) {
            prevOut[0] = ((lpRowInitial[0] * a0a1) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a0a1) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a0a1) - (prevOut[2] * b1b2));
            prevOut[3] = ((lpRowInitial[3] * a0a1) - (prevOut[3] * b1b2));
            bufferPerLine[0] = prevOut[0];
            bufferPerLine[1] = prevOut[1];
            bufferPerLine[2] = prevOut[2];
            bufferPerLine[3] = prevOut[3];
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        bufferPerLine -= Channels;
        lpColInitial += WidthStep*HeightSubOne;
        prevOut[0] = (lpRowInitial[0] * cnext);
        prevOut[1] = (lpRowInitial[1] * cnext);
        prevOut[2] = (lpRowInitial[2] * cnext);
        prevOut[3] = (lpRowInitial[3] * cnext);
        for (int y = HeightSubOne; y >= 0; y--) {
            prevOut[0] = ((lpRowInitial[0] * a2a3) - (prevOut[0] * b1b2));
            prevOut[1] = ((lpRowInitial[1] * a2a3) - (prevOut[1] * b1b2));
            prevOut[2] = ((lpRowInitial[2] * a2a3) - (prevOut[2] * b1b2));
            prevOut[3] = ((lpRowInitial[3] * a2a3) - (prevOut[3] * b1b2));
            bufferPerLine[0] += prevOut[0];
            bufferPerLine[1] += prevOut[1];
            bufferPerLine[2] += prevOut[2];
            bufferPerLine[3] += prevOut[3];
            lpColInitial[0] = bufferPerLine[0];
            lpColInitial[1] = bufferPerLine[1];
            lpColInitial[2] = bufferPerLine[2];
            lpColInitial[3] = bufferPerLine[3];
            lpRowInitial -= Channels;
            lpColInitial -= WidthStep;
            bufferPerLine -= Channels;
        }
    }
    else if (Channels == 1)
    {
        float prevOut = 0;
        prevOut = (lpRowInitial[0] * cprev);
        for (int y = 0; y < height; y++) {
            prevOut = ((lpRowInitial[0] * a0a1) - (prevOut * b1b2));
            bufferPerLine[0] = prevOut;
            bufferPerLine += Channels;
            lpRowInitial += Channels;
        }
        lpRowInitial -= Channels;
        bufferPerLine -= Channels;
        lpColInitial += WidthStep*HeightSubOne;
        prevOut = (lpRowInitial[0] * cnext);
        for (int y = HeightSubOne; y >= 0; y--) {
            prevOut = ((lpRowInitial[0] * a2a3) - (prevOut * b1b2));
            bufferPerLine[0] += prevOut;
            lpColInitial[0] = bufferPerLine[0];
            lpRowInitial -= Channels;
            lpColInitial -= WidthStep;
            bufferPerLine -= Channels;
        }
    }
}
//本人博客:http://tntmonks.cnblogs.com/ 转载请注明出处.
void  GaussianBlurFilter(unsigned char * input, unsigned char * output, int Width, int Height, int Stride, float GaussianSigma) {

    int Channels = Stride / Width;
    float a0, a1, a2, a3, b1, b2, cprev, cnext;

    CalGaussianCoeff(GaussianSigma, &a0, &a1, &a2, &a3, &b1, &b2, &cprev, &cnext);

    float a0a1 = (a0 + a1);
    float a2a3 = (a2 + a3);
    float b1b2 = (b1 + b2); 

    int bufferSizePerThread = (Width > Height ? Width : Height) * Channels;
    unsigned char * bufferPerLine = (unsigned char*)malloc(bufferSizePerThread);
    unsigned char * tempData = (unsigned char*)malloc(Height * Stride);
    if (bufferPerLine == NULL || tempData == NULL)
    {
        if (tempData)
        {
            free(tempData);
        }
        if (bufferPerLine)
        {
            free(bufferPerLine);
        }
        return;
    }
    for (int y = 0; y < Height; ++y) {
        unsigned char * lpRowInitial = input + Stride * y;
        unsigned char * lpColInitial = tempData + y * Channels;
        gaussianHorizontal(bufferPerLine, lpRowInitial, lpColInitial, Width, Height, Channels, Width, a0a1, a2a3, b1b2, cprev, cnext);
    }
    int HeightStep = Height*Channels;
    for (int x = 0; x < Width; ++x) {
        unsigned char * lpColInitial = output + x*Channels;
        unsigned char * lpRowInitial = tempData + HeightStep * x;
        gaussianVertical(bufferPerLine, lpRowInitial, lpColInitial, Height, Width, Channels, a0a1, a2a3, b1b2, cprev, cnext);
    }

    free(bufferPerLine);
    free(tempData);
}

        IIR Gaussian Blur Filter
Implementation using Intel® Advanced Vector
Extensions
 [PDF
513KB]

 

 

     source
code: gaussian_blur.cpp [36KB]

调用方法:

调用方法:

     
我只是简单的看了下这篇著作,感觉他中间用到的总结公式和Deriche滤波器的很像,和本文参考的Recursive
implementation
不太一致,并且其SSE代码对能处理的图还有众多范围,对本人这边的参阅意义不大。

  GaussianBlurFilter(输入图像数据,输出图像数据,宽度,低度,通道数,强度)

  GaussianBlurFilter(输入图像数据,输出图像数据,宽度,中度,通道数,强度)

     
大家先看下主旨的乘除的SSE优化,注意到  GaussBlurFromLeftToRight
的代码中,
其主导的精打细算部分是多少个乘法,不过她唯有3个乘法总计,虽然可以凑成四行,那么就可以充足利用SSE的批量盘算效用了,也就是只要能增加一个大路,使得GaussBlurFromLeftToRight变为如下形式:

  注:援助通道数分别为 1 ,3 ,4.

  注:补助通道数分别为 1 ,3 ,4.

void GaussBlurFromLeftToRight(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)
{
    //#pragma omp parallel for
    for (int Y = 0; Y < Height; Y++)
    {
        float *LinePD = Data + Y * Width * 4;
        float BS1 = LinePD[0], BS2 = LinePD[0], BS3 = LinePD[0];                //  边缘处使用重复像素的方案
        float GS1 = LinePD[1], GS2 = LinePD[1], GS3 = LinePD[1];
        float RS1 = LinePD[2], RS2 = LinePD[2], RS3 = LinePD[2];
        float AS1 = LinePD[3], AS2 = LinePD[3], AS3 = LinePD[3];
        for (int X = 0; X < Width; X++, LinePD += 4)
        {
            LinePD[0] = LinePD[0] * B0 + BS1 * B1 + BS2 * B2 + BS3 * B3;
            LinePD[1] = LinePD[1] * B0 + GS1 * B1 + GS2 * B2 + GS3 * B3;         // 进行顺向迭代
            LinePD[2] = LinePD[2] * B0 + RS1 * B1 + RS2 * B2 + RS3 * B3;
            LinePD[3] = LinePD[3] * B0 + AS1 * B1 + AS2 * B2 + AS3 * B3;
            BS3 = BS2, BS2 = BS1, BS1 = LinePD[0];
            GS3 = GS2, GS2 = GS1, GS1 = LinePD[1];
            RS3 = RS2, RS2 = RS1, RS1 = LinePD[2];
            AS3 = AS2, AS2 = AS1, AS1 = LinePD[3];
        }
    }
}

有关IIR相关知识,参阅 百度词条 “IIR数字滤波器”

关于IIR相关知识,参阅 百度词条 “IIR数字滤波器”

  则很容易就把上述代码转换成SSE能够正式处理的代码了。

http://baike.baidu.com/view/3088994.htm

http://baike.baidu.com/view/3088994.htm

  而对此Y方向的代码,你精心观望会发现,无论是及通道的图,天然的就足以采用SSE举办处理,详见上面的代码。

海内外武功,唯快不破。
正文只是抛砖引玉一下,若有其他有关题材如故要求也足以邮件联系我探究。

天下武功,唯快不破。
本文只是抛砖引玉一下,若有此外相关题材仍然需要也得以邮件联系我啄磨。

  好,我们依然一个一个的来分析:

邮箱地址是:
gaozhihan@vip.qq.com

邮箱地址是:
gaozhihan@vip.qq.com

   第一个函数
CalcGaussCof 无需举办另外的优化。

 

 

     
第二个函数 ConvertBGR8U2BGRAF遵照上述分析需要再度写,因为急需充实一个坦途,新的康庄大道的值填0或者其余值都足以,但提议填0,这对有些SSE函数很有用,我把这一个函数的SSE实现共享一下:

题外话:

题外话:

void ConvertBGR8U2BGRAF_SSE(unsigned char *Src, float *Dest, int Width, int Height, int Stride)
{
    const int BlockSize = 4;
    int Block = (Width - 2) / BlockSize;
    __m128i Mask = _mm_setr_epi8(0, 1, 2, -1, 3, 4, 5, -1, 6, 7, 8, -1, 9, 10, 11, -1);            //    Mask为-1的地方会自动设置数据为0
    __m128i Zero = _mm_setzero_si128();
    //#pragma omp parallel for
    for (int Y = 0; Y < Height; Y++)
    {
        unsigned char *LinePS = Src + Y * Stride;
        float *LinePD = Dest + Y * Width * 4;
        int X = 0;
        for (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 3, LinePD += BlockSize * 4)
        {
            __m128i SrcV = _mm_shuffle_epi8(_mm_loadu_si128((const __m128i*)LinePS), Mask);        //    提取了16个字节,但是因为24位的,我们要将其变为32位的,所以只能提取出其中的前12个字节
            __m128i Src16L = _mm_unpacklo_epi8(SrcV, Zero);
            __m128i Src16H = _mm_unpackhi_epi8(SrcV, Zero);
            _mm_store_ps(LinePD + 0, _mm_cvtepi32_ps(_mm_unpacklo_epi16(Src16L, Zero)));        //    分配内存时已经是16字节对齐了,然后每行又是4的倍数的浮点数,因此,每行都是16字节对齐的
            _mm_store_ps(LinePD + 4, _mm_cvtepi32_ps(_mm_unpackhi_epi16(Src16L, Zero)));        //    _mm_stream_ps是否快点?
            _mm_store_ps(LinePD + 8, _mm_cvtepi32_ps(_mm_unpacklo_epi16(Src16H, Zero)));
            _mm_store_ps(LinePD + 12, _mm_cvtepi32_ps(_mm_unpackhi_epi16(Src16H, Zero)));
        }
        for (; X < Width; X++, LinePS += 3, LinePD += 4)
        {
            LinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];    LinePD[3] = 0;        //    Alpha最好设置为0,虽然在下面的CofB0等SSE常量中通过设置ALPHA对应的系数为0,可以使得计算结果也为0,但是不是最合理的
        }
    }
}

诸多网友直接重视使用opencv,opencv的确非凡无敌,可是假如想要有更大的开拓进取空间以及开创力.

许多网友直接重视使用opencv,opencv的确非常强硬,不过如果想要有更大的发展空间以及开创力.

  稍作解释,_mm_loadu_si128四回性加载16个字节的数据到SSE寄存器中,对于24位图像,16个字节里带有了5

抑或要一步一个脚印去落实部分最中央的算法,扎实的功底才是构建上层建筑的中坚条件.

抑或要一步一个脚印去实现部分最主旨的算法,扎实的基础才是构建上层建筑的为主条件.

  • 1 /
    3个像素的音讯,而我们的对象是把这多少个数量转换为4通道的音信,由此,我们不得不两回性的领到处16/4=4个像素的值,这借助于_mm_shuffle_epi8函数和方便的Mask来落实,_mm_unpacklo_epi8/_mm_unpackhi_epi8分别领到了SrcV的高8位和低8位的8个字节数据并将它们转换为8个16进制数保存到Src16L和Src16H中,
    而_mm_unpacklo_epi16/_mm_unpackhi_epi16则更进一步把16位数据扩充到32位整形数据,最后通过_mm_cvtepi32_ps函数把整形数据转换为浮点型。

我近日只是把opencv当资料库来看,并不认为opencv可以用来绝大多数的商贸项目.

吾近年来只是把opencv当资料库来看,并不认为opencv可以用于绝大多数的生意项目.

  可能有人注意到了 int Block = (Width – 2) /
BlockSize;
这一行代码,为啥要-2操作呢,这也是自个儿在多次测试发现先后连接出现内存错误时才意识到的,因为_mm_loadu_si128五回性加载了5

若本文帮到您,厚颜无耻求微信扫码打个赏.

若本文帮到您,厚颜无耻求微信扫码打个赏.

  • 1 / 3个像素的音讯,当在拍卖最终一行像素时(其他行不会),尽管Block
    取Width/BlockSize,
    则很有可能访问了大于像素范围内的内存,而-2不是-1就是因为卓殊额外的1/3像素起的功能。

图片 4

图片 5

  接着看档次方向的前向传来的SSE方案:

void GaussBlurFromLeftToRight_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)
{
    const  __m128 CofB0 = _mm_set_ps(0, B0, B0, B0);
    const  __m128 CofB1 = _mm_set_ps(0, B1, B1, B1);
    const  __m128 CofB2 = _mm_set_ps(0, B2, B2, B2);
    const  __m128 CofB3 = _mm_set_ps(0, B3, B3, B3);
    //#pragma omp parallel for
    for (int Y = 0; Y < Height; Y++)
    {
        float *LinePD = Data + Y * Width * 4;
        __m128 V1 = _mm_set_ps(LinePD[3], LinePD[2], LinePD[1], LinePD[0]);
        __m128 V2 = V1, V3 = V1;
        for (int X = 0; X < Width; X++, LinePD += 4)            //    还有一种写法不需要这种V3/V2/V1递归赋值,一次性计算3个值,详见 D:\程序设计\正在研究\Core\ShockFilter\Convert里的高斯模糊,但速度上没啥区别
        {
            __m128 V0 = _mm_load_ps(LinePD);
            __m128 V01 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V1));
            __m128 V23 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V3));
            __m128 V = _mm_add_ps(V01, V23);
            V3 = V2;    V2 = V1;    V1 = V;
            _mm_store_ps(LinePD, V);
        }
    }
}

  和方面的4通路的GaussBlurFromLeftToRight_SSE比较,你会意识大多语法上从不任何的分别,实在是太简单了,注意自身从未用_mm_storeu_ps,而是直接运用_mm_store_ps,这是因为,第一,分配Data内存时,我使用了_mm_malloc分配了16字节对齐的内存,而Data每行元素的个数又都是4的倍数,由此,每行起先地点处的内存也是16字节对齐的,所以从来_mm_store_ps完全是足以的。

   
 同理,在笔直方向的前向传播的SSE优化代码就更直白了:

void GaussBlurFromTopToBottom_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)
{
    const  __m128 CofB0 = _mm_set_ps(0, B0, B0, B0);
    const  __m128 CofB1 = _mm_set_ps(0, B1, B1, B1);
    const  __m128 CofB2 = _mm_set_ps(0, B2, B2, B2);
    const  __m128 CofB3 = _mm_set_ps(0, B3, B3, B3);
    for (int Y = 0; Y < Height; Y++)
    {
        float *LinePS3 = Data + (Y + 0) * Width * 4;
        float *LinePS2 = Data + (Y + 1) * Width * 4;
        float *LinePS1 = Data + (Y + 2) * Width * 4;
        float *LinePS0 = Data + (Y + 3) * Width * 4;
        for (int X = 0; X < Width * 4; X += 4)
        {
            __m128 V3 = _mm_load_ps(LinePS3 + X);
            __m128 V2 = _mm_load_ps(LinePS2 + X);
            __m128 V1 = _mm_load_ps(LinePS1 + X);
            __m128 V0 = _mm_load_ps(LinePS0 + X);
            __m128 V01 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V1));
            __m128 V23 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V3));
            _mm_store_ps(LinePS0 + X, _mm_add_ps(V01, V23));
        }
    }
}

  对地方的代码不想做任何解释了。

  最有难度的应当算是ConvertBGRAF2BGR8U的SSE版本了,由于某些原因,我不太愿意分享这一个函数的代码,有趣味的爱侣可以参考opencv的关于实现。

   
 最终的SSE版本高斯模糊的基本点代码如下:

void GaussBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, float Radius)
{
    float B0, B1, B2, B3;
    float *Buffer = (float *)_mm_malloc(Width * (Height + 6) * sizeof(float) * 4, 16);

    CalcGaussCof(Radius, B0, B1, B2, B3);
    ConvertBGR8U2BGRAF_SSE(Src, Buffer + 3 * Width * 4, Width, Height, Stride);

    GaussBlurFromLeftToRight_SSE(Buffer + 3 * Width * 4, Width, Height, B0, B1, B2, B3);        //    在SSE版本中,这两个函数占用的时间比下面两个要多,不过C语言版本也是一样的
    GaussBlurFromRightToLeft_SSE(Buffer + 3 * Width * 4, Width, Height, B0, B1, B2, B3);        //    如果启用多线程,建议把这个函数写到GaussBlurFromLeftToRight的for X循环里,因为这样就可以减少线程并发时的阻力

    memcpy(Buffer + 0 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float));
    memcpy(Buffer + 1 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float));
    memcpy(Buffer + 2 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float));

    GaussBlurFromTopToBottom_SSE(Buffer, Width, Height, B0, B1, B2, B3);

    memcpy(Buffer + (Height + 3) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float));
    memcpy(Buffer + (Height + 4) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float));
    memcpy(Buffer + (Height + 5) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float));

    GaussBlurFromBottomToTop_SSE(Buffer, Width, Height, B0, B1, B2, B3);

    ConvertBGRAF2BGR8U_SSE(Buffer + 3 * Width * 4, Dest, Width, Height, Stride);

    _mm_free(Buffer);
}

  对于同一的3000*2000的彩色图像,SSE版本的代码耗时只有145ms的耗时,相对于一般的C代码有约2.5倍左右的涨潮,这也事出有因,因为我们在推行SSE的代码时时多处理了一个大路的总计量的,但是同编译器自身的SSE优化220ms,唯有1.5倍的涨潮了,这阐明编译器的SSE优化能力或者十分强的。

   
 进一步的优化就是我下边的笺注掉的opemp相关代码,在ConvertBGR8U2BGRAF /
GaussBlurFromLeftToRight / GaussBlurFromRightToLeft / ConvertBGRAF2BGR8U
 4个函数中,直接抬高简单的#pragma omp parallel
for就足以了,至于GaussBlurFromTopToBottom_SSE、
GaussBlurFromBottomToTop_SSE则由于上下行之间数据的相关性,是无能为力落实并行统计的,可是测试表示他们的耗时要比水平方向的少很多。

   
比如大家指定openmp使用2个线程,在上述机器上(四核的),纯C版本能优化到252ms,而纯SSE的只好提升到100ms左右,编译器自身的SSE优化速度大约是150ms,基本上仍旧维持同一个级另外百分比。

   对于灰度图像,很可惜,上述的程度方向上的优化措施就随便为力了,一种方法就是把4行灰度像素混洗成一行,不过这些操作不太有利用SSE实现,此外一种就是把水平方向的数量先转置,然后采取垂直方向的SSE优化代码处理,完成在转置回去,最终对转置的数目再一次开展垂直方向SSE优化,当然转置的长河是可以借助SSE的代码实现的,但是急需非常的一份内存,速度上或许和平常的C相比较就不会有那么多的升级换代了,这些待有时光了再去测试。

   
 前光景后写这一个大约也花了半个月的时刻,写完了上上下下工艺流程,在倒过来看,真的是分外的粗略,有的时候就是这么。

   
 本文并不曾提供全部的可以一向实施的上上下下代码,需者自勤,提供一段C#的调用工程供有趣味的冤家测试和比对(未使用OPENMP版本)。

   
 http://files.cnblogs.com/files/Imageshop/GaussBLur_Sample.rar

图片 6

     后记:后来测试发现,当半径参数较大时,无论是C版本依然SSE版本都会产出一些畸形的老毛病,感觉像是溢出了,后来察觉根本是当半径变大后,B0参数变得很小,以至于用float类型的数量来拍卖递归已经力不从心确保充分的精度了,解决的点子是应用double类型,这对于C语言版本的话是秒秒的业务,而对此SSE版本则是较大的天灾人祸,double时换成AVX可能改动量不大,然而AVX的普及度以及…..,不过,一般景色下,半径不高于75时结果都仍然不错的,这对于多数的使用来说是十足的,半径75时,整个图像已经差不多没有任何的细节了,再大,区别也不大了。

   
 解决上述问题一个实惠的方案就是使用Deriche滤波器,我用该滤波器的float版本对大半径是不会现出问题的,并且也有连带SSE参考代码。

 图片 7

   后续小说:高斯模糊算法的周密优化过程分享(二)。

 

相关文章