All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
@ 2010-09-08 12:19 ` Andi Kleen
  0 siblings, 0 replies; 11+ messages in thread
From: Andi Kleen @ 2010-09-08 12:19 UTC (permalink / raw)
  To: miaox
  Cc: Andi Kleen, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Theodore Ts'o, Linux Kernel, Linux Ext4, Linux Btrfs


> According to the data, the length of the most copies is >=128.

Thanks for the data. Large is easier to optimize than small, that's good.

Could you also measure how many memsets need the backwards copy?
(should be easy to add)

If the number is small that needs backwards then the easiest fix
would be to simply call the normal memcpy in the forward case.

That is for backward could also use a string instruction copy
of course, just have to set the direction flag.

That would be a very small code change.

-Andi



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/3] lib: introduce some memory copy macros and  functions
@ 2010-09-08 12:19 ` Andi Kleen
  0 siblings, 0 replies; 11+ messages in thread
From: Andi Kleen @ 2010-09-08 12:19 UTC (permalink / raw)
  To: miaox
  Cc: Andi Kleen, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Theodore Ts'o, Linux Kernel, Linux Ext4, Linux Btrfs


> According to the data, the length of the most copies is >=128.

Thanks for the data. Large is easier to optimize than small, that's good.

Could you also measure how many memsets need the backwards copy?
(should be easy to add)

If the number is small that needs backwards then the easiest fix
would be to simply call the normal memcpy in the forward case.

That is for backward could also use a string instruction copy
of course, just have to set the direction flag.

That would be a very small code change.

-Andi



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/3] lib: introduce some memory copy macros and    functions
  2010-09-08 12:19 ` Andi Kleen
  (?)
@ 2010-09-08 12:57 ` Miao Xie
  2010-09-08 13:05   ` Andi Kleen
  -1 siblings, 1 reply; 11+ messages in thread
From: Miao Xie @ 2010-09-08 12:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Theodore Ts'o,
	Linux Kernel, Linux Ext4, Linux Btrfs

On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote:
>
>> According to the data, the length of the most copies is>=128.
>
> Thanks for the data. Large is easier to optimize than small, that's good.
>
> Could you also measure how many memsets need the backwards copy?
> (should be easy to add)

I think memset doesn't need the backwards copy.
The reason why memmove needs the backward copy is because the forward copy may
cover the data which will be used later if the dest string intersect with the
src string. But memset needn't worry about this problem.

Thanks!
Miao  

>
> If the number is small that needs backwards then the easiest fix
> would be to simply call the normal memcpy in the forward case.
>
> That is for backward could also use a string instruction copy
> of course, just have to set the direction flag.
>
> That would be a very small code change.
>
> -Andi
>
>
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/3] lib: introduce some memory copy macros and      functions
  2010-09-08 12:57 ` Miao Xie
@ 2010-09-08 13:05   ` Andi Kleen
  2010-09-08 13:32     ` Miao Xie
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2010-09-08 13:05 UTC (permalink / raw)
  To: miaox
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Theodore Ts'o,
	Linux Kernel, Linux Ext4, Linux Btrfs

On Wed, 08 Sep 2010 20:57:07 +0800
Miao Xie <miaox@cn.fujitsu.com> wrote:

> On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote:
> >
> >> According to the data, the length of the most copies is>=128.
> >
> > Thanks for the data. Large is easier to optimize than small, that's
> > good.
> >
> > Could you also measure how many memsets need the backwards copy?
> > (should be easy to add)
> 
> I think memset doesn't need the backwards copy.

I meant for memmove of course. Obviously memset doesn't need a backwards
copy. That was just the only thing the script didn't measure because
the original version didn't have memmove support.

Your whole thread was about making memmove faster, right?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/3] lib: introduce some memory copy macros and    functions
  2010-09-08 13:05   ` Andi Kleen
@ 2010-09-08 13:32     ` Miao Xie
  0 siblings, 0 replies; 11+ messages in thread
From: Miao Xie @ 2010-09-08 13:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Theodore Ts'o,
	Linux Kernel, Linux Ext4, Linux Btrfs

On Wed, 8 Sep 2010 15:05:19 +0200, Andi Kleen wrote:
> On Wed, 08 Sep 2010 20:57:07 +0800
> Miao Xie<miaox@cn.fujitsu.com>  wrote:
>
>> On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote:
>>>
>>>> According to the data, the length of the most copies is>=128.
>>>
>>> Thanks for the data. Large is easier to optimize than small, that's
>>> good.
>>>
>>> Could you also measure how many memsets need the backwards copy?
                                     ^^^^^^^ :-)
>>> (should be easy to add)
>>
>> I think memset doesn't need the backwards copy.
>
> I meant for memmove of course. Obviously memset doesn't need a backwards
> copy. That was just the only thing the script didn't measure because
> the original version didn't have memmove support.

memmove
total 5224252
[snip]

forward copy
value |-------------------------------------------------- count
     0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  980253


backward copy
value |-------------------------------------------------- count
     0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  4243999

the backward copy is much more than the forward copy.

Thanks
Miao



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
  2010-09-02 10:40     ` Andi Kleen
@ 2010-09-08 11:05       ` Miao Xie
  0 siblings, 0 replies; 11+ messages in thread
From: Miao Xie @ 2010-09-08 11:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Theodore Ts'o,
	Linux Kernel, Linux Ext4, Linux Btrfs

Hi, Andi

On Thu, 2 Sep 2010 12:40:08 +0200, Andi Kleen wrote:
>> So I improve the generic version of memcpy and memmove, and x86_64's memmove
>> which are implemented by byte copy.
>
> One should also add that most memmove()s and memcpy()s are actually
> generated by gcc as inlines (especially if you don't use the
> "make my code slow" option aka -Os) and don't use the fallback.
> The fallback depends on the gcc version and if gcc thinks the
> data is aligned or not.
>
> Sometimes one can get better code in the caller by making sure
> gcc knows the correct alignment (e.g. with suitable
> types) and size. This might be worth looking at for btrfs
> if it's really that memmove heavy.

Right! But the src address and dest address is not fixed, so it is hard to
tell gcc that the address is alignment or not.

The problem is memmove is very inefficient in fact, and it is used at some fast path,
such as: it is used to do metadata copy for filesystem, so we must improve it.

>>> I have some systemtap scripts to measure size/alignment distributions of
>>> copies on a kernel, if you have a particular workload you're interested
>>> in those could be tried.
>>
>> Good! Could you give me these script?
>
> ftp://firstfloor.org/pub/ak/probes/csum.m4
>
> You need to run them through .m4 first.
> They don't measure memmove, but that should be easy to add.

I used your script to measure size/alignment distributions of copies on a kernel when
I ran the btrfs test, and got some data:

memmove
total 325903
length
value |-------------------------------------------------- count
     1 |                                                       0
     2 |                                                       0
     4 |                                                       3
     8 |                                                       0
    16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       57062
    32 |@@                                                  5903
    64 |@@@                                                 7296
   128 |@@@@@@@@@                                          18868
   256 |@@@@@@@@@@@@@@@@@                                  33790
   512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   64886
  1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 98450
  2048 |@@@@@@@@@@@@@@@@@@@@                               39645
  4096 |                                                       0
  8192 |                                                       0

length upto 50
value |-------------------------------------------------- count
     2 |                                                        0
     3 |                                                        0
     4 |                                                        3
     5 |                                                        0
     6 |                                                        0
       ~
    21 |                                                        0
    22 |                                                        0
    23 |                                                       24
    24 |                                                        0
    25 |@@@@@@@@@@                                          57037
    26 |                                                        0
    27 |                                                        0
       ~
    29 |                                                        0
    30 |                                                        0
    31 |                                                        1
    32 |                                                        3
    33 |                                                       78
    34 |                                                      215
    35 |                                                     1865
    36 |                                                      432
    37 |                                                        0
    38 |                                                        0
    39 |                                                        0
    40 |                                                        0
    41 |                                                      130
    42 |                                                        0
    43 |                                                       80
    44 |                                                        0
    45 |                                                        0
    46 |                                                        0
    47 |                                                        0
    48 |                                                       80
    49 |                                                        0
    50 |                                                     1077
   >50 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  264878

src unalignments
value |-------------------------------------------------- count
     0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       23173
     1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  17623
     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      23760
     3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  17372
     4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               19185
     5 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  26264
     6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             20288
     7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                18474
     8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             20160
     9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  17754
    10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           21450
    11 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 18127
    12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        23075
    13 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                18582
    14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          21879
    15 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                18737
    16 |                                                       0

dst unalignments
value |-------------------------------------------------- count
     0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  28566
     1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     17449
     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               20980
     3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     17239
     4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     17171
     5 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                        15691
     6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                20558
     7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   18590
     8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               20644
     9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@              21459
    10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                20384
    11 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 19460
    12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           23087
    13 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 19656
    14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     26330
    15 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   18639
    16 |                                                       0

same unalignments
value |-------------------------------------------------- count
     0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  5695
     1 |@@@@@@@@@@@@@@@                                    1815
     2 |@@@@@@@@@@@@@@@@@@@@@@@@@                          2850
     3 |@@@@@@@@@@@@@@@                                    1819
     4 |@@@@@@@@@@@@@@@@@@@@@@@@                           2791
     5 |@@@@@                                               573
     6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      3358
     7 |@@@@@@@@@@@@@@@@@@@@@                              2411
     8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      3340
     9 |@@@@@@@@@@@@@@@@@@@@@                              2404
    10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     3475
    11 |@@@@@@@@@@@@@@@@@@@@@@@@@@                         3019
    12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               4153
    13 |@@@@@@@@@@@@@@@@@@@@@@@@@@                         3052
    14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               4212
    15 |@@@@@@@@@@@@@@@@@@@@@@@@@@                         3011
    16 |                                                      0

different unalignments
value |-------------------------------------------------- count
     0 |                                                       0
     1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      41652
     2 |@@@@@@@@@@@@@@@@@@                                 25447
     3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  69946
     4 |@                                                   1800
     5 |@                                                   1701
     6 |@                                                   1573
     7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       62478
     8 |                                                     810
     9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                49180
    10 |                                                     684
    11 |                                                     148
    12 |                                                    1148
    13 |@@@@@@@@@@                                         15298
    14 |@@                                                  3459
    15 |@                                                   2601
    16 |                                                       0

According to the data, the length of the most copies is >=128.

Besides that I did some tests for various condition on my x86_64 box.
The test is doing 500 bytes memory copy for 50000 times.

The test result is following:

dest < src || dest >= src + len:
Length	Src Unalign	Dest Unalign	Without Patch	Patch applied
------	-----------	------------	-------------	-------------
8	0		0		0s 223086us	0s 230612us
8	0		3		0s 133857us	0s 138364us
8	0		6		0s 133852us	0s 138364us
8	3		0		0s 133845us	0s 138365us
8	3		3		0s 133845us	0s 138364us
8	3		6		0s 133841us	0s 138363us
8	6		0		0s 133848us	0s 138364us
8	6		3		0s 133842us	0s 138363us
8	6		6		0s 133840us	0s 138360us
16	0		0		0s 133847us	0s 138364us
16	0		3		0s 133851us	0s 138362us
16	0		6		0s 133842us	0s 138368us
16	3		0		0s 133849us	0s 138360us
16	3		3		0s 133844us	0s 138362us
16	3		6		0s 133839us	0s 138365us
16	6		0		0s 133845us	0s 138359us
16	6		3		0s 133841us	0s 138368us
16	6		6		0s 133841us	0s 138363us
32	0		0		0s 160914us	0s 165435us
32	0		3		0s 160925us	0s 165427us
32	0		6		0s 160898us	0s 165443us
32	3		0		0s 160930us	0s 165432us
32	3		3		0s 160898us	0s 165434us
32	3		6		0s 160919us	0s 165433us
32	6		0		0s 160909us	0s 165436us
32	6		3		0s 160914us	0s 165425us
32	6		6		0s 160910us	0s 165439us
256	0		0		0s 294756us	0s 299280us
256	0		3		0s 500777us	0s 505367us
256	0		6		0s 655671us	0s 660232us
256	3		0		0s 497769us	0s 503386us
256	3		3		0s 497790us	0s 502358us
256	3		6		0s 500793us	0s 505253us
256	6		0		0s 497751us	0s 503097us
256	6		3		0s 497773us	0s 502242us
256	6		6		0s 497769us	0s 502192us
1024	0		0		0s 457170us	0s 461843us
1024	0		3		1s 655705us	1s 660707us
1024	0		6		2s 388031us	2s 391429us
1024	3		0		1s 652717us	1s 660362us
1024	3		3		2s 755214us	1s 657005us
1024	3		6		1s 655735us	1s 660939us
1024	6		0		1s 669425us	1s 662643us
1024	6		3		1s 653472us	1s 659986us
1024	6		6		1s 653559us	1s 662163us

dest > src && dest < src + len:
Length	Src Unalign	Dest Unalign	Without Patch	Patch applied
------	-----------	------------	-------------	-------------
8	0		3		0s 45029us	0s 45775us
8	0		6		0s 43634us	0s 48014us
8	3		0		0s 43625us	0s 43993us
8	3		6		0s 44324us	0s 45081us
8	6		0		0s 43613us	0s 44014us
8	6		3		0s 43620us	0s 44011us
16	0		0		0s 67680us	0s 49631us
16	0		3		0s 67677us	0s 77417us
16	0		6		0s 67671us	0s 81879us
16	3		0		0s 67676us	0s 52492us
16	3		3		0s 67675us	0s 75199us
16	3		6		0s 67677us	0s 81215us
16	6		0		0s 67673us	0s 52490us
16	6		3		0s 67676us	0s 77304us
16	6		6		0s 67676us	0s 79712us
32	0		0		0s 115800us	0s 49632us
32	0		3		0s 115812us	0s 81214us
32	0		6		0s 115792us	0s 85728us
32	3		0		0s 116488us	0s 60198us
32	3		3		0s 115803us	0s 79709us
32	3		6		0s 115798us	0s 85726us
32	6		0		0s 115844us	0s 60202us
32	6		3		0s 115805us	0s 81212us
32	6		6		0s 115802us	0s 84228us
256	0		0		0s 815079us	0s 91737us
256	0		3		0s 815075us	0s 194652us
256	0		6		0s 815079us	0s 198521us
256	3		0		0s 815074us	0s 171470us
256	3		3		0s 817181us	0s 114299us
256	3		6		0s 816478us	0s 198524us
256	6		0		0s 815822us	0s 173566us
256	6		3		0s 814936us	0s 196515us
256	6		6		0s 815152us	0s 118811us
1024	0		0		3s 125598us	0s 244644us
1024	0		3		3s 132574us	0s 590261us
1024	0		6		3s 125613us	0s 595143us
1024	3		0		3s 128452us	0s 568743us
1024	3		3		3s 124862us	0s 263189us
1024	3		6		3s 127440us	0s 595515us
1024	6		0		3s 122370us	0s 569479us
1024	6		3		3s 127429us	0s 590554us
1024	6		6		3s 126732us	0s 269415us

Though there is little regression when the length is <=16, theperformance
is quite well when the length is >=32.

Thanks!
Miao

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
  2010-09-02 10:11   ` Miao Xie
@ 2010-09-02 10:40     ` Andi Kleen
  2010-09-08 11:05       ` Miao Xie
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2010-09-02 10:40 UTC (permalink / raw)
  To: Miao Xie
  Cc: Andi Kleen, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Theodore Ts'o, Linux Kernel, Linux Ext4, Linux Btrfs

> So I improve the generic version of memcpy and memmove, and x86_64's memmove
> which are implemented by byte copy.

One should also add that most memmove()s and memcpy()s are actually
generated by gcc as inlines (especially if you don't use the
"make my code slow" option aka -Os) and don't use the fallback.
The fallback depends on the gcc version and if gcc thinks the 
data is aligned or not.

Sometimes one can get better code in the caller by making sure
gcc knows the correct alignment (e.g. with suitable 
types) and size. This might be worth looking at for btrfs
if it's really that memmove heavy.

> >
> >I have some systemtap scripts to measure size/alignment distributions of
> >copies on a kernel, if you have a particular workload you're interested
> >in those could be tried.
> 
> Good! Could you give me these script?

ftp://firstfloor.org/pub/ak/probes/csum.m4

You need to run them through .m4 first.
They don't measure memmove, but that should be easy to add.

-Andi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
  2010-09-02  8:55 ` Andi Kleen
@ 2010-09-02 10:11   ` Miao Xie
  2010-09-02 10:40     ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Miao Xie @ 2010-09-02 10:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Theodore Ts'o,
	Linux Kernel, Linux Ext4, Linux Btrfs

[-- Attachment #1: Type: text/plain, Size: 2087 bytes --]

On Thu, 02 Sep 2010 10:55:58 +0200, Andi Kleen wrote:
> Miao Xie<miaox@cn.fujitsu.com>  writes:
>
>> Changes from V1 to V2:
>> - change the version of GPL from version 2.1 to version 2
>>
>> the kernel's memcpy and memmove is very inefficient. But the glibc version is
>> quite fast, in some cases it is 10 times faster than the kernel version. So I
>
>
> Can you elaborate on which CPUs and with what workloads you measured that?

I did this test on x86_64 box with 4 cores, and the workload is quite low,
and I just do 500 bytes copy for 5,000,000 times.

the attached file is my test program.

> The kernel memcpy is optimized for copies smaller than a page size
> for example (kernel very rarely does anything on larger than 4k),
> the glibc isn't. etc. There are various other differences.
>
> memcpy and memmove are very different. AFAIK noone has tried
> to optimize memmove() before because traditionally it wasn't
> used for anything performance critical in the kernel. Has that
> that changed? memcpy on the other hand while not perfect
> is actually quite optimized for typical workloads.

Yes,the performance of memcpy on the most architecture is well,

But some of memmoves are implemented by byte copy, it is quite inefficient.
Unfortunately those memmove are used to modify the metadata of some filesystems,
such as: btrfs. That is memmove is importent for the performance of those filesystems.

So I improve the generic version of memcpy and memmove, and x86_64's memmove
which are implemented by byte copy.

> One big difference between the kernel and glibc is that kernel
> is often cache cold, so you e.g. the cost of a very large code footprint
> memcpy/memset is harder to amortize.
>
> Microbenchmarks often leave out that crucial variable.
>
> I have some systemtap scripts to measure size/alignment distributions of
> copies on a kernel, if you have a particular workload you're interested
> in those could be tried.

Good! Could you give me these script?

> Just copying the glibc bloat uncritical is very likely
> the wrong move at least.

Agree!

Thanks!
Miao

[-- Attachment #2: perf_memcopy.c --]
[-- Type: text/x-csrc, Size: 1250 bytes --]

#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/time.h>
#include <linux/sched.h>
#include <linux/err.h>
#include <linux/string.h>
#include <linux/slab.h>

void get_start_time(struct timeval *tv)
{
	do_gettimeofday(tv);
}

void account_time(struct timeval *stv, struct timeval *etv, int loops)
{
	do_gettimeofday(etv);

	if (loops) {
		while (etv->tv_usec < stv->tv_usec) {
			etv->tv_sec--;
			etv->tv_usec += 1000000;
		}

		etv->tv_usec -= stv->tv_usec;
		etv->tv_sec -= stv->tv_sec;

		while (etv->tv_usec > 1000000) {
			etv->tv_usec -= 1000000;
			etv->tv_sec++;
		}

		printk("\tTotal loops: %d\n", loops);
		printk("\tTotal time: %lds%ldus\n", etv->tv_sec, etv->tv_usec);
	} else
		printk("Didn't do any loop!\n");

}

char *str;

int init_module(void)
{
	struct timeval stv, etv;
	int loops, i;

	str = kmalloc(1000, GFP_KERNEL);
	if (!str)
		return 0;
	loops = i = 5000000;

	printk("memcpy:\n");
	get_start_time(&stv);
	while (i--)
		memcpy(str + 400, str, 500);
	account_time(&stv, &etv, loops);

	i = loops;
	printk("\nmemmove:\n");
	get_start_time(&stv);
	while (i--)
		memmove(str + 400, str, 500);
	account_time(&stv, &etv, loops);

	return 0;
}

void cleanup_module(void)
{
	kfree(str);
}

MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
  2010-09-02  5:46 ` Miao Xie
  (?)
@ 2010-09-02  8:55 ` Andi Kleen
  2010-09-02 10:11   ` Miao Xie
  -1 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2010-09-02  8:55 UTC (permalink / raw)
  To: miaox
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Theodore Ts'o,
	Linux Kernel, Linux Ext4, Linux Btrfs

Miao Xie <miaox@cn.fujitsu.com> writes:

> Changes from V1 to V2:
> - change the version of GPL from version 2.1 to version 2
>
> the kernel's memcpy and memmove is very inefficient. But the glibc version is
> quite fast, in some cases it is 10 times faster than the kernel version. So I


Can you elaborate on which CPUs and with what workloads you measured that?

The kernel memcpy is optimized for copies smaller than a page size 
for example (kernel very rarely does anything on larger than 4k), 
the glibc isn't. etc. There are various other differences.

memcpy and memmove are very different. AFAIK noone has tried
to optimize memmove() before because traditionally it wasn't
used for anything performance critical in the kernel. Has that
that changed? memcpy on the other hand while not perfect
is actually quite optimized for typical workloads.

One big difference between the kernel and glibc is that kernel 
is often cache cold, so you e.g. the cost of a very large code footprint
memcpy/memset is harder to amortize.

Microbenchmarks often leave out that crucial variable.

I have some systemtap scripts to measure size/alignment distributions of
copies on a kernel, if you have a particular workload you're interested
in those could be tried.

Just copying the glibc bloat uncritical is very likely
the wrong move at least.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH V2 1/3] lib: introduce some memory copy macros and functions
@ 2010-09-02  5:46 ` Miao Xie
  0 siblings, 0 replies; 11+ messages in thread
From: Miao Xie @ 2010-09-02  5:46 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Andrew Morton, Theodore Ts'o,
	Andreas Dilger
  Cc: Linux Kernel, Linux Ext4, Linux Btrfs

Changes from V1 to V2:
- change the version of GPL from version 2.1 to version 2

the kernel's memcpy and memmove is very inefficient. But the glibc version is
quite fast, in some cases it is 10 times faster than the kernel version. So I
introduce some memory copy macros and functions of the glibc to improve the
kernel version's performance.

The strategy of the memory functions is:
1. Copy bytes until the destination pointer is aligned.
2. Copy words in unrolled loops.  If the source and destination are not aligned
   in the same way, use word memory operations, but shift and merge two read
   words before writing.
3. Copy the few remaining bytes.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 include/linux/memcopy.h |  226 ++++++++++++++++++++++++++
 lib/Makefile            |    3 +-
 lib/memcopy.c           |  403 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 631 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/memcopy.h
 create mode 100644 lib/memcopy.c

diff --git a/include/linux/memcopy.h b/include/linux/memcopy.h
new file mode 100644
index 0000000..9c65ac8
--- /dev/null
+++ b/include/linux/memcopy.h
@@ -0,0 +1,226 @@
+/*
+ * memcopy.h -- definitions for memory copy functions.  Generic C version.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ * 
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
+ * Public License for more details.
+ * 
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * The code is derived from the GNU C Library.
+ * Copyright (C) 1991, 1992, 1993, 1997, 2004 Free Software Foundation, Inc.
+ */
+#ifndef _LINUX_MEMCOPY_H_
+#define _LINUX_MEMCOPY_H_
+
+/*
+ * The strategy of the memory functions is:
+ *
+ *   1. Copy bytes until the destination pointer is aligned.
+ *
+ *   2. Copy words in unrolled loops.  If the source and destination
+ *      are not aligned in the same way, use word memory operations,
+ *      but shift and merge two read words before writing.
+ *
+ *   3. Copy the few remaining bytes.
+ *
+ * This is fast on processors that have at least 10 registers for
+ * allocation by GCC, and that can access memory at reg+const in one
+ * instruction.
+ */
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+#include <asm/byteorder.h>
+
+/*
+ * The macros defined in this file are:
+ *
+ * BYTE_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_to_copy)
+ * 
+ * BYTE_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_to_copy)
+ * 
+ * WORD_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_remaining, nbytes_to_copy)
+ * 
+ * WORD_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_remaining, nbytes_to_copy)
+ * 
+ * MERGE(old_word, sh_1, new_word, sh_2)
+ * 
+ * MEM_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_to_copy)
+ * 
+ * MEM_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_to_copy)
+ */
+
+#define OP_T_THRESHOLD	16
+
+/*
+ * Type to use for aligned memory operations.
+ * This should normally be the biggest type supported by a single load
+ * and store.
+ */
+#define	op_t	unsigned long int
+#define OPSIZ	(sizeof(op_t))
+
+/* Type to use for unaligned operations.  */
+typedef unsigned char byte;
+
+#ifndef MERGE
+# ifdef __LITTLE_ENDIAN
+#  define MERGE(w0, sh_1, w1, sh_2) (((w0) >> (sh_1)) | ((w1) << (sh_2)))
+# elif defined(__BIG_ENDIAN)
+#  define MERGE(w0, sh_1, w1, sh_2) (((w0) << (sh_1)) | ((w1) >> (sh_2)))
+# else
+#  error "Macro MERGE() hasn't defined!"
+# endif
+#endif
+
+/*
+ * Copy exactly NBYTES bytes from SRC_BP to DST_BP,
+ * without any assumptions about alignment of the pointers.
+ */
+#ifndef BYTE_COPY_FWD
+#define BYTE_COPY_FWD(dst_bp, src_bp, nbytes)				      \
+do {									      \
+	size_t __nbytes = (nbytes);					      \
+	while (__nbytes > 0) {						      \
+		byte __x = ((byte *) src_bp)[0];			      \
+		src_bp += 1;						      \
+		__nbytes -= 1;						      \
+		((byte *) dst_bp)[0] = __x;				      \
+		dst_bp += 1;						      \
+	}								      \
+} while (0)
+#endif
+
+/*
+ * Copy exactly NBYTES_TO_COPY bytes from SRC_END_PTR to DST_END_PTR,
+ * beginning at the bytes right before the pointers and continuing towards
+ * smaller addresses.  Don't assume anything about alignment of the
+ * pointers.
+ */
+#ifndef BYTE_COPY_BWD
+#define BYTE_COPY_BWD(dst_ep, src_ep, nbytes)				      \
+do {									      \
+	size_t __nbytes = (nbytes);					      \
+	while (__nbytes > 0) {						      \
+		byte __x;						      \
+		src_ep -= 1;						      \
+		__x = ((byte *) src_ep)[0];				      \
+		dst_ep -= 1;						      \
+		__nbytes -= 1;						      \
+		((byte *) dst_ep)[0] = __x;				      \
+	}								      \
+} while (0)
+#endif
+/*
+ * Copy *up to* NBYTES bytes from SRC_BP to DST_BP, with
+ * the assumption that DST_BP is aligned on an OPSIZ multiple.  If
+ * not all bytes could be easily copied, store remaining number of bytes
+ * in NBYTES_LEFT, otherwise store 0.
+ */
+extern void _wordcopy_fwd_aligned(long int, long int, size_t);
+extern void _wordcopy_fwd_dest_aligned(long int, long int, size_t);
+#ifndef WORD_COPY_FWD
+#define WORD_COPY_FWD(dst_bp, src_bp, nbytes_left, nbytes)		      \
+do {									      \
+	if (src_bp % OPSIZ == 0)					      \
+		_wordcopy_fwd_aligned (dst_bp, src_bp, (nbytes) / OPSIZ);     \
+	else								      \
+		_wordcopy_fwd_dest_aligned (dst_bp, src_bp, (nbytes) / OPSIZ);\
+									      \
+	src_bp += (nbytes) & -OPSIZ;					      \
+	dst_bp += (nbytes) & -OPSIZ;					      \
+	(nbytes_left) = (nbytes) % OPSIZ;				      \
+} while (0)
+#endif
+
+/*
+ * Copy *up to* NBYTES_TO_COPY bytes from SRC_END_PTR to DST_END_PTR,
+ * beginning at the words (of type op_t) right before the pointers and
+ * continuing towards smaller addresses.  May take advantage of that
+ * DST_END_PTR is aligned on an OPSIZ multiple.  If not all bytes could be
+ * easily copied, store remaining number of bytes in NBYTES_REMAINING,
+ * otherwise store 0.
+ */
+extern void _wordcopy_bwd_aligned(long int, long int, size_t);
+extern void _wordcopy_bwd_dest_aligned(long int, long int, size_t);
+#ifndef WORD_COPY_BWD
+#define WORD_COPY_BWD(dst_ep, src_ep, nbytes_left, nbytes)		      \
+do {									      \
+	if (src_ep % OPSIZ == 0)					      \
+		_wordcopy_bwd_aligned (dst_ep, src_ep, (nbytes) / OPSIZ);     \
+	else								      \
+		_wordcopy_bwd_dest_aligned (dst_ep, src_ep, (nbytes) / OPSIZ);\
+									      \
+	src_ep -= (nbytes) & -OPSIZ;					      \
+	dst_ep -= (nbytes) & -OPSIZ;					      \
+	(nbytes_left) = (nbytes) % OPSIZ;				      \
+} while (0)
+#endif
+
+/* Copy memory from the beginning to the end */
+#ifndef MEM_COPY_FWD
+static __always_inline void mem_copy_fwd(unsigned long dstp,
+					unsigned long srcp,
+					size_t count)
+{
+	/* If there not too few bytes to copy, use word copy. */
+	if (count >= OP_T_THRESHOLD) {
+		/* Copy just a few bytes to make dstp aligned. */
+		count -= (-dstp) % OPSIZ;
+		BYTE_COPY_FWD(dstp, srcp, (-dstp) % OPSIZ);
+
+		/*
+		 * Copy from srcp to dstp taking advantage of the known
+		 * alignment of dstp. Number if bytes remaining is put in
+		 * the third argument.
+		 */
+		WORD_COPY_FWD(dstp, srcp, count, count);
+
+		/* Fall out and copy the tail. */
+	}
+
+	/* There are just a few bytes to copy. Use byte memory operations. */
+	BYTE_COPY_FWD(dstp, srcp, count);
+}
+#endif
+
+/* Copy memory from the end to the beginning.  */
+#ifndef MEM_COPY_BWD
+static __always_inline void mem_copy_bwd(unsigned long dstp,
+					unsigned long srcp,
+					size_t count)
+{
+	srcp += count;
+	dstp += count;
+
+	/* If there not too few bytes to copy, use word copy. */
+	if (count >= OP_T_THRESHOLD) {
+		/* Copy just a few bytes to make dstp aligned. */
+		count -= dstp % OPSIZ;
+		BYTE_COPY_BWD(dstp, srcp, dstp % OPSIZ);
+
+		/*
+		 * Copy from srcp to dstp taking advantage of the known
+		 * alignment of dstp. Number if bytes remaining is put in
+		 * the third argument.
+		 */
+		WORD_COPY_BWD(dstp, srcp, count, count);
+
+		/* Fall out and copy the tail. */
+	}
+
+	/* There are just a few bytes to copy. Use byte memory operations. */
+	BYTE_COPY_BWD (dstp, srcp, count);
+}
+#endif
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index e6a3763..10a319e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,8 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 idr.o int_sqrt.o extable.o prio_tree.o \
 	 sha1.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o prio_heap.o ratelimit.o show_mem.o \
-	 is_single_threaded.o plist.o decompress.o flex_array.o
+	 is_single_threaded.o plist.o decompress.o flex_array.o \
+	 memcopy.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/memcopy.c b/lib/memcopy.c
new file mode 100644
index 0000000..70fb6b2
--- /dev/null
+++ b/lib/memcopy.c
@@ -0,0 +1,403 @@
+/*
+ * memcopy.c -- subroutines for memory copy functions.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ * 
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
+ * Public License for more details.
+ * 
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * The code is derived from the GNU C Library.
+ * Copyright (C) 1991, 1992, 1993, 1997, 2004 Free Software Foundation, Inc.
+ */
+
+/* BE VERY CAREFUL IF YOU CHANGE THIS CODE...!  */
+
+#include <linux/memcopy.h>
+
+/*
+ * _wordcopy_fwd_aligned -- Copy block beginning at SRCP to block beginning
+ * at DSTP with LEN `op_t' words (not LEN bytes!).
+ * Both SRCP and DSTP should be aligned for memory operations on `op_t's.
+ */
+void _wordcopy_fwd_aligned (long int dstp, long int srcp, size_t len)
+{
+	op_t a0, a1;
+
+	switch (len % 8) {
+	case 2:
+		a0 = ((op_t *) srcp)[0];
+		srcp -= 6 * OPSIZ;
+		dstp -= 7 * OPSIZ;
+		len += 6;
+		goto do1;
+	case 3:
+		a1 = ((op_t *) srcp)[0];
+		srcp -= 5 * OPSIZ;
+		dstp -= 6 * OPSIZ;
+		len += 5;
+		goto do2;
+	case 4:
+		a0 = ((op_t *) srcp)[0];
+		srcp -= 4 * OPSIZ;
+		dstp -= 5 * OPSIZ;
+		len += 4;
+		goto do3;
+	case 5:
+		a1 = ((op_t *) srcp)[0];
+		srcp -= 3 * OPSIZ;
+		dstp -= 4 * OPSIZ;
+		len += 3;
+		goto do4;
+	case 6:
+		a0 = ((op_t *) srcp)[0];
+		srcp -= 2 * OPSIZ;
+		dstp -= 3 * OPSIZ;
+		len += 2;
+		goto do5;
+	case 7:
+		a1 = ((op_t *) srcp)[0];
+		srcp -= 1 * OPSIZ;
+		dstp -= 2 * OPSIZ;
+		len += 1;
+		goto do6;
+	case 0:
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			return;
+		a0 = ((op_t *) srcp)[0];
+		srcp -= 0 * OPSIZ;
+		dstp -= 1 * OPSIZ;
+		goto do7;
+	case 1:
+		a1 = ((op_t *) srcp)[0];
+		srcp -=-1 * OPSIZ;
+		dstp -= 0 * OPSIZ;
+		len -= 1;
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			goto do0;
+		goto do8;			/* No-op.  */
+	}
+
+	do {
+do8:
+		a0 = ((op_t *) srcp)[0];
+		((op_t *) dstp)[0] = a1;
+do7:
+		a1 = ((op_t *) srcp)[1];
+		((op_t *) dstp)[1] = a0;
+do6:
+		a0 = ((op_t *) srcp)[2];
+		((op_t *) dstp)[2] = a1;
+do5:
+		a1 = ((op_t *) srcp)[3];
+		((op_t *) dstp)[3] = a0;
+do4:
+		a0 = ((op_t *) srcp)[4];
+		((op_t *) dstp)[4] = a1;
+do3:
+		a1 = ((op_t *) srcp)[5];
+		((op_t *) dstp)[5] = a0;
+do2:
+		a0 = ((op_t *) srcp)[6];
+		((op_t *) dstp)[6] = a1;
+do1:
+		a1 = ((op_t *) srcp)[7];
+		((op_t *) dstp)[7] = a0;
+
+		srcp += 8 * OPSIZ;
+		dstp += 8 * OPSIZ;
+		len -= 8;
+	} while (len != 0);
+
+	/*
+	 * This is the right position for do0.  Please don't move it into
+	 * the loop.
+	 */
+do0:
+	((op_t *) dstp)[0] = a1;
+}
+
+/*
+ * _wordcopy_fwd_dest_aligned -- Copy block beginning at SRCP to block
+ * beginning at DSTP with LEN `op_t' words (not LEN bytes!). DSTP should
+ * be aligned for memory operations on `op_t's, but SRCP must *not* be aligned.
+ */
+
+void _wordcopy_fwd_dest_aligned (long int dstp, long int srcp, size_t len)
+{
+	op_t a0, a1, a2, a3;
+	int sh_1, sh_2;
+
+	/*
+	 * Calculate how to shift a word read at the memory operation aligned
+	 * srcp to make it aligned for copy.
+	 */
+	sh_1 = 8 * (srcp % OPSIZ);
+	sh_2 = 8 * OPSIZ - sh_1;
+
+	/*
+	 * Make SRCP aligned by rounding it down to the beginning of the `op_t'
+	 * it points in the middle of.
+	 */
+	srcp &= -OPSIZ;
+
+	switch (len % 4) {
+	case 2:
+		a1 = ((op_t *) srcp)[0];
+		a2 = ((op_t *) srcp)[1];
+		srcp -= 1 * OPSIZ;
+		dstp -= 3 * OPSIZ;
+		len += 2;
+		goto do1;
+	case 3:
+		a0 = ((op_t *) srcp)[0];
+		a1 = ((op_t *) srcp)[1];
+		srcp -= 0 * OPSIZ;
+		dstp -= 2 * OPSIZ;
+		len += 1;
+		goto do2;
+	case 0:
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			return;
+		a3 = ((op_t *) srcp)[0];
+		a0 = ((op_t *) srcp)[1];
+		srcp -=-1 * OPSIZ;
+		dstp -= 1 * OPSIZ;
+		len += 0;
+		goto do3;
+	case 1:
+		a2 = ((op_t *) srcp)[0];
+		a3 = ((op_t *) srcp)[1];
+		srcp -=-2 * OPSIZ;
+		dstp -= 0 * OPSIZ;
+		len -= 1;
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			goto do0;
+		goto do4;			/* No-op. */
+	}
+
+	do {
+do4:
+		a0 = ((op_t *) srcp)[0];
+		((op_t *) dstp)[0] = MERGE (a2, sh_1, a3, sh_2);
+do3:
+		a1 = ((op_t *) srcp)[1];
+		((op_t *) dstp)[1] = MERGE (a3, sh_1, a0, sh_2);
+do2:
+		a2 = ((op_t *) srcp)[2];
+		((op_t *) dstp)[2] = MERGE (a0, sh_1, a1, sh_2);
+do1:
+		a3 = ((op_t *) srcp)[3];
+		((op_t *) dstp)[3] = MERGE (a1, sh_1, a2, sh_2);
+
+		srcp += 4 * OPSIZ;
+		dstp += 4 * OPSIZ;
+		len -= 4;
+	} while (len != 0);
+
+	/*
+	 * This is the right position for do0.  Please don't move it into
+	 * the loop.
+	 */
+do0:
+	((op_t *) dstp)[0] = MERGE (a2, sh_1, a3, sh_2);
+}
+
+/*
+ * _wordcopy_bwd_aligned -- Copy block finishing right before
+ * SRCP to block finishing right before DSTP with LEN `op_t' words (not LEN
+ * bytes!).  Both SRCP and DSTP should be aligned for memory operations
+ * on `op_t's.
+ */
+void _wordcopy_bwd_aligned (long int dstp, long int srcp, size_t len)
+{
+	op_t a0, a1;
+
+	switch (len % 8) {
+	case 2:
+		srcp -= 2 * OPSIZ;
+		dstp -= 1 * OPSIZ;
+		a0 = ((op_t *) srcp)[1];
+		len += 6;
+		goto do1;
+	case 3:
+		srcp -= 3 * OPSIZ;
+		dstp -= 2 * OPSIZ;
+		a1 = ((op_t *) srcp)[2];
+		len += 5;
+		goto do2;
+	case 4:
+		srcp -= 4 * OPSIZ;
+		dstp -= 3 * OPSIZ;
+		a0 = ((op_t *) srcp)[3];
+		len += 4;
+		goto do3;
+	case 5:
+		srcp -= 5 * OPSIZ;
+		dstp -= 4 * OPSIZ;
+		a1 = ((op_t *) srcp)[4];
+		len += 3;
+		goto do4;
+	case 6:
+		srcp -= 6 * OPSIZ;
+		dstp -= 5 * OPSIZ;
+		a0 = ((op_t *) srcp)[5];
+		len += 2;
+		goto do5;
+	case 7:
+		srcp -= 7 * OPSIZ;
+		dstp -= 6 * OPSIZ;
+		a1 = ((op_t *) srcp)[6];
+		len += 1;
+		goto do6;
+	case 0:
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			return;
+		srcp -= 8 * OPSIZ;
+		dstp -= 7 * OPSIZ;
+		a0 = ((op_t *) srcp)[7];
+		goto do7;
+	case 1:
+		srcp -= 9 * OPSIZ;
+		dstp -= 8 * OPSIZ;
+		a1 = ((op_t *) srcp)[8];
+		len -= 1;
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			goto do0;
+		goto do8;			/* No-op.  */
+	}
+
+	do {
+do8:
+		a0 = ((op_t *) srcp)[7];
+		((op_t *) dstp)[7] = a1;
+do7:
+		a1 = ((op_t *) srcp)[6];
+		((op_t *) dstp)[6] = a0;
+do6:
+		a0 = ((op_t *) srcp)[5];
+		((op_t *) dstp)[5] = a1;
+do5:
+		a1 = ((op_t *) srcp)[4];
+		((op_t *) dstp)[4] = a0;
+do4:
+		a0 = ((op_t *) srcp)[3];
+		((op_t *) dstp)[3] = a1;
+do3:
+		a1 = ((op_t *) srcp)[2];
+		((op_t *) dstp)[2] = a0;
+do2:
+		a0 = ((op_t *) srcp)[1];
+		((op_t *) dstp)[1] = a1;
+do1:
+		a1 = ((op_t *) srcp)[0];
+		((op_t *) dstp)[0] = a0;
+
+		srcp -= 8 * OPSIZ;
+		dstp -= 8 * OPSIZ;
+		len -= 8;
+	} while (len != 0);
+
+	/*
+	 * This is the right position for do0.  Please don't move it into
+	 * the loop.
+	 */
+do0:
+	((op_t *) dstp)[7] = a1;
+}
+
+/*
+ * _wordcopy_bwd_dest_aligned -- Copy block finishing right before SRCP to
+ * block finishing right before DSTP with LEN `op_t' words (not LEN bytes!).
+ * DSTP should be aligned for memory operations on `op_t', but SRCP must *not*
+ * be aligned.
+ */
+void _wordcopy_bwd_dest_aligned (long int dstp, long int srcp, size_t len)
+{
+	op_t a0, a1, a2, a3;
+	int sh_1, sh_2;
+
+	/*
+	 * Calculate how to shift a word read at the memory operation aligned
+	 * srcp to make it aligned for copy.
+	 */
+
+	sh_1 = 8 * (srcp % OPSIZ);
+	sh_2 = 8 * OPSIZ - sh_1;
+
+	/*
+	 * Make srcp aligned by rounding it down to the beginning of the op_t
+	 * it points in the middle of.
+	 */
+	srcp &= -OPSIZ;
+	srcp += OPSIZ;
+
+	switch (len % 4) {
+	case 2:
+		srcp -= 3 * OPSIZ;
+		dstp -= 1 * OPSIZ;
+		a2 = ((op_t *) srcp)[2];
+		a1 = ((op_t *) srcp)[1];
+		len += 2;
+		goto do1;
+	case 3:
+		srcp -= 4 * OPSIZ;
+		dstp -= 2 * OPSIZ;
+		a3 = ((op_t *) srcp)[3];
+		a2 = ((op_t *) srcp)[2];
+		len += 1;
+		goto do2;
+	case 0:
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			return;
+		srcp -= 5 * OPSIZ;
+		dstp -= 3 * OPSIZ;
+		a0 = ((op_t *) srcp)[4];
+		a3 = ((op_t *) srcp)[3];
+		goto do3;
+	case 1:
+		srcp -= 6 * OPSIZ;
+		dstp -= 4 * OPSIZ;
+		a1 = ((op_t *) srcp)[5];
+		a0 = ((op_t *) srcp)[4];
+		len -= 1;
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			goto do0;
+		goto do4;			/* No-op.  */
+	}
+
+	do {
+do4:
+		a3 = ((op_t *) srcp)[3];
+		((op_t *) dstp)[3] = MERGE (a0, sh_1, a1, sh_2);
+do3:
+		a2 = ((op_t *) srcp)[2];
+		((op_t *) dstp)[2] = MERGE (a3, sh_1, a0, sh_2);
+do2:
+		a1 = ((op_t *) srcp)[1];
+		((op_t *) dstp)[1] = MERGE (a2, sh_1, a3, sh_2);
+do1:
+		a0 = ((op_t *) srcp)[0];
+		((op_t *) dstp)[0] = MERGE (a1, sh_1, a2, sh_2);
+
+		srcp -= 4 * OPSIZ;
+		dstp -= 4 * OPSIZ;
+		len -= 4;
+	} while (len != 0);
+
+	/*
+	 * This is the right position for do0.  Please don't move it into
+	 * the loop.
+	 */
+do0:
+	((op_t *) dstp)[3] = MERGE (a0, sh_1, a1, sh_2);
+}
+
-- 
1.7.0.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH V2 1/3] lib: introduce some memory copy macros and functions
@ 2010-09-02  5:46 ` Miao Xie
  0 siblings, 0 replies; 11+ messages in thread
From: Miao Xie @ 2010-09-02  5:46 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Andrew Morton, Theodore Ts'o,
	Andreas Dilger, Chris Mason
  Cc: Linux Kernel, Linux Ext4, Linux Btrfs

Changes from V1 to V2:
- change the version of GPL from version 2.1 to version 2

the kernel's memcpy and memmove is very inefficient. But the glibc version is
quite fast, in some cases it is 10 times faster than the kernel version. So I
introduce some memory copy macros and functions of the glibc to improve the
kernel version's performance.

The strategy of the memory functions is:
1. Copy bytes until the destination pointer is aligned.
2. Copy words in unrolled loops.  If the source and destination are not aligned
   in the same way, use word memory operations, but shift and merge two read
   words before writing.
3. Copy the few remaining bytes.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 include/linux/memcopy.h |  226 ++++++++++++++++++++++++++
 lib/Makefile            |    3 +-
 lib/memcopy.c           |  403 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 631 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/memcopy.h
 create mode 100644 lib/memcopy.c

diff --git a/include/linux/memcopy.h b/include/linux/memcopy.h
new file mode 100644
index 0000000..9c65ac8
--- /dev/null
+++ b/include/linux/memcopy.h
@@ -0,0 +1,226 @@
+/*
+ * memcopy.h -- definitions for memory copy functions.  Generic C version.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ * 
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
+ * Public License for more details.
+ * 
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * The code is derived from the GNU C Library.
+ * Copyright (C) 1991, 1992, 1993, 1997, 2004 Free Software Foundation, Inc.
+ */
+#ifndef _LINUX_MEMCOPY_H_
+#define _LINUX_MEMCOPY_H_
+
+/*
+ * The strategy of the memory functions is:
+ *
+ *   1. Copy bytes until the destination pointer is aligned.
+ *
+ *   2. Copy words in unrolled loops.  If the source and destination
+ *      are not aligned in the same way, use word memory operations,
+ *      but shift and merge two read words before writing.
+ *
+ *   3. Copy the few remaining bytes.
+ *
+ * This is fast on processors that have at least 10 registers for
+ * allocation by GCC, and that can access memory at reg+const in one
+ * instruction.
+ */
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+#include <asm/byteorder.h>
+
+/*
+ * The macros defined in this file are:
+ *
+ * BYTE_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_to_copy)
+ * 
+ * BYTE_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_to_copy)
+ * 
+ * WORD_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_remaining, nbytes_to_copy)
+ * 
+ * WORD_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_remaining, nbytes_to_copy)
+ * 
+ * MERGE(old_word, sh_1, new_word, sh_2)
+ * 
+ * MEM_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_to_copy)
+ * 
+ * MEM_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_to_copy)
+ */
+
+#define OP_T_THRESHOLD	16
+
+/*
+ * Type to use for aligned memory operations.
+ * This should normally be the biggest type supported by a single load
+ * and store.
+ */
+#define	op_t	unsigned long int
+#define OPSIZ	(sizeof(op_t))
+
+/* Type to use for unaligned operations.  */
+typedef unsigned char byte;
+
+#ifndef MERGE
+# ifdef __LITTLE_ENDIAN
+#  define MERGE(w0, sh_1, w1, sh_2) (((w0) >> (sh_1)) | ((w1) << (sh_2)))
+# elif defined(__BIG_ENDIAN)
+#  define MERGE(w0, sh_1, w1, sh_2) (((w0) << (sh_1)) | ((w1) >> (sh_2)))
+# else
+#  error "Macro MERGE() hasn't defined!"
+# endif
+#endif
+
+/*
+ * Copy exactly NBYTES bytes from SRC_BP to DST_BP,
+ * without any assumptions about alignment of the pointers.
+ */
+#ifndef BYTE_COPY_FWD
+#define BYTE_COPY_FWD(dst_bp, src_bp, nbytes)				      \
+do {									      \
+	size_t __nbytes = (nbytes);					      \
+	while (__nbytes > 0) {						      \
+		byte __x = ((byte *) src_bp)[0];			      \
+		src_bp += 1;						      \
+		__nbytes -= 1;						      \
+		((byte *) dst_bp)[0] = __x;				      \
+		dst_bp += 1;						      \
+	}								      \
+} while (0)
+#endif
+
+/*
+ * Copy exactly NBYTES_TO_COPY bytes from SRC_END_PTR to DST_END_PTR,
+ * beginning at the bytes right before the pointers and continuing towards
+ * smaller addresses.  Don't assume anything about alignment of the
+ * pointers.
+ */
+#ifndef BYTE_COPY_BWD
+#define BYTE_COPY_BWD(dst_ep, src_ep, nbytes)				      \
+do {									      \
+	size_t __nbytes = (nbytes);					      \
+	while (__nbytes > 0) {						      \
+		byte __x;						      \
+		src_ep -= 1;						      \
+		__x = ((byte *) src_ep)[0];				      \
+		dst_ep -= 1;						      \
+		__nbytes -= 1;						      \
+		((byte *) dst_ep)[0] = __x;				      \
+	}								      \
+} while (0)
+#endif
+/*
+ * Copy *up to* NBYTES bytes from SRC_BP to DST_BP, with
+ * the assumption that DST_BP is aligned on an OPSIZ multiple.  If
+ * not all bytes could be easily copied, store remaining number of bytes
+ * in NBYTES_LEFT, otherwise store 0.
+ */
+extern void _wordcopy_fwd_aligned(long int, long int, size_t);
+extern void _wordcopy_fwd_dest_aligned(long int, long int, size_t);
+#ifndef WORD_COPY_FWD
+#define WORD_COPY_FWD(dst_bp, src_bp, nbytes_left, nbytes)		      \
+do {									      \
+	if (src_bp % OPSIZ == 0)					      \
+		_wordcopy_fwd_aligned (dst_bp, src_bp, (nbytes) / OPSIZ);     \
+	else								      \
+		_wordcopy_fwd_dest_aligned (dst_bp, src_bp, (nbytes) / OPSIZ);\
+									      \
+	src_bp += (nbytes) & -OPSIZ;					      \
+	dst_bp += (nbytes) & -OPSIZ;					      \
+	(nbytes_left) = (nbytes) % OPSIZ;				      \
+} while (0)
+#endif
+
+/*
+ * Copy *up to* NBYTES_TO_COPY bytes from SRC_END_PTR to DST_END_PTR,
+ * beginning at the words (of type op_t) right before the pointers and
+ * continuing towards smaller addresses.  May take advantage of that
+ * DST_END_PTR is aligned on an OPSIZ multiple.  If not all bytes could be
+ * easily copied, store remaining number of bytes in NBYTES_REMAINING,
+ * otherwise store 0.
+ */
+extern void _wordcopy_bwd_aligned(long int, long int, size_t);
+extern void _wordcopy_bwd_dest_aligned(long int, long int, size_t);
+#ifndef WORD_COPY_BWD
+#define WORD_COPY_BWD(dst_ep, src_ep, nbytes_left, nbytes)		      \
+do {									      \
+	if (src_ep % OPSIZ == 0)					      \
+		_wordcopy_bwd_aligned (dst_ep, src_ep, (nbytes) / OPSIZ);     \
+	else								      \
+		_wordcopy_bwd_dest_aligned (dst_ep, src_ep, (nbytes) / OPSIZ);\
+									      \
+	src_ep -= (nbytes) & -OPSIZ;					      \
+	dst_ep -= (nbytes) & -OPSIZ;					      \
+	(nbytes_left) = (nbytes) % OPSIZ;				      \
+} while (0)
+#endif
+
+/* Copy memory from the beginning to the end */
+#ifndef MEM_COPY_FWD
+static __always_inline void mem_copy_fwd(unsigned long dstp,
+					unsigned long srcp,
+					size_t count)
+{
+	/* If there not too few bytes to copy, use word copy. */
+	if (count >= OP_T_THRESHOLD) {
+		/* Copy just a few bytes to make dstp aligned. */
+		count -= (-dstp) % OPSIZ;
+		BYTE_COPY_FWD(dstp, srcp, (-dstp) % OPSIZ);
+
+		/*
+		 * Copy from srcp to dstp taking advantage of the known
+		 * alignment of dstp. Number if bytes remaining is put in
+		 * the third argument.
+		 */
+		WORD_COPY_FWD(dstp, srcp, count, count);
+
+		/* Fall out and copy the tail. */
+	}
+
+	/* There are just a few bytes to copy. Use byte memory operations. */
+	BYTE_COPY_FWD(dstp, srcp, count);
+}
+#endif
+
+/* Copy memory from the end to the beginning.  */
+#ifndef MEM_COPY_BWD
+static __always_inline void mem_copy_bwd(unsigned long dstp,
+					unsigned long srcp,
+					size_t count)
+{
+	srcp += count;
+	dstp += count;
+
+	/* If there not too few bytes to copy, use word copy. */
+	if (count >= OP_T_THRESHOLD) {
+		/* Copy just a few bytes to make dstp aligned. */
+		count -= dstp % OPSIZ;
+		BYTE_COPY_BWD(dstp, srcp, dstp % OPSIZ);
+
+		/*
+		 * Copy from srcp to dstp taking advantage of the known
+		 * alignment of dstp. Number if bytes remaining is put in
+		 * the third argument.
+		 */
+		WORD_COPY_BWD(dstp, srcp, count, count);
+
+		/* Fall out and copy the tail. */
+	}
+
+	/* There are just a few bytes to copy. Use byte memory operations. */
+	BYTE_COPY_BWD (dstp, srcp, count);
+}
+#endif
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index e6a3763..10a319e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,8 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 idr.o int_sqrt.o extable.o prio_tree.o \
 	 sha1.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o prio_heap.o ratelimit.o show_mem.o \
-	 is_single_threaded.o plist.o decompress.o flex_array.o
+	 is_single_threaded.o plist.o decompress.o flex_array.o \
+	 memcopy.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/memcopy.c b/lib/memcopy.c
new file mode 100644
index 0000000..70fb6b2
--- /dev/null
+++ b/lib/memcopy.c
@@ -0,0 +1,403 @@
+/*
+ * memcopy.c -- subroutines for memory copy functions.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ * 
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
+ * Public License for more details.
+ * 
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * The code is derived from the GNU C Library.
+ * Copyright (C) 1991, 1992, 1993, 1997, 2004 Free Software Foundation, Inc.
+ */
+
+/* BE VERY CAREFUL IF YOU CHANGE THIS CODE...!  */
+
+#include <linux/memcopy.h>
+
+/*
+ * _wordcopy_fwd_aligned -- Copy block beginning at SRCP to block beginning
+ * at DSTP with LEN `op_t' words (not LEN bytes!).
+ * Both SRCP and DSTP should be aligned for memory operations on `op_t's.
+ */
+void _wordcopy_fwd_aligned (long int dstp, long int srcp, size_t len)
+{
+	op_t a0, a1;
+
+	switch (len % 8) {
+	case 2:
+		a0 = ((op_t *) srcp)[0];
+		srcp -= 6 * OPSIZ;
+		dstp -= 7 * OPSIZ;
+		len += 6;
+		goto do1;
+	case 3:
+		a1 = ((op_t *) srcp)[0];
+		srcp -= 5 * OPSIZ;
+		dstp -= 6 * OPSIZ;
+		len += 5;
+		goto do2;
+	case 4:
+		a0 = ((op_t *) srcp)[0];
+		srcp -= 4 * OPSIZ;
+		dstp -= 5 * OPSIZ;
+		len += 4;
+		goto do3;
+	case 5:
+		a1 = ((op_t *) srcp)[0];
+		srcp -= 3 * OPSIZ;
+		dstp -= 4 * OPSIZ;
+		len += 3;
+		goto do4;
+	case 6:
+		a0 = ((op_t *) srcp)[0];
+		srcp -= 2 * OPSIZ;
+		dstp -= 3 * OPSIZ;
+		len += 2;
+		goto do5;
+	case 7:
+		a1 = ((op_t *) srcp)[0];
+		srcp -= 1 * OPSIZ;
+		dstp -= 2 * OPSIZ;
+		len += 1;
+		goto do6;
+	case 0:
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			return;
+		a0 = ((op_t *) srcp)[0];
+		srcp -= 0 * OPSIZ;
+		dstp -= 1 * OPSIZ;
+		goto do7;
+	case 1:
+		a1 = ((op_t *) srcp)[0];
+		srcp -=-1 * OPSIZ;
+		dstp -= 0 * OPSIZ;
+		len -= 1;
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			goto do0;
+		goto do8;			/* No-op.  */
+	}
+
+	do {
+do8:
+		a0 = ((op_t *) srcp)[0];
+		((op_t *) dstp)[0] = a1;
+do7:
+		a1 = ((op_t *) srcp)[1];
+		((op_t *) dstp)[1] = a0;
+do6:
+		a0 = ((op_t *) srcp)[2];
+		((op_t *) dstp)[2] = a1;
+do5:
+		a1 = ((op_t *) srcp)[3];
+		((op_t *) dstp)[3] = a0;
+do4:
+		a0 = ((op_t *) srcp)[4];
+		((op_t *) dstp)[4] = a1;
+do3:
+		a1 = ((op_t *) srcp)[5];
+		((op_t *) dstp)[5] = a0;
+do2:
+		a0 = ((op_t *) srcp)[6];
+		((op_t *) dstp)[6] = a1;
+do1:
+		a1 = ((op_t *) srcp)[7];
+		((op_t *) dstp)[7] = a0;
+
+		srcp += 8 * OPSIZ;
+		dstp += 8 * OPSIZ;
+		len -= 8;
+	} while (len != 0);
+
+	/*
+	 * This is the right position for do0.  Please don't move it into
+	 * the loop.
+	 */
+do0:
+	((op_t *) dstp)[0] = a1;
+}
+
+/*
+ * _wordcopy_fwd_dest_aligned -- Copy block beginning at SRCP to block
+ * beginning at DSTP with LEN `op_t' words (not LEN bytes!). DSTP should
+ * be aligned for memory operations on `op_t's, but SRCP must *not* be aligned.
+ */
+
+void _wordcopy_fwd_dest_aligned (long int dstp, long int srcp, size_t len)
+{
+	op_t a0, a1, a2, a3;
+	int sh_1, sh_2;
+
+	/*
+	 * Calculate how to shift a word read at the memory operation aligned
+	 * srcp to make it aligned for copy.
+	 */
+	sh_1 = 8 * (srcp % OPSIZ);
+	sh_2 = 8 * OPSIZ - sh_1;
+
+	/*
+	 * Make SRCP aligned by rounding it down to the beginning of the `op_t'
+	 * it points in the middle of.
+	 */
+	srcp &= -OPSIZ;
+
+	switch (len % 4) {
+	case 2:
+		a1 = ((op_t *) srcp)[0];
+		a2 = ((op_t *) srcp)[1];
+		srcp -= 1 * OPSIZ;
+		dstp -= 3 * OPSIZ;
+		len += 2;
+		goto do1;
+	case 3:
+		a0 = ((op_t *) srcp)[0];
+		a1 = ((op_t *) srcp)[1];
+		srcp -= 0 * OPSIZ;
+		dstp -= 2 * OPSIZ;
+		len += 1;
+		goto do2;
+	case 0:
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			return;
+		a3 = ((op_t *) srcp)[0];
+		a0 = ((op_t *) srcp)[1];
+		srcp -=-1 * OPSIZ;
+		dstp -= 1 * OPSIZ;
+		len += 0;
+		goto do3;
+	case 1:
+		a2 = ((op_t *) srcp)[0];
+		a3 = ((op_t *) srcp)[1];
+		srcp -=-2 * OPSIZ;
+		dstp -= 0 * OPSIZ;
+		len -= 1;
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			goto do0;
+		goto do4;			/* No-op. */
+	}
+
+	do {
+do4:
+		a0 = ((op_t *) srcp)[0];
+		((op_t *) dstp)[0] = MERGE (a2, sh_1, a3, sh_2);
+do3:
+		a1 = ((op_t *) srcp)[1];
+		((op_t *) dstp)[1] = MERGE (a3, sh_1, a0, sh_2);
+do2:
+		a2 = ((op_t *) srcp)[2];
+		((op_t *) dstp)[2] = MERGE (a0, sh_1, a1, sh_2);
+do1:
+		a3 = ((op_t *) srcp)[3];
+		((op_t *) dstp)[3] = MERGE (a1, sh_1, a2, sh_2);
+
+		srcp += 4 * OPSIZ;
+		dstp += 4 * OPSIZ;
+		len -= 4;
+	} while (len != 0);
+
+	/*
+	 * This is the right position for do0.  Please don't move it into
+	 * the loop.
+	 */
+do0:
+	((op_t *) dstp)[0] = MERGE (a2, sh_1, a3, sh_2);
+}
+
+/*
+ * _wordcopy_bwd_aligned -- Copy block finishing right before
+ * SRCP to block finishing right before DSTP with LEN `op_t' words (not LEN
+ * bytes!).  Both SRCP and DSTP should be aligned for memory operations
+ * on `op_t's.
+ */
+void _wordcopy_bwd_aligned (long int dstp, long int srcp, size_t len)
+{
+	op_t a0, a1;
+
+	switch (len % 8) {
+	case 2:
+		srcp -= 2 * OPSIZ;
+		dstp -= 1 * OPSIZ;
+		a0 = ((op_t *) srcp)[1];
+		len += 6;
+		goto do1;
+	case 3:
+		srcp -= 3 * OPSIZ;
+		dstp -= 2 * OPSIZ;
+		a1 = ((op_t *) srcp)[2];
+		len += 5;
+		goto do2;
+	case 4:
+		srcp -= 4 * OPSIZ;
+		dstp -= 3 * OPSIZ;
+		a0 = ((op_t *) srcp)[3];
+		len += 4;
+		goto do3;
+	case 5:
+		srcp -= 5 * OPSIZ;
+		dstp -= 4 * OPSIZ;
+		a1 = ((op_t *) srcp)[4];
+		len += 3;
+		goto do4;
+	case 6:
+		srcp -= 6 * OPSIZ;
+		dstp -= 5 * OPSIZ;
+		a0 = ((op_t *) srcp)[5];
+		len += 2;
+		goto do5;
+	case 7:
+		srcp -= 7 * OPSIZ;
+		dstp -= 6 * OPSIZ;
+		a1 = ((op_t *) srcp)[6];
+		len += 1;
+		goto do6;
+	case 0:
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			return;
+		srcp -= 8 * OPSIZ;
+		dstp -= 7 * OPSIZ;
+		a0 = ((op_t *) srcp)[7];
+		goto do7;
+	case 1:
+		srcp -= 9 * OPSIZ;
+		dstp -= 8 * OPSIZ;
+		a1 = ((op_t *) srcp)[8];
+		len -= 1;
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			goto do0;
+		goto do8;			/* No-op.  */
+	}
+
+	do {
+do8:
+		a0 = ((op_t *) srcp)[7];
+		((op_t *) dstp)[7] = a1;
+do7:
+		a1 = ((op_t *) srcp)[6];
+		((op_t *) dstp)[6] = a0;
+do6:
+		a0 = ((op_t *) srcp)[5];
+		((op_t *) dstp)[5] = a1;
+do5:
+		a1 = ((op_t *) srcp)[4];
+		((op_t *) dstp)[4] = a0;
+do4:
+		a0 = ((op_t *) srcp)[3];
+		((op_t *) dstp)[3] = a1;
+do3:
+		a1 = ((op_t *) srcp)[2];
+		((op_t *) dstp)[2] = a0;
+do2:
+		a0 = ((op_t *) srcp)[1];
+		((op_t *) dstp)[1] = a1;
+do1:
+		a1 = ((op_t *) srcp)[0];
+		((op_t *) dstp)[0] = a0;
+
+		srcp -= 8 * OPSIZ;
+		dstp -= 8 * OPSIZ;
+		len -= 8;
+	} while (len != 0);
+
+	/*
+	 * This is the right position for do0.  Please don't move it into
+	 * the loop.
+	 */
+do0:
+	((op_t *) dstp)[7] = a1;
+}
+
+/*
+ * _wordcopy_bwd_dest_aligned -- Copy block finishing right before SRCP to
+ * block finishing right before DSTP with LEN `op_t' words (not LEN bytes!).
+ * DSTP should be aligned for memory operations on `op_t', but SRCP must *not*
+ * be aligned.
+ */
+void _wordcopy_bwd_dest_aligned (long int dstp, long int srcp, size_t len)
+{
+	op_t a0, a1, a2, a3;
+	int sh_1, sh_2;
+
+	/*
+	 * Calculate how to shift a word read at the memory operation aligned
+	 * srcp to make it aligned for copy.
+	 */
+
+	sh_1 = 8 * (srcp % OPSIZ);
+	sh_2 = 8 * OPSIZ - sh_1;
+
+	/*
+	 * Make srcp aligned by rounding it down to the beginning of the op_t
+	 * it points in the middle of.
+	 */
+	srcp &= -OPSIZ;
+	srcp += OPSIZ;
+
+	switch (len % 4) {
+	case 2:
+		srcp -= 3 * OPSIZ;
+		dstp -= 1 * OPSIZ;
+		a2 = ((op_t *) srcp)[2];
+		a1 = ((op_t *) srcp)[1];
+		len += 2;
+		goto do1;
+	case 3:
+		srcp -= 4 * OPSIZ;
+		dstp -= 2 * OPSIZ;
+		a3 = ((op_t *) srcp)[3];
+		a2 = ((op_t *) srcp)[2];
+		len += 1;
+		goto do2;
+	case 0:
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			return;
+		srcp -= 5 * OPSIZ;
+		dstp -= 3 * OPSIZ;
+		a0 = ((op_t *) srcp)[4];
+		a3 = ((op_t *) srcp)[3];
+		goto do3;
+	case 1:
+		srcp -= 6 * OPSIZ;
+		dstp -= 4 * OPSIZ;
+		a1 = ((op_t *) srcp)[5];
+		a0 = ((op_t *) srcp)[4];
+		len -= 1;
+		if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+			goto do0;
+		goto do4;			/* No-op.  */
+	}
+
+	do {
+do4:
+		a3 = ((op_t *) srcp)[3];
+		((op_t *) dstp)[3] = MERGE (a0, sh_1, a1, sh_2);
+do3:
+		a2 = ((op_t *) srcp)[2];
+		((op_t *) dstp)[2] = MERGE (a3, sh_1, a0, sh_2);
+do2:
+		a1 = ((op_t *) srcp)[1];
+		((op_t *) dstp)[1] = MERGE (a2, sh_1, a3, sh_2);
+do1:
+		a0 = ((op_t *) srcp)[0];
+		((op_t *) dstp)[0] = MERGE (a1, sh_1, a2, sh_2);
+
+		srcp -= 4 * OPSIZ;
+		dstp -= 4 * OPSIZ;
+		len -= 4;
+	} while (len != 0);
+
+	/*
+	 * This is the right position for do0.  Please don't move it into
+	 * the loop.
+	 */
+do0:
+	((op_t *) dstp)[3] = MERGE (a0, sh_1, a1, sh_2);
+}
+
-- 
1.7.0.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-09-08 13:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-08 12:19 [PATCH V2 1/3] lib: introduce some memory copy macros and functions Andi Kleen
2010-09-08 12:19 ` Andi Kleen
2010-09-08 12:57 ` Miao Xie
2010-09-08 13:05   ` Andi Kleen
2010-09-08 13:32     ` Miao Xie
  -- strict thread matches above, loose matches on Subject: below --
2010-09-02  5:46 Miao Xie
2010-09-02  5:46 ` Miao Xie
2010-09-02  8:55 ` Andi Kleen
2010-09-02 10:11   ` Miao Xie
2010-09-02 10:40     ` Andi Kleen
2010-09-08 11:05       ` Miao Xie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.