All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] raid6: support read-modify-write
@ 2014-08-19 16:36 Markus Stockhausen
  2014-08-19 23:46 ` John Stoffel
  2014-08-21  4:58 ` NeilBrown
  0 siblings, 2 replies; 5+ messages in thread
From: Markus Stockhausen @ 2014-08-19 16:36 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2494 bytes --]

v2: reordering and merging of patches as Neil requested. More
verification & benchmark numbers

Once again thanks to an older patch from Kumar Sundararajan and 
Dan Williams that helped me to understand RAID6 logic inside md 
better. Everything is based on ideas & discussions that started
with http://marc.info/?l=linux-raid&m=136624783417452&w=1

Another try to implement RMW support for RAID6. This time improve
syndrome calculation too. A few things to note:

1) Patches are based on official 3.16 kernel git.

2) The required optimized syndrome functions were implemented if
possible. Generic & SSE2 are the ones that I could write & test
on my machine. If you want to test/benchmark this patch ensure
that you force select one of the two. Programmers with appropriate
hardware in their hands are encouraged to send the missing 
algorithms.

3) raid6 test program was enhanced to verify algorithm correctness. 
Additionaly this release was checked with a self written single 
threaded test tool I called wprd (write predictable random data).
Checked features include raid expansion, rebuild of failed drives, 
different RAID6 geometries, ... /dev/md0 and the underlying block
devices contents were checked with sha256sum against an expected 
result of the unpatched module. Knock on wood so far no failures.

4) In between I was able to grab 10 older disk drives of different 
sizes and speeds and built a test rig. Simple RAID math should
give 3read+3write I/Os for RMW and 7read+3write I/Os for RCW and
thus a 66% improvement for write I/Os with a size smaller or
equal to a single chunk. As you can see reality does not care 
about math but the effect is visible. Remember that larger arrays
will show more speedups.

300 seconds random write with 8 threads
3,2TB (10*400GB) RAID6 64K chunk without spare 
group_thread_cnt=4

bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
        skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
   4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
   8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
  16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
  32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
  64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
 128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
 256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
 512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s

Thanks Neil for your support.

Markus


[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]

****************************************************************************
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497

****************************************************************************

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 0/6] raid6: support read-modify-write
  2014-08-19 16:36 [PATCH v2 0/6] raid6: support read-modify-write Markus Stockhausen
@ 2014-08-19 23:46 ` John Stoffel
  2014-08-20  6:30   ` AW: " Markus Stockhausen
  2014-08-21  4:58 ` NeilBrown
  1 sibling, 1 reply; 5+ messages in thread
From: John Stoffel @ 2014-08-19 23:46 UTC (permalink / raw)
  To: Markus Stockhausen; +Cc: linux-raid

>>>>> "Markus" == Markus Stockhausen <stockhausen@collogia.de> writes:

Markus> 300 seconds random write with 8 threads
Markus> 3,2TB (10*400GB) RAID6 64K chunk without spare 
Markus> group_thread_cnt=4

Markus> bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
Markus>         skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
Markus>    4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
Markus>    8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
Markus>   16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
Markus>   32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
Markus>   64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
Markus>  128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
Markus>  256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
Markus>  512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s

Which is the current 3.16.0 implementation?  I can't keep it straight
in my head and you don't clearly specify which set is what we have
now, and which is your patch and it's option(s) in place.

What type of system did you run this test on?  How much CPU/RAM, etc?
Can you should the configuration of the filesystem/MD volume you wrote
too as well?  Sorry to be picky here, I'm just trying to see what this
buys us.  Were the disks using SATA?  IDE?  What speed are the disks?  

Also, how does the SSE2 optimization work?  Can it be turned on/off?
And how much speedup does it provide?  

Otherwise, I don't see any huge improvements with the  numbers, and
the only consistent win is the rmw_level=1, skip_copy=0 case.  But
even then when the bsize is big enough it's slower than other
options.  So is it a win overall?

John

^ permalink raw reply	[flat|nested] 5+ messages in thread

* AW: [PATCH v2 0/6] raid6: support read-modify-write
  2014-08-19 23:46 ` John Stoffel
@ 2014-08-20  6:30   ` Markus Stockhausen
  0 siblings, 0 replies; 5+ messages in thread
From: Markus Stockhausen @ 2014-08-20  6:30 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3880 bytes --]

> Von: John Stoffel [john@stoffel.org]
> Gesendet: Mittwoch, 20. August 2014 01:46
> An: Markus Stockhausen
> Cc: linux-raid@vger.kernel.org
> Betreff: Re: [PATCH v2 0/6] raid6: support read-modify-write
> 
> >>>>> "Markus" == Markus Stockhausen <stockhausen@collogia.de> writes:
> 
> Markus> 300 seconds random write with 8 threads
> Markus> 3,2TB (10*400GB) RAID6 64K chunk without spare
> Markus> group_thread_cnt=4
> 
> Markus> bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
> Markus>         skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
> Markus>    4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
> Markus>    8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
> Markus>   16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
> Markus>   32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
> Markus>   64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
> Markus>  128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
> Markus>  256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
> Markus>  512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s
> 
> Which is the current 3.16.0 implementation?  I can't keep it straight
> in my head and you don't clearly specify which set is what we have
> now, and which is your patch and it's option(s) in place.

Standard of 3.16 in the above numbers is rmw_level=0/skip_copy=0.
My patch will be rmw_level=1/skip_copy=0. I just included skip copy as
I found it on the mailing list and was interested how rmw plays with stable
pages without need for bio page copies.
 
> What type of system did you run this test on?  How much CPU/RAM, etc?
> Can you should the configuration of the filesystem/MD volume you wrote
> too as well?  Sorry to be picky here, I'm just trying to see what this
> buys us.  Were the disks using SATA?  IDE?  What speed are the disks?

It is a single E5630 with 24GB RAM. Although this should not matter as I
made direct I/O. My tests wrote directly to /dev/md0 no filesystem in between.
Disks are a bunch of 500GB-1TB SATA I/II 7200rpm. Server/Desktop mixed.
 
> Also, how does the SSE2 optimization work?  Can it be turned on/off?
> And how much speedup does it provide?

SSE2 optimization is not an option it is a must. The RAID6 algorithms are 
choosen on startup. In my case the system chooses the SSE2 implementation.
The old patch used the already available implementation of gen_syndrome().
This always overwrote the parity blocks and needed extra spare pages. The 
discussion result about it was, that md should use an inplace syndrome 
calculation for the rmw case. Thus I was forced to copy the algorithms to a 
new xor_syndrome() call. 

Difference between standard and SSE optimized gen_syndrome is in my
case 1-2GB/sec versus 9-10GB/sec (iirc). So it won't make any sense to offer
rmw and fall back to a xor_syndrome() calulation with a default implementation
that is 5 times slower.

To avoid side effects the patch will disable rmw if the choosen existing 
optimized gen_syndrome() function does not offer a xor_syndrome() 
"brother".

> Otherwise, I don't see any huge improvements with the  numbers, and
> the only consistent win is the rmw_level=1, skip_copy=0 case.  But
> even then when the bsize is big enough it's slower than other
> options.  So is it a win overall?

This was the maximum of hardware that I could find in the short time. The
original patch post gives better numbers because it used 12 disks. For me
reasonable raid6 configurations range from 10-16 disks. So for the upper
end the caluclation shows even more potential (but must be proven). 

From simple math for 16 disks:
- Update one block or one chunk
- RCW case: 13 read  I/Os + 3 write I/Os
- RMW case: 3 read I/Os + 3 write I/Os
 
> John

Markus
=

[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]

****************************************************************************
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497

****************************************************************************

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 0/6] raid6: support read-modify-write
  2014-08-19 16:36 [PATCH v2 0/6] raid6: support read-modify-write Markus Stockhausen
  2014-08-19 23:46 ` John Stoffel
@ 2014-08-21  4:58 ` NeilBrown
  2014-08-21  7:08   ` AW: " Markus Stockhausen
  1 sibling, 1 reply; 5+ messages in thread
From: NeilBrown @ 2014-08-21  4:58 UTC (permalink / raw)
  To: Markus Stockhausen; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3755 bytes --]

On Tue, 19 Aug 2014 16:36:20 +0000 Markus Stockhausen
<stockhausen@collogia.de> wrote:

> v2: reordering and merging of patches as Neil requested. More
> verification & benchmark numbers
> 
> Once again thanks to an older patch from Kumar Sundararajan and 
> Dan Williams that helped me to understand RAID6 logic inside md 
> better. Everything is based on ideas & discussions that started
> with http://marc.info/?l=linux-raid&m=136624783417452&w=1
> 
> Another try to implement RMW support for RAID6. This time improve
> syndrome calculation too. A few things to note:
> 
> 1) Patches are based on official 3.16 kernel git.
> 
> 2) The required optimized syndrome functions were implemented if
> possible. Generic & SSE2 are the ones that I could write & test
> on my machine. If you want to test/benchmark this patch ensure
> that you force select one of the two. Programmers with appropriate
> hardware in their hands are encouraged to send the missing 
> algorithms.
> 
> 3) raid6 test program was enhanced to verify algorithm correctness. 
> Additionaly this release was checked with a self written single 
> threaded test tool I called wprd (write predictable random data).
> Checked features include raid expansion, rebuild of failed drives, 
> different RAID6 geometries, ... /dev/md0 and the underlying block
> devices contents were checked with sha256sum against an expected 
> result of the unpatched module. Knock on wood so far no failures.
> 
> 4) In between I was able to grab 10 older disk drives of different 
> sizes and speeds and built a test rig. Simple RAID math should
> give 3read+3write I/Os for RMW and 7read+3write I/Os for RCW and
> thus a 66% improvement for write I/Os with a size smaller or
> equal to a single chunk. As you can see reality does not care 
> about math but the effect is visible. Remember that larger arrays
> will show more speedups.
> 
> 300 seconds random write with 8 threads
> 3,2TB (10*400GB) RAID6 64K chunk without spare 
> group_thread_cnt=4
> 
> bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
>         skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
>    4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
>    8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
>   16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
>   32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
>   64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
>  128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
>  256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
>  512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s
> 
> Thanks Neil for your support.
> 
> Markus
> 

Thanks.  This looks a lot nicer.  If you resend them with the formatting a
s-o-b changes I mentioned I'll put them in my try and try to do some testing
and look at bits more closely.
Two things:
1/ I would be good to have performance numbers in the patch description for
   the patch that makes it all work.  Put yours there now, we can add other
   later..

2/ When you do a partial syndrome you specify a start and end.
   Is that really what is always wanted, or what is easiest?
   I imaging that you might want to "add" or "subtract" an arbitrary subset
   of blocks.  I imaging that the "blocks" array of pointers that is
   passing could have NULLs for the blocks to ignore and would include
   all others in the computation.
   Is there a good reason for not doing that.

Apart from the the code looks quite nice and clean .... though I don't seem
to be concentrating at my best today so I reserve the right to revise that
assessment at a later date :-)

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* AW: [PATCH v2 0/6] raid6: support read-modify-write
  2014-08-21  4:58 ` NeilBrown
@ 2014-08-21  7:08   ` Markus Stockhausen
  0 siblings, 0 replies; 5+ messages in thread
From: Markus Stockhausen @ 2014-08-21  7:08 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3958 bytes --]

> Von: NeilBrown [neilb@suse.de]
> Gesendet: Donnerstag, 21. August 2014 06:58
> An: Markus Stockhausen
> Cc: linux-raid@vger.kernel.org
> Betreff: Re: [PATCH v2 0/6] raid6: support read-modify-write
> 
> 
> Thanks.  This looks a lot nicer.  If you resend them with the formatting a
> s-o-b changes I mentioned I'll put them in my try and try to do some testing
> and look at bits more closely.
> Two things:
> 1/ I would be good to have performance numbers in the patch description for
>    the patch that makes it all work.  Put yours there now, we can add other
>    later..

will send patches the next days

> 2/ When you do a partial syndrome you specify a start and end.
>    Is that really what is always wanted, or what is easiest?
>    I imaging that you might want to "add" or "subtract" an arbitrary subset
>    of blocks.  I imaging that the "blocks" array of pointers that is
>    passing could have NULLs for the blocks to ignore and would include
>    all others in the computation.
>    Is there a good reason for not doing that.

If you have a closer look the upper layers of the patch will do the NULL page
handling - see set_syndrome_sources(). This is broken down into a design of
"start+stop+kernel zero page" algorithm in the syndrome functions. I opted 
for that way because of the following reasons:

- The original algorithms are based on "per-line" syndrome calculation. So they 
will fully calculate x bytes of the syndrome while loading x bytes from alle the 
source pages. If we would do a full calculation for the D0 page, then the D1 
page and so on we need to load/store the partially calculated P/Q values 
multiple times. Additionally the calculation of GF(X) would be quite hard.

- The stop page marker is the essential winner in our calculations because
everything right of that can be ignored.

- NULL pages between data pages are hard to handle. This would lead to
more complexity & additional branches in the assembler routines. The SSE2
implementation needs 34*<number of disks> instructions to calculate 64
bytes of the syndrome. If a disk is zero one of these cycles can be reduced 
to 20 instructions. In my consideration chances will be high that only
adjacent pages will be written most of the time. So stay close to the
original design and keep things clean.

- The start page marker is the clear indication for the functions that from
now on they only need the GF(X) multiplication. So once again do not
switch over to table lookups but stay with the old design. I know that
this is overhead if you change D13 in a 16 disk raid6. But even with this
we have only a quite small CPU overhead (factor 2 for calling the
xor_syndrome twice per rmw):

- RCW: (14*34 instructions)/64 bytes + 13 read I/Os + 3 write I/Os
- RMW: (2*34+2*13*20 instructions)/64 bytes + 3 read I/Os + 3 write I/Os

On the other hand changing D0 is a not-to-discuss clear win for the new
implementation:

- RCW: (14*34 instructions)/64 bytes + 13 read I/Os + 3 write I/Os
- RMW: (2*34)/64 bytes + 3 read I/Os + 3 write I/Os

Conclusion of it all: I had the choice between simply copying the
functions and adding the XOR at the end, my optimizations or a fully 
optimized version. For the third part one needs a lot of real life sample data 
and well defined perfomance comparisons. In my opinion all this 
distracts from the original goal to save disk I/Os on spinning media. 

So the current design is a good balance between performance, simplicity 
code-readability and avoiding spare pages. If you look at the original
implementation it just called gen_syndrome() twice for rmw and even 
with that the numbers where quite impressive.

> Apart from the the code looks quite nice and clean .... though I don't seem
> to be concentrating at my best today so I reserve the right to revise that
> assessment at a later date :-)
> 
> Thanks,
> NeilBrown

Markus=

[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]

****************************************************************************
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497

****************************************************************************

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-08-21  7:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-19 16:36 [PATCH v2 0/6] raid6: support read-modify-write Markus Stockhausen
2014-08-19 23:46 ` John Stoffel
2014-08-20  6:30   ` AW: " Markus Stockhausen
2014-08-21  4:58 ` NeilBrown
2014-08-21  7:08   ` AW: " Markus Stockhausen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.