All of lore.kernel.org
 help / color / mirror / Atom feed
* [dm-devel] Raid0 performance regression
@ 2022-01-21 16:38 Roger Willcocks
  2022-01-23 18:00   ` Lukas Straub
  0 siblings, 1 reply; 9+ messages in thread
From: Roger Willcocks @ 2022-01-21 16:38 UTC (permalink / raw)
  To: dm-devel; +Cc: Roger Willcocks

Hi folks,

we noticed a thirty percent drop in performance on one of our raid
arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like
striping to balance (by time) access to a pair of hardware raid-6
arrays. The underlying issue is also present in the native raid0
driver so herewith the gory details; I'd appreciate your thoughts.

--

blkdev_direct_IO() calls submit_bio() which calls an outermost
generic_make_request() (aka submit_bio_noacct()).

md_make_request() calls blk_queue_split() which cuts an incoming
request into two parts with the first no larger than get_max_io_size()
bytes (which in the case of raid0, is the chunk size):

  R -> AB
  
blk_queue_split() gives the second part 'B' to generic_make_request()
to worry about later and returns the first part 'A'.

md_make_request() then passes 'A' to a more specific request handler,
In this case raid0_make_request().

raid0_make_request() cuts its incoming request into two parts at the
next chunk boundary:

A -> ab

it then fixes up the device (chooses a physical device) for 'a', and
gives both parts, separately, to generic make request()

This is where things go awry, because 'b' is still targetted to the
original device (same as 'B'), but 'B' was queued before 'b'. So we
end up with:

  R -> Bab

The outermost generic_make_request() then cuts 'B' at
get_max_io_size(), and the process repeats. Ascii art follows:


    /---------------------------------------------------/   incoming rq

    /--------/--------/--------/--------/--------/------/   max_io_size
      
|--------|--------|--------|--------|--------|--------|--------| chunks

|...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
      a    b  c     d  e     f  g     h  i     j  k     l

Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'

--

There are several potential fixes -

simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
instead of chunk_size, so that raid0_make_request() receives the
entire transfer length and cuts it up at chunk boundaries;

neatest is for raid0_make_request() to recognise that 'b' doesn't
cross a chunk boundary so it can be sent directly to the physical
device;

and correct is for blk_queue_split to requeue 'A' before 'B'.

--

There's also a second issue - with large raid0 chunk size (256K), the
segments submitted to the physical device are at least 128K and
trigger the early unplug code in blk_mq_make_request(), so the
requests are never merged. There are legitimate reasons for a large
chunk size so this seems unhelpful.

--

As I said, I'd appreciate your thoughts.

--

Roger

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dm-devel] Raid0 performance regression
  2022-01-21 16:38 [dm-devel] Raid0 performance regression Roger Willcocks
@ 2022-01-23 18:00   ` Lukas Straub
  0 siblings, 0 replies; 9+ messages in thread
From: Lukas Straub @ 2022-01-23 18:00 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: dm-devel, Song Liu, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3157 bytes --]

CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.

On Fri, 21 Jan 2022 16:38:03 +0000
Roger Willcocks <roger@filmlight.ltd.uk> wrote:

> Hi folks,
> 
> we noticed a thirty percent drop in performance on one of our raid
> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like
> striping to balance (by time) access to a pair of hardware raid-6
> arrays. The underlying issue is also present in the native raid0
> driver so herewith the gory details; I'd appreciate your thoughts.
> 
> --
> 
> blkdev_direct_IO() calls submit_bio() which calls an outermost
> generic_make_request() (aka submit_bio_noacct()).
> 
> md_make_request() calls blk_queue_split() which cuts an incoming
> request into two parts with the first no larger than get_max_io_size()
> bytes (which in the case of raid0, is the chunk size):
> 
>   R -> AB
>   
> blk_queue_split() gives the second part 'B' to generic_make_request()
> to worry about later and returns the first part 'A'.
> 
> md_make_request() then passes 'A' to a more specific request handler,
> In this case raid0_make_request().
> 
> raid0_make_request() cuts its incoming request into two parts at the
> next chunk boundary:
> 
> A -> ab
> 
> it then fixes up the device (chooses a physical device) for 'a', and
> gives both parts, separately, to generic make request()
> 
> This is where things go awry, because 'b' is still targetted to the
> original device (same as 'B'), but 'B' was queued before 'b'. So we
> end up with:
> 
>   R -> Bab
> 
> The outermost generic_make_request() then cuts 'B' at
> get_max_io_size(), and the process repeats. Ascii art follows:
> 
> 
>     /---------------------------------------------------/   incoming rq
> 
>     /--------/--------/--------/--------/--------/------/   max_io_size
>       
> |--------|--------|--------|--------|--------|--------|--------| chunks
> 
> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
>       a    b  c     d  e     f  g     h  i     j  k     l
> 
> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
> 
> --
> 
> There are several potential fixes -
> 
> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
> instead of chunk_size, so that raid0_make_request() receives the
> entire transfer length and cuts it up at chunk boundaries;
> 
> neatest is for raid0_make_request() to recognise that 'b' doesn't
> cross a chunk boundary so it can be sent directly to the physical
> device;
> 
> and correct is for blk_queue_split to requeue 'A' before 'B'.
> 
> --
> 
> There's also a second issue - with large raid0 chunk size (256K), the
> segments submitted to the physical device are at least 128K and
> trigger the early unplug code in blk_mq_make_request(), so the
> requests are never merged. There are legitimate reasons for a large
> chunk size so this seems unhelpful.
> 
> --
> 
> As I said, I'd appreciate your thoughts.
> 
> --
> 
> Roger
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
> 



-- 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dm-devel] Raid0 performance regression
@ 2022-01-23 18:00   ` Lukas Straub
  0 siblings, 0 replies; 9+ messages in thread
From: Lukas Straub @ 2022-01-23 18:00 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: linux-raid, Song Liu, dm-devel


[-- Attachment #1.1: Type: text/plain, Size: 3157 bytes --]

CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.

On Fri, 21 Jan 2022 16:38:03 +0000
Roger Willcocks <roger@filmlight.ltd.uk> wrote:

> Hi folks,
> 
> we noticed a thirty percent drop in performance on one of our raid
> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like
> striping to balance (by time) access to a pair of hardware raid-6
> arrays. The underlying issue is also present in the native raid0
> driver so herewith the gory details; I'd appreciate your thoughts.
> 
> --
> 
> blkdev_direct_IO() calls submit_bio() which calls an outermost
> generic_make_request() (aka submit_bio_noacct()).
> 
> md_make_request() calls blk_queue_split() which cuts an incoming
> request into two parts with the first no larger than get_max_io_size()
> bytes (which in the case of raid0, is the chunk size):
> 
>   R -> AB
>   
> blk_queue_split() gives the second part 'B' to generic_make_request()
> to worry about later and returns the first part 'A'.
> 
> md_make_request() then passes 'A' to a more specific request handler,
> In this case raid0_make_request().
> 
> raid0_make_request() cuts its incoming request into two parts at the
> next chunk boundary:
> 
> A -> ab
> 
> it then fixes up the device (chooses a physical device) for 'a', and
> gives both parts, separately, to generic make request()
> 
> This is where things go awry, because 'b' is still targetted to the
> original device (same as 'B'), but 'B' was queued before 'b'. So we
> end up with:
> 
>   R -> Bab
> 
> The outermost generic_make_request() then cuts 'B' at
> get_max_io_size(), and the process repeats. Ascii art follows:
> 
> 
>     /---------------------------------------------------/   incoming rq
> 
>     /--------/--------/--------/--------/--------/------/   max_io_size
>       
> |--------|--------|--------|--------|--------|--------|--------| chunks
> 
> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
>       a    b  c     d  e     f  g     h  i     j  k     l
> 
> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
> 
> --
> 
> There are several potential fixes -
> 
> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
> instead of chunk_size, so that raid0_make_request() receives the
> entire transfer length and cuts it up at chunk boundaries;
> 
> neatest is for raid0_make_request() to recognise that 'b' doesn't
> cross a chunk boundary so it can be sent directly to the physical
> device;
> 
> and correct is for blk_queue_split to requeue 'A' before 'B'.
> 
> --
> 
> There's also a second issue - with large raid0 chunk size (256K), the
> segments submitted to the physical device are at least 128K and
> trigger the early unplug code in blk_mq_make_request(), so the
> requests are never merged. There are legitimate reasons for a large
> chunk size so this seems unhelpful.
> 
> --
> 
> As I said, I'd appreciate your thoughts.
> 
> --
> 
> Roger
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
> 



-- 


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 97 bytes --]

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dm-devel] Raid0 performance regression
  2022-01-23 18:00   ` Lukas Straub
@ 2022-01-23 21:34     ` Paul Menzel
  -1 siblings, 0 replies; 9+ messages in thread
From: Paul Menzel @ 2022-01-23 21:34 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: dm-devel, Song Liu, linux-raid, Lukas Straub

Dear Roger,


Am 23.01.22 um 19:00 schrieb Lukas Straub:
> CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.
> 
> On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote:

>> we noticed a thirty percent drop in performance on one of our raid
>> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like

For those outside the CentOS universe, what Linux kernel versions are those?

>> striping to balance (by time) access to a pair of hardware raid-6
>> arrays. The underlying issue is also present in the native raid0
>> driver so herewith the gory details; I'd appreciate your thoughts.
>>
>> --
>>
>> blkdev_direct_IO() calls submit_bio() which calls an outermost
>> generic_make_request() (aka submit_bio_noacct()).
>>
>> md_make_request() calls blk_queue_split() which cuts an incoming
>> request into two parts with the first no larger than get_max_io_size()
>> bytes (which in the case of raid0, is the chunk size):
>>
>>    R -> AB
>>    
>> blk_queue_split() gives the second part 'B' to generic_make_request()
>> to worry about later and returns the first part 'A'.
>>
>> md_make_request() then passes 'A' to a more specific request handler,
>> In this case raid0_make_request().
>>
>> raid0_make_request() cuts its incoming request into two parts at the
>> next chunk boundary:
>>
>> A -> ab
>>
>> it then fixes up the device (chooses a physical device) for 'a', and
>> gives both parts, separately, to generic make request()
>>
>> This is where things go awry, because 'b' is still targetted to the
>> original device (same as 'B'), but 'B' was queued before 'b'. So we
>> end up with:
>>
>>    R -> Bab
>>
>> The outermost generic_make_request() then cuts 'B' at
>> get_max_io_size(), and the process repeats. Ascii art follows:
>>
>>
>>      /---------------------------------------------------/   incoming rq
>>
>>      /--------/--------/--------/--------/--------/------/   max_io_size
>>        
>> |--------|--------|--------|--------|--------|--------|--------| chunks
>>
>> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
>>        a    b  c     d  e     f  g     h  i     j  k     l
>>
>> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
>>
>> --
>>
>> There are several potential fixes -
>>
>> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
>> instead of chunk_size, so that raid0_make_request() receives the
>> entire transfer length and cuts it up at chunk boundaries;
>>
>> neatest is for raid0_make_request() to recognise that 'b' doesn't
>> cross a chunk boundary so it can be sent directly to the physical
>> device;
>>
>> and correct is for blk_queue_split to requeue 'A' before 'B'.
>>
>> --
>>
>> There's also a second issue - with large raid0 chunk size (256K), the
>> segments submitted to the physical device are at least 128K and
>> trigger the early unplug code in blk_mq_make_request(), so the
>> requests are never merged. There are legitimate reasons for a large
>> chunk size so this seems unhelpful.
>>
>> --
>>
>> As I said, I'd appreciate your thoughts.

Thank you for the report and the analysis.

Is the second issue also a regression? If not, I suggest to split it 
into a separate thread.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dm-devel] Raid0 performance regression
@ 2022-01-23 21:34     ` Paul Menzel
  0 siblings, 0 replies; 9+ messages in thread
From: Paul Menzel @ 2022-01-23 21:34 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: linux-raid, Song Liu, dm-devel, Lukas Straub

Dear Roger,


Am 23.01.22 um 19:00 schrieb Lukas Straub:
> CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.
> 
> On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote:

>> we noticed a thirty percent drop in performance on one of our raid
>> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like

For those outside the CentOS universe, what Linux kernel versions are those?

>> striping to balance (by time) access to a pair of hardware raid-6
>> arrays. The underlying issue is also present in the native raid0
>> driver so herewith the gory details; I'd appreciate your thoughts.
>>
>> --
>>
>> blkdev_direct_IO() calls submit_bio() which calls an outermost
>> generic_make_request() (aka submit_bio_noacct()).
>>
>> md_make_request() calls blk_queue_split() which cuts an incoming
>> request into two parts with the first no larger than get_max_io_size()
>> bytes (which in the case of raid0, is the chunk size):
>>
>>    R -> AB
>>    
>> blk_queue_split() gives the second part 'B' to generic_make_request()
>> to worry about later and returns the first part 'A'.
>>
>> md_make_request() then passes 'A' to a more specific request handler,
>> In this case raid0_make_request().
>>
>> raid0_make_request() cuts its incoming request into two parts at the
>> next chunk boundary:
>>
>> A -> ab
>>
>> it then fixes up the device (chooses a physical device) for 'a', and
>> gives both parts, separately, to generic make request()
>>
>> This is where things go awry, because 'b' is still targetted to the
>> original device (same as 'B'), but 'B' was queued before 'b'. So we
>> end up with:
>>
>>    R -> Bab
>>
>> The outermost generic_make_request() then cuts 'B' at
>> get_max_io_size(), and the process repeats. Ascii art follows:
>>
>>
>>      /---------------------------------------------------/   incoming rq
>>
>>      /--------/--------/--------/--------/--------/------/   max_io_size
>>        
>> |--------|--------|--------|--------|--------|--------|--------| chunks
>>
>> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
>>        a    b  c     d  e     f  g     h  i     j  k     l
>>
>> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
>>
>> --
>>
>> There are several potential fixes -
>>
>> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
>> instead of chunk_size, so that raid0_make_request() receives the
>> entire transfer length and cuts it up at chunk boundaries;
>>
>> neatest is for raid0_make_request() to recognise that 'b' doesn't
>> cross a chunk boundary so it can be sent directly to the physical
>> device;
>>
>> and correct is for blk_queue_split to requeue 'A' before 'B'.
>>
>> --
>>
>> There's also a second issue - with large raid0 chunk size (256K), the
>> segments submitted to the physical device are at least 128K and
>> trigger the early unplug code in blk_mq_make_request(), so the
>> requests are never merged. There are legitimate reasons for a large
>> chunk size so this seems unhelpful.
>>
>> --
>>
>> As I said, I'd appreciate your thoughts.

Thank you for the report and the analysis.

Is the second issue also a regression? If not, I suggest to split it 
into a separate thread.


Kind regards,

Paul

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dm-devel] Raid0 performance regression
  2022-01-23 21:34     ` Paul Menzel
@ 2022-01-24 16:48       ` Roger Willcocks
  -1 siblings, 0 replies; 9+ messages in thread
From: Roger Willcocks @ 2022-01-24 16:48 UTC (permalink / raw)
  To: Paul Menzel; +Cc: linux-raid, Song Liu, dm-devel, Roger Willcocks, Lukas Straub



> On 23 Jan 2022, at 21:34, Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> 
> Dear Roger,
> 
> 
> Am 23.01.22 um 19:00 schrieb Lukas Straub:
>> CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.
>> On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote:
> 
>>> we noticed a thirty percent drop in performance on one of our raid
>>> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like
> 
> For those outside the CentOS universe, what Linux kernel versions are those?
> 

2.6.32 (and backported changes) and 4.18.0 (sim.)

>>> striping to balance (by time) access to a pair of hardware raid-6
>>> arrays. The underlying issue is also present in the native raid0
>>> driver so herewith the gory details; I'd appreciate your thoughts.
>>> 
>>> --
>>> 
>>> blkdev_direct_IO() calls submit_bio() which calls an outermost
>>> generic_make_request() (aka submit_bio_noacct()).
>>> 
>>> md_make_request() calls blk_queue_split() which cuts an incoming
>>> request into two parts with the first no larger than get_max_io_size()
>>> bytes (which in the case of raid0, is the chunk size):
>>> 
>>>   R -> AB
>>>   blk_queue_split() gives the second part 'B' to generic_make_request()
>>> to worry about later and returns the first part 'A'.
>>> 
>>> md_make_request() then passes 'A' to a more specific request handler,
>>> In this case raid0_make_request().
>>> 
>>> raid0_make_request() cuts its incoming request into two parts at the
>>> next chunk boundary:
>>> 
>>> A -> ab
>>> 
>>> it then fixes up the device (chooses a physical device) for 'a', and
>>> gives both parts, separately, to generic make request()
>>> 
>>> This is where things go awry, because 'b' is still targetted to the
>>> original device (same as 'B'), but 'B' was queued before 'b'. So we
>>> end up with:
>>> 
>>>   R -> Bab
>>> 
>>> The outermost generic_make_request() then cuts 'B' at
>>> get_max_io_size(), and the process repeats. Ascii art follows:
>>> 
>>> 
>>>     /---------------------------------------------------/   incoming rq
>>> 
>>>     /--------/--------/--------/--------/--------/------/   max_io_size
>>>       |--------|--------|--------|--------|--------|--------|--------| chunks
>>> 
>>> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
>>>       a    b  c     d  e     f  g     h  i     j  k     l
>>> 
>>> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
>>> 
>>> --
>>> 
>>> There are several potential fixes -
>>> 
>>> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
>>> instead of chunk_size, so that raid0_make_request() receives the
>>> entire transfer length and cuts it up at chunk boundaries;
>>> 
>>> neatest is for raid0_make_request() to recognise that 'b' doesn't
>>> cross a chunk boundary so it can be sent directly to the physical
>>> device;
>>> 
>>> and correct is for blk_queue_split to requeue 'A' before 'B'.
>>> 
>>> --
>>> 
>>> There's also a second issue - with large raid0 chunk size (256K), the
>>> segments submitted to the physical device are at least 128K and
>>> trigger the early unplug code in blk_mq_make_request(), so the
>>> requests are never merged. There are legitimate reasons for a large
>>> chunk size so this seems unhelpful.
>>> 
>>> --
>>> 
>>> As I said, I'd appreciate your thoughts.
> 
> Thank you for the report and the analysis.
> 
> Is the second issue also a regression? If not, I suggest to split it into a separate thread.
> 

Yes this is also a regression, both issues above have to be addressed to recover the
original performance.

Specifically, an md raid0 array with 256K chunk size interleaving two x 12-disk raid6
devices (Adaptec 3154 controller, 50MB files stored contiguously on disk, four threads)
can achieve a sequential read rate of 3800 MB/sec with the (very) old 6.5 kernel; this
falls to 2500 MB/sec with the relatively newer kernel.

This change to raid0.c:

-               blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
+              blk_queue_max_hw_sectors(mddev->queue, UINT_MAX);

improves things somewhat, the sub-chunk requests are now submitted in order but we
still only get 2800 MB/sec because no merging takes place; the controller struggles to
keep up with the large number of sub-chunk transfers. This additional change to 
blk-mq.c:

-		if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
+		if (request_count >= BLK_MAX_REQUEST_COUNT || (false && last &&
 		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
 			blk_flush_plug_list(plug, false);

Brings performance back to 6.5 levels.


> 
> Kind regards,
> 
> Paul
> 


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dm-devel] Raid0 performance regression
@ 2022-01-24 16:48       ` Roger Willcocks
  0 siblings, 0 replies; 9+ messages in thread
From: Roger Willcocks @ 2022-01-24 16:48 UTC (permalink / raw)
  To: Paul Menzel; +Cc: Roger Willcocks, dm-devel, Song Liu, linux-raid, Lukas Straub



> On 23 Jan 2022, at 21:34, Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> 
> Dear Roger,
> 
> 
> Am 23.01.22 um 19:00 schrieb Lukas Straub:
>> CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.
>> On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote:
> 
>>> we noticed a thirty percent drop in performance on one of our raid
>>> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like
> 
> For those outside the CentOS universe, what Linux kernel versions are those?
> 

2.6.32 (and backported changes) and 4.18.0 (sim.)

>>> striping to balance (by time) access to a pair of hardware raid-6
>>> arrays. The underlying issue is also present in the native raid0
>>> driver so herewith the gory details; I'd appreciate your thoughts.
>>> 
>>> --
>>> 
>>> blkdev_direct_IO() calls submit_bio() which calls an outermost
>>> generic_make_request() (aka submit_bio_noacct()).
>>> 
>>> md_make_request() calls blk_queue_split() which cuts an incoming
>>> request into two parts with the first no larger than get_max_io_size()
>>> bytes (which in the case of raid0, is the chunk size):
>>> 
>>>   R -> AB
>>>   blk_queue_split() gives the second part 'B' to generic_make_request()
>>> to worry about later and returns the first part 'A'.
>>> 
>>> md_make_request() then passes 'A' to a more specific request handler,
>>> In this case raid0_make_request().
>>> 
>>> raid0_make_request() cuts its incoming request into two parts at the
>>> next chunk boundary:
>>> 
>>> A -> ab
>>> 
>>> it then fixes up the device (chooses a physical device) for 'a', and
>>> gives both parts, separately, to generic make request()
>>> 
>>> This is where things go awry, because 'b' is still targetted to the
>>> original device (same as 'B'), but 'B' was queued before 'b'. So we
>>> end up with:
>>> 
>>>   R -> Bab
>>> 
>>> The outermost generic_make_request() then cuts 'B' at
>>> get_max_io_size(), and the process repeats. Ascii art follows:
>>> 
>>> 
>>>     /---------------------------------------------------/   incoming rq
>>> 
>>>     /--------/--------/--------/--------/--------/------/   max_io_size
>>>       |--------|--------|--------|--------|--------|--------|--------| chunks
>>> 
>>> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
>>>       a    b  c     d  e     f  g     h  i     j  k     l
>>> 
>>> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
>>> 
>>> --
>>> 
>>> There are several potential fixes -
>>> 
>>> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
>>> instead of chunk_size, so that raid0_make_request() receives the
>>> entire transfer length and cuts it up at chunk boundaries;
>>> 
>>> neatest is for raid0_make_request() to recognise that 'b' doesn't
>>> cross a chunk boundary so it can be sent directly to the physical
>>> device;
>>> 
>>> and correct is for blk_queue_split to requeue 'A' before 'B'.
>>> 
>>> --
>>> 
>>> There's also a second issue - with large raid0 chunk size (256K), the
>>> segments submitted to the physical device are at least 128K and
>>> trigger the early unplug code in blk_mq_make_request(), so the
>>> requests are never merged. There are legitimate reasons for a large
>>> chunk size so this seems unhelpful.
>>> 
>>> --
>>> 
>>> As I said, I'd appreciate your thoughts.
> 
> Thank you for the report and the analysis.
> 
> Is the second issue also a regression? If not, I suggest to split it into a separate thread.
> 

Yes this is also a regression, both issues above have to be addressed to recover the
original performance.

Specifically, an md raid0 array with 256K chunk size interleaving two x 12-disk raid6
devices (Adaptec 3154 controller, 50MB files stored contiguously on disk, four threads)
can achieve a sequential read rate of 3800 MB/sec with the (very) old 6.5 kernel; this
falls to 2500 MB/sec with the relatively newer kernel.

This change to raid0.c:

-               blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
+              blk_queue_max_hw_sectors(mddev->queue, UINT_MAX);

improves things somewhat, the sub-chunk requests are now submitted in order but we
still only get 2800 MB/sec because no merging takes place; the controller struggles to
keep up with the large number of sub-chunk transfers. This additional change to 
blk-mq.c:

-		if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
+		if (request_count >= BLK_MAX_REQUEST_COUNT || (false && last &&
 		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
 			blk_flush_plug_list(plug, false);

Brings performance back to 6.5 levels.


> 
> Kind regards,
> 
> Paul
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dm-devel] Raid0 performance regression
  2022-01-24 16:48       ` Roger Willcocks
@ 2022-01-25  8:00         ` Song Liu
  -1 siblings, 0 replies; 9+ messages in thread
From: Song Liu @ 2022-01-25  8:00 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: Paul Menzel, dm-devel, linux-raid, Lukas Straub

On Mon, Jan 24, 2022 at 8:49 AM Roger Willcocks <roger@filmlight.ltd.uk> wrote:
>
>
>
> > On 23 Jan 2022, at 21:34, Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> >
> > Dear Roger,
> >
> >
> > Am 23.01.22 um 19:00 schrieb Lukas Straub:
> >> CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.
> >> On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote:
> >
> >>> we noticed a thirty percent drop in performance on one of our raid
> >>> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like
> >
> > For those outside the CentOS universe, what Linux kernel versions are those?
> >
>
> 2.6.32 (and backported changes) and 4.18.0 (sim.)
>
> >>> striping to balance (by time) access to a pair of hardware raid-6
> >>> arrays. The underlying issue is also present in the native raid0
> >>> driver so herewith the gory details; I'd appreciate your thoughts.
> >>>
> >>> --
> >>>
> >>> blkdev_direct_IO() calls submit_bio() which calls an outermost
> >>> generic_make_request() (aka submit_bio_noacct()).
> >>>
> >>> md_make_request() calls blk_queue_split() which cuts an incoming
> >>> request into two parts with the first no larger than get_max_io_size()
> >>> bytes (which in the case of raid0, is the chunk size):
> >>>
> >>>   R -> AB
> >>>   blk_queue_split() gives the second part 'B' to generic_make_request()
> >>> to worry about later and returns the first part 'A'.
> >>>
> >>> md_make_request() then passes 'A' to a more specific request handler,
> >>> In this case raid0_make_request().
> >>>
> >>> raid0_make_request() cuts its incoming request into two parts at the
> >>> next chunk boundary:
> >>>
> >>> A -> ab
> >>>
> >>> it then fixes up the device (chooses a physical device) for 'a', and
> >>> gives both parts, separately, to generic make request()
> >>>
> >>> This is where things go awry, because 'b' is still targetted to the
> >>> original device (same as 'B'), but 'B' was queued before 'b'. So we
> >>> end up with:
> >>>
> >>>   R -> Bab
> >>>
> >>> The outermost generic_make_request() then cuts 'B' at
> >>> get_max_io_size(), and the process repeats. Ascii art follows:
> >>>
> >>>
> >>>     /---------------------------------------------------/   incoming rq
> >>>
> >>>     /--------/--------/--------/--------/--------/------/   max_io_size
> >>>       |--------|--------|--------|--------|--------|--------|--------| chunks
> >>>
> >>> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
> >>>       a    b  c     d  e     f  g     h  i     j  k     l
> >>>
> >>> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
> >>>
> >>> --
> >>>
> >>> There are several potential fixes -
> >>>
> >>> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
> >>> instead of chunk_size, so that raid0_make_request() receives the
> >>> entire transfer length and cuts it up at chunk boundaries;
> >>>
> >>> neatest is for raid0_make_request() to recognise that 'b' doesn't
> >>> cross a chunk boundary so it can be sent directly to the physical
> >>> device;
> >>>
> >>> and correct is for blk_queue_split to requeue 'A' before 'B'.
> >>>
> >>> --
> >>>
> >>> There's also a second issue - with large raid0 chunk size (256K), the
> >>> segments submitted to the physical device are at least 128K and
> >>> trigger the early unplug code in blk_mq_make_request(), so the
> >>> requests are never merged. There are legitimate reasons for a large
> >>> chunk size so this seems unhelpful.
> >>>
> >>> --
> >>>
> >>> As I said, I'd appreciate your thoughts.
> >
> > Thank you for the report and the analysis.
> >
> > Is the second issue also a regression? If not, I suggest to split it into a separate thread.
> >
>
> Yes this is also a regression, both issues above have to be addressed to recover the
> original performance.
>
> Specifically, an md raid0 array with 256K chunk size interleaving two x 12-disk raid6
> devices (Adaptec 3154 controller, 50MB files stored contiguously on disk, four threads)
> can achieve a sequential read rate of 3800 MB/sec with the (very) old 6.5 kernel; this
> falls to 2500 MB/sec with the relatively newer kernel.
>
> This change to raid0.c:
>
> -               blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
> +              blk_queue_max_hw_sectors(mddev->queue, UINT_MAX);

I guess this is OK.

>
> improves things somewhat, the sub-chunk requests are now submitted in order but we
> still only get 2800 MB/sec because no merging takes place; the controller struggles to
> keep up with the large number of sub-chunk transfers. This additional change to
> blk-mq.c:
>
> -               if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
> +               if (request_count >= BLK_MAX_REQUEST_COUNT || (false && last &&
>                     blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
>                         blk_flush_plug_list(plug, false);

We recently did some optimization for BLK_MAX_REQUEST_COUNT ([1] and some
follow up). We can probably do something similar for BLK_PLUG_FLUSH_SIZE.

Thanks,
Song

[1] https://lore.kernel.org/all/20210907230338.227903-1-songliubraving@fb.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dm-devel] Raid0 performance regression
@ 2022-01-25  8:00         ` Song Liu
  0 siblings, 0 replies; 9+ messages in thread
From: Song Liu @ 2022-01-25  8:00 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: linux-raid, Paul Menzel, dm-devel, Lukas Straub

On Mon, Jan 24, 2022 at 8:49 AM Roger Willcocks <roger@filmlight.ltd.uk> wrote:
>
>
>
> > On 23 Jan 2022, at 21:34, Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> >
> > Dear Roger,
> >
> >
> > Am 23.01.22 um 19:00 schrieb Lukas Straub:
> >> CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.
> >> On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote:
> >
> >>> we noticed a thirty percent drop in performance on one of our raid
> >>> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like
> >
> > For those outside the CentOS universe, what Linux kernel versions are those?
> >
>
> 2.6.32 (and backported changes) and 4.18.0 (sim.)
>
> >>> striping to balance (by time) access to a pair of hardware raid-6
> >>> arrays. The underlying issue is also present in the native raid0
> >>> driver so herewith the gory details; I'd appreciate your thoughts.
> >>>
> >>> --
> >>>
> >>> blkdev_direct_IO() calls submit_bio() which calls an outermost
> >>> generic_make_request() (aka submit_bio_noacct()).
> >>>
> >>> md_make_request() calls blk_queue_split() which cuts an incoming
> >>> request into two parts with the first no larger than get_max_io_size()
> >>> bytes (which in the case of raid0, is the chunk size):
> >>>
> >>>   R -> AB
> >>>   blk_queue_split() gives the second part 'B' to generic_make_request()
> >>> to worry about later and returns the first part 'A'.
> >>>
> >>> md_make_request() then passes 'A' to a more specific request handler,
> >>> In this case raid0_make_request().
> >>>
> >>> raid0_make_request() cuts its incoming request into two parts at the
> >>> next chunk boundary:
> >>>
> >>> A -> ab
> >>>
> >>> it then fixes up the device (chooses a physical device) for 'a', and
> >>> gives both parts, separately, to generic make request()
> >>>
> >>> This is where things go awry, because 'b' is still targetted to the
> >>> original device (same as 'B'), but 'B' was queued before 'b'. So we
> >>> end up with:
> >>>
> >>>   R -> Bab
> >>>
> >>> The outermost generic_make_request() then cuts 'B' at
> >>> get_max_io_size(), and the process repeats. Ascii art follows:
> >>>
> >>>
> >>>     /---------------------------------------------------/   incoming rq
> >>>
> >>>     /--------/--------/--------/--------/--------/------/   max_io_size
> >>>       |--------|--------|--------|--------|--------|--------|--------| chunks
> >>>
> >>> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
> >>>       a    b  c     d  e     f  g     h  i     j  k     l
> >>>
> >>> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
> >>>
> >>> --
> >>>
> >>> There are several potential fixes -
> >>>
> >>> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
> >>> instead of chunk_size, so that raid0_make_request() receives the
> >>> entire transfer length and cuts it up at chunk boundaries;
> >>>
> >>> neatest is for raid0_make_request() to recognise that 'b' doesn't
> >>> cross a chunk boundary so it can be sent directly to the physical
> >>> device;
> >>>
> >>> and correct is for blk_queue_split to requeue 'A' before 'B'.
> >>>
> >>> --
> >>>
> >>> There's also a second issue - with large raid0 chunk size (256K), the
> >>> segments submitted to the physical device are at least 128K and
> >>> trigger the early unplug code in blk_mq_make_request(), so the
> >>> requests are never merged. There are legitimate reasons for a large
> >>> chunk size so this seems unhelpful.
> >>>
> >>> --
> >>>
> >>> As I said, I'd appreciate your thoughts.
> >
> > Thank you for the report and the analysis.
> >
> > Is the second issue also a regression? If not, I suggest to split it into a separate thread.
> >
>
> Yes this is also a regression, both issues above have to be addressed to recover the
> original performance.
>
> Specifically, an md raid0 array with 256K chunk size interleaving two x 12-disk raid6
> devices (Adaptec 3154 controller, 50MB files stored contiguously on disk, four threads)
> can achieve a sequential read rate of 3800 MB/sec with the (very) old 6.5 kernel; this
> falls to 2500 MB/sec with the relatively newer kernel.
>
> This change to raid0.c:
>
> -               blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
> +              blk_queue_max_hw_sectors(mddev->queue, UINT_MAX);

I guess this is OK.

>
> improves things somewhat, the sub-chunk requests are now submitted in order but we
> still only get 2800 MB/sec because no merging takes place; the controller struggles to
> keep up with the large number of sub-chunk transfers. This additional change to
> blk-mq.c:
>
> -               if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
> +               if (request_count >= BLK_MAX_REQUEST_COUNT || (false && last &&
>                     blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
>                         blk_flush_plug_list(plug, false);

We recently did some optimization for BLK_MAX_REQUEST_COUNT ([1] and some
follow up). We can probably do something similar for BLK_PLUG_FLUSH_SIZE.

Thanks,
Song

[1] https://lore.kernel.org/all/20210907230338.227903-1-songliubraving@fb.com/

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-01-26  4:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-21 16:38 [dm-devel] Raid0 performance regression Roger Willcocks
2022-01-23 18:00 ` Lukas Straub
2022-01-23 18:00   ` Lukas Straub
2022-01-23 21:34   ` Paul Menzel
2022-01-23 21:34     ` Paul Menzel
2022-01-24 16:48     ` Roger Willcocks
2022-01-24 16:48       ` Roger Willcocks
2022-01-25  8:00       ` Song Liu
2022-01-25  8:00         ` Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.