All of lore.kernel.org
 help / color / mirror / Atom feed
* raid0 vs. mkfs
@ 2016-11-27 15:24 Avi Kivity
  2016-11-27 17:09 ` Coly Li
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Avi Kivity @ 2016-11-27 15:24 UTC (permalink / raw)
  To: linux-raid

mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large 
disk that supports TRIM/DISCARD (erase whichever is inappropriate).  
That is because mkfs issues a TRIM/DISCARD (erase whichever is 
inappropriate) for the entire partition. As far as I can tell, md 
converts the large TRIM/DISCARD (erase whichever is inappropriate) into 
a large number of TRIM/DISCARD (erase whichever is inappropriate) 
requests, one per chunk-size worth of disk, and issues them to the RAID 
components individually.


It seems to me that md can convert the large TRIM/DISCARD (erase 
whichever is inappropriate) request it gets into one TRIM/DISCARD (erase 
whichever is inappropriate) per RAID component, converting an O(disk 
size / chunk size) operation into an O(number of RAID components) 
operation, which is much faster.


I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with 
the operation taking about a quarter of an hour, continuously pushing 
half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests 
to the disk. Linux 4.1.12.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-27 15:24 raid0 vs. mkfs Avi Kivity
@ 2016-11-27 17:09 ` Coly Li
  2016-11-27 17:25   ` Avi Kivity
  2016-11-28  4:11 ` Chris Murphy
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Coly Li @ 2016-11-27 17:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-raid

On 2016/11/27 下午11:24, Avi Kivity wrote:
> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
> disk that supports TRIM/DISCARD (erase whichever is inappropriate). 
> That is because mkfs issues a TRIM/DISCARD (erase whichever is
> inappropriate) for the entire partition. As far as I can tell, md
> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
> a large number of TRIM/DISCARD (erase whichever is inappropriate)
> requests, one per chunk-size worth of disk, and issues them to the RAID
> components individually.
> 
> 
> It seems to me that md can convert the large TRIM/DISCARD (erase
> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
> whichever is inappropriate) per RAID component, converting an O(disk
> size / chunk size) operation into an O(number of RAID components)
> operation, which is much faster.
> 
> 
> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
> the operation taking about a quarter of an hour, continuously pushing
> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
> to the disk. Linux 4.1.12.

It might be possible to improve a bit for DISCARD performance, by your
suggestion. The implementation might be tricky, but it is worthy to try.

Indeed, it is not only for DISCARD, for read or write, it might be
helpful for better performance as well. We can check the bio size, if,
	bio_sectors(bio)/conf->nr_strip_zones >= SOMETHRESHOLD
it means on each underlying device, we have more then SOMETHRESHOLD
continuous chunks to issue, and they can be merged into a larger bio.

IMHO it's interesting, good suggestion!

Coly


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-27 17:09 ` Coly Li
@ 2016-11-27 17:25   ` Avi Kivity
  2016-11-27 19:25     ` Doug Dumitru
  0 siblings, 1 reply; 31+ messages in thread
From: Avi Kivity @ 2016-11-27 17:25 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-raid

On 11/27/2016 07:09 PM, Coly Li wrote:
> On 2016/11/27 下午11:24, Avi Kivity wrote:
>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>> disk that supports TRIM/DISCARD (erase whichever is inappropriate).
>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>> inappropriate) for the entire partition. As far as I can tell, md
>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
>> a large number of TRIM/DISCARD (erase whichever is inappropriate)
>> requests, one per chunk-size worth of disk, and issues them to the RAID
>> components individually.
>>
>>
>> It seems to me that md can convert the large TRIM/DISCARD (erase
>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
>> whichever is inappropriate) per RAID component, converting an O(disk
>> size / chunk size) operation into an O(number of RAID components)
>> operation, which is much faster.
>>
>>
>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
>> the operation taking about a quarter of an hour, continuously pushing
>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
>> to the disk. Linux 4.1.12.
> It might be possible to improve a bit for DISCARD performance, by your
> suggestion. The implementation might be tricky, but it is worthy to try.
>
> Indeed, it is not only for DISCARD, for read or write, it might be
> helpful for better performance as well. We can check the bio size, if,
> 	bio_sectors(bio)/conf->nr_strip_zones >= SOMETHRESHOLD
> it means on each underlying device, we have more then SOMETHRESHOLD
> continuous chunks to issue, and they can be merged into a larger bio.

It's true that this does not strictly apply to TRIM/DISCARD (erase 
whichever is inappropriate), but to see any gain for READ/WRITE, you 
need a request that is larger than (chunk size) * (raid elements), which 
is unlikely for reasonable values of those parameters.  But a common 
implementation can of course work for multiple request types.

> IMHO it's interesting, good suggestion!

Looking forward to seeing an implementation!

>
> Coly
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-27 17:25   ` Avi Kivity
@ 2016-11-27 19:25     ` Doug Dumitru
  0 siblings, 0 replies; 31+ messages in thread
From: Doug Dumitru @ 2016-11-27 19:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Coly Li, linux-raid

I recently ran into this issue with a proprietary device mapper target
that supports discard.

mkfs.ext4 looks like it issues 2GB discard requests.  blkdiscard looks
like it issues 4GB-4K discard requests.  Both of these are "way
bigger" than the iolimits for transfers.

At least this is what I see at my device mapper layer.  raid0 might
get some additional filtering by the common raid code.

In the case of my mapper, I actually need to split the bio up and
re-issue the discards at iolimits sizes (this is how my device mapper
expects requests).  Fortunately, my mapper is really fast at discards
even at 1MB each (> 8GB/sec on a single thread), so the performance
issue is not that bad.

It would be an easy patch for raid0 to be "smarter" at splitting the
discard request.  It might not actually help that much.  You should
test your nVME disk to see if the performance of discards is much
different between "chunk size" requests and "big requests".  Using
blkdiscard in a script, fill a drive with real data and test discard
speed first using 256K calls to blkdiscard, and then again using 512MB
calls to blkdiscard.  Do this to a single drive.  I suspect that the
times will not be that far off.

Some drives take a real amount of time to process discards.  Even
though it seems like the operation does nothing, the FTL inside of the
SSD is still is getting hammered pretty hard.

If your drives are a "lot" faster with bigger discard requests, then
maybe it would make sense to optimize raid0.  I suspect the win is not
that big.

In terms of enlarging IO, the iolimits and buffering start to come
into play.  With a discard, the bio only has a size and does not have
any actual buffers.  If you push IO really big, then the size of the
bio starts to grow.  1MB is 256 4K biovecs.  A bio_vec is a pointer
plus two ints, so it is 16 bytes long (on x86_64).  256 of these just
happen to fit into a single page.  This is a linear array, so making
this bigger is hard.  Remember that much of the kernel lives inside of
pages and pages (usually 4K) are somewhat of a deity over the entire
kernel.

Then again, you have another option to format your array that will be
very fast and even more effective:

a) secure erase the drives
b) create your raid0 array
c) create your file system with -o nodiscard

Doug Dumitru

On Sun, Nov 27, 2016 at 9:25 AM, Avi Kivity <avi@scylladb.com> wrote:
> On 11/27/2016 07:09 PM, Coly Li wrote:
>>
>> On 2016/11/27 下午11:24, Avi Kivity wrote:
>>>
>>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>>> disk that supports TRIM/DISCARD (erase whichever is inappropriate).
>>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>>> inappropriate) for the entire partition. As far as I can tell, md
>>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
>>> a large number of TRIM/DISCARD (erase whichever is inappropriate)
>>> requests, one per chunk-size worth of disk, and issues them to the RAID
>>> components individually.
>>>
>>>
>>> It seems to me that md can convert the large TRIM/DISCARD (erase
>>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
>>> whichever is inappropriate) per RAID component, converting an O(disk
>>> size / chunk size) operation into an O(number of RAID components)
>>> operation, which is much faster.
>>>
>>>
>>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
>>> the operation taking about a quarter of an hour, continuously pushing
>>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
>>> to the disk. Linux 4.1.12.
>>
>> It might be possible to improve a bit for DISCARD performance, by your
>> suggestion. The implementation might be tricky, but it is worthy to try.
>>
>> Indeed, it is not only for DISCARD, for read or write, it might be
>> helpful for better performance as well. We can check the bio size, if,
>>         bio_sectors(bio)/conf->nr_strip_zones >= SOMETHRESHOLD
>> it means on each underlying device, we have more then SOMETHRESHOLD
>> continuous chunks to issue, and they can be merged into a larger bio.
>
>
> It's true that this does not strictly apply to TRIM/DISCARD (erase whichever
> is inappropriate), but to see any gain for READ/WRITE, you need a request
> that is larger than (chunk size) * (raid elements), which is unlikely for
> reasonable values of those parameters.  But a common implementation can of
> course work for multiple request types.
>
>> IMHO it's interesting, good suggestion!
>
>
> Looking forward to seeing an implementation!
>
>
>>
>> Coly
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-27 15:24 raid0 vs. mkfs Avi Kivity
  2016-11-27 17:09 ` Coly Li
@ 2016-11-28  4:11 ` Chris Murphy
  2016-11-28  7:28   ` Avi Kivity
  2016-11-28  5:09 ` NeilBrown
  2017-01-22 18:01 ` Avi Kivity
  3 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2016-11-28  4:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Linux-RAID

On Sun, Nov 27, 2016 at 8:24 AM, Avi Kivity <avi@scylladb.com> wrote:
> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large disk
> that supports TRIM/DISCARD (erase whichever is inappropriate)

Trim is the appropriate term. Term discard refers to a specific mount
time implementation of FITRIM ioctl, and fstrim refers to a user space
tool that does the same and can be scheduled or issued manually.



  That is
> because mkfs issues a TRIM/DISCARD (erase whichever is inappropriate) for
> the entire partition. As far as I can tell, md converts the large
> TRIM/DISCARD (erase whichever is inappropriate) into a large number of
> TRIM/DISCARD (erase whichever is inappropriate) requests, one per chunk-size
> worth of disk, and issues them to the RAID components individually.

You could strace the mkfs command. Each filesystem is doing it a
little differently the last time I compared mkfs.xfs and mkfs.btrfs;
but I can't qualify the differences relative to how the device is
going to react to those commands.

It's also possible to enable block device tracing and see the actual
SCSI or ATA commands sent to a drive.

There's a metric f tonne of bugs in this area so before anything I'd
consider researching if there's a firmware update for your hardware
and applying that and retesting. And then also after testing your
ideal deployed version, use something much close to upstream (Arch or
Fedora) and see if the problem is reproducible.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-27 15:24 raid0 vs. mkfs Avi Kivity
  2016-11-27 17:09 ` Coly Li
  2016-11-28  4:11 ` Chris Murphy
@ 2016-11-28  5:09 ` NeilBrown
  2016-11-28  6:08   ` Shaohua Li
  2016-11-28  7:38   ` Avi Kivity
  2017-01-22 18:01 ` Avi Kivity
  3 siblings, 2 replies; 31+ messages in thread
From: NeilBrown @ 2016-11-28  5:09 UTC (permalink / raw)
  To: Avi Kivity, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1357 bytes --]

On Mon, Nov 28 2016, Avi Kivity wrote:

> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large 
> disk that supports TRIM/DISCARD (erase whichever is inappropriate).  
> That is because mkfs issues a TRIM/DISCARD (erase whichever is 
> inappropriate) for the entire partition. As far as I can tell, md 
> converts the large TRIM/DISCARD (erase whichever is inappropriate) into 
> a large number of TRIM/DISCARD (erase whichever is inappropriate) 
> requests, one per chunk-size worth of disk, and issues them to the RAID 
> components individually.
>
>
> It seems to me that md can convert the large TRIM/DISCARD (erase 
> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase 
> whichever is inappropriate) per RAID component, converting an O(disk 
> size / chunk size) operation into an O(number of RAID components) 
> operation, which is much faster.
>
>
> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with 
> the operation taking about a quarter of an hour, continuously pushing 
> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests 
> to the disk. Linux 4.1.12.

Surely it is the task of the underlying driver, or the queuing
infrastructure, to merge small requests into large requests.
Why should this have anything do to with RAID0?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-28  5:09 ` NeilBrown
@ 2016-11-28  6:08   ` Shaohua Li
  2016-11-28  7:38   ` Avi Kivity
  1 sibling, 0 replies; 31+ messages in thread
From: Shaohua Li @ 2016-11-28  6:08 UTC (permalink / raw)
  To: NeilBrown; +Cc: Avi Kivity, linux-raid

On Mon, Nov 28, 2016 at 04:09:46PM +1100, Neil Brown wrote:
> On Mon, Nov 28 2016, Avi Kivity wrote:
> 
> > mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large 
> > disk that supports TRIM/DISCARD (erase whichever is inappropriate).  
> > That is because mkfs issues a TRIM/DISCARD (erase whichever is 
> > inappropriate) for the entire partition. As far as I can tell, md 
> > converts the large TRIM/DISCARD (erase whichever is inappropriate) into 
> > a large number of TRIM/DISCARD (erase whichever is inappropriate) 
> > requests, one per chunk-size worth of disk, and issues them to the RAID 
> > components individually.
> >
> >
> > It seems to me that md can convert the large TRIM/DISCARD (erase 
> > whichever is inappropriate) request it gets into one TRIM/DISCARD (erase 
> > whichever is inappropriate) per RAID component, converting an O(disk 
> > size / chunk size) operation into an O(number of RAID components) 
> > operation, which is much faster.
> >
> >
> > I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with 
> > the operation taking about a quarter of an hour, continuously pushing 
> > half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests 
> > to the disk. Linux 4.1.12.
> 
> Surely it is the task of the underlying driver, or the queuing
> infrastructure, to merge small requests into large requests.
> Why should this have anything do to with RAID0?

That's what I'm wondering too. The block layer should merge the small requests
into larget one, and it does, IIRC. There is one bug related to this (commit
9c573de3283af007e), but it only applies to 4.3+.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-28  4:11 ` Chris Murphy
@ 2016-11-28  7:28   ` Avi Kivity
  2016-11-28  7:33     ` Avi Kivity
  0 siblings, 1 reply; 31+ messages in thread
From: Avi Kivity @ 2016-11-28  7:28 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux-RAID

On 11/28/2016 06:11 AM, Chris Murphy wrote:
> On Sun, Nov 27, 2016 at 8:24 AM, Avi Kivity <avi@scylladb.com> wrote:
>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large disk
>> that supports TRIM/DISCARD (erase whichever is inappropriate)
> Trim is the appropriate term. Term discard refers to a specific mount
> time implementation of FITRIM ioctl, and fstrim refers to a user space
> tool that does the same and can be scheduled or issued manually.

That's good to know.

>
>
>    That is
>> because mkfs issues a TRIM/DISCARD (erase whichever is inappropriate) for
>> the entire partition. As far as I can tell, md converts the large
>> TRIM/DISCARD (erase whichever is inappropriate) into a large number of
>> TRIM/DISCARD (erase whichever is inappropriate) requests, one per chunk-size
>> worth of disk, and issues them to the RAID components individually.
> You could strace the mkfs command.

I did, and saw that it was running a single syscall for the entire run.  
I verified in the sources that mkfs.xfs issues a single BLKDISCARD (?!) 
ioctl spanning the entire device.

> Each filesystem is doing it a
> little differently the last time I compared mkfs.xfs and mkfs.btrfs;
> but I can't qualify the differences relative to how the device is
> going to react to those commands.
>
> It's also possible to enable block device tracing and see the actual
> SCSI or ATA commands sent to a drive.

I did, and saw a ton of half-megabyte TRIMs.  It's an NVMe device so not 
SCSI or SATA.


Here's a sample (I only blktraced one of the members):

259,1   10     1090     0.379688898  4801  Q   D 3238067200 + 1024 
[mkfs.xfs]
259,1   10     1091     0.379689222  4801  G   D 3238067200 + 1024 
[mkfs.xfs]
259,1   10     1092     0.379690304  4801  I   D 3238067200 + 1024 
[mkfs.xfs]
259,1   10     1093     0.379703110  2307  D   D 3238067200 + 1024 
[kworker/10:1H]
259,1    1      589     0.379718918     0  C   D 3231849472 + 1024 [0]
259,1   10     1094     0.379735215  4801  Q   D 3238068224 + 1024 
[mkfs.xfs]
259,1   10     1095     0.379735548  4801  G   D 3238068224 + 1024 
[mkfs.xfs]
259,1   10     1096     0.379736598  4801  I   D 3238068224 + 1024 
[mkfs.xfs]
259,1   10     1097     0.379753077  2307  D   D 3238068224 + 1024 
[kworker/10:1H]
259,1    1      590     0.379782139     0  C   D 3231850496 + 1024 [0]
259,1   10     1098     0.379785399  4801  Q   D 3238069248 + 1024 
[mkfs.xfs]
259,1   10     1099     0.379785657  4801  G   D 3238069248 + 1024 
[mkfs.xfs]
259,1   10     1100     0.379786562  4801  I   D 3238069248 + 1024 
[mkfs.xfs]
259,1   10     1101     0.379800116  2307  D   D 3238069248 + 1024 
[kworker/10:1H]
259,1   10     1102     0.379829822  4801  Q   D 3238070272 + 1024 
[mkfs.xfs]
259,1   10     1103     0.379830156  4801  G   D 3238070272 + 1024 
[mkfs.xfs]
259,1   10     1104     0.379831015  4801  I   D 3238070272 + 1024 
[mkfs.xfs]
259,1   10     1105     0.379844120  2307  D   D 3238070272 + 1024 
[kworker/10:1H]
259,1   10     1106     0.379877825  4801  Q   D 3238071296 + 1024 
[mkfs.xfs]
259,1   10     1107     0.379878173  4801  G   D 3238071296 + 1024 
[mkfs.xfs]
259,1   10     1108     0.379879028  4801  I   D 3238071296 + 1024 
[mkfs.xfs]
259,1    1      591     0.379886451     0  C   D 3231851520 + 1024 [0]
259,1   10     1109     0.379898178  2307  D   D 3238071296 + 1024 
[kworker/10:1H]
259,1   10     1110     0.379923982  4801  Q   D 3238072320 + 1024 
[mkfs.xfs]
259,1   10     1111     0.379924229  4801  G   D 3238072320 + 1024 
[mkfs.xfs]
259,1   10     1112     0.379925054  4801  I   D 3238072320 + 1024 
[mkfs.xfs]
259,1   10     1113     0.379937716  2307  D   D 3238072320 + 1024 
[kworker/10:1H]
259,1    1      592     0.379954380     0  C   D 3231852544 + 1024 [0]
259,1   10     1114     0.379970091  4801  Q   D 3238073344 + 1024 
[mkfs.xfs]
259,1   10     1115     0.379970341  4801  G   D 3238073344 + 1024 
[mkfs.xfs]
259,1   10     1116     0.379971260  4801  I   D 3238073344 + 1024 
[mkfs.xfs]
259,1   10     1117     0.379984303  2307  D   D 3238073344 + 1024 
[kworker/10:1H]
259,1   10     1118     0.380014754  4801  Q   D 3238074368 + 1024 
[mkfs.xfs]
259,1   10     1119     0.380015075  4801  G   D 3238074368 + 1024 
[mkfs.xfs]
259,1   10     1120     0.380015903  4801  I   D 3238074368 + 1024 
[mkfs.xfs]
259,1   10     1121     0.380028655  2307  D   D 3238074368 + 1024 
[kworker/10:1H]
259,1    2      170     0.380054279     0  C   D 3218706432 + 1024 [0]
259,1   10     1122     0.380060773  4801  Q   D 3238075392 + 1024 
[mkfs.xfs]
259,1   10     1123     0.380061024  4801  G   D 3238075392 + 1024 
[mkfs.xfs]
259,1   10     1124     0.380062093  4801  I   D 3238075392 + 1024 
[mkfs.xfs]
259,1   10     1125     0.380072940  2307  D   D 3238075392 + 1024 
[kworker/10:1H]
259,1   10     1126     0.380107437  4801  Q   D 3238076416 + 1024 
[mkfs.xfs]
259,1   10     1127     0.380107882  4801  G   D 3238076416 + 1024 
[mkfs.xfs]
259,1   10     1128     0.380109258  4801  I   D 3238076416 + 1024 
[mkfs.xfs]
259,1   10     1129     0.380123914  2307  D   D 3238076416 + 1024 
[kworker/10:1H]
259,1    2      171     0.380130823     0  C   D 3218707456 + 1024 [0]
259,1   10     1130     0.380156971  4801  Q   D 3238077440 + 1024 
[mkfs.xfs]
259,1   10     1131     0.380157308  4801  G   D 3238077440 + 1024 
[mkfs.xfs]
259,1   10     1132     0.380158354  4801  I   D 3238077440 + 1024 
[mkfs.xfs]
259,1   10     1133     0.380168948  2307  D   D 3238077440 + 1024 
[kworker/10:1H]
259,1    2      172     0.380186647     0  C   D 3218708480 + 1024 [0]
259,1   10     1134     0.380197495  4801  Q   D 3238078464 + 1024 
[mkfs.xfs]
259,1   10     1135     0.380197848  4801  G   D 3238078464 + 1024 
[mkfs.xfs]
259,1   10     1136     0.380198724  4801  I   D 3238078464 + 1024 
[mkfs.xfs]
259,1   10     1137     0.380202964  2307  D   D 3238078464 + 1024 
[kworker/10:1H]
259,1   10     1138     0.380237133  4801  Q   D 3238079488 + 1024 
[mkfs.xfs]
259,1   10     1139     0.380237393  4801  G   D 3238079488 + 1024 
[mkfs.xfs]
259,1   10     1140     0.380238333  4801  I   D 3238079488 + 1024 
[mkfs.xfs]
259,1   10     1141     0.380252580  2307  D   D 3238079488 + 1024 
[kworker/10:1H]
259,1    2      173     0.380260605     0  C   D 3218709504 + 1024 [0]
259,1   10     1142     0.380283800  4801  Q   D 3238080512 + 1024 
[mkfs.xfs]
259,1   10     1143     0.380284158  4801  G   D 3238080512 + 1024 
[mkfs.xfs]
259,1   10     1144     0.380285150  4801  I   D 3238080512 + 1024 
[mkfs.xfs]
259,1   10     1145     0.380297127  2307  D   D 3238080512 + 1024 
[kworker/10:1H]
259,1   10     1146     0.380324340  4801  Q   D 3238081536 + 1024 
[mkfs.xfs]
259,1   10     1147     0.380324648  4801  G   D 3238081536 + 1024 
[mkfs.xfs]
259,1   10     1148     0.380325663  4801  I   D 3238081536 + 1024 
[mkfs.xfs]
259,1    2      174     0.380328083     0  C   D 3218710528 + 1024 [0]


So we see these one-megabyte requests; moreover, they are issued 
sequentially.


> There's a metric f tonne of bugs in this area so before anything I'd
> consider researching if there's a firmware update for your hardware
> and applying that and retesting.

I don't have access to that machine any more (I could get some with a 
bit of trouble).  But I think it's clear from the traces that the 
problem is in the RAID layer?

>   And then also after testing your
> ideal deployed version, use something much close to upstream (Arch or
> Fedora) and see if the problem is reproducible.

I'm hoping the RAID maintainers can confirm at a glance whether the 
problem exists or not, it doesn't look like a minor glitch but simply 
that this code path doesn't take the issue into account.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-28  7:28   ` Avi Kivity
@ 2016-11-28  7:33     ` Avi Kivity
  0 siblings, 0 replies; 31+ messages in thread
From: Avi Kivity @ 2016-11-28  7:33 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux-RAID



On 11/28/2016 09:28 AM, Avi Kivity wrote:
> 259,1   10     1140     0.380238333  4801  I   D 3238079488 + 1024 
> [mkfs.xfs]
> 259,1   10     1141     0.380252580  2307  D   D 3238079488 + 1024 
> [kworker/10:1H]
> 259,1    2      173     0.380260605     0  C   D 3218709504 + 1024 [0]
> 259,1   10     1142     0.380283800  4801  Q   D 3238080512 + 1024 
> [mkfs.xfs]
> 259,1   10     1143     0.380284158  4801  G   D 3238080512 + 1024 
> [mkfs.xfs]
> 259,1   10     1144     0.380285150  4801  I   D 3238080512 + 1024 
> [mkfs.xfs]
> 259,1   10     1145     0.380297127  2307  D   D 3238080512 + 1024 
> [kworker/10:1H]
> 259,1   10     1146     0.380324340  4801  Q   D 3238081536 + 1024 
> [mkfs.xfs]
> 259,1   10     1147     0.380324648  4801  G   D 3238081536 + 1024 
> [mkfs.xfs]
> 259,1   10     1148     0.380325663  4801  I   D 3238081536 + 1024 
> [mkfs.xfs]
> 259,1    2      174     0.380328083     0  C   D 3218710528 + 1024 [0]
>
>
> So we see these one-megabyte requests; moreover, they are issued 
> sequentially.
>

Correction, they are issued in parallel, but not merged.  The "C" 
entries correspond to requests issued far in the past.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-28  5:09 ` NeilBrown
  2016-11-28  6:08   ` Shaohua Li
@ 2016-11-28  7:38   ` Avi Kivity
  2016-11-28  8:40     ` NeilBrown
  1 sibling, 1 reply; 31+ messages in thread
From: Avi Kivity @ 2016-11-28  7:38 UTC (permalink / raw)
  To: NeilBrown, linux-raid

On 11/28/2016 07:09 AM, NeilBrown wrote:
> On Mon, Nov 28 2016, Avi Kivity wrote:
>
>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>> disk that supports TRIM/DISCARD (erase whichever is inappropriate).
>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>> inappropriate) for the entire partition. As far as I can tell, md
>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
>> a large number of TRIM/DISCARD (erase whichever is inappropriate)
>> requests, one per chunk-size worth of disk, and issues them to the RAID
>> components individually.
>>
>>
>> It seems to me that md can convert the large TRIM/DISCARD (erase
>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
>> whichever is inappropriate) per RAID component, converting an O(disk
>> size / chunk size) operation into an O(number of RAID components)
>> operation, which is much faster.
>>
>>
>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
>> the operation taking about a quarter of an hour, continuously pushing
>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
>> to the disk. Linux 4.1.12.
> Surely it is the task of the underlying driver, or the queuing
> infrastructure, to merge small requests into large requests.

Here's a blkparse of that run.  As can be seen, there is no concurrency, 
so nobody down the stack has any chance of merging anything.

259,1   10     1090     0.379688898  4801  Q   D 3238067200 + 1024 
[mkfs.xfs]
259,1   10     1091     0.379689222  4801  G   D 3238067200 + 1024 
[mkfs.xfs]
259,1   10     1092     0.379690304  4801  I   D 3238067200 + 1024 
[mkfs.xfs]
259,1   10     1093     0.379703110  2307  D   D 3238067200 + 1024 
[kworker/10:1H]
259,1    1      589     0.379718918     0  C   D 3231849472 + 1024 [0]
259,1   10     1094     0.379735215  4801  Q   D 3238068224 + 1024 
[mkfs.xfs]
259,1   10     1095     0.379735548  4801  G   D 3238068224 + 1024 
[mkfs.xfs]
259,1   10     1096     0.379736598  4801  I   D 3238068224 + 1024 
[mkfs.xfs]
259,1   10     1097     0.379753077  2307  D   D 3238068224 + 1024 
[kworker/10:1H]
259,1    1      590     0.379782139     0  C   D 3231850496 + 1024 [0]
259,1   10     1098     0.379785399  4801  Q   D 3238069248 + 1024 
[mkfs.xfs]
259,1   10     1099     0.379785657  4801  G   D 3238069248 + 1024 
[mkfs.xfs]
259,1   10     1100     0.379786562  4801  I   D 3238069248 + 1024 
[mkfs.xfs]
259,1   10     1101     0.379800116  2307  D   D 3238069248 + 1024 
[kworker/10:1H]
259,1   10     1102     0.379829822  4801  Q   D 3238070272 + 1024 
[mkfs.xfs]
259,1   10     1103     0.379830156  4801  G   D 3238070272 + 1024 
[mkfs.xfs]
259,1   10     1104     0.379831015  4801  I   D 3238070272 + 1024 
[mkfs.xfs]
259,1   10     1105     0.379844120  2307  D   D 3238070272 + 1024 
[kworker/10:1H]
259,1   10     1106     0.379877825  4801  Q   D 3238071296 + 1024 
[mkfs.xfs]
259,1   10     1107     0.379878173  4801  G   D 3238071296 + 1024 
[mkfs.xfs]
259,1   10     1108     0.379879028  4801  I   D 3238071296 + 1024 
[mkfs.xfs]
259,1    1      591     0.379886451     0  C   D 3231851520 + 1024 [0]
259,1   10     1109     0.379898178  2307  D   D 3238071296 + 1024 
[kworker/10:1H]
259,1   10     1110     0.379923982  4801  Q   D 3238072320 + 1024 
[mkfs.xfs]
259,1   10     1111     0.379924229  4801  G   D 3238072320 + 1024 
[mkfs.xfs]
259,1   10     1112     0.379925054  4801  I   D 3238072320 + 1024 
[mkfs.xfs]
259,1   10     1113     0.379937716  2307  D   D 3238072320 + 1024 
[kworker/10:1H]
259,1    1      592     0.379954380     0  C   D 3231852544 + 1024 [0]
259,1   10     1114     0.379970091  4801  Q   D 3238073344 + 1024 
[mkfs.xfs]
259,1   10     1115     0.379970341  4801  G   D 3238073344 + 1024 
[mkfs.xfs]


No merging was happening.  This is an NVMe drive, so running with the 
noop scheduler (which should still merge).   Does the queuing layer 
merge trims?

I don't think it's the queuing layer's job, though.  At the I/O 
scheduler you can merge to clean up sloppy patterns from the upper 
layer, but each layer should try to generate the best pattern it can.  
Large merges mean increased latency for the first request in the chain, 
forcing the I/O scheduler to make a decision which can harm the 
workload.  By generating merged requests in the first place, the upper 
layer removes the need to make that tradeoff (splitting the requests 
removes information: "we are interested only in when all of the range is 
trimmed, not any particular request").


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-28  7:38   ` Avi Kivity
@ 2016-11-28  8:40     ` NeilBrown
  2016-11-28  8:58       ` Avi Kivity
  0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2016-11-28  8:40 UTC (permalink / raw)
  To: Avi Kivity, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3556 bytes --]

On Mon, Nov 28 2016, Avi Kivity wrote:

> On 11/28/2016 07:09 AM, NeilBrown wrote:
>> On Mon, Nov 28 2016, Avi Kivity wrote:
>>
>>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>>> disk that supports TRIM/DISCARD (erase whichever is inappropriate).
>>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>>> inappropriate) for the entire partition. As far as I can tell, md
>>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
>>> a large number of TRIM/DISCARD (erase whichever is inappropriate)
>>> requests, one per chunk-size worth of disk, and issues them to the RAID
>>> components individually.
>>>
>>>
>>> It seems to me that md can convert the large TRIM/DISCARD (erase
>>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
>>> whichever is inappropriate) per RAID component, converting an O(disk
>>> size / chunk size) operation into an O(number of RAID components)
>>> operation, which is much faster.
>>>
>>>
>>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
>>> the operation taking about a quarter of an hour, continuously pushing
>>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
>>> to the disk. Linux 4.1.12.
>> Surely it is the task of the underlying driver, or the queuing
>> infrastructure, to merge small requests into large requests.
>
> Here's a blkparse of that run.  As can be seen, there is no concurrency, 
> so nobody down the stack has any chance of merging anything.

That isn't a valid conclusion to draw.
raid0 effectively calls the make_request_fn function that is registered
by the underlying driver.
If that function handles DISCARD synchronously, then you won't see any
concurrency, and that is because the driver chose not to queue but to
handle directly.
I don't know if it actually does this though.  I don't know the insides
of the nmve driver .... there seems to be a lightnvm thing and a scsi
thing and a pci thing and it all confuses me.

>
>
> No merging was happening.  This is an NVMe drive, so running with the 
> noop scheduler (which should still merge).   Does the queuing layer 
> merge trims?

I wish I knew.  I once thought I understood about half of the block
queuing code, but now with multi-queue, I'll need to learn it all
again. :-(

>
> I don't think it's the queuing layer's job, though.  At the I/O 
> scheduler you can merge to clean up sloppy patterns from the upper 
> layer, but each layer should try to generate the best pattern it can.  

Why?  How does it know what is best for the layer below?

> Large merges mean increased latency for the first request in the chain, 
> forcing the I/O scheduler to make a decision which can harm the 
> workload.  By generating merged requests in the first place, the upper 
> layer removes the need to make that tradeoff (splitting the requests 
> removes information: "we are interested only in when all of the range is 
> trimmed, not any particular request").

If it is easy for the upper layer to break a very large request into a
few very large requests, then I wouldn't necessarily object.
But unless it is very hard for the lower layer to merge requests, it
should be doing that too.
When drivers/lightnvm/rrpc.c is providing rrpc_make_rq as the
make_request_fn, it performs REQ_OP_DISCARD synchronously.  I would
suggest that is a very poor design.  I don't know if that is affecting
you (though a printk would find out).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-28  8:40     ` NeilBrown
@ 2016-11-28  8:58       ` Avi Kivity
  2016-11-28  9:00         ` Christoph Hellwig
  2016-11-29 21:14         ` NeilBrown
  0 siblings, 2 replies; 31+ messages in thread
From: Avi Kivity @ 2016-11-28  8:58 UTC (permalink / raw)
  To: NeilBrown, linux-raid

On 11/28/2016 10:40 AM, NeilBrown wrote:
> On Mon, Nov 28 2016, Avi Kivity wrote:
>
>> On 11/28/2016 07:09 AM, NeilBrown wrote:
>>> On Mon, Nov 28 2016, Avi Kivity wrote:
>>>
>>>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>>>> disk that supports TRIM/DISCARD (erase whichever is inappropriate).
>>>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>>>> inappropriate) for the entire partition. As far as I can tell, md
>>>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
>>>> a large number of TRIM/DISCARD (erase whichever is inappropriate)
>>>> requests, one per chunk-size worth of disk, and issues them to the RAID
>>>> components individually.
>>>>
>>>>
>>>> It seems to me that md can convert the large TRIM/DISCARD (erase
>>>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
>>>> whichever is inappropriate) per RAID component, converting an O(disk
>>>> size / chunk size) operation into an O(number of RAID components)
>>>> operation, which is much faster.
>>>>
>>>>
>>>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
>>>> the operation taking about a quarter of an hour, continuously pushing
>>>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
>>>> to the disk. Linux 4.1.12.
>>> Surely it is the task of the underlying driver, or the queuing
>>> infrastructure, to merge small requests into large requests.
>> Here's a blkparse of that run.  As can be seen, there is no concurrency,
>> so nobody down the stack has any chance of merging anything.
> That isn't a valid conclusion to draw.

It's actually wrong, as I noted later.  The raid layer issues concurrent 
requests, but they are not merged.

> raid0 effectively calls the make_request_fn function that is registered
> by the underlying driver.
> If that function handles DISCARD synchronously, then you won't see any
> concurrency, and that is because the driver chose not to queue but to
> handle directly.
> I don't know if it actually does this though.  I don't know the insides
> of the nmve driver .... there seems to be a lightnvm thing and a scsi
> thing and a pci thing and it all confuses me.

NVMe only has queued TRIMs.  Of course, the driver could be issuing them 
synchronously, but that's doubtful, since NVMe is such a simple protocol.

What I guess is happening is that since the NVMe queue depth is so high, 
and request the driver receives is sent immediately to the disk and 
there is nothing to merge it to.  That could indicate the absence of 
plugging, or just a reluctance to merge TRIMs.

>
>>
>> No merging was happening.  This is an NVMe drive, so running with the
>> noop scheduler (which should still merge).   Does the queuing layer
>> merge trims?
> I wish I knew.  I once thought I understood about half of the block
> queuing code, but now with multi-queue, I'll need to learn it all
> again. :-(
>
>> I don't think it's the queuing layer's job, though.  At the I/O
>> scheduler you can merge to clean up sloppy patterns from the upper
>> layer, but each layer should try to generate the best pattern it can.
> Why?  How does it know what is best for the layer below?

A large request can be split at the lower layer without harming the 
upper layer (beyond the effort involved).  Merging many small requests 
at the lower layer can harm the upper layer, if it depends on the 
latency of individual requests.

In general in software engineering, when you destroy information, there 
is a potential for inefficiency.  In this case the information destroyed 
is that we are only interested when all of the range is TRIMmed, not any 
particular subrange, and the potential for inefficiency is indeed realized.

>
>> Large merges mean increased latency for the first request in the chain,
>> forcing the I/O scheduler to make a decision which can harm the
>> workload.  By generating merged requests in the first place, the upper
>> layer removes the need to make that tradeoff (splitting the requests
>> removes information: "we are interested only in when all of the range is
>> trimmed, not any particular request").
> If it is easy for the upper layer to break a very large request into a
> few very large requests, then I wouldn't necessarily object.

I can't see why it would be hard.  It's simple arithmetic.

> But unless it is very hard for the lower layer to merge requests, it
> should be doing that too.

Merging has tradeoffs.  When you merge requests R1, R2, ... Rn you make 
the latency request R1 sum of the latencies of R1..Rn.  You may gain 
some efficiency in the process, but that's not going to make up for a 
factor of n.  The queuing layer has no way to tell whether the caller is 
interested in the latency of individual requests.  By sending large 
requests, the caller indicates it's not interested in the latency of 
individual subranges.  The queuing layer is still free to internally 
split the request to smaller ranges, to satisfy hardware constraints, or 
to reduce worst-case latencies for competing request streams.

So I disagree that all the work should be pushed to the merging layer.  
It has less information to work with, so the fewer decisions it has to 
make, the better.

> When drivers/lightnvm/rrpc.c is providing rrpc_make_rq as the

This does not seem to be an NVMe driver.

> make_request_fn, it performs REQ_OP_DISCARD synchronously.  I would
> suggest that is a very poor design.  I don't know if that is affecting
> you (though a printk would find out).
>
> NeilBrown


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-28  8:58       ` Avi Kivity
@ 2016-11-28  9:00         ` Christoph Hellwig
  2016-11-28  9:11           ` Avi Kivity
  2016-11-29 21:14         ` NeilBrown
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2016-11-28  9:00 UTC (permalink / raw)
  To: Avi Kivity; +Cc: NeilBrown, linux-raid

On Mon, Nov 28, 2016 at 10:58:24AM +0200, Avi Kivity wrote:
> What I guess is happening is that since the NVMe queue depth is so high, and
> request the driver receives is sent immediately to the disk and there is
> nothing to merge it to.  That could indicate the absence of plugging, or
> just a reluctance to merge TRIMs.

That is exactly the case, and it's also an issue with online trim.
I have work in progress block layer trim patches that always plug trims
and have various TRIM merge improvements including support for ranged
TRIMs.  It needs a bit more work, but I hope I can post it later
this week.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-28  9:00         ` Christoph Hellwig
@ 2016-11-28  9:11           ` Avi Kivity
  2016-11-28  9:15             ` Coly Li
  2016-11-28 17:47             ` Shaohua Li
  0 siblings, 2 replies; 31+ messages in thread
From: Avi Kivity @ 2016-11-28  9:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: NeilBrown, linux-raid

On 11/28/2016 11:00 AM, Christoph Hellwig wrote:
> On Mon, Nov 28, 2016 at 10:58:24AM +0200, Avi Kivity wrote:
>> What I guess is happening is that since the NVMe queue depth is so high, and
>> request the driver receives is sent immediately to the disk and there is
>> nothing to merge it to.  That could indicate the absence of plugging, or
>> just a reluctance to merge TRIMs.
> That is exactly the case, and it's also an issue with online trim.
> I have work in progress block layer trim patches that always plug trims
> and have various TRIM merge improvements including support for ranged
> TRIMs.  It needs a bit more work, but I hope I can post it later
> this week.

Great, good to know.

I still think it should also be fixed in the RAID layer.  There's no 
reason to break a single request in millions of smaller ones, then try 
to merge them into one request back again.  The queuing layer can merge 
when it's given bad patterns from uncontrolled sources, not as an excuse 
to generate bad patterns from within the kernel.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-28  9:11           ` Avi Kivity
@ 2016-11-28  9:15             ` Coly Li
  2016-11-28 17:47             ` Shaohua Li
  1 sibling, 0 replies; 31+ messages in thread
From: Coly Li @ 2016-11-28  9:15 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Christoph Hellwig, NeilBrown, linux-raid

On 2016/11/28 下午5:11, Avi Kivity wrote:
> On 11/28/2016 11:00 AM, Christoph Hellwig wrote:
>> On Mon, Nov 28, 2016 at 10:58:24AM +0200, Avi Kivity wrote:
>>> What I guess is happening is that since the NVMe queue depth is so
>>> high, and
>>> request the driver receives is sent immediately to the disk and there is
>>> nothing to merge it to.  That could indicate the absence of plugging, or
>>> just a reluctance to merge TRIMs.
>> That is exactly the case, and it's also an issue with online trim.
>> I have work in progress block layer trim patches that always plug trims
>> and have various TRIM merge improvements including support for ranged
>> TRIMs.  It needs a bit more work, but I hope I can post it later
>> this week.
> 
> Great, good to know.
> 
> I still think it should also be fixed in the RAID layer.  There's no
> reason to break a single request in millions of smaller ones, then try
> to merge them into one request back again.  The queuing layer can merge
> when it's given bad patterns from uncontrolled sources, not as an excuse
> to generate bad patterns from within the kernel.

Personally I agree with your opinion. We can improve both in block layer
and md code.

Coly


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-28  9:11           ` Avi Kivity
  2016-11-28  9:15             ` Coly Li
@ 2016-11-28 17:47             ` Shaohua Li
  1 sibling, 0 replies; 31+ messages in thread
From: Shaohua Li @ 2016-11-28 17:47 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Christoph Hellwig, NeilBrown, linux-raid

On Mon, Nov 28, 2016 at 11:11:18AM +0200, Avi Kivity wrote:
> On 11/28/2016 11:00 AM, Christoph Hellwig wrote:
> > On Mon, Nov 28, 2016 at 10:58:24AM +0200, Avi Kivity wrote:
> > > What I guess is happening is that since the NVMe queue depth is so high, and
> > > request the driver receives is sent immediately to the disk and there is
> > > nothing to merge it to.  That could indicate the absence of plugging, or
> > > just a reluctance to merge TRIMs.
> > That is exactly the case, and it's also an issue with online trim.
> > I have work in progress block layer trim patches that always plug trims
> > and have various TRIM merge improvements including support for ranged
> > TRIMs.  It needs a bit more work, but I hope I can post it later
> > this week.
> 
> Great, good to know.
> 
> I still think it should also be fixed in the RAID layer.  There's no reason
> to break a single request in millions of smaller ones, then try to merge
> them into one request back again.  The queuing layer can merge when it's
> given bad patterns from uncontrolled sources, not as an excuse to generate
> bad patterns from within the kernel.

I think the reason is you are using 4.1 kernel. upstream kernel does support
request merge even for blk-mq. It didn't at some versions.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-28  8:58       ` Avi Kivity
  2016-11-28  9:00         ` Christoph Hellwig
@ 2016-11-29 21:14         ` NeilBrown
  2016-11-29 22:45           ` Avi Kivity
  1 sibling, 1 reply; 31+ messages in thread
From: NeilBrown @ 2016-11-29 21:14 UTC (permalink / raw)
  To: Avi Kivity, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1822 bytes --]

On Mon, Nov 28 2016, Avi Kivity wrote:
>> If it is easy for the upper layer to break a very large request into a
>> few very large requests, then I wouldn't necessarily object.
>
> I can't see why it would be hard.  It's simple arithmetic.

That is easy to say before writing the code :-)
It probably is easy for RAID0.  Less so for RAID10.  Even less for
RAID6.


>
>> But unless it is very hard for the lower layer to merge requests, it
>> should be doing that too.
>
> Merging has tradeoffs.  When you merge requests R1, R2, ... Rn you make 
> the latency request R1 sum of the latencies of R1..Rn.  You may gain 
> some efficiency in the process, but that's not going to make up for a 
> factor of n.  The queuing layer has no way to tell whether the caller is 
> interested in the latency of individual requests.  By sending large 
> requests, the caller indicates it's not interested in the latency of 
> individual subranges.  The queuing layer is still free to internally 
> split the request to smaller ranges, to satisfy hardware constraints, or 
> to reduce worst-case latencies for competing request streams.

I would have thought that using plug/unplug to group requests is a
fairly strong statement that they can be handled as a unit if that is
convenient.


>
> So I disagree that all the work should be pushed to the merging layer.  
> It has less information to work with, so the fewer decisions it has to 
> make, the better.

I think that the merging layer should be as efficient as it reasonably
can be, and particularly should take into account plugging.  This
benefits all callers.

If it can be demonstrated that changes to some of the upper layers bring
further improvements with acceptable costs, then certainly it is good to
have those too.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-29 21:14         ` NeilBrown
@ 2016-11-29 22:45           ` Avi Kivity
  2016-12-07  5:08             ` Mike Snitzer
  2016-12-07 11:50             ` Coly Li
  0 siblings, 2 replies; 31+ messages in thread
From: Avi Kivity @ 2016-11-29 22:45 UTC (permalink / raw)
  To: NeilBrown, linux-raid

On 11/29/2016 11:14 PM, NeilBrown wrote:
> On Mon, Nov 28 2016, Avi Kivity wrote:
>>> If it is easy for the upper layer to break a very large request into a
>>> few very large requests, then I wouldn't necessarily object.
>> I can't see why it would be hard.  It's simple arithmetic.
> That is easy to say before writing the code :-)
> It probably is easy for RAID0.  Less so for RAID10.  Even less for
> RAID6.
>

pick the largest subrange wihin the inpu range whose bounds are 0 (mod 
stripe-size); TRIM it (for all members); apply the regular algorithm to 
the head and tail subranges.  Works for all RAID types.  If the RAID is 
undergoing reshaping, exclude the range undergoing reshaping, and treat 
the two halves separately.

>>> But unless it is very hard for the lower layer to merge requests, it
>>> should be doing that too.
>> Merging has tradeoffs.  When you merge requests R1, R2, ... Rn you make
>> the latency request R1 sum of the latencies of R1..Rn.  You may gain
>> some efficiency in the process, but that's not going to make up for a
>> factor of n.  The queuing layer has no way to tell whether the caller is
>> interested in the latency of individual requests.  By sending large
>> requests, the caller indicates it's not interested in the latency of
>> individual subranges.  The queuing layer is still free to internally
>> split the request to smaller ranges, to satisfy hardware constraints, or
>> to reduce worst-case latencies for competing request streams.
> I would have thought that using plug/unplug to group requests is a
> fairly strong statement that they can be handled as a unit if that is
> convenient.

It is not.  As an example, consider a read and a few associated 
read-ahead requests submitted in a batch.  The last thing you want is 
for them to be treated as a unit.

Plug/unplug means: I have a bunch of requests here.  Whether they should 
be merged or reordered is orthogonal to whether they are members of a 
batch or not.

>
>
>> So I disagree that all the work should be pushed to the merging layer.
>> It has less information to work with, so the fewer decisions it has to
>> make, the better.
> I think that the merging layer should be as efficient as it reasonably
> can be, and particularly should take into account plugging.  This
> benefits all callers.

Yes, but plugging does not mean "please merge anything you can until the 
unplug".

> If it can be demonstrated that changes to some of the upper layers bring
> further improvements with acceptable costs, then certainly it is good to
> have those too.

Generating millions of requests only to merge them again is 
inefficient.  It happens in an edge case (TRIM of the entirety of a very 
large RAID), but it already caused on user to believe the system 
failed.  I think the system should be more robust than that.

> NeilBrown



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-29 22:45           ` Avi Kivity
@ 2016-12-07  5:08             ` Mike Snitzer
  2016-12-07 11:50             ` Coly Li
  1 sibling, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2016-12-07  5:08 UTC (permalink / raw)
  To: Avi Kivity; +Cc: NeilBrown, linux-raid

On Tue, Nov 29, 2016 at 5:45 PM, Avi Kivity <avi@scylladb.com> wrote:
> On 11/29/2016 11:14 PM, NeilBrown wrote:
>>
>> On Mon, Nov 28 2016, Avi Kivity wrote:
>>>>
>>>> If it is easy for the upper layer to break a very large request into a
>>>> few very large requests, then I wouldn't necessarily object.
>>>
>>> I can't see why it would be hard.  It's simple arithmetic.
>>
>> That is easy to say before writing the code :-)
>> It probably is easy for RAID0.  Less so for RAID10.  Even less for
>> RAID6.
>>
>
> pick the largest subrange wihin the inpu range whose bounds are 0 (mod
> stripe-size); TRIM it (for all members); apply the regular algorithm to the
> head and tail subranges.  Works for all RAID types.  If the RAID is
> undergoing reshaping, exclude the range undergoing reshaping, and treat the
> two halves separately.
>
>>>> But unless it is very hard for the lower layer to merge requests, it
>>>> should be doing that too.
>>>
>>> Merging has tradeoffs.  When you merge requests R1, R2, ... Rn you make
>>> the latency request R1 sum of the latencies of R1..Rn.  You may gain
>>> some efficiency in the process, but that's not going to make up for a
>>> factor of n.  The queuing layer has no way to tell whether the caller is
>>> interested in the latency of individual requests.  By sending large
>>> requests, the caller indicates it's not interested in the latency of
>>> individual subranges.  The queuing layer is still free to internally
>>> split the request to smaller ranges, to satisfy hardware constraints, or
>>> to reduce worst-case latencies for competing request streams.
>>
>> I would have thought that using plug/unplug to group requests is a
>> fairly strong statement that they can be handled as a unit if that is
>> convenient.
>
>
> It is not.  As an example, consider a read and a few associated read-ahead
> requests submitted in a batch.  The last thing you want is for them to be
> treated as a unit.
>
> Plug/unplug means: I have a bunch of requests here.  Whether they should be
> merged or reordered is orthogonal to whether they are members of a batch or
> not.
>
>>
>>
>>> So I disagree that all the work should be pushed to the merging layer.
>>> It has less information to work with, so the fewer decisions it has to
>>> make, the better.
>>
>> I think that the merging layer should be as efficient as it reasonably
>> can be, and particularly should take into account plugging.  This
>> benefits all callers.
>
>
> Yes, but plugging does not mean "please merge anything you can until the
> unplug".
>
>> If it can be demonstrated that changes to some of the upper layers bring
>> further improvements with acceptable costs, then certainly it is good to
>> have those too.
>
>
> Generating millions of requests only to merge them again is inefficient.  It
> happens in an edge case (TRIM of the entirety of a very large RAID), but it
> already caused on user to believe the system failed.  I think the system
> should be more robust than that.

DM's stripe target (raid0) has had Avi's proposed implementation
(breaks large discard into N discards, one per component device) for a
long while.
See commit 7b76ec11fec4 ("dm stripe: support discards")

This improvement applies to other raid levels too.  And I completely
agree that it would be a mistake to to continue to have the block
layer issue millions of chunk-sized bios only to _maybe_ have them be
merged by the IO scheduler and/or plugging.  Needless make-work that
just eats cpu and memory.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-29 22:45           ` Avi Kivity
  2016-12-07  5:08             ` Mike Snitzer
@ 2016-12-07 11:50             ` Coly Li
  2016-12-07 12:03               ` Coly Li
  2016-12-07 16:59               ` Shaohua Li
  1 sibling, 2 replies; 31+ messages in thread
From: Coly Li @ 2016-12-07 11:50 UTC (permalink / raw)
  To: Avi Kivity, NeilBrown; +Cc: linux-raid, linux-block

[-- Attachment #1: Type: text/plain, Size: 3401 bytes --]

On 2016/11/30 上午6:45, Avi Kivity wrote:
> On 11/29/2016 11:14 PM, NeilBrown wrote:
[snip]

>>> So I disagree that all the work should be pushed to the merging layer.
>>> It has less information to work with, so the fewer decisions it has to
>>> make, the better.
>> I think that the merging layer should be as efficient as it reasonably
>> can be, and particularly should take into account plugging.  This
>> benefits all callers.
> 
> Yes, but plugging does not mean "please merge anything you can until the
> unplug".
> 
>> If it can be demonstrated that changes to some of the upper layers bring
>> further improvements with acceptable costs, then certainly it is good to
>> have those too.
> 
> Generating millions of requests only to merge them again is
> inefficient.  It happens in an edge case (TRIM of the entirety of a very
> large RAID), but it already caused on user to believe the system
> failed.  I think the system should be more robust than that.

Neil,

As my understand, if a large discard bio received by
raid0_make_request(), for example it requests to discard chunk 1 to 24
on a raid0 device built by 4 SSDs. This large discard bio will be split
and written to each SSD as the following layout,

SSD1: C1,C5,C9,C13,C17,C21
SSD2: C2,C6,C10,C14,C18,C22
SSD3: C3,C7,C11,C15,C19,C23
SSD4: C4,C8,C12,C16,C20,C24

Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of each
split bio, so we can combine all the bios into four per-SSD large bio,
like this,

bio1 (on SSD1): C{1,5,9,13,17,21}
bio2 (on SSD2): C{2,6,10,14,18,22}
bio3 (on SSD3): C{3,7,11,15,19,23}
bio4 (on SSD4): C{4,8,12,16,20,24}

Now we only need to call generic_make_request() for 4 times. Rebuild the
per-device discard bios is more efficient in raid0 code then in block
layer. There are some reasons that I know,
- there are splice timeout, block layer cannot merge all split bio into
one large bio before time out.
- rebuilt per-device bio in raid0 is just by a few calculation, block
layer does merge on queue with list operations, it is slower.
- raid0 code knows its on disk layout, so rebuild per-device bio is
possible here. block layer has no idea on raid0 layout, it can only do
request merge.

Avi,

I compose a prototype patch, the code is not simple, indeed it is quite
complicated IMHO.

I do a little research, some NVMe SSDs support whole device size
DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
bio to block layer. But raid0_make_request() only receives 512KB size
DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
original large bio into 512KB small bios, the limitation is from
q->limits.discard_granularity.

At this moment, I don't know why a q->limits.discard_granularity is
512KB even the underlying SSD supports whole device size discard. We
also need to fix q->limits.discard_granularity, otherwise
block/blk-lib.c:__blkdev_issue_discard() still does an inefficient loop
to split the original large discard bio into smaller ones and sends them
to raid0 code by next_bio().

I also CC this email to linux-block@vger.kernel.org to ask for help. My
question is, if a NVMe SSD supports whole-device-size DISCARD, is
q->limits.discard_granularity still necessary ?

Here I also attach my prototype patch as a proof of concept, it is
runnable with Linux 4.9-rc7.

Coly



[-- Attachment #2: raid0_handle_large_discard_bio.patch --]
[-- Type: text/plain, Size: 9329 bytes --]

Subject: optimization for large size DISCARD bio by per-device bios 

This is a very early prototype, still needs more block layer code
modification to make it work.

Current upstream raid0_make_request() only handles TRIM/DISCARD bio by
chunk size, it meams for large raid0 device built by SSDs will call
million times generic_make_request() for the split bio. This patch
tries to combine small bios into large one if they are on same real
device and continuous on this real device, then send the combined large
bio to underlying device by single call to generic_make_request().

For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB
NVMeSSD, current upstream raid0_make_request() will call
generic_make_request() 5.7 million times, with this patch only 4 calls
to generic_make_request() is required.

This patch won't work in real world, because in block/blk-lib.c:
__blkdev_issue_discard() the original large bio will be split into
smaller ones by restriction of discard_granularity.

If some day SSD supports whole device sized discard_granularity, it
will be very interesting then...

The basic idea is, if a large discard bio received
by raid0_make_request(), for example it requests to discard chunk 1
to 24 on a raid0 device built by 4 SSDs. This large discard bio will
be split and written to each SSD as the following layout,
	SSD1: C1,C5,C9,C13,C17,C21
	SSD2: C2,C6,C10,C14,C18,C22
	SSD3: C3,C7,C11,C15,C19,C23
	SSD4: C4,C8,C12,C16,C20,C24
Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of
each split bio, so we can combine all the bios into four per-SSD large
bio, like this,
	bio1 (on SSD1): C{1,5,9,13,17,21}
	bio2 (on SSD2): C{2,6,10,14,18,22}
	bio3 (on SSD3): C{3,7,11,15,19,23}
	bio4 (on SSD4): C{4,8,12,16,20,24}
Now we only need to call generic_make_request() for 4 times.

The code is not simple, I need more time to write text to complain how
it works. Currently you can treat it as a proof of concept.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/raid0.c | 224 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 224 insertions(+)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 258986a..73c5fac 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -452,6 +452,225 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev,
 	}
 }
 
+
+struct bio_record {
+	sector_t	bi_sector;
+	unsigned long	sectors;
+	struct md_rdev	*rdev;
+};
+
+static void handle_discard_request(struct mddev *mddev, struct bio *bio)
+{
+	struct bio_record *recs = NULL;
+	struct bio *split;
+	struct r0conf *conf = mddev->private;
+	sector_t sectors, sector;
+	struct strip_zone *first_zone;
+	int zone_idx;
+	sector_t zone_start, zone_end;
+	int nr_strip_zones = conf->nr_strip_zones;
+	int disks;
+	int first_rdev_idx = -1, rdev_idx;
+	struct md_rdev *first_rdev;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+
+	sector = bio->bi_iter.bi_sector;
+	first_zone = find_zone(conf, &sector);
+	first_rdev = map_sector(mddev, first_zone, sector, &sector);
+
+	sectors = chunk_sects -
+		(likely(is_power_of_2(chunk_sects))
+		 ? (sector & (chunk_sects - 1))
+		 : sector_div(sector, chunk_sects));
+
+	/* if bio size is not exceed a chunk boundary,
+	 * simply handle it here.
+	 */
+	if (sectors >= bio_sectors(bio)) {
+		bio->bi_bdev = first_rdev->bdev;
+		bio->bi_iter.bi_sector = sector + first_zone->dev_start +
+					 first_rdev->data_offset;
+		if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+			/* Just ignore it */
+			bio_endio(bio);
+		else
+			generic_make_request(bio);
+		return;
+	}
+
+	/* bio is large enough to be split, allocate recs firstly */
+	disks = mddev->raid_disks;
+	recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO);
+	if (recs == NULL) {
+		printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \
+				"for bio_record", mdname(mddev));
+		bio->bi_error = -ENOMEM;
+		bio_endio(bio);
+		return;
+	}
+
+	zone_idx = first_zone - conf->strip_zone;
+	for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) {
+		struct md_rdev *rdev;
+
+		rdev = conf->devlist[zone_idx * disks + rdev_idx];
+		recs[rdev_idx].rdev = rdev;
+		if (rdev == first_rdev)
+			first_rdev_idx = rdev_idx;
+	}
+
+	/* Restore due to sector_div */
+	sector = bio->bi_iter.bi_sector;
+	recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start;
+	recs[first_rdev_idx].sectors = sectors;
+	BUG_ON(recs[first_rdev_idx].rdev != first_rdev);
+
+	/* recs[first_rdev_idx] is initialized with 'sectors', we need to
+	 * handle the rested sectors, which is sotred in 'sectors' too.
+	 */
+	sectors = bio_sectors(bio) - sectors;
+
+	/* bio may not be chunk size aligned, the split bio on first rdev
+	 * may not be chunk size aligned too. But the rested split bios
+	 * on rested rdevs must be chunk size aligned, and aligned to
+	 * round down chunk number.
+	 */
+	zone_end = first_zone->zone_end;
+	rdev_idx = first_rdev_idx + 1;
+	sector = likely(is_power_of_2(chunk_sects))
+		 ? sector & (~(chunk_sects - 1))
+		 : chunk_sects * (sector/chunk_sects);
+
+	while (rdev_idx < first_zone->nb_dev) {
+		recs[rdev_idx].bi_sector = sector + first_zone->dev_start;
+		if (sectors <= chunk_sects) {
+			recs[rdev_idx].sectors = sectors;
+			goto issue;
+		}
+		recs[rdev_idx].sectors = chunk_sects;
+		sectors -= chunk_sects;
+		rdev_idx++;
+	}
+
+	sector += chunk_sects;
+	zone_start = sector + first_zone->dev_start;
+	if (zone_start == zone_end) {
+		zone_idx++;
+		zone_start = conf->strip_zone[zone_idx].dev_start;
+	}
+
+	while (zone_idx < nr_strip_zones) {
+		int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev;
+		int chunks_per_rdev, rested_chunks, rested_sectors;
+		sector_t zone_sectors, grow_sectors;
+		int add_rested_sectors = 0;
+
+		zone_end = conf->strip_zone[zone_idx].zone_end;
+		zone_sectors = zone_end - zone_start;
+		chunks_per_rdev = sectors;
+		rested_sectors =
+			sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone);
+		rested_chunks = rested_sectors;
+		rested_sectors = sector_div(rested_chunks, chunk_sects);
+
+		if ((chunks_per_rdev * chunk_sects) > zone_sectors)
+			chunks_per_rdev = zone_sectors/chunk_sects;
+
+		/* rested_chunks and rested_sectors go into next zone, we won't
+		 * handle them in this zone. Set them to 0.
+		 */
+		if ((chunks_per_rdev * chunk_sects) == zone_sectors &&
+		    (rested_chunks != 0 || rested_sectors != 0)) {
+			if (rested_chunks != 0)
+				rested_chunks = 0;
+			if (rested_sectors != 0)
+				rested_sectors = 0;
+		}
+
+		if (rested_chunks == 0 && rested_sectors != 0)
+			add_rested_sectors = 1;
+
+		for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) {
+			/* if .sectors is not initailized (== 0), it indicates
+			 * .bi_sector is not initialized neither. We initiate
+			 * .bi_sector firstly, then set .sectors by
+			 * grow_sectors.
+			 */
+			if (recs[rdev_idx].sectors == 0)
+				recs[rdev_idx].bi_sector = zone_start;
+			grow_sectors = chunks_per_rdev * chunk_sects;
+			if (rested_chunks) {
+				grow_sectors += chunk_sects;
+				rested_chunks--;
+				if (rested_chunks == 0 &&
+				    rested_sectors != 0) {
+					recs[rdev_idx].sectors += grow_sectors;
+					sectors -= grow_sectors;
+					add_rested_sectors = 1;
+					continue;
+				}
+			}
+
+			/* if add_rested_sectors != 0, it indicates
+			 * rested_sectors != 0
+			 */
+			if (add_rested_sectors)
+				grow_sectors += rested_sectors;
+			recs[rdev_idx].sectors += grow_sectors;
+			sectors -= grow_sectors;
+			if (add_rested_sectors)
+				break;
+		}
+
+		if (sectors == 0)
+			break;
+		zone_start = zone_end;
+		zone_idx++;
+		BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start);
+	}
+
+
+issue:
+	/* recs contains the re-ordered requests layout, now we can
+	 * chain split bios from recs
+	 */
+	for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) {
+		if (rdev_idx == first_rdev_idx ||
+		    recs[rdev_idx].sectors == 0)
+			continue;
+		split = bio_split(bio,
+				  recs[rdev_idx].sectors,
+				  GFP_NOIO,
+				  fs_bio_set);
+		if (split == NULL)
+			break;
+		bio_chain(split, bio);
+		BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9);
+		split->bi_bdev = recs[rdev_idx].rdev->bdev;
+		split->bi_iter.bi_sector = recs[rdev_idx].bi_sector +
+				recs[rdev_idx].rdev->data_offset;
+
+		if (unlikely(!blk_queue_discard(
+				bdev_get_queue(split->bi_bdev))))
+			/* Just ignore it */
+			bio_endio(split);
+		else
+			generic_make_request(split);
+	}
+	BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9);
+	bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector +
+					recs[first_rdev_idx].rdev->data_offset;
+	bio->bi_bdev = recs[first_rdev_idx].rdev->bdev;
+
+	if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		/* Just ignore it */
+		bio_endio(bio);
+	else
+		generic_make_request(bio);
+
+	kfree(recs);
+}
+
 static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct strip_zone *zone;
@@ -463,6 +682,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 		return;
 	}
 
+	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+		handle_discard_request(mddev, bio);
+		return;
+	}
+
 	do {
 		sector_t sector = bio->bi_iter.bi_sector;
 		unsigned chunk_sects = mddev->chunk_sectors;

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-12-07 11:50             ` Coly Li
@ 2016-12-07 12:03               ` Coly Li
  2016-12-07 16:59               ` Shaohua Li
  1 sibling, 0 replies; 31+ messages in thread
From: Coly Li @ 2016-12-07 12:03 UTC (permalink / raw)
  To: Avi Kivity, NeilBrown; +Cc: linux-raid, linux-block

On 2016/12/7 下午7:50, Coly Li wrote:
> On 2016/11/30 上午6:45, Avi Kivity wrote:
>> On 11/29/2016 11:14 PM, NeilBrown wrote:
> [snip]
> 
>>>> So I disagree that all the work should be pushed to the merging layer.
>>>> It has less information to work with, so the fewer decisions it has to
>>>> make, the better.
>>> I think that the merging layer should be as efficient as it reasonably
>>> can be, and particularly should take into account plugging.  This
>>> benefits all callers.
>>
>> Yes, but plugging does not mean "please merge anything you can until the
>> unplug".
>>
>>> If it can be demonstrated that changes to some of the upper layers bring
>>> further improvements with acceptable costs, then certainly it is good to
>>> have those too.
>>
>> Generating millions of requests only to merge them again is
>> inefficient.  It happens in an edge case (TRIM of the entirety of a very
>> large RAID), but it already caused on user to believe the system
>> failed.  I think the system should be more robust than that.
> 
> Neil,
> 
> As my understand, if a large discard bio received by
> raid0_make_request(), for example it requests to discard chunk 1 to 24
> on a raid0 device built by 4 SSDs. This large discard bio will be split
> and written to each SSD as the following layout,
> 
> SSD1: C1,C5,C9,C13,C17,C21
> SSD2: C2,C6,C10,C14,C18,C22
> SSD3: C3,C7,C11,C15,C19,C23
> SSD4: C4,C8,C12,C16,C20,C24
> 
> Current raid0 code will call generic_make_request() for 24 times for
> each split bio. But it is possible to calculate the final layout of each
> split bio, so we can combine all the bios into four per-SSD large bio,
> like this,
> 
> bio1 (on SSD1): C{1,5,9,13,17,21}
> bio2 (on SSD2): C{2,6,10,14,18,22}
> bio3 (on SSD3): C{3,7,11,15,19,23}
> bio4 (on SSD4): C{4,8,12,16,20,24}
> 
> Now we only need to call generic_make_request() for 4 times. Rebuild the
> per-device discard bios is more efficient in raid0 code then in block
> layer. There are some reasons that I know,
> - there are splice timeout, block layer cannot merge all split bio into
> one large bio before time out.
> - rebuilt per-device bio in raid0 is just by a few calculation, block
> layer does merge on queue with list operations, it is slower.
> - raid0 code knows its on disk layout, so rebuild per-device bio is
> possible here. block layer has no idea on raid0 layout, it can only do
> request merge.
> 
> Avi,
> 
> I compose a prototype patch, the code is not simple, indeed it is quite
> complicated IMHO.
> 
> I do a little research, some NVMe SSDs support whole device size
> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
> bio to block layer. But raid0_make_request() only receives 512KB size
> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
> original large bio into 512KB small bios, the limitation is from
> q->limits.discard_granularity.
> 
> At this moment, I don't know why a q->limits.discard_granularity is
> 512KB even the underlying SSD supports whole device size discard. We
> also need to fix q->limits.discard_granularity, otherwise
> block/blk-lib.c:__blkdev_issue_discard() still does an inefficient loop
> to split the original large discard bio into smaller ones and sends them
> to raid0 code by next_bio().
> 
> I also CC this email to linux-block@vger.kernel.org to ask for help. My
> question is, if a NVMe SSD supports whole-device-size DISCARD, is
> q->limits.discard_granularity still necessary ?
> 
> Here I also attach my prototype patch as a proof of concept, it is
> runnable with Linux 4.9-rc7.

Aha! It is not limits.discard_granularity, it is
limits.max_discard_sectors. Which is set in raid0.c:raid0_run() by
blk_queue_max_discard_sectors(). Here limits.maxZ_discard_sectors is set
to raid0 chunk size. Interesting ....

And, linux-block@vger.kernel.org, please ignore my noise.

Coly

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-12-07 11:50             ` Coly Li
  2016-12-07 12:03               ` Coly Li
@ 2016-12-07 16:59               ` Shaohua Li
  2016-12-08 16:44                 ` Coly Li
  1 sibling, 1 reply; 31+ messages in thread
From: Shaohua Li @ 2016-12-07 16:59 UTC (permalink / raw)
  To: Coly Li; +Cc: Avi Kivity, NeilBrown, linux-raid, linux-block

On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
> On 2016/11/30 上午6:45, Avi Kivity wrote:
> > On 11/29/2016 11:14 PM, NeilBrown wrote:
> [snip]
> 
> >>> So I disagree that all the work should be pushed to the merging layer.
> >>> It has less information to work with, so the fewer decisions it has to
> >>> make, the better.
> >> I think that the merging layer should be as efficient as it reasonably
> >> can be, and particularly should take into account plugging.  This
> >> benefits all callers.
> > 
> > Yes, but plugging does not mean "please merge anything you can until the
> > unplug".
> > 
> >> If it can be demonstrated that changes to some of the upper layers bring
> >> further improvements with acceptable costs, then certainly it is good to
> >> have those too.
> > 
> > Generating millions of requests only to merge them again is
> > inefficient.  It happens in an edge case (TRIM of the entirety of a very
> > large RAID), but it already caused on user to believe the system
> > failed.  I think the system should be more robust than that.
> 
> Neil,
> 
> As my understand, if a large discard bio received by
> raid0_make_request(), for example it requests to discard chunk 1 to 24
> on a raid0 device built by 4 SSDs. This large discard bio will be split
> and written to each SSD as the following layout,
> 
> SSD1: C1,C5,C9,C13,C17,C21
> SSD2: C2,C6,C10,C14,C18,C22
> SSD3: C3,C7,C11,C15,C19,C23
> SSD4: C4,C8,C12,C16,C20,C24
> 
> Current raid0 code will call generic_make_request() for 24 times for
> each split bio. But it is possible to calculate the final layout of each
> split bio, so we can combine all the bios into four per-SSD large bio,
> like this,
> 
> bio1 (on SSD1): C{1,5,9,13,17,21}
> bio2 (on SSD2): C{2,6,10,14,18,22}
> bio3 (on SSD3): C{3,7,11,15,19,23}
> bio4 (on SSD4): C{4,8,12,16,20,24}
> 
> Now we only need to call generic_make_request() for 4 times. Rebuild the
> per-device discard bios is more efficient in raid0 code then in block
> layer. There are some reasons that I know,
> - there are splice timeout, block layer cannot merge all split bio into
> one large bio before time out.
> - rebuilt per-device bio in raid0 is just by a few calculation, block
> layer does merge on queue with list operations, it is slower.
> - raid0 code knows its on disk layout, so rebuild per-device bio is
> possible here. block layer has no idea on raid0 layout, it can only do
> request merge.

Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0
zones make things a little complicated though. I just had a brief look of your
proposed patch, which looks really complicated. I'd suggest something like
this:
1. split the bio according to zone boundary.
2. handle the splitted bio. since the bio is within zone range, calculating
the start and end sector for each rdev should be easy.

This will create slightly more bio to each rdev (not too many, since there
aren't too many zones in practice) and block layer should easily merge these
bios without much overhead. The benefit is a much simpler implementation.

> I compose a prototype patch, the code is not simple, indeed it is quite
> complicated IMHO.
> 
> I do a little research, some NVMe SSDs support whole device size
> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
> bio to block layer. But raid0_make_request() only receives 512KB size
> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
> original large bio into 512KB small bios, the limitation is from
> q->limits.discard_granularity.

please adjust the max discard sectors for the queue. The original setting is
chunk size.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-12-07 16:59               ` Shaohua Li
@ 2016-12-08 16:44                 ` Coly Li
  2016-12-08 19:19                   ` Shaohua Li
  2017-06-29 15:15                   ` Avi Kivity
  0 siblings, 2 replies; 31+ messages in thread
From: Coly Li @ 2016-12-08 16:44 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Avi Kivity, NeilBrown, linux-raid, linux-block

[-- Attachment #1: Type: text/plain, Size: 2545 bytes --]

On 2016/12/8 上午12:59, Shaohua Li wrote:
> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
[snip]
> Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0
> zones make things a little complicated though. I just had a brief look of your
> proposed patch, which looks really complicated. I'd suggest something like
> this:
> 1. split the bio according to zone boundary.
> 2. handle the splitted bio. since the bio is within zone range, calculating
> the start and end sector for each rdev should be easy.
> 

Hi Shaohua,

Thanks for your suggestion! I try to modify the code by your suggestion,
it is even more hard to make the code that way ...

Because even split bios for each zone, all the corner cases still exist
and should be taken care in every zoon. The code will be more complicated.


> This will create slightly more bio to each rdev (not too many, since there
> aren't too many zones in practice) and block layer should easily merge these
> bios without much overhead. The benefit is a much simpler implementation.
> 
>> I compose a prototype patch, the code is not simple, indeed it is quite
>> complicated IMHO.
>>
>> I do a little research, some NVMe SSDs support whole device size
>> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
>> bio to block layer. But raid0_make_request() only receives 512KB size
>> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
>> original large bio into 512KB small bios, the limitation is from
>> q->limits.discard_granularity.
> 
> please adjust the max discard sectors for the queue. The original setting is
> chunk size.

This is a powerful suggestion, I change the max_discard_sectors to raid0
size, and fix some bugs, now the patch looks working well. The
performance number is not bad.

On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel
spends 306 seconds, the patched kernel spends 15 seconds. I see average
request size increases from 1 chunk (1024 sectors) to 2048 chunks
(2097152 sectors).

I don't know why the bios are still be split before raid0_make_request()
receives them, after I set q->limits.max_discard_sectors to
mddev->array_sectors. Can anybody give me a hint ?

Here I attach the RFC v2 patch, if anybody wants to try it, please do it
and response the result :-)

I will take time to write a very detailed commit log and code comments
to make this patch more easier to be understood. Ugly code, that's what
I have to pay to gain better performance ....

Thanks in advance.

Coly





[-- Attachment #2: raid0_handle_large_discard_bio.patch --]
[-- Type: text/plain, Size: 11145 bytes --]

Subject: [RFC v2] optimization for large size DISCARD bio by per-device bios 

This is a very early prototype, still needs more block layer code
modification to make it work.

Current upstream raid0_make_request() only handles TRIM/DISCARD bio by
chunk size, it meams for large raid0 device built by SSDs will call
million times generic_make_request() for the split bio. This patch
tries to combine small bios into large one if they are on same real
device and continuous on this real device, then send the combined large
bio to underlying device by single call to generic_make_request().

For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB
NVMeSSD, current upstream raid0_make_request() will call
generic_make_request() 5.7 million times, with this patch only 4 calls
to generic_make_request() is required.

This patch won't work in real world, because in block/blk-lib.c:
__blkdev_issue_discard() the original large bio will be split into
smaller ones by restriction of discard_granularity.

If some day SSD supports whole device sized discard_granularity, it
will be very interesting then...

The basic idea is, if a large discard bio received
by raid0_make_request(), for example it requests to discard chunk 1
to 24 on a raid0 device built by 4 SSDs. This large discard bio will
be split and written to each SSD as the following layout,
	SSD1: C1,C5,C9,C13,C17,C21
	SSD2: C2,C6,C10,C14,C18,C22
	SSD3: C3,C7,C11,C15,C19,C23
	SSD4: C4,C8,C12,C16,C20,C24
Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of
each split bio, so we can combine all the bios into four per-SSD large
bio, like this,
	bio1 (on SSD1): C{1,5,9,13,17,21}
	bio2 (on SSD2): C{2,6,10,14,18,22}
	bio3 (on SSD3): C{3,7,11,15,19,23}
	bio4 (on SSD4): C{4,8,12,16,20,24}
Now we only need to call generic_make_request() for 4 times.

The code is not simple, I need more time to write text to complain how
it works. Currently you can treat it as a proof of concept.

Changelogs
v1, Initial prototype.
v2, Major changes inlcude,
    - rename function names, now handle_discard_bio() takes care
      in chunk size DISCARD bio and single disk sutiation, large DISCARD
      bio will be handled in handle_large_discard_bio().
    - Set max_discard_sectors to raid0 device size.
    - Fix several bugs which I find in basic testing..

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/raid0.c | 267 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 266 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 258986a..c7afe0c 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -378,7 +378,7 @@ static int raid0_run(struct mddev *mddev)
 
 		blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
 		blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors);
-		blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors);
+		blk_queue_max_discard_sectors(mddev->queue, raid0_size(mddev, 0, 0));
 
 		blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
 		blk_queue_io_opt(mddev->queue,
@@ -452,6 +452,266 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev,
 	}
 }
 
+
+struct bio_record {
+	sector_t	bi_sector;
+	unsigned long	sectors;
+	struct md_rdev	*rdev;
+};
+
+static void handle_large_discard_bio(struct mddev *mddev, struct bio *bio)
+{
+	struct bio_record *recs = NULL;
+	struct bio *split;
+	struct r0conf *conf = mddev->private;
+	sector_t sectors, sector;
+	struct strip_zone *first_zone;
+	int zone_idx;
+	sector_t zone_start, zone_end;
+	int nr_strip_zones = conf->nr_strip_zones;
+	int disks;
+	int first_rdev_idx = -1, rdev_idx;
+	struct md_rdev *first_rdev;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+
+	sector = bio->bi_iter.bi_sector;
+	first_zone = find_zone(conf, &sector);
+	first_rdev = map_sector(mddev, first_zone, sector, &sector);
+
+	/* bio is large enough to be split, allocate recs firstly */
+	disks = mddev->raid_disks;
+	recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO);
+	if (recs == NULL) {
+		printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \
+				"for bio_record", mdname(mddev));
+		bio->bi_error = -ENOMEM;
+		bio_endio(bio);
+		return;
+	}
+
+	zone_idx = first_zone - conf->strip_zone;
+	for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) {
+		struct md_rdev *rdev;
+
+		rdev = conf->devlist[zone_idx * disks + rdev_idx];
+		recs[rdev_idx].rdev = rdev;
+		if (rdev == first_rdev)
+			first_rdev_idx = rdev_idx;
+	}
+
+	sectors = chunk_sects -
+		(likely(is_power_of_2(chunk_sects))
+		? (sector & (chunk_sects - 1))
+		: sector_div(sector, chunk_sects));
+		sector = bio->bi_iter.bi_sector;
+
+	recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start;
+	recs[first_rdev_idx].sectors = sectors;
+
+	/* recs[first_rdev_idx] is initialized with 'sectors', we need to
+	 * handle the rested sectors, which is sotred in 'sectors' too.
+	 */
+	sectors = bio_sectors(bio) - sectors;
+
+	/* bio may not be chunk size aligned, the split bio on first rdev
+	 * may not be chunk size aligned too. But the rested split bios
+	 * on rested rdevs must be chunk size aligned, and aligned to
+	 * round down chunk number.
+	 */
+	zone_end = first_zone->zone_end;
+	rdev_idx = first_rdev_idx + 1;
+	sector = likely(is_power_of_2(chunk_sects))
+		 ? sector & (~(chunk_sects - 1))
+		 : chunk_sects * (sector/chunk_sects);
+
+	while (rdev_idx < first_zone->nb_dev) {
+		if (recs[rdev_idx].sectors == 0) {
+			recs[rdev_idx].bi_sector = sector + first_zone->dev_start;
+			if (sectors <= chunk_sects) {
+				recs[rdev_idx].sectors = sectors;
+				goto issue;
+			}
+			recs[rdev_idx].sectors = chunk_sects;
+			sectors -= chunk_sects;
+		}
+		rdev_idx++;
+	}
+
+	sector += chunk_sects;
+	zone_start = sector + first_zone->dev_start;
+	if (zone_start == zone_end) {
+		zone_idx++;
+		if (zone_idx == nr_strip_zones) {
+			if (sectors != 0)
+				printk(KERN_INFO "bio size exceeds raid0 " \
+					"capability, ignore extra " \
+					"TRIM/DISCARD range.\n");
+			goto issue;
+		}
+		zone_start = conf->strip_zone[zone_idx].dev_start;
+	}
+
+	while (zone_idx < nr_strip_zones) {
+		int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev;
+		int chunks_per_rdev, rested_chunks, rested_sectors;
+		sector_t zone_sectors, grow_sectors;
+		int add_rested_sectors = 0;
+
+		zone_end = conf->strip_zone[zone_idx].zone_end;
+		zone_sectors = zone_end - zone_start;
+		chunks_per_rdev = sectors;
+		rested_sectors =
+			sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone);
+		rested_chunks = rested_sectors;
+		rested_sectors = sector_div(rested_chunks, chunk_sects);
+
+		if ((chunks_per_rdev * chunk_sects) > zone_sectors)
+			chunks_per_rdev = zone_sectors/chunk_sects;
+
+		/* rested_chunks and rested_sectors go into next zone, we won't
+		 * handle them in this zone. Set them to 0.
+		 */
+		if ((chunks_per_rdev * chunk_sects) == zone_sectors &&
+		    (rested_chunks != 0 || rested_sectors != 0)) {
+			if (rested_chunks != 0)
+				rested_chunks = 0;
+			if (rested_sectors != 0)
+				rested_sectors = 0;
+		}
+
+		if (rested_chunks == 0 && rested_sectors != 0)
+			add_rested_sectors ++;
+
+		for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) {
+			/* if .sectors is not initailized (== 0), it indicates
+			 * .bi_sector is not initialized neither. We initiate
+			 * .bi_sector firstly, then set .sectors by
+			 * grow_sectors.
+			 */
+			if (recs[rdev_idx].sectors == 0)
+				recs[rdev_idx].bi_sector = zone_start;
+			grow_sectors = chunks_per_rdev * chunk_sects;
+			if (rested_chunks) {
+				grow_sectors += chunk_sects;
+				rested_chunks--;
+				if (rested_chunks == 0 &&
+				    rested_sectors != 0) {
+					recs[rdev_idx].sectors += grow_sectors;
+					sectors -= grow_sectors;
+					add_rested_sectors ++;
+					continue;
+				}
+			}
+
+			/* if add_rested_sectors != 0, it indicates
+			 * rested_sectors != 0
+			 */
+			if (add_rested_sectors == 1) {
+				grow_sectors += rested_sectors;
+				add_rested_sectors ++;
+			}
+			recs[rdev_idx].sectors += grow_sectors;
+			sectors -= grow_sectors;
+			if (sectors == 0)
+				break;
+		}
+
+		if (sectors == 0)
+			break;
+		zone_start = zone_end;
+		zone_idx++;
+		if (zone_idx < nr_strip_zones)
+			BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start);
+	}
+
+
+issue:
+	/* recs contains the re-ordered requests layout, now we can
+	 * chain split bios from recs
+	 */
+	for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) {
+		if (rdev_idx == first_rdev_idx ||
+		    recs[rdev_idx].sectors == 0)
+			continue;
+		split = bio_split(bio,
+				  recs[rdev_idx].sectors,
+				  GFP_NOIO,
+				  fs_bio_set);
+		if (split == NULL)
+			break;
+		bio_chain(split, bio);
+		BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9);
+		split->bi_bdev = recs[rdev_idx].rdev->bdev;
+		split->bi_iter.bi_sector = recs[rdev_idx].bi_sector +
+				recs[rdev_idx].rdev->data_offset;
+
+		if (unlikely(!blk_queue_discard(
+				bdev_get_queue(split->bi_bdev))))
+			/* Just ignore it */
+			bio_endio(split);
+		else
+			generic_make_request(split);
+	}
+	BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9);
+	bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector +
+					recs[first_rdev_idx].rdev->data_offset;
+	bio->bi_bdev = recs[first_rdev_idx].rdev->bdev;
+
+	if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		/* Just ignore it */
+		bio_endio(bio);
+	else
+		generic_make_request(bio);
+
+	kfree(recs);
+}
+
+static void handle_discard_bio(struct mddev *mddev, struct bio *bio)
+{
+	struct r0conf *conf = mddev->private;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+	sector_t sector, sectors;
+	struct md_rdev *rdev;
+	struct strip_zone *zone;
+
+	sector = bio->bi_iter.bi_sector;
+	zone = find_zone(conf, &sector);
+	rdev = map_sector(mddev, zone, sector, &sector);
+	bio->bi_bdev = rdev->bdev;
+	sectors = chunk_sects -
+		(likely(is_power_of_2(chunk_sects))
+		 ? (sector & (chunk_sects - 1))
+		 : sector_div(sector, chunk_sects));
+
+	if (unlikely(sectors >= bio_sectors(bio))) {
+		bio->bi_iter.bi_sector = sector + zone->dev_start +
+					 rdev->data_offset;
+		goto single_bio;
+	}
+
+	if (unlikely(zone->nb_dev == 1)) {
+		sectors = conf->strip_zone[0].zone_end -
+			  sector;
+		if (bio_sectors(bio) > sectors)
+			bio->bi_iter.bi_size = sectors << 9;
+		bio->bi_iter.bi_sector = sector + rdev->data_offset;
+		goto single_bio;
+	}
+
+	handle_large_discard_bio(mddev, bio);
+	return;
+
+single_bio:
+	if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		/* Just ignore it */
+		bio_endio(bio);
+	else
+		generic_make_request(bio);
+
+	return;
+}
+
+
 static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct strip_zone *zone;
@@ -463,6 +723,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 		return;
 	}
 
+	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+		handle_discard_bio(mddev, bio);
+		return;
+	}
+
 	do {
 		sector_t sector = bio->bi_iter.bi_sector;
 		unsigned chunk_sects = mddev->chunk_sectors;

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-12-08 16:44                 ` Coly Li
@ 2016-12-08 19:19                   ` Shaohua Li
  2016-12-09  7:34                     ` Coly Li
  2017-06-29 15:15                   ` Avi Kivity
  1 sibling, 1 reply; 31+ messages in thread
From: Shaohua Li @ 2016-12-08 19:19 UTC (permalink / raw)
  To: Coly Li; +Cc: Avi Kivity, NeilBrown, linux-raid, linux-block

On Fri, Dec 09, 2016 at 12:44:57AM +0800, Coly Li wrote:
> On 2016/12/8 上午12:59, Shaohua Li wrote:
> > On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
> [snip]
> > Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0
> > zones make things a little complicated though. I just had a brief look of your
> > proposed patch, which looks really complicated. I'd suggest something like
> > this:
> > 1. split the bio according to zone boundary.
> > 2. handle the splitted bio. since the bio is within zone range, calculating
> > the start and end sector for each rdev should be easy.
> > 
> 
> Hi Shaohua,
> 
> Thanks for your suggestion! I try to modify the code by your suggestion,
> it is even more hard to make the code that way ...
> 
> Because even split bios for each zone, all the corner cases still exist
> and should be taken care in every zoon. The code will be more complicated.

Not sure why it makes the code more complicated. Probably I'm wrong, but Just
want to make sure we are in the same page: split the bio according to zone
boundary, then handle the splitted bio separately. Calculating end/start point
of each rdev for the new bio within a zone should be simple. we then clone a
bio for each rdev and dispatch. So for example:
Disk 0: D0 D2 D4 D6 D7
Disk 1: D1 D3 D5
zone 0 is from D0 - D5, zone 1 is from D6 - D7
If bio is from D1 to D7, we split it to 2 bios, one is D1 - D5, the other D6 - D7.
For D1 - D5, we dispatch 2 bios. D1 - D5 for disk 1, D2 - D4 for disk 0
For D6 - D7, we just dispatch to disk 0.
What kind of corner case makes this more complicated?
 
> > This will create slightly more bio to each rdev (not too many, since there
> > aren't too many zones in practice) and block layer should easily merge these
> > bios without much overhead. The benefit is a much simpler implementation.
> > 
> >> I compose a prototype patch, the code is not simple, indeed it is quite
> >> complicated IMHO.
> >>
> >> I do a little research, some NVMe SSDs support whole device size
> >> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
> >> bio to block layer. But raid0_make_request() only receives 512KB size
> >> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
> >> original large bio into 512KB small bios, the limitation is from
> >> q->limits.discard_granularity.
> > 
> > please adjust the max discard sectors for the queue. The original setting is
> > chunk size.
> 
> This is a powerful suggestion, I change the max_discard_sectors to raid0
> size, and fix some bugs, now the patch looks working well. The
> performance number is not bad.
> 
> On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel
> spends 306 seconds, the patched kernel spends 15 seconds. I see average
> request size increases from 1 chunk (1024 sectors) to 2048 chunks
> (2097152 sectors).
> 
> I don't know why the bios are still be split before raid0_make_request()
> receives them, after I set q->limits.max_discard_sectors to
> mddev->array_sectors. Can anybody give me a hint ?

That probably is from disk_stack_limits. try set the max_discard_sectors after it.

> Here I attach the RFC v2 patch, if anybody wants to try it, please do it
> and response the result :-)
> 
> I will take time to write a very detailed commit log and code comments
> to make this patch more easier to be understood. Ugly code, that's what
> I have to pay to gain better performance ....

Can't say I like it :). Hard to read and the memory allocation is ugly. Please
check if there is simpler solution first before writting detailed commit log.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-12-08 19:19                   ` Shaohua Li
@ 2016-12-09  7:34                     ` Coly Li
  2016-12-12  3:17                       ` NeilBrown
  0 siblings, 1 reply; 31+ messages in thread
From: Coly Li @ 2016-12-09  7:34 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Avi Kivity, NeilBrown, linux-raid, linux-block

On 2016/12/9 上午3:19, Shaohua Li wrote:
> On Fri, Dec 09, 2016 at 12:44:57AM +0800, Coly Li wrote:
>> On 2016/12/8 上午12:59, Shaohua Li wrote:
>>> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
>> [snip]
>>> Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0
>>> zones make things a little complicated though. I just had a brief look of your
>>> proposed patch, which looks really complicated. I'd suggest something like
>>> this:
>>> 1. split the bio according to zone boundary.
>>> 2. handle the splitted bio. since the bio is within zone range, calculating
>>> the start and end sector for each rdev should be easy.
>>>
>>
>> Hi Shaohua,
>>
>> Thanks for your suggestion! I try to modify the code by your suggestion,
>> it is even more hard to make the code that way ...
>>
>> Because even split bios for each zone, all the corner cases still exist
>> and should be taken care in every zoon. The code will be more complicated.
> 
> Not sure why it makes the code more complicated. Probably I'm wrong, but Just
> want to make sure we are in the same page: split the bio according to zone
> boundary, then handle the splitted bio separately. Calculating end/start point
> of each rdev for the new bio within a zone should be simple. we then clone a
> bio for each rdev and dispatch. So for example:
> Disk 0: D0 D2 D4 D6 D7
> Disk 1: D1 D3 D5
> zone 0 is from D0 - D5, zone 1 is from D6 - D7
> If bio is from D1 to D7, we split it to 2 bios, one is D1 - D5, the other D6 - D7.
> For D1 - D5, we dispatch 2 bios. D1 - D5 for disk 1, D2 - D4 for disk 0
> For D6 - D7, we just dispatch to disk 0.
> What kind of corner case makes this more complicated?
>  

Let me explain the corner cases.

When upper layer code issues a DISCARD bio, the bio->bi_iter.bi_sector
may not be chunk size aligned, and bio->bi_iter.bi_size may not be
(chunk_sects*nb_dev) sectors aligned. In raid0, we can't simply round
up/down them into chunk size aligned number, otherwise data
lost/corruption will happen.

Therefore for each DISCARD bio that raid0_make_request() receive, the
beginning and ending parts of this bio should be treat very carefully.
All the corner cases *come from here*, they are not about number of
zones or rdevs, it is about whether bio->bi_iter.bi_sector and
bio->bi_iter.bi_size are chunk size aligned or not.

- beginning of the bio
  If bio->bi_iter.bi_sector is not chunk size aligned, current raid0
code will split the beginning part into split bio which only contains
sectors from bio->bi_iter.sector to next chunk size aligned offset, and
issue this bio by generic_make_request(). But in
discard_large_discard_bio() we can't issue the split bio now, we have to
record lenth of this split bio into a per-device structure, and issue a
split bio after all the sectors of the DISCARD bio are calculated. So I
use recs[first_rdev_idx].bi_sector to rcoard bi_iter.bi_sector of the
split bio, recs[first_rdev_idx].sectors to record length of this split
bio, and recs[first_rdev_idx].rdev to record address of the real device
where the split bio will be sent to.

  After the first non-chunk-size aligned part handled, we need to look
into the next rdev for next chunk of DISCARD bio. Now the sector offset
is chunk size aligned, but there are still two condition:
  1) If the first_rdev is not the last read device in current zone, then
on the next real device, its per-device bio will start on a round down
chunk offset of recs[first_rdev_idx].bi_sector. If
recs[first_rdev_idx].bi_sector is chunk size aligned, the next real
device's per-device bio will start at the same chunk offset.

  2) If first_rdev is the last real device in current zone, then next
rdev to handle is the number 0 real device among
conf->strip_zone[zone_idx].nb_dev real devices. In this case, the
bi_iter.bi_setor of the per-device bio of this real device, is the chunk
offset next to chunk which recs[first_rdev_idx].bi_sector hits on
recs[first_rdev_idx].rdev.

- ending part of the bio
  If length of the DISCARD bio is not (chunk_sects * nb_dev) sectors
aligned, after we handled one or more (chunk_sects*nb_dev)) aligned
sectors, the rested sectors are less then chunk_sects*nb_dev, but these
sectors may still fit in some rested_chunks and rested_sectors. We also
need to handle them carefully.
  If rested_chunks is not 0, because chunks are linearly allocated on
each real device in current zone, we need to add chunk_sects to
recs[rdev_idx].sectors, where rdev_idx starts from 0 to
strip_zone[zone_idx].nb_dev - 1. When we move to next real device
(rdev_idx ++), we need to reduce one chunk from rested_chunk
(rested_chunk --). If rested_chunk reaches 0, we start to handle
rested_sectors.
  rested_sectors will be added to the next real device, we just simply
add rested_sectors to recs[rdev_idx].sectors. There is a corner case
that after the calculation of DISCARD bio, rested_chunks is 0, and
rested_sectors is not 0. In this case, we only need to add
rested_sectors to number 0 real device of the zone, which is recs[0].

A DISCARD bio is probably to only cover one raid0 zone, but all the
above corner cases have to be taken care. Therefore submitting split
bios for each zone, does not make the code simpler.

To handle the details correctly is really boring, if I can explain to
you face to face, that will be much easier.


>>> This will create slightly more bio to each rdev (not too many, since there
>>> aren't too many zones in practice) and block layer should easily merge these
>>> bios without much overhead. The benefit is a much simpler implementation.
>>>
>>>> I compose a prototype patch, the code is not simple, indeed it is quite
>>>> complicated IMHO.
>>>>
>>>> I do a little research, some NVMe SSDs support whole device size
>>>> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
>>>> bio to block layer. But raid0_make_request() only receives 512KB size
>>>> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
>>>> original large bio into 512KB small bios, the limitation is from
>>>> q->limits.discard_granularity.
>>>
>>> please adjust the max discard sectors for the queue. The original setting is
>>> chunk size.
>>
>> This is a powerful suggestion, I change the max_discard_sectors to raid0
>> size, and fix some bugs, now the patch looks working well. The
>> performance number is not bad.
>>
>> On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel
>> spends 306 seconds, the patched kernel spends 15 seconds. I see average
>> request size increases from 1 chunk (1024 sectors) to 2048 chunks
>> (2097152 sectors).
>>
>> I don't know why the bios are still be split before raid0_make_request()
>> receives them, after I set q->limits.max_discard_sectors to
>> mddev->array_sectors. Can anybody give me a hint ?
> 
> That probably is from disk_stack_limits. try set the max_discard_sectors after it.
> 

I know why the bio is still split, (1<<32)/512 = 8388608, in
__blkdev_issue_discard(), req_sects is decided by:
	/* Make sure bi_size doesn't overflow */
	req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
I try to simply modify bi_size from unsigned int to unsigned long, and
change the above limit to ULONG_MAX>>9, kernel panics. It seems change
bi_sector from unsigned int to unsigned long is not simple.

If we don't change bi_iter.bi_size to unsigned long, this is the best
effort now.


>> Here I attach the RFC v2 patch, if anybody wants to try it, please do it
>> and response the result :-)
>>
>> I will take time to write a very detailed commit log and code comments
>> to make this patch more easier to be understood. Ugly code, that's what
>> I have to pay to gain better performance ....
> 
> Can't say I like it :). Hard to read and the memory allocation is ugly. Please
> check if there is simpler solution first before writting detailed commit log.

I don't like it neither ....
I can imagine how hard to read it, because even explain it in text is
difficult... A lot of details to take care, and the calculation should
be exactly match in the end.

I tried to avoid to allocate memory for recs[], but without a data
structure to record the per-devcie bio attribution, I cannot find a
better way to write the code. There is only one thing makes me to be 1%+
comfortable is, which is bio_split() also allocates memory ^_^

Coly


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-12-09  7:34                     ` Coly Li
@ 2016-12-12  3:17                       ` NeilBrown
  0 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2016-12-12  3:17 UTC (permalink / raw)
  To: Coly Li, Shaohua Li; +Cc: Avi Kivity, linux-raid, linux-block

[-- Attachment #1: Type: text/plain, Size: 3524 bytes --]

On Fri, Dec 09 2016, Coly Li wrote:

> On 2016/12/9 上午3:19, Shaohua Li wrote:
>> On Fri, Dec 09, 2016 at 12:44:57AM +0800, Coly Li wrote:
>>> On 2016/12/8 上午12:59, Shaohua Li wrote:
>>>> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
>>> [snip]
>>>> Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0
>>>> zones make things a little complicated though. I just had a brief look of your
>>>> proposed patch, which looks really complicated. I'd suggest something like
>>>> this:
>>>> 1. split the bio according to zone boundary.
>>>> 2. handle the splitted bio. since the bio is within zone range, calculating
>>>> the start and end sector for each rdev should be easy.
>>>>
>>>
>>> Hi Shaohua,
>>>
>>> Thanks for your suggestion! I try to modify the code by your suggestion,
>>> it is even more hard to make the code that way ...
>>>
>>> Because even split bios for each zone, all the corner cases still exist
>>> and should be taken care in every zoon. The code will be more complicated.
>> 
>> Not sure why it makes the code more complicated. Probably I'm wrong, but Just
>> want to make sure we are in the same page: split the bio according to zone
>> boundary, then handle the splitted bio separately. Calculating end/start point
>> of each rdev for the new bio within a zone should be simple. we then clone a
>> bio for each rdev and dispatch. So for example:
>> Disk 0: D0 D2 D4 D6 D7
>> Disk 1: D1 D3 D5
>> zone 0 is from D0 - D5, zone 1 is from D6 - D7
>> If bio is from D1 to D7, we split it to 2 bios, one is D1 - D5, the other D6 - D7.
>> For D1 - D5, we dispatch 2 bios. D1 - D5 for disk 1, D2 - D4 for disk 0
>> For D6 - D7, we just dispatch to disk 0.
>> What kind of corner case makes this more complicated?
>>  
>
> Let me explain the corner cases.
>
> When upper layer code issues a DISCARD bio, the bio->bi_iter.bi_sector
> may not be chunk size aligned, and bio->bi_iter.bi_size may not be
> (chunk_sects*nb_dev) sectors aligned. In raid0, we can't simply round
> up/down them into chunk size aligned number, otherwise data
> lost/corruption will happen.
>
> Therefore for each DISCARD bio that raid0_make_request() receive, the
> beginning and ending parts of this bio should be treat very carefully.
> All the corner cases *come from here*, they are not about number of
> zones or rdevs, it is about whether bio->bi_iter.bi_sector and
> bio->bi_iter.bi_size are chunk size aligned or not.
>
> - beginning of the bio
>   If bio->bi_iter.bi_sector is not chunk size aligned, current raid0
> code will split the beginning part into split bio which only contains
> sectors from bio->bi_iter.sector to next chunk size aligned offset, and
> issue this bio by generic_make_request(). But in
> discard_large_discard_bio() we can't issue the split bio now, we have to
> record lenth of this split bio into a per-device structure, and issue a

Why?

Why cannot you just split of the start of the bio and chain it with the
rest of the bio?

If the bio doesn't start at the beginning of a stripe, just split of the
first (partial) chunk exactly was we currently do.

If it does start at the beginning of a stripe, then split off a whole
number of stripes and allocate one bio for each device.  Chain those to
the original bio together with any remainder (which isn't a whole
stripe).

I think that if you make use of bio_split() and bio_chain() properly,
the code will be much simpler.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-11-27 15:24 raid0 vs. mkfs Avi Kivity
                   ` (2 preceding siblings ...)
  2016-11-28  5:09 ` NeilBrown
@ 2017-01-22 18:01 ` Avi Kivity
  2017-01-23 12:26   ` Coly Li
  3 siblings, 1 reply; 31+ messages in thread
From: Avi Kivity @ 2017-01-22 18:01 UTC (permalink / raw)
  To: linux-raid

Hello,


On 11/27/2016 05:24 PM, Avi Kivity wrote:
> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large 
> disk that supports TRIM/DISCARD (erase whichever is inappropriate).  
> That is because mkfs issues a TRIM/DISCARD (erase whichever is 
> inappropriate) for the entire partition. As far as I can tell, md 
> converts the large TRIM/DISCARD (erase whichever is inappropriate) 
> into a large number of TRIM/DISCARD (erase whichever is inappropriate) 
> requests, one per chunk-size worth of disk, and issues them to the 
> RAID components individually.
>
>
> It seems to me that md can convert the large TRIM/DISCARD (erase 
> whichever is inappropriate) request it gets into one TRIM/DISCARD 
> (erase whichever is inappropriate) per RAID component, converting an 
> O(disk size / chunk size) operation into an O(number of RAID 
> components) operation, which is much faster.
>
>
> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, 
> with the operation taking about a quarter of an hour, continuously 
> pushing half-megabyte TRIM/DISCARD (erase whichever is inappropriate) 
> requests to the disk. Linux 4.1.12.
>

Did anyone pick this up by any chance?  The only thing I could find is 
more people complaining about the same issue.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2017-01-22 18:01 ` Avi Kivity
@ 2017-01-23 12:26   ` Coly Li
  0 siblings, 0 replies; 31+ messages in thread
From: Coly Li @ 2017-01-23 12:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2130 bytes --]

On 2017/1/23 上午2:01, Avi Kivity wrote:
> Hello,
> 
> 
> On 11/27/2016 05:24 PM, Avi Kivity wrote:
>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>> disk that supports TRIM/DISCARD (erase whichever is inappropriate). 
>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>> inappropriate) for the entire partition. As far as I can tell, md
>> converts the large TRIM/DISCARD (erase whichever is inappropriate)
>> into a large number of TRIM/DISCARD (erase whichever is inappropriate)
>> requests, one per chunk-size worth of disk, and issues them to the
>> RAID components individually.
>>
>>
>> It seems to me that md can convert the large TRIM/DISCARD (erase
>> whichever is inappropriate) request it gets into one TRIM/DISCARD
>> (erase whichever is inappropriate) per RAID component, converting an
>> O(disk size / chunk size) operation into an O(number of RAID
>> components) operation, which is much faster.
>>
>>
>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices,
>> with the operation taking about a quarter of an hour, continuously
>> pushing half-megabyte TRIM/DISCARD (erase whichever is inappropriate)
>> requests to the disk. Linux 4.1.12.
>>
> 
> Did anyone pick this up by any chance?  The only thing I could find is
> more people complaining about the same issue.

Hi Avi,

I proposed a POC patch, Shaohua and Neil provide review comments,
suggest me to simplify this patch.

If you notice, there is a patch I sent out on Dec 9, 2016, in this email
thread. This patch works, but not the final version to be accepted by
upstream. I quote the performance number here,

“ On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel
spends 306 seconds, the patched kernel spends 15 seconds. I see average
request size increases from 1 chunk (1024 sectors) to 2048 chunks
(2097152 sectors).”

I call this patch as RFC v2 patch, and attach it again in this email.
Now I am working on a new version by the suggestion from Shaohua and
Neil, but any testing feed back of RFC v2 patch is welcome. It works,
just too complex to review.

Thanks.

Coly

[-- Attachment #2: raid0_handle_large_discard_bio.patch --]
[-- Type: text/plain, Size: 11145 bytes --]

Subject: [RFC v2] optimization for large size DISCARD bio by per-device bios 

This is a very early prototype, still needs more block layer code
modification to make it work.

Current upstream raid0_make_request() only handles TRIM/DISCARD bio by
chunk size, it meams for large raid0 device built by SSDs will call
million times generic_make_request() for the split bio. This patch
tries to combine small bios into large one if they are on same real
device and continuous on this real device, then send the combined large
bio to underlying device by single call to generic_make_request().

For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB
NVMeSSD, current upstream raid0_make_request() will call
generic_make_request() 5.7 million times, with this patch only 4 calls
to generic_make_request() is required.

This patch won't work in real world, because in block/blk-lib.c:
__blkdev_issue_discard() the original large bio will be split into
smaller ones by restriction of discard_granularity.

If some day SSD supports whole device sized discard_granularity, it
will be very interesting then...

The basic idea is, if a large discard bio received
by raid0_make_request(), for example it requests to discard chunk 1
to 24 on a raid0 device built by 4 SSDs. This large discard bio will
be split and written to each SSD as the following layout,
	SSD1: C1,C5,C9,C13,C17,C21
	SSD2: C2,C6,C10,C14,C18,C22
	SSD3: C3,C7,C11,C15,C19,C23
	SSD4: C4,C8,C12,C16,C20,C24
Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of
each split bio, so we can combine all the bios into four per-SSD large
bio, like this,
	bio1 (on SSD1): C{1,5,9,13,17,21}
	bio2 (on SSD2): C{2,6,10,14,18,22}
	bio3 (on SSD3): C{3,7,11,15,19,23}
	bio4 (on SSD4): C{4,8,12,16,20,24}
Now we only need to call generic_make_request() for 4 times.

The code is not simple, I need more time to write text to complain how
it works. Currently you can treat it as a proof of concept.

Changelogs
v1, Initial prototype.
v2, Major changes inlcude,
    - rename function names, now handle_discard_bio() takes care
      in chunk size DISCARD bio and single disk sutiation, large DISCARD
      bio will be handled in handle_large_discard_bio().
    - Set max_discard_sectors to raid0 device size.
    - Fix several bugs which I find in basic testing..

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/raid0.c | 267 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 266 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 258986a..c7afe0c 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -378,7 +378,7 @@ static int raid0_run(struct mddev *mddev)
 
 		blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
 		blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors);
-		blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors);
+		blk_queue_max_discard_sectors(mddev->queue, raid0_size(mddev, 0, 0));
 
 		blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
 		blk_queue_io_opt(mddev->queue,
@@ -452,6 +452,266 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev,
 	}
 }
 
+
+struct bio_record {
+	sector_t	bi_sector;
+	unsigned long	sectors;
+	struct md_rdev	*rdev;
+};
+
+static void handle_large_discard_bio(struct mddev *mddev, struct bio *bio)
+{
+	struct bio_record *recs = NULL;
+	struct bio *split;
+	struct r0conf *conf = mddev->private;
+	sector_t sectors, sector;
+	struct strip_zone *first_zone;
+	int zone_idx;
+	sector_t zone_start, zone_end;
+	int nr_strip_zones = conf->nr_strip_zones;
+	int disks;
+	int first_rdev_idx = -1, rdev_idx;
+	struct md_rdev *first_rdev;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+
+	sector = bio->bi_iter.bi_sector;
+	first_zone = find_zone(conf, &sector);
+	first_rdev = map_sector(mddev, first_zone, sector, &sector);
+
+	/* bio is large enough to be split, allocate recs firstly */
+	disks = mddev->raid_disks;
+	recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO);
+	if (recs == NULL) {
+		printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \
+				"for bio_record", mdname(mddev));
+		bio->bi_error = -ENOMEM;
+		bio_endio(bio);
+		return;
+	}
+
+	zone_idx = first_zone - conf->strip_zone;
+	for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) {
+		struct md_rdev *rdev;
+
+		rdev = conf->devlist[zone_idx * disks + rdev_idx];
+		recs[rdev_idx].rdev = rdev;
+		if (rdev == first_rdev)
+			first_rdev_idx = rdev_idx;
+	}
+
+	sectors = chunk_sects -
+		(likely(is_power_of_2(chunk_sects))
+		? (sector & (chunk_sects - 1))
+		: sector_div(sector, chunk_sects));
+		sector = bio->bi_iter.bi_sector;
+
+	recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start;
+	recs[first_rdev_idx].sectors = sectors;
+
+	/* recs[first_rdev_idx] is initialized with 'sectors', we need to
+	 * handle the rested sectors, which is sotred in 'sectors' too.
+	 */
+	sectors = bio_sectors(bio) - sectors;
+
+	/* bio may not be chunk size aligned, the split bio on first rdev
+	 * may not be chunk size aligned too. But the rested split bios
+	 * on rested rdevs must be chunk size aligned, and aligned to
+	 * round down chunk number.
+	 */
+	zone_end = first_zone->zone_end;
+	rdev_idx = first_rdev_idx + 1;
+	sector = likely(is_power_of_2(chunk_sects))
+		 ? sector & (~(chunk_sects - 1))
+		 : chunk_sects * (sector/chunk_sects);
+
+	while (rdev_idx < first_zone->nb_dev) {
+		if (recs[rdev_idx].sectors == 0) {
+			recs[rdev_idx].bi_sector = sector + first_zone->dev_start;
+			if (sectors <= chunk_sects) {
+				recs[rdev_idx].sectors = sectors;
+				goto issue;
+			}
+			recs[rdev_idx].sectors = chunk_sects;
+			sectors -= chunk_sects;
+		}
+		rdev_idx++;
+	}
+
+	sector += chunk_sects;
+	zone_start = sector + first_zone->dev_start;
+	if (zone_start == zone_end) {
+		zone_idx++;
+		if (zone_idx == nr_strip_zones) {
+			if (sectors != 0)
+				printk(KERN_INFO "bio size exceeds raid0 " \
+					"capability, ignore extra " \
+					"TRIM/DISCARD range.\n");
+			goto issue;
+		}
+		zone_start = conf->strip_zone[zone_idx].dev_start;
+	}
+
+	while (zone_idx < nr_strip_zones) {
+		int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev;
+		int chunks_per_rdev, rested_chunks, rested_sectors;
+		sector_t zone_sectors, grow_sectors;
+		int add_rested_sectors = 0;
+
+		zone_end = conf->strip_zone[zone_idx].zone_end;
+		zone_sectors = zone_end - zone_start;
+		chunks_per_rdev = sectors;
+		rested_sectors =
+			sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone);
+		rested_chunks = rested_sectors;
+		rested_sectors = sector_div(rested_chunks, chunk_sects);
+
+		if ((chunks_per_rdev * chunk_sects) > zone_sectors)
+			chunks_per_rdev = zone_sectors/chunk_sects;
+
+		/* rested_chunks and rested_sectors go into next zone, we won't
+		 * handle them in this zone. Set them to 0.
+		 */
+		if ((chunks_per_rdev * chunk_sects) == zone_sectors &&
+		    (rested_chunks != 0 || rested_sectors != 0)) {
+			if (rested_chunks != 0)
+				rested_chunks = 0;
+			if (rested_sectors != 0)
+				rested_sectors = 0;
+		}
+
+		if (rested_chunks == 0 && rested_sectors != 0)
+			add_rested_sectors ++;
+
+		for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) {
+			/* if .sectors is not initailized (== 0), it indicates
+			 * .bi_sector is not initialized neither. We initiate
+			 * .bi_sector firstly, then set .sectors by
+			 * grow_sectors.
+			 */
+			if (recs[rdev_idx].sectors == 0)
+				recs[rdev_idx].bi_sector = zone_start;
+			grow_sectors = chunks_per_rdev * chunk_sects;
+			if (rested_chunks) {
+				grow_sectors += chunk_sects;
+				rested_chunks--;
+				if (rested_chunks == 0 &&
+				    rested_sectors != 0) {
+					recs[rdev_idx].sectors += grow_sectors;
+					sectors -= grow_sectors;
+					add_rested_sectors ++;
+					continue;
+				}
+			}
+
+			/* if add_rested_sectors != 0, it indicates
+			 * rested_sectors != 0
+			 */
+			if (add_rested_sectors == 1) {
+				grow_sectors += rested_sectors;
+				add_rested_sectors ++;
+			}
+			recs[rdev_idx].sectors += grow_sectors;
+			sectors -= grow_sectors;
+			if (sectors == 0)
+				break;
+		}
+
+		if (sectors == 0)
+			break;
+		zone_start = zone_end;
+		zone_idx++;
+		if (zone_idx < nr_strip_zones)
+			BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start);
+	}
+
+
+issue:
+	/* recs contains the re-ordered requests layout, now we can
+	 * chain split bios from recs
+	 */
+	for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) {
+		if (rdev_idx == first_rdev_idx ||
+		    recs[rdev_idx].sectors == 0)
+			continue;
+		split = bio_split(bio,
+				  recs[rdev_idx].sectors,
+				  GFP_NOIO,
+				  fs_bio_set);
+		if (split == NULL)
+			break;
+		bio_chain(split, bio);
+		BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9);
+		split->bi_bdev = recs[rdev_idx].rdev->bdev;
+		split->bi_iter.bi_sector = recs[rdev_idx].bi_sector +
+				recs[rdev_idx].rdev->data_offset;
+
+		if (unlikely(!blk_queue_discard(
+				bdev_get_queue(split->bi_bdev))))
+			/* Just ignore it */
+			bio_endio(split);
+		else
+			generic_make_request(split);
+	}
+	BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9);
+	bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector +
+					recs[first_rdev_idx].rdev->data_offset;
+	bio->bi_bdev = recs[first_rdev_idx].rdev->bdev;
+
+	if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		/* Just ignore it */
+		bio_endio(bio);
+	else
+		generic_make_request(bio);
+
+	kfree(recs);
+}
+
+static void handle_discard_bio(struct mddev *mddev, struct bio *bio)
+{
+	struct r0conf *conf = mddev->private;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+	sector_t sector, sectors;
+	struct md_rdev *rdev;
+	struct strip_zone *zone;
+
+	sector = bio->bi_iter.bi_sector;
+	zone = find_zone(conf, &sector);
+	rdev = map_sector(mddev, zone, sector, &sector);
+	bio->bi_bdev = rdev->bdev;
+	sectors = chunk_sects -
+		(likely(is_power_of_2(chunk_sects))
+		 ? (sector & (chunk_sects - 1))
+		 : sector_div(sector, chunk_sects));
+
+	if (unlikely(sectors >= bio_sectors(bio))) {
+		bio->bi_iter.bi_sector = sector + zone->dev_start +
+					 rdev->data_offset;
+		goto single_bio;
+	}
+
+	if (unlikely(zone->nb_dev == 1)) {
+		sectors = conf->strip_zone[0].zone_end -
+			  sector;
+		if (bio_sectors(bio) > sectors)
+			bio->bi_iter.bi_size = sectors << 9;
+		bio->bi_iter.bi_sector = sector + rdev->data_offset;
+		goto single_bio;
+	}
+
+	handle_large_discard_bio(mddev, bio);
+	return;
+
+single_bio:
+	if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		/* Just ignore it */
+		bio_endio(bio);
+	else
+		generic_make_request(bio);
+
+	return;
+}
+
+
 static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct strip_zone *zone;
@@ -463,6 +723,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 		return;
 	}
 
+	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+		handle_discard_bio(mddev, bio);
+		return;
+	}
+
 	do {
 		sector_t sector = bio->bi_iter.bi_sector;
 		unsigned chunk_sects = mddev->chunk_sectors;

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2016-12-08 16:44                 ` Coly Li
  2016-12-08 19:19                   ` Shaohua Li
@ 2017-06-29 15:15                   ` Avi Kivity
  2017-06-29 15:31                     ` Coly Li
  1 sibling, 1 reply; 31+ messages in thread
From: Avi Kivity @ 2017-06-29 15:15 UTC (permalink / raw)
  To: Coly Li, Shaohua Li; +Cc: NeilBrown, linux-raid, linux-block



On 12/08/2016 06:44 PM, Coly Li wrote:
> On 2016/12/8 上午12:59, Shaohua Li wrote:
>> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
> [snip]
>> Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0
>> zones make things a little complicated though. I just had a brief look of your
>> proposed patch, which looks really complicated. I'd suggest something like
>> this:
>> 1. split the bio according to zone boundary.
>> 2. handle the splitted bio. since the bio is within zone range, calculating
>> the start and end sector for each rdev should be easy.
>>
> Hi Shaohua,
>
> Thanks for your suggestion! I try to modify the code by your suggestion,
> it is even more hard to make the code that way ...
>
> Because even split bios for each zone, all the corner cases still exist
> and should be taken care in every zoon. The code will be more complicated.
>

Hi Coly,

Did you manage to complete this patch? We are seeing its effect, not 
only with mkfs, but also with fstrim(8).

Avi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2017-06-29 15:15                   ` Avi Kivity
@ 2017-06-29 15:31                     ` Coly Li
  2017-06-29 15:36                       ` Avi Kivity
  0 siblings, 1 reply; 31+ messages in thread
From: Coly Li @ 2017-06-29 15:31 UTC (permalink / raw)
  To: Avi Kivity, Shaohua Li; +Cc: NeilBrown, linux-raid, linux-block

On 2017/6/29 下午11:15, Avi Kivity wrote:
> 
> 
> On 12/08/2016 06:44 PM, Coly Li wrote:
>> On 2016/12/8 上午12:59, Shaohua Li wrote:
>>> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
>> [snip]
>>> Thanks for doing this, Coly! For raid0, this totally makes sense. The
>>> raid0
>>> zones make things a little complicated though. I just had a brief
>>> look of your
>>> proposed patch, which looks really complicated. I'd suggest something
>>> like
>>> this:
>>> 1. split the bio according to zone boundary.
>>> 2. handle the splitted bio. since the bio is within zone range,
>>> calculating
>>> the start and end sector for each rdev should be easy.
>>>
>> Hi Shaohua,
>>
>> Thanks for your suggestion! I try to modify the code by your suggestion,
>> it is even more hard to make the code that way ...
>>
>> Because even split bios for each zone, all the corner cases still exist
>> and should be taken care in every zoon. The code will be more
>> complicated.
>>
> 
> Hi Coly,
> 
> Did you manage to complete this patch? We are seeing its effect, not
> only with mkfs, but also with fstrim(8).

Hi Avi,

Shaohua makes another much better patch, which is merged into mainline
kernel in v4.12-rc2.

The commit is '29efc390b946 ("md/md0: optimize raid0 discard handling")'.

Hope this is informative.

Coly


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: raid0 vs. mkfs
  2017-06-29 15:31                     ` Coly Li
@ 2017-06-29 15:36                       ` Avi Kivity
  0 siblings, 0 replies; 31+ messages in thread
From: Avi Kivity @ 2017-06-29 15:36 UTC (permalink / raw)
  To: Coly Li, Shaohua Li; +Cc: NeilBrown, linux-raid, linux-block



On 06/29/2017 06:31 PM, Coly Li wrote:
> On 2017/6/29 下午11:15, Avi Kivity wrote:
>>
>> On 12/08/2016 06:44 PM, Coly Li wrote:
>>> On 2016/12/8 上午12:59, Shaohua Li wrote:
>>>> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
>>> [snip]
>>>> Thanks for doing this, Coly! For raid0, this totally makes sense. The
>>>> raid0
>>>> zones make things a little complicated though. I just had a brief
>>>> look of your
>>>> proposed patch, which looks really complicated. I'd suggest something
>>>> like
>>>> this:
>>>> 1. split the bio according to zone boundary.
>>>> 2. handle the splitted bio. since the bio is within zone range,
>>>> calculating
>>>> the start and end sector for each rdev should be easy.
>>>>
>>> Hi Shaohua,
>>>
>>> Thanks for your suggestion! I try to modify the code by your suggestion,
>>> it is even more hard to make the code that way ...
>>>
>>> Because even split bios for each zone, all the corner cases still exist
>>> and should be taken care in every zoon. The code will be more
>>> complicated.
>>>
>> Hi Coly,
>>
>> Did you manage to complete this patch? We are seeing its effect, not
>> only with mkfs, but also with fstrim(8).
> Hi Avi,
>
> Shaohua makes another much better patch, which is merged into mainline
> kernel in v4.12-rc2.
>
> The commit is '29efc390b946 ("md/md0: optimize raid0 discard handling")'.
>
> Hope this is informative.
>

Thanks a lot, that's great news.



^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-06-29 15:36 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-27 15:24 raid0 vs. mkfs Avi Kivity
2016-11-27 17:09 ` Coly Li
2016-11-27 17:25   ` Avi Kivity
2016-11-27 19:25     ` Doug Dumitru
2016-11-28  4:11 ` Chris Murphy
2016-11-28  7:28   ` Avi Kivity
2016-11-28  7:33     ` Avi Kivity
2016-11-28  5:09 ` NeilBrown
2016-11-28  6:08   ` Shaohua Li
2016-11-28  7:38   ` Avi Kivity
2016-11-28  8:40     ` NeilBrown
2016-11-28  8:58       ` Avi Kivity
2016-11-28  9:00         ` Christoph Hellwig
2016-11-28  9:11           ` Avi Kivity
2016-11-28  9:15             ` Coly Li
2016-11-28 17:47             ` Shaohua Li
2016-11-29 21:14         ` NeilBrown
2016-11-29 22:45           ` Avi Kivity
2016-12-07  5:08             ` Mike Snitzer
2016-12-07 11:50             ` Coly Li
2016-12-07 12:03               ` Coly Li
2016-12-07 16:59               ` Shaohua Li
2016-12-08 16:44                 ` Coly Li
2016-12-08 19:19                   ` Shaohua Li
2016-12-09  7:34                     ` Coly Li
2016-12-12  3:17                       ` NeilBrown
2017-06-29 15:15                   ` Avi Kivity
2017-06-29 15:31                     ` Coly Li
2017-06-29 15:36                       ` Avi Kivity
2017-01-22 18:01 ` Avi Kivity
2017-01-23 12:26   ` Coly Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.