* raid0 vs. mkfs @ 2016-11-27 15:24 Avi Kivity 2016-11-27 17:09 ` Coly Li ` (3 more replies) 0 siblings, 4 replies; 31+ messages in thread From: Avi Kivity @ 2016-11-27 15:24 UTC (permalink / raw) To: linux-raid mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large disk that supports TRIM/DISCARD (erase whichever is inappropriate). That is because mkfs issues a TRIM/DISCARD (erase whichever is inappropriate) for the entire partition. As far as I can tell, md converts the large TRIM/DISCARD (erase whichever is inappropriate) into a large number of TRIM/DISCARD (erase whichever is inappropriate) requests, one per chunk-size worth of disk, and issues them to the RAID components individually. It seems to me that md can convert the large TRIM/DISCARD (erase whichever is inappropriate) request it gets into one TRIM/DISCARD (erase whichever is inappropriate) per RAID component, converting an O(disk size / chunk size) operation into an O(number of RAID components) operation, which is much faster. I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with the operation taking about a quarter of an hour, continuously pushing half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests to the disk. Linux 4.1.12. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-27 15:24 raid0 vs. mkfs Avi Kivity @ 2016-11-27 17:09 ` Coly Li 2016-11-27 17:25 ` Avi Kivity 2016-11-28 4:11 ` Chris Murphy ` (2 subsequent siblings) 3 siblings, 1 reply; 31+ messages in thread From: Coly Li @ 2016-11-27 17:09 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-raid On 2016/11/27 下午11:24, Avi Kivity wrote: > mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large > disk that supports TRIM/DISCARD (erase whichever is inappropriate). > That is because mkfs issues a TRIM/DISCARD (erase whichever is > inappropriate) for the entire partition. As far as I can tell, md > converts the large TRIM/DISCARD (erase whichever is inappropriate) into > a large number of TRIM/DISCARD (erase whichever is inappropriate) > requests, one per chunk-size worth of disk, and issues them to the RAID > components individually. > > > It seems to me that md can convert the large TRIM/DISCARD (erase > whichever is inappropriate) request it gets into one TRIM/DISCARD (erase > whichever is inappropriate) per RAID component, converting an O(disk > size / chunk size) operation into an O(number of RAID components) > operation, which is much faster. > > > I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with > the operation taking about a quarter of an hour, continuously pushing > half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests > to the disk. Linux 4.1.12. It might be possible to improve a bit for DISCARD performance, by your suggestion. The implementation might be tricky, but it is worthy to try. Indeed, it is not only for DISCARD, for read or write, it might be helpful for better performance as well. We can check the bio size, if, bio_sectors(bio)/conf->nr_strip_zones >= SOMETHRESHOLD it means on each underlying device, we have more then SOMETHRESHOLD continuous chunks to issue, and they can be merged into a larger bio. IMHO it's interesting, good suggestion! Coly ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-27 17:09 ` Coly Li @ 2016-11-27 17:25 ` Avi Kivity 2016-11-27 19:25 ` Doug Dumitru 0 siblings, 1 reply; 31+ messages in thread From: Avi Kivity @ 2016-11-27 17:25 UTC (permalink / raw) To: Coly Li; +Cc: linux-raid On 11/27/2016 07:09 PM, Coly Li wrote: > On 2016/11/27 下午11:24, Avi Kivity wrote: >> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large >> disk that supports TRIM/DISCARD (erase whichever is inappropriate). >> That is because mkfs issues a TRIM/DISCARD (erase whichever is >> inappropriate) for the entire partition. As far as I can tell, md >> converts the large TRIM/DISCARD (erase whichever is inappropriate) into >> a large number of TRIM/DISCARD (erase whichever is inappropriate) >> requests, one per chunk-size worth of disk, and issues them to the RAID >> components individually. >> >> >> It seems to me that md can convert the large TRIM/DISCARD (erase >> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase >> whichever is inappropriate) per RAID component, converting an O(disk >> size / chunk size) operation into an O(number of RAID components) >> operation, which is much faster. >> >> >> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with >> the operation taking about a quarter of an hour, continuously pushing >> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests >> to the disk. Linux 4.1.12. > It might be possible to improve a bit for DISCARD performance, by your > suggestion. The implementation might be tricky, but it is worthy to try. > > Indeed, it is not only for DISCARD, for read or write, it might be > helpful for better performance as well. We can check the bio size, if, > bio_sectors(bio)/conf->nr_strip_zones >= SOMETHRESHOLD > it means on each underlying device, we have more then SOMETHRESHOLD > continuous chunks to issue, and they can be merged into a larger bio. It's true that this does not strictly apply to TRIM/DISCARD (erase whichever is inappropriate), but to see any gain for READ/WRITE, you need a request that is larger than (chunk size) * (raid elements), which is unlikely for reasonable values of those parameters. But a common implementation can of course work for multiple request types. > IMHO it's interesting, good suggestion! Looking forward to seeing an implementation! > > Coly > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-27 17:25 ` Avi Kivity @ 2016-11-27 19:25 ` Doug Dumitru 0 siblings, 0 replies; 31+ messages in thread From: Doug Dumitru @ 2016-11-27 19:25 UTC (permalink / raw) To: Avi Kivity; +Cc: Coly Li, linux-raid I recently ran into this issue with a proprietary device mapper target that supports discard. mkfs.ext4 looks like it issues 2GB discard requests. blkdiscard looks like it issues 4GB-4K discard requests. Both of these are "way bigger" than the iolimits for transfers. At least this is what I see at my device mapper layer. raid0 might get some additional filtering by the common raid code. In the case of my mapper, I actually need to split the bio up and re-issue the discards at iolimits sizes (this is how my device mapper expects requests). Fortunately, my mapper is really fast at discards even at 1MB each (> 8GB/sec on a single thread), so the performance issue is not that bad. It would be an easy patch for raid0 to be "smarter" at splitting the discard request. It might not actually help that much. You should test your nVME disk to see if the performance of discards is much different between "chunk size" requests and "big requests". Using blkdiscard in a script, fill a drive with real data and test discard speed first using 256K calls to blkdiscard, and then again using 512MB calls to blkdiscard. Do this to a single drive. I suspect that the times will not be that far off. Some drives take a real amount of time to process discards. Even though it seems like the operation does nothing, the FTL inside of the SSD is still is getting hammered pretty hard. If your drives are a "lot" faster with bigger discard requests, then maybe it would make sense to optimize raid0. I suspect the win is not that big. In terms of enlarging IO, the iolimits and buffering start to come into play. With a discard, the bio only has a size and does not have any actual buffers. If you push IO really big, then the size of the bio starts to grow. 1MB is 256 4K biovecs. A bio_vec is a pointer plus two ints, so it is 16 bytes long (on x86_64). 256 of these just happen to fit into a single page. This is a linear array, so making this bigger is hard. Remember that much of the kernel lives inside of pages and pages (usually 4K) are somewhat of a deity over the entire kernel. Then again, you have another option to format your array that will be very fast and even more effective: a) secure erase the drives b) create your raid0 array c) create your file system with -o nodiscard Doug Dumitru On Sun, Nov 27, 2016 at 9:25 AM, Avi Kivity <avi@scylladb.com> wrote: > On 11/27/2016 07:09 PM, Coly Li wrote: >> >> On 2016/11/27 下午11:24, Avi Kivity wrote: >>> >>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large >>> disk that supports TRIM/DISCARD (erase whichever is inappropriate). >>> That is because mkfs issues a TRIM/DISCARD (erase whichever is >>> inappropriate) for the entire partition. As far as I can tell, md >>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into >>> a large number of TRIM/DISCARD (erase whichever is inappropriate) >>> requests, one per chunk-size worth of disk, and issues them to the RAID >>> components individually. >>> >>> >>> It seems to me that md can convert the large TRIM/DISCARD (erase >>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase >>> whichever is inappropriate) per RAID component, converting an O(disk >>> size / chunk size) operation into an O(number of RAID components) >>> operation, which is much faster. >>> >>> >>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with >>> the operation taking about a quarter of an hour, continuously pushing >>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests >>> to the disk. Linux 4.1.12. >> >> It might be possible to improve a bit for DISCARD performance, by your >> suggestion. The implementation might be tricky, but it is worthy to try. >> >> Indeed, it is not only for DISCARD, for read or write, it might be >> helpful for better performance as well. We can check the bio size, if, >> bio_sectors(bio)/conf->nr_strip_zones >= SOMETHRESHOLD >> it means on each underlying device, we have more then SOMETHRESHOLD >> continuous chunks to issue, and they can be merged into a larger bio. > > > It's true that this does not strictly apply to TRIM/DISCARD (erase whichever > is inappropriate), but to see any gain for READ/WRITE, you need a request > that is larger than (chunk size) * (raid elements), which is unlikely for > reasonable values of those parameters. But a common implementation can of > course work for multiple request types. > >> IMHO it's interesting, good suggestion! > > > Looking forward to seeing an implementation! > > >> >> Coly >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Dumitru EasyCo LLC ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-27 15:24 raid0 vs. mkfs Avi Kivity 2016-11-27 17:09 ` Coly Li @ 2016-11-28 4:11 ` Chris Murphy 2016-11-28 7:28 ` Avi Kivity 2016-11-28 5:09 ` NeilBrown 2017-01-22 18:01 ` Avi Kivity 3 siblings, 1 reply; 31+ messages in thread From: Chris Murphy @ 2016-11-28 4:11 UTC (permalink / raw) To: Avi Kivity; +Cc: Linux-RAID On Sun, Nov 27, 2016 at 8:24 AM, Avi Kivity <avi@scylladb.com> wrote: > mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large disk > that supports TRIM/DISCARD (erase whichever is inappropriate) Trim is the appropriate term. Term discard refers to a specific mount time implementation of FITRIM ioctl, and fstrim refers to a user space tool that does the same and can be scheduled or issued manually. That is > because mkfs issues a TRIM/DISCARD (erase whichever is inappropriate) for > the entire partition. As far as I can tell, md converts the large > TRIM/DISCARD (erase whichever is inappropriate) into a large number of > TRIM/DISCARD (erase whichever is inappropriate) requests, one per chunk-size > worth of disk, and issues them to the RAID components individually. You could strace the mkfs command. Each filesystem is doing it a little differently the last time I compared mkfs.xfs and mkfs.btrfs; but I can't qualify the differences relative to how the device is going to react to those commands. It's also possible to enable block device tracing and see the actual SCSI or ATA commands sent to a drive. There's a metric f tonne of bugs in this area so before anything I'd consider researching if there's a firmware update for your hardware and applying that and retesting. And then also after testing your ideal deployed version, use something much close to upstream (Arch or Fedora) and see if the problem is reproducible. -- Chris Murphy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-28 4:11 ` Chris Murphy @ 2016-11-28 7:28 ` Avi Kivity 2016-11-28 7:33 ` Avi Kivity 0 siblings, 1 reply; 31+ messages in thread From: Avi Kivity @ 2016-11-28 7:28 UTC (permalink / raw) To: Chris Murphy; +Cc: Linux-RAID On 11/28/2016 06:11 AM, Chris Murphy wrote: > On Sun, Nov 27, 2016 at 8:24 AM, Avi Kivity <avi@scylladb.com> wrote: >> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large disk >> that supports TRIM/DISCARD (erase whichever is inappropriate) > Trim is the appropriate term. Term discard refers to a specific mount > time implementation of FITRIM ioctl, and fstrim refers to a user space > tool that does the same and can be scheduled or issued manually. That's good to know. > > > That is >> because mkfs issues a TRIM/DISCARD (erase whichever is inappropriate) for >> the entire partition. As far as I can tell, md converts the large >> TRIM/DISCARD (erase whichever is inappropriate) into a large number of >> TRIM/DISCARD (erase whichever is inappropriate) requests, one per chunk-size >> worth of disk, and issues them to the RAID components individually. > You could strace the mkfs command. I did, and saw that it was running a single syscall for the entire run. I verified in the sources that mkfs.xfs issues a single BLKDISCARD (?!) ioctl spanning the entire device. > Each filesystem is doing it a > little differently the last time I compared mkfs.xfs and mkfs.btrfs; > but I can't qualify the differences relative to how the device is > going to react to those commands. > > It's also possible to enable block device tracing and see the actual > SCSI or ATA commands sent to a drive. I did, and saw a ton of half-megabyte TRIMs. It's an NVMe device so not SCSI or SATA. Here's a sample (I only blktraced one of the members): 259,1 10 1090 0.379688898 4801 Q D 3238067200 + 1024 [mkfs.xfs] 259,1 10 1091 0.379689222 4801 G D 3238067200 + 1024 [mkfs.xfs] 259,1 10 1092 0.379690304 4801 I D 3238067200 + 1024 [mkfs.xfs] 259,1 10 1093 0.379703110 2307 D D 3238067200 + 1024 [kworker/10:1H] 259,1 1 589 0.379718918 0 C D 3231849472 + 1024 [0] 259,1 10 1094 0.379735215 4801 Q D 3238068224 + 1024 [mkfs.xfs] 259,1 10 1095 0.379735548 4801 G D 3238068224 + 1024 [mkfs.xfs] 259,1 10 1096 0.379736598 4801 I D 3238068224 + 1024 [mkfs.xfs] 259,1 10 1097 0.379753077 2307 D D 3238068224 + 1024 [kworker/10:1H] 259,1 1 590 0.379782139 0 C D 3231850496 + 1024 [0] 259,1 10 1098 0.379785399 4801 Q D 3238069248 + 1024 [mkfs.xfs] 259,1 10 1099 0.379785657 4801 G D 3238069248 + 1024 [mkfs.xfs] 259,1 10 1100 0.379786562 4801 I D 3238069248 + 1024 [mkfs.xfs] 259,1 10 1101 0.379800116 2307 D D 3238069248 + 1024 [kworker/10:1H] 259,1 10 1102 0.379829822 4801 Q D 3238070272 + 1024 [mkfs.xfs] 259,1 10 1103 0.379830156 4801 G D 3238070272 + 1024 [mkfs.xfs] 259,1 10 1104 0.379831015 4801 I D 3238070272 + 1024 [mkfs.xfs] 259,1 10 1105 0.379844120 2307 D D 3238070272 + 1024 [kworker/10:1H] 259,1 10 1106 0.379877825 4801 Q D 3238071296 + 1024 [mkfs.xfs] 259,1 10 1107 0.379878173 4801 G D 3238071296 + 1024 [mkfs.xfs] 259,1 10 1108 0.379879028 4801 I D 3238071296 + 1024 [mkfs.xfs] 259,1 1 591 0.379886451 0 C D 3231851520 + 1024 [0] 259,1 10 1109 0.379898178 2307 D D 3238071296 + 1024 [kworker/10:1H] 259,1 10 1110 0.379923982 4801 Q D 3238072320 + 1024 [mkfs.xfs] 259,1 10 1111 0.379924229 4801 G D 3238072320 + 1024 [mkfs.xfs] 259,1 10 1112 0.379925054 4801 I D 3238072320 + 1024 [mkfs.xfs] 259,1 10 1113 0.379937716 2307 D D 3238072320 + 1024 [kworker/10:1H] 259,1 1 592 0.379954380 0 C D 3231852544 + 1024 [0] 259,1 10 1114 0.379970091 4801 Q D 3238073344 + 1024 [mkfs.xfs] 259,1 10 1115 0.379970341 4801 G D 3238073344 + 1024 [mkfs.xfs] 259,1 10 1116 0.379971260 4801 I D 3238073344 + 1024 [mkfs.xfs] 259,1 10 1117 0.379984303 2307 D D 3238073344 + 1024 [kworker/10:1H] 259,1 10 1118 0.380014754 4801 Q D 3238074368 + 1024 [mkfs.xfs] 259,1 10 1119 0.380015075 4801 G D 3238074368 + 1024 [mkfs.xfs] 259,1 10 1120 0.380015903 4801 I D 3238074368 + 1024 [mkfs.xfs] 259,1 10 1121 0.380028655 2307 D D 3238074368 + 1024 [kworker/10:1H] 259,1 2 170 0.380054279 0 C D 3218706432 + 1024 [0] 259,1 10 1122 0.380060773 4801 Q D 3238075392 + 1024 [mkfs.xfs] 259,1 10 1123 0.380061024 4801 G D 3238075392 + 1024 [mkfs.xfs] 259,1 10 1124 0.380062093 4801 I D 3238075392 + 1024 [mkfs.xfs] 259,1 10 1125 0.380072940 2307 D D 3238075392 + 1024 [kworker/10:1H] 259,1 10 1126 0.380107437 4801 Q D 3238076416 + 1024 [mkfs.xfs] 259,1 10 1127 0.380107882 4801 G D 3238076416 + 1024 [mkfs.xfs] 259,1 10 1128 0.380109258 4801 I D 3238076416 + 1024 [mkfs.xfs] 259,1 10 1129 0.380123914 2307 D D 3238076416 + 1024 [kworker/10:1H] 259,1 2 171 0.380130823 0 C D 3218707456 + 1024 [0] 259,1 10 1130 0.380156971 4801 Q D 3238077440 + 1024 [mkfs.xfs] 259,1 10 1131 0.380157308 4801 G D 3238077440 + 1024 [mkfs.xfs] 259,1 10 1132 0.380158354 4801 I D 3238077440 + 1024 [mkfs.xfs] 259,1 10 1133 0.380168948 2307 D D 3238077440 + 1024 [kworker/10:1H] 259,1 2 172 0.380186647 0 C D 3218708480 + 1024 [0] 259,1 10 1134 0.380197495 4801 Q D 3238078464 + 1024 [mkfs.xfs] 259,1 10 1135 0.380197848 4801 G D 3238078464 + 1024 [mkfs.xfs] 259,1 10 1136 0.380198724 4801 I D 3238078464 + 1024 [mkfs.xfs] 259,1 10 1137 0.380202964 2307 D D 3238078464 + 1024 [kworker/10:1H] 259,1 10 1138 0.380237133 4801 Q D 3238079488 + 1024 [mkfs.xfs] 259,1 10 1139 0.380237393 4801 G D 3238079488 + 1024 [mkfs.xfs] 259,1 10 1140 0.380238333 4801 I D 3238079488 + 1024 [mkfs.xfs] 259,1 10 1141 0.380252580 2307 D D 3238079488 + 1024 [kworker/10:1H] 259,1 2 173 0.380260605 0 C D 3218709504 + 1024 [0] 259,1 10 1142 0.380283800 4801 Q D 3238080512 + 1024 [mkfs.xfs] 259,1 10 1143 0.380284158 4801 G D 3238080512 + 1024 [mkfs.xfs] 259,1 10 1144 0.380285150 4801 I D 3238080512 + 1024 [mkfs.xfs] 259,1 10 1145 0.380297127 2307 D D 3238080512 + 1024 [kworker/10:1H] 259,1 10 1146 0.380324340 4801 Q D 3238081536 + 1024 [mkfs.xfs] 259,1 10 1147 0.380324648 4801 G D 3238081536 + 1024 [mkfs.xfs] 259,1 10 1148 0.380325663 4801 I D 3238081536 + 1024 [mkfs.xfs] 259,1 2 174 0.380328083 0 C D 3218710528 + 1024 [0] So we see these one-megabyte requests; moreover, they are issued sequentially. > There's a metric f tonne of bugs in this area so before anything I'd > consider researching if there's a firmware update for your hardware > and applying that and retesting. I don't have access to that machine any more (I could get some with a bit of trouble). But I think it's clear from the traces that the problem is in the RAID layer? > And then also after testing your > ideal deployed version, use something much close to upstream (Arch or > Fedora) and see if the problem is reproducible. I'm hoping the RAID maintainers can confirm at a glance whether the problem exists or not, it doesn't look like a minor glitch but simply that this code path doesn't take the issue into account. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-28 7:28 ` Avi Kivity @ 2016-11-28 7:33 ` Avi Kivity 0 siblings, 0 replies; 31+ messages in thread From: Avi Kivity @ 2016-11-28 7:33 UTC (permalink / raw) To: Chris Murphy; +Cc: Linux-RAID On 11/28/2016 09:28 AM, Avi Kivity wrote: > 259,1 10 1140 0.380238333 4801 I D 3238079488 + 1024 > [mkfs.xfs] > 259,1 10 1141 0.380252580 2307 D D 3238079488 + 1024 > [kworker/10:1H] > 259,1 2 173 0.380260605 0 C D 3218709504 + 1024 [0] > 259,1 10 1142 0.380283800 4801 Q D 3238080512 + 1024 > [mkfs.xfs] > 259,1 10 1143 0.380284158 4801 G D 3238080512 + 1024 > [mkfs.xfs] > 259,1 10 1144 0.380285150 4801 I D 3238080512 + 1024 > [mkfs.xfs] > 259,1 10 1145 0.380297127 2307 D D 3238080512 + 1024 > [kworker/10:1H] > 259,1 10 1146 0.380324340 4801 Q D 3238081536 + 1024 > [mkfs.xfs] > 259,1 10 1147 0.380324648 4801 G D 3238081536 + 1024 > [mkfs.xfs] > 259,1 10 1148 0.380325663 4801 I D 3238081536 + 1024 > [mkfs.xfs] > 259,1 2 174 0.380328083 0 C D 3218710528 + 1024 [0] > > > So we see these one-megabyte requests; moreover, they are issued > sequentially. > Correction, they are issued in parallel, but not merged. The "C" entries correspond to requests issued far in the past. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-27 15:24 raid0 vs. mkfs Avi Kivity 2016-11-27 17:09 ` Coly Li 2016-11-28 4:11 ` Chris Murphy @ 2016-11-28 5:09 ` NeilBrown 2016-11-28 6:08 ` Shaohua Li 2016-11-28 7:38 ` Avi Kivity 2017-01-22 18:01 ` Avi Kivity 3 siblings, 2 replies; 31+ messages in thread From: NeilBrown @ 2016-11-28 5:09 UTC (permalink / raw) To: Avi Kivity, linux-raid [-- Attachment #1: Type: text/plain, Size: 1357 bytes --] On Mon, Nov 28 2016, Avi Kivity wrote: > mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large > disk that supports TRIM/DISCARD (erase whichever is inappropriate). > That is because mkfs issues a TRIM/DISCARD (erase whichever is > inappropriate) for the entire partition. As far as I can tell, md > converts the large TRIM/DISCARD (erase whichever is inappropriate) into > a large number of TRIM/DISCARD (erase whichever is inappropriate) > requests, one per chunk-size worth of disk, and issues them to the RAID > components individually. > > > It seems to me that md can convert the large TRIM/DISCARD (erase > whichever is inappropriate) request it gets into one TRIM/DISCARD (erase > whichever is inappropriate) per RAID component, converting an O(disk > size / chunk size) operation into an O(number of RAID components) > operation, which is much faster. > > > I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with > the operation taking about a quarter of an hour, continuously pushing > half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests > to the disk. Linux 4.1.12. Surely it is the task of the underlying driver, or the queuing infrastructure, to merge small requests into large requests. Why should this have anything do to with RAID0? NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-28 5:09 ` NeilBrown @ 2016-11-28 6:08 ` Shaohua Li 2016-11-28 7:38 ` Avi Kivity 1 sibling, 0 replies; 31+ messages in thread From: Shaohua Li @ 2016-11-28 6:08 UTC (permalink / raw) To: NeilBrown; +Cc: Avi Kivity, linux-raid On Mon, Nov 28, 2016 at 04:09:46PM +1100, Neil Brown wrote: > On Mon, Nov 28 2016, Avi Kivity wrote: > > > mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large > > disk that supports TRIM/DISCARD (erase whichever is inappropriate). > > That is because mkfs issues a TRIM/DISCARD (erase whichever is > > inappropriate) for the entire partition. As far as I can tell, md > > converts the large TRIM/DISCARD (erase whichever is inappropriate) into > > a large number of TRIM/DISCARD (erase whichever is inappropriate) > > requests, one per chunk-size worth of disk, and issues them to the RAID > > components individually. > > > > > > It seems to me that md can convert the large TRIM/DISCARD (erase > > whichever is inappropriate) request it gets into one TRIM/DISCARD (erase > > whichever is inappropriate) per RAID component, converting an O(disk > > size / chunk size) operation into an O(number of RAID components) > > operation, which is much faster. > > > > > > I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with > > the operation taking about a quarter of an hour, continuously pushing > > half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests > > to the disk. Linux 4.1.12. > > Surely it is the task of the underlying driver, or the queuing > infrastructure, to merge small requests into large requests. > Why should this have anything do to with RAID0? That's what I'm wondering too. The block layer should merge the small requests into larget one, and it does, IIRC. There is one bug related to this (commit 9c573de3283af007e), but it only applies to 4.3+. Thanks, Shaohua ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-28 5:09 ` NeilBrown 2016-11-28 6:08 ` Shaohua Li @ 2016-11-28 7:38 ` Avi Kivity 2016-11-28 8:40 ` NeilBrown 1 sibling, 1 reply; 31+ messages in thread From: Avi Kivity @ 2016-11-28 7:38 UTC (permalink / raw) To: NeilBrown, linux-raid On 11/28/2016 07:09 AM, NeilBrown wrote: > On Mon, Nov 28 2016, Avi Kivity wrote: > >> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large >> disk that supports TRIM/DISCARD (erase whichever is inappropriate). >> That is because mkfs issues a TRIM/DISCARD (erase whichever is >> inappropriate) for the entire partition. As far as I can tell, md >> converts the large TRIM/DISCARD (erase whichever is inappropriate) into >> a large number of TRIM/DISCARD (erase whichever is inappropriate) >> requests, one per chunk-size worth of disk, and issues them to the RAID >> components individually. >> >> >> It seems to me that md can convert the large TRIM/DISCARD (erase >> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase >> whichever is inappropriate) per RAID component, converting an O(disk >> size / chunk size) operation into an O(number of RAID components) >> operation, which is much faster. >> >> >> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with >> the operation taking about a quarter of an hour, continuously pushing >> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests >> to the disk. Linux 4.1.12. > Surely it is the task of the underlying driver, or the queuing > infrastructure, to merge small requests into large requests. Here's a blkparse of that run. As can be seen, there is no concurrency, so nobody down the stack has any chance of merging anything. 259,1 10 1090 0.379688898 4801 Q D 3238067200 + 1024 [mkfs.xfs] 259,1 10 1091 0.379689222 4801 G D 3238067200 + 1024 [mkfs.xfs] 259,1 10 1092 0.379690304 4801 I D 3238067200 + 1024 [mkfs.xfs] 259,1 10 1093 0.379703110 2307 D D 3238067200 + 1024 [kworker/10:1H] 259,1 1 589 0.379718918 0 C D 3231849472 + 1024 [0] 259,1 10 1094 0.379735215 4801 Q D 3238068224 + 1024 [mkfs.xfs] 259,1 10 1095 0.379735548 4801 G D 3238068224 + 1024 [mkfs.xfs] 259,1 10 1096 0.379736598 4801 I D 3238068224 + 1024 [mkfs.xfs] 259,1 10 1097 0.379753077 2307 D D 3238068224 + 1024 [kworker/10:1H] 259,1 1 590 0.379782139 0 C D 3231850496 + 1024 [0] 259,1 10 1098 0.379785399 4801 Q D 3238069248 + 1024 [mkfs.xfs] 259,1 10 1099 0.379785657 4801 G D 3238069248 + 1024 [mkfs.xfs] 259,1 10 1100 0.379786562 4801 I D 3238069248 + 1024 [mkfs.xfs] 259,1 10 1101 0.379800116 2307 D D 3238069248 + 1024 [kworker/10:1H] 259,1 10 1102 0.379829822 4801 Q D 3238070272 + 1024 [mkfs.xfs] 259,1 10 1103 0.379830156 4801 G D 3238070272 + 1024 [mkfs.xfs] 259,1 10 1104 0.379831015 4801 I D 3238070272 + 1024 [mkfs.xfs] 259,1 10 1105 0.379844120 2307 D D 3238070272 + 1024 [kworker/10:1H] 259,1 10 1106 0.379877825 4801 Q D 3238071296 + 1024 [mkfs.xfs] 259,1 10 1107 0.379878173 4801 G D 3238071296 + 1024 [mkfs.xfs] 259,1 10 1108 0.379879028 4801 I D 3238071296 + 1024 [mkfs.xfs] 259,1 1 591 0.379886451 0 C D 3231851520 + 1024 [0] 259,1 10 1109 0.379898178 2307 D D 3238071296 + 1024 [kworker/10:1H] 259,1 10 1110 0.379923982 4801 Q D 3238072320 + 1024 [mkfs.xfs] 259,1 10 1111 0.379924229 4801 G D 3238072320 + 1024 [mkfs.xfs] 259,1 10 1112 0.379925054 4801 I D 3238072320 + 1024 [mkfs.xfs] 259,1 10 1113 0.379937716 2307 D D 3238072320 + 1024 [kworker/10:1H] 259,1 1 592 0.379954380 0 C D 3231852544 + 1024 [0] 259,1 10 1114 0.379970091 4801 Q D 3238073344 + 1024 [mkfs.xfs] 259,1 10 1115 0.379970341 4801 G D 3238073344 + 1024 [mkfs.xfs] No merging was happening. This is an NVMe drive, so running with the noop scheduler (which should still merge). Does the queuing layer merge trims? I don't think it's the queuing layer's job, though. At the I/O scheduler you can merge to clean up sloppy patterns from the upper layer, but each layer should try to generate the best pattern it can. Large merges mean increased latency for the first request in the chain, forcing the I/O scheduler to make a decision which can harm the workload. By generating merged requests in the first place, the upper layer removes the need to make that tradeoff (splitting the requests removes information: "we are interested only in when all of the range is trimmed, not any particular request"). ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-28 7:38 ` Avi Kivity @ 2016-11-28 8:40 ` NeilBrown 2016-11-28 8:58 ` Avi Kivity 0 siblings, 1 reply; 31+ messages in thread From: NeilBrown @ 2016-11-28 8:40 UTC (permalink / raw) To: Avi Kivity, linux-raid [-- Attachment #1: Type: text/plain, Size: 3556 bytes --] On Mon, Nov 28 2016, Avi Kivity wrote: > On 11/28/2016 07:09 AM, NeilBrown wrote: >> On Mon, Nov 28 2016, Avi Kivity wrote: >> >>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large >>> disk that supports TRIM/DISCARD (erase whichever is inappropriate). >>> That is because mkfs issues a TRIM/DISCARD (erase whichever is >>> inappropriate) for the entire partition. As far as I can tell, md >>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into >>> a large number of TRIM/DISCARD (erase whichever is inappropriate) >>> requests, one per chunk-size worth of disk, and issues them to the RAID >>> components individually. >>> >>> >>> It seems to me that md can convert the large TRIM/DISCARD (erase >>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase >>> whichever is inappropriate) per RAID component, converting an O(disk >>> size / chunk size) operation into an O(number of RAID components) >>> operation, which is much faster. >>> >>> >>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with >>> the operation taking about a quarter of an hour, continuously pushing >>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests >>> to the disk. Linux 4.1.12. >> Surely it is the task of the underlying driver, or the queuing >> infrastructure, to merge small requests into large requests. > > Here's a blkparse of that run. As can be seen, there is no concurrency, > so nobody down the stack has any chance of merging anything. That isn't a valid conclusion to draw. raid0 effectively calls the make_request_fn function that is registered by the underlying driver. If that function handles DISCARD synchronously, then you won't see any concurrency, and that is because the driver chose not to queue but to handle directly. I don't know if it actually does this though. I don't know the insides of the nmve driver .... there seems to be a lightnvm thing and a scsi thing and a pci thing and it all confuses me. > > > No merging was happening. This is an NVMe drive, so running with the > noop scheduler (which should still merge). Does the queuing layer > merge trims? I wish I knew. I once thought I understood about half of the block queuing code, but now with multi-queue, I'll need to learn it all again. :-( > > I don't think it's the queuing layer's job, though. At the I/O > scheduler you can merge to clean up sloppy patterns from the upper > layer, but each layer should try to generate the best pattern it can. Why? How does it know what is best for the layer below? > Large merges mean increased latency for the first request in the chain, > forcing the I/O scheduler to make a decision which can harm the > workload. By generating merged requests in the first place, the upper > layer removes the need to make that tradeoff (splitting the requests > removes information: "we are interested only in when all of the range is > trimmed, not any particular request"). If it is easy for the upper layer to break a very large request into a few very large requests, then I wouldn't necessarily object. But unless it is very hard for the lower layer to merge requests, it should be doing that too. When drivers/lightnvm/rrpc.c is providing rrpc_make_rq as the make_request_fn, it performs REQ_OP_DISCARD synchronously. I would suggest that is a very poor design. I don't know if that is affecting you (though a printk would find out). NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-28 8:40 ` NeilBrown @ 2016-11-28 8:58 ` Avi Kivity 2016-11-28 9:00 ` Christoph Hellwig 2016-11-29 21:14 ` NeilBrown 0 siblings, 2 replies; 31+ messages in thread From: Avi Kivity @ 2016-11-28 8:58 UTC (permalink / raw) To: NeilBrown, linux-raid On 11/28/2016 10:40 AM, NeilBrown wrote: > On Mon, Nov 28 2016, Avi Kivity wrote: > >> On 11/28/2016 07:09 AM, NeilBrown wrote: >>> On Mon, Nov 28 2016, Avi Kivity wrote: >>> >>>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large >>>> disk that supports TRIM/DISCARD (erase whichever is inappropriate). >>>> That is because mkfs issues a TRIM/DISCARD (erase whichever is >>>> inappropriate) for the entire partition. As far as I can tell, md >>>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into >>>> a large number of TRIM/DISCARD (erase whichever is inappropriate) >>>> requests, one per chunk-size worth of disk, and issues them to the RAID >>>> components individually. >>>> >>>> >>>> It seems to me that md can convert the large TRIM/DISCARD (erase >>>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase >>>> whichever is inappropriate) per RAID component, converting an O(disk >>>> size / chunk size) operation into an O(number of RAID components) >>>> operation, which is much faster. >>>> >>>> >>>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with >>>> the operation taking about a quarter of an hour, continuously pushing >>>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests >>>> to the disk. Linux 4.1.12. >>> Surely it is the task of the underlying driver, or the queuing >>> infrastructure, to merge small requests into large requests. >> Here's a blkparse of that run. As can be seen, there is no concurrency, >> so nobody down the stack has any chance of merging anything. > That isn't a valid conclusion to draw. It's actually wrong, as I noted later. The raid layer issues concurrent requests, but they are not merged. > raid0 effectively calls the make_request_fn function that is registered > by the underlying driver. > If that function handles DISCARD synchronously, then you won't see any > concurrency, and that is because the driver chose not to queue but to > handle directly. > I don't know if it actually does this though. I don't know the insides > of the nmve driver .... there seems to be a lightnvm thing and a scsi > thing and a pci thing and it all confuses me. NVMe only has queued TRIMs. Of course, the driver could be issuing them synchronously, but that's doubtful, since NVMe is such a simple protocol. What I guess is happening is that since the NVMe queue depth is so high, and request the driver receives is sent immediately to the disk and there is nothing to merge it to. That could indicate the absence of plugging, or just a reluctance to merge TRIMs. > >> >> No merging was happening. This is an NVMe drive, so running with the >> noop scheduler (which should still merge). Does the queuing layer >> merge trims? > I wish I knew. I once thought I understood about half of the block > queuing code, but now with multi-queue, I'll need to learn it all > again. :-( > >> I don't think it's the queuing layer's job, though. At the I/O >> scheduler you can merge to clean up sloppy patterns from the upper >> layer, but each layer should try to generate the best pattern it can. > Why? How does it know what is best for the layer below? A large request can be split at the lower layer without harming the upper layer (beyond the effort involved). Merging many small requests at the lower layer can harm the upper layer, if it depends on the latency of individual requests. In general in software engineering, when you destroy information, there is a potential for inefficiency. In this case the information destroyed is that we are only interested when all of the range is TRIMmed, not any particular subrange, and the potential for inefficiency is indeed realized. > >> Large merges mean increased latency for the first request in the chain, >> forcing the I/O scheduler to make a decision which can harm the >> workload. By generating merged requests in the first place, the upper >> layer removes the need to make that tradeoff (splitting the requests >> removes information: "we are interested only in when all of the range is >> trimmed, not any particular request"). > If it is easy for the upper layer to break a very large request into a > few very large requests, then I wouldn't necessarily object. I can't see why it would be hard. It's simple arithmetic. > But unless it is very hard for the lower layer to merge requests, it > should be doing that too. Merging has tradeoffs. When you merge requests R1, R2, ... Rn you make the latency request R1 sum of the latencies of R1..Rn. You may gain some efficiency in the process, but that's not going to make up for a factor of n. The queuing layer has no way to tell whether the caller is interested in the latency of individual requests. By sending large requests, the caller indicates it's not interested in the latency of individual subranges. The queuing layer is still free to internally split the request to smaller ranges, to satisfy hardware constraints, or to reduce worst-case latencies for competing request streams. So I disagree that all the work should be pushed to the merging layer. It has less information to work with, so the fewer decisions it has to make, the better. > When drivers/lightnvm/rrpc.c is providing rrpc_make_rq as the This does not seem to be an NVMe driver. > make_request_fn, it performs REQ_OP_DISCARD synchronously. I would > suggest that is a very poor design. I don't know if that is affecting > you (though a printk would find out). > > NeilBrown ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-28 8:58 ` Avi Kivity @ 2016-11-28 9:00 ` Christoph Hellwig 2016-11-28 9:11 ` Avi Kivity 2016-11-29 21:14 ` NeilBrown 1 sibling, 1 reply; 31+ messages in thread From: Christoph Hellwig @ 2016-11-28 9:00 UTC (permalink / raw) To: Avi Kivity; +Cc: NeilBrown, linux-raid On Mon, Nov 28, 2016 at 10:58:24AM +0200, Avi Kivity wrote: > What I guess is happening is that since the NVMe queue depth is so high, and > request the driver receives is sent immediately to the disk and there is > nothing to merge it to. That could indicate the absence of plugging, or > just a reluctance to merge TRIMs. That is exactly the case, and it's also an issue with online trim. I have work in progress block layer trim patches that always plug trims and have various TRIM merge improvements including support for ranged TRIMs. It needs a bit more work, but I hope I can post it later this week. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-28 9:00 ` Christoph Hellwig @ 2016-11-28 9:11 ` Avi Kivity 2016-11-28 9:15 ` Coly Li 2016-11-28 17:47 ` Shaohua Li 0 siblings, 2 replies; 31+ messages in thread From: Avi Kivity @ 2016-11-28 9:11 UTC (permalink / raw) To: Christoph Hellwig; +Cc: NeilBrown, linux-raid On 11/28/2016 11:00 AM, Christoph Hellwig wrote: > On Mon, Nov 28, 2016 at 10:58:24AM +0200, Avi Kivity wrote: >> What I guess is happening is that since the NVMe queue depth is so high, and >> request the driver receives is sent immediately to the disk and there is >> nothing to merge it to. That could indicate the absence of plugging, or >> just a reluctance to merge TRIMs. > That is exactly the case, and it's also an issue with online trim. > I have work in progress block layer trim patches that always plug trims > and have various TRIM merge improvements including support for ranged > TRIMs. It needs a bit more work, but I hope I can post it later > this week. Great, good to know. I still think it should also be fixed in the RAID layer. There's no reason to break a single request in millions of smaller ones, then try to merge them into one request back again. The queuing layer can merge when it's given bad patterns from uncontrolled sources, not as an excuse to generate bad patterns from within the kernel. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-28 9:11 ` Avi Kivity @ 2016-11-28 9:15 ` Coly Li 2016-11-28 17:47 ` Shaohua Li 1 sibling, 0 replies; 31+ messages in thread From: Coly Li @ 2016-11-28 9:15 UTC (permalink / raw) To: Avi Kivity; +Cc: Christoph Hellwig, NeilBrown, linux-raid On 2016/11/28 下午5:11, Avi Kivity wrote: > On 11/28/2016 11:00 AM, Christoph Hellwig wrote: >> On Mon, Nov 28, 2016 at 10:58:24AM +0200, Avi Kivity wrote: >>> What I guess is happening is that since the NVMe queue depth is so >>> high, and >>> request the driver receives is sent immediately to the disk and there is >>> nothing to merge it to. That could indicate the absence of plugging, or >>> just a reluctance to merge TRIMs. >> That is exactly the case, and it's also an issue with online trim. >> I have work in progress block layer trim patches that always plug trims >> and have various TRIM merge improvements including support for ranged >> TRIMs. It needs a bit more work, but I hope I can post it later >> this week. > > Great, good to know. > > I still think it should also be fixed in the RAID layer. There's no > reason to break a single request in millions of smaller ones, then try > to merge them into one request back again. The queuing layer can merge > when it's given bad patterns from uncontrolled sources, not as an excuse > to generate bad patterns from within the kernel. Personally I agree with your opinion. We can improve both in block layer and md code. Coly ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-28 9:11 ` Avi Kivity 2016-11-28 9:15 ` Coly Li @ 2016-11-28 17:47 ` Shaohua Li 1 sibling, 0 replies; 31+ messages in thread From: Shaohua Li @ 2016-11-28 17:47 UTC (permalink / raw) To: Avi Kivity; +Cc: Christoph Hellwig, NeilBrown, linux-raid On Mon, Nov 28, 2016 at 11:11:18AM +0200, Avi Kivity wrote: > On 11/28/2016 11:00 AM, Christoph Hellwig wrote: > > On Mon, Nov 28, 2016 at 10:58:24AM +0200, Avi Kivity wrote: > > > What I guess is happening is that since the NVMe queue depth is so high, and > > > request the driver receives is sent immediately to the disk and there is > > > nothing to merge it to. That could indicate the absence of plugging, or > > > just a reluctance to merge TRIMs. > > That is exactly the case, and it's also an issue with online trim. > > I have work in progress block layer trim patches that always plug trims > > and have various TRIM merge improvements including support for ranged > > TRIMs. It needs a bit more work, but I hope I can post it later > > this week. > > Great, good to know. > > I still think it should also be fixed in the RAID layer. There's no reason > to break a single request in millions of smaller ones, then try to merge > them into one request back again. The queuing layer can merge when it's > given bad patterns from uncontrolled sources, not as an excuse to generate > bad patterns from within the kernel. I think the reason is you are using 4.1 kernel. upstream kernel does support request merge even for blk-mq. It didn't at some versions. Thanks, Shaohua ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-28 8:58 ` Avi Kivity 2016-11-28 9:00 ` Christoph Hellwig @ 2016-11-29 21:14 ` NeilBrown 2016-11-29 22:45 ` Avi Kivity 1 sibling, 1 reply; 31+ messages in thread From: NeilBrown @ 2016-11-29 21:14 UTC (permalink / raw) To: Avi Kivity, linux-raid [-- Attachment #1: Type: text/plain, Size: 1822 bytes --] On Mon, Nov 28 2016, Avi Kivity wrote: >> If it is easy for the upper layer to break a very large request into a >> few very large requests, then I wouldn't necessarily object. > > I can't see why it would be hard. It's simple arithmetic. That is easy to say before writing the code :-) It probably is easy for RAID0. Less so for RAID10. Even less for RAID6. > >> But unless it is very hard for the lower layer to merge requests, it >> should be doing that too. > > Merging has tradeoffs. When you merge requests R1, R2, ... Rn you make > the latency request R1 sum of the latencies of R1..Rn. You may gain > some efficiency in the process, but that's not going to make up for a > factor of n. The queuing layer has no way to tell whether the caller is > interested in the latency of individual requests. By sending large > requests, the caller indicates it's not interested in the latency of > individual subranges. The queuing layer is still free to internally > split the request to smaller ranges, to satisfy hardware constraints, or > to reduce worst-case latencies for competing request streams. I would have thought that using plug/unplug to group requests is a fairly strong statement that they can be handled as a unit if that is convenient. > > So I disagree that all the work should be pushed to the merging layer. > It has less information to work with, so the fewer decisions it has to > make, the better. I think that the merging layer should be as efficient as it reasonably can be, and particularly should take into account plugging. This benefits all callers. If it can be demonstrated that changes to some of the upper layers bring further improvements with acceptable costs, then certainly it is good to have those too. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-29 21:14 ` NeilBrown @ 2016-11-29 22:45 ` Avi Kivity 2016-12-07 5:08 ` Mike Snitzer 2016-12-07 11:50 ` Coly Li 0 siblings, 2 replies; 31+ messages in thread From: Avi Kivity @ 2016-11-29 22:45 UTC (permalink / raw) To: NeilBrown, linux-raid On 11/29/2016 11:14 PM, NeilBrown wrote: > On Mon, Nov 28 2016, Avi Kivity wrote: >>> If it is easy for the upper layer to break a very large request into a >>> few very large requests, then I wouldn't necessarily object. >> I can't see why it would be hard. It's simple arithmetic. > That is easy to say before writing the code :-) > It probably is easy for RAID0. Less so for RAID10. Even less for > RAID6. > pick the largest subrange wihin the inpu range whose bounds are 0 (mod stripe-size); TRIM it (for all members); apply the regular algorithm to the head and tail subranges. Works for all RAID types. If the RAID is undergoing reshaping, exclude the range undergoing reshaping, and treat the two halves separately. >>> But unless it is very hard for the lower layer to merge requests, it >>> should be doing that too. >> Merging has tradeoffs. When you merge requests R1, R2, ... Rn you make >> the latency request R1 sum of the latencies of R1..Rn. You may gain >> some efficiency in the process, but that's not going to make up for a >> factor of n. The queuing layer has no way to tell whether the caller is >> interested in the latency of individual requests. By sending large >> requests, the caller indicates it's not interested in the latency of >> individual subranges. The queuing layer is still free to internally >> split the request to smaller ranges, to satisfy hardware constraints, or >> to reduce worst-case latencies for competing request streams. > I would have thought that using plug/unplug to group requests is a > fairly strong statement that they can be handled as a unit if that is > convenient. It is not. As an example, consider a read and a few associated read-ahead requests submitted in a batch. The last thing you want is for them to be treated as a unit. Plug/unplug means: I have a bunch of requests here. Whether they should be merged or reordered is orthogonal to whether they are members of a batch or not. > > >> So I disagree that all the work should be pushed to the merging layer. >> It has less information to work with, so the fewer decisions it has to >> make, the better. > I think that the merging layer should be as efficient as it reasonably > can be, and particularly should take into account plugging. This > benefits all callers. Yes, but plugging does not mean "please merge anything you can until the unplug". > If it can be demonstrated that changes to some of the upper layers bring > further improvements with acceptable costs, then certainly it is good to > have those too. Generating millions of requests only to merge them again is inefficient. It happens in an edge case (TRIM of the entirety of a very large RAID), but it already caused on user to believe the system failed. I think the system should be more robust than that. > NeilBrown ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-29 22:45 ` Avi Kivity @ 2016-12-07 5:08 ` Mike Snitzer 2016-12-07 11:50 ` Coly Li 1 sibling, 0 replies; 31+ messages in thread From: Mike Snitzer @ 2016-12-07 5:08 UTC (permalink / raw) To: Avi Kivity; +Cc: NeilBrown, linux-raid On Tue, Nov 29, 2016 at 5:45 PM, Avi Kivity <avi@scylladb.com> wrote: > On 11/29/2016 11:14 PM, NeilBrown wrote: >> >> On Mon, Nov 28 2016, Avi Kivity wrote: >>>> >>>> If it is easy for the upper layer to break a very large request into a >>>> few very large requests, then I wouldn't necessarily object. >>> >>> I can't see why it would be hard. It's simple arithmetic. >> >> That is easy to say before writing the code :-) >> It probably is easy for RAID0. Less so for RAID10. Even less for >> RAID6. >> > > pick the largest subrange wihin the inpu range whose bounds are 0 (mod > stripe-size); TRIM it (for all members); apply the regular algorithm to the > head and tail subranges. Works for all RAID types. If the RAID is > undergoing reshaping, exclude the range undergoing reshaping, and treat the > two halves separately. > >>>> But unless it is very hard for the lower layer to merge requests, it >>>> should be doing that too. >>> >>> Merging has tradeoffs. When you merge requests R1, R2, ... Rn you make >>> the latency request R1 sum of the latencies of R1..Rn. You may gain >>> some efficiency in the process, but that's not going to make up for a >>> factor of n. The queuing layer has no way to tell whether the caller is >>> interested in the latency of individual requests. By sending large >>> requests, the caller indicates it's not interested in the latency of >>> individual subranges. The queuing layer is still free to internally >>> split the request to smaller ranges, to satisfy hardware constraints, or >>> to reduce worst-case latencies for competing request streams. >> >> I would have thought that using plug/unplug to group requests is a >> fairly strong statement that they can be handled as a unit if that is >> convenient. > > > It is not. As an example, consider a read and a few associated read-ahead > requests submitted in a batch. The last thing you want is for them to be > treated as a unit. > > Plug/unplug means: I have a bunch of requests here. Whether they should be > merged or reordered is orthogonal to whether they are members of a batch or > not. > >> >> >>> So I disagree that all the work should be pushed to the merging layer. >>> It has less information to work with, so the fewer decisions it has to >>> make, the better. >> >> I think that the merging layer should be as efficient as it reasonably >> can be, and particularly should take into account plugging. This >> benefits all callers. > > > Yes, but plugging does not mean "please merge anything you can until the > unplug". > >> If it can be demonstrated that changes to some of the upper layers bring >> further improvements with acceptable costs, then certainly it is good to >> have those too. > > > Generating millions of requests only to merge them again is inefficient. It > happens in an edge case (TRIM of the entirety of a very large RAID), but it > already caused on user to believe the system failed. I think the system > should be more robust than that. DM's stripe target (raid0) has had Avi's proposed implementation (breaks large discard into N discards, one per component device) for a long while. See commit 7b76ec11fec4 ("dm stripe: support discards") This improvement applies to other raid levels too. And I completely agree that it would be a mistake to to continue to have the block layer issue millions of chunk-sized bios only to _maybe_ have them be merged by the IO scheduler and/or plugging. Needless make-work that just eats cpu and memory. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-29 22:45 ` Avi Kivity 2016-12-07 5:08 ` Mike Snitzer @ 2016-12-07 11:50 ` Coly Li 2016-12-07 12:03 ` Coly Li 2016-12-07 16:59 ` Shaohua Li 1 sibling, 2 replies; 31+ messages in thread From: Coly Li @ 2016-12-07 11:50 UTC (permalink / raw) To: Avi Kivity, NeilBrown; +Cc: linux-raid, linux-block [-- Attachment #1: Type: text/plain, Size: 3401 bytes --] On 2016/11/30 上午6:45, Avi Kivity wrote: > On 11/29/2016 11:14 PM, NeilBrown wrote: [snip] >>> So I disagree that all the work should be pushed to the merging layer. >>> It has less information to work with, so the fewer decisions it has to >>> make, the better. >> I think that the merging layer should be as efficient as it reasonably >> can be, and particularly should take into account plugging. This >> benefits all callers. > > Yes, but plugging does not mean "please merge anything you can until the > unplug". > >> If it can be demonstrated that changes to some of the upper layers bring >> further improvements with acceptable costs, then certainly it is good to >> have those too. > > Generating millions of requests only to merge them again is > inefficient. It happens in an edge case (TRIM of the entirety of a very > large RAID), but it already caused on user to believe the system > failed. I think the system should be more robust than that. Neil, As my understand, if a large discard bio received by raid0_make_request(), for example it requests to discard chunk 1 to 24 on a raid0 device built by 4 SSDs. This large discard bio will be split and written to each SSD as the following layout, SSD1: C1,C5,C9,C13,C17,C21 SSD2: C2,C6,C10,C14,C18,C22 SSD3: C3,C7,C11,C15,C19,C23 SSD4: C4,C8,C12,C16,C20,C24 Current raid0 code will call generic_make_request() for 24 times for each split bio. But it is possible to calculate the final layout of each split bio, so we can combine all the bios into four per-SSD large bio, like this, bio1 (on SSD1): C{1,5,9,13,17,21} bio2 (on SSD2): C{2,6,10,14,18,22} bio3 (on SSD3): C{3,7,11,15,19,23} bio4 (on SSD4): C{4,8,12,16,20,24} Now we only need to call generic_make_request() for 4 times. Rebuild the per-device discard bios is more efficient in raid0 code then in block layer. There are some reasons that I know, - there are splice timeout, block layer cannot merge all split bio into one large bio before time out. - rebuilt per-device bio in raid0 is just by a few calculation, block layer does merge on queue with list operations, it is slower. - raid0 code knows its on disk layout, so rebuild per-device bio is possible here. block layer has no idea on raid0 layout, it can only do request merge. Avi, I compose a prototype patch, the code is not simple, indeed it is quite complicated IMHO. I do a little research, some NVMe SSDs support whole device size DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD bio to block layer. But raid0_make_request() only receives 512KB size DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the original large bio into 512KB small bios, the limitation is from q->limits.discard_granularity. At this moment, I don't know why a q->limits.discard_granularity is 512KB even the underlying SSD supports whole device size discard. We also need to fix q->limits.discard_granularity, otherwise block/blk-lib.c:__blkdev_issue_discard() still does an inefficient loop to split the original large discard bio into smaller ones and sends them to raid0 code by next_bio(). I also CC this email to linux-block@vger.kernel.org to ask for help. My question is, if a NVMe SSD supports whole-device-size DISCARD, is q->limits.discard_granularity still necessary ? Here I also attach my prototype patch as a proof of concept, it is runnable with Linux 4.9-rc7. Coly [-- Attachment #2: raid0_handle_large_discard_bio.patch --] [-- Type: text/plain, Size: 9329 bytes --] Subject: optimization for large size DISCARD bio by per-device bios This is a very early prototype, still needs more block layer code modification to make it work. Current upstream raid0_make_request() only handles TRIM/DISCARD bio by chunk size, it meams for large raid0 device built by SSDs will call million times generic_make_request() for the split bio. This patch tries to combine small bios into large one if they are on same real device and continuous on this real device, then send the combined large bio to underlying device by single call to generic_make_request(). For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB NVMeSSD, current upstream raid0_make_request() will call generic_make_request() 5.7 million times, with this patch only 4 calls to generic_make_request() is required. This patch won't work in real world, because in block/blk-lib.c: __blkdev_issue_discard() the original large bio will be split into smaller ones by restriction of discard_granularity. If some day SSD supports whole device sized discard_granularity, it will be very interesting then... The basic idea is, if a large discard bio received by raid0_make_request(), for example it requests to discard chunk 1 to 24 on a raid0 device built by 4 SSDs. This large discard bio will be split and written to each SSD as the following layout, SSD1: C1,C5,C9,C13,C17,C21 SSD2: C2,C6,C10,C14,C18,C22 SSD3: C3,C7,C11,C15,C19,C23 SSD4: C4,C8,C12,C16,C20,C24 Current raid0 code will call generic_make_request() for 24 times for each split bio. But it is possible to calculate the final layout of each split bio, so we can combine all the bios into four per-SSD large bio, like this, bio1 (on SSD1): C{1,5,9,13,17,21} bio2 (on SSD2): C{2,6,10,14,18,22} bio3 (on SSD3): C{3,7,11,15,19,23} bio4 (on SSD4): C{4,8,12,16,20,24} Now we only need to call generic_make_request() for 4 times. The code is not simple, I need more time to write text to complain how it works. Currently you can treat it as a proof of concept. Signed-off-by: Coly Li <colyli@suse.de> --- drivers/md/raid0.c | 224 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 224 insertions(+) diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index 258986a..73c5fac 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -452,6 +452,225 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev, } } + +struct bio_record { + sector_t bi_sector; + unsigned long sectors; + struct md_rdev *rdev; +}; + +static void handle_discard_request(struct mddev *mddev, struct bio *bio) +{ + struct bio_record *recs = NULL; + struct bio *split; + struct r0conf *conf = mddev->private; + sector_t sectors, sector; + struct strip_zone *first_zone; + int zone_idx; + sector_t zone_start, zone_end; + int nr_strip_zones = conf->nr_strip_zones; + int disks; + int first_rdev_idx = -1, rdev_idx; + struct md_rdev *first_rdev; + unsigned int chunk_sects = mddev->chunk_sectors; + + sector = bio->bi_iter.bi_sector; + first_zone = find_zone(conf, §or); + first_rdev = map_sector(mddev, first_zone, sector, §or); + + sectors = chunk_sects - + (likely(is_power_of_2(chunk_sects)) + ? (sector & (chunk_sects - 1)) + : sector_div(sector, chunk_sects)); + + /* if bio size is not exceed a chunk boundary, + * simply handle it here. + */ + if (sectors >= bio_sectors(bio)) { + bio->bi_bdev = first_rdev->bdev; + bio->bi_iter.bi_sector = sector + first_zone->dev_start + + first_rdev->data_offset; + if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) + /* Just ignore it */ + bio_endio(bio); + else + generic_make_request(bio); + return; + } + + /* bio is large enough to be split, allocate recs firstly */ + disks = mddev->raid_disks; + recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO); + if (recs == NULL) { + printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \ + "for bio_record", mdname(mddev)); + bio->bi_error = -ENOMEM; + bio_endio(bio); + return; + } + + zone_idx = first_zone - conf->strip_zone; + for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) { + struct md_rdev *rdev; + + rdev = conf->devlist[zone_idx * disks + rdev_idx]; + recs[rdev_idx].rdev = rdev; + if (rdev == first_rdev) + first_rdev_idx = rdev_idx; + } + + /* Restore due to sector_div */ + sector = bio->bi_iter.bi_sector; + recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start; + recs[first_rdev_idx].sectors = sectors; + BUG_ON(recs[first_rdev_idx].rdev != first_rdev); + + /* recs[first_rdev_idx] is initialized with 'sectors', we need to + * handle the rested sectors, which is sotred in 'sectors' too. + */ + sectors = bio_sectors(bio) - sectors; + + /* bio may not be chunk size aligned, the split bio on first rdev + * may not be chunk size aligned too. But the rested split bios + * on rested rdevs must be chunk size aligned, and aligned to + * round down chunk number. + */ + zone_end = first_zone->zone_end; + rdev_idx = first_rdev_idx + 1; + sector = likely(is_power_of_2(chunk_sects)) + ? sector & (~(chunk_sects - 1)) + : chunk_sects * (sector/chunk_sects); + + while (rdev_idx < first_zone->nb_dev) { + recs[rdev_idx].bi_sector = sector + first_zone->dev_start; + if (sectors <= chunk_sects) { + recs[rdev_idx].sectors = sectors; + goto issue; + } + recs[rdev_idx].sectors = chunk_sects; + sectors -= chunk_sects; + rdev_idx++; + } + + sector += chunk_sects; + zone_start = sector + first_zone->dev_start; + if (zone_start == zone_end) { + zone_idx++; + zone_start = conf->strip_zone[zone_idx].dev_start; + } + + while (zone_idx < nr_strip_zones) { + int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev; + int chunks_per_rdev, rested_chunks, rested_sectors; + sector_t zone_sectors, grow_sectors; + int add_rested_sectors = 0; + + zone_end = conf->strip_zone[zone_idx].zone_end; + zone_sectors = zone_end - zone_start; + chunks_per_rdev = sectors; + rested_sectors = + sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone); + rested_chunks = rested_sectors; + rested_sectors = sector_div(rested_chunks, chunk_sects); + + if ((chunks_per_rdev * chunk_sects) > zone_sectors) + chunks_per_rdev = zone_sectors/chunk_sects; + + /* rested_chunks and rested_sectors go into next zone, we won't + * handle them in this zone. Set them to 0. + */ + if ((chunks_per_rdev * chunk_sects) == zone_sectors && + (rested_chunks != 0 || rested_sectors != 0)) { + if (rested_chunks != 0) + rested_chunks = 0; + if (rested_sectors != 0) + rested_sectors = 0; + } + + if (rested_chunks == 0 && rested_sectors != 0) + add_rested_sectors = 1; + + for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) { + /* if .sectors is not initailized (== 0), it indicates + * .bi_sector is not initialized neither. We initiate + * .bi_sector firstly, then set .sectors by + * grow_sectors. + */ + if (recs[rdev_idx].sectors == 0) + recs[rdev_idx].bi_sector = zone_start; + grow_sectors = chunks_per_rdev * chunk_sects; + if (rested_chunks) { + grow_sectors += chunk_sects; + rested_chunks--; + if (rested_chunks == 0 && + rested_sectors != 0) { + recs[rdev_idx].sectors += grow_sectors; + sectors -= grow_sectors; + add_rested_sectors = 1; + continue; + } + } + + /* if add_rested_sectors != 0, it indicates + * rested_sectors != 0 + */ + if (add_rested_sectors) + grow_sectors += rested_sectors; + recs[rdev_idx].sectors += grow_sectors; + sectors -= grow_sectors; + if (add_rested_sectors) + break; + } + + if (sectors == 0) + break; + zone_start = zone_end; + zone_idx++; + BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start); + } + + +issue: + /* recs contains the re-ordered requests layout, now we can + * chain split bios from recs + */ + for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) { + if (rdev_idx == first_rdev_idx || + recs[rdev_idx].sectors == 0) + continue; + split = bio_split(bio, + recs[rdev_idx].sectors, + GFP_NOIO, + fs_bio_set); + if (split == NULL) + break; + bio_chain(split, bio); + BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9); + split->bi_bdev = recs[rdev_idx].rdev->bdev; + split->bi_iter.bi_sector = recs[rdev_idx].bi_sector + + recs[rdev_idx].rdev->data_offset; + + if (unlikely(!blk_queue_discard( + bdev_get_queue(split->bi_bdev)))) + /* Just ignore it */ + bio_endio(split); + else + generic_make_request(split); + } + BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9); + bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector + + recs[first_rdev_idx].rdev->data_offset; + bio->bi_bdev = recs[first_rdev_idx].rdev->bdev; + + if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) + /* Just ignore it */ + bio_endio(bio); + else + generic_make_request(bio); + + kfree(recs); +} + static void raid0_make_request(struct mddev *mddev, struct bio *bio) { struct strip_zone *zone; @@ -463,6 +682,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio) return; } + if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) { + handle_discard_request(mddev, bio); + return; + } + do { sector_t sector = bio->bi_iter.bi_sector; unsigned chunk_sects = mddev->chunk_sectors; ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-12-07 11:50 ` Coly Li @ 2016-12-07 12:03 ` Coly Li 2016-12-07 16:59 ` Shaohua Li 1 sibling, 0 replies; 31+ messages in thread From: Coly Li @ 2016-12-07 12:03 UTC (permalink / raw) To: Avi Kivity, NeilBrown; +Cc: linux-raid, linux-block On 2016/12/7 下午7:50, Coly Li wrote: > On 2016/11/30 上午6:45, Avi Kivity wrote: >> On 11/29/2016 11:14 PM, NeilBrown wrote: > [snip] > >>>> So I disagree that all the work should be pushed to the merging layer. >>>> It has less information to work with, so the fewer decisions it has to >>>> make, the better. >>> I think that the merging layer should be as efficient as it reasonably >>> can be, and particularly should take into account plugging. This >>> benefits all callers. >> >> Yes, but plugging does not mean "please merge anything you can until the >> unplug". >> >>> If it can be demonstrated that changes to some of the upper layers bring >>> further improvements with acceptable costs, then certainly it is good to >>> have those too. >> >> Generating millions of requests only to merge them again is >> inefficient. It happens in an edge case (TRIM of the entirety of a very >> large RAID), but it already caused on user to believe the system >> failed. I think the system should be more robust than that. > > Neil, > > As my understand, if a large discard bio received by > raid0_make_request(), for example it requests to discard chunk 1 to 24 > on a raid0 device built by 4 SSDs. This large discard bio will be split > and written to each SSD as the following layout, > > SSD1: C1,C5,C9,C13,C17,C21 > SSD2: C2,C6,C10,C14,C18,C22 > SSD3: C3,C7,C11,C15,C19,C23 > SSD4: C4,C8,C12,C16,C20,C24 > > Current raid0 code will call generic_make_request() for 24 times for > each split bio. But it is possible to calculate the final layout of each > split bio, so we can combine all the bios into four per-SSD large bio, > like this, > > bio1 (on SSD1): C{1,5,9,13,17,21} > bio2 (on SSD2): C{2,6,10,14,18,22} > bio3 (on SSD3): C{3,7,11,15,19,23} > bio4 (on SSD4): C{4,8,12,16,20,24} > > Now we only need to call generic_make_request() for 4 times. Rebuild the > per-device discard bios is more efficient in raid0 code then in block > layer. There are some reasons that I know, > - there are splice timeout, block layer cannot merge all split bio into > one large bio before time out. > - rebuilt per-device bio in raid0 is just by a few calculation, block > layer does merge on queue with list operations, it is slower. > - raid0 code knows its on disk layout, so rebuild per-device bio is > possible here. block layer has no idea on raid0 layout, it can only do > request merge. > > Avi, > > I compose a prototype patch, the code is not simple, indeed it is quite > complicated IMHO. > > I do a little research, some NVMe SSDs support whole device size > DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD > bio to block layer. But raid0_make_request() only receives 512KB size > DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the > original large bio into 512KB small bios, the limitation is from > q->limits.discard_granularity. > > At this moment, I don't know why a q->limits.discard_granularity is > 512KB even the underlying SSD supports whole device size discard. We > also need to fix q->limits.discard_granularity, otherwise > block/blk-lib.c:__blkdev_issue_discard() still does an inefficient loop > to split the original large discard bio into smaller ones and sends them > to raid0 code by next_bio(). > > I also CC this email to linux-block@vger.kernel.org to ask for help. My > question is, if a NVMe SSD supports whole-device-size DISCARD, is > q->limits.discard_granularity still necessary ? > > Here I also attach my prototype patch as a proof of concept, it is > runnable with Linux 4.9-rc7. Aha! It is not limits.discard_granularity, it is limits.max_discard_sectors. Which is set in raid0.c:raid0_run() by blk_queue_max_discard_sectors(). Here limits.maxZ_discard_sectors is set to raid0 chunk size. Interesting .... And, linux-block@vger.kernel.org, please ignore my noise. Coly ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-12-07 11:50 ` Coly Li 2016-12-07 12:03 ` Coly Li @ 2016-12-07 16:59 ` Shaohua Li 2016-12-08 16:44 ` Coly Li 1 sibling, 1 reply; 31+ messages in thread From: Shaohua Li @ 2016-12-07 16:59 UTC (permalink / raw) To: Coly Li; +Cc: Avi Kivity, NeilBrown, linux-raid, linux-block On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote: > On 2016/11/30 上午6:45, Avi Kivity wrote: > > On 11/29/2016 11:14 PM, NeilBrown wrote: > [snip] > > >>> So I disagree that all the work should be pushed to the merging layer. > >>> It has less information to work with, so the fewer decisions it has to > >>> make, the better. > >> I think that the merging layer should be as efficient as it reasonably > >> can be, and particularly should take into account plugging. This > >> benefits all callers. > > > > Yes, but plugging does not mean "please merge anything you can until the > > unplug". > > > >> If it can be demonstrated that changes to some of the upper layers bring > >> further improvements with acceptable costs, then certainly it is good to > >> have those too. > > > > Generating millions of requests only to merge them again is > > inefficient. It happens in an edge case (TRIM of the entirety of a very > > large RAID), but it already caused on user to believe the system > > failed. I think the system should be more robust than that. > > Neil, > > As my understand, if a large discard bio received by > raid0_make_request(), for example it requests to discard chunk 1 to 24 > on a raid0 device built by 4 SSDs. This large discard bio will be split > and written to each SSD as the following layout, > > SSD1: C1,C5,C9,C13,C17,C21 > SSD2: C2,C6,C10,C14,C18,C22 > SSD3: C3,C7,C11,C15,C19,C23 > SSD4: C4,C8,C12,C16,C20,C24 > > Current raid0 code will call generic_make_request() for 24 times for > each split bio. But it is possible to calculate the final layout of each > split bio, so we can combine all the bios into four per-SSD large bio, > like this, > > bio1 (on SSD1): C{1,5,9,13,17,21} > bio2 (on SSD2): C{2,6,10,14,18,22} > bio3 (on SSD3): C{3,7,11,15,19,23} > bio4 (on SSD4): C{4,8,12,16,20,24} > > Now we only need to call generic_make_request() for 4 times. Rebuild the > per-device discard bios is more efficient in raid0 code then in block > layer. There are some reasons that I know, > - there are splice timeout, block layer cannot merge all split bio into > one large bio before time out. > - rebuilt per-device bio in raid0 is just by a few calculation, block > layer does merge on queue with list operations, it is slower. > - raid0 code knows its on disk layout, so rebuild per-device bio is > possible here. block layer has no idea on raid0 layout, it can only do > request merge. Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0 zones make things a little complicated though. I just had a brief look of your proposed patch, which looks really complicated. I'd suggest something like this: 1. split the bio according to zone boundary. 2. handle the splitted bio. since the bio is within zone range, calculating the start and end sector for each rdev should be easy. This will create slightly more bio to each rdev (not too many, since there aren't too many zones in practice) and block layer should easily merge these bios without much overhead. The benefit is a much simpler implementation. > I compose a prototype patch, the code is not simple, indeed it is quite > complicated IMHO. > > I do a little research, some NVMe SSDs support whole device size > DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD > bio to block layer. But raid0_make_request() only receives 512KB size > DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the > original large bio into 512KB small bios, the limitation is from > q->limits.discard_granularity. please adjust the max discard sectors for the queue. The original setting is chunk size. Thanks, Shaohua ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-12-07 16:59 ` Shaohua Li @ 2016-12-08 16:44 ` Coly Li 2016-12-08 19:19 ` Shaohua Li 2017-06-29 15:15 ` Avi Kivity 0 siblings, 2 replies; 31+ messages in thread From: Coly Li @ 2016-12-08 16:44 UTC (permalink / raw) To: Shaohua Li; +Cc: Avi Kivity, NeilBrown, linux-raid, linux-block [-- Attachment #1: Type: text/plain, Size: 2545 bytes --] On 2016/12/8 上午12:59, Shaohua Li wrote: > On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote: [snip] > Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0 > zones make things a little complicated though. I just had a brief look of your > proposed patch, which looks really complicated. I'd suggest something like > this: > 1. split the bio according to zone boundary. > 2. handle the splitted bio. since the bio is within zone range, calculating > the start and end sector for each rdev should be easy. > Hi Shaohua, Thanks for your suggestion! I try to modify the code by your suggestion, it is even more hard to make the code that way ... Because even split bios for each zone, all the corner cases still exist and should be taken care in every zoon. The code will be more complicated. > This will create slightly more bio to each rdev (not too many, since there > aren't too many zones in practice) and block layer should easily merge these > bios without much overhead. The benefit is a much simpler implementation. > >> I compose a prototype patch, the code is not simple, indeed it is quite >> complicated IMHO. >> >> I do a little research, some NVMe SSDs support whole device size >> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD >> bio to block layer. But raid0_make_request() only receives 512KB size >> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the >> original large bio into 512KB small bios, the limitation is from >> q->limits.discard_granularity. > > please adjust the max discard sectors for the queue. The original setting is > chunk size. This is a powerful suggestion, I change the max_discard_sectors to raid0 size, and fix some bugs, now the patch looks working well. The performance number is not bad. On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel spends 306 seconds, the patched kernel spends 15 seconds. I see average request size increases from 1 chunk (1024 sectors) to 2048 chunks (2097152 sectors). I don't know why the bios are still be split before raid0_make_request() receives them, after I set q->limits.max_discard_sectors to mddev->array_sectors. Can anybody give me a hint ? Here I attach the RFC v2 patch, if anybody wants to try it, please do it and response the result :-) I will take time to write a very detailed commit log and code comments to make this patch more easier to be understood. Ugly code, that's what I have to pay to gain better performance .... Thanks in advance. Coly [-- Attachment #2: raid0_handle_large_discard_bio.patch --] [-- Type: text/plain, Size: 11145 bytes --] Subject: [RFC v2] optimization for large size DISCARD bio by per-device bios This is a very early prototype, still needs more block layer code modification to make it work. Current upstream raid0_make_request() only handles TRIM/DISCARD bio by chunk size, it meams for large raid0 device built by SSDs will call million times generic_make_request() for the split bio. This patch tries to combine small bios into large one if they are on same real device and continuous on this real device, then send the combined large bio to underlying device by single call to generic_make_request(). For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB NVMeSSD, current upstream raid0_make_request() will call generic_make_request() 5.7 million times, with this patch only 4 calls to generic_make_request() is required. This patch won't work in real world, because in block/blk-lib.c: __blkdev_issue_discard() the original large bio will be split into smaller ones by restriction of discard_granularity. If some day SSD supports whole device sized discard_granularity, it will be very interesting then... The basic idea is, if a large discard bio received by raid0_make_request(), for example it requests to discard chunk 1 to 24 on a raid0 device built by 4 SSDs. This large discard bio will be split and written to each SSD as the following layout, SSD1: C1,C5,C9,C13,C17,C21 SSD2: C2,C6,C10,C14,C18,C22 SSD3: C3,C7,C11,C15,C19,C23 SSD4: C4,C8,C12,C16,C20,C24 Current raid0 code will call generic_make_request() for 24 times for each split bio. But it is possible to calculate the final layout of each split bio, so we can combine all the bios into four per-SSD large bio, like this, bio1 (on SSD1): C{1,5,9,13,17,21} bio2 (on SSD2): C{2,6,10,14,18,22} bio3 (on SSD3): C{3,7,11,15,19,23} bio4 (on SSD4): C{4,8,12,16,20,24} Now we only need to call generic_make_request() for 4 times. The code is not simple, I need more time to write text to complain how it works. Currently you can treat it as a proof of concept. Changelogs v1, Initial prototype. v2, Major changes inlcude, - rename function names, now handle_discard_bio() takes care in chunk size DISCARD bio and single disk sutiation, large DISCARD bio will be handled in handle_large_discard_bio(). - Set max_discard_sectors to raid0 device size. - Fix several bugs which I find in basic testing.. Signed-off-by: Coly Li <colyli@suse.de> --- drivers/md/raid0.c | 267 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 266 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index 258986a..c7afe0c 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -378,7 +378,7 @@ static int raid0_run(struct mddev *mddev) blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors); - blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); + blk_queue_max_discard_sectors(mddev->queue, raid0_size(mddev, 0, 0)); blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); blk_queue_io_opt(mddev->queue, @@ -452,6 +452,266 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev, } } + +struct bio_record { + sector_t bi_sector; + unsigned long sectors; + struct md_rdev *rdev; +}; + +static void handle_large_discard_bio(struct mddev *mddev, struct bio *bio) +{ + struct bio_record *recs = NULL; + struct bio *split; + struct r0conf *conf = mddev->private; + sector_t sectors, sector; + struct strip_zone *first_zone; + int zone_idx; + sector_t zone_start, zone_end; + int nr_strip_zones = conf->nr_strip_zones; + int disks; + int first_rdev_idx = -1, rdev_idx; + struct md_rdev *first_rdev; + unsigned int chunk_sects = mddev->chunk_sectors; + + sector = bio->bi_iter.bi_sector; + first_zone = find_zone(conf, §or); + first_rdev = map_sector(mddev, first_zone, sector, §or); + + /* bio is large enough to be split, allocate recs firstly */ + disks = mddev->raid_disks; + recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO); + if (recs == NULL) { + printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \ + "for bio_record", mdname(mddev)); + bio->bi_error = -ENOMEM; + bio_endio(bio); + return; + } + + zone_idx = first_zone - conf->strip_zone; + for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) { + struct md_rdev *rdev; + + rdev = conf->devlist[zone_idx * disks + rdev_idx]; + recs[rdev_idx].rdev = rdev; + if (rdev == first_rdev) + first_rdev_idx = rdev_idx; + } + + sectors = chunk_sects - + (likely(is_power_of_2(chunk_sects)) + ? (sector & (chunk_sects - 1)) + : sector_div(sector, chunk_sects)); + sector = bio->bi_iter.bi_sector; + + recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start; + recs[first_rdev_idx].sectors = sectors; + + /* recs[first_rdev_idx] is initialized with 'sectors', we need to + * handle the rested sectors, which is sotred in 'sectors' too. + */ + sectors = bio_sectors(bio) - sectors; + + /* bio may not be chunk size aligned, the split bio on first rdev + * may not be chunk size aligned too. But the rested split bios + * on rested rdevs must be chunk size aligned, and aligned to + * round down chunk number. + */ + zone_end = first_zone->zone_end; + rdev_idx = first_rdev_idx + 1; + sector = likely(is_power_of_2(chunk_sects)) + ? sector & (~(chunk_sects - 1)) + : chunk_sects * (sector/chunk_sects); + + while (rdev_idx < first_zone->nb_dev) { + if (recs[rdev_idx].sectors == 0) { + recs[rdev_idx].bi_sector = sector + first_zone->dev_start; + if (sectors <= chunk_sects) { + recs[rdev_idx].sectors = sectors; + goto issue; + } + recs[rdev_idx].sectors = chunk_sects; + sectors -= chunk_sects; + } + rdev_idx++; + } + + sector += chunk_sects; + zone_start = sector + first_zone->dev_start; + if (zone_start == zone_end) { + zone_idx++; + if (zone_idx == nr_strip_zones) { + if (sectors != 0) + printk(KERN_INFO "bio size exceeds raid0 " \ + "capability, ignore extra " \ + "TRIM/DISCARD range.\n"); + goto issue; + } + zone_start = conf->strip_zone[zone_idx].dev_start; + } + + while (zone_idx < nr_strip_zones) { + int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev; + int chunks_per_rdev, rested_chunks, rested_sectors; + sector_t zone_sectors, grow_sectors; + int add_rested_sectors = 0; + + zone_end = conf->strip_zone[zone_idx].zone_end; + zone_sectors = zone_end - zone_start; + chunks_per_rdev = sectors; + rested_sectors = + sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone); + rested_chunks = rested_sectors; + rested_sectors = sector_div(rested_chunks, chunk_sects); + + if ((chunks_per_rdev * chunk_sects) > zone_sectors) + chunks_per_rdev = zone_sectors/chunk_sects; + + /* rested_chunks and rested_sectors go into next zone, we won't + * handle them in this zone. Set them to 0. + */ + if ((chunks_per_rdev * chunk_sects) == zone_sectors && + (rested_chunks != 0 || rested_sectors != 0)) { + if (rested_chunks != 0) + rested_chunks = 0; + if (rested_sectors != 0) + rested_sectors = 0; + } + + if (rested_chunks == 0 && rested_sectors != 0) + add_rested_sectors ++; + + for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) { + /* if .sectors is not initailized (== 0), it indicates + * .bi_sector is not initialized neither. We initiate + * .bi_sector firstly, then set .sectors by + * grow_sectors. + */ + if (recs[rdev_idx].sectors == 0) + recs[rdev_idx].bi_sector = zone_start; + grow_sectors = chunks_per_rdev * chunk_sects; + if (rested_chunks) { + grow_sectors += chunk_sects; + rested_chunks--; + if (rested_chunks == 0 && + rested_sectors != 0) { + recs[rdev_idx].sectors += grow_sectors; + sectors -= grow_sectors; + add_rested_sectors ++; + continue; + } + } + + /* if add_rested_sectors != 0, it indicates + * rested_sectors != 0 + */ + if (add_rested_sectors == 1) { + grow_sectors += rested_sectors; + add_rested_sectors ++; + } + recs[rdev_idx].sectors += grow_sectors; + sectors -= grow_sectors; + if (sectors == 0) + break; + } + + if (sectors == 0) + break; + zone_start = zone_end; + zone_idx++; + if (zone_idx < nr_strip_zones) + BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start); + } + + +issue: + /* recs contains the re-ordered requests layout, now we can + * chain split bios from recs + */ + for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) { + if (rdev_idx == first_rdev_idx || + recs[rdev_idx].sectors == 0) + continue; + split = bio_split(bio, + recs[rdev_idx].sectors, + GFP_NOIO, + fs_bio_set); + if (split == NULL) + break; + bio_chain(split, bio); + BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9); + split->bi_bdev = recs[rdev_idx].rdev->bdev; + split->bi_iter.bi_sector = recs[rdev_idx].bi_sector + + recs[rdev_idx].rdev->data_offset; + + if (unlikely(!blk_queue_discard( + bdev_get_queue(split->bi_bdev)))) + /* Just ignore it */ + bio_endio(split); + else + generic_make_request(split); + } + BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9); + bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector + + recs[first_rdev_idx].rdev->data_offset; + bio->bi_bdev = recs[first_rdev_idx].rdev->bdev; + + if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) + /* Just ignore it */ + bio_endio(bio); + else + generic_make_request(bio); + + kfree(recs); +} + +static void handle_discard_bio(struct mddev *mddev, struct bio *bio) +{ + struct r0conf *conf = mddev->private; + unsigned int chunk_sects = mddev->chunk_sectors; + sector_t sector, sectors; + struct md_rdev *rdev; + struct strip_zone *zone; + + sector = bio->bi_iter.bi_sector; + zone = find_zone(conf, §or); + rdev = map_sector(mddev, zone, sector, §or); + bio->bi_bdev = rdev->bdev; + sectors = chunk_sects - + (likely(is_power_of_2(chunk_sects)) + ? (sector & (chunk_sects - 1)) + : sector_div(sector, chunk_sects)); + + if (unlikely(sectors >= bio_sectors(bio))) { + bio->bi_iter.bi_sector = sector + zone->dev_start + + rdev->data_offset; + goto single_bio; + } + + if (unlikely(zone->nb_dev == 1)) { + sectors = conf->strip_zone[0].zone_end - + sector; + if (bio_sectors(bio) > sectors) + bio->bi_iter.bi_size = sectors << 9; + bio->bi_iter.bi_sector = sector + rdev->data_offset; + goto single_bio; + } + + handle_large_discard_bio(mddev, bio); + return; + +single_bio: + if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) + /* Just ignore it */ + bio_endio(bio); + else + generic_make_request(bio); + + return; +} + + static void raid0_make_request(struct mddev *mddev, struct bio *bio) { struct strip_zone *zone; @@ -463,6 +723,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio) return; } + if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) { + handle_discard_bio(mddev, bio); + return; + } + do { sector_t sector = bio->bi_iter.bi_sector; unsigned chunk_sects = mddev->chunk_sectors; ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-12-08 16:44 ` Coly Li @ 2016-12-08 19:19 ` Shaohua Li 2016-12-09 7:34 ` Coly Li 2017-06-29 15:15 ` Avi Kivity 1 sibling, 1 reply; 31+ messages in thread From: Shaohua Li @ 2016-12-08 19:19 UTC (permalink / raw) To: Coly Li; +Cc: Avi Kivity, NeilBrown, linux-raid, linux-block On Fri, Dec 09, 2016 at 12:44:57AM +0800, Coly Li wrote: > On 2016/12/8 上午12:59, Shaohua Li wrote: > > On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote: > [snip] > > Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0 > > zones make things a little complicated though. I just had a brief look of your > > proposed patch, which looks really complicated. I'd suggest something like > > this: > > 1. split the bio according to zone boundary. > > 2. handle the splitted bio. since the bio is within zone range, calculating > > the start and end sector for each rdev should be easy. > > > > Hi Shaohua, > > Thanks for your suggestion! I try to modify the code by your suggestion, > it is even more hard to make the code that way ... > > Because even split bios for each zone, all the corner cases still exist > and should be taken care in every zoon. The code will be more complicated. Not sure why it makes the code more complicated. Probably I'm wrong, but Just want to make sure we are in the same page: split the bio according to zone boundary, then handle the splitted bio separately. Calculating end/start point of each rdev for the new bio within a zone should be simple. we then clone a bio for each rdev and dispatch. So for example: Disk 0: D0 D2 D4 D6 D7 Disk 1: D1 D3 D5 zone 0 is from D0 - D5, zone 1 is from D6 - D7 If bio is from D1 to D7, we split it to 2 bios, one is D1 - D5, the other D6 - D7. For D1 - D5, we dispatch 2 bios. D1 - D5 for disk 1, D2 - D4 for disk 0 For D6 - D7, we just dispatch to disk 0. What kind of corner case makes this more complicated? > > This will create slightly more bio to each rdev (not too many, since there > > aren't too many zones in practice) and block layer should easily merge these > > bios without much overhead. The benefit is a much simpler implementation. > > > >> I compose a prototype patch, the code is not simple, indeed it is quite > >> complicated IMHO. > >> > >> I do a little research, some NVMe SSDs support whole device size > >> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD > >> bio to block layer. But raid0_make_request() only receives 512KB size > >> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the > >> original large bio into 512KB small bios, the limitation is from > >> q->limits.discard_granularity. > > > > please adjust the max discard sectors for the queue. The original setting is > > chunk size. > > This is a powerful suggestion, I change the max_discard_sectors to raid0 > size, and fix some bugs, now the patch looks working well. The > performance number is not bad. > > On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel > spends 306 seconds, the patched kernel spends 15 seconds. I see average > request size increases from 1 chunk (1024 sectors) to 2048 chunks > (2097152 sectors). > > I don't know why the bios are still be split before raid0_make_request() > receives them, after I set q->limits.max_discard_sectors to > mddev->array_sectors. Can anybody give me a hint ? That probably is from disk_stack_limits. try set the max_discard_sectors after it. > Here I attach the RFC v2 patch, if anybody wants to try it, please do it > and response the result :-) > > I will take time to write a very detailed commit log and code comments > to make this patch more easier to be understood. Ugly code, that's what > I have to pay to gain better performance .... Can't say I like it :). Hard to read and the memory allocation is ugly. Please check if there is simpler solution first before writting detailed commit log. Thanks, Shaohua ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-12-08 19:19 ` Shaohua Li @ 2016-12-09 7:34 ` Coly Li 2016-12-12 3:17 ` NeilBrown 0 siblings, 1 reply; 31+ messages in thread From: Coly Li @ 2016-12-09 7:34 UTC (permalink / raw) To: Shaohua Li; +Cc: Avi Kivity, NeilBrown, linux-raid, linux-block On 2016/12/9 上午3:19, Shaohua Li wrote: > On Fri, Dec 09, 2016 at 12:44:57AM +0800, Coly Li wrote: >> On 2016/12/8 上午12:59, Shaohua Li wrote: >>> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote: >> [snip] >>> Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0 >>> zones make things a little complicated though. I just had a brief look of your >>> proposed patch, which looks really complicated. I'd suggest something like >>> this: >>> 1. split the bio according to zone boundary. >>> 2. handle the splitted bio. since the bio is within zone range, calculating >>> the start and end sector for each rdev should be easy. >>> >> >> Hi Shaohua, >> >> Thanks for your suggestion! I try to modify the code by your suggestion, >> it is even more hard to make the code that way ... >> >> Because even split bios for each zone, all the corner cases still exist >> and should be taken care in every zoon. The code will be more complicated. > > Not sure why it makes the code more complicated. Probably I'm wrong, but Just > want to make sure we are in the same page: split the bio according to zone > boundary, then handle the splitted bio separately. Calculating end/start point > of each rdev for the new bio within a zone should be simple. we then clone a > bio for each rdev and dispatch. So for example: > Disk 0: D0 D2 D4 D6 D7 > Disk 1: D1 D3 D5 > zone 0 is from D0 - D5, zone 1 is from D6 - D7 > If bio is from D1 to D7, we split it to 2 bios, one is D1 - D5, the other D6 - D7. > For D1 - D5, we dispatch 2 bios. D1 - D5 for disk 1, D2 - D4 for disk 0 > For D6 - D7, we just dispatch to disk 0. > What kind of corner case makes this more complicated? > Let me explain the corner cases. When upper layer code issues a DISCARD bio, the bio->bi_iter.bi_sector may not be chunk size aligned, and bio->bi_iter.bi_size may not be (chunk_sects*nb_dev) sectors aligned. In raid0, we can't simply round up/down them into chunk size aligned number, otherwise data lost/corruption will happen. Therefore for each DISCARD bio that raid0_make_request() receive, the beginning and ending parts of this bio should be treat very carefully. All the corner cases *come from here*, they are not about number of zones or rdevs, it is about whether bio->bi_iter.bi_sector and bio->bi_iter.bi_size are chunk size aligned or not. - beginning of the bio If bio->bi_iter.bi_sector is not chunk size aligned, current raid0 code will split the beginning part into split bio which only contains sectors from bio->bi_iter.sector to next chunk size aligned offset, and issue this bio by generic_make_request(). But in discard_large_discard_bio() we can't issue the split bio now, we have to record lenth of this split bio into a per-device structure, and issue a split bio after all the sectors of the DISCARD bio are calculated. So I use recs[first_rdev_idx].bi_sector to rcoard bi_iter.bi_sector of the split bio, recs[first_rdev_idx].sectors to record length of this split bio, and recs[first_rdev_idx].rdev to record address of the real device where the split bio will be sent to. After the first non-chunk-size aligned part handled, we need to look into the next rdev for next chunk of DISCARD bio. Now the sector offset is chunk size aligned, but there are still two condition: 1) If the first_rdev is not the last read device in current zone, then on the next real device, its per-device bio will start on a round down chunk offset of recs[first_rdev_idx].bi_sector. If recs[first_rdev_idx].bi_sector is chunk size aligned, the next real device's per-device bio will start at the same chunk offset. 2) If first_rdev is the last real device in current zone, then next rdev to handle is the number 0 real device among conf->strip_zone[zone_idx].nb_dev real devices. In this case, the bi_iter.bi_setor of the per-device bio of this real device, is the chunk offset next to chunk which recs[first_rdev_idx].bi_sector hits on recs[first_rdev_idx].rdev. - ending part of the bio If length of the DISCARD bio is not (chunk_sects * nb_dev) sectors aligned, after we handled one or more (chunk_sects*nb_dev)) aligned sectors, the rested sectors are less then chunk_sects*nb_dev, but these sectors may still fit in some rested_chunks and rested_sectors. We also need to handle them carefully. If rested_chunks is not 0, because chunks are linearly allocated on each real device in current zone, we need to add chunk_sects to recs[rdev_idx].sectors, where rdev_idx starts from 0 to strip_zone[zone_idx].nb_dev - 1. When we move to next real device (rdev_idx ++), we need to reduce one chunk from rested_chunk (rested_chunk --). If rested_chunk reaches 0, we start to handle rested_sectors. rested_sectors will be added to the next real device, we just simply add rested_sectors to recs[rdev_idx].sectors. There is a corner case that after the calculation of DISCARD bio, rested_chunks is 0, and rested_sectors is not 0. In this case, we only need to add rested_sectors to number 0 real device of the zone, which is recs[0]. A DISCARD bio is probably to only cover one raid0 zone, but all the above corner cases have to be taken care. Therefore submitting split bios for each zone, does not make the code simpler. To handle the details correctly is really boring, if I can explain to you face to face, that will be much easier. >>> This will create slightly more bio to each rdev (not too many, since there >>> aren't too many zones in practice) and block layer should easily merge these >>> bios without much overhead. The benefit is a much simpler implementation. >>> >>>> I compose a prototype patch, the code is not simple, indeed it is quite >>>> complicated IMHO. >>>> >>>> I do a little research, some NVMe SSDs support whole device size >>>> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD >>>> bio to block layer. But raid0_make_request() only receives 512KB size >>>> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the >>>> original large bio into 512KB small bios, the limitation is from >>>> q->limits.discard_granularity. >>> >>> please adjust the max discard sectors for the queue. The original setting is >>> chunk size. >> >> This is a powerful suggestion, I change the max_discard_sectors to raid0 >> size, and fix some bugs, now the patch looks working well. The >> performance number is not bad. >> >> On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel >> spends 306 seconds, the patched kernel spends 15 seconds. I see average >> request size increases from 1 chunk (1024 sectors) to 2048 chunks >> (2097152 sectors). >> >> I don't know why the bios are still be split before raid0_make_request() >> receives them, after I set q->limits.max_discard_sectors to >> mddev->array_sectors. Can anybody give me a hint ? > > That probably is from disk_stack_limits. try set the max_discard_sectors after it. > I know why the bio is still split, (1<<32)/512 = 8388608, in __blkdev_issue_discard(), req_sects is decided by: /* Make sure bi_size doesn't overflow */ req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9); I try to simply modify bi_size from unsigned int to unsigned long, and change the above limit to ULONG_MAX>>9, kernel panics. It seems change bi_sector from unsigned int to unsigned long is not simple. If we don't change bi_iter.bi_size to unsigned long, this is the best effort now. >> Here I attach the RFC v2 patch, if anybody wants to try it, please do it >> and response the result :-) >> >> I will take time to write a very detailed commit log and code comments >> to make this patch more easier to be understood. Ugly code, that's what >> I have to pay to gain better performance .... > > Can't say I like it :). Hard to read and the memory allocation is ugly. Please > check if there is simpler solution first before writting detailed commit log. I don't like it neither .... I can imagine how hard to read it, because even explain it in text is difficult... A lot of details to take care, and the calculation should be exactly match in the end. I tried to avoid to allocate memory for recs[], but without a data structure to record the per-devcie bio attribution, I cannot find a better way to write the code. There is only one thing makes me to be 1%+ comfortable is, which is bio_split() also allocates memory ^_^ Coly ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-12-09 7:34 ` Coly Li @ 2016-12-12 3:17 ` NeilBrown 0 siblings, 0 replies; 31+ messages in thread From: NeilBrown @ 2016-12-12 3:17 UTC (permalink / raw) To: Coly Li, Shaohua Li; +Cc: Avi Kivity, linux-raid, linux-block [-- Attachment #1: Type: text/plain, Size: 3524 bytes --] On Fri, Dec 09 2016, Coly Li wrote: > On 2016/12/9 上午3:19, Shaohua Li wrote: >> On Fri, Dec 09, 2016 at 12:44:57AM +0800, Coly Li wrote: >>> On 2016/12/8 上午12:59, Shaohua Li wrote: >>>> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote: >>> [snip] >>>> Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0 >>>> zones make things a little complicated though. I just had a brief look of your >>>> proposed patch, which looks really complicated. I'd suggest something like >>>> this: >>>> 1. split the bio according to zone boundary. >>>> 2. handle the splitted bio. since the bio is within zone range, calculating >>>> the start and end sector for each rdev should be easy. >>>> >>> >>> Hi Shaohua, >>> >>> Thanks for your suggestion! I try to modify the code by your suggestion, >>> it is even more hard to make the code that way ... >>> >>> Because even split bios for each zone, all the corner cases still exist >>> and should be taken care in every zoon. The code will be more complicated. >> >> Not sure why it makes the code more complicated. Probably I'm wrong, but Just >> want to make sure we are in the same page: split the bio according to zone >> boundary, then handle the splitted bio separately. Calculating end/start point >> of each rdev for the new bio within a zone should be simple. we then clone a >> bio for each rdev and dispatch. So for example: >> Disk 0: D0 D2 D4 D6 D7 >> Disk 1: D1 D3 D5 >> zone 0 is from D0 - D5, zone 1 is from D6 - D7 >> If bio is from D1 to D7, we split it to 2 bios, one is D1 - D5, the other D6 - D7. >> For D1 - D5, we dispatch 2 bios. D1 - D5 for disk 1, D2 - D4 for disk 0 >> For D6 - D7, we just dispatch to disk 0. >> What kind of corner case makes this more complicated? >> > > Let me explain the corner cases. > > When upper layer code issues a DISCARD bio, the bio->bi_iter.bi_sector > may not be chunk size aligned, and bio->bi_iter.bi_size may not be > (chunk_sects*nb_dev) sectors aligned. In raid0, we can't simply round > up/down them into chunk size aligned number, otherwise data > lost/corruption will happen. > > Therefore for each DISCARD bio that raid0_make_request() receive, the > beginning and ending parts of this bio should be treat very carefully. > All the corner cases *come from here*, they are not about number of > zones or rdevs, it is about whether bio->bi_iter.bi_sector and > bio->bi_iter.bi_size are chunk size aligned or not. > > - beginning of the bio > If bio->bi_iter.bi_sector is not chunk size aligned, current raid0 > code will split the beginning part into split bio which only contains > sectors from bio->bi_iter.sector to next chunk size aligned offset, and > issue this bio by generic_make_request(). But in > discard_large_discard_bio() we can't issue the split bio now, we have to > record lenth of this split bio into a per-device structure, and issue a Why? Why cannot you just split of the start of the bio and chain it with the rest of the bio? If the bio doesn't start at the beginning of a stripe, just split of the first (partial) chunk exactly was we currently do. If it does start at the beginning of a stripe, then split off a whole number of stripes and allocate one bio for each device. Chain those to the original bio together with any remainder (which isn't a whole stripe). I think that if you make use of bio_split() and bio_chain() properly, the code will be much simpler. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-12-08 16:44 ` Coly Li 2016-12-08 19:19 ` Shaohua Li @ 2017-06-29 15:15 ` Avi Kivity 2017-06-29 15:31 ` Coly Li 1 sibling, 1 reply; 31+ messages in thread From: Avi Kivity @ 2017-06-29 15:15 UTC (permalink / raw) To: Coly Li, Shaohua Li; +Cc: NeilBrown, linux-raid, linux-block On 12/08/2016 06:44 PM, Coly Li wrote: > On 2016/12/8 上午12:59, Shaohua Li wrote: >> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote: > [snip] >> Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0 >> zones make things a little complicated though. I just had a brief look of your >> proposed patch, which looks really complicated. I'd suggest something like >> this: >> 1. split the bio according to zone boundary. >> 2. handle the splitted bio. since the bio is within zone range, calculating >> the start and end sector for each rdev should be easy. >> > Hi Shaohua, > > Thanks for your suggestion! I try to modify the code by your suggestion, > it is even more hard to make the code that way ... > > Because even split bios for each zone, all the corner cases still exist > and should be taken care in every zoon. The code will be more complicated. > Hi Coly, Did you manage to complete this patch? We are seeing its effect, not only with mkfs, but also with fstrim(8). Avi ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2017-06-29 15:15 ` Avi Kivity @ 2017-06-29 15:31 ` Coly Li 2017-06-29 15:36 ` Avi Kivity 0 siblings, 1 reply; 31+ messages in thread From: Coly Li @ 2017-06-29 15:31 UTC (permalink / raw) To: Avi Kivity, Shaohua Li; +Cc: NeilBrown, linux-raid, linux-block On 2017/6/29 下午11:15, Avi Kivity wrote: > > > On 12/08/2016 06:44 PM, Coly Li wrote: >> On 2016/12/8 上午12:59, Shaohua Li wrote: >>> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote: >> [snip] >>> Thanks for doing this, Coly! For raid0, this totally makes sense. The >>> raid0 >>> zones make things a little complicated though. I just had a brief >>> look of your >>> proposed patch, which looks really complicated. I'd suggest something >>> like >>> this: >>> 1. split the bio according to zone boundary. >>> 2. handle the splitted bio. since the bio is within zone range, >>> calculating >>> the start and end sector for each rdev should be easy. >>> >> Hi Shaohua, >> >> Thanks for your suggestion! I try to modify the code by your suggestion, >> it is even more hard to make the code that way ... >> >> Because even split bios for each zone, all the corner cases still exist >> and should be taken care in every zoon. The code will be more >> complicated. >> > > Hi Coly, > > Did you manage to complete this patch? We are seeing its effect, not > only with mkfs, but also with fstrim(8). Hi Avi, Shaohua makes another much better patch, which is merged into mainline kernel in v4.12-rc2. The commit is '29efc390b946 ("md/md0: optimize raid0 discard handling")'. Hope this is informative. Coly ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2017-06-29 15:31 ` Coly Li @ 2017-06-29 15:36 ` Avi Kivity 0 siblings, 0 replies; 31+ messages in thread From: Avi Kivity @ 2017-06-29 15:36 UTC (permalink / raw) To: Coly Li, Shaohua Li; +Cc: NeilBrown, linux-raid, linux-block On 06/29/2017 06:31 PM, Coly Li wrote: > On 2017/6/29 下午11:15, Avi Kivity wrote: >> >> On 12/08/2016 06:44 PM, Coly Li wrote: >>> On 2016/12/8 上午12:59, Shaohua Li wrote: >>>> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote: >>> [snip] >>>> Thanks for doing this, Coly! For raid0, this totally makes sense. The >>>> raid0 >>>> zones make things a little complicated though. I just had a brief >>>> look of your >>>> proposed patch, which looks really complicated. I'd suggest something >>>> like >>>> this: >>>> 1. split the bio according to zone boundary. >>>> 2. handle the splitted bio. since the bio is within zone range, >>>> calculating >>>> the start and end sector for each rdev should be easy. >>>> >>> Hi Shaohua, >>> >>> Thanks for your suggestion! I try to modify the code by your suggestion, >>> it is even more hard to make the code that way ... >>> >>> Because even split bios for each zone, all the corner cases still exist >>> and should be taken care in every zoon. The code will be more >>> complicated. >>> >> Hi Coly, >> >> Did you manage to complete this patch? We are seeing its effect, not >> only with mkfs, but also with fstrim(8). > Hi Avi, > > Shaohua makes another much better patch, which is merged into mainline > kernel in v4.12-rc2. > > The commit is '29efc390b946 ("md/md0: optimize raid0 discard handling")'. > > Hope this is informative. > Thanks a lot, that's great news. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2016-11-27 15:24 raid0 vs. mkfs Avi Kivity ` (2 preceding siblings ...) 2016-11-28 5:09 ` NeilBrown @ 2017-01-22 18:01 ` Avi Kivity 2017-01-23 12:26 ` Coly Li 3 siblings, 1 reply; 31+ messages in thread From: Avi Kivity @ 2017-01-22 18:01 UTC (permalink / raw) To: linux-raid Hello, On 11/27/2016 05:24 PM, Avi Kivity wrote: > mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large > disk that supports TRIM/DISCARD (erase whichever is inappropriate). > That is because mkfs issues a TRIM/DISCARD (erase whichever is > inappropriate) for the entire partition. As far as I can tell, md > converts the large TRIM/DISCARD (erase whichever is inappropriate) > into a large number of TRIM/DISCARD (erase whichever is inappropriate) > requests, one per chunk-size worth of disk, and issues them to the > RAID components individually. > > > It seems to me that md can convert the large TRIM/DISCARD (erase > whichever is inappropriate) request it gets into one TRIM/DISCARD > (erase whichever is inappropriate) per RAID component, converting an > O(disk size / chunk size) operation into an O(number of RAID > components) operation, which is much faster. > > > I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, > with the operation taking about a quarter of an hour, continuously > pushing half-megabyte TRIM/DISCARD (erase whichever is inappropriate) > requests to the disk. Linux 4.1.12. > Did anyone pick this up by any chance? The only thing I could find is more people complaining about the same issue. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: raid0 vs. mkfs 2017-01-22 18:01 ` Avi Kivity @ 2017-01-23 12:26 ` Coly Li 0 siblings, 0 replies; 31+ messages in thread From: Coly Li @ 2017-01-23 12:26 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2130 bytes --] On 2017/1/23 上午2:01, Avi Kivity wrote: > Hello, > > > On 11/27/2016 05:24 PM, Avi Kivity wrote: >> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large >> disk that supports TRIM/DISCARD (erase whichever is inappropriate). >> That is because mkfs issues a TRIM/DISCARD (erase whichever is >> inappropriate) for the entire partition. As far as I can tell, md >> converts the large TRIM/DISCARD (erase whichever is inappropriate) >> into a large number of TRIM/DISCARD (erase whichever is inappropriate) >> requests, one per chunk-size worth of disk, and issues them to the >> RAID components individually. >> >> >> It seems to me that md can convert the large TRIM/DISCARD (erase >> whichever is inappropriate) request it gets into one TRIM/DISCARD >> (erase whichever is inappropriate) per RAID component, converting an >> O(disk size / chunk size) operation into an O(number of RAID >> components) operation, which is much faster. >> >> >> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, >> with the operation taking about a quarter of an hour, continuously >> pushing half-megabyte TRIM/DISCARD (erase whichever is inappropriate) >> requests to the disk. Linux 4.1.12. >> > > Did anyone pick this up by any chance? The only thing I could find is > more people complaining about the same issue. Hi Avi, I proposed a POC patch, Shaohua and Neil provide review comments, suggest me to simplify this patch. If you notice, there is a patch I sent out on Dec 9, 2016, in this email thread. This patch works, but not the final version to be accepted by upstream. I quote the performance number here, “ On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel spends 306 seconds, the patched kernel spends 15 seconds. I see average request size increases from 1 chunk (1024 sectors) to 2048 chunks (2097152 sectors).” I call this patch as RFC v2 patch, and attach it again in this email. Now I am working on a new version by the suggestion from Shaohua and Neil, but any testing feed back of RFC v2 patch is welcome. It works, just too complex to review. Thanks. Coly [-- Attachment #2: raid0_handle_large_discard_bio.patch --] [-- Type: text/plain, Size: 11145 bytes --] Subject: [RFC v2] optimization for large size DISCARD bio by per-device bios This is a very early prototype, still needs more block layer code modification to make it work. Current upstream raid0_make_request() only handles TRIM/DISCARD bio by chunk size, it meams for large raid0 device built by SSDs will call million times generic_make_request() for the split bio. This patch tries to combine small bios into large one if they are on same real device and continuous on this real device, then send the combined large bio to underlying device by single call to generic_make_request(). For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB NVMeSSD, current upstream raid0_make_request() will call generic_make_request() 5.7 million times, with this patch only 4 calls to generic_make_request() is required. This patch won't work in real world, because in block/blk-lib.c: __blkdev_issue_discard() the original large bio will be split into smaller ones by restriction of discard_granularity. If some day SSD supports whole device sized discard_granularity, it will be very interesting then... The basic idea is, if a large discard bio received by raid0_make_request(), for example it requests to discard chunk 1 to 24 on a raid0 device built by 4 SSDs. This large discard bio will be split and written to each SSD as the following layout, SSD1: C1,C5,C9,C13,C17,C21 SSD2: C2,C6,C10,C14,C18,C22 SSD3: C3,C7,C11,C15,C19,C23 SSD4: C4,C8,C12,C16,C20,C24 Current raid0 code will call generic_make_request() for 24 times for each split bio. But it is possible to calculate the final layout of each split bio, so we can combine all the bios into four per-SSD large bio, like this, bio1 (on SSD1): C{1,5,9,13,17,21} bio2 (on SSD2): C{2,6,10,14,18,22} bio3 (on SSD3): C{3,7,11,15,19,23} bio4 (on SSD4): C{4,8,12,16,20,24} Now we only need to call generic_make_request() for 4 times. The code is not simple, I need more time to write text to complain how it works. Currently you can treat it as a proof of concept. Changelogs v1, Initial prototype. v2, Major changes inlcude, - rename function names, now handle_discard_bio() takes care in chunk size DISCARD bio and single disk sutiation, large DISCARD bio will be handled in handle_large_discard_bio(). - Set max_discard_sectors to raid0 device size. - Fix several bugs which I find in basic testing.. Signed-off-by: Coly Li <colyli@suse.de> --- drivers/md/raid0.c | 267 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 266 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index 258986a..c7afe0c 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -378,7 +378,7 @@ static int raid0_run(struct mddev *mddev) blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors); - blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); + blk_queue_max_discard_sectors(mddev->queue, raid0_size(mddev, 0, 0)); blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); blk_queue_io_opt(mddev->queue, @@ -452,6 +452,266 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev, } } + +struct bio_record { + sector_t bi_sector; + unsigned long sectors; + struct md_rdev *rdev; +}; + +static void handle_large_discard_bio(struct mddev *mddev, struct bio *bio) +{ + struct bio_record *recs = NULL; + struct bio *split; + struct r0conf *conf = mddev->private; + sector_t sectors, sector; + struct strip_zone *first_zone; + int zone_idx; + sector_t zone_start, zone_end; + int nr_strip_zones = conf->nr_strip_zones; + int disks; + int first_rdev_idx = -1, rdev_idx; + struct md_rdev *first_rdev; + unsigned int chunk_sects = mddev->chunk_sectors; + + sector = bio->bi_iter.bi_sector; + first_zone = find_zone(conf, §or); + first_rdev = map_sector(mddev, first_zone, sector, §or); + + /* bio is large enough to be split, allocate recs firstly */ + disks = mddev->raid_disks; + recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO); + if (recs == NULL) { + printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \ + "for bio_record", mdname(mddev)); + bio->bi_error = -ENOMEM; + bio_endio(bio); + return; + } + + zone_idx = first_zone - conf->strip_zone; + for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) { + struct md_rdev *rdev; + + rdev = conf->devlist[zone_idx * disks + rdev_idx]; + recs[rdev_idx].rdev = rdev; + if (rdev == first_rdev) + first_rdev_idx = rdev_idx; + } + + sectors = chunk_sects - + (likely(is_power_of_2(chunk_sects)) + ? (sector & (chunk_sects - 1)) + : sector_div(sector, chunk_sects)); + sector = bio->bi_iter.bi_sector; + + recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start; + recs[first_rdev_idx].sectors = sectors; + + /* recs[first_rdev_idx] is initialized with 'sectors', we need to + * handle the rested sectors, which is sotred in 'sectors' too. + */ + sectors = bio_sectors(bio) - sectors; + + /* bio may not be chunk size aligned, the split bio on first rdev + * may not be chunk size aligned too. But the rested split bios + * on rested rdevs must be chunk size aligned, and aligned to + * round down chunk number. + */ + zone_end = first_zone->zone_end; + rdev_idx = first_rdev_idx + 1; + sector = likely(is_power_of_2(chunk_sects)) + ? sector & (~(chunk_sects - 1)) + : chunk_sects * (sector/chunk_sects); + + while (rdev_idx < first_zone->nb_dev) { + if (recs[rdev_idx].sectors == 0) { + recs[rdev_idx].bi_sector = sector + first_zone->dev_start; + if (sectors <= chunk_sects) { + recs[rdev_idx].sectors = sectors; + goto issue; + } + recs[rdev_idx].sectors = chunk_sects; + sectors -= chunk_sects; + } + rdev_idx++; + } + + sector += chunk_sects; + zone_start = sector + first_zone->dev_start; + if (zone_start == zone_end) { + zone_idx++; + if (zone_idx == nr_strip_zones) { + if (sectors != 0) + printk(KERN_INFO "bio size exceeds raid0 " \ + "capability, ignore extra " \ + "TRIM/DISCARD range.\n"); + goto issue; + } + zone_start = conf->strip_zone[zone_idx].dev_start; + } + + while (zone_idx < nr_strip_zones) { + int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev; + int chunks_per_rdev, rested_chunks, rested_sectors; + sector_t zone_sectors, grow_sectors; + int add_rested_sectors = 0; + + zone_end = conf->strip_zone[zone_idx].zone_end; + zone_sectors = zone_end - zone_start; + chunks_per_rdev = sectors; + rested_sectors = + sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone); + rested_chunks = rested_sectors; + rested_sectors = sector_div(rested_chunks, chunk_sects); + + if ((chunks_per_rdev * chunk_sects) > zone_sectors) + chunks_per_rdev = zone_sectors/chunk_sects; + + /* rested_chunks and rested_sectors go into next zone, we won't + * handle them in this zone. Set them to 0. + */ + if ((chunks_per_rdev * chunk_sects) == zone_sectors && + (rested_chunks != 0 || rested_sectors != 0)) { + if (rested_chunks != 0) + rested_chunks = 0; + if (rested_sectors != 0) + rested_sectors = 0; + } + + if (rested_chunks == 0 && rested_sectors != 0) + add_rested_sectors ++; + + for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) { + /* if .sectors is not initailized (== 0), it indicates + * .bi_sector is not initialized neither. We initiate + * .bi_sector firstly, then set .sectors by + * grow_sectors. + */ + if (recs[rdev_idx].sectors == 0) + recs[rdev_idx].bi_sector = zone_start; + grow_sectors = chunks_per_rdev * chunk_sects; + if (rested_chunks) { + grow_sectors += chunk_sects; + rested_chunks--; + if (rested_chunks == 0 && + rested_sectors != 0) { + recs[rdev_idx].sectors += grow_sectors; + sectors -= grow_sectors; + add_rested_sectors ++; + continue; + } + } + + /* if add_rested_sectors != 0, it indicates + * rested_sectors != 0 + */ + if (add_rested_sectors == 1) { + grow_sectors += rested_sectors; + add_rested_sectors ++; + } + recs[rdev_idx].sectors += grow_sectors; + sectors -= grow_sectors; + if (sectors == 0) + break; + } + + if (sectors == 0) + break; + zone_start = zone_end; + zone_idx++; + if (zone_idx < nr_strip_zones) + BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start); + } + + +issue: + /* recs contains the re-ordered requests layout, now we can + * chain split bios from recs + */ + for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) { + if (rdev_idx == first_rdev_idx || + recs[rdev_idx].sectors == 0) + continue; + split = bio_split(bio, + recs[rdev_idx].sectors, + GFP_NOIO, + fs_bio_set); + if (split == NULL) + break; + bio_chain(split, bio); + BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9); + split->bi_bdev = recs[rdev_idx].rdev->bdev; + split->bi_iter.bi_sector = recs[rdev_idx].bi_sector + + recs[rdev_idx].rdev->data_offset; + + if (unlikely(!blk_queue_discard( + bdev_get_queue(split->bi_bdev)))) + /* Just ignore it */ + bio_endio(split); + else + generic_make_request(split); + } + BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9); + bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector + + recs[first_rdev_idx].rdev->data_offset; + bio->bi_bdev = recs[first_rdev_idx].rdev->bdev; + + if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) + /* Just ignore it */ + bio_endio(bio); + else + generic_make_request(bio); + + kfree(recs); +} + +static void handle_discard_bio(struct mddev *mddev, struct bio *bio) +{ + struct r0conf *conf = mddev->private; + unsigned int chunk_sects = mddev->chunk_sectors; + sector_t sector, sectors; + struct md_rdev *rdev; + struct strip_zone *zone; + + sector = bio->bi_iter.bi_sector; + zone = find_zone(conf, §or); + rdev = map_sector(mddev, zone, sector, §or); + bio->bi_bdev = rdev->bdev; + sectors = chunk_sects - + (likely(is_power_of_2(chunk_sects)) + ? (sector & (chunk_sects - 1)) + : sector_div(sector, chunk_sects)); + + if (unlikely(sectors >= bio_sectors(bio))) { + bio->bi_iter.bi_sector = sector + zone->dev_start + + rdev->data_offset; + goto single_bio; + } + + if (unlikely(zone->nb_dev == 1)) { + sectors = conf->strip_zone[0].zone_end - + sector; + if (bio_sectors(bio) > sectors) + bio->bi_iter.bi_size = sectors << 9; + bio->bi_iter.bi_sector = sector + rdev->data_offset; + goto single_bio; + } + + handle_large_discard_bio(mddev, bio); + return; + +single_bio: + if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) + /* Just ignore it */ + bio_endio(bio); + else + generic_make_request(bio); + + return; +} + + static void raid0_make_request(struct mddev *mddev, struct bio *bio) { struct strip_zone *zone; @@ -463,6 +723,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio) return; } + if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) { + handle_discard_bio(mddev, bio); + return; + } + do { sector_t sector = bio->bi_iter.bi_sector; unsigned chunk_sects = mddev->chunk_sectors; ^ permalink raw reply related [flat|nested] 31+ messages in thread
end of thread, other threads:[~2017-06-29 15:36 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-11-27 15:24 raid0 vs. mkfs Avi Kivity 2016-11-27 17:09 ` Coly Li 2016-11-27 17:25 ` Avi Kivity 2016-11-27 19:25 ` Doug Dumitru 2016-11-28 4:11 ` Chris Murphy 2016-11-28 7:28 ` Avi Kivity 2016-11-28 7:33 ` Avi Kivity 2016-11-28 5:09 ` NeilBrown 2016-11-28 6:08 ` Shaohua Li 2016-11-28 7:38 ` Avi Kivity 2016-11-28 8:40 ` NeilBrown 2016-11-28 8:58 ` Avi Kivity 2016-11-28 9:00 ` Christoph Hellwig 2016-11-28 9:11 ` Avi Kivity 2016-11-28 9:15 ` Coly Li 2016-11-28 17:47 ` Shaohua Li 2016-11-29 21:14 ` NeilBrown 2016-11-29 22:45 ` Avi Kivity 2016-12-07 5:08 ` Mike Snitzer 2016-12-07 11:50 ` Coly Li 2016-12-07 12:03 ` Coly Li 2016-12-07 16:59 ` Shaohua Li 2016-12-08 16:44 ` Coly Li 2016-12-08 19:19 ` Shaohua Li 2016-12-09 7:34 ` Coly Li 2016-12-12 3:17 ` NeilBrown 2017-06-29 15:15 ` Avi Kivity 2017-06-29 15:31 ` Coly Li 2017-06-29 15:36 ` Avi Kivity 2017-01-22 18:01 ` Avi Kivity 2017-01-23 12:26 ` Coly Li
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.