Re: [RFC PATCH v2 0/2] add simple copy support

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH v2 0/2] add simple copy support
       [not found] ` <20201204094659.12732-1-selvakuma.s1@samsung.com>
@ 2020-12-07 14:11   ` Christoph Hellwig
  2020-12-07 14:56     ` Hannes Reinecke
  2020-12-07 19:14     ` Javier González
  0 siblings, 2 replies; 15+ messages in thread
From: Christoph Hellwig @ 2020-12-07 14:11 UTC (permalink / raw)
  To: SelvaKumar S
  Cc: linux-nvme, kbusch, axboe, damien.lemoal, hch, sagi, linux-block,
	linux-kernel, dm-devel, snitzer, selvajove, nj.shetty, joshi.k,
	javier.gonz, Martin K. Petersen, Bart Van Assche,
	Mikulas Patocka, linux-scsi

So, I'm really worried about:

 a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
    does accelating dm-kcopyd.  I agree with Damien that lifting dm-kcopyd
    to common code would also be really nice.  I'm not 100% sure it should
    be a requirement, but it sure would be nice to have
    I don't think just adding an ioctl is enough of a use case for complex
    kernel infrastructure.
 b) We had a bunch of different attempts at SCSI XCOPY support form IIRC
    Martin, Bart and Mikulas.  I think we need to pull them into this
    discussion, and make sure whatever we do covers the SCSI needs.

On Fri, Dec 04, 2020 at 03:16:57PM +0530, SelvaKumar S wrote:
> This patchset tries to add support for TP4065a ("Simple Copy Command"),
> v2020.05.04 ("Ratified")
> 
> The Specification can be found in following link.
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> 
> This is an RFC. Looking forward for any feedbacks or other alternate
> designs for plumbing simple copy to IO stack.
> 
> Simple copy command is a copy offloading operation and is  used to copy
> multiple contiguous ranges (source_ranges) of LBA's to a single destination
> LBA within the device reducing traffic between host and device.
> 
> This implementation accepts destination, no of sources and arrays of
> source ranges from application and attach it as payload to the bio and
> submits to the device.
> 
> Following limits are added to queue limits and are exposed in sysfs
> to userspace
> 	- *max_copy_sectors* limits the sum of all source_range length
> 	- *max_copy_nr_ranges* limits the number of source ranges
> 	- *max_copy_range_sectors* limit the maximum number of sectors
> 		that can constitute a single source range.
> 
> Changes from v1:
> 
> 1. Fix memory leak in __blkdev_issue_copy
> 2. Unmark blk_check_copy inline
> 3. Fix line break in blk_check_copy_eod
> 4. Remove p checks and made code more readable
> 5. Don't use bio_set_op_attrs and remove op and set
>    bi_opf directly
> 6. Use struct_size to calculate total_size
> 7. Fix partition remap of copy destination
> 8. Remove mcl,mssrl,msrc from nvme_ns
> 9. Initialize copy queue limits to 0 in nvme_config_copy
> 10. Remove return in QUEUE_FLAG_COPY check
> 11. Remove unused OCFS
> 
> SelvaKumar S (2):
>   block: add simple copy support
>   nvme: add simple copy support
> 
>  block/blk-core.c          |  94 ++++++++++++++++++++++++++---
>  block/blk-lib.c           | 123 ++++++++++++++++++++++++++++++++++++++
>  block/blk-merge.c         |   2 +
>  block/blk-settings.c      |  11 ++++
>  block/blk-sysfs.c         |  23 +++++++
>  block/blk-zoned.c         |   1 +
>  block/bounce.c            |   1 +
>  block/ioctl.c             |  43 +++++++++++++
>  drivers/nvme/host/core.c  |  87 +++++++++++++++++++++++++++
>  include/linux/bio.h       |   1 +
>  include/linux/blk_types.h |  15 +++++
>  include/linux/blkdev.h    |  15 +++++
>  include/linux/nvme.h      |  43 ++++++++++++-
>  include/uapi/linux/fs.h   |  13 ++++
>  14 files changed, 461 insertions(+), 11 deletions(-)
> 
> -- 
> 2.25.1
---end quoted text---

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-07 14:11   ` [RFC PATCH v2 0/2] add simple copy support Christoph Hellwig
@ 2020-12-07 14:56     ` Hannes Reinecke
  2020-12-07 19:24       ` Javier González
                         ` (2 more replies)
  2020-12-07 19:14     ` Javier González
  1 sibling, 3 replies; 15+ messages in thread
From: Hannes Reinecke @ 2020-12-07 14:56 UTC (permalink / raw)
  To: Christoph Hellwig, SelvaKumar S
  Cc: linux-nvme, kbusch, axboe, damien.lemoal, sagi, linux-block,
	linux-kernel, dm-devel, snitzer, selvajove, nj.shetty, joshi.k,
	javier.gonz, Martin K. Petersen, Bart Van Assche,
	Mikulas Patocka, linux-scsi

On 12/7/20 3:11 PM, Christoph Hellwig wrote:
> So, I'm really worried about:
> 
>   a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
>      does accelating dm-kcopyd.  I agree with Damien that lifting dm-kcopyd
>      to common code would also be really nice.  I'm not 100% sure it should
>      be a requirement, but it sure would be nice to have
>      I don't think just adding an ioctl is enough of a use case for complex
>      kernel infrastructure.
>   b) We had a bunch of different attempts at SCSI XCOPY support form IIRC
>      Martin, Bart and Mikulas.  I think we need to pull them into this
>      discussion, and make sure whatever we do covers the SCSI needs.
> 
And we shouldn't forget that the main issue which killed all previous 
implementations was a missing QoS guarantee.
It's nice to have simply copy, but if the implementation is _slower_ 
than doing it by hand from the OS there is very little point in even 
attempting to do so.
I can't see any provisions for that in the TPAR, leading me to the 
assumption that NVMe simple copy will suffer from the same issue.

So if we can't address this I guess this attempt will fail, too.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-07 14:11   ` [RFC PATCH v2 0/2] add simple copy support Christoph Hellwig
  2020-12-07 14:56     ` Hannes Reinecke
@ 2020-12-07 19:14     ` Javier González
  1 sibling, 0 replies; 15+ messages in thread
From: Javier González @ 2020-12-07 19:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: SelvaKumar S, linux-nvme, kbusch, axboe, damien.lemoal, sagi,
	linux-block, linux-kernel, dm-devel, snitzer, selvajove,
	nj.shetty, joshi.k, Martin K. Petersen, Bart Van Assche,
	Mikulas Patocka, linux-scsi

On 07.12.2020 15:11, Christoph Hellwig wrote:
>So, I'm really worried about:
>
> a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
>    does accelating dm-kcopyd.  I agree with Damien that lifting dm-kcopyd
>    to common code would also be really nice.  I'm not 100% sure it should
>    be a requirement, but it sure would be nice to have
>    I don't think just adding an ioctl is enough of a use case for complex
>    kernel infrastructure.

We are looking at dm-kcopyd. I would have liked to start with a very
specific use case and build from there, but I see Damien's and Keith's
point of having a default path. I believe we can add this to the next
version.

> b) We had a bunch of different attempts at SCSI XCOPY support form IIRC
>    Martin, Bart and Mikulas.  I think we need to pull them into this
>    discussion, and make sure whatever we do covers the SCSI needs.

Agree. We discussed a lot about the scope and agreed that everything
outside of the specifics of Simple Copy requires the input from the ones
that have worked on XCOPY support in the past.

>
>On Fri, Dec 04, 2020 at 03:16:57PM +0530, SelvaKumar S wrote:
>> This patchset tries to add support for TP4065a ("Simple Copy Command"),
>> v2020.05.04 ("Ratified")
>>
>> The Specification can be found in following link.
>> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
>>
>> This is an RFC. Looking forward for any feedbacks or other alternate
>> designs for plumbing simple copy to IO stack.
>>
>> Simple copy command is a copy offloading operation and is  used to copy
>> multiple contiguous ranges (source_ranges) of LBA's to a single destination
>> LBA within the device reducing traffic between host and device.
>>
>> This implementation accepts destination, no of sources and arrays of
>> source ranges from application and attach it as payload to the bio and
>> submits to the device.
>>
>> Following limits are added to queue limits and are exposed in sysfs
>> to userspace
>> 	- *max_copy_sectors* limits the sum of all source_range length
>> 	- *max_copy_nr_ranges* limits the number of source ranges
>> 	- *max_copy_range_sectors* limit the maximum number of sectors
>> 		that can constitute a single source range.
>>
>> Changes from v1:
>>
>> 1. Fix memory leak in __blkdev_issue_copy
>> 2. Unmark blk_check_copy inline
>> 3. Fix line break in blk_check_copy_eod
>> 4. Remove p checks and made code more readable
>> 5. Don't use bio_set_op_attrs and remove op and set
>>    bi_opf directly
>> 6. Use struct_size to calculate total_size
>> 7. Fix partition remap of copy destination
>> 8. Remove mcl,mssrl,msrc from nvme_ns
>> 9. Initialize copy queue limits to 0 in nvme_config_copy
>> 10. Remove return in QUEUE_FLAG_COPY check
>> 11. Remove unused OCFS
>>
>> SelvaKumar S (2):
>>   block: add simple copy support
>>   nvme: add simple copy support
>>
>>  block/blk-core.c          |  94 ++++++++++++++++++++++++++---
>>  block/blk-lib.c           | 123 ++++++++++++++++++++++++++++++++++++++
>>  block/blk-merge.c         |   2 +
>>  block/blk-settings.c      |  11 ++++
>>  block/blk-sysfs.c         |  23 +++++++
>>  block/blk-zoned.c         |   1 +
>>  block/bounce.c            |   1 +
>>  block/ioctl.c             |  43 +++++++++++++
>>  drivers/nvme/host/core.c  |  87 +++++++++++++++++++++++++++
>>  include/linux/bio.h       |   1 +
>>  include/linux/blk_types.h |  15 +++++
>>  include/linux/blkdev.h    |  15 +++++
>>  include/linux/nvme.h      |  43 ++++++++++++-
>>  include/uapi/linux/fs.h   |  13 ++++
>>  14 files changed, 461 insertions(+), 11 deletions(-)
>>
>> --
>> 2.25.1
>---end quoted text---

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-07 14:56     ` Hannes Reinecke
@ 2020-12-07 19:24       ` Javier González
  2020-12-08  8:40         ` Johannes Thumshirn
  2020-12-07 22:12       ` Douglas Gilbert
  2020-12-09  3:02       ` Martin K. Petersen
  2 siblings, 1 reply; 15+ messages in thread
From: Javier González @ 2020-12-07 19:24 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, SelvaKumar S, linux-nvme, kbusch, axboe,
	damien.lemoal, sagi, linux-block, linux-kernel, dm-devel,
	snitzer, selvajove, nj.shetty, joshi.k, Martin K. Petersen,
	Bart Van Assche, Mikulas Patocka, linux-scsi

On 07.12.2020 15:56, Hannes Reinecke wrote:
>On 12/7/20 3:11 PM, Christoph Hellwig wrote:
>>So, I'm really worried about:
>>
>>  a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
>>     does accelating dm-kcopyd.  I agree with Damien that lifting dm-kcopyd
>>     to common code would also be really nice.  I'm not 100% sure it should
>>     be a requirement, but it sure would be nice to have
>>     I don't think just adding an ioctl is enough of a use case for complex
>>     kernel infrastructure.
>>  b) We had a bunch of different attempts at SCSI XCOPY support form IIRC
>>     Martin, Bart and Mikulas.  I think we need to pull them into this
>>     discussion, and make sure whatever we do covers the SCSI needs.
>>
>And we shouldn't forget that the main issue which killed all previous 
>implementations was a missing QoS guarantee.
>It's nice to have simply copy, but if the implementation is _slower_ 
>than doing it by hand from the OS there is very little point in even 
>attempting to do so.
>I can't see any provisions for that in the TPAR, leading me to the 
>assumption that NVMe simple copy will suffer from the same issue.
>
>So if we can't address this I guess this attempt will fail, too.

Good point. We can share some performance data on how Simple Copy scales
in terms of bw / latency and the CPU usage. Do you have anything else in
mind?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-07 14:56     ` Hannes Reinecke
  2020-12-07 19:24       ` Javier González
@ 2020-12-07 22:12       ` Douglas Gilbert
  2020-12-08  6:44         ` Hannes Reinecke
  2020-12-09  3:02       ` Martin K. Petersen
  2 siblings, 1 reply; 15+ messages in thread
From: Douglas Gilbert @ 2020-12-07 22:12 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, SelvaKumar S
  Cc: linux-nvme, kbusch, axboe, damien.lemoal, sagi, linux-block,
	linux-kernel, dm-devel, snitzer, selvajove, nj.shetty, joshi.k,
	javier.gonz, Martin K. Petersen, Bart Van Assche,
	Mikulas Patocka, linux-scsi

On 2020-12-07 9:56 a.m., Hannes Reinecke wrote:
> On 12/7/20 3:11 PM, Christoph Hellwig wrote:
>> So, I'm really worried about:
>>
>>   a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
>>      does accelating dm-kcopyd.  I agree with Damien that lifting dm-kcopyd
>>      to common code would also be really nice.  I'm not 100% sure it should
>>      be a requirement, but it sure would be nice to have
>>      I don't think just adding an ioctl is enough of a use case for complex
>>      kernel infrastructure.
>>   b) We had a bunch of different attempts at SCSI XCOPY support form IIRC
>>      Martin, Bart and Mikulas.  I think we need to pull them into this
>>      discussion, and make sure whatever we do covers the SCSI needs.
>>
> And we shouldn't forget that the main issue which killed all previous 
> implementations was a missing QoS guarantee.
> It's nice to have simply copy, but if the implementation is _slower_ than doing 
> it by hand from the OS there is very little point in even attempting to do so.
> I can't see any provisions for that in the TPAR, leading me to the assumption 
> that NVMe simple copy will suffer from the same issue.
> 
> So if we can't address this I guess this attempt will fail, too.

I have been doing quite a lot of work and testing in my sg driver rewrite
in the copy and compare area. The baselines for performance are dd and
io_uring-cp (in liburing). There are lots of ways to improve on them. Here
are some:
    - the user data need never pass through the user space (could
      mmap it out during the READ if there is a good reason). Only the
      metadata (e.g. NVMe or SCSI commands) needs to come from the user
      space and errors, if any, reported back to the user space.
    - break a large copy (or compare) into segments, with each segment
      a "comfortable" size for the OS to handle, say 256 KB
    - there is one constraint: the READ in each segment must complete
      before its paired WRITE can commence
      - extra constraint for some zoned disks: WRITEs must be
        issued in order (assuming they are applied in that order, if
        not, need to wait until each WRITE completes)
    - arrange for READ WRITE pair in each segment to share the same bio
    - have multiple slots each holding a segment (i.e. a bio and
      metadata to process a READ-WRITE pair)
    - re-use each slot's bio for the following READ-WRITE pair
    - issue the READs in each slot asynchronously and do an interleaved
      (io)poll for completion. Then issue the paired WRITE
      asynchronously
    - the above "slot" algorithm runs in one thread, so there can be
      multiple threads doing the same algorithm. Segment manager needs
      to be locked (or use an atomics) so that each segment (identified
      by its starting LBAs) is issued once and only once when the
      next thread wants a segment to copy

Running multiple threads gives diminishing or even worsening returns.
Runtime metrics on lock contention and storage bus capacity may help
choosing the number of threads. A simpler approach might be add more
threads until the combined throughput increase is less than 10% say.


The 'compare' that I mention is based on the SCSI VERIFY(BYTCHK=1) command
(or NVMe NVM Compare command). Using dd logic, a disk to disk compare can
be implemented with not much more work than changing the WRITE to a VERIFY
command. This is a different approach to the Linux cmp utility which
READs in both sides and does a memcmp() type operation. Using ramdisks
(from the scsi_debug driver) the compare operation (max ~ 10 GB/s) was
actually faster than the copy (max ~ 7 GB/s). I put this down to WRITE
operations taking a write lock over the store while the VERIFY only
needs a read lock so many VERIFY operations can co-exist on the same
store. Unfortunately on real SAS and NVMe SSDs that I tested the
performance of the VERIFY and NVM Compare commands is underwhelming.
For comparison, using scsi_debug ramdisks, dd copy throughput was
< 1 GB/s and io_uring-cp was around 2-3 GB/s. The system was Ryzen
3600 based.

Doug Gilbert

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-07 22:12       ` Douglas Gilbert
@ 2020-12-08  6:44         ` Hannes Reinecke
  2020-12-08 12:21           ` Javier González
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2020-12-08  6:44 UTC (permalink / raw)
  To: dgilbert, Christoph Hellwig, SelvaKumar S
  Cc: linux-nvme, kbusch, axboe, damien.lemoal, sagi, linux-block,
	linux-kernel, dm-devel, snitzer, selvajove, nj.shetty, joshi.k,
	javier.gonz, Martin K. Petersen, Bart Van Assche,
	Mikulas Patocka, linux-scsi

On 12/7/20 11:12 PM, Douglas Gilbert wrote:
> On 2020-12-07 9:56 a.m., Hannes Reinecke wrote:
>> On 12/7/20 3:11 PM, Christoph Hellwig wrote:
>>> So, I'm really worried about:
>>>
>>>   a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
>>>      does accelating dm-kcopyd.  I agree with Damien that lifting 
>>> dm-kcopyd
>>>      to common code would also be really nice.  I'm not 100% sure it 
>>> should
>>>      be a requirement, but it sure would be nice to have
>>>      I don't think just adding an ioctl is enough of a use case for 
>>> complex
>>>      kernel infrastructure.
>>>   b) We had a bunch of different attempts at SCSI XCOPY support form 
>>> IIRC
>>>      Martin, Bart and Mikulas.  I think we need to pull them into this
>>>      discussion, and make sure whatever we do covers the SCSI needs.
>>>
>> And we shouldn't forget that the main issue which killed all previous 
>> implementations was a missing QoS guarantee.
>> It's nice to have simply copy, but if the implementation is _slower_ 
>> than doing it by hand from the OS there is very little point in even 
>> attempting to do so.
>> I can't see any provisions for that in the TPAR, leading me to the 
>> assumption that NVMe simple copy will suffer from the same issue.
>>
>> So if we can't address this I guess this attempt will fail, too.
> 
> I have been doing quite a lot of work and testing in my sg driver rewrite
> in the copy and compare area. The baselines for performance are dd and
> io_uring-cp (in liburing). There are lots of ways to improve on them. Here
> are some:
>     - the user data need never pass through the user space (could
>       mmap it out during the READ if there is a good reason). Only the
>       metadata (e.g. NVMe or SCSI commands) needs to come from the user
>       space and errors, if any, reported back to the user space.
>     - break a large copy (or compare) into segments, with each segment
>       a "comfortable" size for the OS to handle, say 256 KB
>     - there is one constraint: the READ in each segment must complete
>       before its paired WRITE can commence
>       - extra constraint for some zoned disks: WRITEs must be
>         issued in order (assuming they are applied in that order, if
>         not, need to wait until each WRITE completes)
>     - arrange for READ WRITE pair in each segment to share the same bio
>     - have multiple slots each holding a segment (i.e. a bio and
>       metadata to process a READ-WRITE pair)
>     - re-use each slot's bio for the following READ-WRITE pair
>     - issue the READs in each slot asynchronously and do an interleaved
>       (io)poll for completion. Then issue the paired WRITE
>       asynchronously
>     - the above "slot" algorithm runs in one thread, so there can be
>       multiple threads doing the same algorithm. Segment manager needs
>       to be locked (or use an atomics) so that each segment (identified
>       by its starting LBAs) is issued once and only once when the
>       next thread wants a segment to copy
> 
> Running multiple threads gives diminishing or even worsening returns.
> Runtime metrics on lock contention and storage bus capacity may help
> choosing the number of threads. A simpler approach might be add more
> threads until the combined throughput increase is less than 10% say.
> 
> 
> The 'compare' that I mention is based on the SCSI VERIFY(BYTCHK=1) command
> (or NVMe NVM Compare command). Using dd logic, a disk to disk compare can
> be implemented with not much more work than changing the WRITE to a VERIFY
> command. This is a different approach to the Linux cmp utility which
> READs in both sides and does a memcmp() type operation. Using ramdisks
> (from the scsi_debug driver) the compare operation (max ~ 10 GB/s) was
> actually faster than the copy (max ~ 7 GB/s). I put this down to WRITE
> operations taking a write lock over the store while the VERIFY only
> needs a read lock so many VERIFY operations can co-exist on the same
> store. Unfortunately on real SAS and NVMe SSDs that I tested the
> performance of the VERIFY and NVM Compare commands is underwhelming.
> For comparison, using scsi_debug ramdisks, dd copy throughput was
> < 1 GB/s and io_uring-cp was around 2-3 GB/s. The system was Ryzen
> 3600 based.
> 
Which is precisely my concern.
Simple copy might be efficient for one particular implementation, but it 
might be completely off the board for others.
But both will be claiming to support it, and us having no idea whether 
choosing simple copy will speed up matters or not.
Without having a programmatic way to figure out the speed of the 
implementation we have to detect the performance ourselves, like the 
benchmarking loop RAID5 does.
I was hoping to avoid that, and just ask the device/controller; but that 
turned out to be in vain.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-07 19:24       ` Javier González
@ 2020-12-08  8:40         ` Johannes Thumshirn
  2020-12-08 12:22           ` Javier González
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Thumshirn @ 2020-12-08  8:40 UTC (permalink / raw)
  To: Javier González, Hannes Reinecke
  Cc: Christoph Hellwig, SelvaKumar S, linux-nvme, kbusch, axboe,
	Damien Le Moal, sagi, linux-block, linux-kernel, dm-devel,
	snitzer, selvajove, nj.shetty, joshi.k, Martin K. Petersen,
	Bart Van Assche, Mikulas Patocka, linux-scsi

On 07/12/2020 20:27, Javier González wrote:
> Good point. We can share some performance data on how Simple Copy scales
> in terms of bw / latency and the CPU usage. Do you have anything else in
> mind?
> 

With an emulation in the kernel, we could make the usd "backend" 
implementation configurable. So if the emulation is faster, users can select
the emulation, if the device is faster then the device.

Kind of what the crypto and raid code do as well.

I'm really interested in this work, as BTRFS relocation/balance will have 
potential benefits, but we need to get it right.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-08  6:44         ` Hannes Reinecke
@ 2020-12-08 12:21           ` Javier González
  0 siblings, 0 replies; 15+ messages in thread
From: Javier González @ 2020-12-08 12:21 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: dgilbert, Christoph Hellwig, SelvaKumar S, linux-nvme, kbusch,
	axboe, damien.lemoal, sagi, linux-block, linux-kernel, dm-devel,
	snitzer, selvajove, nj.shetty, joshi.k, Martin K. Petersen,
	Bart Van Assche, Mikulas Patocka, linux-scsi

On 08.12.2020 07:44, Hannes Reinecke wrote:
>On 12/7/20 11:12 PM, Douglas Gilbert wrote:
>>On 2020-12-07 9:56 a.m., Hannes Reinecke wrote:
>>>On 12/7/20 3:11 PM, Christoph Hellwig wrote:
>>>>So, I'm really worried about:
>>>>
>>>>  a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
>>>>     does accelating dm-kcopyd.  I agree with Damien that 
>>>>lifting dm-kcopyd
>>>>     to common code would also be really nice.  I'm not 100% 
>>>>sure it should
>>>>     be a requirement, but it sure would be nice to have
>>>>     I don't think just adding an ioctl is enough of a use case 
>>>>for complex
>>>>     kernel infrastructure.
>>>>  b) We had a bunch of different attempts at SCSI XCOPY support 
>>>>form IIRC
>>>>     Martin, Bart and Mikulas.  I think we need to pull them into this
>>>>     discussion, and make sure whatever we do covers the SCSI needs.
>>>>
>>>And we shouldn't forget that the main issue which killed all 
>>>previous implementations was a missing QoS guarantee.
>>>It's nice to have simply copy, but if the implementation is 
>>>_slower_ than doing it by hand from the OS there is very little 
>>>point in even attempting to do so.
>>>I can't see any provisions for that in the TPAR, leading me to the 
>>>assumption that NVMe simple copy will suffer from the same issue.
>>>
>>>So if we can't address this I guess this attempt will fail, too.
>>
>>I have been doing quite a lot of work and testing in my sg driver rewrite
>>in the copy and compare area. The baselines for performance are dd and
>>io_uring-cp (in liburing). There are lots of ways to improve on them. Here
>>are some:
>>    - the user data need never pass through the user space (could
>>      mmap it out during the READ if there is a good reason). Only the
>>      metadata (e.g. NVMe or SCSI commands) needs to come from the user
>>      space and errors, if any, reported back to the user space.
>>    - break a large copy (or compare) into segments, with each segment
>>      a "comfortable" size for the OS to handle, say 256 KB
>>    - there is one constraint: the READ in each segment must complete
>>      before its paired WRITE can commence
>>      - extra constraint for some zoned disks: WRITEs must be
>>        issued in order (assuming they are applied in that order, if
>>        not, need to wait until each WRITE completes)
>>    - arrange for READ WRITE pair in each segment to share the same bio
>>    - have multiple slots each holding a segment (i.e. a bio and
>>      metadata to process a READ-WRITE pair)
>>    - re-use each slot's bio for the following READ-WRITE pair
>>    - issue the READs in each slot asynchronously and do an interleaved
>>      (io)poll for completion. Then issue the paired WRITE
>>      asynchronously
>>    - the above "slot" algorithm runs in one thread, so there can be
>>      multiple threads doing the same algorithm. Segment manager needs
>>      to be locked (or use an atomics) so that each segment (identified
>>      by its starting LBAs) is issued once and only once when the
>>      next thread wants a segment to copy
>>
>>Running multiple threads gives diminishing or even worsening returns.
>>Runtime metrics on lock contention and storage bus capacity may help
>>choosing the number of threads. A simpler approach might be add more
>>threads until the combined throughput increase is less than 10% say.
>>
>>
>>The 'compare' that I mention is based on the SCSI VERIFY(BYTCHK=1) command
>>(or NVMe NVM Compare command). Using dd logic, a disk to disk compare can
>>be implemented with not much more work than changing the WRITE to a VERIFY
>>command. This is a different approach to the Linux cmp utility which
>>READs in both sides and does a memcmp() type operation. Using ramdisks
>>(from the scsi_debug driver) the compare operation (max ~ 10 GB/s) was
>>actually faster than the copy (max ~ 7 GB/s). I put this down to WRITE
>>operations taking a write lock over the store while the VERIFY only
>>needs a read lock so many VERIFY operations can co-exist on the same
>>store. Unfortunately on real SAS and NVMe SSDs that I tested the
>>performance of the VERIFY and NVM Compare commands is underwhelming.
>>For comparison, using scsi_debug ramdisks, dd copy throughput was
>>< 1 GB/s and io_uring-cp was around 2-3 GB/s. The system was Ryzen
>>3600 based.
>>
>Which is precisely my concern.
>Simple copy might be efficient for one particular implementation, but 
>it might be completely off the board for others.
>But both will be claiming to support it, and us having no idea whether 
>choosing simple copy will speed up matters or not.
>Without having a programmatic way to figure out the speed of the 
>implementation we have to detect the performance ourselves, like the 
>benchmarking loop RAID5 does.
>I was hoping to avoid that, and just ask the device/controller; but 
>that turned out to be in vain.

I believe it makes sense to do extensive characterization to understand
how the host and device implementation behave. However, I do not believe
we will get far if the requirement is that any acceleration has to
outperform the legacy path under all circumstances and implementations.

At this moment in time, this is a feature very much targeted to
eliminating the extra read/write traffic generated by ZNS host GC.

This said, we do see the value in aligning with existing efforts to
offload copy under other use cases, so if you have a set of tests we can
run to speak the same language, we would be happy to take them and adapt
them to the fio extensions we have posted for testing this too.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-08  8:40         ` Johannes Thumshirn
@ 2020-12-08 12:22           ` Javier González
  2020-12-08 12:37             ` Johannes Thumshirn
  0 siblings, 1 reply; 15+ messages in thread
From: Javier González @ 2020-12-08 12:22 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Hannes Reinecke, Christoph Hellwig, SelvaKumar S, linux-nvme,
	kbusch, axboe, Damien Le Moal, sagi, linux-block, linux-kernel,
	dm-devel, snitzer, selvajove, nj.shetty, joshi.k,
	Martin K. Petersen, Bart Van Assche, Mikulas Patocka, linux-scsi

On 08.12.2020 08:40, Johannes Thumshirn wrote:
>On 07/12/2020 20:27, Javier González wrote:
>> Good point. We can share some performance data on how Simple Copy scales
>> in terms of bw / latency and the CPU usage. Do you have anything else in
>> mind?
>>
>
>With an emulation in the kernel, we could make the usd "backend"
>implementation configurable. So if the emulation is faster, users can select
>the emulation, if the device is faster then the device.
>
>Kind of what the crypto and raid code do as well.

Good idea. Are you thinking of a sysfs entry to select the backend?
>
>I'm really interested in this work, as BTRFS relocation/balance will have
>potential benefits, but we need to get it right.

Agree. We will post a V3 with emulation and addressing other comments.
We can take it from there. If you have comments on V2, please send them
our way and we will take them in.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-08 12:22           ` Javier González
@ 2020-12-08 12:37             ` Johannes Thumshirn
  2020-12-08 13:13               ` Javier González
  2020-12-15 23:45               ` Pavel Machek
  0 siblings, 2 replies; 15+ messages in thread
From: Johannes Thumshirn @ 2020-12-08 12:37 UTC (permalink / raw)
  To: Javier González
  Cc: Hannes Reinecke, Christoph Hellwig, SelvaKumar S, linux-nvme,
	kbusch, axboe, Damien Le Moal, sagi, linux-block, linux-kernel,
	dm-devel, snitzer, selvajove, nj.shetty, joshi.k,
	Martin K. Petersen, Bart Van Assche, Mikulas Patocka, linux-scsi

On 08/12/2020 13:22, Javier González wrote:
> Good idea. Are you thinking of a sysfs entry to select the backend?

Not sure on this one, initially I thought of a sysfs file, but then
how would you do it. One "global" sysfs entry is probably a bad idea.
Having one per block device to select native vs emulation maybe? And
a good way to benchmark.

The other idea would be a benchmark loop on boot like the raid library
does.

Then on the other hand, there might be workloads that run faster with 
the emulation and some that run faster with the hardware acceleration.

I think these points are the reason the last attempts got stuck.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-08 12:37             ` Johannes Thumshirn
@ 2020-12-08 13:13               ` Javier González
  2020-12-08 13:24                 ` Johannes Thumshirn
  2020-12-15 23:45               ` Pavel Machek
  1 sibling, 1 reply; 15+ messages in thread
From: Javier González @ 2020-12-08 13:13 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Hannes Reinecke, Christoph Hellwig, SelvaKumar S, linux-nvme,
	kbusch, axboe, Damien Le Moal, sagi, linux-block, linux-kernel,
	dm-devel, snitzer, selvajove, nj.shetty, joshi.k,
	Martin K. Petersen, Bart Van Assche, Mikulas Patocka, linux-scsi

On 08.12.2020 12:37, Johannes Thumshirn wrote:
>On 08/12/2020 13:22, Javier González wrote:
>> Good idea. Are you thinking of a sysfs entry to select the backend?
>
>Not sure on this one, initially I thought of a sysfs file, but then
>how would you do it. One "global" sysfs entry is probably a bad idea.
>Having one per block device to select native vs emulation maybe? And
>a good way to benchmark.

I was thinking a per block device to target the use case where a certain
implementation / workload is better one way or the other.

>
>The other idea would be a benchmark loop on boot like the raid library
>does.
>
>Then on the other hand, there might be workloads that run faster with
>the emulation and some that run faster with the hardware acceleration.
>
>I think these points are the reason the last attempts got stuck.

Yes. I believe that any benchmark we run would be biased in a certain
way. If we can move forward with a sysfs entry and default to legacy
path, we would not alter current behavior and enable NVMe copy offload
(for now) for those that want to use it. We can then build on top of it.

Does this sound like a reasonable approach?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-08 13:13               ` Javier González
@ 2020-12-08 13:24                 ` Johannes Thumshirn
  2020-12-09  9:17                   ` Javier González
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Thumshirn @ 2020-12-08 13:24 UTC (permalink / raw)
  To: Javier González
  Cc: Hannes Reinecke, Christoph Hellwig, SelvaKumar S, linux-nvme,
	kbusch, axboe, Damien Le Moal, sagi, linux-block, linux-kernel,
	dm-devel, snitzer, selvajove, nj.shetty, joshi.k,
	Martin K. Petersen, Bart Van Assche, Mikulas Patocka, linux-scsi

On 08/12/2020 14:13, Javier González wrote:
> On 08.12.2020 12:37, Johannes Thumshirn wrote:
>> On 08/12/2020 13:22, Javier González wrote:
>>> Good idea. Are you thinking of a sysfs entry to select the backend?
>>
>> Not sure on this one, initially I thought of a sysfs file, but then
>> how would you do it. One "global" sysfs entry is probably a bad idea.
>> Having one per block device to select native vs emulation maybe? And
>> a good way to benchmark.
> 
> I was thinking a per block device to target the use case where a certain
> implementation / workload is better one way or the other.

Yes something along those lines.

>>
>> The other idea would be a benchmark loop on boot like the raid library
>> does.
>>
>> Then on the other hand, there might be workloads that run faster with
>> the emulation and some that run faster with the hardware acceleration.
>>
>> I think these points are the reason the last attempts got stuck.
> 
> Yes. I believe that any benchmark we run would be biased in a certain
> way. If we can move forward with a sysfs entry and default to legacy
> path, we would not alter current behavior and enable NVMe copy offload
> (for now) for those that want to use it. We can then build on top of it.
> 
> Does this sound like a reasonable approach?
> 

Yes this sounds like a reasonable approach to me.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-07 14:56     ` Hannes Reinecke
  2020-12-07 19:24       ` Javier González
  2020-12-07 22:12       ` Douglas Gilbert
@ 2020-12-09  3:02       ` Martin K. Petersen
  2 siblings, 0 replies; 15+ messages in thread
From: Martin K. Petersen @ 2020-12-09  3:02 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, SelvaKumar S, linux-nvme, kbusch, axboe,
	damien.lemoal, sagi, linux-block, linux-kernel, dm-devel,
	snitzer, selvajove, nj.shetty, joshi.k, javier.gonz,
	Martin K. Petersen, Bart Van Assche, Mikulas Patocka, linux-scsi


Hannes,

[Sorry I'm late to the discussion here, had a few fires going in
addition to the end of the kernel cycle]

> And we shouldn't forget that the main issue which killed all previous
> implementations was a missing QoS guarantee.

That and the fact that our arbitrary stacking situation was hard to
resolve.

The QoS guarantee was somewhat addressed by Fred in T10. But I agree we
need some sort of toggle.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-08 13:24                 ` Johannes Thumshirn
@ 2020-12-09  9:17                   ` Javier González
  0 siblings, 0 replies; 15+ messages in thread
From: Javier González @ 2020-12-09  9:17 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Hannes Reinecke, Christoph Hellwig, SelvaKumar S, linux-nvme,
	kbusch, axboe, Damien Le Moal, sagi, linux-block, linux-kernel,
	dm-devel, snitzer, selvajove, nj.shetty, joshi.k,
	Martin K. Petersen, Bart Van Assche, Mikulas Patocka, linux-scsi

On 08.12.2020 13:24, Johannes Thumshirn wrote:
>On 08/12/2020 14:13, Javier González wrote:
>> On 08.12.2020 12:37, Johannes Thumshirn wrote:
>>> On 08/12/2020 13:22, Javier González wrote:
>>>> Good idea. Are you thinking of a sysfs entry to select the backend?
>>>
>>> Not sure on this one, initially I thought of a sysfs file, but then
>>> how would you do it. One "global" sysfs entry is probably a bad idea.
>>> Having one per block device to select native vs emulation maybe? And
>>> a good way to benchmark.
>>
>> I was thinking a per block device to target the use case where a certain
>> implementation / workload is better one way or the other.
>
>Yes something along those lines.
>
>>>
>>> The other idea would be a benchmark loop on boot like the raid library
>>> does.
>>>
>>> Then on the other hand, there might be workloads that run faster with
>>> the emulation and some that run faster with the hardware acceleration.
>>>
>>> I think these points are the reason the last attempts got stuck.
>>
>> Yes. I believe that any benchmark we run would be biased in a certain
>> way. If we can move forward with a sysfs entry and default to legacy
>> path, we would not alter current behavior and enable NVMe copy offload
>> (for now) for those that want to use it. We can then build on top of it.
>>
>> Does this sound like a reasonable approach?
>>
>
>Yes this sounds like a reasonable approach to me.

Cool. We will add this to the V3 then.

Thanks Johannes!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/2] add simple copy support
  2020-12-08 12:37             ` Johannes Thumshirn
  2020-12-08 13:13               ` Javier González
@ 2020-12-15 23:45               ` Pavel Machek
  1 sibling, 0 replies; 15+ messages in thread
From: Pavel Machek @ 2020-12-15 23:45 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Javier González, Hannes Reinecke, Christoph Hellwig,
	SelvaKumar S, linux-nvme, kbusch, axboe, Damien Le Moal, sagi,
	linux-block, linux-kernel, dm-devel, snitzer, selvajove,
	nj.shetty, joshi.k, Martin K. Petersen, Bart Van Assche,
	Mikulas Patocka, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 632 bytes --]

Hi!

> > Good idea. Are you thinking of a sysfs entry to select the backend?
> 
> Not sure on this one, initially I thought of a sysfs file, but then
> how would you do it. One "global" sysfs entry is probably a bad idea.
> Having one per block device to select native vs emulation maybe? And
> a good way to benchmark.
> 
> The other idea would be a benchmark loop on boot like the raid library
> does.

Doing copy benchmarking would require writes on the media, right?

Kernel should not do such stuff without userspace requesting it...

Best regards,
								Pavel
-- 
http://www.livejournal.com/~pavelmachek

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-12-16  0:08 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20201204094719epcas5p23b3c41223897de3840f92ae3c229cda5@epcas5p2.samsung.com>
     [not found] ` <20201204094659.12732-1-selvakuma.s1@samsung.com>
2020-12-07 14:11   ` [RFC PATCH v2 0/2] add simple copy support Christoph Hellwig
2020-12-07 14:56     ` Hannes Reinecke
2020-12-07 19:24       ` Javier González
2020-12-08  8:40         ` Johannes Thumshirn
2020-12-08 12:22           ` Javier González
2020-12-08 12:37             ` Johannes Thumshirn
2020-12-08 13:13               ` Javier González
2020-12-08 13:24                 ` Johannes Thumshirn
2020-12-09  9:17                   ` Javier González
2020-12-15 23:45               ` Pavel Machek
2020-12-07 22:12       ` Douglas Gilbert
2020-12-08  6:44         ` Hannes Reinecke
2020-12-08 12:21           ` Javier González
2020-12-09  3:02       ` Martin K. Petersen
2020-12-07 19:14     ` Javier González

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).