linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bart Van Assche <bvanassche@acm.org>
To: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"dm-devel@redhat.com" <dm-devel@redhat.com>,
	"lsf-pc@lists.linux-foundation.org" 
	<lsf-pc@lists.linux-foundation.org>
Cc: "axboe@kernel.dk" <axboe@kernel.dk>,
	"hare@suse.de" <hare@suse.de>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>,
	Stephen Bates <sbates@raithlin.com>,
	"msnitzer@redhat.com" <msnitzer@redhat.com>,
	"mpatocka@redhat.com" <mpatocka@redhat.com>,
	"zach.brown@ni.com" <zach.brown@ni.com>,
	"roland@purestorage.com" <roland@purestorage.com>,
	"rwheeler@redhat.com" <rwheeler@redhat.com>,
	"frederick.knight@netapp.com" <frederick.knight@netapp.com>,
	Matias Bjorling <Matias.Bjorling@wdc.com>
Subject: Re: [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload
Date: Wed, 8 Jan 2020 19:18:54 -0800	[thread overview]
Message-ID: <fda88fd3-2d75-085e-ca15-a29f89c1e781@acm.org> (raw)
In-Reply-To: <BYAPR04MB5749820C322B40C7DBBBCA02863F0@BYAPR04MB5749.namprd04.prod.outlook.com>

On 2020-01-07 10:14, Chaitanya Kulkarni wrote:
> * Current state of the work :-
> -----------------------------------------------------------------------
> 
> With [3] being hard to handle arbitrary DM/MD stacking without
> splitting the command in two, one for copying IN and one for copying
> OUT. Which is then demonstrated by the [4] why [3] it is not a suitable
> candidate. Also, with [4] there is an unresolved problem with the
> two-command approach about how to handle changes to the DM layout
> between an IN and OUT operations.

Was this last discussed during the 2018 edition of LSF/MM (see also
https://www.spinics.net/lists/linux-block/msg24986.html)? Has anyone
taken notes during that session? I haven't found a report of that
session in the official proceedings (https://lwn.net/Articles/752509/).

Thanks,

Bart.


This is my own collection with two year old notes about copy offloading
for the Linux Kernel:

Potential Users
* All dm-kcopyd users, e.g. dm-cache-target, dm-raid1, dm-snap, dm-thin,
  dm-writecache and dm-zoned.
* Local filesystems like BTRFS, f2fs and bcachefs: garbage collection
  and RAID, at least if RAID is supported by the filesystem. Note: the
  BTRFS_IOC_CLONE_RANGE ioctl is no longer supported. Applications
  should use FICLONERANGE instead.
* Network filesystems, e.g. NFS. Copying at the server side can reduce
  network traffic significantly.
* Linux SCSI initiator systems connected to SAN systems such that
  copying can happen locally on the storage array. XCOPY is widely used
  for provisioning virtual machine images.
* Copy offloading in NVMe fabrics using PCIe peer-to-peer communication.

Requirements
* The block layer must gain support for XCOPY. The new XCOPY API must
  support asynchronous operation such that users of this API are not
  blocked while the XCOPY operation is in progress.
* Copying must be supported not only within a single storage device but
  also between storage devices.
* The SCSI sd driver must gain support for XCOPY.
* A user space API must be added and that API must support asynchronous
  (non-blocking) operation.
* The block layer XCOPY primitive must be support by the device mapper.

SCSI Extended Copy (ANSI T10 SPC)
The SCSI commands that support extended copy operations are:
* POPULATE TOKEN + WRITE USING TOKEN.
* EXTENDED COPY(LID1/4) + RECEIVE COPY STATUS(LID1/4). LID1 stands for a
  List Identifier length of 1 byte and LID4 stands for a List Identifier
  length of 4 bytes.
* SPC-3 and before define EXTENDED COPY(LID1) (83h/00h). SPC-4 added
  EXTENDED COPY(LID4) (83h/01h).

Existing Users and Implementations of SCSI XCOPY
* VMware, which uses XCOPY (with a one-byte length ID, aka LID1).
* Microsoft, which uses ODX (aka LID4 because it has a four-byte length
  ID).
* Storage vendors all support XCOPY, but ODX support is growing.

Block Layer Notes
The block layer supports the following types of block drivers:
* blk-mq request-based drivers.
* make_request drivers.

Notes:
With each request a list of bio's is associated.
Since submit_bio() only accepts a single bio and not a bio list this
means that all make_request block drivers process one bio at a time.

Device Mapper
The device mapper core supports bio processing and blk-mq requests. The
function in the device mapper that creates a request queue is called
alloc_dev(). That function not only allocates a request queue but also
associates a struct gendisk with the request queue. The
DM_DEV_CREATE_CMD ioctl triggers a call of alloc_dev(). The
DM_TABLE_LOAD ioctl loads a table definition. Loading a table definition
causes the type of a dm device to be set to one of the following:
DM_TYPE_NONE;
DM_TYPE_BIO_BASED;
DM_TYPE_REQUEST_BASED;
DM_TYPE_MQ_REQUEST_BASED;
DM_TYPE_DAX_BIO_BASED;
DM_TYPE_NVME_BIO_BASED.

Device mapper drivers must implement target_type.map(),
target_type.clone_and_map_rq() or both. .map() maps a bio list.
.clone_and_map_rq() maps a single request. The multipath and error
device mapper drivers implement both methods. All other dm drivers only
implement the .map() method.

Device mapper bio processing
submit_bio()
-> generic_make_request()
  -> dm_make_request()
    -> __dm_make_request()
      -> __split_and_process_bio()
        -> __split_and_process_non_flush()
          -> __clone_and_map_data_bio()
          -> alloc_tio()
          -> clone_bio()
            -> bio_advance()
          -> __map_bio()

Existing Linux Copy Offload APIs
* The FICLONERANGE ioctl. From <include/linux/fs.h>:
  #define FICLONERANGE _IOW(0x94, 13, struct file_clone_range)

struct file_clone_range {
	__s64 src_fd;
	__u64 src_offset;
	__u64 src_length;
	__u64 dest_offset;
};

* The sendfile() system call. sendfile() copies a given number of bytes
  from one file to another. The output offset is the offset of the
  output file descriptor. The input offset is either the input file
  descriptor offset or can be specified explicitly. The sendfile()
  prototype is as follows:
  ssize_t sendfile(int out_fd, int in_fd, off_t *ppos, size_t count);
  ssize_t sendfile64(int out_fd, int in_fd, loff_t *ppos, size_t count);
* The copy_file_range() system call. See also vfs_copy_file_range(). Its
  prototype is as follows:
  ssize_t copy_file_range(int fd_in, loff_t *off_in, int fd_out,
     loff_t *off_out, size_t len, unsigned int flags);
* The splice() system call is not appropriate for adding extended copy
  functionality since it copies data from or to a pipe. Its prototype is
  as follows:
  long splice(struct file *in, loff_t *off_in, struct file *out,
    loff_t *off_out, size_t len, unsigned int flags);

Existing Linux Block Layer Copy Offload Implementations
* Martin Petersen's REQ_COPY bio, where source and destination block
  device are both specified in the same bio. Only works for block
  devices. Does not work for files. Adds a new blocking ioctl() for
  XCOPY from user space.
* Mikulas Patocka's approach: separate REQ_OP_COPY_WRITE and
  REQ_OP_COPY_READ operations. These are sent individually down stacked
  drivers and are paired by the driver at the bottom of the stack.


  parent reply	other threads:[~2020-01-09  3:18 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20200107181551epcas5p4f47eeafd807c28a26b4024245c4e00ab@epcas5p4.samsung.com>
2020-01-07 18:14 ` [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload Chaitanya Kulkarni
2020-01-08 10:17   ` Javier González
2020-01-09  0:51     ` Logan Gunthorpe
2020-01-09  3:18   ` Bart Van Assche [this message]
2020-01-09  4:01     ` Chaitanya Kulkarni
2020-01-09  5:56     ` Damien Le Moal
2020-01-10  5:33     ` Martin K. Petersen
2020-01-24 14:23   ` Nikos Tsironis
2020-02-13  5:11   ` joshi.k
2020-02-13 13:09     ` Knight, Frederick
2021-05-11  0:15 Chaitanya Kulkarni
2021-05-11 21:15 ` Knight, Frederick
2021-05-12  2:21 ` Bart Van Assche
     [not found] ` <CGME20210512071321eucas1p2ca2253e90449108b9f3e4689bf8e0512@eucas1p2.samsung.com>
2021-05-12  7:13   ` Javier González
2021-05-12  7:30 ` Johannes Thumshirn
     [not found]   ` <CGME20210928191342eucas1p23448dcd51b23495fa67cdc017e77435c@eucas1p2.samsung.com>
2021-09-28 19:13     ` Javier González
2021-09-29  6:44       ` Johannes Thumshirn
2021-09-30  9:43       ` Chaitanya Kulkarni
2021-09-30  9:53         ` Javier González
2021-10-06 10:01         ` Javier González
2021-10-13  8:35           ` Javier González
2021-09-30 16:20       ` Bart Van Assche
2021-10-06 10:05         ` Javier González
2021-10-06 17:33           ` Bart Van Assche
2021-10-08  6:49             ` Javier González
2021-10-29  0:21               ` Chaitanya Kulkarni
2021-10-29  5:51                 ` Hannes Reinecke
2021-10-29  8:16                   ` Javier González
2021-10-29 16:15                   ` Bart Van Assche
2021-11-01 17:54                     ` Keith Busch
2021-10-29  8:14                 ` Javier González
2021-11-03 19:27                   ` Javier González
2021-11-16 13:43                     ` Javier González
2021-11-16 17:59                       ` Bart Van Assche
2021-11-17 12:53                         ` Javier González
2021-11-17 15:52                           ` Bart Van Assche
2021-11-19  7:38                             ` Javier González
2021-11-19 10:47                       ` Kanchan Joshi
2021-11-19 15:51                         ` Keith Busch
2021-11-19 16:21                         ` Bart Van Assche
2021-11-22  7:39                       ` Kanchan Joshi
2021-05-12 15:23 ` Hannes Reinecke
2021-05-12 15:45 ` Himanshu Madhani
2021-05-17 16:39 ` Kanchan Joshi
2021-05-18  0:15 ` Bart Van Assche
2021-06-11  6:03 ` Chaitanya Kulkarni
2021-06-11 15:35 ` Nikos Tsironis
     [not found] <CGME20220127071544uscas1p2f70f4d2509f3ebd574b7ed746d3fa551@uscas1p2.samsung.com>
2022-01-27  7:14 ` Chaitanya Kulkarni
2022-01-28 19:59   ` Adam Manzanares
2022-01-31 11:49     ` Johannes Thumshirn
2022-01-31 19:03   ` Bart Van Assche
2022-02-01  1:54   ` Luis Chamberlain
2022-02-01 10:21   ` Javier González
2022-02-07  9:57     ` Nitesh Shetty
2022-02-02  5:57   ` Kanchan Joshi
2022-02-07 10:45   ` David Disseldorp
2022-03-01 17:34   ` Nikos Tsironis
2022-03-01 21:32     ` Chaitanya Kulkarni
2022-03-03 18:36       ` Nikos Tsironis
2022-03-08 20:48       ` Nikos Tsironis
2022-03-09  8:51         ` Mikulas Patocka
2022-03-09 15:49           ` Nikos Tsironis
     [not found] <CGME20230113094723epcas5p2f6f81ca1ad85f4b26829b87f8ec301ce@epcas5p2.samsung.com>
2023-01-13  9:46 ` Nitesh Shetty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fda88fd3-2d75-085e-ca15-a29f89c1e781@acm.org \
    --to=bvanassche@acm.org \
    --cc=Chaitanya.Kulkarni@wdc.com \
    --cc=Matias.Bjorling@wdc.com \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=frederick.knight@netapp.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=martin.petersen@oracle.com \
    --cc=mpatocka@redhat.com \
    --cc=msnitzer@redhat.com \
    --cc=roland@purestorage.com \
    --cc=rwheeler@redhat.com \
    --cc=sbates@raithlin.com \
    --cc=zach.brown@ni.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).