All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kanchan Joshi <joshiiitr@gmail.com>
To: Damien Le Moal <Damien.LeMoal@wdc.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	Pavel Begunkov <asml.silence@gmail.com>,
	Kanchan Joshi <joshi.k@samsung.com>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"bcrl@kvack.org" <bcrl@kvack.org>,
	Matthew Wilcox <willy@infradead.org>,
	"hch@infradead.org" <hch@infradead.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-aio@kvack.org" <linux-aio@kvack.org>,
	"io-uring@vger.kernel.org" <io-uring@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	SelvaKumar S <selvakuma.s1@samsung.com>,
	Nitesh Shetty <nj.shetty@samsung.com>,
	Javier Gonzalez <javier.gonz@samsung.com>
Subject: Re: [PATCH v4 6/6] io_uring: add support for zone-append
Date: Fri, 31 Jul 2020 12:38:27 +0530	[thread overview]
Message-ID: <CA+1E3rJ5j6MeG3O5Xa7unWLMRz6BacvLVN8xpeEz6AVyJWT55Q@mail.gmail.com> (raw)
In-Reply-To: <MWHPR04MB3758DC08EA17780E498E9EC0E74E0@MWHPR04MB3758.namprd04.prod.outlook.com>

On Fri, Jul 31, 2020 at 12:12 PM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>
> On 2020/07/31 3:26, Kanchan Joshi wrote:
> > On Thu, Jul 30, 2020 at 11:24 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 7/30/20 11:51 AM, Kanchan Joshi wrote:
> >>> On Thu, Jul 30, 2020 at 11:10 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>>>
> >>>> On 30/07/2020 20:16, Jens Axboe wrote:
> >>>>> On 7/30/20 10:26 AM, Pavel Begunkov wrote:
> >>>>>> On 30/07/2020 19:13, Jens Axboe wrote:
> >>>>>>> On 7/30/20 10:08 AM, Pavel Begunkov wrote:
> >>>>>>>> On 27/07/2020 23:34, Jens Axboe wrote:
> >>>>>>>>> On 7/27/20 1:16 PM, Kanchan Joshi wrote:
> >>>>>>>>>> On Fri, Jul 24, 2020 at 10:00 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On 7/24/20 9:49 AM, Kanchan Joshi wrote:
> >>>>>>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
> >>>>>>>>>>>> index 7809ab2..6510cf5 100644
> >>>>>>>>>>>> --- a/fs/io_uring.c
> >>>>>>>>>>>> +++ b/fs/io_uring.c
> >>>>>>>>>>>> @@ -1284,8 +1301,15 @@ static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags)
> >>>>>>>>>>>>       cqe = io_get_cqring(ctx);
> >>>>>>>>>>>>       if (likely(cqe)) {
> >>>>>>>>>>>>               WRITE_ONCE(cqe->user_data, req->user_data);
> >>>>>>>>>>>> -             WRITE_ONCE(cqe->res, res);
> >>>>>>>>>>>> -             WRITE_ONCE(cqe->flags, cflags);
> >>>>>>>>>>>> +             if (unlikely(req->flags & REQ_F_ZONE_APPEND)) {
> >>>>>>>>>>>> +                     if (likely(res > 0))
> >>>>>>>>>>>> +                             WRITE_ONCE(cqe->res64, req->rw.append_offset);
> >>>>>>>>>>>> +                     else
> >>>>>>>>>>>> +                             WRITE_ONCE(cqe->res64, res);
> >>>>>>>>>>>> +             } else {
> >>>>>>>>>>>> +                     WRITE_ONCE(cqe->res, res);
> >>>>>>>>>>>> +                     WRITE_ONCE(cqe->flags, cflags);
> >>>>>>>>>>>> +             }
> >>>>>>>>>>>
> >>>>>>>>>>> This would be nice to keep out of the fast path, if possible.
> >>>>>>>>>>
> >>>>>>>>>> I was thinking of keeping a function-pointer (in io_kiocb) during
> >>>>>>>>>> submission. That would have avoided this check......but argument count
> >>>>>>>>>> differs, so it did not add up.
> >>>>>>>>>
> >>>>>>>>> But that'd grow the io_kiocb just for this use case, which is arguably
> >>>>>>>>> even worse. Unless you can keep it in the per-request private data,
> >>>>>>>>> but there's no more room there for the regular read/write side.
> >>>>>>>>>
> >>>>>>>>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> >>>>>>>>>>>> index 92c2269..2580d93 100644
> >>>>>>>>>>>> --- a/include/uapi/linux/io_uring.h
> >>>>>>>>>>>> +++ b/include/uapi/linux/io_uring.h
> >>>>>>>>>>>> @@ -156,8 +156,13 @@ enum {
> >>>>>>>>>>>>   */
> >>>>>>>>>>>>  struct io_uring_cqe {
> >>>>>>>>>>>>       __u64   user_data;      /* sqe->data submission passed back */
> >>>>>>>>>>>> -     __s32   res;            /* result code for this event */
> >>>>>>>>>>>> -     __u32   flags;
> >>>>>>>>>>>> +     union {
> >>>>>>>>>>>> +             struct {
> >>>>>>>>>>>> +                     __s32   res;    /* result code for this event */
> >>>>>>>>>>>> +                     __u32   flags;
> >>>>>>>>>>>> +             };
> >>>>>>>>>>>> +             __s64   res64;  /* appending offset for zone append */
> >>>>>>>>>>>> +     };
> >>>>>>>>>>>>  };
> >>>>>>>>>>>
> >>>>>>>>>>> Is this a compatible change, both for now but also going forward? You
> >>>>>>>>>>> could randomly have IORING_CQE_F_BUFFER set, or any other future flags.
> >>>>>>>>>>
> >>>>>>>>>> Sorry, I didn't quite understand the concern. CQE_F_BUFFER is not
> >>>>>>>>>> used/set for write currently, so it looked compatible at this point.
> >>>>>>>>>
> >>>>>>>>> Not worried about that, since we won't ever use that for writes. But it
> >>>>>>>>> is a potential headache down the line for other flags, if they apply to
> >>>>>>>>> normal writes.
> >>>>>>>>>
> >>>>>>>>>> Yes, no room for future flags for this operation.
> >>>>>>>>>> Do you see any other way to enable this support in io-uring?
> >>>>>>>>>
> >>>>>>>>> Honestly I think the only viable option is as we discussed previously,
> >>>>>>>>> pass in a pointer to a 64-bit type where we can copy the additional
> >>>>>>>>> completion information to.
> >>>>>>>>
> >>>>>>>> TBH, I hate the idea of such overhead/latency at times when SSDs can
> >>>>>>>> serve writes in less than 10ms. Any chance you measured how long does it
> >>>>>>>
> >>>>>>> 10us? :-)
> >>>>>>
> >>>>>> Hah, 10us indeed :)
> >>>>>>
> >>>>>>>
> >>>>>>>> take to drag through task_work?
> >>>>>>>
> >>>>>>> A 64-bit value copy is really not a lot of overhead... But yes, we'd
> >>>>>>> need to push the completion through task_work at that point, as we can't
> >>>>>>> do it from the completion side. That's not a lot of overhead, and most
> >>>>>>> notably, it's overhead that only affects this particular type.
> >>>>>>>
> >>>>>>> That's not a bad starting point, and something that can always be
> >>>>>>> optimized later if need be. But I seriously doubt it'd be anything to
> >>>>>>> worry about.
> >>>>>>
> >>>>>> I probably need to look myself how it's really scheduled, but if you don't
> >>>>>> mind, here is a quick question: if we do work_add(task) when the task is
> >>>>>> running in the userspace, wouldn't the work execution wait until the next
> >>>>>> syscall/allotted time ends up?
> >>>>>
> >>>>> It'll get the task to enter the kernel, just like signal delivery. The only
> >>>>> tricky part is really if we have a dependency waiting in the kernel, like
> >>>>> the recent eventfd fix.
> >>>>
> >>>> I see, thanks for sorting this out!
> >>>
> >>> Few more doubts about this (please mark me wrong if that is the case):
> >>>
> >>> - Task-work makes me feel like N completions waiting to be served by
> >>> single task.
> >>> Currently completions keep arriving and CQEs would be updated with
> >>> result, but the user-space (submitter task) would not be poked.
> >>>
> >>> - Completion-code will set the task-work. But post that it cannot go
> >>> immediately to its regular business of picking cqe and updating
> >>> res/flags, as we cannot afford user-space to see the cqe before the
> >>> pointer update. So it seems completion-code needs to spawn another
> >>> work which will allocate/update cqe after waiting for pointer-update
> >>> from task-work?
> >>
> >> The task work would post the completion CQE for the request after
> >> writing the offset.
> >
> > Got it, thank you for making it simple.
> > Overall if I try to put the tradeoffs of moving to indirect-offset
> > (compared to current scheme)–
> >
> > Upside:
> > - cqe res/flags would be intact, avoids future-headaches as you mentioned
> > - short-write cases do not have to be failed in lower-layers (as
> > cqe->res is there to report bytes-copied)
>
> I personally think it is a super bad idea to allow short asynchronous append
> writes. The interface should allow the async zone append write to proceed only
> and only if it can be stuffed entirely into a single BIO which necessarilly will
> be a single request on the device side. Otherwise, the application would have no
> guarantees as to where a split may happen, and since this is zone append, the
> next async append will not leave any hole to complete a previous short write.
> This will wreak the structure of the application data.
>
> For the sync case, this is fine. The application can just issue a new append
> write with the remaining unwritten data from the previous append write. But in
> the async case, if one write == one data record (e.g. a key-value tuple for an
> SSTable in an LSM tree), then allowing a short write will destroy the record:
> the partial write will be garbage data that will need garbage collection...

There are cases when short-write is fine, isn't it? For example I can
serve only 8K write (either because of space, or because of those file
limits), but application sent 12K.....iov_iter_gets truncated to 8K
and the write is successful. At least that's what O_APPEND and
RWF_APPEND behaves currently.
But in the current scheme there is no way to report number-of-bytes
copied in io-uring, so I had to fail such short-write in lower-layer
(which does not know whether it is talking to io_uring or aio).
Failing such short-write is perhaps fine for zone-appened, but is it
fine for generic file-append?

> > Downside:
> > - We may not be able to use RWF_APPEND, and need exposing a new
> > type/flag (RWF_INDIRECT_OFFSET etc.) user-space. Not sure if this
> > sounds outrageous, but is it OK to have uring-only flag which can be
> > combined with RWF_APPEND?
>
> Why ? Where is the problem ? O_APPEND/RWF_APPEND is currently meaningless for
> raw block device accesses. We could certainly define a meaning for these in the
> context of zoned block devices.
But application using O_APPEND/RWF_APPEND does not pass a pointer to
be updated by kernel.
While in kernel we would expect that, and may start writing something
which is not a pointer.

> I already commented on the need for first defining an interface (flags etc) and
> its semantic (e.g. do we allow short zone append or not ? What happens for
> regular files ? etc). Did you read my comment ? We really need to first agree on
> something to clarify what needs to be done.

I read and was planning to respond, sorry. But it seemed important to
get the clarity on the uring-interface, as this seems to decide how
this whole thing looks like (to application and to lower layers as
well).

> > -  Expensive compared to sending results in cqe itself. But I agree
> > that this may not be major, and only for one type of write.
> >
> >
>
>
> --
> Damien Le Moal
> Western Digital Research



-- 
Joshi

  parent reply	other threads:[~2020-07-31  7:08 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20200724155244epcas5p2902f57e36e490ee8772da19aa9408cdc@epcas5p2.samsung.com>
2020-07-24 15:49 ` [PATCH v4 0/6] zone-append support in io-uring and aio Kanchan Joshi
     [not found]   ` <CGME20200724155258epcas5p1a75b926950a18cd1e6c8e7a047e6c589@epcas5p1.samsung.com>
2020-07-24 15:49     ` [PATCH v4 1/6] fs: introduce FMODE_ZONE_APPEND and IOCB_ZONE_APPEND Kanchan Joshi
2020-07-24 16:34       ` Jens Axboe
2020-07-26 15:18       ` Christoph Hellwig
2020-07-28  1:49         ` Matthew Wilcox
2020-07-28  7:26           ` Christoph Hellwig
     [not found]   ` <CGME20200724155324epcas5p18e1d3b4402d1e4a8eca87d0b56a3fa9b@epcas5p1.samsung.com>
2020-07-24 15:49     ` [PATCH v4 2/6] fs: change ki_complete interface to support 64bit ret2 Kanchan Joshi
2020-07-26 15:18       ` Christoph Hellwig
     [not found]   ` <CGME20200724155329epcas5p345ba6bad0b8fe18056bb4bcd26c10019@epcas5p3.samsung.com>
2020-07-24 15:49     ` [PATCH v4 3/6] uio: return status with iov truncation Kanchan Joshi
     [not found]   ` <CGME20200724155341epcas5p15bfc55927f2abb60f19784270fe8e377@epcas5p1.samsung.com>
2020-07-24 15:49     ` [PATCH v4 4/6] block: add zone append handling for direct I/O path Kanchan Joshi
2020-07-26 15:19       ` Christoph Hellwig
     [not found]   ` <CGME20200724155346epcas5p2cfb383fe9904a45280c6145f4c13e1b4@epcas5p2.samsung.com>
2020-07-24 15:49     ` [PATCH v4 5/6] block: enable zone-append for iov_iter of bvec type Kanchan Joshi
2020-07-26 15:20       ` Christoph Hellwig
     [not found]   ` <CGME20200724155350epcas5p3b8f1d59eda7f8fbb38c828f692d42fd6@epcas5p3.samsung.com>
2020-07-24 15:49     ` [PATCH v4 6/6] io_uring: add support for zone-append Kanchan Joshi
2020-07-24 16:29       ` Jens Axboe
2020-07-27 19:16         ` Kanchan Joshi
2020-07-27 20:34           ` Jens Axboe
2020-07-30 16:08             ` Pavel Begunkov
2020-07-30 16:13               ` Jens Axboe
2020-07-30 16:26                 ` Pavel Begunkov
2020-07-30 17:16                   ` Jens Axboe
2020-07-30 17:38                     ` Pavel Begunkov
2020-07-30 17:51                       ` Kanchan Joshi
2020-07-30 17:54                         ` Jens Axboe
2020-07-30 18:25                           ` Kanchan Joshi
2020-07-31  6:42                             ` Damien Le Moal
2020-07-31  6:45                               ` hch
2020-07-31  6:59                                 ` Damien Le Moal
2020-07-31  7:58                                   ` Kanchan Joshi
2020-07-31  8:14                                     ` Damien Le Moal
2020-07-31  9:14                                       ` hch
2020-07-31  9:34                                         ` Damien Le Moal
2020-07-31  9:41                                           ` hch
2020-07-31 10:16                                             ` Damien Le Moal
2020-07-31 12:51                                               ` hch
2020-07-31 13:08                                                 ` hch
2020-07-31 15:07                                                   ` Kanchan Joshi
2022-03-02 20:47                                                   ` Luis Chamberlain
2020-08-05  7:35                                                 ` Damien Le Moal
2020-08-14  8:14                                                   ` hch
2020-08-14  8:27                                                     ` Damien Le Moal
2020-08-14 12:04                                                       ` hch
2020-08-14 12:20                                                         ` Damien Le Moal
2020-09-07  7:01                                                     ` Kanchan Joshi
2020-09-08 15:18                                                       ` hch
2020-09-24 17:19                                                         ` Kanchan Joshi
2020-09-25  2:52                                                           ` Damien Le Moal
2020-09-28 18:58                                                             ` Kanchan Joshi
2020-09-29  1:24                                                               ` Damien Le Moal
2020-09-29 18:49                                                                 ` Kanchan Joshi
2022-03-02 20:43                                                         ` Luis Chamberlain
2020-07-31  9:38                                       ` Kanchan Joshi
2022-03-02 20:51                                 ` Luis Chamberlain
2020-07-31  7:08                               ` Kanchan Joshi [this message]
2020-07-30 15:57       ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+1E3rJ5j6MeG3O5Xa7unWLMRz6BacvLVN8xpeEz6AVyJWT55Q@mail.gmail.com \
    --to=joshiiitr@gmail.com \
    --cc=Damien.LeMoal@wdc.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bcrl@kvack.org \
    --cc=hch@infradead.org \
    --cc=io-uring@vger.kernel.org \
    --cc=javier.gonz@samsung.com \
    --cc=joshi.k@samsung.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nj.shetty@samsung.com \
    --cc=selvakuma.s1@samsung.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.