linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [fuse-devel] 512 byte aligned write + O_DIRECT for xfstests
       [not found]         ` <87sgensmsk.fsf@vostro.rath.org>
@ 2020-06-22  6:37           ` Amir Goldstein
  2020-06-22  7:35             ` Nikolaus Rath
  0 siblings, 1 reply; 5+ messages in thread
From: Amir Goldstein @ 2020-06-22  6:37 UTC (permalink / raw)
  To: fuse-devel
  Cc: linux-fsdevel, Miklos Szeredi, Nikolaus Rath, Matthew Wilcox,
	Dave Chinner

[+CC fsdevel folks]

On Mon, Jun 22, 2020 at 8:33 AM Nikolaus Rath <Nikolaus@rath.org> wrote:
>
> On Jun 21 2020, Miklos Szeredi <miklos@szeredi.hu> wrote:
> >> I am not sure that is correct. At step 6, the write() request from
> >> userspace is still being processed. I don't think that it is reasonable
> >> to expect that the write() request is atomic, i.e. you can't expect to
> >> see none or all of the data that is *currently being written*.
> >
> > Apparently the standard is quite clear on this:
> >
> >   "All of the following functions shall be atomic with respect to each
> > other in the effects specified in POSIX.1-2017 when they operate on
> > regular files or symbolic links:
> >
> > [...]
> > pread()
> > read()
> > readv()
> > pwrite()
> > write()
> > writev()
> > [...]
> >
> > If two threads each call one of these functions, each call shall
> > either see all of the specified effects of the other call, or none of
> > them."[1]
> >
> > Thanks,
> > Miklos
> >
> > [1]
> > https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07
>
> Thanks for digging this up, I did not know about this.
>
> That leaves FUSE in a rather uncomfortable place though, doesn't it?
> What does the kernel do when userspace issues a write request that's
> bigger than FUSE userspace pipe? It sounds like either the request must
> be splitted (so it becomes non-atomic), or you'd have to return a short
> write (which IIRC is not supposed to happen for local filesystems).
>

What makes you say that short writes are not supposed to happen?
and what is the definition of "local filesystem" in that claim?

FYI, a similar discussion is also happening about XFS "atomic rw" behavior [1].

Seems like the options for FUSE are:
- Take shared i_rwsem lock on read like XFS and regress performance of
  mixed rw workload
- Do the above only for non-direct and writeback_cache to minimize the
  damage potential
- Return short read/write for direct IO if request is bigger that FUSE
buffer size
- Add a FUSE mode that implements direct IO internally as something like
  RWF_UNCACHED [2] - this is a relaxed version of "no caching" in client or
  a stricter version of "cache write-through"  in the sense that
during an ongoing
  large write operation, read of those fresh written bytes only is served
  from the client cache copy and not from the server.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/20200622010234.GD2040@dread.disaster.area/
[2] https://lore.kernel.org/linux-fsdevel/20191217143948.26380-1-axboe@kernel.dk/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [fuse-devel] 512 byte aligned write + O_DIRECT for xfstests
  2020-06-22  6:37           ` [fuse-devel] 512 byte aligned write + O_DIRECT for xfstests Amir Goldstein
@ 2020-06-22  7:35             ` Nikolaus Rath
  2020-06-22  7:57               ` Amir Goldstein
  0 siblings, 1 reply; 5+ messages in thread
From: Nikolaus Rath @ 2020-06-22  7:35 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: fuse-devel, linux-fsdevel, Miklos Szeredi, Matthew Wilcox, Dave Chinner

On Jun 22 2020, Amir Goldstein <amir73il@gmail.com> wrote:
> [+CC fsdevel folks]
>
> On Mon, Jun 22, 2020 at 8:33 AM Nikolaus Rath <Nikolaus@rath.org> wrote:
>>
>> On Jun 21 2020, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> >> I am not sure that is correct. At step 6, the write() request from
>> >> userspace is still being processed. I don't think that it is reasonable
>> >> to expect that the write() request is atomic, i.e. you can't expect to
>> >> see none or all of the data that is *currently being written*.
>> >
>> > Apparently the standard is quite clear on this:
>> >
>> >   "All of the following functions shall be atomic with respect to each
>> > other in the effects specified in POSIX.1-2017 when they operate on
>> > regular files or symbolic links:
>> >
>> > [...]
>> > pread()
>> > read()
>> > readv()
>> > pwrite()
>> > write()
>> > writev()
>> > [...]
>> >
>> > If two threads each call one of these functions, each call shall
>> > either see all of the specified effects of the other call, or none of
>> > them."[1]
>> >
>> > Thanks,
>> > Miklos
>> >
>> > [1]
>> > https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07
>>
>> Thanks for digging this up, I did not know about this.
>>
>> That leaves FUSE in a rather uncomfortable place though, doesn't it?
>> What does the kernel do when userspace issues a write request that's
>> bigger than FUSE userspace pipe? It sounds like either the request must
>> be splitted (so it becomes non-atomic), or you'd have to return a short
>> write (which IIRC is not supposed to happen for local filesystems).
>>
>
> What makes you say that short writes are not supposed to happen?

I don't think it was an authoritative source, but I I've repeatedly read
that "you do not have to worry about short reads/writes when accessing
the local disk". I expect this to be a common expectation to be baked
into programs, no matter if valid or not.

> Seems like the options for FUSE are:
> - Take shared i_rwsem lock on read like XFS and regress performance of
>   mixed rw workload
> - Do the above only for non-direct and writeback_cache to minimize the
>   damage potential
> - Return short read/write for direct IO if request is bigger that FUSE
> buffer size
> - Add a FUSE mode that implements direct IO internally as something like
>   RWF_UNCACHED [2] - this is a relaxed version of "no caching" in client or
>   a stricter version of "cache write-through"  in the sense that
> during an ongoing
>   large write operation, read of those fresh written bytes only is served
>   from the client cache copy and not from the server.

I didn't understand all of that, but it seems to me that there is a
fundamental problem with splitting up a single write into multiple FUSE
requests, because the second request may fail after the first one
succeeds. 

Best,
-Nikolaus

-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [fuse-devel] 512 byte aligned write + O_DIRECT for xfstests
  2020-06-22  7:35             ` Nikolaus Rath
@ 2020-06-22  7:57               ` Amir Goldstein
  2020-06-26  5:27                 ` Nikolaus Rath
  0 siblings, 1 reply; 5+ messages in thread
From: Amir Goldstein @ 2020-06-22  7:57 UTC (permalink / raw)
  To: Amir Goldstein, fuse-devel, linux-fsdevel, Miklos Szeredi,
	Matthew Wilcox, Dave Chinner

On Mon, Jun 22, 2020 at 10:35 AM Nikolaus Rath <Nikolaus@rath.org> wrote:
>
> On Jun 22 2020, Amir Goldstein <amir73il@gmail.com> wrote:
> > [+CC fsdevel folks]
> >
> > On Mon, Jun 22, 2020 at 8:33 AM Nikolaus Rath <Nikolaus@rath.org> wrote:
> >>
> >> On Jun 21 2020, Miklos Szeredi <miklos@szeredi.hu> wrote:
> >> >> I am not sure that is correct. At step 6, the write() request from
> >> >> userspace is still being processed. I don't think that it is reasonable
> >> >> to expect that the write() request is atomic, i.e. you can't expect to
> >> >> see none or all of the data that is *currently being written*.
> >> >
> >> > Apparently the standard is quite clear on this:
> >> >
> >> >   "All of the following functions shall be atomic with respect to each
> >> > other in the effects specified in POSIX.1-2017 when they operate on
> >> > regular files or symbolic links:
> >> >
> >> > [...]
> >> > pread()
> >> > read()
> >> > readv()
> >> > pwrite()
> >> > write()
> >> > writev()
> >> > [...]
> >> >
> >> > If two threads each call one of these functions, each call shall
> >> > either see all of the specified effects of the other call, or none of
> >> > them."[1]
> >> >
> >> > Thanks,
> >> > Miklos
> >> >
> >> > [1]
> >> > https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07
> >>
> >> Thanks for digging this up, I did not know about this.
> >>
> >> That leaves FUSE in a rather uncomfortable place though, doesn't it?
> >> What does the kernel do when userspace issues a write request that's
> >> bigger than FUSE userspace pipe? It sounds like either the request must
> >> be splitted (so it becomes non-atomic), or you'd have to return a short
> >> write (which IIRC is not supposed to happen for local filesystems).
> >>
> >
> > What makes you say that short writes are not supposed to happen?
>
> I don't think it was an authoritative source, but I I've repeatedly read
> that "you do not have to worry about short reads/writes when accessing
> the local disk". I expect this to be a common expectation to be baked
> into programs, no matter if valid or not.
>

Even if that statement would have been considered true, since when can
we speak of FUSE as a "local filesystem".
IMO it follows all the characteristics of a "network filesystem".

> > Seems like the options for FUSE are:
> > - Take shared i_rwsem lock on read like XFS and regress performance of
> >   mixed rw workload
> > - Do the above only for non-direct and writeback_cache to minimize the
> >   damage potential
> > - Return short read/write for direct IO if request is bigger that FUSE
> > buffer size
> > - Add a FUSE mode that implements direct IO internally as something like
> >   RWF_UNCACHED [2] - this is a relaxed version of "no caching" in client or
> >   a stricter version of "cache write-through"  in the sense that
> > during an ongoing
> >   large write operation, read of those fresh written bytes only is served
> >   from the client cache copy and not from the server.
>
> I didn't understand all of that, but it seems to me that there is a
> fundamental problem with splitting up a single write into multiple FUSE
> requests, because the second request may fail after the first one
> succeeds.
>

I think you are confused by the use of the word "atomic" in the standard.
It does not mean what the O_ATOMIC proposal means, that is - write everything
or write nothing at all.
It means if thread A successfully wrote data X over data Y, then thread B can
either read X or Y, but not half X half Y.
If A got an error on write, the content that B will read is probably undefined
(excuse me for not reading what "the law" has to say about this).
If A got a short (half) write, then surely B can read either half X or half Y
from the first half range. Second half range I am not sure what to expect.

So I do not see any fundamental problem with FUSE write requests.
On the contrary - FUSE write requests are just like any network protocol write
request or local disk IO request for that matter.

Unless I am missing something...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [fuse-devel] 512 byte aligned write + O_DIRECT for xfstests
  2020-06-22  7:57               ` Amir Goldstein
@ 2020-06-26  5:27                 ` Nikolaus Rath
  2020-07-01  9:58                   ` Hselin Chen
  0 siblings, 1 reply; 5+ messages in thread
From: Nikolaus Rath @ 2020-06-26  5:27 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: fuse-devel, linux-fsdevel, Miklos Szeredi, Matthew Wilcox, Dave Chinner

On Jun 22 2020, Amir Goldstein <amir73il@gmail.com> wrote:
>> >> > https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07
>> >>
>> >> Thanks for digging this up, I did not know about this.
>> >>
>> >> That leaves FUSE in a rather uncomfortable place though, doesn't it?
>> >> What does the kernel do when userspace issues a write request that's
>> >> bigger than FUSE userspace pipe? It sounds like either the request must
>> >> be splitted (so it becomes non-atomic), or you'd have to return a short
>> >> write (which IIRC is not supposed to happen for local filesystems).
>> >>
>> >
>> > What makes you say that short writes are not supposed to happen?
>>
>> I don't think it was an authoritative source, but I I've repeatedly read
>> that "you do not have to worry about short reads/writes when accessing
>> the local disk". I expect this to be a common expectation to be baked
>> into programs, no matter if valid or not.
>
> Even if that statement would have been considered true, since when can
> we speak of FUSE as a "local filesystem".
> IMO it follows all the characteristics of a "network filesystem".
>
>> > Seems like the options for FUSE are:
>> > - Take shared i_rwsem lock on read like XFS and regress performance of
>> >   mixed rw workload
>> > - Do the above only for non-direct and writeback_cache to minimize the
>> >   damage potential
>> > - Return short read/write for direct IO if request is bigger that FUSE
>> > buffer size
>> > - Add a FUSE mode that implements direct IO internally as something like
>> >   RWF_UNCACHED [2] - this is a relaxed version of "no caching" in client or
>> >   a stricter version of "cache write-through"  in the sense that
>> > during an ongoing
>> >   large write operation, read of those fresh written bytes only is served
>> >   from the client cache copy and not from the server.
>>
>> I didn't understand all of that, but it seems to me that there is a
>> fundamental problem with splitting up a single write into multiple FUSE
>> requests, because the second request may fail after the first one
>> succeeds.
>>
>
> I think you are confused by the use of the word "atomic" in the standard.
> It does not mean what the O_ATOMIC proposal means, that is - write everything
> or write nothing at all.
> It means if thread A successfully wrote data X over data Y, then thread B can
> either read X or Y, but not half X half Y.
> If A got an error on write, the content that B will read is probably undefined
> (excuse me for not reading what "the law" has to say about this).
> If A got a short (half) write, then surely B can read either half X or half Y
> from the first half range. Second half range I am not sure what to expect.
>
> So I do not see any fundamental problem with FUSE write requests.
> On the contrary - FUSE write requests are just like any network protocol write
> request or local disk IO request for that matter.
>
> Unless I am missing something...

Well, you're missing the point I was trying to make, which was that FUSE
is in an unfortunate spot if we want to avoid short writes *and* comply
with the standard. You are asserting that is perfectly fine for FUSE to
return short writes and I agree that in that case there is no problem
with making writes atomic.

I do not dispute that FUSE is within its right to return short
rights. What I am saying is that I'm sure that there are plenty of
userspace applications that don't expect short writes or reads when
reading *any* regular file, because people assume this is only a concern
for fds that represents sockets or pipes. Yes, this is wrong of
them. But it works almost all the time, so it would be unfortunate if it
suddenly stopped working for FUSE in the situations where it previously
worked.


Best,
-Nikolaus

-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [fuse-devel] 512 byte aligned write + O_DIRECT for xfstests
  2020-06-26  5:27                 ` Nikolaus Rath
@ 2020-07-01  9:58                   ` Hselin Chen
  0 siblings, 0 replies; 5+ messages in thread
From: Hselin Chen @ 2020-07-01  9:58 UTC (permalink / raw)
  To: Amir Goldstein, fuse-devel, linux-fsdevel, Miklos Szeredi,
	Matthew Wilcox, Dave Chinner

Hi Nikolaus,

Sorry for being dense, apologies if I misunderstood something.
I think the issue is also more than just short writes?
The "reverse" write order (i.e. writing higher offsets before lower
offsets) can create temporary "holes" in the written file that short
writes wouldn't have caused.

Thanks!
Albert

On Thu, Jun 25, 2020 at 10:27 PM Nikolaus Rath <Nikolaus@rath.org> wrote:
>
> On Jun 22 2020, Amir Goldstein <amir73il@gmail.com> wrote:
> >> >> > https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07
> >> >>
> >> >> Thanks for digging this up, I did not know about this.
> >> >>
> >> >> That leaves FUSE in a rather uncomfortable place though, doesn't it?
> >> >> What does the kernel do when userspace issues a write request that's
> >> >> bigger than FUSE userspace pipe? It sounds like either the request must
> >> >> be splitted (so it becomes non-atomic), or you'd have to return a short
> >> >> write (which IIRC is not supposed to happen for local filesystems).
> >> >>
> >> >
> >> > What makes you say that short writes are not supposed to happen?
> >>
> >> I don't think it was an authoritative source, but I I've repeatedly read
> >> that "you do not have to worry about short reads/writes when accessing
> >> the local disk". I expect this to be a common expectation to be baked
> >> into programs, no matter if valid or not.
> >
> > Even if that statement would have been considered true, since when can
> > we speak of FUSE as a "local filesystem".
> > IMO it follows all the characteristics of a "network filesystem".
> >
> >> > Seems like the options for FUSE are:
> >> > - Take shared i_rwsem lock on read like XFS and regress performance of
> >> >   mixed rw workload
> >> > - Do the above only for non-direct and writeback_cache to minimize the
> >> >   damage potential
> >> > - Return short read/write for direct IO if request is bigger that FUSE
> >> > buffer size
> >> > - Add a FUSE mode that implements direct IO internally as something like
> >> >   RWF_UNCACHED [2] - this is a relaxed version of "no caching" in client or
> >> >   a stricter version of "cache write-through"  in the sense that
> >> > during an ongoing
> >> >   large write operation, read of those fresh written bytes only is served
> >> >   from the client cache copy and not from the server.
> >>
> >> I didn't understand all of that, but it seems to me that there is a
> >> fundamental problem with splitting up a single write into multiple FUSE
> >> requests, because the second request may fail after the first one
> >> succeeds.
> >>
> >
> > I think you are confused by the use of the word "atomic" in the standard.
> > It does not mean what the O_ATOMIC proposal means, that is - write everything
> > or write nothing at all.
> > It means if thread A successfully wrote data X over data Y, then thread B can
> > either read X or Y, but not half X half Y.
> > If A got an error on write, the content that B will read is probably undefined
> > (excuse me for not reading what "the law" has to say about this).
> > If A got a short (half) write, then surely B can read either half X or half Y
> > from the first half range. Second half range I am not sure what to expect.
> >
> > So I do not see any fundamental problem with FUSE write requests.
> > On the contrary - FUSE write requests are just like any network protocol write
> > request or local disk IO request for that matter.
> >
> > Unless I am missing something...
>
> Well, you're missing the point I was trying to make, which was that FUSE
> is in an unfortunate spot if we want to avoid short writes *and* comply
> with the standard. You are asserting that is perfectly fine for FUSE to
> return short writes and I agree that in that case there is no problem
> with making writes atomic.
>
> I do not dispute that FUSE is within its right to return short
> rights. What I am saying is that I'm sure that there are plenty of
> userspace applications that don't expect short writes or reads when
> reading *any* regular file, because people assume this is only a concern
> for fds that represents sockets or pipes. Yes, this is wrong of
> them. But it works almost all the time, so it would be unfortunate if it
> suddenly stopped working for FUSE in the situations where it previously
> worked.
>
>
> Best,
> -Nikolaus
>
> --
> GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F
>
>              »Time flies like an arrow, fruit flies like a Banana.«
>
>
> --
> fuse-devel mailing list
> To unsubscribe or subscribe, visit https://lists.sourceforge.net/lists/listinfo/fuse-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-07-01  9:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAMHtQmP_TVR8QA+noWQk04Nj_8AxMXfjCj1K_k0Zf6BN-Bq9sg@mail.gmail.com>
     [not found] ` <87bllhh7mg.fsf@vostro.rath.org>
     [not found]   ` <CAMHtQmPcADq0WSAY=uFFyRgAeuCCAo=8dOHg37304at1SRjGBg@mail.gmail.com>
     [not found]     ` <877dw0g0wn.fsf@vostro.rath.org>
     [not found]       ` <CAJfpegs3xthDEuhx_vHUtjJ7BAbVfoDu9voNPPAqJo4G3BBYZQ@mail.gmail.com>
     [not found]         ` <87sgensmsk.fsf@vostro.rath.org>
2020-06-22  6:37           ` [fuse-devel] 512 byte aligned write + O_DIRECT for xfstests Amir Goldstein
2020-06-22  7:35             ` Nikolaus Rath
2020-06-22  7:57               ` Amir Goldstein
2020-06-26  5:27                 ` Nikolaus Rath
2020-07-01  9:58                   ` Hselin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).