All of lore.kernel.org
 help / color / mirror / Atom feed
* munmap, msync: synchronization
@ 2014-04-20 10:28 Heinrich Schuchardt
  2014-04-21 10:16 ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 18+ messages in thread
From: Heinrich Schuchardt @ 2014-04-20 10:28 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), linux-man-u79uwXL29TY76Z2rM5mHXA

Hello Michael,

when analyzing how the fanotify API interacts with mmap(2) I stumbled 
over the following issues in the manpages:


The manpage of msync(2) says:
"msync() flushes changes made to the in-core copy of a file that was 
mapped into memory using mmap(2) back to disk."

"back to disk" implies that the file system is forced to actually write 
to the hard disk, somewhat equivalent to invoking sync(1). Is that 
guaranteed for all file systems?

Not all file systems are necessarily disk based (e.g. davfs, tmpfs).

So shouldn't we write:
"... back to the file system."

http://pubs.opengroup.org/onlinepubs/007904875/functions/msync.html
says
"... to permanent storage locations, if any,"


The manpage of munmap(2) leaves it unclear, if copying back to the 
filesystem is synchronous or asynchronous.
This bit of information is important, because, if munmap is 
asynchronous, applications might want to call msync(,,MS_SYNC), before 
calling munmap. If munmap is synchronous it might block until the file 
system responds (think of waiting for a tape to be loaded, or a webdav 
server to respond).


What happens to an unfinished prior asynchronous update by 
mmap(,,MS_ASYNC) when munmap is called?


Will munmap "invalidate other mappings of the same file (so that they 
can be updated with the fresh values just written)" like 
msync(,,MS_INVALIDATE) does?


Best regards

Heinrich Schuchardt
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: munmap, msync: synchronization
  2014-04-20 10:28 munmap, msync: synchronization Heinrich Schuchardt
@ 2014-04-21 10:16 ` Michael Kerrisk (man-pages)
       [not found]   ` <5354F00E.8050609-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-04-21 10:16 UTC (permalink / raw)
  To: Heinrich Schuchardt, linux-man
  Cc: mtk.manpages, Christoph Hellwig, Dave Chinner, Theodore T'so,
	Linux-Fsdevel, Miklos Szeredi, jamie

[CCing a few people who may correct my errors; perhaps there are some
improvements that are needed for the mmap() and msync() man pages

]

Hello Heinrich,

On 04/20/2014 12:28 PM, Heinrich Schuchardt wrote:
> Hello Michael,
> 
> when analyzing how the fanotify API interacts with mmap(2) I stumbled 
> over the following issues in the manpages:
> 
> 
> The manpage of msync(2) says:
> "msync() flushes changes made to the in-core copy of a file that was 
> mapped into memory using mmap(2) back to disk."
> 
> "back to disk" implies that the file system is forced to actually write 
> to the hard disk, somewhat equivalent to invoking sync(1). Is that 
> guaranteed for all file systems?
> 
> Not all file systems are necessarily disk based (e.g. davfs, tmpfs).
> 
> So shouldn't we write:
> "... back to the file system."

Yes, that seems better to me. Done.

> http://pubs.opengroup.org/onlinepubs/007904875/functions/msync.html
> says
> "... to permanent storage locations, if any,"
> 
> 
> The manpage of munmap(2) leaves it unclear, if copying back to the 
> filesystem is synchronous or asynchronous.

In fact, the page says nearly nothing about whether it synchs at all.
That is (I think) more or less deliberate. See below.

> This bit of information is important, because, if munmap is 
> asynchronous, applications might want to call msync(,,MS_SYNC), before 
> calling munmap. If munmap is synchronous it might block until the file 
> system responds (think of waiting for a tape to be loaded, or a webdav 
> server to respond).
> 
> 
> What happens to an unfinished prior asynchronous update by 
> mmap(,,MS_ASYNC) when munmap is called?

I believe the answer is: On Linux, nothing special; the asynchronous
update will still be done. (I'm not sure that anything needs to be
said in the man page... But, if you have a good argument about why 
something should be said, I'm open to hearing it.)

> Will munmap "invalidate other mappings of the same file (so that they 
> can be updated with the fresh values just written)" like 
> msync(,,MS_INVALIDATE) does?

I don't believe there's any requirement that it does. (Again, I'm not
sure that anything needs to be said in the man page... But, if
you have a good argument...)

So, here's how things are as I understand them.

1. In the bad old days (even on Linux, AFAIK, but that was in days
   before I looked closely at what goes on), the page cache and
   the buffer cache were not unified. That meant that a page from 
   a file might both be in the buffer cache (because of file I/O
   syscalls) and in the page cache (because of mmap()).

2. In a non-unified cache system, pages can naturally get out of
   synch in the two locations. Before it had a unified cache, Linux 
   used to jump some hoops to ensure that contents in the two 
   locations remained consistent.

3. Nowadays Linux--like most (all?) UNIX systems--has a 
   unified cache: file I/O, mmap(), and the paging system all 
   use the same cache. If a file is mmap()-ed and also subject
   to file I?/, there will be only one copy of each file page 
   in the cache. Ergo, the inconsistency problem goes away.

4. IIUC, the pieces like msync(MS_ASYNC) and msync(MS_INVALIDATE)
   exist only because of the bad old non-unified cache days.
   MS_INVALIDATE was a way of saying: make sure that writes
   to the file by other processes are visible in this mapping.
   msync() without the MS_INVALIDATE flags was a way of saying:
   make sure that read()s from the file see the changes made
   via this mapping. Using either MS_SYNC or MS_ASYNC
   was the way of saying: "I either want to wait until the file
   updates have been completed", or "please start the updates
   now, but I don't want to wait until they're completed".

5. On systems with a unified cache, msync(MS_INVALIDATE)
   is a no-op. (That is so on Linux.)

6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified 
   cache system. Filesystem I/O always sees a consistent view,
   and MS_ASYNC never undertook to give a guarantee about *when*
   the update would occur. (The Linux buffer cache logic will 
   ensure that it is flushed out sometime in the near future.)

7. On Linux (and probably many other modern systems), the only
   call that has any real use is msync(MS_SYNC), meaning
   "flush the buffers *now*, and I want to wait for that to 
   complete, so that I can then continue safe in the knowledge
   that my data has landed on a device". That's useful if we
   want insurance for our data in the event of a system crash.

8. POSIX make no mandate for a unified cache system. Thus,
   we have MS_ASYNC and MS_INVALIDATE in the standard, and
   the standard says nothing (AFAIK) about whether munmap() 
   will flush data. On Linux (and probably most modern systems),
   we're fine. but portable applications that care about 
   standards and nonunified caches need to use msync().

   My advice: To ensure that the contents of a shared file
   mapping are written to the underlying file--even on bad old
   implementations--a call to msync() should be made before 
   unmapping a mapping with munmap().

9. The mmap() man page says this:

       MAP_SHARED 
           Share this mapping.  Updates to the mapping are vis‐
           ible to other processes that map this file, and  are
           carried  through  to  the underlying file.  The file
           may not actually be updated until msync(2)  or  mun‐
           map() is called.

   I believe the piece "or munmap()" is misleading. It implies
   that munmap() must trigger a sync action. I don't think this
   is true. All that it is required to do is remove some range
   of pages from the process's virtual address space. I'm
   inclined to remove those words, but I'd like to see if any
   FS person has a correction to my understanding first.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: munmap, msync: synchronization
       [not found]   ` <5354F00E.8050609-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-04-21 18:14     ` Christoph Hellwig
  2014-04-21 19:54       ` Michael Kerrisk (man-pages)
  2014-04-23 14:03       ` Matthew Wilcox
  0 siblings, 2 replies; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-21 18:14 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
	Christoph Hellwig, Dave Chinner, Theodore T'so,
	Linux-Fsdevel, Miklos Szeredi, jamie-yetKDKU6eevNLxjTenLetw

On Mon, Apr 21, 2014 at 12:16:46PM +0200, Michael Kerrisk (man-pages) wrote:
> 1. In the bad old days (even on Linux, AFAIK, but that was in days
>    before I looked closely at what goes on), the page cache and
>    the buffer cache were not unified. That meant that a page from 
>    a file might both be in the buffer cache (because of file I/O
>    syscalls) and in the page cache (because of mmap()).

Correct.

> 2. In a non-unified cache system, pages can naturally get out of
>    synch in the two locations. Before it had a unified cache, Linux 
>    used to jump some hoops to ensure that contents in the two 
>    locations remained consistent.

Yeah.

> 3. Nowadays Linux--like most (all?) UNIX systems--has a 
>    unified cache: file I/O, mmap(), and the paging system all 
>    use the same cache. If a file is mmap()-ed and also subject
>    to file I?/, there will be only one copy of each file page 
>    in the cache. Ergo, the inconsistency problem goes away.

Mostly true, except for FreeBSD and Solaris when they use ZFS, which has
it's own file cache that is not coherent with the VM cache at the
implementation level.  Not sure how much of this leaks to userspace,
though.

> 4. IIUC, the pieces like msync(MS_ASYNC) and msync(MS_INVALIDATE)
>    exist only because of the bad old non-unified cache days.
>    MS_INVALIDATE was a way of saying: make sure that writes
>    to the file by other processes are visible in this mapping.
>    msync() without the MS_INVALIDATE flags was a way of saying:
>    make sure that read()s from the file see the changes made
>    via this mapping. Using either MS_SYNC or MS_ASYNC
>    was the way of saying: "I either want to wait until the file
>    updates have been completed", or "please start the updates
>    now, but I don't want to wait until they're completed".

Right.

> 5. On systems with a unified cache, msync(MS_INVALIDATE)
>    is a no-op. (That is so on Linux.)

Almost.  It returns EBUSY if it hits any mlock()ed region.  Don't ask me
why, though..

> 6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified 
>    cache system. Filesystem I/O always sees a consistent view,
>    and MS_ASYNC never undertook to give a guarantee about *when*
>    the update would occur. (The Linux buffer cache logic will 
>    ensure that it is flushed out sometime in the near future.)

Right.  It's a fairly inefficient noop, though - it actually loops
over all vmas to do nothing with them.

> 7. On Linux (and probably many other modern systems), the only
>    call that has any real use is msync(MS_SYNC), meaning
>    "flush the buffers *now*, and I want to wait for that to 
>    complete, so that I can then continue safe in the knowledge
>    that my data has landed on a device". That's useful if we
>    want insurance for our data in the event of a system crash.

Right.  It's basically another way to call fsync, which is used to
implement it underneath.  It actually should be a ranged-fdatasync
but right it's it's implemented horribly inefficiently in that it
does a fsync call for each vma that it encounters in the range
specified.

> 8. POSIX make no mandate for a unified cache system. Thus,
>    we have MS_ASYNC and MS_INVALIDATE in the standard, and
>    the standard says nothing (AFAIK) about whether munmap() 
>    will flush data. On Linux (and probably most modern systems),
>    we're fine. but portable applications that care about 
>    standards and nonunified caches need to use msync().
> 
>    My advice: To ensure that the contents of a shared file
>    mapping are written to the underlying file--even on bad old
>    implementations--a call to msync() should be made before 
>    unmapping a mapping with munmap().

Agreed.

> 9. The mmap() man page says this:
> 
>        MAP_SHARED 
>            Share this mapping.  Updates to the mapping are vis???
>            ible to other processes that map this file, and  are
>            carried  through  to  the underlying file.  The file
>            may not actually be updated until msync(2)  or  mun???
>            map() is called.
> 
>    I believe the piece "or munmap()" is misleading. It implies
>    that munmap() must trigger a sync action. I don't think this
>    is true. All that it is required to do is remove some range
>    of pages from the process's virtual address space. I'm
>    inclined to remove those words, but I'd like to see if any
>    FS person has a correction to my understanding first.

I would expect non-coherent systems to update their caches on munmap,
Posix does not seem to require this, and I can't find any language
towards that in the HP-UX man page, which was a system that I remember
as non-coherent until the end.
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: munmap, msync: synchronization
  2014-04-21 18:14     ` Christoph Hellwig
@ 2014-04-21 19:54       ` Michael Kerrisk (man-pages)
  2014-04-21 21:34         ` Jamie Lokier
  2014-04-23 14:03       ` Matthew Wilcox
  1 sibling, 1 reply; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-04-21 19:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: mtk.manpages, Heinrich Schuchardt, linux-man, Dave Chinner,
	Theodore T'so, Linux-Fsdevel, Miklos Szeredi, jamie

Christoph,

On 04/21/2014 08:14 PM, Christoph Hellwig wrote:
> On Mon, Apr 21, 2014 at 12:16:46PM +0200, Michael Kerrisk (man-pages) wrote:
>> 1. In the bad old days (even on Linux, AFAIK, but that was in days
>>    before I looked closely at what goes on), the page cache and
>>    the buffer cache were not unified. That meant that a page from 
>>    a file might both be in the buffer cache (because of file I/O
>>    syscalls) and in the page cache (because of mmap()).
> 
> Correct.
> 
>> 2. In a non-unified cache system, pages can naturally get out of
>>    synch in the two locations. Before it had a unified cache, Linux 
>>    used to jump some hoops to ensure that contents in the two 
>>    locations remained consistent.
> 
> Yeah.
> 
>> 3. Nowadays Linux--like most (all?) UNIX systems--has a 
>>    unified cache: file I/O, mmap(), and the paging system all 
>>    use the same cache. If a file is mmap()-ed and also subject
>>    to file I?/, there will be only one copy of each file page 
>>    in the cache. Ergo, the inconsistency problem goes away.
> 
> Mostly true, except for FreeBSD and Solaris when they use ZFS, which has
> it's own file cache that is not coherent with the VM cache at the
> implementation level.  Not sure how much of this leaks to userspace,
> though.

Thanks for that detail.

>> 4. IIUC, the pieces like msync(MS_ASYNC) and msync(MS_INVALIDATE)
>>    exist only because of the bad old non-unified cache days.
>>    MS_INVALIDATE was a way of saying: make sure that writes
>>    to the file by other processes are visible in this mapping.
>>    msync() without the MS_INVALIDATE flags was a way of saying:
>>    make sure that read()s from the file see the changes made
>>    via this mapping. Using either MS_SYNC or MS_ASYNC
>>    was the way of saying: "I either want to wait until the file
>>    updates have been completed", or "please start the updates
>>    now, but I don't want to wait until they're completed".
> 
> Right.
> 
>> 5. On systems with a unified cache, msync(MS_INVALIDATE)
>>    is a no-op. (That is so on Linux.)
> 
> Almost.  It returns EBUSY if it hits any mlock()ed region.  Don't ask me
> why, though..

Ahhh yes, I was aware of that detail, but overlooked it in the point 
above.

>> 6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified 
>>    cache system. Filesystem I/O always sees a consistent view,
>>    and MS_ASYNC never undertook to give a guarantee about *when*
>>    the update would occur. (The Linux buffer cache logic will 
>>    ensure that it is flushed out sometime in the near future.)
> 
> Right.  It's a fairly inefficient noop, though - it actually loops
> over all vmas to do nothing with them.
> 
>> 7. On Linux (and probably many other modern systems), the only
>>    call that has any real use is msync(MS_SYNC), meaning
>>    "flush the buffers *now*, and I want to wait for that to 
>>    complete, so that I can then continue safe in the knowledge
>>    that my data has landed on a device". That's useful if we
>>    want insurance for our data in the event of a system crash.
> 
> Right.  It's basically another way to call fsync, which is used to
> implement it underneath.  It actually should be a ranged-fdatasync
> but right it's it's implemented horribly inefficiently in that it
> does a fsync call for each vma that it encounters in the range
> specified.
> 
>> 8. POSIX make no mandate for a unified cache system. Thus,
>>    we have MS_ASYNC and MS_INVALIDATE in the standard, and
>>    the standard says nothing (AFAIK) about whether munmap() 
>>    will flush data. On Linux (and probably most modern systems),
>>    we're fine. but portable applications that care about 
>>    standards and nonunified caches need to use msync().
>>
>>    My advice: To ensure that the contents of a shared file
>>    mapping are written to the underlying file--even on bad old
>>    implementations--a call to msync() should be made before 
>>    unmapping a mapping with munmap().
> 
> Agreed.

Thanks for checking all of this over and thanks also
for confirming that I learned my lessens well in the
"Jamie Lokier school of tough technical reviewing" ;-).

>> 9. The mmap() man page says this:
>>
>>        MAP_SHARED 
>>            Share this mapping.  Updates to the mapping are vis???
>>            ible to other processes that map this file, and  are
>>            carried  through  to  the underlying file.  The file
>>            may not actually be updated until msync(2)  or  mun???
>>            map() is called.
>>
>>    I believe the piece "or munmap()" is misleading. It implies
>>    that munmap() must trigger a sync action. I don't think this
>>    is true. All that it is required to do is remove some range
>>    of pages from the process's virtual address space. I'm
>>    inclined to remove those words, but I'd like to see if any
>>    FS person has a correction to my understanding first.
> 
> I would expect non-coherent systems to update their caches on munmap,
> Posix does not seem to require this, and I can't find any language
> towards that in the HP-UX man page, which was a system that I remember
> as non-coherent until the end.

Yes, that's how I read it too. POSIX seems to have no requirements here,
so I assume it was catering to to the lowest common denominator.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: munmap, msync: synchronization
  2014-04-21 19:54       ` Michael Kerrisk (man-pages)
@ 2014-04-21 21:34         ` Jamie Lokier
       [not found]           ` <20140421213418.GH30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Jamie Lokier @ 2014-04-21 21:34 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Christoph Hellwig, Heinrich Schuchardt, linux-man, Dave Chinner,
	Theodore T'so, Linux-Fsdevel, Miklos Szeredi

Michael Kerrisk (man-pages) wrote:
> >> 7. On Linux (and probably many other modern systems), the only
> >>    call that has any real use is msync(MS_SYNC), meaning
> >>    "flush the buffers *now*, and I want to wait for that to 
> >>    complete, so that I can then continue safe in the knowledge
> >>    that my data has landed on a device". That's useful if we
> >>    want insurance for our data in the event of a system crash.
> > 
> > Right.  It's basically another way to call fsync, which is used to
> > implement it underneath.  It actually should be a ranged-fdatasync
> > but right it's it's implemented horribly inefficiently in that it
> > does a fsync call for each vma that it encounters in the range
> > specified.

A ranged-fdatasync, for databases with little logs inside the big data
file, would be nice.  AIX, NetBSD and FreeBSD all have one :) Any
likelihood of that ever appearing in Linux?  sync_file_range() comes
with its Warning in the man page which basically means "don't trust me
unless you know the filesystem exactly".

> Thanks for checking all of this over and thanks also
> for confirming that I learned my lessens well in the
> "Jamie Lokier school of tough technical reviewing" ;-).

Hi! That was a long time ago :)

> >> 9. The mmap() man page says this:
> >>
> >>        MAP_SHARED 
> >>            Share this mapping.  Updates to the mapping are vis???
> >>            ible to other processes that map this file, and  are
> >>            carried  through  to  the underlying file.  The file
> >>            may not actually be updated until msync(2)  or  mun???
> >>            map() is called.
> >>
> >>    I believe the piece "or munmap()" is misleading. It implies
> >>    that munmap() must trigger a sync action. I don't think this
> >>    is true. All that it is required to do is remove some range
> >>    of pages from the process's virtual address space. I'm
> >>    inclined to remove those words, but I'd like to see if any
> >>    FS person has a correction to my understanding first.
> > 
> > I would expect non-coherent systems to update their caches on munmap,
> > Posix does not seem to require this, and I can't find any language
> > towards that in the HP-UX man page, which was a system that I remember
> > as non-coherent until the end.
> 
> Yes, that's how I read it too. POSIX seems to have no requirements here,
> so I assume it was catering to to the lowest common denominator.

According to this:

    http://h30499.www3.hp.com/t5/System-Administration/2-second-delays-in-fsync-msync-munmap/td-p/3092785/page/2#.U1WBw8dSI1-

and the conclusion of the following page:

   - munmap() does _something_ on HP-UX, but it might be just a poorly
     implemented artifact rather than equivalent to msync.

   - While we're there, the lowest common denominator for HP-UX was
     that pwrite() followed by mmap() does not provide the data
     recently written, even with fsync() between.  The thread ended
     there, but I would guess either it's a bug _or_ perhaps
     write+mmap+msync(MS_INVALIDATE) are needed in that order despite
     the write being before the mmap, perhaps if the shared segment
     was maintained by another process.

   - To keep it exciting, if you look at the HP-UX man page, 32-bit
     and 64-bit processes have separate mmap caches - writing to
     shared memory in one of them won't be seen immediately by the other.

Then there's this, about Linux NFS incoherency with msync() and O_DIRECT:

    - https://groups.google.com/d/msg/comp.os.linux.development.apps/B49Rej6KV24/xEouZOVXs9gJ

I don't know if any of the above are _true_ though :)

Best,
-- Jamie

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: munmap, msync: synchronization
       [not found]           ` <20140421213418.GH30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
@ 2014-04-22  6:03             ` Christoph Hellwig
  2014-04-22  7:04               ` Jamie Lokier
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-22  6:03 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Michael Kerrisk (man-pages),
	Christoph Hellwig, Heinrich Schuchardt,
	linux-man-u79uwXL29TY76Z2rM5mHXA, Dave Chinner,
	Theodore T'so, Linux-Fsdevel, Miklos Szeredi

On Mon, Apr 21, 2014 at 10:34:18PM +0100, Jamie Lokier wrote:
> A ranged-fdatasync, for databases with little logs inside the big data
> file, would be nice.  AIX, NetBSD and FreeBSD all have one :) Any
> likelihood of that ever appearing in Linux?  sync_file_range() comes
> with its Warning in the man page which basically means "don't trust me
> unless you know the filesystem exactly".

We have the infrastructure for range fsync and fdatasync in the kernel,
it's just not exposed.  Given that you've already done the research
how about you send a patch to wire it up?  Do the above implementations
at least agree on an API for it?

sync_file_range() unfortunately only writes out pagecache data and never
the needed metadata to actually find it.  While we could multiplex a
range fsync over it that seems to be very confusing (and would be more
complicated than just adding new syscalls)

> Then there's this, about Linux NFS incoherency with msync() and O_DIRECT:
> 
>     - https://groups.google.com/d/msg/comp.os.linux.development.apps/B49Rej6KV24/xEouZOVXs9gJ

That mail is utterly confused.  Yes, NFS has less coherency than normal
filesystems (google for close to open), but msync actually does it's
proper job on NFS.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: munmap, msync: synchronization
  2014-04-22  6:03             ` Christoph Hellwig
@ 2014-04-22  7:04               ` Jamie Lokier
  2014-04-22  9:28                 ` [PATCH] fsync_range, was: " Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Jamie Lokier @ 2014-04-22  7:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael Kerrisk (man-pages),
	Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so,
	Linux-Fsdevel, Miklos Szeredi

Christoph Hellwig wrote:
> On Mon, Apr 21, 2014 at 10:34:18PM +0100, Jamie Lokier wrote:
> > A ranged-fdatasync, for databases with little logs inside the big data
> > file, would be nice.  AIX, NetBSD and FreeBSD all have one :) Any
> > likelihood of that ever appearing in Linux?  sync_file_range() comes
> > with its Warning in the man page which basically means "don't trust me
> > unless you know the filesystem exactly".
> 
> We have the infrastructure for range fsync and fdatasync in the kernel,
> it's just not exposed.  Given that you've already done the research
> how about you send a patch to wire it up?  Do the above implementations
> at least agree on an API for it?

Hi Christoph,

Hardly research, I just did a quick Google and was surprised to find
some results.  AIX API differs from the BSDs; the BSDs seem to agree
with each other. fsync_range(), with a flag parameter saying what type
of sync, and whether it flushes the storage device write cache as well
(because they couldn't agree that was good - similar to the barriers
debate).

As for me doing it, no, sorry, I haven't touched the kernel in a few
years, life's been complicated for non-technical reasons, and I don't
have time to get back into it now.

> sync_file_range() unfortunately only writes out pagecache data and never
> the needed metadata to actually find it.  While we could multiplex a
> range fsync over it that seems to be very confusing (and would be more
> complicated than just adding new syscalls)

I agree. I never saw the point in sync_file_range() except to mislead,
whereas fsync_range() always seemed obvious!

In the kernel, I was always under the impression the simple part of
fsync_range - writing out data pages - was solved years ago, but being
sure the filesystem's updated its metadata in the proper way, that
begs for a little research into what filesystems do when asked,
doesn't it?

For example, imagine two dirty pages 0 and 1, two disk blocks A and B,
and a non-overwriting filesystem (similar to btrfs) which knows about
the dirty flags and has formulated a plan to journal a single metadata
change containing two pointers, from [0->A,1->B] to [0->C,1->D] when
it flushes metadata _after_ pages 0 and 1 are written to new disk
blocks C and D.  And you do fsync_range just on block 1.  Now if only
page 1 gets written and page 0 does not, it's important that a
different metadata change is journalled: [0->A,1->D] (or just [1->D]).
Now hopefully, all filesystems are sane enough to just do that, by
calculating what to journal as a response to only data I/O that's in
flight and behind a barrier.  But I wouldn't like to _assume_ that no
filesystems algorithms don't queue up the joint [0->C,1-D] metadata
change somehow, having seem the dirty flags, in a way that gets
confused by a forced metadata flush after partial dirty data flush.
After all it might be a legitimate thing to do in the current scheme.

(Similar things apply to converting preallocated-but-unwritten regions
to written.)

So I have this weird idea that to do it carefully needs a little
checking what filesystems do with carefully ordered block-pointer
metadata writes.

> > Then there's this, about Linux NFS incoherency with msync() and O_DIRECT:
> > 
> >     - https://groups.google.com/d/msg/comp.os.linux.development.apps/B49Rej6KV24/xEouZOVXs9gJ
> 
> That mail is utterly confused.  Yes, NFS has less coherency than normal
> filesystems (google for close to open), but msync actually does it's
> proper job on NFS.

Good to know :)

-- Jamie

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] fsync_range, was: Re: munmap, msync: synchronization
  2014-04-22  7:04               ` Jamie Lokier
@ 2014-04-22  9:28                 ` Christoph Hellwig
  2014-04-23 14:33                   ` Michael Kerrisk (man-pages)
       [not found]                   ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  0 siblings, 2 replies; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-22  9:28 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Christoph Hellwig, Michael Kerrisk (man-pages),
	Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so,
	Linux-Fsdevel, Miklos Szeredi

[-- Attachment #1: Type: text/plain, Size: 2198 bytes --]

On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote:
> Hi Christoph,
> 
> Hardly research, I just did a quick Google and was surprised to find
> some results.  AIX API differs from the BSDs; the BSDs seem to agree
> with each other. fsync_range(), with a flag parameter saying what type
> of sync, and whether it flushes the storage device write cache as well
> (because they couldn't agree that was good - similar to the barriers
> debate).

There is no FreeBSD implementation, I think you were confused by FreeBSD
also hosting NetBSD man pages on their site, just as I initially was.

The APIs are mostly the same, except that AIX reuses O_ flags as
argument and NetBSD has a separate namespace.  Following the latter
seems more sensible, and also allows developer to define the separate
name to the O_ flag for portability.

> As for me doing it, no, sorry, I haven't touched the kernel in a few
> years, life's been complicated for non-technical reasons, and I don't
> have time to get back into it now.

I've cooked up a patch, but I really need someone to test it and promote
it.  Find the patch attached.  There are two differences to the NetBSD
one:

 1) It doesn't fail for read-only FDs.  fsync doesn't, and while
    standards used to have fdatasync and aio_fsync fail for them,
    Linux never did and the standards are catching up:

	http://austingroupbugs.net/view.php?id=501
	http://austingroupbugs.net/view.php?id=671

 2) I don't implement the FDISKSYNC.  Requiring it is utterly broken,
    and we wouldn't even have the infrastructure for it.  It might make
    sense to provide it defined to 0 so that we have the identifier but
    make it a no-op.

> In the kernel, I was always under the impression the simple part of
> fsync_range - writing out data pages - was solved years ago, but being
> sure the filesystem's updated its metadata in the proper way, that
> begs for a little research into what filesystems do when asked,
> doesn't it?

The filesystems I care about handle it fine, and while I don't know
the details of others they better handle it properly, given that we
use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits
from the nfs server.


[-- Attachment #2: 0001-fs-implement-fsync_range.patch --]
[-- Type: text/plain, Size: 5898 bytes --]

>From b63881cac84b35ce3d6a61a33e33ac795a5c583c Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 22 Apr 2014 11:24:51 +0200
Subject: fs: implement fsync_range

Implement a fsync/fdatasync variant that takes a range to sync.  This follow the
NetBSD implementation:

	http://www.freebsd.org/cgi/man.cgi?query=fsync&apropos=0&sektion=0&manpath=NetBSD+5.0&format=html

and is fairly close the AIX implementation that the NetBSD one is based on:

	http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.aix.basetechref%2Fdoc%2Fbasetrf1%2Ffsync.htm

The implementation is very simple because the VFS already offers a ranged
fsync infrastrucute, which is most prominently used to implement O_SYNC
and O_DSYNC writes.

Differences from NetBSD are:

 1) It doesn't fail for read-only FDs.  fsync doesn't, and while standards
    used require fdatasync and aio_fsync to fail for read-only file
    descriptors Linux never did and the standards are catching up:

	http://austingroupbugs.net/view.php?id=501
	http://austingroupbugs.net/view.php?id=671

 2) It doesn't implement the FDISKSYNC.  Requiring a flag to actuall make
    data persistant is completely broken, and the Linux infrastructure
    doesn't support it anyway.  We could provide it as a no-op if we
    really need to.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 arch/x86/syscalls/syscall_32.tbl |    1 +
 arch/x86/syscalls/syscall_64.tbl |    1 +
 fs/sync.c                        |  101 ++++++++++++++++++++++++--------------
 include/uapi/linux/fs.h          |    6 ++-
 4 files changed, 72 insertions(+), 37 deletions(-)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index d6b8679..e239d46 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,3 +360,4 @@
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
 353	i386	renameat2		sys_renameat2
+354	i386	fsync_range		sys_fsync_range
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 04376ac..006d57f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	fsync_range		sys_fsync_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/sync.c b/fs/sync.c
index b28d1dd..58f9ca7 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -197,13 +197,13 @@ int vfs_fsync(struct file *file, int datasync)
 }
 EXPORT_SYMBOL(vfs_fsync);
 
-static int do_fsync(unsigned int fd, int datasync)
+static int do_fsync(unsigned int fd, loff_t start, loff_t end, int datasync)
 {
 	struct fd f = fdget(fd);
 	int ret = -EBADF;
 
 	if (f.file) {
-		ret = vfs_fsync(f.file, datasync);
+		ret = vfs_fsync_range(f.file, start, end, datasync);
 		fdput(f);
 	}
 	return ret;
@@ -211,12 +211,69 @@ static int do_fsync(unsigned int fd, int datasync)
 
 SYSCALL_DEFINE1(fsync, unsigned int, fd)
 {
-	return do_fsync(fd, 0);
+	return do_fsync(fd, 0, LLONG_MAX, 0);
 }
 
 SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
 {
-	return do_fsync(fd, 1);
+	return do_fsync(fd, 0, LLONG_MAX, 1);
+}
+
+static loff_t end_offset(loff_t offset, loff_t nbytes)
+{
+	loff_t endbyte = offset + nbytes;
+
+	if ((s64)offset < 0)
+		return -EINVAL;
+	if ((s64)endbyte < 0)
+		return -EINVAL;
+	if (endbyte < offset)
+		return -EINVAL;
+
+	if (sizeof(pgoff_t) == 4) {
+		if (offset >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
+			/*
+			 * The range starts outside a 32 bit machine's
+			 * pagecache addressing capabilities.  Let it "succeed"
+			 */
+			return 0;
+		}
+		if (endbyte >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
+			/*
+			 * Out to EOF
+			 */
+			return LLONG_MAX;
+		}
+	}
+
+	if (nbytes == 0)
+		endbyte = LLONG_MAX;
+	else
+		endbyte--;		/* inclusive */
+
+	return endbyte;
+}
+
+SYSCALL_DEFINE4(fsync_range, unsigned int, fd, int, how,
+		loff_t, start, loff_t, length)
+{
+	int datasync = 0;
+	loff_t end;
+
+	switch (how) {
+	case FDATASYNC:
+		datasync = 1;
+		break;
+	case FFILESYNC:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	end = end_offset(start, length);
+	if (end <= 0)
+		return end;
+	return do_fsync(fd, start, end, datasync);
 }
 
 /*
@@ -275,40 +332,12 @@ SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
 	loff_t endbyte;			/* inclusive */
 	umode_t i_mode;
 
-	ret = -EINVAL;
 	if (flags & ~VALID_FLAGS)
-		goto out;
-
-	endbyte = offset + nbytes;
-
-	if ((s64)offset < 0)
-		goto out;
-	if ((s64)endbyte < 0)
-		goto out;
-	if (endbyte < offset)
-		goto out;
-
-	if (sizeof(pgoff_t) == 4) {
-		if (offset >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
-			/*
-			 * The range starts outside a 32 bit machine's
-			 * pagecache addressing capabilities.  Let it "succeed"
-			 */
-			ret = 0;
-			goto out;
-		}
-		if (endbyte >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
-			/*
-			 * Out to EOF
-			 */
-			nbytes = 0;
-		}
-	}
+		return -EINVAL;
 
-	if (nbytes == 0)
-		endbyte = LLONG_MAX;
-	else
-		endbyte--;		/* inclusive */
+	endbyte = end_offset(offset, nbytes);
+	if (endbyte <= 0)
+		return endbyte;
 
 	ret = -EBADF;
 	f = fdget(fd);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index ca1a11b..491d9fe 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -199,9 +199,13 @@ struct inodes_stat_t {
 #define FS_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */
 #define FS_FL_USER_MODIFIABLE		0x000380FF /* User modifiable flags */
 
-
+/* flags for sync_file_range */
 #define SYNC_FILE_RANGE_WAIT_BEFORE	1
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
 
+/* flags for fsync_range */
+#define FDATASYNC	0x0010
+#define FFILESYNC	0x0020
+
 #endif /* _UAPI_LINUX_FS_H */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: munmap, msync: synchronization
  2014-04-21 18:14     ` Christoph Hellwig
  2014-04-21 19:54       ` Michael Kerrisk (man-pages)
@ 2014-04-23 14:03       ` Matthew Wilcox
  1 sibling, 0 replies; 18+ messages in thread
From: Matthew Wilcox @ 2014-04-23 14:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael Kerrisk (man-pages),
	Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so,
	Linux-Fsdevel, Miklos Szeredi, jamie

On Mon, Apr 21, 2014 at 11:14:31AM -0700, Christoph Hellwig wrote:
> > 6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified 
> >    cache system. Filesystem I/O always sees a consistent view,
> >    and MS_ASYNC never undertook to give a guarantee about *when*
> >    the update would occur. (The Linux buffer cache logic will 
> >    ensure that it is flushed out sometime in the near future.)
> 
> Right.  It's a fairly inefficient noop, though - it actually loops
> over all vmas to do nothing with them.

This will probably change for Persistent Memory.  The reason it
works today is that we have a page cache which tracks dirty bits and
periodically writes dirty pages to storage.  If we bypass the page cache,
we have to ensure that everything does still eventually get synced.

I don't quite know how this is going to work yet ... I have a number of
ideas in my head.  It probably won't be asynchronous though!

> > 7. On Linux (and probably many other modern systems), the only
> >    call that has any real use is msync(MS_SYNC), meaning
> >    "flush the buffers *now*, and I want to wait for that to 
> >    complete, so that I can then continue safe in the knowledge
> >    that my data has landed on a device". That's useful if we
> >    want insurance for our data in the event of a system crash.
> 
> Right.  It's basically another way to call fsync, which is used to
> implement it underneath.  It actually should be a ranged-fdatasync
> but right it's it's implemented horribly inefficiently in that it
> does a fsync call for each vma that it encounters in the range
> specified.

See also: 

From: Matthew Wilcox <matthew.r.wilcox@intel.com>
To: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>, willy@linux.intel.com
Subject: [PATCH] Sync only the requested range in msync
Date: Thu, 27 Mar 2014 19:02:41 -0400
Message-Id: <1395961361-21307-1-git-send-email-matthew.r.wilcox@intel.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
  2014-04-22  9:28                 ` [PATCH] fsync_range, was: " Christoph Hellwig
@ 2014-04-23 14:33                   ` Michael Kerrisk (man-pages)
  2014-04-23 15:45                     ` Christoph Hellwig
       [not found]                   ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  1 sibling, 1 reply; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-04-23 14:33 UTC (permalink / raw)
  To: Christoph Hellwig, Jamie Lokier
  Cc: mtk.manpages, Heinrich Schuchardt, linux-man, Dave Chinner,
	Theodore T'so, Linux-Fsdevel, Miklos Szeredi

On 04/22/2014 11:28 AM, Christoph Hellwig wrote:
> On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote:
>> Hi Christoph,
>>
>> Hardly research, I just did a quick Google and was surprised to find
>> some results.  AIX API differs from the BSDs; the BSDs seem to agree
>> with each other. fsync_range(), with a flag parameter saying what type
>> of sync, and whether it flushes the storage device write cache as well
>> (because they couldn't agree that was good - similar to the barriers
>> debate).
> 
> There is no FreeBSD implementation, I think you were confused by FreeBSD
> also hosting NetBSD man pages on their site, just as I initially was.
> 
> The APIs are mostly the same, except that AIX reuses O_ flags as
> argument and NetBSD has a separate namespace.  Following the latter
> seems more sensible, and also allows developer to define the separate
> name to the O_ flag for portability.
> 
>> As for me doing it, no, sorry, I haven't touched the kernel in a few
>> years, life's been complicated for non-technical reasons, and I don't
>> have time to get back into it now.
> 
> I've cooked up a patch, but I really need someone to test it and promote
> it.  Find the patch attached.  There are two differences to the NetBSD
> one:
> 
>  1) It doesn't fail for read-only FDs.  fsync doesn't, and while
>     standards used to have fdatasync and aio_fsync fail for them,
>     Linux never did and the standards are catching up:
> 
> 	http://austingroupbugs.net/view.php?id=501
> 	http://austingroupbugs.net/view.php?id=671
> 
>  2) I don't implement the FDISKSYNC.  Requiring it is utterly broken,
>     and we wouldn't even have the infrastructure for it.  It might make
>     sense to provide it defined to 0 so that we have the identifier but
>     make it a no-op.
> 
>> In the kernel, I was always under the impression the simple part of
>> fsync_range - writing out data pages - was solved years ago, but being
>> sure the filesystem's updated its metadata in the proper way, that
>> begs for a little research into what filesystems do when asked,
>> doesn't it?
> 
> The filesystems I care about handle it fine, and while I don't know
> the details of others they better handle it properly, given that we
> use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits
> from the nfs server.

The functionality sounds like it would be worthwhile. I've applied the
patch against 3.15-rc2, and employed the test program below, with test 
files on standard laptop HDD (ext4). The test program repeatedly
a) overwrites a specified region of a file
b) does an fsync_range() on a specified range of the file (need not be 
   the same region that was written).

The CLI is crude, but the arguments are:

1: pathname
2: number of loops
3: Starting point for writes each time round loop
4: Length of region to write
5: Either 'f' for  or 'd' for FDATASYNC
6: start offset for fsync_range()
7: length for fsync_range()

It seems that the patch does roughly what it says on the tin:

# Precreate a 1MB file

$ sync; time ./t_fsync_range /testfs/f 100 0 1000000 d 0 1000000^C
$ dd of=/testfs/f bs=1000 count=1000 if=/dev/full
1000+0 records in
1000+0 records out
1000000 bytes (1.0 MB) copied, 0.00575843 s, 174 MB/s

# Take journaling and atime out of the equation:

$ sudo umount /dev/sdb6
$ sudo tune2fs -O ^has_journal /dev/sdb6$ 
[sudo] password for mtk: 
tune2fs 1.42.8 (20-Jun-2013)
$ sudo mount -o norelatime,strictatime /dev/sdb6 /testfs

# Filesystem unmounted and remounted (with above options) before 
# each of the following tests

===

# 1000 loops, writing 1 MB, syncing entire 1MB range, with FFILESYNC:

$ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 1000000
fsync_range(3, 0x20, 0, 1000000)
Performed 16000 writes
Performed 1000 sync operations

real	0m10.677s
user	0m0.011s
sys	0m0.816s


# 1000 loops, writing 1MB, syncing entire 1MB range, with FDATASYNC:
# (Takes less time, as expected)

$ time ./t_fsync_range /testfs/f 1000 0 1000000 d 0 1000000
fsync_range(3, 0x10, 0, 1000000)
Performed 16000 writes
Performed 1000 sync operations

real	0m8.685s
user	0m0.017s
sys	0m0.825s

===

# 1000 loops, writing 1 MB, syncing just 100kB, with FFILESYNC:
# (Take less time than syncing entire 1MB range, as expected)

$ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 100000
fsync_range(3, 0x20, 0, 100000)
Performed 16000 writes
Performed 1000 sync operations

real	0m1.501s
user	0m0.005s
sys	0m0.339s

# 1000 loops, writing 1 MB, syncing just 10kB, with FFILESYNC:

$ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 10000
fsync_range(3, 0x20, 0, 10000)
Performed 16000 writes
Performed 1000 sync operations

real	0m0.616s
user	0m0.004s
sys	0m0.240s

=======

But I have a question:

When I precreate a 10MB file, and repeat the tests (this time with 
100 loops), I no longer see any significant difference between 
FFILESYNC and FDATASYNC. What am I missing? Sample runs here, 
though I did the tests repeatedly with broadly similar results 
each time:

#FFILESYNC

$ time ./t_fsync_range /testfs/f 100 0 10000000 f 0 10000000
fsync_range(3, 0x20, 0, 10000000)
Performed 15300 writes
Performed 100 sync operations

real	0m17.575s
user	0m0.001s
sys	0m0.656s

# FDATASYNC

$ time ./t_fsync_range /testfs/f 100 0 10000000 d 0 10000000
fsync_range(3, 0x10, 0, 10000000)
Performed 15300 writes
Performed 100 sync operations

real	0m17.228s
user	0m0.005s
sys	0m0.624s

======

Add another question: is there any piece of sync_file_range() 
functionality that could or should be incorporated in this API?

======

Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
  2014-04-23 14:33                   ` Michael Kerrisk (man-pages)
@ 2014-04-23 15:45                     ` Christoph Hellwig
       [not found]                       ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2014-04-24  9:34                       ` Michael Kerrisk (man-pages)
  0 siblings, 2 replies; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-23 15:45 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Christoph Hellwig, Jamie Lokier, Heinrich Schuchardt, linux-man,
	Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi

On Wed, Apr 23, 2014 at 04:33:06PM +0200, Michael Kerrisk (man-pages) wrote:
> # Take journaling and atime out of the equation:
> 
> $ sudo umount /dev/sdb6
> $ sudo tune2fs -O ^has_journal /dev/sdb6$ 
> [sudo] password for mtk: 
> tune2fs 1.42.8 (20-Jun-2013)
> $ sudo mount -o norelatime,strictatime /dev/sdb6 /testfs

The second strictatime argument overrides the earlier norelatime,
so you put it into the picture.

> 
> But I have a question:
> 
> When I precreate a 10MB file, and repeat the tests (this time with 
> 100 loops), I no longer see any significant difference between 
> FFILESYNC and FDATASYNC. What am I missing? Sample runs here, 
> though I did the tests repeatedly with broadly similar results 
> each time:

Not sure.  Do you also see this on other filesystems?

> Add another question: is there any piece of sync_file_range() 
> functionality that could or should be incorporated in this API?

I don't think so.  sync_file_range is a complete mess and impossible
to use correctly for data integrity operations.  Especially the whole
notion that submitting I/O and waiting for it are separate operations
is incompatible with a data integrity call.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
       [not found]                   ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2014-04-23 22:15                     ` Jamie Lokier
       [not found]                       ` <20140423221402.GL30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
  2014-04-24  1:34                     ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Jamie Lokier @ 2014-04-23 22:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael Kerrisk (man-pages),
	Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
	Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi

Christoph Hellwig wrote:
> On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote:
> > Hi Christoph,
> > 
> > Hardly research, I just did a quick Google and was surprised to find
> > some results.  AIX API differs from the BSDs; the BSDs seem to agree
> > with each other. fsync_range(), with a flag parameter saying what type
> > of sync, and whether it flushes the storage device write cache as well
> > (because they couldn't agree that was good - similar to the barriers
> > debate).
> 
> There is no FreeBSD implementation, I think you were confused by FreeBSD
> also hosting NetBSD man pages on their site, just as I initially was.

Yes, especially with the headings on the man pages saying FreeBSD :)
Just checked a FreeBSD 8.2 system, doesn't have it.

> The APIs are mostly the same, except that AIX reuses O_ flags as
> argument and NetBSD has a separate namespace.  Following the latter
> seems more sensible, and also allows developer to define the separate
> name to the O_ flag for portability.
...
> I've cooked up a patch, but I really need someone to test it and promote
> it.  Find the patch attached.  There are two differences to the NetBSD
> one:
> 
>  1) It doesn't fail for read-only FDs.  fsync doesn't, and while
>     standards used to have fdatasync and aio_fsync fail for them,
>     Linux never did and the standards are catching up:
> 
> 	http://austingroupbugs.net/view.php?id=501
> 	http://austingroupbugs.net/view.php?id=671

See also for maybe why:

        http://www.eivanov.com/2011/06/using-fsync-and-fsyncrange-with.html

>  2) I don't implement the FDISKSYNC.  Requiring it is utterly broken,
>     and we wouldn't even have the infrastructure for it.  It might make
>     sense to provide it defined to 0 so that we have the identifier but
>     make it a no-op.

I presume Linux does the equivalent without needing FDISKSYNC, if and
only if the filesystem is mounted with barriers enabled, which is the
default nowadays?

> > In the kernel, I was always under the impression the simple part of
> > fsync_range - writing out data pages - was solved years ago, but being
> > sure the filesystem's updated its metadata in the proper way, that
> > begs for a little research into what filesystems do when asked,
> > doesn't it?
> 
> The filesystems I care about handle it fine, and while I don't know
> the details of others they better handle it properly, given that we
> use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits
> from the nfs server.

Excellent.  This really looks like it should have gone in as a system
call years ago, since vfs_fsync_range was there all along waiting to
be used!

> Differences from NetBSD are:
> 
>  1) It doesn't fail for read-only FDs.  fsync doesn't, and while standards
>     used require fdatasync and aio_fsync to fail for read-only file
>     descriptors Linux never did and the standards are catching up:
> 
> 	http://austingroupbugs.net/view.php?id=501
> 	http://austingroupbugs.net/view.php?id=671
> 
>  2) It doesn't implement the FDISKSYNC.  Requiring a flag to actuall make
>     data persistant is completely broken, and the Linux infrastructure
>     doesn't support it anyway.  We could provide it as a no-op if we
>     really need to.

Ah, more differences, which I think should be dropped actually.

   3) Does not implement NetBSD's documented behaviour when length == 0.
      NetBSD says "If the length parameter is zero, fsync_range() will
      synchronize all of the file data".  This path does from offset.

   4) Other weird range stuff inherited from sync_file_range() on 32
      bit machines only.  May not be correct with O_DIRECT or
      filesystems that don't use page cache.

See:

> +static loff_t end_offset(loff_t offset, loff_t nbytes)
> +{
> +	loff_t endbyte = offset + nbytes;
> +
> +	if ((s64)offset < 0)
> +		return -EINVAL;
> +	if ((s64)endbyte < 0)
> +		return -EINVAL;
> +	if (endbyte < offset)
> +		return -EINVAL;
> +
> +	if (sizeof(pgoff_t) == 4) {
> +		if (offset >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
> +			/*
> +			 * The range starts outside a 32 bit machine's
> +			 * pagecache addressing capabilities.  Let it "succeed"
> +			 */
> +			return 0;
> +		}
> +		if (endbyte >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
> +			/*
> +			 * Out to EOF
> +			 */
> +			return LLONG_MAX;
> +		}
> +	}
> +
> +	if (nbytes == 0)
> +		endbyte = LLONG_MAX;
> +	else
> +		endbyte--;		/* inclusive */
> +
> +	return endbyte;
> +}

That was in sync_file_range(), where I think it might have made more
sense as that's obviously tied to the page cache only.  So:

    a) Giving zero length results in sync from offset..LLONG_MAX.
       (NetBSD would have it be 0..LLONG_MAX, according to man page.)

    b) If the offset is "too large" for page cache on a 32-bit machine,
       it won't do anything -- including no metadata side-effects.

    c) If the length is "too large" for page cache on a 32-bit machine,
       it extends the length to LLONG_MAX.

The desired behaviour with zero length, that's obviously a judgement
call.  I guess that provided NetBSD applications the option to use
FDISKSYNC without a range :)

About b) and c) they both look dubious, because it's not a given that
a filesystem is using page cache, or only using page cache.  For
example FUSE using O_DIRECT.  (Not that I've checked if you can
actually write anything in those ranges though.)

b) looks worse because it means side effects are also quietly not
done, and a file might legitimately not use the page cache (consider a
FUSE-mounted file accessed with O_DIRECT).

So, would it not make sense to just check the offset, length and
offset+length fit into s64; and if length is zero change the range to
0..LLONG_MAX, and simply match NetBSD that way?  (Or, call me crazy,
just return if length is zero.)

Best,
-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
       [not found]                       ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2014-04-23 22:20                         ` Jamie Lokier
       [not found]                           ` <20140423222011.GM30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Jamie Lokier @ 2014-04-23 22:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael Kerrisk (man-pages),
	Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
	Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi

> > Add another question: is there any piece of sync_file_range() 
> > functionality that could or should be incorporated in this API?
> 
> I don't think so.  sync_file_range is a complete mess and impossible
> to use correctly for data integrity operations.  Especially the whole
> notion that submitting I/O and waiting for it are separate operations
> is incompatible with a data integrity call.

I guess it's also to give the application a way to nudge a preferred
asynchronous writeback order, prior to a synchronous wait.  If the
application knows there's a lot of dirty data being generated over
time prior to needing a short fdatasync, it might see it as beneficial
to tell the kernel to start writing that data sooner, so the fdatasync
delay will be shorter.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
       [not found]                   ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2014-04-23 22:15                     ` Jamie Lokier
@ 2014-04-24  1:34                     ` Dave Chinner
  2014-04-25  6:06                       ` Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2014-04-24  1:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jamie Lokier, Michael Kerrisk (man-pages),
	Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
	Theodore T'so, Linux-Fsdevel, Miklos Szeredi

On Tue, Apr 22, 2014 at 02:28:37AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote:
> > Hi Christoph,
> > 
> > Hardly research, I just did a quick Google and was surprised to find
> > some results.  AIX API differs from the BSDs; the BSDs seem to agree
> > with each other. fsync_range(), with a flag parameter saying what type
> > of sync, and whether it flushes the storage device write cache as well
> > (because they couldn't agree that was good - similar to the barriers
> > debate).
> 
> There is no FreeBSD implementation, I think you were confused by FreeBSD
> also hosting NetBSD man pages on their site, just as I initially was.
> 
> The APIs are mostly the same, except that AIX reuses O_ flags as
> argument and NetBSD has a separate namespace.  Following the latter
> seems more sensible, and also allows developer to define the separate
> name to the O_ flag for portability.
> 
> > As for me doing it, no, sorry, I haven't touched the kernel in a few
> > years, life's been complicated for non-technical reasons, and I don't
> > have time to get back into it now.
> 
> I've cooked up a patch, but I really need someone to test it and promote
> it.  Find the patch attached.  There are two differences to the NetBSD
> one:
.....

> From b63881cac84b35ce3d6a61a33e33ac795a5c583c Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> Date: Tue, 22 Apr 2014 11:24:51 +0200
> Subject: fs: implement fsync_range

Christoph, if this is going into the kernel, can you add support for
xfs_io and write a couple of xfstests to test it? I'm not
comfortable with adding new data integrity primitives to the kernel
without having robust validation infrastructure already in place for
it. It might also be worthwhile looking to extend Josef's
fsync-tester.c to be able to use ranged fsyncs so to test all the
various corner cases that we need to....

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
  2014-04-23 15:45                     ` Christoph Hellwig
       [not found]                       ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2014-04-24  9:34                       ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-04-24  9:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: mtk.manpages, Jamie Lokier, Heinrich Schuchardt, linux-man,
	Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi

Oops -- I see that I forgot to attach the test program in my last
mail. Appended below, now.)

On 04/23/2014 05:45 PM, Christoph Hellwig wrote:
> On Wed, Apr 23, 2014 at 04:33:06PM +0200, Michael Kerrisk (man-pages) wrote:
>> # Take journaling and atime out of the equation:
>>
>> $ sudo umount /dev/sdb6
>> $ sudo tune2fs -O ^has_journal /dev/sdb6$ 
>> [sudo] password for mtk: 
>> tune2fs 1.42.8 (20-Jun-2013)
>> $ sudo mount -o norelatime,strictatime /dev/sdb6 /testfs
> 
> The second strictatime argument overrides the earlier norelatime,
> so you put it into the picture.

Oh -- have I misunderstood something? I was wanting classical behavior:
atime always updated (but only synced to disk by FILESYNC). Is that not
what I should get with norelatime+strictatime?

>> But I have a question:
>>
>> When I precreate a 10MB file, and repeat the tests (this time with 
>> 100 loops), I no longer see any significant difference between 
>> FFILESYNC and FDATASYNC. What am I missing? Sample runs here, 
>> though I did the tests repeatedly with broadly similar results 
>> each time:
> 
> Not sure.  Do you also see this on other filesystems?

=======

So, here's some results from XFS:

# 1000 loops. 1MB file, 1MB fsync_range()
# As with ext4, FDATASYNC is faster than FFILESYNC (as expected)

$ sudo umount /dev/sdb6; sudo mount -o norelatime,strictatime /dev/sdb6 /testfs
$ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 1000000
fsync_range(3, 0x20, 0, 1000000)
Performed 16000 writes
Performed 1000 sync operations

real	0m52.264s
user	0m0.018s
sys	0m0.926s
$ sudo umount /dev/sdb6; sudo mount -o norelatime,strictatime /dev/sdb6 /testfs
$ time ./t_fsync_range /testfs/f 1000 0 1000000 d 0 1000000
fsync_range(3, 0x10, 0, 1000000)
Performed 16000 writes
Performed 1000 sync operations

real	0m33.689s
user	0m0.002s
sys	0m0.915s

# (Note that I did not disable XFS journalling--it's not possible to
# do so, right?)

====

# 100 loops, 100MB file, 100MB fsync_range()
# FDATASYNC and FFIFLESYNC times are again similar

$ time ./t_fsync_range /testfs/f 100 0 100000000 f 0 100000000
fsync_range(3, 0x20, 0, 100000000)
Performed 152600 writes
Performed 100 sync operations

real	4m45.257s
user	0m0.004s
sys	0m5.607s

$ time ./t_fsync_range /testfs/f 100 0 100000000 d 0 100000000
fsync_range(3, 0x10, 0, 100000000)
Performed 152600 writes
Performed 100 sync operations

real	4m43.925s
user	0m0.010s
sys	0m3.824s

# Again, the same pattern: no difference between FFILESYNC and FDATASYNC

=====
On JFS, I get

1000 loops, 1MB file, 1MB fsync_range, FFILESYNC:
* Quite a lot of variability (11.3 to 16.5 secs)
1000 loops, 1MB file, 1MB fsync_range, FDATASYNC:
* Quite a lot of variability (8.6 to 10.9 secs)
==> FDATASYNC is on average faster than FFILESYNC

100 loops, 100 MB file, 100MB fsync_range, FFILESYNC:
281 seconds (just a single test)
100 loops, 100 MB file, 100MB fsync_range, FDATASYNC:
280 seconds (just a single test)

So, again, it seems like for a large file sync, there's no difference between
FFILESYNC and FDATASYNC

>> Add another question: is there any piece of sync_file_range() 
>> functionality that could or should be incorporated in this API?
> 
> I don't think so.  sync_file_range is a complete mess and impossible
> to use correctly for data integrity operations.  Especially the whole
> notion that submitting I/O and waiting for it are separate operations
> is incompatible with a data integrity call.

Okay -- I just thought it worth checking.

Cheers,

Michael

========
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define errExit(msg) 	do { perror(msg); exit(EXIT_FAILURE); \
			} while (0)

/* flags for fsync_range */
#define FDATASYNC	0x0010
#define FFILESYNC	0x0020

#define SYS_fsync_range 317

static int
fsync_range(unsigned int fd, int how, loff_t start, loff_t length)
{
    return syscall(SYS_fsync_range, fd, how, start, length);
}

#define BUF_SIZE 65536
static char buf[BUF_SIZE];

int
main(int argc, char *argv[])
{
    int j, fd, nloops, how;
    size_t writeLen, syncLen, wlen;
    size_t bufSize;
    off_t writeOffset, syncOffset;
    int scnt, wcnt;

    if (argc != 8 || strcmp(argv[1], "--help") == 0) {
        fprintf(stderr, "%s pathname nloops write-offset write-length {f|d} "
	        "sync-offset sync-len\n", argv[0]);
	exit(EXIT_SUCCESS);
    }

    fd = open(argv[1], O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
    if (fd == -1)
	errExit("read");

    nloops = atoi(argv[2]);
    writeOffset = atoi(argv[3]);
    writeLen = atoi(argv[4]);
    how = (argv[5][0] == 'd') ? FDATASYNC :
	  (argv[5][0] == 'f') ? FFILESYNC : 0;
    syncOffset = atoi(argv[6]);
    syncLen = atoi(argv[7]);

    if (how != 0)
        fprintf(stderr, "fsync_range(%d, 0x%x, %lld, %zd)\n",
	        fd, how, (long long) syncOffset, syncLen);

    scnt = 0;
    wcnt = 0;

    for (j = 0; j < nloops; j++) {
	memset(buf, j % 256, BUF_SIZE);
	if (lseek(fd, writeOffset, SEEK_SET) == -1)
	    errExit("lseek");

	wlen = writeLen;
        while (wlen > 0) {
            bufSize = (wlen > BUF_SIZE) ? BUF_SIZE : wlen;
	    wlen -= bufSize;
    
	    if (write(fd, buf, bufSize) != bufSize) {
	        fprintf(stderr, "Write failed\n");
	        exit(EXIT_FAILURE);
	    }

	    wcnt++;
        }

	if (how != 0) {
	    scnt++;
	    if (fsync_range(fd, how, syncOffset, syncLen) == -1)
	        errExit("fsync_range");
	}
    }

    fprintf(stderr, "Performed %d writes\n", wcnt);
    fprintf(stderr, "Performed %d sync operations\n", scnt);
    exit(EXIT_SUCCESS);
}



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
  2014-04-24  1:34                     ` Dave Chinner
@ 2014-04-25  6:06                       ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-25  6:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Jamie Lokier, Michael Kerrisk (man-pages),
	Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
	Theodore T'so, Linux-Fsdevel, Miklos Szeredi

On Thu, Apr 24, 2014 at 11:34:35AM +1000, Dave Chinner wrote:
> Christoph, if this is going into the kernel, can you add support for
> xfs_io and write a couple of xfstests to test it? I'm not
> comfortable with adding new data integrity primitives to the kernel
> without having robust validation infrastructure already in place for
> it. It might also be worthwhile looking to extend Josef's
> fsync-tester.c to be able to use ranged fsyncs so to test all the
> various corner cases that we need to....

If we actually want to add it will obviously need test coverage.  Seem
like I can't really get people excited enough to make this more than a
PoC so far, though.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
       [not found]                           ` <20140423222011.GM30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
@ 2014-04-25  6:07                             ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-25  6:07 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Christoph Hellwig, Michael Kerrisk (man-pages),
	Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
	Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi

On Wed, Apr 23, 2014 at 11:20:11PM +0100, Jamie Lokier wrote:
> I guess it's also to give the application a way to nudge a preferred
> asynchronous writeback order, prior to a synchronous wait.  If the
> application knows there's a lot of dirty data being generated over
> time prior to needing a short fdatasync, it might see it as beneficial
> to tell the kernel to start writing that data sooner, so the fdatasync
> delay will be shorter.

If they want to do an async writeback pass first they can just use
sync_file_range for it, that's the only thing it's actually useful for.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
       [not found]                       ` <20140423221402.GL30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
@ 2014-04-25  6:26                         ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-25  6:26 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Michael Kerrisk (man-pages),
	Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
	Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi

On Wed, Apr 23, 2014 at 11:15:27PM +0100, Jamie Lokier wrote:
> >  1) It doesn't fail for read-only FDs.  fsync doesn't, and while
> >     standards used to have fdatasync and aio_fsync fail for them,
> >     Linux never did and the standards are catching up:
> > 
> > 	http://austingroupbugs.net/view.php?id=501
> > 	http://austingroupbugs.net/view.php?id=671
> 
> See also for maybe why:
> 
>         http://www.eivanov.com/2011/06/using-fsync-and-fsyncrange-with.html

I don't really see a "why" there, just the observation that fsync and
fsync_range behavior different on NetBSD, which is odd but documented
behavior.

> >  2) I don't implement the FDISKSYNC.  Requiring it is utterly broken,
> >     and we wouldn't even have the infrastructure for it.  It might make
> >     sense to provide it defined to 0 so that we have the identifier but
> >     make it a no-op.
> 
> I presume Linux does the equivalent without needing FDISKSYNC, if and
> only if the filesystem is mounted with barriers enabled, which is the
> default nowadays?

That's correct, at least for modern mainstream filesystems.  Either way
the filesystem would have to implement the cache flush, so those that
don't support it couldn't support FDISKSYNC either.

> Ah, more differences, which I think should be dropped actually.
> 
>    3) Does not implement NetBSD's documented behaviour when length == 0.
>       NetBSD says "If the length parameter is zero, fsync_range() will
>       synchronize all of the file data".  This path does from offset.

Indeed.  AIX also documents the same behavior.

>    4) Other weird range stuff inherited from sync_file_range() on 32
>       bit machines only.  May not be correct with O_DIRECT or
>       filesystems that don't use page cache.

It's not really possible to implement a full Linux filesystem without
touching the pagecache, but I agree that this probably doesn't
belong into the VFS.  sync_file_range is one of these odd layering
violations that calls straight into the pagecache without going into
the filesystem first (readahead is the other one that comes to mind).

> The desired behaviour with zero length, that's obviously a judgement
> call.  I guess that provided NetBSD applications the option to use
> FDISKSYNC without a range :)

It seems to originate from the earlier AIX version, but I think it's
just their way to sync the whole range. I prefer our 0, LLONG_MAX
notation, but given the existing user interface we should stick to it.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-04-25  6:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-20 10:28 munmap, msync: synchronization Heinrich Schuchardt
2014-04-21 10:16 ` Michael Kerrisk (man-pages)
     [not found]   ` <5354F00E.8050609-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-04-21 18:14     ` Christoph Hellwig
2014-04-21 19:54       ` Michael Kerrisk (man-pages)
2014-04-21 21:34         ` Jamie Lokier
     [not found]           ` <20140421213418.GH30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
2014-04-22  6:03             ` Christoph Hellwig
2014-04-22  7:04               ` Jamie Lokier
2014-04-22  9:28                 ` [PATCH] fsync_range, was: " Christoph Hellwig
2014-04-23 14:33                   ` Michael Kerrisk (man-pages)
2014-04-23 15:45                     ` Christoph Hellwig
     [not found]                       ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-04-23 22:20                         ` Jamie Lokier
     [not found]                           ` <20140423222011.GM30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
2014-04-25  6:07                             ` Christoph Hellwig
2014-04-24  9:34                       ` Michael Kerrisk (man-pages)
     [not found]                   ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-04-23 22:15                     ` Jamie Lokier
     [not found]                       ` <20140423221402.GL30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
2014-04-25  6:26                         ` Christoph Hellwig
2014-04-24  1:34                     ` Dave Chinner
2014-04-25  6:06                       ` Christoph Hellwig
2014-04-23 14:03       ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.