* munmap, msync: synchronization
@ 2014-04-20 10:28 Heinrich Schuchardt
2014-04-21 10:16 ` Michael Kerrisk (man-pages)
0 siblings, 1 reply; 18+ messages in thread
From: Heinrich Schuchardt @ 2014-04-20 10:28 UTC (permalink / raw)
To: Michael Kerrisk (man-pages), linux-man-u79uwXL29TY76Z2rM5mHXA
Hello Michael,
when analyzing how the fanotify API interacts with mmap(2) I stumbled
over the following issues in the manpages:
The manpage of msync(2) says:
"msync() flushes changes made to the in-core copy of a file that was
mapped into memory using mmap(2) back to disk."
"back to disk" implies that the file system is forced to actually write
to the hard disk, somewhat equivalent to invoking sync(1). Is that
guaranteed for all file systems?
Not all file systems are necessarily disk based (e.g. davfs, tmpfs).
So shouldn't we write:
"... back to the file system."
http://pubs.opengroup.org/onlinepubs/007904875/functions/msync.html
says
"... to permanent storage locations, if any,"
The manpage of munmap(2) leaves it unclear, if copying back to the
filesystem is synchronous or asynchronous.
This bit of information is important, because, if munmap is
asynchronous, applications might want to call msync(,,MS_SYNC), before
calling munmap. If munmap is synchronous it might block until the file
system responds (think of waiting for a tape to be loaded, or a webdav
server to respond).
What happens to an unfinished prior asynchronous update by
mmap(,,MS_ASYNC) when munmap is called?
Will munmap "invalidate other mappings of the same file (so that they
can be updated with the fresh values just written)" like
msync(,,MS_INVALIDATE) does?
Best regards
Heinrich Schuchardt
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: munmap, msync: synchronization
2014-04-20 10:28 munmap, msync: synchronization Heinrich Schuchardt
@ 2014-04-21 10:16 ` Michael Kerrisk (man-pages)
[not found] ` <5354F00E.8050609-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
0 siblings, 1 reply; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-04-21 10:16 UTC (permalink / raw)
To: Heinrich Schuchardt, linux-man
Cc: mtk.manpages, Christoph Hellwig, Dave Chinner, Theodore T'so,
Linux-Fsdevel, Miklos Szeredi, jamie
[CCing a few people who may correct my errors; perhaps there are some
improvements that are needed for the mmap() and msync() man pages
]
Hello Heinrich,
On 04/20/2014 12:28 PM, Heinrich Schuchardt wrote:
> Hello Michael,
>
> when analyzing how the fanotify API interacts with mmap(2) I stumbled
> over the following issues in the manpages:
>
>
> The manpage of msync(2) says:
> "msync() flushes changes made to the in-core copy of a file that was
> mapped into memory using mmap(2) back to disk."
>
> "back to disk" implies that the file system is forced to actually write
> to the hard disk, somewhat equivalent to invoking sync(1). Is that
> guaranteed for all file systems?
>
> Not all file systems are necessarily disk based (e.g. davfs, tmpfs).
>
> So shouldn't we write:
> "... back to the file system."
Yes, that seems better to me. Done.
> http://pubs.opengroup.org/onlinepubs/007904875/functions/msync.html
> says
> "... to permanent storage locations, if any,"
>
>
> The manpage of munmap(2) leaves it unclear, if copying back to the
> filesystem is synchronous or asynchronous.
In fact, the page says nearly nothing about whether it synchs at all.
That is (I think) more or less deliberate. See below.
> This bit of information is important, because, if munmap is
> asynchronous, applications might want to call msync(,,MS_SYNC), before
> calling munmap. If munmap is synchronous it might block until the file
> system responds (think of waiting for a tape to be loaded, or a webdav
> server to respond).
>
>
> What happens to an unfinished prior asynchronous update by
> mmap(,,MS_ASYNC) when munmap is called?
I believe the answer is: On Linux, nothing special; the asynchronous
update will still be done. (I'm not sure that anything needs to be
said in the man page... But, if you have a good argument about why
something should be said, I'm open to hearing it.)
> Will munmap "invalidate other mappings of the same file (so that they
> can be updated with the fresh values just written)" like
> msync(,,MS_INVALIDATE) does?
I don't believe there's any requirement that it does. (Again, I'm not
sure that anything needs to be said in the man page... But, if
you have a good argument...)
So, here's how things are as I understand them.
1. In the bad old days (even on Linux, AFAIK, but that was in days
before I looked closely at what goes on), the page cache and
the buffer cache were not unified. That meant that a page from
a file might both be in the buffer cache (because of file I/O
syscalls) and in the page cache (because of mmap()).
2. In a non-unified cache system, pages can naturally get out of
synch in the two locations. Before it had a unified cache, Linux
used to jump some hoops to ensure that contents in the two
locations remained consistent.
3. Nowadays Linux--like most (all?) UNIX systems--has a
unified cache: file I/O, mmap(), and the paging system all
use the same cache. If a file is mmap()-ed and also subject
to file I?/, there will be only one copy of each file page
in the cache. Ergo, the inconsistency problem goes away.
4. IIUC, the pieces like msync(MS_ASYNC) and msync(MS_INVALIDATE)
exist only because of the bad old non-unified cache days.
MS_INVALIDATE was a way of saying: make sure that writes
to the file by other processes are visible in this mapping.
msync() without the MS_INVALIDATE flags was a way of saying:
make sure that read()s from the file see the changes made
via this mapping. Using either MS_SYNC or MS_ASYNC
was the way of saying: "I either want to wait until the file
updates have been completed", or "please start the updates
now, but I don't want to wait until they're completed".
5. On systems with a unified cache, msync(MS_INVALIDATE)
is a no-op. (That is so on Linux.)
6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified
cache system. Filesystem I/O always sees a consistent view,
and MS_ASYNC never undertook to give a guarantee about *when*
the update would occur. (The Linux buffer cache logic will
ensure that it is flushed out sometime in the near future.)
7. On Linux (and probably many other modern systems), the only
call that has any real use is msync(MS_SYNC), meaning
"flush the buffers *now*, and I want to wait for that to
complete, so that I can then continue safe in the knowledge
that my data has landed on a device". That's useful if we
want insurance for our data in the event of a system crash.
8. POSIX make no mandate for a unified cache system. Thus,
we have MS_ASYNC and MS_INVALIDATE in the standard, and
the standard says nothing (AFAIK) about whether munmap()
will flush data. On Linux (and probably most modern systems),
we're fine. but portable applications that care about
standards and nonunified caches need to use msync().
My advice: To ensure that the contents of a shared file
mapping are written to the underlying file--even on bad old
implementations--a call to msync() should be made before
unmapping a mapping with munmap().
9. The mmap() man page says this:
MAP_SHARED
Share this mapping. Updates to the mapping are vis‐
ible to other processes that map this file, and are
carried through to the underlying file. The file
may not actually be updated until msync(2) or mun‐
map() is called.
I believe the piece "or munmap()" is misleading. It implies
that munmap() must trigger a sync action. I don't think this
is true. All that it is required to do is remove some range
of pages from the process's virtual address space. I'm
inclined to remove those words, but I'd like to see if any
FS person has a correction to my understanding first.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: munmap, msync: synchronization
[not found] ` <5354F00E.8050609-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-04-21 18:14 ` Christoph Hellwig
2014-04-21 19:54 ` Michael Kerrisk (man-pages)
2014-04-23 14:03 ` Matthew Wilcox
0 siblings, 2 replies; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-21 18:14 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
Christoph Hellwig, Dave Chinner, Theodore T'so,
Linux-Fsdevel, Miklos Szeredi, jamie-yetKDKU6eevNLxjTenLetw
On Mon, Apr 21, 2014 at 12:16:46PM +0200, Michael Kerrisk (man-pages) wrote:
> 1. In the bad old days (even on Linux, AFAIK, but that was in days
> before I looked closely at what goes on), the page cache and
> the buffer cache were not unified. That meant that a page from
> a file might both be in the buffer cache (because of file I/O
> syscalls) and in the page cache (because of mmap()).
Correct.
> 2. In a non-unified cache system, pages can naturally get out of
> synch in the two locations. Before it had a unified cache, Linux
> used to jump some hoops to ensure that contents in the two
> locations remained consistent.
Yeah.
> 3. Nowadays Linux--like most (all?) UNIX systems--has a
> unified cache: file I/O, mmap(), and the paging system all
> use the same cache. If a file is mmap()-ed and also subject
> to file I?/, there will be only one copy of each file page
> in the cache. Ergo, the inconsistency problem goes away.
Mostly true, except for FreeBSD and Solaris when they use ZFS, which has
it's own file cache that is not coherent with the VM cache at the
implementation level. Not sure how much of this leaks to userspace,
though.
> 4. IIUC, the pieces like msync(MS_ASYNC) and msync(MS_INVALIDATE)
> exist only because of the bad old non-unified cache days.
> MS_INVALIDATE was a way of saying: make sure that writes
> to the file by other processes are visible in this mapping.
> msync() without the MS_INVALIDATE flags was a way of saying:
> make sure that read()s from the file see the changes made
> via this mapping. Using either MS_SYNC or MS_ASYNC
> was the way of saying: "I either want to wait until the file
> updates have been completed", or "please start the updates
> now, but I don't want to wait until they're completed".
Right.
> 5. On systems with a unified cache, msync(MS_INVALIDATE)
> is a no-op. (That is so on Linux.)
Almost. It returns EBUSY if it hits any mlock()ed region. Don't ask me
why, though..
> 6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified
> cache system. Filesystem I/O always sees a consistent view,
> and MS_ASYNC never undertook to give a guarantee about *when*
> the update would occur. (The Linux buffer cache logic will
> ensure that it is flushed out sometime in the near future.)
Right. It's a fairly inefficient noop, though - it actually loops
over all vmas to do nothing with them.
> 7. On Linux (and probably many other modern systems), the only
> call that has any real use is msync(MS_SYNC), meaning
> "flush the buffers *now*, and I want to wait for that to
> complete, so that I can then continue safe in the knowledge
> that my data has landed on a device". That's useful if we
> want insurance for our data in the event of a system crash.
Right. It's basically another way to call fsync, which is used to
implement it underneath. It actually should be a ranged-fdatasync
but right it's it's implemented horribly inefficiently in that it
does a fsync call for each vma that it encounters in the range
specified.
> 8. POSIX make no mandate for a unified cache system. Thus,
> we have MS_ASYNC and MS_INVALIDATE in the standard, and
> the standard says nothing (AFAIK) about whether munmap()
> will flush data. On Linux (and probably most modern systems),
> we're fine. but portable applications that care about
> standards and nonunified caches need to use msync().
>
> My advice: To ensure that the contents of a shared file
> mapping are written to the underlying file--even on bad old
> implementations--a call to msync() should be made before
> unmapping a mapping with munmap().
Agreed.
> 9. The mmap() man page says this:
>
> MAP_SHARED
> Share this mapping. Updates to the mapping are vis???
> ible to other processes that map this file, and are
> carried through to the underlying file. The file
> may not actually be updated until msync(2) or mun???
> map() is called.
>
> I believe the piece "or munmap()" is misleading. It implies
> that munmap() must trigger a sync action. I don't think this
> is true. All that it is required to do is remove some range
> of pages from the process's virtual address space. I'm
> inclined to remove those words, but I'd like to see if any
> FS person has a correction to my understanding first.
I would expect non-coherent systems to update their caches on munmap,
Posix does not seem to require this, and I can't find any language
towards that in the HP-UX man page, which was a system that I remember
as non-coherent until the end.
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: munmap, msync: synchronization
2014-04-21 18:14 ` Christoph Hellwig
@ 2014-04-21 19:54 ` Michael Kerrisk (man-pages)
2014-04-21 21:34 ` Jamie Lokier
2014-04-23 14:03 ` Matthew Wilcox
1 sibling, 1 reply; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-04-21 19:54 UTC (permalink / raw)
To: Christoph Hellwig
Cc: mtk.manpages, Heinrich Schuchardt, linux-man, Dave Chinner,
Theodore T'so, Linux-Fsdevel, Miklos Szeredi, jamie
Christoph,
On 04/21/2014 08:14 PM, Christoph Hellwig wrote:
> On Mon, Apr 21, 2014 at 12:16:46PM +0200, Michael Kerrisk (man-pages) wrote:
>> 1. In the bad old days (even on Linux, AFAIK, but that was in days
>> before I looked closely at what goes on), the page cache and
>> the buffer cache were not unified. That meant that a page from
>> a file might both be in the buffer cache (because of file I/O
>> syscalls) and in the page cache (because of mmap()).
>
> Correct.
>
>> 2. In a non-unified cache system, pages can naturally get out of
>> synch in the two locations. Before it had a unified cache, Linux
>> used to jump some hoops to ensure that contents in the two
>> locations remained consistent.
>
> Yeah.
>
>> 3. Nowadays Linux--like most (all?) UNIX systems--has a
>> unified cache: file I/O, mmap(), and the paging system all
>> use the same cache. If a file is mmap()-ed and also subject
>> to file I?/, there will be only one copy of each file page
>> in the cache. Ergo, the inconsistency problem goes away.
>
> Mostly true, except for FreeBSD and Solaris when they use ZFS, which has
> it's own file cache that is not coherent with the VM cache at the
> implementation level. Not sure how much of this leaks to userspace,
> though.
Thanks for that detail.
>> 4. IIUC, the pieces like msync(MS_ASYNC) and msync(MS_INVALIDATE)
>> exist only because of the bad old non-unified cache days.
>> MS_INVALIDATE was a way of saying: make sure that writes
>> to the file by other processes are visible in this mapping.
>> msync() without the MS_INVALIDATE flags was a way of saying:
>> make sure that read()s from the file see the changes made
>> via this mapping. Using either MS_SYNC or MS_ASYNC
>> was the way of saying: "I either want to wait until the file
>> updates have been completed", or "please start the updates
>> now, but I don't want to wait until they're completed".
>
> Right.
>
>> 5. On systems with a unified cache, msync(MS_INVALIDATE)
>> is a no-op. (That is so on Linux.)
>
> Almost. It returns EBUSY if it hits any mlock()ed region. Don't ask me
> why, though..
Ahhh yes, I was aware of that detail, but overlooked it in the point
above.
>> 6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified
>> cache system. Filesystem I/O always sees a consistent view,
>> and MS_ASYNC never undertook to give a guarantee about *when*
>> the update would occur. (The Linux buffer cache logic will
>> ensure that it is flushed out sometime in the near future.)
>
> Right. It's a fairly inefficient noop, though - it actually loops
> over all vmas to do nothing with them.
>
>> 7. On Linux (and probably many other modern systems), the only
>> call that has any real use is msync(MS_SYNC), meaning
>> "flush the buffers *now*, and I want to wait for that to
>> complete, so that I can then continue safe in the knowledge
>> that my data has landed on a device". That's useful if we
>> want insurance for our data in the event of a system crash.
>
> Right. It's basically another way to call fsync, which is used to
> implement it underneath. It actually should be a ranged-fdatasync
> but right it's it's implemented horribly inefficiently in that it
> does a fsync call for each vma that it encounters in the range
> specified.
>
>> 8. POSIX make no mandate for a unified cache system. Thus,
>> we have MS_ASYNC and MS_INVALIDATE in the standard, and
>> the standard says nothing (AFAIK) about whether munmap()
>> will flush data. On Linux (and probably most modern systems),
>> we're fine. but portable applications that care about
>> standards and nonunified caches need to use msync().
>>
>> My advice: To ensure that the contents of a shared file
>> mapping are written to the underlying file--even on bad old
>> implementations--a call to msync() should be made before
>> unmapping a mapping with munmap().
>
> Agreed.
Thanks for checking all of this over and thanks also
for confirming that I learned my lessens well in the
"Jamie Lokier school of tough technical reviewing" ;-).
>> 9. The mmap() man page says this:
>>
>> MAP_SHARED
>> Share this mapping. Updates to the mapping are vis???
>> ible to other processes that map this file, and are
>> carried through to the underlying file. The file
>> may not actually be updated until msync(2) or mun???
>> map() is called.
>>
>> I believe the piece "or munmap()" is misleading. It implies
>> that munmap() must trigger a sync action. I don't think this
>> is true. All that it is required to do is remove some range
>> of pages from the process's virtual address space. I'm
>> inclined to remove those words, but I'd like to see if any
>> FS person has a correction to my understanding first.
>
> I would expect non-coherent systems to update their caches on munmap,
> Posix does not seem to require this, and I can't find any language
> towards that in the HP-UX man page, which was a system that I remember
> as non-coherent until the end.
Yes, that's how I read it too. POSIX seems to have no requirements here,
so I assume it was catering to to the lowest common denominator.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: munmap, msync: synchronization
2014-04-21 19:54 ` Michael Kerrisk (man-pages)
@ 2014-04-21 21:34 ` Jamie Lokier
[not found] ` <20140421213418.GH30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
0 siblings, 1 reply; 18+ messages in thread
From: Jamie Lokier @ 2014-04-21 21:34 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Christoph Hellwig, Heinrich Schuchardt, linux-man, Dave Chinner,
Theodore T'so, Linux-Fsdevel, Miklos Szeredi
Michael Kerrisk (man-pages) wrote:
> >> 7. On Linux (and probably many other modern systems), the only
> >> call that has any real use is msync(MS_SYNC), meaning
> >> "flush the buffers *now*, and I want to wait for that to
> >> complete, so that I can then continue safe in the knowledge
> >> that my data has landed on a device". That's useful if we
> >> want insurance for our data in the event of a system crash.
> >
> > Right. It's basically another way to call fsync, which is used to
> > implement it underneath. It actually should be a ranged-fdatasync
> > but right it's it's implemented horribly inefficiently in that it
> > does a fsync call for each vma that it encounters in the range
> > specified.
A ranged-fdatasync, for databases with little logs inside the big data
file, would be nice. AIX, NetBSD and FreeBSD all have one :) Any
likelihood of that ever appearing in Linux? sync_file_range() comes
with its Warning in the man page which basically means "don't trust me
unless you know the filesystem exactly".
> Thanks for checking all of this over and thanks also
> for confirming that I learned my lessens well in the
> "Jamie Lokier school of tough technical reviewing" ;-).
Hi! That was a long time ago :)
> >> 9. The mmap() man page says this:
> >>
> >> MAP_SHARED
> >> Share this mapping. Updates to the mapping are vis???
> >> ible to other processes that map this file, and are
> >> carried through to the underlying file. The file
> >> may not actually be updated until msync(2) or mun???
> >> map() is called.
> >>
> >> I believe the piece "or munmap()" is misleading. It implies
> >> that munmap() must trigger a sync action. I don't think this
> >> is true. All that it is required to do is remove some range
> >> of pages from the process's virtual address space. I'm
> >> inclined to remove those words, but I'd like to see if any
> >> FS person has a correction to my understanding first.
> >
> > I would expect non-coherent systems to update their caches on munmap,
> > Posix does not seem to require this, and I can't find any language
> > towards that in the HP-UX man page, which was a system that I remember
> > as non-coherent until the end.
>
> Yes, that's how I read it too. POSIX seems to have no requirements here,
> so I assume it was catering to to the lowest common denominator.
According to this:
http://h30499.www3.hp.com/t5/System-Administration/2-second-delays-in-fsync-msync-munmap/td-p/3092785/page/2#.U1WBw8dSI1-
and the conclusion of the following page:
- munmap() does _something_ on HP-UX, but it might be just a poorly
implemented artifact rather than equivalent to msync.
- While we're there, the lowest common denominator for HP-UX was
that pwrite() followed by mmap() does not provide the data
recently written, even with fsync() between. The thread ended
there, but I would guess either it's a bug _or_ perhaps
write+mmap+msync(MS_INVALIDATE) are needed in that order despite
the write being before the mmap, perhaps if the shared segment
was maintained by another process.
- To keep it exciting, if you look at the HP-UX man page, 32-bit
and 64-bit processes have separate mmap caches - writing to
shared memory in one of them won't be seen immediately by the other.
Then there's this, about Linux NFS incoherency with msync() and O_DIRECT:
- https://groups.google.com/d/msg/comp.os.linux.development.apps/B49Rej6KV24/xEouZOVXs9gJ
I don't know if any of the above are _true_ though :)
Best,
-- Jamie
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: munmap, msync: synchronization
[not found] ` <20140421213418.GH30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
@ 2014-04-22 6:03 ` Christoph Hellwig
2014-04-22 7:04 ` Jamie Lokier
0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-22 6:03 UTC (permalink / raw)
To: Jamie Lokier
Cc: Michael Kerrisk (man-pages),
Christoph Hellwig, Heinrich Schuchardt,
linux-man-u79uwXL29TY76Z2rM5mHXA, Dave Chinner,
Theodore T'so, Linux-Fsdevel, Miklos Szeredi
On Mon, Apr 21, 2014 at 10:34:18PM +0100, Jamie Lokier wrote:
> A ranged-fdatasync, for databases with little logs inside the big data
> file, would be nice. AIX, NetBSD and FreeBSD all have one :) Any
> likelihood of that ever appearing in Linux? sync_file_range() comes
> with its Warning in the man page which basically means "don't trust me
> unless you know the filesystem exactly".
We have the infrastructure for range fsync and fdatasync in the kernel,
it's just not exposed. Given that you've already done the research
how about you send a patch to wire it up? Do the above implementations
at least agree on an API for it?
sync_file_range() unfortunately only writes out pagecache data and never
the needed metadata to actually find it. While we could multiplex a
range fsync over it that seems to be very confusing (and would be more
complicated than just adding new syscalls)
> Then there's this, about Linux NFS incoherency with msync() and O_DIRECT:
>
> - https://groups.google.com/d/msg/comp.os.linux.development.apps/B49Rej6KV24/xEouZOVXs9gJ
That mail is utterly confused. Yes, NFS has less coherency than normal
filesystems (google for close to open), but msync actually does it's
proper job on NFS.
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: munmap, msync: synchronization
2014-04-22 6:03 ` Christoph Hellwig
@ 2014-04-22 7:04 ` Jamie Lokier
2014-04-22 9:28 ` [PATCH] fsync_range, was: " Christoph Hellwig
0 siblings, 1 reply; 18+ messages in thread
From: Jamie Lokier @ 2014-04-22 7:04 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Michael Kerrisk (man-pages),
Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so,
Linux-Fsdevel, Miklos Szeredi
Christoph Hellwig wrote:
> On Mon, Apr 21, 2014 at 10:34:18PM +0100, Jamie Lokier wrote:
> > A ranged-fdatasync, for databases with little logs inside the big data
> > file, would be nice. AIX, NetBSD and FreeBSD all have one :) Any
> > likelihood of that ever appearing in Linux? sync_file_range() comes
> > with its Warning in the man page which basically means "don't trust me
> > unless you know the filesystem exactly".
>
> We have the infrastructure for range fsync and fdatasync in the kernel,
> it's just not exposed. Given that you've already done the research
> how about you send a patch to wire it up? Do the above implementations
> at least agree on an API for it?
Hi Christoph,
Hardly research, I just did a quick Google and was surprised to find
some results. AIX API differs from the BSDs; the BSDs seem to agree
with each other. fsync_range(), with a flag parameter saying what type
of sync, and whether it flushes the storage device write cache as well
(because they couldn't agree that was good - similar to the barriers
debate).
As for me doing it, no, sorry, I haven't touched the kernel in a few
years, life's been complicated for non-technical reasons, and I don't
have time to get back into it now.
> sync_file_range() unfortunately only writes out pagecache data and never
> the needed metadata to actually find it. While we could multiplex a
> range fsync over it that seems to be very confusing (and would be more
> complicated than just adding new syscalls)
I agree. I never saw the point in sync_file_range() except to mislead,
whereas fsync_range() always seemed obvious!
In the kernel, I was always under the impression the simple part of
fsync_range - writing out data pages - was solved years ago, but being
sure the filesystem's updated its metadata in the proper way, that
begs for a little research into what filesystems do when asked,
doesn't it?
For example, imagine two dirty pages 0 and 1, two disk blocks A and B,
and a non-overwriting filesystem (similar to btrfs) which knows about
the dirty flags and has formulated a plan to journal a single metadata
change containing two pointers, from [0->A,1->B] to [0->C,1->D] when
it flushes metadata _after_ pages 0 and 1 are written to new disk
blocks C and D. And you do fsync_range just on block 1. Now if only
page 1 gets written and page 0 does not, it's important that a
different metadata change is journalled: [0->A,1->D] (or just [1->D]).
Now hopefully, all filesystems are sane enough to just do that, by
calculating what to journal as a response to only data I/O that's in
flight and behind a barrier. But I wouldn't like to _assume_ that no
filesystems algorithms don't queue up the joint [0->C,1-D] metadata
change somehow, having seem the dirty flags, in a way that gets
confused by a forced metadata flush after partial dirty data flush.
After all it might be a legitimate thing to do in the current scheme.
(Similar things apply to converting preallocated-but-unwritten regions
to written.)
So I have this weird idea that to do it carefully needs a little
checking what filesystems do with carefully ordered block-pointer
metadata writes.
> > Then there's this, about Linux NFS incoherency with msync() and O_DIRECT:
> >
> > - https://groups.google.com/d/msg/comp.os.linux.development.apps/B49Rej6KV24/xEouZOVXs9gJ
>
> That mail is utterly confused. Yes, NFS has less coherency than normal
> filesystems (google for close to open), but msync actually does it's
> proper job on NFS.
Good to know :)
-- Jamie
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] fsync_range, was: Re: munmap, msync: synchronization
2014-04-22 7:04 ` Jamie Lokier
@ 2014-04-22 9:28 ` Christoph Hellwig
2014-04-23 14:33 ` Michael Kerrisk (man-pages)
[not found] ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
0 siblings, 2 replies; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-22 9:28 UTC (permalink / raw)
To: Jamie Lokier
Cc: Christoph Hellwig, Michael Kerrisk (man-pages),
Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so,
Linux-Fsdevel, Miklos Szeredi
[-- Attachment #1: Type: text/plain, Size: 2198 bytes --]
On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote:
> Hi Christoph,
>
> Hardly research, I just did a quick Google and was surprised to find
> some results. AIX API differs from the BSDs; the BSDs seem to agree
> with each other. fsync_range(), with a flag parameter saying what type
> of sync, and whether it flushes the storage device write cache as well
> (because they couldn't agree that was good - similar to the barriers
> debate).
There is no FreeBSD implementation, I think you were confused by FreeBSD
also hosting NetBSD man pages on their site, just as I initially was.
The APIs are mostly the same, except that AIX reuses O_ flags as
argument and NetBSD has a separate namespace. Following the latter
seems more sensible, and also allows developer to define the separate
name to the O_ flag for portability.
> As for me doing it, no, sorry, I haven't touched the kernel in a few
> years, life's been complicated for non-technical reasons, and I don't
> have time to get back into it now.
I've cooked up a patch, but I really need someone to test it and promote
it. Find the patch attached. There are two differences to the NetBSD
one:
1) It doesn't fail for read-only FDs. fsync doesn't, and while
standards used to have fdatasync and aio_fsync fail for them,
Linux never did and the standards are catching up:
http://austingroupbugs.net/view.php?id=501
http://austingroupbugs.net/view.php?id=671
2) I don't implement the FDISKSYNC. Requiring it is utterly broken,
and we wouldn't even have the infrastructure for it. It might make
sense to provide it defined to 0 so that we have the identifier but
make it a no-op.
> In the kernel, I was always under the impression the simple part of
> fsync_range - writing out data pages - was solved years ago, but being
> sure the filesystem's updated its metadata in the proper way, that
> begs for a little research into what filesystems do when asked,
> doesn't it?
The filesystems I care about handle it fine, and while I don't know
the details of others they better handle it properly, given that we
use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits
from the nfs server.
[-- Attachment #2: 0001-fs-implement-fsync_range.patch --]
[-- Type: text/plain, Size: 5898 bytes --]
>From b63881cac84b35ce3d6a61a33e33ac795a5c583c Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 22 Apr 2014 11:24:51 +0200
Subject: fs: implement fsync_range
Implement a fsync/fdatasync variant that takes a range to sync. This follow the
NetBSD implementation:
http://www.freebsd.org/cgi/man.cgi?query=fsync&apropos=0&sektion=0&manpath=NetBSD+5.0&format=html
and is fairly close the AIX implementation that the NetBSD one is based on:
http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.aix.basetechref%2Fdoc%2Fbasetrf1%2Ffsync.htm
The implementation is very simple because the VFS already offers a ranged
fsync infrastrucute, which is most prominently used to implement O_SYNC
and O_DSYNC writes.
Differences from NetBSD are:
1) It doesn't fail for read-only FDs. fsync doesn't, and while standards
used require fdatasync and aio_fsync to fail for read-only file
descriptors Linux never did and the standards are catching up:
http://austingroupbugs.net/view.php?id=501
http://austingroupbugs.net/view.php?id=671
2) It doesn't implement the FDISKSYNC. Requiring a flag to actuall make
data persistant is completely broken, and the Linux infrastructure
doesn't support it anyway. We could provide it as a no-op if we
really need to.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
fs/sync.c | 101 ++++++++++++++++++++++++--------------
include/uapi/linux/fs.h | 6 ++-
4 files changed, 72 insertions(+), 37 deletions(-)
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index d6b8679..e239d46 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,3 +360,4 @@
351 i386 sched_setattr sys_sched_setattr
352 i386 sched_getattr sys_sched_getattr
353 i386 renameat2 sys_renameat2
+354 i386 fsync_range sys_fsync_range
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 04376ac..006d57f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
316 common renameat2 sys_renameat2
+317 common fsync_range sys_fsync_range
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/sync.c b/fs/sync.c
index b28d1dd..58f9ca7 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -197,13 +197,13 @@ int vfs_fsync(struct file *file, int datasync)
}
EXPORT_SYMBOL(vfs_fsync);
-static int do_fsync(unsigned int fd, int datasync)
+static int do_fsync(unsigned int fd, loff_t start, loff_t end, int datasync)
{
struct fd f = fdget(fd);
int ret = -EBADF;
if (f.file) {
- ret = vfs_fsync(f.file, datasync);
+ ret = vfs_fsync_range(f.file, start, end, datasync);
fdput(f);
}
return ret;
@@ -211,12 +211,69 @@ static int do_fsync(unsigned int fd, int datasync)
SYSCALL_DEFINE1(fsync, unsigned int, fd)
{
- return do_fsync(fd, 0);
+ return do_fsync(fd, 0, LLONG_MAX, 0);
}
SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
{
- return do_fsync(fd, 1);
+ return do_fsync(fd, 0, LLONG_MAX, 1);
+}
+
+static loff_t end_offset(loff_t offset, loff_t nbytes)
+{
+ loff_t endbyte = offset + nbytes;
+
+ if ((s64)offset < 0)
+ return -EINVAL;
+ if ((s64)endbyte < 0)
+ return -EINVAL;
+ if (endbyte < offset)
+ return -EINVAL;
+
+ if (sizeof(pgoff_t) == 4) {
+ if (offset >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
+ /*
+ * The range starts outside a 32 bit machine's
+ * pagecache addressing capabilities. Let it "succeed"
+ */
+ return 0;
+ }
+ if (endbyte >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
+ /*
+ * Out to EOF
+ */
+ return LLONG_MAX;
+ }
+ }
+
+ if (nbytes == 0)
+ endbyte = LLONG_MAX;
+ else
+ endbyte--; /* inclusive */
+
+ return endbyte;
+}
+
+SYSCALL_DEFINE4(fsync_range, unsigned int, fd, int, how,
+ loff_t, start, loff_t, length)
+{
+ int datasync = 0;
+ loff_t end;
+
+ switch (how) {
+ case FDATASYNC:
+ datasync = 1;
+ break;
+ case FFILESYNC:
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ end = end_offset(start, length);
+ if (end <= 0)
+ return end;
+ return do_fsync(fd, start, end, datasync);
}
/*
@@ -275,40 +332,12 @@ SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
loff_t endbyte; /* inclusive */
umode_t i_mode;
- ret = -EINVAL;
if (flags & ~VALID_FLAGS)
- goto out;
-
- endbyte = offset + nbytes;
-
- if ((s64)offset < 0)
- goto out;
- if ((s64)endbyte < 0)
- goto out;
- if (endbyte < offset)
- goto out;
-
- if (sizeof(pgoff_t) == 4) {
- if (offset >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
- /*
- * The range starts outside a 32 bit machine's
- * pagecache addressing capabilities. Let it "succeed"
- */
- ret = 0;
- goto out;
- }
- if (endbyte >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
- /*
- * Out to EOF
- */
- nbytes = 0;
- }
- }
+ return -EINVAL;
- if (nbytes == 0)
- endbyte = LLONG_MAX;
- else
- endbyte--; /* inclusive */
+ endbyte = end_offset(offset, nbytes);
+ if (endbyte <= 0)
+ return endbyte;
ret = -EBADF;
f = fdget(fd);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index ca1a11b..491d9fe 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -199,9 +199,13 @@ struct inodes_stat_t {
#define FS_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */
#define FS_FL_USER_MODIFIABLE 0x000380FF /* User modifiable flags */
-
+/* flags for sync_file_range */
#define SYNC_FILE_RANGE_WAIT_BEFORE 1
#define SYNC_FILE_RANGE_WRITE 2
#define SYNC_FILE_RANGE_WAIT_AFTER 4
+/* flags for fsync_range */
+#define FDATASYNC 0x0010
+#define FFILESYNC 0x0020
+
#endif /* _UAPI_LINUX_FS_H */
--
1.7.10.4
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: munmap, msync: synchronization
2014-04-21 18:14 ` Christoph Hellwig
2014-04-21 19:54 ` Michael Kerrisk (man-pages)
@ 2014-04-23 14:03 ` Matthew Wilcox
1 sibling, 0 replies; 18+ messages in thread
From: Matthew Wilcox @ 2014-04-23 14:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Michael Kerrisk (man-pages),
Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so,
Linux-Fsdevel, Miklos Szeredi, jamie
On Mon, Apr 21, 2014 at 11:14:31AM -0700, Christoph Hellwig wrote:
> > 6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified
> > cache system. Filesystem I/O always sees a consistent view,
> > and MS_ASYNC never undertook to give a guarantee about *when*
> > the update would occur. (The Linux buffer cache logic will
> > ensure that it is flushed out sometime in the near future.)
>
> Right. It's a fairly inefficient noop, though - it actually loops
> over all vmas to do nothing with them.
This will probably change for Persistent Memory. The reason it
works today is that we have a page cache which tracks dirty bits and
periodically writes dirty pages to storage. If we bypass the page cache,
we have to ensure that everything does still eventually get synced.
I don't quite know how this is going to work yet ... I have a number of
ideas in my head. It probably won't be asynchronous though!
> > 7. On Linux (and probably many other modern systems), the only
> > call that has any real use is msync(MS_SYNC), meaning
> > "flush the buffers *now*, and I want to wait for that to
> > complete, so that I can then continue safe in the knowledge
> > that my data has landed on a device". That's useful if we
> > want insurance for our data in the event of a system crash.
>
> Right. It's basically another way to call fsync, which is used to
> implement it underneath. It actually should be a ranged-fdatasync
> but right it's it's implemented horribly inefficiently in that it
> does a fsync call for each vma that it encounters in the range
> specified.
See also:
From: Matthew Wilcox <matthew.r.wilcox@intel.com>
To: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>, willy@linux.intel.com
Subject: [PATCH] Sync only the requested range in msync
Date: Thu, 27 Mar 2014 19:02:41 -0400
Message-Id: <1395961361-21307-1-git-send-email-matthew.r.wilcox@intel.com>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
2014-04-22 9:28 ` [PATCH] fsync_range, was: " Christoph Hellwig
@ 2014-04-23 14:33 ` Michael Kerrisk (man-pages)
2014-04-23 15:45 ` Christoph Hellwig
[not found] ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
1 sibling, 1 reply; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-04-23 14:33 UTC (permalink / raw)
To: Christoph Hellwig, Jamie Lokier
Cc: mtk.manpages, Heinrich Schuchardt, linux-man, Dave Chinner,
Theodore T'so, Linux-Fsdevel, Miklos Szeredi
On 04/22/2014 11:28 AM, Christoph Hellwig wrote:
> On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote:
>> Hi Christoph,
>>
>> Hardly research, I just did a quick Google and was surprised to find
>> some results. AIX API differs from the BSDs; the BSDs seem to agree
>> with each other. fsync_range(), with a flag parameter saying what type
>> of sync, and whether it flushes the storage device write cache as well
>> (because they couldn't agree that was good - similar to the barriers
>> debate).
>
> There is no FreeBSD implementation, I think you were confused by FreeBSD
> also hosting NetBSD man pages on their site, just as I initially was.
>
> The APIs are mostly the same, except that AIX reuses O_ flags as
> argument and NetBSD has a separate namespace. Following the latter
> seems more sensible, and also allows developer to define the separate
> name to the O_ flag for portability.
>
>> As for me doing it, no, sorry, I haven't touched the kernel in a few
>> years, life's been complicated for non-technical reasons, and I don't
>> have time to get back into it now.
>
> I've cooked up a patch, but I really need someone to test it and promote
> it. Find the patch attached. There are two differences to the NetBSD
> one:
>
> 1) It doesn't fail for read-only FDs. fsync doesn't, and while
> standards used to have fdatasync and aio_fsync fail for them,
> Linux never did and the standards are catching up:
>
> http://austingroupbugs.net/view.php?id=501
> http://austingroupbugs.net/view.php?id=671
>
> 2) I don't implement the FDISKSYNC. Requiring it is utterly broken,
> and we wouldn't even have the infrastructure for it. It might make
> sense to provide it defined to 0 so that we have the identifier but
> make it a no-op.
>
>> In the kernel, I was always under the impression the simple part of
>> fsync_range - writing out data pages - was solved years ago, but being
>> sure the filesystem's updated its metadata in the proper way, that
>> begs for a little research into what filesystems do when asked,
>> doesn't it?
>
> The filesystems I care about handle it fine, and while I don't know
> the details of others they better handle it properly, given that we
> use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits
> from the nfs server.
The functionality sounds like it would be worthwhile. I've applied the
patch against 3.15-rc2, and employed the test program below, with test
files on standard laptop HDD (ext4). The test program repeatedly
a) overwrites a specified region of a file
b) does an fsync_range() on a specified range of the file (need not be
the same region that was written).
The CLI is crude, but the arguments are:
1: pathname
2: number of loops
3: Starting point for writes each time round loop
4: Length of region to write
5: Either 'f' for or 'd' for FDATASYNC
6: start offset for fsync_range()
7: length for fsync_range()
It seems that the patch does roughly what it says on the tin:
# Precreate a 1MB file
$ sync; time ./t_fsync_range /testfs/f 100 0 1000000 d 0 1000000^C
$ dd of=/testfs/f bs=1000 count=1000 if=/dev/full
1000+0 records in
1000+0 records out
1000000 bytes (1.0 MB) copied, 0.00575843 s, 174 MB/s
# Take journaling and atime out of the equation:
$ sudo umount /dev/sdb6
$ sudo tune2fs -O ^has_journal /dev/sdb6$
[sudo] password for mtk:
tune2fs 1.42.8 (20-Jun-2013)
$ sudo mount -o norelatime,strictatime /dev/sdb6 /testfs
# Filesystem unmounted and remounted (with above options) before
# each of the following tests
===
# 1000 loops, writing 1 MB, syncing entire 1MB range, with FFILESYNC:
$ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 1000000
fsync_range(3, 0x20, 0, 1000000)
Performed 16000 writes
Performed 1000 sync operations
real 0m10.677s
user 0m0.011s
sys 0m0.816s
# 1000 loops, writing 1MB, syncing entire 1MB range, with FDATASYNC:
# (Takes less time, as expected)
$ time ./t_fsync_range /testfs/f 1000 0 1000000 d 0 1000000
fsync_range(3, 0x10, 0, 1000000)
Performed 16000 writes
Performed 1000 sync operations
real 0m8.685s
user 0m0.017s
sys 0m0.825s
===
# 1000 loops, writing 1 MB, syncing just 100kB, with FFILESYNC:
# (Take less time than syncing entire 1MB range, as expected)
$ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 100000
fsync_range(3, 0x20, 0, 100000)
Performed 16000 writes
Performed 1000 sync operations
real 0m1.501s
user 0m0.005s
sys 0m0.339s
# 1000 loops, writing 1 MB, syncing just 10kB, with FFILESYNC:
$ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 10000
fsync_range(3, 0x20, 0, 10000)
Performed 16000 writes
Performed 1000 sync operations
real 0m0.616s
user 0m0.004s
sys 0m0.240s
=======
But I have a question:
When I precreate a 10MB file, and repeat the tests (this time with
100 loops), I no longer see any significant difference between
FFILESYNC and FDATASYNC. What am I missing? Sample runs here,
though I did the tests repeatedly with broadly similar results
each time:
#FFILESYNC
$ time ./t_fsync_range /testfs/f 100 0 10000000 f 0 10000000
fsync_range(3, 0x20, 0, 10000000)
Performed 15300 writes
Performed 100 sync operations
real 0m17.575s
user 0m0.001s
sys 0m0.656s
# FDATASYNC
$ time ./t_fsync_range /testfs/f 100 0 10000000 d 0 10000000
fsync_range(3, 0x10, 0, 10000000)
Performed 15300 writes
Performed 100 sync operations
real 0m17.228s
user 0m0.005s
sys 0m0.624s
======
Add another question: is there any piece of sync_file_range()
functionality that could or should be incorporated in this API?
======
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
2014-04-23 14:33 ` Michael Kerrisk (man-pages)
@ 2014-04-23 15:45 ` Christoph Hellwig
[not found] ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-04-24 9:34 ` Michael Kerrisk (man-pages)
0 siblings, 2 replies; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-23 15:45 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Christoph Hellwig, Jamie Lokier, Heinrich Schuchardt, linux-man,
Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi
On Wed, Apr 23, 2014 at 04:33:06PM +0200, Michael Kerrisk (man-pages) wrote:
> # Take journaling and atime out of the equation:
>
> $ sudo umount /dev/sdb6
> $ sudo tune2fs -O ^has_journal /dev/sdb6$
> [sudo] password for mtk:
> tune2fs 1.42.8 (20-Jun-2013)
> $ sudo mount -o norelatime,strictatime /dev/sdb6 /testfs
The second strictatime argument overrides the earlier norelatime,
so you put it into the picture.
>
> But I have a question:
>
> When I precreate a 10MB file, and repeat the tests (this time with
> 100 loops), I no longer see any significant difference between
> FFILESYNC and FDATASYNC. What am I missing? Sample runs here,
> though I did the tests repeatedly with broadly similar results
> each time:
Not sure. Do you also see this on other filesystems?
> Add another question: is there any piece of sync_file_range()
> functionality that could or should be incorporated in this API?
I don't think so. sync_file_range is a complete mess and impossible
to use correctly for data integrity operations. Especially the whole
notion that submitting I/O and waiting for it are separate operations
is incompatible with a data integrity call.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
[not found] ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2014-04-23 22:15 ` Jamie Lokier
[not found] ` <20140423221402.GL30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
2014-04-24 1:34 ` Dave Chinner
1 sibling, 1 reply; 18+ messages in thread
From: Jamie Lokier @ 2014-04-23 22:15 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Michael Kerrisk (man-pages),
Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi
Christoph Hellwig wrote:
> On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote:
> > Hi Christoph,
> >
> > Hardly research, I just did a quick Google and was surprised to find
> > some results. AIX API differs from the BSDs; the BSDs seem to agree
> > with each other. fsync_range(), with a flag parameter saying what type
> > of sync, and whether it flushes the storage device write cache as well
> > (because they couldn't agree that was good - similar to the barriers
> > debate).
>
> There is no FreeBSD implementation, I think you were confused by FreeBSD
> also hosting NetBSD man pages on their site, just as I initially was.
Yes, especially with the headings on the man pages saying FreeBSD :)
Just checked a FreeBSD 8.2 system, doesn't have it.
> The APIs are mostly the same, except that AIX reuses O_ flags as
> argument and NetBSD has a separate namespace. Following the latter
> seems more sensible, and also allows developer to define the separate
> name to the O_ flag for portability.
...
> I've cooked up a patch, but I really need someone to test it and promote
> it. Find the patch attached. There are two differences to the NetBSD
> one:
>
> 1) It doesn't fail for read-only FDs. fsync doesn't, and while
> standards used to have fdatasync and aio_fsync fail for them,
> Linux never did and the standards are catching up:
>
> http://austingroupbugs.net/view.php?id=501
> http://austingroupbugs.net/view.php?id=671
See also for maybe why:
http://www.eivanov.com/2011/06/using-fsync-and-fsyncrange-with.html
> 2) I don't implement the FDISKSYNC. Requiring it is utterly broken,
> and we wouldn't even have the infrastructure for it. It might make
> sense to provide it defined to 0 so that we have the identifier but
> make it a no-op.
I presume Linux does the equivalent without needing FDISKSYNC, if and
only if the filesystem is mounted with barriers enabled, which is the
default nowadays?
> > In the kernel, I was always under the impression the simple part of
> > fsync_range - writing out data pages - was solved years ago, but being
> > sure the filesystem's updated its metadata in the proper way, that
> > begs for a little research into what filesystems do when asked,
> > doesn't it?
>
> The filesystems I care about handle it fine, and while I don't know
> the details of others they better handle it properly, given that we
> use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits
> from the nfs server.
Excellent. This really looks like it should have gone in as a system
call years ago, since vfs_fsync_range was there all along waiting to
be used!
> Differences from NetBSD are:
>
> 1) It doesn't fail for read-only FDs. fsync doesn't, and while standards
> used require fdatasync and aio_fsync to fail for read-only file
> descriptors Linux never did and the standards are catching up:
>
> http://austingroupbugs.net/view.php?id=501
> http://austingroupbugs.net/view.php?id=671
>
> 2) It doesn't implement the FDISKSYNC. Requiring a flag to actuall make
> data persistant is completely broken, and the Linux infrastructure
> doesn't support it anyway. We could provide it as a no-op if we
> really need to.
Ah, more differences, which I think should be dropped actually.
3) Does not implement NetBSD's documented behaviour when length == 0.
NetBSD says "If the length parameter is zero, fsync_range() will
synchronize all of the file data". This path does from offset.
4) Other weird range stuff inherited from sync_file_range() on 32
bit machines only. May not be correct with O_DIRECT or
filesystems that don't use page cache.
See:
> +static loff_t end_offset(loff_t offset, loff_t nbytes)
> +{
> + loff_t endbyte = offset + nbytes;
> +
> + if ((s64)offset < 0)
> + return -EINVAL;
> + if ((s64)endbyte < 0)
> + return -EINVAL;
> + if (endbyte < offset)
> + return -EINVAL;
> +
> + if (sizeof(pgoff_t) == 4) {
> + if (offset >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
> + /*
> + * The range starts outside a 32 bit machine's
> + * pagecache addressing capabilities. Let it "succeed"
> + */
> + return 0;
> + }
> + if (endbyte >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
> + /*
> + * Out to EOF
> + */
> + return LLONG_MAX;
> + }
> + }
> +
> + if (nbytes == 0)
> + endbyte = LLONG_MAX;
> + else
> + endbyte--; /* inclusive */
> +
> + return endbyte;
> +}
That was in sync_file_range(), where I think it might have made more
sense as that's obviously tied to the page cache only. So:
a) Giving zero length results in sync from offset..LLONG_MAX.
(NetBSD would have it be 0..LLONG_MAX, according to man page.)
b) If the offset is "too large" for page cache on a 32-bit machine,
it won't do anything -- including no metadata side-effects.
c) If the length is "too large" for page cache on a 32-bit machine,
it extends the length to LLONG_MAX.
The desired behaviour with zero length, that's obviously a judgement
call. I guess that provided NetBSD applications the option to use
FDISKSYNC without a range :)
About b) and c) they both look dubious, because it's not a given that
a filesystem is using page cache, or only using page cache. For
example FUSE using O_DIRECT. (Not that I've checked if you can
actually write anything in those ranges though.)
b) looks worse because it means side effects are also quietly not
done, and a file might legitimately not use the page cache (consider a
FUSE-mounted file accessed with O_DIRECT).
So, would it not make sense to just check the offset, length and
offset+length fit into s64; and if length is zero change the range to
0..LLONG_MAX, and simply match NetBSD that way? (Or, call me crazy,
just return if length is zero.)
Best,
-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
[not found] ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2014-04-23 22:20 ` Jamie Lokier
[not found] ` <20140423222011.GM30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
0 siblings, 1 reply; 18+ messages in thread
From: Jamie Lokier @ 2014-04-23 22:20 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Michael Kerrisk (man-pages),
Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi
> > Add another question: is there any piece of sync_file_range()
> > functionality that could or should be incorporated in this API?
>
> I don't think so. sync_file_range is a complete mess and impossible
> to use correctly for data integrity operations. Especially the whole
> notion that submitting I/O and waiting for it are separate operations
> is incompatible with a data integrity call.
I guess it's also to give the application a way to nudge a preferred
asynchronous writeback order, prior to a synchronous wait. If the
application knows there's a lot of dirty data being generated over
time prior to needing a short fdatasync, it might see it as beneficial
to tell the kernel to start writing that data sooner, so the fdatasync
delay will be shorter.
-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
[not found] ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-04-23 22:15 ` Jamie Lokier
@ 2014-04-24 1:34 ` Dave Chinner
2014-04-25 6:06 ` Christoph Hellwig
1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2014-04-24 1:34 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jamie Lokier, Michael Kerrisk (man-pages),
Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
Theodore T'so, Linux-Fsdevel, Miklos Szeredi
On Tue, Apr 22, 2014 at 02:28:37AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote:
> > Hi Christoph,
> >
> > Hardly research, I just did a quick Google and was surprised to find
> > some results. AIX API differs from the BSDs; the BSDs seem to agree
> > with each other. fsync_range(), with a flag parameter saying what type
> > of sync, and whether it flushes the storage device write cache as well
> > (because they couldn't agree that was good - similar to the barriers
> > debate).
>
> There is no FreeBSD implementation, I think you were confused by FreeBSD
> also hosting NetBSD man pages on their site, just as I initially was.
>
> The APIs are mostly the same, except that AIX reuses O_ flags as
> argument and NetBSD has a separate namespace. Following the latter
> seems more sensible, and also allows developer to define the separate
> name to the O_ flag for portability.
>
> > As for me doing it, no, sorry, I haven't touched the kernel in a few
> > years, life's been complicated for non-technical reasons, and I don't
> > have time to get back into it now.
>
> I've cooked up a patch, but I really need someone to test it and promote
> it. Find the patch attached. There are two differences to the NetBSD
> one:
.....
> From b63881cac84b35ce3d6a61a33e33ac795a5c583c Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> Date: Tue, 22 Apr 2014 11:24:51 +0200
> Subject: fs: implement fsync_range
Christoph, if this is going into the kernel, can you add support for
xfs_io and write a couple of xfstests to test it? I'm not
comfortable with adding new data integrity primitives to the kernel
without having robust validation infrastructure already in place for
it. It might also be worthwhile looking to extend Josef's
fsync-tester.c to be able to use ranged fsyncs so to test all the
various corner cases that we need to....
Cheers,
Dave.
--
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
2014-04-23 15:45 ` Christoph Hellwig
[not found] ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2014-04-24 9:34 ` Michael Kerrisk (man-pages)
1 sibling, 0 replies; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-04-24 9:34 UTC (permalink / raw)
To: Christoph Hellwig
Cc: mtk.manpages, Jamie Lokier, Heinrich Schuchardt, linux-man,
Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi
Oops -- I see that I forgot to attach the test program in my last
mail. Appended below, now.)
On 04/23/2014 05:45 PM, Christoph Hellwig wrote:
> On Wed, Apr 23, 2014 at 04:33:06PM +0200, Michael Kerrisk (man-pages) wrote:
>> # Take journaling and atime out of the equation:
>>
>> $ sudo umount /dev/sdb6
>> $ sudo tune2fs -O ^has_journal /dev/sdb6$
>> [sudo] password for mtk:
>> tune2fs 1.42.8 (20-Jun-2013)
>> $ sudo mount -o norelatime,strictatime /dev/sdb6 /testfs
>
> The second strictatime argument overrides the earlier norelatime,
> so you put it into the picture.
Oh -- have I misunderstood something? I was wanting classical behavior:
atime always updated (but only synced to disk by FILESYNC). Is that not
what I should get with norelatime+strictatime?
>> But I have a question:
>>
>> When I precreate a 10MB file, and repeat the tests (this time with
>> 100 loops), I no longer see any significant difference between
>> FFILESYNC and FDATASYNC. What am I missing? Sample runs here,
>> though I did the tests repeatedly with broadly similar results
>> each time:
>
> Not sure. Do you also see this on other filesystems?
=======
So, here's some results from XFS:
# 1000 loops. 1MB file, 1MB fsync_range()
# As with ext4, FDATASYNC is faster than FFILESYNC (as expected)
$ sudo umount /dev/sdb6; sudo mount -o norelatime,strictatime /dev/sdb6 /testfs
$ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 1000000
fsync_range(3, 0x20, 0, 1000000)
Performed 16000 writes
Performed 1000 sync operations
real 0m52.264s
user 0m0.018s
sys 0m0.926s
$ sudo umount /dev/sdb6; sudo mount -o norelatime,strictatime /dev/sdb6 /testfs
$ time ./t_fsync_range /testfs/f 1000 0 1000000 d 0 1000000
fsync_range(3, 0x10, 0, 1000000)
Performed 16000 writes
Performed 1000 sync operations
real 0m33.689s
user 0m0.002s
sys 0m0.915s
# (Note that I did not disable XFS journalling--it's not possible to
# do so, right?)
====
# 100 loops, 100MB file, 100MB fsync_range()
# FDATASYNC and FFIFLESYNC times are again similar
$ time ./t_fsync_range /testfs/f 100 0 100000000 f 0 100000000
fsync_range(3, 0x20, 0, 100000000)
Performed 152600 writes
Performed 100 sync operations
real 4m45.257s
user 0m0.004s
sys 0m5.607s
$ time ./t_fsync_range /testfs/f 100 0 100000000 d 0 100000000
fsync_range(3, 0x10, 0, 100000000)
Performed 152600 writes
Performed 100 sync operations
real 4m43.925s
user 0m0.010s
sys 0m3.824s
# Again, the same pattern: no difference between FFILESYNC and FDATASYNC
=====
On JFS, I get
1000 loops, 1MB file, 1MB fsync_range, FFILESYNC:
* Quite a lot of variability (11.3 to 16.5 secs)
1000 loops, 1MB file, 1MB fsync_range, FDATASYNC:
* Quite a lot of variability (8.6 to 10.9 secs)
==> FDATASYNC is on average faster than FFILESYNC
100 loops, 100 MB file, 100MB fsync_range, FFILESYNC:
281 seconds (just a single test)
100 loops, 100 MB file, 100MB fsync_range, FDATASYNC:
280 seconds (just a single test)
So, again, it seems like for a large file sync, there's no difference between
FFILESYNC and FDATASYNC
>> Add another question: is there any piece of sync_file_range()
>> functionality that could or should be incorporated in this API?
>
> I don't think so. sync_file_range is a complete mess and impossible
> to use correctly for data integrity operations. Especially the whole
> notion that submitting I/O and waiting for it are separate operations
> is incompatible with a data integrity call.
Okay -- I just thought it worth checking.
Cheers,
Michael
========
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
/* flags for fsync_range */
#define FDATASYNC 0x0010
#define FFILESYNC 0x0020
#define SYS_fsync_range 317
static int
fsync_range(unsigned int fd, int how, loff_t start, loff_t length)
{
return syscall(SYS_fsync_range, fd, how, start, length);
}
#define BUF_SIZE 65536
static char buf[BUF_SIZE];
int
main(int argc, char *argv[])
{
int j, fd, nloops, how;
size_t writeLen, syncLen, wlen;
size_t bufSize;
off_t writeOffset, syncOffset;
int scnt, wcnt;
if (argc != 8 || strcmp(argv[1], "--help") == 0) {
fprintf(stderr, "%s pathname nloops write-offset write-length {f|d} "
"sync-offset sync-len\n", argv[0]);
exit(EXIT_SUCCESS);
}
fd = open(argv[1], O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
if (fd == -1)
errExit("read");
nloops = atoi(argv[2]);
writeOffset = atoi(argv[3]);
writeLen = atoi(argv[4]);
how = (argv[5][0] == 'd') ? FDATASYNC :
(argv[5][0] == 'f') ? FFILESYNC : 0;
syncOffset = atoi(argv[6]);
syncLen = atoi(argv[7]);
if (how != 0)
fprintf(stderr, "fsync_range(%d, 0x%x, %lld, %zd)\n",
fd, how, (long long) syncOffset, syncLen);
scnt = 0;
wcnt = 0;
for (j = 0; j < nloops; j++) {
memset(buf, j % 256, BUF_SIZE);
if (lseek(fd, writeOffset, SEEK_SET) == -1)
errExit("lseek");
wlen = writeLen;
while (wlen > 0) {
bufSize = (wlen > BUF_SIZE) ? BUF_SIZE : wlen;
wlen -= bufSize;
if (write(fd, buf, bufSize) != bufSize) {
fprintf(stderr, "Write failed\n");
exit(EXIT_FAILURE);
}
wcnt++;
}
if (how != 0) {
scnt++;
if (fsync_range(fd, how, syncOffset, syncLen) == -1)
errExit("fsync_range");
}
}
fprintf(stderr, "Performed %d writes\n", wcnt);
fprintf(stderr, "Performed %d sync operations\n", scnt);
exit(EXIT_SUCCESS);
}
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
2014-04-24 1:34 ` Dave Chinner
@ 2014-04-25 6:06 ` Christoph Hellwig
0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-25 6:06 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Jamie Lokier, Michael Kerrisk (man-pages),
Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
Theodore T'so, Linux-Fsdevel, Miklos Szeredi
On Thu, Apr 24, 2014 at 11:34:35AM +1000, Dave Chinner wrote:
> Christoph, if this is going into the kernel, can you add support for
> xfs_io and write a couple of xfstests to test it? I'm not
> comfortable with adding new data integrity primitives to the kernel
> without having robust validation infrastructure already in place for
> it. It might also be worthwhile looking to extend Josef's
> fsync-tester.c to be able to use ranged fsyncs so to test all the
> various corner cases that we need to....
If we actually want to add it will obviously need test coverage. Seem
like I can't really get people excited enough to make this more than a
PoC so far, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
[not found] ` <20140423222011.GM30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
@ 2014-04-25 6:07 ` Christoph Hellwig
0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-25 6:07 UTC (permalink / raw)
To: Jamie Lokier
Cc: Christoph Hellwig, Michael Kerrisk (man-pages),
Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi
On Wed, Apr 23, 2014 at 11:20:11PM +0100, Jamie Lokier wrote:
> I guess it's also to give the application a way to nudge a preferred
> asynchronous writeback order, prior to a synchronous wait. If the
> application knows there's a lot of dirty data being generated over
> time prior to needing a short fdatasync, it might see it as beneficial
> to tell the kernel to start writing that data sooner, so the fdatasync
> delay will be shorter.
If they want to do an async writeback pass first they can just use
sync_file_range for it, that's the only thing it's actually useful for.
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
[not found] ` <20140423221402.GL30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
@ 2014-04-25 6:26 ` Christoph Hellwig
0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2014-04-25 6:26 UTC (permalink / raw)
To: Jamie Lokier
Cc: Michael Kerrisk (man-pages),
Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA,
Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi
On Wed, Apr 23, 2014 at 11:15:27PM +0100, Jamie Lokier wrote:
> > 1) It doesn't fail for read-only FDs. fsync doesn't, and while
> > standards used to have fdatasync and aio_fsync fail for them,
> > Linux never did and the standards are catching up:
> >
> > http://austingroupbugs.net/view.php?id=501
> > http://austingroupbugs.net/view.php?id=671
>
> See also for maybe why:
>
> http://www.eivanov.com/2011/06/using-fsync-and-fsyncrange-with.html
I don't really see a "why" there, just the observation that fsync and
fsync_range behavior different on NetBSD, which is odd but documented
behavior.
> > 2) I don't implement the FDISKSYNC. Requiring it is utterly broken,
> > and we wouldn't even have the infrastructure for it. It might make
> > sense to provide it defined to 0 so that we have the identifier but
> > make it a no-op.
>
> I presume Linux does the equivalent without needing FDISKSYNC, if and
> only if the filesystem is mounted with barriers enabled, which is the
> default nowadays?
That's correct, at least for modern mainstream filesystems. Either way
the filesystem would have to implement the cache flush, so those that
don't support it couldn't support FDISKSYNC either.
> Ah, more differences, which I think should be dropped actually.
>
> 3) Does not implement NetBSD's documented behaviour when length == 0.
> NetBSD says "If the length parameter is zero, fsync_range() will
> synchronize all of the file data". This path does from offset.
Indeed. AIX also documents the same behavior.
> 4) Other weird range stuff inherited from sync_file_range() on 32
> bit machines only. May not be correct with O_DIRECT or
> filesystems that don't use page cache.
It's not really possible to implement a full Linux filesystem without
touching the pagecache, but I agree that this probably doesn't
belong into the VFS. sync_file_range is one of these odd layering
violations that calls straight into the pagecache without going into
the filesystem first (readahead is the other one that comes to mind).
> The desired behaviour with zero length, that's obviously a judgement
> call. I guess that provided NetBSD applications the option to use
> FDISKSYNC without a range :)
It seems to originate from the earlier AIX version, but I think it's
just their way to sync the whole range. I prefer our 0, LLONG_MAX
notation, but given the existing user interface we should stick to it.
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2014-04-25 6:26 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-20 10:28 munmap, msync: synchronization Heinrich Schuchardt
2014-04-21 10:16 ` Michael Kerrisk (man-pages)
[not found] ` <5354F00E.8050609-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-04-21 18:14 ` Christoph Hellwig
2014-04-21 19:54 ` Michael Kerrisk (man-pages)
2014-04-21 21:34 ` Jamie Lokier
[not found] ` <20140421213418.GH30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
2014-04-22 6:03 ` Christoph Hellwig
2014-04-22 7:04 ` Jamie Lokier
2014-04-22 9:28 ` [PATCH] fsync_range, was: " Christoph Hellwig
2014-04-23 14:33 ` Michael Kerrisk (man-pages)
2014-04-23 15:45 ` Christoph Hellwig
[not found] ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-04-23 22:20 ` Jamie Lokier
[not found] ` <20140423222011.GM30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
2014-04-25 6:07 ` Christoph Hellwig
2014-04-24 9:34 ` Michael Kerrisk (man-pages)
[not found] ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-04-23 22:15 ` Jamie Lokier
[not found] ` <20140423221402.GL30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
2014-04-25 6:26 ` Christoph Hellwig
2014-04-24 1:34 ` Dave Chinner
2014-04-25 6:06 ` Christoph Hellwig
2014-04-23 14:03 ` Matthew Wilcox
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.