On Fri, 2017-12-22 at 09:35 +1100, NeilBrown wrote: > On Thu, Dec 21 2017, Trond Myklebust wrote: > > > On Fri, 2017-12-22 at 07:59 +1100, NeilBrown wrote: > > > On Thu, Dec 21 2017, Trond Myklebust wrote: > > > > > > > On Thu, 2017-12-21 at 10:39 -0500, Chuck Lever wrote: > > > > > Hi Neil- > > > > > > > > > > > > > > > > On Dec 20, 2017, at 9:57 PM, NeilBrown > > > > > > wrote: > > > > > > > > > > > > > > > > > > When an i_op->getattr() call is made on an NFS file > > > > > > (typically from a 'stat' family system call), NFS > > > > > > will first flush any dirty data to the server. > > > > > > > > > > > > This ensures that the mtime reported is correct and stable, > > > > > > but has a performance penalty. 'stat' is normally thought > > > > > > to be a quick operation, and imposing this cost can be > > > > > > surprising. > > > > > > > > > > To be clear, this behavior is a POSIX requirement. > > > > > > > > > > > > > > > > I have seen problems when one process is writing a large > > > > > > file and another process performs "ls -l" on the containing > > > > > > directory and is blocked for as long as it take to flush > > > > > > all the dirty data to the server, which can be minutes. > > > > > > > > > > Yes, a well-known annoyance that cannot be addressed > > > > > even with a write delegation. > > > > > > > > > > > > > > > > I have also seen a legacy application which frequently > > > > > > calls > > > > > > "fstat" on a file that it is writing to. On a local > > > > > > filesystem (and in the Solaris implementation of NFS) this > > > > > > fstat call is cheap. On Linux/NFS, the causes a noticeable > > > > > > decrease in throughput. > > > > > > > > > > If the preceding write is small, Linux could be using > > > > > a FILE_SYNC write, but Solaris could be using UNSTABLE. > > > > > > > > > > > > > > > > The only circumstances where an application calling > > > > > > 'stat()' > > > > > > might get an mtime which is not stable are times when some > > > > > > other process is writing to the file and the two processes > > > > > > are not using locking to ensure consistency, or when the > > > > > > one > > > > > > process is both writing and stating. In neither of these > > > > > > cases is it reasonable to expect the mtime to be stable. > > > > > > > > > > I'm not convinced this is a strong enough rationale > > > > > for claiming it is safe to disable the existing > > > > > behavior. > > > > > > > > > > You've explained cases where the new behavior is > > > > > reasonable, but do you have any examples where the > > > > > new behavior would be a problem? There must be a > > > > > reason why POSIX explicitly requires an up-to-date > > > > > mtime. > > > > > > > > > > What guidance would nfs(5) give on when it is safe > > > > > to specify the new mount option? > > > > > > > > > > > > > > > > In the most common cases where mtime is important > > > > > > (e.g. make), no other process has the file open, so there > > > > > > will be no dirty data and the mtime will be stable. > > > > > > > > > > Isn't it also the case that make is a multi-process > > > > > workload where one process modifies a file, then > > > > > closes it (which triggers a flush), and then another > > > > > process stats the file? The new mount option does > > > > > not change the behavior of close(2), does it? > > > > > > > > > > > > > > > > Rather than unilaterally changing this behavior of 'stat', > > > > > > this patch adds a "nosyncflush" mount option to allow > > > > > > sysadmins to have applications which are hurt by the > > > > > > current > > > > > > behavior to disable it. > > > > > > > > > > IMO a mount option is at the wrong granularity. A > > > > > mount point will be shared between applications that > > > > > can tolerate the non-POSIX behavior and those that > > > > > cannot, for instance. > > > > > > > > Agreed. > > > > > > > > The other thing to note here is that we now have an embryonic > > > > statx() > > > > system call, which allows the application itself to decide > > > > whether > > > > or > > > > not it needs up to date values for the atime/ctime/mtime. While > > > > we > > > > haven't yet plumbed in the NFS side, the intention was always > > > > to > > > > use > > > > that information to turn off the writeback flushing when > > > > possible. > > > > > > Yes, if statx() were actually working, we could change the > > > application > > > to avoid the flush. But then if changing the application were an > > > option, I suspect that - for my current customer issue - we could > > > just > > > remove the fstat() calls. I doubt they are really necessary. > > > I think programmers often think of stat() (and particularly > > > fstat()) > > > as > > > fairly cheap and so they use it whenever convenient. Only NFS > > > violates > > > this expectation. > > > > > > Also statx() is only a real solution if/when it gets widely > > > used. Will > > > "ls -l" default to AT_STATX_DONT_SYNC ?? > > > > > > Apart from the Posix requirement (which only requires that the > > > timestamps be updated, not that the data be flushed), do you know > > > of > > > any > > > value gained from flushing data before stat()? > > > > > > > POSIX requires that timestamps change as part of the read() or > > write() > > system call. > > Does it require that they don't change at other times? Yes. > I see the (arguable) deviation from POSIX that I propose to be > completely inline with the close-to-open consistency semantics that > NFS > already uses, which are weaker than what POSIX might suggest. > > As you didn't exactly answer the question, would it be fair to say > that > you don't know of any reason to flush-before-stat except that POSIX > seems to require it? The reason is to emulate POSIX semantics. That is the only reason, and that is why when we added a statx() function, I asked that we allow the application to specify that it doesn't need these POSIX semantics. -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@primarydata.com