All of lore.kernel.org
 help / color / mirror / Atom feed
* Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
@ 2009-04-30 20:12 Brian R Cowan
  2009-04-30 20:25 ` Christoph Hellwig
  2009-04-30 20:28 ` Chuck Lever
  0 siblings, 2 replies; 94+ messages in thread
From: Brian R Cowan @ 2009-04-30 20:12 UTC (permalink / raw)
  To: linux-nfs

Hello all,

This is my first post, so please be gentle.... I have been working with a 
customer who is attempting to build their product in ClearCase dynamic 
views on Linux. When they went from Red hat Enterprise Linux 4 (update 5) 
to Red Hat Enterprise Linux 5 (Update 2), their build performance degraded 
dramatically. When troubleshooting the issue, we noticed that links on 
RHEL 5 caused an incredible number of "STABLE" 4kb nfs writes even though 
the storage we were writing to was EXPLICITLY mounted async. (This made 
RHEL 5 nearly 5x slower than RHEL 4.5 in this area...)

On consultation with some internal resources, we found this change in the 
2.6 kernel:
        
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2

In here it looks like the NFS client is forcing sync writes any time a 
write of less than the NFS write size occurs. We tested this hypothesis by 
setting the write size to 2KB. The "STABLE" writes went away and link 
times came back down out of the stratosphere. We built a modified kernel 
based on the RHEL 5.2 kernel (that ONLY backed out of this change) and we 
got a 33% improvement in overall build speeds. In my case, I see almost 
identical build times between the 2 OS's when we use this modified kernel 
on RHEL 5.

Now, why am I posing this to the list? I need to understand *why* that 
change was made. On the face of it, simply backing out that patch would be 
perfect. I'm paranoid. I want to make sure that this is the ONLY reason:
"/* For single writes, FLUSH_STABLE is more efficient */ "

It seems more accurate to say that they *aren't* more efficient, but 
rather are "safer, but slower."

I know that this is a 3+ year old update, but RHEL 4 is based on a 2.4 
kernel, and SLES 9 is based on something in the same ballpark. And our 
customers see problems when they go to SLES 10/RHEL 5 from the prior major 
distro version.

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-04-30 20:12 Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Brian R Cowan
@ 2009-04-30 20:25 ` Christoph Hellwig
  2009-04-30 20:28 ` Chuck Lever
  1 sibling, 0 replies; 94+ messages in thread
From: Christoph Hellwig @ 2009-04-30 20:25 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: linux-nfs

On Thu, Apr 30, 2009 at 04:12:19PM -0400, Brian R Cowan wrote:
> Hello all,
> 
> This is my first post, so please be gentle.... I have been working with a 
> customer who is attempting to build their product in ClearCase dynamic 
> views on Linux.

> I know that this is a 3+ year old update, but RHEL 4 is based on a 2.4 
> kernel, and SLES 9 is based on something in the same ballpark. And our 
> customers see problems when they go to SLES 10/RHEL 5 from the prior major 
> distro version.

You should probably complain to the distro vendors if you use distro
kernels.  And even when the change might not be diretly related please
reproduce anything posted to upstream projects without binary only
module junk like clearcase.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-04-30 20:12 Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Brian R Cowan
  2009-04-30 20:25 ` Christoph Hellwig
@ 2009-04-30 20:28 ` Chuck Lever
  2009-04-30 20:41   ` Peter Staubach
  1 sibling, 1 reply; 94+ messages in thread
From: Chuck Lever @ 2009-04-30 20:28 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: linux-nfs


On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:

> Hello all,
>
> This is my first post, so please be gentle.... I have been working  
> with a
> customer who is attempting to build their product in ClearCase dynamic
> views on Linux. When they went from Red hat Enterprise Linux 4  
> (update 5)
> to Red Hat Enterprise Linux 5 (Update 2), their build performance  
> degraded
> dramatically. When troubleshooting the issue, we noticed that links on
> RHEL 5 caused an incredible number of "STABLE" 4kb nfs writes even  
> though
> the storage we were writing to was EXPLICITLY mounted async. (This  
> made
> RHEL 5 nearly 5x slower than RHEL 4.5 in this area...)
>
> On consultation with some internal resources, we found this change  
> in the
> 2.6 kernel:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>
> In here it looks like the NFS client is forcing sync writes any time a
> write of less than the NFS write size occurs. We tested this  
> hypothesis by
> setting the write size to 2KB. The "STABLE" writes went away and link
> times came back down out of the stratosphere. We built a modified  
> kernel
> based on the RHEL 5.2 kernel (that ONLY backed out of this change)  
> and we
> got a 33% improvement in overall build speeds. In my case, I see  
> almost
> identical build times between the 2 OS's when we use this modified  
> kernel
> on RHEL 5.
>
> Now, why am I posing this to the list? I need to understand *why* that
> change was made. On the face of it, simply backing out that patch  
> would be
> perfect. I'm paranoid. I want to make sure that this is the ONLY  
> reason:
> "/* For single writes, FLUSH_STABLE is more efficient */ "
>
> It seems more accurate to say that they *aren't* more efficient, but
> rather are "safer, but slower."

They are more efficient from the point of view that only a single RPC  
is needed for a complete write.  The WRITE and COMMIT are done in a  
single request.

I don't think the issue here is whether the write is stable, but it is  
whether the NFS client has to block the application for it.  A stable  
write that is asynchronous to the application is faster than WRITE 
+COMMIT.

So it's not "stable" that is holding you up, it's "synchronous."   
Those are orthogonal concepts.

> I know that this is a 3+ year old update, but RHEL 4 is based on a 2.4
> kernel,

Nope, RHEL 4 is 2.6.9.  RHEL 3 is 2.4.20-ish.

> and SLES 9 is based on something in the same ballpark. And our
> customers see problems when they go to SLES 10/RHEL 5 from the prior  
> major
> distro version.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-04-30 20:28 ` Chuck Lever
@ 2009-04-30 20:41   ` Peter Staubach
  2009-04-30 21:13     ` Chuck Lever
  2009-04-30 21:23     ` Trond Myklebust
  0 siblings, 2 replies; 94+ messages in thread
From: Peter Staubach @ 2009-04-30 20:41 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Brian R Cowan, linux-nfs

Chuck Lever wrote:
>
> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
>
>> Hello all,
>>
>> This is my first post, so please be gentle.... I have been working
>> with a
>> customer who is attempting to build their product in ClearCase dynamic
>> views on Linux. When they went from Red hat Enterprise Linux 4
>> (update 5)
>> to Red Hat Enterprise Linux 5 (Update 2), their build performance
>> degraded
>> dramatically. When troubleshooting the issue, we noticed that links on
>> RHEL 5 caused an incredible number of "STABLE" 4kb nfs writes even
>> though
>> the storage we were writing to was EXPLICITLY mounted async. (This made
>> RHEL 5 nearly 5x slower than RHEL 4.5 in this area...)
>>
>> On consultation with some internal resources, we found this change in
>> the
>> 2.6 kernel:
>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>>
>>
>> In here it looks like the NFS client is forcing sync writes any time a
>> write of less than the NFS write size occurs. We tested this
>> hypothesis by
>> setting the write size to 2KB. The "STABLE" writes went away and link
>> times came back down out of the stratosphere. We built a modified kernel
>> based on the RHEL 5.2 kernel (that ONLY backed out of this change)
>> and we
>> got a 33% improvement in overall build speeds. In my case, I see almost
>> identical build times between the 2 OS's when we use this modified
>> kernel
>> on RHEL 5.
>>
>> Now, why am I posing this to the list? I need to understand *why* that
>> change was made. On the face of it, simply backing out that patch
>> would be
>> perfect. I'm paranoid. I want to make sure that this is the ONLY reason:
>> "/* For single writes, FLUSH_STABLE is more efficient */ "
>>
>> It seems more accurate to say that they *aren't* more efficient, but
>> rather are "safer, but slower."
>
> They are more efficient from the point of view that only a single RPC
> is needed for a complete write.  The WRITE and COMMIT are done in a
> single request.
>
> I don't think the issue here is whether the write is stable, but it is
> whether the NFS client has to block the application for it.  A stable
> write that is asynchronous to the application is faster than
> WRITE+COMMIT.
>
> So it's not "stable" that is holding you up, it's "synchronous." 
> Those are orthogonal concepts.
>

Actually, the "stable" part can be a killer.  It depends upon
why and when nfs_flush_inode() is invoked.

I did quite a bit of work on this aspect of RHEL-5 and discovered
that this particular code was leading to some serious slowdowns.
The server would end up doing a very slow FILE_SYNC write when
all that was really required was an UNSTABLE write at the time.

Did anyone actually measure this optimization and if so, what
were the numbers?

    Thanx...

       ps

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-04-30 20:41   ` Peter Staubach
@ 2009-04-30 21:13     ` Chuck Lever
  2009-04-30 21:23     ` Trond Myklebust
  1 sibling, 0 replies; 94+ messages in thread
From: Chuck Lever @ 2009-04-30 21:13 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Brian R Cowan, linux-nfs


On Apr 30, 2009, at 4:41 PM, Peter Staubach wrote:

> Chuck Lever wrote:
>>
>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
>>
>>> Hello all,
>>>
>>> This is my first post, so please be gentle.... I have been working
>>> with a
>>> customer who is attempting to build their product in ClearCase  
>>> dynamic
>>> views on Linux. When they went from Red hat Enterprise Linux 4
>>> (update 5)
>>> to Red Hat Enterprise Linux 5 (Update 2), their build performance
>>> degraded
>>> dramatically. When troubleshooting the issue, we noticed that  
>>> links on
>>> RHEL 5 caused an incredible number of "STABLE" 4kb nfs writes even
>>> though
>>> the storage we were writing to was EXPLICITLY mounted async. (This  
>>> made
>>> RHEL 5 nearly 5x slower than RHEL 4.5 in this area...)
>>>
>>> On consultation with some internal resources, we found this change  
>>> in
>>> the
>>> 2.6 kernel:
>>>
>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>>>
>>>
>>> In here it looks like the NFS client is forcing sync writes any  
>>> time a
>>> write of less than the NFS write size occurs. We tested this
>>> hypothesis by
>>> setting the write size to 2KB. The "STABLE" writes went away and  
>>> link
>>> times came back down out of the stratosphere. We built a modified  
>>> kernel
>>> based on the RHEL 5.2 kernel (that ONLY backed out of this change)
>>> and we
>>> got a 33% improvement in overall build speeds. In my case, I see  
>>> almost
>>> identical build times between the 2 OS's when we use this modified
>>> kernel
>>> on RHEL 5.
>>>
>>> Now, why am I posing this to the list? I need to understand *why*  
>>> that
>>> change was made. On the face of it, simply backing out that patch
>>> would be
>>> perfect. I'm paranoid. I want to make sure that this is the ONLY  
>>> reason:
>>> "/* For single writes, FLUSH_STABLE is more efficient */ "
>>>
>>> It seems more accurate to say that they *aren't* more efficient, but
>>> rather are "safer, but slower."
>>
>> They are more efficient from the point of view that only a single RPC
>> is needed for a complete write.  The WRITE and COMMIT are done in a
>> single request.
>>
>> I don't think the issue here is whether the write is stable, but it  
>> is
>> whether the NFS client has to block the application for it.  A stable
>> write that is asynchronous to the application is faster than
>> WRITE+COMMIT.
>>
>> So it's not "stable" that is holding you up, it's "synchronous."
>> Those are orthogonal concepts.
>>
>
> Actually, the "stable" part can be a killer.  It depends upon
> why and when nfs_flush_inode() is invoked.
>
> I did quite a bit of work on this aspect of RHEL-5 and discovered
> that this particular code was leading to some serious slowdowns.
> The server would end up doing a very slow FILE_SYNC write when
> all that was really required was an UNSTABLE write at the time.

If the client is asking for FILE_SYNC when it doesn't need the COMMIT,  
then yes, that would hurt performance.

> Did anyone actually measure this optimization and if so, what
> were the numbers?
>
>    Thanx...
>
>       ps

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-04-30 20:41   ` Peter Staubach
  2009-04-30 21:13     ` Chuck Lever
@ 2009-04-30 21:23     ` Trond Myklebust
  2009-05-01 16:39       ` Brian R Cowan
       [not found]       ` <1241126587.15476.62.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  1 sibling, 2 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-04-30 21:23 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Chuck Lever, Brian R Cowan, linux-nfs

On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> Chuck Lever wrote:
> >
> > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> >>
> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
> >>
> Actually, the "stable" part can be a killer.  It depends upon
> why and when nfs_flush_inode() is invoked.
> 
> I did quite a bit of work on this aspect of RHEL-5 and discovered
> that this particular code was leading to some serious slowdowns.
> The server would end up doing a very slow FILE_SYNC write when
> all that was really required was an UNSTABLE write at the time.
> 
> Did anyone actually measure this optimization and if so, what
> were the numbers?

As usual, the optimisation is workload dependent. The main type of
workload we're targetting with this patch is the app that opens a file,
writes < 4k and then closes the file. For that case, it's a no-brainer
that you don't need to split a single stable write into an unstable + a
commit.

So if the application isn't doing the above type of short write followed
by close, then exactly what is causing a flush to disk in the first
place? Ordinarily, the client will try to cache writes until the cows
come home (or until the VM tells it to reclaim memory - whichever comes
first)...

Cheers
  Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-04-30 21:23     ` Trond Myklebust
@ 2009-05-01 16:39       ` Brian R Cowan
       [not found]       ` <1241126587.15476.62.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  1 sibling, 0 replies; 94+ messages in thread
From: Brian R Cowan @ 2009-05-01 16:39 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

linux-nfs-owner@vger.kernel.org wrote on 04/30/2009 05:23:07 PM:

> As usual, the optimisation is workload dependent. The main type of
> workload we're targetting with this patch is the app that opens a file,
> writes < 4k and then closes the file. For that case, it's a no-brainer
> that you don't need to split a single stable write into an unstable + a
> commit.

The app impacted most is the gcc linker... I tested by building Samba, 
then by linking smbd. We think the linker memory maps the output file. 
Don't really know for sure since I don't know the gcc source any more than 
I'm an expert in the Linux NFS implementation. In any event, the linker is 
doing all kinds of lseeks and writes as it builds the output executable 
based on the various .o files being linked in. All of those writes are 
slowed down by this write change. If we were closing the file afterwards, 
that would be one thing, but we're not...

> 
> So if the application isn't doing the above type of short write followed
> by close, then exactly what is causing a flush to disk in the first
> place? Ordinarily, the client will try to cache writes until the cows
> come home (or until the VM tells it to reclaim memory - whichever comes
> first)...

We suspect it's the latter (something telling the system to flush memory) 
but chasing that looks to be a challenge...

> 
> Cheers
>   Trond
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]       ` <1241126587.15476.62.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-05-29 15:55         ` Brian R Cowan
  2009-05-29 16:46           ` Trond Myklebust
  2009-05-29 17:01           ` Chuck Lever
  0 siblings, 2 replies; 94+ messages in thread
From: Brian R Cowan @ 2009-05-29 15:55 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

Been working this issue with Red hat, and didn't need to go to the list... 
Well, now I do... You mention that "The main type of workload we're 
targetting with this patch is the app that opens a file, writes < 4k and 
then closes the file." Well, it appears that this issue also impacts 
flushing pages from filesystem caches.

The reason this came up in my environment is that our product's build 
auditing gives the the filesystem cache an interesting workout. When 
ClearCase audits a build, the build places data in a few places, 
including:
1) a build audit file that usually resides in /tmp. This build audit is 
essentially a log of EVERY file open/read/write/delete/rename/etc. that 
the programs called in the build script make in the clearcase "view" 
you're building in. As a result, this file can get pretty large.
2) The build outputs themselves, which in this case are being written to a 
remote storage location on a Linux or Solaris server, and
3) a file called .cmake.state, which is a local cache that is written to 
after the build script completes containing what is essentially a "Bill of 
materials" for the files created during builds in this "view."

We believe that the build audit file access is causing build output to get 
flushed out of the filesystem cache. These flushes happen *in 4k chunks.* 
This trips over this change since the cache pages appear to get flushed on 
an individual basis.

One note is that if the build outputs were going to a clearcase view 
stored on an enterprise-level NAS device, there isn't as much of an issue 
because many of these return from the stable write request as soon as the 
data goes into the battery-backed memory disk cache on the NAS. However, 
it really impacts writes to general-purpose OS's that follow Sun's lead in 
how they handle "stable" writes. The truly annoying part about this rather 
subtle change is that the NFS client is specifically ignoring the client 
mount options since we cannot force the "async" mount option to turn off 
this behavior.

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Trond Myklebust <trond.myklebust@fys.uio.no>
To:
Peter Staubach <staubach@redhat.com>
Cc:
Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/IBM@IBMUS, 
linux-nfs@vger.kernel.org
Date:
04/30/2009 05:23 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
Sent by:
linux-nfs-owner@vger.kernel.org



On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> Chuck Lever wrote:
> >
> > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> >>
> >> 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2

> >>
> Actually, the "stable" part can be a killer.  It depends upon
> why and when nfs_flush_inode() is invoked.
> 
> I did quite a bit of work on this aspect of RHEL-5 and discovered
> that this particular code was leading to some serious slowdowns.
> The server would end up doing a very slow FILE_SYNC write when
> all that was really required was an UNSTABLE write at the time.
> 
> Did anyone actually measure this optimization and if so, what
> were the numbers?

As usual, the optimisation is workload dependent. The main type of
workload we're targetting with this patch is the app that opens a file,
writes < 4k and then closes the file. For that case, it's a no-brainer
that you don't need to split a single stable write into an unstable + a
commit.

So if the application isn't doing the above type of short write followed
by close, then exactly what is causing a flush to disk in the first
place? Ordinarily, the client will try to cache writes until the cows
come home (or until the VM tells it to reclaim memory - whichever comes
first)...

Cheers
  Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 15:55         ` Brian R Cowan
@ 2009-05-29 16:46           ` Trond Myklebust
       [not found]             ` <1243615595.7155.48.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2009-05-29 17:01           ` Chuck Lever
  1 sibling, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-05-29 16:46 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

Look... This happens when you _flush_ the file to stable storage if
there is only a single write < wsize. It isn't the business of the NFS
layer to decide when you flush the file; that's an application
decision...

Trond



On Fri, 2009-05-29 at 11:55 -0400, Brian R Cowan wrote:
> Been working this issue with Red hat, and didn't need to go to the list... 
> Well, now I do... You mention that "The main type of workload we're 
> targetting with this patch is the app that opens a file, writes < 4k and 
> then closes the file." Well, it appears that this issue also impacts 
> flushing pages from filesystem caches.
> 
> The reason this came up in my environment is that our product's build 
> auditing gives the the filesystem cache an interesting workout. When 
> ClearCase audits a build, the build places data in a few places, 
> including:
> 1) a build audit file that usually resides in /tmp. This build audit is 
> essentially a log of EVERY file open/read/write/delete/rename/etc. that 
> the programs called in the build script make in the clearcase "view" 
> you're building in. As a result, this file can get pretty large.
> 2) The build outputs themselves, which in this case are being written to a 
> remote storage location on a Linux or Solaris server, and
> 3) a file called .cmake.state, which is a local cache that is written to 
> after the build script completes containing what is essentially a "Bill of 
> materials" for the files created during builds in this "view."
> 
> We believe that the build audit file access is causing build output to get 
> flushed out of the filesystem cache. These flushes happen *in 4k chunks.* 
> This trips over this change since the cache pages appear to get flushed on 
> an individual basis.
> 
> One note is that if the build outputs were going to a clearcase view 
> stored on an enterprise-level NAS device, there isn't as much of an issue 
> because many of these return from the stable write request as soon as the 
> data goes into the battery-backed memory disk cache on the NAS. However, 
> it really impacts writes to general-purpose OS's that follow Sun's lead in 
> how they handle "stable" writes. The truly annoying part about this rather 
> subtle change is that the NFS client is specifically ignoring the client 
> mount options since we cannot force the "async" mount option to turn off 
> this behavior.
> 
> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>  
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>  
> 
> Please be sure to update your PMR using ESR at 
> http://www-306.ibm.com/software/support/probsub.html or cc all 
> correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
> case I am not available.
> 
> 
> 
> From:
> Trond Myklebust <trond.myklebust@fys.uio.no>
> To:
> Peter Staubach <staubach@redhat.com>
> Cc:
> Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/IBM@IBMUS, 
> linux-nfs@vger.kernel.org
> Date:
> 04/30/2009 05:23 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
> Sent by:
> linux-nfs-owner@vger.kernel.org
> 
> 
> 
> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> > Chuck Lever wrote:
> > >
> > > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> > >>
> > >> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
> 
> > >>
> > Actually, the "stable" part can be a killer.  It depends upon
> > why and when nfs_flush_inode() is invoked.
> > 
> > I did quite a bit of work on this aspect of RHEL-5 and discovered
> > that this particular code was leading to some serious slowdowns.
> > The server would end up doing a very slow FILE_SYNC write when
> > all that was really required was an UNSTABLE write at the time.
> > 
> > Did anyone actually measure this optimization and if so, what
> > were the numbers?
> 
> As usual, the optimisation is workload dependent. The main type of
> workload we're targetting with this patch is the app that opens a file,
> writes < 4k and then closes the file. For that case, it's a no-brainer
> that you don't need to split a single stable write into an unstable + a
> commit.
> 
> So if the application isn't doing the above type of short write followed
> by close, then exactly what is causing a flush to disk in the first
> place? Ordinarily, the client will try to cache writes until the cows
> come home (or until the VM tells it to reclaim memory - whichever comes
> first)...
> 
> Cheers
>   Trond
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 15:55         ` Brian R Cowan
  2009-05-29 16:46           ` Trond Myklebust
@ 2009-05-29 17:01           ` Chuck Lever
  2009-05-29 17:38             ` Brian R Cowan
  1 sibling, 1 reply; 94+ messages in thread
From: Chuck Lever @ 2009-05-29 17:01 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Trond Myklebust, linux-nfs, linux-nfs-owner, Peter Staubach


On May 29, 2009, at 11:55 AM, Brian R Cowan wrote:

> Been working this issue with Red hat, and didn't need to go to the  
> list...
> Well, now I do... You mention that "The main type of workload we're
> targetting with this patch is the app that opens a file, writes < 4k  
> and
> then closes the file." Well, it appears that this issue also impacts
> flushing pages from filesystem caches.
>
> The reason this came up in my environment is that our product's build
> auditing gives the the filesystem cache an interesting workout. When
> ClearCase audits a build, the build places data in a few places,
> including:
> 1) a build audit file that usually resides in /tmp. This build audit  
> is
> essentially a log of EVERY file open/read/write/delete/rename/etc.  
> that
> the programs called in the build script make in the clearcase "view"
> you're building in. As a result, this file can get pretty large.
> 2) The build outputs themselves, which in this case are being  
> written to a
> remote storage location on a Linux or Solaris server, and
> 3) a file called .cmake.state, which is a local cache that is  
> written to
> after the build script completes containing what is essentially a  
> "Bill of
> materials" for the files created during builds in this "view."
>
> We believe that the build audit file access is causing build output  
> to get
> flushed out of the filesystem cache. These flushes happen *in 4k  
> chunks.*
> This trips over this change since the cache pages appear to get  
> flushed on
> an individual basis.

So, are you saying that the application is flushing after every 4KB  
write(2), or that the application has written a bunch of pages, and VM/ 
VFS on the client is doing the synchronous page flushes?  If it's the  
application doing this, then you really do not want to mitigate this  
by defeating the STABLE writes -- the application must have some  
requirement that the data is permanent.

Unless I have misunderstood something, the previous faster behavior  
was due to cheating, and put your data at risk.  I can't see how  
replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would  
cause such a significant performance impact.

> One note is that if the build outputs were going to a clearcase view
> stored on an enterprise-level NAS device, there isn't as much of an  
> issue
> because many of these return from the stable write request as soon  
> as the
> data goes into the battery-backed memory disk cache on the NAS.  
> However,
> it really impacts writes to general-purpose OS's that follow Sun's  
> lead in
> how they handle "stable" writes. The truly annoying part about this  
> rather
> subtle change is that the NFS client is specifically ignoring the  
> client
> mount options since we cannot force the "async" mount option to turn  
> off
> this behavior.

You may have a misunderstanding about what exactly "async" does.  The  
"sync" / "async" mount options control only whether the application  
waits for the data to be flushed to permanent storage.  They have no  
effect on any file system I know of _how_ specifically the data is  
moved from the page cache to permanent storage.

> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>
>
> Please be sure to update your PMR using ESR at
> http://www-306.ibm.com/software/support/probsub.html or cc all
> correspondence to sw_support@us.ibm.com to be sure your PMR is  
> updated in
> case I am not available.
>
>
>
> From:
> Trond Myklebust <trond.myklebust@fys.uio.no>
> To:
> Peter Staubach <staubach@redhat.com>
> Cc:
> Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/ 
> IBM@IBMUS,
> linux-nfs@vger.kernel.org
> Date:
> 04/30/2009 05:23 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page  
> flushing
> Sent by:
> linux-nfs-owner@vger.kernel.org
>
>
>
> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
>> Chuck Lever wrote:
>>>
>>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
>>>>
>>>>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>
>>>>
>> Actually, the "stable" part can be a killer.  It depends upon
>> why and when nfs_flush_inode() is invoked.
>>
>> I did quite a bit of work on this aspect of RHEL-5 and discovered
>> that this particular code was leading to some serious slowdowns.
>> The server would end up doing a very slow FILE_SYNC write when
>> all that was really required was an UNSTABLE write at the time.
>>
>> Did anyone actually measure this optimization and if so, what
>> were the numbers?
>
> As usual, the optimisation is workload dependent. The main type of
> workload we're targetting with this patch is the app that opens a  
> file,
> writes < 4k and then closes the file. For that case, it's a no-brainer
> that you don't need to split a single stable write into an unstable  
> + a
> commit.
>
> So if the application isn't doing the above type of short write  
> followed
> by close, then exactly what is causing a flush to disk in the first
> place? Ordinarily, the client will try to cache writes until the cows
> come home (or until the VM tells it to reclaim memory - whichever  
> comes
> first)...
>
> Cheers
>  Trond
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"  
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]             ` <1243615595.7155.48.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-05-29 17:25               ` Brian R Cowan
  2009-05-29 17:35                 ` Trond Myklebust
  2009-05-29 17:48               ` Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Peter Staubach
  1 sibling, 1 reply; 94+ messages in thread
From: Brian R Cowan @ 2009-05-29 17:25 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

Ah, but I submit that the application isn't making the decision... The OS 
is. My testcase is building Samba on Linux using gcc. The gcc linker sure 
isn't deciding to flush the file. It's happily seeking/reading and 
seeking/writing with no idea what is happening under the covers. When the 
build gets audited, the cache gets flushed... No audit, no flush. The only 
apparent difference is that we have an audit file getting written to on 
the local disk. The linker has no idea it's getting audited.

I'm interested in knowing what kind of performance benefit this 
optimization is providing in small-file writes. Unless it's incredibly 
dramatic, then I really don't see why we can't do one of the following:
1) get rid of it,
2) find some way to not do it when the OS flushes filesystem cache, or
3) make the "async" mount option turn it off, or
4) create a new mount option to force the optimization on/off.

I just don't see how a single RPC saved is saving all that much time. 
Since:
 - open
 - write (unstable) <write size
 - commit
 - close
Depends on the commit call to finish writing to disk, and
 - open
 - write (stable) <write size
 - close
Also depends on the time taken to writ ethe data to disk, I can't see the 
one less RPC buying that much time, other than perhaps on NAS devices.

This may reduce the server load, but this is ignoring the mount options. 
We can't turn this behavior OFF, and that's the biggest issue. I don't 
mind the small-file-write optimization itself, as long as I and my 
customers are able to CHOOSE whether the optimization is active. It boils 
down to this: when I *categorically* say that the mount is async, the OS 
should pay attention. There are cases when the OS doesn't know best. If 
the OS always knew what would work best, there wouldn't be nearly as many 
mount options as there are now.

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Trond Myklebust <trond.myklebust@fys.uio.no>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, 
linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
Date:
05/29/2009 12:47 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
Sent by:
linux-nfs-owner@vger.kernel.org



Look... This happens when you _flush_ the file to stable storage if
there is only a single write < wsize. It isn't the business of the NFS
layer to decide when you flush the file; that's an application
decision...

Trond



On Fri, 2009-05-29 at 11:55 -0400, Brian R Cowan wrote:
> Been working this issue with Red hat, and didn't need to go to the 
list... 
> Well, now I do... You mention that "The main type of workload we're 
> targetting with this patch is the app that opens a file, writes < 4k and 

> then closes the file." Well, it appears that this issue also impacts 
> flushing pages from filesystem caches.
> 
> The reason this came up in my environment is that our product's build 
> auditing gives the the filesystem cache an interesting workout. When 
> ClearCase audits a build, the build places data in a few places, 
> including:
> 1) a build audit file that usually resides in /tmp. This build audit is 
> essentially a log of EVERY file open/read/write/delete/rename/etc. that 
> the programs called in the build script make in the clearcase "view" 
> you're building in. As a result, this file can get pretty large.
> 2) The build outputs themselves, which in this case are being written to 
a 
> remote storage location on a Linux or Solaris server, and
> 3) a file called .cmake.state, which is a local cache that is written to 

> after the build script completes containing what is essentially a "Bill 
of 
> materials" for the files created during builds in this "view."
> 
> We believe that the build audit file access is causing build output to 
get 
> flushed out of the filesystem cache. These flushes happen *in 4k 
chunks.* 
> This trips over this change since the cache pages appear to get flushed 
on 
> an individual basis.
> 
> One note is that if the build outputs were going to a clearcase view 
> stored on an enterprise-level NAS device, there isn't as much of an 
issue 
> because many of these return from the stable write request as soon as 
the 
> data goes into the battery-backed memory disk cache on the NAS. However, 

> it really impacts writes to general-purpose OS's that follow Sun's lead 
in 
> how they handle "stable" writes. The truly annoying part about this 
rather 
> subtle change is that the NFS client is specifically ignoring the client 

> mount options since we cannot force the "async" mount option to turn off 

> this behavior.
> 
> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
> 
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
> 
> 
> Please be sure to update your PMR using ESR at 
> http://www-306.ibm.com/software/support/probsub.html or cc all 
> correspondence to sw_support@us.ibm.com to be sure your PMR is updated 
in 
> case I am not available.
> 
> 
> 
> From:
> Trond Myklebust <trond.myklebust@fys.uio.no>
> To:
> Peter Staubach <staubach@redhat.com>
> Cc:
> Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/IBM@IBMUS, 

> linux-nfs@vger.kernel.org
> Date:
> 04/30/2009 05:23 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page 
flushing
> Sent by:
> linux-nfs-owner@vger.kernel.org
> 
> 
> 
> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> > Chuck Lever wrote:
> > >
> > > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> > >>
> > >> 
> 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2

> 
> > >>
> > Actually, the "stable" part can be a killer.  It depends upon
> > why and when nfs_flush_inode() is invoked.
> > 
> > I did quite a bit of work on this aspect of RHEL-5 and discovered
> > that this particular code was leading to some serious slowdowns.
> > The server would end up doing a very slow FILE_SYNC write when
> > all that was really required was an UNSTABLE write at the time.
> > 
> > Did anyone actually measure this optimization and if so, what
> > were the numbers?
> 
> As usual, the optimisation is workload dependent. The main type of
> workload we're targetting with this patch is the app that opens a file,
> writes < 4k and then closes the file. For that case, it's a no-brainer
> that you don't need to split a single stable write into an unstable + a
> commit.
> 
> So if the application isn't doing the above type of short write followed
> by close, then exactly what is causing a flush to disk in the first
> place? Ordinarily, the client will try to cache writes until the cows
> come home (or until the VM tells it to reclaim memory - whichever comes
> first)...
> 
> Cheers
>   Trond
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 17:25               ` Brian R Cowan
@ 2009-05-29 17:35                 ` Trond Myklebust
       [not found]                   ` <1243618500.7155.56.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-05-29 17:35 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
> Ah, but I submit that the application isn't making the decision... The OS 
> is. My testcase is building Samba on Linux using gcc. The gcc linker sure 
> isn't deciding to flush the file. It's happily seeking/reading and 
> seeking/writing with no idea what is happening under the covers. When the 
> build gets audited, the cache gets flushed... No audit, no flush. The only 
> apparent difference is that we have an audit file getting written to on 
> the local disk. The linker has no idea it's getting audited.
> 
> I'm interested in knowing what kind of performance benefit this 
> optimization is providing in small-file writes. Unless it's incredibly 
> dramatic, then I really don't see why we can't do one of the following:
> 1) get rid of it,
> 2) find some way to not do it when the OS flushes filesystem cache, or
> 3) make the "async" mount option turn it off, or
> 4) create a new mount option to force the optimization on/off.
> 
> I just don't see how a single RPC saved is saving all that much time. 
> Since:
>  - open
>  - write (unstable) <write size
>  - commit
>  - close
> Depends on the commit call to finish writing to disk, and
>  - open
>  - write (stable) <write size
>  - close
> Also depends on the time taken to writ ethe data to disk, I can't see the 
> one less RPC buying that much time, other than perhaps on NAS devices.
> 
> This may reduce the server load, but this is ignoring the mount options. 
> We can't turn this behavior OFF, and that's the biggest issue. I don't 
> mind the small-file-write optimization itself, as long as I and my 
> customers are able to CHOOSE whether the optimization is active. It boils 
> down to this: when I *categorically* say that the mount is async, the OS 
> should pay attention. There are cases when the OS doesn't know best. If 
> the OS always knew what would work best, there wouldn't be nearly as many 
> mount options as there are now.

What are you smoking? There is _NO_DIFFERENCE_ between what the server
is supposed to do when sent a single stable write, and what it is
supposed to do when sent an unstable write plus a commit. BOTH cases are
supposed to result in the server writing the data to stable storage
before the stable write / commit is allowed to return a reply.

The extra RPC round trip (+ parsing overhead ++++) due to the commit
call is the _only_ difference.

No, you can't turn this behaviour off (unless you use the 'async' export
option on a Linux server), but there is no difference there between the
stable write and the unstable write + commit.

THEY BOTH RESULT IN THE SAME BEHAVIOUR.

Trond


> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>  
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>  
> 
> Please be sure to update your PMR using ESR at 
> http://www-306.ibm.com/software/support/probsub.html or cc all 
> correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
> case I am not available.
> 
> 
> 
> From:
> Trond Myklebust <trond.myklebust@fys.uio.no>
> To:
> Brian R Cowan/Cupertino/IBM@IBMUS
> Cc:
> Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, 
> linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
> Date:
> 05/29/2009 12:47 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
> Sent by:
> linux-nfs-owner@vger.kernel.org
> 
> 
> 
> Look... This happens when you _flush_ the file to stable storage if
> there is only a single write < wsize. It isn't the business of the NFS
> layer to decide when you flush the file; that's an application
> decision...
> 
> Trond
> 
> 
> 
> On Fri, 2009-05-29 at 11:55 -0400, Brian R Cowan wrote:
> > Been working this issue with Red hat, and didn't need to go to the 
> list... 
> > Well, now I do... You mention that "The main type of workload we're 
> > targetting with this patch is the app that opens a file, writes < 4k and 
> 
> > then closes the file." Well, it appears that this issue also impacts 
> > flushing pages from filesystem caches.
> > 
> > The reason this came up in my environment is that our product's build 
> > auditing gives the the filesystem cache an interesting workout. When 
> > ClearCase audits a build, the build places data in a few places, 
> > including:
> > 1) a build audit file that usually resides in /tmp. This build audit is 
> > essentially a log of EVERY file open/read/write/delete/rename/etc. that 
> > the programs called in the build script make in the clearcase "view" 
> > you're building in. As a result, this file can get pretty large.
> > 2) The build outputs themselves, which in this case are being written to 
> a 
> > remote storage location on a Linux or Solaris server, and
> > 3) a file called .cmake.state, which is a local cache that is written to 
> 
> > after the build script completes containing what is essentially a "Bill 
> of 
> > materials" for the files created during builds in this "view."
> > 
> > We believe that the build audit file access is causing build output to 
> get 
> > flushed out of the filesystem cache. These flushes happen *in 4k 
> chunks.* 
> > This trips over this change since the cache pages appear to get flushed 
> on 
> > an individual basis.
> > 
> > One note is that if the build outputs were going to a clearcase view 
> > stored on an enterprise-level NAS device, there isn't as much of an 
> issue 
> > because many of these return from the stable write request as soon as 
> the 
> > data goes into the battery-backed memory disk cache on the NAS. However, 
> 
> > it really impacts writes to general-purpose OS's that follow Sun's lead 
> in 
> > how they handle "stable" writes. The truly annoying part about this 
> rather 
> > subtle change is that the NFS client is specifically ignoring the client 
> 
> > mount options since we cannot force the "async" mount option to turn off 
> 
> > this behavior.
> > 
> > =================================================================
> > Brian Cowan
> > Advisory Software Engineer
> > ClearCase Customer Advocacy Group (CAG)
> > Rational Software
> > IBM Software Group
> > 81 Hartwell Ave
> > Lexington, MA
> > 
> > Phone: 1.781.372.3580
> > Web: http://www.ibm.com/software/rational/support/
> > 
> > 
> > Please be sure to update your PMR using ESR at 
> > http://www-306.ibm.com/software/support/probsub.html or cc all 
> > correspondence to sw_support@us.ibm.com to be sure your PMR is updated 
> in 
> > case I am not available.
> > 
> > 
> > 
> > From:
> > Trond Myklebust <trond.myklebust@fys.uio.no>
> > To:
> > Peter Staubach <staubach@redhat.com>
> > Cc:
> > Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/IBM@IBMUS, 
> 
> > linux-nfs@vger.kernel.org
> > Date:
> > 04/30/2009 05:23 PM
> > Subject:
> > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page 
> flushing
> > Sent by:
> > linux-nfs-owner@vger.kernel.org
> > 
> > 
> > 
> > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> > > Chuck Lever wrote:
> > > >
> > > > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> > > >>
> > > >> 
> > 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
> 
> > 
> > > >>
> > > Actually, the "stable" part can be a killer.  It depends upon
> > > why and when nfs_flush_inode() is invoked.
> > > 
> > > I did quite a bit of work on this aspect of RHEL-5 and discovered
> > > that this particular code was leading to some serious slowdowns.
> > > The server would end up doing a very slow FILE_SYNC write when
> > > all that was really required was an UNSTABLE write at the time.
> > > 
> > > Did anyone actually measure this optimization and if so, what
> > > were the numbers?
> > 
> > As usual, the optimisation is workload dependent. The main type of
> > workload we're targetting with this patch is the app that opens a file,
> > writes < 4k and then closes the file. For that case, it's a no-brainer
> > that you don't need to split a single stable write into an unstable + a
> > commit.
> > 
> > So if the application isn't doing the above type of short write followed
> > by close, then exactly what is causing a flush to disk in the first
> > place? Ordinarily, the client will try to cache writes until the cows
> > come home (or until the VM tells it to reclaim memory - whichever comes
> > first)...
> > 
> > Cheers
> >   Trond
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 17:01           ` Chuck Lever
@ 2009-05-29 17:38             ` Brian R Cowan
  2009-05-29 17:42               ` Trond Myklebust
  0 siblings, 1 reply; 94+ messages in thread
From: Brian R Cowan @ 2009-05-29 17:38 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, linux-nfs-owner, Peter Staubach, Trond Myklebust

> You may have a misunderstanding about what exactly "async" does.  The 
> "sync" / "async" mount options control only whether the application 
> waits for the data to be flushed to permanent storage.  They have no 
> effect on any file system I know of _how_ specifically the data is 
> moved from the page cache to permanent storage.

The problem is that the client change seems to cause the application to 
stop until this stable write completes... What is interesting is that it's 
not always a write operation that the linker gets stuck on. Our best 
hypothesis -- from correlating times in strace and tcpdump traces -- is 
that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* 
system calls on the output file (that is opened for read/write). We THINK 
the read call triggers a FILE_SYNC write if the page is dirty...and that 
is why the read calls are taking so long. Seeing writes happening when the 
app is waiting for a read is odd to say the least... (In my test, there is 
nothing else running on the Virtual machines, so the only thing that could 
be triggering the filesystem activity is the build test...)

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Chuck Lever <chuck.lever@oracle.com>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Trond Myklebust <trond.myklebust@fys.uio.no>, linux-nfs@vger.kernel.org, 
linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
Date:
05/29/2009 01:02 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
Sent by:
linux-nfs-owner@vger.kernel.org




On May 29, 2009, at 11:55 AM, Brian R Cowan wrote:

> Been working this issue with Red hat, and didn't need to go to the 
> list...
> Well, now I do... You mention that "The main type of workload we're
> targetting with this patch is the app that opens a file, writes < 4k 
> and
> then closes the file." Well, it appears that this issue also impacts
> flushing pages from filesystem caches.
>
> The reason this came up in my environment is that our product's build
> auditing gives the the filesystem cache an interesting workout. When
> ClearCase audits a build, the build places data in a few places,
> including:
> 1) a build audit file that usually resides in /tmp. This build audit 
> is
> essentially a log of EVERY file open/read/write/delete/rename/etc. 
> that
> the programs called in the build script make in the clearcase "view"
> you're building in. As a result, this file can get pretty large.
> 2) The build outputs themselves, which in this case are being 
> written to a
> remote storage location on a Linux or Solaris server, and
> 3) a file called .cmake.state, which is a local cache that is 
> written to
> after the build script completes containing what is essentially a 
> "Bill of
> materials" for the files created during builds in this "view."
>
> We believe that the build audit file access is causing build output 
> to get
> flushed out of the filesystem cache. These flushes happen *in 4k 
> chunks.*
> This trips over this change since the cache pages appear to get 
> flushed on
> an individual basis.

So, are you saying that the application is flushing after every 4KB 
write(2), or that the application has written a bunch of pages, and VM/ 
VFS on the client is doing the synchronous page flushes?  If it's the 
application doing this, then you really do not want to mitigate this 
by defeating the STABLE writes -- the application must have some 
requirement that the data is permanent.

Unless I have misunderstood something, the previous faster behavior 
was due to cheating, and put your data at risk.  I can't see how 
replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would 
cause such a significant performance impact.

> One note is that if the build outputs were going to a clearcase view
> stored on an enterprise-level NAS device, there isn't as much of an 
> issue
> because many of these return from the stable write request as soon 
> as the
> data goes into the battery-backed memory disk cache on the NAS. 
> However,
> it really impacts writes to general-purpose OS's that follow Sun's 
> lead in
> how they handle "stable" writes. The truly annoying part about this 
> rather
> subtle change is that the NFS client is specifically ignoring the 
> client
> mount options since we cannot force the "async" mount option to turn 
> off
> this behavior.

You may have a misunderstanding about what exactly "async" does.  The 
"sync" / "async" mount options control only whether the application 
waits for the data to be flushed to permanent storage.  They have no 
effect on any file system I know of _how_ specifically the data is 
moved from the page cache to permanent storage.

> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>
>
> Please be sure to update your PMR using ESR at
> http://www-306.ibm.com/software/support/probsub.html or cc all
> correspondence to sw_support@us.ibm.com to be sure your PMR is 
> updated in
> case I am not available.
>
>
>
> From:
> Trond Myklebust <trond.myklebust@fys.uio.no>
> To:
> Peter Staubach <staubach@redhat.com>
> Cc:
> Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/ 
> IBM@IBMUS,
> linux-nfs@vger.kernel.org
> Date:
> 04/30/2009 05:23 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page 
> flushing
> Sent by:
> linux-nfs-owner@vger.kernel.org
>
>
>
> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
>> Chuck Lever wrote:
>>>
>>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
>>>>
>>>>
> 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2

>
>>>>
>> Actually, the "stable" part can be a killer.  It depends upon
>> why and when nfs_flush_inode() is invoked.
>>
>> I did quite a bit of work on this aspect of RHEL-5 and discovered
>> that this particular code was leading to some serious slowdowns.
>> The server would end up doing a very slow FILE_SYNC write when
>> all that was really required was an UNSTABLE write at the time.
>>
>> Did anyone actually measure this optimization and if so, what
>> were the numbers?
>
> As usual, the optimisation is workload dependent. The main type of
> workload we're targetting with this patch is the app that opens a 
> file,
> writes < 4k and then closes the file. For that case, it's a no-brainer
> that you don't need to split a single stable write into an unstable 
> + a
> commit.
>
> So if the application isn't doing the above type of short write 
> followed
> by close, then exactly what is causing a flush to disk in the first
> place? Ordinarily, the client will try to cache writes until the cows
> come home (or until the VM tells it to reclaim memory - whichever 
> comes
> first)...
>
> Cheers
>  Trond
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 17:38             ` Brian R Cowan
@ 2009-05-29 17:42               ` Trond Myklebust
       [not found]                 ` <1243618968.7155.60.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-05-29 17:42 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
> > You may have a misunderstanding about what exactly "async" does.  The 
> > "sync" / "async" mount options control only whether the application 
> > waits for the data to be flushed to permanent storage.  They have no 
> > effect on any file system I know of _how_ specifically the data is 
> > moved from the page cache to permanent storage.
> 
> The problem is that the client change seems to cause the application to 
> stop until this stable write completes... What is interesting is that it's 
> not always a write operation that the linker gets stuck on. Our best 
> hypothesis -- from correlating times in strace and tcpdump traces -- is 
> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* 
> system calls on the output file (that is opened for read/write). We THINK 
> the read call triggers a FILE_SYNC write if the page is dirty...and that 
> is why the read calls are taking so long. Seeing writes happening when the 
> app is waiting for a read is odd to say the least... (In my test, there is 
> nothing else running on the Virtual machines, so the only thing that could 
> be triggering the filesystem activity is the build test...)

Yes. If the page is dirty, but not up to date, then it needs to be
cleaned before you can overwrite the contents with the results of a
fresh read.
That means flushing the data to disk... Which again means doing either a
stable write or an unstable write+commit. The former is more efficient
that the latter, 'cos it accomplishes the exact same work in a single
RPC call.

Trond

> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>  
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>  
> 
> Please be sure to update your PMR using ESR at 
> http://www-306.ibm.com/software/support/probsub.html or cc all 
> correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
> case I am not available.
> 
> 
> 
> From:
> Chuck Lever <chuck.lever@oracle.com>
> To:
> Brian R Cowan/Cupertino/IBM@IBMUS
> Cc:
> Trond Myklebust <trond.myklebust@fys.uio.no>, linux-nfs@vger.kernel.org, 
> linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
> Date:
> 05/29/2009 01:02 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
> Sent by:
> linux-nfs-owner@vger.kernel.org
> 
> 
> 
> 
> On May 29, 2009, at 11:55 AM, Brian R Cowan wrote:
> 
> > Been working this issue with Red hat, and didn't need to go to the 
> > list...
> > Well, now I do... You mention that "The main type of workload we're
> > targetting with this patch is the app that opens a file, writes < 4k 
> > and
> > then closes the file." Well, it appears that this issue also impacts
> > flushing pages from filesystem caches.
> >
> > The reason this came up in my environment is that our product's build
> > auditing gives the the filesystem cache an interesting workout. When
> > ClearCase audits a build, the build places data in a few places,
> > including:
> > 1) a build audit file that usually resides in /tmp. This build audit 
> > is
> > essentially a log of EVERY file open/read/write/delete/rename/etc. 
> > that
> > the programs called in the build script make in the clearcase "view"
> > you're building in. As a result, this file can get pretty large.
> > 2) The build outputs themselves, which in this case are being 
> > written to a
> > remote storage location on a Linux or Solaris server, and
> > 3) a file called .cmake.state, which is a local cache that is 
> > written to
> > after the build script completes containing what is essentially a 
> > "Bill of
> > materials" for the files created during builds in this "view."
> >
> > We believe that the build audit file access is causing build output 
> > to get
> > flushed out of the filesystem cache. These flushes happen *in 4k 
> > chunks.*
> > This trips over this change since the cache pages appear to get 
> > flushed on
> > an individual basis.
> 
> So, are you saying that the application is flushing after every 4KB 
> write(2), or that the application has written a bunch of pages, and VM/ 
> VFS on the client is doing the synchronous page flushes?  If it's the 
> application doing this, then you really do not want to mitigate this 
> by defeating the STABLE writes -- the application must have some 
> requirement that the data is permanent.
> 
> Unless I have misunderstood something, the previous faster behavior 
> was due to cheating, and put your data at risk.  I can't see how 
> replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would 
> cause such a significant performance impact.
> 
> > One note is that if the build outputs were going to a clearcase view
> > stored on an enterprise-level NAS device, there isn't as much of an 
> > issue
> > because many of these return from the stable write request as soon 
> > as the
> > data goes into the battery-backed memory disk cache on the NAS. 
> > However,
> > it really impacts writes to general-purpose OS's that follow Sun's 
> > lead in
> > how they handle "stable" writes. The truly annoying part about this 
> > rather
> > subtle change is that the NFS client is specifically ignoring the 
> > client
> > mount options since we cannot force the "async" mount option to turn 
> > off
> > this behavior.
> 
> You may have a misunderstanding about what exactly "async" does.  The 
> "sync" / "async" mount options control only whether the application 
> waits for the data to be flushed to permanent storage.  They have no 
> effect on any file system I know of _how_ specifically the data is 
> moved from the page cache to permanent storage.
> 
> > =================================================================
> > Brian Cowan
> > Advisory Software Engineer
> > ClearCase Customer Advocacy Group (CAG)
> > Rational Software
> > IBM Software Group
> > 81 Hartwell Ave
> > Lexington, MA
> >
> > Phone: 1.781.372.3580
> > Web: http://www.ibm.com/software/rational/support/
> >
> >
> > Please be sure to update your PMR using ESR at
> > http://www-306.ibm.com/software/support/probsub.html or cc all
> > correspondence to sw_support@us.ibm.com to be sure your PMR is 
> > updated in
> > case I am not available.
> >
> >
> >
> > From:
> > Trond Myklebust <trond.myklebust@fys.uio.no>
> > To:
> > Peter Staubach <staubach@redhat.com>
> > Cc:
> > Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/ 
> > IBM@IBMUS,
> > linux-nfs@vger.kernel.org
> > Date:
> > 04/30/2009 05:23 PM
> > Subject:
> > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page 
> > flushing
> > Sent by:
> > linux-nfs-owner@vger.kernel.org
> >
> >
> >
> > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> >> Chuck Lever wrote:
> >>>
> >>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> >>>>
> >>>>
> > 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
> 
> >
> >>>>
> >> Actually, the "stable" part can be a killer.  It depends upon
> >> why and when nfs_flush_inode() is invoked.
> >>
> >> I did quite a bit of work on this aspect of RHEL-5 and discovered
> >> that this particular code was leading to some serious slowdowns.
> >> The server would end up doing a very slow FILE_SYNC write when
> >> all that was really required was an UNSTABLE write at the time.
> >>
> >> Did anyone actually measure this optimization and if so, what
> >> were the numbers?
> >
> > As usual, the optimisation is workload dependent. The main type of
> > workload we're targetting with this patch is the app that opens a 
> > file,
> > writes < 4k and then closes the file. For that case, it's a no-brainer
> > that you don't need to split a single stable write into an unstable 
> > + a
> > commit.
> >
> > So if the application isn't doing the above type of short write 
> > followed
> > by close, then exactly what is causing a flush to disk in the first
> > place? Ordinarily, the client will try to cache writes until the cows
> > come home (or until the VM tells it to reclaim memory - whichever 
> > comes
> > first)...
> >
> > Cheers
> >  Trond
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" 
> > in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> 
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                 ` <1243618968.7155.60.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-05-29 17:47                   ` Chuck Lever
  2009-05-29 18:15                     ` Trond Myklebust
  2009-05-29 17:51                   ` Peter Staubach
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 94+ messages in thread
From: Chuck Lever @ 2009-05-29 17:47 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Brian R Cowan, linux-nfs, linux-nfs-owner, Peter Staubach


On May 29, 2009, at 1:42 PM, Trond Myklebust wrote:

> On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
>>> You may have a misunderstanding about what exactly "async" does.   
>>> The
>>> "sync" / "async" mount options control only whether the application
>>> waits for the data to be flushed to permanent storage.  They have no
>>> effect on any file system I know of _how_ specifically the data is
>>> moved from the page cache to permanent storage.
>>
>> The problem is that the client change seems to cause the  
>> application to
>> stop until this stable write completes... What is interesting is  
>> that it's
>> not always a write operation that the linker gets stuck on. Our best
>> hypothesis -- from correlating times in strace and tcpdump traces  
>> -- is
>> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by  
>> *read()*
>> system calls on the output file (that is opened for read/write). We  
>> THINK
>> the read call triggers a FILE_SYNC write if the page is dirty...and  
>> that
>> is why the read calls are taking so long. Seeing writes happening  
>> when the
>> app is waiting for a read is odd to say the least... (In my test,  
>> there is
>> nothing else running on the Virtual machines, so the only thing  
>> that could
>> be triggering the filesystem activity is the build test...)
>
> Yes. If the page is dirty, but not up to date, then it needs to be
> cleaned before you can overwrite the contents with the results of a
> fresh read.
> That means flushing the data to disk... Which again means doing  
> either a
> stable write or an unstable write+commit. The former is more efficient
> that the latter, 'cos it accomplishes the exact same work in a single
> RPC call.

It might be prudent to flush the whole file when such a dirty page is  
discovered to get the benefit of write coalescing.

> Trond
>
>> =================================================================
>> Brian Cowan
>> Advisory Software Engineer
>> ClearCase Customer Advocacy Group (CAG)
>> Rational Software
>> IBM Software Group
>> 81 Hartwell Ave
>> Lexington, MA
>>
>> Phone: 1.781.372.3580
>> Web: http://www.ibm.com/software/rational/support/
>>
>>
>> Please be sure to update your PMR using ESR at
>> http://www-306.ibm.com/software/support/probsub.html or cc all
>> correspondence to sw_support@us.ibm.com to be sure your PMR is  
>> updated in
>> case I am not available.
>>
>>
>>
>> From:
>> Chuck Lever <chuck.lever@oracle.com>
>> To:
>> Brian R Cowan/Cupertino/IBM@IBMUS
>> Cc:
>> Trond Myklebust <trond.myklebust@fys.uio.no>, linux-nfs@vger.kernel.org 
>> ,
>> linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
>> Date:
>> 05/29/2009 01:02 PM
>> Subject:
>> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page  
>> flushing
>> Sent by:
>> linux-nfs-owner@vger.kernel.org
>>
>>
>>
>>
>> On May 29, 2009, at 11:55 AM, Brian R Cowan wrote:
>>
>>> Been working this issue with Red hat, and didn't need to go to the
>>> list...
>>> Well, now I do... You mention that "The main type of workload we're
>>> targetting with this patch is the app that opens a file, writes < 4k
>>> and
>>> then closes the file." Well, it appears that this issue also impacts
>>> flushing pages from filesystem caches.
>>>
>>> The reason this came up in my environment is that our product's  
>>> build
>>> auditing gives the the filesystem cache an interesting workout. When
>>> ClearCase audits a build, the build places data in a few places,
>>> including:
>>> 1) a build audit file that usually resides in /tmp. This build audit
>>> is
>>> essentially a log of EVERY file open/read/write/delete/rename/etc.
>>> that
>>> the programs called in the build script make in the clearcase "view"
>>> you're building in. As a result, this file can get pretty large.
>>> 2) The build outputs themselves, which in this case are being
>>> written to a
>>> remote storage location on a Linux or Solaris server, and
>>> 3) a file called .cmake.state, which is a local cache that is
>>> written to
>>> after the build script completes containing what is essentially a
>>> "Bill of
>>> materials" for the files created during builds in this "view."
>>>
>>> We believe that the build audit file access is causing build output
>>> to get
>>> flushed out of the filesystem cache. These flushes happen *in 4k
>>> chunks.*
>>> This trips over this change since the cache pages appear to get
>>> flushed on
>>> an individual basis.
>>
>> So, are you saying that the application is flushing after every 4KB
>> write(2), or that the application has written a bunch of pages, and  
>> VM/
>> VFS on the client is doing the synchronous page flushes?  If it's the
>> application doing this, then you really do not want to mitigate this
>> by defeating the STABLE writes -- the application must have some
>> requirement that the data is permanent.
>>
>> Unless I have misunderstood something, the previous faster behavior
>> was due to cheating, and put your data at risk.  I can't see how
>> replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would
>> cause such a significant performance impact.
>>
>>> One note is that if the build outputs were going to a clearcase view
>>> stored on an enterprise-level NAS device, there isn't as much of an
>>> issue
>>> because many of these return from the stable write request as soon
>>> as the
>>> data goes into the battery-backed memory disk cache on the NAS.
>>> However,
>>> it really impacts writes to general-purpose OS's that follow Sun's
>>> lead in
>>> how they handle "stable" writes. The truly annoying part about this
>>> rather
>>> subtle change is that the NFS client is specifically ignoring the
>>> client
>>> mount options since we cannot force the "async" mount option to turn
>>> off
>>> this behavior.
>>
>> You may have a misunderstanding about what exactly "async" does.  The
>> "sync" / "async" mount options control only whether the application
>> waits for the data to be flushed to permanent storage.  They have no
>> effect on any file system I know of _how_ specifically the data is
>> moved from the page cache to permanent storage.
>>
>>> =================================================================
>>> Brian Cowan
>>> Advisory Software Engineer
>>> ClearCase Customer Advocacy Group (CAG)
>>> Rational Software
>>> IBM Software Group
>>> 81 Hartwell Ave
>>> Lexington, MA
>>>
>>> Phone: 1.781.372.3580
>>> Web: http://www.ibm.com/software/rational/support/
>>>
>>>
>>> Please be sure to update your PMR using ESR at
>>> http://www-306.ibm.com/software/support/probsub.html or cc all
>>> correspondence to sw_support@us.ibm.com to be sure your PMR is
>>> updated in
>>> case I am not available.
>>>
>>>
>>>
>>> From:
>>> Trond Myklebust <trond.myklebust@fys.uio.no>
>>> To:
>>> Peter Staubach <staubach@redhat.com>
>>> Cc:
>>> Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/
>>> IBM@IBMUS,
>>> linux-nfs@vger.kernel.org
>>> Date:
>>> 04/30/2009 05:23 PM
>>> Subject:
>>> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page
>>> flushing
>>> Sent by:
>>> linux-nfs-owner@vger.kernel.org
>>>
>>>
>>>
>>> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
>>>> Chuck Lever wrote:
>>>>>
>>>>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
>>>>>>
>>>>>>
>>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>>
>>>
>>>>>>
>>>> Actually, the "stable" part can be a killer.  It depends upon
>>>> why and when nfs_flush_inode() is invoked.
>>>>
>>>> I did quite a bit of work on this aspect of RHEL-5 and discovered
>>>> that this particular code was leading to some serious slowdowns.
>>>> The server would end up doing a very slow FILE_SYNC write when
>>>> all that was really required was an UNSTABLE write at the time.
>>>>
>>>> Did anyone actually measure this optimization and if so, what
>>>> were the numbers?
>>>
>>> As usual, the optimisation is workload dependent. The main type of
>>> workload we're targetting with this patch is the app that opens a
>>> file,
>>> writes < 4k and then closes the file. For that case, it's a no- 
>>> brainer
>>> that you don't need to split a single stable write into an unstable
>>> + a
>>> commit.
>>>
>>> So if the application isn't doing the above type of short write
>>> followed
>>> by close, then exactly what is causing a flush to disk in the first
>>> place? Ordinarily, the client will try to cache writes until the  
>>> cows
>>> come home (or until the VM tells it to reclaim memory - whichever
>>> comes
>>> first)...
>>>
>>> Cheers
>>> Trond
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
>>> in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux- 
>> nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"  
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]             ` <1243615595.7155.48.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2009-05-29 17:25               ` Brian R Cowan
@ 2009-05-29 17:48               ` Peter Staubach
  2009-05-29 18:21                 ` Trond Myklebust
  1 sibling, 1 reply; 94+ messages in thread
From: Peter Staubach @ 2009-05-29 17:48 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner

Trond Myklebust wrote:
> Look... This happens when you _flush_ the file to stable storage if
> there is only a single write < wsize. It isn't the business of the NFS
> layer to decide when you flush the file; that's an application
> decision...
>
>   

I think that one easy way to show why this optimization is
not quite what we would all like, why there only being a
single write _now_ isn't quite sufficient, is to write a
block of a file and then read it back.  Things like
compilers and linkers might do this during their random
access to the file being created.  I would guess that this
audit thing that Brian has refered to does the same sort
of thing.

       ps

ps. Why do we flush dirty pages before they can be read?
I am not even clear why we care about waiting for an
already existing flush to be completed before using the
page to satisfy a read system call.

> Trond
>
>
>
> On Fri, 2009-05-29 at 11:55 -0400, Brian R Cowan wrote:
>   
>> Been working this issue with Red hat, and didn't need to go to the list... 
>> Well, now I do... You mention that "The main type of workload we're 
>> targetting with this patch is the app that opens a file, writes < 4k and 
>> then closes the file." Well, it appears that this issue also impacts 
>> flushing pages from filesystem caches.
>>
>> The reason this came up in my environment is that our product's build 
>> auditing gives the the filesystem cache an interesting workout. When 
>> ClearCase audits a build, the build places data in a few places, 
>> including:
>> 1) a build audit file that usually resides in /tmp. This build audit is 
>> essentially a log of EVERY file open/read/write/delete/rename/etc. that 
>> the programs called in the build script make in the clearcase "view" 
>> you're building in. As a result, this file can get pretty large.
>> 2) The build outputs themselves, which in this case are being written to a 
>> remote storage location on a Linux or Solaris server, and
>> 3) a file called .cmake.state, which is a local cache that is written to 
>> after the build script completes containing what is essentially a "Bill of 
>> materials" for the files created during builds in this "view."
>>
>> We believe that the build audit file access is causing build output to get 
>> flushed out of the filesystem cache. These flushes happen *in 4k chunks.* 
>> This trips over this change since the cache pages appear to get flushed on 
>> an individual basis.
>>
>> One note is that if the build outputs were going to a clearcase view 
>> stored on an enterprise-level NAS device, there isn't as much of an issue 
>> because many of these return from the stable write request as soon as the 
>> data goes into the battery-backed memory disk cache on the NAS. However, 
>> it really impacts writes to general-purpose OS's that follow Sun's lead in 
>> how they handle "stable" writes. The truly annoying part about this rather 
>> subtle change is that the NFS client is specifically ignoring the client 
>> mount options since we cannot force the "async" mount option to turn off 
>> this behavior.
>>
>> =================================================================
>> Brian Cowan
>> Advisory Software Engineer
>> ClearCase Customer Advocacy Group (CAG)
>> Rational Software
>> IBM Software Group
>> 81 Hartwell Ave
>> Lexington, MA
>>  
>> Phone: 1.781.372.3580
>> Web: http://www.ibm.com/software/rational/support/
>>  
>>
>> Please be sure to update your PMR using ESR at 
>> http://www-306.ibm.com/software/support/probsub.html or cc all 
>> correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
>> case I am not available.
>>
>>
>>
>> From:
>> Trond Myklebust <trond.myklebust@fys.uio.no>
>> To:
>> Peter Staubach <staubach@redhat.com>
>> Cc:
>> Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/IBM@IBMUS, 
>> linux-nfs@vger.kernel.org
>> Date:
>> 04/30/2009 05:23 PM
>> Subject:
>> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
>> Sent by:
>> linux-nfs-owner@vger.kernel.org
>>
>>
>>
>> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
>>     
>>> Chuck Lever wrote:
>>>       
>>>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
>>>>         
>>>>>           
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>>
>>     
>>> Actually, the "stable" part can be a killer.  It depends upon
>>> why and when nfs_flush_inode() is invoked.
>>>
>>> I did quite a bit of work on this aspect of RHEL-5 and discovered
>>> that this particular code was leading to some serious slowdowns.
>>> The server would end up doing a very slow FILE_SYNC write when
>>> all that was really required was an UNSTABLE write at the time.
>>>
>>> Did anyone actually measure this optimization and if so, what
>>> were the numbers?
>>>       
>> As usual, the optimisation is workload dependent. The main type of
>> workload we're targetting with this patch is the app that opens a file,
>> writes < 4k and then closes the file. For that case, it's a no-brainer
>> that you don't need to split a single stable write into an unstable + a
>> commit.
>>
>> So if the application isn't doing the above type of short write followed
>> by close, then exactly what is causing a flush to disk in the first
>> place? Ordinarily, the client will try to cache writes until the cows
>> come home (or until the VM tells it to reclaim memory - whichever comes
>> first)...
>>
>> Cheers
>>   Trond
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>     
>
>
>   


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                 ` <1243618968.7155.60.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2009-05-29 17:47                   ` Chuck Lever
@ 2009-05-29 17:51                   ` Peter Staubach
  2009-05-29 18:25                     ` Brian R Cowan
  2009-05-29 18:43                     ` Trond Myklebust
  2009-05-29 17:55                   ` Brian R Cowan
  2009-05-29 17:57                   ` Trond Myklebust
  3 siblings, 2 replies; 94+ messages in thread
From: Peter Staubach @ 2009-05-29 17:51 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner

Trond Myklebust wrote:
> On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
>   
>>> You may have a misunderstanding about what exactly "async" does.  The 
>>> "sync" / "async" mount options control only whether the application 
>>> waits for the data to be flushed to permanent storage.  They have no 
>>> effect on any file system I know of _how_ specifically the data is 
>>> moved from the page cache to permanent storage.
>>>       
>> The problem is that the client change seems to cause the application to 
>> stop until this stable write completes... What is interesting is that it's 
>> not always a write operation that the linker gets stuck on. Our best 
>> hypothesis -- from correlating times in strace and tcpdump traces -- is 
>> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* 
>> system calls on the output file (that is opened for read/write). We THINK 
>> the read call triggers a FILE_SYNC write if the page is dirty...and that 
>> is why the read calls are taking so long. Seeing writes happening when the 
>> app is waiting for a read is odd to say the least... (In my test, there is 
>> nothing else running on the Virtual machines, so the only thing that could 
>> be triggering the filesystem activity is the build test...)
>>     
>
> Yes. If the page is dirty, but not up to date, then it needs to be
> cleaned before you can overwrite the contents with the results of a
> fresh read.
> That means flushing the data to disk... Which again means doing either a
> stable write or an unstable write+commit. The former is more efficient
> that the latter, 'cos it accomplishes the exact same work in a single
> RPC call.

In the normal case, we aren't overwriting the contents with the
results of a fresh read.  We are going to simply return the
current contents of the page.  Given this, then why is the normal
data cache consistency mechanism, based on the attribute cache,
not sufficient?

    Thanx...

       ps


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                 ` <1243618968.7155.60.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2009-05-29 17:47                   ` Chuck Lever
  2009-05-29 17:51                   ` Peter Staubach
@ 2009-05-29 17:55                   ` Brian R Cowan
  2009-05-29 18:07                     ` Trond Myklebust
  2009-05-29 17:57                   ` Trond Myklebust
  3 siblings, 1 reply; 94+ messages in thread
From: Brian R Cowan @ 2009-05-29 17:55 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

> Yes. If the page is dirty, but not up to date, then it needs to be
> cleaned before you can overwrite the contents with the results of a
> fresh read.
> That means flushing the data to disk... Which again means doing either a
> stable write or an unstable write+commit. The former is more efficient
> that the latter, 'cos it accomplishes the exact same work in a single
> RPC call.

I suspect that the COMMIT RPC's are done somewhere other than in the flush 
itself. If the "write + commit" operation was happening in the that exact 
matter, then the change in the git at the beginning of this thread *would 
not have impacted client performance*. I can demonstrate -- at will -- 
that it does impact performance. So, there is something that keeps track 
of the number of writes and issues the commits without slowing down the 
application. This git change bypasses that and degrades the linker 
performance.

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Trond Myklebust <trond.myklebust@fys.uio.no>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, 
linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
Date:
05/29/2009 01:43 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
Sent by:
linux-nfs-owner@vger.kernel.org



On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
> > You may have a misunderstanding about what exactly "async" does.  The 
> > "sync" / "async" mount options control only whether the application 
> > waits for the data to be flushed to permanent storage.  They have no 
> > effect on any file system I know of _how_ specifically the data is 
> > moved from the page cache to permanent storage.
> 
> The problem is that the client change seems to cause the application to 
> stop until this stable write completes... What is interesting is that 
it's 
> not always a write operation that the linker gets stuck on. Our best 
> hypothesis -- from correlating times in strace and tcpdump traces -- is 
> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* 
> system calls on the output file (that is opened for read/write). We 
THINK 
> the read call triggers a FILE_SYNC write if the page is dirty...and that 

> is why the read calls are taking so long. Seeing writes happening when 
the 
> app is waiting for a read is odd to say the least... (In my test, there 
is 
> nothing else running on the Virtual machines, so the only thing that 
could 
> be triggering the filesystem activity is the build test...)

Yes. If the page is dirty, but not up to date, then it needs to be
cleaned before you can overwrite the contents with the results of a
fresh read.
That means flushing the data to disk... Which again means doing either a
stable write or an unstable write+commit. The former is more efficient
that the latter, 'cos it accomplishes the exact same work in a single
RPC call.

Trond

> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
> 
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
> 
> 
> Please be sure to update your PMR using ESR at 
> http://www-306.ibm.com/software/support/probsub.html or cc all 
> correspondence to sw_support@us.ibm.com to be sure your PMR is updated 
in 
> case I am not available.
> 
> 
> 
> From:
> Chuck Lever <chuck.lever@oracle.com>
> To:
> Brian R Cowan/Cupertino/IBM@IBMUS
> Cc:
> Trond Myklebust <trond.myklebust@fys.uio.no>, linux-nfs@vger.kernel.org, 

> linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
> Date:
> 05/29/2009 01:02 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page 
flushing
> Sent by:
> linux-nfs-owner@vger.kernel.org
> 
> 
> 
> 
> On May 29, 2009, at 11:55 AM, Brian R Cowan wrote:
> 
> > Been working this issue with Red hat, and didn't need to go to the 
> > list...
> > Well, now I do... You mention that "The main type of workload we're
> > targetting with this patch is the app that opens a file, writes < 4k 
> > and
> > then closes the file." Well, it appears that this issue also impacts
> > flushing pages from filesystem caches.
> >
> > The reason this came up in my environment is that our product's build
> > auditing gives the the filesystem cache an interesting workout. When
> > ClearCase audits a build, the build places data in a few places,
> > including:
> > 1) a build audit file that usually resides in /tmp. This build audit 
> > is
> > essentially a log of EVERY file open/read/write/delete/rename/etc. 
> > that
> > the programs called in the build script make in the clearcase "view"
> > you're building in. As a result, this file can get pretty large.
> > 2) The build outputs themselves, which in this case are being 
> > written to a
> > remote storage location on a Linux or Solaris server, and
> > 3) a file called .cmake.state, which is a local cache that is 
> > written to
> > after the build script completes containing what is essentially a 
> > "Bill of
> > materials" for the files created during builds in this "view."
> >
> > We believe that the build audit file access is causing build output 
> > to get
> > flushed out of the filesystem cache. These flushes happen *in 4k 
> > chunks.*
> > This trips over this change since the cache pages appear to get 
> > flushed on
> > an individual basis.
> 
> So, are you saying that the application is flushing after every 4KB 
> write(2), or that the application has written a bunch of pages, and VM/ 
> VFS on the client is doing the synchronous page flushes?  If it's the 
> application doing this, then you really do not want to mitigate this 
> by defeating the STABLE writes -- the application must have some 
> requirement that the data is permanent.
> 
> Unless I have misunderstood something, the previous faster behavior 
> was due to cheating, and put your data at risk.  I can't see how 
> replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would 
> cause such a significant performance impact.
> 
> > One note is that if the build outputs were going to a clearcase view
> > stored on an enterprise-level NAS device, there isn't as much of an 
> > issue
> > because many of these return from the stable write request as soon 
> > as the
> > data goes into the battery-backed memory disk cache on the NAS. 
> > However,
> > it really impacts writes to general-purpose OS's that follow Sun's 
> > lead in
> > how they handle "stable" writes. The truly annoying part about this 
> > rather
> > subtle change is that the NFS client is specifically ignoring the 
> > client
> > mount options since we cannot force the "async" mount option to turn 
> > off
> > this behavior.
> 
> You may have a misunderstanding about what exactly "async" does.  The 
> "sync" / "async" mount options control only whether the application 
> waits for the data to be flushed to permanent storage.  They have no 
> effect on any file system I know of _how_ specifically the data is 
> moved from the page cache to permanent storage.
> 
> > =================================================================
> > Brian Cowan
> > Advisory Software Engineer
> > ClearCase Customer Advocacy Group (CAG)
> > Rational Software
> > IBM Software Group
> > 81 Hartwell Ave
> > Lexington, MA
> >
> > Phone: 1.781.372.3580
> > Web: http://www.ibm.com/software/rational/support/
> >
> >
> > Please be sure to update your PMR using ESR at
> > http://www-306.ibm.com/software/support/probsub.html or cc all
> > correspondence to sw_support@us.ibm.com to be sure your PMR is 
> > updated in
> > case I am not available.
> >
> >
> >
> > From:
> > Trond Myklebust <trond.myklebust@fys.uio.no>
> > To:
> > Peter Staubach <staubach@redhat.com>
> > Cc:
> > Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/ 
> > IBM@IBMUS,
> > linux-nfs@vger.kernel.org
> > Date:
> > 04/30/2009 05:23 PM
> > Subject:
> > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page 
> > flushing
> > Sent by:
> > linux-nfs-owner@vger.kernel.org
> >
> >
> >
> > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> >> Chuck Lever wrote:
> >>>
> >>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> >>>>
> >>>>
> > 
> 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2

> 
> >
> >>>>
> >> Actually, the "stable" part can be a killer.  It depends upon
> >> why and when nfs_flush_inode() is invoked.
> >>
> >> I did quite a bit of work on this aspect of RHEL-5 and discovered
> >> that this particular code was leading to some serious slowdowns.
> >> The server would end up doing a very slow FILE_SYNC write when
> >> all that was really required was an UNSTABLE write at the time.
> >>
> >> Did anyone actually measure this optimization and if so, what
> >> were the numbers?
> >
> > As usual, the optimisation is workload dependent. The main type of
> > workload we're targetting with this patch is the app that opens a 
> > file,
> > writes < 4k and then closes the file. For that case, it's a no-brainer
> > that you don't need to split a single stable write into an unstable 
> > + a
> > commit.
> >
> > So if the application isn't doing the above type of short write 
> > followed
> > by close, then exactly what is causing a flush to disk in the first
> > place? Ordinarily, the client will try to cache writes until the cows
> > come home (or until the VM tells it to reclaim memory - whichever 
> > comes
> > first)...
> >
> > Cheers
> >  Trond
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" 
> > in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> 
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                 ` <1243618968.7155.60.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
                                     ` (2 preceding siblings ...)
  2009-05-29 17:55                   ` Brian R Cowan
@ 2009-05-29 17:57                   ` Trond Myklebust
  3 siblings, 0 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-05-29 17:57 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, 2009-05-29 at 13:42 -0400, Trond Myklebust wrote:
> On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
> > > You may have a misunderstanding about what exactly "async" does.  The 
> > > "sync" / "async" mount options control only whether the application 
> > > waits for the data to be flushed to permanent storage.  They have no 
> > > effect on any file system I know of _how_ specifically the data is 
> > > moved from the page cache to permanent storage.
> > 
> > The problem is that the client change seems to cause the application to 
> > stop until this stable write completes... What is interesting is that it's 
> > not always a write operation that the linker gets stuck on. Our best 
> > hypothesis -- from correlating times in strace and tcpdump traces -- is 
> > that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* 
> > system calls on the output file (that is opened for read/write). We THINK 
> > the read call triggers a FILE_SYNC write if the page is dirty...and that 
> > is why the read calls are taking so long. Seeing writes happening when the 
> > app is waiting for a read is odd to say the least... (In my test, there is 
> > nothing else running on the Virtual machines, so the only thing that could 
> > be triggering the filesystem activity is the build test...)
> 
> Yes. If the page is dirty, but not up to date, then it needs to be
> cleaned before you can overwrite the contents with the results of a
> fresh read.
> That means flushing the data to disk... Which again means doing either a
> stable write or an unstable write+commit. The former is more efficient
> that the latter, 'cos it accomplishes the exact same work in a single
> RPC call.
> 
> Trond

In fact, I suspect your real gripe is rather with the logic that marks a
page as being up to date (i.e. whether or not they require a READ call).
I suggest trying kernel 2.6.27 or newer, and seeing if the changes that
are in those kernels fix your problem.

Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 17:55                   ` Brian R Cowan
@ 2009-05-29 18:07                     ` Trond Myklebust
       [not found]                       ` <1243620455.7155.80.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-05-29 18:07 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, 2009-05-29 at 13:55 -0400, Brian R Cowan wrote:
> > Yes. If the page is dirty, but not up to date, then it needs to be
> > cleaned before you can overwrite the contents with the results of a
> > fresh read.
> > That means flushing the data to disk... Which again means doing either a
> > stable write or an unstable write+commit. The former is more efficient
> > that the latter, 'cos it accomplishes the exact same work in a single
> > RPC call.
> 
> I suspect that the COMMIT RPC's are done somewhere other than in the flush 
> itself. If the "write + commit" operation was happening in the that exact 
> matter, then the change in the git at the beginning of this thread *would 
> not have impacted client performance*. I can demonstrate -- at will -- 
> that it does impact performance. So, there is something that keeps track 
> of the number of writes and issues the commits without slowing down the 
> application. This git change bypasses that and degrades the linker 
> performance.

If the server gives slower performance for a single stable write, vs.
the same unstable write + commit, then you are demonstrating that the
server is seriously _broken_.

The only other explanation, is if the client prior to that patch being
applied was somehow failing to send out the COMMIT. If so, then the
client was broken, and the patch is a fix that results in correct
behaviour. That would mean that the rest of the client flush code is
probably still broken, but at least the nfs_wb_page() is now correct.

Those are the only 2 options.

Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 17:47                   ` Chuck Lever
@ 2009-05-29 18:15                     ` Trond Myklebust
  0 siblings, 0 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-05-29 18:15 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Brian R Cowan, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, 2009-05-29 at 13:47 -0400, Chuck Lever wrote:
> On May 29, 2009, at 1:42 PM, Trond Myklebust wrote:
> 
> > On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
> >>> You may have a misunderstanding about what exactly "async" does.   
> >>> The
> >>> "sync" / "async" mount options control only whether the application
> >>> waits for the data to be flushed to permanent storage.  They have no
> >>> effect on any file system I know of _how_ specifically the data is
> >>> moved from the page cache to permanent storage.
> >>
> >> The problem is that the client change seems to cause the  
> >> application to
> >> stop until this stable write completes... What is interesting is  
> >> that it's
> >> not always a write operation that the linker gets stuck on. Our best
> >> hypothesis -- from correlating times in strace and tcpdump traces  
> >> -- is
> >> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by  
> >> *read()*
> >> system calls on the output file (that is opened for read/write). We  
> >> THINK
> >> the read call triggers a FILE_SYNC write if the page is dirty...and  
> >> that
> >> is why the read calls are taking so long. Seeing writes happening  
> >> when the
> >> app is waiting for a read is odd to say the least... (In my test,  
> >> there is
> >> nothing else running on the Virtual machines, so the only thing  
> >> that could
> >> be triggering the filesystem activity is the build test...)
> >
> > Yes. If the page is dirty, but not up to date, then it needs to be
> > cleaned before you can overwrite the contents with the results of a
> > fresh read.
> > That means flushing the data to disk... Which again means doing  
> > either a
> > stable write or an unstable write+commit. The former is more efficient
> > that the latter, 'cos it accomplishes the exact same work in a single
> > RPC call.
> 
> It might be prudent to flush the whole file when such a dirty page is  
> discovered to get the benefit of write coalescing.

There are very few workloads where that will help. You basically have to
be modifying the end of a page that has not previously been read in (so
is not already marked up to date) and then writing into the beginning of
the next page, which must also be not up to date.

Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                       ` <1243620455.7155.80.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-05-29 18:18                         ` Brian R Cowan
  2009-05-29 18:29                           ` Trond Myklebust
  0 siblings, 1 reply; 94+ messages in thread
From: Brian R Cowan @ 2009-05-29 18:18 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

There is a third option, that the COMMIT calls are not coming from the 
same thread of execution that the write call is. The symptoms would seem 
to bear that out. As would the fact that the performance degradation 
occurs both when the server is Linux itself and when it is Solaris (any 
NFSv3-supporting version). I'm not saying that Solaris is bug-free, but it 
would be unusual if they are both broken the same way. The linux nfs FAQ 
says:

-----------------------
* NFS Version 3 introduces the concept of "safe asynchronous writes." A 
Version 3 client can specify that the server is allowed to reply before it 
has saved the requested data to disk, permitting the server to gather 
small NFS write operations into a single efficient disk write operation. A 
Version 3 client can also specify that the data must be written to disk 
before the server replies, just like a Version 2 write. The client 
specifies the type of write by setting the stable_how field in the 
arguments of each write operation to UNSTABLE to request a safe 
asynchronous write, and FILE_SYNC for an NFS Version 2 style write.

Servers indicate whether the requested data is permanently stored by 
setting a corresponding field in the response to each NFS write operation. 
A server can respond to an UNSTABLE write request with an UNSTABLE reply 
or a FILE_SYNC reply, depending on whether or not the requested data 
resides on permanent storage yet. An NFS protocol-compliant server must 
respond to a FILE_SYNC request only with a FILE_SYNC reply.

Clients ensure that data that was written using a safe asynchronous write 
has been written onto permanent storage using a new operation available in 
Version 3 called a COMMIT. Servers do not send a response to a COMMIT 
operation until all data specified in the request has been written to 
permanent storage. NFS Version 3 clients must protect buffered data that 
has been written using a safe asynchronous write but not yet committed. If 
a server reboots before a client has sent an appropriate COMMIT, the 
server can reply to the eventual COMMIT request in a way that forces the 
client to resend the original write operation. Version 3 clients use 
COMMIT operations when flushing safe asynchronous writes to the server 
during a close(2) or fsync(2) system call, or when encountering memory 
pressure. 
-----------------------

Now, what happens in the client when the server cones back with the 
UNSTABLE reply?
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Trond Myklebust <trond.myklebust@fys.uio.no>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, 
linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
Date:
05/29/2009 02:07 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing



On Fri, 2009-05-29 at 13:55 -0400, Brian R Cowan wrote:
> > Yes. If the page is dirty, but not up to date, then it needs to be
> > cleaned before you can overwrite the contents with the results of a
> > fresh read.
> > That means flushing the data to disk... Which again means doing either 
a
> > stable write or an unstable write+commit. The former is more efficient
> > that the latter, 'cos it accomplishes the exact same work in a single
> > RPC call.
> 
> I suspect that the COMMIT RPC's are done somewhere other than in the 
flush 
> itself. If the "write + commit" operation was happening in the that 
exact 
> matter, then the change in the git at the beginning of this thread 
*would 
> not have impacted client performance*. I can demonstrate -- at will -- 
> that it does impact performance. So, there is something that keeps track 

> of the number of writes and issues the commits without slowing down the 
> application. This git change bypasses that and degrades the linker 
> performance.

If the server gives slower performance for a single stable write, vs.
the same unstable write + commit, then you are demonstrating that the
server is seriously _broken_.

The only other explanation, is if the client prior to that patch being
applied was somehow failing to send out the COMMIT. If so, then the
client was broken, and the patch is a fix that results in correct
behaviour. That would mean that the rest of the client flush code is
probably still broken, but at least the nfs_wb_page() is now correct.

Those are the only 2 options.

Trond




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 17:48               ` Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Peter Staubach
@ 2009-05-29 18:21                 ` Trond Myklebust
  0 siblings, 0 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-05-29 18:21 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner

On Fri, 2009-05-29 at 13:48 -0400, Peter Staubach wrote:
> Trond Myklebust wrote:
> > Look... This happens when you _flush_ the file to stable storage if
> > there is only a single write < wsize. It isn't the business of the NFS
> > layer to decide when you flush the file; that's an application
> > decision...
> >
> >   
> 
> I think that one easy way to show why this optimization is
> not quite what we would all like, why there only being a
> single write _now_ isn't quite sufficient, is to write a
> block of a file and then read it back.  Things like
> compilers and linkers might do this during their random
> access to the file being created.  I would guess that this
> audit thing that Brian has refered to does the same sort
> of thing.
> 
>        ps
> 
> ps. Why do we flush dirty pages before they can be read?
> I am not even clear why we care about waiting for an
> already existing flush to be completed before using the
> page to satisfy a read system call.

We only do this if the page cannot be marked as up to date. i.e. there
have to be parts of the page which contain valid data on the server, and
that our client hasn't read in yet, and that aren't being overwritten by
our write.

Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 17:51                   ` Peter Staubach
@ 2009-05-29 18:25                     ` Brian R Cowan
  2009-05-29 18:43                     ` Trond Myklebust
  1 sibling, 0 replies; 94+ messages in thread
From: Brian R Cowan @ 2009-05-29 18:25 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Trond Myklebust

Peter, this is my point. The application/client-side end result is that 
we're making a read wait for a write. We already have the data we need in 
the cache, since the application is what put it in there to begin with. 

I think this is a classic "unintended consequence" that is being observed 
on SuSE 10, Red hat 5, and I'm sure others. 

But since people using my product have only just started moving to Red hat 
5, we're seeing more of these... There aren't too many people who build 
across NFS, not when local storage is relatively cheap, and much faster. 
But there are companies that do this so the build results are available 
even if the build host has been turned off, gone to standby/hibernate, or 
is even a virtual machine that no longer exists. The biggest problem here 
that the unavoidable extra filesystem cache load that build auditing 
creates appears to trigger the flushing. For whatever reason, those 
flushes happen in such a way to trigger the STABLE writes instead of the 
faster UNSTABLE ones. 

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Peter Staubach <staubach@redhat.com>
To:
Trond Myklebust <trond.myklebust@fys.uio.no>
Cc:
Brian R Cowan/Cupertino/IBM@IBMUS, Chuck Lever <chuck.lever@oracle.com>, 
linux-nfs@vger.kernel.org, linux-nfs-owner@vger.kernel.org
Date:
05/29/2009 01:51 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing



Trond Myklebust wrote:
> On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
> 
>>> You may have a misunderstanding about what exactly "async" does.  The 
>>> "sync" / "async" mount options control only whether the application 
>>> waits for the data to be flushed to permanent storage.  They have no 
>>> effect on any file system I know of _how_ specifically the data is 
>>> moved from the page cache to permanent storage.
>>> 
>> The problem is that the client change seems to cause the application to 

>> stop until this stable write completes... What is interesting is that 
it's 
>> not always a write operation that the linker gets stuck on. Our best 
>> hypothesis -- from correlating times in strace and tcpdump traces -- is 

>> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* 
>> system calls on the output file (that is opened for read/write). We 
THINK 
>> the read call triggers a FILE_SYNC write if the page is dirty...and 
that 
>> is why the read calls are taking so long. Seeing writes happening when 
the 
>> app is waiting for a read is odd to say the least... (In my test, there 
is 
>> nothing else running on the Virtual machines, so the only thing that 
could 
>> be triggering the filesystem activity is the build test...)
>> 
>
> Yes. If the page is dirty, but not up to date, then it needs to be
> cleaned before you can overwrite the contents with the results of a
> fresh read.
> That means flushing the data to disk... Which again means doing either a
> stable write or an unstable write+commit. The former is more efficient
> that the latter, 'cos it accomplishes the exact same work in a single
> RPC call.

In the normal case, we aren't overwriting the contents with the
results of a fresh read.  We are going to simply return the
current contents of the page.  Given this, then why is the normal
data cache consistency mechanism, based on the attribute cache,
not sufficient?

    Thanx...

       ps




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 18:18                         ` Brian R Cowan
@ 2009-05-29 18:29                           ` Trond Myklebust
       [not found]                             ` <1243621769.7155.97.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-05-29 18:29 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, 2009-05-29 at 14:18 -0400, Brian R Cowan wrote:
> There is a third option, that the COMMIT calls are not coming from the 
> same thread of execution that the write call is. The symptoms would seem 
> to bear that out. As would the fact that the performance degradation 
> occurs both when the server is Linux itself and when it is Solaris (any 
> NFSv3-supporting version). I'm not saying that Solaris is bug-free, but it 
> would be unusual if they are both broken the same way. The linux nfs FAQ 
> says:
> 
> -----------------------
> * NFS Version 3 introduces the concept of "safe asynchronous writes." A 
> Version 3 client can specify that the server is allowed to reply before it 
> has saved the requested data to disk, permitting the server to gather 
> small NFS write operations into a single efficient disk write operation. A 
> Version 3 client can also specify that the data must be written to disk 
> before the server replies, just like a Version 2 write. The client 
> specifies the type of write by setting the stable_how field in the 
> arguments of each write operation to UNSTABLE to request a safe 
> asynchronous write, and FILE_SYNC for an NFS Version 2 style write.
> 
> Servers indicate whether the requested data is permanently stored by 
> setting a corresponding field in the response to each NFS write operation. 
> A server can respond to an UNSTABLE write request with an UNSTABLE reply 
> or a FILE_SYNC reply, depending on whether or not the requested data 
> resides on permanent storage yet. An NFS protocol-compliant server must 
> respond to a FILE_SYNC request only with a FILE_SYNC reply.
> 
> Clients ensure that data that was written using a safe asynchronous write 
> has been written onto permanent storage using a new operation available in 
> Version 3 called a COMMIT. Servers do not send a response to a COMMIT 
> operation until all data specified in the request has been written to 
> permanent storage. NFS Version 3 clients must protect buffered data that 
> has been written using a safe asynchronous write but not yet committed. If 
> a server reboots before a client has sent an appropriate COMMIT, the 
> server can reply to the eventual COMMIT request in a way that forces the 
> client to resend the original write operation. Version 3 clients use 
> COMMIT operations when flushing safe asynchronous writes to the server 
> during a close(2) or fsync(2) system call, or when encountering memory 
> pressure. 
> -----------------------
> 
> Now, what happens in the client when the server cones back with the 
> UNSTABLE reply?

The server cannot reply with an UNSTABLE reply to a stable write
request. See above.

As for your assertion that the COMMIT comes from some other thread of
execution. I don't see how that can change anything. Some thread,
somewhere has to wait for that COMMIT to complete. If it isn't your
application, then the same burden falls on another application or the
pdflush thread. While that may feel more interactive to you, it still
means that you are making the server + some local process do more work
(extra RPC round trip) for no good reason.

Trond

> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>  
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>  
> 
> Please be sure to update your PMR using ESR at 
> http://www-306.ibm.com/software/support/probsub.html or cc all 
> correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
> case I am not available.
> 
> 
> 
> From:
> Trond Myklebust <trond.myklebust@fys.uio.no>
> To:
> Brian R Cowan/Cupertino/IBM@IBMUS
> Cc:
> Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, 
> linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
> Date:
> 05/29/2009 02:07 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
> 
> 
> 
> On Fri, 2009-05-29 at 13:55 -0400, Brian R Cowan wrote:
> > > Yes. If the page is dirty, but not up to date, then it needs to be
> > > cleaned before you can overwrite the contents with the results of a
> > > fresh read.
> > > That means flushing the data to disk... Which again means doing either 
> a
> > > stable write or an unstable write+commit. The former is more efficient
> > > that the latter, 'cos it accomplishes the exact same work in a single
> > > RPC call.
> > 
> > I suspect that the COMMIT RPC's are done somewhere other than in the 
> flush 
> > itself. If the "write + commit" operation was happening in the that 
> exact 
> > matter, then the change in the git at the beginning of this thread 
> *would 
> > not have impacted client performance*. I can demonstrate -- at will -- 
> > that it does impact performance. So, there is something that keeps track 
> 
> > of the number of writes and issues the commits without slowing down the 
> > application. This git change bypasses that and degrades the linker 
> > performance.
> 
> If the server gives slower performance for a single stable write, vs.
> the same unstable write + commit, then you are demonstrating that the
> server is seriously _broken_.
> 
> The only other explanation, is if the client prior to that patch being
> applied was somehow failing to send out the COMMIT. If so, then the
> client was broken, and the patch is a fix that results in correct
> behaviour. That would mean that the rest of the client flush code is
> probably still broken, but at least the nfs_wb_page() is now correct.
> 
> Those are the only 2 options.
> 
> Trond
> 
> 
> 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 17:51                   ` Peter Staubach
  2009-05-29 18:25                     ` Brian R Cowan
@ 2009-05-29 18:43                     ` Trond Myklebust
  1 sibling, 0 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-05-29 18:43 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner

On Fri, 2009-05-29 at 13:51 -0400, Peter Staubach wrote:
> Trond Myklebust wrote:
> > On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
> >   
> >>> You may have a misunderstanding about what exactly "async" does.  The 
> >>> "sync" / "async" mount options control only whether the application 
> >>> waits for the data to be flushed to permanent storage.  They have no 
> >>> effect on any file system I know of _how_ specifically the data is 
> >>> moved from the page cache to permanent storage.
> >>>       
> >> The problem is that the client change seems to cause the application to 
> >> stop until this stable write completes... What is interesting is that it's 
> >> not always a write operation that the linker gets stuck on. Our best 
> >> hypothesis -- from correlating times in strace and tcpdump traces -- is 
> >> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* 
> >> system calls on the output file (that is opened for read/write). We THINK 
> >> the read call triggers a FILE_SYNC write if the page is dirty...and that 
> >> is why the read calls are taking so long. Seeing writes happening when the 
> >> app is waiting for a read is odd to say the least... (In my test, there is 
> >> nothing else running on the Virtual machines, so the only thing that could 
> >> be triggering the filesystem activity is the build test...)
> >>     
> >
> > Yes. If the page is dirty, but not up to date, then it needs to be
> > cleaned before you can overwrite the contents with the results of a
> > fresh read.
> > That means flushing the data to disk... Which again means doing either a
> > stable write or an unstable write+commit. The former is more efficient
> > that the latter, 'cos it accomplishes the exact same work in a single
> > RPC call.
> 
> In the normal case, we aren't overwriting the contents with the
> results of a fresh read.  We are going to simply return the
> current contents of the page.  Given this, then why is the normal
> data cache consistency mechanism, based on the attribute cache,
> not sufficient?

It is. You would need to look into why the page was not marked with the
PG_uptodate flag when it was being filled. We generally do try to do
that whenever possible.

Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                             ` <1243621769.7155.97.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-05-29 20:09                               ` Brian R Cowan
  2009-05-29 20:21                                 ` Trond Myklebust
  0 siblings, 1 reply; 94+ messages in thread
From: Brian R Cowan @ 2009-05-29 20:09 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

I think you missed the context of my comment... Previous to this 
4-year-old update, the writes were not sent with STABLE, this update 
forced that behavior. So, before then we sent an UNSTABLE write request. 
This would either give us back the UNSTABLE or FILE_SYNC response. My 
question is this: When the server sends back UNSTABLE, as a response to 
UNSTABLE, exactly what happens? By some chance is there a separate worker 
thread that occasionally sends COMMITs back to the server?

The performance data we  have would seem to bear that out. When we backed 
out the force of STABLE writes, the link times went back up and the reads 
stopped waiting on the cache flushes. If, as you say, this change had no 
impact on how the client actually performed these flushes, backing out the 
change would not have made links take 4x longer on Red Hat 5. All we did 
in our test was back out that change...

I'm willing to discuss this issue in a conference call. I can send the 
bridge information to those who are interested, as well as the other 
people here in IBM I've been working with... At least one of them is a 
regular contributor -- Frank Filz...

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Trond Myklebust <trond.myklebust@fys.uio.no>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, 
linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
Date:
05/29/2009 02:31 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing



On Fri, 2009-05-29 at 14:18 -0400, Brian R Cowan wrote:
> There is a third option, that the COMMIT calls are not coming from the 
> same thread of execution that the write call is. The symptoms would seem 

> to bear that out. As would the fact that the performance degradation 
> occurs both when the server is Linux itself and when it is Solaris (any 
> NFSv3-supporting version). I'm not saying that Solaris is bug-free, but 
it 
> would be unusual if they are both broken the same way. The linux nfs FAQ 

> says:
> 
> -----------------------
> * NFS Version 3 introduces the concept of "safe asynchronous writes." A 
> Version 3 client can specify that the server is allowed to reply before 
it 
> has saved the requested data to disk, permitting the server to gather 
> small NFS write operations into a single efficient disk write operation. 
A 
> Version 3 client can also specify that the data must be written to disk 
> before the server replies, just like a Version 2 write. The client 
> specifies the type of write by setting the stable_how field in the 
> arguments of each write operation to UNSTABLE to request a safe 
> asynchronous write, and FILE_SYNC for an NFS Version 2 style write.
> 
> Servers indicate whether the requested data is permanently stored by 
> setting a corresponding field in the response to each NFS write 
operation. 
> A server can respond to an UNSTABLE write request with an UNSTABLE reply 

> or a FILE_SYNC reply, depending on whether or not the requested data 
> resides on permanent storage yet. An NFS protocol-compliant server must 
> respond to a FILE_SYNC request only with a FILE_SYNC reply.
> 
> Clients ensure that data that was written using a safe asynchronous 
write 
> has been written onto permanent storage using a new operation available 
in 
> Version 3 called a COMMIT. Servers do not send a response to a COMMIT 
> operation until all data specified in the request has been written to 
> permanent storage. NFS Version 3 clients must protect buffered data that 

> has been written using a safe asynchronous write but not yet committed. 
If 
> a server reboots before a client has sent an appropriate COMMIT, the 
> server can reply to the eventual COMMIT request in a way that forces the 

> client to resend the original write operation. Version 3 clients use 
> COMMIT operations when flushing safe asynchronous writes to the server 
> during a close(2) or fsync(2) system call, or when encountering memory 
> pressure. 
> -----------------------
> 
> Now, what happens in the client when the server cones back with the 
> UNSTABLE reply?

The server cannot reply with an UNSTABLE reply to a stable write
request. See above.

As for your assertion that the COMMIT comes from some other thread of
execution. I don't see how that can change anything. Some thread,
somewhere has to wait for that COMMIT to complete. If it isn't your
application, then the same burden falls on another application or the
pdflush thread. While that may feel more interactive to you, it still
means that you are making the server + some local process do more work
(extra RPC round trip) for no good reason.

Trond

> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
> 
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
> 
> 
> Please be sure to update your PMR using ESR at 
> http://www-306.ibm.com/software/support/probsub.html or cc all 
> correspondence to sw_support@us.ibm.com to be sure your PMR is updated 
in 
> case I am not available.
> 
> 
> 
> From:
> Trond Myklebust <trond.myklebust@fys.uio.no>
> To:
> Brian R Cowan/Cupertino/IBM@IBMUS
> Cc:
> Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, 
> linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
> Date:
> 05/29/2009 02:07 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page 
flushing
> 
> 
> 
> On Fri, 2009-05-29 at 13:55 -0400, Brian R Cowan wrote:
> > > Yes. If the page is dirty, but not up to date, then it needs to be
> > > cleaned before you can overwrite the contents with the results of a
> > > fresh read.
> > > That means flushing the data to disk... Which again means doing 
either 
> a
> > > stable write or an unstable write+commit. The former is more 
efficient
> > > that the latter, 'cos it accomplishes the exact same work in a 
single
> > > RPC call.
> > 
> > I suspect that the COMMIT RPC's are done somewhere other than in the 
> flush 
> > itself. If the "write + commit" operation was happening in the that 
> exact 
> > matter, then the change in the git at the beginning of this thread 
> *would 
> > not have impacted client performance*. I can demonstrate -- at will -- 

> > that it does impact performance. So, there is something that keeps 
track 
> 
> > of the number of writes and issues the commits without slowing down 
the 
> > application. This git change bypasses that and degrades the linker 
> > performance.
> 
> If the server gives slower performance for a single stable write, vs.
> the same unstable write + commit, then you are demonstrating that the
> server is seriously _broken_.
> 
> The only other explanation, is if the client prior to that patch being
> applied was somehow failing to send out the COMMIT. If so, then the
> client was broken, and the patch is a fix that results in correct
> behaviour. That would mean that the rest of the client flush code is
> probably still broken, but at least the nfs_wb_page() is now correct.
> 
> Those are the only 2 options.
> 
> Trond
> 
> 
> 





^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 20:09                               ` Brian R Cowan
@ 2009-05-29 20:21                                 ` Trond Myklebust
       [not found]                                   ` <1243628519.7155.150.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
       [not found]                                   ` <OFBB9B2C07.CC3D028B-ON852575C5. <1243634634.7155.160.camel@heimdal.trondhjem.org>
  0 siblings, 2 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-05-29 20:21 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, 2009-05-29 at 16:09 -0400, Brian R Cowan wrote:
> I think you missed the context of my comment... Previous to this 
> 4-year-old update, the writes were not sent with STABLE, this update 
> forced that behavior. So, before then we sent an UNSTABLE write request. 
> This would either give us back the UNSTABLE or FILE_SYNC response. My 
> question is this: When the server sends back UNSTABLE, as a response to 
> UNSTABLE, exactly what happens? By some chance is there a separate worker 
> thread that occasionally sends COMMITs back to the server?

pdflush will do it occasionally, but otherwise the COMMITs are all sent
synchronously by the thread that is flushing out the data.

In this case, the flush is done by the call to nfs_wb_page() in
nfs_readpage(), and it waits synchronously for the unstable WRITE and
the subsequent COMMIT to finish.

Note that there is no way to bypass the wait: if some other thread jumps
in and sends the COMMIT (after the unstable write has returned), then
the caller of nfs_wb_page() still has to wait for that call to complete,
and for nfs_commit_release() to mark the page as clean.

Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                   ` <1243628519.7155.150.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-05-29 21:55                                     ` Brian R Cowan
  2009-05-29 22:03                                       ` Trond Myklebust
  0 siblings, 1 reply; 94+ messages in thread
From: Brian R Cowan @ 2009-05-29 21:55 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

So, it is possible that either pdflush is sending the commits or us, or 
that the commits are happening when the file closes, giving us one/tens of 
commits instead of hundreds or thousands. That's a big difference. The 
write RPCs still happen in RHEL 4, they just don't block the linker, or at 
least nowhere near as often. Since there is only one application/thread 
(the gcc linker) writing this file, the odds of another task getting 
stalled here are minimal at best.

This optimization definitely helps server utilization for copies of large 
numbers of small files, and I personally don't care which is the default 
(though I have a coworker who is of the opinion that async means async, 
and if he wanted sync writes, he would either mount with nfsvers=2 or 
mount sync). But we need the option to turn it off for cases where it is 
thought to cause problems. 

You mention that one can set the async export option, but 1) it may not 
always available; and 2) essentially tells the server to "lie" about write 
status, something that can bite us seriously if the server crashes, hits a 
disk full error. etc. And in any event, it's something that only a 
particular class of clients is impacted by, and making a change to *all* 
so *some* work in the expected manner feels about as graceful as dynamite 
fishing...

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Trond Myklebust <trond.myklebust@fys.uio.no>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, 
linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
Date:
05/29/2009 04:28 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing



On Fri, 2009-05-29 at 16:09 -0400, Brian R Cowan wrote:
> I think you missed the context of my comment... Previous to this 
> 4-year-old update, the writes were not sent with STABLE, this update 
> forced that behavior. So, before then we sent an UNSTABLE write request. 

> This would either give us back the UNSTABLE or FILE_SYNC response. My 
> question is this: When the server sends back UNSTABLE, as a response to 
> UNSTABLE, exactly what happens? By some chance is there a separate 
worker 
> thread that occasionally sends COMMITs back to the server?

pdflush will do it occasionally, but otherwise the COMMITs are all sent
synchronously by the thread that is flushing out the data.

In this case, the flush is done by the call to nfs_wb_page() in
nfs_readpage(), and it waits synchronously for the unstable WRITE and
the subsequent COMMIT to finish.

Note that there is no way to bypass the wait: if some other thread jumps
in and sends the COMMIT (after the unstable write has returned), then
the caller of nfs_wb_page() still has to wait for that call to complete,
and for nfs_commit_release() to mark the page as clean.

Trond




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 21:55                                     ` Brian R Cowan
@ 2009-05-29 22:03                                       ` Trond Myklebust
  0 siblings, 0 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-05-29 22:03 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, 2009-05-29 at 17:55 -0400, Brian R Cowan wrote:
> So, it is possible that either pdflush is sending the commits or us, or 
> that the commits are happening when the file closes, giving us one/tens of 
> commits instead of hundreds or thousands. That's a big difference. The 
> write RPCs still happen in RHEL 4, they just don't block the linker, or at 
> least nowhere near as often. Since there is only one application/thread 
> (the gcc linker) writing this file, the odds of another task getting 
> stalled here are minimal at best.

No, you're not listening! That COMMIT is _synchronous_ and happens
before you can proceed with the READ request. There is no economy of
scale as you seem to assume.

Trond



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                     ` <1243634634.7155.160.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-05-29 22:20                                       ` Brian R Cowan
  2009-05-29 22:36                                         ` Trond Myklebust
  0 siblings, 1 reply; 94+ messages in thread
From: Brian R Cowan @ 2009-05-29 22:20 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

I am listening. 

Commit is sync. I get that.

The NFS client does Async writes in RHEL 4. They *eventually* get 
committed. (Doesn't really matter who causes the commit, does it.)
Read system calls may trigger cache flushing, but since not all of them 
are sync writes, the reads don't *always* stall when cache flushes occur.
Builds are fast. 

We do sync writes in RHEL 5, so they MUST stop and wait for the NFS server 
to come back.
READ system calls stall whan the read triggers a flush of one or more 
cache pages.
Builds are slow. Links are at least 4x slower.

I am perfectly willing to send you network traces showing the issue. I can 
even DEMONSTRATE it for you using the remote meeting software of your 
choice. I can even demonstrate the impact of removing that behavior.

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Trond Myklebust <trond.myklebust@fys.uio.no>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, 
linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
Date:
05/29/2009 06:06 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing



On Fri, 2009-05-29 at 17:55 -0400, Brian R Cowan wrote:
> So, it is possible that either pdflush is sending the commits or us, or 
> that the commits are happening when the file closes, giving us one/tens 
of 
> commits instead of hundreds or thousands. That's a big difference. The 
> write RPCs still happen in RHEL 4, they just don't block the linker, or 
at 
> least nowhere near as often. Since there is only one application/thread 
> (the gcc linker) writing this file, the odds of another task getting 
> stalled here are minimal at best.

No, you're not listening! That COMMIT is _synchronous_ and happens
before you can proceed with the READ request. There is no economy of
scale as you seem to assume.

Trond





^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 22:20                                       ` Brian R Cowan
@ 2009-05-29 22:36                                         ` Trond Myklebust
  0 siblings, 0 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-05-29 22:36 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, 2009-05-29 at 18:20 -0400, Brian R Cowan wrote:
> I am listening. 
> 
> Commit is sync. I get that.
> 
> The NFS client does Async writes in RHEL 4. They *eventually* get 
> committed. (Doesn't really matter who causes the commit, does it.)
> Read system calls may trigger cache flushing, but since not all of them 
> are sync writes, the reads don't *always* stall when cache flushes occur.
> Builds are fast. 

All reads that trigger writes will trigger _sync_ writes and _sync_
commits. That's true of RHEL-5, RHEL-4, RHEL-3, and all the way back to
the very first 2.4 kernels. There is no deferred commit in that case,
because the cached dirty data needs to be overwritten by a fresh read,
which means that we may lose the data if the server reboots between the
unstable write and the ensuing read.

> We do sync writes in RHEL 5, so they MUST stop and wait for the NFS server 
> to come back.
> READ system calls stall whan the read triggers a flush of one or more 
> cache pages.
> Builds are slow. Links are at least 4x slower.
> 
> I am perfectly willing to send you network traces showing the issue. I can 
> even DEMONSTRATE it for you using the remote meeting software of your 
> choice. I can even demonstrate the impact of removing that behavior.

Can you demonstrate it using a recent kernel? If it's a problem that is
limited to RHEL-5, then it is up to Peter & co to pull in the fixes from
mainline, but if the slowdown is still present in 2.6.30, then I'm all
ears. However I don't for a minute accept your explanation that this has
something to do with stable vs unstable+commit.

    Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                       ` <1243636593.7155.188.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-05-29 23:02                                         ` Brian R Cowan
  2009-05-29 23:13                                           ` Trond Myklebust
  0 siblings, 1 reply; 94+ messages in thread
From: Brian R Cowan @ 2009-05-29 23:02 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

If you can explain how pulling that ONE change can cause the performance 
issue to essentially disappear, I'd be more than happy to *try* to get a 
2.6.30 test environment configured. Getting ClearCase to *install* on 
kernel.org kernels is a non-trivial operation, requiring modifications to 
install scripts, module makefiles, etc. Then there is the issue of 
verifying that nothing else is impacted, all before I can begin to do this 
test. We're talking days here. 

To be blunt, I'd need something I can take to a manager who will ask me 
why I'm spending so much time on an issue when we "already have the 
cause."

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Trond Myklebust <trond.myklebust@fys.uio.no>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, 
linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com>
Date:
05/29/2009 06:38 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing



On Fri, 2009-05-29 at 18:20 -0400, Brian R Cowan wrote:
> I am listening. 
> 
> Commit is sync. I get that.
> 
> The NFS client does Async writes in RHEL 4. They *eventually* get 
> committed. (Doesn't really matter who causes the commit, does it.)
> Read system calls may trigger cache flushing, but since not all of them 
> are sync writes, the reads don't *always* stall when cache flushes 
occur.
> Builds are fast. 

All reads that trigger writes will trigger _sync_ writes and _sync_
commits. That's true of RHEL-5, RHEL-4, RHEL-3, and all the way back to
the very first 2.4 kernels. There is no deferred commit in that case,
because the cached dirty data needs to be overwritten by a fresh read,
which means that we may lose the data if the server reboots between the
unstable write and the ensuing read.

> We do sync writes in RHEL 5, so they MUST stop and wait for the NFS 
server 
> to come back.
> READ system calls stall whan the read triggers a flush of one or more 
> cache pages.
> Builds are slow. Links are at least 4x slower.
> 
> I am perfectly willing to send you network traces showing the issue. I 
can 
> even DEMONSTRATE it for you using the remote meeting software of your 
> choice. I can even demonstrate the impact of removing that behavior.

Can you demonstrate it using a recent kernel? If it's a problem that is
limited to RHEL-5, then it is up to Peter & co to pull in the fixes from
mainline, but if the slowdown is still present in 2.6.30, then I'm all
ears. However I don't for a minute accept your explanation that this has
something to do with stable vs unstable+commit.

    Trond




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-29 23:02                                         ` Brian R Cowan
@ 2009-05-29 23:13                                           ` Trond Myklebust
  0 siblings, 0 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-05-29 23:13 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, 2009-05-29 at 19:02 -0400, Brian R Cowan wrote:
> If you can explain how pulling that ONE change can cause the performance 
> issue to essentially disappear, I'd be more than happy to *try* to get a 
> 2.6.30 test environment configured. Getting ClearCase to *install* on 
> kernel.org kernels is a non-trivial operation, requiring modifications to 
> install scripts, module makefiles, etc. Then there is the issue of 
> verifying that nothing else is impacted, all before I can begin to do this 
> test. We're talking days here. 
> 
> To be blunt, I'd need something I can take to a manager who will ask me 
> why I'm spending so much time on an issue when we "already have the 
> cause."

It's simple: you are the one asking for a change to the established
kernel behaviour, so you get to justify that change. Saying "it breaks
clearcase on RHEL-5" is not a justification, and I won't accept to ack
the change.

Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                   ` <1243618500.7155.56.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-05-30  0:22                     ` Greg Banks
       [not found]                       ` <ac442c870905291722x1ec811b2sda997d464898fcda-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 94+ messages in thread
From: Greg Banks @ 2009-05-30  0:22 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
<trond.myklebust@fys.uio.no> wrote:
> On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
>>
>
> What are you smoking? There is _NO_DIFFERENCE_ between what the server
> is supposed to do when sent a single stable write, and what it is
> supposed to do when sent an unstable write plus a commit. BOTH cases are
> supposed to result in the server writing the data to stable storage
> before the stable write / commit is allowed to return a reply.

This probably makes no difference to the discussion, but for a Linux
server there is a subtle difference between what the server is
supposed to do and what it actually does.

For a stable WRITE rpc, the Linux server sets O_SYNC in the struct
file during the vfs_writev() call and expects the underlying
filesystem to obey that flag and flush the data to disk.  For a COMMIT
rpc, the Linux server uses the underlying filesystem's f_op->fsync
instead.  This results in some potential differences:

 * The underlying filesystem might be broken in one code path and not
the other (e.g. ignoring O_SYNC in f_op->{aio_,}write or silently
failing in f_op->fsync).  These kinds of bugs tend to be subtle
because in the absence of a crash they affect only the timing of IO
and so they might not be noticed.

 * The underlying filesystem might be doing more or better things in
one or the other code paths e.g. optimising allocations.

 * The Linux NFS server ignores the byte range in the COMMIT rpc and
flushes the whole file (I suspect this is a historical accident rather
than deliberate policy).  If there is other dirty data on that file
server-side, that other data will be written too before the COMMIT
reply is sent.  This may have a performance impact, depending on the
workload.

> The extra RPC round trip (+ parsing overhead ++++) due to the commit
> call is the _only_ difference.

This is almost completely true.  If the server behaved ideally and
predictably, this would be completely true.

</pedant>

-- 
Greg.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                       ` <ac442c870905291722x1ec811b2sda997d464898fcda-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-05-30  7:57                         ` Christoph Hellwig
  2009-06-01 22:30                           ` J. Bruce Fields
  2009-05-30 12:26                         ` Trond Myklebust
  1 sibling, 1 reply; 94+ messages in thread
From: Christoph Hellwig @ 2009-05-30  7:57 UTC (permalink / raw)
  To: Greg Banks
  Cc: Trond Myklebust, Brian R Cowan, Chuck Lever, linux-nfs,
	linux-nfs-owner, Peter Staubach

On Sat, May 30, 2009 at 10:22:58AM +1000, Greg Banks wrote:
>  * The underlying filesystem might be doing more or better things in
> one or the other code paths e.g. optimising allocations.

Which is the case with ext3 which is pretty common.  It does reasonably
well on O_SYNC as far as I can see, but has a catastrophic fsync
implementation. 

>  * The Linux NFS server ignores the byte range in the COMMIT rpc and
> flushes the whole file (I suspect this is a historical accident rather
> than deliberate policy).  If there is other dirty data on that file
> server-side, that other data will be written too before the COMMIT
> reply is sent.  This may have a performance impact, depending on the
> workload.

Right now we can't actually implement that proper because the fsync
file operation can't actually flush sub ranges.  There have been some
other requests for this, but my ->fsync resdesign in on hold until
NFSD stops calling ->fsync without a file struct.

I think the open file cache will help us with that, if we can extend
it to also cache open file structs for directories.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                       ` <ac442c870905291722x1ec811b2sda997d464898fcda-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-05-30  7:57                         ` Christoph Hellwig
@ 2009-05-30 12:26                         ` Trond Myklebust
       [not found]                           ` <1243686363.5209.16.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  1 sibling, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-05-30 12:26 UTC (permalink / raw)
  To: Greg Banks
  Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote:
> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
> <trond.myklebust@fys.uio.no> wrote:
> > On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
> >>
> >
> > What are you smoking? There is _NO_DIFFERENCE_ between what the server
> > is supposed to do when sent a single stable write, and what it is
> > supposed to do when sent an unstable write plus a commit. BOTH cases are
> > supposed to result in the server writing the data to stable storage
> > before the stable write / commit is allowed to return a reply.
> 
> This probably makes no difference to the discussion, but for a Linux
> server there is a subtle difference between what the server is
> supposed to do and what it actually does.
> 
> For a stable WRITE rpc, the Linux server sets O_SYNC in the struct
> file during the vfs_writev() call and expects the underlying
> filesystem to obey that flag and flush the data to disk.  For a COMMIT
> rpc, the Linux server uses the underlying filesystem's f_op->fsync
> instead.  This results in some potential differences:
> 
>  * The underlying filesystem might be broken in one code path and not
> the other (e.g. ignoring O_SYNC in f_op->{aio_,}write or silently
> failing in f_op->fsync).  These kinds of bugs tend to be subtle
> because in the absence of a crash they affect only the timing of IO
> and so they might not be noticed.
> 
>  * The underlying filesystem might be doing more or better things in
> one or the other code paths e.g. optimising allocations.
> 
>  * The Linux NFS server ignores the byte range in the COMMIT rpc and
> flushes the whole file (I suspect this is a historical accident rather
> than deliberate policy).  If there is other dirty data on that file
> server-side, that other data will be written too before the COMMIT
> reply is sent.  This may have a performance impact, depending on the
> workload.
> 
> > The extra RPC round trip (+ parsing overhead ++++) due to the commit
> > call is the _only_ difference.
> 
> This is almost completely true.  If the server behaved ideally and
> predictably, this would be completely true.
> 
> </pedant>
> 

Firstly, the server only uses O_SYNC if you turn off write gathering
(a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs
server is to always try write gathering and hence no O_SYNC.

Secondly, even if it were the case, then this does not justify changing
the client behaviour. The NFS protocol does not mandate, or even
recommend that the server use O_SYNC. All it says is that a stable write
and an unstable write+commit should both have the same result: namely
that the data+metadata must have been flushed to stable storage. The
protocol spec leaves it as an exercise to the server implementer to do
this as efficiently as possible.


  Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page  flushing
       [not found]                           ` <1243686363.5209.16.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-05-30 12:43                             ` Trond Myklebust
  2009-05-30 13:02                             ` Greg Banks
  1 sibling, 0 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-05-30 12:43 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Greg Banks, Brian R Cowan, Chuck Lever, linux-nfs,
	linux-nfs-owner, Peter Staubach

On May 30, 2009, at 8:26, Trond Myklebust <trond.myklebust@fys.uio.no>  
wrote:
>
> Firstly, the server only uses O_SYNC if you turn off write gathering
> (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs
> server is to always try write gathering and hence no O_SYNC.
>
> Secondly, even if it were the case, then this does not justify  
> changing
> the client behaviour. The NFS protocol does not mandate, or even
> recommend that the server use O_SYNC. All it says is that a stable  
> write
> and an unstable write+commit should both have the same result: namely
> that the data+metadata must have been flushed to stable storage. The
> protocol spec leaves it as an exercise to the server implementer to do
> this as efficiently as possible.
>

Speaking of write gathering... Are we sure that heuristic that checks  
i_writecount isn't introducing spurious 10ms delays here? It seems odd  
for the server to do write gathering on nfsv3 writes: if the client  
wants to send more writes, it will set the unstable flag...

Trond

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                           ` <1243686363.5209.16.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2009-05-30 12:43                             ` Trond Myklebust
@ 2009-05-30 13:02                             ` Greg Banks
       [not found]                               ` <ac442c870905300602v6950ec42y5195d2d6ea7dd4c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 94+ messages in thread
From: Greg Banks @ 2009-05-30 13:02 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust
<trond.myklebust@fys.uio.no> wrote:
> On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote:
>> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
>> <trond.myklebust@fys.uio.no> wrote:
>> > On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
>> >>
>>
>
> Firstly, the server only uses O_SYNC if you turn off write gathering
> (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs
> server is to always try write gathering and hence no O_SYNC.

Well, write gathering is a total crock that AFAICS only helps
single-file writes on NFSv2.  For today's workloads all it does is
provide a hotspot on the two global variables that track writes in an
attempt to gather them.  Back when I worked on a server product,
no_wdelay was one of the standard options for new exports.

> Secondly, even if it were the case, then this does not justify changing
> the client behaviour.

I totally agree, it was just an observation.

In any case, as Christoph points out, the ext3 performance difference
makes an unstable WRITE+COMMIT slower than a stable WRITE, and you
already assumed that.

-- 
Greg.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-05-30  7:57                         ` Christoph Hellwig
@ 2009-06-01 22:30                           ` J. Bruce Fields
  2009-06-05 14:54                             ` Christoph Hellwig
  0 siblings, 1 reply; 94+ messages in thread
From: J. Bruce Fields @ 2009-06-01 22:30 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: Greg Banks, Trond Myklebust, Brian R Cowan, Chuck Lever,
	linux-nfs, linux-nfs-owner, Peter Staubach, Christoph Hellwig

On Sat, May 30, 2009 at 03:57:56AM -0400, Christoph Hellwig wrote:
> On Sat, May 30, 2009 at 10:22:58AM +1000, Greg Banks wrote:
> >  * The underlying filesystem might be doing more or better things in
> > one or the other code paths e.g. optimising allocations.
> 
> Which is the case with ext3 which is pretty common.  It does reasonably
> well on O_SYNC as far as I can see, but has a catastrophic fsync
> implementation. 
> 
> >  * The Linux NFS server ignores the byte range in the COMMIT rpc and
> > flushes the whole file (I suspect this is a historical accident rather
> > than deliberate policy).  If there is other dirty data on that file
> > server-side, that other data will be written too before the COMMIT
> > reply is sent.  This may have a performance impact, depending on the
> > workload.
> 
> Right now we can't actually implement that proper because the fsync
> file operation can't actually flush sub ranges.  There have been some
> other requests for this, but my ->fsync resdesign in on hold until
> NFSD stops calling ->fsync without a file struct.
> 
> I think the open file cache will help us with that, if we can extend
> it to also cache open file structs for directories.

Krishna Kumar--do you think that'd be a reasonable thing to do?

--b.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                               ` <ac442c870905300602v6950ec42y5195d2d6ea7dd4c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-06-01 22:30                                 ` J. Bruce Fields
  2009-06-02 15:00                                 ` Chuck Lever
  1 sibling, 0 replies; 94+ messages in thread
From: J. Bruce Fields @ 2009-06-01 22:30 UTC (permalink / raw)
  To: Greg Banks
  Cc: Trond Myklebust, Brian R Cowan, Chuck Lever, linux-nfs,
	linux-nfs-owner, Peter Staubach

On Sat, May 30, 2009 at 11:02:47PM +1000, Greg Banks wrote:
> On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust
> <trond.myklebust@fys.uio.no> wrote:
> > On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote:
> >> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
> >> <trond.myklebust@fys.uio.no> wrote:
> >> > On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
> >> >>
> >>
> >
> > Firstly, the server only uses O_SYNC if you turn off write gathering
> > (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs
> > server is to always try write gathering and hence no O_SYNC.
> 
> Well, write gathering is a total crock that AFAICS only helps
> single-file writes on NFSv2.  For today's workloads all it does is
> provide a hotspot on the two global variables that track writes in an
> attempt to gather them.  Back when I worked on a server product,
> no_wdelay was one of the standard options for new exports.

Should be a simple nfs-utils patch to change the default.

--b.

> 
> > Secondly, even if it were the case, then this does not justify changing
> > the client behaviour.
> 
> I totally agree, it was just an observation.
> 
> In any case, as Christoph points out, the ext3 performance difference
> makes an unstable WRITE+COMMIT slower than a stable WRITE, and you
> already assumed that.
> 
> -- 
> Greg.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                               ` <ac442c870905300602v6950ec42y5195d2d6ea7dd4c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-06-01 22:30                                 ` J. Bruce Fields
@ 2009-06-02 15:00                                 ` Chuck Lever
  2009-06-02 17:27                                   ` Trond Myklebust
  1 sibling, 1 reply; 94+ messages in thread
From: Chuck Lever @ 2009-06-02 15:00 UTC (permalink / raw)
  To: Greg Banks
  Cc: Trond Myklebust, Brian R Cowan, linux-nfs, linux-nfs-owner,
	Peter Staubach

On May 30, 2009, at 9:02 AM, Greg Banks wrote:
> On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust
> <trond.myklebust@fys.uio.no> wrote:
>> On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote:
>>> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
>>> <trond.myklebust@fys.uio.no> wrote:
>>>> On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
>>>>>
>>>
>>
>> Firstly, the server only uses O_SYNC if you turn off write gathering
>> (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs
>> server is to always try write gathering and hence no O_SYNC.
>
> Well, write gathering is a total crock that AFAICS only helps
> single-file writes on NFSv2.  For today's workloads all it does is
> provide a hotspot on the two global variables that track writes in an
> attempt to gather them.  Back when I worked on a server product,
> no_wdelay was one of the standard options for new exports.

Really?  Even for NFSv3/4 FILE_SYNC?  I can understand that it  
wouldn't have any real effect on UNSTABLE.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-02 15:00                                 ` Chuck Lever
@ 2009-06-02 17:27                                   ` Trond Myklebust
       [not found]                                     ` <1243963631.4868.124.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-06-02 17:27 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Greg Banks, Brian R Cowan, linux-nfs, linux-nfs-owner, Peter Staubach

On Tue, 2009-06-02 at 11:00 -0400, Chuck Lever wrote:
> On May 30, 2009, at 9:02 AM, Greg Banks wrote:
> > On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust
> > <trond.myklebust@fys.uio.no> wrote:
> >> On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote:
> >>> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
> >>> <trond.myklebust@fys.uio.no> wrote:
> >>>> On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
> >>>>>
> >>>
> >>
> >> Firstly, the server only uses O_SYNC if you turn off write gathering
> >> (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs
> >> server is to always try write gathering and hence no O_SYNC.
> >
> > Well, write gathering is a total crock that AFAICS only helps
> > single-file writes on NFSv2.  For today's workloads all it does is
> > provide a hotspot on the two global variables that track writes in an
> > attempt to gather them.  Back when I worked on a server product,
> > no_wdelay was one of the standard options for new exports.
> 
> Really?  Even for NFSv3/4 FILE_SYNC?  I can understand that it  
> wouldn't have any real effect on UNSTABLE.

The question is why would a sensible client ever want to send more than
1 NFSv3 write with FILE_SYNC? If you need to send multiple writes in
parallel to the same file, then it makes much more sense to use
UNSTABLE.

Write gathering relies on waiting an arbitrary length of time in order
to see if someone is going to send another write. The protocol offers no
guidance as to how long that wait should be, and so (at least on the
Linux server) we've coded in a hard wait of 10ms if and only if we see
that something else has the file open for writing.
One problem with the Linux implementation is that the "something else"
could be another nfs server thread that happens to be in nfsd_write(),
however it could also be another open NFSv4 stateid, or a NLM lock, or a
local process that has the file open for writing.
Another problem is that the nfs server keeps a record of the last file
that was accessed, and also waits if it sees you are writing again to
that same file. Of course it has no idea if this is truly a parallel
write, or if it just happens that you are writing again to the same file
using O_SYNC...

  Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                     ` <1243963631.4868.124.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-06-02 18:15                                       ` Chuck Lever
  2009-06-03 16:22                                       ` Carlos Carvalho
  1 sibling, 0 replies; 94+ messages in thread
From: Chuck Lever @ 2009-06-02 18:15 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Greg Banks, Brian R Cowan, linux-nfs, linux-nfs-owner, Peter Staubach

On Jun 2, 2009, at 1:27 PM, Trond Myklebust wrote:
> On Tue, 2009-06-02 at 11:00 -0400, Chuck Lever wrote:
>> On May 30, 2009, at 9:02 AM, Greg Banks wrote:
>>> On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust
>>> <trond.myklebust@fys.uio.no> wrote:
>>>> On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote:
>>>>> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
>>>>> <trond.myklebust@fys.uio.no> wrote:
>>>>>> On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
>>>>>>>
>>>>>
>>>>
>>>> Firstly, the server only uses O_SYNC if you turn off write  
>>>> gathering
>>>> (a.k.a. the 'wdelay' option). The default behaviour for the Linux  
>>>> nfs
>>>> server is to always try write gathering and hence no O_SYNC.
>>>
>>> Well, write gathering is a total crock that AFAICS only helps
>>> single-file writes on NFSv2.  For today's workloads all it does is
>>> provide a hotspot on the two global variables that track writes in  
>>> an
>>> attempt to gather them.  Back when I worked on a server product,
>>> no_wdelay was one of the standard options for new exports.
>>
>> Really?  Even for NFSv3/4 FILE_SYNC?  I can understand that it
>> wouldn't have any real effect on UNSTABLE.
>
> The question is why would a sensible client ever want to send more  
> than
> 1 NFSv3 write with FILE_SYNC?

A client might behave this way if an application was performing random  
4KB synchronous writes to a large file, or the VM is aggressively  
flushing single pages to try to mitigate a low-memory situation.  IOW  
it may not be up to the client...

Penalizing FILE_SYNC writes, even a little, by waiting a bit could  
also reduce the server's workload by slowing clients that are pounding  
a server with synchronous writes.

Not an argument, really... but it seems like there are some scenarios  
where delaying synchronous writes could still be useful.  The real  
question is whether these scenarios occur frequently enough to warrant  
the overhead in the server.  It would be nice to see some I/O trace  
data.

> If you need to send multiple writes in
> parallel to the same file, then it makes much more sense to use
> UNSTABLE.

Yep, agreed.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                     ` <1243963631.4868.124.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2009-06-02 18:15                                       ` Chuck Lever
@ 2009-06-03 16:22                                       ` Carlos Carvalho
  2009-06-03 17:10                                         ` Trond Myklebust
  1 sibling, 1 reply; 94+ messages in thread
From: Carlos Carvalho @ 2009-06-03 16:22 UTC (permalink / raw)
  To: linux-nfs

Trond Myklebust (trond.myklebust@fys.uio.no) wrote on 2 June 2009 13:27:
 >Write gathering relies on waiting an arbitrary length of time in order
 >to see if someone is going to send another write. The protocol offers no
 >guidance as to how long that wait should be, and so (at least on the
 >Linux server) we've coded in a hard wait of 10ms if and only if we see
 >that something else has the file open for writing.
 >One problem with the Linux implementation is that the "something else"
 >could be another nfs server thread that happens to be in nfsd_write(),
 >however it could also be another open NFSv4 stateid, or a NLM lock, or a
 >local process that has the file open for writing.
 >Another problem is that the nfs server keeps a record of the last file
 >that was accessed, and also waits if it sees you are writing again to
 >that same file. Of course it has no idea if this is truly a parallel
 >write, or if it just happens that you are writing again to the same file
 >using O_SYNC...

I think the decision to write or wait doesn't belong to the nfs
server; it should just send the writes immediately. It's up to the
fs/block/device layers to do the gathering. I understand that the
client should try to do the gathering before sending the request to
the wire.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-03 16:22                                       ` Carlos Carvalho
@ 2009-06-03 17:10                                         ` Trond Myklebust
       [not found]                                           ` <OFB53BFCCB.0CEC7A7E-ON852575C <1244138698.5203.59.camel@heimdal.trondhjem.org>
                                                             ` (2 more replies)
  0 siblings, 3 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-06-03 17:10 UTC (permalink / raw)
  To: Carlos Carvalho; +Cc: linux-nfs

On Wed, 2009-06-03 at 13:22 -0300, Carlos Carvalho wrote:
> Trond Myklebust (trond.myklebust@fys.uio.no) wrote on 2 June 2009 13:27:
>  >Write gathering relies on waiting an arbitrary length of time in order
>  >to see if someone is going to send another write. The protocol offers no
>  >guidance as to how long that wait should be, and so (at least on the
>  >Linux server) we've coded in a hard wait of 10ms if and only if we see
>  >that something else has the file open for writing.
>  >One problem with the Linux implementation is that the "something else"
>  >could be another nfs server thread that happens to be in nfsd_write(),
>  >however it could also be another open NFSv4 stateid, or a NLM lock, or a
>  >local process that has the file open for writing.
>  >Another problem is that the nfs server keeps a record of the last file
>  >that was accessed, and also waits if it sees you are writing again to
>  >that same file. Of course it has no idea if this is truly a parallel
>  >write, or if it just happens that you are writing again to the same file
>  >using O_SYNC...
> 
> I think the decision to write or wait doesn't belong to the nfs
> server; it should just send the writes immediately. It's up to the
> fs/block/device layers to do the gathering. I understand that the
> client should try to do the gathering before sending the request to
> the wire

This isn't something that we've just pulled out of a hat. It dates back
to pre-NFSv3 times, when every write had to be synchronously committed
to disk before the RPC call could return.

See, for instance,

http://books.google.com/books?id=y9GgPhjyOUwC&pg=PA243&lpg=PA243&dq=What
+is+nfs+write
+gathering&source=bl&ots=M8s0XS2SLd&sig=ctmxQrpII2_Ti4czgpGZrF9mmds&hl=en&ei=Xa0mSrLMC8iptgfSsqHsBg&sa=X&oi=book_result&ct=result&resnum=3

The point is that while it is a good idea for NFSv2, we have much better
methods of dealing with multiple writes in NFSv3 and v4...

Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-03 17:10                                         ` Trond Myklebust
       [not found]                                           ` <OFB53BFCCB.0CEC7A7E-ON852575C <1244138698.5203.59.camel@heimdal.trondhjem.org>
@ 2009-06-03 21:28                                           ` Dean Hildebrand
  2009-06-04  2:16                                             ` Carlos Carvalho
  2009-06-04 17:42                                           ` Brian R Cowan
  2 siblings, 1 reply; 94+ messages in thread
From: Dean Hildebrand @ 2009-06-03 21:28 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Carlos Carvalho, linux-nfs



Trond Myklebust wrote:
> On Wed, 2009-06-03 at 13:22 -0300, Carlos Carvalho wrote:
>   
>> Trond Myklebust (trond.myklebust@fys.uio.no) wrote on 2 June 2009 13:27:
>>  >Write gathering relies on waiting an arbitrary length of time in order
>>  >to see if someone is going to send another write. The protocol offers no
>>  >guidance as to how long that wait should be, and so (at least on the
>>  >Linux server) we've coded in a hard wait of 10ms if and only if we see
>>  >that something else has the file open for writing.
>>  >One problem with the Linux implementation is that the "something else"
>>  >could be another nfs server thread that happens to be in nfsd_write(),
>>  >however it could also be another open NFSv4 stateid, or a NLM lock, or a
>>  >local process that has the file open for writing.
>>  >Another problem is that the nfs server keeps a record of the last file
>>  >that was accessed, and also waits if it sees you are writing again to
>>  >that same file. Of course it has no idea if this is truly a parallel
>>  >write, or if it just happens that you are writing again to the same file
>>  >using O_SYNC...
>>
>> I think the decision to write or wait doesn't belong to the nfs
>> server; it should just send the writes immediately. It's up to the
>> fs/block/device layers to do the gathering. I understand that the
>> client should try to do the gathering before sending the request to
>> the wire
>>     
Just to be clear, the linux NFS server does not gather the writes.  
Writes are passed immediately to the fs.  nfsd simply waits 10ms before 
sync'ing the writes to disk.  This allows the underlying file system 
time to do the gathering and sync data in larger chunks.  Of course, 
this is only for stables writes and wdelay is enabled for the export.

Dean
>
> This isn't something that we've just pulled out of a hat. It dates back
> to pre-NFSv3 times, when every write had to be synchronously committed
> to disk before the RPC call could return.
>
> See, for instance,
>
> http://books.google.com/books?id=y9GgPhjyOUwC&pg=PA243&lpg=PA243&dq=What
> +is+nfs+write
> +gathering&source=bl&ots=M8s0XS2SLd&sig=ctmxQrpII2_Ti4czgpGZrF9mmds&hl=en&ei=Xa0mSrLMC8iptgfSsqHsBg&sa=X&oi=book_result&ct=result&resnum=3
>
> The point is that while it is a good idea for NFSv2, we have much better
> methods of dealing with multiple writes in NFSv3 and v4...
>
> Trond
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-03 21:28                                           ` Dean Hildebrand
@ 2009-06-04  2:16                                             ` Carlos Carvalho
  0 siblings, 0 replies; 94+ messages in thread
From: Carlos Carvalho @ 2009-06-04  2:16 UTC (permalink / raw)
  To: linux-nfs

Dean Hildebrand (seattleplus@gmail.com) wrote on 3 June 2009 17:28:
 >Trond Myklebust wrote:
 >> On Wed, 2009-06-03 at 13:22 -0300, Carlos Carvalho wrote:
 >>   
 >>> Trond Myklebust (trond.myklebust@fys.uio.no) wrote on 2 June 2009 13:27:
 >>>  >Write gathering relies on waiting an arbitrary length of time in order
 >>>  >to see if someone is going to send another write. The protocol offers no
 >>>  >guidance as to how long that wait should be, and so (at least on the
 >>>  >Linux server) we've coded in a hard wait of 10ms if and only if we see
 >>>  >that something else has the file open for writing.
 >>>  >One problem with the Linux implementation is that the "something else"
 >>>  >could be another nfs server thread that happens to be in nfsd_write(),
 >>>  >however it could also be another open NFSv4 stateid, or a NLM lock, or a
 >>>  >local process that has the file open for writing.
 >>>  >Another problem is that the nfs server keeps a record of the last file
 >>>  >that was accessed, and also waits if it sees you are writing again to
 >>>  >that same file. Of course it has no idea if this is truly a parallel
 >>>  >write, or if it just happens that you are writing again to the same file
 >>>  >using O_SYNC...
 >>>
 >>> I think the decision to write or wait doesn't belong to the nfs
 >>> server; it should just send the writes immediately. It's up to the
 >>> fs/block/device layers to do the gathering. I understand that the
 >>> client should try to do the gathering before sending the request to
 >>> the wire
 >>>     
 >Just to be clear, the linux NFS server does not gather the writes.  
 >Writes are passed immediately to the fs.

Ah! That's much better.

 >nfsd simply waits 10ms before 
 >sync'ing the writes to disk.  This allows the underlying file system 
  ****
 >time to do the gathering and sync data in larger chunks.

OK, all is perfectly fine then.

Since syncs seem to be a requirement of the protocol, perhaps the 10ms
delay could be made tunable to allow admins more flexibility. For
example, if we change other timeouts we could adjust the nfs sync one
accordingly. Could be an option to nfsd or, better, a variable in /proc.

Thanks Dean and Trond for the explanations.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-03 17:10                                         ` Trond Myklebust
       [not found]                                           ` <OFB53BFCCB.0CEC7A7E-ON852575C <1244138698.5203.59.camel@heimdal.trondhjem.org>
  2009-06-03 21:28                                           ` Dean Hildebrand
@ 2009-06-04 17:42                                           ` Brian R Cowan
  2009-06-04 18:04                                             ` Trond Myklebust
  2 siblings, 1 reply; 94+ messages in thread
From: Brian R Cowan @ 2009-06-04 17:42 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner

I've been looking in more detail in the network traces that started all 
this, and doing some additional testing with the 2.6.29 kernel in an 
NFS-only build...

In brief:
1) RHEL 5 generates >3x the network write traffic than RHEL4 when linking 
Samba's smbd.
2) In RHEL 5, Those unnecessary writes are slowed down by the "FILE_SYNC" 
optimization put in place for small writes.
3) That optimization seems to be removed from the kernel somewhere between 
2.6.18 and 2.6.29.
4) Unfortunately the "unnecessary write before read" behavior is still 
present in 2.6.29.

In detail:
In RHEL 5, I see a lot of reads from offset {whatever} *immediately* 
preceded by a write to *the same offset*. This is obviously a bad thing, 
now the trick is finding out where it is coming from. The 
write-before-read behavior is happening on the smbd file itself (not 
surprising since that's the only file we're writing in this test...). This 
happens with every 2.6.18 and later kernel I've tested to date.

In RHEL 5, most of the writes are FILE_SYNC writes, which appear to take 
something on the order of 10ms to come back. When using a 2.6.29 kernel, 
the TOTAL time for the write+commit rpc set (write rpc, write reply, 
commit rpc, commit reply), to come back is something like 2ms. I guess the 
NFS servers aren't handling FILE_SYNC writes very well. in 2.6.29, ALL the 
write calls appear to be unstable writes, in RHEL5, most are FILE_SYNC 
writes. (Network traces available upon request.)

Neither is quite as fast as RHEL 4, because the link under RHEL 4 only 
puts about 150 WRITE rpc's on the wire. RHEL 5 generates more than 500 
when building on NFS, and 2.6.29 puts about 340 write rpc's, plus a 
similar number of COMMITs, on the wire. 

The bottom line:
* If someone can help me find where 2.6 stopped setting small writes to 
FILE_SYNC, I'd appreciate it. It would save me time walking through >50 
commitdiffs in gitweb...
* Is this the correct place to start discussing the annoying 
write-before-almost-every-read behavior that 2.6.18 picked up and 2.6.29 
continues? 

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Trond Myklebust <trond.myklebust@fys.uio.no>
To:
Carlos Carvalho <carlos@fisica.ufpr.br>
Cc:
linux-nfs@vger.kernel.org
Date:
06/03/2009 01:10 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
Sent by:
linux-nfs-owner@vger.kernel.org



On Wed, 2009-06-03 at 13:22 -0300, Carlos Carvalho wrote:
> Trond Myklebust (trond.myklebust@fys.uio.no) wrote on 2 June 2009 13:27:
>  >Write gathering relies on waiting an arbitrary length of time in order
>  >to see if someone is going to send another write. The protocol offers 
no
>  >guidance as to how long that wait should be, and so (at least on the
>  >Linux server) we've coded in a hard wait of 10ms if and only if we see
>  >that something else has the file open for writing.
>  >One problem with the Linux implementation is that the "something else"
>  >could be another nfs server thread that happens to be in nfsd_write(),
>  >however it could also be another open NFSv4 stateid, or a NLM lock, or 
a
>  >local process that has the file open for writing.
>  >Another problem is that the nfs server keeps a record of the last file
>  >that was accessed, and also waits if it sees you are writing again to
>  >that same file. Of course it has no idea if this is truly a parallel
>  >write, or if it just happens that you are writing again to the same 
file
>  >using O_SYNC...
> 
> I think the decision to write or wait doesn't belong to the nfs
> server; it should just send the writes immediately. It's up to the
> fs/block/device layers to do the gathering. I understand that the
> client should try to do the gathering before sending the request to
> the wire

This isn't something that we've just pulled out of a hat. It dates back
to pre-NFSv3 times, when every write had to be synchronously committed
to disk before the RPC call could return.

See, for instance,

http://books.google.com/books?id=y9GgPhjyOUwC&pg=PA243&lpg=PA243&dq=What
+is+nfs+write
+gathering&source=bl&ots=M8s0XS2SLd&sig=ctmxQrpII2_Ti4czgpGZrF9mmds&hl=en&ei=Xa0mSrLMC8iptgfSsqHsBg&sa=X&oi=book_result&ct=result&resnum=3

The point is that while it is a good idea for NFSv2, we have much better
methods of dealing with multiple writes in NFSv3 and v4...

Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-04 17:42                                           ` Brian R Cowan
@ 2009-06-04 18:04                                             ` Trond Myklebust
  2009-06-04 20:43                                               ` Link performance over NFS degraded in RHEL5. -- was : " Brian R Cowan
  2009-06-24 19:54                                               ` [PATCH] read-modify-write page updating Peter Staubach
  0 siblings, 2 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-06-04 18:04 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner

On Thu, 2009-06-04 at 13:42 -0400, Brian R Cowan wrote:
> I've been looking in more detail in the network traces that started all 
> this, and doing some additional testing with the 2.6.29 kernel in an 
> NFS-only build...
> 
> In brief:
> 1) RHEL 5 generates >3x the network write traffic than RHEL4 when linking 
> Samba's smbd.
> 2) In RHEL 5, Those unnecessary writes are slowed down by the "FILE_SYNC" 
> optimization put in place for small writes.
> 3) That optimization seems to be removed from the kernel somewhere between 
> 2.6.18 and 2.6.29.
> 4) Unfortunately the "unnecessary write before read" behavior is still 
> present in 2.6.29.
> 
> In detail:
> In RHEL 5, I see a lot of reads from offset {whatever} *immediately* 
> preceded by a write to *the same offset*. This is obviously a bad thing, 
> now the trick is finding out where it is coming from. The 
> write-before-read behavior is happening on the smbd file itself (not 
> surprising since that's the only file we're writing in this test...). This 
> happens with every 2.6.18 and later kernel I've tested to date.
> 
> In RHEL 5, most of the writes are FILE_SYNC writes, which appear to take 
> something on the order of 10ms to come back. When using a 2.6.29 kernel, 
> the TOTAL time for the write+commit rpc set (write rpc, write reply, 
> commit rpc, commit reply), to come back is something like 2ms. I guess the 
> NFS servers aren't handling FILE_SYNC writes very well. in 2.6.29, ALL the 
> write calls appear to be unstable writes, in RHEL5, most are FILE_SYNC 
> writes. (Network traces available upon request.)

Did you try turning off write gathering on the server (i.e. add the
'no_wdelay' export option)? As I said earlier, that forces a delay of
10ms per RPC call, which might explain the FILE_SYNC slowness.

> Neither is quite as fast as RHEL 4, because the link under RHEL 4 only 
> puts about 150 WRITE rpc's on the wire. RHEL 5 generates more than 500 
> when building on NFS, and 2.6.29 puts about 340 write rpc's, plus a 
> similar number of COMMITs, on the wire. 
> 
> The bottom line:
> * If someone can help me find where 2.6 stopped setting small writes to 
> FILE_SYNC, I'd appreciate it. It would save me time walking through >50 
> commitdiffs in gitweb...

It still does set FILE_SYNC for single page writes.

> * Is this the correct place to start discussing the annoying 
> write-before-almost-every-read behavior that 2.6.18 picked up and 2.6.29 
> continues? 

Yes, but you'll need to tell us a bit more about the write patterns. Are
these random writes, or are they sequential? Is there any file locking
involved?

As I've said earlier in this thread, all NFS clients will flush out the
dirty data if a page that is being attempted read also contains
uninitialised areas.

Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-04 18:04                                             ` Trond Myklebust
@ 2009-06-04 20:43                                               ` Brian R Cowan
  2009-06-04 20:57                                                 ` Trond Myklebust
                                                                   ` (2 more replies)
  2009-06-24 19:54                                               ` [PATCH] read-modify-write page updating Peter Staubach
  1 sibling, 3 replies; 94+ messages in thread
From: Brian R Cowan @ 2009-06-04 20:43 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner

Trond Myklebust <trond.myklebust@fys.uio.no> wrote on 06/04/2009 02:04:58 
PM:

> Did you try turning off write gathering on the server (i.e. add the
> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> 10ms per RPC call, which might explain the FILE_SYNC slowness.

Just tried it, this seems to be a very useful workaround as well. The 
FILE_SYNC write calls come back in about the same amount of time as the 
write+commit pairs... Speeds up building regardless of the network 
filesystem (ClearCase MVFS or straight NFS).

> > The bottom line:
> > * If someone can help me find where 2.6 stopped setting small writes 
to 
> > FILE_SYNC, I'd appreciate it. It would save me time walking through 
>50 
> > commitdiffs in gitweb...
> 
> It still does set FILE_SYNC for single page writes.

Well, the network trace *seems* to say otherwise, but that could be 
because the 2.6.29 kernel is now reliably following a code path that 
doesn't set up to do FILE_SYNC writes for these flushes... Just like the 
RHEL 5 traces didn't have every "small" write to the link output file go 
out as a FILE_SYNC write.

> 
> > * Is this the correct place to start discussing the annoying 
> > write-before-almost-every-read behavior that 2.6.18 picked up and 
2.6.29 
> > continues? 
> 
> Yes, but you'll need to tell us a bit more about the write patterns. Are
> these random writes, or are they sequential? Is there any file locking
> involved?

Well, it's just a link, so it's random read/write traffic. (read object 
file/library, add stuff to output file, seek somewhere else and update a 
table, etc., etc.) All I did here was build Samba over nfs, remove 
bin/smbd, and then do a "make bin/smbd" to rebuild it. My network traces 
show that the file is opened "UNCHECKED" when doing the build in straight 
NFS, and "EXCLUSIVE" when building in a ClearCase view. This change does 
not seem to impact the behavior. We never lock the output file. The 
write-before-read happens all over the place. And when we did straces and 
lined up the call times, is it a read operation triggering the write. 

> 
> As I've said earlier in this thread, all NFS clients will flush out the
> dirty data if a page that is being attempted read also contains
> uninitialised areas.

What I'm trying to understand is why RHEL 4 is not flushing anywhere near 
as often. Either RHEL4 erred on the side of not writing, and RHEL5 is 
erring on the opposite side, or RHEL5 is doing unnecessary flushes... I've 
seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels, but 
it still flushes a lot more than RHEL 4 does.

In any event, that doesn't help us here since 1) ClearCase can't work with 
that kernel; 2) Red Hat won't support use of that kernel on RHEL 5; and 3) 
the amount of code review my customer would have to go through to get the 
whole kernel vetted for use in their environment is frightening.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-04 20:43                                               ` Link performance over NFS degraded in RHEL5. -- was : " Brian R Cowan
@ 2009-06-04 20:57                                                 ` Trond Myklebust
  2009-06-04 21:30                                                   ` Brian R Cowan
  2009-06-04 21:07                                                 ` Peter Staubach
  2009-06-05 11:35                                                 ` Steve Dickson
  2 siblings, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-06-04 20:57 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner

On Thu, 2009-06-04 at 16:43 -0400, Brian R Cowan wrote:
> What I'm trying to understand is why RHEL 4 is not flushing anywhere near 
> as often. Either RHEL4 erred on the side of not writing, and RHEL5 is 
> erring on the opposite side, or RHEL5 is doing unnecessary flushes... I've 
> seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels, but 
> it still flushes a lot more than RHEL 4 does.

Most of that increase is probably mainly due to the changes to the way
stat() works. More precisely, it would be due to this patch:

   http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=70b9ecbdb9c5fdc731f8780bffd45d9519020c4a

which went into Linux 2.6.16 in order to fix a posix compatibility
issue.

Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-04 20:43                                               ` Link performance over NFS degraded in RHEL5. -- was : " Brian R Cowan
  2009-06-04 20:57                                                 ` Trond Myklebust
@ 2009-06-04 21:07                                                 ` Peter Staubach
  2009-06-04 21:39                                                   ` Brian R Cowan
  2009-06-05 11:35                                                 ` Steve Dickson
  2 siblings, 1 reply; 94+ messages in thread
From: Peter Staubach @ 2009-06-04 21:07 UTC (permalink / raw)
  To: Brian R Cowan
  Cc: Trond Myklebust, Carlos Carvalho, linux-nfs, linux-nfs-owner

Brian R Cowan wrote:
> Trond Myklebust <trond.myklebust@fys.uio.no> wrote on 06/04/2009 02:04:58 
> PM:
>
>   
>> Did you try turning off write gathering on the server (i.e. add the
>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
>>     
>
> Just tried it, this seems to be a very useful workaround as well. The 
> FILE_SYNC write calls come back in about the same amount of time as the 
> write+commit pairs... Speeds up building regardless of the network 
> filesystem (ClearCase MVFS or straight NFS).
>
>   
>>> The bottom line:
>>> * If someone can help me find where 2.6 stopped setting small writes 
>>>       
> to 
>   
>>> FILE_SYNC, I'd appreciate it. It would save me time walking through 
>>>       
>> 50 
>>     
>>> commitdiffs in gitweb...
>>>       
>> It still does set FILE_SYNC for single page writes.
>>     
>
> Well, the network trace *seems* to say otherwise, but that could be 
> because the 2.6.29 kernel is now reliably following a code path that 
> doesn't set up to do FILE_SYNC writes for these flushes... Just like the 
> RHEL 5 traces didn't have every "small" write to the link output file go 
> out as a FILE_SYNC write.
>
>   
>>> * Is this the correct place to start discussing the annoying 
>>> write-before-almost-every-read behavior that 2.6.18 picked up and 
>>>       
> 2.6.29 
>   
>>> continues? 
>>>       
>> Yes, but you'll need to tell us a bit more about the write patterns. Are
>> these random writes, or are they sequential? Is there any file locking
>> involved?
>>     
>
> Well, it's just a link, so it's random read/write traffic. (read object 
> file/library, add stuff to output file, seek somewhere else and update a 
> table, etc., etc.) All I did here was build Samba over nfs, remove 
> bin/smbd, and then do a "make bin/smbd" to rebuild it. My network traces 
> show that the file is opened "UNCHECKED" when doing the build in straight 
> NFS, and "EXCLUSIVE" when building in a ClearCase view. This change does 
> not seem to impact the behavior. We never lock the output file. The 
> write-before-read happens all over the place. And when we did straces and 
> lined up the call times, is it a read operation triggering the write. 
>
>   
>> As I've said earlier in this thread, all NFS clients will flush out the
>> dirty data if a page that is being attempted read also contains
>> uninitialised areas.
>>     
>
> What I'm trying to understand is why RHEL 4 is not flushing anywhere near 
> as often. Either RHEL4 erred on the side of not writing, and RHEL5 is 
> erring on the opposite side, or RHEL5 is doing unnecessary flushes... I've 
> seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels, but 
> it still flushes a lot more than RHEL 4 does.
>
>   

I think that you are making a lot of assumptions here, that
are not necessarily backed by the evidence.  The base cause
here seems more likely to me to be the setting of PG_uptodate
being different on the different releases, ie. RHEL-4, RHEL-5,
and 2.6.29.  All of these kernels contain the support to
write out pages which are not marked as PG_uptodate.

       ps

> In any event, that doesn't help us here since 1) ClearCase can't work with 
> that kernel; 2) Red Hat won't support use of that kernel on RHEL 5; and 3) 
> the amount of code review my customer would have to go through to get the 
> whole kernel vetted for use in their environment is frightening.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-04 20:57                                                 ` Trond Myklebust
@ 2009-06-04 21:30                                                   ` Brian R Cowan
  2009-06-04 21:48                                                     ` Trond Myklebust
  0 siblings, 1 reply; 94+ messages in thread
From: Brian R Cowan @ 2009-06-04 21:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner

I'll have to see if/how this impacts the flush behavior. I don't THINK we 
are doing getattrs in the middle of the link, but the trace information 
kind of went astray when the VM's gor reverted to base OS.

Also, your recommended workaround of setting no_wdelay only works if the 
NFS server is Linux, the option isn't available on Solaris or HP-UX. This 
limits it's usefulness in heterogenous environments. Solaris 10 doesn't 
support async NFS exports, and we've already discussed how the small-write 
optimization overrides write behavior on async mounts.

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Trond Myklebust <trond.myklebust@fys.uio.no>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Carlos Carvalho <carlos@fisica.ufpr.br>, linux-nfs@vger.kernel.org, 
linux-nfs-owner@vger.kernel.org
Date:
06/04/2009 04:57 PM
Subject:
Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS 
I/O performance degraded by FLUSH_STABLE page flushing



On Thu, 2009-06-04 at 16:43 -0400, Brian R Cowan wrote:
> What I'm trying to understand is why RHEL 4 is not flushing anywhere 
near 
> as often. Either RHEL4 erred on the side of not writing, and RHEL5 is 
> erring on the opposite side, or RHEL5 is doing unnecessary flushes... 
I've 
> seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels, 
but 
> it still flushes a lot more than RHEL 4 does.

Most of that increase is probably mainly due to the changes to the way
stat() works. More precisely, it would be due to this patch:

   
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=70b9ecbdb9c5fdc731f8780bffd45d9519020c4a


which went into Linux 2.6.16 in order to fix a posix compatibility
issue.

Trond




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-04 21:07                                                 ` Peter Staubach
@ 2009-06-04 21:39                                                   ` Brian R Cowan
  0 siblings, 0 replies; 94+ messages in thread
From: Brian R Cowan @ 2009-06-04 21:39 UTC (permalink / raw)
  To: Peter Staubach
  Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner, Trond Myklebust

Peter Staubach <staubach@redhat.com> wrote on 06/04/2009 05:07:29 PM:

> > What I'm trying to understand is why RHEL 4 is not flushing anywhere 
near 
> > as often. Either RHEL4 erred on the side of not writing, and RHEL5 is 
> > erring on the opposite side, or RHEL5 is doing unnecessary flushes... 
I've 
> > seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels, 
but 
> > it still flushes a lot more than RHEL 4 does.
> >
> > 
> 
> I think that you are making a lot of assumptions here, that
> are not necessarily backed by the evidence.  The base cause
> here seems more likely to me to be the setting of PG_uptodate
> being different on the different releases, ie. RHEL-4, RHEL-5,
> and 2.6.29.  All of these kernels contain the support to
> write out pages which are not marked as PG_uptodate.
> 
>        ps
I'm trying to find out why the paging/flushing is happening. It's 
incredibly trivial to reproduce, just link something large over NFS. RHEL4 
writes to the smbd file about 150x, RHEL 5 writes to it > 500x, and 2.6.29 
writes about 340x. I have network traces showing that. I'm now trying to 
understand why... So we an determine if there is anything that can be done 
about it...

Trond's note about a getattr change that went into 2.6.16 may be important 
since we have also seen this slowdown on SuSE 10, which is based on 2.6.16 
kernels. I'm just a little unsure of why the gcc linker would be calling 
getattr... Time to collect more straces, I guess, and then to see what 
happens under the covers... (Be just my luck if the seek eventually causes 
nfs_getattr to be called, though it would certainly explain the behavior.)

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-04 21:30                                                   ` Brian R Cowan
@ 2009-06-04 21:48                                                     ` Trond Myklebust
  0 siblings, 0 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-06-04 21:48 UTC (permalink / raw)
  To: Brian R Cowan; +Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner

Well, that's a good reason to get rid of those Solaris servers. :-)

Seriously, though, we do _not_ fix server bugs by changing the client.  
If we had been doing something that was incorrect, or not recommended  
by the NFS spec, then matters would be different...

Trond

On Jun 4, 2009, at 17:30, Brian R Cowan <brcowan@us.ibm.com> wrote:

> I'll have to see if/how this impacts the flush behavior. I don't  
> THINK we
> are doing getattrs in the middle of the link, but the trace  
> information
> kind of went astray when the VM's gor reverted to base OS.
>
> Also, your recommended workaround of setting no_wdelay only works if  
> the
> NFS server is Linux, the option isn't available on Solaris or HP-UX.  
> This
> limits it's usefulness in heterogenous environments. Solaris 10  
> doesn't
> support async NFS exports, and we've already discussed how the small- 
> write
> optimization overrides write behavior on async mounts.
>
> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>
>
> Please be sure to update your PMR using ESR at
> http://www-306.ibm.com/software/support/probsub.html or cc all
> correspondence to sw_support@us.ibm.com to be sure your PMR is  
> updated in
> case I am not available.
>
>
>
> From:
> Trond Myklebust <trond.myklebust@fys.uio.no>
> To:
> Brian R Cowan/Cupertino/IBM@IBMUS
> Cc:
> Carlos Carvalho <carlos@fisica.ufpr.br>, linux-nfs@vger.kernel.org,
> linux-nfs-owner@vger.kernel.org
> Date:
> 06/04/2009 04:57 PM
> Subject:
> Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write  
> NFS
> I/O performance degraded by FLUSH_STABLE page flushing
>
>
>
> On Thu, 2009-06-04 at 16:43 -0400, Brian R Cowan wrote:
>> What I'm trying to understand is why RHEL 4 is not flushing anywhere
> near
>> as often. Either RHEL4 erred on the side of not writing, and RHEL5 is
>> erring on the opposite side, or RHEL5 is doing unnecessary flushes...
> I've
>> seen that 2.6.29 flushes less than the Red hat 2.6.18-derived  
>> kernels,
> but
>> it still flushes a lot more than RHEL 4 does.
>
> Most of that increase is probably mainly due to the changes to the way
> stat() works. More precisely, it would be due to this patch:
>
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=70b9ecbdb9c5fdc731f8780bffd45d9519020c4a
>
>
> which went into Linux 2.6.16 in order to fix a posix compatibility
> issue.
>
> Trond
>
>
>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-04 20:43                                               ` Link performance over NFS degraded in RHEL5. -- was : " Brian R Cowan
  2009-06-04 20:57                                                 ` Trond Myklebust
  2009-06-04 21:07                                                 ` Peter Staubach
@ 2009-06-05 11:35                                                 ` Steve Dickson
  2009-06-05 12:46                                                   ` Trond Myklebust
                                                                     ` (3 more replies)
  2 siblings, 4 replies; 94+ messages in thread
From: Steve Dickson @ 2009-06-05 11:35 UTC (permalink / raw)
  To: Neil Brown, Greg Banks; +Cc: Brian R Cowan, linux-nfs

Brian R Cowan wrote:
> Trond Myklebust <trond.myklebust@fys.uio.no> wrote on 06/04/2009 02:04:58 
> PM:
> 
>> Did you try turning off write gathering on the server (i.e. add the
>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> 
> Just tried it, this seems to be a very useful workaround as well. The 
> FILE_SYNC write calls come back in about the same amount of time as the 
> write+commit pairs... Speeds up building regardless of the network 
> filesystem (ClearCase MVFS or straight NFS).

Does anybody had the history as to why 'no_wdelay' is an 
export default? As Brian mentioned later in this thread
it only helps Linux servers, but that's good thing, IMHO. ;-)

So I would have no problem changing the default export
options in nfs-utils, but it would be nice to know why 
it was there in the first place...

Neil, Greg??  

steved.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-05 11:35                                                 ` Steve Dickson
@ 2009-06-05 12:46                                                   ` Trond Myklebust
  2009-06-05 13:03                                                     ` Brian R Cowan
  2009-06-05 13:05                                                   ` Tom Talpey
                                                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-06-05 12:46 UTC (permalink / raw)
  To: Steve Dickson; +Cc: Neil Brown, Greg Banks, Brian R Cowan, linux-nfs

On Fri, 2009-06-05 at 07:35 -0400, Steve Dickson wrote:
> Brian R Cowan wrote:
> > Trond Myklebust <trond.myklebust@fys.uio.no> wrote on 06/04/2009 02:04:58 
> > PM:
> > 
> >> Did you try turning off write gathering on the server (i.e. add the
> >> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> >> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> > 
> > Just tried it, this seems to be a very useful workaround as well. The 
> > FILE_SYNC write calls come back in about the same amount of time as the 
> > write+commit pairs... Speeds up building regardless of the network 
> > filesystem (ClearCase MVFS or straight NFS).
> 
> Does anybody had the history as to why 'no_wdelay' is an 
> export default? As Brian mentioned later in this thread
> it only helps Linux servers, but that's good thing, IMHO. ;-)
> 
> So I would have no problem changing the default export
> options in nfs-utils, but it would be nice to know why 
> it was there in the first place...

It dates back to the days when most Linux clients in use in the field
were NFSv2 only. After all, it has only been 15 years...

  Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-05 12:46                                                   ` Trond Myklebust
@ 2009-06-05 13:03                                                     ` Brian R Cowan
  0 siblings, 0 replies; 94+ messages in thread
From: Brian R Cowan @ 2009-06-05 13:03 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Greg Banks, linux-nfs, Neil Brown, Steve Dickson

Personally, I would leave the default export options alone. Simply because 
they more or less match the defaults for the other NFS servers. 

Also, there may be negative impacts of changing the default export option 
to no_wdelay on really busy servers. One possible result is that more CPU 
time gets spent waiting on writes to disk. 

I'm a bit paranoid when it comes to tuning *server* settings, since they 
impact all clients all at once, where client tuning generally only impacts 
the one client.

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Trond Myklebust <trond.myklebust@fys.uio.no>
To:
Steve Dickson <SteveD@redhat.com>
Cc:
Neil Brown <neilb@suse.de>, Greg Banks <gnb@fmeh.org>, Brian R 
Cowan/Cupertino/IBM@IBMUS, linux-nfs@vger.kernel.org
Date:
06/05/2009 08:48 AM
Subject:
Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS 
I/O performance degraded by FLUSH_STABLE page flushing



On Fri, 2009-06-05 at 07:35 -0400, Steve Dickson wrote:
> Brian R Cowan wrote:
> > Trond Myklebust <trond.myklebust@fys.uio.no> wrote on 06/04/2009 
02:04:58 
> > PM:
> > 
> >> Did you try turning off write gathering on the server (i.e. add the
> >> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> >> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> > 
> > Just tried it, this seems to be a very useful workaround as well. The 
> > FILE_SYNC write calls come back in about the same amount of time as 
the 
> > write+commit pairs... Speeds up building regardless of the network 
> > filesystem (ClearCase MVFS or straight NFS).
> 
> Does anybody had the history as to why 'no_wdelay' is an 
> export default? As Brian mentioned later in this thread
> it only helps Linux servers, but that's good thing, IMHO. ;-)
> 
> So I would have no problem changing the default export
> options in nfs-utils, but it would be nice to know why 
> it was there in the first place...

It dates back to the days when most Linux clients in use in the field
were NFSv2 only. After all, it has only been 15 years...

  Trond




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-05 11:35                                                 ` Steve Dickson
  2009-06-05 12:46                                                   ` Trond Myklebust
@ 2009-06-05 13:05                                                   ` Tom Talpey
       [not found]                                                   ` <4A29144A.6030405@gmail.com>
  2009-06-05 13:56                                                   ` Brian R Cowan
  3 siblings, 0 replies; 94+ messages in thread
From: Tom Talpey @ 2009-06-05 13:05 UTC (permalink / raw)
  To: Steve Dickson; +Cc: Linux NFS Mailing List

On 6/5/2009 7:35 AM, Steve Dickson wrote:
> Brian R Cowan wrote:
>> Trond Myklebust<trond.myklebust@fys.uio.no>  wrote on 06/04/2009 02:04:58
>> PM:
>>
>>> Did you try turning off write gathering on the server (i.e. add the
>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
>> Just tried it, this seems to be a very useful workaround as well. The
>> FILE_SYNC write calls come back in about the same amount of time as the
>> write+commit pairs... Speeds up building regardless of the network
>> filesystem (ClearCase MVFS or straight NFS).
>
> Does anybody had the history as to why 'no_wdelay' is an
> export default?

Because "wdelay" is a complete crock?

Adding 10ms to every write RPC only helps if there's a steady
single-file stream arriving at the server. In most other workloads
it only slows things down.

The better solution is to continue tuning the clients to issue
writes in a more sequential and less all-or-nothing fashion.
There are plenty of other less crock-ful things to do in the
server, too.

Tom.

  As Brian mentioned later in this thread
> it only helps Linux servers, but that's good thing, IMHO. ;-)
>
> So I would have no problem changing the default export
> options in nfs-utils, but it would be nice to know why
> it was there in the first place...
>
> Neil, Greg??
>
> steved.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                                   ` <4A29144A.6030405@gmail.com>
@ 2009-06-05 13:30                                                     ` Steve Dickson
  2009-06-05 13:52                                                       ` Trond Myklebust
       [not found]                                                     ` <4A291D83.1000508@RedHat.com>
  1 sibling, 1 reply; 94+ messages in thread
From: Steve Dickson @ 2009-06-05 13:30 UTC (permalink / raw)
  To: Tom Talpey; +Cc: Linux NFS Mailing list



Tom Talpey wrote:
> On 6/5/2009 7:35 AM, Steve Dickson wrote:
>> Brian R Cowan wrote:
>>> Trond Myklebust<trond.myklebust@fys.uio.no>  wrote on 06/04/2009
>>> 02:04:58
>>> PM:
>>>
>>>> Did you try turning off write gathering on the server (i.e. add the
>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
>>> Just tried it, this seems to be a very useful workaround as well. The
>>> FILE_SYNC write calls come back in about the same amount of time as the
>>> write+commit pairs... Speeds up building regardless of the network
>>> filesystem (ClearCase MVFS or straight NFS).
>>
>> Does anybody had the history as to why 'no_wdelay' is an
>> export default?
> 
> Because "wdelay" is a complete crock?
> 
> Adding 10ms to every write RPC only helps if there's a steady
> single-file stream arriving at the server. In most other workloads
> it only slows things down.
> 
> The better solution is to continue tuning the clients to issue
> writes in a more sequential and less all-or-nothing fashion.
> There are plenty of other less crock-ful things to do in the
> server, too.
Ok... So do you think removing it as a default would cause
any regressions?

steved.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                                     ` <4A291D83.1000508@RedHat.com>
@ 2009-06-05 13:50                                                       ` Tom Talpey
  2009-06-05 13:54                                                         ` Trond Myklebust
  0 siblings, 1 reply; 94+ messages in thread
From: Tom Talpey @ 2009-06-05 13:50 UTC (permalink / raw)
  To: Steve Dickson; +Cc: Linux NFS Mailing List

On 6/5/2009 9:28 AM, Steve Dickson wrote:
>
> Tom Talpey wrote:
>> On 6/5/2009 7:35 AM, Steve Dickson wrote:
>>> Brian R Cowan wrote:
>>>> Trond Myklebust<trond.myklebust@fys.uio.no>   wrote on 06/04/2009
>>>> 02:04:58
>>>> PM:
>>>>
>>>>> Did you try turning off write gathering on the server (i.e. add the
>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
>>>> Just tried it, this seems to be a very useful workaround as well. The
>>>> FILE_SYNC write calls come back in about the same amount of time as the
>>>> write+commit pairs... Speeds up building regardless of the network
>>>> filesystem (ClearCase MVFS or straight NFS).
>>> Does anybody had the history as to why 'no_wdelay' is an
>>> export default?
>> Because "wdelay" is a complete crock?
>>
>> Adding 10ms to every write RPC only helps if there's a steady
>> single-file stream arriving at the server. In most other workloads
>> it only slows things down.
>>
>> The better solution is to continue tuning the clients to issue
>> writes in a more sequential and less all-or-nothing fashion.
>> There are plenty of other less crock-ful things to do in the
>> server, too.
> Ok... So do you think removing it as a default would cause
> any regressions?

I'm not 100% clear on what you mean by removing it. Since it's
a "no_" option, removing it means that "wdelay" becomes the
default? That would certainly cause a regression for many.

I think the big problem with tweaking the default in nfs_utils
is that there's little guarantee of the kernel behavior that
would result. Older kernels, NFSv2 mounts, etc will behave completely
differently from new ones, NFSv3, modified clients, etc. So
touching this option is quite risky, IMO, even though it's a
crock.

Tom.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-05 13:30                                                     ` Steve Dickson
@ 2009-06-05 13:52                                                       ` Trond Myklebust
       [not found]                                                         ` <1244209956.5410.33.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-06-05 13:52 UTC (permalink / raw)
  To: Steve Dickson; +Cc: Tom Talpey, Linux NFS Mailing list

On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
> 
> Tom Talpey wrote:
> > On 6/5/2009 7:35 AM, Steve Dickson wrote:
> >> Brian R Cowan wrote:
> >>> Trond Myklebust<trond.myklebust@fys.uio.no>  wrote on 06/04/2009
> >>> 02:04:58
> >>> PM:
> >>>
> >>>> Did you try turning off write gathering on the server (i.e. add the
> >>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> >>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> >>> Just tried it, this seems to be a very useful workaround as well. The
> >>> FILE_SYNC write calls come back in about the same amount of time as the
> >>> write+commit pairs... Speeds up building regardless of the network
> >>> filesystem (ClearCase MVFS or straight NFS).
> >>
> >> Does anybody had the history as to why 'no_wdelay' is an
> >> export default?
> > 
> > Because "wdelay" is a complete crock?
> > 
> > Adding 10ms to every write RPC only helps if there's a steady
> > single-file stream arriving at the server. In most other workloads
> > it only slows things down.
> > 
> > The better solution is to continue tuning the clients to issue
> > writes in a more sequential and less all-or-nothing fashion.
> > There are plenty of other less crock-ful things to do in the
> > server, too.
> Ok... So do you think removing it as a default would cause
> any regressions?

It might for NFSv2 clients, since they don't have the option of using
unstable writes. I'd therefore prefer a kernel solution that makes write
gathering an NFSv2 only feature.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-05 13:50                                                       ` Tom Talpey
@ 2009-06-05 13:54                                                         ` Trond Myklebust
  2009-06-05 13:58                                                           ` Tom Talpey
  0 siblings, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-06-05 13:54 UTC (permalink / raw)
  To: Tom Talpey; +Cc: Steve Dickson, Linux NFS Mailing List

On Fri, 2009-06-05 at 09:50 -0400, Tom Talpey wrote:
> I'm not 100% clear on what you mean by removing it. Since it's
> a "no_" option, removing it means that "wdelay" becomes the
> default? That would certainly cause a regression for many.

You've misunderstood. The current default is to _set_ 'wdelay' on all
exports that do not explicitly turn it off.

Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-05 11:35                                                 ` Steve Dickson
                                                                     ` (2 preceding siblings ...)
       [not found]                                                   ` <4A29144A.6030405@gmail.com>
@ 2009-06-05 13:56                                                   ` Brian R Cowan
  3 siblings, 0 replies; 94+ messages in thread
From: Brian R Cowan @ 2009-06-05 13:56 UTC (permalink / raw)
  To: Steve Dickson; +Cc: Greg Banks, linux-nfs, Neil Brown

Actually wdelay is the export default, and I recall the man page saying 
something along the lines of doing this to allow the server to coalesce 
writes. Somewhere else (I think in another part of this thread) it's 
mentioned that the server will sit for up to 10ms waiting for other writes 
to this export. The reality is that wdelay+FILE_SYNC = up to a 10ms delay 
waiting for the write RPC to come back. That being said, I would rather 
leave this alone so that we don't accidentally impact something else. 
After all, the no_wdelay export option will work around it nicely in an 
all-Linux environment, and file pages don't flush with FILE_SYNC on 
2.6.29.

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
 
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
 

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@us.ibm.com to be sure your PMR is updated in 
case I am not available.



From:
Steve Dickson <SteveD@redhat.com>
To:
Neil Brown <neilb@suse.de>, Greg Banks <gnb@fmeh.org>
Cc:
Brian R Cowan/Cupertino/IBM@IBMUS, linux-nfs@vger.kernel.org
Date:
06/05/2009 07:38 AM
Subject:
Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS 
I/O performance degraded by FLUSH_STABLE page flushing



Brian R Cowan wrote:
> Trond Myklebust <trond.myklebust@fys.uio.no> wrote on 06/04/2009 
02:04:58 
> PM:
> 
>> Did you try turning off write gathering on the server (i.e. add the
>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> 
> Just tried it, this seems to be a very useful workaround as well. The 
> FILE_SYNC write calls come back in about the same amount of time as the 
> write+commit pairs... Speeds up building regardless of the network 
> filesystem (ClearCase MVFS or straight NFS).

Does anybody had the history as to why 'no_wdelay' is an 
export default? As Brian mentioned later in this thread
it only helps Linux servers, but that's good thing, IMHO. ;-)

So I would have no problem changing the default export
options in nfs-utils, but it would be nice to know why 
it was there in the first place...

Neil, Greg?? 

steved.



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                                         ` <1244209956.5410.33.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-06-05 13:57                                                           ` Steve Dickson
       [not found]                                                             ` <4A29243F.8080008-AfCzQyP5zfLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 94+ messages in thread
From: Steve Dickson @ 2009-06-05 13:57 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Tom Talpey, Linux NFS Mailing list



Trond Myklebust wrote:
> On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
>> Tom Talpey wrote:
>>> On 6/5/2009 7:35 AM, Steve Dickson wrote:
>>>> Brian R Cowan wrote:
>>>>> Trond Myklebust<trond.myklebust@fys.uio.no>  wrote on 06/04/2009
>>>>> 02:04:58
>>>>> PM:
>>>>>
>>>>>> Did you try turning off write gathering on the server (i.e. add the
>>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
>>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
>>>>> Just tried it, this seems to be a very useful workaround as well. The
>>>>> FILE_SYNC write calls come back in about the same amount of time as the
>>>>> write+commit pairs... Speeds up building regardless of the network
>>>>> filesystem (ClearCase MVFS or straight NFS).
>>>> Does anybody had the history as to why 'no_wdelay' is an
>>>> export default?
>>> Because "wdelay" is a complete crock?
>>>
>>> Adding 10ms to every write RPC only helps if there's a steady
>>> single-file stream arriving at the server. In most other workloads
>>> it only slows things down.
>>>
>>> The better solution is to continue tuning the clients to issue
>>> writes in a more sequential and less all-or-nothing fashion.
>>> There are plenty of other less crock-ful things to do in the
>>> server, too.
>> Ok... So do you think removing it as a default would cause
>> any regressions?
> 
> It might for NFSv2 clients, since they don't have the option of using
> unstable writes. I'd therefore prefer a kernel solution that makes write
> gathering an NFSv2 only feature.
Sounds good to me! ;-)

steved.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-05 13:54                                                         ` Trond Myklebust
@ 2009-06-05 13:58                                                           ` Tom Talpey
  0 siblings, 0 replies; 94+ messages in thread
From: Tom Talpey @ 2009-06-05 13:58 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Steve Dickson, Linux NFS Mailing List

On 6/5/2009 9:54 AM, Trond Myklebust wrote:
> On Fri, 2009-06-05 at 09:50 -0400, Tom Talpey wrote:
>> I'm not 100% clear on what you mean by removing it. Since it's
>> a "no_" option, removing it means that "wdelay" becomes the
>> default? That would certainly cause a regression for many.
>
> You've misunderstood. The current default is to _set_ 'wdelay' on all
> exports that do not explicitly turn it off.

Ok, then turning it off will help some and hurt some. There's no
right setting for all. I do agree that fixing the server is the
best solution, not grabbing wildly at its crockful controls.

Tom.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-01 22:30                           ` J. Bruce Fields
@ 2009-06-05 14:54                             ` Christoph Hellwig
  2009-06-05 16:01                               ` J. Bruce Fields
  2009-06-05 16:12                               ` Trond Myklebust
  0 siblings, 2 replies; 94+ messages in thread
From: Christoph Hellwig @ 2009-06-05 14:54 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Krishna Kumar, Greg Banks, Trond Myklebust, Brian R Cowan,
	Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach,
	Christoph Hellwig

On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote:
> > NFSD stops calling ->fsync without a file struct.
> > 
> > I think the open file cache will help us with that, if we can extend
> > it to also cache open file structs for directories.
> 
> Krishna Kumar--do you think that'd be a reasonable thing to do?

Btw, do you have at least the basic open files cache queue for 2.6.31?


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-05 14:54                             ` Christoph Hellwig
@ 2009-06-05 16:01                               ` J. Bruce Fields
  2009-06-05 16:12                               ` Trond Myklebust
  1 sibling, 0 replies; 94+ messages in thread
From: J. Bruce Fields @ 2009-06-05 16:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Krishna Kumar, Greg Banks, Trond Myklebust, Brian R Cowan,
	Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, Jun 05, 2009 at 10:54:50AM -0400, Christoph Hellwig wrote:
> On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote:
> > > NFSD stops calling ->fsync without a file struct.
> > > 
> > > I think the open file cache will help us with that, if we can extend
> > > it to also cache open file structs for directories.
> > 
> > Krishna Kumar--do you think that'd be a reasonable thing to do?
> 
> Btw, do you have at least the basic open files cache queue for 2.6.31?

No.  I'll try to give it a look this afternoon.

--b.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                                             ` <4A29243F.8080008-AfCzQyP5zfLQT0dZR+AlfA@public.gmane.org>
@ 2009-06-05 16:05                                                               ` J. Bruce Fields
  2009-06-05 16:35                                                                 ` Trond Myklebust
  0 siblings, 1 reply; 94+ messages in thread
From: J. Bruce Fields @ 2009-06-05 16:05 UTC (permalink / raw)
  To: Steve Dickson; +Cc: Trond Myklebust, Tom Talpey, Linux NFS Mailing list

On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote:
> 
> 
> Trond Myklebust wrote:
> > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
> >> Tom Talpey wrote:
> >>> On 6/5/2009 7:35 AM, Steve Dickson wrote:
> >>>> Brian R Cowan wrote:
> >>>>> Trond Myklebust<trond.myklebust@fys.uio.no>  wrote on 06/04/2009
> >>>>> 02:04:58
> >>>>> PM:
> >>>>>
> >>>>>> Did you try turning off write gathering on the server (i.e. add the
> >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> >>>>> Just tried it, this seems to be a very useful workaround as well. The
> >>>>> FILE_SYNC write calls come back in about the same amount of time as the
> >>>>> write+commit pairs... Speeds up building regardless of the network
> >>>>> filesystem (ClearCase MVFS or straight NFS).
> >>>> Does anybody had the history as to why 'no_wdelay' is an
> >>>> export default?
> >>> Because "wdelay" is a complete crock?
> >>>
> >>> Adding 10ms to every write RPC only helps if there's a steady
> >>> single-file stream arriving at the server. In most other workloads
> >>> it only slows things down.
> >>>
> >>> The better solution is to continue tuning the clients to issue
> >>> writes in a more sequential and less all-or-nothing fashion.
> >>> There are plenty of other less crock-ful things to do in the
> >>> server, too.
> >> Ok... So do you think removing it as a default would cause
> >> any regressions?
> > 
> > It might for NFSv2 clients, since they don't have the option of using
> > unstable writes. I'd therefore prefer a kernel solution that makes write
> > gathering an NFSv2 only feature.
> Sounds good to me! ;-)

Patch welcomed.--b.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-05 14:54                             ` Christoph Hellwig
  2009-06-05 16:01                               ` J. Bruce Fields
@ 2009-06-05 16:12                               ` Trond Myklebust
       [not found]                                 ` <1244218328.5410.38.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  1 sibling, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-06-05 16:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: J. Bruce Fields, Krishna Kumar, Greg Banks, Brian R Cowan,
	Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, 2009-06-05 at 10:54 -0400, Christoph Hellwig wrote:
> On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote:
> > > NFSD stops calling ->fsync without a file struct.
> > > 
> > > I think the open file cache will help us with that, if we can extend
> > > it to also cache open file structs for directories.
> > 
> > Krishna Kumar--do you think that'd be a reasonable thing to do?
> 
> Btw, do you have at least the basic open files cache queue for 2.6.31?
> 

Now that _will_ badly screw up the write gathering heuristic...

Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-05 16:05                                                               ` J. Bruce Fields
@ 2009-06-05 16:35                                                                 ` Trond Myklebust
       [not found]                                                                   ` <1244219715.5410.40.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-06-05 16:35 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Steve Dickson, Tom Talpey, Linux NFS Mailing list

On Fri, 2009-06-05 at 12:05 -0400, J. Bruce Fields wrote:
> On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote:
> > 
> > 
> > Trond Myklebust wrote:
> > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
> > >> Tom Talpey wrote:
> > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote:
> > >>>> Brian R Cowan wrote:
> > >>>>> Trond Myklebust<trond.myklebust@fys.uio.no>  wrote on 06/04/2009
> > >>>>> 02:04:58
> > >>>>> PM:
> > >>>>>
> > >>>>>> Did you try turning off write gathering on the server (i.e. add the
> > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> > >>>>> Just tried it, this seems to be a very useful workaround as well. The
> > >>>>> FILE_SYNC write calls come back in about the same amount of time as the
> > >>>>> write+commit pairs... Speeds up building regardless of the network
> > >>>>> filesystem (ClearCase MVFS or straight NFS).
> > >>>> Does anybody had the history as to why 'no_wdelay' is an
> > >>>> export default?
> > >>> Because "wdelay" is a complete crock?
> > >>>
> > >>> Adding 10ms to every write RPC only helps if there's a steady
> > >>> single-file stream arriving at the server. In most other workloads
> > >>> it only slows things down.
> > >>>
> > >>> The better solution is to continue tuning the clients to issue
> > >>> writes in a more sequential and less all-or-nothing fashion.
> > >>> There are plenty of other less crock-ful things to do in the
> > >>> server, too.
> > >> Ok... So do you think removing it as a default would cause
> > >> any regressions?
> > > 
> > > It might for NFSv2 clients, since they don't have the option of using
> > > unstable writes. I'd therefore prefer a kernel solution that makes write
> > > gathering an NFSv2 only feature.
> > Sounds good to me! ;-)
> 
> Patch welcomed.--b.

Something like this ought to suffice...

-----------------------------------------------------------------------
From: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSD: Make sure that write gathering only applies to NFSv2

NFSv3 and above can use unstable writes whenever they are sending more
than one write, rather than relying on the flaky write gathering
heuristics. More often than not, write gathering is currently getting it
wrong when the NFSv3 clients are sending a single write with FILE_SYNC
for efficiency reasons.

This patch turns off write gathering for NFSv3/v4, and ensure that
it only applies to the one case that can actually benefit: namely NFSv2.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---

 fs/nfsd/vfs.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)


diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index b660435..f30cc4e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -975,6 +975,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 	__be32			err = 0;
 	int			host_err;
 	int			stable = *stablep;
+	int			use_wgather;
 
 #ifdef MSNFS
 	err = nfserr_perm;
@@ -993,9 +994,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 	 *  -	the sync export option has been set, or
 	 *  -	the client requested O_SYNC behavior (NFSv3 feature).
 	 *  -   The file system doesn't support fsync().
-	 * When gathered writes have been configured for this volume,
+	 * When NFSv2 gathered writes have been configured for this volume,
 	 * flushing the data to disk is handled separately below.
 	 */
+	use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp);
 
 	if (!file->f_op->fsync) {/* COMMIT3 cannot work */
 	       stable = 2;
@@ -1004,7 +1006,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 
 	if (!EX_ISSYNC(exp))
 		stable = 0;
-	if (stable && !EX_WGATHER(exp)) {
+	if (stable && !use_wgather) {
 		spin_lock(&file->f_lock);
 		file->f_flags |= O_SYNC;
 		spin_unlock(&file->f_lock);
@@ -1040,7 +1042,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 		 * nice and simple solution (IMHO), and it seems to
 		 * work:-)
 		 */
-		if (EX_WGATHER(exp)) {
+		if (use_wgather) {
 			if (atomic_read(&inode->i_writecount) > 1
 			    || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
 				dprintk("nfsd: write defer %d\n", task_pid_nr(current));



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                 ` <1244218328.5410.38.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-06-05 19:54                                   ` J. Bruce Fields
  2009-06-05 21:21                                     ` Trond Myklebust
  0 siblings, 1 reply; 94+ messages in thread
From: J. Bruce Fields @ 2009-06-05 19:54 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Christoph Hellwig, Krishna Kumar, Greg Banks, Brian R Cowan,
	Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, Jun 05, 2009 at 12:12:08PM -0400, Trond Myklebust wrote:
> On Fri, 2009-06-05 at 10:54 -0400, Christoph Hellwig wrote:
> > On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote:
> > > > NFSD stops calling ->fsync without a file struct.
> > > > 
> > > > I think the open file cache will help us with that, if we can extend
> > > > it to also cache open file structs for directories.
> > > 
> > > Krishna Kumar--do you think that'd be a reasonable thing to do?
> > 
> > Btw, do you have at least the basic open files cache queue for 2.6.31?
> > 
> 
> Now that _will_ badly screw up the write gathering heuristic...

How?

--b.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-05 19:54                                   ` J. Bruce Fields
@ 2009-06-05 21:21                                     ` Trond Myklebust
  0 siblings, 0 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-06-05 21:21 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Christoph Hellwig, Krishna Kumar, Greg Banks, Brian R Cowan,
	Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach

On Fri, 2009-06-05 at 15:54 -0400, J. Bruce Fields wrote:
> On Fri, Jun 05, 2009 at 12:12:08PM -0400, Trond Myklebust wrote:
> > On Fri, 2009-06-05 at 10:54 -0400, Christoph Hellwig wrote:
> > > On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote:
> > > > > NFSD stops calling ->fsync without a file struct.
> > > > > 
> > > > > I think the open file cache will help us with that, if we can extend
> > > > > it to also cache open file structs for directories.
> > > > 
> > > > Krishna Kumar--do you think that'd be a reasonable thing to do?
> > > 
> > > Btw, do you have at least the basic open files cache queue for 2.6.31?
> > > 
> > 
> > Now that _will_ badly screw up the write gathering heuristic...
> 
> How?
> 

The heuristic looks at inode->i_writecount in order to figure out how
many nfsd threads are currently trying to write to the file. The
reference to i_writecount is held by the struct file.
The problam is that if you start sharing struct file among several nfsd
threads by means of a cache, then the i_writecount will not change, and
so the heuristic fails.

While we won't miss it much in NFSv3 and v4, it may change the
performance of the few systems out there that still believe NFSv2 is the
best thing since sliced bread...

  Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                                                   ` <1244219715.5410.40.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-06-15 23:08                                                                     ` J. Bruce Fields
  2009-06-16  0:21                                                                       ` NeilBrown
  2009-06-16  0:32                                                                       ` Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Trond Myklebust
  0 siblings, 2 replies; 94+ messages in thread
From: J. Bruce Fields @ 2009-06-15 23:08 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Steve Dickson, Tom Talpey, Linux NFS Mailing list

On Fri, Jun 05, 2009 at 12:35:15PM -0400, Trond Myklebust wrote:
> On Fri, 2009-06-05 at 12:05 -0400, J. Bruce Fields wrote:
> > On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote:
> > > 
> > > 
> > > Trond Myklebust wrote:
> > > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
> > > >> Tom Talpey wrote:
> > > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote:
> > > >>>> Brian R Cowan wrote:
> > > >>>>> Trond Myklebust<trond.myklebust@fys.uio.no>  wrote on 06/04/2009
> > > >>>>> 02:04:58
> > > >>>>> PM:
> > > >>>>>
> > > >>>>>> Did you try turning off write gathering on the server (i.e. add the
> > > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> > > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> > > >>>>> Just tried it, this seems to be a very useful workaround as well. The
> > > >>>>> FILE_SYNC write calls come back in about the same amount of time as the
> > > >>>>> write+commit pairs... Speeds up building regardless of the network
> > > >>>>> filesystem (ClearCase MVFS or straight NFS).
> > > >>>> Does anybody had the history as to why 'no_wdelay' is an
> > > >>>> export default?
> > > >>> Because "wdelay" is a complete crock?
> > > >>>
> > > >>> Adding 10ms to every write RPC only helps if there's a steady
> > > >>> single-file stream arriving at the server. In most other workloads
> > > >>> it only slows things down.
> > > >>>
> > > >>> The better solution is to continue tuning the clients to issue
> > > >>> writes in a more sequential and less all-or-nothing fashion.
> > > >>> There are plenty of other less crock-ful things to do in the
> > > >>> server, too.
> > > >> Ok... So do you think removing it as a default would cause
> > > >> any regressions?
> > > > 
> > > > It might for NFSv2 clients, since they don't have the option of using
> > > > unstable writes. I'd therefore prefer a kernel solution that makes write
> > > > gathering an NFSv2 only feature.
> > > Sounds good to me! ;-)
> > 
> > Patch welcomed.--b.
> 
> Something like this ought to suffice...

Thanks, applied.

I'd also like to apply cleanup something like the following--there's
probably some cleaner way, but it just bothers me to have this
write-gathering special case take up the bulk of nfsd_vfs_write....

--b.

commit bfe7680d68afaf3f0b1195c8976db1fd1f03229d
Author: J. Bruce Fields <bfields@citi.umich.edu>
Date:   Mon Jun 15 16:03:53 2009 -0700

    nfsd: Pull write-gathering code out of nfsd_vfs_write
    
    This is a relatively self-contained piece of code that handles a special
    case--move it to its own function.
    
    Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index a8aac7f..de68557 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -963,6 +963,44 @@ static void kill_suid(struct dentry *dentry)
 	mutex_unlock(&dentry->d_inode->i_mutex);
 }
 
+/*
+ * Gathered writes: If another process is currently writing to the file,
+ * there's a high chance this is another nfsd (triggered by a bulk write
+ * from a client's biod). Rather than syncing the file with each write
+ * request, we sleep for 10 msec.
+ *
+ * I don't know if this roughly approximates C. Juszak's idea of
+ * gathered writes, but it's a nice and simple solution (IMHO), and it
+ * seems to work:-)
+ *
+ * Note: we do this only in the NFSv2 case, since v3 and higher have a
+ * better tool (separate unstable writes and commits) for solving this
+ * problem.
+ */
+static void wait_for_concurrent_writes(struct file *file, int use_wgather, int *host_err)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	static ino_t last_ino;
+	static dev_t last_dev;
+
+	if (!use_wgather)
+		goto out;
+	if (atomic_read(&inode->i_writecount) > 1
+	    || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
+		dprintk("nfsd: write defer %d\n", task_pid_nr(current));
+		msleep(10);
+		dprintk("nfsd: write resume %d\n", task_pid_nr(current));
+	}
+
+	if (inode->i_state & I_DIRTY) {
+		dprintk("nfsd: write sync %d\n", task_pid_nr(current));
+		*host_err = nfsd_sync(file);
+	}
+out:
+	last_ino = inode->i_ino;
+	last_dev = inode->i_sb->s_dev;
+}
+
 static __be32
 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 				loff_t offset, struct kvec *vec, int vlen,
@@ -1025,41 +1063,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 	if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
 		kill_suid(dentry);
 
-	if (host_err >= 0 && stable) {
-		static ino_t	last_ino;
-		static dev_t	last_dev;
-
-		/*
-		 * Gathered writes: If another process is currently
-		 * writing to the file, there's a high chance
-		 * this is another nfsd (triggered by a bulk write
-		 * from a client's biod). Rather than syncing the
-		 * file with each write request, we sleep for 10 msec.
-		 *
-		 * I don't know if this roughly approximates
-		 * C. Juszak's idea of gathered writes, but it's a
-		 * nice and simple solution (IMHO), and it seems to
-		 * work:-)
-		 */
-		if (use_wgather) {
-			if (atomic_read(&inode->i_writecount) > 1
-			    || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
-				dprintk("nfsd: write defer %d\n", task_pid_nr(current));
-				msleep(10);
-				dprintk("nfsd: write resume %d\n", task_pid_nr(current));
-			}
-
-			if (inode->i_state & I_DIRTY) {
-				dprintk("nfsd: write sync %d\n", task_pid_nr(current));
-				host_err=nfsd_sync(file);
-			}
-#if 0
-			wake_up(&inode->i_wait);
-#endif
-		}
-		last_ino = inode->i_ino;
-		last_dev = inode->i_sb->s_dev;
-	}
+	if (host_err >= 0 && stable)
+		wait_for_concurrent_writes(file, use_wgather, &host_err);
 
 	dprintk("nfsd: write complete host_err=%d\n", host_err);
 	if (host_err >= 0) {

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-15 23:08                                                                     ` J. Bruce Fields
@ 2009-06-16  0:21                                                                       ` NeilBrown
       [not found]                                                                         ` <99d4545537613ce76040d3655b78bdb7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
  2009-06-16  0:32                                                                       ` Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Trond Myklebust
  1 sibling, 1 reply; 94+ messages in thread
From: NeilBrown @ 2009-06-16  0:21 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Trond Myklebust, Steve Dickson, Tom Talpey, Linux NFS Mailing list

On Tue, June 16, 2009 9:08 am, J. Bruce Fields wrote:

> +	if (host_err >= 0 && stable)
> +		wait_for_concurrent_writes(file, use_wgather, &host_err);
>

Surely you want this to be:

   if (host_err >= 0 && stable && use_wgather)
         host_err = wait_for_concurrent_writes(file);
as
 - this is more readable
 - setting last_ino and last_dev is pointless when !use_wgather
 - we aren't interested in differentiation between non-negative values of
   host_err.

NeilBrown


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-15 23:08                                                                     ` J. Bruce Fields
  2009-06-16  0:21                                                                       ` NeilBrown
@ 2009-06-16  0:32                                                                       ` Trond Myklebust
       [not found]                                                                         ` <1245112324.7470.7.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  1 sibling, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-06-16  0:32 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Steve Dickson, Tom Talpey, Linux NFS Mailing list

On Mon, 2009-06-15 at 19:08 -0400, J. Bruce Fields wrote:
> On Fri, Jun 05, 2009 at 12:35:15PM -0400, Trond Myklebust wrote:
> > On Fri, 2009-06-05 at 12:05 -0400, J. Bruce Fields wrote:
> > > On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote:
> > > > 
> > > > 
> > > > Trond Myklebust wrote:
> > > > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
> > > > >> Tom Talpey wrote:
> > > > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote:
> > > > >>>> Brian R Cowan wrote:
> > > > >>>>> Trond Myklebust<trond.myklebust@fys.uio.no>  wrote on 06/04/2009
> > > > >>>>> 02:04:58
> > > > >>>>> PM:
> > > > >>>>>
> > > > >>>>>> Did you try turning off write gathering on the server (i.e. add the
> > > > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> > > > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> > > > >>>>> Just tried it, this seems to be a very useful workaround as well. The
> > > > >>>>> FILE_SYNC write calls come back in about the same amount of time as the
> > > > >>>>> write+commit pairs... Speeds up building regardless of the network
> > > > >>>>> filesystem (ClearCase MVFS or straight NFS).
> > > > >>>> Does anybody had the history as to why 'no_wdelay' is an
> > > > >>>> export default?
> > > > >>> Because "wdelay" is a complete crock?
> > > > >>>
> > > > >>> Adding 10ms to every write RPC only helps if there's a steady
> > > > >>> single-file stream arriving at the server. In most other workloads
> > > > >>> it only slows things down.
> > > > >>>
> > > > >>> The better solution is to continue tuning the clients to issue
> > > > >>> writes in a more sequential and less all-or-nothing fashion.
> > > > >>> There are plenty of other less crock-ful things to do in the
> > > > >>> server, too.
> > > > >> Ok... So do you think removing it as a default would cause
> > > > >> any regressions?
> > > > > 
> > > > > It might for NFSv2 clients, since they don't have the option of using
> > > > > unstable writes. I'd therefore prefer a kernel solution that makes write
> > > > > gathering an NFSv2 only feature.
> > > > Sounds good to me! ;-)
> > > 
> > > Patch welcomed.--b.
> > 
> > Something like this ought to suffice...
> 
> Thanks, applied.
> 
> I'd also like to apply cleanup something like the following--there's
> probably some cleaner way, but it just bothers me to have this
> write-gathering special case take up the bulk of nfsd_vfs_write....
> 
> --b.
> 
> commit bfe7680d68afaf3f0b1195c8976db1fd1f03229d
> Author: J. Bruce Fields <bfields@citi.umich.edu>
> Date:   Mon Jun 15 16:03:53 2009 -0700
> 
>     nfsd: Pull write-gathering code out of nfsd_vfs_write
>     
>     This is a relatively self-contained piece of code that handles a special
>     case--move it to its own function.
>     
>     Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index a8aac7f..de68557 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -963,6 +963,44 @@ static void kill_suid(struct dentry *dentry)
>  	mutex_unlock(&dentry->d_inode->i_mutex);
>  }
>  
> +/*
> + * Gathered writes: If another process is currently writing to the file,
> + * there's a high chance this is another nfsd (triggered by a bulk write
> + * from a client's biod). Rather than syncing the file with each write
> + * request, we sleep for 10 msec.
> + *
> + * I don't know if this roughly approximates C. Juszak's idea of
> + * gathered writes, but it's a nice and simple solution (IMHO), and it
> + * seems to work:-)
> + *
> + * Note: we do this only in the NFSv2 case, since v3 and higher have a
> + * better tool (separate unstable writes and commits) for solving this
> + * problem.
> + */
> +static void wait_for_concurrent_writes(struct file *file, int use_wgather, int *host_err)
> +{
> +	struct inode *inode = file->f_path.dentry->d_inode;
> +	static ino_t last_ino;
> +	static dev_t last_dev;
> +
> +	if (!use_wgather)
> +		goto out;
> +	if (atomic_read(&inode->i_writecount) > 1
> +	    || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
> +		dprintk("nfsd: write defer %d\n", task_pid_nr(current));
> +		msleep(10);
> +		dprintk("nfsd: write resume %d\n", task_pid_nr(current));
> +	}
> +
> +	if (inode->i_state & I_DIRTY) {
> +		dprintk("nfsd: write sync %d\n", task_pid_nr(current));
> +		*host_err = nfsd_sync(file);
> +	}
> +out:
> +	last_ino = inode->i_ino;
> +	last_dev = inode->i_sb->s_dev;
> +}

Shouldn't you also timestamp the last_ino/last_dev? Currently you can
end up waiting even if the last time you referenced this file was 10
minutes ago...

> +
>  static __be32
>  nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
>  				loff_t offset, struct kvec *vec, int vlen,
> @@ -1025,41 +1063,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
>  	if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
>  		kill_suid(dentry);
>  
> -	if (host_err >= 0 && stable) {
> -		static ino_t	last_ino;
> -		static dev_t	last_dev;
> -
> -		/*
> -		 * Gathered writes: If another process is currently
> -		 * writing to the file, there's a high chance
> -		 * this is another nfsd (triggered by a bulk write
> -		 * from a client's biod). Rather than syncing the
> -		 * file with each write request, we sleep for 10 msec.
> -		 *
> -		 * I don't know if this roughly approximates
> -		 * C. Juszak's idea of gathered writes, but it's a
> -		 * nice and simple solution (IMHO), and it seems to
> -		 * work:-)
> -		 */
> -		if (use_wgather) {
> -			if (atomic_read(&inode->i_writecount) > 1
> -			    || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
> -				dprintk("nfsd: write defer %d\n", task_pid_nr(current));
> -				msleep(10);
> -				dprintk("nfsd: write resume %d\n", task_pid_nr(current));
> -			}
> -
> -			if (inode->i_state & I_DIRTY) {
> -				dprintk("nfsd: write sync %d\n", task_pid_nr(current));
> -				host_err=nfsd_sync(file);
> -			}
> -#if 0
> -			wake_up(&inode->i_wait);
> -#endif
> -		}
> -		last_ino = inode->i_ino;
> -		last_dev = inode->i_sb->s_dev;
> -	}
> +	if (host_err >= 0 && stable)
> +		wait_for_concurrent_writes(file, use_wgather, &host_err);
>  
>  	dprintk("nfsd: write complete host_err=%d\n", host_err);
>  	if (host_err >= 0) {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                                                         ` <99d4545537613ce76040d3655b78bdb7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
@ 2009-06-16  0:33                                                                           ` J. Bruce Fields
  2009-06-16  0:50                                                                             ` NeilBrown
  0 siblings, 1 reply; 94+ messages in thread
From: J. Bruce Fields @ 2009-06-16  0:33 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Steve Dickson, Tom Talpey, Linux NFS Mailing list

On Tue, Jun 16, 2009 at 10:21:50AM +1000, NeilBrown wrote:
> On Tue, June 16, 2009 9:08 am, J. Bruce Fields wrote:
> 
> > +	if (host_err >= 0 && stable)
> > +		wait_for_concurrent_writes(file, use_wgather, &host_err);
> >
> 
> Surely you want this to be:
> 
>    if (host_err >= 0 && stable && use_wgather)
>          host_err = wait_for_concurrent_writes(file);
> as
>  - this is more readable
>  - setting last_ino and last_dev is pointless when !use_wgather

Yep, thanks.

>  - we aren't interested in differentiation between non-negative values of
>    host_err.

Unfortunately, just below:

	if (host_err >= 0) {
		err = 0;
		*cnt = host_err;
	} else
		err = nfserrno(host_err);

We could save that count earlier, e.g.:

@@ -1014,6 +1013,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
        int                     host_err;
        int                     stable = *stablep;
        int                     use_wgather;
+       int                     bytes;
 
 #ifdef MSNFS
        err = nfserr_perm;
@@ -1056,6 +1056,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
        set_fs(oldfs);
        if (host_err >= 0) {
                nfsdstats.io_write += host_err;
+               bytes = host_err;
                fsnotify_modify(file->f_path.dentry);
        }
 
@@ -1063,13 +1064,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fh
        if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
                kill_suid(dentry);
 
-       if (host_err >= 0 && stable)
-               wait_for_concurrent_writes(file, use_wgather, &host_err);
+       if (host_err >= 0 && stable && use_wgather)
+               host_err = wait_for_concurrent_writes(file);
 
        dprintk("nfsd: write complete host_err=%d\n", host_err);
        if (host_err >= 0) {
                err = 0;
-               *cnt = host_err;
+               *cnt = bytes;
        } else
                err = nfserrno(host_err);
 out:

--b.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-16  0:33                                                                           ` J. Bruce Fields
@ 2009-06-16  0:50                                                                             ` NeilBrown
       [not found]                                                                               ` <02ada87c636e1088e9365a3cbea301e7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
  0 siblings, 1 reply; 94+ messages in thread
From: NeilBrown @ 2009-06-16  0:50 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Trond Myklebust, Steve Dickson, Tom Talpey, Linux NFS Mailing list

On Tue, June 16, 2009 10:33 am, J. Bruce Fields wrote:
> On Tue, Jun 16, 2009 at 10:21:50AM +1000, NeilBrown wrote:
>> On Tue, June 16, 2009 9:08 am, J. Bruce Fields wrote:
>>
>> > +	if (host_err >= 0 && stable)
>> > +		wait_for_concurrent_writes(file, use_wgather, &host_err);
>> >
>>
>> Surely you want this to be:
>>
>>    if (host_err >= 0 && stable && use_wgather)
>>          host_err = wait_for_concurrent_writes(file);
>> as
>>  - this is more readable
>>  - setting last_ino and last_dev is pointless when !use_wgather
>
> Yep, thanks.
>
>>  - we aren't interested in differentiation between non-negative values
>> of
>>    host_err.
>
> Unfortunately, just below:
>
> 	if (host_err >= 0) {
> 		err = 0;
> 		*cnt = host_err;
> 	} else
> 		err = nfserrno(host_err);
>

Ahh.... that must be in code you haven't pushed out yet.
I don't see it in mainline or git.linux-nfs.org

> We could save that count earlier, e.g.:
>
> @@ -1014,6 +1013,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh
> *fhp,
>         int                     host_err;
>         int                     stable = *stablep;
>         int                     use_wgather;
> +       int                     bytes;
>
>  #ifdef MSNFS
>         err = nfserr_perm;
> @@ -1056,6 +1056,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh
> *fhp,
>         set_fs(oldfs);
>         if (host_err >= 0) {
>                 nfsdstats.io_write += host_err;
> +               bytes = host_err;
>                 fsnotify_modify(file->f_path.dentry);

Or even

   if (host_err >= 0) {
          bytes = host_err;
          nfsdstats.io_write += bytes
           ...

And if you did that in whatever patch move the assignment to
*cnt to the bottom of the function, it might be even more readable!

Thanks,
NeilBrown



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                                                               ` <02ada87c636e1088e9365a3cbea301e7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
@ 2009-06-16  0:55                                                                                 ` J. Bruce Fields
  2009-06-17 16:54                                                                                   ` J. Bruce Fields
  0 siblings, 1 reply; 94+ messages in thread
From: J. Bruce Fields @ 2009-06-16  0:55 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Steve Dickson, Tom Talpey, Linux NFS Mailing list

On Tue, Jun 16, 2009 at 10:50:57AM +1000, NeilBrown wrote:
> On Tue, June 16, 2009 10:33 am, J. Bruce Fields wrote:
> > On Tue, Jun 16, 2009 at 10:21:50AM +1000, NeilBrown wrote:
> >> On Tue, June 16, 2009 9:08 am, J. Bruce Fields wrote:
> >>
> >> > +	if (host_err >= 0 && stable)
> >> > +		wait_for_concurrent_writes(file, use_wgather, &host_err);
> >> >
> >>
> >> Surely you want this to be:
> >>
> >>    if (host_err >= 0 && stable && use_wgather)
> >>          host_err = wait_for_concurrent_writes(file);
> >> as
> >>  - this is more readable
> >>  - setting last_ino and last_dev is pointless when !use_wgather
> >
> > Yep, thanks.
> >
> >>  - we aren't interested in differentiation between non-negative values
> >> of
> >>    host_err.
> >
> > Unfortunately, just below:
> >
> > 	if (host_err >= 0) {
> > 		err = 0;
> > 		*cnt = host_err;
> > 	} else
> > 		err = nfserrno(host_err);
> >
> 
> Ahh.... that must be in code you haven't pushed out yet.
> I don't see it in mainline or git.linux-nfs.org

Whoops--actually, it's the opposite problem: a bugfix patch that went
upstream removed this, and I didn't merge that back into my for-2.6.31
branch.  OK, time to do that, and then this is all much simpler....
Thanks for calling my attention to that!

--b.

> 
> > We could save that count earlier, e.g.:
> >
> > @@ -1014,6 +1013,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh
> > *fhp,
> >         int                     host_err;
> >         int                     stable = *stablep;
> >         int                     use_wgather;
> > +       int                     bytes;
> >
> >  #ifdef MSNFS
> >         err = nfserr_perm;
> > @@ -1056,6 +1056,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh
> > *fhp,
> >         set_fs(oldfs);
> >         if (host_err >= 0) {
> >                 nfsdstats.io_write += host_err;
> > +               bytes = host_err;
> >                 fsnotify_modify(file->f_path.dentry);
> 
> Or even
> 
>    if (host_err >= 0) {
>           bytes = host_err;
>           nfsdstats.io_write += bytes
>            ...
> 
> And if you did that in whatever patch move the assignment to
> *cnt to the bottom of the function, it might be even more readable!
> 
> Thanks,
> NeilBrown
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
       [not found]                                                                         ` <1245112324.7470.7.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-06-16  2:02                                                                           ` J. Bruce Fields
  0 siblings, 0 replies; 94+ messages in thread
From: J. Bruce Fields @ 2009-06-16  2:02 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Steve Dickson, Tom Talpey, Linux NFS Mailing list

On Mon, Jun 15, 2009 at 05:32:04PM -0700, Trond Myklebust wrote:
> On Mon, 2009-06-15 at 19:08 -0400, J. Bruce Fields wrote:
> > On Fri, Jun 05, 2009 at 12:35:15PM -0400, Trond Myklebust wrote:
> > > On Fri, 2009-06-05 at 12:05 -0400, J. Bruce Fields wrote:
> > > > On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote:
> > > > > 
> > > > > 
> > > > > Trond Myklebust wrote:
> > > > > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
> > > > > >> Tom Talpey wrote:
> > > > > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote:
> > > > > >>>> Brian R Cowan wrote:
> > > > > >>>>> Trond Myklebust<trond.myklebust@fys.uio.no>  wrote on 06/04/2009
> > > > > >>>>> 02:04:58
> > > > > >>>>> PM:
> > > > > >>>>>
> > > > > >>>>>> Did you try turning off write gathering on the server (i.e. add the
> > > > > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> > > > > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> > > > > >>>>> Just tried it, this seems to be a very useful workaround as well. The
> > > > > >>>>> FILE_SYNC write calls come back in about the same amount of time as the
> > > > > >>>>> write+commit pairs... Speeds up building regardless of the network
> > > > > >>>>> filesystem (ClearCase MVFS or straight NFS).
> > > > > >>>> Does anybody had the history as to why 'no_wdelay' is an
> > > > > >>>> export default?
> > > > > >>> Because "wdelay" is a complete crock?
> > > > > >>>
> > > > > >>> Adding 10ms to every write RPC only helps if there's a steady
> > > > > >>> single-file stream arriving at the server. In most other workloads
> > > > > >>> it only slows things down.
> > > > > >>>
> > > > > >>> The better solution is to continue tuning the clients to issue
> > > > > >>> writes in a more sequential and less all-or-nothing fashion.
> > > > > >>> There are plenty of other less crock-ful things to do in the
> > > > > >>> server, too.
> > > > > >> Ok... So do you think removing it as a default would cause
> > > > > >> any regressions?
> > > > > > 
> > > > > > It might for NFSv2 clients, since they don't have the option of using
> > > > > > unstable writes. I'd therefore prefer a kernel solution that makes write
> > > > > > gathering an NFSv2 only feature.
> > > > > Sounds good to me! ;-)
> > > > 
> > > > Patch welcomed.--b.
> > > 
> > > Something like this ought to suffice...
> > 
> > Thanks, applied.
> > 
> > I'd also like to apply cleanup something like the following--there's
> > probably some cleaner way, but it just bothers me to have this
> > write-gathering special case take up the bulk of nfsd_vfs_write....
> > 
> > --b.
> > 
> > commit bfe7680d68afaf3f0b1195c8976db1fd1f03229d
> > Author: J. Bruce Fields <bfields@citi.umich.edu>
> > Date:   Mon Jun 15 16:03:53 2009 -0700
> > 
> >     nfsd: Pull write-gathering code out of nfsd_vfs_write
> >     
> >     This is a relatively self-contained piece of code that handles a special
> >     case--move it to its own function.
> >     
> >     Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
> > 
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index a8aac7f..de68557 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -963,6 +963,44 @@ static void kill_suid(struct dentry *dentry)
> >  	mutex_unlock(&dentry->d_inode->i_mutex);
> >  }
> >  
> > +/*
> > + * Gathered writes: If another process is currently writing to the file,
> > + * there's a high chance this is another nfsd (triggered by a bulk write
> > + * from a client's biod). Rather than syncing the file with each write
> > + * request, we sleep for 10 msec.
> > + *
> > + * I don't know if this roughly approximates C. Juszak's idea of
> > + * gathered writes, but it's a nice and simple solution (IMHO), and it
> > + * seems to work:-)
> > + *
> > + * Note: we do this only in the NFSv2 case, since v3 and higher have a
> > + * better tool (separate unstable writes and commits) for solving this
> > + * problem.
> > + */
> > +static void wait_for_concurrent_writes(struct file *file, int use_wgather, int *host_err)
> > +{
> > +	struct inode *inode = file->f_path.dentry->d_inode;
> > +	static ino_t last_ino;
> > +	static dev_t last_dev;
> > +
> > +	if (!use_wgather)
> > +		goto out;
> > +	if (atomic_read(&inode->i_writecount) > 1
> > +	    || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
> > +		dprintk("nfsd: write defer %d\n", task_pid_nr(current));
> > +		msleep(10);
> > +		dprintk("nfsd: write resume %d\n", task_pid_nr(current));
> > +	}
> > +
> > +	if (inode->i_state & I_DIRTY) {
> > +		dprintk("nfsd: write sync %d\n", task_pid_nr(current));
> > +		*host_err = nfsd_sync(file);
> > +	}
> > +out:
> > +	last_ino = inode->i_ino;
> > +	last_dev = inode->i_sb->s_dev;
> > +}
> 
> Shouldn't you also timestamp the last_ino/last_dev? Currently you can
> end up waiting even if the last time you referenced this file was 10
> minutes ago...

Maybe, but I don't know that avoiding the delay in the case where
use_wdelay writes are coming rarely is particularly important.

(Note this is just a single static last_ino/last_dev, so the timestamp
would just tell us how long ago there was last a use_wdelay write.)

I'm not as interested in making wdelay work better--someone who uses v2
and wants to benchmark it can do that--as I am interested in just
getting it out of the way so I don't have to look at it again....

--b.

> 
> > +
> >  static __be32
> >  nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
> >  				loff_t offset, struct kvec *vec, int vlen,
> > @@ -1025,41 +1063,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
> >  	if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
> >  		kill_suid(dentry);
> >  
> > -	if (host_err >= 0 && stable) {
> > -		static ino_t	last_ino;
> > -		static dev_t	last_dev;
> > -
> > -		/*
> > -		 * Gathered writes: If another process is currently
> > -		 * writing to the file, there's a high chance
> > -		 * this is another nfsd (triggered by a bulk write
> > -		 * from a client's biod). Rather than syncing the
> > -		 * file with each write request, we sleep for 10 msec.
> > -		 *
> > -		 * I don't know if this roughly approximates
> > -		 * C. Juszak's idea of gathered writes, but it's a
> > -		 * nice and simple solution (IMHO), and it seems to
> > -		 * work:-)
> > -		 */
> > -		if (use_wgather) {
> > -			if (atomic_read(&inode->i_writecount) > 1
> > -			    || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
> > -				dprintk("nfsd: write defer %d\n", task_pid_nr(current));
> > -				msleep(10);
> > -				dprintk("nfsd: write resume %d\n", task_pid_nr(current));
> > -			}
> > -
> > -			if (inode->i_state & I_DIRTY) {
> > -				dprintk("nfsd: write sync %d\n", task_pid_nr(current));
> > -				host_err=nfsd_sync(file);
> > -			}
> > -#if 0
> > -			wake_up(&inode->i_wait);
> > -#endif
> > -		}
> > -		last_ino = inode->i_ino;
> > -		last_dev = inode->i_sb->s_dev;
> > -	}
> > +	if (host_err >= 0 && stable)
> > +		wait_for_concurrent_writes(file, use_wgather, &host_err);
> >  
> >  	dprintk("nfsd: write complete host_err=%d\n", host_err);
> >  	if (host_err >= 0) {
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
  2009-06-16  0:55                                                                                 ` J. Bruce Fields
@ 2009-06-17 16:54                                                                                   ` J. Bruce Fields
  2009-06-17 16:59                                                                                     ` [PATCH 1/3] nfsd: track last inode only in use_wgather case J. Bruce Fields
  0 siblings, 1 reply; 94+ messages in thread
From: J. Bruce Fields @ 2009-06-17 16:54 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Steve Dickson, Tom Talpey, Linux NFS Mailing list

On Mon, Jun 15, 2009 at 08:55:58PM -0400, bfields wrote:
> Whoops--actually, it's the opposite problem: a bugfix patch that went
> upstream removed this, and I didn't merge that back into my for-2.6.31
> branch.  OK, time to do that, and then this is all much simpler....
> Thanks for calling my attention to that!

Having fixed that... the following is what I'm applying (on top of
Trond's).

--b.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 1/3] nfsd: track last inode only in use_wgather case
  2009-06-17 16:54                                                                                   ` J. Bruce Fields
@ 2009-06-17 16:59                                                                                     ` J. Bruce Fields
  2009-06-17 16:59                                                                                       ` [PATCH 2/3] nfsd: Pull write-gathering code out of nfsd_vfs_write J. Bruce Fields
  0 siblings, 1 reply; 94+ messages in thread
From: J. Bruce Fields @ 2009-06-17 16:59 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Steve Dickson, Tom Talpey,
	Linux NFS Mailing list, J. Bruce Fields

From: J. Bruce Fields <bfields@citi.umich.edu>

Updating last_ino and last_dev probably isn't useful in the !use_wgather
case.

Also remove some pointless ifdef'd-out code.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
---
 fs/nfsd/vfs.c |   25 ++++++++++---------------
 1 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f30cc4e..ebf56c6 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1026,7 +1026,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 	if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
 		kill_suid(dentry);
 
-	if (host_err >= 0 && stable) {
+	if (host_err >= 0 && stable && use_wgather) {
 		static ino_t	last_ino;
 		static dev_t	last_dev;
 
@@ -1042,21 +1042,16 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 		 * nice and simple solution (IMHO), and it seems to
 		 * work:-)
 		 */
-		if (use_wgather) {
-			if (atomic_read(&inode->i_writecount) > 1
-			    || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
-				dprintk("nfsd: write defer %d\n", task_pid_nr(current));
-				msleep(10);
-				dprintk("nfsd: write resume %d\n", task_pid_nr(current));
-			}
+		if (atomic_read(&inode->i_writecount) > 1
+		    || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
+			dprintk("nfsd: write defer %d\n", task_pid_nr(current));
+			msleep(10);
+			dprintk("nfsd: write resume %d\n", task_pid_nr(current));
+		}
 
-			if (inode->i_state & I_DIRTY) {
-				dprintk("nfsd: write sync %d\n", task_pid_nr(current));
-				host_err=nfsd_sync(file);
-			}
-#if 0
-			wake_up(&inode->i_wait);
-#endif
+		if (inode->i_state & I_DIRTY) {
+			dprintk("nfsd: write sync %d\n", task_pid_nr(current));
+			host_err=nfsd_sync(file);
 		}
 		last_ino = inode->i_ino;
 		last_dev = inode->i_sb->s_dev;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH 2/3] nfsd: Pull write-gathering code out of nfsd_vfs_write
  2009-06-17 16:59                                                                                     ` [PATCH 1/3] nfsd: track last inode only in use_wgather case J. Bruce Fields
@ 2009-06-17 16:59                                                                                       ` J. Bruce Fields
  2009-06-17 16:59                                                                                         ` [PATCH 3/3] nfsd: minor nfsd_vfs_write cleanup J. Bruce Fields
  0 siblings, 1 reply; 94+ messages in thread
From: J. Bruce Fields @ 2009-06-17 16:59 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Steve Dickson, Tom Talpey,
	Linux NFS Mailing list, J. Bruce Fields

From: J. Bruce Fields <bfields@citi.umich.edu>

This is a relatively self-contained piece of code that handles a special
case--move it to its own function.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
---
 fs/nfsd/vfs.c |   69 ++++++++++++++++++++++++++++++++------------------------
 1 files changed, 39 insertions(+), 30 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ebf56c6..6ad76a4 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -963,6 +963,43 @@ static void kill_suid(struct dentry *dentry)
 	mutex_unlock(&dentry->d_inode->i_mutex);
 }
 
+/*
+ * Gathered writes: If another process is currently writing to the file,
+ * there's a high chance this is another nfsd (triggered by a bulk write
+ * from a client's biod). Rather than syncing the file with each write
+ * request, we sleep for 10 msec.
+ *
+ * I don't know if this roughly approximates C. Juszak's idea of
+ * gathered writes, but it's a nice and simple solution (IMHO), and it
+ * seems to work:-)
+ *
+ * Note: we do this only in the NFSv2 case, since v3 and higher have a
+ * better tool (separate unstable writes and commits) for solving this
+ * problem.
+ */
+static int wait_for_concurrent_writes(struct file *file)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	static ino_t last_ino;
+	static dev_t last_dev;
+	int err = 0;
+
+	if (atomic_read(&inode->i_writecount) > 1
+	    || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
+		dprintk("nfsd: write defer %d\n", task_pid_nr(current));
+		msleep(10);
+		dprintk("nfsd: write resume %d\n", task_pid_nr(current));
+	}
+
+	if (inode->i_state & I_DIRTY) {
+		dprintk("nfsd: write sync %d\n", task_pid_nr(current));
+		err = nfsd_sync(file);
+	}
+	last_ino = inode->i_ino;
+	last_dev = inode->i_sb->s_dev;
+	return err;
+}
+
 static __be32
 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 				loff_t offset, struct kvec *vec, int vlen,
@@ -1026,36 +1063,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 	if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
 		kill_suid(dentry);
 
-	if (host_err >= 0 && stable && use_wgather) {
-		static ino_t	last_ino;
-		static dev_t	last_dev;
-
-		/*
-		 * Gathered writes: If another process is currently
-		 * writing to the file, there's a high chance
-		 * this is another nfsd (triggered by a bulk write
-		 * from a client's biod). Rather than syncing the
-		 * file with each write request, we sleep for 10 msec.
-		 *
-		 * I don't know if this roughly approximates
-		 * C. Juszak's idea of gathered writes, but it's a
-		 * nice and simple solution (IMHO), and it seems to
-		 * work:-)
-		 */
-		if (atomic_read(&inode->i_writecount) > 1
-		    || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
-			dprintk("nfsd: write defer %d\n", task_pid_nr(current));
-			msleep(10);
-			dprintk("nfsd: write resume %d\n", task_pid_nr(current));
-		}
-
-		if (inode->i_state & I_DIRTY) {
-			dprintk("nfsd: write sync %d\n", task_pid_nr(current));
-			host_err=nfsd_sync(file);
-		}
-		last_ino = inode->i_ino;
-		last_dev = inode->i_sb->s_dev;
-	}
+	if (host_err >= 0 && stable && use_wgather)
+		host_err = wait_for_concurrent_writes(file);
 
 	dprintk("nfsd: write complete host_err=%d\n", host_err);
 	if (host_err >= 0)
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH 3/3] nfsd: minor nfsd_vfs_write cleanup
  2009-06-17 16:59                                                                                       ` [PATCH 2/3] nfsd: Pull write-gathering code out of nfsd_vfs_write J. Bruce Fields
@ 2009-06-17 16:59                                                                                         ` J. Bruce Fields
  0 siblings, 0 replies; 94+ messages in thread
From: J. Bruce Fields @ 2009-06-17 16:59 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Steve Dickson, Tom Talpey,
	Linux NFS Mailing list, J. Bruce Fields

From: J. Bruce Fields <bfields@citi.umich.edu>

There's no need to check host_err >= 0 every time here when we could
check host_err < 0 once, following the usual kernel style.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
---
 fs/nfsd/vfs.c |   15 ++++++++-------
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 6ad76a4..1cf7061 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1053,19 +1053,20 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 	oldfs = get_fs(); set_fs(KERNEL_DS);
 	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &offset);
 	set_fs(oldfs);
-	if (host_err >= 0) {
-		*cnt = host_err;
-		nfsdstats.io_write += host_err;
-		fsnotify_modify(file->f_path.dentry);
-	}
+	if (host_err < 0)
+		goto out_nfserr;
+	*cnt = host_err;
+	nfsdstats.io_write += host_err;
+	fsnotify_modify(file->f_path.dentry);
 
 	/* clear setuid/setgid flag after write */
-	if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
+	if (inode->i_mode & (S_ISUID | S_ISGID))
 		kill_suid(dentry);
 
-	if (host_err >= 0 && stable && use_wgather)
+	if (stable && use_wgather)
 		host_err = wait_for_concurrent_writes(file);
 
+out_nfserr:
 	dprintk("nfsd: write complete host_err=%d\n", host_err);
 	if (host_err >= 0)
 		err = 0;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH] read-modify-write page updating
  2009-06-04 18:04                                             ` Trond Myklebust
  2009-06-04 20:43                                               ` Link performance over NFS degraded in RHEL5. -- was : " Brian R Cowan
@ 2009-06-24 19:54                                               ` Peter Staubach
  2009-06-25 17:13                                                 ` Trond Myklebust
  2009-07-09 14:12                                                 ` [PATCH v2] " Peter Staubach
  1 sibling, 2 replies; 94+ messages in thread
From: Peter Staubach @ 2009-06-24 19:54 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Brian R Cowan, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 2780 bytes --]

Hi.

I have a proposal for possibly resolving this issue.

I believe that this situation occurs due to the way that the
Linux NFS client handles writes which modify partial pages.

The Linux NFS client handles partial page modifications by
allocating a page from the page cache, copying the data from
the user level into the page, and then keeping track of the
offset and length of the modified portions of the page.  The
page is not marked as up to date because there are portions
of the page which do not contain valid file contents.

When a read call comes in for a portion of the page, the
contents of the page must be read in the from the server.
However, since the page may already contain some modified
data, that modified data must be written to the server
before the file contents can be read back in the from server.
And, since the writing and reading can not be done atomically,
the data must be written and committed to stable storage on
the server for safety purposes.  This means either a
FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT.
This has been discussed at length previously.

This algorithm could be described as modify-write-read.  It
is most efficient when the application only updates pages
and does not read them.

My proposed solution is to add a heuristic to decide whether
to do this modify-write-read algorithm or switch to a read-
modify-write algorithm when initially allocating the page
in the write system call path.  The heuristic uses the modes
that the file was opened with, the offset in the page to
read from, and the size of the region to read.

If the file was opened for reading in addition to writing
and the page would not be filled completely with data from
the user level, then read in the old contents of the page
and mark it as Uptodate before copying in the new data.  If
the page would be completely filled with data from the user
level, then there would be no reason to read in the old
contents because they would just be copied over.

This would optimize for applications which randomly access
and update portions of files.  The linkage editor for the
C compiler is an example of such a thing.

I tested the attached patch by using rpmbuild to build the
current Fedora rawhide kernel.  The kernel without the
patch generated about 153,000 READ requests and 265,500
WRITE requests.  The modified kernel containing the patch
generated about 156,000 READ requests and 257,000 WRITE
requests.  Thus, about 3,000 more READ requests were
generated, but about 8,500 fewer WRITE requests were
generated.  I suspect that many of these additional
WRITE requests were probably FILE_SYNC requests to WRITE
a single page, but I didn't test this theory.

    Thanx...

       ps

Signed-off-by: Peter Staubach <staubach@redhat.com>

[-- Attachment #2: read-modify-write.devel --]
[-- Type: text/plain, Size: 980 bytes --]

--- linux-2.6.30.i686/fs/nfs/file.c.org
+++ linux-2.6.30.i686/fs/nfs/file.c
@@ -337,15 +337,15 @@ static int nfs_write_begin(struct file *
 			struct page **pagep, void **fsdata)
 {
 	int ret;
-	pgoff_t index;
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 	struct page *page;
-	index = pos >> PAGE_CACHE_SHIFT;
 
 	dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
 		file->f_path.dentry->d_parent->d_name.name,
 		file->f_path.dentry->d_name.name,
 		mapping->host->i_ino, len, (long long) pos);
 
+start:
 	/*
 	 * Prevent starvation issues if someone is doing a consistency
 	 * sync-to-disk
@@ -364,6 +364,12 @@ static int nfs_write_begin(struct file *
 	if (ret) {
 		unlock_page(page);
 		page_cache_release(page);
+	} else if ((file->f_mode & FMODE_READ) && !PageUptodate(page) &&
+		   ((pos & (PAGE_CACHE_SIZE - 1)) || len != PAGE_CACHE_SIZE)) {
+		ret = nfs_readpage(file, page);
+		page_cache_release(page);
+		if (!ret)
+			goto start;
 	}
 	return ret;
 }

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] read-modify-write page updating
  2009-06-24 19:54                                               ` [PATCH] read-modify-write page updating Peter Staubach
@ 2009-06-25 17:13                                                 ` Trond Myklebust
       [not found]                                                   ` <1245950029.4913.17.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2009-07-09 14:12                                                 ` [PATCH v2] " Peter Staubach
  1 sibling, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-06-25 17:13 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Brian R Cowan, linux-nfs

On Wed, 2009-06-24 at 15:54 -0400, Peter Staubach wrote:
> Hi.
> 
> I have a proposal for possibly resolving this issue.
> 
> I believe that this situation occurs due to the way that the
> Linux NFS client handles writes which modify partial pages.
> 
> The Linux NFS client handles partial page modifications by
> allocating a page from the page cache, copying the data from
> the user level into the page, and then keeping track of the
> offset and length of the modified portions of the page.  The
> page is not marked as up to date because there are portions
> of the page which do not contain valid file contents.
> 
> When a read call comes in for a portion of the page, the
> contents of the page must be read in the from the server.
> However, since the page may already contain some modified
> data, that modified data must be written to the server
> before the file contents can be read back in the from server.
> And, since the writing and reading can not be done atomically,
> the data must be written and committed to stable storage on
> the server for safety purposes.  This means either a
> FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT.
> This has been discussed at length previously.
> 
> This algorithm could be described as modify-write-read.  It
> is most efficient when the application only updates pages
> and does not read them.
> 
> My proposed solution is to add a heuristic to decide whether
> to do this modify-write-read algorithm or switch to a read-
> modify-write algorithm when initially allocating the page
> in the write system call path.  The heuristic uses the modes
> that the file was opened with, the offset in the page to
> read from, and the size of the region to read.
> 
> If the file was opened for reading in addition to writing
> and the page would not be filled completely with data from
> the user level, then read in the old contents of the page
> and mark it as Uptodate before copying in the new data.  If
> the page would be completely filled with data from the user
> level, then there would be no reason to read in the old
> contents because they would just be copied over.
> 
> This would optimize for applications which randomly access
> and update portions of files.  The linkage editor for the
> C compiler is an example of such a thing.
> 
> I tested the attached patch by using rpmbuild to build the
> current Fedora rawhide kernel.  The kernel without the
> patch generated about 153,000 READ requests and 265,500
> WRITE requests.  The modified kernel containing the patch
> generated about 156,000 READ requests and 257,000 WRITE
> requests.  Thus, about 3,000 more READ requests were
> generated, but about 8,500 fewer WRITE requests were
> generated.  I suspect that many of these additional
> WRITE requests were probably FILE_SYNC requests to WRITE
> a single page, but I didn't test this theory.
> 
>     Thanx...
> 
>        ps
> 
> Signed-off-by: Peter Staubach <staubach@redhat.com>
> plain text document attachment (read-modify-write.devel)
> --- linux-2.6.30.i686/fs/nfs/file.c.org
> +++ linux-2.6.30.i686/fs/nfs/file.c
> @@ -337,15 +337,15 @@ static int nfs_write_begin(struct file *
>  			struct page **pagep, void **fsdata)
>  {
>  	int ret;
> -	pgoff_t index;
> +	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
>  	struct page *page;
> -	index = pos >> PAGE_CACHE_SHIFT;
>  
>  	dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
>  		file->f_path.dentry->d_parent->d_name.name,
>  		file->f_path.dentry->d_name.name,
>  		mapping->host->i_ino, len, (long long) pos);
>  
> +start:
>  	/*
>  	 * Prevent starvation issues if someone is doing a consistency
>  	 * sync-to-disk
> @@ -364,6 +364,12 @@ static int nfs_write_begin(struct file *
>  	if (ret) {
>  		unlock_page(page);
>  		page_cache_release(page);
> +	} else if ((file->f_mode & FMODE_READ) && !PageUptodate(page) &&
> +		   ((pos & (PAGE_CACHE_SIZE - 1)) || len != PAGE_CACHE_SIZE)) {

It might also be nice to put the above test in a little inlined helper
function (called nfs_want_read_modify_write() ?).

So, a number of questions spring to mind:

     1. What if we're extending the file? We might not need to read the
        page at all in that case (see nfs_write_end()).
     2. What if the page is already dirty or is carrying an uncommitted
        unstable write?
     3. We might want to try to avoid looping more than once here. If
        the kernel is very low on memory, we might just want to write
        out the data rather than read the page and risk having the VM
        eject it before we can dirty it.
     4. Should we be starting an async readahead on the next page?
        Single page sized reads can be a nuisance too, if you are
        writing huge amounts of data.

> +		ret = nfs_readpage(file, page);
> +		page_cache_release(page);
> +		if (!ret)
> +			goto start;
>  	}
>  	return ret;
>  }

Cheers
  Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] read-modify-write page updating
       [not found]                                                   ` <1245950029.4913.17.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-07-09 13:59                                                     ` Peter Staubach
  0 siblings, 0 replies; 94+ messages in thread
From: Peter Staubach @ 2009-07-09 13:59 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Brian R Cowan, linux-nfs

Trond Myklebust wrote:
>
> It might also be nice to put the above test in a little inlined helper
> function (called nfs_want_read_modify_write() ?).
>
>   

Good suggestion.

> So, a number of questions spring to mind:
>
>      1. What if we're extending the file? We might not need to read the
>         page at all in that case (see nfs_write_end()).
>   

Yup.

>      2. What if the page is already dirty or is carrying an uncommitted
>         unstable write?
>   

Yup.

>      3. We might want to try to avoid looping more than once here. If
>         the kernel is very low on memory, we might just want to write
>         out the data rather than read the page and risk having the VM
>         eject it before we can dirty it.
>   

Yup.

>      4. Should we be starting an async readahead on the next page?
>         Single page sized reads can be a nuisance too, if you are
>         writing huge amounts of data.

This one is tough.  It sounds good, but seems difficult to implement.

I think that this could be viewed as an optimization.

       ps


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v2] read-modify-write page updating
  2009-06-24 19:54                                               ` [PATCH] read-modify-write page updating Peter Staubach
  2009-06-25 17:13                                                 ` Trond Myklebust
@ 2009-07-09 14:12                                                 ` Peter Staubach
  2009-07-09 15:39                                                   ` Trond Myklebust
  2009-08-04 17:52                                                   ` [PATCH v3] " Peter Staubach
  1 sibling, 2 replies; 94+ messages in thread
From: Peter Staubach @ 2009-07-09 14:12 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Brian R Cowan, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 2869 bytes --]

Hi.

I have a proposal for possibly resolving this issue.

I believe that this situation occurs due to the way that the
Linux NFS client handles writes which modify partial pages.

The Linux NFS client handles partial page modifications by
allocating a page from the page cache, copying the data from
the user level into the page, and then keeping track of the
offset and length of the modified portions of the page.  The
page is not marked as up to date because there are portions
of the page which do not contain valid file contents.

When a read call comes in for a portion of the page, the
contents of the page must be read in the from the server.
However, since the page may already contain some modified
data, that modified data must be written to the server
before the file contents can be read back in the from server.
And, since the writing and reading can not be done atomically,
the data must be written and committed to stable storage on
the server for safety purposes.  This means either a
FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT.
This has been discussed at length previously.

This algorithm could be described as modify-write-read.  It
is most efficient when the application only updates pages
and does not read them.

My proposed solution is to add a heuristic to decide whether
to do this modify-write-read algorithm or switch to a read-
modify-write algorithm when initially allocating the page
in the write system call path.  The heuristic uses the modes
that the file was opened with, the offset in the page to
read from, and the size of the region to read.

If the file was opened for reading in addition to writing
and the page would not be filled completely with data from
the user level, then read in the old contents of the page
and mark it as Uptodate before copying in the new data.  If
the page would be completely filled with data from the user
level, then there would be no reason to read in the old
contents because they would just be copied over.

This would optimize for applications which randomly access
and update portions of files.  The linkage editor for the
C compiler is an example of such a thing.

I tested the attached patch by using rpmbuild to build the
current Fedora rawhide kernel.  The kernel without the
patch generated about 269,500 WRITE requests.  The modified
kernel containing the patch generated about 261,000 WRITE
requests.  Thus, about 8,500 fewer WRITE requests were
generated.  I suspect that many of these additional
WRITE requests were probably FILE_SYNC requests to WRITE
a single page, but I didn't test this theory.

The previous version of this patch caused the NFS client to
generate around 3,000 more READ requests.  This version
actually causes the NFS client to generate almost 500 fewer
READ requests.

    Thanx...

       ps

Signed-off-by: Peter Staubach <staubach@redhat.com>

[-- Attachment #2: read-modify-write.devel.2 --]
[-- Type: application/x-troff-man, Size: 2713 bytes --]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v2] read-modify-write page updating
  2009-07-09 14:12                                                 ` [PATCH v2] " Peter Staubach
@ 2009-07-09 15:39                                                   ` Trond Myklebust
       [not found]                                                     ` <1247153972.5766.15.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2009-08-04 17:52                                                   ` [PATCH v3] " Peter Staubach
  1 sibling, 1 reply; 94+ messages in thread
From: Trond Myklebust @ 2009-07-09 15:39 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Brian R Cowan, linux-nfs

On Thu, 2009-07-09 at 10:12 -0400, Peter Staubach wrote:

> Signed-off-by: Peter Staubach <staubach@redhat.com>

Please could you send such patches as inline, rather than as
attachments. It makes it harder to comment on the patch contents...

> +static int nfs_want_read_modify_write(struct file *file, struct page *page,
> +			loff_t pos, unsigned len)
> +{
> +	unsigned int pglen = nfs_page_length(page);
> +	unsigned int offset = pos & (PAGE_CACHE_SIZE - 1);
> +	unsigned int end = offset + len;
> +
> +	if ((file->f_mode & FMODE_READ) &&	/* open for read? */
> +	    !PageUptodate(page) &&		/* Uptodate? */
> +	    !PageDirty(page) &&			/* Dirty already? */
> +	    !PagePrivate(page) &&		/* i/o request already? */

I don't think you need the PageDirty() test. These days we should be
guaranteed to always have PagePrivate() set whenever PageDirty() is
(although the converse is not true). Anything else would be a bug...

> +	    pglen &&				/* valid bytes of file? */
> +	    (end < pglen || offset))		/* replace all valid bytes? */
> +		return 1;
> +	return 0;
> +}
> +


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v2] read-modify-write page updating
       [not found]                                                     ` <1247153972.5766.15.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-07-10 15:57                                                       ` Peter Staubach
  2009-07-10 17:22                                                         ` J. Bruce Fields
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Staubach @ 2009-07-10 15:57 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Brian R Cowan, linux-nfs

Trond Myklebust wrote:
> On Thu, 2009-07-09 at 10:12 -0400, Peter Staubach wrote:
>
>   
>> Signed-off-by: Peter Staubach <staubach@redhat.com>
>>     
>
> Please could you send such patches as inline, rather than as
> attachments. It makes it harder to comment on the patch contents...
>
>   

I will investigate how to do this.

>> +static int nfs_want_read_modify_write(struct file *file, struct page *page,
>> +			loff_t pos, unsigned len)
>> +{
>> +	unsigned int pglen = nfs_page_length(page);
>> +	unsigned int offset = pos & (PAGE_CACHE_SIZE - 1);
>> +	unsigned int end = offset + len;
>> +
>> +	if ((file->f_mode & FMODE_READ) &&	/* open for read? */
>> +	    !PageUptodate(page) &&		/* Uptodate? */
>> +	    !PageDirty(page) &&			/* Dirty already? */
>> +	    !PagePrivate(page) &&		/* i/o request already? */
>>     
>
> I don't think you need the PageDirty() test. These days we should be
> guaranteed to always have PagePrivate() set whenever PageDirty() is
> (although the converse is not true). Anything else would be a bug...
>
>   

Okie doke.  It seemed to me that this should be true, but it was
safer to leave both tests.

I will remove that PageDirty test, retest, and then send another
version of the patch.  I will be out next week, so it will take a
couple of weeks.

    Thanx...

       ps

>> +	    pglen &&				/* valid bytes of file? */
>> +	    (end < pglen || offset))		/* replace all valid bytes? */
>> +		return 1;
>> +	return 0;
>> +}
>> +
>>     
>
>   


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v2] read-modify-write page updating
  2009-07-10 15:57                                                       ` Peter Staubach
@ 2009-07-10 17:22                                                         ` J. Bruce Fields
  0 siblings, 0 replies; 94+ messages in thread
From: J. Bruce Fields @ 2009-07-10 17:22 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Trond Myklebust, Brian R Cowan, linux-nfs

On Fri, Jul 10, 2009 at 11:57:02AM -0400, Peter Staubach wrote:
> Trond Myklebust wrote:
>> On Thu, 2009-07-09 at 10:12 -0400, Peter Staubach wrote:
>>
>>   
>>> Signed-off-by: Peter Staubach <staubach@redhat.com>
>>>     
>>
>> Please could you send such patches as inline, rather than as
>> attachments. It makes it harder to comment on the patch contents...
>>
>>   
>
> I will investigate how to do this.

See Documentation/email-clients.txt.  (It has an entry for Thunderbird,
for example.)

--b.

>
>>> +static int nfs_want_read_modify_write(struct file *file, struct page *page,
>>> +			loff_t pos, unsigned len)
>>> +{
>>> +	unsigned int pglen = nfs_page_length(page);
>>> +	unsigned int offset = pos & (PAGE_CACHE_SIZE - 1);
>>> +	unsigned int end = offset + len;
>>> +
>>> +	if ((file->f_mode & FMODE_READ) &&	/* open for read? */
>>> +	    !PageUptodate(page) &&		/* Uptodate? */
>>> +	    !PageDirty(page) &&			/* Dirty already? */
>>> +	    !PagePrivate(page) &&		/* i/o request already? */
>>>     
>>
>> I don't think you need the PageDirty() test. These days we should be
>> guaranteed to always have PagePrivate() set whenever PageDirty() is
>> (although the converse is not true). Anything else would be a bug...
>>
>>   
>
> Okie doke.  It seemed to me that this should be true, but it was
> safer to leave both tests.
>
> I will remove that PageDirty test, retest, and then send another
> version of the patch.  I will be out next week, so it will take a
> couple of weeks.
>
>    Thanx...
>
>       ps
>
>>> +	    pglen &&				/* valid bytes of file? */
>>> +	    (end < pglen || offset))		/* replace all valid bytes? */
>>> +		return 1;
>>> +	return 0;
>>> +}
>>> +
>>>     
>>
>>   
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v3] read-modify-write page updating
  2009-07-09 14:12                                                 ` [PATCH v2] " Peter Staubach
  2009-07-09 15:39                                                   ` Trond Myklebust
@ 2009-08-04 17:52                                                   ` Peter Staubach
  2009-08-05  0:50                                                     ` Trond Myklebust
  1 sibling, 1 reply; 94+ messages in thread
From: Peter Staubach @ 2009-08-04 17:52 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Brian R Cowan, linux-nfs

Hi.

I have a proposal for possibly resolving this issue.

I believe that this situation occurs due to the way that the
Linux NFS client handles writes which modify partial pages.

The Linux NFS client handles partial page modifications by
allocating a page from the page cache, copying the data from
the user level into the page, and then keeping track of the
offset and length of the modified portions of the page.  The
page is not marked as up to date because there are portions
of the page which do not contain valid file contents.

When a read call comes in for a portion of the page, the
contents of the page must be read in the from the server.
However, since the page may already contain some modified
data, that modified data must be written to the server
before the file contents can be read back in the from server.
And, since the writing and reading can not be done atomically,
the data must be written and committed to stable storage on
the server for safety purposes.  This means either a
FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT.
This has been discussed at length previously.

This algorithm could be described as modify-write-read.  It
is most efficient when the application only updates pages
and does not read them.

My proposed solution is to add a heuristic to decide whether
to do this modify-write-read algorithm or switch to a read-
modify-write algorithm when initially allocating the page
in the write system call path.  The heuristic uses the modes
that the file was opened with, the offset in the page to
read from, and the size of the region to read.

If the file was opened for reading in addition to writing
and the page would not be filled completely with data from
the user level, then read in the old contents of the page
and mark it as Uptodate before copying in the new data.  If
the page would be completely filled with data from the user
level, then there would be no reason to read in the old
contents because they would just be copied over.

This would optimize for applications which randomly access
and update portions of files.  The linkage editor for the
C compiler is an example of such a thing.

I tested the attached patch by using rpmbuild to build the
current Fedora rawhide kernel.  The kernel without the
patch generated about 269,500 WRITE requests.  The modified
kernel containing the patch generated about 261,000 WRITE
requests.  Thus, about 8,500 fewer WRITE requests were
generated.  I suspect that many of these additional
WRITE requests were probably FILE_SYNC requests to WRITE
a single page, but I didn't test this theory.

The difference between this patch and the previous one was
to remove the unneeded PageDirty() test.  I then retested to
ensure that the resulting system continued to behave as
desired.

	Thanx...

		ps

Signed-off-by: Peter Staubach <staubach@redhat.com>

--- linux-2.6.30.i686/fs/nfs/file.c.org
+++ linux-2.6.30.i686/fs/nfs/file.c
@@ -328,6 +328,42 @@ nfs_file_fsync(struct file *file, struct
 }
 
 /*
+ * Decide whether a read/modify/write cycle may be more efficient
+ * then a modify/write/read cycle when writing to a page in the
+ * page cache.
+ *
+ * The modify/write/read cycle may occur if a page is read before
+ * being completely filled by the writer.  In this situation, the
+ * page must be completely written to stable storage on the server
+ * before it can be refilled by reading in the page from the server.
+ * This can lead to expensive, small, FILE_SYNC mode writes being
+ * done.
+ *
+ * It may be more efficient to read the page first if the file is
+ * open for reading in addition to writing, the page is not marked
+ * as Uptodate, it is not dirty or waiting to be committed,
+ * indicating that it was previously allocated and then modified,
+ * that there were valid bytes of data in that range of the file,
+ * and that the new data won't completely replace the old data in
+ * that range of the file.
+ */
+static int nfs_want_read_modify_write(struct file *file, struct page *page,
+			loff_t pos, unsigned len)
+{
+	unsigned int pglen = nfs_page_length(page);
+	unsigned int offset = pos & (PAGE_CACHE_SIZE - 1);
+	unsigned int end = offset + len;
+
+	if ((file->f_mode & FMODE_READ) &&	/* open for read? */
+	    !PageUptodate(page) &&		/* Uptodate? */
+	    !PagePrivate(page) &&		/* i/o request already? */
+	    pglen &&				/* valid bytes of file? */
+	    (end < pglen || offset))		/* replace all valid bytes? */
+		return 1;
+	return 0;
+}
+
+/*
  * This does the "real" work of the write. We must allocate and lock the
  * page to be sent back to the generic routine, which then copies the
  * data from user space.
@@ -340,15 +376,16 @@ static int nfs_write_begin(struct file *
 			struct page **pagep, void **fsdata)
 {
 	int ret;
-	pgoff_t index;
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 	struct page *page;
-	index = pos >> PAGE_CACHE_SHIFT;
+	int once_thru = 0;
 
 	dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
 		file->f_path.dentry->d_parent->d_name.name,
 		file->f_path.dentry->d_name.name,
 		mapping->host->i_ino, len, (long long) pos);
 
+start:
 	/*
 	 * Prevent starvation issues if someone is doing a consistency
 	 * sync-to-disk
@@ -367,6 +404,13 @@ static int nfs_write_begin(struct file *
 	if (ret) {
 		unlock_page(page);
 		page_cache_release(page);
+	} else if (!once_thru &&
+		   nfs_want_read_modify_write(file, page, pos, len)) {
+		once_thru = 1;
+		ret = nfs_readpage(file, page);
+		page_cache_release(page);
+		if (!ret)
+			goto start;
 	}
 	return ret;
 }


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3] read-modify-write page updating
  2009-08-04 17:52                                                   ` [PATCH v3] " Peter Staubach
@ 2009-08-05  0:50                                                     ` Trond Myklebust
  0 siblings, 0 replies; 94+ messages in thread
From: Trond Myklebust @ 2009-08-05  0:50 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Brian R Cowan, linux-nfs

On Tue, 2009-08-04 at 13:52 -0400, Peter Staubach wrote:
> Signed-off-by: Peter Staubach <staubach@redhat.com>
> 
> --- linux-2.6.30.i686/fs/nfs/file.c.org
> +++ linux-2.6.30.i686/fs/nfs/file.c
> @@ -328,6 +328,42 @@ nfs_file_fsync(struct file *file, struct
>  }
>  
>  /*
> + * Decide whether a read/modify/write cycle may be more efficient
> + * then a modify/write/read cycle when writing to a page in the
> + * page cache.
> + *
> + * The modify/write/read cycle may occur if a page is read before
> + * being completely filled by the writer.  In this situation, the
> + * page must be completely written to stable storage on the server
> + * before it can be refilled by reading in the page from the server.
> + * This can lead to expensive, small, FILE_SYNC mode writes being
> + * done.
> + *
> + * It may be more efficient to read the page first if the file is
> + * open for reading in addition to writing, the page is not marked
> + * as Uptodate, it is not dirty or waiting to be committed,
> + * indicating that it was previously allocated and then modified,
> + * that there were valid bytes of data in that range of the file,
> + * and that the new data won't completely replace the old data in
> + * that range of the file.
> + */
> +static int nfs_want_read_modify_write(struct file *file, struct page *page,
> +			loff_t pos, unsigned len)
> +{
> +	unsigned int pglen = nfs_page_length(page);
> +	unsigned int offset = pos & (PAGE_CACHE_SIZE - 1);
> +	unsigned int end = offset + len;
> +
> +	if ((file->f_mode & FMODE_READ) &&	/* open for read? */
> +	    !PageUptodate(page) &&		/* Uptodate? */
> +	    !PagePrivate(page) &&		/* i/o request already? */
> +	    pglen &&				/* valid bytes of file? */
> +	    (end < pglen || offset))		/* replace all valid bytes? */
> +		return 1;
> +	return 0;
> +}
> +
> +/*
>   * This does the "real" work of the write. We must allocate and lock the
>   * page to be sent back to the generic routine, which then copies the
>   * data from user space.
> @@ -340,15 +376,16 @@ static int nfs_write_begin(struct file *
>  			struct page **pagep, void **fsdata)
>  {
>  	int ret;
> -	pgoff_t index;
> +	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
>  	struct page *page;
> -	index = pos >> PAGE_CACHE_SHIFT;
> +	int once_thru = 0;
>  
>  	dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
>  		file->f_path.dentry->d_parent->d_name.name,
>  		file->f_path.dentry->d_name.name,
>  		mapping->host->i_ino, len, (long long) pos);
>  
> +start:
>  	/*
>  	 * Prevent starvation issues if someone is doing a consistency
>  	 * sync-to-disk
> @@ -367,6 +404,13 @@ static int nfs_write_begin(struct file *
>  	if (ret) {
>  		unlock_page(page);
>  		page_cache_release(page);
> +	} else if (!once_thru &&
> +		   nfs_want_read_modify_write(file, page, pos, len)) {
> +		once_thru = 1;
> +		ret = nfs_readpage(file, page);
> +		page_cache_release(page);
> +		if (!ret)
> +			goto start;
>  	}
>  	return ret;
>  }
> 
Thanks! Applied...

Trond


^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2009-08-05  0:50 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-30 20:12 Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Brian R Cowan
2009-04-30 20:25 ` Christoph Hellwig
2009-04-30 20:28 ` Chuck Lever
2009-04-30 20:41   ` Peter Staubach
2009-04-30 21:13     ` Chuck Lever
2009-04-30 21:23     ` Trond Myklebust
2009-05-01 16:39       ` Brian R Cowan
     [not found]       ` <1241126587.15476.62.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-29 15:55         ` Brian R Cowan
2009-05-29 16:46           ` Trond Myklebust
     [not found]             ` <1243615595.7155.48.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-29 17:25               ` Brian R Cowan
2009-05-29 17:35                 ` Trond Myklebust
     [not found]                   ` <1243618500.7155.56.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-30  0:22                     ` Greg Banks
     [not found]                       ` <ac442c870905291722x1ec811b2sda997d464898fcda-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-05-30  7:57                         ` Christoph Hellwig
2009-06-01 22:30                           ` J. Bruce Fields
2009-06-05 14:54                             ` Christoph Hellwig
2009-06-05 16:01                               ` J. Bruce Fields
2009-06-05 16:12                               ` Trond Myklebust
     [not found]                                 ` <1244218328.5410.38.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-06-05 19:54                                   ` J. Bruce Fields
2009-06-05 21:21                                     ` Trond Myklebust
2009-05-30 12:26                         ` Trond Myklebust
     [not found]                           ` <1243686363.5209.16.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-30 12:43                             ` Trond Myklebust
2009-05-30 13:02                             ` Greg Banks
     [not found]                               ` <ac442c870905300602v6950ec42y5195d2d6ea7dd4c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-06-01 22:30                                 ` J. Bruce Fields
2009-06-02 15:00                                 ` Chuck Lever
2009-06-02 17:27                                   ` Trond Myklebust
     [not found]                                     ` <1243963631.4868.124.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-06-02 18:15                                       ` Chuck Lever
2009-06-03 16:22                                       ` Carlos Carvalho
2009-06-03 17:10                                         ` Trond Myklebust
     [not found]                                           ` <OFB53BFCCB.0CEC7A7E-ON852575C <1244138698.5203.59.camel@heimdal.trondhjem.org>
2009-06-03 21:28                                           ` Dean Hildebrand
2009-06-04  2:16                                             ` Carlos Carvalho
2009-06-04 17:42                                           ` Brian R Cowan
2009-06-04 18:04                                             ` Trond Myklebust
2009-06-04 20:43                                               ` Link performance over NFS degraded in RHEL5. -- was : " Brian R Cowan
2009-06-04 20:57                                                 ` Trond Myklebust
2009-06-04 21:30                                                   ` Brian R Cowan
2009-06-04 21:48                                                     ` Trond Myklebust
2009-06-04 21:07                                                 ` Peter Staubach
2009-06-04 21:39                                                   ` Brian R Cowan
2009-06-05 11:35                                                 ` Steve Dickson
2009-06-05 12:46                                                   ` Trond Myklebust
2009-06-05 13:03                                                     ` Brian R Cowan
2009-06-05 13:05                                                   ` Tom Talpey
     [not found]                                                   ` <4A29144A.6030405@gmail.com>
2009-06-05 13:30                                                     ` Steve Dickson
2009-06-05 13:52                                                       ` Trond Myklebust
     [not found]                                                         ` <1244209956.5410.33.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-06-05 13:57                                                           ` Steve Dickson
     [not found]                                                             ` <4A29243F.8080008-AfCzQyP5zfLQT0dZR+AlfA@public.gmane.org>
2009-06-05 16:05                                                               ` J. Bruce Fields
2009-06-05 16:35                                                                 ` Trond Myklebust
     [not found]                                                                   ` <1244219715.5410.40.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-06-15 23:08                                                                     ` J. Bruce Fields
2009-06-16  0:21                                                                       ` NeilBrown
     [not found]                                                                         ` <99d4545537613ce76040d3655b78bdb7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
2009-06-16  0:33                                                                           ` J. Bruce Fields
2009-06-16  0:50                                                                             ` NeilBrown
     [not found]                                                                               ` <02ada87c636e1088e9365a3cbea301e7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
2009-06-16  0:55                                                                                 ` J. Bruce Fields
2009-06-17 16:54                                                                                   ` J. Bruce Fields
2009-06-17 16:59                                                                                     ` [PATCH 1/3] nfsd: track last inode only in use_wgather case J. Bruce Fields
2009-06-17 16:59                                                                                       ` [PATCH 2/3] nfsd: Pull write-gathering code out of nfsd_vfs_write J. Bruce Fields
2009-06-17 16:59                                                                                         ` [PATCH 3/3] nfsd: minor nfsd_vfs_write cleanup J. Bruce Fields
2009-06-16  0:32                                                                       ` Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Trond Myklebust
     [not found]                                                                         ` <1245112324.7470.7.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-06-16  2:02                                                                           ` J. Bruce Fields
     [not found]                                                     ` <4A291D83.1000508@RedHat.com>
2009-06-05 13:50                                                       ` Tom Talpey
2009-06-05 13:54                                                         ` Trond Myklebust
2009-06-05 13:58                                                           ` Tom Talpey
2009-06-05 13:56                                                   ` Brian R Cowan
2009-06-24 19:54                                               ` [PATCH] read-modify-write page updating Peter Staubach
2009-06-25 17:13                                                 ` Trond Myklebust
     [not found]                                                   ` <1245950029.4913.17.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-07-09 13:59                                                     ` Peter Staubach
2009-07-09 14:12                                                 ` [PATCH v2] " Peter Staubach
2009-07-09 15:39                                                   ` Trond Myklebust
     [not found]                                                     ` <1247153972.5766.15.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-07-10 15:57                                                       ` Peter Staubach
2009-07-10 17:22                                                         ` J. Bruce Fields
2009-08-04 17:52                                                   ` [PATCH v3] " Peter Staubach
2009-08-05  0:50                                                     ` Trond Myklebust
2009-05-29 17:48               ` Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Peter Staubach
2009-05-29 18:21                 ` Trond Myklebust
2009-05-29 17:01           ` Chuck Lever
2009-05-29 17:38             ` Brian R Cowan
2009-05-29 17:42               ` Trond Myklebust
     [not found]                 ` <1243618968.7155.60.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-29 17:47                   ` Chuck Lever
2009-05-29 18:15                     ` Trond Myklebust
2009-05-29 17:51                   ` Peter Staubach
2009-05-29 18:25                     ` Brian R Cowan
2009-05-29 18:43                     ` Trond Myklebust
2009-05-29 17:55                   ` Brian R Cowan
2009-05-29 18:07                     ` Trond Myklebust
     [not found]                       ` <1243620455.7155.80.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-29 18:18                         ` Brian R Cowan
2009-05-29 18:29                           ` Trond Myklebust
     [not found]                             ` <1243621769.7155.97.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-29 20:09                               ` Brian R Cowan
2009-05-29 20:21                                 ` Trond Myklebust
     [not found]                                   ` <1243628519.7155.150.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-29 21:55                                     ` Brian R Cowan
2009-05-29 22:03                                       ` Trond Myklebust
     [not found]                                   ` <OFBB9B2C07.CC3D028B-ON852575C5. <1243634634.7155.160.camel@heimdal.trondhjem.org>
     [not found]                                     ` <1243634634.7155.160.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-29 22:20                                       ` Brian R Cowan
2009-05-29 22:36                                         ` Trond Myklebust
     [not found]                                     ` <OF061E0258.9581352B-ON852575C <1243636593.7155.188.camel@heimdal.trondhjem.org>
     [not found]                                       ` <1243636593.7155.188.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-29 23:02                                         ` Brian R Cowan
2009-05-29 23:13                                           ` Trond Myklebust
2009-05-29 17:57                   ` Trond Myklebust

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.