* Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing @ 2009-04-30 20:12 Brian R Cowan 2009-04-30 20:25 ` Christoph Hellwig 2009-04-30 20:28 ` Chuck Lever 0 siblings, 2 replies; 94+ messages in thread From: Brian R Cowan @ 2009-04-30 20:12 UTC (permalink / raw) To: linux-nfs Hello all, This is my first post, so please be gentle.... I have been working with a customer who is attempting to build their product in ClearCase dynamic views on Linux. When they went from Red hat Enterprise Linux 4 (update 5) to Red Hat Enterprise Linux 5 (Update 2), their build performance degraded dramatically. When troubleshooting the issue, we noticed that links on RHEL 5 caused an incredible number of "STABLE" 4kb nfs writes even though the storage we were writing to was EXPLICITLY mounted async. (This made RHEL 5 nearly 5x slower than RHEL 4.5 in this area...) On consultation with some internal resources, we found this change in the 2.6 kernel: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 In here it looks like the NFS client is forcing sync writes any time a write of less than the NFS write size occurs. We tested this hypothesis by setting the write size to 2KB. The "STABLE" writes went away and link times came back down out of the stratosphere. We built a modified kernel based on the RHEL 5.2 kernel (that ONLY backed out of this change) and we got a 33% improvement in overall build speeds. In my case, I see almost identical build times between the 2 OS's when we use this modified kernel on RHEL 5. Now, why am I posing this to the list? I need to understand *why* that change was made. On the face of it, simply backing out that patch would be perfect. I'm paranoid. I want to make sure that this is the ONLY reason: "/* For single writes, FLUSH_STABLE is more efficient */ " It seems more accurate to say that they *aren't* more efficient, but rather are "safer, but slower." I know that this is a 3+ year old update, but RHEL 4 is based on a 2.4 kernel, and SLES 9 is based on something in the same ballpark. And our customers see problems when they go to SLES 10/RHEL 5 from the prior major distro version. ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-04-30 20:12 Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Brian R Cowan @ 2009-04-30 20:25 ` Christoph Hellwig 2009-04-30 20:28 ` Chuck Lever 1 sibling, 0 replies; 94+ messages in thread From: Christoph Hellwig @ 2009-04-30 20:25 UTC (permalink / raw) To: Brian R Cowan; +Cc: linux-nfs On Thu, Apr 30, 2009 at 04:12:19PM -0400, Brian R Cowan wrote: > Hello all, > > This is my first post, so please be gentle.... I have been working with a > customer who is attempting to build their product in ClearCase dynamic > views on Linux. > I know that this is a 3+ year old update, but RHEL 4 is based on a 2.4 > kernel, and SLES 9 is based on something in the same ballpark. And our > customers see problems when they go to SLES 10/RHEL 5 from the prior major > distro version. You should probably complain to the distro vendors if you use distro kernels. And even when the change might not be diretly related please reproduce anything posted to upstream projects without binary only module junk like clearcase. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-04-30 20:12 Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Brian R Cowan 2009-04-30 20:25 ` Christoph Hellwig @ 2009-04-30 20:28 ` Chuck Lever 2009-04-30 20:41 ` Peter Staubach 1 sibling, 1 reply; 94+ messages in thread From: Chuck Lever @ 2009-04-30 20:28 UTC (permalink / raw) To: Brian R Cowan; +Cc: linux-nfs On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: > Hello all, > > This is my first post, so please be gentle.... I have been working > with a > customer who is attempting to build their product in ClearCase dynamic > views on Linux. When they went from Red hat Enterprise Linux 4 > (update 5) > to Red Hat Enterprise Linux 5 (Update 2), their build performance > degraded > dramatically. When troubleshooting the issue, we noticed that links on > RHEL 5 caused an incredible number of "STABLE" 4kb nfs writes even > though > the storage we were writing to was EXPLICITLY mounted async. (This > made > RHEL 5 nearly 5x slower than RHEL 4.5 in this area...) > > On consultation with some internal resources, we found this change > in the > 2.6 kernel: > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 > > In here it looks like the NFS client is forcing sync writes any time a > write of less than the NFS write size occurs. We tested this > hypothesis by > setting the write size to 2KB. The "STABLE" writes went away and link > times came back down out of the stratosphere. We built a modified > kernel > based on the RHEL 5.2 kernel (that ONLY backed out of this change) > and we > got a 33% improvement in overall build speeds. In my case, I see > almost > identical build times between the 2 OS's when we use this modified > kernel > on RHEL 5. > > Now, why am I posing this to the list? I need to understand *why* that > change was made. On the face of it, simply backing out that patch > would be > perfect. I'm paranoid. I want to make sure that this is the ONLY > reason: > "/* For single writes, FLUSH_STABLE is more efficient */ " > > It seems more accurate to say that they *aren't* more efficient, but > rather are "safer, but slower." They are more efficient from the point of view that only a single RPC is needed for a complete write. The WRITE and COMMIT are done in a single request. I don't think the issue here is whether the write is stable, but it is whether the NFS client has to block the application for it. A stable write that is asynchronous to the application is faster than WRITE +COMMIT. So it's not "stable" that is holding you up, it's "synchronous." Those are orthogonal concepts. > I know that this is a 3+ year old update, but RHEL 4 is based on a 2.4 > kernel, Nope, RHEL 4 is 2.6.9. RHEL 3 is 2.4.20-ish. > and SLES 9 is based on something in the same ballpark. And our > customers see problems when they go to SLES 10/RHEL 5 from the prior > major > distro version. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-04-30 20:28 ` Chuck Lever @ 2009-04-30 20:41 ` Peter Staubach 2009-04-30 21:13 ` Chuck Lever 2009-04-30 21:23 ` Trond Myklebust 0 siblings, 2 replies; 94+ messages in thread From: Peter Staubach @ 2009-04-30 20:41 UTC (permalink / raw) To: Chuck Lever; +Cc: Brian R Cowan, linux-nfs Chuck Lever wrote: > > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: > >> Hello all, >> >> This is my first post, so please be gentle.... I have been working >> with a >> customer who is attempting to build their product in ClearCase dynamic >> views on Linux. When they went from Red hat Enterprise Linux 4 >> (update 5) >> to Red Hat Enterprise Linux 5 (Update 2), their build performance >> degraded >> dramatically. When troubleshooting the issue, we noticed that links on >> RHEL 5 caused an incredible number of "STABLE" 4kb nfs writes even >> though >> the storage we were writing to was EXPLICITLY mounted async. (This made >> RHEL 5 nearly 5x slower than RHEL 4.5 in this area...) >> >> On consultation with some internal resources, we found this change in >> the >> 2.6 kernel: >> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 >> >> >> In here it looks like the NFS client is forcing sync writes any time a >> write of less than the NFS write size occurs. We tested this >> hypothesis by >> setting the write size to 2KB. The "STABLE" writes went away and link >> times came back down out of the stratosphere. We built a modified kernel >> based on the RHEL 5.2 kernel (that ONLY backed out of this change) >> and we >> got a 33% improvement in overall build speeds. In my case, I see almost >> identical build times between the 2 OS's when we use this modified >> kernel >> on RHEL 5. >> >> Now, why am I posing this to the list? I need to understand *why* that >> change was made. On the face of it, simply backing out that patch >> would be >> perfect. I'm paranoid. I want to make sure that this is the ONLY reason: >> "/* For single writes, FLUSH_STABLE is more efficient */ " >> >> It seems more accurate to say that they *aren't* more efficient, but >> rather are "safer, but slower." > > They are more efficient from the point of view that only a single RPC > is needed for a complete write. The WRITE and COMMIT are done in a > single request. > > I don't think the issue here is whether the write is stable, but it is > whether the NFS client has to block the application for it. A stable > write that is asynchronous to the application is faster than > WRITE+COMMIT. > > So it's not "stable" that is holding you up, it's "synchronous." > Those are orthogonal concepts. > Actually, the "stable" part can be a killer. It depends upon why and when nfs_flush_inode() is invoked. I did quite a bit of work on this aspect of RHEL-5 and discovered that this particular code was leading to some serious slowdowns. The server would end up doing a very slow FILE_SYNC write when all that was really required was an UNSTABLE write at the time. Did anyone actually measure this optimization and if so, what were the numbers? Thanx... ps ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-04-30 20:41 ` Peter Staubach @ 2009-04-30 21:13 ` Chuck Lever 2009-04-30 21:23 ` Trond Myklebust 1 sibling, 0 replies; 94+ messages in thread From: Chuck Lever @ 2009-04-30 21:13 UTC (permalink / raw) To: Peter Staubach; +Cc: Brian R Cowan, linux-nfs On Apr 30, 2009, at 4:41 PM, Peter Staubach wrote: > Chuck Lever wrote: >> >> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: >> >>> Hello all, >>> >>> This is my first post, so please be gentle.... I have been working >>> with a >>> customer who is attempting to build their product in ClearCase >>> dynamic >>> views on Linux. When they went from Red hat Enterprise Linux 4 >>> (update 5) >>> to Red Hat Enterprise Linux 5 (Update 2), their build performance >>> degraded >>> dramatically. When troubleshooting the issue, we noticed that >>> links on >>> RHEL 5 caused an incredible number of "STABLE" 4kb nfs writes even >>> though >>> the storage we were writing to was EXPLICITLY mounted async. (This >>> made >>> RHEL 5 nearly 5x slower than RHEL 4.5 in this area...) >>> >>> On consultation with some internal resources, we found this change >>> in >>> the >>> 2.6 kernel: >>> >>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 >>> >>> >>> In here it looks like the NFS client is forcing sync writes any >>> time a >>> write of less than the NFS write size occurs. We tested this >>> hypothesis by >>> setting the write size to 2KB. The "STABLE" writes went away and >>> link >>> times came back down out of the stratosphere. We built a modified >>> kernel >>> based on the RHEL 5.2 kernel (that ONLY backed out of this change) >>> and we >>> got a 33% improvement in overall build speeds. In my case, I see >>> almost >>> identical build times between the 2 OS's when we use this modified >>> kernel >>> on RHEL 5. >>> >>> Now, why am I posing this to the list? I need to understand *why* >>> that >>> change was made. On the face of it, simply backing out that patch >>> would be >>> perfect. I'm paranoid. I want to make sure that this is the ONLY >>> reason: >>> "/* For single writes, FLUSH_STABLE is more efficient */ " >>> >>> It seems more accurate to say that they *aren't* more efficient, but >>> rather are "safer, but slower." >> >> They are more efficient from the point of view that only a single RPC >> is needed for a complete write. The WRITE and COMMIT are done in a >> single request. >> >> I don't think the issue here is whether the write is stable, but it >> is >> whether the NFS client has to block the application for it. A stable >> write that is asynchronous to the application is faster than >> WRITE+COMMIT. >> >> So it's not "stable" that is holding you up, it's "synchronous." >> Those are orthogonal concepts. >> > > Actually, the "stable" part can be a killer. It depends upon > why and when nfs_flush_inode() is invoked. > > I did quite a bit of work on this aspect of RHEL-5 and discovered > that this particular code was leading to some serious slowdowns. > The server would end up doing a very slow FILE_SYNC write when > all that was really required was an UNSTABLE write at the time. If the client is asking for FILE_SYNC when it doesn't need the COMMIT, then yes, that would hurt performance. > Did anyone actually measure this optimization and if so, what > were the numbers? > > Thanx... > > ps -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-04-30 20:41 ` Peter Staubach 2009-04-30 21:13 ` Chuck Lever @ 2009-04-30 21:23 ` Trond Myklebust 2009-05-01 16:39 ` Brian R Cowan [not found] ` <1241126587.15476.62.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 1 sibling, 2 replies; 94+ messages in thread From: Trond Myklebust @ 2009-04-30 21:23 UTC (permalink / raw) To: Peter Staubach; +Cc: Chuck Lever, Brian R Cowan, linux-nfs On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote: > Chuck Lever wrote: > > > > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: > >> > >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 > >> > Actually, the "stable" part can be a killer. It depends upon > why and when nfs_flush_inode() is invoked. > > I did quite a bit of work on this aspect of RHEL-5 and discovered > that this particular code was leading to some serious slowdowns. > The server would end up doing a very slow FILE_SYNC write when > all that was really required was an UNSTABLE write at the time. > > Did anyone actually measure this optimization and if so, what > were the numbers? As usual, the optimisation is workload dependent. The main type of workload we're targetting with this patch is the app that opens a file, writes < 4k and then closes the file. For that case, it's a no-brainer that you don't need to split a single stable write into an unstable + a commit. So if the application isn't doing the above type of short write followed by close, then exactly what is causing a flush to disk in the first place? Ordinarily, the client will try to cache writes until the cows come home (or until the VM tells it to reclaim memory - whichever comes first)... Cheers Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-04-30 21:23 ` Trond Myklebust @ 2009-05-01 16:39 ` Brian R Cowan [not found] ` <1241126587.15476.62.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 1 sibling, 0 replies; 94+ messages in thread From: Brian R Cowan @ 2009-05-01 16:39 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach linux-nfs-owner@vger.kernel.org wrote on 04/30/2009 05:23:07 PM: > As usual, the optimisation is workload dependent. The main type of > workload we're targetting with this patch is the app that opens a file, > writes < 4k and then closes the file. For that case, it's a no-brainer > that you don't need to split a single stable write into an unstable + a > commit. The app impacted most is the gcc linker... I tested by building Samba, then by linking smbd. We think the linker memory maps the output file. Don't really know for sure since I don't know the gcc source any more than I'm an expert in the Linux NFS implementation. In any event, the linker is doing all kinds of lseeks and writes as it builds the output executable based on the various .o files being linked in. All of those writes are slowed down by this write change. If we were closing the file afterwards, that would be one thing, but we're not... > > So if the application isn't doing the above type of short write followed > by close, then exactly what is causing a flush to disk in the first > place? Ordinarily, the client will try to cache writes until the cows > come home (or until the VM tells it to reclaim memory - whichever comes > first)... We suspect it's the latter (something telling the system to flush memory) but chasing that looks to be a challenge... > > Cheers > Trond > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1241126587.15476.62.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1241126587.15476.62.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-29 15:55 ` Brian R Cowan 2009-05-29 16:46 ` Trond Myklebust 2009-05-29 17:01 ` Chuck Lever 0 siblings, 2 replies; 94+ messages in thread From: Brian R Cowan @ 2009-05-29 15:55 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach Been working this issue with Red hat, and didn't need to go to the list... Well, now I do... You mention that "The main type of workload we're targetting with this patch is the app that opens a file, writes < 4k and then closes the file." Well, it appears that this issue also impacts flushing pages from filesystem caches. The reason this came up in my environment is that our product's build auditing gives the the filesystem cache an interesting workout. When ClearCase audits a build, the build places data in a few places, including: 1) a build audit file that usually resides in /tmp. This build audit is essentially a log of EVERY file open/read/write/delete/rename/etc. that the programs called in the build script make in the clearcase "view" you're building in. As a result, this file can get pretty large. 2) The build outputs themselves, which in this case are being written to a remote storage location on a Linux or Solaris server, and 3) a file called .cmake.state, which is a local cache that is written to after the build script completes containing what is essentially a "Bill of materials" for the files created during builds in this "view." We believe that the build audit file access is causing build output to get flushed out of the filesystem cache. These flushes happen *in 4k chunks.* This trips over this change since the cache pages appear to get flushed on an individual basis. One note is that if the build outputs were going to a clearcase view stored on an enterprise-level NAS device, there isn't as much of an issue because many of these return from the stable write request as soon as the data goes into the battery-backed memory disk cache on the NAS. However, it really impacts writes to general-purpose OS's that follow Sun's lead in how they handle "stable" writes. The truly annoying part about this rather subtle change is that the NFS client is specifically ignoring the client mount options since we cannot force the "async" mount option to turn off this behavior. ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Trond Myklebust <trond.myklebust@fys.uio.no> To: Peter Staubach <staubach@redhat.com> Cc: Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/IBM@IBMUS, linux-nfs@vger.kernel.org Date: 04/30/2009 05:23 PM Subject: Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Sent by: linux-nfs-owner@vger.kernel.org On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote: > Chuck Lever wrote: > > > > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: > >> > >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 > >> > Actually, the "stable" part can be a killer. It depends upon > why and when nfs_flush_inode() is invoked. > > I did quite a bit of work on this aspect of RHEL-5 and discovered > that this particular code was leading to some serious slowdowns. > The server would end up doing a very slow FILE_SYNC write when > all that was really required was an UNSTABLE write at the time. > > Did anyone actually measure this optimization and if so, what > were the numbers? As usual, the optimisation is workload dependent. The main type of workload we're targetting with this patch is the app that opens a file, writes < 4k and then closes the file. For that case, it's a no-brainer that you don't need to split a single stable write into an unstable + a commit. So if the application isn't doing the above type of short write followed by close, then exactly what is causing a flush to disk in the first place? Ordinarily, the client will try to cache writes until the cows come home (or until the VM tells it to reclaim memory - whichever comes first)... Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 15:55 ` Brian R Cowan @ 2009-05-29 16:46 ` Trond Myklebust [not found] ` <1243615595.7155.48.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-29 17:01 ` Chuck Lever 1 sibling, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-05-29 16:46 UTC (permalink / raw) To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach Look... This happens when you _flush_ the file to stable storage if there is only a single write < wsize. It isn't the business of the NFS layer to decide when you flush the file; that's an application decision... Trond On Fri, 2009-05-29 at 11:55 -0400, Brian R Cowan wrote: > Been working this issue with Red hat, and didn't need to go to the list... > Well, now I do... You mention that "The main type of workload we're > targetting with this patch is the app that opens a file, writes < 4k and > then closes the file." Well, it appears that this issue also impacts > flushing pages from filesystem caches. > > The reason this came up in my environment is that our product's build > auditing gives the the filesystem cache an interesting workout. When > ClearCase audits a build, the build places data in a few places, > including: > 1) a build audit file that usually resides in /tmp. This build audit is > essentially a log of EVERY file open/read/write/delete/rename/etc. that > the programs called in the build script make in the clearcase "view" > you're building in. As a result, this file can get pretty large. > 2) The build outputs themselves, which in this case are being written to a > remote storage location on a Linux or Solaris server, and > 3) a file called .cmake.state, which is a local cache that is written to > after the build script completes containing what is essentially a "Bill of > materials" for the files created during builds in this "view." > > We believe that the build audit file access is causing build output to get > flushed out of the filesystem cache. These flushes happen *in 4k chunks.* > This trips over this change since the cache pages appear to get flushed on > an individual basis. > > One note is that if the build outputs were going to a clearcase view > stored on an enterprise-level NAS device, there isn't as much of an issue > because many of these return from the stable write request as soon as the > data goes into the battery-backed memory disk cache on the NAS. However, > it really impacts writes to general-purpose OS's that follow Sun's lead in > how they handle "stable" writes. The truly annoying part about this rather > subtle change is that the NFS client is specifically ignoring the client > mount options since we cannot force the "async" mount option to turn off > this behavior. > > ================================================================= > Brian Cowan > Advisory Software Engineer > ClearCase Customer Advocacy Group (CAG) > Rational Software > IBM Software Group > 81 Hartwell Ave > Lexington, MA > > Phone: 1.781.372.3580 > Web: http://www.ibm.com/software/rational/support/ > > > Please be sure to update your PMR using ESR at > http://www-306.ibm.com/software/support/probsub.html or cc all > correspondence to sw_support@us.ibm.com to be sure your PMR is updated in > case I am not available. > > > > From: > Trond Myklebust <trond.myklebust@fys.uio.no> > To: > Peter Staubach <staubach@redhat.com> > Cc: > Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/IBM@IBMUS, > linux-nfs@vger.kernel.org > Date: > 04/30/2009 05:23 PM > Subject: > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing > Sent by: > linux-nfs-owner@vger.kernel.org > > > > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote: > > Chuck Lever wrote: > > > > > > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: > > >> > > >> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 > > > >> > > Actually, the "stable" part can be a killer. It depends upon > > why and when nfs_flush_inode() is invoked. > > > > I did quite a bit of work on this aspect of RHEL-5 and discovered > > that this particular code was leading to some serious slowdowns. > > The server would end up doing a very slow FILE_SYNC write when > > all that was really required was an UNSTABLE write at the time. > > > > Did anyone actually measure this optimization and if so, what > > were the numbers? > > As usual, the optimisation is workload dependent. The main type of > workload we're targetting with this patch is the app that opens a file, > writes < 4k and then closes the file. For that case, it's a no-brainer > that you don't need to split a single stable write into an unstable + a > commit. > > So if the application isn't doing the above type of short write followed > by close, then exactly what is causing a flush to disk in the first > place? Ordinarily, the client will try to cache writes until the cows > come home (or until the VM tells it to reclaim memory - whichever comes > first)... > > Cheers > Trond > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1243615595.7155.48.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243615595.7155.48.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-29 17:25 ` Brian R Cowan 2009-05-29 17:35 ` Trond Myklebust 2009-05-29 17:48 ` Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Peter Staubach 1 sibling, 1 reply; 94+ messages in thread From: Brian R Cowan @ 2009-05-29 17:25 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach Ah, but I submit that the application isn't making the decision... The OS is. My testcase is building Samba on Linux using gcc. The gcc linker sure isn't deciding to flush the file. It's happily seeking/reading and seeking/writing with no idea what is happening under the covers. When the build gets audited, the cache gets flushed... No audit, no flush. The only apparent difference is that we have an audit file getting written to on the local disk. The linker has no idea it's getting audited. I'm interested in knowing what kind of performance benefit this optimization is providing in small-file writes. Unless it's incredibly dramatic, then I really don't see why we can't do one of the following: 1) get rid of it, 2) find some way to not do it when the OS flushes filesystem cache, or 3) make the "async" mount option turn it off, or 4) create a new mount option to force the optimization on/off. I just don't see how a single RPC saved is saving all that much time. Since: - open - write (unstable) <write size - commit - close Depends on the commit call to finish writing to disk, and - open - write (stable) <write size - close Also depends on the time taken to writ ethe data to disk, I can't see the one less RPC buying that much time, other than perhaps on NAS devices. This may reduce the server load, but this is ignoring the mount options. We can't turn this behavior OFF, and that's the biggest issue. I don't mind the small-file-write optimization itself, as long as I and my customers are able to CHOOSE whether the optimization is active. It boils down to this: when I *categorically* say that the mount is async, the OS should pay attention. There are cases when the OS doesn't know best. If the OS always knew what would work best, there wouldn't be nearly as many mount options as there are now. ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Trond Myklebust <trond.myklebust@fys.uio.no> To: Brian R Cowan/Cupertino/IBM@IBMUS Cc: Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> Date: 05/29/2009 12:47 PM Subject: Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Sent by: linux-nfs-owner@vger.kernel.org Look... This happens when you _flush_ the file to stable storage if there is only a single write < wsize. It isn't the business of the NFS layer to decide when you flush the file; that's an application decision... Trond On Fri, 2009-05-29 at 11:55 -0400, Brian R Cowan wrote: > Been working this issue with Red hat, and didn't need to go to the list... > Well, now I do... You mention that "The main type of workload we're > targetting with this patch is the app that opens a file, writes < 4k and > then closes the file." Well, it appears that this issue also impacts > flushing pages from filesystem caches. > > The reason this came up in my environment is that our product's build > auditing gives the the filesystem cache an interesting workout. When > ClearCase audits a build, the build places data in a few places, > including: > 1) a build audit file that usually resides in /tmp. This build audit is > essentially a log of EVERY file open/read/write/delete/rename/etc. that > the programs called in the build script make in the clearcase "view" > you're building in. As a result, this file can get pretty large. > 2) The build outputs themselves, which in this case are being written to a > remote storage location on a Linux or Solaris server, and > 3) a file called .cmake.state, which is a local cache that is written to > after the build script completes containing what is essentially a "Bill of > materials" for the files created during builds in this "view." > > We believe that the build audit file access is causing build output to get > flushed out of the filesystem cache. These flushes happen *in 4k chunks.* > This trips over this change since the cache pages appear to get flushed on > an individual basis. > > One note is that if the build outputs were going to a clearcase view > stored on an enterprise-level NAS device, there isn't as much of an issue > because many of these return from the stable write request as soon as the > data goes into the battery-backed memory disk cache on the NAS. However, > it really impacts writes to general-purpose OS's that follow Sun's lead in > how they handle "stable" writes. The truly annoying part about this rather > subtle change is that the NFS client is specifically ignoring the client > mount options since we cannot force the "async" mount option to turn off > this behavior. > > ================================================================= > Brian Cowan > Advisory Software Engineer > ClearCase Customer Advocacy Group (CAG) > Rational Software > IBM Software Group > 81 Hartwell Ave > Lexington, MA > > Phone: 1.781.372.3580 > Web: http://www.ibm.com/software/rational/support/ > > > Please be sure to update your PMR using ESR at > http://www-306.ibm.com/software/support/probsub.html or cc all > correspondence to sw_support@us.ibm.com to be sure your PMR is updated in > case I am not available. > > > > From: > Trond Myklebust <trond.myklebust@fys.uio.no> > To: > Peter Staubach <staubach@redhat.com> > Cc: > Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/IBM@IBMUS, > linux-nfs@vger.kernel.org > Date: > 04/30/2009 05:23 PM > Subject: > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing > Sent by: > linux-nfs-owner@vger.kernel.org > > > > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote: > > Chuck Lever wrote: > > > > > > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: > > >> > > >> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 > > > >> > > Actually, the "stable" part can be a killer. It depends upon > > why and when nfs_flush_inode() is invoked. > > > > I did quite a bit of work on this aspect of RHEL-5 and discovered > > that this particular code was leading to some serious slowdowns. > > The server would end up doing a very slow FILE_SYNC write when > > all that was really required was an UNSTABLE write at the time. > > > > Did anyone actually measure this optimization and if so, what > > were the numbers? > > As usual, the optimisation is workload dependent. The main type of > workload we're targetting with this patch is the app that opens a file, > writes < 4k and then closes the file. For that case, it's a no-brainer > that you don't need to split a single stable write into an unstable + a > commit. > > So if the application isn't doing the above type of short write followed > by close, then exactly what is causing a flush to disk in the first > place? Ordinarily, the client will try to cache writes until the cows > come home (or until the VM tells it to reclaim memory - whichever comes > first)... > > Cheers > Trond > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 17:25 ` Brian R Cowan @ 2009-05-29 17:35 ` Trond Myklebust [not found] ` <1243618500.7155.56.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-05-29 17:35 UTC (permalink / raw) To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote: > Ah, but I submit that the application isn't making the decision... The OS > is. My testcase is building Samba on Linux using gcc. The gcc linker sure > isn't deciding to flush the file. It's happily seeking/reading and > seeking/writing with no idea what is happening under the covers. When the > build gets audited, the cache gets flushed... No audit, no flush. The only > apparent difference is that we have an audit file getting written to on > the local disk. The linker has no idea it's getting audited. > > I'm interested in knowing what kind of performance benefit this > optimization is providing in small-file writes. Unless it's incredibly > dramatic, then I really don't see why we can't do one of the following: > 1) get rid of it, > 2) find some way to not do it when the OS flushes filesystem cache, or > 3) make the "async" mount option turn it off, or > 4) create a new mount option to force the optimization on/off. > > I just don't see how a single RPC saved is saving all that much time. > Since: > - open > - write (unstable) <write size > - commit > - close > Depends on the commit call to finish writing to disk, and > - open > - write (stable) <write size > - close > Also depends on the time taken to writ ethe data to disk, I can't see the > one less RPC buying that much time, other than perhaps on NAS devices. > > This may reduce the server load, but this is ignoring the mount options. > We can't turn this behavior OFF, and that's the biggest issue. I don't > mind the small-file-write optimization itself, as long as I and my > customers are able to CHOOSE whether the optimization is active. It boils > down to this: when I *categorically* say that the mount is async, the OS > should pay attention. There are cases when the OS doesn't know best. If > the OS always knew what would work best, there wouldn't be nearly as many > mount options as there are now. What are you smoking? There is _NO_DIFFERENCE_ between what the server is supposed to do when sent a single stable write, and what it is supposed to do when sent an unstable write plus a commit. BOTH cases are supposed to result in the server writing the data to stable storage before the stable write / commit is allowed to return a reply. The extra RPC round trip (+ parsing overhead ++++) due to the commit call is the _only_ difference. No, you can't turn this behaviour off (unless you use the 'async' export option on a Linux server), but there is no difference there between the stable write and the unstable write + commit. THEY BOTH RESULT IN THE SAME BEHAVIOUR. Trond > ================================================================= > Brian Cowan > Advisory Software Engineer > ClearCase Customer Advocacy Group (CAG) > Rational Software > IBM Software Group > 81 Hartwell Ave > Lexington, MA > > Phone: 1.781.372.3580 > Web: http://www.ibm.com/software/rational/support/ > > > Please be sure to update your PMR using ESR at > http://www-306.ibm.com/software/support/probsub.html or cc all > correspondence to sw_support@us.ibm.com to be sure your PMR is updated in > case I am not available. > > > > From: > Trond Myklebust <trond.myklebust@fys.uio.no> > To: > Brian R Cowan/Cupertino/IBM@IBMUS > Cc: > Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, > linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> > Date: > 05/29/2009 12:47 PM > Subject: > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing > Sent by: > linux-nfs-owner@vger.kernel.org > > > > Look... This happens when you _flush_ the file to stable storage if > there is only a single write < wsize. It isn't the business of the NFS > layer to decide when you flush the file; that's an application > decision... > > Trond > > > > On Fri, 2009-05-29 at 11:55 -0400, Brian R Cowan wrote: > > Been working this issue with Red hat, and didn't need to go to the > list... > > Well, now I do... You mention that "The main type of workload we're > > targetting with this patch is the app that opens a file, writes < 4k and > > > then closes the file." Well, it appears that this issue also impacts > > flushing pages from filesystem caches. > > > > The reason this came up in my environment is that our product's build > > auditing gives the the filesystem cache an interesting workout. When > > ClearCase audits a build, the build places data in a few places, > > including: > > 1) a build audit file that usually resides in /tmp. This build audit is > > essentially a log of EVERY file open/read/write/delete/rename/etc. that > > the programs called in the build script make in the clearcase "view" > > you're building in. As a result, this file can get pretty large. > > 2) The build outputs themselves, which in this case are being written to > a > > remote storage location on a Linux or Solaris server, and > > 3) a file called .cmake.state, which is a local cache that is written to > > > after the build script completes containing what is essentially a "Bill > of > > materials" for the files created during builds in this "view." > > > > We believe that the build audit file access is causing build output to > get > > flushed out of the filesystem cache. These flushes happen *in 4k > chunks.* > > This trips over this change since the cache pages appear to get flushed > on > > an individual basis. > > > > One note is that if the build outputs were going to a clearcase view > > stored on an enterprise-level NAS device, there isn't as much of an > issue > > because many of these return from the stable write request as soon as > the > > data goes into the battery-backed memory disk cache on the NAS. However, > > > it really impacts writes to general-purpose OS's that follow Sun's lead > in > > how they handle "stable" writes. The truly annoying part about this > rather > > subtle change is that the NFS client is specifically ignoring the client > > > mount options since we cannot force the "async" mount option to turn off > > > this behavior. > > > > ================================================================= > > Brian Cowan > > Advisory Software Engineer > > ClearCase Customer Advocacy Group (CAG) > > Rational Software > > IBM Software Group > > 81 Hartwell Ave > > Lexington, MA > > > > Phone: 1.781.372.3580 > > Web: http://www.ibm.com/software/rational/support/ > > > > > > Please be sure to update your PMR using ESR at > > http://www-306.ibm.com/software/support/probsub.html or cc all > > correspondence to sw_support@us.ibm.com to be sure your PMR is updated > in > > case I am not available. > > > > > > > > From: > > Trond Myklebust <trond.myklebust@fys.uio.no> > > To: > > Peter Staubach <staubach@redhat.com> > > Cc: > > Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/IBM@IBMUS, > > > linux-nfs@vger.kernel.org > > Date: > > 04/30/2009 05:23 PM > > Subject: > > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page > flushing > > Sent by: > > linux-nfs-owner@vger.kernel.org > > > > > > > > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote: > > > Chuck Lever wrote: > > > > > > > > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: > > > >> > > > >> > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 > > > > > > >> > > > Actually, the "stable" part can be a killer. It depends upon > > > why and when nfs_flush_inode() is invoked. > > > > > > I did quite a bit of work on this aspect of RHEL-5 and discovered > > > that this particular code was leading to some serious slowdowns. > > > The server would end up doing a very slow FILE_SYNC write when > > > all that was really required was an UNSTABLE write at the time. > > > > > > Did anyone actually measure this optimization and if so, what > > > were the numbers? > > > > As usual, the optimisation is workload dependent. The main type of > > workload we're targetting with this patch is the app that opens a file, > > writes < 4k and then closes the file. For that case, it's a no-brainer > > that you don't need to split a single stable write into an unstable + a > > commit. > > > > So if the application isn't doing the above type of short write followed > > by close, then exactly what is causing a flush to disk in the first > > place? Ordinarily, the client will try to cache writes until the cows > > come home (or until the VM tells it to reclaim memory - whichever comes > > first)... > > > > Cheers > > Trond > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1243618500.7155.56.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243618500.7155.56.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-30 0:22 ` Greg Banks [not found] ` <ac442c870905291722x1ec811b2sda997d464898fcda-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 94+ messages in thread From: Greg Banks @ 2009-05-30 0:22 UTC (permalink / raw) To: Trond Myklebust Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote: >> > > What are you smoking? There is _NO_DIFFERENCE_ between what the server > is supposed to do when sent a single stable write, and what it is > supposed to do when sent an unstable write plus a commit. BOTH cases are > supposed to result in the server writing the data to stable storage > before the stable write / commit is allowed to return a reply. This probably makes no difference to the discussion, but for a Linux server there is a subtle difference between what the server is supposed to do and what it actually does. For a stable WRITE rpc, the Linux server sets O_SYNC in the struct file during the vfs_writev() call and expects the underlying filesystem to obey that flag and flush the data to disk. For a COMMIT rpc, the Linux server uses the underlying filesystem's f_op->fsync instead. This results in some potential differences: * The underlying filesystem might be broken in one code path and not the other (e.g. ignoring O_SYNC in f_op->{aio_,}write or silently failing in f_op->fsync). These kinds of bugs tend to be subtle because in the absence of a crash they affect only the timing of IO and so they might not be noticed. * The underlying filesystem might be doing more or better things in one or the other code paths e.g. optimising allocations. * The Linux NFS server ignores the byte range in the COMMIT rpc and flushes the whole file (I suspect this is a historical accident rather than deliberate policy). If there is other dirty data on that file server-side, that other data will be written too before the COMMIT reply is sent. This may have a performance impact, depending on the workload. > The extra RPC round trip (+ parsing overhead ++++) due to the commit > call is the _only_ difference. This is almost completely true. If the server behaved ideally and predictably, this would be completely true. </pedant> -- Greg. ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <ac442c870905291722x1ec811b2sda997d464898fcda-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <ac442c870905291722x1ec811b2sda997d464898fcda-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-05-30 7:57 ` Christoph Hellwig 2009-06-01 22:30 ` J. Bruce Fields 2009-05-30 12:26 ` Trond Myklebust 1 sibling, 1 reply; 94+ messages in thread From: Christoph Hellwig @ 2009-05-30 7:57 UTC (permalink / raw) To: Greg Banks Cc: Trond Myklebust, Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Sat, May 30, 2009 at 10:22:58AM +1000, Greg Banks wrote: > * The underlying filesystem might be doing more or better things in > one or the other code paths e.g. optimising allocations. Which is the case with ext3 which is pretty common. It does reasonably well on O_SYNC as far as I can see, but has a catastrophic fsync implementation. > * The Linux NFS server ignores the byte range in the COMMIT rpc and > flushes the whole file (I suspect this is a historical accident rather > than deliberate policy). If there is other dirty data on that file > server-side, that other data will be written too before the COMMIT > reply is sent. This may have a performance impact, depending on the > workload. Right now we can't actually implement that proper because the fsync file operation can't actually flush sub ranges. There have been some other requests for this, but my ->fsync resdesign in on hold until NFSD stops calling ->fsync without a file struct. I think the open file cache will help us with that, if we can extend it to also cache open file structs for directories. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-30 7:57 ` Christoph Hellwig @ 2009-06-01 22:30 ` J. Bruce Fields 2009-06-05 14:54 ` Christoph Hellwig 0 siblings, 1 reply; 94+ messages in thread From: J. Bruce Fields @ 2009-06-01 22:30 UTC (permalink / raw) To: Krishna Kumar Cc: Greg Banks, Trond Myklebust, Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach, Christoph Hellwig On Sat, May 30, 2009 at 03:57:56AM -0400, Christoph Hellwig wrote: > On Sat, May 30, 2009 at 10:22:58AM +1000, Greg Banks wrote: > > * The underlying filesystem might be doing more or better things in > > one or the other code paths e.g. optimising allocations. > > Which is the case with ext3 which is pretty common. It does reasonably > well on O_SYNC as far as I can see, but has a catastrophic fsync > implementation. > > > * The Linux NFS server ignores the byte range in the COMMIT rpc and > > flushes the whole file (I suspect this is a historical accident rather > > than deliberate policy). If there is other dirty data on that file > > server-side, that other data will be written too before the COMMIT > > reply is sent. This may have a performance impact, depending on the > > workload. > > Right now we can't actually implement that proper because the fsync > file operation can't actually flush sub ranges. There have been some > other requests for this, but my ->fsync resdesign in on hold until > NFSD stops calling ->fsync without a file struct. > > I think the open file cache will help us with that, if we can extend > it to also cache open file structs for directories. Krishna Kumar--do you think that'd be a reasonable thing to do? --b. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-01 22:30 ` J. Bruce Fields @ 2009-06-05 14:54 ` Christoph Hellwig 2009-06-05 16:01 ` J. Bruce Fields 2009-06-05 16:12 ` Trond Myklebust 0 siblings, 2 replies; 94+ messages in thread From: Christoph Hellwig @ 2009-06-05 14:54 UTC (permalink / raw) To: J. Bruce Fields Cc: Krishna Kumar, Greg Banks, Trond Myklebust, Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach, Christoph Hellwig On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote: > > NFSD stops calling ->fsync without a file struct. > > > > I think the open file cache will help us with that, if we can extend > > it to also cache open file structs for directories. > > Krishna Kumar--do you think that'd be a reasonable thing to do? Btw, do you have at least the basic open files cache queue for 2.6.31? ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-05 14:54 ` Christoph Hellwig @ 2009-06-05 16:01 ` J. Bruce Fields 2009-06-05 16:12 ` Trond Myklebust 1 sibling, 0 replies; 94+ messages in thread From: J. Bruce Fields @ 2009-06-05 16:01 UTC (permalink / raw) To: Christoph Hellwig Cc: Krishna Kumar, Greg Banks, Trond Myklebust, Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, Jun 05, 2009 at 10:54:50AM -0400, Christoph Hellwig wrote: > On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote: > > > NFSD stops calling ->fsync without a file struct. > > > > > > I think the open file cache will help us with that, if we can extend > > > it to also cache open file structs for directories. > > > > Krishna Kumar--do you think that'd be a reasonable thing to do? > > Btw, do you have at least the basic open files cache queue for 2.6.31? No. I'll try to give it a look this afternoon. --b. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-05 14:54 ` Christoph Hellwig 2009-06-05 16:01 ` J. Bruce Fields @ 2009-06-05 16:12 ` Trond Myklebust [not found] ` <1244218328.5410.38.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 1 sibling, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-06-05 16:12 UTC (permalink / raw) To: Christoph Hellwig Cc: J. Bruce Fields, Krishna Kumar, Greg Banks, Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, 2009-06-05 at 10:54 -0400, Christoph Hellwig wrote: > On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote: > > > NFSD stops calling ->fsync without a file struct. > > > > > > I think the open file cache will help us with that, if we can extend > > > it to also cache open file structs for directories. > > > > Krishna Kumar--do you think that'd be a reasonable thing to do? > > Btw, do you have at least the basic open files cache queue for 2.6.31? > Now that _will_ badly screw up the write gathering heuristic... Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1244218328.5410.38.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1244218328.5410.38.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-06-05 19:54 ` J. Bruce Fields 2009-06-05 21:21 ` Trond Myklebust 0 siblings, 1 reply; 94+ messages in thread From: J. Bruce Fields @ 2009-06-05 19:54 UTC (permalink / raw) To: Trond Myklebust Cc: Christoph Hellwig, Krishna Kumar, Greg Banks, Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, Jun 05, 2009 at 12:12:08PM -0400, Trond Myklebust wrote: > On Fri, 2009-06-05 at 10:54 -0400, Christoph Hellwig wrote: > > On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote: > > > > NFSD stops calling ->fsync without a file struct. > > > > > > > > I think the open file cache will help us with that, if we can extend > > > > it to also cache open file structs for directories. > > > > > > Krishna Kumar--do you think that'd be a reasonable thing to do? > > > > Btw, do you have at least the basic open files cache queue for 2.6.31? > > > > Now that _will_ badly screw up the write gathering heuristic... How? --b. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-05 19:54 ` J. Bruce Fields @ 2009-06-05 21:21 ` Trond Myklebust 0 siblings, 0 replies; 94+ messages in thread From: Trond Myklebust @ 2009-06-05 21:21 UTC (permalink / raw) To: J. Bruce Fields Cc: Christoph Hellwig, Krishna Kumar, Greg Banks, Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, 2009-06-05 at 15:54 -0400, J. Bruce Fields wrote: > On Fri, Jun 05, 2009 at 12:12:08PM -0400, Trond Myklebust wrote: > > On Fri, 2009-06-05 at 10:54 -0400, Christoph Hellwig wrote: > > > On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote: > > > > > NFSD stops calling ->fsync without a file struct. > > > > > > > > > > I think the open file cache will help us with that, if we can extend > > > > > it to also cache open file structs for directories. > > > > > > > > Krishna Kumar--do you think that'd be a reasonable thing to do? > > > > > > Btw, do you have at least the basic open files cache queue for 2.6.31? > > > > > > > Now that _will_ badly screw up the write gathering heuristic... > > How? > The heuristic looks at inode->i_writecount in order to figure out how many nfsd threads are currently trying to write to the file. The reference to i_writecount is held by the struct file. The problam is that if you start sharing struct file among several nfsd threads by means of a cache, then the i_writecount will not change, and so the heuristic fails. While we won't miss it much in NFSv3 and v4, it may change the performance of the few systems out there that still believe NFSv2 is the best thing since sliced bread... Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <ac442c870905291722x1ec811b2sda997d464898fcda-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-05-30 7:57 ` Christoph Hellwig @ 2009-05-30 12:26 ` Trond Myklebust [not found] ` <1243686363.5209.16.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 1 sibling, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-05-30 12:26 UTC (permalink / raw) To: Greg Banks Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote: > On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust > <trond.myklebust@fys.uio.no> wrote: > > On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote: > >> > > > > What are you smoking? There is _NO_DIFFERENCE_ between what the server > > is supposed to do when sent a single stable write, and what it is > > supposed to do when sent an unstable write plus a commit. BOTH cases are > > supposed to result in the server writing the data to stable storage > > before the stable write / commit is allowed to return a reply. > > This probably makes no difference to the discussion, but for a Linux > server there is a subtle difference between what the server is > supposed to do and what it actually does. > > For a stable WRITE rpc, the Linux server sets O_SYNC in the struct > file during the vfs_writev() call and expects the underlying > filesystem to obey that flag and flush the data to disk. For a COMMIT > rpc, the Linux server uses the underlying filesystem's f_op->fsync > instead. This results in some potential differences: > > * The underlying filesystem might be broken in one code path and not > the other (e.g. ignoring O_SYNC in f_op->{aio_,}write or silently > failing in f_op->fsync). These kinds of bugs tend to be subtle > because in the absence of a crash they affect only the timing of IO > and so they might not be noticed. > > * The underlying filesystem might be doing more or better things in > one or the other code paths e.g. optimising allocations. > > * The Linux NFS server ignores the byte range in the COMMIT rpc and > flushes the whole file (I suspect this is a historical accident rather > than deliberate policy). If there is other dirty data on that file > server-side, that other data will be written too before the COMMIT > reply is sent. This may have a performance impact, depending on the > workload. > > > The extra RPC round trip (+ parsing overhead ++++) due to the commit > > call is the _only_ difference. > > This is almost completely true. If the server behaved ideally and > predictably, this would be completely true. > > </pedant> > Firstly, the server only uses O_SYNC if you turn off write gathering (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs server is to always try write gathering and hence no O_SYNC. Secondly, even if it were the case, then this does not justify changing the client behaviour. The NFS protocol does not mandate, or even recommend that the server use O_SYNC. All it says is that a stable write and an unstable write+commit should both have the same result: namely that the data+metadata must have been flushed to stable storage. The protocol spec leaves it as an exercise to the server implementer to do this as efficiently as possible. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1243686363.5209.16.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243686363.5209.16.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-30 12:43 ` Trond Myklebust 2009-05-30 13:02 ` Greg Banks 1 sibling, 0 replies; 94+ messages in thread From: Trond Myklebust @ 2009-05-30 12:43 UTC (permalink / raw) To: Trond Myklebust Cc: Greg Banks, Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On May 30, 2009, at 8:26, Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > Firstly, the server only uses O_SYNC if you turn off write gathering > (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs > server is to always try write gathering and hence no O_SYNC. > > Secondly, even if it were the case, then this does not justify > changing > the client behaviour. The NFS protocol does not mandate, or even > recommend that the server use O_SYNC. All it says is that a stable > write > and an unstable write+commit should both have the same result: namely > that the data+metadata must have been flushed to stable storage. The > protocol spec leaves it as an exercise to the server implementer to do > this as efficiently as possible. > Speaking of write gathering... Are we sure that heuristic that checks i_writecount isn't introducing spurious 10ms delays here? It seems odd for the server to do write gathering on nfsv3 writes: if the client wants to send more writes, it will set the unstable flag... Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243686363.5209.16.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-30 12:43 ` Trond Myklebust @ 2009-05-30 13:02 ` Greg Banks [not found] ` <ac442c870905300602v6950ec42y5195d2d6ea7dd4c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 94+ messages in thread From: Greg Banks @ 2009-05-30 13:02 UTC (permalink / raw) To: Trond Myklebust Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote: >> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust >> <trond.myklebust@fys.uio.no> wrote: >> > On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote: >> >> >> > > Firstly, the server only uses O_SYNC if you turn off write gathering > (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs > server is to always try write gathering and hence no O_SYNC. Well, write gathering is a total crock that AFAICS only helps single-file writes on NFSv2. For today's workloads all it does is provide a hotspot on the two global variables that track writes in an attempt to gather them. Back when I worked on a server product, no_wdelay was one of the standard options for new exports. > Secondly, even if it were the case, then this does not justify changing > the client behaviour. I totally agree, it was just an observation. In any case, as Christoph points out, the ext3 performance difference makes an unstable WRITE+COMMIT slower than a stable WRITE, and you already assumed that. -- Greg. ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <ac442c870905300602v6950ec42y5195d2d6ea7dd4c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <ac442c870905300602v6950ec42y5195d2d6ea7dd4c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-06-01 22:30 ` J. Bruce Fields 2009-06-02 15:00 ` Chuck Lever 1 sibling, 0 replies; 94+ messages in thread From: J. Bruce Fields @ 2009-06-01 22:30 UTC (permalink / raw) To: Greg Banks Cc: Trond Myklebust, Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Sat, May 30, 2009 at 11:02:47PM +1000, Greg Banks wrote: > On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust > <trond.myklebust@fys.uio.no> wrote: > > On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote: > >> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust > >> <trond.myklebust@fys.uio.no> wrote: > >> > On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote: > >> >> > >> > > > > Firstly, the server only uses O_SYNC if you turn off write gathering > > (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs > > server is to always try write gathering and hence no O_SYNC. > > Well, write gathering is a total crock that AFAICS only helps > single-file writes on NFSv2. For today's workloads all it does is > provide a hotspot on the two global variables that track writes in an > attempt to gather them. Back when I worked on a server product, > no_wdelay was one of the standard options for new exports. Should be a simple nfs-utils patch to change the default. --b. > > > Secondly, even if it were the case, then this does not justify changing > > the client behaviour. > > I totally agree, it was just an observation. > > In any case, as Christoph points out, the ext3 performance difference > makes an unstable WRITE+COMMIT slower than a stable WRITE, and you > already assumed that. > > -- > Greg. > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <ac442c870905300602v6950ec42y5195d2d6ea7dd4c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-06-01 22:30 ` J. Bruce Fields @ 2009-06-02 15:00 ` Chuck Lever 2009-06-02 17:27 ` Trond Myklebust 1 sibling, 1 reply; 94+ messages in thread From: Chuck Lever @ 2009-06-02 15:00 UTC (permalink / raw) To: Greg Banks Cc: Trond Myklebust, Brian R Cowan, linux-nfs, linux-nfs-owner, Peter Staubach On May 30, 2009, at 9:02 AM, Greg Banks wrote: > On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust > <trond.myklebust@fys.uio.no> wrote: >> On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote: >>> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust >>> <trond.myklebust@fys.uio.no> wrote: >>>> On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote: >>>>> >>> >> >> Firstly, the server only uses O_SYNC if you turn off write gathering >> (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs >> server is to always try write gathering and hence no O_SYNC. > > Well, write gathering is a total crock that AFAICS only helps > single-file writes on NFSv2. For today's workloads all it does is > provide a hotspot on the two global variables that track writes in an > attempt to gather them. Back when I worked on a server product, > no_wdelay was one of the standard options for new exports. Really? Even for NFSv3/4 FILE_SYNC? I can understand that it wouldn't have any real effect on UNSTABLE. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-02 15:00 ` Chuck Lever @ 2009-06-02 17:27 ` Trond Myklebust [not found] ` <1243963631.4868.124.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-06-02 17:27 UTC (permalink / raw) To: Chuck Lever Cc: Greg Banks, Brian R Cowan, linux-nfs, linux-nfs-owner, Peter Staubach On Tue, 2009-06-02 at 11:00 -0400, Chuck Lever wrote: > On May 30, 2009, at 9:02 AM, Greg Banks wrote: > > On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust > > <trond.myklebust@fys.uio.no> wrote: > >> On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote: > >>> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust > >>> <trond.myklebust@fys.uio.no> wrote: > >>>> On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote: > >>>>> > >>> > >> > >> Firstly, the server only uses O_SYNC if you turn off write gathering > >> (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs > >> server is to always try write gathering and hence no O_SYNC. > > > > Well, write gathering is a total crock that AFAICS only helps > > single-file writes on NFSv2. For today's workloads all it does is > > provide a hotspot on the two global variables that track writes in an > > attempt to gather them. Back when I worked on a server product, > > no_wdelay was one of the standard options for new exports. > > Really? Even for NFSv3/4 FILE_SYNC? I can understand that it > wouldn't have any real effect on UNSTABLE. The question is why would a sensible client ever want to send more than 1 NFSv3 write with FILE_SYNC? If you need to send multiple writes in parallel to the same file, then it makes much more sense to use UNSTABLE. Write gathering relies on waiting an arbitrary length of time in order to see if someone is going to send another write. The protocol offers no guidance as to how long that wait should be, and so (at least on the Linux server) we've coded in a hard wait of 10ms if and only if we see that something else has the file open for writing. One problem with the Linux implementation is that the "something else" could be another nfs server thread that happens to be in nfsd_write(), however it could also be another open NFSv4 stateid, or a NLM lock, or a local process that has the file open for writing. Another problem is that the nfs server keeps a record of the last file that was accessed, and also waits if it sees you are writing again to that same file. Of course it has no idea if this is truly a parallel write, or if it just happens that you are writing again to the same file using O_SYNC... Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1243963631.4868.124.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243963631.4868.124.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-06-02 18:15 ` Chuck Lever 2009-06-03 16:22 ` Carlos Carvalho 1 sibling, 0 replies; 94+ messages in thread From: Chuck Lever @ 2009-06-02 18:15 UTC (permalink / raw) To: Trond Myklebust Cc: Greg Banks, Brian R Cowan, linux-nfs, linux-nfs-owner, Peter Staubach On Jun 2, 2009, at 1:27 PM, Trond Myklebust wrote: > On Tue, 2009-06-02 at 11:00 -0400, Chuck Lever wrote: >> On May 30, 2009, at 9:02 AM, Greg Banks wrote: >>> On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust >>> <trond.myklebust@fys.uio.no> wrote: >>>> On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote: >>>>> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust >>>>> <trond.myklebust@fys.uio.no> wrote: >>>>>> On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote: >>>>>>> >>>>> >>>> >>>> Firstly, the server only uses O_SYNC if you turn off write >>>> gathering >>>> (a.k.a. the 'wdelay' option). The default behaviour for the Linux >>>> nfs >>>> server is to always try write gathering and hence no O_SYNC. >>> >>> Well, write gathering is a total crock that AFAICS only helps >>> single-file writes on NFSv2. For today's workloads all it does is >>> provide a hotspot on the two global variables that track writes in >>> an >>> attempt to gather them. Back when I worked on a server product, >>> no_wdelay was one of the standard options for new exports. >> >> Really? Even for NFSv3/4 FILE_SYNC? I can understand that it >> wouldn't have any real effect on UNSTABLE. > > The question is why would a sensible client ever want to send more > than > 1 NFSv3 write with FILE_SYNC? A client might behave this way if an application was performing random 4KB synchronous writes to a large file, or the VM is aggressively flushing single pages to try to mitigate a low-memory situation. IOW it may not be up to the client... Penalizing FILE_SYNC writes, even a little, by waiting a bit could also reduce the server's workload by slowing clients that are pounding a server with synchronous writes. Not an argument, really... but it seems like there are some scenarios where delaying synchronous writes could still be useful. The real question is whether these scenarios occur frequently enough to warrant the overhead in the server. It would be nice to see some I/O trace data. > If you need to send multiple writes in > parallel to the same file, then it makes much more sense to use > UNSTABLE. Yep, agreed. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243963631.4868.124.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-06-02 18:15 ` Chuck Lever @ 2009-06-03 16:22 ` Carlos Carvalho 2009-06-03 17:10 ` Trond Myklebust 1 sibling, 1 reply; 94+ messages in thread From: Carlos Carvalho @ 2009-06-03 16:22 UTC (permalink / raw) To: linux-nfs Trond Myklebust (trond.myklebust@fys.uio.no) wrote on 2 June 2009 13:27: >Write gathering relies on waiting an arbitrary length of time in order >to see if someone is going to send another write. The protocol offers no >guidance as to how long that wait should be, and so (at least on the >Linux server) we've coded in a hard wait of 10ms if and only if we see >that something else has the file open for writing. >One problem with the Linux implementation is that the "something else" >could be another nfs server thread that happens to be in nfsd_write(), >however it could also be another open NFSv4 stateid, or a NLM lock, or a >local process that has the file open for writing. >Another problem is that the nfs server keeps a record of the last file >that was accessed, and also waits if it sees you are writing again to >that same file. Of course it has no idea if this is truly a parallel >write, or if it just happens that you are writing again to the same file >using O_SYNC... I think the decision to write or wait doesn't belong to the nfs server; it should just send the writes immediately. It's up to the fs/block/device layers to do the gathering. I understand that the client should try to do the gathering before sending the request to the wire. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-03 16:22 ` Carlos Carvalho @ 2009-06-03 17:10 ` Trond Myklebust [not found] ` <OFB53BFCCB.0CEC7A7E-ON852575C <1244138698.5203.59.camel@heimdal.trondhjem.org> ` (2 more replies) 0 siblings, 3 replies; 94+ messages in thread From: Trond Myklebust @ 2009-06-03 17:10 UTC (permalink / raw) To: Carlos Carvalho; +Cc: linux-nfs On Wed, 2009-06-03 at 13:22 -0300, Carlos Carvalho wrote: > Trond Myklebust (trond.myklebust@fys.uio.no) wrote on 2 June 2009 13:27: > >Write gathering relies on waiting an arbitrary length of time in order > >to see if someone is going to send another write. The protocol offers no > >guidance as to how long that wait should be, and so (at least on the > >Linux server) we've coded in a hard wait of 10ms if and only if we see > >that something else has the file open for writing. > >One problem with the Linux implementation is that the "something else" > >could be another nfs server thread that happens to be in nfsd_write(), > >however it could also be another open NFSv4 stateid, or a NLM lock, or a > >local process that has the file open for writing. > >Another problem is that the nfs server keeps a record of the last file > >that was accessed, and also waits if it sees you are writing again to > >that same file. Of course it has no idea if this is truly a parallel > >write, or if it just happens that you are writing again to the same file > >using O_SYNC... > > I think the decision to write or wait doesn't belong to the nfs > server; it should just send the writes immediately. It's up to the > fs/block/device layers to do the gathering. I understand that the > client should try to do the gathering before sending the request to > the wire This isn't something that we've just pulled out of a hat. It dates back to pre-NFSv3 times, when every write had to be synchronously committed to disk before the RPC call could return. See, for instance, http://books.google.com/books?id=y9GgPhjyOUwC&pg=PA243&lpg=PA243&dq=What +is+nfs+write +gathering&source=bl&ots=M8s0XS2SLd&sig=ctmxQrpII2_Ti4czgpGZrF9mmds&hl=en&ei=Xa0mSrLMC8iptgfSsqHsBg&sa=X&oi=book_result&ct=result&resnum=3 The point is that while it is a good idea for NFSv2, we have much better methods of dealing with multiple writes in NFSv3 and v4... Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <OFB53BFCCB.0CEC7A7E-ON852575C <1244138698.5203.59.camel@heimdal.trondhjem.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-03 17:10 ` Trond Myklebust [not found] ` <OFB53BFCCB.0CEC7A7E-ON852575C <1244138698.5203.59.camel@heimdal.trondhjem.org> @ 2009-06-03 21:28 ` Dean Hildebrand 2009-06-04 2:16 ` Carlos Carvalho 2009-06-04 17:42 ` Brian R Cowan 2 siblings, 1 reply; 94+ messages in thread From: Dean Hildebrand @ 2009-06-03 21:28 UTC (permalink / raw) To: Trond Myklebust; +Cc: Carlos Carvalho, linux-nfs Trond Myklebust wrote: > On Wed, 2009-06-03 at 13:22 -0300, Carlos Carvalho wrote: > >> Trond Myklebust (trond.myklebust@fys.uio.no) wrote on 2 June 2009 13:27: >> >Write gathering relies on waiting an arbitrary length of time in order >> >to see if someone is going to send another write. The protocol offers no >> >guidance as to how long that wait should be, and so (at least on the >> >Linux server) we've coded in a hard wait of 10ms if and only if we see >> >that something else has the file open for writing. >> >One problem with the Linux implementation is that the "something else" >> >could be another nfs server thread that happens to be in nfsd_write(), >> >however it could also be another open NFSv4 stateid, or a NLM lock, or a >> >local process that has the file open for writing. >> >Another problem is that the nfs server keeps a record of the last file >> >that was accessed, and also waits if it sees you are writing again to >> >that same file. Of course it has no idea if this is truly a parallel >> >write, or if it just happens that you are writing again to the same file >> >using O_SYNC... >> >> I think the decision to write or wait doesn't belong to the nfs >> server; it should just send the writes immediately. It's up to the >> fs/block/device layers to do the gathering. I understand that the >> client should try to do the gathering before sending the request to >> the wire >> Just to be clear, the linux NFS server does not gather the writes. Writes are passed immediately to the fs. nfsd simply waits 10ms before sync'ing the writes to disk. This allows the underlying file system time to do the gathering and sync data in larger chunks. Of course, this is only for stables writes and wdelay is enabled for the export. Dean > > This isn't something that we've just pulled out of a hat. It dates back > to pre-NFSv3 times, when every write had to be synchronously committed > to disk before the RPC call could return. > > See, for instance, > > http://books.google.com/books?id=y9GgPhjyOUwC&pg=PA243&lpg=PA243&dq=What > +is+nfs+write > +gathering&source=bl&ots=M8s0XS2SLd&sig=ctmxQrpII2_Ti4czgpGZrF9mmds&hl=en&ei=Xa0mSrLMC8iptgfSsqHsBg&sa=X&oi=book_result&ct=result&resnum=3 > > The point is that while it is a good idea for NFSv2, we have much better > methods of dealing with multiple writes in NFSv3 and v4... > > Trond > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-03 21:28 ` Dean Hildebrand @ 2009-06-04 2:16 ` Carlos Carvalho 0 siblings, 0 replies; 94+ messages in thread From: Carlos Carvalho @ 2009-06-04 2:16 UTC (permalink / raw) To: linux-nfs Dean Hildebrand (seattleplus@gmail.com) wrote on 3 June 2009 17:28: >Trond Myklebust wrote: >> On Wed, 2009-06-03 at 13:22 -0300, Carlos Carvalho wrote: >> >>> Trond Myklebust (trond.myklebust@fys.uio.no) wrote on 2 June 2009 13:27: >>> >Write gathering relies on waiting an arbitrary length of time in order >>> >to see if someone is going to send another write. The protocol offers no >>> >guidance as to how long that wait should be, and so (at least on the >>> >Linux server) we've coded in a hard wait of 10ms if and only if we see >>> >that something else has the file open for writing. >>> >One problem with the Linux implementation is that the "something else" >>> >could be another nfs server thread that happens to be in nfsd_write(), >>> >however it could also be another open NFSv4 stateid, or a NLM lock, or a >>> >local process that has the file open for writing. >>> >Another problem is that the nfs server keeps a record of the last file >>> >that was accessed, and also waits if it sees you are writing again to >>> >that same file. Of course it has no idea if this is truly a parallel >>> >write, or if it just happens that you are writing again to the same file >>> >using O_SYNC... >>> >>> I think the decision to write or wait doesn't belong to the nfs >>> server; it should just send the writes immediately. It's up to the >>> fs/block/device layers to do the gathering. I understand that the >>> client should try to do the gathering before sending the request to >>> the wire >>> >Just to be clear, the linux NFS server does not gather the writes. >Writes are passed immediately to the fs. Ah! That's much better. >nfsd simply waits 10ms before >sync'ing the writes to disk. This allows the underlying file system **** >time to do the gathering and sync data in larger chunks. OK, all is perfectly fine then. Since syncs seem to be a requirement of the protocol, perhaps the 10ms delay could be made tunable to allow admins more flexibility. For example, if we change other timeouts we could adjust the nfs sync one accordingly. Could be an option to nfsd or, better, a variable in /proc. Thanks Dean and Trond for the explanations. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-03 17:10 ` Trond Myklebust [not found] ` <OFB53BFCCB.0CEC7A7E-ON852575C <1244138698.5203.59.camel@heimdal.trondhjem.org> 2009-06-03 21:28 ` Dean Hildebrand @ 2009-06-04 17:42 ` Brian R Cowan 2009-06-04 18:04 ` Trond Myklebust 2 siblings, 1 reply; 94+ messages in thread From: Brian R Cowan @ 2009-06-04 17:42 UTC (permalink / raw) To: Trond Myklebust; +Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner I've been looking in more detail in the network traces that started all this, and doing some additional testing with the 2.6.29 kernel in an NFS-only build... In brief: 1) RHEL 5 generates >3x the network write traffic than RHEL4 when linking Samba's smbd. 2) In RHEL 5, Those unnecessary writes are slowed down by the "FILE_SYNC" optimization put in place for small writes. 3) That optimization seems to be removed from the kernel somewhere between 2.6.18 and 2.6.29. 4) Unfortunately the "unnecessary write before read" behavior is still present in 2.6.29. In detail: In RHEL 5, I see a lot of reads from offset {whatever} *immediately* preceded by a write to *the same offset*. This is obviously a bad thing, now the trick is finding out where it is coming from. The write-before-read behavior is happening on the smbd file itself (not surprising since that's the only file we're writing in this test...). This happens with every 2.6.18 and later kernel I've tested to date. In RHEL 5, most of the writes are FILE_SYNC writes, which appear to take something on the order of 10ms to come back. When using a 2.6.29 kernel, the TOTAL time for the write+commit rpc set (write rpc, write reply, commit rpc, commit reply), to come back is something like 2ms. I guess the NFS servers aren't handling FILE_SYNC writes very well. in 2.6.29, ALL the write calls appear to be unstable writes, in RHEL5, most are FILE_SYNC writes. (Network traces available upon request.) Neither is quite as fast as RHEL 4, because the link under RHEL 4 only puts about 150 WRITE rpc's on the wire. RHEL 5 generates more than 500 when building on NFS, and 2.6.29 puts about 340 write rpc's, plus a similar number of COMMITs, on the wire. The bottom line: * If someone can help me find where 2.6 stopped setting small writes to FILE_SYNC, I'd appreciate it. It would save me time walking through >50 commitdiffs in gitweb... * Is this the correct place to start discussing the annoying write-before-almost-every-read behavior that 2.6.18 picked up and 2.6.29 continues? ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Trond Myklebust <trond.myklebust@fys.uio.no> To: Carlos Carvalho <carlos@fisica.ufpr.br> Cc: linux-nfs@vger.kernel.org Date: 06/03/2009 01:10 PM Subject: Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Sent by: linux-nfs-owner@vger.kernel.org On Wed, 2009-06-03 at 13:22 -0300, Carlos Carvalho wrote: > Trond Myklebust (trond.myklebust@fys.uio.no) wrote on 2 June 2009 13:27: > >Write gathering relies on waiting an arbitrary length of time in order > >to see if someone is going to send another write. The protocol offers no > >guidance as to how long that wait should be, and so (at least on the > >Linux server) we've coded in a hard wait of 10ms if and only if we see > >that something else has the file open for writing. > >One problem with the Linux implementation is that the "something else" > >could be another nfs server thread that happens to be in nfsd_write(), > >however it could also be another open NFSv4 stateid, or a NLM lock, or a > >local process that has the file open for writing. > >Another problem is that the nfs server keeps a record of the last file > >that was accessed, and also waits if it sees you are writing again to > >that same file. Of course it has no idea if this is truly a parallel > >write, or if it just happens that you are writing again to the same file > >using O_SYNC... > > I think the decision to write or wait doesn't belong to the nfs > server; it should just send the writes immediately. It's up to the > fs/block/device layers to do the gathering. I understand that the > client should try to do the gathering before sending the request to > the wire This isn't something that we've just pulled out of a hat. It dates back to pre-NFSv3 times, when every write had to be synchronously committed to disk before the RPC call could return. See, for instance, http://books.google.com/books?id=y9GgPhjyOUwC&pg=PA243&lpg=PA243&dq=What +is+nfs+write +gathering&source=bl&ots=M8s0XS2SLd&sig=ctmxQrpII2_Ti4czgpGZrF9mmds&hl=en&ei=Xa0mSrLMC8iptgfSsqHsBg&sa=X&oi=book_result&ct=result&resnum=3 The point is that while it is a good idea for NFSv2, we have much better methods of dealing with multiple writes in NFSv3 and v4... Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-04 17:42 ` Brian R Cowan @ 2009-06-04 18:04 ` Trond Myklebust 2009-06-04 20:43 ` Link performance over NFS degraded in RHEL5. -- was : " Brian R Cowan 2009-06-24 19:54 ` [PATCH] read-modify-write page updating Peter Staubach 0 siblings, 2 replies; 94+ messages in thread From: Trond Myklebust @ 2009-06-04 18:04 UTC (permalink / raw) To: Brian R Cowan; +Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner On Thu, 2009-06-04 at 13:42 -0400, Brian R Cowan wrote: > I've been looking in more detail in the network traces that started all > this, and doing some additional testing with the 2.6.29 kernel in an > NFS-only build... > > In brief: > 1) RHEL 5 generates >3x the network write traffic than RHEL4 when linking > Samba's smbd. > 2) In RHEL 5, Those unnecessary writes are slowed down by the "FILE_SYNC" > optimization put in place for small writes. > 3) That optimization seems to be removed from the kernel somewhere between > 2.6.18 and 2.6.29. > 4) Unfortunately the "unnecessary write before read" behavior is still > present in 2.6.29. > > In detail: > In RHEL 5, I see a lot of reads from offset {whatever} *immediately* > preceded by a write to *the same offset*. This is obviously a bad thing, > now the trick is finding out where it is coming from. The > write-before-read behavior is happening on the smbd file itself (not > surprising since that's the only file we're writing in this test...). This > happens with every 2.6.18 and later kernel I've tested to date. > > In RHEL 5, most of the writes are FILE_SYNC writes, which appear to take > something on the order of 10ms to come back. When using a 2.6.29 kernel, > the TOTAL time for the write+commit rpc set (write rpc, write reply, > commit rpc, commit reply), to come back is something like 2ms. I guess the > NFS servers aren't handling FILE_SYNC writes very well. in 2.6.29, ALL the > write calls appear to be unstable writes, in RHEL5, most are FILE_SYNC > writes. (Network traces available upon request.) Did you try turning off write gathering on the server (i.e. add the 'no_wdelay' export option)? As I said earlier, that forces a delay of 10ms per RPC call, which might explain the FILE_SYNC slowness. > Neither is quite as fast as RHEL 4, because the link under RHEL 4 only > puts about 150 WRITE rpc's on the wire. RHEL 5 generates more than 500 > when building on NFS, and 2.6.29 puts about 340 write rpc's, plus a > similar number of COMMITs, on the wire. > > The bottom line: > * If someone can help me find where 2.6 stopped setting small writes to > FILE_SYNC, I'd appreciate it. It would save me time walking through >50 > commitdiffs in gitweb... It still does set FILE_SYNC for single page writes. > * Is this the correct place to start discussing the annoying > write-before-almost-every-read behavior that 2.6.18 picked up and 2.6.29 > continues? Yes, but you'll need to tell us a bit more about the write patterns. Are these random writes, or are they sequential? Is there any file locking involved? As I've said earlier in this thread, all NFS clients will flush out the dirty data if a page that is being attempted read also contains uninitialised areas. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-04 18:04 ` Trond Myklebust @ 2009-06-04 20:43 ` Brian R Cowan 2009-06-04 20:57 ` Trond Myklebust ` (2 more replies) 2009-06-24 19:54 ` [PATCH] read-modify-write page updating Peter Staubach 1 sibling, 3 replies; 94+ messages in thread From: Brian R Cowan @ 2009-06-04 20:43 UTC (permalink / raw) To: Trond Myklebust; +Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner Trond Myklebust <trond.myklebust@fys.uio.no> wrote on 06/04/2009 02:04:58 PM: > Did you try turning off write gathering on the server (i.e. add the > 'no_wdelay' export option)? As I said earlier, that forces a delay of > 10ms per RPC call, which might explain the FILE_SYNC slowness. Just tried it, this seems to be a very useful workaround as well. The FILE_SYNC write calls come back in about the same amount of time as the write+commit pairs... Speeds up building regardless of the network filesystem (ClearCase MVFS or straight NFS). > > The bottom line: > > * If someone can help me find where 2.6 stopped setting small writes to > > FILE_SYNC, I'd appreciate it. It would save me time walking through >50 > > commitdiffs in gitweb... > > It still does set FILE_SYNC for single page writes. Well, the network trace *seems* to say otherwise, but that could be because the 2.6.29 kernel is now reliably following a code path that doesn't set up to do FILE_SYNC writes for these flushes... Just like the RHEL 5 traces didn't have every "small" write to the link output file go out as a FILE_SYNC write. > > > * Is this the correct place to start discussing the annoying > > write-before-almost-every-read behavior that 2.6.18 picked up and 2.6.29 > > continues? > > Yes, but you'll need to tell us a bit more about the write patterns. Are > these random writes, or are they sequential? Is there any file locking > involved? Well, it's just a link, so it's random read/write traffic. (read object file/library, add stuff to output file, seek somewhere else and update a table, etc., etc.) All I did here was build Samba over nfs, remove bin/smbd, and then do a "make bin/smbd" to rebuild it. My network traces show that the file is opened "UNCHECKED" when doing the build in straight NFS, and "EXCLUSIVE" when building in a ClearCase view. This change does not seem to impact the behavior. We never lock the output file. The write-before-read happens all over the place. And when we did straces and lined up the call times, is it a read operation triggering the write. > > As I've said earlier in this thread, all NFS clients will flush out the > dirty data if a page that is being attempted read also contains > uninitialised areas. What I'm trying to understand is why RHEL 4 is not flushing anywhere near as often. Either RHEL4 erred on the side of not writing, and RHEL5 is erring on the opposite side, or RHEL5 is doing unnecessary flushes... I've seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels, but it still flushes a lot more than RHEL 4 does. In any event, that doesn't help us here since 1) ClearCase can't work with that kernel; 2) Red Hat won't support use of that kernel on RHEL 5; and 3) the amount of code review my customer would have to go through to get the whole kernel vetted for use in their environment is frightening. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-04 20:43 ` Link performance over NFS degraded in RHEL5. -- was : " Brian R Cowan @ 2009-06-04 20:57 ` Trond Myklebust 2009-06-04 21:30 ` Brian R Cowan 2009-06-04 21:07 ` Peter Staubach 2009-06-05 11:35 ` Steve Dickson 2 siblings, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-06-04 20:57 UTC (permalink / raw) To: Brian R Cowan; +Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner On Thu, 2009-06-04 at 16:43 -0400, Brian R Cowan wrote: > What I'm trying to understand is why RHEL 4 is not flushing anywhere near > as often. Either RHEL4 erred on the side of not writing, and RHEL5 is > erring on the opposite side, or RHEL5 is doing unnecessary flushes... I've > seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels, but > it still flushes a lot more than RHEL 4 does. Most of that increase is probably mainly due to the changes to the way stat() works. More precisely, it would be due to this patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=70b9ecbdb9c5fdc731f8780bffd45d9519020c4a which went into Linux 2.6.16 in order to fix a posix compatibility issue. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-04 20:57 ` Trond Myklebust @ 2009-06-04 21:30 ` Brian R Cowan 2009-06-04 21:48 ` Trond Myklebust 0 siblings, 1 reply; 94+ messages in thread From: Brian R Cowan @ 2009-06-04 21:30 UTC (permalink / raw) To: Trond Myklebust; +Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner I'll have to see if/how this impacts the flush behavior. I don't THINK we are doing getattrs in the middle of the link, but the trace information kind of went astray when the VM's gor reverted to base OS. Also, your recommended workaround of setting no_wdelay only works if the NFS server is Linux, the option isn't available on Solaris or HP-UX. This limits it's usefulness in heterogenous environments. Solaris 10 doesn't support async NFS exports, and we've already discussed how the small-write optimization overrides write behavior on async mounts. ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Trond Myklebust <trond.myklebust@fys.uio.no> To: Brian R Cowan/Cupertino/IBM@IBMUS Cc: Carlos Carvalho <carlos@fisica.ufpr.br>, linux-nfs@vger.kernel.org, linux-nfs-owner@vger.kernel.org Date: 06/04/2009 04:57 PM Subject: Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing On Thu, 2009-06-04 at 16:43 -0400, Brian R Cowan wrote: > What I'm trying to understand is why RHEL 4 is not flushing anywhere near > as often. Either RHEL4 erred on the side of not writing, and RHEL5 is > erring on the opposite side, or RHEL5 is doing unnecessary flushes... I've > seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels, but > it still flushes a lot more than RHEL 4 does. Most of that increase is probably mainly due to the changes to the way stat() works. More precisely, it would be due to this patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=70b9ecbdb9c5fdc731f8780bffd45d9519020c4a which went into Linux 2.6.16 in order to fix a posix compatibility issue. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-04 21:30 ` Brian R Cowan @ 2009-06-04 21:48 ` Trond Myklebust 0 siblings, 0 replies; 94+ messages in thread From: Trond Myklebust @ 2009-06-04 21:48 UTC (permalink / raw) To: Brian R Cowan; +Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner Well, that's a good reason to get rid of those Solaris servers. :-) Seriously, though, we do _not_ fix server bugs by changing the client. If we had been doing something that was incorrect, or not recommended by the NFS spec, then matters would be different... Trond On Jun 4, 2009, at 17:30, Brian R Cowan <brcowan@us.ibm.com> wrote: > I'll have to see if/how this impacts the flush behavior. I don't > THINK we > are doing getattrs in the middle of the link, but the trace > information > kind of went astray when the VM's gor reverted to base OS. > > Also, your recommended workaround of setting no_wdelay only works if > the > NFS server is Linux, the option isn't available on Solaris or HP-UX. > This > limits it's usefulness in heterogenous environments. Solaris 10 > doesn't > support async NFS exports, and we've already discussed how the small- > write > optimization overrides write behavior on async mounts. > > ================================================================= > Brian Cowan > Advisory Software Engineer > ClearCase Customer Advocacy Group (CAG) > Rational Software > IBM Software Group > 81 Hartwell Ave > Lexington, MA > > Phone: 1.781.372.3580 > Web: http://www.ibm.com/software/rational/support/ > > > Please be sure to update your PMR using ESR at > http://www-306.ibm.com/software/support/probsub.html or cc all > correspondence to sw_support@us.ibm.com to be sure your PMR is > updated in > case I am not available. > > > > From: > Trond Myklebust <trond.myklebust@fys.uio.no> > To: > Brian R Cowan/Cupertino/IBM@IBMUS > Cc: > Carlos Carvalho <carlos@fisica.ufpr.br>, linux-nfs@vger.kernel.org, > linux-nfs-owner@vger.kernel.org > Date: > 06/04/2009 04:57 PM > Subject: > Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write > NFS > I/O performance degraded by FLUSH_STABLE page flushing > > > > On Thu, 2009-06-04 at 16:43 -0400, Brian R Cowan wrote: >> What I'm trying to understand is why RHEL 4 is not flushing anywhere > near >> as often. Either RHEL4 erred on the side of not writing, and RHEL5 is >> erring on the opposite side, or RHEL5 is doing unnecessary flushes... > I've >> seen that 2.6.29 flushes less than the Red hat 2.6.18-derived >> kernels, > but >> it still flushes a lot more than RHEL 4 does. > > Most of that increase is probably mainly due to the changes to the way > stat() works. More precisely, it would be due to this patch: > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=70b9ecbdb9c5fdc731f8780bffd45d9519020c4a > > > which went into Linux 2.6.16 in order to fix a posix compatibility > issue. > > Trond > > > ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-04 20:43 ` Link performance over NFS degraded in RHEL5. -- was : " Brian R Cowan 2009-06-04 20:57 ` Trond Myklebust @ 2009-06-04 21:07 ` Peter Staubach 2009-06-04 21:39 ` Brian R Cowan 2009-06-05 11:35 ` Steve Dickson 2 siblings, 1 reply; 94+ messages in thread From: Peter Staubach @ 2009-06-04 21:07 UTC (permalink / raw) To: Brian R Cowan Cc: Trond Myklebust, Carlos Carvalho, linux-nfs, linux-nfs-owner Brian R Cowan wrote: > Trond Myklebust <trond.myklebust@fys.uio.no> wrote on 06/04/2009 02:04:58 > PM: > > >> Did you try turning off write gathering on the server (i.e. add the >> 'no_wdelay' export option)? As I said earlier, that forces a delay of >> 10ms per RPC call, which might explain the FILE_SYNC slowness. >> > > Just tried it, this seems to be a very useful workaround as well. The > FILE_SYNC write calls come back in about the same amount of time as the > write+commit pairs... Speeds up building regardless of the network > filesystem (ClearCase MVFS or straight NFS). > > >>> The bottom line: >>> * If someone can help me find where 2.6 stopped setting small writes >>> > to > >>> FILE_SYNC, I'd appreciate it. It would save me time walking through >>> >> 50 >> >>> commitdiffs in gitweb... >>> >> It still does set FILE_SYNC for single page writes. >> > > Well, the network trace *seems* to say otherwise, but that could be > because the 2.6.29 kernel is now reliably following a code path that > doesn't set up to do FILE_SYNC writes for these flushes... Just like the > RHEL 5 traces didn't have every "small" write to the link output file go > out as a FILE_SYNC write. > > >>> * Is this the correct place to start discussing the annoying >>> write-before-almost-every-read behavior that 2.6.18 picked up and >>> > 2.6.29 > >>> continues? >>> >> Yes, but you'll need to tell us a bit more about the write patterns. Are >> these random writes, or are they sequential? Is there any file locking >> involved? >> > > Well, it's just a link, so it's random read/write traffic. (read object > file/library, add stuff to output file, seek somewhere else and update a > table, etc., etc.) All I did here was build Samba over nfs, remove > bin/smbd, and then do a "make bin/smbd" to rebuild it. My network traces > show that the file is opened "UNCHECKED" when doing the build in straight > NFS, and "EXCLUSIVE" when building in a ClearCase view. This change does > not seem to impact the behavior. We never lock the output file. The > write-before-read happens all over the place. And when we did straces and > lined up the call times, is it a read operation triggering the write. > > >> As I've said earlier in this thread, all NFS clients will flush out the >> dirty data if a page that is being attempted read also contains >> uninitialised areas. >> > > What I'm trying to understand is why RHEL 4 is not flushing anywhere near > as often. Either RHEL4 erred on the side of not writing, and RHEL5 is > erring on the opposite side, or RHEL5 is doing unnecessary flushes... I've > seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels, but > it still flushes a lot more than RHEL 4 does. > > I think that you are making a lot of assumptions here, that are not necessarily backed by the evidence. The base cause here seems more likely to me to be the setting of PG_uptodate being different on the different releases, ie. RHEL-4, RHEL-5, and 2.6.29. All of these kernels contain the support to write out pages which are not marked as PG_uptodate. ps > In any event, that doesn't help us here since 1) ClearCase can't work with > that kernel; 2) Red Hat won't support use of that kernel on RHEL 5; and 3) > the amount of code review my customer would have to go through to get the > whole kernel vetted for use in their environment is frightening. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-04 21:07 ` Peter Staubach @ 2009-06-04 21:39 ` Brian R Cowan 0 siblings, 0 replies; 94+ messages in thread From: Brian R Cowan @ 2009-06-04 21:39 UTC (permalink / raw) To: Peter Staubach Cc: Carlos Carvalho, linux-nfs, linux-nfs-owner, Trond Myklebust Peter Staubach <staubach@redhat.com> wrote on 06/04/2009 05:07:29 PM: > > What I'm trying to understand is why RHEL 4 is not flushing anywhere near > > as often. Either RHEL4 erred on the side of not writing, and RHEL5 is > > erring on the opposite side, or RHEL5 is doing unnecessary flushes... I've > > seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels, but > > it still flushes a lot more than RHEL 4 does. > > > > > > I think that you are making a lot of assumptions here, that > are not necessarily backed by the evidence. The base cause > here seems more likely to me to be the setting of PG_uptodate > being different on the different releases, ie. RHEL-4, RHEL-5, > and 2.6.29. All of these kernels contain the support to > write out pages which are not marked as PG_uptodate. > > ps I'm trying to find out why the paging/flushing is happening. It's incredibly trivial to reproduce, just link something large over NFS. RHEL4 writes to the smbd file about 150x, RHEL 5 writes to it > 500x, and 2.6.29 writes about 340x. I have network traces showing that. I'm now trying to understand why... So we an determine if there is anything that can be done about it... Trond's note about a getattr change that went into 2.6.16 may be important since we have also seen this slowdown on SuSE 10, which is based on 2.6.16 kernels. I'm just a little unsure of why the gcc linker would be calling getattr... Time to collect more straces, I guess, and then to see what happens under the covers... (Be just my luck if the seek eventually causes nfs_getattr to be called, though it would certainly explain the behavior.) ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-04 20:43 ` Link performance over NFS degraded in RHEL5. -- was : " Brian R Cowan 2009-06-04 20:57 ` Trond Myklebust 2009-06-04 21:07 ` Peter Staubach @ 2009-06-05 11:35 ` Steve Dickson 2009-06-05 12:46 ` Trond Myklebust ` (3 more replies) 2 siblings, 4 replies; 94+ messages in thread From: Steve Dickson @ 2009-06-05 11:35 UTC (permalink / raw) To: Neil Brown, Greg Banks; +Cc: Brian R Cowan, linux-nfs Brian R Cowan wrote: > Trond Myklebust <trond.myklebust@fys.uio.no> wrote on 06/04/2009 02:04:58 > PM: > >> Did you try turning off write gathering on the server (i.e. add the >> 'no_wdelay' export option)? As I said earlier, that forces a delay of >> 10ms per RPC call, which might explain the FILE_SYNC slowness. > > Just tried it, this seems to be a very useful workaround as well. The > FILE_SYNC write calls come back in about the same amount of time as the > write+commit pairs... Speeds up building regardless of the network > filesystem (ClearCase MVFS or straight NFS). Does anybody had the history as to why 'no_wdelay' is an export default? As Brian mentioned later in this thread it only helps Linux servers, but that's good thing, IMHO. ;-) So I would have no problem changing the default export options in nfs-utils, but it would be nice to know why it was there in the first place... Neil, Greg?? steved. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-05 11:35 ` Steve Dickson @ 2009-06-05 12:46 ` Trond Myklebust 2009-06-05 13:03 ` Brian R Cowan 2009-06-05 13:05 ` Tom Talpey ` (2 subsequent siblings) 3 siblings, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-06-05 12:46 UTC (permalink / raw) To: Steve Dickson; +Cc: Neil Brown, Greg Banks, Brian R Cowan, linux-nfs On Fri, 2009-06-05 at 07:35 -0400, Steve Dickson wrote: > Brian R Cowan wrote: > > Trond Myklebust <trond.myklebust@fys.uio.no> wrote on 06/04/2009 02:04:58 > > PM: > > > >> Did you try turning off write gathering on the server (i.e. add the > >> 'no_wdelay' export option)? As I said earlier, that forces a delay of > >> 10ms per RPC call, which might explain the FILE_SYNC slowness. > > > > Just tried it, this seems to be a very useful workaround as well. The > > FILE_SYNC write calls come back in about the same amount of time as the > > write+commit pairs... Speeds up building regardless of the network > > filesystem (ClearCase MVFS or straight NFS). > > Does anybody had the history as to why 'no_wdelay' is an > export default? As Brian mentioned later in this thread > it only helps Linux servers, but that's good thing, IMHO. ;-) > > So I would have no problem changing the default export > options in nfs-utils, but it would be nice to know why > it was there in the first place... It dates back to the days when most Linux clients in use in the field were NFSv2 only. After all, it has only been 15 years... Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-05 12:46 ` Trond Myklebust @ 2009-06-05 13:03 ` Brian R Cowan 0 siblings, 0 replies; 94+ messages in thread From: Brian R Cowan @ 2009-06-05 13:03 UTC (permalink / raw) To: Trond Myklebust; +Cc: Greg Banks, linux-nfs, Neil Brown, Steve Dickson Personally, I would leave the default export options alone. Simply because they more or less match the defaults for the other NFS servers. Also, there may be negative impacts of changing the default export option to no_wdelay on really busy servers. One possible result is that more CPU time gets spent waiting on writes to disk. I'm a bit paranoid when it comes to tuning *server* settings, since they impact all clients all at once, where client tuning generally only impacts the one client. ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Trond Myklebust <trond.myklebust@fys.uio.no> To: Steve Dickson <SteveD@redhat.com> Cc: Neil Brown <neilb@suse.de>, Greg Banks <gnb@fmeh.org>, Brian R Cowan/Cupertino/IBM@IBMUS, linux-nfs@vger.kernel.org Date: 06/05/2009 08:48 AM Subject: Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing On Fri, 2009-06-05 at 07:35 -0400, Steve Dickson wrote: > Brian R Cowan wrote: > > Trond Myklebust <trond.myklebust@fys.uio.no> wrote on 06/04/2009 02:04:58 > > PM: > > > >> Did you try turning off write gathering on the server (i.e. add the > >> 'no_wdelay' export option)? As I said earlier, that forces a delay of > >> 10ms per RPC call, which might explain the FILE_SYNC slowness. > > > > Just tried it, this seems to be a very useful workaround as well. The > > FILE_SYNC write calls come back in about the same amount of time as the > > write+commit pairs... Speeds up building regardless of the network > > filesystem (ClearCase MVFS or straight NFS). > > Does anybody had the history as to why 'no_wdelay' is an > export default? As Brian mentioned later in this thread > it only helps Linux servers, but that's good thing, IMHO. ;-) > > So I would have no problem changing the default export > options in nfs-utils, but it would be nice to know why > it was there in the first place... It dates back to the days when most Linux clients in use in the field were NFSv2 only. After all, it has only been 15 years... Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-05 11:35 ` Steve Dickson 2009-06-05 12:46 ` Trond Myklebust @ 2009-06-05 13:05 ` Tom Talpey [not found] ` <4A29144A.6030405@gmail.com> 2009-06-05 13:56 ` Brian R Cowan 3 siblings, 0 replies; 94+ messages in thread From: Tom Talpey @ 2009-06-05 13:05 UTC (permalink / raw) To: Steve Dickson; +Cc: Linux NFS Mailing List On 6/5/2009 7:35 AM, Steve Dickson wrote: > Brian R Cowan wrote: >> Trond Myklebust<trond.myklebust@fys.uio.no> wrote on 06/04/2009 02:04:58 >> PM: >> >>> Did you try turning off write gathering on the server (i.e. add the >>> 'no_wdelay' export option)? As I said earlier, that forces a delay of >>> 10ms per RPC call, which might explain the FILE_SYNC slowness. >> Just tried it, this seems to be a very useful workaround as well. The >> FILE_SYNC write calls come back in about the same amount of time as the >> write+commit pairs... Speeds up building regardless of the network >> filesystem (ClearCase MVFS or straight NFS). > > Does anybody had the history as to why 'no_wdelay' is an > export default? Because "wdelay" is a complete crock? Adding 10ms to every write RPC only helps if there's a steady single-file stream arriving at the server. In most other workloads it only slows things down. The better solution is to continue tuning the clients to issue writes in a more sequential and less all-or-nothing fashion. There are plenty of other less crock-ful things to do in the server, too. Tom. As Brian mentioned later in this thread > it only helps Linux servers, but that's good thing, IMHO. ;-) > > So I would have no problem changing the default export > options in nfs-utils, but it would be nice to know why > it was there in the first place... > > Neil, Greg?? > > steved. > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <4A29144A.6030405@gmail.com>]
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <4A29144A.6030405@gmail.com> @ 2009-06-05 13:30 ` Steve Dickson 2009-06-05 13:52 ` Trond Myklebust [not found] ` <4A291D83.1000508@RedHat.com> 1 sibling, 1 reply; 94+ messages in thread From: Steve Dickson @ 2009-06-05 13:30 UTC (permalink / raw) To: Tom Talpey; +Cc: Linux NFS Mailing list Tom Talpey wrote: > On 6/5/2009 7:35 AM, Steve Dickson wrote: >> Brian R Cowan wrote: >>> Trond Myklebust<trond.myklebust@fys.uio.no> wrote on 06/04/2009 >>> 02:04:58 >>> PM: >>> >>>> Did you try turning off write gathering on the server (i.e. add the >>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of >>>> 10ms per RPC call, which might explain the FILE_SYNC slowness. >>> Just tried it, this seems to be a very useful workaround as well. The >>> FILE_SYNC write calls come back in about the same amount of time as the >>> write+commit pairs... Speeds up building regardless of the network >>> filesystem (ClearCase MVFS or straight NFS). >> >> Does anybody had the history as to why 'no_wdelay' is an >> export default? > > Because "wdelay" is a complete crock? > > Adding 10ms to every write RPC only helps if there's a steady > single-file stream arriving at the server. In most other workloads > it only slows things down. > > The better solution is to continue tuning the clients to issue > writes in a more sequential and less all-or-nothing fashion. > There are plenty of other less crock-ful things to do in the > server, too. Ok... So do you think removing it as a default would cause any regressions? steved. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-05 13:30 ` Steve Dickson @ 2009-06-05 13:52 ` Trond Myklebust [not found] ` <1244209956.5410.33.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-06-05 13:52 UTC (permalink / raw) To: Steve Dickson; +Cc: Tom Talpey, Linux NFS Mailing list On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote: > > Tom Talpey wrote: > > On 6/5/2009 7:35 AM, Steve Dickson wrote: > >> Brian R Cowan wrote: > >>> Trond Myklebust<trond.myklebust@fys.uio.no> wrote on 06/04/2009 > >>> 02:04:58 > >>> PM: > >>> > >>>> Did you try turning off write gathering on the server (i.e. add the > >>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of > >>>> 10ms per RPC call, which might explain the FILE_SYNC slowness. > >>> Just tried it, this seems to be a very useful workaround as well. The > >>> FILE_SYNC write calls come back in about the same amount of time as the > >>> write+commit pairs... Speeds up building regardless of the network > >>> filesystem (ClearCase MVFS or straight NFS). > >> > >> Does anybody had the history as to why 'no_wdelay' is an > >> export default? > > > > Because "wdelay" is a complete crock? > > > > Adding 10ms to every write RPC only helps if there's a steady > > single-file stream arriving at the server. In most other workloads > > it only slows things down. > > > > The better solution is to continue tuning the clients to issue > > writes in a more sequential and less all-or-nothing fashion. > > There are plenty of other less crock-ful things to do in the > > server, too. > Ok... So do you think removing it as a default would cause > any regressions? It might for NFSv2 clients, since they don't have the option of using unstable writes. I'd therefore prefer a kernel solution that makes write gathering an NFSv2 only feature. Cheers Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1244209956.5410.33.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1244209956.5410.33.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-06-05 13:57 ` Steve Dickson [not found] ` <4A29243F.8080008-AfCzQyP5zfLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 94+ messages in thread From: Steve Dickson @ 2009-06-05 13:57 UTC (permalink / raw) To: Trond Myklebust; +Cc: Tom Talpey, Linux NFS Mailing list Trond Myklebust wrote: > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote: >> Tom Talpey wrote: >>> On 6/5/2009 7:35 AM, Steve Dickson wrote: >>>> Brian R Cowan wrote: >>>>> Trond Myklebust<trond.myklebust@fys.uio.no> wrote on 06/04/2009 >>>>> 02:04:58 >>>>> PM: >>>>> >>>>>> Did you try turning off write gathering on the server (i.e. add the >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness. >>>>> Just tried it, this seems to be a very useful workaround as well. The >>>>> FILE_SYNC write calls come back in about the same amount of time as the >>>>> write+commit pairs... Speeds up building regardless of the network >>>>> filesystem (ClearCase MVFS or straight NFS). >>>> Does anybody had the history as to why 'no_wdelay' is an >>>> export default? >>> Because "wdelay" is a complete crock? >>> >>> Adding 10ms to every write RPC only helps if there's a steady >>> single-file stream arriving at the server. In most other workloads >>> it only slows things down. >>> >>> The better solution is to continue tuning the clients to issue >>> writes in a more sequential and less all-or-nothing fashion. >>> There are plenty of other less crock-ful things to do in the >>> server, too. >> Ok... So do you think removing it as a default would cause >> any regressions? > > It might for NFSv2 clients, since they don't have the option of using > unstable writes. I'd therefore prefer a kernel solution that makes write > gathering an NFSv2 only feature. Sounds good to me! ;-) steved. ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <4A29243F.8080008-AfCzQyP5zfLQT0dZR+AlfA@public.gmane.org>]
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <4A29243F.8080008-AfCzQyP5zfLQT0dZR+AlfA@public.gmane.org> @ 2009-06-05 16:05 ` J. Bruce Fields 2009-06-05 16:35 ` Trond Myklebust 0 siblings, 1 reply; 94+ messages in thread From: J. Bruce Fields @ 2009-06-05 16:05 UTC (permalink / raw) To: Steve Dickson; +Cc: Trond Myklebust, Tom Talpey, Linux NFS Mailing list On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote: > > > Trond Myklebust wrote: > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote: > >> Tom Talpey wrote: > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote: > >>>> Brian R Cowan wrote: > >>>>> Trond Myklebust<trond.myklebust@fys.uio.no> wrote on 06/04/2009 > >>>>> 02:04:58 > >>>>> PM: > >>>>> > >>>>>> Did you try turning off write gathering on the server (i.e. add the > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness. > >>>>> Just tried it, this seems to be a very useful workaround as well. The > >>>>> FILE_SYNC write calls come back in about the same amount of time as the > >>>>> write+commit pairs... Speeds up building regardless of the network > >>>>> filesystem (ClearCase MVFS or straight NFS). > >>>> Does anybody had the history as to why 'no_wdelay' is an > >>>> export default? > >>> Because "wdelay" is a complete crock? > >>> > >>> Adding 10ms to every write RPC only helps if there's a steady > >>> single-file stream arriving at the server. In most other workloads > >>> it only slows things down. > >>> > >>> The better solution is to continue tuning the clients to issue > >>> writes in a more sequential and less all-or-nothing fashion. > >>> There are plenty of other less crock-ful things to do in the > >>> server, too. > >> Ok... So do you think removing it as a default would cause > >> any regressions? > > > > It might for NFSv2 clients, since they don't have the option of using > > unstable writes. I'd therefore prefer a kernel solution that makes write > > gathering an NFSv2 only feature. > Sounds good to me! ;-) Patch welcomed.--b. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-05 16:05 ` J. Bruce Fields @ 2009-06-05 16:35 ` Trond Myklebust [not found] ` <1244219715.5410.40.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-06-05 16:35 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Steve Dickson, Tom Talpey, Linux NFS Mailing list On Fri, 2009-06-05 at 12:05 -0400, J. Bruce Fields wrote: > On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote: > > > > > > Trond Myklebust wrote: > > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote: > > >> Tom Talpey wrote: > > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote: > > >>>> Brian R Cowan wrote: > > >>>>> Trond Myklebust<trond.myklebust@fys.uio.no> wrote on 06/04/2009 > > >>>>> 02:04:58 > > >>>>> PM: > > >>>>> > > >>>>>> Did you try turning off write gathering on the server (i.e. add the > > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of > > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness. > > >>>>> Just tried it, this seems to be a very useful workaround as well. The > > >>>>> FILE_SYNC write calls come back in about the same amount of time as the > > >>>>> write+commit pairs... Speeds up building regardless of the network > > >>>>> filesystem (ClearCase MVFS or straight NFS). > > >>>> Does anybody had the history as to why 'no_wdelay' is an > > >>>> export default? > > >>> Because "wdelay" is a complete crock? > > >>> > > >>> Adding 10ms to every write RPC only helps if there's a steady > > >>> single-file stream arriving at the server. In most other workloads > > >>> it only slows things down. > > >>> > > >>> The better solution is to continue tuning the clients to issue > > >>> writes in a more sequential and less all-or-nothing fashion. > > >>> There are plenty of other less crock-ful things to do in the > > >>> server, too. > > >> Ok... So do you think removing it as a default would cause > > >> any regressions? > > > > > > It might for NFSv2 clients, since they don't have the option of using > > > unstable writes. I'd therefore prefer a kernel solution that makes write > > > gathering an NFSv2 only feature. > > Sounds good to me! ;-) > > Patch welcomed.--b. Something like this ought to suffice... ----------------------------------------------------------------------- From: Trond Myklebust <Trond.Myklebust@netapp.com> NFSD: Make sure that write gathering only applies to NFSv2 NFSv3 and above can use unstable writes whenever they are sending more than one write, rather than relying on the flaky write gathering heuristics. More often than not, write gathering is currently getting it wrong when the NFSv3 clients are sending a single write with FILE_SYNC for efficiency reasons. This patch turns off write gathering for NFSv3/v4, and ensure that it only applies to the one case that can actually benefit: namely NFSv2. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfsd/vfs.c | 8 +++++--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index b660435..f30cc4e 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -975,6 +975,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, __be32 err = 0; int host_err; int stable = *stablep; + int use_wgather; #ifdef MSNFS err = nfserr_perm; @@ -993,9 +994,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, * - the sync export option has been set, or * - the client requested O_SYNC behavior (NFSv3 feature). * - The file system doesn't support fsync(). - * When gathered writes have been configured for this volume, + * When NFSv2 gathered writes have been configured for this volume, * flushing the data to disk is handled separately below. */ + use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp); if (!file->f_op->fsync) {/* COMMIT3 cannot work */ stable = 2; @@ -1004,7 +1006,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, if (!EX_ISSYNC(exp)) stable = 0; - if (stable && !EX_WGATHER(exp)) { + if (stable && !use_wgather) { spin_lock(&file->f_lock); file->f_flags |= O_SYNC; spin_unlock(&file->f_lock); @@ -1040,7 +1042,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, * nice and simple solution (IMHO), and it seems to * work:-) */ - if (EX_WGATHER(exp)) { + if (use_wgather) { if (atomic_read(&inode->i_writecount) > 1 || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) { dprintk("nfsd: write defer %d\n", task_pid_nr(current)); ^ permalink raw reply related [flat|nested] 94+ messages in thread
[parent not found: <1244219715.5410.40.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1244219715.5410.40.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-06-15 23:08 ` J. Bruce Fields 2009-06-16 0:21 ` NeilBrown 2009-06-16 0:32 ` Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Trond Myklebust 0 siblings, 2 replies; 94+ messages in thread From: J. Bruce Fields @ 2009-06-15 23:08 UTC (permalink / raw) To: Trond Myklebust; +Cc: Steve Dickson, Tom Talpey, Linux NFS Mailing list On Fri, Jun 05, 2009 at 12:35:15PM -0400, Trond Myklebust wrote: > On Fri, 2009-06-05 at 12:05 -0400, J. Bruce Fields wrote: > > On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote: > > > > > > > > > Trond Myklebust wrote: > > > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote: > > > >> Tom Talpey wrote: > > > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote: > > > >>>> Brian R Cowan wrote: > > > >>>>> Trond Myklebust<trond.myklebust@fys.uio.no> wrote on 06/04/2009 > > > >>>>> 02:04:58 > > > >>>>> PM: > > > >>>>> > > > >>>>>> Did you try turning off write gathering on the server (i.e. add the > > > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of > > > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness. > > > >>>>> Just tried it, this seems to be a very useful workaround as well. The > > > >>>>> FILE_SYNC write calls come back in about the same amount of time as the > > > >>>>> write+commit pairs... Speeds up building regardless of the network > > > >>>>> filesystem (ClearCase MVFS or straight NFS). > > > >>>> Does anybody had the history as to why 'no_wdelay' is an > > > >>>> export default? > > > >>> Because "wdelay" is a complete crock? > > > >>> > > > >>> Adding 10ms to every write RPC only helps if there's a steady > > > >>> single-file stream arriving at the server. In most other workloads > > > >>> it only slows things down. > > > >>> > > > >>> The better solution is to continue tuning the clients to issue > > > >>> writes in a more sequential and less all-or-nothing fashion. > > > >>> There are plenty of other less crock-ful things to do in the > > > >>> server, too. > > > >> Ok... So do you think removing it as a default would cause > > > >> any regressions? > > > > > > > > It might for NFSv2 clients, since they don't have the option of using > > > > unstable writes. I'd therefore prefer a kernel solution that makes write > > > > gathering an NFSv2 only feature. > > > Sounds good to me! ;-) > > > > Patch welcomed.--b. > > Something like this ought to suffice... Thanks, applied. I'd also like to apply cleanup something like the following--there's probably some cleaner way, but it just bothers me to have this write-gathering special case take up the bulk of nfsd_vfs_write.... --b. commit bfe7680d68afaf3f0b1195c8976db1fd1f03229d Author: J. Bruce Fields <bfields@citi.umich.edu> Date: Mon Jun 15 16:03:53 2009 -0700 nfsd: Pull write-gathering code out of nfsd_vfs_write This is a relatively self-contained piece of code that handles a special case--move it to its own function. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index a8aac7f..de68557 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -963,6 +963,44 @@ static void kill_suid(struct dentry *dentry) mutex_unlock(&dentry->d_inode->i_mutex); } +/* + * Gathered writes: If another process is currently writing to the file, + * there's a high chance this is another nfsd (triggered by a bulk write + * from a client's biod). Rather than syncing the file with each write + * request, we sleep for 10 msec. + * + * I don't know if this roughly approximates C. Juszak's idea of + * gathered writes, but it's a nice and simple solution (IMHO), and it + * seems to work:-) + * + * Note: we do this only in the NFSv2 case, since v3 and higher have a + * better tool (separate unstable writes and commits) for solving this + * problem. + */ +static void wait_for_concurrent_writes(struct file *file, int use_wgather, int *host_err) +{ + struct inode *inode = file->f_path.dentry->d_inode; + static ino_t last_ino; + static dev_t last_dev; + + if (!use_wgather) + goto out; + if (atomic_read(&inode->i_writecount) > 1 + || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) { + dprintk("nfsd: write defer %d\n", task_pid_nr(current)); + msleep(10); + dprintk("nfsd: write resume %d\n", task_pid_nr(current)); + } + + if (inode->i_state & I_DIRTY) { + dprintk("nfsd: write sync %d\n", task_pid_nr(current)); + *host_err = nfsd_sync(file); + } +out: + last_ino = inode->i_ino; + last_dev = inode->i_sb->s_dev; +} + static __be32 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, loff_t offset, struct kvec *vec, int vlen, @@ -1025,41 +1063,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID))) kill_suid(dentry); - if (host_err >= 0 && stable) { - static ino_t last_ino; - static dev_t last_dev; - - /* - * Gathered writes: If another process is currently - * writing to the file, there's a high chance - * this is another nfsd (triggered by a bulk write - * from a client's biod). Rather than syncing the - * file with each write request, we sleep for 10 msec. - * - * I don't know if this roughly approximates - * C. Juszak's idea of gathered writes, but it's a - * nice and simple solution (IMHO), and it seems to - * work:-) - */ - if (use_wgather) { - if (atomic_read(&inode->i_writecount) > 1 - || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) { - dprintk("nfsd: write defer %d\n", task_pid_nr(current)); - msleep(10); - dprintk("nfsd: write resume %d\n", task_pid_nr(current)); - } - - if (inode->i_state & I_DIRTY) { - dprintk("nfsd: write sync %d\n", task_pid_nr(current)); - host_err=nfsd_sync(file); - } -#if 0 - wake_up(&inode->i_wait); -#endif - } - last_ino = inode->i_ino; - last_dev = inode->i_sb->s_dev; - } + if (host_err >= 0 && stable) + wait_for_concurrent_writes(file, use_wgather, &host_err); dprintk("nfsd: write complete host_err=%d\n", host_err); if (host_err >= 0) { ^ permalink raw reply related [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-15 23:08 ` J. Bruce Fields @ 2009-06-16 0:21 ` NeilBrown [not found] ` <99d4545537613ce76040d3655b78bdb7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org> 2009-06-16 0:32 ` Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Trond Myklebust 1 sibling, 1 reply; 94+ messages in thread From: NeilBrown @ 2009-06-16 0:21 UTC (permalink / raw) To: J. Bruce Fields Cc: Trond Myklebust, Steve Dickson, Tom Talpey, Linux NFS Mailing list On Tue, June 16, 2009 9:08 am, J. Bruce Fields wrote: > + if (host_err >= 0 && stable) > + wait_for_concurrent_writes(file, use_wgather, &host_err); > Surely you want this to be: if (host_err >= 0 && stable && use_wgather) host_err = wait_for_concurrent_writes(file); as - this is more readable - setting last_ino and last_dev is pointless when !use_wgather - we aren't interested in differentiation between non-negative values of host_err. NeilBrown ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <99d4545537613ce76040d3655b78bdb7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>]
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <99d4545537613ce76040d3655b78bdb7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org> @ 2009-06-16 0:33 ` J. Bruce Fields 2009-06-16 0:50 ` NeilBrown 0 siblings, 1 reply; 94+ messages in thread From: J. Bruce Fields @ 2009-06-16 0:33 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Steve Dickson, Tom Talpey, Linux NFS Mailing list On Tue, Jun 16, 2009 at 10:21:50AM +1000, NeilBrown wrote: > On Tue, June 16, 2009 9:08 am, J. Bruce Fields wrote: > > > + if (host_err >= 0 && stable) > > + wait_for_concurrent_writes(file, use_wgather, &host_err); > > > > Surely you want this to be: > > if (host_err >= 0 && stable && use_wgather) > host_err = wait_for_concurrent_writes(file); > as > - this is more readable > - setting last_ino and last_dev is pointless when !use_wgather Yep, thanks. > - we aren't interested in differentiation between non-negative values of > host_err. Unfortunately, just below: if (host_err >= 0) { err = 0; *cnt = host_err; } else err = nfserrno(host_err); We could save that count earlier, e.g.: @@ -1014,6 +1013,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, int host_err; int stable = *stablep; int use_wgather; + int bytes; #ifdef MSNFS err = nfserr_perm; @@ -1056,6 +1056,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, set_fs(oldfs); if (host_err >= 0) { nfsdstats.io_write += host_err; + bytes = host_err; fsnotify_modify(file->f_path.dentry); } @@ -1063,13 +1064,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fh if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID))) kill_suid(dentry); - if (host_err >= 0 && stable) - wait_for_concurrent_writes(file, use_wgather, &host_err); + if (host_err >= 0 && stable && use_wgather) + host_err = wait_for_concurrent_writes(file); dprintk("nfsd: write complete host_err=%d\n", host_err); if (host_err >= 0) { err = 0; - *cnt = host_err; + *cnt = bytes; } else err = nfserrno(host_err); out: --b. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-16 0:33 ` J. Bruce Fields @ 2009-06-16 0:50 ` NeilBrown [not found] ` <02ada87c636e1088e9365a3cbea301e7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org> 0 siblings, 1 reply; 94+ messages in thread From: NeilBrown @ 2009-06-16 0:50 UTC (permalink / raw) To: J. Bruce Fields Cc: Trond Myklebust, Steve Dickson, Tom Talpey, Linux NFS Mailing list On Tue, June 16, 2009 10:33 am, J. Bruce Fields wrote: > On Tue, Jun 16, 2009 at 10:21:50AM +1000, NeilBrown wrote: >> On Tue, June 16, 2009 9:08 am, J. Bruce Fields wrote: >> >> > + if (host_err >= 0 && stable) >> > + wait_for_concurrent_writes(file, use_wgather, &host_err); >> > >> >> Surely you want this to be: >> >> if (host_err >= 0 && stable && use_wgather) >> host_err = wait_for_concurrent_writes(file); >> as >> - this is more readable >> - setting last_ino and last_dev is pointless when !use_wgather > > Yep, thanks. > >> - we aren't interested in differentiation between non-negative values >> of >> host_err. > > Unfortunately, just below: > > if (host_err >= 0) { > err = 0; > *cnt = host_err; > } else > err = nfserrno(host_err); > Ahh.... that must be in code you haven't pushed out yet. I don't see it in mainline or git.linux-nfs.org > We could save that count earlier, e.g.: > > @@ -1014,6 +1013,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh > *fhp, > int host_err; > int stable = *stablep; > int use_wgather; > + int bytes; > > #ifdef MSNFS > err = nfserr_perm; > @@ -1056,6 +1056,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh > *fhp, > set_fs(oldfs); > if (host_err >= 0) { > nfsdstats.io_write += host_err; > + bytes = host_err; > fsnotify_modify(file->f_path.dentry); Or even if (host_err >= 0) { bytes = host_err; nfsdstats.io_write += bytes ... And if you did that in whatever patch move the assignment to *cnt to the bottom of the function, it might be even more readable! Thanks, NeilBrown ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <02ada87c636e1088e9365a3cbea301e7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>]
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <02ada87c636e1088e9365a3cbea301e7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org> @ 2009-06-16 0:55 ` J. Bruce Fields 2009-06-17 16:54 ` J. Bruce Fields 0 siblings, 1 reply; 94+ messages in thread From: J. Bruce Fields @ 2009-06-16 0:55 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Steve Dickson, Tom Talpey, Linux NFS Mailing list On Tue, Jun 16, 2009 at 10:50:57AM +1000, NeilBrown wrote: > On Tue, June 16, 2009 10:33 am, J. Bruce Fields wrote: > > On Tue, Jun 16, 2009 at 10:21:50AM +1000, NeilBrown wrote: > >> On Tue, June 16, 2009 9:08 am, J. Bruce Fields wrote: > >> > >> > + if (host_err >= 0 && stable) > >> > + wait_for_concurrent_writes(file, use_wgather, &host_err); > >> > > >> > >> Surely you want this to be: > >> > >> if (host_err >= 0 && stable && use_wgather) > >> host_err = wait_for_concurrent_writes(file); > >> as > >> - this is more readable > >> - setting last_ino and last_dev is pointless when !use_wgather > > > > Yep, thanks. > > > >> - we aren't interested in differentiation between non-negative values > >> of > >> host_err. > > > > Unfortunately, just below: > > > > if (host_err >= 0) { > > err = 0; > > *cnt = host_err; > > } else > > err = nfserrno(host_err); > > > > Ahh.... that must be in code you haven't pushed out yet. > I don't see it in mainline or git.linux-nfs.org Whoops--actually, it's the opposite problem: a bugfix patch that went upstream removed this, and I didn't merge that back into my for-2.6.31 branch. OK, time to do that, and then this is all much simpler.... Thanks for calling my attention to that! --b. > > > We could save that count earlier, e.g.: > > > > @@ -1014,6 +1013,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh > > *fhp, > > int host_err; > > int stable = *stablep; > > int use_wgather; > > + int bytes; > > > > #ifdef MSNFS > > err = nfserr_perm; > > @@ -1056,6 +1056,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh > > *fhp, > > set_fs(oldfs); > > if (host_err >= 0) { > > nfsdstats.io_write += host_err; > > + bytes = host_err; > > fsnotify_modify(file->f_path.dentry); > > Or even > > if (host_err >= 0) { > bytes = host_err; > nfsdstats.io_write += bytes > ... > > And if you did that in whatever patch move the assignment to > *cnt to the bottom of the function, it might be even more readable! > > Thanks, > NeilBrown > > ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-16 0:55 ` J. Bruce Fields @ 2009-06-17 16:54 ` J. Bruce Fields 2009-06-17 16:59 ` [PATCH 1/3] nfsd: track last inode only in use_wgather case J. Bruce Fields 0 siblings, 1 reply; 94+ messages in thread From: J. Bruce Fields @ 2009-06-17 16:54 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Steve Dickson, Tom Talpey, Linux NFS Mailing list On Mon, Jun 15, 2009 at 08:55:58PM -0400, bfields wrote: > Whoops--actually, it's the opposite problem: a bugfix patch that went > upstream removed this, and I didn't merge that back into my for-2.6.31 > branch. OK, time to do that, and then this is all much simpler.... > Thanks for calling my attention to that! Having fixed that... the following is what I'm applying (on top of Trond's). --b. ^ permalink raw reply [flat|nested] 94+ messages in thread
* [PATCH 1/3] nfsd: track last inode only in use_wgather case 2009-06-17 16:54 ` J. Bruce Fields @ 2009-06-17 16:59 ` J. Bruce Fields 2009-06-17 16:59 ` [PATCH 2/3] nfsd: Pull write-gathering code out of nfsd_vfs_write J. Bruce Fields 0 siblings, 1 reply; 94+ messages in thread From: J. Bruce Fields @ 2009-06-17 16:59 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Steve Dickson, Tom Talpey, Linux NFS Mailing list, J. Bruce Fields From: J. Bruce Fields <bfields@citi.umich.edu> Updating last_ino and last_dev probably isn't useful in the !use_wgather case. Also remove some pointless ifdef'd-out code. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> --- fs/nfsd/vfs.c | 25 ++++++++++--------------- 1 files changed, 10 insertions(+), 15 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index f30cc4e..ebf56c6 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -1026,7 +1026,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID))) kill_suid(dentry); - if (host_err >= 0 && stable) { + if (host_err >= 0 && stable && use_wgather) { static ino_t last_ino; static dev_t last_dev; @@ -1042,21 +1042,16 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, * nice and simple solution (IMHO), and it seems to * work:-) */ - if (use_wgather) { - if (atomic_read(&inode->i_writecount) > 1 - || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) { - dprintk("nfsd: write defer %d\n", task_pid_nr(current)); - msleep(10); - dprintk("nfsd: write resume %d\n", task_pid_nr(current)); - } + if (atomic_read(&inode->i_writecount) > 1 + || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) { + dprintk("nfsd: write defer %d\n", task_pid_nr(current)); + msleep(10); + dprintk("nfsd: write resume %d\n", task_pid_nr(current)); + } - if (inode->i_state & I_DIRTY) { - dprintk("nfsd: write sync %d\n", task_pid_nr(current)); - host_err=nfsd_sync(file); - } -#if 0 - wake_up(&inode->i_wait); -#endif + if (inode->i_state & I_DIRTY) { + dprintk("nfsd: write sync %d\n", task_pid_nr(current)); + host_err=nfsd_sync(file); } last_ino = inode->i_ino; last_dev = inode->i_sb->s_dev; -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 94+ messages in thread
* [PATCH 2/3] nfsd: Pull write-gathering code out of nfsd_vfs_write 2009-06-17 16:59 ` [PATCH 1/3] nfsd: track last inode only in use_wgather case J. Bruce Fields @ 2009-06-17 16:59 ` J. Bruce Fields 2009-06-17 16:59 ` [PATCH 3/3] nfsd: minor nfsd_vfs_write cleanup J. Bruce Fields 0 siblings, 1 reply; 94+ messages in thread From: J. Bruce Fields @ 2009-06-17 16:59 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Steve Dickson, Tom Talpey, Linux NFS Mailing list, J. Bruce Fields From: J. Bruce Fields <bfields@citi.umich.edu> This is a relatively self-contained piece of code that handles a special case--move it to its own function. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> --- fs/nfsd/vfs.c | 69 ++++++++++++++++++++++++++++++++------------------------ 1 files changed, 39 insertions(+), 30 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index ebf56c6..6ad76a4 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -963,6 +963,43 @@ static void kill_suid(struct dentry *dentry) mutex_unlock(&dentry->d_inode->i_mutex); } +/* + * Gathered writes: If another process is currently writing to the file, + * there's a high chance this is another nfsd (triggered by a bulk write + * from a client's biod). Rather than syncing the file with each write + * request, we sleep for 10 msec. + * + * I don't know if this roughly approximates C. Juszak's idea of + * gathered writes, but it's a nice and simple solution (IMHO), and it + * seems to work:-) + * + * Note: we do this only in the NFSv2 case, since v3 and higher have a + * better tool (separate unstable writes and commits) for solving this + * problem. + */ +static int wait_for_concurrent_writes(struct file *file) +{ + struct inode *inode = file->f_path.dentry->d_inode; + static ino_t last_ino; + static dev_t last_dev; + int err = 0; + + if (atomic_read(&inode->i_writecount) > 1 + || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) { + dprintk("nfsd: write defer %d\n", task_pid_nr(current)); + msleep(10); + dprintk("nfsd: write resume %d\n", task_pid_nr(current)); + } + + if (inode->i_state & I_DIRTY) { + dprintk("nfsd: write sync %d\n", task_pid_nr(current)); + err = nfsd_sync(file); + } + last_ino = inode->i_ino; + last_dev = inode->i_sb->s_dev; + return err; +} + static __be32 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, loff_t offset, struct kvec *vec, int vlen, @@ -1026,36 +1063,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID))) kill_suid(dentry); - if (host_err >= 0 && stable && use_wgather) { - static ino_t last_ino; - static dev_t last_dev; - - /* - * Gathered writes: If another process is currently - * writing to the file, there's a high chance - * this is another nfsd (triggered by a bulk write - * from a client's biod). Rather than syncing the - * file with each write request, we sleep for 10 msec. - * - * I don't know if this roughly approximates - * C. Juszak's idea of gathered writes, but it's a - * nice and simple solution (IMHO), and it seems to - * work:-) - */ - if (atomic_read(&inode->i_writecount) > 1 - || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) { - dprintk("nfsd: write defer %d\n", task_pid_nr(current)); - msleep(10); - dprintk("nfsd: write resume %d\n", task_pid_nr(current)); - } - - if (inode->i_state & I_DIRTY) { - dprintk("nfsd: write sync %d\n", task_pid_nr(current)); - host_err=nfsd_sync(file); - } - last_ino = inode->i_ino; - last_dev = inode->i_sb->s_dev; - } + if (host_err >= 0 && stable && use_wgather) + host_err = wait_for_concurrent_writes(file); dprintk("nfsd: write complete host_err=%d\n", host_err); if (host_err >= 0) -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 94+ messages in thread
* [PATCH 3/3] nfsd: minor nfsd_vfs_write cleanup 2009-06-17 16:59 ` [PATCH 2/3] nfsd: Pull write-gathering code out of nfsd_vfs_write J. Bruce Fields @ 2009-06-17 16:59 ` J. Bruce Fields 0 siblings, 0 replies; 94+ messages in thread From: J. Bruce Fields @ 2009-06-17 16:59 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Steve Dickson, Tom Talpey, Linux NFS Mailing list, J. Bruce Fields From: J. Bruce Fields <bfields@citi.umich.edu> There's no need to check host_err >= 0 every time here when we could check host_err < 0 once, following the usual kernel style. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> --- fs/nfsd/vfs.c | 15 ++++++++------- 1 files changed, 8 insertions(+), 7 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 6ad76a4..1cf7061 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -1053,19 +1053,20 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, oldfs = get_fs(); set_fs(KERNEL_DS); host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &offset); set_fs(oldfs); - if (host_err >= 0) { - *cnt = host_err; - nfsdstats.io_write += host_err; - fsnotify_modify(file->f_path.dentry); - } + if (host_err < 0) + goto out_nfserr; + *cnt = host_err; + nfsdstats.io_write += host_err; + fsnotify_modify(file->f_path.dentry); /* clear setuid/setgid flag after write */ - if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID))) + if (inode->i_mode & (S_ISUID | S_ISGID)) kill_suid(dentry); - if (host_err >= 0 && stable && use_wgather) + if (stable && use_wgather) host_err = wait_for_concurrent_writes(file); +out_nfserr: dprintk("nfsd: write complete host_err=%d\n", host_err); if (host_err >= 0) err = 0; -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-15 23:08 ` J. Bruce Fields 2009-06-16 0:21 ` NeilBrown @ 2009-06-16 0:32 ` Trond Myklebust [not found] ` <1245112324.7470.7.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 1 sibling, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-06-16 0:32 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Steve Dickson, Tom Talpey, Linux NFS Mailing list On Mon, 2009-06-15 at 19:08 -0400, J. Bruce Fields wrote: > On Fri, Jun 05, 2009 at 12:35:15PM -0400, Trond Myklebust wrote: > > On Fri, 2009-06-05 at 12:05 -0400, J. Bruce Fields wrote: > > > On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote: > > > > > > > > > > > > Trond Myklebust wrote: > > > > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote: > > > > >> Tom Talpey wrote: > > > > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote: > > > > >>>> Brian R Cowan wrote: > > > > >>>>> Trond Myklebust<trond.myklebust@fys.uio.no> wrote on 06/04/2009 > > > > >>>>> 02:04:58 > > > > >>>>> PM: > > > > >>>>> > > > > >>>>>> Did you try turning off write gathering on the server (i.e. add the > > > > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of > > > > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness. > > > > >>>>> Just tried it, this seems to be a very useful workaround as well. The > > > > >>>>> FILE_SYNC write calls come back in about the same amount of time as the > > > > >>>>> write+commit pairs... Speeds up building regardless of the network > > > > >>>>> filesystem (ClearCase MVFS or straight NFS). > > > > >>>> Does anybody had the history as to why 'no_wdelay' is an > > > > >>>> export default? > > > > >>> Because "wdelay" is a complete crock? > > > > >>> > > > > >>> Adding 10ms to every write RPC only helps if there's a steady > > > > >>> single-file stream arriving at the server. In most other workloads > > > > >>> it only slows things down. > > > > >>> > > > > >>> The better solution is to continue tuning the clients to issue > > > > >>> writes in a more sequential and less all-or-nothing fashion. > > > > >>> There are plenty of other less crock-ful things to do in the > > > > >>> server, too. > > > > >> Ok... So do you think removing it as a default would cause > > > > >> any regressions? > > > > > > > > > > It might for NFSv2 clients, since they don't have the option of using > > > > > unstable writes. I'd therefore prefer a kernel solution that makes write > > > > > gathering an NFSv2 only feature. > > > > Sounds good to me! ;-) > > > > > > Patch welcomed.--b. > > > > Something like this ought to suffice... > > Thanks, applied. > > I'd also like to apply cleanup something like the following--there's > probably some cleaner way, but it just bothers me to have this > write-gathering special case take up the bulk of nfsd_vfs_write.... > > --b. > > commit bfe7680d68afaf3f0b1195c8976db1fd1f03229d > Author: J. Bruce Fields <bfields@citi.umich.edu> > Date: Mon Jun 15 16:03:53 2009 -0700 > > nfsd: Pull write-gathering code out of nfsd_vfs_write > > This is a relatively self-contained piece of code that handles a special > case--move it to its own function. > > Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > index a8aac7f..de68557 100644 > --- a/fs/nfsd/vfs.c > +++ b/fs/nfsd/vfs.c > @@ -963,6 +963,44 @@ static void kill_suid(struct dentry *dentry) > mutex_unlock(&dentry->d_inode->i_mutex); > } > > +/* > + * Gathered writes: If another process is currently writing to the file, > + * there's a high chance this is another nfsd (triggered by a bulk write > + * from a client's biod). Rather than syncing the file with each write > + * request, we sleep for 10 msec. > + * > + * I don't know if this roughly approximates C. Juszak's idea of > + * gathered writes, but it's a nice and simple solution (IMHO), and it > + * seems to work:-) > + * > + * Note: we do this only in the NFSv2 case, since v3 and higher have a > + * better tool (separate unstable writes and commits) for solving this > + * problem. > + */ > +static void wait_for_concurrent_writes(struct file *file, int use_wgather, int *host_err) > +{ > + struct inode *inode = file->f_path.dentry->d_inode; > + static ino_t last_ino; > + static dev_t last_dev; > + > + if (!use_wgather) > + goto out; > + if (atomic_read(&inode->i_writecount) > 1 > + || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) { > + dprintk("nfsd: write defer %d\n", task_pid_nr(current)); > + msleep(10); > + dprintk("nfsd: write resume %d\n", task_pid_nr(current)); > + } > + > + if (inode->i_state & I_DIRTY) { > + dprintk("nfsd: write sync %d\n", task_pid_nr(current)); > + *host_err = nfsd_sync(file); > + } > +out: > + last_ino = inode->i_ino; > + last_dev = inode->i_sb->s_dev; > +} Shouldn't you also timestamp the last_ino/last_dev? Currently you can end up waiting even if the last time you referenced this file was 10 minutes ago... > + > static __be32 > nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, > loff_t offset, struct kvec *vec, int vlen, > @@ -1025,41 +1063,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, > if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID))) > kill_suid(dentry); > > - if (host_err >= 0 && stable) { > - static ino_t last_ino; > - static dev_t last_dev; > - > - /* > - * Gathered writes: If another process is currently > - * writing to the file, there's a high chance > - * this is another nfsd (triggered by a bulk write > - * from a client's biod). Rather than syncing the > - * file with each write request, we sleep for 10 msec. > - * > - * I don't know if this roughly approximates > - * C. Juszak's idea of gathered writes, but it's a > - * nice and simple solution (IMHO), and it seems to > - * work:-) > - */ > - if (use_wgather) { > - if (atomic_read(&inode->i_writecount) > 1 > - || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) { > - dprintk("nfsd: write defer %d\n", task_pid_nr(current)); > - msleep(10); > - dprintk("nfsd: write resume %d\n", task_pid_nr(current)); > - } > - > - if (inode->i_state & I_DIRTY) { > - dprintk("nfsd: write sync %d\n", task_pid_nr(current)); > - host_err=nfsd_sync(file); > - } > -#if 0 > - wake_up(&inode->i_wait); > -#endif > - } > - last_ino = inode->i_ino; > - last_dev = inode->i_sb->s_dev; > - } > + if (host_err >= 0 && stable) > + wait_for_concurrent_writes(file, use_wgather, &host_err); > > dprintk("nfsd: write complete host_err=%d\n", host_err); > if (host_err >= 0) { > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1245112324.7470.7.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1245112324.7470.7.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-06-16 2:02 ` J. Bruce Fields 0 siblings, 0 replies; 94+ messages in thread From: J. Bruce Fields @ 2009-06-16 2:02 UTC (permalink / raw) To: Trond Myklebust; +Cc: Steve Dickson, Tom Talpey, Linux NFS Mailing list On Mon, Jun 15, 2009 at 05:32:04PM -0700, Trond Myklebust wrote: > On Mon, 2009-06-15 at 19:08 -0400, J. Bruce Fields wrote: > > On Fri, Jun 05, 2009 at 12:35:15PM -0400, Trond Myklebust wrote: > > > On Fri, 2009-06-05 at 12:05 -0400, J. Bruce Fields wrote: > > > > On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote: > > > > > > > > > > > > > > > Trond Myklebust wrote: > > > > > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote: > > > > > >> Tom Talpey wrote: > > > > > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote: > > > > > >>>> Brian R Cowan wrote: > > > > > >>>>> Trond Myklebust<trond.myklebust@fys.uio.no> wrote on 06/04/2009 > > > > > >>>>> 02:04:58 > > > > > >>>>> PM: > > > > > >>>>> > > > > > >>>>>> Did you try turning off write gathering on the server (i.e. add the > > > > > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of > > > > > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness. > > > > > >>>>> Just tried it, this seems to be a very useful workaround as well. The > > > > > >>>>> FILE_SYNC write calls come back in about the same amount of time as the > > > > > >>>>> write+commit pairs... Speeds up building regardless of the network > > > > > >>>>> filesystem (ClearCase MVFS or straight NFS). > > > > > >>>> Does anybody had the history as to why 'no_wdelay' is an > > > > > >>>> export default? > > > > > >>> Because "wdelay" is a complete crock? > > > > > >>> > > > > > >>> Adding 10ms to every write RPC only helps if there's a steady > > > > > >>> single-file stream arriving at the server. In most other workloads > > > > > >>> it only slows things down. > > > > > >>> > > > > > >>> The better solution is to continue tuning the clients to issue > > > > > >>> writes in a more sequential and less all-or-nothing fashion. > > > > > >>> There are plenty of other less crock-ful things to do in the > > > > > >>> server, too. > > > > > >> Ok... So do you think removing it as a default would cause > > > > > >> any regressions? > > > > > > > > > > > > It might for NFSv2 clients, since they don't have the option of using > > > > > > unstable writes. I'd therefore prefer a kernel solution that makes write > > > > > > gathering an NFSv2 only feature. > > > > > Sounds good to me! ;-) > > > > > > > > Patch welcomed.--b. > > > > > > Something like this ought to suffice... > > > > Thanks, applied. > > > > I'd also like to apply cleanup something like the following--there's > > probably some cleaner way, but it just bothers me to have this > > write-gathering special case take up the bulk of nfsd_vfs_write.... > > > > --b. > > > > commit bfe7680d68afaf3f0b1195c8976db1fd1f03229d > > Author: J. Bruce Fields <bfields@citi.umich.edu> > > Date: Mon Jun 15 16:03:53 2009 -0700 > > > > nfsd: Pull write-gathering code out of nfsd_vfs_write > > > > This is a relatively self-contained piece of code that handles a special > > case--move it to its own function. > > > > Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> > > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > > index a8aac7f..de68557 100644 > > --- a/fs/nfsd/vfs.c > > +++ b/fs/nfsd/vfs.c > > @@ -963,6 +963,44 @@ static void kill_suid(struct dentry *dentry) > > mutex_unlock(&dentry->d_inode->i_mutex); > > } > > > > +/* > > + * Gathered writes: If another process is currently writing to the file, > > + * there's a high chance this is another nfsd (triggered by a bulk write > > + * from a client's biod). Rather than syncing the file with each write > > + * request, we sleep for 10 msec. > > + * > > + * I don't know if this roughly approximates C. Juszak's idea of > > + * gathered writes, but it's a nice and simple solution (IMHO), and it > > + * seems to work:-) > > + * > > + * Note: we do this only in the NFSv2 case, since v3 and higher have a > > + * better tool (separate unstable writes and commits) for solving this > > + * problem. > > + */ > > +static void wait_for_concurrent_writes(struct file *file, int use_wgather, int *host_err) > > +{ > > + struct inode *inode = file->f_path.dentry->d_inode; > > + static ino_t last_ino; > > + static dev_t last_dev; > > + > > + if (!use_wgather) > > + goto out; > > + if (atomic_read(&inode->i_writecount) > 1 > > + || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) { > > + dprintk("nfsd: write defer %d\n", task_pid_nr(current)); > > + msleep(10); > > + dprintk("nfsd: write resume %d\n", task_pid_nr(current)); > > + } > > + > > + if (inode->i_state & I_DIRTY) { > > + dprintk("nfsd: write sync %d\n", task_pid_nr(current)); > > + *host_err = nfsd_sync(file); > > + } > > +out: > > + last_ino = inode->i_ino; > > + last_dev = inode->i_sb->s_dev; > > +} > > Shouldn't you also timestamp the last_ino/last_dev? Currently you can > end up waiting even if the last time you referenced this file was 10 > minutes ago... Maybe, but I don't know that avoiding the delay in the case where use_wdelay writes are coming rarely is particularly important. (Note this is just a single static last_ino/last_dev, so the timestamp would just tell us how long ago there was last a use_wdelay write.) I'm not as interested in making wdelay work better--someone who uses v2 and wants to benchmark it can do that--as I am interested in just getting it out of the way so I don't have to look at it again.... --b. > > > + > > static __be32 > > nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, > > loff_t offset, struct kvec *vec, int vlen, > > @@ -1025,41 +1063,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, > > if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID))) > > kill_suid(dentry); > > > > - if (host_err >= 0 && stable) { > > - static ino_t last_ino; > > - static dev_t last_dev; > > - > > - /* > > - * Gathered writes: If another process is currently > > - * writing to the file, there's a high chance > > - * this is another nfsd (triggered by a bulk write > > - * from a client's biod). Rather than syncing the > > - * file with each write request, we sleep for 10 msec. > > - * > > - * I don't know if this roughly approximates > > - * C. Juszak's idea of gathered writes, but it's a > > - * nice and simple solution (IMHO), and it seems to > > - * work:-) > > - */ > > - if (use_wgather) { > > - if (atomic_read(&inode->i_writecount) > 1 > > - || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) { > > - dprintk("nfsd: write defer %d\n", task_pid_nr(current)); > > - msleep(10); > > - dprintk("nfsd: write resume %d\n", task_pid_nr(current)); > > - } > > - > > - if (inode->i_state & I_DIRTY) { > > - dprintk("nfsd: write sync %d\n", task_pid_nr(current)); > > - host_err=nfsd_sync(file); > > - } > > -#if 0 > > - wake_up(&inode->i_wait); > > -#endif > > - } > > - last_ino = inode->i_ino; > > - last_dev = inode->i_sb->s_dev; > > - } > > + if (host_err >= 0 && stable) > > + wait_for_concurrent_writes(file, use_wgather, &host_err); > > > > dprintk("nfsd: write complete host_err=%d\n", host_err); > > if (host_err >= 0) { > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <4A291D83.1000508@RedHat.com>]
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <4A291D83.1000508@RedHat.com> @ 2009-06-05 13:50 ` Tom Talpey 2009-06-05 13:54 ` Trond Myklebust 0 siblings, 1 reply; 94+ messages in thread From: Tom Talpey @ 2009-06-05 13:50 UTC (permalink / raw) To: Steve Dickson; +Cc: Linux NFS Mailing List On 6/5/2009 9:28 AM, Steve Dickson wrote: > > Tom Talpey wrote: >> On 6/5/2009 7:35 AM, Steve Dickson wrote: >>> Brian R Cowan wrote: >>>> Trond Myklebust<trond.myklebust@fys.uio.no> wrote on 06/04/2009 >>>> 02:04:58 >>>> PM: >>>> >>>>> Did you try turning off write gathering on the server (i.e. add the >>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of >>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness. >>>> Just tried it, this seems to be a very useful workaround as well. The >>>> FILE_SYNC write calls come back in about the same amount of time as the >>>> write+commit pairs... Speeds up building regardless of the network >>>> filesystem (ClearCase MVFS or straight NFS). >>> Does anybody had the history as to why 'no_wdelay' is an >>> export default? >> Because "wdelay" is a complete crock? >> >> Adding 10ms to every write RPC only helps if there's a steady >> single-file stream arriving at the server. In most other workloads >> it only slows things down. >> >> The better solution is to continue tuning the clients to issue >> writes in a more sequential and less all-or-nothing fashion. >> There are plenty of other less crock-ful things to do in the >> server, too. > Ok... So do you think removing it as a default would cause > any regressions? I'm not 100% clear on what you mean by removing it. Since it's a "no_" option, removing it means that "wdelay" becomes the default? That would certainly cause a regression for many. I think the big problem with tweaking the default in nfs_utils is that there's little guarantee of the kernel behavior that would result. Older kernels, NFSv2 mounts, etc will behave completely differently from new ones, NFSv3, modified clients, etc. So touching this option is quite risky, IMO, even though it's a crock. Tom. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-05 13:50 ` Tom Talpey @ 2009-06-05 13:54 ` Trond Myklebust 2009-06-05 13:58 ` Tom Talpey 0 siblings, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-06-05 13:54 UTC (permalink / raw) To: Tom Talpey; +Cc: Steve Dickson, Linux NFS Mailing List On Fri, 2009-06-05 at 09:50 -0400, Tom Talpey wrote: > I'm not 100% clear on what you mean by removing it. Since it's > a "no_" option, removing it means that "wdelay" becomes the > default? That would certainly cause a regression for many. You've misunderstood. The current default is to _set_ 'wdelay' on all exports that do not explicitly turn it off. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-05 13:54 ` Trond Myklebust @ 2009-06-05 13:58 ` Tom Talpey 0 siblings, 0 replies; 94+ messages in thread From: Tom Talpey @ 2009-06-05 13:58 UTC (permalink / raw) To: Trond Myklebust; +Cc: Steve Dickson, Linux NFS Mailing List On 6/5/2009 9:54 AM, Trond Myklebust wrote: > On Fri, 2009-06-05 at 09:50 -0400, Tom Talpey wrote: >> I'm not 100% clear on what you mean by removing it. Since it's >> a "no_" option, removing it means that "wdelay" becomes the >> default? That would certainly cause a regression for many. > > You've misunderstood. The current default is to _set_ 'wdelay' on all > exports that do not explicitly turn it off. Ok, then turning it off will help some and hurt some. There's no right setting for all. I do agree that fixing the server is the best solution, not grabbing wildly at its crockful controls. Tom. ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-06-05 11:35 ` Steve Dickson ` (2 preceding siblings ...) [not found] ` <4A29144A.6030405@gmail.com> @ 2009-06-05 13:56 ` Brian R Cowan 3 siblings, 0 replies; 94+ messages in thread From: Brian R Cowan @ 2009-06-05 13:56 UTC (permalink / raw) To: Steve Dickson; +Cc: Greg Banks, linux-nfs, Neil Brown Actually wdelay is the export default, and I recall the man page saying something along the lines of doing this to allow the server to coalesce writes. Somewhere else (I think in another part of this thread) it's mentioned that the server will sit for up to 10ms waiting for other writes to this export. The reality is that wdelay+FILE_SYNC = up to a 10ms delay waiting for the write RPC to come back. That being said, I would rather leave this alone so that we don't accidentally impact something else. After all, the no_wdelay export option will work around it nicely in an all-Linux environment, and file pages don't flush with FILE_SYNC on 2.6.29. ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Steve Dickson <SteveD@redhat.com> To: Neil Brown <neilb@suse.de>, Greg Banks <gnb@fmeh.org> Cc: Brian R Cowan/Cupertino/IBM@IBMUS, linux-nfs@vger.kernel.org Date: 06/05/2009 07:38 AM Subject: Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Brian R Cowan wrote: > Trond Myklebust <trond.myklebust@fys.uio.no> wrote on 06/04/2009 02:04:58 > PM: > >> Did you try turning off write gathering on the server (i.e. add the >> 'no_wdelay' export option)? As I said earlier, that forces a delay of >> 10ms per RPC call, which might explain the FILE_SYNC slowness. > > Just tried it, this seems to be a very useful workaround as well. The > FILE_SYNC write calls come back in about the same amount of time as the > write+commit pairs... Speeds up building regardless of the network > filesystem (ClearCase MVFS or straight NFS). Does anybody had the history as to why 'no_wdelay' is an export default? As Brian mentioned later in this thread it only helps Linux servers, but that's good thing, IMHO. ;-) So I would have no problem changing the default export options in nfs-utils, but it would be nice to know why it was there in the first place... Neil, Greg?? steved. ^ permalink raw reply [flat|nested] 94+ messages in thread
* [PATCH] read-modify-write page updating 2009-06-04 18:04 ` Trond Myklebust 2009-06-04 20:43 ` Link performance over NFS degraded in RHEL5. -- was : " Brian R Cowan @ 2009-06-24 19:54 ` Peter Staubach 2009-06-25 17:13 ` Trond Myklebust 2009-07-09 14:12 ` [PATCH v2] " Peter Staubach 1 sibling, 2 replies; 94+ messages in thread From: Peter Staubach @ 2009-06-24 19:54 UTC (permalink / raw) To: Trond Myklebust; +Cc: Brian R Cowan, linux-nfs [-- Attachment #1: Type: text/plain, Size: 2780 bytes --] Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 153,000 READ requests and 265,500 WRITE requests. The modified kernel containing the patch generated about 156,000 READ requests and 257,000 WRITE requests. Thus, about 3,000 more READ requests were generated, but about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> [-- Attachment #2: read-modify-write.devel --] [-- Type: text/plain, Size: 980 bytes --] --- linux-2.6.30.i686/fs/nfs/file.c.org +++ linux-2.6.30.i686/fs/nfs/file.c @@ -337,15 +337,15 @@ static int nfs_write_begin(struct file * struct page **pagep, void **fsdata) { int ret; - pgoff_t index; + pgoff_t index = pos >> PAGE_CACHE_SHIFT; struct page *page; - index = pos >> PAGE_CACHE_SHIFT; dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n", file->f_path.dentry->d_parent->d_name.name, file->f_path.dentry->d_name.name, mapping->host->i_ino, len, (long long) pos); +start: /* * Prevent starvation issues if someone is doing a consistency * sync-to-disk @@ -364,6 +364,12 @@ static int nfs_write_begin(struct file * if (ret) { unlock_page(page); page_cache_release(page); + } else if ((file->f_mode & FMODE_READ) && !PageUptodate(page) && + ((pos & (PAGE_CACHE_SIZE - 1)) || len != PAGE_CACHE_SIZE)) { + ret = nfs_readpage(file, page); + page_cache_release(page); + if (!ret) + goto start; } return ret; } ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: [PATCH] read-modify-write page updating 2009-06-24 19:54 ` [PATCH] read-modify-write page updating Peter Staubach @ 2009-06-25 17:13 ` Trond Myklebust [not found] ` <1245950029.4913.17.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-07-09 14:12 ` [PATCH v2] " Peter Staubach 1 sibling, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-06-25 17:13 UTC (permalink / raw) To: Peter Staubach; +Cc: Brian R Cowan, linux-nfs On Wed, 2009-06-24 at 15:54 -0400, Peter Staubach wrote: > Hi. > > I have a proposal for possibly resolving this issue. > > I believe that this situation occurs due to the way that the > Linux NFS client handles writes which modify partial pages. > > The Linux NFS client handles partial page modifications by > allocating a page from the page cache, copying the data from > the user level into the page, and then keeping track of the > offset and length of the modified portions of the page. The > page is not marked as up to date because there are portions > of the page which do not contain valid file contents. > > When a read call comes in for a portion of the page, the > contents of the page must be read in the from the server. > However, since the page may already contain some modified > data, that modified data must be written to the server > before the file contents can be read back in the from server. > And, since the writing and reading can not be done atomically, > the data must be written and committed to stable storage on > the server for safety purposes. This means either a > FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. > This has been discussed at length previously. > > This algorithm could be described as modify-write-read. It > is most efficient when the application only updates pages > and does not read them. > > My proposed solution is to add a heuristic to decide whether > to do this modify-write-read algorithm or switch to a read- > modify-write algorithm when initially allocating the page > in the write system call path. The heuristic uses the modes > that the file was opened with, the offset in the page to > read from, and the size of the region to read. > > If the file was opened for reading in addition to writing > and the page would not be filled completely with data from > the user level, then read in the old contents of the page > and mark it as Uptodate before copying in the new data. If > the page would be completely filled with data from the user > level, then there would be no reason to read in the old > contents because they would just be copied over. > > This would optimize for applications which randomly access > and update portions of files. The linkage editor for the > C compiler is an example of such a thing. > > I tested the attached patch by using rpmbuild to build the > current Fedora rawhide kernel. The kernel without the > patch generated about 153,000 READ requests and 265,500 > WRITE requests. The modified kernel containing the patch > generated about 156,000 READ requests and 257,000 WRITE > requests. Thus, about 3,000 more READ requests were > generated, but about 8,500 fewer WRITE requests were > generated. I suspect that many of these additional > WRITE requests were probably FILE_SYNC requests to WRITE > a single page, but I didn't test this theory. > > Thanx... > > ps > > Signed-off-by: Peter Staubach <staubach@redhat.com> > plain text document attachment (read-modify-write.devel) > --- linux-2.6.30.i686/fs/nfs/file.c.org > +++ linux-2.6.30.i686/fs/nfs/file.c > @@ -337,15 +337,15 @@ static int nfs_write_begin(struct file * > struct page **pagep, void **fsdata) > { > int ret; > - pgoff_t index; > + pgoff_t index = pos >> PAGE_CACHE_SHIFT; > struct page *page; > - index = pos >> PAGE_CACHE_SHIFT; > > dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n", > file->f_path.dentry->d_parent->d_name.name, > file->f_path.dentry->d_name.name, > mapping->host->i_ino, len, (long long) pos); > > +start: > /* > * Prevent starvation issues if someone is doing a consistency > * sync-to-disk > @@ -364,6 +364,12 @@ static int nfs_write_begin(struct file * > if (ret) { > unlock_page(page); > page_cache_release(page); > + } else if ((file->f_mode & FMODE_READ) && !PageUptodate(page) && > + ((pos & (PAGE_CACHE_SIZE - 1)) || len != PAGE_CACHE_SIZE)) { It might also be nice to put the above test in a little inlined helper function (called nfs_want_read_modify_write() ?). So, a number of questions spring to mind: 1. What if we're extending the file? We might not need to read the page at all in that case (see nfs_write_end()). 2. What if the page is already dirty or is carrying an uncommitted unstable write? 3. We might want to try to avoid looping more than once here. If the kernel is very low on memory, we might just want to write out the data rather than read the page and risk having the VM eject it before we can dirty it. 4. Should we be starting an async readahead on the next page? Single page sized reads can be a nuisance too, if you are writing huge amounts of data. > + ret = nfs_readpage(file, page); > + page_cache_release(page); > + if (!ret) > + goto start; > } > return ret; > } Cheers Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1245950029.4913.17.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: [PATCH] read-modify-write page updating [not found] ` <1245950029.4913.17.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-07-09 13:59 ` Peter Staubach 0 siblings, 0 replies; 94+ messages in thread From: Peter Staubach @ 2009-07-09 13:59 UTC (permalink / raw) To: Trond Myklebust; +Cc: Brian R Cowan, linux-nfs Trond Myklebust wrote: > > It might also be nice to put the above test in a little inlined helper > function (called nfs_want_read_modify_write() ?). > > Good suggestion. > So, a number of questions spring to mind: > > 1. What if we're extending the file? We might not need to read the > page at all in that case (see nfs_write_end()). > Yup. > 2. What if the page is already dirty or is carrying an uncommitted > unstable write? > Yup. > 3. We might want to try to avoid looping more than once here. If > the kernel is very low on memory, we might just want to write > out the data rather than read the page and risk having the VM > eject it before we can dirty it. > Yup. > 4. Should we be starting an async readahead on the next page? > Single page sized reads can be a nuisance too, if you are > writing huge amounts of data. This one is tough. It sounds good, but seems difficult to implement. I think that this could be viewed as an optimization. ps ^ permalink raw reply [flat|nested] 94+ messages in thread
* [PATCH v2] read-modify-write page updating 2009-06-24 19:54 ` [PATCH] read-modify-write page updating Peter Staubach 2009-06-25 17:13 ` Trond Myklebust @ 2009-07-09 14:12 ` Peter Staubach 2009-07-09 15:39 ` Trond Myklebust 2009-08-04 17:52 ` [PATCH v3] " Peter Staubach 1 sibling, 2 replies; 94+ messages in thread From: Peter Staubach @ 2009-07-09 14:12 UTC (permalink / raw) To: Trond Myklebust; +Cc: Brian R Cowan, linux-nfs [-- Attachment #1: Type: text/plain, Size: 2869 bytes --] Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 269,500 WRITE requests. The modified kernel containing the patch generated about 261,000 WRITE requests. Thus, about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. The previous version of this patch caused the NFS client to generate around 3,000 more READ requests. This version actually causes the NFS client to generate almost 500 fewer READ requests. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> [-- Attachment #2: read-modify-write.devel.2 --] [-- Type: application/x-troff-man, Size: 2713 bytes --] ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: [PATCH v2] read-modify-write page updating 2009-07-09 14:12 ` [PATCH v2] " Peter Staubach @ 2009-07-09 15:39 ` Trond Myklebust [not found] ` <1247153972.5766.15.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-08-04 17:52 ` [PATCH v3] " Peter Staubach 1 sibling, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-07-09 15:39 UTC (permalink / raw) To: Peter Staubach; +Cc: Brian R Cowan, linux-nfs On Thu, 2009-07-09 at 10:12 -0400, Peter Staubach wrote: > Signed-off-by: Peter Staubach <staubach@redhat.com> Please could you send such patches as inline, rather than as attachments. It makes it harder to comment on the patch contents... > +static int nfs_want_read_modify_write(struct file *file, struct page *page, > + loff_t pos, unsigned len) > +{ > + unsigned int pglen = nfs_page_length(page); > + unsigned int offset = pos & (PAGE_CACHE_SIZE - 1); > + unsigned int end = offset + len; > + > + if ((file->f_mode & FMODE_READ) && /* open for read? */ > + !PageUptodate(page) && /* Uptodate? */ > + !PageDirty(page) && /* Dirty already? */ > + !PagePrivate(page) && /* i/o request already? */ I don't think you need the PageDirty() test. These days we should be guaranteed to always have PagePrivate() set whenever PageDirty() is (although the converse is not true). Anything else would be a bug... > + pglen && /* valid bytes of file? */ > + (end < pglen || offset)) /* replace all valid bytes? */ > + return 1; > + return 0; > +} > + ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1247153972.5766.15.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: [PATCH v2] read-modify-write page updating [not found] ` <1247153972.5766.15.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-07-10 15:57 ` Peter Staubach 2009-07-10 17:22 ` J. Bruce Fields 0 siblings, 1 reply; 94+ messages in thread From: Peter Staubach @ 2009-07-10 15:57 UTC (permalink / raw) To: Trond Myklebust; +Cc: Brian R Cowan, linux-nfs Trond Myklebust wrote: > On Thu, 2009-07-09 at 10:12 -0400, Peter Staubach wrote: > > >> Signed-off-by: Peter Staubach <staubach@redhat.com> >> > > Please could you send such patches as inline, rather than as > attachments. It makes it harder to comment on the patch contents... > > I will investigate how to do this. >> +static int nfs_want_read_modify_write(struct file *file, struct page *page, >> + loff_t pos, unsigned len) >> +{ >> + unsigned int pglen = nfs_page_length(page); >> + unsigned int offset = pos & (PAGE_CACHE_SIZE - 1); >> + unsigned int end = offset + len; >> + >> + if ((file->f_mode & FMODE_READ) && /* open for read? */ >> + !PageUptodate(page) && /* Uptodate? */ >> + !PageDirty(page) && /* Dirty already? */ >> + !PagePrivate(page) && /* i/o request already? */ >> > > I don't think you need the PageDirty() test. These days we should be > guaranteed to always have PagePrivate() set whenever PageDirty() is > (although the converse is not true). Anything else would be a bug... > > Okie doke. It seemed to me that this should be true, but it was safer to leave both tests. I will remove that PageDirty test, retest, and then send another version of the patch. I will be out next week, so it will take a couple of weeks. Thanx... ps >> + pglen && /* valid bytes of file? */ >> + (end < pglen || offset)) /* replace all valid bytes? */ >> + return 1; >> + return 0; >> +} >> + >> > > ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: [PATCH v2] read-modify-write page updating 2009-07-10 15:57 ` Peter Staubach @ 2009-07-10 17:22 ` J. Bruce Fields 0 siblings, 0 replies; 94+ messages in thread From: J. Bruce Fields @ 2009-07-10 17:22 UTC (permalink / raw) To: Peter Staubach; +Cc: Trond Myklebust, Brian R Cowan, linux-nfs On Fri, Jul 10, 2009 at 11:57:02AM -0400, Peter Staubach wrote: > Trond Myklebust wrote: >> On Thu, 2009-07-09 at 10:12 -0400, Peter Staubach wrote: >> >> >>> Signed-off-by: Peter Staubach <staubach@redhat.com> >>> >> >> Please could you send such patches as inline, rather than as >> attachments. It makes it harder to comment on the patch contents... >> >> > > I will investigate how to do this. See Documentation/email-clients.txt. (It has an entry for Thunderbird, for example.) --b. > >>> +static int nfs_want_read_modify_write(struct file *file, struct page *page, >>> + loff_t pos, unsigned len) >>> +{ >>> + unsigned int pglen = nfs_page_length(page); >>> + unsigned int offset = pos & (PAGE_CACHE_SIZE - 1); >>> + unsigned int end = offset + len; >>> + >>> + if ((file->f_mode & FMODE_READ) && /* open for read? */ >>> + !PageUptodate(page) && /* Uptodate? */ >>> + !PageDirty(page) && /* Dirty already? */ >>> + !PagePrivate(page) && /* i/o request already? */ >>> >> >> I don't think you need the PageDirty() test. These days we should be >> guaranteed to always have PagePrivate() set whenever PageDirty() is >> (although the converse is not true). Anything else would be a bug... >> >> > > Okie doke. It seemed to me that this should be true, but it was > safer to leave both tests. > > I will remove that PageDirty test, retest, and then send another > version of the patch. I will be out next week, so it will take a > couple of weeks. > > Thanx... > > ps > >>> + pglen && /* valid bytes of file? */ >>> + (end < pglen || offset)) /* replace all valid bytes? */ >>> + return 1; >>> + return 0; >>> +} >>> + >>> >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 94+ messages in thread
* [PATCH v3] read-modify-write page updating 2009-07-09 14:12 ` [PATCH v2] " Peter Staubach 2009-07-09 15:39 ` Trond Myklebust @ 2009-08-04 17:52 ` Peter Staubach 2009-08-05 0:50 ` Trond Myklebust 1 sibling, 1 reply; 94+ messages in thread From: Peter Staubach @ 2009-08-04 17:52 UTC (permalink / raw) To: Trond Myklebust; +Cc: Brian R Cowan, linux-nfs Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 269,500 WRITE requests. The modified kernel containing the patch generated about 261,000 WRITE requests. Thus, about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. The difference between this patch and the previous one was to remove the unneeded PageDirty() test. I then retested to ensure that the resulting system continued to behave as desired. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> --- linux-2.6.30.i686/fs/nfs/file.c.org +++ linux-2.6.30.i686/fs/nfs/file.c @@ -328,6 +328,42 @@ nfs_file_fsync(struct file *file, struct } /* + * Decide whether a read/modify/write cycle may be more efficient + * then a modify/write/read cycle when writing to a page in the + * page cache. + * + * The modify/write/read cycle may occur if a page is read before + * being completely filled by the writer. In this situation, the + * page must be completely written to stable storage on the server + * before it can be refilled by reading in the page from the server. + * This can lead to expensive, small, FILE_SYNC mode writes being + * done. + * + * It may be more efficient to read the page first if the file is + * open for reading in addition to writing, the page is not marked + * as Uptodate, it is not dirty or waiting to be committed, + * indicating that it was previously allocated and then modified, + * that there were valid bytes of data in that range of the file, + * and that the new data won't completely replace the old data in + * that range of the file. + */ +static int nfs_want_read_modify_write(struct file *file, struct page *page, + loff_t pos, unsigned len) +{ + unsigned int pglen = nfs_page_length(page); + unsigned int offset = pos & (PAGE_CACHE_SIZE - 1); + unsigned int end = offset + len; + + if ((file->f_mode & FMODE_READ) && /* open for read? */ + !PageUptodate(page) && /* Uptodate? */ + !PagePrivate(page) && /* i/o request already? */ + pglen && /* valid bytes of file? */ + (end < pglen || offset)) /* replace all valid bytes? */ + return 1; + return 0; +} + +/* * This does the "real" work of the write. We must allocate and lock the * page to be sent back to the generic routine, which then copies the * data from user space. @@ -340,15 +376,16 @@ static int nfs_write_begin(struct file * struct page **pagep, void **fsdata) { int ret; - pgoff_t index; + pgoff_t index = pos >> PAGE_CACHE_SHIFT; struct page *page; - index = pos >> PAGE_CACHE_SHIFT; + int once_thru = 0; dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n", file->f_path.dentry->d_parent->d_name.name, file->f_path.dentry->d_name.name, mapping->host->i_ino, len, (long long) pos); +start: /* * Prevent starvation issues if someone is doing a consistency * sync-to-disk @@ -367,6 +404,13 @@ static int nfs_write_begin(struct file * if (ret) { unlock_page(page); page_cache_release(page); + } else if (!once_thru && + nfs_want_read_modify_write(file, page, pos, len)) { + once_thru = 1; + ret = nfs_readpage(file, page); + page_cache_release(page); + if (!ret) + goto start; } return ret; } ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: [PATCH v3] read-modify-write page updating 2009-08-04 17:52 ` [PATCH v3] " Peter Staubach @ 2009-08-05 0:50 ` Trond Myklebust 0 siblings, 0 replies; 94+ messages in thread From: Trond Myklebust @ 2009-08-05 0:50 UTC (permalink / raw) To: Peter Staubach; +Cc: Brian R Cowan, linux-nfs On Tue, 2009-08-04 at 13:52 -0400, Peter Staubach wrote: > Signed-off-by: Peter Staubach <staubach@redhat.com> > > --- linux-2.6.30.i686/fs/nfs/file.c.org > +++ linux-2.6.30.i686/fs/nfs/file.c > @@ -328,6 +328,42 @@ nfs_file_fsync(struct file *file, struct > } > > /* > + * Decide whether a read/modify/write cycle may be more efficient > + * then a modify/write/read cycle when writing to a page in the > + * page cache. > + * > + * The modify/write/read cycle may occur if a page is read before > + * being completely filled by the writer. In this situation, the > + * page must be completely written to stable storage on the server > + * before it can be refilled by reading in the page from the server. > + * This can lead to expensive, small, FILE_SYNC mode writes being > + * done. > + * > + * It may be more efficient to read the page first if the file is > + * open for reading in addition to writing, the page is not marked > + * as Uptodate, it is not dirty or waiting to be committed, > + * indicating that it was previously allocated and then modified, > + * that there were valid bytes of data in that range of the file, > + * and that the new data won't completely replace the old data in > + * that range of the file. > + */ > +static int nfs_want_read_modify_write(struct file *file, struct page *page, > + loff_t pos, unsigned len) > +{ > + unsigned int pglen = nfs_page_length(page); > + unsigned int offset = pos & (PAGE_CACHE_SIZE - 1); > + unsigned int end = offset + len; > + > + if ((file->f_mode & FMODE_READ) && /* open for read? */ > + !PageUptodate(page) && /* Uptodate? */ > + !PagePrivate(page) && /* i/o request already? */ > + pglen && /* valid bytes of file? */ > + (end < pglen || offset)) /* replace all valid bytes? */ > + return 1; > + return 0; > +} > + > +/* > * This does the "real" work of the write. We must allocate and lock the > * page to be sent back to the generic routine, which then copies the > * data from user space. > @@ -340,15 +376,16 @@ static int nfs_write_begin(struct file * > struct page **pagep, void **fsdata) > { > int ret; > - pgoff_t index; > + pgoff_t index = pos >> PAGE_CACHE_SHIFT; > struct page *page; > - index = pos >> PAGE_CACHE_SHIFT; > + int once_thru = 0; > > dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n", > file->f_path.dentry->d_parent->d_name.name, > file->f_path.dentry->d_name.name, > mapping->host->i_ino, len, (long long) pos); > > +start: > /* > * Prevent starvation issues if someone is doing a consistency > * sync-to-disk > @@ -367,6 +404,13 @@ static int nfs_write_begin(struct file * > if (ret) { > unlock_page(page); > page_cache_release(page); > + } else if (!once_thru && > + nfs_want_read_modify_write(file, page, pos, len)) { > + once_thru = 1; > + ret = nfs_readpage(file, page); > + page_cache_release(page); > + if (!ret) > + goto start; > } > return ret; > } > Thanks! Applied... Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243615595.7155.48.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-29 17:25 ` Brian R Cowan @ 2009-05-29 17:48 ` Peter Staubach 2009-05-29 18:21 ` Trond Myklebust 1 sibling, 1 reply; 94+ messages in thread From: Peter Staubach @ 2009-05-29 17:48 UTC (permalink / raw) To: Trond Myklebust; +Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner Trond Myklebust wrote: > Look... This happens when you _flush_ the file to stable storage if > there is only a single write < wsize. It isn't the business of the NFS > layer to decide when you flush the file; that's an application > decision... > > I think that one easy way to show why this optimization is not quite what we would all like, why there only being a single write _now_ isn't quite sufficient, is to write a block of a file and then read it back. Things like compilers and linkers might do this during their random access to the file being created. I would guess that this audit thing that Brian has refered to does the same sort of thing. ps ps. Why do we flush dirty pages before they can be read? I am not even clear why we care about waiting for an already existing flush to be completed before using the page to satisfy a read system call. > Trond > > > > On Fri, 2009-05-29 at 11:55 -0400, Brian R Cowan wrote: > >> Been working this issue with Red hat, and didn't need to go to the list... >> Well, now I do... You mention that "The main type of workload we're >> targetting with this patch is the app that opens a file, writes < 4k and >> then closes the file." Well, it appears that this issue also impacts >> flushing pages from filesystem caches. >> >> The reason this came up in my environment is that our product's build >> auditing gives the the filesystem cache an interesting workout. When >> ClearCase audits a build, the build places data in a few places, >> including: >> 1) a build audit file that usually resides in /tmp. This build audit is >> essentially a log of EVERY file open/read/write/delete/rename/etc. that >> the programs called in the build script make in the clearcase "view" >> you're building in. As a result, this file can get pretty large. >> 2) The build outputs themselves, which in this case are being written to a >> remote storage location on a Linux or Solaris server, and >> 3) a file called .cmake.state, which is a local cache that is written to >> after the build script completes containing what is essentially a "Bill of >> materials" for the files created during builds in this "view." >> >> We believe that the build audit file access is causing build output to get >> flushed out of the filesystem cache. These flushes happen *in 4k chunks.* >> This trips over this change since the cache pages appear to get flushed on >> an individual basis. >> >> One note is that if the build outputs were going to a clearcase view >> stored on an enterprise-level NAS device, there isn't as much of an issue >> because many of these return from the stable write request as soon as the >> data goes into the battery-backed memory disk cache on the NAS. However, >> it really impacts writes to general-purpose OS's that follow Sun's lead in >> how they handle "stable" writes. The truly annoying part about this rather >> subtle change is that the NFS client is specifically ignoring the client >> mount options since we cannot force the "async" mount option to turn off >> this behavior. >> >> ================================================================= >> Brian Cowan >> Advisory Software Engineer >> ClearCase Customer Advocacy Group (CAG) >> Rational Software >> IBM Software Group >> 81 Hartwell Ave >> Lexington, MA >> >> Phone: 1.781.372.3580 >> Web: http://www.ibm.com/software/rational/support/ >> >> >> Please be sure to update your PMR using ESR at >> http://www-306.ibm.com/software/support/probsub.html or cc all >> correspondence to sw_support@us.ibm.com to be sure your PMR is updated in >> case I am not available. >> >> >> >> From: >> Trond Myklebust <trond.myklebust@fys.uio.no> >> To: >> Peter Staubach <staubach@redhat.com> >> Cc: >> Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/IBM@IBMUS, >> linux-nfs@vger.kernel.org >> Date: >> 04/30/2009 05:23 PM >> Subject: >> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing >> Sent by: >> linux-nfs-owner@vger.kernel.org >> >> >> >> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote: >> >>> Chuck Lever wrote: >>> >>>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: >>>> >>>>> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 >> >> >>> Actually, the "stable" part can be a killer. It depends upon >>> why and when nfs_flush_inode() is invoked. >>> >>> I did quite a bit of work on this aspect of RHEL-5 and discovered >>> that this particular code was leading to some serious slowdowns. >>> The server would end up doing a very slow FILE_SYNC write when >>> all that was really required was an UNSTABLE write at the time. >>> >>> Did anyone actually measure this optimization and if so, what >>> were the numbers? >>> >> As usual, the optimisation is workload dependent. The main type of >> workload we're targetting with this patch is the app that opens a file, >> writes < 4k and then closes the file. For that case, it's a no-brainer >> that you don't need to split a single stable write into an unstable + a >> commit. >> >> So if the application isn't doing the above type of short write followed >> by close, then exactly what is causing a flush to disk in the first >> place? Ordinarily, the client will try to cache writes until the cows >> come home (or until the VM tells it to reclaim memory - whichever comes >> first)... >> >> Cheers >> Trond >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> > > > ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 17:48 ` Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Peter Staubach @ 2009-05-29 18:21 ` Trond Myklebust 0 siblings, 0 replies; 94+ messages in thread From: Trond Myklebust @ 2009-05-29 18:21 UTC (permalink / raw) To: Peter Staubach; +Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner On Fri, 2009-05-29 at 13:48 -0400, Peter Staubach wrote: > Trond Myklebust wrote: > > Look... This happens when you _flush_ the file to stable storage if > > there is only a single write < wsize. It isn't the business of the NFS > > layer to decide when you flush the file; that's an application > > decision... > > > > > > I think that one easy way to show why this optimization is > not quite what we would all like, why there only being a > single write _now_ isn't quite sufficient, is to write a > block of a file and then read it back. Things like > compilers and linkers might do this during their random > access to the file being created. I would guess that this > audit thing that Brian has refered to does the same sort > of thing. > > ps > > ps. Why do we flush dirty pages before they can be read? > I am not even clear why we care about waiting for an > already existing flush to be completed before using the > page to satisfy a read system call. We only do this if the page cannot be marked as up to date. i.e. there have to be parts of the page which contain valid data on the server, and that our client hasn't read in yet, and that aren't being overwritten by our write. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 15:55 ` Brian R Cowan 2009-05-29 16:46 ` Trond Myklebust @ 2009-05-29 17:01 ` Chuck Lever 2009-05-29 17:38 ` Brian R Cowan 1 sibling, 1 reply; 94+ messages in thread From: Chuck Lever @ 2009-05-29 17:01 UTC (permalink / raw) To: Brian R Cowan; +Cc: Trond Myklebust, linux-nfs, linux-nfs-owner, Peter Staubach On May 29, 2009, at 11:55 AM, Brian R Cowan wrote: > Been working this issue with Red hat, and didn't need to go to the > list... > Well, now I do... You mention that "The main type of workload we're > targetting with this patch is the app that opens a file, writes < 4k > and > then closes the file." Well, it appears that this issue also impacts > flushing pages from filesystem caches. > > The reason this came up in my environment is that our product's build > auditing gives the the filesystem cache an interesting workout. When > ClearCase audits a build, the build places data in a few places, > including: > 1) a build audit file that usually resides in /tmp. This build audit > is > essentially a log of EVERY file open/read/write/delete/rename/etc. > that > the programs called in the build script make in the clearcase "view" > you're building in. As a result, this file can get pretty large. > 2) The build outputs themselves, which in this case are being > written to a > remote storage location on a Linux or Solaris server, and > 3) a file called .cmake.state, which is a local cache that is > written to > after the build script completes containing what is essentially a > "Bill of > materials" for the files created during builds in this "view." > > We believe that the build audit file access is causing build output > to get > flushed out of the filesystem cache. These flushes happen *in 4k > chunks.* > This trips over this change since the cache pages appear to get > flushed on > an individual basis. So, are you saying that the application is flushing after every 4KB write(2), or that the application has written a bunch of pages, and VM/ VFS on the client is doing the synchronous page flushes? If it's the application doing this, then you really do not want to mitigate this by defeating the STABLE writes -- the application must have some requirement that the data is permanent. Unless I have misunderstood something, the previous faster behavior was due to cheating, and put your data at risk. I can't see how replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would cause such a significant performance impact. > One note is that if the build outputs were going to a clearcase view > stored on an enterprise-level NAS device, there isn't as much of an > issue > because many of these return from the stable write request as soon > as the > data goes into the battery-backed memory disk cache on the NAS. > However, > it really impacts writes to general-purpose OS's that follow Sun's > lead in > how they handle "stable" writes. The truly annoying part about this > rather > subtle change is that the NFS client is specifically ignoring the > client > mount options since we cannot force the "async" mount option to turn > off > this behavior. You may have a misunderstanding about what exactly "async" does. The "sync" / "async" mount options control only whether the application waits for the data to be flushed to permanent storage. They have no effect on any file system I know of _how_ specifically the data is moved from the page cache to permanent storage. > ================================================================= > Brian Cowan > Advisory Software Engineer > ClearCase Customer Advocacy Group (CAG) > Rational Software > IBM Software Group > 81 Hartwell Ave > Lexington, MA > > Phone: 1.781.372.3580 > Web: http://www.ibm.com/software/rational/support/ > > > Please be sure to update your PMR using ESR at > http://www-306.ibm.com/software/support/probsub.html or cc all > correspondence to sw_support@us.ibm.com to be sure your PMR is > updated in > case I am not available. > > > > From: > Trond Myklebust <trond.myklebust@fys.uio.no> > To: > Peter Staubach <staubach@redhat.com> > Cc: > Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/ > IBM@IBMUS, > linux-nfs@vger.kernel.org > Date: > 04/30/2009 05:23 PM > Subject: > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page > flushing > Sent by: > linux-nfs-owner@vger.kernel.org > > > > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote: >> Chuck Lever wrote: >>> >>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: >>>> >>>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 > >>>> >> Actually, the "stable" part can be a killer. It depends upon >> why and when nfs_flush_inode() is invoked. >> >> I did quite a bit of work on this aspect of RHEL-5 and discovered >> that this particular code was leading to some serious slowdowns. >> The server would end up doing a very slow FILE_SYNC write when >> all that was really required was an UNSTABLE write at the time. >> >> Did anyone actually measure this optimization and if so, what >> were the numbers? > > As usual, the optimisation is workload dependent. The main type of > workload we're targetting with this patch is the app that opens a > file, > writes < 4k and then closes the file. For that case, it's a no-brainer > that you don't need to split a single stable write into an unstable > + a > commit. > > So if the application isn't doing the above type of short write > followed > by close, then exactly what is causing a flush to disk in the first > place? Ordinarily, the client will try to cache writes until the cows > come home (or until the VM tells it to reclaim memory - whichever > comes > first)... > > Cheers > Trond > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 17:01 ` Chuck Lever @ 2009-05-29 17:38 ` Brian R Cowan 2009-05-29 17:42 ` Trond Myklebust 0 siblings, 1 reply; 94+ messages in thread From: Brian R Cowan @ 2009-05-29 17:38 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-nfs, linux-nfs-owner, Peter Staubach, Trond Myklebust > You may have a misunderstanding about what exactly "async" does. The > "sync" / "async" mount options control only whether the application > waits for the data to be flushed to permanent storage. They have no > effect on any file system I know of _how_ specifically the data is > moved from the page cache to permanent storage. The problem is that the client change seems to cause the application to stop until this stable write completes... What is interesting is that it's not always a write operation that the linker gets stuck on. Our best hypothesis -- from correlating times in strace and tcpdump traces -- is that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* system calls on the output file (that is opened for read/write). We THINK the read call triggers a FILE_SYNC write if the page is dirty...and that is why the read calls are taking so long. Seeing writes happening when the app is waiting for a read is odd to say the least... (In my test, there is nothing else running on the Virtual machines, so the only thing that could be triggering the filesystem activity is the build test...) ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Chuck Lever <chuck.lever@oracle.com> To: Brian R Cowan/Cupertino/IBM@IBMUS Cc: Trond Myklebust <trond.myklebust@fys.uio.no>, linux-nfs@vger.kernel.org, linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> Date: 05/29/2009 01:02 PM Subject: Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Sent by: linux-nfs-owner@vger.kernel.org On May 29, 2009, at 11:55 AM, Brian R Cowan wrote: > Been working this issue with Red hat, and didn't need to go to the > list... > Well, now I do... You mention that "The main type of workload we're > targetting with this patch is the app that opens a file, writes < 4k > and > then closes the file." Well, it appears that this issue also impacts > flushing pages from filesystem caches. > > The reason this came up in my environment is that our product's build > auditing gives the the filesystem cache an interesting workout. When > ClearCase audits a build, the build places data in a few places, > including: > 1) a build audit file that usually resides in /tmp. This build audit > is > essentially a log of EVERY file open/read/write/delete/rename/etc. > that > the programs called in the build script make in the clearcase "view" > you're building in. As a result, this file can get pretty large. > 2) The build outputs themselves, which in this case are being > written to a > remote storage location on a Linux or Solaris server, and > 3) a file called .cmake.state, which is a local cache that is > written to > after the build script completes containing what is essentially a > "Bill of > materials" for the files created during builds in this "view." > > We believe that the build audit file access is causing build output > to get > flushed out of the filesystem cache. These flushes happen *in 4k > chunks.* > This trips over this change since the cache pages appear to get > flushed on > an individual basis. So, are you saying that the application is flushing after every 4KB write(2), or that the application has written a bunch of pages, and VM/ VFS on the client is doing the synchronous page flushes? If it's the application doing this, then you really do not want to mitigate this by defeating the STABLE writes -- the application must have some requirement that the data is permanent. Unless I have misunderstood something, the previous faster behavior was due to cheating, and put your data at risk. I can't see how replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would cause such a significant performance impact. > One note is that if the build outputs were going to a clearcase view > stored on an enterprise-level NAS device, there isn't as much of an > issue > because many of these return from the stable write request as soon > as the > data goes into the battery-backed memory disk cache on the NAS. > However, > it really impacts writes to general-purpose OS's that follow Sun's > lead in > how they handle "stable" writes. The truly annoying part about this > rather > subtle change is that the NFS client is specifically ignoring the > client > mount options since we cannot force the "async" mount option to turn > off > this behavior. You may have a misunderstanding about what exactly "async" does. The "sync" / "async" mount options control only whether the application waits for the data to be flushed to permanent storage. They have no effect on any file system I know of _how_ specifically the data is moved from the page cache to permanent storage. > ================================================================= > Brian Cowan > Advisory Software Engineer > ClearCase Customer Advocacy Group (CAG) > Rational Software > IBM Software Group > 81 Hartwell Ave > Lexington, MA > > Phone: 1.781.372.3580 > Web: http://www.ibm.com/software/rational/support/ > > > Please be sure to update your PMR using ESR at > http://www-306.ibm.com/software/support/probsub.html or cc all > correspondence to sw_support@us.ibm.com to be sure your PMR is > updated in > case I am not available. > > > > From: > Trond Myklebust <trond.myklebust@fys.uio.no> > To: > Peter Staubach <staubach@redhat.com> > Cc: > Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/ > IBM@IBMUS, > linux-nfs@vger.kernel.org > Date: > 04/30/2009 05:23 PM > Subject: > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page > flushing > Sent by: > linux-nfs-owner@vger.kernel.org > > > > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote: >> Chuck Lever wrote: >>> >>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: >>>> >>>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 > >>>> >> Actually, the "stable" part can be a killer. It depends upon >> why and when nfs_flush_inode() is invoked. >> >> I did quite a bit of work on this aspect of RHEL-5 and discovered >> that this particular code was leading to some serious slowdowns. >> The server would end up doing a very slow FILE_SYNC write when >> all that was really required was an UNSTABLE write at the time. >> >> Did anyone actually measure this optimization and if so, what >> were the numbers? > > As usual, the optimisation is workload dependent. The main type of > workload we're targetting with this patch is the app that opens a > file, > writes < 4k and then closes the file. For that case, it's a no-brainer > that you don't need to split a single stable write into an unstable > + a > commit. > > So if the application isn't doing the above type of short write > followed > by close, then exactly what is causing a flush to disk in the first > place? Ordinarily, the client will try to cache writes until the cows > come home (or until the VM tells it to reclaim memory - whichever > comes > first)... > > Cheers > Trond > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 17:38 ` Brian R Cowan @ 2009-05-29 17:42 ` Trond Myklebust [not found] ` <1243618968.7155.60.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-05-29 17:42 UTC (permalink / raw) To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote: > > You may have a misunderstanding about what exactly "async" does. The > > "sync" / "async" mount options control only whether the application > > waits for the data to be flushed to permanent storage. They have no > > effect on any file system I know of _how_ specifically the data is > > moved from the page cache to permanent storage. > > The problem is that the client change seems to cause the application to > stop until this stable write completes... What is interesting is that it's > not always a write operation that the linker gets stuck on. Our best > hypothesis -- from correlating times in strace and tcpdump traces -- is > that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* > system calls on the output file (that is opened for read/write). We THINK > the read call triggers a FILE_SYNC write if the page is dirty...and that > is why the read calls are taking so long. Seeing writes happening when the > app is waiting for a read is odd to say the least... (In my test, there is > nothing else running on the Virtual machines, so the only thing that could > be triggering the filesystem activity is the build test...) Yes. If the page is dirty, but not up to date, then it needs to be cleaned before you can overwrite the contents with the results of a fresh read. That means flushing the data to disk... Which again means doing either a stable write or an unstable write+commit. The former is more efficient that the latter, 'cos it accomplishes the exact same work in a single RPC call. Trond > ================================================================= > Brian Cowan > Advisory Software Engineer > ClearCase Customer Advocacy Group (CAG) > Rational Software > IBM Software Group > 81 Hartwell Ave > Lexington, MA > > Phone: 1.781.372.3580 > Web: http://www.ibm.com/software/rational/support/ > > > Please be sure to update your PMR using ESR at > http://www-306.ibm.com/software/support/probsub.html or cc all > correspondence to sw_support@us.ibm.com to be sure your PMR is updated in > case I am not available. > > > > From: > Chuck Lever <chuck.lever@oracle.com> > To: > Brian R Cowan/Cupertino/IBM@IBMUS > Cc: > Trond Myklebust <trond.myklebust@fys.uio.no>, linux-nfs@vger.kernel.org, > linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> > Date: > 05/29/2009 01:02 PM > Subject: > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing > Sent by: > linux-nfs-owner@vger.kernel.org > > > > > On May 29, 2009, at 11:55 AM, Brian R Cowan wrote: > > > Been working this issue with Red hat, and didn't need to go to the > > list... > > Well, now I do... You mention that "The main type of workload we're > > targetting with this patch is the app that opens a file, writes < 4k > > and > > then closes the file." Well, it appears that this issue also impacts > > flushing pages from filesystem caches. > > > > The reason this came up in my environment is that our product's build > > auditing gives the the filesystem cache an interesting workout. When > > ClearCase audits a build, the build places data in a few places, > > including: > > 1) a build audit file that usually resides in /tmp. This build audit > > is > > essentially a log of EVERY file open/read/write/delete/rename/etc. > > that > > the programs called in the build script make in the clearcase "view" > > you're building in. As a result, this file can get pretty large. > > 2) The build outputs themselves, which in this case are being > > written to a > > remote storage location on a Linux or Solaris server, and > > 3) a file called .cmake.state, which is a local cache that is > > written to > > after the build script completes containing what is essentially a > > "Bill of > > materials" for the files created during builds in this "view." > > > > We believe that the build audit file access is causing build output > > to get > > flushed out of the filesystem cache. These flushes happen *in 4k > > chunks.* > > This trips over this change since the cache pages appear to get > > flushed on > > an individual basis. > > So, are you saying that the application is flushing after every 4KB > write(2), or that the application has written a bunch of pages, and VM/ > VFS on the client is doing the synchronous page flushes? If it's the > application doing this, then you really do not want to mitigate this > by defeating the STABLE writes -- the application must have some > requirement that the data is permanent. > > Unless I have misunderstood something, the previous faster behavior > was due to cheating, and put your data at risk. I can't see how > replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would > cause such a significant performance impact. > > > One note is that if the build outputs were going to a clearcase view > > stored on an enterprise-level NAS device, there isn't as much of an > > issue > > because many of these return from the stable write request as soon > > as the > > data goes into the battery-backed memory disk cache on the NAS. > > However, > > it really impacts writes to general-purpose OS's that follow Sun's > > lead in > > how they handle "stable" writes. The truly annoying part about this > > rather > > subtle change is that the NFS client is specifically ignoring the > > client > > mount options since we cannot force the "async" mount option to turn > > off > > this behavior. > > You may have a misunderstanding about what exactly "async" does. The > "sync" / "async" mount options control only whether the application > waits for the data to be flushed to permanent storage. They have no > effect on any file system I know of _how_ specifically the data is > moved from the page cache to permanent storage. > > > ================================================================= > > Brian Cowan > > Advisory Software Engineer > > ClearCase Customer Advocacy Group (CAG) > > Rational Software > > IBM Software Group > > 81 Hartwell Ave > > Lexington, MA > > > > Phone: 1.781.372.3580 > > Web: http://www.ibm.com/software/rational/support/ > > > > > > Please be sure to update your PMR using ESR at > > http://www-306.ibm.com/software/support/probsub.html or cc all > > correspondence to sw_support@us.ibm.com to be sure your PMR is > > updated in > > case I am not available. > > > > > > > > From: > > Trond Myklebust <trond.myklebust@fys.uio.no> > > To: > > Peter Staubach <staubach@redhat.com> > > Cc: > > Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/ > > IBM@IBMUS, > > linux-nfs@vger.kernel.org > > Date: > > 04/30/2009 05:23 PM > > Subject: > > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page > > flushing > > Sent by: > > linux-nfs-owner@vger.kernel.org > > > > > > > > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote: > >> Chuck Lever wrote: > >>> > >>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: > >>>> > >>>> > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 > > > > >>>> > >> Actually, the "stable" part can be a killer. It depends upon > >> why and when nfs_flush_inode() is invoked. > >> > >> I did quite a bit of work on this aspect of RHEL-5 and discovered > >> that this particular code was leading to some serious slowdowns. > >> The server would end up doing a very slow FILE_SYNC write when > >> all that was really required was an UNSTABLE write at the time. > >> > >> Did anyone actually measure this optimization and if so, what > >> were the numbers? > > > > As usual, the optimisation is workload dependent. The main type of > > workload we're targetting with this patch is the app that opens a > > file, > > writes < 4k and then closes the file. For that case, it's a no-brainer > > that you don't need to split a single stable write into an unstable > > + a > > commit. > > > > So if the application isn't doing the above type of short write > > followed > > by close, then exactly what is causing a flush to disk in the first > > place? Ordinarily, the client will try to cache writes until the cows > > come home (or until the VM tells it to reclaim memory - whichever > > comes > > first)... > > > > Cheers > > Trond > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" > > in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- > Chuck Lever > chuck[dot]lever[at]oracle[dot]com > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1243618968.7155.60.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243618968.7155.60.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-29 17:47 ` Chuck Lever 2009-05-29 18:15 ` Trond Myklebust 2009-05-29 17:51 ` Peter Staubach ` (2 subsequent siblings) 3 siblings, 1 reply; 94+ messages in thread From: Chuck Lever @ 2009-05-29 17:47 UTC (permalink / raw) To: Trond Myklebust; +Cc: Brian R Cowan, linux-nfs, linux-nfs-owner, Peter Staubach On May 29, 2009, at 1:42 PM, Trond Myklebust wrote: > On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote: >>> You may have a misunderstanding about what exactly "async" does. >>> The >>> "sync" / "async" mount options control only whether the application >>> waits for the data to be flushed to permanent storage. They have no >>> effect on any file system I know of _how_ specifically the data is >>> moved from the page cache to permanent storage. >> >> The problem is that the client change seems to cause the >> application to >> stop until this stable write completes... What is interesting is >> that it's >> not always a write operation that the linker gets stuck on. Our best >> hypothesis -- from correlating times in strace and tcpdump traces >> -- is >> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by >> *read()* >> system calls on the output file (that is opened for read/write). We >> THINK >> the read call triggers a FILE_SYNC write if the page is dirty...and >> that >> is why the read calls are taking so long. Seeing writes happening >> when the >> app is waiting for a read is odd to say the least... (In my test, >> there is >> nothing else running on the Virtual machines, so the only thing >> that could >> be triggering the filesystem activity is the build test...) > > Yes. If the page is dirty, but not up to date, then it needs to be > cleaned before you can overwrite the contents with the results of a > fresh read. > That means flushing the data to disk... Which again means doing > either a > stable write or an unstable write+commit. The former is more efficient > that the latter, 'cos it accomplishes the exact same work in a single > RPC call. It might be prudent to flush the whole file when such a dirty page is discovered to get the benefit of write coalescing. > Trond > >> ================================================================= >> Brian Cowan >> Advisory Software Engineer >> ClearCase Customer Advocacy Group (CAG) >> Rational Software >> IBM Software Group >> 81 Hartwell Ave >> Lexington, MA >> >> Phone: 1.781.372.3580 >> Web: http://www.ibm.com/software/rational/support/ >> >> >> Please be sure to update your PMR using ESR at >> http://www-306.ibm.com/software/support/probsub.html or cc all >> correspondence to sw_support@us.ibm.com to be sure your PMR is >> updated in >> case I am not available. >> >> >> >> From: >> Chuck Lever <chuck.lever@oracle.com> >> To: >> Brian R Cowan/Cupertino/IBM@IBMUS >> Cc: >> Trond Myklebust <trond.myklebust@fys.uio.no>, linux-nfs@vger.kernel.org >> , >> linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> >> Date: >> 05/29/2009 01:02 PM >> Subject: >> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page >> flushing >> Sent by: >> linux-nfs-owner@vger.kernel.org >> >> >> >> >> On May 29, 2009, at 11:55 AM, Brian R Cowan wrote: >> >>> Been working this issue with Red hat, and didn't need to go to the >>> list... >>> Well, now I do... You mention that "The main type of workload we're >>> targetting with this patch is the app that opens a file, writes < 4k >>> and >>> then closes the file." Well, it appears that this issue also impacts >>> flushing pages from filesystem caches. >>> >>> The reason this came up in my environment is that our product's >>> build >>> auditing gives the the filesystem cache an interesting workout. When >>> ClearCase audits a build, the build places data in a few places, >>> including: >>> 1) a build audit file that usually resides in /tmp. This build audit >>> is >>> essentially a log of EVERY file open/read/write/delete/rename/etc. >>> that >>> the programs called in the build script make in the clearcase "view" >>> you're building in. As a result, this file can get pretty large. >>> 2) The build outputs themselves, which in this case are being >>> written to a >>> remote storage location on a Linux or Solaris server, and >>> 3) a file called .cmake.state, which is a local cache that is >>> written to >>> after the build script completes containing what is essentially a >>> "Bill of >>> materials" for the files created during builds in this "view." >>> >>> We believe that the build audit file access is causing build output >>> to get >>> flushed out of the filesystem cache. These flushes happen *in 4k >>> chunks.* >>> This trips over this change since the cache pages appear to get >>> flushed on >>> an individual basis. >> >> So, are you saying that the application is flushing after every 4KB >> write(2), or that the application has written a bunch of pages, and >> VM/ >> VFS on the client is doing the synchronous page flushes? If it's the >> application doing this, then you really do not want to mitigate this >> by defeating the STABLE writes -- the application must have some >> requirement that the data is permanent. >> >> Unless I have misunderstood something, the previous faster behavior >> was due to cheating, and put your data at risk. I can't see how >> replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would >> cause such a significant performance impact. >> >>> One note is that if the build outputs were going to a clearcase view >>> stored on an enterprise-level NAS device, there isn't as much of an >>> issue >>> because many of these return from the stable write request as soon >>> as the >>> data goes into the battery-backed memory disk cache on the NAS. >>> However, >>> it really impacts writes to general-purpose OS's that follow Sun's >>> lead in >>> how they handle "stable" writes. The truly annoying part about this >>> rather >>> subtle change is that the NFS client is specifically ignoring the >>> client >>> mount options since we cannot force the "async" mount option to turn >>> off >>> this behavior. >> >> You may have a misunderstanding about what exactly "async" does. The >> "sync" / "async" mount options control only whether the application >> waits for the data to be flushed to permanent storage. They have no >> effect on any file system I know of _how_ specifically the data is >> moved from the page cache to permanent storage. >> >>> ================================================================= >>> Brian Cowan >>> Advisory Software Engineer >>> ClearCase Customer Advocacy Group (CAG) >>> Rational Software >>> IBM Software Group >>> 81 Hartwell Ave >>> Lexington, MA >>> >>> Phone: 1.781.372.3580 >>> Web: http://www.ibm.com/software/rational/support/ >>> >>> >>> Please be sure to update your PMR using ESR at >>> http://www-306.ibm.com/software/support/probsub.html or cc all >>> correspondence to sw_support@us.ibm.com to be sure your PMR is >>> updated in >>> case I am not available. >>> >>> >>> >>> From: >>> Trond Myklebust <trond.myklebust@fys.uio.no> >>> To: >>> Peter Staubach <staubach@redhat.com> >>> Cc: >>> Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/ >>> IBM@IBMUS, >>> linux-nfs@vger.kernel.org >>> Date: >>> 04/30/2009 05:23 PM >>> Subject: >>> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page >>> flushing >>> Sent by: >>> linux-nfs-owner@vger.kernel.org >>> >>> >>> >>> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote: >>>> Chuck Lever wrote: >>>>> >>>>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: >>>>>> >>>>>> >>> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 >> >>> >>>>>> >>>> Actually, the "stable" part can be a killer. It depends upon >>>> why and when nfs_flush_inode() is invoked. >>>> >>>> I did quite a bit of work on this aspect of RHEL-5 and discovered >>>> that this particular code was leading to some serious slowdowns. >>>> The server would end up doing a very slow FILE_SYNC write when >>>> all that was really required was an UNSTABLE write at the time. >>>> >>>> Did anyone actually measure this optimization and if so, what >>>> were the numbers? >>> >>> As usual, the optimisation is workload dependent. The main type of >>> workload we're targetting with this patch is the app that opens a >>> file, >>> writes < 4k and then closes the file. For that case, it's a no- >>> brainer >>> that you don't need to split a single stable write into an unstable >>> + a >>> commit. >>> >>> So if the application isn't doing the above type of short write >>> followed >>> by close, then exactly what is causing a flush to disk in the first >>> place? Ordinarily, the client will try to cache writes until the >>> cows >>> come home (or until the VM tells it to reclaim memory - whichever >>> comes >>> first)... >>> >>> Cheers >>> Trond >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" >>> in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >> >> -- >> Chuck Lever >> chuck[dot]lever[at]oracle[dot]com >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux- >> nfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 17:47 ` Chuck Lever @ 2009-05-29 18:15 ` Trond Myklebust 0 siblings, 0 replies; 94+ messages in thread From: Trond Myklebust @ 2009-05-29 18:15 UTC (permalink / raw) To: Chuck Lever; +Cc: Brian R Cowan, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, 2009-05-29 at 13:47 -0400, Chuck Lever wrote: > On May 29, 2009, at 1:42 PM, Trond Myklebust wrote: > > > On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote: > >>> You may have a misunderstanding about what exactly "async" does. > >>> The > >>> "sync" / "async" mount options control only whether the application > >>> waits for the data to be flushed to permanent storage. They have no > >>> effect on any file system I know of _how_ specifically the data is > >>> moved from the page cache to permanent storage. > >> > >> The problem is that the client change seems to cause the > >> application to > >> stop until this stable write completes... What is interesting is > >> that it's > >> not always a write operation that the linker gets stuck on. Our best > >> hypothesis -- from correlating times in strace and tcpdump traces > >> -- is > >> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by > >> *read()* > >> system calls on the output file (that is opened for read/write). We > >> THINK > >> the read call triggers a FILE_SYNC write if the page is dirty...and > >> that > >> is why the read calls are taking so long. Seeing writes happening > >> when the > >> app is waiting for a read is odd to say the least... (In my test, > >> there is > >> nothing else running on the Virtual machines, so the only thing > >> that could > >> be triggering the filesystem activity is the build test...) > > > > Yes. If the page is dirty, but not up to date, then it needs to be > > cleaned before you can overwrite the contents with the results of a > > fresh read. > > That means flushing the data to disk... Which again means doing > > either a > > stable write or an unstable write+commit. The former is more efficient > > that the latter, 'cos it accomplishes the exact same work in a single > > RPC call. > > It might be prudent to flush the whole file when such a dirty page is > discovered to get the benefit of write coalescing. There are very few workloads where that will help. You basically have to be modifying the end of a page that has not previously been read in (so is not already marked up to date) and then writing into the beginning of the next page, which must also be not up to date. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243618968.7155.60.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-29 17:47 ` Chuck Lever @ 2009-05-29 17:51 ` Peter Staubach 2009-05-29 18:25 ` Brian R Cowan 2009-05-29 18:43 ` Trond Myklebust 2009-05-29 17:55 ` Brian R Cowan 2009-05-29 17:57 ` Trond Myklebust 3 siblings, 2 replies; 94+ messages in thread From: Peter Staubach @ 2009-05-29 17:51 UTC (permalink / raw) To: Trond Myklebust; +Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner Trond Myklebust wrote: > On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote: > >>> You may have a misunderstanding about what exactly "async" does. The >>> "sync" / "async" mount options control only whether the application >>> waits for the data to be flushed to permanent storage. They have no >>> effect on any file system I know of _how_ specifically the data is >>> moved from the page cache to permanent storage. >>> >> The problem is that the client change seems to cause the application to >> stop until this stable write completes... What is interesting is that it's >> not always a write operation that the linker gets stuck on. Our best >> hypothesis -- from correlating times in strace and tcpdump traces -- is >> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* >> system calls on the output file (that is opened for read/write). We THINK >> the read call triggers a FILE_SYNC write if the page is dirty...and that >> is why the read calls are taking so long. Seeing writes happening when the >> app is waiting for a read is odd to say the least... (In my test, there is >> nothing else running on the Virtual machines, so the only thing that could >> be triggering the filesystem activity is the build test...) >> > > Yes. If the page is dirty, but not up to date, then it needs to be > cleaned before you can overwrite the contents with the results of a > fresh read. > That means flushing the data to disk... Which again means doing either a > stable write or an unstable write+commit. The former is more efficient > that the latter, 'cos it accomplishes the exact same work in a single > RPC call. In the normal case, we aren't overwriting the contents with the results of a fresh read. We are going to simply return the current contents of the page. Given this, then why is the normal data cache consistency mechanism, based on the attribute cache, not sufficient? Thanx... ps ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 17:51 ` Peter Staubach @ 2009-05-29 18:25 ` Brian R Cowan 2009-05-29 18:43 ` Trond Myklebust 1 sibling, 0 replies; 94+ messages in thread From: Brian R Cowan @ 2009-05-29 18:25 UTC (permalink / raw) To: Peter Staubach; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Trond Myklebust Peter, this is my point. The application/client-side end result is that we're making a read wait for a write. We already have the data we need in the cache, since the application is what put it in there to begin with. I think this is a classic "unintended consequence" that is being observed on SuSE 10, Red hat 5, and I'm sure others. But since people using my product have only just started moving to Red hat 5, we're seeing more of these... There aren't too many people who build across NFS, not when local storage is relatively cheap, and much faster. But there are companies that do this so the build results are available even if the build host has been turned off, gone to standby/hibernate, or is even a virtual machine that no longer exists. The biggest problem here that the unavoidable extra filesystem cache load that build auditing creates appears to trigger the flushing. For whatever reason, those flushes happen in such a way to trigger the STABLE writes instead of the faster UNSTABLE ones. ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Peter Staubach <staubach@redhat.com> To: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Brian R Cowan/Cupertino/IBM@IBMUS, Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, linux-nfs-owner@vger.kernel.org Date: 05/29/2009 01:51 PM Subject: Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Trond Myklebust wrote: > On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote: > >>> You may have a misunderstanding about what exactly "async" does. The >>> "sync" / "async" mount options control only whether the application >>> waits for the data to be flushed to permanent storage. They have no >>> effect on any file system I know of _how_ specifically the data is >>> moved from the page cache to permanent storage. >>> >> The problem is that the client change seems to cause the application to >> stop until this stable write completes... What is interesting is that it's >> not always a write operation that the linker gets stuck on. Our best >> hypothesis -- from correlating times in strace and tcpdump traces -- is >> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* >> system calls on the output file (that is opened for read/write). We THINK >> the read call triggers a FILE_SYNC write if the page is dirty...and that >> is why the read calls are taking so long. Seeing writes happening when the >> app is waiting for a read is odd to say the least... (In my test, there is >> nothing else running on the Virtual machines, so the only thing that could >> be triggering the filesystem activity is the build test...) >> > > Yes. If the page is dirty, but not up to date, then it needs to be > cleaned before you can overwrite the contents with the results of a > fresh read. > That means flushing the data to disk... Which again means doing either a > stable write or an unstable write+commit. The former is more efficient > that the latter, 'cos it accomplishes the exact same work in a single > RPC call. In the normal case, we aren't overwriting the contents with the results of a fresh read. We are going to simply return the current contents of the page. Given this, then why is the normal data cache consistency mechanism, based on the attribute cache, not sufficient? Thanx... ps ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 17:51 ` Peter Staubach 2009-05-29 18:25 ` Brian R Cowan @ 2009-05-29 18:43 ` Trond Myklebust 1 sibling, 0 replies; 94+ messages in thread From: Trond Myklebust @ 2009-05-29 18:43 UTC (permalink / raw) To: Peter Staubach; +Cc: Brian R Cowan, Chuck Lever, linux-nfs, linux-nfs-owner On Fri, 2009-05-29 at 13:51 -0400, Peter Staubach wrote: > Trond Myklebust wrote: > > On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote: > > > >>> You may have a misunderstanding about what exactly "async" does. The > >>> "sync" / "async" mount options control only whether the application > >>> waits for the data to be flushed to permanent storage. They have no > >>> effect on any file system I know of _how_ specifically the data is > >>> moved from the page cache to permanent storage. > >>> > >> The problem is that the client change seems to cause the application to > >> stop until this stable write completes... What is interesting is that it's > >> not always a write operation that the linker gets stuck on. Our best > >> hypothesis -- from correlating times in strace and tcpdump traces -- is > >> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* > >> system calls on the output file (that is opened for read/write). We THINK > >> the read call triggers a FILE_SYNC write if the page is dirty...and that > >> is why the read calls are taking so long. Seeing writes happening when the > >> app is waiting for a read is odd to say the least... (In my test, there is > >> nothing else running on the Virtual machines, so the only thing that could > >> be triggering the filesystem activity is the build test...) > >> > > > > Yes. If the page is dirty, but not up to date, then it needs to be > > cleaned before you can overwrite the contents with the results of a > > fresh read. > > That means flushing the data to disk... Which again means doing either a > > stable write or an unstable write+commit. The former is more efficient > > that the latter, 'cos it accomplishes the exact same work in a single > > RPC call. > > In the normal case, we aren't overwriting the contents with the > results of a fresh read. We are going to simply return the > current contents of the page. Given this, then why is the normal > data cache consistency mechanism, based on the attribute cache, > not sufficient? It is. You would need to look into why the page was not marked with the PG_uptodate flag when it was being filled. We generally do try to do that whenever possible. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243618968.7155.60.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-29 17:47 ` Chuck Lever 2009-05-29 17:51 ` Peter Staubach @ 2009-05-29 17:55 ` Brian R Cowan 2009-05-29 18:07 ` Trond Myklebust 2009-05-29 17:57 ` Trond Myklebust 3 siblings, 1 reply; 94+ messages in thread From: Brian R Cowan @ 2009-05-29 17:55 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach > Yes. If the page is dirty, but not up to date, then it needs to be > cleaned before you can overwrite the contents with the results of a > fresh read. > That means flushing the data to disk... Which again means doing either a > stable write or an unstable write+commit. The former is more efficient > that the latter, 'cos it accomplishes the exact same work in a single > RPC call. I suspect that the COMMIT RPC's are done somewhere other than in the flush itself. If the "write + commit" operation was happening in the that exact matter, then the change in the git at the beginning of this thread *would not have impacted client performance*. I can demonstrate -- at will -- that it does impact performance. So, there is something that keeps track of the number of writes and issues the commits without slowing down the application. This git change bypasses that and degrades the linker performance. ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Trond Myklebust <trond.myklebust@fys.uio.no> To: Brian R Cowan/Cupertino/IBM@IBMUS Cc: Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> Date: 05/29/2009 01:43 PM Subject: Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Sent by: linux-nfs-owner@vger.kernel.org On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote: > > You may have a misunderstanding about what exactly "async" does. The > > "sync" / "async" mount options control only whether the application > > waits for the data to be flushed to permanent storage. They have no > > effect on any file system I know of _how_ specifically the data is > > moved from the page cache to permanent storage. > > The problem is that the client change seems to cause the application to > stop until this stable write completes... What is interesting is that it's > not always a write operation that the linker gets stuck on. Our best > hypothesis -- from correlating times in strace and tcpdump traces -- is > that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* > system calls on the output file (that is opened for read/write). We THINK > the read call triggers a FILE_SYNC write if the page is dirty...and that > is why the read calls are taking so long. Seeing writes happening when the > app is waiting for a read is odd to say the least... (In my test, there is > nothing else running on the Virtual machines, so the only thing that could > be triggering the filesystem activity is the build test...) Yes. If the page is dirty, but not up to date, then it needs to be cleaned before you can overwrite the contents with the results of a fresh read. That means flushing the data to disk... Which again means doing either a stable write or an unstable write+commit. The former is more efficient that the latter, 'cos it accomplishes the exact same work in a single RPC call. Trond > ================================================================= > Brian Cowan > Advisory Software Engineer > ClearCase Customer Advocacy Group (CAG) > Rational Software > IBM Software Group > 81 Hartwell Ave > Lexington, MA > > Phone: 1.781.372.3580 > Web: http://www.ibm.com/software/rational/support/ > > > Please be sure to update your PMR using ESR at > http://www-306.ibm.com/software/support/probsub.html or cc all > correspondence to sw_support@us.ibm.com to be sure your PMR is updated in > case I am not available. > > > > From: > Chuck Lever <chuck.lever@oracle.com> > To: > Brian R Cowan/Cupertino/IBM@IBMUS > Cc: > Trond Myklebust <trond.myklebust@fys.uio.no>, linux-nfs@vger.kernel.org, > linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> > Date: > 05/29/2009 01:02 PM > Subject: > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing > Sent by: > linux-nfs-owner@vger.kernel.org > > > > > On May 29, 2009, at 11:55 AM, Brian R Cowan wrote: > > > Been working this issue with Red hat, and didn't need to go to the > > list... > > Well, now I do... You mention that "The main type of workload we're > > targetting with this patch is the app that opens a file, writes < 4k > > and > > then closes the file." Well, it appears that this issue also impacts > > flushing pages from filesystem caches. > > > > The reason this came up in my environment is that our product's build > > auditing gives the the filesystem cache an interesting workout. When > > ClearCase audits a build, the build places data in a few places, > > including: > > 1) a build audit file that usually resides in /tmp. This build audit > > is > > essentially a log of EVERY file open/read/write/delete/rename/etc. > > that > > the programs called in the build script make in the clearcase "view" > > you're building in. As a result, this file can get pretty large. > > 2) The build outputs themselves, which in this case are being > > written to a > > remote storage location on a Linux or Solaris server, and > > 3) a file called .cmake.state, which is a local cache that is > > written to > > after the build script completes containing what is essentially a > > "Bill of > > materials" for the files created during builds in this "view." > > > > We believe that the build audit file access is causing build output > > to get > > flushed out of the filesystem cache. These flushes happen *in 4k > > chunks.* > > This trips over this change since the cache pages appear to get > > flushed on > > an individual basis. > > So, are you saying that the application is flushing after every 4KB > write(2), or that the application has written a bunch of pages, and VM/ > VFS on the client is doing the synchronous page flushes? If it's the > application doing this, then you really do not want to mitigate this > by defeating the STABLE writes -- the application must have some > requirement that the data is permanent. > > Unless I have misunderstood something, the previous faster behavior > was due to cheating, and put your data at risk. I can't see how > replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would > cause such a significant performance impact. > > > One note is that if the build outputs were going to a clearcase view > > stored on an enterprise-level NAS device, there isn't as much of an > > issue > > because many of these return from the stable write request as soon > > as the > > data goes into the battery-backed memory disk cache on the NAS. > > However, > > it really impacts writes to general-purpose OS's that follow Sun's > > lead in > > how they handle "stable" writes. The truly annoying part about this > > rather > > subtle change is that the NFS client is specifically ignoring the > > client > > mount options since we cannot force the "async" mount option to turn > > off > > this behavior. > > You may have a misunderstanding about what exactly "async" does. The > "sync" / "async" mount options control only whether the application > waits for the data to be flushed to permanent storage. They have no > effect on any file system I know of _how_ specifically the data is > moved from the page cache to permanent storage. > > > ================================================================= > > Brian Cowan > > Advisory Software Engineer > > ClearCase Customer Advocacy Group (CAG) > > Rational Software > > IBM Software Group > > 81 Hartwell Ave > > Lexington, MA > > > > Phone: 1.781.372.3580 > > Web: http://www.ibm.com/software/rational/support/ > > > > > > Please be sure to update your PMR using ESR at > > http://www-306.ibm.com/software/support/probsub.html or cc all > > correspondence to sw_support@us.ibm.com to be sure your PMR is > > updated in > > case I am not available. > > > > > > > > From: > > Trond Myklebust <trond.myklebust@fys.uio.no> > > To: > > Peter Staubach <staubach@redhat.com> > > Cc: > > Chuck Lever <chuck.lever@oracle.com>, Brian R Cowan/Cupertino/ > > IBM@IBMUS, > > linux-nfs@vger.kernel.org > > Date: > > 04/30/2009 05:23 PM > > Subject: > > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page > > flushing > > Sent by: > > linux-nfs-owner@vger.kernel.org > > > > > > > > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote: > >> Chuck Lever wrote: > >>> > >>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: > >>>> > >>>> > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 > > > > >>>> > >> Actually, the "stable" part can be a killer. It depends upon > >> why and when nfs_flush_inode() is invoked. > >> > >> I did quite a bit of work on this aspect of RHEL-5 and discovered > >> that this particular code was leading to some serious slowdowns. > >> The server would end up doing a very slow FILE_SYNC write when > >> all that was really required was an UNSTABLE write at the time. > >> > >> Did anyone actually measure this optimization and if so, what > >> were the numbers? > > > > As usual, the optimisation is workload dependent. The main type of > > workload we're targetting with this patch is the app that opens a > > file, > > writes < 4k and then closes the file. For that case, it's a no-brainer > > that you don't need to split a single stable write into an unstable > > + a > > commit. > > > > So if the application isn't doing the above type of short write > > followed > > by close, then exactly what is causing a flush to disk in the first > > place? Ordinarily, the client will try to cache writes until the cows > > come home (or until the VM tells it to reclaim memory - whichever > > comes > > first)... > > > > Cheers > > Trond > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" > > in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- > Chuck Lever > chuck[dot]lever[at]oracle[dot]com > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 17:55 ` Brian R Cowan @ 2009-05-29 18:07 ` Trond Myklebust [not found] ` <1243620455.7155.80.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-05-29 18:07 UTC (permalink / raw) To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, 2009-05-29 at 13:55 -0400, Brian R Cowan wrote: > > Yes. If the page is dirty, but not up to date, then it needs to be > > cleaned before you can overwrite the contents with the results of a > > fresh read. > > That means flushing the data to disk... Which again means doing either a > > stable write or an unstable write+commit. The former is more efficient > > that the latter, 'cos it accomplishes the exact same work in a single > > RPC call. > > I suspect that the COMMIT RPC's are done somewhere other than in the flush > itself. If the "write + commit" operation was happening in the that exact > matter, then the change in the git at the beginning of this thread *would > not have impacted client performance*. I can demonstrate -- at will -- > that it does impact performance. So, there is something that keeps track > of the number of writes and issues the commits without slowing down the > application. This git change bypasses that and degrades the linker > performance. If the server gives slower performance for a single stable write, vs. the same unstable write + commit, then you are demonstrating that the server is seriously _broken_. The only other explanation, is if the client prior to that patch being applied was somehow failing to send out the COMMIT. If so, then the client was broken, and the patch is a fix that results in correct behaviour. That would mean that the rest of the client flush code is probably still broken, but at least the nfs_wb_page() is now correct. Those are the only 2 options. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1243620455.7155.80.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243620455.7155.80.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-29 18:18 ` Brian R Cowan 2009-05-29 18:29 ` Trond Myklebust 0 siblings, 1 reply; 94+ messages in thread From: Brian R Cowan @ 2009-05-29 18:18 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach There is a third option, that the COMMIT calls are not coming from the same thread of execution that the write call is. The symptoms would seem to bear that out. As would the fact that the performance degradation occurs both when the server is Linux itself and when it is Solaris (any NFSv3-supporting version). I'm not saying that Solaris is bug-free, but it would be unusual if they are both broken the same way. The linux nfs FAQ says: ----------------------- * NFS Version 3 introduces the concept of "safe asynchronous writes." A Version 3 client can specify that the server is allowed to reply before it has saved the requested data to disk, permitting the server to gather small NFS write operations into a single efficient disk write operation. A Version 3 client can also specify that the data must be written to disk before the server replies, just like a Version 2 write. The client specifies the type of write by setting the stable_how field in the arguments of each write operation to UNSTABLE to request a safe asynchronous write, and FILE_SYNC for an NFS Version 2 style write. Servers indicate whether the requested data is permanently stored by setting a corresponding field in the response to each NFS write operation. A server can respond to an UNSTABLE write request with an UNSTABLE reply or a FILE_SYNC reply, depending on whether or not the requested data resides on permanent storage yet. An NFS protocol-compliant server must respond to a FILE_SYNC request only with a FILE_SYNC reply. Clients ensure that data that was written using a safe asynchronous write has been written onto permanent storage using a new operation available in Version 3 called a COMMIT. Servers do not send a response to a COMMIT operation until all data specified in the request has been written to permanent storage. NFS Version 3 clients must protect buffered data that has been written using a safe asynchronous write but not yet committed. If a server reboots before a client has sent an appropriate COMMIT, the server can reply to the eventual COMMIT request in a way that forces the client to resend the original write operation. Version 3 clients use COMMIT operations when flushing safe asynchronous writes to the server during a close(2) or fsync(2) system call, or when encountering memory pressure. ----------------------- Now, what happens in the client when the server cones back with the UNSTABLE reply? ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Trond Myklebust <trond.myklebust@fys.uio.no> To: Brian R Cowan/Cupertino/IBM@IBMUS Cc: Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> Date: 05/29/2009 02:07 PM Subject: Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing On Fri, 2009-05-29 at 13:55 -0400, Brian R Cowan wrote: > > Yes. If the page is dirty, but not up to date, then it needs to be > > cleaned before you can overwrite the contents with the results of a > > fresh read. > > That means flushing the data to disk... Which again means doing either a > > stable write or an unstable write+commit. The former is more efficient > > that the latter, 'cos it accomplishes the exact same work in a single > > RPC call. > > I suspect that the COMMIT RPC's are done somewhere other than in the flush > itself. If the "write + commit" operation was happening in the that exact > matter, then the change in the git at the beginning of this thread *would > not have impacted client performance*. I can demonstrate -- at will -- > that it does impact performance. So, there is something that keeps track > of the number of writes and issues the commits without slowing down the > application. This git change bypasses that and degrades the linker > performance. If the server gives slower performance for a single stable write, vs. the same unstable write + commit, then you are demonstrating that the server is seriously _broken_. The only other explanation, is if the client prior to that patch being applied was somehow failing to send out the COMMIT. If so, then the client was broken, and the patch is a fix that results in correct behaviour. That would mean that the rest of the client flush code is probably still broken, but at least the nfs_wb_page() is now correct. Those are the only 2 options. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 18:18 ` Brian R Cowan @ 2009-05-29 18:29 ` Trond Myklebust [not found] ` <1243621769.7155.97.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 94+ messages in thread From: Trond Myklebust @ 2009-05-29 18:29 UTC (permalink / raw) To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, 2009-05-29 at 14:18 -0400, Brian R Cowan wrote: > There is a third option, that the COMMIT calls are not coming from the > same thread of execution that the write call is. The symptoms would seem > to bear that out. As would the fact that the performance degradation > occurs both when the server is Linux itself and when it is Solaris (any > NFSv3-supporting version). I'm not saying that Solaris is bug-free, but it > would be unusual if they are both broken the same way. The linux nfs FAQ > says: > > ----------------------- > * NFS Version 3 introduces the concept of "safe asynchronous writes." A > Version 3 client can specify that the server is allowed to reply before it > has saved the requested data to disk, permitting the server to gather > small NFS write operations into a single efficient disk write operation. A > Version 3 client can also specify that the data must be written to disk > before the server replies, just like a Version 2 write. The client > specifies the type of write by setting the stable_how field in the > arguments of each write operation to UNSTABLE to request a safe > asynchronous write, and FILE_SYNC for an NFS Version 2 style write. > > Servers indicate whether the requested data is permanently stored by > setting a corresponding field in the response to each NFS write operation. > A server can respond to an UNSTABLE write request with an UNSTABLE reply > or a FILE_SYNC reply, depending on whether or not the requested data > resides on permanent storage yet. An NFS protocol-compliant server must > respond to a FILE_SYNC request only with a FILE_SYNC reply. > > Clients ensure that data that was written using a safe asynchronous write > has been written onto permanent storage using a new operation available in > Version 3 called a COMMIT. Servers do not send a response to a COMMIT > operation until all data specified in the request has been written to > permanent storage. NFS Version 3 clients must protect buffered data that > has been written using a safe asynchronous write but not yet committed. If > a server reboots before a client has sent an appropriate COMMIT, the > server can reply to the eventual COMMIT request in a way that forces the > client to resend the original write operation. Version 3 clients use > COMMIT operations when flushing safe asynchronous writes to the server > during a close(2) or fsync(2) system call, or when encountering memory > pressure. > ----------------------- > > Now, what happens in the client when the server cones back with the > UNSTABLE reply? The server cannot reply with an UNSTABLE reply to a stable write request. See above. As for your assertion that the COMMIT comes from some other thread of execution. I don't see how that can change anything. Some thread, somewhere has to wait for that COMMIT to complete. If it isn't your application, then the same burden falls on another application or the pdflush thread. While that may feel more interactive to you, it still means that you are making the server + some local process do more work (extra RPC round trip) for no good reason. Trond > ================================================================= > Brian Cowan > Advisory Software Engineer > ClearCase Customer Advocacy Group (CAG) > Rational Software > IBM Software Group > 81 Hartwell Ave > Lexington, MA > > Phone: 1.781.372.3580 > Web: http://www.ibm.com/software/rational/support/ > > > Please be sure to update your PMR using ESR at > http://www-306.ibm.com/software/support/probsub.html or cc all > correspondence to sw_support@us.ibm.com to be sure your PMR is updated in > case I am not available. > > > > From: > Trond Myklebust <trond.myklebust@fys.uio.no> > To: > Brian R Cowan/Cupertino/IBM@IBMUS > Cc: > Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, > linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> > Date: > 05/29/2009 02:07 PM > Subject: > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing > > > > On Fri, 2009-05-29 at 13:55 -0400, Brian R Cowan wrote: > > > Yes. If the page is dirty, but not up to date, then it needs to be > > > cleaned before you can overwrite the contents with the results of a > > > fresh read. > > > That means flushing the data to disk... Which again means doing either > a > > > stable write or an unstable write+commit. The former is more efficient > > > that the latter, 'cos it accomplishes the exact same work in a single > > > RPC call. > > > > I suspect that the COMMIT RPC's are done somewhere other than in the > flush > > itself. If the "write + commit" operation was happening in the that > exact > > matter, then the change in the git at the beginning of this thread > *would > > not have impacted client performance*. I can demonstrate -- at will -- > > that it does impact performance. So, there is something that keeps track > > > of the number of writes and issues the commits without slowing down the > > application. This git change bypasses that and degrades the linker > > performance. > > If the server gives slower performance for a single stable write, vs. > the same unstable write + commit, then you are demonstrating that the > server is seriously _broken_. > > The only other explanation, is if the client prior to that patch being > applied was somehow failing to send out the COMMIT. If so, then the > client was broken, and the patch is a fix that results in correct > behaviour. That would mean that the rest of the client flush code is > probably still broken, but at least the nfs_wb_page() is now correct. > > Those are the only 2 options. > > Trond > > > ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1243621769.7155.97.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243621769.7155.97.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-29 20:09 ` Brian R Cowan 2009-05-29 20:21 ` Trond Myklebust 0 siblings, 1 reply; 94+ messages in thread From: Brian R Cowan @ 2009-05-29 20:09 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach I think you missed the context of my comment... Previous to this 4-year-old update, the writes were not sent with STABLE, this update forced that behavior. So, before then we sent an UNSTABLE write request. This would either give us back the UNSTABLE or FILE_SYNC response. My question is this: When the server sends back UNSTABLE, as a response to UNSTABLE, exactly what happens? By some chance is there a separate worker thread that occasionally sends COMMITs back to the server? The performance data we have would seem to bear that out. When we backed out the force of STABLE writes, the link times went back up and the reads stopped waiting on the cache flushes. If, as you say, this change had no impact on how the client actually performed these flushes, backing out the change would not have made links take 4x longer on Red Hat 5. All we did in our test was back out that change... I'm willing to discuss this issue in a conference call. I can send the bridge information to those who are interested, as well as the other people here in IBM I've been working with... At least one of them is a regular contributor -- Frank Filz... ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Trond Myklebust <trond.myklebust@fys.uio.no> To: Brian R Cowan/Cupertino/IBM@IBMUS Cc: Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> Date: 05/29/2009 02:31 PM Subject: Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing On Fri, 2009-05-29 at 14:18 -0400, Brian R Cowan wrote: > There is a third option, that the COMMIT calls are not coming from the > same thread of execution that the write call is. The symptoms would seem > to bear that out. As would the fact that the performance degradation > occurs both when the server is Linux itself and when it is Solaris (any > NFSv3-supporting version). I'm not saying that Solaris is bug-free, but it > would be unusual if they are both broken the same way. The linux nfs FAQ > says: > > ----------------------- > * NFS Version 3 introduces the concept of "safe asynchronous writes." A > Version 3 client can specify that the server is allowed to reply before it > has saved the requested data to disk, permitting the server to gather > small NFS write operations into a single efficient disk write operation. A > Version 3 client can also specify that the data must be written to disk > before the server replies, just like a Version 2 write. The client > specifies the type of write by setting the stable_how field in the > arguments of each write operation to UNSTABLE to request a safe > asynchronous write, and FILE_SYNC for an NFS Version 2 style write. > > Servers indicate whether the requested data is permanently stored by > setting a corresponding field in the response to each NFS write operation. > A server can respond to an UNSTABLE write request with an UNSTABLE reply > or a FILE_SYNC reply, depending on whether or not the requested data > resides on permanent storage yet. An NFS protocol-compliant server must > respond to a FILE_SYNC request only with a FILE_SYNC reply. > > Clients ensure that data that was written using a safe asynchronous write > has been written onto permanent storage using a new operation available in > Version 3 called a COMMIT. Servers do not send a response to a COMMIT > operation until all data specified in the request has been written to > permanent storage. NFS Version 3 clients must protect buffered data that > has been written using a safe asynchronous write but not yet committed. If > a server reboots before a client has sent an appropriate COMMIT, the > server can reply to the eventual COMMIT request in a way that forces the > client to resend the original write operation. Version 3 clients use > COMMIT operations when flushing safe asynchronous writes to the server > during a close(2) or fsync(2) system call, or when encountering memory > pressure. > ----------------------- > > Now, what happens in the client when the server cones back with the > UNSTABLE reply? The server cannot reply with an UNSTABLE reply to a stable write request. See above. As for your assertion that the COMMIT comes from some other thread of execution. I don't see how that can change anything. Some thread, somewhere has to wait for that COMMIT to complete. If it isn't your application, then the same burden falls on another application or the pdflush thread. While that may feel more interactive to you, it still means that you are making the server + some local process do more work (extra RPC round trip) for no good reason. Trond > ================================================================= > Brian Cowan > Advisory Software Engineer > ClearCase Customer Advocacy Group (CAG) > Rational Software > IBM Software Group > 81 Hartwell Ave > Lexington, MA > > Phone: 1.781.372.3580 > Web: http://www.ibm.com/software/rational/support/ > > > Please be sure to update your PMR using ESR at > http://www-306.ibm.com/software/support/probsub.html or cc all > correspondence to sw_support@us.ibm.com to be sure your PMR is updated in > case I am not available. > > > > From: > Trond Myklebust <trond.myklebust@fys.uio.no> > To: > Brian R Cowan/Cupertino/IBM@IBMUS > Cc: > Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, > linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> > Date: > 05/29/2009 02:07 PM > Subject: > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing > > > > On Fri, 2009-05-29 at 13:55 -0400, Brian R Cowan wrote: > > > Yes. If the page is dirty, but not up to date, then it needs to be > > > cleaned before you can overwrite the contents with the results of a > > > fresh read. > > > That means flushing the data to disk... Which again means doing either > a > > > stable write or an unstable write+commit. The former is more efficient > > > that the latter, 'cos it accomplishes the exact same work in a single > > > RPC call. > > > > I suspect that the COMMIT RPC's are done somewhere other than in the > flush > > itself. If the "write + commit" operation was happening in the that > exact > > matter, then the change in the git at the beginning of this thread > *would > > not have impacted client performance*. I can demonstrate -- at will -- > > that it does impact performance. So, there is something that keeps track > > > of the number of writes and issues the commits without slowing down the > > application. This git change bypasses that and degrades the linker > > performance. > > If the server gives slower performance for a single stable write, vs. > the same unstable write + commit, then you are demonstrating that the > server is seriously _broken_. > > The only other explanation, is if the client prior to that patch being > applied was somehow failing to send out the COMMIT. If so, then the > client was broken, and the patch is a fix that results in correct > behaviour. That would mean that the rest of the client flush code is > probably still broken, but at least the nfs_wb_page() is now correct. > > Those are the only 2 options. > > Trond > > > ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 20:09 ` Brian R Cowan @ 2009-05-29 20:21 ` Trond Myklebust [not found] ` <1243628519.7155.150.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> [not found] ` <OFBB9B2C07.CC3D028B-ON852575C5. <1243634634.7155.160.camel@heimdal.trondhjem.org> 0 siblings, 2 replies; 94+ messages in thread From: Trond Myklebust @ 2009-05-29 20:21 UTC (permalink / raw) To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, 2009-05-29 at 16:09 -0400, Brian R Cowan wrote: > I think you missed the context of my comment... Previous to this > 4-year-old update, the writes were not sent with STABLE, this update > forced that behavior. So, before then we sent an UNSTABLE write request. > This would either give us back the UNSTABLE or FILE_SYNC response. My > question is this: When the server sends back UNSTABLE, as a response to > UNSTABLE, exactly what happens? By some chance is there a separate worker > thread that occasionally sends COMMITs back to the server? pdflush will do it occasionally, but otherwise the COMMITs are all sent synchronously by the thread that is flushing out the data. In this case, the flush is done by the call to nfs_wb_page() in nfs_readpage(), and it waits synchronously for the unstable WRITE and the subsequent COMMIT to finish. Note that there is no way to bypass the wait: if some other thread jumps in and sends the COMMIT (after the unstable write has returned), then the caller of nfs_wb_page() still has to wait for that call to complete, and for nfs_commit_release() to mark the page as clean. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <1243628519.7155.150.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243628519.7155.150.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-29 21:55 ` Brian R Cowan 2009-05-29 22:03 ` Trond Myklebust 0 siblings, 1 reply; 94+ messages in thread From: Brian R Cowan @ 2009-05-29 21:55 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach So, it is possible that either pdflush is sending the commits or us, or that the commits are happening when the file closes, giving us one/tens of commits instead of hundreds or thousands. That's a big difference. The write RPCs still happen in RHEL 4, they just don't block the linker, or at least nowhere near as often. Since there is only one application/thread (the gcc linker) writing this file, the odds of another task getting stalled here are minimal at best. This optimization definitely helps server utilization for copies of large numbers of small files, and I personally don't care which is the default (though I have a coworker who is of the opinion that async means async, and if he wanted sync writes, he would either mount with nfsvers=2 or mount sync). But we need the option to turn it off for cases where it is thought to cause problems. You mention that one can set the async export option, but 1) it may not always available; and 2) essentially tells the server to "lie" about write status, something that can bite us seriously if the server crashes, hits a disk full error. etc. And in any event, it's something that only a particular class of clients is impacted by, and making a change to *all* so *some* work in the expected manner feels about as graceful as dynamite fishing... ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Trond Myklebust <trond.myklebust@fys.uio.no> To: Brian R Cowan/Cupertino/IBM@IBMUS Cc: Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> Date: 05/29/2009 04:28 PM Subject: Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing On Fri, 2009-05-29 at 16:09 -0400, Brian R Cowan wrote: > I think you missed the context of my comment... Previous to this > 4-year-old update, the writes were not sent with STABLE, this update > forced that behavior. So, before then we sent an UNSTABLE write request. > This would either give us back the UNSTABLE or FILE_SYNC response. My > question is this: When the server sends back UNSTABLE, as a response to > UNSTABLE, exactly what happens? By some chance is there a separate worker > thread that occasionally sends COMMITs back to the server? pdflush will do it occasionally, but otherwise the COMMITs are all sent synchronously by the thread that is flushing out the data. In this case, the flush is done by the call to nfs_wb_page() in nfs_readpage(), and it waits synchronously for the unstable WRITE and the subsequent COMMIT to finish. Note that there is no way to bypass the wait: if some other thread jumps in and sends the COMMIT (after the unstable write has returned), then the caller of nfs_wb_page() still has to wait for that call to complete, and for nfs_commit_release() to mark the page as clean. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 21:55 ` Brian R Cowan @ 2009-05-29 22:03 ` Trond Myklebust 0 siblings, 0 replies; 94+ messages in thread From: Trond Myklebust @ 2009-05-29 22:03 UTC (permalink / raw) To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, 2009-05-29 at 17:55 -0400, Brian R Cowan wrote: > So, it is possible that either pdflush is sending the commits or us, or > that the commits are happening when the file closes, giving us one/tens of > commits instead of hundreds or thousands. That's a big difference. The > write RPCs still happen in RHEL 4, they just don't block the linker, or at > least nowhere near as often. Since there is only one application/thread > (the gcc linker) writing this file, the odds of another task getting > stalled here are minimal at best. No, you're not listening! That COMMIT is _synchronous_ and happens before you can proceed with the READ request. There is no economy of scale as you seem to assume. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <OFBB9B2C07.CC3D028B-ON852575C5. <1243634634.7155.160.camel@heimdal.trondhjem.org>]
[parent not found: <1243634634.7155.160.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243634634.7155.160.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-29 22:20 ` Brian R Cowan 2009-05-29 22:36 ` Trond Myklebust 0 siblings, 1 reply; 94+ messages in thread From: Brian R Cowan @ 2009-05-29 22:20 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach I am listening. Commit is sync. I get that. The NFS client does Async writes in RHEL 4. They *eventually* get committed. (Doesn't really matter who causes the commit, does it.) Read system calls may trigger cache flushing, but since not all of them are sync writes, the reads don't *always* stall when cache flushes occur. Builds are fast. We do sync writes in RHEL 5, so they MUST stop and wait for the NFS server to come back. READ system calls stall whan the read triggers a flush of one or more cache pages. Builds are slow. Links are at least 4x slower. I am perfectly willing to send you network traces showing the issue. I can even DEMONSTRATE it for you using the remote meeting software of your choice. I can even demonstrate the impact of removing that behavior. ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Trond Myklebust <trond.myklebust@fys.uio.no> To: Brian R Cowan/Cupertino/IBM@IBMUS Cc: Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> Date: 05/29/2009 06:06 PM Subject: Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing On Fri, 2009-05-29 at 17:55 -0400, Brian R Cowan wrote: > So, it is possible that either pdflush is sending the commits or us, or > that the commits are happening when the file closes, giving us one/tens of > commits instead of hundreds or thousands. That's a big difference. The > write RPCs still happen in RHEL 4, they just don't block the linker, or at > least nowhere near as often. Since there is only one application/thread > (the gcc linker) writing this file, the odds of another task getting > stalled here are minimal at best. No, you're not listening! That COMMIT is _synchronous_ and happens before you can proceed with the READ request. There is no economy of scale as you seem to assume. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 22:20 ` Brian R Cowan @ 2009-05-29 22:36 ` Trond Myklebust 0 siblings, 0 replies; 94+ messages in thread From: Trond Myklebust @ 2009-05-29 22:36 UTC (permalink / raw) To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, 2009-05-29 at 18:20 -0400, Brian R Cowan wrote: > I am listening. > > Commit is sync. I get that. > > The NFS client does Async writes in RHEL 4. They *eventually* get > committed. (Doesn't really matter who causes the commit, does it.) > Read system calls may trigger cache flushing, but since not all of them > are sync writes, the reads don't *always* stall when cache flushes occur. > Builds are fast. All reads that trigger writes will trigger _sync_ writes and _sync_ commits. That's true of RHEL-5, RHEL-4, RHEL-3, and all the way back to the very first 2.4 kernels. There is no deferred commit in that case, because the cached dirty data needs to be overwritten by a fresh read, which means that we may lose the data if the server reboots between the unstable write and the ensuing read. > We do sync writes in RHEL 5, so they MUST stop and wait for the NFS server > to come back. > READ system calls stall whan the read triggers a flush of one or more > cache pages. > Builds are slow. Links are at least 4x slower. > > I am perfectly willing to send you network traces showing the issue. I can > even DEMONSTRATE it for you using the remote meeting software of your > choice. I can even demonstrate the impact of removing that behavior. Can you demonstrate it using a recent kernel? If it's a problem that is limited to RHEL-5, then it is up to Peter & co to pull in the fixes from mainline, but if the slowdown is still present in 2.6.30, then I'm all ears. However I don't for a minute accept your explanation that this has something to do with stable vs unstable+commit. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
[parent not found: <OF061E0258.9581352B-ON852575C <1243636593.7155.188.camel@heimdal.trondhjem.org>]
[parent not found: <1243636593.7155.188.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243636593.7155.188.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-29 23:02 ` Brian R Cowan 2009-05-29 23:13 ` Trond Myklebust 0 siblings, 1 reply; 94+ messages in thread From: Brian R Cowan @ 2009-05-29 23:02 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach If you can explain how pulling that ONE change can cause the performance issue to essentially disappear, I'd be more than happy to *try* to get a 2.6.30 test environment configured. Getting ClearCase to *install* on kernel.org kernels is a non-trivial operation, requiring modifications to install scripts, module makefiles, etc. Then there is the issue of verifying that nothing else is impacted, all before I can begin to do this test. We're talking days here. To be blunt, I'd need something I can take to a manager who will ask me why I'm spending so much time on an issue when we "already have the cause." ================================================================= Brian Cowan Advisory Software Engineer ClearCase Customer Advocacy Group (CAG) Rational Software IBM Software Group 81 Hartwell Ave Lexington, MA Phone: 1.781.372.3580 Web: http://www.ibm.com/software/rational/support/ Please be sure to update your PMR using ESR at http://www-306.ibm.com/software/support/probsub.html or cc all correspondence to sw_support@us.ibm.com to be sure your PMR is updated in case I am not available. From: Trond Myklebust <trond.myklebust@fys.uio.no> To: Brian R Cowan/Cupertino/IBM@IBMUS Cc: Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org, linux-nfs-owner@vger.kernel.org, Peter Staubach <staubach@redhat.com> Date: 05/29/2009 06:38 PM Subject: Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing On Fri, 2009-05-29 at 18:20 -0400, Brian R Cowan wrote: > I am listening. > > Commit is sync. I get that. > > The NFS client does Async writes in RHEL 4. They *eventually* get > committed. (Doesn't really matter who causes the commit, does it.) > Read system calls may trigger cache flushing, but since not all of them > are sync writes, the reads don't *always* stall when cache flushes occur. > Builds are fast. All reads that trigger writes will trigger _sync_ writes and _sync_ commits. That's true of RHEL-5, RHEL-4, RHEL-3, and all the way back to the very first 2.4 kernels. There is no deferred commit in that case, because the cached dirty data needs to be overwritten by a fresh read, which means that we may lose the data if the server reboots between the unstable write and the ensuing read. > We do sync writes in RHEL 5, so they MUST stop and wait for the NFS server > to come back. > READ system calls stall whan the read triggers a flush of one or more > cache pages. > Builds are slow. Links are at least 4x slower. > > I am perfectly willing to send you network traces showing the issue. I can > even DEMONSTRATE it for you using the remote meeting software of your > choice. I can even demonstrate the impact of removing that behavior. Can you demonstrate it using a recent kernel? If it's a problem that is limited to RHEL-5, then it is up to Peter & co to pull in the fixes from mainline, but if the slowdown is still present in 2.6.30, then I'm all ears. However I don't for a minute accept your explanation that this has something to do with stable vs unstable+commit. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing 2009-05-29 23:02 ` Brian R Cowan @ 2009-05-29 23:13 ` Trond Myklebust 0 siblings, 0 replies; 94+ messages in thread From: Trond Myklebust @ 2009-05-29 23:13 UTC (permalink / raw) To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, 2009-05-29 at 19:02 -0400, Brian R Cowan wrote: > If you can explain how pulling that ONE change can cause the performance > issue to essentially disappear, I'd be more than happy to *try* to get a > 2.6.30 test environment configured. Getting ClearCase to *install* on > kernel.org kernels is a non-trivial operation, requiring modifications to > install scripts, module makefiles, etc. Then there is the issue of > verifying that nothing else is impacted, all before I can begin to do this > test. We're talking days here. > > To be blunt, I'd need something I can take to a manager who will ask me > why I'm spending so much time on an issue when we "already have the > cause." It's simple: you are the one asking for a change to the established kernel behaviour, so you get to justify that change. Saying "it breaks clearcase on RHEL-5" is not a justification, and I won't accept to ack the change. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
* Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [not found] ` <1243618968.7155.60.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> ` (2 preceding siblings ...) 2009-05-29 17:55 ` Brian R Cowan @ 2009-05-29 17:57 ` Trond Myklebust 3 siblings, 0 replies; 94+ messages in thread From: Trond Myklebust @ 2009-05-29 17:57 UTC (permalink / raw) To: Brian R Cowan; +Cc: Chuck Lever, linux-nfs, linux-nfs-owner, Peter Staubach On Fri, 2009-05-29 at 13:42 -0400, Trond Myklebust wrote: > On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote: > > > You may have a misunderstanding about what exactly "async" does. The > > > "sync" / "async" mount options control only whether the application > > > waits for the data to be flushed to permanent storage. They have no > > > effect on any file system I know of _how_ specifically the data is > > > moved from the page cache to permanent storage. > > > > The problem is that the client change seems to cause the application to > > stop until this stable write completes... What is interesting is that it's > > not always a write operation that the linker gets stuck on. Our best > > hypothesis -- from correlating times in strace and tcpdump traces -- is > > that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()* > > system calls on the output file (that is opened for read/write). We THINK > > the read call triggers a FILE_SYNC write if the page is dirty...and that > > is why the read calls are taking so long. Seeing writes happening when the > > app is waiting for a read is odd to say the least... (In my test, there is > > nothing else running on the Virtual machines, so the only thing that could > > be triggering the filesystem activity is the build test...) > > Yes. If the page is dirty, but not up to date, then it needs to be > cleaned before you can overwrite the contents with the results of a > fresh read. > That means flushing the data to disk... Which again means doing either a > stable write or an unstable write+commit. The former is more efficient > that the latter, 'cos it accomplishes the exact same work in a single > RPC call. > > Trond In fact, I suspect your real gripe is rather with the logic that marks a page as being up to date (i.e. whether or not they require a READ call). I suggest trying kernel 2.6.27 or newer, and seeing if the changes that are in those kernels fix your problem. Trond ^ permalink raw reply [flat|nested] 94+ messages in thread
end of thread, other threads:[~2009-08-05 0:50 UTC | newest] Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-04-30 20:12 Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Brian R Cowan 2009-04-30 20:25 ` Christoph Hellwig 2009-04-30 20:28 ` Chuck Lever 2009-04-30 20:41 ` Peter Staubach 2009-04-30 21:13 ` Chuck Lever 2009-04-30 21:23 ` Trond Myklebust 2009-05-01 16:39 ` Brian R Cowan [not found] ` <1241126587.15476.62.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-29 15:55 ` Brian R Cowan 2009-05-29 16:46 ` Trond Myklebust [not found] ` <1243615595.7155.48.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-29 17:25 ` Brian R Cowan 2009-05-29 17:35 ` Trond Myklebust [not found] ` <1243618500.7155.56.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-30 0:22 ` Greg Banks [not found] ` <ac442c870905291722x1ec811b2sda997d464898fcda-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-05-30 7:57 ` Christoph Hellwig 2009-06-01 22:30 ` J. Bruce Fields 2009-06-05 14:54 ` Christoph Hellwig 2009-06-05 16:01 ` J. Bruce Fields 2009-06-05 16:12 ` Trond Myklebust [not found] ` <1244218328.5410.38.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-06-05 19:54 ` J. Bruce Fields 2009-06-05 21:21 ` Trond Myklebust 2009-05-30 12:26 ` Trond Myklebust [not found] ` <1243686363.5209.16.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-30 12:43 ` Trond Myklebust 2009-05-30 13:02 ` Greg Banks [not found] ` <ac442c870905300602v6950ec42y5195d2d6ea7dd4c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-06-01 22:30 ` J. Bruce Fields 2009-06-02 15:00 ` Chuck Lever 2009-06-02 17:27 ` Trond Myklebust [not found] ` <1243963631.4868.124.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-06-02 18:15 ` Chuck Lever 2009-06-03 16:22 ` Carlos Carvalho 2009-06-03 17:10 ` Trond Myklebust [not found] ` <OFB53BFCCB.0CEC7A7E-ON852575C <1244138698.5203.59.camel@heimdal.trondhjem.org> 2009-06-03 21:28 ` Dean Hildebrand 2009-06-04 2:16 ` Carlos Carvalho 2009-06-04 17:42 ` Brian R Cowan 2009-06-04 18:04 ` Trond Myklebust 2009-06-04 20:43 ` Link performance over NFS degraded in RHEL5. -- was : " Brian R Cowan 2009-06-04 20:57 ` Trond Myklebust 2009-06-04 21:30 ` Brian R Cowan 2009-06-04 21:48 ` Trond Myklebust 2009-06-04 21:07 ` Peter Staubach 2009-06-04 21:39 ` Brian R Cowan 2009-06-05 11:35 ` Steve Dickson 2009-06-05 12:46 ` Trond Myklebust 2009-06-05 13:03 ` Brian R Cowan 2009-06-05 13:05 ` Tom Talpey [not found] ` <4A29144A.6030405@gmail.com> 2009-06-05 13:30 ` Steve Dickson 2009-06-05 13:52 ` Trond Myklebust [not found] ` <1244209956.5410.33.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-06-05 13:57 ` Steve Dickson [not found] ` <4A29243F.8080008-AfCzQyP5zfLQT0dZR+AlfA@public.gmane.org> 2009-06-05 16:05 ` J. Bruce Fields 2009-06-05 16:35 ` Trond Myklebust [not found] ` <1244219715.5410.40.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-06-15 23:08 ` J. Bruce Fields 2009-06-16 0:21 ` NeilBrown [not found] ` <99d4545537613ce76040d3655b78bdb7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org> 2009-06-16 0:33 ` J. Bruce Fields 2009-06-16 0:50 ` NeilBrown [not found] ` <02ada87c636e1088e9365a3cbea301e7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org> 2009-06-16 0:55 ` J. Bruce Fields 2009-06-17 16:54 ` J. Bruce Fields 2009-06-17 16:59 ` [PATCH 1/3] nfsd: track last inode only in use_wgather case J. Bruce Fields 2009-06-17 16:59 ` [PATCH 2/3] nfsd: Pull write-gathering code out of nfsd_vfs_write J. Bruce Fields 2009-06-17 16:59 ` [PATCH 3/3] nfsd: minor nfsd_vfs_write cleanup J. Bruce Fields 2009-06-16 0:32 ` Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Trond Myklebust [not found] ` <1245112324.7470.7.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-06-16 2:02 ` J. Bruce Fields [not found] ` <4A291D83.1000508@RedHat.com> 2009-06-05 13:50 ` Tom Talpey 2009-06-05 13:54 ` Trond Myklebust 2009-06-05 13:58 ` Tom Talpey 2009-06-05 13:56 ` Brian R Cowan 2009-06-24 19:54 ` [PATCH] read-modify-write page updating Peter Staubach 2009-06-25 17:13 ` Trond Myklebust [not found] ` <1245950029.4913.17.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-07-09 13:59 ` Peter Staubach 2009-07-09 14:12 ` [PATCH v2] " Peter Staubach 2009-07-09 15:39 ` Trond Myklebust [not found] ` <1247153972.5766.15.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-07-10 15:57 ` Peter Staubach 2009-07-10 17:22 ` J. Bruce Fields 2009-08-04 17:52 ` [PATCH v3] " Peter Staubach 2009-08-05 0:50 ` Trond Myklebust 2009-05-29 17:48 ` Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Peter Staubach 2009-05-29 18:21 ` Trond Myklebust 2009-05-29 17:01 ` Chuck Lever 2009-05-29 17:38 ` Brian R Cowan 2009-05-29 17:42 ` Trond Myklebust [not found] ` <1243618968.7155.60.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-29 17:47 ` Chuck Lever 2009-05-29 18:15 ` Trond Myklebust 2009-05-29 17:51 ` Peter Staubach 2009-05-29 18:25 ` Brian R Cowan 2009-05-29 18:43 ` Trond Myklebust 2009-05-29 17:55 ` Brian R Cowan 2009-05-29 18:07 ` Trond Myklebust [not found] ` <1243620455.7155.80.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-29 18:18 ` Brian R Cowan 2009-05-29 18:29 ` Trond Myklebust [not found] ` <1243621769.7155.97.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-29 20:09 ` Brian R Cowan 2009-05-29 20:21 ` Trond Myklebust [not found] ` <1243628519.7155.150.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-29 21:55 ` Brian R Cowan 2009-05-29 22:03 ` Trond Myklebust [not found] ` <OFBB9B2C07.CC3D028B-ON852575C5. <1243634634.7155.160.camel@heimdal.trondhjem.org> [not found] ` <1243634634.7155.160.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-29 22:20 ` Brian R Cowan 2009-05-29 22:36 ` Trond Myklebust [not found] ` <OF061E0258.9581352B-ON852575C <1243636593.7155.188.camel@heimdal.trondhjem.org> [not found] ` <1243636593.7155.188.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-29 23:02 ` Brian R Cowan 2009-05-29 23:13 ` Trond Myklebust 2009-05-29 17:57 ` Trond Myklebust
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.