All of lore.kernel.org
 help / color / mirror / Atom feed
* Broken nfsd in recent kernels
@ 2007-02-13  0:26 Norman Weathers
  2007-02-13  3:48 ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Norman Weathers @ 2007-02-13  0:26 UTC (permalink / raw)
  To: nfs

Hello,

I have noticed, at least in our Fedora 6 test case, that recent kernels
(2.6.18 and 2.6.19) that there appears to be a "read hell" issue.  Has
anyone else seen this?

For instance, using iozone, during a write case (32 kb blocks) to a Sun
x4100 running Fedora Core 6 and the Fedora core kernels, I get decent
throughput.  But, as soon as the test goes from write to rewrite, I see
a large amount of read activity (via iostat) on the NFS server.  It
looks like 4kb read blocks.

The host nodes involved have the following configuration:

uname -a
Linux hoepld15 2.6.18-1.2868.fc6 #1 SMP Fri Dec 15 17:29:48 EST 2006
x86_64 x86_64 x86_64 GNU/Linux

free:
             total       used       free     shared    buffers
cached
Mem:       8044904    7968452      76452          0      17592
7489888
-/+ buffers/cache:     460972    7583932
Swap:      4192956        156    4192800

cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 280
stepping        : 2
cpu MHz         : 2400.000
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips        : 4789.93
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 280
stepping        : 2
cpu MHz         : 2400.000
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips        : 4689.98
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor       : 2
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 280
stepping        : 2
cpu MHz         : 2400.000
cache size      : 1024 KB
physical id     : 1
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips        : 4785.70
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor       : 3
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 280
stepping        : 2
cpu MHz         : 2400.000
cache size      : 1024 KB
physical id     : 1
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips        : 4785.70
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp


I can run from any of our Fedora Clients (3, 4, or 6) and completely
swamp the server with read requests when there shouldn't be any read
requests at all.

I find that if I try to open a file that isn't there with an
fopen(name,"w"), I am ok because I truncate the file.  If I try and
fopen(name,"r+"), then I get into trouble where it wants to read these
4KB blocks.  It is not a trivial amount as on our system I am able to
pull of almost 2000 tps of 4 KB blocks, which kills our boxes.  I know
it is the NFS layer because if I run the disk exercise programs such as
iozone and another in house program locally on the NFS server, it is
fine, but the minute I run it remote, and it tries to open a file that
already exists and has > 0 bytes, it goes nuts.  I haven't been able to
try a vanilla kernel yet because I am having trouble finding a node free
that I can test with.

Also, I have ruled out 64 bit and 32 bit problems.  The NFS server I had
been using is a 64 bit box, but I just tested the same thing serving a
filesystem from my 32 bit laptop, and it also has the issue (it is a FC6
as well).  Also, I have ruled out filesystems.  The 64 bit server was
using XFS, and my laptop is using ext3, and both systems have the same
issue.

If there is any other information I can get you, please let me know.

In the mean time, we are trying to setup some tests using the latest
(2.6.20) kernel.

Thanks for your time,

Norman Weathers



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Broken nfsd in recent kernels
  2007-02-13  0:26 Broken nfsd in recent kernels Norman Weathers
@ 2007-02-13  3:48 ` Neil Brown
  2007-02-13  3:58   ` Nick Piggin
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Brown @ 2007-02-13  3:48 UTC (permalink / raw)
  To: Norman Weathers; +Cc: Nick Piggin, nfs

On Monday February 12, norman.r.weathers@conocophillips.com wrote:
> Hello,
> 
> I have noticed, at least in our Fedora 6 test case, that recent kernels
> (2.6.18 and 2.6.19) that there appears to be a "read hell" issue.  Has
> anyone else seen this?
> 
> For instance, using iozone, during a write case (32 kb blocks) to a Sun
> x4100 running Fedora Core 6 and the Fedora core kernels, I get decent
> throughput.  But, as soon as the test goes from write to rewrite, I see
> a large amount of read activity (via iostat) on the NFS server.  It
> looks like 4kb read blocks.

Yes.......

When the NFS server writes a large block (e.g. 32K) to a file, it has
the data in a number of buffers as they came in off the network.  Due
to the alignment of data in an NFS request, they almost certainly will
not be page-aligned.

This 'iovec' is then written to the file.

Normally when writing to a file from user-space (normal write or
writev system call), the pages holding the data to be written could be
paged out, so it has to be brought in to memory before the copy start.

A change was made to generic_file_buffered_write (in mm/filemap.c)
probably around 2.6.18 so that when writing from an iovec, each entry
is send to the file separately, because faulting in all the entries
at once is a bit awkward.

So the net result is that when NFSd writes to a file, the filesystem
sees a bunch of non-page-aligned writes rather than nicely aligned
writes (even when the NFS request holds a nicely aligned write).  This
causes it to pre-read all the pages.  Ugh.

Nick:  You've have some pending patching in this area.  Might they
address this problem?

NeilBrown

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Broken nfsd in recent kernels
  2007-02-13  3:48 ` Neil Brown
@ 2007-02-13  3:58   ` Nick Piggin
  2007-02-13  4:37     ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Nick Piggin @ 2007-02-13  3:58 UTC (permalink / raw)
  To: Neil Brown; +Cc: nfs, Norman Weathers

Neil Brown wrote:
> On Monday February 12, norman.r.weathers@conocophillips.com wrote:
> 
>>Hello,
>>
>>I have noticed, at least in our Fedora 6 test case, that recent kernels
>>(2.6.18 and 2.6.19) that there appears to be a "read hell" issue.  Has
>>anyone else seen this?
>>
>>For instance, using iozone, during a write case (32 kb blocks) to a Sun
>>x4100 running Fedora Core 6 and the Fedora core kernels, I get decent
>>throughput.  But, as soon as the test goes from write to rewrite, I see
>>a large amount of read activity (via iostat) on the NFS server.  It
>>looks like 4kb read blocks.
> 
> 
> Yes.......
> 
> When the NFS server writes a large block (e.g. 32K) to a file, it has
> the data in a number of buffers as they came in off the network.  Due
> to the alignment of data in an NFS request, they almost certainly will
> not be page-aligned.
> 
> This 'iovec' is then written to the file.
> 
> Normally when writing to a file from user-space (normal write or
> writev system call), the pages holding the data to be written could be
> paged out, so it has to be brought in to memory before the copy start.
> 
> A change was made to generic_file_buffered_write (in mm/filemap.c)
> probably around 2.6.18 so that when writing from an iovec, each entry
> is send to the file separately, because faulting in all the entries
> at once is a bit awkward.
> 
> So the net result is that when NFSd writes to a file, the filesystem
> sees a bunch of non-page-aligned writes rather than nicely aligned
> writes (even when the NFS request holds a nicely aligned write).  This
> causes it to pre-read all the pages.  Ugh.
> 
> Nick:  You've have some pending patching in this area.  Might they
> address this problem?

Hi Neil,

Yes, they do address the multiple-segment iovec problem, but it remains
to be seen when the patches will get in...

It is very awkward to fix the problem in the prepare_write/commit_write
path due to the nature of the API. Basically I'm reverting to performing
an extra data copy there, which reduces bandwidth quite a lot (although
it does reintroduce the multi-segment iovec copying, so it might be a
win in this case).

Then I'm looking at introducing a new aops API that filesystems can
implement to solve the problem in a well performing manner.

The problem is, this can't really happen until the important filesystems
implement the API.

It would be interesting to know whether Norman's test case actually is
using writev...

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Broken nfsd in recent kernels
  2007-02-13  3:58   ` Nick Piggin
@ 2007-02-13  4:37     ` Neil Brown
  2007-02-13  4:50       ` Nick Piggin
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Brown @ 2007-02-13  4:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: nfs, Norman Weathers

On Tuesday February 13, nickpiggin@yahoo.com.au wrote:
> Neil Brown wrote:
> > On Monday February 12, norman.r.weathers@conocophillips.com wrote:
> > 
> >>Hello,
> >>
> >>I have noticed, at least in our Fedora 6 test case, that recent kernels
> >>(2.6.18 and 2.6.19) that there appears to be a "read hell" issue.  Has
> >>anyone else seen this?
> >>
> >>For instance, using iozone, during a write case (32 kb blocks) to a Sun
> >>x4100 running Fedora Core 6 and the Fedora core kernels, I get decent
> >>throughput.  But, as soon as the test goes from write to rewrite, I see
> >>a large amount of read activity (via iostat) on the NFS server.  It
> >>looks like 4kb read blocks.
> > 
> > 
> > Yes.......
> > 
> > When the NFS server writes a large block (e.g. 32K) to a file, it has
> > the data in a number of buffers as they came in off the network.  Due
> > to the alignment of data in an NFS request, they almost certainly will
> > not be page-aligned.
> > 
> > This 'iovec' is then written to the file.
> > 
> > Normally when writing to a file from user-space (normal write or
> > writev system call), the pages holding the data to be written could be
> > paged out, so it has to be brought in to memory before the copy start.
> > 
> > A change was made to generic_file_buffered_write (in mm/filemap.c)
> > probably around 2.6.18 so that when writing from an iovec, each entry
> > is send to the file separately, because faulting in all the entries
> > at once is a bit awkward.
> > 
> > So the net result is that when NFSd writes to a file, the filesystem
> > sees a bunch of non-page-aligned writes rather than nicely aligned
> > writes (even when the NFS request holds a nicely aligned write).  This
> > causes it to pre-read all the pages.  Ugh.
> > 
> > Nick:  You've have some pending patching in this area.  Might they
> > address this problem?
> 
> Hi Neil,
> 
> Yes, they do address the multiple-segment iovec problem, but it remains
> to be seen when the patches will get in...
> 
> It is very awkward to fix the problem in the prepare_write/commit_write
> path due to the nature of the API. Basically I'm reverting to performing
> an extra data copy there, which reduces bandwidth quite a lot (although
> it does reintroduce the multi-segment iovec copying, so it might be a
> win in this case).
> 
> Then I'm looking at introducing a new aops API that filesystems can
> implement to solve the problem in a well performing manner.
> 
> The problem is, this can't really happen until the important filesystems
> implement the API.
> 
> It would be interesting to know whether Norman's test case actually is
> using writev...

He is just use NFS.  NFSD does use writev.
A typical 32K write arrives as a bunch of IP packets most of which
hold 1448 bytes.  These are all presented to the filesystem in a
writev (vfs_writev actually).

I presume the problem is that we cannot fault_in_pages_readable two
different buffers as the first might disappear while the second is
being paged in....
Would it be possible to count how much of the iovec is in
kernel-space, or maybe how much is *not* part of the file being
written too, and allow that much to be processed all at once?
Or is there something more subtle that I am missing?

The following patch is rather gross, but seems to work and should be
safe... what do you think?

NeilBrown

Signed-off-by: Neil Brown <neilb@suse.de>

diff .prev/mm/filemap.c ./mm/filemap.c
--- .prev/mm/filemap.c	2007-02-13 15:10:32.000000000 +1100
+++ ./mm/filemap.c	2007-02-13 15:19:20.000000000 +1100
@@ -2163,9 +2163,11 @@ generic_file_buffered_write(struct kiocb
 		/*
 		 * Limit the size of the copy to that of the current segment,
 		 * because fault_in_pages_readable() doesn't know how to walk
-		 * segments.
+		 * segments, but don't worry about such technicalities if nfsd
+		 * is writing, as prefault isn't needed then.
 		 */
-		bytes = min(bytes, cur_iov->iov_len - iov_base);
+		if (!segment_eq(get_fs(), KERNEL_DS))
+			bytes = min(bytes, cur_iov->iov_len - iov_base);
 
 		/*
 		 * Bring in the user page that we will copy from _first_.
@@ -2173,7 +2175,8 @@ generic_file_buffered_write(struct kiocb
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		fault_in_pages_readable(buf, bytes);
+		if (!segment_eq(get_fs(), KERNEL_DS))
+			fault_in_pages_readable(buf, bytes);
 
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
 		if (!page) {

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Broken nfsd in recent kernels
  2007-02-13  4:37     ` Neil Brown
@ 2007-02-13  4:50       ` Nick Piggin
  0 siblings, 0 replies; 5+ messages in thread
From: Nick Piggin @ 2007-02-13  4:50 UTC (permalink / raw)
  To: Neil Brown; +Cc: nfs, Norman Weathers

Neil Brown wrote:
> On Tuesday February 13, nickpiggin@yahoo.com.au wrote:

>>It would be interesting to know whether Norman's test case actually is
>>using writev...
> 
> 
> He is just use NFS.  NFSD does use writev.

OK that makes a lot of sense.

> A typical 32K write arrives as a bunch of IP packets most of which
> hold 1448 bytes.  These are all presented to the filesystem in a
> writev (vfs_writev actually).
> 
> I presume the problem is that we cannot fault_in_pages_readable two
> different buffers as the first might disappear while the second is
> being paged in....

Yeah that, and also I don't think we ever actually did the
fault_in_pages_readable for subsequent segments past the first one,
so it could be quite trivial to trigger the problem delibrately.

> Would it be possible to count how much of the iovec is in
> kernel-space, or maybe how much is *not* part of the file being
> written too, and allow that much to be processed all at once?
> Or is there something more subtle that I am missing?
> 
> The following patch is rather gross, but seems to work and should be
> safe... what do you think?

I think it seems like a very good idea. There is no reason to worry
about faults if we're dealing with kernel constructed buffers.

> 
> NeilBrown
> 
> Signed-off-by: Neil Brown <neilb@suse.de>
> 
> diff .prev/mm/filemap.c ./mm/filemap.c
> --- .prev/mm/filemap.c	2007-02-13 15:10:32.000000000 +1100
> +++ ./mm/filemap.c	2007-02-13 15:19:20.000000000 +1100
> @@ -2163,9 +2163,11 @@ generic_file_buffered_write(struct kiocb
>  		/*
>  		 * Limit the size of the copy to that of the current segment,
>  		 * because fault_in_pages_readable() doesn't know how to walk
> -		 * segments.
> +		 * segments, but don't worry about such technicalities if nfsd
> +		 * is writing, as prefault isn't needed then.
>  		 */
> -		bytes = min(bytes, cur_iov->iov_len - iov_base);
> +		if (!segment_eq(get_fs(), KERNEL_DS))
> +			bytes = min(bytes, cur_iov->iov_len - iov_base);
>  
>  		/*
>  		 * Bring in the user page that we will copy from _first_.
> @@ -2173,7 +2175,8 @@ generic_file_buffered_write(struct kiocb
>  		 * same page as we're writing to, without it being marked
>  		 * up-to-date.
>  		 */
> -		fault_in_pages_readable(buf, bytes);
> +		if (!segment_eq(get_fs(), KERNEL_DS))
> +			fault_in_pages_readable(buf, bytes);
>  
>  		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
>  		if (!page) {
> 


-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-02-13  4:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-13  0:26 Broken nfsd in recent kernels Norman Weathers
2007-02-13  3:48 ` Neil Brown
2007-02-13  3:58   ` Nick Piggin
2007-02-13  4:37     ` Neil Brown
2007-02-13  4:50       ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.