rsize,wsize=1M causes severe lags in 10/100 Mbps, what sets those defaults?

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* rsize,wsize=1M causes severe lags in 10/100 Mbps, what sets those defaults?
@ 2019-09-19  7:29 Alkis Georgopoulos
  2019-09-19 15:08 ` Trond Myklebust
  0 siblings, 1 reply; 19+ messages in thread
From: Alkis Georgopoulos @ 2019-09-19  7:29 UTC (permalink / raw)
  To: linux-nfs

Hi, in any recent distribution that I tried, the default NFS wsize/rsize 
was 1 MB.

On 10/100 Mbps networks, this causes severe lags, timeouts, and dmesg 
fills with messages like:

 > [  316.404250] nfs: server 192.168.1.112 not responding, still trying
 > [  316.759512] nfs: server 192.168.1.112 OK

Forcing wsize/rsize to 32K makes all the problems disappear and NFS 
access more snappy, without sacrificing any speed at least up to gigabit 
networks that I tested with.

I would like to request that the defaults be changed to 32K.
But I didn't find out where these defaults come from, where to file the 
issue and my test case / benchmarks to support it.

I've initially reported it at the klibc nfsmount program that I was 
using, but this is just using the NFS defaults, which are the ones that 
should be amended. So initial test case / benchmarks there:
https://lists.zytor.com/archives/klibc/2019-September/004234.html

Please Cc me as I'm not in the list.

Thank you,
Alkis Georgopoulos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps, what sets those defaults?
  2019-09-19  7:29 rsize,wsize=1M causes severe lags in 10/100 Mbps, what sets those defaults? Alkis Georgopoulos
@ 2019-09-19 15:08 ` Trond Myklebust
  2019-09-19 15:58   ` rsize,wsize=1M causes severe lags in 10/100 Mbps Alkis Georgopoulos
  0 siblings, 1 reply; 19+ messages in thread
From: Trond Myklebust @ 2019-09-19 15:08 UTC (permalink / raw)
  To: Alkis Georgopoulos; +Cc: linux-nfs

On Thu, 19 Sep 2019 at 03:44, Alkis Georgopoulos <alkisg@gmail.com> wrote:
>
> Hi, in any recent distribution that I tried, the default NFS wsize/rsize
> was 1 MB.
>
> On 10/100 Mbps networks, this causes severe lags, timeouts, and dmesg
> fills with messages like:
>
>  > [  316.404250] nfs: server 192.168.1.112 not responding, still trying
>  > [  316.759512] nfs: server 192.168.1.112 OK
>
> Forcing wsize/rsize to 32K makes all the problems disappear and NFS
> access more snappy, without sacrificing any speed at least up to gigabit
> networks that I tested with.
>
> I would like to request that the defaults be changed to 32K.
> But I didn't find out where these defaults come from, where to file the
> issue and my test case / benchmarks to support it.
>
> I've initially reported it at the klibc nfsmount program that I was
> using, but this is just using the NFS defaults, which are the ones that
> should be amended. So initial test case / benchmarks there:
> https://lists.zytor.com/archives/klibc/2019-September/004234.html
>
> Please Cc me as I'm not in the list.
>

The default client behaviour is just to go with whatever recommended
value the server specifies. You can change that value yourself on the
knfsd server by editing the pseudo-file in
/proc/fs/nfsd/max_block_size.

Cheers
   Trond

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-19 15:08 ` Trond Myklebust
@ 2019-09-19 15:58   ` Alkis Georgopoulos
  2019-09-19 16:11     ` Trond Myklebust
  0 siblings, 1 reply; 19+ messages in thread
From: Alkis Georgopoulos @ 2019-09-19 15:58 UTC (permalink / raw)
  To: linux-nfs; +Cc: Trond Myklebust

On 9/19/19 6:08 PM, Trond Myklebust wrote:
> The default client behaviour is just to go with whatever recommended
> value the server specifies. You can change that value yourself on the
> knfsd server by editing the pseudo-file in
> /proc/fs/nfsd/max_block_size.


Thank you, and I guess I can automate this, by running
`systemctl edit nfs-kernel-server`, and adding:
[Service]
ExecStartPre=sh -c 'echo 32768 > /proc/fs/nfsd/max_block_size'

But isn't it a problem that the defaults cause errors in dmesg and 
severe lags in 10/100 Mbps, and even make 1000 Mbps a lot less snappy 
than with 32K?

In any case thank you again.
Alkis Georgopoulos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-19 15:58   ` rsize,wsize=1M causes severe lags in 10/100 Mbps Alkis Georgopoulos
@ 2019-09-19 16:11     ` Trond Myklebust
  2019-09-19 19:21       ` Alkis Georgopoulos
  0 siblings, 1 reply; 19+ messages in thread
From: Trond Myklebust @ 2019-09-19 16:11 UTC (permalink / raw)
  To: alkisg, linux-nfs

On Thu, 2019-09-19 at 18:58 +0300, Alkis Georgopoulos wrote:
> On 9/19/19 6:08 PM, Trond Myklebust wrote:
> > The default client behaviour is just to go with whatever
> > recommended
> > value the server specifies. You can change that value yourself on
> > the
> > knfsd server by editing the pseudo-file in
> > /proc/fs/nfsd/max_block_size.
> 
> Thank you, and I guess I can automate this, by running
> `systemctl edit nfs-kernel-server`, and adding:
> [Service]
> ExecStartPre=sh -c 'echo 32768 > /proc/fs/nfsd/max_block_size'
> 
> But isn't it a problem that the defaults cause errors in dmesg and 
> severe lags in 10/100 Mbps, and even make 1000 Mbps a lot less
> snappy 
> than with 32K?
> 

No. It is not a problem, because nfs-utils defaults to using TCP
mounts. Fragmentation is only a problem with UDP, and we stopped
defaulting to that almost 2 decades ago.

However it may well be that klibc is still defaulting to using UDP, in
which case it should be fixed. There are major Linux distros out there
today that don't even compile in support for NFS over UDP any more.

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-19 16:11     ` Trond Myklebust
@ 2019-09-19 19:21       ` Alkis Georgopoulos
  2019-09-19 19:51         ` Trond Myklebust
  0 siblings, 1 reply; 19+ messages in thread
From: Alkis Georgopoulos @ 2019-09-19 19:21 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs

On 9/19/19 7:11 PM, Trond Myklebust wrote:
> No. It is not a problem, because nfs-utils defaults to using TCP
> mounts. Fragmentation is only a problem with UDP, and we stopped
> defaulting to that almost 2 decades ago.
> 
> However it may well be that klibc is still defaulting to using UDP, in
> which case it should be fixed. There are major Linux distros out there
> today that don't even compile in support for NFS over UDP any more.

I haven't tested with UDP at all; the problem was with TCP.
I saw the problem in klibc nfsmount with TCP + NFS 3,
and in `mount -t nfs -o timeo=7 server:/share /mnt` with TCP + NFS 4.2.

Steps to reproduce:
1) Connect server <=> client at 10 or 100 Mbps.
Gigabit is also "less snappy" but it's less obvious there.
For reliable results, I made sure that server/client/network didn't have 
any other load at all.

2) Server:
echo '/srv *(ro,async,no_subtree_check)' >> /etc/exports
exportfs -ra
truncate -s 10G /srv/10G.file
The sparse file ensures that disk IO bandwidth isn't an issue.

3) Client:
mount -t nfs -o timeo=7 192.168.1.112:/srv /mnt
dd if=/mnt/10G.file of=/dev/null status=progress

4) Result:
dd there starts with 11.2 MB/sec, which is fine/expected,
and it slowly drops to 2 MB/sec after a while,
it lags, omitting some seconds in its output line,
e.g. 507510784 bytes (508 MB, 484 MiB) copied, 186 s, 2,7 MB/s^C,
at which point "Ctrl+C" needs 30+ seconds to stop dd,
because of IO waiting etc.

In another terminal tab, `dmesg -w` is full of these:
[  316.404250] nfs: server 192.168.1.112 not responding, still trying
[  316.759512] nfs: server 192.168.1.112 OK

5) Remarks:
With timeo=600, there are no errors in dmesg.
The fact that timeo=7 (the nfsmount default) causes errors, proves that 
some packets need more than 0.7 secs to arrive.
Which in turn explains why all the applications open extremely slowly 
and feel sluggish on netroot = 100 Mbps, NFS, TCP.

Lowering rsize,wsize from 1M to 32K solves all those issues without any 
negative side effects that I can see. Even on gigabit, 32K makes 
applications a lot more snappy so it's better even there.
On 10 Mbps, rsize=1M is completely unusable.

So I'm not sure where rsize=1M is a better default. Is it only for 10G+ 
connections?

Thank you very much,
Alkis Georgopoulos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-19 19:21       ` Alkis Georgopoulos
@ 2019-09-19 19:51         ` Trond Myklebust
  2019-09-19 19:57           ` Alkis Georgopoulos
  0 siblings, 1 reply; 19+ messages in thread
From: Trond Myklebust @ 2019-09-19 19:51 UTC (permalink / raw)
  To: alkisg, linux-nfs

On Thu, 2019-09-19 at 22:21 +0300, Alkis Georgopoulos wrote:
> On 9/19/19 7:11 PM, Trond Myklebust wrote:
> > No. It is not a problem, because nfs-utils defaults to using TCP
> > mounts. Fragmentation is only a problem with UDP, and we stopped
> > defaulting to that almost 2 decades ago.
> > 
> > However it may well be that klibc is still defaulting to using UDP,
> > in
> > which case it should be fixed. There are major Linux distros out
> > there
> > today that don't even compile in support for NFS over UDP any more.
> 
> I haven't tested with UDP at all; the problem was with TCP.
> I saw the problem in klibc nfsmount with TCP + NFS 3,
> and in `mount -t nfs -o timeo=7 server:/share /mnt` with TCP + NFS
> 4.2.
> 
> Steps to reproduce:
> 1) Connect server <=> client at 10 or 100 Mbps.
> Gigabit is also "less snappy" but it's less obvious there.
> For reliable results, I made sure that server/client/network didn't
> have 
> any other load at all.
> 
> 2) Server:
> echo '/srv *(ro,async,no_subtree_check)' >> /etc/exports
> exportfs -ra
> truncate -s 10G /srv/10G.file
> The sparse file ensures that disk IO bandwidth isn't an issue.
> 
> 3) Client:
> mount -t nfs -o timeo=7 192.168.1.112:/srv /mnt
> dd if=/mnt/10G.file of=/dev/null status=progress
> 
> 4) Result:
> dd there starts with 11.2 MB/sec, which is fine/expected,
> and it slowly drops to 2 MB/sec after a while,
> it lags, omitting some seconds in its output line,
> e.g. 507510784 bytes (508 MB, 484 MiB) copied, 186 s, 2,7 MB/s^C,
> at which point "Ctrl+C" needs 30+ seconds to stop dd,
> because of IO waiting etc.
> 
> In another terminal tab, `dmesg -w` is full of these:
> [  316.404250] nfs: server 192.168.1.112 not responding, still trying
> [  316.759512] nfs: server 192.168.1.112 OK
> 
> 5) Remarks:
> With timeo=600, there are no errors in dmesg.
> The fact that timeo=7 (the nfsmount default) causes errors, proves
> that 
> some packets need more than 0.7 secs to arrive.
> Which in turn explains why all the applications open extremely
> slowly 
> and feel sluggish on netroot = 100 Mbps, NFS, TCP.
> 
> Lowering rsize,wsize from 1M to 32K solves all those issues without
> any 
> negative side effects that I can see. Even on gigabit, 32K makes 
> applications a lot more snappy so it's better even there.
> On 10 Mbps, rsize=1M is completely unusable.
> 
> So I'm not sure where rsize=1M is a better default. Is it only for
> 10G+ 
> connections?
> 

I don't understand why klibc would default to supplying a timeo=7
argument at all. It would be MUCH better if it just let the kernel set
the default, which in the case of TCP is timeo=600.

I agree with your argument that replaying requests every 0.7 seconds is
just going to cause congestion. TCP provides for reliable delivery of
RPC messages to the server, which is why the kernel default is a full
minute.

So please ask the klibc developers to change libmount to let the kernel
decide the default mount options. Their current setting is just plain
wrong.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-19 19:51         ` Trond Myklebust
@ 2019-09-19 19:57           ` Alkis Georgopoulos
  2019-09-19 20:05             ` Trond Myklebust
  0 siblings, 1 reply; 19+ messages in thread
From: Alkis Georgopoulos @ 2019-09-19 19:57 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs

On 9/19/19 10:51 PM, Trond Myklebust wrote:
> I don't understand why klibc would default to supplying a timeo=7
> argument at all. It would be MUCH better if it just let the kernel set
> the default, which in the case of TCP is timeo=600.
> 
> I agree with your argument that replaying requests every 0.7 seconds is
> just going to cause congestion. TCP provides for reliable delivery of
> RPC messages to the server, which is why the kernel default is a full
> minute.
> 
> So please ask the klibc developers to change libmount to let the kernel
> decide the default mount options. Their current setting is just plain
> wrong.


This was what I asked in my first message to their mailing list,
https://lists.zytor.com/archives/klibc/2019-September/004234.html

Then I realized that timeo=600 just hides the real problem,
which is rsize=1M.

NFS defaults: timeo=600,rsize=1M => lag
nfsmount defaults: timeo=7,rsize=1MK => lag AND dmesg errors

My proposal: timeo=whatever,rsize=32K => all fine

If more benchmarks are needed from me to document the
"NFS defaults: timeo=600,rsize=1M => lag"
I can surely provide them.

Thanks,
Alkis

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-19 19:57           ` Alkis Georgopoulos
@ 2019-09-19 20:05             ` Trond Myklebust
  2019-09-19 20:20               ` Alkis Georgopoulos
  0 siblings, 1 reply; 19+ messages in thread
From: Trond Myklebust @ 2019-09-19 20:05 UTC (permalink / raw)
  To: alkisg, linux-nfs

On Thu, 2019-09-19 at 22:57 +0300, Alkis Georgopoulos wrote:
> On 9/19/19 10:51 PM, Trond Myklebust wrote:
> > I don't understand why klibc would default to supplying a timeo=7
> > argument at all. It would be MUCH better if it just let the kernel
> > set
> > the default, which in the case of TCP is timeo=600.
> > 
> > I agree with your argument that replaying requests every 0.7
> > seconds is
> > just going to cause congestion. TCP provides for reliable delivery
> > of
> > RPC messages to the server, which is why the kernel default is a
> > full
> > minute.
> > 
> > So please ask the klibc developers to change libmount to let the
> > kernel
> > decide the default mount options. Their current setting is just
> > plain
> > wrong.
> 
> This was what I asked in my first message to their mailing list,
> https://lists.zytor.com/archives/klibc/2019-September/004234.html
> 
> Then I realized that timeo=600 just hides the real problem,
> which is rsize=1M.
> 
> NFS defaults: timeo=600,rsize=1M => lag
> nfsmount defaults: timeo=7,rsize=1MK => lag AND dmesg errors
> 
> My proposal: timeo=whatever,rsize=32K => all fine
> 
> If more benchmarks are needed from me to document the
> "NFS defaults: timeo=600,rsize=1M => lag"
> I can surely provide them.

There are plenty of operations that can take longer than 700 ms to
complete. Synchronous writes to disk are one, but COMMIT (i.e. the NFS
equivalent of fsync()) can often take much longer even though it has no
payload.

So the problem is not the size of the WRITE payload. The real problem
is the timeout.

The bottom line is that if you want to keep timeo=7 as a mount option
for TCP, then you are on your own.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-19 20:05             ` Trond Myklebust
@ 2019-09-19 20:20               ` Alkis Georgopoulos
  2019-09-19 20:40                 ` Trond Myklebust
  0 siblings, 1 reply; 19+ messages in thread
From: Alkis Georgopoulos @ 2019-09-19 20:20 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs

On 9/19/19 11:05 PM, Trond Myklebust wrote:
> There are plenty of operations that can take longer than 700 ms to
> complete. Synchronous writes to disk are one, but COMMIT (i.e. the NFS
> equivalent of fsync()) can often take much longer even though it has no
> payload.
> 
> So the problem is not the size of the WRITE payload. The real problem
> is the timeout.
> 
> The bottom line is that if you want to keep timeo=7 as a mount option
> for TCP, then you are on your own.
> 

The problem isn't timeo at all.
If I understand it correctly, when I try to launch firefox over nfsroot, 
NFS will wait until it fills 1M before "replying" to the application.
Thus the applications will launch a lot slower, as they get "disk 
feedback" in larger chunks and not "snappy".

In numbers:
timeo=600,rsize=1M => firefox opens in 30 secs
timeo=600,rsize=32k => firefox opens in 20 secs

Anyway, thank you very much for your time and feedback.

Kind regards,
Alkis Georgopoulos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-19 20:20               ` Alkis Georgopoulos
@ 2019-09-19 20:40                 ` Trond Myklebust
  2019-09-19 21:19                   ` Daniel Forrest
  0 siblings, 1 reply; 19+ messages in thread
From: Trond Myklebust @ 2019-09-19 20:40 UTC (permalink / raw)
  To: alkisg, linux-nfs

On Thu, 2019-09-19 at 23:20 +0300, Alkis Georgopoulos wrote:
> On 9/19/19 11:05 PM, Trond Myklebust wrote:
> > There are plenty of operations that can take longer than 700 ms to
> > complete. Synchronous writes to disk are one, but COMMIT (i.e. the
> > NFS
> > equivalent of fsync()) can often take much longer even though it
> > has no
> > payload.
> > 
> > So the problem is not the size of the WRITE payload. The real
> > problem
> > is the timeout.
> > 
> > The bottom line is that if you want to keep timeo=7 as a mount
> > option
> > for TCP, then you are on your own.
> > 
> 
> The problem isn't timeo at all.
> If I understand it correctly, when I try to launch firefox over
> nfsroot, 
> NFS will wait until it fills 1M before "replying" to the application.
> Thus the applications will launch a lot slower, as they get "disk 
> feedback" in larger chunks and not "snappy".
> 
> In numbers:
> timeo=600,rsize=1M => firefox opens in 30 secs
> timeo=600,rsize=32k => firefox opens in 20 secs
> 

That's a different problem, and is most likely due to readahead causing
your client to read more data than it needs to. It is also true that
the maximum readahead size is proportional to the rsize and that maybe
it shouldn't be.
However the VM layer is supposed to ensure that the kernel doesn't try
to read ahead more than necessary. It is bounded by the maximum we set
in the NFS layer, but it isn't supposed to hit that maximum unless the
readahead heuristics show that the application may need it.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-19 20:40                 ` Trond Myklebust
@ 2019-09-19 21:19                   ` Daniel Forrest
  2019-09-19 21:42                     ` Trond Myklebust
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Forrest @ 2019-09-19 21:19 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: alkisg, linux-nfs

On Thu, Sep 19, 2019 at 08:40:41PM +0000, Trond Myklebust wrote:
> On Thu, 2019-09-19 at 23:20 +0300, Alkis Georgopoulos wrote:
> > On 9/19/19 11:05 PM, Trond Myklebust wrote:
> > > There are plenty of operations that can take longer than 700 ms to
> > > complete. Synchronous writes to disk are one, but COMMIT (i.e. the
> > > NFS
> > > equivalent of fsync()) can often take much longer even though it
> > > has no
> > > payload.
> > > 
> > > So the problem is not the size of the WRITE payload. The real
> > > problem
> > > is the timeout.
> > > 
> > > The bottom line is that if you want to keep timeo=7 as a mount
> > > option
> > > for TCP, then you are on your own.
> > > 
> > 
> > The problem isn't timeo at all.
> > If I understand it correctly, when I try to launch firefox over
> > nfsroot, 
> > NFS will wait until it fills 1M before "replying" to the application.
> > Thus the applications will launch a lot slower, as they get "disk 
> > feedback" in larger chunks and not "snappy".
> > 
> > In numbers:
> > timeo=600,rsize=1M => firefox opens in 30 secs
> > timeo=600,rsize=32k => firefox opens in 20 secs
> > 
> 
> That's a different problem, and is most likely due to readahead causing
> your client to read more data than it needs to. It is also true that
> the maximum readahead size is proportional to the rsize and that maybe
> it shouldn't be.
> However the VM layer is supposed to ensure that the kernel doesn't try
> to read ahead more than necessary. It is bounded by the maximum we set
> in the NFS layer, but it isn't supposed to hit that maximum unless the
> readahead heuristics show that the application may need it.
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com

What may be happening here is something I have noticed with glibc.

- statfs reports the rsize/wsize as the block size of the filesystem.

- glibc uses the block size as the default buffer size for fread/fwrite.

If an application is using fread/fwrite on an NFS mounted file with an
rsize/wsize of 1M it will try to fill a 1MB buffer.

I have often changed mounts to use rsize/wsize=64K to alleviate this.

-- 
Dan Forrest
Space Science and Engineering Center, University of Wisconsin, Madison
dforrest@wisc.edu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-19 21:19                   ` Daniel Forrest
@ 2019-09-19 21:42                     ` Trond Myklebust
  2019-09-19 22:16                       ` Daniel Forrest
  0 siblings, 1 reply; 19+ messages in thread
From: Trond Myklebust @ 2019-09-19 21:42 UTC (permalink / raw)
  To: Daniel Forrest; +Cc: alkisg, linux-nfs

On Thu, 2019-09-19 at 16:19 -0500, Daniel Forrest wrote:
> On Thu, Sep 19, 2019 at 08:40:41PM +0000, Trond Myklebust wrote:
> > On Thu, 2019-09-19 at 23:20 +0300, Alkis Georgopoulos wrote:
> > > On 9/19/19 11:05 PM, Trond Myklebust wrote:
> > > > There are plenty of operations that can take longer than 700 ms
> > > > to
> > > > complete. Synchronous writes to disk are one, but COMMIT (i.e.
> > > > the
> > > > NFS
> > > > equivalent of fsync()) can often take much longer even though
> > > > it
> > > > has no
> > > > payload.
> > > > 
> > > > So the problem is not the size of the WRITE payload. The real
> > > > problem
> > > > is the timeout.
> > > > 
> > > > The bottom line is that if you want to keep timeo=7 as a mount
> > > > option
> > > > for TCP, then you are on your own.
> > > > 
> > > 
> > > The problem isn't timeo at all.
> > > If I understand it correctly, when I try to launch firefox over
> > > nfsroot, 
> > > NFS will wait until it fills 1M before "replying" to the
> > > application.
> > > Thus the applications will launch a lot slower, as they get
> > > "disk 
> > > feedback" in larger chunks and not "snappy".
> > > 
> > > In numbers:
> > > timeo=600,rsize=1M => firefox opens in 30 secs
> > > timeo=600,rsize=32k => firefox opens in 20 secs
> > > 
> > 
> > That's a different problem, and is most likely due to readahead
> > causing
> > your client to read more data than it needs to. It is also true
> > that
> > the maximum readahead size is proportional to the rsize and that
> > maybe
> > it shouldn't be.
> > However the VM layer is supposed to ensure that the kernel doesn't
> > try
> > to read ahead more than necessary. It is bounded by the maximum we
> > set
> > in the NFS layer, but it isn't supposed to hit that maximum unless
> > the
> > readahead heuristics show that the application may need it.
> > 
> > -- 
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.myklebust@hammerspace.com
> 
> What may be happening here is something I have noticed with glibc.
> 
> - statfs reports the rsize/wsize as the block size of the filesystem.
> 
> - glibc uses the block size as the default buffer size for
> fread/fwrite.
> 
> If an application is using fread/fwrite on an NFS mounted file with
> an
> rsize/wsize of 1M it will try to fill a 1MB buffer.
> 
> I have often changed mounts to use rsize/wsize=64K to alleviate this.
> 

That sounds like an abuse of the filesystem block size. There is
nothing in the POSIX definition of either fread() or fwrite() that
requires glibc to do this: 
https://pubs.opengroup.org/onlinepubs/9699919799/functions/fread.html
https://pubs.opengroup.org/onlinepubs/9699919799/functions/fwrite.html


-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-19 21:42                     ` Trond Myklebust
@ 2019-09-19 22:16                       ` Daniel Forrest
  2019-09-20  9:25                         ` Alkis Georgopoulos
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Forrest @ 2019-09-19 22:16 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: alkisg, linux-nfs

On Thu, Sep 19, 2019 at 05:42:26PM -0400, Trond Myklebust wrote:
> On Thu, 2019-09-19 at 16:19 -0500, Daniel Forrest wrote:
> > On Thu, Sep 19, 2019 at 08:40:41PM +0000, Trond Myklebust wrote:
> > > On Thu, 2019-09-19 at 23:20 +0300, Alkis Georgopoulos wrote:
> > > > On 9/19/19 11:05 PM, Trond Myklebust wrote:
> > > > > There are plenty of operations that can take longer than 700 ms
> > > > > to
> > > > > complete. Synchronous writes to disk are one, but COMMIT (i.e.
> > > > > the
> > > > > NFS
> > > > > equivalent of fsync()) can often take much longer even though
> > > > > it
> > > > > has no
> > > > > payload.
> > > > > 
> > > > > So the problem is not the size of the WRITE payload. The real
> > > > > problem
> > > > > is the timeout.
> > > > > 
> > > > > The bottom line is that if you want to keep timeo=7 as a mount
> > > > > option
> > > > > for TCP, then you are on your own.
> > > > > 
> > > > 
> > > > The problem isn't timeo at all.
> > > > If I understand it correctly, when I try to launch firefox over
> > > > nfsroot, 
> > > > NFS will wait until it fills 1M before "replying" to the
> > > > application.
> > > > Thus the applications will launch a lot slower, as they get
> > > > "disk 
> > > > feedback" in larger chunks and not "snappy".
> > > > 
> > > > In numbers:
> > > > timeo=600,rsize=1M => firefox opens in 30 secs
> > > > timeo=600,rsize=32k => firefox opens in 20 secs
> > > > 
> > > 
> > > That's a different problem, and is most likely due to readahead
> > > causing
> > > your client to read more data than it needs to. It is also true
> > > that
> > > the maximum readahead size is proportional to the rsize and that
> > > maybe
> > > it shouldn't be.
> > > However the VM layer is supposed to ensure that the kernel doesn't
> > > try
> > > to read ahead more than necessary. It is bounded by the maximum we
> > > set
> > > in the NFS layer, but it isn't supposed to hit that maximum unless
> > > the
> > > readahead heuristics show that the application may need it.
> > > 
> > > -- 
> > > Trond Myklebust
> > > Linux NFS client maintainer, Hammerspace
> > > trond.myklebust@hammerspace.com
> > 
> > What may be happening here is something I have noticed with glibc.
> > 
> > - statfs reports the rsize/wsize as the block size of the filesystem.
> > 
> > - glibc uses the block size as the default buffer size for
> > fread/fwrite.
> > 
> > If an application is using fread/fwrite on an NFS mounted file with
> > an rsize/wsize of 1M it will try to fill a 1MB buffer.
> > 
> > I have often changed mounts to use rsize/wsize=64K to alleviate this.
> > 
> 
> That sounds like an abuse of the filesystem block size. There is
> nothing in the POSIX definition of either fread() or fwrite() that
> requires glibc to do this: 
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/fread.html
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/fwrite.html
> 
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com

It looks like this was fixed in glibc 2.25:

https://sourceware.org/bugzilla/show_bug.cgi?id=4099

But this version is not on the CentOS 6/7 systems I use.

-- 
Dan Forrest
Space Science and Engineering Center University of Wisconsin, Madison
dforrest@wisc.edu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-19 22:16                       ` Daniel Forrest
@ 2019-09-20  9:25                         ` Alkis Georgopoulos
  2019-09-20  9:48                           ` Alkis Georgopoulos
  0 siblings, 1 reply; 19+ messages in thread
From: Alkis Georgopoulos @ 2019-09-20  9:25 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs

On 9/20/19 1:16 AM, Daniel Forrest wrote:
>>> What may be happening here is something I have noticed with glibc.
>>>
>>> - statfs reports the rsize/wsize as the block size of the filesystem.
>>>
>>> - glibc uses the block size as the default buffer size for
>>> fread/fwrite.
>>>
>>> If an application is using fread/fwrite on an NFS mounted file with
>>> an rsize/wsize of 1M it will try to fill a 1MB buffer.
>>>
>>> I have often changed mounts to use rsize/wsize=64K to alleviate this.
>>
>> That sounds like an abuse of the filesystem block size. There is
>> nothing in the POSIX definition of either fread() or fwrite() that
>> requires glibc to do this:
>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/fread.html
>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/fwrite.html
>>
> 
> It looks like this was fixed in glibc 2.25:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=4099

This is likely not the exact issue I'm experiencing, as I'm testing e.g.
with glibc 2.27-3ubuntu1 on Ubuntu 18.04 and kernel 5.0.

New benchmark, measuring the boot time of a netbooted client,
from right after the kernel is loaded to the display manager screen:

1) On 10 Mbps:
a) tcp,timeo=600,rsize=32K: 304 secs
b) tcp,timeo=600,rsize=1M: 618 secs

2) On 100 Mbps:
a) tcp,timeo=600,rsize=32K: 40 secs
b) tcp,timeo=600,rsize=1M: 84 secs

3) On 1000 Mbps:
a) tcp,timeo=600,rsize=32K: 20 secs
b) tcp,timeo=600,rsize=1M: 24 secs

32K is always faster, even on full gigabit.
Disk access on gigabit was *significantly* faster to result in 4 seconds 
lower boot time. In the 10/100 cases, rsize=1M is pretty much unusable.
There are no writes involved, they go in a local tmpfs/overlayfs.
Would it make sense for me to measure the *boot bandwidth* in each case, 
to see if more things (readahead) are downloaded with rsize=1M?

I can do whatever benchmarks and test whatever parameters you tell me 
to, but I do not know the NFS/kernel internals to be able to explain why 
this happens.

The reason I investigated this is because I developed the new version of 
ltsp.org (GPL netbooting software), where we switched from 
squashfs-over-NBD to squashfs-over-NFS, and netbooting was extremely 
slow until I lowered rsize to 32K, so I thought I'd share my findings in 
case it makes a better default for everyone (or reveals a problem 
elsewhere).
With rsize=32K, squashfs-over-NFS is as speedy as squashfs-over-NBD, but 
a lot more stable.

Of course the same rsize findings apply for NFS /home too (without 
nfsmount), or for just transferring large or small files, not just for /.

Btw, 
https://www.kernel.org/doc/Documentation/filesystems/nfs/nfsroot.txt 
says the kernel nfsroot defaults are timeo=7,rsize=4096,wsize=4096. This 
is about the internal kernel netbooting support, not using klibc 
nfsmount; but I haven't tested it as it would involve compiling a kernel 
with my NIC driver.

Thank you,
Alkis Georgopoulos
LTSP developer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-20  9:25                         ` Alkis Georgopoulos
@ 2019-09-20  9:48                           ` Alkis Georgopoulos
  2019-09-20 10:04                             ` Alkis Georgopoulos
  0 siblings, 1 reply; 19+ messages in thread
From: Alkis Georgopoulos @ 2019-09-20  9:48 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs

On 9/20/19 12:25 PM, Alkis Georgopoulos wrote:
> This is likely not the exact issue I'm experiencing, as I'm testing e.g.
> with glibc 2.27-3ubuntu1 on Ubuntu 18.04 and kernel 5.0.
> 
> New benchmark, measuring the boot time of a netbooted client,
> from right after the kernel is loaded to the display manager screen:
> 
> 1) On 10 Mbps:
> a) tcp,timeo=600,rsize=32K: 304 secs
> b) tcp,timeo=600,rsize=1M: 618 secs
> 
> 2) On 100 Mbps:
> a) tcp,timeo=600,rsize=32K: 40 secs
> b) tcp,timeo=600,rsize=1M: 84 secs
> 
> 3) On 1000 Mbps:
> a) tcp,timeo=600,rsize=32K: 20 secs
> b) tcp,timeo=600,rsize=1M: 24 secs
> 
> 32K is always faster, even on full gigabit.
> Disk access on gigabit was *significantly* faster to result in 4 seconds 
> lower boot time. In the 10/100 cases, rsize=1M is pretty much unusable.
> There are no writes involved, they go in a local tmpfs/overlayfs.
> Would it make sense for me to measure the *boot bandwidth* in each case, 
> to see if more things (readahead) are downloaded with rsize=1M?


I did test the boot bandwidth.
On ext4-over-NFS, with tmpfs-and-overlayfs to make root writable:

2) On 100 Mbps:
a) tcp,timeo=600,rsize=32K: 471MB
b) tcp,timeo=600,rsize=1M: 1250MB

So it is indeed slower because it's transferring more things that the 
client doesn't need.
Maybe it is a different or a new aspect of the readahead issue that you 
guys mentioned above.
Is it possible that NFS is always sending 1MB chunks even when the 
actual data inside them is lower?

If you want me to test more things, I can;
if you consider it a problem with glibc etc that shouldn't involve this 
mailing list, I can try to report it there...

Thank you,
Alkis Georgopoulos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-20  9:48                           ` Alkis Georgopoulos
@ 2019-09-20 10:04                             ` Alkis Georgopoulos
  2019-09-21  7:52                               ` Alkis Georgopoulos
  0 siblings, 1 reply; 19+ messages in thread
From: Alkis Georgopoulos @ 2019-09-20 10:04 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs

On 9/20/19 12:48 PM, Alkis Georgopoulos wrote:
> I did test the boot bandwidth (I mean how many MB were transferred).
> On ext4-over-NFS, with tmpfs-and-overlayfs to make root writable:


I also tested with the kernel netbooting default of rsize=4K to compare.
All on 100 Mbps, tcp,timeo=600:

| rsize | MB to boot | sec to boot |
|-------|------------|-------------|
|   1M  |    1250    |     84      |
|  32K  |     471    |     40      |
|   4K  |     320    |     31      |
|   2K  |     355    |     34      |

It appears matching rsize=cluster size=4K gives the best results.

Thank you,
Alkis Georgopoulos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-20 10:04                             ` Alkis Georgopoulos
@ 2019-09-21  7:52                               ` Alkis Georgopoulos
  2019-09-21  7:59                                 ` Alkis Georgopoulos
  0 siblings, 1 reply; 19+ messages in thread
From: Alkis Georgopoulos @ 2019-09-21  7:52 UTC (permalink / raw)
  To: linux-nfs

I think it's caused by the kernel readahead, not glibc readahead.
TL;DR: This solves the problem:
echo 4 > /sys/devices/virtual/bdi/0:58/read_ahead_kb

Question: how to configure NFS/kernel to automatically set that?

Long version:
Doing step (4) below results in tremendous speedup:

1) mount -t nfs -o tcp,timeo=600,rsize=1048576,wsize=1048576 
10.161.254.11:/srv/ltsp /mnt

2) cat /proc/fs/nfsfs/volumes
We see the DEV number from there, e.g. 0:58

3) cat /sys/devices/virtual/bdi/0:58/read_ahead_kb
15360
I assume that this means the kernel will try to read ahead up to 15 MB 
for each accessed file. *THIS IS THE PROBLEM*. For non-NFS devices, this 
value is 128 (KB).

4) echo 4 > /sys/devices/virtual/bdi/0:58/read_ahead_kb

5) Test. Traffic now should be a *lot* less, and speed a *lot* more.
E.g. my NFS booting tests:
  - read_ahead_kb=15360 (the default) => 1160 MB traffic to boot
  - read_ahead_kb=128 => 324MB traffic
  - read_ahead_kb=4 => 223MB traffic

So the question that remains, is how to properly configure either NFS or 
the kernel, to use small readahead values for NFS.

I'm currently doing it with this workaround:
for f in $(awk '/^v[0-9]/ { print $4 }' < /proc/fs/nfsfs/volumes); do 
echo 4 > /sys/devices/virtual/bdi/$f/read_ahead_kb; done

Thanks,
Alkis

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-21  7:52                               ` Alkis Georgopoulos
@ 2019-09-21  7:59                                 ` Alkis Georgopoulos
  2019-09-21 11:02                                   ` Alkis Georgopoulos
  0 siblings, 1 reply; 19+ messages in thread
From: Alkis Georgopoulos @ 2019-09-21  7:59 UTC (permalink / raw)
  To: linux-nfs

On 9/21/19 10:52 AM, Alkis Georgopoulos wrote:
> I think it's caused by the kernel readahead, not glibc readahead.
> TL;DR: This solves the problem:
> echo 4 > /sys/devices/virtual/bdi/0:58/read_ahead_kb
> 
> Question: how to configure NFS/kernel to automatically set that?
> 
> Long version:
> Doing step (4) below results in tremendous speedup:
> 
> 1) mount -t nfs -o tcp,timeo=600,rsize=1048576,wsize=1048576 
> 10.161.254.11:/srv/ltsp /mnt
> 
> 2) cat /proc/fs/nfsfs/volumes
> We see the DEV number from there, e.g. 0:58
> 
> 3) cat /sys/devices/virtual/bdi/0:58/read_ahead_kb
> 15360
> I assume that this means the kernel will try to read ahead up to 15 MB 
> for each accessed file. *THIS IS THE PROBLEM*. For non-NFS devices, this 
> value is 128 (KB).
> 
> 4) echo 4 > /sys/devices/virtual/bdi/0:58/read_ahead_kb
> 
> 5) Test. Traffic now should be a *lot* less, and speed a *lot* more.
> E.g. my NFS booting tests:
>   - read_ahead_kb=15360 (the default) => 1160 MB traffic to boot
>   - read_ahead_kb=128 => 324MB traffic
>   - read_ahead_kb=4 => 223MB traffic
> 
> So the question that remains, is how to properly configure either NFS or 
> the kernel, to use small readahead values for NFS.
> 
> I'm currently doing it with this workaround:
> for f in $(awk '/^v[0-9]/ { print $4 }' < /proc/fs/nfsfs/volumes); do 
> echo 4 > /sys/devices/virtual/bdi/$f/read_ahead_kb; done
> 
> Thanks,
> Alkis



Quoting https://lkml.org/lkml/2010/2/26/48
 > nfs: use 2*rsize readahead size
 > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
 > readahead size 512k*15=7680k is too large than necessary for typical
 > clients.

I.e. the problem probably is that when NFS_MAX_READAHEAD=15 was 
implemented, rsize was 512k; now that rsize=1M, this results in 
readaheads of 15M, which cause all the traffic and lags.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: rsize,wsize=1M causes severe lags in 10/100 Mbps
  2019-09-21  7:59                                 ` Alkis Georgopoulos
@ 2019-09-21 11:02                                   ` Alkis Georgopoulos
  0 siblings, 0 replies; 19+ messages in thread
From: Alkis Georgopoulos @ 2019-09-21 11:02 UTC (permalink / raw)
  To: linux-nfs

On 9/21/19 10:59 AM, Alkis Georgopoulos wrote:
> I.e. the problem probably is that when NFS_MAX_READAHEAD=15 was 
> implemented, rsize was 512k; now that rsize=1M, this results in 
> readaheads of 15M, which cause all the traffic and lags.


I filed a bug report for this:
https://bugzilla.kernel.org/show_bug.cgi?id=204939

A quick work around is to run on the clients, after the NFS mounts:

for f in $(awk '/^v[0-9]/ { print $4 }' < /proc/fs/nfsfs/volumes); do
     echo 4 > /sys/devices/virtual/bdi/$f/read_ahead_kb
done

Btw the mail title is wrong, the workaround above causes the netboot
traffic to drop from e.g. 1160MB to 221MB in any network speed;
it was just more observable in lower speeds.

Thank you very much,
Alkis Georgopoulos

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-09-21 11:02 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-19  7:29 rsize,wsize=1M causes severe lags in 10/100 Mbps, what sets those defaults? Alkis Georgopoulos
2019-09-19 15:08 ` Trond Myklebust
2019-09-19 15:58   ` rsize,wsize=1M causes severe lags in 10/100 Mbps Alkis Georgopoulos
2019-09-19 16:11     ` Trond Myklebust
2019-09-19 19:21       ` Alkis Georgopoulos
2019-09-19 19:51         ` Trond Myklebust
2019-09-19 19:57           ` Alkis Georgopoulos
2019-09-19 20:05             ` Trond Myklebust
2019-09-19 20:20               ` Alkis Georgopoulos
2019-09-19 20:40                 ` Trond Myklebust
2019-09-19 21:19                   ` Daniel Forrest
2019-09-19 21:42                     ` Trond Myklebust
2019-09-19 22:16                       ` Daniel Forrest
2019-09-20  9:25                         ` Alkis Georgopoulos
2019-09-20  9:48                           ` Alkis Georgopoulos
2019-09-20 10:04                             ` Alkis Georgopoulos
2019-09-21  7:52                               ` Alkis Georgopoulos
2019-09-21  7:59                                 ` Alkis Georgopoulos
2019-09-21 11:02                                   ` Alkis Georgopoulos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).