Re: NFSv4: Mounting NFS server which is down, blocks all other NFS mounts on same machine

From: Michael Wakabayashi <mwakabayashi@vmware.com>
To: Olga Kornievskaia <aglo@umich.edu>,
	Trond Myklebust <trondmy@hammerspace.com>
Cc: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"SteveD@redhat.com" <SteveD@RedHat.com>
Subject: Re: NFSv4: Mounting NFS server which is down, blocks all other NFS mounts on same machine
Date: Tue, 8 Jun 2021 09:16:07 +0000	[thread overview]
Message-ID: <CO1PR05MB810120D887411CCDA786A849B7379@CO1PR05MB8101.namprd05.prod.outlook.com> (raw)
In-Reply-To: <CAN-5tyFUsdHOs05DZw4teb3hGRQ5P+5MqUuE5wEwiP4Ki07cfQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5364 bytes --]

Hi Olga,

> > You say that it's taken offline. If it's offline there shouldn't be
> > anything listening on port 2049. I was only able to reproduce the
> > problem when a client is able to send a SYN to the server and not
> > getting a reply back. If the server is offline, there will always be a
> > reply back (RST or something of the sorts). Client tries a bit but it
> > never gets stuck in the rpc_execute() state because it would get a
> > reply from the TCP layer. Your stack is where  there is no TCP reply
> > from the server.

> I'm actually claiming their infrastructure is broken. He says the
> server is down. If that's the case, the TCP layer will time out fast
> and it will not visibly block other mounts. However, if the server is
> unresponsive, that's what the provided stack shows, then the TCP
> timeout is much larger. I'm saying the server should truly be
> unreachable and not unresponsive.

I don't think the infrastructure is broken.
The behavior we're seeing seems to be correct (explained below).

> If the server is offline, there will always be a reply back
> (RST or something of the sorts).

> He says the server is down. If that's the case, the TCP layer will
> time out fast

I believe these two statements are incorrect. Let me explain.

If the server is offline (for example physically unplugged from power)
then this unplugged server cannot reply. It's a brick.
CPU, memory, disk and NIC all have zero power.  There's no operating
system running, the dead server won't see the TCP request packet
and cannot reply so the following cannot be true:
> If the server is offline, there will always be a reply back

Here's what's happening at the TCP layer:
I took a TCP packet capture (see attached nfsv4.pcap file) and can see the
NFS client(10.162.132.231) attempting a 3-way TCP handshake with the
powered-off/offline server(2.2.2.2).  The client sends a TCP SYN to
the NFS server.  But the NFS server is powered off, so the NFS client
times out waiting for the TCP SYN-ACK reply.  On timeout, the NFS client
will retransmit the TCP SYN packet, and eventually time out again
waiting for the SYN-ACK reply.  This process repeats itself until TCP
retransmits are exhausted.  Eventually the NFS client mount command
gives up (after 3 minutes) and exits.

During this 3 minute period every other NFS mount command on the host where
the NFS client is running is blocked; this effectively caused a
denial of service attack since no other user was able to NFS mount anything,
including perfectly valid NFS mounts.
To make matters worse, after the mount command exited, the workload would
retry the powered off mount command again extending the duration of the
inadvertent denial of service.

> He says the server is down. If that's the case, the TCP layer will
> time out fast

As described above, the 3-way TCP handshake timeout is relatively slow and
the mount command takes 3 minutes to exit.

I believe you're thinking of the case when the NFS server is powered-on,
but has no Linux process listening on NFS port 2049. In this case
the NFS server --will-- reply  quickly (within milliseconds) with a
TCP RST/Reset packet and the NFS client will quickly exit the mount process
with an error code.

> There are valid protocol reasons why the NFSv4 client has to check
> whether or not the new mount is really talking to the same server but
> over a different IP addresses.

Hi Trond,

I ran the following script:
    #!/bin/sh -x
    mount -o vers=4 -v -v -v 2.2.2.2:/fake_path /tmp/mnt.dead &
    echo PID_OF_DEAD_BG_MOUNT=$!
    sleep 5 # give time for the first mount to execute in the background
    mount -o vers=4 -v -v -v 10.188.76.67:/git /tmp/mnt.alive
on Ubuntu 21.04, MacOS 11.1 and FreeBSD 11.4.

The undesirable blocking behavior only appeared on Ubuntu.
MacOs and FreeBSD executed the script fine meaning the 10.188.76.67:/git
NFS share immediately and successfully mounted without blocking.

On Ubuntu, a user runs "mount <unreachable-ip-address>:<fake-path>"
which blocks mounts for every other user on the system, this is undesirable.
Our virtual machines basically crashed because we had several hundred
to several thousand blocked mount processes preventing workloads from
making progress.

We'd prefer the mount behavior of the MacOS or FreeBSD implementations,
even if it's less secure, since at least it's not taking down
our servers with an unintentional DoS attack.

Is there any chance of looking at the FreeBSD mount implementation and seeing
if it is correct, and if so, have the Linux mount command emulate this behavior?

Thanks, Mike

p.s.
I've attached the following 4 files which were generated on Ubuntu 21.04:
  1. mount_hang.sh: script running mount test
  2. mount_hang.sh.log: output of mount_hang.sh
  3. trace_pipe.txt: output of: cat /sys/kernel/debug/tracing/trace_pipe > trace_pipe.txt
                        with events/sunrpc/enable set to 1
                             events/nfs4/enable   set to 1
  4. nfsv4.pcap: output of "tcpdump -v -v -v -v  -s 0 -w /tmp/nfsv4.pcap port 2049"

The test procedure was:
- run mount_hang.sh on the Ubunut 21.04 VM
- after seeing the second mount command execute
- wait about 5 seconds
- press control-C to exit the mount_hang.sh script.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: mount_hang.sh --]
[-- Type: text/x-sh; name="mount_hang.sh", Size: 170 bytes --]

#!/bin/sh -x

mount -o vers=4 -v -v -v 2.2.2.2:/fake_path /tmp/mnt.dead &
echo PID_OF_DEAD_BG_MOUNT=$!
sleep 5

mount -o vers=4 -v -v -v 10.188.76.67:/git /tmp/mnt.alive

[-- Attachment #3: mount_hang.sh.log --]
[-- Type: application/octet-stream, Size: 634 bytes --]

root@mikes-ubuntu-21-04:/tmp/dbg# date; ./mount_hang.sh
Tue Jun  8 07:09:45 AM UTC 2021
+ echo PID_OF_DEAD_BG_MOUNT=714882
PID_OF_DEAD_BG_MOUNT=714882
+ sleep 5
+ mount -o vers=4 -v -v -v 2.2.2.2:/fake_path /tmp/mnt.dead
mount.nfs: timeout set for Tue Jun  8 07:11:45 2021
mount.nfs: trying text-based options 'vers=4,addr=2.2.2.2,clientaddr=10.162.132.231'
+ mount -o vers=4 -v -v -v 10.188.76.67:/git /tmp/mnt.alive
mount.nfs: timeout set for Tue Jun  8 07:11:50 2021
mount.nfs: trying text-based options 'vers=4,addr=10.188.76.67,clientaddr=10.162.132.231'
^C
root@mikes-ubuntu-21-04:/tmp/dbg# date
Tue Jun  8 07:09:56 AM UTC 2021

[-- Attachment #4: nfsv4.pcap --]
[-- Type: application/octet-stream, Size: 474 bytes --]

[-- Attachment #5: trace_pipe.txt --]
[-- Type: text/plain, Size: 3508 bytes --]

           <...>-714884  [001] .... 1639365.962466: xprt_create: peer=[2.2.2.2]:2049 state=BOUND
           <...>-714884  [001] .N.. 1639365.962524: rpc_clnt_new_err: program=nfs server=2.2.2.2 error=-22
           <...>-701148  [001] .... 1639365.962530: xprt_destroy: peer=[2.2.2.2]:2049 state=LOCKED|BOUND
           <...>-714884  [001] .... 1639365.962543: xprt_create: peer=[2.2.2.2]:2049 state=BOUND
           <...>-714884  [001] .... 1639365.962555: rpc_clnt_new: client=0 peer=[2.2.2.2]:2049 program=nfs server=2.2.2.2
           <...>-714884  [001] .... 1639365.962558: rpc_task_begin: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=ACTIVE status=0 action=0x0
           <...>-714884  [001] .... 1639365.962559: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_start [sunrpc]
           <...>-714884  [001] .... 1639365.962560: rpc_request: task:3239@0 nfsv4 NULL (sync)
           <...>-714884  [001] .... 1639365.962560: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_reserve [sunrpc]
           <...>-714884  [001] .... 1639365.962562: xprt_reserve: task:3239@0 xid=0xec976bbb
           <...>-714884  [001] .... 1639365.962562: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_reserveresult [sunrpc]
           <...>-714884  [001] .... 1639365.962562: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_refresh [sunrpc]
           <...>-714884  [001] .... 1639365.962564: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_refreshresult [sunrpc]
           <...>-714884  [001] .... 1639365.962564: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_allocate [sunrpc]
           <...>-714884  [001] .... 1639365.962566: rpc_buf_alloc: task:3239@0 callsize=56 recvsize=32 status=0
           <...>-714884  [001] .... 1639365.962566: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_encode [sunrpc]
           <...>-714884  [001] .... 1639365.962569: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect [sunrpc]
           <...>-714884  [001] .... 1639365.962569: xprt_reserve_xprt: task:3239@0 snd_task:3239
           <...>-714884  [001] .... 1639365.962570: xprt_connect: peer=[2.2.2.2]:2049 state=LOCKED|BOUND
           <...>-714884  [001] .... 1639365.962571: rpc_task_sleep: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 timeout=0 queue=xprt_pending
           <...>-714884  [001] .N.. 1639365.962576: rpc_task_sync_sleep: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=QUEUED|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect_status [sunrpc]
    kworker/u5:1-3258    [001] .... 1639365.962618: rpc_socket_connect: error=-115 socket:[2506590] dstaddr=2.2.2.2/2049 state=2 (CONNECTING) sk_state=2 (SYN_SENT)
    kworker/u5:1-3258    [001] .... 1639365.962619: xprt_release_xprt: task:4294967295@4294967295 snd_task:4294967295