All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michael Wakabayashi <mwakabayashi@vmware.com>
To: Olga Kornievskaia <aglo@umich.edu>
Cc: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: NFSv4: Mounting NFS server which is down, blocks all other NFS mounts on same machine
Date: Thu, 20 May 2021 09:51:54 +0000	[thread overview]
Message-ID: <CO1PR05MB810173C0D970DE22AA9535B8B72A9@CO1PR05MB8101.namprd05.prod.outlook.com> (raw)
In-Reply-To: <CAN-5tyGgWx6F2s=t+0UAJJZEAEfNnv+Sq8eeBbnQYocKOOK8Jg@mail.gmail.com>

Hi Orna,

Thank you for looking.

I spent a couple of hours trying to get various
SystemTap NFS scripts working but mostly got errors.

For example:
> root@mikes-ubuntu-21-04:~/src/systemtap-scripts/tracepoints# stap nfs4_fsinfo.stp
> semantic error: unable to find tracepoint variable '$status' (alternatives: $$parms, $$vars, $task, $$name): identifier '$status' at nfs4_fsinfo.stp:7:11
>         source: terror = $status
>                         ^
> Pass 2: analysis failed.  [man error::pass2]

If you have any stap scripts that work on Ubuntu
that you'd like me to run or have pointers on how
to setup my Ubuntu environment to run them
successfully, please let me know and I can try again..


Here's the call trace for the mount.nfs command
mounting the bad NFS server/10.1.1.1:

[Thu May 20 08:53:35 2021] task:mount.nfs       state:D stack:    0 pid:13903 ppid: 13900 flags:0x00004000
[Thu May 20 08:53:35 2021] Call Trace:
[Thu May 20 08:53:35 2021]  ? rpc_init_task+0x150/0x150 [sunrpc]
[Thu May 20 08:53:35 2021]  __schedule+0x23d/0x670
[Thu May 20 08:53:35 2021]  ? rpc_init_task+0x150/0x150 [sunrpc]
[Thu May 20 08:53:35 2021]  schedule+0x4f/0xc0
[Thu May 20 08:53:35 2021]  rpc_wait_bit_killable+0x25/0xb0 [sunrpc]
[Thu May 20 08:53:35 2021]  __wait_on_bit+0x33/0xa0
[Thu May 20 08:53:35 2021]  ? call_reserveresult+0xa0/0xa0 [sunrpc]
[Thu May 20 08:53:35 2021]  out_of_line_wait_on_bit+0x8d/0xb0
[Thu May 20 08:53:35 2021]  ? var_wake_function+0x30/0x30
[Thu May 20 08:53:35 2021]  __rpc_execute+0xd4/0x290 [sunrpc]
[Thu May 20 08:53:35 2021]  rpc_execute+0x5e/0x80 [sunrpc]
[Thu May 20 08:53:35 2021]  rpc_run_task+0x13d/0x180 [sunrpc]
[Thu May 20 08:53:35 2021]  rpc_call_sync+0x51/0xa0 [sunrpc]
[Thu May 20 08:53:35 2021]  rpc_create_xprt+0x177/0x1c0 [sunrpc]
[Thu May 20 08:53:35 2021]  rpc_create+0x11f/0x220 [sunrpc]
[Thu May 20 08:53:35 2021]  ? __memcg_kmem_charge+0x7d/0xf0
[Thu May 20 08:53:35 2021]  ? _cond_resched+0x1a/0x50
[Thu May 20 08:53:35 2021]  nfs_create_rpc_client+0x13a/0x180 [nfs]
[Thu May 20 08:53:35 2021]  nfs4_init_client+0x205/0x290 [nfsv4]
[Thu May 20 08:53:35 2021]  ? __fscache_acquire_cookie+0x10a/0x210 [fscache]
[Thu May 20 08:53:35 2021]  ? nfs_fscache_get_client_cookie+0xa9/0x120 [nfs]
[Thu May 20 08:53:35 2021]  ? nfs_match_client+0x37/0x2a0 [nfs]
[Thu May 20 08:53:35 2021]  nfs_get_client+0x14d/0x190 [nfs]
[Thu May 20 08:53:35 2021]  nfs4_set_client+0xd3/0x120 [nfsv4]
[Thu May 20 08:53:35 2021]  nfs4_init_server+0xf8/0x270 [nfsv4]
[Thu May 20 08:53:35 2021]  nfs4_create_server+0x58/0xa0 [nfsv4]
[Thu May 20 08:53:35 2021]  nfs4_try_get_tree+0x3a/0xc0 [nfsv4]
[Thu May 20 08:53:35 2021]  nfs_get_tree+0x38/0x50 [nfs]
[Thu May 20 08:53:35 2021]  vfs_get_tree+0x2a/0xc0
[Thu May 20 08:53:35 2021]  do_new_mount+0x14b/0x1a0
[Thu May 20 08:53:35 2021]  path_mount+0x1d4/0x4e0
[Thu May 20 08:53:35 2021]  __x64_sys_mount+0x108/0x140
[Thu May 20 08:53:35 2021]  do_syscall_64+0x38/0x90
[Thu May 20 08:53:35 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xa9


Here's the call trace for the mount.nfs command
mounting an available NFS server/10.188.76.67 (which was
blocked by the first mount.nfs command above):
[Thu May 20 08:53:35 2021] task:mount.nfs       state:D stack:    0 pid:13910 ppid: 13907 flags:0x00004000 
[Thu May 20 08:53:35 2021] Call Trace:
[Thu May 20 08:53:35 2021]  __schedule+0x23d/0x670
[Thu May 20 08:53:35 2021]  schedule+0x4f/0xc0
[Thu May 20 08:53:35 2021]  nfs_wait_client_init_complete+0x5a/0x90 [nfs]
[Thu May 20 08:53:35 2021]  ? wait_woken+0x80/0x80
[Thu May 20 08:53:35 2021]  nfs_match_client+0x1de/0x2a0 [nfs]
[Thu May 20 08:53:35 2021]  ? pcpu_block_update_hint_alloc+0xcc/0x2d0
[Thu May 20 08:53:35 2021]  nfs_get_client+0x62/0x190 [nfs]
[Thu May 20 08:53:35 2021]  nfs4_set_client+0xd3/0x120 [nfsv4]
[Thu May 20 08:53:35 2021]  nfs4_init_server+0xf8/0x270 [nfsv4]
[Thu May 20 08:53:35 2021]  nfs4_create_server+0x58/0xa0 [nfsv4]
[Thu May 20 08:53:35 2021]  nfs4_try_get_tree+0x3a/0xc0 [nfsv4]
[Thu May 20 08:53:35 2021]  nfs_get_tree+0x38/0x50 [nfs]
[Thu May 20 08:53:35 2021]  vfs_get_tree+0x2a/0xc0
[Thu May 20 08:53:35 2021]  do_new_mount+0x14b/0x1a0
[Thu May 20 08:53:35 2021]  path_mount+0x1d4/0x4e0
[Thu May 20 08:53:35 2021]  __x64_sys_mount+0x108/0x140
[Thu May 20 08:53:35 2021]  do_syscall_64+0x38/0x90
[Thu May 20 08:53:35 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

I've pasted the entire dmesg output here: https://pastebin.com/90QJyAL9


This is the command I ran to mount an unreachable NFS server:
date; time strace mount.nfs 10.1.1.1:/nopath /tmp/mnt.dead; date
The strace log: https://pastebin.com/5yVhm77u

This is the command I ran to mount the available NFS server:
date; time strace mount.nfs 10.188.76.67:/ /tmp/mnt.alive ; date
The strace log: https://pastebin.com/kTimQ6vH

The procedure:
- run dmesg -C to clear dmesg logs
- run mount.nfs on 10.1.1.1 (this IP address is down/not responding to ping) which hung
- run mount.nfs on 10.188.76.67  which also hung
- "echo t > /proc/sysrq-trigger" to dump the call traces for hung processes
- dmesg -T > dmesg.log to save the dmesg logs
- control-Z the mount.nfs command to 10.1.1.1
- "kill -9 %1" in the terminal to kill the mount.nfs to 10.1.1.1
- mount.nfs to 10.188.76.67 immediately mounts successfully
  after the first mount is killed (we can see this by the timestamp in the logs files)


Thanks, Mike



From: Olga Kornievskaia <aglo@umich.edu>
Sent: Wednesday, May 19, 2021 12:15 PM
To: Michael Wakabayashi <mwakabayashi@vmware.com>
Cc: linux-nfs@vger.kernel.org <linux-nfs@vger.kernel.org>
Subject: Re: NFSv4: Mounting NFS server which is down, blocks all other NFS mounts on same machine 
 
On Sun, May 16, 2021 at 11:18 PM Michael Wakabayashi
<mwakabayashi@vmware.com> wrote:
>
> Hi,
>
> We're seeing what looks like an NFSv4 issue.
>
> Mounting an NFS server that is down (ping to this NFS server's IP address does not respond) will block _all_ other NFS mount attempts even if the NFS servers are available and working properly (these subsequent mounts hang).
>
> If I kill the NFS mount process that's trying to mount the dead NFS server, the NFS mounts that were blocked will immediately unblock and mount successfully, which suggests the first mount command is blocking the other mount commands.
>
>
> I verified this behavior using a newly built mount.nfs command from the recent nfs-utils 2.5.3 package installed on a recent version of Ubuntu Cloud Image 21.04:
> * https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceforge.net%2Fprojects%2Fnfs%2Ffiles%2Fnfs-utils%2F2.5.3%2F&amp;data=04%7C01%7Cmwakabayashi%40vmware.com%7Cfe9df245d11945bd70fd08d91afa7565%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637570485288219912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=90wWL%2FDqjJMsdlFDxF3hlmyhuS86VwNrtOD%2BLTGxY20%3D&amp;reserved=0
> * https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloud-images.ubuntu.com%2Freleases%2Fhirsute%2Frelease-20210513%2Fubuntu-21.04-server-cloudimg-amd64.ova&amp;data=04%7C01%7Cmwakabayashi%40vmware.com%7Cfe9df245d11945bd70fd08d91afa7565%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637570485288219912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=iWIB0PuQ1HiOpFGmoViTmzreirD8EJRAkG%2BOw57QTKs%3D&amp;reserved=0
>
>
> The reason this looks like it is specific to NFSv4 is from the following output showing "vers=4.2":
> > $ strace /sbin/mount.nfs <unreachable-IP-address>:/path /tmp/mnt
> > [ ... cut ... ]
> > mount("<unreadhable-IP-address>:/path", "/tmp/mnt", "nfs", 0, "vers=4.2,addr=<unreachable-IP-address>,clien"...^C^Z
>
> Also, if I try the same mount.nfs commands but specifying NFSv3, the mount to the dead NFS server hangs, but the mounts to the operational NFS servers do not block and mount successfully; this bug doesn't happen when using NFSv3.
>
>
> We reported this issue under util-linux here:
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkarelzak%2Futil-linux%2Fissues%2F1309&amp;data=04%7C01%7Cmwakabayashi%40vmware.com%7Cfe9df245d11945bd70fd08d91afa7565%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637570485288219912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=WLxFM2Ls5PodPjrvA%2FZninvvHF6LlrO9ywSEMwgcR50%3D&amp;reserved=0
> [mounting nfs server which is down blocks all other nfs mounts on same machine #1309]
>
> I also found an older bug on this mailing list that had similar symptoms (but could not tell if it was the same problem or not):
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Flinux-nfs%2Fpatch%2F87vaori26c.fsf%40notabene.neil.brown.name%2F&amp;data=04%7C01%7Cmwakabayashi%40vmware.com%7Cfe9df245d11945bd70fd08d91afa7565%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637570485288219912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Y6A61VJQ6IDwsvUjjBc%2Fjrf80rvGSkaIjc0UhWRQ9kk%3D&amp;reserved=0
> [[PATCH/RFC] NFSv4: don't let hanging mounts block other mounts]
>
> Thanks, Mike

Hi Mike,

This is not a helpful reply but I was curious if I could reproduce
your issue but was not successful. I'm able to initiate a mount to an
unreachable-IP-address which hangs and then do another mount to an
existing server without issues. Ubuntu 21.04 seems to be 5.11 based so
I tried upstream 5.11 and I tried the latest upstream nfs-utils
(instead of what my distro has which was an older version).

To debug, perhaps get an output of the nfs4 and sunrpc tracepoints.
Or also get output from dmesg after doing “echo t >
/proc/sysrq-trigger” to see where the mounts are hanging.

  reply	other threads:[~2021-05-20  9:54 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-17  1:37 NFSv4: Mounting NFS server which is down, blocks all other NFS mounts on same machine Michael Wakabayashi
2021-05-19 19:15 ` Olga Kornievskaia
2021-05-20  9:51   ` Michael Wakabayashi [this message]
2021-05-20 10:43     ` Michael Wakabayashi
2021-05-20 23:51       ` Olga Kornievskaia
2021-05-21 19:11         ` Michael Wakabayashi
2021-05-20 18:42   ` Steve Dickson
     [not found]     ` <CO1PR05MB8101FD5E77B386A75786FF41B7299@CO1PR05MB8101.namprd05.prod.outlook.com>
2021-05-21 19:35       ` Olga Kornievskaia
2021-05-21 20:31         ` Michael Wakabayashi
2021-05-21 21:06           ` Olga Kornievskaia
2021-05-21 22:08             ` Trond Myklebust
2021-05-21 22:41               ` Olga Kornievskaia
2021-06-08  9:16                 ` Michael Wakabayashi
2021-06-08 16:10                   ` Olga Kornievskaia
2021-06-09  5:31                     ` Michael Wakabayashi
2021-06-09 13:50                       ` Olga Kornievskaia
2021-06-09 20:19                         ` Alex Romanenko
2021-06-11  5:26                           ` Michael Wakabayashi
2021-06-09 14:31                       ` Benjamin Coddington
2021-06-09 14:41                         ` Olga Kornievskaia
2021-06-09 17:14                           ` Michael Wakabayashi
2021-06-09 14:41                         ` Trond Myklebust
2021-06-09 15:00                           ` Benjamin Coddington
2021-06-09 15:19                             ` Trond Myklebust
2021-06-09  6:46                     ` Alex Romanenko
2021-05-21 22:38             ` Olga Kornievskaia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CO1PR05MB810173C0D970DE22AA9535B8B72A9@CO1PR05MB8101.namprd05.prod.outlook.com \
    --to=mwakabayashi@vmware.com \
    --cc=aglo@umich.edu \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.