All of lore.kernel.org
 help / color / mirror / Atom feed
From: "J. Bruce Fields" <bfields@fieldses.org>
To: hedrick@rutgers.edu
Cc: Timothy Pearson <tpearson@raptorengineering.com>,
	Chuck Lever <chuck.lever@oracle.com>,
	linux-nfs <linux-nfs@vger.kernel.org>
Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
Date: Mon, 9 Aug 2021 14:30:29 -0400	[thread overview]
Message-ID: <20210809183029.GC8394@fieldses.org> (raw)
In-Reply-To: <2FEAFB26-C723-450D-A115-1D82841BBF73@rutgers.edu>

On Mon, Aug 09, 2021 at 01:15:33PM -0400, hedrick@rutgers.edu wrote:
> There seems to be a soft lockup message on the console, but that’s all
> I can find.
> 
> I’m currently considering whether it’s best to move to NFS 4.0, which
> seems not to cause the issue, or 4.2 with delegations disabled. This
> is the primary server for the department. If it fails, everything
> fails, VMs because read-only, user jobs fai, etc.
> 
> We ran for a year before this showed up, so I’m pretty sure going to
> 4.0 will fix it.

I thought you also upgraded the kernel at the same time?  (What were the
two kernels involved?)  So we don't know whether it's a new kernel bug,
or an NFSv4.2-specific bug, or something else.

> But I have use cases for ACLs that will only work
> with 4.2.

NFSv4 ACLs on the Linux server are the same in 4.0 and 4.2.

> Since the problem seems to be in the callback mechanism, and
> as far as I can tell that’s only used for delegations, I assume
> turning off delegations will fix it.

It could be.  Though I asked mainly as a way to help narrow down where
the problem is.

--b.

> We’ve also had a history of issues with 4.2 problems on clients.
> That’s why we backed off to 4.0 initially. Clients were seeing hangs.
> 
> It’s discouraging to hear that even the most recent kernel has
> problems.
> 
> > On Aug 9, 2021, at 1:06 PM, Timothy Pearson
> > <tpearson@raptorengineering.com> wrote:
> > 
> > Did you see anything else at all on the terminal?  The inability to
> > log in is sadly familiar, our boxes are configured to dump a trace
> > over serial every 120 seconds or so if they lock up
> > (/proc/sys/kernel/hung_task_timeout_secs) and I'm not sure you'd see
> > anything past the callback messages without that active.
> > 
> > FWIW we ended up (mostly) working around the problem by moving the
> > critical systems (which are all NFSv3) to a new server, but that's a
> > stopgap measure as we were looking to deploy NFSv4 on a broader
> > scale.  My gut feeling is the failure occurs under heavy load where
> > too many NFSv4 requests from a single client are pending due to a
> > combination of storage and network saturation, but it's proven very
> > difficult to debug -- even splitting the v3 hosts from the larger
> > NFS server (reducing traffic + storage load) seems to have
> > temporarily stabilized things.
> > 
> > ----- Original Message -----
> >> From: hedrick@rutgers.edu To: "Timothy Pearson"
> >> <tpearson@raptorengineering.com> Cc: "J. Bruce Fields"
> >> <bfields@fieldses.org>, "Chuck Lever" <chuck.lever@oracle.com>,
> >> "linux-nfs" <linux-nfs@vger.kernel.org> Sent: Monday, August 9,
> >> 2021 11:31:25 AM Subject: Re: CPU stall, eventual host hang with
> >> BTRFS + NFS under heavy load
> > 
> >> Incidentally, we’re a computer science department. We have such a
> >> variety of students and researchers that it’s impossible to know
> >> what they are all doing.  Historically, if there’s a bug in
> >> anyth9ing, we’ll see it, and usually enough for it to be fatal.
> >> 
> >> question: is backing off to 4.0 or disabling delegations likely to
> >> have more of an impact on performance?
> >> 
> >>> On Aug 9, 2021, at 12:17 PM, hedrick@rutgers.edu wrote:
> >>> 
> >>> I just found this because we’ve been dealing with hangs of our
> >>> primary NFS server. This is ubuntu 20.04, which is 5.10.
> >>> 
> >>> Right before the hang:
> >>> 
> >>> Aug  8 21:50:46 communis.lcsr.rutgers.edu
> >>> <http://communis.lcsr.rutgers.edu/> kernel: [294852.644801]
> >>> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
> >>> 00000000b260cf95 xid e3faa54e Aug  8 21:51:54
> >>> communis.lcsr.rutgers.edu <http://communis.lcsr.rutgers.edu/>
> >>> kernel: [294921.252531] receive_cb_reply: Got unrecognized reply:
> >>> calldir 0x1 xpt_bc_xprt 00000000b260cf95 xid f0faa54e
> >>> 
> >>> 
> >>> I looked at the code, and this seems to be an NFS4.1 callback. We
> >>> just started seeing the problem after upgrading most of our hosts
> >>> in a way that caused them to move from NFS 4.0 to 4.2. I assume
> >>> 4.2 is using the 4.1 callback. Rather than disabling delegations,
> >>> we’re moving back to NFS 4.0 on the clients (except ESXi).
> >>> 
> >>> We’re using ZFS, so this isn’t just btrfs.
> >>> 
> >>> I’m afraid I don’t have any backtrace. I was going to get more
> >>> information, but it happened late at night and we were unable to
> >>> get into the system to gather information. Just had to reboot.
> >>> 
> >>>> On Jul 5, 2021, at 5:47 AM, Timothy Pearson
> >>>> <tpearson@raptorengineering.com
> >>>> <mailto:tpearson@raptorengineering.com>> wrote:
> >>>> 
> >>>> Forgot to add -- sometimes, right before the core stall and
> >>>> backtrace, we see messages similar to the following:
> >>>> 
> >>>> [16825.408854] receive_cb_reply: Got unrecognized reply: calldir
> >>>> 0x1 xpt_bc_xprt 0000000051f43ff7 xid 2e0c9b7a [16825.414070]
> >>>> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
> >>>> 0000000051f43ff7 xid 2f0c9b7a [16825.414360] receive_cb_reply:
> >>>> Got unrecognized reply: calldir 0x1 xpt_bc_xprt 0000000051f43ff7
> >>>> xid 300c9b7a
> >>>> 
> >>>> We're not sure if they are related or not.
> >>>> 
> >>>> ----- Original Message -----
> >>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com
> >>>>> <mailto:tpearson@raptorengineering.com>> To: "J. Bruce Fields"
> >>>>> <bfields@fieldses.org <mailto:bfields@fieldses.org>>, "Chuck
> >>>>> Lever" <chuck.lever@oracle.com <mailto:chuck.lever@oracle.com>>
> >>>>> Cc: "linux-nfs" <linux-nfs@vger.kernel.org
> >>>>> <mailto:linux-nfs@vger.kernel.org>> Sent: Monday, July 5, 2021
> >>>>> 4:44:29 AM Subject: CPU stall, eventual host hang with BTRFS +
> >>>>> NFS under heavy load
> >>>> 
> >>>>> We've been dealing with a fairly nasty NFS-related problem off
> >>>>> and on for the past couple of years.  The host is a large POWER
> >>>>> server with several external SAS arrays attached, using BTRFS
> >>>>> for cold storage of large amounts of data.  The main symptom is
> >>>>> that under heavy sustained NFS write traffic using certain file
> >>>>> types (see below) a core will suddenly lock up, continually
> >>>>> spewing a backtrace similar to the one I've pasted below.  While
> >>>>> this immediately halts all NFS traffic to the affected client
> >>>>> (which is never the same client as the machine doing the large
> >>>>> file transfer), the larger issue is that over the next few
> >>>>> minutes / hours the entire host will gradually degrade in
> >>>>> responsiveness until it grinds to a complete halt.  Once the
> >>>>> core stall occurs we have been unable to find any way to restore
> >>>>> the machine to full functionality or avoid the degradation and
> >>>>> eventual hang short of a hard power down and restart.
> >>>>> 
> >>>>> Tens of GB of compressed data in a single file seems to be
> >>>>> fairly good at triggering the problem, whereas raw disk images
> >>>>> or other regularly patterned data tend not to be.  The
> >>>>> underlying hardware is functioning perfectly with no problems
> >>>>> noted, and moving the files without NFS avoids the bug.
> >>>>> 
> >>>>> We've been using a workaround involving purposefully pausing
> >>>>> (SIGSTOP) the file transfer process on the client as soon as
> >>>>> other clients start to show a slowdown.  This hack avoids the
> >>>>> bug entirely provided the host is allowed to catch back up prior
> >>>>> to resuming (SIGCONT) the file transfer process.  From this, it
> >>>>> seems something is going very wrong within the NFS stack under
> >>>>> high storage I/O pressure and high storage write latency
> >>>>> (timeout?) -- it should simply pause transfers while the storage
> >>>>> subsystem catches up, not lock up a core and force a host
> >>>>> restart.  Interesting, sometimes it does exactly what it is
> >>>>> supposed to and does pause and wait for the storage subsystem,
> >>>>> but around 20% of the time it just triggers this bug and stalls
> >>>>> a core.
> >>>>> 
> >>>>> This bug has been present since at least 4.14 and is still
> >>>>> present in the latest 5.12.14 version.
> >>>>> 
> >>>>> As the machine is in production, it is difficult to gather
> >>>>> further information or test patches, however we would be able to
> >>>>> apply patches to the kernel that would potentially restore
> >>>>> stability with enough advance scheduling.
> >>>>> 
> >>>>> Sample backtrace below:
> >>>>> 
> >>>>> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
> >>>>> [16846.426202] rcu:     32-....: (5249 ticks this GP)
> >>>>> idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
> >>>>> [16846.426241]  (t=5251 jiffies g=2720809 q=756724)
> >>>>> [16846.426273] NMI backtrace for cpu 32 [16846.426298] CPU: 32
> >>>>> PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> >>>>> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
> >>>>> [16846.426406] Call Trace: [16846.426429] [c000200010823250]
> >>>>> [c00000000074e630] dump_stack+0xc4/0x114 (unreliable)
> >>>>> [16846.426483] [c000200010823290] [c00000000075aebc]
> >>>>> nmi_cpu_backtrace+0xfc/0x150 [16846.426506] [c000200010823310]
> >>>>> [c00000000075b0a8] nmi_trigger_cpumask_backtrace+0x198/0x1f0
> >>>>> [16846.426577] [c0002000108233b0] [c000000000072818]
> >>>>> arch_trigger_cpumask_backtrace+0x28/0x40 [16846.426621]
> >>>>> [c0002000108233d0] [c000000000202db8]
> >>>>> rcu_dump_cpu_stacks+0x158/0x1b8 [16846.426667]
> >>>>> [c000200010823470] [c000000000201828]
> >>>>> rcu_sched_clock_irq+0x908/0xb10 [16846.426708]
> >>>>> [c000200010823560] [c0000000002141d0]
> >>>>> update_process_times+0xc0/0x140 [16846.426768]
> >>>>> [c0002000108235a0] [c00000000022dd34]
> >>>>> tick_sched_handle.isra.18+0x34/0xd0 [16846.426808]
> >>>>> [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
> >>>>> [16846.426856] [c000200010823610] [c00000000021577c]
> >>>>> __hrtimer_run_queues+0x16c/0x370 [16846.426903]
> >>>>> [c000200010823690] [c000000000216378]
> >>>>> hrtimer_interrupt+0x128/0x2f0 [16846.426947] [c000200010823740]
> >>>>> [c000000000029494] timer_interrupt+0x134/0x310 [16846.426989]
> >>>>> [c0002000108237a0] [c000000000016c54]
> >>>>> replay_soft_interrupts+0x124/0x2e0 [16846.427045]
> >>>>> [c000200010823990] [c000000000016f14]
> >>>>> arch_local_irq_restore+0x104/0x170 [16846.427103]
> >>>>> [c0002000108239c0] [c00000000017247c]
> >>>>> mod_delayed_work_on+0x8c/0xe0 [16846.427149] [c000200010823a20]
> >>>>> [c00800000819fe04] rpc_set_queue_timer+0x5c/0x80 [sunrpc]
> >>>>> [16846.427234] [c000200010823a40] [c0080000081a096c]
> >>>>> __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
> >>>>> [16846.427324] [c000200010823a90] [c0080000081a3080]
> >>>>> rpc_sleep_on_timeout+0x88/0x110 [sunrpc] [16846.427388]
> >>>>> [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530
> >>>>> [nfsd] [16846.427457] [c000200010823b60] [c0080000081a0a0c]
> >>>>> rpc_exit_task+0x84/0x1d0 [sunrpc] [16846.427520]
> >>>>> [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760
> >>>>> [sunrpc] [16846.427598] [c000200010823c30] [c0080000081a2b18]
> >>>>> rpc_async_schedule+0x40/0x70 [sunrpc] [16846.427687]
> >>>>> [c000200010823c60] [c000000000170bf0]
> >>>>> process_one_work+0x290/0x580 [16846.427736] [c000200010823d00]
> >>>>> [c000000000170f68] worker_thread+0x88/0x620 [16846.427813]
> >>>>> [c000200010823da0] [c00000000017b860] kthread+0x1a0/0x1b0
> >>>>> [16846.427865] [c000200010823e10] [c00000000000d6ec]
> >>>>> ret_from_kernel_thread+0x5c/0x70 [16873.869180] watchdog: BUG:
> >>>>> soft lockup - CPU#32 stuck for 49s!  [kworker/u130:25:10624]
> >>>>> [16873.869245] Modules linked in: rpcsec_gss_krb5
> >>>>> iscsi_target_mod target_core_user uio target_core_pscsi
> >>>>> target_core_file target_core_iblock target_core_mod tun
> >>>>> nft_counter nf_tables nfnetlink vfio_pci vfio_virqfd
> >>>>> vfio_iommu_spapr_tce vfio vfio_spapr_eeh i2c_dev bridg$
> >>>>> [16873.869413]  linear mlx4_ib ib_uverbs ib_core raid1 md_mod
> >>>>> sd_mod t10_pi hid_generic usbhid hid ses enclosure
> >>>>> crct10dif_vpmsum crc32c_vpmsum xhci_pci xhci_hcd ixgbe mlx4_core
> >>>>> mpt3sas usbcore tg3 mdio_devres of_mdio fixed_phy xfrm_algo mdio
> >>>>> libphy aacraid igb raid_cl$ [16873.869889] CPU: 32 PID: 10624
> >>>>> Comm: kworker/u130:25 Not tainted 5.12.14 #1 [16873.869966]
> >>>>> Workqueue: rpciod rpc_async_schedule [sunrpc] [16873.870023]
> >>>>> NIP:  c000000000711300 LR: c0080000081a0708 CTR:
> >>>>> c0000000007112a0 [16873.870073] REGS: c0002000108237d0 TRAP:
> >>>>> 0900   Not tainted  (5.12.14) [16873.870109] MSR:
> >>>>> 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR:
> >>>>> 24004842  XER: 00000000 [16873.870146] CFAR: c0080000081d8054
> >>>>> IRQMASK: 0 GPR00: c0080000081a0748 c000200010823a70
> >>>>> c0000000015c0700 c0000000e2227a40 GPR04: c0000000e2227a40
> >>>>> c0000000e2227a40 c000200ffb6cc0a8 0000000000000018 GPR08:
> >>>>> 0000000000000000 5deadbeef0000122 c0080000081ffd18
> >>>>> c0080000081d8040 GPR12: c0000000007112a0 c000200fff7fee00
> >>>>> c00000000017b6c8 c000000090d9ccc0 GPR16: 0000000000000000
> >>>>> 0000000000000000 0000000000000000 0000000000000000 GPR20:
> >>>>> 0000000000000000 0000000000000000 0000000000000000
> >>>>> 0000000000000040 GPR24: 0000000000000000 0000000000000000
> >>>>> fffffffffffffe00 0000000000000001 GPR28: c00000001a62f000
> >>>>> c0080000081a0988 c0080000081ffd10 c0000000e2227a00
> >>>>> [16873.870452] NIP [c000000000711300]
> >>>>> __list_del_entry_valid+0x60/0x100 [16873.870507] LR
> >>>>> [c0080000081a0708]
> >>>>> rpc_wake_up_task_on_wq_queue_action_locked+0x330/0x400 [sunrpc]

  parent reply	other threads:[~2021-08-09 18:30 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-05  9:44 CPU stall, eventual host hang with BTRFS + NFS under heavy load Timothy Pearson
2021-07-05  9:47 ` Timothy Pearson
2021-07-23 21:01   ` J. Bruce Fields
     [not found]   ` <B4D8C4B7-EE8C-456C-A6C5-D25FF1F3608E@rutgers.edu>
     [not found]     ` <3A4DF3BB-955C-4301-BBED-4D5F02959F71@rutgers.edu>
2021-08-09 17:06       ` Timothy Pearson
2021-08-09 17:15         ` hedrick
2021-08-09 17:25           ` Timothy Pearson
2021-08-09 17:37           ` Chuck Lever III
     [not found]             ` <F5179A41-FB9A-4AB1-BE58-C2859DB7EC06@rutgers.edu>
2021-08-09 18:30               ` Timothy Pearson
2021-08-09 18:38                 ` hedrick
2021-08-09 18:44                   ` Timothy Pearson
2021-08-09 18:49                   ` J. Bruce Fields
     [not found]                     ` <15AD846A-4638-4ACF-B47C-8EF655AD6E85@rutgers.edu>
2021-08-09 18:56                       ` Timothy Pearson
2021-08-09 20:54                         ` Charles Hedrick
2021-08-09 21:49                           ` Timothy Pearson
2021-08-09 22:01                             ` Charles Hedrick
     [not found]             ` <1119B476-171F-4C5A-9DEF-184F211A6A98@rutgers.edu>
2021-08-10 16:22               ` Timothy Pearson
2021-08-16 14:43                 ` hedrick
2021-08-09 18:30           ` J. Bruce Fields [this message]
2021-08-09 18:34             ` hedrick
     [not found]             ` <413163A6-8484-4170-9877-C0C2D50B13C0@rutgers.edu>
2021-08-10 14:58               ` J. Bruce Fields
2021-07-23 21:00 ` J. Bruce Fields
2021-07-23 21:22   ` Timothy Pearson
2021-07-28 19:51     ` Timothy Pearson
2021-08-02 19:28       ` J. Bruce Fields
2021-08-10  0:43 ` NeilBrown
2021-08-10  0:54   ` J.  Bruce Fields
2021-08-12 14:44   ` J.  Bruce Fields
2021-08-12 21:36     ` NeilBrown
2021-10-08 20:27       ` Scott Mayhew
2021-10-08 20:53         ` Timothy Pearson
2021-10-08 21:11         ` J.  Bruce Fields
2021-10-09 17:33         ` Chuck Lever III
2021-10-11 14:30           ` Bruce Fields
2021-10-11 16:36             ` Chuck Lever III
2021-10-11 21:57               ` NeilBrown
2021-10-14 22:36                 ` Trond Myklebust
2021-10-14 22:51                   ` NeilBrown
2021-10-15  8:03                     ` Trond Myklebust
2021-10-15  8:05                       ` Trond Myklebust
2021-12-01 18:36                         ` Scott Mayhew
2021-12-01 19:35                           ` Trond Myklebust
2021-12-01 20:13                             ` Scott Mayhew

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210809183029.GC8394@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=hedrick@rutgers.edu \
    --cc=linux-nfs@vger.kernel.org \
    --cc=tpearson@raptorengineering.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.