All of lore.kernel.org
 help / color / mirror / Atom feed
From: Trond Myklebust <trondmy@hammerspace.com>
To: "david@fromorbit.com" <david@fromorbit.com>
Cc: "bfoster@redhat.com" <bfoster@redhat.com>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"trondmy@kernel.org" <trondmy@kernel.org>,
	"hch@infradead.org" <hch@infradead.org>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"djwong@kernel.org" <djwong@kernel.org>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	"willy@infradead.org" <willy@infradead.org>
Subject: Re: [PATCH] iomap: Address soft lockup in iomap_finish_ioend()
Date: Wed, 5 Jan 2022 20:45:05 +0000	[thread overview]
Message-ID: <c9d9b7850c6086b123b4add4de7b1992cb62f6ad.camel@hammerspace.com> (raw)
In-Reply-To: <0996c40657b5873dda5119344bf74556491e27b9.camel@hammerspace.com>

On Tue, 2022-01-04 at 21:09 -0500, Trond Myklebust wrote:
> On Tue, 2022-01-04 at 12:22 +1100, Dave Chinner wrote:
> > On Tue, Jan 04, 2022 at 12:04:23AM +0000, Trond Myklebust wrote:
> > > On Tue, 2022-01-04 at 09:03 +1100, Dave Chinner wrote:
> > > > On Sat, Jan 01, 2022 at 05:39:45PM +0000, Trond Myklebust
> > > > wrote:
> > > > > On Sat, 2022-01-01 at 14:55 +1100, Dave Chinner wrote:
> > > > > > As it is, if you are getting soft lockups in this location,
> > > > > > that's
> > > > > > an indication that the ioend chain that is being built by
> > > > > > XFS
> > > > > > is
> > > > > > way, way too long. IOWs, the completion latency problem is
> > > > > > caused
> > > > > > by
> > > > > > a lack of submit side ioend chain length bounding in
> > > > > > combination
> > > > > > with unbound completion side merging in xfs_end_bio - it's
> > > > > > not a
> > > > > > problem with the generic iomap code....
> > > > > > 
> > > > > > Let's try to address this in the XFS code, rather than hack
> > > > > > unnecessary band-aids over the problem in the generic
> > > > > > code...
> > > > > > 
> > > > > > Cheers,
> > > > > > 
> > > > > > Dave.
> > > > > 
> > > > > Fair enough. As long as someone is working on a solution,
> > > > > then
> > > > > I'm
> > > > > happy. Just a couple of things:
> > > > > 
> > > > > Firstly, we've verified that the cond_resched() in the bio
> > > > > loop
> > > > > does
> > > > > suffice to resolve the issue with XFS, which would tend to
> > > > > confirm
> > > > > what
> > > > > you're saying above about the underlying issue being the
> > > > > ioend
> > > > > chain
> > > > > length.
> > > > > 
> > > > > Secondly, note that we've tested this issue with a variety of
> > > > > older
> > > > > kernels, including 4.18.x, 5.1.x and 5.15.x, so please bear
> > > > > in
> > > > > mind
> > > > > that it would be useful for any fix to be backward portable
> > > > > through
> > > > > the
> > > > > stable mechanism.
> > > > 
> > > > The infrastructure hasn't changed that much, so whatever the
> > > > result
> > > > is it should be backportable.
> > > > 
> > > > As it is, is there a specific workload that triggers this
> > > > issue?
> > > > Or
> > > > a specific machine config (e.g. large memory, slow storage).
> > > > Are
> > > > there large fragmented files in use (e.g. randomly written VM
> > > > image
> > > > files)? There are a few factors that can exacerbate the ioend
> > > > chain
> > > > lengths, so it would be handy to have some idea of what is
> > > > actually
> > > > triggering this behaviour...
> > > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > 
> > > We have different reproducers. The common feature appears to be
> > > the
> > > need for a decently fast box with fairly large memory (128GB in
> > > one
> > > case, 400GB in the other). It has been reproduced with HDs, SSDs
> > > and
> > > NVME systems.
> > > 
> > > On the 128GB box, we had it set up with 10+ disks in a JBOD
> > > configuration and were running the AJA system tests.
> > > 
> > > On the 400GB box, we were just serially creating large (> 6GB)
> > > files
> > > using fio and that was occasionally triggering the issue. However
> > > doing
> > > an strace of that workload to disk reproduced the problem faster
> > > :-
> > > ).
> > 
> > Ok, that matches up with the "lots of logically sequential dirty
> > data on a single inode in cache" vector that is required to create
> > really long bio chains on individual ioends.
> > 
> > Can you try the patch below and see if addresses the issue?
> > 
> 
> That patch does seem to fix the soft lockups.
> 

Oops... Strike that, apparently our tests just hit the following when
running on AWS with that patch.

[Wed Jan  5 20:34:46 2022] watchdog: BUG: soft lockup - CPU#4 stuck for
48s! [kworker/4:1:31315]
[Wed Jan  5 20:34:46 2022] Modules linked in: nfsv3 auth_name
bpf_preload xt_nat veth nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4
dns_resolver nfsidmap nfs fscache netfs dm_multipath nfsd auth_rpcgss
nfs_acl lockd grace sunrpc xt_MASQUERADE nf_conntrack_netlink
xt_addrtype br_netfilter bridge stp llc overlay xt_sctp
nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp
nf_conntrack_ftp xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6
ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle
ip6table_security ip6table_raw iptable_nat nf_nat iptable_mangle
iptable_security iptable_raw nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bonding tls
ipmi_msghandler intel_rapl_msr intel_rapl_common isst_if_common nfit
libnvdimm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4
rapl ip_tables xfs nvme crc32c_intel ena nvme_core
[Wed Jan  5 20:34:46 2022] CPU: 4 PID: 31315 Comm: kworker/4:1 Kdump:
loaded Tainted: G        W    L    5.15.12-200.pd.17718.el7.x86_64 #1
[Wed Jan  5 20:34:46 2022] Hardware name: Amazon EC2 r5b.2xlarge/, BIOS
1.0 10/16/2017
[Wed Jan  5 20:34:46 2022] Workqueue: xfs-conv/nvme1n1 xfs_end_io [xfs]
[Wed Jan  5 20:34:46 2022] RIP:
0010:_raw_spin_unlock_irqrestore+0x1c/0x20
[Wed Jan  5 20:34:46 2022] Code: 92 cc cc cc cc cc cc cc cc cc cc cc cc
cc 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 75 01 c3 fb 66
0f 1f 44 00 00 <c3> 0f 1f 00 0f 1f 44 00 00 8b 07 a9 ff 01 00 00 75 21
b8 00 02 00
[Wed Jan  5 20:34:46 2022] RSP: 0018:ffffb9738983fd10 EFLAGS: 00000206
[Wed Jan  5 20:34:46 2022] RAX: 0000000000000001 RBX: 0000000000000db0
RCX: fffffffffffff90f
[Wed Jan  5 20:34:46 2022] RDX: ffffffffa3808938 RSI: 0000000000000206
RDI: ffffffffa3808930
[Wed Jan  5 20:34:46 2022] RBP: 0000000000000206 R08: ffffb9738601fc80
R09: ffffb9738601fc80
[Wed Jan  5 20:34:46 2022] R10: 0000000000000000 R11: 0000000000000000
R12: ffffffffa3808930
[Wed Jan  5 20:34:46 2022] R13: ffffdda3db40dd40 R14: ffff930e1c62f538
R15: ffffdda3db40dd40
[Wed Jan  5 20:34:46 2022] FS:  0000000000000000(0000)
GS:ffff93164dd00000(0000) knlGS:0000000000000000
[Wed Jan  5 20:34:46 2022] CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[Wed Jan  5 20:34:46 2022] CR2: 00007ffe41f5c080 CR3: 00000005a5810001
CR4: 00000000007706e0
[Wed Jan  5 20:34:46 2022] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[Wed Jan  5 20:34:46 2022] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[Wed Jan  5 20:34:46 2022] PKRU: 55555554
[Wed Jan  5 20:34:46 2022] Call Trace:
[Wed Jan  5 20:34:46 2022]  <TASK>
[Wed Jan  5 20:34:46 2022]  wake_up_page_bit+0x79/0xe0
[Wed Jan  5 20:34:46 2022]  end_page_writeback+0xc4/0xf0
[Wed Jan  5 20:34:46 2022]  iomap_finish_ioend+0x130/0x260
[Wed Jan  5 20:34:46 2022]  iomap_finish_ioends+0x71/0x90
[Wed Jan  5 20:34:46 2022]  xfs_end_ioend+0x5a/0x120 [xfs]
[Wed Jan  5 20:34:46 2022]  xfs_end_io+0xa1/0xc0 [xfs]
[Wed Jan  5 20:34:46 2022]  process_one_work+0x1f1/0x390
[Wed Jan  5 20:34:46 2022]  worker_thread+0x53/0x3e0
[Wed Jan  5 20:34:46 2022]  ? process_one_work+0x390/0x390
[Wed Jan  5 20:34:46 2022]  kthread+0x127/0x150
[Wed Jan  5 20:34:46 2022]  ? set_kthread_struct+0x40/0x40
[Wed Jan  5 20:34:46 2022]  ret_from_fork+0x22/0x30
[Wed Jan  5 20:34:46 2022]  </TASK>


So it was harder to hit, but we still did eventually.

-- 
Trond Myklebust
CTO, Hammerspace Inc
4984 El Camino Real, Suite 208
Los Altos, CA 94022
​
www.hammer.space


  reply	other threads:[~2022-01-05 20:45 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-30 19:35 [PATCH] iomap: Address soft lockup in iomap_finish_ioend() trondmy
2021-12-30 21:24 ` Jens Axboe
2021-12-30 22:25   ` Trond Myklebust
2021-12-30 22:27     ` Jens Axboe
2021-12-30 22:55       ` Trond Myklebust
2021-12-31  1:42 ` Matthew Wilcox
2021-12-31  6:16   ` Trond Myklebust
2022-01-01  3:55     ` Dave Chinner
2022-01-01 17:39       ` Trond Myklebust
2022-01-03 22:03         ` Dave Chinner
2022-01-04  0:04           ` Trond Myklebust
2022-01-04  1:22             ` Dave Chinner
2022-01-04  3:01               ` Trond Myklebust
2022-01-04  7:08               ` hch
2022-01-04 18:08                 ` Matthew Wilcox
2022-01-04 18:14                   ` hch
2022-01-04 19:22                     ` Darrick J. Wong
2022-01-04 21:52                       ` Dave Chinner
2022-01-04 23:12                         ` Darrick J. Wong
2022-01-05  2:10                           ` Dave Chinner
2022-01-05 13:56                             ` Brian Foster
2022-01-05 22:04                               ` Dave Chinner
2022-01-06 16:44                                 ` Brian Foster
2022-01-10  8:18                                   ` Dave Chinner
2022-01-10 17:45                                     ` Brian Foster
2022-01-10 18:11                                       ` hch
2022-01-11 14:33                                       ` Trond Myklebust
2022-01-05 13:42                           ` hch
2022-01-04 21:16                 ` Dave Chinner
2022-01-05 13:43                   ` hch
2022-01-05 22:34                     ` Dave Chinner
2022-01-05  2:09               ` Trond Myklebust
2022-01-05 20:45                 ` Trond Myklebust [this message]
2022-01-05 22:48                   ` Dave Chinner
2022-01-05 23:29                     ` Trond Myklebust
2022-01-06  0:01                     ` Darrick J. Wong
2022-01-09 23:09                       ` Dave Chinner
2022-01-06 18:36                     ` Trond Myklebust
2022-01-06 18:38                       ` Trond Myklebust
2022-01-06 20:07                       ` Brian Foster
2022-01-07  3:08                         ` Trond Myklebust
2022-01-07 15:15                           ` Brian Foster
2022-01-09 23:34                       ` Dave Chinner
2022-01-10 23:37                       ` Dave Chinner
2022-01-11  0:08                         ` Dave Chinner
2022-01-13 17:01                         ` Trond Myklebust
2022-01-17 17:24                           ` Trond Myklebust
2022-01-17 17:36                             ` Darrick J. Wong
2022-01-04 13:36         ` Brian Foster
2022-01-04 19:23           ` Darrick J. Wong
2022-01-05  2:31 ` [iomap] f5934dda54: BUG:sleeping_function_called_from_invalid_context_at_fs/iomap/buffered-io.c kernel test robot
2022-01-05  2:31   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c9d9b7850c6086b123b4add4de7b1992cb62f6ad.camel@hammerspace.com \
    --to=trondmy@hammerspace.com \
    --cc=axboe@kernel.dk \
    --cc=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=trondmy@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.