linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: "Theodore Y. Ts'o" <tytso@mit.edu>, Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Christoph Hellwig <hch@lst.de>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Naohiro Aota <Naohiro.Aota@wdc.com>,
	Masato Suzuki <masato.suzuki@wdc.com>
Subject: Re: [PATCH] ext4: Fix deadlock on page reclaim
Date: Sat, 27 Jul 2019 02:59:59 +0000	[thread overview]
Message-ID: <BYAPR04MB58162929012135E47C68923AE7C30@BYAPR04MB5816.namprd04.prod.outlook.com> (raw)
In-Reply-To: 20190726225508.GA13729@mit.edu

On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>>>
>>> This looks like something that could hit every file systems, so
>>> shouldn't we fix this in common code?  We could also look into
>>> just using memalloc_nofs_save for the page cache allocation path
>>> instead of the per-mapping gfp_mask.
>>
>> I think it has to be the entire IO path - any allocation from the
>> underlying filesystem could recurse into the top level filesystem
>> and then deadlock if the memory reclaim submits IO or blocks on
>> IO completion from the upper filesystem. That's a bloody big hammer
>> for something that is only necessary when there are stacked
>> filesystems like this....
> 
> Yeah.... that's why using memalloc_nofs_save() probably makes the most
> sense, and dm_zoned should use that before it calls into ext4.

Unfortunately, with this particular setup, that will not solve the problem.
dm-zoned submit BIOs to its backend drive in response to XFS activity. The
requests for these BIOs are passed along to the kernel tcmu HBA and end up in
that HBA command ring. The commands themselves are read from the ring and
executed by the tcmu-runner user process which executes them doing
pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
context than the dm-zoned worker thread issuing the BIO,
memalloc_nofs_save/restore() calls in dm-zoned will have no effect.

We tried a simpler setup using loopback mount (XFS used directly in an ext4
file) and running the same workload. We failed to recreate a similar deadlock in
this case, but I am strongly suspecting that it can happen too. It is simply
much harder to hit because the IO path from XFS to ext4 is all in-kernel and
asynchronous, whereas tcmu-runner ZBC handler is a synchronous QD=1 path for IOs
which makes it relatively easy to get inter-dependent writes or read+write
queued back-to-back and create the deadlock.

So back to Dave's point, we may be needing the big-hammer solution in the case
of stacked file systems, while a non-stack setups do not necessarily need it
(that is for the FS to decide). But I do not see how to implement this big
hammer conditionally. How can a file system tell if it is at the top of the
stack (big hammer not needed) or lower than the top level (big hammer needed) ?

One simple hack would be an fcntl() or mount option to tell the FS to use
GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
applications or system setup is correct. So not so safe.

-- 
Damien Le Moal
Western Digital Research

  reply	other threads:[~2019-07-27  3:00 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-25  9:33 [PATCH] ext4: Fix deadlock on page reclaim Damien Le Moal
2019-07-25 11:54 ` Christoph Hellwig
2019-07-25 19:57   ` Andreas Dilger
2019-07-26  2:24   ` Damien Le Moal
2019-07-26 22:44   ` Dave Chinner
2019-07-26 22:55     ` Theodore Y. Ts'o
2019-07-27  2:59       ` Damien Le Moal [this message]
2019-07-28 23:41         ` Dave Chinner
2019-07-29  1:07           ` Damien Le Moal
2019-07-29 18:40         ` Andreas Dilger
2019-07-30  2:06           ` Damien Le Moal
2019-07-30 23:47             ` Dave Chinner
2019-07-31  4:37               ` Damien Le Moal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BYAPR04MB58162929012135E47C68923AE7C30@BYAPR04MB5816.namprd04.prod.outlook.com \
    --to=damien.lemoal@wdc.com \
    --cc=Naohiro.Aota@wdc.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=hch@lst.de \
    --cc=jthumshirn@suse.de \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=masato.suzuki@wdc.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).