linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Mikulas Patocka <mpatocka@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-xfs@vger.kernel.org, dm-devel@redhat.com,
	Jens Axboe <axboe@kernel.dk>, NeilBrown <neilb@suse.de>
Subject: Re: [PATCH 0/6] Overhaul memalloc_no*
Date: Tue, 30 Jun 2020 08:34:10 +1000	[thread overview]
Message-ID: <20200629223410.GK2005@dread.disaster.area> (raw)
In-Reply-To: <alpine.LRH.2.02.2006290918030.11293@file01.intranet.prod.int.rdu2.redhat.com>

On Mon, Jun 29, 2020 at 09:43:23AM -0400, Mikulas Patocka wrote:
> On Mon, 29 Jun 2020, Dave Chinner wrote:
> > On Sat, Jun 27, 2020 at 09:09:09AM -0400, Mikulas Patocka wrote:
> > > On Sat, 27 Jun 2020, Dave Chinner wrote:
> > > > On Fri, Jun 26, 2020 at 11:02:19AM -0400, Mikulas Patocka wrote:
> > > > > Hi
> > > > > 
> > > > > I suggest to join memalloc_noio and memalloc_nofs into just one flag that 
> > > > > prevents both filesystem recursion and i/o recursion.
> > > > > 
> > > > > Note that any I/O can recurse into a filesystem via the loop device, thus 
> > > > > it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is set 
> > > > > and PF_MEMALLOC_NOIO is not set.
> > > > 
> > > > Correct me if I'm wrong, but I think that will prevent swapping from
> > > > GFP_NOFS memory reclaim contexts.
> > > 
> > > Yes.
> > > 
> > > > IOWs, this will substantially
> > > > change the behaviour of the memory reclaim system under sustained
> > > > GFP_NOFS memory pressure. Sustained GFP_NOFS memory pressure is
> > > > quite common, so I really don't think we want to telling memory
> > > > reclaim "you can't do IO at all" when all we are trying to do is
> > > > prevent recursion back into the same filesystem.
> > > 
> > > So, we can define __GFP_ONLY_SWAP_IO and __GFP_IO.
> > 
> > Uh, why?
> > 
> > Exactly what problem are you trying to solve here?
> 
> This:
> 
> 1. The filesystem does a GFP_NOFS allocation.
> 2. The allocation calls directly a dm-bufio shrinker.
> 3. The dm-bufio shrinker sees that there is __GFP_IO set, so it assumes 
>    that it can do I/O. It selects some dirty buffers, writes them back and 
>    waits for the I/O to finish.

And so you are doing IO in a GFP_NOFS context because someone thought
the block layer can't recurse back into filesystems? That's a broken
assumption and has been since the loop device was introduced over a
couple of decades ago. I mean, the dm-bufio IO submission path uses
GFP_NOIO for obvious reasons, but once it's in the next device down
it loses all control of the submission context.

This is what I mean about "looking at reclaim contexts above the
current layer is a Big Red Flag"? The fundamental assumption of
dm-bufio that it can issue IO in GFP_NOFS context and not have a
lower layer recurse back into a filesystem has always been
incorrect. Just because the loop device now does GFP_NOIO
allocation, that doesn't mean what dm-bufio is doing in this
shrinker is correct or valid.

Because, as you point out:

> 4. The dirty buffers belong to a loop device.
> 5. The loop device thread calls the filesystem that did the GFP_NOFS 
>    allocation in step 1 (and that is still waiting for the allocation to 
>    succeed).
> Note that setting PF_MEMALLOC_NOIO on the loop thread won't help with this 
> deadlock.

Right, re-entering the filesystem might block on a lock, IO, memory
allocation, journal space reservation, etc. Indeed, it might not
even be able to issue transactions because the allocating context is
using GFP_NOFS because it is already running a transaction.

> Do you argue that this is a bug in dm-bufio? Or a bug in the kernel? Or 
> that it can't happen?

That's a bug in dm-bufio - dm is a layered block device and so has
_always_ been able to have filesystems both above and below
it in the storage stack. i.e. the assumption that there is no
filesystem context under the DM layers has always been wrong.

i.e. the memory reclaim context specfically directed dm-bufio that
whatever the shrinker does, it must not recurse into the filesystem
layer. It is the responsibility of the shrinker to obey the
constraints it was given by memory reclaim, and the dm-bufio
shrinker's assumption that there cannot be a filesystem below the DM
device violates this directive because, quite clearly, there can be
filesystems underneath DM devices.

IOWs, assuming that you can issue and block on IO from a -layered
block device- in GFP_NOFS shrinker context is flawed.  i.e. Anything
that presents as a block device that is layered on top of another
block device can recurse into a filesystem as they can sit on top of
a loop device. This has always been the case, and that means the
assumptions the dm-bufio shrinker is making about what it can do in
GFP_NOFS shrinker context has always been incorrect.

Remember that I explained "you should not block kswapd" in this
shrinker a year ago?

| What follows from that, and is pertinent for in this situation, is
| that if you don't block kswapd, then other reclaim contexts are not
| going to get stuck waiting for it regardless of the reclaim context
| they use.

https://lore.kernel.org/linux-fsdevel/20190809215733.GZ7777@dread.disaster.area/

If you did that when I suggested it, this problem would be solved.
i.e. The only way to fix this problem once adn for all is to stop
using the shrinker as a mechanism to issue and wait on IO. If you
need background writeback of dirty buffers, do it from a
WQ_MEM_RECLAIM workqueue that isn't directly in the memory reclaim
path and so can issue writeback and block safely from a GFP_KERNEL
context. Kick the workqueue from the shrinker context, but get rid
of the IO submission and waiting from the shrinker and all the
GFP_NOFS memory reclaim recursion problems go away.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


  reply	other threads:[~2020-06-29 22:34 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-25 11:31 Matthew Wilcox (Oracle)
2020-06-25 11:31 ` [PATCH 1/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_noio Matthew Wilcox (Oracle)
2020-06-25 12:22   ` Michal Hocko
2020-06-25 12:34     ` Matthew Wilcox
2020-06-25 12:42       ` Michal Hocko
2020-06-25 11:31 ` [PATCH 2/6] mm: Add become_kswapd and restore_kswapd Matthew Wilcox (Oracle)
2020-06-25 12:31   ` Michal Hocko
2020-06-25 11:31 ` [PATCH 3/6] xfs: Convert to memalloc_nofs_save Matthew Wilcox (Oracle)
2020-06-25 11:31 ` [PATCH 4/6] mm: Replace PF_MEMALLOC_NOFS with memalloc_nofs Matthew Wilcox (Oracle)
2020-06-25 13:35   ` Michal Hocko
2020-06-25 11:31 ` [PATCH 5/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_nocma Matthew Wilcox (Oracle)
2020-06-25 11:31 ` [PATCH 6/6] mm: Add memalloc_nowait Matthew Wilcox (Oracle)
2020-06-25 12:40   ` Michal Hocko
2020-06-25 13:10     ` Matthew Wilcox
2020-06-25 13:34       ` Michal Hocko
2020-06-25 19:05   ` kernel test robot
2020-06-25 23:51   ` kernel test robot
2020-06-29  5:08   ` Mike Rapoport
2020-06-29 12:18     ` Matthew Wilcox
2020-06-29 12:52       ` Michal Hocko
2020-06-29 13:45         ` Mike Rapoport
2020-06-29 21:28           ` Matthew Wilcox
2020-06-30  6:34             ` Michal Hocko
2020-07-01  4:12               ` Matthew Wilcox
2020-07-01  5:53                 ` Michal Hocko
2020-07-01  7:04                   ` Mike Rapoport
2020-09-24  0:39   ` Mike Snitzer
2020-09-24  1:10     ` Matthew Wilcox
2020-10-23 14:49   ` Daniel Vetter
2020-06-25 18:48 ` [PATCH 0/6] Overhaul memalloc_no* Darrick J. Wong
2020-06-25 20:34   ` Matthew Wilcox
2020-06-25 20:36   ` Michal Hocko
2020-06-25 20:40     ` Matthew Wilcox
2020-06-26 15:02 ` Mikulas Patocka
2020-06-26 23:08   ` Dave Chinner
2020-06-27 13:09     ` Mikulas Patocka
2020-06-29  0:35       ` Dave Chinner
2020-06-29 13:43         ` Mikulas Patocka
2020-06-29 22:34           ` Dave Chinner [this message]
2020-07-03 14:26             ` [PATCH] dm-bufio: do cleanup from a workqueue Mikulas Patocka
2020-06-29  8:22     ` [PATCH 0/6] Overhaul memalloc_no* Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200629223410.GK2005@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=neilb@suse.de \
    --cc=willy@infradead.org \
    --subject='Re: [PATCH 0/6] Overhaul memalloc_no*' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).