All of lore.kernel.org
 help / color / mirror / Atom feed
From: NeilBrown <neilb-l3A5Bk7waGM@public.gmane.org>
To: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Mel Gorman <mgorman-IBi9RG/b67k@public.gmane.org>,
	Trond Myklebust
	<trond.myklebust-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Junxiao Bi <junxiao.bi-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
	Linux NFS Mailing List
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Devel FS Linux
	<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: [PATCH v2 1/2] SUNRPC: Fix memory reclaim deadlocks in rpciod
Date: Thu, 11 Sep 2014 20:53:05 +1000	[thread overview]
Message-ID: <20140911205305.578bc017@notabene.brown> (raw)
In-Reply-To: <20140911085046.GC22042-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 5611 bytes --]

On Thu, 11 Sep 2014 10:50:47 +0200 Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:

> On Thu 11-09-14 09:57:43, Neil Brown wrote:
> > On Wed, 10 Sep 2014 15:48:43 +0200 Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> > 
> > > On Tue 09-09-14 12:33:46, Neil Brown wrote:
> > > > On Thu, 4 Sep 2014 15:54:27 +0200 Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> > > > 
> > > > > [Sorry for jumping in so late - I've been busy last days]
> > > > > 
> > > > > On Wed 27-08-14 16:36:44, Mel Gorman wrote:
> > > > > > On Tue, Aug 26, 2014 at 08:00:20PM -0400, Trond Myklebust wrote:
> > > > > > > On Tue, Aug 26, 2014 at 7:51 PM, Trond Myklebust
> > > > > > > <trond.myklebust-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org> wrote:
> > > > > > > > On Tue, Aug 26, 2014 at 7:19 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > [...]
> > > > > > > >> wait_on_page_writeback() is a hammer, and we need to be better about
> > > > > > > >> this once we have per-memcg dirty writeback and throttling, but I
> > > > > > > >> think that really misses the point.  Even if memcg writeback waiting
> > > > > > > >> were smarter, any length of time spent waiting for yourself to make
> > > > > > > >> progress is absurd.  We just shouldn't be solving deadlock scenarios
> > > > > > > >> through arbitrary timeouts on one side.  If you can't wait for IO to
> > > > > > > >> finish, you shouldn't be passing __GFP_IO.
> > > > > 
> > > > > Exactly!
> > > > 
> > > > This is overly simplistic.
> > > > The code that cannot wait may be further up the call chain and not in a
> > > > position to avoid passing __GFP_IO.
> > > > In many case it isn't that "you can't wait for IO" in general, but that you
> > > > cannot wait for one specific IO request.
> > > 
> > > Could you be more specific, please? Why would a particular IO make any
> > > difference to general IO from the same path? My understanding was that
> > > once the page is marked PG_writeback then it is about to be written to
> > > its destination and if there is any need for memory allocation it should
> > > better not allow IO from reclaim.
> > 
> > The more complex the filesystem, the harder it is to "not allow IO from
> > reclaim".
> > For NFS (which started this thread) there might be a need to open a new
> > connection - so allocating in the networking code would all need to be
> > careful.
> 
> memalloc_noio_{save,restor} might help in that regards.

It might.  It is a bit of a heavy stick though.
Especially as "nofs" is what is really wanted (I think).

> 
> > And it isn't impossible that a 'gss' credential needs to be re-negotiated,
> > and that might even need user-space interaction (not sure of details).
> 
> OK, so if I understand you correctly all those allocations tmight happen
> _after_ the page has been marked PG_writeback. This would be bad indeed
> if such a path could appear in the memcg limit reclaim. The outcome of
> the previous discussion was that this doesn't happen in practice for
> nfs code, though, because the real flushing doesn't happen from a user
> context. The issue was reported for an old kernel where the flushing
> happened from the user context. It would be a huge problem to have a
> flusher within a restricted environment not only because of this path.
> 
> > What you say certainly used to be the case, and very often still is.  But it
> > doesn't really scale with complexity of filesystems.
> > 
> > I don't think there is (yet) any need to optimised for allocations that don't
> > disallow IO happening in the writeout path.  But I do think waiting
> > indefinitely for a particular IO is unjustifiable.
> 
> Well, as Johannes already pointed out. The right way to fix memcg
> reclaim is to implement proper memcg aware dirty pages throttling and
> flushing. This is a song of distant future I am afraid. Until then we
> have to live with workarounds. I would be happy to make this one more
> robust but timeout based solutions just sound too fragile and triggering
> OOM is a big risk.
> 
> Maybe we can disbale waiting if current->flags & PF_LESS_THROTTLE. I
> would be even tempted to WARN_ON(current->flags & PF_LESS_THROTTLE) in
> that path to catch a potential misconfiguration when the flusher is a
> part of restricted environment. The only real user of the flag is nfsd
> though and it runs from a kernel thread so this wouldn't help much to
> catch potentialy buggy code. So I am not really sure how much of an
> improvement this would be.
> 

I think it would be inappropriate to use PF_LESS_THROTTLE.  That is really
about throttling the dirtying of pages, not their writeback.

As has been said, there isn't really a bug that needs fixing at present, so
delving too deeply into designing a solution is probably pointless.

Using global flags is sometimes suitable, but it doesn't help when you are
waiting for memory allocation to happen in another process.
Using timeouts is sometimes suitable, but only if the backup plan isn't too
drastic.

My feeling is that the "ideal" would be to wait until:
  - this thread can make forward progress, or
  - no thread (in this memcg?) can make forward progress
In the first case we succeed.  In the second we take the most gentle backup
solution (e.g. use the last dregs of memory, or trigger OOM).
Detecting when no other thread can make forward progress is probably not
trivial, but it doesn't need to be cheap.

Hopefully when a real issue arises we'll be able to figure something out.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

WARNING: multiple messages have this Message-ID (diff)
From: NeilBrown <neilb@suse.de>
To: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.com>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Junxiao Bi <junxiao.bi@oracle.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Devel FS Linux <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v2 1/2] SUNRPC: Fix memory reclaim deadlocks in rpciod
Date: Thu, 11 Sep 2014 20:53:05 +1000	[thread overview]
Message-ID: <20140911205305.578bc017@notabene.brown> (raw)
In-Reply-To: <20140911085046.GC22042@dhcp22.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 5524 bytes --]

On Thu, 11 Sep 2014 10:50:47 +0200 Michal Hocko <mhocko@suse.cz> wrote:

> On Thu 11-09-14 09:57:43, Neil Brown wrote:
> > On Wed, 10 Sep 2014 15:48:43 +0200 Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Tue 09-09-14 12:33:46, Neil Brown wrote:
> > > > On Thu, 4 Sep 2014 15:54:27 +0200 Michal Hocko <mhocko@suse.cz> wrote:
> > > > 
> > > > > [Sorry for jumping in so late - I've been busy last days]
> > > > > 
> > > > > On Wed 27-08-14 16:36:44, Mel Gorman wrote:
> > > > > > On Tue, Aug 26, 2014 at 08:00:20PM -0400, Trond Myklebust wrote:
> > > > > > > On Tue, Aug 26, 2014 at 7:51 PM, Trond Myklebust
> > > > > > > <trond.myklebust@primarydata.com> wrote:
> > > > > > > > On Tue, Aug 26, 2014 at 7:19 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > [...]
> > > > > > > >> wait_on_page_writeback() is a hammer, and we need to be better about
> > > > > > > >> this once we have per-memcg dirty writeback and throttling, but I
> > > > > > > >> think that really misses the point.  Even if memcg writeback waiting
> > > > > > > >> were smarter, any length of time spent waiting for yourself to make
> > > > > > > >> progress is absurd.  We just shouldn't be solving deadlock scenarios
> > > > > > > >> through arbitrary timeouts on one side.  If you can't wait for IO to
> > > > > > > >> finish, you shouldn't be passing __GFP_IO.
> > > > > 
> > > > > Exactly!
> > > > 
> > > > This is overly simplistic.
> > > > The code that cannot wait may be further up the call chain and not in a
> > > > position to avoid passing __GFP_IO.
> > > > In many case it isn't that "you can't wait for IO" in general, but that you
> > > > cannot wait for one specific IO request.
> > > 
> > > Could you be more specific, please? Why would a particular IO make any
> > > difference to general IO from the same path? My understanding was that
> > > once the page is marked PG_writeback then it is about to be written to
> > > its destination and if there is any need for memory allocation it should
> > > better not allow IO from reclaim.
> > 
> > The more complex the filesystem, the harder it is to "not allow IO from
> > reclaim".
> > For NFS (which started this thread) there might be a need to open a new
> > connection - so allocating in the networking code would all need to be
> > careful.
> 
> memalloc_noio_{save,restor} might help in that regards.

It might.  It is a bit of a heavy stick though.
Especially as "nofs" is what is really wanted (I think).

> 
> > And it isn't impossible that a 'gss' credential needs to be re-negotiated,
> > and that might even need user-space interaction (not sure of details).
> 
> OK, so if I understand you correctly all those allocations tmight happen
> _after_ the page has been marked PG_writeback. This would be bad indeed
> if such a path could appear in the memcg limit reclaim. The outcome of
> the previous discussion was that this doesn't happen in practice for
> nfs code, though, because the real flushing doesn't happen from a user
> context. The issue was reported for an old kernel where the flushing
> happened from the user context. It would be a huge problem to have a
> flusher within a restricted environment not only because of this path.
> 
> > What you say certainly used to be the case, and very often still is.  But it
> > doesn't really scale with complexity of filesystems.
> > 
> > I don't think there is (yet) any need to optimised for allocations that don't
> > disallow IO happening in the writeout path.  But I do think waiting
> > indefinitely for a particular IO is unjustifiable.
> 
> Well, as Johannes already pointed out. The right way to fix memcg
> reclaim is to implement proper memcg aware dirty pages throttling and
> flushing. This is a song of distant future I am afraid. Until then we
> have to live with workarounds. I would be happy to make this one more
> robust but timeout based solutions just sound too fragile and triggering
> OOM is a big risk.
> 
> Maybe we can disbale waiting if current->flags & PF_LESS_THROTTLE. I
> would be even tempted to WARN_ON(current->flags & PF_LESS_THROTTLE) in
> that path to catch a potential misconfiguration when the flusher is a
> part of restricted environment. The only real user of the flag is nfsd
> though and it runs from a kernel thread so this wouldn't help much to
> catch potentialy buggy code. So I am not really sure how much of an
> improvement this would be.
> 

I think it would be inappropriate to use PF_LESS_THROTTLE.  That is really
about throttling the dirtying of pages, not their writeback.

As has been said, there isn't really a bug that needs fixing at present, so
delving too deeply into designing a solution is probably pointless.

Using global flags is sometimes suitable, but it doesn't help when you are
waiting for memory allocation to happen in another process.
Using timeouts is sometimes suitable, but only if the backup plan isn't too
drastic.

My feeling is that the "ideal" would be to wait until:
  - this thread can make forward progress, or
  - no thread (in this memcg?) can make forward progress
In the first case we succeed.  In the second we take the most gentle backup
solution (e.g. use the last dregs of memory, or trigger OOM).
Detecting when no other thread can make forward progress is probably not
trivial, but it doesn't need to be cheap.

Hopefully when a real issue arises we'll be able to figure something out.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

  parent reply	other threads:[~2014-09-11 10:53 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-22  7:55 rpciod deadlock issue Junxiao Bi
     [not found] ` <53F6F772.6020708-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2014-08-22 22:49   ` [PATCH v2 1/2] SUNRPC: Fix memory reclaim deadlocks in rpciod Trond Myklebust
2014-08-22 22:49     ` Trond Myklebust
2014-08-22 22:49     ` [PATCH v2 2/2] NFS: Ensure that rpciod does not trigger reclaim writebacks Trond Myklebust
     [not found]     ` <1408747772-37938-1-git-send-email-trond.myklebust-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>
2014-08-25  5:34       ` [PATCH v2 1/2] SUNRPC: Fix memory reclaim deadlocks in rpciod Junxiao Bi
2014-08-25  5:34         ` Junxiao Bi
2014-08-25  6:48       ` NeilBrown
2014-08-25  6:48         ` NeilBrown
     [not found]         ` <20140825164852.50723141-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2014-08-26  5:43           ` Junxiao Bi
2014-08-26  5:43             ` Junxiao Bi
2014-08-26  6:21             ` NeilBrown
2014-08-26  6:49               ` Junxiao Bi
2014-08-26  7:04                 ` NeilBrown
     [not found]                   ` <20140826170410.20560764-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2014-08-26  7:23                     ` Junxiao Bi
2014-08-26  7:23                       ` Junxiao Bi
2014-08-26 10:53           ` Mel Gorman
2014-08-26 10:53             ` Mel Gorman
2014-08-26 12:58             ` Trond Myklebust
2014-08-26 13:26               ` Mel Gorman
     [not found]                 ` <20140826132624.GU17696-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2014-08-26 23:19                   ` Johannes Weiner
2014-08-26 23:19                     ` Johannes Weiner
     [not found]                     ` <20140826231938.GA13889-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2014-08-26 23:51                       ` Trond Myklebust
2014-08-26 23:51                         ` Trond Myklebust
     [not found]                         ` <CAHQdGtRPsVFVfph5OcsZk_+WYPPJ-MpE2myZfXAb3jq6fuM4zw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-27  0:00                           ` Trond Myklebust
2014-08-27  0:00                             ` Trond Myklebust
2014-08-27 15:36                             ` Mel Gorman
     [not found]                               ` <20140827153644.GF12374-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2014-08-27 16:15                                 ` Trond Myklebust
2014-08-27 16:15                                   ` Trond Myklebust
2014-08-28  8:30                                   ` Mel Gorman
     [not found]                                     ` <20140828083053.GJ12374-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2014-08-28  8:49                                       ` Junxiao Bi
2014-08-28  8:49                                         ` Junxiao Bi
2014-08-28  9:25                                         ` Mel Gorman
2014-09-04 13:54                                 ` Michal Hocko
2014-09-04 13:54                                   ` Michal Hocko
2014-09-09  2:33                                   ` NeilBrown
2014-09-10 13:48                                     ` Michal Hocko
     [not found]                                       ` <20140910134842.GG25219-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2014-09-10 23:57                                         ` NeilBrown
2014-09-10 23:57                                           ` NeilBrown
     [not found]                                           ` <20140911095743.1ed87519-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2014-09-11  8:50                                             ` Michal Hocko
2014-09-11  8:50                                               ` Michal Hocko
     [not found]                                               ` <20140911085046.GC22042-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2014-09-11 10:53                                                 ` NeilBrown [this message]
2014-09-11 10:53                                                   ` NeilBrown
2014-08-27  1:43                       ` NeilBrown
2014-08-27  1:43                         ` NeilBrown
2014-08-25  6:05   ` rpciod deadlock issue NeilBrown
2014-08-25  6:05     ` NeilBrown
     [not found]     ` <20140825160501.433b3e9e-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2014-08-25  6:15       ` NeilBrown
2014-08-25  6:15         ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140911205305.578bc017@notabene.brown \
    --to=neilb-l3a5bk7wagm@public.gmane.org \
    --cc=hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org \
    --cc=junxiao.bi-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org \
    --cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=mgorman-IBi9RG/b67k@public.gmane.org \
    --cc=mhocko-AlSwsSmVLrQ@public.gmane.org \
    --cc=trond.myklebust-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.