Re: [PATCH] mm, memcg: clear page protection when memcg oom group happens

From: Michal Hocko <mhocko@kernel.org>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux MM <linux-mm@kvack.org>
Subject: Re: [PATCH] mm, memcg: clear page protection when memcg oom group happens
Date: Tue, 26 Nov 2019 11:22:13 +0100	[thread overview]
Message-ID: <20191126102213.GD20912@dhcp22.suse.cz> (raw)
In-Reply-To: <CALOAHbBfCk-1qt4z3C5dCjWr__exKNvd15hXZ3_Wo3cLS7jdOw@mail.gmail.com>

On Tue 26-11-19 18:02:27, Yafang Shao wrote:
> On Tue, Nov 26, 2019 at 5:50 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 26-11-19 17:35:59, Yafang Shao wrote:
> > > On Tue, Nov 26, 2019 at 3:31 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Tue 26-11-19 11:52:19, Yafang Shao wrote:
> > > > > On Mon, Nov 25, 2019 at 10:42 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > >
> > > > > > On Mon, Nov 25, 2019 at 03:21:50PM +0100, Michal Hocko wrote:
> > > > > > > On Mon 25-11-19 22:11:15, Yafang Shao wrote:
> > > > > > > > When there're no processes, we don't need to protect the pages. You
> > > > > > > > can consider it as 'fault tolerance' .
> > > > > > >
> > > > > > > I have already tried to explain why this is a bold statement that
> > > > > > > doesn't really hold universally and that the kernel doesn't really have
> > > > > > > enough information to make an educated guess.
> > > > > >
> > > > > > I agree, this is not obviously true. And the kernel shouldn't try to
> > > > > > guess whether the explicit userspace configuration is still desirable
> > > > > > to userspace or not. Should we also delete the cgroup when it becomes
> > > > > > empty for example?
> > > > > >
> > > > > > It's better to implement these kinds of policy decisions from
> > > > > > userspace.
> > > > > >
> > > > > > There is a cgroup.events file that can be polled, and its "populated"
> > > > > > field shows conveniently whether there are tasks in a subtree or
> > > > > > not. You can use that to clear protection settings.
> > > > >
> > > > > Why isn't force_empty supported in cgroup2 ?
> > > >
> > > > There wasn't any sound usecase AFAIR.
> > > >
> > > > > In this case we can free the protected file pages immdiately with force_empty.
> > > >
> > > > You can do the same thing by setting the hard limit to 0.
> > >
> > > I look though the code, and the difference between setting the hard
> > > limit to 0 and force empty is that setting the hard limit to 0 will
> > > generate some OOM reports, that should not happen in this case.
> > > I think we should make little improvement as bellow,
> >
> > Yes, if you are not able to reclaim all of the memory then the OOM
> > killer is triggered. And that was not the case with force_empty. I
> > didn't mean that the two are equivalent, sorry if I misled you.
> > I merely wanted to point out that you have means to cleanup the memcg
> > with the existing API.
> >
> > > @@ -6137,9 +6137,11 @@ static ssize_t memory_max_write(struct
> > > kernfs_open_file *of,
> > >                         continue;
> > >                 }
> > >
> > > -               memcg_memory_event(memcg, MEMCG_OOM);
> > > -               if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> > > -                       break;
> > > +               if (cgroup_is_populated(memcg->css.cgroup)) {
> > > +                       memcg_memory_event(memcg, MEMCG_OOM);
> > > +                       if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> > > +                               break;
> > > +               }
> > >         }
> >
> > If there are no killable tasks then
> >         "Out of memory and no killable processes..."
> > is printed and that really reflects the situation and is the right thing
> > to do. Your above patch would suppress that information which might be
> > important.
> >
> 
> Not only this output.
> Pls. see dump_header(), many outputs and even worse is that the
> dump_stack() is also executed.

Yes, there will be the full oom report. I have outlined the "no
killable" part because this is the main distinguisher for the "no tasks"
case.

> > > Well,  if someone don't want to kill proesses but only want ot drop
> > > page caches, setting the hard limit to 0 won't work.
> >
> > Could you be more specific about a real world example when somebody
> > wants to drop per-memcg pagecache?
> 
> For example, if one memcg  has lots of negtive denties,  that causes
> the file page cache continuesly been reclaimed, so we want to drop all
> these negtive dentries. force_empty is a better workaround so far, and
> that can give us more chance to analyze why negtive dentries are
> generated.

force_empty sounds like a brute force to clean negative dentries TBH.
And it is not really way too much different from shrinking the hard
limit.

Why doesn't a normal reclaim work for those situation? Anyway, this is
getting really tangent to the original topic so I would suggest to start
a new email thread with a clear description of a problem you are facing
and we can go from there.
-- 
Michal Hocko
SUSE Labs