From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751886AbcFKOfy (ORCPT <rfc822;w@1wt.eu>);
	Sat, 11 Jun 2016 10:35:54 -0400
Received: from www262.sakura.ne.jp ([202.181.97.72]:53497 "EHLO
	www262.sakura.ne.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751076AbcFKOfx (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sat, 11 Jun 2016 10:35:53 -0400
To: mhocko@kernel.org
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, mgorman@suse.de,
        vbabka@suse.cz, hannes@cmpxchg.org, riel@redhat.com,
        david@fromorbit.com, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 1/2] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_HARD with more useful semantic
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
References: <1465212736-14637-1-git-send-email-mhocko@kernel.org>
	<1465212736-14637-2-git-send-email-mhocko@kernel.org>
	<7fb7e035-7795-839b-d1b0-4a68fcf8e9c9@I-love.SAKURA.ne.jp>
	<20160607123149.GK12305@dhcp22.suse.cz>
In-Reply-To: <20160607123149.GK12305@dhcp22.suse.cz>
Message-Id: <201606112335.HBG09891.OLFJOFtVMOQHSF@I-love.SAKURA.ne.jp>
X-Mailer: Winbiff [Version 2.51 PL2]
X-Accept-Language: ja,en,zh
Date: Sat, 11 Jun 2016 23:35:49 +0900
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Michal Hocko wrote:
> On Tue 07-06-16 21:11:03, Tetsuo Handa wrote:
> > Remaining __GFP_REPEAT users are not always doing costly allocations.
> 
> Yes but...
> 
> > Sometimes they pass __GFP_REPEAT because the size is given from userspace.
> > Thus, unconditional s/__GFP_REPEAT/__GFP_RETRY_HARD/g is not good.
> 
> Would that be a regression though? Strictly speaking the __GFP_REPEAT
> documentation was explicit to not loop for ever. So nobody should have
> expected nofail semantic pretty much by definition. The fact that our
> previous implementation was not fully conforming to the documentation is
> just an implementation detail.  All the remaining users of __GFP_REPEAT
> _have_ to be prepared for the allocation failure. So what exactly is the
> problem with them?

A !costly allocation becomes weaker than now if __GFP_RETRY_HARD is passed.

> > >  	/* Reclaim has failed us, start killing things */
> > >  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> > >  	if (page)
> > > @@ -3719,6 +3731,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >  	/* Retry as long as the OOM killer is making progress */
> > >  	if (did_some_progress) {
> > >  		no_progress_loops = 0;
> > > +		passed_oom = true;
> > 
> > This is too premature. did_some_progress != 0 after returning from
> > __alloc_pages_may_oom() does not mean the OOM killer was invoked. It only means
> > that mutex_trylock(&oom_lock) was attempted.
> 
> which means that we have reached the OOM condition and _somebody_ is
> actaully handling the OOM on our behalf.

That _somebody_ might release oom_lock without invoking the OOM killer (e.g.
doing !__GFP_FS allocation), which means that we have reached the OOM condition
and nobody is actually handling the OOM on our behalf. __GFP_RETRY_HARD becomes
as weak as __GFP_NORETRY. I think this is a regression.


> > What I think more important is hearing from __GFP_REPEAT users how hard they
> > want to retry. It is possible that they want to retry unless SIGKILL is
> > delivered, but passing __GFP_NOFAIL is too hard, and therefore __GFP_REPEAT
> > is used instead. It is possible that they use __GFP_NOFAIL || __GFP_KILLABLE
> > if __GFP_KILLABLE were available. In my module (though I'm not using
> > __GFP_REPEAT), I want to retry unless SIGKILL is delivered.
> 
> To be honest killability for a particular allocation request sounds
> like a hack to me. Just consider the expected semantic. How do you
> handle when one path uses explicit __GFP_KILLABLE while other path (from
> the same syscall) is not... If anything this would have to be process
> context wise.

I didn't catch your question. But making code killable should be considered
good unless it complicates error handling paths.

Since we are not setting TIF_MEMDIE to all OOM-killed threads, OOM-killed
threads will have to loop until mutex_trylock(&oom_lock) succeeds in order
to get TIF_MEMDIE by calling out_of_memory(), which is a needless delay.

Many allocations from syscall context can give up upon SIGKILL. We don't
need to allow OOM-killed threads to use memory reserves if that allocation
is killable.

Converting down_write(&mm->mmap_sem) to down_write_killable(&mm->mmap_sem)
is considered good. But converting kmalloc(GFP_KERNEL) to
kmalloc(GFP_KERNEL | __GFP_KILLABLE) is considered hack. Why?