Re: [RFC] mm: Proactive compaction

From: Khalid Aziz <khalid.aziz@oracle.com>
To: Vlastimil Babka <vbabka@suse.cz>,
	Nitin Gupta <nigupta@nvidia.com>,
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	"mhocko@suse.com" <mhocko@suse.com>,
	"mgorman@techsingularity.net" <mgorman@techsingularity.net>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>
Cc: "cai@lca.pw" <cai@lca.pw>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"aryabinin@virtuozzo.com" <aryabinin@virtuozzo.com>,
	"jannh@google.com" <jannh@google.com>,
	"guro@fb.com" <guro@fb.com>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	"keescook@chromium.org" <keescook@chromium.org>,
	"yuzhao@google.com" <yuzhao@google.com>,
	"arunks@codeaurora.org" <arunks@codeaurora.org>,
	"willy@infradead.org" <willy@infradead.org>,
	"janne.huttunen@nokia.com" <janne.huttunen@nokia.com>,
	"khlebnikov@yandex-team.ru" <khlebnikov@yandex-team.ru>
Subject: Re: [RFC] mm: Proactive compaction
Date: Tue, 24 Sep 2019 08:11:09 -0600	[thread overview]
Message-ID: <d33fce4d-6018-0235-5391-debc8974eda5@oracle.com> (raw)
In-Reply-To: <71d7fba0-bd6f-3ac5-1fd8-9a8ff6fc6b8b@suse.cz>

On 9/24/19 7:39 AM, Vlastimil Babka wrote:
> On 9/20/19 1:37 AM, Nitin Gupta wrote:
>> On Tue, 2019-08-20 at 10:46 +0200, Vlastimil Babka wrote:
>>>
>>> That's a lot of control knobs - how is an admin supposed to tune them to
>>> their
>>> needs?
>>
>>
>> Yes, it's difficult for an admin to get so many tunable right unless
>> targeting a very specific workload.
>>
>> How about a simpler solution where we exposed just one tunable per-node:
>>    /sys/.../node-x/compaction_effort
>> which accepts [0, 100]
>>
>> This parallels /proc/sys/vm/swappiness but for compaction. With this
>> single number, we can estimate per-order [low, high] watermarks for external
>> fragmentation like this:
>>  - For now, map this range to [low, medium, high] which correponds to specific
>> low, high thresholds for extfrag.
>>  - Apply more relaxed thresholds for higher-order than for lower orders.
>>
>> With this single tunable we remove the burden of setting per-order explicit
>> [low, high] thresholds and it should be easier to experiment with.
> 
> What about instead autotuning by the numbers of allocations hitting
> direct compaction recently? IIRC there were attempts in the past (myself
> included) and recently Khalid's that was quite elaborated.
> 

I do think the right way forward with this longstanding problem is to
take the burden of managing free memory away from end user and let the
kernel autotune itself to the demands of workload. We can start with a
simpler algorithm in the kernel that adapts to workload and refine it as
we move forward. As long as initial implementation performs at least as
well as current free page management, we have a workable path for
improvements. I am moving the implementation I put together in kernel to
a userspace daemon just to test it out on larger variety of workloads.
It is more limited in userspace with limited access to statistics the
algorithm needs to perform trend analysis so I would rather be doing
this in the kernel.

--
Khalid