All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thomas Schoebel-Theuer <tst@schoebel-theuer.de>
To: Mel Gorman <mgorman@techsingularity.net>,
	Christoph Lameter <cl@linux.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	linux-mm@kvack.org, linux-rdma@vger.kernel.org,
	akpm@linux-foundation.org, andi@firstfloor.org,
	Rik van Riel <riel@redhat.com>, Michal Hocko <mhocko@kernel.org>,
	Guy Shattah <sguy@mellanox.com>,
	Anshuman Khandual <khandual@linux.vnet.ibm.com>,
	Michal Nazarewicz <mina86@mina86.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Nellans <dnellans@nvidia.com>,
	Laura Abbott <labbott@redhat.com>, Pavel Machek <pavel@ucw.cz>,
	Dave Hansen <dave.hansen@intel.com>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: Re: [RFC 1/2] Protect larger order pages from breaking up
Date: Thu, 22 Feb 2018 22:19:32 +0100	[thread overview]
Message-ID: <68050f0f-14ca-d974-9cf4-19694a2244b9@schoebel-theuer.de> (raw)
In-Reply-To: <20180219101935.cb3gnkbjimn5hbud@techsingularity.net>

[-- Attachment #1: Type: text/plain, Size: 5015 bytes --]

On 02/19/18 11:19, Mel Gorman wrote:
>
>> Index: linux/mm/page_alloc.c
>> ===================================================================
>> --- linux.orig/mm/page_alloc.c
>> +++ linux/mm/page_alloc.c
>> @@ -1844,7 +1844,12 @@ struct page *__rmqueue_smallest(struct z
>>   		area = &(zone->free_area[current_order]);
>>   		page = list_first_entry_or_null(&area->free_list[migratetype],
>>   							struct page, lru);
>> -		if (!page)
>> +		/*
>> +		 * Continue if no page is found or if our freelist contains
>> +		 * less than the minimum pages of that order. In that case
>> +		 * we better look for a different order.
>> +		 */
>> +		if (!page || area->nr_free < area->min)
>>   			continue;
>>   		list_del(&page->lru);
>>   		rmv_page_order(page);
> This is surprising to say the least. Assuming reservations are at order-3,
> this would refuse to split order-3 even if there was sufficient reserved
> pages at higher orders for a reserve.

Hi Mel,

I agree with you that the above code does not really do what it should.

At least, the condition needs to be changed to:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76c9688b6a0a..193dfd85a6b1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1837,7 +1837,15 @@ struct page *__rmqueue_smallest(struct zone 
*zone, unsigned int order,
                 area = &(zone->free_area[current_order]);
                 page = 
list_first_entry_or_null(&area->free_list[migratetype],
                                                         struct page, lru);
-               if (!page)
+               /*
+                * Continue if no page is found or if we are about to
+                * split a truly higher order than requested.
+                * There is no limit for just _using_ exactly the right
+                * order. The limit is only for _splitting_ some
+                * higher order.
+                */
+               if (!page ||
+                   (area->nr_free < area->min && current_order > order))
                         continue;
                 list_del(&page->lru);
                 rmv_page_order(page);


The "&& current_order > order" part is _crucial_. If left out, it will 
work even counter-productive. I know this from development of my 
original patch some years ago.

Please have a look at the attached patchset for kernel 3.16 which is in 
_production_ at 1&1 Internet SE at about 20,000 servers for several 
years now, starting from kernel 3.2.x to 3.16.x (or maybe the very first 
version was for 2.6.32, I don't remember exactly).

It has collected several millions of operation hours in total, and it is 
known to work miracles for some of our workloads.

Porting to later kernels should be relatively easy. Also notice that the 
switch labels at patch #2 could need some minor tweaking, e.g. also 
including ZONE_DMA32 or similar, and also might need some 
architecture-specific tweaking. All of the tweaking is depending on the 
actual workload. I am using it only at datacenter servers (webhosting) 
and at x86_64.

Please notice that the user interface of my patchset is extremely simple 
and can be easily understood by junior sysadmins:

After running your box for several days or weeks or even months (or 
possibly, after you just got an OOM), just do
# cat /proc/sys/vm/perorder_statistics > /etc/defaults/my_perorder_reserve

Then add a trivial startup script, e.g. to systemd or to sysv init etc, 
which just does the following early during the next reboot:
# cat /etc/defaults/my_perorder_reserve > /proc/sys/vm/perorder_reserve

That's it.

No need for a deep understanding of the theory of the memory 
fragmentation problem.

Also no need for adding anything to the boot commandline. Fragmentation 
will typically occur only after some days or weeks or months of 
operation, at least in all of the practical cases I have personally seen 
at 1&1 datacenters and their workloads.

Please notice that fragmentation can be a very serious problem for 
operations if you are hurt by it. It can seriously harm your business. 
And it is _extremely_ specific to the actual workload, and to the 
hardware / chipset / etc. This is addressed by the above method of 
determining the right values from _actual_ operations (not from 
speculation) and then memoizing them.

The attached patchset tries to be very simple, but in my practical 
experience it is a very effective practical solution.

When requested, I can post the mathematical theory behind the patch, or 
I could give a presentation at some of the next conferences if I would 
be invited (or better give a practical explanation instead). But 
probably nobody on these lists wants to deal with any theories.

Just _play_ with the patchset practically, and then you will notice.

Cheers and greetings,

Yours sincerly old-school hacker Thomas


P.S. I cannot attend these lists full-time due to my workload at 1&1 
which is unfortunately not designed for upstream hacking, so please stay 
patient with me if an answer takes a few days.



[-- Attachment #2: 0001-mm-fix-fragmentation-by-pre-reserving-higher-order-p.patch --]
[-- Type: text/plain, Size: 0 bytes --]



  parent reply	other threads:[~2018-02-22 21:20 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-16 16:01 [RFC 0/2] Larger Order Protection V1 Christoph Lameter
2018-02-16 16:01 ` [RFC 1/2] Protect larger order pages from breaking up Christoph Lameter
2018-02-16 16:01   ` Christoph Lameter
     [not found]   ` <20180216160121.519788537-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
2018-02-16 17:03     ` Andi Kleen
2018-02-16 17:03       ` Andi Kleen
     [not found]       ` <20180216170354.vpbuugzqsrrfc4js-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
2018-02-16 18:25         ` Christopher Lameter
2018-02-16 18:25           ` Christopher Lameter
2018-02-16 19:01     ` Dave Hansen
2018-02-16 19:01       ` Dave Hansen
2018-02-16 20:15       ` Christopher Lameter
2018-02-16 21:08         ` Dave Hansen
2018-02-16 21:08           ` Dave Hansen
2018-02-16 21:43           ` Matthew Wilcox
     [not found]             ` <20180216214353.GA32655-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
2018-02-16 21:47               ` Dave Hansen
2018-02-16 21:47                 ` Dave Hansen
2018-02-16 18:02   ` Randy Dunlap
     [not found]     ` <b76028c6-c755-8178-2dfc-81c7db1f8bed-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2018-02-17 16:07       ` Mike Rapoprt
2018-02-17 16:07         ` Mike Rapoprt
2018-02-16 18:59   ` Mike Kravetz
     [not found]     ` <5108eb20-2b20-bd48-903e-bce312e96974-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2018-02-16 20:13       ` Christopher Lameter
2018-02-16 20:13         ` Christopher Lameter
2018-02-18  9:00         ` Guy Shattah
2018-02-18  9:00           ` Guy Shattah
2018-02-19 10:19   ` Mel Gorman
2018-02-19 14:42     ` Michal Hocko
2018-02-19 15:09     ` Christopher Lameter
2018-02-22 21:19     ` Thomas Schoebel-Theuer [this message]
2018-02-22 21:53       ` Zi Yan
2018-02-23  2:01         ` Christopher Lameter
2018-02-23  2:16           ` Zi Yan
2018-02-23  2:45             ` Christopher Lameter
2018-02-23  9:59       ` Mel Gorman
2018-02-16 16:01 ` [RFC 2/2] Page order diagnostics Christoph Lameter
2018-02-16 16:01   ` Christoph Lameter
     [not found]   ` <20180216160121.583566579-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
2018-02-17 21:17     ` Pavel Machek
2018-02-17 21:17       ` Pavel Machek
2018-02-19 14:54       ` Christopher Lameter
2018-02-16 18:27 ` [RFC 0/2] Larger Order Protection V1 Christopher Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=68050f0f-14ca-d974-9cf4-19694a2244b9@schoebel-theuer.de \
    --to=tst@schoebel-theuer.de \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=cl@linux.com \
    --cc=dave.hansen@intel.com \
    --cc=dnellans@nvidia.com \
    --cc=khandual@linux.vnet.ibm.com \
    --cc=labbott@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=mina86@mina86.com \
    --cc=pavel@ucw.cz \
    --cc=riel@redhat.com \
    --cc=sguy@mellanox.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.