From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=R6+Q=ZD=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.7 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B55F2C43331
	for <linux-mm@archiver.kernel.org>; Mon, 11 Nov 2019 22:00:25 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 5908821872
	for <linux-mm@archiver.kernel.org>; Mon, 11 Nov 2019 22:00:25 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="JqlwbjuW"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5908821872
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id D688B6B0005; Mon, 11 Nov 2019 17:00:24 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D19806B0006; Mon, 11 Nov 2019 17:00:24 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C2EDC6B0007; Mon, 11 Nov 2019 17:00:24 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0187.hostedemail.com [216.40.44.187])
	by kanga.kvack.org (Postfix) with ESMTP id AC5326B0005
	for <linux-mm@kvack.org>; Mon, 11 Nov 2019 17:00:24 -0500 (EST)
Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with SMTP id 75F6A824999B
	for <linux-mm@kvack.org>; Mon, 11 Nov 2019 22:00:24 +0000 (UTC)
X-FDA: 76145365968.03.quiet92_5ac89161afa5a
X-HE-Tag: quiet92_5ac89161afa5a
X-Filterd-Recvd-Size: 9355
Received: from mail-io1-f65.google.com (mail-io1-f65.google.com [209.85.166.65])
	by imf17.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 11 Nov 2019 22:00:23 +0000 (UTC)
Received: by mail-io1-f65.google.com with SMTP id v17so15239908iol.12
        for <linux-mm@kvack.org>; Mon, 11 Nov 2019 14:00:23 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=yl3/EdQbk7NoybNy9noTa/zZ2Z+usQIoS1WR7+EnoX8=;
        b=JqlwbjuWee/HUTvmX4eSS6B+9msDPjXmwdI9pcNBZvDVlisWu3sNcjEBMMDyQ4x7fs
         Kd6aAzuc5EcMzRYEjV2o9T8iU69P31QWpHLGYFXx2X0eNo23pVMzn5DnGbICBp03dks5
         229WPOgaizJIVfLiRoNNAxcs/6HvW0VQqUEMz8ygHlxCOiH8kFFOUCBdwAXhjqYbf/kB
         SojB3fBICl9GSDycBi8k7qeOGP8ApeW3izEqtiUvVS8+74eJQnkDl5zoeon14rrk8BW7
         MNl1gpKc4L0tkmmhFyvgXwl9B0MHm8wFc+OS1MQC73gtSAZA1Rx626PNQl8esVxXeKIA
         AX/w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=yl3/EdQbk7NoybNy9noTa/zZ2Z+usQIoS1WR7+EnoX8=;
        b=NsgQP6IFs5CB2raMVqZAcgBX4kCSxhSzunKXmJJiCfZtZP2slvaQz9evpJwJJM/wmB
         HC4tQ8QlbQ0YltxCb9jQ7OdnK1KObmVw+G6xjgQP+1AEvkDE6mBd3RvGssTDvLKpGhlL
         LC2q6poWeQMZqDclBpTrbyUkjSdGAEcUEQ6s5sXNs/cENZ7vxxWFeTa+B7OAb6rmYP+/
         Pn1kJS8k4d/Yd//KvcEA2pEB1ErQxM6qFHuo2Xth7++iLKFrNO9rUiknOFgI6HE5UwYg
         YzEeTs7Fc4HGlSC6hWizfYPVJLm3dBeglxeyruxI1oNkFI8Jw0drkVwincAdi9T1bp7l
         7CYw==
X-Gm-Message-State: APjAAAXt5FVjNZb/E0N/WJEEjEFaKM766JCEC85QZO0x+eesVXyIGMDn
	sRN/1WvvukH3XFQ1lOzmihCMDDRmdl5gBfelvJw=
X-Google-Smtp-Source: APXvYqwrA1w/1J6Ox6OYWvf0Pzg3A28V4XZu32zfIwvB/BbrRgX03KVzy0ggIOe227RNx6eWUB7x43lmrLFOCxk90Ac=
X-Received: by 2002:a5d:8789:: with SMTP id f9mr27305171ion.237.1573509622679;
 Mon, 11 Nov 2019 14:00:22 -0800 (PST)
MIME-Version: 1.0
References: <20191106000547.juQRi83gi%akpm@linux-foundation.org>
 <20191106121605.GH8314@dhcp22.suse.cz> <d8a81439-10bf-a0ff-ded3-88c0dca964bb@redhat.com>
In-Reply-To: <d8a81439-10bf-a0ff-ded3-88c0dca964bb@redhat.com>
From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Mon, 11 Nov 2019 14:00:11 -0800
Message-ID: <CAKgT0Ufo7iTG6Lp8oaavGsMjk+3EDpK_yLJpLCHZ=MsgJf9=rA@mail.gmail.com>
Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree
To: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, 
	Andrea Arcangeli <aarcange@redhat.com>, Alexander Duyck <alexander.h.duyck@linux.intel.com>, 
	Dan Williams <dan.j.williams@intel.com>, Dave Hansen <dave.hansen@intel.com>, 
	David Hildenbrand <david@redhat.com>, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, lcapitulino@redhat.com, 
	Mel Gorman <mgorman@techsingularity.net>, mm-commits@vger.kernel.org, 
	"Michael S. Tsirkin" <mst@redhat.com>, Oscar Salvador <osalvador@suse.de>, Pankaj Gupta <pagupta@redhat.com>, 
	Paolo Bonzini <pbonzini@redhat.com>, Rik van Riel <riel@surriel.com>, Vlastimil Babka <vbabka@suse.cz>, 
	"Wang, Wei W" <wei.w.wang@intel.com>, Matthew Wilcox <willy@infradead.org>, 
	Yang Zhang <yang.zhang.wz@gmail.com>, linux-mm <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Nov 11, 2019 at 10:52 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 11/6/19 7:16 AM, Michal Hocko wrote:
> > I didn't have time to read through newer versions of this patch series
> > but I remember there were concerns about this functionality being pulled
> > into the page allocator previously both by me and Mel [1][2]. Have those been
> > addressed? I do not see an ack from Mel or any other MM people. Is there
> > really a consensus that we want something like that living in the
> > allocator?
> >
> > There has also been a different approach discussed and from [3]
> > (referenced by the cover letter) I can only see
> >
> > : Then Nitesh's solution had changed to the bitmap approach[7]. However it
> > : has been pointed out that this solution doesn't deal with sparse memory,
> > : hotplug, and various other issues.
> >
> > which looks more like something to be done than a fundamental
> > roadblocks.
> >
> > [1] http://lkml.kernel.org/r/20190912163525.GV2739@techsingularity.net
> > [2] http://lkml.kernel.org/r/20190912091925.GM4023@dhcp22.suse.cz
> > [3] http://lkml.kernel.org/r/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com
> >
> [...]
>
> Hi,
>
> I performed some experiments to find the root cause for the performance
> degradation Alexander reported with my v12 patch-set. [1]
>
> I will try to give a brief background of the previous discussion
> under v12: (Alexander can correct me if I am missing something).
> Alexander suggested two issues with my v12 posting: [2]
> (This is excluding the sparse zone and memory hotplug/hotremove support)
>
> - A crash which was caused because I was not using spinlock_irqsave()
>   (Fix suggestion came from Alexander).
>
> - Performance degradation with Alexander's suggested setup. Where we are using
>   modified will-it-scale/page_fault with THP, CONFIG_SLAB_FREELIST_RANDOM &
>   CONFIG_SHUFFLE_PAGE_ALLOCATOR. When I was using (MAX_ORDER - 2) as the
>   PAGE_REPORTING_MIN_ORDER, I also observed significant performance degradation
>   (around 20% in the number of threads launched on the 16th vCPU). However, on
>   switching the PAGE_REPORTING_MIN_ORDER to (MAX_ORDER - 1), I was able to get
>   the performance similar to what Alexander is reporting.
>
> PAGE_REPORTING_MIN_ORDER: is the minimum order of a page to be captured in the
> bitmap and get reported to the hypervisor.
>
> For the discussion where we are comparing the two series, the performance
> aspect is more relevant and important.
> It turns out that with the current implementation the number of vmexit with
> PAGE_REPORTING_MIN_ORDER as pageblock_order or (MAX_ORDER - 2) are significantly
> large when compared to (MAX_ODER - 1).
>
> One of the reason could be that the lower order pages are not getting sufficient
> time to merge with each other as a result they are somehow getting reported
> with 2 separate reporting requests. Hence, generating more vmexits. Where
> as with (MAX_ORDER - 1) we don't have that kind of situation as I never try
> to report any page which has order < (MAX_ORDER - 1).
>
> To fix this, I might have to further limit the reporting which could allow the
> lower order pages to further merge and hence reduce the VM exits. I will try to
> do some experiments to see if I can fix this. In any case, if anyone has a
> suggestion I would be more than happy to look in that direction.

That doesn't make any sense. My setup using MAX_ORDER - 2, aka
pageblock_order, as the limit doesn't experience the same performance
issues the bitmap solution does. That leads me to believe the issue
isn't that the pages have not had a chance to be merged.

> Following are the numbers I gathered on a 30GB single NUMA, 16 vCPU guest
> affined to a single host-NUMA:
>
> On 16th vCPU:
> With PAGE_REPORTING_MIN_ORDER as (MAX_ORDER - 1):
> % Dip on the number of Processes = 1.3 %
> % Dip on the number of  Threads  = 5.7 %
>
> With PAGE_REPORTING_MIN_ORDER as With (pageblock_order):
> % Dip on the number of Processes = 5 %
> % Dip on the number of  Threads  = 20 %

So I don't hold much faith in the threads numbers. I have seen the
variability be as high as 14% between runs.

> Michal's suggestion:
> I was able to get the prototype which could use page-isolation API:
> start_isolate_page_range()/undo_isolate_page_range() to work.
> But the issue mentioned above was also evident with it.
>
> Hence, I think before moving to the decision whether I want to use
> __isolate_free_page() which isolates pages from the buddy or
> start/undo_isolate_page_range() which just marks the page as MIGRATE_ISOLATE,
> it is important for me to resolve the above-mentioned issue.

I'd be curious how you are avoiding causing memory starvation if you
are isolating ranges of memory that have been recently freed.

> Previous discussions:
> More about how we ended up with these two approaches could be found at [3] &
> [4] explained by Alexander & David.
>
> [1] https://lore.kernel.org/lkml/20190812131235.27244-1-nitesh@redhat.com/
> [2] https://lkml.org/lkml/2019/10/2/425
> [3] https://lkml.org/lkml/2019/10/23/1166
> [4] https://lkml.org/lkml/2019/9/12/48
>

So one thing you may want to consider would be how placement of the
buffers will impact your performance.

One thing I realized I was doing wrong with my approach was scanning
for pages starting at the tail and then working up. It greatly hurt
the efficiency of my search since in the standard case most of the
free memory will be placed at the head and only with shuffling enabled
do I really need to worry about things getting mixed up with the tail.

I suspect you may be similarly making things more difficult for
yourself by placing the reported pages back on the head of the list
instead of placing them at the tail where they will not be reallocated
immediately.