From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=QuKr=HT=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 66255C433E0
	for <linux-mm@archiver.kernel.org>; Wed, 17 Feb 2021 20:46:25 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 033FD64E6C
	for <linux-mm@archiver.kernel.org>; Wed, 17 Feb 2021 20:46:24 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 033FD64E6C
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 878986B0006; Wed, 17 Feb 2021 15:46:24 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 82A7D6B006C; Wed, 17 Feb 2021 15:46:24 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 740BD6B006E; Wed, 17 Feb 2021 15:46:24 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0240.hostedemail.com [216.40.44.240])
	by kanga.kvack.org (Postfix) with ESMTP id 5E5FA6B0006
	for <linux-mm@kvack.org>; Wed, 17 Feb 2021 15:46:24 -0500 (EST)
Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 1FB631842B124
	for <linux-mm@kvack.org>; Wed, 17 Feb 2021 20:46:24 +0000 (UTC)
X-FDA: 77828942688.16.34D306E
Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com [209.85.210.182])
	by imf11.hostedemail.com (Postfix) with ESMTP id 217752000D82
	for <linux-mm@kvack.org>; Wed, 17 Feb 2021 20:46:17 +0000 (UTC)
Received: by mail-pf1-f182.google.com with SMTP id m6so9219586pfk.1
        for <linux-mm@kvack.org>; Wed, 17 Feb 2021 12:46:23 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=sender:date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=iZc4B1Q+CyiqJeSdeG8prgrIa3xPgvflOEqHMjO6o+M=;
        b=AOfPcWXA6GijbuuBhcd0OQi2XAulVffyK1tW9ESBXDo9rZGO0tOeMTBKN4pCb1lf5d
         dwWg6KYCrWPtBLskgtvuPuuCsqfkNuAmc+vXkLlLycM8jb/vRvK4xsv6nt4Gb2eE914y
         dPRBx1JS5SWr7ubelK+jm5BhaZykbp9dUPLr7mK7jEs6IY1cchDVu3cFmWUrYhx0uYYv
         Sb3PBdF4N6sZ5DcM7oCf7VEEVDdBCxysxn9J3IfkNf/dMRgeYfurB6XY814KFS2hjxtF
         FjlX3xXjTYmuPmAutjDvbCaAmh+pFLMJ0nZtKB+G0gkkYcympq5znzO49+Bb6bo9KF21
         jN2g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:sender:date:from:to:cc:subject:message-id
         :references:mime-version:content-disposition:in-reply-to;
        bh=iZc4B1Q+CyiqJeSdeG8prgrIa3xPgvflOEqHMjO6o+M=;
        b=fL5q9D6QOhs78bDy2x4T6SXr+MUSxxxOdvzVU23i/+yYm6ICwqzB2UthwEn9nCVP8R
         PSjWaklCFlauKhPNnO6GSDpxrV3rGUxi02XXZeU0x2D6+dMczC/TUz6DJYlzZK/0/Tn8
         EKetK0KiiVfiIDgUOaQjukdKd/vCCREfDTW0lg8LVW6vcIVKXLnsOQi6wfIq5mVWjFRF
         z/u+YE2qiBKE30kKJyS1nWikXJh/S6oqUdqCba+kZ25DQlceVVm7jnH4SeOvCDTXoXaf
         JqLchwSAhJJEUaDgcdyOhtjhCciyHJG9conOk1t56xcEzDb0O4Jg2XLrYpRqRvvegS64
         RqAA==
X-Gm-Message-State: AOAM531OjQ0xXYezSTbmYZ8H+zxeKJo/RiL6WjTmT+au4DDvuxsy+J46
	8ByaQ2aYQ9nWdZrtpwD2uso=
X-Google-Smtp-Source: ABdhPJzd0t0ssx9imTZeL1PNQQRheMmXcRqEKUZylWCcKhSdmThJOFFFPTC66D/+pUIxSc6A6k+7Vw==
X-Received: by 2002:a62:7fcb:0:b029:1da:36b1:8ac7 with SMTP id a194-20020a627fcb0000b02901da36b18ac7mr1063900pfd.13.1613594782422;
        Wed, 17 Feb 2021 12:46:22 -0800 (PST)
Received: from google.com ([2620:15c:211:201:157d:8a19:5427:ea9e])
        by smtp.gmail.com with ESMTPSA id x9sm3321738pfc.114.2021.02.17.12.46.20
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 17 Feb 2021 12:46:21 -0800 (PST)
Date: Wed, 17 Feb 2021 12:46:19 -0800
From: Minchan Kim <minchan@kernel.org>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>,
	cgoldswo@codeaurora.org, linux-fsdevel@vger.kernel.org,
	willy@infradead.org, david@redhat.com, vbabka@suse.cz,
	viro@zeniv.linux.org.uk, joaodias@google.com
Subject: Re: [RFC 1/2] mm: disable LRU pagevec during the migration
 temporarily
Message-ID: <YC2Am34Fso5Y5SPC@google.com>
References: <20210216170348.1513483-1-minchan@kernel.org>
 <YCzbCg3+upAo1Kdj@dhcp22.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <YCzbCg3+upAo1Kdj@dhcp22.suse.cz>
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 217752000D82
X-Stat-Signature: 1cg6hijzjpmqtmwhk7zsms74dc18jf1r
Received-SPF: none (gmail.com>: No applicable sender policy available) receiver=imf11; identity=mailfrom; envelope-from="<minchan.kim@gmail.com>"; helo=mail-pf1-f182.google.com; client-ip=209.85.210.182
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1613594777-77944
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Feb 17, 2021 at 09:59:54AM +0100, Michal Hocko wrote:
> On Tue 16-02-21 09:03:47, Minchan Kim wrote:
> > LRU pagevec holds refcount of pages until the pagevec are drained.
> > It could prevent migration since the refcount of the page is greater
> > than the expection in migration logic. To mitigate the issue,
> > callers of migrate_pages drains LRU pagevec via migrate_prep or
> > lru_add_drain_all before migrate_pages call.
> > 
> > However, it's not enough because pages coming into pagevec after the
> > draining call still could stay at the pagevec so it could keep
> > preventing page migration. Since some callers of migrate_pages have
> > retrial logic with LRU draining, the page would migrate at next trail
> > but it is still fragile in that it doesn't close the fundamental race
> > between upcoming LRU pages into pagvec and migration so the migration
> > failure could cause contiguous memory allocation failure in the end.
> 
> Please put some numbers on how often this happens here.

Actually, it's hard to get real number of cma allocation failure
since we didn't introduce any cma allocation statistics.
I am working on it to make the failure more visible.
https://lore.kernel.org/linux-mm/20210217170025.512704-1-minchan@kernel.org/

Having said, the cma allocation failure is one of the most common bug
reporting from field test I got periodically. I couldn't say they all
comes from pagevec issue since it's *really hard* to get who was the long
time holder of additional refcount at the moment. However, I kept tracking
down pinner from drivers and resolved them but the problem is still
reported and saw pagevec is one of common place to hold additional
refcount. 

I think data below is not exact one you want to see but that's the
way how I could see how this patch effectives reduce vulnerability.

Quote from Matthew quesiton
````
What I measured was how many times migrate_pages retried with force mode
below debug code.
The test was android apps launching with cma allocation in background.
Total cma allocation count was about 500 during the entire testing
and have seen about 400 retrial with below debug code.
With this patchset(with bug fix), the retrial count was reduced under 30.

What I measured was how many times the migrate_pages
diff --git a/mm/migrate.c b/mm/migrate.c
index 04a98bb2f568..caa661be2d16 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1459,6 +1459,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
                                                private, page, pass > 2, mode,
                                                reason);

+                       if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
+                               printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
+                               dump_page(page, "fail to migrate");
+                       }
+

```

IMO, rather than relying on retrial, it would be better if we close
such race from the beginning like free memory isolation logic in
start_isolate_page_range. Furthmore, migrate_pending function
will provide a way to detect migration-in-progress other vulnerable
places in future.

> 
> > The other concern is migration keeps retrying until pages in pagevec
> > are drained. During the time, migration repeatedly allocates target
> > page, unmap source page from page table of processes and then get to
> > know the failure, restore the original page to pagetable of processes,
> > free target page, which is also not good.
> 
> This is not good for performance you mean, rigth?

It's not good for allocation latency as well as performance but
it's not a top reason for this work. Goal is mostly focus on
closing the race to reduce pontential sites to migration failure
for CMA.

>  
> > To solve the issue, this patch tries to close the race rather than
> > relying on retrial and luck. The idea is to introduce
> > migration-in-progress tracking count with introducing IPI barrier
> > after atomic updating the count to minimize read-side overhead.
> >
> > The migrate_prep increases migrate_pending_count under the lock
> > and IPI call to guarantee every CPU see the uptodate value
> > of migrate_pending_count. Then, drain pagevec via lru_add_drain_all.
> > >From now on, no LRU pages could reach pagevec since LRU handling
> > functions skips the batching if migration is in progress with checking
> > migrate_pedning(IOW, pagevec should be empty until migration is done).
> > Every migrate_prep's caller should call migrate_finish in pair to
> > decrease the migration tracking count.
> 
> migrate_prep already does schedule draining on each cpu which has pages
> queued. Why isn't it enough to disable pcp lru caches right before
> draining in migrate_prep? More on IPI side below

Yub, with simple tweak of lru_add_drain_all(bool disable) and queue
it for every CPU, it would work.

>  
> [...]
> > +static DEFINE_SPINLOCK(migrate_pending_lock);
> > +static unsigned long migrate_pending_count;
> > +static DEFINE_PER_CPU(struct work_struct, migrate_pending_work);
> > +
> > +static void read_migrate_pending(struct work_struct *work)
> > +{
> > +	/* TODO : not sure it's needed */
> > +	unsigned long dummy = __READ_ONCE(migrate_pending_count);
> > +	(void)dummy;
> 
> What are you trying to achieve here? Are you just trying to enforce read
> memory barrier here?

I though IPI by migrate_pending_work will guarantee the memory ordering
so it wouldn't be needed but wanted to get attention from reviewer.
If it's not needed, I will drop it.

> 
> > +}
> > +
> > +bool migrate_pending(void)
> > +{
> > +	return migrate_pending_count;
> > +}
> > +
> >  /*
> >   * migrate_prep() needs to be called before we start compiling a list of pages
> >   * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
> > @@ -64,11 +80,27 @@
> >   */
> >  void migrate_prep(void)
> >  {
> > +	unsigned int cpu;
> > +
> > +	spin_lock(&migrate_pending_lock);
> > +	migrate_pending_count++;
> > +	spin_unlock(&migrate_pending_lock);
> 
> I suspect you do not want to add atomic_read inside hot paths, right? Is
> this really something that we have to microoptimize for? atomic_read is
> a simple READ_ONCE on many archs.

It's also spin_lock_irq_save in some arch. If the new synchonization is
heavily compilcated, atomic would be better for simple start but I thought
this locking scheme is too simple so no need to add atomic operation in
readside.