From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66255C433E0 for ; Wed, 17 Feb 2021 20:46:25 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 033FD64E6C for ; Wed, 17 Feb 2021 20:46:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 033FD64E6C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 878986B0006; Wed, 17 Feb 2021 15:46:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 82A7D6B006C; Wed, 17 Feb 2021 15:46:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 740BD6B006E; Wed, 17 Feb 2021 15:46:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0240.hostedemail.com [216.40.44.240]) by kanga.kvack.org (Postfix) with ESMTP id 5E5FA6B0006 for ; Wed, 17 Feb 2021 15:46:24 -0500 (EST) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 1FB631842B124 for ; Wed, 17 Feb 2021 20:46:24 +0000 (UTC) X-FDA: 77828942688.16.34D306E Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com [209.85.210.182]) by imf11.hostedemail.com (Postfix) with ESMTP id 217752000D82 for ; Wed, 17 Feb 2021 20:46:17 +0000 (UTC) Received: by mail-pf1-f182.google.com with SMTP id m6so9219586pfk.1 for ; Wed, 17 Feb 2021 12:46:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=iZc4B1Q+CyiqJeSdeG8prgrIa3xPgvflOEqHMjO6o+M=; b=AOfPcWXA6GijbuuBhcd0OQi2XAulVffyK1tW9ESBXDo9rZGO0tOeMTBKN4pCb1lf5d dwWg6KYCrWPtBLskgtvuPuuCsqfkNuAmc+vXkLlLycM8jb/vRvK4xsv6nt4Gb2eE914y dPRBx1JS5SWr7ubelK+jm5BhaZykbp9dUPLr7mK7jEs6IY1cchDVu3cFmWUrYhx0uYYv Sb3PBdF4N6sZ5DcM7oCf7VEEVDdBCxysxn9J3IfkNf/dMRgeYfurB6XY814KFS2hjxtF FjlX3xXjTYmuPmAutjDvbCaAmh+pFLMJ0nZtKB+G0gkkYcympq5znzO49+Bb6bo9KF21 jN2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to; bh=iZc4B1Q+CyiqJeSdeG8prgrIa3xPgvflOEqHMjO6o+M=; b=fL5q9D6QOhs78bDy2x4T6SXr+MUSxxxOdvzVU23i/+yYm6ICwqzB2UthwEn9nCVP8R PSjWaklCFlauKhPNnO6GSDpxrV3rGUxi02XXZeU0x2D6+dMczC/TUz6DJYlzZK/0/Tn8 EKetK0KiiVfiIDgUOaQjukdKd/vCCREfDTW0lg8LVW6vcIVKXLnsOQi6wfIq5mVWjFRF z/u+YE2qiBKE30kKJyS1nWikXJh/S6oqUdqCba+kZ25DQlceVVm7jnH4SeOvCDTXoXaf JqLchwSAhJJEUaDgcdyOhtjhCciyHJG9conOk1t56xcEzDb0O4Jg2XLrYpRqRvvegS64 RqAA== X-Gm-Message-State: AOAM531OjQ0xXYezSTbmYZ8H+zxeKJo/RiL6WjTmT+au4DDvuxsy+J46 8ByaQ2aYQ9nWdZrtpwD2uso= X-Google-Smtp-Source: ABdhPJzd0t0ssx9imTZeL1PNQQRheMmXcRqEKUZylWCcKhSdmThJOFFFPTC66D/+pUIxSc6A6k+7Vw== X-Received: by 2002:a62:7fcb:0:b029:1da:36b1:8ac7 with SMTP id a194-20020a627fcb0000b02901da36b18ac7mr1063900pfd.13.1613594782422; Wed, 17 Feb 2021 12:46:22 -0800 (PST) Received: from google.com ([2620:15c:211:201:157d:8a19:5427:ea9e]) by smtp.gmail.com with ESMTPSA id x9sm3321738pfc.114.2021.02.17.12.46.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Feb 2021 12:46:21 -0800 (PST) Date: Wed, 17 Feb 2021 12:46:19 -0800 From: Minchan Kim To: Michal Hocko Cc: Andrew Morton , linux-mm , LKML , cgoldswo@codeaurora.org, linux-fsdevel@vger.kernel.org, willy@infradead.org, david@redhat.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, joaodias@google.com Subject: Re: [RFC 1/2] mm: disable LRU pagevec during the migration temporarily Message-ID: References: <20210216170348.1513483-1-minchan@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 217752000D82 X-Stat-Signature: 1cg6hijzjpmqtmwhk7zsms74dc18jf1r Received-SPF: none (gmail.com>: No applicable sender policy available) receiver=imf11; identity=mailfrom; envelope-from=""; helo=mail-pf1-f182.google.com; client-ip=209.85.210.182 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1613594777-77944 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Feb 17, 2021 at 09:59:54AM +0100, Michal Hocko wrote: > On Tue 16-02-21 09:03:47, Minchan Kim wrote: > > LRU pagevec holds refcount of pages until the pagevec are drained. > > It could prevent migration since the refcount of the page is greater > > than the expection in migration logic. To mitigate the issue, > > callers of migrate_pages drains LRU pagevec via migrate_prep or > > lru_add_drain_all before migrate_pages call. > > > > However, it's not enough because pages coming into pagevec after the > > draining call still could stay at the pagevec so it could keep > > preventing page migration. Since some callers of migrate_pages have > > retrial logic with LRU draining, the page would migrate at next trail > > but it is still fragile in that it doesn't close the fundamental race > > between upcoming LRU pages into pagvec and migration so the migration > > failure could cause contiguous memory allocation failure in the end. > > Please put some numbers on how often this happens here. Actually, it's hard to get real number of cma allocation failure since we didn't introduce any cma allocation statistics. I am working on it to make the failure more visible. https://lore.kernel.org/linux-mm/20210217170025.512704-1-minchan@kernel.org/ Having said, the cma allocation failure is one of the most common bug reporting from field test I got periodically. I couldn't say they all comes from pagevec issue since it's *really hard* to get who was the long time holder of additional refcount at the moment. However, I kept tracking down pinner from drivers and resolved them but the problem is still reported and saw pagevec is one of common place to hold additional refcount. I think data below is not exact one you want to see but that's the way how I could see how this patch effectives reduce vulnerability. Quote from Matthew quesiton ```` What I measured was how many times migrate_pages retried with force mode below debug code. The test was android apps launching with cma allocation in background. Total cma allocation count was about 500 during the entire testing and have seen about 400 retrial with below debug code. With this patchset(with bug fix), the retrial count was reduced under 30. What I measured was how many times the migrate_pages diff --git a/mm/migrate.c b/mm/migrate.c index 04a98bb2f568..caa661be2d16 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1459,6 +1459,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, private, page, pass > 2, mode, reason); + if (rc && reason == MR_CONTIG_RANGE && pass > 2) { + printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc); + dump_page(page, "fail to migrate"); + } + ``` IMO, rather than relying on retrial, it would be better if we close such race from the beginning like free memory isolation logic in start_isolate_page_range. Furthmore, migrate_pending function will provide a way to detect migration-in-progress other vulnerable places in future. > > > The other concern is migration keeps retrying until pages in pagevec > > are drained. During the time, migration repeatedly allocates target > > page, unmap source page from page table of processes and then get to > > know the failure, restore the original page to pagetable of processes, > > free target page, which is also not good. > > This is not good for performance you mean, rigth? It's not good for allocation latency as well as performance but it's not a top reason for this work. Goal is mostly focus on closing the race to reduce pontential sites to migration failure for CMA. > > > To solve the issue, this patch tries to close the race rather than > > relying on retrial and luck. The idea is to introduce > > migration-in-progress tracking count with introducing IPI barrier > > after atomic updating the count to minimize read-side overhead. > > > > The migrate_prep increases migrate_pending_count under the lock > > and IPI call to guarantee every CPU see the uptodate value > > of migrate_pending_count. Then, drain pagevec via lru_add_drain_all. > > >From now on, no LRU pages could reach pagevec since LRU handling > > functions skips the batching if migration is in progress with checking > > migrate_pedning(IOW, pagevec should be empty until migration is done). > > Every migrate_prep's caller should call migrate_finish in pair to > > decrease the migration tracking count. > > migrate_prep already does schedule draining on each cpu which has pages > queued. Why isn't it enough to disable pcp lru caches right before > draining in migrate_prep? More on IPI side below Yub, with simple tweak of lru_add_drain_all(bool disable) and queue it for every CPU, it would work. > > [...] > > +static DEFINE_SPINLOCK(migrate_pending_lock); > > +static unsigned long migrate_pending_count; > > +static DEFINE_PER_CPU(struct work_struct, migrate_pending_work); > > + > > +static void read_migrate_pending(struct work_struct *work) > > +{ > > + /* TODO : not sure it's needed */ > > + unsigned long dummy = __READ_ONCE(migrate_pending_count); > > + (void)dummy; > > What are you trying to achieve here? Are you just trying to enforce read > memory barrier here? I though IPI by migrate_pending_work will guarantee the memory ordering so it wouldn't be needed but wanted to get attention from reviewer. If it's not needed, I will drop it. > > > +} > > + > > +bool migrate_pending(void) > > +{ > > + return migrate_pending_count; > > +} > > + > > /* > > * migrate_prep() needs to be called before we start compiling a list of pages > > * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is > > @@ -64,11 +80,27 @@ > > */ > > void migrate_prep(void) > > { > > + unsigned int cpu; > > + > > + spin_lock(&migrate_pending_lock); > > + migrate_pending_count++; > > + spin_unlock(&migrate_pending_lock); > > I suspect you do not want to add atomic_read inside hot paths, right? Is > this really something that we have to microoptimize for? atomic_read is > a simple READ_ONCE on many archs. It's also spin_lock_irq_save in some arch. If the new synchonization is heavily compilcated, atomic would be better for simple start but I thought this locking scheme is too simple so no need to add atomic operation in readside.