From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=i6x1=N7=vger.kernel.org=linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.1 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A5D5BC43441
	for <linux-kernel@archiver.kernel.org>; Tue, 20 Nov 2018 05:45:06 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 49851208E4
	for <linux-kernel@archiver.kernel.org>; Tue, 20 Nov 2018 05:45:06 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="SCRTG0sD"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 49851208E4
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1731873AbeKTQMW (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Tue, 20 Nov 2018 11:12:22 -0500
Received: from mail-pg1-f195.google.com ([209.85.215.195]:33814 "EHLO
        mail-pg1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726883AbeKTQMV (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 20 Nov 2018 11:12:21 -0500
Received: by mail-pg1-f195.google.com with SMTP id 17so407075pgg.1
        for <linux-kernel@vger.kernel.org>; Mon, 19 Nov 2018 21:45:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :user-agent:mime-version;
        bh=5su8qWxOqOD+Tjuw3ssqyEPe5GYRzq+MX4KmJD9vE8M=;
        b=SCRTG0sDQJ9XJIHTy2XeM/WZkeocXWSbZT5fqjvlJg8c75C7xJQlvAifP0M0373Hbm
         UwhZG91TAjow1KLogCFJEwxhXCHDMaXNTpAnN5rhW38okevf0HJV5CEX+FYdIVzqKcJN
         ent+UUvGNr3c+M4n+2beAude67Vxw/ZPxJA2C1vM108wxd7i2SRjL+BqheteCvl+AXOu
         f4k9DPMnWVPGT9aaG581Js5k6Xm7FVVqjc03AuH12zpbU/gfjppfV80Zf/evmRMjyf1T
         kA61BFI/AJGwe3P3eo2kgTevf3OD5WZdxE242xv9dEg9zrdVIjiqugYV6KlZM6bxa4aK
         J1kQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:user-agent:mime-version;
        bh=5su8qWxOqOD+Tjuw3ssqyEPe5GYRzq+MX4KmJD9vE8M=;
        b=GEDGXBrZmt5t6n6z2zGcgNowVWxFfGr6szdcmcYyUtbdnu+h1ai+/oLVaU6jqztq+J
         eUcyf5E4SPRrA4A40OBpnm6TTEfiCpDGrGSgU/5f4V3SJYji/IujzWVX+jzkJFsKGsea
         hOsYND1MHWuWoPFMk4LbE5AzjE4Bf9fcCmzCC/6shH9/hhZukzMKOWOIrxQLTJB1nL/g
         6j4Bm67iGLRgMxhuyCEH1gaA35clpKOUf4LpKVMN2lE1Y11UaoFM5DVBAbI1dCiLB8Qf
         uXoaoLvss4n1Znd318cIzIXTkbocs8/9USdmLemhERxnopqF2tx0lJEoOyyq24lOGdTA
         8I0A==
X-Gm-Message-State: AA+aEWb3Zg28s/vBToReZ4eBqUb4Wk6SYtWXofK0Md9E8QwLhyWR7CjK
        eCOXhmQDxnd5AZamT6+s5YXiqg==
X-Google-Smtp-Source: AFSGD/UCJV7LJk/IBXeghmo6M33JsU3RgS16Ve7gaayjjyx6hB5dWh6tXkzXYzG+EkLTHTpoo0r+kQ==
X-Received: by 2002:a63:fa02:: with SMTP id y2mr679843pgh.177.1542692702826;
        Mon, 19 Nov 2018 21:45:02 -0800 (PST)
Received: from [100.112.89.103] ([104.133.8.103])
        by smtp.gmail.com with ESMTPSA id s37sm37292983pgm.19.2018.11.19.21.45.01
        (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
        Mon, 19 Nov 2018 21:45:01 -0800 (PST)
Date:   Mon, 19 Nov 2018 21:44:41 -0800 (PST)
From:   Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@eggly.anvils
To:     Baoquan He <bhe@redhat.com>
cc:     Hugh Dickins <hughd@google.com>, Michal Hocko <mhocko@kernel.org>,
        Vlastimil Babka <vbabka@suse.cz>,
        David Hildenbrand <david@redhat.com>, linux-mm@kvack.org,
        pifang@redhat.com, linux-kernel@vger.kernel.org,
        akpm@linux-foundation.org, aarcange@redhat.com,
        Mel Gorman <mgorman@suse.de>
Subject: Re: Memory hotplug softlock issue
In-Reply-To: <20181120015644.GA5727@MiWiFi-R3L-srv>
Message-ID: <alpine.LSU.2.11.1811192127130.2848@eggly.anvils>
References: <20181115143204.GV23831@dhcp22.suse.cz> <20181116012433.GU2653@MiWiFi-R3L-srv> <20181116091409.GD14706@dhcp22.suse.cz> <20181119105202.GE18471@MiWiFi-R3L-srv> <20181119124033.GJ22247@dhcp22.suse.cz> <20181119125121.GK22247@dhcp22.suse.cz>
 <20181119141016.GO22247@dhcp22.suse.cz> <20181119173312.GV22247@dhcp22.suse.cz> <alpine.LSU.2.11.1811191215290.15640@eggly.anvils> <20181119205907.GW22247@dhcp22.suse.cz> <20181120015644.GA5727@MiWiFi-R3L-srv>
User-Agent: Alpine 2.11 (LSU 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, 20 Nov 2018, Baoquan He wrote:
> On 11/19/18 at 09:59pm, Michal Hocko wrote:
> > On Mon 19-11-18 12:34:09, Hugh Dickins wrote:
> > > I'm glad that I delayed, what I had then (migration_waitqueue instead
> > > of using page_waitqueue) was not wrong, but what I've been using the
> > > last couple of months is rather better (and can be put to use to solve
> > > similar problems in collapsing pages on huge tmpfs. but we don't need
> > > to get into that at this time): put_and_wait_on_page_locked().
> > > 
> > > What I have not yet done is verify it on latest kernel, and research
> > > the interested Cc list (Linus and Tim Chen come immediately to mind),
> > > and write the commit comment. I have some testing to do on the latest
> > > kernel today, so I'll throw put_and_wait_on_page_locked() in too,
> > > and post tomorrow I hope.
> > 
> > Cool, it seems that Baoquan has a reliable test case to trigger the
> > pathological case.
> 
> Yes. I will test Hugh's patch.

Thanks: I've completed some of the retesting now, so it would probably
help us all better if I post the patch in this thread, even without
completing its description and links and Cc list yet - there isn't
even a problem description below, I still have to paste that in from
the unposted patch that I made six months ago.  Here is today's...


[PATCH] mm: put_and_wait_on_page_locked() while page is migrated

We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.

But in fact, so long as wait_on_page_bit_common() does the put_page(),
and is careful not to rely on struct page contents thereafter, there is
no need to hold a reference to the page while waiting on it.  That does
mean that this case cannot go back through the loop: but that's fine for
the page migration case, and even if used more widely, is limited by the
"Stop walking if it's locked" optimization in wake_page_function().

Add interface put_and_wait_on_page_locked() to do this, using negative
value of the lock arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow:
I have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state,
so arrange it accordingly - but that may be nothing but pedantic.

shrink_page_list()'s __ClearPageLocked(): that was a surprise! this
survived a lot of testing before that showed up.  It does raise the
question: should is_page_cache_freeable() and __remove_mapping() now
treat a PG_waiters page as if an extra reference were held?  Perhaps,
but I don't think it matters much, since shrink_page_list() already
had to win its trylock_page(), so waiters are not very common there: I
noticed no difference when trying the bigger change, and it's surely not
needed while put_and_wait_on_page_locked() is only for page migration.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pagemap.h |  2 ++
 mm/filemap.c            | 53 ++++++++++++++++++++++++++++++++---------
 mm/huge_memory.c        |  6 ++---
 mm/migrate.c            | 12 ++++------
 mm/vmscan.c             | 11 +++++----
 5 files changed, 57 insertions(+), 27 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 226f96f0dee0..e2d7039af6a3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -537,6 +537,8 @@ static inline int wait_on_page_locked_killable(struct page *page)
 	return wait_on_page_bit_killable(compound_head(page), PG_locked);
 }
 
+extern void put_and_wait_on_page_locked(struct page *page);
+
 /* 
  * Wait for a page to complete writeback
  */
diff --git a/mm/filemap.c b/mm/filemap.c
index 81adec8ee02c..ef82119032d8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -981,7 +981,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
 	if (wait_page->bit_nr != key->bit_nr)
 		return 0;
 
-	/* Stop walking if it's locked */
+	/*
+	 * Stop walking if it's locked.
+	 * Is this safe if put_and_wait_on_page_locked() is in use?
+	 * Yes: the waker must hold a reference to this page, and if PG_locked
+	 * has now already been set by another task, that task must also hold
+	 * a reference to the *same usage* of this page; so there is no need
+	 * to walk on to wake even the put_and_wait_on_page_locked() callers.
+	 */
 	if (test_bit(key->bit_nr, &key->page->flags))
 		return -1;
 
@@ -1050,13 +1057,14 @@ static void wake_up_page(struct page *page, int bit)
 }
 
 static inline int wait_on_page_bit_common(wait_queue_head_t *q,
-		struct page *page, int bit_nr, int state, bool lock)
+		struct page *page, int bit_nr, int state, int lock)
 {
 	struct wait_page_queue wait_page;
 	wait_queue_entry_t *wait = &wait_page.wait;
 	bool thrashing = false;
 	unsigned long pflags;
 	int ret = 0;
+	bool bit_is_set;
 
 	if (bit_nr == PG_locked &&
 	    !PageUptodate(page) && PageWorkingset(page)) {
@@ -1067,7 +1075,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 	}
 
 	init_wait(wait);
-	wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0;
+	wait->flags = lock > 0 ? WQ_FLAG_EXCLUSIVE : 0;
 	wait->func = wake_page_function;
 	wait_page.page = page;
 	wait_page.bit_nr = bit_nr;
@@ -1084,14 +1092,17 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 
 		spin_unlock_irq(&q->lock);
 
-		if (likely(test_bit(bit_nr, &page->flags))) {
+		bit_is_set = test_bit(bit_nr, &page->flags);
+		if (lock < 0)
+			put_page(page);
+
+		if (likely(bit_is_set))
 			io_schedule();
-		}
 
-		if (lock) {
+		if (lock > 0) {
 			if (!test_and_set_bit_lock(bit_nr, &page->flags))
 				break;
-		} else {
+		} else if (lock == 0) {
 			if (!test_bit(bit_nr, &page->flags))
 				break;
 		}
@@ -1100,6 +1111,17 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 			ret = -EINTR;
 			break;
 		}
+
+		if (lock < 0) {
+			/*
+			 * We can no longer safely access page->flags:
+			 * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
+			 * there is a risk of waiting forever on a page reused
+			 * for something that keeps it locked indefinitely.
+			 * But best check for -EINTR above before breaking.
+			 */
+			break;
+		}
 	}
 
 	finish_wait(q, wait);
@@ -1124,17 +1146,26 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 void wait_on_page_bit(struct page *page, int bit_nr)
 {
 	wait_queue_head_t *q = page_waitqueue(page);
-	wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, false);
+	wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, 0);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
 int wait_on_page_bit_killable(struct page *page, int bit_nr)
 {
 	wait_queue_head_t *q = page_waitqueue(page);
-	return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, false);
+	return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, 0);
 }
 EXPORT_SYMBOL(wait_on_page_bit_killable);
 
+void put_and_wait_on_page_locked(struct page *page)
+{
+	wait_queue_head_t *q;
+
+	page = compound_head(page);
+	q = page_waitqueue(page);
+	wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, -1);
+}
+
 /**
  * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
  * @page: Page defining the wait queue of interest
@@ -1264,7 +1295,7 @@ void __lock_page(struct page *__page)
 {
 	struct page *page = compound_head(__page);
 	wait_queue_head_t *q = page_waitqueue(page);
-	wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, true);
+	wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, 1);
 }
 EXPORT_SYMBOL(__lock_page);
 
@@ -1272,7 +1303,7 @@ int __lock_page_killable(struct page *__page)
 {
 	struct page *page = compound_head(__page);
 	wait_queue_head_t *q = page_waitqueue(page);
-	return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, true);
+	return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, 1);
 }
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 622cced74fd9..832ab11badc2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1501,8 +1501,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 		if (!get_page_unless_zero(page))
 			goto out_unlock;
 		spin_unlock(vmf->ptl);
-		wait_on_page_locked(page);
-		put_page(page);
+		put_and_wait_on_page_locked(page);
 		goto out;
 	}
 
@@ -1538,8 +1537,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 		if (!get_page_unless_zero(page))
 			goto out_unlock;
 		spin_unlock(vmf->ptl);
-		wait_on_page_locked(page);
-		put_page(page);
+		put_and_wait_on_page_locked(page);
 		goto out;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index f7e4bfdc13b7..acda06f99754 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -327,16 +327,13 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 
 	/*
 	 * Once page cache replacement of page migration started, page_count
-	 * *must* be zero. And, we don't want to call wait_on_page_locked()
-	 * against a page without get_page().
-	 * So, we use get_page_unless_zero(), here. Even failed, page fault
-	 * will occur again.
+	 * is zero; but we must not call put_and_wait_on_page_locked() without
+	 * a ref. Use get_page_unless_zero(), and just fault again if it fails.
 	 */
 	if (!get_page_unless_zero(page))
 		goto out;
 	pte_unmap_unlock(ptep, ptl);
-	wait_on_page_locked(page);
-	put_page(page);
+	put_and_wait_on_page_locked(page);
 	return;
 out:
 	pte_unmap_unlock(ptep, ptl);
@@ -370,8 +367,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
 	if (!get_page_unless_zero(page))
 		goto unlock;
 	spin_unlock(ptl);
-	wait_on_page_locked(page);
-	put_page(page);
+	put_and_wait_on_page_locked(page);
 	return;
 unlock:
 	spin_unlock(ptl);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 62ac0c488624..85d9dde31153 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1459,11 +1459,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		/*
 		 * At this point, we have no other references and there is
 		 * no way to pick any more up (removed from LRU, removed
-		 * from pagecache). Can use non-atomic bitops now (and
-		 * we obviously don't have to worry about waking up a process
-		 * waiting on the page lock, because there are no references.
+		 * from pagecache). Usually we can use non-atomic bitops now,
+		 * but beware: earlier calls to put_and_wait_on_page_locked()
+		 * might still be waiting.
 		 */
-		__ClearPageLocked(page);
+		if (unlikely(PageWaiters(page)))
+			unlock_page(page);
+		else
+			__ClearPageLocked(page);
 free_it:
 		nr_reclaimed++;
 
-- 
2.19.1.1215.g8438c0b245-goog