linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/9] mm, hwpoison: enable 1GB hugepage support (v2)
@ 2022-06-23 23:51 Naoya Horiguchi
  2022-06-23 23:51 ` [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages() Naoya Horiguchi
                   ` (8 more replies)
  0 siblings, 9 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2022-06-23 23:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

Here is v2 of "enabling memory error handling on 1GB hugepage" patchset.

Major updates:

  - (patch 3/9) I made pud_huge() and follow_huge_pud() aware of non-present
    pud entry (based on Miaohe's comment).

  - (patch 4/9 and patch 5/9) I extended the mechanism to save raw error
    info to support multiple error subpages in a hugepage.  Additionally,
    I added a "unreliable" flag which prevents freeing hwpoison hugetlb
    if any raw error info is lost.

  - (patch 1/9 and 2/9) During testing some common cases for 1GB hugepage,
    I found a few issues in existing code, so this series starts
    with fixing them.

The remaining patches should have only minor updates since v1.

Patch dependency:

- "mm/memory-failure: disable unpoison once hw error happens"
  (actually the conflict is not logical one, but adding MF_SIMULATED to
   mf_flags conflicts with patch 6/9.)

v1: https://lore.kernel.org/linux-mm/20220602050631.771414-1-naoya.horiguchi@linux.dev/T/#u

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (9):
      mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages()
      mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()
      mm/hugetlb: make pud_huge() and huge_pud() aware of non-present pud entry
      mm, hwpoison, hugetlb: support saving mechanism of raw error pages
      mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
      mm, hwpoison: set PG_hwpoison for busy hugetlb pages
      mm, hwpoison: make __page_handle_poison returns int
      mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
      mm, hwpoison: enable memory error handling on 1GB hugepage

 arch/x86/mm/hugetlbpage.c |   3 +-
 include/linux/hugetlb.h   |  13 ++++
 include/linux/mm.h        |   2 +-
 include/linux/swapops.h   |   9 +++
 include/ras/ras_event.h   |   1 -
 mm/hugetlb.c              |  78 ++++++++++++++--------
 mm/memory-failure.c       | 163 +++++++++++++++++++++++++++++++++++++---------
 7 files changed, 209 insertions(+), 60 deletions(-)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages()
  2022-06-23 23:51 [PATCH v2 0/9] mm, hwpoison: enable 1GB hugepage support (v2) Naoya Horiguchi
@ 2022-06-23 23:51 ` Naoya Horiguchi
  2022-06-24  2:25   ` Miaohe Lin
  2022-06-23 23:51 ` [PATCH v2 2/9] mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range() Naoya Horiguchi
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Naoya Horiguchi @ 2022-06-23 23:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

I found a weird state of 1GB hugepage pool, caused by the following
procedure:

  - run a process reserving all free 1GB hugepages,
  - shrink free 1GB hugepage pool to zero (i.e. writing 0 to
    /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then
  - kill the reserving process.

, then all the hugepages are free *and* surplus at the same time.

  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
  3
  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
  3
  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages
  0
  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages
  3

This state is resolved by reserving and allocating the pages then
freeing them again, so this seems not to result in serious problem.
But it's a little surprizing (shrinking pool suddenly fails).

This behavior is caused by hstate_is_gigantic() check in
return_unused_surplus_pages(). This was introduced so long ago in 2008
by commit aa888a74977a ("hugetlb: support larger than MAX_ORDER"), and
it seems to me that this check is no longer unnecessary. Let's remove it.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/hugetlb.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a57e1be41401..c538278170a2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2432,10 +2432,6 @@ static void return_unused_surplus_pages(struct hstate *h,
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
 
-	/* Cannot return gigantic pages currently */
-	if (hstate_is_gigantic(h))
-		goto out;
-
 	/*
 	 * Part (or even all) of the reservation could have been backed
 	 * by pre-allocated pages. Only free surplus pages.
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 2/9] mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()
  2022-06-23 23:51 [PATCH v2 0/9] mm, hwpoison: enable 1GB hugepage support (v2) Naoya Horiguchi
  2022-06-23 23:51 ` [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages() Naoya Horiguchi
@ 2022-06-23 23:51 ` Naoya Horiguchi
  2022-06-24  9:23   ` Miaohe Lin
                     ` (2 more replies)
  2022-06-23 23:51 ` [PATCH v2 3/9] mm/hugetlb: make pud_huge() and huge_pud() aware of non-present pud entry Naoya Horiguchi
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2022-06-23 23:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Originally copy_hugetlb_page_range() handles migration entries and hwpoisone
entries in similar manner.  But recently the related code path has more code
for migration entries, and when is_writable_migration_entry() was converted
to !is_readable_migration_entry(), hwpoison entries on source processes got
to be unexpectedly updated (which is legitimate for migration entries, but
not for hwpoison entries).  This results in unexpected serious issues like
kernel panic when foking processes with hwpoison entries in pmd.

Separate the if branch into one for hwpoison entries and one for migration
entries.

Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org> # 5.18
---
 mm/hugetlb.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c538278170a2..f59f43c06601 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4784,8 +4784,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			 * sharing with another vma.
 			 */
 			;
-		} else if (unlikely(is_hugetlb_entry_migration(entry) ||
-				    is_hugetlb_entry_hwpoisoned(entry))) {
+		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) {
+			bool uffd_wp = huge_pte_uffd_wp(entry);
+
+			if (!userfaultfd_wp(dst_vma) && uffd_wp)
+				entry = huge_pte_clear_uffd_wp(entry);
+			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
+		} else if (unlikely(is_hugetlb_entry_migration(entry))) {
 			swp_entry_t swp_entry = pte_to_swp_entry(entry);
 			bool uffd_wp = huge_pte_uffd_wp(entry);
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 3/9] mm/hugetlb: make pud_huge() and huge_pud() aware of non-present pud entry
  2022-06-23 23:51 [PATCH v2 0/9] mm, hwpoison: enable 1GB hugepage support (v2) Naoya Horiguchi
  2022-06-23 23:51 ` [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages() Naoya Horiguchi
  2022-06-23 23:51 ` [PATCH v2 2/9] mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range() Naoya Horiguchi
@ 2022-06-23 23:51 ` Naoya Horiguchi
  2022-06-24  8:40   ` HORIGUCHI NAOYA(堀口 直也)
                     ` (2 more replies)
  2022-06-23 23:51 ` [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages Naoya Horiguchi
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2022-06-23 23:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

follow_pud_mask() does not support non-present pud entry now.  As long as
I tested on x86_64 server, follow_pud_mask() still simply returns
no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe
user-visible effect should happen.  But generally we should call
follow_huge_pud() for non-present pud entry for 1GB hugetlb page.

Update pud_huge() and huge_pud() to handle non-present pud entries.  The
changes are similar to previous works for pud entries commit e66f17ff7177
("mm/hugetlb: take page table lock in follow_huge_pmd()") and commit
cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage").

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 arch/x86/mm/hugetlbpage.c |  3 ++-
 mm/hugetlb.c              | 26 +++++++++++++++++++++++++-
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index a0d023cb4292..5fb86fb49ba8 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -70,7 +70,8 @@ int pmd_huge(pmd_t pmd)
 
 int pud_huge(pud_t pud)
 {
-	return !!(pud_val(pud) & _PAGE_PSE);
+	return !pud_none(pud) &&
+		(pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT;
 }
 #endif
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f59f43c06601..b7ae5f73f3b2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6946,10 +6946,34 @@ struct page * __weak
 follow_huge_pud(struct mm_struct *mm, unsigned long address,
 		pud_t *pud, int flags)
 {
+	struct page *page = NULL;
+	spinlock_t *ptl;
+	pte_t pte;
+
 	if (flags & (FOLL_GET | FOLL_PIN))
 		return NULL;
 
-	return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
+retry:
+	ptl = huge_pte_lock(hstate_sizelog(PUD_SHIFT), mm, (pte_t *)pud);
+	if (!pud_huge(*pud))
+		goto out;
+	pte = huge_ptep_get((pte_t *)pud);
+	if (pte_present(pte)) {
+		page = pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
+		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
+			page = NULL;
+			goto out;
+		}
+	} else {
+		if (is_hugetlb_entry_migration(pte)) {
+			spin_unlock(ptl);
+			__migration_entry_wait(mm, (pte_t *)pud, ptl);
+			goto retry;
+		}
+	}
+out:
+	spin_unlock(ptl);
+	return page;
 }
 
 struct page * __weak
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  2022-06-23 23:51 [PATCH v2 0/9] mm, hwpoison: enable 1GB hugepage support (v2) Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2022-06-23 23:51 ` [PATCH v2 3/9] mm/hugetlb: make pud_huge() and huge_pud() aware of non-present pud entry Naoya Horiguchi
@ 2022-06-23 23:51 ` Naoya Horiguchi
  2022-06-27  3:16   ` Miaohe Lin
  2022-06-27  9:26   ` Muchun Song
  2022-06-23 23:51 ` [PATCH v2 5/9] mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage Naoya Horiguchi
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2022-06-23 23:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

When handling memory error on a hugetlb page, the error handler tries to
dissolve and turn it into 4kB pages.  If it's successfully dissolved,
PageHWPoison flag is moved to the raw error page, so that's all right.
However, dissolve sometimes fails, then the error page is left as
hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
healthy pages, but that's not possible now because the information about
where the raw error pages is lost.

Use the private field of a few tail pages to keep that information.  The
code path of shrinking hugepage pool uses this info to try delayed dissolve.
In order to remember multiple errors in a hugepage, a singly-linked list
originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed.  Only
simple operations (adding an entry or clearing all) are required and the
list is assumed not to be very long, so this simple data structure should
be enough.

If we failed to save raw error info, the hwpoison hugepage has errors on
unknown subpage, then this new saving mechanism does not work any more,
so disable saving new raw error info and freeing hwpoison hugepages.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
v1 -> v2:
- support hwpoison hugepage with multiple errors,
- moved the new interface functions to mm/memory-failure.c,
- define additional subpage index SUBPAGE_INDEX_HWPOISON_UNRELIABLE,
- stop freeing/dissolving hwpoison hugepages with unreliable raw error info,
- drop hugetlb_clear_page_hwpoison() in dissolve_free_huge_page() because
  that's done in update_and_free_page(),
- move setting/clearing PG_hwpoison flag to the new interfaces,
- checking already hwpoisoned or not on a subpage basis.

ChangeLog since previous post on 4/27:
- fixed typo in patch description (by Miaohe)
- fixed config value in #ifdef statement (by Miaohe)
- added sentences about "multiple hwpoison pages" scenario in patch
  description

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/hugetlb.h | 13 ++++++
 mm/hugetlb.c            | 39 +++++++++--------
 mm/memory-failure.c     | 95 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 125 insertions(+), 22 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e4cff27d1198..ac13c2022517 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -42,6 +42,10 @@ enum {
 	SUBPAGE_INDEX_CGROUP,		/* reuse page->private */
 	SUBPAGE_INDEX_CGROUP_RSVD,	/* reuse page->private */
 	__MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD,
+#endif
+#ifdef CONFIG_MEMORY_FAILURE
+	SUBPAGE_INDEX_HWPOISON,
+	SUBPAGE_INDEX_HWPOISON_UNRELIABLE,
 #endif
 	__NR_USED_SUBPAGE,
 };
@@ -798,6 +802,15 @@ extern int dissolve_free_huge_page(struct page *page);
 extern int dissolve_free_huge_pages(unsigned long start_pfn,
 				    unsigned long end_pfn);
 
+#ifdef CONFIG_MEMORY_FAILURE
+extern int hugetlb_clear_page_hwpoison(struct page *hpage);
+#else
+static inline int hugetlb_clear_page_hwpoison(struct page *hpage)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
 #ifndef arch_hugetlb_migration_supported
 static inline bool arch_hugetlb_migration_supported(struct hstate *h)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b7ae5f73f3b2..19ef427aa1e8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1541,17 +1541,15 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
-	if (hugetlb_vmemmap_alloc(h, page)) {
-		spin_lock_irq(&hugetlb_lock);
-		/*
-		 * If we cannot allocate vmemmap pages, just refuse to free the
-		 * page and put the page back on the hugetlb free list and treat
-		 * as a surplus page.
-		 */
-		add_hugetlb_page(h, page, true);
-		spin_unlock_irq(&hugetlb_lock);
-		return;
-	}
+	if (hugetlb_vmemmap_alloc(h, page))
+		goto fail;
+
+	/*
+	 * Move PageHWPoison flag from head page to the raw error pages,
+	 * which makes any healthy subpages reusable.
+	 */
+	if (unlikely(PageHWPoison(page) && hugetlb_clear_page_hwpoison(page)))
+		goto fail;
 
 	for (i = 0; i < pages_per_huge_page(h);
 	     i++, subpage = mem_map_next(subpage, page, i)) {
@@ -1572,6 +1570,16 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
 	} else {
 		__free_pages(page, huge_page_order(h));
 	}
+	return;
+fail:
+	spin_lock_irq(&hugetlb_lock);
+	/*
+	 * If we cannot allocate vmemmap pages or cannot identify raw hwpoison
+	 * subpages reliably, just refuse to free the page and put the page
+	 * back on the hugetlb free list and treat as a surplus page.
+	 */
+	add_hugetlb_page(h, page, true);
+	spin_unlock_irq(&hugetlb_lock);
 }
 
 /*
@@ -2115,15 +2123,6 @@ int dissolve_free_huge_page(struct page *page)
 		 */
 		rc = hugetlb_vmemmap_alloc(h, head);
 		if (!rc) {
-			/*
-			 * Move PageHWPoison flag from head page to the raw
-			 * error page, which makes any subpages rather than
-			 * the error page reusable.
-			 */
-			if (PageHWPoison(head) && page != head) {
-				SetPageHWPoison(page);
-				ClearPageHWPoison(head);
-			}
 			update_and_free_page(h, head, false);
 		} else {
 			spin_lock_irq(&hugetlb_lock);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fb3feb1f363e..af571fa6f2af 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1499,6 +1499,97 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Struct raw_hwp_page represents information about "raw error page",
+ * constructing singly linked list originated from ->private field of
+ * SUBPAGE_INDEX_HWPOISON-th tail page.
+ */
+struct raw_hwp_page {
+	struct llist_node node;
+	struct page *page;
+};
+
+static inline struct llist_head *raw_hwp_list_head(struct page *hpage)
+{
+	return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON);
+}
+
+static inline int raw_hwp_unreliable(struct page *hpage)
+{
+	return page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE);
+}
+
+static inline void set_raw_hwp_unreliable(struct page *hpage)
+{
+	set_page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE, 1);
+}
+
+/*
+ * about race consideration
+ */
+static inline int hugetlb_set_page_hwpoison(struct page *hpage,
+					struct page *page)
+{
+	struct llist_head *head;
+	struct raw_hwp_page *raw_hwp;
+	struct llist_node *t, *tnode;
+	int ret;
+
+	/*
+	 * Once the hwpoison hugepage has lost reliable raw error info,
+	 * there is little mean to keep additional error info precisely,
+	 * so skip to add additional raw error info.
+	 */
+	if (raw_hwp_unreliable(hpage))
+		return -EHWPOISON;
+	head = raw_hwp_list_head(hpage);
+	llist_for_each_safe(tnode, t, head->first) {
+		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
+
+		if (p->page == page)
+			return -EHWPOISON;
+	}
+
+	ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0;
+	/* the first error event will be counted in action_result(). */
+	if (ret)
+		num_poisoned_pages_inc();
+
+	raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL);
+	if (raw_hwp) {
+		raw_hwp->page = page;
+		llist_add(&raw_hwp->node, head);
+	} else {
+		/*
+		 * Failed to save raw error info.  We no longer trace all
+		 * hwpoisoned subpages, and we need refuse to free/dissolve
+		 * this hwpoisoned hugepage.
+		 */
+		set_raw_hwp_unreliable(hpage);
+		return ret;
+	}
+	return ret;
+}
+
+inline int hugetlb_clear_page_hwpoison(struct page *hpage)
+{
+	struct llist_head *head;
+	struct llist_node *t, *tnode;
+
+	if (raw_hwp_unreliable(hpage))
+		return -EBUSY;
+	ClearPageHWPoison(hpage);
+	head = raw_hwp_list_head(hpage);
+	llist_for_each_safe(tnode, t, head->first) {
+		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
+
+		SetPageHWPoison(p->page);
+		kfree(p);
+	}
+	llist_del_all(head);
+	return 0;
+}
+
 /*
  * Called from hugetlb code with hugetlb_lock held.
  *
@@ -1533,7 +1624,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
 		goto out;
 	}
 
-	if (TestSetPageHWPoison(head)) {
+	if (hugetlb_set_page_hwpoison(head, page)) {
 		ret = -EHWPOISON;
 		goto out;
 	}
@@ -1585,7 +1676,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 	lock_page(head);
 
 	if (hwpoison_filter(p)) {
-		ClearPageHWPoison(head);
+		hugetlb_clear_page_hwpoison(head);
 		res = -EOPNOTSUPP;
 		goto out;
 	}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 5/9] mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
  2022-06-23 23:51 [PATCH v2 0/9] mm, hwpoison: enable 1GB hugepage support (v2) Naoya Horiguchi
                   ` (3 preceding siblings ...)
  2022-06-23 23:51 ` [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages Naoya Horiguchi
@ 2022-06-23 23:51 ` Naoya Horiguchi
  2022-06-27  8:32   ` Miaohe Lin
  2022-06-23 23:51 ` [PATCH v2 6/9] mm, hwpoison: set PG_hwpoison for busy hugetlb pages Naoya Horiguchi
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Naoya Horiguchi @ 2022-06-23 23:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Raw error info list needs to be removed when hwpoisoned hugetlb is
unpoisioned.  And unpoison handler needs to know how many errors there
are in the target hugepage. So add them.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/swapops.h |  9 +++++++++
 mm/memory-failure.c     | 34 +++++++++++++++++++++++++++-------
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 9584c2e1c54a..ad62776ee99c 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -494,6 +494,11 @@ static inline void num_poisoned_pages_dec(void)
 	atomic_long_dec(&num_poisoned_pages);
 }
 
+static inline void num_poisoned_pages_sub(long i)
+{
+	atomic_long_sub(i, &num_poisoned_pages);
+}
+
 #else
 
 static inline swp_entry_t make_hwpoison_entry(struct page *page)
@@ -514,6 +519,10 @@ static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
 static inline void num_poisoned_pages_inc(void)
 {
 }
+
+static inline void num_poisoned_pages_sub(long i)
+{
+}
 #endif
 
 static inline int non_swap_entry(swp_entry_t entry)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index af571fa6f2af..6005e953d011 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1571,23 +1571,34 @@ static inline int hugetlb_set_page_hwpoison(struct page *hpage,
 	return ret;
 }
 
-inline int hugetlb_clear_page_hwpoison(struct page *hpage)
+static inline long free_raw_hwp_pages(struct page *hpage, bool move_flag)
 {
 	struct llist_head *head;
 	struct llist_node *t, *tnode;
+	long count = 0;
 
-	if (raw_hwp_unreliable(hpage))
-		return -EBUSY;
-	ClearPageHWPoison(hpage);
 	head = raw_hwp_list_head(hpage);
 	llist_for_each_safe(tnode, t, head->first) {
 		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
 
-		SetPageHWPoison(p->page);
+		if (move_flag)
+			SetPageHWPoison(p->page);
 		kfree(p);
+		count++;
 	}
 	llist_del_all(head);
-	return 0;
+	return count;
+}
+
+inline int hugetlb_clear_page_hwpoison(struct page *hpage)
+{
+	int ret = -EBUSY;
+
+	if (raw_hwp_unreliable(hpage))
+		return ret;
+	ret = !TestClearPageHWPoison(hpage);
+	free_raw_hwp_pages(hpage, true);
+	return ret;
 }
 
 /*
@@ -1729,6 +1740,10 @@ static inline int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *
 {
 	return 0;
 }
+
+static inline void free_raw_hwp_pages(struct page *hpage, bool move_flag)
+{
+}
 #endif
 
 static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
@@ -2188,6 +2203,7 @@ int unpoison_memory(unsigned long pfn)
 	struct page *p;
 	int ret = -EBUSY;
 	int freeit = 0;
+	long count = 1;
 	static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL,
 					DEFAULT_RATELIMIT_BURST);
 
@@ -2235,6 +2251,8 @@ int unpoison_memory(unsigned long pfn)
 
 	ret = get_hwpoison_page(p, MF_UNPOISON);
 	if (!ret) {
+		if (PageHuge(p))
+			count = free_raw_hwp_pages(page, false);
 		ret = TestClearPageHWPoison(page) ? 0 : -EBUSY;
 	} else if (ret < 0) {
 		if (ret == -EHWPOISON) {
@@ -2243,6 +2261,8 @@ int unpoison_memory(unsigned long pfn)
 			unpoison_pr_info("Unpoison: failed to grab page %#lx\n",
 					 pfn, &unpoison_rs);
 	} else {
+		if (PageHuge(p))
+			count = free_raw_hwp_pages(page, false);
 		freeit = !!TestClearPageHWPoison(p);
 
 		put_page(page);
@@ -2255,7 +2275,7 @@ int unpoison_memory(unsigned long pfn)
 unlock_mutex:
 	mutex_unlock(&mf_mutex);
 	if (!ret || freeit) {
-		num_poisoned_pages_dec();
+		num_poisoned_pages_sub(count);
 		unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
 				 page_to_pfn(p), &unpoison_rs);
 	}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 6/9] mm, hwpoison: set PG_hwpoison for busy hugetlb pages
  2022-06-23 23:51 [PATCH v2 0/9] mm, hwpoison: enable 1GB hugepage support (v2) Naoya Horiguchi
                   ` (4 preceding siblings ...)
  2022-06-23 23:51 ` [PATCH v2 5/9] mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage Naoya Horiguchi
@ 2022-06-23 23:51 ` Naoya Horiguchi
  2022-06-27  8:39   ` Miaohe Lin
  2022-06-23 23:51 ` [PATCH v2 7/9] mm, hwpoison: make __page_handle_poison returns int Naoya Horiguchi
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Naoya Horiguchi @ 2022-06-23 23:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

If memory_failure() fails to grab page refcount on a hugetlb page
because it's busy, it returns without setting PG_hwpoison on it.
This not only loses a chance of error containment, but breaks the rule
that action_result() should be called only when memory_failure() do
any of handling work (even if that's just setting PG_hwpoison).
This inconsistency could harm code maintainability.

So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case.

Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/mm.h  | 1 +
 mm/memory-failure.c | 8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4346e51484ba..044dc5a2e361 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3233,6 +3233,7 @@ enum mf_flags {
 	MF_SOFT_OFFLINE = 1 << 3,
 	MF_UNPOISON = 1 << 4,
 	MF_SW_SIMULATED = 1 << 5,
+	MF_NO_RETRY = 1 << 6,
 };
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6005e953d011..ce045d0d6115 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1632,7 +1632,8 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
 			count_increased = true;
 	} else {
 		ret = -EBUSY;
-		goto out;
+		if (!(flags & MF_NO_RETRY))
+			goto out;
 	}
 
 	if (hugetlb_set_page_hwpoison(head, page)) {
@@ -1659,7 +1660,6 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 	struct page *p = pfn_to_page(pfn);
 	struct page *head;
 	unsigned long page_flags;
-	bool retry = true;
 
 	*hugetlb = 1;
 retry:
@@ -1675,8 +1675,8 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 		}
 		return res;
 	} else if (res == -EBUSY) {
-		if (retry) {
-			retry = false;
+		if (!(flags & MF_NO_RETRY)) {
+			flags |= MF_NO_RETRY;
 			goto retry;
 		}
 		action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 7/9] mm, hwpoison: make __page_handle_poison returns int
  2022-06-23 23:51 [PATCH v2 0/9] mm, hwpoison: enable 1GB hugepage support (v2) Naoya Horiguchi
                   ` (5 preceding siblings ...)
  2022-06-23 23:51 ` [PATCH v2 6/9] mm, hwpoison: set PG_hwpoison for busy hugetlb pages Naoya Horiguchi
@ 2022-06-23 23:51 ` Naoya Horiguchi
  2022-06-27  9:02   ` Miaohe Lin
  2022-06-23 23:51 ` [PATCH v2 8/9] mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage Naoya Horiguchi
  2022-06-23 23:51 ` [PATCH v2 9/9] mm, hwpoison: enable memory error handling on " Naoya Horiguchi
  8 siblings, 1 reply; 45+ messages in thread
From: Naoya Horiguchi @ 2022-06-23 23:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

__page_handle_poison() returns bool that shows whether
take_page_off_buddy() has passed or not now.  But we will want to
distinguish another case of "dissolve has passed but taking off failed"
by its return value. So change the type of the return value.
No functional change.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ce045d0d6115..db85f644a1e3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -71,7 +71,13 @@ atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool hw_memory_failure __read_mostly = false;
 
-static bool __page_handle_poison(struct page *page)
+/*
+ * Return values:
+ *   1:   the page is dissolved (if needed) and taken off from buddy,
+ *   0:   the page is dissolved (if needed) and not taken off from buddy,
+ *   < 0: failed to dissolve.
+ */
+static int __page_handle_poison(struct page *page)
 {
 	int ret;
 
@@ -81,7 +87,7 @@ static bool __page_handle_poison(struct page *page)
 		ret = take_page_off_buddy(page);
 	zone_pcp_enable(page_zone(page));
 
-	return ret > 0;
+	return ret;
 }
 
 static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
@@ -91,7 +97,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
 		 * Doing this check for free pages is also fine since dissolve_free_huge_page
 		 * returns 0 for non-hugetlb pages as well.
 		 */
-		if (!__page_handle_poison(page))
+		if (__page_handle_poison(page) <= 0)
 			/*
 			 * We could fail to take off the target page from buddy
 			 * for example due to racy page allocation, but that's
@@ -1048,7 +1054,7 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 		 * subpages.
 		 */
 		put_page(hpage);
-		if (__page_handle_poison(p)) {
+		if (__page_handle_poison(p) > 0) {
 			page_ref_inc(p);
 			res = MF_RECOVERED;
 		}
@@ -1698,8 +1704,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 	 */
 	if (res == 0) {
 		unlock_page(head);
-		res = MF_FAILED;
-		if (__page_handle_poison(p)) {
+		if (__page_handle_poison(p) > 0) {
 			page_ref_inc(p);
 			res = MF_RECOVERED;
 		}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 8/9] mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
  2022-06-23 23:51 [PATCH v2 0/9] mm, hwpoison: enable 1GB hugepage support (v2) Naoya Horiguchi
                   ` (6 preceding siblings ...)
  2022-06-23 23:51 ` [PATCH v2 7/9] mm, hwpoison: make __page_handle_poison returns int Naoya Horiguchi
@ 2022-06-23 23:51 ` Naoya Horiguchi
  2022-06-27 12:24   ` Miaohe Lin
  2022-06-23 23:51 ` [PATCH v2 9/9] mm, hwpoison: enable memory error handling on " Naoya Horiguchi
  8 siblings, 1 reply; 45+ messages in thread
From: Naoya Horiguchi @ 2022-06-23 23:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Currently if memory_failure() (modified to remove blocking code with
subsequent patch) is called on a page in some 1GB hugepage, memory error
handling fails and the raw error page gets into leaked state.  The impact
is small in production systems (just leaked single 4kB page), but this
limits the testability because unpoison doesn't work for it.
We can no longer create 1GB hugepage on the 1GB physical address range
with such leaked pages, that's not useful when testing on small systems.

When a hwpoison page in a 1GB hugepage is handled, it's caught by the
PageHWPoison check in free_pages_prepare() because the 1GB hugepage is
broken down into raw error pages before coming to this point:

        if (unlikely(PageHWPoison(page)) && !order) {
                ...
                return false;
        }

Then, the page is not sent to buddy and the page refcount is left 0.

Originally this check is supposed to work when the error page is freed from
page_handle_poison() (that is called from soft-offline), but now we are
opening another path to call it, so the callers of __page_handle_poison()
need to handle the case by considering the return value 0 as success. Then
page refcount for hwpoison is properly incremented so unpoison works.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index db85f644a1e3..fc7b83cb6468 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1046,7 +1046,6 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 		res = truncate_error_page(hpage, page_to_pfn(p), mapping);
 		unlock_page(hpage);
 	} else {
-		res = MF_FAILED;
 		unlock_page(hpage);
 		/*
 		 * migration entry prevents later access on error hugepage,
@@ -1054,9 +1053,11 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 		 * subpages.
 		 */
 		put_page(hpage);
-		if (__page_handle_poison(p) > 0) {
+		if (__page_handle_poison(p) >= 0) {
 			page_ref_inc(p);
 			res = MF_RECOVERED;
+		} else {
+			res = MF_FAILED;
 		}
 	}
 
@@ -1704,9 +1705,11 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 	 */
 	if (res == 0) {
 		unlock_page(head);
-		if (__page_handle_poison(p) > 0) {
+		if (__page_handle_poison(p) >= 0) {
 			page_ref_inc(p);
 			res = MF_RECOVERED;
+		} else {
+			res = MF_FAILED;
 		}
 		action_result(pfn, MF_MSG_FREE_HUGE, res);
 		return res == MF_RECOVERED ? 0 : -EBUSY;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 9/9] mm, hwpoison: enable memory error handling on 1GB hugepage
  2022-06-23 23:51 [PATCH v2 0/9] mm, hwpoison: enable 1GB hugepage support (v2) Naoya Horiguchi
                   ` (7 preceding siblings ...)
  2022-06-23 23:51 ` [PATCH v2 8/9] mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage Naoya Horiguchi
@ 2022-06-23 23:51 ` Naoya Horiguchi
  2022-06-28  2:06   ` Miaohe Lin
  8 siblings, 1 reply; 45+ messages in thread
From: Naoya Horiguchi @ 2022-06-23 23:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Now error handling code is prepared, so remove the blocking code and
enable memory error handling on 1GB hugepage.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/mm.h      |  1 -
 include/ras/ras_event.h |  1 -
 mm/memory-failure.c     | 16 ----------------
 3 files changed, 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 044dc5a2e361..9d7e9b5a4d1d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3284,7 +3284,6 @@ enum mf_action_page_type {
 	MF_MSG_DIFFERENT_COMPOUND,
 	MF_MSG_HUGE,
 	MF_MSG_FREE_HUGE,
-	MF_MSG_NON_PMD_HUGE,
 	MF_MSG_UNMAP_FAILED,
 	MF_MSG_DIRTY_SWAPCACHE,
 	MF_MSG_CLEAN_SWAPCACHE,
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index d0337a41141c..cbd3ddd7c33d 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -360,7 +360,6 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking" ) \
 	EM ( MF_MSG_HUGE, "huge page" )					\
 	EM ( MF_MSG_FREE_HUGE, "free huge page" )			\
-	EM ( MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page" )		\
 	EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" )		\
 	EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" )		\
 	EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" )		\
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fc7b83cb6468..33521e059f7f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -728,7 +728,6 @@ static const char * const action_page_types[] = {
 	[MF_MSG_DIFFERENT_COMPOUND]	= "different compound page after locking",
 	[MF_MSG_HUGE]			= "huge page",
 	[MF_MSG_FREE_HUGE]		= "free huge page",
-	[MF_MSG_NON_PMD_HUGE]		= "non-pmd-sized huge page",
 	[MF_MSG_UNMAP_FAILED]		= "unmapping failed page",
 	[MF_MSG_DIRTY_SWAPCACHE]	= "dirty swapcache page",
 	[MF_MSG_CLEAN_SWAPCACHE]	= "clean swapcache page",
@@ -1717,21 +1716,6 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 
 	page_flags = head->flags;
 
-	/*
-	 * TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
-	 * simply disable it. In order to make it work properly, we need
-	 * make sure that:
-	 *  - conversion of a pud that maps an error hugetlb into hwpoison
-	 *    entry properly works, and
-	 *  - other mm code walking over page table is aware of pud-aligned
-	 *    hwpoison entries.
-	 */
-	if (huge_page_size(page_hstate(head)) > PMD_SIZE) {
-		action_result(pfn, MF_MSG_NON_PMD_HUGE, MF_IGNORED);
-		res = -EBUSY;
-		goto out;
-	}
-
 	if (!hwpoison_user_mappings(p, pfn, flags, head)) {
 		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
 		res = -EBUSY;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages()
  2022-06-23 23:51 ` [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages() Naoya Horiguchi
@ 2022-06-24  2:25   ` Miaohe Lin
  2022-06-24  8:03     ` Muchun Song
  0 siblings, 1 reply; 45+ messages in thread
From: Miaohe Lin @ 2022-06-24  2:25 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel, Linux-MM

On 2022/6/24 7:51, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> I found a weird state of 1GB hugepage pool, caused by the following
> procedure:
> 
>   - run a process reserving all free 1GB hugepages,
>   - shrink free 1GB hugepage pool to zero (i.e. writing 0 to
>     /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then
>   - kill the reserving process.
> 
> , then all the hugepages are free *and* surplus at the same time.
> 
>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>   3
>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
>   3
>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages
>   0
>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages
>   3
> 
> This state is resolved by reserving and allocating the pages then
> freeing them again, so this seems not to result in serious problem.
> But it's a little surprizing (shrinking pool suddenly fails).
> 
> This behavior is caused by hstate_is_gigantic() check in
> return_unused_surplus_pages(). This was introduced so long ago in 2008
> by commit aa888a74977a ("hugetlb: support larger than MAX_ORDER"), and
> it seems to me that this check is no longer unnecessary. Let's remove it.

s/unnecessary/necessary/

> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  mm/hugetlb.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a57e1be41401..c538278170a2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2432,10 +2432,6 @@ static void return_unused_surplus_pages(struct hstate *h,
>  	/* Uncommit the reservation */
>  	h->resv_huge_pages -= unused_resv_pages;
>  
> -	/* Cannot return gigantic pages currently */
> -	if (hstate_is_gigantic(h))
> -		goto out;
> -

IIUC it might be better to do the below check:
	/*
	 * Cannot return gigantic pages currently if runtime gigantic page
	 * allocation is not supported.
	 */
	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
		goto out;

But I might be miss something.

Thanks.

>  	/*
>  	 * Part (or even all) of the reservation could have been backed
>  	 * by pre-allocated pages. Only free surplus pages.
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages()
  2022-06-24  2:25   ` Miaohe Lin
@ 2022-06-24  8:03     ` Muchun Song
  2022-06-24  8:15       ` Miaohe Lin
  0 siblings, 1 reply; 45+ messages in thread
From: Muchun Song @ 2022-06-24  8:03 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Naoya Horiguchi, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Liu Shixin, Yang Shi, Oscar Salvador, Naoya Horiguchi,
	linux-kernel, Linux-MM

On Fri, Jun 24, 2022 at 10:25:48AM +0800, Miaohe Lin wrote:
> On 2022/6/24 7:51, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > I found a weird state of 1GB hugepage pool, caused by the following
> > procedure:
> > 
> >   - run a process reserving all free 1GB hugepages,
> >   - shrink free 1GB hugepage pool to zero (i.e. writing 0 to
> >     /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then
> >   - kill the reserving process.
> > 
> > , then all the hugepages are free *and* surplus at the same time.
> > 
> >   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >   3
> >   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
> >   3
> >   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages
> >   0
> >   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages
> >   3
> > 
> > This state is resolved by reserving and allocating the pages then
> > freeing them again, so this seems not to result in serious problem.
> > But it's a little surprizing (shrinking pool suddenly fails).
> > 
> > This behavior is caused by hstate_is_gigantic() check in
> > return_unused_surplus_pages(). This was introduced so long ago in 2008
> > by commit aa888a74977a ("hugetlb: support larger than MAX_ORDER"), and
> > it seems to me that this check is no longer unnecessary. Let's remove it.
> 
> s/unnecessary/necessary/
> 
> > 
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > ---
> >  mm/hugetlb.c | 4 ----
> >  1 file changed, 4 deletions(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a57e1be41401..c538278170a2 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2432,10 +2432,6 @@ static void return_unused_surplus_pages(struct hstate *h,
> >  	/* Uncommit the reservation */
> >  	h->resv_huge_pages -= unused_resv_pages;
> >  
> > -	/* Cannot return gigantic pages currently */
> > -	if (hstate_is_gigantic(h))
> > -		goto out;
> > -
> 
> IIUC it might be better to do the below check:
> 	/*
> 	 * Cannot return gigantic pages currently if runtime gigantic page
> 	 * allocation is not supported.
> 	 */
> 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> 		goto out;
>

The change looks good to me. However, the comments above is unnecessary
since gigantic_page_runtime_supported() is straightforward.

Thanks.
 
> But I might be miss something.
> 
> Thanks.
> 
> >  	/*
> >  	 * Part (or even all) of the reservation could have been backed
> >  	 * by pre-allocated pages. Only free surplus pages.
> > 
> 
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages()
  2022-06-24  8:03     ` Muchun Song
@ 2022-06-24  8:15       ` Miaohe Lin
  2022-06-24  8:34         ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 45+ messages in thread
From: Miaohe Lin @ 2022-06-24  8:15 UTC (permalink / raw)
  To: Muchun Song
  Cc: Naoya Horiguchi, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Liu Shixin, Yang Shi, Oscar Salvador, Naoya Horiguchi,
	linux-kernel, Linux-MM

On 2022/6/24 16:03, Muchun Song wrote:
> On Fri, Jun 24, 2022 at 10:25:48AM +0800, Miaohe Lin wrote:
>> On 2022/6/24 7:51, Naoya Horiguchi wrote:
>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>>
>>> I found a weird state of 1GB hugepage pool, caused by the following
>>> procedure:
>>>
>>>   - run a process reserving all free 1GB hugepages,
>>>   - shrink free 1GB hugepage pool to zero (i.e. writing 0 to
>>>     /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then
>>>   - kill the reserving process.
>>>
>>> , then all the hugepages are free *and* surplus at the same time.
>>>
>>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>   3
>>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
>>>   3
>>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages
>>>   0
>>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages
>>>   3
>>>
>>> This state is resolved by reserving and allocating the pages then
>>> freeing them again, so this seems not to result in serious problem.
>>> But it's a little surprizing (shrinking pool suddenly fails).
>>>
>>> This behavior is caused by hstate_is_gigantic() check in
>>> return_unused_surplus_pages(). This was introduced so long ago in 2008
>>> by commit aa888a74977a ("hugetlb: support larger than MAX_ORDER"), and
>>> it seems to me that this check is no longer unnecessary. Let's remove it.
>>
>> s/unnecessary/necessary/
>>
>>>
>>> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>> ---
>>>  mm/hugetlb.c | 4 ----
>>>  1 file changed, 4 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index a57e1be41401..c538278170a2 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -2432,10 +2432,6 @@ static void return_unused_surplus_pages(struct hstate *h,
>>>  	/* Uncommit the reservation */
>>>  	h->resv_huge_pages -= unused_resv_pages;
>>>  
>>> -	/* Cannot return gigantic pages currently */
>>> -	if (hstate_is_gigantic(h))
>>> -		goto out;
>>> -
>>
>> IIUC it might be better to do the below check:
>> 	/*
>> 	 * Cannot return gigantic pages currently if runtime gigantic page
>> 	 * allocation is not supported.
>> 	 */
>> 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>> 		goto out;
>>
> 
> The change looks good to me. However, the comments above is unnecessary
> since gigantic_page_runtime_supported() is straightforward.

Agree. The comments can be removed.

> 
> Thanks.

Thanks for reviewing.

>  
>> But I might be miss something.
>>
>> Thanks.
>>
>>>  	/*
>>>  	 * Part (or even all) of the reservation could have been backed
>>>  	 * by pre-allocated pages. Only free surplus pages.
>>>
>>
>>
> .
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages()
  2022-06-24  8:15       ` Miaohe Lin
@ 2022-06-24  8:34         ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-24 19:11           ` Mike Kravetz
  0 siblings, 1 reply; 45+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-24  8:34 UTC (permalink / raw)
  To: Miaohe Lin, Muchun Song
  Cc: Naoya Horiguchi, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Liu Shixin, Yang Shi, Oscar Salvador, linux-kernel, Linux-MM

On Fri, Jun 24, 2022 at 04:15:19PM +0800, Miaohe Lin wrote:
> On 2022/6/24 16:03, Muchun Song wrote:
> > On Fri, Jun 24, 2022 at 10:25:48AM +0800, Miaohe Lin wrote:
> >> On 2022/6/24 7:51, Naoya Horiguchi wrote:
> >>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> >>>
> >>> I found a weird state of 1GB hugepage pool, caused by the following
> >>> procedure:
> >>>
> >>>   - run a process reserving all free 1GB hugepages,
> >>>   - shrink free 1GB hugepage pool to zero (i.e. writing 0 to
> >>>     /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then
> >>>   - kill the reserving process.
> >>>
> >>> , then all the hugepages are free *and* surplus at the same time.
> >>>
> >>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >>>   3
> >>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
> >>>   3
> >>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages
> >>>   0
> >>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages
> >>>   3
> >>>
> >>> This state is resolved by reserving and allocating the pages then
> >>> freeing them again, so this seems not to result in serious problem.
> >>> But it's a little surprizing (shrinking pool suddenly fails).
> >>>
> >>> This behavior is caused by hstate_is_gigantic() check in
> >>> return_unused_surplus_pages(). This was introduced so long ago in 2008
> >>> by commit aa888a74977a ("hugetlb: support larger than MAX_ORDER"), and
> >>> it seems to me that this check is no longer unnecessary. Let's remove it.
> >>
> >> s/unnecessary/necessary/

Thanks, I fixed it.

> >>
> >>>
> >>> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> >>> ---
> >>>  mm/hugetlb.c | 4 ----
> >>>  1 file changed, 4 deletions(-)
> >>>
> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>> index a57e1be41401..c538278170a2 100644
> >>> --- a/mm/hugetlb.c
> >>> +++ b/mm/hugetlb.c
> >>> @@ -2432,10 +2432,6 @@ static void return_unused_surplus_pages(struct hstate *h,
> >>>  	/* Uncommit the reservation */
> >>>  	h->resv_huge_pages -= unused_resv_pages;
> >>>  
> >>> -	/* Cannot return gigantic pages currently */
> >>> -	if (hstate_is_gigantic(h))
> >>> -		goto out;
> >>> -
> >>
> >> IIUC it might be better to do the below check:
> >> 	/*
> >> 	 * Cannot return gigantic pages currently if runtime gigantic page
> >> 	 * allocation is not supported.
> >> 	 */
> >> 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> >> 		goto out;
> >>
> > 
> > The change looks good to me. However, the comments above is unnecessary
> > since gigantic_page_runtime_supported() is straightforward.
> 
> Agree. The comments can be removed.

Thank you, both. Adding !gigantic_page_runtime_supported without comment
makes sense to me.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 3/9] mm/hugetlb: make pud_huge() and huge_pud() aware of non-present pud entry
  2022-06-23 23:51 ` [PATCH v2 3/9] mm/hugetlb: make pud_huge() and huge_pud() aware of non-present pud entry Naoya Horiguchi
@ 2022-06-24  8:40   ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-25  0:02   ` Mike Kravetz
  2022-06-25  9:42   ` Miaohe Lin
  2 siblings, 0 replies; 45+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-24  8:40 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Miaohe Lin, Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	linux-kernel

Sorry, I found that $SUBJECT mentions wrong function name (I meant
follow_huge_pud(), not huge_pud()), this will be fixed in the later version.

On Fri, Jun 24, 2022 at 08:51:47AM +0900, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> follow_pud_mask() does not support non-present pud entry now.  As long as
> I tested on x86_64 server, follow_pud_mask() still simply returns
> no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe
> user-visible effect should happen.  But generally we should call
> follow_huge_pud() for non-present pud entry for 1GB hugetlb page.
> 
> Update pud_huge() and huge_pud() to handle non-present pud entries.  The

here the same typo, too.

- Naoya Horiguchi

> changes are similar to previous works for pud entries commit e66f17ff7177
> ("mm/hugetlb: take page table lock in follow_huge_pmd()") and commit
> cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage").
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 2/9] mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()
  2022-06-23 23:51 ` [PATCH v2 2/9] mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range() Naoya Horiguchi
@ 2022-06-24  9:23   ` Miaohe Lin
  2022-06-27  6:06     ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-24 20:57   ` Mike Kravetz
  2022-06-27  7:57   ` Muchun Song
  2 siblings, 1 reply; 45+ messages in thread
From: Miaohe Lin @ 2022-06-24  9:23 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel, Linux-MM

On 2022/6/24 7:51, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Originally copy_hugetlb_page_range() handles migration entries and hwpoisone

s/hwpoisone/hwpoisoned/

> entries in similar manner.  But recently the related code path has more code
> for migration entries, and when is_writable_migration_entry() was converted
> to !is_readable_migration_entry(), hwpoison entries on source processes got
> to be unexpectedly updated (which is legitimate for migration entries, but
> not for hwpoison entries).  This results in unexpected serious issues like
> kernel panic when foking processes with hwpoison entries in pmd.

s/foking/forking/

> 
> Separate the if branch into one for hwpoison entries and one for migration
> entries.
> 
> Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org> # 5.18

This makes sense to me. Thanks for fixing this.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

> ---
>  mm/hugetlb.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c538278170a2..f59f43c06601 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4784,8 +4784,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  			 * sharing with another vma.
>  			 */
>  			;
> -		} else if (unlikely(is_hugetlb_entry_migration(entry) ||
> -				    is_hugetlb_entry_hwpoisoned(entry))) {
> +		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) {
> +			bool uffd_wp = huge_pte_uffd_wp(entry);
> +
> +			if (!userfaultfd_wp(dst_vma) && uffd_wp)
> +				entry = huge_pte_clear_uffd_wp(entry);
> +			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> +		} else if (unlikely(is_hugetlb_entry_migration(entry))) {
>  			swp_entry_t swp_entry = pte_to_swp_entry(entry);
>  			bool uffd_wp = huge_pte_uffd_wp(entry);
>  
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages()
  2022-06-24  8:34         ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-06-24 19:11           ` Mike Kravetz
  2022-06-27  6:02             ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 45+ messages in thread
From: Mike Kravetz @ 2022-06-24 19:11 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Miaohe Lin, Muchun Song, Naoya Horiguchi, Andrew Morton,
	David Hildenbrand, Liu Shixin, Yang Shi, Oscar Salvador,
	linux-kernel, Linux-MM

On 06/24/22 08:34, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Fri, Jun 24, 2022 at 04:15:19PM +0800, Miaohe Lin wrote:
> > On 2022/6/24 16:03, Muchun Song wrote:
> > > On Fri, Jun 24, 2022 at 10:25:48AM +0800, Miaohe Lin wrote:
> > >> On 2022/6/24 7:51, Naoya Horiguchi wrote:
> > >>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > >>>
> > >>> I found a weird state of 1GB hugepage pool, caused by the following
> > >>> procedure:
> > >>>
> > >>>   - run a process reserving all free 1GB hugepages,
> > >>>   - shrink free 1GB hugepage pool to zero (i.e. writing 0 to
> > >>>     /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then
> > >>>   - kill the reserving process.
> > >>>
> > >>> , then all the hugepages are free *and* surplus at the same time.
> > >>>
> > >>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > >>>   3
> > >>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
> > >>>   3
> > >>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages
> > >>>   0
> > >>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages
> > >>>   3
> > >>>
> > >>> This state is resolved by reserving and allocating the pages then
> > >>> freeing them again, so this seems not to result in serious problem.
> > >>> But it's a little surprizing (shrinking pool suddenly fails).
> > >>>
> > >>> This behavior is caused by hstate_is_gigantic() check in
> > >>> return_unused_surplus_pages(). This was introduced so long ago in 2008
> > >>> by commit aa888a74977a ("hugetlb: support larger than MAX_ORDER"), and
> > >>> it seems to me that this check is no longer unnecessary. Let's remove it.

Thank you for finding this!!!

> > >>> +++ b/mm/hugetlb.c
> > >>> @@ -2432,10 +2432,6 @@ static void return_unused_surplus_pages(struct hstate *h,
> > >>>  	/* Uncommit the reservation */
> > >>>  	h->resv_huge_pages -= unused_resv_pages;
> > >>>  
> > >>> -	/* Cannot return gigantic pages currently */
> > >>> -	if (hstate_is_gigantic(h))
> > >>> -		goto out;
> > >>> -
> > >>
> > >> IIUC it might be better to do the below check:
> > >> 	/*
> > >> 	 * Cannot return gigantic pages currently if runtime gigantic page
> > >> 	 * allocation is not supported.
> > >> 	 */
> > >> 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> > >> 		goto out;
> > >>
> > > 
> > > The change looks good to me. However, the comments above is unnecessary
> > > since gigantic_page_runtime_supported() is straightforward.
> > 
> > Agree. The comments can be removed.
> 
> Thank you, both. Adding !gigantic_page_runtime_supported without comment
> makes sense to me.

The change above makes sense to me.  However, ...

If we make the change above, will we have the same strange situation described
in the commit message when !gigantic_page_runtime_supported() is true?

IIUC, !gigantic_page_runtime_supported implies that gigantic hugetlb
pages can not be allocated or freed at run time.  They can only be
allocated at boot time.  So, there should NEVER be surplus gigantic
pages if !gigantic_page_runtime_supported().  To avoid this situation,
perhaps we should change set_max_huge_pages as follows (not tested)?

-- 
Mike Kravetz


diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5eabb8009d8a..c78d6c20e6b0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3416,7 +3416,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 * the user tries to allocate gigantic pages but let the user free the
 	 * boottime allocated gigantic pages.
 	 */
-	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
+	if (hstate_is_gigantic(h) && (!IS_ENABLED(CONFIG_CONTIG_ALLOC) ||
+					!gigantic_page_runtime_supported())) {
 		if (count > persistent_huge_pages(h)) {
 			spin_unlock_irq(&hugetlb_lock);
 			mutex_unlock(&h->resize_lock);
@@ -3464,6 +3465,19 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			goto out;
 	}
 
+	/*
+	 * We can not decrease gigantic pool size if runtime modification
+	 * is not supported.
+	 */
+	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) {
+		if (count < persistent_huge_pages(h)) {
+			spin_unlock_irq(&hugetlb_lock);
+			mutex_unlock(&h->resize_lock);
+			NODEMASK_FREE(node_alloc_noretry);
+			return -EINVAL;
+		}
+	}
+
 	/*
 	 * Decrease the pool size
 	 * First return free pages to the buddy allocator (being careful


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 2/9] mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()
  2022-06-23 23:51 ` [PATCH v2 2/9] mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range() Naoya Horiguchi
  2022-06-24  9:23   ` Miaohe Lin
@ 2022-06-24 20:57   ` Mike Kravetz
  2022-06-27  6:48     ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-27  7:57   ` Muchun Song
  2 siblings, 1 reply; 45+ messages in thread
From: Mike Kravetz @ 2022-06-24 20:57 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

On 06/24/22 08:51, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Originally copy_hugetlb_page_range() handles migration entries and hwpoisone
> entries in similar manner.  But recently the related code path has more code
> for migration entries, and when is_writable_migration_entry() was converted
> to !is_readable_migration_entry(), hwpoison entries on source processes got
> to be unexpectedly updated (which is legitimate for migration entries, but
> not for hwpoison entries).  This results in unexpected serious issues like
> kernel panic when foking processes with hwpoison entries in pmd.
> 
> Separate the if branch into one for hwpoison entries and one for migration
> entries.
> 
> Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org> # 5.18
> ---
>  mm/hugetlb.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)

Thank you!

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
(with typos pointed out by Miaohe Lin)

Just curious, are there any hwpoisoned tests in any test suite?  I run
libhugetlbfs tests and ltp on a regular basis which sometimes catch
regressions.  If there are no tests in any suite today, this might be
something we want to consider for future work.

-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 3/9] mm/hugetlb: make pud_huge() and huge_pud() aware of non-present pud entry
  2022-06-23 23:51 ` [PATCH v2 3/9] mm/hugetlb: make pud_huge() and huge_pud() aware of non-present pud entry Naoya Horiguchi
  2022-06-24  8:40   ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-06-25  0:02   ` Mike Kravetz
  2022-06-27  7:07     ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-25  9:42   ` Miaohe Lin
  2 siblings, 1 reply; 45+ messages in thread
From: Mike Kravetz @ 2022-06-25  0:02 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

On 06/24/22 08:51, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> follow_pud_mask() does not support non-present pud entry now.  As long as
> I tested on x86_64 server, follow_pud_mask() still simply returns
> no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe
> user-visible effect should happen.  But generally we should call
> follow_huge_pud() for non-present pud entry for 1GB hugetlb page.
> 
> Update pud_huge() and huge_pud() to handle non-present pud entries.  The
> changes are similar to previous works for pud entries commit e66f17ff7177
> ("mm/hugetlb: take page table lock in follow_huge_pmd()") and commit
> cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage").
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  arch/x86/mm/hugetlbpage.c |  3 ++-
>  mm/hugetlb.c              | 26 +++++++++++++++++++++++++-
>  2 files changed, 27 insertions(+), 2 deletions(-)

Thanks.  Overall looks good with typos corrected that you already noticed.
One question below.
> 
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index a0d023cb4292..5fb86fb49ba8 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -70,7 +70,8 @@ int pmd_huge(pmd_t pmd)
>  
>  int pud_huge(pud_t pud)
>  {
> -	return !!(pud_val(pud) & _PAGE_PSE);
> +	return !pud_none(pud) &&
> +		(pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT;
>  }
>  #endif
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f59f43c06601..b7ae5f73f3b2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6946,10 +6946,34 @@ struct page * __weak
>  follow_huge_pud(struct mm_struct *mm, unsigned long address,
>  		pud_t *pud, int flags)
>  {
> +	struct page *page = NULL;
> +	spinlock_t *ptl;
> +	pte_t pte;
> +
>  	if (flags & (FOLL_GET | FOLL_PIN))
>  		return NULL;
>  
> -	return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
> +retry:
> +	ptl = huge_pte_lock(hstate_sizelog(PUD_SHIFT), mm, (pte_t *)pud);
> +	if (!pud_huge(*pud))
> +		goto out;
> +	pte = huge_ptep_get((pte_t *)pud);
> +	if (pte_present(pte)) {
> +		page = pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
> +		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {

The call to try_grab_page() seems a bit strange since flags will not include
FOLL_GET | FOLL_PIN as tested above.  Is try_grab_page() always going be a
noop here?

-- 
Mike Kravetz

> +			page = NULL;
> +			goto out;
> +		}
> +	} else {
> +		if (is_hugetlb_entry_migration(pte)) {
> +			spin_unlock(ptl);
> +			__migration_entry_wait(mm, (pte_t *)pud, ptl);
> +			goto retry;
> +		}
> +	}
> +out:
> +	spin_unlock(ptl);
> +	return page;
>  }
>  
>  struct page * __weak
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 3/9] mm/hugetlb: make pud_huge() and huge_pud() aware of non-present pud entry
  2022-06-23 23:51 ` [PATCH v2 3/9] mm/hugetlb: make pud_huge() and huge_pud() aware of non-present pud entry Naoya Horiguchi
  2022-06-24  8:40   ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-25  0:02   ` Mike Kravetz
@ 2022-06-25  9:42   ` Miaohe Lin
  2022-06-27  7:24     ` HORIGUCHI NAOYA(堀口 直也)
  2 siblings, 1 reply; 45+ messages in thread
From: Miaohe Lin @ 2022-06-25  9:42 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel, Linux-MM

On 2022/6/24 7:51, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> follow_pud_mask() does not support non-present pud entry now.  As long as
> I tested on x86_64 server, follow_pud_mask() still simply returns
> no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe
> user-visible effect should happen.  But generally we should call
> follow_huge_pud() for non-present pud entry for 1GB hugetlb page.
> 
> Update pud_huge() and huge_pud() to handle non-present pud entries.  The
> changes are similar to previous works for pud entries commit e66f17ff7177
> ("mm/hugetlb: take page table lock in follow_huge_pmd()") and commit
> cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage").
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  arch/x86/mm/hugetlbpage.c |  3 ++-
>  mm/hugetlb.c              | 26 +++++++++++++++++++++++++-
>  2 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index a0d023cb4292..5fb86fb49ba8 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -70,7 +70,8 @@ int pmd_huge(pmd_t pmd)
>  

No strong opinion but a comment similar to pmd_huge might be better?

/*
 * pmd_huge() returns 1 if @pmd is hugetlb related entry, that is normal
 * hugetlb entry or non-present (migration or hwpoisoned) hugetlb entry.
 * Otherwise, returns 0.
 */

>  int pud_huge(pud_t pud)
>  {
> -	return !!(pud_val(pud) & _PAGE_PSE);
> +	return !pud_none(pud) &&
> +		(pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT;
>  }
>  #endif
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f59f43c06601..b7ae5f73f3b2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6946,10 +6946,34 @@ struct page * __weak
>  follow_huge_pud(struct mm_struct *mm, unsigned long address,
>  		pud_t *pud, int flags)
>  {
> +	struct page *page = NULL;
> +	spinlock_t *ptl;
> +	pte_t pte;
> +
>  	if (flags & (FOLL_GET | FOLL_PIN))
>  		return NULL;

Should the above check be modified? It seems the below try_grab_page might not grab the page as
expected (as Mike pointed out). Or the extra page refcnt is unneeded?

>  
> -	return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
> +retry:
> +	ptl = huge_pte_lock(hstate_sizelog(PUD_SHIFT), mm, (pte_t *)pud);
> +	if (!pud_huge(*pud))
> +		goto out;
> +	pte = huge_ptep_get((pte_t *)pud);
> +	if (pte_present(pte)) {
> +		page = pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
> +		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
> +			page = NULL;
> +			goto out;
> +		}
> +	} else {
> +		if (is_hugetlb_entry_migration(pte)) {
> +			spin_unlock(ptl);
> +			__migration_entry_wait(mm, (pte_t *)pud, ptl);
> +			goto retry;
> +		}

Again. No strong opinion but a comment similar to follow_huge_pmd might be better?

/*
 * hwpoisoned entry is treated as no_page_table in
 * follow_page_mask().
 */

Thanks!

> +	}
> +out:
> +	spin_unlock(ptl);
> +	return page;
>  }
>  
>  struct page * __weak
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  2022-06-23 23:51 ` [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages Naoya Horiguchi
@ 2022-06-27  3:16   ` Miaohe Lin
  2022-06-27  7:56     ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-27  9:26   ` Muchun Song
  1 sibling, 1 reply; 45+ messages in thread
From: Miaohe Lin @ 2022-06-27  3:16 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel, Linux-MM

On 2022/6/24 7:51, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> When handling memory error on a hugetlb page, the error handler tries to
> dissolve and turn it into 4kB pages.  If it's successfully dissolved,
> PageHWPoison flag is moved to the raw error page, so that's all right.
> However, dissolve sometimes fails, then the error page is left as
> hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
> healthy pages, but that's not possible now because the information about
> where the raw error pages is lost.
> 
> Use the private field of a few tail pages to keep that information.  The
> code path of shrinking hugepage pool uses this info to try delayed dissolve.
> In order to remember multiple errors in a hugepage, a singly-linked list
> originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed.  Only
> simple operations (adding an entry or clearing all) are required and the
> list is assumed not to be very long, so this simple data structure should
> be enough.
> 
> If we failed to save raw error info, the hwpoison hugepage has errors on
> unknown subpage, then this new saving mechanism does not work any more,
> so disable saving new raw error info and freeing hwpoison hugepages.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Many thanks for your patch. This patch looks good to me. Some nits below.

>
<snip>
>
> +static inline int hugetlb_set_page_hwpoison(struct page *hpage,
> +					struct page *page)
> +{
> +	struct llist_head *head;
> +	struct raw_hwp_page *raw_hwp;
> +	struct llist_node *t, *tnode;
> +	int ret;
> +
> +	/*
> +	 * Once the hwpoison hugepage has lost reliable raw error info,
> +	 * there is little mean to keep additional error info precisely,

It should be s/mean/meaning/ ?

> +	 * so skip to add additional raw error info.
> +	 */
> +	if (raw_hwp_unreliable(hpage))
> +		return -EHWPOISON;
> +	head = raw_hwp_list_head(hpage);
> +	llist_for_each_safe(tnode, t, head->first) {
> +		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
> +
> +		if (p->page == page)
> +			return -EHWPOISON;
> +	}
> +
> +	ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0;
> +	/* the first error event will be counted in action_result(). */
> +	if (ret)
> +		num_poisoned_pages_inc();
> +
> +	raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL);
> +	if (raw_hwp) {
> +		raw_hwp->page = page;
> +		llist_add(&raw_hwp->node, head);
> +	} else {
> +		/*
> +		 * Failed to save raw error info.  We no longer trace all
> +		 * hwpoisoned subpages, and we need refuse to free/dissolve
> +		 * this hwpoisoned hugepage.
> +		 */
> +		set_raw_hwp_unreliable(hpage);
> +		return ret;

This "return ret" can be combined into the below one?

> +	}
> +	return ret;
> +}
> +
> +inline int hugetlb_clear_page_hwpoison(struct page *hpage)
> +{
> +	struct llist_head *head;
> +	struct llist_node *t, *tnode;
> +
> +	if (raw_hwp_unreliable(hpage))
> +		return -EBUSY;

Can we try freeing the memory of raw_hwp_list to save possible memory? It seems
raw_hwp_list becomes unneeded when raw_hwp_unreliable.

Thanks.

> +	ClearPageHWPoison(hpage);
> +	head = raw_hwp_list_head(hpage);
> +	llist_for_each_safe(tnode, t, head->first) {
> +		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
> +
> +		SetPageHWPoison(p->page);
> +		kfree(p);
> +	}
> +	llist_del_all(head);
> +	return 0;
> +}
> +
>  /*
>   * Called from hugetlb code with hugetlb_lock held.
>   *
> @@ -1533,7 +1624,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
>  		goto out;
>  	}
>  
> -	if (TestSetPageHWPoison(head)) {
> +	if (hugetlb_set_page_hwpoison(head, page)) {
>  		ret = -EHWPOISON;
>  		goto out;
>  	}
> @@ -1585,7 +1676,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>  	lock_page(head);
>  
>  	if (hwpoison_filter(p)) {
> -		ClearPageHWPoison(head);
> +		hugetlb_clear_page_hwpoison(head);
>  		res = -EOPNOTSUPP;
>  		goto out;
>  	}
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages()
  2022-06-24 19:11           ` Mike Kravetz
@ 2022-06-27  6:02             ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-27 17:25               ` Mike Kravetz
  0 siblings, 1 reply; 45+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-27  6:02 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Miaohe Lin, Muchun Song, Naoya Horiguchi, Andrew Morton,
	David Hildenbrand, Liu Shixin, Yang Shi, Oscar Salvador,
	linux-kernel, Linux-MM

On Fri, Jun 24, 2022 at 12:11:07PM -0700, Mike Kravetz wrote:
> On 06/24/22 08:34, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Fri, Jun 24, 2022 at 04:15:19PM +0800, Miaohe Lin wrote:
> > > On 2022/6/24 16:03, Muchun Song wrote:
> > > > On Fri, Jun 24, 2022 at 10:25:48AM +0800, Miaohe Lin wrote:
> > > >> On 2022/6/24 7:51, Naoya Horiguchi wrote:
> > > >>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > >>>
> > > >>> I found a weird state of 1GB hugepage pool, caused by the following
> > > >>> procedure:
> > > >>>
> > > >>>   - run a process reserving all free 1GB hugepages,
> > > >>>   - shrink free 1GB hugepage pool to zero (i.e. writing 0 to
> > > >>>     /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then
> > > >>>   - kill the reserving process.
> > > >>>
> > > >>> , then all the hugepages are free *and* surplus at the same time.
> > > >>>
> > > >>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > > >>>   3
> > > >>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
> > > >>>   3
> > > >>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages
> > > >>>   0
> > > >>>   $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages
> > > >>>   3
> > > >>>
> > > >>> This state is resolved by reserving and allocating the pages then
> > > >>> freeing them again, so this seems not to result in serious problem.
> > > >>> But it's a little surprizing (shrinking pool suddenly fails).
> > > >>>
> > > >>> This behavior is caused by hstate_is_gigantic() check in
> > > >>> return_unused_surplus_pages(). This was introduced so long ago in 2008
> > > >>> by commit aa888a74977a ("hugetlb: support larger than MAX_ORDER"), and
> > > >>> it seems to me that this check is no longer unnecessary. Let's remove it.
> 
> Thank you for finding this!!!
> 
> > > >>> +++ b/mm/hugetlb.c
> > > >>> @@ -2432,10 +2432,6 @@ static void return_unused_surplus_pages(struct hstate *h,
> > > >>>  	/* Uncommit the reservation */
> > > >>>  	h->resv_huge_pages -= unused_resv_pages;
> > > >>>  
> > > >>> -	/* Cannot return gigantic pages currently */
> > > >>> -	if (hstate_is_gigantic(h))
> > > >>> -		goto out;
> > > >>> -
> > > >>
> > > >> IIUC it might be better to do the below check:
> > > >> 	/*
> > > >> 	 * Cannot return gigantic pages currently if runtime gigantic page
> > > >> 	 * allocation is not supported.
> > > >> 	 */
> > > >> 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> > > >> 		goto out;
> > > >>
> > > > 
> > > > The change looks good to me. However, the comments above is unnecessary
> > > > since gigantic_page_runtime_supported() is straightforward.
> > > 
> > > Agree. The comments can be removed.
> > 
> > Thank you, both. Adding !gigantic_page_runtime_supported without comment
> > makes sense to me.
> 
> The change above makes sense to me.  However, ...
> 
> If we make the change above, will we have the same strange situation described
> in the commit message when !gigantic_page_runtime_supported() is true?
> 
> IIUC, !gigantic_page_runtime_supported implies that gigantic hugetlb
> pages can not be allocated or freed at run time.  They can only be
> allocated at boot time.  So, there should NEVER be surplus gigantic
> pages if !gigantic_page_runtime_supported().

I have the same understanding as the above.

>  To avoid this situation,
> perhaps we should change set_max_huge_pages as follows (not tested)?

The suggested diff looks clearer about what it does, so I'd like to take it
in the next version.  Then, what do we do on the "if (hstate_if_gigantic())"
check in return_unused_surplus_pages in the original suggestion?  Should it
be kept as is, or removed, or checked with !gigantic_page_runtime_supported()?
I guess that the new checks prevent calling return_unused_surplus_pages()
during pool shrinking, so the check seems not necessary any more.

Thanks,
Naoya Horiguchi

> 
> -- 
> Mike Kravetz
> 
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5eabb8009d8a..c78d6c20e6b0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3416,7 +3416,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	 * the user tries to allocate gigantic pages but let the user free the
>  	 * boottime allocated gigantic pages.
>  	 */
> -	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
> +	if (hstate_is_gigantic(h) && (!IS_ENABLED(CONFIG_CONTIG_ALLOC) ||
> +					!gigantic_page_runtime_supported())) {
>  		if (count > persistent_huge_pages(h)) {
>  			spin_unlock_irq(&hugetlb_lock);
>  			mutex_unlock(&h->resize_lock);
> @@ -3464,6 +3465,19 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  			goto out;
>  	}
>  
> +	/*
> +	 * We can not decrease gigantic pool size if runtime modification
> +	 * is not supported.
> +	 */
> +	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) {
> +		if (count < persistent_huge_pages(h)) {
> +			spin_unlock_irq(&hugetlb_lock);
> +			mutex_unlock(&h->resize_lock);
> +			NODEMASK_FREE(node_alloc_noretry);
> +			return -EINVAL;
> +		}
> +	}
> +
>  	/*
>  	 * Decrease the pool size
>  	 * First return free pages to the buddy allocator (being careful

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 2/9] mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()
  2022-06-24  9:23   ` Miaohe Lin
@ 2022-06-27  6:06     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 45+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-27  6:06 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Naoya Horiguchi, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song, linux-kernel,
	Linux-MM

On Fri, Jun 24, 2022 at 05:23:41PM +0800, Miaohe Lin wrote:
> On 2022/6/24 7:51, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > Originally copy_hugetlb_page_range() handles migration entries and hwpoisone
> 
> s/hwpoisone/hwpoisoned/
> 
> > entries in similar manner.  But recently the related code path has more code
> > for migration entries, and when is_writable_migration_entry() was converted
> > to !is_readable_migration_entry(), hwpoison entries on source processes got
> > to be unexpectedly updated (which is legitimate for migration entries, but
> > not for hwpoison entries).  This results in unexpected serious issues like
> > kernel panic when foking processes with hwpoison entries in pmd.
> 
> s/foking/forking/

Sorry for many typos.  I somehow forgot to run spell checker.

> 
> > 
> > Separate the if branch into one for hwpoison entries and one for migration
> > entries.
> > 
> > Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Cc: <stable@vger.kernel.org> # 5.18
> 
> This makes sense to me. Thanks for fixing this.
> 
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thank you for reviewing!

- Naoya Horiguchi

> 
> > ---
> >  mm/hugetlb.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index c538278170a2..f59f43c06601 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4784,8 +4784,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  			 * sharing with another vma.
> >  			 */
> >  			;
> > -		} else if (unlikely(is_hugetlb_entry_migration(entry) ||
> > -				    is_hugetlb_entry_hwpoisoned(entry))) {
> > +		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) {
> > +			bool uffd_wp = huge_pte_uffd_wp(entry);
> > +
> > +			if (!userfaultfd_wp(dst_vma) && uffd_wp)
> > +				entry = huge_pte_clear_uffd_wp(entry);
> > +			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> > +		} else if (unlikely(is_hugetlb_entry_migration(entry))) {
> >  			swp_entry_t swp_entry = pte_to_swp_entry(entry);
> >  			bool uffd_wp = huge_pte_uffd_wp(entry);
> >  
> > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 2/9] mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()
  2022-06-24 20:57   ` Mike Kravetz
@ 2022-06-27  6:48     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 45+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-27  6:48 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Miaohe Lin, Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	linux-kernel

On Fri, Jun 24, 2022 at 01:57:57PM -0700, Mike Kravetz wrote:
> On 06/24/22 08:51, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > Originally copy_hugetlb_page_range() handles migration entries and hwpoisone
> > entries in similar manner.  But recently the related code path has more code
> > for migration entries, and when is_writable_migration_entry() was converted
> > to !is_readable_migration_entry(), hwpoison entries on source processes got
> > to be unexpectedly updated (which is legitimate for migration entries, but
> > not for hwpoison entries).  This results in unexpected serious issues like
> > kernel panic when foking processes with hwpoison entries in pmd.
> > 
> > Separate the if branch into one for hwpoison entries and one for migration
> > entries.
> > 
> > Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Cc: <stable@vger.kernel.org> # 5.18
> > ---
> >  mm/hugetlb.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> Thank you!
> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> (with typos pointed out by Miaohe Lin)
> 
> Just curious, are there any hwpoisoned tests in any test suite?  I run
> libhugetlbfs tests and ltp on a regular basis which sometimes catch
> regressions.  If there are no tests in any suite today, this might be
> something we want to consider for future work.

mce-tests (https://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git) has
some test cases about hwpoison (under cases/function/hwpoison/), but this
tool mostly focuses on MCE and does not have enough hwpoison testcases.
So I'm maintaining my own test suite
(https://github.com/nhoriguchi/mm_regression) for long to help my own
hwpoison development/maintenance.  I'd like to make this tool more handy
so that other developers or testsuites can easily run the hwpoison testing,
although I need more work for it.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 3/9] mm/hugetlb: make pud_huge() and huge_pud() aware of non-present pud entry
  2022-06-25  0:02   ` Mike Kravetz
@ 2022-06-27  7:07     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 45+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-27  7:07 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Miaohe Lin, Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	linux-kernel

On Fri, Jun 24, 2022 at 05:02:01PM -0700, Mike Kravetz wrote:
> On 06/24/22 08:51, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > follow_pud_mask() does not support non-present pud entry now.  As long as
> > I tested on x86_64 server, follow_pud_mask() still simply returns
> > no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe
> > user-visible effect should happen.  But generally we should call
> > follow_huge_pud() for non-present pud entry for 1GB hugetlb page.
> > 
> > Update pud_huge() and huge_pud() to handle non-present pud entries.  The
> > changes are similar to previous works for pud entries commit e66f17ff7177
> > ("mm/hugetlb: take page table lock in follow_huge_pmd()") and commit
> > cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage").
> > 
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > ---
> >  arch/x86/mm/hugetlbpage.c |  3 ++-
> >  mm/hugetlb.c              | 26 +++++++++++++++++++++++++-
> >  2 files changed, 27 insertions(+), 2 deletions(-)
> 
> Thanks.  Overall looks good with typos corrected that you already noticed.
> One question below.
> > 
> > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > index a0d023cb4292..5fb86fb49ba8 100644
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -70,7 +70,8 @@ int pmd_huge(pmd_t pmd)
> >  
> >  int pud_huge(pud_t pud)
> >  {
> > -	return !!(pud_val(pud) & _PAGE_PSE);
> > +	return !pud_none(pud) &&
> > +		(pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT;
> >  }
> >  #endif
> >  
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index f59f43c06601..b7ae5f73f3b2 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6946,10 +6946,34 @@ struct page * __weak
> >  follow_huge_pud(struct mm_struct *mm, unsigned long address,
> >  		pud_t *pud, int flags)
> >  {
> > +	struct page *page = NULL;
> > +	spinlock_t *ptl;
> > +	pte_t pte;
> > +
> >  	if (flags & (FOLL_GET | FOLL_PIN))
> >  		return NULL;
> >  
> > -	return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
> > +retry:
> > +	ptl = huge_pte_lock(hstate_sizelog(PUD_SHIFT), mm, (pte_t *)pud);
> > +	if (!pud_huge(*pud))
> > +		goto out;
> > +	pte = huge_ptep_get((pte_t *)pud);
> > +	if (pte_present(pte)) {
> > +		page = pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
> > +		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
> 
> The call to try_grab_page() seems a bit strange since flags will not include
> FOLL_GET | FOLL_PIN as tested above.  Is try_grab_page() always going be a
> noop here?

Thanks, you're right.  "flags & (FOLL_GET | FOLL_PIN)" check should be
updated.  Probably it's to be aligned to what follow_huge_pmd() checks first.

Thanks,
Naoya Horiguchi

> 
> -- 
> Mike Kravetz
> 
> > +			page = NULL;
> > +			goto out;
> > +		}
> > +	} else {
> > +		if (is_hugetlb_entry_migration(pte)) {
> > +			spin_unlock(ptl);
> > +			__migration_entry_wait(mm, (pte_t *)pud, ptl);
> > +			goto retry;
> > +		}
> > +	}
> > +out:
> > +	spin_unlock(ptl);
> > +	return page;
> >  }
> >  
> >  struct page * __weak
> > -- 
> > 2.25.1
> > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 3/9] mm/hugetlb: make pud_huge() and huge_pud() aware of non-present pud entry
  2022-06-25  9:42   ` Miaohe Lin
@ 2022-06-27  7:24     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 45+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-27  7:24 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Naoya Horiguchi, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song, linux-kernel,
	Linux-MM

On Sat, Jun 25, 2022 at 05:42:17PM +0800, Miaohe Lin wrote:
> On 2022/6/24 7:51, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > follow_pud_mask() does not support non-present pud entry now.  As long as
> > I tested on x86_64 server, follow_pud_mask() still simply returns
> > no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe
> > user-visible effect should happen.  But generally we should call
> > follow_huge_pud() for non-present pud entry for 1GB hugetlb page.
> > 
> > Update pud_huge() and huge_pud() to handle non-present pud entries.  The
> > changes are similar to previous works for pud entries commit e66f17ff7177
> > ("mm/hugetlb: take page table lock in follow_huge_pmd()") and commit
> > cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage").
> > 
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > ---
> >  arch/x86/mm/hugetlbpage.c |  3 ++-
> >  mm/hugetlb.c              | 26 +++++++++++++++++++++++++-
> >  2 files changed, 27 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > index a0d023cb4292..5fb86fb49ba8 100644
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -70,7 +70,8 @@ int pmd_huge(pmd_t pmd)
> >  
> 
> No strong opinion but a comment similar to pmd_huge might be better?
> 
> /*
>  * pmd_huge() returns 1 if @pmd is hugetlb related entry, that is normal
>  * hugetlb entry or non-present (migration or hwpoisoned) hugetlb entry.
>  * Otherwise, returns 0.
>  */

OK, I'll add some.

> 
> >  int pud_huge(pud_t pud)
> >  {
> > -	return !!(pud_val(pud) & _PAGE_PSE);
> > +	return !pud_none(pud) &&
> > +		(pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT;
> >  }
> >  #endif
> >  
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index f59f43c06601..b7ae5f73f3b2 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6946,10 +6946,34 @@ struct page * __weak
> >  follow_huge_pud(struct mm_struct *mm, unsigned long address,
> >  		pud_t *pud, int flags)
> >  {
> > +	struct page *page = NULL;
> > +	spinlock_t *ptl;
> > +	pte_t pte;
> > +
> >  	if (flags & (FOLL_GET | FOLL_PIN))
> >  		return NULL;
> 
> Should the above check be modified? It seems the below try_grab_page might not grab the page as
> expected (as Mike pointed out). Or the extra page refcnt is unneeded?

Yes, this check should be updated.

> 
> >  
> > -	return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
> > +retry:
> > +	ptl = huge_pte_lock(hstate_sizelog(PUD_SHIFT), mm, (pte_t *)pud);
> > +	if (!pud_huge(*pud))
> > +		goto out;
> > +	pte = huge_ptep_get((pte_t *)pud);
> > +	if (pte_present(pte)) {
> > +		page = pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
> > +		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
> > +			page = NULL;
> > +			goto out;
> > +		}
> > +	} else {
> > +		if (is_hugetlb_entry_migration(pte)) {
> > +			spin_unlock(ptl);
> > +			__migration_entry_wait(mm, (pte_t *)pud, ptl);
> > +			goto retry;
> > +		}
> 
> Again. No strong opinion but a comment similar to follow_huge_pmd might be better?
> 
> /*
>  * hwpoisoned entry is treated as no_page_table in
>  * follow_page_mask().
>  */

Will add comment on this too. Thank you.

- Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  2022-06-27  3:16   ` Miaohe Lin
@ 2022-06-27  7:56     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 45+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-27  7:56 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Naoya Horiguchi, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song, linux-kernel,
	Linux-MM

On Mon, Jun 27, 2022 at 11:16:06AM +0800, Miaohe Lin wrote:
> On 2022/6/24 7:51, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > When handling memory error on a hugetlb page, the error handler tries to
> > dissolve and turn it into 4kB pages.  If it's successfully dissolved,
> > PageHWPoison flag is moved to the raw error page, so that's all right.
> > However, dissolve sometimes fails, then the error page is left as
> > hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
> > healthy pages, but that's not possible now because the information about
> > where the raw error pages is lost.
> > 
> > Use the private field of a few tail pages to keep that information.  The
> > code path of shrinking hugepage pool uses this info to try delayed dissolve.
> > In order to remember multiple errors in a hugepage, a singly-linked list
> > originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed.  Only
> > simple operations (adding an entry or clearing all) are required and the
> > list is assumed not to be very long, so this simple data structure should
> > be enough.
> > 
> > If we failed to save raw error info, the hwpoison hugepage has errors on
> > unknown subpage, then this new saving mechanism does not work any more,
> > so disable saving new raw error info and freeing hwpoison hugepages.
> > 
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Many thanks for your patch. This patch looks good to me. Some nits below.
> 
> >
> <snip>
> >
> > +static inline int hugetlb_set_page_hwpoison(struct page *hpage,
> > +					struct page *page)
> > +{
> > +	struct llist_head *head;
> > +	struct raw_hwp_page *raw_hwp;
> > +	struct llist_node *t, *tnode;
> > +	int ret;
> > +
> > +	/*
> > +	 * Once the hwpoison hugepage has lost reliable raw error info,
> > +	 * there is little mean to keep additional error info precisely,
> 
> It should be s/mean/meaning/ ?

Right, fixed.

> 
> > +	 * so skip to add additional raw error info.
> > +	 */
> > +	if (raw_hwp_unreliable(hpage))
> > +		return -EHWPOISON;
> > +	head = raw_hwp_list_head(hpage);
> > +	llist_for_each_safe(tnode, t, head->first) {
> > +		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
> > +
> > +		if (p->page == page)
> > +			return -EHWPOISON;
> > +	}
> > +
> > +	ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0;
> > +	/* the first error event will be counted in action_result(). */
> > +	if (ret)
> > +		num_poisoned_pages_inc();
> > +
> > +	raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL);
> > +	if (raw_hwp) {
> > +		raw_hwp->page = page;
> > +		llist_add(&raw_hwp->node, head);
> > +	} else {
> > +		/*
> > +		 * Failed to save raw error info.  We no longer trace all
> > +		 * hwpoisoned subpages, and we need refuse to free/dissolve
> > +		 * this hwpoisoned hugepage.
> > +		 */
> > +		set_raw_hwp_unreliable(hpage);
> > +		return ret;
> 
> This "return ret" can be combined into the below one?
> 

Right, fixed.

> > +	}
> > +	return ret;
> > +}
> > +
> > +inline int hugetlb_clear_page_hwpoison(struct page *hpage)
> > +{
> > +	struct llist_head *head;
> > +	struct llist_node *t, *tnode;
> > +
> > +	if (raw_hwp_unreliable(hpage))
> > +		return -EBUSY;
> 
> Can we try freeing the memory of raw_hwp_list to save possible memory? It seems
> raw_hwp_list becomes unneeded when raw_hwp_unreliable.

Yes, we can. I try to change like that.
Thanks for the suggestion.

- Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 2/9] mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()
  2022-06-23 23:51 ` [PATCH v2 2/9] mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range() Naoya Horiguchi
  2022-06-24  9:23   ` Miaohe Lin
  2022-06-24 20:57   ` Mike Kravetz
@ 2022-06-27  7:57   ` Muchun Song
  2 siblings, 0 replies; 45+ messages in thread
From: Muchun Song @ 2022-06-27  7:57 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Miaohe Lin, Liu Shixin, Yang Shi, Oscar Salvador,
	Naoya Horiguchi, linux-kernel

On Fri, Jun 24, 2022 at 08:51:46AM +0900, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Originally copy_hugetlb_page_range() handles migration entries and hwpoisone
> entries in similar manner.  But recently the related code path has more code
> for migration entries, and when is_writable_migration_entry() was converted
> to !is_readable_migration_entry(), hwpoison entries on source processes got
> to be unexpectedly updated (which is legitimate for migration entries, but
> not for hwpoison entries).  This results in unexpected serious issues like
> kernel panic when foking processes with hwpoison entries in pmd.
> 
> Separate the if branch into one for hwpoison entries and one for migration
> entries.
> 
> Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Thanks for fixing it.

Reviewed-by: Muchun Song <songmuchun@bytedance.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 5/9] mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
  2022-06-23 23:51 ` [PATCH v2 5/9] mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage Naoya Horiguchi
@ 2022-06-27  8:32   ` Miaohe Lin
  2022-06-27 11:48     ` Miaohe Lin
  0 siblings, 1 reply; 45+ messages in thread
From: Miaohe Lin @ 2022-06-27  8:32 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel

On 2022/6/24 7:51, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Raw error info list needs to be removed when hwpoisoned hugetlb is
> unpoisioned.  And unpoison handler needs to know how many errors there
> are in the target hugepage. So add them.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

snip

> @@ -2255,7 +2275,7 @@ int unpoison_memory(unsigned long pfn)
>  unlock_mutex:
>  	mutex_unlock(&mf_mutex);
>  	if (!ret || freeit) {
> -		num_poisoned_pages_dec();
> +		num_poisoned_pages_sub(count);

IIUC, num_poisoned_pages will only be incremented once for hugetlb page. If many
subpages are hwpoisoned, they will reach the "else if (res == -EHWPOISON)" path
in try_memory_failure_hugetlb and thus num_poisoned_pages_inc is ignored. Maybe
that should be changed so subpages can contribute to the num_poisoned_pages
or should we just do num_poisoned_pages_dec here? Or am I miss something?

Thanks!

>  		unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
>  				 page_to_pfn(p), &unpoison_rs);
>  	}
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 6/9] mm, hwpoison: set PG_hwpoison for busy hugetlb pages
  2022-06-23 23:51 ` [PATCH v2 6/9] mm, hwpoison: set PG_hwpoison for busy hugetlb pages Naoya Horiguchi
@ 2022-06-27  8:39   ` Miaohe Lin
  0 siblings, 0 replies; 45+ messages in thread
From: Miaohe Lin @ 2022-06-27  8:39 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel

On 2022/6/24 7:51, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> If memory_failure() fails to grab page refcount on a hugetlb page
> because it's busy, it returns without setting PG_hwpoison on it.
> This not only loses a chance of error containment, but breaks the rule
> that action_result() should be called only when memory_failure() do
> any of handling work (even if that's just setting PG_hwpoison).
> This inconsistency could harm code maintainability.
> 
> So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case.
> 
> Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks!

> ---
>  include/linux/mm.h  | 1 +
>  mm/memory-failure.c | 8 ++++----
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4346e51484ba..044dc5a2e361 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3233,6 +3233,7 @@ enum mf_flags {
>  	MF_SOFT_OFFLINE = 1 << 3,
>  	MF_UNPOISON = 1 << 4,
>  	MF_SW_SIMULATED = 1 << 5,
> +	MF_NO_RETRY = 1 << 6,
>  };
>  extern int memory_failure(unsigned long pfn, int flags);
>  extern void memory_failure_queue(unsigned long pfn, int flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 6005e953d011..ce045d0d6115 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1632,7 +1632,8 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
>  			count_increased = true;
>  	} else {
>  		ret = -EBUSY;
> -		goto out;
> +		if (!(flags & MF_NO_RETRY))
> +			goto out;
>  	}
>  
>  	if (hugetlb_set_page_hwpoison(head, page)) {
> @@ -1659,7 +1660,6 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>  	struct page *p = pfn_to_page(pfn);
>  	struct page *head;
>  	unsigned long page_flags;
> -	bool retry = true;
>  
>  	*hugetlb = 1;
>  retry:
> @@ -1675,8 +1675,8 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>  		}
>  		return res;
>  	} else if (res == -EBUSY) {
> -		if (retry) {
> -			retry = false;
> +		if (!(flags & MF_NO_RETRY)) {
> +			flags |= MF_NO_RETRY;
>  			goto retry;
>  		}
>  		action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 7/9] mm, hwpoison: make __page_handle_poison returns int
  2022-06-23 23:51 ` [PATCH v2 7/9] mm, hwpoison: make __page_handle_poison returns int Naoya Horiguchi
@ 2022-06-27  9:02   ` Miaohe Lin
  2022-06-28  6:02     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 45+ messages in thread
From: Miaohe Lin @ 2022-06-27  9:02 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel

On 2022/6/24 7:51, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> __page_handle_poison() returns bool that shows whether
> take_page_off_buddy() has passed or not now.  But we will want to
> distinguish another case of "dissolve has passed but taking off failed"
> by its return value. So change the type of the return value.
> No functional change.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  mm/memory-failure.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index ce045d0d6115..db85f644a1e3 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -71,7 +71,13 @@ atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>  
>  static bool hw_memory_failure __read_mostly = false;
>  
> -static bool __page_handle_poison(struct page *page)
> +/*
> + * Return values:
> + *   1:   the page is dissolved (if needed) and taken off from buddy,
> + *   0:   the page is dissolved (if needed) and not taken off from buddy,
> + *   < 0: failed to dissolve.
> + */
> +static int __page_handle_poison(struct page *page)
>  {
>  	int ret;
>  
> @@ -81,7 +87,7 @@ static bool __page_handle_poison(struct page *page)
>  		ret = take_page_off_buddy(page);
>  	zone_pcp_enable(page_zone(page));
>  
> -	return ret > 0;
> +	return ret;
>  }
>  
>  static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
> @@ -91,7 +97,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
>  		 * Doing this check for free pages is also fine since dissolve_free_huge_page
>  		 * returns 0 for non-hugetlb pages as well.
>  		 */
> -		if (!__page_handle_poison(page))
> +		if (__page_handle_poison(page) <= 0)
>  			/*
>  			 * We could fail to take off the target page from buddy
>  			 * for example due to racy page allocation, but that's
> @@ -1048,7 +1054,7 @@ static int me_huge_page(struct page_state *ps, struct page *p)
>  		 * subpages.
>  		 */
>  		put_page(hpage);
> -		if (__page_handle_poison(p)) {
> +		if (__page_handle_poison(p) > 0) {
>  			page_ref_inc(p);
>  			res = MF_RECOVERED;
>  		}
> @@ -1698,8 +1704,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>  	 */
>  	if (res == 0) {
>  		unlock_page(head);
> -		res = MF_FAILED;

It seems the previous discussion in [1] is missed. But that doesn't matter as pointed out by [1]. :)

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks.

[1]: https://lkml.org/lkml/2022/6/8/10

> -		if (__page_handle_poison(p)) {
> +		if (__page_handle_poison(p) > 0) {
>  			page_ref_inc(p);
>  			res = MF_RECOVERED;
>  		}
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  2022-06-23 23:51 ` [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages Naoya Horiguchi
  2022-06-27  3:16   ` Miaohe Lin
@ 2022-06-27  9:26   ` Muchun Song
  2022-06-28  2:41     ` HORIGUCHI NAOYA(堀口 直也)
  1 sibling, 1 reply; 45+ messages in thread
From: Muchun Song @ 2022-06-27  9:26 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Miaohe Lin, Liu Shixin, Yang Shi, Oscar Salvador,
	Naoya Horiguchi, linux-kernel

On Fri, Jun 24, 2022 at 08:51:48AM +0900, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> When handling memory error on a hugetlb page, the error handler tries to
> dissolve and turn it into 4kB pages.  If it's successfully dissolved,
> PageHWPoison flag is moved to the raw error page, so that's all right.
> However, dissolve sometimes fails, then the error page is left as
> hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
> healthy pages, but that's not possible now because the information about
> where the raw error pages is lost.
> 
> Use the private field of a few tail pages to keep that information.  The
> code path of shrinking hugepage pool uses this info to try delayed dissolve.
> In order to remember multiple errors in a hugepage, a singly-linked list
> originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed.  Only
> simple operations (adding an entry or clearing all) are required and the
> list is assumed not to be very long, so this simple data structure should
> be enough.
> 
> If we failed to save raw error info, the hwpoison hugepage has errors on
> unknown subpage, then this new saving mechanism does not work any more,
> so disable saving new raw error info and freeing hwpoison hugepages.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Thanks for your work on this. I have several quesions below.

> ---
> v1 -> v2:
> - support hwpoison hugepage with multiple errors,
> - moved the new interface functions to mm/memory-failure.c,
> - define additional subpage index SUBPAGE_INDEX_HWPOISON_UNRELIABLE,
> - stop freeing/dissolving hwpoison hugepages with unreliable raw error info,
> - drop hugetlb_clear_page_hwpoison() in dissolve_free_huge_page() because
>   that's done in update_and_free_page(),
> - move setting/clearing PG_hwpoison flag to the new interfaces,
> - checking already hwpoisoned or not on a subpage basis.
> 
> ChangeLog since previous post on 4/27:
> - fixed typo in patch description (by Miaohe)
> - fixed config value in #ifdef statement (by Miaohe)
> - added sentences about "multiple hwpoison pages" scenario in patch
>   description
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  include/linux/hugetlb.h | 13 ++++++
>  mm/hugetlb.c            | 39 +++++++++--------
>  mm/memory-failure.c     | 95 ++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 125 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e4cff27d1198..ac13c2022517 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -42,6 +42,10 @@ enum {
>  	SUBPAGE_INDEX_CGROUP,		/* reuse page->private */
>  	SUBPAGE_INDEX_CGROUP_RSVD,	/* reuse page->private */
>  	__MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD,
> +#endif
> +#ifdef CONFIG_MEMORY_FAILURE
> +	SUBPAGE_INDEX_HWPOISON,
> +	SUBPAGE_INDEX_HWPOISON_UNRELIABLE,
>  #endif
>  	__NR_USED_SUBPAGE,
>  };
> @@ -798,6 +802,15 @@ extern int dissolve_free_huge_page(struct page *page);
>  extern int dissolve_free_huge_pages(unsigned long start_pfn,
>  				    unsigned long end_pfn);
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +extern int hugetlb_clear_page_hwpoison(struct page *hpage);
> +#else
> +static inline int hugetlb_clear_page_hwpoison(struct page *hpage)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
>  #ifndef arch_hugetlb_migration_supported
>  static inline bool arch_hugetlb_migration_supported(struct hstate *h)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b7ae5f73f3b2..19ef427aa1e8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1541,17 +1541,15 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
>  	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>  		return;
>  
> -	if (hugetlb_vmemmap_alloc(h, page)) {
> -		spin_lock_irq(&hugetlb_lock);
> -		/*
> -		 * If we cannot allocate vmemmap pages, just refuse to free the
> -		 * page and put the page back on the hugetlb free list and treat
> -		 * as a surplus page.
> -		 */
> -		add_hugetlb_page(h, page, true);
> -		spin_unlock_irq(&hugetlb_lock);
> -		return;
> -	}
> +	if (hugetlb_vmemmap_alloc(h, page))
> +		goto fail;
> +
> +	/*
> +	 * Move PageHWPoison flag from head page to the raw error pages,
> +	 * which makes any healthy subpages reusable.
> +	 */
> +	if (unlikely(PageHWPoison(page) && hugetlb_clear_page_hwpoison(page)))
> +		goto fail;
>  
>  	for (i = 0; i < pages_per_huge_page(h);
>  	     i++, subpage = mem_map_next(subpage, page, i)) {
> @@ -1572,6 +1570,16 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
>  	} else {
>  		__free_pages(page, huge_page_order(h));
>  	}
> +	return;
> +fail:
> +	spin_lock_irq(&hugetlb_lock);
> +	/*
> +	 * If we cannot allocate vmemmap pages or cannot identify raw hwpoison
> +	 * subpages reliably, just refuse to free the page and put the page
> +	 * back on the hugetlb free list and treat as a surplus page.
> +	 */
> +	add_hugetlb_page(h, page, true);
> +	spin_unlock_irq(&hugetlb_lock);
>  }
>  
>  /*
> @@ -2115,15 +2123,6 @@ int dissolve_free_huge_page(struct page *page)
>  		 */
>  		rc = hugetlb_vmemmap_alloc(h, head);
>  		if (!rc) {
> -			/*
> -			 * Move PageHWPoison flag from head page to the raw
> -			 * error page, which makes any subpages rather than
> -			 * the error page reusable.
> -			 */
> -			if (PageHWPoison(head) && page != head) {
> -				SetPageHWPoison(page);
> -				ClearPageHWPoison(head);
> -			}
>  			update_and_free_page(h, head, false);
>  		} else {
>  			spin_lock_irq(&hugetlb_lock);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fb3feb1f363e..af571fa6f2af 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1499,6 +1499,97 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
>  }
>  
>  #ifdef CONFIG_HUGETLB_PAGE
> +/*
> + * Struct raw_hwp_page represents information about "raw error page",
> + * constructing singly linked list originated from ->private field of
> + * SUBPAGE_INDEX_HWPOISON-th tail page.
> + */
> +struct raw_hwp_page {
> +	struct llist_node node;
> +	struct page *page;
> +};
> +
> +static inline struct llist_head *raw_hwp_list_head(struct page *hpage)
> +{
> +	return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON);
> +}
> +
> +static inline int raw_hwp_unreliable(struct page *hpage)
> +{
> +	return page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE);
> +}
> +
> +static inline void set_raw_hwp_unreliable(struct page *hpage)
> +{
> +	set_page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE, 1);
> +}

Why not use HPAGEFLAG(HwpoisonUnreliable, hwpoison_unreliable) directly?

> +
> +/*
> + * about race consideration
> + */
> +static inline int hugetlb_set_page_hwpoison(struct page *hpage,
> +					struct page *page)
> +{
> +	struct llist_head *head;
> +	struct raw_hwp_page *raw_hwp;
> +	struct llist_node *t, *tnode;
> +	int ret;
> +
> +	/*
> +	 * Once the hwpoison hugepage has lost reliable raw error info,
> +	 * there is little mean to keep additional error info precisely,
> +	 * so skip to add additional raw error info.
> +	 */
> +	if (raw_hwp_unreliable(hpage))
> +		return -EHWPOISON;
> +	head = raw_hwp_list_head(hpage);
> +	llist_for_each_safe(tnode, t, head->first) {
> +		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
> +
> +		if (p->page == page)
> +			return -EHWPOISON;
> +	}
> +
> +	ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0;
> +	/* the first error event will be counted in action_result(). */
> +	if (ret)
> +		num_poisoned_pages_inc();
> +
> +	raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL);

This function can be called in atomic context, GFP_ATOMIC should be used
here.

> +	if (raw_hwp) {
> +		raw_hwp->page = page;
> +		llist_add(&raw_hwp->node, head);

The maximum amount of items in the list is one, right?

> +	} else {
> +		/*
> +		 * Failed to save raw error info.  We no longer trace all
> +		 * hwpoisoned subpages, and we need refuse to free/dissolve
> +		 * this hwpoisoned hugepage.
> +		 */
> +		set_raw_hwp_unreliable(hpage);
> +		return ret;
> +	}
> +	return ret;
> +}
> +
> +inline int hugetlb_clear_page_hwpoison(struct page *hpage)
> +{
> +	struct llist_head *head;
> +	struct llist_node *t, *tnode;
> +
> +	if (raw_hwp_unreliable(hpage))
> +		return -EBUSY;

IIUC, we use head page's PageHWPoison to synchronize hugetlb_clear_page_hwpoison()
and hugetlb_set_page_hwpoison(), right? If so, who can set hwp_unreliable here?

> +	ClearPageHWPoison(hpage);
> +	head = raw_hwp_list_head(hpage);
> +	llist_for_each_safe(tnode, t, head->first) {

Is it possible that a new item is added hugetlb_set_page_hwpoison()  and we do not
traverse it (we have cleared page's PageHWPoison)? Then we ignored a real hwpoison
page, right?

> +		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
> +
> +		SetPageHWPoison(p->page);
> +		kfree(p);
> +	}
> +	llist_del_all(head);

If the above issue exists, moving ClearPageHWPoison(hpage) to here could
fix it. We should clear PageHWPoison carefully since the head page is
also can be poisoned.

Thanks.

> +	return 0;
> +}
> +
>  /*
>   * Called from hugetlb code with hugetlb_lock held.
>   *
> @@ -1533,7 +1624,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
>  		goto out;
>  	}
>  
> -	if (TestSetPageHWPoison(head)) {
> +	if (hugetlb_set_page_hwpoison(head, page)) {
>  		ret = -EHWPOISON;
>  		goto out;
>  	}
> @@ -1585,7 +1676,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>  	lock_page(head);
>  
>  	if (hwpoison_filter(p)) {
> -		ClearPageHWPoison(head);
> +		hugetlb_clear_page_hwpoison(head);
>  		res = -EOPNOTSUPP;
>  		goto out;
>  	}
> -- 
> 2.25.1
> 
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 5/9] mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
  2022-06-27  8:32   ` Miaohe Lin
@ 2022-06-27 11:48     ` Miaohe Lin
  0 siblings, 0 replies; 45+ messages in thread
From: Miaohe Lin @ 2022-06-27 11:48 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel

On 2022/6/27 16:32, Miaohe Lin wrote:
> On 2022/6/24 7:51, Naoya Horiguchi wrote:
>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>
>> Raw error info list needs to be removed when hwpoisoned hugetlb is
>> unpoisioned.  And unpoison handler needs to know how many errors there
>> are in the target hugepage. So add them.
>>
>> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> snip
> 
>> @@ -2255,7 +2275,7 @@ int unpoison_memory(unsigned long pfn)
>>  unlock_mutex:
>>  	mutex_unlock(&mf_mutex);
>>  	if (!ret || freeit) {
>> -		num_poisoned_pages_dec();
>> +		num_poisoned_pages_sub(count);
> 
> IIUC, num_poisoned_pages will only be incremented once for hugetlb page. If many
> subpages are hwpoisoned, they will reach the "else if (res == -EHWPOISON)" path
> in try_memory_failure_hugetlb and thus num_poisoned_pages_inc is ignored. Maybe
> that should be changed so subpages can contribute to the num_poisoned_pages
> or should we just do num_poisoned_pages_dec here? Or am I miss something?

Sorry. I re-read patch 4/9 and I found hugetlb_set_page_hwpoison will increment the
num_poisoned_pages for each subpages. Please ignore this question.

> 
> Thanks!
> 
>>  		unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
>>  				 page_to_pfn(p), &unpoison_rs);
>>  	}
>>
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 8/9] mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
  2022-06-23 23:51 ` [PATCH v2 8/9] mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage Naoya Horiguchi
@ 2022-06-27 12:24   ` Miaohe Lin
  0 siblings, 0 replies; 45+ messages in thread
From: Miaohe Lin @ 2022-06-27 12:24 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel

On 2022/6/24 7:51, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Currently if memory_failure() (modified to remove blocking code with
> subsequent patch) is called on a page in some 1GB hugepage, memory error
> handling fails and the raw error page gets into leaked state.  The impact
> is small in production systems (just leaked single 4kB page), but this
> limits the testability because unpoison doesn't work for it.
> We can no longer create 1GB hugepage on the 1GB physical address range
> with such leaked pages, that's not useful when testing on small systems.
> 
> When a hwpoison page in a 1GB hugepage is handled, it's caught by the
> PageHWPoison check in free_pages_prepare() because the 1GB hugepage is
> broken down into raw error pages before coming to this point:
> 
>         if (unlikely(PageHWPoison(page)) && !order) {
>                 ...
>                 return false;
>         }
> 
> Then, the page is not sent to buddy and the page refcount is left 0.
> 
> Originally this check is supposed to work when the error page is freed from
> page_handle_poison() (that is called from soft-offline), but now we are
> opening another path to call it, so the callers of __page_handle_poison()
> need to handle the case by considering the return value 0 as success. Then
> page refcount for hwpoison is properly incremented so unpoison works.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

It seems I misunderstand the commit log in [1]. But I hope I get the point this time. :)

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks!

[1]https://lore.kernel.org/linux-mm/19981830-a5e6-bdba-4a1c-1cdcea61b93b@huawei.com/

> ---
>  mm/memory-failure.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index db85f644a1e3..fc7b83cb6468 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1046,7 +1046,6 @@ static int me_huge_page(struct page_state *ps, struct page *p)
>  		res = truncate_error_page(hpage, page_to_pfn(p), mapping);
>  		unlock_page(hpage);
>  	} else {
> -		res = MF_FAILED;
>  		unlock_page(hpage);
>  		/*
>  		 * migration entry prevents later access on error hugepage,
> @@ -1054,9 +1053,11 @@ static int me_huge_page(struct page_state *ps, struct page *p)
>  		 * subpages.
>  		 */
>  		put_page(hpage);
> -		if (__page_handle_poison(p) > 0) {
> +		if (__page_handle_poison(p) >= 0) {
>  			page_ref_inc(p);
>  			res = MF_RECOVERED;
> +		} else {
> +			res = MF_FAILED;
>  		}
>  	}
>  
> @@ -1704,9 +1705,11 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>  	 */
>  	if (res == 0) {
>  		unlock_page(head);
> -		if (__page_handle_poison(p) > 0) {
> +		if (__page_handle_poison(p) >= 0) {
>  			page_ref_inc(p);
>  			res = MF_RECOVERED;
> +		} else {
> +			res = MF_FAILED;
>  		}
>  		action_result(pfn, MF_MSG_FREE_HUGE, res);
>  		return res == MF_RECOVERED ? 0 : -EBUSY;
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages()
  2022-06-27  6:02             ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-06-27 17:25               ` Mike Kravetz
  2022-06-28  3:01                 ` Miaohe Lin
  2022-06-28  8:38                 ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 2 replies; 45+ messages in thread
From: Mike Kravetz @ 2022-06-27 17:25 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Miaohe Lin, Muchun Song, Naoya Horiguchi, Andrew Morton,
	David Hildenbrand, Liu Shixin, Yang Shi, Oscar Salvador,
	linux-kernel, Linux-MM

On 06/27/22 06:02, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Fri, Jun 24, 2022 at 12:11:07PM -0700, Mike Kravetz wrote:
> > On 06/24/22 08:34, HORIGUCHI NAOYA(堀口 直也) wrote:
> > > On Fri, Jun 24, 2022 at 04:15:19PM +0800, Miaohe Lin wrote:
> > > > On 2022/6/24 16:03, Muchun Song wrote:
> > > > > On Fri, Jun 24, 2022 at 10:25:48AM +0800, Miaohe Lin wrote:
> > > > >> On 2022/6/24 7:51, Naoya Horiguchi wrote:
> > > > >>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > > >>
> > > > >> IIUC it might be better to do the below check:
> > > > >> 	/*
> > > > >> 	 * Cannot return gigantic pages currently if runtime gigantic page
> > > > >> 	 * allocation is not supported.
> > > > >> 	 */
> > > > >> 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> > > > >> 		goto out;
> > > > >>
> > > > > 
> > > > > The change looks good to me. However, the comments above is unnecessary
> > > > > since gigantic_page_runtime_supported() is straightforward.
> > > > 
> > > > Agree. The comments can be removed.
> > > 
> > > Thank you, both. Adding !gigantic_page_runtime_supported without comment
> > > makes sense to me.
> > 
> > The change above makes sense to me.  However, ...
> > 
> > If we make the change above, will we have the same strange situation described
> > in the commit message when !gigantic_page_runtime_supported() is true?
> > 
> > IIUC, !gigantic_page_runtime_supported implies that gigantic hugetlb
> > pages can not be allocated or freed at run time.  They can only be
> > allocated at boot time.  So, there should NEVER be surplus gigantic
> > pages if !gigantic_page_runtime_supported().
> 
> I have the same understanding as the above.
> 
> >  To avoid this situation,
> > perhaps we should change set_max_huge_pages as follows (not tested)?
> 
> The suggested diff looks clearer about what it does, so I'd like to take it
> in the next version.  Then, what do we do on the "if (hstate_if_gigantic())"
> check in return_unused_surplus_pages in the original suggestion?  Should it
> be kept as is, or removed, or checked with !gigantic_page_runtime_supported()?
> I guess that the new checks prevent calling return_unused_surplus_pages()
> during pool shrinking, so the check seems not necessary any more.

My first thought was to keep the check in return_unused_surplus_pages() as it
is called in other code paths.  However, it SHOULD only try to free surplus
hugetlb pages.  With the modifications to set_max_huge_pages we will not
have any surplus gigantic pages if !gigantic_page_runtime_supported, so
the check can be removed.

Also note that we never try to dynamically allocate surplus gigantic pages.
This also is left over from the time when we could not easily allocate a
gigantic page at runtime.  It would not surprise me if someone found a use
case to ease this restriction in the future.  Especially so if 1G THP support
is ever added.  If this happens, the check would be necessary and I would
guess that it would not be added.

Sorry for thinking our loud!!!  Although not necessary, it 'might' be a good
idea to leave the check because it would be overlooked if dynamic allocation
of gigantic surplus pages is ever added.  I do not have a strong opinion.

P.S. This also reminds me that a similar check should be added to the
demote hugetlb code path.  It would be bad if !gigantic_page_runtime_supported
and we demoted a gigantic page into numerous non-gigantic pages.  I will
send a patch.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 9/9] mm, hwpoison: enable memory error handling on 1GB hugepage
  2022-06-23 23:51 ` [PATCH v2 9/9] mm, hwpoison: enable memory error handling on " Naoya Horiguchi
@ 2022-06-28  2:06   ` Miaohe Lin
  0 siblings, 0 replies; 45+ messages in thread
From: Miaohe Lin @ 2022-06-28  2:06 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel

On 2022/6/24 7:51, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Now error handling code is prepared, so remove the blocking code and
> enable memory error handling on 1GB hugepage.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

The patch itself looks good to me.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks!

BTW: There might still be some places need to be changed. Maybe we can
process them altogether after this series is done?

> ---
>  include/linux/mm.h      |  1 -
>  include/ras/ras_event.h |  1 -
>  mm/memory-failure.c     | 16 ----------------
>  3 files changed, 18 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 044dc5a2e361..9d7e9b5a4d1d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3284,7 +3284,6 @@ enum mf_action_page_type {
>  	MF_MSG_DIFFERENT_COMPOUND,
>  	MF_MSG_HUGE,
>  	MF_MSG_FREE_HUGE,
> -	MF_MSG_NON_PMD_HUGE,
>  	MF_MSG_UNMAP_FAILED,
>  	MF_MSG_DIRTY_SWAPCACHE,
>  	MF_MSG_CLEAN_SWAPCACHE,
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index d0337a41141c..cbd3ddd7c33d 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -360,7 +360,6 @@ TRACE_EVENT(aer_event,
>  	EM ( MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking" ) \
>  	EM ( MF_MSG_HUGE, "huge page" )					\
>  	EM ( MF_MSG_FREE_HUGE, "free huge page" )			\
> -	EM ( MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page" )		\
>  	EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" )		\
>  	EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" )		\
>  	EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" )		\
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fc7b83cb6468..33521e059f7f 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -728,7 +728,6 @@ static const char * const action_page_types[] = {
>  	[MF_MSG_DIFFERENT_COMPOUND]	= "different compound page after locking",
>  	[MF_MSG_HUGE]			= "huge page",
>  	[MF_MSG_FREE_HUGE]		= "free huge page",
> -	[MF_MSG_NON_PMD_HUGE]		= "non-pmd-sized huge page",
>  	[MF_MSG_UNMAP_FAILED]		= "unmapping failed page",
>  	[MF_MSG_DIRTY_SWAPCACHE]	= "dirty swapcache page",
>  	[MF_MSG_CLEAN_SWAPCACHE]	= "clean swapcache page",
> @@ -1717,21 +1716,6 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>  
>  	page_flags = head->flags;
>  
> -	/*
> -	 * TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
> -	 * simply disable it. In order to make it work properly, we need
> -	 * make sure that:
> -	 *  - conversion of a pud that maps an error hugetlb into hwpoison
> -	 *    entry properly works, and
> -	 *  - other mm code walking over page table is aware of pud-aligned
> -	 *    hwpoison entries.
> -	 */
> -	if (huge_page_size(page_hstate(head)) > PMD_SIZE) {
> -		action_result(pfn, MF_MSG_NON_PMD_HUGE, MF_IGNORED);
> -		res = -EBUSY;
> -		goto out;
> -	}
> -
>  	if (!hwpoison_user_mappings(p, pfn, flags, head)) {
>  		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
>  		res = -EBUSY;
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  2022-06-27  9:26   ` Muchun Song
@ 2022-06-28  2:41     ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-28  6:26       ` Muchun Song
  0 siblings, 1 reply; 45+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-28  2:41 UTC (permalink / raw)
  To: Muchun Song
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Mike Kravetz, Miaohe Lin, Liu Shixin, Yang Shi, Oscar Salvador,
	linux-kernel

On Mon, Jun 27, 2022 at 05:26:01PM +0800, Muchun Song wrote:
> On Fri, Jun 24, 2022 at 08:51:48AM +0900, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > When handling memory error on a hugetlb page, the error handler tries to
> > dissolve and turn it into 4kB pages.  If it's successfully dissolved,
> > PageHWPoison flag is moved to the raw error page, so that's all right.
> > However, dissolve sometimes fails, then the error page is left as
> > hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
> > healthy pages, but that's not possible now because the information about
> > where the raw error pages is lost.
> > 
> > Use the private field of a few tail pages to keep that information.  The
> > code path of shrinking hugepage pool uses this info to try delayed dissolve.
> > In order to remember multiple errors in a hugepage, a singly-linked list
> > originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed.  Only
> > simple operations (adding an entry or clearing all) are required and the
> > list is assumed not to be very long, so this simple data structure should
> > be enough.
> > 
> > If we failed to save raw error info, the hwpoison hugepage has errors on
> > unknown subpage, then this new saving mechanism does not work any more,
> > so disable saving new raw error info and freeing hwpoison hugepages.
> > 
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Thanks for your work on this. I have several quesions below.
> 
...

> > @@ -1499,6 +1499,97 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
> >  }
> >  
> >  #ifdef CONFIG_HUGETLB_PAGE
> > +/*
> > + * Struct raw_hwp_page represents information about "raw error page",
> > + * constructing singly linked list originated from ->private field of
> > + * SUBPAGE_INDEX_HWPOISON-th tail page.
> > + */
> > +struct raw_hwp_page {
> > +	struct llist_node node;
> > +	struct page *page;
> > +};
> > +
> > +static inline struct llist_head *raw_hwp_list_head(struct page *hpage)
> > +{
> > +	return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON);
> > +}
> > +
> > +static inline int raw_hwp_unreliable(struct page *hpage)
> > +{
> > +	return page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE);
> > +}
> > +
> > +static inline void set_raw_hwp_unreliable(struct page *hpage)
> > +{
> > +	set_page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE, 1);
> > +}
> 
> Why not use HPAGEFLAG(HwpoisonUnreliable, hwpoison_unreliable) directly?
> 

OK, that sounds better, I'll do it.

> > +
> > +/*
> > + * about race consideration
> > + */
> > +static inline int hugetlb_set_page_hwpoison(struct page *hpage,
> > +					struct page *page)
> > +{
> > +	struct llist_head *head;
> > +	struct raw_hwp_page *raw_hwp;
> > +	struct llist_node *t, *tnode;
> > +	int ret;
> > +
> > +	/*
> > +	 * Once the hwpoison hugepage has lost reliable raw error info,
> > +	 * there is little mean to keep additional error info precisely,
> > +	 * so skip to add additional raw error info.
> > +	 */
> > +	if (raw_hwp_unreliable(hpage))
> > +		return -EHWPOISON;
> > +	head = raw_hwp_list_head(hpage);
> > +	llist_for_each_safe(tnode, t, head->first) {
> > +		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
> > +
> > +		if (p->page == page)
> > +			return -EHWPOISON;
> > +	}
> > +
> > +	ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0;
> > +	/* the first error event will be counted in action_result(). */
> > +	if (ret)
> > +		num_poisoned_pages_inc();
> > +
> > +	raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL);
> 
> This function can be called in atomic context, GFP_ATOMIC should be used
> here.

OK, I'll use GFP_ATOMIC.

> 
> > +	if (raw_hwp) {
> > +		raw_hwp->page = page;
> > +		llist_add(&raw_hwp->node, head);
> 
> The maximum amount of items in the list is one, right?

The maximum is the number of subpages in the hugepage (512 for 2MB hugepage,
262144 for 1GB hugepage).

> 
> > +	} else {
> > +		/*
> > +		 * Failed to save raw error info.  We no longer trace all
> > +		 * hwpoisoned subpages, and we need refuse to free/dissolve
> > +		 * this hwpoisoned hugepage.
> > +		 */
> > +		set_raw_hwp_unreliable(hpage);
> > +		return ret;
> > +	}
> > +	return ret;
> > +}
> > +
> > +inline int hugetlb_clear_page_hwpoison(struct page *hpage)
> > +{
> > +	struct llist_head *head;
> > +	struct llist_node *t, *tnode;
> > +
> > +	if (raw_hwp_unreliable(hpage))
> > +		return -EBUSY;
> 
> IIUC, we use head page's PageHWPoison to synchronize hugetlb_clear_page_hwpoison()
> and hugetlb_set_page_hwpoison(), right? If so, who can set hwp_unreliable here?

Sorry if I might miss your point, but raw_hwp_unreliable is set when
allocating raw_hwp_page failed.  hugetlb_set_page_hwpoison() can be called
multiple times on a hugepage and if one of the calls fails, the hwpoisoned
hugepage becomes unreliable.

BTW, as you pointed out above, if we switch to passing GFP_ATOMIC to kmalloc(),
the kmalloc() never fails, so we no longer have to implement this unreliable
flag, so things get simpler.

> 
> > +	ClearPageHWPoison(hpage);
> > +	head = raw_hwp_list_head(hpage);
> > +	llist_for_each_safe(tnode, t, head->first) {
> 
> Is it possible that a new item is added hugetlb_set_page_hwpoison()  and we do not
> traverse it (we have cleared page's PageHWPoison)? Then we ignored a real hwpoison
> page, right?

Maybe you are mentioning the race like below. Yes, that's possible.

  CPU 0                            CPU 1

                                   free_huge_page
                                     lock hugetlb_lock
                                     ClearHPageMigratable
                                     unlock hugetlb_lock
  get_huge_page_for_hwpoison
    lock hugetlb_lock
    __get_huge_page_for_hwpoison
      hugetlb_set_page_hwpoison
        allocate raw_hwp_page
        TestSetPageHWPoison
                                     update_and_free_page
                                       __update_and_free_page
                                         if (PageHWPoison)
                                           hugetlb_clear_page_hwpoison
                                             TestClearPageHWPoison
                                             // remove all list items
        llist_add
    unlock hugetlb_lock


The end result seems not critical (leaking raced raw_hwp_page?), but
we need fix.

> 
> > +		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
> > +
> > +		SetPageHWPoison(p->page);
> > +		kfree(p);
> > +	}
> > +	llist_del_all(head);
> 
> If the above issue exists, moving ClearPageHWPoison(hpage) to here could
> fix it. We should clear PageHWPoison carefully since the head page is
> also can be poisoned.

The reason why I put ClearPageHWPoison(hpage) before llist_for_each_safe()
was that raw error page can be the head page.  But this can be solved
with some additional code to remember whether raw_hwp_page list has an item
associated with the head page.

Or another approach in my mind now is to call hugetlb_clear_page_hwpoison()
with taking mf_mutex.

> 
> Thanks.

Thank you for valuable feedbacks.

- Naoya Horiguchi

> 
> > +	return 0;
> > +}
> > +
> >  /*
> >   * Called from hugetlb code with hugetlb_lock held.
> >   *
> > @@ -1533,7 +1624,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
> >  		goto out;
> >  	}
> >  
> > -	if (TestSetPageHWPoison(head)) {
> > +	if (hugetlb_set_page_hwpoison(head, page)) {
> >  		ret = -EHWPOISON;
> >  		goto out;
> >  	}
> > @@ -1585,7 +1676,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
> >  	lock_page(head);
> >  
> >  	if (hwpoison_filter(p)) {
> > -		ClearPageHWPoison(head);
> > +		hugetlb_clear_page_hwpoison(head);
> >  		res = -EOPNOTSUPP;
> >  		goto out;
> >  	}
> > -- 
> > 2.25.1
> > 
> > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages()
  2022-06-27 17:25               ` Mike Kravetz
@ 2022-06-28  3:01                 ` Miaohe Lin
  2022-06-28  8:38                 ` HORIGUCHI NAOYA(堀口 直也)
  1 sibling, 0 replies; 45+ messages in thread
From: Miaohe Lin @ 2022-06-28  3:01 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Naoya Horiguchi, Andrew Morton, David Hildenbrand,
	Liu Shixin, Yang Shi, Oscar Salvador, linux-kernel, Linux-MM,
	HORIGUCHI NAOYA(堀口 直也)

On 2022/6/28 1:25, Mike Kravetz wrote:
> On 06/27/22 06:02, HORIGUCHI NAOYA(堀口 直也) wrote:
>> On Fri, Jun 24, 2022 at 12:11:07PM -0700, Mike Kravetz wrote:
>>> On 06/24/22 08:34, HORIGUCHI NAOYA(堀口 直也) wrote:
>>>> On Fri, Jun 24, 2022 at 04:15:19PM +0800, Miaohe Lin wrote:
>>>>> On 2022/6/24 16:03, Muchun Song wrote:
>>>>>> On Fri, Jun 24, 2022 at 10:25:48AM +0800, Miaohe Lin wrote:
>>>>>>> On 2022/6/24 7:51, Naoya Horiguchi wrote:
>>>>>>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>>>>>>
>>>>>>> IIUC it might be better to do the below check:
>>>>>>> 	/*
>>>>>>> 	 * Cannot return gigantic pages currently if runtime gigantic page
>>>>>>> 	 * allocation is not supported.
>>>>>>> 	 */
>>>>>>> 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>>>>>>> 		goto out;
>>>>>>>
>>>>>>
>>>>>> The change looks good to me. However, the comments above is unnecessary
>>>>>> since gigantic_page_runtime_supported() is straightforward.
>>>>>
>>>>> Agree. The comments can be removed.
>>>>
>>>> Thank you, both. Adding !gigantic_page_runtime_supported without comment
>>>> makes sense to me.
>>>
>>> The change above makes sense to me.  However, ...
>>>
>>> If we make the change above, will we have the same strange situation described
>>> in the commit message when !gigantic_page_runtime_supported() is true?
>>>
>>> IIUC, !gigantic_page_runtime_supported implies that gigantic hugetlb
>>> pages can not be allocated or freed at run time.  They can only be
>>> allocated at boot time.  So, there should NEVER be surplus gigantic
>>> pages if !gigantic_page_runtime_supported().
>>
>> I have the same understanding as the above.
>>
>>>  To avoid this situation,
>>> perhaps we should change set_max_huge_pages as follows (not tested)?
>>
>> The suggested diff looks clearer about what it does, so I'd like to take it
>> in the next version.  Then, what do we do on the "if (hstate_if_gigantic())"
>> check in return_unused_surplus_pages in the original suggestion?  Should it
>> be kept as is, or removed, or checked with !gigantic_page_runtime_supported()?
>> I guess that the new checks prevent calling return_unused_surplus_pages()
>> during pool shrinking, so the check seems not necessary any more.
> 
> My first thought was to keep the check in return_unused_surplus_pages() as it
> is called in other code paths.  However, it SHOULD only try to free surplus
> hugetlb pages.  With the modifications to set_max_huge_pages we will not
> have any surplus gigantic pages if !gigantic_page_runtime_supported, so
> the check can be removed.
> 
> Also note that we never try to dynamically allocate surplus gigantic pages.
> This also is left over from the time when we could not easily allocate a
> gigantic page at runtime.  It would not surprise me if someone found a use
> case to ease this restriction in the future.  Especially so if 1G THP support
> is ever added.  If this happens, the check would be necessary and I would
> guess that it would not be added.
> 
> Sorry for thinking our loud!!!  Although not necessary, it 'might' be a good
> idea to leave the check because it would be overlooked if dynamic allocation
> of gigantic surplus pages is ever added.  I do not have a strong opinion.
> 
> P.S. This also reminds me that a similar check should be added to the
> demote hugetlb code path.  It would be bad if !gigantic_page_runtime_supported
> and we demoted a gigantic page into numerous non-gigantic pages.  I will
> send a patch.

Out-of-topic
There're some places check "if (hstate_is_gigantic(h))" while others check
"if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())". Should
we make it explicit in some manner when gigantic_page_runtime_supported is
needed to make code easier to follow?

Just a trivial suggestion. Thanks!

> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 7/9] mm, hwpoison: make __page_handle_poison returns int
  2022-06-27  9:02   ` Miaohe Lin
@ 2022-06-28  6:02     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 45+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-28  6:02 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Mike Kravetz, Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	linux-kernel

On Mon, Jun 27, 2022 at 05:02:47PM +0800, Miaohe Lin wrote:
> On 2022/6/24 7:51, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > __page_handle_poison() returns bool that shows whether
> > take_page_off_buddy() has passed or not now.  But we will want to
> > distinguish another case of "dissolve has passed but taking off failed"
> > by its return value. So change the type of the return value.
> > No functional change.
> > 
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > ---
...
> > @@ -1698,8 +1704,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
> >  	 */
> >  	if (res == 0) {
> >  		unlock_page(head);
> > -		res = MF_FAILED;
> 
> It seems the previous discussion in [1] is missed. But that doesn't matter as pointed out by [1]. :)

Sorry for missing this, I updated this line to be removed in patch 8/9.

> 
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thank you.

- Naoya Horiguchi

> 
> Thanks.
> 
> [1]: https://lkml.org/lkml/2022/6/8/10

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  2022-06-28  2:41     ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-06-28  6:26       ` Muchun Song
  2022-06-28  7:51         ` Muchun Song
  2022-06-28  8:17         ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 2 replies; 45+ messages in thread
From: Muchun Song @ 2022-06-28  6:26 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Mike Kravetz, Miaohe Lin, Liu Shixin, Yang Shi, Oscar Salvador,
	linux-kernel

On Tue, Jun 28, 2022 at 02:41:22AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Jun 27, 2022 at 05:26:01PM +0800, Muchun Song wrote:
> > On Fri, Jun 24, 2022 at 08:51:48AM +0900, Naoya Horiguchi wrote:
> > > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > 
> > > When handling memory error on a hugetlb page, the error handler tries to
> > > dissolve and turn it into 4kB pages.  If it's successfully dissolved,
> > > PageHWPoison flag is moved to the raw error page, so that's all right.
> > > However, dissolve sometimes fails, then the error page is left as
> > > hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
> > > healthy pages, but that's not possible now because the information about
> > > where the raw error pages is lost.
> > > 
> > > Use the private field of a few tail pages to keep that information.  The
> > > code path of shrinking hugepage pool uses this info to try delayed dissolve.
> > > In order to remember multiple errors in a hugepage, a singly-linked list
> > > originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed.  Only
> > > simple operations (adding an entry or clearing all) are required and the
> > > list is assumed not to be very long, so this simple data structure should
> > > be enough.
> > > 
> > > If we failed to save raw error info, the hwpoison hugepage has errors on
> > > unknown subpage, then this new saving mechanism does not work any more,
> > > so disable saving new raw error info and freeing hwpoison hugepages.
> > > 
> > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > Thanks for your work on this. I have several quesions below.
> > 
> ...
> 
> > > @@ -1499,6 +1499,97 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
> > >  }
> > >  
> > >  #ifdef CONFIG_HUGETLB_PAGE
> > > +/*
> > > + * Struct raw_hwp_page represents information about "raw error page",
> > > + * constructing singly linked list originated from ->private field of
> > > + * SUBPAGE_INDEX_HWPOISON-th tail page.
> > > + */
> > > +struct raw_hwp_page {
> > > +	struct llist_node node;
> > > +	struct page *page;
> > > +};
> > > +
> > > +static inline struct llist_head *raw_hwp_list_head(struct page *hpage)
> > > +{
> > > +	return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON);
> > > +}
> > > +
> > > +static inline int raw_hwp_unreliable(struct page *hpage)
> > > +{
> > > +	return page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE);
> > > +}
> > > +
> > > +static inline void set_raw_hwp_unreliable(struct page *hpage)
> > > +{
> > > +	set_page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE, 1);
> > > +}
> > 
> > Why not use HPAGEFLAG(HwpoisonUnreliable, hwpoison_unreliable) directly?
> > 
> 
> OK, that sounds better, I'll do it.
> 
> > > +
> > > +/*
> > > + * about race consideration
> > > + */
> > > +static inline int hugetlb_set_page_hwpoison(struct page *hpage,
> > > +					struct page *page)
> > > +{
> > > +	struct llist_head *head;
> > > +	struct raw_hwp_page *raw_hwp;
> > > +	struct llist_node *t, *tnode;
> > > +	int ret;
> > > +
> > > +	/*
> > > +	 * Once the hwpoison hugepage has lost reliable raw error info,
> > > +	 * there is little mean to keep additional error info precisely,
> > > +	 * so skip to add additional raw error info.
> > > +	 */
> > > +	if (raw_hwp_unreliable(hpage))
> > > +		return -EHWPOISON;
> > > +	head = raw_hwp_list_head(hpage);
> > > +	llist_for_each_safe(tnode, t, head->first) {
> > > +		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
> > > +
> > > +		if (p->page == page)
> > > +			return -EHWPOISON;
> > > +	}
> > > +
> > > +	ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0;
> > > +	/* the first error event will be counted in action_result(). */
> > > +	if (ret)
> > > +		num_poisoned_pages_inc();
> > > +
> > > +	raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL);
> > 
> > This function can be called in atomic context, GFP_ATOMIC should be used
> > here.
> 
> OK, I'll use GFP_ATOMIC.
> 
> > 
> > > +	if (raw_hwp) {
> > > +		raw_hwp->page = page;
> > > +		llist_add(&raw_hwp->node, head);
> > 
> > The maximum amount of items in the list is one, right?
> 
> The maximum is the number of subpages in the hugepage (512 for 2MB hugepage,
> 262144 for 1GB hugepage).
>

You are right. I have missed something yesterday.
 
> > 
> > > +	} else {
> > > +		/*
> > > +		 * Failed to save raw error info.  We no longer trace all
> > > +		 * hwpoisoned subpages, and we need refuse to free/dissolve
> > > +		 * this hwpoisoned hugepage.
> > > +		 */
> > > +		set_raw_hwp_unreliable(hpage);
> > > +		return ret;
> > > +	}
> > > +	return ret;
> > > +}
> > > +
> > > +inline int hugetlb_clear_page_hwpoison(struct page *hpage)
> > > +{
> > > +	struct llist_head *head;
> > > +	struct llist_node *t, *tnode;
> > > +
> > > +	if (raw_hwp_unreliable(hpage))
> > > +		return -EBUSY;
> > 
> > IIUC, we use head page's PageHWPoison to synchronize hugetlb_clear_page_hwpoison()
> > and hugetlb_set_page_hwpoison(), right? If so, who can set hwp_unreliable here?
> 
> Sorry if I might miss your point, but raw_hwp_unreliable is set when
> allocating raw_hwp_page failed.  hugetlb_set_page_hwpoison() can be called

Sorry. I have missed this. Thanks for your clarification.

> multiple times on a hugepage and if one of the calls fails, the hwpoisoned
> hugepage becomes unreliable.
> 
> BTW, as you pointed out above, if we switch to passing GFP_ATOMIC to kmalloc(),
> the kmalloc() never fails, so we no longer have to implement this unreliable

No. kmalloc() with GFP_ATOMIC can fail unless I miss something important.

> flag, so things get simpler.
> 
> > 
> > > +	ClearPageHWPoison(hpage);
> > > +	head = raw_hwp_list_head(hpage);
> > > +	llist_for_each_safe(tnode, t, head->first) {
> > 
> > Is it possible that a new item is added hugetlb_set_page_hwpoison()  and we do not
> > traverse it (we have cleared page's PageHWPoison)? Then we ignored a real hwpoison
> > page, right?
> 
> Maybe you are mentioning the race like below. Yes, that's possible.
>

Sorry, ignore my previous comments, I'm thinking something wrong.

>   CPU 0                            CPU 1
> 
>                                    free_huge_page
>                                      lock hugetlb_lock
>                                      ClearHPageMigratable
				       remove_hugetlb_page()
				       // the page is non-HugeTLB now
>                                      unlock hugetlb_lock
>   get_huge_page_for_hwpoison
>     lock hugetlb_lock
>     __get_huge_page_for_hwpoison

	// cannot reach here since it is not a HugeTLB page now.
	// So this race is impossible. Then we fallback to normal
	// page handling. Seems there is a new issue here.
	//
	// memory_failure()
	//	try_memory_failure_hugetlb()
	//	if (hugetlb)
	//		goto unlock_mutex;
	//	if (TestSetPageHWPoison(p)) {
	//	// This non-HugeTLB page's vmemmap is still optimized.
	
Setting COMPOUND_PAGE_DTOR after hugetlb_vmemmap_restore() might fix this
issue, but we will encounter this race as you mentioned below.

Thanks.
	
>       hugetlb_set_page_hwpoison
>         allocate raw_hwp_page
>         TestSetPageHWPoison
>                                      update_and_free_page
>                                        __update_and_free_page
>                                          if (PageHWPoison)
>                                            hugetlb_clear_page_hwpoison
>                                              TestClearPageHWPoison
>                                              // remove all list items
>         llist_add
>     unlock hugetlb_lock
> 
> 
> The end result seems not critical (leaking raced raw_hwp_page?), but
> we need fix.
> 
> > 
> > > +		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
> > > +
> > > +		SetPageHWPoison(p->page);
> > > +		kfree(p);
> > > +	}
> > > +	llist_del_all(head);
> > 
> > If the above issue exists, moving ClearPageHWPoison(hpage) to here could
> > fix it. We should clear PageHWPoison carefully since the head page is
> > also can be poisoned.
> 
> The reason why I put ClearPageHWPoison(hpage) before llist_for_each_safe()
> was that raw error page can be the head page.  But this can be solved
> with some additional code to remember whether raw_hwp_page list has an item
> associated with the head page.
>
> Or another approach in my mind now is to call hugetlb_clear_page_hwpoison()
> with taking mf_mutex.
> 
> > 
> > Thanks.
> 
> Thank you for valuable feedbacks.
> 
> - Naoya Horiguchi
> 
> > 
> > > +	return 0;
> > > +}
> > > +
> > >  /*
> > >   * Called from hugetlb code with hugetlb_lock held.
> > >   *
> > > @@ -1533,7 +1624,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
> > >  		goto out;
> > >  	}
> > >  
> > > -	if (TestSetPageHWPoison(head)) {
> > > +	if (hugetlb_set_page_hwpoison(head, page)) {
> > >  		ret = -EHWPOISON;
> > >  		goto out;
> > >  	}
> > > @@ -1585,7 +1676,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
> > >  	lock_page(head);
> > >  
> > >  	if (hwpoison_filter(p)) {
> > > -		ClearPageHWPoison(head);
> > > +		hugetlb_clear_page_hwpoison(head);
> > >  		res = -EOPNOTSUPP;
> > >  		goto out;
> > >  	}
> > > -- 
> > > 2.25.1
> > > 
> > > 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  2022-06-28  6:26       ` Muchun Song
@ 2022-06-28  7:51         ` Muchun Song
  2022-06-28  8:17         ` HORIGUCHI NAOYA(堀口 直也)
  1 sibling, 0 replies; 45+ messages in thread
From: Muchun Song @ 2022-06-28  7:51 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Mike Kravetz, Miaohe Lin, Liu Shixin, Yang Shi, Oscar Salvador,
	linux-kernel

On Tue, Jun 28, 2022 at 02:26:47PM +0800, Muchun Song wrote:
> On Tue, Jun 28, 2022 at 02:41:22AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Mon, Jun 27, 2022 at 05:26:01PM +0800, Muchun Song wrote:
> > > On Fri, Jun 24, 2022 at 08:51:48AM +0900, Naoya Horiguchi wrote:
> > > > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > > 
> > > > When handling memory error on a hugetlb page, the error handler tries to
> > > > dissolve and turn it into 4kB pages.  If it's successfully dissolved,
> > > > PageHWPoison flag is moved to the raw error page, so that's all right.
> > > > However, dissolve sometimes fails, then the error page is left as
> > > > hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
> > > > healthy pages, but that's not possible now because the information about
> > > > where the raw error pages is lost.
> > > > 
> > > > Use the private field of a few tail pages to keep that information.  The
> > > > code path of shrinking hugepage pool uses this info to try delayed dissolve.
> > > > In order to remember multiple errors in a hugepage, a singly-linked list
> > > > originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed.  Only
> > > > simple operations (adding an entry or clearing all) are required and the
> > > > list is assumed not to be very long, so this simple data structure should
> > > > be enough.
> > > > 
> > > > If we failed to save raw error info, the hwpoison hugepage has errors on
> > > > unknown subpage, then this new saving mechanism does not work any more,
> > > > so disable saving new raw error info and freeing hwpoison hugepages.
> > > > 
> > > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > 
> > > Thanks for your work on this. I have several quesions below.
> > > 
> > ...
> > 
> > > > @@ -1499,6 +1499,97 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
> > > >  }
> > > >  
> > > >  #ifdef CONFIG_HUGETLB_PAGE
> > > > +/*
> > > > + * Struct raw_hwp_page represents information about "raw error page",
> > > > + * constructing singly linked list originated from ->private field of
> > > > + * SUBPAGE_INDEX_HWPOISON-th tail page.
> > > > + */
> > > > +struct raw_hwp_page {
> > > > +	struct llist_node node;
> > > > +	struct page *page;
> > > > +};
> > > > +
> > > > +static inline struct llist_head *raw_hwp_list_head(struct page *hpage)
> > > > +{
> > > > +	return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON);
> > > > +}
> > > > +
> > > > +static inline int raw_hwp_unreliable(struct page *hpage)
> > > > +{
> > > > +	return page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE);
> > > > +}
> > > > +
> > > > +static inline void set_raw_hwp_unreliable(struct page *hpage)
> > > > +{
> > > > +	set_page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE, 1);
> > > > +}
> > > 
> > > Why not use HPAGEFLAG(HwpoisonUnreliable, hwpoison_unreliable) directly?
> > > 
> > 
> > OK, that sounds better, I'll do it.
> > 
> > > > +
> > > > +/*
> > > > + * about race consideration
> > > > + */
> > > > +static inline int hugetlb_set_page_hwpoison(struct page *hpage,
> > > > +					struct page *page)
> > > > +{
> > > > +	struct llist_head *head;
> > > > +	struct raw_hwp_page *raw_hwp;
> > > > +	struct llist_node *t, *tnode;
> > > > +	int ret;
> > > > +
> > > > +	/*
> > > > +	 * Once the hwpoison hugepage has lost reliable raw error info,
> > > > +	 * there is little mean to keep additional error info precisely,
> > > > +	 * so skip to add additional raw error info.
> > > > +	 */
> > > > +	if (raw_hwp_unreliable(hpage))
> > > > +		return -EHWPOISON;
> > > > +	head = raw_hwp_list_head(hpage);
> > > > +	llist_for_each_safe(tnode, t, head->first) {
> > > > +		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
> > > > +
> > > > +		if (p->page == page)
> > > > +			return -EHWPOISON;
> > > > +	}
> > > > +
> > > > +	ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0;
> > > > +	/* the first error event will be counted in action_result(). */
> > > > +	if (ret)
> > > > +		num_poisoned_pages_inc();
> > > > +
> > > > +	raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL);
> > > 
> > > This function can be called in atomic context, GFP_ATOMIC should be used
> > > here.
> > 
> > OK, I'll use GFP_ATOMIC.
> > 
> > > 
> > > > +	if (raw_hwp) {
> > > > +		raw_hwp->page = page;
> > > > +		llist_add(&raw_hwp->node, head);
> > > 
> > > The maximum amount of items in the list is one, right?
> > 
> > The maximum is the number of subpages in the hugepage (512 for 2MB hugepage,
> > 262144 for 1GB hugepage).
> >
> 
> You are right. I have missed something yesterday.
>  
> > > 
> > > > +	} else {
> > > > +		/*
> > > > +		 * Failed to save raw error info.  We no longer trace all
> > > > +		 * hwpoisoned subpages, and we need refuse to free/dissolve
> > > > +		 * this hwpoisoned hugepage.
> > > > +		 */
> > > > +		set_raw_hwp_unreliable(hpage);
> > > > +		return ret;
> > > > +	}
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +inline int hugetlb_clear_page_hwpoison(struct page *hpage)
> > > > +{
> > > > +	struct llist_head *head;
> > > > +	struct llist_node *t, *tnode;
> > > > +
> > > > +	if (raw_hwp_unreliable(hpage))
> > > > +		return -EBUSY;
> > > 
> > > IIUC, we use head page's PageHWPoison to synchronize hugetlb_clear_page_hwpoison()
> > > and hugetlb_set_page_hwpoison(), right? If so, who can set hwp_unreliable here?
> > 
> > Sorry if I might miss your point, but raw_hwp_unreliable is set when
> > allocating raw_hwp_page failed.  hugetlb_set_page_hwpoison() can be called
> 
> Sorry. I have missed this. Thanks for your clarification.
> 
> > multiple times on a hugepage and if one of the calls fails, the hwpoisoned
> > hugepage becomes unreliable.
> > 
> > BTW, as you pointed out above, if we switch to passing GFP_ATOMIC to kmalloc(),
> > the kmalloc() never fails, so we no longer have to implement this unreliable
> 
> No. kmalloc() with GFP_ATOMIC can fail unless I miss something important.
> 
> > flag, so things get simpler.
> > 
> > > 
> > > > +	ClearPageHWPoison(hpage);
> > > > +	head = raw_hwp_list_head(hpage);
> > > > +	llist_for_each_safe(tnode, t, head->first) {
> > > 
> > > Is it possible that a new item is added hugetlb_set_page_hwpoison()  and we do not
> > > traverse it (we have cleared page's PageHWPoison)? Then we ignored a real hwpoison
> > > page, right?
> > 
> > Maybe you are mentioning the race like below. Yes, that's possible.
> >
> 
> Sorry, ignore my previous comments, I'm thinking something wrong.
> 
> >   CPU 0                            CPU 1
> > 
> >                                    free_huge_page
> >                                      lock hugetlb_lock
> >                                      ClearHPageMigratable
> 				       remove_hugetlb_page()
> 				       // the page is non-HugeTLB now
> >                                      unlock hugetlb_lock
> >   get_huge_page_for_hwpoison
> >     lock hugetlb_lock
> >     __get_huge_page_for_hwpoison
> 
> 	// cannot reach here since it is not a HugeTLB page now.
> 	// So this race is impossible. Then we fallback to normal
> 	// page handling. Seems there is a new issue here.
> 	//
> 	// memory_failure()
> 	//	try_memory_failure_hugetlb()
> 	//	if (hugetlb)
> 	//		goto unlock_mutex;
> 	//	if (TestSetPageHWPoison(p)) {
> 	//	// This non-HugeTLB page's vmemmap is still optimized.
> 	
> Setting COMPOUND_PAGE_DTOR after hugetlb_vmemmap_restore() might fix this
                                               ^^^
                                I mean hugetlb_vmemmap_alloc().

> issue, but we will encounter this race as you mentioned below.
> 
> Thanks.
> 	
> >       hugetlb_set_page_hwpoison
> >         allocate raw_hwp_page
> >         TestSetPageHWPoison
> >                                      update_and_free_page
> >                                        __update_and_free_page
> >                                          if (PageHWPoison)
> >                                            hugetlb_clear_page_hwpoison
> >                                              TestClearPageHWPoison
> >                                              // remove all list items
> >         llist_add
> >     unlock hugetlb_lock
> > 
> > 
> > The end result seems not critical (leaking raced raw_hwp_page?), but
> > we need fix.
> > 
> > > 
> > > > +		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
> > > > +
> > > > +		SetPageHWPoison(p->page);
> > > > +		kfree(p);
> > > > +	}
> > > > +	llist_del_all(head);
> > > 
> > > If the above issue exists, moving ClearPageHWPoison(hpage) to here could
> > > fix it. We should clear PageHWPoison carefully since the head page is
> > > also can be poisoned.
> > 
> > The reason why I put ClearPageHWPoison(hpage) before llist_for_each_safe()
> > was that raw error page can be the head page.  But this can be solved
> > with some additional code to remember whether raw_hwp_page list has an item
> > associated with the head page.
> >
> > Or another approach in my mind now is to call hugetlb_clear_page_hwpoison()
> > with taking mf_mutex.
> > 
> > > 
> > > Thanks.
> > 
> > Thank you for valuable feedbacks.
> > 
> > - Naoya Horiguchi
> > 
> > > 
> > > > +	return 0;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Called from hugetlb code with hugetlb_lock held.
> > > >   *
> > > > @@ -1533,7 +1624,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
> > > >  		goto out;
> > > >  	}
> > > >  
> > > > -	if (TestSetPageHWPoison(head)) {
> > > > +	if (hugetlb_set_page_hwpoison(head, page)) {
> > > >  		ret = -EHWPOISON;
> > > >  		goto out;
> > > >  	}
> > > > @@ -1585,7 +1676,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
> > > >  	lock_page(head);
> > > >  
> > > >  	if (hwpoison_filter(p)) {
> > > > -		ClearPageHWPoison(head);
> > > > +		hugetlb_clear_page_hwpoison(head);
> > > >  		res = -EOPNOTSUPP;
> > > >  		goto out;
> > > >  	}
> > > > -- 
> > > > 2.25.1
> > > > 
> > > > 
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  2022-06-28  6:26       ` Muchun Song
  2022-06-28  7:51         ` Muchun Song
@ 2022-06-28  8:17         ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-28 10:37           ` Muchun Song
  1 sibling, 1 reply; 45+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-28  8:17 UTC (permalink / raw)
  To: Muchun Song
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Mike Kravetz, Miaohe Lin, Liu Shixin, Yang Shi, Oscar Salvador,
	linux-kernel

On Tue, Jun 28, 2022 at 02:26:47PM +0800, Muchun Song wrote:
> On Tue, Jun 28, 2022 at 02:41:22AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Mon, Jun 27, 2022 at 05:26:01PM +0800, Muchun Song wrote:
> > > On Fri, Jun 24, 2022 at 08:51:48AM +0900, Naoya Horiguchi wrote:
> > > > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
...
> > > > +	} else {
> > > > +		/*
> > > > +		 * Failed to save raw error info.  We no longer trace all
> > > > +		 * hwpoisoned subpages, and we need refuse to free/dissolve
> > > > +		 * this hwpoisoned hugepage.
> > > > +		 */
> > > > +		set_raw_hwp_unreliable(hpage);
> > > > +		return ret;
> > > > +	}
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +inline int hugetlb_clear_page_hwpoison(struct page *hpage)
> > > > +{
> > > > +	struct llist_head *head;
> > > > +	struct llist_node *t, *tnode;
> > > > +
> > > > +	if (raw_hwp_unreliable(hpage))
> > > > +		return -EBUSY;
> > > 
> > > IIUC, we use head page's PageHWPoison to synchronize hugetlb_clear_page_hwpoison()
> > > and hugetlb_set_page_hwpoison(), right? If so, who can set hwp_unreliable here?
> > 
> > Sorry if I might miss your point, but raw_hwp_unreliable is set when
> > allocating raw_hwp_page failed.  hugetlb_set_page_hwpoison() can be called
> 
> Sorry. I have missed this. Thanks for your clarification.
> 
> > multiple times on a hugepage and if one of the calls fails, the hwpoisoned
> > hugepage becomes unreliable.
> > 
> > BTW, as you pointed out above, if we switch to passing GFP_ATOMIC to kmalloc(),
> > the kmalloc() never fails, so we no longer have to implement this unreliable
> 
> No. kmalloc() with GFP_ATOMIC can fail unless I miss something important.

OK, I've interpretted the comment about GFP_ATOMIC wrongly.

 * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
 * watermark is applied to allow access to "atomic reserves".
 

> > flag, so things get simpler.
> > 
> > > 
> > > > +	ClearPageHWPoison(hpage);
> > > > +	head = raw_hwp_list_head(hpage);
> > > > +	llist_for_each_safe(tnode, t, head->first) {
> > > 
> > > Is it possible that a new item is added hugetlb_set_page_hwpoison()  and we do not
> > > traverse it (we have cleared page's PageHWPoison)? Then we ignored a real hwpoison
> > > page, right?
> > 
> > Maybe you are mentioning the race like below. Yes, that's possible.
> >
> 
> Sorry, ignore my previous comments, I'm thinking something wrong.
> 
> >   CPU 0                            CPU 1
> > 
> >                                    free_huge_page
> >                                      lock hugetlb_lock
> >                                      ClearHPageMigratable
> 				       remove_hugetlb_page()
> 				       // the page is non-HugeTLB now

Oh, I missed that.

> >                                      unlock hugetlb_lock
> >   get_huge_page_for_hwpoison
> >     lock hugetlb_lock
> >     __get_huge_page_for_hwpoison
> 
> 	// cannot reach here since it is not a HugeTLB page now.
> 	// So this race is impossible. Then we fallback to normal
> 	// page handling. Seems there is a new issue here.
> 	//
> 	// memory_failure()
> 	//	try_memory_failure_hugetlb()
> 	//	if (hugetlb)
> 	//		goto unlock_mutex;
> 	//	if (TestSetPageHWPoison(p)) {
> 	//	// This non-HugeTLB page's vmemmap is still optimized.
> 	
> Setting COMPOUND_PAGE_DTOR after hugetlb_vmemmap_restore() might fix this
> issue, but we will encounter this race as you mentioned below.

I don't have clear ideas about this now (I don't test vmemmap-optimized case
yet), so I will think more about this case. Maybe memory_failure() need
detect it because memory_failure() heaviliy depends on the status of struct
page.

Thanks,
Naoya Horiguchi

> 
> Thanks.
> 	
> >       hugetlb_set_page_hwpoison
> >         allocate raw_hwp_page
> >         TestSetPageHWPoison
> >                                      update_and_free_page
> >                                        __update_and_free_page
> >                                          if (PageHWPoison)
> >                                            hugetlb_clear_page_hwpoison
> >                                              TestClearPageHWPoison
> >                                              // remove all list items
> >         llist_add
> >     unlock hugetlb_lock
> > 
> > 
> > The end result seems not critical (leaking raced raw_hwp_page?), but
> > we need fix.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages()
  2022-06-27 17:25               ` Mike Kravetz
  2022-06-28  3:01                 ` Miaohe Lin
@ 2022-06-28  8:38                 ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-30  2:27                   ` HORIGUCHI NAOYA(堀口 直也)
  1 sibling, 1 reply; 45+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-28  8:38 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Miaohe Lin, Muchun Song, Naoya Horiguchi, Andrew Morton,
	David Hildenbrand, Liu Shixin, Yang Shi, Oscar Salvador,
	linux-kernel, Linux-MM

On Mon, Jun 27, 2022 at 10:25:13AM -0700, Mike Kravetz wrote:
> On 06/27/22 06:02, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Fri, Jun 24, 2022 at 12:11:07PM -0700, Mike Kravetz wrote:
> > > On 06/24/22 08:34, HORIGUCHI NAOYA(堀口 直也) wrote:
> > > > On Fri, Jun 24, 2022 at 04:15:19PM +0800, Miaohe Lin wrote:
> > > > > On 2022/6/24 16:03, Muchun Song wrote:
> > > > > > On Fri, Jun 24, 2022 at 10:25:48AM +0800, Miaohe Lin wrote:
> > > > > >> On 2022/6/24 7:51, Naoya Horiguchi wrote:
> > > > > >>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > > > >>
> > > > > >> IIUC it might be better to do the below check:
> > > > > >> 	/*
> > > > > >> 	 * Cannot return gigantic pages currently if runtime gigantic page
> > > > > >> 	 * allocation is not supported.
> > > > > >> 	 */
> > > > > >> 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> > > > > >> 		goto out;
> > > > > >>
> > > > > > 
> > > > > > The change looks good to me. However, the comments above is unnecessary
> > > > > > since gigantic_page_runtime_supported() is straightforward.
> > > > > 
> > > > > Agree. The comments can be removed.
> > > > 
> > > > Thank you, both. Adding !gigantic_page_runtime_supported without comment
> > > > makes sense to me.
> > > 
> > > The change above makes sense to me.  However, ...
> > > 
> > > If we make the change above, will we have the same strange situation described
> > > in the commit message when !gigantic_page_runtime_supported() is true?
> > > 
> > > IIUC, !gigantic_page_runtime_supported implies that gigantic hugetlb
> > > pages can not be allocated or freed at run time.  They can only be
> > > allocated at boot time.  So, there should NEVER be surplus gigantic
> > > pages if !gigantic_page_runtime_supported().
> > 
> > I have the same understanding as the above.
> > 
> > >  To avoid this situation,
> > > perhaps we should change set_max_huge_pages as follows (not tested)?
> > 
> > The suggested diff looks clearer about what it does, so I'd like to take it
> > in the next version.  Then, what do we do on the "if (hstate_if_gigantic())"
> > check in return_unused_surplus_pages in the original suggestion?  Should it
> > be kept as is, or removed, or checked with !gigantic_page_runtime_supported()?
> > I guess that the new checks prevent calling return_unused_surplus_pages()
> > during pool shrinking, so the check seems not necessary any more.
> 
> My first thought was to keep the check in return_unused_surplus_pages() as it
> is called in other code paths.  However, it SHOULD only try to free surplus
> hugetlb pages.  With the modifications to set_max_huge_pages we will not
> have any surplus gigantic pages if !gigantic_page_runtime_supported, so
> the check can be removed.
> 
> Also note that we never try to dynamically allocate surplus gigantic pages.
> This also is left over from the time when we could not easily allocate a
> gigantic page at runtime.  It would not surprise me if someone found a use
> case to ease this restriction in the future.  Especially so if 1G THP support
> is ever added.  If this happens, the check would be necessary and I would
> guess that it would not be added.
> 
> Sorry for thinking our loud!!!  Although not necessary, it 'might' be a good
> idea to leave the check because it would be overlooked if dynamic allocation
> of gigantic surplus pages is ever added.  I do not have a strong opinion.

OK, so let's keep the check.

> 
> P.S. This also reminds me that a similar check should be added to the
> demote hugetlb code path.  It would be bad if !gigantic_page_runtime_supported
> and we demoted a gigantic page into numerous non-gigantic pages.  I will
> send a patch.

Sounds nice.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  2022-06-28  8:17         ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-06-28 10:37           ` Muchun Song
  0 siblings, 0 replies; 45+ messages in thread
From: Muchun Song @ 2022-06-28 10:37 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Mike Kravetz, Miaohe Lin, Liu Shixin, Yang Shi, Oscar Salvador,
	linux-kernel

On Tue, Jun 28, 2022 at 08:17:55AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Jun 28, 2022 at 02:26:47PM +0800, Muchun Song wrote:
> > On Tue, Jun 28, 2022 at 02:41:22AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > > On Mon, Jun 27, 2022 at 05:26:01PM +0800, Muchun Song wrote:
> > > > On Fri, Jun 24, 2022 at 08:51:48AM +0900, Naoya Horiguchi wrote:
> > > > > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ...
> > > > > +	} else {
> > > > > +		/*
> > > > > +		 * Failed to save raw error info.  We no longer trace all
> > > > > +		 * hwpoisoned subpages, and we need refuse to free/dissolve
> > > > > +		 * this hwpoisoned hugepage.
> > > > > +		 */
> > > > > +		set_raw_hwp_unreliable(hpage);
> > > > > +		return ret;
> > > > > +	}
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > > +inline int hugetlb_clear_page_hwpoison(struct page *hpage)
> > > > > +{
> > > > > +	struct llist_head *head;
> > > > > +	struct llist_node *t, *tnode;
> > > > > +
> > > > > +	if (raw_hwp_unreliable(hpage))
> > > > > +		return -EBUSY;
> > > > 
> > > > IIUC, we use head page's PageHWPoison to synchronize hugetlb_clear_page_hwpoison()
> > > > and hugetlb_set_page_hwpoison(), right? If so, who can set hwp_unreliable here?
> > > 
> > > Sorry if I might miss your point, but raw_hwp_unreliable is set when
> > > allocating raw_hwp_page failed.  hugetlb_set_page_hwpoison() can be called
> > 
> > Sorry. I have missed this. Thanks for your clarification.
> > 
> > > multiple times on a hugepage and if one of the calls fails, the hwpoisoned
> > > hugepage becomes unreliable.
> > > 
> > > BTW, as you pointed out above, if we switch to passing GFP_ATOMIC to kmalloc(),
> > > the kmalloc() never fails, so we no longer have to implement this unreliable
> > 
> > No. kmalloc() with GFP_ATOMIC can fail unless I miss something important.
> 
> OK, I've interpretted the comment about GFP_ATOMIC wrongly.
> 
>  * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
>  * watermark is applied to allow access to "atomic reserves".
>  
> 
> > > flag, so things get simpler.
> > > 
> > > > 
> > > > > +	ClearPageHWPoison(hpage);
> > > > > +	head = raw_hwp_list_head(hpage);
> > > > > +	llist_for_each_safe(tnode, t, head->first) {
> > > > 
> > > > Is it possible that a new item is added hugetlb_set_page_hwpoison()  and we do not
> > > > traverse it (we have cleared page's PageHWPoison)? Then we ignored a real hwpoison
> > > > page, right?
> > > 
> > > Maybe you are mentioning the race like below. Yes, that's possible.
> > >
> > 
> > Sorry, ignore my previous comments, I'm thinking something wrong.
> > 
> > >   CPU 0                            CPU 1
> > > 
> > >                                    free_huge_page
> > >                                      lock hugetlb_lock
> > >                                      ClearHPageMigratable
> > 				       remove_hugetlb_page()
> > 				       // the page is non-HugeTLB now
> 
> Oh, I missed that.
> 
> > >                                      unlock hugetlb_lock
> > >   get_huge_page_for_hwpoison
> > >     lock hugetlb_lock
> > >     __get_huge_page_for_hwpoison
> > 
> > 	// cannot reach here since it is not a HugeTLB page now.
> > 	// So this race is impossible. Then we fallback to normal
> > 	// page handling. Seems there is a new issue here.
> > 	//
> > 	// memory_failure()
> > 	//	try_memory_failure_hugetlb()
> > 	//	if (hugetlb)
> > 	//		goto unlock_mutex;
> > 	//	if (TestSetPageHWPoison(p)) {
> > 	//	// This non-HugeTLB page's vmemmap is still optimized.
> > 	
> > Setting COMPOUND_PAGE_DTOR after hugetlb_vmemmap_restore() might fix this
> > issue, but we will encounter this race as you mentioned below.
> 
> I don't have clear ideas about this now (I don't test vmemmap-optimized case
> yet), so I will think more about this case. Maybe memory_failure() need
> detect it because memory_failure() heaviliy depends on the status of struct
> page.
>

Because HVO (HugeTLB Vmemmap Optimization) will map all tail vmemmap pages
with read-only, we cannot write any data to some tail struct pages. It is
a new issue unrelated to this patch.

Thanks.
 
> Thanks,
> Naoya Horiguchi
> 
> > 
> > Thanks.
> > 	
> > >       hugetlb_set_page_hwpoison
> > >         allocate raw_hwp_page
> > >         TestSetPageHWPoison
> > >                                      update_and_free_page
> > >                                        __update_and_free_page
> > >                                          if (PageHWPoison)
> > >                                            hugetlb_clear_page_hwpoison
> > >                                              TestClearPageHWPoison
> > >                                              // remove all list items
> > >         llist_add
> > >     unlock hugetlb_lock
> > > 
> > > 
> > > The end result seems not critical (leaking raced raw_hwp_page?), but
> > > we need fix.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages()
  2022-06-28  8:38                 ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-06-30  2:27                   ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 45+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-30  2:27 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Miaohe Lin, Muchun Song, Naoya Horiguchi, Andrew Morton,
	David Hildenbrand, Liu Shixin, Yang Shi, Oscar Salvador,
	linux-kernel, Linux-MM

On Tue, Jun 28, 2022 at 08:38:08AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Jun 27, 2022 at 10:25:13AM -0700, Mike Kravetz wrote:
> > On 06/27/22 06:02, HORIGUCHI NAOYA(堀口 直也) wrote:
> > > On Fri, Jun 24, 2022 at 12:11:07PM -0700, Mike Kravetz wrote:
> > > > On 06/24/22 08:34, HORIGUCHI NAOYA(堀口 直也) wrote:
> > > > > On Fri, Jun 24, 2022 at 04:15:19PM +0800, Miaohe Lin wrote:
> > > > > > On 2022/6/24 16:03, Muchun Song wrote:
> > > > > > > On Fri, Jun 24, 2022 at 10:25:48AM +0800, Miaohe Lin wrote:
> > > > > > >> On 2022/6/24 7:51, Naoya Horiguchi wrote:
> > > > > > >>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > > > > >>
> > > > > > >> IIUC it might be better to do the below check:
> > > > > > >> 	/*
> > > > > > >> 	 * Cannot return gigantic pages currently if runtime gigantic page
> > > > > > >> 	 * allocation is not supported.
> > > > > > >> 	 */
> > > > > > >> 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> > > > > > >> 		goto out;
> > > > > > >>
> > > > > > > 
> > > > > > > The change looks good to me. However, the comments above is unnecessary
> > > > > > > since gigantic_page_runtime_supported() is straightforward.
> > > > > > 
> > > > > > Agree. The comments can be removed.
> > > > > 
> > > > > Thank you, both. Adding !gigantic_page_runtime_supported without comment
> > > > > makes sense to me.
> > > > 
> > > > The change above makes sense to me.  However, ...
> > > > 
> > > > If we make the change above, will we have the same strange situation described
> > > > in the commit message when !gigantic_page_runtime_supported() is true?
> > > > 
> > > > IIUC, !gigantic_page_runtime_supported implies that gigantic hugetlb
> > > > pages can not be allocated or freed at run time.  They can only be
> > > > allocated at boot time.  So, there should NEVER be surplus gigantic
> > > > pages if !gigantic_page_runtime_supported().
> > > 
> > > I have the same understanding as the above.
> > > 
> > > >  To avoid this situation,
> > > > perhaps we should change set_max_huge_pages as follows (not tested)?
> > > 
> > > The suggested diff looks clearer about what it does, so I'd like to take it
> > > in the next version.  Then, what do we do on the "if (hstate_if_gigantic())"
> > > check in return_unused_surplus_pages in the original suggestion?  Should it
> > > be kept as is, or removed, or checked with !gigantic_page_runtime_supported()?
> > > I guess that the new checks prevent calling return_unused_surplus_pages()
> > > during pool shrinking, so the check seems not necessary any more.
> > 
> > My first thought was to keep the check in return_unused_surplus_pages() as it
> > is called in other code paths.  However, it SHOULD only try to free surplus
> > hugetlb pages.  With the modifications to set_max_huge_pages we will not
> > have any surplus gigantic pages if !gigantic_page_runtime_supported, so
> > the check can be removed.
> > 
> > Also note that we never try to dynamically allocate surplus gigantic pages.
> > This also is left over from the time when we could not easily allocate a
> > gigantic page at runtime.  It would not surprise me if someone found a use
> > case to ease this restriction in the future.  Especially so if 1G THP support
> > is ever added.  If this happens, the check would be necessary and I would
> > guess that it would not be added.
> > 
> > Sorry for thinking our loud!!!  Although not necessary, it 'might' be a good
> > idea to leave the check because it would be overlooked if dynamic allocation
> > of gigantic surplus pages is ever added.  I do not have a strong opinion.
> 
> OK, so let's keep the check.

Sorry, I found that keeping the check doesn't fix the problem.
I'll update the check with !gigantic_page_runtime_supported().

- Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2022-06-30  2:27 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-23 23:51 [PATCH v2 0/9] mm, hwpoison: enable 1GB hugepage support (v2) Naoya Horiguchi
2022-06-23 23:51 ` [PATCH v2 1/9] mm/hugetlb: remove checking hstate_is_gigantic() in return_unused_surplus_pages() Naoya Horiguchi
2022-06-24  2:25   ` Miaohe Lin
2022-06-24  8:03     ` Muchun Song
2022-06-24  8:15       ` Miaohe Lin
2022-06-24  8:34         ` HORIGUCHI NAOYA(堀口 直也)
2022-06-24 19:11           ` Mike Kravetz
2022-06-27  6:02             ` HORIGUCHI NAOYA(堀口 直也)
2022-06-27 17:25               ` Mike Kravetz
2022-06-28  3:01                 ` Miaohe Lin
2022-06-28  8:38                 ` HORIGUCHI NAOYA(堀口 直也)
2022-06-30  2:27                   ` HORIGUCHI NAOYA(堀口 直也)
2022-06-23 23:51 ` [PATCH v2 2/9] mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range() Naoya Horiguchi
2022-06-24  9:23   ` Miaohe Lin
2022-06-27  6:06     ` HORIGUCHI NAOYA(堀口 直也)
2022-06-24 20:57   ` Mike Kravetz
2022-06-27  6:48     ` HORIGUCHI NAOYA(堀口 直也)
2022-06-27  7:57   ` Muchun Song
2022-06-23 23:51 ` [PATCH v2 3/9] mm/hugetlb: make pud_huge() and huge_pud() aware of non-present pud entry Naoya Horiguchi
2022-06-24  8:40   ` HORIGUCHI NAOYA(堀口 直也)
2022-06-25  0:02   ` Mike Kravetz
2022-06-27  7:07     ` HORIGUCHI NAOYA(堀口 直也)
2022-06-25  9:42   ` Miaohe Lin
2022-06-27  7:24     ` HORIGUCHI NAOYA(堀口 直也)
2022-06-23 23:51 ` [PATCH v2 4/9] mm, hwpoison, hugetlb: support saving mechanism of raw error pages Naoya Horiguchi
2022-06-27  3:16   ` Miaohe Lin
2022-06-27  7:56     ` HORIGUCHI NAOYA(堀口 直也)
2022-06-27  9:26   ` Muchun Song
2022-06-28  2:41     ` HORIGUCHI NAOYA(堀口 直也)
2022-06-28  6:26       ` Muchun Song
2022-06-28  7:51         ` Muchun Song
2022-06-28  8:17         ` HORIGUCHI NAOYA(堀口 直也)
2022-06-28 10:37           ` Muchun Song
2022-06-23 23:51 ` [PATCH v2 5/9] mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage Naoya Horiguchi
2022-06-27  8:32   ` Miaohe Lin
2022-06-27 11:48     ` Miaohe Lin
2022-06-23 23:51 ` [PATCH v2 6/9] mm, hwpoison: set PG_hwpoison for busy hugetlb pages Naoya Horiguchi
2022-06-27  8:39   ` Miaohe Lin
2022-06-23 23:51 ` [PATCH v2 7/9] mm, hwpoison: make __page_handle_poison returns int Naoya Horiguchi
2022-06-27  9:02   ` Miaohe Lin
2022-06-28  6:02     ` HORIGUCHI NAOYA(堀口 直也)
2022-06-23 23:51 ` [PATCH v2 8/9] mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage Naoya Horiguchi
2022-06-27 12:24   ` Miaohe Lin
2022-06-23 23:51 ` [PATCH v2 9/9] mm, hwpoison: enable memory error handling on " Naoya Horiguchi
2022-06-28  2:06   ` Miaohe Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).