linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/6] Fix some bugs about HugeTLB code
@ 2021-01-10 12:40 Muchun Song
  2021-01-10 12:40 ` [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one Muchun Song
                   ` (5 more replies)
  0 siblings, 6 replies; 39+ messages in thread
From: Muchun Song @ 2021-01-10 12:40 UTC (permalink / raw)
  To: mike.kravetz, akpm
  Cc: n-horiguchi, ak, mhocko, linux-mm, linux-kernel, Muchun Song

This patch series aims to fix some bugs and add some improvements.

Changelog since v2 -> v3:
  - Update commit log.
  - Using head[3].private to indicate the page is freed in patch #3.

Changelog since v1 -> v2:
  - Export set_page_huge_active() in patch #2 to fix.
  - Using head[3].mapping to indicate the page is freed in patch #3.
  - Flush @free_hpage_work in patch #4.

Muchun Song (6):
  mm: migrate: do not migrate HugeTLB page whose refcount is one
  mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page
  mm: hugetlb: fix a race between freeing and dissolving the page
  mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page
  mm: hugetlb: fix a race between isolating and freeing page
  mm: hugetlb: remove VM_BUG_ON_PAGE from page_huge_active

 fs/hugetlbfs/inode.c    |  3 ++-
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c            | 59 +++++++++++++++++++++++++++++++++++++++++--------
 mm/migrate.c            |  6 +++++
 4 files changed, 60 insertions(+), 10 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-10 12:40 [PATCH v3 0/6] Fix some bugs about HugeTLB code Muchun Song
@ 2021-01-10 12:40 ` Muchun Song
  2021-01-12  9:42   ` Michal Hocko
  2021-01-12 11:00   ` David Hildenbrand
  2021-01-10 12:40 ` [PATCH v3 2/6] mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page Muchun Song
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 39+ messages in thread
From: Muchun Song @ 2021-01-10 12:40 UTC (permalink / raw)
  To: mike.kravetz, akpm
  Cc: n-horiguchi, ak, mhocko, linux-mm, linux-kernel, Muchun Song, Yang Shi

If the refcount is one when it is migrated, it means that the page
was freed from under us. So we are done and do not need to migrate.

This optimization is consistent with the regular pages, just like
unmap_and_move() does.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Yang Shi <shy828301@gmail.com>
---
 mm/migrate.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index 4385f2fb5d18..a6631c4eb6a6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 		return -ENOSYS;
 	}
 
+	if (page_count(hpage) == 1) {
+		/* page was freed from under us. So we are done. */
+		putback_active_hugepage(hpage);
+		return MIGRATEPAGE_SUCCESS;
+	}
+
 	new_hpage = get_new_page(hpage, private);
 	if (!new_hpage)
 		return -ENOMEM;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 2/6] mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page
  2021-01-10 12:40 [PATCH v3 0/6] Fix some bugs about HugeTLB code Muchun Song
  2021-01-10 12:40 ` [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one Muchun Song
@ 2021-01-10 12:40 ` Muchun Song
  2021-01-11 23:04   ` Mike Kravetz
  2021-01-12  9:45   ` Michal Hocko
  2021-01-10 12:40 ` [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page Muchun Song
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 39+ messages in thread
From: Muchun Song @ 2021-01-10 12:40 UTC (permalink / raw)
  To: mike.kravetz, akpm
  Cc: n-horiguchi, ak, mhocko, linux-mm, linux-kernel, Muchun Song, stable

If a new hugetlb page is allocated during fallocate it will not be
marked as active (set_page_huge_active) which will result in a later
isolate_huge_page failure when the page migration code would like to
move that page. Such a failure would be unexpected and wrong.

Only export set_page_huge_active, just leave clear_page_huge_active
as static. Because there are no external users.

Fixes: 70c3547e36f5 (hugetlbfs: add hugetlbfs_fallocate())
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: stable@vger.kernel.org
---
 fs/hugetlbfs/inode.c    | 3 ++-
 include/linux/hugetlb.h | 2 ++
 mm/hugetlb.c            | 2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b5c109703daa..21c20fd5f9ee 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -735,9 +735,10 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 
+		set_page_huge_active(page);
 		/*
 		 * unlock_page because locked by add_to_page_cache()
-		 * page_put due to reference from alloc_huge_page()
+		 * put_page() due to reference from alloc_huge_page()
 		 */
 		unlock_page(page);
 		put_page(page);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ebca2ef02212..b5807f23caf8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -770,6 +770,8 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
 }
 #endif
 
+void set_page_huge_active(struct page *page);
+
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1f3bf1710b66..4741d60f8955 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1348,7 +1348,7 @@ bool page_huge_active(struct page *page)
 }
 
 /* never called for tail page */
-static void set_page_huge_active(struct page *page)
+void set_page_huge_active(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
 	SetPagePrivate(&page[1]);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page
  2021-01-10 12:40 [PATCH v3 0/6] Fix some bugs about HugeTLB code Muchun Song
  2021-01-10 12:40 ` [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one Muchun Song
  2021-01-10 12:40 ` [PATCH v3 2/6] mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page Muchun Song
@ 2021-01-10 12:40 ` Muchun Song
  2021-01-12  1:06   ` Mike Kravetz
  2021-01-12 10:02   ` Michal Hocko
  2021-01-10 12:40 ` [PATCH v3 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page Muchun Song
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 39+ messages in thread
From: Muchun Song @ 2021-01-10 12:40 UTC (permalink / raw)
  To: mike.kravetz, akpm
  Cc: n-horiguchi, ak, mhocko, linux-mm, linux-kernel, Muchun Song, stable

There is a race condition between __free_huge_page()
and dissolve_free_huge_page().

CPU0:                         CPU1:

// page_count(page) == 1
put_page(page)
  __free_huge_page(page)
                              dissolve_free_huge_page(page)
                                spin_lock(&hugetlb_lock)
                                // PageHuge(page) && !page_count(page)
                                update_and_free_page(page)
                                // page is freed to the buddy
                                spin_unlock(&hugetlb_lock)
    spin_lock(&hugetlb_lock)
    clear_page_huge_active(page)
    enqueue_huge_page(page)
    // It is wrong, the page is already freed
    spin_unlock(&hugetlb_lock)

The race windows is between put_page() and dissolve_free_huge_page().

We should make sure that the page is already on the free list
when it is dissolved.

Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: stable@vger.kernel.org
---
 mm/hugetlb.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4741d60f8955..4a9011e12175 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -79,6 +79,21 @@ DEFINE_SPINLOCK(hugetlb_lock);
 static int num_fault_mutexes;
 struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 
+static inline bool PageHugeFreed(struct page *head)
+{
+	return page_private(head + 4) == -1UL;
+}
+
+static inline void SetPageHugeFreed(struct page *head)
+{
+	set_page_private(head + 4, -1UL);
+}
+
+static inline void ClearPageHugeFreed(struct page *head)
+{
+	set_page_private(head + 4, 0);
+}
+
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 
@@ -1028,6 +1043,7 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
 	list_move(&page->lru, &h->hugepage_freelists[nid]);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
+	SetPageHugeFreed(page);
 }
 
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
@@ -1044,6 +1060,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
 
 		list_move(&page->lru, &h->hugepage_activelist);
 		set_page_refcounted(page);
+		ClearPageHugeFreed(page);
 		h->free_huge_pages--;
 		h->free_huge_pages_node[nid]--;
 		return page;
@@ -1504,6 +1521,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	spin_lock(&hugetlb_lock);
 	h->nr_huge_pages++;
 	h->nr_huge_pages_node[nid]++;
+	ClearPageHugeFreed(page);
 	spin_unlock(&hugetlb_lock);
 }
 
@@ -1770,6 +1788,14 @@ int dissolve_free_huge_page(struct page *page)
 		int nid = page_to_nid(head);
 		if (h->free_huge_pages - h->resv_huge_pages == 0)
 			goto out;
+
+		/*
+		 * We should make sure that the page is already on the free list
+		 * when it is dissolved.
+		 */
+		if (unlikely(!PageHugeFreed(head)))
+			goto out;
+
 		/*
 		 * Move PageHWPoison flag from head page to the raw error page,
 		 * which makes any subpages rather than the error page reusable.
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page
  2021-01-10 12:40 [PATCH v3 0/6] Fix some bugs about HugeTLB code Muchun Song
                   ` (2 preceding siblings ...)
  2021-01-10 12:40 ` [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page Muchun Song
@ 2021-01-10 12:40 ` Muchun Song
  2021-01-12  1:20   ` Mike Kravetz
  2021-01-10 12:40 ` [PATCH v3 5/6] mm: hugetlb: fix a race between isolating and freeing page Muchun Song
  2021-01-10 12:40 ` [PATCH v3 6/6] mm: hugetlb: remove VM_BUG_ON_PAGE from page_huge_active Muchun Song
  5 siblings, 1 reply; 39+ messages in thread
From: Muchun Song @ 2021-01-10 12:40 UTC (permalink / raw)
  To: mike.kravetz, akpm
  Cc: n-horiguchi, ak, mhocko, linux-mm, linux-kernel, Muchun Song

There is a race between dissolve_free_huge_page() and put_page(),
and the race window is quite small. Theoretically, we should return
-EBUSY when we encounter this race. In fact, we have a chance to
successfully dissolve the page if we do a retry. Because the race
window is quite small. If we seize this opportunity, it is an
optimization for increasing the success rate of dissolving page.

If we free a HugeTLB page from a non-task context, it is deferred
through a workqueue. In this case, we need to flush the work.

The dissolve_free_huge_page() can be called from memory hotplug,
the caller aims to free the HugeTLB page to the buddy allocator
so that the caller can unplug the page successfully.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4a9011e12175..a176ceed55f1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1763,10 +1763,11 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
  * nothing for in-use hugepages and non-hugepages.
  * This function returns values like below:
  *
- *  -EBUSY: failed to dissolved free hugepages or the hugepage is in-use
- *          (allocated or reserved.)
- *       0: successfully dissolved free hugepages or the page is not a
- *          hugepage (considered as already dissolved)
+ *  -EAGAIN: race with __free_huge_page() and can do a retry
+ *  -EBUSY:  failed to dissolved free hugepages or the hugepage is in-use
+ *           (allocated or reserved.)
+ *       0:  successfully dissolved free hugepages or the page is not a
+ *           hugepage (considered as already dissolved)
  */
 int dissolve_free_huge_page(struct page *page)
 {
@@ -1793,8 +1794,10 @@ int dissolve_free_huge_page(struct page *page)
 		 * We should make sure that the page is already on the free list
 		 * when it is dissolved.
 		 */
-		if (unlikely(!PageHugeFreed(head)))
+		if (unlikely(!PageHugeFreed(head))) {
+			rc = -EAGAIN;
 			goto out;
+		}
 
 		/*
 		 * Move PageHWPoison flag from head page to the raw error page,
@@ -1813,6 +1816,14 @@ int dissolve_free_huge_page(struct page *page)
 	}
 out:
 	spin_unlock(&hugetlb_lock);
+
+	/*
+	 * If the freeing of the HugeTLB page is put on a work queue, we should
+	 * flush the work before retrying.
+	 */
+	if (unlikely(rc == -EAGAIN))
+		flush_work(&free_hpage_work);
+
 	return rc;
 }
 
@@ -1835,7 +1846,12 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
 		page = pfn_to_page(pfn);
+retry:
 		rc = dissolve_free_huge_page(page);
+		if (rc == -EAGAIN) {
+			cpu_relax();
+			goto retry;
+		}
 		if (rc)
 			break;
 	}
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 5/6] mm: hugetlb: fix a race between isolating and freeing page
  2021-01-10 12:40 [PATCH v3 0/6] Fix some bugs about HugeTLB code Muchun Song
                   ` (3 preceding siblings ...)
  2021-01-10 12:40 ` [PATCH v3 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page Muchun Song
@ 2021-01-10 12:40 ` Muchun Song
  2021-01-10 12:40 ` [PATCH v3 6/6] mm: hugetlb: remove VM_BUG_ON_PAGE from page_huge_active Muchun Song
  5 siblings, 0 replies; 39+ messages in thread
From: Muchun Song @ 2021-01-10 12:40 UTC (permalink / raw)
  To: mike.kravetz, akpm
  Cc: n-horiguchi, ak, mhocko, linux-mm, linux-kernel, Muchun Song,
	Michal Hocko, stable

There is a race between isolate_huge_page() and __free_huge_page().

CPU0:                                       CPU1:

if (PageHuge(page))
                                            put_page(page)
                                              __free_huge_page(page)
                                                  spin_lock(&hugetlb_lock)
                                                  update_and_free_page(page)
                                                    set_compound_page_dtor(page,
                                                      NULL_COMPOUND_DTOR)
                                                  spin_unlock(&hugetlb_lock)
  isolate_huge_page(page)
    // trigger BUG_ON
    VM_BUG_ON_PAGE(!PageHead(page), page)
    spin_lock(&hugetlb_lock)
    page_huge_active(page)
      // trigger BUG_ON
      VM_BUG_ON_PAGE(!PageHuge(page), page)
    spin_unlock(&hugetlb_lock)

When we isolate a HugeTLB page on CPU0. Meanwhile, we free it to the
buddy allocator on CPU1. Then, we can trigger a BUG_ON on CPU0. Because
it is already freed to the buddy allocator.

Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: stable@vger.kernel.org
---
 mm/hugetlb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a176ceed55f1..e7ed30afbb8f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5575,9 +5575,9 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
 {
 	bool ret = true;
 
-	VM_BUG_ON_PAGE(!PageHead(page), page);
 	spin_lock(&hugetlb_lock);
-	if (!page_huge_active(page) || !get_page_unless_zero(page)) {
+	if (!PageHeadHuge(page) || !page_huge_active(page) ||
+	    !get_page_unless_zero(page)) {
 		ret = false;
 		goto unlock;
 	}
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 6/6] mm: hugetlb: remove VM_BUG_ON_PAGE from page_huge_active
  2021-01-10 12:40 [PATCH v3 0/6] Fix some bugs about HugeTLB code Muchun Song
                   ` (4 preceding siblings ...)
  2021-01-10 12:40 ` [PATCH v3 5/6] mm: hugetlb: fix a race between isolating and freeing page Muchun Song
@ 2021-01-10 12:40 ` Muchun Song
  5 siblings, 0 replies; 39+ messages in thread
From: Muchun Song @ 2021-01-10 12:40 UTC (permalink / raw)
  To: mike.kravetz, akpm
  Cc: n-horiguchi, ak, mhocko, linux-mm, linux-kernel, Muchun Song,
	Michal Hocko, stable

The page_huge_active() can be called from scan_movable_pages() which
do not hold a reference count to the HugeTLB page. So when we call
page_huge_active() from scan_movable_pages(), the HugeTLB page can
be freed parallel. Then we will trigger a BUG_ON which is in the
page_huge_active() when CONFIG_DEBUG_VM is enabled. Just remove the
VM_BUG_ON_PAGE.

Fixes: 7e1f049efb86 ("mm: hugetlb: cleanup using paeg_huge_active()")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: stable@vger.kernel.org
---
 mm/hugetlb.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e7ed30afbb8f..5940bf0c49b9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1360,8 +1360,7 @@ struct hstate *size_to_hstate(unsigned long size)
  */
 bool page_huge_active(struct page *page)
 {
-	VM_BUG_ON_PAGE(!PageHuge(page), page);
-	return PageHead(page) && PagePrivate(&page[1]);
+	return PageHeadHuge(page) && PagePrivate(&page[1]);
 }
 
 /* never called for tail page */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 2/6] mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page
  2021-01-10 12:40 ` [PATCH v3 2/6] mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page Muchun Song
@ 2021-01-11 23:04   ` Mike Kravetz
  2021-01-12  9:45   ` Michal Hocko
  1 sibling, 0 replies; 39+ messages in thread
From: Mike Kravetz @ 2021-01-11 23:04 UTC (permalink / raw)
  To: Muchun Song, akpm; +Cc: n-horiguchi, ak, mhocko, linux-mm, linux-kernel, stable

On 1/10/21 4:40 AM, Muchun Song wrote:
> If a new hugetlb page is allocated during fallocate it will not be
> marked as active (set_page_huge_active) which will result in a later
> isolate_huge_page failure when the page migration code would like to
> move that page. Such a failure would be unexpected and wrong.
> 
> Only export set_page_huge_active, just leave clear_page_huge_active
> as static. Because there are no external users.
> 
> Fixes: 70c3547e36f5 (hugetlbfs: add hugetlbfs_fallocate())
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Cc: stable@vger.kernel.org
> ---
>  fs/hugetlbfs/inode.c    | 3 ++-
>  include/linux/hugetlb.h | 2 ++
>  mm/hugetlb.c            | 2 +-
>  3 files changed, 5 insertions(+), 2 deletions(-)

Thanks.

Although page_huge_active is declared in page-flags.h, I much prefer the
declaration of set_page_huge_active to be in hugetlb.c.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page
  2021-01-10 12:40 ` [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page Muchun Song
@ 2021-01-12  1:06   ` Mike Kravetz
  2021-01-12 10:02   ` Michal Hocko
  1 sibling, 0 replies; 39+ messages in thread
From: Mike Kravetz @ 2021-01-12  1:06 UTC (permalink / raw)
  To: Muchun Song, akpm; +Cc: n-horiguchi, ak, mhocko, linux-mm, linux-kernel, stable

On 1/10/21 4:40 AM, Muchun Song wrote:
> There is a race condition between __free_huge_page()
> and dissolve_free_huge_page().
> 
> CPU0:                         CPU1:
> 
> // page_count(page) == 1
> put_page(page)
>   __free_huge_page(page)
>                               dissolve_free_huge_page(page)
>                                 spin_lock(&hugetlb_lock)
>                                 // PageHuge(page) && !page_count(page)
>                                 update_and_free_page(page)
>                                 // page is freed to the buddy
>                                 spin_unlock(&hugetlb_lock)
>     spin_lock(&hugetlb_lock)
>     clear_page_huge_active(page)
>     enqueue_huge_page(page)
>     // It is wrong, the page is already freed
>     spin_unlock(&hugetlb_lock)
> 
> The race windows is between put_page() and dissolve_free_huge_page().
> 
> We should make sure that the page is already on the free list
> when it is dissolved.
> 
> Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Cc: stable@vger.kernel.org
> ---
>  mm/hugetlb.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)

Thanks,

It is unfortunate that we have to add more huge page state information
to fix this issue.  However, I believe we have explored all other options.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page
  2021-01-10 12:40 ` [PATCH v3 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page Muchun Song
@ 2021-01-12  1:20   ` Mike Kravetz
  2021-01-12  8:33     ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Kravetz @ 2021-01-12  1:20 UTC (permalink / raw)
  To: Muchun Song, akpm, mhocko; +Cc: n-horiguchi, ak, linux-mm, linux-kernel

On 1/10/21 4:40 AM, Muchun Song wrote:
> There is a race between dissolve_free_huge_page() and put_page(),
> and the race window is quite small. Theoretically, we should return
> -EBUSY when we encounter this race. In fact, we have a chance to
> successfully dissolve the page if we do a retry. Because the race
> window is quite small. If we seize this opportunity, it is an
> optimization for increasing the success rate of dissolving page.
> 
> If we free a HugeTLB page from a non-task context, it is deferred
> through a workqueue. In this case, we need to flush the work.
> 
> The dissolve_free_huge_page() can be called from memory hotplug,
> the caller aims to free the HugeTLB page to the buddy allocator
> so that the caller can unplug the page successfully.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)

I am unsure about the need for this patch.  The code is OK, there are no
issues with the code.

As mentioned in the commit message, this is an optimization and could
potentially cause a memory offline operation to succeed instead of fail.
However, we are very unlikely to ever exercise this code.  Adding an
optimization that is unlikely to be exercised is certainly questionable.

Memory offline is the only code that could benefit from this optimization.
As someone with more memory offline user experience, what is your opinion
Michal?

-- 
Mike Kravetz

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4a9011e12175..a176ceed55f1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1763,10 +1763,11 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>   * nothing for in-use hugepages and non-hugepages.
>   * This function returns values like below:
>   *
> - *  -EBUSY: failed to dissolved free hugepages or the hugepage is in-use
> - *          (allocated or reserved.)
> - *       0: successfully dissolved free hugepages or the page is not a
> - *          hugepage (considered as already dissolved)
> + *  -EAGAIN: race with __free_huge_page() and can do a retry
> + *  -EBUSY:  failed to dissolved free hugepages or the hugepage is in-use
> + *           (allocated or reserved.)
> + *       0:  successfully dissolved free hugepages or the page is not a
> + *           hugepage (considered as already dissolved)
>   */
>  int dissolve_free_huge_page(struct page *page)
>  {
> @@ -1793,8 +1794,10 @@ int dissolve_free_huge_page(struct page *page)
>  		 * We should make sure that the page is already on the free list
>  		 * when it is dissolved.
>  		 */
> -		if (unlikely(!PageHugeFreed(head)))
> +		if (unlikely(!PageHugeFreed(head))) {
> +			rc = -EAGAIN;
>  			goto out;
> +		}
>  
>  		/*
>  		 * Move PageHWPoison flag from head page to the raw error page,
> @@ -1813,6 +1816,14 @@ int dissolve_free_huge_page(struct page *page)
>  	}
>  out:
>  	spin_unlock(&hugetlb_lock);
> +
> +	/*
> +	 * If the freeing of the HugeTLB page is put on a work queue, we should
> +	 * flush the work before retrying.
> +	 */
> +	if (unlikely(rc == -EAGAIN))
> +		flush_work(&free_hpage_work);
> +
>  	return rc;
>  }
>  
> @@ -1835,7 +1846,12 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
>  
>  	for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
>  		page = pfn_to_page(pfn);
> +retry:
>  		rc = dissolve_free_huge_page(page);
> +		if (rc == -EAGAIN) {
> +			cpu_relax();
> +			goto retry;
> +		}
>  		if (rc)
>  			break;
>  	}
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page
  2021-01-12  1:20   ` Mike Kravetz
@ 2021-01-12  8:33     ` Michal Hocko
  2021-01-12  9:51       ` [External] " Muchun Song
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2021-01-12  8:33 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Muchun Song, akpm, n-horiguchi, ak, linux-mm, linux-kernel

On Mon 11-01-21 17:20:51, Mike Kravetz wrote:
> On 1/10/21 4:40 AM, Muchun Song wrote:
> > There is a race between dissolve_free_huge_page() and put_page(),
> > and the race window is quite small. Theoretically, we should return
> > -EBUSY when we encounter this race. In fact, we have a chance to
> > successfully dissolve the page if we do a retry. Because the race
> > window is quite small. If we seize this opportunity, it is an
> > optimization for increasing the success rate of dissolving page.
> > 
> > If we free a HugeTLB page from a non-task context, it is deferred
> > through a workqueue. In this case, we need to flush the work.
> > 
> > The dissolve_free_huge_page() can be called from memory hotplug,
> > the caller aims to free the HugeTLB page to the buddy allocator
> > so that the caller can unplug the page successfully.
> > 
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  mm/hugetlb.c | 26 +++++++++++++++++++++-----
> >  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> I am unsure about the need for this patch.  The code is OK, there are no
> issues with the code.
> 
> As mentioned in the commit message, this is an optimization and could
> potentially cause a memory offline operation to succeed instead of fail.
> However, we are very unlikely to ever exercise this code.  Adding an
> optimization that is unlikely to be exercised is certainly questionable.
> 
> Memory offline is the only code that could benefit from this optimization.
> As someone with more memory offline user experience, what is your opinion
> Michal?

I am not a great fun of optimizations without any data to back them up.
I do not see any sign this code has been actually tested and the
condition triggered.

Besides that I have requested to have an explanation of why blocking on
the WQ is safe and that hasn't happened.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-10 12:40 ` [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one Muchun Song
@ 2021-01-12  9:42   ` Michal Hocko
  2021-01-12  9:43     ` [External] " Muchun Song
  2021-01-12 11:00   ` David Hildenbrand
  1 sibling, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2021-01-12  9:42 UTC (permalink / raw)
  To: Muchun Song
  Cc: mike.kravetz, akpm, n-horiguchi, ak, linux-mm, linux-kernel, Yang Shi

On Sun 10-01-21 20:40:12, Muchun Song wrote:
> If the refcount is one when it is migrated, it means that the page
> was freed from under us. So we are done and do not need to migrate.

I would consider the following easier to understand. Feel free to reuse.
"
All pages isolated for the migration have an elevated reference count
and therefore seeing a reference count equal to 1 means that the last
user of the page has dropped the reference and the page has became
unused and there doesn't make much sense to migrate it anymore. This has
been done for regular pages and this patch does the same for hugetlb
pages. Although the likelyhood of the race is rather small for hugetlb
pages it makes sense the two code paths in sync.
"

> 
> This optimization is consistent with the regular pages, just like
> unmap_and_move() does.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Yang Shi <shy828301@gmail.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/migrate.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4385f2fb5d18..a6631c4eb6a6 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>  		return -ENOSYS;
>  	}
>  
> +	if (page_count(hpage) == 1) {
> +		/* page was freed from under us. So we are done. */
> +		putback_active_hugepage(hpage);
> +		return MIGRATEPAGE_SUCCESS;
> +	}
> +
>  	new_hpage = get_new_page(hpage, private);
>  	if (!new_hpage)
>  		return -ENOMEM;
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12  9:42   ` Michal Hocko
@ 2021-01-12  9:43     ` Muchun Song
  0 siblings, 0 replies; 39+ messages in thread
From: Muchun Song @ 2021-01-12  9:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen,
	Linux Memory Management List, LKML, Yang Shi

On Tue, Jan 12, 2021 at 5:42 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sun 10-01-21 20:40:12, Muchun Song wrote:
> > If the refcount is one when it is migrated, it means that the page
> > was freed from under us. So we are done and do not need to migrate.
>
> I would consider the following easier to understand. Feel free to reuse.
> "
> All pages isolated for the migration have an elevated reference count
> and therefore seeing a reference count equal to 1 means that the last
> user of the page has dropped the reference and the page has became
> unused and there doesn't make much sense to migrate it anymore. This has
> been done for regular pages and this patch does the same for hugetlb
> pages. Although the likelyhood of the race is rather small for hugetlb
> pages it makes sense the two code paths in sync.
> "

Thanks.

>
> >
> > This optimization is consistent with the regular pages, just like
> > unmap_and_move() does.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Acked-by: Yang Shi <shy828301@gmail.com>
>
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks.

>
> > ---
> >  mm/migrate.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 4385f2fb5d18..a6631c4eb6a6 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> >               return -ENOSYS;
> >       }
> >
> > +     if (page_count(hpage) == 1) {
> > +             /* page was freed from under us. So we are done. */
> > +             putback_active_hugepage(hpage);
> > +             return MIGRATEPAGE_SUCCESS;
> > +     }
> > +
> >       new_hpage = get_new_page(hpage, private);
> >       if (!new_hpage)
> >               return -ENOMEM;
> > --
> > 2.11.0
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 2/6] mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page
  2021-01-10 12:40 ` [PATCH v3 2/6] mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page Muchun Song
  2021-01-11 23:04   ` Mike Kravetz
@ 2021-01-12  9:45   ` Michal Hocko
  1 sibling, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2021-01-12  9:45 UTC (permalink / raw)
  To: Muchun Song
  Cc: mike.kravetz, akpm, n-horiguchi, ak, linux-mm, linux-kernel, stable

On Sun 10-01-21 20:40:13, Muchun Song wrote:
> If a new hugetlb page is allocated during fallocate it will not be
> marked as active (set_page_huge_active) which will result in a later
> isolate_huge_page failure when the page migration code would like to
> move that page. Such a failure would be unexpected and wrong.
> 
> Only export set_page_huge_active, just leave clear_page_huge_active
> as static. Because there are no external users.
> 
> Fixes: 70c3547e36f5 (hugetlbfs: add hugetlbfs_fallocate())
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Cc: stable@vger.kernel.org

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  fs/hugetlbfs/inode.c    | 3 ++-
>  include/linux/hugetlb.h | 2 ++
>  mm/hugetlb.c            | 2 +-
>  3 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index b5c109703daa..21c20fd5f9ee 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -735,9 +735,10 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>  
>  		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  
> +		set_page_huge_active(page);
>  		/*
>  		 * unlock_page because locked by add_to_page_cache()
> -		 * page_put due to reference from alloc_huge_page()
> +		 * put_page() due to reference from alloc_huge_page()
>  		 */
>  		unlock_page(page);
>  		put_page(page);
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ebca2ef02212..b5807f23caf8 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -770,6 +770,8 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
>  }
>  #endif
>  
> +void set_page_huge_active(struct page *page);
> +
>  #else	/* CONFIG_HUGETLB_PAGE */
>  struct hstate {};
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1f3bf1710b66..4741d60f8955 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1348,7 +1348,7 @@ bool page_huge_active(struct page *page)
>  }
>  
>  /* never called for tail page */
> -static void set_page_huge_active(struct page *page)
> +void set_page_huge_active(struct page *page)
>  {
>  	VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
>  	SetPagePrivate(&page[1]);
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page
  2021-01-12  8:33     ` Michal Hocko
@ 2021-01-12  9:51       ` Muchun Song
  2021-01-12 10:06         ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Muchun Song @ 2021-01-12  9:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen,
	Linux Memory Management List, LKML

On Tue, Jan 12, 2021 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 11-01-21 17:20:51, Mike Kravetz wrote:
> > On 1/10/21 4:40 AM, Muchun Song wrote:
> > > There is a race between dissolve_free_huge_page() and put_page(),
> > > and the race window is quite small. Theoretically, we should return
> > > -EBUSY when we encounter this race. In fact, we have a chance to
> > > successfully dissolve the page if we do a retry. Because the race
> > > window is quite small. If we seize this opportunity, it is an
> > > optimization for increasing the success rate of dissolving page.
> > >
> > > If we free a HugeTLB page from a non-task context, it is deferred
> > > through a workqueue. In this case, we need to flush the work.
> > >
> > > The dissolve_free_huge_page() can be called from memory hotplug,
> > > the caller aims to free the HugeTLB page to the buddy allocator
> > > so that the caller can unplug the page successfully.
> > >
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > ---
> > >  mm/hugetlb.c | 26 +++++++++++++++++++++-----
> > >  1 file changed, 21 insertions(+), 5 deletions(-)
> >
> > I am unsure about the need for this patch.  The code is OK, there are no
> > issues with the code.
> >
> > As mentioned in the commit message, this is an optimization and could
> > potentially cause a memory offline operation to succeed instead of fail.
> > However, we are very unlikely to ever exercise this code.  Adding an
> > optimization that is unlikely to be exercised is certainly questionable.
> >
> > Memory offline is the only code that could benefit from this optimization.
> > As someone with more memory offline user experience, what is your opinion
> > Michal?
>
> I am not a great fun of optimizations without any data to back them up.
> I do not see any sign this code has been actually tested and the
> condition triggered.

This race is quite small. I only trigger this only once on my server.
And then the kernel panic. So I sent this patch series to fix some
bugs.

>
> Besides that I have requested to have an explanation of why blocking on
> the WQ is safe and that hasn't happened.

I have seen all the caller of dissolve_free_huge_page, some caller is under
page lock (via lock_page). Others are also under a sleep context.

So I think that blocking on the WQ is safe. Right?

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page
  2021-01-10 12:40 ` [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page Muchun Song
  2021-01-12  1:06   ` Mike Kravetz
@ 2021-01-12 10:02   ` Michal Hocko
  2021-01-12 10:13     ` [External] " Muchun Song
  1 sibling, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2021-01-12 10:02 UTC (permalink / raw)
  To: Muchun Song
  Cc: mike.kravetz, akpm, n-horiguchi, ak, linux-mm, linux-kernel, stable

On Sun 10-01-21 20:40:14, Muchun Song wrote:
[...]
> @@ -1770,6 +1788,14 @@ int dissolve_free_huge_page(struct page *page)
>  		int nid = page_to_nid(head);
>  		if (h->free_huge_pages - h->resv_huge_pages == 0)
>  			goto out;
> +
> +		/*
> +		 * We should make sure that the page is already on the free list
> +		 * when it is dissolved.
> +		 */
> +		if (unlikely(!PageHugeFreed(head)))
> +			goto out;
> +

Do you really want to report EBUSY in this case? This doesn't make much
sense to me TBH. I believe you want to return 0 same as when you race
and the page is no longer PageHuge.

>  		/*
>  		 * Move PageHWPoison flag from head page to the raw error page,
>  		 * which makes any subpages rather than the error page reusable.
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page
  2021-01-12  9:51       ` [External] " Muchun Song
@ 2021-01-12 10:06         ` Michal Hocko
  2021-01-12 10:49           ` Muchun Song
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2021-01-12 10:06 UTC (permalink / raw)
  To: Muchun Song
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen,
	Linux Memory Management List, LKML

On Tue 12-01-21 17:51:05, Muchun Song wrote:
> On Tue, Jan 12, 2021 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 11-01-21 17:20:51, Mike Kravetz wrote:
> > > On 1/10/21 4:40 AM, Muchun Song wrote:
> > > > There is a race between dissolve_free_huge_page() and put_page(),
> > > > and the race window is quite small. Theoretically, we should return
> > > > -EBUSY when we encounter this race. In fact, we have a chance to
> > > > successfully dissolve the page if we do a retry. Because the race
> > > > window is quite small. If we seize this opportunity, it is an
> > > > optimization for increasing the success rate of dissolving page.
> > > >
> > > > If we free a HugeTLB page from a non-task context, it is deferred
> > > > through a workqueue. In this case, we need to flush the work.
> > > >
> > > > The dissolve_free_huge_page() can be called from memory hotplug,
> > > > the caller aims to free the HugeTLB page to the buddy allocator
> > > > so that the caller can unplug the page successfully.
> > > >
> > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > ---
> > > >  mm/hugetlb.c | 26 +++++++++++++++++++++-----
> > > >  1 file changed, 21 insertions(+), 5 deletions(-)
> > >
> > > I am unsure about the need for this patch.  The code is OK, there are no
> > > issues with the code.
> > >
> > > As mentioned in the commit message, this is an optimization and could
> > > potentially cause a memory offline operation to succeed instead of fail.
> > > However, we are very unlikely to ever exercise this code.  Adding an
> > > optimization that is unlikely to be exercised is certainly questionable.
> > >
> > > Memory offline is the only code that could benefit from this optimization.
> > > As someone with more memory offline user experience, what is your opinion
> > > Michal?
> >
> > I am not a great fun of optimizations without any data to back them up.
> > I do not see any sign this code has been actually tested and the
> > condition triggered.
> 
> This race is quite small. I only trigger this only once on my server.
> And then the kernel panic. So I sent this patch series to fix some
> bugs.

Memory hotplug shouldn't panic when this race happens. Are you sure you
have seen a race that is directly related to this patch?

> > Besides that I have requested to have an explanation of why blocking on
> > the WQ is safe and that hasn't happened.
> 
> I have seen all the caller of dissolve_free_huge_page, some caller is under
> page lock (via lock_page). Others are also under a sleep context.
> 
> So I think that blocking on the WQ is safe. Right?

I have requested to explicitly write your thinking why this is safe so
that we can double check it. Dependency on a work queue progress is much
more complex than any other locks because there is no guarantee that WQ
will make forward progress (all workers might be stuck, new workers not
able to be created etc.).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page
  2021-01-12 10:02   ` Michal Hocko
@ 2021-01-12 10:13     ` Muchun Song
  2021-01-12 11:17       ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Muchun Song @ 2021-01-12 10:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen,
	Linux Memory Management List, LKML, linux- stable

On Tue, Jan 12, 2021 at 6:02 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sun 10-01-21 20:40:14, Muchun Song wrote:
> [...]
> > @@ -1770,6 +1788,14 @@ int dissolve_free_huge_page(struct page *page)
> >               int nid = page_to_nid(head);
> >               if (h->free_huge_pages - h->resv_huge_pages == 0)
> >                       goto out;
> > +
> > +             /*
> > +              * We should make sure that the page is already on the free list
> > +              * when it is dissolved.
> > +              */
> > +             if (unlikely(!PageHugeFreed(head)))
> > +                     goto out;
> > +
>
> Do you really want to report EBUSY in this case? This doesn't make much
> sense to me TBH. I believe you want to return 0 same as when you race
> and the page is no longer PageHuge.

Return 0 is wrong. Because the page is not freed to the buddy allocator.
IIUC, dissolve_free_huge_page returns 0 when the page is already freed
to the buddy allocator. Right?

>
> >               /*
> >                * Move PageHWPoison flag from head page to the raw error page,
> >                * which makes any subpages rather than the error page reusable.
> > --
> > 2.11.0
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page
  2021-01-12 10:06         ` Michal Hocko
@ 2021-01-12 10:49           ` Muchun Song
  2021-01-12 11:11             ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Muchun Song @ 2021-01-12 10:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen,
	Linux Memory Management List, LKML

On Tue, Jan 12, 2021 at 6:06 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 12-01-21 17:51:05, Muchun Song wrote:
> > On Tue, Jan 12, 2021 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 11-01-21 17:20:51, Mike Kravetz wrote:
> > > > On 1/10/21 4:40 AM, Muchun Song wrote:
> > > > > There is a race between dissolve_free_huge_page() and put_page(),
> > > > > and the race window is quite small. Theoretically, we should return
> > > > > -EBUSY when we encounter this race. In fact, we have a chance to
> > > > > successfully dissolve the page if we do a retry. Because the race
> > > > > window is quite small. If we seize this opportunity, it is an
> > > > > optimization for increasing the success rate of dissolving page.
> > > > >
> > > > > If we free a HugeTLB page from a non-task context, it is deferred
> > > > > through a workqueue. In this case, we need to flush the work.
> > > > >
> > > > > The dissolve_free_huge_page() can be called from memory hotplug,
> > > > > the caller aims to free the HugeTLB page to the buddy allocator
> > > > > so that the caller can unplug the page successfully.
> > > > >
> > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > > ---
> > > > >  mm/hugetlb.c | 26 +++++++++++++++++++++-----
> > > > >  1 file changed, 21 insertions(+), 5 deletions(-)
> > > >
> > > > I am unsure about the need for this patch.  The code is OK, there are no
> > > > issues with the code.
> > > >
> > > > As mentioned in the commit message, this is an optimization and could
> > > > potentially cause a memory offline operation to succeed instead of fail.
> > > > However, we are very unlikely to ever exercise this code.  Adding an
> > > > optimization that is unlikely to be exercised is certainly questionable.
> > > >
> > > > Memory offline is the only code that could benefit from this optimization.
> > > > As someone with more memory offline user experience, what is your opinion
> > > > Michal?
> > >
> > > I am not a great fun of optimizations without any data to back them up.
> > > I do not see any sign this code has been actually tested and the
> > > condition triggered.
> >
> > This race is quite small. I only trigger this only once on my server.
> > And then the kernel panic. So I sent this patch series to fix some
> > bugs.
>
> Memory hotplug shouldn't panic when this race happens. Are you sure you
> have seen a race that is directly related to this patch?

I mean the panic is fixed by:

  [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

>
> > > Besides that I have requested to have an explanation of why blocking on
> > > the WQ is safe and that hasn't happened.
> >
> > I have seen all the caller of dissolve_free_huge_page, some caller is under
> > page lock (via lock_page). Others are also under a sleep context.
> >
> > So I think that blocking on the WQ is safe. Right?
>
> I have requested to explicitly write your thinking why this is safe so
> that we can double check it. Dependency on a work queue progress is much
> more complex than any other locks because there is no guarantee that WQ
> will make forward progress (all workers might be stuck, new workers not
> able to be created etc.).

OK. I know about your concern. How about setting the page as temporary
when hitting this race?

 int dissolve_free_huge_page(struct page *page)
 {
@@ -1793,8 +1794,10 @@ int dissolve_free_huge_page(struct page *page)
                 * We should make sure that the page is already on the free list
                 * when it is dissolved.
                 */
-               if (unlikely(!PageHugeFreed(head)))
+               if (unlikely(!PageHugeFreed(head))) {
+                       SetPageHugeTemporary(page)
                        goto out;
+               }

Setting the page as temporary and just return -EBUSY (do not flush
the work). __free_huge_page() will free it to the buddy allocator later.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-10 12:40 ` [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one Muchun Song
  2021-01-12  9:42   ` Michal Hocko
@ 2021-01-12 11:00   ` David Hildenbrand
  2021-01-12 11:11     ` David Hildenbrand
  1 sibling, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-01-12 11:00 UTC (permalink / raw)
  To: Muchun Song, mike.kravetz, akpm
  Cc: n-horiguchi, ak, mhocko, linux-mm, linux-kernel, Yang Shi

On 10.01.21 13:40, Muchun Song wrote:
> If the refcount is one when it is migrated, it means that the page
> was freed from under us. So we are done and do not need to migrate.
> 
> This optimization is consistent with the regular pages, just like
> unmap_and_move() does.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/migrate.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4385f2fb5d18..a6631c4eb6a6 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>  		return -ENOSYS;
>  	}
>  
> +	if (page_count(hpage) == 1) {
> +		/* page was freed from under us. So we are done. */
> +		putback_active_hugepage(hpage);
> +		return MIGRATEPAGE_SUCCESS;
> +	}
> +
>  	new_hpage = get_new_page(hpage, private);
>  	if (!new_hpage)
>  		return -ENOMEM;
> 

Question: What if called via alloc_contig_range() where we even want to
"migrate" free pages, meaning, relocate it?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12 11:00   ` David Hildenbrand
@ 2021-01-12 11:11     ` David Hildenbrand
  2021-01-12 11:27       ` Michal Hocko
  2021-01-12 13:40       ` [External] " Muchun Song
  0 siblings, 2 replies; 39+ messages in thread
From: David Hildenbrand @ 2021-01-12 11:11 UTC (permalink / raw)
  To: Muchun Song, mike.kravetz, akpm
  Cc: n-horiguchi, ak, mhocko, linux-mm, linux-kernel, Yang Shi

On 12.01.21 12:00, David Hildenbrand wrote:
> On 10.01.21 13:40, Muchun Song wrote:
>> If the refcount is one when it is migrated, it means that the page
>> was freed from under us. So we are done and do not need to migrate.
>>
>> This optimization is consistent with the regular pages, just like
>> unmap_and_move() does.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>> Acked-by: Yang Shi <shy828301@gmail.com>
>> ---
>>  mm/migrate.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 4385f2fb5d18..a6631c4eb6a6 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>>  		return -ENOSYS;
>>  	}
>>  
>> +	if (page_count(hpage) == 1) {
>> +		/* page was freed from under us. So we are done. */
>> +		putback_active_hugepage(hpage);
>> +		return MIGRATEPAGE_SUCCESS;
>> +	}
>> +
>>  	new_hpage = get_new_page(hpage, private);
>>  	if (!new_hpage)
>>  		return -ENOMEM;
>>
> 
> Question: What if called via alloc_contig_range() where we even want to
> "migrate" free pages, meaning, relocate it?
> 

To be more precise:

a) We don't have dissolve_free_huge_pages() calls on the
alloc_contig_range() path. So we *need* migration IIUC.

b) dissolve_free_huge_pages() will fail if going below the reservation.
In that case we really want to migrate free pages. This even applies to
memory offlining.

Either I am missing something important or this patch is more dangerous
than it looks like.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page
  2021-01-12 10:49           ` Muchun Song
@ 2021-01-12 11:11             ` Michal Hocko
  2021-01-12 11:34               ` Muchun Song
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2021-01-12 11:11 UTC (permalink / raw)
  To: Muchun Song
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen,
	Linux Memory Management List, LKML

On Tue 12-01-21 18:49:17, Muchun Song wrote:
> On Tue, Jan 12, 2021 at 6:06 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 12-01-21 17:51:05, Muchun Song wrote:
> > > On Tue, Jan 12, 2021 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 11-01-21 17:20:51, Mike Kravetz wrote:
> > > > > On 1/10/21 4:40 AM, Muchun Song wrote:
> > > > > > There is a race between dissolve_free_huge_page() and put_page(),
> > > > > > and the race window is quite small. Theoretically, we should return
> > > > > > -EBUSY when we encounter this race. In fact, we have a chance to
> > > > > > successfully dissolve the page if we do a retry. Because the race
> > > > > > window is quite small. If we seize this opportunity, it is an
> > > > > > optimization for increasing the success rate of dissolving page.
> > > > > >
> > > > > > If we free a HugeTLB page from a non-task context, it is deferred
> > > > > > through a workqueue. In this case, we need to flush the work.
> > > > > >
> > > > > > The dissolve_free_huge_page() can be called from memory hotplug,
> > > > > > the caller aims to free the HugeTLB page to the buddy allocator
> > > > > > so that the caller can unplug the page successfully.
> > > > > >
> > > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > > > ---
> > > > > >  mm/hugetlb.c | 26 +++++++++++++++++++++-----
> > > > > >  1 file changed, 21 insertions(+), 5 deletions(-)
> > > > >
> > > > > I am unsure about the need for this patch.  The code is OK, there are no
> > > > > issues with the code.
> > > > >
> > > > > As mentioned in the commit message, this is an optimization and could
> > > > > potentially cause a memory offline operation to succeed instead of fail.
> > > > > However, we are very unlikely to ever exercise this code.  Adding an
> > > > > optimization that is unlikely to be exercised is certainly questionable.
> > > > >
> > > > > Memory offline is the only code that could benefit from this optimization.
> > > > > As someone with more memory offline user experience, what is your opinion
> > > > > Michal?
> > > >
> > > > I am not a great fun of optimizations without any data to back them up.
> > > > I do not see any sign this code has been actually tested and the
> > > > condition triggered.
> > >
> > > This race is quite small. I only trigger this only once on my server.
> > > And then the kernel panic. So I sent this patch series to fix some
> > > bugs.
> >
> > Memory hotplug shouldn't panic when this race happens. Are you sure you
> > have seen a race that is directly related to this patch?
> 
> I mean the panic is fixed by:
> 
>   [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

OK, so the answer is that this is not really triggered by any real life
problem. Can you actually trigger it intentionally?
 
> > > > Besides that I have requested to have an explanation of why blocking on
> > > > the WQ is safe and that hasn't happened.
> > >
> > > I have seen all the caller of dissolve_free_huge_page, some caller is under
> > > page lock (via lock_page). Others are also under a sleep context.
> > >
> > > So I think that blocking on the WQ is safe. Right?
> >
> > I have requested to explicitly write your thinking why this is safe so
> > that we can double check it. Dependency on a work queue progress is much
> > more complex than any other locks because there is no guarantee that WQ
> > will make forward progress (all workers might be stuck, new workers not
> > able to be created etc.).
> 
> OK. I know about your concern. How about setting the page as temporary
> when hitting this race?
> 
>  int dissolve_free_huge_page(struct page *page)
>  {
> @@ -1793,8 +1794,10 @@ int dissolve_free_huge_page(struct page *page)
>                  * We should make sure that the page is already on the free list
>                  * when it is dissolved.
>                  */
> -               if (unlikely(!PageHugeFreed(head)))
> +               if (unlikely(!PageHugeFreed(head))) {
> +                       SetPageHugeTemporary(page)
>                         goto out;
> +               }
> 
> Setting the page as temporary and just return -EBUSY (do not flush
> the work). __free_huge_page() will free it to the buddy allocator later.

Can we stop these subtle hacks please? Temporary page is meant to
represent unaccounted temporary page for migration. This has nothing to
do with it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page
  2021-01-12 10:13     ` [External] " Muchun Song
@ 2021-01-12 11:17       ` Michal Hocko
  2021-01-12 11:43         ` Muchun Song
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2021-01-12 11:17 UTC (permalink / raw)
  To: Muchun Song
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen,
	Linux Memory Management List, LKML, linux- stable

On Tue 12-01-21 18:13:02, Muchun Song wrote:
> On Tue, Jan 12, 2021 at 6:02 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sun 10-01-21 20:40:14, Muchun Song wrote:
> > [...]
> > > @@ -1770,6 +1788,14 @@ int dissolve_free_huge_page(struct page *page)
> > >               int nid = page_to_nid(head);
> > >               if (h->free_huge_pages - h->resv_huge_pages == 0)
> > >                       goto out;
> > > +
> > > +             /*
> > > +              * We should make sure that the page is already on the free list
> > > +              * when it is dissolved.
> > > +              */
> > > +             if (unlikely(!PageHugeFreed(head)))
> > > +                     goto out;
> > > +
> >
> > Do you really want to report EBUSY in this case? This doesn't make much
> > sense to me TBH. I believe you want to return 0 same as when you race
> > and the page is no longer PageHuge.
> 
> Return 0 is wrong. Because the page is not freed to the buddy allocator.
> IIUC, dissolve_free_huge_page returns 0 when the page is already freed
> to the buddy allocator. Right?

0 is return when the page is either dissolved or it doesn't need
dissolving. If there is a race with somebody else freeing the page then
there is nothing to dissolve. Under which condition it makes sense to
report the failure and/or retry dissolving?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12 11:11     ` David Hildenbrand
@ 2021-01-12 11:27       ` Michal Hocko
  2021-01-12 11:34         ` David Hildenbrand
  2021-01-12 13:40       ` [External] " Muchun Song
  1 sibling, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2021-01-12 11:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Muchun Song, mike.kravetz, akpm, n-horiguchi, ak, linux-mm,
	linux-kernel, Yang Shi

On Tue 12-01-21 12:11:21, David Hildenbrand wrote:
> On 12.01.21 12:00, David Hildenbrand wrote:
> > On 10.01.21 13:40, Muchun Song wrote:
> >> If the refcount is one when it is migrated, it means that the page
> >> was freed from under us. So we are done and do not need to migrate.
> >>
> >> This optimization is consistent with the regular pages, just like
> >> unmap_and_move() does.
> >>
> >> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> >> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> >> Acked-by: Yang Shi <shy828301@gmail.com>
> >> ---
> >>  mm/migrate.c | 6 ++++++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/mm/migrate.c b/mm/migrate.c
> >> index 4385f2fb5d18..a6631c4eb6a6 100644
> >> --- a/mm/migrate.c
> >> +++ b/mm/migrate.c
> >> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> >>  		return -ENOSYS;
> >>  	}
> >>  
> >> +	if (page_count(hpage) == 1) {
> >> +		/* page was freed from under us. So we are done. */
> >> +		putback_active_hugepage(hpage);
> >> +		return MIGRATEPAGE_SUCCESS;
> >> +	}
> >> +
> >>  	new_hpage = get_new_page(hpage, private);
> >>  	if (!new_hpage)
> >>  		return -ENOMEM;
> >>
> > 
> > Question: What if called via alloc_contig_range() where we even want to
> > "migrate" free pages, meaning, relocate it?
> > 
> 
> To be more precise:
> 
> a) We don't have dissolve_free_huge_pages() calls on the
> alloc_contig_range() path. So we *need* migration IIUC.
> 
> b) dissolve_free_huge_pages() will fail if going below the reservation.
> In that case we really want to migrate free pages. This even applies to
> memory offlining.
> 
> Either I am missing something important or this patch is more dangerous
> than it looks like.

This is an interesting point. But do we try to migrate hugetlb pages in
alloc_contig_range? isolate_migratepages_block !PageLRU need to be
marked as PageMovable AFAICS. This would be quite easy to implement but
a more fundamental question is whether we really want to mess with
existing pools for alloc_contig_range.

Anyway you are quite right that this change has more side effects than
it is easy to see while it doesn't really bring any major advantage
other than the code consistency.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12 11:27       ` Michal Hocko
@ 2021-01-12 11:34         ` David Hildenbrand
  2021-01-12 12:16           ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-01-12 11:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Muchun Song, mike.kravetz, akpm, n-horiguchi, ak, linux-mm,
	linux-kernel, Yang Shi

On 12.01.21 12:27, Michal Hocko wrote:
> On Tue 12-01-21 12:11:21, David Hildenbrand wrote:
>> On 12.01.21 12:00, David Hildenbrand wrote:
>>> On 10.01.21 13:40, Muchun Song wrote:
>>>> If the refcount is one when it is migrated, it means that the page
>>>> was freed from under us. So we are done and do not need to migrate.
>>>>
>>>> This optimization is consistent with the regular pages, just like
>>>> unmap_and_move() does.
>>>>
>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>> Acked-by: Yang Shi <shy828301@gmail.com>
>>>> ---
>>>>  mm/migrate.c | 6 ++++++
>>>>  1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index 4385f2fb5d18..a6631c4eb6a6 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>>>>  		return -ENOSYS;
>>>>  	}
>>>>  
>>>> +	if (page_count(hpage) == 1) {
>>>> +		/* page was freed from under us. So we are done. */
>>>> +		putback_active_hugepage(hpage);
>>>> +		return MIGRATEPAGE_SUCCESS;
>>>> +	}
>>>> +
>>>>  	new_hpage = get_new_page(hpage, private);
>>>>  	if (!new_hpage)
>>>>  		return -ENOMEM;
>>>>
>>>
>>> Question: What if called via alloc_contig_range() where we even want to
>>> "migrate" free pages, meaning, relocate it?
>>>
>>
>> To be more precise:
>>
>> a) We don't have dissolve_free_huge_pages() calls on the
>> alloc_contig_range() path. So we *need* migration IIUC.
>>
>> b) dissolve_free_huge_pages() will fail if going below the reservation.
>> In that case we really want to migrate free pages. This even applies to
>> memory offlining.
>>
>> Either I am missing something important or this patch is more dangerous
>> than it looks like.
> 
> This is an interesting point. But do we try to migrate hugetlb pages in
> alloc_contig_range? isolate_migratepages_block !PageLRU need to be

I didn't test it so far (especially in the context of virtio-mem or
CMA), but have a TODO item on my long list of things to look at in the
future.

> marked as PageMovable AFAICS. This would be quite easy to implement but
> a more fundamental question is whether we really want to mess with
> existing pools for alloc_contig_range.

Can these pages fall onto ZONE_MOVABLE or even MIGRATE_CMA? If yes, we
really want to. And I think both is the case for "ordinary" huge pages
allocated via the buddy.

> 
> Anyway you are quite right that this change has more side effects than
> it is easy to see while it doesn't really bring any major advantage
> other than the consistency.

Free hugetlbfs pages are special. E.g., they cannot simply be skipped
when offlining. So I don't think consistency actually really applies.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page
  2021-01-12 11:11             ` Michal Hocko
@ 2021-01-12 11:34               ` Muchun Song
  0 siblings, 0 replies; 39+ messages in thread
From: Muchun Song @ 2021-01-12 11:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen,
	Linux Memory Management List, LKML

On Tue, Jan 12, 2021 at 7:12 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 12-01-21 18:49:17, Muchun Song wrote:
> > On Tue, Jan 12, 2021 at 6:06 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 12-01-21 17:51:05, Muchun Song wrote:
> > > > On Tue, Jan 12, 2021 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 11-01-21 17:20:51, Mike Kravetz wrote:
> > > > > > On 1/10/21 4:40 AM, Muchun Song wrote:
> > > > > > > There is a race between dissolve_free_huge_page() and put_page(),
> > > > > > > and the race window is quite small. Theoretically, we should return
> > > > > > > -EBUSY when we encounter this race. In fact, we have a chance to
> > > > > > > successfully dissolve the page if we do a retry. Because the race
> > > > > > > window is quite small. If we seize this opportunity, it is an
> > > > > > > optimization for increasing the success rate of dissolving page.
> > > > > > >
> > > > > > > If we free a HugeTLB page from a non-task context, it is deferred
> > > > > > > through a workqueue. In this case, we need to flush the work.
> > > > > > >
> > > > > > > The dissolve_free_huge_page() can be called from memory hotplug,
> > > > > > > the caller aims to free the HugeTLB page to the buddy allocator
> > > > > > > so that the caller can unplug the page successfully.
> > > > > > >
> > > > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > > > > ---
> > > > > > >  mm/hugetlb.c | 26 +++++++++++++++++++++-----
> > > > > > >  1 file changed, 21 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > I am unsure about the need for this patch.  The code is OK, there are no
> > > > > > issues with the code.
> > > > > >
> > > > > > As mentioned in the commit message, this is an optimization and could
> > > > > > potentially cause a memory offline operation to succeed instead of fail.
> > > > > > However, we are very unlikely to ever exercise this code.  Adding an
> > > > > > optimization that is unlikely to be exercised is certainly questionable.
> > > > > >
> > > > > > Memory offline is the only code that could benefit from this optimization.
> > > > > > As someone with more memory offline user experience, what is your opinion
> > > > > > Michal?
> > > > >
> > > > > I am not a great fun of optimizations without any data to back them up.
> > > > > I do not see any sign this code has been actually tested and the
> > > > > condition triggered.
> > > >
> > > > This race is quite small. I only trigger this only once on my server.
> > > > And then the kernel panic. So I sent this patch series to fix some
> > > > bugs.
> > >
> > > Memory hotplug shouldn't panic when this race happens. Are you sure you
> > > have seen a race that is directly related to this patch?
> >
> > I mean the panic is fixed by:
> >
> >   [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page
>
> OK, so the answer is that this is not really triggered by any real life
> problem. Can you actually trigger it intentionally?
>
> > > > > Besides that I have requested to have an explanation of why blocking on
> > > > > the WQ is safe and that hasn't happened.
> > > >
> > > > I have seen all the caller of dissolve_free_huge_page, some caller is under
> > > > page lock (via lock_page). Others are also under a sleep context.
> > > >
> > > > So I think that blocking on the WQ is safe. Right?
> > >
> > > I have requested to explicitly write your thinking why this is safe so
> > > that we can double check it. Dependency on a work queue progress is much
> > > more complex than any other locks because there is no guarantee that WQ
> > > will make forward progress (all workers might be stuck, new workers not
> > > able to be created etc.).
> >
> > OK. I know about your concern. How about setting the page as temporary
> > when hitting this race?
> >
> >  int dissolve_free_huge_page(struct page *page)
> >  {
> > @@ -1793,8 +1794,10 @@ int dissolve_free_huge_page(struct page *page)
> >                  * We should make sure that the page is already on the free list
> >                  * when it is dissolved.
> >                  */
> > -               if (unlikely(!PageHugeFreed(head)))
> > +               if (unlikely(!PageHugeFreed(head))) {
> > +                       SetPageHugeTemporary(page)
> >                         goto out;
> > +               }
> >
> > Setting the page as temporary and just return -EBUSY (do not flush
> > the work). __free_huge_page() will free it to the buddy allocator later.
>
> Can we stop these subtle hacks please? Temporary page is meant to
> represent unaccounted temporary page for migration. This has nothing to
> do with it.

Sure. Can drop this patch.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page
  2021-01-12 11:17       ` Michal Hocko
@ 2021-01-12 11:43         ` Muchun Song
  2021-01-12 12:37           ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Muchun Song @ 2021-01-12 11:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen,
	Linux Memory Management List, LKML, linux- stable

On Tue, Jan 12, 2021 at 7:17 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 12-01-21 18:13:02, Muchun Song wrote:
> > On Tue, Jan 12, 2021 at 6:02 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Sun 10-01-21 20:40:14, Muchun Song wrote:
> > > [...]
> > > > @@ -1770,6 +1788,14 @@ int dissolve_free_huge_page(struct page *page)
> > > >               int nid = page_to_nid(head);
> > > >               if (h->free_huge_pages - h->resv_huge_pages == 0)
> > > >                       goto out;
> > > > +
> > > > +             /*
> > > > +              * We should make sure that the page is already on the free list
> > > > +              * when it is dissolved.
> > > > +              */
> > > > +             if (unlikely(!PageHugeFreed(head)))
> > > > +                     goto out;
> > > > +
> > >
> > > Do you really want to report EBUSY in this case? This doesn't make much
> > > sense to me TBH. I believe you want to return 0 same as when you race
> > > and the page is no longer PageHuge.
> >
> > Return 0 is wrong. Because the page is not freed to the buddy allocator.
> > IIUC, dissolve_free_huge_page returns 0 when the page is already freed
> > to the buddy allocator. Right?
>
> 0 is return when the page is either dissolved or it doesn't need
> dissolving. If there is a race with somebody else freeing the page then
> there is nothing to dissolve. Under which condition it makes sense to
> report the failure and/or retry dissolving?

If there is a race with somebody else freeing the page, the page
can be freed to the hugepage pool not the buddy allocator. Do
you think that this page is dissolved?

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12 11:34         ` David Hildenbrand
@ 2021-01-12 12:16           ` Michal Hocko
  2021-01-12 14:23             ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2021-01-12 12:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Muchun Song, mike.kravetz, akpm, n-horiguchi, ak, linux-mm,
	linux-kernel, Yang Shi

On Tue 12-01-21 12:34:01, David Hildenbrand wrote:
> On 12.01.21 12:27, Michal Hocko wrote:
> > On Tue 12-01-21 12:11:21, David Hildenbrand wrote:
> >> On 12.01.21 12:00, David Hildenbrand wrote:
> >>> On 10.01.21 13:40, Muchun Song wrote:
> >>>> If the refcount is one when it is migrated, it means that the page
> >>>> was freed from under us. So we are done and do not need to migrate.
> >>>>
> >>>> This optimization is consistent with the regular pages, just like
> >>>> unmap_and_move() does.
> >>>>
> >>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> >>>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>>> Acked-by: Yang Shi <shy828301@gmail.com>
> >>>> ---
> >>>>  mm/migrate.c | 6 ++++++
> >>>>  1 file changed, 6 insertions(+)
> >>>>
> >>>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>>> index 4385f2fb5d18..a6631c4eb6a6 100644
> >>>> --- a/mm/migrate.c
> >>>> +++ b/mm/migrate.c
> >>>> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> >>>>  		return -ENOSYS;
> >>>>  	}
> >>>>  
> >>>> +	if (page_count(hpage) == 1) {
> >>>> +		/* page was freed from under us. So we are done. */
> >>>> +		putback_active_hugepage(hpage);
> >>>> +		return MIGRATEPAGE_SUCCESS;
> >>>> +	}
> >>>> +
> >>>>  	new_hpage = get_new_page(hpage, private);
> >>>>  	if (!new_hpage)
> >>>>  		return -ENOMEM;
> >>>>
> >>>
> >>> Question: What if called via alloc_contig_range() where we even want to
> >>> "migrate" free pages, meaning, relocate it?
> >>>
> >>
> >> To be more precise:
> >>
> >> a) We don't have dissolve_free_huge_pages() calls on the
> >> alloc_contig_range() path. So we *need* migration IIUC.
> >>
> >> b) dissolve_free_huge_pages() will fail if going below the reservation.
> >> In that case we really want to migrate free pages. This even applies to
> >> memory offlining.
> >>
> >> Either I am missing something important or this patch is more dangerous
> >> than it looks like.
> > 
> > This is an interesting point. But do we try to migrate hugetlb pages in
> > alloc_contig_range? isolate_migratepages_block !PageLRU need to be
> 
> I didn't test it so far (especially in the context of virtio-mem or
> CMA), but have a TODO item on my long list of things to look at in the
> future.

I have looked around and it seems this would more work than just to
allow the migration in a-c-r. migrate_huge_page_move_mapping resp.
hugetlbfs_migrate_page would need to special case pool pages. Likely a
non trivial work and potentially another land mine territory.

> 
> > marked as PageMovable AFAICS. This would be quite easy to implement but
> > a more fundamental question is whether we really want to mess with
> > existing pools for alloc_contig_range.
> 
> Can these pages fall onto ZONE_MOVABLE or even MIGRATE_CMA? If yes, we
> really want to. And I think both is the case for "ordinary" huge pages
> allocated via the buddy.

Yes movable hugetlb pages can sit in movable zone and CMA as well (see
htlb_modify_alloc_mask).
 
> > Anyway you are quite right that this change has more side effects than
> > it is easy to see while it doesn't really bring any major advantage
> > other than the consistency.
> 
> Free hugetlbfs pages are special. E.g., they cannot simply be skipped
> when offlining. So I don't think consistency actually really applies.

Well, currently pool pages are not migrateable but you are right that
this is likely something that we will need to look into in the future
and this optimization would stand in the way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page
  2021-01-12 11:43         ` Muchun Song
@ 2021-01-12 12:37           ` Michal Hocko
  2021-01-12 15:05             ` Muchun Song
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2021-01-12 12:37 UTC (permalink / raw)
  To: Muchun Song
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen,
	Linux Memory Management List, LKML, linux- stable

On Tue 12-01-21 19:43:21, Muchun Song wrote:
> On Tue, Jan 12, 2021 at 7:17 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 12-01-21 18:13:02, Muchun Song wrote:
> > > On Tue, Jan 12, 2021 at 6:02 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Sun 10-01-21 20:40:14, Muchun Song wrote:
> > > > [...]
> > > > > @@ -1770,6 +1788,14 @@ int dissolve_free_huge_page(struct page *page)
> > > > >               int nid = page_to_nid(head);
> > > > >               if (h->free_huge_pages - h->resv_huge_pages == 0)
> > > > >                       goto out;
> > > > > +
> > > > > +             /*
> > > > > +              * We should make sure that the page is already on the free list
> > > > > +              * when it is dissolved.
> > > > > +              */
> > > > > +             if (unlikely(!PageHugeFreed(head)))
> > > > > +                     goto out;
> > > > > +
> > > >
> > > > Do you really want to report EBUSY in this case? This doesn't make much
> > > > sense to me TBH. I believe you want to return 0 same as when you race
> > > > and the page is no longer PageHuge.
> > >
> > > Return 0 is wrong. Because the page is not freed to the buddy allocator.
> > > IIUC, dissolve_free_huge_page returns 0 when the page is already freed
> > > to the buddy allocator. Right?
> >
> > 0 is return when the page is either dissolved or it doesn't need
> > dissolving. If there is a race with somebody else freeing the page then
> > there is nothing to dissolve. Under which condition it makes sense to
> > report the failure and/or retry dissolving?
> 
> If there is a race with somebody else freeing the page, the page
> can be freed to the hugepage pool not the buddy allocator. Do
> you think that this page is dissolved?

OK, I see what you mean. Effectively the page would be in a limbo, not
yet in the pool nor in the allocator but it can find its way to the
either of the two. But I still dislike returning a failure because that
would mean e.g. memory hotplug to fail. Can you simply retry inside this
code path (drop the lock, cond_resched and retry)?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12 11:11     ` David Hildenbrand
  2021-01-12 11:27       ` Michal Hocko
@ 2021-01-12 13:40       ` Muchun Song
  2021-01-12 13:51         ` David Hildenbrand
  1 sibling, 1 reply; 39+ messages in thread
From: Muchun Song @ 2021-01-12 13:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen, mhocko,
	Linux Memory Management List, LKML, Yang Shi

On Tue, Jan 12, 2021 at 7:11 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.01.21 12:00, David Hildenbrand wrote:
> > On 10.01.21 13:40, Muchun Song wrote:
> >> If the refcount is one when it is migrated, it means that the page
> >> was freed from under us. So we are done and do not need to migrate.
> >>
> >> This optimization is consistent with the regular pages, just like
> >> unmap_and_move() does.
> >>
> >> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> >> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> >> Acked-by: Yang Shi <shy828301@gmail.com>
> >> ---
> >>  mm/migrate.c | 6 ++++++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/mm/migrate.c b/mm/migrate.c
> >> index 4385f2fb5d18..a6631c4eb6a6 100644
> >> --- a/mm/migrate.c
> >> +++ b/mm/migrate.c
> >> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> >>              return -ENOSYS;
> >>      }
> >>
> >> +    if (page_count(hpage) == 1) {
> >> +            /* page was freed from under us. So we are done. */
> >> +            putback_active_hugepage(hpage);
> >> +            return MIGRATEPAGE_SUCCESS;
> >> +    }
> >> +
> >>      new_hpage = get_new_page(hpage, private);
> >>      if (!new_hpage)
> >>              return -ENOMEM;
> >>
> >
> > Question: What if called via alloc_contig_range() where we even want to
> > "migrate" free pages, meaning, relocate it?
> >
>
> To be more precise:
>
> a) We don't have dissolve_free_huge_pages() calls on the
> alloc_contig_range() path. So we *need* migration IIUC.

Without this patch, if you want to migrate a HUgeTLB page,
the page is freed to the hugepage pool. With this patch,
the page is also freed to the hugepage pool.
I didn't see any different. I am missing something?


>
> b) dissolve_free_huge_pages() will fail if going below the reservation.
> In that case we really want to migrate free pages. This even applies to
> memory offlining.
>
> Either I am missing something important or this patch is more dangerous
> than it looks like.
>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12 13:40       ` [External] " Muchun Song
@ 2021-01-12 13:51         ` David Hildenbrand
  2021-01-12 14:17           ` Muchun Song
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-01-12 13:51 UTC (permalink / raw)
  To: Muchun Song
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen, mhocko,
	Linux Memory Management List, LKML, Yang Shi

On 12.01.21 14:40, Muchun Song wrote:
> On Tue, Jan 12, 2021 at 7:11 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.01.21 12:00, David Hildenbrand wrote:
>>> On 10.01.21 13:40, Muchun Song wrote:
>>>> If the refcount is one when it is migrated, it means that the page
>>>> was freed from under us. So we are done and do not need to migrate.
>>>>
>>>> This optimization is consistent with the regular pages, just like
>>>> unmap_and_move() does.
>>>>
>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>> Acked-by: Yang Shi <shy828301@gmail.com>
>>>> ---
>>>>  mm/migrate.c | 6 ++++++
>>>>  1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index 4385f2fb5d18..a6631c4eb6a6 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>>>>              return -ENOSYS;
>>>>      }
>>>>
>>>> +    if (page_count(hpage) == 1) {
>>>> +            /* page was freed from under us. So we are done. */
>>>> +            putback_active_hugepage(hpage);
>>>> +            return MIGRATEPAGE_SUCCESS;
>>>> +    }
>>>> +
>>>>      new_hpage = get_new_page(hpage, private);
>>>>      if (!new_hpage)
>>>>              return -ENOMEM;
>>>>
>>>
>>> Question: What if called via alloc_contig_range() where we even want to
>>> "migrate" free pages, meaning, relocate it?
>>>
>>
>> To be more precise:
>>
>> a) We don't have dissolve_free_huge_pages() calls on the
>> alloc_contig_range() path. So we *need* migration IIUC.
> 
> Without this patch, if you want to migrate a HUgeTLB page,
> the page is freed to the hugepage pool. With this patch,
> the page is also freed to the hugepage pool.
> I didn't see any different. I am missing something?

I am definitely not an expert on hugetlb pools, that's why I am asking.

Isn't it, that with your code, no new page is allocated - so
dissolve_free_huge_pages() might just refuse to dissolve due to
reservations, bailing out, no?

(as discussed, looks like alloc_contig_range() needs to be fixed to
handle this correctly)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12 13:51         ` David Hildenbrand
@ 2021-01-12 14:17           ` Muchun Song
  2021-01-12 14:28             ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Muchun Song @ 2021-01-12 14:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen, mhocko,
	Linux Memory Management List, LKML, Yang Shi

On Tue, Jan 12, 2021 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.01.21 14:40, Muchun Song wrote:
> > On Tue, Jan 12, 2021 at 7:11 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 12.01.21 12:00, David Hildenbrand wrote:
> >>> On 10.01.21 13:40, Muchun Song wrote:
> >>>> If the refcount is one when it is migrated, it means that the page
> >>>> was freed from under us. So we are done and do not need to migrate.
> >>>>
> >>>> This optimization is consistent with the regular pages, just like
> >>>> unmap_and_move() does.
> >>>>
> >>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> >>>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>>> Acked-by: Yang Shi <shy828301@gmail.com>
> >>>> ---
> >>>>  mm/migrate.c | 6 ++++++
> >>>>  1 file changed, 6 insertions(+)
> >>>>
> >>>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>>> index 4385f2fb5d18..a6631c4eb6a6 100644
> >>>> --- a/mm/migrate.c
> >>>> +++ b/mm/migrate.c
> >>>> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> >>>>              return -ENOSYS;
> >>>>      }
> >>>>
> >>>> +    if (page_count(hpage) == 1) {
> >>>> +            /* page was freed from under us. So we are done. */
> >>>> +            putback_active_hugepage(hpage);
> >>>> +            return MIGRATEPAGE_SUCCESS;
> >>>> +    }
> >>>> +
> >>>>      new_hpage = get_new_page(hpage, private);
> >>>>      if (!new_hpage)
> >>>>              return -ENOMEM;
> >>>>
> >>>
> >>> Question: What if called via alloc_contig_range() where we even want to
> >>> "migrate" free pages, meaning, relocate it?
> >>>
> >>
> >> To be more precise:
> >>
> >> a) We don't have dissolve_free_huge_pages() calls on the
> >> alloc_contig_range() path. So we *need* migration IIUC.
> >
> > Without this patch, if you want to migrate a HUgeTLB page,
> > the page is freed to the hugepage pool. With this patch,
> > the page is also freed to the hugepage pool.
> > I didn't see any different. I am missing something?
>
> I am definitely not an expert on hugetlb pools, that's why I am asking.
>
> Isn't it, that with your code, no new page is allocated - so
> dissolve_free_huge_pages() might just refuse to dissolve due to
> reservations, bailing out, no?

Without this patch, the new page can be allocated from the
hugepage pool. The dissolve_free_huge_pages() also
can refuse to dissolve due to reservations. Right?

>
> (as discussed, looks like alloc_contig_range() needs to be fixed to
> handle this correctly)
>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12 12:16           ` Michal Hocko
@ 2021-01-12 14:23             ` Michal Hocko
  2021-01-12 14:41               ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2021-01-12 14:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Muchun Song, mike.kravetz, akpm, n-horiguchi, ak, linux-mm,
	linux-kernel, Yang Shi

On Tue 12-01-21 13:16:45, Michal Hocko wrote:
[...]
> Well, currently pool pages are not migrateable but you are right that
> this is likely something that we will need to look into in the future
> and this optimization would stand in the way.

After some more thinking I believe I was wrong in my last statement.
This optimization shouldn't have any effect on pages on the pool as
those stay at reference count 0 and they cannot be isolated either
(clear_page_huge_active before it is enqueued).

That being said, the migration code would still have to learn about
about this pages but that is out of scope of this discussion.

Sorry about the confusion from my side.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12 14:17           ` Muchun Song
@ 2021-01-12 14:28             ` David Hildenbrand
  2021-01-12 14:59               ` Muchun Song
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-01-12 14:28 UTC (permalink / raw)
  To: Muchun Song
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen, mhocko,
	Linux Memory Management List, LKML, Yang Shi

On 12.01.21 15:17, Muchun Song wrote:
> On Tue, Jan 12, 2021 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.01.21 14:40, Muchun Song wrote:
>>> On Tue, Jan 12, 2021 at 7:11 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 12.01.21 12:00, David Hildenbrand wrote:
>>>>> On 10.01.21 13:40, Muchun Song wrote:
>>>>>> If the refcount is one when it is migrated, it means that the page
>>>>>> was freed from under us. So we are done and do not need to migrate.
>>>>>>
>>>>>> This optimization is consistent with the regular pages, just like
>>>>>> unmap_and_move() does.
>>>>>>
>>>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>>>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>> Acked-by: Yang Shi <shy828301@gmail.com>
>>>>>> ---
>>>>>>  mm/migrate.c | 6 ++++++
>>>>>>  1 file changed, 6 insertions(+)
>>>>>>
>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>> index 4385f2fb5d18..a6631c4eb6a6 100644
>>>>>> --- a/mm/migrate.c
>>>>>> +++ b/mm/migrate.c
>>>>>> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>>>>>>              return -ENOSYS;
>>>>>>      }
>>>>>>
>>>>>> +    if (page_count(hpage) == 1) {
>>>>>> +            /* page was freed from under us. So we are done. */
>>>>>> +            putback_active_hugepage(hpage);
>>>>>> +            return MIGRATEPAGE_SUCCESS;
>>>>>> +    }
>>>>>> +
>>>>>>      new_hpage = get_new_page(hpage, private);
>>>>>>      if (!new_hpage)
>>>>>>              return -ENOMEM;
>>>>>>
>>>>>
>>>>> Question: What if called via alloc_contig_range() where we even want to
>>>>> "migrate" free pages, meaning, relocate it?
>>>>>
>>>>
>>>> To be more precise:
>>>>
>>>> a) We don't have dissolve_free_huge_pages() calls on the
>>>> alloc_contig_range() path. So we *need* migration IIUC.
>>>
>>> Without this patch, if you want to migrate a HUgeTLB page,
>>> the page is freed to the hugepage pool. With this patch,
>>> the page is also freed to the hugepage pool.
>>> I didn't see any different. I am missing something?
>>
>> I am definitely not an expert on hugetlb pools, that's why I am asking.
>>
>> Isn't it, that with your code, no new page is allocated - so
>> dissolve_free_huge_pages() might just refuse to dissolve due to
>> reservations, bailing out, no?
> 
> Without this patch, the new page can be allocated from the
> hugepage pool. The dissolve_free_huge_pages() also
> can refuse to dissolve due to reservations. Right?

Oh, you mean the migration target might be coming from the pool? I guess
yes, looking at alloc_migration_target()->alloc_huge_page_nodemask().

In that case, yes, I think we run into a similar issue already.

Instead of trying to allocate new huge pages in
dissolve_free_huge_pages() to "relocate free pages", we bail out.

This all feels kind of wrong. After we migrated a huge page we should
free it back to the buddy, so most of our machinery just keeps working
without caring about free huge pages.


I can see how your patch will not change the current (IMHO broken) behavior.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12 14:23             ` Michal Hocko
@ 2021-01-12 14:41               ` David Hildenbrand
  2021-01-12 14:53                 ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-01-12 14:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Muchun Song, mike.kravetz, akpm, n-horiguchi, ak, linux-mm,
	linux-kernel, Yang Shi

On 12.01.21 15:23, Michal Hocko wrote:
> On Tue 12-01-21 13:16:45, Michal Hocko wrote:
> [...]
>> Well, currently pool pages are not migrateable but you are right that
>> this is likely something that we will need to look into in the future
>> and this optimization would stand in the way.
> 
> After some more thinking I believe I was wrong in my last statement.
> This optimization shouldn't have any effect on pages on the pool as
> those stay at reference count 0 and they cannot be isolated either
> (clear_page_huge_active before it is enqueued).
> 
> That being said, the migration code would still have to learn about
> about this pages but that is out of scope of this discussion.
> 
> Sorry about the confusion from my side.
> 

At this point I am fairly confused what's working at what's not :D

I think this will require more thought, on how to teach
alloc_contig_range() (and eventually in some cases offline_pages()?) to
do the right thing.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12 14:41               ` David Hildenbrand
@ 2021-01-12 14:53                 ` Michal Hocko
  2021-01-12 20:12                   ` Mike Kravetz
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2021-01-12 14:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Muchun Song, mike.kravetz, akpm, n-horiguchi, ak, linux-mm,
	linux-kernel, Yang Shi

On Tue 12-01-21 15:41:02, David Hildenbrand wrote:
> On 12.01.21 15:23, Michal Hocko wrote:
> > On Tue 12-01-21 13:16:45, Michal Hocko wrote:
> > [...]
> >> Well, currently pool pages are not migrateable but you are right that
> >> this is likely something that we will need to look into in the future
> >> and this optimization would stand in the way.
> > 
> > After some more thinking I believe I was wrong in my last statement.
> > This optimization shouldn't have any effect on pages on the pool as
> > those stay at reference count 0 and they cannot be isolated either
> > (clear_page_huge_active before it is enqueued).
> > 
> > That being said, the migration code would still have to learn about
> > about this pages but that is out of scope of this discussion.
> > 
> > Sorry about the confusion from my side.
> > 
> 
> At this point I am fairly confused what's working at what's not :D

heh, tell me something about that. Hugetlb is a maze full of land mines.

> I think this will require more thought, on how to teach
> alloc_contig_range() (and eventually in some cases offline_pages()?) to
> do the right thing.

Well, offlining sort of works because it retries both migrates and
dissolves. It can fail with the later due to reservations but that can
be expected. We can still try harder to rellocate/rebalance per numa
pools to keep the reservation but I strongly suspect nobody has noticed
this to be a problem so there we are.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12 14:28             ` David Hildenbrand
@ 2021-01-12 14:59               ` Muchun Song
  0 siblings, 0 replies; 39+ messages in thread
From: Muchun Song @ 2021-01-12 14:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen, mhocko,
	Linux Memory Management List, LKML, Yang Shi

On Tue, Jan 12, 2021 at 10:28 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.01.21 15:17, Muchun Song wrote:
> > On Tue, Jan 12, 2021 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 12.01.21 14:40, Muchun Song wrote:
> >>> On Tue, Jan 12, 2021 at 7:11 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 12.01.21 12:00, David Hildenbrand wrote:
> >>>>> On 10.01.21 13:40, Muchun Song wrote:
> >>>>>> If the refcount is one when it is migrated, it means that the page
> >>>>>> was freed from under us. So we are done and do not need to migrate.
> >>>>>>
> >>>>>> This optimization is consistent with the regular pages, just like
> >>>>>> unmap_and_move() does.
> >>>>>>
> >>>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> >>>>>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>>>>> Acked-by: Yang Shi <shy828301@gmail.com>
> >>>>>> ---
> >>>>>>  mm/migrate.c | 6 ++++++
> >>>>>>  1 file changed, 6 insertions(+)
> >>>>>>
> >>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>>>>> index 4385f2fb5d18..a6631c4eb6a6 100644
> >>>>>> --- a/mm/migrate.c
> >>>>>> +++ b/mm/migrate.c
> >>>>>> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> >>>>>>              return -ENOSYS;
> >>>>>>      }
> >>>>>>
> >>>>>> +    if (page_count(hpage) == 1) {
> >>>>>> +            /* page was freed from under us. So we are done. */
> >>>>>> +            putback_active_hugepage(hpage);
> >>>>>> +            return MIGRATEPAGE_SUCCESS;
> >>>>>> +    }
> >>>>>> +
> >>>>>>      new_hpage = get_new_page(hpage, private);
> >>>>>>      if (!new_hpage)
> >>>>>>              return -ENOMEM;
> >>>>>>
> >>>>>
> >>>>> Question: What if called via alloc_contig_range() where we even want to
> >>>>> "migrate" free pages, meaning, relocate it?
> >>>>>
> >>>>
> >>>> To be more precise:
> >>>>
> >>>> a) We don't have dissolve_free_huge_pages() calls on the
> >>>> alloc_contig_range() path. So we *need* migration IIUC.
> >>>
> >>> Without this patch, if you want to migrate a HUgeTLB page,
> >>> the page is freed to the hugepage pool. With this patch,
> >>> the page is also freed to the hugepage pool.
> >>> I didn't see any different. I am missing something?
> >>
> >> I am definitely not an expert on hugetlb pools, that's why I am asking.
> >>
> >> Isn't it, that with your code, no new page is allocated - so
> >> dissolve_free_huge_pages() might just refuse to dissolve due to
> >> reservations, bailing out, no?
> >
> > Without this patch, the new page can be allocated from the
> > hugepage pool. The dissolve_free_huge_pages() also
> > can refuse to dissolve due to reservations. Right?
>
> Oh, you mean the migration target might be coming from the pool? I guess
> yes, looking at alloc_migration_target()->alloc_huge_page_nodemask().

Yeah, you are right. If we want to free a HugeTLB page to
the buddy allocator, we should dissolve_free_huge_page()
to do that. Migrating cannot guarantee this at least now.

>
> In that case, yes, I think we run into a similar issue already.
>
> Instead of trying to allocate new huge pages in
> dissolve_free_huge_pages() to "relocate free pages", we bail out.
>
> This all feels kind of wrong. After we migrated a huge page we should
> free it back to the buddy, so most of our machinery just keeps working
> without caring about free huge pages.
>
>
> I can see how your patch will not change the current (IMHO broken) behavior.
>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page
  2021-01-12 12:37           ` Michal Hocko
@ 2021-01-12 15:05             ` Muchun Song
  0 siblings, 0 replies; 39+ messages in thread
From: Muchun Song @ 2021-01-12 15:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, Naoya Horiguchi, Andi Kleen,
	Linux Memory Management List, LKML, linux- stable

On Tue, Jan 12, 2021 at 8:37 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 12-01-21 19:43:21, Muchun Song wrote:
> > On Tue, Jan 12, 2021 at 7:17 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 12-01-21 18:13:02, Muchun Song wrote:
> > > > On Tue, Jan 12, 2021 at 6:02 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Sun 10-01-21 20:40:14, Muchun Song wrote:
> > > > > [...]
> > > > > > @@ -1770,6 +1788,14 @@ int dissolve_free_huge_page(struct page *page)
> > > > > >               int nid = page_to_nid(head);
> > > > > >               if (h->free_huge_pages - h->resv_huge_pages == 0)
> > > > > >                       goto out;
> > > > > > +
> > > > > > +             /*
> > > > > > +              * We should make sure that the page is already on the free list
> > > > > > +              * when it is dissolved.
> > > > > > +              */
> > > > > > +             if (unlikely(!PageHugeFreed(head)))
> > > > > > +                     goto out;
> > > > > > +
> > > > >
> > > > > Do you really want to report EBUSY in this case? This doesn't make much
> > > > > sense to me TBH. I believe you want to return 0 same as when you race
> > > > > and the page is no longer PageHuge.
> > > >
> > > > Return 0 is wrong. Because the page is not freed to the buddy allocator.
> > > > IIUC, dissolve_free_huge_page returns 0 when the page is already freed
> > > > to the buddy allocator. Right?
> > >
> > > 0 is return when the page is either dissolved or it doesn't need
> > > dissolving. If there is a race with somebody else freeing the page then
> > > there is nothing to dissolve. Under which condition it makes sense to
> > > report the failure and/or retry dissolving?
> >
> > If there is a race with somebody else freeing the page, the page
> > can be freed to the hugepage pool not the buddy allocator. Do
> > you think that this page is dissolved?
>
> OK, I see what you mean. Effectively the page would be in a limbo, not
> yet in the pool nor in the allocator but it can find its way to the
> either of the two. But I still dislike returning a failure because that
> would mean e.g. memory hotplug to fail. Can you simply retry inside this
> code path (drop the lock, cond_resched and retry)?

Yeah. This is what I want to do (making the memory hotplug as
successful as possible). So I send the patch:

  [PATCH v3 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page

Adding a simple retry inside this function when hitting this race is
also fine to me. I can do that.




> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one
  2021-01-12 14:53                 ` Michal Hocko
@ 2021-01-12 20:12                   ` Mike Kravetz
  0 siblings, 0 replies; 39+ messages in thread
From: Mike Kravetz @ 2021-01-12 20:12 UTC (permalink / raw)
  To: Michal Hocko, David Hildenbrand
  Cc: Muchun Song, akpm, n-horiguchi, ak, linux-mm, linux-kernel, Yang Shi

On 1/12/21 6:53 AM, Michal Hocko wrote:
> On Tue 12-01-21 15:41:02, David Hildenbrand wrote:
>> On 12.01.21 15:23, Michal Hocko wrote:
>>> On Tue 12-01-21 13:16:45, Michal Hocko wrote:
>>> [...]
>>>> Well, currently pool pages are not migrateable but you are right that
>>>> this is likely something that we will need to look into in the future
>>>> and this optimization would stand in the way.
>>>
>>> After some more thinking I believe I was wrong in my last statement.
>>> This optimization shouldn't have any effect on pages on the pool as
>>> those stay at reference count 0 and they cannot be isolated either
>>> (clear_page_huge_active before it is enqueued).
>>>
>>> That being said, the migration code would still have to learn about
>>> about this pages but that is out of scope of this discussion.
>>>
>>> Sorry about the confusion from my side.
>>>
>>
>> At this point I am fairly confused what's working at what's not :D
> 
> heh, tell me something about that. Hugetlb is a maze full of land mines.
> 
>> I think this will require more thought, on how to teach
>> alloc_contig_range() (and eventually in some cases offline_pages()?) to
>> do the right thing.
> 
> Well, offlining sort of works because it retries both migrates and
> dissolves. It can fail with the later due to reservations but that can
> be expected. We can still try harder to rellocate/rebalance per numa
> pools to keep the reservation but I strongly suspect nobody has noticed
> this to be a problem so there we are.

Due to my time zone, I get to read all the previous comments before
commenting myself. :)

To be clear, this patch is handling a very specific case where a hugetlb
page was isolated for migration and after being isolated the last reference
to the page was dropped.  Normally, dropping the last reference would send
the page to free_huge_page processing.  free_huge_page processing would
either dissolve the page to the buddy allocator or more likely place the
page on the free list of the pool.  However, since isolation also holds
a reference to the page, processing is continued down the migration path.

Today there is no code to migrate free huge pages in the pool.  Only
active in use pages are migrated.  Without this patch, 'free pages' in
the very specific state above would be migrated.  But to be clear, that
was never the intention of any hugetlb migration code.  In that respect,
I believe this patch helps the current code work as designed.

David brings up the valid point that alloc_contig_range needs to deal
with free hugetlb pool pages.  However, that is code which needs to be
written as it does not exist today.  I suspect nobody thought about free
hugetlb pages when alloc_contig_range was written.  This patch should
in no way hinder development of that code.  Free huge pages have a ref
count of 0, and this code is checking for ref count of 1.

That is a long way of saying that I still agree with this patch.  The
only modification/suggestion would be enhancing the commit message as
suggested by Michal.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2021-01-12 21:59 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-10 12:40 [PATCH v3 0/6] Fix some bugs about HugeTLB code Muchun Song
2021-01-10 12:40 ` [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one Muchun Song
2021-01-12  9:42   ` Michal Hocko
2021-01-12  9:43     ` [External] " Muchun Song
2021-01-12 11:00   ` David Hildenbrand
2021-01-12 11:11     ` David Hildenbrand
2021-01-12 11:27       ` Michal Hocko
2021-01-12 11:34         ` David Hildenbrand
2021-01-12 12:16           ` Michal Hocko
2021-01-12 14:23             ` Michal Hocko
2021-01-12 14:41               ` David Hildenbrand
2021-01-12 14:53                 ` Michal Hocko
2021-01-12 20:12                   ` Mike Kravetz
2021-01-12 13:40       ` [External] " Muchun Song
2021-01-12 13:51         ` David Hildenbrand
2021-01-12 14:17           ` Muchun Song
2021-01-12 14:28             ` David Hildenbrand
2021-01-12 14:59               ` Muchun Song
2021-01-10 12:40 ` [PATCH v3 2/6] mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page Muchun Song
2021-01-11 23:04   ` Mike Kravetz
2021-01-12  9:45   ` Michal Hocko
2021-01-10 12:40 ` [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page Muchun Song
2021-01-12  1:06   ` Mike Kravetz
2021-01-12 10:02   ` Michal Hocko
2021-01-12 10:13     ` [External] " Muchun Song
2021-01-12 11:17       ` Michal Hocko
2021-01-12 11:43         ` Muchun Song
2021-01-12 12:37           ` Michal Hocko
2021-01-12 15:05             ` Muchun Song
2021-01-10 12:40 ` [PATCH v3 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page Muchun Song
2021-01-12  1:20   ` Mike Kravetz
2021-01-12  8:33     ` Michal Hocko
2021-01-12  9:51       ` [External] " Muchun Song
2021-01-12 10:06         ` Michal Hocko
2021-01-12 10:49           ` Muchun Song
2021-01-12 11:11             ` Michal Hocko
2021-01-12 11:34               ` Muchun Song
2021-01-10 12:40 ` [PATCH v3 5/6] mm: hugetlb: fix a race between isolating and freeing page Muchun Song
2021-01-10 12:40 ` [PATCH v3 6/6] mm: hugetlb: remove VM_BUG_ON_PAGE from page_huge_active Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).