[PATCH 0/8] make hugetlb put_page safe for all calling contexts

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] make hugetlb put_page safe for all calling contexts
@ 2021-03-25  0:28 Mike Kravetz
  2021-03-25  0:28 ` [PATCH 1/8] mm: cma: introduce cma_release_nowait() Mike Kravetz
                   ` (8 more replies)
  0 siblings, 9 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25  0:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton, Mike Kravetz

This effort is the result a recent bug report [1].  In subsequent
discussions [2], it was deemed necessary to properly fix the hugetlb
put_page path (free_huge_page).  This RFC provides a possible way to
address the issue.  Comments are welcome/encouraged as several attempts
at this have been made in the past.

This series is based on v5.12-rc3-mmotm-2021-03-17-22-24.  At a high
level, the series provides:
- Patches 1 & 2 from Roman Gushchin provide cma_release_nowait()
- Patches 4, 5 & 6 are aimed at reducing lock hold times.  To be clear
  the goal is to eliminate single lock hold times of a long duration.
  Overall lock hold time is not addressed.
- Patch 7 makes hugetlb_lock and subpool lock IRQ safe.  It also reverts
  the code which defers calls to a workqueue if !in_task.
- Patch 8 adds some lockdep_assert_held() calls

[1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
[2] http://lkml.kernel.org/r/20210311021321.127500-1-mike.kravetz@oracle.com

RFC -> v1
- Add Roman's cma_release_nowait() patches.  This eliminated the need
  to do a workqueue handoff in hugetlb code.
- Use Michal's suggestion to batch pages for freeing.  This eliminated
  the need to recalculate loop control variables when dropping the lock.
- Added lockdep_assert_held() calls
- Rebased to v5.12-rc3-mmotm-2021-03-17-22-24

Mike Kravetz (6):
  hugetlb: add per-hstate mutex to synchronize user adjustments
  hugetlb: create remove_hugetlb_page() to separate functionality
  hugetlb: call update_and_free_page without hugetlb_lock
  hugetlb: change free_pool_huge_page to remove_pool_huge_page
  hugetlb: make free_huge_page irq safe
  hugetlb: add lockdep_assert_held() calls for hugetlb_lock

Roman Gushchin (2):
  mm: cma: introduce cma_release_nowait()
  mm: hugetlb: don't drop hugetlb_lock around cma_release() call

 include/linux/cma.h     |   2 +
 include/linux/hugetlb.h |   1 +
 mm/cma.c                |  93 +++++++++++
 mm/cma.h                |   5 +
 mm/hugetlb.c            | 354 +++++++++++++++++++++-------------------
 mm/hugetlb_cgroup.c     |   8 +-
 6 files changed, 294 insertions(+), 169 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25  0:28 [PATCH 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
@ 2021-03-25  0:28 ` Mike Kravetz
  2021-03-25  9:39   ` Oscar Salvador
                     ` (2 more replies)
  2021-03-25  0:28 ` [PATCH 2/8] mm: hugetlb: don't drop hugetlb_lock around cma_release() call Mike Kravetz
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25  0:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton, Mike Kravetz

From: Roman Gushchin <guro@fb.com>

cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
It makes it a blocking function, which complicates its usage from
non-blocking contexts. For instance, hugetlbfs code is temporarily
dropping the hugetlb_lock spinlock to call cma_release().

This patch introduces a non-blocking cma_release_nowait(), which
postpones the cma bitmap clearance. It's done later from a work
context. The first page in the cma allocation is used to store
the work struct. Because CMA allocations and de-allocations are
usually not that frequent, a single global workqueue is used.

To make sure that subsequent cma_alloc() call will pass, cma_alloc()
flushes the cma_release_wq workqueue. To avoid a performance
regression in the case when only cma_release() is used, gate it
by a per-cma area flag, which is set by the first call
of cma_release_nowait().

Signed-off-by: Roman Gushchin <guro@fb.com>
[mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/cma.h |  2 +
 mm/cma.c            | 93 +++++++++++++++++++++++++++++++++++++++++++++
 mm/cma.h            |  5 +++
 3 files changed, 100 insertions(+)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 217999c8a762..497eca478c2f 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -47,6 +47,8 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
 extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 			      bool no_warn);
 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
+extern bool cma_release_nowait(struct cma *cma, const struct page *pages,
+			       unsigned int count);
 
 extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
 #endif
diff --git a/mm/cma.c b/mm/cma.c
index 90e27458ddb7..14cc8e901703 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -36,9 +36,18 @@
 
 #include "cma.h"
 
+struct cma_clear_bitmap_work {
+	struct work_struct work;
+	struct cma *cma;
+	unsigned long pfn;
+	unsigned int count;
+};
+
 struct cma cma_areas[MAX_CMA_AREAS];
 unsigned cma_area_count;
 
+struct workqueue_struct *cma_release_wq;
+
 phys_addr_t cma_get_base(const struct cma *cma)
 {
 	return PFN_PHYS(cma->base_pfn);
@@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
 	for (i = 0; i < cma_area_count; i++)
 		cma_activate_area(&cma_areas[i]);
 
+	cma_release_wq = create_workqueue("cma_release");
+	if (!cma_release_wq)
+		return -ENOMEM;
+
 	return 0;
 }
 core_initcall(cma_init_reserved_areas);
@@ -203,6 +216,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
 
 	cma->base_pfn = PFN_DOWN(base);
 	cma->count = size >> PAGE_SHIFT;
+	cma->flags = 0;
 	cma->order_per_bit = order_per_bit;
 	*res_cma = cma;
 	cma_area_count++;
@@ -452,6 +466,14 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 		goto out;
 
 	for (;;) {
+		/*
+		 * If the CMA bitmap is cleared asynchronously after
+		 * cma_release_nowait(), cma release workqueue has to be
+		 * flushed here in order to make the allocation succeed.
+		 */
+		if (test_bit(CMA_DELAYED_RELEASE, &cma->flags))
+			flush_workqueue(cma_release_wq);
+
 		mutex_lock(&cma->lock);
 		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
 				bitmap_maxno, start, bitmap_count, mask,
@@ -552,6 +574,77 @@ bool cma_release(struct cma *cma, const struct page *pages, unsigned int count)
 	return true;
 }
 
+static void cma_clear_bitmap_fn(struct work_struct *work)
+{
+	struct cma_clear_bitmap_work *w;
+
+	w = container_of(work, struct cma_clear_bitmap_work, work);
+
+	cma_clear_bitmap(w->cma, w->pfn, w->count);
+
+	__free_page(pfn_to_page(w->pfn));
+}
+
+/**
+ * cma_release_nowait() - release allocated pages without blocking
+ * @cma:   Contiguous memory region for which the allocation is performed.
+ * @pages: Allocated pages.
+ * @count: Number of allocated pages.
+ *
+ * Similar to cma_release(), this function releases memory allocated
+ * by cma_alloc(), but unlike cma_release() is non-blocking and can be
+ * called from an atomic context.
+ * It returns false when provided pages do not belong to contiguous area
+ * and true otherwise.
+ */
+bool cma_release_nowait(struct cma *cma, const struct page *pages,
+			unsigned int count)
+{
+	struct cma_clear_bitmap_work *work;
+	unsigned long pfn;
+
+	if (!cma || !pages)
+		return false;
+
+	pr_debug("%s(page %p)\n", __func__, (void *)pages);
+
+	pfn = page_to_pfn(pages);
+
+	if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count)
+		return false;
+
+	VM_BUG_ON(pfn + count > cma->base_pfn + cma->count);
+
+	/*
+	 * Set CMA_DELAYED_RELEASE flag: subsequent cma_alloc()'s
+	 * will wait for the async part of cma_release_nowait() to
+	 * finish.
+	 */
+	if (unlikely(!test_bit(CMA_DELAYED_RELEASE, &cma->flags)))
+		set_bit(CMA_DELAYED_RELEASE, &cma->flags);
+
+	/*
+	 * To make cma_release_nowait() non-blocking, cma bitmap is cleared
+	 * from a work context (see cma_clear_bitmap_fn()). The first page
+	 * in the cma allocation is used to store the work structure,
+	 * so it's released after the cma bitmap clearance. Other pages
+	 * are released immediately as previously.
+	 */
+	if (count > 1)
+		free_contig_range(pfn + 1, count - 1);
+
+	work = (struct cma_clear_bitmap_work *)page_to_virt(pages);
+	INIT_WORK(&work->work, cma_clear_bitmap_fn);
+	work->cma = cma;
+	work->pfn = pfn;
+	work->count = count;
+	queue_work(cma_release_wq, &work->work);
+
+	trace_cma_release(pfn, pages, count);
+
+	return true;
+}
+
 int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data)
 {
 	int i;
diff --git a/mm/cma.h b/mm/cma.h
index 95d1aa2d808a..2063fb5bc985 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -17,6 +17,7 @@ struct cma_stat {
 struct cma {
 	unsigned long   base_pfn;
 	unsigned long   count;
+	unsigned long   flags;
 	unsigned long   *bitmap;
 	unsigned int order_per_bit; /* Order of pages represented by one bit */
 	struct mutex    lock;
@@ -31,6 +32,10 @@ struct cma {
 #endif
 };
 
+enum cma_flags {
+	CMA_DELAYED_RELEASE, /* cma bitmap is cleared asynchronously */
+};
+
 extern struct cma cma_areas[MAX_CMA_AREAS];
 extern unsigned cma_area_count;
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 2/8] mm: hugetlb: don't drop hugetlb_lock around cma_release() call
  2021-03-25  0:28 [PATCH 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
  2021-03-25  0:28 ` [PATCH 1/8] mm: cma: introduce cma_release_nowait() Mike Kravetz
@ 2021-03-25  0:28 ` Mike Kravetz
  2021-03-25  0:28 ` [PATCH 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25  0:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton, Mike Kravetz

From: Roman Gushchin <guro@fb.com>

Replace blocking cma_release() with a non-blocking cma_release_nowait()
call, so there is no more need to temporarily drop hugetlb_lock.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 408dbc08298a..f9ba63fc1747 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1258,10 +1258,11 @@ static void free_gigantic_page(struct page *page, unsigned int order)
 {
 	/*
 	 * If the page isn't allocated using the cma allocator,
-	 * cma_release() returns false.
+	 * cma_release_nowait() returns false.
 	 */
 #ifdef CONFIG_CMA
-	if (cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order))
+	if (cma_release_nowait(hugetlb_cma[page_to_nid(page)], page,
+			       1 << order))
 		return;
 #endif
 
@@ -1348,14 +1349,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
-		/*
-		 * Temporarily drop the hugetlb_lock, because
-		 * we might block in free_gigantic_page().
-		 */
-		spin_unlock(&hugetlb_lock);
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
-		spin_lock(&hugetlb_lock);
 	} else {
 		__free_pages(page, huge_page_order(h));
 	}
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments
  2021-03-25  0:28 [PATCH 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
  2021-03-25  0:28 ` [PATCH 1/8] mm: cma: introduce cma_release_nowait() Mike Kravetz
  2021-03-25  0:28 ` [PATCH 2/8] mm: hugetlb: don't drop hugetlb_lock around cma_release() call Mike Kravetz
@ 2021-03-25  0:28 ` Mike Kravetz
  2021-03-25 10:47   ` Michal Hocko
                     ` (2 more replies)
  2021-03-25  0:28 ` [PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25  0:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton, Mike Kravetz

The helper routine hstate_next_node_to_alloc accesses and modifies the
hstate variable next_nid_to_alloc.  The helper is used by the routines
alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
called with hugetlb_lock held.  However, alloc_pool_huge_page can not
be called with the hugetlb lock held as it will call the page allocator.
Two instances of alloc_pool_huge_page could be run in parallel or
alloc_pool_huge_page could run in parallel with adjust_pool_surplus
which may result in the variable next_nid_to_alloc becoming invalid
for the caller and pages being allocated on the wrong node.

Both alloc_pool_huge_page and adjust_pool_surplus are only called from
the routine set_max_huge_pages after boot.  set_max_huge_pages is only
called as the reusult of a user writing to the proc/sysfs nr_hugepages,
or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.

It makes little sense to allow multiple adjustment to the number of
hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
allow one hugetlb page adjustment at a time.  This will synchronize
modifications to the next_nid_to_alloc variable.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c            | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a7f7d5f328dc..8817ec987d68 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
+	struct mutex mutex;
 	int next_nid_to_alloc;
 	int next_nid_to_free;
 	unsigned int order;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f9ba63fc1747..404b0b1c5258 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2616,6 +2616,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	else
 		return -ENOMEM;
 
+	/* mutex prevents concurrent adjustments for the same hstate */
+	mutex_lock(&h->mutex);
 	spin_lock(&hugetlb_lock);
 
 	/*
@@ -2648,6 +2650,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
 		if (count > persistent_huge_pages(h)) {
 			spin_unlock(&hugetlb_lock);
+			mutex_unlock(&h->mutex);
 			NODEMASK_FREE(node_alloc_noretry);
 			return -EINVAL;
 		}
@@ -2722,6 +2725,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 out:
 	h->max_huge_pages = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
+	mutex_unlock(&h->mutex);
 
 	NODEMASK_FREE(node_alloc_noretry);
 
@@ -3209,6 +3213,7 @@ void __init hugetlb_add_hstate(unsigned int order)
 	BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
 	BUG_ON(order == 0);
 	h = &hstates[hugetlb_max_hstate++];
+	mutex_init(&h->mutex);
 	h->order = order;
 	h->mask = ~(huge_page_size(h) - 1);
 	for (i = 0; i < MAX_NUMNODES; ++i)
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-03-25  0:28 [PATCH 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (2 preceding siblings ...)
  2021-03-25  0:28 ` [PATCH 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
@ 2021-03-25  0:28 ` Mike Kravetz
  2021-03-25 10:49   ` Michal Hocko
                     ` (2 more replies)
  2021-03-25  0:28 ` [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25  0:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton, Mike Kravetz

The new remove_hugetlb_page() routine is designed to remove a hugetlb
page from hugetlbfs processing.  It will remove the page from the active
or free list, update global counters and set the compound page
destructor to NULL so that PageHuge() will return false for the 'page'.
After this call, the 'page' can be treated as a normal compound page or
a collection of base size pages.

remove_hugetlb_page is to be called with the hugetlb_lock held.

Creating this routine and separating functionality is in preparation for
restructuring code to reduce lock hold times.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 70 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 45 insertions(+), 25 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 404b0b1c5258..3938ec086b5c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1327,6 +1327,46 @@ static inline void destroy_compound_gigantic_page(struct page *page,
 						unsigned int order) { }
 #endif
 
+/*
+ * Remove hugetlb page from lists, and update dtor so that page appears
+ * as just a compound page.  A reference is held on the page.
+ * NOTE: hugetlb specific page flags stored in page->private are not
+ *	 automatically cleared.  These flags may be used in routines
+ *	 which operate on the resulting compound page.
+ *
+ * Must be called with hugetlb lock held.
+ */
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+							bool adjust_surplus)
+{
+	int nid = page_to_nid(page);
+
+	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+		return;
+
+	list_del(&page->lru);
+
+	if (HPageFreed(page)) {
+		h->free_huge_pages--;
+		h->free_huge_pages_node[nid]--;
+		ClearHPageFreed(page);
+	}
+	if (adjust_surplus) {
+		h->surplus_huge_pages--;
+		h->surplus_huge_pages_node[nid]--;
+	}
+
+	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
+	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
+
+	ClearHPageTemporary(page);
+	set_page_refcounted(page);
+	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
+
+	h->nr_huge_pages--;
+	h->nr_huge_pages_node[nid]--;
+}
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
@@ -1335,8 +1375,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
-	h->nr_huge_pages--;
-	h->nr_huge_pages_node[page_to_nid(page)]--;
 	for (i = 0; i < pages_per_huge_page(h);
 	     i++, subpage = mem_map_next(subpage, page, i)) {
 		subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
@@ -1344,10 +1382,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 				1 << PG_active | 1 << PG_private |
 				1 << PG_writeback);
 	}
-	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
-	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
-	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
-	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
@@ -1415,15 +1449,12 @@ static void __free_huge_page(struct page *page)
 		h->resv_huge_pages++;
 
 	if (HPageTemporary(page)) {
-		list_del(&page->lru);
-		ClearHPageTemporary(page);
+		remove_hugetlb_page(h, page, false);
 		update_and_free_page(h, page);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
-		list_del(&page->lru);
+		remove_hugetlb_page(h, page, true);
 		update_and_free_page(h, page);
-		h->surplus_huge_pages--;
-		h->surplus_huge_pages_node[nid]--;
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);
@@ -1708,13 +1739,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 			struct page *page =
 				list_entry(h->hugepage_freelists[node].next,
 					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[node]--;
-			if (acct_surplus) {
-				h->surplus_huge_pages--;
-				h->surplus_huge_pages_node[node]--;
-			}
+			remove_hugetlb_page(h, page, acct_surplus);
 			update_and_free_page(h, page);
 			ret = 1;
 			break;
@@ -1752,7 +1777,6 @@ int dissolve_free_huge_page(struct page *page)
 	if (!page_count(page)) {
 		struct page *head = compound_head(page);
 		struct hstate *h = page_hstate(head);
-		int nid = page_to_nid(head);
 		if (h->free_huge_pages - h->resv_huge_pages == 0)
 			goto out;
 
@@ -1783,9 +1807,7 @@ int dissolve_free_huge_page(struct page *page)
 			SetPageHWPoison(page);
 			ClearPageHWPoison(head);
 		}
-		list_del(&head->lru);
-		h->free_huge_pages--;
-		h->free_huge_pages_node[nid]--;
+		remove_hugetlb_page(h, page, false);
 		h->max_huge_pages--;
 		update_and_free_page(h, head);
 		rc = 0;
@@ -2553,10 +2575,8 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 				return;
 			if (PageHighMem(page))
 				continue;
-			list_del(&page->lru);
+			remove_hugetlb_page(h, page, false);
 			update_and_free_page(h, page);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[page_to_nid(page)]--;
 		}
 	}
 }
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-03-25  0:28 [PATCH 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (3 preceding siblings ...)
  2021-03-25  0:28 ` [PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
@ 2021-03-25  0:28 ` Mike Kravetz
  2021-03-25 10:55   ` Michal Hocko
  2021-03-27  6:54   ` [External] " Muchun Song
  2021-03-25  0:28 ` [PATCH 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25  0:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton, Mike Kravetz

With the introduction of remove_hugetlb_page(), there is no need for
update_and_free_page to hold the hugetlb lock.  Change all callers to
drop the lock before calling.

With additional code modifications, this will allow loops which decrease
the huge page pool to drop the hugetlb_lock with each page to reduce
long hold times.

The ugly unlock/lock cycle in free_pool_huge_page will be removed in
a subsequent patch which restructures free_pool_huge_page.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3938ec086b5c..fae7f034d1eb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1450,16 +1450,18 @@ static void __free_huge_page(struct page *page)
 
 	if (HPageTemporary(page)) {
 		remove_hugetlb_page(h, page, false);
+		spin_unlock(&hugetlb_lock);
 		update_and_free_page(h, page);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
 		remove_hugetlb_page(h, page, true);
+		spin_unlock(&hugetlb_lock);
 		update_and_free_page(h, page);
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);
+		spin_unlock(&hugetlb_lock);
 	}
-	spin_unlock(&hugetlb_lock);
 }
 
 /*
@@ -1740,7 +1742,13 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 				list_entry(h->hugepage_freelists[node].next,
 					  struct page, lru);
 			remove_hugetlb_page(h, page, acct_surplus);
+			/*
+			 * unlock/lock around update_and_free_page is temporary
+			 * and will be removed with subsequent patch.
+			 */
+			spin_unlock(&hugetlb_lock);
 			update_and_free_page(h, page);
+			spin_lock(&hugetlb_lock);
 			ret = 1;
 			break;
 		}
@@ -1809,8 +1817,9 @@ int dissolve_free_huge_page(struct page *page)
 		}
 		remove_hugetlb_page(h, page, false);
 		h->max_huge_pages--;
+		spin_unlock(&hugetlb_lock);
 		update_and_free_page(h, head);
-		rc = 0;
+		return 0;
 	}
 out:
 	spin_unlock(&hugetlb_lock);
@@ -2563,22 +2572,37 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 						nodemask_t *nodes_allowed)
 {
 	int i;
+	struct list_head page_list;
+	struct page *page, *next;
 
 	if (hstate_is_gigantic(h))
 		return;
 
+	/*
+	 * Collect pages to be freed on a list, and free after dropping lock
+	 */
+	INIT_LIST_HEAD(&page_list);
 	for_each_node_mask(i, *nodes_allowed) {
-		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
 		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
-				return;
+				goto out;
 			if (PageHighMem(page))
 				continue;
 			remove_hugetlb_page(h, page, false);
-			update_and_free_page(h, page);
+			INIT_LIST_HEAD(&page->lru);
+			list_add(&page->lru, &page_list);
 		}
 	}
+
+out:
+	spin_unlock(&hugetlb_lock);
+	list_for_each_entry_safe(page, next, &page_list, lru) {
+		list_del(&page->lru);
+		update_and_free_page(h, page);
+		cond_resched();
+	}
+	spin_lock(&hugetlb_lock);
 }
 #else
 static inline void try_to_free_low(struct hstate *h, unsigned long count,
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page
  2021-03-25  0:28 [PATCH 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (4 preceding siblings ...)
  2021-03-25  0:28 ` [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
@ 2021-03-25  0:28 ` Mike Kravetz
  2021-03-25 11:06   ` Michal Hocko
  2021-03-25  0:28 ` [PATCH 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25  0:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton, Mike Kravetz

free_pool_huge_page was called with hugetlb_lock held.  It would remove
a hugetlb page, and then free the corresponding pages to the lower level
allocators such as buddy.  free_pool_huge_page was called in a loop to
remove hugetlb pages and these loops could hold the hugetlb_lock for a
considerable time.

Create new routine remove_pool_huge_page to replace free_pool_huge_page.
remove_pool_huge_page will remove the hugetlb page, and it must be
called with the hugetlb_lock held.  It will return the removed page and
it is the responsibility of the caller to free the page to the lower
level allocators.  The hugetlb_lock is dropped before freeing to these
allocators which results in shorter lock hold times.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 88 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 51 insertions(+), 37 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fae7f034d1eb..a9785e73379f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1204,7 +1204,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
 }
 
 /*
- * helper for free_pool_huge_page() - return the previously saved
+ * helper for remove_pool_huge_page() - return the previously saved
  * node ["this node"] from which to free a huge page.  Advance the
  * next node id whether or not we find a free huge page to free so
  * that the next attempt to free addresses the next node.
@@ -1720,16 +1720,18 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 }
 
 /*
- * Free huge page from pool from next node to free.
- * Attempt to keep persistent huge pages more or less
- * balanced over allowed nodes.
+ * Remove huge page from pool from next node to free.  Attempt to keep
+ * persistent huge pages more or less balanced over allowed nodes.
+ * This routine only 'removes' the hugetlb page.  The caller must make
+ * an additional call to free the page to low level allocators.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-							 bool acct_surplus)
+static struct page *remove_pool_huge_page(struct hstate *h,
+						nodemask_t *nodes_allowed,
+						 bool acct_surplus)
 {
 	int nr_nodes, node;
-	int ret = 0;
+	struct page *page = NULL;
 
 	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
 		/*
@@ -1738,23 +1740,14 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 		 */
 		if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
 		    !list_empty(&h->hugepage_freelists[node])) {
-			struct page *page =
-				list_entry(h->hugepage_freelists[node].next,
+			page = list_entry(h->hugepage_freelists[node].next,
 					  struct page, lru);
 			remove_hugetlb_page(h, page, acct_surplus);
-			/*
-			 * unlock/lock around update_and_free_page is temporary
-			 * and will be removed with subsequent patch.
-			 */
-			spin_unlock(&hugetlb_lock);
-			update_and_free_page(h, page);
-			spin_lock(&hugetlb_lock);
-			ret = 1;
 			break;
 		}
 	}
 
-	return ret;
+	return page;
 }
 
 /*
@@ -2074,17 +2067,16 @@ static int gather_surplus_pages(struct hstate *h, long delta)
  *    to the associated reservation map.
  * 2) Free any unused surplus pages that may have been allocated to satisfy
  *    the reservation.  As many as unused_resv_pages may be freed.
- *
- * Called with hugetlb_lock held.  However, the lock could be dropped (and
- * reacquired) during calls to cond_resched_lock.  Whenever dropping the lock,
- * we must make sure nobody else can claim pages we are in the process of
- * freeing.  Do this by ensuring resv_huge_page always is greater than the
- * number of huge pages we plan to free when dropping the lock.
  */
 static void return_unused_surplus_pages(struct hstate *h,
 					unsigned long unused_resv_pages)
 {
 	unsigned long nr_pages;
+	struct page *page, *t_page;
+	struct list_head page_list;
+
+	/* Uncommit the reservation */
+	h->resv_huge_pages -= unused_resv_pages;
 
 	/* Cannot return gigantic pages currently */
 	if (hstate_is_gigantic(h))
@@ -2101,24 +2093,27 @@ static void return_unused_surplus_pages(struct hstate *h,
 	 * evenly across all nodes with memory. Iterate across these nodes
 	 * until we can no longer free unreserved surplus pages. This occurs
 	 * when the nodes with surplus pages have no free pages.
-	 * free_pool_huge_page() will balance the freed pages across the
+	 * remove_pool_huge_page() will balance the freed pages across the
 	 * on-line nodes with memory and will handle the hstate accounting.
-	 *
-	 * Note that we decrement resv_huge_pages as we free the pages.  If
-	 * we drop the lock, resv_huge_pages will still be sufficiently large
-	 * to cover subsequent pages we may free.
 	 */
+	INIT_LIST_HEAD(&page_list);
 	while (nr_pages--) {
-		h->resv_huge_pages--;
-		unused_resv_pages--;
-		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
+		page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
+		if (!page)
 			goto out;
-		cond_resched_lock(&hugetlb_lock);
+
+		INIT_LIST_HEAD(&page->lru);
+		list_add(&page->lru, &page_list);
 	}
 
 out:
-	/* Fully uncommit the reservation */
-	h->resv_huge_pages -= unused_resv_pages;
+	spin_unlock(&hugetlb_lock);
+	list_for_each_entry_safe(page, t_page, &page_list, lru) {
+		list_del(&page->lru);
+		update_and_free_page(h, page);
+		cond_resched();
+	}
+	spin_lock(&hugetlb_lock);
 }
 
 
@@ -2648,6 +2643,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
 	unsigned long min_count, ret;
+	struct page *page, *t_page;
+	struct list_head page_list;
 	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
 
 	/*
@@ -2757,11 +2754,28 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
 	try_to_free_low(h, min_count, nodes_allowed);
+
+	/*
+	 * Collect pages to be removed on list without dropping lock
+	 */
+	INIT_LIST_HEAD(&page_list);
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, nodes_allowed, 0))
+		page = remove_pool_huge_page(h, nodes_allowed, 0);
+		if (!page)
 			break;
-		cond_resched_lock(&hugetlb_lock);
+
+		INIT_LIST_HEAD(&page->lru);
+		list_add(&page->lru, &page_list);
 	}
+	/* free the pages after dropping lock */
+	spin_unlock(&hugetlb_lock);
+	list_for_each_entry_safe(page, t_page, &page_list, lru) {
+		list_del(&page->lru);
+		update_and_free_page(h, page);
+		cond_resched();
+	}
+	spin_lock(&hugetlb_lock);
+
 	while (count < persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, nodes_allowed, 1))
 			break;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 7/8] hugetlb: make free_huge_page irq safe
  2021-03-25  0:28 [PATCH 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (5 preceding siblings ...)
  2021-03-25  0:28 ` [PATCH 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
@ 2021-03-25  0:28 ` Mike Kravetz
  2021-03-25 11:21   ` Michal Hocko
  2021-03-27  7:06   ` [External] " Muchun Song
  2021-03-25  0:28 ` [PATCH 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
  2021-03-26  1:42 ` [PATCH 0/8] make hugetlb put_page safe for all calling contexts Miaohe Lin
  8 siblings, 2 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25  0:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton, Mike Kravetz

Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
non-task context") was added to address the issue of free_huge_page
being called from irq context.  That commit hands off free_huge_page
processing to a workqueue if !in_task.  However, as seen in [1] this
does not cover all cases.  Instead, make the locks taken in the
free_huge_page irq safe.

This patch does the following:
- Make hugetlb_lock irq safe.  This is mostly a simple process of
  changing spin_*lock calls to spin_*lock_irq* calls.
- Make subpool lock irq safe in a similar manner.
- Revert the !in_task check and workqueue handoff.

[1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c        | 169 +++++++++++++++++---------------------------
 mm/hugetlb_cgroup.c |   8 +--
 2 files changed, 67 insertions(+), 110 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a9785e73379f..e4c441b878f2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,9 +93,10 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
 	return true;
 }
 
-static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
+static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
+						unsigned long irq_flags)
 {
-	spin_unlock(&spool->lock);
+	spin_unlock_irqrestore(&spool->lock, irq_flags);
 
 	/* If no pages are used, and no other handles to the subpool
 	 * remain, give up any reservations based on minimum size and
@@ -134,10 +135,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
 
 void hugepage_put_subpool(struct hugepage_subpool *spool)
 {
-	spin_lock(&spool->lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&spool->lock, flags);
 	BUG_ON(!spool->count);
 	spool->count--;
-	unlock_or_release_subpool(spool);
+	unlock_or_release_subpool(spool, flags);
 }
 
 /*
@@ -156,7 +159,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
 	if (!spool)
 		return ret;
 
-	spin_lock(&spool->lock);
+	spin_lock_irq(&spool->lock);
 
 	if (spool->max_hpages != -1) {		/* maximum size accounting */
 		if ((spool->used_hpages + delta) <= spool->max_hpages)
@@ -183,7 +186,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
 	}
 
 unlock_ret:
-	spin_unlock(&spool->lock);
+	spin_unlock_irq(&spool->lock);
 	return ret;
 }
 
@@ -197,11 +200,12 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
 				       long delta)
 {
 	long ret = delta;
+	unsigned long flags;
 
 	if (!spool)
 		return delta;
 
-	spin_lock(&spool->lock);
+	spin_lock_irqsave(&spool->lock, flags);
 
 	if (spool->max_hpages != -1)		/* maximum size accounting */
 		spool->used_hpages -= delta;
@@ -222,7 +226,7 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
 	 * If hugetlbfs_put_super couldn't free spool due to an outstanding
 	 * quota reference, free it now.
 	 */
-	unlock_or_release_subpool(spool);
+	unlock_or_release_subpool(spool, flags);
 
 	return ret;
 }
@@ -1401,7 +1405,7 @@ struct hstate *size_to_hstate(unsigned long size)
 	return NULL;
 }
 
-static void __free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
@@ -1411,6 +1415,7 @@ static void __free_huge_page(struct page *page)
 	int nid = page_to_nid(page);
 	struct hugepage_subpool *spool = hugetlb_page_subpool(page);
 	bool restore_reserve;
+	unsigned long flags;
 
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(page_mapcount(page), page);
@@ -1439,7 +1444,7 @@ static void __free_huge_page(struct page *page)
 			restore_reserve = true;
 	}
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irqsave(&hugetlb_lock, flags);
 	ClearHPageMigratable(page);
 	hugetlb_cgroup_uncharge_page(hstate_index(h),
 				     pages_per_huge_page(h), page);
@@ -1450,66 +1455,18 @@ static void __free_huge_page(struct page *page)
 
 	if (HPageTemporary(page)) {
 		remove_hugetlb_page(h, page, false);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 		update_and_free_page(h, page);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
 		remove_hugetlb_page(h, page, true);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 		update_and_free_page(h, page);
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);
-		spin_unlock(&hugetlb_lock);
-	}
-}
-
-/*
- * As free_huge_page() can be called from a non-task context, we have
- * to defer the actual freeing in a workqueue to prevent potential
- * hugetlb_lock deadlock.
- *
- * free_hpage_workfn() locklessly retrieves the linked list of pages to
- * be freed and frees them one-by-one. As the page->mapping pointer is
- * going to be cleared in __free_huge_page() anyway, it is reused as the
- * llist_node structure of a lockless linked list of huge pages to be freed.
- */
-static LLIST_HEAD(hpage_freelist);
-
-static void free_hpage_workfn(struct work_struct *work)
-{
-	struct llist_node *node;
-	struct page *page;
-
-	node = llist_del_all(&hpage_freelist);
-
-	while (node) {
-		page = container_of((struct address_space **)node,
-				     struct page, mapping);
-		node = node->next;
-		__free_huge_page(page);
-	}
-}
-static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
-
-void free_huge_page(struct page *page)
-{
-	/*
-	 * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
-	 */
-	if (!in_task()) {
-		/*
-		 * Only call schedule_work() if hpage_freelist is previously
-		 * empty. Otherwise, schedule_work() had been called but the
-		 * workfn hasn't retrieved the list yet.
-		 */
-		if (llist_add((struct llist_node *)&page->mapping,
-			      &hpage_freelist))
-			schedule_work(&free_hpage_work);
-		return;
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 	}
-
-	__free_huge_page(page);
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -1519,11 +1476,11 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	hugetlb_set_page_subpool(page, NULL);
 	set_hugetlb_cgroup(page, NULL);
 	set_hugetlb_cgroup_rsvd(page, NULL);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	h->nr_huge_pages++;
 	h->nr_huge_pages_node[nid]++;
 	ClearHPageFreed(page);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 }
 
 static void prep_compound_gigantic_page(struct page *page, unsigned int order)
@@ -1769,7 +1726,7 @@ int dissolve_free_huge_page(struct page *page)
 	if (!PageHuge(page))
 		return 0;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (!PageHuge(page)) {
 		rc = 0;
 		goto out;
@@ -1786,7 +1743,7 @@ int dissolve_free_huge_page(struct page *page)
 		 * when it is dissolved.
 		 */
 		if (unlikely(!HPageFreed(head))) {
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			cond_resched();
 
 			/*
@@ -1810,12 +1767,12 @@ int dissolve_free_huge_page(struct page *page)
 		}
 		remove_hugetlb_page(h, page, false);
 		h->max_huge_pages--;
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		update_and_free_page(h, head);
 		return 0;
 	}
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return rc;
 }
 
@@ -1857,16 +1814,16 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	if (hstate_is_gigantic(h))
 		return NULL;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages)
 		goto out_unlock;
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
 	if (!page)
 		return NULL;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	/*
 	 * We could have raced with the pool size change.
 	 * Double check that and simply deallocate the new page
@@ -1876,7 +1833,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	 */
 	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
 		SetHPageTemporary(page);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		put_page(page);
 		return NULL;
 	} else {
@@ -1885,7 +1842,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	}
 
 out_unlock:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	return page;
 }
@@ -1935,17 +1892,17 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
 		nodemask_t *nmask, gfp_t gfp_mask)
 {
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (h->free_huge_pages - h->resv_huge_pages > 0) {
 		struct page *page;
 
 		page = dequeue_huge_page_nodemask(h, gfp_mask, preferred_nid, nmask);
 		if (page) {
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			return page;
 		}
 	}
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
 }
@@ -1993,7 +1950,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 
 	ret = -ENOMEM;
 retry:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
 		page = alloc_surplus_huge_page(h, htlb_alloc_mask(h),
 				NUMA_NO_NODE, NULL);
@@ -2010,7 +1967,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	 * After retaking hugetlb_lock, we need to recalculate 'needed'
 	 * because either resv_huge_pages or free_huge_pages may have changed.
 	 */
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	needed = (h->resv_huge_pages + delta) -
 			(h->free_huge_pages + allocated);
 	if (needed > 0) {
@@ -2050,12 +2007,12 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 		enqueue_huge_page(h, page);
 	}
 free:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	/* Free unnecessary surplus pages to the buddy allocator */
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru)
 		put_page(page);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 
 	return ret;
 }
@@ -2107,13 +2064,13 @@ static void return_unused_surplus_pages(struct hstate *h,
 	}
 
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	list_for_each_entry_safe(page, t_page, &page_list, lru) {
 		list_del(&page->lru);
 		update_and_free_page(h, page);
 		cond_resched();
 	}
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 }
 
 
@@ -2348,7 +2305,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (ret)
 		goto out_uncharge_cgroup_reservation;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	/*
 	 * glb_chg is passed to indicate whether or not a page must be taken
 	 * from the global free pool (global change).  gbl_chg == 0 indicates
@@ -2356,7 +2313,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	 */
 	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
 	if (!page) {
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		page = alloc_buddy_huge_page_with_mpol(h, vma, addr);
 		if (!page)
 			goto out_uncharge_cgroup;
@@ -2364,7 +2321,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 			SetHPageRestoreReserve(page);
 			h->resv_huge_pages--;
 		}
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		list_add(&page->lru, &h->hugepage_activelist);
 		/* Fall through */
 	}
@@ -2377,7 +2334,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 						  h_cg, page);
 	}
 
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	hugetlb_set_page_subpool(page, spool);
 
@@ -2591,13 +2548,13 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	}
 
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	list_for_each_entry_safe(page, next, &page_list, lru) {
 		list_del(&page->lru);
 		update_and_free_page(h, page);
 		cond_resched();
 	}
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 }
 #else
 static inline void try_to_free_low(struct hstate *h, unsigned long count,
@@ -2659,7 +2616,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 
 	/* mutex prevents concurrent adjustments for the same hstate */
 	mutex_lock(&h->mutex);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 
 	/*
 	 * Check for a node specific request.
@@ -2690,7 +2647,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 */
 	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
 		if (count > persistent_huge_pages(h)) {
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			mutex_unlock(&h->mutex);
 			NODEMASK_FREE(node_alloc_noretry);
 			return -EINVAL;
@@ -2720,14 +2677,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		 * page, free_huge_page will handle it by freeing the page
 		 * and reducing the surplus.
 		 */
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 
 		/* yield cpu to avoid soft lockup */
 		cond_resched();
 
 		ret = alloc_pool_huge_page(h, nodes_allowed,
 						node_alloc_noretry);
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		if (!ret)
 			goto out;
 
@@ -2768,13 +2725,13 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		list_add(&page->lru, &page_list);
 	}
 	/* free the pages after dropping lock */
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	list_for_each_entry_safe(page, t_page, &page_list, lru) {
 		list_del(&page->lru);
 		update_and_free_page(h, page);
 		cond_resched();
 	}
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 
 	while (count < persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, nodes_allowed, 1))
@@ -2782,7 +2739,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	}
 out:
 	h->max_huge_pages = persistent_huge_pages(h);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	mutex_unlock(&h->mutex);
 
 	NODEMASK_FREE(node_alloc_noretry);
@@ -2938,9 +2895,9 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 	if (err)
 		return err;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	h->nr_overcommit_huge_pages = input;
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	return count;
 }
@@ -3527,9 +3484,9 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
 		goto out;
 
 	if (write) {
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		h->nr_overcommit_huge_pages = tmp;
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 	}
 out:
 	return ret;
@@ -3625,7 +3582,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
 	if (!delta)
 		return 0;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	/*
 	 * When cpuset is configured, it breaks the strict hugetlb page
 	 * reservation as the accounting is done on a global variable. Such
@@ -3664,7 +3621,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
 		return_unused_surplus_pages(h, (unsigned long) -delta);
 
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return ret;
 }
 
@@ -5727,7 +5684,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
 {
 	bool ret = true;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (!PageHeadHuge(page) ||
 	    !HPageMigratable(page) ||
 	    !get_page_unless_zero(page)) {
@@ -5737,16 +5694,16 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
 	ClearHPageMigratable(page);
 	list_move_tail(&page->lru, list);
 unlock:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return ret;
 }
 
 void putback_active_hugepage(struct page *page)
 {
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	SetHPageMigratable(page);
 	list_move_tail(&page->lru, &(page_hstate(page))->hugepage_activelist);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	put_page(page);
 }
 
@@ -5780,12 +5737,12 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
 		 */
 		if (new_nid == old_nid)
 			return;
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		if (h->surplus_huge_pages_node[old_nid]) {
 			h->surplus_huge_pages_node[old_nid]--;
 			h->surplus_huge_pages_node[new_nid]++;
 		}
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 	}
 }
 
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 726b85f4f303..5383023d0cca 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -204,11 +204,11 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 	do {
 		idx = 0;
 		for_each_hstate(h) {
-			spin_lock(&hugetlb_lock);
+			spin_lock_irq(&hugetlb_lock);
 			list_for_each_entry(page, &h->hugepage_activelist, lru)
 				hugetlb_cgroup_move_parent(idx, h_cg, page);
 
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			idx++;
 		}
 		cond_resched();
@@ -784,7 +784,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
 	if (hugetlb_cgroup_disabled())
 		return;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	h_cg = hugetlb_cgroup_from_page(oldhpage);
 	h_cg_rsvd = hugetlb_cgroup_from_page_rsvd(oldhpage);
 	set_hugetlb_cgroup(oldhpage, NULL);
@@ -794,7 +794,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
 	set_hugetlb_cgroup(newhpage, h_cg);
 	set_hugetlb_cgroup_rsvd(newhpage, h_cg_rsvd);
 	list_move(&newhpage->lru, &h->hugepage_activelist);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return;
 }
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock
  2021-03-25  0:28 [PATCH 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (6 preceding siblings ...)
  2021-03-25  0:28 ` [PATCH 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
@ 2021-03-25  0:28 ` Mike Kravetz
  2021-03-25 11:22   ` Michal Hocko
                     ` (2 more replies)
  2021-03-26  1:42 ` [PATCH 0/8] make hugetlb put_page safe for all calling contexts Miaohe Lin
  8 siblings, 3 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25  0:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton, Mike Kravetz

After making hugetlb lock irq safe and separating some functionality
done under the lock, add some lockdep_assert_held to help verify
locking.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e4c441b878f2..de5b3cf4a155 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1062,6 +1062,8 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
+
+	lockdep_assert_held(&hugetlb_lock);
 	list_move(&page->lru, &h->hugepage_freelists[nid]);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
@@ -1073,6 +1075,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
 	struct page *page;
 	bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
+	lockdep_assert_held(&hugetlb_lock);
 	list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
 		if (pin && !is_pinnable_page(page))
 			continue;
@@ -1345,6 +1348,7 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
 {
 	int nid = page_to_nid(page);
 
+	lockdep_assert_held(&hugetlb_lock);
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
@@ -1690,6 +1694,7 @@ static struct page *remove_pool_huge_page(struct hstate *h,
 	int nr_nodes, node;
 	struct page *page = NULL;
 
+	lockdep_assert_held(&hugetlb_lock);
 	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
 		/*
 		 * If we're returning unused surplus pages, only examine
@@ -1939,6 +1944,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	long needed, allocated;
 	bool alloc_ok = true;
 
+	lockdep_assert_held(&hugetlb_lock);
 	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
 	if (needed <= 0) {
 		h->resv_huge_pages += delta;
@@ -2032,6 +2038,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 	struct page *page, *t_page;
 	struct list_head page_list;
 
+	lockdep_assert_held(&hugetlb_lock);
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
 
@@ -2527,6 +2534,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	struct list_head page_list;
 	struct page *page, *next;
 
+	lockdep_assert_held(&hugetlb_lock);
 	if (hstate_is_gigantic(h))
 		return;
 
@@ -2573,6 +2581,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 {
 	int nr_nodes, node;
 
+	lockdep_assert_held(&hugetlb_lock);
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0) {
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25  0:28 ` [PATCH 1/8] mm: cma: introduce cma_release_nowait() Mike Kravetz
@ 2021-03-25  9:39   ` Oscar Salvador
  2021-03-25  9:45   ` Michal Hocko
  2021-03-25  9:56   ` David Hildenbrand
  2 siblings, 0 replies; 53+ messages in thread
From: Oscar Salvador @ 2021-03-25  9:39 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Michal Hocko,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Wed, Mar 24, 2021 at 05:28:28PM -0700, Mike Kravetz wrote:
> +struct cma_clear_bitmap_work {
> +	struct work_struct work;
> +	struct cma *cma;
> +	unsigned long pfn;
> +	unsigned int count;
> +};
> +
>  struct cma cma_areas[MAX_CMA_AREAS];
>  unsigned cma_area_count;
>  
> +struct workqueue_struct *cma_release_wq;

should this be static?

> +
>  phys_addr_t cma_get_base(const struct cma *cma)
>  {
>  	return PFN_PHYS(cma->base_pfn);
> @@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
>  	for (i = 0; i < cma_area_count; i++)
>  		cma_activate_area(&cma_areas[i]);
>  
> +	cma_release_wq = create_workqueue("cma_release");
> +	if (!cma_release_wq)
> +		return -ENOMEM;
> +

I did not check the code that goes through the initcalls and maybe we
cannot really have this scneario, but what happens if we return -ENOMEM?
Because I can see that later in cma_release_nowait() you mess with
cma_release_wq. Can it be that at that stage cma_release_wq == NULL? due
to -ENOMEM, or are we guaranteed to never reach that point?

Also, should the cma_release_wq go before the cma_activate_area?

> +bool cma_release_nowait(struct cma *cma, const struct page *pages,
> +			unsigned int count)
> +{
> +	struct cma_clear_bitmap_work *work;
> +	unsigned long pfn;
> +
> +	if (!cma || !pages)
> +		return false;
> +
> +	pr_debug("%s(page %p)\n", __func__, (void *)pages);

cma_release() seems to have:

pr_debug("%s(page %p, count %u)\n", __func__, (void *)pages, count);

any reason to not have the same here?


> +
> +	pfn = page_to_pfn(pages);
> +
> +	if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count)
> +		return false;
> +
> +	VM_BUG_ON(pfn + count > cma->base_pfn + cma->count);
> +
> +	/*
> +	 * Set CMA_DELAYED_RELEASE flag: subsequent cma_alloc()'s
> +	 * will wait for the async part of cma_release_nowait() to
> +	 * finish.
> +	 */
> +	if (unlikely(!test_bit(CMA_DELAYED_RELEASE, &cma->flags)))
> +		set_bit(CMA_DELAYED_RELEASE, &cma->flags);

It seems this cannot really happen? This is the only place we set the
bit, right?
Why not set the bit unconditionally? Against what this is guarding us?

> +
> +	/*
> +	 * To make cma_release_nowait() non-blocking, cma bitmap is cleared
> +	 * from a work context (see cma_clear_bitmap_fn()). The first page
> +	 * in the cma allocation is used to store the work structure,
> +	 * so it's released after the cma bitmap clearance. Other pages
> +	 * are released immediately as previously.
> +	 */
> +	if (count > 1)
> +		free_contig_range(pfn + 1, count - 1);
> +
> +	work = (struct cma_clear_bitmap_work *)page_to_virt(pages);
> +	INIT_WORK(&work->work, cma_clear_bitmap_fn);
> +	work->cma = cma;
> +	work->pfn = pfn;
> +	work->count = count;
> +	queue_work(cma_release_wq, &work->work);

As I said above, can cma_release_wq be NULL if we had -ENOMEM before?


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25  0:28 ` [PATCH 1/8] mm: cma: introduce cma_release_nowait() Mike Kravetz
  2021-03-25  9:39   ` Oscar Salvador
@ 2021-03-25  9:45   ` Michal Hocko
  2021-03-25  9:54     ` Oscar Salvador
  2021-03-25 10:11     ` Michal Hocko
  2021-03-25  9:56   ` David Hildenbrand
  2 siblings, 2 replies; 53+ messages in thread
From: Michal Hocko @ 2021-03-25  9:45 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Wed 24-03-21 17:28:28, Mike Kravetz wrote:
[...]
>  phys_addr_t cma_get_base(const struct cma *cma)
>  {
>  	return PFN_PHYS(cma->base_pfn);
> @@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
>  	for (i = 0; i < cma_area_count; i++)
>  		cma_activate_area(&cma_areas[i]);
>  
> +	cma_release_wq = create_workqueue("cma_release");

Considering the workqueue is used to free up memory it should likely be
WQ_MEM_RECLAIM to prevent from long stalls when WQs are overloaded.

I cannot judge the CMA parts from a very quick glance this looks
reasonably.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25  9:45   ` Michal Hocko
@ 2021-03-25  9:54     ` Oscar Salvador
  2021-03-25 10:10       ` Michal Hocko
  2021-03-25 10:11     ` Michal Hocko
  1 sibling, 1 reply; 53+ messages in thread
From: Oscar Salvador @ 2021-03-25  9:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Roman Gushchin,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Thu, Mar 25, 2021 at 10:45:18AM +0100, Michal Hocko wrote:
> On Wed 24-03-21 17:28:28, Mike Kravetz wrote:
> [...]
> >  phys_addr_t cma_get_base(const struct cma *cma)
> >  {
> >  	return PFN_PHYS(cma->base_pfn);
> > @@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
> >  	for (i = 0; i < cma_area_count; i++)
> >  		cma_activate_area(&cma_areas[i]);
> >  
> > +	cma_release_wq = create_workqueue("cma_release");
> 
> Considering the workqueue is used to free up memory it should likely be
> WQ_MEM_RECLAIM to prevent from long stalls when WQs are overloaded.

I might be missing something but the worqueue->func() seems to not be anything
for memory related purposes.

The worqueue calls cma_clear_bitmap_fn(), and the only think that does
is freeing one page, and clearing the bitmap. 
I might be reading the CMA code wrongly, but I cannot see any freeing up
of memory in cma_clear_bitmap()->__bitmap_clear() (or its siblings).

The actual freeing of memory seems to happen in cma_release_no_wait()
with: 

 if (count > 1)
       free_contig_range(pfn + 1, count - 1);

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25  0:28 ` [PATCH 1/8] mm: cma: introduce cma_release_nowait() Mike Kravetz
  2021-03-25  9:39   ` Oscar Salvador
  2021-03-25  9:45   ` Michal Hocko
@ 2021-03-25  9:56   ` David Hildenbrand
  2021-03-25 10:22     ` Michal Hocko
  2 siblings, 1 reply; 53+ messages in thread
From: David Hildenbrand @ 2021-03-25  9:56 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	Muchun Song, David Rientjes, Miaohe Lin, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Andrew Morton

On 25.03.21 01:28, Mike Kravetz wrote:
> From: Roman Gushchin <guro@fb.com>
> 
> cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
> It makes it a blocking function, which complicates its usage from
> non-blocking contexts. For instance, hugetlbfs code is temporarily
> dropping the hugetlb_lock spinlock to call cma_release().
> 
> This patch introduces a non-blocking cma_release_nowait(), which
> postpones the cma bitmap clearance. It's done later from a work
> context. The first page in the cma allocation is used to store
> the work struct. Because CMA allocations and de-allocations are
> usually not that frequent, a single global workqueue is used.
> 
> To make sure that subsequent cma_alloc() call will pass, cma_alloc()
> flushes the cma_release_wq workqueue. To avoid a performance
> regression in the case when only cma_release() is used, gate it
> by a per-cma area flag, which is set by the first call
> of cma_release_nowait().
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---


1. Is there a real reason this is a mutex and not a spin lock? It seems 
to only protect the bitmap. Are bitmaps that huge that we spend a 
significant amount of time in there?

Because I also read "Because CMA allocations and de-allocations are
usually not that frequent".

With a spinlock, you would no longer be sleeping, but obviously you 
might end up waiting for the lock ;) Not sure if that would help.

2. IIUC, if we would do the clearing completely lockless and use atomic 
bitmap ops instead, only cma_debug_show_areas() would see slight 
inconsistencies. As long as the setting code (-> allocation code) holds 
the lock, I think this should be fine (-> no double allocations).

(sorry if that has already been discussed)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25  9:54     ` Oscar Salvador
@ 2021-03-25 10:10       ` Michal Hocko
  0 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2021-03-25 10:10 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, linux-mm, linux-kernel, Roman Gushchin,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Thu 25-03-21 10:54:10, Oscar Salvador wrote:
> On Thu, Mar 25, 2021 at 10:45:18AM +0100, Michal Hocko wrote:
> > On Wed 24-03-21 17:28:28, Mike Kravetz wrote:
> > [...]
> > >  phys_addr_t cma_get_base(const struct cma *cma)
> > >  {
> > >  	return PFN_PHYS(cma->base_pfn);
> > > @@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
> > >  	for (i = 0; i < cma_area_count; i++)
> > >  		cma_activate_area(&cma_areas[i]);
> > >  
> > > +	cma_release_wq = create_workqueue("cma_release");
> > 
> > Considering the workqueue is used to free up memory it should likely be
> > WQ_MEM_RECLAIM to prevent from long stalls when WQs are overloaded.
> 
> I might be missing something but the worqueue->func() seems to not be anything
> for memory related purposes.
> The worqueue calls cma_clear_bitmap_fn(), and the only think that does
> is freeing one page, and clearing the bitmap. 

No, it shouldn't realease a sing page. That is a bug which I have
overlooked as well. I should free up the whole count worth of pages.
But fundamentally the worker does all the heavy lifting and frees up the
memory.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25  9:45   ` Michal Hocko
  2021-03-25  9:54     ` Oscar Salvador
@ 2021-03-25 10:11     ` Michal Hocko
  2021-03-25 10:13       ` David Hildenbrand
  2021-03-25 10:17       ` Oscar Salvador
  1 sibling, 2 replies; 53+ messages in thread
From: Michal Hocko @ 2021-03-25 10:11 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Thu 25-03-21 10:45:19, Michal Hocko wrote:
> On Wed 24-03-21 17:28:28, Mike Kravetz wrote:
> [...]
> >  phys_addr_t cma_get_base(const struct cma *cma)
> >  {
> >  	return PFN_PHYS(cma->base_pfn);
> > @@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
> >  	for (i = 0; i < cma_area_count; i++)
> >  		cma_activate_area(&cma_areas[i]);
> >  
> > +	cma_release_wq = create_workqueue("cma_release");
> 
> Considering the workqueue is used to free up memory it should likely be
> WQ_MEM_RECLAIM to prevent from long stalls when WQs are overloaded.
> 
> I cannot judge the CMA parts from a very quick glance this looks
> reasonably.

I have overlooked that
+static void cma_clear_bitmap_fn(struct work_struct *work)
+{
+       struct cma_clear_bitmap_work *w;
+
+       w = container_of(work, struct cma_clear_bitmap_work, work);
+
+       cma_clear_bitmap(w->cma, w->pfn, w->count);
+
+       __free_page(pfn_to_page(w->pfn));
+}

should be doing free_contig_range with w->count target.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25 10:11     ` Michal Hocko
@ 2021-03-25 10:13       ` David Hildenbrand
  2021-03-25 10:17       ` Oscar Salvador
  1 sibling, 0 replies; 53+ messages in thread
From: David Hildenbrand @ 2021-03-25 10:13 UTC (permalink / raw)
  To: Michal Hocko, Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On 25.03.21 11:11, Michal Hocko wrote:
> On Thu 25-03-21 10:45:19, Michal Hocko wrote:
>> On Wed 24-03-21 17:28:28, Mike Kravetz wrote:
>> [...]
>>>   phys_addr_t cma_get_base(const struct cma *cma)
>>>   {
>>>   	return PFN_PHYS(cma->base_pfn);
>>> @@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
>>>   	for (i = 0; i < cma_area_count; i++)
>>>   		cma_activate_area(&cma_areas[i]);
>>>   
>>> +	cma_release_wq = create_workqueue("cma_release");
>>
>> Considering the workqueue is used to free up memory it should likely be
>> WQ_MEM_RECLAIM to prevent from long stalls when WQs are overloaded.
>>
>> I cannot judge the CMA parts from a very quick glance this looks
>> reasonably.
> 
> I have overlooked that
> +static void cma_clear_bitmap_fn(struct work_struct *work)
> +{
> +       struct cma_clear_bitmap_work *w;
> +
> +       w = container_of(work, struct cma_clear_bitmap_work, work);
> +
> +       cma_clear_bitmap(w->cma, w->pfn, w->count);
> +
> +       __free_page(pfn_to_page(w->pfn));
> +}
> 
> should be doing free_contig_range with w->count target.
> 

Then the name "cma_clear_bitmap_fn" is misleading.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25 10:11     ` Michal Hocko
  2021-03-25 10:13       ` David Hildenbrand
@ 2021-03-25 10:17       ` Oscar Salvador
  2021-03-25 10:24         ` Michal Hocko
  1 sibling, 1 reply; 53+ messages in thread
From: Oscar Salvador @ 2021-03-25 10:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Roman Gushchin,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Thu, Mar 25, 2021 at 11:11:49AM +0100, Michal Hocko wrote:
> I have overlooked that
> +static void cma_clear_bitmap_fn(struct work_struct *work)
> +{
> +       struct cma_clear_bitmap_work *w;
> +
> +       w = container_of(work, struct cma_clear_bitmap_work, work);
> +
> +       cma_clear_bitmap(w->cma, w->pfn, w->count);
> +
> +       __free_page(pfn_to_page(w->pfn));
> +}
> 
> should be doing free_contig_range with w->count target.

That is currently done in cma_release_nowait().
You meant we should move that work there in the wq?

I thought the reason for creating a WQ in the first place was to have a
non-blocking cma_release() variant, as we might block down the chain
when clearing the bitmat, I do not think this was about freeing up
memory, or at least the changelog did not mention it.

Also, IIUC, we use the first page to store the workqueue information (we use it
as a storage for it, in cma_release_nowait()), that is why once we are done with
the workqueue work in cma_clear_bitmap_fn(), we can free that page.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25  9:56   ` David Hildenbrand
@ 2021-03-25 10:22     ` Michal Hocko
  2021-03-25 16:56       ` Mike Kravetz
  0 siblings, 1 reply; 53+ messages in thread
From: Michal Hocko @ 2021-03-25 10:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Kravetz, linux-mm, linux-kernel, Roman Gushchin,
	Shakeel Butt, Oscar Salvador, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
> On 25.03.21 01:28, Mike Kravetz wrote:
> > From: Roman Gushchin <guro@fb.com>
> > 
> > cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
> > It makes it a blocking function, which complicates its usage from
> > non-blocking contexts. For instance, hugetlbfs code is temporarily
> > dropping the hugetlb_lock spinlock to call cma_release().
> > 
> > This patch introduces a non-blocking cma_release_nowait(), which
> > postpones the cma bitmap clearance. It's done later from a work
> > context. The first page in the cma allocation is used to store
> > the work struct. Because CMA allocations and de-allocations are
> > usually not that frequent, a single global workqueue is used.
> > 
> > To make sure that subsequent cma_alloc() call will pass, cma_alloc()
> > flushes the cma_release_wq workqueue. To avoid a performance
> > regression in the case when only cma_release() is used, gate it
> > by a per-cma area flag, which is set by the first call
> > of cma_release_nowait().
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> 
> 
> 1. Is there a real reason this is a mutex and not a spin lock? It seems to
> only protect the bitmap. Are bitmaps that huge that we spend a significant
> amount of time in there?

Good question. Looking at the code it doesn't seem that there is any
blockable operation or any heavy lifting done under the lock.
7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
the lock and there was a simple bitmat protection back then. I suspect
the patch just followed the cma_mutex lead and used the same type of the
lock. cma_mutex used to protect alloc_contig_range which is sleepable.

This all suggests that there is no real reason to use a sleepable lock
at all and we do not need all this heavy lifting.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25 10:17       ` Oscar Salvador
@ 2021-03-25 10:24         ` Michal Hocko
  0 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2021-03-25 10:24 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, linux-mm, linux-kernel, Roman Gushchin,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Thu 25-03-21 11:17:32, Oscar Salvador wrote:
> On Thu, Mar 25, 2021 at 11:11:49AM +0100, Michal Hocko wrote:
> > I have overlooked that
> > +static void cma_clear_bitmap_fn(struct work_struct *work)
> > +{
> > +       struct cma_clear_bitmap_work *w;
> > +
> > +       w = container_of(work, struct cma_clear_bitmap_work, work);
> > +
> > +       cma_clear_bitmap(w->cma, w->pfn, w->count);
> > +
> > +       __free_page(pfn_to_page(w->pfn));
> > +}
> > 
> > should be doing free_contig_range with w->count target.
> 
> That is currently done in cma_release_nowait().
> You meant we should move that work there in the wq?

I have missed that part. My bad. But it seems this whole thing is moot
because we can make the lock a spinlock as pointed out by David.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments
  2021-03-25  0:28 ` [PATCH 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
@ 2021-03-25 10:47   ` Michal Hocko
  2021-03-25 12:29   ` Oscar Salvador
  2021-03-26  1:52   ` Miaohe Lin
  2 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2021-03-25 10:47 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Wed 24-03-21 17:28:30, Mike Kravetz wrote:
> The helper routine hstate_next_node_to_alloc accesses and modifies the
> hstate variable next_nid_to_alloc.  The helper is used by the routines
> alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
> called with hugetlb_lock held.  However, alloc_pool_huge_page can not
> be called with the hugetlb lock held as it will call the page allocator.
> Two instances of alloc_pool_huge_page could be run in parallel or
> alloc_pool_huge_page could run in parallel with adjust_pool_surplus
> which may result in the variable next_nid_to_alloc becoming invalid
> for the caller and pages being allocated on the wrong node.
> 
> Both alloc_pool_huge_page and adjust_pool_surplus are only called from
> the routine set_max_huge_pages after boot.  set_max_huge_pages is only
> called as the reusult of a user writing to the proc/sysfs nr_hugepages,
> or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.
> 
> It makes little sense to allow multiple adjustment to the number of
> hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
> allow one hugetlb page adjustment at a time.  This will synchronize
> modifications to the next_nid_to_alloc variable.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

I would just recommend s@mutex@resize_lock@ so that the intention is
more clear from the name.
> ---
>  include/linux/hugetlb.h | 1 +
>  mm/hugetlb.c            | 5 +++++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index a7f7d5f328dc..8817ec987d68 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
>  #define HSTATE_NAME_LEN 32
>  /* Defines one hugetlb page size */
>  struct hstate {
> +	struct mutex mutex;
>  	int next_nid_to_alloc;
>  	int next_nid_to_free;
>  	unsigned int order;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f9ba63fc1747..404b0b1c5258 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2616,6 +2616,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	else
>  		return -ENOMEM;
>  
> +	/* mutex prevents concurrent adjustments for the same hstate */
> +	mutex_lock(&h->mutex);
>  	spin_lock(&hugetlb_lock);
>  
>  	/*
> @@ -2648,6 +2650,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
>  		if (count > persistent_huge_pages(h)) {
>  			spin_unlock(&hugetlb_lock);
> +			mutex_unlock(&h->mutex);
>  			NODEMASK_FREE(node_alloc_noretry);
>  			return -EINVAL;
>  		}
> @@ -2722,6 +2725,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  out:
>  	h->max_huge_pages = persistent_huge_pages(h);
>  	spin_unlock(&hugetlb_lock);
> +	mutex_unlock(&h->mutex);
>  
>  	NODEMASK_FREE(node_alloc_noretry);
>  
> @@ -3209,6 +3213,7 @@ void __init hugetlb_add_hstate(unsigned int order)
>  	BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
>  	BUG_ON(order == 0);
>  	h = &hstates[hugetlb_max_hstate++];
> +	mutex_init(&h->mutex);
>  	h->order = order;
>  	h->mask = ~(huge_page_size(h) - 1);
>  	for (i = 0; i < MAX_NUMNODES; ++i)
> -- 
> 2.30.2
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-03-25  0:28 ` [PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
@ 2021-03-25 10:49   ` Michal Hocko
  2021-03-26  2:10   ` Miaohe Lin
  2021-03-27  6:36   ` [External] " Muchun Song
  2 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2021-03-25 10:49 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Wed 24-03-21 17:28:31, Mike Kravetz wrote:
> The new remove_hugetlb_page() routine is designed to remove a hugetlb
> page from hugetlbfs processing.  It will remove the page from the active
> or free list, update global counters and set the compound page
> destructor to NULL so that PageHuge() will return false for the 'page'.
> After this call, the 'page' can be treated as a normal compound page or
> a collection of base size pages.
> 
> remove_hugetlb_page is to be called with the hugetlb_lock held.
> 
> Creating this routine and separating functionality is in preparation for
> restructuring code to reduce lock hold times.

Call out that this is not intended to introduce any functional change.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/hugetlb.c | 70 +++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 45 insertions(+), 25 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 404b0b1c5258..3938ec086b5c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1327,6 +1327,46 @@ static inline void destroy_compound_gigantic_page(struct page *page,
>  						unsigned int order) { }
>  #endif
>  
> +/*
> + * Remove hugetlb page from lists, and update dtor so that page appears
> + * as just a compound page.  A reference is held on the page.
> + * NOTE: hugetlb specific page flags stored in page->private are not
> + *	 automatically cleared.  These flags may be used in routines
> + *	 which operate on the resulting compound page.
> + *
> + * Must be called with hugetlb lock held.
> + */
> +static void remove_hugetlb_page(struct hstate *h, struct page *page,
> +							bool adjust_surplus)
> +{
> +	int nid = page_to_nid(page);
> +
> +	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> +		return;
> +
> +	list_del(&page->lru);
> +
> +	if (HPageFreed(page)) {
> +		h->free_huge_pages--;
> +		h->free_huge_pages_node[nid]--;
> +		ClearHPageFreed(page);
> +	}
> +	if (adjust_surplus) {
> +		h->surplus_huge_pages--;
> +		h->surplus_huge_pages_node[nid]--;
> +	}
> +
> +	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
> +	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
> +
> +	ClearHPageTemporary(page);
> +	set_page_refcounted(page);
> +	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> +
> +	h->nr_huge_pages--;
> +	h->nr_huge_pages_node[nid]--;
> +}
> +
>  static void update_and_free_page(struct hstate *h, struct page *page)
>  {
>  	int i;
> @@ -1335,8 +1375,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>  	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>  		return;
>  
> -	h->nr_huge_pages--;
> -	h->nr_huge_pages_node[page_to_nid(page)]--;
>  	for (i = 0; i < pages_per_huge_page(h);
>  	     i++, subpage = mem_map_next(subpage, page, i)) {
>  		subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
> @@ -1344,10 +1382,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>  				1 << PG_active | 1 << PG_private |
>  				1 << PG_writeback);
>  	}
> -	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
> -	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
> -	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> -	set_page_refcounted(page);
>  	if (hstate_is_gigantic(h)) {
>  		destroy_compound_gigantic_page(page, huge_page_order(h));
>  		free_gigantic_page(page, huge_page_order(h));
> @@ -1415,15 +1449,12 @@ static void __free_huge_page(struct page *page)
>  		h->resv_huge_pages++;
>  
>  	if (HPageTemporary(page)) {
> -		list_del(&page->lru);
> -		ClearHPageTemporary(page);
> +		remove_hugetlb_page(h, page, false);
>  		update_and_free_page(h, page);
>  	} else if (h->surplus_huge_pages_node[nid]) {
>  		/* remove the page from active list */
> -		list_del(&page->lru);
> +		remove_hugetlb_page(h, page, true);
>  		update_and_free_page(h, page);
> -		h->surplus_huge_pages--;
> -		h->surplus_huge_pages_node[nid]--;
>  	} else {
>  		arch_clear_hugepage_flags(page);
>  		enqueue_huge_page(h, page);
> @@ -1708,13 +1739,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  			struct page *page =
>  				list_entry(h->hugepage_freelists[node].next,
>  					  struct page, lru);
> -			list_del(&page->lru);
> -			h->free_huge_pages--;
> -			h->free_huge_pages_node[node]--;
> -			if (acct_surplus) {
> -				h->surplus_huge_pages--;
> -				h->surplus_huge_pages_node[node]--;
> -			}
> +			remove_hugetlb_page(h, page, acct_surplus);
>  			update_and_free_page(h, page);
>  			ret = 1;
>  			break;
> @@ -1752,7 +1777,6 @@ int dissolve_free_huge_page(struct page *page)
>  	if (!page_count(page)) {
>  		struct page *head = compound_head(page);
>  		struct hstate *h = page_hstate(head);
> -		int nid = page_to_nid(head);
>  		if (h->free_huge_pages - h->resv_huge_pages == 0)
>  			goto out;
>  
> @@ -1783,9 +1807,7 @@ int dissolve_free_huge_page(struct page *page)
>  			SetPageHWPoison(page);
>  			ClearPageHWPoison(head);
>  		}
> -		list_del(&head->lru);
> -		h->free_huge_pages--;
> -		h->free_huge_pages_node[nid]--;
> +		remove_hugetlb_page(h, page, false);
>  		h->max_huge_pages--;
>  		update_and_free_page(h, head);
>  		rc = 0;
> @@ -2553,10 +2575,8 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  				return;
>  			if (PageHighMem(page))
>  				continue;
> -			list_del(&page->lru);
> +			remove_hugetlb_page(h, page, false);
>  			update_and_free_page(h, page);
> -			h->free_huge_pages--;
> -			h->free_huge_pages_node[page_to_nid(page)]--;
>  		}
>  	}
>  }
> -- 
> 2.30.2
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-03-25  0:28 ` [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
@ 2021-03-25 10:55   ` Michal Hocko
  2021-03-25 17:12     ` Mike Kravetz
  2021-03-27  6:54   ` [External] " Muchun Song
  1 sibling, 1 reply; 53+ messages in thread
From: Michal Hocko @ 2021-03-25 10:55 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Wed 24-03-21 17:28:32, Mike Kravetz wrote:
> With the introduction of remove_hugetlb_page(), there is no need for
> update_and_free_page to hold the hugetlb lock.  Change all callers to
> drop the lock before calling.
> 
> With additional code modifications, this will allow loops which decrease
> the huge page pool to drop the hugetlb_lock with each page to reduce
> long hold times.
> 
> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
> a subsequent patch which restructures free_pool_huge_page.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

One minor thing below

[...]
> @@ -2563,22 +2572,37 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  						nodemask_t *nodes_allowed)
>  {
>  	int i;
> +	struct list_head page_list;
> +	struct page *page, *next;
>  
>  	if (hstate_is_gigantic(h))
>  		return;
>  
> +	/*
> +	 * Collect pages to be freed on a list, and free after dropping lock
> +	 */
> +	INIT_LIST_HEAD(&page_list);
>  	for_each_node_mask(i, *nodes_allowed) {
> -		struct page *page, *next;
>  		struct list_head *freel = &h->hugepage_freelists[i];
>  		list_for_each_entry_safe(page, next, freel, lru) {
>  			if (count >= h->nr_huge_pages)
> -				return;
> +				goto out;
>  			if (PageHighMem(page))
>  				continue;
>  			remove_hugetlb_page(h, page, false);
> -			update_and_free_page(h, page);
> +			INIT_LIST_HEAD(&page->lru);

What is the point of rhis INIT_LIST_HEAD? Page has been removed from the
list by remove_hugetlb_page so it can be added to a new one without any
reinitialization.

> +			list_add(&page->lru, &page_list);
>  		}
>  	}
> +
> +out:
> +	spin_unlock(&hugetlb_lock);
> +	list_for_each_entry_safe(page, next, &page_list, lru) {
> +		list_del(&page->lru);
> +		update_and_free_page(h, page);
> +		cond_resched();
> +	}
> +	spin_lock(&hugetlb_lock);
>  }
>  #else
>  static inline void try_to_free_low(struct hstate *h, unsigned long count,
> -- 
> 2.30.2
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page
  2021-03-25  0:28 ` [PATCH 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
@ 2021-03-25 11:06   ` Michal Hocko
  2021-03-25 17:29     ` Mike Kravetz
  0 siblings, 1 reply; 53+ messages in thread
From: Michal Hocko @ 2021-03-25 11:06 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Wed 24-03-21 17:28:33, Mike Kravetz wrote:
[...]
> @@ -2074,17 +2067,16 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>   *    to the associated reservation map.
>   * 2) Free any unused surplus pages that may have been allocated to satisfy
>   *    the reservation.  As many as unused_resv_pages may be freed.
> - *
> - * Called with hugetlb_lock held.  However, the lock could be dropped (and
> - * reacquired) during calls to cond_resched_lock.  Whenever dropping the lock,
> - * we must make sure nobody else can claim pages we are in the process of
> - * freeing.  Do this by ensuring resv_huge_page always is greater than the
> - * number of huge pages we plan to free when dropping the lock.
>   */
>  static void return_unused_surplus_pages(struct hstate *h,
>  					unsigned long unused_resv_pages)
>  {
>  	unsigned long nr_pages;
> +	struct page *page, *t_page;
> +	struct list_head page_list;
> +
> +	/* Uncommit the reservation */
> +	h->resv_huge_pages -= unused_resv_pages;

Is this ok for cases where remove_pool_huge_page fails early? I have to
say I am kinda lost in the resv_huge_pages accounting here. The original
code was already quite supicious to me. TBH.
>  
>  	/* Cannot return gigantic pages currently */
>  	if (hstate_is_gigantic(h))
> @@ -2101,24 +2093,27 @@ static void return_unused_surplus_pages(struct hstate *h,
>  	 * evenly across all nodes with memory. Iterate across these nodes
>  	 * until we can no longer free unreserved surplus pages. This occurs
>  	 * when the nodes with surplus pages have no free pages.
> -	 * free_pool_huge_page() will balance the freed pages across the
> +	 * remove_pool_huge_page() will balance the freed pages across the
>  	 * on-line nodes with memory and will handle the hstate accounting.
> -	 *
> -	 * Note that we decrement resv_huge_pages as we free the pages.  If
> -	 * we drop the lock, resv_huge_pages will still be sufficiently large
> -	 * to cover subsequent pages we may free.
>  	 */
> +	INIT_LIST_HEAD(&page_list);
>  	while (nr_pages--) {
> -		h->resv_huge_pages--;
> -		unused_resv_pages--;
> -		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
> +		page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
> +		if (!page)
>  			goto out;
> -		cond_resched_lock(&hugetlb_lock);
> +
> +		INIT_LIST_HEAD(&page->lru);

again unnecessary INIT_LIST_HEAD

> +		list_add(&page->lru, &page_list);
>  	}
>  
>  out:
> -	/* Fully uncommit the reservation */
> -	h->resv_huge_pages -= unused_resv_pages;
> +	spin_unlock(&hugetlb_lock);
> +	list_for_each_entry_safe(page, t_page, &page_list, lru) {
> +		list_del(&page->lru);
> +		update_and_free_page(h, page);
> +		cond_resched();
> +	}

You have the same construct at 3 different places maybe it deserves a
little helper update_and_free_page_batch.

> +	spin_lock(&hugetlb_lock);
>  }
>  
>  
> @@ -2648,6 +2643,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  			      nodemask_t *nodes_allowed)
>  {
>  	unsigned long min_count, ret;
> +	struct page *page, *t_page;
> +	struct list_head page_list;
>  	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
>  
>  	/*
> @@ -2757,11 +2754,28 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
>  	min_count = max(count, min_count);
>  	try_to_free_low(h, min_count, nodes_allowed);
> +
> +	/*
> +	 * Collect pages to be removed on list without dropping lock
> +	 */
> +	INIT_LIST_HEAD(&page_list);
>  	while (min_count < persistent_huge_pages(h)) {
> -		if (!free_pool_huge_page(h, nodes_allowed, 0))
> +		page = remove_pool_huge_page(h, nodes_allowed, 0);
> +		if (!page)
>  			break;
> -		cond_resched_lock(&hugetlb_lock);
> +
> +		INIT_LIST_HEAD(&page->lru);

INIT_LIST_HEAD again.

> +		list_add(&page->lru, &page_list);
>  	}
> +	/* free the pages after dropping lock */
> +	spin_unlock(&hugetlb_lock);
> +	list_for_each_entry_safe(page, t_page, &page_list, lru) {
> +		list_del(&page->lru);
> +		update_and_free_page(h, page);
> +		cond_resched();
> +	}
> +	spin_lock(&hugetlb_lock);
> +
>  	while (count < persistent_huge_pages(h)) {
>  		if (!adjust_pool_surplus(h, nodes_allowed, 1))
>  			break;
> -- 
> 2.30.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/8] hugetlb: make free_huge_page irq safe
  2021-03-25  0:28 ` [PATCH 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
@ 2021-03-25 11:21   ` Michal Hocko
  2021-03-25 17:32     ` Mike Kravetz
  2021-03-27  7:06   ` [External] " Muchun Song
  1 sibling, 1 reply; 53+ messages in thread
From: Michal Hocko @ 2021-03-25 11:21 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Wed 24-03-21 17:28:34, Mike Kravetz wrote:
> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
> non-task context") was added to address the issue of free_huge_page
> being called from irq context.  That commit hands off free_huge_page
> processing to a workqueue if !in_task.  However, as seen in [1] this
> does not cover all cases.  Instead, make the locks taken in the
> free_huge_page irq safe.

I would just call out the deadlock scenario here in the changelog
rathert than torture people by forcing them to follow up on the 0day
report. Something like the below?
"
"
However this doesn't cover all the cases as pointed out by 0day bot
lockdep report [1]
:  Possible interrupt unsafe locking scenario:
: 
:        CPU0                    CPU1
:        ----                    ----
:   lock(hugetlb_lock);
:                                local_irq_disable();
:                                lock(slock-AF_INET);
:                                lock(hugetlb_lock);
:   <Interrupt>
:     lock(slock-AF_INET);

Shakeel has later explained that this is very likely TCP TX
zerocopy from hugetlb pages scenario when the networking code drops a
last reference to hugetlb page while having IRQ disabled. Hugetlb
freeing path doesn't disable IRQ while holding hugetlb_lock so a lock
dependency chain can lead to a deadlock.
 

> This patch does the following:
> - Make hugetlb_lock irq safe.  This is mostly a simple process of
>   changing spin_*lock calls to spin_*lock_irq* calls.
> - Make subpool lock irq safe in a similar manner.
> - Revert the !in_task check and workqueue handoff.
> 
> [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/hugetlb.c        | 169 +++++++++++++++++---------------------------
>  mm/hugetlb_cgroup.c |   8 +--
>  2 files changed, 67 insertions(+), 110 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a9785e73379f..e4c441b878f2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -93,9 +93,10 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
>  	return true;
>  }
>  
> -static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
> +static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
> +						unsigned long irq_flags)
>  {
> -	spin_unlock(&spool->lock);
> +	spin_unlock_irqrestore(&spool->lock, irq_flags);
>  
>  	/* If no pages are used, and no other handles to the subpool
>  	 * remain, give up any reservations based on minimum size and
> @@ -134,10 +135,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
>  
>  void hugepage_put_subpool(struct hugepage_subpool *spool)
>  {
> -	spin_lock(&spool->lock);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&spool->lock, flags);
>  	BUG_ON(!spool->count);
>  	spool->count--;
> -	unlock_or_release_subpool(spool);
> +	unlock_or_release_subpool(spool, flags);
>  }
>  
>  /*
> @@ -156,7 +159,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
>  	if (!spool)
>  		return ret;
>  
> -	spin_lock(&spool->lock);
> +	spin_lock_irq(&spool->lock);
>  
>  	if (spool->max_hpages != -1) {		/* maximum size accounting */
>  		if ((spool->used_hpages + delta) <= spool->max_hpages)
> @@ -183,7 +186,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
>  	}
>  
>  unlock_ret:
> -	spin_unlock(&spool->lock);
> +	spin_unlock_irq(&spool->lock);
>  	return ret;
>  }
>  
> @@ -197,11 +200,12 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
>  				       long delta)
>  {
>  	long ret = delta;
> +	unsigned long flags;
>  
>  	if (!spool)
>  		return delta;
>  
> -	spin_lock(&spool->lock);
> +	spin_lock_irqsave(&spool->lock, flags);
>  
>  	if (spool->max_hpages != -1)		/* maximum size accounting */
>  		spool->used_hpages -= delta;
> @@ -222,7 +226,7 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
>  	 * If hugetlbfs_put_super couldn't free spool due to an outstanding
>  	 * quota reference, free it now.
>  	 */
> -	unlock_or_release_subpool(spool);
> +	unlock_or_release_subpool(spool, flags);
>  
>  	return ret;
>  }
> @@ -1401,7 +1405,7 @@ struct hstate *size_to_hstate(unsigned long size)
>  	return NULL;
>  }
>  
> -static void __free_huge_page(struct page *page)
> +void free_huge_page(struct page *page)
>  {
>  	/*
>  	 * Can't pass hstate in here because it is called from the
> @@ -1411,6 +1415,7 @@ static void __free_huge_page(struct page *page)
>  	int nid = page_to_nid(page);
>  	struct hugepage_subpool *spool = hugetlb_page_subpool(page);
>  	bool restore_reserve;
> +	unsigned long flags;
>  
>  	VM_BUG_ON_PAGE(page_count(page), page);
>  	VM_BUG_ON_PAGE(page_mapcount(page), page);
> @@ -1439,7 +1444,7 @@ static void __free_huge_page(struct page *page)
>  			restore_reserve = true;
>  	}
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irqsave(&hugetlb_lock, flags);
>  	ClearHPageMigratable(page);
>  	hugetlb_cgroup_uncharge_page(hstate_index(h),
>  				     pages_per_huge_page(h), page);
> @@ -1450,66 +1455,18 @@ static void __free_huge_page(struct page *page)
>  
>  	if (HPageTemporary(page)) {
>  		remove_hugetlb_page(h, page, false);
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irqrestore(&hugetlb_lock, flags);
>  		update_and_free_page(h, page);
>  	} else if (h->surplus_huge_pages_node[nid]) {
>  		/* remove the page from active list */
>  		remove_hugetlb_page(h, page, true);
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irqrestore(&hugetlb_lock, flags);
>  		update_and_free_page(h, page);
>  	} else {
>  		arch_clear_hugepage_flags(page);
>  		enqueue_huge_page(h, page);
> -		spin_unlock(&hugetlb_lock);
> -	}
> -}
> -
> -/*
> - * As free_huge_page() can be called from a non-task context, we have
> - * to defer the actual freeing in a workqueue to prevent potential
> - * hugetlb_lock deadlock.
> - *
> - * free_hpage_workfn() locklessly retrieves the linked list of pages to
> - * be freed and frees them one-by-one. As the page->mapping pointer is
> - * going to be cleared in __free_huge_page() anyway, it is reused as the
> - * llist_node structure of a lockless linked list of huge pages to be freed.
> - */
> -static LLIST_HEAD(hpage_freelist);
> -
> -static void free_hpage_workfn(struct work_struct *work)
> -{
> -	struct llist_node *node;
> -	struct page *page;
> -
> -	node = llist_del_all(&hpage_freelist);
> -
> -	while (node) {
> -		page = container_of((struct address_space **)node,
> -				     struct page, mapping);
> -		node = node->next;
> -		__free_huge_page(page);
> -	}
> -}
> -static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> -
> -void free_huge_page(struct page *page)
> -{
> -	/*
> -	 * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> -	 */
> -	if (!in_task()) {
> -		/*
> -		 * Only call schedule_work() if hpage_freelist is previously
> -		 * empty. Otherwise, schedule_work() had been called but the
> -		 * workfn hasn't retrieved the list yet.
> -		 */
> -		if (llist_add((struct llist_node *)&page->mapping,
> -			      &hpage_freelist))
> -			schedule_work(&free_hpage_work);
> -		return;
> +		spin_unlock_irqrestore(&hugetlb_lock, flags);
>  	}
> -
> -	__free_huge_page(page);
>  }
>  
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> @@ -1519,11 +1476,11 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>  	hugetlb_set_page_subpool(page, NULL);
>  	set_hugetlb_cgroup(page, NULL);
>  	set_hugetlb_cgroup_rsvd(page, NULL);
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	h->nr_huge_pages++;
>  	h->nr_huge_pages_node[nid]++;
>  	ClearHPageFreed(page);
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  }
>  
>  static void prep_compound_gigantic_page(struct page *page, unsigned int order)
> @@ -1769,7 +1726,7 @@ int dissolve_free_huge_page(struct page *page)
>  	if (!PageHuge(page))
>  		return 0;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	if (!PageHuge(page)) {
>  		rc = 0;
>  		goto out;
> @@ -1786,7 +1743,7 @@ int dissolve_free_huge_page(struct page *page)
>  		 * when it is dissolved.
>  		 */
>  		if (unlikely(!HPageFreed(head))) {
> -			spin_unlock(&hugetlb_lock);
> +			spin_unlock_irq(&hugetlb_lock);
>  			cond_resched();
>  
>  			/*
> @@ -1810,12 +1767,12 @@ int dissolve_free_huge_page(struct page *page)
>  		}
>  		remove_hugetlb_page(h, page, false);
>  		h->max_huge_pages--;
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  		update_and_free_page(h, head);
>  		return 0;
>  	}
>  out:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	return rc;
>  }
>  
> @@ -1857,16 +1814,16 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>  	if (hstate_is_gigantic(h))
>  		return NULL;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages)
>  		goto out_unlock;
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  
>  	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
>  	if (!page)
>  		return NULL;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	/*
>  	 * We could have raced with the pool size change.
>  	 * Double check that and simply deallocate the new page
> @@ -1876,7 +1833,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>  	 */
>  	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>  		SetHPageTemporary(page);
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  		put_page(page);
>  		return NULL;
>  	} else {
> @@ -1885,7 +1842,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>  	}
>  
>  out_unlock:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  
>  	return page;
>  }
> @@ -1935,17 +1892,17 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
>  struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
>  		nodemask_t *nmask, gfp_t gfp_mask)
>  {
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	if (h->free_huge_pages - h->resv_huge_pages > 0) {
>  		struct page *page;
>  
>  		page = dequeue_huge_page_nodemask(h, gfp_mask, preferred_nid, nmask);
>  		if (page) {
> -			spin_unlock(&hugetlb_lock);
> +			spin_unlock_irq(&hugetlb_lock);
>  			return page;
>  		}
>  	}
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  
>  	return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
>  }
> @@ -1993,7 +1950,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  
>  	ret = -ENOMEM;
>  retry:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	for (i = 0; i < needed; i++) {
>  		page = alloc_surplus_huge_page(h, htlb_alloc_mask(h),
>  				NUMA_NO_NODE, NULL);
> @@ -2010,7 +1967,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  	 * After retaking hugetlb_lock, we need to recalculate 'needed'
>  	 * because either resv_huge_pages or free_huge_pages may have changed.
>  	 */
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	needed = (h->resv_huge_pages + delta) -
>  			(h->free_huge_pages + allocated);
>  	if (needed > 0) {
> @@ -2050,12 +2007,12 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  		enqueue_huge_page(h, page);
>  	}
>  free:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  
>  	/* Free unnecessary surplus pages to the buddy allocator */
>  	list_for_each_entry_safe(page, tmp, &surplus_list, lru)
>  		put_page(page);
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  
>  	return ret;
>  }
> @@ -2107,13 +2064,13 @@ static void return_unused_surplus_pages(struct hstate *h,
>  	}
>  
>  out:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	list_for_each_entry_safe(page, t_page, &page_list, lru) {
>  		list_del(&page->lru);
>  		update_and_free_page(h, page);
>  		cond_resched();
>  	}
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  }
>  
>  
> @@ -2348,7 +2305,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	if (ret)
>  		goto out_uncharge_cgroup_reservation;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	/*
>  	 * glb_chg is passed to indicate whether or not a page must be taken
>  	 * from the global free pool (global change).  gbl_chg == 0 indicates
> @@ -2356,7 +2313,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	 */
>  	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>  	if (!page) {
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  		page = alloc_buddy_huge_page_with_mpol(h, vma, addr);
>  		if (!page)
>  			goto out_uncharge_cgroup;
> @@ -2364,7 +2321,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  			SetHPageRestoreReserve(page);
>  			h->resv_huge_pages--;
>  		}
> -		spin_lock(&hugetlb_lock);
> +		spin_lock_irq(&hugetlb_lock);
>  		list_add(&page->lru, &h->hugepage_activelist);
>  		/* Fall through */
>  	}
> @@ -2377,7 +2334,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  						  h_cg, page);
>  	}
>  
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  
>  	hugetlb_set_page_subpool(page, spool);
>  
> @@ -2591,13 +2548,13 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  	}
>  
>  out:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	list_for_each_entry_safe(page, next, &page_list, lru) {
>  		list_del(&page->lru);
>  		update_and_free_page(h, page);
>  		cond_resched();
>  	}
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  }
>  #else
>  static inline void try_to_free_low(struct hstate *h, unsigned long count,
> @@ -2659,7 +2616,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  
>  	/* mutex prevents concurrent adjustments for the same hstate */
>  	mutex_lock(&h->mutex);
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  
>  	/*
>  	 * Check for a node specific request.
> @@ -2690,7 +2647,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	 */
>  	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
>  		if (count > persistent_huge_pages(h)) {
> -			spin_unlock(&hugetlb_lock);
> +			spin_unlock_irq(&hugetlb_lock);
>  			mutex_unlock(&h->mutex);
>  			NODEMASK_FREE(node_alloc_noretry);
>  			return -EINVAL;
> @@ -2720,14 +2677,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  		 * page, free_huge_page will handle it by freeing the page
>  		 * and reducing the surplus.
>  		 */
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  
>  		/* yield cpu to avoid soft lockup */
>  		cond_resched();
>  
>  		ret = alloc_pool_huge_page(h, nodes_allowed,
>  						node_alloc_noretry);
> -		spin_lock(&hugetlb_lock);
> +		spin_lock_irq(&hugetlb_lock);
>  		if (!ret)
>  			goto out;
>  
> @@ -2768,13 +2725,13 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  		list_add(&page->lru, &page_list);
>  	}
>  	/* free the pages after dropping lock */
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	list_for_each_entry_safe(page, t_page, &page_list, lru) {
>  		list_del(&page->lru);
>  		update_and_free_page(h, page);
>  		cond_resched();
>  	}
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  
>  	while (count < persistent_huge_pages(h)) {
>  		if (!adjust_pool_surplus(h, nodes_allowed, 1))
> @@ -2782,7 +2739,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	}
>  out:
>  	h->max_huge_pages = persistent_huge_pages(h);
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	mutex_unlock(&h->mutex);
>  
>  	NODEMASK_FREE(node_alloc_noretry);
> @@ -2938,9 +2895,9 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
>  	if (err)
>  		return err;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	h->nr_overcommit_huge_pages = input;
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  
>  	return count;
>  }
> @@ -3527,9 +3484,9 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
>  		goto out;
>  
>  	if (write) {
> -		spin_lock(&hugetlb_lock);
> +		spin_lock_irq(&hugetlb_lock);
>  		h->nr_overcommit_huge_pages = tmp;
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  	}
>  out:
>  	return ret;
> @@ -3625,7 +3582,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
>  	if (!delta)
>  		return 0;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	/*
>  	 * When cpuset is configured, it breaks the strict hugetlb page
>  	 * reservation as the accounting is done on a global variable. Such
> @@ -3664,7 +3621,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
>  		return_unused_surplus_pages(h, (unsigned long) -delta);
>  
>  out:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	return ret;
>  }
>  
> @@ -5727,7 +5684,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
>  {
>  	bool ret = true;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	if (!PageHeadHuge(page) ||
>  	    !HPageMigratable(page) ||
>  	    !get_page_unless_zero(page)) {
> @@ -5737,16 +5694,16 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
>  	ClearHPageMigratable(page);
>  	list_move_tail(&page->lru, list);
>  unlock:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	return ret;
>  }
>  
>  void putback_active_hugepage(struct page *page)
>  {
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	SetHPageMigratable(page);
>  	list_move_tail(&page->lru, &(page_hstate(page))->hugepage_activelist);
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	put_page(page);
>  }
>  
> @@ -5780,12 +5737,12 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
>  		 */
>  		if (new_nid == old_nid)
>  			return;
> -		spin_lock(&hugetlb_lock);
> +		spin_lock_irq(&hugetlb_lock);
>  		if (h->surplus_huge_pages_node[old_nid]) {
>  			h->surplus_huge_pages_node[old_nid]--;
>  			h->surplus_huge_pages_node[new_nid]++;
>  		}
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  	}
>  }
>  
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 726b85f4f303..5383023d0cca 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -204,11 +204,11 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
>  	do {
>  		idx = 0;
>  		for_each_hstate(h) {
> -			spin_lock(&hugetlb_lock);
> +			spin_lock_irq(&hugetlb_lock);
>  			list_for_each_entry(page, &h->hugepage_activelist, lru)
>  				hugetlb_cgroup_move_parent(idx, h_cg, page);
>  
> -			spin_unlock(&hugetlb_lock);
> +			spin_unlock_irq(&hugetlb_lock);
>  			idx++;
>  		}
>  		cond_resched();
> @@ -784,7 +784,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
>  	if (hugetlb_cgroup_disabled())
>  		return;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	h_cg = hugetlb_cgroup_from_page(oldhpage);
>  	h_cg_rsvd = hugetlb_cgroup_from_page_rsvd(oldhpage);
>  	set_hugetlb_cgroup(oldhpage, NULL);
> @@ -794,7 +794,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
>  	set_hugetlb_cgroup(newhpage, h_cg);
>  	set_hugetlb_cgroup_rsvd(newhpage, h_cg_rsvd);
>  	list_move(&newhpage->lru, &h->hugepage_activelist);
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	return;
>  }
>  
> -- 
> 2.30.2
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock
  2021-03-25  0:28 ` [PATCH 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
@ 2021-03-25 11:22   ` Michal Hocko
  2021-03-26  2:12   ` Miaohe Lin
  2021-03-27  8:14   ` [External] " Muchun Song
  2 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2021-03-25 11:22 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Wed 24-03-21 17:28:35, Mike Kravetz wrote:
> After making hugetlb lock irq safe and separating some functionality
> done under the lock, add some lockdep_assert_held to help verify
> locking.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/hugetlb.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e4c441b878f2..de5b3cf4a155 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1062,6 +1062,8 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
>  static void enqueue_huge_page(struct hstate *h, struct page *page)
>  {
>  	int nid = page_to_nid(page);
> +
> +	lockdep_assert_held(&hugetlb_lock);
>  	list_move(&page->lru, &h->hugepage_freelists[nid]);
>  	h->free_huge_pages++;
>  	h->free_huge_pages_node[nid]++;
> @@ -1073,6 +1075,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
>  	struct page *page;
>  	bool pin = !!(current->flags & PF_MEMALLOC_PIN);
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
>  		if (pin && !is_pinnable_page(page))
>  			continue;
> @@ -1345,6 +1348,7 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
>  {
>  	int nid = page_to_nid(page);
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>  		return;
>  
> @@ -1690,6 +1694,7 @@ static struct page *remove_pool_huge_page(struct hstate *h,
>  	int nr_nodes, node;
>  	struct page *page = NULL;
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
>  		/*
>  		 * If we're returning unused surplus pages, only examine
> @@ -1939,6 +1944,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  	long needed, allocated;
>  	bool alloc_ok = true;
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
>  	if (needed <= 0) {
>  		h->resv_huge_pages += delta;
> @@ -2032,6 +2038,7 @@ static void return_unused_surplus_pages(struct hstate *h,
>  	struct page *page, *t_page;
>  	struct list_head page_list;
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	/* Uncommit the reservation */
>  	h->resv_huge_pages -= unused_resv_pages;
>  
> @@ -2527,6 +2534,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  	struct list_head page_list;
>  	struct page *page, *next;
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	if (hstate_is_gigantic(h))
>  		return;
>  
> @@ -2573,6 +2581,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
>  {
>  	int nr_nodes, node;
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	VM_BUG_ON(delta != -1 && delta != 1);
>  
>  	if (delta < 0) {
> -- 
> 2.30.2
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments
  2021-03-25  0:28 ` [PATCH 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
  2021-03-25 10:47   ` Michal Hocko
@ 2021-03-25 12:29   ` Oscar Salvador
  2021-03-26  1:52   ` Miaohe Lin
  2 siblings, 0 replies; 53+ messages in thread
From: Oscar Salvador @ 2021-03-25 12:29 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Michal Hocko,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Wed, Mar 24, 2021 at 05:28:30PM -0700, Mike Kravetz wrote:
> The helper routine hstate_next_node_to_alloc accesses and modifies the
> hstate variable next_nid_to_alloc.  The helper is used by the routines
> alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
> called with hugetlb_lock held.  However, alloc_pool_huge_page can not
> be called with the hugetlb lock held as it will call the page allocator.
> Two instances of alloc_pool_huge_page could be run in parallel or
> alloc_pool_huge_page could run in parallel with adjust_pool_surplus
> which may result in the variable next_nid_to_alloc becoming invalid
> for the caller and pages being allocated on the wrong node.

Is this something you have seen happening? If so, it is easier to
trigger? I doubt so as I have not seen any bug report, but just
wondering whether a Fixes tag is needed, or probably not worth, right?

> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
>  #define HSTATE_NAME_LEN 32
>  /* Defines one hugetlb page size */
>  struct hstate {
> +	struct mutex mutex;

I am also with Michal here, renaming the mutex to something closer to
its function might be better to understand it without diving too much in
the code.

>  	int next_nid_to_alloc;
>  	int next_nid_to_free;
>  	unsigned int order;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f9ba63fc1747..404b0b1c5258 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2616,6 +2616,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	else
>  		return -ENOMEM;
>  
> +	/* mutex prevents concurrent adjustments for the same hstate */
> +	mutex_lock(&h->mutex);
>  	spin_lock(&hugetlb_lock);

I find above comment a bit misleading.
AFAIK, hugetlb_lock also protects from concurrent adjustments for the
same hstate (hugepage_activelist, free_huge_pages, surplus_huge_pages,
etc...).
Would it be more apropiate saying that mutex_lock() only prevents from
simultaneously sysfs/proc operations?

Reviewed-by: Oscar Salvador <osalvador@suse.e>


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25 10:22     ` Michal Hocko
@ 2021-03-25 16:56       ` Mike Kravetz
  2021-03-25 17:15         ` David Hildenbrand
  0 siblings, 1 reply; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25 16:56 UTC (permalink / raw)
  To: Michal Hocko, David Hildenbrand
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton, Joonsoo Kim

On 3/25/21 3:22 AM, Michal Hocko wrote:
> On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
>> On 25.03.21 01:28, Mike Kravetz wrote:
>>> From: Roman Gushchin <guro@fb.com>
>>>
>>> cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
>>> It makes it a blocking function, which complicates its usage from
>>> non-blocking contexts. For instance, hugetlbfs code is temporarily
>>> dropping the hugetlb_lock spinlock to call cma_release().
>>>
>>> This patch introduces a non-blocking cma_release_nowait(), which
>>> postpones the cma bitmap clearance. It's done later from a work
>>> context. The first page in the cma allocation is used to store
>>> the work struct. Because CMA allocations and de-allocations are
>>> usually not that frequent, a single global workqueue is used.
>>>
>>> To make sure that subsequent cma_alloc() call will pass, cma_alloc()
>>> flushes the cma_release_wq workqueue. To avoid a performance
>>> regression in the case when only cma_release() is used, gate it
>>> by a per-cma area flag, which is set by the first call
>>> of cma_release_nowait().
>>>
>>> Signed-off-by: Roman Gushchin <guro@fb.com>
>>> [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>
>>
>> 1. Is there a real reason this is a mutex and not a spin lock? It seems to
>> only protect the bitmap. Are bitmaps that huge that we spend a significant
>> amount of time in there?
> 
> Good question. Looking at the code it doesn't seem that there is any
> blockable operation or any heavy lifting done under the lock.
> 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
> the lock and there was a simple bitmat protection back then. I suspect
> the patch just followed the cma_mutex lead and used the same type of the
> lock. cma_mutex used to protect alloc_contig_range which is sleepable.
> 
> This all suggests that there is no real reason to use a sleepable lock
> at all and we do not need all this heavy lifting.
> 

When Roman first proposed these patches, I brought up the same issue:

https://lore.kernel.org/linux-mm/20201022023352.GC300658@carbon.dhcp.thefacebook.com/

Previously, Roman proposed replacing the mutex with a spinlock but
Joonsoo was opposed.

Adding Joonsoo on Cc:
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-03-25 10:55   ` Michal Hocko
@ 2021-03-25 17:12     ` Mike Kravetz
  2021-03-25 19:39       ` Michal Hocko
  0 siblings, 1 reply; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25 17:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On 3/25/21 3:55 AM, Michal Hocko wrote:
> On Wed 24-03-21 17:28:32, Mike Kravetz wrote:
>> With the introduction of remove_hugetlb_page(), there is no need for
>> update_and_free_page to hold the hugetlb lock.  Change all callers to
>> drop the lock before calling.
>>
>> With additional code modifications, this will allow loops which decrease
>> the huge page pool to drop the hugetlb_lock with each page to reduce
>> long hold times.
>>
>> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
>> a subsequent patch which restructures free_pool_huge_page.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> One minor thing below
> 
> [...]
>> @@ -2563,22 +2572,37 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>>  						nodemask_t *nodes_allowed)
>>  {
>>  	int i;
>> +	struct list_head page_list;
>> +	struct page *page, *next;
>>  
>>  	if (hstate_is_gigantic(h))
>>  		return;
>>  
>> +	/*
>> +	 * Collect pages to be freed on a list, and free after dropping lock
>> +	 */
>> +	INIT_LIST_HEAD(&page_list);
>>  	for_each_node_mask(i, *nodes_allowed) {
>> -		struct page *page, *next;
>>  		struct list_head *freel = &h->hugepage_freelists[i];
>>  		list_for_each_entry_safe(page, next, freel, lru) {
>>  			if (count >= h->nr_huge_pages)
>> -				return;
>> +				goto out;
>>  			if (PageHighMem(page))
>>  				continue;
>>  			remove_hugetlb_page(h, page, false);
>> -			update_and_free_page(h, page);
>> +			INIT_LIST_HEAD(&page->lru);
> 
> What is the point of rhis INIT_LIST_HEAD? Page has been removed from the
> list by remove_hugetlb_page so it can be added to a new one without any
> reinitialization.

remove_hugetlb_page just does a list_del.  list_del will poison the
pointers in page->lru.  The following list_add will then complain about
list corruption.

I could replace the list_del in remove_hugetlb_page with list_del_init.
However, not all callers of remove_hugetlb_page will be adding the page
to a list.  If we just call update_and_free_page, there is no need to
reinitialize the list pointers.

Might be better to just use list_del_init in remove_hugetlb_page to
avoid any questions like this.
-- 
Mike Kravetz

> 
>> +			list_add(&page->lru, &page_list);
>>  		}
>>  	}
>> +
>> +out:
>> +	spin_unlock(&hugetlb_lock);
>> +	list_for_each_entry_safe(page, next, &page_list, lru) {
>> +		list_del(&page->lru);
>> +		update_and_free_page(h, page);
>> +		cond_resched();
>> +	}
>> +	spin_lock(&hugetlb_lock);
>>  }
>>  #else
>>  static inline void try_to_free_low(struct hstate *h, unsigned long count,
>> -- 
>> 2.30.2
>>
> 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25 16:56       ` Mike Kravetz
@ 2021-03-25 17:15         ` David Hildenbrand
  2021-03-25 20:12           ` Minchan Kim
  0 siblings, 1 reply; 53+ messages in thread
From: David Hildenbrand @ 2021-03-25 17:15 UTC (permalink / raw)
  To: Mike Kravetz, Michal Hocko
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton, Joonsoo Kim

On 25.03.21 17:56, Mike Kravetz wrote:
> On 3/25/21 3:22 AM, Michal Hocko wrote:
>> On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
>>> On 25.03.21 01:28, Mike Kravetz wrote:
>>>> From: Roman Gushchin <guro@fb.com>
>>>>
>>>> cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
>>>> It makes it a blocking function, which complicates its usage from
>>>> non-blocking contexts. For instance, hugetlbfs code is temporarily
>>>> dropping the hugetlb_lock spinlock to call cma_release().
>>>>
>>>> This patch introduces a non-blocking cma_release_nowait(), which
>>>> postpones the cma bitmap clearance. It's done later from a work
>>>> context. The first page in the cma allocation is used to store
>>>> the work struct. Because CMA allocations and de-allocations are
>>>> usually not that frequent, a single global workqueue is used.
>>>>
>>>> To make sure that subsequent cma_alloc() call will pass, cma_alloc()
>>>> flushes the cma_release_wq workqueue. To avoid a performance
>>>> regression in the case when only cma_release() is used, gate it
>>>> by a per-cma area flag, which is set by the first call
>>>> of cma_release_nowait().
>>>>
>>>> Signed-off-by: Roman Gushchin <guro@fb.com>
>>>> [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>> ---
>>>
>>>
>>> 1. Is there a real reason this is a mutex and not a spin lock? It seems to
>>> only protect the bitmap. Are bitmaps that huge that we spend a significant
>>> amount of time in there?
>>
>> Good question. Looking at the code it doesn't seem that there is any
>> blockable operation or any heavy lifting done under the lock.
>> 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
>> the lock and there was a simple bitmat protection back then. I suspect
>> the patch just followed the cma_mutex lead and used the same type of the
>> lock. cma_mutex used to protect alloc_contig_range which is sleepable.
>>
>> This all suggests that there is no real reason to use a sleepable lock
>> at all and we do not need all this heavy lifting.
>>
> 
> When Roman first proposed these patches, I brought up the same issue:
> 
> https://lore.kernel.org/linux-mm/20201022023352.GC300658@carbon.dhcp.thefacebook.com/
> 
> Previously, Roman proposed replacing the mutex with a spinlock but
> Joonsoo was opposed.
> 
> Adding Joonsoo on Cc:
> 

There has to be a good reason not to. And if there is a good reason, 
lockless clearing might be one feasible alternative.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page
  2021-03-25 11:06   ` Michal Hocko
@ 2021-03-25 17:29     ` Mike Kravetz
  0 siblings, 0 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25 17:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On 3/25/21 4:06 AM, Michal Hocko wrote:
> On Wed 24-03-21 17:28:33, Mike Kravetz wrote:
> [...]
>> @@ -2074,17 +2067,16 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>>   *    to the associated reservation map.
>>   * 2) Free any unused surplus pages that may have been allocated to satisfy
>>   *    the reservation.  As many as unused_resv_pages may be freed.
>> - *
>> - * Called with hugetlb_lock held.  However, the lock could be dropped (and
>> - * reacquired) during calls to cond_resched_lock.  Whenever dropping the lock,
>> - * we must make sure nobody else can claim pages we are in the process of
>> - * freeing.  Do this by ensuring resv_huge_page always is greater than the
>> - * number of huge pages we plan to free when dropping the lock.
>>   */
>>  static void return_unused_surplus_pages(struct hstate *h,
>>  					unsigned long unused_resv_pages)
>>  {
>>  	unsigned long nr_pages;
>> +	struct page *page, *t_page;
>> +	struct list_head page_list;
>> +
>> +	/* Uncommit the reservation */
>> +	h->resv_huge_pages -= unused_resv_pages;
> 
> Is this ok for cases where remove_pool_huge_page fails early? I have to
> say I am kinda lost in the resv_huge_pages accounting here. The original
> code was already quite supicious to me. TBH.

Yes, it is safe.  The existing code will do the same but perhaps in a
different way.

Some history is in the changelog for commit e5bbc8a6c992 ("mm/hugetlb.c:
fix reservation race when freeing surplus pages").  The race fixed by
that commit was introduced by the cond_resched_lock() which we are
removing in this patch.  Therefore, we can remove the tricky code that
was added to deal with dropping the lock.

I should add an explanation to the commit message.

Additionally, I suspect we may end up once again dropping the lock in
the below loop when adding vmemmap support.  Then, we would need to add
back the code in commit e5bbc8a6c992.  Sigh.

>>  
>>  	/* Cannot return gigantic pages currently */
>>  	if (hstate_is_gigantic(h))
>> @@ -2101,24 +2093,27 @@ static void return_unused_surplus_pages(struct hstate *h,
>>  	 * evenly across all nodes with memory. Iterate across these nodes
>>  	 * until we can no longer free unreserved surplus pages. This occurs
>>  	 * when the nodes with surplus pages have no free pages.
>> -	 * free_pool_huge_page() will balance the freed pages across the
>> +	 * remove_pool_huge_page() will balance the freed pages across the
>>  	 * on-line nodes with memory and will handle the hstate accounting.
>> -	 *
>> -	 * Note that we decrement resv_huge_pages as we free the pages.  If
>> -	 * we drop the lock, resv_huge_pages will still be sufficiently large
>> -	 * to cover subsequent pages we may free.
>>  	 */
>> +	INIT_LIST_HEAD(&page_list);
>>  	while (nr_pages--) {
>> -		h->resv_huge_pages--;
>> -		unused_resv_pages--;
>> -		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
>> +		page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
>> +		if (!page)
>>  			goto out;
>> -		cond_resched_lock(&hugetlb_lock);
>> +
>> +		INIT_LIST_HEAD(&page->lru);
> 
> again unnecessary INIT_LIST_HEAD
> 
>> +		list_add(&page->lru, &page_list);
>>  	}
>>  
>>  out:
>> -	/* Fully uncommit the reservation */
>> -	h->resv_huge_pages -= unused_resv_pages;
>> +	spin_unlock(&hugetlb_lock);
>> +	list_for_each_entry_safe(page, t_page, &page_list, lru) {
>> +		list_del(&page->lru);
>> +		update_and_free_page(h, page);
>> +		cond_resched();
>> +	}
> 
> You have the same construct at 3 different places maybe it deserves a
> little helper update_and_free_page_batch.

Sure.  I will add it.

-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/8] hugetlb: make free_huge_page irq safe
  2021-03-25 11:21   ` Michal Hocko
@ 2021-03-25 17:32     ` Mike Kravetz
  0 siblings, 0 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25 17:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On 3/25/21 4:21 AM, Michal Hocko wrote:
> On Wed 24-03-21 17:28:34, Mike Kravetz wrote:
>> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
>> non-task context") was added to address the issue of free_huge_page
>> being called from irq context.  That commit hands off free_huge_page
>> processing to a workqueue if !in_task.  However, as seen in [1] this
>> does not cover all cases.  Instead, make the locks taken in the
>> free_huge_page irq safe.
> 
> I would just call out the deadlock scenario here in the changelog
> rathert than torture people by forcing them to follow up on the 0day
> report. Something like the below?
> "
> "
> However this doesn't cover all the cases as pointed out by 0day bot
> lockdep report [1]
> :  Possible interrupt unsafe locking scenario:
> : 
> :        CPU0                    CPU1
> :        ----                    ----
> :   lock(hugetlb_lock);
> :                                local_irq_disable();
> :                                lock(slock-AF_INET);
> :                                lock(hugetlb_lock);
> :   <Interrupt>
> :     lock(slock-AF_INET);
> 
> Shakeel has later explained that this is very likely TCP TX
> zerocopy from hugetlb pages scenario when the networking code drops a
> last reference to hugetlb page while having IRQ disabled. Hugetlb
> freeing path doesn't disable IRQ while holding hugetlb_lock so a lock
> dependency chain can lead to a deadlock.
>  

Thanks.  I will update changelog.

> 
>> This patch does the following:
>> - Make hugetlb_lock irq safe.  This is mostly a simple process of
>>   changing spin_*lock calls to spin_*lock_irq* calls.
>> - Make subpool lock irq safe in a similar manner.
>> - Revert the !in_task check and workqueue handoff.
>>
>> [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

And, thanks for looking at the series!
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-03-25 17:12     ` Mike Kravetz
@ 2021-03-25 19:39       ` Michal Hocko
  2021-03-25 20:33         ` Mike Kravetz
  0 siblings, 1 reply; 53+ messages in thread
From: Michal Hocko @ 2021-03-25 19:39 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Thu 25-03-21 10:12:05, Mike Kravetz wrote:
> On 3/25/21 3:55 AM, Michal Hocko wrote:
> > On Wed 24-03-21 17:28:32, Mike Kravetz wrote:
> >> With the introduction of remove_hugetlb_page(), there is no need for
> >> update_and_free_page to hold the hugetlb lock.  Change all callers to
> >> drop the lock before calling.
> >>
> >> With additional code modifications, this will allow loops which decrease
> >> the huge page pool to drop the hugetlb_lock with each page to reduce
> >> long hold times.
> >>
> >> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
> >> a subsequent patch which restructures free_pool_huge_page.
> >>
> >> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > 
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > 
> > One minor thing below
> > 
> > [...]
> >> @@ -2563,22 +2572,37 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
> >>  						nodemask_t *nodes_allowed)
> >>  {
> >>  	int i;
> >> +	struct list_head page_list;
> >> +	struct page *page, *next;
> >>  
> >>  	if (hstate_is_gigantic(h))
> >>  		return;
> >>  
> >> +	/*
> >> +	 * Collect pages to be freed on a list, and free after dropping lock
> >> +	 */
> >> +	INIT_LIST_HEAD(&page_list);
> >>  	for_each_node_mask(i, *nodes_allowed) {
> >> -		struct page *page, *next;
> >>  		struct list_head *freel = &h->hugepage_freelists[i];
> >>  		list_for_each_entry_safe(page, next, freel, lru) {
> >>  			if (count >= h->nr_huge_pages)
> >> -				return;
> >> +				goto out;
> >>  			if (PageHighMem(page))
> >>  				continue;
> >>  			remove_hugetlb_page(h, page, false);
> >> -			update_and_free_page(h, page);
> >> +			INIT_LIST_HEAD(&page->lru);
> > 
> > What is the point of rhis INIT_LIST_HEAD? Page has been removed from the
> > list by remove_hugetlb_page so it can be added to a new one without any
> > reinitialization.
> 
> remove_hugetlb_page just does a list_del.  list_del will poison the
> pointers in page->lru.  The following list_add will then complain about
> list corruption.

Are you sure? list_del followed by list_add is a normal API usage
pattern AFAIK. INIT_LIST_HEAD is to do the first initialization before
first use.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25 17:15         ` David Hildenbrand
@ 2021-03-25 20:12           ` Minchan Kim
  2021-03-25 23:19             ` Roman Gushchin
  0 siblings, 1 reply; 53+ messages in thread
From: Minchan Kim @ 2021-03-25 20:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Kravetz, Michal Hocko, linux-mm, linux-kernel,
	Roman Gushchin, Shakeel Butt, Oscar Salvador, Muchun Song,
	David Rientjes, Miaohe Lin, Peter Zijlstra, Matthew Wilcox,
	HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long, Peter Xu,
	Mina Almasry, Hillf Danton, Andrew Morton, Joonsoo Kim

On Thu, Mar 25, 2021 at 06:15:11PM +0100, David Hildenbrand wrote:
> On 25.03.21 17:56, Mike Kravetz wrote:
> > On 3/25/21 3:22 AM, Michal Hocko wrote:
> > > On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
> > > > On 25.03.21 01:28, Mike Kravetz wrote:
> > > > > From: Roman Gushchin <guro@fb.com>
> > > > > 
> > > > > cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
> > > > > It makes it a blocking function, which complicates its usage from
> > > > > non-blocking contexts. For instance, hugetlbfs code is temporarily
> > > > > dropping the hugetlb_lock spinlock to call cma_release().
> > > > > 
> > > > > This patch introduces a non-blocking cma_release_nowait(), which
> > > > > postpones the cma bitmap clearance. It's done later from a work
> > > > > context. The first page in the cma allocation is used to store
> > > > > the work struct. Because CMA allocations and de-allocations are
> > > > > usually not that frequent, a single global workqueue is used.
> > > > > 
> > > > > To make sure that subsequent cma_alloc() call will pass, cma_alloc()
> > > > > flushes the cma_release_wq workqueue. To avoid a performance
> > > > > regression in the case when only cma_release() is used, gate it
> > > > > by a per-cma area flag, which is set by the first call
> > > > > of cma_release_nowait().
> > > > > 
> > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
> > > > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > > > > ---
> > > > 
> > > > 
> > > > 1. Is there a real reason this is a mutex and not a spin lock? It seems to
> > > > only protect the bitmap. Are bitmaps that huge that we spend a significant
> > > > amount of time in there?
> > > 
> > > Good question. Looking at the code it doesn't seem that there is any
> > > blockable operation or any heavy lifting done under the lock.
> > > 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
> > > the lock and there was a simple bitmat protection back then. I suspect
> > > the patch just followed the cma_mutex lead and used the same type of the
> > > lock. cma_mutex used to protect alloc_contig_range which is sleepable.
> > > 
> > > This all suggests that there is no real reason to use a sleepable lock
> > > at all and we do not need all this heavy lifting.
> > > 
> > 
> > When Roman first proposed these patches, I brought up the same issue:
> > 
> > https://lore.kernel.org/linux-mm/20201022023352.GC300658@carbon.dhcp.thefacebook.com/
> > 
> > Previously, Roman proposed replacing the mutex with a spinlock but
> > Joonsoo was opposed.
> > 
> > Adding Joonsoo on Cc:
> > 
> 
> There has to be a good reason not to. And if there is a good reason,
> lockless clearing might be one feasible alternative.

I also don't think nowait variant is good idea. If the scanning of
bitmap is *really* significant, it might be signal that we need to
introduce different technique or data structure not bitmap rather
than a new API variant.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-03-25 19:39       ` Michal Hocko
@ 2021-03-25 20:33         ` Mike Kravetz
  0 siblings, 0 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25 20:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On 3/25/21 12:39 PM, Michal Hocko wrote:
> On Thu 25-03-21 10:12:05, Mike Kravetz wrote:
>> On 3/25/21 3:55 AM, Michal Hocko wrote:
>>> On Wed 24-03-21 17:28:32, Mike Kravetz wrote:
>>>> With the introduction of remove_hugetlb_page(), there is no need for
>>>> update_and_free_page to hold the hugetlb lock.  Change all callers to
>>>> drop the lock before calling.
>>>>
>>>> With additional code modifications, this will allow loops which decrease
>>>> the huge page pool to drop the hugetlb_lock with each page to reduce
>>>> long hold times.
>>>>
>>>> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
>>>> a subsequent patch which restructures free_pool_huge_page.
>>>>
>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>
>>> Acked-by: Michal Hocko <mhocko@suse.com>
>>>
>>> One minor thing below
>>>
>>> [...]
>>>> @@ -2563,22 +2572,37 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>>>>  						nodemask_t *nodes_allowed)
>>>>  {
>>>>  	int i;
>>>> +	struct list_head page_list;
>>>> +	struct page *page, *next;
>>>>  
>>>>  	if (hstate_is_gigantic(h))
>>>>  		return;
>>>>  
>>>> +	/*
>>>> +	 * Collect pages to be freed on a list, and free after dropping lock
>>>> +	 */
>>>> +	INIT_LIST_HEAD(&page_list);
>>>>  	for_each_node_mask(i, *nodes_allowed) {
>>>> -		struct page *page, *next;
>>>>  		struct list_head *freel = &h->hugepage_freelists[i];
>>>>  		list_for_each_entry_safe(page, next, freel, lru) {
>>>>  			if (count >= h->nr_huge_pages)
>>>> -				return;
>>>> +				goto out;
>>>>  			if (PageHighMem(page))
>>>>  				continue;
>>>>  			remove_hugetlb_page(h, page, false);
>>>> -			update_and_free_page(h, page);
>>>> +			INIT_LIST_HEAD(&page->lru);
>>>
>>> What is the point of rhis INIT_LIST_HEAD? Page has been removed from the
>>> list by remove_hugetlb_page so it can be added to a new one without any
>>> reinitialization.
>>
>> remove_hugetlb_page just does a list_del.  list_del will poison the
>> pointers in page->lru.  The following list_add will then complain about
>> list corruption.
> 
> Are you sure? list_del followed by list_add is a normal API usage
> pattern AFAIK. INIT_LIST_HEAD is to do the first initialization before
> first use.

Sorry for the noise.  The INIT_LIST_HEAD is indeed unnecessary.

I must have got confused while looking at a corrupt list splat in
earlier code development.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25 20:12           ` Minchan Kim
@ 2021-03-25 23:19             ` Roman Gushchin
  2021-03-25 23:49               ` Mike Kravetz
  0 siblings, 1 reply; 53+ messages in thread
From: Roman Gushchin @ 2021-03-25 23:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Hildenbrand, Mike Kravetz, Michal Hocko, linux-mm,
	linux-kernel, Shakeel Butt, Oscar Salvador, Muchun Song,
	David Rientjes, Miaohe Lin, Peter Zijlstra, Matthew Wilcox,
	HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long, Peter Xu,
	Mina Almasry, Hillf Danton, Andrew Morton, Joonsoo Kim

On Thu, Mar 25, 2021 at 01:12:51PM -0700, Minchan Kim wrote:
> On Thu, Mar 25, 2021 at 06:15:11PM +0100, David Hildenbrand wrote:
> > On 25.03.21 17:56, Mike Kravetz wrote:
> > > On 3/25/21 3:22 AM, Michal Hocko wrote:
> > > > On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
> > > > > On 25.03.21 01:28, Mike Kravetz wrote:
> > > > > > From: Roman Gushchin <guro@fb.com>
> > > > > > 
> > > > > > cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
> > > > > > It makes it a blocking function, which complicates its usage from
> > > > > > non-blocking contexts. For instance, hugetlbfs code is temporarily
> > > > > > dropping the hugetlb_lock spinlock to call cma_release().
> > > > > > 
> > > > > > This patch introduces a non-blocking cma_release_nowait(), which
> > > > > > postpones the cma bitmap clearance. It's done later from a work
> > > > > > context. The first page in the cma allocation is used to store
> > > > > > the work struct. Because CMA allocations and de-allocations are
> > > > > > usually not that frequent, a single global workqueue is used.
> > > > > > 
> > > > > > To make sure that subsequent cma_alloc() call will pass, cma_alloc()
> > > > > > flushes the cma_release_wq workqueue. To avoid a performance
> > > > > > regression in the case when only cma_release() is used, gate it
> > > > > > by a per-cma area flag, which is set by the first call
> > > > > > of cma_release_nowait().
> > > > > > 
> > > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > > [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
> > > > > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > > > > > ---
> > > > > 
> > > > > 
> > > > > 1. Is there a real reason this is a mutex and not a spin lock? It seems to
> > > > > only protect the bitmap. Are bitmaps that huge that we spend a significant
> > > > > amount of time in there?
> > > > 
> > > > Good question. Looking at the code it doesn't seem that there is any
> > > > blockable operation or any heavy lifting done under the lock.
> > > > 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
> > > > the lock and there was a simple bitmat protection back then. I suspect
> > > > the patch just followed the cma_mutex lead and used the same type of the
> > > > lock. cma_mutex used to protect alloc_contig_range which is sleepable.
> > > > 
> > > > This all suggests that there is no real reason to use a sleepable lock
> > > > at all and we do not need all this heavy lifting.
> > > > 
> > > 
> > > When Roman first proposed these patches, I brought up the same issue:
> > > 
> > > https://lore.kernel.org/linux-mm/20201022023352.GC300658@carbon.dhcp.thefacebook.com/
> > > 
> > > Previously, Roman proposed replacing the mutex with a spinlock but
> > > Joonsoo was opposed.
> > > 
> > > Adding Joonsoo on Cc:
> > > 
> > 
> > There has to be a good reason not to. And if there is a good reason,
> > lockless clearing might be one feasible alternative.
> 
> I also don't think nowait variant is good idea. If the scanning of
> bitmap is *really* significant, it might be signal that we need to
> introduce different technique or data structure not bitmap rather
> than a new API variant.

I'd also prefer to just replace the mutex with a spinlock rather than fiddling
with a delayed release.

Thanks!


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25 23:19             ` Roman Gushchin
@ 2021-03-25 23:49               ` Mike Kravetz
  2021-03-26 21:32                 ` Mike Kravetz
  0 siblings, 1 reply; 53+ messages in thread
From: Mike Kravetz @ 2021-03-25 23:49 UTC (permalink / raw)
  To: Roman Gushchin, Minchan Kim
  Cc: David Hildenbrand, Michal Hocko, linux-mm, linux-kernel,
	Shakeel Butt, Oscar Salvador, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton, Joonsoo Kim

On 3/25/21 4:19 PM, Roman Gushchin wrote:
> On Thu, Mar 25, 2021 at 01:12:51PM -0700, Minchan Kim wrote:
>> On Thu, Mar 25, 2021 at 06:15:11PM +0100, David Hildenbrand wrote:
>>> On 25.03.21 17:56, Mike Kravetz wrote:
>>>> On 3/25/21 3:22 AM, Michal Hocko wrote:
>>>>> On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
>>>>>> On 25.03.21 01:28, Mike Kravetz wrote:
>>>>>>> From: Roman Gushchin <guro@fb.com>
>>>>>>>
>>>>>>> cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
>>>>>>> It makes it a blocking function, which complicates its usage from
>>>>>>> non-blocking contexts. For instance, hugetlbfs code is temporarily
>>>>>>> dropping the hugetlb_lock spinlock to call cma_release().
>>>>>>>
>>>>>>> This patch introduces a non-blocking cma_release_nowait(), which
>>>>>>> postpones the cma bitmap clearance. It's done later from a work
>>>>>>> context. The first page in the cma allocation is used to store
>>>>>>> the work struct. Because CMA allocations and de-allocations are
>>>>>>> usually not that frequent, a single global workqueue is used.
>>>>>>>
>>>>>>> To make sure that subsequent cma_alloc() call will pass, cma_alloc()
>>>>>>> flushes the cma_release_wq workqueue. To avoid a performance
>>>>>>> regression in the case when only cma_release() is used, gate it
>>>>>>> by a per-cma area flag, which is set by the first call
>>>>>>> of cma_release_nowait().
>>>>>>>
>>>>>>> Signed-off-by: Roman Gushchin <guro@fb.com>
>>>>>>> [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
>>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>>> ---
>>>>>>
>>>>>>
>>>>>> 1. Is there a real reason this is a mutex and not a spin lock? It seems to
>>>>>> only protect the bitmap. Are bitmaps that huge that we spend a significant
>>>>>> amount of time in there?
>>>>>
>>>>> Good question. Looking at the code it doesn't seem that there is any
>>>>> blockable operation or any heavy lifting done under the lock.
>>>>> 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
>>>>> the lock and there was a simple bitmat protection back then. I suspect
>>>>> the patch just followed the cma_mutex lead and used the same type of the
>>>>> lock. cma_mutex used to protect alloc_contig_range which is sleepable.
>>>>>
>>>>> This all suggests that there is no real reason to use a sleepable lock
>>>>> at all and we do not need all this heavy lifting.
>>>>>
>>>>
>>>> When Roman first proposed these patches, I brought up the same issue:
>>>>
>>>> https://lore.kernel.org/linux-mm/20201022023352.GC300658@carbon.dhcp.thefacebook.com/
>>>>
>>>> Previously, Roman proposed replacing the mutex with a spinlock but
>>>> Joonsoo was opposed.
>>>>
>>>> Adding Joonsoo on Cc:
>>>>
>>>
>>> There has to be a good reason not to. And if there is a good reason,
>>> lockless clearing might be one feasible alternative.
>>
>> I also don't think nowait variant is good idea. If the scanning of
>> bitmap is *really* significant, it might be signal that we need to
>> introduce different technique or data structure not bitmap rather
>> than a new API variant.
> 
> I'd also prefer to just replace the mutex with a spinlock rather than fiddling
> with a delayed release.
> 

I hope Joonsoo or someone else brings up specific concerns.  I do not
know enough about all CMA use cases.  Certainly, in this specific use
case converting to a spinlock would not be an issue.  Do note that we
would want to convert to an irq safe spinlock and disable irqs if that
makes any difference in the discussion.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 0/8] make hugetlb put_page safe for all calling contexts
  2021-03-25  0:28 [PATCH 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (7 preceding siblings ...)
  2021-03-25  0:28 ` [PATCH 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
@ 2021-03-26  1:42 ` Miaohe Lin
  2021-03-26 20:00   ` Mike Kravetz
  8 siblings, 1 reply; 53+ messages in thread
From: Miaohe Lin @ 2021-03-26  1:42 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Andrew Morton

Hi:
On 2021/3/25 8:28, Mike Kravetz wrote:
> This effort is the result a recent bug report [1].  In subsequent
> discussions [2], it was deemed necessary to properly fix the hugetlb

Many thanks for the effort. I have read the discussions and it is pretty long.
Maybe it would be helpful if you give a brief summary here?

> put_page path (free_huge_page).  This RFC provides a possible way to

trival: Not RFC here.

> address the issue.  Comments are welcome/encouraged as several attempts
> at this have been made in the past.
> > This series is based on v5.12-rc3-mmotm-2021-03-17-22-24.  At a high
> level, the series provides:
> - Patches 1 & 2 from Roman Gushchin provide cma_release_nowait()

trival: missing description of the Patches 3 ?

> - Patches 4, 5 & 6 are aimed at reducing lock hold times.  To be clear
>   the goal is to eliminate single lock hold times of a long duration.
>   Overall lock hold time is not addressed.
> - Patch 7 makes hugetlb_lock and subpool lock IRQ safe.  It also reverts
>   the code which defers calls to a workqueue if !in_task.
> - Patch 8 adds some lockdep_assert_held() calls
> 
> [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
> [2] http://lkml.kernel.org/r/20210311021321.127500-1-mike.kravetz@oracle.com
> 
> RFC -> v1
> - Add Roman's cma_release_nowait() patches.  This eliminated the need
>   to do a workqueue handoff in hugetlb code.
> - Use Michal's suggestion to batch pages for freeing.  This eliminated
>   the need to recalculate loop control variables when dropping the lock.
> - Added lockdep_assert_held() calls
> - Rebased to v5.12-rc3-mmotm-2021-03-17-22-24
> 
> Mike Kravetz (6):
>   hugetlb: add per-hstate mutex to synchronize user adjustments
>   hugetlb: create remove_hugetlb_page() to separate functionality
>   hugetlb: call update_and_free_page without hugetlb_lock
>   hugetlb: change free_pool_huge_page to remove_pool_huge_page
>   hugetlb: make free_huge_page irq safe
>   hugetlb: add lockdep_assert_held() calls for hugetlb_lock
> 
> Roman Gushchin (2):
>   mm: cma: introduce cma_release_nowait()
>   mm: hugetlb: don't drop hugetlb_lock around cma_release() call
> 
>  include/linux/cma.h     |   2 +
>  include/linux/hugetlb.h |   1 +
>  mm/cma.c                |  93 +++++++++++
>  mm/cma.h                |   5 +
>  mm/hugetlb.c            | 354 +++++++++++++++++++++-------------------
>  mm/hugetlb_cgroup.c     |   8 +-
>  6 files changed, 294 insertions(+), 169 deletions(-)
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments
  2021-03-25  0:28 ` [PATCH 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
  2021-03-25 10:47   ` Michal Hocko
  2021-03-25 12:29   ` Oscar Salvador
@ 2021-03-26  1:52   ` Miaohe Lin
  2 siblings, 0 replies; 53+ messages in thread
From: Miaohe Lin @ 2021-03-26  1:52 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Andrew Morton

On 2021/3/25 8:28, Mike Kravetz wrote:
> The helper routine hstate_next_node_to_alloc accesses and modifies the
> hstate variable next_nid_to_alloc.  The helper is used by the routines
> alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
> called with hugetlb_lock held.  However, alloc_pool_huge_page can not
> be called with the hugetlb lock held as it will call the page allocator.
> Two instances of alloc_pool_huge_page could be run in parallel or
> alloc_pool_huge_page could run in parallel with adjust_pool_surplus
> which may result in the variable next_nid_to_alloc becoming invalid
> for the caller and pages being allocated on the wrong node.
> > Both alloc_pool_huge_page and adjust_pool_surplus are only called from
> the routine set_max_huge_pages after boot.  set_max_huge_pages is only
> called as the reusult of a user writing to the proc/sysfs nr_hugepages,
> or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.
> 
> It makes little sense to allow multiple adjustment to the number of
> hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
> allow one hugetlb page adjustment at a time.  This will synchronize
> modifications to the next_nid_to_alloc variable.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  include/linux/hugetlb.h | 1 +
>  mm/hugetlb.c            | 5 +++++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index a7f7d5f328dc..8817ec987d68 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
>  #define HSTATE_NAME_LEN 32
>  /* Defines one hugetlb page size */
>  struct hstate {
> +	struct mutex mutex;

I am also with Michal and Oscar here, renaming the mutex to something closer to
its function.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

>  	int next_nid_to_alloc;
>  	int next_nid_to_free;
>  	unsigned int order;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f9ba63fc1747..404b0b1c5258 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2616,6 +2616,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	else
>  		return -ENOMEM;
>  
> +	/* mutex prevents concurrent adjustments for the same hstate */
> +	mutex_lock(&h->mutex);
>  	spin_lock(&hugetlb_lock);
>  
>  	/*
> @@ -2648,6 +2650,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
>  		if (count > persistent_huge_pages(h)) {
>  			spin_unlock(&hugetlb_lock);
> +			mutex_unlock(&h->mutex);
>  			NODEMASK_FREE(node_alloc_noretry);
>  			return -EINVAL;
>  		}
> @@ -2722,6 +2725,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  out:
>  	h->max_huge_pages = persistent_huge_pages(h);
>  	spin_unlock(&hugetlb_lock);
> +	mutex_unlock(&h->mutex);
>  
>  	NODEMASK_FREE(node_alloc_noretry);
>  
> @@ -3209,6 +3213,7 @@ void __init hugetlb_add_hstate(unsigned int order)
>  	BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
>  	BUG_ON(order == 0);
>  	h = &hstates[hugetlb_max_hstate++];
> +	mutex_init(&h->mutex);
>  	h->order = order;
>  	h->mask = ~(huge_page_size(h) - 1);
>  	for (i = 0; i < MAX_NUMNODES; ++i)
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-03-25  0:28 ` [PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
  2021-03-25 10:49   ` Michal Hocko
@ 2021-03-26  2:10   ` Miaohe Lin
  2021-03-26 19:57     ` Mike Kravetz
  2021-03-27  6:36   ` [External] " Muchun Song
  2 siblings, 1 reply; 53+ messages in thread
From: Miaohe Lin @ 2021-03-26  2:10 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Andrew Morton

On 2021/3/25 8:28, Mike Kravetz wrote:
> The new remove_hugetlb_page() routine is designed to remove a hugetlb
> page from hugetlbfs processing.  It will remove the page from the active
> or free list, update global counters and set the compound page
> destructor to NULL so that PageHuge() will return false for the 'page'.
> After this call, the 'page' can be treated as a normal compound page or
> a collection of base size pages.
> 
> remove_hugetlb_page is to be called with the hugetlb_lock held.
> 
> Creating this routine and separating functionality is in preparation for
> restructuring code to reduce lock hold times.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 70 +++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 45 insertions(+), 25 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 404b0b1c5258..3938ec086b5c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1327,6 +1327,46 @@ static inline void destroy_compound_gigantic_page(struct page *page,
>  						unsigned int order) { }
>  #endif
>  
> +/*
> + * Remove hugetlb page from lists, and update dtor so that page appears
> + * as just a compound page.  A reference is held on the page.
> + * NOTE: hugetlb specific page flags stored in page->private are not
> + *	 automatically cleared.  These flags may be used in routines
> + *	 which operate on the resulting compound page.

It seems HPageFreed and HPageTemporary is cleared. Which hugetlb specific page flags
is reserverd here and why? Could you please give a simple example to clarify
this in the comment to help understand this NOTE?

The code looks good to me. Many thanks!
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

> + *
> + * Must be called with hugetlb lock held.
> + */
> +static void remove_hugetlb_page(struct hstate *h, struct page *page,
> +							bool adjust_surplus)
> +{
> +	int nid = page_to_nid(page);
> +
> +	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> +		return;
> +
> +	list_del(&page->lru);
> +
> +	if (HPageFreed(page)) {
> +		h->free_huge_pages--;
> +		h->free_huge_pages_node[nid]--;
> +		ClearHPageFreed(page);
> +	}
> +	if (adjust_surplus) {
> +		h->surplus_huge_pages--;
> +		h->surplus_huge_pages_node[nid]--;
> +	}
> +
> +	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
> +	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
> +
> +	ClearHPageTemporary(page);
> +	set_page_refcounted(page);
> +	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> +
> +	h->nr_huge_pages--;
> +	h->nr_huge_pages_node[nid]--;
> +}
> +
>  static void update_and_free_page(struct hstate *h, struct page *page)
>  {
>  	int i;
> @@ -1335,8 +1375,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>  	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>  		return;
>  
> -	h->nr_huge_pages--;
> -	h->nr_huge_pages_node[page_to_nid(page)]--;
>  	for (i = 0; i < pages_per_huge_page(h);
>  	     i++, subpage = mem_map_next(subpage, page, i)) {
>  		subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
> @@ -1344,10 +1382,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>  				1 << PG_active | 1 << PG_private |
>  				1 << PG_writeback);
>  	}
> -	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
> -	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
> -	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> -	set_page_refcounted(page);
>  	if (hstate_is_gigantic(h)) {
>  		destroy_compound_gigantic_page(page, huge_page_order(h));
>  		free_gigantic_page(page, huge_page_order(h));
> @@ -1415,15 +1449,12 @@ static void __free_huge_page(struct page *page)
>  		h->resv_huge_pages++;
>  
>  	if (HPageTemporary(page)) {
> -		list_del(&page->lru);
> -		ClearHPageTemporary(page);
> +		remove_hugetlb_page(h, page, false);
>  		update_and_free_page(h, page);
>  	} else if (h->surplus_huge_pages_node[nid]) {
>  		/* remove the page from active list */
> -		list_del(&page->lru);
> +		remove_hugetlb_page(h, page, true);
>  		update_and_free_page(h, page);
> -		h->surplus_huge_pages--;
> -		h->surplus_huge_pages_node[nid]--;
>  	} else {
>  		arch_clear_hugepage_flags(page);
>  		enqueue_huge_page(h, page);
> @@ -1708,13 +1739,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  			struct page *page =
>  				list_entry(h->hugepage_freelists[node].next,
>  					  struct page, lru);
> -			list_del(&page->lru);
> -			h->free_huge_pages--;
> -			h->free_huge_pages_node[node]--;
> -			if (acct_surplus) {
> -				h->surplus_huge_pages--;
> -				h->surplus_huge_pages_node[node]--;
> -			}
> +			remove_hugetlb_page(h, page, acct_surplus);
>  			update_and_free_page(h, page);
>  			ret = 1;
>  			break;
> @@ -1752,7 +1777,6 @@ int dissolve_free_huge_page(struct page *page)
>  	if (!page_count(page)) {
>  		struct page *head = compound_head(page);
>  		struct hstate *h = page_hstate(head);
> -		int nid = page_to_nid(head);
>  		if (h->free_huge_pages - h->resv_huge_pages == 0)
>  			goto out;
>  
> @@ -1783,9 +1807,7 @@ int dissolve_free_huge_page(struct page *page)
>  			SetPageHWPoison(page);
>  			ClearPageHWPoison(head);
>  		}
> -		list_del(&head->lru);
> -		h->free_huge_pages--;
> -		h->free_huge_pages_node[nid]--;
> +		remove_hugetlb_page(h, page, false);
>  		h->max_huge_pages--;
>  		update_and_free_page(h, head);
>  		rc = 0;
> @@ -2553,10 +2575,8 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  				return;
>  			if (PageHighMem(page))
>  				continue;
> -			list_del(&page->lru);
> +			remove_hugetlb_page(h, page, false);
>  			update_and_free_page(h, page);
> -			h->free_huge_pages--;
> -			h->free_huge_pages_node[page_to_nid(page)]--;
>  		}
>  	}
>  }
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock
  2021-03-25  0:28 ` [PATCH 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
  2021-03-25 11:22   ` Michal Hocko
@ 2021-03-26  2:12   ` Miaohe Lin
  2021-03-27  8:14   ` [External] " Muchun Song
  2 siblings, 0 replies; 53+ messages in thread
From: Miaohe Lin @ 2021-03-26  2:12 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Andrew Morton

On 2021/3/25 8:28, Mike Kravetz wrote:
> After making hugetlb lock irq safe and separating some functionality
> done under the lock, add some lockdep_assert_held to help verify
> locking.
> 

Looks good to me. Thanks.
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e4c441b878f2..de5b3cf4a155 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1062,6 +1062,8 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
>  static void enqueue_huge_page(struct hstate *h, struct page *page)
>  {
>  	int nid = page_to_nid(page);
> +
> +	lockdep_assert_held(&hugetlb_lock);
>  	list_move(&page->lru, &h->hugepage_freelists[nid]);
>  	h->free_huge_pages++;
>  	h->free_huge_pages_node[nid]++;
> @@ -1073,6 +1075,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
>  	struct page *page;
>  	bool pin = !!(current->flags & PF_MEMALLOC_PIN);
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
>  		if (pin && !is_pinnable_page(page))
>  			continue;
> @@ -1345,6 +1348,7 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
>  {
>  	int nid = page_to_nid(page);
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>  		return;
>  
> @@ -1690,6 +1694,7 @@ static struct page *remove_pool_huge_page(struct hstate *h,
>  	int nr_nodes, node;
>  	struct page *page = NULL;
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
>  		/*
>  		 * If we're returning unused surplus pages, only examine
> @@ -1939,6 +1944,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  	long needed, allocated;
>  	bool alloc_ok = true;
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
>  	if (needed <= 0) {
>  		h->resv_huge_pages += delta;
> @@ -2032,6 +2038,7 @@ static void return_unused_surplus_pages(struct hstate *h,
>  	struct page *page, *t_page;
>  	struct list_head page_list;
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	/* Uncommit the reservation */
>  	h->resv_huge_pages -= unused_resv_pages;
>  
> @@ -2527,6 +2534,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  	struct list_head page_list;
>  	struct page *page, *next;
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	if (hstate_is_gigantic(h))
>  		return;
>  
> @@ -2573,6 +2581,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
>  {
>  	int nr_nodes, node;
>  
> +	lockdep_assert_held(&hugetlb_lock);
>  	VM_BUG_ON(delta != -1 && delta != 1);
>  
>  	if (delta < 0) {
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-03-26  2:10   ` Miaohe Lin
@ 2021-03-26 19:57     ` Mike Kravetz
  2021-03-27  1:40       ` Miaohe Lin
  0 siblings, 1 reply; 53+ messages in thread
From: Mike Kravetz @ 2021-03-26 19:57 UTC (permalink / raw)
  To: Miaohe Lin, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Andrew Morton

On 3/25/21 7:10 PM, Miaohe Lin wrote:
> On 2021/3/25 8:28, Mike Kravetz wrote:
>> The new remove_hugetlb_page() routine is designed to remove a hugetlb
>> page from hugetlbfs processing.  It will remove the page from the active
>> or free list, update global counters and set the compound page
>> destructor to NULL so that PageHuge() will return false for the 'page'.
>> After this call, the 'page' can be treated as a normal compound page or
>> a collection of base size pages.
>>
>> remove_hugetlb_page is to be called with the hugetlb_lock held.
>>
>> Creating this routine and separating functionality is in preparation for
>> restructuring code to reduce lock hold times.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  mm/hugetlb.c | 70 +++++++++++++++++++++++++++++++++-------------------
>>  1 file changed, 45 insertions(+), 25 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 404b0b1c5258..3938ec086b5c 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1327,6 +1327,46 @@ static inline void destroy_compound_gigantic_page(struct page *page,
>>  						unsigned int order) { }
>>  #endif
>>  
>> +/*
>> + * Remove hugetlb page from lists, and update dtor so that page appears
>> + * as just a compound page.  A reference is held on the page.
>> + * NOTE: hugetlb specific page flags stored in page->private are not
>> + *	 automatically cleared.  These flags may be used in routines
>> + *	 which operate on the resulting compound page.
> 
> It seems HPageFreed and HPageTemporary is cleared. Which hugetlb specific page flags
> is reserverd here and why? Could you please give a simple example to clarify
> this in the comment to help understand this NOTE?
> 

I will remove that NOTE: in the comment to avoid any confusion.

The NOTE was add in the RFC that contained a separate patch to add a flag
that tracked huge pages allocated from CMA.  That flag needed to remain
for subsequent freeing of such pages.  This is no longer needed.

> The code looks good to me. Many thanks!
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks,
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 0/8] make hugetlb put_page safe for all calling contexts
  2021-03-26  1:42 ` [PATCH 0/8] make hugetlb put_page safe for all calling contexts Miaohe Lin
@ 2021-03-26 20:00   ` Mike Kravetz
  0 siblings, 0 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-26 20:00 UTC (permalink / raw)
  To: Miaohe Lin, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Andrew Morton

On 3/25/21 6:42 PM, Miaohe Lin wrote:
> Hi:
> On 2021/3/25 8:28, Mike Kravetz wrote:
>> This effort is the result a recent bug report [1].  In subsequent
>> discussions [2], it was deemed necessary to properly fix the hugetlb
> 
> Many thanks for the effort. I have read the discussions and it is pretty long.
> Maybe it would be helpful if you give a brief summary here?
> 
>> put_page path (free_huge_page).  This RFC provides a possible way to
> 
> trival: Not RFC here.
> 
>> address the issue.  Comments are welcome/encouraged as several attempts
>> at this have been made in the past.
>>> This series is based on v5.12-rc3-mmotm-2021-03-17-22-24.  At a high
>> level, the series provides:
>> - Patches 1 & 2 from Roman Gushchin provide cma_release_nowait()
> 
> trival: missing description of the Patches 3 ?
> 

Thanks,
I will clean this up with next version.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-25 23:49               ` Mike Kravetz
@ 2021-03-26 21:32                 ` Mike Kravetz
  2021-03-29  7:46                   ` Michal Hocko
  0 siblings, 1 reply; 53+ messages in thread
From: Mike Kravetz @ 2021-03-26 21:32 UTC (permalink / raw)
  To: Roman Gushchin, Minchan Kim
  Cc: David Hildenbrand, Michal Hocko, linux-mm, linux-kernel,
	Shakeel Butt, Oscar Salvador, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton, Joonsoo Kim

On 3/25/21 4:49 PM, Mike Kravetz wrote:
> On 3/25/21 4:19 PM, Roman Gushchin wrote:
>> On Thu, Mar 25, 2021 at 01:12:51PM -0700, Minchan Kim wrote:
>>> On Thu, Mar 25, 2021 at 06:15:11PM +0100, David Hildenbrand wrote:
>>>> On 25.03.21 17:56, Mike Kravetz wrote:
>>>>> On 3/25/21 3:22 AM, Michal Hocko wrote:
>>>>>> On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
>>>>>>> On 25.03.21 01:28, Mike Kravetz wrote:
>>>>>>>> From: Roman Gushchin <guro@fb.com>
>>>>>>>>
>>>>>>>> cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
>>>>>>>> It makes it a blocking function, which complicates its usage from
>>>>>>>> non-blocking contexts. For instance, hugetlbfs code is temporarily
>>>>>>>> dropping the hugetlb_lock spinlock to call cma_release().
>>>>>>>>
>>>>>>>> This patch introduces a non-blocking cma_release_nowait(), which
>>>>>>>> postpones the cma bitmap clearance. It's done later from a work
>>>>>>>> context. The first page in the cma allocation is used to store
>>>>>>>> the work struct. Because CMA allocations and de-allocations are
>>>>>>>> usually not that frequent, a single global workqueue is used.
>>>>>>>>
>>>>>>>> To make sure that subsequent cma_alloc() call will pass, cma_alloc()
>>>>>>>> flushes the cma_release_wq workqueue. To avoid a performance
>>>>>>>> regression in the case when only cma_release() is used, gate it
>>>>>>>> by a per-cma area flag, which is set by the first call
>>>>>>>> of cma_release_nowait().
>>>>>>>>
>>>>>>>> Signed-off-by: Roman Gushchin <guro@fb.com>
>>>>>>>> [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
>>>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>>>> ---
>>>>>>>
>>>>>>>
>>>>>>> 1. Is there a real reason this is a mutex and not a spin lock? It seems to
>>>>>>> only protect the bitmap. Are bitmaps that huge that we spend a significant
>>>>>>> amount of time in there?
>>>>>>
>>>>>> Good question. Looking at the code it doesn't seem that there is any
>>>>>> blockable operation or any heavy lifting done under the lock.
>>>>>> 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
>>>>>> the lock and there was a simple bitmat protection back then. I suspect
>>>>>> the patch just followed the cma_mutex lead and used the same type of the
>>>>>> lock. cma_mutex used to protect alloc_contig_range which is sleepable.
>>>>>>
>>>>>> This all suggests that there is no real reason to use a sleepable lock
>>>>>> at all and we do not need all this heavy lifting.
>>>>>>
>>>>>
>>>>> When Roman first proposed these patches, I brought up the same issue:
>>>>>
>>>>> https://lore.kernel.org/linux-mm/20201022023352.GC300658@carbon.dhcp.thefacebook.com/
>>>>>
>>>>> Previously, Roman proposed replacing the mutex with a spinlock but
>>>>> Joonsoo was opposed.
>>>>>
>>>>> Adding Joonsoo on Cc:
>>>>>
>>>>
>>>> There has to be a good reason not to. And if there is a good reason,
>>>> lockless clearing might be one feasible alternative.
>>>
>>> I also don't think nowait variant is good idea. If the scanning of
>>> bitmap is *really* significant, it might be signal that we need to
>>> introduce different technique or data structure not bitmap rather
>>> than a new API variant.
>>
>> I'd also prefer to just replace the mutex with a spinlock rather than fiddling
>> with a delayed release.
>>
> 
> I hope Joonsoo or someone else brings up specific concerns.  I do not
> know enough about all CMA use cases.  Certainly, in this specific use
> case converting to a spinlock would not be an issue.  Do note that we
> would want to convert to an irq safe spinlock and disable irqs if that
> makes any difference in the discussion.
> 

Suggestions on how to move forward would be appreciated.  I can think of
the following options.

- Use the the cma_release_nowait() routine as defined in this patch.

- Just change the mutex to an irq safe spinlock.  AFAICT, the potential
  downsides could be:
  - Interrupts disabled during long bitmap scans
  - Wasted cpu cycles (spinning) if there is much contention on lock
  Both of these would be more of an issue on small/embedded systems. I
  took a quick look at the callers of cma_alloc/cma_release and nothing
  stood out that could lead to high degrees of contention.  However, I
  could have missed something.

- Another idea I had was to allow the user to specify the locking type
  when creating a cma area.  In this way, cma areas which may have
  release calls from atomic context would be set up with an irq safe
  spinlock.  Others, would use the mutex.  I admit this is a hackish
  way to address the issue, but perhaps not much worse than the separate
  cma_release_nowait interface?

- Change the CMA bitmap to some other data structure and algorithm.
  This would obviously take more work.

Thanks,
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-03-26 19:57     ` Mike Kravetz
@ 2021-03-27  1:40       ` Miaohe Lin
  0 siblings, 0 replies; 53+ messages in thread
From: Miaohe Lin @ 2021-03-27  1:40 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Andrew Morton

On 2021/3/27 3:57, Mike Kravetz wrote:
> On 3/25/21 7:10 PM, Miaohe Lin wrote:
>> On 2021/3/25 8:28, Mike Kravetz wrote:
>>> The new remove_hugetlb_page() routine is designed to remove a hugetlb
>>> page from hugetlbfs processing.  It will remove the page from the active
>>> or free list, update global counters and set the compound page
>>> destructor to NULL so that PageHuge() will return false for the 'page'.
>>> After this call, the 'page' can be treated as a normal compound page or
>>> a collection of base size pages.
>>>
>>> remove_hugetlb_page is to be called with the hugetlb_lock held.
>>>
>>> Creating this routine and separating functionality is in preparation for
>>> restructuring code to reduce lock hold times.
>>>
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>>  mm/hugetlb.c | 70 +++++++++++++++++++++++++++++++++-------------------
>>>  1 file changed, 45 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 404b0b1c5258..3938ec086b5c 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -1327,6 +1327,46 @@ static inline void destroy_compound_gigantic_page(struct page *page,
>>>  						unsigned int order) { }
>>>  #endif
>>>  
>>> +/*
>>> + * Remove hugetlb page from lists, and update dtor so that page appears
>>> + * as just a compound page.  A reference is held on the page.
>>> + * NOTE: hugetlb specific page flags stored in page->private are not
>>> + *	 automatically cleared.  These flags may be used in routines
>>> + *	 which operate on the resulting compound page.
>>
>> It seems HPageFreed and HPageTemporary is cleared. Which hugetlb specific page flags
>> is reserverd here and why? Could you please give a simple example to clarify
>> this in the comment to help understand this NOTE?
>>
> 
> I will remove that NOTE: in the comment to avoid any confusion.
> 
> The NOTE was add in the RFC that contained a separate patch to add a flag
> that tracked huge pages allocated from CMA.  That flag needed to remain
> for subsequent freeing of such pages.  This is no longer needed.
> 

Many thanks for explaination. I was confused about that NOTE. :)

>> The code looks good to me. Many thanks!
>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> 
> Thanks,
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [External] [PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-03-25  0:28 ` [PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
  2021-03-25 10:49   ` Michal Hocko
  2021-03-26  2:10   ` Miaohe Lin
@ 2021-03-27  6:36   ` Muchun Song
  2 siblings, 0 replies; 53+ messages in thread
From: Muchun Song @ 2021-03-27  6:36 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Thu, Mar 25, 2021 at 8:29 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> The new remove_hugetlb_page() routine is designed to remove a hugetlb
> page from hugetlbfs processing.  It will remove the page from the active
> or free list, update global counters and set the compound page
> destructor to NULL so that PageHuge() will return false for the 'page'.
> After this call, the 'page' can be treated as a normal compound page or
> a collection of base size pages.
>
> remove_hugetlb_page is to be called with the hugetlb_lock held.
>
> Creating this routine and separating functionality is in preparation for
> restructuring code to reduce lock hold times.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks for your effort on this.

> ---
>  mm/hugetlb.c | 70 +++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 45 insertions(+), 25 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 404b0b1c5258..3938ec086b5c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1327,6 +1327,46 @@ static inline void destroy_compound_gigantic_page(struct page *page,
>                                                 unsigned int order) { }
>  #endif
>
> +/*
> + * Remove hugetlb page from lists, and update dtor so that page appears
> + * as just a compound page.  A reference is held on the page.
> + * NOTE: hugetlb specific page flags stored in page->private are not
> + *      automatically cleared.  These flags may be used in routines
> + *      which operate on the resulting compound page.
> + *
> + * Must be called with hugetlb lock held.
> + */
> +static void remove_hugetlb_page(struct hstate *h, struct page *page,
> +                                                       bool adjust_surplus)
> +{
> +       int nid = page_to_nid(page);
> +
> +       if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> +               return;
> +
> +       list_del(&page->lru);
> +
> +       if (HPageFreed(page)) {
> +               h->free_huge_pages--;
> +               h->free_huge_pages_node[nid]--;
> +               ClearHPageFreed(page);
> +       }
> +       if (adjust_surplus) {
> +               h->surplus_huge_pages--;
> +               h->surplus_huge_pages_node[nid]--;
> +       }
> +
> +       VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
> +       VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
> +
> +       ClearHPageTemporary(page);
> +       set_page_refcounted(page);
> +       set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> +
> +       h->nr_huge_pages--;
> +       h->nr_huge_pages_node[nid]--;
> +}
> +
>  static void update_and_free_page(struct hstate *h, struct page *page)
>  {
>         int i;
> @@ -1335,8 +1375,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>         if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>                 return;
>
> -       h->nr_huge_pages--;
> -       h->nr_huge_pages_node[page_to_nid(page)]--;
>         for (i = 0; i < pages_per_huge_page(h);
>              i++, subpage = mem_map_next(subpage, page, i)) {
>                 subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
> @@ -1344,10 +1382,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>                                 1 << PG_active | 1 << PG_private |
>                                 1 << PG_writeback);
>         }
> -       VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
> -       VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
> -       set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> -       set_page_refcounted(page);
>         if (hstate_is_gigantic(h)) {
>                 destroy_compound_gigantic_page(page, huge_page_order(h));
>                 free_gigantic_page(page, huge_page_order(h));
> @@ -1415,15 +1449,12 @@ static void __free_huge_page(struct page *page)
>                 h->resv_huge_pages++;
>
>         if (HPageTemporary(page)) {
> -               list_del(&page->lru);
> -               ClearHPageTemporary(page);
> +               remove_hugetlb_page(h, page, false);
>                 update_and_free_page(h, page);
>         } else if (h->surplus_huge_pages_node[nid]) {
>                 /* remove the page from active list */
> -               list_del(&page->lru);
> +               remove_hugetlb_page(h, page, true);
>                 update_and_free_page(h, page);
> -               h->surplus_huge_pages--;
> -               h->surplus_huge_pages_node[nid]--;
>         } else {
>                 arch_clear_hugepage_flags(page);
>                 enqueue_huge_page(h, page);
> @@ -1708,13 +1739,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>                         struct page *page =
>                                 list_entry(h->hugepage_freelists[node].next,
>                                           struct page, lru);
> -                       list_del(&page->lru);
> -                       h->free_huge_pages--;
> -                       h->free_huge_pages_node[node]--;
> -                       if (acct_surplus) {
> -                               h->surplus_huge_pages--;
> -                               h->surplus_huge_pages_node[node]--;
> -                       }
> +                       remove_hugetlb_page(h, page, acct_surplus);
>                         update_and_free_page(h, page);
>                         ret = 1;
>                         break;
> @@ -1752,7 +1777,6 @@ int dissolve_free_huge_page(struct page *page)
>         if (!page_count(page)) {
>                 struct page *head = compound_head(page);
>                 struct hstate *h = page_hstate(head);
> -               int nid = page_to_nid(head);
>                 if (h->free_huge_pages - h->resv_huge_pages == 0)
>                         goto out;
>
> @@ -1783,9 +1807,7 @@ int dissolve_free_huge_page(struct page *page)
>                         SetPageHWPoison(page);
>                         ClearPageHWPoison(head);
>                 }
> -               list_del(&head->lru);
> -               h->free_huge_pages--;
> -               h->free_huge_pages_node[nid]--;
> +               remove_hugetlb_page(h, page, false);
>                 h->max_huge_pages--;
>                 update_and_free_page(h, head);
>                 rc = 0;
> @@ -2553,10 +2575,8 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>                                 return;
>                         if (PageHighMem(page))
>                                 continue;
> -                       list_del(&page->lru);
> +                       remove_hugetlb_page(h, page, false);
>                         update_and_free_page(h, page);
> -                       h->free_huge_pages--;
> -                       h->free_huge_pages_node[page_to_nid(page)]--;
>                 }
>         }
>  }
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [External] [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-03-25  0:28 ` [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
  2021-03-25 10:55   ` Michal Hocko
@ 2021-03-27  6:54   ` Muchun Song
  2021-03-28 21:40     ` Mike Kravetz
  1 sibling, 1 reply; 53+ messages in thread
From: Muchun Song @ 2021-03-27  6:54 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Thu, Mar 25, 2021 at 8:29 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> With the introduction of remove_hugetlb_page(), there is no need for
> update_and_free_page to hold the hugetlb lock.  Change all callers to
> drop the lock before calling.
>
> With additional code modifications, this will allow loops which decrease
> the huge page pool to drop the hugetlb_lock with each page to reduce
> long hold times.
>
> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
> a subsequent patch which restructures free_pool_huge_page.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Some nits below.

> ---
>  mm/hugetlb.c | 34 +++++++++++++++++++++++++++++-----
>  1 file changed, 29 insertions(+), 5 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3938ec086b5c..fae7f034d1eb 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1450,16 +1450,18 @@ static void __free_huge_page(struct page *page)
>
>         if (HPageTemporary(page)) {
>                 remove_hugetlb_page(h, page, false);
> +               spin_unlock(&hugetlb_lock);
>                 update_and_free_page(h, page);
>         } else if (h->surplus_huge_pages_node[nid]) {
>                 /* remove the page from active list */
>                 remove_hugetlb_page(h, page, true);
> +               spin_unlock(&hugetlb_lock);
>                 update_and_free_page(h, page);
>         } else {
>                 arch_clear_hugepage_flags(page);
>                 enqueue_huge_page(h, page);
> +               spin_unlock(&hugetlb_lock);
>         }
> -       spin_unlock(&hugetlb_lock);
>  }
>
>  /*
> @@ -1740,7 +1742,13 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>                                 list_entry(h->hugepage_freelists[node].next,
>                                           struct page, lru);
>                         remove_hugetlb_page(h, page, acct_surplus);
> +                       /*
> +                        * unlock/lock around update_and_free_page is temporary
> +                        * and will be removed with subsequent patch.
> +                        */
> +                       spin_unlock(&hugetlb_lock);
>                         update_and_free_page(h, page);
> +                       spin_lock(&hugetlb_lock);
>                         ret = 1;
>                         break;
>                 }
> @@ -1809,8 +1817,9 @@ int dissolve_free_huge_page(struct page *page)
>                 }
>                 remove_hugetlb_page(h, page, false);
>                 h->max_huge_pages--;
> +               spin_unlock(&hugetlb_lock);
>                 update_and_free_page(h, head);
> -               rc = 0;
> +               return 0;
>         }
>  out:
>         spin_unlock(&hugetlb_lock);
> @@ -2563,22 +2572,37 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>                                                 nodemask_t *nodes_allowed)
>  {
>         int i;
> +       struct list_head page_list;

I prefer to use LIST_HEAD(page_list) directly.

> +       struct page *page, *next;
>
>         if (hstate_is_gigantic(h))
>                 return;
>
> +       /*
> +        * Collect pages to be freed on a list, and free after dropping lock
> +        */
> +       INIT_LIST_HEAD(&page_list);
>         for_each_node_mask(i, *nodes_allowed) {
> -               struct page *page, *next;
>                 struct list_head *freel = &h->hugepage_freelists[i];
>                 list_for_each_entry_safe(page, next, freel, lru) {
>                         if (count >= h->nr_huge_pages)
> -                               return;
> +                               goto out;
>                         if (PageHighMem(page))
>                                 continue;
>                         remove_hugetlb_page(h, page, false);
> -                       update_and_free_page(h, page);
> +                       INIT_LIST_HEAD(&page->lru);

As Michal pointed out that this is superfluous.

> +                       list_add(&page->lru, &page_list);
>                 }
>         }
> +
> +out:
> +       spin_unlock(&hugetlb_lock);
> +       list_for_each_entry_safe(page, next, &page_list, lru) {
> +               list_del(&page->lru);

It looks like list_del() is also superfluous. Should we remove it?

Thanks.

> +               update_and_free_page(h, page);
> +               cond_resched();
> +       }
> +       spin_lock(&hugetlb_lock);
>  }
>  #else
>  static inline void try_to_free_low(struct hstate *h, unsigned long count,
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [External] [PATCH 7/8] hugetlb: make free_huge_page irq safe
  2021-03-25  0:28 ` [PATCH 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
  2021-03-25 11:21   ` Michal Hocko
@ 2021-03-27  7:06   ` Muchun Song
  2021-03-29  7:49     ` Michal Hocko
  1 sibling, 1 reply; 53+ messages in thread
From: Muchun Song @ 2021-03-27  7:06 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Thu, Mar 25, 2021 at 8:29 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
> non-task context") was added to address the issue of free_huge_page
> being called from irq context.  That commit hands off free_huge_page
> processing to a workqueue if !in_task.  However, as seen in [1] this
> does not cover all cases.  Instead, make the locks taken in the
> free_huge_page irq safe.
>
> This patch does the following:
> - Make hugetlb_lock irq safe.  This is mostly a simple process of
>   changing spin_*lock calls to spin_*lock_irq* calls.
> - Make subpool lock irq safe in a similar manner.
> - Revert the !in_task check and workqueue handoff.
>
> [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

The changes are straightforward.

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Since this patchset aims to fix a real word issue. Should we add a Fixes
tag?

> ---
>  mm/hugetlb.c        | 169 +++++++++++++++++---------------------------
>  mm/hugetlb_cgroup.c |   8 +--
>  2 files changed, 67 insertions(+), 110 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a9785e73379f..e4c441b878f2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -93,9 +93,10 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
>         return true;
>  }
>
> -static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
> +static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
> +                                               unsigned long irq_flags)
>  {
> -       spin_unlock(&spool->lock);
> +       spin_unlock_irqrestore(&spool->lock, irq_flags);
>
>         /* If no pages are used, and no other handles to the subpool
>          * remain, give up any reservations based on minimum size and
> @@ -134,10 +135,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
>
>  void hugepage_put_subpool(struct hugepage_subpool *spool)
>  {
> -       spin_lock(&spool->lock);
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&spool->lock, flags);
>         BUG_ON(!spool->count);
>         spool->count--;
> -       unlock_or_release_subpool(spool);
> +       unlock_or_release_subpool(spool, flags);
>  }
>
>  /*
> @@ -156,7 +159,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
>         if (!spool)
>                 return ret;
>
> -       spin_lock(&spool->lock);
> +       spin_lock_irq(&spool->lock);
>
>         if (spool->max_hpages != -1) {          /* maximum size accounting */
>                 if ((spool->used_hpages + delta) <= spool->max_hpages)
> @@ -183,7 +186,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
>         }
>
>  unlock_ret:
> -       spin_unlock(&spool->lock);
> +       spin_unlock_irq(&spool->lock);
>         return ret;
>  }
>
> @@ -197,11 +200,12 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
>                                        long delta)
>  {
>         long ret = delta;
> +       unsigned long flags;
>
>         if (!spool)
>                 return delta;
>
> -       spin_lock(&spool->lock);
> +       spin_lock_irqsave(&spool->lock, flags);
>
>         if (spool->max_hpages != -1)            /* maximum size accounting */
>                 spool->used_hpages -= delta;
> @@ -222,7 +226,7 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
>          * If hugetlbfs_put_super couldn't free spool due to an outstanding
>          * quota reference, free it now.
>          */
> -       unlock_or_release_subpool(spool);
> +       unlock_or_release_subpool(spool, flags);
>
>         return ret;
>  }
> @@ -1401,7 +1405,7 @@ struct hstate *size_to_hstate(unsigned long size)
>         return NULL;
>  }
>
> -static void __free_huge_page(struct page *page)
> +void free_huge_page(struct page *page)
>  {
>         /*
>          * Can't pass hstate in here because it is called from the
> @@ -1411,6 +1415,7 @@ static void __free_huge_page(struct page *page)
>         int nid = page_to_nid(page);
>         struct hugepage_subpool *spool = hugetlb_page_subpool(page);
>         bool restore_reserve;
> +       unsigned long flags;
>
>         VM_BUG_ON_PAGE(page_count(page), page);
>         VM_BUG_ON_PAGE(page_mapcount(page), page);
> @@ -1439,7 +1444,7 @@ static void __free_huge_page(struct page *page)
>                         restore_reserve = true;
>         }
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irqsave(&hugetlb_lock, flags);
>         ClearHPageMigratable(page);
>         hugetlb_cgroup_uncharge_page(hstate_index(h),
>                                      pages_per_huge_page(h), page);
> @@ -1450,66 +1455,18 @@ static void __free_huge_page(struct page *page)
>
>         if (HPageTemporary(page)) {
>                 remove_hugetlb_page(h, page, false);
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irqrestore(&hugetlb_lock, flags);
>                 update_and_free_page(h, page);
>         } else if (h->surplus_huge_pages_node[nid]) {
>                 /* remove the page from active list */
>                 remove_hugetlb_page(h, page, true);
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irqrestore(&hugetlb_lock, flags);
>                 update_and_free_page(h, page);
>         } else {
>                 arch_clear_hugepage_flags(page);
>                 enqueue_huge_page(h, page);
> -               spin_unlock(&hugetlb_lock);
> -       }
> -}
> -
> -/*
> - * As free_huge_page() can be called from a non-task context, we have
> - * to defer the actual freeing in a workqueue to prevent potential
> - * hugetlb_lock deadlock.
> - *
> - * free_hpage_workfn() locklessly retrieves the linked list of pages to
> - * be freed and frees them one-by-one. As the page->mapping pointer is
> - * going to be cleared in __free_huge_page() anyway, it is reused as the
> - * llist_node structure of a lockless linked list of huge pages to be freed.
> - */
> -static LLIST_HEAD(hpage_freelist);
> -
> -static void free_hpage_workfn(struct work_struct *work)
> -{
> -       struct llist_node *node;
> -       struct page *page;
> -
> -       node = llist_del_all(&hpage_freelist);
> -
> -       while (node) {
> -               page = container_of((struct address_space **)node,
> -                                    struct page, mapping);
> -               node = node->next;
> -               __free_huge_page(page);
> -       }
> -}
> -static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> -
> -void free_huge_page(struct page *page)
> -{
> -       /*
> -        * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> -        */
> -       if (!in_task()) {
> -               /*
> -                * Only call schedule_work() if hpage_freelist is previously
> -                * empty. Otherwise, schedule_work() had been called but the
> -                * workfn hasn't retrieved the list yet.
> -                */
> -               if (llist_add((struct llist_node *)&page->mapping,
> -                             &hpage_freelist))
> -                       schedule_work(&free_hpage_work);
> -               return;
> +               spin_unlock_irqrestore(&hugetlb_lock, flags);
>         }
> -
> -       __free_huge_page(page);
>  }
>
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> @@ -1519,11 +1476,11 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>         hugetlb_set_page_subpool(page, NULL);
>         set_hugetlb_cgroup(page, NULL);
>         set_hugetlb_cgroup_rsvd(page, NULL);
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         h->nr_huge_pages++;
>         h->nr_huge_pages_node[nid]++;
>         ClearHPageFreed(page);
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>  }
>
>  static void prep_compound_gigantic_page(struct page *page, unsigned int order)
> @@ -1769,7 +1726,7 @@ int dissolve_free_huge_page(struct page *page)
>         if (!PageHuge(page))
>                 return 0;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         if (!PageHuge(page)) {
>                 rc = 0;
>                 goto out;
> @@ -1786,7 +1743,7 @@ int dissolve_free_huge_page(struct page *page)
>                  * when it is dissolved.
>                  */
>                 if (unlikely(!HPageFreed(head))) {
> -                       spin_unlock(&hugetlb_lock);
> +                       spin_unlock_irq(&hugetlb_lock);
>                         cond_resched();
>
>                         /*
> @@ -1810,12 +1767,12 @@ int dissolve_free_huge_page(struct page *page)
>                 }
>                 remove_hugetlb_page(h, page, false);
>                 h->max_huge_pages--;
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>                 update_and_free_page(h, head);
>                 return 0;
>         }
>  out:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         return rc;
>  }
>
> @@ -1857,16 +1814,16 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>         if (hstate_is_gigantic(h))
>                 return NULL;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages)
>                 goto out_unlock;
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
>         if (!page)
>                 return NULL;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         /*
>          * We could have raced with the pool size change.
>          * Double check that and simply deallocate the new page
> @@ -1876,7 +1833,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>          */
>         if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>                 SetHPageTemporary(page);
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>                 put_page(page);
>                 return NULL;
>         } else {
> @@ -1885,7 +1842,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>         }
>
>  out_unlock:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         return page;
>  }
> @@ -1935,17 +1892,17 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
>  struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
>                 nodemask_t *nmask, gfp_t gfp_mask)
>  {
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         if (h->free_huge_pages - h->resv_huge_pages > 0) {
>                 struct page *page;
>
>                 page = dequeue_huge_page_nodemask(h, gfp_mask, preferred_nid, nmask);
>                 if (page) {
> -                       spin_unlock(&hugetlb_lock);
> +                       spin_unlock_irq(&hugetlb_lock);
>                         return page;
>                 }
>         }
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
>  }
> @@ -1993,7 +1950,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>
>         ret = -ENOMEM;
>  retry:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         for (i = 0; i < needed; i++) {
>                 page = alloc_surplus_huge_page(h, htlb_alloc_mask(h),
>                                 NUMA_NO_NODE, NULL);
> @@ -2010,7 +1967,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>          * After retaking hugetlb_lock, we need to recalculate 'needed'
>          * because either resv_huge_pages or free_huge_pages may have changed.
>          */
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         needed = (h->resv_huge_pages + delta) -
>                         (h->free_huge_pages + allocated);
>         if (needed > 0) {
> @@ -2050,12 +2007,12 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>                 enqueue_huge_page(h, page);
>         }
>  free:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         /* Free unnecessary surplus pages to the buddy allocator */
>         list_for_each_entry_safe(page, tmp, &surplus_list, lru)
>                 put_page(page);
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>
>         return ret;
>  }
> @@ -2107,13 +2064,13 @@ static void return_unused_surplus_pages(struct hstate *h,
>         }
>
>  out:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         list_for_each_entry_safe(page, t_page, &page_list, lru) {
>                 list_del(&page->lru);
>                 update_and_free_page(h, page);
>                 cond_resched();
>         }
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>  }
>
>
> @@ -2348,7 +2305,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>         if (ret)
>                 goto out_uncharge_cgroup_reservation;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         /*
>          * glb_chg is passed to indicate whether or not a page must be taken
>          * from the global free pool (global change).  gbl_chg == 0 indicates
> @@ -2356,7 +2313,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>          */
>         page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>         if (!page) {
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>                 page = alloc_buddy_huge_page_with_mpol(h, vma, addr);
>                 if (!page)
>                         goto out_uncharge_cgroup;
> @@ -2364,7 +2321,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>                         SetHPageRestoreReserve(page);
>                         h->resv_huge_pages--;
>                 }
> -               spin_lock(&hugetlb_lock);
> +               spin_lock_irq(&hugetlb_lock);
>                 list_add(&page->lru, &h->hugepage_activelist);
>                 /* Fall through */
>         }
> @@ -2377,7 +2334,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>                                                   h_cg, page);
>         }
>
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         hugetlb_set_page_subpool(page, spool);
>
> @@ -2591,13 +2548,13 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>         }
>
>  out:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         list_for_each_entry_safe(page, next, &page_list, lru) {
>                 list_del(&page->lru);
>                 update_and_free_page(h, page);
>                 cond_resched();
>         }
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>  }
>  #else
>  static inline void try_to_free_low(struct hstate *h, unsigned long count,
> @@ -2659,7 +2616,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>
>         /* mutex prevents concurrent adjustments for the same hstate */
>         mutex_lock(&h->mutex);
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>
>         /*
>          * Check for a node specific request.
> @@ -2690,7 +2647,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>          */
>         if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
>                 if (count > persistent_huge_pages(h)) {
> -                       spin_unlock(&hugetlb_lock);
> +                       spin_unlock_irq(&hugetlb_lock);
>                         mutex_unlock(&h->mutex);
>                         NODEMASK_FREE(node_alloc_noretry);
>                         return -EINVAL;
> @@ -2720,14 +2677,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>                  * page, free_huge_page will handle it by freeing the page
>                  * and reducing the surplus.
>                  */
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>
>                 /* yield cpu to avoid soft lockup */
>                 cond_resched();
>
>                 ret = alloc_pool_huge_page(h, nodes_allowed,
>                                                 node_alloc_noretry);
> -               spin_lock(&hugetlb_lock);
> +               spin_lock_irq(&hugetlb_lock);
>                 if (!ret)
>                         goto out;
>
> @@ -2768,13 +2725,13 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>                 list_add(&page->lru, &page_list);
>         }
>         /* free the pages after dropping lock */
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         list_for_each_entry_safe(page, t_page, &page_list, lru) {
>                 list_del(&page->lru);
>                 update_and_free_page(h, page);
>                 cond_resched();
>         }
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>
>         while (count < persistent_huge_pages(h)) {
>                 if (!adjust_pool_surplus(h, nodes_allowed, 1))
> @@ -2782,7 +2739,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>         }
>  out:
>         h->max_huge_pages = persistent_huge_pages(h);
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         mutex_unlock(&h->mutex);
>
>         NODEMASK_FREE(node_alloc_noretry);
> @@ -2938,9 +2895,9 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
>         if (err)
>                 return err;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         h->nr_overcommit_huge_pages = input;
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         return count;
>  }
> @@ -3527,9 +3484,9 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
>                 goto out;
>
>         if (write) {
> -               spin_lock(&hugetlb_lock);
> +               spin_lock_irq(&hugetlb_lock);
>                 h->nr_overcommit_huge_pages = tmp;
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>         }
>  out:
>         return ret;
> @@ -3625,7 +3582,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
>         if (!delta)
>                 return 0;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         /*
>          * When cpuset is configured, it breaks the strict hugetlb page
>          * reservation as the accounting is done on a global variable. Such
> @@ -3664,7 +3621,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
>                 return_unused_surplus_pages(h, (unsigned long) -delta);
>
>  out:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         return ret;
>  }
>
> @@ -5727,7 +5684,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
>  {
>         bool ret = true;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         if (!PageHeadHuge(page) ||
>             !HPageMigratable(page) ||
>             !get_page_unless_zero(page)) {
> @@ -5737,16 +5694,16 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
>         ClearHPageMigratable(page);
>         list_move_tail(&page->lru, list);
>  unlock:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         return ret;
>  }
>
>  void putback_active_hugepage(struct page *page)
>  {
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         SetHPageMigratable(page);
>         list_move_tail(&page->lru, &(page_hstate(page))->hugepage_activelist);
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         put_page(page);
>  }
>
> @@ -5780,12 +5737,12 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
>                  */
>                 if (new_nid == old_nid)
>                         return;
> -               spin_lock(&hugetlb_lock);
> +               spin_lock_irq(&hugetlb_lock);
>                 if (h->surplus_huge_pages_node[old_nid]) {
>                         h->surplus_huge_pages_node[old_nid]--;
>                         h->surplus_huge_pages_node[new_nid]++;
>                 }
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>         }
>  }
>
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 726b85f4f303..5383023d0cca 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -204,11 +204,11 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
>         do {
>                 idx = 0;
>                 for_each_hstate(h) {
> -                       spin_lock(&hugetlb_lock);
> +                       spin_lock_irq(&hugetlb_lock);
>                         list_for_each_entry(page, &h->hugepage_activelist, lru)
>                                 hugetlb_cgroup_move_parent(idx, h_cg, page);
>
> -                       spin_unlock(&hugetlb_lock);
> +                       spin_unlock_irq(&hugetlb_lock);
>                         idx++;
>                 }
>                 cond_resched();
> @@ -784,7 +784,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
>         if (hugetlb_cgroup_disabled())
>                 return;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         h_cg = hugetlb_cgroup_from_page(oldhpage);
>         h_cg_rsvd = hugetlb_cgroup_from_page_rsvd(oldhpage);
>         set_hugetlb_cgroup(oldhpage, NULL);
> @@ -794,7 +794,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
>         set_hugetlb_cgroup(newhpage, h_cg);
>         set_hugetlb_cgroup_rsvd(newhpage, h_cg_rsvd);
>         list_move(&newhpage->lru, &h->hugepage_activelist);
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         return;
>  }
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [External] [PATCH 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock
  2021-03-25  0:28 ` [PATCH 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
  2021-03-25 11:22   ` Michal Hocko
  2021-03-26  2:12   ` Miaohe Lin
@ 2021-03-27  8:14   ` Muchun Song
  2 siblings, 0 replies; 53+ messages in thread
From: Muchun Song @ 2021-03-27  8:14 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Thu, Mar 25, 2021 at 8:29 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> After making hugetlb lock irq safe and separating some functionality
> done under the lock, add some lockdep_assert_held to help verify
> locking.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.

> ---
>  mm/hugetlb.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e4c441b878f2..de5b3cf4a155 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1062,6 +1062,8 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
>  static void enqueue_huge_page(struct hstate *h, struct page *page)
>  {
>         int nid = page_to_nid(page);
> +
> +       lockdep_assert_held(&hugetlb_lock);
>         list_move(&page->lru, &h->hugepage_freelists[nid]);
>         h->free_huge_pages++;
>         h->free_huge_pages_node[nid]++;
> @@ -1073,6 +1075,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
>         struct page *page;
>         bool pin = !!(current->flags & PF_MEMALLOC_PIN);
>
> +       lockdep_assert_held(&hugetlb_lock);
>         list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
>                 if (pin && !is_pinnable_page(page))
>                         continue;
> @@ -1345,6 +1348,7 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
>  {
>         int nid = page_to_nid(page);
>
> +       lockdep_assert_held(&hugetlb_lock);
>         if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>                 return;
>
> @@ -1690,6 +1694,7 @@ static struct page *remove_pool_huge_page(struct hstate *h,
>         int nr_nodes, node;
>         struct page *page = NULL;
>
> +       lockdep_assert_held(&hugetlb_lock);
>         for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
>                 /*
>                  * If we're returning unused surplus pages, only examine
> @@ -1939,6 +1944,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>         long needed, allocated;
>         bool alloc_ok = true;
>
> +       lockdep_assert_held(&hugetlb_lock);
>         needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
>         if (needed <= 0) {
>                 h->resv_huge_pages += delta;
> @@ -2032,6 +2038,7 @@ static void return_unused_surplus_pages(struct hstate *h,
>         struct page *page, *t_page;
>         struct list_head page_list;
>
> +       lockdep_assert_held(&hugetlb_lock);
>         /* Uncommit the reservation */
>         h->resv_huge_pages -= unused_resv_pages;
>
> @@ -2527,6 +2534,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>         struct list_head page_list;
>         struct page *page, *next;
>
> +       lockdep_assert_held(&hugetlb_lock);
>         if (hstate_is_gigantic(h))
>                 return;
>
> @@ -2573,6 +2581,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
>  {
>         int nr_nodes, node;
>
> +       lockdep_assert_held(&hugetlb_lock);
>         VM_BUG_ON(delta != -1 && delta != 1);
>
>         if (delta < 0) {
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [External] [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-03-27  6:54   ` [External] " Muchun Song
@ 2021-03-28 21:40     ` Mike Kravetz
  0 siblings, 0 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-28 21:40 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On 3/26/21 11:54 PM, Muchun Song wrote:
> On Thu, Mar 25, 2021 at 8:29 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> With the introduction of remove_hugetlb_page(), there is no need for
>> update_and_free_page to hold the hugetlb lock.  Change all callers to
>> drop the lock before calling.
>>
>> With additional code modifications, this will allow loops which decrease
>> the huge page pool to drop the hugetlb_lock with each page to reduce
>> long hold times.
>>
>> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
>> a subsequent patch which restructures free_pool_huge_page.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> 
> Some nits below.

Thanks Muchun,

I agree with all your suggestions below, and will include modifications
in the next version.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-26 21:32                 ` Mike Kravetz
@ 2021-03-29  7:46                   ` Michal Hocko
  2021-03-29 22:27                     ` Mike Kravetz
  0 siblings, 1 reply; 53+ messages in thread
From: Michal Hocko @ 2021-03-29  7:46 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Roman Gushchin, Minchan Kim, David Hildenbrand, linux-mm,
	linux-kernel, Shakeel Butt, Oscar Salvador, Muchun Song,
	David Rientjes, Miaohe Lin, Peter Zijlstra, Matthew Wilcox,
	HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long, Peter Xu,
	Mina Almasry, Hillf Danton, Andrew Morton, Joonsoo Kim

On Fri 26-03-21 14:32:01, Mike Kravetz wrote:
[...]
> - Just change the mutex to an irq safe spinlock.

Yes please.

>   AFAICT, the potential
>   downsides could be:
>   - Interrupts disabled during long bitmap scans

How large those bitmaps are in practice?

>   - Wasted cpu cycles (spinning) if there is much contention on lock
>   Both of these would be more of an issue on small/embedded systems. I
>   took a quick look at the callers of cma_alloc/cma_release and nothing
>   stood out that could lead to high degrees of contention.  However, I
>   could have missed something.

If this is really a practical concern then we can try a more complex
solution based on some data.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [External] [PATCH 7/8] hugetlb: make free_huge_page irq safe
  2021-03-27  7:06   ` [External] " Muchun Song
@ 2021-03-29  7:49     ` Michal Hocko
  2021-03-29 22:44       ` Mike Kravetz
  0 siblings, 1 reply; 53+ messages in thread
From: Michal Hocko @ 2021-03-29  7:49 UTC (permalink / raw)
  To: Muchun Song
  Cc: Mike Kravetz, Linux Memory Management List, LKML, Roman Gushchin,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On Sat 27-03-21 15:06:36, Muchun Song wrote:
> On Thu, Mar 25, 2021 at 8:29 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
> > non-task context") was added to address the issue of free_huge_page
> > being called from irq context.  That commit hands off free_huge_page
> > processing to a workqueue if !in_task.  However, as seen in [1] this
> > does not cover all cases.  Instead, make the locks taken in the
> > free_huge_page irq safe.
> >
> > This patch does the following:
> > - Make hugetlb_lock irq safe.  This is mostly a simple process of
> >   changing spin_*lock calls to spin_*lock_irq* calls.
> > - Make subpool lock irq safe in a similar manner.
> > - Revert the !in_task check and workqueue handoff.
> >
> > [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
> >
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> The changes are straightforward.
> 
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> 
> Since this patchset aims to fix a real word issue. Should we add a Fixes
> tag?

Do we know since when it is possible to use hugetlb in the networking
context? Maybe this is possible since ever but I am wondering why the
lockdep started complaining only now. Maybe just fuzzing finally started
using this setup which nobody does normally.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()
  2021-03-29  7:46                   ` Michal Hocko
@ 2021-03-29 22:27                     ` Mike Kravetz
  0 siblings, 0 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-29 22:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Minchan Kim, David Hildenbrand, linux-mm,
	linux-kernel, Shakeel Butt, Oscar Salvador, Muchun Song,
	David Rientjes, Miaohe Lin, Peter Zijlstra, Matthew Wilcox,
	HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long, Peter Xu,
	Mina Almasry, Hillf Danton, Andrew Morton, Joonsoo Kim,
	Barry Song, Will Deacon

On 3/29/21 12:46 AM, Michal Hocko wrote:
> On Fri 26-03-21 14:32:01, Mike Kravetz wrote:
> [...]
>> - Just change the mutex to an irq safe spinlock.
> 
> Yes please.
> 
>>   AFAICT, the potential
>>   downsides could be:
>>   - Interrupts disabled during long bitmap scans
> 
> How large those bitmaps are in practice?
> 
>>   - Wasted cpu cycles (spinning) if there is much contention on lock
>>   Both of these would be more of an issue on small/embedded systems. I
>>   took a quick look at the callers of cma_alloc/cma_release and nothing
>>   stood out that could lead to high degrees of contention.  However, I
>>   could have missed something.
> 
> If this is really a practical concern then we can try a more complex
> solution based on some data.
> 

Ok, I will send v2 with this approach.  Adding Barry and Will on Cc: as
they were involved in adding more cma use cases for dma on arm.

-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [External] [PATCH 7/8] hugetlb: make free_huge_page irq safe
  2021-03-29  7:49     ` Michal Hocko
@ 2021-03-29 22:44       ` Mike Kravetz
  0 siblings, 0 replies; 53+ messages in thread
From: Mike Kravetz @ 2021-03-29 22:44 UTC (permalink / raw)
  To: Michal Hocko, Muchun Song
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Andrew Morton

On 3/29/21 12:49 AM, Michal Hocko wrote:
> On Sat 27-03-21 15:06:36, Muchun Song wrote:
>> On Thu, Mar 25, 2021 at 8:29 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
>>> non-task context") was added to address the issue of free_huge_page
>>> being called from irq context.  That commit hands off free_huge_page
>>> processing to a workqueue if !in_task.  However, as seen in [1] this
>>> does not cover all cases.  Instead, make the locks taken in the
>>> free_huge_page irq safe.
>>>
>>> This patch does the following:
>>> - Make hugetlb_lock irq safe.  This is mostly a simple process of
>>>   changing spin_*lock calls to spin_*lock_irq* calls.
>>> - Make subpool lock irq safe in a similar manner.
>>> - Revert the !in_task check and workqueue handoff.
>>>
>>> [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
>>>
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>
>> The changes are straightforward.
>>
>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
>>
>> Since this patchset aims to fix a real word issue. Should we add a Fixes
>> tag?
> 
> Do we know since when it is possible to use hugetlb in the networking
> context? Maybe this is possible since ever but I am wondering why the
> lockdep started complaining only now. Maybe just fuzzing finally started
> using this setup which nobody does normally.
> 

From my memory and email search, this first came up with powerpc iommu here:
https://lore.kernel.org/lkml/20180905112341.21355-1-aneesh.kumar@linux.ibm.com/

Aneesh proposed a solution similar to this, but 'fixed' the issue by changing
the powerpc code.

AFAICT, the put_page/free_huge_page code path has only been 'safe' to
call from task context since it was originally written.  The real
question is when was it first possible for some code to do (the last)
put_page for a hugetlbfs page from irq context?  My 'guess' is that this
may have been possible for quite a while.  I can imagine a dma reference
to a hugetlb page held after the user space reference goes away.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2021-03-29 22:44 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-25  0:28 [PATCH 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
2021-03-25  0:28 ` [PATCH 1/8] mm: cma: introduce cma_release_nowait() Mike Kravetz
2021-03-25  9:39   ` Oscar Salvador
2021-03-25  9:45   ` Michal Hocko
2021-03-25  9:54     ` Oscar Salvador
2021-03-25 10:10       ` Michal Hocko
2021-03-25 10:11     ` Michal Hocko
2021-03-25 10:13       ` David Hildenbrand
2021-03-25 10:17       ` Oscar Salvador
2021-03-25 10:24         ` Michal Hocko
2021-03-25  9:56   ` David Hildenbrand
2021-03-25 10:22     ` Michal Hocko
2021-03-25 16:56       ` Mike Kravetz
2021-03-25 17:15         ` David Hildenbrand
2021-03-25 20:12           ` Minchan Kim
2021-03-25 23:19             ` Roman Gushchin
2021-03-25 23:49               ` Mike Kravetz
2021-03-26 21:32                 ` Mike Kravetz
2021-03-29  7:46                   ` Michal Hocko
2021-03-29 22:27                     ` Mike Kravetz
2021-03-25  0:28 ` [PATCH 2/8] mm: hugetlb: don't drop hugetlb_lock around cma_release() call Mike Kravetz
2021-03-25  0:28 ` [PATCH 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
2021-03-25 10:47   ` Michal Hocko
2021-03-25 12:29   ` Oscar Salvador
2021-03-26  1:52   ` Miaohe Lin
2021-03-25  0:28 ` [PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
2021-03-25 10:49   ` Michal Hocko
2021-03-26  2:10   ` Miaohe Lin
2021-03-26 19:57     ` Mike Kravetz
2021-03-27  1:40       ` Miaohe Lin
2021-03-27  6:36   ` [External] " Muchun Song
2021-03-25  0:28 ` [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
2021-03-25 10:55   ` Michal Hocko
2021-03-25 17:12     ` Mike Kravetz
2021-03-25 19:39       ` Michal Hocko
2021-03-25 20:33         ` Mike Kravetz
2021-03-27  6:54   ` [External] " Muchun Song
2021-03-28 21:40     ` Mike Kravetz
2021-03-25  0:28 ` [PATCH 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
2021-03-25 11:06   ` Michal Hocko
2021-03-25 17:29     ` Mike Kravetz
2021-03-25  0:28 ` [PATCH 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
2021-03-25 11:21   ` Michal Hocko
2021-03-25 17:32     ` Mike Kravetz
2021-03-27  7:06   ` [External] " Muchun Song
2021-03-29  7:49     ` Michal Hocko
2021-03-29 22:44       ` Mike Kravetz
2021-03-25  0:28 ` [PATCH 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
2021-03-25 11:22   ` Michal Hocko
2021-03-26  2:12   ` Miaohe Lin
2021-03-27  8:14   ` [External] " Muchun Song
2021-03-26  1:42 ` [PATCH 0/8] make hugetlb put_page safe for all calling contexts Miaohe Lin
2021-03-26 20:00   ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).