[PATCH v2 0/8] make hugetlb put_page safe for all calling contexts

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/8] make hugetlb put_page safe for all calling contexts
@ 2021-03-29 23:23 Mike Kravetz
  2021-03-29 23:23 ` [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: Mike Kravetz @ 2021-03-29 23:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

This effort is the result a recent bug report [1].  Syzbot found a
potential deadlock in the hugetlb put_page/free_huge_page_path.
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
Since the free_huge_page_path already has code to 'hand off' page
free requests to a workqueue, a suggestion was proposed to make
the in_irq() detection accurate by always enabling PREEMPT_COUNT [2].
The outcome of that discussion was that the hugetlb put_page path
(free_huge_page) path should be properly fixed and safe for all calling
contexts.

This series is based on v5.12-rc3-mmotm-2021-03-17-22-24.  At a high
level, the series provides:
- Patches 1 & 2 change CMA bitmap mutex to an irq safe spinlock
- Patch 3 adds a mutex for proc/sysfs interfaces changing hugetlb counts
- Patches 4, 5 & 6 are aimed at reducing lock hold times.  To be clear
  the goal is to eliminate single lock hold times of a long duration.
  Overall lock hold time is not addressed.
- Patch 7 makes hugetlb_lock and subpool lock IRQ safe.  It also reverts
  the code which defers calls to a workqueue if !in_task.
- Patch 8 adds some lockdep_assert_held() calls

[1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
[2] http://lkml.kernel.org/r/20210311021321.127500-1-mike.kravetz@oracle.com

v1 -> v2
- Drop Roman's cma_release_nowait() patches and just change CMA mutex
  to an IRQ safe spinlock.
- Cleanups to variable names, commets and commit messages as suggested
  by Michal, Oscar, Miaohe and Muchun.
- Dropped unnecessary INIT_LIST_HEAD as suggested by Michal and list_del
  as suggested by Muchun.
- Created update_and_free_pages_bulk helper as suggested by Michal.
- Rebased on v5.12-rc4-mmotm-2021-03-28-16-37
- Added Acked-by: and Reviewed-by: from v1

RFC -> v1
- Add Roman's cma_release_nowait() patches.  This eliminated the need
  to do a workqueue handoff in hugetlb code.
- Use Michal's suggestion to batch pages for freeing.  This eliminated
  the need to recalculate loop control variables when dropping the lock.
- Added lockdep_assert_held() calls
- Rebased to v5.12-rc3-mmotm-2021-03-17-22-24

Mike Kravetz (8):
  mm/cma: change cma mutex to irq safe spinlock
  hugetlb: no need to drop hugetlb_lock to call cma_release
  hugetlb: add per-hstate mutex to synchronize user adjustments
  hugetlb: create remove_hugetlb_page() to separate functionality
  hugetlb: call update_and_free_page without hugetlb_lock
  hugetlb: change free_pool_huge_page to remove_pool_huge_page
  hugetlb: make free_huge_page irq safe
  hugetlb: add lockdep_assert_held() calls for hugetlb_lock

 include/linux/hugetlb.h |   1 +
 mm/cma.c                |  20 +--
 mm/cma.h                |   2 +-
 mm/cma_debug.c          |  10 +-
 mm/hugetlb.c            | 340 +++++++++++++++++++++-------------------
 mm/hugetlb_cgroup.c     |   8 +-
 6 files changed, 202 insertions(+), 179 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-03-29 23:23 [PATCH v2 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
@ 2021-03-29 23:23 ` Mike Kravetz
  2021-03-30  1:13   ` Roman Gushchin
                     ` (2 more replies)
  2021-03-29 23:23 ` [PATCH v2 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release Mike Kravetz
                   ` (6 subsequent siblings)
  7 siblings, 3 replies; 31+ messages in thread
From: Mike Kravetz @ 2021-03-29 23:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

Ideally, cma_release could be called from any context.  However, that is
not possible because a mutex is used to protect the per-area bitmap.
Change the bitmap to an irq safe spinlock.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/cma.c       | 20 +++++++++++---------
 mm/cma.h       |  2 +-
 mm/cma_debug.c | 10 ++++++----
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index b2393b892d3b..80875fd4487b 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -24,7 +24,6 @@
 #include <linux/memblock.h>
 #include <linux/err.h>
 #include <linux/mm.h>
-#include <linux/mutex.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
 #include <linux/log2.h>
@@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn,
 			     unsigned int count)
 {
 	unsigned long bitmap_no, bitmap_count;
+	unsigned long flags;
 
 	bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
 	bitmap_count = cma_bitmap_pages_to_bits(cma, count);
 
-	mutex_lock(&cma->lock);
+	spin_lock_irqsave(&cma->lock, flags);
 	bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
-	mutex_unlock(&cma->lock);
+	spin_unlock_irqrestore(&cma->lock, flags);
 }
 
 static void __init cma_activate_area(struct cma *cma)
@@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
 	     pfn += pageblock_nr_pages)
 		init_cma_reserved_pageblock(pfn_to_page(pfn));
 
-	mutex_init(&cma->lock);
+	spin_lock_init(&cma->lock);
 
 #ifdef CONFIG_CMA_DEBUGFS
 	INIT_HLIST_HEAD(&cma->mem_head);
@@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
 	unsigned long start = 0;
 	unsigned long nr_part, nr_total = 0;
 	unsigned long nbits = cma_bitmap_maxno(cma);
+	unsigned long flags;
 
-	mutex_lock(&cma->lock);
+	spin_lock_irqsave(&cma->lock, flags);
 	pr_info("number of available pages: ");
 	for (;;) {
 		next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
@@ -407,7 +408,7 @@ static void cma_debug_show_areas(struct cma *cma)
 		start = next_zero_bit + nr_zero;
 	}
 	pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
-	mutex_unlock(&cma->lock);
+	spin_unlock_irqrestore(&cma->lock, flags);
 }
 #else
 static inline void cma_debug_show_areas(struct cma *cma) { }
@@ -430,6 +431,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 	unsigned long pfn = -1;
 	unsigned long start = 0;
 	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
+	unsigned long flags;
 	size_t i;
 	struct page *page = NULL;
 	int ret = -ENOMEM;
@@ -454,12 +456,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 		goto out;
 
 	for (;;) {
-		mutex_lock(&cma->lock);
+		spin_lock_irqsave(&cma->lock, flags);
 		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
 				bitmap_maxno, start, bitmap_count, mask,
 				offset);
 		if (bitmap_no >= bitmap_maxno) {
-			mutex_unlock(&cma->lock);
+			spin_unlock_irqrestore(&cma->lock, flags);
 			break;
 		}
 		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
@@ -468,7 +470,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 		 * our exclusive use. If the migration fails we will take the
 		 * lock again and unmark it.
 		 */
-		mutex_unlock(&cma->lock);
+		spin_unlock_irqrestore(&cma->lock, flags);
 
 		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
 		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
diff --git a/mm/cma.h b/mm/cma.h
index 68ffad4e430d..2c775877eae2 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -15,7 +15,7 @@ struct cma {
 	unsigned long   count;
 	unsigned long   *bitmap;
 	unsigned int order_per_bit; /* Order of pages represented by one bit */
-	struct mutex    lock;
+	spinlock_t	lock;
 #ifdef CONFIG_CMA_DEBUGFS
 	struct hlist_head mem_head;
 	spinlock_t mem_head_lock;
diff --git a/mm/cma_debug.c b/mm/cma_debug.c
index d5bf8aa34fdc..6379cfbfd568 100644
--- a/mm/cma_debug.c
+++ b/mm/cma_debug.c
@@ -35,11 +35,12 @@ static int cma_used_get(void *data, u64 *val)
 {
 	struct cma *cma = data;
 	unsigned long used;
+	unsigned long flags;
 
-	mutex_lock(&cma->lock);
+	spin_lock_irqsave(&cma->lock, flags);
 	/* pages counter is smaller than sizeof(int) */
 	used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
-	mutex_unlock(&cma->lock);
+	spin_unlock_irqrestore(&cma->lock, flags);
 	*val = (u64)used << cma->order_per_bit;
 
 	return 0;
@@ -52,8 +53,9 @@ static int cma_maxchunk_get(void *data, u64 *val)
 	unsigned long maxchunk = 0;
 	unsigned long start, end = 0;
 	unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
+	unsigned long flags;
 
-	mutex_lock(&cma->lock);
+	spin_lock_irqsave(&cma->lock, flags);
 	for (;;) {
 		start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
 		if (start >= bitmap_maxno)
@@ -61,7 +63,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
 		end = find_next_bit(cma->bitmap, bitmap_maxno, start);
 		maxchunk = max(end - start, maxchunk);
 	}
-	mutex_unlock(&cma->lock);
+	spin_unlock_irqrestore(&cma->lock, flags);
 	*val = (u64)maxchunk << cma->order_per_bit;
 
 	return 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release
  2021-03-29 23:23 [PATCH v2 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
  2021-03-29 23:23 ` [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
@ 2021-03-29 23:23 ` Mike Kravetz
  2021-03-30  1:13   ` Roman Gushchin
  2021-03-30  8:01   ` Michal Hocko
  2021-03-29 23:23 ` [PATCH v2 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Mike Kravetz @ 2021-03-29 23:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

Now that cma_release is non-blocking and irq safe, there is no need to
drop hugetlb_lock before calling.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c3e4baa4156..1d62f0492e7b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1353,14 +1353,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
-		/*
-		 * Temporarily drop the hugetlb_lock, because
-		 * we might block in free_gigantic_page().
-		 */
-		spin_unlock(&hugetlb_lock);
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
-		spin_lock(&hugetlb_lock);
 	} else {
 		__free_pages(page, huge_page_order(h));
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments
  2021-03-29 23:23 [PATCH v2 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
  2021-03-29 23:23 ` [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
  2021-03-29 23:23 ` [PATCH v2 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release Mike Kravetz
@ 2021-03-29 23:23 ` Mike Kravetz
  2021-03-30  2:23     ` Muchun Song
  2021-03-29 23:23 ` [PATCH v2 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Mike Kravetz @ 2021-03-29 23:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

The helper routine hstate_next_node_to_alloc accesses and modifies the
hstate variable next_nid_to_alloc.  The helper is used by the routines
alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
called with hugetlb_lock held.  However, alloc_pool_huge_page can not
be called with the hugetlb lock held as it will call the page allocator.
Two instances of alloc_pool_huge_page could be run in parallel or
alloc_pool_huge_page could run in parallel with adjust_pool_surplus
which may result in the variable next_nid_to_alloc becoming invalid
for the caller and pages being allocated on the wrong node.

Both alloc_pool_huge_page and adjust_pool_surplus are only called from
the routine set_max_huge_pages after boot.  set_max_huge_pages is only
called as the reusult of a user writing to the proc/sysfs nr_hugepages,
or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.

It makes little sense to allow multiple adjustment to the number of
hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
allow one hugetlb page adjustment at a time.  This will synchronize
modifications to the next_nid_to_alloc variable.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c            | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d9b78e82652f..b92f25ccef58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
+	struct mutex resize_lock;
 	int next_nid_to_alloc;
 	int next_nid_to_free;
 	unsigned int order;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1d62f0492e7b..8497a3598c86 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2730,6 +2730,11 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	else
 		return -ENOMEM;
 
+	/*
+	 * resize_lock mutex prevents concurrent adjustments to number of
+	 * pages in hstate via the proc/sysfs interfaces.
+	 */
+	mutex_lock(&h->resize_lock);
 	spin_lock(&hugetlb_lock);
 
 	/*
@@ -2762,6 +2767,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
 		if (count > persistent_huge_pages(h)) {
 			spin_unlock(&hugetlb_lock);
+			mutex_unlock(&h->resize_lock);
 			NODEMASK_FREE(node_alloc_noretry);
 			return -EINVAL;
 		}
@@ -2836,6 +2842,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 out:
 	h->max_huge_pages = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
+	mutex_unlock(&h->resize_lock);
 
 	NODEMASK_FREE(node_alloc_noretry);
 
@@ -3323,6 +3330,7 @@ void __init hugetlb_add_hstate(unsigned int order)
 	BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
 	BUG_ON(order == 0);
 	h = &hstates[hugetlb_max_hstate++];
+	mutex_init(&h->resize_lock);
 	h->order = order;
 	h->mask = ~(huge_page_size(h) - 1);
 	for (i = 0; i < MAX_NUMNODES; ++i)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-03-29 23:23 [PATCH v2 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (2 preceding siblings ...)
  2021-03-29 23:23 ` [PATCH v2 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
@ 2021-03-29 23:23 ` Mike Kravetz
  2021-03-29 23:23 ` [PATCH v2 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2021-03-29 23:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

The new remove_hugetlb_page() routine is designed to remove a hugetlb
page from hugetlbfs processing.  It will remove the page from the active
or free list, update global counters and set the compound page
destructor to NULL so that PageHuge() will return false for the 'page'.
After this call, the 'page' can be treated as a normal compound page or
a collection of base size pages.

remove_hugetlb_page is to be called with the hugetlb_lock held.

Creating this routine and separating functionality is in preparation for
restructuring code to reduce lock hold times.  This commit should not
introduce any changes to functionality.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 67 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 42 insertions(+), 25 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8497a3598c86..16beabbbbe49 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1331,6 +1331,43 @@ static inline void destroy_compound_gigantic_page(struct page *page,
 						unsigned int order) { }
 #endif
 
+/*
+ * Remove hugetlb page from lists, and update dtor so that page appears
+ * as just a compound page.  A reference is held on the page.
+ *
+ * Must be called with hugetlb lock held.
+ */
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+							bool adjust_surplus)
+{
+	int nid = page_to_nid(page);
+
+	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+		return;
+
+	list_del(&page->lru);
+
+	if (HPageFreed(page)) {
+		h->free_huge_pages--;
+		h->free_huge_pages_node[nid]--;
+		ClearHPageFreed(page);
+	}
+	if (adjust_surplus) {
+		h->surplus_huge_pages--;
+		h->surplus_huge_pages_node[nid]--;
+	}
+
+	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
+	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
+
+	ClearHPageTemporary(page);
+	set_page_refcounted(page);
+	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
+
+	h->nr_huge_pages--;
+	h->nr_huge_pages_node[nid]--;
+}
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
@@ -1339,8 +1376,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
-	h->nr_huge_pages--;
-	h->nr_huge_pages_node[page_to_nid(page)]--;
 	for (i = 0; i < pages_per_huge_page(h);
 	     i++, subpage = mem_map_next(subpage, page, i)) {
 		subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
@@ -1348,10 +1383,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 				1 << PG_active | 1 << PG_private |
 				1 << PG_writeback);
 	}
-	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
-	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
-	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
-	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
@@ -1419,15 +1450,12 @@ static void __free_huge_page(struct page *page)
 		h->resv_huge_pages++;
 
 	if (HPageTemporary(page)) {
-		list_del(&page->lru);
-		ClearHPageTemporary(page);
+		remove_hugetlb_page(h, page, false);
 		update_and_free_page(h, page);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
-		list_del(&page->lru);
+		remove_hugetlb_page(h, page, true);
 		update_and_free_page(h, page);
-		h->surplus_huge_pages--;
-		h->surplus_huge_pages_node[nid]--;
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);
@@ -1712,13 +1740,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 			struct page *page =
 				list_entry(h->hugepage_freelists[node].next,
 					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[node]--;
-			if (acct_surplus) {
-				h->surplus_huge_pages--;
-				h->surplus_huge_pages_node[node]--;
-			}
+			remove_hugetlb_page(h, page, acct_surplus);
 			update_and_free_page(h, page);
 			ret = 1;
 			break;
@@ -1756,7 +1778,6 @@ int dissolve_free_huge_page(struct page *page)
 	if (!page_count(page)) {
 		struct page *head = compound_head(page);
 		struct hstate *h = page_hstate(head);
-		int nid = page_to_nid(head);
 		if (h->free_huge_pages - h->resv_huge_pages == 0)
 			goto out;
 
@@ -1787,9 +1808,7 @@ int dissolve_free_huge_page(struct page *page)
 			SetPageHWPoison(page);
 			ClearPageHWPoison(head);
 		}
-		list_del(&head->lru);
-		h->free_huge_pages--;
-		h->free_huge_pages_node[nid]--;
+		remove_hugetlb_page(h, page, false);
 		h->max_huge_pages--;
 		update_and_free_page(h, head);
 		rc = 0;
@@ -2667,10 +2686,8 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 				return;
 			if (PageHighMem(page))
 				continue;
-			list_del(&page->lru);
+			remove_hugetlb_page(h, page, false);
 			update_and_free_page(h, page);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[page_to_nid(page)]--;
 		}
 	}
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-03-29 23:23 [PATCH v2 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (3 preceding siblings ...)
  2021-03-29 23:23 ` [PATCH v2 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
@ 2021-03-29 23:23 ` Mike Kravetz
  2021-03-30  2:10   ` Miaohe Lin
  2021-03-30  2:21     ` Muchun Song
  2021-03-29 23:24 ` [PATCH v2 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Mike Kravetz @ 2021-03-29 23:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

With the introduction of remove_hugetlb_page(), there is no need for
update_and_free_page to hold the hugetlb lock.  Change all callers to
drop the lock before calling.

With additional code modifications, this will allow loops which decrease
the huge page pool to drop the hugetlb_lock with each page to reduce
long hold times.

The ugly unlock/lock cycle in free_pool_huge_page will be removed in
a subsequent patch which restructures free_pool_huge_page.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 16beabbbbe49..dec7bd0dc63d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1451,16 +1451,18 @@ static void __free_huge_page(struct page *page)
 
 	if (HPageTemporary(page)) {
 		remove_hugetlb_page(h, page, false);
+		spin_unlock(&hugetlb_lock);
 		update_and_free_page(h, page);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
 		remove_hugetlb_page(h, page, true);
+		spin_unlock(&hugetlb_lock);
 		update_and_free_page(h, page);
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);
+		spin_unlock(&hugetlb_lock);
 	}
-	spin_unlock(&hugetlb_lock);
 }
 
 /*
@@ -1741,7 +1743,13 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 				list_entry(h->hugepage_freelists[node].next,
 					  struct page, lru);
 			remove_hugetlb_page(h, page, acct_surplus);
+			/*
+			 * unlock/lock around update_and_free_page is temporary
+			 * and will be removed with subsequent patch.
+			 */
+			spin_unlock(&hugetlb_lock);
 			update_and_free_page(h, page);
+			spin_lock(&hugetlb_lock);
 			ret = 1;
 			break;
 		}
@@ -1810,8 +1818,9 @@ int dissolve_free_huge_page(struct page *page)
 		}
 		remove_hugetlb_page(h, page, false);
 		h->max_huge_pages--;
+		spin_unlock(&hugetlb_lock);
 		update_and_free_page(h, head);
-		rc = 0;
+		return 0;
 	}
 out:
 	spin_unlock(&hugetlb_lock);
@@ -2674,22 +2683,35 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 						nodemask_t *nodes_allowed)
 {
 	int i;
+	struct page *page, *next;
+	LIST_HEAD(page_list);
 
 	if (hstate_is_gigantic(h))
 		return;
 
+	/*
+	 * Collect pages to be freed on a list, and free after dropping lock
+	 */
+	INIT_LIST_HEAD(&page_list);
 	for_each_node_mask(i, *nodes_allowed) {
-		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
 		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
-				return;
+				goto out;
 			if (PageHighMem(page))
 				continue;
 			remove_hugetlb_page(h, page, false);
-			update_and_free_page(h, page);
+			list_add(&page->lru, &page_list);
 		}
 	}
+
+out:
+	spin_unlock(&hugetlb_lock);
+	list_for_each_entry_safe(page, next, &page_list, lru) {
+		update_and_free_page(h, page);
+		cond_resched();
+	}
+	spin_lock(&hugetlb_lock);
 }
 #else
 static inline void try_to_free_low(struct hstate *h, unsigned long count,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page
  2021-03-29 23:23 [PATCH v2 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (4 preceding siblings ...)
  2021-03-29 23:23 ` [PATCH v2 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
@ 2021-03-29 23:24 ` Mike Kravetz
  2021-03-30  2:30     ` Muchun Song
  2021-03-30  8:06   ` Michal Hocko
  2021-03-29 23:24 ` [PATCH v2 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
  2021-03-29 23:24 ` [PATCH v2 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
  7 siblings, 2 replies; 31+ messages in thread
From: Mike Kravetz @ 2021-03-29 23:24 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

free_pool_huge_page was called with hugetlb_lock held.  It would remove
a hugetlb page, and then free the corresponding pages to the lower level
allocators such as buddy.  free_pool_huge_page was called in a loop to
remove hugetlb pages and these loops could hold the hugetlb_lock for a
considerable time.

Create new routine remove_pool_huge_page to replace free_pool_huge_page.
remove_pool_huge_page will remove the hugetlb page, and it must be
called with the hugetlb_lock held.  It will return the removed page and
it is the responsibility of the caller to free the page to the lower
level allocators.  The hugetlb_lock is dropped before freeing to these
allocators which results in shorter lock hold times.

Add new helper routine to call update_and_free_page for a list of pages.

Note: Some changes to the routine return_unused_surplus_pages are in
need of explanation.  Commit e5bbc8a6c992 ("mm/hugetlb.c: fix reservation
race when freeing surplus pages") modified this routine to address a
race which could occur when dropping the hugetlb_lock in the loop that
removes pool pages.  Accounting changes introduced in that commit were
subtle and took some thought to understand.  This commit removes the
cond_resched_lock() and the potential race.  Therefore, remove the
subtle code and restore the more straight forward accounting effectively
reverting the commit.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 95 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 53 insertions(+), 42 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dec7bd0dc63d..d3f3cb8766b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1209,7 +1209,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
 }
 
 /*
- * helper for free_pool_huge_page() - return the previously saved
+ * helper for remove_pool_huge_page() - return the previously saved
  * node ["this node"] from which to free a huge page.  Advance the
  * next node id whether or not we find a free huge page to free so
  * that the next attempt to free addresses the next node.
@@ -1391,6 +1391,16 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	}
 }
 
+static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
+{
+	struct page *page, *t_page;
+
+	list_for_each_entry_safe(page, t_page, list, lru) {
+		update_and_free_page(h, page);
+		cond_resched();
+	}
+}
+
 struct hstate *size_to_hstate(unsigned long size)
 {
 	struct hstate *h;
@@ -1721,16 +1731,18 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 }
 
 /*
- * Free huge page from pool from next node to free.
- * Attempt to keep persistent huge pages more or less
- * balanced over allowed nodes.
+ * Remove huge page from pool from next node to free.  Attempt to keep
+ * persistent huge pages more or less balanced over allowed nodes.
+ * This routine only 'removes' the hugetlb page.  The caller must make
+ * an additional call to free the page to low level allocators.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-							 bool acct_surplus)
+static struct page *remove_pool_huge_page(struct hstate *h,
+						nodemask_t *nodes_allowed,
+						 bool acct_surplus)
 {
 	int nr_nodes, node;
-	int ret = 0;
+	struct page *page = NULL;
 
 	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
 		/*
@@ -1739,23 +1751,14 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 		 */
 		if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
 		    !list_empty(&h->hugepage_freelists[node])) {
-			struct page *page =
-				list_entry(h->hugepage_freelists[node].next,
+			page = list_entry(h->hugepage_freelists[node].next,
 					  struct page, lru);
 			remove_hugetlb_page(h, page, acct_surplus);
-			/*
-			 * unlock/lock around update_and_free_page is temporary
-			 * and will be removed with subsequent patch.
-			 */
-			spin_unlock(&hugetlb_lock);
-			update_and_free_page(h, page);
-			spin_lock(&hugetlb_lock);
-			ret = 1;
 			break;
 		}
 	}
 
-	return ret;
+	return page;
 }
 
 /*
@@ -2075,17 +2078,16 @@ static int gather_surplus_pages(struct hstate *h, long delta)
  *    to the associated reservation map.
  * 2) Free any unused surplus pages that may have been allocated to satisfy
  *    the reservation.  As many as unused_resv_pages may be freed.
- *
- * Called with hugetlb_lock held.  However, the lock could be dropped (and
- * reacquired) during calls to cond_resched_lock.  Whenever dropping the lock,
- * we must make sure nobody else can claim pages we are in the process of
- * freeing.  Do this by ensuring resv_huge_page always is greater than the
- * number of huge pages we plan to free when dropping the lock.
  */
 static void return_unused_surplus_pages(struct hstate *h,
 					unsigned long unused_resv_pages)
 {
 	unsigned long nr_pages;
+	struct page *page;
+	LIST_HEAD(page_list);
+
+	/* Uncommit the reservation */
+	h->resv_huge_pages -= unused_resv_pages;
 
 	/* Cannot return gigantic pages currently */
 	if (hstate_is_gigantic(h))
@@ -2102,24 +2104,22 @@ static void return_unused_surplus_pages(struct hstate *h,
 	 * evenly across all nodes with memory. Iterate across these nodes
 	 * until we can no longer free unreserved surplus pages. This occurs
 	 * when the nodes with surplus pages have no free pages.
-	 * free_pool_huge_page() will balance the freed pages across the
+	 * remove_pool_huge_page() will balance the freed pages across the
 	 * on-line nodes with memory and will handle the hstate accounting.
-	 *
-	 * Note that we decrement resv_huge_pages as we free the pages.  If
-	 * we drop the lock, resv_huge_pages will still be sufficiently large
-	 * to cover subsequent pages we may free.
 	 */
+	INIT_LIST_HEAD(&page_list);
 	while (nr_pages--) {
-		h->resv_huge_pages--;
-		unused_resv_pages--;
-		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
+		page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
+		if (!page)
 			goto out;
-		cond_resched_lock(&hugetlb_lock);
+
+		list_add(&page->lru, &page_list);
 	}
 
 out:
-	/* Fully uncommit the reservation */
-	h->resv_huge_pages -= unused_resv_pages;
+	spin_unlock(&hugetlb_lock);
+	update_and_free_pages_bulk(h, &page_list);
+	spin_lock(&hugetlb_lock);
 }
 
 
@@ -2683,7 +2683,6 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 						nodemask_t *nodes_allowed)
 {
 	int i;
-	struct page *page, *next;
 	LIST_HEAD(page_list);
 
 	if (hstate_is_gigantic(h))
@@ -2694,6 +2693,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	 */
 	INIT_LIST_HEAD(&page_list);
 	for_each_node_mask(i, *nodes_allowed) {
+		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
 		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
@@ -2707,10 +2707,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 
 out:
 	spin_unlock(&hugetlb_lock);
-	list_for_each_entry_safe(page, next, &page_list, lru) {
-		update_and_free_page(h, page);
-		cond_resched();
-	}
+	update_and_free_pages_bulk(h, &page_list);
 	spin_lock(&hugetlb_lock);
 }
 #else
@@ -2757,6 +2754,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
 	unsigned long min_count, ret;
+	struct page *page;
+	LIST_HEAD(page_list);
 	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
 
 	/*
@@ -2869,11 +2868,23 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
 	try_to_free_low(h, min_count, nodes_allowed);
+
+	/*
+	 * Collect pages to be removed on list without dropping lock
+	 */
+	INIT_LIST_HEAD(&page_list);
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, nodes_allowed, 0))
+		page = remove_pool_huge_page(h, nodes_allowed, 0);
+		if (!page)
 			break;
-		cond_resched_lock(&hugetlb_lock);
+
+		list_add(&page->lru, &page_list);
 	}
+	/* free the pages after dropping lock */
+	spin_unlock(&hugetlb_lock);
+	update_and_free_pages_bulk(h, &page_list);
+	spin_lock(&hugetlb_lock);
+
 	while (count < persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, nodes_allowed, 1))
 			break;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 7/8] hugetlb: make free_huge_page irq safe
  2021-03-29 23:23 [PATCH v2 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (5 preceding siblings ...)
  2021-03-29 23:24 ` [PATCH v2 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
@ 2021-03-29 23:24 ` Mike Kravetz
  2021-03-29 23:24 ` [PATCH v2 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
  7 siblings, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2021-03-29 23:24 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
non-task context") was added to address the issue of free_huge_page
being called from irq context.  That commit hands off free_huge_page
processing to a workqueue if !in_task.  However, this doesn't cover
all the cases as pointed out by 0day bot lockdep report [1].

:  Possible interrupt unsafe locking scenario:
:
:        CPU0                    CPU1
:        ----                    ----
:   lock(hugetlb_lock);
:                                local_irq_disable();
:                                lock(slock-AF_INET);
:                                lock(hugetlb_lock);
:   <Interrupt>
:     lock(slock-AF_INET);

Shakeel has later explained that this is very likely TCP TX zerocopy
from hugetlb pages scenario when the networking code drops a last
reference to hugetlb page while having IRQ disabled. Hugetlb freeing
path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
chain can lead to a deadlock.

This commit addresses the issue by doing the following:
- Make hugetlb_lock irq safe.  This is mostly a simple process of
  changing spin_*lock calls to spin_*lock_irq* calls.
- Make subpool lock irq safe in a similar manner.
- Revert the !in_task check and workqueue handoff.

[1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c        | 167 ++++++++++++++++----------------------------
 mm/hugetlb_cgroup.c |   8 +--
 2 files changed, 66 insertions(+), 109 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d3f3cb8766b8..bf36abc2305a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,9 +93,10 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
 	return true;
 }
 
-static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
+static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
+						unsigned long irq_flags)
 {
-	spin_unlock(&spool->lock);
+	spin_unlock_irqrestore(&spool->lock, irq_flags);
 
 	/* If no pages are used, and no other handles to the subpool
 	 * remain, give up any reservations based on minimum size and
@@ -134,10 +135,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
 
 void hugepage_put_subpool(struct hugepage_subpool *spool)
 {
-	spin_lock(&spool->lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&spool->lock, flags);
 	BUG_ON(!spool->count);
 	spool->count--;
-	unlock_or_release_subpool(spool);
+	unlock_or_release_subpool(spool, flags);
 }
 
 /*
@@ -156,7 +159,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
 	if (!spool)
 		return ret;
 
-	spin_lock(&spool->lock);
+	spin_lock_irq(&spool->lock);
 
 	if (spool->max_hpages != -1) {		/* maximum size accounting */
 		if ((spool->used_hpages + delta) <= spool->max_hpages)
@@ -183,7 +186,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
 	}
 
 unlock_ret:
-	spin_unlock(&spool->lock);
+	spin_unlock_irq(&spool->lock);
 	return ret;
 }
 
@@ -197,11 +200,12 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
 				       long delta)
 {
 	long ret = delta;
+	unsigned long flags;
 
 	if (!spool)
 		return delta;
 
-	spin_lock(&spool->lock);
+	spin_lock_irqsave(&spool->lock, flags);
 
 	if (spool->max_hpages != -1)		/* maximum size accounting */
 		spool->used_hpages -= delta;
@@ -222,7 +226,7 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
 	 * If hugetlbfs_put_super couldn't free spool due to an outstanding
 	 * quota reference, free it now.
 	 */
-	unlock_or_release_subpool(spool);
+	unlock_or_release_subpool(spool, flags);
 
 	return ret;
 }
@@ -1412,7 +1416,7 @@ struct hstate *size_to_hstate(unsigned long size)
 	return NULL;
 }
 
-static void __free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
@@ -1422,6 +1426,7 @@ static void __free_huge_page(struct page *page)
 	int nid = page_to_nid(page);
 	struct hugepage_subpool *spool = hugetlb_page_subpool(page);
 	bool restore_reserve;
+	unsigned long flags;
 
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(page_mapcount(page), page);
@@ -1450,7 +1455,7 @@ static void __free_huge_page(struct page *page)
 			restore_reserve = true;
 	}
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irqsave(&hugetlb_lock, flags);
 	ClearHPageMigratable(page);
 	hugetlb_cgroup_uncharge_page(hstate_index(h),
 				     pages_per_huge_page(h), page);
@@ -1461,67 +1466,19 @@ static void __free_huge_page(struct page *page)
 
 	if (HPageTemporary(page)) {
 		remove_hugetlb_page(h, page, false);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 		update_and_free_page(h, page);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
 		remove_hugetlb_page(h, page, true);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 		update_and_free_page(h, page);
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);
-		spin_unlock(&hugetlb_lock);
-	}
-}
-
-/*
- * As free_huge_page() can be called from a non-task context, we have
- * to defer the actual freeing in a workqueue to prevent potential
- * hugetlb_lock deadlock.
- *
- * free_hpage_workfn() locklessly retrieves the linked list of pages to
- * be freed and frees them one-by-one. As the page->mapping pointer is
- * going to be cleared in __free_huge_page() anyway, it is reused as the
- * llist_node structure of a lockless linked list of huge pages to be freed.
- */
-static LLIST_HEAD(hpage_freelist);
-
-static void free_hpage_workfn(struct work_struct *work)
-{
-	struct llist_node *node;
-	struct page *page;
-
-	node = llist_del_all(&hpage_freelist);
-
-	while (node) {
-		page = container_of((struct address_space **)node,
-				     struct page, mapping);
-		node = node->next;
-		__free_huge_page(page);
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 	}
 }
-static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
-
-void free_huge_page(struct page *page)
-{
-	/*
-	 * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
-	 */
-	if (!in_task()) {
-		/*
-		 * Only call schedule_work() if hpage_freelist is previously
-		 * empty. Otherwise, schedule_work() had been called but the
-		 * workfn hasn't retrieved the list yet.
-		 */
-		if (llist_add((struct llist_node *)&page->mapping,
-			      &hpage_freelist))
-			schedule_work(&free_hpage_work);
-		return;
-	}
-
-	__free_huge_page(page);
-}
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
@@ -1530,11 +1487,11 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	hugetlb_set_page_subpool(page, NULL);
 	set_hugetlb_cgroup(page, NULL);
 	set_hugetlb_cgroup_rsvd(page, NULL);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	h->nr_huge_pages++;
 	h->nr_huge_pages_node[nid]++;
 	ClearHPageFreed(page);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 }
 
 static void prep_compound_gigantic_page(struct page *page, unsigned int order)
@@ -1780,7 +1737,7 @@ int dissolve_free_huge_page(struct page *page)
 	if (!PageHuge(page))
 		return 0;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (!PageHuge(page)) {
 		rc = 0;
 		goto out;
@@ -1797,7 +1754,7 @@ int dissolve_free_huge_page(struct page *page)
 		 * when it is dissolved.
 		 */
 		if (unlikely(!HPageFreed(head))) {
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			cond_resched();
 
 			/*
@@ -1821,12 +1778,12 @@ int dissolve_free_huge_page(struct page *page)
 		}
 		remove_hugetlb_page(h, page, false);
 		h->max_huge_pages--;
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		update_and_free_page(h, head);
 		return 0;
 	}
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return rc;
 }
 
@@ -1868,16 +1825,16 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	if (hstate_is_gigantic(h))
 		return NULL;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages)
 		goto out_unlock;
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
 	if (!page)
 		return NULL;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	/*
 	 * We could have raced with the pool size change.
 	 * Double check that and simply deallocate the new page
@@ -1887,7 +1844,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	 */
 	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
 		SetHPageTemporary(page);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		put_page(page);
 		return NULL;
 	} else {
@@ -1896,7 +1853,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	}
 
 out_unlock:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	return page;
 }
@@ -1946,17 +1903,17 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
 		nodemask_t *nmask, gfp_t gfp_mask)
 {
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (h->free_huge_pages - h->resv_huge_pages > 0) {
 		struct page *page;
 
 		page = dequeue_huge_page_nodemask(h, gfp_mask, preferred_nid, nmask);
 		if (page) {
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			return page;
 		}
 	}
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
 }
@@ -2004,7 +1961,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 
 	ret = -ENOMEM;
 retry:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
 		page = alloc_surplus_huge_page(h, htlb_alloc_mask(h),
 				NUMA_NO_NODE, NULL);
@@ -2021,7 +1978,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	 * After retaking hugetlb_lock, we need to recalculate 'needed'
 	 * because either resv_huge_pages or free_huge_pages may have changed.
 	 */
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	needed = (h->resv_huge_pages + delta) -
 			(h->free_huge_pages + allocated);
 	if (needed > 0) {
@@ -2061,12 +2018,12 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 		enqueue_huge_page(h, page);
 	}
 free:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	/* Free unnecessary surplus pages to the buddy allocator */
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru)
 		put_page(page);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 
 	return ret;
 }
@@ -2117,9 +2074,9 @@ static void return_unused_surplus_pages(struct hstate *h,
 	}
 
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	update_and_free_pages_bulk(h, &page_list);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 }
 
 
@@ -2464,7 +2421,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (ret)
 		goto out_uncharge_cgroup_reservation;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	/*
 	 * glb_chg is passed to indicate whether or not a page must be taken
 	 * from the global free pool (global change).  gbl_chg == 0 indicates
@@ -2472,7 +2429,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	 */
 	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
 	if (!page) {
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		page = alloc_buddy_huge_page_with_mpol(h, vma, addr);
 		if (!page)
 			goto out_uncharge_cgroup;
@@ -2480,7 +2437,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 			SetHPageRestoreReserve(page);
 			h->resv_huge_pages--;
 		}
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		list_add(&page->lru, &h->hugepage_activelist);
 		/* Fall through */
 	}
@@ -2493,7 +2450,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 						  h_cg, page);
 	}
 
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	hugetlb_set_page_subpool(page, spool);
 
@@ -2706,9 +2663,9 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	}
 
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	update_and_free_pages_bulk(h, &page_list);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 }
 #else
 static inline void try_to_free_low(struct hstate *h, unsigned long count,
@@ -2804,7 +2761,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 */
 	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
 		if (count > persistent_huge_pages(h)) {
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			mutex_unlock(&h->resize_lock);
 			NODEMASK_FREE(node_alloc_noretry);
 			return -EINVAL;
@@ -2834,14 +2791,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		 * page, free_huge_page will handle it by freeing the page
 		 * and reducing the surplus.
 		 */
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 
 		/* yield cpu to avoid soft lockup */
 		cond_resched();
 
 		ret = alloc_pool_huge_page(h, nodes_allowed,
 						node_alloc_noretry);
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		if (!ret)
 			goto out;
 
@@ -2881,9 +2838,9 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		list_add(&page->lru, &page_list);
 	}
 	/* free the pages after dropping lock */
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	update_and_free_pages_bulk(h, &page_list);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 
 	while (count < persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, nodes_allowed, 1))
@@ -2891,7 +2848,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	}
 out:
 	h->max_huge_pages = persistent_huge_pages(h);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	mutex_unlock(&h->resize_lock);
 
 	NODEMASK_FREE(node_alloc_noretry);
@@ -3047,9 +3004,9 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 	if (err)
 		return err;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	h->nr_overcommit_huge_pages = input;
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	return count;
 }
@@ -3636,9 +3593,9 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
 		goto out;
 
 	if (write) {
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		h->nr_overcommit_huge_pages = tmp;
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 	}
 out:
 	return ret;
@@ -3734,7 +3691,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
 	if (!delta)
 		return 0;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	/*
 	 * When cpuset is configured, it breaks the strict hugetlb page
 	 * reservation as the accounting is done on a global variable. Such
@@ -3773,7 +3730,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
 		return_unused_surplus_pages(h, (unsigned long) -delta);
 
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return ret;
 }
 
@@ -5837,7 +5794,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
 {
 	bool ret = true;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (!PageHeadHuge(page) ||
 	    !HPageMigratable(page) ||
 	    !get_page_unless_zero(page)) {
@@ -5847,16 +5804,16 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
 	ClearHPageMigratable(page);
 	list_move_tail(&page->lru, list);
 unlock:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return ret;
 }
 
 void putback_active_hugepage(struct page *page)
 {
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	SetHPageMigratable(page);
 	list_move_tail(&page->lru, &(page_hstate(page))->hugepage_activelist);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	put_page(page);
 }
 
@@ -5890,12 +5847,12 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
 		 */
 		if (new_nid == old_nid)
 			return;
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		if (h->surplus_huge_pages_node[old_nid]) {
 			h->surplus_huge_pages_node[old_nid]--;
 			h->surplus_huge_pages_node[new_nid]++;
 		}
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 	}
 }
 
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 726b85f4f303..5383023d0cca 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -204,11 +204,11 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 	do {
 		idx = 0;
 		for_each_hstate(h) {
-			spin_lock(&hugetlb_lock);
+			spin_lock_irq(&hugetlb_lock);
 			list_for_each_entry(page, &h->hugepage_activelist, lru)
 				hugetlb_cgroup_move_parent(idx, h_cg, page);
 
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			idx++;
 		}
 		cond_resched();
@@ -784,7 +784,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
 	if (hugetlb_cgroup_disabled())
 		return;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	h_cg = hugetlb_cgroup_from_page(oldhpage);
 	h_cg_rsvd = hugetlb_cgroup_from_page_rsvd(oldhpage);
 	set_hugetlb_cgroup(oldhpage, NULL);
@@ -794,7 +794,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
 	set_hugetlb_cgroup(newhpage, h_cg);
 	set_hugetlb_cgroup_rsvd(newhpage, h_cg_rsvd);
 	list_move(&newhpage->lru, &h->hugepage_activelist);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock
  2021-03-29 23:23 [PATCH v2 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (6 preceding siblings ...)
  2021-03-29 23:24 ` [PATCH v2 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
@ 2021-03-29 23:24 ` Mike Kravetz
  7 siblings, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2021-03-29 23:24 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

After making hugetlb lock irq safe and separating some functionality
done under the lock, add some lockdep_assert_held to help verify
locking.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bf36abc2305a..06282f340f40 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1068,6 +1068,8 @@ static void __enqueue_huge_page(struct list_head *list, struct page *page)
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
+
+	lockdep_assert_held(&hugetlb_lock);
 	__enqueue_huge_page(&h->hugepage_freelists[nid], page);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
@@ -1078,6 +1080,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
 	struct page *page;
 	bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
+	lockdep_assert_held(&hugetlb_lock);
 	list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
 		if (pin && !is_pinnable_page(page))
 			continue;
@@ -1346,6 +1349,7 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
 {
 	int nid = page_to_nid(page);
 
+	lockdep_assert_held(&hugetlb_lock);
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
@@ -1701,6 +1705,7 @@ static struct page *remove_pool_huge_page(struct hstate *h,
 	int nr_nodes, node;
 	struct page *page = NULL;
 
+	lockdep_assert_held(&hugetlb_lock);
 	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
 		/*
 		 * If we're returning unused surplus pages, only examine
@@ -1950,6 +1955,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	long needed, allocated;
 	bool alloc_ok = true;
 
+	lockdep_assert_held(&hugetlb_lock);
 	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
 	if (needed <= 0) {
 		h->resv_huge_pages += delta;
@@ -2043,6 +2049,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 	struct page *page;
 	LIST_HEAD(page_list);
 
+	lockdep_assert_held(&hugetlb_lock);
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
 
@@ -2642,6 +2649,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	int i;
 	LIST_HEAD(page_list);
 
+	lockdep_assert_held(&hugetlb_lock);
 	if (hstate_is_gigantic(h))
 		return;
 
@@ -2684,6 +2692,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 {
 	int nr_nodes, node;
 
+	lockdep_assert_held(&hugetlb_lock);
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-03-29 23:23 ` [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
@ 2021-03-30  1:13   ` Roman Gushchin
  2021-03-30  1:20   ` Song Bao Hua (Barry Song)
  2021-03-30  8:01   ` Michal Hocko
  2 siblings, 0 replies; 31+ messages in thread
From: Roman Gushchin @ 2021-03-30  1:13 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Mon, Mar 29, 2021 at 04:23:55PM -0700, Mike Kravetz wrote:
> Ideally, cma_release could be called from any context.  However, that is
> not possible because a mutex is used to protect the per-area bitmap.
> Change the bitmap to an irq safe spinlock.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Roman Gushchin <guro@fb.com>

Thanks!

> ---
>  mm/cma.c       | 20 +++++++++++---------
>  mm/cma.h       |  2 +-
>  mm/cma_debug.c | 10 ++++++----
>  3 files changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index b2393b892d3b..80875fd4487b 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -24,7 +24,6 @@
>  #include <linux/memblock.h>
>  #include <linux/err.h>
>  #include <linux/mm.h>
> -#include <linux/mutex.h>
>  #include <linux/sizes.h>
>  #include <linux/slab.h>
>  #include <linux/log2.h>
> @@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn,
>  			     unsigned int count)
>  {
>  	unsigned long bitmap_no, bitmap_count;
> +	unsigned long flags;
>  
>  	bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
>  	bitmap_count = cma_bitmap_pages_to_bits(cma, count);
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>  	bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  }
>  
>  static void __init cma_activate_area(struct cma *cma)
> @@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
>  	     pfn += pageblock_nr_pages)
>  		init_cma_reserved_pageblock(pfn_to_page(pfn));
>  
> -	mutex_init(&cma->lock);
> +	spin_lock_init(&cma->lock);
>  
>  #ifdef CONFIG_CMA_DEBUGFS
>  	INIT_HLIST_HEAD(&cma->mem_head);
> @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
>  	unsigned long start = 0;
>  	unsigned long nr_part, nr_total = 0;
>  	unsigned long nbits = cma_bitmap_maxno(cma);
> +	unsigned long flags;
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>  	pr_info("number of available pages: ");
>  	for (;;) {
>  		next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +408,7 @@ static void cma_debug_show_areas(struct cma *cma)
>  		start = next_zero_bit + nr_zero;
>  	}
>  	pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  }
>  #else
>  static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -430,6 +431,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>  	unsigned long pfn = -1;
>  	unsigned long start = 0;
>  	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> +	unsigned long flags;
>  	size_t i;
>  	struct page *page = NULL;
>  	int ret = -ENOMEM;
> @@ -454,12 +456,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>  		goto out;
>  
>  	for (;;) {
> -		mutex_lock(&cma->lock);
> +		spin_lock_irqsave(&cma->lock, flags);
>  		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>  				bitmap_maxno, start, bitmap_count, mask,
>  				offset);
>  		if (bitmap_no >= bitmap_maxno) {
> -			mutex_unlock(&cma->lock);
> +			spin_unlock_irqrestore(&cma->lock, flags);
>  			break;
>  		}
>  		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> @@ -468,7 +470,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>  		 * our exclusive use. If the migration fails we will take the
>  		 * lock again and unmark it.
>  		 */
> -		mutex_unlock(&cma->lock);
> +		spin_unlock_irqrestore(&cma->lock, flags);
>  
>  		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
>  		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> diff --git a/mm/cma.h b/mm/cma.h
> index 68ffad4e430d..2c775877eae2 100644
> --- a/mm/cma.h
> +++ b/mm/cma.h
> @@ -15,7 +15,7 @@ struct cma {
>  	unsigned long   count;
>  	unsigned long   *bitmap;
>  	unsigned int order_per_bit; /* Order of pages represented by one bit */
> -	struct mutex    lock;
> +	spinlock_t	lock;
>  #ifdef CONFIG_CMA_DEBUGFS
>  	struct hlist_head mem_head;
>  	spinlock_t mem_head_lock;
> diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> index d5bf8aa34fdc..6379cfbfd568 100644
> --- a/mm/cma_debug.c
> +++ b/mm/cma_debug.c
> @@ -35,11 +35,12 @@ static int cma_used_get(void *data, u64 *val)
>  {
>  	struct cma *cma = data;
>  	unsigned long used;
> +	unsigned long flags;
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>  	/* pages counter is smaller than sizeof(int) */
>  	used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  	*val = (u64)used << cma->order_per_bit;
>  
>  	return 0;
> @@ -52,8 +53,9 @@ static int cma_maxchunk_get(void *data, u64 *val)
>  	unsigned long maxchunk = 0;
>  	unsigned long start, end = 0;
>  	unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
> +	unsigned long flags;
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>  	for (;;) {
>  		start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
>  		if (start >= bitmap_maxno)
> @@ -61,7 +63,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>  		end = find_next_bit(cma->bitmap, bitmap_maxno, start);
>  		maxchunk = max(end - start, maxchunk);
>  	}
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  	*val = (u64)maxchunk << cma->order_per_bit;
>  
>  	return 0;
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release
  2021-03-29 23:23 ` [PATCH v2 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release Mike Kravetz
@ 2021-03-30  1:13   ` Roman Gushchin
  2021-03-30  8:01   ` Michal Hocko
  1 sibling, 0 replies; 31+ messages in thread
From: Roman Gushchin @ 2021-03-30  1:13 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Mon, Mar 29, 2021 at 04:23:56PM -0700, Mike Kravetz wrote:
> Now that cma_release is non-blocking and irq safe, there is no need to
> drop hugetlb_lock before calling.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3c3e4baa4156..1d62f0492e7b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1353,14 +1353,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>  	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
>  	set_page_refcounted(page);
>  	if (hstate_is_gigantic(h)) {
> -		/*
> -		 * Temporarily drop the hugetlb_lock, because
> -		 * we might block in free_gigantic_page().
> -		 */
> -		spin_unlock(&hugetlb_lock);
>  		destroy_compound_gigantic_page(page, huge_page_order(h));
>  		free_gigantic_page(page, huge_page_order(h));
> -		spin_lock(&hugetlb_lock);
>  	} else {
>  		__free_pages(page, huge_page_order(h));
>  	}
> -- 
> 2.30.2
> 

Acked-by: Roman Gushchin <guro@fb.com>

Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-03-29 23:23 ` [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
  2021-03-30  1:13   ` Roman Gushchin
@ 2021-03-30  1:20   ` Song Bao Hua (Barry Song)
  2021-03-30  2:18     ` Mike Kravetz
  2021-03-30  8:01   ` Michal Hocko
  2 siblings, 1 reply; 31+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-03-30  1:20 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, linmiaohe,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Will Deacon, Andrew Morton



> -----Original Message-----
> From: Mike Kravetz [mailto:mike.kravetz@oracle.com]
> Sent: Tuesday, March 30, 2021 12:24 PM
> To: linux-mm@kvack.org; linux-kernel@vger.kernel.org
> Cc: Roman Gushchin <guro@fb.com>; Michal Hocko <mhocko@suse.com>; Shakeel Butt
> <shakeelb@google.com>; Oscar Salvador <osalvador@suse.de>; David Hildenbrand
> <david@redhat.com>; Muchun Song <songmuchun@bytedance.com>; David Rientjes
> <rientjes@google.com>; linmiaohe <linmiaohe@huawei.com>; Peter Zijlstra
> <peterz@infradead.org>; Matthew Wilcox <willy@infradead.org>; HORIGUCHI NAOYA
> <naoya.horiguchi@nec.com>; Aneesh Kumar K . V <aneesh.kumar@linux.ibm.com>;
> Waiman Long <longman@redhat.com>; Peter Xu <peterx@redhat.com>; Mina Almasry
> <almasrymina@google.com>; Hillf Danton <hdanton@sina.com>; Joonsoo Kim
> <iamjoonsoo.kim@lge.com>; Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com>; Will Deacon <will@kernel.org>; Andrew Morton
> <akpm@linux-foundation.org>; Mike Kravetz <mike.kravetz@oracle.com>
> Subject: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
> 
> Ideally, cma_release could be called from any context.  However, that is
> not possible because a mutex is used to protect the per-area bitmap.
> Change the bitmap to an irq safe spinlock.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

It seems mutex_lock is locking some areas with bitmap operations which
should be safe to atomic context.

Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>

> ---
>  mm/cma.c       | 20 +++++++++++---------
>  mm/cma.h       |  2 +-
>  mm/cma_debug.c | 10 ++++++----
>  3 files changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index b2393b892d3b..80875fd4487b 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -24,7 +24,6 @@
>  #include <linux/memblock.h>
>  #include <linux/err.h>
>  #include <linux/mm.h>
> -#include <linux/mutex.h>
>  #include <linux/sizes.h>
>  #include <linux/slab.h>
>  #include <linux/log2.h>
> @@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long
> pfn,
>  			     unsigned int count)
>  {
>  	unsigned long bitmap_no, bitmap_count;
> +	unsigned long flags;
> 
>  	bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
>  	bitmap_count = cma_bitmap_pages_to_bits(cma, count);
> 
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>  	bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  }
> 
>  static void __init cma_activate_area(struct cma *cma)
> @@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
>  	     pfn += pageblock_nr_pages)
>  		init_cma_reserved_pageblock(pfn_to_page(pfn));
> 
> -	mutex_init(&cma->lock);
> +	spin_lock_init(&cma->lock);
> 
>  #ifdef CONFIG_CMA_DEBUGFS
>  	INIT_HLIST_HEAD(&cma->mem_head);
> @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
>  	unsigned long start = 0;
>  	unsigned long nr_part, nr_total = 0;
>  	unsigned long nbits = cma_bitmap_maxno(cma);
> +	unsigned long flags;
> 
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>  	pr_info("number of available pages: ");
>  	for (;;) {
>  		next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +408,7 @@ static void cma_debug_show_areas(struct cma *cma)
>  		start = next_zero_bit + nr_zero;
>  	}
>  	pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  }
>  #else
>  static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -430,6 +431,7 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
>  	unsigned long pfn = -1;
>  	unsigned long start = 0;
>  	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> +	unsigned long flags;
>  	size_t i;
>  	struct page *page = NULL;
>  	int ret = -ENOMEM;
> @@ -454,12 +456,12 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
>  		goto out;
> 
>  	for (;;) {
> -		mutex_lock(&cma->lock);
> +		spin_lock_irqsave(&cma->lock, flags);
>  		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>  				bitmap_maxno, start, bitmap_count, mask,
>  				offset);
>  		if (bitmap_no >= bitmap_maxno) {
> -			mutex_unlock(&cma->lock);
> +			spin_unlock_irqrestore(&cma->lock, flags);
>  			break;
>  		}
>  		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> @@ -468,7 +470,7 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
>  		 * our exclusive use. If the migration fails we will take the
>  		 * lock again and unmark it.
>  		 */
> -		mutex_unlock(&cma->lock);
> +		spin_unlock_irqrestore(&cma->lock, flags);
> 
>  		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
>  		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> diff --git a/mm/cma.h b/mm/cma.h
> index 68ffad4e430d..2c775877eae2 100644
> --- a/mm/cma.h
> +++ b/mm/cma.h
> @@ -15,7 +15,7 @@ struct cma {
>  	unsigned long   count;
>  	unsigned long   *bitmap;
>  	unsigned int order_per_bit; /* Order of pages represented by one bit */
> -	struct mutex    lock;
> +	spinlock_t	lock;
>  #ifdef CONFIG_CMA_DEBUGFS
>  	struct hlist_head mem_head;
>  	spinlock_t mem_head_lock;
> diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> index d5bf8aa34fdc..6379cfbfd568 100644
> --- a/mm/cma_debug.c
> +++ b/mm/cma_debug.c
> @@ -35,11 +35,12 @@ static int cma_used_get(void *data, u64 *val)
>  {
>  	struct cma *cma = data;
>  	unsigned long used;
> +	unsigned long flags;
> 
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>  	/* pages counter is smaller than sizeof(int) */
>  	used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  	*val = (u64)used << cma->order_per_bit;
> 
>  	return 0;
> @@ -52,8 +53,9 @@ static int cma_maxchunk_get(void *data, u64 *val)
>  	unsigned long maxchunk = 0;
>  	unsigned long start, end = 0;
>  	unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
> +	unsigned long flags;
> 
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>  	for (;;) {
>  		start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
>  		if (start >= bitmap_maxno)
> @@ -61,7 +63,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>  		end = find_next_bit(cma->bitmap, bitmap_maxno, start);
>  		maxchunk = max(end - start, maxchunk);
>  	}
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  	*val = (u64)maxchunk << cma->order_per_bit;
> 
>  	return 0;
> --
> 2.30.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-03-29 23:23 ` [PATCH v2 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
@ 2021-03-30  2:10   ` Miaohe Lin
  2021-03-30  2:21     ` Muchun Song
  1 sibling, 0 replies; 31+ messages in thread
From: Miaohe Lin @ 2021-03-30  2:10 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Joonsoo Kim, Barry Song,
	Will Deacon, Andrew Morton

On 2021/3/30 7:23, Mike Kravetz wrote:
> With the introduction of remove_hugetlb_page(), there is no need for
> update_and_free_page to hold the hugetlb lock.  Change all callers to
> drop the lock before calling.
> 
> With additional code modifications, this will allow loops which decrease
> the huge page pool to drop the hugetlb_lock with each page to reduce
> long hold times.
> 
> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
> a subsequent patch which restructures free_pool_huge_page.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Looks good to me. Thanks!
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

> ---
>  mm/hugetlb.c | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 16beabbbbe49..dec7bd0dc63d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1451,16 +1451,18 @@ static void __free_huge_page(struct page *page)
>  
>  	if (HPageTemporary(page)) {
>  		remove_hugetlb_page(h, page, false);
> +		spin_unlock(&hugetlb_lock);
>  		update_and_free_page(h, page);
>  	} else if (h->surplus_huge_pages_node[nid]) {
>  		/* remove the page from active list */
>  		remove_hugetlb_page(h, page, true);
> +		spin_unlock(&hugetlb_lock);
>  		update_and_free_page(h, page);
>  	} else {
>  		arch_clear_hugepage_flags(page);
>  		enqueue_huge_page(h, page);
> +		spin_unlock(&hugetlb_lock);
>  	}
> -	spin_unlock(&hugetlb_lock);
>  }
>  
>  /*
> @@ -1741,7 +1743,13 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  				list_entry(h->hugepage_freelists[node].next,
>  					  struct page, lru);
>  			remove_hugetlb_page(h, page, acct_surplus);
> +			/*
> +			 * unlock/lock around update_and_free_page is temporary
> +			 * and will be removed with subsequent patch.
> +			 */
> +			spin_unlock(&hugetlb_lock);
>  			update_and_free_page(h, page);
> +			spin_lock(&hugetlb_lock);
>  			ret = 1;
>  			break;
>  		}
> @@ -1810,8 +1818,9 @@ int dissolve_free_huge_page(struct page *page)
>  		}
>  		remove_hugetlb_page(h, page, false);
>  		h->max_huge_pages--;
> +		spin_unlock(&hugetlb_lock);
>  		update_and_free_page(h, head);
> -		rc = 0;
> +		return 0;
>  	}
>  out:
>  	spin_unlock(&hugetlb_lock);
> @@ -2674,22 +2683,35 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  						nodemask_t *nodes_allowed)
>  {
>  	int i;
> +	struct page *page, *next;
> +	LIST_HEAD(page_list);
>  
>  	if (hstate_is_gigantic(h))
>  		return;
>  
> +	/*
> +	 * Collect pages to be freed on a list, and free after dropping lock
> +	 */
> +	INIT_LIST_HEAD(&page_list);
>  	for_each_node_mask(i, *nodes_allowed) {
> -		struct page *page, *next;
>  		struct list_head *freel = &h->hugepage_freelists[i];
>  		list_for_each_entry_safe(page, next, freel, lru) {
>  			if (count >= h->nr_huge_pages)
> -				return;
> +				goto out;
>  			if (PageHighMem(page))
>  				continue;
>  			remove_hugetlb_page(h, page, false);
> -			update_and_free_page(h, page);
> +			list_add(&page->lru, &page_list);
>  		}
>  	}
> +
> +out:
> +	spin_unlock(&hugetlb_lock);
> +	list_for_each_entry_safe(page, next, &page_list, lru) {
> +		update_and_free_page(h, page);
> +		cond_resched();
> +	}
> +	spin_lock(&hugetlb_lock);
>  }
>  #else
>  static inline void try_to_free_low(struct hstate *h, unsigned long count,
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-03-30  1:20   ` Song Bao Hua (Barry Song)
@ 2021-03-30  2:18     ` Mike Kravetz
  0 siblings, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2021-03-30  2:18 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song), linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, linmiaohe,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Will Deacon, Andrew Morton

On 3/29/21 6:20 PM, Song Bao Hua (Barry Song) wrote:
> 
> 
>> -----Original Message-----
>> From: Mike Kravetz [mailto:mike.kravetz@oracle.com]
>> Sent: Tuesday, March 30, 2021 12:24 PM
>> To: linux-mm@kvack.org; linux-kernel@vger.kernel.org
>> Cc: Roman Gushchin <guro@fb.com>; Michal Hocko <mhocko@suse.com>; Shakeel Butt
>> <shakeelb@google.com>; Oscar Salvador <osalvador@suse.de>; David Hildenbrand
>> <david@redhat.com>; Muchun Song <songmuchun@bytedance.com>; David Rientjes
>> <rientjes@google.com>; linmiaohe <linmiaohe@huawei.com>; Peter Zijlstra
>> <peterz@infradead.org>; Matthew Wilcox <willy@infradead.org>; HORIGUCHI NAOYA
>> <naoya.horiguchi@nec.com>; Aneesh Kumar K . V <aneesh.kumar@linux.ibm.com>;
>> Waiman Long <longman@redhat.com>; Peter Xu <peterx@redhat.com>; Mina Almasry
>> <almasrymina@google.com>; Hillf Danton <hdanton@sina.com>; Joonsoo Kim
>> <iamjoonsoo.kim@lge.com>; Song Bao Hua (Barry Song)
>> <song.bao.hua@hisilicon.com>; Will Deacon <will@kernel.org>; Andrew Morton
>> <akpm@linux-foundation.org>; Mike Kravetz <mike.kravetz@oracle.com>
>> Subject: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
>>
>> Ideally, cma_release could be called from any context.  However, that is
>> not possible because a mutex is used to protect the per-area bitmap.
>> Change the bitmap to an irq safe spinlock.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> It seems mutex_lock is locking some areas with bitmap operations which
> should be safe to atomic context.
> 
> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>

Thanks Barry,

Not sure if you saw questions from Michal in previous series?
There was some concern from Joonsoo in the past about lock hold time due
to bitmap scans.  You may have some insight into the typical size of CMA
areas on arm64.  I believe the calls to set up the areas specify one bit
per page.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] [PATCH v2 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-03-29 23:23 ` [PATCH v2 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
@ 2021-03-30  2:21     ` Muchun Song
  2021-03-30  2:21     ` Muchun Song
  1 sibling, 0 replies; 31+ messages in thread
From: Muchun Song @ 2021-03-30  2:21 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Tue, Mar 30, 2021 at 7:24 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> With the introduction of remove_hugetlb_page(), there is no need for
> update_and_free_page to hold the hugetlb lock.  Change all callers to
> drop the lock before calling.
>
> With additional code modifications, this will allow loops which decrease
> the huge page pool to drop the hugetlb_lock with each page to reduce
> long hold times.
>
> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
> a subsequent patch which restructures free_pool_huge_page.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb.c | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 16beabbbbe49..dec7bd0dc63d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1451,16 +1451,18 @@ static void __free_huge_page(struct page *page)
>
>         if (HPageTemporary(page)) {
>                 remove_hugetlb_page(h, page, false);
> +               spin_unlock(&hugetlb_lock);
>                 update_and_free_page(h, page);
>         } else if (h->surplus_huge_pages_node[nid]) {
>                 /* remove the page from active list */
>                 remove_hugetlb_page(h, page, true);
> +               spin_unlock(&hugetlb_lock);
>                 update_and_free_page(h, page);
>         } else {
>                 arch_clear_hugepage_flags(page);
>                 enqueue_huge_page(h, page);
> +               spin_unlock(&hugetlb_lock);
>         }
> -       spin_unlock(&hugetlb_lock);
>  }
>
>  /*
> @@ -1741,7 +1743,13 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>                                 list_entry(h->hugepage_freelists[node].next,
>                                           struct page, lru);
>                         remove_hugetlb_page(h, page, acct_surplus);
> +                       /*
> +                        * unlock/lock around update_and_free_page is temporary
> +                        * and will be removed with subsequent patch.
> +                        */
> +                       spin_unlock(&hugetlb_lock);
>                         update_and_free_page(h, page);
> +                       spin_lock(&hugetlb_lock);
>                         ret = 1;
>                         break;
>                 }
> @@ -1810,8 +1818,9 @@ int dissolve_free_huge_page(struct page *page)
>                 }
>                 remove_hugetlb_page(h, page, false);
>                 h->max_huge_pages--;
> +               spin_unlock(&hugetlb_lock);
>                 update_and_free_page(h, head);
> -               rc = 0;
> +               return 0;
>         }
>  out:
>         spin_unlock(&hugetlb_lock);
> @@ -2674,22 +2683,35 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>                                                 nodemask_t *nodes_allowed)
>  {
>         int i;
> +       struct page *page, *next;
> +       LIST_HEAD(page_list);
>
>         if (hstate_is_gigantic(h))
>                 return;
>
> +       /*
> +        * Collect pages to be freed on a list, and free after dropping lock
> +        */
> +       INIT_LIST_HEAD(&page_list);

INIT_LIST_HEAD is unnecessary. Because the macro of
LIST_HEAD already initializes the list_head structure.

>         for_each_node_mask(i, *nodes_allowed) {
> -               struct page *page, *next;
>                 struct list_head *freel = &h->hugepage_freelists[i];
>                 list_for_each_entry_safe(page, next, freel, lru) {
>                         if (count >= h->nr_huge_pages)
> -                               return;
> +                               goto out;
>                         if (PageHighMem(page))
>                                 continue;
>                         remove_hugetlb_page(h, page, false);
> -                       update_and_free_page(h, page);
> +                       list_add(&page->lru, &page_list);
>                 }
>         }
> +
> +out:
> +       spin_unlock(&hugetlb_lock);
> +       list_for_each_entry_safe(page, next, &page_list, lru) {
> +               update_and_free_page(h, page);
> +               cond_resched();
> +       }
> +       spin_lock(&hugetlb_lock);
>  }
>  #else
>  static inline void try_to_free_low(struct hstate *h, unsigned long count,
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] [PATCH v2 5/8] hugetlb: call update_and_free_page without hugetlb_lock
@ 2021-03-30  2:21     ` Muchun Song
  0 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2021-03-30  2:21 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Tue, Mar 30, 2021 at 7:24 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> With the introduction of remove_hugetlb_page(), there is no need for
> update_and_free_page to hold the hugetlb lock.  Change all callers to
> drop the lock before calling.
>
> With additional code modifications, this will allow loops which decrease
> the huge page pool to drop the hugetlb_lock with each page to reduce
> long hold times.
>
> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
> a subsequent patch which restructures free_pool_huge_page.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb.c | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 16beabbbbe49..dec7bd0dc63d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1451,16 +1451,18 @@ static void __free_huge_page(struct page *page)
>
>         if (HPageTemporary(page)) {
>                 remove_hugetlb_page(h, page, false);
> +               spin_unlock(&hugetlb_lock);
>                 update_and_free_page(h, page);
>         } else if (h->surplus_huge_pages_node[nid]) {
>                 /* remove the page from active list */
>                 remove_hugetlb_page(h, page, true);
> +               spin_unlock(&hugetlb_lock);
>                 update_and_free_page(h, page);
>         } else {
>                 arch_clear_hugepage_flags(page);
>                 enqueue_huge_page(h, page);
> +               spin_unlock(&hugetlb_lock);
>         }
> -       spin_unlock(&hugetlb_lock);
>  }
>
>  /*
> @@ -1741,7 +1743,13 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>                                 list_entry(h->hugepage_freelists[node].next,
>                                           struct page, lru);
>                         remove_hugetlb_page(h, page, acct_surplus);
> +                       /*
> +                        * unlock/lock around update_and_free_page is temporary
> +                        * and will be removed with subsequent patch.
> +                        */
> +                       spin_unlock(&hugetlb_lock);
>                         update_and_free_page(h, page);
> +                       spin_lock(&hugetlb_lock);
>                         ret = 1;
>                         break;
>                 }
> @@ -1810,8 +1818,9 @@ int dissolve_free_huge_page(struct page *page)
>                 }
>                 remove_hugetlb_page(h, page, false);
>                 h->max_huge_pages--;
> +               spin_unlock(&hugetlb_lock);
>                 update_and_free_page(h, head);
> -               rc = 0;
> +               return 0;
>         }
>  out:
>         spin_unlock(&hugetlb_lock);
> @@ -2674,22 +2683,35 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>                                                 nodemask_t *nodes_allowed)
>  {
>         int i;
> +       struct page *page, *next;
> +       LIST_HEAD(page_list);
>
>         if (hstate_is_gigantic(h))
>                 return;
>
> +       /*
> +        * Collect pages to be freed on a list, and free after dropping lock
> +        */
> +       INIT_LIST_HEAD(&page_list);

INIT_LIST_HEAD is unnecessary. Because the macro of
LIST_HEAD already initializes the list_head structure.

>         for_each_node_mask(i, *nodes_allowed) {
> -               struct page *page, *next;
>                 struct list_head *freel = &h->hugepage_freelists[i];
>                 list_for_each_entry_safe(page, next, freel, lru) {
>                         if (count >= h->nr_huge_pages)
> -                               return;
> +                               goto out;
>                         if (PageHighMem(page))
>                                 continue;
>                         remove_hugetlb_page(h, page, false);
> -                       update_and_free_page(h, page);
> +                       list_add(&page->lru, &page_list);
>                 }
>         }
> +
> +out:
> +       spin_unlock(&hugetlb_lock);
> +       list_for_each_entry_safe(page, next, &page_list, lru) {
> +               update_and_free_page(h, page);
> +               cond_resched();
> +       }
> +       spin_lock(&hugetlb_lock);
>  }
>  #else
>  static inline void try_to_free_low(struct hstate *h, unsigned long count,
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] [PATCH v2 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments
  2021-03-29 23:23 ` [PATCH v2 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
@ 2021-03-30  2:23     ` Muchun Song
  0 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2021-03-30  2:23 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Tue, Mar 30, 2021 at 7:24 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> The helper routine hstate_next_node_to_alloc accesses and modifies the
> hstate variable next_nid_to_alloc.  The helper is used by the routines
> alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
> called with hugetlb_lock held.  However, alloc_pool_huge_page can not
> be called with the hugetlb lock held as it will call the page allocator.
> Two instances of alloc_pool_huge_page could be run in parallel or
> alloc_pool_huge_page could run in parallel with adjust_pool_surplus
> which may result in the variable next_nid_to_alloc becoming invalid
> for the caller and pages being allocated on the wrong node.
>
> Both alloc_pool_huge_page and adjust_pool_surplus are only called from
> the routine set_max_huge_pages after boot.  set_max_huge_pages is only
> called as the reusult of a user writing to the proc/sysfs nr_hugepages,
> or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.
>
> It makes little sense to allow multiple adjustment to the number of
> hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
> allow one hugetlb page adjustment at a time.  This will synchronize
> modifications to the next_nid_to_alloc variable.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Thanks Mike.

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  include/linux/hugetlb.h | 1 +
>  mm/hugetlb.c            | 8 ++++++++
>  2 files changed, 9 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index d9b78e82652f..b92f25ccef58 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
>  #define HSTATE_NAME_LEN 32
>  /* Defines one hugetlb page size */
>  struct hstate {
> +       struct mutex resize_lock;
>         int next_nid_to_alloc;
>         int next_nid_to_free;
>         unsigned int order;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1d62f0492e7b..8497a3598c86 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2730,6 +2730,11 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>         else
>                 return -ENOMEM;
>
> +       /*
> +        * resize_lock mutex prevents concurrent adjustments to number of
> +        * pages in hstate via the proc/sysfs interfaces.
> +        */
> +       mutex_lock(&h->resize_lock);
>         spin_lock(&hugetlb_lock);
>
>         /*
> @@ -2762,6 +2767,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>         if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
>                 if (count > persistent_huge_pages(h)) {
>                         spin_unlock(&hugetlb_lock);
> +                       mutex_unlock(&h->resize_lock);
>                         NODEMASK_FREE(node_alloc_noretry);
>                         return -EINVAL;
>                 }
> @@ -2836,6 +2842,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  out:
>         h->max_huge_pages = persistent_huge_pages(h);
>         spin_unlock(&hugetlb_lock);
> +       mutex_unlock(&h->resize_lock);
>
>         NODEMASK_FREE(node_alloc_noretry);
>
> @@ -3323,6 +3330,7 @@ void __init hugetlb_add_hstate(unsigned int order)
>         BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
>         BUG_ON(order == 0);
>         h = &hstates[hugetlb_max_hstate++];
> +       mutex_init(&h->resize_lock);
>         h->order = order;
>         h->mask = ~(huge_page_size(h) - 1);
>         for (i = 0; i < MAX_NUMNODES; ++i)
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] [PATCH v2 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments
@ 2021-03-30  2:23     ` Muchun Song
  0 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2021-03-30  2:23 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Tue, Mar 30, 2021 at 7:24 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> The helper routine hstate_next_node_to_alloc accesses and modifies the
> hstate variable next_nid_to_alloc.  The helper is used by the routines
> alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
> called with hugetlb_lock held.  However, alloc_pool_huge_page can not
> be called with the hugetlb lock held as it will call the page allocator.
> Two instances of alloc_pool_huge_page could be run in parallel or
> alloc_pool_huge_page could run in parallel with adjust_pool_surplus
> which may result in the variable next_nid_to_alloc becoming invalid
> for the caller and pages being allocated on the wrong node.
>
> Both alloc_pool_huge_page and adjust_pool_surplus are only called from
> the routine set_max_huge_pages after boot.  set_max_huge_pages is only
> called as the reusult of a user writing to the proc/sysfs nr_hugepages,
> or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.
>
> It makes little sense to allow multiple adjustment to the number of
> hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
> allow one hugetlb page adjustment at a time.  This will synchronize
> modifications to the next_nid_to_alloc variable.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Thanks Mike.

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  include/linux/hugetlb.h | 1 +
>  mm/hugetlb.c            | 8 ++++++++
>  2 files changed, 9 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index d9b78e82652f..b92f25ccef58 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
>  #define HSTATE_NAME_LEN 32
>  /* Defines one hugetlb page size */
>  struct hstate {
> +       struct mutex resize_lock;
>         int next_nid_to_alloc;
>         int next_nid_to_free;
>         unsigned int order;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1d62f0492e7b..8497a3598c86 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2730,6 +2730,11 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>         else
>                 return -ENOMEM;
>
> +       /*
> +        * resize_lock mutex prevents concurrent adjustments to number of
> +        * pages in hstate via the proc/sysfs interfaces.
> +        */
> +       mutex_lock(&h->resize_lock);
>         spin_lock(&hugetlb_lock);
>
>         /*
> @@ -2762,6 +2767,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>         if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
>                 if (count > persistent_huge_pages(h)) {
>                         spin_unlock(&hugetlb_lock);
> +                       mutex_unlock(&h->resize_lock);
>                         NODEMASK_FREE(node_alloc_noretry);
>                         return -EINVAL;
>                 }
> @@ -2836,6 +2842,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  out:
>         h->max_huge_pages = persistent_huge_pages(h);
>         spin_unlock(&hugetlb_lock);
> +       mutex_unlock(&h->resize_lock);
>
>         NODEMASK_FREE(node_alloc_noretry);
>
> @@ -3323,6 +3330,7 @@ void __init hugetlb_add_hstate(unsigned int order)
>         BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
>         BUG_ON(order == 0);
>         h = &hstates[hugetlb_max_hstate++];
> +       mutex_init(&h->resize_lock);
>         h->order = order;
>         h->mask = ~(huge_page_size(h) - 1);
>         for (i = 0; i < MAX_NUMNODES; ++i)
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] [PATCH v2 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page
  2021-03-29 23:24 ` [PATCH v2 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
@ 2021-03-30  2:30     ` Muchun Song
  2021-03-30  8:06   ` Michal Hocko
  1 sibling, 0 replies; 31+ messages in thread
From: Muchun Song @ 2021-03-30  2:30 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Tue, Mar 30, 2021 at 7:24 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> free_pool_huge_page was called with hugetlb_lock held.  It would remove
> a hugetlb page, and then free the corresponding pages to the lower level
> allocators such as buddy.  free_pool_huge_page was called in a loop to
> remove hugetlb pages and these loops could hold the hugetlb_lock for a
> considerable time.
>
> Create new routine remove_pool_huge_page to replace free_pool_huge_page.
> remove_pool_huge_page will remove the hugetlb page, and it must be
> called with the hugetlb_lock held.  It will return the removed page and
> it is the responsibility of the caller to free the page to the lower
> level allocators.  The hugetlb_lock is dropped before freeing to these
> allocators which results in shorter lock hold times.
>
> Add new helper routine to call update_and_free_page for a list of pages.
>
> Note: Some changes to the routine return_unused_surplus_pages are in
> need of explanation.  Commit e5bbc8a6c992 ("mm/hugetlb.c: fix reservation
> race when freeing surplus pages") modified this routine to address a
> race which could occur when dropping the hugetlb_lock in the loop that
> removes pool pages.  Accounting changes introduced in that commit were
> subtle and took some thought to understand.  This commit removes the
> cond_resched_lock() and the potential race.  Therefore, remove the
> subtle code and restore the more straight forward accounting effectively
> reverting the commit.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Some nits below.

> ---
>  mm/hugetlb.c | 95 +++++++++++++++++++++++++++++-----------------------
>  1 file changed, 53 insertions(+), 42 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dec7bd0dc63d..d3f3cb8766b8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1209,7 +1209,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
>  }
>
>  /*
> - * helper for free_pool_huge_page() - return the previously saved
> + * helper for remove_pool_huge_page() - return the previously saved
>   * node ["this node"] from which to free a huge page.  Advance the
>   * next node id whether or not we find a free huge page to free so
>   * that the next attempt to free addresses the next node.
> @@ -1391,6 +1391,16 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>         }
>  }
>
> +static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> +{
> +       struct page *page, *t_page;
> +
> +       list_for_each_entry_safe(page, t_page, list, lru) {
> +               update_and_free_page(h, page);
> +               cond_resched();
> +       }
> +}
> +
>  struct hstate *size_to_hstate(unsigned long size)
>  {
>         struct hstate *h;
> @@ -1721,16 +1731,18 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  }
>
>  /*
> - * Free huge page from pool from next node to free.
> - * Attempt to keep persistent huge pages more or less
> - * balanced over allowed nodes.
> + * Remove huge page from pool from next node to free.  Attempt to keep
> + * persistent huge pages more or less balanced over allowed nodes.
> + * This routine only 'removes' the hugetlb page.  The caller must make
> + * an additional call to free the page to low level allocators.
>   * Called with hugetlb_lock locked.
>   */
> -static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> -                                                        bool acct_surplus)
> +static struct page *remove_pool_huge_page(struct hstate *h,
> +                                               nodemask_t *nodes_allowed,
> +                                                bool acct_surplus)
>  {
>         int nr_nodes, node;
> -       int ret = 0;
> +       struct page *page = NULL;
>
>         for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
>                 /*
> @@ -1739,23 +1751,14 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>                  */
>                 if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
>                     !list_empty(&h->hugepage_freelists[node])) {
> -                       struct page *page =
> -                               list_entry(h->hugepage_freelists[node].next,
> +                       page = list_entry(h->hugepage_freelists[node].next,
>                                           struct page, lru);
>                         remove_hugetlb_page(h, page, acct_surplus);
> -                       /*
> -                        * unlock/lock around update_and_free_page is temporary
> -                        * and will be removed with subsequent patch.
> -                        */
> -                       spin_unlock(&hugetlb_lock);
> -                       update_and_free_page(h, page);
> -                       spin_lock(&hugetlb_lock);
> -                       ret = 1;
>                         break;
>                 }
>         }
>
> -       return ret;
> +       return page;
>  }
>
>  /*
> @@ -2075,17 +2078,16 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>   *    to the associated reservation map.
>   * 2) Free any unused surplus pages that may have been allocated to satisfy
>   *    the reservation.  As many as unused_resv_pages may be freed.
> - *
> - * Called with hugetlb_lock held.  However, the lock could be dropped (and
> - * reacquired) during calls to cond_resched_lock.  Whenever dropping the lock,
> - * we must make sure nobody else can claim pages we are in the process of
> - * freeing.  Do this by ensuring resv_huge_page always is greater than the
> - * number of huge pages we plan to free when dropping the lock.
>   */
>  static void return_unused_surplus_pages(struct hstate *h,
>                                         unsigned long unused_resv_pages)
>  {
>         unsigned long nr_pages;
> +       struct page *page;
> +       LIST_HEAD(page_list);
> +
> +       /* Uncommit the reservation */
> +       h->resv_huge_pages -= unused_resv_pages;
>
>         /* Cannot return gigantic pages currently */
>         if (hstate_is_gigantic(h))
> @@ -2102,24 +2104,22 @@ static void return_unused_surplus_pages(struct hstate *h,
>          * evenly across all nodes with memory. Iterate across these nodes
>          * until we can no longer free unreserved surplus pages. This occurs
>          * when the nodes with surplus pages have no free pages.
> -        * free_pool_huge_page() will balance the freed pages across the
> +        * remove_pool_huge_page() will balance the freed pages across the
>          * on-line nodes with memory and will handle the hstate accounting.
> -        *
> -        * Note that we decrement resv_huge_pages as we free the pages.  If
> -        * we drop the lock, resv_huge_pages will still be sufficiently large
> -        * to cover subsequent pages we may free.
>          */
> +       INIT_LIST_HEAD(&page_list);

INIT_LIST_HEAD is unnecessary. LIST_HEAD is enough.

>         while (nr_pages--) {
> -               h->resv_huge_pages--;
> -               unused_resv_pages--;
> -               if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
> +               page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
> +               if (!page)
>                         goto out;
> -               cond_resched_lock(&hugetlb_lock);
> +
> +               list_add(&page->lru, &page_list);
>         }
>
>  out:
> -       /* Fully uncommit the reservation */
> -       h->resv_huge_pages -= unused_resv_pages;
> +       spin_unlock(&hugetlb_lock);
> +       update_and_free_pages_bulk(h, &page_list);
> +       spin_lock(&hugetlb_lock);
>  }
>
>
> @@ -2683,7 +2683,6 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>                                                 nodemask_t *nodes_allowed)
>  {
>         int i;
> -       struct page *page, *next;
>         LIST_HEAD(page_list);
>
>         if (hstate_is_gigantic(h))
> @@ -2694,6 +2693,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>          */
>         INIT_LIST_HEAD(&page_list);
>         for_each_node_mask(i, *nodes_allowed) {
> +               struct page *page, *next;
>                 struct list_head *freel = &h->hugepage_freelists[i];
>                 list_for_each_entry_safe(page, next, freel, lru) {
>                         if (count >= h->nr_huge_pages)
> @@ -2707,10 +2707,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>
>  out:
>         spin_unlock(&hugetlb_lock);
> -       list_for_each_entry_safe(page, next, &page_list, lru) {
> -               update_and_free_page(h, page);
> -               cond_resched();
> -       }
> +       update_and_free_pages_bulk(h, &page_list);
>         spin_lock(&hugetlb_lock);
>  }
>  #else
> @@ -2757,6 +2754,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>                               nodemask_t *nodes_allowed)
>  {
>         unsigned long min_count, ret;
> +       struct page *page;
> +       LIST_HEAD(page_list);
>         NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
>
>         /*
> @@ -2869,11 +2868,23 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>         min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
>         min_count = max(count, min_count);
>         try_to_free_low(h, min_count, nodes_allowed);
> +
> +       /*
> +        * Collect pages to be removed on list without dropping lock
> +        */
> +       INIT_LIST_HEAD(&page_list);

Same here.

>         while (min_count < persistent_huge_pages(h)) {
> -               if (!free_pool_huge_page(h, nodes_allowed, 0))
> +               page = remove_pool_huge_page(h, nodes_allowed, 0);
> +               if (!page)
>                         break;
> -               cond_resched_lock(&hugetlb_lock);
> +
> +               list_add(&page->lru, &page_list);
>         }
> +       /* free the pages after dropping lock */
> +       spin_unlock(&hugetlb_lock);
> +       update_and_free_pages_bulk(h, &page_list);
> +       spin_lock(&hugetlb_lock);
> +
>         while (count < persistent_huge_pages(h)) {
>                 if (!adjust_pool_surplus(h, nodes_allowed, 1))
>                         break;
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] [PATCH v2 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page
@ 2021-03-30  2:30     ` Muchun Song
  0 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2021-03-30  2:30 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Tue, Mar 30, 2021 at 7:24 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> free_pool_huge_page was called with hugetlb_lock held.  It would remove
> a hugetlb page, and then free the corresponding pages to the lower level
> allocators such as buddy.  free_pool_huge_page was called in a loop to
> remove hugetlb pages and these loops could hold the hugetlb_lock for a
> considerable time.
>
> Create new routine remove_pool_huge_page to replace free_pool_huge_page.
> remove_pool_huge_page will remove the hugetlb page, and it must be
> called with the hugetlb_lock held.  It will return the removed page and
> it is the responsibility of the caller to free the page to the lower
> level allocators.  The hugetlb_lock is dropped before freeing to these
> allocators which results in shorter lock hold times.
>
> Add new helper routine to call update_and_free_page for a list of pages.
>
> Note: Some changes to the routine return_unused_surplus_pages are in
> need of explanation.  Commit e5bbc8a6c992 ("mm/hugetlb.c: fix reservation
> race when freeing surplus pages") modified this routine to address a
> race which could occur when dropping the hugetlb_lock in the loop that
> removes pool pages.  Accounting changes introduced in that commit were
> subtle and took some thought to understand.  This commit removes the
> cond_resched_lock() and the potential race.  Therefore, remove the
> subtle code and restore the more straight forward accounting effectively
> reverting the commit.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Some nits below.

> ---
>  mm/hugetlb.c | 95 +++++++++++++++++++++++++++++-----------------------
>  1 file changed, 53 insertions(+), 42 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dec7bd0dc63d..d3f3cb8766b8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1209,7 +1209,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
>  }
>
>  /*
> - * helper for free_pool_huge_page() - return the previously saved
> + * helper for remove_pool_huge_page() - return the previously saved
>   * node ["this node"] from which to free a huge page.  Advance the
>   * next node id whether or not we find a free huge page to free so
>   * that the next attempt to free addresses the next node.
> @@ -1391,6 +1391,16 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>         }
>  }
>
> +static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> +{
> +       struct page *page, *t_page;
> +
> +       list_for_each_entry_safe(page, t_page, list, lru) {
> +               update_and_free_page(h, page);
> +               cond_resched();
> +       }
> +}
> +
>  struct hstate *size_to_hstate(unsigned long size)
>  {
>         struct hstate *h;
> @@ -1721,16 +1731,18 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  }
>
>  /*
> - * Free huge page from pool from next node to free.
> - * Attempt to keep persistent huge pages more or less
> - * balanced over allowed nodes.
> + * Remove huge page from pool from next node to free.  Attempt to keep
> + * persistent huge pages more or less balanced over allowed nodes.
> + * This routine only 'removes' the hugetlb page.  The caller must make
> + * an additional call to free the page to low level allocators.
>   * Called with hugetlb_lock locked.
>   */
> -static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> -                                                        bool acct_surplus)
> +static struct page *remove_pool_huge_page(struct hstate *h,
> +                                               nodemask_t *nodes_allowed,
> +                                                bool acct_surplus)
>  {
>         int nr_nodes, node;
> -       int ret = 0;
> +       struct page *page = NULL;
>
>         for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
>                 /*
> @@ -1739,23 +1751,14 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>                  */
>                 if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
>                     !list_empty(&h->hugepage_freelists[node])) {
> -                       struct page *page =
> -                               list_entry(h->hugepage_freelists[node].next,
> +                       page = list_entry(h->hugepage_freelists[node].next,
>                                           struct page, lru);
>                         remove_hugetlb_page(h, page, acct_surplus);
> -                       /*
> -                        * unlock/lock around update_and_free_page is temporary
> -                        * and will be removed with subsequent patch.
> -                        */
> -                       spin_unlock(&hugetlb_lock);
> -                       update_and_free_page(h, page);
> -                       spin_lock(&hugetlb_lock);
> -                       ret = 1;
>                         break;
>                 }
>         }
>
> -       return ret;
> +       return page;
>  }
>
>  /*
> @@ -2075,17 +2078,16 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>   *    to the associated reservation map.
>   * 2) Free any unused surplus pages that may have been allocated to satisfy
>   *    the reservation.  As many as unused_resv_pages may be freed.
> - *
> - * Called with hugetlb_lock held.  However, the lock could be dropped (and
> - * reacquired) during calls to cond_resched_lock.  Whenever dropping the lock,
> - * we must make sure nobody else can claim pages we are in the process of
> - * freeing.  Do this by ensuring resv_huge_page always is greater than the
> - * number of huge pages we plan to free when dropping the lock.
>   */
>  static void return_unused_surplus_pages(struct hstate *h,
>                                         unsigned long unused_resv_pages)
>  {
>         unsigned long nr_pages;
> +       struct page *page;
> +       LIST_HEAD(page_list);
> +
> +       /* Uncommit the reservation */
> +       h->resv_huge_pages -= unused_resv_pages;
>
>         /* Cannot return gigantic pages currently */
>         if (hstate_is_gigantic(h))
> @@ -2102,24 +2104,22 @@ static void return_unused_surplus_pages(struct hstate *h,
>          * evenly across all nodes with memory. Iterate across these nodes
>          * until we can no longer free unreserved surplus pages. This occurs
>          * when the nodes with surplus pages have no free pages.
> -        * free_pool_huge_page() will balance the freed pages across the
> +        * remove_pool_huge_page() will balance the freed pages across the
>          * on-line nodes with memory and will handle the hstate accounting.
> -        *
> -        * Note that we decrement resv_huge_pages as we free the pages.  If
> -        * we drop the lock, resv_huge_pages will still be sufficiently large
> -        * to cover subsequent pages we may free.
>          */
> +       INIT_LIST_HEAD(&page_list);

INIT_LIST_HEAD is unnecessary. LIST_HEAD is enough.

>         while (nr_pages--) {
> -               h->resv_huge_pages--;
> -               unused_resv_pages--;
> -               if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
> +               page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
> +               if (!page)
>                         goto out;
> -               cond_resched_lock(&hugetlb_lock);
> +
> +               list_add(&page->lru, &page_list);
>         }
>
>  out:
> -       /* Fully uncommit the reservation */
> -       h->resv_huge_pages -= unused_resv_pages;
> +       spin_unlock(&hugetlb_lock);
> +       update_and_free_pages_bulk(h, &page_list);
> +       spin_lock(&hugetlb_lock);
>  }
>
>
> @@ -2683,7 +2683,6 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>                                                 nodemask_t *nodes_allowed)
>  {
>         int i;
> -       struct page *page, *next;
>         LIST_HEAD(page_list);
>
>         if (hstate_is_gigantic(h))
> @@ -2694,6 +2693,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>          */
>         INIT_LIST_HEAD(&page_list);
>         for_each_node_mask(i, *nodes_allowed) {
> +               struct page *page, *next;
>                 struct list_head *freel = &h->hugepage_freelists[i];
>                 list_for_each_entry_safe(page, next, freel, lru) {
>                         if (count >= h->nr_huge_pages)
> @@ -2707,10 +2707,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>
>  out:
>         spin_unlock(&hugetlb_lock);
> -       list_for_each_entry_safe(page, next, &page_list, lru) {
> -               update_and_free_page(h, page);
> -               cond_resched();
> -       }
> +       update_and_free_pages_bulk(h, &page_list);
>         spin_lock(&hugetlb_lock);
>  }
>  #else
> @@ -2757,6 +2754,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>                               nodemask_t *nodes_allowed)
>  {
>         unsigned long min_count, ret;
> +       struct page *page;
> +       LIST_HEAD(page_list);
>         NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
>
>         /*
> @@ -2869,11 +2868,23 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>         min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
>         min_count = max(count, min_count);
>         try_to_free_low(h, min_count, nodes_allowed);
> +
> +       /*
> +        * Collect pages to be removed on list without dropping lock
> +        */
> +       INIT_LIST_HEAD(&page_list);

Same here.

>         while (min_count < persistent_huge_pages(h)) {
> -               if (!free_pool_huge_page(h, nodes_allowed, 0))
> +               page = remove_pool_huge_page(h, nodes_allowed, 0);
> +               if (!page)
>                         break;
> -               cond_resched_lock(&hugetlb_lock);
> +
> +               list_add(&page->lru, &page_list);
>         }
> +       /* free the pages after dropping lock */
> +       spin_unlock(&hugetlb_lock);
> +       update_and_free_pages_bulk(h, &page_list);
> +       spin_lock(&hugetlb_lock);
> +
>         while (count < persistent_huge_pages(h)) {
>                 if (!adjust_pool_surplus(h, nodes_allowed, 1))
>                         break;
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-03-29 23:23 ` [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
  2021-03-30  1:13   ` Roman Gushchin
  2021-03-30  1:20   ` Song Bao Hua (Barry Song)
@ 2021-03-30  8:01   ` Michal Hocko
  2021-03-30  8:08       ` Muchun Song
  2021-03-31  2:37     ` Mike Kravetz
  2 siblings, 2 replies; 31+ messages in thread
From: Michal Hocko @ 2021-03-30  8:01 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Mon 29-03-21 16:23:55, Mike Kravetz wrote:
> Ideally, cma_release could be called from any context.  However, that is
> not possible because a mutex is used to protect the per-area bitmap.
> Change the bitmap to an irq safe spinlock.

I would phrase the changelog slightly differerent
"
cma_release is currently a sleepable operatation because the bitmap
manipulation is protected by cma->lock mutex. Hugetlb code which relies
on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
irq safe.

The lock doesn't protect any sleepable operation so it can be changed to
a (irq aware) spin lock. The bitmap processing should be quite fast in
typical case but if cma sizes grow to TB then we will likely need to
replace the lock by a more optimized bitmap implementation.
"

it seems that you are overusing irqsave variants even from context which
are never called from the IRQ context so they do not need storing flags.

[...]
> @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
>  	unsigned long start = 0;
>  	unsigned long nr_part, nr_total = 0;
>  	unsigned long nbits = cma_bitmap_maxno(cma);
> +	unsigned long flags;
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);

spin_lock_irq should be sufficient. This is only called from the
allocation context and that is never called from IRQ context.

>  	pr_info("number of available pages: ");
>  	for (;;) {
>  		next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +408,7 @@ static void cma_debug_show_areas(struct cma *cma)
>  		start = next_zero_bit + nr_zero;
>  	}
>  	pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  }
>  #else
>  static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -430,6 +431,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>  	unsigned long pfn = -1;
>  	unsigned long start = 0;
>  	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> +	unsigned long flags;
>  	size_t i;
>  	struct page *page = NULL;
>  	int ret = -ENOMEM;
> @@ -454,12 +456,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>  		goto out;
>  
>  	for (;;) {
> -		mutex_lock(&cma->lock);
> +		spin_lock_irqsave(&cma->lock, flags);
>  		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>  				bitmap_maxno, start, bitmap_count, mask,
>  				offset);
>  		if (bitmap_no >= bitmap_maxno) {
> -			mutex_unlock(&cma->lock);
> +			spin_unlock_irqrestore(&cma->lock, flags);
>  			break;
>  		}
>  		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);

same here.

> @@ -468,7 +470,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>  		 * our exclusive use. If the migration fails we will take the
>  		 * lock again and unmark it.
>  		 */
> -		mutex_unlock(&cma->lock);
> +		spin_unlock_irqrestore(&cma->lock, flags);
>  
>  		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
>  		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> diff --git a/mm/cma.h b/mm/cma.h
> index 68ffad4e430d..2c775877eae2 100644
> --- a/mm/cma.h
> +++ b/mm/cma.h
> @@ -15,7 +15,7 @@ struct cma {
>  	unsigned long   count;
>  	unsigned long   *bitmap;
>  	unsigned int order_per_bit; /* Order of pages represented by one bit */
> -	struct mutex    lock;
> +	spinlock_t	lock;
>  #ifdef CONFIG_CMA_DEBUGFS
>  	struct hlist_head mem_head;
>  	spinlock_t mem_head_lock;
> diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> index d5bf8aa34fdc..6379cfbfd568 100644
> --- a/mm/cma_debug.c
> +++ b/mm/cma_debug.c
> @@ -35,11 +35,12 @@ static int cma_used_get(void *data, u64 *val)
>  {
>  	struct cma *cma = data;
>  	unsigned long used;
> +	unsigned long flags;
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>  	/* pages counter is smaller than sizeof(int) */
>  	used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  	*val = (u64)used << cma->order_per_bit;

same here

>  
>  	return 0;
> @@ -52,8 +53,9 @@ static int cma_maxchunk_get(void *data, u64 *val)
>  	unsigned long maxchunk = 0;
>  	unsigned long start, end = 0;
>  	unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
> +	unsigned long flags;
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>  	for (;;) {
>  		start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
>  		if (start >= bitmap_maxno)
> @@ -61,7 +63,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>  		end = find_next_bit(cma->bitmap, bitmap_maxno, start);
>  		maxchunk = max(end - start, maxchunk);
>  	}
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  	*val = (u64)maxchunk << cma->order_per_bit;
>  
>  	return 0;

and here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release
  2021-03-29 23:23 ` [PATCH v2 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release Mike Kravetz
  2021-03-30  1:13   ` Roman Gushchin
@ 2021-03-30  8:01   ` Michal Hocko
  1 sibling, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2021-03-30  8:01 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Mon 29-03-21 16:23:56, Mike Kravetz wrote:
> Now that cma_release is non-blocking and irq safe, there is no need to
> drop hugetlb_lock before calling.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/hugetlb.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3c3e4baa4156..1d62f0492e7b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1353,14 +1353,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>  	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
>  	set_page_refcounted(page);
>  	if (hstate_is_gigantic(h)) {
> -		/*
> -		 * Temporarily drop the hugetlb_lock, because
> -		 * we might block in free_gigantic_page().
> -		 */
> -		spin_unlock(&hugetlb_lock);
>  		destroy_compound_gigantic_page(page, huge_page_order(h));
>  		free_gigantic_page(page, huge_page_order(h));
> -		spin_lock(&hugetlb_lock);
>  	} else {
>  		__free_pages(page, huge_page_order(h));
>  	}
> -- 
> 2.30.2
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page
  2021-03-29 23:24 ` [PATCH v2 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
  2021-03-30  2:30     ` Muchun Song
@ 2021-03-30  8:06   ` Michal Hocko
  1 sibling, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2021-03-30  8:06 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Mon 29-03-21 16:24:00, Mike Kravetz wrote:
> free_pool_huge_page was called with hugetlb_lock held.  It would remove
> a hugetlb page, and then free the corresponding pages to the lower level
> allocators such as buddy.  free_pool_huge_page was called in a loop to
> remove hugetlb pages and these loops could hold the hugetlb_lock for a
> considerable time.
> 
> Create new routine remove_pool_huge_page to replace free_pool_huge_page.
> remove_pool_huge_page will remove the hugetlb page, and it must be
> called with the hugetlb_lock held.  It will return the removed page and
> it is the responsibility of the caller to free the page to the lower
> level allocators.  The hugetlb_lock is dropped before freeing to these
> allocators which results in shorter lock hold times.
> 
> Add new helper routine to call update_and_free_page for a list of pages.
> 
> Note: Some changes to the routine return_unused_surplus_pages are in
> need of explanation.  Commit e5bbc8a6c992 ("mm/hugetlb.c: fix reservation
> race when freeing surplus pages") modified this routine to address a
> race which could occur when dropping the hugetlb_lock in the loop that
> removes pool pages.  Accounting changes introduced in that commit were
> subtle and took some thought to understand.  This commit removes the
> cond_resched_lock() and the potential race.  Therefore, remove the
> subtle code and restore the more straight forward accounting effectively
> reverting the commit.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Please drop INIT_LIST_HEAD which seems to be a left over from rebasing
to use LIST_HEAD.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/hugetlb.c | 95 +++++++++++++++++++++++++++++-----------------------
>  1 file changed, 53 insertions(+), 42 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dec7bd0dc63d..d3f3cb8766b8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1209,7 +1209,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
>  }
>  
>  /*
> - * helper for free_pool_huge_page() - return the previously saved
> + * helper for remove_pool_huge_page() - return the previously saved
>   * node ["this node"] from which to free a huge page.  Advance the
>   * next node id whether or not we find a free huge page to free so
>   * that the next attempt to free addresses the next node.
> @@ -1391,6 +1391,16 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>  	}
>  }
>  
> +static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> +{
> +	struct page *page, *t_page;
> +
> +	list_for_each_entry_safe(page, t_page, list, lru) {
> +		update_and_free_page(h, page);
> +		cond_resched();
> +	}
> +}
> +
>  struct hstate *size_to_hstate(unsigned long size)
>  {
>  	struct hstate *h;
> @@ -1721,16 +1731,18 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  }
>  
>  /*
> - * Free huge page from pool from next node to free.
> - * Attempt to keep persistent huge pages more or less
> - * balanced over allowed nodes.
> + * Remove huge page from pool from next node to free.  Attempt to keep
> + * persistent huge pages more or less balanced over allowed nodes.
> + * This routine only 'removes' the hugetlb page.  The caller must make
> + * an additional call to free the page to low level allocators.
>   * Called with hugetlb_lock locked.
>   */
> -static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> -							 bool acct_surplus)
> +static struct page *remove_pool_huge_page(struct hstate *h,
> +						nodemask_t *nodes_allowed,
> +						 bool acct_surplus)
>  {
>  	int nr_nodes, node;
> -	int ret = 0;
> +	struct page *page = NULL;
>  
>  	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
>  		/*
> @@ -1739,23 +1751,14 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  		 */
>  		if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
>  		    !list_empty(&h->hugepage_freelists[node])) {
> -			struct page *page =
> -				list_entry(h->hugepage_freelists[node].next,
> +			page = list_entry(h->hugepage_freelists[node].next,
>  					  struct page, lru);
>  			remove_hugetlb_page(h, page, acct_surplus);
> -			/*
> -			 * unlock/lock around update_and_free_page is temporary
> -			 * and will be removed with subsequent patch.
> -			 */
> -			spin_unlock(&hugetlb_lock);
> -			update_and_free_page(h, page);
> -			spin_lock(&hugetlb_lock);
> -			ret = 1;
>  			break;
>  		}
>  	}
>  
> -	return ret;
> +	return page;
>  }
>  
>  /*
> @@ -2075,17 +2078,16 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>   *    to the associated reservation map.
>   * 2) Free any unused surplus pages that may have been allocated to satisfy
>   *    the reservation.  As many as unused_resv_pages may be freed.
> - *
> - * Called with hugetlb_lock held.  However, the lock could be dropped (and
> - * reacquired) during calls to cond_resched_lock.  Whenever dropping the lock,
> - * we must make sure nobody else can claim pages we are in the process of
> - * freeing.  Do this by ensuring resv_huge_page always is greater than the
> - * number of huge pages we plan to free when dropping the lock.
>   */
>  static void return_unused_surplus_pages(struct hstate *h,
>  					unsigned long unused_resv_pages)
>  {
>  	unsigned long nr_pages;
> +	struct page *page;
> +	LIST_HEAD(page_list);
> +
> +	/* Uncommit the reservation */
> +	h->resv_huge_pages -= unused_resv_pages;
>  
>  	/* Cannot return gigantic pages currently */
>  	if (hstate_is_gigantic(h))
> @@ -2102,24 +2104,22 @@ static void return_unused_surplus_pages(struct hstate *h,
>  	 * evenly across all nodes with memory. Iterate across these nodes
>  	 * until we can no longer free unreserved surplus pages. This occurs
>  	 * when the nodes with surplus pages have no free pages.
> -	 * free_pool_huge_page() will balance the freed pages across the
> +	 * remove_pool_huge_page() will balance the freed pages across the
>  	 * on-line nodes with memory and will handle the hstate accounting.
> -	 *
> -	 * Note that we decrement resv_huge_pages as we free the pages.  If
> -	 * we drop the lock, resv_huge_pages will still be sufficiently large
> -	 * to cover subsequent pages we may free.
>  	 */
> +	INIT_LIST_HEAD(&page_list);
>  	while (nr_pages--) {
> -		h->resv_huge_pages--;
> -		unused_resv_pages--;
> -		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
> +		page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
> +		if (!page)
>  			goto out;
> -		cond_resched_lock(&hugetlb_lock);
> +
> +		list_add(&page->lru, &page_list);
>  	}
>  
>  out:
> -	/* Fully uncommit the reservation */
> -	h->resv_huge_pages -= unused_resv_pages;
> +	spin_unlock(&hugetlb_lock);
> +	update_and_free_pages_bulk(h, &page_list);
> +	spin_lock(&hugetlb_lock);
>  }
>  
>  
> @@ -2683,7 +2683,6 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  						nodemask_t *nodes_allowed)
>  {
>  	int i;
> -	struct page *page, *next;
>  	LIST_HEAD(page_list);
>  
>  	if (hstate_is_gigantic(h))
> @@ -2694,6 +2693,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  	 */
>  	INIT_LIST_HEAD(&page_list);
>  	for_each_node_mask(i, *nodes_allowed) {
> +		struct page *page, *next;
>  		struct list_head *freel = &h->hugepage_freelists[i];
>  		list_for_each_entry_safe(page, next, freel, lru) {
>  			if (count >= h->nr_huge_pages)
> @@ -2707,10 +2707,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  
>  out:
>  	spin_unlock(&hugetlb_lock);
> -	list_for_each_entry_safe(page, next, &page_list, lru) {
> -		update_and_free_page(h, page);
> -		cond_resched();
> -	}
> +	update_and_free_pages_bulk(h, &page_list);
>  	spin_lock(&hugetlb_lock);
>  }
>  #else
> @@ -2757,6 +2754,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  			      nodemask_t *nodes_allowed)
>  {
>  	unsigned long min_count, ret;
> +	struct page *page;
> +	LIST_HEAD(page_list);
>  	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
>  
>  	/*
> @@ -2869,11 +2868,23 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
>  	min_count = max(count, min_count);
>  	try_to_free_low(h, min_count, nodes_allowed);
> +
> +	/*
> +	 * Collect pages to be removed on list without dropping lock
> +	 */
> +	INIT_LIST_HEAD(&page_list);
>  	while (min_count < persistent_huge_pages(h)) {
> -		if (!free_pool_huge_page(h, nodes_allowed, 0))
> +		page = remove_pool_huge_page(h, nodes_allowed, 0);
> +		if (!page)
>  			break;
> -		cond_resched_lock(&hugetlb_lock);
> +
> +		list_add(&page->lru, &page_list);
>  	}
> +	/* free the pages after dropping lock */
> +	spin_unlock(&hugetlb_lock);
> +	update_and_free_pages_bulk(h, &page_list);
> +	spin_lock(&hugetlb_lock);
> +
>  	while (count < persistent_huge_pages(h)) {
>  		if (!adjust_pool_surplus(h, nodes_allowed, 1))
>  			break;
> -- 
> 2.30.2
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-03-30  8:01   ` Michal Hocko
@ 2021-03-30  8:08       ` Muchun Song
  2021-03-31  2:37     ` Mike Kravetz
  1 sibling, 0 replies; 31+ messages in thread
From: Muchun Song @ 2021-03-30  8:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Linux Memory Management List, LKML, Roman Gushchin,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Tue, Mar 30, 2021 at 4:01 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 29-03-21 16:23:55, Mike Kravetz wrote:
> > Ideally, cma_release could be called from any context.  However, that is
> > not possible because a mutex is used to protect the per-area bitmap.
> > Change the bitmap to an irq safe spinlock.
>
> I would phrase the changelog slightly differerent
> "
> cma_release is currently a sleepable operatation because the bitmap
> manipulation is protected by cma->lock mutex. Hugetlb code which relies
> on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> irq safe.
>
> The lock doesn't protect any sleepable operation so it can be changed to
> a (irq aware) spin lock. The bitmap processing should be quite fast in
> typical case but if cma sizes grow to TB then we will likely need to
> replace the lock by a more optimized bitmap implementation.
> "
>
> it seems that you are overusing irqsave variants even from context which
> are never called from the IRQ context so they do not need storing flags.
>
> [...]
> > @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
> >       unsigned long start = 0;
> >       unsigned long nr_part, nr_total = 0;
> >       unsigned long nbits = cma_bitmap_maxno(cma);
> > +     unsigned long flags;
> >
> > -     mutex_lock(&cma->lock);
> > +     spin_lock_irqsave(&cma->lock, flags);
>
> spin_lock_irq should be sufficient. This is only called from the
> allocation context and that is never called from IRQ context.

This makes me think more. I think that spin_lock should be
sufficient. Right?


>
> >       pr_info("number of available pages: ");
> >       for (;;) {
> >               next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> > @@ -407,7 +408,7 @@ static void cma_debug_show_areas(struct cma *cma)
> >               start = next_zero_bit + nr_zero;
> >       }
> >       pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> > -     mutex_unlock(&cma->lock);
> > +     spin_unlock_irqrestore(&cma->lock, flags);
> >  }
> >  #else
> >  static inline void cma_debug_show_areas(struct cma *cma) { }
> > @@ -430,6 +431,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >       unsigned long pfn = -1;
> >       unsigned long start = 0;
> >       unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> > +     unsigned long flags;
> >       size_t i;
> >       struct page *page = NULL;
> >       int ret = -ENOMEM;
> > @@ -454,12 +456,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >               goto out;
> >
> >       for (;;) {
> > -             mutex_lock(&cma->lock);
> > +             spin_lock_irqsave(&cma->lock, flags);
> >               bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
> >                               bitmap_maxno, start, bitmap_count, mask,
> >                               offset);
> >               if (bitmap_no >= bitmap_maxno) {
> > -                     mutex_unlock(&cma->lock);
> > +                     spin_unlock_irqrestore(&cma->lock, flags);
> >                       break;
> >               }
> >               bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
>
> same here.
>
> > @@ -468,7 +470,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >                * our exclusive use. If the migration fails we will take the
> >                * lock again and unmark it.
> >                */
> > -             mutex_unlock(&cma->lock);
> > +             spin_unlock_irqrestore(&cma->lock, flags);
> >
> >               pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
> >               ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> > diff --git a/mm/cma.h b/mm/cma.h
> > index 68ffad4e430d..2c775877eae2 100644
> > --- a/mm/cma.h
> > +++ b/mm/cma.h
> > @@ -15,7 +15,7 @@ struct cma {
> >       unsigned long   count;
> >       unsigned long   *bitmap;
> >       unsigned int order_per_bit; /* Order of pages represented by one bit */
> > -     struct mutex    lock;
> > +     spinlock_t      lock;
> >  #ifdef CONFIG_CMA_DEBUGFS
> >       struct hlist_head mem_head;
> >       spinlock_t mem_head_lock;
> > diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> > index d5bf8aa34fdc..6379cfbfd568 100644
> > --- a/mm/cma_debug.c
> > +++ b/mm/cma_debug.c
> > @@ -35,11 +35,12 @@ static int cma_used_get(void *data, u64 *val)
> >  {
> >       struct cma *cma = data;
> >       unsigned long used;
> > +     unsigned long flags;
> >
> > -     mutex_lock(&cma->lock);
> > +     spin_lock_irqsave(&cma->lock, flags);
> >       /* pages counter is smaller than sizeof(int) */
> >       used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
> > -     mutex_unlock(&cma->lock);
> > +     spin_unlock_irqrestore(&cma->lock, flags);
> >       *val = (u64)used << cma->order_per_bit;
>
> same here
>
> >
> >       return 0;
> > @@ -52,8 +53,9 @@ static int cma_maxchunk_get(void *data, u64 *val)
> >       unsigned long maxchunk = 0;
> >       unsigned long start, end = 0;
> >       unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
> > +     unsigned long flags;
> >
> > -     mutex_lock(&cma->lock);
> > +     spin_lock_irqsave(&cma->lock, flags);
> >       for (;;) {
> >               start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
> >               if (start >= bitmap_maxno)
> > @@ -61,7 +63,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
> >               end = find_next_bit(cma->bitmap, bitmap_maxno, start);
> >               maxchunk = max(end - start, maxchunk);
> >       }
> > -     mutex_unlock(&cma->lock);
> > +     spin_unlock_irqrestore(&cma->lock, flags);
> >       *val = (u64)maxchunk << cma->order_per_bit;
> >
> >       return 0;
>
> and here.
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
@ 2021-03-30  8:08       ` Muchun Song
  0 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2021-03-30  8:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Linux Memory Management List, LKML, Roman Gushchin,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Tue, Mar 30, 2021 at 4:01 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 29-03-21 16:23:55, Mike Kravetz wrote:
> > Ideally, cma_release could be called from any context.  However, that is
> > not possible because a mutex is used to protect the per-area bitmap.
> > Change the bitmap to an irq safe spinlock.
>
> I would phrase the changelog slightly differerent
> "
> cma_release is currently a sleepable operatation because the bitmap
> manipulation is protected by cma->lock mutex. Hugetlb code which relies
> on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> irq safe.
>
> The lock doesn't protect any sleepable operation so it can be changed to
> a (irq aware) spin lock. The bitmap processing should be quite fast in
> typical case but if cma sizes grow to TB then we will likely need to
> replace the lock by a more optimized bitmap implementation.
> "
>
> it seems that you are overusing irqsave variants even from context which
> are never called from the IRQ context so they do not need storing flags.
>
> [...]
> > @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
> >       unsigned long start = 0;
> >       unsigned long nr_part, nr_total = 0;
> >       unsigned long nbits = cma_bitmap_maxno(cma);
> > +     unsigned long flags;
> >
> > -     mutex_lock(&cma->lock);
> > +     spin_lock_irqsave(&cma->lock, flags);
>
> spin_lock_irq should be sufficient. This is only called from the
> allocation context and that is never called from IRQ context.

This makes me think more. I think that spin_lock should be
sufficient. Right?


>
> >       pr_info("number of available pages: ");
> >       for (;;) {
> >               next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> > @@ -407,7 +408,7 @@ static void cma_debug_show_areas(struct cma *cma)
> >               start = next_zero_bit + nr_zero;
> >       }
> >       pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> > -     mutex_unlock(&cma->lock);
> > +     spin_unlock_irqrestore(&cma->lock, flags);
> >  }
> >  #else
> >  static inline void cma_debug_show_areas(struct cma *cma) { }
> > @@ -430,6 +431,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >       unsigned long pfn = -1;
> >       unsigned long start = 0;
> >       unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> > +     unsigned long flags;
> >       size_t i;
> >       struct page *page = NULL;
> >       int ret = -ENOMEM;
> > @@ -454,12 +456,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >               goto out;
> >
> >       for (;;) {
> > -             mutex_lock(&cma->lock);
> > +             spin_lock_irqsave(&cma->lock, flags);
> >               bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
> >                               bitmap_maxno, start, bitmap_count, mask,
> >                               offset);
> >               if (bitmap_no >= bitmap_maxno) {
> > -                     mutex_unlock(&cma->lock);
> > +                     spin_unlock_irqrestore(&cma->lock, flags);
> >                       break;
> >               }
> >               bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
>
> same here.
>
> > @@ -468,7 +470,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >                * our exclusive use. If the migration fails we will take the
> >                * lock again and unmark it.
> >                */
> > -             mutex_unlock(&cma->lock);
> > +             spin_unlock_irqrestore(&cma->lock, flags);
> >
> >               pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
> >               ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> > diff --git a/mm/cma.h b/mm/cma.h
> > index 68ffad4e430d..2c775877eae2 100644
> > --- a/mm/cma.h
> > +++ b/mm/cma.h
> > @@ -15,7 +15,7 @@ struct cma {
> >       unsigned long   count;
> >       unsigned long   *bitmap;
> >       unsigned int order_per_bit; /* Order of pages represented by one bit */
> > -     struct mutex    lock;
> > +     spinlock_t      lock;
> >  #ifdef CONFIG_CMA_DEBUGFS
> >       struct hlist_head mem_head;
> >       spinlock_t mem_head_lock;
> > diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> > index d5bf8aa34fdc..6379cfbfd568 100644
> > --- a/mm/cma_debug.c
> > +++ b/mm/cma_debug.c
> > @@ -35,11 +35,12 @@ static int cma_used_get(void *data, u64 *val)
> >  {
> >       struct cma *cma = data;
> >       unsigned long used;
> > +     unsigned long flags;
> >
> > -     mutex_lock(&cma->lock);
> > +     spin_lock_irqsave(&cma->lock, flags);
> >       /* pages counter is smaller than sizeof(int) */
> >       used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
> > -     mutex_unlock(&cma->lock);
> > +     spin_unlock_irqrestore(&cma->lock, flags);
> >       *val = (u64)used << cma->order_per_bit;
>
> same here
>
> >
> >       return 0;
> > @@ -52,8 +53,9 @@ static int cma_maxchunk_get(void *data, u64 *val)
> >       unsigned long maxchunk = 0;
> >       unsigned long start, end = 0;
> >       unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
> > +     unsigned long flags;
> >
> > -     mutex_lock(&cma->lock);
> > +     spin_lock_irqsave(&cma->lock, flags);
> >       for (;;) {
> >               start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
> >               if (start >= bitmap_maxno)
> > @@ -61,7 +63,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
> >               end = find_next_bit(cma->bitmap, bitmap_maxno, start);
> >               maxchunk = max(end - start, maxchunk);
> >       }
> > -     mutex_unlock(&cma->lock);
> > +     spin_unlock_irqrestore(&cma->lock, flags);
> >       *val = (u64)maxchunk << cma->order_per_bit;
> >
> >       return 0;
>
> and here.
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [External] Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-03-30  8:08       ` Muchun Song
  (?)
@ 2021-03-30  8:17       ` Song Bao Hua (Barry Song)
  -1 siblings, 0 replies; 31+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-03-30  8:17 UTC (permalink / raw)
  To: Muchun Song, Michal Hocko
  Cc: Mike Kravetz, Linux Memory Management List, LKML, Roman Gushchin,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	linmiaohe, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Will Deacon, Andrew Morton



> -----Original Message-----
> From: Muchun Song [mailto:songmuchun@bytedance.com]
> Sent: Tuesday, March 30, 2021 9:09 PM
> To: Michal Hocko <mhocko@suse.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>; Linux Memory Management List
> <linux-mm@kvack.org>; LKML <linux-kernel@vger.kernel.org>; Roman Gushchin
> <guro@fb.com>; Shakeel Butt <shakeelb@google.com>; Oscar Salvador
> <osalvador@suse.de>; David Hildenbrand <david@redhat.com>; David Rientjes
> <rientjes@google.com>; linmiaohe <linmiaohe@huawei.com>; Peter Zijlstra
> <peterz@infradead.org>; Matthew Wilcox <willy@infradead.org>; HORIGUCHI NAOYA
> <naoya.horiguchi@nec.com>; Aneesh Kumar K . V <aneesh.kumar@linux.ibm.com>;
> Waiman Long <longman@redhat.com>; Peter Xu <peterx@redhat.com>; Mina Almasry
> <almasrymina@google.com>; Hillf Danton <hdanton@sina.com>; Joonsoo Kim
> <iamjoonsoo.kim@lge.com>; Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com>; Will Deacon <will@kernel.org>; Andrew Morton
> <akpm@linux-foundation.org>
> Subject: Re: [External] Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe
> spinlock
> 
> On Tue, Mar 30, 2021 at 4:01 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 29-03-21 16:23:55, Mike Kravetz wrote:
> > > Ideally, cma_release could be called from any context.  However,
> > > that is not possible because a mutex is used to protect the per-area bitmap.
> > > Change the bitmap to an irq safe spinlock.
> >
> > I would phrase the changelog slightly differerent "
> > cma_release is currently a sleepable operatation because the bitmap
> > manipulation is protected by cma->lock mutex. Hugetlb code which
> > relies on cma_release for CMA backed (giga) hugetlb pages, however,
> > needs to be irq safe.
> >
> > The lock doesn't protect any sleepable operation so it can be changed
> > to a (irq aware) spin lock. The bitmap processing should be quite fast
> > in typical case but if cma sizes grow to TB then we will likely need
> > to replace the lock by a more optimized bitmap implementation.
> > "
> >
> > it seems that you are overusing irqsave variants even from context
> > which are never called from the IRQ context so they do not need storing flags.
> >
> > [...]
> > > @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
> > >       unsigned long start = 0;
> > >       unsigned long nr_part, nr_total = 0;
> > >       unsigned long nbits = cma_bitmap_maxno(cma);
> > > +     unsigned long flags;
> > >
> > > -     mutex_lock(&cma->lock);
> > > +     spin_lock_irqsave(&cma->lock, flags);
> >
> > spin_lock_irq should be sufficient. This is only called from the
> > allocation context and that is never called from IRQ context.
> 
> This makes me think more. I think that spin_lock should be sufficient. Right?
> 

It seems Mike's point is that cma_release might be called from both
irq context and process context.

If it is running in process context, we need the irq-disable to lock
the irq context which might jump to call cma_release at the same time.

We have never seen cma_release has been really called in irq context
by now, anyway.

> 
> >
> > >       pr_info("number of available pages: ");
> > >       for (;;) {
> > >               next_zero_bit = find_next_zero_bit(cma->bitmap, nbits,
> > > start); @@ -407,7 +408,7 @@ static void cma_debug_show_areas(struct cma
> *cma)
> > >               start = next_zero_bit + nr_zero;
> > >       }
> > >       pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> > > -     mutex_unlock(&cma->lock);
> > > +     spin_unlock_irqrestore(&cma->lock, flags);
> > >  }
> > >  #else
> > >  static inline void cma_debug_show_areas(struct cma *cma) { } @@
> > > -430,6 +431,7 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
> > >       unsigned long pfn = -1;
> > >       unsigned long start = 0;
> > >       unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> > > +     unsigned long flags;
> > >       size_t i;
> > >       struct page *page = NULL;
> > >       int ret = -ENOMEM;
> > > @@ -454,12 +456,12 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
> > >               goto out;
> > >
> > >       for (;;) {
> > > -             mutex_lock(&cma->lock);
> > > +             spin_lock_irqsave(&cma->lock, flags);
> > >               bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
> > >                               bitmap_maxno, start, bitmap_count, mask,
> > >                               offset);
> > >               if (bitmap_no >= bitmap_maxno) {
> > > -                     mutex_unlock(&cma->lock);
> > > +                     spin_unlock_irqrestore(&cma->lock, flags);
> > >                       break;
> > >               }
> > >               bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> >
> > same here.
> >
> > > @@ -468,7 +470,7 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
> > >                * our exclusive use. If the migration fails we will take the
> > >                * lock again and unmark it.
> > >                */
> > > -             mutex_unlock(&cma->lock);
> > > +             spin_unlock_irqrestore(&cma->lock, flags);
> > >
> > >               pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
> > >               ret = alloc_contig_range(pfn, pfn + count,
> > > MIGRATE_CMA, diff --git a/mm/cma.h b/mm/cma.h index
> > > 68ffad4e430d..2c775877eae2 100644
> > > --- a/mm/cma.h
> > > +++ b/mm/cma.h
> > > @@ -15,7 +15,7 @@ struct cma {
> > >       unsigned long   count;
> > >       unsigned long   *bitmap;
> > >       unsigned int order_per_bit; /* Order of pages represented by one bit
> */
> > > -     struct mutex    lock;
> > > +     spinlock_t      lock;
> > >  #ifdef CONFIG_CMA_DEBUGFS
> > >       struct hlist_head mem_head;
> > >       spinlock_t mem_head_lock;
> > > diff --git a/mm/cma_debug.c b/mm/cma_debug.c index
> > > d5bf8aa34fdc..6379cfbfd568 100644
> > > --- a/mm/cma_debug.c
> > > +++ b/mm/cma_debug.c
> > > @@ -35,11 +35,12 @@ static int cma_used_get(void *data, u64 *val)  {
> > >       struct cma *cma = data;
> > >       unsigned long used;
> > > +     unsigned long flags;
> > >
> > > -     mutex_lock(&cma->lock);
> > > +     spin_lock_irqsave(&cma->lock, flags);
> > >       /* pages counter is smaller than sizeof(int) */
> > >       used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
> > > -     mutex_unlock(&cma->lock);
> > > +     spin_unlock_irqrestore(&cma->lock, flags);
> > >       *val = (u64)used << cma->order_per_bit;
> >
> > same here
> >
> > >
> > >       return 0;
> > > @@ -52,8 +53,9 @@ static int cma_maxchunk_get(void *data, u64 *val)
> > >       unsigned long maxchunk = 0;
> > >       unsigned long start, end = 0;
> > >       unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
> > > +     unsigned long flags;
> > >
> > > -     mutex_lock(&cma->lock);
> > > +     spin_lock_irqsave(&cma->lock, flags);
> > >       for (;;) {
> > >               start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
> > >               if (start >= bitmap_maxno)
> > > @@ -61,7 +63,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
> > >               end = find_next_bit(cma->bitmap, bitmap_maxno, start);
> > >               maxchunk = max(end - start, maxchunk);
> > >       }
> > > -     mutex_unlock(&cma->lock);
> > > +     spin_unlock_irqrestore(&cma->lock, flags);
> > >       *val = (u64)maxchunk << cma->order_per_bit;
> > >
> > >       return 0;
> >
> > and here.
> > --
> > Michal Hocko
> > SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-03-30  8:08       ` Muchun Song
  (?)
  (?)
@ 2021-03-30  8:18       ` Michal Hocko
  2021-03-30  8:21           ` Muchun Song
  -1 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2021-03-30  8:18 UTC (permalink / raw)
  To: Muchun Song
  Cc: Mike Kravetz, Linux Memory Management List, LKML, Roman Gushchin,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Tue 30-03-21 16:08:36, Muchun Song wrote:
> On Tue, Mar 30, 2021 at 4:01 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 29-03-21 16:23:55, Mike Kravetz wrote:
> > > Ideally, cma_release could be called from any context.  However, that is
> > > not possible because a mutex is used to protect the per-area bitmap.
> > > Change the bitmap to an irq safe spinlock.
> >
> > I would phrase the changelog slightly differerent
> > "
> > cma_release is currently a sleepable operatation because the bitmap
> > manipulation is protected by cma->lock mutex. Hugetlb code which relies
> > on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> > irq safe.
> >
> > The lock doesn't protect any sleepable operation so it can be changed to
> > a (irq aware) spin lock. The bitmap processing should be quite fast in
> > typical case but if cma sizes grow to TB then we will likely need to
> > replace the lock by a more optimized bitmap implementation.
> > "
> >
> > it seems that you are overusing irqsave variants even from context which
> > are never called from the IRQ context so they do not need storing flags.
> >
> > [...]
> > > @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
> > >       unsigned long start = 0;
> > >       unsigned long nr_part, nr_total = 0;
> > >       unsigned long nbits = cma_bitmap_maxno(cma);
> > > +     unsigned long flags;
> > >
> > > -     mutex_lock(&cma->lock);
> > > +     spin_lock_irqsave(&cma->lock, flags);
> >
> > spin_lock_irq should be sufficient. This is only called from the
> > allocation context and that is never called from IRQ context.
> 
> This makes me think more. I think that spin_lock should be
> sufficient. Right?

Nope. Think of the following scenario
	spin_lock(cma->lock);
	<IRQ>
	put_page
	  __free_huge_page
	    cma_release
	      spin_lock_irqsave() DEADLOCK
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-03-30  8:18       ` Michal Hocko
@ 2021-03-30  8:21           ` Muchun Song
  0 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2021-03-30  8:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Linux Memory Management List, LKML, Roman Gushchin,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Tue, Mar 30, 2021 at 4:18 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 30-03-21 16:08:36, Muchun Song wrote:
> > On Tue, Mar 30, 2021 at 4:01 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 29-03-21 16:23:55, Mike Kravetz wrote:
> > > > Ideally, cma_release could be called from any context.  However, that is
> > > > not possible because a mutex is used to protect the per-area bitmap.
> > > > Change the bitmap to an irq safe spinlock.
> > >
> > > I would phrase the changelog slightly differerent
> > > "
> > > cma_release is currently a sleepable operatation because the bitmap
> > > manipulation is protected by cma->lock mutex. Hugetlb code which relies
> > > on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> > > irq safe.
> > >
> > > The lock doesn't protect any sleepable operation so it can be changed to
> > > a (irq aware) spin lock. The bitmap processing should be quite fast in
> > > typical case but if cma sizes grow to TB then we will likely need to
> > > replace the lock by a more optimized bitmap implementation.
> > > "
> > >
> > > it seems that you are overusing irqsave variants even from context which
> > > are never called from the IRQ context so they do not need storing flags.
> > >
> > > [...]
> > > > @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
> > > >       unsigned long start = 0;
> > > >       unsigned long nr_part, nr_total = 0;
> > > >       unsigned long nbits = cma_bitmap_maxno(cma);
> > > > +     unsigned long flags;
> > > >
> > > > -     mutex_lock(&cma->lock);
> > > > +     spin_lock_irqsave(&cma->lock, flags);
> > >
> > > spin_lock_irq should be sufficient. This is only called from the
> > > allocation context and that is never called from IRQ context.
> >
> > This makes me think more. I think that spin_lock should be
> > sufficient. Right?
>
> Nope. Think of the following scenario
>         spin_lock(cma->lock);
>         <IRQ>
>         put_page
>           __free_huge_page
>             cma_release
>               spin_lock_irqsave() DEADLOCK

Got it. Thanks.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
@ 2021-03-30  8:21           ` Muchun Song
  0 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2021-03-30  8:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Linux Memory Management List, LKML, Roman Gushchin,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Tue, Mar 30, 2021 at 4:18 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 30-03-21 16:08:36, Muchun Song wrote:
> > On Tue, Mar 30, 2021 at 4:01 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 29-03-21 16:23:55, Mike Kravetz wrote:
> > > > Ideally, cma_release could be called from any context.  However, that is
> > > > not possible because a mutex is used to protect the per-area bitmap.
> > > > Change the bitmap to an irq safe spinlock.
> > >
> > > I would phrase the changelog slightly differerent
> > > "
> > > cma_release is currently a sleepable operatation because the bitmap
> > > manipulation is protected by cma->lock mutex. Hugetlb code which relies
> > > on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> > > irq safe.
> > >
> > > The lock doesn't protect any sleepable operation so it can be changed to
> > > a (irq aware) spin lock. The bitmap processing should be quite fast in
> > > typical case but if cma sizes grow to TB then we will likely need to
> > > replace the lock by a more optimized bitmap implementation.
> > > "
> > >
> > > it seems that you are overusing irqsave variants even from context which
> > > are never called from the IRQ context so they do not need storing flags.
> > >
> > > [...]
> > > > @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
> > > >       unsigned long start = 0;
> > > >       unsigned long nr_part, nr_total = 0;
> > > >       unsigned long nbits = cma_bitmap_maxno(cma);
> > > > +     unsigned long flags;
> > > >
> > > > -     mutex_lock(&cma->lock);
> > > > +     spin_lock_irqsave(&cma->lock, flags);
> > >
> > > spin_lock_irq should be sufficient. This is only called from the
> > > allocation context and that is never called from IRQ context.
> >
> > This makes me think more. I think that spin_lock should be
> > sufficient. Right?
>
> Nope. Think of the following scenario
>         spin_lock(cma->lock);
>         <IRQ>
>         put_page
>           __free_huge_page
>             cma_release
>               spin_lock_irqsave() DEADLOCK

Got it. Thanks.

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-03-30  8:01   ` Michal Hocko
  2021-03-30  8:08       ` Muchun Song
@ 2021-03-31  2:37     ` Mike Kravetz
  1 sibling, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2021-03-31  2:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On 3/30/21 1:01 AM, Michal Hocko wrote:
> On Mon 29-03-21 16:23:55, Mike Kravetz wrote:
>> Ideally, cma_release could be called from any context.  However, that is
>> not possible because a mutex is used to protect the per-area bitmap.
>> Change the bitmap to an irq safe spinlock.
> 
> I would phrase the changelog slightly differerent
> "
> cma_release is currently a sleepable operatation because the bitmap
> manipulation is protected by cma->lock mutex. Hugetlb code which relies
> on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> irq safe.
> 
> The lock doesn't protect any sleepable operation so it can be changed to
> a (irq aware) spin lock. The bitmap processing should be quite fast in
> typical case but if cma sizes grow to TB then we will likely need to
> replace the lock by a more optimized bitmap implementation.
> "

That is better.  Thank you.

> 
> it seems that you are overusing irqsave variants even from context which
> are never called from the IRQ context so they do not need storing flags.
> 
> [...]

Yes.

>> @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
>>  	unsigned long start = 0;
>>  	unsigned long nr_part, nr_total = 0;
>>  	unsigned long nbits = cma_bitmap_maxno(cma);
>> +	unsigned long flags;
>>  
>> -	mutex_lock(&cma->lock);
>> +	spin_lock_irqsave(&cma->lock, flags);
> 
> spin_lock_irq should be sufficient. This is only called from the
> allocation context and that is never called from IRQ context.
> 

I will change this and those below.

Thanks for your continued reviews and patience.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] [PATCH v2 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-03-30  2:21     ` Muchun Song
  (?)
@ 2021-03-31  2:39     ` Mike Kravetz
  -1 siblings, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2021-03-31  2:39 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On 3/29/21 7:21 PM, Muchun Song wrote:
> On Tue, Mar 30, 2021 at 7:24 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> With the introduction of remove_hugetlb_page(), there is no need for
>> update_and_free_page to hold the hugetlb lock.  Change all callers to
>> drop the lock before calling.
>>
>> With additional code modifications, this will allow loops which decrease
>> the huge page pool to drop the hugetlb_lock with each page to reduce
>> long hold times.
>>
>> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
>> a subsequent patch which restructures free_pool_huge_page.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> Acked-by: Michal Hocko <mhocko@suse.com>
>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
>> ---
>>  mm/hugetlb.c | 32 +++++++++++++++++++++++++++-----
>>  1 file changed, 27 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 16beabbbbe49..dec7bd0dc63d 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1451,16 +1451,18 @@ static void __free_huge_page(struct page *page)
>>
>>         if (HPageTemporary(page)) {
>>                 remove_hugetlb_page(h, page, false);
>> +               spin_unlock(&hugetlb_lock);
>>                 update_and_free_page(h, page);
>>         } else if (h->surplus_huge_pages_node[nid]) {
>>                 /* remove the page from active list */
>>                 remove_hugetlb_page(h, page, true);
>> +               spin_unlock(&hugetlb_lock);
>>                 update_and_free_page(h, page);
>>         } else {
>>                 arch_clear_hugepage_flags(page);
>>                 enqueue_huge_page(h, page);
>> +               spin_unlock(&hugetlb_lock);
>>         }
>> -       spin_unlock(&hugetlb_lock);
>>  }
>>
>>  /*
>> @@ -1741,7 +1743,13 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>>                                 list_entry(h->hugepage_freelists[node].next,
>>                                           struct page, lru);
>>                         remove_hugetlb_page(h, page, acct_surplus);
>> +                       /*
>> +                        * unlock/lock around update_and_free_page is temporary
>> +                        * and will be removed with subsequent patch.
>> +                        */
>> +                       spin_unlock(&hugetlb_lock);
>>                         update_and_free_page(h, page);
>> +                       spin_lock(&hugetlb_lock);
>>                         ret = 1;
>>                         break;
>>                 }
>> @@ -1810,8 +1818,9 @@ int dissolve_free_huge_page(struct page *page)
>>                 }
>>                 remove_hugetlb_page(h, page, false);
>>                 h->max_huge_pages--;
>> +               spin_unlock(&hugetlb_lock);
>>                 update_and_free_page(h, head);
>> -               rc = 0;
>> +               return 0;
>>         }
>>  out:
>>         spin_unlock(&hugetlb_lock);
>> @@ -2674,22 +2683,35 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>>                                                 nodemask_t *nodes_allowed)
>>  {
>>         int i;
>> +       struct page *page, *next;
>> +       LIST_HEAD(page_list);
>>
>>         if (hstate_is_gigantic(h))
>>                 return;
>>
>> +       /*
>> +        * Collect pages to be freed on a list, and free after dropping lock
>> +        */
>> +       INIT_LIST_HEAD(&page_list);
> 
> INIT_LIST_HEAD is unnecessary. Because the macro of
> LIST_HEAD already initializes the list_head structure.
> 

Thanks.
I will fix here and the same issue in patch 6.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-03-31  2:41 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-29 23:23 [PATCH v2 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
2021-03-29 23:23 ` [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
2021-03-30  1:13   ` Roman Gushchin
2021-03-30  1:20   ` Song Bao Hua (Barry Song)
2021-03-30  2:18     ` Mike Kravetz
2021-03-30  8:01   ` Michal Hocko
2021-03-30  8:08     ` [External] " Muchun Song
2021-03-30  8:08       ` Muchun Song
2021-03-30  8:17       ` Song Bao Hua (Barry Song)
2021-03-30  8:18       ` Michal Hocko
2021-03-30  8:21         ` Muchun Song
2021-03-30  8:21           ` Muchun Song
2021-03-31  2:37     ` Mike Kravetz
2021-03-29 23:23 ` [PATCH v2 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release Mike Kravetz
2021-03-30  1:13   ` Roman Gushchin
2021-03-30  8:01   ` Michal Hocko
2021-03-29 23:23 ` [PATCH v2 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
2021-03-30  2:23   ` [External] " Muchun Song
2021-03-30  2:23     ` Muchun Song
2021-03-29 23:23 ` [PATCH v2 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
2021-03-29 23:23 ` [PATCH v2 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
2021-03-30  2:10   ` Miaohe Lin
2021-03-30  2:21   ` [External] " Muchun Song
2021-03-30  2:21     ` Muchun Song
2021-03-31  2:39     ` Mike Kravetz
2021-03-29 23:24 ` [PATCH v2 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
2021-03-30  2:30   ` [External] " Muchun Song
2021-03-30  2:30     ` Muchun Song
2021-03-30  8:06   ` Michal Hocko
2021-03-29 23:24 ` [PATCH v2 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
2021-03-29 23:24 ` [PATCH v2 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.