[PATCH v3 0/8] make hugetlb put_page safe for all calling contexts

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/8] make hugetlb put_page safe for all calling contexts
@ 2021-03-31  3:41 Mike Kravetz
  2021-03-31  3:41 ` [PATCH v3 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Mike Kravetz @ 2021-03-31  3:41 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

This effort is the result a recent bug report [1].  Syzbot found a
potential deadlock in the hugetlb put_page/free_huge_page_path.
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
Since the free_huge_page_path already has code to 'hand off' page
free requests to a workqueue, a suggestion was proposed to make
the in_irq() detection accurate by always enabling PREEMPT_COUNT [2].
The outcome of that discussion was that the hugetlb put_page path
(free_huge_page) path should be properly fixed and safe for all calling
contexts.

This series is based on v5.12-rc3-mmotm-2021-03-17-22-24.  At a high
level, the series provides:
- Patches 1 & 2 change CMA bitmap mutex to an irq safe spinlock
- Patch 3 adds a mutex for proc/sysfs interfaces changing hugetlb counts
- Patches 4, 5 & 6 are aimed at reducing lock hold times.  To be clear
  the goal is to eliminate single lock hold times of a long duration.
  Overall lock hold time is not addressed.
- Patch 7 makes hugetlb_lock and subpool lock IRQ safe.  It also reverts
  the code which defers calls to a workqueue if !in_task.
- Patch 8 adds some lockdep_assert_held() calls

[1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
[2] http://lkml.kernel.org/r/20210311021321.127500-1-mike.kravetz@oracle.com

v2 -> v3
- Update commit message in patch 1 as suggested by Michal
- Do not use spin_lock_irqsave/spin_unlock_irqrestore when we know we
  are in task context as suggested by Michal
- Remove unnecessary INIT_LIST_HEAD() as suggested by Muchun
  
v1 -> v2
- Drop Roman's cma_release_nowait() patches and just change CMA mutex
  to an IRQ safe spinlock.
- Cleanups to variable names, commets and commit messages as suggested
  by Michal, Oscar, Miaohe and Muchun.
- Dropped unnecessary INIT_LIST_HEAD as suggested by Michal and list_del
  as suggested by Muchun.
- Created update_and_free_pages_bulk helper as suggested by Michal.
- Rebased on v5.12-rc4-mmotm-2021-03-28-16-37
- Added Acked-by: and Reviewed-by: from v1

RFC -> v1
- Add Roman's cma_release_nowait() patches.  This eliminated the need
  to do a workqueue handoff in hugetlb code.
- Use Michal's suggestion to batch pages for freeing.  This eliminated
  the need to recalculate loop control variables when dropping the lock.
- Added lockdep_assert_held() calls
- Rebased to v5.12-rc3-mmotm-2021-03-17-22-24

Mike Kravetz (8):
  mm/cma: change cma mutex to irq safe spinlock
  hugetlb: no need to drop hugetlb_lock to call cma_release
  hugetlb: add per-hstate mutex to synchronize user adjustments
  hugetlb: create remove_hugetlb_page() to separate functionality
  hugetlb: call update_and_free_page without hugetlb_lock
  hugetlb: change free_pool_huge_page to remove_pool_huge_page
  hugetlb: make free_huge_page irq safe
  hugetlb: add lockdep_assert_held() calls for hugetlb_lock

 include/linux/hugetlb.h |   1 +
 mm/cma.c                |  18 +--
 mm/cma.h                |   2 +-
 mm/cma_debug.c          |   8 +-
 mm/hugetlb.c            | 337 +++++++++++++++++++++-------------------
 mm/hugetlb_cgroup.c     |   8 +-
 6 files changed, 195 insertions(+), 179 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-03-31  3:41 [PATCH v3 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
@ 2021-03-31  3:41 ` Mike Kravetz
  2021-03-31  7:25   ` Michal Hocko
  2021-03-31  7:52   ` David Hildenbrand
  2021-03-31  3:41 ` [PATCH v3 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release Mike Kravetz
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 15+ messages in thread
From: Mike Kravetz @ 2021-03-31  3:41 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

cma_release is currently a sleepable operatation because the bitmap
manipulation is protected by cma->lock mutex. Hugetlb code which relies
on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
irq safe.

The lock doesn't protect any sleepable operation so it can be changed to
a (irq aware) spin lock. The bitmap processing should be quite fast in
typical case but if cma sizes grow to TB then we will likely need to
replace the lock by a more optimized bitmap implementation.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/cma.c       | 18 +++++++++---------
 mm/cma.h       |  2 +-
 mm/cma_debug.c |  8 ++++----
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index b2393b892d3b..2380f2571eb5 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -24,7 +24,6 @@
 #include <linux/memblock.h>
 #include <linux/err.h>
 #include <linux/mm.h>
-#include <linux/mutex.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
 #include <linux/log2.h>
@@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn,
 			     unsigned int count)
 {
 	unsigned long bitmap_no, bitmap_count;
+	unsigned long flags;
 
 	bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
 	bitmap_count = cma_bitmap_pages_to_bits(cma, count);
 
-	mutex_lock(&cma->lock);
+	spin_lock_irqsave(&cma->lock, flags);
 	bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
-	mutex_unlock(&cma->lock);
+	spin_unlock_irqrestore(&cma->lock, flags);
 }
 
 static void __init cma_activate_area(struct cma *cma)
@@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
 	     pfn += pageblock_nr_pages)
 		init_cma_reserved_pageblock(pfn_to_page(pfn));
 
-	mutex_init(&cma->lock);
+	spin_lock_init(&cma->lock);
 
 #ifdef CONFIG_CMA_DEBUGFS
 	INIT_HLIST_HEAD(&cma->mem_head);
@@ -392,7 +392,7 @@ static void cma_debug_show_areas(struct cma *cma)
 	unsigned long nr_part, nr_total = 0;
 	unsigned long nbits = cma_bitmap_maxno(cma);
 
-	mutex_lock(&cma->lock);
+	spin_lock_irq(&cma->lock);
 	pr_info("number of available pages: ");
 	for (;;) {
 		next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
@@ -407,7 +407,7 @@ static void cma_debug_show_areas(struct cma *cma)
 		start = next_zero_bit + nr_zero;
 	}
 	pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
-	mutex_unlock(&cma->lock);
+	spin_unlock_irq(&cma->lock);
 }
 #else
 static inline void cma_debug_show_areas(struct cma *cma) { }
@@ -454,12 +454,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 		goto out;
 
 	for (;;) {
-		mutex_lock(&cma->lock);
+		spin_lock_irq(&cma->lock);
 		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
 				bitmap_maxno, start, bitmap_count, mask,
 				offset);
 		if (bitmap_no >= bitmap_maxno) {
-			mutex_unlock(&cma->lock);
+			spin_unlock_irq(&cma->lock);
 			break;
 		}
 		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
@@ -468,7 +468,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 		 * our exclusive use. If the migration fails we will take the
 		 * lock again and unmark it.
 		 */
-		mutex_unlock(&cma->lock);
+		spin_unlock_irq(&cma->lock);
 
 		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
 		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
diff --git a/mm/cma.h b/mm/cma.h
index 68ffad4e430d..2c775877eae2 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -15,7 +15,7 @@ struct cma {
 	unsigned long   count;
 	unsigned long   *bitmap;
 	unsigned int order_per_bit; /* Order of pages represented by one bit */
-	struct mutex    lock;
+	spinlock_t	lock;
 #ifdef CONFIG_CMA_DEBUGFS
 	struct hlist_head mem_head;
 	spinlock_t mem_head_lock;
diff --git a/mm/cma_debug.c b/mm/cma_debug.c
index d5bf8aa34fdc..2e7704955f4f 100644
--- a/mm/cma_debug.c
+++ b/mm/cma_debug.c
@@ -36,10 +36,10 @@ static int cma_used_get(void *data, u64 *val)
 	struct cma *cma = data;
 	unsigned long used;
 
-	mutex_lock(&cma->lock);
+	spin_lock_irq(&cma->lock);
 	/* pages counter is smaller than sizeof(int) */
 	used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
-	mutex_unlock(&cma->lock);
+	spin_unlock_irq(&cma->lock);
 	*val = (u64)used << cma->order_per_bit;
 
 	return 0;
@@ -53,7 +53,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
 	unsigned long start, end = 0;
 	unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
 
-	mutex_lock(&cma->lock);
+	spin_lock_irq(&cma->lock);
 	for (;;) {
 		start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
 		if (start >= bitmap_maxno)
@@ -61,7 +61,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
 		end = find_next_bit(cma->bitmap, bitmap_maxno, start);
 		maxchunk = max(end - start, maxchunk);
 	}
-	mutex_unlock(&cma->lock);
+	spin_unlock_irq(&cma->lock);
 	*val = (u64)maxchunk << cma->order_per_bit;
 
 	return 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release
  2021-03-31  3:41 [PATCH v3 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
  2021-03-31  3:41 ` [PATCH v3 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
@ 2021-03-31  3:41 ` Mike Kravetz
  2021-03-31  3:41 ` [PATCH v3 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2021-03-31  3:41 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

Now that cma_release is non-blocking and irq safe, there is no need to
drop hugetlb_lock before calling.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/hugetlb.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c3e4baa4156..1d62f0492e7b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1353,14 +1353,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
-		/*
-		 * Temporarily drop the hugetlb_lock, because
-		 * we might block in free_gigantic_page().
-		 */
-		spin_unlock(&hugetlb_lock);
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
-		spin_lock(&hugetlb_lock);
 	} else {
 		__free_pages(page, huge_page_order(h));
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments
  2021-03-31  3:41 [PATCH v3 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
  2021-03-31  3:41 ` [PATCH v3 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
  2021-03-31  3:41 ` [PATCH v3 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release Mike Kravetz
@ 2021-03-31  3:41 ` Mike Kravetz
  2021-03-31  3:41 ` [PATCH v3 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2021-03-31  3:41 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

The helper routine hstate_next_node_to_alloc accesses and modifies the
hstate variable next_nid_to_alloc.  The helper is used by the routines
alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
called with hugetlb_lock held.  However, alloc_pool_huge_page can not
be called with the hugetlb lock held as it will call the page allocator.
Two instances of alloc_pool_huge_page could be run in parallel or
alloc_pool_huge_page could run in parallel with adjust_pool_surplus
which may result in the variable next_nid_to_alloc becoming invalid
for the caller and pages being allocated on the wrong node.

Both alloc_pool_huge_page and adjust_pool_surplus are only called from
the routine set_max_huge_pages after boot.  set_max_huge_pages is only
called as the reusult of a user writing to the proc/sysfs nr_hugepages,
or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.

It makes little sense to allow multiple adjustment to the number of
hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
allow one hugetlb page adjustment at a time.  This will synchronize
modifications to the next_nid_to_alloc variable.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c            | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d9b78e82652f..b92f25ccef58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
+	struct mutex resize_lock;
 	int next_nid_to_alloc;
 	int next_nid_to_free;
 	unsigned int order;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1d62f0492e7b..8497a3598c86 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2730,6 +2730,11 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	else
 		return -ENOMEM;
 
+	/*
+	 * resize_lock mutex prevents concurrent adjustments to number of
+	 * pages in hstate via the proc/sysfs interfaces.
+	 */
+	mutex_lock(&h->resize_lock);
 	spin_lock(&hugetlb_lock);
 
 	/*
@@ -2762,6 +2767,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
 		if (count > persistent_huge_pages(h)) {
 			spin_unlock(&hugetlb_lock);
+			mutex_unlock(&h->resize_lock);
 			NODEMASK_FREE(node_alloc_noretry);
 			return -EINVAL;
 		}
@@ -2836,6 +2842,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 out:
 	h->max_huge_pages = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
+	mutex_unlock(&h->resize_lock);
 
 	NODEMASK_FREE(node_alloc_noretry);
 
@@ -3323,6 +3330,7 @@ void __init hugetlb_add_hstate(unsigned int order)
 	BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
 	BUG_ON(order == 0);
 	h = &hstates[hugetlb_max_hstate++];
+	mutex_init(&h->resize_lock);
 	h->order = order;
 	h->mask = ~(huge_page_size(h) - 1);
 	for (i = 0; i < MAX_NUMNODES; ++i)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-03-31  3:41 [PATCH v3 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (2 preceding siblings ...)
  2021-03-31  3:41 ` [PATCH v3 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
@ 2021-03-31  3:41 ` Mike Kravetz
  2021-03-31  3:41 ` [PATCH v3 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2021-03-31  3:41 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

The new remove_hugetlb_page() routine is designed to remove a hugetlb
page from hugetlbfs processing.  It will remove the page from the active
or free list, update global counters and set the compound page
destructor to NULL so that PageHuge() will return false for the 'page'.
After this call, the 'page' can be treated as a normal compound page or
a collection of base size pages.

remove_hugetlb_page is to be called with the hugetlb_lock held.

Creating this routine and separating functionality is in preparation for
restructuring code to reduce lock hold times.  This commit should not
introduce any changes to functionality.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 67 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 42 insertions(+), 25 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8497a3598c86..16beabbbbe49 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1331,6 +1331,43 @@ static inline void destroy_compound_gigantic_page(struct page *page,
 						unsigned int order) { }
 #endif
 
+/*
+ * Remove hugetlb page from lists, and update dtor so that page appears
+ * as just a compound page.  A reference is held on the page.
+ *
+ * Must be called with hugetlb lock held.
+ */
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+							bool adjust_surplus)
+{
+	int nid = page_to_nid(page);
+
+	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+		return;
+
+	list_del(&page->lru);
+
+	if (HPageFreed(page)) {
+		h->free_huge_pages--;
+		h->free_huge_pages_node[nid]--;
+		ClearHPageFreed(page);
+	}
+	if (adjust_surplus) {
+		h->surplus_huge_pages--;
+		h->surplus_huge_pages_node[nid]--;
+	}
+
+	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
+	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
+
+	ClearHPageTemporary(page);
+	set_page_refcounted(page);
+	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
+
+	h->nr_huge_pages--;
+	h->nr_huge_pages_node[nid]--;
+}
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
@@ -1339,8 +1376,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
-	h->nr_huge_pages--;
-	h->nr_huge_pages_node[page_to_nid(page)]--;
 	for (i = 0; i < pages_per_huge_page(h);
 	     i++, subpage = mem_map_next(subpage, page, i)) {
 		subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
@@ -1348,10 +1383,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 				1 << PG_active | 1 << PG_private |
 				1 << PG_writeback);
 	}
-	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
-	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
-	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
-	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
@@ -1419,15 +1450,12 @@ static void __free_huge_page(struct page *page)
 		h->resv_huge_pages++;
 
 	if (HPageTemporary(page)) {
-		list_del(&page->lru);
-		ClearHPageTemporary(page);
+		remove_hugetlb_page(h, page, false);
 		update_and_free_page(h, page);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
-		list_del(&page->lru);
+		remove_hugetlb_page(h, page, true);
 		update_and_free_page(h, page);
-		h->surplus_huge_pages--;
-		h->surplus_huge_pages_node[nid]--;
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);
@@ -1712,13 +1740,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 			struct page *page =
 				list_entry(h->hugepage_freelists[node].next,
 					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[node]--;
-			if (acct_surplus) {
-				h->surplus_huge_pages--;
-				h->surplus_huge_pages_node[node]--;
-			}
+			remove_hugetlb_page(h, page, acct_surplus);
 			update_and_free_page(h, page);
 			ret = 1;
 			break;
@@ -1756,7 +1778,6 @@ int dissolve_free_huge_page(struct page *page)
 	if (!page_count(page)) {
 		struct page *head = compound_head(page);
 		struct hstate *h = page_hstate(head);
-		int nid = page_to_nid(head);
 		if (h->free_huge_pages - h->resv_huge_pages == 0)
 			goto out;
 
@@ -1787,9 +1808,7 @@ int dissolve_free_huge_page(struct page *page)
 			SetPageHWPoison(page);
 			ClearPageHWPoison(head);
 		}
-		list_del(&head->lru);
-		h->free_huge_pages--;
-		h->free_huge_pages_node[nid]--;
+		remove_hugetlb_page(h, page, false);
 		h->max_huge_pages--;
 		update_and_free_page(h, head);
 		rc = 0;
@@ -2667,10 +2686,8 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 				return;
 			if (PageHighMem(page))
 				continue;
-			list_del(&page->lru);
+			remove_hugetlb_page(h, page, false);
 			update_and_free_page(h, page);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[page_to_nid(page)]--;
 		}
 	}
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-03-31  3:41 [PATCH v3 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (3 preceding siblings ...)
  2021-03-31  3:41 ` [PATCH v3 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
@ 2021-03-31  3:41 ` Mike Kravetz
  2021-03-31  3:41 ` [PATCH v3 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2021-03-31  3:41 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

With the introduction of remove_hugetlb_page(), there is no need for
update_and_free_page to hold the hugetlb lock.  Change all callers to
drop the lock before calling.

With additional code modifications, this will allow loops which decrease
the huge page pool to drop the hugetlb_lock with each page to reduce
long hold times.

The ugly unlock/lock cycle in free_pool_huge_page will be removed in
a subsequent patch which restructures free_pool_huge_page.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb.c | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 16beabbbbe49..ac4be941a3e5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1451,16 +1451,18 @@ static void __free_huge_page(struct page *page)
 
 	if (HPageTemporary(page)) {
 		remove_hugetlb_page(h, page, false);
+		spin_unlock(&hugetlb_lock);
 		update_and_free_page(h, page);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
 		remove_hugetlb_page(h, page, true);
+		spin_unlock(&hugetlb_lock);
 		update_and_free_page(h, page);
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);
+		spin_unlock(&hugetlb_lock);
 	}
-	spin_unlock(&hugetlb_lock);
 }
 
 /*
@@ -1741,7 +1743,13 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 				list_entry(h->hugepage_freelists[node].next,
 					  struct page, lru);
 			remove_hugetlb_page(h, page, acct_surplus);
+			/*
+			 * unlock/lock around update_and_free_page is temporary
+			 * and will be removed with subsequent patch.
+			 */
+			spin_unlock(&hugetlb_lock);
 			update_and_free_page(h, page);
+			spin_lock(&hugetlb_lock);
 			ret = 1;
 			break;
 		}
@@ -1810,8 +1818,9 @@ int dissolve_free_huge_page(struct page *page)
 		}
 		remove_hugetlb_page(h, page, false);
 		h->max_huge_pages--;
+		spin_unlock(&hugetlb_lock);
 		update_and_free_page(h, head);
-		rc = 0;
+		return 0;
 	}
 out:
 	spin_unlock(&hugetlb_lock);
@@ -2674,22 +2683,34 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 						nodemask_t *nodes_allowed)
 {
 	int i;
+	struct page *page, *next;
+	LIST_HEAD(page_list);
 
 	if (hstate_is_gigantic(h))
 		return;
 
+	/*
+	 * Collect pages to be freed on a list, and free after dropping lock
+	 */
 	for_each_node_mask(i, *nodes_allowed) {
-		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
 		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
-				return;
+				goto out;
 			if (PageHighMem(page))
 				continue;
 			remove_hugetlb_page(h, page, false);
-			update_and_free_page(h, page);
+			list_add(&page->lru, &page_list);
 		}
 	}
+
+out:
+	spin_unlock(&hugetlb_lock);
+	list_for_each_entry_safe(page, next, &page_list, lru) {
+		update_and_free_page(h, page);
+		cond_resched();
+	}
+	spin_lock(&hugetlb_lock);
 }
 #else
 static inline void try_to_free_low(struct hstate *h, unsigned long count,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page
  2021-03-31  3:41 [PATCH v3 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (4 preceding siblings ...)
  2021-03-31  3:41 ` [PATCH v3 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
@ 2021-03-31  3:41 ` Mike Kravetz
  2021-03-31  3:41 ` [PATCH v3 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
  2021-03-31  3:41 ` [PATCH v3 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
  7 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2021-03-31  3:41 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

free_pool_huge_page was called with hugetlb_lock held.  It would remove
a hugetlb page, and then free the corresponding pages to the lower level
allocators such as buddy.  free_pool_huge_page was called in a loop to
remove hugetlb pages and these loops could hold the hugetlb_lock for a
considerable time.

Create new routine remove_pool_huge_page to replace free_pool_huge_page.
remove_pool_huge_page will remove the hugetlb page, and it must be
called with the hugetlb_lock held.  It will return the removed page and
it is the responsibility of the caller to free the page to the lower
level allocators.  The hugetlb_lock is dropped before freeing to these
allocators which results in shorter lock hold times.

Add new helper routine to call update_and_free_page for a list of pages.

Note: Some changes to the routine return_unused_surplus_pages are in
need of explanation.  Commit e5bbc8a6c992 ("mm/hugetlb.c: fix reservation
race when freeing surplus pages") modified this routine to address a
race which could occur when dropping the hugetlb_lock in the loop that
removes pool pages.  Accounting changes introduced in that commit were
subtle and took some thought to understand.  This commit removes the
cond_resched_lock() and the potential race.  Therefore, remove the
subtle code and restore the more straight forward accounting effectively
reverting the commit.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/hugetlb.c | 93 ++++++++++++++++++++++++++++------------------------
 1 file changed, 51 insertions(+), 42 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac4be941a3e5..5b2ca4663d7f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1209,7 +1209,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
 }
 
 /*
- * helper for free_pool_huge_page() - return the previously saved
+ * helper for remove_pool_huge_page() - return the previously saved
  * node ["this node"] from which to free a huge page.  Advance the
  * next node id whether or not we find a free huge page to free so
  * that the next attempt to free addresses the next node.
@@ -1391,6 +1391,16 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	}
 }
 
+static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
+{
+	struct page *page, *t_page;
+
+	list_for_each_entry_safe(page, t_page, list, lru) {
+		update_and_free_page(h, page);
+		cond_resched();
+	}
+}
+
 struct hstate *size_to_hstate(unsigned long size)
 {
 	struct hstate *h;
@@ -1721,16 +1731,18 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 }
 
 /*
- * Free huge page from pool from next node to free.
- * Attempt to keep persistent huge pages more or less
- * balanced over allowed nodes.
+ * Remove huge page from pool from next node to free.  Attempt to keep
+ * persistent huge pages more or less balanced over allowed nodes.
+ * This routine only 'removes' the hugetlb page.  The caller must make
+ * an additional call to free the page to low level allocators.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-							 bool acct_surplus)
+static struct page *remove_pool_huge_page(struct hstate *h,
+						nodemask_t *nodes_allowed,
+						 bool acct_surplus)
 {
 	int nr_nodes, node;
-	int ret = 0;
+	struct page *page = NULL;
 
 	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
 		/*
@@ -1739,23 +1751,14 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 		 */
 		if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
 		    !list_empty(&h->hugepage_freelists[node])) {
-			struct page *page =
-				list_entry(h->hugepage_freelists[node].next,
+			page = list_entry(h->hugepage_freelists[node].next,
 					  struct page, lru);
 			remove_hugetlb_page(h, page, acct_surplus);
-			/*
-			 * unlock/lock around update_and_free_page is temporary
-			 * and will be removed with subsequent patch.
-			 */
-			spin_unlock(&hugetlb_lock);
-			update_and_free_page(h, page);
-			spin_lock(&hugetlb_lock);
-			ret = 1;
 			break;
 		}
 	}
 
-	return ret;
+	return page;
 }
 
 /*
@@ -2075,17 +2078,16 @@ static int gather_surplus_pages(struct hstate *h, long delta)
  *    to the associated reservation map.
  * 2) Free any unused surplus pages that may have been allocated to satisfy
  *    the reservation.  As many as unused_resv_pages may be freed.
- *
- * Called with hugetlb_lock held.  However, the lock could be dropped (and
- * reacquired) during calls to cond_resched_lock.  Whenever dropping the lock,
- * we must make sure nobody else can claim pages we are in the process of
- * freeing.  Do this by ensuring resv_huge_page always is greater than the
- * number of huge pages we plan to free when dropping the lock.
  */
 static void return_unused_surplus_pages(struct hstate *h,
 					unsigned long unused_resv_pages)
 {
 	unsigned long nr_pages;
+	struct page *page;
+	LIST_HEAD(page_list);
+
+	/* Uncommit the reservation */
+	h->resv_huge_pages -= unused_resv_pages;
 
 	/* Cannot return gigantic pages currently */
 	if (hstate_is_gigantic(h))
@@ -2102,24 +2104,21 @@ static void return_unused_surplus_pages(struct hstate *h,
 	 * evenly across all nodes with memory. Iterate across these nodes
 	 * until we can no longer free unreserved surplus pages. This occurs
 	 * when the nodes with surplus pages have no free pages.
-	 * free_pool_huge_page() will balance the freed pages across the
+	 * remove_pool_huge_page() will balance the freed pages across the
 	 * on-line nodes with memory and will handle the hstate accounting.
-	 *
-	 * Note that we decrement resv_huge_pages as we free the pages.  If
-	 * we drop the lock, resv_huge_pages will still be sufficiently large
-	 * to cover subsequent pages we may free.
 	 */
 	while (nr_pages--) {
-		h->resv_huge_pages--;
-		unused_resv_pages--;
-		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
+		page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
+		if (!page)
 			goto out;
-		cond_resched_lock(&hugetlb_lock);
+
+		list_add(&page->lru, &page_list);
 	}
 
 out:
-	/* Fully uncommit the reservation */
-	h->resv_huge_pages -= unused_resv_pages;
+	spin_unlock(&hugetlb_lock);
+	update_and_free_pages_bulk(h, &page_list);
+	spin_lock(&hugetlb_lock);
 }
 
 
@@ -2683,7 +2682,6 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 						nodemask_t *nodes_allowed)
 {
 	int i;
-	struct page *page, *next;
 	LIST_HEAD(page_list);
 
 	if (hstate_is_gigantic(h))
@@ -2693,6 +2691,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	 * Collect pages to be freed on a list, and free after dropping lock
 	 */
 	for_each_node_mask(i, *nodes_allowed) {
+		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
 		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
@@ -2706,10 +2705,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 
 out:
 	spin_unlock(&hugetlb_lock);
-	list_for_each_entry_safe(page, next, &page_list, lru) {
-		update_and_free_page(h, page);
-		cond_resched();
-	}
+	update_and_free_pages_bulk(h, &page_list);
 	spin_lock(&hugetlb_lock);
 }
 #else
@@ -2756,6 +2752,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
 	unsigned long min_count, ret;
+	struct page *page;
+	LIST_HEAD(page_list);
 	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
 
 	/*
@@ -2868,11 +2866,22 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
 	try_to_free_low(h, min_count, nodes_allowed);
+
+	/*
+	 * Collect pages to be removed on list without dropping lock
+	 */
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, nodes_allowed, 0))
+		page = remove_pool_huge_page(h, nodes_allowed, 0);
+		if (!page)
 			break;
-		cond_resched_lock(&hugetlb_lock);
+
+		list_add(&page->lru, &page_list);
 	}
+	/* free the pages after dropping lock */
+	spin_unlock(&hugetlb_lock);
+	update_and_free_pages_bulk(h, &page_list);
+	spin_lock(&hugetlb_lock);
+
 	while (count < persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, nodes_allowed, 1))
 			break;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 7/8] hugetlb: make free_huge_page irq safe
  2021-03-31  3:41 [PATCH v3 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (5 preceding siblings ...)
  2021-03-31  3:41 ` [PATCH v3 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
@ 2021-03-31  3:41 ` Mike Kravetz
  2021-04-02 12:47   ` [External] " Muchun Song
  2021-03-31  3:41 ` [PATCH v3 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
  7 siblings, 1 reply; 15+ messages in thread
From: Mike Kravetz @ 2021-03-31  3:41 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
non-task context") was added to address the issue of free_huge_page
being called from irq context.  That commit hands off free_huge_page
processing to a workqueue if !in_task.  However, this doesn't cover
all the cases as pointed out by 0day bot lockdep report [1].

:  Possible interrupt unsafe locking scenario:
:
:        CPU0                    CPU1
:        ----                    ----
:   lock(hugetlb_lock);
:                                local_irq_disable();
:                                lock(slock-AF_INET);
:                                lock(hugetlb_lock);
:   <Interrupt>
:     lock(slock-AF_INET);

Shakeel has later explained that this is very likely TCP TX zerocopy
from hugetlb pages scenario when the networking code drops a last
reference to hugetlb page while having IRQ disabled. Hugetlb freeing
path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
chain can lead to a deadlock.

This commit addresses the issue by doing the following:
- Make hugetlb_lock irq safe.  This is mostly a simple process of
  changing spin_*lock calls to spin_*lock_irq* calls.
- Make subpool lock irq safe in a similar manner.
- Revert the !in_task check and workqueue handoff.

[1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c        | 167 ++++++++++++++++----------------------------
 mm/hugetlb_cgroup.c |   8 +--
 2 files changed, 66 insertions(+), 109 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5b2ca4663d7f..0bd4dc04df0f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,9 +93,10 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
 	return true;
 }
 
-static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
+static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
+						unsigned long irq_flags)
 {
-	spin_unlock(&spool->lock);
+	spin_unlock_irqrestore(&spool->lock, irq_flags);
 
 	/* If no pages are used, and no other handles to the subpool
 	 * remain, give up any reservations based on minimum size and
@@ -134,10 +135,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
 
 void hugepage_put_subpool(struct hugepage_subpool *spool)
 {
-	spin_lock(&spool->lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&spool->lock, flags);
 	BUG_ON(!spool->count);
 	spool->count--;
-	unlock_or_release_subpool(spool);
+	unlock_or_release_subpool(spool, flags);
 }
 
 /*
@@ -156,7 +159,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
 	if (!spool)
 		return ret;
 
-	spin_lock(&spool->lock);
+	spin_lock_irq(&spool->lock);
 
 	if (spool->max_hpages != -1) {		/* maximum size accounting */
 		if ((spool->used_hpages + delta) <= spool->max_hpages)
@@ -183,7 +186,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
 	}
 
 unlock_ret:
-	spin_unlock(&spool->lock);
+	spin_unlock_irq(&spool->lock);
 	return ret;
 }
 
@@ -197,11 +200,12 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
 				       long delta)
 {
 	long ret = delta;
+	unsigned long flags;
 
 	if (!spool)
 		return delta;
 
-	spin_lock(&spool->lock);
+	spin_lock_irqsave(&spool->lock, flags);
 
 	if (spool->max_hpages != -1)		/* maximum size accounting */
 		spool->used_hpages -= delta;
@@ -222,7 +226,7 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
 	 * If hugetlbfs_put_super couldn't free spool due to an outstanding
 	 * quota reference, free it now.
 	 */
-	unlock_or_release_subpool(spool);
+	unlock_or_release_subpool(spool, flags);
 
 	return ret;
 }
@@ -1412,7 +1416,7 @@ struct hstate *size_to_hstate(unsigned long size)
 	return NULL;
 }
 
-static void __free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
@@ -1422,6 +1426,7 @@ static void __free_huge_page(struct page *page)
 	int nid = page_to_nid(page);
 	struct hugepage_subpool *spool = hugetlb_page_subpool(page);
 	bool restore_reserve;
+	unsigned long flags;
 
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(page_mapcount(page), page);
@@ -1450,7 +1455,7 @@ static void __free_huge_page(struct page *page)
 			restore_reserve = true;
 	}
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irqsave(&hugetlb_lock, flags);
 	ClearHPageMigratable(page);
 	hugetlb_cgroup_uncharge_page(hstate_index(h),
 				     pages_per_huge_page(h), page);
@@ -1461,67 +1466,19 @@ static void __free_huge_page(struct page *page)
 
 	if (HPageTemporary(page)) {
 		remove_hugetlb_page(h, page, false);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 		update_and_free_page(h, page);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
 		remove_hugetlb_page(h, page, true);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 		update_and_free_page(h, page);
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);
-		spin_unlock(&hugetlb_lock);
-	}
-}
-
-/*
- * As free_huge_page() can be called from a non-task context, we have
- * to defer the actual freeing in a workqueue to prevent potential
- * hugetlb_lock deadlock.
- *
- * free_hpage_workfn() locklessly retrieves the linked list of pages to
- * be freed and frees them one-by-one. As the page->mapping pointer is
- * going to be cleared in __free_huge_page() anyway, it is reused as the
- * llist_node structure of a lockless linked list of huge pages to be freed.
- */
-static LLIST_HEAD(hpage_freelist);
-
-static void free_hpage_workfn(struct work_struct *work)
-{
-	struct llist_node *node;
-	struct page *page;
-
-	node = llist_del_all(&hpage_freelist);
-
-	while (node) {
-		page = container_of((struct address_space **)node,
-				     struct page, mapping);
-		node = node->next;
-		__free_huge_page(page);
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 	}
 }
-static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
-
-void free_huge_page(struct page *page)
-{
-	/*
-	 * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
-	 */
-	if (!in_task()) {
-		/*
-		 * Only call schedule_work() if hpage_freelist is previously
-		 * empty. Otherwise, schedule_work() had been called but the
-		 * workfn hasn't retrieved the list yet.
-		 */
-		if (llist_add((struct llist_node *)&page->mapping,
-			      &hpage_freelist))
-			schedule_work(&free_hpage_work);
-		return;
-	}
-
-	__free_huge_page(page);
-}
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
@@ -1530,11 +1487,11 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	hugetlb_set_page_subpool(page, NULL);
 	set_hugetlb_cgroup(page, NULL);
 	set_hugetlb_cgroup_rsvd(page, NULL);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	h->nr_huge_pages++;
 	h->nr_huge_pages_node[nid]++;
 	ClearHPageFreed(page);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 }
 
 static void prep_compound_gigantic_page(struct page *page, unsigned int order)
@@ -1780,7 +1737,7 @@ int dissolve_free_huge_page(struct page *page)
 	if (!PageHuge(page))
 		return 0;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (!PageHuge(page)) {
 		rc = 0;
 		goto out;
@@ -1797,7 +1754,7 @@ int dissolve_free_huge_page(struct page *page)
 		 * when it is dissolved.
 		 */
 		if (unlikely(!HPageFreed(head))) {
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			cond_resched();
 
 			/*
@@ -1821,12 +1778,12 @@ int dissolve_free_huge_page(struct page *page)
 		}
 		remove_hugetlb_page(h, page, false);
 		h->max_huge_pages--;
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		update_and_free_page(h, head);
 		return 0;
 	}
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return rc;
 }
 
@@ -1868,16 +1825,16 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	if (hstate_is_gigantic(h))
 		return NULL;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages)
 		goto out_unlock;
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
 	if (!page)
 		return NULL;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	/*
 	 * We could have raced with the pool size change.
 	 * Double check that and simply deallocate the new page
@@ -1887,7 +1844,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	 */
 	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
 		SetHPageTemporary(page);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		put_page(page);
 		return NULL;
 	} else {
@@ -1896,7 +1853,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	}
 
 out_unlock:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	return page;
 }
@@ -1946,17 +1903,17 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
 		nodemask_t *nmask, gfp_t gfp_mask)
 {
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (h->free_huge_pages - h->resv_huge_pages > 0) {
 		struct page *page;
 
 		page = dequeue_huge_page_nodemask(h, gfp_mask, preferred_nid, nmask);
 		if (page) {
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			return page;
 		}
 	}
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
 }
@@ -2004,7 +1961,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 
 	ret = -ENOMEM;
 retry:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
 		page = alloc_surplus_huge_page(h, htlb_alloc_mask(h),
 				NUMA_NO_NODE, NULL);
@@ -2021,7 +1978,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	 * After retaking hugetlb_lock, we need to recalculate 'needed'
 	 * because either resv_huge_pages or free_huge_pages may have changed.
 	 */
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	needed = (h->resv_huge_pages + delta) -
 			(h->free_huge_pages + allocated);
 	if (needed > 0) {
@@ -2061,12 +2018,12 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 		enqueue_huge_page(h, page);
 	}
 free:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	/* Free unnecessary surplus pages to the buddy allocator */
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru)
 		put_page(page);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 
 	return ret;
 }
@@ -2116,9 +2073,9 @@ static void return_unused_surplus_pages(struct hstate *h,
 	}
 
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	update_and_free_pages_bulk(h, &page_list);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 }
 
 
@@ -2463,7 +2420,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (ret)
 		goto out_uncharge_cgroup_reservation;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	/*
 	 * glb_chg is passed to indicate whether or not a page must be taken
 	 * from the global free pool (global change).  gbl_chg == 0 indicates
@@ -2471,7 +2428,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	 */
 	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
 	if (!page) {
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		page = alloc_buddy_huge_page_with_mpol(h, vma, addr);
 		if (!page)
 			goto out_uncharge_cgroup;
@@ -2479,7 +2436,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 			SetHPageRestoreReserve(page);
 			h->resv_huge_pages--;
 		}
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		list_add(&page->lru, &h->hugepage_activelist);
 		/* Fall through */
 	}
@@ -2492,7 +2449,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 						  h_cg, page);
 	}
 
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	hugetlb_set_page_subpool(page, spool);
 
@@ -2704,9 +2661,9 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	}
 
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	update_and_free_pages_bulk(h, &page_list);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 }
 #else
 static inline void try_to_free_low(struct hstate *h, unsigned long count,
@@ -2802,7 +2759,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 */
 	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
 		if (count > persistent_huge_pages(h)) {
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			mutex_unlock(&h->resize_lock);
 			NODEMASK_FREE(node_alloc_noretry);
 			return -EINVAL;
@@ -2832,14 +2789,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		 * page, free_huge_page will handle it by freeing the page
 		 * and reducing the surplus.
 		 */
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 
 		/* yield cpu to avoid soft lockup */
 		cond_resched();
 
 		ret = alloc_pool_huge_page(h, nodes_allowed,
 						node_alloc_noretry);
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		if (!ret)
 			goto out;
 
@@ -2878,9 +2835,9 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		list_add(&page->lru, &page_list);
 	}
 	/* free the pages after dropping lock */
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	update_and_free_pages_bulk(h, &page_list);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 
 	while (count < persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, nodes_allowed, 1))
@@ -2888,7 +2845,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	}
 out:
 	h->max_huge_pages = persistent_huge_pages(h);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	mutex_unlock(&h->resize_lock);
 
 	NODEMASK_FREE(node_alloc_noretry);
@@ -3044,9 +3001,9 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 	if (err)
 		return err;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	h->nr_overcommit_huge_pages = input;
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	return count;
 }
@@ -3633,9 +3590,9 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
 		goto out;
 
 	if (write) {
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		h->nr_overcommit_huge_pages = tmp;
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 	}
 out:
 	return ret;
@@ -3731,7 +3688,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
 	if (!delta)
 		return 0;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	/*
 	 * When cpuset is configured, it breaks the strict hugetlb page
 	 * reservation as the accounting is done on a global variable. Such
@@ -3770,7 +3727,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
 		return_unused_surplus_pages(h, (unsigned long) -delta);
 
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return ret;
 }
 
@@ -5834,7 +5791,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
 {
 	bool ret = true;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (!PageHeadHuge(page) ||
 	    !HPageMigratable(page) ||
 	    !get_page_unless_zero(page)) {
@@ -5844,16 +5801,16 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
 	ClearHPageMigratable(page);
 	list_move_tail(&page->lru, list);
 unlock:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return ret;
 }
 
 void putback_active_hugepage(struct page *page)
 {
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	SetHPageMigratable(page);
 	list_move_tail(&page->lru, &(page_hstate(page))->hugepage_activelist);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	put_page(page);
 }
 
@@ -5887,12 +5844,12 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
 		 */
 		if (new_nid == old_nid)
 			return;
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		if (h->surplus_huge_pages_node[old_nid]) {
 			h->surplus_huge_pages_node[old_nid]--;
 			h->surplus_huge_pages_node[new_nid]++;
 		}
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 	}
 }
 
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 726b85f4f303..5383023d0cca 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -204,11 +204,11 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 	do {
 		idx = 0;
 		for_each_hstate(h) {
-			spin_lock(&hugetlb_lock);
+			spin_lock_irq(&hugetlb_lock);
 			list_for_each_entry(page, &h->hugepage_activelist, lru)
 				hugetlb_cgroup_move_parent(idx, h_cg, page);
 
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			idx++;
 		}
 		cond_resched();
@@ -784,7 +784,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
 	if (hugetlb_cgroup_disabled())
 		return;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	h_cg = hugetlb_cgroup_from_page(oldhpage);
 	h_cg_rsvd = hugetlb_cgroup_from_page_rsvd(oldhpage);
 	set_hugetlb_cgroup(oldhpage, NULL);
@@ -794,7 +794,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
 	set_hugetlb_cgroup(newhpage, h_cg);
 	set_hugetlb_cgroup_rsvd(newhpage, h_cg_rsvd);
 	list_move(&newhpage->lru, &h->hugepage_activelist);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock
  2021-03-31  3:41 [PATCH v3 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (6 preceding siblings ...)
  2021-03-31  3:41 ` [PATCH v3 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
@ 2021-03-31  3:41 ` Mike Kravetz
  7 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2021-03-31  3:41 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton, Mike Kravetz

After making hugetlb lock irq safe and separating some functionality
done under the lock, add some lockdep_assert_held to help verify
locking.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0bd4dc04df0f..c22111f3da20 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1068,6 +1068,8 @@ static void __enqueue_huge_page(struct list_head *list, struct page *page)
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
+
+	lockdep_assert_held(&hugetlb_lock);
 	__enqueue_huge_page(&h->hugepage_freelists[nid], page);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
@@ -1078,6 +1080,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
 	struct page *page;
 	bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
+	lockdep_assert_held(&hugetlb_lock);
 	list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
 		if (pin && !is_pinnable_page(page))
 			continue;
@@ -1346,6 +1349,7 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
 {
 	int nid = page_to_nid(page);
 
+	lockdep_assert_held(&hugetlb_lock);
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
@@ -1701,6 +1705,7 @@ static struct page *remove_pool_huge_page(struct hstate *h,
 	int nr_nodes, node;
 	struct page *page = NULL;
 
+	lockdep_assert_held(&hugetlb_lock);
 	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
 		/*
 		 * If we're returning unused surplus pages, only examine
@@ -1950,6 +1955,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	long needed, allocated;
 	bool alloc_ok = true;
 
+	lockdep_assert_held(&hugetlb_lock);
 	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
 	if (needed <= 0) {
 		h->resv_huge_pages += delta;
@@ -2043,6 +2049,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 	struct page *page;
 	LIST_HEAD(page_list);
 
+	lockdep_assert_held(&hugetlb_lock);
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
 
@@ -2641,6 +2648,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	int i;
 	LIST_HEAD(page_list);
 
+	lockdep_assert_held(&hugetlb_lock);
 	if (hstate_is_gigantic(h))
 		return;
 
@@ -2682,6 +2690,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 {
 	int nr_nodes, node;
 
+	lockdep_assert_held(&hugetlb_lock);
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-03-31  3:41 ` [PATCH v3 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
@ 2021-03-31  7:25   ` Michal Hocko
  2021-03-31  7:52   ` David Hildenbrand
  1 sibling, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2021-03-31  7:25 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Tue 30-03-21 20:41:41, Mike Kravetz wrote:
> cma_release is currently a sleepable operatation because the bitmap
> manipulation is protected by cma->lock mutex. Hugetlb code which relies
> on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> irq safe.
> 
> The lock doesn't protect any sleepable operation so it can be changed to
> a (irq aware) spin lock. The bitmap processing should be quite fast in
> typical case but if cma sizes grow to TB then we will likely need to
> replace the lock by a more optimized bitmap implementation.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/cma.c       | 18 +++++++++---------
>  mm/cma.h       |  2 +-
>  mm/cma_debug.c |  8 ++++----
>  3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index b2393b892d3b..2380f2571eb5 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -24,7 +24,6 @@
>  #include <linux/memblock.h>
>  #include <linux/err.h>
>  #include <linux/mm.h>
> -#include <linux/mutex.h>
>  #include <linux/sizes.h>
>  #include <linux/slab.h>
>  #include <linux/log2.h>
> @@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn,
>  			     unsigned int count)
>  {
>  	unsigned long bitmap_no, bitmap_count;
> +	unsigned long flags;
>  
>  	bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
>  	bitmap_count = cma_bitmap_pages_to_bits(cma, count);
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>  	bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  }
>  
>  static void __init cma_activate_area(struct cma *cma)
> @@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
>  	     pfn += pageblock_nr_pages)
>  		init_cma_reserved_pageblock(pfn_to_page(pfn));
>  
> -	mutex_init(&cma->lock);
> +	spin_lock_init(&cma->lock);
>  
>  #ifdef CONFIG_CMA_DEBUGFS
>  	INIT_HLIST_HEAD(&cma->mem_head);
> @@ -392,7 +392,7 @@ static void cma_debug_show_areas(struct cma *cma)
>  	unsigned long nr_part, nr_total = 0;
>  	unsigned long nbits = cma_bitmap_maxno(cma);
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>  	pr_info("number of available pages: ");
>  	for (;;) {
>  		next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +407,7 @@ static void cma_debug_show_areas(struct cma *cma)
>  		start = next_zero_bit + nr_zero;
>  	}
>  	pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>  }
>  #else
>  static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -454,12 +454,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>  		goto out;
>  
>  	for (;;) {
> -		mutex_lock(&cma->lock);
> +		spin_lock_irq(&cma->lock);
>  		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>  				bitmap_maxno, start, bitmap_count, mask,
>  				offset);
>  		if (bitmap_no >= bitmap_maxno) {
> -			mutex_unlock(&cma->lock);
> +			spin_unlock_irq(&cma->lock);
>  			break;
>  		}
>  		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> @@ -468,7 +468,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>  		 * our exclusive use. If the migration fails we will take the
>  		 * lock again and unmark it.
>  		 */
> -		mutex_unlock(&cma->lock);
> +		spin_unlock_irq(&cma->lock);
>  
>  		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
>  		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> diff --git a/mm/cma.h b/mm/cma.h
> index 68ffad4e430d..2c775877eae2 100644
> --- a/mm/cma.h
> +++ b/mm/cma.h
> @@ -15,7 +15,7 @@ struct cma {
>  	unsigned long   count;
>  	unsigned long   *bitmap;
>  	unsigned int order_per_bit; /* Order of pages represented by one bit */
> -	struct mutex    lock;
> +	spinlock_t	lock;
>  #ifdef CONFIG_CMA_DEBUGFS
>  	struct hlist_head mem_head;
>  	spinlock_t mem_head_lock;
> diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> index d5bf8aa34fdc..2e7704955f4f 100644
> --- a/mm/cma_debug.c
> +++ b/mm/cma_debug.c
> @@ -36,10 +36,10 @@ static int cma_used_get(void *data, u64 *val)
>  	struct cma *cma = data;
>  	unsigned long used;
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>  	/* pages counter is smaller than sizeof(int) */
>  	used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>  	*val = (u64)used << cma->order_per_bit;
>  
>  	return 0;
> @@ -53,7 +53,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>  	unsigned long start, end = 0;
>  	unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>  	for (;;) {
>  		start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
>  		if (start >= bitmap_maxno)
> @@ -61,7 +61,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>  		end = find_next_bit(cma->bitmap, bitmap_maxno, start);
>  		maxchunk = max(end - start, maxchunk);
>  	}
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>  	*val = (u64)maxchunk << cma->order_per_bit;
>  
>  	return 0;
> -- 
> 2.30.2
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-03-31  3:41 ` [PATCH v3 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
  2021-03-31  7:25   ` Michal Hocko
@ 2021-03-31  7:52   ` David Hildenbrand
  1 sibling, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-03-31  7:52 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	Muchun Song, David Rientjes, Miaohe Lin, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Joonsoo Kim, Barry Song,
	Will Deacon, Andrew Morton

On 31.03.21 05:41, Mike Kravetz wrote:
> cma_release is currently a sleepable operatation because the bitmap
> manipulation is protected by cma->lock mutex. Hugetlb code which relies
> on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> irq safe.
> 
> The lock doesn't protect any sleepable operation so it can be changed to
> a (irq aware) spin lock. The bitmap processing should be quite fast in
> typical case but if cma sizes grow to TB then we will likely need to
> replace the lock by a more optimized bitmap implementation.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/cma.c       | 18 +++++++++---------
>   mm/cma.h       |  2 +-
>   mm/cma_debug.c |  8 ++++----
>   3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index b2393b892d3b..2380f2571eb5 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -24,7 +24,6 @@
>   #include <linux/memblock.h>
>   #include <linux/err.h>
>   #include <linux/mm.h>
> -#include <linux/mutex.h>
>   #include <linux/sizes.h>
>   #include <linux/slab.h>
>   #include <linux/log2.h>
> @@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn,
>   			     unsigned int count)
>   {
>   	unsigned long bitmap_no, bitmap_count;
> +	unsigned long flags;
>   
>   	bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
>   	bitmap_count = cma_bitmap_pages_to_bits(cma, count);
>   
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>   	bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>   }
>   
>   static void __init cma_activate_area(struct cma *cma)
> @@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
>   	     pfn += pageblock_nr_pages)
>   		init_cma_reserved_pageblock(pfn_to_page(pfn));
>   
> -	mutex_init(&cma->lock);
> +	spin_lock_init(&cma->lock);
>   
>   #ifdef CONFIG_CMA_DEBUGFS
>   	INIT_HLIST_HEAD(&cma->mem_head);
> @@ -392,7 +392,7 @@ static void cma_debug_show_areas(struct cma *cma)
>   	unsigned long nr_part, nr_total = 0;
>   	unsigned long nbits = cma_bitmap_maxno(cma);
>   
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>   	pr_info("number of available pages: ");
>   	for (;;) {
>   		next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +407,7 @@ static void cma_debug_show_areas(struct cma *cma)
>   		start = next_zero_bit + nr_zero;
>   	}
>   	pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>   }
>   #else
>   static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -454,12 +454,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>   		goto out;
>   
>   	for (;;) {
> -		mutex_lock(&cma->lock);
> +		spin_lock_irq(&cma->lock);
>   		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>   				bitmap_maxno, start, bitmap_count, mask,
>   				offset);
>   		if (bitmap_no >= bitmap_maxno) {
> -			mutex_unlock(&cma->lock);
> +			spin_unlock_irq(&cma->lock);
>   			break;
>   		}
>   		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> @@ -468,7 +468,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>   		 * our exclusive use. If the migration fails we will take the
>   		 * lock again and unmark it.
>   		 */
> -		mutex_unlock(&cma->lock);
> +		spin_unlock_irq(&cma->lock);
>   
>   		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
>   		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> diff --git a/mm/cma.h b/mm/cma.h
> index 68ffad4e430d..2c775877eae2 100644
> --- a/mm/cma.h
> +++ b/mm/cma.h
> @@ -15,7 +15,7 @@ struct cma {
>   	unsigned long   count;
>   	unsigned long   *bitmap;
>   	unsigned int order_per_bit; /* Order of pages represented by one bit */
> -	struct mutex    lock;
> +	spinlock_t	lock;
>   #ifdef CONFIG_CMA_DEBUGFS
>   	struct hlist_head mem_head;
>   	spinlock_t mem_head_lock;
> diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> index d5bf8aa34fdc..2e7704955f4f 100644
> --- a/mm/cma_debug.c
> +++ b/mm/cma_debug.c
> @@ -36,10 +36,10 @@ static int cma_used_get(void *data, u64 *val)
>   	struct cma *cma = data;
>   	unsigned long used;
>   
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>   	/* pages counter is smaller than sizeof(int) */
>   	used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>   	*val = (u64)used << cma->order_per_bit;
>   
>   	return 0;
> @@ -53,7 +53,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>   	unsigned long start, end = 0;
>   	unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
>   
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>   	for (;;) {
>   		start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
>   		if (start >= bitmap_maxno)
> @@ -61,7 +61,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>   		end = find_next_bit(cma->bitmap, bitmap_maxno, start);
>   		maxchunk = max(end - start, maxchunk);
>   	}
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>   	*val = (u64)maxchunk << cma->order_per_bit;
>   
>   	return 0;
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] [PATCH v3 7/8] hugetlb: make free_huge_page irq safe
  2021-03-31  3:41 ` [PATCH v3 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
@ 2021-04-02 12:47   ` Muchun Song
  2021-04-02 20:55     ` Mike Kravetz
  0 siblings, 1 reply; 15+ messages in thread
From: Muchun Song @ 2021-04-02 12:47 UTC (permalink / raw)
  To: Mike Kravetz, Oscar Salvador
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, David Hildenbrand, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon,
	Andrew Morton

On Wed, Mar 31, 2021 at 11:42 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
> non-task context") was added to address the issue of free_huge_page
> being called from irq context.  That commit hands off free_huge_page
> processing to a workqueue if !in_task.  However, this doesn't cover
> all the cases as pointed out by 0day bot lockdep report [1].
>
> :  Possible interrupt unsafe locking scenario:
> :
> :        CPU0                    CPU1
> :        ----                    ----
> :   lock(hugetlb_lock);
> :                                local_irq_disable();
> :                                lock(slock-AF_INET);
> :                                lock(hugetlb_lock);
> :   <Interrupt>
> :     lock(slock-AF_INET);
>
> Shakeel has later explained that this is very likely TCP TX zerocopy
> from hugetlb pages scenario when the networking code drops a last
> reference to hugetlb page while having IRQ disabled. Hugetlb freeing
> path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
> chain can lead to a deadlock.
>
> This commit addresses the issue by doing the following:
> - Make hugetlb_lock irq safe.  This is mostly a simple process of
>   changing spin_*lock calls to spin_*lock_irq* calls.
> - Make subpool lock irq safe in a similar manner.
> - Revert the !in_task check and workqueue handoff.
>
> [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Hi Mike,

Today I pulled the newest code (next-20210401). I found that
alloc_and_dissolve_huge_page is not updated. In this function,
hugetlb_lock is still non-irq safe. Maybe you or Oscar need
to fix.

Thanks.

> ---
>  mm/hugetlb.c        | 167 ++++++++++++++++----------------------------
>  mm/hugetlb_cgroup.c |   8 +--
>  2 files changed, 66 insertions(+), 109 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5b2ca4663d7f..0bd4dc04df0f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -93,9 +93,10 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
>         return true;
>  }
>
> -static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
> +static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
> +                                               unsigned long irq_flags)
>  {
> -       spin_unlock(&spool->lock);
> +       spin_unlock_irqrestore(&spool->lock, irq_flags);
>
>         /* If no pages are used, and no other handles to the subpool
>          * remain, give up any reservations based on minimum size and
> @@ -134,10 +135,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
>
>  void hugepage_put_subpool(struct hugepage_subpool *spool)
>  {
> -       spin_lock(&spool->lock);
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&spool->lock, flags);
>         BUG_ON(!spool->count);
>         spool->count--;
> -       unlock_or_release_subpool(spool);
> +       unlock_or_release_subpool(spool, flags);
>  }
>
>  /*
> @@ -156,7 +159,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
>         if (!spool)
>                 return ret;
>
> -       spin_lock(&spool->lock);
> +       spin_lock_irq(&spool->lock);
>
>         if (spool->max_hpages != -1) {          /* maximum size accounting */
>                 if ((spool->used_hpages + delta) <= spool->max_hpages)
> @@ -183,7 +186,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
>         }
>
>  unlock_ret:
> -       spin_unlock(&spool->lock);
> +       spin_unlock_irq(&spool->lock);
>         return ret;
>  }
>
> @@ -197,11 +200,12 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
>                                        long delta)
>  {
>         long ret = delta;
> +       unsigned long flags;
>
>         if (!spool)
>                 return delta;
>
> -       spin_lock(&spool->lock);
> +       spin_lock_irqsave(&spool->lock, flags);
>
>         if (spool->max_hpages != -1)            /* maximum size accounting */
>                 spool->used_hpages -= delta;
> @@ -222,7 +226,7 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
>          * If hugetlbfs_put_super couldn't free spool due to an outstanding
>          * quota reference, free it now.
>          */
> -       unlock_or_release_subpool(spool);
> +       unlock_or_release_subpool(spool, flags);
>
>         return ret;
>  }
> @@ -1412,7 +1416,7 @@ struct hstate *size_to_hstate(unsigned long size)
>         return NULL;
>  }
>
> -static void __free_huge_page(struct page *page)
> +void free_huge_page(struct page *page)
>  {
>         /*
>          * Can't pass hstate in here because it is called from the
> @@ -1422,6 +1426,7 @@ static void __free_huge_page(struct page *page)
>         int nid = page_to_nid(page);
>         struct hugepage_subpool *spool = hugetlb_page_subpool(page);
>         bool restore_reserve;
> +       unsigned long flags;
>
>         VM_BUG_ON_PAGE(page_count(page), page);
>         VM_BUG_ON_PAGE(page_mapcount(page), page);
> @@ -1450,7 +1455,7 @@ static void __free_huge_page(struct page *page)
>                         restore_reserve = true;
>         }
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irqsave(&hugetlb_lock, flags);
>         ClearHPageMigratable(page);
>         hugetlb_cgroup_uncharge_page(hstate_index(h),
>                                      pages_per_huge_page(h), page);
> @@ -1461,67 +1466,19 @@ static void __free_huge_page(struct page *page)
>
>         if (HPageTemporary(page)) {
>                 remove_hugetlb_page(h, page, false);
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irqrestore(&hugetlb_lock, flags);
>                 update_and_free_page(h, page);
>         } else if (h->surplus_huge_pages_node[nid]) {
>                 /* remove the page from active list */
>                 remove_hugetlb_page(h, page, true);
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irqrestore(&hugetlb_lock, flags);
>                 update_and_free_page(h, page);
>         } else {
>                 arch_clear_hugepage_flags(page);
>                 enqueue_huge_page(h, page);
> -               spin_unlock(&hugetlb_lock);
> -       }
> -}
> -
> -/*
> - * As free_huge_page() can be called from a non-task context, we have
> - * to defer the actual freeing in a workqueue to prevent potential
> - * hugetlb_lock deadlock.
> - *
> - * free_hpage_workfn() locklessly retrieves the linked list of pages to
> - * be freed and frees them one-by-one. As the page->mapping pointer is
> - * going to be cleared in __free_huge_page() anyway, it is reused as the
> - * llist_node structure of a lockless linked list of huge pages to be freed.
> - */
> -static LLIST_HEAD(hpage_freelist);
> -
> -static void free_hpage_workfn(struct work_struct *work)
> -{
> -       struct llist_node *node;
> -       struct page *page;
> -
> -       node = llist_del_all(&hpage_freelist);
> -
> -       while (node) {
> -               page = container_of((struct address_space **)node,
> -                                    struct page, mapping);
> -               node = node->next;
> -               __free_huge_page(page);
> +               spin_unlock_irqrestore(&hugetlb_lock, flags);
>         }
>  }
> -static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> -
> -void free_huge_page(struct page *page)
> -{
> -       /*
> -        * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> -        */
> -       if (!in_task()) {
> -               /*
> -                * Only call schedule_work() if hpage_freelist is previously
> -                * empty. Otherwise, schedule_work() had been called but the
> -                * workfn hasn't retrieved the list yet.
> -                */
> -               if (llist_add((struct llist_node *)&page->mapping,
> -                             &hpage_freelist))
> -                       schedule_work(&free_hpage_work);
> -               return;
> -       }
> -
> -       __free_huge_page(page);
> -}
>
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>  {
> @@ -1530,11 +1487,11 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>         hugetlb_set_page_subpool(page, NULL);
>         set_hugetlb_cgroup(page, NULL);
>         set_hugetlb_cgroup_rsvd(page, NULL);
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         h->nr_huge_pages++;
>         h->nr_huge_pages_node[nid]++;
>         ClearHPageFreed(page);
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>  }
>
>  static void prep_compound_gigantic_page(struct page *page, unsigned int order)
> @@ -1780,7 +1737,7 @@ int dissolve_free_huge_page(struct page *page)
>         if (!PageHuge(page))
>                 return 0;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         if (!PageHuge(page)) {
>                 rc = 0;
>                 goto out;
> @@ -1797,7 +1754,7 @@ int dissolve_free_huge_page(struct page *page)
>                  * when it is dissolved.
>                  */
>                 if (unlikely(!HPageFreed(head))) {
> -                       spin_unlock(&hugetlb_lock);
> +                       spin_unlock_irq(&hugetlb_lock);
>                         cond_resched();
>
>                         /*
> @@ -1821,12 +1778,12 @@ int dissolve_free_huge_page(struct page *page)
>                 }
>                 remove_hugetlb_page(h, page, false);
>                 h->max_huge_pages--;
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>                 update_and_free_page(h, head);
>                 return 0;
>         }
>  out:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         return rc;
>  }
>
> @@ -1868,16 +1825,16 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>         if (hstate_is_gigantic(h))
>                 return NULL;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages)
>                 goto out_unlock;
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
>         if (!page)
>                 return NULL;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         /*
>          * We could have raced with the pool size change.
>          * Double check that and simply deallocate the new page
> @@ -1887,7 +1844,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>          */
>         if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>                 SetHPageTemporary(page);
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>                 put_page(page);
>                 return NULL;
>         } else {
> @@ -1896,7 +1853,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>         }
>
>  out_unlock:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         return page;
>  }
> @@ -1946,17 +1903,17 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
>  struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
>                 nodemask_t *nmask, gfp_t gfp_mask)
>  {
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         if (h->free_huge_pages - h->resv_huge_pages > 0) {
>                 struct page *page;
>
>                 page = dequeue_huge_page_nodemask(h, gfp_mask, preferred_nid, nmask);
>                 if (page) {
> -                       spin_unlock(&hugetlb_lock);
> +                       spin_unlock_irq(&hugetlb_lock);
>                         return page;
>                 }
>         }
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
>  }
> @@ -2004,7 +1961,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>
>         ret = -ENOMEM;
>  retry:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         for (i = 0; i < needed; i++) {
>                 page = alloc_surplus_huge_page(h, htlb_alloc_mask(h),
>                                 NUMA_NO_NODE, NULL);
> @@ -2021,7 +1978,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>          * After retaking hugetlb_lock, we need to recalculate 'needed'
>          * because either resv_huge_pages or free_huge_pages may have changed.
>          */
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         needed = (h->resv_huge_pages + delta) -
>                         (h->free_huge_pages + allocated);
>         if (needed > 0) {
> @@ -2061,12 +2018,12 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>                 enqueue_huge_page(h, page);
>         }
>  free:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         /* Free unnecessary surplus pages to the buddy allocator */
>         list_for_each_entry_safe(page, tmp, &surplus_list, lru)
>                 put_page(page);
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>
>         return ret;
>  }
> @@ -2116,9 +2073,9 @@ static void return_unused_surplus_pages(struct hstate *h,
>         }
>
>  out:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         update_and_free_pages_bulk(h, &page_list);
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>  }
>
>
> @@ -2463,7 +2420,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>         if (ret)
>                 goto out_uncharge_cgroup_reservation;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         /*
>          * glb_chg is passed to indicate whether or not a page must be taken
>          * from the global free pool (global change).  gbl_chg == 0 indicates
> @@ -2471,7 +2428,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>          */
>         page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>         if (!page) {
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>                 page = alloc_buddy_huge_page_with_mpol(h, vma, addr);
>                 if (!page)
>                         goto out_uncharge_cgroup;
> @@ -2479,7 +2436,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>                         SetHPageRestoreReserve(page);
>                         h->resv_huge_pages--;
>                 }
> -               spin_lock(&hugetlb_lock);
> +               spin_lock_irq(&hugetlb_lock);
>                 list_add(&page->lru, &h->hugepage_activelist);
>                 /* Fall through */
>         }
> @@ -2492,7 +2449,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>                                                   h_cg, page);
>         }
>
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         hugetlb_set_page_subpool(page, spool);
>
> @@ -2704,9 +2661,9 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>         }
>
>  out:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         update_and_free_pages_bulk(h, &page_list);
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>  }
>  #else
>  static inline void try_to_free_low(struct hstate *h, unsigned long count,
> @@ -2802,7 +2759,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>          */
>         if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
>                 if (count > persistent_huge_pages(h)) {
> -                       spin_unlock(&hugetlb_lock);
> +                       spin_unlock_irq(&hugetlb_lock);
>                         mutex_unlock(&h->resize_lock);
>                         NODEMASK_FREE(node_alloc_noretry);
>                         return -EINVAL;
> @@ -2832,14 +2789,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>                  * page, free_huge_page will handle it by freeing the page
>                  * and reducing the surplus.
>                  */
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>
>                 /* yield cpu to avoid soft lockup */
>                 cond_resched();
>
>                 ret = alloc_pool_huge_page(h, nodes_allowed,
>                                                 node_alloc_noretry);
> -               spin_lock(&hugetlb_lock);
> +               spin_lock_irq(&hugetlb_lock);
>                 if (!ret)
>                         goto out;
>
> @@ -2878,9 +2835,9 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>                 list_add(&page->lru, &page_list);
>         }
>         /* free the pages after dropping lock */
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         update_and_free_pages_bulk(h, &page_list);
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>
>         while (count < persistent_huge_pages(h)) {
>                 if (!adjust_pool_surplus(h, nodes_allowed, 1))
> @@ -2888,7 +2845,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>         }
>  out:
>         h->max_huge_pages = persistent_huge_pages(h);
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         mutex_unlock(&h->resize_lock);
>
>         NODEMASK_FREE(node_alloc_noretry);
> @@ -3044,9 +3001,9 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
>         if (err)
>                 return err;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         h->nr_overcommit_huge_pages = input;
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         return count;
>  }
> @@ -3633,9 +3590,9 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
>                 goto out;
>
>         if (write) {
> -               spin_lock(&hugetlb_lock);
> +               spin_lock_irq(&hugetlb_lock);
>                 h->nr_overcommit_huge_pages = tmp;
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>         }
>  out:
>         return ret;
> @@ -3731,7 +3688,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
>         if (!delta)
>                 return 0;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         /*
>          * When cpuset is configured, it breaks the strict hugetlb page
>          * reservation as the accounting is done on a global variable. Such
> @@ -3770,7 +3727,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
>                 return_unused_surplus_pages(h, (unsigned long) -delta);
>
>  out:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         return ret;
>  }
>
> @@ -5834,7 +5791,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
>  {
>         bool ret = true;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         if (!PageHeadHuge(page) ||
>             !HPageMigratable(page) ||
>             !get_page_unless_zero(page)) {
> @@ -5844,16 +5801,16 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
>         ClearHPageMigratable(page);
>         list_move_tail(&page->lru, list);
>  unlock:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         return ret;
>  }
>
>  void putback_active_hugepage(struct page *page)
>  {
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         SetHPageMigratable(page);
>         list_move_tail(&page->lru, &(page_hstate(page))->hugepage_activelist);
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         put_page(page);
>  }
>
> @@ -5887,12 +5844,12 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
>                  */
>                 if (new_nid == old_nid)
>                         return;
> -               spin_lock(&hugetlb_lock);
> +               spin_lock_irq(&hugetlb_lock);
>                 if (h->surplus_huge_pages_node[old_nid]) {
>                         h->surplus_huge_pages_node[old_nid]--;
>                         h->surplus_huge_pages_node[new_nid]++;
>                 }
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>         }
>  }
>
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 726b85f4f303..5383023d0cca 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -204,11 +204,11 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
>         do {
>                 idx = 0;
>                 for_each_hstate(h) {
> -                       spin_lock(&hugetlb_lock);
> +                       spin_lock_irq(&hugetlb_lock);
>                         list_for_each_entry(page, &h->hugepage_activelist, lru)
>                                 hugetlb_cgroup_move_parent(idx, h_cg, page);
>
> -                       spin_unlock(&hugetlb_lock);
> +                       spin_unlock_irq(&hugetlb_lock);
>                         idx++;
>                 }
>                 cond_resched();
> @@ -784,7 +784,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
>         if (hugetlb_cgroup_disabled())
>                 return;
>
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         h_cg = hugetlb_cgroup_from_page(oldhpage);
>         h_cg_rsvd = hugetlb_cgroup_from_page_rsvd(oldhpage);
>         set_hugetlb_cgroup(oldhpage, NULL);
> @@ -794,7 +794,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
>         set_hugetlb_cgroup(newhpage, h_cg);
>         set_hugetlb_cgroup_rsvd(newhpage, h_cg_rsvd);
>         list_move(&newhpage->lru, &h->hugepage_activelist);
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>         return;
>  }
>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] [PATCH v3 7/8] hugetlb: make free_huge_page irq safe
  2021-04-02 12:47   ` [External] " Muchun Song
@ 2021-04-02 20:55     ` Mike Kravetz
  2021-04-03  5:59       ` Muchun Song
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Kravetz @ 2021-04-02 20:55 UTC (permalink / raw)
  To: Muchun Song, Oscar Salvador, Andrew Morton
  Cc: Linux Memory Management List, LKML, Roman Gushchin, Michal Hocko,
	Shakeel Butt, David Hildenbrand, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon

On 4/2/21 5:47 AM, Muchun Song wrote:
> On Wed, Mar 31, 2021 at 11:42 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
>> non-task context") was added to address the issue of free_huge_page
>> being called from irq context.  That commit hands off free_huge_page
>> processing to a workqueue if !in_task.  However, this doesn't cover
>> all the cases as pointed out by 0day bot lockdep report [1].
>>
>> :  Possible interrupt unsafe locking scenario:
>> :
>> :        CPU0                    CPU1
>> :        ----                    ----
>> :   lock(hugetlb_lock);
>> :                                local_irq_disable();
>> :                                lock(slock-AF_INET);
>> :                                lock(hugetlb_lock);
>> :   <Interrupt>
>> :     lock(slock-AF_INET);
>>
>> Shakeel has later explained that this is very likely TCP TX zerocopy
>> from hugetlb pages scenario when the networking code drops a last
>> reference to hugetlb page while having IRQ disabled. Hugetlb freeing
>> path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
>> chain can lead to a deadlock.
>>
>> This commit addresses the issue by doing the following:
>> - Make hugetlb_lock irq safe.  This is mostly a simple process of
>>   changing spin_*lock calls to spin_*lock_irq* calls.
>> - Make subpool lock irq safe in a similar manner.
>> - Revert the !in_task check and workqueue handoff.
>>
>> [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> Acked-by: Michal Hocko <mhocko@suse.com>
>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> 
> Hi Mike,
> 
> Today I pulled the newest code (next-20210401). I found that
> alloc_and_dissolve_huge_page is not updated. In this function,
> hugetlb_lock is still non-irq safe. Maybe you or Oscar need
> to fix.
> 
> Thanks.

Thank you Muchun,

Oscar's changes were not in Andrew's tree when I started on this series
and I failed to notice their inclusion.  In addition,
isolate_or_dissolve_huge_page also needs updating as well as a change in
set_max_huge_pages that was omitted while rebasing.

Andrew, the following patch addresses those missing changes.  Ideally,
the changes should be combined/included in this patch.  If you want me
to sent another version of this patch or another series, let me know.

From 450593eb3cea895f499ddc343c22424c552ea502 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Fri, 2 Apr 2021 13:18:13 -0700
Subject: [PATCH] hugetlb: fix irq locking omissions

The pach "hugetlb: make free_huge_page irq safe" changed spin_*lock
calls to spin_*lock_irq* calls.  However, it missed several places
in the file hugetlb.c.  Add the overlooked changes.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c22111f3da20..a6bfc6bcbc81 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2284,7 +2284,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 	 */
 	page_ref_dec(new_page);
 retry:
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (!PageHuge(old_page)) {
 		/*
 		 * Freed from under us. Drop new_page too.
@@ -2297,7 +2297,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 		 * Fail with -EBUSY if not possible.
 		 */
 		update_and_free_page(h, new_page);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		if (!isolate_huge_page(old_page, list))
 			ret = -EBUSY;
 		return ret;
@@ -2307,7 +2307,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 		 * freelist yet. Race window is small, so we can succed here if
 		 * we retry.
 		 */
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		cond_resched();
 		goto retry;
 	} else {
@@ -2323,7 +2323,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 		__enqueue_huge_page(&h->hugepage_freelists[nid], new_page);
 	}
 unlock:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	return ret;
 }
@@ -2339,15 +2339,15 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 	 * to carefully check the state under the lock.
 	 * Return success when racing as if we dissolved the page ourselves.
 	 */
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (PageHuge(page)) {
 		head = compound_head(page);
 		h = page_hstate(head);
 	} else {
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		return 0;
 	}
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	/*
 	 * Fence off gigantic pages as there is a cyclic dependency between
@@ -2737,7 +2737,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 * pages in hstate via the proc/sysfs interfaces.
 	 */
 	mutex_lock(&h->resize_lock);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 
 	/*
 	 * Check for a node specific request.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [External] [PATCH v3 7/8] hugetlb: make free_huge_page irq safe
  2021-04-02 20:55     ` Mike Kravetz
@ 2021-04-03  5:59       ` Muchun Song
  2021-04-04  0:00         ` Mike Kravetz
  0 siblings, 1 reply; 15+ messages in thread
From: Muchun Song @ 2021-04-03  5:59 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Oscar Salvador, Andrew Morton, Linux Memory Management List,
	LKML, Roman Gushchin, Michal Hocko, Shakeel Butt,
	David Hildenbrand, David Rientjes, Miaohe Lin, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Joonsoo Kim, Barry Song,
	Will Deacon

On Sat, Apr 3, 2021 at 4:56 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 4/2/21 5:47 AM, Muchun Song wrote:
> > On Wed, Mar 31, 2021 at 11:42 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>
> >> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
> >> non-task context") was added to address the issue of free_huge_page
> >> being called from irq context.  That commit hands off free_huge_page
> >> processing to a workqueue if !in_task.  However, this doesn't cover
> >> all the cases as pointed out by 0day bot lockdep report [1].
> >>
> >> :  Possible interrupt unsafe locking scenario:
> >> :
> >> :        CPU0                    CPU1
> >> :        ----                    ----
> >> :   lock(hugetlb_lock);
> >> :                                local_irq_disable();
> >> :                                lock(slock-AF_INET);
> >> :                                lock(hugetlb_lock);
> >> :   <Interrupt>
> >> :     lock(slock-AF_INET);
> >>
> >> Shakeel has later explained that this is very likely TCP TX zerocopy
> >> from hugetlb pages scenario when the networking code drops a last
> >> reference to hugetlb page while having IRQ disabled. Hugetlb freeing
> >> path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
> >> chain can lead to a deadlock.
> >>
> >> This commit addresses the issue by doing the following:
> >> - Make hugetlb_lock irq safe.  This is mostly a simple process of
> >>   changing spin_*lock calls to spin_*lock_irq* calls.
> >> - Make subpool lock irq safe in a similar manner.
> >> - Revert the !in_task check and workqueue handoff.
> >>
> >> [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
> >>
> >> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >> Acked-by: Michal Hocko <mhocko@suse.com>
> >> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> >
> > Hi Mike,
> >
> > Today I pulled the newest code (next-20210401). I found that
> > alloc_and_dissolve_huge_page is not updated. In this function,
> > hugetlb_lock is still non-irq safe. Maybe you or Oscar need
> > to fix.
> >
> > Thanks.
>
> Thank you Muchun,
>
> Oscar's changes were not in Andrew's tree when I started on this series
> and I failed to notice their inclusion.  In addition,
> isolate_or_dissolve_huge_page also needs updating as well as a change in
> set_max_huge_pages that was omitted while rebasing.
>
> Andrew, the following patch addresses those missing changes.  Ideally,
> the changes should be combined/included in this patch.  If you want me
> to sent another version of this patch or another series, let me know.
>
> From 450593eb3cea895f499ddc343c22424c552ea502 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Fri, 2 Apr 2021 13:18:13 -0700
> Subject: [PATCH] hugetlb: fix irq locking omissions
>
> The pach "hugetlb: make free_huge_page irq safe" changed spin_*lock
> calls to spin_*lock_irq* calls.  However, it missed several places
> in the file hugetlb.c.  Add the overlooked changes.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Thanks MIke. But there are still some places that need
improvement. See below.

> ---
>  mm/hugetlb.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c22111f3da20..a6bfc6bcbc81 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2284,7 +2284,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>          */
>         page_ref_dec(new_page);
>  retry:
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         if (!PageHuge(old_page)) {
>                 /*
>                  * Freed from under us. Drop new_page too.
> @@ -2297,7 +2297,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>                  * Fail with -EBUSY if not possible.
>                  */
>                 update_and_free_page(h, new_page);

Now update_and_free_page can be called without holding
hugetlb_lock. We can move it out of hugetlb_lock. In this
function, there are 3 places which call update_and_free_page().
We can move all of them out of hugetlb_lock. Right?

This optimization can squash to the commit:

    "hugetlb: call update_and_free_page without hugetlb_lock"

Thanks.

> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>                 if (!isolate_huge_page(old_page, list))
>                         ret = -EBUSY;
>                 return ret;
> @@ -2307,7 +2307,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>                  * freelist yet. Race window is small, so we can succed here if
>                  * we retry.
>                  */
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>                 cond_resched();
>                 goto retry;
>         } else {
> @@ -2323,7 +2323,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>                 __enqueue_huge_page(&h->hugepage_freelists[nid], new_page);
>         }
>  unlock:
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         return ret;
>  }
> @@ -2339,15 +2339,15 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
>          * to carefully check the state under the lock.
>          * Return success when racing as if we dissolved the page ourselves.
>          */
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>         if (PageHuge(page)) {
>                 head = compound_head(page);
>                 h = page_hstate(head);
>         } else {
> -               spin_unlock(&hugetlb_lock);
> +               spin_unlock_irq(&hugetlb_lock);
>                 return 0;
>         }
> -       spin_unlock(&hugetlb_lock);
> +       spin_unlock_irq(&hugetlb_lock);
>
>         /*
>          * Fence off gigantic pages as there is a cyclic dependency between
> @@ -2737,7 +2737,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>          * pages in hstate via the proc/sysfs interfaces.
>          */
>         mutex_lock(&h->resize_lock);
> -       spin_lock(&hugetlb_lock);
> +       spin_lock_irq(&hugetlb_lock);
>
>         /*
>          * Check for a node specific request.
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] [PATCH v3 7/8] hugetlb: make free_huge_page irq safe
  2021-04-03  5:59       ` Muchun Song
@ 2021-04-04  0:00         ` Mike Kravetz
  0 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2021-04-04  0:00 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, Andrew Morton, Linux Memory Management List,
	LKML, Roman Gushchin, Michal Hocko, Shakeel Butt,
	David Hildenbrand, David Rientjes, Miaohe Lin, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Joonsoo Kim, Barry Song,
	Will Deacon

On 4/2/21 10:59 PM, Muchun Song wrote:
> On Sat, Apr 3, 2021 at 4:56 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> On 4/2/21 5:47 AM, Muchun Song wrote:
>>> On Wed, Mar 31, 2021 at 11:42 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>>
>>>> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
>>>> non-task context") was added to address the issue of free_huge_page
>>>> being called from irq context.  That commit hands off free_huge_page
>>>> processing to a workqueue if !in_task.  However, this doesn't cover
>>>> all the cases as pointed out by 0day bot lockdep report [1].
>>>>
>>>> :  Possible interrupt unsafe locking scenario:
>>>> :
>>>> :        CPU0                    CPU1
>>>> :        ----                    ----
>>>> :   lock(hugetlb_lock);
>>>> :                                local_irq_disable();
>>>> :                                lock(slock-AF_INET);
>>>> :                                lock(hugetlb_lock);
>>>> :   <Interrupt>
>>>> :     lock(slock-AF_INET);
>>>>
>>>> Shakeel has later explained that this is very likely TCP TX zerocopy
>>>> from hugetlb pages scenario when the networking code drops a last
>>>> reference to hugetlb page while having IRQ disabled. Hugetlb freeing
>>>> path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
>>>> chain can lead to a deadlock.
>>>>
>>>> This commit addresses the issue by doing the following:
>>>> - Make hugetlb_lock irq safe.  This is mostly a simple process of
>>>>   changing spin_*lock calls to spin_*lock_irq* calls.
>>>> - Make subpool lock irq safe in a similar manner.
>>>> - Revert the !in_task check and workqueue handoff.
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
>>>>
>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>> Acked-by: Michal Hocko <mhocko@suse.com>
>>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
>>>
>>> Hi Mike,
>>>
>>> Today I pulled the newest code (next-20210401). I found that
>>> alloc_and_dissolve_huge_page is not updated. In this function,
>>> hugetlb_lock is still non-irq safe. Maybe you or Oscar need
>>> to fix.
>>>
>>> Thanks.
>>
>> Thank you Muchun,
>>
>> Oscar's changes were not in Andrew's tree when I started on this series
>> and I failed to notice their inclusion.  In addition,
>> isolate_or_dissolve_huge_page also needs updating as well as a change in
>> set_max_huge_pages that was omitted while rebasing.
>>
>> Andrew, the following patch addresses those missing changes.  Ideally,
>> the changes should be combined/included in this patch.  If you want me
>> to sent another version of this patch or another series, let me know.
>>
>> From 450593eb3cea895f499ddc343c22424c552ea502 Mon Sep 17 00:00:00 2001
>> From: Mike Kravetz <mike.kravetz@oracle.com>
>> Date: Fri, 2 Apr 2021 13:18:13 -0700
>> Subject: [PATCH] hugetlb: fix irq locking omissions
>>
>> The pach "hugetlb: make free_huge_page irq safe" changed spin_*lock
>> calls to spin_*lock_irq* calls.  However, it missed several places
>> in the file hugetlb.c.  Add the overlooked changes.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Thanks MIke. But there are still some places that need
> improvement. See below.
> 

Correct.  My apologies again for not fully taking into account the new
code from Oscar's series when working on this.

>> ---
>>  mm/hugetlb.c | 16 ++++++++--------
>>  1 file changed, 8 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index c22111f3da20..a6bfc6bcbc81 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -2284,7 +2284,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>>          */
>>         page_ref_dec(new_page);
>>  retry:
>> -       spin_lock(&hugetlb_lock);
>> +       spin_lock_irq(&hugetlb_lock);
>>         if (!PageHuge(old_page)) {
>>                 /*
>>                  * Freed from under us. Drop new_page too.
>> @@ -2297,7 +2297,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>>                  * Fail with -EBUSY if not possible.
>>                  */
>>                 update_and_free_page(h, new_page);
> 
> Now update_and_free_page can be called without holding
> hugetlb_lock. We can move it out of hugetlb_lock. In this
> function, there are 3 places which call update_and_free_page().
> We can move all of them out of hugetlb_lock. Right?

We will need to do more than that.
The call to update_and_free_page in alloc_and_dissolve_huge_page
assumes old functionality.  i.e. It assumes h->nr_huge_pages will be
decremented in update_and_free_page.  This is no longer the case.

This will need to be fixed in patch 4 of my series which changes the
functionality of update_and_free_page.  I'm afraid a change there will
end up requiring changes in subsequent patches due to context.

I will have an update on Monday.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-04-04  0:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-31  3:41 [PATCH v3 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
2021-03-31  3:41 ` [PATCH v3 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
2021-03-31  7:25   ` Michal Hocko
2021-03-31  7:52   ` David Hildenbrand
2021-03-31  3:41 ` [PATCH v3 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release Mike Kravetz
2021-03-31  3:41 ` [PATCH v3 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
2021-03-31  3:41 ` [PATCH v3 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
2021-03-31  3:41 ` [PATCH v3 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
2021-03-31  3:41 ` [PATCH v3 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
2021-03-31  3:41 ` [PATCH v3 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
2021-04-02 12:47   ` [External] " Muchun Song
2021-04-02 20:55     ` Mike Kravetz
2021-04-03  5:59       ` Muchun Song
2021-04-04  0:00         ` Mike Kravetz
2021-03-31  3:41 ` [PATCH v3 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).