[PATCH RFC net-next/mm V4 0/2] page_pool: new approach for leak detection and shutdown phase

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC net-next/mm V4 0/2] page_pool: new approach for leak detection and shutdown phase
@ 2023-05-23 14:52 Jesper Dangaard Brouer
  2023-05-23 14:52 ` [PATCH RFC net-next/mm V4 1/2] mm/page_pool: catch page_pool memory leaks Jesper Dangaard Brouer
  2023-05-23 14:52 ` [PATCH RFC net-next/mm V4 2/2] page_pool: Remove workqueue in new shutdown scheme Jesper Dangaard Brouer
  0 siblings, 2 replies; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2023-05-23 14:52 UTC (permalink / raw)
  To: Ilias Apalodimas, netdev, Eric Dumazet, linux-mm, Mel Gorman
  Cc: Jesper Dangaard Brouer, lorenzo,
	Toke Høiland-Jørgensen, linyunsheng, bpf,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Andrew Morton,
	willy

Patchset change summary:
 - Remove PP workqueue and inflight warnings, instead rely on inflight
   pages to trigger cleanup
 - Moves leak detection to the MM-layer page allocator when combined
   with CONFIG_DEBUG_VM.

The page_pool (PP) workqueue calling page_pool_release_retry generate
too many false-positive reports. Further more, these reports of
page_pool shutdown still having inflight packets are not very helpful
to track down the root-cause.

In the past these reports have helped us catch driver bugs, that
leaked pages by invoking put_page directly, often in code paths
handling error cases. PP pages had a shorter lifespan (within driver
and XDP code paths). Since PP pages got a recycle return path for
SKBs, the lifespan for a PP page can be much longer. Thus, it is time
to revisit periodic release retry mechanism. The default 60 sec
lifespan assumption is obviously wrong/obsolete, as things like TCP
sockets can keep SKBs around for much longer (e.g. retransmits,
timeouts, NAPI defer schemes etc).

The inflight reports, means one of two things: (1) API user is still
holding on, or (2) page got leaked and will never be returned to PP.
The PP need to accept it have no control over (1) how long outstanding
PP pages are kept by the API users. What we really want to is to catch
are(2) pages that "leak". Meaning they didn't get proper returned via
PP APIs.

Leaked PP pages result in these issues: (A) We can never release
page_pool memory structs, which (B) holds on to a refcnt on struct
device for DMA mapping, and (C) leaking DMA-mappings that (D) means a
hardware device can potentially write into a page returned to the page
allocator.

V4: Use RCU sync method to resolve races

V3: Fix races found Toke

V2: Fix race found by Yunsheng Lin <linyunsheng@huawei.com>

---

Jesper Dangaard Brouer (2):
      mm/page_pool: catch page_pool memory leaks
      page_pool: Remove workqueue in new shutdown scheme

 include/net/page_pool.h |  10 +--
 mm/page_alloc.c         |   7 ++
 net/core/page_pool.c    | 138 ++++++++++++++++++++++++++++------------
 3 files changed, 111 insertions(+), 44 deletions(-)

--

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH RFC net-next/mm V4 1/2] mm/page_pool: catch page_pool memory leaks
  2023-05-23 14:52 [PATCH RFC net-next/mm V4 0/2] page_pool: new approach for leak detection and shutdown phase Jesper Dangaard Brouer
@ 2023-05-23 14:52 ` Jesper Dangaard Brouer
  2023-05-23 14:52 ` [PATCH RFC net-next/mm V4 2/2] page_pool: Remove workqueue in new shutdown scheme Jesper Dangaard Brouer
  1 sibling, 0 replies; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2023-05-23 14:52 UTC (permalink / raw)
  To: Ilias Apalodimas, netdev, Eric Dumazet, linux-mm, Mel Gorman
  Cc: Jesper Dangaard Brouer, lorenzo,
	Toke Høiland-Jørgensen, linyunsheng, bpf,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Andrew Morton,
	willy

Pages belonging to a page_pool (PP) instance must be freed through the
PP APIs in-order to correctly release any DMA mappings and release
refcnt on the DMA device when freeing PP instance. When PP release a
page (page_pool_release_page) the page->pp_magic value is cleared.

This patch detect a leaked PP page in free_page_is_bad() via
unexpected state of page->pp_magic value being PP_SIGNATURE.

We choose to report and treat it as a bad page. It would be possible
to release the page via returning it to the PP instance as the
page->pp pointer is likely still valid.

Notice this code is only activated when either compiled with
CONFIG_DEBUG_VM or boot cmdline debug_pagealloc=on, and
CONFIG_PAGE_POOL.

Reduced example output of leak with PP_SIGNATURE = dead000000000040:

 BUG: Bad page state in process swapper/0  pfn:110bbf
 page:000000005bc8cfb8 refcount:0 mapcount:0 mapping:0000000000000000 index:0x110bbf000 pfn:0x110bbf
 flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff)
 raw: 002fffff80000000 dead000000000040 ffff888117255000 0000000000000000
 raw: 0000000110bbf000 000000000000003e 00000000ffffffff 0000000000000000
 page dumped because: page_pool leak
 [...]

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 mm/page_alloc.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b..e6b996da39d4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1085,6 +1085,9 @@ static inline bool page_expected_state(struct page *page,
 			page_ref_count(page) |
 #ifdef CONFIG_MEMCG
 			page->memcg_data |
+#endif
+#ifdef CONFIG_PAGE_POOL
+			((page->pp_magic & ~0x3UL) == PP_SIGNATURE) |
 #endif
 			(page->flags & check_flags)))
 		return false;
@@ -1111,6 +1114,10 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
 #ifdef CONFIG_MEMCG
 	if (unlikely(page->memcg_data))
 		bad_reason = "page still charged to cgroup";
+#endif
+#ifdef CONFIG_PAGE_POOL
+	if (unlikely((page->pp_magic & ~0x3UL) == PP_SIGNATURE))
+		bad_reason = "page_pool leak";
 #endif
 	return bad_reason;
 }



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH RFC net-next/mm V4 2/2] page_pool: Remove workqueue in new shutdown scheme
  2023-05-23 14:52 [PATCH RFC net-next/mm V4 0/2] page_pool: new approach for leak detection and shutdown phase Jesper Dangaard Brouer
  2023-05-23 14:52 ` [PATCH RFC net-next/mm V4 1/2] mm/page_pool: catch page_pool memory leaks Jesper Dangaard Brouer
@ 2023-05-23 14:52 ` Jesper Dangaard Brouer
  2023-05-23 16:16   ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2023-05-23 14:52 UTC (permalink / raw)
  To: Ilias Apalodimas, netdev, Eric Dumazet, linux-mm, Mel Gorman
  Cc: Jesper Dangaard Brouer, lorenzo,
	Toke Høiland-Jørgensen, linyunsheng, bpf,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Andrew Morton,
	willy

This removes the workqueue scheme that periodically tests when
inflight reach zero such that page_pool memory can be freed.

This change adds code to fast-path free checking for a shutdown flags
bit after returning PP pages.

Performance is very important for PP, as the fast path is used for
XDP_DROP use-cases where NIC drivers recycle PP pages directly into PP
alloc cache.

This patch (since V3) shows zero impact on this fast path. Micro
benchmarked with [1] on Intel CPU E5-1650 @3.60GHz. The slight code
reorg of likely() are deliberate.

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/page_pool.h |   10 ++-
 net/core/page_pool.c    |  138 ++++++++++++++++++++++++++++++++++-------------
 2 files changed, 104 insertions(+), 44 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index c8ec2f34722b..19396e05f12d 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -50,6 +50,9 @@
 				 PP_FLAG_DMA_SYNC_DEV |\
 				 PP_FLAG_PAGE_FRAG)
 
+/* Internal flag: PP in shutdown phase, waiting for inflight pages */
+#define PP_FLAG_SHUTDOWN	BIT(8)
+
 /*
  * Fast allocation side cache array/stack
  *
@@ -151,11 +154,6 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
 struct page_pool {
 	struct page_pool_params p;
 
-	struct delayed_work release_dw;
-	void (*disconnect)(void *);
-	unsigned long defer_start;
-	unsigned long defer_warn;
-
 	u32 pages_state_hold_cnt;
 	unsigned int frag_offset;
 	struct page *frag_page;
@@ -165,6 +163,7 @@ struct page_pool {
 	/* these stats are incremented while in softirq context */
 	struct page_pool_alloc_stats alloc_stats;
 #endif
+	void (*disconnect)(void *);
 	u32 xdp_mem_id;
 
 	/*
@@ -208,6 +207,7 @@ struct page_pool {
 	refcount_t user_cnt;
 
 	u64 destroy_cnt;
+	struct rcu_head rcu;
 };
 
 struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index e212e9d7edcb..213349148a48 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -23,9 +23,6 @@
 
 #include <trace/events/page_pool.h>
 
-#define DEFER_TIME (msecs_to_jiffies(1000))
-#define DEFER_WARN_INTERVAL (60 * HZ)
-
 #define BIAS_MAX	LONG_MAX
 
 #ifdef CONFIG_PAGE_POOL_STATS
@@ -380,6 +377,10 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	struct page *page;
 	int i, nr_pages;
 
+	/* API usage BUG: PP in shutdown phase, cannot alloc new pages */
+	if (WARN_ON(pool->p.flags & PP_FLAG_SHUTDOWN))
+		return NULL;
+
 	/* Don't support bulk alloc for high-order pages */
 	if (unlikely(pp_order))
 		return __page_pool_alloc_page_order(pool, gfp);
@@ -450,9 +451,8 @@ EXPORT_SYMBOL(page_pool_alloc_pages);
  */
 #define _distance(a, b)	(s32)((a) - (b))
 
-static s32 page_pool_inflight(struct page_pool *pool)
+static s32 __page_pool_inflight(struct page_pool *pool, u32 release_cnt)
 {
-	u32 release_cnt = atomic_read(&pool->pages_state_release_cnt);
 	u32 hold_cnt = READ_ONCE(pool->pages_state_hold_cnt);
 	s32 inflight;
 
@@ -464,6 +464,14 @@ static s32 page_pool_inflight(struct page_pool *pool)
 	return inflight;
 }
 
+static s32 page_pool_inflight(struct page_pool *pool)
+{
+	u32 release_cnt = atomic_read(&pool->pages_state_release_cnt);
+	return __page_pool_inflight(pool, release_cnt);
+}
+
+static int page_pool_free_attempt(struct page_pool *pool, u32 release_cnt);
+
 /* Disconnects a page (from a page_pool).  API users can have a need
  * to disconnect a page (from a page_pool), to allow it to be used as
  * a regular page (that will eventually be returned to the normal
@@ -472,7 +480,7 @@ static s32 page_pool_inflight(struct page_pool *pool)
 void page_pool_release_page(struct page_pool *pool, struct page *page)
 {
 	dma_addr_t dma;
-	int count;
+	u32 release_cnt;
 
 	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
 		/* Always account for inflight pages, even if we didn't
@@ -490,11 +498,12 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
 skip_dma_unmap:
 	page_pool_clear_pp_info(page);
 
-	/* This may be the last page returned, releasing the pool, so
-	 * it is not safe to reference pool afterwards.
-	 */
-	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
-	trace_page_pool_state_release(pool, page, count);
+	release_cnt = atomic_inc_return(&pool->pages_state_release_cnt);
+	trace_page_pool_state_release(pool, page, release_cnt);
+
+	/* In shutdown phase, last page will free pool instance */
+	if (READ_ONCE(pool->p.flags) & PP_FLAG_SHUTDOWN)
+		page_pool_free_attempt(pool, release_cnt);
 }
 EXPORT_SYMBOL(page_pool_release_page);
 
@@ -535,7 +544,7 @@ static bool page_pool_recycle_in_ring(struct page_pool *pool, struct page *page)
 static bool page_pool_recycle_in_cache(struct page *page,
 				       struct page_pool *pool)
 {
-	if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE)) {
+	if (pool->alloc.count == PP_ALLOC_CACHE_SIZE) {
 		recycle_stat_inc(pool, cache_full);
 		return false;
 	}
@@ -546,6 +555,8 @@ static bool page_pool_recycle_in_cache(struct page *page,
 	return true;
 }
 
+static void page_pool_empty_ring(struct page_pool *pool);
+
 /* If the page refcnt == 1, this will try to recycle the page.
  * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
  * the configured size min(dma_sync_size, pool->max_len).
@@ -572,7 +583,8 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
 			page_pool_dma_sync_for_device(pool, page,
 						      dma_sync_size);
 
-		if (allow_direct && in_softirq() &&
+		/* During PP shutdown, no direct recycle must occur */
+		if (likely(allow_direct && in_softirq()) &&
 		    page_pool_recycle_in_cache(page, pool))
 			return NULL;
 
@@ -609,6 +621,8 @@ void page_pool_put_defragged_page(struct page_pool *pool, struct page *page,
 		recycle_stat_inc(pool, ring_full);
 		page_pool_return_page(pool, page);
 	}
+	if (page && pool->p.flags & PP_FLAG_SHUTDOWN)
+		page_pool_empty_ring(pool);
 }
 EXPORT_SYMBOL(page_pool_put_defragged_page);
 
@@ -646,6 +660,9 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 	recycle_stat_add(pool, ring, i);
 	page_pool_ring_unlock(pool);
 
+	if (pool->p.flags & PP_FLAG_SHUTDOWN)
+		page_pool_empty_ring(pool);
+
 	/* Hopefully all pages was return into ptr_ring */
 	if (likely(i == bulk_len))
 		return;
@@ -737,12 +754,18 @@ struct page *page_pool_alloc_frag(struct page_pool *pool,
 }
 EXPORT_SYMBOL(page_pool_alloc_frag);
 
+noinline
 static void page_pool_empty_ring(struct page_pool *pool)
 {
-	struct page *page;
+	struct page *page, *next;
+
+	next = ptr_ring_consume_bh(&pool->ring);
 
 	/* Empty recycle ring */
-	while ((page = ptr_ring_consume_bh(&pool->ring))) {
+	while (next) {
+		page = next;
+		next = ptr_ring_consume_bh(&pool->ring);
+
 		/* Verify the refcnt invariant of cached pages */
 		if (!(page_ref_count(page) == 1))
 			pr_crit("%s() page_pool refcnt %d violation\n",
@@ -768,6 +791,14 @@ static void page_pool_free(struct page_pool *pool)
 	kfree(pool);
 }
 
+static void page_pool_free_rcu(struct rcu_head *rcu)
+{
+	struct page_pool *pool;
+
+	pool = container_of(rcu, struct page_pool, rcu);
+	page_pool_free(pool);
+}
+
 static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 {
 	struct page *page;
@@ -796,39 +827,30 @@ static void page_pool_scrub(struct page_pool *pool)
 	page_pool_empty_ring(pool);
 }
 
-static int page_pool_release(struct page_pool *pool)
+noinline
+static int page_pool_free_attempt(struct page_pool *pool, u32 release_cnt)
 {
 	int inflight;
 
-	page_pool_scrub(pool);
-	inflight = page_pool_inflight(pool);
+	rcu_read_lock();
+	inflight = __page_pool_inflight(pool, release_cnt);
+	rcu_read_unlock();
 	if (!inflight)
-		page_pool_free(pool);
+		call_rcu(&pool->rcu, page_pool_free_rcu);
 
 	return inflight;
 }
 
-static void page_pool_release_retry(struct work_struct *wq)
+static int page_pool_release(struct page_pool *pool)
 {
-	struct delayed_work *dwq = to_delayed_work(wq);
-	struct page_pool *pool = container_of(dwq, typeof(*pool), release_dw);
 	int inflight;
 
-	inflight = page_pool_release(pool);
+	page_pool_scrub(pool);
+	inflight = page_pool_inflight(pool);
 	if (!inflight)
-		return;
-
-	/* Periodic warning */
-	if (time_after_eq(jiffies, pool->defer_warn)) {
-		int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;
-
-		pr_warn("%s() stalled pool shutdown %d inflight %d sec\n",
-			__func__, inflight, sec);
-		pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
-	}
+		page_pool_free(pool);
 
-	/* Still not ready to be disconnected, retry later */
-	schedule_delayed_work(&pool->release_dw, DEFER_TIME);
+	return inflight;
 }
 
 void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
@@ -856,6 +878,10 @@ EXPORT_SYMBOL(page_pool_unlink_napi);
 
 void page_pool_destroy(struct page_pool *pool)
 {
+	unsigned int flags;
+	u32 release_cnt;
+	u32 hold_cnt;
+
 	if (!pool)
 		return;
 
@@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool)
 	if (!page_pool_release(pool))
 		return;
 
-	pool->defer_start = jiffies;
-	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
+	/* PP have pages inflight, thus cannot immediately release memory.
+	 * Enter into shutdown phase, depending on remaining in-flight PP
+	 * pages to trigger shutdown process (on concurrent CPUs) and last
+	 * page will free pool instance.
+	 *
+	 * There exist two race conditions here, we need to take into
+	 * account in the following code.
+	 *
+	 * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
+	 *    pages into the ptr_ring.  Thus, it missed triggering shutdown
+	 *    process, which can then be stalled forever.
+	 *
+	 * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
+	 *    page, which triggered shutdown process and freed pool
+	 *    instance. Thus, its not safe to dereference *pool afterwards.
+	 *
+	 * Handling races by holding a fake in-flight count, via artificially
+	 * bumping pages_state_hold_cnt, which assures pool isn't freed under
+	 * us.  Use RCU Grace-Periods to guarantee concurrent CPUs will
+	 * transition safely into the shutdown phase.
+	 *
+	 * After safely transition into this state the races are resolved.  For
+	 * race(1) its safe to recheck and empty ptr_ring (it will not free
+	 * pool). Race(2) cannot happen, and we can release fake in-flight count
+	 * as last step.
+	 */
+	hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
+	WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
+	synchronize_rcu();
+
+	flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;
+	WRITE_ONCE(pool->p.flags, flags);
+	synchronize_rcu();
+
+	/* Concurrent CPUs could have returned last pages into ptr_ring */
+	page_pool_empty_ring(pool);
 
-	INIT_DELAYED_WORK(&pool->release_dw, page_pool_release_retry);
-	schedule_delayed_work(&pool->release_dw, DEFER_TIME);
+	release_cnt = atomic_inc_return(&pool->pages_state_release_cnt);
+	page_pool_free_attempt(pool, release_cnt);
 }
 EXPORT_SYMBOL(page_pool_destroy);
 



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC net-next/mm V4 2/2] page_pool: Remove workqueue in new shutdown scheme
  2023-05-23 14:52 ` [PATCH RFC net-next/mm V4 2/2] page_pool: Remove workqueue in new shutdown scheme Jesper Dangaard Brouer
@ 2023-05-23 16:16   ` Toke Høiland-Jørgensen
  2023-05-24 12:00     ` Yunsheng Lin
  0 siblings, 1 reply; 6+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-05-23 16:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Ilias Apalodimas, netdev, Eric Dumazet,
	linux-mm, Mel Gorman
  Cc: Jesper Dangaard Brouer, lorenzo, linyunsheng, bpf,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Andrew Morton,
	willy

>  void page_pool_destroy(struct page_pool *pool)
>  {
> +	unsigned int flags;
> +	u32 release_cnt;
> +	u32 hold_cnt;
> +
>  	if (!pool)
>  		return;
>  
> @@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool)
>  	if (!page_pool_release(pool))
>  		return;
>  
> -	pool->defer_start = jiffies;
> -	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
> +	/* PP have pages inflight, thus cannot immediately release memory.
> +	 * Enter into shutdown phase, depending on remaining in-flight PP
> +	 * pages to trigger shutdown process (on concurrent CPUs) and last
> +	 * page will free pool instance.
> +	 *
> +	 * There exist two race conditions here, we need to take into
> +	 * account in the following code.
> +	 *
> +	 * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
> +	 *    pages into the ptr_ring.  Thus, it missed triggering shutdown
> +	 *    process, which can then be stalled forever.
> +	 *
> +	 * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
> +	 *    page, which triggered shutdown process and freed pool
> +	 *    instance. Thus, its not safe to dereference *pool afterwards.
> +	 *
> +	 * Handling races by holding a fake in-flight count, via artificially
> +	 * bumping pages_state_hold_cnt, which assures pool isn't freed under
> +	 * us.  Use RCU Grace-Periods to guarantee concurrent CPUs will
> +	 * transition safely into the shutdown phase.
> +	 *
> +	 * After safely transition into this state the races are resolved.  For
> +	 * race(1) its safe to recheck and empty ptr_ring (it will not free
> +	 * pool). Race(2) cannot happen, and we can release fake in-flight count
> +	 * as last step.
> +	 */
> +	hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
> +	WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
> +	synchronize_rcu();
> +
> +	flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;
> +	WRITE_ONCE(pool->p.flags, flags);
> +	synchronize_rcu();

Hmm, synchronize_rcu() can be quite expensive; why do we need two of
them? Should be fine to just do one after those two writes, as long as
the order of those writes is correct (which WRITE_ONCE should ensure)?

Also, if we're adding this (blocking) operation in the teardown path we
risk adding latency to that path (network interface removal,
BPF_PROG_RUN syscall etc), so not sure if this actually ends up being an
improvement anymore, as opposed to just keeping the workqueue but
dropping the warning?

-Toke


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC net-next/mm V4 2/2] page_pool: Remove workqueue in new shutdown scheme
  2023-05-23 16:16   ` Toke Høiland-Jørgensen
@ 2023-05-24 12:00     ` Yunsheng Lin
  2023-05-24 16:42       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 6+ messages in thread
From: Yunsheng Lin @ 2023-05-24 12:00 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, netdev, Eric Dumazet, linux-mm, Mel Gorman
  Cc: lorenzo, bpf, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, willy

On 2023/5/24 0:16, Toke Høiland-Jørgensen wrote:
>>  void page_pool_destroy(struct page_pool *pool)
>>  {
>> +	unsigned int flags;
>> +	u32 release_cnt;
>> +	u32 hold_cnt;
>> +
>>  	if (!pool)
>>  		return;
>>  
>> @@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool)
>>  	if (!page_pool_release(pool))
>>  		return;
>>  
>> -	pool->defer_start = jiffies;
>> -	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
>> +	/* PP have pages inflight, thus cannot immediately release memory.
>> +	 * Enter into shutdown phase, depending on remaining in-flight PP
>> +	 * pages to trigger shutdown process (on concurrent CPUs) and last
>> +	 * page will free pool instance.
>> +	 *
>> +	 * There exist two race conditions here, we need to take into
>> +	 * account in the following code.
>> +	 *
>> +	 * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
>> +	 *    pages into the ptr_ring.  Thus, it missed triggering shutdown
>> +	 *    process, which can then be stalled forever.
>> +	 *
>> +	 * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
>> +	 *    page, which triggered shutdown process and freed pool
>> +	 *    instance. Thus, its not safe to dereference *pool afterwards.
>> +	 *
>> +	 * Handling races by holding a fake in-flight count, via artificially
>> +	 * bumping pages_state_hold_cnt, which assures pool isn't freed under
>> +	 * us.  Use RCU Grace-Periods to guarantee concurrent CPUs will
>> +	 * transition safely into the shutdown phase.
>> +	 *
>> +	 * After safely transition into this state the races are resolved.  For
>> +	 * race(1) its safe to recheck and empty ptr_ring (it will not free
>> +	 * pool). Race(2) cannot happen, and we can release fake in-flight count
>> +	 * as last step.
>> +	 */
>> +	hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
>> +	WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
>> +	synchronize_rcu();
>> +
>> +	flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;
>> +	WRITE_ONCE(pool->p.flags, flags);
>> +	synchronize_rcu();
> 
> Hmm, synchronize_rcu() can be quite expensive; why do we need two of
> them? Should be fine to just do one after those two writes, as long as
> the order of those writes is correct (which WRITE_ONCE should ensure)?

I am not sure rcu is the right scheme to fix the problem, as rcu is usually
for one doing freeing/updating and many doing reading, while the case we
try to fix here is all doing the reading and trying to do the freeing.

And there might still be data race here as below:
     cpu0 calling page_pool_destroy()                cpu1 caling page_pool_release_page()

WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
      WRITE_ONCE(pool->p.flags, flags);
           synchronize_rcu();
                                                             atomic_inc_return()

        release_cnt = atomic_inc_return();
      page_pool_free_attempt(pool, release_cnt);
        rcu call page_pool_free_rcu()

				                     if (READ_ONCE(pool->p.flags) & PP_FLAG_SHUTDOWN)
                                                               page_pool_free_attempt()

As the rcu_read_[un]lock are only in page_pool_free_attempt(), cpu0
will see the inflight being zero and triger the rcu to free the pp,
and cpu1 see the pool->p.flags with PP_FLAG_SHUTDOWN set, it will
access pool->pages_state_hold_cnt in __page_pool_inflight(), causing
a use-after-free problem?


> 
> Also, if we're adding this (blocking) operation in the teardown path we
> risk adding latency to that path (network interface removal,
> BPF_PROG_RUN syscall etc), so not sure if this actually ends up being an
> improvement anymore, as opposed to just keeping the workqueue but
> dropping the warning?

we might be able to remove the workqueue from the destroy path, a
workqueue might be still needed to be trigered to call page_pool_free()
in non-atomic context instead of calling page_pool_free() directly in
page_pool_release_page(), as page_pool_release_page() might be called
in atomic context and page_pool_free() requires a non-atomic context
for put_device() and pool->disconnect using the mutex_lock() in
mem_allocator_disconnect().

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC net-next/mm V4 2/2] page_pool: Remove workqueue in new shutdown scheme
  2023-05-24 12:00     ` Yunsheng Lin
@ 2023-05-24 16:42       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2023-05-24 16:42 UTC (permalink / raw)
  To: Yunsheng Lin, Toke Høiland-Jørgensen, Ilias Apalodimas,
	netdev, Eric Dumazet, linux-mm, Mel Gorman
  Cc: brouer, lorenzo, bpf, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Andrew Morton, willy



On 24/05/2023 14.00, Yunsheng Lin wrote:
> On 2023/5/24 0:16, Toke Høiland-Jørgensen wrote:
>>>   void page_pool_destroy(struct page_pool *pool)
>>>   {
>>> +	unsigned int flags;
>>> +	u32 release_cnt;
>>> +	u32 hold_cnt;
>>> +
>>>   	if (!pool)
>>>   		return;
>>>   
>>> @@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool)
>>>   	if (!page_pool_release(pool))
>>>   		return;
>>>   
>>> -	pool->defer_start = jiffies;
>>> -	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
>>> +	/* PP have pages inflight, thus cannot immediately release memory.
>>> +	 * Enter into shutdown phase, depending on remaining in-flight PP
>>> +	 * pages to trigger shutdown process (on concurrent CPUs) and last
>>> +	 * page will free pool instance.
>>> +	 *
>>> +	 * There exist two race conditions here, we need to take into
>>> +	 * account in the following code.
>>> +	 *
>>> +	 * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
>>> +	 *    pages into the ptr_ring.  Thus, it missed triggering shutdown
>>> +	 *    process, which can then be stalled forever.
>>> +	 *
>>> +	 * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
>>> +	 *    page, which triggered shutdown process and freed pool
>>> +	 *    instance. Thus, its not safe to dereference *pool afterwards.
>>> +	 *
>>> +	 * Handling races by holding a fake in-flight count, via artificially
>>> +	 * bumping pages_state_hold_cnt, which assures pool isn't freed under
>>> +	 * us.  Use RCU Grace-Periods to guarantee concurrent CPUs will
>>> +	 * transition safely into the shutdown phase.
>>> +	 *
>>> +	 * After safely transition into this state the races are resolved.  For
>>> +	 * race(1) its safe to recheck and empty ptr_ring (it will not free
>>> +	 * pool). Race(2) cannot happen, and we can release fake in-flight count
>>> +	 * as last step.
>>> +	 */
>>> +	hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
>>> +	WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
>>> +	synchronize_rcu();
>>> +
>>> +	flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;
>>> +	WRITE_ONCE(pool->p.flags, flags);
>>> +	synchronize_rcu();
>>
>> Hmm, synchronize_rcu() can be quite expensive; why do we need two of
>> them? Should be fine to just do one after those two writes, as long as
>> the order of those writes is correct (which WRITE_ONCE should ensure)?
> 
> I am not sure rcu is the right scheme to fix the problem, as rcu is usually
> for one doing freeing/updating and many doing reading, while the case we
> try to fix here is all doing the reading and trying to do the freeing.
> 
> And there might still be data race here as below:
>       cpu0 calling page_pool_destroy()                cpu1 caling page_pool_release_page()
> 
> WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
>        WRITE_ONCE(pool->p.flags, flags);
>             synchronize_rcu();
>                                                               atomic_inc_return()
> 
>          release_cnt = atomic_inc_return();
>        page_pool_free_attempt(pool, release_cnt);
>          rcu call page_pool_free_rcu()
> 
> 				                     if (READ_ONCE(pool->p.flags) & PP_FLAG_SHUTDOWN)
>                                                                 page_pool_free_attempt()
> 
> As the rcu_read_[un]lock are only in page_pool_free_attempt(), cpu0
> will see the inflight being zero and triger the rcu to free the pp,
> and cpu1 see the pool->p.flags with PP_FLAG_SHUTDOWN set, it will
> access pool->pages_state_hold_cnt in __page_pool_inflight(), causing
> a use-after-free problem?
> 
> 
>>
>> Also, if we're adding this (blocking) operation in the teardown path we
>> risk adding latency to that path (network interface removal,
>> BPF_PROG_RUN syscall etc), so not sure if this actually ends up being an
>> improvement anymore, as opposed to just keeping the workqueue but
>> dropping the warning?
> 
> we might be able to remove the workqueue from the destroy path, a
> workqueue might be still needed to be trigered to call page_pool_free()
> in non-atomic context instead of calling page_pool_free() directly in
> page_pool_release_page(), as page_pool_release_page() might be called
> in atomic context and page_pool_free() requires a non-atomic context
> for put_device() and pool->disconnect using the mutex_lock() in
> mem_allocator_disconnect().
> 

I thought the call_rcu() callback provided the right context, but
skimming call_rcu() I think it doesn't.  Argh, I think you are right, we
cannot avoid the workqueue, as we need the non-atomic context.

Thanks for catching and pointing this out :-)

--Jesper


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-05-24 16:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-23 14:52 [PATCH RFC net-next/mm V4 0/2] page_pool: new approach for leak detection and shutdown phase Jesper Dangaard Brouer
2023-05-23 14:52 ` [PATCH RFC net-next/mm V4 1/2] mm/page_pool: catch page_pool memory leaks Jesper Dangaard Brouer
2023-05-23 14:52 ` [PATCH RFC net-next/mm V4 2/2] page_pool: Remove workqueue in new shutdown scheme Jesper Dangaard Brouer
2023-05-23 16:16   ` Toke Høiland-Jørgensen
2023-05-24 12:00     ` Yunsheng Lin
2023-05-24 16:42       ` Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.