Re: [PATCH RFC net-next/mm V3 1/2] page_pool: Remove workqueue in new shutdown scheme

From: Yunsheng Lin <linyunsheng@huawei.com>
To: Jesper Dangaard Brouer <brouer@redhat.com>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	<netdev@vger.kernel.org>, Eric Dumazet <eric.dumazet@gmail.com>,
	<linux-mm@kvack.org>, Mel Gorman <mgorman@techsingularity.net>
Cc: lorenzo@kernel.org, "Toke Høiland-Jørgensen" <toke@redhat.com>,
	bpf@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Paolo Abeni" <pabeni@redhat.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	willy@infradead.org
Subject: Re: [PATCH RFC net-next/mm V3 1/2] page_pool: Remove workqueue in new shutdown scheme
Date: Thu, 4 May 2023 10:42:32 +0800	[thread overview]
Message-ID: <387f4653-1986-3ffe-65e7-448a59002ed0@huawei.com> (raw)
In-Reply-To: <168269857929.2191653.13267688321246766547.stgit@firesoul>

On 2023/4/29 0:16, Jesper Dangaard Brouer wrote:
>  void page_pool_release_page(struct page_pool *pool, struct page *page)
>  {
> +	unsigned int flags = READ_ONCE(pool->p.flags);
>  	dma_addr_t dma;
> -	int count;
> +	u32 release_cnt;
> +	u32 hold_cnt;
>  
>  	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
>  		/* Always account for inflight pages, even if we didn't
> @@ -490,11 +503,15 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
>  skip_dma_unmap:
>  	page_pool_clear_pp_info(page);
>  
> -	/* This may be the last page returned, releasing the pool, so
> -	 * it is not safe to reference pool afterwards.
> -	 */
> -	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
> -	trace_page_pool_state_release(pool, page, count);

There is a time window between "unsigned int flags = READ_ONCE(pool->p.flags)"
and flags checking, if page_pool_destroy() is called concurrently during that
time window, it seems we will have a pp instance leaking problem here?

It seems it is very hard to aovid this kind of corner case when using both
flags & PP_FLAG_SHUTDOWN and release_cnt/hold_cnt checking to decide if pp
instance can be freed.
Can we use something like biased reference counting, which used by frag support
in page pool? So that we only need to check only one variable and avoid cache
bouncing as much as possible.

> +	if (flags & PP_FLAG_SHUTDOWN)
> +		hold_cnt = pp_read_hold_cnt(pool);
> +
> +	release_cnt = atomic_inc_return(&pool->pages_state_release_cnt);
> +	trace_page_pool_state_release(pool, page, release_cnt);
> +
> +	/* In shutdown phase, last page will free pool instance */
> +	if (flags & PP_FLAG_SHUTDOWN)
> +		page_pool_free_attempt(pool, hold_cnt, release_cnt);
>  }
>  EXPORT_SYMBOL(page_pool_release_page);
> 

...

>  
>  void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
> @@ -856,6 +884,10 @@ EXPORT_SYMBOL(page_pool_unlink_napi);
>  
>  void page_pool_destroy(struct page_pool *pool)
>  {
> +	unsigned int flags;
> +	u32 release_cnt;
> +	u32 hold_cnt;
> +
>  	if (!pool)
>  		return;
>  
> @@ -868,11 +900,39 @@ void page_pool_destroy(struct page_pool *pool)
>  	if (!page_pool_release(pool))
>  		return;
>  
> -	pool->defer_start = jiffies;
> -	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
> +	/* PP have pages inflight, thus cannot immediately release memory.
> +	 * Enter into shutdown phase, depending on remaining in-flight PP
> +	 * pages to trigger shutdown process (on concurrent CPUs) and last
> +	 * page will free pool instance.
> +	 *
> +	 * There exist two race conditions here, we need to take into
> +	 * account in the following code.
> +	 *
> +	 * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
> +	 *    pages into the ptr_ring.  Thus, it missed triggering shutdown
> +	 *    process, which can then be stalled forever.
> +	 *
> +	 * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
> +	 *    page, which triggered shutdown process and freed pool
> +	 *    instance. Thus, its not safe to dereference *pool afterwards.
> +	 *
> +	 * Handling races by holding a fake in-flight count, via
> +	 * artificially bumping pages_state_hold_cnt, which assures pool
> +	 * isn't freed under us.  For race(1) its safe to recheck ptr_ring
> +	 * (it will not free pool). Race(2) cannot happen, and we can
> +	 * release fake in-flight count as last step.
> +	 */
> +	hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
> +	smp_store_release(&pool->pages_state_hold_cnt, hold_cnt);

I assume the smp_store_release() is used to ensure the correct order
between the above store operations?
There is data dependency between those two store operations, do we
really need the smp_store_release() here?

> +	barrier();
> +	flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;

Do we need a stronger barrier like smp_rmb() to prevent cpu from
executing "flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN"
before "smp_store_release(&pool->pages_state_hold_cnt, hold_cnt)"
even if there is a smp_store_release() barrier here?

> +	smp_store_release(&pool->p.flags, flags);
> +
> +	/* Concurrent CPUs could have returned last pages into ptr_ring */
> +	page_pool_empty_ring(pool);
>  
> -	INIT_DELAYED_WORK(&pool->release_dw, page_pool_release_retry);
> -	schedule_delayed_work(&pool->release_dw, DEFER_TIME);
> +	release_cnt = atomic_inc_return(&pool->pages_state_release_cnt);
> +	page_pool_free_attempt(pool, hold_cnt, release_cnt);
>  }
>  EXPORT_SYMBOL(page_pool_destroy);
>  
> 
> 
> .
>