[BUG] Possible unsafe page_pool usage in octeontx2

* [BUG] Possible unsafe page_pool usage in octeontx2
@ 2023-08-23  9:47 Sebastian Andrzej Siewior
  2023-08-23 11:36 ` Ilias Apalodimas
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-08-23  9:47 UTC (permalink / raw)
  To: netdev, Ratheesh Kannoth
  Cc: David S. Miller, Eric Dumazet, Geetha sowjanya, Ilias Apalodimas,
	Jakub Kicinski, Jesper Dangaard Brouer, Paolo Abeni,
	Subbaraya Sundeep, Sunil Goutham, Thomas Gleixner, hariprasad

Hi,

I've been looking at the page_pool locking.

page_pool_alloc_frag() -> page_pool_alloc_pages() ->
__page_pool_get_cached():

There core of the allocation is:
|         /* Caller MUST guarantee safe non-concurrent access, e.g. softirq */
|         if (likely(pool->alloc.count)) {
|                 /* Fast-path */
|                 page = pool->alloc.cache[--pool->alloc.count];

The access to the `cache' array and the `count' variable is not locked.
This is fine as long as there only one consumer per pool. In my
understanding the intention is to have one page_pool per NAPI callback
to ensure this.

The pool can be filled in the same context (within allocation if the
pool is empty). There is also page_pool_recycle_in_cache() which fills
the pool from within skb free, for instance:
 napi_consume_skb() -> skb_release_all() -> skb_release_data() ->
 napi_frag_unref() -> page_pool_return_skb_page().

The last one has the following check here:
|         napi = READ_ONCE(pp->p.napi);
|         allow_direct = napi_safe && napi &&
|                 READ_ONCE(napi->list_owner) == smp_processor_id();

This eventually ends in page_pool_recycle_in_cache() where it adds the
page to the cache buffer if the check above is true (and BH is disabled). 

napi->list_owner is set once NAPI is scheduled until the poll callback
completed. It is safe to add items to list because only one of the two
can run on a single CPU and the completion of them ensured by having BH
disabled the whole time.

This breaks in octeontx2 where a worker is used to fill the buffer:
  otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() ->
  otx2_alloc_pool_buf() -> page_pool_alloc_frag().

BH is disabled but the add of a page can still happen while NAPI
callback runs on a remote CPU and so corrupting the index/ array.

API wise I would suggest to

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 7ff80b80a6f9f..b50e219470a36 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
 			page_pool_dma_sync_for_device(pool, page,
 						      dma_sync_size);
 
-		if (allow_direct && in_softirq() &&
+		if (allow_direct && in_serving_softirq() &&
 		    page_pool_recycle_in_cache(page, pool))
 			return NULL;
 
because the intention (as I understand it) is to be invoked from within
the NAPI callback (while softirq is served) and not if BH is just
disabled due to a lock or so.

It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to
page_pool_alloc_pages() to spot usage outside of softirq. But this will
trigger in every driver since the same function is used in the open
callback to initially setup the HW.

Sebastian

^ permalink raw reply related	[flat|nested] 22+ messages in thread