Re: [PATCH net-next 2/3] net: page_pool: add the possibility to sync DMA memory for non-coherent devices

From: Lorenzo Bianconi <lorenzo@kernel.org>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: netdev@vger.kernel.org, lorenzo.bianconi@redhat.com,
	davem@davemloft.net, thomas.petazzoni@bootlin.com,
	ilias.apalodimas@linaro.org, matteo.croce@redhat.com
Subject: Re: [PATCH net-next 2/3] net: page_pool: add the possibility to sync DMA memory for non-coherent devices
Date: Mon, 11 Nov 2019 21:11:50 +0200	[thread overview]
Message-ID: <20191111191150.GF4197@localhost.localdomain> (raw)
In-Reply-To: <20191111174835.7344731b@carbon>

[-- Attachment #1: Type: text/plain, Size: 8609 bytes --]

> On Sun, 10 Nov 2019 14:09:09 +0200
> Lorenzo Bianconi <lorenzo@kernel.org> wrote:
> 
> > Introduce the following parameters in order to add the possibility to sync
> > DMA memory area before putting allocated buffers in the page_pool caches:
> 
> > - sync: set to 1 if device is non cache-coherent and needs to flush DMA area
> 

Hi Jesper,

thx for the review

> I don't agree that this is only for non cache-coherent devices.
> 
> This change is generally for all device drivers.  Via setting 'sync'
> (which I prefer to rename 'dma_sync') driver request that page_pool
> takes over doing DMA-sync-for-device. (Very important, DMA-sync-for-CPU
> is still drivers responsibility).  Drivers can benefit from removing
> their calls to dma_sync_single_for_device().
> 
> We need to define meaning/semantics of this setting (my definition):
> - This means that all pages that driver gets from page_pool, will be
>   DMA-synced-for-device.

ack, will fix it in v2

> 
> > - offset: DMA address offset where the DMA engine starts copying rx data
> 
> > - max_len: maximum DMA memory size page_pool is allowed to flush. This
> >   is currently used in __page_pool_alloc_pages_slow routine when pages
> >   are allocated from page allocator
> 
> Implementation wise (you did as I suggested offlist), and does the
> DMA-sync-for-device at return-time page_pool_put_page() time, because
> we (often) know the length that was/can touched by CPU.  This is key to
> the optimization, that we know this length.

right, refilling the cache we now the exact length that was/can touched by CPU.

> 
> I also think you/we need to explain why this optimization is correct,
> my attempt: 
> 
> This optimization reduce the length of the DMA-sync-for-device.  The
> optimization is valid, because page is initially DMA-synced-for-device,
> as defined via max_len.  At driver RX time, the driver will do a
> DMA-sync-for-CPU on the memory for the packet length.  What is
> important is the memory occupied by packet payload, because this is the
> memory CPU is allowed to read and modify.  If CPU have not written into
> a cache-line, then we know that CPU will not be flushing this, thus it
> doesn't need a DMA-sync-for-device.  As we don't track cache-lines
> written into, simply use the full packet length as dma_sync_size, at
> page_pool recycle time.  This also take into account any tail-extend.

ack, will update it in v2

Regards,
Lorenzo

> 
> 
> > These parameters are supposed to be set by device drivers
> 
> 
>  
> > Tested-by: Matteo Croce <mcroce@redhat.com>
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> >  include/net/page_pool.h | 11 +++++++----
> >  net/core/page_pool.c    | 39 +++++++++++++++++++++++++++++++++------
> >  2 files changed, 40 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index 2cbcdbdec254..defbfd90ab46 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -65,6 +65,9 @@ struct page_pool_params {
> >  	int		nid;  /* Numa node id to allocate from pages from */
> >  	struct device	*dev; /* device, for DMA pre-mapping purposes */
> >  	enum dma_data_direction dma_dir; /* DMA mapping direction */
> > +	unsigned int	max_len; /* max DMA sync memory size */
> > +	unsigned int	offset;  /* DMA addr offset */
> > +	u8 sync;
> >  };
> >  
> >  struct page_pool {
> > @@ -150,8 +153,8 @@ static inline void page_pool_destroy(struct page_pool *pool)
> >  }
> >  
> >  /* Never call this directly, use helpers below */
> > -void __page_pool_put_page(struct page_pool *pool,
> > -			  struct page *page, bool allow_direct);
> > +void __page_pool_put_page(struct page_pool *pool, struct page *page,
> > +			  unsigned int dma_sync_size, bool allow_direct);
> >  
> >  static inline void page_pool_put_page(struct page_pool *pool,
> >  				      struct page *page, bool allow_direct)
> > @@ -160,14 +163,14 @@ static inline void page_pool_put_page(struct page_pool *pool,
> >  	 * allow registering MEM_TYPE_PAGE_POOL, but shield linker.
> >  	 */
> >  #ifdef CONFIG_PAGE_POOL
> > -	__page_pool_put_page(pool, page, allow_direct);
> > +	__page_pool_put_page(pool, page, 0, allow_direct);
> >  #endif
> >  }
> >  /* Very limited use-cases allow recycle direct */
> >  static inline void page_pool_recycle_direct(struct page_pool *pool,
> >  					    struct page *page)
> >  {
> > -	__page_pool_put_page(pool, page, true);
> > +	__page_pool_put_page(pool, page, 0, true);
> >  }
> >  
> >  /* API user MUST have disconnected alloc-side (not allowed to call
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index 5bc65587f1c4..af9514c2d15b 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -112,6 +112,17 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
> >  	return page;
> >  }
> >  
> > +/* Used for non-coherent devices */
> > +static void page_pool_dma_sync_for_device(struct page_pool *pool,
> > +					  struct page *page,
> > +					  unsigned int dma_sync_size)
> > +{
> > +	dma_sync_size = min(dma_sync_size, pool->p.max_len);
> > +	dma_sync_single_range_for_device(pool->p.dev, page->dma_addr,
> > +					 pool->p.offset, dma_sync_size,
> > +					 pool->p.dma_dir);
> > +}
> > +
> >  /* slow path */
> >  noinline
> >  static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> > @@ -156,6 +167,10 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> >  	}
> >  	page->dma_addr = dma;
> >  
> > +	/* non-coherent devices - flush memory */
> > +	if (pool->p.sync)
> > +		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> > +
> >  skip_dma_map:
> >  	/* Track how many pages are held 'in-flight' */
> >  	pool->pages_state_hold_cnt++;
> > @@ -255,7 +270,8 @@ static void __page_pool_return_page(struct page_pool *pool, struct page *page)
> >  }
> >  
> >  static bool __page_pool_recycle_into_ring(struct page_pool *pool,
> > -				   struct page *page)
> > +					  struct page *page,
> > +					  unsigned int dma_sync_size)
> >  {
> >  	int ret;
> >  	/* BH protection not needed if current is serving softirq */
> > @@ -264,6 +280,10 @@ static bool __page_pool_recycle_into_ring(struct page_pool *pool,
> >  	else
> >  		ret = ptr_ring_produce_bh(&pool->ring, page);
> >  
> > +	/* non-coherent devices - flush memory */
> > +	if (ret == 0 && pool->p.sync)
> > +		page_pool_dma_sync_for_device(pool, page, dma_sync_size);
> > +
> >  	return (ret == 0) ? true : false;
> >  }
> >  
> > @@ -273,18 +293,23 @@ static bool __page_pool_recycle_into_ring(struct page_pool *pool,
> >   * Caller must provide appropriate safe context.
> >   */
> >  static bool __page_pool_recycle_direct(struct page *page,
> > -				       struct page_pool *pool)
> > +				       struct page_pool *pool,
> > +				       unsigned int dma_sync_size)
> >  {
> >  	if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE))
> >  		return false;
> >  
> >  	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
> >  	pool->alloc.cache[pool->alloc.count++] = page;
> > +
> > +	/* non-coherent devices - flush memory */
> > +	if (pool->p.sync)
> > +		page_pool_dma_sync_for_device(pool, page, dma_sync_size);
> >  	return true;
> >  }
> >  
> > -void __page_pool_put_page(struct page_pool *pool,
> > -			  struct page *page, bool allow_direct)
> > +void __page_pool_put_page(struct page_pool *pool, struct page *page,
> > +			  unsigned int dma_sync_size, bool allow_direct)
> >  {
> >  	/* This allocator is optimized for the XDP mode that uses
> >  	 * one-frame-per-page, but have fallbacks that act like the
> > @@ -296,10 +321,12 @@ void __page_pool_put_page(struct page_pool *pool,
> >  		/* Read barrier done in page_ref_count / READ_ONCE */
> >  
> >  		if (allow_direct && in_serving_softirq())
> > -			if (__page_pool_recycle_direct(page, pool))
> > +			if (__page_pool_recycle_direct(page, pool,
> > +						       dma_sync_size))
> >  				return;
> >  
> > -		if (!__page_pool_recycle_into_ring(pool, page)) {
> > +		if (!__page_pool_recycle_into_ring(pool, page,
> > +						   dma_sync_size)) {
> >  			/* Cache full, fallback to free pages */
> >  			__page_pool_return_page(pool, page);
> >  		}
> 
> 
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]