bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v3 0/4] introduce page_pool_alloc() API
@ 2023-06-09 13:17 Yunsheng Lin
       [not found] ` <20230609131740.7496-4-linyunsheng@huawei.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Yunsheng Lin @ 2023-06-09 13:17 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Matthias Brugger, AngeloGioacchino Del Regno, bpf,
	linux-arm-kernel, linux-mediatek

In [1] & [2], there are usecases for veth and virtio_net to
use frag support in page pool to reduce memory usage, and it
may request different frag size depending on the head/tail
room space for xdp_frame/shinfo and mtu/packet size. When the
requested frag size is large enough that a single page can not
be split into more than one frag, using frag support only have
performance penalty because of the extra frag count handling
for frag support.

So this patchset provides a page pool API for the driver to
allocate memory with least memory utilization and performance
penalty when it doesn't know the size of memory it need
beforehand.

1. https://patchwork.kernel.org/project/netdevbpf/patch/d3ae6bd3537fbce379382ac6a42f67e22f27ece2.1683896626.git.lorenzo@kernel.org/
2. https://patchwork.kernel.org/project/netdevbpf/patch/20230526054621.18371-3-liangchen.linux@gmail.com/

V3: Incorporate changes from the disscusion with Alexander,
    mostly the inline wraper, PAGE_POOL_DMA_USE_PP_FRAG_COUNT
    change split to separate patch and comment change.
V2: Add patch to remove PP_FLAG_PAGE_FRAG flags and mention
    virtio_net usecase in the cover letter.
V1: Drop RFC tag and page_pool_frag patch.

Yunsheng Lin (4):
  page_pool: frag API support for 32-bit arch with 64-bit DMA
  page_pool: unify frag_count handling in page_pool_is_last_frag()
  page_pool: introduce page_pool_alloc() API
  page_pool: remove PP_FLAG_PAGE_FRAG flag

 .../net/ethernet/hisilicon/hns3/hns3_enet.c   |   3 +-
 .../marvell/octeontx2/nic/otx2_common.c       |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  11 +-
 drivers/net/wireless/mediatek/mt76/mac80211.c |   2 +-
 include/net/page_pool.h                       | 131 +++++++++++++++---
 net/core/page_pool.c                          |  26 ++--
 net/core/skbuff.c                             |   2 +-
 7 files changed, 139 insertions(+), 38 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
       [not found]           ` <CAKgT0UccmDe+CE6=zDYQHi1=3vXf5MptzDo+BsPrKdmP5j9kgQ@mail.gmail.com>
@ 2023-06-15 16:19             ` Jesper Dangaard Brouer
  2023-06-16 11:57               ` Yunsheng Lin
  0 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2023-06-15 16:19 UTC (permalink / raw)
  To: Alexander Duyck, Yunsheng Lin
  Cc: brouer, davem, kuba, pabeni, netdev, linux-kernel,
	Lorenzo Bianconi, Jesper Dangaard Brouer, Ilias Apalodimas,
	Eric Dumazet, Maryam Tahhan, bpf


On 15/06/2023 16.45, Alexander Duyck wrote:
[..]
> 
> What concerns me is that you seem to be taking the page pool API in a
> direction other than what it was really intended for. For any physical
> device you aren't going to necessarily know what size fragment you are
> working with until you have already allocated the page and DMA has
> been performed. That is why drivers such as the Mellanox driver are
> fragmenting in the driver instead of allocated pre-fragmented pages.
> 

+1

I share concerns with Alexander Duyck here. As the inventor and
maintainer, I can say this is taking the page_pool API in a direction I
didn't intent or planned for. As Alex also says, the intent was for
fixed sized memory chunks that are DMA ready.  Use-case was the physical
device RX "early demux problem", where the size is not known before hand.

I need to be convinced this is a good direction to take the page_pool
design/architecture into... e.g. allocations with dynamic sizes.
Maybe it is a good idea, but as below "consumers" of the API is usually
the way to show this is the case.

[...]
> 
> What I was getting at is that if you are going to add an API you have
> to have a consumer for the API. That is rule #1 for kernel API
> development. You don't add API without a consumer for it. The changes
> you are making are to support some future implementation, and I see it
> breaking most of the existing implementation. That is my concern.
> 

You have mentioned veth as the use-case. I know I acked adding page_pool
use-case to veth, for when we need to convert an SKB into an
xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
In this case in veth, the size is known at the page allocation time.
Thus, using the page_pool API is wasting memory.  We did this for
performance reasons, but we are not using PP for what is was intended
for.  We mostly use page_pool, because it an existing recycle return
path, and we were too lazy to add another alloc-type (see enum
xdp_mem_type).

Maybe you/we can extend veth to use this dynamic size API, to show us
that this is API is a better approach.  I will signup for benchmarking
this (and coordinating with CC Maryam as she came with use-case we
improved on).

--Jesper


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
  2023-06-15 16:19             ` [PATCH net-next v3 3/4] page_pool: " Jesper Dangaard Brouer
@ 2023-06-16 11:57               ` Yunsheng Lin
  2023-06-16 16:31                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Yunsheng Lin @ 2023-06-16 11:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexander Duyck
  Cc: brouer, davem, kuba, pabeni, netdev, linux-kernel,
	Lorenzo Bianconi, Jesper Dangaard Brouer, Ilias Apalodimas,
	Eric Dumazet, Maryam Tahhan, bpf

On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:

...

> You have mentioned veth as the use-case. I know I acked adding page_pool
> use-case to veth, for when we need to convert an SKB into an
> xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
> In this case in veth, the size is known at the page allocation time.
> Thus, using the page_pool API is wasting memory.  We did this for
> performance reasons, but we are not using PP for what is was intended
> for.  We mostly use page_pool, because it an existing recycle return
> path, and we were too lazy to add another alloc-type (see enum
> xdp_mem_type).
> 
> Maybe you/we can extend veth to use this dynamic size API, to show us
> that this is API is a better approach.  I will signup for benchmarking
> this (and coordinating with CC Maryam as she came with use-case we
> improved on).

Thanks, let's find out if page pool is the right hammer for the
veth XDP case.

Below is the change for veth using the new api in this patch.
Only compile test as I am not familiar enough with veth XDP and
testing environment for it.
Please try it if it is helpful.

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 614f3e3efab0..8850394f1d29 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
        if (skb_shared(skb) || skb_head_is_locked(skb) ||
            skb_shinfo(skb)->nr_frags ||
            skb_headroom(skb) < XDP_PACKET_HEADROOM) {
-               u32 size, len, max_head_size, off;
+               u32 size, len, max_head_size, off, truesize, page_offset;
                struct sk_buff *nskb;
                struct page *page;
                int i, head_off;
@@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
                if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
                        goto drop;

+               size = min_t(u32, skb->len, max_head_size);
+               truesize = size;
+
                /* Allocate skb head */
-               page = page_pool_dev_alloc_pages(rq->page_pool);
+               page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);
                if (!page)
                        goto drop;

-               nskb = napi_build_skb(page_address(page), PAGE_SIZE);
+               nskb = napi_build_skb(page_address(page) + page_offset, truesize);
                if (!nskb) {
                        page_pool_put_full_page(rq->page_pool, page, true);
                        goto drop;
@@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
                skb_copy_header(nskb, skb);
                skb_mark_for_recycle(nskb);

-               size = min_t(u32, skb->len, max_head_size);
                if (skb_copy_bits(skb, 0, nskb->data, size)) {
                        consume_skb(nskb);
                        goto drop;
@@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
                len = skb->len - off;

                for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
-                       page = page_pool_dev_alloc_pages(rq->page_pool);
+                       size = min_t(u32, len, PAGE_SIZE);
+                       truesize = size;
+
+                       page = page_pool_dev_alloc(rq->page_pool, &page_offset,
+                                                  &truesize);
                        if (!page) {
                                consume_skb(nskb);
                                goto drop;
                        }

-                       size = min_t(u32, len, PAGE_SIZE);
-                       skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
+                       skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);
                        if (skb_copy_bits(skb, off, page_address(page),
                                          size)) {
                                consume_skb(nskb);


> 
> --Jesper
> 
> .
> 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
  2023-06-16 11:57               ` Yunsheng Lin
@ 2023-06-16 16:31                 ` Jesper Dangaard Brouer
  2023-06-16 17:34                   ` Alexander Duyck
  0 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2023-06-16 16:31 UTC (permalink / raw)
  To: Yunsheng Lin, Jesper Dangaard Brouer, Alexander Duyck
  Cc: brouer, davem, kuba, pabeni, netdev, linux-kernel,
	Lorenzo Bianconi, Jesper Dangaard Brouer, Ilias Apalodimas,
	Eric Dumazet, Maryam Tahhan, bpf



On 16/06/2023 13.57, Yunsheng Lin wrote:
> On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:
> 
> ...
> 
>> You have mentioned veth as the use-case. I know I acked adding page_pool
>> use-case to veth, for when we need to convert an SKB into an
>> xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
>> In this case in veth, the size is known at the page allocation time.
>> Thus, using the page_pool API is wasting memory.  We did this for
>> performance reasons, but we are not using PP for what is was intended
>> for.  We mostly use page_pool, because it an existing recycle return
>> path, and we were too lazy to add another alloc-type (see enum
>> xdp_mem_type).
>>
>> Maybe you/we can extend veth to use this dynamic size API, to show us
>> that this is API is a better approach.  I will signup for benchmarking
>> this (and coordinating with CC Maryam as she came with use-case we
>> improved on).
> 
> Thanks, let's find out if page pool is the right hammer for the
> veth XDP case.
> 
> Below is the change for veth using the new api in this patch.
> Only compile test as I am not familiar enough with veth XDP and
> testing environment for it.
> Please try it if it is helpful.
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 614f3e3efab0..8850394f1d29 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>          if (skb_shared(skb) || skb_head_is_locked(skb) ||
>              skb_shinfo(skb)->nr_frags ||
>              skb_headroom(skb) < XDP_PACKET_HEADROOM) {
> -               u32 size, len, max_head_size, off;
> +               u32 size, len, max_head_size, off, truesize, page_offset;
>                  struct sk_buff *nskb;
>                  struct page *page;
>                  int i, head_off;
> @@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>                  if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
>                          goto drop;
> 
> +               size = min_t(u32, skb->len, max_head_size);
> +               truesize = size;
> +
>                  /* Allocate skb head */
> -               page = page_pool_dev_alloc_pages(rq->page_pool);
> +               page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);

Maybe rename API to:

  addr = netmem_alloc(rq->page_pool, &truesize);

>                  if (!page)
>                          goto drop;
> 
> -               nskb = napi_build_skb(page_address(page), PAGE_SIZE);
> +               nskb = napi_build_skb(page_address(page) + page_offset, truesize);

IMHO this illustrates that API is strange/funky.
(I think this is what Alex Duyck is also pointing out).

This is the memory (virtual) address "pointer":
  addr = page_address(page) + page_offset

This is what napi_build_skb() takes as input. (I looked at other users 
of napi_build_skb() whom all give a mem ptr "va" as arg.)
So, why does your new API provide the "page" and not just the address?

As proposed above:
   addr = netmem_alloc(rq->page_pool, &truesize);

Maybe the API should be renamed, to indicate this isn't returning a "page"?
We have talked about the name "netmem" before.

>                  if (!nskb) {
>                          page_pool_put_full_page(rq->page_pool, page, true);
>                          goto drop;
> @@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>                  skb_copy_header(nskb, skb);
>                  skb_mark_for_recycle(nskb);
> 
> -               size = min_t(u32, skb->len, max_head_size);
>                  if (skb_copy_bits(skb, 0, nskb->data, size)) {
>                          consume_skb(nskb);
>                          goto drop;
> @@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>                  len = skb->len - off;
> 
>                  for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
> -                       page = page_pool_dev_alloc_pages(rq->page_pool);
> +                       size = min_t(u32, len, PAGE_SIZE);
> +                       truesize = size;
> +
> +                       page = page_pool_dev_alloc(rq->page_pool, &page_offset,
> +                                                  &truesize);
>                          if (!page) {
>                                  consume_skb(nskb);
>                                  goto drop;
>                          }
> 
> -                       size = min_t(u32, len, PAGE_SIZE);
> -                       skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
> +                       skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);

Guess, this shows the opposite; that the "page" _is_ used by the 
existing API.

>                          if (skb_copy_bits(skb, off, page_address(page),
>                                            size)) {
>                                  consume_skb(nskb);

--Jesper


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
  2023-06-16 16:31                 ` Jesper Dangaard Brouer
@ 2023-06-16 17:34                   ` Alexander Duyck
  2023-06-16 18:41                     ` Jesper Dangaard Brouer
  2023-06-17 12:47                     ` Yunsheng Lin
  0 siblings, 2 replies; 14+ messages in thread
From: Alexander Duyck @ 2023-06-16 17:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Yunsheng Lin, brouer, davem, kuba, pabeni, netdev, linux-kernel,
	Lorenzo Bianconi, Jesper Dangaard Brouer, Ilias Apalodimas,
	Eric Dumazet, Maryam Tahhan, bpf

On Fri, Jun 16, 2023 at 9:31 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 16/06/2023 13.57, Yunsheng Lin wrote:
> > On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:
> >
> > ...
> >
> >> You have mentioned veth as the use-case. I know I acked adding page_pool
> >> use-case to veth, for when we need to convert an SKB into an
> >> xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
> >> In this case in veth, the size is known at the page allocation time.
> >> Thus, using the page_pool API is wasting memory.  We did this for
> >> performance reasons, but we are not using PP for what is was intended
> >> for.  We mostly use page_pool, because it an existing recycle return
> >> path, and we were too lazy to add another alloc-type (see enum
> >> xdp_mem_type).
> >>
> >> Maybe you/we can extend veth to use this dynamic size API, to show us
> >> that this is API is a better approach.  I will signup for benchmarking
> >> this (and coordinating with CC Maryam as she came with use-case we
> >> improved on).
> >
> > Thanks, let's find out if page pool is the right hammer for the
> > veth XDP case.
> >
> > Below is the change for veth using the new api in this patch.
> > Only compile test as I am not familiar enough with veth XDP and
> > testing environment for it.
> > Please try it if it is helpful.
> >
> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > index 614f3e3efab0..8850394f1d29 100644
> > --- a/drivers/net/veth.c
> > +++ b/drivers/net/veth.c
> > @@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >          if (skb_shared(skb) || skb_head_is_locked(skb) ||
> >              skb_shinfo(skb)->nr_frags ||
> >              skb_headroom(skb) < XDP_PACKET_HEADROOM) {
> > -               u32 size, len, max_head_size, off;
> > +               u32 size, len, max_head_size, off, truesize, page_offset;
> >                  struct sk_buff *nskb;
> >                  struct page *page;
> >                  int i, head_off;
> > @@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >                  if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
> >                          goto drop;
> >
> > +               size = min_t(u32, skb->len, max_head_size);
> > +               truesize = size;
> > +
> >                  /* Allocate skb head */
> > -               page = page_pool_dev_alloc_pages(rq->page_pool);
> > +               page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);
>
> Maybe rename API to:
>
>   addr = netmem_alloc(rq->page_pool, &truesize);
>
> >                  if (!page)
> >                          goto drop;
> >
> > -               nskb = napi_build_skb(page_address(page), PAGE_SIZE);
> > +               nskb = napi_build_skb(page_address(page) + page_offset, truesize);
>
> IMHO this illustrates that API is strange/funky.
> (I think this is what Alex Duyck is also pointing out).
>
> This is the memory (virtual) address "pointer":
>   addr = page_address(page) + page_offset
>
> This is what napi_build_skb() takes as input. (I looked at other users
> of napi_build_skb() whom all give a mem ptr "va" as arg.)
> So, why does your new API provide the "page" and not just the address?
>
> As proposed above:
>    addr = netmem_alloc(rq->page_pool, &truesize);
>
> Maybe the API should be renamed, to indicate this isn't returning a "page"?
> We have talked about the name "netmem" before.

Yeah, this is more-or-less what I was getting at. Keep in mind this is
likely the most common case since most frames passed and forth aren't
ever usually much larger than 1500B.

> >                  if (!nskb) {
> >                          page_pool_put_full_page(rq->page_pool, page, true);
> >                          goto drop;
> > @@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >                  skb_copy_header(nskb, skb);
> >                  skb_mark_for_recycle(nskb);
> >
> > -               size = min_t(u32, skb->len, max_head_size);
> >                  if (skb_copy_bits(skb, 0, nskb->data, size)) {
> >                          consume_skb(nskb);
> >                          goto drop;
> > @@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >                  len = skb->len - off;
> >
> >                  for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
> > -                       page = page_pool_dev_alloc_pages(rq->page_pool);
> > +                       size = min_t(u32, len, PAGE_SIZE);
> > +                       truesize = size;
> > +
> > +                       page = page_pool_dev_alloc(rq->page_pool, &page_offset,
> > +                                                  &truesize);
> >                          if (!page) {
> >                                  consume_skb(nskb);
> >                                  goto drop;
> >                          }
> >
> > -                       size = min_t(u32, len, PAGE_SIZE);
> > -                       skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
> > +                       skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);
>
> Guess, this shows the opposite; that the "page" _is_ used by the
> existing API.

This is a sort-of. One thing that has come up as of late is that all
this stuff is being moved over to folios anyway and getting away from
pages. In addition I am not sure how often we are having to take this
path as I am not sure how many non-Tx frames end up having to have
fragments added to them. For something like veth it might be more
common though since Tx becomes Rx in this case.

One thought I had on this is that we could look at adding a new
function that abstracts this away and makes use of netmem instead.
Then the whole page/folio thing would be that much further removed.

One other question I have now that I look at this code as well. Why is
it using page_pool and not just a frag cache allocator, or pages
themselves? It doesn't seem like it has a DMA mapping to deal with
since this is essentially copy-break code. Seems problematic that
there is no DMA involved here at all. This could be more easily
handled with just a single page_frag_cache style allocator.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
  2023-06-16 17:34                   ` Alexander Duyck
@ 2023-06-16 18:41                     ` Jesper Dangaard Brouer
  2023-06-16 18:47                       ` Jakub Kicinski
  2023-06-16 19:50                       ` Alexander Duyck
  2023-06-17 12:47                     ` Yunsheng Lin
  1 sibling, 2 replies; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2023-06-16 18:41 UTC (permalink / raw)
  To: Alexander Duyck, Jesper Dangaard Brouer
  Cc: brouer, Yunsheng Lin, davem, kuba, pabeni, netdev, linux-kernel,
	Lorenzo Bianconi, Jesper Dangaard Brouer, Ilias Apalodimas,
	Eric Dumazet, Maryam Tahhan, bpf



On 16/06/2023 19.34, Alexander Duyck wrote:
> On Fri, Jun 16, 2023 at 9:31 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>>
>> On 16/06/2023 13.57, Yunsheng Lin wrote:
>>> On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:
>>>
>>> ...
>>>
>>>> You have mentioned veth as the use-case. I know I acked adding page_pool
>>>> use-case to veth, for when we need to convert an SKB into an
>>>> xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
>>>> In this case in veth, the size is known at the page allocation time.
>>>> Thus, using the page_pool API is wasting memory.  We did this for
>>>> performance reasons, but we are not using PP for what is was intended
>>>> for.  We mostly use page_pool, because it an existing recycle return
>>>> path, and we were too lazy to add another alloc-type (see enum
>>>> xdp_mem_type).
>>>>
>>>> Maybe you/we can extend veth to use this dynamic size API, to show us
>>>> that this is API is a better approach.  I will signup for benchmarking
>>>> this (and coordinating with CC Maryam as she came with use-case we
>>>> improved on).
>>>
>>> Thanks, let's find out if page pool is the right hammer for the
>>> veth XDP case.
>>>
>>> Below is the change for veth using the new api in this patch.
>>> Only compile test as I am not familiar enough with veth XDP and
>>> testing environment for it.
>>> Please try it if it is helpful.
>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index 614f3e3efab0..8850394f1d29 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>>           if (skb_shared(skb) || skb_head_is_locked(skb) ||
>>>               skb_shinfo(skb)->nr_frags ||
>>>               skb_headroom(skb) < XDP_PACKET_HEADROOM) {
>>> -               u32 size, len, max_head_size, off;
>>> +               u32 size, len, max_head_size, off, truesize, page_offset;
>>>                   struct sk_buff *nskb;
>>>                   struct page *page;
>>>                   int i, head_off;
>>> @@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>>                   if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
>>>                           goto drop;
>>>
>>> +               size = min_t(u32, skb->len, max_head_size);
>>> +               truesize = size;
>>> +
>>>                   /* Allocate skb head */
>>> -               page = page_pool_dev_alloc_pages(rq->page_pool);
>>> +               page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);
>>
>> Maybe rename API to:
>>
>>    addr = netmem_alloc(rq->page_pool, &truesize);
>>
>>>                   if (!page)
>>>                           goto drop;
>>>
>>> -               nskb = napi_build_skb(page_address(page), PAGE_SIZE);
>>> +               nskb = napi_build_skb(page_address(page) + page_offset, truesize);
>>
>> IMHO this illustrates that API is strange/funky.
>> (I think this is what Alex Duyck is also pointing out).
>>
>> This is the memory (virtual) address "pointer":
>>    addr = page_address(page) + page_offset
>>
>> This is what napi_build_skb() takes as input. (I looked at other users
>> of napi_build_skb() whom all give a mem ptr "va" as arg.)
>> So, why does your new API provide the "page" and not just the address?
>>
>> As proposed above:
>>     addr = netmem_alloc(rq->page_pool, &truesize);
>>
>> Maybe the API should be renamed, to indicate this isn't returning a "page"?
>> We have talked about the name "netmem" before.
> 
> Yeah, this is more-or-less what I was getting at. Keep in mind this is
> likely the most common case since most frames passed and forth aren't
> ever usually much larger than 1500B.
> 

Good to get confirmed this is "more-or-less" your suggestion/direction.


>>>                   if (!nskb) {
>>>                           page_pool_put_full_page(rq->page_pool, page, true);
>>>                           goto drop;
>>> @@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>>                   skb_copy_header(nskb, skb);
>>>                   skb_mark_for_recycle(nskb);
>>>
>>> -               size = min_t(u32, skb->len, max_head_size);
>>>                   if (skb_copy_bits(skb, 0, nskb->data, size)) {
>>>                           consume_skb(nskb);
>>>                           goto drop;
>>> @@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>>                   len = skb->len - off;
>>>
>>>                   for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
>>> -                       page = page_pool_dev_alloc_pages(rq->page_pool);
>>> +                       size = min_t(u32, len, PAGE_SIZE);
>>> +                       truesize = size;
>>> +
>>> +                       page = page_pool_dev_alloc(rq->page_pool, &page_offset,
>>> +                                                  &truesize);
>>>                           if (!page) {
>>>                                   consume_skb(nskb);
>>>                                   goto drop;
>>>                           }
>>>
>>> -                       size = min_t(u32, len, PAGE_SIZE);
>>> -                       skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
>>> +                       skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);
>>
>> Guess, this shows the opposite; that the "page" _is_ used by the
>> existing API.
> 
> This is a sort-of. One thing that has come up as of late is that all
> this stuff is being moved over to folios anyway and getting away from
> pages. In addition I am not sure how often we are having to take this
> path as I am not sure how many non-Tx frames end up having to have
> fragments added to them. For something like veth it might be more
> common though since Tx becomes Rx in this case.

I'm thinking, that is it very unlikely that XDP have modified the
fragments.  So, why are we allocating and copying the fragments?
Wouldn't it be possible for this veth code to bump the refcnt on these
fragments? (maybe I missed some detail).

> 
> One thought I had on this is that we could look at adding a new
> function that abstracts this away and makes use of netmem instead.
> Then the whole page/folio thing would be that much further removed.

I like this "thought" of yours :-)

> 
> One other question I have now that I look at this code as well. Why is
> it using page_pool and not just a frag cache allocator, or pages
> themselves? It doesn't seem like it has a DMA mapping to deal with
> since this is essentially copy-break code. Seems problematic that
> there is no DMA involved here at all. This could be more easily
> handled with just a single page_frag_cache style allocator.
> 

Yes, precisely.
I distinctly remember what I tried to poke you and Eric on this approach
earlier, but I cannot find a link to that email.

I would really appreciate, if you Alex, could give the approach in
veth_convert_skb_to_xdp_buff() some review, as I believe that is a huge
potential for improvements that will lead to large performance
improvements. (I'm sure Maryam will be eager to help re-test performance
for her use-cases).

--Jesper


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
  2023-06-16 18:41                     ` Jesper Dangaard Brouer
@ 2023-06-16 18:47                       ` Jakub Kicinski
  2023-06-16 19:50                       ` Alexander Duyck
  1 sibling, 0 replies; 14+ messages in thread
From: Jakub Kicinski @ 2023-06-16 18:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, brouer, Yunsheng Lin, davem, pabeni, netdev,
	linux-kernel, Lorenzo Bianconi, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Maryam Tahhan, bpf

On Fri, 16 Jun 2023 20:41:45 +0200 Jesper Dangaard Brouer wrote:
> > This is a sort-of. One thing that has come up as of late is that all
> > this stuff is being moved over to folios anyway and getting away from
> > pages. In addition I am not sure how often we are having to take this
> > path as I am not sure how many non-Tx frames end up having to have
> > fragments added to them. For something like veth it might be more
> > common though since Tx becomes Rx in this case.  
> 
> I'm thinking, that is it very unlikely that XDP have modified the
> fragments.  So, why are we allocating and copying the fragments?
> Wouldn't it be possible for this veth code to bump the refcnt on these
> fragments? (maybe I missed some detail).

They may be page cache pages, AFAIU.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
  2023-06-16 18:41                     ` Jesper Dangaard Brouer
  2023-06-16 18:47                       ` Jakub Kicinski
@ 2023-06-16 19:50                       ` Alexander Duyck
  2023-06-18 15:05                         ` Lorenzo Bianconi
  1 sibling, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2023-06-16 19:50 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, Yunsheng Lin, davem, kuba, pabeni, netdev, linux-kernel,
	Lorenzo Bianconi, Jesper Dangaard Brouer, Ilias Apalodimas,
	Eric Dumazet, Maryam Tahhan, bpf

On Fri, Jun 16, 2023 at 11:41 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 16/06/2023 19.34, Alexander Duyck wrote:
> > On Fri, Jun 16, 2023 at 9:31 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >>
> >>
> >> On 16/06/2023 13.57, Yunsheng Lin wrote:
> >>> On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:

[...]

>
>
> >>>                   if (!nskb) {
> >>>                           page_pool_put_full_page(rq->page_pool, page, true);
> >>>                           goto drop;
> >>> @@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >>>                   skb_copy_header(nskb, skb);
> >>>                   skb_mark_for_recycle(nskb);
> >>>
> >>> -               size = min_t(u32, skb->len, max_head_size);
> >>>                   if (skb_copy_bits(skb, 0, nskb->data, size)) {
> >>>                           consume_skb(nskb);
> >>>                           goto drop;
> >>> @@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >>>                   len = skb->len - off;
> >>>
> >>>                   for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
> >>> -                       page = page_pool_dev_alloc_pages(rq->page_pool);
> >>> +                       size = min_t(u32, len, PAGE_SIZE);
> >>> +                       truesize = size;
> >>> +
> >>> +                       page = page_pool_dev_alloc(rq->page_pool, &page_offset,
> >>> +                                                  &truesize);
> >>>                           if (!page) {
> >>>                                   consume_skb(nskb);
> >>>                                   goto drop;
> >>>                           }
> >>>
> >>> -                       size = min_t(u32, len, PAGE_SIZE);
> >>> -                       skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
> >>> +                       skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);
> >>
> >> Guess, this shows the opposite; that the "page" _is_ used by the
> >> existing API.
> >
> > This is a sort-of. One thing that has come up as of late is that all
> > this stuff is being moved over to folios anyway and getting away from
> > pages. In addition I am not sure how often we are having to take this
> > path as I am not sure how many non-Tx frames end up having to have
> > fragments added to them. For something like veth it might be more
> > common though since Tx becomes Rx in this case.
>
> I'm thinking, that is it very unlikely that XDP have modified the
> fragments.  So, why are we allocating and copying the fragments?
> Wouldn't it be possible for this veth code to bump the refcnt on these
> fragments? (maybe I missed some detail).

From what I can tell this is an exception case with multiple caveats
for shared, locked, or nonlinear frames.

As such I suspect there may be some deprecated cases in there too
since XDP multi-buf support has been around for a while so the code
shouldn't need to reallocate to linearize a nonlinear frame.

> >
> > One other question I have now that I look at this code as well. Why is
> > it using page_pool and not just a frag cache allocator, or pages
> > themselves? It doesn't seem like it has a DMA mapping to deal with
> > since this is essentially copy-break code. Seems problematic that
> > there is no DMA involved here at all. This could be more easily
> > handled with just a single page_frag_cache style allocator.
> >
>
> Yes, precisely.
> I distinctly remember what I tried to poke you and Eric on this approach
> earlier, but I cannot find a link to that email.
>
> I would really appreciate, if you Alex, could give the approach in
> veth_convert_skb_to_xdp_buff() some review, as I believe that is a huge
> potential for improvements that will lead to large performance
> improvements. (I'm sure Maryam will be eager to help re-test performance
> for her use-cases).

Well just looking at it the quick and dirty answer would be to look at
making use of something like page_frag_cache. I won't go into details
since it isn't too different from the frag allocator, but it is much
simpler since it is just doing reference count hacks instead of having
to do the extra overhead to keep the DMA mapping in place. The veth
would then just be sitting on at most an order 3 page while it is
waiting to fully consume it rather than waiting on a full pool of
pages.

Alternatively it could do something similar to page_frag_alloc_align
itself and just bypass doing a custom allocator. If it went that route
it could do something almost like a ring buffer and greatly improve
the throughput since it would be able to allocate a higher order page
and just copy the entire skb in so the entire thing would be linear
rather than having to allocate a bunch of single pages.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
  2023-06-16 17:34                   ` Alexander Duyck
  2023-06-16 18:41                     ` Jesper Dangaard Brouer
@ 2023-06-17 12:47                     ` Yunsheng Lin
  1 sibling, 0 replies; 14+ messages in thread
From: Yunsheng Lin @ 2023-06-17 12:47 UTC (permalink / raw)
  To: Alexander Duyck, Jesper Dangaard Brouer
  Cc: Yunsheng Lin, brouer, davem, kuba, pabeni, netdev, linux-kernel,
	Lorenzo Bianconi, Jesper Dangaard Brouer, Ilias Apalodimas,
	Eric Dumazet, Maryam Tahhan, bpf

On 2023/6/17 1:34, Alexander Duyck wrote:
...

>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index 614f3e3efab0..8850394f1d29 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>>          if (skb_shared(skb) || skb_head_is_locked(skb) ||
>>>              skb_shinfo(skb)->nr_frags ||
>>>              skb_headroom(skb) < XDP_PACKET_HEADROOM) {
>>> -               u32 size, len, max_head_size, off;
>>> +               u32 size, len, max_head_size, off, truesize, page_offset;
>>>                  struct sk_buff *nskb;
>>>                  struct page *page;
>>>                  int i, head_off;
>>> @@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>>                  if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
>>>                          goto drop;
>>>
>>> +               size = min_t(u32, skb->len, max_head_size);
>>> +               truesize = size;
>>> +
>>>                  /* Allocate skb head */
>>> -               page = page_pool_dev_alloc_pages(rq->page_pool);
>>> +               page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);
>>
>> Maybe rename API to:
>>
>>   addr = netmem_alloc(rq->page_pool, &truesize);

Unless we create a subsystem called netmem, I am not sure about
the 'netmem', it seems more confusing to use it here.

>>
>>>                  if (!page)
>>>                          goto drop;
>>>
>>> -               nskb = napi_build_skb(page_address(page), PAGE_SIZE);
>>> +               nskb = napi_build_skb(page_address(page) + page_offset, truesize);
>>
>> IMHO this illustrates that API is strange/funky.
>> (I think this is what Alex Duyck is also pointing out).
>>
>> This is the memory (virtual) address "pointer":
>>   addr = page_address(page) + page_offset
>>
>> This is what napi_build_skb() takes as input. (I looked at other users
>> of napi_build_skb() whom all give a mem ptr "va" as arg.)
>> So, why does your new API provide the "page" and not just the address?
>>
>> As proposed above:
>>    addr = netmem_alloc(rq->page_pool, &truesize);
>>
>> Maybe the API should be renamed, to indicate this isn't returning a "page"?
>> We have talked about the name "netmem" before.
> 
> Yeah, this is more-or-less what I was getting at. Keep in mind this is
> likely the most common case since most frames passed and forth aren't
> ever usually much larger than 1500B.

I do feel the pain here, there is why I use a per cpu 'struct
page_pool_frag' to report the result back to user so that we
can report both 'va' and 'page' to the user in the RFC of this
patchset.

IHMO, compared to the above point, it is more importance that
we have a unified implementation for both of them instead
of page frag based on the page allocator.

Currently there are three implementations for page frag:
1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().
3. drivers/vhost/net.c: vhost seems to be using it to build
   xdp frame, and it's implementation seems to be a mix of
   the above two.

Acctually I have a patchset to remove the third one waiting
to send out after this one.

And I wonder if the first and second one can be unified as
one, as it seems the only user facing difference is one
returning va, and the other returning page. other difference
seems to be implementation specific, for example, one is
doing offset incrementing, and the other doing offset
decrementing.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
  2023-06-16 19:50                       ` Alexander Duyck
@ 2023-06-18 15:05                         ` Lorenzo Bianconi
  2023-06-20 16:19                           ` Alexander Duyck
  0 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Bianconi @ 2023-06-18 15:05 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesper Dangaard Brouer, brouer, Yunsheng Lin, davem, kuba,
	pabeni, netdev, linux-kernel, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Maryam Tahhan, bpf

[-- Attachment #1: Type: text/plain, Size: 2610 bytes --]

[...]
> >
> > Yes, precisely.
> > I distinctly remember what I tried to poke you and Eric on this approach
> > earlier, but I cannot find a link to that email.
> >
> > I would really appreciate, if you Alex, could give the approach in
> > veth_convert_skb_to_xdp_buff() some review, as I believe that is a huge
> > potential for improvements that will lead to large performance
> > improvements. (I'm sure Maryam will be eager to help re-test performance
> > for her use-cases).
> 
> Well just looking at it the quick and dirty answer would be to look at
> making use of something like page_frag_cache. I won't go into details
> since it isn't too different from the frag allocator, but it is much
> simpler since it is just doing reference count hacks instead of having
> to do the extra overhead to keep the DMA mapping in place. The veth
> would then just be sitting on at most an order 3 page while it is
> waiting to fully consume it rather than waiting on a full pool of
> pages.

Hi,

I did some experiments using page_frag_cache/page_frag_alloc() instead of
page_pools in a simple environment I used to test XDP for veth driver.
In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
the page_frag_cache in order to copy the full skb in the new one, actually
"linearizing" the packet (since we know the original skb length).
I run an iperf TCP connection over a veth pair where the
remote device runs the xdp_rxq_info sample (available in the kernel source
tree, with action XDP_PASS):

TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server

net-next (page_pool):
- MTU 1500B: ~  7.5 Gbps
- MTU 8000B: ~ 15.3 Gbps

net-next + page_frag_alloc:
- MTU 1500B: ~  8.4 Gbps
- MTU 8000B: ~ 14.7 Gbps

It seems there is no a clear "win" situation here (at least in this environment
and we this simple approach). Moreover:
- can the linearization introduce any issue whenever we perform XDP_REDIRECT
  into a destination device?
- can the page_frag_cache introduce more memory fragmentation (IIRC we were
  experiencing this issue in mt76 before switching to page_pools).

What do you think?

Regards,
Lorenzo

> 
> Alternatively it could do something similar to page_frag_alloc_align
> itself and just bypass doing a custom allocator. If it went that route
> it could do something almost like a ring buffer and greatly improve
> the throughput since it would be able to allocate a higher order page
> and just copy the entire skb in so the entire thing would be linear
> rather than having to allocate a bunch of single pages.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
  2023-06-18 15:05                         ` Lorenzo Bianconi
@ 2023-06-20 16:19                           ` Alexander Duyck
  2023-06-20 21:16                             ` Lorenzo Bianconi
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2023-06-20 16:19 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Jesper Dangaard Brouer, brouer, Yunsheng Lin, davem, kuba,
	pabeni, netdev, linux-kernel, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Maryam Tahhan, bpf

On Sun, Jun 18, 2023 at 8:05 AM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>
> [...]
> > >
> > > Yes, precisely.
> > > I distinctly remember what I tried to poke you and Eric on this approach
> > > earlier, but I cannot find a link to that email.
> > >
> > > I would really appreciate, if you Alex, could give the approach in
> > > veth_convert_skb_to_xdp_buff() some review, as I believe that is a huge
> > > potential for improvements that will lead to large performance
> > > improvements. (I'm sure Maryam will be eager to help re-test performance
> > > for her use-cases).
> >
> > Well just looking at it the quick and dirty answer would be to look at
> > making use of something like page_frag_cache. I won't go into details
> > since it isn't too different from the frag allocator, but it is much
> > simpler since it is just doing reference count hacks instead of having
> > to do the extra overhead to keep the DMA mapping in place. The veth
> > would then just be sitting on at most an order 3 page while it is
> > waiting to fully consume it rather than waiting on a full pool of
> > pages.
>
> Hi,
>
> I did some experiments using page_frag_cache/page_frag_alloc() instead of
> page_pools in a simple environment I used to test XDP for veth driver.
> In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
> the page_frag_cache in order to copy the full skb in the new one, actually
> "linearizing" the packet (since we know the original skb length).
> I run an iperf TCP connection over a veth pair where the
> remote device runs the xdp_rxq_info sample (available in the kernel source
> tree, with action XDP_PASS):
>
> TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
>
> net-next (page_pool):
> - MTU 1500B: ~  7.5 Gbps
> - MTU 8000B: ~ 15.3 Gbps
>
> net-next + page_frag_alloc:
> - MTU 1500B: ~  8.4 Gbps
> - MTU 8000B: ~ 14.7 Gbps
>
> It seems there is no a clear "win" situation here (at least in this environment
> and we this simple approach). Moreover:

For the 1500B packets it is a win, but for 8000B it looks like there
is a regression. Any idea what is causing it?

> - can the linearization introduce any issue whenever we perform XDP_REDIRECT
>   into a destination device?

It shouldn't. If it does it would probably point to an issue w/ the
destination driver rather than an issue with the code doing this.

> - can the page_frag_cache introduce more memory fragmentation (IIRC we were
>   experiencing this issue in mt76 before switching to page_pools).

I think it largely depends on where the packets are ending up. I know
this is the approach we are using for sockets, see
skb_page_frag_refill(). If nothing else, if you took a similar
approach to it you might be able to bypass the need for the
page_frag_cache itself, although you would likely still end up
allocating similar structures.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
  2023-06-20 16:19                           ` Alexander Duyck
@ 2023-06-20 21:16                             ` Lorenzo Bianconi
  2023-06-21 11:55                               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Bianconi @ 2023-06-20 21:16 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesper Dangaard Brouer, brouer, Yunsheng Lin, davem, kuba,
	pabeni, netdev, linux-kernel, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Maryam Tahhan, bpf

[-- Attachment #1: Type: text/plain, Size: 1955 bytes --]

[...]

> > I did some experiments using page_frag_cache/page_frag_alloc() instead of
> > page_pools in a simple environment I used to test XDP for veth driver.
> > In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
> > the page_frag_cache in order to copy the full skb in the new one, actually
> > "linearizing" the packet (since we know the original skb length).
> > I run an iperf TCP connection over a veth pair where the
> > remote device runs the xdp_rxq_info sample (available in the kernel source
> > tree, with action XDP_PASS):
> >
> > TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
> >
> > net-next (page_pool):
> > - MTU 1500B: ~  7.5 Gbps
> > - MTU 8000B: ~ 15.3 Gbps
> >
> > net-next + page_frag_alloc:
> > - MTU 1500B: ~  8.4 Gbps
> > - MTU 8000B: ~ 14.7 Gbps
> >
> > It seems there is no a clear "win" situation here (at least in this environment
> > and we this simple approach). Moreover:
> 
> For the 1500B packets it is a win, but for 8000B it looks like there
> is a regression. Any idea what is causing it?

nope, I have not looked into it yet.

> 
> > - can the linearization introduce any issue whenever we perform XDP_REDIRECT
> >   into a destination device?
> 
> It shouldn't. If it does it would probably point to an issue w/ the
> destination driver rather than an issue with the code doing this.

ack, fine.

> 
> > - can the page_frag_cache introduce more memory fragmentation (IIRC we were
> >   experiencing this issue in mt76 before switching to page_pools).
> 
> I think it largely depends on where the packets are ending up. I know
> this is the approach we are using for sockets, see
> skb_page_frag_refill(). If nothing else, if you took a similar
> approach to it you might be able to bypass the need for the
> page_frag_cache itself, although you would likely still end up
> allocating similar structures.

ack.

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
  2023-06-20 21:16                             ` Lorenzo Bianconi
@ 2023-06-21 11:55                               ` Jesper Dangaard Brouer
  2023-06-24 14:44                                 ` Yunsheng Lin
  0 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2023-06-21 11:55 UTC (permalink / raw)
  To: Lorenzo Bianconi, Alexander Duyck
  Cc: brouer, Jesper Dangaard Brouer, Yunsheng Lin, davem, kuba,
	pabeni, netdev, linux-kernel, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Maryam Tahhan, bpf



On 20/06/2023 23.16, Lorenzo Bianconi wrote:
> [...]
> 
>>> I did some experiments using page_frag_cache/page_frag_alloc() instead of
>>> page_pools in a simple environment I used to test XDP for veth driver.
>>> In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
>>> the page_frag_cache in order to copy the full skb in the new one, actually
>>> "linearizing" the packet (since we know the original skb length).
>>> I run an iperf TCP connection over a veth pair where the
>>> remote device runs the xdp_rxq_info sample (available in the kernel source
>>> tree, with action XDP_PASS):
>>>
>>> TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
>>>
>>> net-next (page_pool):
>>> - MTU 1500B: ~  7.5 Gbps
>>> - MTU 8000B: ~ 15.3 Gbps
>>>
>>> net-next + page_frag_alloc:
>>> - MTU 1500B: ~  8.4 Gbps
>>> - MTU 8000B: ~ 14.7 Gbps
>>>
>>> It seems there is no a clear "win" situation here (at least in this environment
>>> and we this simple approach). Moreover:
>>
>> For the 1500B packets it is a win, but for 8000B it looks like there
>> is a regression. Any idea what is causing it?
> 
> nope, I have not looked into it yet.
> 

I think I can explain via using micro-benchmark numbers.
(Lorenzo and I have discussed this over IRC, so this is our summary)

*** MTU 1500***

* The MTU 1500 case, where page_frag_alloc is faster than PP (page_pool):

The PP alloc a 4K page for MTU 1500. The cost of alloc + recycle via
ptr_ring cost 48 cycles (page_pool02_ptr_ring Per elem: 48 cycles(tsc)).

The page_frag_alloc API allocates a 32KB order-3 page, and chops it up
for packets.  The order-3 alloc + free cost 514 cycles (page_bench01:
alloc_pages order:3(32768B) 514 cycles). The MTU 1500 needs alloc size
1514+320+256 = 2090 bytes.  In 32KB we can fit 15 packets.  Thus, the
amortized cost per packet is only 34.3 cycles (514/15).

Thus, this explains why page_frag_alloc API have an advantage here, as
amortized cost per packet is lower (for page_frag_alloc).


*** MTU 8000 ***

* The MTU 8000 case, where PP is faster than page_frag_alloc.

The page_frag_alloc API cannot slice the same 32KB into as many packets.
The MTU 8000 needs alloc size 8000+14+256+320 = 8590 bytes.  This is can
only store 3 full packets (32768/8590 = 3.81).
Thus, cost is 514/3 = 171 cycles.

The PP is actually challenged at MTU 8000, because it unfortunately
leads to allocating 3 full pages (12KiB), due to needed alloc size 8590
bytes. Thus cost is 3x 48 cycles = 144 cycles.
(There is also a chance of Jakubs "allow_direct" optimization in 
page_pool_return_skb_page to increase performance for PP).

Thus, this explains why PP is fastest in this case.


*** Surprising insights ***

My (maybe) surprising conclusion is that we should combine the two
approaches.  Which is basically what Lin's patchset is doing!
Thus, I'm actually suddenly become a fan of this patchset...

The insight is that PP can also work with higher-order pages and the
cost of PP recycles via ptr_ring will be the same, regardless of page
order size.  Thus, we can reduced the order-3 cost 514 cycles to
basically 48 cycles, and fit 15 packets (MTU 1500) resulting is
amortized allocator cost 48/15 = 3.2 cycles.

On the PP alloc-side this will be amazingly fast. When PP recycles frags
side, see page_pool_defrag_page() there is an atomic_sub operation.
I've measured atomic_inc to cost 17 cycles (for optimal non-contended
case), thus 3+17 = 20 cycles, it should still be a win.


--Jesper


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
  2023-06-21 11:55                               ` Jesper Dangaard Brouer
@ 2023-06-24 14:44                                 ` Yunsheng Lin
  0 siblings, 0 replies; 14+ messages in thread
From: Yunsheng Lin @ 2023-06-24 14:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Lorenzo Bianconi, Alexander Duyck
  Cc: brouer, Yunsheng Lin, davem, kuba, pabeni, netdev, linux-kernel,
	Jesper Dangaard Brouer, Ilias Apalodimas, Eric Dumazet,
	Maryam Tahhan, bpf

On 2023/6/21 19:55, Jesper Dangaard Brouer wrote:
> 
> 
> On 20/06/2023 23.16, Lorenzo Bianconi wrote:
>> [...]
>>
>>>> I did some experiments using page_frag_cache/page_frag_alloc() instead of
>>>> page_pools in a simple environment I used to test XDP for veth driver.
>>>> In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
>>>> the page_frag_cache in order to copy the full skb in the new one, actually
>>>> "linearizing" the packet (since we know the original skb length).
>>>> I run an iperf TCP connection over a veth pair where the
>>>> remote device runs the xdp_rxq_info sample (available in the kernel source
>>>> tree, with action XDP_PASS):
>>>>
>>>> TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
>>>>
>>>> net-next (page_pool):
>>>> - MTU 1500B: ~  7.5 Gbps
>>>> - MTU 8000B: ~ 15.3 Gbps
>>>>
>>>> net-next + page_frag_alloc:
>>>> - MTU 1500B: ~  8.4 Gbps
>>>> - MTU 8000B: ~ 14.7 Gbps
>>>>
>>>> It seems there is no a clear "win" situation here (at least in this environment
>>>> and we this simple approach). Moreover:
>>>
>>> For the 1500B packets it is a win, but for 8000B it looks like there
>>> is a regression. Any idea what is causing it?
>>
>> nope, I have not looked into it yet.
>>
> 
> I think I can explain via using micro-benchmark numbers.
> (Lorenzo and I have discussed this over IRC, so this is our summary)
> 
> *** MTU 1500***
> 
> * The MTU 1500 case, where page_frag_alloc is faster than PP (page_pool):
> 
> The PP alloc a 4K page for MTU 1500. The cost of alloc + recycle via
> ptr_ring cost 48 cycles (page_pool02_ptr_ring Per elem: 48 cycles(tsc)).
> 
> The page_frag_alloc API allocates a 32KB order-3 page, and chops it up
> for packets.  The order-3 alloc + free cost 514 cycles (page_bench01:
> alloc_pages order:3(32768B) 514 cycles). The MTU 1500 needs alloc size
> 1514+320+256 = 2090 bytes.  In 32KB we can fit 15 packets.  Thus, the
> amortized cost per packet is only 34.3 cycles (514/15).
> 
> Thus, this explains why page_frag_alloc API have an advantage here, as
> amortized cost per packet is lower (for page_frag_alloc).
> 
> 
> *** MTU 8000 ***
> 
> * The MTU 8000 case, where PP is faster than page_frag_alloc.
> 
> The page_frag_alloc API cannot slice the same 32KB into as many packets.
> The MTU 8000 needs alloc size 8000+14+256+320 = 8590 bytes.  This is can
> only store 3 full packets (32768/8590 = 3.81).
> Thus, cost is 514/3 = 171 cycles.
> 
> The PP is actually challenged at MTU 8000, because it unfortunately
> leads to allocating 3 full pages (12KiB), due to needed alloc size 8590
> bytes. Thus cost is 3x 48 cycles = 144 cycles.
> (There is also a chance of Jakubs "allow_direct" optimization in page_pool_return_skb_page to increase performance for PP).
> 
> Thus, this explains why PP is fastest in this case.

Great analysis.
So the problem seems to be: can we optimize the page fragment cache
implementation so that it can at least match the performance of PP
for the above case? As Alexander seems to be against using PP for
the veth case without involving DMA mapping.

> 
> 
> *** Surprising insights ***
> 
> My (maybe) surprising conclusion is that we should combine the two
> approaches.  Which is basically what Lin's patchset is doing!
> Thus, I'm actually suddenly become a fan of this patchset...
> 
> The insight is that PP can also work with higher-order pages and the
> cost of PP recycles via ptr_ring will be the same, regardless of page
> order size.  Thus, we can reduced the order-3 cost 514 cycles to
> basically 48 cycles, and fit 15 packets (MTU 1500) resulting is
> amortized allocator cost 48/15 = 3.2 cycles.
> 
> On the PP alloc-side this will be amazingly fast. When PP recycles frags
> side, see page_pool_defrag_page() there is an atomic_sub operation.
> I've measured atomic_inc to cost 17 cycles (for optimal non-contended
> case), thus 3+17 = 20 cycles, it should still be a win.
> 
> 
> --Jesper
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-06-24 14:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-09 13:17 [PATCH net-next v3 0/4] introduce page_pool_alloc() API Yunsheng Lin
     [not found] ` <20230609131740.7496-4-linyunsheng@huawei.com>
     [not found]   ` <CAKgT0UfVwQ=ri7ZDNnsATH2RQpEz+zDBBb6YprvniMEWGdw+dQ@mail.gmail.com>
     [not found]     ` <36366741-8df2-1137-0dd9-d498d0f770e4@huawei.com>
     [not found]       ` <CAKgT0UdXTSv1fDHBX4UC6Ok9NXKMJ_9F88CEv5TK+mpzy0N21g@mail.gmail.com>
     [not found]         ` <c06f6f59-6c35-4944-8f7a-7f6f0e076649@huawei.com>
     [not found]           ` <CAKgT0UccmDe+CE6=zDYQHi1=3vXf5MptzDo+BsPrKdmP5j9kgQ@mail.gmail.com>
2023-06-15 16:19             ` [PATCH net-next v3 3/4] page_pool: " Jesper Dangaard Brouer
2023-06-16 11:57               ` Yunsheng Lin
2023-06-16 16:31                 ` Jesper Dangaard Brouer
2023-06-16 17:34                   ` Alexander Duyck
2023-06-16 18:41                     ` Jesper Dangaard Brouer
2023-06-16 18:47                       ` Jakub Kicinski
2023-06-16 19:50                       ` Alexander Duyck
2023-06-18 15:05                         ` Lorenzo Bianconi
2023-06-20 16:19                           ` Alexander Duyck
2023-06-20 21:16                             ` Lorenzo Bianconi
2023-06-21 11:55                               ` Jesper Dangaard Brouer
2023-06-24 14:44                                 ` Yunsheng Lin
2023-06-17 12:47                     ` Yunsheng Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).