All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Wang <jasowang@redhat.com>
To: wangyunjian <wangyunjian@huawei.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	Paolo Abeni <pabeni@redhat.com>,
	 "willemdebruijn.kernel@gmail.com"
	<willemdebruijn.kernel@gmail.com>,
	"kuba@kernel.org" <kuba@kernel.org>,
	 "bjorn@kernel.org" <bjorn@kernel.org>,
	"magnus.karlsson@intel.com" <magnus.karlsson@intel.com>,
	 "maciej.fijalkowski@intel.com" <maciej.fijalkowski@intel.com>,
	 "jonathan.lemon@gmail.com" <jonathan.lemon@gmail.com>,
	"davem@davemloft.net" <davem@davemloft.net>,
	 "bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	 "virtualization@lists.linux.dev"
	<virtualization@lists.linux.dev>, xudingke <xudingke@huawei.com>,
	 "liwei (DT)" <liwei395@huawei.com>
Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
Date: Tue, 12 Mar 2024 14:07:30 +0800	[thread overview]
Message-ID: <CACGkMEtG40J1zZG4nSvviw4MqX+RPOVuHG9PHR-PiYcZLj38CQ@mail.gmail.com> (raw)
In-Reply-To: <d44e32c757014b38bb64d92af683b128@huawei.com>

On Mon, Mar 11, 2024 at 9:28 PM wangyunjian <wangyunjian@huawei.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jason Wang [mailto:jasowang@redhat.com]
> > Sent: Monday, March 11, 2024 12:01 PM
> > To: wangyunjian <wangyunjian@huawei.com>
> > Cc: Michael S. Tsirkin <mst@redhat.com>; Paolo Abeni <pabeni@redhat.com>;
> > willemdebruijn.kernel@gmail.com; kuba@kernel.org; bjorn@kernel.org;
> > magnus.karlsson@intel.com; maciej.fijalkowski@intel.com;
> > jonathan.lemon@gmail.com; davem@davemloft.net; bpf@vger.kernel.org;
> > netdev@vger.kernel.org; linux-kernel@vger.kernel.org; kvm@vger.kernel.org;
> > virtualization@lists.linux.dev; xudingke <xudingke@huawei.com>; liwei (DT)
> > <liwei395@huawei.com>
> > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
> >
> > On Mon, Mar 4, 2024 at 9:45 PM wangyunjian <wangyunjian@huawei.com>
> > wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > > > Sent: Friday, March 1, 2024 7:53 PM
> > > > To: wangyunjian <wangyunjian@huawei.com>
> > > > Cc: Paolo Abeni <pabeni@redhat.com>;
> > > > willemdebruijn.kernel@gmail.com; jasowang@redhat.com;
> > > > kuba@kernel.org; bjorn@kernel.org; magnus.karlsson@intel.com;
> > > > maciej.fijalkowski@intel.com; jonathan.lemon@gmail.com;
> > > > davem@davemloft.net; bpf@vger.kernel.org; netdev@vger.kernel.org;
> > > > linux-kernel@vger.kernel.org; kvm@vger.kernel.org;
> > > > virtualization@lists.linux.dev; xudingke <xudingke@huawei.com>;
> > > > liwei (DT) <liwei395@huawei.com>
> > > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy
> > > > support
> > > >
> > > > On Fri, Mar 01, 2024 at 11:45:52AM +0000, wangyunjian wrote:
> > > > > > -----Original Message-----
> > > > > > From: Paolo Abeni [mailto:pabeni@redhat.com]
> > > > > > Sent: Thursday, February 29, 2024 7:13 PM
> > > > > > To: wangyunjian <wangyunjian@huawei.com>; mst@redhat.com;
> > > > > > willemdebruijn.kernel@gmail.com; jasowang@redhat.com;
> > > > > > kuba@kernel.org; bjorn@kernel.org; magnus.karlsson@intel.com;
> > > > > > maciej.fijalkowski@intel.com; jonathan.lemon@gmail.com;
> > > > > > davem@davemloft.net
> > > > > > Cc: bpf@vger.kernel.org; netdev@vger.kernel.org;
> > > > > > linux-kernel@vger.kernel.org; kvm@vger.kernel.org;
> > > > > > virtualization@lists.linux.dev; xudingke <xudingke@huawei.com>;
> > > > > > liwei (DT) <liwei395@huawei.com>
> > > > > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy
> > > > > > support
> > > > > >
> > > > > > On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > > > > > > @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr)
> > > > > > >         }
> > > > > > >  }
> > > > > > >
> > > > > > > +static void tun_peek_xsk(struct tun_file *tfile) {
> > > > > > > +       struct xsk_buff_pool *pool;
> > > > > > > +       u32 i, batch, budget;
> > > > > > > +       void *frame;
> > > > > > > +
> > > > > > > +       if (!ptr_ring_empty(&tfile->tx_ring))
> > > > > > > +               return;
> > > > > > > +
> > > > > > > +       spin_lock(&tfile->pool_lock);
> > > > > > > +       pool = tfile->xsk_pool;
> > > > > > > +       if (!pool) {
> > > > > > > +               spin_unlock(&tfile->pool_lock);
> > > > > > > +               return;
> > > > > > > +       }
> > > > > > > +
> > > > > > > +       if (tfile->nb_descs) {
> > > > > > > +               xsk_tx_completed(pool, tfile->nb_descs);
> > > > > > > +               if (xsk_uses_need_wakeup(pool))
> > > > > > > +                       xsk_set_tx_need_wakeup(pool);
> > > > > > > +       }
> > > > > > > +
> > > > > > > +       spin_lock(&tfile->tx_ring.producer_lock);
> > > > > > > +       budget = min_t(u32, tfile->tx_ring.size,
> > > > > > > + TUN_XDP_BATCH);
> > > > > > > +
> > > > > > > +       batch = xsk_tx_peek_release_desc_batch(pool, budget);
> > > > > > > +       if (!batch) {
> > > > > >
> > > > > > This branch looks like an unneeded "optimization". The generic
> > > > > > loop below should have the same effect with no measurable perf
> > > > > > delta - and
> > > > smaller code.
> > > > > > Just remove this.
> > > > > >
> > > > > > > +               tfile->nb_descs = 0;
> > > > > > > +               spin_unlock(&tfile->tx_ring.producer_lock);
> > > > > > > +               spin_unlock(&tfile->pool_lock);
> > > > > > > +               return;
> > > > > > > +       }
> > > > > > > +
> > > > > > > +       tfile->nb_descs = batch;
> > > > > > > +       for (i = 0; i < batch; i++) {
> > > > > > > +               /* Encode the XDP DESC flag into lowest bit
> > > > > > > + for consumer to
> > > > differ
> > > > > > > +                * XDP desc from XDP buffer and sk_buff.
> > > > > > > +                */
> > > > > > > +               frame = tun_xdp_desc_to_ptr(&pool->tx_descs[i]);
> > > > > > > +               /* The budget must be less than or equal to
> > tx_ring.size,
> > > > > > > +                * so enqueuing will not fail.
> > > > > > > +                */
> > > > > > > +               __ptr_ring_produce(&tfile->tx_ring, frame);
> > > > > > > +       }
> > > > > > > +       spin_unlock(&tfile->tx_ring.producer_lock);
> > > > > > > +       spin_unlock(&tfile->pool_lock);
> > > > > >
> > > > > > More related to the general design: it looks wrong. What if
> > > > > > get_rx_bufs() will fail (ENOBUF) after successful peeking? With
> > > > > > no more incoming packets, later peek will return 0 and it looks
> > > > > > like that the half-processed packets will stay in the ring forever???
> > > > > >
> > > > > > I think the 'ring produce' part should be moved into tun_do_read().
> > > > >
> > > > > Currently, the vhost-net obtains a batch descriptors/sk_buffs from
> > > > > the ptr_ring and enqueue the batch descriptors/sk_buffs to the
> > > > > virtqueue'queue, and then consumes the descriptors/sk_buffs from
> > > > > the virtqueue'queue in sequence. As a result, TUN does not know
> > > > > whether the batch descriptors have been used up, and thus does not
> > > > > know when to
> > > > return the batch descriptors.
> > > > >
> > > > > So, I think it's reasonable that when vhost-net checks ptr_ring is
> > > > > empty, it calls peek_len to get new xsk's descs and return the descriptors.
> > > > >
> > > > > Thanks
> > > >
> > > > What you need to think about is that if you peek, another call in
> > > > parallel can get the same value at the same time.
> > >
> > > Thank you. I have identified a problem. The tx_descs array was created within
> > xsk's pool.
> > > When xsk is freed, the pool and tx_descs are also freed. Howerver,
> > > some descs may remain in the virtqueue'queue, which could lead to a
> > use-after-free scenario.
> >
> > This can probably solving by when xsk pool is disabled, signal the vhost_net to
> > drop those descriptors.
>
> I think TUN can notify vhost_net to drop these descriptors through netdev events.

Great, actually, the "issue" described above exist in this patch as
well. For example, you did:

                        spin_lock(&tfile->pool_lock);
                        if (tfile->pool) {
                              ret = tun_put_user_desc(tun, tfile,
&tfile->desc, to);

You did copy_to_user() under spinlock which is actually a bug.

> However, there is a potential concurrency problem. When handling netdev events
> and packets, vhost_net preempts the 'vq->mutex_lock', leading to unstable performance.

I think we don't need to care the perf in this case.

And we gain a lot:

1) no trick in peek
2) batching support
...

Thanks

>
> Thanks
> >
> > Thanks
> >
> > > Currently,
> > > I do not have an idea to solve this concurrency problem and believe
> > > this scenario may not be appropriate for reusing the ptr_ring.
> > >
> > > Thanks
> > >
> > > >
> > > >
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Paolo
> > > > >
> > >
>


  reply	other threads:[~2024-03-12  6:07 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-28 11:05 [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support Yunjian Wang
2024-02-29 10:24 ` kernel test robot
2024-02-29 11:12 ` Paolo Abeni
2024-02-29 13:15   ` wangyunjian
2024-03-01 11:45   ` wangyunjian
2024-03-01 11:53     ` Michael S. Tsirkin
2024-03-04 13:45       ` wangyunjian
2024-03-11  4:00         ` Jason Wang
2024-03-11 13:27           ` wangyunjian
2024-03-12  6:07             ` Jason Wang [this message]
2024-02-29 15:44 ` kernel test robot
2024-03-01 14:11 ` Maciej Fijalkowski
2024-03-01 18:40   ` Willem de Bruijn
2024-03-06  5:32     ` Jason Wang
2024-03-04  6:55 ` Jason Wang
2024-03-04 11:23   ` wangyunjian
2024-03-06  2:11     ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACGkMEtG40J1zZG4nSvviw4MqX+RPOVuHG9PHR-PiYcZLj38CQ@mail.gmail.com \
    --to=jasowang@redhat.com \
    --cc=bjorn@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=jonathan.lemon@gmail.com \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liwei395@huawei.com \
    --cc=maciej.fijalkowski@intel.com \
    --cc=magnus.karlsson@intel.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=virtualization@lists.linux.dev \
    --cc=wangyunjian@huawei.com \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=xudingke@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.