All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin KaFai Lau <kafai@fb.com>
To: "Toke Høiland-Jørgensen" <toke@redhat.com>
Cc: "Jesper Dangaard Brouer" <brouer@redhat.com>,
	"Hangbin Liu" <liuhangbin@gmail.com>,
	bpf@vger.kernel.org, netdev@vger.kernel.org,
	"Jiri Benc" <jbenc@redhat.com>,
	"Eelco Chaudron" <echaudro@redhat.com>,
	ast@kernel.org, "Daniel Borkmann" <daniel@iogearbox.net>,
	"Lorenzo Bianconi" <lorenzo.bianconi@redhat.com>,
	"David Ahern" <dsahern@gmail.com>,
	"Andrii Nakryiko" <andrii.nakryiko@gmail.com>,
	"Alexei Starovoitov" <alexei.starovoitov@gmail.com>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"Maciej Fijalkowski" <maciej.fijalkowski@intel.com>,
	"Björn Töpel" <bjorn.topel@gmail.com>
Subject: Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
Date: Fri, 16 Apr 2021 11:20:01 -0700	[thread overview]
Message-ID: <20210416182001.56dski6q6kmgr74f@kafai-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <877dl2im0y.fsf@toke.dk>

On Fri, Apr 16, 2021 at 12:03:41PM +0200, Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau <kafai@fb.com> writes:
> 
> > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> >> Jesper Dangaard Brouer <brouer@redhat.com> writes:
> >> 
> >> > On Thu, 15 Apr 2021 10:35:51 -0700
> >> > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >
> >> >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:
> >> >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> >> >> >   
> >> >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:  
> >> >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> >> > >> >  {
> >> >> > >> >  	struct net_device *dev = bq->dev;
> >> >> > >> > -	int sent = 0, err = 0;
> >> >> > >> > +	int sent = 0, drops = 0, err = 0;
> >> >> > >> > +	unsigned int cnt = bq->count;
> >> >> > >> > +	int to_send = cnt;
> >> >> > >> >  	int i;
> >> >> > >> >  
> >> >> > >> > -	if (unlikely(!bq->count))
> >> >> > >> > +	if (unlikely(!cnt))
> >> >> > >> >  		return;
> >> >> > >> >  
> >> >> > >> > -	for (i = 0; i < bq->count; i++) {
> >> >> > >> > +	for (i = 0; i < cnt; i++) {
> >> >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> >> >> > >> >  
> >> >> > >> >  		prefetch(xdpf);
> >> >> > >> >  	}
> >> >> > >> >  
> >> >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> >> >> > >> > +	if (bq->xdp_prog) {  
> >> >> > >> bq->xdp_prog is used here
> >> >> > >>   
> >> >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> >> >> > >> > +		if (!to_send)
> >> >> > >> > +			goto out;
> >> >> > >> > +
> >> >> > >> > +		drops = cnt - to_send;
> >> >> > >> > +	}
> >> >> > >> > +  
> >> >> > >> 
> >> >> > >> [ ... ]
> >> >> > >>   
> >> >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> > >> > -		       struct net_device *dev_rx)
> >> >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> >> >> > >> >  {
> >> >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> >> >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> >> >> > >> >  	 * from net_device drivers NAPI func end.
> >> >> > >> > +	 *
> >> >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> >> >> > >> > +	 * are only ever modified together.
> >> >> > >> >  	 */
> >> >> > >> > -	if (!bq->dev_rx)
> >> >> > >> > +	if (!bq->dev_rx) {
> >> >> > >> >  		bq->dev_rx = dev_rx;
> >> >> > >> > +		bq->xdp_prog = xdp_prog;  
> >> >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> >> >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> >> >> > >> 
> >> >> > >> e.g. what if the devmap elem gets deleted.  
> >> >> > >
> >> >> > > Jesper knows better than me. From my veiw, based on the description of
> >> >> > > __dev_flush():
> >> >> > >
> >> >> > > On devmap tear down we ensure the flush list is empty before completing to
> >> >> > > ensure all flush operations have completed. When drivers update the bpf
> >> >> > > program they may need to ensure any flush ops are also complete.  
> >> >>
> >> >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> >> >> 
> >> >> > 
> >> >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> >> >> > which also runs under one big rcu_read_lock(). So the storage in the
> >> >> > bulk queue is quite temporary, it's just used for bulking to increase
> >> >> > performance :)  
> >> >>
> >> >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> >> >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> >> >> in i40e_run_xdp() and it is fine.
> >> >> 
> >> >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> >> >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> >> >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> >> >>
> >> >> I do see the big rcu_read_lock() in mlx5e_napi_poll().
> >> >
> >> > I believed/assumed xdp_do_flush_map() was already protected under an
> >> > rcu_read_lock.  As the devmap and cpumap, which get called via
> >> > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> >> > are operating on.
> > What other rcu objects it is using during flush?
> 
> The bq_enqueue() function in cpumap.c puts the 'bq' pointer onto the
> flush_list, and 'bq' lives inside struct bpf_cpu_map_entry, so that's a
> reference to the map entry as well.
> 
> The devmap function used to work the same way, until we changed it in
> 75ccae62cb8d ("xdp: Move devmap bulk queue into struct net_device").
Got it. Thanks for the explanation in bq_enqueue() in cpumap.c.
I was under the impression that xdp_do_flush_map() should not
use any rcu object now since I don't see rcu_read_lock() there
and I use it as a hint in code reading.

> >> > Perhaps it is a bug in i40e?
> > A quick look into ixgbe falls into the same bucket.
> > didn't look at other drivers though.
> >
> >> >
> >> > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> >> > call, which I think means that this CPU will not go-through a RCU grace
> >> > period before we exit softirq, so in-practice it should be safe.
> >> 
> >> Yup, this seems to be correct: rcu_softirq_qs() is only called between
> >> full invocations of the softirq handler, which for networking is
> >> net_rx_action(), and so translates into full NAPI poll cycles.
> >
> > I don't know enough to comment on the rcu/softirq part, may be someone
> > can chime in.  There is also a recent napi_threaded_poll().
> >
> > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
> > At least, it sounds incorrect to only make an exception here while keeping
> > other rcu_read_lock() as-is.
> 
> I'd tend to agree that the correct thing to do is to fix any affected
> drivers so there's a wide rcu_read_lock() around the full xdp+flush. If
> nothing else, this serves as an annotation for the expected lifetime of
> the objects involved.
> 
> However, given that this is not a new issue, I don't think it should be
> holding up this patch series... We can start a new conversation on what
> the right way to fix this is - and maybe bring in Paul for advice on the
> RCU side? WDYT?
Yeah...it falls into the same issue as the current bq_enqueue() in cpumap.c.
I am fine to put them together into the solve later bucket.  I will delegate
this decision to the maintainers.

I would wait a bit on Paul's reply though.

Also, patch 2 does not necessary depend on patch 1?  Another option is to post
patch 1 separately later as an optimization when the rcu discussion concluded.

  reply	other threads:[~2021-04-16 18:20 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-14 12:26 [PATCHv7 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
2021-04-14 12:26 ` [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue Hangbin Liu
2021-04-15  0:17   ` Martin KaFai Lau
2021-04-15  2:37     ` Hangbin Liu
2021-04-15  9:22       ` Toke Høiland-Jørgensen
2021-04-15 17:35         ` Martin KaFai Lau
2021-04-15 18:21           ` Jesper Dangaard Brouer
2021-04-15 20:29             ` Toke Høiland-Jørgensen
2021-04-16  0:39               ` Martin KaFai Lau
2021-04-16 10:03                 ` Toke Høiland-Jørgensen
2021-04-16 18:20                   ` Martin KaFai Lau [this message]
2021-04-16 13:45                 ` Jesper Dangaard Brouer
2021-04-16 14:35                   ` Toke Høiland-Jørgensen
2021-04-16 18:22                   ` Martin KaFai Lau
2021-04-17  0:23                     ` Paul E. McKenney
2021-04-17 12:27                       ` Toke Høiland-Jørgensen
2021-04-19 16:58                         ` Paul E. McKenney
2021-04-19 18:12                           ` Toke Høiland-Jørgensen
2021-04-19 18:32                             ` Paul E. McKenney
2021-04-19 21:21                               ` Toke Høiland-Jørgensen
2021-04-19 21:41                                 ` Paul E. McKenney
2021-04-19 22:16                                   ` Toke Høiland-Jørgensen
2021-04-19 22:31                                     ` Paul E. McKenney
2021-04-21 14:24                                       ` Toke Høiland-Jørgensen
2021-04-21 14:59                                         ` Paul E. McKenney
2021-04-21 19:59                                           ` Toke Høiland-Jørgensen
2021-04-21 20:51                                             ` Paul E. McKenney
2021-04-21 21:10                                               ` Toke Høiland-Jørgensen
2021-04-21 21:30                                                 ` Paul E. McKenney
2021-04-21 22:00                                                   ` Toke Høiland-Jørgensen
2021-04-21 22:31                                                     ` Paul E. McKenney
2021-04-22 14:30                                                       ` Toke Høiland-Jørgensen
2021-04-14 12:26 ` [PATCHv7 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
2021-04-15  0:23   ` Martin KaFai Lau
2021-04-15  2:21     ` Hangbin Liu
2021-04-15  9:29       ` Toke Høiland-Jørgensen
2021-04-14 12:26 ` [PATCHv7 bpf-next 3/4] sample/bpf: add xdp_redirect_map_multi for redirect_map broadcast test Hangbin Liu
2021-04-14 12:26 ` [PATCHv7 bpf-next 4/4] selftests/bpf: add xdp_redirect_multi test Hangbin Liu
2021-04-14 14:16 ` [PATCHv7 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support Toke Høiland-Jørgensen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210416182001.56dski6q6kmgr74f@kafai-mbp.dhcp.thefacebook.com \
    --to=kafai@fb.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=ast@kernel.org \
    --cc=bjorn.topel@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=daniel@iogearbox.net \
    --cc=dsahern@gmail.com \
    --cc=echaudro@redhat.com \
    --cc=jbenc@redhat.com \
    --cc=john.fastabend@gmail.com \
    --cc=liuhangbin@gmail.com \
    --cc=lorenzo.bianconi@redhat.com \
    --cc=maciej.fijalkowski@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=toke@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.