Re: [PATCH net-next 9/9] net: enetc: add support for XDP_REDIRECT

From: Vladimir Oltean <olteanv@gmail.com>
To: "Toke Høiland-Jørgensen" <toke@redhat.com>
Cc: Jakub Kicinski <kuba@kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	Alexander Duyck <alexander.duyck@gmail.com>,
	Ioana Ciornei <ioana.ciornei@nxp.com>,
	Alex Marginean <alexandru.marginean@nxp.com>,
	Claudiu Manoil <claudiu.manoil@nxp.com>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	Vladimir Oltean <vladimir.oltean@nxp.com>
Subject: Re: [PATCH net-next 9/9] net: enetc: add support for XDP_REDIRECT
Date: Thu, 1 Apr 2021 19:09:43 +0300	[thread overview]
Message-ID: <20210401160943.frw7l3rio7spr33n@skbuf> (raw)
In-Reply-To: <875z16nsiu.fsf@toke.dk>

On Thu, Apr 01, 2021 at 01:39:05PM +0200, Toke Høiland-Jørgensen wrote:
> Vladimir Oltean <olteanv@gmail.com> writes:
>
> > On Thu, Apr 01, 2021 at 01:26:02PM +0200, Toke Høiland-Jørgensen wrote:
> >> > +int enetc_xdp_xmit(struct net_device *ndev, int num_frames,
> >> > +		   struct xdp_frame **frames, u32 flags)
> >> > +{
> >> > +	struct enetc_tx_swbd xdp_redirect_arr[ENETC_MAX_SKB_FRAGS] = {0};
> >> > +	struct enetc_ndev_priv *priv = netdev_priv(ndev);
> >> > +	struct enetc_bdr *tx_ring;
> >> > +	int xdp_tx_bd_cnt, i, k;
> >> > +	int xdp_tx_frm_cnt = 0;
> >> > +
> >> > +	tx_ring = priv->tx_ring[smp_processor_id()];
> >>
> >> What mechanism guarantees that this won't overflow the array? :)
> >
> > Which array, the array of TX rings?
>
> Yes.
>

The problem isn't even accessing an out-of-bounds element in the TX ring array.

As it turns out, I had a relatively superficial understanding of how
things are organized, but let me try to explain.

The number of TX rings is a configurable resource (between PFs and VFs)
and we read the capability at probe time:

enetc_get_si_caps:
	val = enetc_rd(hw, ENETC_SICAPR0);
	si->num_rx_rings = (val >> 16) & 0xff;
	si->num_tx_rings = val & 0xff;

enetc_init_si_rings_params:
	priv->num_tx_rings = si->num_tx_rings;

In any case, the TX array is declared as:

struct enetc_ndev_priv {
	struct enetc_bdr *tx_ring[16];
	struct enetc_bdr *rx_ring[16];
};

because that's the maximum hardware capability.

The priv->tx_ring array is populated in:

enetc_alloc_msix:
	/* # of tx rings per int vector */
	v_tx_rings = priv->num_tx_rings / priv->bdr_int_num;

	for (i = 0; i < priv->bdr_int_num; i++) {
		for (j = 0; j < v_tx_rings; j++) {
			if (priv->bdr_int_num == ENETC_MAX_BDR_INT)
				idx = 2 * j + i; /* 2 CPUs */
			else
				idx = j + i * v_tx_rings; /* default */

			priv->tx_ring[idx] = bdr;
		}
	}

priv->bdr_int_num is set to "num_online_cpus()".
On LS1028A, it can be either 1 or 2 (and the ENETC_MAX_BDR_INT macro is
equal to 2).

Otherwise said, the convoluted logic above does the following:
- It affines an MSI interrupt vector per CPU
- It affines an RX ring per MSI vector, hence per CPU
- It balances the fixed number of TX rings (say 8) among the available
  MSI vectors, hence CPUs (say 2). It does this by iterating with i
  through the RX MSI interrupt vectors, and with j through the number of
  TX rings per MSI vector.

This logic maps:
- the even TX rings to CPU 0 and the odd TX rings to CPU 1, if 2 CPUs
  are used
- all TX rings to CPU 0, if 1 CPU is used

This is done because we have this logic in enetc_poll:

	for (i = 0; i < v->count_tx_rings; i++)
		if (!enetc_clean_tx_ring(&v->tx_ring[i], budget))
			complete = false;

for processing the TX completions of a given group of TX rings in the RX
MSI interrupt handler of a certain CPU.

Otherwise said, priv->tx_ring[i] is always BD ring i, and that mapping
never changes. All 8 TX rings are enabled and available for use.

What I knew about tc-taprio and tc-mqprio is that they only enqueue to
TX queues [0, num_tc-1] because of this, as it turns out:

enetc_xmit:
	tx_ring = priv->tx_ring[skb->queue_mapping];

where skb->queue_mapping is given by:
	err = netif_set_real_num_tx_queues(ndev, priv->num_tx_rings);
and by this, respectively, from the mqprio code path:
	netif_set_real_num_tx_queues(ndev, num_tc);

As for why XDP works, and priv->tx_ring[smp_processor_id()] is:
- TX ring 0 for CPU 0 and TX ring 1 for CPU 1, if 2 CPUs are used
- TX ring 0, if 1 CPU is used

The TX completions in the first case are handled by:
- CPU 0 for TX ring 0 (because it is even) and CPU 1 for TX ring 1
  (because it is odd), if 2 CPUs are used, due to the mapping I talked
  about earlier
- CPU 0 if only 1 CPU is used

> > You mean that it's possible to receive a TC_SETUP_QDISC_MQPRIO or
> > TC_SETUP_QDISC_TAPRIO with num_tc == 1, and we have 2 CPUs?
>
> Not just that, this ndo can be called on arbitrary CPUs after a
> redirect. The code just calls through from the XDP receive path so which
> CPU it ends up on depends on the RSS+IRQ config of the other device,
> which may not even be the same driver; i.e., you have no control over
> that... :)
>

What do you mean by "arbitrary" CPU? You can't plug CPUs in, it's a dual
core system... Why does the source ifindex matter at all? I'm using the
TX ring affined to the CPU that ndo_xdp_xmit is currently running on.

> > Well, yeah, I don't know what's the proper way to deal with that. Ideas?
>
> Well the obvious one is just:
>
> tx_ring = priv->tx_ring[smp_processor_id() % num_ring_ids];
>
> and then some kind of locking to deal with multiple CPUs accessing the
> same TX ring...

By multiple CPUs accessing the same TX ring, you mean locking between
ndo_xdp_xmit and ndo_start_xmit? Can that even happen if the hardware
architecture is to have at least as many TX rings as CPUs?

Because otherwise, I see that ndo_xdp_xmit is only called from
xdp_do_flush, which is in softirq context, which to my very rudimentary
knowledge run with bottom halves, thus preemption, disabled? So I don't
think it's possible for ndo_xdp_xmit and ndo_xmit, or even two
ndo_xdp_xmit instances, to access the same TX ring?

Sorry, I'm sure these are trivial questions, but I would like to really
understand what I need to change and why :D