Re: [PATCH v2 net-next 0/7] dpaa2-eth: add support for Rx traffic classes

From: Jakub Kicinski <kuba@kernel.org>
To: Ioana Ciornei <ioana.ciornei@nxp.com>
Cc: "davem@davemloft.net" <davem@davemloft.net>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>
Subject: Re: [PATCH v2 net-next 0/7] dpaa2-eth: add support for Rx traffic classes
Date: Thu, 21 May 2020 12:07:52 -0700	[thread overview]
Message-ID: <20200521120752.07fd83aa@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> (raw)
In-Reply-To: <VI1PR0402MB38710DAD1F17B80F83403396E0B60@VI1PR0402MB3871.eurprd04.prod.outlook.com>

On Wed, 20 May 2020 20:24:43 +0000 Ioana Ciornei wrote:
> > Subject: Re: [PATCH v2 net-next 0/7] dpaa2-eth: add support for Rx traffic
> > classes
> > 
> > On Wed, 20 May 2020 15:10:42 +0000 Ioana Ciornei wrote:  
> > > DPAA2 has frame queues per each Rx traffic class and the decision from
> > > which queue to pull frames from is made by the HW based on the queue
> > > priority within a channel (there is one channel per each CPU).  
> > 
> > IOW you're reading the descriptor for the device memory/iomem address and
> > the HW will return the next descriptor based on configured priority?  
> 
> That's the general idea but the decision is not made on a frame by frame bases
> but rather on a dequeue operation which can, at a maximum, return
> 16 frame descriptors at a time.

I see!

> > Presumably strict priority?  
> 
> Only the two highest traffic classes are in strict priority, while the other 6 TCs
> form two priority tiers - medium(4 TCs) and low (last two TCs).
> 
> > > If this should be modeled in software, then I assume there should be a
> > > NAPI instance for each traffic class and the stack should know in
> > > which order to call the poll() callbacks so that the priority is respected.  
> > 
> > Right, something like that. But IMHO not needed if HW can serve the right
> > descriptor upon poll.  
> 
> After thinking this through I don't actually believe that multiple NAPI instances
> would solve this in any circumstance at all:
> 
> - If you have hardware prioritization with full scheduling on dequeue then job on the
> driver side is already done.
> - If you only have hardware assist for prioritization (ie hardware gives you multiple
> rings but doesn't tell you from which one to dequeue) then you can still use a single
> NAPI instance just fine and pick the highest priority non-empty ring on-the-fly basically.
> 
> What I am having trouble understanding is how the fully software implementation
> of this possible new Rx qdisc should work. Somehow the skb->priority should be taken
> into account when the skb is passing though the stack (ie a higher priority skb should
> surpass another previously received skb even if the latter one was received first, but
> its priority queue is congested).

I'd think the SW implementation would come down to which ring to
service first. If there are multiple rings on the host NAPI can try
to read from highest priority ring first and then move on to next prio.
Not sure if there would be a use case for multiple NAPIs for busy
polling or not.

I was hoping we can solve this with the new ring config API (which is
coming any day now, ehh) - in which I hope user space will be able to
assign rings to NAPI instances, all we would have needed would be also
controlling the querying order. But that doesn't really work for you,
it seems, since the selection is offloaded to HW :S

> I don't have a very deep understanding of the stack but I am thinking that the
> enqueue_to_backlog()/process_backlog() area could be a candidate place for sorting out
> bottlenecks. In case we do that I don't see why a qdisc would be necessary at all and not
> have everybody benefit from prioritization based on skb->priority.

I think once the driver picks the frame up it should run with it to
completion (+/-GRO). We have natural batching with NAPI processing.
Every NAPI budget high priority rings get a chance to preempt lower
ones.