From mboxrd@z Thu Jan  1 00:00:00 1970
From: Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
Date: Sun, 25 Mar 2018 13:46:32 +0200 (CEST)
Message-ID: <alpine.DEB.2.21.1803241325500.1481@nanos.tec.linutronix.de>
References: <20180307011230.24001-1-jesus.sanchez-palencia@intel.com> <20180307011230.24001-14-jesus.sanchez-palencia@intel.com> <alpine.DEB.2.21.1803211407520.3754@nanos.tec.linutronix.de> <alpine.DEB.2.21.1803211758140.3754@nanos.tec.linutronix.de>
 <65da0648-b835-a171-3986-2d1ddcb8ea10@intel.com> <alpine.DEB.2.21.1803222312061.1489@nanos.tec.linutronix.de> <2897b562-06e0-0fcc-4fb1-e8c4469c0faa@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Cc: netdev@vger.kernel.org, jhs@mojatatu.com, xiyou.wangcong@gmail.com,
        jiri@resnulli.us, vinicius.gomes@intel.com,
        richardcochran@gmail.com, anna-maria@linutronix.de,
        henrik@austad.us, John Stultz <john.stultz@linaro.org>,
        levi.pearson@harman.com, edumazet@google.com, willemb@google.com,
        mlichvar@redhat.com
To: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from Galois.linutronix.de ([146.0.238.70]:43692 "EHLO
        Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751985AbeCYLqo (ORCPT
        <rfc822;netdev@vger.kernel.org>); Sun, 25 Mar 2018 07:46:44 -0400
In-Reply-To: <2897b562-06e0-0fcc-4fb1-e8c4469c0faa@intel.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Fri, 23 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/22/2018 03:52 PM, Thomas Gleixner wrote:
> > So what's the plan for this? Having TAS as a separate entity or TAS feeding
> > into the proposed 'basic' time transmission thing?
> 
> The second one, I guess.

That's just wrong. It won't work. See below.

> Elaborating, the plan is at some point having TAS as a separate entity,
> but which can use tbs for one of its classes (and cbs for another, and
> strict priority for everything else, etc).
>
> Basically, the design would something along the lines of 'taprio'. A root qdisc
> that is both time and priority aware, and capable of running a schedule for the
> port. That schedule can run inside the kernel with hrtimers, or just be
> offloaded into the controller if Qbv is supported on HW.
> 
> Because it would expose the inner traffic classes in a mq / mqprio / prio style,
> then it would allow for other per-queue qdiscs to be attached to it. On a system
> using the i210, for instance, we could then have tbs installed on traffic class
> 0 just dialing hw offload. The Qbv schedule would be running in SW on the TAS
> entity (i.e. 'taprio') which would be setting the packets' txtime before
> dequeueing packets on a fast path -> tbs -> NIC.
> 
> Similarly, other qdisc, like cbs, could be installed if all that traffic class
> requires is traffic shaping once its 'gate' is allowed to execute the selected
> tx algorithm attached to it.
> 
> > I've not yet seen a convincing argument why this low level stuff with all
> > of its weird flavours is superiour over something which reflects the basic
> > operating principle of TSN.
> 
> 
> As you know, not all TSN systems are designed the same. Take AVB systems, for
> example. These not always are running on networks that are aware of any time
> schedule, or at least not quite like what is described by Qbv.
> 
> On those systems there is usually a certain number of streams with different
> priorities that care mostly about having their bandwidth reserved along the
> network. The applications running on such systems are usually based on AVTP,
> thus they already have to calculate and set the "avtp presentation time"
> per-packet themselves. A Qbv scheduler would probably provide very little
> benefits to this domain, IMHO. For "talkers" of these AVB systems, shaping
> traffic using txtime (i.e. tbs) can provide a low-jitter alternative to cbs, for
> instance.

You're looking at it from particular use cases and try to accomodate for
them in the simplest possible way. I don't think that cuts it.

Let's take a step back and look at it from a more general POV without
trying to make it fit to any of the standards first. I'm deliberately NOT
using any of the standard defined terms.

At the (local) network level you have always an explicit plan. This plan
might range from no plan at all to an very elaborate plan which is strict
about when each node is allowed to TX a particular class of packets.

So lets assume we have the following picture:

   	       	  [NIC]
		    |
	 [ Time slice manager ]

Now in the simplest case, the time slice manager has no constraints and
exposes a single input which allows the application to say: "Send my packet
at time X". There is no restriction on 'time X' except if there is a time
collision with an already queued packet or the requested TX time has
already passed. That's close to what you implemented.

  Is the TX timestamp which you defined in the user space ABI a fixed
  scheduling point or is it a deadline?

  That's an important distinction and for this all to work accross various
  use cases you need a way to express that in the ABI. It might be an
  implicit property of the socket/channel to which the application connects
  to but still you want to express it from the application side to do
  proper sanity checking.

  Just think about stuff like audio/video streaming. The point of
  transmission does not have to be fixed if you have some intelligent
  controller at the receiving end which can buffer stuff. The only relevant
  information is the deadline, i.e. the latest point in time where the
  packet needs to go out on the wire in order to keep the stream steady at
  the consumer side. Having the notion of a deadline and that's the only
  thing the provider knows about allows you proper utilization by using an
  approriate scheduling algorithm like EDF.

  Contrary to that you want very explicit TX points for applications like
  automation control. For this kind of use case there is no wiggle room, it
  has to go out at a fixed time because that's the way control systems
  work.

  This is missing right now and you want to get that right from the very
  beginning. Duct taping it on the interface later on is a bad idea.

Now lets go one step further and create two time slices for whatever
purpose still on the single node (not network wide). You want to do that
because you want temporal separation of services. The reason might be
bandwidth guarantee, collission avoidance or whatever.

  How does the application which was written for the simple manager which
  had no restrictions learn about this?

  Does it learn it the hard way because now the packets which fall into the
  reserved timeslice are rejected? The way you created your interface, the
  answer is yes. That's patently bad as it requires to change the
  application once it runs on a partitioned node.

  So you really want a way for the application to query the timing
  constraints and perhaps other properties of the channel it connects
  to. And you want that now before the first application starts to use the
  new ABI. If the application developer does not use it, you still have to
  fix the application, but you have to fix it because the developer was a
  lazy bastard and not because the design was bad. That's a major
  difference.

Now that we have two time slices, I'm coming back to your idea of having
your proposed qdisc as the entity which sits right at the network
interface. Lets assume the following:

   [Slice 1: Timed traffic ] [Slice 2: Other Traffic]

  Lets assume further that 'Other traffic' has no idea about time slices at
  all. It's just stuff like ssh, http, etc. So if you keep that design

       	         [ NIC ]
  	            |
           [ Time slice manager ]
	       |          |
     [ Timed traffic ]  [ Other traffic ]

  feeding into your proposed TBS thingy, then in case of underutilization
  of the 'Timed traffic' slot you prevent utilization of remaining time by
  pulling 'Other traffic' into the empty slots because 'Other traffic' is
  restricted to Slice 2 and 'Timed traffic' does not know about 'Other
  traffic' at all. And no, you cannot make TBS magically pull packets from
  'Other traffic' just because its not designed for it. So your design
  becomes strictly partitioned and forces underutilization.

  That's becoming even worse, when you switch to the proposed full hardware
  offloading scheme. In that case the only way to do admission control is
  the TX time of the farthest out packet which is already queued. That
  might work for a single application which controls all of the network
  traffic, but it wont ever work for something more flexible. The more I
  think about it the less interesting full hardware offload becomes. It's
  nice if you have a fully strict scheduling plan for everything, but then
  your admission control is bogus once you have more than one channel as
  input. So yes, it can be used when the card supports it and you have
  other ways to enforce admission control w/o hurting utilization or if you
  don't care about utilization at all. It's also useful for channels which
  are strictly isolated and have a defined TX time. Such traffic can be
  directly fed into the hardware.

Coming back to the overall scheme. If you start upfront with a time slice
manager which is designed to:

  - Handle multiple channels

  - Expose the time constraints, properties per channel

then you can fit all kind of use cases, whether designed by committee or
not. You can configure that thing per node or network wide. It does not
make a difference. The only difference are the resulting constraints.

We really want to accomodate everything between the 'no restrictions' and
the 'full network wide explicit plan' case. And it's not rocket science
once you realize that the 'no restrictions' case is just a subset of the
'full network wide explicit plan' simply because it exposes a single
channel where:

	slice period = slice length.

It's that easy, but at the same time you teach the application from the
very beginning to ask for the time constraints so if it runs on a more
sophisticated system/network, then it will see a different slice period and
a different slice length and can accomodate or react in a useful way
instead of just dying on the 17th packet it tries to send because it is
rejected.

We really want to design for this as we want to be able to run the video
stream on the same node and network which does robot control without
changing the video application. That's not a theoretical problem. These use
cases exist today, but they are forced to use different networks for the
two. But if you look at the utilization of both then they very well fit
into one and industry certainly wants to go for that.

That implies that you need constraint aware applications from the very
beginning and that requires a proper ABI in the first place. The proposed
ad hoc mode does not qualify. Please be aware, that you are creating a user
space ABI and not a random in kernel interface which can be changed at any
given time.

So lets look once more at the picture in an abstract way:

     	       [ NIC ]
	          |
	 [ Time slice manager ]
	    |           |
         [ Ch 0 ] ... [ Ch N ]

So you have a bunch of properties here:

1) Number of Channels ranging from 1 to N

2) Start point, slice period and slice length per channel

3) Queueing modes assigned per channel. Again that might be anything from
   'feed through' over FIFO, PRIO to more complex things like EDF.

   The queueing mode can also influence properties like the meaning of the
   TX time, i.e. strict or deadline.

Please sit back and map your use cases, standards or whatever you care
about into the above and I would be very surprised if they don't fit.

Thanks,

	tglx