From mboxrd@z Thu Jan 1 00:00:00 1970 From: Thomas Gleixner Subject: Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc Date: Sun, 25 Mar 2018 13:46:32 +0200 (CEST) Message-ID: References: <20180307011230.24001-1-jesus.sanchez-palencia@intel.com> <20180307011230.24001-14-jesus.sanchez-palencia@intel.com> <65da0648-b835-a171-3986-2d1ddcb8ea10@intel.com> <2897b562-06e0-0fcc-4fb1-e8c4469c0faa@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Cc: netdev@vger.kernel.org, jhs@mojatatu.com, xiyou.wangcong@gmail.com, jiri@resnulli.us, vinicius.gomes@intel.com, richardcochran@gmail.com, anna-maria@linutronix.de, henrik@austad.us, John Stultz , levi.pearson@harman.com, edumazet@google.com, willemb@google.com, mlichvar@redhat.com To: Jesus Sanchez-Palencia Return-path: Received: from Galois.linutronix.de ([146.0.238.70]:43692 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751985AbeCYLqo (ORCPT ); Sun, 25 Mar 2018 07:46:44 -0400 In-Reply-To: <2897b562-06e0-0fcc-4fb1-e8c4469c0faa@intel.com> Sender: netdev-owner@vger.kernel.org List-ID: On Fri, 23 Mar 2018, Jesus Sanchez-Palencia wrote: > On 03/22/2018 03:52 PM, Thomas Gleixner wrote: > > So what's the plan for this? Having TAS as a separate entity or TAS feeding > > into the proposed 'basic' time transmission thing? > > The second one, I guess. That's just wrong. It won't work. See below. > Elaborating, the plan is at some point having TAS as a separate entity, > but which can use tbs for one of its classes (and cbs for another, and > strict priority for everything else, etc). > > Basically, the design would something along the lines of 'taprio'. A root qdisc > that is both time and priority aware, and capable of running a schedule for the > port. That schedule can run inside the kernel with hrtimers, or just be > offloaded into the controller if Qbv is supported on HW. > > Because it would expose the inner traffic classes in a mq / mqprio / prio style, > then it would allow for other per-queue qdiscs to be attached to it. On a system > using the i210, for instance, we could then have tbs installed on traffic class > 0 just dialing hw offload. The Qbv schedule would be running in SW on the TAS > entity (i.e. 'taprio') which would be setting the packets' txtime before > dequeueing packets on a fast path -> tbs -> NIC. > > Similarly, other qdisc, like cbs, could be installed if all that traffic class > requires is traffic shaping once its 'gate' is allowed to execute the selected > tx algorithm attached to it. > > > I've not yet seen a convincing argument why this low level stuff with all > > of its weird flavours is superiour over something which reflects the basic > > operating principle of TSN. > > > As you know, not all TSN systems are designed the same. Take AVB systems, for > example. These not always are running on networks that are aware of any time > schedule, or at least not quite like what is described by Qbv. > > On those systems there is usually a certain number of streams with different > priorities that care mostly about having their bandwidth reserved along the > network. The applications running on such systems are usually based on AVTP, > thus they already have to calculate and set the "avtp presentation time" > per-packet themselves. A Qbv scheduler would probably provide very little > benefits to this domain, IMHO. For "talkers" of these AVB systems, shaping > traffic using txtime (i.e. tbs) can provide a low-jitter alternative to cbs, for > instance. You're looking at it from particular use cases and try to accomodate for them in the simplest possible way. I don't think that cuts it. Let's take a step back and look at it from a more general POV without trying to make it fit to any of the standards first. I'm deliberately NOT using any of the standard defined terms. At the (local) network level you have always an explicit plan. This plan might range from no plan at all to an very elaborate plan which is strict about when each node is allowed to TX a particular class of packets. So lets assume we have the following picture: [NIC] | [ Time slice manager ] Now in the simplest case, the time slice manager has no constraints and exposes a single input which allows the application to say: "Send my packet at time X". There is no restriction on 'time X' except if there is a time collision with an already queued packet or the requested TX time has already passed. That's close to what you implemented. Is the TX timestamp which you defined in the user space ABI a fixed scheduling point or is it a deadline? That's an important distinction and for this all to work accross various use cases you need a way to express that in the ABI. It might be an implicit property of the socket/channel to which the application connects to but still you want to express it from the application side to do proper sanity checking. Just think about stuff like audio/video streaming. The point of transmission does not have to be fixed if you have some intelligent controller at the receiving end which can buffer stuff. The only relevant information is the deadline, i.e. the latest point in time where the packet needs to go out on the wire in order to keep the stream steady at the consumer side. Having the notion of a deadline and that's the only thing the provider knows about allows you proper utilization by using an approriate scheduling algorithm like EDF. Contrary to that you want very explicit TX points for applications like automation control. For this kind of use case there is no wiggle room, it has to go out at a fixed time because that's the way control systems work. This is missing right now and you want to get that right from the very beginning. Duct taping it on the interface later on is a bad idea. Now lets go one step further and create two time slices for whatever purpose still on the single node (not network wide). You want to do that because you want temporal separation of services. The reason might be bandwidth guarantee, collission avoidance or whatever. How does the application which was written for the simple manager which had no restrictions learn about this? Does it learn it the hard way because now the packets which fall into the reserved timeslice are rejected? The way you created your interface, the answer is yes. That's patently bad as it requires to change the application once it runs on a partitioned node. So you really want a way for the application to query the timing constraints and perhaps other properties of the channel it connects to. And you want that now before the first application starts to use the new ABI. If the application developer does not use it, you still have to fix the application, but you have to fix it because the developer was a lazy bastard and not because the design was bad. That's a major difference. Now that we have two time slices, I'm coming back to your idea of having your proposed qdisc as the entity which sits right at the network interface. Lets assume the following: [Slice 1: Timed traffic ] [Slice 2: Other Traffic] Lets assume further that 'Other traffic' has no idea about time slices at all. It's just stuff like ssh, http, etc. So if you keep that design [ NIC ] | [ Time slice manager ] | | [ Timed traffic ] [ Other traffic ] feeding into your proposed TBS thingy, then in case of underutilization of the 'Timed traffic' slot you prevent utilization of remaining time by pulling 'Other traffic' into the empty slots because 'Other traffic' is restricted to Slice 2 and 'Timed traffic' does not know about 'Other traffic' at all. And no, you cannot make TBS magically pull packets from 'Other traffic' just because its not designed for it. So your design becomes strictly partitioned and forces underutilization. That's becoming even worse, when you switch to the proposed full hardware offloading scheme. In that case the only way to do admission control is the TX time of the farthest out packet which is already queued. That might work for a single application which controls all of the network traffic, but it wont ever work for something more flexible. The more I think about it the less interesting full hardware offload becomes. It's nice if you have a fully strict scheduling plan for everything, but then your admission control is bogus once you have more than one channel as input. So yes, it can be used when the card supports it and you have other ways to enforce admission control w/o hurting utilization or if you don't care about utilization at all. It's also useful for channels which are strictly isolated and have a defined TX time. Such traffic can be directly fed into the hardware. Coming back to the overall scheme. If you start upfront with a time slice manager which is designed to: - Handle multiple channels - Expose the time constraints, properties per channel then you can fit all kind of use cases, whether designed by committee or not. You can configure that thing per node or network wide. It does not make a difference. The only difference are the resulting constraints. We really want to accomodate everything between the 'no restrictions' and the 'full network wide explicit plan' case. And it's not rocket science once you realize that the 'no restrictions' case is just a subset of the 'full network wide explicit plan' simply because it exposes a single channel where: slice period = slice length. It's that easy, but at the same time you teach the application from the very beginning to ask for the time constraints so if it runs on a more sophisticated system/network, then it will see a different slice period and a different slice length and can accomodate or react in a useful way instead of just dying on the 17th packet it tries to send because it is rejected. We really want to design for this as we want to be able to run the video stream on the same node and network which does robot control without changing the video application. That's not a theoretical problem. These use cases exist today, but they are forced to use different networks for the two. But if you look at the utilization of both then they very well fit into one and industry certainly wants to go for that. That implies that you need constraint aware applications from the very beginning and that requires a proper ABI in the first place. The proposed ad hoc mode does not qualify. Please be aware, that you are creating a user space ABI and not a random in kernel interface which can be changed at any given time. So lets look once more at the picture in an abstract way: [ NIC ] | [ Time slice manager ] | | [ Ch 0 ] ... [ Ch N ] So you have a bunch of properties here: 1) Number of Channels ranging from 1 to N 2) Start point, slice period and slice length per channel 3) Queueing modes assigned per channel. Again that might be anything from 'feed through' over FIFO, PRIO to more complex things like EDF. The queueing mode can also influence properties like the meaning of the TX time, i.e. strict or deadline. Please sit back and map your use cases, standards or whatever you care about into the above and I would be very surprised if they don't fit. Thanks, tglx