From mboxrd@z Thu Jan 1 00:00:00 1970 From: Henrik Austad Subject: Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc Date: Wed, 28 Mar 2018 15:07:06 +0200 Message-ID: <20180328130706.GA382@sisyphus.home.austad.us> References: <20180307011230.24001-1-jesus.sanchez-palencia@intel.com> <20180307011230.24001-14-jesus.sanchez-palencia@intel.com> <65da0648-b835-a171-3986-2d1ddcb8ea10@intel.com> <2897b562-06e0-0fcc-4fb1-e8c4469c0faa@intel.com> <60799930-56a0-3692-9482-e733d7277152@intel.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="82I3+IH0IqGh5yIs" Cc: Jesus Sanchez-Palencia , netdev@vger.kernel.org, jhs@mojatatu.com, xiyou.wangcong@gmail.com, jiri@resnulli.us, vinicius.gomes@intel.com, richardcochran@gmail.com, anna-maria@linutronix.de, John Stultz , levi.pearson@harman.com, edumazet@google.com, willemb@google.com, mlichvar@redhat.com To: Thomas Gleixner Return-path: Received: from mail-lf0-f52.google.com ([209.85.215.52]:44994 "EHLO mail-lf0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753106AbeC1NHL (ORCPT ); Wed, 28 Mar 2018 09:07:11 -0400 Received: by mail-lf0-f52.google.com with SMTP id g203-v6so3378079lfg.11 for ; Wed, 28 Mar 2018 06:07:11 -0700 (PDT) Content-Disposition: inline In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: --82I3+IH0IqGh5yIs Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Mar 28, 2018 at 09:48:05AM +0200, Thomas Gleixner wrote: > Jesus, Thomas, Jesus, > On Tue, 27 Mar 2018, Jesus Sanchez-Palencia wrote: > > On 03/25/2018 04:46 AM, Thomas Gleixner wrote: > > > This is missing right now and you want to get that right from the v= ery > > > beginning. Duct taping it on the interface later on is a bad idea. > >=20 > > Agreed that this is needed. On the SO_TXTIME + tbs proposal, I believe = it's been > > covered by the (per-packet) SCM_DROP_IF_LATE. Do you think we need a di= fferent > > mechanism for expressing that? >=20 > Uuurgh. No. DROP_IF_LATE is just crap to be honest. >=20 > There are two modes: >=20 > 1) Send at the given TX time (Explicit mode) >=20 > 2) Send before given TX time (Deadline mode) >=20 > There is no need to specify 'drop if late' simply because if the message = is > handed in past the given TX time, it's too late by definition. What you a= re > trying to implement is a hybrid of TSN and general purpose (not time awar= e) > networking in one go. And you do that because your overall design is not > looking at the big picture. You designed from a given use case assumption > and tried to fit other things into it with duct tape. Yes, +1 to this. The whole point of bandwidth reservation is to not drop=20 frames, you should never, ever miss a deadline, if you do, then your=20 admission tests are inadequate. > > > So you really want a way for the application to query the timing > > > constraints and perhaps other properties of the channel it connects > > > to. And you want that now before the first application starts to us= e the > > > new ABI. If the application developer does not use it, you still ha= ve to > > > fix the application, but you have to fix it because the developer w= as a > > > lazy bastard and not because the design was bad. That's a major > > > difference. > >=20 > > Ok, this is something that we have considered in the past, but then the= feedback > > here drove us onto a different direction. The overall input we got here= was that > > applications would have to be adjusted or that userspace would have to = handle > > the coordination between applications somehow (e.g.: a daemon could be = developed > > separately to accommodate the fully dynamic use-cases, etc). >=20 > The only thing which will happen is that you get applications which requi= re > to control the full interface themself because they are so important and > the only ones which get it right. Good luck with fixing them up. >=20 > That extra daemon if it ever surfaces will be just a PITA. Think about > 20khz control loops. Do you really want queueing, locking, several context > switches and priority configuration nightmares in such a scenario? > Definitely not! You want a fast channel directly to the root qdisc which > takes care of getting it out at the right point, which might be immediate > handover if the adapter supports hw scheduling. >=20 > > This is a new requirement for the entire discussion. > > If I'm not missing anything, however, underutilization of the time slot= s is only > > a problem: > >=20 > > 1) for the fully dynamic use-cases and; > > 2) because now you are designing applications in terms of time slices, = right? >=20 > No. It's a general problem. I'm not designing applications in terms of ti= me > slices. Time slices are a fundamental property of TSN. Whether you use th= em > for explicit scheduling or bandwidth reservation or make them flat does n= ot > matter. >=20 > The application does not necessarily need to know about the time > constraints at all. But if it wants to use timed scheduling then it better > does know about them. yep, +1 in a lot of A/V cases here, the application will have to know about= =20 presentation_time, and the delay through the network stack should be "low= =20 and deterministic", but apart from that, the application shouldn't have to= =20 care about SO_TXTIME and what other applications may or may not do. > > We have not thought of making any of the proposed qdiscs capable of (op= tionally) > > adjusting the "time slices", but mainly because this is not a problem w= e had > > here before. Our assumption was that per-port Tx schedules would only b= e used > > for static systems. In other words, no, we didn't think that re-balanci= ng the > > slots was a requirement, not even for 'taprio'. >=20 > Sigh. Utilization is not something entirely new in the network space. I'm > not saying that this needs to be implemented right away, but designing it > in a way which forces underutilization is just wrong. >=20 > > > Coming back to the overall scheme. If you start upfront with a time s= lice > > > manager which is designed to: > > > > > > - Handle multiple channels > > > > > > - Expose the time constraints, properties per channel > > > > > > then you can fit all kind of use cases, whether designed by committee= or > > > not. You can configure that thing per node or network wide. It does n= ot > > > make a difference. The only difference are the resulting constraints. > >=20 > > > > Ok, and I believe the above was covered by what we had proposed before,= unless > > what you meant by time constraints is beyond the configured port schedu= le. > > > > Are you suggesting that we'll need to have a kernel entity that is not = only > > aware of the current traffic classes 'schedule', but also of the resour= ces that > > are still available for new streams to be accommodated into the classes= ? Putting > > it differently, is the TAS you envision just an entity that runs a sche= dule, or > > is it a time-aware 'orchestrator'? >=20 > In the first place its something which runs a defined schedule. >=20 > The accomodation for new streams is required, but not necessarily at the > root qdisc level. That might be a qdisc feeding into it. >=20 > Assume you have a bandwidth reservation, aka time slot, for audio. If your > audio related qdisc does deadline scheduling then you can add new streams > to it up to the point where it's not longer able to fit. >=20 > The only thing which might be needed at the root qdisc is the ability to > utilize unused time slots for other purposes, but that's not required to = be > there in the first place as long as its designed in a way that it can be > added later on. >=20 > > > So lets look once more at the picture in an abstract way: > > > > > > [ NIC ] > > > | > > > [ Time slice manager ] > > > | | > > > [ Ch 0 ] ... [ Ch N ] > > > > > > So you have a bunch of properties here: > > > > > > 1) Number of Channels ranging from 1 to N > > > > > > 2) Start point, slice period and slice length per channel > >=20 > > Ok, so we agree that a TAS entity is needed. Assuming that channels are= traffic > > classes, do you have something else in mind other than a new root qdisc? >=20 > Whatever you call it, the important point is that it is the gate keeper to > the network adapter and there is no way around it. It fully controls the > timed schedule how simple or how complex it may be. >=20 > > > 3) Queueing modes assigned per channel. Again that might be anything = =66rom > > > 'feed through' over FIFO, PRIO to more complex things like EDF. > > > > > > The queueing mode can also influence properties like the meaning o= f the > > > TX time, i.e. strict or deadline. > >=20 > >=20 > > Ok, but how are the queueing modes assigned / configured per channel? > >=20 > > Just to make sure we re-visit some ideas from the past: > >=20 > > * TAS: > >=20 > > The idea we are currently exploring is to add a "time-aware", priori= ty based > > qdisc, that also exposes the Tx queues available and provides a mech= anism for > > mapping priority <-> traffic class <-> Tx queues in a similar fashio= n as > > mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line wo= uld be: > >=20 > > $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4 \ > > map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 \ > > queues 0 1 2 3 \ > > sched-file gates.sched [base-time ] \ > > [cycle-time ] [extension-time ] > >=20 > > is multi-line, with each line being of the following format: > > > >=20 > > Qbv only defines one : "S" for 'SetGates' > >=20 > > For example: > >=20 > > S 0x01 300 > > S 0x03 500 > >=20 > > This means that there are two intervals, the first will have the gate > > for traffic class 0 open for 300 nanoseconds, the second will have > > both traffic classes open for 500 nanoseconds. >=20 > To accomodate stuff like control systems you also need a base line, which > is not expressed as interval. Otherwise you can't schedule network wide > explicit plans. That's either an absolute network-time (TAI) time stamp or > an offset to a well defined network-time (TAI) time stamp, e.g. start of > epoch or something else which is agreed on. The actual schedule then fast > forwards past now (TAI) and sets up the slots from there. That makes node > hotplug possible as well. Ok, so this is perhaps a bit of a sidetrack, but based on other discussions= =20 in this patch-series, does it really make sense to discuss anything *but*= =20 TAI? If you have a TSN-stream (or any other time-sensitive way of prioritizing= =20 frames based on time), then the network is going to be PTP synched anyway,= =20 and all the rest of the network is going to operate on PTP-time. Why even= =20 bother adding CLOCK_REALTIME and CLOCK_MONOTONIC to the discussion? Sure,= =20 use CLOCK_REALTIME locally and sync that to TAI, but the kernel should=20 worry about ptp-time _for_that_adapter_, and we should make it pretty=20 obvious to userspace that if you want to specify tx-time, then there's this= =20 thing called 'PTP' and it rules this domain. My $0.02 etc > Btw, it's not only control systems. Think about complex multi source A/V > streams. They are reality in recording and life mixing and looking at the > timing constraints of such scenarios, collision avoidance is key there. So > you want to be able to do network wide traffic orchestration. Yep, and if are too bursty, the network is free to drop your frames, which= =20 is not desired. > > It would handle multiple channels and expose their constraints / proper= ties. > > Each channel also becomes a traffic class, so other qdiscs can be attac= hed to > > them separately. >=20 > Right. I don't think you need a separate qdisc for each channel, if you describe a= =20 channel with - period (what AVB calls observation interval) - max data - deadline you should be able to keep a sorted rb-tree and handle that pretty=20 efficiently. Or perhaps I'm completely missing the mark here. If so, my=20 apologies > > So, in summary, because our entire design is based on qdisc interfaces,= what we > > had proposed was a root qdisc (the time slice manager, as you put) that= allows > > for other qdiscs to be attached to each channel. The inner qdiscs defin= e the > > queueing modes for each channel, and tbs is just one of those modes. I > > understand now that you want to allow for fully dynamic use-cases to be > > supported as well, which we hadn't covered with our TAS proposal before= because > > we hadn't envisioned it being used for these systems' design. >=20 > Yes, you have the root qdisc, which is in charge of the overall scheduling > plan, how complex or not it is defined does not matter. It exposes traffic > classes which have properties defined by the configuration. >=20 > The qdiscs which are attached to those traffic classes can be anything > including: >=20 > - Simple feed through (Applications are time contraints aware and set the > exact schedule). qdisc has admission control. >=20 > - Deadline aware qdisc to handle e.g. A/V streams. Applications are aware > of time constraints and provide the packet deadline. qdisc has admissi= on > control. This can be a simple first comes, first served scheduler or > something like EDF which allows optimized utilization. The qdisc sets > the TX time depending on the deadline and feeds into the root. As a small nitpick, it would make more sense to do a laxity-approach here,= =20 both for explicit mode and deadline-mode. We know the size of the frame to= =20 send, we know the outgoing rate, so keep a ready-queue sorted based on=20 laxity laxity =3D absolute_deadline - (size / outgoing_rate) Also, given that we use a *single* tx-queue for time-triggered=20 transmission, this boils down to a uniprocessor equivalent and we have a=20 lot of func real-time scheduling academia to draw from. This could then probably handle both of the above (Direct + deadline), but= =20 that's implementatino specific I guess. > - FIFO/PRIO/XXX for general traffic. Applications do not know anything > about timing constraints. These qdiscs obviously have neither admission > control nor do they set a TX time. The root qdisc just pulls from the= re > when the assigned time slot is due or if it (optionally) decides to use > underutilized time slots from other classes. >=20 > - .... Add your favourite scheduling mode(s). Just give it sub-qdiscs and offload enqueue/dequeue to those I suppose. --=20 Henrik Austad --82I3+IH0IqGh5yIs Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEARECAAYFAlq7k3oACgkQ6k5VT6v45lnM2wCfU+U6vdW/Vz0j1EbEiYvssTWO 3ssAn2Kt1H5egUMWyBZoVUeDw6nfzKom =ifkT -----END PGP SIGNATURE----- --82I3+IH0IqGh5yIs--