From mboxrd@z Thu Jan  1 00:00:00 1970
From: Henrik Austad <henrik@austad.us>
Subject: Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
Date: Wed, 28 Mar 2018 15:07:06 +0200
Message-ID: <20180328130706.GA382@sisyphus.home.austad.us>
References: <20180307011230.24001-1-jesus.sanchez-palencia@intel.com>
 <20180307011230.24001-14-jesus.sanchez-palencia@intel.com>
 <alpine.DEB.2.21.1803211407520.3754@nanos.tec.linutronix.de>
 <alpine.DEB.2.21.1803211758140.3754@nanos.tec.linutronix.de>
 <65da0648-b835-a171-3986-2d1ddcb8ea10@intel.com>
 <alpine.DEB.2.21.1803222312061.1489@nanos.tec.linutronix.de>
 <2897b562-06e0-0fcc-4fb1-e8c4469c0faa@intel.com>
 <alpine.DEB.2.21.1803241325500.1481@nanos.tec.linutronix.de>
 <60799930-56a0-3692-9482-e733d7277152@intel.com>
 <alpine.DEB.2.21.1803280808490.3247@nanos.tec.linutronix.de>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
        protocol="application/pgp-signature"; boundary="82I3+IH0IqGh5yIs"
Cc: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>,
        netdev@vger.kernel.org, jhs@mojatatu.com, xiyou.wangcong@gmail.com,
        jiri@resnulli.us, vinicius.gomes@intel.com,
        richardcochran@gmail.com, anna-maria@linutronix.de,
        John Stultz <john.stultz@linaro.org>, levi.pearson@harman.com,
        edumazet@google.com, willemb@google.com, mlichvar@redhat.com
To: Thomas Gleixner <tglx@linutronix.de>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-lf0-f52.google.com ([209.85.215.52]:44994 "EHLO
        mail-lf0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1753106AbeC1NHL (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 28 Mar 2018 09:07:11 -0400
Received: by mail-lf0-f52.google.com with SMTP id g203-v6so3378079lfg.11
        for <netdev@vger.kernel.org>; Wed, 28 Mar 2018 06:07:11 -0700 (PDT)
Content-Disposition: inline
In-Reply-To: <alpine.DEB.2.21.1803280808490.3247@nanos.tec.linutronix.de>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>


--82I3+IH0IqGh5yIs
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Mar 28, 2018 at 09:48:05AM +0200, Thomas Gleixner wrote:
> Jesus,

Thomas, Jesus,

> On Tue, 27 Mar 2018, Jesus Sanchez-Palencia wrote:
> > On 03/25/2018 04:46 AM, Thomas Gleixner wrote:
> > >   This is missing right now and you want to get that right from the v=
ery
> > >   beginning. Duct taping it on the interface later on is a bad idea.
> >=20
> > Agreed that this is needed. On the SO_TXTIME + tbs proposal, I believe =
it's been
> > covered by the (per-packet) SCM_DROP_IF_LATE. Do you think we need a di=
fferent
> > mechanism for expressing that?
>=20
> Uuurgh. No. DROP_IF_LATE is just crap to be honest.
>=20
> There are two modes:
>=20
>       1) Send at the given TX time (Explicit mode)
>=20
>       2) Send before given TX time (Deadline mode)
>=20
> There is no need to specify 'drop if late' simply because if the message =
is
> handed in past the given TX time, it's too late by definition. What you a=
re
> trying to implement is a hybrid of TSN and general purpose (not time awar=
e)
> networking in one go. And you do that because your overall design is not
> looking at the big picture. You designed from a given use case assumption
> and tried to fit other things into it with duct tape.

Yes, +1 to this. The whole point of bandwidth reservation is to not drop=20
frames, you should never, ever miss a deadline, if you do, then your=20
admission tests are inadequate.

> > >   So you really want a way for the application to query the timing
> > >   constraints and perhaps other properties of the channel it connects
> > >   to. And you want that now before the first application starts to us=
e the
> > >   new ABI. If the application developer does not use it, you still ha=
ve to
> > >   fix the application, but you have to fix it because the developer w=
as a
> > >   lazy bastard and not because the design was bad. That's a major
> > >   difference.
> >=20
> > Ok, this is something that we have considered in the past, but then the=
 feedback
> > here drove us onto a different direction. The overall input we got here=
 was that
> > applications would have to be adjusted or that userspace would have to =
handle
> > the coordination between applications somehow (e.g.: a daemon could be =
developed
> > separately to accommodate the fully dynamic use-cases, etc).
>=20
> The only thing which will happen is that you get applications which requi=
re
> to control the full interface themself because they are so important and
> the only ones which get it right. Good luck with fixing them up.
>=20
> That extra daemon if it ever surfaces will be just a PITA. Think about
> 20khz control loops. Do you really want queueing, locking, several context
> switches and priority configuration nightmares in such a scenario?
> Definitely not! You want a fast channel directly to the root qdisc which
> takes care of getting it out at the right point, which might be immediate
> handover if the adapter supports hw scheduling.
>=20
> > This is a new requirement for the entire discussion.
> > If I'm not missing anything, however, underutilization of the time slot=
s is only
> > a problem:
> >=20
> > 1) for the fully dynamic use-cases and;
> > 2) because now you are designing applications in terms of time slices, =
right?
>=20
> No. It's a general problem. I'm not designing applications in terms of ti=
me
> slices. Time slices are a fundamental property of TSN. Whether you use th=
em
> for explicit scheduling or bandwidth reservation or make them flat does n=
ot
> matter.
>=20
> The application does not necessarily need to know about the time
> constraints at all. But if it wants to use timed scheduling then it better
> does know about them.

yep, +1 in a lot of A/V cases here, the application will have to know about=
=20
presentation_time, and the delay through the network stack should be "low=
=20
and deterministic", but apart from that, the application shouldn't have to=
=20
care about SO_TXTIME and what other applications may or may not do.

> > We have not thought of making any of the proposed qdiscs capable of (op=
tionally)
> > adjusting the "time slices", but mainly because this is not a problem w=
e had
> > here before. Our assumption was that per-port Tx schedules would only b=
e used
> > for static systems. In other words, no, we didn't think that re-balanci=
ng the
> > slots was a requirement, not even for 'taprio'.
>=20
> Sigh. Utilization is not something entirely new in the network space. I'm
> not saying that this needs to be implemented right away, but designing it
> in a way which forces underutilization is just wrong.
>=20
> > > Coming back to the overall scheme. If you start upfront with a time s=
lice
> > > manager which is designed to:
> > >
> > >   - Handle multiple channels
> > >
> > >   - Expose the time constraints, properties per channel
> > >
> > > then you can fit all kind of use cases, whether designed by committee=
 or
> > > not. You can configure that thing per node or network wide. It does n=
ot
> > > make a difference. The only difference are the resulting constraints.
> >=20
> >
> > Ok, and I believe the above was covered by what we had proposed before,=
 unless
> > what you meant by time constraints is beyond the configured port schedu=
le.
> >
> > Are you suggesting that we'll need to have a kernel entity that is not =
only
> > aware of the current traffic classes 'schedule', but also of the resour=
ces that
> > are still available for new streams to be accommodated into the classes=
? Putting
> > it differently, is the TAS you envision just an entity that runs a sche=
dule, or
> > is it a time-aware 'orchestrator'?
>=20
> In the first place its something which runs a defined schedule.
>=20
> The accomodation for new streams is required, but not necessarily at the
> root qdisc level. That might be a qdisc feeding into it.
>=20
> Assume you have a bandwidth reservation, aka time slot, for audio. If your
> audio related qdisc does deadline scheduling then you can add new streams
> to it up to the point where it's not longer able to fit.
>=20
> The only thing which might be needed at the root qdisc is the ability to
> utilize unused time slots for other purposes, but that's not required to =
be
> there in the first place as long as its designed in a way that it can be
> added later on.
>=20
> > > So lets look once more at the picture in an abstract way:
> > >
> > >      	       [ NIC ]
> > > 	          |
> > > 	 [ Time slice manager ]
> > > 	    |           |
> > >          [ Ch 0 ] ... [ Ch N ]
> > >
> > > So you have a bunch of properties here:
> > >
> > > 1) Number of Channels ranging from 1 to N
> > >
> > > 2) Start point, slice period and slice length per channel
> >=20
> > Ok, so we agree that a TAS entity is needed. Assuming that channels are=
 traffic
> > classes, do you have something else in mind other than a new root qdisc?
>=20
> Whatever you call it, the important point is that it is the gate keeper to
> the network adapter and there is no way around it. It fully controls the
> timed schedule how simple or how complex it may be.
>=20
> > > 3) Queueing modes assigned per channel. Again that might be anything =
=66rom
> > >    'feed through' over FIFO, PRIO to more complex things like EDF.
> > >
> > >    The queueing mode can also influence properties like the meaning o=
f the
> > >    TX time, i.e. strict or deadline.
> >=20
> >=20
> > Ok, but how are the queueing modes assigned / configured per channel?
> >=20
> > Just to make sure we re-visit some ideas from the past:
> >=20
> > * TAS:
> >=20
> >    The idea we are currently exploring is to add a "time-aware", priori=
ty based
> >    qdisc, that also exposes the Tx queues available and provides a mech=
anism for
> >    mapping priority <-> traffic class <-> Tx queues in a similar fashio=
n as
> >    mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line wo=
uld be:
> >=20
> >    $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
> >      	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
> > 	   queues 0 1 2 3                                              \
> >      	   sched-file gates.sched [base-time <interval>]               \
> >            [cycle-time <interval>] [extension-time <interval>]
> >=20
> >    <file> is multi-line, with each line being of the following format:
> >    <cmd> <gate mask> <interval in nanoseconds>
> >=20
> >    Qbv only defines one <cmd>: "S" for 'SetGates'
> >=20
> >    For example:
> >=20
> >    S 0x01 300
> >    S 0x03 500
> >=20
> >    This means that there are two intervals, the first will have the gate
> >    for traffic class 0 open for 300 nanoseconds, the second will have
> >    both traffic classes open for 500 nanoseconds.
>=20
> To accomodate stuff like control systems you also need a base line, which
> is not expressed as interval. Otherwise you can't schedule network wide
> explicit plans. That's either an absolute network-time (TAI) time stamp or
> an offset to a well defined network-time (TAI) time stamp, e.g. start of
> epoch or something else which is agreed on. The actual schedule then fast
> forwards past now (TAI) and sets up the slots from there. That makes node
> hotplug possible as well.

Ok, so this is perhaps a bit of a sidetrack, but based on other discussions=
=20
in this patch-series, does it really make sense to discuss anything *but*=
=20
TAI?

If you have a TSN-stream (or any other time-sensitive way of prioritizing=
=20
frames based on time), then the network is going to be PTP synched anyway,=
=20
and all the rest of the network is going to operate on PTP-time. Why even=
=20
bother adding CLOCK_REALTIME and CLOCK_MONOTONIC to the discussion? Sure,=
=20
use CLOCK_REALTIME locally and sync that to TAI, but the kernel should=20
worry about ptp-time _for_that_adapter_, and we should make it pretty=20
obvious to userspace that if you want to specify tx-time, then there's this=
=20
thing called 'PTP' and it rules this domain. My $0.02 etc

> Btw, it's not only control systems. Think about complex multi source A/V
> streams. They are reality in recording and life mixing and looking at the
> timing constraints of such scenarios, collision avoidance is key there. So
> you want to be able to do network wide traffic orchestration.

Yep, and if are too bursty, the network is free to drop your frames, which=
=20
is not desired.

> > It would handle multiple channels and expose their constraints / proper=
ties.
> > Each channel also becomes a traffic class, so other qdiscs can be attac=
hed to
> > them separately.
>=20
> Right.

I don't think you need a separate qdisc for each channel, if you describe a=
=20
channel with

- period (what AVB calls observation interval)
- max data
- deadline

you should be able to keep a sorted rb-tree and handle that pretty=20
efficiently. Or perhaps I'm completely missing the mark here. If so, my=20
apologies

> > So, in summary, because our entire design is based on qdisc interfaces,=
 what we
> > had proposed was a root qdisc (the time slice manager, as you put) that=
 allows
> > for other qdiscs to be attached to each channel. The inner qdiscs defin=
e the
> > queueing modes for each channel, and tbs is just one of those modes. I
> > understand now that you want to allow for fully dynamic use-cases to be
> > supported as well, which we hadn't covered with our TAS proposal before=
 because
> > we hadn't envisioned it being used for these systems' design.
>=20
> Yes, you have the root qdisc, which is in charge of the overall scheduling
> plan, how complex or not it is defined does not matter. It exposes traffic
> classes which have properties defined by the configuration.
>=20
> The qdiscs which are attached to those traffic classes can be anything
> including:
>=20
>  - Simple feed through (Applications are time contraints aware and set the
>    exact schedule). qdisc has admission control.
>=20
>  - Deadline aware qdisc to handle e.g. A/V streams. Applications are aware
>    of time constraints and provide the packet deadline. qdisc has admissi=
on
>    control. This can be a simple first comes, first served scheduler or
>    something like EDF which allows optimized utilization. The qdisc sets
>    the TX time depending on the deadline and feeds into the root.

As a small nitpick, it would make more sense to do a laxity-approach here,=
=20
both for explicit mode and deadline-mode. We know the size of the frame to=
=20
send, we know the outgoing rate, so keep a ready-queue sorted based on=20
laxity

     laxity =3D absolute_deadline - (size / outgoing_rate)

Also, given that we use a *single* tx-queue for time-triggered=20
transmission, this boils down to a uniprocessor equivalent and we have a=20
lot of func real-time scheduling academia to draw from.

This could then probably handle both of the above (Direct + deadline), but=
=20
that's implementatino specific I guess.

>  - FIFO/PRIO/XXX for general traffic. Applications do not know anything
>    about timing constraints. These qdiscs obviously have neither admission
>    control nor do they set a TX time.  The root qdisc just pulls from the=
re
>    when the assigned time slot is due or if it (optionally) decides to use
>    underutilized time slots from other classes.
>=20
>  - .... Add your favourite scheduling mode(s).

Just give it sub-qdiscs and offload enqueue/dequeue to those I suppose.

--=20
Henrik Austad

--82I3+IH0IqGh5yIs
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iEYEARECAAYFAlq7k3oACgkQ6k5VT6v45lnM2wCfU+U6vdW/Vz0j1EbEiYvssTWO
3ssAn2Kt1H5egUMWyBZoVUeDw6nfzKom
=ifkT
-----END PGP SIGNATURE-----

--82I3+IH0IqGh5yIs--