linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Geva, Erez" <erez.geva.ext@siemens.com>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Network Development <netdev@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>,
	Arnd Bergmann <arnd@arndb.de>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	"David S . Miller" <davem@davemloft.net>,
	Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>,
	Jakub Kicinski <kuba@kernel.org>,
	Jamal Hadi Salim <jhs@mojatatu.com>,
	Jiri Pirko <jiri@resnulli.us>,
	Alexei Starovoitov <ast@kernel.org>,
	Colin Ian King <colin.king@canonical.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Eric Dumazet <edumazet@google.com>,
	Eyal Birger <eyal.birger@gmail.com>,
	"Gustavo A . R . Silva" <gustavoars@kernel.org>,
	Jakub Sitnicki <jakub@cloudflare.com>,
	John Ogness <john.ogness@linutronix.de>,
	Jon Rosen <jrosen@cisco.com>, Kees Cook <keescook@chromium.org>,
	Marc Kleine-Budde <mkl@pengutronix.de>,
	Martin KaFai Lau <kafai@fb.com>,
	Matthieu Baerts <matthieu.baerts@tessares.net>,
	Andrei Vagin <avagin@gmail.com>,
	Dmitry Safonov <0x7f454c46@gmail.com>,
	"Eric W . Biederman" <ebiederm@xmission.com>,
	Ingo Molnar <mingo@kernel.org>,
	John Stultz <john.stultz@linaro.org>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Michal Kubecek <mkubecek@suse.cz>,
	Or Cohen <orcohen@paloaltonetworks.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Richard Cochran <richardcochran@gmail.com>,
	Stefan Schmidt <stefan@datenfreihafen.org>,
	Xie He <xie.he.0141@gmail.com>, Stephen Boyd <sboyd@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Vladis Dronov <vdronov@redhat.com>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Frederic Weisbecker <frederic@kernel.org>,
	Vinicius Costa Gomes <vinicius.gomes@intel.com>,
	Vedang Patel <vedang.patel@intel.com>,
	"Sudler, Simon" <simon.sudler@siemens.com>,
	"Meisinger, Andreas" <andreas.meisinger@siemens.com>,
	"henning.schild@siemens.com" <henning.schild@siemens.com>,
	"jan.kiszka@siemens.com" <jan.kiszka@siemens.com>,
	"Zirkler, Andreas" <andreas.zirkler@siemens.com>
Subject: Re: [PATCH 1/3] Add TX sending hardware timestamp.
Date: Fri, 11 Dec 2020 14:22:05 +0000	[thread overview]
Message-ID: <VI1PR10MB24468B591888DC515017B5B5ABCA0@VI1PR10MB2446.EURPRD10.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <CAF=yD-Lf=JpkXvGs=AGtyhCEFcG_8_WgnNbg1cbGownohsHw8g@mail.gmail.com>


On 11/12/2020 00:30, Willem de Bruijn wrote:
>>> If I understand correctly, you are trying to achieve a single delivery time.
>>> The need for two separate timestamps passed along is only because the
>>> kernel is unable to do the time base conversion.
>>
>> Yes, a correct point.
>>
>>>
>>> Else, ETF could program the qdisc watchdog in system time and later,
>>> on dequeue, convert skb->tstamp to the h/w time base before
>>> passing it to the device.
>>
>> Or the skb->tstamp is HW time-stamp and the ETF convert it to system clock based.
>>
>>>
>>> It's still not entirely clear to me why the packet has to be held by
>>> ETF initially first, if it is held until delivery time by hardware
>>> later. But more on that below.
>>
>> Let plot a simple scenario.
>> App A send a packet with time-stamp 100.
>> After arrive a second packet from App B with time-stamp 90.
>> Without ETF, the second packet will have to wait till the interface hardware send the first packet on 100.
>> Making the second packet late by 10 + first packet send time.
>> Obviously other "normal" packets are send to the non-ETF queue, though they do not block ETF packets
>> The ETF delta is a barrier that the application have to send the packet before to ensure the packet do not tossed.
> 
> Got it. The assumption here is that devices are FIFO. That is not
> necessarily the case, but I do not know whether it is in practice,
> e.g., on the i210.
> 
>>
>>>
>>> So far, the use case sounds a bit narrow and the use of two timestamp
>>> fields for a single delivery event a bit of a hack.
>>
>> The definition of a hack is up to you
> 
> Fair enough :) That wasn't very constructive feedback on my part.
> 
>>> And one that does impose a cost in the hot path of many workloads
>>> by adding a field the ip cookie, cork and writing to (possibly cold)
>>> skb_shinfo for every packet.
>>
>> Most packets do not use skb->tstamp either, probably the cost of testing is higher then just copying.
>> But perhaps if we copy 2 time-stamp we can add a condition for both.
>> What do you think?
> 
> I'd need to take a closer look at the skb_hwtstamps, which unlike
> skb->tstamp lie in the skb_shared_data. If that is an otherwise cold
> cacheline, then access would be expensive.
Good point.
We should review it.
That can make a flag for using copying time-stamps into the SKB more feasible.

> 
> The ipcm and cork are admittedly cheap and not worth a branch. But
> still it is good to understand that this situation of unsynchronized
> clocks is a common operation condition for the foreseeable future, not
> an unfortunate constraint of a single piece of hardware.
> 
> An extreme option would be moving everything behind a static_branch as
> most hot paths will not have the feature enabled. But I'm not
> seriously suggesting that for a few assignments.
> 
>> The cookie and the cork are just intermediate from application to SKB, I do not think they cost much.
>> Both writes of time stamp to the cookie and the cork are conditioned.
>>
>>>
>>>>>>> Indeed, we want pacing offload to work for existing applications.
>>>>>>>
>>>>>> As the conversion of the PHC and the system clock is dynamic over time.
>>>>>> How do you propse to achive it?
>>>>>
>>>>> Can you elaborate on this concern?
>>>>
>>>> Using single time stamp have 3 possible solutions:
>>>>
>>>> 1. Current solution, synchronize the system clock and the PHC.
>>>>       Application uses the system clock.
>>>>       The ETF can use the system clock for ordering and pass the packet to the driver on time
>>>>       The network interface hardware compare the time-stamp to the PHC.
>>>>
>>>> 2. The application convert the PHC time-stamp to system clock based.
>>>>        The ETF works as solution 1
>>>>        The network driver convert the system clock time-stamp back to PHC time-stamp.
>>>>        This solution need a new Net-Link flag and modify the relevant network drivers.
>>>>        Yet this solution have 2 problems:
>>>>        * As applications today are not aware that system clock and PHC are not synchronized and
>>>>           therefore do not perform any conversion, most of them only use the system clock.
>>>>        * As the conversion in the network driver happens ~300 - 600 microseconds after
>>>>           the application send the packet.
>>>>           And as the PHC and system clock frequencies and offset can change during this period.
>>>>           The conversion will produce a different PHC time-stamp from the application original time-stamp.
>>>>           We require a precession of 1 nanoseconds of the PHC time-stamp.
>>>>
>>>> 3. The application uses PHC time-stamp for skb->tstamp
>>>>       The ETF convert the  PHC time-stamp to system clock time-stamp.
>>>>       This solution require implementations on supporting reading PHC clocks
>>>>       from IRQ/kernel thread context in kernel space.
>>>
>>> ETF has to release the packet well in advance of the hardware
>>> timestamp for the packet to arrive at the device on time. In practice
>>> I would expect this delta parameter to be at least at usec timescale.
>>> That gives some wiggle room with regard to s/w tstamp, at least.
>>
>> Yes, the author of the ETF uses a delta of 300 usec.
>> The interface I use for testing, Intel I210 need around 100 usec to 150 usec.
>> I believe it is related to PCIe speed of transferring the data on time and probably similar to other interfaces using PCIe.
>> If you overflow the interface hardware with high traffic the delta is much higher.
>> The clocks conversion error of the application is characteristic around ~1 usec to 5 usec for up to 10 ms sending a head.
>>
>>>
>>> If changes in clock distance are relatively infrequent, could this
>>> clock diff be a qdisc parameter, updated infrequently outside the
>>> packet path?
>>
>> As the clocks are updating of both frequency and offset dynamically make it very hard to perform.
>> The rate of the update of the PHC depends on PTP setting (usually around 1 second).
>> The rate of the update of the system clock depends how you synchronize it ( I assume it is similar or slower).
>> But user may require and use higher rates. It is only penalty by more traffic and CPU.
>> Bare in mind the 2 clocks synchronization are independent, the cross can be unpredictable.
>>
>> The ETF would have to "know" on which packets we use the previous update and which are the last update.
>> And hope we do not "miss" updates.
>>
>> And we would need a "service" to update these values with proper configuration, sound like overhead to me.
> 
> Ack. Thanks for the operating context. I didn't know these constraints
> well enough. Agreed that this is not a very feasible approach then.
> 
>> Another point.
>> The delta includes the PCIe DMA transfer speed, this is a hardware limitation.
>> The idea of TSN is that the application send the packet as closer as possible to the hardware send.
>> Increase the error of the clocks conversion defy the purpose of TSN.
>>
>> A more reasonable is to track the clocks inside the kernel.
>> As we mention on solution 3.
>>
>>>
>>> It would even be preferable if the qdisc and core stack could be
>>> ignorant of such hardware clocks and the time base is converted by the
>>> device driver when encoding skb->tstamp into the tx descriptor. Is the
>>> device hardware clock readable by the driver?
>>
>> All drivers that support PTP (IEEE 1558) have to read the PHC.
>> PTP is mandatory for TSN.
>> But some drivers might be limited on which context they can read the PHC.
>> This is a question to the vendors.
>> For example Intel I210 allow reading the PHC.
>>
>> However the kernel POSIX layer uses application context lockings.
>>
>>>
>>>   From the above, it sounds like this is not trivial.
>>
>> I am not sure if it so complicated.
>> But as the Linux maintainers want to keep the Linux kernel with a single system clock.
>> It might be more of a political question, or perhaps a better planning then I did.
> 
> This would seem the preferable option to me: use a kernel time base
> throughout the stack and limit knowledge of the hardware clock to the
> relevant hardware driver.
> 
> If that is infeasible, then I don't immediately see an alternative to
> the current dual timestamp field variant, either.
> 
>>>
>>> I don't know which exact device you're targeting. Is it an in-tree driver?
>>
>> ETF uses ethernet interfaces with IEEE 1558 and 802.1Qbv or 802.1Qbu.
>> Interfaces that support TSN, https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FTime-Sensitive_Networking&amp;data=04%7C01%7Cerez.geva.ext%40siemens.com%7C51af403dfc0041ced22e08d89d63db25%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C637432399491933753%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=9xEP%2F2w3hrGItE0%2Btl0Bsy1xtUXwv4xUowuZ9QInXmI%3D&amp;reserved=0
>> I use Intel I210 at the moment.
>> As of 5.10-rc6, there are 4 drivers
>> kernel-etf/drivers/net/ethernet (etf-5.10-rc6)$ find -name '*.c' | xargs grep -r TC_SETUP_QDISC_ETF
>> ./freescale/enetc/enetc.c:              case TC_SETUP_QDISC_ETF:
>> ./stmicro/stmmac/stmmac_main.c: case TC_SETUP_QDISC_ETF:
>> ./intel/igc/igc_main.c:                 case TC_SETUP_QDISC_ETF:
>> ./intel/igb/igb_main.c:                 case TC_SETUP_QDISC_ETF:
>> There are more vendors like
>>    Renesas that have a driver for the RZ-G SOC.
>>    Broadcom have chips that support TSN, but they do not publish the code.
>> I believe that other vendors will add TSN support as it becomes more popular.
> 
> Very clear. Thanks.
> 
>>>
>>>> Just for clarification:
>>>> ETF as all Net-Link, only uses system clock (the TAI)
>>>> The network interface hardware only uses the PHC.
>>>> Nor Net-Link neither the driver perform any conversions.
>>>> The Kernel does not provide and clock conversion beside system clock.
>>>> Linux kernel is a single clock system.
>>>>
>>>>>
>>>>> The simplest solution for offloading pacing would be to interpret
>>>>> skb->tstamp either for software pacing, or skip software pacing if the
>>>>> device advertises a NETIF_F hardware pacing feature.
>>>>
>>>> That will defy the purpose of ETF.
>>>> ETF exist for ordering packets.
>>>> Why should the device driver defer it?
>>>> Simply do not use the QDISC for this interface.
>>>
>>> ETF queues packets until their delivery time is reached. It does not
>>> order for any other reason than to calculate the next qdisc watchdog
>>> event, really.
>>
>> No, ETF also order the packets on .enqueue = etf_enqueue_timesortedlist()
>> Our patch suggest to order them by hardware time stamp.
>> And leave the watchdog setting using skb->tstamp that hold system clock TAI time-stamp.
>>
>>>
>>> If h/w can do the same and the driver can convert skb->tstamp to the
>>> right timebase, indeed no qdisc is needed for pacing. But there may be
>>> a need for selective h/w offload if h/w has additional constraints,
>>> such as on the number of packets outstanding or time horizon.
>>
>> The driver do not order the packets, it send packets in the order of arrival.
>> We can add ETF component to each relevant driver, but do we want to add Net-Link features to drivers?
>> I think the purpose is to make the drivers as small as possible and leave common intelligence in the Net-Link layer.
> 
> I was thinking of devices that implement ETF in hardware for full
> pacing hardware offload. Not in the driver.
> 
>>>
>>>>>
>>>>> Clockbase is an issue. The device driver may have to convert to
>>>>> whatever format the device expects when copying skb->tstamp in the
>>>>> device tx descriptor.
>>>>
>>>> We do hope our definition is clear.
>>>> In the current kernel skb->tstamp uses system clock.
>>>> The hardware time-stamp is PHC based, as it is used today for PTP two steps.
>>>> We only propose to use the same hardware time-stamp.
>>>>
>>>> Passing the hardware time-stamp to the skb->tstamp might seems a bit tricky
>>>> The gaol is the leave the driver unaware to whether we
>>>> * Synchronizing the PHC and system clock
>>>> * The ETF pass the hardware time-stamp to skb->tstamp
>>>> Only the applications and the ETF are aware.
>>>> The application can detect by checking the ETF flag.
>>>> The ETF flags are part of the network administration.
>>>> That also configure the PTP and the system clock synchronization.
>>>>
>>>>>
>>>>>>
>>>>>>> It only requires that pacing qdiscs, both sch_etf and sch_fq,
>>>>>>> optionally skip queuing in their .enqueue callback and instead allow
>>>>>>> the skb to pass to the device driver as is, with skb->tstamp set. Only
>>>>>>> to devices that advertise support for h/w pacing offload.
>>>>>>>
>>>>>> I did not use "Fair Queue traffic policing".
>>>>>> As for ETF, it is all about ordering packets from different applications.
>>>>>> How can we achive it with skiping queuing?
>>>>>> Could you elaborate on this point?
>>>>>
>>>>> The qdisc can only defer pacing to hardware if hardware can ensure the
>>>>> same invariants on ordering, of course.
>>>>
>>>> Yes, this is why we suggest ETF order packets using the hardware time-stamp.
>>>> And pass the packet based on system time.
>>>> So ETF query the system clock only and not the PHC.
>>>
>>> On which note: with this patch set all applications have to agree to
>>> use h/w time base in etf_enqueue_timesortedlist. In practice that
>>> makes this h/w mode a qdisc used by a single process?
>>
>> A single process theoretically does not need ETF, just set the skb-> tstamp and use a pass through queue.
>> However the only way now to set TC_SETUP_QDISC_ETF in the driver is using ETF.
> 
> Yes, and I'd like to eventually get rid of this constraint.
> 
> 
>> The ETF QDISC is per network interface.
>> So all application that uses a single network interface have to comply to the QDISC configuration.
>> Sound like any other new feature in the NetLink.
>>
>> Theoretically a single network interface could participate in 2 TSN/PTP domains.
>> In that case you can create one QDISC without "use hardware time-stamp" for first TSN/PTP domain and synchronize the PHC to system clock.
>> And use the second one with a QDISC that use hardware time-stamp.
>> You will need a driver/hardware that support multiple PHCs.
>> The separation of the domains could be using VLANs.
>>
>> Note: A TSN domain is bound to a PTP domain.
>>
>>>
>>>>>
>>>>> Btw: this is quite a long list of CC:s
>>>>>
>>>> I need to update my company colleagues as well as the Linux group.
>>>
>>> Of course. But even ignoring that this is still quite a large list (> 40).
>>>
>>> My response yesterday was actually blocked as a result ;) Retrying now.
>>>
>>
>> I left 5 people from Siemens, I hope it improves.
>>
>>
>> Thanks for your comments and enlightenments, they are very useful
>>     Erez

  parent reply	other threads:[~2020-12-12  1:01 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-09 14:37 [PATCH 0/3] Add sending TX hardware timestamp for TC ETF Qdisc Erez Geva
2020-12-09 14:37 ` [PATCH 1/3] Add TX sending hardware timestamp Erez Geva
2020-12-09 14:48   ` Willem de Bruijn
2020-12-09 15:21     ` Geva, Erez
2020-12-09 17:37       ` Willem de Bruijn
2020-12-09 20:18         ` Geva, Erez
2020-12-10 19:11           ` Willem de Bruijn
2020-12-10 22:37             ` Geva, Erez
2020-12-10 23:30               ` Willem de Bruijn
2020-12-11  0:27                 ` Vinicius Costa Gomes
2020-12-11 14:44                   ` Geva, Erez
2020-12-11 15:15                   ` Willem de Bruijn
2020-12-11 14:22                 ` Geva, Erez [this message]
2020-12-10  3:11   ` kernel test robot
2020-12-10 12:41     ` Geva, Erez
2020-12-10 18:17       ` Geva, Erez
2020-12-12  8:47       ` [kbuild-all] " Philip Li
2020-12-16  2:01         ` Rong Chen
2020-12-09 14:37 ` [PATCH 2/3] Pass TX sending hardware timestamp to a socket's buffer Erez Geva
2020-12-09 14:37 ` [PATCH 3/3] The TC ETF Qdisc pass the hardware timestamp to the interface driver Erez Geva

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VI1PR10MB24468B591888DC515017B5B5ABCA0@VI1PR10MB2446.EURPRD10.PROD.OUTLOOK.COM \
    --to=erez.geva.ext@siemens.com \
    --cc=0x7f454c46@gmail.com \
    --cc=andreas.meisinger@siemens.com \
    --cc=andreas.zirkler@siemens.com \
    --cc=arnd@arndb.de \
    --cc=ast@kernel.org \
    --cc=avagin@gmail.com \
    --cc=bigeasy@linutronix.de \
    --cc=colin.king@canonical.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=ebiederm@xmission.com \
    --cc=edumazet@google.com \
    --cc=eyal.birger@gmail.com \
    --cc=frederic@kernel.org \
    --cc=gustavoars@kernel.org \
    --cc=henning.schild@siemens.com \
    --cc=jakub@cloudflare.com \
    --cc=jan.kiszka@siemens.com \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=john.ogness@linutronix.de \
    --cc=john.stultz@linaro.org \
    --cc=jrosen@cisco.com \
    --cc=kafai@fb.com \
    --cc=keescook@chromium.org \
    --cc=kuba@kernel.org \
    --cc=kuznet@ms2.inr.ac.ru \
    --cc=linmiaohe@huawei.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matthieu.baerts@tessares.net \
    --cc=mingo@kernel.org \
    --cc=mkl@pengutronix.de \
    --cc=mkubecek@suse.cz \
    --cc=netdev@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=orcohen@paloaltonetworks.com \
    --cc=peterz@infradead.org \
    --cc=richardcochran@gmail.com \
    --cc=sboyd@kernel.org \
    --cc=simon.sudler@siemens.com \
    --cc=stefan@datenfreihafen.org \
    --cc=tglx@linutronix.de \
    --cc=vdronov@redhat.com \
    --cc=vedang.patel@intel.com \
    --cc=vinicius.gomes@intel.com \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=xie.he.0141@gmail.com \
    --cc=xiyou.wangcong@gmail.com \
    --cc=yoshfuji@linux-ipv6.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).