Netdev Archive on lore.kernel.org
 help / color / Atom feed
From: Ido Schimmel <idosch@idosch.org>
To: Neil Horman <nhorman@tuxdriver.com>
Cc: David Miller <davem@davemloft.net>,
	netdev@vger.kernel.org, jiri@mellanox.com, mlxsw@mellanox.com,
	dsahern@gmail.com, roopa@cumulusnetworks.com,
	nikolay@cumulusnetworks.com, andy@greyhouse.net,
	pablo@netfilter.org, jakub.kicinski@netronome.com,
	pieter.jansenvanvuuren@netronome.com, andrew@lunn.ch,
	f.fainelli@gmail.com, vivien.didelot@gmail.com,
	idosch@mellanox.com
Subject: Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths
Date: Sun, 14 Jul 2019 15:43:57 +0300
Message-ID: <20190714124357.GA21070@splinter> (raw)
In-Reply-To: <20190714112904.GA5082@hmswarspite.think-freely.org>

On Sun, Jul 14, 2019 at 07:29:04AM -0400, Neil Horman wrote:
> On Fri, Jul 12, 2019 at 04:52:30PM +0300, Ido Schimmel wrote:
> > On Thu, Jul 11, 2019 at 07:53:54PM -0400, Neil Horman wrote:
> > > A few things here:
> > > IIRC we don't announce individual hardware drops, drivers record them in
> > > internal structures, and they are retrieved on demand via ethtool calls, so you
> > > will either need to include some polling (probably not a very performant idea),
> > > or some sort of flagging mechanism to indicate that on the next message sent to
> > > user space you should go retrieve hw stats from a given interface.  I certainly
> > > wouldn't mind seeing this happen, but its more work than just adding a new
> > > netlink message.
> > 
> > Neil,
> > 
> > The idea of this series is to pass the dropped packets themselves to
> > user space along with metadata, such as the drop reason and the ingress
> > port. In the future more metadata could be added thanks to the
> > extensible nature of netlink.
> > 
> I had experimented with this idea previously.  Specifically I had investigated
> the possibility of creating a dummy net_device that received only dropped
> packets so that utilities like tcpdump could monitor the interface for said
> packets along with the metadata that described where they dropped.
> 
> The concern I had was, as Dave mentioned, that you would wind up with either a
> head of line blocking issue, or simply lots of lost "dropped" packets due to
> queue overflow on receive, which kind of defeated the purpose of drop monitor.
> 
> That said, I like the idea, and if we can find a way around the fact that we
> could potentially receive way more dropped packets than we could bounce back to
> userspace, it would be a major improvement.

We don't necessarily need the entire packet, so we could add an option
to truncate them and get only the headers. tc-sample does this (see man
tc-sample).

Also, packets that are dropped for the same reason and belong to the
same flow are not necessarily interesting. We could add an eBPF filter
on the netlink socket which will only allow "unique" packets to be
enqueued. Where "unique" is defined as {drop reason, 5-tuple}. In the
case of SW drops "drop reason" is instruction pointer (call site of
kfree_skb()) and in the case of HW drops it's the drop reason provided
by the device. The duplicate copies will be counted in the eBPF map.

> 
> > In v1 these packets were notified to user space as devlink events
> > and my plan for v2 is to send them as drop_monitor events, given it's an
> > existing generic netlink channel used to monitor SW drops. This will
> > allow users to listen on one netlink channel to diagnose potential
> > problems in either SW or HW (and hopefully XDP in the future).
> > 
> Yeah, I'm supportive of that.
> 
> > Please note that the packets I'm talking about are packets users
> > currently do not see. They are dropped - potentially silently - by the
> > underlying device, thereby making it hard to debug whatever issues you
> > might be experiencing in your network.
> > 
> Right I get that, you want the ability to register a listener of sorts to
> monitor drops in hardware and report that back to user space as an drop even
> with a location that (instead of being a kernel address, is a 'special location'
> representing a hardware instance.  Makes sense.  Having that be a location +
> counter tuple would make sense, but I don't think we can pass the skb itself (as
> you mention above), without seeing significant loss.

Getting the skb itself will be an option users will need to enable in
drop_monitor. If it is not enabled, then default behavior is maintained
and you only get notifications that contain the location + counter tuple
you mentioned.

> 
> > The control path that determines if these packets are even sent to the
> > CPU from the HW needs to remain in devlink for the reasons I outlined in
> > my previous reply. However, the monitoring of these drops will be over
> > drop_monitor. This is similar to what we are currently doing with
> > tc-sample, where the control path that triggers the sampling and
> > determines the sampling rate and truncation is done over rtnetlink (tc),
> > but the sampled packets are notified over the generic netlink psample
> > channel.
> > 
> > To make it more real, you can check the example of the dissected devlink
> > message that notifies the drop of a packet due to a multicast source
> > MAC: https://marc.info/?l=linux-netdev&m=156248736710238&w=2
> > 
> > I will obviously have to create another Wireshark dissector for
> > drop_monitor in order to get the same information.
> > 
> yes, Of course.
> > > Thats an interesting idea, but dropwatch certainly isn't currently setup for
> > > that kind of messaging.  It may be worth creating a v2 of the netlink protocol
> > > and really thinking out what you want to communicate.
> > 
> > I don't think we need a v2 of the netlink protocol. My current plan is
> > to extend the existing protocol with: New message type (e.g.,
> > NET_DM_CMD_HW_ALERT), new multicast group and a set of attributes to
> > encode the information that is currently encoded in the example message
> > I pasted above.
> > 
> Ok, that makes sense.  I think we already do some very rudimentary version of
> that (see trace_napi_poll_hit).  Here we check the device we receive frames on
> to see if its rx_dropped count has increased, and if it has, store that as a
> drop count in the NULL location.  Thats obviously insufficient, but I wonder if
> its worth looking at using the dm_hw_stat_delta to encode and record those event
> for sending with your new message type.

Will check.

Thanks, Neil.

  reply index

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-07  7:58 Ido Schimmel
2019-07-07  7:58 ` [PATCH net-next 01/11] devlink: Create helper to fill port type information Ido Schimmel
2019-07-07  7:58 ` [PATCH net-next 02/11] devlink: Add packet trap infrastructure Ido Schimmel
2019-07-07  7:58 ` [PATCH net-next 03/11] devlink: Add generic packet traps and groups Ido Schimmel
2019-07-07  7:58 ` [PATCH net-next 04/11] Documentation: Add devlink-trap documentation Ido Schimmel
2019-07-07  7:58 ` [PATCH net-next 05/11] netdevsim: Add devlink-trap support Ido Schimmel
2019-07-07  7:58 ` [PATCH net-next 06/11] Documentation: Add description of netdevsim traps Ido Schimmel
2019-07-07  7:58 ` [PATCH net-next 07/11] mlxsw: core: Add API to set trap action Ido Schimmel
2019-07-07  7:58 ` [PATCH net-next 08/11] mlxsw: reg: Add new " Ido Schimmel
2019-07-07  7:58 ` [PATCH net-next 09/11] mlxsw: Add layer 2 discard trap IDs Ido Schimmel
2019-07-07  7:58 ` [PATCH net-next 10/11] mlxsw: Add trap group for layer 2 discards Ido Schimmel
2019-07-07  7:58 ` [PATCH net-next 11/11] mlxsw: spectrum: Add devlink-trap support Ido Schimmel
2019-07-07  8:01 ` [PATCH iproute2-next 0/7] " Ido Schimmel
2019-07-07  8:01   ` [PATCH iproute2-next 1/7] devlink: Increase number of supported options Ido Schimmel
2019-07-07  8:01   ` [PATCH iproute2-next 2/7] devlink: Add devlink trap set and show commands Ido Schimmel
2019-07-07  8:01   ` [PATCH iproute2-next 3/7] devlink: Add devlink trap group " Ido Schimmel
2019-07-07  8:01   ` [PATCH iproute2-next 4/7] devlink: Add devlink trap monitor support Ido Schimmel
2019-07-07  8:01   ` [PATCH iproute2-next 5/7] devlink: Set NETLINK_NO_ENOBUFS when monitoring events Ido Schimmel
2019-07-07  8:01   ` [PATCH iproute2-next 6/7] devlink: Add fflush() to print functions Ido Schimmel
2019-07-07  8:02   ` [PATCH iproute2-next 7/7] devlink: Add man page for devlink-trap Ido Schimmel
2019-07-07  8:03 ` [RFC PATCH net-next 0/5] selftests: Add devlink-trap selftests Ido Schimmel
2019-07-07  8:03   ` [RFC PATCH net-next 1/5] selftests: devlink_trap: Add test cases for devlink-trap Ido Schimmel
2019-07-07  8:03   ` [RFC PATCH net-next 2/5] Documentation: Add a section for devlink-trap testing Ido Schimmel
2019-07-07  8:03   ` [RFC PATCH net-next 3/5] selftests: forwarding: devlink_lib: Add devlink-trap helpers Ido Schimmel
2019-07-07  8:03   ` [RFC PATCH net-next 4/5] selftests: mlxsw: Add test cases for devlink-trap L2 drops Ido Schimmel
2019-07-07  8:03   ` [RFC PATCH net-next 5/5] selftests: mlxsw: Add a test case for devlink-trap Ido Schimmel
2019-07-07  8:15 ` [PATCH net-next 00/11] Add drop monitor for offloaded data paths Ido Schimmel
2019-07-07 19:45 ` David Miller
2019-07-08 13:19   ` Ido Schimmel
2019-07-08 22:51     ` Jakub Kicinski
2019-07-09 12:38       ` Ido Schimmel
2019-07-09 22:34         ` Jakub Kicinski
2019-07-10 11:20           ` Ido Schimmel
2019-07-10 11:39             ` Toke Høiland-Jørgensen
2019-07-11 12:39   ` Ido Schimmel
2019-07-11 19:02     ` David Miller
2019-07-11 23:53     ` Neil Horman
2019-07-12  3:40       ` Florian Fainelli
2019-07-12 12:05         ` Neil Horman
2019-07-12  9:27       ` Toke Høiland-Jørgensen
2019-07-12 12:18         ` Neil Horman
2019-07-12 12:33           ` Toke Høiland-Jørgensen
2019-07-13  0:40             ` Neil Horman
2019-07-13  8:07               ` Toke Høiland-Jørgensen
2019-07-12 13:52       ` Ido Schimmel
2019-07-14 11:29         ` Neil Horman
2019-07-14 12:43           ` Ido Schimmel [this message]
2019-07-14  2:38     ` David Miller

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190714124357.GA21070@splinter \
    --to=idosch@idosch.org \
    --cc=andrew@lunn.ch \
    --cc=andy@greyhouse.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@gmail.com \
    --cc=f.fainelli@gmail.com \
    --cc=idosch@mellanox.com \
    --cc=jakub.kicinski@netronome.com \
    --cc=jiri@mellanox.com \
    --cc=mlxsw@mellanox.com \
    --cc=netdev@vger.kernel.org \
    --cc=nhorman@tuxdriver.com \
    --cc=nikolay@cumulusnetworks.com \
    --cc=pablo@netfilter.org \
    --cc=pieter.jansenvanvuuren@netronome.com \
    --cc=roopa@cumulusnetworks.com \
    --cc=vivien.didelot@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org netdev@archiver.kernel.org
	public-inbox-index netdev


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox