netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH next 0/3] blackhole device to invalidate dst
@ 2019-06-22  0:45 Mahesh Bandewar
  2019-06-22 15:35 ` David Ahern
  2019-06-25  4:00 ` Michael Chan
  0 siblings, 2 replies; 5+ messages in thread
From: Mahesh Bandewar @ 2019-06-22  0:45 UTC (permalink / raw)
  To: Netdev
  Cc: Eric Dumazet, David Miller, Michael Chan, Daniel Axtens,
	Mahesh Bandewar, Mahesh Bandewar

When we invalidate dst or mark it "dead", we assign 'lo' to
dst->dev. First of all this assignment is racy and more over,
it has MTU implications.

The standard dev MTU is 1500 while the Loopback MTU is 64k. TCP
code when dereferencing the dst don't check if the dst is valid
or not. TCP when dereferencing a dead-dst while negotiating a
new connection, may use dst device which is 'lo' instead of
using the correct device. Consider the following scenario:

A SYN arrives on an interface and tcp-layer while processing
SYNACK finds a dst and associates it with SYNACK skb. Now before
skb gets passed to L3 for processing, if that dst gets "dead"
(because of the virtual device getting disappeared & then reappeared),
the 'lo' gets assigned to that dst (lo MTU = 64k). Let's assume
the SYN has ADV_MSS set as 9k while the output device through
which this SYNACK is going to go out has standard MTU of 1500.
The MTU check during the route check passes since MIN(9K, 64K)
is 9k and TCP successfully negotiates 9k MSS. The subsequent
data packet; bigger in size gets passed to the device and it 
won't be marked as GSO since the assumed MTU of the device is
9k.

This either crashes the NIC and we have seen fixes that went
into drivers to handle this scenario. 8914a595110a ('bnx2x:
disable GSO where gso_size is too big for hardware') and
2b16f048729b ('net: create skb_gso_validate_mac_len()') and
with those fixes TCP eventually recovers but not before
few dropped segments.

Well, I'm not a TCP expert and though we have experienced
these corner cases in our environment, I could not reproduce 
this case reliably in my test setup to try this fix myself.
However, Michael Chan <michael.chan@broadcom.com> had a setup
where these fixes helped him mitigate the issue and not cause
the crash.

The idea here is to not alter the data-path with additional
locks or smb()/rmb() barriers to avoid racy assignments but
to create a new device that has really low MTU that has
.ndo_start_xmit essentially a kfree_skb(). Make use of this
device instead of 'lo' when marking the dst dead.

First patch implements the blackhole device and second
patch uses it in IPv4 and IPv6 stack while the third patch
is the self test that ensures the sanity of this device.

Mahesh Bandewar (3):
  loopback: create blackhole net device similar to loopack.
  blackhole_netdev: use blackhole_netdev to invalidate dst entries
  blackhole_dev: add a selftest

 drivers/net/loopback.c                        |  76 +++++++++++--
 include/linux/netdevice.h                     |   2 +
 lib/Kconfig.debug                             |   9 ++
 lib/Makefile                                  |   1 +
 lib/test_blackhole_dev.c                      | 100 ++++++++++++++++++
 net/core/dst.c                                |   2 +-
 net/ipv4/route.c                              |   3 +-
 net/ipv6/route.c                              |   2 +-
 tools/testing/selftests/net/Makefile          |   3 +-
 tools/testing/selftests/net/config            |   1 +
 .../selftests/net/test_blackhole_dev.sh       |  11 ++
 11 files changed, 196 insertions(+), 14 deletions(-)
 create mode 100644 lib/test_blackhole_dev.c
 create mode 100755 tools/testing/selftests/net/test_blackhole_dev.sh

-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH next 0/3] blackhole device to invalidate dst
  2019-06-22  0:45 [PATCH next 0/3] blackhole device to invalidate dst Mahesh Bandewar
@ 2019-06-22 15:35 ` David Ahern
  2019-06-23 23:01   ` Mahesh Bandewar (महेश बंडेवार)
  2019-06-25  4:00 ` Michael Chan
  1 sibling, 1 reply; 5+ messages in thread
From: David Ahern @ 2019-06-22 15:35 UTC (permalink / raw)
  To: Mahesh Bandewar, Netdev
  Cc: Eric Dumazet, David Miller, Michael Chan, Daniel Axtens, Mahesh Bandewar

On 6/21/19 6:45 PM, Mahesh Bandewar wrote:
> When we invalidate dst or mark it "dead", we assign 'lo' to
> dst->dev. First of all this assignment is racy and more over,
> it has MTU implications.
> 
> The standard dev MTU is 1500 while the Loopback MTU is 64k. TCP
> code when dereferencing the dst don't check if the dst is valid
> or not. TCP when dereferencing a dead-dst while negotiating a
> new connection, may use dst device which is 'lo' instead of
> using the correct device. Consider the following scenario:
> 

Why doesn't the TCP code (or any code) check if a cached dst is valid?
That's the whole point of marking it dead - to tell users not to rely on
it.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH next 0/3] blackhole device to invalidate dst
  2019-06-22 15:35 ` David Ahern
@ 2019-06-23 23:01   ` Mahesh Bandewar (महेश बंडेवार)
  0 siblings, 0 replies; 5+ messages in thread
From: Mahesh Bandewar (महेश बंडेवार) @ 2019-06-23 23:01 UTC (permalink / raw)
  To: David Ahern
  Cc: Netdev, Eric Dumazet, David Miller, Michael Chan, Daniel Axtens,
	Mahesh Bandewar

On Sat, Jun 22, 2019 at 8:35 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 6/21/19 6:45 PM, Mahesh Bandewar wrote:
> > When we invalidate dst or mark it "dead", we assign 'lo' to
> > dst->dev. First of all this assignment is racy and more over,
> > it has MTU implications.
> >
> > The standard dev MTU is 1500 while the Loopback MTU is 64k. TCP
> > code when dereferencing the dst don't check if the dst is valid
> > or not. TCP when dereferencing a dead-dst while negotiating a
> > new connection, may use dst device which is 'lo' instead of
> > using the correct device. Consider the following scenario:
> >
>
> Why doesn't the TCP code (or any code) check if a cached dst is valid?
> That's the whole point of marking it dead - to tell users not to rely on
> it.

Well, I'm not an expert in TCP, but I could guess. Eric, please help
me being honest with my guess.
The code that marks it dead (control path) and the code that need to
check the validity (data-path) need to have some sort of
synchronization and that could be costly in the data-path. Also what
is the worst case scenario? ... that packet is going to get dropped
since the under-lying device or route has issues (minus this corner
case). May be that was the reason.

As I mentioned in the log, we could  remove the racy-ness  with
barriers or locks but then that would be an additional cost in the
data-path. having a dummy/backhole dev is the cheapest solution with
no changes to the data-path.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH next 0/3] blackhole device to invalidate dst
  2019-06-22  0:45 [PATCH next 0/3] blackhole device to invalidate dst Mahesh Bandewar
  2019-06-22 15:35 ` David Ahern
@ 2019-06-25  4:00 ` Michael Chan
  2019-06-25  4:17   ` Mahesh Bandewar (महेश बंडेवार)
  1 sibling, 1 reply; 5+ messages in thread
From: Michael Chan @ 2019-06-25  4:00 UTC (permalink / raw)
  To: Mahesh Bandewar
  Cc: Netdev, Eric Dumazet, David Miller, Daniel Axtens, Mahesh Bandewar

On Fri, Jun 21, 2019 at 5:45 PM Mahesh Bandewar <maheshb@google.com> wrote:

> Well, I'm not a TCP expert and though we have experienced
> these corner cases in our environment, I could not reproduce
> this case reliably in my test setup to try this fix myself.
> However, Michael Chan <michael.chan@broadcom.com> had a setup
> where these fixes helped him mitigate the issue and not cause
> the crash.
>

I will ask the lab to test these patches tomorrow.  Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH next 0/3] blackhole device to invalidate dst
  2019-06-25  4:00 ` Michael Chan
@ 2019-06-25  4:17   ` Mahesh Bandewar (महेश बंडेवार)
  0 siblings, 0 replies; 5+ messages in thread
From: Mahesh Bandewar (महेश बंडेवार) @ 2019-06-25  4:17 UTC (permalink / raw)
  To: Michael Chan
  Cc: Netdev, Eric Dumazet, David Miller, Daniel Axtens, Mahesh Bandewar

On Mon, Jun 24, 2019 at 9:00 PM Michael Chan <michael.chan@broadcom.com> wrote:
>
> On Fri, Jun 21, 2019 at 5:45 PM Mahesh Bandewar <maheshb@google.com> wrote:
>
> > Well, I'm not a TCP expert and though we have experienced
> > these corner cases in our environment, I could not reproduce
> > this case reliably in my test setup to try this fix myself.
> > However, Michael Chan <michael.chan@broadcom.com> had a setup
> > where these fixes helped him mitigate the issue and not cause
> > the crash.
> >
>
> I will ask the lab to test these patches tomorrow.  Thanks.
Thank you Michael.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-06-25  4:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-22  0:45 [PATCH next 0/3] blackhole device to invalidate dst Mahesh Bandewar
2019-06-22 15:35 ` David Ahern
2019-06-23 23:01   ` Mahesh Bandewar (महेश बंडेवार)
2019-06-25  4:00 ` Michael Chan
2019-06-25  4:17   ` Mahesh Bandewar (महेश बंडेवार)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).