[PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s
@ 2021-04-28  2:29 Matt Corallo
  2021-04-28 12:20 ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Matt Corallo @ 2021-04-28  2:29 UTC (permalink / raw)
  To: David S. Miller, netdev, Alexey Kuznetsov, Hideaki YOSHIFUJI
  Cc: Eric Dumazet, Willy Tarreau, Keyu Man

The default IP reassembly timeout of 30 seconds predates git
history (and cursory web searches turn up nothing related to it).
The only relevant source cited in net/ipv4/ip_fragment.c is RFC
791 defining IPv4 in 1981. RFC 791 suggests allowing the timer to
increase on the receipt of each fragment (which Linux deliberately
does not do), with a default timeout for each fragment of 15
seconds. It suggests 15s to cap a 10Kb/s flow to a 150Kb buffer of
fragments.

When Linux receives a fragment, if the total memory used for the
fragment reassembly buffer (across all senders) exceeds
net.ipv4.ipfrag_high_thresh (or the equivalent for IPv6), it
silently drops all future fragments fragments until the timers on
the original expire.

All the way in 2021, these numbers feel almost comical. The default
buffer size for fragmentation reassembly is hard-coded at 4MiB as
`net->ipv4.fqdir->high_thresh = 4 * 1024 * 1024;` capping a host at
1.06Mb/s of lost fragments before all fragments received on the
host are dropped (with independent limits for IPv6).

At 1.06Mb/s of lost fragments, we move from DoS attack threshold to
real-world scenarios - at moderate loss rates on consumer networks
today its fairly easy to hit this, causing end hosts with their MTU
(mis-)configured to fragment to have nearly 10-20 second blocks of
100% packet loss.

Reducing the default fragment timeout to 1sec gives us 32Mb/s of
fragments before we drop all fragments, which is certainly more in
line with today's network speeds than 1.06Mb/s, though an optimal
value may be still lower. Sadly, reducing it further requires a
change to the sysctl interface, as net.ipv4.ipfrag_time is only
specified in seconds.
---
  include/net/ip.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 2d6b985d11cc..f1473ac5a27c 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -135,7 +135,7 @@ struct ip_ra_chain {
  #define IP_MF        0x2000        /* Flag: "More Fragments"    */
  #define IP_OFFSET    0x1FFF        /* "Fragment Offset" part    */

-#define IP_FRAG_TIME    (30 * HZ)        /* fragment lifetime    */
+#define IP_FRAG_TIME    (1 * HZ)        /* fragment lifetime    */

  struct msghdr;
  struct net_device;
-- 
2.30.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s
  2021-04-28  2:29 [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s Matt Corallo
@ 2021-04-28 12:20 ` Eric Dumazet
  2021-04-28 14:09   ` Matt Corallo
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2021-04-28 12:20 UTC (permalink / raw)
  To: Matt Corallo
  Cc: David S. Miller, netdev, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Willy Tarreau, Keyu Man

On Wed, Apr 28, 2021 at 4:29 AM Matt Corallo
<netdev-list@mattcorallo.com> wrote:
>
> The default IP reassembly timeout of 30 seconds predates git
> history (and cursory web searches turn up nothing related to it).
> The only relevant source cited in net/ipv4/ip_fragment.c is RFC
> 791 defining IPv4 in 1981. RFC 791 suggests allowing the timer to
> increase on the receipt of each fragment (which Linux deliberately
> does not do), with a default timeout for each fragment of 15
> seconds. It suggests 15s to cap a 10Kb/s flow to a 150Kb buffer of
> fragments.
>
> When Linux receives a fragment, if the total memory used for the
> fragment reassembly buffer (across all senders) exceeds
> net.ipv4.ipfrag_high_thresh (or the equivalent for IPv6), it
> silently drops all future fragments fragments until the timers on
> the original expire.
>
> All the way in 2021, these numbers feel almost comical. The default
> buffer size for fragmentation reassembly is hard-coded at 4MiB as
> `net->ipv4.fqdir->high_thresh = 4 * 1024 * 1024;` capping a host at
> 1.06Mb/s of lost fragments before all fragments received on the
> host are dropped (with independent limits for IPv6).
>
> At 1.06Mb/s of lost fragments, we move from DoS attack threshold to
> real-world scenarios - at moderate loss rates on consumer networks
> today its fairly easy to hit this, causing end hosts with their MTU
> (mis-)configured to fragment to have nearly 10-20 second blocks of
> 100% packet loss.
>
> Reducing the default fragment timeout to 1sec gives us 32Mb/s of
> fragments before we drop all fragments, which is certainly more in
> line with today's network speeds than 1.06Mb/s, though an optimal
> value may be still lower. Sadly, reducing it further requires a
> change to the sysctl interface, as net.ipv4.ipfrag_time is only
> specified in seconds.
> ---
>   include/net/ip.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/net/ip.h b/include/net/ip.h
> index 2d6b985d11cc..f1473ac5a27c 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -135,7 +135,7 @@ struct ip_ra_chain {
>   #define IP_MF        0x2000        /* Flag: "More Fragments"    */
>   #define IP_OFFSET    0x1FFF        /* "Fragment Offset" part    */
>
> -#define IP_FRAG_TIME    (30 * HZ)        /* fragment lifetime    */
> +#define IP_FRAG_TIME    (1 * HZ)        /* fragment lifetime    */
>
>   struct msghdr;
>   struct net_device;
> --
> 2.30.2


This is going to break many use cases.

I can certainly say that in many cases, we need more than 1 second to
complete reassembly.
Some Internet users share satellite links with 600 ms RTT, not
everybody has fiber links in 2021.

There is a sysctl, exactly for the cases where admins can decide to
make the value smaller.

You can laugh all you want, the sad thing with IP frags is that really
some applications still want to use them.

Also, admins willing to use 400 MB of memory instead of 4MB can just
change a sysctl.

Again, nothing will prevent reassembly units to be DDOS targets.

At Google, we use 100 MB for /proc/sys/net/ipv4/ipfrag_high_thresh and
/proc/sys/net/ipv6/ip6frag_high_thresh,
no kernel patch is needed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s
  2021-04-28 12:20 ` Eric Dumazet
@ 2021-04-28 14:09   ` Matt Corallo
  2021-04-28 14:13     ` Willy Tarreau
  0 siblings, 1 reply; 13+ messages in thread
From: Matt Corallo @ 2021-04-28 14:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, netdev, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Willy Tarreau, Keyu Man

On 4/28/21 08:20, Eric Dumazet wrote:
> This is going to break many use cases.
> 
> I can certainly say that in many cases, we need more than 1 second to
> complete reassembly.
> Some Internet users share satellite links with 600 ms RTT, not
> everybody has fiber links in 2021.

I'm curious what RTT has to do with it? Frags aren't resent, so there's no RTT you need to wait for, the question is 
more your available bandwidth and how much packet reordering you see, which even for many sat links isn't zero anymore 
(better, in-flow packet reordering is becoming more and more rare!).

Even given some material reordering, 30 seconds on a 100Kb is a lot!

> There is a sysctl, exactly for the cases where admins can decide to
> make the value smaller.

Sadly this doesn't actually solve it in many cases. Because Linux reassembles fragments by default any time conntrack is 
loaded (disabling this is very nontrivial), anyone with a Linux box in between two hosts ends up breaking flows with any 
material loss of frags.

More broadly, just because there is a sysctl, doesn't mean the default needs to be sensible for most users. As you note, 
there's a sysctl, if someone is on a 1Kbps sat link with fragments sent out of order, they can change it :). This 
constant hasn't been touched since pre-git!

> You can laugh all you want, the sad thing with IP frags is that really
> some applications still want to use them.

Yes, including my application, which breaks any time the flow *transits* a Linux box (ie not just my end host(s), but 
any box in between that happens to have conntrack loaded).

> Also, admins willing to use 400 MB of memory instead of 4MB can just
> change a sysctl.
> 
> Again, nothing will prevent reassembly units to be DDOS targets.

Yep, not claiming any differently. As noted in a previous thread you really have to crank up the limits to prevent DDOS.

> At Google, we use 100 MB for /proc/sys/net/ipv4/ipfrag_high_thresh and
> /proc/sys/net/ipv6/ip6frag_high_thresh,
> no kernel patch is needed.
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s
  2021-04-28 14:09   ` Matt Corallo
@ 2021-04-28 14:13     ` Willy Tarreau
  2021-04-28 14:28       ` Matt Corallo
  0 siblings, 1 reply; 13+ messages in thread
From: Willy Tarreau @ 2021-04-28 14:13 UTC (permalink / raw)
  To: Matt Corallo
  Cc: Eric Dumazet, David S. Miller, netdev, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Keyu Man

On Wed, Apr 28, 2021 at 10:09:00AM -0400, Matt Corallo wrote:
> 
> 
> On 4/28/21 08:20, Eric Dumazet wrote:
> > This is going to break many use cases.
> > 
> > I can certainly say that in many cases, we need more than 1 second to
> > complete reassembly.
> > Some Internet users share satellite links with 600 ms RTT, not
> > everybody has fiber links in 2021.
> 
> I'm curious what RTT has to do with it? Frags aren't resent, so there's no
> RTT you need to wait for, the question is more your available bandwidth and
> how much packet reordering you see, which even for many sat links isn't zero
> anymore (better, in-flow packet reordering is becoming more and more rare!).

Regardless of retransmits, large RTTs are often an indication of buffer bloat
on the path, and this can take some fragments apart, even worse when you mix
this with multi-path routing where some fragments may take a short path and
others can take a congested one. In this case you'll note that the excessive
buffer time can become a non-negligible part of the observed RTT, hence the
indirect relation between the two.

Willy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s
  2021-04-28 14:13     ` Willy Tarreau
@ 2021-04-28 14:28       ` Matt Corallo
  2021-04-28 15:38         ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Matt Corallo @ 2021-04-28 14:28 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric Dumazet, David S. Miller, netdev, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Keyu Man

On 4/28/21 10:13, Willy Tarreau wrote:
> On Wed, Apr 28, 2021 at 10:09:00AM -0400, Matt Corallo wrote:
> Regardless of retransmits, large RTTs are often an indication of buffer bloat
> on the path, and this can take some fragments apart, even worse when you mix
> this with multi-path routing where some fragments may take a short path and
> others can take a congested one. In this case you'll note that the excessive
> buffer time can become a non-negligible part of the observed RTT, hence the
> indirect relation between the two.

Right, buffer bloat is definitely a concern. Would it make more sense to reduce the default to somewhere closer to 3s?

More generally, I find this a rather interesting case - obviously breaking *deployed* use-cases of Linux is Really Bad, 
but at the same time, the internet has changed around us and suddenly other reasonable use-cases of Linux (ie as a 
router processing real-world consumer flows - in my case a stupid DOCSIS modem dropping 1Mbps from its measly 20Mbps 
limit) have slowly broken instead.

Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s
  2021-04-28 14:28       ` Matt Corallo
@ 2021-04-28 15:38         ` Eric Dumazet
  2021-04-28 16:35           ` Matt Corallo
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2021-04-28 15:38 UTC (permalink / raw)
  To: Matt Corallo
  Cc: Willy Tarreau, David S. Miller, netdev, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Keyu Man

On Wed, Apr 28, 2021 at 4:28 PM Matt Corallo
<netdev-list@mattcorallo.com> wrote:
>
>
>
> On 4/28/21 10:13, Willy Tarreau wrote:
> > On Wed, Apr 28, 2021 at 10:09:00AM -0400, Matt Corallo wrote:
> > Regardless of retransmits, large RTTs are often an indication of buffer bloat
> > on the path, and this can take some fragments apart, even worse when you mix
> > this with multi-path routing where some fragments may take a short path and
> > others can take a congested one. In this case you'll note that the excessive
> > buffer time can become a non-negligible part of the observed RTT, hence the
> > indirect relation between the two.
>
> Right, buffer bloat is definitely a concern. Would it make more sense to reduce the default to somewhere closer to 3s?
>
> More generally, I find this a rather interesting case - obviously breaking *deployed* use-cases of Linux is Really Bad,
> but at the same time, the internet has changed around us and suddenly other reasonable use-cases of Linux (ie as a
> router processing real-world consumer flows - in my case a stupid DOCSIS modem dropping 1Mbps from its measly 20Mbps
> limit) have slowly broken instead.
>
> Matt

I have been working in wifi environments (linux conferences) where RTT
could reach 20 sec, and even 30 seconds, and this was in some very
rich cities in the USA.

Obviously, when a network is under provisioned by 50x factor, you
_need_ more time to complete fragments.

For some reason, the crazy IP reassembly stuff comes every couple of
years, and it is now a FAQ.

The Internet has changed for the  lucky ones, but some deployments are
using 4Mbps satellite connectivity, shared by hundreds of people.
I urge application designers to _not_ rely on doomed frags, even in
controlled networks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s
  2021-04-28 15:38         ` Eric Dumazet
@ 2021-04-28 16:35           ` Matt Corallo
       [not found]             ` <1baf048d-18e8-3e0c-feee-a01b381b0168@bluematt.me>
  0 siblings, 1 reply; 13+ messages in thread
From: Matt Corallo @ 2021-04-28 16:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willy Tarreau, David S. Miller, netdev, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Keyu Man

On 4/28/21 11:38, Eric Dumazet wrote:
> On Wed, Apr 28, 2021 at 4:28 PM Matt Corallo
> <netdev-list@mattcorallo.com> wrote:
> I have been working in wifi environments (linux conferences) where RTT
> could reach 20 sec, and even 30 seconds, and this was in some very
> rich cities in the USA.
> 
> Obviously, when a network is under provisioned by 50x factor, you
> _need_ more time to complete fragments.

Its also a trade-off - if you're in a hugely under-provisioned environment with bufferblot issues you may have some 
fragments that need more time for reassembly if they've gotten horribly reordered (though just having 20 second RTT 
doesn't imply that fragments are going to be re-ordered by 20 seconds, more likely you might see a small fraction of 
it), but you're also likely to have more *lost* fragments, which can trigger the black-holing behavior here.

If you have some loss in the flow, its very easy to hit 1Mbps of lost fragments and suddenly instead of just giving more 
time to reassemble, you're just black-holing instead. I'm not claiming I have the right trade-off here, I'd love more 
input, but at least in my experience trying to just occasionally send fragments on a pretty standard DOCSIS modem, 30s 
is way off.

> For some reason, the crazy IP reassembly stuff comes every couple of
> years, and it is now a FAQ.
> 
> The Internet has changed for the  lucky ones, but some deployments are
> using 4Mbps satellite connectivity, shared by hundreds of people.

I'd think this is a great example of a case where you precisely *dont* want such a low threshold for dropping all 
fragments. Note that in my specific deployment (standard DOCSIS), we're talking about the same speed and network as was 
available ten years ago, this isn't exactly an uncommon or particularly fancy deployment. The real issue is applications 
which happily send 8MB of fragments within a few seconds and suddenly find themselves completely black-holed for 30 
seconds, but this isn't going to just go away.

> I urge application designers to _not_ rely on doomed frags, even in
> controlled networks.

I'd love to, but we're talking about a default value for fragment reassembly. At least in my subjective experience here, 
the conservative 30s time takes things from "more time" to "completely blackhole", which feels like the wrong tradeoff. 
At the end of the day, you can't expect fragments to work super well, indeed, and you assume some amount of loss, the 
goal is to minimize the loss you see from them.

Even if you have some reordering, you're unlikely to see every fragment reordered (I guess you could imagine a horribly 
broken qdisc, does such a thing exist in practice?) such that you always need 30s to reassemble. Taking some loss to 
avoid making it so easy to completely blackhole fragments seems like a reasonable tradeoff.

Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s
       [not found]             ` <1baf048d-18e8-3e0c-feee-a01b381b0168@bluematt.me>
@ 2021-04-30 17:09               ` Eric Dumazet
  2021-04-30 17:42                 ` Matt Corallo
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2021-04-30 17:09 UTC (permalink / raw)
  To: Matt Corallo
  Cc: Willy Tarreau, David S. Miller, netdev, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Keyu Man

On Fri, Apr 30, 2021 at 5:52 PM Matt Corallo
<netdev-list@mattcorallo.com> wrote:
>
> Following up - is there a way forward here?
>

Tune the sysctls to meet your goals ?

I did the needed work so that you can absolutely decide to use 256GB
of ram per host for frags if you want.
(Although I have not tested with crazy values like that, maybe some
kind of bottleneck will be hit)

> I think the current ease of hitting the black-hole-ing behavior is unacceptable (and often not something that can be
> changed even with the sysctl knobs due to intermediate hosts), and am happy to do some work to fix it.
>
> Someone mentioned in a previous thread randomly evicting fragments instead of dropping all new fragments when we reach
> saturation, which may be an option. We could also do something in between 1s and 30s, preserving behavior for hosts
> which see fragments delivered out-of-order by seconds while still reducing the ease of accidentally just black-holing
> all fragments entirely in more standard internet access deployments.
>

Give me one implementation, I will give you a DDOS program to defeat it.
linux code is public, attackers will simply change their attacks.

There is no generic solution, they are all bad.

If you evict randomly, it will also fail. So why bother ?


> >
> >
> > On 4/28/21 11:38, Eric Dumazet wrote:
> >> On Wed, Apr 28, 2021 at 4:28 PM Matt Corallo
> >> <netdev-list@mattcorallo.com> wrote:
> >> I have been working in wifi environments (linux conferences) where RTT
> >> could reach 20 sec, and even 30 seconds, and this was in some very
> >> rich cities in the USA.
> >>
> >> Obviously, when a network is under provisioned by 50x factor, you
> >> _need_ more time to complete fragments.
> >
> > Its also a trade-off - if you're in a hugely under-provisioned environment with bufferblot issues you may have some
> > fragments that need more time for reassembly if they've gotten horribly reordered (though just having 20 second RTT
> > doesn't imply that fragments are going to be re-ordered by 20 seconds, more likely you might see a small fraction of
> > it), but you're also likely to have more *lost* fragments, which can trigger the black-holing behavior here.
> >
> > If you have some loss in the flow, its very easy to hit 1Mbps of lost fragments and suddenly instead of just giving more
> > time to reassemble, you're just black-holing instead. I'm not claiming I have the right trade-off here, I'd love more
> > input, but at least in my experience trying to just occasionally send fragments on a pretty standard DOCSIS modem, 30s
> > is way off.
> >
> >> For some reason, the crazy IP reassembly stuff comes every couple of
> >> years, and it is now a FAQ.
> >>
> >> The Internet has changed for the  lucky ones, but some deployments are
> >> using 4Mbps satellite connectivity, shared by hundreds of people.
> >
> > I'd think this is a great example of a case where you precisely *dont* want such a low threshold for dropping all
> > fragments. Note that in my specific deployment (standard DOCSIS), we're talking about the same speed and network as was
> > available ten years ago, this isn't exactly an uncommon or particularly fancy deployment. The real issue is applications
> > which happily send 8MB of fragments within a few seconds and suddenly find themselves completely black-holed for 30
> > seconds, but this isn't going to just go away.
> >
> >> I urge application designers to _not_ rely on doomed frags, even in
> >> controlled networks.
> >
> > I'd love to, but we're talking about a default value for fragment reassembly. At least in my subjective experience here,
> > the conservative 30s time takes things from "more time" to "completely blackhole", which feels like the wrong tradeoff.
> > At the end of the day, you can't expect fragments to work super well, indeed, and you assume some amount of loss, the
> > goal is to minimize the loss you see from them.
> >
> > Even if you have some reordering, you're unlikely to see every fragment reordered (I guess you could imagine a horribly
> > broken qdisc, does such a thing exist in practice?) such that you always need 30s to reassemble. Taking some loss to
> > avoid making it so easy to completely blackhole fragments seems like a reasonable tradeoff.
> >
> > Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s
  2021-04-30 17:09               ` Eric Dumazet
@ 2021-04-30 17:42                 ` Matt Corallo
  2021-04-30 17:49                   ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Matt Corallo @ 2021-04-30 17:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willy Tarreau, David S. Miller, netdev, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Keyu Man



On 4/30/21 13:09, Eric Dumazet wrote:
> On Fri, Apr 30, 2021 at 5:52 PM Matt Corallo
> <netdev-list@mattcorallo.com> wrote:
>>
>> Following up - is there a way forward here?
>>
> 
> Tune the sysctls to meet your goals ?
> 
> I did the needed work so that you can absolutely decide to use 256GB
> of ram per host for frags if you want.
> (Although I have not tested with crazy values like that, maybe some
> kind of bottleneck will be hit)

Again, this is not a solution universally because this issue appears when transiting a Linux router. This isn't only 
about end-hosts (or I wouldn't have even bothered with any of this). Sometimes packets flow over a Linux router that you 
don't have control over, which is true in my case.

>> I think the current ease of hitting the black-hole-ing behavior is unacceptable (and often not something that can be
>> changed even with the sysctl knobs due to intermediate hosts), and am happy to do some work to fix it.
>>
>> Someone mentioned in a previous thread randomly evicting fragments instead of dropping all new fragments when we reach
>> saturation, which may be an option. We could also do something in between 1s and 30s, preserving behavior for hosts
>> which see fragments delivered out-of-order by seconds while still reducing the ease of accidentally just black-holing
>> all fragments entirely in more standard internet access deployments.
>>
> 
> Give me one implementation, I will give you a DDOS program to defeat it.
> linux code is public, attackers will simply change their attacks.
> 
> There is no generic solution, they are all bad.
> 
> If you evict randomly, it will also fail. So why bother ?

This was never about DDoS attacks - as noted several times this is about it being trivial to have all your fragments 
blackholed for 30 seconds at a time just because you have some normal run-of-the-mill packet loss.

I agree with you wholeheartedly that there isn't a solution to the DDoS attack issue, I'm not trying to address it. On 
the other hand, in the face of no attacks or otherwise malicious behavior, I'd expect Linux to not exhibit the complete 
blackholing of fragments that it does today.

Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s
  2021-04-30 17:42                 ` Matt Corallo
@ 2021-04-30 17:49                   ` Eric Dumazet
  2021-04-30 17:53                     ` Matt Corallo
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2021-04-30 17:49 UTC (permalink / raw)
  To: Matt Corallo
  Cc: Willy Tarreau, David S. Miller, netdev, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Keyu Man

On Fri, Apr 30, 2021 at 7:42 PM Matt Corallo
<netdev-list@mattcorallo.com> wrote:
>
>
>
> On 4/30/21 13:09, Eric Dumazet wrote:
> > On Fri, Apr 30, 2021 at 5:52 PM Matt Corallo
> > <netdev-list@mattcorallo.com> wrote:
> >>
> >> Following up - is there a way forward here?
> >>
> >
> > Tune the sysctls to meet your goals ?
> >
> > I did the needed work so that you can absolutely decide to use 256GB
> > of ram per host for frags if you want.
> > (Although I have not tested with crazy values like that, maybe some
> > kind of bottleneck will be hit)
>
> Again, this is not a solution universally because this issue appears when transiting a Linux router. This isn't only
> about end-hosts (or I wouldn't have even bothered with any of this). Sometimes packets flow over a Linux router that you
> don't have control over, which is true in my case.
>
> >> I think the current ease of hitting the black-hole-ing behavior is unacceptable (and often not something that can be
> >> changed even with the sysctl knobs due to intermediate hosts), and am happy to do some work to fix it.
> >>
> >> Someone mentioned in a previous thread randomly evicting fragments instead of dropping all new fragments when we reach
> >> saturation, which may be an option. We could also do something in between 1s and 30s, preserving behavior for hosts
> >> which see fragments delivered out-of-order by seconds while still reducing the ease of accidentally just black-holing
> >> all fragments entirely in more standard internet access deployments.
> >>
> >
> > Give me one implementation, I will give you a DDOS program to defeat it.
> > linux code is public, attackers will simply change their attacks.
> >
> > There is no generic solution, they are all bad.
> >
> > If you evict randomly, it will also fail. So why bother ?
>
> This was never about DDoS attacks - as noted several times this is about it being trivial to have all your fragments
> blackholed for 30 seconds at a time just because you have some normal run-of-the-mill packet loss.

Again, it will be trivial to have a use case where valid fragments are dropped.

Random can be considered as the worst strategy in some cases.

Queue management can tail drop, head drop, random drop, there is no
magical choice.

>
> I agree with you wholeheartedly that there isn't a solution to the DDoS attack issue, I'm not trying to address it. On
> the other hand, in the face of no attacks or otherwise malicious behavior, I'd expect Linux to not exhibit the complete
> blackholing of fragments that it does today.

Your expectations are unfortunately not something that linux can
satisfy _automatically_,
you have to tweak sysctls to tune _your_ workload.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s
  2021-04-30 17:49                   ` Eric Dumazet
@ 2021-04-30 17:53                     ` Matt Corallo
  2021-04-30 18:04                       ` Matt Corallo
  0 siblings, 1 reply; 13+ messages in thread
From: Matt Corallo @ 2021-04-30 17:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willy Tarreau, David S. Miller, netdev, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Keyu Man



On 4/30/21 13:49, Eric Dumazet wrote:
> On Fri, Apr 30, 2021 at 7:42 PM Matt Corallo
> <netdev-list@mattcorallo.com> wrote:
>> This was never about DDoS attacks - as noted several times this is about it being trivial to have all your fragments
>> blackholed for 30 seconds at a time just because you have some normal run-of-the-mill packet loss.
> 
> Again, it will be trivial to have a use case where valid fragments are dropped.
> 
> Random can be considered as the worst strategy in some cases.
> 
> Queue management can tail drop, head drop, random drop, there is no
> magical choice.

Glad we're on the same page :).

>>
>> I agree with you wholeheartedly that there isn't a solution to the DDoS attack issue, I'm not trying to address it. On
>> the other hand, in the face of no attacks or otherwise malicious behavior, I'd expect Linux to not exhibit the complete
>> blackholing of fragments that it does today.
> 
> Your expectations are unfortunately not something that linux can
> satisfy _automatically_,
> you have to tweak sysctls to tune _your_ workload.

Yep, totally agree, its an optimization question. We just have to decide on what the most reasonable use-case is that 
can be supported at low cost.

I'm still a little dubious that a constant picked some twenty years ago is still the best selection for an optimization 
question that is a function of real-world networks.

Buffer bloat exists, but so do networks that will happily drop 1Mbps of packets. The first has always been true, the 
second only more recently has become more and more common (both due to network speed and application behavior).

Thanks again for your time and consideration,
Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s
  2021-04-30 17:53                     ` Matt Corallo
@ 2021-04-30 18:04                       ` Matt Corallo
  2021-05-03 14:30                         ` Matt Corallo
  0 siblings, 1 reply; 13+ messages in thread
From: Matt Corallo @ 2021-04-30 18:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willy Tarreau, David S. Miller, netdev, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Keyu Man

On 4/30/21 13:53, Matt Corallo wrote:
> 
> Buffer bloat exists, but so do networks that will happily drop 1Mbps of packets. The first has always been true, the 
> second only more recently has become more and more common (both due to network speed and application behavior).

It may be worth noting, to further highlight the tradeoffs made here - that, given a constant amount of memory allocated 
for fragment reassembly, *under* estimating the timeout will result in only loss of some % of packets which were 
reordered in excess of the timeout, whereas *over* estimating the timeout results in complete blackhole for up to the 
timeout in the face of material packet loss.

This asymmetry is why I suggested possibly random eviction could be useful as a different set of trade-offs, but I'm 
certainly not qualified to make that determination.

Thanks again for your time and consideration,
Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s
  2021-04-30 18:04                       ` Matt Corallo
@ 2021-05-03 14:30                         ` Matt Corallo
  0 siblings, 0 replies; 13+ messages in thread
From: Matt Corallo @ 2021-05-03 14:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willy Tarreau, David S. Miller, netdev, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Keyu Man

At the risk of being obnoxious here - that's a "no" to reconsidering the tradeoffs picked 20 years ago?

I don't want to waste time if the answer is a complete "no", but if it isn't I'm happy to try to figure out what exactly 
the right tradeoffs are here, and spend time implementing things.

Thanks,
Matt

On 4/30/21 14:04, Matt Corallo wrote:
> On 4/30/21 13:53, Matt Corallo wrote:
>>
>> Buffer bloat exists, but so do networks that will happily drop 1Mbps of packets. The first has always been true, the 
>> second only more recently has become more and more common (both due to network speed and application behavior).
> 
> It may be worth noting, to further highlight the tradeoffs made here - that, given a constant amount of memory allocated 
> for fragment reassembly, *under* estimating the timeout will result in only loss of some % of packets which were 
> reordered in excess of the timeout, whereas *over* estimating the timeout results in complete blackhole for up to the 
> timeout in the face of material packet loss.
> 
> This asymmetry is why I suggested possibly random eviction could be useful as a different set of trade-offs, but I'm 
> certainly not qualified to make that determination.
> 
> Thanks again for your time and consideration,
> Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-05-03 14:30 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-28  2:29 [PATCH net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s Matt Corallo
2021-04-28 12:20 ` Eric Dumazet
2021-04-28 14:09   ` Matt Corallo
2021-04-28 14:13     ` Willy Tarreau
2021-04-28 14:28       ` Matt Corallo
2021-04-28 15:38         ` Eric Dumazet
2021-04-28 16:35           ` Matt Corallo
     [not found]             ` <1baf048d-18e8-3e0c-feee-a01b381b0168@bluematt.me>
2021-04-30 17:09               ` Eric Dumazet
2021-04-30 17:42                 ` Matt Corallo
2021-04-30 17:49                   ` Eric Dumazet
2021-04-30 17:53                     ` Matt Corallo
2021-04-30 18:04                       ` Matt Corallo
2021-05-03 14:30                         ` Matt Corallo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.