netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BUG: sk_backlog.len can overestimate
       [not found] <CAGXJAmwQw1ohc48NfAvMyNDpDgHGkdVO89Jo8B0j0TuMr7wLpA@mail.gmail.com>
@ 2019-09-30 23:58 ` John Ousterhout
  2019-10-01  0:14   ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: John Ousterhout @ 2019-09-30 23:58 UTC (permalink / raw)
  To: netdev

As of 4.16.10, it appears to me that sk->sk_backlog_len does not
provide an accurate estimate of backlog length; this reduces the
usefulness of the "limit" argument to sk_add_backlog.

The problem is that, under heavy load, sk->sk_backlog_len can grow
arbitrarily large, even though the actual amount of data in the
backlog is small. This happens because __release_sock doesn't reset
the backlog length until it gets completely caught up. Under heavy
load, new packets can be arriving continuously  into the backlog
(which increases sk_backlog.len) while other packets are being
serviced. This can go on forever, so sk_backlog.len never gets reset
and it can become arbitrarily large.

Because of this, the "limit" argument to sk_add_backlog may not be
useful, since it could result in packets being discarded even though
the backlog is not very large.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: sk_backlog.len can overestimate
  2019-09-30 23:58 ` BUG: sk_backlog.len can overestimate John Ousterhout
@ 2019-10-01  0:14   ` Eric Dumazet
  2019-10-01 15:44     ` John Ousterhout
       [not found]     ` <CAGXJAmxmJ-Vm379N4nbjXeQCAgY9ur53wmr0HZy23dQ_t++r-Q@mail.gmail.com>
  0 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2019-10-01  0:14 UTC (permalink / raw)
  To: John Ousterhout, netdev



On 9/30/19 4:58 PM, John Ousterhout wrote:
> As of 4.16.10, it appears to me that sk->sk_backlog_len does not
> provide an accurate estimate of backlog length; this reduces the
> usefulness of the "limit" argument to sk_add_backlog.
> 
> The problem is that, under heavy load, sk->sk_backlog_len can grow
> arbitrarily large, even though the actual amount of data in the
> backlog is small. This happens because __release_sock doesn't reset
> the backlog length until it gets completely caught up. Under heavy
> load, new packets can be arriving continuously  into the backlog
> (which increases sk_backlog.len) while other packets are being
> serviced. This can go on forever, so sk_backlog.len never gets reset
> and it can become arbitrarily large.

Certainly not.

It can not grow arbitrarily large, unless a backport gone wrong maybe.

> 
> Because of this, the "limit" argument to sk_add_backlog may not be
> useful, since it could result in packets being discarded even though
> the backlog is not very large.
> 


You will have to study git log/history for the details, the limit _is_ useful,
and we reset the limit in __release_sock() only when _safe_.

Assuming you talk about TCP, then I suggest you use a more recent kernel.

linux-5.0 got coalescing in the backlog queue, which helped quite a bit.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: sk_backlog.len can overestimate
  2019-10-01  0:14   ` Eric Dumazet
@ 2019-10-01 15:44     ` John Ousterhout
       [not found]     ` <CAGXJAmxmJ-Vm379N4nbjXeQCAgY9ur53wmr0HZy23dQ_t++r-Q@mail.gmail.com>
  1 sibling, 0 replies; 11+ messages in thread
From: John Ousterhout @ 2019-10-01 15:44 UTC (permalink / raw)
  To: netdev

On Mon, Sep 30, 2019 at 5:14 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On 9/30/19 4:58 PM, John Ousterhout wrote:
> > As of 4.16.10, it appears to me that sk->sk_backlog_len does not
> > provide an accurate estimate of backlog length; this reduces the
> > usefulness of the "limit" argument to sk_add_backlog.
> >
> > The problem is that, under heavy load, sk->sk_backlog_len can grow
> > arbitrarily large, even though the actual amount of data in the
> > backlog is small. This happens because __release_sock doesn't reset
> > the backlog length until it gets completely caught up. Under heavy
> > load, new packets can be arriving continuously  into the backlog
> > (which increases sk_backlog.len) while other packets are being
> > serviced. This can go on forever, so sk_backlog.len never gets reset
> > and it can become arbitrarily large.
>
> Certainly not.
>
> It can not grow arbitrarily large, unless a backport gone wrong maybe.

Can you help me understand what would limit the growth of this value?
Suppose that new packets are arriving as quickly as they are
processed. Every time __release_sock calls sk_backlog_rcv, a new
packet arrives during the call, which is added to the backlog,
incrementing sk_backlog.len. However, sk_backlog_len doesn't get
decreased when sk_backlog_rcv completes, since the backlog hasn't
emptied (as you said, it's not "safe"). As a result, sk_backlog.len
has increased, but the actual backlog length is unchanged (one packet
was added, one was removed). Why can't this process repeat
indefinitely, until eventually sk_backlog.len reaches whatever limit
the transport specifies when it invokes sk_add_backlog? At this point
packets will be dropped by the transport even though the backlog isn't
actually very large.

> > Because of this, the "limit" argument to sk_add_backlog may not be
> > useful, since it could result in packets being discarded even though
> > the backlog is not very large.
> >
>
>
> You will have to study git log/history for the details, the limit _is_ useful,
> and we reset the limit in __release_sock() only when _safe_.
>
> Assuming you talk about TCP, then I suggest you use a more recent kernel.
>
> linux-5.0 got coalescing in the backlog queue, which helped quite a bit.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Fwd: BUG: sk_backlog.len can overestimate
       [not found]       ` <f4520c32-3133-fb3b-034e-d492d40eb066@gmail.com>
@ 2019-10-01 15:46         ` John Ousterhout
  2019-10-01 15:48         ` John Ousterhout
  1 sibling, 0 replies; 11+ messages in thread
From: John Ousterhout @ 2019-10-01 15:46 UTC (permalink / raw)
  To: netdev

(I accidentally dropped netdev on my earlier message... here is Eric's
response, which also didn't go to the group)

---------- Forwarded message ---------
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, Sep 30, 2019 at 6:53 PM
Subject: Re: BUG: sk_backlog.len can overestimate
To: John Ousterhout <ouster@cs.stanford.edu>

On 9/30/19 5:41 PM, John Ousterhout wrote:
> On Mon, Sep 30, 2019 at 5:14 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>>
>> On 9/30/19 4:58 PM, John Ousterhout wrote:
>>> As of 4.16.10, it appears to me that sk->sk_backlog_len does not
>>> provide an accurate estimate of backlog length; this reduces the
>>> usefulness of the "limit" argument to sk_add_backlog.
>>>
>>> The problem is that, under heavy load, sk->sk_backlog_len can grow
>>> arbitrarily large, even though the actual amount of data in the
>>> backlog is small. This happens because __release_sock doesn't reset
>>> the backlog length until it gets completely caught up. Under heavy
>>> load, new packets can be arriving continuously  into the backlog
>>> (which increases sk_backlog.len) while other packets are being
>>> serviced. This can go on forever, so sk_backlog.len never gets reset
>>> and it can become arbitrarily large.
>>
>> Certainly not.
>>
>> It can not grow arbitrarily large, unless a backport gone wrong maybe.
>
> Can you help me understand what would limit the growth of this value?
> Suppose that new packets are arriving as quickly as they are
> processed. Every time __release_sock calls sk_backlog_rcv, a new
> packet arrives during the call, which is added to the backlog,
> incrementing sk_backlog.len. However, sk_backlog_len doesn't get
> decreased when sk_backlog_rcv completes, since the backlog hasn't
> emptied (as you said, it's not "safe"). As a result, sk_backlog.len
> has increased, but the actual backlog length is unchanged (one packet
> was added, one was removed). Why can't this process repeat
> indefinitely, until eventually sk_backlog.len reaches whatever limit
> the transport specifies when it invokes sk_add_backlog? At this point
> packets will be dropped by the transport even though the backlog isn't
> actually very large.

The process is bounded by socket sk_rcvbuf + sk_sndbuf

bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
{
        u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf;

        ...
        if (unlikely(sk_add_backlog(sk, skb, limit))) {
            ...
            __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
        ...
}


Once the limit is reached, sk_backlog.len wont be touched, unless
__release_sock()
has processed the whole queue.


>
>>>
>>> Because of this, the "limit" argument to sk_add_backlog may not be
>>> useful, since it could result in packets being discarded even though
>>> the backlog is not very large.
>>>
>>
>>
>> You will have to study git log/history for the details, the limit _is_ useful,
>> and we reset the limit in __release_sock() only when _safe_.
>>
>> Assuming you talk about TCP, then I suggest you use a more recent kernel.
>>
>> linux-5.0 got coalescing in the backlog queue, which helped quite a bit.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: sk_backlog.len can overestimate
       [not found]       ` <f4520c32-3133-fb3b-034e-d492d40eb066@gmail.com>
  2019-10-01 15:46         ` Fwd: " John Ousterhout
@ 2019-10-01 15:48         ` John Ousterhout
  2019-10-01 16:19           ` Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: John Ousterhout @ 2019-10-01 15:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Mon, Sep 30, 2019 at 6:53 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On 9/30/19 5:41 PM, John Ousterhout wrote:
> > On Mon, Sep 30, 2019 at 5:14 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>
> >> On 9/30/19 4:58 PM, John Ousterhout wrote:
> >>> As of 4.16.10, it appears to me that sk->sk_backlog_len does not
> >>> provide an accurate estimate of backlog length; this reduces the
> >>> usefulness of the "limit" argument to sk_add_backlog.
> >>>
> >>> The problem is that, under heavy load, sk->sk_backlog_len can grow
> >>> arbitrarily large, even though the actual amount of data in the
> >>> backlog is small. This happens because __release_sock doesn't reset
> >>> the backlog length until it gets completely caught up. Under heavy
> >>> load, new packets can be arriving continuously  into the backlog
> >>> (which increases sk_backlog.len) while other packets are being
> >>> serviced. This can go on forever, so sk_backlog.len never gets reset
> >>> and it can become arbitrarily large.
> >>
> >> Certainly not.
> >>
> >> It can not grow arbitrarily large, unless a backport gone wrong maybe.
> >
> > Can you help me understand what would limit the growth of this value?
> > Suppose that new packets are arriving as quickly as they are
> > processed. Every time __release_sock calls sk_backlog_rcv, a new
> > packet arrives during the call, which is added to the backlog,
> > incrementing sk_backlog.len. However, sk_backlog_len doesn't get
> > decreased when sk_backlog_rcv completes, since the backlog hasn't
> > emptied (as you said, it's not "safe"). As a result, sk_backlog.len
> > has increased, but the actual backlog length is unchanged (one packet
> > was added, one was removed). Why can't this process repeat
> > indefinitely, until eventually sk_backlog.len reaches whatever limit
> > the transport specifies when it invokes sk_add_backlog? At this point
> > packets will be dropped by the transport even though the backlog isn't
> > actually very large.
>
> The process is bounded by socket sk_rcvbuf + sk_sndbuf
>
> bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
> {
>         u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf;
>
>         ...
>         if (unlikely(sk_add_backlog(sk, skb, limit))) {
>             ...
>             __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
>         ...
> }
>
>
> Once the limit is reached, sk_backlog.len wont be touched, unless __release_sock()
> has processed the whole queue.

Sorry if I'm missing something obvious here, but when you say
"sk_backlog.len won't be touched", doesn't that mean that incoming
packets will have to be dropped? And can't this occur even though the
true size of the backlog might be way less than sk_rcvbuf + sk_sndbuf,
as I described above? It seems to me that the basic problem is that
sk_backlog.len could exceed any given limit, even though there aren't
actually that many bytes still left in the backlog.

> >>> Because of this, the "limit" argument to sk_add_backlog may not be
> >>> useful, since it could result in packets being discarded even though
> >>> the backlog is not very large.
> >>>
> >>
> >>
> >> You will have to study git log/history for the details, the limit _is_ useful,
> >> and we reset the limit in __release_sock() only when _safe_.
> >>
> >> Assuming you talk about TCP, then I suggest you use a more recent kernel.
> >>
> >> linux-5.0 got coalescing in the backlog queue, which helped quite a bit.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: sk_backlog.len can overestimate
  2019-10-01 15:48         ` John Ousterhout
@ 2019-10-01 16:19           ` Eric Dumazet
  2019-10-01 17:25             ` John Ousterhout
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2019-10-01 16:19 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev



On 10/1/19 8:48 AM, John Ousterhout wrote:
> On Mon, Sep 30, 2019 at 6:53 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> On 9/30/19 5:41 PM, John Ousterhout wrote:
>>> On Mon, Sep 30, 2019 at 5:14 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>
>>>> On 9/30/19 4:58 PM, John Ousterhout wrote:
>>>>> As of 4.16.10, it appears to me that sk->sk_backlog_len does not
>>>>> provide an accurate estimate of backlog length; this reduces the
>>>>> usefulness of the "limit" argument to sk_add_backlog.
>>>>>
>>>>> The problem is that, under heavy load, sk->sk_backlog_len can grow
>>>>> arbitrarily large, even though the actual amount of data in the
>>>>> backlog is small. This happens because __release_sock doesn't reset
>>>>> the backlog length until it gets completely caught up. Under heavy
>>>>> load, new packets can be arriving continuously  into the backlog
>>>>> (which increases sk_backlog.len) while other packets are being
>>>>> serviced. This can go on forever, so sk_backlog.len never gets reset
>>>>> and it can become arbitrarily large.
>>>>
>>>> Certainly not.
>>>>
>>>> It can not grow arbitrarily large, unless a backport gone wrong maybe.
>>>
>>> Can you help me understand what would limit the growth of this value?
>>> Suppose that new packets are arriving as quickly as they are
>>> processed. Every time __release_sock calls sk_backlog_rcv, a new
>>> packet arrives during the call, which is added to the backlog,
>>> incrementing sk_backlog.len. However, sk_backlog_len doesn't get
>>> decreased when sk_backlog_rcv completes, since the backlog hasn't
>>> emptied (as you said, it's not "safe"). As a result, sk_backlog.len
>>> has increased, but the actual backlog length is unchanged (one packet
>>> was added, one was removed). Why can't this process repeat
>>> indefinitely, until eventually sk_backlog.len reaches whatever limit
>>> the transport specifies when it invokes sk_add_backlog? At this point
>>> packets will be dropped by the transport even though the backlog isn't
>>> actually very large.
>>
>> The process is bounded by socket sk_rcvbuf + sk_sndbuf
>>
>> bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>> {
>>         u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf;
>>
>>         ...
>>         if (unlikely(sk_add_backlog(sk, skb, limit))) {
>>             ...
>>             __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
>>         ...
>> }
>>
>>
>> Once the limit is reached, sk_backlog.len wont be touched, unless __release_sock()
>> has processed the whole queue.
> 
> Sorry if I'm missing something obvious here, but when you say
> "sk_backlog.len won't be touched", doesn't that mean that incoming
> packets will have to be dropped?

Yes packets are dropped if the socket has exhausted its memory budget.

Presumably the sender is trying to fool us.

 And can't this occur even though the
> true size of the backlog might be way less than sk_rcvbuf + sk_sndbuf,
> as I described above? It seems to me that the basic problem is that
> sk_backlog.len could exceed any given limit, even though there aren't
> actually that many bytes still left in the backlog.
> 

Sorry, I have no idea what is the problem you see.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: sk_backlog.len can overestimate
  2019-10-01 16:19           ` Eric Dumazet
@ 2019-10-01 17:25             ` John Ousterhout
  2019-10-01 18:34               ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: John Ousterhout @ 2019-10-01 17:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Tue, Oct 1, 2019 at 9:19 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> ...
> Sorry, I have no idea what is the problem you see.

OK, let me try again from the start. Consider two values:
* sk->sk_backlog.len
* The actual number of bytes in buffers in the current backlog list

Now consider a series of propositions:

1. These two are not always the same. As packets get processed by
calling sk_backlog_rcv, they are removed from the backlog list, so the
actual amount of memory consumed by the backlog list drops. However,
sk->sk_backlog.len doesn't change until the entire backlog is cleared,
at which point it is reset to zero. So, there can be periods of time
where sk->sk_backlog.len overstates the actual memory consumption of
the backlog.

2. The gap between sk->sk_backlog.len and actual backlog size can grow
quite large. This happens if new packets arrive while sk_backlog_rcv
is working. The socket is locked, so these new packets will be added
to the backlog, which will increase sk->sk_backlog_len. Under high
load, this could continue indefinitely: packets keep arriving, so the
backlog never empties, so sk->sk_backlog_len never gets reset.
However, packets are actually being processed from the backlog, so
it's possible that the actual size of the backlog isn't changing, yet
sk->sk_backlog.len continues to grow.

3. Eventually, the growth in sk->sk_backlog.len will be limited by the
"limit" argument to sk_add_backlog. When this happens, packets will be
dropped.

4. Now suppose I pass a value of 1000000 as the limit to
sk_add_backlog. It's possible that sk_add_backlog will reject my
request even though the backlog only contains a total of 10000 bytes.
The other 990000 bytes were present on the backlog at one time (though
not necessarily all at the same time), but they have been processed
and removed; __release_sock hasn't gotten around to updating
sk->sk_backlog.len, because it hasn't been able to completely clear
the backlog.

5. Bottom line: under high load, a socket can be forced to drop
packets even though it never actually exceeded its memory budget. This
isn't a case of a sender trying to fool us; we fooled ourselves,
because of the delay in resetting sk->sk_backlog.len.

Does this make sense?

By the way, I have actually observed this phenomenon in an
implementation of the Homa transport protocol.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: sk_backlog.len can overestimate
  2019-10-01 17:25             ` John Ousterhout
@ 2019-10-01 18:34               ` Eric Dumazet
  2019-10-01 20:45                 ` John Ousterhout
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2019-10-01 18:34 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev



On 10/1/19 10:25 AM, John Ousterhout wrote:
> On Tue, Oct 1, 2019 at 9:19 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> ...
>> Sorry, I have no idea what is the problem you see.
> 
> OK, let me try again from the start. Consider two values:
> * sk->sk_backlog.len
> * The actual number of bytes in buffers in the current backlog list
> 
> Now consider a series of propositions:
> 
> 1. These two are not always the same. As packets get processed by
> calling sk_backlog_rcv, they are removed from the backlog list, so the
> actual amount of memory consumed by the backlog list drops. However,
> sk->sk_backlog.len doesn't change until the entire backlog is cleared,
> at which point it is reset to zero. So, there can be periods of time
> where sk->sk_backlog.len overstates the actual memory consumption of
> the backlog.

Yes, this is done on purpose (and documented in __release_sock()

Otherwise you could have a livelock situation, with user thread being
trapped forever in system, and never return to user land.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8eae939f1400326b06d0c9afe53d2a484a326871


> 
> 2. The gap between sk->sk_backlog.len and actual backlog size can grow
> quite large. This happens if new packets arrive while sk_backlog_rcv
> is working. The socket is locked, so these new packets will be added
> to the backlog, which will increase sk->sk_backlog_len. Under high
> load, this could continue indefinitely: packets keep arriving, so the
> backlog never empties, so sk->sk_backlog_len never gets reset.
> However, packets are actually being processed from the backlog, so
> it's possible that the actual size of the backlog isn't changing, yet
> sk->sk_backlog.len continues to grow.
> 
> 3. Eventually, the growth in sk->sk_backlog.len will be limited by the
> "limit" argument to sk_add_backlog. When this happens, packets will be
> dropped.

_Exactly_ WAI

> 
> 4. Now suppose I pass a value of 1000000 as the limit to
> sk_add_backlog. It's possible that sk_add_backlog will reject my
> request even though the backlog only contains a total of 10000 bytes.
> The other 990000 bytes were present on the backlog at one time (though
> not necessarily all at the same time), but they have been processed
> and removed; __release_sock hasn't gotten around to updating
> sk->sk_backlog.len, because it hasn't been able to completely clear
> the backlog.

WAI

> 
> 5. Bottom line: under high load, a socket can be forced to drop
> packets even though it never actually exceeded its memory budget. This
> isn't a case of a sender trying to fool us; we fooled ourselves,
> because of the delay in resetting sk->sk_backlog.len.
> 
> Does this make sense?

Yes, just increase your socket limits. setsockopt(...  SO_RCVBUF ...),
and risk user threads having bigger socket syscall latencies, obviously.

> 
> By the way, I have actually observed this phenomenon in an
> implementation of the Homa transport protocol.
> 

Maybe this transport protocol should size correctly its sockets limits :)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: sk_backlog.len can overestimate
  2019-10-01 18:34               ` Eric Dumazet
@ 2019-10-01 20:45                 ` John Ousterhout
  2019-10-01 20:53                   ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: John Ousterhout @ 2019-10-01 20:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Tue, Oct 1, 2019 at 11:34 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On 10/1/19 10:25 AM, John Ousterhout wrote:
> > On Tue, Oct 1, 2019 at 9:19 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> ...
> >> Sorry, I have no idea what is the problem you see.
> >
> > OK, let me try again from the start. Consider two values:
> > * sk->sk_backlog.len
> > * The actual number of bytes in buffers in the current backlog list
> >
> > Now consider a series of propositions:
> >
> > 1. These two are not always the same. As packets get processed by
> > calling sk_backlog_rcv, they are removed from the backlog list, so the
> > actual amount of memory consumed by the backlog list drops. However,
> > sk->sk_backlog.len doesn't change until the entire backlog is cleared,
> > at which point it is reset to zero. So, there can be periods of time
> > where sk->sk_backlog.len overstates the actual memory consumption of
> > the backlog.
>
> Yes, this is done on purpose (and documented in __release_sock()
>
> Otherwise you could have a livelock situation, with user thread being
> trapped forever in system, and never return to user land.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8eae939f1400326b06d0c9afe53d2a484a326871
>
>
> >
> > 2. The gap between sk->sk_backlog.len and actual backlog size can grow
> > quite large. This happens if new packets arrive while sk_backlog_rcv
> > is working. The socket is locked, so these new packets will be added
> > to the backlog, which will increase sk->sk_backlog_len. Under high
> > load, this could continue indefinitely: packets keep arriving, so the
> > backlog never empties, so sk->sk_backlog_len never gets reset.
> > However, packets are actually being processed from the backlog, so
> > it's possible that the actual size of the backlog isn't changing, yet
> > sk->sk_backlog.len continues to grow.
> >
> > 3. Eventually, the growth in sk->sk_backlog.len will be limited by the
> > "limit" argument to sk_add_backlog. When this happens, packets will be
> > dropped.
>
> _Exactly_ WAI
>
> >
> > 4. Now suppose I pass a value of 1000000 as the limit to
> > sk_add_backlog. It's possible that sk_add_backlog will reject my
> > request even though the backlog only contains a total of 10000 bytes.
> > The other 990000 bytes were present on the backlog at one time (though
> > not necessarily all at the same time), but they have been processed
> > and removed; __release_sock hasn't gotten around to updating
> > sk->sk_backlog.len, because it hasn't been able to completely clear
> > the backlog.
>
> WAI
>
> >
> > 5. Bottom line: under high load, a socket can be forced to drop
> > packets even though it never actually exceeded its memory budget. This
> > isn't a case of a sender trying to fool us; we fooled ourselves,
> > because of the delay in resetting sk->sk_backlog.len.
> >
> > Does this make sense?
>
> Yes, just increase your socket limits. setsockopt(...  SO_RCVBUF ...),
> and risk user threads having bigger socket syscall latencies, obviously.

OK, I think I understand where you are coming from now. However, I
have a comment and a suggestion.

Comment: the terminology here confused me. The problem is that
sk->sk_backlog.len isn't actually a "length" (e.g., # bytes currently
stored in the backlog). It's actually "the total number of bytes added
to the backlog since the last time it was completely cleared."
Unfortunately, this isn't documented. Same thing for the limit: the
goal isn't to to limit the amount of data in the backlog, it's to
limit the amount of data processed in one call to __release_sock.
Although I saw the comment in __release_sock, it didn't have quite
enough info to fully apprise me of the problem being solved.

Suggestion: although the code does solve the problem you mentioned, it
has the unpleasant side effect that it can cause packet drops. This is
unfortunate because the drops are most likely to happen in periods of
overload, which is when you'd really like not to waste work that has
already been done. Did you consider the possibility of returning from
__release_sock without processing the entire backlog? For example,
once __release_sock has processed a bunch of packets, why not take the
others and feed them back through the NAPI mechanism again, so they'll
be passed to the net_protocol.handler again?

> > By the way, I have actually observed this phenomenon in an
> > implementation of the Homa transport protocol.
> >
>
> Maybe this transport protocol should size correctly its sockets limits :)

But this isn't really about socket resource limits (though that is
conflated in the implementation); it's about limiting the time spent
in a single call to __release_sock, no?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: sk_backlog.len can overestimate
  2019-10-01 20:45                 ` John Ousterhout
@ 2019-10-01 20:53                   ` Eric Dumazet
  2019-10-01 23:01                     ` John Ousterhout
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2019-10-01 20:53 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev



On 10/1/19 1:45 PM, John Ousterhout wrote:

> 
> But this isn't really about socket resource limits (though that is
> conflated in the implementation); it's about limiting the time spent
> in a single call to __release_sock, no?

The proxy used is memory usage, not time usage.

cond_resched() or a preemptible kernel makes anything based on time flaky,
you probably do not want to play with a time limit...




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: sk_backlog.len can overestimate
  2019-10-01 20:53                   ` Eric Dumazet
@ 2019-10-01 23:01                     ` John Ousterhout
  0 siblings, 0 replies; 11+ messages in thread
From: John Ousterhout @ 2019-10-01 23:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Tue, Oct 1, 2019 at 1:53 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On 10/1/19 1:45 PM, John Ousterhout wrote:
>
> >
> > But this isn't really about socket resource limits (though that is
> > conflated in the implementation); it's about limiting the time spent
> > in a single call to __release_sock, no?
>
> The proxy used is memory usage, not time usage.

I apologize for being pedantic, but the proxy isn't memory usage; it's
actually "number of bytes added to the backlog since the last time it
was emptied". At the time the limit is hit, actual memory usage is
probably a lot less than the limit. This was the source of my
confusion, since I assumed you really *wanted* memory usage to be the
limit.


> cond_resched() or a preemptible kernel makes anything based on time flaky,
> you probably do not want to play with a time limit...
>
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-10-01 23:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAGXJAmwQw1ohc48NfAvMyNDpDgHGkdVO89Jo8B0j0TuMr7wLpA@mail.gmail.com>
2019-09-30 23:58 ` BUG: sk_backlog.len can overestimate John Ousterhout
2019-10-01  0:14   ` Eric Dumazet
2019-10-01 15:44     ` John Ousterhout
     [not found]     ` <CAGXJAmxmJ-Vm379N4nbjXeQCAgY9ur53wmr0HZy23dQ_t++r-Q@mail.gmail.com>
     [not found]       ` <f4520c32-3133-fb3b-034e-d492d40eb066@gmail.com>
2019-10-01 15:46         ` Fwd: " John Ousterhout
2019-10-01 15:48         ` John Ousterhout
2019-10-01 16:19           ` Eric Dumazet
2019-10-01 17:25             ` John Ousterhout
2019-10-01 18:34               ` Eric Dumazet
2019-10-01 20:45                 ` John Ousterhout
2019-10-01 20:53                   ` Eric Dumazet
2019-10-01 23:01                     ` John Ousterhout

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).