* BUG: sk_backlog.len can overestimate [not found] <CAGXJAmwQw1ohc48NfAvMyNDpDgHGkdVO89Jo8B0j0TuMr7wLpA@mail.gmail.com> @ 2019-09-30 23:58 ` John Ousterhout 2019-10-01 0:14 ` Eric Dumazet 0 siblings, 1 reply; 11+ messages in thread From: John Ousterhout @ 2019-09-30 23:58 UTC (permalink / raw) To: netdev As of 4.16.10, it appears to me that sk->sk_backlog_len does not provide an accurate estimate of backlog length; this reduces the usefulness of the "limit" argument to sk_add_backlog. The problem is that, under heavy load, sk->sk_backlog_len can grow arbitrarily large, even though the actual amount of data in the backlog is small. This happens because __release_sock doesn't reset the backlog length until it gets completely caught up. Under heavy load, new packets can be arriving continuously into the backlog (which increases sk_backlog.len) while other packets are being serviced. This can go on forever, so sk_backlog.len never gets reset and it can become arbitrarily large. Because of this, the "limit" argument to sk_add_backlog may not be useful, since it could result in packets being discarded even though the backlog is not very large. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BUG: sk_backlog.len can overestimate 2019-09-30 23:58 ` BUG: sk_backlog.len can overestimate John Ousterhout @ 2019-10-01 0:14 ` Eric Dumazet 2019-10-01 15:44 ` John Ousterhout [not found] ` <CAGXJAmxmJ-Vm379N4nbjXeQCAgY9ur53wmr0HZy23dQ_t++r-Q@mail.gmail.com> 0 siblings, 2 replies; 11+ messages in thread From: Eric Dumazet @ 2019-10-01 0:14 UTC (permalink / raw) To: John Ousterhout, netdev On 9/30/19 4:58 PM, John Ousterhout wrote: > As of 4.16.10, it appears to me that sk->sk_backlog_len does not > provide an accurate estimate of backlog length; this reduces the > usefulness of the "limit" argument to sk_add_backlog. > > The problem is that, under heavy load, sk->sk_backlog_len can grow > arbitrarily large, even though the actual amount of data in the > backlog is small. This happens because __release_sock doesn't reset > the backlog length until it gets completely caught up. Under heavy > load, new packets can be arriving continuously into the backlog > (which increases sk_backlog.len) while other packets are being > serviced. This can go on forever, so sk_backlog.len never gets reset > and it can become arbitrarily large. Certainly not. It can not grow arbitrarily large, unless a backport gone wrong maybe. > > Because of this, the "limit" argument to sk_add_backlog may not be > useful, since it could result in packets being discarded even though > the backlog is not very large. > You will have to study git log/history for the details, the limit _is_ useful, and we reset the limit in __release_sock() only when _safe_. Assuming you talk about TCP, then I suggest you use a more recent kernel. linux-5.0 got coalescing in the backlog queue, which helped quite a bit. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BUG: sk_backlog.len can overestimate 2019-10-01 0:14 ` Eric Dumazet @ 2019-10-01 15:44 ` John Ousterhout [not found] ` <CAGXJAmxmJ-Vm379N4nbjXeQCAgY9ur53wmr0HZy23dQ_t++r-Q@mail.gmail.com> 1 sibling, 0 replies; 11+ messages in thread From: John Ousterhout @ 2019-10-01 15:44 UTC (permalink / raw) To: netdev On Mon, Sep 30, 2019 at 5:14 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On 9/30/19 4:58 PM, John Ousterhout wrote: > > As of 4.16.10, it appears to me that sk->sk_backlog_len does not > > provide an accurate estimate of backlog length; this reduces the > > usefulness of the "limit" argument to sk_add_backlog. > > > > The problem is that, under heavy load, sk->sk_backlog_len can grow > > arbitrarily large, even though the actual amount of data in the > > backlog is small. This happens because __release_sock doesn't reset > > the backlog length until it gets completely caught up. Under heavy > > load, new packets can be arriving continuously into the backlog > > (which increases sk_backlog.len) while other packets are being > > serviced. This can go on forever, so sk_backlog.len never gets reset > > and it can become arbitrarily large. > > Certainly not. > > It can not grow arbitrarily large, unless a backport gone wrong maybe. Can you help me understand what would limit the growth of this value? Suppose that new packets are arriving as quickly as they are processed. Every time __release_sock calls sk_backlog_rcv, a new packet arrives during the call, which is added to the backlog, incrementing sk_backlog.len. However, sk_backlog_len doesn't get decreased when sk_backlog_rcv completes, since the backlog hasn't emptied (as you said, it's not "safe"). As a result, sk_backlog.len has increased, but the actual backlog length is unchanged (one packet was added, one was removed). Why can't this process repeat indefinitely, until eventually sk_backlog.len reaches whatever limit the transport specifies when it invokes sk_add_backlog? At this point packets will be dropped by the transport even though the backlog isn't actually very large. > > Because of this, the "limit" argument to sk_add_backlog may not be > > useful, since it could result in packets being discarded even though > > the backlog is not very large. > > > > > You will have to study git log/history for the details, the limit _is_ useful, > and we reset the limit in __release_sock() only when _safe_. > > Assuming you talk about TCP, then I suggest you use a more recent kernel. > > linux-5.0 got coalescing in the backlog queue, which helped quite a bit. ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <CAGXJAmxmJ-Vm379N4nbjXeQCAgY9ur53wmr0HZy23dQ_t++r-Q@mail.gmail.com>]
[parent not found: <f4520c32-3133-fb3b-034e-d492d40eb066@gmail.com>]
* Fwd: BUG: sk_backlog.len can overestimate [not found] ` <f4520c32-3133-fb3b-034e-d492d40eb066@gmail.com> @ 2019-10-01 15:46 ` John Ousterhout 2019-10-01 15:48 ` John Ousterhout 1 sibling, 0 replies; 11+ messages in thread From: John Ousterhout @ 2019-10-01 15:46 UTC (permalink / raw) To: netdev (I accidentally dropped netdev on my earlier message... here is Eric's response, which also didn't go to the group) ---------- Forwarded message --------- From: Eric Dumazet <eric.dumazet@gmail.com> Date: Mon, Sep 30, 2019 at 6:53 PM Subject: Re: BUG: sk_backlog.len can overestimate To: John Ousterhout <ouster@cs.stanford.edu> On 9/30/19 5:41 PM, John Ousterhout wrote: > On Mon, Sep 30, 2019 at 5:14 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: >> >> >> >> On 9/30/19 4:58 PM, John Ousterhout wrote: >>> As of 4.16.10, it appears to me that sk->sk_backlog_len does not >>> provide an accurate estimate of backlog length; this reduces the >>> usefulness of the "limit" argument to sk_add_backlog. >>> >>> The problem is that, under heavy load, sk->sk_backlog_len can grow >>> arbitrarily large, even though the actual amount of data in the >>> backlog is small. This happens because __release_sock doesn't reset >>> the backlog length until it gets completely caught up. Under heavy >>> load, new packets can be arriving continuously into the backlog >>> (which increases sk_backlog.len) while other packets are being >>> serviced. This can go on forever, so sk_backlog.len never gets reset >>> and it can become arbitrarily large. >> >> Certainly not. >> >> It can not grow arbitrarily large, unless a backport gone wrong maybe. > > Can you help me understand what would limit the growth of this value? > Suppose that new packets are arriving as quickly as they are > processed. Every time __release_sock calls sk_backlog_rcv, a new > packet arrives during the call, which is added to the backlog, > incrementing sk_backlog.len. However, sk_backlog_len doesn't get > decreased when sk_backlog_rcv completes, since the backlog hasn't > emptied (as you said, it's not "safe"). As a result, sk_backlog.len > has increased, but the actual backlog length is unchanged (one packet > was added, one was removed). Why can't this process repeat > indefinitely, until eventually sk_backlog.len reaches whatever limit > the transport specifies when it invokes sk_add_backlog? At this point > packets will be dropped by the transport even though the backlog isn't > actually very large. The process is bounded by socket sk_rcvbuf + sk_sndbuf bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb) { u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf; ... if (unlikely(sk_add_backlog(sk, skb, limit))) { ... __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP); ... } Once the limit is reached, sk_backlog.len wont be touched, unless __release_sock() has processed the whole queue. > >>> >>> Because of this, the "limit" argument to sk_add_backlog may not be >>> useful, since it could result in packets being discarded even though >>> the backlog is not very large. >>> >> >> >> You will have to study git log/history for the details, the limit _is_ useful, >> and we reset the limit in __release_sock() only when _safe_. >> >> Assuming you talk about TCP, then I suggest you use a more recent kernel. >> >> linux-5.0 got coalescing in the backlog queue, which helped quite a bit. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BUG: sk_backlog.len can overestimate [not found] ` <f4520c32-3133-fb3b-034e-d492d40eb066@gmail.com> 2019-10-01 15:46 ` Fwd: " John Ousterhout @ 2019-10-01 15:48 ` John Ousterhout 2019-10-01 16:19 ` Eric Dumazet 1 sibling, 1 reply; 11+ messages in thread From: John Ousterhout @ 2019-10-01 15:48 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Mon, Sep 30, 2019 at 6:53 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On 9/30/19 5:41 PM, John Ousterhout wrote: > > On Mon, Sep 30, 2019 at 5:14 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> > >> On 9/30/19 4:58 PM, John Ousterhout wrote: > >>> As of 4.16.10, it appears to me that sk->sk_backlog_len does not > >>> provide an accurate estimate of backlog length; this reduces the > >>> usefulness of the "limit" argument to sk_add_backlog. > >>> > >>> The problem is that, under heavy load, sk->sk_backlog_len can grow > >>> arbitrarily large, even though the actual amount of data in the > >>> backlog is small. This happens because __release_sock doesn't reset > >>> the backlog length until it gets completely caught up. Under heavy > >>> load, new packets can be arriving continuously into the backlog > >>> (which increases sk_backlog.len) while other packets are being > >>> serviced. This can go on forever, so sk_backlog.len never gets reset > >>> and it can become arbitrarily large. > >> > >> Certainly not. > >> > >> It can not grow arbitrarily large, unless a backport gone wrong maybe. > > > > Can you help me understand what would limit the growth of this value? > > Suppose that new packets are arriving as quickly as they are > > processed. Every time __release_sock calls sk_backlog_rcv, a new > > packet arrives during the call, which is added to the backlog, > > incrementing sk_backlog.len. However, sk_backlog_len doesn't get > > decreased when sk_backlog_rcv completes, since the backlog hasn't > > emptied (as you said, it's not "safe"). As a result, sk_backlog.len > > has increased, but the actual backlog length is unchanged (one packet > > was added, one was removed). Why can't this process repeat > > indefinitely, until eventually sk_backlog.len reaches whatever limit > > the transport specifies when it invokes sk_add_backlog? At this point > > packets will be dropped by the transport even though the backlog isn't > > actually very large. > > The process is bounded by socket sk_rcvbuf + sk_sndbuf > > bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb) > { > u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf; > > ... > if (unlikely(sk_add_backlog(sk, skb, limit))) { > ... > __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP); > ... > } > > > Once the limit is reached, sk_backlog.len wont be touched, unless __release_sock() > has processed the whole queue. Sorry if I'm missing something obvious here, but when you say "sk_backlog.len won't be touched", doesn't that mean that incoming packets will have to be dropped? And can't this occur even though the true size of the backlog might be way less than sk_rcvbuf + sk_sndbuf, as I described above? It seems to me that the basic problem is that sk_backlog.len could exceed any given limit, even though there aren't actually that many bytes still left in the backlog. > >>> Because of this, the "limit" argument to sk_add_backlog may not be > >>> useful, since it could result in packets being discarded even though > >>> the backlog is not very large. > >>> > >> > >> > >> You will have to study git log/history for the details, the limit _is_ useful, > >> and we reset the limit in __release_sock() only when _safe_. > >> > >> Assuming you talk about TCP, then I suggest you use a more recent kernel. > >> > >> linux-5.0 got coalescing in the backlog queue, which helped quite a bit. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BUG: sk_backlog.len can overestimate 2019-10-01 15:48 ` John Ousterhout @ 2019-10-01 16:19 ` Eric Dumazet 2019-10-01 17:25 ` John Ousterhout 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2019-10-01 16:19 UTC (permalink / raw) To: John Ousterhout; +Cc: netdev On 10/1/19 8:48 AM, John Ousterhout wrote: > On Mon, Sep 30, 2019 at 6:53 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: >> >> On 9/30/19 5:41 PM, John Ousterhout wrote: >>> On Mon, Sep 30, 2019 at 5:14 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: >>>> >>>> On 9/30/19 4:58 PM, John Ousterhout wrote: >>>>> As of 4.16.10, it appears to me that sk->sk_backlog_len does not >>>>> provide an accurate estimate of backlog length; this reduces the >>>>> usefulness of the "limit" argument to sk_add_backlog. >>>>> >>>>> The problem is that, under heavy load, sk->sk_backlog_len can grow >>>>> arbitrarily large, even though the actual amount of data in the >>>>> backlog is small. This happens because __release_sock doesn't reset >>>>> the backlog length until it gets completely caught up. Under heavy >>>>> load, new packets can be arriving continuously into the backlog >>>>> (which increases sk_backlog.len) while other packets are being >>>>> serviced. This can go on forever, so sk_backlog.len never gets reset >>>>> and it can become arbitrarily large. >>>> >>>> Certainly not. >>>> >>>> It can not grow arbitrarily large, unless a backport gone wrong maybe. >>> >>> Can you help me understand what would limit the growth of this value? >>> Suppose that new packets are arriving as quickly as they are >>> processed. Every time __release_sock calls sk_backlog_rcv, a new >>> packet arrives during the call, which is added to the backlog, >>> incrementing sk_backlog.len. However, sk_backlog_len doesn't get >>> decreased when sk_backlog_rcv completes, since the backlog hasn't >>> emptied (as you said, it's not "safe"). As a result, sk_backlog.len >>> has increased, but the actual backlog length is unchanged (one packet >>> was added, one was removed). Why can't this process repeat >>> indefinitely, until eventually sk_backlog.len reaches whatever limit >>> the transport specifies when it invokes sk_add_backlog? At this point >>> packets will be dropped by the transport even though the backlog isn't >>> actually very large. >> >> The process is bounded by socket sk_rcvbuf + sk_sndbuf >> >> bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb) >> { >> u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf; >> >> ... >> if (unlikely(sk_add_backlog(sk, skb, limit))) { >> ... >> __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP); >> ... >> } >> >> >> Once the limit is reached, sk_backlog.len wont be touched, unless __release_sock() >> has processed the whole queue. > > Sorry if I'm missing something obvious here, but when you say > "sk_backlog.len won't be touched", doesn't that mean that incoming > packets will have to be dropped? Yes packets are dropped if the socket has exhausted its memory budget. Presumably the sender is trying to fool us. And can't this occur even though the > true size of the backlog might be way less than sk_rcvbuf + sk_sndbuf, > as I described above? It seems to me that the basic problem is that > sk_backlog.len could exceed any given limit, even though there aren't > actually that many bytes still left in the backlog. > Sorry, I have no idea what is the problem you see. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BUG: sk_backlog.len can overestimate 2019-10-01 16:19 ` Eric Dumazet @ 2019-10-01 17:25 ` John Ousterhout 2019-10-01 18:34 ` Eric Dumazet 0 siblings, 1 reply; 11+ messages in thread From: John Ousterhout @ 2019-10-01 17:25 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Tue, Oct 1, 2019 at 9:19 AM Eric Dumazet <eric.dumazet@gmail.com> wrote: > ... > Sorry, I have no idea what is the problem you see. OK, let me try again from the start. Consider two values: * sk->sk_backlog.len * The actual number of bytes in buffers in the current backlog list Now consider a series of propositions: 1. These two are not always the same. As packets get processed by calling sk_backlog_rcv, they are removed from the backlog list, so the actual amount of memory consumed by the backlog list drops. However, sk->sk_backlog.len doesn't change until the entire backlog is cleared, at which point it is reset to zero. So, there can be periods of time where sk->sk_backlog.len overstates the actual memory consumption of the backlog. 2. The gap between sk->sk_backlog.len and actual backlog size can grow quite large. This happens if new packets arrive while sk_backlog_rcv is working. The socket is locked, so these new packets will be added to the backlog, which will increase sk->sk_backlog_len. Under high load, this could continue indefinitely: packets keep arriving, so the backlog never empties, so sk->sk_backlog_len never gets reset. However, packets are actually being processed from the backlog, so it's possible that the actual size of the backlog isn't changing, yet sk->sk_backlog.len continues to grow. 3. Eventually, the growth in sk->sk_backlog.len will be limited by the "limit" argument to sk_add_backlog. When this happens, packets will be dropped. 4. Now suppose I pass a value of 1000000 as the limit to sk_add_backlog. It's possible that sk_add_backlog will reject my request even though the backlog only contains a total of 10000 bytes. The other 990000 bytes were present on the backlog at one time (though not necessarily all at the same time), but they have been processed and removed; __release_sock hasn't gotten around to updating sk->sk_backlog.len, because it hasn't been able to completely clear the backlog. 5. Bottom line: under high load, a socket can be forced to drop packets even though it never actually exceeded its memory budget. This isn't a case of a sender trying to fool us; we fooled ourselves, because of the delay in resetting sk->sk_backlog.len. Does this make sense? By the way, I have actually observed this phenomenon in an implementation of the Homa transport protocol. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BUG: sk_backlog.len can overestimate 2019-10-01 17:25 ` John Ousterhout @ 2019-10-01 18:34 ` Eric Dumazet 2019-10-01 20:45 ` John Ousterhout 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2019-10-01 18:34 UTC (permalink / raw) To: John Ousterhout; +Cc: netdev On 10/1/19 10:25 AM, John Ousterhout wrote: > On Tue, Oct 1, 2019 at 9:19 AM Eric Dumazet <eric.dumazet@gmail.com> wrote: >> ... >> Sorry, I have no idea what is the problem you see. > > OK, let me try again from the start. Consider two values: > * sk->sk_backlog.len > * The actual number of bytes in buffers in the current backlog list > > Now consider a series of propositions: > > 1. These two are not always the same. As packets get processed by > calling sk_backlog_rcv, they are removed from the backlog list, so the > actual amount of memory consumed by the backlog list drops. However, > sk->sk_backlog.len doesn't change until the entire backlog is cleared, > at which point it is reset to zero. So, there can be periods of time > where sk->sk_backlog.len overstates the actual memory consumption of > the backlog. Yes, this is done on purpose (and documented in __release_sock() Otherwise you could have a livelock situation, with user thread being trapped forever in system, and never return to user land. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8eae939f1400326b06d0c9afe53d2a484a326871 > > 2. The gap between sk->sk_backlog.len and actual backlog size can grow > quite large. This happens if new packets arrive while sk_backlog_rcv > is working. The socket is locked, so these new packets will be added > to the backlog, which will increase sk->sk_backlog_len. Under high > load, this could continue indefinitely: packets keep arriving, so the > backlog never empties, so sk->sk_backlog_len never gets reset. > However, packets are actually being processed from the backlog, so > it's possible that the actual size of the backlog isn't changing, yet > sk->sk_backlog.len continues to grow. > > 3. Eventually, the growth in sk->sk_backlog.len will be limited by the > "limit" argument to sk_add_backlog. When this happens, packets will be > dropped. _Exactly_ WAI > > 4. Now suppose I pass a value of 1000000 as the limit to > sk_add_backlog. It's possible that sk_add_backlog will reject my > request even though the backlog only contains a total of 10000 bytes. > The other 990000 bytes were present on the backlog at one time (though > not necessarily all at the same time), but they have been processed > and removed; __release_sock hasn't gotten around to updating > sk->sk_backlog.len, because it hasn't been able to completely clear > the backlog. WAI > > 5. Bottom line: under high load, a socket can be forced to drop > packets even though it never actually exceeded its memory budget. This > isn't a case of a sender trying to fool us; we fooled ourselves, > because of the delay in resetting sk->sk_backlog.len. > > Does this make sense? Yes, just increase your socket limits. setsockopt(... SO_RCVBUF ...), and risk user threads having bigger socket syscall latencies, obviously. > > By the way, I have actually observed this phenomenon in an > implementation of the Homa transport protocol. > Maybe this transport protocol should size correctly its sockets limits :) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BUG: sk_backlog.len can overestimate 2019-10-01 18:34 ` Eric Dumazet @ 2019-10-01 20:45 ` John Ousterhout 2019-10-01 20:53 ` Eric Dumazet 0 siblings, 1 reply; 11+ messages in thread From: John Ousterhout @ 2019-10-01 20:45 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Tue, Oct 1, 2019 at 11:34 AM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On 10/1/19 10:25 AM, John Ousterhout wrote: > > On Tue, Oct 1, 2019 at 9:19 AM Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> ... > >> Sorry, I have no idea what is the problem you see. > > > > OK, let me try again from the start. Consider two values: > > * sk->sk_backlog.len > > * The actual number of bytes in buffers in the current backlog list > > > > Now consider a series of propositions: > > > > 1. These two are not always the same. As packets get processed by > > calling sk_backlog_rcv, they are removed from the backlog list, so the > > actual amount of memory consumed by the backlog list drops. However, > > sk->sk_backlog.len doesn't change until the entire backlog is cleared, > > at which point it is reset to zero. So, there can be periods of time > > where sk->sk_backlog.len overstates the actual memory consumption of > > the backlog. > > Yes, this is done on purpose (and documented in __release_sock() > > Otherwise you could have a livelock situation, with user thread being > trapped forever in system, and never return to user land. > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8eae939f1400326b06d0c9afe53d2a484a326871 > > > > > > 2. The gap between sk->sk_backlog.len and actual backlog size can grow > > quite large. This happens if new packets arrive while sk_backlog_rcv > > is working. The socket is locked, so these new packets will be added > > to the backlog, which will increase sk->sk_backlog_len. Under high > > load, this could continue indefinitely: packets keep arriving, so the > > backlog never empties, so sk->sk_backlog_len never gets reset. > > However, packets are actually being processed from the backlog, so > > it's possible that the actual size of the backlog isn't changing, yet > > sk->sk_backlog.len continues to grow. > > > > 3. Eventually, the growth in sk->sk_backlog.len will be limited by the > > "limit" argument to sk_add_backlog. When this happens, packets will be > > dropped. > > _Exactly_ WAI > > > > > 4. Now suppose I pass a value of 1000000 as the limit to > > sk_add_backlog. It's possible that sk_add_backlog will reject my > > request even though the backlog only contains a total of 10000 bytes. > > The other 990000 bytes were present on the backlog at one time (though > > not necessarily all at the same time), but they have been processed > > and removed; __release_sock hasn't gotten around to updating > > sk->sk_backlog.len, because it hasn't been able to completely clear > > the backlog. > > WAI > > > > > 5. Bottom line: under high load, a socket can be forced to drop > > packets even though it never actually exceeded its memory budget. This > > isn't a case of a sender trying to fool us; we fooled ourselves, > > because of the delay in resetting sk->sk_backlog.len. > > > > Does this make sense? > > Yes, just increase your socket limits. setsockopt(... SO_RCVBUF ...), > and risk user threads having bigger socket syscall latencies, obviously. OK, I think I understand where you are coming from now. However, I have a comment and a suggestion. Comment: the terminology here confused me. The problem is that sk->sk_backlog.len isn't actually a "length" (e.g., # bytes currently stored in the backlog). It's actually "the total number of bytes added to the backlog since the last time it was completely cleared." Unfortunately, this isn't documented. Same thing for the limit: the goal isn't to to limit the amount of data in the backlog, it's to limit the amount of data processed in one call to __release_sock. Although I saw the comment in __release_sock, it didn't have quite enough info to fully apprise me of the problem being solved. Suggestion: although the code does solve the problem you mentioned, it has the unpleasant side effect that it can cause packet drops. This is unfortunate because the drops are most likely to happen in periods of overload, which is when you'd really like not to waste work that has already been done. Did you consider the possibility of returning from __release_sock without processing the entire backlog? For example, once __release_sock has processed a bunch of packets, why not take the others and feed them back through the NAPI mechanism again, so they'll be passed to the net_protocol.handler again? > > By the way, I have actually observed this phenomenon in an > > implementation of the Homa transport protocol. > > > > Maybe this transport protocol should size correctly its sockets limits :) But this isn't really about socket resource limits (though that is conflated in the implementation); it's about limiting the time spent in a single call to __release_sock, no? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BUG: sk_backlog.len can overestimate 2019-10-01 20:45 ` John Ousterhout @ 2019-10-01 20:53 ` Eric Dumazet 2019-10-01 23:01 ` John Ousterhout 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2019-10-01 20:53 UTC (permalink / raw) To: John Ousterhout; +Cc: netdev On 10/1/19 1:45 PM, John Ousterhout wrote: > > But this isn't really about socket resource limits (though that is > conflated in the implementation); it's about limiting the time spent > in a single call to __release_sock, no? The proxy used is memory usage, not time usage. cond_resched() or a preemptible kernel makes anything based on time flaky, you probably do not want to play with a time limit... ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: BUG: sk_backlog.len can overestimate 2019-10-01 20:53 ` Eric Dumazet @ 2019-10-01 23:01 ` John Ousterhout 0 siblings, 0 replies; 11+ messages in thread From: John Ousterhout @ 2019-10-01 23:01 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Tue, Oct 1, 2019 at 1:53 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On 10/1/19 1:45 PM, John Ousterhout wrote: > > > > > But this isn't really about socket resource limits (though that is > > conflated in the implementation); it's about limiting the time spent > > in a single call to __release_sock, no? > > The proxy used is memory usage, not time usage. I apologize for being pedantic, but the proxy isn't memory usage; it's actually "number of bytes added to the backlog since the last time it was emptied". At the time the limit is hit, actual memory usage is probably a lot less than the limit. This was the source of my confusion, since I assumed you really *wanted* memory usage to be the limit. > cond_resched() or a preemptible kernel makes anything based on time flaky, > you probably do not want to play with a time limit... > > > ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2019-10-01 23:02 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <CAGXJAmwQw1ohc48NfAvMyNDpDgHGkdVO89Jo8B0j0TuMr7wLpA@mail.gmail.com> 2019-09-30 23:58 ` BUG: sk_backlog.len can overestimate John Ousterhout 2019-10-01 0:14 ` Eric Dumazet 2019-10-01 15:44 ` John Ousterhout [not found] ` <CAGXJAmxmJ-Vm379N4nbjXeQCAgY9ur53wmr0HZy23dQ_t++r-Q@mail.gmail.com> [not found] ` <f4520c32-3133-fb3b-034e-d492d40eb066@gmail.com> 2019-10-01 15:46 ` Fwd: " John Ousterhout 2019-10-01 15:48 ` John Ousterhout 2019-10-01 16:19 ` Eric Dumazet 2019-10-01 17:25 ` John Ousterhout 2019-10-01 18:34 ` Eric Dumazet 2019-10-01 20:45 ` John Ousterhout 2019-10-01 20:53 ` Eric Dumazet 2019-10-01 23:01 ` John Ousterhout
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).