netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RACE] net/unix: SIOCGOUTQ returns off-by-one data
@ 2020-12-16 13:10 David Rheinsberg
  0 siblings, 0 replies; only message in thread
From: David Rheinsberg @ 2020-12-16 13:10 UTC (permalink / raw)
  To: netdev

Hi

We currently use SIOCOUTQ on AF_UNIX+SOCK_STREAM sockets to figure out
whether data that we wrote to a socket is still pending or was
dequeued. Preferably, we would like to know whether a specific message
we queued was dequeued by the other side, but sadly SIOCOUTQ does not
report the actual payload-size, but `skb->truesize`, making it
impossible for us to predict how much data the other side has
dequeued. Hence, we simply use the value to wait for the queue to
empty.

Practically, this means under special circumstances we refrain from
writing data to a socket until SIOCOUTQ returns 0. If it does not
return 0, we wait to be woken-up by a following EPOLLOUT (which works
fine in edge-triggered mode).

This worked fine for us, until we spuriously got off-by-one errors,
where SIOCOUTQ started returning 1, but did not send a wakeup /
EPOLLOUT afterwards despite the counter at some point dropping to 0.
The culprit seems to be sock_wfree():

[slightly simplified]
void sock_wfree(struct sk_buff *skb)
{
    struct sock *sk = skb->sk;
    unsigned int len = skb->truesize;

    if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) {
        [...]
        refcount_sub_and_test(len - 1, &sk->sk_wmem_alloc);
        sk->sk_write_space(sk);
        len = 1;
    }
    [...]
    if (refcount_sub_and_test(len, &sk->sk_wmem_alloc))
        __sk_free(sk);
}

The kernel seems to use the `sk_wmem_alloc` counter for
reference-counting. Therefore, if we check SIOCOUTQ exactly between
`sk->sk_write_space()` and the following `refcount` update, we will
see a counter of 1, but never get woken up afterwards. Some of our
users can hit this reliably on newer arm64 machines.

My question now is whether this is intended behavior? I couldn't come
up with a simple fix, as I assume this overload on sk_wmem_alloc was
done for performance reasons. We could acquire an sk_refcnt before
sk_write_space and drop it afterwards, but it would probably need
further adjustments in sk_free() and friends, and would negatively
affect performance.

As a workaround we assume a return value lower than 128 means
effectively 0, because even a single queued skb would cause the
counter to be way bigger than 128, due to the size of `struct
sk_buff`. This gives us room to deal with up to 128 temporary
references the kernel holds. However, this does feel slightly awkward,
so I wondered whether someone has a better idea to fix this? Or
whether SIOCOUTQ is just not meant to be used that way?

NOTE: Our SIOCOUTQ code is used to properly account for
inflight-file-descriptors. The kernel accounts them on the sender's
quota, so we use this to reliably track how long an inflight FD is
queued on a socket, and thus accounted on us. This way we prevent
malicious users from causing us to queue too many inflight
file-descriptors and thus exceeding our quota.

Thanks
David

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2020-12-16 13:11 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-16 13:10 [RACE] net/unix: SIOCGOUTQ returns off-by-one data David Rheinsberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).