From mboxrd@z Thu Jan 1 00:00:00 1970 From: Willem de Bruijn Subject: Re: [PATCH net-next v2 1/2] udp: msg_zerocopy Date: Wed, 28 Nov 2018 18:50:15 -0500 Message-ID: References: <20181126152939.258443-1-willemdebruijn.kernel@gmail.com> <20181126152939.258443-2-willemdebruijn.kernel@gmail.com> <11350ff03fc6b03e34f8b4acd063371c887758d8.camel@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Cc: Network Development , David Miller , Willem de Bruijn To: Paolo Abeni Return-path: Received: from mail-ed1-f68.google.com ([209.85.208.68]:41285 "EHLO mail-ed1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726655AbeK2KyO (ORCPT ); Thu, 29 Nov 2018 05:54:14 -0500 Received: by mail-ed1-f68.google.com with SMTP id z28so359130edi.8 for ; Wed, 28 Nov 2018 15:50:52 -0800 (PST) In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Mon, Nov 26, 2018 at 2:49 PM Willem de Bruijn wrote: > > On Mon, Nov 26, 2018 at 1:19 PM Willem de Bruijn > wrote: > > > > On Mon, Nov 26, 2018 at 1:04 PM Paolo Abeni wrote: > > > > > > On Mon, 2018-11-26 at 12:59 -0500, Willem de Bruijn wrote: > > > > The callers of this function do flush the queue of the other skbs on > > > > error, but only after the call to sock_zerocopy_put_abort. > > > > > > > > sock_zerocopy_put_abort depends on total rollback to revert the > > > > sk_zckey increment and suppress the completion notification (which > > > > must not happen on return with error). > > > > > > > > I don't immediately have a fix. Need to think about this some more.. > > > > > > [still out of sheer ignorance] How about tacking a refcnt for the whole > > > ip_append_data() scope, like in the tcp case? that will add an atomic > > > op per loop (likely, hitting the cache) but will remove some code hunk > > > in sock_zerocopy_put_abort() and sock_zerocopy_alloc(). > > > > The atomic op pair is indeed what I was trying to avoid. But I also need > > to solve the problem that the final decrement will happen from the freeing > > of the other skbs in __ip_flush_pending_frames, and will not suppress > > the notification. > > > > Freeing the entire queue inside __ip_append_data, effectively making it > > a true noop on error is one approach. But that is invasive, also to non > > zerocopy codepaths, so I would rather avoid that. > > > > Perhaps I need to handle the abort logic in udp_sendmsg directly, > > after both __ip_append_data and __ip_flush_pending_frames. > > Actually, > > (1) the notification suppression is handled correctly, as .._abort > decrements uarg->len. If now zero, this suppresses the notification > in sock_zerocopy_callback, regardless whether that callback is > called right away or from a later kfree_skb. > > (2) if moving skb_zcopy_set below getfrag, then no kfree_skb > will be called on a zerocopy skb inside __ip_append_data. So on > abort the refcount is exactly the number of zerocopy skbs on the > queue that will call sock_zerocopy_put later. Abort then only needs > to handle special case zero, and call sock_zerocopy_put right away. An additional issue is that refcount_t cannot be initialized to zero, then incremented for each skb, unlike the original patch based on atomic_t. I did revert to the basic implementation using an extra ref for the function call, similar to TCP, as you suggested. On top of that as a separate optimization patch I have a variant that uses refcnt zero by replacing refcount_inc with refcount_set(.., refcount_read(..) + 1). Not very pretty. An alternative to elide the cost of sock_zerocopy_put in the fast path is to add a static branch on SO_ZEROCOPY. That would also compile it out for TCP in the common case. Anyway, I do intend to send a v3, but it's taking a bit longer to try to find a clean solution. > > Tentative fix on top of v2 (I'll squash into v3): > > --- > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 2179ef84bb44..4b21a58329d1 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -1099,12 +1099,13 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg) > > - if (sk->sk_type != SOCK_STREAM && !refcount_read(&uarg->refcnt)) > + if (sk->sk_type != SOCK_STREAM && > !refcount_read(&uarg->refcnt)) { > refcount_set(&uarg->refcnt, 1); > - > - sock_zerocopy_put(uarg); > + sock_zerocopy_put(uarg); > + } > } > } > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index 7504da2f33d6..a19396e21b35 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -1014,13 +1014,6 @@ static int __ip_append_data(struct sock *sk, > skb->csum = 0; > skb_reserve(skb, hh_len); > > - /* only the initial fragment is time stamped */ > - skb_shinfo(skb)->tx_flags = cork->tx_flags; > - cork->tx_flags = 0; > - skb_shinfo(skb)->tskey = tskey; > - tskey = 0; > - skb_zcopy_set(skb, uarg); > - > /* > * Find where to start putting bytes. > */ > @@ -1053,6 +1046,13 @@ static int __ip_append_data(struct sock *sk, > exthdrlen = 0; > csummode = CHECKSUM_NONE; > > + /* only the initial fragment is time stamped */ > + skb_shinfo(skb)->tx_flags = cork->tx_flags; > + cork->tx_flags = 0; > + skb_shinfo(skb)->tskey = tskey; > + tskey = 0; > + skb_zcopy_set(skb, uarg); > + > if ((flags & MSG_CONFIRM) && !skb_prev) > skb_set_dst_pending_confirm(skb, 1); > > --- > > This patch moves all the skb_shinfo touches operations after the copy, > to avoid touching that twice. > > Instead of the refcnt trick, I could also refactor sock_zerocopy_put > and call __sock_zerocopy_put > > --- > > -void sock_zerocopy_put(struct ubuf_info *uarg) > +static void __sock_zerocopy_put(struct ubuf_info *uarg) > { > - if (uarg && refcount_dec_and_test(&uarg->refcnt)) { > - if (uarg->callback) > - uarg->callback(uarg, uarg->zerocopy); > - else > - consume_skb(skb_from_uarg(uarg)); > - } > + if (uarg->callback) > + uarg->callback(uarg, uarg->zerocopy); > + else > + consume_skb(skb_from_uarg(uarg)); > +} > + > +void sock_zerocopy_put(struct ubuf_info *uarg) > + if (uarg && refcount_dec_and_test(&uarg->refcnt)) > + __sock_zerocopy_put(uarg); > } > EXPORT_SYMBOL_GPL(sock_zerocopy_put); > > ---