From mboxrd@z Thu Jan  1 00:00:00 1970
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Subject: Re: [PATCH net-next v2 1/2] udp: msg_zerocopy
Date: Wed, 28 Nov 2018 18:50:15 -0500
Message-ID: <CAF=yD-L4bBm_obmsW5m7aTGwFv8uC-hDA7zo9qsAUr0y32Rr0Q@mail.gmail.com>
References: <20181126152939.258443-1-willemdebruijn.kernel@gmail.com>
 <20181126152939.258443-2-willemdebruijn.kernel@gmail.com> <11350ff03fc6b03e34f8b4acd063371c887758d8.camel@redhat.com>
 <CAF=yD-J-QP_zUFY9W7CMdjOPgAwkHGgSfVQfVb=EKTNOZB4P8Q@mail.gmail.com>
 <ae1b8e52272b14a49ff4225e647c86c0aab57980.camel@redhat.com>
 <CAF=yD-KG+wyHkX40FXPXNiNEHeKjOd5hPSDFuh-z0pUAnLehLg@mail.gmail.com> <CAF=yD-KoUX=suZaaV3xzixSM2Kds65YRxCR4HyXXkKp8=aN+XQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Cc: Network Development <netdev@vger.kernel.org>,
        David Miller <davem@davemloft.net>,
        Willem de Bruijn <willemb@google.com>
To: Paolo Abeni <pabeni@redhat.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-ed1-f68.google.com ([209.85.208.68]:41285 "EHLO
        mail-ed1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726655AbeK2KyO (ORCPT
        <rfc822;netdev@vger.kernel.org>); Thu, 29 Nov 2018 05:54:14 -0500
Received: by mail-ed1-f68.google.com with SMTP id z28so359130edi.8
        for <netdev@vger.kernel.org>; Wed, 28 Nov 2018 15:50:52 -0800 (PST)
In-Reply-To: <CAF=yD-KoUX=suZaaV3xzixSM2Kds65YRxCR4HyXXkKp8=aN+XQ@mail.gmail.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Mon, Nov 26, 2018 at 2:49 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Mon, Nov 26, 2018 at 1:19 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > On Mon, Nov 26, 2018 at 1:04 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On Mon, 2018-11-26 at 12:59 -0500, Willem de Bruijn wrote:
> > > > The callers of this function do flush the queue of the other skbs on
> > > > error, but only after the call to sock_zerocopy_put_abort.
> > > >
> > > > sock_zerocopy_put_abort depends on total rollback to revert the
> > > > sk_zckey increment and suppress the completion notification (which
> > > > must not happen on return with error).
> > > >
> > > > I don't immediately have a fix. Need to think about this some more..
> > >
> > > [still out of sheer ignorance] How about tacking a refcnt for the whole
> > > ip_append_data() scope, like in the tcp case? that will add an atomic
> > > op per loop (likely, hitting the cache) but will remove some code hunk
> > > in sock_zerocopy_put_abort() and sock_zerocopy_alloc().
> >
> > The atomic op pair is indeed what I was trying to avoid. But I also need
> > to solve the problem that the final decrement will happen from the freeing
> > of the other skbs in __ip_flush_pending_frames, and will not suppress
> > the notification.
> >
> > Freeing the entire queue inside __ip_append_data, effectively making it
> > a true noop on error is one approach. But that is invasive, also to non
> > zerocopy codepaths, so I would rather avoid that.
> >
> > Perhaps I need to handle the abort logic in udp_sendmsg directly,
> > after both __ip_append_data and __ip_flush_pending_frames.
>
> Actually,
>
> (1) the notification suppression is handled correctly, as .._abort
> decrements uarg->len. If now zero, this suppresses the notification
> in sock_zerocopy_callback, regardless whether that callback is
> called right away or from a later kfree_skb.
>
> (2) if moving skb_zcopy_set below getfrag, then no kfree_skb
> will be called on a zerocopy skb inside __ip_append_data. So on
> abort the refcount is exactly the number of zerocopy skbs on the
> queue that will call sock_zerocopy_put later. Abort then only needs
> to handle special case zero, and call sock_zerocopy_put right away.

An additional issue is that refcount_t cannot be initialized to
zero, then incremented for each skb, unlike the original patch
based on atomic_t.

I did revert to the basic implementation using an extra ref
for the function call, similar to TCP, as you suggested.

On top of that as a separate optimization patch I have a
variant that uses refcnt zero by replacing refcount_inc with
refcount_set(.., refcount_read(..) + 1). Not very pretty.

An alternative to elide the cost of sock_zerocopy_put in the
fast path is to add a static branch on SO_ZEROCOPY. That
would also compile it out for TCP in the common case.

Anyway, I do intend to send a v3, but it's taking a bit longer
to try to find a clean solution.

>
> Tentative fix on top of v2 (I'll squash into v3):
>
> ---
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 2179ef84bb44..4b21a58329d1 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1099,12 +1099,13 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
>
> -               if (sk->sk_type != SOCK_STREAM && !refcount_read(&uarg->refcnt))
> +               if (sk->sk_type != SOCK_STREAM &&
> !refcount_read(&uarg->refcnt)) {
>                         refcount_set(&uarg->refcnt, 1);
> -
> -               sock_zerocopy_put(uarg);
> +                       sock_zerocopy_put(uarg);
> +               }
>         }
>  }
>
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 7504da2f33d6..a19396e21b35 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1014,13 +1014,6 @@ static int __ip_append_data(struct sock *sk,
>                         skb->csum = 0;
>                         skb_reserve(skb, hh_len);
>
> -                       /* only the initial fragment is time stamped */
> -                       skb_shinfo(skb)->tx_flags = cork->tx_flags;
> -                       cork->tx_flags = 0;
> -                       skb_shinfo(skb)->tskey = tskey;
> -                       tskey = 0;
> -                       skb_zcopy_set(skb, uarg);
> -
>                         /*
>                          *      Find where to start putting bytes.
>                          */
> @@ -1053,6 +1046,13 @@ static int __ip_append_data(struct sock *sk,
>                         exthdrlen = 0;
>                         csummode = CHECKSUM_NONE;
>
> +                       /* only the initial fragment is time stamped */
> +                       skb_shinfo(skb)->tx_flags = cork->tx_flags;
> +                       cork->tx_flags = 0;
> +                       skb_shinfo(skb)->tskey = tskey;
> +                       tskey = 0;
> +                       skb_zcopy_set(skb, uarg);
> +
>                         if ((flags & MSG_CONFIRM) && !skb_prev)
>                                 skb_set_dst_pending_confirm(skb, 1);
>
> ---
>
> This patch moves all the skb_shinfo touches operations after the copy,
> to avoid touching that twice.
>
> Instead of the refcnt trick, I could also refactor sock_zerocopy_put
> and call __sock_zerocopy_put
>
> ---
>
> -void sock_zerocopy_put(struct ubuf_info *uarg)
> +static void __sock_zerocopy_put(struct ubuf_info *uarg)
>  {
> -       if (uarg && refcount_dec_and_test(&uarg->refcnt)) {
> -               if (uarg->callback)
> -                       uarg->callback(uarg, uarg->zerocopy);
> -               else
> -                       consume_skb(skb_from_uarg(uarg));
> -       }
> +       if (uarg->callback)
> +               uarg->callback(uarg, uarg->zerocopy);
> +       else
> +               consume_skb(skb_from_uarg(uarg));
> +}
> +
> +void sock_zerocopy_put(struct ubuf_info *uarg)
> +       if (uarg && refcount_dec_and_test(&uarg->refcnt))
> +               __sock_zerocopy_put(uarg);
>  }
>  EXPORT_SYMBOL_GPL(sock_zerocopy_put);
>
> ---