From mboxrd@z Thu Jan  1 00:00:00 1970
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Subject: Re: [PATCH net-next] virtio_net: force_napi_tx module param.
Date: Tue, 28 Aug 2018 15:57:47 -0400
Message-ID: <CAF=yD-K6Uy8TvPA4sUDN9_MWD=QjHaWioRbQ2F4PGdcY_barvw@mail.gmail.com>
References: <20180723231119.142904-1-caleb.raitto@gmail.com>
 <20180724134138-mutt-send-email-mst@kernel.org> <CAF=yD-JmhKHhXcUdfm39t=OV+5VMT90XdGWaaKSWK+kG5UBh8Q@mail.gmail.com>
 <20180724212238-mutt-send-email-mst@kernel.org> <CAF=yD-KT4_yLq2OehHhdFc=9R0eRrKMHq2ae9NndJWqc+8h=LA@mail.gmail.com>
 <20180725012205-mutt-send-email-mst@kernel.org> <CAF=yD-J4kS5SiEum-d7SfY7wF+go746QMkoer49G+knVSXtVGg@mail.gmail.com>
 <20180725014410-mutt-send-email-mst@kernel.org> <CAJoqh4UDJvFvfmFrzgHyk5pNXaQUqnLk7HtJzuxGoUzfZoy64Q@mail.gmail.com>
 <706da951-ed30-85e8-c0aa-cb9ae8b3deb7@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Cc: "Jon Olson (Google Drive)" <jonolson@google.com>,
        "Michael S. Tsirkin" <mst@redhat.com>, caleb.raitto@gmail.com,
        David Miller <davem@davemloft.net>,
        Network Development <netdev@vger.kernel.org>,
        Caleb Raitto <caraitto@google.com>
To: Jason Wang <jasowang@redhat.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-ed1-f67.google.com ([209.85.208.67]:45785 "EHLO
        mail-ed1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727124AbeH1Xvh (ORCPT
        <rfc822;netdev@vger.kernel.org>); Tue, 28 Aug 2018 19:51:37 -0400
Received: by mail-ed1-f67.google.com with SMTP id p52-v6so2251219eda.12
        for <netdev@vger.kernel.org>; Tue, 28 Aug 2018 12:58:24 -0700 (PDT)
In-Reply-To: <706da951-ed30-85e8-c0aa-cb9ae8b3deb7@redhat.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Mon, Jul 30, 2018 at 2:06 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
>
> On 2018=E5=B9=B407=E6=9C=8825=E6=97=A5 08:17, Jon Olson wrote:
> > On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wro=
te:
> >> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> >>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> w=
rote:
> >>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> >>>>> >From the above linked patch, I understand that there are yet
> >>>>> other special cases in production, such as a hard cap on #tx queues=
 to
> >>>>> 32 regardless of number of vcpus.
> >>>> I don't think upstream kernels have this limit - we can
> >>>> now use vmalloc for higher number of queues.
> >>> Yes. that patch* mentioned it as a google compute engine imposed
> >>> limit. It is exactly such cloud provider imposed rules that I'm
> >>> concerned about working around in upstream drivers.
> >>>
> >>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
> >> Yea. Why does GCE do it btw?
> > There are a few reasons for the limit, some historical, some current.
> >
> > Historically we did this because of a kernel limit on the number of
> > TAP queues (in Montreal I thought this limit was 32). To my chagrin,
> > the limit upstream at the time we did it was actually eight. We had
> > increased the limit from eight to 32 internally, and it appears in
> > upstream it has subsequently increased upstream to 256. We no longer
> > use TAP for networking, so that constraint no longer applies for us,
> > but when looking at removing/raising the limit we discovered no
> > workloads that clearly benefited from lifting it, and it also placed
> > more pressure on our virtual networking stack particularly on the Tx
> > side. We left it as-is.
> >
> > In terms of current reasons there are really two. One is memory usage.
> > As you know, virtio-net uses rx/tx pairs, so there's an expectation
> > that the guest will have an Rx queue for every Tx queue. We run our
> > individual virtqueues fairly deep (4096 entries) to give guests a wide
> > time window for re-posting Rx buffers and avoiding starvation on
> > packet delivery. Filling an Rx vring with max-sized mergeable buffers
> > (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
> > be up to 512MB of memory posted for network buffers. Scaling this to
> > the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
> > all of the Rx rings full would (in the large average Rx packet size
> > case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
> > of RAM available, but I don't believe we've observed a situation where
> > they would have benefited from having 2.5 gigs of buffers posted for
> > incoming network traffic :)
>
> We can work to have async txq and rxq instead of paris if there's a
> strong requirement.
>
> >
> > The second reason is interrupt related -- as I mentioned above, we
> > have found no workloads that clearly benefit from so many queues, but
> > we have found workloads that degrade. In particular workloads that do
> > a lot of small packet processing but which aren't extremely latency
> > sensitive can achieve higher PPS by taking fewer interrupt across
> > fewer VCPUs due to better batching (this also incurs higher latency,
> > but at the limit the "busy" cores end up suppressing most interrupts
> > and spending most of their cycles farming out work). Memcache is a
> > good example here, particularly if the latency targets for request
> > completion are in the ~milliseconds range (rather than the
> > microseconds we typically strive for with TCP_RR-style workloads).
> >
> > All of that said, we haven't been forthcoming with data (and
> > unfortunately I don't have it handy in a useful form, otherwise I'd
> > simply post it here), so I understand the hesitation to simply run
> > with napi_tx across the board. As Willem said, this patch seemed like
> > the least disruptive way to allow us to continue down the road of
> > "universal" NAPI Tx and to hopefully get data across enough workloads
> > (with VMs small, large, and absurdly large :) to present a compelling
> > argument in one direction or another. As far as I know there aren't
> > currently any NAPI related ethtool commands (based on a quick perusal
> > of ethtool.h)
>
> As I suggest before, maybe we can (ab)use tx-frames-irq.

I forgot to respond to this originally, but I agree.

How about something like the snippet below. It would be simpler to
reason about if only allow switching while the device is down, but
napi does not strictly require that.

+static int virtnet_set_coalesce(struct net_device *dev,
+                               struct ethtool_coalesce *ec)
+{
+       const u32 tx_coalesce_napi_mask =3D (1 << 16);
+       const struct ethtool_coalesce ec_default =3D {
+               .cmd =3D ETHTOOL_SCOALESCE,
+               .rx_max_coalesced_frames =3D 1,
+               .tx_max_coalesced_frames =3D 1,
+       };
+       struct virtnet_info *vi =3D netdev_priv(dev);
+       int napi_weight =3D 0;
+       bool running;
+       int i;
+
+       if (ec->tx_max_coalesced_frames & tx_coalesce_napi_mask) {
+               ec->tx_max_coalesced_frames &=3D ~tx_coalesce_napi_mask;
+               napi_weight =3D NAPI_POLL_WEIGHT;
+       }
+
+       /* disallow changes to fields not explicitly tested above */
+       if (memcmp(ec, &ec_default, sizeof(ec_default)))
+               return -EINVAL;
+
+       if (napi_weight ^ vi->sq[0].napi.weight) {
+               running =3D netif_running(vi->dev);
+
+               for (i =3D 0; i < vi->max_queue_pairs; i++) {
+                       vi->sq[i].napi.weight =3D napi_weight;
+
+                       if (!running)
+                               continue;
+
+                       if (napi_weight)
+                               virtnet_napi_tx_enable(vi, vi->sq[i].vq,
+                                                      &vi->sq[i].napi);
+                       else
+                               napi_disable(&vi->sq[i].napi);
+               }
+       }
+
+       return 0;
+}
+
+static int virtnet_get_coalesce(struct net_device *dev,
+                               struct ethtool_coalesce *ec)
+{
+       const u32 tx_coalesce_napi_mask =3D (1 << 16);
+       const struct ethtool_coalesce ec_default =3D {
+               .cmd =3D ETHTOOL_GCOALESCE,
+               .rx_max_coalesced_frames =3D 1,
+               .tx_max_coalesced_frames =3D 1,
+       };
+       struct virtnet_info *vi =3D netdev_priv(dev);
+
+       memcpy(ec, &ec_default, sizeof(ec_default));
+
+       if (vi->sq[0].napi.weight)
+               ec->tx_max_coalesced_frames |=3D tx_coalesce_napi_mask;
+
+       return 0;
+}