From mboxrd@z Thu Jan 1 00:00:00 1970 From: Willem de Bruijn Subject: Re: [PATCH net-next] virtio_net: force_napi_tx module param. Date: Tue, 28 Aug 2018 15:57:47 -0400 Message-ID: References: <20180723231119.142904-1-caleb.raitto@gmail.com> <20180724134138-mutt-send-email-mst@kernel.org> <20180724212238-mutt-send-email-mst@kernel.org> <20180725012205-mutt-send-email-mst@kernel.org> <20180725014410-mutt-send-email-mst@kernel.org> <706da951-ed30-85e8-c0aa-cb9ae8b3deb7@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Cc: "Jon Olson (Google Drive)" , "Michael S. Tsirkin" , caleb.raitto@gmail.com, David Miller , Network Development , Caleb Raitto To: Jason Wang Return-path: Received: from mail-ed1-f67.google.com ([209.85.208.67]:45785 "EHLO mail-ed1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727124AbeH1Xvh (ORCPT ); Tue, 28 Aug 2018 19:51:37 -0400 Received: by mail-ed1-f67.google.com with SMTP id p52-v6so2251219eda.12 for ; Tue, 28 Aug 2018 12:58:24 -0700 (PDT) In-Reply-To: <706da951-ed30-85e8-c0aa-cb9ae8b3deb7@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On Mon, Jul 30, 2018 at 2:06 AM Jason Wang wrote: > > > > On 2018=E5=B9=B407=E6=9C=8825=E6=97=A5 08:17, Jon Olson wrote: > > On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin wro= te: > >> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote: > >>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin w= rote: > >>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote: > >>>>> >From the above linked patch, I understand that there are yet > >>>>> other special cases in production, such as a hard cap on #tx queues= to > >>>>> 32 regardless of number of vcpus. > >>>> I don't think upstream kernels have this limit - we can > >>>> now use vmalloc for higher number of queues. > >>> Yes. that patch* mentioned it as a google compute engine imposed > >>> limit. It is exactly such cloud provider imposed rules that I'm > >>> concerned about working around in upstream drivers. > >>> > >>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/ > >> Yea. Why does GCE do it btw? > > There are a few reasons for the limit, some historical, some current. > > > > Historically we did this because of a kernel limit on the number of > > TAP queues (in Montreal I thought this limit was 32). To my chagrin, > > the limit upstream at the time we did it was actually eight. We had > > increased the limit from eight to 32 internally, and it appears in > > upstream it has subsequently increased upstream to 256. We no longer > > use TAP for networking, so that constraint no longer applies for us, > > but when looking at removing/raising the limit we discovered no > > workloads that clearly benefited from lifting it, and it also placed > > more pressure on our virtual networking stack particularly on the Tx > > side. We left it as-is. > > > > In terms of current reasons there are really two. One is memory usage. > > As you know, virtio-net uses rx/tx pairs, so there's an expectation > > that the guest will have an Rx queue for every Tx queue. We run our > > individual virtqueues fairly deep (4096 entries) to give guests a wide > > time window for re-posting Rx buffers and avoiding starvation on > > packet delivery. Filling an Rx vring with max-sized mergeable buffers > > (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can > > be up to 512MB of memory posted for network buffers. Scaling this to > > the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping > > all of the Rx rings full would (in the large average Rx packet size > > case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T > > of RAM available, but I don't believe we've observed a situation where > > they would have benefited from having 2.5 gigs of buffers posted for > > incoming network traffic :) > > We can work to have async txq and rxq instead of paris if there's a > strong requirement. > > > > > The second reason is interrupt related -- as I mentioned above, we > > have found no workloads that clearly benefit from so many queues, but > > we have found workloads that degrade. In particular workloads that do > > a lot of small packet processing but which aren't extremely latency > > sensitive can achieve higher PPS by taking fewer interrupt across > > fewer VCPUs due to better batching (this also incurs higher latency, > > but at the limit the "busy" cores end up suppressing most interrupts > > and spending most of their cycles farming out work). Memcache is a > > good example here, particularly if the latency targets for request > > completion are in the ~milliseconds range (rather than the > > microseconds we typically strive for with TCP_RR-style workloads). > > > > All of that said, we haven't been forthcoming with data (and > > unfortunately I don't have it handy in a useful form, otherwise I'd > > simply post it here), so I understand the hesitation to simply run > > with napi_tx across the board. As Willem said, this patch seemed like > > the least disruptive way to allow us to continue down the road of > > "universal" NAPI Tx and to hopefully get data across enough workloads > > (with VMs small, large, and absurdly large :) to present a compelling > > argument in one direction or another. As far as I know there aren't > > currently any NAPI related ethtool commands (based on a quick perusal > > of ethtool.h) > > As I suggest before, maybe we can (ab)use tx-frames-irq. I forgot to respond to this originally, but I agree. How about something like the snippet below. It would be simpler to reason about if only allow switching while the device is down, but napi does not strictly require that. +static int virtnet_set_coalesce(struct net_device *dev, + struct ethtool_coalesce *ec) +{ + const u32 tx_coalesce_napi_mask =3D (1 << 16); + const struct ethtool_coalesce ec_default =3D { + .cmd =3D ETHTOOL_SCOALESCE, + .rx_max_coalesced_frames =3D 1, + .tx_max_coalesced_frames =3D 1, + }; + struct virtnet_info *vi =3D netdev_priv(dev); + int napi_weight =3D 0; + bool running; + int i; + + if (ec->tx_max_coalesced_frames & tx_coalesce_napi_mask) { + ec->tx_max_coalesced_frames &=3D ~tx_coalesce_napi_mask; + napi_weight =3D NAPI_POLL_WEIGHT; + } + + /* disallow changes to fields not explicitly tested above */ + if (memcmp(ec, &ec_default, sizeof(ec_default))) + return -EINVAL; + + if (napi_weight ^ vi->sq[0].napi.weight) { + running =3D netif_running(vi->dev); + + for (i =3D 0; i < vi->max_queue_pairs; i++) { + vi->sq[i].napi.weight =3D napi_weight; + + if (!running) + continue; + + if (napi_weight) + virtnet_napi_tx_enable(vi, vi->sq[i].vq, + &vi->sq[i].napi); + else + napi_disable(&vi->sq[i].napi); + } + } + + return 0; +} + +static int virtnet_get_coalesce(struct net_device *dev, + struct ethtool_coalesce *ec) +{ + const u32 tx_coalesce_napi_mask =3D (1 << 16); + const struct ethtool_coalesce ec_default =3D { + .cmd =3D ETHTOOL_GCOALESCE, + .rx_max_coalesced_frames =3D 1, + .tx_max_coalesced_frames =3D 1, + }; + struct virtnet_info *vi =3D netdev_priv(dev); + + memcpy(ec, &ec_default, sizeof(ec_default)); + + if (vi->sq[0].napi.weight) + ec->tx_max_coalesced_frames |=3D tx_coalesce_napi_mask; + + return 0; +}