Re: [PATCH net-next] virtio_net: force_napi_tx module param.

From: Jon Olson <jonolson@google.com>
To: mst@redhat.com
Cc: willemdebruijn.kernel@gmail.com, caleb.raitto@gmail.com,
	jasowang@redhat.com, davem@davemloft.net, netdev@vger.kernel.org,
	Caleb Raitto <caraitto@google.com>
Subject: Re: [PATCH net-next] virtio_net: force_napi_tx module param.
Date: Tue, 24 Jul 2018 17:17:38 -0700	[thread overview]
Message-ID: <CAJoqh4UDJvFvfmFrzgHyk5pNXaQUqnLk7HtJzuxGoUzfZoy64Q@mail.gmail.com> (raw)
In-Reply-To: <20180725014410-mutt-send-email-mst@kernel.org>

On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> > On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> > > > >From the above linked patch, I understand that there are yet
> > > > other special cases in production, such as a hard cap on #tx queues to
> > > > 32 regardless of number of vcpus.
> > >
> > > I don't think upstream kernels have this limit - we can
> > > now use vmalloc for higher number of queues.
> >
> > Yes. that patch* mentioned it as a google compute engine imposed
> > limit. It is exactly such cloud provider imposed rules that I'm
> > concerned about working around in upstream drivers.
> >
> > * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
>
> Yea. Why does GCE do it btw?

There are a few reasons for the limit, some historical, some current.

Historically we did this because of a kernel limit on the number of
TAP queues (in Montreal I thought this limit was 32). To my chagrin,
the limit upstream at the time we did it was actually eight. We had
increased the limit from eight to 32 internally, and it appears in
upstream it has subsequently increased upstream to 256. We no longer
use TAP for networking, so that constraint no longer applies for us,
but when looking at removing/raising the limit we discovered no
workloads that clearly benefited from lifting it, and it also placed
more pressure on our virtual networking stack particularly on the Tx
side. We left it as-is.

In terms of current reasons there are really two. One is memory usage.
As you know, virtio-net uses rx/tx pairs, so there's an expectation
that the guest will have an Rx queue for every Tx queue. We run our
individual virtqueues fairly deep (4096 entries) to give guests a wide
time window for re-posting Rx buffers and avoiding starvation on
packet delivery. Filling an Rx vring with max-sized mergeable buffers
(4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
be up to 512MB of memory posted for network buffers. Scaling this to
the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
all of the Rx rings full would (in the large average Rx packet size
case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
of RAM available, but I don't believe we've observed a situation where
they would have benefited from having 2.5 gigs of buffers posted for
incoming network traffic :)

The second reason is interrupt related -- as I mentioned above, we
have found no workloads that clearly benefit from so many queues, but
we have found workloads that degrade. In particular workloads that do
a lot of small packet processing but which aren't extremely latency
sensitive can achieve higher PPS by taking fewer interrupt across
fewer VCPUs due to better batching (this also incurs higher latency,
but at the limit the "busy" cores end up suppressing most interrupts
and spending most of their cycles farming out work). Memcache is a
good example here, particularly if the latency targets for request
completion are in the ~milliseconds range (rather than the
microseconds we typically strive for with TCP_RR-style workloads).

All of that said, we haven't been forthcoming with data (and
unfortunately I don't have it handy in a useful form, otherwise I'd
simply post it here), so I understand the hesitation to simply run
with napi_tx across the board. As Willem said, this patch seemed like
the least disruptive way to allow us to continue down the road of
"universal" NAPI Tx and to hopefully get data across enough workloads
(with VMs small, large, and absurdly large :) to present a compelling
argument in one direction or another. As far as I know there aren't
currently any NAPI related ethtool commands (based on a quick perusal
of ethtool.h) -- it seems like it would be fairly involved/heavyweight
to plumb one solely for this unless NAPI Tx is something many users
will want to tune (and for which other drivers would support tuning).