Re: [PATCH net-next] virtio_net: force_napi_tx module param.

From: Jason Wang <jasowang@redhat.com>
To: Jon Olson <jonolson@google.com>, mst@redhat.com
Cc: willemdebruijn.kernel@gmail.com, caleb.raitto@gmail.com,
	davem@davemloft.net, netdev@vger.kernel.org,
	Caleb Raitto <caraitto@google.com>
Subject: Re: [PATCH net-next] virtio_net: force_napi_tx module param.
Date: Mon, 30 Jul 2018 14:06:50 +0800	[thread overview]
Message-ID: <706da951-ed30-85e8-c0aa-cb9ae8b3deb7@redhat.com> (raw)
In-Reply-To: <CAJoqh4UDJvFvfmFrzgHyk5pNXaQUqnLk7HtJzuxGoUzfZoy64Q@mail.gmail.com>

On 2018年07月25日 08:17, Jon Olson wrote:
> On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
>>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
>>>>> >From the above linked patch, I understand that there are yet
>>>>> other special cases in production, such as a hard cap on #tx queues to
>>>>> 32 regardless of number of vcpus.
>>>> I don't think upstream kernels have this limit - we can
>>>> now use vmalloc for higher number of queues.
>>> Yes. that patch* mentioned it as a google compute engine imposed
>>> limit. It is exactly such cloud provider imposed rules that I'm
>>> concerned about working around in upstream drivers.
>>>
>>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
>> Yea. Why does GCE do it btw?
> There are a few reasons for the limit, some historical, some current.
>
> Historically we did this because of a kernel limit on the number of
> TAP queues (in Montreal I thought this limit was 32). To my chagrin,
> the limit upstream at the time we did it was actually eight. We had
> increased the limit from eight to 32 internally, and it appears in
> upstream it has subsequently increased upstream to 256. We no longer
> use TAP for networking, so that constraint no longer applies for us,
> but when looking at removing/raising the limit we discovered no
> workloads that clearly benefited from lifting it, and it also placed
> more pressure on our virtual networking stack particularly on the Tx
> side. We left it as-is.
>
> In terms of current reasons there are really two. One is memory usage.
> As you know, virtio-net uses rx/tx pairs, so there's an expectation
> that the guest will have an Rx queue for every Tx queue. We run our
> individual virtqueues fairly deep (4096 entries) to give guests a wide
> time window for re-posting Rx buffers and avoiding starvation on
> packet delivery. Filling an Rx vring with max-sized mergeable buffers
> (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
> be up to 512MB of memory posted for network buffers. Scaling this to
> the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
> all of the Rx rings full would (in the large average Rx packet size
> case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
> of RAM available, but I don't believe we've observed a situation where
> they would have benefited from having 2.5 gigs of buffers posted for
> incoming network traffic :)

We can work to have async txq and rxq instead of paris if there's a 
strong requirement.

>
> The second reason is interrupt related -- as I mentioned above, we
> have found no workloads that clearly benefit from so many queues, but
> we have found workloads that degrade. In particular workloads that do
> a lot of small packet processing but which aren't extremely latency
> sensitive can achieve higher PPS by taking fewer interrupt across
> fewer VCPUs due to better batching (this also incurs higher latency,
> but at the limit the "busy" cores end up suppressing most interrupts
> and spending most of their cycles farming out work). Memcache is a
> good example here, particularly if the latency targets for request
> completion are in the ~milliseconds range (rather than the
> microseconds we typically strive for with TCP_RR-style workloads).
>
> All of that said, we haven't been forthcoming with data (and
> unfortunately I don't have it handy in a useful form, otherwise I'd
> simply post it here), so I understand the hesitation to simply run
> with napi_tx across the board. As Willem said, this patch seemed like
> the least disruptive way to allow us to continue down the road of
> "universal" NAPI Tx and to hopefully get data across enough workloads
> (with VMs small, large, and absurdly large :) to present a compelling
> argument in one direction or another. As far as I know there aren't
> currently any NAPI related ethtool commands (based on a quick perusal
> of ethtool.h)

As I suggest before, maybe we can (ab)use tx-frames-irq.

Thanks

> -- it seems like it would be fairly involved/heavyweight
> to plumb one solely for this unless NAPI Tx is something many users
> will want to tune (and for which other drivers would support tuning).
>
> --
> Jon Olson