All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Wang <jasowang@redhat.com>
To: Jon Olson <jonolson@google.com>, mst@redhat.com
Cc: willemdebruijn.kernel@gmail.com, caleb.raitto@gmail.com,
	davem@davemloft.net, netdev@vger.kernel.org,
	Caleb Raitto <caraitto@google.com>
Subject: Re: [PATCH net-next] virtio_net: force_napi_tx module param.
Date: Mon, 30 Jul 2018 14:06:50 +0800	[thread overview]
Message-ID: <706da951-ed30-85e8-c0aa-cb9ae8b3deb7@redhat.com> (raw)
In-Reply-To: <CAJoqh4UDJvFvfmFrzgHyk5pNXaQUqnLk7HtJzuxGoUzfZoy64Q@mail.gmail.com>



On 2018年07月25日 08:17, Jon Olson wrote:
> On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
>>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
>>>>> >From the above linked patch, I understand that there are yet
>>>>> other special cases in production, such as a hard cap on #tx queues to
>>>>> 32 regardless of number of vcpus.
>>>> I don't think upstream kernels have this limit - we can
>>>> now use vmalloc for higher number of queues.
>>> Yes. that patch* mentioned it as a google compute engine imposed
>>> limit. It is exactly such cloud provider imposed rules that I'm
>>> concerned about working around in upstream drivers.
>>>
>>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
>> Yea. Why does GCE do it btw?
> There are a few reasons for the limit, some historical, some current.
>
> Historically we did this because of a kernel limit on the number of
> TAP queues (in Montreal I thought this limit was 32). To my chagrin,
> the limit upstream at the time we did it was actually eight. We had
> increased the limit from eight to 32 internally, and it appears in
> upstream it has subsequently increased upstream to 256. We no longer
> use TAP for networking, so that constraint no longer applies for us,
> but when looking at removing/raising the limit we discovered no
> workloads that clearly benefited from lifting it, and it also placed
> more pressure on our virtual networking stack particularly on the Tx
> side. We left it as-is.
>
> In terms of current reasons there are really two. One is memory usage.
> As you know, virtio-net uses rx/tx pairs, so there's an expectation
> that the guest will have an Rx queue for every Tx queue. We run our
> individual virtqueues fairly deep (4096 entries) to give guests a wide
> time window for re-posting Rx buffers and avoiding starvation on
> packet delivery. Filling an Rx vring with max-sized mergeable buffers
> (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
> be up to 512MB of memory posted for network buffers. Scaling this to
> the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
> all of the Rx rings full would (in the large average Rx packet size
> case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
> of RAM available, but I don't believe we've observed a situation where
> they would have benefited from having 2.5 gigs of buffers posted for
> incoming network traffic :)

We can work to have async txq and rxq instead of paris if there's a 
strong requirement.

>
> The second reason is interrupt related -- as I mentioned above, we
> have found no workloads that clearly benefit from so many queues, but
> we have found workloads that degrade. In particular workloads that do
> a lot of small packet processing but which aren't extremely latency
> sensitive can achieve higher PPS by taking fewer interrupt across
> fewer VCPUs due to better batching (this also incurs higher latency,
> but at the limit the "busy" cores end up suppressing most interrupts
> and spending most of their cycles farming out work). Memcache is a
> good example here, particularly if the latency targets for request
> completion are in the ~milliseconds range (rather than the
> microseconds we typically strive for with TCP_RR-style workloads).
>
> All of that said, we haven't been forthcoming with data (and
> unfortunately I don't have it handy in a useful form, otherwise I'd
> simply post it here), so I understand the hesitation to simply run
> with napi_tx across the board. As Willem said, this patch seemed like
> the least disruptive way to allow us to continue down the road of
> "universal" NAPI Tx and to hopefully get data across enough workloads
> (with VMs small, large, and absurdly large :) to present a compelling
> argument in one direction or another. As far as I know there aren't
> currently any NAPI related ethtool commands (based on a quick perusal
> of ethtool.h)

As I suggest before, maybe we can (ab)use tx-frames-irq.

Thanks

> -- it seems like it would be fairly involved/heavyweight
> to plumb one solely for this unless NAPI Tx is something many users
> will want to tune (and for which other drivers would support tuning).
>
> --
> Jon Olson

  reply	other threads:[~2018-07-30  7:40 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-23 23:11 [PATCH net-next] virtio_net: force_napi_tx module param Caleb Raitto
2018-07-24  0:53 ` Stephen Hemminger
2018-07-24  1:23   ` Willem de Bruijn
2018-07-24 10:42 ` Michael S. Tsirkin
2018-07-24 14:01   ` Willem de Bruijn
2018-07-24 18:38     ` Michael S. Tsirkin
2018-07-24 20:52       ` Willem de Bruijn
2018-07-24 22:23         ` Michael S. Tsirkin
2018-07-24 22:31           ` Willem de Bruijn
2018-07-24 22:46             ` Michael S. Tsirkin
2018-07-25  0:02               ` Willem de Bruijn
2018-07-25  0:17               ` Jon Olson
2018-07-30  6:06                 ` Jason Wang [this message]
2018-08-01 22:27                   ` Michael S. Tsirkin
2018-08-28 19:57                   ` Willem de Bruijn
2018-08-29  7:56                     ` Jason Wang
2018-08-29 13:01                       ` Willem de Bruijn
2018-09-09 23:07                         ` Willem de Bruijn
2018-09-10  5:59                           ` Jason Wang
2018-07-29 16:00 ` David Miller
2018-07-29 20:33   ` Michael S. Tsirkin
2018-07-29 20:36     ` David Miller
2018-07-29 21:09     ` Willem de Bruijn
2018-07-29 21:32   ` Willem de Bruijn
2018-07-31 12:34     ` Michael S. Tsirkin
2018-08-01 15:46       ` Willem de Bruijn
2018-08-01 15:56         ` Willem de Bruijn
2018-08-01 22:25         ` Michael S. Tsirkin
2018-08-01 22:43           ` Willem de Bruijn
2018-08-01 22:48             ` Michael S. Tsirkin
2018-08-01 23:33               ` Willem de Bruijn
2018-07-30 19:51   ` Stephen Hemminger
2018-07-31  1:41     ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=706da951-ed30-85e8-c0aa-cb9ae8b3deb7@redhat.com \
    --to=jasowang@redhat.com \
    --cc=caleb.raitto@gmail.com \
    --cc=caraitto@google.com \
    --cc=davem@davemloft.net \
    --cc=jonolson@google.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.