Re: [PATCH RFC 1/2] virtio-net: bql support

From: Jason Wang <jasowang@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: linux-kernel@vger.kernel.org, maxime.coquelin@redhat.com,
	tiwei.bie@intel.com, wexu@redhat.com, jfreimann@redhat.com,
	"David S. Miller" <davem@davemloft.net>,
	virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org
Subject: Re: [PATCH RFC 1/2] virtio-net: bql support
Date: Mon, 7 Jan 2019 11:51:55 +0800	[thread overview]
Message-ID: <aea2fd16-ec5b-64b5-2095-9a37044223f6@redhat.com> (raw)
In-Reply-To: <20190106221506-mutt-send-email-mst@kernel.org>

On 2019/1/7 上午11:17, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 10:14:37AM +0800, Jason Wang wrote:
>> On 2019/1/2 下午9:59, Michael S. Tsirkin wrote:
>>> On Wed, Jan 02, 2019 at 11:28:43AM +0800, Jason Wang wrote:
>>>> On 2018/12/31 上午2:45, Michael S. Tsirkin wrote:
>>>>> On Thu, Dec 27, 2018 at 06:00:36PM +0800, Jason Wang wrote:
>>>>>> On 2018/12/26 下午11:19, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Dec 06, 2018 at 04:17:36PM +0800, Jason Wang wrote:
>>>>>>>> On 2018/12/6 上午6:54, Michael S. Tsirkin wrote:
>>>>>>>>> When use_napi is set, let's enable BQLs.  Note: some of the issues are
>>>>>>>>> similar to wifi.  It's worth considering whether something similar to
>>>>>>>>> commit 36148c2bbfbe ("mac80211: Adjust TSQ pacing shift") might be
>>>>>>>>> benefitial.
>>>>>>>> I've played a similar patch several days before. The tricky part is the mode
>>>>>>>> switching between napi and no napi. We should make sure when the packet is
>>>>>>>> sent and trakced by BQL,  it should be consumed by BQL as well. I did it by
>>>>>>>> tracking it through skb->cb.  And deal with the freeze by reset the BQL
>>>>>>>> status. Patch attached.
>>>>>>>>
>>>>>>>> But when testing with vhost-net, I don't very a stable performance,
>>>>>>> So how about increasing TSQ pacing shift then?
>>>>>> I can test this. But changing default TCP value is much more than a
>>>>>> virtio-net specific thing.
>>>>> Well same logic as wifi applies. Unpredictable latencies related
>>>>> to radio in one case, to host scheduler in the other.
>>>>>
>>>>>>>> it was
>>>>>>>> probably because we batch the used ring updating so tx interrupt may come
>>>>>>>> randomly. We probably need to implement time bounded coalescing mechanism
>>>>>>>> which could be configured from userspace.
>>>>>>> I don't think it's reasonable to expect userspace to be that smart ...
>>>>>>> Why do we need time bounded? used ring is always updated when ring
>>>>>>> becomes empty.
>>>>>> We don't add used when means BQL may not see the consumed packet in time.
>>>>>> And the delay varies based on the workload since we count packets not bytes
>>>>>> or time before doing the batched updating.
>>>>>>
>>>>>> Thanks
>>>>> Sorry I still don't get it.
>>>>> When nothing is outstanding then we do update the used.
>>>>> So if BQL stops userspace from sending packets then
>>>>> we get an interrupt and packets start flowing again.
>>>> Yes, but how about the cases of multiple flows. That's where I see unstable
>>>> results.
>>>>
>>>>
>>>>> It might be suboptimal, we might need to tune it but I doubt running
>>>>> timers is a solution, timer interrupts cause VM exits.
>>>> Probably not a timer but a time counter (or event byte counter) in vhost to
>>>> add used and signal guest if it exceeds a value instead of waiting the
>>>> number of packets.
>>>>
>>>>
>>>> Thanks
>>> Well we already have VHOST_NET_WEIGHT - is it too big then?
>>
>> I'm not sure, it might be too big.
>>
>>
>>> And maybe we should expose the "MORE" flag in the descriptor -
>>> do you think that will help?
>>>
>> I don't know. But how a "more" flag can help here?
>>
>> Thanks
> It sounds like we should be a bit more aggressive in updating used ring.
> But if we just do it naively we will harm performance for sure as that
> is how we are doing batching right now.

I agree but the problem is to balance the PPS and throughput. More 
batching helps for PPS but may damage TCP throughput.

>   Instead we could make guest
> control batching using the more flag - if that's not set we write out
> the used ring.

It's under the control of guest, so I'm afraid we still need some more 
guard (e.g time/bytes counters) on host.

Thanks

>