From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael S. Tsirkin" Subject: Re: thoughts stac/clac and get user for vhost Date: Mon, 7 Jan 2019 00:42:43 -0500 Message-ID: <20190107003500-mutt-send-email-mst@kernel.org> References: <20181225113301-mutt-send-email-mst@kernel.org> <20181226100341-mutt-send-email-mst@kernel.org> <042e0002-0dce-42e4-8694-4f3fa96c3975@redhat.com> <20181230133226-mutt-send-email-mst@kernel.org> <3a58f172-f36f-044d-f8ac-8e24b2dc61a5@redhat.com> <20190104162014-mutt-send-email-mst@kernel.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Cc: netdev@vger.kernel.org To: Jason Wang Return-path: Received: from mx1.redhat.com ([209.132.183.28]:41368 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725300AbfAGFmq (ORCPT ); Mon, 7 Jan 2019 00:42:46 -0500 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 6CA7F61D07 for ; Mon, 7 Jan 2019 05:42:45 +0000 (UTC) Content-Disposition: inline In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Mon, Jan 07, 2019 at 12:26:51PM +0800, Jason Wang wrote: > > On 2019/1/5 上午5:25, Michael S. Tsirkin wrote: > > On Wed, Jan 02, 2019 at 11:25:14AM +0800, Jason Wang wrote: > > > On 2018/12/31 上午2:40, Michael S. Tsirkin wrote: > > > > On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote: > > > > > On 2018/12/26 下午11:06, Michael S. Tsirkin wrote: > > > > > > On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote: > > > > > > > On 2018/12/26 上午12:41, Michael S. Tsirkin wrote: > > > > > > > > Hi! > > > > > > > > I was just wondering: packed ring batches things naturally. > > > > > > > > E.g. > > > > > > > > > > > > > > > > user_access_begin > > > > > > > > check descriptor valid > > > > > > > > smp_rmb > > > > > > > > copy descriptor > > > > > > > > user_access_end > > > > > > > But without speculation on the descriptor (which may only work for in-order > > > > > > > or even a violation of spec). Only one two access of a single descriptor > > > > > > > could be batched. For split ring, we can batch more since we know how many > > > > > > > descriptors is pending. (avail_idx - last_avail_idx). > > > > > > > > > > > > > > Anything I miss? > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > just check more descriptors in a loop: > > > > > > > > > > > > user_access_begin > > > > > > for (i = 0; i < 16; ++i) { > > > > > > if (!descriptor valid) > > > > > > break; > > > > > > smp_rmb > > > > > > copy descriptor > > > > > > } > > > > > > user_access_end > > > > > > > > > > > > you don't really need to know how many there are > > > > > > ahead of the time as you still copy them 1 by one. > > > > > So let's see the case of split ring > > > > > > > > > > > > > > > user_access_begin > > > > > > > > > > n = avail_idx - last_avail_idx (1) > > > > > > > > > > n = MIN(n, 16) > > > > > > > > > > smp_rmb > > > > > > > > > > read n entries from avail_ring (2) > > > > > > > > > > for (i =0; i > > > > > > > > >     copy descriptor (3) > > > > > > > > > > user_access_end > > > > > > > > > > > > > > > Consider for the case of heavy workload. So for packed ring, we have 32 > > > > > times of userspace access and 16 times of smp_rmb() > > > > > > > > > > For split ring we have > > > > > > > > > > (1) 1 time > > > > > > > > > > (2) 2 times at most > > > > > > > > > > (3) 16 times > > > > > > > > > > 19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be > > > > > eliminated with in order. 3 could be batched completely with in order and > > > > > partially when out of order. > > > > > > > > > > I don't see how packed ring help here especially consider lfence on x86 is > > > > > more than memory fence, it prevents speculation in fact. > > > > > > > > > > Thanks > > > > So on x86 at least RMB is free, this is why I never bothered optimizing > > > > it out. Is smp_rmb still worth optimizing out for ARM? Does it cost > > > > more than the extra indirection in the split ring? > > > > > > I don't know, but obviously, RMB has a chance to damage the performance more > > > or less. But even on arch where the RMB is free, packed ring still does not > > > show obvious advantage. > > People do measure gains with a PMD on host+guest. > > So it's a question of optimizing the packed ring implementation in Linux. > > > Well, 2%-3% difference is not quite a lot. People reported a 10% gain with tiny packets, others reported more. Again, packed ring is faster sometimes by a factor of 3x but virtio is just virtio, there's a lot going on besides just passing the buffer addresses guest to host, and a different ring layout won't help with that. > I think it's not hard to let split ring faster have some small optimizations > on the code itself. > > Thanks Speed up the split ring support in virtio pmd in dpdk? There have been several people working on that for a while now. It seems more likely that we can speed up the newer packed ring code. E.g. things like prefetch have a much better chance to work will with the packed layout, with split one it was a wash IIRC. > > > > > > > > > But my point was really fundamental - if ring accesses are expensive > > > > then we should batch them. > > > > > > I don't object the batching, the reason that they are expensive could be: > > > > > > 1) unnecessary overhead caused by speculation barrier and check likes SMAP > > > 2) cache contention > > > > > > So it does not conflict with the effort that I did to remove 1). My plan is: > > > for metadata, try to eliminate all the 1) completely. For data, we can do > > > batch copying to amortize its effort. For avail/descriptor batching, we can > > > try to it on top. > > > > > > > > > > Right now we have an API that gets > > > > an iovec directly. That limits the optimizations you can do. > > > > > > > > The translation works like this: > > > > > > > > ring -> valid descriptors -> iovecs > > > > > > > > We should have APIs for each step that work in batches. > > > > > > > Yes. > > > > > > Thanks > > > > > > > > > > > > > > So packed layout should show the gain with this approach. > > > > > > > > That could be motivation enough to finally enable vhost packed ring > > > > > > > > support. > > > > > > > > > > > > > > > > Thoughts? > > > > > > > >