From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757346AbdDRCRI (ORCPT ); Mon, 17 Apr 2017 22:17:08 -0400 Received: from mx1.redhat.com ([209.132.183.28]:44766 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755463AbdDRCRG (ORCPT ); Mon, 17 Apr 2017 22:17:06 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 7249CC054C5C Authentication-Results: ext-mx08.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx08.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=jasowang@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 7249CC054C5C Subject: Re: [PATCH 1/3] ptr_ring: batch ring zeroing To: "Michael S. Tsirkin" References: <1491544049-19108-1-git-send-email-mst@redhat.com> <19f56d99-279a-5a8b-39a7-1017a3cb4bdd@redhat.com> <97915bd0-ef22-4e67-1b18-3522e2369fda@redhat.com> <20170414235732-mutt-send-email-mst@kernel.org> Cc: linux-kernel@vger.kernel.org, brouer@redhat.com From: Jason Wang Message-ID: <34324c4c-54f9-0323-def3-3c2f7c6ead74@redhat.com> Date: Tue, 18 Apr 2017 10:16:57 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <20170414235732-mutt-send-email-mst@kernel.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Tue, 18 Apr 2017 02:17:05 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2017年04月15日 05:00, Michael S. Tsirkin wrote: > On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote: >> >> On 2017年04月12日 16:03, Jason Wang wrote: >>> >>> On 2017年04月07日 13:49, Michael S. Tsirkin wrote: >>>> A known weakness in ptr_ring design is that it does not handle well the >>>> situation when ring is almost full: as entries are consumed they are >>>> immediately used again by the producer, so consumer and producer are >>>> writing to a shared cache line. >>>> >>>> To fix this, add batching to consume calls: as entries are >>>> consumed do not write NULL into the ring until we get >>>> a multiple (in current implementation 2x) of cache lines >>>> away from the producer. At that point, write them all out. >>>> >>>> We do the write out in the reverse order to keep >>>> producer from sharing cache with consumer for as long >>>> as possible. >>>> >>>> Writeout also triggers when ring wraps around - there's >>>> no special reason to do this but it helps keep the code >>>> a bit simpler. >>>> >>>> What should we do if getting away from producer by 2 cache lines >>>> would mean we are keeping the ring moe than half empty? >>>> Maybe we should reduce the batching in this case, >>>> current patch simply reduces the batching. >>>> >>>> Notes: >>>> - it is no longer true that a call to consume guarantees >>>> that the following call to produce will succeed. >>>> No users seem to assume that. >>>> - batching can also in theory reduce the signalling rate: >>>> users that would previously send interrups to the producer >>>> to wake it up after consuming each entry would now only >>>> need to do this once in a batch. >>>> Doing this would be easy by returning a flag to the caller. >>>> No users seem to do signalling on consume yet so this was not >>>> implemented yet. >>>> >>>> Signed-off-by: Michael S. Tsirkin >>>> --- >>>> >>>> Jason, I am curious whether the following gives you some of >>>> the performance boost that you see with vhost batching >>>> patches. Is vhost batching on top still helpful? >>> The patch looks good to me, will have a test for vhost batching patches. >>> >>> Thanks >> Still helpful: >> >> before this patch: 1.84Mpps >> with this patch: 2.00Mpps >> with batch dequeuing: 2.30Mpps >> >> Acked-by: Jason Wang >> >> Thanks > Fascinating. How do we explain the gain with batch dequeue? I count the drop rate (pktgen on tun and count tun tx) and maybe it can explain more or less: Before this patch: TX xmit 1.8Mpps Tx dropped 0.23Mpps Tx total 2.04Mpps 11% dropped After this patch: Tx xmit 1.95Mpps Tx dropped 0.33Mpps Tx total 2.28Mpps 14% dropped With batched dequeuing: Tx xmit 2.24Mpps Tx dropped 0.01Mpps Tx total 2.26Mpps ~0% dropped With this patch, we remove the cache contention by blocking the producer more or less. With batch dequeuing, the ring is not full in 99% of the cases which probably means the producer is not blocked for most of the time. > Is it just the lock overhead? I remove the spinlocks for peeking and dequeuing on top of this patch. The tx pps were increased from ~2Mpps to ~2.08Mpps. So it was not only the lock overhead. > Can you pls try to replace > the lock with a simple non-fair atomic and see what happens? > Not sure I get the idea, we are going for fast path of spinlocks for sure which is just a cmpxchg(). Thanks