Re: [PATCH v3 2/5] util: introduce threaded workqueue

From: Xiao Guangrong <guangrong.xiao@gmail.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
	Christophe de Dinechin <dinechin@redhat.com>
Cc: KVM list <kvm@vger.kernel.org>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Xiao Guangrong <xiaoguangrong@tencent.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Peter Xu <peterx@redhat.com>,
	qemu-devel@nongnu.org, Juan Quintela <quintela@redhat.com>,
	Wei Wang <wei.w.wang@intel.com>,
	"Emilio G. Cota" <cota@braap.org>,
	jiang.biao2@zte.com.cn
Subject: Re: [PATCH v3 2/5] util: introduce threaded workqueue
Date: Mon, 10 Dec 2018 11:23:34 +0800	[thread overview]
Message-ID: <bf963363-7aaf-2be6-74cb-ca964c46ad73@gmail.com> (raw)
In-Reply-To: <a228d699-f7fc-3e32-b750-19225f95d8db@redhat.com>

On 12/5/18 1:16 AM, Paolo Bonzini wrote:
> On 04/12/18 16:49, Christophe de Dinechin wrote:
>>>   Linux and QEMU's own qht work just fine with compile-time directives.
>>
>> Wouldn’t it work fine without any compile-time directive at all?
> 
> Yes, that's what I meant.  Though there are certainly cases in which the
> difference without proper cacheline alignment is an order of magnitude
> less throughput or something like that; it would certainly be noticeable.
> 
>>> I don't think lock-free lists are easier.  Bitmaps smaller than 64
>>> elements are both faster and easier to manage.
>>
>> I believe that this is only true if you use a linked list for both freelist
>> management and for thread notification (i.e. to replace the bitmaps).
>> However, if you use an atomic list only for the free list, and keep
>> bitmaps for signaling, then performance is at least equal, often better.
>> Plus you get the added benefit of having a thread-safe API, i.e.
>> something that is truly lock-free.
>>
>> I did a small experiment to test / prove this. Last commit on branch:
>> https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue
>> Take with a grain of salt, microbenchmarks are always suspect ;-)
>>
>> The code in “thread_test.c” includes Xiao’s code with two variations,
>> plus some testing code lifted from the flight recorder library.
>> 1. The FREE_LIST variation (sl_test) is what I would like to propose.
>> 2. The BITMAP variation (bm_test) is the baseline
>> 3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach
>>
>> To run it, you need to do “make opt-test”, then run “test_script”
>> which outputs a CSV file. The summary of my findings testing on
>> a ThinkPad, a Xeon machine and a MacBook is here:
>> https://imgur.com/a/4HmbB9K
>>
>> Overall, the proposed approach:
>>
>> - makes the API thread safe and lock free, addressing the one
>> drawback that Xiao was mentioning.
>>
>> - delivers up to 30% more requests on the Macbook, while being
>> “within noise” (sometimes marginally better) for the other two.
>> I suspect an optimization opportunity found by clang, because
>> the Macbook delivers really high numbers.
>>
>> - spends less time blocking when all threads are busy, which
>> accounts for the higher number of client loops.
>>
>> If you think that makes sense, then either Xiao can adapt the code
>> from the branch above, or I can send a follow-up patch.
> 
> Having a follow-up patch would be best I think.  Thanks for
> experimenting with this, it's always fun stuff. :)
> 

Yup, Christophe, please post the follow-up patches and add yourself
to the author list if you like. I am looking forward to it. :)

Thanks!