All of lore.kernel.org
 help / color / mirror / Atom feed
From: Xiao Guangrong <guangrong.xiao@gmail.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
	Christophe de Dinechin <dinechin@redhat.com>
Cc: KVM list <kvm@vger.kernel.org>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Xiao Guangrong <xiaoguangrong@tencent.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Peter Xu <peterx@redhat.com>,
	qemu-devel@nongnu.org, Juan Quintela <quintela@redhat.com>,
	Wei Wang <wei.w.wang@intel.com>,
	"Emilio G. Cota" <cota@braap.org>,
	jiang.biao2@zte.com.cn
Subject: Re: [PATCH v3 2/5] util: introduce threaded workqueue
Date: Mon, 10 Dec 2018 11:23:34 +0800	[thread overview]
Message-ID: <bf963363-7aaf-2be6-74cb-ca964c46ad73@gmail.com> (raw)
In-Reply-To: <a228d699-f7fc-3e32-b750-19225f95d8db@redhat.com>



On 12/5/18 1:16 AM, Paolo Bonzini wrote:
> On 04/12/18 16:49, Christophe de Dinechin wrote:
>>>   Linux and QEMU's own qht work just fine with compile-time directives.
>>
>> Wouldn’t it work fine without any compile-time directive at all?
> 
> Yes, that's what I meant.  Though there are certainly cases in which the
> difference without proper cacheline alignment is an order of magnitude
> less throughput or something like that; it would certainly be noticeable.
> 
>>> I don't think lock-free lists are easier.  Bitmaps smaller than 64
>>> elements are both faster and easier to manage.
>>
>> I believe that this is only true if you use a linked list for both freelist
>> management and for thread notification (i.e. to replace the bitmaps).
>> However, if you use an atomic list only for the free list, and keep
>> bitmaps for signaling, then performance is at least equal, often better.
>> Plus you get the added benefit of having a thread-safe API, i.e.
>> something that is truly lock-free.
>>
>> I did a small experiment to test / prove this. Last commit on branch:
>> https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue
>> Take with a grain of salt, microbenchmarks are always suspect ;-)
>>
>> The code in “thread_test.c” includes Xiao’s code with two variations,
>> plus some testing code lifted from the flight recorder library.
>> 1. The FREE_LIST variation (sl_test) is what I would like to propose.
>> 2. The BITMAP variation (bm_test) is the baseline
>> 3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach
>>
>> To run it, you need to do “make opt-test”, then run “test_script”
>> which outputs a CSV file. The summary of my findings testing on
>> a ThinkPad, a Xeon machine and a MacBook is here:
>> https://imgur.com/a/4HmbB9K
>>
>> Overall, the proposed approach:
>>
>> - makes the API thread safe and lock free, addressing the one
>> drawback that Xiao was mentioning.
>>
>> - delivers up to 30% more requests on the Macbook, while being
>> “within noise” (sometimes marginally better) for the other two.
>> I suspect an optimization opportunity found by clang, because
>> the Macbook delivers really high numbers.
>>
>> - spends less time blocking when all threads are busy, which
>> accounts for the higher number of client loops.
>>
>> If you think that makes sense, then either Xiao can adapt the code
>> from the branch above, or I can send a follow-up patch.
> 
> Having a follow-up patch would be best I think.  Thanks for
> experimenting with this, it's always fun stuff. :)
> 

Yup, Christophe, please post the follow-up patches and add yourself
to the author list if you like. I am looking forward to it. :)

Thanks!

WARNING: multiple messages have this Message-ID (diff)
From: Xiao Guangrong <guangrong.xiao@gmail.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
	Christophe de Dinechin <dinechin@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	qemu-devel@nongnu.org, KVM list <kvm@vger.kernel.org>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Peter Xu <peterx@redhat.com>, Wei Wang <wei.w.wang@intel.com>,
	jiang.biao2@zte.com.cn, Eric Blake <eblake@redhat.com>,
	Juan Quintela <quintela@redhat.com>,
	"Emilio G. Cota" <cota@braap.org>,
	Xiao Guangrong <xiaoguangrong@tencent.com>
Subject: Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
Date: Mon, 10 Dec 2018 11:23:34 +0800	[thread overview]
Message-ID: <bf963363-7aaf-2be6-74cb-ca964c46ad73@gmail.com> (raw)
In-Reply-To: <a228d699-f7fc-3e32-b750-19225f95d8db@redhat.com>



On 12/5/18 1:16 AM, Paolo Bonzini wrote:
> On 04/12/18 16:49, Christophe de Dinechin wrote:
>>>   Linux and QEMU's own qht work just fine with compile-time directives.
>>
>> Wouldn’t it work fine without any compile-time directive at all?
> 
> Yes, that's what I meant.  Though there are certainly cases in which the
> difference without proper cacheline alignment is an order of magnitude
> less throughput or something like that; it would certainly be noticeable.
> 
>>> I don't think lock-free lists are easier.  Bitmaps smaller than 64
>>> elements are both faster and easier to manage.
>>
>> I believe that this is only true if you use a linked list for both freelist
>> management and for thread notification (i.e. to replace the bitmaps).
>> However, if you use an atomic list only for the free list, and keep
>> bitmaps for signaling, then performance is at least equal, often better.
>> Plus you get the added benefit of having a thread-safe API, i.e.
>> something that is truly lock-free.
>>
>> I did a small experiment to test / prove this. Last commit on branch:
>> https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue
>> Take with a grain of salt, microbenchmarks are always suspect ;-)
>>
>> The code in “thread_test.c” includes Xiao’s code with two variations,
>> plus some testing code lifted from the flight recorder library.
>> 1. The FREE_LIST variation (sl_test) is what I would like to propose.
>> 2. The BITMAP variation (bm_test) is the baseline
>> 3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach
>>
>> To run it, you need to do “make opt-test”, then run “test_script”
>> which outputs a CSV file. The summary of my findings testing on
>> a ThinkPad, a Xeon machine and a MacBook is here:
>> https://imgur.com/a/4HmbB9K
>>
>> Overall, the proposed approach:
>>
>> - makes the API thread safe and lock free, addressing the one
>> drawback that Xiao was mentioning.
>>
>> - delivers up to 30% more requests on the Macbook, while being
>> “within noise” (sometimes marginally better) for the other two.
>> I suspect an optimization opportunity found by clang, because
>> the Macbook delivers really high numbers.
>>
>> - spends less time blocking when all threads are busy, which
>> accounts for the higher number of client loops.
>>
>> If you think that makes sense, then either Xiao can adapt the code
>> from the branch above, or I can send a follow-up patch.
> 
> Having a follow-up patch would be best I think.  Thanks for
> experimenting with this, it's always fun stuff. :)
> 

Yup, Christophe, please post the follow-up patches and add yourself
to the author list if you like. I am looking forward to it. :)

Thanks!

  reply	other threads:[~2018-12-10  3:23 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-22  7:20 [PATCH v3 0/5] migration: improve multithreads guangrong.xiao
2018-11-22  7:20 ` [Qemu-devel] " guangrong.xiao
2018-11-22  7:20 ` [PATCH v3 1/5] bitops: introduce change_bit_atomic guangrong.xiao
2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
2018-11-23 10:23   ` Dr. David Alan Gilbert
2018-11-23 10:23     ` [Qemu-devel] " Dr. David Alan Gilbert
2018-11-28  9:35   ` Juan Quintela
2018-11-28  9:35     ` [Qemu-devel] " Juan Quintela
2018-11-22  7:20 ` [PATCH v3 2/5] util: introduce threaded workqueue guangrong.xiao
2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
2018-11-23 11:02   ` Dr. David Alan Gilbert
2018-11-23 11:02     ` [Qemu-devel] " Dr. David Alan Gilbert
2018-11-26  7:57     ` Xiao Guangrong
2018-11-26  7:57       ` [Qemu-devel] " Xiao Guangrong
2018-11-26 10:56       ` Dr. David Alan Gilbert
2018-11-26 10:56         ` [Qemu-devel] " Dr. David Alan Gilbert
2018-11-27  7:17         ` Xiao Guangrong
2018-11-27  7:17           ` [Qemu-devel] " Xiao Guangrong
2018-11-26 18:55       ` Emilio G. Cota
2018-11-26 18:55         ` [Qemu-devel] " Emilio G. Cota
2018-11-27  8:30         ` Xiao Guangrong
2018-11-27  8:30           ` [Qemu-devel] " Xiao Guangrong
2018-11-24  0:12   ` Emilio G. Cota
2018-11-24  0:12     ` [Qemu-devel] " Emilio G. Cota
2018-11-26  8:06     ` Xiao Guangrong
2018-11-26  8:06       ` [Qemu-devel] " Xiao Guangrong
2018-11-26 18:49       ` Emilio G. Cota
2018-11-26 18:49         ` [Qemu-devel] " Emilio G. Cota
2018-11-27  8:29         ` Xiao Guangrong
2018-11-27  8:29           ` [Qemu-devel] " Xiao Guangrong
2018-11-24  0:17   ` Emilio G. Cota
2018-11-24  0:17     ` [Qemu-devel] " Emilio G. Cota
2018-11-26  8:18     ` Xiao Guangrong
2018-11-26  8:18       ` [Qemu-devel] " Xiao Guangrong
2018-11-26 10:28       ` Paolo Bonzini
2018-11-26 10:28         ` [Qemu-devel] " Paolo Bonzini
2018-11-27  8:31         ` Xiao Guangrong
2018-11-27  8:31           ` [Qemu-devel] " Xiao Guangrong
2018-11-27 12:49   ` Christophe de Dinechin
2018-11-27 12:49     ` [Qemu-devel] " Christophe de Dinechin
2018-11-27 13:51     ` Paolo Bonzini
2018-11-27 13:51       ` [Qemu-devel] " Paolo Bonzini
2018-12-04 15:49       ` Christophe de Dinechin
2018-12-04 15:49         ` [Qemu-devel] " Christophe de Dinechin
2018-12-04 17:16         ` Paolo Bonzini
2018-12-04 17:16           ` [Qemu-devel] " Paolo Bonzini
2018-12-10  3:23           ` Xiao Guangrong [this message]
2018-12-10  3:23             ` Xiao Guangrong
2018-11-27 17:39     ` Emilio G. Cota
2018-11-27 17:39       ` [Qemu-devel] " Emilio G. Cota
2018-11-28  8:55     ` Xiao Guangrong
2018-11-28  8:55       ` [Qemu-devel] " Xiao Guangrong
2018-11-22  7:20 ` [PATCH v3 3/5] migration: use threaded workqueue for compression guangrong.xiao
2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
2018-11-23 18:17   ` Dr. David Alan Gilbert
2018-11-23 18:17     ` [Qemu-devel] " Dr. David Alan Gilbert
2018-11-23 18:22     ` Paolo Bonzini
2018-11-23 18:22       ` [Qemu-devel] " Paolo Bonzini
2018-11-23 18:29       ` Dr. David Alan Gilbert
2018-11-23 18:29         ` [Qemu-devel] " Dr. David Alan Gilbert
2018-11-26  8:00         ` Xiao Guangrong
2018-11-26  8:00           ` [Qemu-devel] " Xiao Guangrong
2018-11-22  7:20 ` [PATCH v3 4/5] migration: use threaded workqueue for decompression guangrong.xiao
2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
2018-11-22  7:20 ` [PATCH v3 5/5] tests: add threaded-workqueue-bench guangrong.xiao
2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
2018-11-22 21:25 ` [PATCH v3 0/5] migration: improve multithreads no-reply
2018-11-22 21:25   ` [Qemu-devel] " no-reply
2018-11-22 21:35 ` no-reply
2018-11-22 21:35   ` [Qemu-devel] " no-reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bf963363-7aaf-2be6-74cb-ca964c46ad73@gmail.com \
    --to=guangrong.xiao@gmail.com \
    --cc=cota@braap.org \
    --cc=dgilbert@redhat.com \
    --cc=dinechin@redhat.com \
    --cc=jiang.biao2@zte.com.cn \
    --cc=kvm@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=wei.w.wang@intel.com \
    --cc=xiaoguangrong@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.