All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Emilio G. Cota" <cota@braap.org>
To: Xiao Guangrong <guangrong.xiao@gmail.com>
Cc: kvm@vger.kernel.org, mst@redhat.com, mtosatti@redhat.com,
	Xiao Guangrong <xiaoguangrong@tencent.com>,
	dgilbert@redhat.com, peterx@redhat.com, qemu-devel@nongnu.org,
	quintela@redhat.com, wei.w.wang@intel.com,
	jiang.biao2@zte.com.cn, pbonzini@redhat.com
Subject: Re: [PATCH v2 2/5] util: introduce threaded workqueue
Date: Tue, 20 Nov 2018 11:33:15 -0500	[thread overview]
Message-ID: <20181120163315.GA32388@flamenco> (raw)
In-Reply-To: <35f0eb1a-3ec9-344b-7dda-d766c1b79626@gmail.com>

On Tue, Nov 20, 2018 at 18:25:25 +0800, Xiao Guangrong wrote:
> On 11/14/18 2:38 AM, Emilio G. Cota wrote:
> > On Tue, Nov 06, 2018 at 20:20:22 +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
(snip)
> > Batching achieves higher performance at high core counts (>8),
> > since worker threads go through fewer sleep+wake cycles while
> > waiting for jobs. Performance however diminishes as more threads
> > are used (>=24) due to cache line contention.
> > 
> > Contention can be avoided by partitioning the request array, bitmaps
> > and completion event to be entirely local to worker threads
> > ("+per-thread-state"). This avoids the performance decrease at >=24
> > threads that we observe with batching alone. Note that the overall
> > speedup does not go beyond ~8; this is explained by the fact that we
> > use a single requester thread. Thus, as we add more worker threads,
> > they become increasingly less busy, yet throughput remains roughly
> > constant. I say roughly because there's quite a bit of noise--this
> > is a 4-socket machine and I'm letting the scheduler move threads
> > around, although I'm limiting the cores that can be used with
> > taskset to maximize locality (this means that for 15 threads we're
> > using 16 host cores that are all in the same socket; note that
> > the additional thread is the requester one).
> 
> Hmm... I have carefully written the stuff step by step:
> 1. separate @requests from threads to each single thread
> 2. separate @completion from threads
> 3. use batch mode
> 4. separate bitmaps from threads
> 5. revert batch mode based on step 4
> 6. compare them with directly using Emilio's patches.
> 
> The big different between my modification and Emilio's patches is
> that i still make per-thread-data be attached in the end of
> @Threads.
>
> I got these performance data:
> 	https://ibb.co/kcLnLL
> 
> Indeed, i can almost reproduce your conclusion. The confused part
> is batch -  I used 16G memory to do the benchmark, the batch improved
> nothing, I guess all the threads keep busy under this case anyway.

I wouldn't worry too much about it, then. The machine I ran my tests
on is rather old, so it's possible that cross-core/socket communication
is much slower there than on your machine, which makes batching
more attractive. At the end of the day, the measurements you take on
the machines you care about are what matter =)

> > I have pushed the above changes, along with some minor fixes (e.g.
> > removing the Threads.name field) here:
> > 
> >    https://github.com/cota/qemu/tree/xiao
> > 
> > Note that the 0-len variable goes away, and that Threads become
> > read-only. I also use padding to make sure the events are in
> > separate cache lines.
> > 
> > Feel free to incorporate whatever you see fit from that branch into
> > a subsequent iteraiton.
> 
> Nice, I will add you to the author list if you do not object. :)

That's not necessary. You deserve the full credit -- I just reviewed
the code ;-)

Thanks,

		Emilio

WARNING: multiple messages have this Message-ID (diff)
From: "Emilio G. Cota" <cota@braap.org>
To: Xiao Guangrong <guangrong.xiao@gmail.com>
Cc: pbonzini@redhat.com, mst@redhat.com, mtosatti@redhat.com,
	qemu-devel@nongnu.org, kvm@vger.kernel.org, dgilbert@redhat.com,
	peterx@redhat.com, wei.w.wang@intel.com, jiang.biao2@zte.com.cn,
	eblake@redhat.com, quintela@redhat.com,
	Xiao Guangrong <xiaoguangrong@tencent.com>
Subject: Re: [Qemu-devel] [PATCH v2 2/5] util: introduce threaded workqueue
Date: Tue, 20 Nov 2018 11:33:15 -0500	[thread overview]
Message-ID: <20181120163315.GA32388@flamenco> (raw)
In-Reply-To: <35f0eb1a-3ec9-344b-7dda-d766c1b79626@gmail.com>

On Tue, Nov 20, 2018 at 18:25:25 +0800, Xiao Guangrong wrote:
> On 11/14/18 2:38 AM, Emilio G. Cota wrote:
> > On Tue, Nov 06, 2018 at 20:20:22 +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
(snip)
> > Batching achieves higher performance at high core counts (>8),
> > since worker threads go through fewer sleep+wake cycles while
> > waiting for jobs. Performance however diminishes as more threads
> > are used (>=24) due to cache line contention.
> > 
> > Contention can be avoided by partitioning the request array, bitmaps
> > and completion event to be entirely local to worker threads
> > ("+per-thread-state"). This avoids the performance decrease at >=24
> > threads that we observe with batching alone. Note that the overall
> > speedup does not go beyond ~8; this is explained by the fact that we
> > use a single requester thread. Thus, as we add more worker threads,
> > they become increasingly less busy, yet throughput remains roughly
> > constant. I say roughly because there's quite a bit of noise--this
> > is a 4-socket machine and I'm letting the scheduler move threads
> > around, although I'm limiting the cores that can be used with
> > taskset to maximize locality (this means that for 15 threads we're
> > using 16 host cores that are all in the same socket; note that
> > the additional thread is the requester one).
> 
> Hmm... I have carefully written the stuff step by step:
> 1. separate @requests from threads to each single thread
> 2. separate @completion from threads
> 3. use batch mode
> 4. separate bitmaps from threads
> 5. revert batch mode based on step 4
> 6. compare them with directly using Emilio's patches.
> 
> The big different between my modification and Emilio's patches is
> that i still make per-thread-data be attached in the end of
> @Threads.
>
> I got these performance data:
> 	https://ibb.co/kcLnLL
> 
> Indeed, i can almost reproduce your conclusion. The confused part
> is batch -  I used 16G memory to do the benchmark, the batch improved
> nothing, I guess all the threads keep busy under this case anyway.

I wouldn't worry too much about it, then. The machine I ran my tests
on is rather old, so it's possible that cross-core/socket communication
is much slower there than on your machine, which makes batching
more attractive. At the end of the day, the measurements you take on
the machines you care about are what matter =)

> > I have pushed the above changes, along with some minor fixes (e.g.
> > removing the Threads.name field) here:
> > 
> >    https://github.com/cota/qemu/tree/xiao
> > 
> > Note that the 0-len variable goes away, and that Threads become
> > read-only. I also use padding to make sure the events are in
> > separate cache lines.
> > 
> > Feel free to incorporate whatever you see fit from that branch into
> > a subsequent iteraiton.
> 
> Nice, I will add you to the author list if you do not object. :)

That's not necessary. You deserve the full credit -- I just reviewed
the code ;-)

Thanks,

		Emilio

  reply	other threads:[~2018-11-20 16:33 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-06 12:20 [PATCH v2 0/5] migration: improve multithreads guangrong.xiao
2018-11-06 12:20 ` [Qemu-devel] " guangrong.xiao
2018-11-06 12:20 ` [PATCH v2 1/5] bitops: introduce change_bit_atomic guangrong.xiao
2018-11-06 12:20   ` [Qemu-devel] " guangrong.xiao
2018-11-06 12:20 ` [PATCH v2 2/5] util: introduce threaded workqueue guangrong.xiao
2018-11-06 12:20   ` [Qemu-devel] " guangrong.xiao
2018-11-13 18:38   ` Emilio G. Cota
2018-11-13 18:38     ` [Qemu-devel] " Emilio G. Cota
2018-11-20 10:25     ` Xiao Guangrong
2018-11-20 10:25       ` [Qemu-devel] " Xiao Guangrong
2018-11-20 16:33       ` Emilio G. Cota [this message]
2018-11-20 16:33         ` Emilio G. Cota
2018-11-06 12:20 ` [PATCH v2 3/5] migration: use threaded workqueue for compression guangrong.xiao
2018-11-06 12:20   ` [Qemu-devel] " guangrong.xiao
2018-11-06 12:20 ` [PATCH v2 4/5] migration: use threaded workqueue for decompression guangrong.xiao
2018-11-06 12:20   ` [Qemu-devel] " guangrong.xiao
2018-11-06 12:20 ` [PATCH v2 5/5] tests: add threaded-workqueue-bench guangrong.xiao
2018-11-06 12:20   ` [Qemu-devel] " guangrong.xiao
2018-11-07  2:52 ` [PATCH v2 0/5] migration: improve multithreads no-reply
2018-11-07  2:52   ` [Qemu-devel] " no-reply
2018-11-12  3:07 ` Xiao Guangrong
2018-11-12  3:07   ` [Qemu-devel] " Xiao Guangrong
2018-11-20 18:27   ` Paolo Bonzini
2018-11-20 18:27     ` [Qemu-devel] " Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181120163315.GA32388@flamenco \
    --to=cota@braap.org \
    --cc=dgilbert@redhat.com \
    --cc=guangrong.xiao@gmail.com \
    --cc=jiang.biao2@zte.com.cn \
    --cc=kvm@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=wei.w.wang@intel.com \
    --cc=xiaoguangrong@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.