From: "Emilio G. Cota" <cota@braap.org> To: Xiao Guangrong <guangrong.xiao@gmail.com> Cc: kvm@vger.kernel.org, mst@redhat.com, mtosatti@redhat.com, Xiao Guangrong <xiaoguangrong@tencent.com>, dgilbert@redhat.com, peterx@redhat.com, qemu-devel@nongnu.org, quintela@redhat.com, wei.w.wang@intel.com, jiang.biao2@zte.com.cn, pbonzini@redhat.com Subject: Re: [PATCH v2 2/5] util: introduce threaded workqueue Date: Tue, 20 Nov 2018 11:33:15 -0500 [thread overview] Message-ID: <20181120163315.GA32388@flamenco> (raw) In-Reply-To: <35f0eb1a-3ec9-344b-7dda-d766c1b79626@gmail.com> On Tue, Nov 20, 2018 at 18:25:25 +0800, Xiao Guangrong wrote: > On 11/14/18 2:38 AM, Emilio G. Cota wrote: > > On Tue, Nov 06, 2018 at 20:20:22 +0800, guangrong.xiao@gmail.com wrote: > > > From: Xiao Guangrong <xiaoguangrong@tencent.com> (snip) > > Batching achieves higher performance at high core counts (>8), > > since worker threads go through fewer sleep+wake cycles while > > waiting for jobs. Performance however diminishes as more threads > > are used (>=24) due to cache line contention. > > > > Contention can be avoided by partitioning the request array, bitmaps > > and completion event to be entirely local to worker threads > > ("+per-thread-state"). This avoids the performance decrease at >=24 > > threads that we observe with batching alone. Note that the overall > > speedup does not go beyond ~8; this is explained by the fact that we > > use a single requester thread. Thus, as we add more worker threads, > > they become increasingly less busy, yet throughput remains roughly > > constant. I say roughly because there's quite a bit of noise--this > > is a 4-socket machine and I'm letting the scheduler move threads > > around, although I'm limiting the cores that can be used with > > taskset to maximize locality (this means that for 15 threads we're > > using 16 host cores that are all in the same socket; note that > > the additional thread is the requester one). > > Hmm... I have carefully written the stuff step by step: > 1. separate @requests from threads to each single thread > 2. separate @completion from threads > 3. use batch mode > 4. separate bitmaps from threads > 5. revert batch mode based on step 4 > 6. compare them with directly using Emilio's patches. > > The big different between my modification and Emilio's patches is > that i still make per-thread-data be attached in the end of > @Threads. > > I got these performance data: > https://ibb.co/kcLnLL > > Indeed, i can almost reproduce your conclusion. The confused part > is batch - I used 16G memory to do the benchmark, the batch improved > nothing, I guess all the threads keep busy under this case anyway. I wouldn't worry too much about it, then. The machine I ran my tests on is rather old, so it's possible that cross-core/socket communication is much slower there than on your machine, which makes batching more attractive. At the end of the day, the measurements you take on the machines you care about are what matter =) > > I have pushed the above changes, along with some minor fixes (e.g. > > removing the Threads.name field) here: > > > > https://github.com/cota/qemu/tree/xiao > > > > Note that the 0-len variable goes away, and that Threads become > > read-only. I also use padding to make sure the events are in > > separate cache lines. > > > > Feel free to incorporate whatever you see fit from that branch into > > a subsequent iteraiton. > > Nice, I will add you to the author list if you do not object. :) That's not necessary. You deserve the full credit -- I just reviewed the code ;-) Thanks, Emilio
WARNING: multiple messages have this Message-ID (diff)
From: "Emilio G. Cota" <cota@braap.org> To: Xiao Guangrong <guangrong.xiao@gmail.com> Cc: pbonzini@redhat.com, mst@redhat.com, mtosatti@redhat.com, qemu-devel@nongnu.org, kvm@vger.kernel.org, dgilbert@redhat.com, peterx@redhat.com, wei.w.wang@intel.com, jiang.biao2@zte.com.cn, eblake@redhat.com, quintela@redhat.com, Xiao Guangrong <xiaoguangrong@tencent.com> Subject: Re: [Qemu-devel] [PATCH v2 2/5] util: introduce threaded workqueue Date: Tue, 20 Nov 2018 11:33:15 -0500 [thread overview] Message-ID: <20181120163315.GA32388@flamenco> (raw) In-Reply-To: <35f0eb1a-3ec9-344b-7dda-d766c1b79626@gmail.com> On Tue, Nov 20, 2018 at 18:25:25 +0800, Xiao Guangrong wrote: > On 11/14/18 2:38 AM, Emilio G. Cota wrote: > > On Tue, Nov 06, 2018 at 20:20:22 +0800, guangrong.xiao@gmail.com wrote: > > > From: Xiao Guangrong <xiaoguangrong@tencent.com> (snip) > > Batching achieves higher performance at high core counts (>8), > > since worker threads go through fewer sleep+wake cycles while > > waiting for jobs. Performance however diminishes as more threads > > are used (>=24) due to cache line contention. > > > > Contention can be avoided by partitioning the request array, bitmaps > > and completion event to be entirely local to worker threads > > ("+per-thread-state"). This avoids the performance decrease at >=24 > > threads that we observe with batching alone. Note that the overall > > speedup does not go beyond ~8; this is explained by the fact that we > > use a single requester thread. Thus, as we add more worker threads, > > they become increasingly less busy, yet throughput remains roughly > > constant. I say roughly because there's quite a bit of noise--this > > is a 4-socket machine and I'm letting the scheduler move threads > > around, although I'm limiting the cores that can be used with > > taskset to maximize locality (this means that for 15 threads we're > > using 16 host cores that are all in the same socket; note that > > the additional thread is the requester one). > > Hmm... I have carefully written the stuff step by step: > 1. separate @requests from threads to each single thread > 2. separate @completion from threads > 3. use batch mode > 4. separate bitmaps from threads > 5. revert batch mode based on step 4 > 6. compare them with directly using Emilio's patches. > > The big different between my modification and Emilio's patches is > that i still make per-thread-data be attached in the end of > @Threads. > > I got these performance data: > https://ibb.co/kcLnLL > > Indeed, i can almost reproduce your conclusion. The confused part > is batch - I used 16G memory to do the benchmark, the batch improved > nothing, I guess all the threads keep busy under this case anyway. I wouldn't worry too much about it, then. The machine I ran my tests on is rather old, so it's possible that cross-core/socket communication is much slower there than on your machine, which makes batching more attractive. At the end of the day, the measurements you take on the machines you care about are what matter =) > > I have pushed the above changes, along with some minor fixes (e.g. > > removing the Threads.name field) here: > > > > https://github.com/cota/qemu/tree/xiao > > > > Note that the 0-len variable goes away, and that Threads become > > read-only. I also use padding to make sure the events are in > > separate cache lines. > > > > Feel free to incorporate whatever you see fit from that branch into > > a subsequent iteraiton. > > Nice, I will add you to the author list if you do not object. :) That's not necessary. You deserve the full credit -- I just reviewed the code ;-) Thanks, Emilio
next prev parent reply other threads:[~2018-11-20 16:33 UTC|newest] Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-11-06 12:20 [PATCH v2 0/5] migration: improve multithreads guangrong.xiao 2018-11-06 12:20 ` [Qemu-devel] " guangrong.xiao 2018-11-06 12:20 ` [PATCH v2 1/5] bitops: introduce change_bit_atomic guangrong.xiao 2018-11-06 12:20 ` [Qemu-devel] " guangrong.xiao 2018-11-06 12:20 ` [PATCH v2 2/5] util: introduce threaded workqueue guangrong.xiao 2018-11-06 12:20 ` [Qemu-devel] " guangrong.xiao 2018-11-13 18:38 ` Emilio G. Cota 2018-11-13 18:38 ` [Qemu-devel] " Emilio G. Cota 2018-11-20 10:25 ` Xiao Guangrong 2018-11-20 10:25 ` [Qemu-devel] " Xiao Guangrong 2018-11-20 16:33 ` Emilio G. Cota [this message] 2018-11-20 16:33 ` Emilio G. Cota 2018-11-06 12:20 ` [PATCH v2 3/5] migration: use threaded workqueue for compression guangrong.xiao 2018-11-06 12:20 ` [Qemu-devel] " guangrong.xiao 2018-11-06 12:20 ` [PATCH v2 4/5] migration: use threaded workqueue for decompression guangrong.xiao 2018-11-06 12:20 ` [Qemu-devel] " guangrong.xiao 2018-11-06 12:20 ` [PATCH v2 5/5] tests: add threaded-workqueue-bench guangrong.xiao 2018-11-06 12:20 ` [Qemu-devel] " guangrong.xiao 2018-11-07 2:52 ` [PATCH v2 0/5] migration: improve multithreads no-reply 2018-11-07 2:52 ` [Qemu-devel] " no-reply 2018-11-12 3:07 ` Xiao Guangrong 2018-11-12 3:07 ` [Qemu-devel] " Xiao Guangrong 2018-11-20 18:27 ` Paolo Bonzini 2018-11-20 18:27 ` [Qemu-devel] " Paolo Bonzini
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20181120163315.GA32388@flamenco \ --to=cota@braap.org \ --cc=dgilbert@redhat.com \ --cc=guangrong.xiao@gmail.com \ --cc=jiang.biao2@zte.com.cn \ --cc=kvm@vger.kernel.org \ --cc=mst@redhat.com \ --cc=mtosatti@redhat.com \ --cc=pbonzini@redhat.com \ --cc=peterx@redhat.com \ --cc=qemu-devel@nongnu.org \ --cc=quintela@redhat.com \ --cc=wei.w.wang@intel.com \ --cc=xiaoguangrong@tencent.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.