From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751431AbeAWKdd (ORCPT ); Tue, 23 Jan 2018 05:33:33 -0500 Received: from mail-ot0-f175.google.com ([74.125.82.175]:43420 "EHLO mail-ot0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751255AbeAWKdc (ORCPT ); Tue, 23 Jan 2018 05:33:32 -0500 X-Google-Smtp-Source: AH8x226ikRtuXYIzvPnBVzfZ3tUOrxQFmQRXq8LYISZPJW/WxVLGsVt7KS7kY8criIu4MbBI8+FGUEWtJDVI/sqpfnw= MIME-Version: 1.0 In-Reply-To: <20180122125337.GE2228@hirez.programming.kicks-ass.net> References: <20180122125337.GE2228@hirez.programming.kicks-ass.net> From: Wanpeng Li Date: Tue, 23 Jan 2018 18:33:31 +0800 Message-ID: Subject: Re: unixbench context switch perfomance & cpu topology To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, kvm , Paolo Bonzini , Radim Krcmar , Frederic Weisbecker , Thomas Gleixner , Ingo Molnar Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 2018-01-22 20:53 GMT+08:00 Peter Zijlstra : > On Mon, Jan 22, 2018 at 07:47:45PM +0800, Wanpeng Li wrote: >> Hi all, >> >> We can observe unixbench context switch performance is heavily >> influenced by cpu topology which is exposed to the guest. the score is >> posted below, bigger is better, both the guest and the host kernel are >> 3.15-rc3(we can also reproduce against centos 7.4 693 guest/host), LLC >> is exposed to the guest, kvm adaptive halt-polling is default enabled, >> then start a guest w/ 8 logical cpus. >> >> >> >> unixbench context switch >> -smp 8, sockets=8, cores=1, threads=1 382036 >> -smp 8, sockets=4, cores=2, threads=1 132480 >> -smp 8, sockets=2, cores=4, threads=1 128032 >> -smp 8, sockets=2, cores=2, threads=2 131767 >> -smp 8, sockets=1, cores=4, threads=2 132742 >> -smp 8, sockets=1, cores=4, threads=2 (guest w/ nohz=off idle=poll) 331471 >> >> I can observe there are a lot of reschedule IPIs sent from one vCPU to >> another vCPU, the context switch workload switches between running and >> idle frequently which results in HLT instruction in the idle path, I >> use idle=poll to avoid vmexit due to HLT and to avoid reschedule IPIs >> since idle task checks TIF_NEED_RESCHED flags in a loop, nohz=off can >> stop to program lapic timer/other nohz stuffs. Any idea why sockets=8 >> can get best performance? > > I suspect because we load-balance less agressively across nodes than we > do within a cache domain. It is true. after taking a more closer look by kernelshark, the context1 in the guest will be migrated to another logical cpu after several milliseconds for sockets=1, cores=4, threads=2, however, it can keep on one logical cpu around several seconds for sockets=8, cores=1, threads=1 before migrating to another one. > > Fix you benchmark to pin itself to a single CPU, that's the only > sensible way to obtain this number in any case. Yeah, this setup can get a good performance. Actually the two context1 tasks don't stack up on one logical cpu at the most of time which is observed by kernelshark opposed to Mike's reply. In addition, I can observe the sum of RESCHED IPIs in the guest for sockets=1, cores=4, threads=2 is 4.5 times for sockets=8, cores=1, threads=1. Any idea how this can happen? I suspect the TTWU path selects another idle logical cpu which results in a RESCHED IPI is avoidless. However, there is still no benefit for performance after I clear the SD_BALANCE_WAKE for correlative sched_domains. Regards, Wanpeng Li