From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD3B0C433EF for ; Thu, 23 Jun 2022 21:54:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229789AbiFWVyQ (ORCPT ); Thu, 23 Jun 2022 17:54:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36436 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229545AbiFWVyP (ORCPT ); Thu, 23 Jun 2022 17:54:15 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 0F70162717 for ; Thu, 23 Jun 2022 14:54:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1656021253; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ZHTv9vjLjcZ5kd3TdhT6opIMcF3KqG0Q/ig2CFHaOQE=; b=eiPWT23Ci1C1YL5fZ6jzecqfo+ocbFrdKotUp6gmeNGBQMZVJ1bhfO+hhfVFjx01VLr/A2 YH/GwXfBe/muSVAB1ZY99sLOa/rhdVcQZBHdPyAM/71b3SqbPqFwiz3RrJaawedk77S64e HtiRdkKc3svOAcfXZ9aDvHUH0pvRJww= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-554-OHi2kHtjOr-Mf_JxbDyf6Q-1; Thu, 23 Jun 2022 17:54:09 -0400 X-MC-Unique: OHi2kHtjOr-Mf_JxbDyf6Q-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 3C99B3C1104B; Thu, 23 Jun 2022 21:54:09 +0000 (UTC) Received: from [10.22.9.91] (unknown [10.22.9.91]) by smtp.corp.redhat.com (Postfix) with ESMTP id 191D6492CA5; Thu, 23 Jun 2022 21:54:07 +0000 (UTC) Message-ID: Date: Thu, 23 Jun 2022 17:54:07 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Subject: Re: [PATCH] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation Content-Language: en-US To: Guo Hui , peterz@infradead.org Cc: jpoimboe@kernel.org, song@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, daniel@iogearbox.net, will@kernel.org, boqun.feng@gmail.com, wangxiaohua@uniontech.com, linux-kernel@vger.kernel.org References: <20220623155007.3059-1-guohui@uniontech.com> From: Waiman Long In-Reply-To: <20220623155007.3059-1-guohui@uniontech.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.85 on 10.11.54.9 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 6/23/22 11:50, Guo Hui wrote: > The instructions assigned to the parameters of > the vcpu_is_preempted function in the X86 architecture > physical machine are redundant instructions, which cause the > multi-core performance of Unixbench to drop by > about 300 to 500 points. The C function is as follows: > static bool vcpu_is_preempted(long vcpu); > > The parameter assignments in the function osq_lock > that call the function vcpu_is_preempted are as follows: > mov 0x14(%rax),%edi > sub $0x1,%edi > > The above instructions are unnecessary > in the X86 Native operating environment, > causing high cache-misses and degrading performance. > > This patch implements the replacement of the above instructions with > the nop instruction at system startup. > When the assignment C code changes, > the above assignment instructions may also change accordingly. > In order to flexibly replace the instructions generated by the C code, > this patch defines the macro ALTERNATIVE_C_CODE > from the ALTERNATIVE macro. Different from the first parameter > of the ALTERNATIVE macro, the ALTERNATIVE_C_CODE macro is > the first parameter is a C code statement and cannot contain > a return statement. To use the macro, for example, > replace the instructions generated by the > C code 'cpu = node->cpu - 1;' with the nop instruction: > > OSQ_ALTERNATIVE_C_CODE(cpu = node->cpu - 1, "nop", 0); > > The patch effect is as follows two machines, > Unixbench runs with full core score: > > 1. Machine configuration: > Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz > CPU core: 40 > Memory: 256G > OS Kernel: 5.19-rc3 > > Before using the patch: > > System Benchmarks Index Values BASELINE RESULT INDEX > Dhrystone 2 using register variables 116700.0 944608003.9 80943.3 > Double-Precision Whetstone 55.0 212045.8 38553.8 > Execl Throughput 43.0 42772.3 9947.1 > File Copy 1024 bufsize 2000 maxblocks 3960.0 462656.6 1168.3 > File Copy 256 bufsize 500 maxblocks 1655.0 120043.6 725.3 > File Copy 4096 bufsize 8000 maxblocks 5800.0 1544525.5 2663.0 > Pipe Throughput 12440.0 47277698.5 38004.6 > Pipe-based Context Switching 4000.0 1894556.8 4736.4 > Process Creation 126.0 86077.0 6831.5 > Shell Scripts (1 concurrent) 42.4 70236.3 16565.2 > Shell Scripts (8 concurrent) 6.0 8978.1 14963.4 > System Call Overhead 15000.0 4691260.0 3127.5 > ======== > System Benchmarks Index Score 7980.9 > > After using the patch: > > System Benchmarks Index Values BASELINE RESULT INDEX > Dhrystone 2 using register variables 116700.0 2253984916.9 193143.5 > Double-Precision Whetstone 55.0 438940.3 79807.3 > Execl Throughput 43.0 10720.3 2493.1 > File Copy 1024 bufsize 2000 maxblocks 3960.0 312233.0 788.5 > File Copy 256 bufsize 500 maxblocks 1655.0 80050.9 483.7 > File Copy 4096 bufsize 8000 maxblocks 5800.0 1036101.7 1786.4 > Pipe Throughput 12440.0 117700315.3 94614.4 > Pipe-based Context Switching 4000.0 8421909.8 21054.8 > Process Creation 126.0 36742.0 2916.0 > Shell Scripts (1 concurrent) 42.4 52846.2 12463.7 > Shell Scripts (8 concurrent) 6.0 7058.1 11763.6 > System Call Overhead 15000.0 6791548.2 4527.7 > ======== > System Benchmarks Index Score 8260.6 > > 2. Machine configuration: > Hygon C86 7185 32-core Processor > CPU core: 128 > Memory: 256G > OS Kernel: 5.19-rc3 > > Before using the patch: > > System Benchmarks Index Values BASELINE RESULT INDEX > Dhrystone 2 using register variables 116700.0 2256283941.6 193340.5 > Double-Precision Whetstone 55.0 439577.3 79923.2 > Execl Throughput 43.0 10013.6 2328.7 > File Copy 1024 bufsize 2000 maxblocks 3960.0 278121.5 702.3 > File Copy 256 bufsize 500 maxblocks 1655.0 71835.5 434.1 > File Copy 4096 bufsize 8000 maxblocks 5800.0 905654.2 1561.5 > Pipe Throughput 12440.0 117715166.2 94626.3 > Pipe-based Context Switching 4000.0 7731331.7 19328.3 > Process Creation 126.0 30157.8 2393.5 > Shell Scripts (1 concurrent) 42.4 48670.8 11479.0 > Shell Scripts (8 concurrent) 6.0 6595.6 10992.7 > System Call Overhead 15000.0 6766475.9 4511.0 > ======== > System Benchmarks Index Score 7688.7 > > After using the patch: > > System Benchmarks Index Values BASELINE RESULT INDEX > Dhrystone 2 using register variables 116700.0 2253984916.9 193143.5 > Double-Precision Whetstone 55.0 438940.3 79807.3 > Execl Throughput 43.0 10720.3 2493.1 > File Copy 1024 bufsize 2000 maxblocks 3960.0 312233.0 788.5 > File Copy 256 bufsize 500 maxblocks 1655.0 80050.9 483.7 > File Copy 4096 bufsize 8000 maxblocks 5800.0 1036101.7 1786.4 > Pipe Throughput 12440.0 117700315.3 94614.4 > Pipe-based Context Switching 4000.0 8421909.8 21054.8 > Process Creation 126.0 36742.0 2916.0 > Shell Scripts (1 concurrent) 42.4 52846.2 12463.7 > Shell Scripts (8 concurrent) 6.0 7058.1 11763.6 > System Call Overhead 15000.0 6791548.2 4527.7 > ======== > System Benchmarks Index Score 8260.6 > > Signed-off-by: Guo Hui > --- > arch/x86/include/asm/alternative.h | 15 +++++++++++++++ > arch/x86/kernel/alternative.c | 11 +++++++++++ > include/linux/osq_lock.h | 7 +++++++ > kernel/locking/osq_lock.c | 6 +++++- > 4 files changed, 38 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h > index 9b10c8c76..5979ebe89 100644 > --- a/arch/x86/include/asm/alternative.h > +++ b/arch/x86/include/asm/alternative.h > @@ -167,6 +167,21 @@ static inline int alternatives_text_reserved(void *start, void *end) > ALTINSTR_REPLACEMENT(newinstr, 1) \ > ".popsection\n" > > +/* alternative c code primitive: */ > +#define ALTERNATIVE_C_CODE(oldinstr, newinstr, feature) \ > +do { \ > + asm volatile("661:\n\t"); \ > + oldinstr; \ > + asm volatile("\n662:\n" \ > + alt_end_marker ":\n" \ > + ".pushsection .altinstructions,\"a\"\n" \ > + ALTINSTR_ENTRY(feature, 1) \ > + ".popsection\n" \ > + ".pushsection .altinstr_replacement, \"ax\"\n" \ > + ALTINSTR_REPLACEMENT(newinstr, 1) \ > + ".popsection\n"); \ > +} while (0) > + > #define ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2)\ > OLDINSTR_2(oldinstr, 1, 2) \ > ".pushsection .altinstructions,\"a\"\n" \ > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > index e257f6c80..cf77be884 100644 > --- a/arch/x86/kernel/alternative.c > +++ b/arch/x86/kernel/alternative.c > @@ -251,6 +251,8 @@ static void __init_or_module noinline optimize_nops(u8 *instr, size_t len) > } > } > > +extern bool pv_is_native_vcpu_is_preempted(void); > + > /* > * Replace instructions with better alternatives for this CPU type. This runs > * before SMP is initialized to avoid SMP problems with self modifying code. > @@ -285,6 +287,15 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start, > > instr = (u8 *)&a->instr_offset + a->instr_offset; > replacement = (u8 *)&a->repl_offset + a->repl_offset; > + > + if (*replacement == 0x90 && a->replacementlen == 1) { > +#if defined(CONFIG_PARAVIRT_SPINLOCKS) > + if (pv_is_native_vcpu_is_preempted()) > + add_nops(instr, a->instrlen); > +#endif > + continue; > + } > + This is hacky and it may incorrectly affect other alternatives that patches thing to nop. > BUG_ON(a->instrlen > sizeof(insn_buff)); > BUG_ON(feature >= (NCAPINTS + NBUGINTS) * 32); > > diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h > index 5581dbd3b..ee960e3aa 100644 > --- a/include/linux/osq_lock.h > +++ b/include/linux/osq_lock.h > @@ -38,4 +38,11 @@ static inline bool osq_is_locked(struct optimistic_spin_queue *lock) > return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL; > } > > +#ifdef ALTERNATIVE_C_CODE > +#define OSQ_ALTERNATIVE_C_CODE(oldinstr, newinstr, feature) \ > + ALTERNATIVE_C_CODE(oldinstr, newinstr, feature) > +#else > +#define OSQ_ALTERNATIVE_C_CODE(oldinstr, newinstr, feature) oldinstr > +#endif > + > #endif > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c > index d5610ad52..bbe7d640c 100644 > --- a/kernel/locking/osq_lock.c > +++ b/kernel/locking/osq_lock.c > @@ -24,7 +24,11 @@ static inline int encode_cpu(int cpu_nr) > > static inline int node_cpu(struct optimistic_spin_node *node) > { > - return node->cpu - 1; > + int cpu = 0; > + > + OSQ_ALTERNATIVE_C_CODE(cpu = node->cpu - 1, "nop", 0); > + > + return cpu; > } Why don't you use static_key to control the alternative and add a late_initcall() to set it. I think that will be simpler and you can throw away ALTERNATIVE_C_CODE(). Cheers, Longman