From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C0ACC10F0E for ; Mon, 15 Apr 2019 09:52:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 5BFB02070D for ; Mon, 15 Apr 2019 09:52:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726579AbfDOJwz (ORCPT ); Mon, 15 Apr 2019 05:52:55 -0400 Received: from mail-qt1-f196.google.com ([209.85.160.196]:43473 "EHLO mail-qt1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725798AbfDOJwz (ORCPT ); Mon, 15 Apr 2019 05:52:55 -0400 Received: by mail-qt1-f196.google.com with SMTP id v32so18294743qtc.10 for ; Mon, 15 Apr 2019 02:52:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:cc:references:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=FxWIExPltMBUleTiiGPpXvvsH9UFbiw2fxfamtTOCqI=; b=mRKifl+CjSE3N2wcqfFMmhp6ByKkh3qdifgoWE+HE+ty/InirDWcAPYsZAiI5reOeM NXFEM6vLLCXoMMs75rZ/G2XslkBiQx1EQ7X/eA74rdEdkpjk4wXAQwRHvHi263X0Pevp +hPVEQTgGCztA6iD4saozTCjEElPARMR1Jln8Y3myxA+9wJLqFXHqUqs8xGqUPCBwsYa fs5eI4eG9PFI5t+4DJ0tNUjK2nZed+GVvbD0QQUG0Rd8YgsvKNIjUpTYGVJa0czB2mUz cxKdz7PYIdFhUYumCN6hqP0DtGaN1wIypZqtnosMn6q1cJe4xaI3u2umJQnbF9ILvBiF TFyQ== X-Gm-Message-State: APjAAAUjtqtjX91ra++ZMQtzCm3mHnGypknmx9yRIpL5aqTkp6yDpdeT 4lqLOv2xtvrdQn1vbyAxrAaOqQ== X-Google-Smtp-Source: APXvYqx3I5h8engYyAkdOwvje2+dYS7RwPk/B/cQZ1dIIZrZBmT1e+5hizM/hjQT4xFhkCfd/E87SA== X-Received: by 2002:ac8:f6e:: with SMTP id l43mr58671655qtk.322.1555321974288; Mon, 15 Apr 2019 02:52:54 -0700 (PDT) Received: from t460s.bristot.redhat.com (host237-128-dynamic.51-79-r.retail.telecomitalia.it. [79.51.128.237]) by smtp.gmail.com with ESMTPSA id l6sm28410455qke.65.2019.04.15.02.52.49 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 15 Apr 2019 02:52:53 -0700 (PDT) Subject: Re: [PATCH V5 0/7] x86/jump_label: Bound IPIs sent when updating a static key From: Daniel Bristot de Oliveira To: linux-kernel@vger.kernel.org Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Greg Kroah-Hartman , Masami Hiramatsu , "Steven Rostedt (VMware)" , Jiri Kosina , Josh Poimboeuf , "Peter Zijlstra (Intel)" , Chris von Recklinghausen , Jason Baron , Scott Wood , Marcelo Tosatti , Clark Williams , x86@kernel.org References: Message-ID: <302cc15a-6805-9549-2de9-ffbb4707ae64@redhat.com> Date: Mon, 15 Apr 2019 11:52:46 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4/1/19 10:58 AM, Daniel Bristot de Oliveira wrote: > While tuning a system with CPUs isolated as much as possible, we've > noticed that isolated CPUs were receiving bursts of 12 IPIs, periodically. > Tracing the functions that emit IPIs, we saw chronyd - an unprivileged > process - generating the IPIs when changing a static key, enabling > network timestaping on sockets. > > For instance, the trace pointed: > > # trace-cmd record --func-stack -p function -l smp_call_function_single -e irq_vectors -f 'vector == 251'... > # trace-cmde report... > [...] > chronyd-858 [000] 433.286406: function: smp_call_function_single > chronyd-858 [000] 433.286408: kernel_stack: > => smp_call_function_many (ffffffffbc10ab22) > => smp_call_function (ffffffffbc10abaa) > => on_each_cpu (ffffffffbc10ac0b) > => text_poke_bp (ffffffffbc02436a) > => arch_jump_label_transform (ffffffffbc020ec8) > => __jump_label_update (ffffffffbc1b300f) > => jump_label_update (ffffffffbc1b30ed) > => static_key_slow_inc (ffffffffbc1b334d) > => net_enable_timestamp (ffffffffbc625c44) > => sock_enable_timestamp (ffffffffbc6127d5) > => sock_setsockopt (ffffffffbc612d2f) > => SyS_setsockopt (ffffffffbc60d5e6) > => tracesys (ffffffffbc7681c5) > -0 [001] 433.286416: call_function_single_entry: vector=251 > -0 [001] 433.286419: call_function_single_exit: vector=251 > > [... The IPI takes place 12 times] > > The static key in case was the netstamp_needed_key. A static key > change from enabled->disabled/disabled->enabled causes the code to be > patched, and this is done in three steps: > > -- Pseudo-code #1 - Current implementation --- > For each key to be updated: > 1) add an int3 trap to the address that will be patched > sync cores (send IPI to all other CPUs) > 2) update all but the first byte of the patched range > sync cores (send IPI to all other CPUs) > 3) replace the first byte (int3) by the first byte of replacing opcode > sync cores (send IPI to all other CPUs) > -- Pseudo-code #1 --- > > As the static key netstamp_needed_key has four entries (used in for > places in the code) in our kernel, 3 IPIs were generated for each > entry, resulting in 12 IPIs. The number of IPIs then is linear with > regard to the number 'n' of entries of a key: O(n*3), which is O(n). > > This algorithm works fine for the update of a single key. But we think > it is possible to optimize the case in which a static key has more than > one entry. For instance, the sched_schedstats jump label has 56 entries > in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in > which the thread that is enabling is _not_ running. > > Rather than doing each updated at once, it is possible to queue all updates > first, and then apply all updates at once, rewriting the pseudo-code #1 > in this way: > > -- Pseudo-code #2 - proposal --- > 1) for each key in the queue: > add an int3 trap to the address that will be patched > sync cores (send IPI to all other CPUs) > > 2) for each key in the queue: > update all but the first byte of the patched range > sync cores (send IPI to all other CPUs) > > 3) for each key in the queue: > replace the first byte (int3) by the first byte of replacing opcode > sync cores (send IPI to all other CPUs) > -- Pseudo-code #2 - proposal --- > > Doing the update in this way, the number of IPI becomes O(3) with regard > to the number of keys, which is O(1). > > Currently, the jump label of a static key is transformed via the arch > specific function: > > void arch_jump_label_transform(struct jump_entry *entry, > enum jump_label_type type) > > The new approach (batch mode) uses two arch functions, the first has the > same arguments of the arch_jump_label_transform(), and is the function: > > int arch_jump_label_transform_queue(struct jump_entry *entry, > enum jump_label_type type) > > Rather than transforming the code, it adds the jump_entry in a queue of > entries to be updated. > > After queuing all jump_entries, the function: > > void arch_jump_label_transform_apply(void) > > Applies the changes in the queue. > > One easy way to see the benefits of this patch is switching the > schedstats on and off. For instance: > > -------------------------- %< ---------------------------- > #!/bin/bash > > while [ true ]; do > sysctl -w kernel.sched_schedstats=1 > sleep 2 > sysctl -w kernel.sched_schedstats=0 > sleep 2 > done > -------------------------- >% ---------------------------- > > while watching the IPI count: > > -------------------------- %< ---------------------------- > # watch -n1 "cat /proc/interrupts | grep Function" > -------------------------- >% ---------------------------- > > With the current mode, it is possible to see +- 168 IPIs each 2 seconds, > while with the batch mode the number of IPIs goes to 3 each 2 seconds. > > Regarding the performance impact of this patch set, I made two measurements: > > The time to update a key (the task that is causing the change) > The time to run the int3 handler (the side effect on a thread that > hits the code being changed) > > The sched_schedstats static key was chosen as the key to being switched on and > off. The reason being is that it is used in more than 56 places, in a hot path. > The change in the schedstats static key will be done with the following > command: > > while [ true ]; do > sysctl -w kernel.sched_schedstats=1 > usleep 500000 > sysctl -w kernel.sched_schedstats=0 > usleep 500000 > done > > In this way, they key will be updated twice per second. To force the hit of the > int3 handler, the system will also run a kernel compilation with two jobs per > CPU. The test machine is a two nodes/24 CPUs box with an Intel Xeon processor > @2.27GHz. > > Regarding the update part, on average, the regular kernel takes 57 ms to update > the static key, while the kernel with the batch updates takes just 1.4 ms on > average. Although it seems to be too good to be true, it makes sense: the > sched_schedstats key is used in 56 places, so it was expected that it would take > around 56 times to update the keys with the current implementation, as the > IPIs are the most expensive part of the update. > > Regarding the int3 handler, the non-batch handler takes 45 ns on average, while > the batch version takes around 180 ns. At first glance, it seems to be a high > value. But it is not, considering that it is doing 56 updates, rather than one! > It is taking four times more, only. This gain is possible because the patch > uses a binary search in the vector: log2(56)=5.8. So, it was expected to have > an overhead within four times. > > (voice of tv propaganda) But, that is not all! As the int3 handler keeps on for > a shorter period (because the update part is on for a shorter time), the number > of hits in the int3 handler decreased by 10%. > > The question then is: Is it worth paying the price of "135 ns" more in the int3 > handler? > > Considering that, in this test case, we are saving the handling of 53 IPIs, > that takes more than these 135 ns, it seems to be a meager price to be paid. > Moreover, the test case was forcing the hit of the int3, in practice, it > does not take that often. While the IPI takes place on all CPUs, hitting > the int3 handler or not! > > For instance, in an isolated CPU with a process running in user-space > (nohz_full use-case), the chances of hitting the int3 handler is barely zero, > while there is no way to avoid the IPIs. By bounding the IPIs, we are improving > this scenario a lot. > > Changes from v4: > - Renamed jump_label_can_update_check() to jump_label_can_update() > (Borislav Petkov) > - Switch jump_label_can_update() return values to bool (Borislav Petkov) > - Accept the fact that some lines will be > 80 characters (Borislav Petkov) > - Local variables are now in the inverted Christmas three order > (Borislav Petkov) > - Removed superfluous helpers. (Borislav Petkov) > - Renamed text_to_poke to text_patch_loc, and handler to detour > (Borislav Petkov & Steven Rostedt) > - Re-applied the suggestion of not using BUG() from steven (Masami Hiramatsu) > - arch_jump_label_transform_queue() now returns 0 on success and > -ENOSPC/EINVAL on errors (Masami Hiramatsu) This is a gentle ping... Thanks! -- Daniel