All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vishal Chourasia <vishalc@linux.vnet.ibm.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org,
	vincent.guittot@linaro.org, vschneid@redhat.com,
	srikar@linux.vnet.ibm.com, sshegde@linux.ibm.com,
	vishalc@linux.vnet.ibm.com
Subject: sched/debug: CPU hotplug operation suffers in a large cpu systems
Date: Mon, 17 Oct 2022 18:40:49 +0530	[thread overview]
Message-ID: <Y01UWQL2y2r69sBX@li-05afa54c-330e-11b2-a85c-e3f3aa0db1e9.ibm.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 5834 bytes --]

smt=off operation on system with 1920 CPUs is taking approx 59 mins on v5.14
versus 29 mins on v5.11 measured using:
# time ppc64_cpu --smt=off

Doing a git bisect between kernel v5.11 and v5.14 pointed to the commit
3b87f136f8fc ("sched,debug: Convert sysctl sched_domains to debugfs"). This
commit moves sched_domain information that was originally exported using sysctl
to debugfs.

Reverting the said commit, gives us the expected good result.

Previously sched domain information was exported at procfs(sysctl):
/proc/sys/kernel/sched_domain/ but now it gets exported at debugfs
:/sys/kernel/debug/sched/domains/ 

We also observe regression in kernel v6.0-rc4, which vanishes after reverting
the commit 3b87f136f8fc

# Output of `time ppc64_cpu --smt=off` on different kernel versions
|-------------------------------------+------------+----------+----------|
| kernel version                      | real       | user     | sys      |
|-------------------------------------+------------+----------+----------|
| v5.11                               | 29m22.007s | 0m0.001s | 0m6.444s |
| v5.14                               | 58m15.719s | 0m0.037s | 0m7.482s |
| v6.0-rc4                            | 59m30.318s | 0m0.055s | 0m7.681s |
| v6.0-rc4 with 3b87f136f8fc reverted | 32m20.486s | 0m0.029s | 0m7.361s |
|-------------------------------------+------------+----------+----------|

Machine with 1920 cpus was used for the above experiments.  Output of lscpu is
added below.

# lscpu 
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 1920
On-line CPU(s) list: 0-1919
Model name: POWER10 (architected), altivec supported
Model: 2.0 (pvr 0080 0200)
Thread(s) per core: 8
Core(s) per socket: 14
Socket(s): 17
Physical sockets: 15
Physical chips: 1
Physical cores/chip: 16

Through our experiments we have found that even when offlining 1 cpu, functions
responsible for exporting sched_domain information took more time in case of
debugfs relative to sysctl.

Experiments using trace-cmd function-graph plugin have shown execution time for
certain methods common in both the scenarios (procfs and debugfs) differ
drastically. 

Below table list the execution time for some of the symbols for sysctl(procfs)
and debugfs case. 

|--------------------------------+----------------+--------------|
| method                         | sysctl         | debugfs      |
|--------------------------------+----------------+--------------|
| unregister_sysctl_table        |   0.020050 s   | NA           |
| build_sched_domains            |   3.090563 s   | 3.119130 s   |
| register_sched_domain_sysctl   |   0.065487 s   | NA           |
| update_sched_domain_debugfs    |   NA           | 2.791232 s   |
| partition_sched_domains_locked |   3.195958 s   | 5.933254 s   |
|--------------------------------+----------------+--------------|

Note: partition_sched_domains_locked internally calls build_sched_domains
      and calls other functions respective to what's being currently used to
      export information i.e. sysctl or debugfs

Above numbers are quoted from the case where we tried offlining 1 cpu in system
with 1920 online cpus.

From the above table, register_sched_domain_sysctl and
unregister_sysctl_table_collectively took ~0.085 secs, whereas
update_sched_domain_debugfs took ~2.79 secs. 

Root cause:

The observed regression stems from the way these two pseudo-filesystems handle
creation and deletion of files and directories internally.  

update_sched_domain_debugfs builds and exports sched_domains information to the
userspace. It begins with removing/tearing down the per-cpu directories present
in /sys/kernel/debug/sched/domains/ one by one for each possible cpu. And then
recreate per-cpu per-domain files and directories one by one for each possible
cpus.

Excerpt from the trace-cmd output for the debugfs case
...
             |  update_sched_domain_debugfs() {
+ 14.526 us  |    debugfs_lookup();
# 1092.64 us |    debugfs_remove();
+ 48.408 us  |    debugfs_create_dir();      -   creates per-cpu    dir
  9.038 us   |    debugfs_create_dir();      -   creates per-domain dir
  9.638 us   |    debugfs_create_ulong();   -+
  7.762 us   |    debugfs_create_ulong();    |
  7.776 us   |    debugfs_create_u64();      |
  7.502 us   |    debugfs_create_u32();      |__ creates per-domain files
  7.646 us   |    debugfs_create_u32();      |
  7.702 us   |    debugfs_create_u32();      |
  6.974 us   |    debugfs_create_str();      |
  7.628 us   |    debugfs_create_file();    -+
...                                          -   repeat other domains and cpus

As a first step, We used debugfs_remove_recursive to remove entries for each cpu
in one go instead of calling debugfs_remove per cpu. But, We did not see any
improvement whatsoever.   

We understand debugfs doesn't concern itself with performance, and that smt=off
operation won't be invoked much often, statistically speaking. But, We should
understand that as the CPUs in a system will scale debugfs becomes a massive
performance bottleneck that shouldn't be ignored.

Even in a system with 240 CPUs system, update_sched_domain_debugfs is 1600 times
slower than register_sched_domain_sysctl when building sched_domain directory
for 240 CPUs only.

# For 240 CPU system
|------------------------------+---------------|
| method                       | time taken    |
|------------------------------+---------------|
| update_sched_domain_debugfs  | 236550.996 us |
| register_sched_domain_sysctl | 13907.940 us  |
|------------------------------+---------------|

Any ideas on how to improve here from the community is much appreciated.

Meanwhile, We will keep posting our progress updates.

-- vishal.c

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

             reply	other threads:[~2022-10-17 13:11 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-17 13:10 Vishal Chourasia [this message]
2022-10-17 14:19 ` sched/debug: CPU hotplug operation suffers in a large cpu systems Peter Zijlstra
2022-10-17 14:54   ` Greg Kroah-Hartman
2022-10-18 10:37     ` Vishal Chourasia
2022-10-18 11:04       ` Greg Kroah-Hartman
2022-10-18 11:04         ` Greg Kroah-Hartman
2022-10-26  6:37         ` Vishal Chourasia
2022-10-26  6:37           ` Vishal Chourasia
2022-10-26  7:02           ` Greg Kroah-Hartman
2022-10-26  7:02             ` Greg Kroah-Hartman
2022-10-26  9:10             ` Peter Zijlstra
2022-10-26  9:10               ` Peter Zijlstra
2022-11-08 10:00               ` Vishal Chourasia
2022-11-08 10:00                 ` Vishal Chourasia
2022-11-08 12:24                 ` Greg Kroah-Hartman
2022-11-08 12:24                   ` Greg Kroah-Hartman
2022-11-08 14:51                   ` Srikar Dronamraju
2022-11-08 14:51                     ` Srikar Dronamraju
2022-11-08 15:38                     ` Greg Kroah-Hartman
2022-11-08 15:38                       ` Greg Kroah-Hartman
2022-12-12 19:17                   ` Phil Auld
2022-12-12 19:17                     ` Phil Auld
2022-12-13  2:17                     ` kernel test robot
2022-12-13  2:17                       ` kernel test robot
2022-12-13  6:23                     ` Greg Kroah-Hartman
2022-12-13  6:23                       ` Greg Kroah-Hartman
2022-12-13 13:22                       ` Phil Auld
2022-12-13 13:22                         ` Phil Auld
2022-12-13 14:31                         ` Greg Kroah-Hartman
2022-12-13 14:31                           ` Greg Kroah-Hartman
2022-12-13 14:45                           ` Phil Auld
2022-12-13 14:45                             ` Phil Auld
2023-01-19 15:31                           ` Phil Auld
2023-01-19 15:31                             ` Phil Auld
2022-12-13 23:41                         ` Michael Ellerman
2022-12-13 23:41                           ` Michael Ellerman
2022-12-14  2:26                           ` Phil Auld
2022-12-14  2:26                             ` Phil Auld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y01UWQL2y2r69sBX@li-05afa54c-330e-11b2-a85c-e3f3aa0db1e9.ibm.com \
    --to=vishalc@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=sshegde@linux.ibm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.