From: Steven Rostedt <rostedt@goodmis.org>
To: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, pmladek@suse.com,
sergey.senozhatsky@gmail.com, linux@rasmusvillemoes.dk
Subject: Re: [PATCH 03/15] print_integer: new and improved way of printing integers
Date: Tue, 21 Apr 2020 17:08:36 -0400 [thread overview]
Message-ID: <20200421170836.4aafcbc2@gandalf.local.home> (raw)
In-Reply-To: <20200421164924.GB8735@avx2>
[-- Attachment #1: Type: text/plain, Size: 10811 bytes --]
On Tue, 21 Apr 2020 19:49:24 +0300
Alexey Dobriyan <adobriyan@gmail.com> wrote:
> > Exactly. The code in _print_integer_u32() doesn't look as fast as the
> > code in vsprintf() that happens to use lookup tables and converts
> > without any loops.
> >
> > Hint, loops are bad, they cause the CPU to slow down.
>
> Oh, come on! Loops make code fit into icache and μop decode cache.
Depends on the architecture.
>
> > Anyway, this patch series would require a pretty good improvement, as
> > the code replacing the sprintf() usages is pretty ugly compared to a
> > simple sprintf() call.
>
> No! Fast code must look ugly. Or in other words if you try to optimise
> integer printing to death you'll probably end with something like
> _print_integer().
As I stated, it will require a "pretty good improvement". There's always a
trade off. If the improvement is noticeable for real life cases, then ugly
code is worth it. If we are making ugly code for a benefit of something
that never shows outside of noise, then the net cost (less maintainable
code), is not worth it.
>
> When the very first patch changed /proc/stat to seq_put_decimal_ull()
> the speed up was 66% (or 33%). That's how slow printing was back then.
> It can be made slightly faster even now.
I'd like to see the tests that your ran (to reproduce them myself).
The first patch was making a 64bit number into 32bit number, thus
shortening the work by half.
>
> > Randomly picking patch 6:
> >
> > static int loadavg_proc_show(struct seq_file *m, void *v)
> > {
> > unsigned long avnrun[3];
> >
> > get_avenrun(avnrun, FIXED_1/200, 0);
> >
> > seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n",
> > LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
> > LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
> > LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
> > nr_running(), nr_threads,
> > idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> > return 0;
> > }
> >
> > *vs*
> >
> > static int loadavg_proc_show(struct seq_file *m, void *v)
> > {
> > unsigned long avnrun[3];
> > char buf[3 * (LEN_UL + 1 + 2 + 1) + 10 + 1 + 10 + 1 + 10 + 1];
> > char *p = buf + sizeof(buf);
> > int i;
> >
> > *--p = '\n';
> > p = _print_integer_u32(p, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> > *--p = ' ';
> > p = _print_integer_u32(p, nr_threads);
> > *--p = '/';
> > p = _print_integer_u32(p, nr_running());
> >
> > get_avenrun(avnrun, FIXED_1/200, 0);
> > for (i = 2; i >= 0; i--) {
> > *--p = ' ';
> > --p; /* overwritten */
> > *--p = '0'; /* conditionally overwritten */
> > (void)_print_integer_u32(p + 2, LOAD_FRAC(avnrun[i]));
> > *--p = '.';
> > p = _print_integer_ul(p, LOAD_INT(avnrun[i]));
> > }
> >
> > seq_write(m, p, buf + sizeof(buf) - p);
> > return 0;
> > }
> >
> >
> > I much rather keep the first version.
>
> I did the benchmarks (without stack protector though), everything except
> /proc/cpuinfo and /proc/meminfo became faster. This requires investigation
> and I can drop vsprintf() changes until then.
>
> Now given that /proc/uptime format cast in stone, code may look a bit ugly
> and unusual but it won't require maintainance
Please share what you did as your benchmarks. If this is as good of a
performance as you claim, then these changes would be worth looking into.
So I applied your entire series, added the following patch:
diff --git a/include/linux/spinlock_api_smp.h
b/include/linux/spinlock_api_smp.h index 19a9be9d97ee..17c582d77ab7 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -152,9 +152,16 @@ static inline void __raw_spin_unlock(raw_spinlock_t
*lock) preempt_enable();
}
+extern u64 sched_clock(void);
static inline void __raw_spin_unlock_irqrestore(raw_spinlock_t *lock,
unsigned long flags)
{
+ char buf[32];
+ u64 start, stop;
+ start = sched_clock();
+ sprintf(buf,"%lld", (unsigned long)lock);
+ stop = sched_clock();
+ trace_printk("time: %lld '%s'\n", stop - start, buf);
spin_release(&lock->dep_map, _RET_IP_);
do_raw_spin_unlock(lock);
local_irq_restore(flags);
Then after boot up, I did the following:
# trace-cmd stop
# trace-cmd extract
Which captured the traces:
<idle>-0 [003] 5.405208: bprint: _raw_spin_unlock_irqrestore: time: 799 '-110308271193088'
<idle>-0 [003] 5.405210: bprint: _raw_spin_unlock_irqrestore: time: 273 '-110308271193088'
<idle>-0 [003] 5.412235: bprint: _raw_spin_unlock_irqrestore: time: 1138 '-110308271193088'
<idle>-0 [003] 5.412236: bprint: _raw_spin_unlock_irqrestore: time: 213 '-110308271193088'
<idle>-0 [003] 5.414241: bprint: _raw_spin_unlock_irqrestore: time: 1094 '-110308271193088'
<idle>-0 [003] 5.414243: bprint: _raw_spin_unlock_irqrestore: time: 182 '-110308271193088'
<idle>-0 [003] 5.418241: bprint: _raw_spin_unlock_irqrestore: time: 1113 '-110308271193088'
Where "time: X", X is the delta in nanoseconds around the sprintf() call.
The I ran the attached perl program on the output, and got the following
results:
Before your patches:
# trace-cmd report | ./report.pl
full_total = 52844823
average: 255.902176229032
std: 439.269729814847
And with your patches:
# trace-cmd report | ./report.pl
full_total = 84725476
average: 407.873274762306
std: 555.755670463724
As the standard deviation is bigger than the average, it appears to be all
in the noise.
Then I decided to see if it affects "ps -eux"
Original:
# perf stat -r 100 ps -eux > /dev/null
Performance counter stats for 'ps -eux' (100 runs):
8.92 msec task-clock # 0.937 CPUs utilized ( +- 0.90% )
5 context-switches # 0.545 K/sec ( +- 1.24% )
0 cpu-migrations # 0.000 K/sec
259 page-faults # 0.029 M/sec ( +- 0.07% )
32,973,751 cycles # 3.698 GHz ( +- 0.09% )
17,254,307 stalled-cycles-frontend # 52.33% frontend cycles idle ( +- 0.17% )
38,707,960 instructions # 1.17 insn per cycle
# 0.45 stalled cycles per insn ( +- 0.01% )
8,153,197 branches # 914.274 M/sec ( +- 0.01% )
114,992 branch-misses # 1.41% of all branches ( +- 0.12% )
0.0095170 +- 0.0000829 seconds time elapsed ( +- 0.87% )
With your patches:
# perf stat -r 100 ps -eux > /dev/null
Performance counter stats for 'ps -eux' (100 runs):
8.86 msec task-clock # 0.918 CPUs utilized ( +- 1.06% )
5 context-switches # 0.527 K/sec ( +- 1.22% )
0 cpu-migrations # 0.001 K/sec ( +-100.00% )
259 page-faults # 0.029 M/sec ( +- 0.08% )
32,699,168 cycles # 3.692 GHz ( +- 0.12% )
16,995,861 stalled-cycles-frontend # 51.98% frontend cycles idle ( +- 0.21% )
38,114,396 instructions # 1.17 insn per cycle
# 0.45 stalled cycles per insn ( +- 0.03% )
7,985,526 branches # 901.625 M/sec ( +- 0.03% )
112,852 branch-misses # 1.41% of all branches ( +- 0.17% )
0.009652 +- 0.000276 seconds time elapsed ( +- 2.86% )
Not really much difference.
Then what about just catting /proc/$$/stat, and do a 1000 runs!
Original:
# perf stat -r 1000 cat /proc/$$/stat > /dev/null
Performance counter stats for 'cat /proc/1622/stat' (1000 runs):
0.34 msec task-clock # 0.680 CPUs utilized ( +- 0.21% )
0 context-switches # 0.071 K/sec ( +- 20.18% )
0 cpu-migrations # 0.000 K/sec
65 page-faults # 0.192 M/sec ( +- 0.07% )
993,486 cycles # 2.934 GHz ( +- 0.07% )
577,903 stalled-cycles-frontend # 58.17% frontend cycles idle ( +- 0.09% )
936,489 instructions # 0.94 insn per cycle
# 0.62 stalled cycles per insn ( +- 0.07% )
202,912 branches # 599.213 M/sec ( +- 0.07% )
6,976 branch-misses # 3.44% of all branches ( +- 0.11% )
0.00049797 +- 0.00000111 seconds time elapsed ( +- 0.22% )
With your patches:
# perf stat -r 1000 cat /proc/$$/stat > /dev/null
Performance counter stats for 'cat /proc/1624/stat' (1000 runs):
0.34 msec task-clock # 0.664 CPUs utilized ( +- 0.23% )
0 context-switches # 0.018 K/sec ( +- 40.72% )
0 cpu-migrations # 0.000 K/sec
65 page-faults # 0.190 M/sec ( +- 0.07% )
988,430 cycles # 2.892 GHz ( +- 0.07% )
574,841 stalled-cycles-frontend # 58.16% frontend cycles idle ( +- 0.09% )
933,852 instructions # 0.94 insn per cycle
# 0.62 stalled cycles per insn ( +- 0.07% )
202,096 branches # 591.297 M/sec ( +- 0.07% )
6,836 branch-misses # 3.38% of all branches ( +- 0.09% )
0.00051476 +- 0.00000557 seconds time elapsed ( +- 1.08% )
They are pretty much identical.
What's the purpose of all these changes again? There was no cover letter.
-- Steve
[-- Attachment #2: report.pl --]
[-- Type: application/x-perl, Size: 474 bytes --]
next prev parent reply other threads:[~2020-04-21 21:08 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-20 20:57 [PATCH 01/15] sched: make nr_running() return "unsigned int" Alexey Dobriyan
2020-04-20 20:57 ` [PATCH 02/15] sched: make nr_iowait_cpu() " Alexey Dobriyan
2020-04-20 20:57 ` [PATCH 03/15] print_integer: new and improved way of printing integers Alexey Dobriyan
2020-04-20 21:19 ` Andy Shevchenko
2020-04-20 21:27 ` Andy Shevchenko
2020-04-21 1:54 ` Steven Rostedt
2020-04-21 16:49 ` Alexey Dobriyan
2020-04-21 21:08 ` Steven Rostedt [this message]
2020-04-23 3:42 ` Sergey Senozhatsky
2020-04-21 16:31 ` Alexey Dobriyan
2020-04-21 16:59 ` Matthew Wilcox
2020-04-20 20:57 ` [PATCH 04/15] print_integer, proc: rewrite /proc/self via print_integer() Alexey Dobriyan
2020-04-20 20:57 ` [PATCH 05/15] print_integer, proc: rewrite /proc/thread-self " Alexey Dobriyan
2020-04-20 20:57 ` [PATCH 06/15] print_integer, proc: rewrite /proc/loadavg " Alexey Dobriyan
2020-04-20 20:57 ` [PATCH 07/15] print_integer, proc: rewrite /proc/stat " Alexey Dobriyan
2020-04-20 20:57 ` [PATCH 08/15] print_integer, proc: rewrite /proc/uptime " Alexey Dobriyan
2020-04-20 20:57 ` [PATCH 09/15] proc: s/p/tsk/ Alexey Dobriyan
2020-04-20 20:57 ` [PATCH 10/15] print_integer, proc: rewrite /proc/*/fd via print_integer() Alexey Dobriyan
2020-04-20 20:57 ` [PATCH 11/15] print_integer, proc: rewrite /proc/*/stat " Alexey Dobriyan
2020-04-20 20:57 ` [PATCH 12/15] print_integer, proc: rewrite /proc/*/statm " Alexey Dobriyan
2020-04-20 20:57 ` [PATCH 13/15] print_integer, printf: rewrite num_to_str() " Alexey Dobriyan
2020-04-20 20:57 ` [PATCH 14/15] print_integer, printf: rewrite the rest of lib/vsprintf.c " Alexey Dobriyan
2020-04-21 2:01 ` Steven Rostedt
2020-04-20 20:57 ` [PATCH 15/15] print_integer, proc: rewrite /proc/meminfo " Alexey Dobriyan
2020-04-20 21:05 ` [PATCH 01/15] sched: make nr_running() return "unsigned int" Matthew Wilcox
2020-04-21 17:06 ` Alexey Dobriyan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200421170836.4aafcbc2@gandalf.local.home \
--to=rostedt@goodmis.org \
--cc=adobriyan@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=andriy.shevchenko@linux.intel.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@rasmusvillemoes.dk \
--cc=pmladek@suse.com \
--cc=sergey.senozhatsky@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).