All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] defer: misc updates
@ 2020-05-31  0:30 Akira Yokosawa
  2020-05-31  0:32 ` [PATCH 1/3] defer: Annotate consecutive QQZs as such for 'nq' build Akira Yokosawa
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Akira Yokosawa @ 2020-05-31  0:30 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

Hi Paul,

This is misc updates in response to your recent updates.

Patch 1/3 treats QQZ annotations for "nq" build.
Patch 2/3 adds a paragraph in #9 of FAQ.txt.  The wording may need
your retouch for fluency.
Patch 3/3 is an independent improvement of runlatex.sh.  It will avoid
a few redundant runs of pdflatex when you have some typo in labels/refs.

Another suggestion to Figures 9.25 and 9.29.
Wouldn't these graphs look better with log scale x-axis?

X range can be 0.001 -- 10.

You'll need to add a few data points in sub-microsecond critical-section
duration to show plausible shapes in those regions, though.

        Thanks, Akira
--
Akira Yokosawa (3):
  defer: Annotate consecutive QQZs as such for 'nq' build
  FAQ.txt: Advertise 'nq' build in #9
  runlatex.sh: Give up early on undefined refs

 FAQ.txt               |  8 ++++++--
 defer/rcuusage.tex    | 20 +++++++++++---------
 utilities/runlatex.sh | 42 +++++++++++++++++++++++++-----------------
 3 files changed, 42 insertions(+), 28 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] defer: Annotate consecutive QQZs as such for 'nq' build
  2020-05-31  0:30 [PATCH 0/3] defer: misc updates Akira Yokosawa
@ 2020-05-31  0:32 ` Akira Yokosawa
  2020-05-31  0:33 ` [PATCH 2/3] FAQ.txt: Advertise 'nq' build in #9 Akira Yokosawa
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Akira Yokosawa @ 2020-05-31  0:32 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

From ff1d191e40f23f5a0200ee3bbabe2073a6c03394 Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@gmail.com>
Date: Sun, 31 May 2020 08:12:02 +0900
Subject: [PATCH 1/3] defer: Annotate consecutive QQZs as such for 'nq' build

Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
 defer/rcuusage.tex | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/defer/rcuusage.tex b/defer/rcuusage.tex
index d5329a39..72c1f331 100644
--- a/defer/rcuusage.tex
+++ b/defer/rcuusage.tex
@@ -241,12 +241,13 @@ reader-writer locking are shown in
 Figure~\ref{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking},
 which was generated on a 448-CPU 2.10\,GHz Intel x86 system.
 
-\QuickQuiz{
+\QuickQuizSeries{%
+\QuickQuizB{
 	WTF?
 	How the heck do you expect me to believe that RCU can have less
 	than a 300-picosecond overhead when the clock period at 2.10\,GHz
 	is almost 500\,picoseconds?
-}\QuickQuizAnswer{
+}\QuickQuizAnswerB{
 	First, consider that the inner loop used to
 	take this measurement is as follows:
 
@@ -290,13 +291,13 @@ which was generated on a 448-CPU 2.10\,GHz Intel x86 system.
 
 	It certainly is not just every day that a timing measurement
 	of 267 picoseconds turns out to be an overestimate!
-}\QuickQuizEnd
+}\QuickQuizEndB
 
-\QuickQuiz{
+\QuickQuizM{
 	Didn't an earlier release of this book show RCU read-side
 	overhead way down in the sub-picosecond range?
 	What happened???
-}\QuickQuizAnswer{
+}\QuickQuizAnswerM{
 	Excellent memory!!!
 	The overhead in some early releases was in fact roughly
 	100~femtoseconds.
@@ -328,12 +329,12 @@ which was generated on a 448-CPU 2.10\,GHz Intel x86 system.
 	So which change had the most effect, Linus's commit or the change in
 	the system?
 	This question is left as an exercise to the reader.
-}\QuickQuizEnd
+}\QuickQuizEndM
 
-\QuickQuiz{
+\QuickQuizE{
 	Why is there such large variation for the \co{rcu} trace in
 	Figure~\ref{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking}?
-}\QuickQuizAnswer{
+}\QuickQuizAnswerE{
 	Keep in mind that this is a log-log plot, so those large-seeming
 	\co{rcu} variances in reality span only a few hundred picoseconds.
 	And that is such a short time that anything could cause it.
@@ -347,7 +348,8 @@ which was generated on a 448-CPU 2.10\,GHz Intel x86 system.
 	Attempting to reduce these variations by running the guest OSes
 	at real-time priority (as suggested by Joel Fernandes) is left
 	as an exercise for the reader.
-}\QuickQuizEnd
+}\QuickQuizEndE
+}                 % End of \QuickQuizSeries
 
 Note that reader-writer locking is more than an order of magnitude slower
 than RCU on a single CPU, and is more than \emph{four} orders of magnitude
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] FAQ.txt: Advertise 'nq' build in #9
  2020-05-31  0:30 [PATCH 0/3] defer: misc updates Akira Yokosawa
  2020-05-31  0:32 ` [PATCH 1/3] defer: Annotate consecutive QQZs as such for 'nq' build Akira Yokosawa
@ 2020-05-31  0:33 ` Akira Yokosawa
  2020-05-31  0:35 ` [PATCH 3/3] runlatex.sh: Give up early on undefined refs Akira Yokosawa
  2020-05-31 16:50 ` [PATCH 0/3] defer: misc updates Paul E. McKenney
  3 siblings, 0 replies; 15+ messages in thread
From: Akira Yokosawa @ 2020-05-31  0:33 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

From 4c3ca479e8aeb16fee229d9d52eb30b42c7618cc Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@gmail.com>
Date: Sun, 31 May 2020 08:16:33 +0900
Subject: [PATCH 2/3] FAQ.txt: Advertise 'nq' build in #9

The experimental target "nq" can be a partial solution for those
who don't like Quick Quizzes.

Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
 FAQ.txt | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/FAQ.txt b/FAQ.txt
index e63117af..a284615f 100644
--- a/FAQ.txt
+++ b/FAQ.txt
@@ -158,8 +158,12 @@
 
 	A.	Quite a few people like them a lot, so they will be
 		staying.  However, you can easily produce a copy of the
-		book that omits the Quick Quizzes by editing the Makefile
-		and qqz.sty files in the top-level directory.
+		book that hides the quiz part of Quick Quizzes in the
+		text by "make nq".
+
+		If the resulting perfbook-nq.pdf does not satisfy you,
+		you can edit the Makefile and qqz.sty files in the
+		top-level directory as you like.
 
 		One approach is to make the "\QuickQuiz" command in
 		qqz.sty be a no-op and to add line to the Makefile that
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] runlatex.sh: Give up early on undefined refs
  2020-05-31  0:30 [PATCH 0/3] defer: misc updates Akira Yokosawa
  2020-05-31  0:32 ` [PATCH 1/3] defer: Annotate consecutive QQZs as such for 'nq' build Akira Yokosawa
  2020-05-31  0:33 ` [PATCH 2/3] FAQ.txt: Advertise 'nq' build in #9 Akira Yokosawa
@ 2020-05-31  0:35 ` Akira Yokosawa
  2020-05-31 16:50 ` [PATCH 0/3] defer: misc updates Paul E. McKenney
  3 siblings, 0 replies; 15+ messages in thread
From: Akira Yokosawa @ 2020-05-31  0:35 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

From 073429a7f9e68b4b7f51c9b766bf8aa24da83081 Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@gmail.com>
Date: Sun, 31 May 2020 08:18:34 +0900
Subject: [PATCH 3/3] runlatex.sh: Give up early on undefined refs

Successive "undefined refs"s mean true missing/misspelled labels/refs.

Add code to detect second "undefined refs" and to give up early.

Also add code to skip an unnecessary pdflatex run when .aux and
.bbl files are up-to-date.
This can happen when you run "make" after removing perfbook.pdf,
e.g., with a different LATEX_OPT setting.

Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
 utilities/runlatex.sh | 42 +++++++++++++++++++++++++-----------------
 1 file changed, 25 insertions(+), 17 deletions(-)

diff --git a/utilities/runlatex.sh b/utilities/runlatex.sh
index 2aef9e2c..9687be95 100644
--- a/utilities/runlatex.sh
+++ b/utilities/runlatex.sh
@@ -51,6 +51,18 @@ identical_warnings () {
 	return 1 ;
 }
 
+exerpt_warnings () {
+	if grep -q "LaTeX Warning:" $basename.log
+	then
+		echo "----- Excerpt around remaining warning messages -----"
+		grep -B 8 -A 5 "LaTeX Warning:" $basename.log | tee $basename-warning.log
+		echo "----- You can see $basename-warning.log for the warnings above. -----"
+		echo "----- If you need to, see $basename.log for details. -----"
+		rm -f $basename-warning-prev.log
+		exit 1
+	fi
+}
+
 iterate_latex () {
 	pdflatex $LATEX_OPT $basename > /dev/null 2>&1 < /dev/null || :
 	if grep -q '! Emergency stop.' $basename.log
@@ -83,26 +95,30 @@ basename=`echo $1 | sed -e 's/\.tex$//'`
 
 if ! test -r $basename-first.log
 then
-	if ! sh utilities/mpostcheck.sh
-	then
-		exit 1
-	fi
+	echo "No need to update aux and bbl files."
 	echo "pdflatex 1 for $basename.pdf"
-	iterate_latex
+	iter=1
+else
+	rm -f $basename-first.log
+	echo "pdflatex 2 for $basename.pdf # for possible bib update"
+	iter=2
 fi
-rm -f $basename-first.log
-iter=2
-echo "pdflatex 2 for $basename.pdf # for possible bib update"
 iterate_latex
 min_iter=2
 while grep -q 'LaTeX Warning: There were undefined references' $basename.log
 do
+	if test $undefined_refs
+	then
+		echo "Undefined refs remain, giving up."
+		exerpt_warnings
+	fi
 	if identical_warnings
 	then
 		break
 	fi
 	iter=`expr $iter + 1`
 	echo "pdflatex $iter for $basename.pdf # remaining undefined refs"
+	undefined_refs=1
 	iterate_latex
 done
 min_iter=3
@@ -116,15 +132,7 @@ do
 	echo "pdflatex $iter for $basename.pdf # label(s) may have changed"
 	iterate_latex
 done
-if grep -q "LaTeX Warning:" $basename.log
-then
-	echo "----- Excerpt around remaining warning messages -----"
-	grep -B 8 -A 5 "LaTeX Warning:" $basename.log | tee $basename-warning.log
-	echo "----- You can see $basename-warning.log for the warnings above. -----"
-	echo "----- If you need to, see $basename.log for details. -----"
-	rm -f $basename-warning-prev.log
-	exit 1
-fi
+exerpt_warnings
 rm -f $basename-warning.log $basename-warning-prev.log
 echo "'$basename.pdf' is ready."
 # cleveref version check (Ubuntu 18.04 LTS has buggy one
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] defer: misc updates
  2020-05-31  0:30 [PATCH 0/3] defer: misc updates Akira Yokosawa
                   ` (2 preceding siblings ...)
  2020-05-31  0:35 ` [PATCH 3/3] runlatex.sh: Give up early on undefined refs Akira Yokosawa
@ 2020-05-31 16:50 ` Paul E. McKenney
  2020-05-31 23:11   ` Akira Yokosawa
  3 siblings, 1 reply; 15+ messages in thread
From: Paul E. McKenney @ 2020-05-31 16:50 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Sun, May 31, 2020 at 09:30:44AM +0900, Akira Yokosawa wrote:
> Hi Paul,
> 
> This is misc updates in response to your recent updates.
> 
> Patch 1/3 treats QQZ annotations for "nq" build.

Good reminder, thank you!

> Patch 2/3 adds a paragraph in #9 of FAQ.txt.  The wording may need
> your retouch for fluency.
> Patch 3/3 is an independent improvement of runlatex.sh.  It will avoid
> a few redundant runs of pdflatex when you have some typo in labels/refs.

Nice, queued and pushed, thank you!

> Another suggestion to Figures 9.25 and 9.29.
> Wouldn't these graphs look better with log scale x-axis?
> 
> X range can be 0.001 -- 10.
> 
> You'll need to add a few data points in sub-microsecond critical-section
> duration to show plausible shapes in those regions, though.

I took a quick look and didn't find any nanosecond delay primitives
in the Linux kernel, but yes, that would be nicer looking.

I don't expect to make further progress on this particular graph
in the immediate future, but if you know of such a delay primitive,
please don't keep it a secret!  ;-)

						Thanx, Paul

>         Thanks, Akira
> --
> Akira Yokosawa (3):
>   defer: Annotate consecutive QQZs as such for 'nq' build
>   FAQ.txt: Advertise 'nq' build in #9
>   runlatex.sh: Give up early on undefined refs
> 
>  FAQ.txt               |  8 ++++++--
>  defer/rcuusage.tex    | 20 +++++++++++---------
>  utilities/runlatex.sh | 42 +++++++++++++++++++++++++-----------------
>  3 files changed, 42 insertions(+), 28 deletions(-)
> 
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] defer: misc updates
  2020-05-31 16:50 ` [PATCH 0/3] defer: misc updates Paul E. McKenney
@ 2020-05-31 23:11   ` Akira Yokosawa
  2020-06-01  1:18     ` Paul E. McKenney
  0 siblings, 1 reply; 15+ messages in thread
From: Akira Yokosawa @ 2020-05-31 23:11 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On Sun, 31 May 2020 09:50:23 -0700, Paul E. McKenney wrote:
> On Sun, May 31, 2020 at 09:30:44AM +0900, Akira Yokosawa wrote:
>> Hi Paul,
>>
>> This is misc updates in response to your recent updates.
>>
>> Patch 1/3 treats QQZ annotations for "nq" build.
> 
> Good reminder, thank you!
> 
>> Patch 2/3 adds a paragraph in #9 of FAQ.txt.  The wording may need
>> your retouch for fluency.
>> Patch 3/3 is an independent improvement of runlatex.sh.  It will avoid
>> a few redundant runs of pdflatex when you have some typo in labels/refs.
> 
> Nice, queued and pushed, thank you!
> 
>> Another suggestion to Figures 9.25 and 9.29.
>> Wouldn't these graphs look better with log scale x-axis?
>>
>> X range can be 0.001 -- 10.
>>
>> You'll need to add a few data points in sub-microsecond critical-section
>> duration to show plausible shapes in those regions, though.
> 
> I took a quick look and didn't find any nanosecond delay primitives
> in the Linux kernel, but yes, that would be nicer looking.
> 
> I don't expect to make further progress on this particular graph
> in the immediate future, but if you know of such a delay primitive,
> please don't keep it a secret!  ;-)

I find ndelay() defined in include/asm_generic/delay.h.
I'm not sure if it works as you would expect, though.

        Thanks, Akira

> 
> 						Thanx, Paul
> 
>>         Thanks, Akira
>> --
>> Akira Yokosawa (3):
>>   defer: Annotate consecutive QQZs as such for 'nq' build
>>   FAQ.txt: Advertise 'nq' build in #9
>>   runlatex.sh: Give up early on undefined refs
>>
>>  FAQ.txt               |  8 ++++++--
>>  defer/rcuusage.tex    | 20 +++++++++++---------
>>  utilities/runlatex.sh | 42 +++++++++++++++++++++++++-----------------
>>  3 files changed, 42 insertions(+), 28 deletions(-)
>>
>> -- 
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] defer: misc updates
  2020-05-31 23:11   ` Akira Yokosawa
@ 2020-06-01  1:18     ` Paul E. McKenney
  2020-06-01 15:10       ` Akira Yokosawa
  0 siblings, 1 reply; 15+ messages in thread
From: Paul E. McKenney @ 2020-06-01  1:18 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Mon, Jun 01, 2020 at 08:11:06AM +0900, Akira Yokosawa wrote:
> On Sun, 31 May 2020 09:50:23 -0700, Paul E. McKenney wrote:
> > On Sun, May 31, 2020 at 09:30:44AM +0900, Akira Yokosawa wrote:
> >> Hi Paul,
> >>
> >> This is misc updates in response to your recent updates.
> >>
> >> Patch 1/3 treats QQZ annotations for "nq" build.
> > 
> > Good reminder, thank you!
> > 
> >> Patch 2/3 adds a paragraph in #9 of FAQ.txt.  The wording may need
> >> your retouch for fluency.
> >> Patch 3/3 is an independent improvement of runlatex.sh.  It will avoid
> >> a few redundant runs of pdflatex when you have some typo in labels/refs.
> > 
> > Nice, queued and pushed, thank you!
> > 
> >> Another suggestion to Figures 9.25 and 9.29.
> >> Wouldn't these graphs look better with log scale x-axis?
> >>
> >> X range can be 0.001 -- 10.
> >>
> >> You'll need to add a few data points in sub-microsecond critical-section
> >> duration to show plausible shapes in those regions, though.
> > 
> > I took a quick look and didn't find any nanosecond delay primitives
> > in the Linux kernel, but yes, that would be nicer looking.
> > 
> > I don't expect to make further progress on this particular graph
> > in the immediate future, but if you know of such a delay primitive,
> > please don't keep it a secret!  ;-)
> 
> I find ndelay() defined in include/asm_generic/delay.h.
> I'm not sure if it works as you would expect, though.

I must be going blind, given that I missed that one!

I did try it out, and it suffers from about 10% timing errors.  In
contrast, udelay is usually less than 1%.  But how about as shown below?

							Thanx, Paul

------------------------------------------------------------------------

commit 7d9ab703b0a33ff5f8db330f0bac3dde9deead07
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Sun May 31 18:14:57 2020 -0700

    refperf: Change readdelay module parameter to nanoseconds
    
    The current units of microseconds are too coarse, so this commit
    changes the units to nanoseconds.  However, ndelay is used only for the
    nanoseconds with udelay being used for whole microseconds.  For example,
    setting refperf.readdelay=1500 results in an ndelay(500) followed by
    a udelay(1).
    
    Suggested-by: Akira Yokosawa <akiyks@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c
index 3b72925..96f8ba0 100644
--- a/kernel/rcu/refperf.c
+++ b/kernel/rcu/refperf.c
@@ -66,8 +66,8 @@ torture_param(long, loops, 10000, "Number of loops per experiment.");
 torture_param(int, nreaders, -1, "Number of loops per experiment.");
 // Number of runs.
 torture_param(int, nruns, 30, "Number of experiments to run.");
-// Reader delay in microseconds, 0 for no delay.
-torture_param(int, readdelay, 0, "Read-side delay in microseconds.");
+// Reader delay in nanoseconds, 0 for no delay.
+torture_param(int, readdelay, 0, "Read-side delay in nanoseconds.");
 
 #ifdef MODULE
 # define REFPERF_SHUTDOWN 0
@@ -111,7 +111,8 @@ struct ref_perf_ops {
 	void (*init)(void);
 	void (*cleanup)(void);
 	void (*readsection)(const int nloops);
-	void (*delaysection)(const int nloops, const int ndelay);
+	void (*delaysection)(const int nloops,
+			     const int udelay, const int ndelay);
 	const char *name;
 };
 
@@ -127,13 +128,17 @@ static void ref_rcu_read_section(const int nloops)
 	}
 }
 
-static void ref_rcu_delay_section(const int nloops, const int ndelay)
+static void
+ref_rcu_delay_section(const int nloops, const int udelay, const int ndelay)
 {
 	int i;
 
 	for (i = nloops; i >= 0; i--) {
 		rcu_read_lock();
-		udelay(ndelay);
+		if (udelay)
+			udelay(udelay);
+		if (ndelay)
+			ndelay(ndelay);
 		rcu_read_unlock();
 	}
 }
@@ -165,14 +170,18 @@ static void srcu_ref_perf_read_section(const int nloops)
 	}
 }
 
-static void srcu_ref_perf_delay_section(const int nloops, const int ndelay)
+static void srcu_ref_perf_delay_section(const int nloops,
+					const int udelay, const int ndelay)
 {
 	int i;
 	int idx;
 
 	for (i = nloops; i >= 0; i--) {
 		idx = srcu_read_lock(srcu_ctlp);
-		udelay(ndelay);
+		if (udelay)
+			udelay(udelay);
+		if (ndelay)
+			ndelay(ndelay);
 		srcu_read_unlock(srcu_ctlp, idx);
 	}
 }
@@ -197,13 +206,17 @@ static void ref_refcnt_section(const int nloops)
 	}
 }
 
-static void ref_refcnt_delay_section(const int nloops, const int ndelay)
+static void
+ref_refcnt_delay_section(const int nloops, const int udelay, const int ndelay)
 {
 	int i;
 
 	for (i = nloops; i >= 0; i--) {
 		atomic_inc(&refcnt);
-		udelay(ndelay);
+		if (udelay)
+			udelay(udelay);
+		if (ndelay)
+			ndelay(ndelay);
 		atomic_dec(&refcnt);
 	}
 }
@@ -233,13 +246,17 @@ static void ref_rwlock_section(const int nloops)
 	}
 }
 
-static void ref_rwlock_delay_section(const int nloops, const int ndelay)
+static void
+ref_rwlock_delay_section(const int nloops, const int udelay, const int ndelay)
 {
 	int i;
 
 	for (i = nloops; i >= 0; i--) {
 		read_lock(&test_rwlock);
-		udelay(ndelay);
+		if (udelay)
+			udelay(udelay);
+		if (ndelay)
+			ndelay(ndelay);
 		read_unlock(&test_rwlock);
 	}
 }
@@ -269,13 +286,17 @@ static void ref_rwsem_section(const int nloops)
 	}
 }
 
-static void ref_rwsem_delay_section(const int nloops, const int ndelay)
+static void
+ref_rwsem_delay_section(const int nloops, const int udelay, const int ndelay)
 {
 	int i;
 
 	for (i = nloops; i >= 0; i--) {
 		down_read(&test_rwsem);
-		udelay(ndelay);
+		if (udelay)
+			udelay(udelay);
+		if (ndelay)
+			ndelay(ndelay);
 		up_read(&test_rwsem);
 	}
 }
@@ -292,7 +313,8 @@ static void rcu_perf_one_reader(void)
 	if (readdelay <= 0)
 		cur_ops->readsection(loops);
 	else
-		cur_ops->delaysection(loops, readdelay);
+		cur_ops->delaysection(loops,
+				      readdelay / 1000, readdelay % 1000);
 }
 
 // Reader kthread.  Repeatedly does empty RCU read-side

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] defer: misc updates
  2020-06-01  1:18     ` Paul E. McKenney
@ 2020-06-01 15:10       ` Akira Yokosawa
  2020-06-01 16:13         ` Paul E. McKenney
  0 siblings, 1 reply; 15+ messages in thread
From: Akira Yokosawa @ 2020-06-01 15:10 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On Sun, 31 May 2020 18:18:38 -0700, Paul E. McKenney wrote:
> On Mon, Jun 01, 2020 at 08:11:06AM +0900, Akira Yokosawa wrote:
>> On Sun, 31 May 2020 09:50:23 -0700, Paul E. McKenney wrote:
>>> On Sun, May 31, 2020 at 09:30:44AM +0900, Akira Yokosawa wrote:
>>>> Hi Paul,
>>>>
>>>> This is misc updates in response to your recent updates.
>>>>
>>>> Patch 1/3 treats QQZ annotations for "nq" build.
>>>
>>> Good reminder, thank you!
>>>
>>>> Patch 2/3 adds a paragraph in #9 of FAQ.txt.  The wording may need
>>>> your retouch for fluency.
>>>> Patch 3/3 is an independent improvement of runlatex.sh.  It will avoid
>>>> a few redundant runs of pdflatex when you have some typo in labels/refs.
>>>
>>> Nice, queued and pushed, thank you!
>>>
>>>> Another suggestion to Figures 9.25 and 9.29.
>>>> Wouldn't these graphs look better with log scale x-axis?
>>>>
>>>> X range can be 0.001 -- 10.
>>>>
>>>> You'll need to add a few data points in sub-microsecond critical-section
>>>> duration to show plausible shapes in those regions, though.
>>>
>>> I took a quick look and didn't find any nanosecond delay primitives
>>> in the Linux kernel, but yes, that would be nicer looking.
>>>
>>> I don't expect to make further progress on this particular graph
>>> in the immediate future, but if you know of such a delay primitive,
>>> please don't keep it a secret!  ;-)
>>
>> I find ndelay() defined in include/asm_generic/delay.h.
>> I'm not sure if it works as you would expect, though.
> 
> I must be going blind, given that I missed that one!

:-) :-)

> 
> I did try it out, and it suffers from about 10% timing errors.  In
> contrast, udelay is usually less than 1%.

You mean udelay(1)'s error is less than 10ns, whereas ndelay(1000)'s
error is about 100ns?

Looking at the definition of __udelay() and __ndelay() in
arch/x86/lib/delay.c, the constant 0x10c7 has much effective bits
than 0x0005. This is likely the cause of difference in errors.

>                                           But how about as shown below?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 7d9ab703b0a33ff5f8db330f0bac3dde9deead07
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Sun May 31 18:14:57 2020 -0700
> 
>     refperf: Change readdelay module parameter to nanoseconds
>     
>     The current units of microseconds are too coarse, so this commit
>     changes the units to nanoseconds.  However, ndelay is used only for the
>     nanoseconds with udelay being used for whole microseconds.  For example,
>     setting refperf.readdelay=1500 results in an ndelay(500) followed by
>     a udelay(1).

Your code below looks opposite, udelay(1) + ndelay(500), doesn't it?

>     
>     Suggested-by: Akira Yokosawa <akiyks@gmail.com>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c
> index 3b72925..96f8ba0 100644
> --- a/kernel/rcu/refperf.c
> +++ b/kernel/rcu/refperf.c
> @@ -66,8 +66,8 @@ torture_param(long, loops, 10000, "Number of loops per experiment.");
>  torture_param(int, nreaders, -1, "Number of loops per experiment.");
>  // Number of runs.
>  torture_param(int, nruns, 30, "Number of experiments to run.");
> -// Reader delay in microseconds, 0 for no delay.
> -torture_param(int, readdelay, 0, "Read-side delay in microseconds.");
> +// Reader delay in nanoseconds, 0 for no delay.
> +torture_param(int, readdelay, 0, "Read-side delay in nanoseconds.");
>  
>  #ifdef MODULE
>  # define REFPERF_SHUTDOWN 0
> @@ -111,7 +111,8 @@ struct ref_perf_ops {
>  	void (*init)(void);
>  	void (*cleanup)(void);
>  	void (*readsection)(const int nloops);
> -	void (*delaysection)(const int nloops, const int ndelay);
> +	void (*delaysection)(const int nloops,
> +			     const int udelay, const int ndelay);
>  	const char *name;
>  };
>  
> @@ -127,13 +128,17 @@ static void ref_rcu_read_section(const int nloops)
>  	}
>  }
>  
> -static void ref_rcu_delay_section(const int nloops, const int ndelay)
> +static void
> +ref_rcu_delay_section(const int nloops, const int udelay, const int ndelay)
>  {
>  	int i;
>  
>  	for (i = nloops; i >= 0; i--) {
>  		rcu_read_lock();
> -		udelay(ndelay);
> +		if (udelay)
> +			udelay(udelay);
> +		if (ndelay)
> +			ndelay(ndelay);
>  		rcu_read_unlock();
>  	}
>  }
> @@ -165,14 +170,18 @@ static void srcu_ref_perf_read_section(const int nloops)
>  	}
>  }
>  
> -static void srcu_ref_perf_delay_section(const int nloops, const int ndelay)
> +static void srcu_ref_perf_delay_section(const int nloops,
> +					const int udelay, const int ndelay)
>  {
>  	int i;
>  	int idx;
>  
>  	for (i = nloops; i >= 0; i--) {
>  		idx = srcu_read_lock(srcu_ctlp);
> -		udelay(ndelay);
> +		if (udelay)
> +			udelay(udelay);
> +		if (ndelay)
> +			ndelay(ndelay);
>  		srcu_read_unlock(srcu_ctlp, idx);
>  	}
>  }
> @@ -197,13 +206,17 @@ static void ref_refcnt_section(const int nloops)
>  	}
>  }
>  
> -static void ref_refcnt_delay_section(const int nloops, const int ndelay)
> +static void
> +ref_refcnt_delay_section(const int nloops, const int udelay, const int ndelay)
>  {
>  	int i;
>  
>  	for (i = nloops; i >= 0; i--) {
>  		atomic_inc(&refcnt);
> -		udelay(ndelay);
> +		if (udelay)
> +			udelay(udelay);
> +		if (ndelay)
> +			ndelay(ndelay);
>  		atomic_dec(&refcnt);
>  	}
>  }
> @@ -233,13 +246,17 @@ static void ref_rwlock_section(const int nloops)
>  	}
>  }
>  
> -static void ref_rwlock_delay_section(const int nloops, const int ndelay)
> +static void
> +ref_rwlock_delay_section(const int nloops, const int udelay, const int ndelay)
>  {
>  	int i;
>  
>  	for (i = nloops; i >= 0; i--) {
>  		read_lock(&test_rwlock);
> -		udelay(ndelay);
> +		if (udelay)
> +			udelay(udelay);
> +		if (ndelay)
> +			ndelay(ndelay);
>  		read_unlock(&test_rwlock);
>  	}
>  }
> @@ -269,13 +286,17 @@ static void ref_rwsem_section(const int nloops)
>  	}
>  }
>  
> -static void ref_rwsem_delay_section(const int nloops, const int ndelay)
> +static void
> +ref_rwsem_delay_section(const int nloops, const int udelay, const int ndelay)
>  {
>  	int i;
>  
>  	for (i = nloops; i >= 0; i--) {
>  		down_read(&test_rwsem);
> -		udelay(ndelay);
> +		if (udelay)
> +			udelay(udelay);
> +		if (ndelay)
> +			ndelay(ndelay);

Maybe defining a helper function/macro for this common pattern in rcu.h 
can ease maintenance cost. Say undelay(udl, ndl)?

        Thanks, Akira

>  		up_read(&test_rwsem);
>  	}
>  }
> @@ -292,7 +313,8 @@ static void rcu_perf_one_reader(void)
>  	if (readdelay <= 0)
>  		cur_ops->readsection(loops);
>  	else
> -		cur_ops->delaysection(loops, readdelay);
> +		cur_ops->delaysection(loops,
> +				      readdelay / 1000, readdelay % 1000);
>  }
>  
>  // Reader kthread.  Repeatedly does empty RCU read-side
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] defer: misc updates
  2020-06-01 15:10       ` Akira Yokosawa
@ 2020-06-01 16:13         ` Paul E. McKenney
  2020-06-01 22:51           ` Akira Yokosawa
  0 siblings, 1 reply; 15+ messages in thread
From: Paul E. McKenney @ 2020-06-01 16:13 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Tue, Jun 02, 2020 at 12:10:06AM +0900, Akira Yokosawa wrote:
> On Sun, 31 May 2020 18:18:38 -0700, Paul E. McKenney wrote:
> > On Mon, Jun 01, 2020 at 08:11:06AM +0900, Akira Yokosawa wrote:
> >> On Sun, 31 May 2020 09:50:23 -0700, Paul E. McKenney wrote:
> >>> On Sun, May 31, 2020 at 09:30:44AM +0900, Akira Yokosawa wrote:
> >>>> Hi Paul,
> >>>>
> >>>> This is misc updates in response to your recent updates.
> >>>>
> >>>> Patch 1/3 treats QQZ annotations for "nq" build.
> >>>
> >>> Good reminder, thank you!
> >>>
> >>>> Patch 2/3 adds a paragraph in #9 of FAQ.txt.  The wording may need
> >>>> your retouch for fluency.
> >>>> Patch 3/3 is an independent improvement of runlatex.sh.  It will avoid
> >>>> a few redundant runs of pdflatex when you have some typo in labels/refs.
> >>>
> >>> Nice, queued and pushed, thank you!
> >>>
> >>>> Another suggestion to Figures 9.25 and 9.29.
> >>>> Wouldn't these graphs look better with log scale x-axis?
> >>>>
> >>>> X range can be 0.001 -- 10.
> >>>>
> >>>> You'll need to add a few data points in sub-microsecond critical-section
> >>>> duration to show plausible shapes in those regions, though.
> >>>
> >>> I took a quick look and didn't find any nanosecond delay primitives
> >>> in the Linux kernel, but yes, that would be nicer looking.
> >>>
> >>> I don't expect to make further progress on this particular graph
> >>> in the immediate future, but if you know of such a delay primitive,
> >>> please don't keep it a secret!  ;-)
> >>
> >> I find ndelay() defined in include/asm_generic/delay.h.
> >> I'm not sure if it works as you would expect, though.
> > 
> > I must be going blind, given that I missed that one!
> 
> :-) :-)
> 
> > I did try it out, and it suffers from about 10% timing errors.  In
> > contrast, udelay is usually less than 1%.
> 
> You mean udelay(1)'s error is less than 10ns, whereas ndelay(1000)'s
> error is about 100ns?

Yuck.  The 10% was a preliminary eyeballing.  An overnight run showed it
to be worst than that.  100ns gets me about 130ns, 200ns gets me about
270ns, and 500ns gets me about 600ns.  So ndelay() is useful only for
very short delays.

> Looking at the definition of __udelay() and __ndelay() in
> arch/x86/lib/delay.c, the constant 0x10c7 has much effective bits
> than 0x0005. This is likely the cause of difference in errors.

That makes a lot of sense, and thank you for checking!

> >                                           But how about as shown below?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit 7d9ab703b0a33ff5f8db330f0bac3dde9deead07
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Sun May 31 18:14:57 2020 -0700
> > 
> >     refperf: Change readdelay module parameter to nanoseconds
> >     
> >     The current units of microseconds are too coarse, so this commit
> >     changes the units to nanoseconds.  However, ndelay is used only for the
> >     nanoseconds with udelay being used for whole microseconds.  For example,
> >     setting refperf.readdelay=1500 results in an ndelay(500) followed by
> >     a udelay(1).
> 
> Your code below looks opposite, udelay(1) + ndelay(500), doesn't it?

Indeed it does!  I will fix this, thank you.

> >     Suggested-by: Akira Yokosawa <akiyks@gmail.com>
> >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c
> > index 3b72925..96f8ba0 100644
> > --- a/kernel/rcu/refperf.c
> > +++ b/kernel/rcu/refperf.c
> > @@ -66,8 +66,8 @@ torture_param(long, loops, 10000, "Number of loops per experiment.");
> >  torture_param(int, nreaders, -1, "Number of loops per experiment.");
> >  // Number of runs.
> >  torture_param(int, nruns, 30, "Number of experiments to run.");
> > -// Reader delay in microseconds, 0 for no delay.
> > -torture_param(int, readdelay, 0, "Read-side delay in microseconds.");
> > +// Reader delay in nanoseconds, 0 for no delay.
> > +torture_param(int, readdelay, 0, "Read-side delay in nanoseconds.");
> >  
> >  #ifdef MODULE
> >  # define REFPERF_SHUTDOWN 0
> > @@ -111,7 +111,8 @@ struct ref_perf_ops {
> >  	void (*init)(void);
> >  	void (*cleanup)(void);
> >  	void (*readsection)(const int nloops);
> > -	void (*delaysection)(const int nloops, const int ndelay);
> > +	void (*delaysection)(const int nloops,
> > +			     const int udelay, const int ndelay);
> >  	const char *name;
> >  };
> >  
> > @@ -127,13 +128,17 @@ static void ref_rcu_read_section(const int nloops)
> >  	}
> >  }
> >  
> > -static void ref_rcu_delay_section(const int nloops, const int ndelay)
> > +static void
> > +ref_rcu_delay_section(const int nloops, const int udelay, const int ndelay)
> >  {
> >  	int i;
> >  
> >  	for (i = nloops; i >= 0; i--) {
> >  		rcu_read_lock();
> > -		udelay(ndelay);
> > +		if (udelay)
> > +			udelay(udelay);
> > +		if (ndelay)
> > +			ndelay(ndelay);
> >  		rcu_read_unlock();
> >  	}
> >  }
> > @@ -165,14 +170,18 @@ static void srcu_ref_perf_read_section(const int nloops)
> >  	}
> >  }
> >  
> > -static void srcu_ref_perf_delay_section(const int nloops, const int ndelay)
> > +static void srcu_ref_perf_delay_section(const int nloops,
> > +					const int udelay, const int ndelay)
> >  {
> >  	int i;
> >  	int idx;
> >  
> >  	for (i = nloops; i >= 0; i--) {
> >  		idx = srcu_read_lock(srcu_ctlp);
> > -		udelay(ndelay);
> > +		if (udelay)
> > +			udelay(udelay);
> > +		if (ndelay)
> > +			ndelay(ndelay);
> >  		srcu_read_unlock(srcu_ctlp, idx);
> >  	}
> >  }
> > @@ -197,13 +206,17 @@ static void ref_refcnt_section(const int nloops)
> >  	}
> >  }
> >  
> > -static void ref_refcnt_delay_section(const int nloops, const int ndelay)
> > +static void
> > +ref_refcnt_delay_section(const int nloops, const int udelay, const int ndelay)
> >  {
> >  	int i;
> >  
> >  	for (i = nloops; i >= 0; i--) {
> >  		atomic_inc(&refcnt);
> > -		udelay(ndelay);
> > +		if (udelay)
> > +			udelay(udelay);
> > +		if (ndelay)
> > +			ndelay(ndelay);
> >  		atomic_dec(&refcnt);
> >  	}
> >  }
> > @@ -233,13 +246,17 @@ static void ref_rwlock_section(const int nloops)
> >  	}
> >  }
> >  
> > -static void ref_rwlock_delay_section(const int nloops, const int ndelay)
> > +static void
> > +ref_rwlock_delay_section(const int nloops, const int udelay, const int ndelay)
> >  {
> >  	int i;
> >  
> >  	for (i = nloops; i >= 0; i--) {
> >  		read_lock(&test_rwlock);
> > -		udelay(ndelay);
> > +		if (udelay)
> > +			udelay(udelay);
> > +		if (ndelay)
> > +			ndelay(ndelay);
> >  		read_unlock(&test_rwlock);
> >  	}
> >  }
> > @@ -269,13 +286,17 @@ static void ref_rwsem_section(const int nloops)
> >  	}
> >  }
> >  
> > -static void ref_rwsem_delay_section(const int nloops, const int ndelay)
> > +static void
> > +ref_rwsem_delay_section(const int nloops, const int udelay, const int ndelay)
> >  {
> >  	int i;
> >  
> >  	for (i = nloops; i >= 0; i--) {
> >  		down_read(&test_rwsem);
> > -		udelay(ndelay);
> > +		if (udelay)
> > +			udelay(udelay);
> > +		if (ndelay)
> > +			ndelay(ndelay);
> 
> Maybe defining a helper function/macro for this common pattern in rcu.h 
> can ease maintenance cost. Say undelay(udl, ndl)?

Good point!  Updated and currently testing.

Thank you for your review and comments!

							Thanx, Paul

>         Thanks, Akira
> 
> >  		up_read(&test_rwsem);
> >  	}
> >  }
> > @@ -292,7 +313,8 @@ static void rcu_perf_one_reader(void)
> >  	if (readdelay <= 0)
> >  		cur_ops->readsection(loops);
> >  	else
> > -		cur_ops->delaysection(loops, readdelay);
> > +		cur_ops->delaysection(loops,
> > +				      readdelay / 1000, readdelay % 1000);
> >  }
> >  
> >  // Reader kthread.  Repeatedly does empty RCU read-side
> > 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] defer: misc updates
  2020-06-01 16:13         ` Paul E. McKenney
@ 2020-06-01 22:51           ` Akira Yokosawa
  2020-06-01 23:45             ` Paul E. McKenney
  0 siblings, 1 reply; 15+ messages in thread
From: Akira Yokosawa @ 2020-06-01 22:51 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On Mon, 1 Jun 2020 09:13:49 -0700, Paul E. McKenney wrote:
> On Tue, Jun 02, 2020 at 12:10:06AM +0900, Akira Yokosawa wrote:
>> On Sun, 31 May 2020 18:18:38 -0700, Paul E. McKenney wrote:
>>> On Mon, Jun 01, 2020 at 08:11:06AM +0900, Akira Yokosawa wrote:
>>>> On Sun, 31 May 2020 09:50:23 -0700, Paul E. McKenney wrote:
>>>>> On Sun, May 31, 2020 at 09:30:44AM +0900, Akira Yokosawa wrote:
>>>>>> Hi Paul,
>>>>>>
>>>>>> This is misc updates in response to your recent updates.
>>>>>>
>>>>>> Patch 1/3 treats QQZ annotations for "nq" build.
>>>>>
>>>>> Good reminder, thank you!
>>>>>
>>>>>> Patch 2/3 adds a paragraph in #9 of FAQ.txt.  The wording may need
>>>>>> your retouch for fluency.
>>>>>> Patch 3/3 is an independent improvement of runlatex.sh.  It will avoid
>>>>>> a few redundant runs of pdflatex when you have some typo in labels/refs.
>>>>>
>>>>> Nice, queued and pushed, thank you!
>>>>>
>>>>>> Another suggestion to Figures 9.25 and 9.29.
>>>>>> Wouldn't these graphs look better with log scale x-axis?
>>>>>>
>>>>>> X range can be 0.001 -- 10.
>>>>>>
>>>>>> You'll need to add a few data points in sub-microsecond critical-section
>>>>>> duration to show plausible shapes in those regions, though.
>>>>>
>>>>> I took a quick look and didn't find any nanosecond delay primitives
>>>>> in the Linux kernel, but yes, that would be nicer looking.
>>>>>
>>>>> I don't expect to make further progress on this particular graph
>>>>> in the immediate future, but if you know of such a delay primitive,
>>>>> please don't keep it a secret!  ;-)
>>>>
>>>> I find ndelay() defined in include/asm_generic/delay.h.
>>>> I'm not sure if it works as you would expect, though.
>>>
>>> I must be going blind, given that I missed that one!
>>
>> :-) :-)
>>
>>> I did try it out, and it suffers from about 10% timing errors.  In
>>> contrast, udelay is usually less than 1%.
>>
>> You mean udelay(1)'s error is less than 10ns, whereas ndelay(1000)'s
>> error is about 100ns?
> 
> Yuck.  The 10% was a preliminary eyeballing.  An overnight run showed it
> to be worst than that.  100ns gets me about 130ns, 200ns gets me about
> 270ns, and 500ns gets me about 600ns.  So ndelay() is useful only for
> very short delays.

To compensate the error, how about doing the appended?
Yes, this is kind of ugly...

Another point you should be aware.  It looks like arch/powerpc
does not have __ndelay defined.  Which means ndelay() would cause
build error.  Still, I might be missing something.

        Thanks, Akira

diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c
index 5db165ecd465..0a3764ea220c 100644
--- a/kernel/rcu/refperf.c
+++ b/kernel/rcu/refperf.c
@@ -122,7 +122,7 @@ static void un_delay(const int udl, const int ndl)
        if (udl)
                udelay(udl);
        if (ndl)
-               ndelay(ndl);
+               ndelay((ndl * 859) / 1000); // 5 : 2^32/1000000000 (4.295)
 }
 
 static void ref_rcu_read_section(const int nloops)




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] defer: misc updates
  2020-06-01 22:51           ` Akira Yokosawa
@ 2020-06-01 23:45             ` Paul E. McKenney
  2020-06-02 14:27               ` Akira Yokosawa
  0 siblings, 1 reply; 15+ messages in thread
From: Paul E. McKenney @ 2020-06-01 23:45 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Tue, Jun 02, 2020 at 07:51:31AM +0900, Akira Yokosawa wrote:
> On Mon, 1 Jun 2020 09:13:49 -0700, Paul E. McKenney wrote:
> > On Tue, Jun 02, 2020 at 12:10:06AM +0900, Akira Yokosawa wrote:
> >> On Sun, 31 May 2020 18:18:38 -0700, Paul E. McKenney wrote:
> >>> On Mon, Jun 01, 2020 at 08:11:06AM +0900, Akira Yokosawa wrote:
> >>>> On Sun, 31 May 2020 09:50:23 -0700, Paul E. McKenney wrote:
> >>>>> On Sun, May 31, 2020 at 09:30:44AM +0900, Akira Yokosawa wrote:
> >>>>>> Hi Paul,
> >>>>>>
> >>>>>> This is misc updates in response to your recent updates.
> >>>>>>
> >>>>>> Patch 1/3 treats QQZ annotations for "nq" build.
> >>>>>
> >>>>> Good reminder, thank you!
> >>>>>
> >>>>>> Patch 2/3 adds a paragraph in #9 of FAQ.txt.  The wording may need
> >>>>>> your retouch for fluency.
> >>>>>> Patch 3/3 is an independent improvement of runlatex.sh.  It will avoid
> >>>>>> a few redundant runs of pdflatex when you have some typo in labels/refs.
> >>>>>
> >>>>> Nice, queued and pushed, thank you!
> >>>>>
> >>>>>> Another suggestion to Figures 9.25 and 9.29.
> >>>>>> Wouldn't these graphs look better with log scale x-axis?
> >>>>>>
> >>>>>> X range can be 0.001 -- 10.
> >>>>>>
> >>>>>> You'll need to add a few data points in sub-microsecond critical-section
> >>>>>> duration to show plausible shapes in those regions, though.
> >>>>>
> >>>>> I took a quick look and didn't find any nanosecond delay primitives
> >>>>> in the Linux kernel, but yes, that would be nicer looking.
> >>>>>
> >>>>> I don't expect to make further progress on this particular graph
> >>>>> in the immediate future, but if you know of such a delay primitive,
> >>>>> please don't keep it a secret!  ;-)
> >>>>
> >>>> I find ndelay() defined in include/asm_generic/delay.h.
> >>>> I'm not sure if it works as you would expect, though.
> >>>
> >>> I must be going blind, given that I missed that one!
> >>
> >> :-) :-)
> >>
> >>> I did try it out, and it suffers from about 10% timing errors.  In
> >>> contrast, udelay is usually less than 1%.
> >>
> >> You mean udelay(1)'s error is less than 10ns, whereas ndelay(1000)'s
> >> error is about 100ns?
> > 
> > Yuck.  The 10% was a preliminary eyeballing.  An overnight run showed it
> > to be worst than that.  100ns gets me about 130ns, 200ns gets me about
> > 270ns, and 500ns gets me about 600ns.  So ndelay() is useful only for
> > very short delays.
> 
> To compensate the error, how about doing the appended?
> Yes, this is kind of ugly...
> 
> Another point you should be aware.  It looks like arch/powerpc
> does not have __ndelay defined.  Which means ndelay() would cause
> build error.  Still, I might be missing something.

That is quite clever!  It does turn ndelay(1) into ndelay(0), but it
probably costs more than a nanosecond to do the integer division, so
that shouldn't be a problem.

However, I believe that any such compensatory schemes should be done
within ndelay() rather than by its users.  Plus, as you imply, different
architectures might need different adjustments.  My concern is that
different CPU generations within a given architecture might also need
different adjustments. :-(

							Thanx, Paul

>         Thanks, Akira
> 
> diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c
> index 5db165ecd465..0a3764ea220c 100644
> --- a/kernel/rcu/refperf.c
> +++ b/kernel/rcu/refperf.c
> @@ -122,7 +122,7 @@ static void un_delay(const int udl, const int ndl)
>         if (udl)
>                 udelay(udl);
>         if (ndl)
> -               ndelay(ndl);
> +               ndelay((ndl * 859) / 1000); // 5 : 2^32/1000000000 (4.295)
>  }
>  
>  static void ref_rcu_read_section(const int nloops)
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] defer: misc updates
  2020-06-01 23:45             ` Paul E. McKenney
@ 2020-06-02 14:27               ` Akira Yokosawa
  2020-06-02 15:28                 ` Paul E. McKenney
  0 siblings, 1 reply; 15+ messages in thread
From: Akira Yokosawa @ 2020-06-02 14:27 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On Mon, 1 Jun 2020 16:45:45 -0700, Paul E. McKenney wrote:
> On Tue, Jun 02, 2020 at 07:51:31AM +0900, Akira Yokosawa wrote:
>> On Mon, 1 Jun 2020 09:13:49 -0700, Paul E. McKenney wrote:
>>> On Tue, Jun 02, 2020 at 12:10:06AM +0900, Akira Yokosawa wrote:
>>>> On Sun, 31 May 2020 18:18:38 -0700, Paul E. McKenney wrote:
>>>>> On Mon, Jun 01, 2020 at 08:11:06AM +0900, Akira Yokosawa wrote:
>>>>>> On Sun, 31 May 2020 09:50:23 -0700, Paul E. McKenney wrote:
>>>>>>> On Sun, May 31, 2020 at 09:30:44AM +0900, Akira Yokosawa wrote:
>>>>>>>> Hi Paul,
>>>>>>>>
>>>>>>>> This is misc updates in response to your recent updates.
>>>>>>>>
>>>>>>>> Patch 1/3 treats QQZ annotations for "nq" build.
>>>>>>>
>>>>>>> Good reminder, thank you!
>>>>>>>
>>>>>>>> Patch 2/3 adds a paragraph in #9 of FAQ.txt.  The wording may need
>>>>>>>> your retouch for fluency.
>>>>>>>> Patch 3/3 is an independent improvement of runlatex.sh.  It will avoid
>>>>>>>> a few redundant runs of pdflatex when you have some typo in labels/refs.
>>>>>>>
>>>>>>> Nice, queued and pushed, thank you!
>>>>>>>
>>>>>>>> Another suggestion to Figures 9.25 and 9.29.
>>>>>>>> Wouldn't these graphs look better with log scale x-axis?
>>>>>>>>
>>>>>>>> X range can be 0.001 -- 10.
>>>>>>>>
>>>>>>>> You'll need to add a few data points in sub-microsecond critical-section
>>>>>>>> duration to show plausible shapes in those regions, though.
>>>>>>>
>>>>>>> I took a quick look and didn't find any nanosecond delay primitives
>>>>>>> in the Linux kernel, but yes, that would be nicer looking.
>>>>>>>
>>>>>>> I don't expect to make further progress on this particular graph
>>>>>>> in the immediate future, but if you know of such a delay primitive,
>>>>>>> please don't keep it a secret!  ;-)
>>>>>>
>>>>>> I find ndelay() defined in include/asm_generic/delay.h.
>>>>>> I'm not sure if it works as you would expect, though.
>>>>>
>>>>> I must be going blind, given that I missed that one!
>>>>
>>>> :-) :-)
>>>>
>>>>> I did try it out, and it suffers from about 10% timing errors.  In
>>>>> contrast, udelay is usually less than 1%.
>>>>
>>>> You mean udelay(1)'s error is less than 10ns, whereas ndelay(1000)'s
>>>> error is about 100ns?
>>>
>>> Yuck.  The 10% was a preliminary eyeballing.  An overnight run showed it
>>> to be worst than that.  100ns gets me about 130ns, 200ns gets me about
>>> 270ns, and 500ns gets me about 600ns.  So ndelay() is useful only for
>>> very short delays.
>>
>> To compensate the error, how about doing the appended?
>> Yes, this is kind of ugly...
>>
>> Another point you should be aware.  It looks like arch/powerpc
>> does not have __ndelay defined.  Which means ndelay() would cause
>> build error.  Still, I might be missing something.
> 
> That is quite clever!  It does turn ndelay(1) into ndelay(0), but it
> probably costs more than a nanosecond to do the integer division, so
> that shouldn't be a problem.
> 
> However, I believe that any such compensatory schemes should be done
> within ndelay() rather than by its users.

I'm not brave enough to change the behavior of ndelay() seeing the
number of call sites in kernel code base, especially under drivers/.

Looking at the updated Figures 9.25 and 9.29, the timing error of
ndelay() results in the discrepancy of "rcu" plots from the ideal
orthogonal lines in sub-microseconds regions (0.1, 0.2, and 0.5us).
I don't think you like such misleading plots.

You could instead compensate the x-values you give to ndelay().

On x86, you know the resolution of xdelay() is 1.164153ns.
Which means if you want a time delay of 100ns, ndelay(86) will
be 100.117ns.
ndelay(172) will be 200.234ns and ndelay(429) will be 499.422ns.
ndelay(430) will be 500.586ns, which is the 2nd closest.
If you don't want to exceed 500ns, ndelay(430) would be your choice.

I think this level of tweak is worthwhile, especially it will
result in a better looking plot of RCU scaling.

Thoughts?

        Thanks, Akira

PS: The bumps in Figures 9.25 and 9.29 in the sub-microsecond region 
might be the effect of difference of instruction stream.
As we have seen in Figure 9.22, slight changes in the code path,
e.g. jump target alignment, can cause 10% -- 20% of performance
difference.

Enforce inlining un_delay() might or might not help. Just guessing.


>                                           Plus, as you imply, different
> architectures might need different adjustments.  My concern is that
> different CPU generations within a given architecture might also need
> different adjustments. :-(
> 
> 							Thanx, Paul
> 
>>         Thanks, Akira
>>
>> diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c
>> index 5db165ecd465..0a3764ea220c 100644
>> --- a/kernel/rcu/refperf.c
>> +++ b/kernel/rcu/refperf.c
>> @@ -122,7 +122,7 @@ static void un_delay(const int udl, const int ndl)
>>         if (udl)
>>                 udelay(udl);
>>         if (ndl)
>> -               ndelay(ndl);
>> +               ndelay((ndl * 859) / 1000); // 5 : 2^32/1000000000 (4.295)
>>  }
>>  
>>  static void ref_rcu_read_section(const int nloops)
>>
>>
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] defer: misc updates
  2020-06-02 14:27               ` Akira Yokosawa
@ 2020-06-02 15:28                 ` Paul E. McKenney
  2020-06-02 23:05                   ` Akira Yokosawa
  0 siblings, 1 reply; 15+ messages in thread
From: Paul E. McKenney @ 2020-06-02 15:28 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Tue, Jun 02, 2020 at 11:27:37PM +0900, Akira Yokosawa wrote:
> On Mon, 1 Jun 2020 16:45:45 -0700, Paul E. McKenney wrote:
> > On Tue, Jun 02, 2020 at 07:51:31AM +0900, Akira Yokosawa wrote:
> >> On Mon, 1 Jun 2020 09:13:49 -0700, Paul E. McKenney wrote:
> >>> On Tue, Jun 02, 2020 at 12:10:06AM +0900, Akira Yokosawa wrote:
> >>>> On Sun, 31 May 2020 18:18:38 -0700, Paul E. McKenney wrote:
> >>>>> On Mon, Jun 01, 2020 at 08:11:06AM +0900, Akira Yokosawa wrote:
> >>>>>> On Sun, 31 May 2020 09:50:23 -0700, Paul E. McKenney wrote:
> >>>>>>> On Sun, May 31, 2020 at 09:30:44AM +0900, Akira Yokosawa wrote:
> >>>>>>>> Hi Paul,
> >>>>>>>>
> >>>>>>>> This is misc updates in response to your recent updates.
> >>>>>>>>
> >>>>>>>> Patch 1/3 treats QQZ annotations for "nq" build.
> >>>>>>>
> >>>>>>> Good reminder, thank you!
> >>>>>>>
> >>>>>>>> Patch 2/3 adds a paragraph in #9 of FAQ.txt.  The wording may need
> >>>>>>>> your retouch for fluency.
> >>>>>>>> Patch 3/3 is an independent improvement of runlatex.sh.  It will avoid
> >>>>>>>> a few redundant runs of pdflatex when you have some typo in labels/refs.
> >>>>>>>
> >>>>>>> Nice, queued and pushed, thank you!
> >>>>>>>
> >>>>>>>> Another suggestion to Figures 9.25 and 9.29.
> >>>>>>>> Wouldn't these graphs look better with log scale x-axis?
> >>>>>>>>
> >>>>>>>> X range can be 0.001 -- 10.
> >>>>>>>>
> >>>>>>>> You'll need to add a few data points in sub-microsecond critical-section
> >>>>>>>> duration to show plausible shapes in those regions, though.
> >>>>>>>
> >>>>>>> I took a quick look and didn't find any nanosecond delay primitives
> >>>>>>> in the Linux kernel, but yes, that would be nicer looking.
> >>>>>>>
> >>>>>>> I don't expect to make further progress on this particular graph
> >>>>>>> in the immediate future, but if you know of such a delay primitive,
> >>>>>>> please don't keep it a secret!  ;-)
> >>>>>>
> >>>>>> I find ndelay() defined in include/asm_generic/delay.h.
> >>>>>> I'm not sure if it works as you would expect, though.
> >>>>>
> >>>>> I must be going blind, given that I missed that one!
> >>>>
> >>>> :-) :-)
> >>>>
> >>>>> I did try it out, and it suffers from about 10% timing errors.  In
> >>>>> contrast, udelay is usually less than 1%.
> >>>>
> >>>> You mean udelay(1)'s error is less than 10ns, whereas ndelay(1000)'s
> >>>> error is about 100ns?
> >>>
> >>> Yuck.  The 10% was a preliminary eyeballing.  An overnight run showed it
> >>> to be worst than that.  100ns gets me about 130ns, 200ns gets me about
> >>> 270ns, and 500ns gets me about 600ns.  So ndelay() is useful only for
> >>> very short delays.
> >>
> >> To compensate the error, how about doing the appended?
> >> Yes, this is kind of ugly...
> >>
> >> Another point you should be aware.  It looks like arch/powerpc
> >> does not have __ndelay defined.  Which means ndelay() would cause
> >> build error.  Still, I might be missing something.
> > 
> > That is quite clever!  It does turn ndelay(1) into ndelay(0), but it
> > probably costs more than a nanosecond to do the integer division, so
> > that shouldn't be a problem.
> > 
> > However, I believe that any such compensatory schemes should be done
> > within ndelay() rather than by its users.
> 
> I'm not brave enough to change the behavior of ndelay() seeing the
> number of call sites in kernel code base, especially under drivers/.
> 
> Looking at the updated Figures 9.25 and 9.29, the timing error of
> ndelay() results in the discrepancy of "rcu" plots from the ideal
> orthogonal lines in sub-microseconds regions (0.1, 0.2, and 0.5us).
> I don't think you like such misleading plots.
> 
> You could instead compensate the x-values you give to ndelay().
> 
> On x86, you know the resolution of xdelay() is 1.164153ns.
> Which means if you want a time delay of 100ns, ndelay(86) will
> be 100.117ns.
> ndelay(172) will be 200.234ns and ndelay(429) will be 499.422ns.
> ndelay(430) will be 500.586ns, which is the 2nd closest.
> If you don't want to exceed 500ns, ndelay(430) would be your choice.
> 
> I think this level of tweak is worthwhile, especially it will
> result in a better looking plot of RCU scaling.
> 
> Thoughts?

Huh.

What we could do is to do a calibration pass where we sample a
fine-grained timesource, spin on a series of ndelay() calls that last for
a few microseconds, then resample the fine-grained timestamp.  We could
then do a binary search so as to compute a corrected ndelay argument.
We would then need to verify the corrected argument.

This procedure would be architecture independent, and might also account
for instruction-stream differences.

Is there a better way?  Seems like there should be.  ;-)

							Thanx, Paul

> PS: The bumps in Figures 9.25 and 9.29 in the sub-microsecond region 
> might be the effect of difference of instruction stream.
> As we have seen in Figure 9.22, slight changes in the code path,
> e.g. jump target alignment, can cause 10% -- 20% of performance
> difference.
> 
> Enforce inlining un_delay() might or might not help. Just guessing.
> 
> 
> >                                           Plus, as you imply, different
> > architectures might need different adjustments.  My concern is that
> > different CPU generations within a given architecture might also need
> > different adjustments. :-(
> > 
> > 							Thanx, Paul
> > 
> >>         Thanks, Akira
> >>
> >> diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c
> >> index 5db165ecd465..0a3764ea220c 100644
> >> --- a/kernel/rcu/refperf.c
> >> +++ b/kernel/rcu/refperf.c
> >> @@ -122,7 +122,7 @@ static void un_delay(const int udl, const int ndl)
> >>         if (udl)
> >>                 udelay(udl);
> >>         if (ndl)
> >> -               ndelay(ndl);
> >> +               ndelay((ndl * 859) / 1000); // 5 : 2^32/1000000000 (4.295)
> >>  }
> >>  
> >>  static void ref_rcu_read_section(const int nloops)
> >>
> >>
> >>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] defer: misc updates
  2020-06-02 15:28                 ` Paul E. McKenney
@ 2020-06-02 23:05                   ` Akira Yokosawa
  2020-06-03  1:02                     ` Paul E. McKenney
  0 siblings, 1 reply; 15+ messages in thread
From: Akira Yokosawa @ 2020-06-02 23:05 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On Tue, 2 Jun 2020 08:28:09 -0700, Paul E. McKenney wrote:
> On Tue, Jun 02, 2020 at 11:27:37PM +0900, Akira Yokosawa wrote:
>> On Mon, 1 Jun 2020 16:45:45 -0700, Paul E. McKenney wrote:
>>> On Tue, Jun 02, 2020 at 07:51:31AM +0900, Akira Yokosawa wrote:
>>>> On Mon, 1 Jun 2020 09:13:49 -0700, Paul E. McKenney wrote:
>>>>> On Tue, Jun 02, 2020 at 12:10:06AM +0900, Akira Yokosawa wrote:
>>>>>> On Sun, 31 May 2020 18:18:38 -0700, Paul E. McKenney wrote:
>>>>>>> On Mon, Jun 01, 2020 at 08:11:06AM +0900, Akira Yokosawa wrote:
>>>>>>>> On Sun, 31 May 2020 09:50:23 -0700, Paul E. McKenney wrote:
>>>>>>>>> On Sun, May 31, 2020 at 09:30:44AM +0900, Akira Yokosawa wrote:
>>>>>>>>>> Hi Paul,
>>>>>>>>>>
>>>>>>>>>> This is misc updates in response to your recent updates.
>>>>>>>>>>
>>>>>>>>>> Patch 1/3 treats QQZ annotations for "nq" build.
>>>>>>>>>
>>>>>>>>> Good reminder, thank you!
>>>>>>>>>
>>>>>>>>>> Patch 2/3 adds a paragraph in #9 of FAQ.txt.  The wording may need
>>>>>>>>>> your retouch for fluency.
>>>>>>>>>> Patch 3/3 is an independent improvement of runlatex.sh.  It will avoid
>>>>>>>>>> a few redundant runs of pdflatex when you have some typo in labels/refs.
>>>>>>>>>
>>>>>>>>> Nice, queued and pushed, thank you!
>>>>>>>>>
>>>>>>>>>> Another suggestion to Figures 9.25 and 9.29.
>>>>>>>>>> Wouldn't these graphs look better with log scale x-axis?
>>>>>>>>>>
>>>>>>>>>> X range can be 0.001 -- 10.
>>>>>>>>>>
>>>>>>>>>> You'll need to add a few data points in sub-microsecond critical-section
>>>>>>>>>> duration to show plausible shapes in those regions, though.
>>>>>>>>>
>>>>>>>>> I took a quick look and didn't find any nanosecond delay primitives
>>>>>>>>> in the Linux kernel, but yes, that would be nicer looking.
>>>>>>>>>
>>>>>>>>> I don't expect to make further progress on this particular graph
>>>>>>>>> in the immediate future, but if you know of such a delay primitive,
>>>>>>>>> please don't keep it a secret!  ;-)
>>>>>>>>
>>>>>>>> I find ndelay() defined in include/asm_generic/delay.h.
>>>>>>>> I'm not sure if it works as you would expect, though.
>>>>>>>
>>>>>>> I must be going blind, given that I missed that one!
>>>>>>
>>>>>> :-) :-)
>>>>>>
>>>>>>> I did try it out, and it suffers from about 10% timing errors.  In
>>>>>>> contrast, udelay is usually less than 1%.
>>>>>>
>>>>>> You mean udelay(1)'s error is less than 10ns, whereas ndelay(1000)'s
>>>>>> error is about 100ns?
>>>>>
>>>>> Yuck.  The 10% was a preliminary eyeballing.  An overnight run showed it
>>>>> to be worst than that.  100ns gets me about 130ns, 200ns gets me about
>>>>> 270ns, and 500ns gets me about 600ns.  So ndelay() is useful only for
>>>>> very short delays.
>>>>
>>>> To compensate the error, how about doing the appended?
>>>> Yes, this is kind of ugly...
>>>>
>>>> Another point you should be aware.  It looks like arch/powerpc
>>>> does not have __ndelay defined.  Which means ndelay() would cause
>>>> build error.  Still, I might be missing something.
>>>
>>> That is quite clever!  It does turn ndelay(1) into ndelay(0), but it
>>> probably costs more than a nanosecond to do the integer division, so
>>> that shouldn't be a problem.
>>>
>>> However, I believe that any such compensatory schemes should be done
>>> within ndelay() rather than by its users.
>>
>> I'm not brave enough to change the behavior of ndelay() seeing the
>> number of call sites in kernel code base, especially under drivers/.
>>
>> Looking at the updated Figures 9.25 and 9.29, the timing error of
>> ndelay() results in the discrepancy of "rcu" plots from the ideal
>> orthogonal lines in sub-microseconds regions (0.1, 0.2, and 0.5us).
>> I don't think you like such misleading plots.
>>
>> You could instead compensate the x-values you give to ndelay().
>>
>> On x86, you know the resolution of xdelay() is 1.164153ns.
>> Which means if you want a time delay of 100ns, ndelay(86) will
>> be 100.117ns.
>> ndelay(172) will be 200.234ns and ndelay(429) will be 499.422ns.
>> ndelay(430) will be 500.586ns, which is the 2nd closest.
>> If you don't want to exceed 500ns, ndelay(430) would be your choice.
>>
>> I think this level of tweak is worthwhile, especially it will
>> result in a better looking plot of RCU scaling.
>>
>> Thoughts?
> 
> Huh.
> 
> What we could do is to do a calibration pass where we sample a
> fine-grained timesource, spin on a series of ndelay() calls that last for
> a few microseconds, then resample the fine-grained timestamp.  We could
> then do a binary search so as to compute a corrected ndelay argument.
> We would then need to verify the corrected argument.
> 
> This procedure would be architecture independent, and might also account
> for instruction-stream differences.

This calibration part could be implemented and tested on a small system,
assuming you have sub-microsecond ndelay() and fine-grained timer.

For example, powerpc I mentioned earlier uses the fallback definition
in linux/delay.h:

	#ifndef ndelay
	static inline void ndelay(unsigned long x)
	{
		udelay(DIV_ROUND_UP(x, 1000));
	}
	#define ndelay(x) ndelay(x)
	#endif

> 
> Is there a better way?  Seems like there should be.  ;-)

There can be someone already has done a similar thing.

        Thanks, Akira

> 
> 							Thanx, Paul
> 
>> PS: The bumps in Figures 9.25 and 9.29 in the sub-microsecond region 
>> might be the effect of difference of instruction stream.
>> As we have seen in Figure 9.22, slight changes in the code path,
>> e.g. jump target alignment, can cause 10% -- 20% of performance
>> difference.
>>
>> Enforce inlining un_delay() might or might not help. Just guessing.
>>
>>
>>>                                           Plus, as you imply, different
>>> architectures might need different adjustments.  My concern is that
>>> different CPU generations within a given architecture might also need
>>> different adjustments. :-(
>>>
>>> 							Thanx, Paul
>>>
>>>>         Thanks, Akira
>>>>
>>>> diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c
>>>> index 5db165ecd465..0a3764ea220c 100644
>>>> --- a/kernel/rcu/refperf.c
>>>> +++ b/kernel/rcu/refperf.c
>>>> @@ -122,7 +122,7 @@ static void un_delay(const int udl, const int ndl)
>>>>         if (udl)
>>>>                 udelay(udl);
>>>>         if (ndl)
>>>> -               ndelay(ndl);
>>>> +               ndelay((ndl * 859) / 1000); // 5 : 2^32/1000000000 (4.295)
>>>>  }
>>>>  
>>>>  static void ref_rcu_read_section(const int nloops)
>>>>
>>>>
>>>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] defer: misc updates
  2020-06-02 23:05                   ` Akira Yokosawa
@ 2020-06-03  1:02                     ` Paul E. McKenney
  0 siblings, 0 replies; 15+ messages in thread
From: Paul E. McKenney @ 2020-06-03  1:02 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Wed, Jun 03, 2020 at 08:05:48AM +0900, Akira Yokosawa wrote:
> On Tue, 2 Jun 2020 08:28:09 -0700, Paul E. McKenney wrote:
> > On Tue, Jun 02, 2020 at 11:27:37PM +0900, Akira Yokosawa wrote:
> >> On Mon, 1 Jun 2020 16:45:45 -0700, Paul E. McKenney wrote:
> >>> On Tue, Jun 02, 2020 at 07:51:31AM +0900, Akira Yokosawa wrote:
> >>>> On Mon, 1 Jun 2020 09:13:49 -0700, Paul E. McKenney wrote:
> >>>>> On Tue, Jun 02, 2020 at 12:10:06AM +0900, Akira Yokosawa wrote:
> >>>>>> On Sun, 31 May 2020 18:18:38 -0700, Paul E. McKenney wrote:
> >>>>>>> On Mon, Jun 01, 2020 at 08:11:06AM +0900, Akira Yokosawa wrote:
> >>>>>>>> On Sun, 31 May 2020 09:50:23 -0700, Paul E. McKenney wrote:
> >>>>>>>>> On Sun, May 31, 2020 at 09:30:44AM +0900, Akira Yokosawa wrote:
> >>>>>>>>>> Hi Paul,
> >>>>>>>>>>
> >>>>>>>>>> This is misc updates in response to your recent updates.
> >>>>>>>>>>
> >>>>>>>>>> Patch 1/3 treats QQZ annotations for "nq" build.
> >>>>>>>>>
> >>>>>>>>> Good reminder, thank you!
> >>>>>>>>>
> >>>>>>>>>> Patch 2/3 adds a paragraph in #9 of FAQ.txt.  The wording may need
> >>>>>>>>>> your retouch for fluency.
> >>>>>>>>>> Patch 3/3 is an independent improvement of runlatex.sh.  It will avoid
> >>>>>>>>>> a few redundant runs of pdflatex when you have some typo in labels/refs.
> >>>>>>>>>
> >>>>>>>>> Nice, queued and pushed, thank you!
> >>>>>>>>>
> >>>>>>>>>> Another suggestion to Figures 9.25 and 9.29.
> >>>>>>>>>> Wouldn't these graphs look better with log scale x-axis?
> >>>>>>>>>>
> >>>>>>>>>> X range can be 0.001 -- 10.
> >>>>>>>>>>
> >>>>>>>>>> You'll need to add a few data points in sub-microsecond critical-section
> >>>>>>>>>> duration to show plausible shapes in those regions, though.
> >>>>>>>>>
> >>>>>>>>> I took a quick look and didn't find any nanosecond delay primitives
> >>>>>>>>> in the Linux kernel, but yes, that would be nicer looking.
> >>>>>>>>>
> >>>>>>>>> I don't expect to make further progress on this particular graph
> >>>>>>>>> in the immediate future, but if you know of such a delay primitive,
> >>>>>>>>> please don't keep it a secret!  ;-)
> >>>>>>>>
> >>>>>>>> I find ndelay() defined in include/asm_generic/delay.h.
> >>>>>>>> I'm not sure if it works as you would expect, though.
> >>>>>>>
> >>>>>>> I must be going blind, given that I missed that one!
> >>>>>>
> >>>>>> :-) :-)
> >>>>>>
> >>>>>>> I did try it out, and it suffers from about 10% timing errors.  In
> >>>>>>> contrast, udelay is usually less than 1%.
> >>>>>>
> >>>>>> You mean udelay(1)'s error is less than 10ns, whereas ndelay(1000)'s
> >>>>>> error is about 100ns?
> >>>>>
> >>>>> Yuck.  The 10% was a preliminary eyeballing.  An overnight run showed it
> >>>>> to be worst than that.  100ns gets me about 130ns, 200ns gets me about
> >>>>> 270ns, and 500ns gets me about 600ns.  So ndelay() is useful only for
> >>>>> very short delays.
> >>>>
> >>>> To compensate the error, how about doing the appended?
> >>>> Yes, this is kind of ugly...
> >>>>
> >>>> Another point you should be aware.  It looks like arch/powerpc
> >>>> does not have __ndelay defined.  Which means ndelay() would cause
> >>>> build error.  Still, I might be missing something.
> >>>
> >>> That is quite clever!  It does turn ndelay(1) into ndelay(0), but it
> >>> probably costs more than a nanosecond to do the integer division, so
> >>> that shouldn't be a problem.
> >>>
> >>> However, I believe that any such compensatory schemes should be done
> >>> within ndelay() rather than by its users.
> >>
> >> I'm not brave enough to change the behavior of ndelay() seeing the
> >> number of call sites in kernel code base, especially under drivers/.
> >>
> >> Looking at the updated Figures 9.25 and 9.29, the timing error of
> >> ndelay() results in the discrepancy of "rcu" plots from the ideal
> >> orthogonal lines in sub-microseconds regions (0.1, 0.2, and 0.5us).
> >> I don't think you like such misleading plots.
> >>
> >> You could instead compensate the x-values you give to ndelay().
> >>
> >> On x86, you know the resolution of xdelay() is 1.164153ns.
> >> Which means if you want a time delay of 100ns, ndelay(86) will
> >> be 100.117ns.
> >> ndelay(172) will be 200.234ns and ndelay(429) will be 499.422ns.
> >> ndelay(430) will be 500.586ns, which is the 2nd closest.
> >> If you don't want to exceed 500ns, ndelay(430) would be your choice.
> >>
> >> I think this level of tweak is worthwhile, especially it will
> >> result in a better looking plot of RCU scaling.
> >>
> >> Thoughts?
> > 
> > Huh.
> > 
> > What we could do is to do a calibration pass where we sample a
> > fine-grained timesource, spin on a series of ndelay() calls that last for
> > a few microseconds, then resample the fine-grained timestamp.  We could
> > then do a binary search so as to compute a corrected ndelay argument.
> > We would then need to verify the corrected argument.
> > 
> > This procedure would be architecture independent, and might also account
> > for instruction-stream differences.
> 
> This calibration part could be implemented and tested on a small system,
> assuming you have sub-microsecond ndelay() and fine-grained timer.

Just to be clear, my thought is to do a short calibration cycle on the
system running the actual test as part of refperf initialization.

> For example, powerpc I mentioned earlier uses the fallback definition
> in linux/delay.h:
> 
> 	#ifndef ndelay
> 	static inline void ndelay(unsigned long x)
> 	{
> 		udelay(DIV_ROUND_UP(x, 1000));
> 	}
> 	#define ndelay(x) ndelay(x)
> 	#endif

Indeed, any calibration would need to be careful of this!

> > Is there a better way?  Seems like there should be.  ;-)
> 
> There can be someone already has done a similar thing.

Quite possibly.

							Thanx, Paul

>         Thanks, Akira
> 
> > 
> > 							Thanx, Paul
> > 
> >> PS: The bumps in Figures 9.25 and 9.29 in the sub-microsecond region 
> >> might be the effect of difference of instruction stream.
> >> As we have seen in Figure 9.22, slight changes in the code path,
> >> e.g. jump target alignment, can cause 10% -- 20% of performance
> >> difference.
> >>
> >> Enforce inlining un_delay() might or might not help. Just guessing.
> >>
> >>
> >>>                                           Plus, as you imply, different
> >>> architectures might need different adjustments.  My concern is that
> >>> different CPU generations within a given architecture might also need
> >>> different adjustments. :-(
> >>>
> >>> 							Thanx, Paul
> >>>
> >>>>         Thanks, Akira
> >>>>
> >>>> diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c
> >>>> index 5db165ecd465..0a3764ea220c 100644
> >>>> --- a/kernel/rcu/refperf.c
> >>>> +++ b/kernel/rcu/refperf.c
> >>>> @@ -122,7 +122,7 @@ static void un_delay(const int udl, const int ndl)
> >>>>         if (udl)
> >>>>                 udelay(udl);
> >>>>         if (ndl)
> >>>> -               ndelay(ndl);
> >>>> +               ndelay((ndl * 859) / 1000); // 5 : 2^32/1000000000 (4.295)
> >>>>  }
> >>>>  
> >>>>  static void ref_rcu_read_section(const int nloops)
> >>>>
> >>>>
> >>>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-06-03  1:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-31  0:30 [PATCH 0/3] defer: misc updates Akira Yokosawa
2020-05-31  0:32 ` [PATCH 1/3] defer: Annotate consecutive QQZs as such for 'nq' build Akira Yokosawa
2020-05-31  0:33 ` [PATCH 2/3] FAQ.txt: Advertise 'nq' build in #9 Akira Yokosawa
2020-05-31  0:35 ` [PATCH 3/3] runlatex.sh: Give up early on undefined refs Akira Yokosawa
2020-05-31 16:50 ` [PATCH 0/3] defer: misc updates Paul E. McKenney
2020-05-31 23:11   ` Akira Yokosawa
2020-06-01  1:18     ` Paul E. McKenney
2020-06-01 15:10       ` Akira Yokosawa
2020-06-01 16:13         ` Paul E. McKenney
2020-06-01 22:51           ` Akira Yokosawa
2020-06-01 23:45             ` Paul E. McKenney
2020-06-02 14:27               ` Akira Yokosawa
2020-06-02 15:28                 ` Paul E. McKenney
2020-06-02 23:05                   ` Akira Yokosawa
2020-06-03  1:02                     ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.