All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Minor updates
@ 2019-12-07  4:05 Akira Yokosawa
  2019-12-07  4:06 ` [PATCH 1/2] toyrcu: Use mathcal O for 'orders of' Akira Yokosawa
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Akira Yokosawa @ 2019-12-07  4:05 UTC (permalink / raw)
  To: Paul E. McKenney, perfbook, Akira Yokosawa

Hi Paul,

This patch set fixes minor issues I noticed while reading your
recent updates.

Apart from the changes, I'd like you to mention in the answer to
Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64
instructions directly, but decode them into uOPs (via MOP) and
keep them in a uOP cache [1].
So the execution cycle is not necessarily corresponds to instruction
count, but heavily depends on the behavior of the microarch, which
is not predictable without actually running the code. 

[1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)

        Thanks, Akira
--
Akira Yokosawa (2):
  toyrcu: Use mathcal O for 'orders of'
  defer/rcuusage: Fix typo (that -> than)

 appendix/toyrcu/toyrcu.tex | 2 +-
 defer/rcuusage.tex         | 2 +-
 perfbook.tex               | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] toyrcu: Use mathcal O for 'orders of'
  2019-12-07  4:05 [PATCH 0/2] Minor updates Akira Yokosawa
@ 2019-12-07  4:06 ` Akira Yokosawa
  2019-12-07  4:07 ` [PATCH 2/2] defer/rcuusage: Fix typo (that -> than) Akira Yokosawa
  2019-12-07 16:43 ` [PATCH 0/2] Minor updates Paul E. McKenney
  2 siblings, 0 replies; 13+ messages in thread
From: Akira Yokosawa @ 2019-12-07  4:06 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

From b981496b8b93f2dafd1becfc7cc65df121c73a55 Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@gmail.com>
Date: Fri, 6 Dec 2019 07:59:40 +0900
Subject: [PATCH 1/2] toyrcu: Use mathcal O for 'orders of'

Also update the macro "\O{}" introduced in commit b4ad25eae241
("future/QC: Use upright glyph for math constant and descriptive
suffix") so that it uses "\left(" and "\right)".

Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
 appendix/toyrcu/toyrcu.tex | 2 +-
 perfbook.tex               | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/appendix/toyrcu/toyrcu.tex b/appendix/toyrcu/toyrcu.tex
index 9dff9f8a..92bd5ae7 100644
--- a/appendix/toyrcu/toyrcu.tex
+++ b/appendix/toyrcu/toyrcu.tex
@@ -1590,7 +1590,7 @@ create a new RCU implementation.
 	In particular, expensive operations such as cache misses,
 	atomic instructions, memory barriers, and branches should
 	be avoided.
-\item	RCU read-side primitives should have $O\left(1\right)$ computational
+\item	RCU read-side primitives should have $\O{1}$ computational
 	complexity to enable real-time use.
 	(This implies that readers run concurrently with updaters.)
 \item	RCU read-side primitives should be usable in all contexts
diff --git a/perfbook.tex b/perfbook.tex
index 37160751..14f138d4 100644
--- a/perfbook.tex
+++ b/perfbook.tex
@@ -240,7 +240,7 @@
 \newcommand{\qop}[1]{{\sffamily #1}} % QC operator such as H, T, S, etc.
 
 \DeclareRobustCommand{\euler}{\ensuremath{\mathrm{e}}}
-\DeclareRobustCommand{\O}[1]{\ensuremath{\mathcal{O}(#1)}}
+\DeclareRobustCommand{\O}[1]{\ensuremath{\mathcal{O}\left(#1\right)}}
 \newcommand{\Power}[1]{POWER#1}
 \newcommand{\GNUC}{GNU~C}
 \newcommand{\GCC}{GCC}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] defer/rcuusage: Fix typo (that -> than)
  2019-12-07  4:05 [PATCH 0/2] Minor updates Akira Yokosawa
  2019-12-07  4:06 ` [PATCH 1/2] toyrcu: Use mathcal O for 'orders of' Akira Yokosawa
@ 2019-12-07  4:07 ` Akira Yokosawa
  2019-12-07 16:43 ` [PATCH 0/2] Minor updates Paul E. McKenney
  2 siblings, 0 replies; 13+ messages in thread
From: Akira Yokosawa @ 2019-12-07  4:07 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

From e991e75d28adcc5c07546c980a0ec0b3e5ba345e Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@gmail.com>
Date: Sat, 7 Dec 2019 11:13:45 +0900
Subject: [PATCH 2/2] defer/rcuusage: Fix typo (that -> than)

Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
 defer/rcuusage.tex | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/defer/rcuusage.tex b/defer/rcuusage.tex
index 58688b48..004f6665 100644
--- a/defer/rcuusage.tex
+++ b/defer/rcuusage.tex
@@ -126,7 +126,7 @@ that of the ideal synchronization-free workload.
 	back to ideal.
 
 	The RCU variant of the \co{route_lookup()} search loop actually
-	has one more x86 instructions that does the sequential version,
+	has one more x86 instructions than does the sequential version,
 	namely the \co{lea} in the sequence
 	\co{cmp}, \co{je}, \co{mov}, \co{cmp}, \co{lea}, and \co{jne}.
 	This extra instruction is due to the \co{rcu_head} structure
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Minor updates
  2019-12-07  4:05 [PATCH 0/2] Minor updates Akira Yokosawa
  2019-12-07  4:06 ` [PATCH 1/2] toyrcu: Use mathcal O for 'orders of' Akira Yokosawa
  2019-12-07  4:07 ` [PATCH 2/2] defer/rcuusage: Fix typo (that -> than) Akira Yokosawa
@ 2019-12-07 16:43 ` Paul E. McKenney
  2019-12-07 23:41   ` Akira Yokosawa
  2 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2019-12-07 16:43 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote:
> Hi Paul,
> 
> This patch set fixes minor issues I noticed while reading your
> recent updates.

Queued and pushed, along with a fix to another of my typos, thank
you very much!

> Apart from the changes, I'd like you to mention in the answer to
> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64
> instructions directly, but decode them into uOPs (via MOP) and
> keep them in a uOP cache [1].
> So the execution cycle is not necessarily corresponds to instruction
> count, but heavily depends on the behavior of the microarch, which
> is not predictable without actually running the code. 
> 
> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)

My thought is that I should review the "Hardware and it Habits" chapter,
add this information if it is not already present, and then make the
answer to this Quick Quiz refer back to that.  Does that seem reasonable?

Also, I am thinking in terms of a release (not yet an edition) in
the near term.  Anything else that absolutely must be fixed first?

							Thanx, Paul

>         Thanks, Akira
> --
> Akira Yokosawa (2):
>   toyrcu: Use mathcal O for 'orders of'
>   defer/rcuusage: Fix typo (that -> than)
> 
>  appendix/toyrcu/toyrcu.tex | 2 +-
>  defer/rcuusage.tex         | 2 +-
>  perfbook.tex               | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Minor updates
  2019-12-07 16:43 ` [PATCH 0/2] Minor updates Paul E. McKenney
@ 2019-12-07 23:41   ` Akira Yokosawa
  2019-12-08  1:15     ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Akira Yokosawa @ 2019-12-07 23:41 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On Sat, 7 Dec 2019 08:43:08 -0800, Paul E. McKenney wrote:
> On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote:
>> Hi Paul,
>>
>> This patch set fixes minor issues I noticed while reading your
>> recent updates.
> 
> Queued and pushed, along with a fix to another of my typos, thank
> you very much!
> 
>> Apart from the changes, I'd like you to mention in the answer to
>> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64
>> instructions directly, but decode them into uOPs (via MOP) and
>> keep them in a uOP cache [1].
>> So the execution cycle is not necessarily corresponds to instruction
>> count, but heavily depends on the behavior of the microarch, which
>> is not predictable without actually running the code. 
>>
>> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
> 
> My thought is that I should review the "Hardware and it Habits" chapter,
> add this information if it is not already present, and then make the
> answer to this Quick Quiz refer back to that.  Does that seem reasonable?

Yes, it sounds quite reasonable!

(Skimming through the chapter...)

So Section 3.1.1 lightly touches pipelining. Section 3.2 mostly discusses
memory sub-systems.

Modern Intel architectures can be thought of as superscalar RISC
processors which emulate x86 ISA. The transformation of x86 instructions
into uOPs can be thought of as another layer of optimization
(sometimes "de-optimization" from compiler writer's POV) ;-).

But deep-diving this topic would cost you another chapter/appendix.
I'm not sure if it's worthwhile for perfbook.
Maybe it would suffice to lightly touch the difficulty of
predicting execution cycles of particular instruction streams
on modern microprocessors (not limited to Intel's), and put
a few citations of textbooks/reference manuals.

> 
> Also, I am thinking in terms of a release (not yet an edition) in
> the near term.  Anything else that absolutely must be fixed first?

There remains a couple of ACCESS_ONCE()s. I'm submitting a patch
to get rid of them.  I don't have any other pending urgent fixes
at the moment.

       Thanks, Akira

> 
> 							Thanx, Paul
> 
>>         Thanks, Akira
>> --
>> Akira Yokosawa (2):
>>   toyrcu: Use mathcal O for 'orders of'
>>   defer/rcuusage: Fix typo (that -> than)
>>
>>  appendix/toyrcu/toyrcu.tex | 2 +-
>>  defer/rcuusage.tex         | 2 +-
>>  perfbook.tex               | 2 +-
>>  3 files changed, 3 insertions(+), 3 deletions(-)
>>
>> -- 
>> 2.17.1
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Minor updates
  2019-12-07 23:41   ` Akira Yokosawa
@ 2019-12-08  1:15     ` Paul E. McKenney
  2019-12-08 15:54       ` Akira Yokosawa
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2019-12-08  1:15 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Sun, Dec 08, 2019 at 08:41:45AM +0900, Akira Yokosawa wrote:
> On Sat, 7 Dec 2019 08:43:08 -0800, Paul E. McKenney wrote:
> > On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote:
> >> Hi Paul,
> >>
> >> This patch set fixes minor issues I noticed while reading your
> >> recent updates.
> > 
> > Queued and pushed, along with a fix to another of my typos, thank
> > you very much!
> > 
> >> Apart from the changes, I'd like you to mention in the answer to
> >> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64
> >> instructions directly, but decode them into uOPs (via MOP) and
> >> keep them in a uOP cache [1].
> >> So the execution cycle is not necessarily corresponds to instruction
> >> count, but heavily depends on the behavior of the microarch, which
> >> is not predictable without actually running the code. 
> >>
> >> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
> > 
> > My thought is that I should review the "Hardware and it Habits" chapter,
> > add this information if it is not already present, and then make the
> > answer to this Quick Quiz refer back to that.  Does that seem reasonable?
> 
> Yes, it sounds quite reasonable!
> 
> (Skimming through the chapter...)
> 
> So Section 3.1.1 lightly touches pipelining. Section 3.2 mostly discusses
> memory sub-systems.
> 
> Modern Intel architectures can be thought of as superscalar RISC
> processors which emulate x86 ISA. The transformation of x86 instructions
> into uOPs can be thought of as another layer of optimization
> (sometimes "de-optimization" from compiler writer's POV) ;-).
> 
> But deep-diving this topic would cost you another chapter/appendix.
> I'm not sure if it's worthwhile for perfbook.
> Maybe it would suffice to lightly touch the difficulty of
> predicting execution cycles of particular instruction streams
> on modern microprocessors (not limited to Intel's), and put
> a few citations of textbooks/reference manuals.

What I did was to add a rough diagram and a paragraph or two of
explanation to Section 3.1.1, then add a reference to that section
in the Quick Quiz.

> > Also, I am thinking in terms of a release (not yet an edition) in
> > the near term.  Anything else that absolutely must be fixed first?
> 
> There remains a couple of ACCESS_ONCE()s. I'm submitting a patch
> to get rid of them.  I don't have any other pending urgent fixes
> at the moment.

Got it, thank you!

							Thanx, Paul

>        Thanks, Akira
> 
> > 
> > 							Thanx, Paul
> > 
> >>         Thanks, Akira
> >> --
> >> Akira Yokosawa (2):
> >>   toyrcu: Use mathcal O for 'orders of'
> >>   defer/rcuusage: Fix typo (that -> than)
> >>
> >>  appendix/toyrcu/toyrcu.tex | 2 +-
> >>  defer/rcuusage.tex         | 2 +-
> >>  perfbook.tex               | 2 +-
> >>  3 files changed, 3 insertions(+), 3 deletions(-)
> >>
> >> -- 
> >> 2.17.1
> >>
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Minor updates
  2019-12-08  1:15     ` Paul E. McKenney
@ 2019-12-08 15:54       ` Akira Yokosawa
  2019-12-08 18:11         ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Akira Yokosawa @ 2019-12-08 15:54 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On Sat, 7 Dec 2019 17:15:50 -0800, Paul E. McKenney wrote:
> On Sun, Dec 08, 2019 at 08:41:45AM +0900, Akira Yokosawa wrote:
>> On Sat, 7 Dec 2019 08:43:08 -0800, Paul E. McKenney wrote:
>>> On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote:
>>>> Hi Paul,
>>>>
>>>> This patch set fixes minor issues I noticed while reading your
>>>> recent updates.
>>>
>>> Queued and pushed, along with a fix to another of my typos, thank
>>> you very much!
>>>
>>>> Apart from the changes, I'd like you to mention in the answer to
>>>> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64
>>>> instructions directly, but decode them into uOPs (via MOP) and
>>>> keep them in a uOP cache [1].
>>>> So the execution cycle is not necessarily corresponds to instruction
>>>> count, but heavily depends on the behavior of the microarch, which
>>>> is not predictable without actually running the code. 
>>>>
>>>> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
>>>
>>> My thought is that I should review the "Hardware and it Habits" chapter,
>>> add this information if it is not already present, and then make the
>>> answer to this Quick Quiz refer back to that.  Does that seem reasonable?
>>
>> Yes, it sounds quite reasonable!
>>
>> (Skimming through the chapter...)
>>
>> So Section 3.1.1 lightly touches pipelining. Section 3.2 mostly discusses
>> memory sub-systems.
>>
>> Modern Intel architectures can be thought of as superscalar RISC
>> processors which emulate x86 ISA. The transformation of x86 instructions
>> into uOPs can be thought of as another layer of optimization
>> (sometimes "de-optimization" from compiler writer's POV) ;-).
>>
>> But deep-diving this topic would cost you another chapter/appendix.
>> I'm not sure if it's worthwhile for perfbook.
>> Maybe it would suffice to lightly touch the difficulty of
>> predicting execution cycles of particular instruction streams
>> on modern microprocessors (not limited to Intel's), and put
>> a few citations of textbooks/reference manuals.
> 
> What I did was to add a rough diagram and a paragraph or two of
> explanation to Section 3.1.1, then add a reference to that section
> in the Quick Quiz.

I'd like to see a couple of more keywords to be mentioned here other
than "pipeline".  "Super-scalar" is present in Glossary, but
"Superscalar" looks much common these days. Appended below is
a tentative patch I made to show you my idea. Please feel free
to edit as you'd like before applying it.

Another point I'd like to suggest.
Figure 9.23 and the following figures still show the result on
a 16 CPU system. Looks like it is difficult to make plots of 448-thread
system corresponding to them. Can you add info on the HW system
where those 16 CPU results were obtained in the beginning of
Section 9.5.4.2?

        Thanks, Akira

-------------8<-------------------
From 7e5c17a2e816a23bb12aa23f18ed96d5e820fc3a Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@gmail.com>
Date: Mon, 9 Dec 2019 00:23:59 +0900
Subject: [TENTATIVE PATCH] cpu/overview: Mention approaches other than 'pipelining'

Also remove "-" from "Super-scaler" in Glossary.

Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
 cpu/overview.tex   | 22 ++++++++++++++--------
 defer/rcuusage.tex |  2 +-
 glossary.tex       |  4 ++--
 3 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/cpu/overview.tex b/cpu/overview.tex
index b80f47c1..191c1c68 100644
--- a/cpu/overview.tex
+++ b/cpu/overview.tex
@@ -42,11 +42,13 @@ In the early 1980s, the typical microprocessor fetched an instruction,
 decoded it, and executed it, typically taking \emph{at least} three
 clock cycles to complete one instruction before proceeding to the next.
 In contrast, the CPU of the late 1990s and of the 2000s execute
-many instructions simultaneously, using a deep \emph{pipeline} to control
+many instructions simultaneously, using a combination of approaches
+including \emph{pipeline}, \emph{superscalar}, \emph{out-of-order},
+and \emph{speculative} execution, to control
 the flow of instructions internally to the CPU.
 Some cores have more than one hardware thread, which is variously called
 \emph{simultaneous multithreading} (SMT) or \emph{hyperthreading}
-(HT)~\cite{JFennel1973SMT}.
+(HT)~\cite{JFennel1973SMT},
 each of which appears as
 an independent CPU to software, at least from a functional viewpoint.
 These modern hardware features can greatly improve performance, as
@@ -96,14 +98,17 @@ Figure~\ref{fig:cpu:CPU Meets a Pipeline Flush}.
 \end{figure}
 
 This gets even worse in the increasingly common case of hyperthreading
-(or SMT, if you prefer).
+(or SMT, if you prefer).\footnote{
+	Superscalar is involved in most cases, too.
+}
 In this case, all the hardware threads sharing a core also share that
 core's resources, including registers, cache, execution units, and so on.
-The instruction streams are decoded into micro-operations, and use of the
-shared execution units and the hundreds of hardware registers is coordinated
+The instruction streams might be decoded into micro-operations,
+and use of the shared execution units and the hundreds of hardware
+registers can be coordinated
 by a micro-operation scheduler.
-A rough diagram of a two-threaded core is shown in
-\Cref{fig:cpu:Rough View of Modern Micro-Architecture},
+A rough diagram of such a two-threaded core is shown in
+\cref{fig:cpu:Rough View of Modern Micro-Architecture},
 and more accurate (and thus more complex) diagrams are available in
 textbooks and scholarly papers.\footnote{
 	Here is one example for a late-2010s Intel CPU:
@@ -123,7 +128,8 @@ of clairvoyance.
 In particular, adding an instruction to a tight loop can sometimes
 actually speed up execution, counterintuitive though that might be.
 
-Unfortunately, pipeline flushes are not the only hazards in the obstacle
+Unfortunately, pipeline flushes and shared-resource contentions
+are not the only hazards in the obstacle
 course that modern CPUs must run.
 The next section covers the hazards of referencing memory.
 
diff --git a/defer/rcuusage.tex b/defer/rcuusage.tex
index 7fe633c3..fa04ddb6 100644
--- a/defer/rcuusage.tex
+++ b/defer/rcuusage.tex
@@ -139,7 +139,7 @@ that of the ideal synchronization-free workload.
 	are long gone.
 
 	But those of you who read
-	\Cref{sec:cpu:Pipelined CPUs}
+	\cref{sec:cpu:Pipelined CPUs}
 	carefully already knew all of this!
 
 	These counter-intuitive results of course means that any
diff --git a/glossary.tex b/glossary.tex
index c10ffe4e..4a3aa796 100644
--- a/glossary.tex
+++ b/glossary.tex
@@ -382,11 +382,11 @@
 	as well as its cache so as to ensure that the software sees
 	the memory operations performed by this CPU as if they
 	were carried out in program order.
-\item[Super-Scalar CPU:]
+\item[Superscalar CPU:]
 	A scalar (non-vector) CPU capable of executing multiple instructions
 	concurrently.
 	This is a step up from a pipelined CPU that executes multiple
-	instructions in an assembly-line fashion---in a super-scalar
+	instructions in an assembly-line fashion---in a superscalar
 	CPU, each stage of the pipeline would be capable of handling
 	more than one instruction.
 	For example, if the conditions were exactly right,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Minor updates
  2019-12-08 15:54       ` Akira Yokosawa
@ 2019-12-08 18:11         ` Paul E. McKenney
  2019-12-09 12:50           ` Akira Yokosawa
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2019-12-08 18:11 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Mon, Dec 09, 2019 at 12:54:46AM +0900, Akira Yokosawa wrote:
> On Sat, 7 Dec 2019 17:15:50 -0800, Paul E. McKenney wrote:
> > On Sun, Dec 08, 2019 at 08:41:45AM +0900, Akira Yokosawa wrote:
> >> On Sat, 7 Dec 2019 08:43:08 -0800, Paul E. McKenney wrote:
> >>> On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote:
> >>>> Hi Paul,
> >>>>
> >>>> This patch set fixes minor issues I noticed while reading your
> >>>> recent updates.
> >>>
> >>> Queued and pushed, along with a fix to another of my typos, thank
> >>> you very much!
> >>>
> >>>> Apart from the changes, I'd like you to mention in the answer to
> >>>> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64
> >>>> instructions directly, but decode them into uOPs (via MOP) and
> >>>> keep them in a uOP cache [1].
> >>>> So the execution cycle is not necessarily corresponds to instruction
> >>>> count, but heavily depends on the behavior of the microarch, which
> >>>> is not predictable without actually running the code. 
> >>>>
> >>>> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
> >>>
> >>> My thought is that I should review the "Hardware and it Habits" chapter,
> >>> add this information if it is not already present, and then make the
> >>> answer to this Quick Quiz refer back to that.  Does that seem reasonable?
> >>
> >> Yes, it sounds quite reasonable!
> >>
> >> (Skimming through the chapter...)
> >>
> >> So Section 3.1.1 lightly touches pipelining. Section 3.2 mostly discusses
> >> memory sub-systems.
> >>
> >> Modern Intel architectures can be thought of as superscalar RISC
> >> processors which emulate x86 ISA. The transformation of x86 instructions
> >> into uOPs can be thought of as another layer of optimization
> >> (sometimes "de-optimization" from compiler writer's POV) ;-).
> >>
> >> But deep-diving this topic would cost you another chapter/appendix.
> >> I'm not sure if it's worthwhile for perfbook.
> >> Maybe it would suffice to lightly touch the difficulty of
> >> predicting execution cycles of particular instruction streams
> >> on modern microprocessors (not limited to Intel's), and put
> >> a few citations of textbooks/reference manuals.
> > 
> > What I did was to add a rough diagram and a paragraph or two of
> > explanation to Section 3.1.1, then add a reference to that section
> > in the Quick Quiz.
> 
> I'd like to see a couple of more keywords to be mentioned here other
> than "pipeline".  "Super-scalar" is present in Glossary, but
> "Superscalar" looks much common these days. Appended below is
> a tentative patch I made to show you my idea. Please feel free
> to edit as you'd like before applying it.
> 
> Another point I'd like to suggest.
> Figure 9.23 and the following figures still show the result on
> a 16 CPU system. Looks like it is difficult to make plots of 448-thread
> system corresponding to them. Can you add info on the HW system
> where those 16 CPU results were obtained in the beginning of
> Section 9.5.4.2?
> 
>         Thanks, Akira
> 
> -------------8<-------------------
> >From 7e5c17a2e816a23bb12aa23f18ed96d5e820fc3a Mon Sep 17 00:00:00 2001
> From: Akira Yokosawa <akiyks@gmail.com>
> Date: Mon, 9 Dec 2019 00:23:59 +0900
> Subject: [TENTATIVE PATCH] cpu/overview: Mention approaches other than 'pipelining'
> 
> Also remove "-" from "Super-scaler" in Glossary.
> 
> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>

Good points, thank you!

Applied with a few inevitable edits.  ;-)

							Thanx, Paul

> ---
>  cpu/overview.tex   | 22 ++++++++++++++--------
>  defer/rcuusage.tex |  2 +-
>  glossary.tex       |  4 ++--
>  3 files changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/cpu/overview.tex b/cpu/overview.tex
> index b80f47c1..191c1c68 100644
> --- a/cpu/overview.tex
> +++ b/cpu/overview.tex
> @@ -42,11 +42,13 @@ In the early 1980s, the typical microprocessor fetched an instruction,
>  decoded it, and executed it, typically taking \emph{at least} three
>  clock cycles to complete one instruction before proceeding to the next.
>  In contrast, the CPU of the late 1990s and of the 2000s execute
> -many instructions simultaneously, using a deep \emph{pipeline} to control
> +many instructions simultaneously, using a combination of approaches
> +including \emph{pipeline}, \emph{superscalar}, \emph{out-of-order},
> +and \emph{speculative} execution, to control
>  the flow of instructions internally to the CPU.
>  Some cores have more than one hardware thread, which is variously called
>  \emph{simultaneous multithreading} (SMT) or \emph{hyperthreading}
> -(HT)~\cite{JFennel1973SMT}.
> +(HT)~\cite{JFennel1973SMT},
>  each of which appears as
>  an independent CPU to software, at least from a functional viewpoint.
>  These modern hardware features can greatly improve performance, as
> @@ -96,14 +98,17 @@ Figure~\ref{fig:cpu:CPU Meets a Pipeline Flush}.
>  \end{figure}
>  
>  This gets even worse in the increasingly common case of hyperthreading
> -(or SMT, if you prefer).
> +(or SMT, if you prefer).\footnote{
> +	Superscalar is involved in most cases, too.
> +}
>  In this case, all the hardware threads sharing a core also share that
>  core's resources, including registers, cache, execution units, and so on.
> -The instruction streams are decoded into micro-operations, and use of the
> -shared execution units and the hundreds of hardware registers is coordinated
> +The instruction streams might be decoded into micro-operations,
> +and use of the shared execution units and the hundreds of hardware
> +registers can be coordinated
>  by a micro-operation scheduler.
> -A rough diagram of a two-threaded core is shown in
> -\Cref{fig:cpu:Rough View of Modern Micro-Architecture},
> +A rough diagram of such a two-threaded core is shown in
> +\cref{fig:cpu:Rough View of Modern Micro-Architecture},
>  and more accurate (and thus more complex) diagrams are available in
>  textbooks and scholarly papers.\footnote{
>  	Here is one example for a late-2010s Intel CPU:
> @@ -123,7 +128,8 @@ of clairvoyance.
>  In particular, adding an instruction to a tight loop can sometimes
>  actually speed up execution, counterintuitive though that might be.
>  
> -Unfortunately, pipeline flushes are not the only hazards in the obstacle
> +Unfortunately, pipeline flushes and shared-resource contentions
> +are not the only hazards in the obstacle
>  course that modern CPUs must run.
>  The next section covers the hazards of referencing memory.
>  
> diff --git a/defer/rcuusage.tex b/defer/rcuusage.tex
> index 7fe633c3..fa04ddb6 100644
> --- a/defer/rcuusage.tex
> +++ b/defer/rcuusage.tex
> @@ -139,7 +139,7 @@ that of the ideal synchronization-free workload.
>  	are long gone.
>  
>  	But those of you who read
> -	\Cref{sec:cpu:Pipelined CPUs}
> +	\cref{sec:cpu:Pipelined CPUs}
>  	carefully already knew all of this!
>  
>  	These counter-intuitive results of course means that any
> diff --git a/glossary.tex b/glossary.tex
> index c10ffe4e..4a3aa796 100644
> --- a/glossary.tex
> +++ b/glossary.tex
> @@ -382,11 +382,11 @@
>  	as well as its cache so as to ensure that the software sees
>  	the memory operations performed by this CPU as if they
>  	were carried out in program order.
> -\item[Super-Scalar CPU:]
> +\item[Superscalar CPU:]
>  	A scalar (non-vector) CPU capable of executing multiple instructions
>  	concurrently.
>  	This is a step up from a pipelined CPU that executes multiple
> -	instructions in an assembly-line fashion---in a super-scalar
> +	instructions in an assembly-line fashion---in a superscalar
>  	CPU, each stage of the pipeline would be capable of handling
>  	more than one instruction.
>  	For example, if the conditions were exactly right,
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Minor updates
  2019-12-08 18:11         ` Paul E. McKenney
@ 2019-12-09 12:50           ` Akira Yokosawa
  2019-12-09 18:06             ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Akira Yokosawa @ 2019-12-09 12:50 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On Sun, 8 Dec 2019 10:11:20 -0800, Paul E. McKenney wrote:
> On Mon, Dec 09, 2019 at 12:54:46AM +0900, Akira Yokosawa wrote:
>> On Sat, 7 Dec 2019 17:15:50 -0800, Paul E. McKenney wrote:
>>> On Sun, Dec 08, 2019 at 08:41:45AM +0900, Akira Yokosawa wrote:
>>>> On Sat, 7 Dec 2019 08:43:08 -0800, Paul E. McKenney wrote:
>>>>> On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote:
>>>>>> Hi Paul,
>>>>>>
>>>>>> This patch set fixes minor issues I noticed while reading your
>>>>>> recent updates.
>>>>>
>>>>> Queued and pushed, along with a fix to another of my typos, thank
>>>>> you very much!
>>>>>
>>>>>> Apart from the changes, I'd like you to mention in the answer to
>>>>>> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64
>>>>>> instructions directly, but decode them into uOPs (via MOP) and
>>>>>> keep them in a uOP cache [1].
>>>>>> So the execution cycle is not necessarily corresponds to instruction
>>>>>> count, but heavily depends on the behavior of the microarch, which
>>>>>> is not predictable without actually running the code. 
>>>>>>
>>>>>> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
>>>>>
>>>>> My thought is that I should review the "Hardware and it Habits" chapter,
>>>>> add this information if it is not already present, and then make the
>>>>> answer to this Quick Quiz refer back to that.  Does that seem reasonable?
>>>>
>>>> Yes, it sounds quite reasonable!
>>>>
>>>> (Skimming through the chapter...)
>>>>
>>>> So Section 3.1.1 lightly touches pipelining. Section 3.2 mostly discusses
>>>> memory sub-systems.
>>>>
>>>> Modern Intel architectures can be thought of as superscalar RISC
>>>> processors which emulate x86 ISA. The transformation of x86 instructions
>>>> into uOPs can be thought of as another layer of optimization
>>>> (sometimes "de-optimization" from compiler writer's POV) ;-).
>>>>
>>>> But deep-diving this topic would cost you another chapter/appendix.
>>>> I'm not sure if it's worthwhile for perfbook.
>>>> Maybe it would suffice to lightly touch the difficulty of
>>>> predicting execution cycles of particular instruction streams
>>>> on modern microprocessors (not limited to Intel's), and put
>>>> a few citations of textbooks/reference manuals.
>>>
>>> What I did was to add a rough diagram and a paragraph or two of
>>> explanation to Section 3.1.1, then add a reference to that section
>>> in the Quick Quiz.
>>
>> I'd like to see a couple of more keywords to be mentioned here other
>> than "pipeline".  "Super-scalar" is present in Glossary, but
>> "Superscalar" looks much common these days. Appended below is
>> a tentative patch I made to show you my idea. Please feel free
>> to edit as you'd like before applying it.
>>
>> Another point I'd like to suggest.
>> Figure 9.23 and the following figures still show the result on
>> a 16 CPU system. Looks like it is difficult to make plots of 448-thread
>> system corresponding to them. Can you add info on the HW system
>> where those 16 CPU results were obtained in the beginning of
>> Section 9.5.4.2?
>>
>>         Thanks, Akira
>>
>> -------------8<-------------------
>> >From 7e5c17a2e816a23bb12aa23f18ed96d5e820fc3a Mon Sep 17 00:00:00 2001
>> From: Akira Yokosawa <akiyks@gmail.com>
>> Date: Mon, 9 Dec 2019 00:23:59 +0900
>> Subject: [TENTATIVE PATCH] cpu/overview: Mention approaches other than 'pipelining'
>>
>> Also remove "-" from "Super-scaler" in Glossary.
>>
>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
> 
> Good points, thank you!
> 
> Applied with a few inevitable edits.  ;-)

Quite a few edits!  Thank you.

Let me reiterate my earlier suggestion:

>> Another point I'd like to suggest.
>> Figure 9.23 and the following figures still show the result on
>> a 16 CPU system. Looks like it is difficult to make plots of 448-thread
>> system corresponding to them. Can you add info on the HW system
>> where those 16 CPU results were obtained in the beginning of
>> Section 9.5.4.2?

Can you look into this as well?

        Thanks, Akira

> 
> 							Thanx, Paul
> 
>> ---
>>  cpu/overview.tex   | 22 ++++++++++++++--------
>>  defer/rcuusage.tex |  2 +-
>>  glossary.tex       |  4 ++--
>>  3 files changed, 17 insertions(+), 11 deletions(-)
>>
>> diff --git a/cpu/overview.tex b/cpu/overview.tex
>> index b80f47c1..191c1c68 100644
>> --- a/cpu/overview.tex
>> +++ b/cpu/overview.tex
>> @@ -42,11 +42,13 @@ In the early 1980s, the typical microprocessor fetched an instruction,
>>  decoded it, and executed it, typically taking \emph{at least} three
>>  clock cycles to complete one instruction before proceeding to the next.
>>  In contrast, the CPU of the late 1990s and of the 2000s execute
>> -many instructions simultaneously, using a deep \emph{pipeline} to control
>> +many instructions simultaneously, using a combination of approaches
>> +including \emph{pipeline}, \emph{superscalar}, \emph{out-of-order},
>> +and \emph{speculative} execution, to control
>>  the flow of instructions internally to the CPU.
>>  Some cores have more than one hardware thread, which is variously called
>>  \emph{simultaneous multithreading} (SMT) or \emph{hyperthreading}
>> -(HT)~\cite{JFennel1973SMT}.
>> +(HT)~\cite{JFennel1973SMT},
>>  each of which appears as
>>  an independent CPU to software, at least from a functional viewpoint.
>>  These modern hardware features can greatly improve performance, as
>> @@ -96,14 +98,17 @@ Figure~\ref{fig:cpu:CPU Meets a Pipeline Flush}.
>>  \end{figure}
>>  
>>  This gets even worse in the increasingly common case of hyperthreading
>> -(or SMT, if you prefer).
>> +(or SMT, if you prefer).\footnote{
>> +	Superscalar is involved in most cases, too.
>> +}
>>  In this case, all the hardware threads sharing a core also share that
>>  core's resources, including registers, cache, execution units, and so on.
>> -The instruction streams are decoded into micro-operations, and use of the
>> -shared execution units and the hundreds of hardware registers is coordinated
>> +The instruction streams might be decoded into micro-operations,
>> +and use of the shared execution units and the hundreds of hardware
>> +registers can be coordinated
>>  by a micro-operation scheduler.
>> -A rough diagram of a two-threaded core is shown in
>> -\Cref{fig:cpu:Rough View of Modern Micro-Architecture},
>> +A rough diagram of such a two-threaded core is shown in
>> +\cref{fig:cpu:Rough View of Modern Micro-Architecture},
>>  and more accurate (and thus more complex) diagrams are available in
>>  textbooks and scholarly papers.\footnote{
>>  	Here is one example for a late-2010s Intel CPU:
>> @@ -123,7 +128,8 @@ of clairvoyance.
>>  In particular, adding an instruction to a tight loop can sometimes
>>  actually speed up execution, counterintuitive though that might be.
>>  
>> -Unfortunately, pipeline flushes are not the only hazards in the obstacle
>> +Unfortunately, pipeline flushes and shared-resource contentions
>> +are not the only hazards in the obstacle
>>  course that modern CPUs must run.
>>  The next section covers the hazards of referencing memory.
>>  
>> diff --git a/defer/rcuusage.tex b/defer/rcuusage.tex
>> index 7fe633c3..fa04ddb6 100644
>> --- a/defer/rcuusage.tex
>> +++ b/defer/rcuusage.tex
>> @@ -139,7 +139,7 @@ that of the ideal synchronization-free workload.
>>  	are long gone.
>>  
>>  	But those of you who read
>> -	\Cref{sec:cpu:Pipelined CPUs}
>> +	\cref{sec:cpu:Pipelined CPUs}
>>  	carefully already knew all of this!
>>  
>>  	These counter-intuitive results of course means that any
>> diff --git a/glossary.tex b/glossary.tex
>> index c10ffe4e..4a3aa796 100644
>> --- a/glossary.tex
>> +++ b/glossary.tex
>> @@ -382,11 +382,11 @@
>>  	as well as its cache so as to ensure that the software sees
>>  	the memory operations performed by this CPU as if they
>>  	were carried out in program order.
>> -\item[Super-Scalar CPU:]
>> +\item[Superscalar CPU:]
>>  	A scalar (non-vector) CPU capable of executing multiple instructions
>>  	concurrently.
>>  	This is a step up from a pipelined CPU that executes multiple
>> -	instructions in an assembly-line fashion---in a super-scalar
>> +	instructions in an assembly-line fashion---in a superscalar
>>  	CPU, each stage of the pipeline would be capable of handling
>>  	more than one instruction.
>>  	For example, if the conditions were exactly right,
>> -- 
>> 2.17.1
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Minor updates
  2019-12-09 12:50           ` Akira Yokosawa
@ 2019-12-09 18:06             ` Paul E. McKenney
  2019-12-09 22:11               ` Akira Yokosawa
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2019-12-09 18:06 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Mon, Dec 09, 2019 at 09:50:56PM +0900, Akira Yokosawa wrote:
> On Sun, 8 Dec 2019 10:11:20 -0800, Paul E. McKenney wrote:
> > On Mon, Dec 09, 2019 at 12:54:46AM +0900, Akira Yokosawa wrote:
> >> On Sat, 7 Dec 2019 17:15:50 -0800, Paul E. McKenney wrote:
> >>> On Sun, Dec 08, 2019 at 08:41:45AM +0900, Akira Yokosawa wrote:
> >>>> On Sat, 7 Dec 2019 08:43:08 -0800, Paul E. McKenney wrote:
> >>>>> On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote:
> >>>>>> Hi Paul,
> >>>>>>
> >>>>>> This patch set fixes minor issues I noticed while reading your
> >>>>>> recent updates.
> >>>>>
> >>>>> Queued and pushed, along with a fix to another of my typos, thank
> >>>>> you very much!
> >>>>>
> >>>>>> Apart from the changes, I'd like you to mention in the answer to
> >>>>>> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64
> >>>>>> instructions directly, but decode them into uOPs (via MOP) and
> >>>>>> keep them in a uOP cache [1].
> >>>>>> So the execution cycle is not necessarily corresponds to instruction
> >>>>>> count, but heavily depends on the behavior of the microarch, which
> >>>>>> is not predictable without actually running the code. 
> >>>>>>
> >>>>>> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
> >>>>>
> >>>>> My thought is that I should review the "Hardware and it Habits" chapter,
> >>>>> add this information if it is not already present, and then make the
> >>>>> answer to this Quick Quiz refer back to that.  Does that seem reasonable?
> >>>>
> >>>> Yes, it sounds quite reasonable!
> >>>>
> >>>> (Skimming through the chapter...)
> >>>>
> >>>> So Section 3.1.1 lightly touches pipelining. Section 3.2 mostly discusses
> >>>> memory sub-systems.
> >>>>
> >>>> Modern Intel architectures can be thought of as superscalar RISC
> >>>> processors which emulate x86 ISA. The transformation of x86 instructions
> >>>> into uOPs can be thought of as another layer of optimization
> >>>> (sometimes "de-optimization" from compiler writer's POV) ;-).
> >>>>
> >>>> But deep-diving this topic would cost you another chapter/appendix.
> >>>> I'm not sure if it's worthwhile for perfbook.
> >>>> Maybe it would suffice to lightly touch the difficulty of
> >>>> predicting execution cycles of particular instruction streams
> >>>> on modern microprocessors (not limited to Intel's), and put
> >>>> a few citations of textbooks/reference manuals.
> >>>
> >>> What I did was to add a rough diagram and a paragraph or two of
> >>> explanation to Section 3.1.1, then add a reference to that section
> >>> in the Quick Quiz.
> >>
> >> I'd like to see a couple of more keywords to be mentioned here other
> >> than "pipeline".  "Super-scalar" is present in Glossary, but
> >> "Superscalar" looks much common these days. Appended below is
> >> a tentative patch I made to show you my idea. Please feel free
> >> to edit as you'd like before applying it.
> >>
> >> Another point I'd like to suggest.
> >> Figure 9.23 and the following figures still show the result on
> >> a 16 CPU system. Looks like it is difficult to make plots of 448-thread
> >> system corresponding to them. Can you add info on the HW system
> >> where those 16 CPU results were obtained in the beginning of
> >> Section 9.5.4.2?
> >>
> >>         Thanks, Akira
> >>
> >> -------------8<-------------------
> >> >From 7e5c17a2e816a23bb12aa23f18ed96d5e820fc3a Mon Sep 17 00:00:00 2001
> >> From: Akira Yokosawa <akiyks@gmail.com>
> >> Date: Mon, 9 Dec 2019 00:23:59 +0900
> >> Subject: [TENTATIVE PATCH] cpu/overview: Mention approaches other than 'pipelining'
> >>
> >> Also remove "-" from "Super-scaler" in Glossary.
> >>
> >> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
> > 
> > Good points, thank you!
> > 
> > Applied with a few inevitable edits.  ;-)
> 
> Quite a few edits!  Thank you.
> 
> Let me reiterate my earlier suggestion:
> 
> >> Another point I'd like to suggest.
> >> Figure 9.23 and the following figures still show the result on
> >> a 16 CPU system. Looks like it is difficult to make plots of 448-thread
> >> system corresponding to them. Can you add info on the HW system
> >> where those 16 CPU results were obtained in the beginning of
> >> Section 9.5.4.2?
> 
> Can you look into this as well?

There are a few build issues, but the main problem has been that I have
needed to use that system to verify Linux-kernel fixes.  The intent
is to regenerate most and maybe all of the results on the large system
over time.

But I added the system's info in the meantime.  ;-)

							Thanx, Paul

>         Thanks, Akira
> 
> > 
> > 							Thanx, Paul
> > 
> >> ---
> >>  cpu/overview.tex   | 22 ++++++++++++++--------
> >>  defer/rcuusage.tex |  2 +-
> >>  glossary.tex       |  4 ++--
> >>  3 files changed, 17 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/cpu/overview.tex b/cpu/overview.tex
> >> index b80f47c1..191c1c68 100644
> >> --- a/cpu/overview.tex
> >> +++ b/cpu/overview.tex
> >> @@ -42,11 +42,13 @@ In the early 1980s, the typical microprocessor fetched an instruction,
> >>  decoded it, and executed it, typically taking \emph{at least} three
> >>  clock cycles to complete one instruction before proceeding to the next.
> >>  In contrast, the CPU of the late 1990s and of the 2000s execute
> >> -many instructions simultaneously, using a deep \emph{pipeline} to control
> >> +many instructions simultaneously, using a combination of approaches
> >> +including \emph{pipeline}, \emph{superscalar}, \emph{out-of-order},
> >> +and \emph{speculative} execution, to control
> >>  the flow of instructions internally to the CPU.
> >>  Some cores have more than one hardware thread, which is variously called
> >>  \emph{simultaneous multithreading} (SMT) or \emph{hyperthreading}
> >> -(HT)~\cite{JFennel1973SMT}.
> >> +(HT)~\cite{JFennel1973SMT},
> >>  each of which appears as
> >>  an independent CPU to software, at least from a functional viewpoint.
> >>  These modern hardware features can greatly improve performance, as
> >> @@ -96,14 +98,17 @@ Figure~\ref{fig:cpu:CPU Meets a Pipeline Flush}.
> >>  \end{figure}
> >>  
> >>  This gets even worse in the increasingly common case of hyperthreading
> >> -(or SMT, if you prefer).
> >> +(or SMT, if you prefer).\footnote{
> >> +	Superscalar is involved in most cases, too.
> >> +}
> >>  In this case, all the hardware threads sharing a core also share that
> >>  core's resources, including registers, cache, execution units, and so on.
> >> -The instruction streams are decoded into micro-operations, and use of the
> >> -shared execution units and the hundreds of hardware registers is coordinated
> >> +The instruction streams might be decoded into micro-operations,
> >> +and use of the shared execution units and the hundreds of hardware
> >> +registers can be coordinated
> >>  by a micro-operation scheduler.
> >> -A rough diagram of a two-threaded core is shown in
> >> -\Cref{fig:cpu:Rough View of Modern Micro-Architecture},
> >> +A rough diagram of such a two-threaded core is shown in
> >> +\cref{fig:cpu:Rough View of Modern Micro-Architecture},
> >>  and more accurate (and thus more complex) diagrams are available in
> >>  textbooks and scholarly papers.\footnote{
> >>  	Here is one example for a late-2010s Intel CPU:
> >> @@ -123,7 +128,8 @@ of clairvoyance.
> >>  In particular, adding an instruction to a tight loop can sometimes
> >>  actually speed up execution, counterintuitive though that might be.
> >>  
> >> -Unfortunately, pipeline flushes are not the only hazards in the obstacle
> >> +Unfortunately, pipeline flushes and shared-resource contentions
> >> +are not the only hazards in the obstacle
> >>  course that modern CPUs must run.
> >>  The next section covers the hazards of referencing memory.
> >>  
> >> diff --git a/defer/rcuusage.tex b/defer/rcuusage.tex
> >> index 7fe633c3..fa04ddb6 100644
> >> --- a/defer/rcuusage.tex
> >> +++ b/defer/rcuusage.tex
> >> @@ -139,7 +139,7 @@ that of the ideal synchronization-free workload.
> >>  	are long gone.
> >>  
> >>  	But those of you who read
> >> -	\Cref{sec:cpu:Pipelined CPUs}
> >> +	\cref{sec:cpu:Pipelined CPUs}
> >>  	carefully already knew all of this!
> >>  
> >>  	These counter-intuitive results of course means that any
> >> diff --git a/glossary.tex b/glossary.tex
> >> index c10ffe4e..4a3aa796 100644
> >> --- a/glossary.tex
> >> +++ b/glossary.tex
> >> @@ -382,11 +382,11 @@
> >>  	as well as its cache so as to ensure that the software sees
> >>  	the memory operations performed by this CPU as if they
> >>  	were carried out in program order.
> >> -\item[Super-Scalar CPU:]
> >> +\item[Superscalar CPU:]
> >>  	A scalar (non-vector) CPU capable of executing multiple instructions
> >>  	concurrently.
> >>  	This is a step up from a pipelined CPU that executes multiple
> >> -	instructions in an assembly-line fashion---in a super-scalar
> >> +	instructions in an assembly-line fashion---in a superscalar
> >>  	CPU, each stage of the pipeline would be capable of handling
> >>  	more than one instruction.
> >>  	For example, if the conditions were exactly right,
> >> -- 
> >> 2.17.1
> >>
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Minor updates
  2019-12-09 18:06             ` Paul E. McKenney
@ 2019-12-09 22:11               ` Akira Yokosawa
  2019-12-10  0:08                 ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Akira Yokosawa @ 2019-12-09 22:11 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On 2019/12/10 3:06, Paul E. McKenney wrote:
> On Mon, Dec 09, 2019 at 09:50:56PM +0900, Akira Yokosawa wrote:
>> On Sun, 8 Dec 2019 10:11:20 -0800, Paul E. McKenney wrote:
>>> On Mon, Dec 09, 2019 at 12:54:46AM +0900, Akira Yokosawa wrote:
>>>> On Sat, 7 Dec 2019 17:15:50 -0800, Paul E. McKenney wrote:
>>>>> On Sun, Dec 08, 2019 at 08:41:45AM +0900, Akira Yokosawa wrote:
>>>>>> On Sat, 7 Dec 2019 08:43:08 -0800, Paul E. McKenney wrote:
>>>>>>> On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote:
>>>>>>>> Hi Paul,
>>>>>>>>
>>>>>>>> This patch set fixes minor issues I noticed while reading your
>>>>>>>> recent updates.
>>>>>>>
>>>>>>> Queued and pushed, along with a fix to another of my typos, thank
>>>>>>> you very much!
>>>>>>>
>>>>>>>> Apart from the changes, I'd like you to mention in the answer to
>>>>>>>> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64
>>>>>>>> instructions directly, but decode them into uOPs (via MOP) and
>>>>>>>> keep them in a uOP cache [1].
>>>>>>>> So the execution cycle is not necessarily corresponds to instruction
>>>>>>>> count, but heavily depends on the behavior of the microarch, which
>>>>>>>> is not predictable without actually running the code. 
>>>>>>>>
>>>>>>>> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
>>>>>>>
>>>>>>> My thought is that I should review the "Hardware and it Habits" chapter,
>>>>>>> add this information if it is not already present, and then make the
>>>>>>> answer to this Quick Quiz refer back to that.  Does that seem reasonable?
>>>>>>
>>>>>> Yes, it sounds quite reasonable!
>>>>>>
>>>>>> (Skimming through the chapter...)
>>>>>>
>>>>>> So Section 3.1.1 lightly touches pipelining. Section 3.2 mostly discusses
>>>>>> memory sub-systems.
>>>>>>
>>>>>> Modern Intel architectures can be thought of as superscalar RISC
>>>>>> processors which emulate x86 ISA. The transformation of x86 instructions
>>>>>> into uOPs can be thought of as another layer of optimization
>>>>>> (sometimes "de-optimization" from compiler writer's POV) ;-).
>>>>>>
>>>>>> But deep-diving this topic would cost you another chapter/appendix.
>>>>>> I'm not sure if it's worthwhile for perfbook.
>>>>>> Maybe it would suffice to lightly touch the difficulty of
>>>>>> predicting execution cycles of particular instruction streams
>>>>>> on modern microprocessors (not limited to Intel's), and put
>>>>>> a few citations of textbooks/reference manuals.
>>>>>
>>>>> What I did was to add a rough diagram and a paragraph or two of
>>>>> explanation to Section 3.1.1, then add a reference to that section
>>>>> in the Quick Quiz.
>>>>
>>>> I'd like to see a couple of more keywords to be mentioned here other
>>>> than "pipeline".  "Super-scalar" is present in Glossary, but
>>>> "Superscalar" looks much common these days. Appended below is
>>>> a tentative patch I made to show you my idea. Please feel free
>>>> to edit as you'd like before applying it.
>>>>
>>>> Another point I'd like to suggest.
>>>> Figure 9.23 and the following figures still show the result on
>>>> a 16 CPU system. Looks like it is difficult to make plots of 448-thread
>>>> system corresponding to them. Can you add info on the HW system
>>>> where those 16 CPU results were obtained in the beginning of
>>>> Section 9.5.4.2?
>>>>
>>>>         Thanks, Akira
>>>>
>>>> -------------8<-------------------
>>>> >From 7e5c17a2e816a23bb12aa23f18ed96d5e820fc3a Mon Sep 17 00:00:00 2001
>>>> From: Akira Yokosawa <akiyks@gmail.com>
>>>> Date: Mon, 9 Dec 2019 00:23:59 +0900
>>>> Subject: [TENTATIVE PATCH] cpu/overview: Mention approaches other than 'pipelining'
>>>>
>>>> Also remove "-" from "Super-scaler" in Glossary.
>>>>
>>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
>>>
>>> Good points, thank you!
>>>
>>> Applied with a few inevitable edits.  ;-)
>>
>> Quite a few edits!  Thank you.
>>
>> Let me reiterate my earlier suggestion:
>>
>>>> Another point I'd like to suggest.
>>>> Figure 9.23 and the following figures still show the result on
>>>> a 16 CPU system. Looks like it is difficult to make plots of 448-thread
>>>> system corresponding to them. Can you add info on the HW system
>>>> where those 16 CPU results were obtained in the beginning of
>>>> Section 9.5.4.2?
>>
>> Can you look into this as well?
> 
> There are a few build issues, but the main problem has been that I have
> needed to use that system to verify Linux-kernel fixes.  The intent
> is to regenerate most and maybe all of the results on the large system
> over time.

I said "difficult" because of the counterintuitive variation of
cycles you've encountered by the additional "lea" instruction.
You will need to eliminate such variations to evaluate the cost
of RCU, I suppose.
Looks like Intel processors are sensitive to alignment of branch targets.
(I think you know the matter better than me, but I could not help.)
For example: https://stackoverflow.com/questions/18113995/

> 
> But I added the system's info in the meantime.  ;-)

Which generation of Intel x86 system was it?

        Thanks, Akira

> 
> 							Thanx, Paul
> 
>>         Thanks, Akira
>>
>>>
>>> 							Thanx, Paul
>>>
>>>> ---
>>>>  cpu/overview.tex   | 22 ++++++++++++++--------
>>>>  defer/rcuusage.tex |  2 +-
>>>>  glossary.tex       |  4 ++--
>>>>  3 files changed, 17 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/cpu/overview.tex b/cpu/overview.tex
>>>> index b80f47c1..191c1c68 100644
>>>> --- a/cpu/overview.tex
>>>> +++ b/cpu/overview.tex
>>>> @@ -42,11 +42,13 @@ In the early 1980s, the typical microprocessor fetched an instruction,
>>>>  decoded it, and executed it, typically taking \emph{at least} three
>>>>  clock cycles to complete one instruction before proceeding to the next.
>>>>  In contrast, the CPU of the late 1990s and of the 2000s execute
>>>> -many instructions simultaneously, using a deep \emph{pipeline} to control
>>>> +many instructions simultaneously, using a combination of approaches
>>>> +including \emph{pipeline}, \emph{superscalar}, \emph{out-of-order},
>>>> +and \emph{speculative} execution, to control
>>>>  the flow of instructions internally to the CPU.
>>>>  Some cores have more than one hardware thread, which is variously called
>>>>  \emph{simultaneous multithreading} (SMT) or \emph{hyperthreading}
>>>> -(HT)~\cite{JFennel1973SMT}.
>>>> +(HT)~\cite{JFennel1973SMT},
>>>>  each of which appears as
>>>>  an independent CPU to software, at least from a functional viewpoint.
>>>>  These modern hardware features can greatly improve performance, as
>>>> @@ -96,14 +98,17 @@ Figure~\ref{fig:cpu:CPU Meets a Pipeline Flush}.
>>>>  \end{figure}
>>>>  
>>>>  This gets even worse in the increasingly common case of hyperthreading
>>>> -(or SMT, if you prefer).
>>>> +(or SMT, if you prefer).\footnote{
>>>> +	Superscalar is involved in most cases, too.
>>>> +}
>>>>  In this case, all the hardware threads sharing a core also share that
>>>>  core's resources, including registers, cache, execution units, and so on.
>>>> -The instruction streams are decoded into micro-operations, and use of the
>>>> -shared execution units and the hundreds of hardware registers is coordinated
>>>> +The instruction streams might be decoded into micro-operations,
>>>> +and use of the shared execution units and the hundreds of hardware
>>>> +registers can be coordinated
>>>>  by a micro-operation scheduler.
>>>> -A rough diagram of a two-threaded core is shown in
>>>> -\Cref{fig:cpu:Rough View of Modern Micro-Architecture},
>>>> +A rough diagram of such a two-threaded core is shown in
>>>> +\cref{fig:cpu:Rough View of Modern Micro-Architecture},
>>>>  and more accurate (and thus more complex) diagrams are available in
>>>>  textbooks and scholarly papers.\footnote{
>>>>  	Here is one example for a late-2010s Intel CPU:
>>>> @@ -123,7 +128,8 @@ of clairvoyance.
>>>>  In particular, adding an instruction to a tight loop can sometimes
>>>>  actually speed up execution, counterintuitive though that might be.
>>>>  
>>>> -Unfortunately, pipeline flushes are not the only hazards in the obstacle
>>>> +Unfortunately, pipeline flushes and shared-resource contentions
>>>> +are not the only hazards in the obstacle
>>>>  course that modern CPUs must run.
>>>>  The next section covers the hazards of referencing memory.
>>>>  
>>>> diff --git a/defer/rcuusage.tex b/defer/rcuusage.tex
>>>> index 7fe633c3..fa04ddb6 100644
>>>> --- a/defer/rcuusage.tex
>>>> +++ b/defer/rcuusage.tex
>>>> @@ -139,7 +139,7 @@ that of the ideal synchronization-free workload.
>>>>  	are long gone.
>>>>  
>>>>  	But those of you who read
>>>> -	\Cref{sec:cpu:Pipelined CPUs}
>>>> +	\cref{sec:cpu:Pipelined CPUs}
>>>>  	carefully already knew all of this!
>>>>  
>>>>  	These counter-intuitive results of course means that any
>>>> diff --git a/glossary.tex b/glossary.tex
>>>> index c10ffe4e..4a3aa796 100644
>>>> --- a/glossary.tex
>>>> +++ b/glossary.tex
>>>> @@ -382,11 +382,11 @@
>>>>  	as well as its cache so as to ensure that the software sees
>>>>  	the memory operations performed by this CPU as if they
>>>>  	were carried out in program order.
>>>> -\item[Super-Scalar CPU:]
>>>> +\item[Superscalar CPU:]
>>>>  	A scalar (non-vector) CPU capable of executing multiple instructions
>>>>  	concurrently.
>>>>  	This is a step up from a pipelined CPU that executes multiple
>>>> -	instructions in an assembly-line fashion---in a super-scalar
>>>> +	instructions in an assembly-line fashion---in a superscalar
>>>>  	CPU, each stage of the pipeline would be capable of handling
>>>>  	more than one instruction.
>>>>  	For example, if the conditions were exactly right,
>>>> -- 
>>>> 2.17.1
>>>>
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Minor updates
  2019-12-09 22:11               ` Akira Yokosawa
@ 2019-12-10  0:08                 ` Paul E. McKenney
  2019-12-10 12:32                   ` Akira Yokosawa
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2019-12-10  0:08 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Tue, Dec 10, 2019 at 07:11:10AM +0900, Akira Yokosawa wrote:
> On 2019/12/10 3:06, Paul E. McKenney wrote:
> > On Mon, Dec 09, 2019 at 09:50:56PM +0900, Akira Yokosawa wrote:
> >> On Sun, 8 Dec 2019 10:11:20 -0800, Paul E. McKenney wrote:
> >>> On Mon, Dec 09, 2019 at 12:54:46AM +0900, Akira Yokosawa wrote:
> >>>> On Sat, 7 Dec 2019 17:15:50 -0800, Paul E. McKenney wrote:
> >>>>> On Sun, Dec 08, 2019 at 08:41:45AM +0900, Akira Yokosawa wrote:
> >>>>>> On Sat, 7 Dec 2019 08:43:08 -0800, Paul E. McKenney wrote:
> >>>>>>> On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote:
> >>>>>>>> Hi Paul,
> >>>>>>>>
> >>>>>>>> This patch set fixes minor issues I noticed while reading your
> >>>>>>>> recent updates.
> >>>>>>>
> >>>>>>> Queued and pushed, along with a fix to another of my typos, thank
> >>>>>>> you very much!
> >>>>>>>
> >>>>>>>> Apart from the changes, I'd like you to mention in the answer to
> >>>>>>>> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64
> >>>>>>>> instructions directly, but decode them into uOPs (via MOP) and
> >>>>>>>> keep them in a uOP cache [1].
> >>>>>>>> So the execution cycle is not necessarily corresponds to instruction
> >>>>>>>> count, but heavily depends on the behavior of the microarch, which
> >>>>>>>> is not predictable without actually running the code. 
> >>>>>>>>
> >>>>>>>> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
> >>>>>>>
> >>>>>>> My thought is that I should review the "Hardware and it Habits" chapter,
> >>>>>>> add this information if it is not already present, and then make the
> >>>>>>> answer to this Quick Quiz refer back to that.  Does that seem reasonable?
> >>>>>>
> >>>>>> Yes, it sounds quite reasonable!
> >>>>>>
> >>>>>> (Skimming through the chapter...)
> >>>>>>
> >>>>>> So Section 3.1.1 lightly touches pipelining. Section 3.2 mostly discusses
> >>>>>> memory sub-systems.
> >>>>>>
> >>>>>> Modern Intel architectures can be thought of as superscalar RISC
> >>>>>> processors which emulate x86 ISA. The transformation of x86 instructions
> >>>>>> into uOPs can be thought of as another layer of optimization
> >>>>>> (sometimes "de-optimization" from compiler writer's POV) ;-).
> >>>>>>
> >>>>>> But deep-diving this topic would cost you another chapter/appendix.
> >>>>>> I'm not sure if it's worthwhile for perfbook.
> >>>>>> Maybe it would suffice to lightly touch the difficulty of
> >>>>>> predicting execution cycles of particular instruction streams
> >>>>>> on modern microprocessors (not limited to Intel's), and put
> >>>>>> a few citations of textbooks/reference manuals.
> >>>>>
> >>>>> What I did was to add a rough diagram and a paragraph or two of
> >>>>> explanation to Section 3.1.1, then add a reference to that section
> >>>>> in the Quick Quiz.
> >>>>
> >>>> I'd like to see a couple of more keywords to be mentioned here other
> >>>> than "pipeline".  "Super-scalar" is present in Glossary, but
> >>>> "Superscalar" looks much common these days. Appended below is
> >>>> a tentative patch I made to show you my idea. Please feel free
> >>>> to edit as you'd like before applying it.
> >>>>
> >>>> Another point I'd like to suggest.
> >>>> Figure 9.23 and the following figures still show the result on
> >>>> a 16 CPU system. Looks like it is difficult to make plots of 448-thread
> >>>> system corresponding to them. Can you add info on the HW system
> >>>> where those 16 CPU results were obtained in the beginning of
> >>>> Section 9.5.4.2?
> >>>>
> >>>>         Thanks, Akira
> >>>>
> >>>> -------------8<-------------------
> >>>> >From 7e5c17a2e816a23bb12aa23f18ed96d5e820fc3a Mon Sep 17 00:00:00 2001
> >>>> From: Akira Yokosawa <akiyks@gmail.com>
> >>>> Date: Mon, 9 Dec 2019 00:23:59 +0900
> >>>> Subject: [TENTATIVE PATCH] cpu/overview: Mention approaches other than 'pipelining'
> >>>>
> >>>> Also remove "-" from "Super-scaler" in Glossary.
> >>>>
> >>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
> >>>
> >>> Good points, thank you!
> >>>
> >>> Applied with a few inevitable edits.  ;-)
> >>
> >> Quite a few edits!  Thank you.
> >>
> >> Let me reiterate my earlier suggestion:
> >>
> >>>> Another point I'd like to suggest.
> >>>> Figure 9.23 and the following figures still show the result on
> >>>> a 16 CPU system. Looks like it is difficult to make plots of 448-thread
> >>>> system corresponding to them. Can you add info on the HW system
> >>>> where those 16 CPU results were obtained in the beginning of
> >>>> Section 9.5.4.2?
> >>
> >> Can you look into this as well?
> > 
> > There are a few build issues, but the main problem has been that I have
> > needed to use that system to verify Linux-kernel fixes.  The intent
> > is to regenerate most and maybe all of the results on the large system
> > over time.
> 
> I said "difficult" because of the counterintuitive variation of
> cycles you've encountered by the additional "lea" instruction.
> You will need to eliminate such variations to evaluate the cost
> of RCU, I suppose.
> Looks like Intel processors are sensitive to alignment of branch targets.
> (I think you know the matter better than me, but I could not help.)
> For example: https://stackoverflow.com/questions/18113995/

It does indeed get complicated.  ;-)

Another experiment on the todo list is to move the rcu_head structure to
the end, which should eliminate that extra lea instruction.  I am planning
to introduce that to the answer to the more-than-ideal quick quiz.

> > But I added the system's info in the meantime.  ;-)
> 
> Which generation of Intel x86 system was it?

I don't know, as that was before I got smart and started capturing
/proc/cpuinfo.  It was quite old, probably produced in 2010 or so.
Maybe even earlier.

Which is another good reason to rerun those results, but I don't see
this as blocking the release.

							Thanx, Paul

>         Thanks, Akira
> 
> > 
> > 							Thanx, Paul
> > 
> >>         Thanks, Akira
> >>
> >>>
> >>> 							Thanx, Paul
> >>>
> >>>> ---
> >>>>  cpu/overview.tex   | 22 ++++++++++++++--------
> >>>>  defer/rcuusage.tex |  2 +-
> >>>>  glossary.tex       |  4 ++--
> >>>>  3 files changed, 17 insertions(+), 11 deletions(-)
> >>>>
> >>>> diff --git a/cpu/overview.tex b/cpu/overview.tex
> >>>> index b80f47c1..191c1c68 100644
> >>>> --- a/cpu/overview.tex
> >>>> +++ b/cpu/overview.tex
> >>>> @@ -42,11 +42,13 @@ In the early 1980s, the typical microprocessor fetched an instruction,
> >>>>  decoded it, and executed it, typically taking \emph{at least} three
> >>>>  clock cycles to complete one instruction before proceeding to the next.
> >>>>  In contrast, the CPU of the late 1990s and of the 2000s execute
> >>>> -many instructions simultaneously, using a deep \emph{pipeline} to control
> >>>> +many instructions simultaneously, using a combination of approaches
> >>>> +including \emph{pipeline}, \emph{superscalar}, \emph{out-of-order},
> >>>> +and \emph{speculative} execution, to control
> >>>>  the flow of instructions internally to the CPU.
> >>>>  Some cores have more than one hardware thread, which is variously called
> >>>>  \emph{simultaneous multithreading} (SMT) or \emph{hyperthreading}
> >>>> -(HT)~\cite{JFennel1973SMT}.
> >>>> +(HT)~\cite{JFennel1973SMT},
> >>>>  each of which appears as
> >>>>  an independent CPU to software, at least from a functional viewpoint.
> >>>>  These modern hardware features can greatly improve performance, as
> >>>> @@ -96,14 +98,17 @@ Figure~\ref{fig:cpu:CPU Meets a Pipeline Flush}.
> >>>>  \end{figure}
> >>>>  
> >>>>  This gets even worse in the increasingly common case of hyperthreading
> >>>> -(or SMT, if you prefer).
> >>>> +(or SMT, if you prefer).\footnote{
> >>>> +	Superscalar is involved in most cases, too.
> >>>> +}
> >>>>  In this case, all the hardware threads sharing a core also share that
> >>>>  core's resources, including registers, cache, execution units, and so on.
> >>>> -The instruction streams are decoded into micro-operations, and use of the
> >>>> -shared execution units and the hundreds of hardware registers is coordinated
> >>>> +The instruction streams might be decoded into micro-operations,
> >>>> +and use of the shared execution units and the hundreds of hardware
> >>>> +registers can be coordinated
> >>>>  by a micro-operation scheduler.
> >>>> -A rough diagram of a two-threaded core is shown in
> >>>> -\Cref{fig:cpu:Rough View of Modern Micro-Architecture},
> >>>> +A rough diagram of such a two-threaded core is shown in
> >>>> +\cref{fig:cpu:Rough View of Modern Micro-Architecture},
> >>>>  and more accurate (and thus more complex) diagrams are available in
> >>>>  textbooks and scholarly papers.\footnote{
> >>>>  	Here is one example for a late-2010s Intel CPU:
> >>>> @@ -123,7 +128,8 @@ of clairvoyance.
> >>>>  In particular, adding an instruction to a tight loop can sometimes
> >>>>  actually speed up execution, counterintuitive though that might be.
> >>>>  
> >>>> -Unfortunately, pipeline flushes are not the only hazards in the obstacle
> >>>> +Unfortunately, pipeline flushes and shared-resource contentions
> >>>> +are not the only hazards in the obstacle
> >>>>  course that modern CPUs must run.
> >>>>  The next section covers the hazards of referencing memory.
> >>>>  
> >>>> diff --git a/defer/rcuusage.tex b/defer/rcuusage.tex
> >>>> index 7fe633c3..fa04ddb6 100644
> >>>> --- a/defer/rcuusage.tex
> >>>> +++ b/defer/rcuusage.tex
> >>>> @@ -139,7 +139,7 @@ that of the ideal synchronization-free workload.
> >>>>  	are long gone.
> >>>>  
> >>>>  	But those of you who read
> >>>> -	\Cref{sec:cpu:Pipelined CPUs}
> >>>> +	\cref{sec:cpu:Pipelined CPUs}
> >>>>  	carefully already knew all of this!
> >>>>  
> >>>>  	These counter-intuitive results of course means that any
> >>>> diff --git a/glossary.tex b/glossary.tex
> >>>> index c10ffe4e..4a3aa796 100644
> >>>> --- a/glossary.tex
> >>>> +++ b/glossary.tex
> >>>> @@ -382,11 +382,11 @@
> >>>>  	as well as its cache so as to ensure that the software sees
> >>>>  	the memory operations performed by this CPU as if they
> >>>>  	were carried out in program order.
> >>>> -\item[Super-Scalar CPU:]
> >>>> +\item[Superscalar CPU:]
> >>>>  	A scalar (non-vector) CPU capable of executing multiple instructions
> >>>>  	concurrently.
> >>>>  	This is a step up from a pipelined CPU that executes multiple
> >>>> -	instructions in an assembly-line fashion---in a super-scalar
> >>>> +	instructions in an assembly-line fashion---in a superscalar
> >>>>  	CPU, each stage of the pipeline would be capable of handling
> >>>>  	more than one instruction.
> >>>>  	For example, if the conditions were exactly right,
> >>>> -- 
> >>>> 2.17.1
> >>>>
> >>
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Minor updates
  2019-12-10  0:08                 ` Paul E. McKenney
@ 2019-12-10 12:32                   ` Akira Yokosawa
  0 siblings, 0 replies; 13+ messages in thread
From: Akira Yokosawa @ 2019-12-10 12:32 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On 2019/12/10 9:08, Paul E. McKenney wrote:
> On Tue, Dec 10, 2019 at 07:11:10AM +0900, Akira Yokosawa wrote:
>> On 2019/12/10 3:06, Paul E. McKenney wrote:
>>> On Mon, Dec 09, 2019 at 09:50:56PM +0900, Akira Yokosawa wrote:
>>>> On Sun, 8 Dec 2019 10:11:20 -0800, Paul E. McKenney wrote:
>>>>> On Mon, Dec 09, 2019 at 12:54:46AM +0900, Akira Yokosawa wrote:
>>>>>> On Sat, 7 Dec 2019 17:15:50 -0800, Paul E. McKenney wrote:
>>>>>>> On Sun, Dec 08, 2019 at 08:41:45AM +0900, Akira Yokosawa wrote:
>>>>>>>> On Sat, 7 Dec 2019 08:43:08 -0800, Paul E. McKenney wrote:
>>>>>>>>> On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote:
>>>>>>>>>> Hi Paul,
>>>>>>>>>>
>>>>>>>>>> This patch set fixes minor issues I noticed while reading your
>>>>>>>>>> recent updates.
>>>>>>>>>
>>>>>>>>> Queued and pushed, along with a fix to another of my typos, thank
>>>>>>>>> you very much!
>>>>>>>>>
>>>>>>>>>> Apart from the changes, I'd like you to mention in the answer to
>>>>>>>>>> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64
>>>>>>>>>> instructions directly, but decode them into uOPs (via MOP) and
>>>>>>>>>> keep them in a uOP cache [1].
>>>>>>>>>> So the execution cycle is not necessarily corresponds to instruction
>>>>>>>>>> count, but heavily depends on the behavior of the microarch, which
>>>>>>>>>> is not predictable without actually running the code. 
>>>>>>>>>>
>>>>>>>>>> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
>>>>>>>>>
>>>>>>>>> My thought is that I should review the "Hardware and it Habits" chapter,
>>>>>>>>> add this information if it is not already present, and then make the
>>>>>>>>> answer to this Quick Quiz refer back to that.  Does that seem reasonable?
>>>>>>>>
>>>>>>>> Yes, it sounds quite reasonable!
>>>>>>>>
>>>>>>>> (Skimming through the chapter...)
>>>>>>>>
>>>>>>>> So Section 3.1.1 lightly touches pipelining. Section 3.2 mostly discusses
>>>>>>>> memory sub-systems.
>>>>>>>>
>>>>>>>> Modern Intel architectures can be thought of as superscalar RISC
>>>>>>>> processors which emulate x86 ISA. The transformation of x86 instructions
>>>>>>>> into uOPs can be thought of as another layer of optimization
>>>>>>>> (sometimes "de-optimization" from compiler writer's POV) ;-).
>>>>>>>>
>>>>>>>> But deep-diving this topic would cost you another chapter/appendix.
>>>>>>>> I'm not sure if it's worthwhile for perfbook.
>>>>>>>> Maybe it would suffice to lightly touch the difficulty of
>>>>>>>> predicting execution cycles of particular instruction streams
>>>>>>>> on modern microprocessors (not limited to Intel's), and put
>>>>>>>> a few citations of textbooks/reference manuals.
>>>>>>>
>>>>>>> What I did was to add a rough diagram and a paragraph or two of
>>>>>>> explanation to Section 3.1.1, then add a reference to that section
>>>>>>> in the Quick Quiz.
>>>>>>
>>>>>> I'd like to see a couple of more keywords to be mentioned here other
>>>>>> than "pipeline".  "Super-scalar" is present in Glossary, but
>>>>>> "Superscalar" looks much common these days. Appended below is
>>>>>> a tentative patch I made to show you my idea. Please feel free
>>>>>> to edit as you'd like before applying it.
>>>>>>
>>>>>> Another point I'd like to suggest.
>>>>>> Figure 9.23 and the following figures still show the result on
>>>>>> a 16 CPU system. Looks like it is difficult to make plots of 448-thread
>>>>>> system corresponding to them. Can you add info on the HW system
>>>>>> where those 16 CPU results were obtained in the beginning of
>>>>>> Section 9.5.4.2?
>>>>>>
>>>>>>         Thanks, Akira
>>>>>>
>>>>>> -------------8<-------------------
>>>>>> >From 7e5c17a2e816a23bb12aa23f18ed96d5e820fc3a Mon Sep 17 00:00:00 2001
>>>>>> From: Akira Yokosawa <akiyks@gmail.com>
>>>>>> Date: Mon, 9 Dec 2019 00:23:59 +0900
>>>>>> Subject: [TENTATIVE PATCH] cpu/overview: Mention approaches other than 'pipelining'
>>>>>>
>>>>>> Also remove "-" from "Super-scaler" in Glossary.
>>>>>>
>>>>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
>>>>>
>>>>> Good points, thank you!
>>>>>
>>>>> Applied with a few inevitable edits.  ;-)
>>>>
>>>> Quite a few edits!  Thank you.
>>>>
>>>> Let me reiterate my earlier suggestion:
>>>>
>>>>>> Another point I'd like to suggest.
>>>>>> Figure 9.23 and the following figures still show the result on
>>>>>> a 16 CPU system. Looks like it is difficult to make plots of 448-thread
>>>>>> system corresponding to them. Can you add info on the HW system
>>>>>> where those 16 CPU results were obtained in the beginning of
>>>>>> Section 9.5.4.2?
>>>>
>>>> Can you look into this as well?
>>>
>>> There are a few build issues, but the main problem has been that I have
>>> needed to use that system to verify Linux-kernel fixes.  The intent
>>> is to regenerate most and maybe all of the results on the large system
>>> over time.
>>
>> I said "difficult" because of the counterintuitive variation of
>> cycles you've encountered by the additional "lea" instruction.
>> You will need to eliminate such variations to evaluate the cost
>> of RCU, I suppose.
>> Looks like Intel processors are sensitive to alignment of branch targets.
>> (I think you know the matter better than me, but I could not help.)
>> For example: https://stackoverflow.com/questions/18113995/
> 
> It does indeed get complicated.  ;-)
> 
> Another experiment on the todo list is to move the rcu_head structure to
> the end, which should eliminate that extra lea instruction.  I am planning
> to introduce that to the answer to the more-than-ideal quick quiz.

That sounds like quite reasonable.

> 
>>> But I added the system's info in the meantime.  ;-)
>>
>> Which generation of Intel x86 system was it?
> 
> I don't know, as that was before I got smart and started capturing
> /proc/cpuinfo.  It was quite old, probably produced in 2010 or so.
> Maybe even earlier.

(Digging up the git history...)
Yes, this plot has existed ever since the first commit of perfbook.
And I won't blame you if you don't remember exactly what type of
machine you ran the performance tests on. x86 in 2008 means it was
pre-Nehalem, wasn't it?

There remains a table of data obtained on Nehalem in 2009,
which was added in commit 38fd945ff401 ("Fill out CPU chapter,
including adding Nehalem data.").

> 
> Which is another good reason to rerun those results, but I don't see
> this as blocking the release.

Agreed.

        Thanks, Akira

> 
> 							Thanx, Paul
> 
[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-12-10 12:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-07  4:05 [PATCH 0/2] Minor updates Akira Yokosawa
2019-12-07  4:06 ` [PATCH 1/2] toyrcu: Use mathcal O for 'orders of' Akira Yokosawa
2019-12-07  4:07 ` [PATCH 2/2] defer/rcuusage: Fix typo (that -> than) Akira Yokosawa
2019-12-07 16:43 ` [PATCH 0/2] Minor updates Paul E. McKenney
2019-12-07 23:41   ` Akira Yokosawa
2019-12-08  1:15     ` Paul E. McKenney
2019-12-08 15:54       ` Akira Yokosawa
2019-12-08 18:11         ` Paul E. McKenney
2019-12-09 12:50           ` Akira Yokosawa
2019-12-09 18:06             ` Paul E. McKenney
2019-12-09 22:11               ` Akira Yokosawa
2019-12-10  0:08                 ` Paul E. McKenney
2019-12-10 12:32                   ` Akira Yokosawa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.