All of lore.kernel.org
 help / color / mirror / Atom feed
* The answer of Quiz C.8 is not quite reasonable
@ 2022-04-14 17:42 Hao Lee
  2022-04-17 17:44 ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Hao Lee @ 2022-04-14 17:42 UTC (permalink / raw)
  To: paulmck; +Cc: perfbook

Hi,

At the beginning of C.3.3 we have supposed the cache line containing "a"
resides _only_ in _CPU1’s_ cache. I think this is why _CPU0_ has to send
a "_read_ invalidate message" to _retrieve_ the cache line and invalid
CPU1's cache line.

However, the answer says the reason is the cache line in question
contains more than just the variable a. I can't understand the logical
relationship between this answer and the question. Am I missing
something here? Thanks.

Regards,
Hao Lee

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: The answer of Quiz C.8 is not quite reasonable
  2022-04-14 17:42 The answer of Quiz C.8 is not quite reasonable Hao Lee
@ 2022-04-17 17:44 ` Paul E. McKenney
  2022-04-18  8:01   ` Hao Lee
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2022-04-17 17:44 UTC (permalink / raw)
  To: Hao Lee; +Cc: perfbook

On Thu, Apr 14, 2022 at 05:42:25PM +0000, Hao Lee wrote:
> Hi,
> 
> At the beginning of C.3.3 we have supposed the cache line containing "a"
> resides _only_ in _CPU1’s_ cache. I think this is why _CPU0_ has to send
> a "_read_ invalidate message" to _retrieve_ the cache line and invalid
> CPU1's cache line.
> 
> However, the answer says the reason is the cache line in question
> contains more than just the variable a. I can't understand the logical
> relationship between this answer and the question. Am I missing
> something here? Thanks.

I added the commit shown below.  Does that help?

							Thanx, Paul

------------------------------------------------------------------------

commit 36fe14d5ebe406e331a5d89533fe3187d2019898
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Sun Apr 17 10:41:33 2022 -0700

    appendix/whymb: Clarify QQ C.8
    
    More clearly note the presence of data other than the variable a.
    
    Reported-by: Hao Lee <haolee.swjtu@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
index 8f607e35..43f1307b 100644
--- a/appendix/whymb/whymemorybarriers.tex
+++ b/appendix/whymb/whymemorybarriers.tex
@@ -821,9 +821,14 @@ Then the sequence of operations might be as follows:
 	In \cref{seq:app:whymb:Store Buffers and Memory Barriers} above,
 	why does CPU~0 need to issue a ``read invalidate''
 	rather than a simple ``invalidate''?
+	After all, \co{foo()} will overwrite \co{a} in any case, so why
+	should it care about the old value of \co{a}?
 }\QuickQuizAnswer{
-	Because the cache line in question contains more than just the
+	Because the cache line in question contains more data than just the
 	variable \co{a}.
+	Issuing ``invalidate'' instead of the needed ``read invalidate''
+	would cause that other data to be lost, which would constitute
+	a serious bug in the hardware.
 }\QuickQuizEnd
 
 The hardware designers cannot help directly here, since the CPUs have

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: The answer of Quiz C.8 is not quite reasonable
  2022-04-17 17:44 ` Paul E. McKenney
@ 2022-04-18  8:01   ` Hao Lee
  2022-04-19 17:28     ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Hao Lee @ 2022-04-18  8:01 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook

On Sun, Apr 17, 2022 at 10:44:54AM -0700, Paul E. McKenney wrote:
> On Thu, Apr 14, 2022 at 05:42:25PM +0000, Hao Lee wrote:
> > Hi,
> > 
> > At the beginning of C.3.3 we have supposed the cache line containing "a"
> > resides _only_ in _CPU1’s_ cache. I think this is why _CPU0_ has to send
> > a "_read_ invalidate message" to _retrieve_ the cache line and invalid
> > CPU1's cache line.
> > 
> > However, the answer says the reason is the cache line in question
> > contains more than just the variable a. I can't understand the logical
> > relationship between this answer and the question. Am I missing
> > something here? Thanks.
> 
> I added the commit shown below.  Does that help?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 36fe14d5ebe406e331a5d89533fe3187d2019898
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Sun Apr 17 10:41:33 2022 -0700
> 
>     appendix/whymb: Clarify QQ C.8
>     
>     More clearly note the presence of data other than the variable a.
>     
>     Reported-by: Hao Lee <haolee.swjtu@gmail.com>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
> index 8f607e35..43f1307b 100644
> --- a/appendix/whymb/whymemorybarriers.tex
> +++ b/appendix/whymb/whymemorybarriers.tex
> @@ -821,9 +821,14 @@ Then the sequence of operations might be as follows:
>  	In \cref{seq:app:whymb:Store Buffers and Memory Barriers} above,
>  	why does CPU~0 need to issue a ``read invalidate''
>  	rather than a simple ``invalidate''?
> +	After all, \co{foo()} will overwrite \co{a} in any case, so why
> +	should it care about the old value of \co{a}?

Totally clear!

And we may also need to add some details to C.3.1:

	With the addition of these store buffers, CPU 0 can simply
	record its write in its store buffer and continue executing.
	When the cache line does finally make its way from CPU 1 to CPU
	0, the data will be moved from the store buffer to the cache
	line.

This passage explains why we need a store buffer, but I think the data
in store buffer won't be moved directly to the cache line.
Instead, the store buffer must be merged with the cache line responded
by CPU1, and only after that can it be moved to CPU0's cache line.

Thanks,
Hao Lee

>  }\QuickQuizAnswer{
> -	Because the cache line in question contains more than just the
> +	Because the cache line in question contains more data than just the
>  	variable \co{a}.
> +	Issuing ``invalidate'' instead of the needed ``read invalidate''
> +	would cause that other data to be lost, which would constitute
> +	a serious bug in the hardware.
>  }\QuickQuizEnd
>  
>  The hardware designers cannot help directly here, since the CPUs have

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: The answer of Quiz C.8 is not quite reasonable
  2022-04-18  8:01   ` Hao Lee
@ 2022-04-19 17:28     ` Paul E. McKenney
  2022-04-20  6:45       ` Hao Lee
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2022-04-19 17:28 UTC (permalink / raw)
  To: Hao Lee; +Cc: perfbook

On Mon, Apr 18, 2022 at 08:01:17AM +0000, Hao Lee wrote:
> On Sun, Apr 17, 2022 at 10:44:54AM -0700, Paul E. McKenney wrote:
> > On Thu, Apr 14, 2022 at 05:42:25PM +0000, Hao Lee wrote:
> > > Hi,
> > > 
> > > At the beginning of C.3.3 we have supposed the cache line containing "a"
> > > resides _only_ in _CPU1’s_ cache. I think this is why _CPU0_ has to send
> > > a "_read_ invalidate message" to _retrieve_ the cache line and invalid
> > > CPU1's cache line.
> > > 
> > > However, the answer says the reason is the cache line in question
> > > contains more than just the variable a. I can't understand the logical
> > > relationship between this answer and the question. Am I missing
> > > something here? Thanks.
> > 
> > I added the commit shown below.  Does that help?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit 36fe14d5ebe406e331a5d89533fe3187d2019898
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Sun Apr 17 10:41:33 2022 -0700
> > 
> >     appendix/whymb: Clarify QQ C.8
> >     
> >     More clearly note the presence of data other than the variable a.
> >     
> >     Reported-by: Hao Lee <haolee.swjtu@gmail.com>
> >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
> > index 8f607e35..43f1307b 100644
> > --- a/appendix/whymb/whymemorybarriers.tex
> > +++ b/appendix/whymb/whymemorybarriers.tex
> > @@ -821,9 +821,14 @@ Then the sequence of operations might be as follows:
> >  	In \cref{seq:app:whymb:Store Buffers and Memory Barriers} above,
> >  	why does CPU~0 need to issue a ``read invalidate''
> >  	rather than a simple ``invalidate''?
> > +	After all, \co{foo()} will overwrite \co{a} in any case, so why
> > +	should it care about the old value of \co{a}?
> 
> Totally clear!
> 
> And we may also need to add some details to C.3.1:
> 
> 	With the addition of these store buffers, CPU 0 can simply
> 	record its write in its store buffer and continue executing.
> 	When the cache line does finally make its way from CPU 1 to CPU
> 	0, the data will be moved from the store buffer to the cache
> 	line.
> 
> This passage explains why we need a store buffer, but I think the data
> in store buffer won't be moved directly to the cache line.
> Instead, the store buffer must be merged with the cache line responded
> by CPU1, and only after that can it be moved to CPU0's cache line.

You lost me here.

Ah, maybe the missing point is that store buffers do not necessarily
maintain full cache lines, but only the data that was actually stored.
Or, if the store buffer does contain full cache lines, it also contains
a mask to indicate what portions of the cache line need to be updated.

Does that help, or am I missing your point?

							Thanx, Paul

> Thanks,
> Hao Lee
> 
> >  }\QuickQuizAnswer{
> > -	Because the cache line in question contains more than just the
> > +	Because the cache line in question contains more data than just the
> >  	variable \co{a}.
> > +	Issuing ``invalidate'' instead of the needed ``read invalidate''
> > +	would cause that other data to be lost, which would constitute
> > +	a serious bug in the hardware.
> >  }\QuickQuizEnd
> >  
> >  The hardware designers cannot help directly here, since the CPUs have

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: The answer of Quiz C.8 is not quite reasonable
  2022-04-19 17:28     ` Paul E. McKenney
@ 2022-04-20  6:45       ` Hao Lee
  2022-04-20 18:15         ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Hao Lee @ 2022-04-20  6:45 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook

On Tue, Apr 19, 2022 at 10:28:50AM -0700, Paul E. McKenney wrote:
> On Mon, Apr 18, 2022 at 08:01:17AM +0000, Hao Lee wrote:
> > On Sun, Apr 17, 2022 at 10:44:54AM -0700, Paul E. McKenney wrote:
> > > On Thu, Apr 14, 2022 at 05:42:25PM +0000, Hao Lee wrote:
> > > > Hi,
> > > > 
> > > > At the beginning of C.3.3 we have supposed the cache line containing "a"
> > > > resides _only_ in _CPU1’s_ cache. I think this is why _CPU0_ has to send
> > > > a "_read_ invalidate message" to _retrieve_ the cache line and invalid
> > > > CPU1's cache line.
> > > > 
> > > > However, the answer says the reason is the cache line in question
> > > > contains more than just the variable a. I can't understand the logical
> > > > relationship between this answer and the question. Am I missing
> > > > something here? Thanks.
> > > 
> > > I added the commit shown below.  Does that help?
> > > 
> > > 							Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > commit 36fe14d5ebe406e331a5d89533fe3187d2019898
> > > Author: Paul E. McKenney <paulmck@kernel.org>
> > > Date:   Sun Apr 17 10:41:33 2022 -0700
> > > 
> > >     appendix/whymb: Clarify QQ C.8
> > >     
> > >     More clearly note the presence of data other than the variable a.
> > >     
> > >     Reported-by: Hao Lee <haolee.swjtu@gmail.com>
> > >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > 
> > > diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
> > > index 8f607e35..43f1307b 100644
> > > --- a/appendix/whymb/whymemorybarriers.tex
> > > +++ b/appendix/whymb/whymemorybarriers.tex
> > > @@ -821,9 +821,14 @@ Then the sequence of operations might be as follows:
> > >  	In \cref{seq:app:whymb:Store Buffers and Memory Barriers} above,
> > >  	why does CPU~0 need to issue a ``read invalidate''
> > >  	rather than a simple ``invalidate''?
> > > +	After all, \co{foo()} will overwrite \co{a} in any case, so why
> > > +	should it care about the old value of \co{a}?
> > 
> > Totally clear!
> > 
> > And we may also need to add some details to C.3.1:
> > 
> > 	With the addition of these store buffers, CPU 0 can simply
> > 	record its write in its store buffer and continue executing.
> > 	When the cache line does finally make its way from CPU 1 to CPU
> > 	0, the data will be moved from the store buffer to the cache
> > 	line.
> > 
> > This passage explains why we need a store buffer, but I think the data
> > in store buffer won't be moved directly to the cache line.
> > Instead, the store buffer must be merged with the cache line responded
> > by CPU1, and only after that can it be moved to CPU0's cache line.
> 
> You lost me here.
> 
> Ah, maybe the missing point is that store buffers do not necessarily
> maintain full cache lines, but only the data that was actually stored.

Yes! This is exactly what I want to say. I don't find any hardware sheet
that illustrates the details, but I think the following process may be
reasonable:

The memory data from address 0x0~0xf only exists in CPU1's cache line,
and now CPU0 wants to write a byte at address 0x0. CPU0 write the _byte_
into its store buffer and send a "read invalidate" message to CPU1. When
CPU0 receives the whole cache line responded by CPU1, it needs to
overwrite the first byte of the responded cache line with the byte in
store buffer, leaving the other 15 bytes untouched. And then, the
"merged" cache line can be moved to CPU0's cache.

> Or, if the store buffer does contain full cache lines, it also contains
> a mask to indicate what portions of the cache line need to be updated.

I think this scenario seems impossible because CPU0 doesn't have the
content of the target cache line, and it can only record changed bytes
in store buffer.


Thanks,
Hao Lee

> 
> Does that help, or am I missing your point?
> 
> 							Thanx, Paul
> 
> > Thanks,
> > Hao Lee
> > 
> > >  }\QuickQuizAnswer{
> > > -	Because the cache line in question contains more than just the
> > > +	Because the cache line in question contains more data than just the
> > >  	variable \co{a}.
> > > +	Issuing ``invalidate'' instead of the needed ``read invalidate''
> > > +	would cause that other data to be lost, which would constitute
> > > +	a serious bug in the hardware.
> > >  }\QuickQuizEnd
> > >  
> > >  The hardware designers cannot help directly here, since the CPUs have

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: The answer of Quiz C.8 is not quite reasonable
  2022-04-20  6:45       ` Hao Lee
@ 2022-04-20 18:15         ` Paul E. McKenney
  2022-04-21 13:34           ` Hao Lee
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2022-04-20 18:15 UTC (permalink / raw)
  To: Hao Lee; +Cc: perfbook

On Wed, Apr 20, 2022 at 06:45:23AM +0000, Hao Lee wrote:
> On Tue, Apr 19, 2022 at 10:28:50AM -0700, Paul E. McKenney wrote:
> > On Mon, Apr 18, 2022 at 08:01:17AM +0000, Hao Lee wrote:
> > > On Sun, Apr 17, 2022 at 10:44:54AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Apr 14, 2022 at 05:42:25PM +0000, Hao Lee wrote:
> > > > > Hi,
> > > > > 
> > > > > At the beginning of C.3.3 we have supposed the cache line containing "a"
> > > > > resides _only_ in _CPU1’s_ cache. I think this is why _CPU0_ has to send
> > > > > a "_read_ invalidate message" to _retrieve_ the cache line and invalid
> > > > > CPU1's cache line.
> > > > > 
> > > > > However, the answer says the reason is the cache line in question
> > > > > contains more than just the variable a. I can't understand the logical
> > > > > relationship between this answer and the question. Am I missing
> > > > > something here? Thanks.
> > > > 
> > > > I added the commit shown below.  Does that help?
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > commit 36fe14d5ebe406e331a5d89533fe3187d2019898
> > > > Author: Paul E. McKenney <paulmck@kernel.org>
> > > > Date:   Sun Apr 17 10:41:33 2022 -0700
> > > > 
> > > >     appendix/whymb: Clarify QQ C.8
> > > >     
> > > >     More clearly note the presence of data other than the variable a.
> > > >     
> > > >     Reported-by: Hao Lee <haolee.swjtu@gmail.com>
> > > >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > > 
> > > > diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
> > > > index 8f607e35..43f1307b 100644
> > > > --- a/appendix/whymb/whymemorybarriers.tex
> > > > +++ b/appendix/whymb/whymemorybarriers.tex
> > > > @@ -821,9 +821,14 @@ Then the sequence of operations might be as follows:
> > > >  	In \cref{seq:app:whymb:Store Buffers and Memory Barriers} above,
> > > >  	why does CPU~0 need to issue a ``read invalidate''
> > > >  	rather than a simple ``invalidate''?
> > > > +	After all, \co{foo()} will overwrite \co{a} in any case, so why
> > > > +	should it care about the old value of \co{a}?
> > > 
> > > Totally clear!
> > > 
> > > And we may also need to add some details to C.3.1:
> > > 
> > > 	With the addition of these store buffers, CPU 0 can simply
> > > 	record its write in its store buffer and continue executing.
> > > 	When the cache line does finally make its way from CPU 1 to CPU
> > > 	0, the data will be moved from the store buffer to the cache
> > > 	line.
> > > 
> > > This passage explains why we need a store buffer, but I think the data
> > > in store buffer won't be moved directly to the cache line.
> > > Instead, the store buffer must be merged with the cache line responded
> > > by CPU1, and only after that can it be moved to CPU0's cache line.
> > 
> > You lost me here.
> > 
> > Ah, maybe the missing point is that store buffers do not necessarily
> > maintain full cache lines, but only the data that was actually stored.
> 
> Yes! This is exactly what I want to say. I don't find any hardware sheet
> that illustrates the details, but I think the following process may be
> reasonable:
> 
> The memory data from address 0x0~0xf only exists in CPU1's cache line,
> and now CPU0 wants to write a byte at address 0x0. CPU0 write the _byte_
> into its store buffer and send a "read invalidate" message to CPU1. When
> CPU0 receives the whole cache line responded by CPU1, it needs to
> overwrite the first byte of the responded cache line with the byte in
> store buffer, leaving the other 15 bytes untouched. And then, the
> "merged" cache line can be moved to CPU0's cache.

How about as in the commit shown below?

> > Or, if the store buffer does contain full cache lines, it also contains
> > a mask to indicate what portions of the cache line need to be updated.
> 
> I think this scenario seems impossible because CPU0 doesn't have the
> content of the target cache line, and it can only record changed bytes
> in store buffer.

Well, there are many ways to record changed bytes.  One way would be
to have eache store-buffer entry have double the bits of a cache line,
so that if each cache line is 64 bits, each store-buffer entry has
128 bits.  64 of those bits record the recently stored values, with
don't-care bits for any portions of that cache line that have not been
recently stored to by this CPU.  The other 64 bits are set to the value
1 if the corresponding bit has recently been stored to, and set to the
value zero otherwise.

The obvious disadvantage of this approach is the larger size of each
store-buffer entry.  The corresponding advantage is that the common
case of consecutive stores can usually be merged into a single
store-buffer entry.

Again, how about the commit shown below?

							Thanx, Paul

------------------------------------------------------------------------

commit 475cc7fa460f60b0e518808c68890c8d63658d1c
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed Apr 20 10:50:59 2022 -0700

    appendix/whymb: Store buffers and partial cache lines
    
    Reported-by: Hao Lee <haolee.swjtu@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
index aeaa4291..347635a4 100644
--- a/appendix/whymb/whymemorybarriers.tex
+++ b/appendix/whymb/whymemorybarriers.tex
@@ -12,7 +12,10 @@ So what possessed CPU designers to cause them to inflict \IXBpl{memory barrier}
 on poor unsuspecting SMP software designers?
 
 In short, because reordering memory references allows much better performance,
-and so memory barriers are needed to force ordering in things like
+courtesy of the finite speed of light and the non-zero size of atoms
+noted in \cref{sec:cpu:Overheads}, and particularly in the
+hardware-performance question posed by \QuickQuizRef{\QspeedOfLightAtoms}.
+Therefore, memory barriers are needed to force ordering in things like
 synchronization primitives whose correct operation depends on ordered
 memory references.
 
@@ -658,16 +661,55 @@ When the cache line does finally make its way from CPU~1 to CPU~0,
 the data will be moved from the store buffer to the cache line.
 
 \QuickQuiz{
-	But if the main purpose of store buffers is to hide acknowledgment
-	latencies in multiprocessor cache-coherence protocols, why
-	do uniprocessors also have store buffers?
+	But then why do uniprocessors also have store buffers?
 }\QuickQuizAnswer{
 	Because the purpose of store buffers is not just to hide
 	acknowledgement latencies in multiprocessor cache-coherence protocols,
 	but to hide memory latencies in general.
 	Because memory is much slower than is cache on uniprocessors,
 	store buffers on uniprocessors can help to hide write-miss
-	latencies.
+	memory latencies.
+}\QuickQuizEnd
+
+Please note that the store buffer does not necessarily operate on
+full cache lines.
+The reason for this is that a given store-buffer entry need only contain
+the value stored, not the other data contained in the corresponding
+cache line.
+Which is a good thing, because the CPU doing the store has no idea
+what that other data might be!
+But once the corresponding cache line arrives, any values from the
+store buffer that update that cache line can be merged into it,
+and the corresponding entries can then be removed from the store buffer.
+Any other data in that cache line is of course left intact.
+
+\QuickQuiz{
+	So store-buffer entries are variable length?
+	Isn't that difficult to implement in hardware?
+}\QuickQuizAnswer{
+	Here are two ways for hardware to easily handle variable-length
+	stores.
+
+	First, each store-buffer entry could be a single byte wide.
+	Then an 64-bit store would consume eight store-buffer entries.
+	This approach is simple and flexible, but one disadvantage is
+	that each entry would need to replicate much of the address that
+	was stored to.
+
+	Second, each store-buffer entry could be double the size of a
+	cache line, with half of the bits containing the values stored,
+	and the other half indicating which bits had been stored to.
+	So, assuming a 32-bit cache line, a single-byte store of 0x5a
+	to the low-order byte of a given cache line would result in
+	\co{0xXXXXXX5a} for the first half and \co{0x000000ff} for the
+	second half, where the values labeled \co{X} are arbitrary
+	because they would be ignored.
+	This approach allows multiple consecutive stores corresponding to
+	a given cache line to be merged into a single store-buffer entry,
+	but is space-inefficient for random stores of single bytes.
+
+	Much more complex and efficient schemes are of course used
+	by actual hardware designers.
 }\QuickQuizEnd
 
 \begin{figure}
diff --git a/cpu/overheads.tex b/cpu/overheads.tex
index b8a65faa..c9f5f1f7 100644
--- a/cpu/overheads.tex
+++ b/cpu/overheads.tex
@@ -425,6 +425,8 @@ thousand clock cycles.
 	able to do to ease the plight of parallel programmers.
 }\QuickQuizEnd
 
+\QuickQuizLabel{\QspeedOfLightAtoms}
+
 \begin{table}
 \rowcolors{1}{}{lightgray}
 \renewcommand*{\arraystretch}{1.1}

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: The answer of Quiz C.8 is not quite reasonable
  2022-04-20 18:15         ` Paul E. McKenney
@ 2022-04-21 13:34           ` Hao Lee
  0 siblings, 0 replies; 7+ messages in thread
From: Hao Lee @ 2022-04-21 13:34 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook

On Wed, Apr 20, 2022 at 11:15:53AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 20, 2022 at 06:45:23AM +0000, Hao Lee wrote:
> > On Tue, Apr 19, 2022 at 10:28:50AM -0700, Paul E. McKenney wrote:
> > > On Mon, Apr 18, 2022 at 08:01:17AM +0000, Hao Lee wrote:
> > > > On Sun, Apr 17, 2022 at 10:44:54AM -0700, Paul E. McKenney wrote:
> > > > > On Thu, Apr 14, 2022 at 05:42:25PM +0000, Hao Lee wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > At the beginning of C.3.3 we have supposed the cache line containing "a"
> > > > > > resides _only_ in _CPU1’s_ cache. I think this is why _CPU0_ has to send
> > > > > > a "_read_ invalidate message" to _retrieve_ the cache line and invalid
> > > > > > CPU1's cache line.
> > > > > > 
> > > > > > However, the answer says the reason is the cache line in question
> > > > > > contains more than just the variable a. I can't understand the logical
> > > > > > relationship between this answer and the question. Am I missing
> > > > > > something here? Thanks.
> > > > > 
> > > > > I added the commit shown below.  Does that help?
> > > > > 
> > > > > 							Thanx, Paul
> > > > > 
> > > > > ------------------------------------------------------------------------
> > > > > 
> > > > > commit 36fe14d5ebe406e331a5d89533fe3187d2019898
> > > > > Author: Paul E. McKenney <paulmck@kernel.org>
> > > > > Date:   Sun Apr 17 10:41:33 2022 -0700
> > > > > 
> > > > >     appendix/whymb: Clarify QQ C.8
> > > > >     
> > > > >     More clearly note the presence of data other than the variable a.
> > > > >     
> > > > >     Reported-by: Hao Lee <haolee.swjtu@gmail.com>
> > > > >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > > > 
> > > > > diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
> > > > > index 8f607e35..43f1307b 100644
> > > > > --- a/appendix/whymb/whymemorybarriers.tex
> > > > > +++ b/appendix/whymb/whymemorybarriers.tex
> > > > > @@ -821,9 +821,14 @@ Then the sequence of operations might be as follows:
> > > > >  	In \cref{seq:app:whymb:Store Buffers and Memory Barriers} above,
> > > > >  	why does CPU~0 need to issue a ``read invalidate''
> > > > >  	rather than a simple ``invalidate''?
> > > > > +	After all, \co{foo()} will overwrite \co{a} in any case, so why
> > > > > +	should it care about the old value of \co{a}?
> > > > 
> > > > Totally clear!
> > > > 
> > > > And we may also need to add some details to C.3.1:
> > > > 
> > > > 	With the addition of these store buffers, CPU 0 can simply
> > > > 	record its write in its store buffer and continue executing.
> > > > 	When the cache line does finally make its way from CPU 1 to CPU
> > > > 	0, the data will be moved from the store buffer to the cache
> > > > 	line.
> > > > 
> > > > This passage explains why we need a store buffer, but I think the data
> > > > in store buffer won't be moved directly to the cache line.
> > > > Instead, the store buffer must be merged with the cache line responded
> > > > by CPU1, and only after that can it be moved to CPU0's cache line.
> > > 
> > > You lost me here.
> > > 
> > > Ah, maybe the missing point is that store buffers do not necessarily
> > > maintain full cache lines, but only the data that was actually stored.
> > 
> > Yes! This is exactly what I want to say. I don't find any hardware sheet
> > that illustrates the details, but I think the following process may be
> > reasonable:
> > 
> > The memory data from address 0x0~0xf only exists in CPU1's cache line,
> > and now CPU0 wants to write a byte at address 0x0. CPU0 write the _byte_
> > into its store buffer and send a "read invalidate" message to CPU1. When
> > CPU0 receives the whole cache line responded by CPU1, it needs to
> > overwrite the first byte of the responded cache line with the byte in
> > store buffer, leaving the other 15 bytes untouched. And then, the
> > "merged" cache line can be moved to CPU0's cache.
> 
> How about as in the commit shown below?
> 
> > > Or, if the store buffer does contain full cache lines, it also contains
> > > a mask to indicate what portions of the cache line need to be updated.
> > 
> > I think this scenario seems impossible because CPU0 doesn't have the
> > content of the target cache line, and it can only record changed bytes
> > in store buffer.
> 
> Well, there are many ways to record changed bytes.  One way would be
> to have eache store-buffer entry have double the bits of a cache line,
> so that if each cache line is 64 bits, each store-buffer entry has
> 128 bits.  64 of those bits record the recently stored values, with
> don't-care bits for any portions of that cache line that have not been
> recently stored to by this CPU.  The other 64 bits are set to the value
> 1 if the corresponding bit has recently been stored to, and set to the
> value zero otherwise.
> 
> The obvious disadvantage of this approach is the larger size of each
> store-buffer entry.  The corresponding advantage is that the common
> case of consecutive stores can usually be merged into a single
> store-buffer entry.

Thanks for elaborating on these details! Pretty clear!

> 
> Again, how about the commit shown below?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 475cc7fa460f60b0e518808c68890c8d63658d1c
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Wed Apr 20 10:50:59 2022 -0700
> 
>     appendix/whymb: Store buffers and partial cache lines
>     
>     Reported-by: Hao Lee <haolee.swjtu@gmail.com>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
> index aeaa4291..347635a4 100644
> --- a/appendix/whymb/whymemorybarriers.tex
> +++ b/appendix/whymb/whymemorybarriers.tex
> @@ -12,7 +12,10 @@ So what possessed CPU designers to cause them to inflict \IXBpl{memory barrier}
>  on poor unsuspecting SMP software designers?
>  
>  In short, because reordering memory references allows much better performance,
> -and so memory barriers are needed to force ordering in things like
> +courtesy of the finite speed of light and the non-zero size of atoms
> +noted in \cref{sec:cpu:Overheads}, and particularly in the
> +hardware-performance question posed by \QuickQuizRef{\QspeedOfLightAtoms}.
> +Therefore, memory barriers are needed to force ordering in things like
>  synchronization primitives whose correct operation depends on ordered
>  memory references.
>  
> @@ -658,16 +661,55 @@ When the cache line does finally make its way from CPU~1 to CPU~0,
>  the data will be moved from the store buffer to the cache line.
>  
>  \QuickQuiz{
> -	But if the main purpose of store buffers is to hide acknowledgment
> -	latencies in multiprocessor cache-coherence protocols, why
> -	do uniprocessors also have store buffers?
> +	But then why do uniprocessors also have store buffers?
>  }\QuickQuizAnswer{
>  	Because the purpose of store buffers is not just to hide
>  	acknowledgement latencies in multiprocessor cache-coherence protocols,
>  	but to hide memory latencies in general.
>  	Because memory is much slower than is cache on uniprocessors,
>  	store buffers on uniprocessors can help to hide write-miss
> -	latencies.
> +	memory latencies.
> +}\QuickQuizEnd
> +
> +Please note that the store buffer does not necessarily operate on
> +full cache lines.
> +The reason for this is that a given store-buffer entry need only contain
> +the value stored, not the other data contained in the corresponding
> +cache line.
> +Which is a good thing, because the CPU doing the store has no idea
> +what that other data might be!
> +But once the corresponding cache line arrives, any values from the
> +store buffer that update that cache line can be merged into it,
> +and the corresponding entries can then be removed from the store buffer.
> +Any other data in that cache line is of course left intact.
> +
> +\QuickQuiz{
> +	So store-buffer entries are variable length?
> +	Isn't that difficult to implement in hardware?
> +}\QuickQuizAnswer{
> +	Here are two ways for hardware to easily handle variable-length
> +	stores.
> +
> +	First, each store-buffer entry could be a single byte wide.
> +	Then an 64-bit store would consume eight store-buffer entries.
> +	This approach is simple and flexible, but one disadvantage is
> +	that each entry would need to replicate much of the address that
> +	was stored to.
> +
> +	Second, each store-buffer entry could be double the size of a
> +	cache line, with half of the bits containing the values stored,
> +	and the other half indicating which bits had been stored to.
> +	So, assuming a 32-bit cache line, a single-byte store of 0x5a
> +	to the low-order byte of a given cache line would result in
> +	\co{0xXXXXXX5a} for the first half and \co{0x000000ff} for the
> +	second half, where the values labeled \co{X} are arbitrary
> +	because they would be ignored.
> +	This approach allows multiple consecutive stores corresponding to
> +	a given cache line to be merged into a single store-buffer entry,
> +	but is space-inefficient for random stores of single bytes.

This commit and these passages have clarified everything!
Thank you for your hard work!


Regards,
Hao Lee

> +
> +	Much more complex and efficient schemes are of course used
> +	by actual hardware designers.
>  }\QuickQuizEnd
>  
>  \begin{figure}
> diff --git a/cpu/overheads.tex b/cpu/overheads.tex
> index b8a65faa..c9f5f1f7 100644
> --- a/cpu/overheads.tex
> +++ b/cpu/overheads.tex
> @@ -425,6 +425,8 @@ thousand clock cycles.
>  	able to do to ease the plight of parallel programmers.
>  }\QuickQuizEnd
>  
> +\QuickQuizLabel{\QspeedOfLightAtoms}
> +
>  \begin{table}
>  \rowcolors{1}{}{lightgray}
>  \renewcommand*{\arraystretch}{1.1}

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-04-21 13:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-14 17:42 The answer of Quiz C.8 is not quite reasonable Hao Lee
2022-04-17 17:44 ` Paul E. McKenney
2022-04-18  8:01   ` Hao Lee
2022-04-19 17:28     ` Paul E. McKenney
2022-04-20  6:45       ` Hao Lee
2022-04-20 18:15         ` Paul E. McKenney
2022-04-21 13:34           ` Hao Lee

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.