All of lore.kernel.org
 help / color / mirror / Atom feed
* The weird re-ordering issue of the Alpha arch'
@ 2017-04-29 14:26 Yubin Ruan
  2017-05-01 15:58 ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Yubin Ruan @ 2017-04-29 14:26 UTC (permalink / raw)
  To: perfbook; +Cc: Paul E. McKenney

Hi, 
Remember a few weeks ago we discussed about the weird re-ordering issue of the
Alpha arch', which is mentioned in Appendix.B in the perfbook? I got really
confused at that moment. Paul gave me a reference to a SGI webpage(an email
discussion actually), but that wasn't so understandable. Today I found a few
words from Kourosh Gharachorloo[1], which are very instructional for me:

    For Alpha processors, the anomalous behavior is currently only possible on a
    21264-based system. And obviously you have to be using one of our
    multiprocessor servers. Finally, the chances that you actually see it are very
    low, yet it is possible.
    
    Here is what has to happen for this behavior to show up. Assume T1 runs on P1
    and T2 on P2. P2 has to be caching location y with value 0. P1 does y=1 which
    causes an "invalidate y" to be sent to P2. This invalidate goes into the
    incoming "probe queue" of P2; as you will see, the problem arises because
    this invalidate could theoretically sit in the probe queue without doing an
    MB on P2. The invalidate is acknowledged right away at this point (i.e., you
    don't wait for it to actually invalidate the copy in P2's cache before
    sending the acknowledgment). Therefore, P1 can go through its MB. And it
    proceeds to do the write to p. Now P2 proceeds to read p. The reply for read
    p is allowed to bypass the probe queue on P2 on its incoming path (this allow
    s replies/data to get back to the 21264 quickly without needing to wait for
    previous incoming probes to be serviced). Now, P2 can derefence P to read the
    old value of y that is sitting in its cache (the inval y in P2's probe queue
    is still sitting there).
    
    How does an MB on P2 fix this? The 21264 flushes its incoming probe queue
    (i.e., services any pending messages in there) at every MB. Hence, after the
    read of P, you do an MB which pulls in the inval to y for sure. And you can
    no longer see the old cached value for y.
    
    Even though the above scenario is theoretically possible, the chances of
    observing a problem due to it are extremely minute. The reason is that even
    if you setup the caching properly, P2 will likely have ample opportunity to
    service the messages (i.e., inval) in its probe queue before it receives the
    data reply for "read p". Nonetheless, if you get into a situation where you
    have placed many things in P2's probe queue ahead of the inval to y, then it
    is possible that the reply to p comes back and bypasses this inval. It would
    be difficult for you to set up the scenario though and actually observe the
    anomaly.
    
    The above addresses how current Alpha's may violate what you have shown.
    Future Alpha's can violate it due to other optimizations. One interesting
    optimization is value prediction.

What I want to say is that next time you update the perfbook, you can take a few
words from it. I mean, you can adopt the same schema like "Assume T1 runs on P1
and T2 on P2. P2 has to be caching location y with value 0....". That would make
the perfbook more understandable :)

Regards,
Yubin

[1]: https://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The weird re-ordering issue of the Alpha arch'
  2017-04-29 14:26 The weird re-ordering issue of the Alpha arch' Yubin Ruan
@ 2017-05-01 15:58 ` Paul E. McKenney
  2017-05-08 13:25   ` Yubin Ruan
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2017-05-01 15:58 UTC (permalink / raw)
  To: Yubin Ruan; +Cc: perfbook

On Sat, Apr 29, 2017 at 10:26:05PM +0800, Yubin Ruan wrote:
> Hi, 
> Remember a few weeks ago we discussed about the weird re-ordering issue of the
> Alpha arch', which is mentioned in Appendix.B in the perfbook? I got really
> confused at that moment. Paul gave me a reference to a SGI webpage(an email
> discussion actually), but that wasn't so understandable. Today I found a few
> words from Kourosh Gharachorloo[1], which are very instructional for me:
> 
>     For Alpha processors, the anomalous behavior is currently only possible on a
>     21264-based system. And obviously you have to be using one of our
>     multiprocessor servers. Finally, the chances that you actually see it are very
>     low, yet it is possible.
>     
>     Here is what has to happen for this behavior to show up. Assume T1 runs on P1
>     and T2 on P2. P2 has to be caching location y with value 0. P1 does y=1 which
>     causes an "invalidate y" to be sent to P2. This invalidate goes into the
>     incoming "probe queue" of P2; as you will see, the problem arises because
>     this invalidate could theoretically sit in the probe queue without doing an
>     MB on P2. The invalidate is acknowledged right away at this point (i.e., you
>     don't wait for it to actually invalidate the copy in P2's cache before
>     sending the acknowledgment). Therefore, P1 can go through its MB. And it
>     proceeds to do the write to p. Now P2 proceeds to read p. The reply for read
>     p is allowed to bypass the probe queue on P2 on its incoming path (this allow
>     s replies/data to get back to the 21264 quickly without needing to wait for
>     previous incoming probes to be serviced). Now, P2 can derefence P to read the
>     old value of y that is sitting in its cache (the inval y in P2's probe queue
>     is still sitting there).
>     
>     How does an MB on P2 fix this? The 21264 flushes its incoming probe queue
>     (i.e., services any pending messages in there) at every MB. Hence, after the
>     read of P, you do an MB which pulls in the inval to y for sure. And you can
>     no longer see the old cached value for y.
>     
>     Even though the above scenario is theoretically possible, the chances of
>     observing a problem due to it are extremely minute. The reason is that even
>     if you setup the caching properly, P2 will likely have ample opportunity to
>     service the messages (i.e., inval) in its probe queue before it receives the
>     data reply for "read p". Nonetheless, if you get into a situation where you
>     have placed many things in P2's probe queue ahead of the inval to y, then it
>     is possible that the reply to p comes back and bypasses this inval. It would
>     be difficult for you to set up the scenario though and actually observe the
>     anomaly.
>     
>     The above addresses how current Alpha's may violate what you have shown.
>     Future Alpha's can violate it due to other optimizations. One interesting
>     optimization is value prediction.
> 
> What I want to say is that next time you update the perfbook, you can take a few
> words from it. I mean, you can adopt the same schema like "Assume T1 runs on P1
> and T2 on P2. P2 has to be caching location y with value 0....". That would make
> the perfbook more understandable :)

Thank you -very- much, Yubin!  I was not aware of this, and you are quite
correct, it is -much- better than the current citation.

							Thanx, Paul

> Regards,
> Yubin
> 
> [1]: https://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The weird re-ordering issue of the Alpha arch'
  2017-05-01 15:58 ` Paul E. McKenney
@ 2017-05-08 13:25   ` Yubin Ruan
  2017-05-08 15:50     ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Yubin Ruan @ 2017-05-08 13:25 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook

On Mon, May 01, 2017 at 08:58:16AM -0700, Paul E. McKenney wrote:
> On Sat, Apr 29, 2017 at 10:26:05PM +0800, Yubin Ruan wrote:
> > Hi, 
> > Remember a few weeks ago we discussed about the weird re-ordering issue of the
> > Alpha arch', which is mentioned in Appendix.B in the perfbook? I got really
> > confused at that moment. Paul gave me a reference to a SGI webpage(an email
> > discussion actually), but that wasn't so understandable. Today I found a few
> > words from Kourosh Gharachorloo[1], which are very instructional for me:
> > 
> >     For Alpha processors, the anomalous behavior is currently only possible on a
> >     21264-based system. And obviously you have to be using one of our
> >     multiprocessor servers. Finally, the chances that you actually see it are very
> >     low, yet it is possible.
> >     
> >     Here is what has to happen for this behavior to show up. Assume T1 runs on P1
> >     and T2 on P2. P2 has to be caching location y with value 0. P1 does y=1 which
> >     causes an "invalidate y" to be sent to P2. This invalidate goes into the
> >     incoming "probe queue" of P2; as you will see, the problem arises because
> >     this invalidate could theoretically sit in the probe queue without doing an
> >     MB on P2. The invalidate is acknowledged right away at this point (i.e., you
> >     don't wait for it to actually invalidate the copy in P2's cache before
> >     sending the acknowledgment). Therefore, P1 can go through its MB. And it
> >     proceeds to do the write to p. Now P2 proceeds to read p. The reply for read
> >     p is allowed to bypass the probe queue on P2 on its incoming path (this allow
> >     s replies/data to get back to the 21264 quickly without needing to wait for
> >     previous incoming probes to be serviced). Now, P2 can derefence P to read the
> >     old value of y that is sitting in its cache (the inval y in P2's probe queue
> >     is still sitting there).
> >     
> >     How does an MB on P2 fix this? The 21264 flushes its incoming probe queue
> >     (i.e., services any pending messages in there) at every MB. Hence, after the
> >     read of P, you do an MB which pulls in the inval to y for sure. And you can
> >     no longer see the old cached value for y.
> >     
> >     Even though the above scenario is theoretically possible, the chances of
> >     observing a problem due to it are extremely minute. The reason is that even
> >     if you setup the caching properly, P2 will likely have ample opportunity to
> >     service the messages (i.e., inval) in its probe queue before it receives the
> >     data reply for "read p". Nonetheless, if you get into a situation where you
> >     have placed many things in P2's probe queue ahead of the inval to y, then it
> >     is possible that the reply to p comes back and bypasses this inval. It would
> >     be difficult for you to set up the scenario though and actually observe the
> >     anomaly.
> >     
> >     The above addresses how current Alpha's may violate what you have shown.
> >     Future Alpha's can violate it due to other optimizations. One interesting
> >     optimization is value prediction.
> > 
> > What I want to say is that next time you update the perfbook, you can take a few
> > words from it. I mean, you can adopt the same schema like "Assume T1 runs on P1
> > and T2 on P2. P2 has to be caching location y with value 0....". That would make
> > the perfbook more understandable :)
> 
> Thank you -very- much, Yubin!  I was not aware of this, and you are quite
> correct, it is -much- better than the current citation.

Hmm...that reminds me of some words in the perfbook. In the answer of quick quiz 4.17,
you state that:

    Memory barrier only enforce ordering among multiple memory references: They do
    absolutely nothing to expedite the propogation of data from one part of the system
    to another. This leads to a quick rule of thumb:  You do not need memory barriers
    unless you are using more than one variable to communicate between multiple threads.

Is that only true for the Alpha processor? I mean, on platforms other than
Alpha (e.g x86), memory barrier *do* expedite the propogation of data from one
processor/core to other processor/core, even though that is not officially documented.

---
Yubin

> > 
> > [1]: https://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html
> > 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The weird re-ordering issue of the Alpha arch'
  2017-05-08 13:25   ` Yubin Ruan
@ 2017-05-08 15:50     ` Paul E. McKenney
  2017-05-09 11:08       ` Yubin Ruan
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2017-05-08 15:50 UTC (permalink / raw)
  To: Yubin Ruan; +Cc: perfbook

On Mon, May 08, 2017 at 09:25:28PM +0800, Yubin Ruan wrote:
> On Mon, May 01, 2017 at 08:58:16AM -0700, Paul E. McKenney wrote:
> > On Sat, Apr 29, 2017 at 10:26:05PM +0800, Yubin Ruan wrote:

[ . . . ]

> Hmm...that reminds me of some words in the perfbook. In the answer of quick quiz 4.17,
> you state that:
> 
>     Memory barrier only enforce ordering among multiple memory references: They do
>     absolutely nothing to expedite the propogation of data from one part of the system
>     to another. This leads to a quick rule of thumb:  You do not need memory barriers
>     unless you are using more than one variable to communicate between multiple threads.
> 
> Is that only true for the Alpha processor? I mean, on platforms other than
> Alpha (e.g x86), memory barrier *do* expedite the propogation of data from one
> processor/core to other processor/core, even though that is not officially documented.

Can you point me at any unofficial documentation of this, for example,
any performance measurements indicating that (for example) the mfence
instruction speeds up the propagation of previous writes to other CPUs?

In the absence of such documentation, all I can really do is change
"They do absolutely nothing to expedite..." to something like "They are
not guaranteed to do anything to expedite..."

							Thanx, Paul

> ---
> Yubin
> 
> > > 
> > > [1]: https://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The weird re-ordering issue of the Alpha arch'
  2017-05-09 11:08       ` Yubin Ruan
@ 2017-05-09  4:21         ` Paul E. McKenney
  2017-05-09 15:58           ` Yubin Ruan
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2017-05-09  4:21 UTC (permalink / raw)
  To: Yubin Ruan; +Cc: perfbook

On Tue, May 09, 2017 at 07:08:01PM +0800, Yubin Ruan wrote:
> On Mon, May 08, 2017 at 08:50:52AM -0700, Paul E. McKenney wrote:
> > On Mon, May 08, 2017 at 09:25:28PM +0800, Yubin Ruan wrote:
> > > On Mon, May 01, 2017 at 08:58:16AM -0700, Paul E. McKenney wrote:
> > > > On Sat, Apr 29, 2017 at 10:26:05PM +0800, Yubin Ruan wrote:
> > 
> > [ . . . ]
> > 
> > > Hmm...that reminds me of some words in the perfbook. In the answer of quick quiz 4.17,
> > > you state that:
> > > 
> > >     Memory barrier only enforce ordering among multiple memory references: They do
> > >     absolutely nothing to expedite the propogation of data from one part of the system
> > >     to another. This leads to a quick rule of thumb:  You do not need memory barriers
> > >     unless you are using more than one variable to communicate between multiple threads.
> > > 
> > > Is that only true for the Alpha processor? I mean, on platforms other than
> > > Alpha (e.g x86), memory barrier *do* expedite the propogation of data from one
> > > processor/core to other processor/core, even though that is not officially documented.
> > 
> > Can you point me at any unofficial documentation of this, for example,
> > any performance measurements indicating that (for example) the mfence
> > instruction speeds up the propagation of previous writes to other CPUs?
> 
> Hmm...I might had had too much drug at that moment. What I mean is that, on platform
> like x86, memory barrier instructions(e.g sfence) enforce that the order of some memory
> references are preserved as the same as in the origin processor by another processors.
> However, any speedup is not guaranteed.

Hey, I was hoping!  ;-)

							Thanx, Paul

> Regards,
> Yubin
> 
> > In the absence of such documentation, all I can really do is change
> > "They do absolutely nothing to expedite..." to something like "They are
> > not guaranteed to do anything to expedite..."
> > 
> > 							Thanx, Paul
> > 
> > > ---
> > > Yubin
> > > 
> > > > > 
> > > > > [1]: https://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html
> > > > > 
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The weird re-ordering issue of the Alpha arch'
  2017-05-09 15:58           ` Yubin Ruan
@ 2017-05-09  9:03             ` Junchang Wang
  2017-05-09 14:45               ` Yubin Ruan
  0 siblings, 1 reply; 9+ messages in thread
From: Junchang Wang @ 2017-05-09  9:03 UTC (permalink / raw)
  To: Yubin Ruan; +Cc: Paul E. McKenney, perfbook

[-- Attachment #1: Type: text/plain, Size: 3587 bytes --]

Hi Yubin,

I never heard that memory barrier instructions can speedup other
instructions (either read or write). My understanding is that under the
covers a barrier instruction may need to, for example, flush pipeline to
maintain program order, and to wait for the propagation of data to other
CPU cores, which disturbs the normal execution process of other
instructions instead of accelerating.


--Junchang


On Tue, May 9, 2017 at 11:58 PM, Yubin Ruan <ablacktshirt@gmail.com> wrote:

> On Mon, May 08, 2017 at 09:21:32PM -0700, Paul E. McKenney wrote:
> > On Tue, May 09, 2017 at 07:08:01PM +0800, Yubin Ruan wrote:
> > > On Mon, May 08, 2017 at 08:50:52AM -0700, Paul E. McKenney wrote:
> > > > On Mon, May 08, 2017 at 09:25:28PM +0800, Yubin Ruan wrote:
> > > > > On Mon, May 01, 2017 at 08:58:16AM -0700, Paul E. McKenney wrote:
> > > > > > On Sat, Apr 29, 2017 at 10:26:05PM +0800, Yubin Ruan wrote:
> > > >
> > > > [ . . . ]
> > > >
> > > > > Hmm...that reminds me of some words in the perfbook. In the answer
> of quick quiz 4.17,
> > > > > you state that:
> > > > >
> > > > >     Memory barrier only enforce ordering among multiple memory
> references: They do
> > > > >     absolutely nothing to expedite the propogation of data from
> one part of the system
> > > > >     to another. This leads to a quick rule of thumb:  You do not
> need memory barriers
> > > > >     unless you are using more than one variable to communicate
> between multiple threads.
> > > > >
> > > > > Is that only true for the Alpha processor? I mean, on platforms
> other than
> > > > > Alpha (e.g x86), memory barrier *do* expedite the propogation of
> data from one
> > > > > processor/core to other processor/core, even though that is not
> officially documented.
> > > >
> > > > Can you point me at any unofficial documentation of this, for
> example,
> > > > any performance measurements indicating that (for example) the mfence
> > > > instruction speeds up the propagation of previous writes to other
> CPUs?
> > >
> > > Hmm...I might had had too much drug at that moment. What I mean is
> that, on platform
> > > like x86, memory barrier instructions(e.g sfence) enforce that the
> order of some memory
> > > references are preserved as the same as in the origin processor by
> another processors.
> > > However, any speedup is not guaranteed.
> >
> > Hey, I was hoping!  ;-)
>
> Ah...sorry to dispoint you. :)
> Maybe some memory barrier instructions do have some speedup side effect to
> "expedite" the propogation of previous writes. Maybe try to consult some
> Intel
> engineer about this...?
> It would be good to have some experiments to measure this. But currently I
> don't
> know how to carry out the experiment myself. Do you have any plan or idea?
>
> Thanks,
> Yubin
>
> >
> >                                                       Thanx, Paul
> >
> > > Regards,
> > > Yubin
> > >
> > > > In the absence of such documentation, all I can really do is change
> > > > "They do absolutely nothing to expedite..." to something like "They
> are
> > > > not guaranteed to do anything to expedite..."
> > > >
> > > >                                                   Thanx, Paul
> > > >
> > > > > ---
> > > > > Yubin
> > > > >
> > > > > > >
> > > > > > > [1]: https://www.cs.umd.edu/~pugh/java/memoryModel/
> AlphaReordering.html
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe perfbook" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[-- Attachment #2: Type: text/html, Size: 5190 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The weird re-ordering issue of the Alpha arch'
  2017-05-08 15:50     ` Paul E. McKenney
@ 2017-05-09 11:08       ` Yubin Ruan
  2017-05-09  4:21         ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Yubin Ruan @ 2017-05-09 11:08 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook

On Mon, May 08, 2017 at 08:50:52AM -0700, Paul E. McKenney wrote:
> On Mon, May 08, 2017 at 09:25:28PM +0800, Yubin Ruan wrote:
> > On Mon, May 01, 2017 at 08:58:16AM -0700, Paul E. McKenney wrote:
> > > On Sat, Apr 29, 2017 at 10:26:05PM +0800, Yubin Ruan wrote:
> 
> [ . . . ]
> 
> > Hmm...that reminds me of some words in the perfbook. In the answer of quick quiz 4.17,
> > you state that:
> > 
> >     Memory barrier only enforce ordering among multiple memory references: They do
> >     absolutely nothing to expedite the propogation of data from one part of the system
> >     to another. This leads to a quick rule of thumb:  You do not need memory barriers
> >     unless you are using more than one variable to communicate between multiple threads.
> > 
> > Is that only true for the Alpha processor? I mean, on platforms other than
> > Alpha (e.g x86), memory barrier *do* expedite the propogation of data from one
> > processor/core to other processor/core, even though that is not officially documented.
> 
> Can you point me at any unofficial documentation of this, for example,
> any performance measurements indicating that (for example) the mfence
> instruction speeds up the propagation of previous writes to other CPUs?

Hmm...I might had had too much drug at that moment. What I mean is that, on platform
like x86, memory barrier instructions(e.g sfence) enforce that the order of some memory
references are preserved as the same as in the origin processor by another processors.
However, any speedup is not guaranteed.

Regards,
Yubin

> In the absence of such documentation, all I can really do is change
> "They do absolutely nothing to expedite..." to something like "They are
> not guaranteed to do anything to expedite..."
> 
> 							Thanx, Paul
> 
> > ---
> > Yubin
> > 
> > > > 
> > > > [1]: https://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The weird re-ordering issue of the Alpha arch'
  2017-05-09  9:03             ` Junchang Wang
@ 2017-05-09 14:45               ` Yubin Ruan
  0 siblings, 0 replies; 9+ messages in thread
From: Yubin Ruan @ 2017-05-09 14:45 UTC (permalink / raw)
  To: Junchang Wang; +Cc: Paul E. McKenney, perfbook

On Tue, May 09, 2017 at 05:03:36PM +0800, Junchang Wang wrote:
> Hi Yubin,
> 
> I never heard that memory barrier instructions can speedup other
> instructions (either read or write). My understanding is that under the
> covers a barrier instruction may need to, for example, flush pipeline to
> maintain program order, and to wait for the propagation of data to other
> CPU cores, which disturbs the normal execution process of other
> instructions instead of accelerating.

Yes you are right. I just messed it up, because, for example, on x86, a write
memory barrier would guarantee that previous writes will arrive at other
processors in their program order (which is not always the case for Alpha). And
that is as if there are something pushing them...So I make that mistake...

Still, I would love to know what is your idea for testing whether this is true.

Thanks,
Yubin

> 
> On Tue, May 9, 2017 at 11:58 PM, Yubin Ruan <ablacktshirt@gmail.com> wrote:
> 
> > On Mon, May 08, 2017 at 09:21:32PM -0700, Paul E. McKenney wrote:
> > > On Tue, May 09, 2017 at 07:08:01PM +0800, Yubin Ruan wrote:
> > > > On Mon, May 08, 2017 at 08:50:52AM -0700, Paul E. McKenney wrote:
> > > > > On Mon, May 08, 2017 at 09:25:28PM +0800, Yubin Ruan wrote:
> > > > > > On Mon, May 01, 2017 at 08:58:16AM -0700, Paul E. McKenney wrote:
> > > > > > > On Sat, Apr 29, 2017 at 10:26:05PM +0800, Yubin Ruan wrote:
> > > > >
> > > > > [ . . . ]
> > > > >
> > > > > > Hmm...that reminds me of some words in the perfbook. In the answer
> > of quick quiz 4.17,
> > > > > > you state that:
> > > > > >
> > > > > >     Memory barrier only enforce ordering among multiple memory
> > references: They do
> > > > > >     absolutely nothing to expedite the propogation of data from
> > one part of the system
> > > > > >     to another. This leads to a quick rule of thumb:  You do not
> > need memory barriers
> > > > > >     unless you are using more than one variable to communicate
> > between multiple threads.
> > > > > >
> > > > > > Is that only true for the Alpha processor? I mean, on platforms
> > other than
> > > > > > Alpha (e.g x86), memory barrier *do* expedite the propogation of
> > data from one
> > > > > > processor/core to other processor/core, even though that is not
> > officially documented.
> > > > >
> > > > > Can you point me at any unofficial documentation of this, for
> > example,
> > > > > any performance measurements indicating that (for example) the mfence
> > > > > instruction speeds up the propagation of previous writes to other
> > CPUs?
> > > >
> > > > Hmm...I might had had too much drug at that moment. What I mean is
> > that, on platform
> > > > like x86, memory barrier instructions(e.g sfence) enforce that the
> > order of some memory
> > > > references are preserved as the same as in the origin processor by
> > another processors.
> > > > However, any speedup is not guaranteed.
> > >
> > > Hey, I was hoping!  ;-)
> >
> > Ah...sorry to dispoint you. :)
> > Maybe some memory barrier instructions do have some speedup side effect to
> > "expedite" the propogation of previous writes. Maybe try to consult some
> > Intel
> > engineer about this...?
> > It would be good to have some experiments to measure this. But currently I
> > don't
> > know how to carry out the experiment myself. Do you have any plan or idea?
> >
> > Thanks,
> > Yubin
> >
> > >
> > >                                                       Thanx, Paul
> > >
> > > > Regards,
> > > > Yubin
> > > >
> > > > > In the absence of such documentation, all I can really do is change
> > > > > "They do absolutely nothing to expedite..." to something like "They
> > are
> > > > > not guaranteed to do anything to expedite..."
> > > > >
> > > > >                                                   Thanx, Paul
> > > > >
> > > > > > ---
> > > > > > Yubin
> > > > > >
> > > > > > > >
> > > > > > > > [1]: https://www.cs.umd.edu/~pugh/java/memoryModel/
> > AlphaReordering.html
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe perfbook" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The weird re-ordering issue of the Alpha arch'
  2017-05-09  4:21         ` Paul E. McKenney
@ 2017-05-09 15:58           ` Yubin Ruan
  2017-05-09  9:03             ` Junchang Wang
  0 siblings, 1 reply; 9+ messages in thread
From: Yubin Ruan @ 2017-05-09 15:58 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook

On Mon, May 08, 2017 at 09:21:32PM -0700, Paul E. McKenney wrote:
> On Tue, May 09, 2017 at 07:08:01PM +0800, Yubin Ruan wrote:
> > On Mon, May 08, 2017 at 08:50:52AM -0700, Paul E. McKenney wrote:
> > > On Mon, May 08, 2017 at 09:25:28PM +0800, Yubin Ruan wrote:
> > > > On Mon, May 01, 2017 at 08:58:16AM -0700, Paul E. McKenney wrote:
> > > > > On Sat, Apr 29, 2017 at 10:26:05PM +0800, Yubin Ruan wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > Hmm...that reminds me of some words in the perfbook. In the answer of quick quiz 4.17,
> > > > you state that:
> > > > 
> > > >     Memory barrier only enforce ordering among multiple memory references: They do
> > > >     absolutely nothing to expedite the propogation of data from one part of the system
> > > >     to another. This leads to a quick rule of thumb:  You do not need memory barriers
> > > >     unless you are using more than one variable to communicate between multiple threads.
> > > > 
> > > > Is that only true for the Alpha processor? I mean, on platforms other than
> > > > Alpha (e.g x86), memory barrier *do* expedite the propogation of data from one
> > > > processor/core to other processor/core, even though that is not officially documented.
> > > 
> > > Can you point me at any unofficial documentation of this, for example,
> > > any performance measurements indicating that (for example) the mfence
> > > instruction speeds up the propagation of previous writes to other CPUs?
> > 
> > Hmm...I might had had too much drug at that moment. What I mean is that, on platform
> > like x86, memory barrier instructions(e.g sfence) enforce that the order of some memory
> > references are preserved as the same as in the origin processor by another processors.
> > However, any speedup is not guaranteed.
> 
> Hey, I was hoping!  ;-)

Ah...sorry to dispoint you. :)
Maybe some memory barrier instructions do have some speedup side effect to
"expedite" the propogation of previous writes. Maybe try to consult some Intel
engineer about this...?
It would be good to have some experiments to measure this. But currently I don't
know how to carry out the experiment myself. Do you have any plan or idea?

Thanks,
Yubin

> 
> 							Thanx, Paul
> 
> > Regards,
> > Yubin
> > 
> > > In the absence of such documentation, all I can really do is change
> > > "They do absolutely nothing to expedite..." to something like "They are
> > > not guaranteed to do anything to expedite..."
> > > 
> > > 							Thanx, Paul
> > > 
> > > > ---
> > > > Yubin
> > > > 
> > > > > > 
> > > > > > [1]: https://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-05-09 15:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-29 14:26 The weird re-ordering issue of the Alpha arch' Yubin Ruan
2017-05-01 15:58 ` Paul E. McKenney
2017-05-08 13:25   ` Yubin Ruan
2017-05-08 15:50     ` Paul E. McKenney
2017-05-09 11:08       ` Yubin Ruan
2017-05-09  4:21         ` Paul E. McKenney
2017-05-09 15:58           ` Yubin Ruan
2017-05-09  9:03             ` Junchang Wang
2017-05-09 14:45               ` Yubin Ruan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.