All of lore.kernel.org
 help / color / mirror / Atom feed
* [Question] Quick Quiz 5.17 and cache coherency
@ 2017-02-22 13:22 Yubin Ruan
  2017-02-23 13:44 ` Akira Yokosawa
  0 siblings, 1 reply; 5+ messages in thread
From: Yubin Ruan @ 2017-02-22 13:22 UTC (permalink / raw)
  To: perfbook

Hi,
I am reading chapter 5 and got confused by the answer to Quick Quiz
5.17. So this is an Array-Based Per-thread Eventually Consistent Counter
scheme:

1  DEFINE_PER_THREAD(unsigned long, counter);
2  unsigned long global_count;
3  int stopflag;
4
5  void inc_count(void)
6  {
7      ACCESS_ONCE(__get_thread_var(counter))++;
8  }
9
10 unsigned long read_count(void)
11 {
12     return ACCESS_ONCE(global_count);
13 }
14
15 void *eventual(void *arg)
16 {
17     int t;
18     int sum;
19     while (stopflag < 3) {
20         sum = 0;
21         for_each_thread(t)
22             sum += ACCESS_ONCE(per_thread(counter, t));
23         ACCESS_ONCE(global_count) = sum;
24         poll(NULL, 0, 1);
25         if (stopflag) {
26             smp_mb();
27             stopflag++;
28         }
29     }
30     return NULL;
31 }
32
33 void count_init(void)
34 {
35     thread_id_t tid;
36     if (pthread_create(&tid, NULL, eventual, NULL)) {
37         perror("count_init:pthread_create");
38         exit(-1);
39     }
40 }
41
42 void count_cleanup(void)
43 {
44     stopflag = 1;
45     while (stopflag < 3)
46         poll(NULL, 0, 1);
47     smp_mb();
48 }


I understand the code. In Quick Quiz 5.17, the question is:

    Why _doesn't_ the `inc_count()' in the code above need to use atomic
    instructions? After all, we now have multiple threads accessing the
    per-thread counters!

I think I know the answer to this question: now that you use per-thread
variable, you don't need atomic instructions. The scenarios where you
need atomic instructions are some places like this:

1 long counter = 0;
2 void inc_count(void)
3 {
4    counter++;  //need atomic instruction
5 }
6
7 long read_count(void)
8 {
9    return counter;
10}


But, the answer provided in the book is:

<------------------- Answer Begin ---------------------->
    Because one of the two threads only reads, and because
the variable is aligned and machine-sized, non-atomic instructions
suffice. That said, the ACCESS_ONCE() macro is used to prevent
compiler optimizations that might otherwise prevent the counter
updates from becoming visible to eventual() [Cor12].

    An older version of this algorithm did in fact use atomic
instructions, kudos to Ersoy Bayramoglu for pointing
out that they are in fact unnecessary. That said, atomic
                                      ~~~~~~~~~~~~~~~~~
instructions would be needed in cases where the per-thread
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
counter variables were smaller than the global global_count.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
However, note that on a 32-bit system, the per-thread counter
variables might need to be limited to 32 bits in order to sum
them accurately, but with a 64-bit global_count variable to avoid
overflow. In this case, it is necessary to zero the per-thread
counter variables periodically in order to avoid overflow. It is
extremely important to note that this zeroing cannot be delayed too long
or overflow of the smaller per-thread variables will result. This
approach therefore imposes real-time requirements on the underlying
system, and in turn must be used with extreme care.

    In contrast, if all variables are the same size, overflow
of any variable is harmless because the eventual sum will
be modulo the word size.
<------------------------ End -------------------------->

Although more complicated than I think, I totally fine with this answer,
except the sentence with ~~~ under it. Why is that? Why do we need
atomic instructions when counter variables were smaller than the global
`global_count' ? Also, the second sentence of the question seems to hint
about cache coherency, but I cannot see the point :(

regards,
Yubin Ruan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Question] Quick Quiz 5.17 and cache coherency
  2017-02-22 13:22 [Question] Quick Quiz 5.17 and cache coherency Yubin Ruan
@ 2017-02-23 13:44 ` Akira Yokosawa
  2017-02-23 14:33   ` Akira Yokosawa
  0 siblings, 1 reply; 5+ messages in thread
From: Akira Yokosawa @ 2017-02-23 13:44 UTC (permalink / raw)
  To: Yubin Ruan, perfbook

On 2017/02/22 21:22:48 +0800, Yubin Ruan wrote:
> Hi,
> I am reading chapter 5 and got confused by the answer to Quick Quiz
> 5.17. So this is an Array-Based Per-thread Eventually Consistent Counter
> scheme:
> 
> 1  DEFINE_PER_THREAD(unsigned long, counter);
> 2  unsigned long global_count;
> 3  int stopflag;
> 4
> 5  void inc_count(void)
> 6  {
> 7      ACCESS_ONCE(__get_thread_var(counter))++;
> 8  }
> 9
> 10 unsigned long read_count(void)
> 11 {
> 12     return ACCESS_ONCE(global_count);
> 13 }
> 14
> 15 void *eventual(void *arg)
> 16 {
> 17     int t;
> 18     int sum;
> 19     while (stopflag < 3) {
> 20         sum = 0;
> 21         for_each_thread(t)
> 22             sum += ACCESS_ONCE(per_thread(counter, t));
> 23         ACCESS_ONCE(global_count) = sum;
> 24         poll(NULL, 0, 1);
> 25         if (stopflag) {
> 26             smp_mb();
> 27             stopflag++;
> 28         }
> 29     }
> 30     return NULL;
> 31 }
> 32
> 33 void count_init(void)
> 34 {
> 35     thread_id_t tid;
> 36     if (pthread_create(&tid, NULL, eventual, NULL)) {
> 37         perror("count_init:pthread_create");
> 38         exit(-1);
> 39     }
> 40 }
> 41
> 42 void count_cleanup(void)
> 43 {
> 44     stopflag = 1;
> 45     while (stopflag < 3)
> 46         poll(NULL, 0, 1);
> 47     smp_mb();
> 48 }
> 
> 
> I understand the code. In Quick Quiz 5.17, the question is:
> 
>     Why _doesn't_ the `inc_count()' in the code above need to use atomic
>     instructions? After all, we now have multiple threads accessing the
>     per-thread counters!
> 
> I think I know the answer to this question: now that you use per-thread
> variable, you don't need atomic instructions. The scenarios where you
> need atomic instructions are some places like this:
> 
> 1 long counter = 0;
> 2 void inc_count(void)
> 3 {
> 4    counter++;  //need atomic instruction
> 5 }
> 6
> 7 long read_count(void)
> 8 {
> 9    return counter;
> 10}
> 
> 
> But, the answer provided in the book is:
> 
> <------------------- Answer Begin ---------------------->
>     Because one of the two threads only reads, and because
> the variable is aligned and machine-sized, non-atomic instructions
> suffice. 

Well, I think this sentence explains everything necessary. 
Since the per-thread counters are also read-accessed from eventual() thread,
the the increment of the count should look "atomic" from eventual().
If the variable is *not* machine-sized or unaligned, the increment
access might need multiple read / write accesses and it would be not
"atomic" from other thread's point of view.

As as side note, you can see the changes made in this Quick Quiz and Answer
in the git history.  Actually, the original version (dated 2011-01-01,
git tag:v2011.01.02a) of the Quiz said:

	Why does inc_count() in
	Figure 5.? (whatever figure number at that time)
	need to use atomic instructions?

This was the opposite of the current version.

>          That said, the ACCESS_ONCE() macro is used to prevent
> compiler optimizations that might otherwise prevent the counter
> updates from becoming visible to eventual() [Cor12].
> 
>     An older version of this algorithm did in fact use atomic
> instructions, kudos to Ersoy Bayramoglu for pointing
> out that they are in fact unnecessary. That said, atomic
>                                       ~~~~~~~~~~~~~~~~~
> instructions would be needed in cases where the per-thread
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> counter variables were smaller than the global global_count.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I suppose Paul added this sentence to explain the case where
per-thread counter is shorter than machine-size.

I'm sure Paul will give some comment once he recalls the circumstances ;-)

                                    Thanks, Akira

> However, note that on a 32-bit system, the per-thread counter
> variables might need to be limited to 32 bits in order to sum
> them accurately, but with a 64-bit global_count variable to avoid
> overflow. In this case, it is necessary to zero the per-thread
> counter variables periodically in order to avoid overflow. It is
> extremely important to note that this zeroing cannot be delayed too long
> or overflow of the smaller per-thread variables will result. This
> approach therefore imposes real-time requirements on the underlying
> system, and in turn must be used with extreme care.
> 
>     In contrast, if all variables are the same size, overflow
> of any variable is harmless because the eventual sum will
> be modulo the word size.
> <------------------------ End -------------------------->
> 
> Although more complicated than I think, I totally fine with this answer,
> except the sentence with ~~~ under it. Why is that? Why do we need
> atomic instructions when counter variables were smaller than the global
> `global_count' ? Also, the second sentence of the question seems to hint
> about cache coherency, but I cannot see the point :(
> 
> regards,
> Yubin Ruan
> --
> To unsubscribe from this list: send the line "unsubscribe perfbook" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Question] Quick Quiz 5.17 and cache coherency
  2017-02-23 13:44 ` Akira Yokosawa
@ 2017-02-23 14:33   ` Akira Yokosawa
  2017-02-23 20:19     ` Paul E. McKenney
       [not found]     ` <fda72075-2bca-b0e7-8a6c-3db2d15eeadb@gmail.com>
  0 siblings, 2 replies; 5+ messages in thread
From: Akira Yokosawa @ 2017-02-23 14:33 UTC (permalink / raw)
  To: Yubin Ruan; +Cc: perfbook, Paul E. McKenney

On 2017/02/23 22:44:41 +0900, Akira Yokosawa wrote:
> On 2017/02/22 21:22:48 +0800, Yubin Ruan wrote:
>> Hi,
>> I am reading chapter 5 and got confused by the answer to Quick Quiz
>> 5.17. So this is an Array-Based Per-thread Eventually Consistent Counter
>> scheme:
>>
>> 1  DEFINE_PER_THREAD(unsigned long, counter);
>> 2  unsigned long global_count;
>> 3  int stopflag;
>> 4
>> 5  void inc_count(void)
>> 6  {
>> 7      ACCESS_ONCE(__get_thread_var(counter))++;
>> 8  }
>> 9
>> 10 unsigned long read_count(void)
>> 11 {
>> 12     return ACCESS_ONCE(global_count);
>> 13 }
>> 14
>> 15 void *eventual(void *arg)
>> 16 {
>> 17     int t;
>> 18     int sum;
>> 19     while (stopflag < 3) {
>> 20         sum = 0;
>> 21         for_each_thread(t)
>> 22             sum += ACCESS_ONCE(per_thread(counter, t));
>> 23         ACCESS_ONCE(global_count) = sum;
>> 24         poll(NULL, 0, 1);
>> 25         if (stopflag) {
>> 26             smp_mb();
>> 27             stopflag++;
>> 28         }
>> 29     }
>> 30     return NULL;
>> 31 }
>> 32
>> 33 void count_init(void)
>> 34 {
>> 35     thread_id_t tid;
>> 36     if (pthread_create(&tid, NULL, eventual, NULL)) {
>> 37         perror("count_init:pthread_create");
>> 38         exit(-1);
>> 39     }
>> 40 }
>> 41
>> 42 void count_cleanup(void)
>> 43 {
>> 44     stopflag = 1;
>> 45     while (stopflag < 3)
>> 46         poll(NULL, 0, 1);
>> 47     smp_mb();
>> 48 }
>>
>>
>> I understand the code. In Quick Quiz 5.17, the question is:
>>
>>     Why _doesn't_ the `inc_count()' in the code above need to use atomic
>>     instructions? After all, we now have multiple threads accessing the
>>     per-thread counters!
>>
>> I think I know the answer to this question: now that you use per-thread
>> variable, you don't need atomic instructions. The scenarios where you
>> need atomic instructions are some places like this:
>>
>> 1 long counter = 0;
>> 2 void inc_count(void)
>> 3 {
>> 4    counter++;  //need atomic instruction
>> 5 }
>> 6
>> 7 long read_count(void)
>> 8 {
>> 9    return counter;
>> 10}
>>
>>
>> But, the answer provided in the book is:
>>
>> <------------------- Answer Begin ---------------------->
>>     Because one of the two threads only reads, and because
>> the variable is aligned and machine-sized, non-atomic instructions
>> suffice. 
> 
> Well, I think this sentence explains everything necessary. 
> Since the per-thread counters are also read-accessed from eventual() thread,
> the the increment of the count should look "atomic" from eventual().
> If the variable is *not* machine-sized or unaligned, the increment
> access might need multiple read / write accesses and it would be not
> "atomic" from other thread's point of view.

This was somewhat confusing. Let me try again.

Since the per-thread counters are also read-accessed from eventual() thread,
the write access of the increment of the count should look "atomic" from eventual().
If the variable is *not* machine-sized nor aligned, the write of the modified value
might need multiple read / write accesses and they would not be 'atomic" from
other threads' points of view.

Incrementing a counter is a Read-Modify-Write access. In per-thread counter,
it is not necessary to make the whole RMW access atomic. Only the write of the
incremented value needs to look "atomic".

By the way, an "unsigned long" variable is machine-sized and aligned on
most compiler-platform combination, but there might be exceptions I suppose.

                                          Thanks, Akira
> 
> As as side note, you can see the changes made in this Quick Quiz and Answer
> in the git history.  Actually, the original version (dated 2011-01-01,
> git tag:v2011.01.02a) of the Quiz said:
> 
> 	Why does inc_count() in
> 	Figure 5.? (whatever figure number at that time)
> 	need to use atomic instructions?
> 
> This was the opposite of the current version.
> 
>>          That said, the ACCESS_ONCE() macro is used to prevent
>> compiler optimizations that might otherwise prevent the counter
>> updates from becoming visible to eventual() [Cor12].
>>
>>     An older version of this algorithm did in fact use atomic
>> instructions, kudos to Ersoy Bayramoglu for pointing
>> out that they are in fact unnecessary. That said, atomic
>>                                       ~~~~~~~~~~~~~~~~~
>> instructions would be needed in cases where the per-thread
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> counter variables were smaller than the global global_count.
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> I suppose Paul added this sentence to explain the case where
> per-thread counter is shorter than machine-size.
> 
> I'm sure Paul will give some comment once he recalls the circumstances ;-)

Now added explicit CC: Paul.

> 
>                                     Thanks, Akira
> 
>> However, note that on a 32-bit system, the per-thread counter
>> variables might need to be limited to 32 bits in order to sum
>> them accurately, but with a 64-bit global_count variable to avoid
>> overflow. In this case, it is necessary to zero the per-thread
>> counter variables periodically in order to avoid overflow. It is
>> extremely important to note that this zeroing cannot be delayed too long
>> or overflow of the smaller per-thread variables will result. This
>> approach therefore imposes real-time requirements on the underlying
>> system, and in turn must be used with extreme care.
>>
>>     In contrast, if all variables are the same size, overflow
>> of any variable is harmless because the eventual sum will
>> be modulo the word size.
>> <------------------------ End -------------------------->
>>
>> Although more complicated than I think, I totally fine with this answer,
>> except the sentence with ~~~ under it. Why is that? Why do we need
>> atomic instructions when counter variables were smaller than the global
>> `global_count' ? Also, the second sentence of the question seems to hint
>> about cache coherency, but I cannot see the point :(
>>
>> regards,
>> Yubin Ruan
>> --
>> To unsubscribe from this list: send the line "unsubscribe perfbook" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Question] Quick Quiz 5.17 and cache coherency
  2017-02-23 14:33   ` Akira Yokosawa
@ 2017-02-23 20:19     ` Paul E. McKenney
       [not found]     ` <fda72075-2bca-b0e7-8a6c-3db2d15eeadb@gmail.com>
  1 sibling, 0 replies; 5+ messages in thread
From: Paul E. McKenney @ 2017-02-23 20:19 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: Yubin Ruan, perfbook

On Thu, Feb 23, 2017 at 11:33:55PM +0900, Akira Yokosawa wrote:
> On 2017/02/23 22:44:41 +0900, Akira Yokosawa wrote:
> > On 2017/02/22 21:22:48 +0800, Yubin Ruan wrote:
> >> Hi,
> >> I am reading chapter 5 and got confused by the answer to Quick Quiz
> >> 5.17. So this is an Array-Based Per-thread Eventually Consistent Counter
> >> scheme:
> >>
> >> 1  DEFINE_PER_THREAD(unsigned long, counter);
> >> 2  unsigned long global_count;
> >> 3  int stopflag;
> >> 4
> >> 5  void inc_count(void)
> >> 6  {
> >> 7      ACCESS_ONCE(__get_thread_var(counter))++;
> >> 8  }
> >> 9
> >> 10 unsigned long read_count(void)
> >> 11 {
> >> 12     return ACCESS_ONCE(global_count);
> >> 13 }
> >> 14
> >> 15 void *eventual(void *arg)
> >> 16 {
> >> 17     int t;
> >> 18     int sum;
> >> 19     while (stopflag < 3) {
> >> 20         sum = 0;
> >> 21         for_each_thread(t)
> >> 22             sum += ACCESS_ONCE(per_thread(counter, t));
> >> 23         ACCESS_ONCE(global_count) = sum;
> >> 24         poll(NULL, 0, 1);
> >> 25         if (stopflag) {
> >> 26             smp_mb();
> >> 27             stopflag++;
> >> 28         }
> >> 29     }
> >> 30     return NULL;
> >> 31 }
> >> 32
> >> 33 void count_init(void)
> >> 34 {
> >> 35     thread_id_t tid;
> >> 36     if (pthread_create(&tid, NULL, eventual, NULL)) {
> >> 37         perror("count_init:pthread_create");
> >> 38         exit(-1);
> >> 39     }
> >> 40 }
> >> 41
> >> 42 void count_cleanup(void)
> >> 43 {
> >> 44     stopflag = 1;
> >> 45     while (stopflag < 3)
> >> 46         poll(NULL, 0, 1);
> >> 47     smp_mb();
> >> 48 }
> >>
> >>
> >> I understand the code. In Quick Quiz 5.17, the question is:
> >>
> >>     Why _doesn't_ the `inc_count()' in the code above need to use atomic
> >>     instructions? After all, we now have multiple threads accessing the
> >>     per-thread counters!
> >>
> >> I think I know the answer to this question: now that you use per-thread
> >> variable, you don't need atomic instructions. The scenarios where you
> >> need atomic instructions are some places like this:
> >>
> >> 1 long counter = 0;
> >> 2 void inc_count(void)
> >> 3 {
> >> 4    counter++;  //need atomic instruction
> >> 5 }
> >> 6
> >> 7 long read_count(void)
> >> 8 {
> >> 9    return counter;
> >> 10}
> >>
> >>
> >> But, the answer provided in the book is:
> >>
> >> <------------------- Answer Begin ---------------------->
> >>     Because one of the two threads only reads, and because
> >> the variable is aligned and machine-sized, non-atomic instructions
> >> suffice. 
> > 
> > Well, I think this sentence explains everything necessary. 
> > Since the per-thread counters are also read-accessed from eventual() thread,
> > the the increment of the count should look "atomic" from eventual().
> > If the variable is *not* machine-sized or unaligned, the increment
> > access might need multiple read / write accesses and it would be not
> > "atomic" from other thread's point of view.
> 
> This was somewhat confusing. Let me try again.
> 
> Since the per-thread counters are also read-accessed from eventual() thread,
> the write access of the increment of the count should look "atomic" from eventual().
> If the variable is *not* machine-sized nor aligned, the write of the modified value
> might need multiple read / write accesses and they would not be 'atomic" from
> other threads' points of view.
> 
> Incrementing a counter is a Read-Modify-Write access. In per-thread counter,
> it is not necessary to make the whole RMW access atomic. Only the write of the
> incremented value needs to look "atomic".
> 
> By the way, an "unsigned long" variable is machine-sized and aligned on
> most compiler-platform combination, but there might be exceptions I suppose.

Good explanation!

>                                           Thanks, Akira
> > 
> > As as side note, you can see the changes made in this Quick Quiz and Answer
> > in the git history.  Actually, the original version (dated 2011-01-01,
> > git tag:v2011.01.02a) of the Quiz said:
> > 
> > 	Why does inc_count() in
> > 	Figure 5.? (whatever figure number at that time)
> > 	need to use atomic instructions?
> > 
> > This was the opposite of the current version.
> > 
> >>          That said, the ACCESS_ONCE() macro is used to prevent
> >> compiler optimizations that might otherwise prevent the counter
> >> updates from becoming visible to eventual() [Cor12].
> >>
> >>     An older version of this algorithm did in fact use atomic
> >> instructions, kudos to Ersoy Bayramoglu for pointing
> >> out that they are in fact unnecessary. That said, atomic
> >>                                       ~~~~~~~~~~~~~~~~~
> >> instructions would be needed in cases where the per-thread
> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> counter variables were smaller than the global global_count.
> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > I suppose Paul added this sentence to explain the case where
> > per-thread counter is shorter than machine-size.
> > 
> > I'm sure Paul will give some comment once he recalls the circumstances ;-)
> 
> Now added explicit CC: Paul.

Sorry for the slow reponse, I am on travel, and am likely to be completely
off the grid for the next 24-48 hours.  (Hiking!)

But yes, checking the git history can be quite interesting, both in
terms of evolution of the algorithm and spotting my past mistakes.  ;-)

							Thanx, Paul

> >                                     Thanks, Akira
> > 
> >> However, note that on a 32-bit system, the per-thread counter
> >> variables might need to be limited to 32 bits in order to sum
> >> them accurately, but with a 64-bit global_count variable to avoid
> >> overflow. In this case, it is necessary to zero the per-thread
> >> counter variables periodically in order to avoid overflow. It is
> >> extremely important to note that this zeroing cannot be delayed too long
> >> or overflow of the smaller per-thread variables will result. This
> >> approach therefore imposes real-time requirements on the underlying
> >> system, and in turn must be used with extreme care.
> >>
> >>     In contrast, if all variables are the same size, overflow
> >> of any variable is harmless because the eventual sum will
> >> be modulo the word size.
> >> <------------------------ End -------------------------->
> >>
> >> Although more complicated than I think, I totally fine with this answer,
> >> except the sentence with ~~~ under it. Why is that? Why do we need
> >> atomic instructions when counter variables were smaller than the global
> >> `global_count' ? Also, the second sentence of the question seems to hint
> >> about cache coherency, but I cannot see the point :(
> >>
> >> regards,
> >> Yubin Ruan
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe perfbook" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Question] Quick Quiz 5.17 and cache coherency
       [not found]       ` <ac6c52c4-38d1-a2dd-76a8-67398f921793@gmail.com>
@ 2017-02-24  1:01         ` Yubin Ruan
  0 siblings, 0 replies; 5+ messages in thread
From: Yubin Ruan @ 2017-02-24  1:01 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On 02/24/2017 06:16 AM, Akira Yokosawa wrote:
> On 2017/02/24 01:02:21 +0800, Yubin Ruan wrote:
>> On 02/23/2017 10:33 PM, Akira Yokosawa wrote:
>>> On 2017/02/23 22:44:41 +0900, Akira Yokosawa wrote:
>>>> On 2017/02/22 21:22:48 +0800, Yubin Ruan wrote:
>>>>> Hi,
>>>>> I am reading chapter 5 and got confused by the answer to Quick Quiz
>>>>> 5.17. So this is an Array-Based Per-thread Eventually Consistent Counter
>>>>> scheme:
>>>>>
>>>>> 1  DEFINE_PER_THREAD(unsigned long, counter);
>>>>> 2  unsigned long global_count;
>>>>> 3  int stopflag;
>>>>> 4
>>>>> 5  void inc_count(void)
>>>>> 6  {
>>>>> 7      ACCESS_ONCE(__get_thread_var(counter))++;
>>>>> 8  }
>>>>> 9
>>>>> 10 unsigned long read_count(void)
>>>>> 11 {
>>>>> 12     return ACCESS_ONCE(global_count);
>>>>> 13 }
>>>>> 14
>>>>> 15 void *eventual(void *arg)
>>>>> 16 {
>>>>> 17     int t;
>>>>> 18     int sum;
>>>>> 19     while (stopflag < 3) {
>>>>> 20         sum = 0;
>>>>> 21         for_each_thread(t)
>>>>> 22             sum += ACCESS_ONCE(per_thread(counter, t));
>>>>> 23         ACCESS_ONCE(global_count) = sum;
>>>>> 24         poll(NULL, 0, 1);
>>>>> 25         if (stopflag) {
>>>>> 26             smp_mb();
>>>>> 27             stopflag++;
>>>>> 28         }
>>>>> 29     }
>>>>> 30     return NULL;
>>>>> 31 }
>>>>> 32
>>>>> 33 void count_init(void)
>>>>> 34 {
>>>>> 35     thread_id_t tid;
>>>>> 36     if (pthread_create(&tid, NULL, eventual, NULL)) {
>>>>> 37         perror("count_init:pthread_create");
>>>>> 38         exit(-1);
>>>>> 39     }
>>>>> 40 }
>>>>> 41
>>>>> 42 void count_cleanup(void)
>>>>> 43 {
>>>>> 44     stopflag = 1;
>>>>> 45     while (stopflag < 3)
>>>>> 46         poll(NULL, 0, 1);
>>>>> 47     smp_mb();
>>>>> 48 }
>>>>>
>>>>>
>>>>> I understand the code. In Quick Quiz 5.17, the question is:
>>>>>
>>>>>     Why _doesn't_ the `inc_count()' in the code above need to use atomic
>>>>>     instructions? After all, we now have multiple threads accessing the
>>>>>     per-thread counters!
>>>>>
>>>>> I think I know the answer to this question: now that you use per-thread
>>>>> variable, you don't need atomic instructions. The scenarios where you
>>>>> need atomic instructions are some places like this:
>>>>>
>>>>> 1 long counter = 0;
>>>>> 2 void inc_count(void)
>>>>> 3 {
>>>>> 4    counter++;  //need atomic instruction
>>>>> 5 }
>>>>> 6
>>>>> 7 long read_count(void)
>>>>> 8 {
>>>>> 9    return counter;
>>>>> 10}
>>>>>
>>>>>
>>>>> But, the answer provided in the book is:
>>>>>
>>>>> <------------------- Answer Begin ---------------------->
>>>>>     Because one of the two threads only reads, and because
>>>>> the variable is aligned and machine-sized, non-atomic instructions
>>>>> suffice. 
>>>>
>>>> Well, I think this sentence explains everything necessary. 
>>>> Since the per-thread counters are also read-accessed from eventual() thread,
>>>> the the increment of the count should look "atomic" from eventual().
>>>> If the variable is *not* machine-sized or unaligned, the increment
>>>> access might need multiple read / write accesses and it would be not
>>>> "atomic" from other thread's point of view.
>>>
>>> This was somewhat confusing. Let me try again.
>>>
>>> Since the per-thread counters are also read-accessed from eventual() thread,
>>> the write access of the increment of the count should look "atomic" from eventual().
>>> If the variable is *not* machine-sized nor aligned, the write of the modified value
>>> might need multiple read / write accesses and they would not be 'atomic" from
>>> other threads' points of view.
>>>
>>> Incrementing a counter is a Read-Modify-Write access. In per-thread counter,
>>> it is not necessary to make the whole RMW access atomic. Only the write of the
>>> incremented value needs to look "atomic".
>>>
>>> By the way, an "unsigned long" variable is machine-sized and aligned on
>>> most compiler-platform combination, but there might be exceptions I suppose.
>>>
>>
>> Thanks for your information!
>>
>>>>
>>>> As as side note, you can see the changes made in this Quick Quiz and Answer
>>>> in the git history.  Actually, the original version (dated 2011-01-01,
>>>> git tag:v2011.01.02a) of the Quiz said:
>>>>
>>>> 	Why does inc_count() in
>>>> 	Figure 5.? (whatever figure number at that time)
>>>> 	need to use atomic instructions?
>>>>
>>>> This was the opposite of the current version.
>>>>
>>>>>          That said, the ACCESS_ONCE() macro is used to prevent
>>>>> compiler optimizations that might otherwise prevent the counter
>>>>> updates from becoming visible to eventual() [Cor12].
>>>>>
>>>>>     An older version of this algorithm did in fact use atomic
>>>>> instructions, kudos to Ersoy Bayramoglu for pointing
>>>>> out that they are in fact unnecessary. That said, atomic
>>>>>                                       ~~~~~~~~~~~~~~~~~
>>>>> instructions would be needed in cases where the per-thread
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>> counter variables were smaller than the global global_count.
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>
>>>> I suppose Paul added this sentence to explain the case where
>>>> per-thread counter is shorter than machine-size.
>>
>> So, you mean, what he intended to say is something like:
>>
>>     would be needed in cases where the per-thread counter
>>     variables where smaller than machine size
>> ?
> 
> Let's see what Paul want to say. (By the way, this message seems private and
> I kept this reply private.)

sorry I just forgot to `Cc'.

> 
>> But, after all, there is only one thread doing the writing, which mean
>> that we don't need any atomic instruction even the variable `counter' is
>> not properly aligned. right?
> 
> The point is that "counter" is per-thread and is in an array.
> "counter" for another thread is placed next to it in an unaligned way.
> 
> Suppose an architecture which can only do machine-sized, aligned memory accesses.
> A write to an unaligned variable will need the following 4 memory accesses:
> 
>   1.  Read memory including lower part of the variable
>   2.  Read memory including higher part of the variable
>       (Replace the updated part on registers without affecting outside of
>        the variable)
>   3.  Write memory including lower part of the variable
>   4.  Write memory including higher part of the variable
> 
> Can you imagine what happens if another thread also does a write to a neighboring
> counter in a racy timing?
> 
>                               Thanks, Akira
> 

well I didn't expect that. thank you.

regards,
Yubin Ruan

>>
>> regards,
>> Yubin Ruan
>>
>>>>
>>>> I'm sure Paul will give some comment once he recalls the circumstances ;-)
>>>
>>> Now added explicit CC: Paul.
>>>
>>>>
>>>>                                     Thanks, Akira
>>>>
>>>>> However, note that on a 32-bit system, the per-thread counter
>>>>> variables might need to be limited to 32 bits in order to sum
>>>>> them accurately, but with a 64-bit global_count variable to avoid
>>>>> overflow. In this case, it is necessary to zero the per-thread
>>>>> counter variables periodically in order to avoid overflow. It is
>>>>> extremely important to note that this zeroing cannot be delayed too long
>>>>> or overflow of the smaller per-thread variables will result. This
>>>>> approach therefore imposes real-time requirements on the underlying
>>>>> system, and in turn must be used with extreme care.
>>>>>
>>>>>     In contrast, if all variables are the same size, overflow
>>>>> of any variable is harmless because the eventual sum will
>>>>> be modulo the word size.
>>>>> <------------------------ End -------------------------->
>>>>>
>>>>> Although more complicated than I think, I totally fine with this answer,
>>>>> except the sentence with ~~~ under it. Why is that? Why do we need
>>>>> atomic instructions when counter variables were smaller than the global
>>>>> `global_count' ? Also, the second sentence of the question seems to hint
>>>>> about cache coherency, but I cannot see the point :(
>>>>>
>>>>> regards,
>>>>> Yubin Ruan
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe perfbook" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>
>> .
>>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-02-24  1:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-22 13:22 [Question] Quick Quiz 5.17 and cache coherency Yubin Ruan
2017-02-23 13:44 ` Akira Yokosawa
2017-02-23 14:33   ` Akira Yokosawa
2017-02-23 20:19     ` Paul E. McKenney
     [not found]     ` <fda72075-2bca-b0e7-8a6c-3db2d15eeadb@gmail.com>
     [not found]       ` <ac6c52c4-38d1-a2dd-76a8-67398f921793@gmail.com>
2017-02-24  1:01         ` Yubin Ruan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.