On Fri, 2007-12-14 at 16:26 +0800, zhejiang wrote:
> On Thu, 2007-12-13 at 22:54 +0100, Peter Zijlstra wrote:
> > On Thu, 2007-12-13 at 11:30 +0800, zhejiang wrote:
> > > __percpu_counter_add() cache the result in percpu variable until it
> > > exceeds the batch.
> > > The prop_norm_percpu() use the percpu_counter_read(&pl->events) to read
> > > the counter ,and use percpu_counter_add(&pl->events, -half) to half the
> > > counter.
> > > 
> > > There are potential problems:
> > > 1.The counter may be negative 
> > > 2.After some calculation, it may be converted to a big positive number.
> > > 
> > > 1.
> > > For example, the batch is 32, when the bdi add 32 to the pl->events,
> > > the pl->events->count will be 32.Suppose one of the percpu counter is 1.
> > > 
> > > In the prop_norm_percpu(),the half will be 16.Because it is under the
> > > batch, the pl->events->count won't be modified and one of the percpu
> > > counter may be -15. If call the prop_norm_percpu() again, the half will
> > > still be 16,though it should be 8.The percpu counter may be -31.
> > > Now, there pl->events->count is still 32.
> > > If do the third calculation, the percpu counter will be -47, it will
> > > be merged into the pl->evnets->count.Then pl->events->count will be
> > > negative.
> > > 
> > > 2.When the pl->events->count is negative,
> > > unsigned long val = percpu_counter_read(&pl->events);
> > > This statement may return a negative number, so the val would be a big
> > > number.Because of the overflow, the pl->events->count will be converted
> > > into a big positive number after some calculation.
> > > 
> > > Because of the overflow, I catch some very big numerators when call the
> > > prop_fraction_percpu().
> > > 
> > > I think that it should use percpu_counter_sum() instead of the
> > > percpu_counter_read() to be more robust.
> > > 
> > > diff -Nur a/proportions.c b/proportions.c
> > > --- a/proportions.c     2007-12-12 11:05:59.000000000 +0800
> > > +++ b/proportions.c     2007-12-13 11:05:40.000000000 +0800
> > > @@ -241,7 +241,7 @@
> > >          * can never result in a negative number.
> > >          */
> > >         while (pl->period != global_period) {
> > > -               unsigned long val = percpu_counter_read(&pl->events);
> > > +               unsigned long val = percpu_counter_sum(&pl->events);
> > >                 unsigned long half = (val + 1) >> 1;
> > >  
> > >                 /*
> > > 
> > 
> > _sum is unacceptably slow here. I'm thinking something like
> > bdi_stat_error() as used in balance_dirty_pages() would also solve this
> > issue, no?
> 
> _sum is slow, but it can avoid over-subtraction when the counter is less
> than 2 * BDI_STAT_BATCH.
> 
> How about do the calculation like this patch?
> 
> Change the val and half into long type, if we found the val is negative,
> it means that we did subtraction too much.
> So we actually add some value to it.
> 
> diff -Nur a/proportions.c b/proportions.c
> --- a/proportions.c     2007-12-12 11:05:59.000000000 +0800
> +++ b/proportions.c     2007-12-14 16:21:08.000000000 +0800
> @@ -241,8 +241,18 @@
>          * can never result in a negative number.
>          */
>         while (pl->period != global_period) {
> -               unsigned long val = percpu_counter_read(&pl->events);
> -               unsigned long half = (val + 1) >> 1;
> +               long val = percpu_counter_read(&pl->events);
> +               long half;
> +
> +               if (val >= 0) {
> +                       half = (val + 1) >> 1;
> +               } else {
> +                       /*
> +                        *If val is negative, the counter has been
> subtracted too much.
> +                        *Do some compensation here.
> +                        */
> +                       half = (val - 1) / 2;
> +               }
>  
>                 /*
>                  * Half of zero won't be much less, break out.

How about something like this:

diff --git a/lib/proportions.c b/lib/proportions.c
index 332d8c5..4afb330 100644
--- a/lib/proportions.c
+++ b/lib/proportions.c
@@ -190,6 +190,8 @@ prop_adjust_shift(int *pl_shift, unsigned long *pl_period, int new_shift)
  * PERCPU
  */
 
+#define PROP_BATCH (8*(1+ilog2(nr_cpu_ids)))
+
 int prop_local_init_percpu(struct prop_local_percpu *pl)
 {
 	spin_lock_init(&pl->lock);
@@ -203,6 +205,16 @@ void prop_local_destroy_percpu(struct prop_local_percpu *pl)
 	percpu_counter_destroy(&pl->events);
 }
 
+static unsigned long prop_percpu_read(struct prop_local_percpu *pl)
+{
+	unsigned long val = percpu_counter_read(&pl->events);
+
+	if (val < (nr_cpu_ids * PROP_BATCH))
+		val = percpu_counter_sum(&pl->events);
+
+	return val;
+}
+
 /*
  * Catch up with missed period expirations.
  *
@@ -241,7 +253,7 @@ void prop_norm_percpu(struct prop_global *pg, struct prop_local_percpu *pl)
 	 * can never result in a negative number.
 	 */
 	while (pl->period != global_period) {
-		unsigned long val = percpu_counter_read(&pl->events);
+		unsigned long val = prop_percpu_read(pl);
 		unsigned long half = (val + 1) >> 1;
 
 		/*
@@ -252,7 +264,7 @@ void prop_norm_percpu(struct prop_global *pg, struct prop_local_percpu *pl)
 		if (!val)
 			break;
 
-		percpu_counter_add(&pl->events, -half);
+		__percpu_counter_add(&pl->events, -half, PROP_BATCH);
 		pl->period += period;
 	}
 	pl->period = global_period;
@@ -267,7 +279,7 @@ void __prop_inc_percpu(struct prop_descriptor *pd, struct prop_local_percpu *pl)
 	struct prop_global *pg = prop_get_global(pd);
 
 	prop_norm_percpu(pg, pl);
-	percpu_counter_add(&pl->events, 1);
+	__percpu_counter_add(&pl->events, 1, PROP_BATCH);
 	percpu_counter_add(&pg->events, 1);
 	prop_put_global(pd, pg);
 }