ZVC: Increase threshold for larger processor configurationss
diff mbox series

Message ID Pine.LNX.4.64.0606281038530.22262@schroedinger.engr.sgi.com
State New, archived
Headers show
Series
  • ZVC: Increase threshold for larger processor configurationss
Related show

Commit Message

Christoph Lameter June 28, 2006, 5:41 p.m. UTC
We detecteded a slight degree of contention with the new zoned VM counters 
if over 128 processors simultaneously fault anonymous pages. Raising the 
threshold to 64 fixed the contention issue.

So we need to increase the threshhold depending on the number of processors
in the system. And it may be best to overcompensate a bit.

We keep the existing threshold of 32 for configurations with less than or
equal to 64 processors. In the range of 64 to 128 processors we go to a
threshold of 64. Beyond that we go to 125 (we have to be able to increment
one beyond the threshold and then better avoid 127 just in case).

(There are a more scalability improvement possible by utilizing the 
cacheline when it has been acquired to update all pending counters but I 
want to first test with a few hundred processors to see if we need those 
and then we need to figure out if there are bad effects for smaller 
configurations.)

Signed-off-by: Christoph Lameter <clameter@sgi.com>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Andrew Morton June 28, 2006, 10:49 p.m. UTC | #1
Christoph Lameter <clameter@sgi.com> wrote:
>
> We detecteded a slight degree of contention with the new zoned VM counters 
> if over 128 processors simultaneously fault anonymous pages. Raising the 
> threshold to 64 fixed the contention issue.
> 
> So we need to increase the threshhold depending on the number of processors
> in the system. And it may be best to overcompensate a bit.
> 
> We keep the existing threshold of 32 for configurations with less than or
> equal to 64 processors. In the range of 64 to 128 processors we go to a
> threshold of 64. Beyond that we go to 125 (we have to be able to increment
> one beyond the threshold and then better avoid 127 just in case).
> 
> (There are a more scalability improvement possible by utilizing the 
> cacheline when it has been acquired to update all pending counters but I 
> want to first test with a few hundred processors to see if we need those 
> and then we need to figure out if there are bad effects for smaller 
> configurations.)
> 
> ...
>  
> +/*
> + * With higher processor counts we need higher threshold to avoid contention.
> + */
> +#if NR_CPUS <= 64
>  #define STAT_THRESHOLD 32
> +#elif NR_CPUS <= 128
> +#define STAT_THRESHOLD 64
> +#else
> +/*
> + * Use the maximum usable threshhold.
> + * We need to increment one beyond the threshold. To be safe
> + * also avoid 127.
> + */
> +#define STAT_THRESHOLD 125
> +#endif

As we become less and less dependent upon NR_CPUS, it becomes more and more
viable for people to ship kernels which are compiled with a very high
NR_CPUS value.  And code such as the above becomes less and less effective.

An alternative would be to calculate stat_threshold at runtime, based on
the cpu_possible count (I guess).  Or even:

static inline int stat_threshold(void)
{
#if NR_CPUS <= 32
	return 32;
#else
	return dynamically_calculated_stat_threshold;
#endif
}

Did you consider my earlier suggestion about these counters?  That, over the
short-term, they tend to count in only one direction?  So we can do

	if (x > STAT_THRESHOLD) {
		zone_page_state_add(x + STAT_THRESHOLD/2, zone, item);
		x = -STAT_THRESHOLD/2;
	} else if (x < -STAT_THRESHOLD) {
		zone_page_state_add(x - STAT_THRESHOLD/2, zone, item);
		x = STAT_THRESHOLD;
	}

that'll give an decrease in contention while not consuming any extra
storage and while (I think) increasing accuracy.

Whatever way we go, let's think harder about this one, please.  Anything
which uses NR_CPUS is a red flag.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Christoph Lameter June 29, 2006, 6:22 p.m. UTC | #2
On Wed, 28 Jun 2006, Andrew Morton wrote:

> An alternative would be to calculate stat_threshold at runtime, based on
> the cpu_possible count (I guess).  Or even:
> 
> static inline int stat_threshold(void)
> {
> #if NR_CPUS <= 32
> 	return 32;
> #else
> 	return dynamically_calculated_stat_threshold;
> #endif
> }

Thats one more memory reference. Hmmm... We could place the threshold in 
the same cacheline. That would also open up the possbiliity of dynamically 
calculating the threshold.

> Did you consider my earlier suggestion about these counters?  That, over the
> short-term, they tend to count in only one direction?  So we can do
> 
> 	if (x > STAT_THRESHOLD) {
> 		zone_page_state_add(x + STAT_THRESHOLD/2, zone, item);
> 		x = -STAT_THRESHOLD/2;
> 	} else if (x < -STAT_THRESHOLD) {
> 		zone_page_state_add(x - STAT_THRESHOLD/2, zone, item);
> 		x = STAT_THRESHOLD;
> 	}
> 
> that'll give an decrease in contention while not consuming any extra
> storage and while (I think) increasing accuracy.

Uhh... We are overcompensating right? Pretty funky idea that is new to me 
and that would require some thought.

This would basically increase the stepping by 50% if we are only going in 
one direction.

If we are doing a mixture of allocations and deallocations (or pages being 
marked dirty / undirty, mapping unmapping) then this may potentially
increase the number of updates and therefore the cacheline contentions.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Andrew Morton June 29, 2006, 6:57 p.m. UTC | #3
On Thu, 29 Jun 2006 11:22:45 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 28 Jun 2006, Andrew Morton wrote:
> 
> > An alternative would be to calculate stat_threshold at runtime, based on
> > the cpu_possible count (I guess).  Or even:
> > 
> > static inline int stat_threshold(void)
> > {
> > #if NR_CPUS <= 32
> > 	return 32;
> > #else
> > 	return dynamically_calculated_stat_threshold;
> > #endif
> > }
> 
> Thats one more memory reference. Hmmm... We could place the threshold in 
> the same cacheline. That would also open up the possbiliity of dynamically 
> calculating the threshold.

yup.

> > Did you consider my earlier suggestion about these counters?  That, over the
> > short-term, they tend to count in only one direction?  So we can do
> > 
> > 	if (x > STAT_THRESHOLD) {
> > 		zone_page_state_add(x + STAT_THRESHOLD/2, zone, item);
> > 		x = -STAT_THRESHOLD/2;
> > 	} else if (x < -STAT_THRESHOLD) {
> > 		zone_page_state_add(x - STAT_THRESHOLD/2, zone, item);
> > 		x = STAT_THRESHOLD;
> > 	}
> > 
> > that'll give an decrease in contention while not consuming any extra
> > storage and while (I think) increasing accuracy.
> 
> Uhh... We are overcompensating right? Pretty funky idea that is new to me 
> and that would require some thought.

See inbox ;)

> This would basically increase the stepping by 50% if we are only going in 
> one direction.

yes.

> If we are doing a mixture of allocations and deallocations (or pages being 
> marked dirty / undirty, mapping unmapping) then this may potentially
> increase the number of updates and therefore the cacheline contentions.

Yes.  I'd handwavingly contend that this is a rare situation.

A lot of the counters only ever count in one direction!  So we could even
skew them by an entire STAT_THRESHOLD.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Christoph Lameter June 30, 2006, 6:14 a.m. UTC | #4
On Thu, 29 Jun 2006, Christoph Lameter wrote:

> Thats one more memory reference. Hmmm... We could place the threshold in 
> the same cacheline. That would also open up the possbiliity of dynamically 
> calculating the threshold.
> 

like this?

ZVC: Dynamic counter threshold configuration

We add a threshold before the first counter and calculate the
threshold based on the size of the zone assuming that larger machines have larger
zone sizes and therefore larger thresholds.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.17-mm4/mm/vmstat.c
===================================================================
--- linux-2.6.17-mm4.orig/mm/vmstat.c	2006-06-29 13:07:13.342483251 -0700
+++ linux-2.6.17-mm4/mm/vmstat.c	2006-06-29 13:17:52.978741713 -0700
@@ -112,37 +112,22 @@ atomic_long_t vm_stat[NR_VM_ZONE_STAT_IT
 EXPORT_SYMBOL(vm_stat);
 
 #ifdef CONFIG_SMP
-
-#define STAT_THRESHOLD 32
-
-/*
- * Determine pointer to currently valid differential byte given a zone and
- * the item number.
- *
- * Preemption must be off
- */
-static inline s8 *diff_pointer(struct zone *zone, enum zone_stat_item item)
-{
-	return &zone_pcp(zone, smp_processor_id())->vm_stat_diff[item];
-}
-
 /*
  * For use when we know that interrupts are disabled.
  */
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
-	s8 *p;
+	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	s8 *p = pcp->vm_stat_diff + item;
 	long x;
 
-	p = diff_pointer(zone, item);
 	x = delta + *p;
 
-	if (unlikely(x > STAT_THRESHOLD || x < -STAT_THRESHOLD)) {
+	if (unlikely(x > pcp->stat_threshold || x < -pcp->stat_threshold)) {
 		zone_page_state_add(x, zone, item);
 		x = 0;
 	}
-
 	*p = x;
 }
 EXPORT_SYMBOL(__mod_zone_page_state);
@@ -184,11 +169,12 @@ EXPORT_SYMBOL(mod_zone_page_state);
  */
 static void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	s8 *p = diff_pointer(zone, item);
+	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)++;
 
-	if (unlikely(*p > STAT_THRESHOLD)) {
+	if (unlikely(*p > pcp->stat_threshold)) {
 		zone_page_state_add(*p, zone, item);
 		*p = 0;
 	}
@@ -203,11 +189,12 @@ EXPORT_SYMBOL(__inc_zone_page_state);
 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	struct zone *zone = page_zone(page);
-	s8 *p = diff_pointer(zone, item);
+	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)--;
 
-	if (unlikely(*p < -STAT_THRESHOLD)) {
+	if (unlikely(*p < -pcp->stat_threshold)) {
 		zone_page_state_add(*p, zone, item);
 		*p = 0;
 	}
@@ -238,16 +225,12 @@ EXPORT_SYMBOL(inc_zone_page_state);
 void dec_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	unsigned long flags;
-	struct zone *zone;
-	s8 *p;
+	struct zone *zone = page_zone(page);
+	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	s8 *p = pcp->vm_stat_diff + item;
 
-	zone = page_zone(page);
 	local_irq_save(flags);
-	p = diff_pointer(zone, item);
-
-	(*p)--;
-
-	if (unlikely(*p < -STAT_THRESHOLD)) {
+	if (unlikely(*p < -pcp->stat_threshold)) {
 		zone_page_state_add(*p, zone, item);
 		*p = 0;
 	}
@@ -502,6 +485,7 @@ static int zoneinfo_show(struct seq_file
 		seq_printf(m,
 			   ")"
 			   "\n  pagesets");
+		seq_printf(m, "\n    stat_threshold: %i", zone_pcp(zone, 0)->stat_threshold);
 		for_each_online_cpu(i) {
 			struct per_cpu_pageset *pageset;
 			int j;
Index: linux-2.6.17-mm4/include/linux/mmzone.h
===================================================================
--- linux-2.6.17-mm4.orig/include/linux/mmzone.h	2006-06-29 13:07:12.486090884 -0700
+++ linux-2.6.17-mm4/include/linux/mmzone.h	2006-06-29 13:07:26.486201908 -0700
@@ -77,6 +77,7 @@ struct per_cpu_pages {
 struct per_cpu_pageset {
 	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
 #ifdef CONFIG_SMP
+	s8 stat_threshold;	/* maximum diff before update */
 	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
 #endif
 } ____cacheline_aligned_in_smp;
Index: linux-2.6.17-mm4/mm/page_alloc.c
===================================================================
--- linux-2.6.17-mm4.orig/mm/page_alloc.c	2006-06-29 13:07:13.313188187 -0700
+++ linux-2.6.17-mm4/mm/page_alloc.c	2006-06-29 13:14:57.831351395 -0700
@@ -1918,6 +1918,8 @@ static int __cpuinit process_zones(int c
 		if (percpu_pagelist_fraction)
 			setup_pagelist_highmark(zone_pcp(zone, cpu),
 			 	(zone->present_pages / percpu_pagelist_fraction));
+
+		vm_stat_setup(zone, cpu);
 	}
 
 	return 0;
@@ -2039,6 +2041,7 @@ static __meminit void zone_pcp_init(stru
 #else
 		setup_pageset(zone_pcp(zone,cpu), batch);
 #endif
+		vm_stat_setup(zone, cpu);
 	}
 	if (zone->present_pages)
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
Index: linux-2.6.17-mm4/include/linux/vmstat.h
===================================================================
--- linux-2.6.17-mm4.orig/include/linux/vmstat.h	2006-06-29 13:07:12.916728323 -0700
+++ linux-2.6.17-mm4/include/linux/vmstat.h	2006-06-29 13:18:20.594224260 -0700
@@ -174,6 +175,27 @@ extern void inc_zone_state(struct zone *
 void refresh_cpu_vm_stats(int);
 void refresh_vm_stats(void);
 
+static inline void vm_stat_setup(struct zone *zone, int cpu)
+{
+	int threshold;
+
+	/*
+	 * Increase the threshold for every 64 Megabyte of memory by one.
+	 * The bigger the zone the more flexibility we can give us.
+	 *
+	 * The minimum threshold of 4 is equal to 256 MB of memory.
+	 * The maximum of 125 reflects 8 Gigabytes.
+	 */
+	threshold = zone->present_pages / (1024*1024*64 / PAGE_SIZE);
+
+	/*
+	 * Threshold must be between 4 and 125
+	 */
+	threshold = max(4, min(125, threshold));
+
+	zone_pcp(zone, cpu)->stat_threshold = threshold;
+}
+
 #else /* CONFIG_SMP */
 
 /*
@@ -210,6 +232,7 @@ static inline void __dec_zone_page_state
 
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_vm_stats(void) { }
+static inline void vm_stat_setup(struct zone *z, int cpu) { }
 #endif
 
 #endif /* _LINUX_VMSTAT_H */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Christoph Lameter June 30, 2006, 6:15 a.m. UTC | #5
On Thu, 29 Jun 2006, Christoph Lameter wrote:

> > Did you consider my earlier suggestion about these counters?  That, over the
> > short-term, they tend to count in only one direction?  So we can do
> Uhh... We are overcompensating right? Pretty funky idea that is new to me 
> and that would require some thought.
> 
> This would basically increase the stepping by 50% if we are only going in 
> one direction.

A patch that does this:


ZVC: overcompensate while incrementing ZVC counters

Overcompensate by a balance factor when incrementing or decrementing
ZVC counters anticipating continual increase in the same direction.

Note that I have not been able to see any effect off this approach on
an 8p system where I tested this.
I probably will have a chance to test it on larger systems (160p) tomorrow.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.17-mm4/mm/vmstat.c
===================================================================
--- linux-2.6.17-mm4.orig/mm/vmstat.c	2006-06-29 13:35:16.959161608 -0700
+++ linux-2.6.17-mm4/mm/vmstat.c	2006-06-29 13:54:46.361438715 -0700
@@ -167,6 +167,9 @@ EXPORT_SYMBOL(mod_zone_page_state);
  * in between and therefore the atomicity vs. interrupt cannot be exploited
  * in a useful way here.
  */
+
+#define balance_factor(x) (x->stat_threshold / 2)
+
 static void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
@@ -175,8 +178,8 @@ static void __inc_zone_state(struct zone
 	(*p)++;
 
 	if (unlikely(*p > pcp->stat_threshold)) {
-		zone_page_state_add(*p, zone, item);
-		*p = 0;
+		zone_page_state_add(*p + balance_factor(pcp), zone, item);
+		*p = -balance_factor(pcp);
 	}
 }
 
@@ -195,8 +198,8 @@ void __dec_zone_page_state(struct page *
 	(*p)--;
 
 	if (unlikely(*p < -pcp->stat_threshold)) {
-		zone_page_state_add(*p, zone, item);
-		*p = 0;
+		zone_page_state_add(*p - balance_factor(pcp), zone, item);
+		*p = balance_factor(pcp);
 	}
 }
 EXPORT_SYMBOL(__dec_zone_page_state);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Andrew Morton June 30, 2006, 6:31 a.m. UTC | #6
On Thu, 29 Jun 2006 23:15:55 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Thu, 29 Jun 2006, Christoph Lameter wrote:
> 
> > > Did you consider my earlier suggestion about these counters?  That, over the
> > > short-term, they tend to count in only one direction?  So we can do
> > Uhh... We are overcompensating right? Pretty funky idea that is new to me 
> > and that would require some thought.
> > 
> > This would basically increase the stepping by 50% if we are only going in 
> > one direction.
> 
> A patch that does this:
> 
> 
> ZVC: overcompensate while incrementing ZVC counters
> 
> Overcompensate by a balance factor when incrementing or decrementing
> ZVC counters anticipating continual increase in the same direction.

Looks sensible.

Please check that none of this is racy wrt memory hotplug
(process_zones->vm_stat_setup).

> Note that I have not been able to see any effect off this approach on
> an 8p system where I tested this.
> I probably will have a chance to test it on larger systems (160p) tomorrow.

OK.  But let's not rush - it's only fine-tuning.  I'm thinking we should
get what we have in -mm4 into -rc1 - I think it's stable enough, and we
don't want to be carrying all those changes splatered across the VM for the
next two months.  Let's aim to get the well-measured fine-tuning in place
for -rc2, OK?

Are you aware of any to-do items remaining in the -mm4 patches?  The NFS
changes need a review from Trond - hopefully he'll be able to find 5-10
minutes to do that sometime?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Christoph Lameter June 30, 2006, 7:17 a.m. UTC | #7
On Thu, 29 Jun 2006, Andrew Morton wrote:

> Are you aware of any to-do items remaining in the -mm4 patches?  The NFS
> changes need a review from Trond - hopefully he'll be able to find 5-10
> minutes to do that sometime?

I am reasonably sure that these are okay now. I spend some quality hours 
with NFS. sigh. But another eye would be good.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Christoph Lameter June 30, 2006, 5:15 p.m. UTC | #8
On Thu, 29 Jun 2006, Andrew Morton wrote:

> A lot of the counters only ever count in one direction!  So we could even
> skew them by an entire STAT_THRESHOLD.

I just tested both approaches separately. If I leave the STAT_THRESHOLD at 
32 and just add the STAT_THRESHOLD / 2 trick we can scale cleanly up to 
160p in the synthetic test.

I have a bad feeling though for 512p 1024p and 4096p with the static 
approach with STAT_THRESHOLD /2 but it will take some time to get tests 
done on those.

I think the best right now is to just take the dynamic threshold patch. I 
expect that to do just fine at the higher numbers. If we have more trouble 
at 1024p then maybe try STAT_THRESHOLD/2 before going to 
aggregate counter consolidation.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch
diff mbox series

Index: linux-2.6.17-mm3/mm/vmstat.c
===================================================================
--- linux-2.6.17-mm3.orig/mm/vmstat.c	2006-06-27 20:24:37.455840645 -0700
+++ linux-2.6.17-mm3/mm/vmstat.c	2006-06-28 10:32:14.441818620 -0700
@@ -112,7 +112,21 @@  atomic_long_t vm_stat[NR_VM_ZONE_STAT_IT
 
 #ifdef CONFIG_SMP
 
+/*
+ * With higher processor counts we need higher threshold to avoid contention.
+ */
+#if NR_CPUS <= 64
 #define STAT_THRESHOLD 32
+#elif NR_CPUS <= 128
+#define STAT_THRESHOLD 64
+#else
+/*
+ * Use the maximum usable threshhold.
+ * We need to increment one beyond the threshold. To be safe
+ * also avoid 127.
+ */
+#define STAT_THRESHOLD 125
+#endif
 
 /*
  * Determine pointer to currently valid differential byte given a zone and