linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Event counters [1/3]: Basic counter functionality
@ 2005-12-20 23:57 Christoph Lameter
  2005-12-20 23:57 ` [RFC] Event counters [2/3]: Convert inc_page_state -> count_event Christoph Lameter
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Christoph Lameter @ 2005-12-20 23:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nick Piggin, linux-mm, Marcelo Tosatti, Christoph Lameter, Andi Kleen

Light weight counter functions

The remaining counters in page_state after the zoned VM counter patch has been
applied are all just for show in /proc/vmstat. They have no essential function
for the VM and therefore we can make these counters lightweight by ignoring
races and also allow an off switch for embedded systems that allows a building
of a linux kernels without these counters.

The implementation of these counters is through inline code that typically
results in a simple increment of a global memory locations.

Also
- Rename page_state to event_state
- Make event state an array indexed by the event item.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.15-rc5-mm3/init/Kconfig
===================================================================
--- linux-2.6.15-rc5-mm3.orig/init/Kconfig	2005-12-16 11:44:09.000000000 -0800
+++ linux-2.6.15-rc5-mm3/init/Kconfig	2005-12-20 14:15:23.000000000 -0800
@@ -411,6 +411,15 @@ config SLAB
 	  SLOB is more space efficient but does not scale well and is
 	  more susceptible to fragmentation.
 
+config EVENT_COUNTERS
+	default y
+	bool "Enable event counters for /proc/vmstat" if EMBEDDED
+	help
+	  Event counters are only needed to display statistics. They
+	  have no function for the kernel itself. This option allows
+	  the disabling of the event counters. /proc/vmstat will only
+	  contain essential counters.
+
 config SERIAL_PCI
 	depends PCI && SERIAL_8250
 	default y
Index: linux-2.6.15-rc5-mm3/include/linux/page-flags.h
===================================================================
--- linux-2.6.15-rc5-mm3.orig/include/linux/page-flags.h	2005-12-20 13:15:45.000000000 -0800
+++ linux-2.6.15-rc5-mm3/include/linux/page-flags.h	2005-12-20 14:55:00.000000000 -0800
@@ -77,120 +77,70 @@
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_uncached		19	/* Page has been mapped as uncached */
 
+#ifdef CONFIG_EVENT_COUNTERS
 /*
- * Global page accounting.  One instance per CPU.  Only unsigned longs are
- * allowed.
+ * Light weight per cpu counter implementation.
  *
- * - Fields can be modified with xxx_page_state and xxx_page_state_zone at
- * any time safely (which protects the instance from modification by
- * interrupt.
- * - The __xxx_page_state variants can be used safely when interrupts are
- * disabled.
- * - The __xxx_page_state variants can be used if the field is only
- * modified from process context, or only modified from interrupt context.
- * In this case, the field should be commented here.
+ * Note that these can race. We do not bother to enable preemption
+ * or care about interrupt races. All we care about is to have some
+ * approximate count of events.
+ *
+ * Counters should only be incremented and no critical kernel component
+ * should rely on the counter values.
+ *
+ * Counters are handled completely inline. On many platforms the code
+ * generated will simply be the increment of a global address.
  */
-struct page_state {
-	/*
-	 * The below are zeroed by get_page_state().  Use get_full_page_state()
-	 * to add up all these.
-	 */
-	unsigned long pgpgin;		/* Disk reads */
-	unsigned long pgpgout;		/* Disk writes */
-	unsigned long pswpin;		/* swap reads */
-	unsigned long pswpout;		/* swap writes */
-
-	unsigned long pgalloc_high;	/* page allocations */
-	unsigned long pgalloc_normal;
-	unsigned long pgalloc_dma32;
-	unsigned long pgalloc_dma;
-
-	unsigned long pgfree;		/* page freeings */
-	unsigned long pgactivate;	/* pages moved inactive->active */
-	unsigned long pgdeactivate;	/* pages moved active->inactive */
-
-	unsigned long pgfault;		/* faults (major+minor) */
-	unsigned long pgmajfault;	/* faults (major only) */
-
-	unsigned long pgrefill_high;	/* inspected in refill_inactive_zone */
-	unsigned long pgrefill_normal;
-	unsigned long pgrefill_dma32;
-	unsigned long pgrefill_dma;
-
-	unsigned long pgsteal_high;	/* total highmem pages reclaimed */
-	unsigned long pgsteal_normal;
-	unsigned long pgsteal_dma32;
-	unsigned long pgsteal_dma;
-
-	unsigned long pgscan_kswapd_high;/* total highmem pages scanned */
-	unsigned long pgscan_kswapd_normal;
-	unsigned long pgscan_kswapd_dma32;
-	unsigned long pgscan_kswapd_dma;
-
-	unsigned long pgscan_direct_high;/* total highmem pages scanned */
-	unsigned long pgscan_direct_normal;
-	unsigned long pgscan_direct_dma32;
-	unsigned long pgscan_direct_dma;
-
-	unsigned long pginodesteal;	/* pages reclaimed via inode freeing */
-	unsigned long slabs_scanned;	/* slab objects scanned */
-	unsigned long kswapd_steal;	/* pages reclaimed by kswapd */
-	unsigned long kswapd_inodesteal;/* reclaimed via kswapd inode freeing */
-	unsigned long pageoutrun;	/* kswapd's calls to page reclaim */
-	unsigned long allocstall;	/* direct reclaim calls */
+#define FOR_ALL_ZONES(x) x##_DMA, x##_DMA32, x##_NORMAL, x##_HIGH
 
-	unsigned long pgrotated;	/* pages rotated to tail of the LRU */
+enum event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
+		FOR_ALL_ZONES(PGALLOC),
+		PGFREE, PGACTIVATE, PGDEACTIVATE,
+		PGFAULT, PGMAJFAULT,
+ 		FOR_ALL_ZONES(PGREFILL),
+ 		FOR_ALL_ZONES(PGSTEAL),
+		FOR_ALL_ZONES(PGSCAN_KSWAPD),
+		FOR_ALL_ZONES(PGSCAN_DIRECT),
+		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
+		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		NR_EVENT_ITEMS
 };
 
-extern void get_full_page_state(struct page_state *ret);
-extern unsigned long read_page_state_offset(unsigned long offset);
-extern void mod_page_state_offset(unsigned long offset, unsigned long delta);
-extern void __mod_page_state_offset(unsigned long offset, unsigned long delta);
-
-#define read_page_state(member) \
-	read_page_state_offset(offsetof(struct page_state, member))
-
-#define mod_page_state(member, delta)	\
-	mod_page_state_offset(offsetof(struct page_state, member), (delta))
-
-#define __mod_page_state(member, delta)	\
-	__mod_page_state_offset(offsetof(struct page_state, member), (delta))
-
-#define inc_page_state(member)		mod_page_state(member, 1UL)
-#define dec_page_state(member)		mod_page_state(member, 0UL - 1)
-#define add_page_state(member,delta)	mod_page_state(member, (delta))
-#define sub_page_state(member,delta)	mod_page_state(member, 0UL - (delta))
-
-#define __inc_page_state(member)	__mod_page_state(member, 1UL)
-#define __dec_page_state(member)	__mod_page_state(member, 0UL - 1)
-#define __add_page_state(member,delta)	__mod_page_state(member, (delta))
-#define __sub_page_state(member,delta)	__mod_page_state(member, 0UL - (delta))
-
-#define page_state(member) (*__page_state(offsetof(struct page_state, member)))
-
-#define state_zone_offset(zone, member)					\
-({									\
-	unsigned offset;						\
-	if (is_highmem(zone))						\
-		offset = offsetof(struct page_state, member##_high);	\
-	else if (is_normal(zone))					\
-		offset = offsetof(struct page_state, member##_normal);	\
-	else if (is_dma32(zone))					\
-		offset = offsetof(struct page_state, member##_dma32);	\
-	else								\
-		offset = offsetof(struct page_state, member##_dma);	\
-	offset;								\
-})
-
-#define __mod_page_state_zone(zone, member, delta)			\
- do {									\
-	__mod_page_state_offset(state_zone_offset(zone, member), (delta)); \
- } while (0)
-
-#define mod_page_state_zone(zone, member, delta)			\
- do {									\
-	mod_page_state_offset(state_zone_offset(zone, member), (delta)); \
- } while (0)
+struct event_state {
+	unsigned long event[NR_EVENT_ITEMS];
+};
+
+DECLARE_PER_CPU(struct event_state, event_states);
+
+extern unsigned long get_global_events(enum event_item e);
+extern void sum_events(unsigned long *r, cpumask_t *cpumask);
+extern void all_events(unsigned long *r);
+
+static inline unsigned long get_cpu_events(enum event_item item)
+{
+	return __get_cpu_var(event_states).event[item];
+}
+
+static inline void count_event(enum event_item item)
+{
+	__get_cpu_var(event_states).event[item]++;
+}
+
+static inline void count_events(enum event_item item, long delta)
+{
+	__get_cpu_var(event_states).event[item] += delta;
+}
+
+#else
+/* Disable counters */
+#define get_cpu_events(e)	0L
+#define get_global_events(e)	0L
+#define count_event(e)		do { } while (0)
+#define count_events(e,d)	do { } while (0)
+#endif
+
+#define count_zone_event(item, zone) count_event(item##_DMA + zone_idx(zone))
+#define count_zone_events(item, zone, delta) count_events(item##_DMA + zone_idx(zone), delta)
 
 /*
  * Zone based accounting with per cpu differentials.
Index: linux-2.6.15-rc5-mm3/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/mm/page_alloc.c	2005-12-20 14:14:33.000000000 -0800
+++ linux-2.6.15-rc5-mm3/mm/page_alloc.c	2005-12-20 15:46:46.000000000 -0800
@@ -1573,77 +1573,50 @@ static void show_node(struct zone *zone)
 #define show_node(zone)	do { } while (0)
 #endif
 
-/*
- * Accumulate the page_state information across all CPUs.
- * The result is unavoidably approximate - it can change
- * during and after execution of this function.
- */
-static DEFINE_PER_CPU(struct page_state, page_states) = {0};
+#ifdef CONFIG_EVENT_COUNTERS
+DEFINE_PER_CPU(struct event_state, event_states) = {{0}};
 
-static void __get_page_state(struct page_state *ret, int nr, cpumask_t *cpumask)
+void sum_events(unsigned long *ret, cpumask_t *cpumask)
 {
 	int cpu = 0;
+	int i;
 
-	memset(ret, 0, sizeof(*ret));
+	memset(ret, 0, NR_EVENT_ITEMS * sizeof(unsigned long));
 
 	cpu = first_cpu(*cpumask);
 	while (cpu < NR_CPUS) {
-		unsigned long *in, *out, off;
-
-		in = (unsigned long *)&per_cpu(page_states, cpu);
+		struct event_state *this = &per_cpu(event_states, cpu);
 
 		cpu = next_cpu(cpu, *cpumask);
 
 		if (cpu < NR_CPUS)
-			prefetch(&per_cpu(page_states, cpu));
+			prefetch(&per_cpu(event_states, cpu));
+
 
-		out = (unsigned long *)ret;
-		for (off = 0; off < nr; off++)
-			*out++ += *in++;
+		for (i=0; i< NR_EVENT_ITEMS; i++)
+			ret[i] += this->event[i];
 	}
 }
+EXPORT_SYMBOL(sum_events);
 
-void get_full_page_state(struct page_state *ret)
+void all_events(unsigned long *ret)
 {
-	cpumask_t mask = CPU_MASK_ALL;
-
-	__get_page_state(ret, sizeof(*ret) / sizeof(unsigned long), &mask);
+	sum_events(ret, &cpu_online_map);
 }
+EXPORT_SYMBOL(all_events);
 
-unsigned long read_page_state_offset(unsigned long offset)
+unsigned long get_global_events(enum event_item e)
 {
 	unsigned long ret = 0;
 	int cpu;
 
-	for_each_cpu(cpu) {
-		unsigned long in;
+	for_each_cpu(cpu)
+		ret += per_cpu(event_states, cpu).event[e];
 
-		in = (unsigned long)&per_cpu(page_states, cpu) + offset;
-		ret += *((unsigned long *)in);
-	}
 	return ret;
 }
-
-void __mod_page_state_offset(unsigned long offset, unsigned long delta)
-{
-	void *ptr;
-
-	ptr = &__get_cpu_var(page_states);
-	*(unsigned long *)(ptr + offset) += delta;
-}
-EXPORT_SYMBOL(__mod_page_state_offset);
-
-void mod_page_state_offset(unsigned long offset, unsigned long delta)
-{
-	unsigned long flags;
-	void *ptr;
-
-	ptr = &__get_cpu_var(page_states);
-	local_irq_save(flags);
-	*(unsigned long *)(ptr + offset) += delta;
-	local_irq_restore(flags);
-}
-EXPORT_SYMBOL(mod_page_state_offset);
+EXPORT_SYMBOL(get_global_events);
+#endif
 
 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
 			unsigned long *free, struct pglist_data *pgdat)
@@ -2663,7 +2636,7 @@ static char *vmstat_text[] = {
 	"nr_unstable",
 	"nr_bounce",
 
-	/* Page state */
+#ifdef CONFIG_EVENT_COUNTERS
 	"pgpgin",
 	"pgpgout",
 	"pswpin",
@@ -2709,28 +2682,36 @@ static char *vmstat_text[] = {
 	"allocstall",
 
 	"pgrotated"
+#endif
 };
 
 static void *vmstat_start(struct seq_file *m, loff_t *pos)
 {
 	unsigned long *v;
-	struct page_state *ps;
+#ifdef CONFIG_EVENT_COUNTERS
+	unsigned long *e;
+#endif
 	int i;
 
 	if (*pos >= ARRAY_SIZE(vmstat_text))
 		return NULL;
-
+#ifdef CONFIG_EVENT_COUNTERS
 	v = kmalloc(NR_STAT_ITEMS * sizeof(unsigned long)
-			+ sizeof(struct page_state), GFP_KERNEL);
+			+ sizeof(struct event_state), GFP_KERNEL);
+#else
+	v = kmalloc(NR_STAT_ITEMS * sizeof(unsigned long), GFP_KERNEL);
+#endif
 	m->private = v;
 	if (!v)
 		return ERR_PTR(-ENOMEM);
 	for (i = 0; i < NR_STAT_ITEMS; i++)
 		v[i] = global_page_state(i);
-	ps = (struct page_state *)(v + NR_STAT_ITEMS);
-	get_full_page_state(ps);
-	ps->pgpgin /= 2;		/* sectors -> kbytes */
-	ps->pgpgout /= 2;
+#ifdef CONFIG_EVENT_COUNTERS
+	e = v + NR_STAT_ITEMS;
+	all_events(e);
+	e[PGPGIN] /= 2;		/* sectors -> kbytes */
+	e[PGPGOUT] /= 2;
+#endif
 	return v + *pos;
 }
 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC] Event counters [2/3]: Convert inc_page_state -> count_event
  2005-12-20 23:57 [RFC] Event counters [1/3]: Basic counter functionality Christoph Lameter
@ 2005-12-20 23:57 ` Christoph Lameter
  2005-12-20 23:57 ` [RFC] Event counters [3/3]: Convert NUMA counters to event counters Christoph Lameter
  2005-12-31  6:46 ` [RFC] Event counters [1/3]: Basic counter functionality Marcelo Tosatti
  2 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2005-12-20 23:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nick Piggin, linux-mm, Andi Kleen, Marcelo Tosatti, Christoph Lameter

Convert inc/add page_state to count_event(s)

Convert the page_state operations to count_event() and count_zone_event()

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.15-rc5-mm3/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/mm/page_alloc.c	2005-12-20 14:30:49.000000000 -0800
+++ linux-2.6.15-rc5-mm3/mm/page_alloc.c	2005-12-20 14:36:36.000000000 -0800
@@ -443,7 +443,7 @@ static void __free_pages_ok(struct page 
 
 	kernel_map_pages(page, 1 << order, 0);
 	local_irq_save(flags);
-	__mod_page_state(pgfree, 1 << order);
+	count_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, order);
 	local_irq_restore(flags);
 }
@@ -473,7 +473,7 @@ void fastcall __init __free_pages_bootme
 
 		arch_free_page(page, order);
 
-		mod_page_state(pgfree, 1 << order);
+		count_events(PGFREE, 1 << order);
 
 		list_add(&page->lru, &list);
 		kernel_map_pages(page, 1 << order, 0);
@@ -1030,7 +1030,7 @@ static void fastcall free_hot_cold_page(
 
 	pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
 	local_irq_save(flags);
-	__inc_page_state(pgfree);
+	count_event(PGFREE);
 	list_add(&page->lru, &pcp->list);
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
@@ -1097,7 +1097,7 @@ again:
 			goto failed;
 	}
 
-	__mod_page_state_zone(zone, pgalloc, 1 << order);
+	count_zone_events(PGALLOC, zone, 1 << order);
 	zone_statistics(zonelist, zone, cpu);
 	local_irq_restore(flags);
 	put_cpu();
Index: linux-2.6.15-rc5-mm3/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/mm/vmscan.c	2005-12-20 12:57:42.000000000 -0800
+++ linux-2.6.15-rc5-mm3/mm/vmscan.c	2005-12-20 14:36:36.000000000 -0800
@@ -227,7 +227,7 @@ int shrink_slab(unsigned long scanned, g
 					(nr_before - shrink_ret));
 			}
 			shrinker_stat_add(shrinker, nr_req, this_scan);
-			mod_page_state(slabs_scanned, this_scan);
+			count_events(SLABS_SCANNED, this_scan);
 			total_scan -= this_scan;
 
 			cond_resched();
@@ -567,7 +567,7 @@ keep:
 	list_splice(&ret_pages, page_list);
 	if (pagevec_count(&freed_pvec))
 		__pagevec_release_nonlru(&freed_pvec);
-	mod_page_state(pgactivate, pgactivate);
+	count_events(PGACTIVATE, pgactivate);
 	sc->nr_reclaimed += reclaimed;
 	return reclaimed;
 }
@@ -888,11 +888,11 @@ static void shrink_cache(struct zone *zo
 
 		local_irq_disable();
 		if (current_is_kswapd()) {
-			__mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
-			__mod_page_state(kswapd_steal, nr_freed);
+			count_zone_events(PGSCAN_KSWAPD, zone, nr_scan);
+			count_events(KSWAPD_STEAL, nr_freed);
 		} else
-			__mod_page_state_zone(zone, pgscan_direct, nr_scan);
-		__mod_page_state_zone(zone, pgsteal, nr_freed);
+			count_zone_events(PGSCAN_DIRECT, zone, nr_scan);
+		count_events(PGACTIVATE, nr_freed);
 
 		spin_lock(&zone->lru_lock);
 		/*
@@ -1056,11 +1056,10 @@ refill_inactive_zone(struct zone *zone, 
 		}
 	}
 	zone->nr_active += pgmoved;
-	spin_unlock(&zone->lru_lock);
+	spin_unlock_irq(&zone->lru_lock);
 
-	__mod_page_state_zone(zone, pgrefill, pgscanned);
-	__mod_page_state(pgdeactivate, pgdeactivate);
-	local_irq_enable();
+	count_zone_events(PGREFILL, zone, pgscanned);
+	count_events(PGDEACTIVATE, pgdeactivate);
 
 	pagevec_release(&pvec);
 }
@@ -1182,7 +1181,7 @@ int try_to_free_pages(struct zone **zone
 	sc.gfp_mask = gfp_mask;
 	sc.may_writepage = 0;
 
-	inc_page_state(allocstall);
+	count_event(ALLOCSTALL);
 
 	for (i = 0; zones[i] != NULL; i++) {
 		struct zone *zone = zones[i];
@@ -1285,7 +1284,7 @@ loop_again:
 	sc.may_writepage = 0;
 	sc.nr_mapped = global_page_state(NR_MAPPED);
 
-	inc_page_state(pageoutrun);
+	count_event(PAGEOUTRUN);
 
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;
Index: linux-2.6.15-rc5-mm3/block/ll_rw_blk.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/block/ll_rw_blk.c	2005-12-16 11:44:06.000000000 -0800
+++ linux-2.6.15-rc5-mm3/block/ll_rw_blk.c	2005-12-20 14:36:36.000000000 -0800
@@ -3007,9 +3007,9 @@ void submit_bio(int rw, struct bio *bio)
 	BIO_BUG_ON(!bio->bi_io_vec);
 	bio->bi_rw |= rw;
 	if (rw & WRITE)
-		mod_page_state(pgpgout, count);
+		count_events(PGPGOUT, count);
 	else
-		mod_page_state(pgpgin, count);
+		count_events(PGPGIN, count);
 
 	if (unlikely(block_dump)) {
 		char b[BDEVNAME_SIZE];
Index: linux-2.6.15-rc5-mm3/mm/page_io.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/mm/page_io.c	2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5-mm3/mm/page_io.c	2005-12-20 14:36:36.000000000 -0800
@@ -101,7 +101,7 @@ int swap_writepage(struct page *page, st
 	}
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		rw |= (1 << BIO_RW_SYNC);
-	inc_page_state(pswpout);
+	count_event(PSWPOUT);
 	set_page_writeback(page);
 	unlock_page(page);
 	submit_bio(rw, bio);
@@ -123,7 +123,7 @@ int swap_readpage(struct file *file, str
 		ret = -ENOMEM;
 		goto out;
 	}
-	inc_page_state(pswpin);
+	count_event(PSWPIN);
 	submit_bio(READ, bio);
 out:
 	return ret;
Index: linux-2.6.15-rc5-mm3/mm/memory.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/mm/memory.c	2005-12-20 12:58:31.000000000 -0800
+++ linux-2.6.15-rc5-mm3/mm/memory.c	2005-12-20 14:36:36.000000000 -0800
@@ -1887,7 +1887,7 @@ static int do_swap_page(struct mm_struct
 
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
-		inc_page_state(pgmajfault);
+		count_event(PGMAJFAULT);
 		grab_swap_token();
 	}
 
@@ -2247,7 +2247,7 @@ int __handle_mm_fault(struct mm_struct *
 
 	__set_current_state(TASK_RUNNING);
 
-	inc_page_state(pgfault);
+	count_event(PGFAULT);
 
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, write_access);
Index: linux-2.6.15-rc5-mm3/fs/inode.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/fs/inode.c	2005-12-16 11:44:08.000000000 -0800
+++ linux-2.6.15-rc5-mm3/fs/inode.c	2005-12-20 14:36:36.000000000 -0800
@@ -462,9 +462,9 @@ static void prune_icache(int nr_to_scan)
 	up(&iprune_sem);
 
 	if (current_is_kswapd())
-		mod_page_state(kswapd_inodesteal, reap);
+		count_events(KSWAPD_INODESTEAL, reap);
 	else
-		mod_page_state(pginodesteal, reap);
+		count_events(PGINODESTEAL, reap);
 }
 
 /*
Index: linux-2.6.15-rc5-mm3/mm/shmem.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/mm/shmem.c	2005-12-16 11:44:09.000000000 -0800
+++ linux-2.6.15-rc5-mm3/mm/shmem.c	2005-12-20 14:36:36.000000000 -0800
@@ -996,7 +996,7 @@ repeat:
 			spin_unlock(&info->lock);
 			/* here we actually do the io */
 			if (type && *type == VM_FAULT_MINOR) {
-				inc_page_state(pgmajfault);
+				count_event(PGMAJFAULT);
 				*type = VM_FAULT_MAJOR;
 			}
 			swappage = shmem_swapin(info, swap, idx);
Index: linux-2.6.15-rc5-mm3/mm/swap.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/mm/swap.c	2005-12-16 11:44:09.000000000 -0800
+++ linux-2.6.15-rc5-mm3/mm/swap.c	2005-12-20 14:36:36.000000000 -0800
@@ -85,7 +85,7 @@ int rotate_reclaimable_page(struct page 
 	if (PageLRU(page) && !PageActive(page)) {
 		list_del(&page->lru);
 		list_add_tail(&page->lru, &zone->inactive_list);
-		inc_page_state(pgrotated);
+		count_event(PGROTATED);
 	}
 	if (!test_clear_page_writeback(page))
 		BUG();
@@ -105,7 +105,7 @@ void fastcall activate_page(struct page 
 		del_page_from_inactive_list(zone, page);
 		SetPageActive(page);
 		add_page_to_active_list(zone, page);
-		inc_page_state(pgactivate);
+		count_event(PGACTIVATE);
 	}
 	spin_unlock_irq(&zone->lru_lock);
 }
Index: linux-2.6.15-rc5-mm3/mm/filemap.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/mm/filemap.c	2005-12-20 12:57:55.000000000 -0800
+++ linux-2.6.15-rc5-mm3/mm/filemap.c	2005-12-20 14:36:36.000000000 -0800
@@ -1283,7 +1283,7 @@ retry_find:
 		 */
 		if (!did_readaround) {
 			majmin = VM_FAULT_MAJOR;
-			inc_page_state(pgmajfault);
+			count_event(PGMAJFAULT);
 		}
 		did_readaround = 1;
 		ra_pages = max_sane_readahead(file->f_ra.ra_pages);
@@ -1354,7 +1354,7 @@ no_cached_page:
 page_not_uptodate:
 	if (!did_readaround) {
 		majmin = VM_FAULT_MAJOR;
-		inc_page_state(pgmajfault);
+		count_event(PGMAJFAULT);
 	}
 	lock_page(page);
 
Index: linux-2.6.15-rc5-mm3/fs/ncpfs/mmap.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/fs/ncpfs/mmap.c	2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5-mm3/fs/ncpfs/mmap.c	2005-12-20 14:47:50.000000000 -0800
@@ -93,7 +93,7 @@ static struct page* ncp_file_mmap_nopage
 	 */
 	if (type)
 		*type = VM_FAULT_MAJOR;
-	inc_page_state(pgmajfault);
+	count_event(PGMAJFAULT);
 	return page;
 }
 
Index: linux-2.6.15-rc5-mm3/drivers/parisc/led.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/drivers/parisc/led.c	2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5-mm3/drivers/parisc/led.c	2005-12-20 14:47:50.000000000 -0800
@@ -410,14 +410,12 @@ static __inline__ int led_get_net_activi
 static __inline__ int led_get_diskio_activity(void)
 {	
 	static unsigned long last_pgpgin, last_pgpgout;
-	struct page_state pgstat;
 	int changed;
 
-	get_full_page_state(&pgstat); /* get no of sectors in & out */
-
 	/* Just use a very simple calculation here. Do not care about overflow,
 	   since we only want to know if there was activity or not. */
-	changed = (pgstat.pgpgin != last_pgpgin) || (pgstat.pgpgout != last_pgpgout);
+	changed = (get_global_events(PGPGIN) != last_pgpgin) ||
+		  (get_global_events(PGPGOUT) != last_pgpgout);
 	last_pgpgin  = pgstat.pgpgin;
 	last_pgpgout = pgstat.pgpgout;
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC] Event counters [3/3]: Convert NUMA counters to event counters
  2005-12-20 23:57 [RFC] Event counters [1/3]: Basic counter functionality Christoph Lameter
  2005-12-20 23:57 ` [RFC] Event counters [2/3]: Convert inc_page_state -> count_event Christoph Lameter
@ 2005-12-20 23:57 ` Christoph Lameter
  2005-12-21 22:24   ` Christoph Lameter
  2005-12-31  6:46 ` [RFC] Event counters [1/3]: Basic counter functionality Marcelo Tosatti
  2 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2005-12-20 23:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nick Piggin, linux-mm, Andi Kleen, Christoph Lameter, Marcelo Tosatti

Use event counters instead of numa statistics

I am not sure if this is such a bright idea. But one could remove the
NUMA statistics and use event counters. This patch reduces the number of
numa counters to two. One for counting allocations that were not able to
follow memory policy (NUMA_MISS) and one for general off node allocations
(NUMA_OFF_NODE).

This would greatly simplify numa counters handling. node/numastat would
no longer be available (maybe one could improvise one with the stats of
the processors on the node?)

Big question: Do we really need these detailed NUMA statistics that are only
reported via numa stats and never used in the VM?

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.15-rc5-mm3/include/linux/mmzone.h
===================================================================
--- linux-2.6.15-rc5-mm3.orig/include/linux/mmzone.h	2005-12-20 13:15:44.000000000 -0800
+++ linux-2.6.15-rc5-mm3/include/linux/mmzone.h	2005-12-20 15:46:53.000000000 -0800
@@ -79,14 +79,6 @@ struct per_cpu_pageset {
 	s8 vm_stat_diff[NR_STAT_ITEMS];
 #endif
 
-#ifdef CONFIG_NUMA
-	unsigned long numa_hit;		/* allocated in intended node */
-	unsigned long numa_miss;	/* allocated in non intended node */
-	unsigned long numa_foreign;	/* was intended here, hit elsewhere */
-	unsigned long interleave_hit; 	/* interleaver prefered this zone */
-	unsigned long local_node;	/* allocation from local node */
-	unsigned long other_node;	/* allocation from other node */
-#endif
 } ____cacheline_aligned_in_smp;
 
 #ifdef CONFIG_NUMA
Index: linux-2.6.15-rc5-mm3/include/linux/page-flags.h
===================================================================
--- linux-2.6.15-rc5-mm3.orig/include/linux/page-flags.h	2005-12-20 14:55:00.000000000 -0800
+++ linux-2.6.15-rc5-mm3/include/linux/page-flags.h	2005-12-20 15:46:53.000000000 -0800
@@ -103,6 +103,9 @@ enum event_item { PGPGIN, PGPGOUT, PSWPI
 		FOR_ALL_ZONES(PGSCAN_DIRECT),
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_NUMA
+		NUMA_MISS, NUMA_OFF_NODE,
+#endif
 		NR_EVENT_ITEMS
 };
 
Index: linux-2.6.15-rc5-mm3/mm/mempolicy.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/mm/mempolicy.c	2005-12-16 11:44:09.000000000 -0800
+++ linux-2.6.15-rc5-mm3/mm/mempolicy.c	2005-12-20 15:46:53.000000000 -0800
@@ -1072,15 +1072,9 @@ static struct page *alloc_page_interleav
 					unsigned nid)
 {
 	struct zonelist *zl;
-	struct page *page;
 
 	zl = NODE_DATA(nid)->node_zonelists + gfp_zone(gfp);
-	page = __alloc_pages(gfp, order, zl);
-	if (page && page_zone(page) == zl->zones[0]) {
-		zone_pcp(zl->zones[0],get_cpu())->interleave_hit++;
-		put_cpu();
-	}
-	return page;
+	return __alloc_pages(gfp, order, zl);
 }
 
 /**
Index: linux-2.6.15-rc5-mm3/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/mm/page_alloc.c	2005-12-20 15:46:51.000000000 -0800
+++ linux-2.6.15-rc5-mm3/mm/page_alloc.c	2005-12-20 15:46:53.000000000 -0800
@@ -989,24 +989,16 @@ void mark_free_pages(struct zone *zone)
 
 #endif /* CONFIG_PM */
 
-static void zone_statistics(struct zonelist *zonelist, struct zone *z, int cpu)
+static void zone_statistics(struct zonelist *zonelist, struct zone *z, int pages)
 {
 #ifdef CONFIG_NUMA
 	pg_data_t *pg = z->zone_pgdat;
-	pg_data_t *orig = zonelist->zones[0]->zone_pgdat;
-	struct per_cpu_pageset *p;
 
-	p = zone_pcp(z, cpu);
-	if (pg == orig) {
-		p->numa_hit++;
-	} else {
-		p->numa_miss++;
-		zone_pcp(zonelist->zones[0], cpu)->numa_foreign++;
-	}
-	if (pg == NODE_DATA(numa_node_id()))
-		p->local_node++;
-	else
-		p->other_node++;
+	if (pg != zonelist->zones[0]->zone_pgdat)
+		count_events(NUMA_MISS, pages);
+
+	if (pg != NODE_DATA(numa_node_id()))
+		count_events(NUMA_OFF_NODE, pages);
 #endif
 }
 
@@ -1098,7 +1090,7 @@ again:
 	}
 
 	count_zone_events(PGALLOC, zone, 1 << order);
-	zone_statistics(zonelist, zone, cpu);
+	zone_statistics(zonelist, zone, 1 << order);
 	local_irq_restore(flags);
 	put_cpu();
 
@@ -2586,21 +2578,6 @@ static int zoneinfo_show(struct seq_file
 					   pageset->pcp[j].high,
 					   pageset->pcp[j].batch);
 			}
-#ifdef CONFIG_NUMA
-			seq_printf(m,
-				   "\n            numa_hit:       %lu"
-				   "\n            numa_miss:      %lu"
-				   "\n            numa_foreign:   %lu"
-				   "\n            interleave_hit: %lu"
-				   "\n            local_node:     %lu"
-				   "\n            other_node:     %lu",
-				   pageset->numa_hit,
-				   pageset->numa_miss,
-				   pageset->numa_foreign,
-				   pageset->interleave_hit,
-				   pageset->local_node,
-				   pageset->other_node);
-#endif
 		}
 		seq_printf(m,
 			   "\n  all_unreclaimable: %u"
@@ -2681,7 +2658,11 @@ static char *vmstat_text[] = {
 	"pageoutrun",
 	"allocstall",
 
-	"pgrotated"
+	"pgrotated",
+#ifdef CONFIG_NUMA
+	"numa_miss",
+	"off_node"
+#endif
 #endif
 };
 
Index: linux-2.6.15-rc5-mm3/drivers/base/node.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/drivers/base/node.c	2005-12-20 13:15:45.000000000 -0800
+++ linux-2.6.15-rc5-mm3/drivers/base/node.c	2005-12-20 15:46:53.000000000 -0800
@@ -91,46 +91,6 @@ static ssize_t node_read_meminfo(struct 
 #undef K
 static SYSDEV_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
 
-static ssize_t node_read_numastat(struct sys_device * dev, char * buf)
-{
-	unsigned long numa_hit, numa_miss, interleave_hit, numa_foreign;
-	unsigned long local_node, other_node;
-	int i, cpu;
-	pg_data_t *pg = NODE_DATA(dev->id);
-	numa_hit = 0;
-	numa_miss = 0;
-	interleave_hit = 0;
-	numa_foreign = 0;
-	local_node = 0;
-	other_node = 0;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		struct zone *z = &pg->node_zones[i];
-		for (cpu = 0; cpu < NR_CPUS; cpu++) {
-			struct per_cpu_pageset *ps = zone_pcp(z,cpu);
-			numa_hit += ps->numa_hit;
-			numa_miss += ps->numa_miss;
-			numa_foreign += ps->numa_foreign;
-			interleave_hit += ps->interleave_hit;
-			local_node += ps->local_node;
-			other_node += ps->other_node;
-		}
-	}
-	return sprintf(buf,
-		       "numa_hit %lu\n"
-		       "numa_miss %lu\n"
-		       "numa_foreign %lu\n"
-		       "interleave_hit %lu\n"
-		       "local_node %lu\n"
-		       "other_node %lu\n",
-		       numa_hit,
-		       numa_miss,
-		       numa_foreign,
-		       interleave_hit,
-		       local_node,
-		       other_node);
-}
-static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
-
 static ssize_t node_read_distance(struct sys_device * dev, char * buf)
 {
 	int nid = dev->id;
@@ -166,7 +126,6 @@ int register_node(struct node *node, int
 	if (!error){
 		sysdev_create_file(&node->sysdev, &attr_cpumap);
 		sysdev_create_file(&node->sysdev, &attr_meminfo);
-		sysdev_create_file(&node->sysdev, &attr_numastat);
 		sysdev_create_file(&node->sysdev, &attr_distance);
 	}
 	return error;
@@ -183,7 +142,6 @@ void unregister_node(struct node *node)
 {
 	sysdev_remove_file(&node->sysdev, &attr_cpumap);
 	sysdev_remove_file(&node->sysdev, &attr_meminfo);
-	sysdev_remove_file(&node->sysdev, &attr_numastat);
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
 	sysdev_unregister(&node->sysdev);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Event counters [3/3]: Convert NUMA counters to event counters
  2005-12-20 23:57 ` [RFC] Event counters [3/3]: Convert NUMA counters to event counters Christoph Lameter
@ 2005-12-21 22:24   ` Christoph Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2005-12-21 22:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Nick Piggin, linux-mm, Andi Kleen, Marcelo Tosatti

On Tue, 20 Dec 2005, Christoph Lameter wrote:

> no longer be available (maybe one could improvise one with the stats of
> the processors on the node?)

Here is such an alternate patch. It creates a per cpu vmstat in 
/sys/devices/system/cpu/cpu<nr>/vmstat (omitting the zoned counters which
are per zone and therefore node specific). 
This allows one to observe VM related events occurring on a specific 
processor

It also preserves /sys/devices/system/node/node<nr>/numa_stats which will
contain the summed up event counters for all the processors on the node.

This will preserve the existing numa_miss and numa_other fields and add
lots of other information that may be useful.

There is a global numa_miss and numa_other field available in /proc/vmstat
with both these patches. The fields may be helpful to check for fast 
checks if allocation requests are properly satisfied.

Index: linux-2.6.15-rc5-mm3/include/linux/mmzone.h
===================================================================
--- linux-2.6.15-rc5-mm3.orig/include/linux/mmzone.h	2005-12-20 13:15:44.000000000 -0800
+++ linux-2.6.15-rc5-mm3/include/linux/mmzone.h	2005-12-21 13:28:13.000000000 -0800
@@ -79,14 +79,6 @@ struct per_cpu_pageset {
 	s8 vm_stat_diff[NR_STAT_ITEMS];
 #endif
 
-#ifdef CONFIG_NUMA
-	unsigned long numa_hit;		/* allocated in intended node */
-	unsigned long numa_miss;	/* allocated in non intended node */
-	unsigned long numa_foreign;	/* was intended here, hit elsewhere */
-	unsigned long interleave_hit; 	/* interleaver prefered this zone */
-	unsigned long local_node;	/* allocation from local node */
-	unsigned long other_node;	/* allocation from other node */
-#endif
 } ____cacheline_aligned_in_smp;
 
 #ifdef CONFIG_NUMA
Index: linux-2.6.15-rc5-mm3/include/linux/page-flags.h
===================================================================
--- linux-2.6.15-rc5-mm3.orig/include/linux/page-flags.h	2005-12-20 14:55:00.000000000 -0800
+++ linux-2.6.15-rc5-mm3/include/linux/page-flags.h	2005-12-21 14:03:13.000000000 -0800
@@ -103,6 +103,9 @@ enum event_item { PGPGIN, PGPGOUT, PSWPI
 		FOR_ALL_ZONES(PGSCAN_DIRECT),
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_NUMA
+		NUMA_MISS, NUMA_OFF_NODE,
+#endif
 		NR_EVENT_ITEMS
 };
 
@@ -131,6 +134,8 @@ static inline void count_events(enum eve
 	__get_cpu_var(event_states).event[item] += delta;
 }
 
+extern char *vmstat_text[];
+
 #else
 /* Disable counters */
 #define get_cpu_events(e)	0L
Index: linux-2.6.15-rc5-mm3/mm/mempolicy.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/mm/mempolicy.c	2005-12-16 11:44:09.000000000 -0800
+++ linux-2.6.15-rc5-mm3/mm/mempolicy.c	2005-12-21 13:28:13.000000000 -0800
@@ -1072,15 +1072,9 @@ static struct page *alloc_page_interleav
 					unsigned nid)
 {
 	struct zonelist *zl;
-	struct page *page;
 
 	zl = NODE_DATA(nid)->node_zonelists + gfp_zone(gfp);
-	page = __alloc_pages(gfp, order, zl);
-	if (page && page_zone(page) == zl->zones[0]) {
-		zone_pcp(zl->zones[0],get_cpu())->interleave_hit++;
-		put_cpu();
-	}
-	return page;
+	return __alloc_pages(gfp, order, zl);
 }
 
 /**
Index: linux-2.6.15-rc5-mm3/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/mm/page_alloc.c	2005-12-20 15:46:51.000000000 -0800
+++ linux-2.6.15-rc5-mm3/mm/page_alloc.c	2005-12-21 13:40:34.000000000 -0800
@@ -989,24 +989,16 @@ void mark_free_pages(struct zone *zone)
 
 #endif /* CONFIG_PM */
 
-static void zone_statistics(struct zonelist *zonelist, struct zone *z, int cpu)
+static void zone_statistics(struct zonelist *zonelist, struct zone *z, int pages)
 {
 #ifdef CONFIG_NUMA
 	pg_data_t *pg = z->zone_pgdat;
-	pg_data_t *orig = zonelist->zones[0]->zone_pgdat;
-	struct per_cpu_pageset *p;
 
-	p = zone_pcp(z, cpu);
-	if (pg == orig) {
-		p->numa_hit++;
-	} else {
-		p->numa_miss++;
-		zone_pcp(zonelist->zones[0], cpu)->numa_foreign++;
-	}
-	if (pg == NODE_DATA(numa_node_id()))
-		p->local_node++;
-	else
-		p->other_node++;
+	if (pg != zonelist->zones[0]->zone_pgdat)
+		count_events(NUMA_MISS, pages);
+
+	if (pg != NODE_DATA(numa_node_id()))
+		count_events(NUMA_OFF_NODE, pages);
 #endif
 }
 
@@ -1098,7 +1090,7 @@ again:
 	}
 
 	count_zone_events(PGALLOC, zone, 1 << order);
-	zone_statistics(zonelist, zone, cpu);
+	zone_statistics(zonelist, zone, 1 << order);
 	local_irq_restore(flags);
 	put_cpu();
 
@@ -2586,21 +2578,6 @@ static int zoneinfo_show(struct seq_file
 					   pageset->pcp[j].high,
 					   pageset->pcp[j].batch);
 			}
-#ifdef CONFIG_NUMA
-			seq_printf(m,
-				   "\n            numa_hit:       %lu"
-				   "\n            numa_miss:      %lu"
-				   "\n            numa_foreign:   %lu"
-				   "\n            interleave_hit: %lu"
-				   "\n            local_node:     %lu"
-				   "\n            other_node:     %lu",
-				   pageset->numa_hit,
-				   pageset->numa_miss,
-				   pageset->numa_foreign,
-				   pageset->interleave_hit,
-				   pageset->local_node,
-				   pageset->other_node);
-#endif
 		}
 		seq_printf(m,
 			   "\n  all_unreclaimable: %u"
@@ -2625,7 +2602,7 @@ struct seq_operations zoneinfo_op = {
 	.show	= zoneinfo_show,
 };
 
-static char *vmstat_text[] = {
+char *vmstat_text[] = {
 	/* Zoned VM counters */
 	"nr_mapped",
 	"nr_pagecache",
@@ -2681,7 +2658,11 @@ static char *vmstat_text[] = {
 	"pageoutrun",
 	"allocstall",
 
-	"pgrotated"
+	"pgrotated",
+#ifdef CONFIG_NUMA
+	"numa_miss",
+	"other_node"
+#endif
 #endif
 };
 
Index: linux-2.6.15-rc5-mm3/drivers/base/node.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/drivers/base/node.c	2005-12-20 13:15:45.000000000 -0800
+++ linux-2.6.15-rc5-mm3/drivers/base/node.c	2005-12-21 14:01:38.000000000 -0800
@@ -93,41 +93,16 @@ static SYSDEV_ATTR(meminfo, S_IRUGO, nod
 
 static ssize_t node_read_numastat(struct sys_device * dev, char * buf)
 {
-	unsigned long numa_hit, numa_miss, interleave_hit, numa_foreign;
-	unsigned long local_node, other_node;
-	int i, cpu;
-	pg_data_t *pg = NODE_DATA(dev->id);
-	numa_hit = 0;
-	numa_miss = 0;
-	interleave_hit = 0;
-	numa_foreign = 0;
-	local_node = 0;
-	other_node = 0;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		struct zone *z = &pg->node_zones[i];
-		for (cpu = 0; cpu < NR_CPUS; cpu++) {
-			struct per_cpu_pageset *ps = zone_pcp(z,cpu);
-			numa_hit += ps->numa_hit;
-			numa_miss += ps->numa_miss;
-			numa_foreign += ps->numa_foreign;
-			interleave_hit += ps->interleave_hit;
-			local_node += ps->local_node;
-			other_node += ps->other_node;
-		}
-	}
-	return sprintf(buf,
-		       "numa_hit %lu\n"
-		       "numa_miss %lu\n"
-		       "numa_foreign %lu\n"
-		       "interleave_hit %lu\n"
-		       "local_node %lu\n"
-		       "other_node %lu\n",
-		       numa_hit,
-		       numa_miss,
-		       numa_foreign,
-		       interleave_hit,
-		       local_node,
-		       other_node);
+	unsigned long v[NR_EVENT_ITEMS];
+	int i;
+	char *p = buf;
+
+	sum_events(v, &node_to_cpumask(dev->id));
+
+	for (i = 0; i < NR_EVENT_ITEMS; i++)
+		p += sprintf(p, "%s %ld\n", vmstat_text[NR_STAT_ITEMS + i],
+				v[i]);
+	return p - buf;
 }
 static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
Index: linux-2.6.15-rc5-mm3/drivers/base/cpu.c
===================================================================
--- linux-2.6.15-rc5-mm3.orig/drivers/base/cpu.c	2005-12-16 11:44:06.000000000 -0800
+++ linux-2.6.15-rc5-mm3/drivers/base/cpu.c	2005-12-21 13:59:35.000000000 -0800
@@ -8,6 +8,7 @@
 #include <linux/cpu.h>
 #include <linux/topology.h>
 #include <linux/device.h>
+#include <linux/page-flags.h>
 
 #include "base.h"
 
@@ -83,6 +84,26 @@ static inline void register_cpu_control(
 }
 #endif /* CONFIG_HOTPLUG_CPU */
 
+static ssize_t cpu_read_vmstat(struct sys_device * dev, char * buf)
+{
+	int i;
+	char *p = buf;
+	cpumask_t mask;
+	struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+	unsigned long v[NR_EVENT_ITEMS];
+
+	cpus_clear(mask);
+	cpu_set(cpu->sysdev.id, mask);
+	sum_events(v, &mask);
+
+	for (i = 0; i < NR_EVENT_ITEMS; i++)
+		p += sprintf(p, "%s %ld\n",
+			vmstat_text[NR_STAT_ITEMS + i], v[i]);
+	return p - buf;
+}
+static SYSDEV_ATTR(vmstat, S_IRUGO, cpu_read_vmstat, NULL);
+
+
 #ifdef CONFIG_KEXEC
 #include <linux/kexec.h>
 
@@ -138,6 +159,7 @@ int __devinit register_cpu(struct cpu *c
 	if (!error)
 		error = sysdev_create_file(&cpu->sysdev, &attr_crash_notes);
 #endif
+	sysdev_create_file(&cpu->sysdev, &attr_vmstat);
 	return error;
 }
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Event counters [1/3]: Basic counter functionality
  2005-12-20 23:57 [RFC] Event counters [1/3]: Basic counter functionality Christoph Lameter
  2005-12-20 23:57 ` [RFC] Event counters [2/3]: Convert inc_page_state -> count_event Christoph Lameter
  2005-12-20 23:57 ` [RFC] Event counters [3/3]: Convert NUMA counters to event counters Christoph Lameter
@ 2005-12-31  6:46 ` Marcelo Tosatti
  2005-12-31  7:54   ` Nick Piggin
  2 siblings, 1 reply; 14+ messages in thread
From: Marcelo Tosatti @ 2005-12-31  6:46 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, Nick Piggin, linux-mm, Andi Kleen

Hi Christoph,

On Tue, Dec 20, 2005 at 03:57:33PM -0800, Christoph Lameter wrote:
> Light weight counter functions
> 
> The remaining counters in page_state after the zoned VM counter patch has been
> applied are all just for show in /proc/vmstat. They have no essential function
> for the VM and therefore we can make these counters lightweight by ignoring
> races and also allow an off switch for embedded systems that allows a building
> of a linux kernels without these counters.
> 
> The implementation of these counters is through inline code that typically
> results in a simple increment of a global memory locations.
> 
> Also
> - Rename page_state to event_state
> - Make event state an array indexed by the event item.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> Index: linux-2.6.15-rc5-mm3/init/Kconfig
> ===================================================================
> --- linux-2.6.15-rc5-mm3.orig/init/Kconfig	2005-12-16 11:44:09.000000000 -0800
> +++ linux-2.6.15-rc5-mm3/init/Kconfig	2005-12-20 14:15:23.000000000 -0800
> @@ -411,6 +411,15 @@ config SLAB
>  	  SLOB is more space efficient but does not scale well and is
>  	  more susceptible to fragmentation.
>  
> +config EVENT_COUNTERS

Please rename to VM_EVENT_COUNTERS or a better name.

> +	default y
> +	bool "Enable event counters for /proc/vmstat" if EMBEDDED
> +	help
> +	  Event counters are only needed to display statistics. They
> +	  have no function for the kernel itself. This option allows
> +	  the disabling of the event counters. /proc/vmstat will only
> +	  contain essential counters.
> +
>  config SERIAL_PCI
>  	depends PCI && SERIAL_8250
>  	default y
> Index: linux-2.6.15-rc5-mm3/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.15-rc5-mm3.orig/include/linux/page-flags.h	2005-12-20 13:15:45.000000000 -0800
> +++ linux-2.6.15-rc5-mm3/include/linux/page-flags.h	2005-12-20 14:55:00.000000000 -0800
> @@ -77,120 +77,70 @@
>  #define PG_nosave_free		18	/* Free, should not be written */
>  #define PG_uncached		19	/* Page has been mapped as uncached */
>  
> +#ifdef CONFIG_EVENT_COUNTERS
>  /*
> - * Global page accounting.  One instance per CPU.  Only unsigned longs are
> - * allowed.
> + * Light weight per cpu counter implementation.
>   *
> - * - Fields can be modified with xxx_page_state and xxx_page_state_zone at
> - * any time safely (which protects the instance from modification by
> - * interrupt.
> - * - The __xxx_page_state variants can be used safely when interrupts are
> - * disabled.
> - * - The __xxx_page_state variants can be used if the field is only
> - * modified from process context, or only modified from interrupt context.
> - * In this case, the field should be commented here.
> + * Note that these can race. We do not bother to enable preemption
> + * or care about interrupt races. All we care about is to have some
> + * approximate count of events.

What about this addition to the documentation above, to make it a little more 
verbose:

	The possible race scenario is restricted to kernel preemption,
	and could happen as follows:

	thread A				thread B
a)	movl    xyz(%ebp), %eax			movl    xyz(%ebp), %eax
b)	incl    %eax				incl    %eax
c)	movl    %eax, xyz(%ebp)			movl    %eax, xyz(%ebp)

Thread A can be preempted in b), and thread B succesfully increments the
counter, writing it back to memory. Now thread A resumes execution, with
its stale copy of the counter, and overwrites the current counter.

Resulting in increments lost.

However that should be relatively rare condition.

> + *
> + * Counters should only be incremented and no critical kernel component
> + * should rely on the counter values.
> + *
> + * Counters are handled completely inline. On many platforms the code
> + * generated will simply be the increment of a global address.
>   */
> -struct page_state {
> -	/*
> -	 * The below are zeroed by get_page_state().  Use get_full_page_state()
> -	 * to add up all these.
> -	 */
> -	unsigned long pgpgin;		/* Disk reads */
> -	unsigned long pgpgout;		/* Disk writes */
> -	unsigned long pswpin;		/* swap reads */
> -	unsigned long pswpout;		/* swap writes */
> -
> -	unsigned long pgalloc_high;	/* page allocations */
> -	unsigned long pgalloc_normal;
> -	unsigned long pgalloc_dma32;
> -	unsigned long pgalloc_dma;
> -
> -	unsigned long pgfree;		/* page freeings */
> -	unsigned long pgactivate;	/* pages moved inactive->active */
> -	unsigned long pgdeactivate;	/* pages moved active->inactive */
> -
> -	unsigned long pgfault;		/* faults (major+minor) */
> -	unsigned long pgmajfault;	/* faults (major only) */
> -
> -	unsigned long pgrefill_high;	/* inspected in refill_inactive_zone */
> -	unsigned long pgrefill_normal;
> -	unsigned long pgrefill_dma32;
> -	unsigned long pgrefill_dma;
> -
> -	unsigned long pgsteal_high;	/* total highmem pages reclaimed */
> -	unsigned long pgsteal_normal;
> -	unsigned long pgsteal_dma32;
> -	unsigned long pgsteal_dma;
> -
> -	unsigned long pgscan_kswapd_high;/* total highmem pages scanned */
> -	unsigned long pgscan_kswapd_normal;
> -	unsigned long pgscan_kswapd_dma32;
> -	unsigned long pgscan_kswapd_dma;
> -
> -	unsigned long pgscan_direct_high;/* total highmem pages scanned */
> -	unsigned long pgscan_direct_normal;
> -	unsigned long pgscan_direct_dma32;
> -	unsigned long pgscan_direct_dma;
> -
> -	unsigned long pginodesteal;	/* pages reclaimed via inode freeing */
> -	unsigned long slabs_scanned;	/* slab objects scanned */
> -	unsigned long kswapd_steal;	/* pages reclaimed by kswapd */
> -	unsigned long kswapd_inodesteal;/* reclaimed via kswapd inode freeing */
> -	unsigned long pageoutrun;	/* kswapd's calls to page reclaim */
> -	unsigned long allocstall;	/* direct reclaim calls */
> +#define FOR_ALL_ZONES(x) x##_DMA, x##_DMA32, x##_NORMAL, x##_HIGH
>  
> -	unsigned long pgrotated;	/* pages rotated to tail of the LRU */
> +enum event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> +		FOR_ALL_ZONES(PGALLOC),
> +		PGFREE, PGACTIVATE, PGDEACTIVATE,
> +		PGFAULT, PGMAJFAULT,
> + 		FOR_ALL_ZONES(PGREFILL),
> + 		FOR_ALL_ZONES(PGSTEAL),
> +		FOR_ALL_ZONES(PGSCAN_KSWAPD),
> +		FOR_ALL_ZONES(PGSCAN_DIRECT),
> +		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> +		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> +		NR_EVENT_ITEMS
>  };
>  
> -extern void get_full_page_state(struct page_state *ret);
> -extern unsigned long read_page_state_offset(unsigned long offset);
> -extern void mod_page_state_offset(unsigned long offset, unsigned long delta);
> -extern void __mod_page_state_offset(unsigned long offset, unsigned long delta);
> -
> -#define read_page_state(member) \
> -	read_page_state_offset(offsetof(struct page_state, member))
> -
> -#define mod_page_state(member, delta)	\
> -	mod_page_state_offset(offsetof(struct page_state, member), (delta))
> -
> -#define __mod_page_state(member, delta)	\
> -	__mod_page_state_offset(offsetof(struct page_state, member), (delta))
> -
> -#define inc_page_state(member)		mod_page_state(member, 1UL)
> -#define dec_page_state(member)		mod_page_state(member, 0UL - 1)
> -#define add_page_state(member,delta)	mod_page_state(member, (delta))
> -#define sub_page_state(member,delta)	mod_page_state(member, 0UL - (delta))
> -
> -#define __inc_page_state(member)	__mod_page_state(member, 1UL)
> -#define __dec_page_state(member)	__mod_page_state(member, 0UL - 1)
> -#define __add_page_state(member,delta)	__mod_page_state(member, (delta))
> -#define __sub_page_state(member,delta)	__mod_page_state(member, 0UL - (delta))
> -
> -#define page_state(member) (*__page_state(offsetof(struct page_state, member)))
> -
> -#define state_zone_offset(zone, member)					\
> -({									\
> -	unsigned offset;						\
> -	if (is_highmem(zone))						\
> -		offset = offsetof(struct page_state, member##_high);	\
> -	else if (is_normal(zone))					\
> -		offset = offsetof(struct page_state, member##_normal);	\
> -	else if (is_dma32(zone))					\
> -		offset = offsetof(struct page_state, member##_dma32);	\
> -	else								\
> -		offset = offsetof(struct page_state, member##_dma);	\
> -	offset;								\
> -})
> -
> -#define __mod_page_state_zone(zone, member, delta)			\
> - do {									\
> -	__mod_page_state_offset(state_zone_offset(zone, member), (delta)); \
> - } while (0)
> -
> -#define mod_page_state_zone(zone, member, delta)			\
> - do {									\
> -	mod_page_state_offset(state_zone_offset(zone, member), (delta)); \
> - } while (0)
> +struct event_state {
> +	unsigned long event[NR_EVENT_ITEMS];
> +};
> +
> +DECLARE_PER_CPU(struct event_state, event_states);

Might be interesting to mark this structure as __cacheline_aligned_in_smp to
avoid unrelated data sitting in the same line?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Event counters [1/3]: Basic counter functionality
  2005-12-31  6:46 ` [RFC] Event counters [1/3]: Basic counter functionality Marcelo Tosatti
@ 2005-12-31  7:54   ` Nick Piggin
  2005-12-31 20:26     ` Marcelo Tosatti
  2006-01-03 19:08     ` Christoph Lameter
  0 siblings, 2 replies; 14+ messages in thread
From: Nick Piggin @ 2005-12-31  7:54 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Christoph Lameter, linux-kernel, linux-mm, Andi Kleen

Marcelo Tosatti wrote:

> 
> What about this addition to the documentation above, to make it a little more 
> verbose:
> 
> 	The possible race scenario is restricted to kernel preemption,
> 	and could happen as follows:
> 
> 	thread A				thread B
> a)	movl    xyz(%ebp), %eax			movl    xyz(%ebp), %eax
> b)	incl    %eax				incl    %eax
> c)	movl    %eax, xyz(%ebp)			movl    %eax, xyz(%ebp)
> 
> Thread A can be preempted in b), and thread B succesfully increments the
> counter, writing it back to memory. Now thread A resumes execution, with
> its stale copy of the counter, and overwrites the current counter.
> 
> Resulting in increments lost.
> 
> However that should be relatively rare condition.
> 

Hi Guys,

I've been waiting for some mm/ patches to clear from -mm before commenting
too much... however I see that this patch is actually against -mm itself,
with my __mod_page_state stuff in it... that makes the page state accounting
much lighter weight AND is not racy.

So I'm not exactly sure why such a patch as this is wanted now? Are there
any more xxx_page_state hotspots? (I admit to only looking at page faults,
page allocator, and page reclaim).

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Event counters [1/3]: Basic counter functionality
  2005-12-31  7:54   ` Nick Piggin
@ 2005-12-31 20:26     ` Marcelo Tosatti
  2006-01-02 21:40       ` Marcelo Tosatti
  2006-01-03  5:01       ` Nick Piggin
  2006-01-03 19:08     ` Christoph Lameter
  1 sibling, 2 replies; 14+ messages in thread
From: Marcelo Tosatti @ 2005-12-31 20:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, linux-kernel, linux-mm, Andi Kleen

Hi Nick!

On Sat, Dec 31, 2005 at 06:54:25PM +1100, Nick Piggin wrote:
> Marcelo Tosatti wrote:
> 
> >
> >What about this addition to the documentation above, to make it a little 
> >more verbose:
> >
> >	The possible race scenario is restricted to kernel preemption,
> >	and could happen as follows:
> >
> >	thread A				thread B
> >a)	movl    xyz(%ebp), %eax			movl    xyz(%ebp), %eax
> >b)	incl    %eax				incl    %eax
> >c)	movl    %eax, xyz(%ebp)			movl    %eax, xyz(%ebp)
> >
> >Thread A can be preempted in b), and thread B succesfully increments the
> >counter, writing it back to memory. Now thread A resumes execution, with
> >its stale copy of the counter, and overwrites the current counter.
> >
> >Resulting in increments lost.
> >
> >However that should be relatively rare condition.
> >
> 
> Hi Guys,
> 
> I've been waiting for some mm/ patches to clear from -mm before commenting
> too much... however I see that this patch is actually against -mm itself,
> with my __mod_page_state stuff in it... that makes the page state accounting
> much lighter weight AND is not racy.

It is racy with reference to preempt (please refer to the race condition
described above):

diff -puN mm/rmap.c~mm-page_state-opt mm/rmap.c
--- devel/mm/rmap.c~mm-page_state-opt   2005-12-13 22:25:01.000000000 -0800
+++ devel-akpm/mm/rmap.c        2005-12-13 22:25:01.000000000 -0800
@@ -451,7 +451,11 @@ static void __page_set_anon_rmap(struct 

        page->index = linear_page_index(vma, address);
 
-       inc_page_state(nr_mapped);
+       /*
+        * nr_mapped state can be updated without turning off
+        * interrupts because it is not modified via interrupt.
+        */
+       __inc_page_state(nr_mapped);
 }

And since "nr_mapped" is not a counter for debugging purposes only, you 
can't be lazy with reference to its consistency.

I would argue that you need a preempt save version for this important
counters, surrounded by preempt_disable/preempt_enable (which vanish 
if one selects !CONFIG_PREEMPT).

As Christoph notes, debugging counter consistency can be lazy, not even
requiring correct preempt locking (hum, this is debatable, needs careful
verification).
 
> So I'm not exactly sure why such a patch as this is wanted now? Are there
> any more xxx_page_state hotspots? (I admit to only looking at page faults,
> page allocator, and page reclaim).

A consolidation of the good parts of both would be interesting.

I don't see much point in Christoph's naming change to "event_counter", 
why are you doing that?

And follows an addition to your's mm-page_state-opt-docs.patch. Still
need to verify "nr_dirty" and "nr_unstable".

Happy new year!

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 343083f..f173e0f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -83,10 +83,14 @@
 struct page_state {
 	unsigned long nr_dirty;		/* Dirty writeable pages */
 	unsigned long nr_writeback;	/* Pages under writeback */
+					/* also modified from IRQ context */
 	unsigned long nr_unstable;	/* NFS unstable pages */
 	unsigned long nr_page_table_pages;/* Pages used for pagetables */
+					/* only modified from process context */
 	unsigned long nr_mapped;	/* mapped into pagetables */
+					/* only modified from process context */
 	unsigned long nr_slab;		/* In slab */
+					/* also modified from IRQ context */
 #define GET_PAGE_STATE_LAST nr_slab
 
 	/*

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC] Event counters [1/3]: Basic counter functionality
  2005-12-31 20:26     ` Marcelo Tosatti
@ 2006-01-02 21:40       ` Marcelo Tosatti
  2006-01-03  5:11         ` Nick Piggin
  2006-01-03  5:01       ` Nick Piggin
  1 sibling, 1 reply; 14+ messages in thread
From: Marcelo Tosatti @ 2006-01-02 21:40 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, linux-kernel, linux-mm, Andi Kleen

On Sat, Dec 31, 2005 at 06:26:02PM -0200, Marcelo Tosatti wrote:
> Hi Nick!
> 
> On Sat, Dec 31, 2005 at 06:54:25PM +1100, Nick Piggin wrote:
> > Marcelo Tosatti wrote:
> > 
> > >
> > >What about this addition to the documentation above, to make it a little 
> > >more verbose:
> > >
> > >	The possible race scenario is restricted to kernel preemption,
> > >	and could happen as follows:
> > >
> > >	thread A				thread B
> > >a)	movl    xyz(%ebp), %eax			movl    xyz(%ebp), %eax
> > >b)	incl    %eax				incl    %eax
> > >c)	movl    %eax, xyz(%ebp)			movl    %eax, xyz(%ebp)
> > >
> > >Thread A can be preempted in b), and thread B succesfully increments the
> > >counter, writing it back to memory. Now thread A resumes execution, with
> > >its stale copy of the counter, and overwrites the current counter.
> > >
> > >Resulting in increments lost.
> > >
> > >However that should be relatively rare condition.
> > >
> > 
> > Hi Guys,
> > 
> > I've been waiting for some mm/ patches to clear from -mm before commenting
> > too much... however I see that this patch is actually against -mm itself,
> > with my __mod_page_state stuff in it... that makes the page state accounting
> > much lighter weight AND is not racy.
> 
> It is racy with reference to preempt (please refer to the race condition
> described above):
> 
> diff -puN mm/rmap.c~mm-page_state-opt mm/rmap.c
> --- devel/mm/rmap.c~mm-page_state-opt   2005-12-13 22:25:01.000000000 -0800
> +++ devel-akpm/mm/rmap.c        2005-12-13 22:25:01.000000000 -0800
> @@ -451,7 +451,11 @@ static void __page_set_anon_rmap(struct 
> 
>         page->index = linear_page_index(vma, address);
>  
> -       inc_page_state(nr_mapped);
> +       /*
> +        * nr_mapped state can be updated without turning off
> +        * interrupts because it is not modified via interrupt.
> +        */
> +       __inc_page_state(nr_mapped);
>  }
> 
> And since "nr_mapped" is not a counter for debugging purposes only, you 
> can't be lazy with reference to its consistency.
> 
> I would argue that you need a preempt save version for this important
> counters, surrounded by preempt_disable/preempt_enable (which vanish 
> if one selects !CONFIG_PREEMPT).

Nick, 

The following patch:

- Moves the lightweight "inc/dec" versions of mod_page_state variants
to three underscores, making those the default for locations where enough
locks are held.

- Make the two-underscore version disable and enable preemption, which 
is required to avoid preempt-related races which can result in missed
updates.

- Extends the lightweight version usage in page reclaim, 
pte allocation, and a few other codepaths.

How does it look? 


 fs/buffer.c                |    2 +-
 fs/inode.c                 |    4 ++--
 include/linux/page-flags.h |   17 +++++++++++++++++
 mm/memory.c                |    7 +++++--
 mm/page-writeback.c        |    2 +-
 mm/page_alloc.c            |   14 +++++++++++---
 mm/rmap.c                  |    3 ++-
 mm/swap.c                  |    4 ++--
 mm/vmscan.c                |    8 ++++----
 9 files changed, 45 insertions(+), 16 deletions(-)

diff -Nur -p --exclude-from=linux-2.6.15-rc5/Documentation/dontdiff /tmp/2.6.15-rc5-mm3.orig/fs/buffer.c linux-2.6.15-rc5/fs/buffer.c
--- /tmp/2.6.15-rc5-mm3.orig/fs/buffer.c	2006-01-02 18:25:51.000000000 -0200
+++ linux-2.6.15-rc5/fs/buffer.c	2006-01-02 18:55:01.000000000 -0200
@@ -857,7 +857,7 @@ int __set_page_dirty_buffers(struct page
 		write_lock_irq(&mapping->tree_lock);
 		if (page->mapping) {	/* Race with truncate? */
 			if (mapping_cap_account_dirty(mapping))
-				inc_page_state(nr_dirty);
+				___inc_page_state(nr_dirty);
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
diff -Nur -p --exclude-from=linux-2.6.15-rc5/Documentation/dontdiff /tmp/2.6.15-rc5-mm3.orig/fs/inode.c linux-2.6.15-rc5/fs/inode.c
--- /tmp/2.6.15-rc5-mm3.orig/fs/inode.c	2006-01-02 18:26:26.000000000 -0200
+++ linux-2.6.15-rc5/fs/inode.c	2006-01-02 18:54:36.000000000 -0200
@@ -462,9 +462,9 @@ static void prune_icache(int nr_to_scan)
 	up(&iprune_sem);
 
 	if (current_is_kswapd())
-		mod_page_state(kswapd_inodesteal, reap);
+		__mod_page_state(kswapd_inodesteal, reap);
 	else
-		mod_page_state(pginodesteal, reap);
+		__mod_page_state(pginodesteal, reap);
 }
 
 /*
diff -Nur -p --exclude-from=linux-2.6.15-rc5/Documentation/dontdiff /tmp/2.6.15-rc5-mm3.orig/include/linux/page-flags.h linux-2.6.15-rc5/include/linux/page-flags.h
--- /tmp/2.6.15-rc5-mm3.orig/include/linux/page-flags.h	2006-01-02 18:26:10.000000000 -0200
+++ linux-2.6.15-rc5/include/linux/page-flags.h	2006-01-02 18:45:40.000000000 -0200
@@ -157,6 +157,7 @@ extern void get_page_state_node(struct p
 extern void get_full_page_state(struct page_state *ret);
 extern unsigned long read_page_state_offset(unsigned long offset);
 extern void mod_page_state_offset(unsigned long offset, unsigned long delta);
+extern void ___mod_page_state_offset(unsigned long offset, unsigned long delta);
 extern void __mod_page_state_offset(unsigned long offset, unsigned long delta);
 
 #define read_page_state(member) \
@@ -168,16 +169,27 @@ extern void __mod_page_state_offset(unsi
 #define __mod_page_state(member, delta)	\
 	__mod_page_state_offset(offsetof(struct page_state, member), (delta))
 
+#define ___mod_page_state(member, delta)	\
+	___mod_page_state_offset(offsetof(struct page_state, member), (delta))
+
+/* preempt/IRQ safe */
 #define inc_page_state(member)		mod_page_state(member, 1UL)
 #define dec_page_state(member)		mod_page_state(member, 0UL - 1)
 #define add_page_state(member,delta)	mod_page_state(member, (delta))
 #define sub_page_state(member,delta)	mod_page_state(member, 0UL - (delta))
 
+/* only preempt safe */
 #define __inc_page_state(member)	__mod_page_state(member, 1UL)
 #define __dec_page_state(member)	__mod_page_state(member, 0UL - 1)
 #define __add_page_state(member,delta)	__mod_page_state(member, (delta))
 #define __sub_page_state(member,delta)	__mod_page_state(member, 0UL - (delta))
 
+/* preempt/IRQ unsafe (plain inc/dec operations) */
+#define ___inc_page_state(member)	___mod_page_state(member, 1UL)
+#define ___dec_page_state(member)	___mod_page_state(member, 0UL - 1)
+#define ___add_page_state(member,delta)	___mod_page_state(member, (delta))
+#define ___sub_page_state(member,delta)	___mod_page_state(member, 0UL - (delta))
+
 #define page_state(member) (*__page_state(offsetof(struct page_state, member)))
 
 #define state_zone_offset(zone, member)					\
@@ -194,6 +206,11 @@ extern void __mod_page_state_offset(unsi
 	offset;								\
 })
 
+#define ___mod_page_state_zone(zone, member, delta)			\
+ do {									\
+	___mod_page_state_offset(state_zone_offset(zone, member), (delta)); \
+ } while (0)
+
 #define __mod_page_state_zone(zone, member, delta)			\
  do {									\
 	__mod_page_state_offset(state_zone_offset(zone, member), (delta)); \
diff -Nur -p --exclude-from=linux-2.6.15-rc5/Documentation/dontdiff /tmp/2.6.15-rc5-mm3.orig/mm/memory.c linux-2.6.15-rc5/mm/memory.c
--- /tmp/2.6.15-rc5-mm3.orig/mm/memory.c	2006-01-02 18:25:34.000000000 -0200
+++ linux-2.6.15-rc5/mm/memory.c	2006-01-02 18:39:49.000000000 -0200
@@ -116,7 +116,7 @@ static void free_pte_range(struct mmu_ga
 	pmd_clear(pmd);
 	pte_lock_deinit(page);
 	pte_free_tlb(tlb, page);
-	dec_page_state(nr_page_table_pages);
+	__dec_page_state(nr_page_table_pages);
 	tlb->mm->nr_ptes--;
 }
 
@@ -302,7 +302,10 @@ int __pte_alloc(struct mm_struct *mm, pm
 		pte_free(new);
 	} else {
 		mm->nr_ptes++;
-		inc_page_state(nr_page_table_pages);
+		/* nr_page_table_pages is never touched by interrupt 
+		 * context, so its not necessary to disable them.
+		 */
+		__inc_page_state(nr_page_table_pages);
 		pmd_populate(mm, pmd, new);
 	}
 	spin_unlock(&mm->page_table_lock);
diff -Nur -p --exclude-from=linux-2.6.15-rc5/Documentation/dontdiff /tmp/2.6.15-rc5-mm3.orig/mm/page_alloc.c linux-2.6.15-rc5/mm/page_alloc.c
--- /tmp/2.6.15-rc5-mm3.orig/mm/page_alloc.c	2006-01-02 18:26:26.000000000 -0200
+++ linux-2.6.15-rc5/mm/page_alloc.c	2006-01-02 18:26:54.000000000 -0200
@@ -443,7 +443,7 @@ static void __free_pages_ok(struct page 
 
 	kernel_map_pages(page, 1 << order, 0);
 	local_irq_save(flags);
-	__mod_page_state(pgfree, 1 << order);
+	___mod_page_state(pgfree, 1 << order);
 	free_one_page(page_zone(page), page, order);
 	local_irq_restore(flags);
 }
@@ -753,7 +753,7 @@ static void fastcall free_hot_cold_page(
 
 	pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
 	local_irq_save(flags);
-	__inc_page_state(pgfree);
+	___inc_page_state(pgfree);
 	list_add(&page->lru, &pcp->list);
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
@@ -820,7 +820,7 @@ again:
 			goto failed;
 	}
 
-	__mod_page_state_zone(zone, pgalloc, 1 << order);
+	___mod_page_state_zone(zone, pgalloc, 1 << order);
 	zone_statistics(zonelist, zone, cpu);
 	local_irq_restore(flags);
 	put_cpu();
@@ -1375,8 +1375,16 @@ unsigned long read_page_state_offset(uns
 	return ret;
 }
 
+
 void __mod_page_state_offset(unsigned long offset, unsigned long delta)
 {
+	preempt_disable();
+	___mod_page_state_offset(offset, delta);
+	preempt_enable();
+}
+
+void ___mod_page_state_offset(unsigned long offset, unsigned long delta)
+{
 	void *ptr;
 
 	ptr = &__get_cpu_var(page_states);
diff -Nur -p --exclude-from=linux-2.6.15-rc5/Documentation/dontdiff /tmp/2.6.15-rc5-mm3.orig/mm/page-writeback.c linux-2.6.15-rc5/mm/page-writeback.c
--- /tmp/2.6.15-rc5-mm3.orig/mm/page-writeback.c	2006-01-02 18:25:12.000000000 -0200
+++ linux-2.6.15-rc5/mm/page-writeback.c	2006-01-02 19:27:52.000000000 -0200
@@ -632,7 +632,7 @@ int __set_page_dirty_nobuffers(struct pa
 			if (mapping2) { /* Race with truncate? */
 				BUG_ON(mapping2 != mapping);
 				if (mapping_cap_account_dirty(mapping))
-					inc_page_state(nr_dirty);
+					___inc_page_state(nr_dirty);
 				radix_tree_tag_set(&mapping->page_tree,
 					page_index(page), PAGECACHE_TAG_DIRTY);
 			}
diff -Nur -p --exclude-from=linux-2.6.15-rc5/Documentation/dontdiff /tmp/2.6.15-rc5-mm3.orig/mm/rmap.c linux-2.6.15-rc5/mm/rmap.c
--- /tmp/2.6.15-rc5-mm3.orig/mm/rmap.c	2006-01-02 18:25:35.000000000 -0200
+++ linux-2.6.15-rc5/mm/rmap.c	2006-01-02 18:29:48.000000000 -0200
@@ -453,7 +453,8 @@ static void __page_set_anon_rmap(struct 
 
 	/*
 	 * nr_mapped state can be updated without turning off
-	 * interrupts because it is not modified via interrupt.
+	 * interrupts because it is not modified via interrupt. 
+	 * But it must be guarded against preemption.
 	 */
 	__inc_page_state(nr_mapped);
 }
diff -Nur -p --exclude-from=linux-2.6.15-rc5/Documentation/dontdiff /tmp/2.6.15-rc5-mm3.orig/mm/swap.c linux-2.6.15-rc5/mm/swap.c
--- /tmp/2.6.15-rc5-mm3.orig/mm/swap.c	2006-01-02 18:25:35.000000000 -0200
+++ linux-2.6.15-rc5/mm/swap.c	2006-01-02 18:33:50.000000000 -0200
@@ -85,7 +85,7 @@ int rotate_reclaimable_page(struct page 
 	if (PageLRU(page) && !PageActive(page)) {
 		list_del(&page->lru);
 		list_add_tail(&page->lru, &zone->inactive_list);
-		inc_page_state(pgrotated);
+		___inc_page_state(pgrotated);
 	}
 	if (!test_clear_page_writeback(page))
 		BUG();
@@ -105,7 +105,7 @@ void fastcall activate_page(struct page 
 		del_page_from_inactive_list(zone, page);
 		SetPageActive(page);
 		add_page_to_active_list(zone, page);
-		inc_page_state(pgactivate);
+		___inc_page_state(pgactivate);
 	}
 	spin_unlock_irq(&zone->lru_lock);
 }
diff -Nur -p --exclude-from=linux-2.6.15-rc5/Documentation/dontdiff /tmp/2.6.15-rc5-mm3.orig/mm/vmscan.c linux-2.6.15-rc5/mm/vmscan.c
--- /tmp/2.6.15-rc5-mm3.orig/mm/vmscan.c	2006-01-02 18:26:26.000000000 -0200
+++ linux-2.6.15-rc5/mm/vmscan.c	2006-01-02 18:53:56.000000000 -0200
@@ -1058,8 +1058,8 @@ refill_inactive_zone(struct zone *zone, 
 	zone->nr_active += pgmoved;
 	spin_unlock(&zone->lru_lock);
 
-	__mod_page_state_zone(zone, pgrefill, pgscanned);
-	__mod_page_state(pgdeactivate, pgdeactivate);
+	___mod_page_state_zone(zone, pgrefill, pgscanned);
+	___mod_page_state(pgdeactivate, pgdeactivate);
 	local_irq_enable();
 
 	pagevec_release(&pvec);
@@ -1182,7 +1182,7 @@ int try_to_free_pages(struct zone **zone
 	sc.gfp_mask = gfp_mask;
 	sc.may_writepage = 0;
 
-	inc_page_state(allocstall);
+	__inc_page_state(allocstall);
 
 	for (i = 0; zones[i] != NULL; i++) {
 		struct zone *zone = zones[i];
@@ -1285,7 +1285,7 @@ loop_again:
 	sc.may_writepage = 0;
 	sc.nr_mapped = read_page_state(nr_mapped);
 
-	inc_page_state(pageoutrun);
+	__inc_page_state(pageoutrun);
 
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;








^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Event counters [1/3]: Basic counter functionality
  2005-12-31 20:26     ` Marcelo Tosatti
  2006-01-02 21:40       ` Marcelo Tosatti
@ 2006-01-03  5:01       ` Nick Piggin
  1 sibling, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2006-01-03  5:01 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Christoph Lameter, lkml, linux-mm, Andi Kleen

On Sat, 2005-12-31 at 18:26 -0200, Marcelo Tosatti wrote:
> Hi Nick!
> 

Hey Marcelo!

> On Sat, Dec 31, 2005 at 06:54:25PM +1100, Nick Piggin wrote:
> > 
> > Hi Guys,
> > 
> > I've been waiting for some mm/ patches to clear from -mm before commenting
> > too much... however I see that this patch is actually against -mm itself,
> > with my __mod_page_state stuff in it... that makes the page state accounting
> > much lighter weight AND is not racy.
> 
> It is racy with reference to preempt (please refer to the race condition
> described above):
> 
> diff -puN mm/rmap.c~mm-page_state-opt mm/rmap.c
> --- devel/mm/rmap.c~mm-page_state-opt   2005-12-13 22:25:01.000000000 -0800
> +++ devel-akpm/mm/rmap.c        2005-12-13 22:25:01.000000000 -0800
> @@ -451,7 +451,11 @@ static void __page_set_anon_rmap(struct 
> 
>         page->index = linear_page_index(vma, address);
>  
> -       inc_page_state(nr_mapped);
> +       /*
> +        * nr_mapped state can be updated without turning off
> +        * interrupts because it is not modified via interrupt.
> +        */
> +       __inc_page_state(nr_mapped);
>  }
> 
> And since "nr_mapped" is not a counter for debugging purposes only, you 
> can't be lazy with reference to its consistency.
> 
> I would argue that you need a preempt save version for this important
> counters, surrounded by preempt_disable/preempt_enable (which vanish 
> if one selects !CONFIG_PREEMPT).
> 

I think it should not be racy because the function should always be
called with the page table lock held, which disables preempt. I guess
the comment should be explicit about that as well.

There were some runtime warnings that come up when this patch first
went into -mm because of a silly typo, however that should now be
resolved too.

> As Christoph notes, debugging counter consistency can be lazy, not even
> requiring correct preempt locking (hum, this is debatable, needs careful
> verification).
>  
> > So I'm not exactly sure why such a patch as this is wanted now? Are there
> > any more xxx_page_state hotspots? (I admit to only looking at page faults,
> > page allocator, and page reclaim).
> 
> A consolidation of the good parts of both would be interesting.
> 
> I don't see much point in Christoph's naming change to "event_counter", 
> why are you doing that?
> 
> And follows an addition to your's mm-page_state-opt-docs.patch. Still
> need to verify "nr_dirty" and "nr_unstable".
> 
> Happy new year!
> 

Thanks, happy new year to you too!

-- 
SUSE Labs, Novell Inc.



Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Event counters [1/3]: Basic counter functionality
  2006-01-02 21:40       ` Marcelo Tosatti
@ 2006-01-03  5:11         ` Nick Piggin
  2006-01-03 10:11           ` Marcelo Tosatti
  0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2006-01-03  5:11 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Christoph Lameter, lkml, linux-mm, Andi Kleen

On Mon, 2006-01-02 at 19:40 -0200, Marcelo Tosatti wrote:

> Nick, 
> 
> The following patch:
> 
> - Moves the lightweight "inc/dec" versions of mod_page_state variants
> to three underscores, making those the default for locations where enough
> locks are held.
> 

I guess I was hoping to try to keep it simple, and just have two
variants, the __ version would require the caller to do the locking.
In cases like eg. allocstall, they should happen infrequently enough
that the extra complexity is probably not worth worrying about.

I don't think I commented about the preempt race though (and requirement
to have preempt off from process context), which obviously can be a
problem as you say (though I think things are currently safe?).

> - Make the two-underscore version disable and enable preemption, which 
> is required to avoid preempt-related races which can result in missed
> updates.
> 
> - Extends the lightweight version usage in page reclaim, 
> pte allocation, and a few other codepaths.
> 

I guess nr_dirty looks OK in the places it can be put under tree_lock.

nr_page_table_pages is OK because ptl should be held to prevent preempt.

pgrotated and pgactivate should be good because of lru_lock.

Thanks for going through these!

-- 
SUSE Labs, Novell Inc.



Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Event counters [1/3]: Basic counter functionality
  2006-01-03  5:11         ` Nick Piggin
@ 2006-01-03 10:11           ` Marcelo Tosatti
  2006-01-03 13:21             ` Nick Piggin
  0 siblings, 1 reply; 14+ messages in thread
From: Marcelo Tosatti @ 2006-01-03 10:11 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, lkml, linux-mm, Andi Kleen

On Tue, Jan 03, 2006 at 04:11:46PM +1100, Nick Piggin wrote:
> On Mon, 2006-01-02 at 19:40 -0200, Marcelo Tosatti wrote:
> 
> > Nick, 
> > 
> > The following patch:
> > 
> > - Moves the lightweight "inc/dec" versions of mod_page_state variants
> > to three underscores, making those the default for locations where enough
> > locks are held.
> > 
> 
> I guess I was hoping to try to keep it simple, and just have two
> variants, the __ version would require the caller to do the locking.

I see - one point is that the two/three underscore versions make
it clear that preempt is required, though, but it might be a bit
over-complicated as you say.

Well, its up to you - please rearrange the patch as you wish and merge
up?

> In cases like eg. allocstall, they should happen infrequently enough
> that the extra complexity is probably not worth worrying about.

True, but it reduces kernel code, which is always good.

> I don't think I commented about the preempt race though (and requirement
> to have preempt off from process context), which obviously can be a
> problem as you say (though I think things are currently safe?).

"I think it should not be racy because the function should always be
called with the page table lock held, which disables preempt. I guess
the comment should be explicit about that as well."

Yes, you're right! My bad.

> > - Make the two-underscore version disable and enable preemption, which 
> > is required to avoid preempt-related races which can result in missed
> > updates.
> > 
> > - Extends the lightweight version usage in page reclaim, 
> > pte allocation, and a few other codepaths.
> > 
> 
> I guess nr_dirty looks OK in the places it can be put under tree_lock.
> 
> nr_page_table_pages is OK because ptl should be held to prevent preempt.
> 
> pgrotated and pgactivate should be good because of lru_lock.
> 
> Thanks for going through these!

There's still probably a few more counters but the ones covered till now 
should be the most significant ones performance-wise.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Event counters [1/3]: Basic counter functionality
  2006-01-03 13:21             ` Nick Piggin
@ 2006-01-03 12:06               ` Marcelo Tosatti
  0 siblings, 0 replies; 14+ messages in thread
From: Marcelo Tosatti @ 2006-01-03 12:06 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, lkml, linux-mm, Andi Kleen

On Wed, Jan 04, 2006 at 12:21:55AM +1100, Nick Piggin wrote:
> Marcelo Tosatti wrote:
> >On Tue, Jan 03, 2006 at 04:11:46PM +1100, Nick Piggin wrote:
> >
> 
> >>I guess I was hoping to try to keep it simple, and just have two
> >>variants, the __ version would require the caller to do the locking.
> >
> >
> >I see - one point is that the two/three underscore versions make
> >it clear that preempt is required, though, but it might be a bit
> >over-complicated as you say.
> >
> >Well, its up to you - please rearrange the patch as you wish and merge
> >up?
> >
> 
> OK I will push it upstream - thanks!
> 
> We can revisit details again when some smoke clears from the
> coming 2.6.16 merge cycle?

Sure - we can also go further and the optimize operations on the
remaining counters.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Event counters [1/3]: Basic counter functionality
  2006-01-03 10:11           ` Marcelo Tosatti
@ 2006-01-03 13:21             ` Nick Piggin
  2006-01-03 12:06               ` Marcelo Tosatti
  0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2006-01-03 13:21 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Christoph Lameter, lkml, linux-mm, Andi Kleen

Marcelo Tosatti wrote:
> On Tue, Jan 03, 2006 at 04:11:46PM +1100, Nick Piggin wrote:
> 

>>I guess I was hoping to try to keep it simple, and just have two
>>variants, the __ version would require the caller to do the locking.
> 
> 
> I see - one point is that the two/three underscore versions make
> it clear that preempt is required, though, but it might be a bit
> over-complicated as you say.
> 
> Well, its up to you - please rearrange the patch as you wish and merge
> up?
> 

OK I will push it upstream - thanks!

We can revisit details again when some smoke clears from the
coming 2.6.16 merge cycle?

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Event counters [1/3]: Basic counter functionality
  2005-12-31  7:54   ` Nick Piggin
  2005-12-31 20:26     ` Marcelo Tosatti
@ 2006-01-03 19:08     ` Christoph Lameter
  1 sibling, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-01-03 19:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Marcelo Tosatti, linux-kernel, linux-mm, Andi Kleen

On Sat, 31 Dec 2005, Nick Piggin wrote:

> So I'm not exactly sure why such a patch as this is wanted now? Are there
> any more xxx_page_state hotspots? (I admit to only looking at page faults,
> page allocator, and page reclaim).

The proposed patchset is based on the zoned counter patchset. This means 
that critical counters have been converted to use different macros. The 
following discussion of Marcelo and Nick on nr_mapped etc is not relevant 
to this patch since nr_mapped etc are not event counters but are handled 
by the zoned counters.

The event counters are the leftover vanity counters that are referenced 
only for display in /proc and the proposed approach is to only allow 
increments and allow racy updates.

Then these lightweight counters are also used to optimize away the numa 
specific counters in the per cpu structures.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-01-03 19:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-20 23:57 [RFC] Event counters [1/3]: Basic counter functionality Christoph Lameter
2005-12-20 23:57 ` [RFC] Event counters [2/3]: Convert inc_page_state -> count_event Christoph Lameter
2005-12-20 23:57 ` [RFC] Event counters [3/3]: Convert NUMA counters to event counters Christoph Lameter
2005-12-21 22:24   ` Christoph Lameter
2005-12-31  6:46 ` [RFC] Event counters [1/3]: Basic counter functionality Marcelo Tosatti
2005-12-31  7:54   ` Nick Piggin
2005-12-31 20:26     ` Marcelo Tosatti
2006-01-02 21:40       ` Marcelo Tosatti
2006-01-03  5:11         ` Nick Piggin
2006-01-03 10:11           ` Marcelo Tosatti
2006-01-03 13:21             ` Nick Piggin
2006-01-03 12:06               ` Marcelo Tosatti
2006-01-03  5:01       ` Nick Piggin
2006-01-03 19:08     ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).