linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: 2.4.10pre11 vm rewrite fixes for mainline inclusion and testing
  2001-09-18 20:43 2.4.10pre11 vm rewrite fixes for mainline inclusion and testing Andrea Arcangeli
@ 2001-09-18 19:29 ` Marcelo Tosatti
  2001-09-18 19:39   ` Marcelo Tosatti
  2001-09-19  6:21 ` Andrew Morton
  2001-09-20  5:58 ` Andrew Morton
  2 siblings, 1 reply; 8+ messages in thread
From: Marcelo Tosatti @ 2001-09-18 19:29 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel, akpm



On Tue, 18 Sep 2001, Andrea Arcangeli wrote:

(testing now)

> Linus, can you merge this patch in the next pre-patch? Marcelo and
> Andrew, can you test if this fixes your problems properly or if we need
> further work on it for the oom problem?
> 
> Only in 2.4.10pre11aa1: 00_vm-aa-2
> 
> 	description of the patch:
> 
> 	-       fixed a race condition in rw_swap_page path: if we need
> 	        to wait synchronously on the page we must hold a reference
> 	        on the page or it could be freed by the VM under us
> 	        after it's been unlocked at I/O completion time (see
> 	        the page_io.c changes)
> 	-       don't hide anything (see the new parameter "this_max_scan" to
> 	        shrink_cache)
> 	-       don't skip work on the ptes but just don't stop until we
> 	        unmapped the "interesting" pages from the right classzone

I really think we should reintroduce the idea of "enough" inactive/free
pages, Andrea.

This way we avoid starving the page reclaiming work on only one zone when
we have a small zone under extreme pressure, while at the same time we
avoid breaking writeout clustering and the "fairness" of pte scanning.

Do you remember the "zone_inactive_plenty()/zone_free_plenty()" stuff
which we had back into pre10 ?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.4.10pre11 vm rewrite fixes for mainline inclusion and testing
  2001-09-18 19:29 ` Marcelo Tosatti
@ 2001-09-18 19:39   ` Marcelo Tosatti
  2001-09-18 21:11     ` Andrea Arcangeli
  0 siblings, 1 reply; 8+ messages in thread
From: Marcelo Tosatti @ 2001-09-18 19:39 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel, akpm



On Tue, 18 Sep 2001, Marcelo Tosatti wrote:

> 
> 
> On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> 
> (testing now)

Well, unfortunately:

Sep 18 17:59:01 matrix PAM_pwdb[842]: (sshd) session opened for user
marcelo by (uid=0)
Sep 18 17:59:27 matrix kernel: __alloc_pages: 0-order allocation failed
(gfp=0x20/0) from c012c5fe
Sep 18 17:59:28 matrix sshd[860]: Accepted password for marcelo from
10.0.17.22 port 1020
Sep 18 17:59:29 matrix PAM_pwdb[860]: (sshd) session opened for user
marcelo by (uid=0)
Sep 18 17:59:42 matrix sshd[873]: Accepted password for marcelo from
10.0.17.22 port 1021
Sep 18 17:59:43 matrix PAM_pwdb[873]: (sshd) session opened for user
marcelo by (uid=0)
Sep 18 17:59:48 matrix PAM_pwdb[875]: (su) session opened for user root by
marcelo(uid=719)
Sep 18 17:59:55 matrix kernel: __alloc_pages: 0-order allocation failed
(gfp=0x20/0) from c012c5fe


(2GB RAM, 4way, 4 setiathome's, 2 fillmem 1GB each)

I really think we need the "fail: loop again: try_to_free_pages()" logic,
Andrea: If there is not enough memory we HAVE to block in the page
reclaiming work.

I'll try do work on some logic like that in 1h or so... 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* 2.4.10pre11 vm rewrite fixes for mainline inclusion and testing
@ 2001-09-18 20:43 Andrea Arcangeli
  2001-09-18 19:29 ` Marcelo Tosatti
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2001-09-18 20:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, marcelo, akpm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=us-ascii, Size: 16933 bytes --]

Linus, can you merge this patch in the next pre-patch? Marcelo and
Andrew, can you test if this fixes your problems properly or if we need
further work on it for the oom problem?

Only in 2.4.10pre11aa1: 00_vm-aa-2

	description of the patch:

	-       fixed a race condition in rw_swap_page path: if we need
	        to wait synchronously on the page we must hold a reference
	        on the page or it could be freed by the VM under us
	        after it's been unlocked at I/O completion time (see
	        the page_io.c changes)
	-       don't hide anything (see the new parameter "this_max_scan" to
	        shrink_cache)
	-       don't skip work on the ptes but just don't stop until we
	        unmapped the "interesting" pages from the right classzone
	-       set the task to runnable to avoid lockups with copy_users
	        run within a wait_event interface with the task not runnable
	-       make sure not to pollute the active cache with referenced
	        swapcache
	-       allow deep allocations in case we succeed the balancing but we
	        haven't queued pages in the local freelist and kswapd+GFP_ATOMIC
	        put the number of free pages in the classzone below the atomic
	        watermark for legal reasons (for the record: the atomic
	        watermark is min/2)
	-       block kswapd during oom for 5 sec to allow more graceful
	        task killing
	-       don't fail memory_balancing unless we also failed in making
	        pages freeable (if we made pages freeable and we couldn't
	        find them it means kswapd freed them for us)
	-       refill the inactive list regularly to not left things hanging
		forever in the active list
	-       don't max_scan all over the active list, just stay at
	        MAX_PRIORITY to better preserve the working set during
	        heavy vm load
	-       make deactivate_page to unset the referenced bit so the
	        page can really be freed, and the other way around for
	        activate_page so it stays there for longer time
	-       initialize need_balance to zero at boot
	-	skip over physically referenced pages queued in the lru lists
	 	in swap_out (wait the physical aging to finish before kicking
		in freeing those pages)

	results of the patch:

	1)	better stability thanks to the race fix ;)
	2)      tasks should not be killed by mistake unless truly oom (oom
	        handling seems sane too)
	3)      swapout behaviour should be much better
	4)      the regular refill of the inactive list should provide better
		performance in the I/O loads. Infact even dbench runs even
		faster (but it wasn't really developed to improve dbench).

andrea@laser:/mnt > for i in  1 2 3 4 ; do dbench 40; done
40 clients started
..............................................................................................................................................................................................................................................................................................................................+..............+........................................+.................+................+....+.....+................................+......+.............................................................................................................................................+.................................................................................................................................................................................................................+...............................+.................................................................+...+.....................................+.+.+.....................................+.....+.........................+..+....+...............................+................................................+.........................................................+..........+..............................................................+.......+.............................+......+...............................+....+..............................................................................................................+...............................................+.........................++++++****************************************
Throughput 24.2406 MB/sec (NB0.3008 MB/sec  242.406 MBit/sec)
40 clients started
..........................................................................................................................................................+......................................................................+....+.......................+.....................................................................................................................................................................................................................................+.+...................................+........+.............................+..+.......+........................................................................................................+................+...................+.......................................................................................................+.................................................+..............................+...............................+....+....................................................................+...............................................................+................................................+.............................................+.....................................................................................+...................+......+................+......+..........................................................................................................+.........................+.........+++..+........................................+....+...+.................+++****************************************
Throughput 24.2276 MB/sec (NB0.2845 MB/sec  242.276 MBit/sec)
40 clients started
...........................................................................................................................+...............................+....................................................+...................................................................................................................................................................................................................................+.............+....+.....+.......+......................+................................+...+........................................................................+...........................................................................................................................................................................+.................+................+..................................................++...........................................................................................................................................................+...............................................................................+..............................++..+..............+..........+.........................+.+...........+.............+...+...........................................................+....+.....................................................................................................+.....+...............................................................................+...............................+............+.++++****************************************
Throughput 24.2455 MB/sec (NB0.3068 MB/sec  242.455 MBit/sec)
[..]

without this incremental patch I was constantly getting exactly
20mbyte/sec, not I get constantly 24mbyte/sec in all the runs with very
very very small deviation ;) (but of course dbench isn't the only
workload that we must get right, the below changes weren't really made
with dbench in mind) btw, 2.4.10pre10 with the old vm was returning
9mbyte/sec.

So here it is the vm patch (aka 00_vm-aa-2) ready for merging into
mainline:

diff -urN vm-ref/mm/page_alloc.c vm/mm/page_alloc.c
--- vm-ref/mm/page_alloc.c	Tue Sep 18 21:46:07 2001
+++ vm/mm/page_alloc.c	Tue Sep 18 15:39:50 2001
@@ -255,7 +255,7 @@
 
 		local_pages &current->local_pages;
 
-		if (__freed) {
+		if (__builtin_expect(__freed, 1)) {
 			/* pick from the last inserted so we're lifo */
 			entry local_pages->next;
 			do {
@@ -380,13 +380,15 @@
 			if (!z)
 				break;
 
-			if (zone_free_pages(z, order) > (gfp_mask & __GFP_HIGH ? z->pages_min / 2 : z->pages_min)) {
-				page rmqueue(z, order);
-				if (page)
-					return page;
-			}
+			page rmqueue(z, order);
+			if (page)
+				return page;
 		}
 	} else {
+		/* 
+		 * Check that no other task is been killed meanwhile,
+		 * in such a case we can succeed the allocation.
+		 */
 		for (;;) {
 			zone_t *z *(zone++);
 			if (!z)
@@ -733,6 +735,7 @@
 		zone->lock SPIN_LOCK_UNLOCKED;
 		zone->zone_pgdat pgdat;
 		zone->free_pages ;
+		zone->need_balance ;
 		if (!size)
 			continue;
 
diff -urN vm-ref/mm/page_io.c vm/mm/page_io.c
--- vm-ref/mm/page_io.c	Tue Sep 18 21:46:07 2001
+++ vm/mm/page_io.c	Tue Sep 18 18:04:21 2001
@@ -78,7 +78,15 @@
  	if (!wait) {
  		SetPageDecrAfter(page);
  		atomic_inc(&nr_async_pages);
- 	}
+ 	} else
+		/*
+		 * Must hold a reference until after wait_on_page()
+		 * returned or it could be freed by the VM after
+		 * I/O is completed and the page is been unlocked.
+		 * The asynchronous path is fine since it never
+		 * references the page after brw_page().
+		 */
+		page_cache_get(page);
 
  	/* block_size = PAGE_SIZE/zones_used */
  	brw_page(rw, page, dev, zones, block_size);
@@ -94,6 +102,7 @@
 	/* This shouldn't happen, but check to be sure. */
 	if (page_count(page) = 0)
 		printk(KERN_ERR "rw_swap_page: page unused while waiting!\n");
+	page_cache_release(page);
 
 	return 1;
 }
diff -urN vm-ref/mm/swap.c vm/mm/swap.c
--- vm-ref/mm/swap.c	Tue Sep 18 21:46:07 2001
+++ vm/mm/swap.c	Tue Sep 18 21:23:36 2001
@@ -54,6 +54,7 @@
 		del_page_from_active_list(page);
 		add_page_to_inactive_list(page);
 	}
+	ClearPageReferenced(page);
 }	
 
 void deactivate_page(struct page * page)
@@ -72,6 +73,7 @@
 		del_page_from_inactive_list(page);
 		add_page_to_active_list(page);
 	}
+	SetPageReferenced(page);
 }
 
 void activate_page(struct page * page)
diff -urN vm-ref/mm/vmscan.c vm/mm/vmscan.c
--- vm-ref/mm/vmscan.c	Tue Sep 18 21:46:07 2001
+++ vm/mm/vmscan.c	Tue Sep 18 21:23:49 2001
@@ -47,20 +47,24 @@
 {
 	pte_t pte;
 	swp_entry_t entry;
+	int right_classzone;
 
 	/* Don't look at this pte if it's been accessed recently. */
 	if (ptep_test_and_clear_young(page_table)) {
 		flush_tlb_page(vma, address);
-		SetPageReferenced(page);
+		activate_page(page);
 		return 0;
 	}
-
-	if (!memclass(page->zone, classzone))
+	if ((PageInactive(page) || PageActive(page)) && PageReferenced(page))
 		return 0;
 
 	if (TryLockPage(page))
 		return 0;
 
+	right_classzone ;
+	if (!memclass(page->zone, classzone))
+		right_classzone ;
+
 	/* From this point on, the odds are that we're going to
 	 * nuke this pte, so read and clear the pte.  This hook
 	 * is needed on CPUs which update the accessed and dirty
@@ -90,7 +94,7 @@
 			if (freeable)
 				deactivate_page(page);
 			page_cache_release(page);
-			return freeable;
+			return freeable & right_classzone;
 		}
 	}
 
@@ -293,8 +297,10 @@
 	/* Then, look at the other mm's */
 	counter mmlist_nr / priority;
 	do {
-		if (current->need_resched)
+		if (__builtin_expect(current->need_resched, 0)) {
+			__set_current_state(TASK_RUNNING);
 			schedule();
+		}
 
 		spin_lock(&mmlist_lock);
 		mm swap_mm;
@@ -324,20 +330,19 @@
 	return 0;
 }
 
-static int FASTCALL(shrink_cache(struct list_head * lru, int * max_scan, int nr_pages, zone_t * classzone, unsigned int gfp_mask));
-static int shrink_cache(struct list_head * lru, int * max_scan, int nr_pages, zone_t * classzone, unsigned int gfp_mask)
+static int FASTCALL(shrink_cache(struct list_head * lru, int * max_scan, int this_max_scan, int nr_pages, zone_t * classzone, unsigned int gfp_mask));
+static int shrink_cache(struct list_head * lru, int * max_scan, int this_max_scan, int nr_pages, zone_t * classzone, unsigned int gfp_mask)
 {
-	LIST_HEAD(active_local_lru);
-	LIST_HEAD(inactive_local_lru);
 	struct list_head * entry;
 	int __max_scan *max_scan;
 
 	spin_lock(&pagemap_lru_lock);
-	while (__max_scan && (entry lru->prev) !lru) {
+	while (__max_scan && this_max_scan && (entry lru->prev) !lru) {
 		struct page * page;
 
 		if (__builtin_expect(current->need_resched, 0)) {
 			spin_unlock(&pagemap_lru_lock);
+			__set_current_state(TASK_RUNNING);
 			schedule();
 			spin_lock(&pagemap_lru_lock);
 			continue;
@@ -348,21 +353,33 @@
 		if (__builtin_expect(!PageInactive(page) && !PageActive(page), 0))
 			BUG();
 
+		this_max_scan--;
+
 		if (PageTestandClearReferenced(page)) {
-			if (PageInactive(page)) {
-				del_page_from_inactive_list(page);
-				add_page_to_active_list(page);
-			} else if (PageActive(page)) {
+			if (!PageSwapCache(page)) {
+				if (PageInactive(page)) {
+					del_page_from_inactive_list(page);
+					add_page_to_active_list(page);
+				} else if (PageActive(page)) {
+					list_del(entry);
+					list_add(entry, &active_list);
+				} else
+					BUG();
+			} else {
 				list_del(entry);
-				list_add(entry, &active_list);
-			} else
-				BUG();
+				list_add(entry, lru);
+			}
 			continue;
 		}
 
-		deactivate_page_nolock(page);
-		list_del(entry);
-		list_add_tail(entry, &inactive_local_lru);
+		if (PageInactive(page)) {
+			/* just roll it over, no need to update any stat */
+			list_del(entry);
+			list_add(entry, &inactive_list);
+		} else {
+			del_page_from_active_list(page);
+			add_page_to_inactive_list(page);
+		}
 
 		if (__builtin_expect(!memclass(page->zone, classzone), 0))
 			continue;
@@ -372,8 +389,6 @@
 		/* Racy check to avoid trylocking when not worthwhile */
 		if (!page->buffers && page_count(page) !) {
 			activate_page_nolock(page);
-			list_del(entry);
-			list_add_tail(entry, &active_local_lru);
 			continue;
 		}
 
@@ -497,29 +512,48 @@
 			continue;
 		break;
 	}
-
-	list_splice(&inactive_local_lru, &inactive_list);
-	list_splice(&active_local_lru, &active_list);
 	spin_unlock(&pagemap_lru_lock);
 
 	*max_scan __max_scan;
 	return nr_pages;
 }
 
+static void refill_inactive(int nr_pages)
+{
+	struct list_head * entry;
+
+	spin_lock(&pagemap_lru_lock);
+	entry ¬tive_list.prev;
+	while (nr_pages-- && entry !&active_list) {
+		struct page * page;
+
+		page list_entry(entry, struct page, lru);
+		entry ntry->prev;
+
+		if (!page->buffers && page_count(page) !)
+			continue;
+
+		del_page_from_active_list(page);
+		add_page_to_inactive_list(page);
+	}
+	spin_unlock(&pagemap_lru_lock);
+}
+
 static int FASTCALL(shrink_caches(int priority, zone_t * classzone, unsigned int gfp_mask, int nr_pages));
 static int shrink_caches(int priority, zone_t * classzone, unsigned int gfp_mask, int nr_pages)
 {
-	int max_scan (nr_inactive_pages + nr_active_pages / priority) / priority;
+	int max_scan (nr_inactive_pages + nr_active_pages / DEF_PRIORITY) / priority;
 
 	nr_pages -kmem_cache_reap(gfp_mask);
 	if (nr_pages <)
 		return 0;
 
-	nr_pages shrink_cache(&inactive_list, &max_scan, nr_pages, classzone, gfp_mask);
+	refill_inactive(nr_pages / 2);
+	nr_pages shrink_cache(&inactive_list, &max_scan, nr_inactive_pages, nr_pages, classzone, gfp_mask);
 	if (nr_pages <)
 		return 0;
 
-	nr_pages shrink_cache(&active_list, &max_scan, nr_pages, classzone, gfp_mask);
+	nr_pages shrink_cache(&active_list, &max_scan, nr_active_pages / DEF_PRIORITY, nr_pages, classzone, gfp_mask);
 	if (nr_pages <)
 		return 0;
 
@@ -532,6 +566,7 @@
 int try_to_free_pages(zone_t * classzone, unsigned int gfp_mask, unsigned int order)
 {
 	int priority ÞF_PRIORITY;
+	int ret ;
 
 	do {
 		int nr_pages SWAP_CLUSTER_MAX;
@@ -539,10 +574,10 @@
 		if (nr_pages <)
 			return 1;
 
-		swap_out(priority, classzone, gfp_mask, SWAP_CLUSTER_MAX);
+		ret |swap_out(priority, classzone, gfp_mask, SWAP_CLUSTER_MAX << 2);
 	} while (--priority);
 
-	return 0;
+	return ret;
 }
 
 DECLARE_WAIT_QUEUE_HEAD(kswapd_wait);
@@ -567,12 +602,14 @@
 
 	for (i pgdat->nr_zones-1; i >; i--) {
 		zone pgdat->node_zones + i;
-		if (current->need_resched)
+		if (__builtin_expect(current->need_resched, 0))
 			schedule();
 		if (!zone->need_balance)
 			continue;
 		if (!try_to_free_pages(zone, GFP_KSWAPD, 0)) {
 			zone->need_balance ;
+			__set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(HZ*5);
 			continue;
 		}
 		if (check_classzone_need_balance(zone))


as usual any feedback is welcome, thanks! (if there is disagreement on
some part of the patch, please at least merge urgently the page_io.c
part that is a stability bugfix)

Andrea

PS. as usual I'm also releasing a 2.4.10pre11aa1 with the few remaining
    bits plus and the above patch included.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.4.10pre11 vm rewrite fixes for mainline inclusion and testing
  2001-09-18 19:39   ` Marcelo Tosatti
@ 2001-09-18 21:11     ` Andrea Arcangeli
  2001-09-18 21:26       ` Andrea Arcangeli
  0 siblings, 1 reply; 8+ messages in thread
From: Andrea Arcangeli @ 2001-09-18 21:11 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, linux-kernel, akpm

On Tue, Sep 18, 2001 at 04:39:34PM -0300, Marcelo Tosatti wrote:
> 
> 
> On Tue, 18 Sep 2001, Marcelo Tosatti wrote:
> 
> > 
> > 
> > On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> > 
> > (testing now)
> 
> Well, unfortunately:

I'm not so pessimistic, I was pretty sure it would have worked out.

> 
> Sep 18 17:59:01 matrix PAM_pwdb[842]: (sshd) session opened for user
> marcelo by (uid=0)
> Sep 18 17:59:27 matrix kernel: __alloc_pages: 0-order allocation failed
> (gfp=0x20/0) from c012c5fe
> Sep 18 17:59:28 matrix sshd[860]: Accepted password for marcelo from
> 10.0.17.22 port 1020
> Sep 18 17:59:29 matrix PAM_pwdb[860]: (sshd) session opened for user
> marcelo by (uid=0)
> Sep 18 17:59:42 matrix sshd[873]: Accepted password for marcelo from
> 10.0.17.22 port 1021
> Sep 18 17:59:43 matrix PAM_pwdb[873]: (sshd) session opened for user
> marcelo by (uid=0)
> Sep 18 17:59:48 matrix PAM_pwdb[875]: (su) session opened for user root by
> marcelo(uid=719)
> Sep 18 17:59:55 matrix kernel: __alloc_pages: 0-order allocation failed
> (gfp=0x20/0) from c012c5fe

That's just a GREAT log: those are GFP_ATOMIC allocations, all is _more_
than _perfect_ in the above log.

> I really think we need the "fail: loop again: try_to_free_pages()" logic,

What do you want to do with GFP_ATOMIC? we cannot do anything at the
moment unless you want to make an atomic list that we can free from
atomic context, I thought about something like that but I guess we can
do it after 2.4.11 (possibly in 2.4.x, it doesn't sound a showstopper).

> Andrea: If there is not enough memory we HAVE to block in the page
> reclaiming work.

we obviously cannot block with GFP_ATOMIC.

Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.4.10pre11 vm rewrite fixes for mainline inclusion and testing
  2001-09-18 21:11     ` Andrea Arcangeli
@ 2001-09-18 21:26       ` Andrea Arcangeli
  0 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2001-09-18 21:26 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, linux-kernel, akpm

On Tue, Sep 18, 2001 at 11:11:50PM +0200, Andrea Arcangeli wrote:
> do it after 2.4.11 (possibly in 2.4.x, it doesn't sound a showstopper).
				  ^^^ typo, I meant 2.5 :), sorry

Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.4.10pre11 vm rewrite fixes for mainline inclusion and testing
  2001-09-18 20:43 2.4.10pre11 vm rewrite fixes for mainline inclusion and testing Andrea Arcangeli
  2001-09-18 19:29 ` Marcelo Tosatti
@ 2001-09-19  6:21 ` Andrew Morton
  2001-09-20  5:58 ` Andrew Morton
  2 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2001-09-19  6:21 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel, marcelo

Andrea Arcangeli wrote:
> 
> Linus, can you merge this patch in the next pre-patch? Marcelo and
> Andrew, can you test if this fixes your problems properly or if we need
> further work on it for the oom problem?
> 

With the same workload as yesterday (combination of anon load and
shared mappings):

- throughput is more than twice that of 2.4.9-ac10.  I checked
  this several times.  A huge difference.

- there were no zero-order allocation failures, and no oom-killings.

- A few minutes into testing I hit a BUG() in shrink_cache().  Unfortunately
  the BUG() macro doesn't report file-n-line in recent kernels.  I set up
  kgdb and of course it ran happily for an hour.  Typical.

- Ah-hah.  I wound up the VM load a bit and hit the box with 32,000
  pings/sec.  There weren't any allocation failures at all, although
  the box wedged totally once.  I suspect a networking problem in this
  case.

  With this workload we again hit the BUG() in shrink_cache(), this time
  under gdb:

Program received signal SIGTRAP, Trace/breakpoint trap.
shrink_cache (lru=0xc0353de4, max_scan=0xdfff1f74, this_max_scan=0x301, nr_pages=0x1a, 
    classzone=0xc02e2588, gfp_mask=0x1c0) at vmscan.c:475
475                             BUG();
(gdb) l 
470                                     continue;
471                             }
472                     }
473     
474                     if (__builtin_expect(!page->mapping, 0))
475                             BUG();
476     
477                     if (__builtin_expect(!spin_trylock(&pagecache_lock), 0)) {
478                             /* we hold the page lock so the page cannot go away from under us */
479                             spin_unlock(&pagemap_lru_lock);
(gdb) bt
#0  shrink_cache (lru=0xc0353de4, max_scan=0xdfff1f74, this_max_scan=0x301, nr_pages=0x1a, 
    classzone=0xc02e2588, gfp_mask=0x1c0) at vmscan.c:475
#1  0xc012edcf in shrink_caches (priority=0x6, classzone=0xc02e2588, gfp_mask=0x1c0, nr_pages=0x20)
    at vmscan.c:551
#2  0xc012ee4f in try_to_free_pages (classzone=0xc02e2588, gfp_mask=0x1c0, order=0xc132fadc)
    at vmscan.c:572
#3  0xc012ef23 in kswapd_balance_pgdat (pgdat=0xc02e24e0) at vmscan.c:608
#4  0xc012ef9e in kswapd_balance () at vmscan.c:632
#5  0xc012f0cd in kswapd (unused=0x0) at vmscan.c:721
#6  0xc01055ab in kernel_thread (fn=0xd6c3b38c, arg=0xd6c3b390, flags=0xd6c3b394) at process.c:444
#7  0xd6c3b388 in ?? ()

(gdb) p *page
$1 = {list = {next = 0x0, prev = 0x0}, mapping = 0x0, index = 0x8210, next_hash = 0x0, count = {
    counter = 0x1}, flags = 0x89, lru = {next = 0xc11eef5c, prev = 0xc0353de4}, wait = {lock = {
      lock = 0x1}, task_list = {next = 0xc132fae8, prev = 0xc132fae8}}, pprev_hash = 0x0, 
  buffers = 0x0, virtual = 0xccbeb000, zone = 0xc02e2588}

So the page is inactive, locked, uptodate, and has no mapping.

(gdb) p *page->zone
$2 = {lock = {lock = 0x1}, free_pages = 0x1d7, pages_min = 0xff, pages_low = 0x1fe, 
  pages_high = 0x2fd, need_balance = 0x1, free_area = {{free_list = {next = 0xc1719ac0, 
        prev = 0xc119e200}, map = 0xc18002c0}, {free_list = {next = 0xc119e400, 
        prev = 0xc115db80}, map = 0xc18021c0}, {free_list = {next = 0xc11f1e00, 
        prev = 0xc119df00}, map = 0xc1803140}, {free_list = {next = 0xc115dc00, 
        prev = 0xc119e000}, map = 0xc1803900}, {free_list = {next = 0xc119d800, 
        prev = 0xc115b800}, map = 0xc1803ce0}, {free_list = {next = 0xc115b000, 
        prev = 0xc115b000}, map = 0xc1803ee0}, {free_list = {next = 0xc115a000, 
        prev = 0xc115a000}, map = 0xc1803fe0}, {free_list = {next = 0xc1158000, 
        prev = 0xc1158000}, map = 0xc1804060}, {free_list = {next = 0xc02e2600, 
        prev = 0xc02e2600}, map = 0xc18040a0}, {free_list = {next = 0xc02e260c, 
        prev = 0xc02e260c}, map = 0x0}}, zone_pgdat = 0xc02e24e0, zone_mem_map = 0xc1040000, 
  zone_start_paddr = 0x1000000, zone_start_mapnr = 0x1000, name = 0xc029c508 "Normal", 
  size = 0x1f000}

(gdb) p page->zone
$5 = (struct zone_struct *) 0xc02e2588
(gdb) p classzone
$6 = (zone_t *) 0xc02e2588

Poking around a bit, there's another thread on another CPU spinning
on pagemap_lru_lock:

(gdb) thread 51
[Switching to thread 51 (thread 986)]#0  0xc028cfd4 in stext_lock () at af_packet.c:1870
1870    }
(gdb) bt
#0  0xc028cfd4 in stext_lock () at af_packet.c:1870
#1  0x00000020 in netlink_proto_exit () at clgenfb.c:2479
#2  0xc012edcf in shrink_caches (priority=Cannot access memory at address 0x20
) at vmscan.c:551
#3  0xc012ee4f in try_to_free_pages (classzone=0xc02e2588, gfp_mask=0x1d2, order=0xc132fac0)
    at vmscan.c:572
#4  0xc012f83a in balance_classzone (classzone=0xc02e2588, gfp_mask=0x1d2, order=0x0, 
    freed=0xd89f7e6c) at page_alloc.c:245
#5  0xc012fad1 in __alloc_pages (gfp_mask=0x1d2, order=0x0, zonelist=0xc02e26f8)
    at page_alloc.c:370
#6  0xc012f7d7 in _alloc_pages (gfp_mask=0xc02e2704, order=0x0) at page_alloc.c:226
#7  0xc0125522 in do_anonymous_page (mm=0xc73c3f20, vma=0xdfa2e5c0, page_table=0xd7983428, 
    write_access=0x1, addr=0x4b10a000) at /usr/src/linux-akpm/include/linux/mm.h:389
#8  0xc01255cb in do_no_page (mm=0xc73c3f20, vma=0xdfa2e5c0, address=0x4b10a000, write_access=0x1, 
    page_table=0xd7983428) at memory.c:1238
#9  0xc01256e4 in handle_mm_fault (mm=0xc73c3f20, vma=0xdfa2e5c0, address=0x4b10a000, 
    write_access=0x1) at memory.c:1318
#10 0xc0113f42 in do_page_fault (regs=0xd89f7fc4, error_code=0x6) at fault.c:267
#11 0xc0106f0c in error_code () at af_packet.c:1879

gdb screwed up the call trace - this may be due to FASTCALL...

But we're spinning in line 445 here:

435                             if (try_to_free_buffers(page, gfp_mask)) {
436                                     if (!page->mapping) {
437                                             UnlockPage(page);
438     
439                                             /*
440                                              * Account we successfully freed a page
441                                              * of buffer cache.
(gdb) 
442                                              */
443                                             atomic_dec(&buffermem_pages);
444     
445                                             spin_lock(&pagemap_lru_lock);
446                                             __lru_cache_del(page);
447     
448                                             /* effectively free the page here */

Isn't this the race?  We've just stripped the buffers from an
anon page (presumably after swapin), but it's still on the inactive
list.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.4.10pre11 vm rewrite fixes for mainline inclusion and testing
  2001-09-18 20:43 2.4.10pre11 vm rewrite fixes for mainline inclusion and testing Andrea Arcangeli
  2001-09-18 19:29 ` Marcelo Tosatti
  2001-09-19  6:21 ` Andrew Morton
@ 2001-09-20  5:58 ` Andrew Morton
  2001-09-20  6:11   ` Andrea Arcangeli
  2 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2001-09-20  5:58 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel, marcelo


With this patch against -pre12 the BUG()s in shrink_cache()
go away.

--- linux-2.4.10-pre12/mm/vmscan.c	Wed Sep 19 20:47:21 2001
+++ linux-akpm/mm/vmscan.c	Wed Sep 19 22:49:48 2001
@@ -435,15 +435,20 @@ static int shrink_cache(struct list_head
 
 			if (try_to_free_buffers(page, gfp_mask)) {
 				if (!page->mapping) {
-					UnlockPage(page);
-
 					/*
 					 * Account we successfully freed a page
 					 * of buffer cache.
 					 */
 					atomic_dec(&buffermem_pages);
 
+					/*
+					 * We must not allow an anon page
+					 * with no buffers to be visible on
+					 * the LRU, so we unlock the page after
+					 * taking the lru lock
+					 */
 					spin_lock(&pagemap_lru_lock);
+					UnlockPage(page);
 					__lru_cache_del(page);
 
 					/* effectively free the page here */


With this patch applied I've had three total system lockups with
the usual workload.  No diagnostics, no interrupts, NMI watchdog
doesn't catch it.  Nice.   This is not related to networking; I
wasn't able to do much network stress testing because the
darn APIC bug kept biting me.  Grumble.

I'll try -pre9.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.4.10pre11 vm rewrite fixes for mainline inclusion and testing
  2001-09-20  5:58 ` Andrew Morton
@ 2001-09-20  6:11   ` Andrea Arcangeli
  0 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2001-09-20  6:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, linux-kernel, marcelo

On Wed, Sep 19, 2001 at 10:58:15PM -0700, Andrew Morton wrote:
> 
> With this patch against -pre12 the BUG()s in shrink_cache()
> go away.
> 
> --- linux-2.4.10-pre12/mm/vmscan.c	Wed Sep 19 20:47:21 2001
> +++ linux-akpm/mm/vmscan.c	Wed Sep 19 22:49:48 2001
> @@ -435,15 +435,20 @@ static int shrink_cache(struct list_head
>  
>  			if (try_to_free_buffers(page, gfp_mask)) {
>  				if (!page->mapping) {
> -					UnlockPage(page);
> -
>  					/*
>  					 * Account we successfully freed a page
>  					 * of buffer cache.
>  					 */
>  					atomic_dec(&buffermem_pages);
>  
> +					/*
> +					 * We must not allow an anon page
> +					 * with no buffers to be visible on
> +					 * the LRU, so we unlock the page after
> +					 * taking the lru lock
> +					 */
>  					spin_lock(&pagemap_lru_lock);
> +					UnlockPage(page);
>  					__lru_cache_del(page);
>  
>  					/* effectively free the page here */

correct, thanks! Please Linus apply.

Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2001-09-20  6:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-18 20:43 2.4.10pre11 vm rewrite fixes for mainline inclusion and testing Andrea Arcangeli
2001-09-18 19:29 ` Marcelo Tosatti
2001-09-18 19:39   ` Marcelo Tosatti
2001-09-18 21:11     ` Andrea Arcangeli
2001-09-18 21:26       ` Andrea Arcangeli
2001-09-19  6:21 ` Andrew Morton
2001-09-20  5:58 ` Andrew Morton
2001-09-20  6:11   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).