linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* page_launder() on 2.4.9/10 issue
@ 2001-08-28  3:36 Marcelo Tosatti
  2001-08-28 18:07 ` Daniel Phillips
  0 siblings, 1 reply; 79+ messages in thread
From: Marcelo Tosatti @ 2001-08-28  3:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: lkml


Linus,

I just noticed that the new page_launder() logic has a big bad problem.

The window to find and free previously written out pages by page_launder()
is the amount of writeable pages on the inactive dirty list.

We'll keep writing out dirty pages (as long as they are available) even if
have a ton of cleaned pages: its just that we don't see them because we
scan a small piece of the inactive dirty list each time.

That obviously did not happen with the full scan behaviour.

With asynchronous i_dirty->i_clean movement (moving a cleaned page to the
clean list at the IO completion handler. Please don't consider that for
2.4 :)) this would not happen, too.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-08-28  3:36 page_launder() on 2.4.9/10 issue Marcelo Tosatti
@ 2001-08-28 18:07 ` Daniel Phillips
  2001-08-28 18:17   ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Daniel Phillips @ 2001-08-28 18:07 UTC (permalink / raw)
  To: Marcelo Tosatti, Linus Torvalds; +Cc: lkml

On August 28, 2001 05:36 am, Marcelo Tosatti wrote:
> Linus,
> 
> I just noticed that the new page_launder() logic has a big bad problem.
> 
> The window to find and free previously written out pages by page_launder()
> is the amount of writeable pages on the inactive dirty list.
> 
> We'll keep writing out dirty pages (as long as they are available) even if
> have a ton of cleaned pages: its just that we don't see them because we
> scan a small piece of the inactive dirty list each time.
> 
> That obviously did not happen with the full scan behaviour.
> 
> With asynchronous i_dirty->i_clean movement (moving a cleaned page to the
> clean list at the IO completion handler. Please don't consider that for
> 2.4 :)) this would not happen, too.

Or we could have parallel lists for dirty and clean.

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-08-28 18:07 ` Daniel Phillips
@ 2001-08-28 18:17   ` Linus Torvalds
  2001-08-30  1:36     ` Daniel Phillips
  2001-09-03 14:57     ` Marcelo Tosatti
  0 siblings, 2 replies; 79+ messages in thread
From: Linus Torvalds @ 2001-08-28 18:17 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Marcelo Tosatti, lkml


On Tue, 28 Aug 2001, Daniel Phillips wrote:
> On August 28, 2001 05:36 am, Marcelo Tosatti wrote:
> > Linus,
> >
> > I just noticed that the new page_launder() logic has a big bad problem.
> >
> > The window to find and free previously written out pages by page_launder()
> > is the amount of writeable pages on the inactive dirty list.

No.

There is no "window". The page_launder() logic is very clear - it will
write out any dirty pages that it finds that are "old".

> > We'll keep writing out dirty pages (as long as they are available) even if
> > have a ton of cleaned pages: its just that we don't see them because we
> > scan a small piece of the inactive dirty list each time.

So? We need to write them out at some point anyway. Isn't it much better
to be graceful about it, and allow the writeout to happen in the
background. The way things _used_ to work, we'd delay the write-out until
we REALLY had to, which is great for dbench, but is really horrible for
any normal load.

Think about it - do you really want the system to actively try to reach
the point where it has no "regular" pages left, and has to start writing
stuff out (and wait for them synchronously) in order to free up memory? I
strongly feel that the old code was really really wrong - it may have been
wonderful for throughput, but it had non-repeatable behaviour, and easily
caused the inactive_dirty list to fill up with dirty pages because it
unfairly penalized clean pages.

You do need to realize that dbench is a really bad benchmark, and should
not be used as a way to tweak the algorithms.

> > That obviously did not happen with the full scan behaviour.

The new code has no difference between "full scan" and "partial scan". It
will do the same thing regardless of whether you scan the whole list, as
it doesn't have any state.

This did NOT happen with the old "launder_loop" state thing, but I think
you agreed that that code was unreliable and flaky, and caused basically
random non-LRU behaviour that depended on subtle effects in (a) who called
it and (b) what the layout of the inactive_dirty list was.


> > With asynchronous i_dirty->i_clean movement (moving a cleaned page to the
> > clean list at the IO completion handler. Please don't consider that for
> > 2.4 :)) this would not happen, too.
>
> Or we could have parallel lists for dirty and clean.

Well, more importantly, do you actually have good reason to believe that
it is wrong to try to write things out asynchronously?

		Linus


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-08-28 18:17   ` Linus Torvalds
@ 2001-08-30  1:36     ` Daniel Phillips
  2001-09-03 14:57     ` Marcelo Tosatti
  1 sibling, 0 replies; 79+ messages in thread
From: Daniel Phillips @ 2001-08-30  1:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marcelo Tosatti, lkml

On August 28, 2001 08:17 pm, Linus Torvalds wrote:
> On Tue, 28 Aug 2001, Daniel Phillips wrote:
> > On August 28, 2001 05:36 am, Marcelo Tosatti wrote:
> > > Linus,
> > >
> > > I just noticed that the new page_launder() logic has a big bad problem.
> > >
> > > The window to find and free previously written out pages by 
> > > page_launder() is the amount of writeable pages on the inactive dirty 
> > > list.
> 
> No.
> 
> There is no "window". The page_launder() logic is very clear - it will
> write out any dirty pages that it finds that are "old".
> 
> > > We'll keep writing out dirty pages (as long as they are available) even 
> > > if have a ton of cleaned pages: its just that we don't see them because 
> > > we scan a small piece of the inactive dirty list each time.
> 
> So? We need to write them out at some point anyway. Isn't it much better
> to be graceful about it, and allow the writeout to happen in the
> background. The way things _used_ to work, we'd delay the write-out until
> we REALLY had to, which is great for dbench, but is really horrible for
> any normal load.

I thought about it a lot and I had a really hard time coming up with examples 
where starting writeout early is not the right thing to do.  Even write 
merging takes care of itself because if the system is heaviliy loaded the 
queue will naturally back up and create all the write merging opportunities 
we need.  Temporary file deletion is hurt by early writeout, yes, but that is 
really something we should be handling at the filesystem level, not the vfs. 
(According to this theory, XFS with its delayed allocation should be a star 
performer on dbench.)

The only case I can see where early writeout is not necessarily the best 
policy is when we have lots of input going on at the same time.  The classic 
example is program startup.  If there are lots of inactive/clean pages we 
want to hold off writeout until the swap-in activity due to program start 
winds down or eats all the inactive/clean pages.

> Think about it - do you really want the system to actively try to reach
> the point where it has no "regular" pages left, and has to start writing
> stuff out (and wait for them synchronously) in order to free up memory? I
> strongly feel that the old code was really really wrong - it may have been
> wonderful for throughput, but it had non-repeatable behaviour, and easily
> caused the inactive_dirty list to fill up with dirty pages because it
> unfairly penalized clean pages.

It was just plain wrong.  We got sucked into the trap of optimizing for
dbench.

> [...]
> > > With asynchronous i_dirty->i_clean movement (moving a cleaned page to
> > > the clean list at the IO completion handler. Please don't consider that 
> > > for 2.4 :)) this would not happen, too.
> >
> > Or we could have parallel lists for dirty and clean.
> 
> Well, more importantly, do you actually have good reason to believe that
> it is wrong to try to write things out asynchronously?

Asynchronous is good, but we don't want to blindly submit every dirty page as 
soon as it arrives on the inactive_dirty list.  This will throw away 
information about the short-term activity of pages, without which we have no 
means of distinguishing between LFU and LRU pages.  It doesn't matter under 
light disk load because... the load is light (duh) but under heavy load it 
does matter.

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-08-28 18:17   ` Linus Torvalds
  2001-08-30  1:36     ` Daniel Phillips
@ 2001-09-03 14:57     ` Marcelo Tosatti
  2001-09-04 15:26       ` Jan Harkes
  1 sibling, 1 reply; 79+ messages in thread
From: Marcelo Tosatti @ 2001-09-03 14:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, lkml



On Tue, 28 Aug 2001, Linus Torvalds wrote:

> 
> On Tue, 28 Aug 2001, Daniel Phillips wrote:
> > On August 28, 2001 05:36 am, Marcelo Tosatti wrote:
> > > Linus,
> > >
> > > I just noticed that the new page_launder() logic has a big bad problem.
> > >
> > > The window to find and free previously written out pages by page_launder()
> > > is the amount of writeable pages on the inactive dirty list.
> 
> No.
> 
> There is no "window". The page_launder() logic is very clear - it will
> write out any dirty pages that it finds that are "old".

Yes, this is clear. Look above.

> 
> > > We'll keep writing out dirty pages (as long as they are available) even if
> > > have a ton of cleaned pages: its just that we don't see them because we
> > > scan a small piece of the inactive dirty list each time.
> 
> So? We need to write them out at some point anyway. Isn't it much better
> to be graceful about it, and allow the writeout to happen in the
> background. The way things _used_ to work, we'd delay the write-out until
> we REALLY had to, which is great for dbench, but is really horrible for
> any normal load.
> 
> Think about it - do you really want the system to actively try to reach
> the point where it has no "regular" pages left, and has to start writing
> stuff out (and wait for them synchronously) in order to free up memory? 

No, of course not.  You're missing my point.

> I strongly feel that the old code was really really wrong - it may
> have been wonderful for throughput, but it had non-repeatable
> behaviour, and easily caused the inactive_dirty list to fill up with
> dirty pages because it unfairly penalized clean pages.

Agreed. I'm not talking about this specific issue, however.

> You do need to realize that dbench is a really bad benchmark, and should
> not be used as a way to tweak the algorithms.
> 
> > > That obviously did not happen with the full scan behaviour.
> 
> The new code has no difference between "full scan" and "partial scan". It
> will do the same thing regardless of whether you scan the whole list, as
> it doesn't have any state.
> 
> This did NOT happen with the old "launder_loop" state thing, but I think
> you agreed that that code was unreliable and flaky, and caused basically
> random non-LRU behaviour that depended on subtle effects in (a) who called
> it and (b) what the layout of the inactive_dirty list was.

Right. Please read the explanation above and you will understand that I'm
talking about something else. 

> > > With asynchronous i_dirty->i_clean movement (moving a cleaned page to the
> > > clean list at the IO completion handler. Please don't consider that for
> > > 2.4 :)) this would not happen, too.
> >
> > Or we could have parallel lists for dirty and clean.
> 
> Well, more importantly, do you actually have good reason to believe that
> it is wrong to try to write things out asynchronously?

No. Its not wrong to write things out, Linus. Thats not my point, however.

What I'm trying to tell you is that cleaned (written) memory should be
freed as soon as it gets cleaned.

Look:

1M shortage
page_launder() writeouts 10M of data 
Those 10M gets written out (cleaned)
page_launder() writeouts 10M of data 
Those 10M gets written out (cleaned) 
...

We are going to find the written out data (which should be freed ASAP,
since it already had enough time to be touched) _too_ late (only when we
loop the whole inactive dirty list).

Do you see my point ? 

I already have some code which adds a laundry list -- pages being written
out (by page_launder()) go to the laundry list, and each page_launder()
call will first check for unlocked pages on the laundry list, for then
doing the usual page_launder() stuff.

As far as I've seen, this has improved things _a lot_ exactly due to the
problem I explained. I'll post the code as soon as I have some time to
clean it.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 15:26       ` Jan Harkes
@ 2001-09-04 15:24         ` Marcelo Tosatti
  2001-09-04 17:14           ` Jan Harkes
  2001-09-04 16:27         ` Rik van Riel
  1 sibling, 1 reply; 79+ messages in thread
From: Marcelo Tosatti @ 2001-09-04 15:24 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Linus Torvalds, Daniel Phillips, lkml, riel



On Tue, 4 Sep 2001, Jan Harkes wrote:

> On Mon, Sep 03, 2001 at 11:57:09AM -0300, Marcelo Tosatti wrote:
> > I already have some code which adds a laundry list -- pages being written
> > out (by page_launder()) go to the laundry list, and each page_launder()
> > call will first check for unlocked pages on the laundry list, for then
> > doing the usual page_launder() stuff.
> 
> NO, please don't add another list to fix the symptoms of bad page aging.

Please, read my message again.

The laundry list is not an attempt to fix aging. Its just one way to find
previously cleaned data faster.

You should have created a new thread with subject "Aging is broken". :)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-03 14:57     ` Marcelo Tosatti
@ 2001-09-04 15:26       ` Jan Harkes
  2001-09-04 15:24         ` Marcelo Tosatti
  2001-09-04 16:27         ` Rik van Riel
  0 siblings, 2 replies; 79+ messages in thread
From: Jan Harkes @ 2001-09-04 15:26 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, Daniel Phillips, lkml, riel

On Mon, Sep 03, 2001 at 11:57:09AM -0300, Marcelo Tosatti wrote:
> I already have some code which adds a laundry list -- pages being written
> out (by page_launder()) go to the laundry list, and each page_launder()
> call will first check for unlocked pages on the laundry list, for then
> doing the usual page_launder() stuff.

NO, please don't add another list to fix the symptoms of bad page aging.

One of the graduate students here at CMU has been looking at the 2.4 VM,
trying to predict the size of the app that can possibly be loaded
without causing the system to start trashing.

To do this he was looking at the current working set and was using the
ages of pages in the page-cache as an indicator. i.e. he is exporting
the number of pages of a given age on the active list through a /proc
device. The results were unpredictable (almost every page was age 0,
except for a few that were MAX_PAGE_AGE) and walking through the source
showed why.

Aging is broken. Horribly. As a result, the inactive list is filled with
pages that are not necessarily inactive.

refill_inactive scan does aging based on the PG_Referenced bit, this is
only set for bufferpages. So on every call to refill_inactive pretty
much all active pages are being aged down agressively.

The hardware referenced bit is checked in swap_out and ages up. swap_out
walks part of the vm of all processes, and ages up all referenced pages.
However these pages will immediately get aged down as well by the
following refill_inactive. The recent moving around of refill_inactive
in the 2.4.10-pre4 patch has actually made down aging twice as
agressive.

Down aging is /2, up aging is += 3, so only pages that are referenced
more frequently than once a second on a not-loaded system could slowly
crawl up. Anything else is at age 0.

I've attached a patch against 2.4.10-pre4 that tries to do 2 things,
split the up/down aging out of refill_inactive etc. And it crawls _all_
process VM's to copy all hardware referenced bits to the software bit.
On a system withoug shortage, pages are only aged up, this is not realy
a problem, because as soon as there is some shortage the aggressive down
aging pulls pages at MAX_PAGE_AGE down to age 0 within 5 calls.

This is just an experimental patch, it probably doesn't work right on
all various kinds of CPU's. But at least it gets the aging somewhat
better. Oh and it seems to me that the discussion about read-ahead pages
is pretty much moot after this patch, they shouldn't push active stuff
out of memory.

Jan


diff -ur linux-2.4.10-pre4/mm/vmscan.c linux/mm/vmscan.c
--- linux-2.4.10-pre4/mm/vmscan.c	Tue Sep  4 10:55:29 2001
+++ linux/mm/vmscan.c	Tue Sep  4 11:04:48 2001
@@ -45,6 +45,165 @@
 	page->age /= 2;
 }
 
+/* mm->page_table_lock is held. mmap_sem is not held */
+static void vm_crawl_pmd(struct mm_struct * mm, struct vm_area_struct * vma, pmd_t *dir, unsigned long address, unsigned long end)
+{
+	pte_t * pte;
+	unsigned long pmd_end;
+
+	if (pmd_none(*dir))
+		return;
+	if (pmd_bad(*dir)) {
+		pmd_ERROR(*dir);
+		pmd_clear(dir);
+		return;
+	}
+	
+	pte = pte_offset(dir, address);
+	
+	pmd_end = (address + PMD_SIZE) & PMD_MASK;
+	if (end > pmd_end)
+		end = pmd_end;
+
+	do {
+		if (pte_present(*pte)) {
+			struct page *page = pte_page(*pte);
+
+			if (VALID_PAGE(page) && !PageReserved(page) &&
+			    ptep_test_and_clear_young(pte))
+			{
+				SetPageReferenced(page);
+			}
+		}
+		address += PAGE_SIZE;
+		pte++;
+	} while (address && (address < end));
+}
+
+/* mm->page_table_lock is held. mmap_sem is not held */
+static inline void vm_crawl_pgd(struct mm_struct * mm, struct vm_area_struct * vma, pgd_t *dir, unsigned long address, unsigned long end)
+{
+	pmd_t * pmd;
+	unsigned long pgd_end;
+
+	if (pgd_none(*dir))
+		return;
+	if (pgd_bad(*dir)) {
+		pgd_ERROR(*dir);
+		pgd_clear(dir);
+		return;
+	}
+
+	pmd = pmd_offset(dir, address);
+
+	pgd_end = (address + PGDIR_SIZE) & PGDIR_MASK;	
+	if (pgd_end && (end > pgd_end))
+		end = pgd_end;
+	
+	do {
+		vm_crawl_pmd(mm, vma, pmd, address, end);
+		address = (address + PMD_SIZE) & PMD_MASK;
+		pmd++;
+	} while (address && (address < end));
+}
+
+/* mm->page_table_lock is held. mmap_sem is not held */
+static void vm_crawl_vma(struct mm_struct * mm, struct vm_area_struct * vma)
+{
+	pgd_t *pgdir;
+	unsigned long end, address;
+
+	/* Skip areas which are locked down */
+	if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
+		return;
+
+	address = vma->vm_start;
+	pgdir = pgd_offset(mm, address);
+
+	end = vma->vm_end;
+	if (address >= end)
+		BUG();
+	do {
+		vm_crawl_pgd(mm, vma, pgdir, address, end);
+		address = (address + PGDIR_SIZE) & PGDIR_MASK;
+		pgdir++;
+	} while (address && (address < end));
+}
+
+static void vm_crawl_mm(struct mm_struct * mm)
+{
+	struct vm_area_struct* vma;
+
+	/*
+	 * Go through process' page directory.
+	 */
+
+	/*
+	 * Find the proper vm-area after freezing the vma chain 
+	 * and ptes.
+	 */
+	spin_lock(&mm->page_table_lock);
+
+	for (vma = find_vma(mm, 0); vma; vma = vma->vm_next)
+		vm_crawl_vma(mm, vma);
+
+	spin_unlock(&mm->page_table_lock);
+}
+
+/* set the software PG_Referenced bit on pages that have been accessed since
+ * the last scan. */
+static void vm_angel(void)
+{
+	struct list_head *p;
+	struct mm_struct *mm;
+
+	/* Walk all mm's */
+	spin_lock(&mmlist_lock);
+
+	p = init_mm.mmlist.next;
+	while (p != &init_mm.mmlist)
+	{
+		mm = list_entry(p, struct mm_struct, mmlist);
+
+		/* Make sure the mm doesn't disappear when we drop the lock.. */
+		atomic_inc(&mm->mm_users);
+		spin_unlock(&mmlist_lock);
+
+		vm_crawl_mm(mm);
+
+		/* Grab the lock again */
+		spin_lock(&mmlist_lock);
+
+		p = p->next;
+		mmput(mm);
+	}
+
+	spin_unlock(&mmlist_lock);
+}
+
+/* Age all pages that on the active list that have their referenced bit set.
+ * Down aging is only done when do_try_to_free pages fails the first time
+ * through. kswapd is running too often to get any fair aging behavior
+ * otherwise and apps that are running when there is no memory pressure should
+ * in my opinion get a little advantage against the new 'memory hogs' that
+ * push us into a shortage. */
+void vm_devil(int general_shortage)
+{
+	struct list_head * p;
+	struct page * page;
+
+	/* Take the lock while messing with the list... */
+	spin_lock(&pagemap_lru_lock);
+	list_for_each(p, &active_list) {
+		page = list_entry(p, struct page, lru);
+		if (PageTestandClearReferenced(page))
+		    age_page_up(page);
+		else if (general_shortage)
+		    age_page_down(page);
+	}
+	spin_unlock(&pagemap_lru_lock);
+}
+
 /*
  * The swap-out function returns 1 if it successfully
  * scanned all the pages it was asked to (`count').
@@ -87,6 +246,23 @@
 	pte_t pte;
 	swp_entry_t entry;
 
+	/* Don't look at this page if it's been accessed recently. */
+	if (page->mapping && page->age)
+		return;
+
+#if 0 /* The problem is that this test makes the system extremely unwilling to
+       * swap anything out, maybe we're not looking at a large enough part of
+       * the process VM so basically everything is typically referenced by the
+       * time we consider swapping out? */
+
+	/* Pages that have no swap allocated will not be on the active list and
+	 * will not be aged. However their Referenced bit should be set. */
+	if (PageTestandClearReferenced(page)) {
+	    page->age = 0;
+	    return;
+	}
+#endif
+
 	/* 
 	 * If we are doing a zone-specific scan, do not
 	 * touch pages from zones which don't have a 
@@ -95,12 +271,6 @@
 	if (zone_inactive_plenty(page->zone))
 		return;
 
-	/* Don't look at this pte if it's been accessed recently. */
-	if (ptep_test_and_clear_young(page_table)) {
-		age_page_up(page);
-		return;
-	}
-
 	if (TryLockPage(page))
 		return;
 
@@ -153,9 +323,12 @@
 			set_page_dirty(page);
 		goto drop_pte;
 	}
+
 	/*
-	 * Check PageDirty as well as pte_dirty: page may
-	 * have been brought back from swap by swapoff.
+	 * Ok, it's really dirty. That means that
+	 * we should either create a new swap cache
+	 * entry for it, or we should write it back
+	 * to its own backing store.
 	 */
 	if (!pte_dirty(pte) && !PageDirty(page))
 		goto drop_pte;
@@ -669,7 +842,6 @@
 	struct list_head * page_lru;
 	struct page * page;
 	int maxscan = nr_active_pages >> priority;
-	int page_active = 0;
 	int nr_deactivated = 0;
 
 	/* Take the lock while messing with the list... */
@@ -690,41 +862,34 @@
 		 * have plenty inactive pages.
 		 */
 
-		if (zone_inactive_plenty(page->zone)) {
-			page_active = 1;
+		if (zone_inactive_plenty(page->zone))
 			goto skip_page;
-		}
 
-		/* Do aging on the pages. */
-		if (PageTestandClearReferenced(page)) {
-			age_page_up(page);
-			page_active = 1;
-		} else {
-			age_page_down(page);
-			/*
-			 * Since we don't hold a reference on the page
-			 * ourselves, we have to do our test a bit more
-			 * strict then deactivate_page(). This is needed
-			 * since otherwise the system could hang shuffling
-			 * unfreeable pages from the active list to the
-			 * inactive_dirty list and back again...
-			 *
-			 * SUBTLE: we can have buffer pages with count 1.
-			 */
-			if (page->age == 0 && page_count(page) <=
-						(page->buffers ? 2 : 1)) {
-				deactivate_page_nolock(page);
-				page_active = 0;
-			} else {
-				page_active = 1;
-			}
+		/* not much use to inactivate ramdisk pages when page_launder
+		 * simply bounces them back to the active list */
+		if (page_ramdisk(page))
+		    	goto skip_page;
+
+		/*
+		 * Since we don't hold a reference on the page
+		 * ourselves, we have to do our test a bit more
+		 * strict then deactivate_page(). This is needed
+		 * since otherwise the system could hang shuffling
+		 * unfreeable pages from the active list to the
+		 * inactive_dirty list and back again...
+		 *
+		 * SUBTLE: we can have buffer pages with count 1.
+		 */
+		if (page->age == 0 && page_count(page) <= (page->buffers ? 2 : 1)) {
+			deactivate_page_nolock(page);
 		}
+
 		/*
 		 * If the page is still on the active list, move it
 		 * to the other end of the list. Otherwise we exit if
 		 * we have done enough work.
 		 */
-		if (page_active || PageActive(page)) {
+		if (PageActive(page)) {
 skip_page:
 			list_del(page_lru);
 			list_add(page_lru, &active_list);
@@ -820,14 +985,21 @@
 #define GENERAL_SHORTAGE 4
 static int do_try_to_free_pages(unsigned int gfp_mask, int user)
 {
+	/* Always walk at least the active queue when called */
 	int shortage = 0;
 	int maxtry;
 
+	/* make sure to update referenced bits */
+	vm_angel();
+
 	/* Always walk at least the active queue when called */
 	refill_inactive_scan(DEF_PRIORITY);
 
 	maxtry = 1 << DEF_PRIORITY;
 	do {
+	    	/* perform aging of the active list */
+	    	vm_devil(shortage & GENERAL_SHORTAGE);
+
 		/*
 		 * If needed, we move pages from the active list
 		 * to the inactive list.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 17:14           ` Jan Harkes
@ 2001-09-04 15:53             ` Marcelo Tosatti
  2001-09-04 19:33             ` Daniel Phillips
  2001-09-06 11:52             ` Rik van Riel
  2 siblings, 0 replies; 79+ messages in thread
From: Marcelo Tosatti @ 2001-09-04 15:53 UTC (permalink / raw)
  To: Jan Harkes; +Cc: linux-kernel



On Tue, 4 Sep 2001, Jan Harkes wrote:

> On Tue, Sep 04, 2001 at 12:24:36PM -0300, Marcelo Tosatti wrote:
> > On Tue, 4 Sep 2001, Jan Harkes wrote:
> > > On Mon, Sep 03, 2001 at 11:57:09AM -0300, Marcelo Tosatti wrote:
> > > > I already have some code which adds a laundry list -- pages being written
> > > > out (by page_launder()) go to the laundry list, and each page_launder()
> > > > call will first check for unlocked pages on the laundry list, for then
> > > > doing the usual page_launder() stuff.
> > > 
> > > NO, please don't add another list to fix the symptoms of bad page aging.
> > 
> > Please, read my message again.
> 
> Sorry, it was a reaction to all the VM nonsense that has been flying
> around lately. The a lot of complaints and discussions wouldn't even
> have started if we actually moved _inactive_ pages to the inactive list
> instead of random pages.

> To get back on the thread I jumped into, I totally agree with Linus that
> writeout should be as soon as possible. Probably even as soon as an
> inactive dirty page hits the inactive dirty list, which would
> effectively turn the inactive dirty list into your laundry list.

Wrong. The laundry list is something where on flight pages stay so users
can free memory from there as soon as the IO is finished.

Do you see what I mean ?




^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 17:13           ` Jan Harkes
@ 2001-09-04 15:56             ` Marcelo Tosatti
  2001-09-04 17:54               ` Jan Harkes
  2001-09-04 17:35             ` Daniel Phillips
  1 sibling, 1 reply; 79+ messages in thread
From: Marcelo Tosatti @ 2001-09-04 15:56 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Rik van Riel, Linus Torvalds, Daniel Phillips, lkml



On Tue, 4 Sep 2001, Jan Harkes wrote:

> On Tue, Sep 04, 2001 at 01:27:50PM -0300, Rik van Riel wrote:
> > I've been working on a CPU and memory efficient reverse
> > mapping patch for Linux, one which will allow us to do
> > a bunch of optimisations for later on (infrastructure)
> > and has as its short-term benefit the potential for
> > better page aging.
> 
> Yes, I can see that using reverse mappings would be a way of correcting
> the aging if you call page_age_up from try_to_swap_out, in which case
> there probably needs to be a page_age_down on virtual mappings as well
> to correctly balance things.
> 
> > It seems the balancing FreeBSD does (up aging +3, down
> > aging -1, inactive list in LRU order as extra stage) is
> 
> One other observation, we should add anonymously allocated memory to the
> active-list as soon as they are allocated in do_nopage. At the moment a
> large part of memory is not aged at all until we start swapping things
> out.

With reverse mappings we can completly remove the "swap_out()" loop logic
and age pte's at refill_inactive_scan(). 

All that with anon memory added to the active-list as soon as allocated,
of course.

Jan, I suggest you to take a look at the reverse mapping code. 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 15:26       ` Jan Harkes
  2001-09-04 15:24         ` Marcelo Tosatti
@ 2001-09-04 16:27         ` Rik van Riel
  2001-09-04 17:13           ` Jan Harkes
  2001-09-04 20:43           ` Jan Harkes
  1 sibling, 2 replies; 79+ messages in thread
From: Rik van Riel @ 2001-09-04 16:27 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Marcelo Tosatti, Linus Torvalds, Daniel Phillips, lkml

On Tue, 4 Sep 2001, Jan Harkes wrote:

> NO, please don't add another list to fix the symptoms of bad page aging.
>
> One of the graduate students here at CMU has been looking at the 2.4 VM,
> trying to predict the size of the app that can possibly be loaded
> without causing the system to start trashing.

	[snip results]

> Aging is broken. Horribly. As a result, the inactive list is filled with
> pages that are not necessarily inactive.

I've been working on a CPU and memory efficient reverse
mapping patch for Linux, one which will allow us to do
a bunch of optimisations for later on (infrastructure)
and has as its short-term benefit the potential for
better page aging.

It seems the balancing FreeBSD does (up aging +3, down
aging -1, inactive list in LRU order as extra stage) is
working nicely on my laptop now, but I don't think I'll
be releasing that as part of the patch ...

	http://www.surriel.com/patches/2.4/2.4.8-ac12-pmap3

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 17:54               ` Jan Harkes
@ 2001-09-04 16:37                 ` Marcelo Tosatti
  2001-09-04 18:49                 ` Alan Cox
  2001-09-04 19:54                 ` Andrea Arcangeli
  2 siblings, 0 replies; 79+ messages in thread
From: Marcelo Tosatti @ 2001-09-04 16:37 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Rik van Riel, linux-kernel



On Tue, 4 Sep 2001, Jan Harkes wrote:

> On Tue, Sep 04, 2001 at 12:56:32PM -0300, Marcelo Tosatti wrote:
> > On Tue, 4 Sep 2001, Jan Harkes wrote:
> > > One other observation, we should add anonymously allocated memory to the
> > > active-list as soon as they are allocated in do_nopage. At the moment a
> > > large part of memory is not aged at all until we start swapping things
> > > out.
> > 
> > With reverse mappings we can completly remove the "swap_out()" loop logic
> > and age pte's at refill_inactive_scan(). 
> > 
> > All that with anon memory added to the active-list as soon as allocated,
> > of course.
> > 
> > Jan, I suggest you to take a look at the reverse mapping code. 
> 
> I'm getting pretty sick and tired of these endless discussion. People
> have been reporting problems and they are pretty much alway met with the
> answer, "it works here, if you can do better send a patch".
> 
> Now for the past _9_ stable kernel releases, page aging hasn't worked
> at all!! Nobody seems to even have bothered to check. I send in a patch
> and you basically answer with "Ohh, but we know about that one. Just
> apply patch wizzbangfoo#105 which basically does everything differently".

Jan, 

Calm down. I haven't told you that the reverse mapping code is the fix to
all aging problems, did I?

I will take a careful look at your code later. However, I (and everybody
else) has not enough time to fix the whole VM in one day. 

> Yeah I'll have a look at that code, and I'll check what the page ages
> look like when I actually run it (if it doesn't crash the system first).

I haven't said reverse mapping will fix the aging problem. I just did a
comment on top of your comment.

Please read my mails more carefully and slowly before sending me to
hell. :) 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 16:27         ` Rik van Riel
@ 2001-09-04 17:13           ` Jan Harkes
  2001-09-04 15:56             ` Marcelo Tosatti
  2001-09-04 17:35             ` Daniel Phillips
  2001-09-04 20:43           ` Jan Harkes
  1 sibling, 2 replies; 79+ messages in thread
From: Jan Harkes @ 2001-09-04 17:13 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Marcelo Tosatti, Linus Torvalds, Daniel Phillips, lkml

On Tue, Sep 04, 2001 at 01:27:50PM -0300, Rik van Riel wrote:
> I've been working on a CPU and memory efficient reverse
> mapping patch for Linux, one which will allow us to do
> a bunch of optimisations for later on (infrastructure)
> and has as its short-term benefit the potential for
> better page aging.

Yes, I can see that using reverse mappings would be a way of correcting
the aging if you call page_age_up from try_to_swap_out, in which case
there probably needs to be a page_age_down on virtual mappings as well
to correctly balance things.

> It seems the balancing FreeBSD does (up aging +3, down
> aging -1, inactive list in LRU order as extra stage) is

One other observation, we should add anonymously allocated memory to the
active-list as soon as they are allocated in do_nopage. At the moment a
large part of memory is not aged at all until we start swapping things
out.

Jan


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 15:24         ` Marcelo Tosatti
@ 2001-09-04 17:14           ` Jan Harkes
  2001-09-04 15:53             ` Marcelo Tosatti
                               ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Jan Harkes @ 2001-09-04 17:14 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

On Tue, Sep 04, 2001 at 12:24:36PM -0300, Marcelo Tosatti wrote:
> On Tue, 4 Sep 2001, Jan Harkes wrote:
> > On Mon, Sep 03, 2001 at 11:57:09AM -0300, Marcelo Tosatti wrote:
> > > I already have some code which adds a laundry list -- pages being written
> > > out (by page_launder()) go to the laundry list, and each page_launder()
> > > call will first check for unlocked pages on the laundry list, for then
> > > doing the usual page_launder() stuff.
> > 
> > NO, please don't add another list to fix the symptoms of bad page aging.
> 
> Please, read my message again.

Sorry, it was a reaction to all the VM nonsense that has been flying
around lately. The a lot of complaints and discussions wouldn't even
have started if we actually moved _inactive_ pages to the inactive list
instead of random pages.

To get back on the thread I jumped into, I totally agree with Linus that
writeout should be as soon as possible. Probably even as soon as an
inactive dirty page hits the inactive dirty list, which would
effectively turn the inactive dirty list into your laundry list.

Jan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 17:13           ` Jan Harkes
  2001-09-04 15:56             ` Marcelo Tosatti
@ 2001-09-04 17:35             ` Daniel Phillips
  1 sibling, 0 replies; 79+ messages in thread
From: Daniel Phillips @ 2001-09-04 17:35 UTC (permalink / raw)
  To: Jan Harkes, Rik van Riel; +Cc: Marcelo Tosatti, Linus Torvalds, lkml

On September 4, 2001 07:13 pm, Jan Harkes wrote:
> On Tue, Sep 04, 2001 at 01:27:50PM -0300, Rik van Riel wrote:
> > I've been working on a CPU and memory efficient reverse
> > mapping patch for Linux, one which will allow us to do
> > a bunch of optimisations for later on (infrastructure)
> > and has as its short-term benefit the potential for
> > better page aging.
> 
> Yes, I can see that using reverse mappings would be a way of correcting
> the aging if you call page_age_up from try_to_swap_out, in which case
> there probably needs to be a page_age_down on virtual mappings as well
> to correctly balance things.

There is.  1) Unreferenced process space page gets unmapped, goes on to LRU 
lists 2) page aged down to zero until it gets deactivated 3) page deactivated 
and evicted soon after.  If the page is referenced during (2) or (3) it will 
be mapped back in, no IO because it's still in the swap cache (minor fault).

But this is lopsided and hard to balance.  Also, unmapping/remapping is an 
expensive way to check for short-term page activity.

> > It seems the balancing FreeBSD does (up aging +3, down
> > aging -1, inactive list in LRU order as extra stage) is
> 
> One other observation, we should add anonymously allocated memory to the
> active-list as soon as they are allocated in do_nopage. At the moment a
> large part of memory is not aged at all until we start swapping things
> out.

This is useless without rmap because the page will just be aged down, not up. 
With rmap, yes, that's what needs to be done.

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 15:56             ` Marcelo Tosatti
@ 2001-09-04 17:54               ` Jan Harkes
  2001-09-04 16:37                 ` Marcelo Tosatti
                                   ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Jan Harkes @ 2001-09-04 17:54 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Rik van Riel, linux-kernel

On Tue, Sep 04, 2001 at 12:56:32PM -0300, Marcelo Tosatti wrote:
> On Tue, 4 Sep 2001, Jan Harkes wrote:
> > One other observation, we should add anonymously allocated memory to the
> > active-list as soon as they are allocated in do_nopage. At the moment a
> > large part of memory is not aged at all until we start swapping things
> > out.
> 
> With reverse mappings we can completly remove the "swap_out()" loop logic
> and age pte's at refill_inactive_scan(). 
> 
> All that with anon memory added to the active-list as soon as allocated,
> of course.
> 
> Jan, I suggest you to take a look at the reverse mapping code. 

I'm getting pretty sick and tired of these endless discussion. People
have been reporting problems and they are pretty much alway met with the
answer, "it works here, if you can do better send a patch".

Now for the past _9_ stable kernel releases, page aging hasn't worked
at all!! Nobody seems to even have bothered to check. I send in a patch
and you basically answer with "Ohh, but we know about that one. Just
apply patch wizzbangfoo#105 which basically does everything differently".

Yeah I'll have a look at that code, and I'll check what the page ages
look like when I actually run it (if it doesn't crash the system first).

Jan


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 19:54                 ` Andrea Arcangeli
@ 2001-09-04 18:36                   ` Marcelo Tosatti
  2001-09-04 20:10                   ` Daniel Phillips
  2001-09-06 11:18                   ` Rik van Riel
  2 siblings, 0 replies; 79+ messages in thread
From: Marcelo Tosatti @ 2001-09-04 18:36 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Jan Harkes, Rik van Riel, linux-kernel



On Tue, 4 Sep 2001, Andrea Arcangeli wrote:

> On Tue, Sep 04, 2001 at 01:54:27PM -0400, Jan Harkes wrote:
> > Now for the past _9_ stable kernel releases, page aging hasn't worked
> > at all!! Nobody seems to even have bothered to check. I send in a patch
> 
> All I can say is that I hope you will get your problem fixed with one of
> the next -aa, I incidentally started working on it yesterday. So far
> it's a one thousand diff very far from compiling, so it will grow
> further, but it shouldn't take too long to finish the rewrite. Once
> finished the benchmarks and the reproducible 2.4 deadlocks will tell me
> if I'm right.

Andrea, 

Could you please describe how you're trying to fix the "anon pages not
being added to the active list at do_no_page()" problem Jan described ?

Thanks!


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 17:54               ` Jan Harkes
  2001-09-04 16:37                 ` Marcelo Tosatti
@ 2001-09-04 18:49                 ` Alan Cox
  2001-09-04 19:39                   ` Jan Harkes
  2001-09-04 19:54                 ` Andrea Arcangeli
  2 siblings, 1 reply; 79+ messages in thread
From: Alan Cox @ 2001-09-04 18:49 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Marcelo Tosatti, Rik van Riel, linux-kernel

> Now for the past _9_ stable kernel releases, page aging hasn't worked
> at all!! Nobody seems to even have bothered to check. I send in a patch
> and you basically answer with "Ohh, but we know about that one. Just
> apply patch wizzbangfoo#105 which basically does everything differently".

Maybe you should take issue with the people applying random patches, missing
important ones and mixing and matching incompatible ideas in the main tree ?

The VM tuning in the -ac tree is a lot more reliable for most loads (its
certainly not perfect) and that is because the changes have been done and
tested one at a time as they are merged. Real engineering process is the
only way to get this sort of thing working well.

Alan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 17:14           ` Jan Harkes
  2001-09-04 15:53             ` Marcelo Tosatti
@ 2001-09-04 19:33             ` Daniel Phillips
  2001-09-06 11:52             ` Rik van Riel
  2 siblings, 0 replies; 79+ messages in thread
From: Daniel Phillips @ 2001-09-04 19:33 UTC (permalink / raw)
  To: Jan Harkes, Marcelo Tosatti; +Cc: linux-kernel

On September 4, 2001 07:14 pm, Jan Harkes wrote:
> To get back on the thread I jumped into, I totally agree with Linus that
> writeout should be as soon as possible. Probably even as soon as an
> inactive dirty page hits the inactive dirty list, which would
> effectively turn the inactive dirty list into your laundry list.

No, we don't want that, we need the inactive list as a test of short-term
inactivity.  It doesn't make sense to begin the writeout until the page
has made it to the other end of the inactive ist.  Otherwise you just
revert to "one-hand-clock".

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 18:49                 ` Alan Cox
@ 2001-09-04 19:39                   ` Jan Harkes
  2001-09-04 20:25                     ` Alan Cox
  0 siblings, 1 reply; 79+ messages in thread
From: Jan Harkes @ 2001-09-04 19:39 UTC (permalink / raw)
  To: Alan Cox; +Cc: Marcelo Tosatti, Rik van Riel, linux-kernel

On Tue, Sep 04, 2001 at 07:49:47PM +0100, Alan Cox wrote:
> The VM tuning in the -ac tree is a lot more reliable for most loads (its
> certainly not perfect) and that is because the changes have been done and
> tested one at a time as they are merged. Real engineering process is the
> only way to get this sort of thing working well.

I grabbed the 2.4.9-ac7 patch and looked at some of the files.

Pages allocated with do_anonymous_page are not added to the active list.
as a result there is no aging information for a page until it is
unmapped. So we might be unmapping and allocating swap for shared pages
that another process is using heavily. In which case this page should
always have a high age in in the active list and won't actually get
swapped out. So we get both unnecessary minor faults, and the swap space
will never be reclaimed because we never swap it back in.

Also up aging of mapped process pages is still done in try_to_swap_out,
and all of these pages are still aged down indiscriminately in
refill_inactive_scan. I don't see how it could age that much
differently, so I'm assuming all pages in the active list are basically
at age 0 no matter what aging strategy is picked.

Especially because only down aging is performed periodically by kswapd,
while the only code that ages process pages up is only called once the
system hits free or inactive shortage.

There is some places where tests have been added that should never make
a difference anyways. In reclaim_page and page_launder a page on the
inactive list is checked for page->age. Because the page is not mapped
in any VM it is not possibly for age to be non-zero. If the page was
referenced it would have triggered a minor fault and reactivated the
page.

I guess it is just more carefully papering over the existing problems.

Jan


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 17:54               ` Jan Harkes
  2001-09-04 16:37                 ` Marcelo Tosatti
  2001-09-04 18:49                 ` Alan Cox
@ 2001-09-04 19:54                 ` Andrea Arcangeli
  2001-09-04 18:36                   ` Marcelo Tosatti
                                     ` (2 more replies)
  2 siblings, 3 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2001-09-04 19:54 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Marcelo Tosatti, Rik van Riel, linux-kernel

On Tue, Sep 04, 2001 at 01:54:27PM -0400, Jan Harkes wrote:
> Now for the past _9_ stable kernel releases, page aging hasn't worked
> at all!! Nobody seems to even have bothered to check. I send in a patch

All I can say is that I hope you will get your problem fixed with one of
the next -aa, I incidentally started working on it yesterday. So far
it's a one thousand diff very far from compiling, so it will grow
further, but it shouldn't take too long to finish the rewrite. Once
finished the benchmarks and the reproducible 2.4 deadlocks will tell me
if I'm right.

Andrea

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 19:54                 ` Andrea Arcangeli
  2001-09-04 18:36                   ` Marcelo Tosatti
@ 2001-09-04 20:10                   ` Daniel Phillips
  2001-09-04 22:04                     ` Andrea Arcangeli
  2001-09-06 11:18                   ` Rik van Riel
  2 siblings, 1 reply; 79+ messages in thread
From: Daniel Phillips @ 2001-09-04 20:10 UTC (permalink / raw)
  To: Andrea Arcangeli, Jan Harkes; +Cc: Marcelo Tosatti, Rik van Riel, linux-kernel

On September 4, 2001 09:54 pm, Andrea Arcangeli wrote:
> On Tue, Sep 04, 2001 at 01:54:27PM -0400, Jan Harkes wrote:
> > Now for the past _9_ stable kernel releases, page aging hasn't worked
> > at all!! Nobody seems to even have bothered to check. I send in a patch
> 
> All I can say is that I hope you will get your problem fixed with one of
> the next -aa, I incidentally started working on it yesterday. So far
> it's a one thousand diff very far from compiling, so it will grow
> further, but it shouldn't take too long to finish the rewrite. Once
> finished the benchmarks and the reproducible 2.4 deadlocks will tell me
> if I'm right.

Which reproducible deadlocks did you have in mind, and how do I reproduce
them?

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 19:39                   ` Jan Harkes
@ 2001-09-04 20:25                     ` Alan Cox
  2001-09-06 11:23                       ` Rik van Riel
  0 siblings, 1 reply; 79+ messages in thread
From: Alan Cox @ 2001-09-04 20:25 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Alan Cox, Marcelo Tosatti, Rik van Riel, linux-kernel

> Pages allocated with do_anonymous_page are not added to the active list.
> as a result there is no aging information for a page until it is
> unmapped. So we might be unmapping and allocating swap for shared pages

Right ok. 

> I guess it is just more carefully papering over the existing problems.

If you are correct then I suspect the better behaviour is primarily coming
from the balancing algorithms and the choices made rather than the quality
of data suggested.

When Rik gets back off a plane this sounds like something that should be
tested - one item at a time.

Alan


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 16:27         ` Rik van Riel
  2001-09-04 17:13           ` Jan Harkes
@ 2001-09-04 20:43           ` Jan Harkes
  2001-09-06 11:21             ` Rik van Riel
  1 sibling, 1 reply; 79+ messages in thread
From: Jan Harkes @ 2001-09-04 20:43 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel

On Tue, Sep 04, 2001 at 01:27:50PM -0300, Rik van Riel wrote:
> I've been working on a CPU and memory efficient reverse
> mapping patch for Linux, one which will allow us to do
> a bunch of optimisations for later on (infrastructure)
> and has as its short-term benefit the potential for
> better page aging.
> 
> It seems the balancing FreeBSD does (up aging +3, down
> aging -1, inactive list in LRU order as extra stage) is
> working nicely on my laptop now, but I don't think I'll
> be releasing that as part of the patch ...
> 
> 	http://www.surriel.com/patches/2.4/2.4.8-ac12-pmap3

I like the fact that it completely removes the vm crawling swap_out
path. It also does aging more sanely because it now can take everything
into account. It also works around the problems of anonymous pages that
aren't aged until they are added to the swap cache.

It should also minimize unnecessary minor page faults because the
unmapping is done for all pte's once the page->age hits zero, and
frequently used pages should not grabbing and lock down swapspace that
they won't be able to give up (until the process exits).

The pte_chain allocation stuff looks a bit scary, where did you want to
reclaim them from when memory runs out, unmap existing pte's?

One thing that might be nice, and showed a lot of promise here is to
either age down by subtracting instead of dividing to make it less
aggressive. It is already hard enough for pages to get referenced enough
to move up the scale.

Or use a similar approach as I have in my patch, age up periodically,
but only age down when there is memory shortage, This gives a slight
advantage to processes that were running when there was not much VM
pressure. When something starts hogging memory, it is penalized a bit
for disturbing the peace, but the agressive down aging will quickly
rebalance, typically within about 3 calls to do_try_to_free_pages.

I might port your patch over to Linus's 2.4.10-pre tree to play with it.
It could very well be a significant improvement because it does address
many of the issues that I ran into.

Jan


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 20:10                   ` Daniel Phillips
@ 2001-09-04 22:04                     ` Andrea Arcangeli
  2001-09-05  2:41                       ` Daniel Phillips
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2001-09-04 22:04 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Jan Harkes, Marcelo Tosatti, Rik van Riel, linux-kernel

On Tue, Sep 04, 2001 at 10:10:42PM +0200, Daniel Phillips wrote:
> Which reproducible deadlocks did you have in mind, and how do I reproduce
> them?

I meant the various known oom deadlocks. I've one showstopper report
with the blkdev in pagecache patch with in use also a small ramdisk
pagecache backed, the pagecache backed works like ramfs etc.. marks the
page dirty again in writepage, somebody must have broken page_launder or
something else in the memory managment because exactly the same code was
working fine in 2.4.7. Now it probably loops or breaks totally when
somebody marks the page dirty again, but the vm problems are much much
wider, starting from the kswapd loop on gfp dma or gfp normal, the
overkill swapping when there's tons of ram in freeable cache and you are
taking advantage of the cache, lack of defragmentation, lack of
knowledge of the classzone to balance in the memory balancing (this in
turn is why kswapd goes mad),  very imprecise estimation of the freeable
ram, overkill code in the allocator (the limit stuff is senseless), tons
magic numbers that doesn't make any sensible difference, tons of cpu
wasted, performance that decreases at every run of the benchmarks,
etc...

If you believe I'm dreaming just forget about this email, this is my
last email about this until I've finished.

Andrea

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 22:04                     ` Andrea Arcangeli
@ 2001-09-05  2:41                       ` Daniel Phillips
  0 siblings, 0 replies; 79+ messages in thread
From: Daniel Phillips @ 2001-09-05  2:41 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Jan Harkes, Marcelo Tosatti, Rik van Riel, linux-kernel

On September 5, 2001 12:04 am, Andrea Arcangeli wrote:
> On Tue, Sep 04, 2001 at 10:10:42PM +0200, Daniel Phillips wrote:
> > Which reproducible deadlocks did you have in mind, and how do I reproduce
> > them?
> 
> I meant the various known oom deadlocks. I've one showstopper report
> with the blkdev in pagecache patch with in use also a small ramdisk
> pagecache backed, the pagecache backed works like ramfs etc.. marks the
> page dirty again in writepage, somebody must have broken page_launder or
> something else in the memory managment because exactly the same code was
> working fine in 2.4.7. Now it probably loops or breaks totally when
> somebody marks the page dirty again, but the vm problems are much much
> wider, starting from the kswapd loop on gfp dma or gfp normal, the
> overkill swapping when there's tons of ram in freeable cache and you are
> taking advantage of the cache, lack of defragmentation, lack of
> knowledge of the classzone to balance in the memory balancing (this in
> turn is why kswapd goes mad),  very imprecise estimation of the freeable
> ram, overkill code in the allocator (the limit stuff is senseless), tons
> magic numbers that doesn't make any sensible difference, tons of cpu
> wasted, performance that decreases at every run of the benchmarks,
> etc...
> 
> If you believe I'm dreaming just forget about this email, this is my
> last email about this until I've finished.

Sure.  You mentioned one deadlock - oom - and a bunch of suckages.  The oom 
problem is related to imprecise knowledge of freeable memory, you could group 
those two together.  Active defragmentation isn't going to be that hard, I 
think.  We'll see...

Don't forget all the stuff that works pretty well now.  Most of the problem 
reports we're getting now are concerned with the fact that we're loading up 
logs with allocation failure messages.  We probably wouldn't get those 
reports if we just turned of the messages now.  Bounce buffer allocation was 
the stopper there and Marcelo's patch has put that one away.  I think I found 
a practical solution to the 0 order atomic failures, subject to more 
confirmation.  Balancing and aging, while not perfect, are at least 
servicable.  Hugh Dickins rooted out a bunch of genuine bugs in swap.  Rik 
seems to have defanged the swap space allocation problem.  Other bugs were 
rooted out and killed by Ben and Linus.  All in all, things are much improved.

The biggest issue we need to tackle before calling it a servicable vm system 
is the freeable memory accounting.

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 19:54                 ` Andrea Arcangeli
  2001-09-04 18:36                   ` Marcelo Tosatti
  2001-09-04 20:10                   ` Daniel Phillips
@ 2001-09-06 11:18                   ` Rik van Riel
  2 siblings, 0 replies; 79+ messages in thread
From: Rik van Riel @ 2001-09-06 11:18 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Jan Harkes, Marcelo Tosatti, linux-kernel

On Tue, 4 Sep 2001, Andrea Arcangeli wrote:
> On Tue, Sep 04, 2001 at 01:54:27PM -0400, Jan Harkes wrote:
> > Now for the past _9_ stable kernel releases, page aging hasn't worked
> > at all!! Nobody seems to even have bothered to check. I send in a patch
>
> All I can say is that I hope you will get your problem fixed with one
> of the next -aa, I incidentally started working on it yesterday.

You too? ;)

> So far it's a one thousand diff very far from compiling, so it will
> grow further, but it shouldn't take too long to finish the rewrite.
> Once finished the benchmarks and the reproducible 2.4 deadlocks will
> tell me if I'm right.

Of course, we could try to work together on this one, since
we both seem to be starved for time ...

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 20:43           ` Jan Harkes
@ 2001-09-06 11:21             ` Rik van Riel
  0 siblings, 0 replies; 79+ messages in thread
From: Rik van Riel @ 2001-09-06 11:21 UTC (permalink / raw)
  To: Jan Harkes; +Cc: linux-kernel

On Tue, 4 Sep 2001, Jan Harkes wrote:

> The pte_chain allocation stuff looks a bit scary, where did you want
> to reclaim them from when memory runs out, unmap existing pte's?

Exactly. This is the strategy also used by BSD and it seems to
work really well.

> One thing that might be nice, and showed a lot of promise here is to
> either age down by subtracting instead of dividing to make it less
> aggressive. It is already hard enough for pages to get referenced
> enough to move up the scale.

Oh definately, I've tried it with linear page aging and it works
a lot better. I'm just not including that in my patch right now
because I don't want to mix policy and mechanism right now and I
want to really get the mechanism right before moving on to other
stuff.

> Or use a similar approach as I have in my patch, age up periodically,
> but only age down when there is memory shortage,

Where can I get your patch ?

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 20:25                     ` Alan Cox
@ 2001-09-06 11:23                       ` Rik van Riel
  0 siblings, 0 replies; 79+ messages in thread
From: Rik van Riel @ 2001-09-06 11:23 UTC (permalink / raw)
  To: Alan Cox; +Cc: Jan Harkes, Marcelo Tosatti, linux-kernel

On Tue, 4 Sep 2001, Alan Cox wrote:

> > Pages allocated with do_anonymous_page are not added to the active list.
> > as a result there is no aging information for a page until it is
> > unmapped. So we might be unmapping and allocating swap for shared pages
>
> Right ok.

One problem though, we cannot 'see' the referenced bits in the
page tables and nothing else is accessing this page, so there's
no information we can learn from having this page on the active
list.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-04 17:14           ` Jan Harkes
  2001-09-04 15:53             ` Marcelo Tosatti
  2001-09-04 19:33             ` Daniel Phillips
@ 2001-09-06 11:52             ` Rik van Riel
  2001-09-06 12:31               ` Daniel Phillips
  2001-09-06 13:10               ` Stephan von Krawczynski
  2 siblings, 2 replies; 79+ messages in thread
From: Rik van Riel @ 2001-09-06 11:52 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Marcelo Tosatti, linux-kernel

On Tue, 4 Sep 2001, Jan Harkes wrote:

> To get back on the thread I jumped into, I totally agree with Linus
> that writeout should be as soon as possible.

Nice way to destroy read performance.  As DaveM noted so
nicely in his reverse mapping patch (at the end of the
2.3 series), dirty pages get moved to the laundry list
and the washing machine will deal with them when we have
a full load.

Lets face it, spinning the washing machine is expensive
and running less than a full load makes things inefficient ;)

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 11:52             ` Rik van Riel
@ 2001-09-06 12:31               ` Daniel Phillips
  2001-09-06 12:32                 ` Rik van Riel
  2001-09-06 13:10               ` Stephan von Krawczynski
  1 sibling, 1 reply; 79+ messages in thread
From: Daniel Phillips @ 2001-09-06 12:31 UTC (permalink / raw)
  To: Rik van Riel, Jan Harkes; +Cc: Marcelo Tosatti, linux-kernel

On September 6, 2001 01:52 pm, Rik van Riel wrote:
> On Tue, 4 Sep 2001, Jan Harkes wrote:
> 
> > To get back on the thread I jumped into, I totally agree with Linus
> > that writeout should be as soon as possible.
> 
> Nice way to destroy read performance.

Blindly delaying all the writes in the name of better read performance isn't 
the right idea either.  Perhaps we should have a good think about some 
sensible mechanism for balancing reads against writes.

> As DaveM noted so
> nicely in his reverse mapping patch (at the end of the
> 2.3 series), dirty pages get moved to the laundry list
> and the washing machine will deal with them when we have
> a full load.
> 
> Lets face it, spinning the washing machine is expensive
> and running less than a full load makes things inefficient ;)

That makes a good sound bite but doesn't stand up to scrutiny.

It's not a washing machine ;-)

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 12:31               ` Daniel Phillips
@ 2001-09-06 12:32                 ` Rik van Riel
  2001-09-06 12:53                   ` Daniel Phillips
  0 siblings, 1 reply; 79+ messages in thread
From: Rik van Riel @ 2001-09-06 12:32 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Jan Harkes, Marcelo Tosatti, linux-kernel

On Thu, 6 Sep 2001, Daniel Phillips wrote:
> On September 6, 2001 01:52 pm, Rik van Riel wrote:
> > On Tue, 4 Sep 2001, Jan Harkes wrote:
> >
> > > To get back on the thread I jumped into, I totally agree with Linus
> > > that writeout should be as soon as possible.
> >
> > Nice way to destroy read performance.
>
> Blindly delaying all the writes in the name of better read performance
> isn't the right idea either.  Perhaps we should have a good think
> about some sensible mechanism for balancing reads against writes.

Absolutely, delaying writes for too long is just as bad,
we need something in-between.

> > Lets face it, spinning the washing machine is expensive
> > and running less than a full load makes things inefficient ;)
>
> That makes a good sound bite but doesn't stand up to scrutiny.
> It's not a washing machine ;-)

Two words:  "IO clustering".

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 12:32                 ` Rik van Riel
@ 2001-09-06 12:53                   ` Daniel Phillips
  2001-09-06 13:03                     ` Rik van Riel
  0 siblings, 1 reply; 79+ messages in thread
From: Daniel Phillips @ 2001-09-06 12:53 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Jan Harkes, Marcelo Tosatti, linux-kernel

On September 6, 2001 02:32 pm, Rik van Riel wrote:
> > > Lets face it, spinning the washing machine is expensive
> > > and running less than a full load makes things inefficient ;)
> >
> > That makes a good sound bite but doesn't stand up to scrutiny.
> > It's not a washing machine ;-)
> 
> Two words:  "IO clustering".

Yes, *after* the IO queue is fully loaded that makes sense.  Leaving it 
partly or fully idle while waiting for it to load up makes no sense at all.

IO clustering will happen naturally after the queue loads up.

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 12:53                   ` Daniel Phillips
@ 2001-09-06 13:03                     ` Rik van Riel
  2001-09-06 13:18                       ` Kurt Garloff
  0 siblings, 1 reply; 79+ messages in thread
From: Rik van Riel @ 2001-09-06 13:03 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Jan Harkes, Marcelo Tosatti, linux-kernel

On Thu, 6 Sep 2001, Daniel Phillips wrote:
> On September 6, 2001 02:32 pm, Rik van Riel wrote:

> > Two words:  "IO clustering".
>
> Yes, *after* the IO queue is fully loaded that makes sense.  Leaving it
> partly or fully idle while waiting for it to load up makes no sense at all.
>
> IO clustering will happen naturally after the queue loads up.

Exactly, so we need to give the queue some time to load
up, right ?

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 11:52             ` Rik van Riel
  2001-09-06 12:31               ` Daniel Phillips
@ 2001-09-06 13:10               ` Stephan von Krawczynski
  2001-09-06 13:23                 ` Alex Bligh - linux-kernel
                                   ` (3 more replies)
  1 sibling, 4 replies; 79+ messages in thread
From: Stephan von Krawczynski @ 2001-09-06 13:10 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: riel, jaharkes, marcelo, linux-kernel

On Thu, 6 Sep 2001 14:31:32 +0200 Daniel Phillips <phillips@bonn-fries.net>
wrote:

> On September 6, 2001 01:52 pm, Rik van Riel wrote:
> > On Tue, 4 Sep 2001, Jan Harkes wrote:
> > 
> > > To get back on the thread I jumped into, I totally agree with Linus
> > > that writeout should be as soon as possible.
> > 
> > Nice way to destroy read performance.
> 
> Blindly delaying all the writes in the name of better read performance isn't 
> the right idea either.  Perhaps we should have a good think about some 
> sensible mechanism for balancing reads against writes.

I guess I have the real-world proof for that:
Yesterday I mastered a CD (around 700 MB) and burned it, I left the equipment
to get some food and sleep (sometimes needed :-). During this time the machine
acts as nfs-server and gets about 3 GB of data written to it. Coming back today
I recognise that deleting the CD image made yesterday frees up about 500 MB of
physical mem (free mem was very low before). It was obviously held 24 hours for
no reason, and _not_ (as one would expect) exchanged against the nfs-data. This
means the caches were full with _old_ data and explains why nfs performance has
remarkably dropped since 2.2. There is too few mem around to get good
performance (no matter if read or write). Obviously aging did not work at all,
there was not a single hit on these (CD image) pages during 24 hours, compared
to lots on the nfs-data. Even if the nfs-data would only have one single hit,
the old CD image should have been removed, because it is inactive and _older_.

> > As DaveM noted so
> > nicely in his reverse mapping patch (at the end of the
> > 2.3 series), dirty pages get moved to the laundry list
> > and the washing machine will deal with them when we have
> > a full load.
> > 
> > Lets face it, spinning the washing machine is expensive
> > and running less than a full load makes things inefficient ;)

I guess this is what people writing w*ndows screen blankers thought, too ;-)

Sorry for this comment, couldn't resist :-)

Stephan



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:03                     ` Rik van Riel
@ 2001-09-06 13:18                       ` Kurt Garloff
  2001-09-06 13:23                         ` Rik van Riel
                                           ` (3 more replies)
  0 siblings, 4 replies; 79+ messages in thread
From: Kurt Garloff @ 2001-09-06 13:18 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, Jan Harkes, Marcelo Tosatti, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 931 bytes --]

On Thu, Sep 06, 2001 at 10:03:03AM -0300, Rik van Riel wrote:
> On Thu, 6 Sep 2001, Daniel Phillips wrote:
> > On September 6, 2001 02:32 pm, Rik van Riel wrote:
> > > Two words:  "IO clustering".
> >
> > Yes, *after* the IO queue is fully loaded that makes sense.  Leaving it
> > partly or fully idle while waiting for it to load up makes no sense at all.
> >
> > IO clustering will happen naturally after the queue loads up.
> 
> Exactly, so we need to give the queue some time to load
> up, right ?

Just use two limits:
* Time: After some time (say two seconds), we can always afford to write it
  out 
* Queue filling: After it has a certain size, it's worth doing a writing.

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE GmbH, Nuernberg, DE                                SCSI, Security

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:18                       ` Kurt Garloff
@ 2001-09-06 13:23                         ` Rik van Riel
  2001-09-06 13:28                         ` Alan Cox
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 79+ messages in thread
From: Rik van Riel @ 2001-09-06 13:23 UTC (permalink / raw)
  To: Kurt Garloff; +Cc: Daniel Phillips, Jan Harkes, Marcelo Tosatti, linux-kernel

On Thu, 6 Sep 2001, Kurt Garloff wrote:

> > Exactly, so we need to give the queue some time to load
> > up, right ?
>
> Just use two limits:
> * Time: After some time (say two seconds), we can always afford to write it
>   out
> * Queue filling: After it has a certain size, it's worth doing a writing.

Sounds good to me.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:10               ` Stephan von Krawczynski
@ 2001-09-06 13:23                 ` Alex Bligh - linux-kernel
  2001-09-06 13:54                   ` M. Edward Borasky
  2001-09-06 13:42                 ` Stephan von Krawczynski
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 79+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-06 13:23 UTC (permalink / raw)
  To: Stephan von Krawczynski, Daniel Phillips
  Cc: riel, jaharkes, marcelo, linux-kernel, Alex Bligh - linux-kernel



--On Thursday, September 06, 2001 3:10 PM +0200 Stephan von Krawczynski 
<skraw@ithnet.com> wrote:

> Obviously aging did not work at all,
> there was not a single hit on these (CD image) pages during 24 hours,
> compared to lots on the nfs-data.

If there's no memory pressure, data stays in InactiveDirty, caches,
etc., forever. What makes you think more memory would have helped
the NFS performance? It's possible these all were served out of caches
too.

--
Alex Bligh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:18                       ` Kurt Garloff
  2001-09-06 13:23                         ` Rik van Riel
@ 2001-09-06 13:28                         ` Alan Cox
  2001-09-06 13:29                           ` Rik van Riel
  2001-09-06 16:45                         ` Daniel Phillips
  2001-09-06 17:35                         ` Mike Fedyk
  3 siblings, 1 reply; 79+ messages in thread
From: Alan Cox @ 2001-09-06 13:28 UTC (permalink / raw)
  To: Kurt Garloff
  Cc: Rik van Riel, Daniel Phillips, Jan Harkes, Marcelo Tosatti, linux-kernel

> Just use two limits:
> * Time: After some time (say two seconds), we can always afford to write it
>   out=20
> * Queue filling: After it has a certain size, it's worth doing a writing.

Both debatable and both I can find counter cases for - think about a shared
memory database with multiple game clients using it (eg the older AberMUD
codebase). Writing that to disk is counterproductive

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:28                         ` Alan Cox
@ 2001-09-06 13:29                           ` Rik van Riel
  0 siblings, 0 replies; 79+ messages in thread
From: Rik van Riel @ 2001-09-06 13:29 UTC (permalink / raw)
  To: Alan Cox
  Cc: Kurt Garloff, Daniel Phillips, Jan Harkes, Marcelo Tosatti, linux-kernel

On Thu, 6 Sep 2001, Alan Cox wrote:

> > Just use two limits:
> > * Time: After some time (say two seconds), we can always afford to write it
> >   out=20
> > * Queue filling: After it has a certain size, it's worth doing a writing.
>
> Both debatable and both I can find counter cases for - think about a
> shared memory database with multiple game clients using it (eg the
> older AberMUD codebase). Writing that to disk is counterproductive

This is only for pages on the inactive_dirty list, though;
ie pages we want to evict from memory with the minimal amount
of work possible ;)

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:10               ` Stephan von Krawczynski
  2001-09-06 13:23                 ` Alex Bligh - linux-kernel
@ 2001-09-06 13:42                 ` Stephan von Krawczynski
  2001-09-06 14:01                   ` Alex Bligh - linux-kernel
  2001-09-06 14:39                   ` Stephan von Krawczynski
  2001-09-06 17:51                 ` Daniel Phillips
  2001-09-07 12:30                 ` page_launder() on 2.4.9/10 issue Stephan von Krawczynski
  3 siblings, 2 replies; 79+ messages in thread
From: Stephan von Krawczynski @ 2001-09-06 13:42 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel; +Cc: phillips, riel, jaharkes, marcelo, linux-kernel

On Thu, 06 Sep 2001 14:23:58 +0100 Alex Bligh - linux-kernel
<linux-kernel@alex.org.uk> wrote:

> 
> 
> --On Thursday, September 06, 2001 3:10 PM +0200 Stephan von Krawczynski 
> <skraw@ithnet.com> wrote:
> 
> > Obviously aging did not work at all,
> > there was not a single hit on these (CD image) pages during 24 hours,
> > compared to lots on the nfs-data.
> 
> If there's no memory pressure, data stays in InactiveDirty, caches,
> etc., forever. What makes you think more memory would have helped
> the NFS performance? It's possible these all were served out of caches
> too.

Negative. Switching off export-option "no_subtree_check" (which basically leads
to more small allocs during nfs action) shows immediately mem failures and
truncated files on the server and stale nfs handles on the client. So the
system _is_ under pressure. This exactly made me start (my branch of) the
discussion.
Besides I would really like to know what useable _data_ is in these pages, as I
cannot see which application should hold it (the CD stuff was quit "long ago").
FS should have sync'ed several times, too. 

Stephan


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: page_launder() on 2.4.9/10 issue
  2001-09-06 13:23                 ` Alex Bligh - linux-kernel
@ 2001-09-06 13:54                   ` M. Edward Borasky
  2001-09-06 14:39                     ` Alan Cox
  2001-09-06 17:33                     ` Daniel Phillips
  0 siblings, 2 replies; 79+ messages in thread
From: M. Edward Borasky @ 2001-09-06 13:54 UTC (permalink / raw)
  To: linux-kernel

I'm relatively new to the Linux kernel world and even newer to the list, so
forgive me if I'm asking a silly question or making a silly comment. It
seems to me, from what I've seen of this discussion so far, that the only
way one "tunes" Linux kernels at the moment is by changing code and
rebuilding the kernel. That is, there are few "tunables" that one can set,
based on one's circumstances, to optimize kernel performance for a specific
application or environment.

Every other operating system that I've done performance tuning on, starting
with Xerox CP-V in 1974, had such tunables and tools to set them. And quite
often, some of the tuning parameters can be set "on the fly", simply by
knowing the correct memory location to set and poking a new value into it.
No one "memory management scheme", for example, can be all things to all
tasks, and it seems to me that giving users tools to measure and control the
behavior of memory management, *preferably without having to recompile and
reboot*, should be a major priority if Linux is to succeed in a wide variety
of applications.

OK, I'll get off my soapbox now, and ask a related question. Is there a
mathematical model of the Linux kernel somewhere that I could get my hands
on?
--
M. Edward (Ed) Borasky, Chief Scientist, Borasky Research
http://www.borasky-research.net  http://www.aracnet.com/~znmeb
mailto:znmeb@borasky-research.com  mailto:znmeb@aracnet.com

Stand-Up Comedy: Because Man Does Not Live By Dread Alone


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:42                 ` Stephan von Krawczynski
@ 2001-09-06 14:01                   ` Alex Bligh - linux-kernel
  2001-09-06 14:39                   ` Stephan von Krawczynski
  1 sibling, 0 replies; 79+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-06 14:01 UTC (permalink / raw)
  To: Stephan von Krawczynski, Alex Bligh - linux-kernel
  Cc: phillips, riel, jaharkes, marcelo, linux-kernel,
	Alex Bligh - linux-kernel

>> If there's no memory pressure, data stays in InactiveDirty, caches,
>> etc., forever. What makes you think more memory would have helped
>> the NFS performance? It's possible these all were served out of caches
>> too.
>
> Negative. Switching off export-option "no_subtree_check" (which basically
> leads to more small allocs during nfs action) shows immediately mem
> failures and truncated files on the server and stale nfs handles on the
> client. So the system _is_ under pressure. This exactly made me start (my
> branch of) the discussion.
> Besides I would really like to know what useable _data_ is in these
> pages, as I cannot see which application should hold it (the CD stuff was
> quit "long ago"). FS should have sync'ed several times, too.

Yes, but this is because VM system's targets & pressure calcs do not
take into account fragmentation of the underlying physical memory.
IE, in theory you could have half your memory free, but
not be able to allocate a single 8k block. Nothing would cause
cache, or InactiveDirty stuff to be written.

You yourself proved this, by switching rsize,wsize to 1k and said
it all worked fine! (unless I misread your email).

The other potential problem is that if the memory requirement
is all extremely bursty and without __GFP_WAIT (i.e. allocated
GFP_ATOMIC) then it is conceivable you need a whole pile of
memory allocated before the system has time to retrieve it
from things which require locks, I/O, etc. However, I suspect
this isn't the problem.

Put my instrumentation patch on, and if I'm right you'll see something
like the following, but worse. Look at 32kB allocations (order 3,
which is what I think you said was failing), and look at the %fragmentation.
This is the % of free memory which cannot be allocated as (in this
case) contiguous 32kB chunks (as it's all in smaller blocks). As
this approaches 100, the VM system is going to think 'no memory
pressure' and not free up pages, but you are going to be unable
to allocate.

The second of these examples was after a single bonnie run,
a sync, and 5 minutes of idle activity. Note that in this
example, and few order 4 allocations which required DMA would
fail, though the VM system would see plenty of memory. And they
will continue failing.

I think what you want isn't more memory, its less
fragmented memory. Or an underlying system which can
cope with fragmentation.

--
Alex Bligh


Before
$ cat /proc/memareas
   Zone     4kB     8kB    16kB    32kB    64kB   128kB   256kB   512kB  1024kB  2048kB Tot Pages/kb
    DMA       2       2       4       3       3       3       1       1       0       6 =     3454
  @frag      0%      0%      0%      1%      1%      3%      6%      7%     11%     11%      13816kB
 Normal       0       0       6      29      18       8       4       0       1      23 =    13088
  @frag      0%      0%      0%      0%      2%      4%      6%      8%      8%     10%      52352kB
HighMem = 0kB - zero size zone

After
$ cat /proc/memareas
   Zone     4kB     8kB    16kB    32kB    64kB   128kB   256kB   512kB 1024kB  2048kB Tot Pages/kb
    DMA     522     382     210      53       8       2       1       0       0       0 =     2806
  @frag      0%     19%     46%     76%     91%     95%     98%    100%    100%    100%      11224kB
 Normal       0    1155    1656     756     163      29       0       1       0       0 =    18646
  @frag      0%      0%     12%     48%     80%     94%     99%     99%    100%    100%      74584kB
                                    ^^^
					Order 3
HighMem = 0kB - zero size zone



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:42                 ` Stephan von Krawczynski
  2001-09-06 14:01                   ` Alex Bligh - linux-kernel
@ 2001-09-06 14:39                   ` Stephan von Krawczynski
  2001-09-06 15:02                     ` Alex Bligh - linux-kernel
  2001-09-06 15:10                     ` Stephan von Krawczynski
  1 sibling, 2 replies; 79+ messages in thread
From: Stephan von Krawczynski @ 2001-09-06 14:39 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel; +Cc: phillips, riel, jaharkes, marcelo, linux-kernel

On Thu, 06 Sep 2001 15:01:49 +0100 Alex Bligh - linux-kernel
<linux-kernel@alex.org.uk> wrote:

> Yes, but this is because VM system's targets & pressure calcs do not
> take into account fragmentation of the underlying physical memory.
> IE, in theory you could have half your memory free, but
> not be able to allocate a single 8k block. Nothing would cause
> cache, or InactiveDirty stuff to be written.

Which is obviously not the right way to go. I guess we agree in that.

> You yourself proved this, by switching rsize,wsize to 1k and said
> it all worked fine! (unless I misread your email).

Sorry, misunderstanding: I did not touch rsize/wsize. What I do is to lower fs
action by not letting knfsd walk through the subtrees of a mounted fs. This
leads to less allocs/frees by the fs layer which tend to fail and let knfs fail
afterwards.

> [...]
> I think what you want isn't more memory, its less
> fragmented memory.

This is one important part for sure.

> Or an underlying system which can
> cope with fragmentation.

Well, I'd rather prefer the cure than the dope :-)

Regards, Stephan


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:54                   ` M. Edward Borasky
@ 2001-09-06 14:39                     ` Alan Cox
  2001-09-06 16:20                       ` Victor Yodaiken
  2001-09-06 17:33                     ` Daniel Phillips
  1 sibling, 1 reply; 79+ messages in thread
From: Alan Cox @ 2001-09-06 14:39 UTC (permalink / raw)
  To: M. Edward Borasky; +Cc: linux-kernel

> forgive me if I'm asking a silly question or making a silly comment. It
> seems to me, from what I've seen of this discussion so far, that the only
> way one "tunes" Linux kernels at the moment is by changing code and
> rebuilding the kernel. That is, there are few "tunables" that one can set,
> based on one's circumstances, to optimize kernel performance for a specific
> application or environment.

There are a lot of tunables in /proc/sys. An excellent tool for playing with
them is "powertweak". 

> No one "memory management scheme", for example, can be all things to all
> tasks, and it seems to me that giving users tools to measure and control the
> behavior of memory management, *preferably without having to recompile and
> reboot*, should be a major priority if Linux is to succeed in a wide variety
> of applications.

The VM is tunable in the -ac tree. I still believe the VM can and should be
self tuning but we are not there yet.

> OK, I'll get off my soapbox now, and ask a related question. Is there a
> mathematical model of the Linux kernel somewhere that I could get my hands
> on?

Not that I am aware of. 

Alan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 14:39                   ` Stephan von Krawczynski
@ 2001-09-06 15:02                     ` Alex Bligh - linux-kernel
  2001-09-06 15:07                       ` Rik van Riel
  2001-09-06 15:10                     ` Stephan von Krawczynski
  1 sibling, 1 reply; 79+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-06 15:02 UTC (permalink / raw)
  To: Stephan von Krawczynski, Alex Bligh - linux-kernel
  Cc: phillips, riel, jaharkes, marcelo, linux-kernel,
	Alex Bligh - linux-kernel

Stephan,

>> Yes, but this is because VM system's targets & pressure calcs do not
>> take into account fragmentation of the underlying physical memory.
>> IE, in theory you could have half your memory free, but
>> not be able to allocate a single 8k block. Nothing would cause
>> cache, or InactiveDirty stuff to be written.
>
> Which is obviously not the right way to go. I guess we agree in that.

Well, I agree that this is not desirable. I am not sure whether
the right course is
 (a) to avoid getting here,
 (b) to do traditional page_launder() stuff, i.e. write stuff out,
     and hope that fixes it
 (c) to actively go defragment (Daniel P's prefered approach)
 (d) some combination of the above.

>> You yourself proved this, by switching rsize,wsize to 1k and said
>> it all worked fine! (unless I misread your email).
>
> Sorry, misunderstanding: I did not touch rsize/wsize. What I do is to lower fs
> action by not letting knfsd walk through the subtrees of a mounted fs. This
> leads to less allocs/frees by the fs layer which tend to fail and let knfs fail
> afterwards.

OK, I'm getting confused.

I'm looking at stuff you sent like:
Aug 29 13:43:34 admin kernel: pid=1207; __alloc_pages(gfp=0x20, order=3, ...)
Aug 29 13:43:34 admin kernel: Call Trace: [_alloc_pages+22/24] [__get_free_pages+10/24] [<fdcec826>] [<fdcec8f5>] [<fdceb7d7>]
Aug 29 13:43:34 admin kernel:    [<fdcec0f5>] [<fdcea589>] [ip_local_deliver_finish+0/368] [nf_hook_slow+272/404] [ip_rcv_finish+0/480] [ip_local_deliver+436/444]
Aug 29 13:43:34 admin kernel:    [ip_local_deliver_finish+0/368] [ip_rcv_finish+0/480] [ip_rcv_finish+413/480] [ip_rcv_finish+0/480] [nf_hook_slow+272/404] [ip_rcv+870/944]
Aug 29 13:43:34 admin kernel:    [ip_rcv_finish+0/480] [net_rx_action+362/628] [do_softirq+111/204] [do_IRQ+219/236] [ret_from_intr+0/7] [sys_ioctl+443/532]
Aug 29 13:43:34 admin kernel:    [system_call+51/56]
Aug 29 13:43:34 admin kernel: __alloc_pages: 3-order allocation failed (gfp=0x20/0).

If you use rsize=1024,wsize=1024, (note you may have to force
this at the client end), you should not see, at least from NFS,
allocations at greater than order 0. So if the problem is /just/
fragmentation (rather than too little memory), it will magically
go away (i.e. be hidden). If it's not just fragmentation, you
will still see errors. This is not intended as a solution, but
as a diagnostic tool. [I mistakenly thought/dreamed you had
already done this].

Note there may still be other things trying to do >0 order
allocs, for instance bounce buffers, but I believe you have
applied useful patches for them already.

--
Alex Bligh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 15:02                     ` Alex Bligh - linux-kernel
@ 2001-09-06 15:07                       ` Rik van Riel
       [not found]                         ` <Pine.LNX.4.33L.0109061206020.31200-100000@imladris.rielhome.con ectiva>
  0 siblings, 1 reply; 79+ messages in thread
From: Rik van Riel @ 2001-09-06 15:07 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel
  Cc: Stephan von Krawczynski, phillips, jaharkes, marcelo, linux-kernel

On Thu, 6 Sep 2001, Alex Bligh - linux-kernel wrote:

> >> IE, in theory you could have half your memory free, but
> >> not be able to allocate a single 8k block. Nothing would cause
> >> cache, or InactiveDirty stuff to be written.
> >
> > Which is obviously not the right way to go. I guess we agree in that.
>
> Well, I agree that this is not desirable. I am not sure whether
> the right course is
>  (a) to avoid getting here,
>  (b) to do traditional page_launder() stuff, i.e. write stuff out,
>      and hope that fixes it
>  (c) to actively go defragment (Daniel P's prefered approach)
>  (d) some combination of the above.

On many systems, higher-order allocations are a really really
small fraction of the allocations, so ideally we'd have them
take the burden of memory fragmentation and won't punish the
normal allocations.

That pretty much rules out very strong forms of (a), things
like (b) and (c) are very possible to do and maybe even easy.

They also won't cause any overhead for normal allocations
since we'd only call them when needed.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 14:39                   ` Stephan von Krawczynski
  2001-09-06 15:02                     ` Alex Bligh - linux-kernel
@ 2001-09-06 15:10                     ` Stephan von Krawczynski
  2001-09-06 15:18                       ` Alex Bligh - linux-kernel
  1 sibling, 1 reply; 79+ messages in thread
From: Stephan von Krawczynski @ 2001-09-06 15:10 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel; +Cc: phillips, riel, jaharkes, marcelo, linux-kernel

On Thu, 06 Sep 2001 16:02:04 +0100 Alex Bligh - linux-kernel
<linux-kernel@alex.org.uk> wrote:

> Stephan,
> >> You yourself proved this, by switching rsize,wsize to 1k and said
> >> it all worked fine! (unless I misread your email).
> >
> > Sorry, misunderstanding: I did not touch rsize/wsize. What I do is to lower
fs
> > action by not letting knfsd walk through the subtrees of a mounted fs. This
> > leads to less allocs/frees by the fs layer which tend to fail and let knfs
fail
> > afterwards.
> 
> OK, I'm getting confused.

To end that:

What I meant was, I did not touch the values most everybody uses on NFS, which
is:
rsize=8192,wsize=8192
Using smaller values (or default = 1024) gives such a ridicolously bad
performance that I would even prefer samba.

Regards,
Stephan



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
       [not found]                         ` <Pine.LNX.4.33L.0109061206020.31200-100000@imladris.rielhome.con ectiva>
@ 2001-09-06 15:16                           ` Alex Bligh - linux-kernel
  0 siblings, 0 replies; 79+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-06 15:16 UTC (permalink / raw)
  To: Rik van Riel, Alex Bligh - linux-kernel
  Cc: Stephan von Krawczynski, phillips, jaharkes, marcelo,
	linux-kernel, Alex Bligh - linux-kernel



--On Thursday, September 06, 2001 12:07 PM -0300 Rik van Riel 
<riel@conectiva.com.br> wrote:

> On many systems, higher-order allocations are a really really
> small fraction of the allocations, so ideally we'd have them
> take the burden of memory fragmentation and won't punish the
> normal allocations.

The only nit being, every instance Stephan's reported so far,
and in most other reports I've seen, the allocation
has been GFP_ATOMIC (i.e. with mask without __GFP_WAIT).
For non-atomic >0 order allocations we already have some
good logic that does (b) via page_launder(), and
where necessary reclaim_page(),__free_page().

So waiting until we are in the high order allocation
allocation is too late, as we don't have room to move.

I think we need to defragment / avoid fragmentation
BEFORE the GFP_ATOMIC high order allocation comes along.
I have some ideas I'd like to test tonight.

--
Alex Bligh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 15:10                     ` Stephan von Krawczynski
@ 2001-09-06 15:18                       ` Alex Bligh - linux-kernel
  2001-09-06 17:34                         ` Daniel Phillips
  0 siblings, 1 reply; 79+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-06 15:18 UTC (permalink / raw)
  To: Stephan von Krawczynski, Alex Bligh - linux-kernel
  Cc: phillips, riel, jaharkes, marcelo, linux-kernel,
	Alex Bligh - linux-kernel



--On Thursday, September 06, 2001 5:10 PM +0200 Stephan von Krawczynski 
<skraw@ithnet.com> wrote:

> (or default = 1024) gives such a ridicolously bad
> performance

I know. I am trying to ensure we have the problem definitively
identified, either from /proc/memareas, or by showing it
goes away if you change rsize/wsize. I am NOT proposing
it as a fix.

--
Alex Bligh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 14:39                     ` Alan Cox
@ 2001-09-06 16:20                       ` Victor Yodaiken
  0 siblings, 0 replies; 79+ messages in thread
From: Victor Yodaiken @ 2001-09-06 16:20 UTC (permalink / raw)
  To: Alan Cox; +Cc: M. Edward Borasky, linux-kernel

On Thu, Sep 06, 2001 at 03:39:17PM +0100, Alan Cox wrote:
> > OK, I'll get off my soapbox now, and ask a related question. Is there a
> > mathematical model of the Linux kernel somewhere that I could get my hands
> > on?
> 
> Not that I am aware of. 

A mathematical model of the Linux kernel would be a major scientific advance.

> 
> Alan
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:18                       ` Kurt Garloff
  2001-09-06 13:23                         ` Rik van Riel
  2001-09-06 13:28                         ` Alan Cox
@ 2001-09-06 16:45                         ` Daniel Phillips
  2001-09-06 16:57                           ` Rik van Riel
  2001-09-06 17:35                         ` Mike Fedyk
  3 siblings, 1 reply; 79+ messages in thread
From: Daniel Phillips @ 2001-09-06 16:45 UTC (permalink / raw)
  To: Kurt Garloff, Rik van Riel; +Cc: Jan Harkes, Marcelo Tosatti, linux-kernel

On September 6, 2001 03:18 pm, Kurt Garloff wrote:
> On Thu, Sep 06, 2001 at 10:03:03AM -0300, Rik van Riel wrote:
> > On Thu, 6 Sep 2001, Daniel Phillips wrote:
> > > On September 6, 2001 02:32 pm, Rik van Riel wrote:
> > > > Two words:  "IO clustering".
> > >
> > > Yes, *after* the IO queue is fully loaded that makes sense.  Leaving it
> > > partly or fully idle while waiting for it to load up makes no sense at 
all.
> > >
> > > IO clustering will happen naturally after the queue loads up.
> > 
> > Exactly, so we need to give the queue some time to load
> > up, right ?
> 
> Just use two limits:
> * Time: After some time (say two seconds), we can always afford to write it
>   out 
> * Queue filling: After it has a certain size, it's worth doing a writing.

Err, not quite the whole story.  It is *never* right to leave the disk 
sitting idle while there are dirty, writable IO buffers.

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 16:45                         ` Daniel Phillips
@ 2001-09-06 16:57                           ` Rik van Riel
  2001-09-06 17:22                             ` Daniel Phillips
  0 siblings, 1 reply; 79+ messages in thread
From: Rik van Riel @ 2001-09-06 16:57 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Kurt Garloff, Jan Harkes, Marcelo Tosatti, linux-kernel

On Thu, 6 Sep 2001, Daniel Phillips wrote:

> Err, not quite the whole story.  It is *never* right to leave the disk
> sitting idle while there are dirty, writable IO buffers.

Define "idle" ?

Is idle the time it takes between two readahead requests
to be issued, delaying the second request because you
just moved the disk arm away ?

Is idle when we haven't had a request for, say, 3 disk
seek time periods ?

Is idle when we won't be getting any request soon for the
area where the disk arm is hanging out ?  (and how do we
know the future?)

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 16:57                           ` Rik van Riel
@ 2001-09-06 17:22                             ` Daniel Phillips
  2001-09-06 19:25                               ` Rik van Riel
  0 siblings, 1 reply; 79+ messages in thread
From: Daniel Phillips @ 2001-09-06 17:22 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Kurt Garloff, Jan Harkes, Marcelo Tosatti, linux-kernel

On September 6, 2001 06:57 pm, Rik van Riel wrote:
> On Thu, 6 Sep 2001, Daniel Phillips wrote:
> 
> > Err, not quite the whole story.  It is *never* right to leave the disk
> > sitting idle while there are dirty, writable IO buffers.
> 
> Define "idle" ?

Idle = not doing anything.  IO queue is empty.

> Is idle the time it takes between two readahead requests
> to be issued, delaying the second request because you
> just moved the disk arm away ?

Which two readahead requests?  It's idle.

> Is idle when we haven't had a request for, say, 3 disk
> seek time periods ?

See above definition of idle.

> Is idle when we won't be getting any request soon for the
> area where the disk arm is hanging out ?  (and how do we
> know the future?)

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 17:34                         ` Daniel Phillips
@ 2001-09-06 17:32                           ` Alex Bligh - linux-kernel
  0 siblings, 0 replies; 79+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-06 17:32 UTC (permalink / raw)
  To: Daniel Phillips, Alex Bligh - linux-kernel, Stephan von Krawczynski
  Cc: riel, jaharkes, marcelo, linux-kernel, Alex Bligh - linux-kernel



--On Thursday, September 06, 2001 7:34 PM +0200 Daniel Phillips 
<phillips@bonn-fries.net> wrote:

> On September 6, 2001 05:18 pm, Alex Bligh - linux-kernel wrote:
>> --On Thursday, September 06, 2001 5:10 PM +0200 Stephan von Krawczynski
>> <skraw@ithnet.com> wrote:
>>
>> > (or default = 1024) gives such a ridicolously bad
>> > performance
>>
>> I know. I am trying to ensure we have the problem definitively
>> identified, either from /proc/memareas, or by showing it
>> goes away if you change rsize/wsize. I am NOT proposing
>> it as a fix.
>
> Are rsize/wsize expressed in bytes?  In which case you'd want them to be
> 4096  for this test.

Bytes per request. There is some header wastage, so 4096 is too high
as the packets will be slightly larger than a page. I suggested 1024
rather than 2048 as 1024 is the original standard & thus everything
supports it.


--
Alex Bligh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:54                   ` M. Edward Borasky
  2001-09-06 14:39                     ` Alan Cox
@ 2001-09-06 17:33                     ` Daniel Phillips
  1 sibling, 0 replies; 79+ messages in thread
From: Daniel Phillips @ 2001-09-06 17:33 UTC (permalink / raw)
  To: M. Edward Borasky, linux-kernel

On September 6, 2001 03:54 pm, M. Edward Borasky wrote:
> I'm relatively new to the Linux kernel world and even newer to the list, so
> forgive me if I'm asking a silly question or making a silly comment. It
> seems to me, from what I've seen of this discussion so far, that the only
> way one "tunes" Linux kernels at the moment is by changing code and
> rebuilding the kernel. That is, there are few "tunables" that one can set,
> based on one's circumstances, to optimize kernel performance for a specific
> application or environment.
> 
> Every other operating system that I've done performance tuning on, starting
> with Xerox CP-V in 1974, had such tunables and tools to set them. And quite
> often, some of the tuning parameters can be set "on the fly", simply by
> knowing the correct memory location to set and poking a new value into it.

We typically use proc for this, sometimes combined with an ioctl.  Some of 
these settings are standard in the kernel (bdflush, others) but more often 
you will have to apply a patch.

> No one "memory management scheme", for example, can be all things to all
> tasks, and it seems to me that giving users tools to measure and control the
> behavior of memory management, *preferably without having to recompile and
> reboot*, should be a major priority if Linux is to succeed in a wide variety
> of applications.

Linus doesn't seem to like like having tuning knobs appear where a better 
algorithm should be used instead.  Leaving the knobs out makes people work 
harder to come up with solutions that don't need them.

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 15:18                       ` Alex Bligh - linux-kernel
@ 2001-09-06 17:34                         ` Daniel Phillips
  2001-09-06 17:32                           ` Alex Bligh - linux-kernel
  0 siblings, 1 reply; 79+ messages in thread
From: Daniel Phillips @ 2001-09-06 17:34 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel, Stephan von Krawczynski
  Cc: riel, jaharkes, marcelo, linux-kernel, Alex Bligh - linux-kernel

On September 6, 2001 05:18 pm, Alex Bligh - linux-kernel wrote:
> --On Thursday, September 06, 2001 5:10 PM +0200 Stephan von Krawczynski 
> <skraw@ithnet.com> wrote:
> 
> > (or default = 1024) gives such a ridicolously bad
> > performance
> 
> I know. I am trying to ensure we have the problem definitively
> identified, either from /proc/memareas, or by showing it
> goes away if you change rsize/wsize. I am NOT proposing
> it as a fix.

Are rsize/wsize expressed in bytes?  In which case you'd want them to be 4096 
for this test.

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:18                       ` Kurt Garloff
                                           ` (2 preceding siblings ...)
  2001-09-06 16:45                         ` Daniel Phillips
@ 2001-09-06 17:35                         ` Mike Fedyk
  3 siblings, 0 replies; 79+ messages in thread
From: Mike Fedyk @ 2001-09-06 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kurt Garloff, Rik van Riel, Daniel Phillips, Jan Harkes, Marcelo Tosatti

On Thu, Sep 06, 2001 at 03:18:27PM +0200, Kurt Garloff wrote:
> On Thu, Sep 06, 2001 at 10:03:03AM -0300, Rik van Riel wrote:
> > On Thu, 6 Sep 2001, Daniel Phillips wrote:
> > > On September 6, 2001 02:32 pm, Rik van Riel wrote:
> > > > Two words:  "IO clustering".
> > >
> > > Yes, *after* the IO queue is fully loaded that makes sense.  Leaving it
> > > partly or fully idle while waiting for it to load up makes no sense at all.
> > >
> > > IO clustering will happen naturally after the queue loads up.
> > 
> > Exactly, so we need to give the queue some time to load
> > up, right ?
> 
> Just use two limits:
> * Time: After some time (say two seconds), we can always afford to write it
>   out 
> * Queue filling: After it has a certain size, it's worth doing a writing.
> 

Correct me if I'm wrong, but aren't these two settings tunable in bdflush?
If not, then how exactly does bdflush interact with this?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:10               ` Stephan von Krawczynski
  2001-09-06 13:23                 ` Alex Bligh - linux-kernel
  2001-09-06 13:42                 ` Stephan von Krawczynski
@ 2001-09-06 17:51                 ` Daniel Phillips
  2001-09-06 21:01                   ` [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG] Alex Bligh - linux-kernel
  2001-09-07 12:30                 ` page_launder() on 2.4.9/10 issue Stephan von Krawczynski
  3 siblings, 1 reply; 79+ messages in thread
From: Daniel Phillips @ 2001-09-06 17:51 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: riel, jaharkes, marcelo, linux-kernel

On September 6, 2001 03:10 pm, Stephan von Krawczynski wrote:
> > Blindly delaying all the writes in the name of better read performance isn't 
> > the right idea either.  Perhaps we should have a good think about some 
> > sensible mechanism for balancing reads against writes.
> 
> I guess I have the real-world proof for that:
> Yesterday I mastered a CD (around 700 MB) and burned it, I left the equipment
> to get some food and sleep (sometimes needed :-). During this time the machine
> acts as nfs-server and gets about 3 GB of data written to it. Coming back today
> I recognise that deleting the CD image made yesterday frees up about 500 MB of
> physical mem (free mem was very low before). It was obviously held 24 hours for
> no reason, and _not_ (as one would expect) exchanged against the nfs-data. This
> means the caches were full with _old_ data and explains why nfs performance has
> remarkably dropped since 2.2. There is too few mem around to get good
> performance (no matter if read or write). Obviously aging did not work at all,
> there was not a single hit on these (CD image) pages during 24 hours, compared
> to lots on the nfs-data. Even if the nfs-data would only have one single hit,
> the old CD image should have been removed, because it is inactive and _older_.

OK, this is not related to what we were discussing (IO latency).  It's not too
hard to fix, we just need to do a little aging whenever there are allocations,
whether or not there is memory_pressure.  I don't think it's a real problem
though, we have at least two problems we really do need to fix (oom and
high order failures).

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 17:22                             ` Daniel Phillips
@ 2001-09-06 19:25                               ` Rik van Riel
  2001-09-06 19:45                                 ` Daniel Phillips
  0 siblings, 1 reply; 79+ messages in thread
From: Rik van Riel @ 2001-09-06 19:25 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Kurt Garloff, Jan Harkes, Marcelo Tosatti, linux-kernel

On Thu, 6 Sep 2001, Daniel Phillips wrote:
> On September 6, 2001 06:57 pm, Rik van Riel wrote:
> > On Thu, 6 Sep 2001, Daniel Phillips wrote:
> >
> > > Err, not quite the whole story.  It is *never* right to leave the disk
> > > sitting idle while there are dirty, writable IO buffers.
> >
> > Define "idle" ?
>
> Idle = not doing anything.  IO queue is empty.
>
> > Is idle the time it takes between two readahead requests
> > to be issued, delaying the second request because you
> > just moved the disk arm away ?
>
> Which two readahead requests?  It's idle.

OK, in this case I disagree with you ;)

Disk seek time takes ages, as much as 10 milliseconds.

I really don't think it's good to move the disk arm away
from the data we are reading just to write out this one
disk block.

Going 20 milliseconds out of our way to write out a single
block really can't be worth it in any scenario I can imagine.

OTOH, flushing out 64 or 128 kB at once (or some fraction of
the inactive list, say 5%?) almost certainly is worth it in
many cases.

regards,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 19:25                               ` Rik van Riel
@ 2001-09-06 19:45                                 ` Daniel Phillips
  2001-09-06 19:52                                   ` Rik van Riel
  2001-09-06 19:53                                   ` Mike Fedyk
  0 siblings, 2 replies; 79+ messages in thread
From: Daniel Phillips @ 2001-09-06 19:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Kurt Garloff, Jan Harkes, Marcelo Tosatti, linux-kernel

On September 6, 2001 09:25 pm, Rik van Riel wrote:
> On Thu, 6 Sep 2001, Daniel Phillips wrote:
> > On September 6, 2001 06:57 pm, Rik van Riel wrote:
> > > On Thu, 6 Sep 2001, Daniel Phillips wrote:
> > >
> > > > Err, not quite the whole story.  It is *never* right to leave the disk
> > > > sitting idle while there are dirty, writable IO buffers.
> > >
> > > Define "idle" ?
> >
> > Idle = not doing anything.  IO queue is empty.
> >
> > > Is idle the time it takes between two readahead requests
> > > to be issued, delaying the second request because you
> > > just moved the disk arm away ?
> >
> > Which two readahead requests?  It's idle.
> 
> OK, in this case I disagree with you ;)
> 
> Disk seek time takes ages, as much as 10 milliseconds.
> 
> I really don't think it's good to move the disk arm away
> from the data we are reading just to write out this one
> disk block.
> 
> Going 20 milliseconds out of our way to write out a single
> block really can't be worth it in any scenario I can imagine.
> 
> OTOH, flushing out 64 or 128 kB at once (or some fraction of
> the inactive list, say 5%?) almost certainly is worth it in
> many cases.

Again, I have to ask, which reads are you interfering with?  Ones that 
haven't happened yet?  Remember, the disk is idle.  So *at worst* you are 
going to get one extra seek before getting hit with the tidal wave of reads 
you seem to be worried about.  This simply isn't significant.

I've tested this, I know early writeout under light load is a win.

What we should be worrying about is how to balance reads against writes under 
heavy load.

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 19:45                                 ` Daniel Phillips
@ 2001-09-06 19:52                                   ` Rik van Riel
  2001-09-07  0:32                                     ` Kurt Garloff
  2001-09-06 19:53                                   ` Mike Fedyk
  1 sibling, 1 reply; 79+ messages in thread
From: Rik van Riel @ 2001-09-06 19:52 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Kurt Garloff, Jan Harkes, Marcelo Tosatti, linux-kernel

On Thu, 6 Sep 2001, Daniel Phillips wrote:

> Again, I have to ask, which reads are you interfering with?  Ones that
> haven't happened yet?  Remember, the disk is idle.  So *at worst* you are
> going to get one extra seek before getting hit with the tidal wave of reads
> you seem to be worried about.  This simply isn't significant.
>
> I've tested this, I know early writeout under light load is a win.

Other people have tested this too, and light writeout of
small blocks destroys the performance of a heavy read
load.

> What we should be worrying about is how to balance reads against
> writes under heavy load.

Exactly. We need to make sure we're efficient when the
system is under heavy read load and light write load.
This kind of load is very common in servers, especially
web, ftp or news servers.

regards,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 19:45                                 ` Daniel Phillips
  2001-09-06 19:52                                   ` Rik van Riel
@ 2001-09-06 19:53                                   ` Mike Fedyk
  1 sibling, 0 replies; 79+ messages in thread
From: Mike Fedyk @ 2001-09-06 19:53 UTC (permalink / raw)
  To: linux-kernel

On Thu, Sep 06, 2001 at 09:45:35PM +0200, Daniel Phillips wrote:
> What we should be worrying about is how to balance reads against writes under 
> heavy load.
> 

Yes, I agree.  You can have a process that is at a 19 niceness level that
doesn't do much processing, but a lot of disk access bring your system down
to a crawl.

Improvement in this area would be nice.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG]
  2001-09-06 17:51                 ` Daniel Phillips
@ 2001-09-06 21:01                   ` Alex Bligh - linux-kernel
  2001-09-07  6:35                     ` Daniel Phillips
  0 siblings, 1 reply; 79+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-06 21:01 UTC (permalink / raw)
  To: Daniel Phillips, riel, linux-kernel; +Cc: Alex Bligh - linux-kernel

I thought I'd try coding this, then I thought better of it and so am asking
people's opinions first. The following describes a mechanism to change the
zone/buddy allocation system to minimize fragmentation before it happens,
and then defragment post-facto.

Background, & Statement of problem
==================================

High order [1] memory allocations tend to fail when memory is fragmented.
Memory becomes fragmented through normal system usage, without memory
pressure. When memory is fragmented, it stays fragmented.

While non-atomic [2] high order can wait until progress is made freeing
pages, the algorithm 'free pages without reference to their location until
sufficient adjacent pages have by chance been freed for a coalescence' is
inefficient compared to a defragmentation routine, or an attempt to free
specific adjacent pages which may coalesce.

The problem is worse for atomic [2] request, which can neither defragment
memory (due to I/O and locking restrictions), nor can they make progress
via (for instance) page_launder().

Therefore, in a fragmented memory environment, it has been observed that
high order requests, particularly atomic ones [3], fail frequently.

Common sources of atomic high order requests include allocations from the
network layer where packets exceed 4k in size (for instance NFS packets
with rsize,wsize>2048, fragmentation and reassembly), and the SCSI layer.
Whilst it is undeniable that some drivers would benefit from using
technologies like scatter lists to avoid the necessity of contiguous
physical memory allocation, large swathes of current code assumes the
opposite, and some is hard to change. [4]

As many of these allocations occur in bottom half, or interrupt routines,
it is more difficult to handle a failure gracefully than in other code.
This tends to lead to performance problems [5], or worse (hard errors),
which should be minimized.


Causes of fragmentation
=======================

Linux adopts a largely requestor-anonymous form of page allocation. Memory
is divided into 3 zones, and page requesters can specify a list of suitable
zones from which pages may be allocated, but beyond that, pages are
allocated in a manner which does not distinguish between users of given
pages.

Thus pages allocated for packets in flight are likely to be intermingled
with buffer pages, cache pages, code pages and data pages. Each of these
different types of allocation has a different persistence over time. Some
(for instance pages on the InactiveDirty list in an idle system) will
persist indefinitely.

The buddy allocator will attempt (by looking at lowest order lists first)
to allocate pages from fragmented areas first. Assuming pages are freed at
random, this would act as a defragmentation process. However, if a system
is taken to high utilization and back again to idle, the dispersion of
persistent pages (for instance InactiveDirty pages) becomes great, and the
buddy allocator performs poorly at coalescing blocks.

The situation is worsened by the understandable desire for simplicity in
the VM system, which measures solely the number of pages free in different
zones, as opposed their respective locations. It is possible (and has been
observed) to have a system in a state with hardly any high order buddies on
free area lists (thus where it would be impossible to make many atomic high
order allocations), but copious easilly freeable RAM. This is in essence
because no attempt is made to balance for different order free-lists, and
shortage of entries on high-order free lists does not in itself cause
memory pressure.

It is probably undesirable for the normal VM system to react to
fragmentation in the same way it does to normal memory pressure. This would
result in an unselective paging out / discarding of data, whereas an
approach which selected pages to free which would be most likely to cause
coalescence would be more useful. Further, it would be possible, by moving
the data in physical pages, to move many types of page, without loss of
in-memory data at all.


Approaches to solution
======================

It has been suggested that post-facto defragementation is a useful
technique. This is undoubtedly true, but the defragmentation needs to run
before it is 'needed' - i.e. we need to ensure that memory is never
sufficiently fragmented that a reasonable size burst of high order atomic
allocations can fail. This could be achieved by running some background
defragmentation task against some measurable fragmentation target. Here
fragmentation pressure would be an orthogonal measure to memory pressure.
Non atomic high order allocations which are failing should allow the
defragmenter to run, rather than call pagelaunder().

Defragmentation routines appear to be simple at first. Simply run through
the free lists of particular zones, examining whether the constituent pages
of buddies of free areas can be freed or moved. However, using this
approach alone has some drawbacks. Firstly, it is not immediately obvious
that by moving pages you are making the situation any better, because it is
not clear that the (new) destination page will be allocated somewhere less
awkward. Secondly, whilst many types of page can be allocated and moved
with minimal effort (for instance pages on the Active or Inactive lists),
it is less obvious how to move buffer and cache pages transparently (given
only a pointer to the page struct to start with, it is hard to determine
where they are used and referred to, for a start) and it is far from
obvious how to move arbitrary pages allocated by the kernel for disparate
purposes (including pages allocated by the slab allocator).

However, this is not the only possibility to minimize fragmentation.

Part of the problem is the fact that pages are allocated by location
without reference to the caller. If (for instance) buffer pages tended to
be allocated next to eachother, cache pages tended to be allocated next to
eachother, pages allocated by the network stack tended to be allocated next
to eachother, then a number of benefits would accrue:

Firstly, defragmentation would be more successful. Defragmentation would
tend to focus on pages allocated away from their natural brethren, and
their newly allocated pages, into which their data would be moved, would
tend to be next to these. This would help ensure that the new page was
indeed a better location than the old page. Also, as pages of similar ease
or difficulty to move would be clumped, the effect of a large number of
difficult to move pages would be reduced by their mutual proximity.

Secondly, defragmentation would be less necessary. Pages allocated by
different functions have different natural persistence. For instance, pages
allocated within the networking stack typically have short persistence, due
to the transitory nature of the packets they represent. Therefore, in areas
of memory preferred by low persistence users, the natural defragmentation
effect of the buddy allocator would be greater.

Therefore it is suggested that different allocators have affinities for
different areas of memory. One mechanism of achieving this effect would be
an extension to the zone system.

Currently, there are three zones (DMA, Normal and High memory). Imagine
instead, there were many more zones, and the above three labels became
'zone types'. There would thus be many DMA zones, many normal zones, and
many high memory zones. These zones would be at least the highest order
allocation in size - currently 2Mb on i386, but this could be reduced
slightly with minimal disruption. In this manner, the efficiency of the
buddy allocator is not reduced, as the buddy allocator has no visibility of
coalescence etc. above this level anyway.

Balancing would occur accross the aggregate of zone types (i.e. across all
DMA zones in aggregate, accross all High memory zones in aggregate, etc.)
as opposed to by individual zones.

Each zone type would have an associated hash table, the entries being zones
of that type. A routine requesting an allocation would pass information to
__alloc_pages which identified it - it may well be that the GFP flags, the
order, and perhaps some ID for the subsystem is sufficient. This would act
as the key to the hash table concerned. When allocating a page, all zones
in the hash table with the appropriate key (i.e. a matching allocator) are
first tried, in order. If no page is found, then an empty zone (special
key) is found, which is then labelled, and used as, a zone of the type
required. If no empty zone is available of that zone type, then, other zone
types (using the list of appropriate zone types are tried). If no page is
found, then starting with the first zone type again, the first page in ANY
zone within that zone hash table is utilized, and so on through other
suitable zone types.

In this manner, pages are likely to be clustered in zones by allocator. The
role of the defragmenter becomes firstly to target pages which have an
inappropriate key for the zone concerned, and secondly to target pages in
sparsely allocated zones, so the zone becomes unkeyed, and free for
rekeying later. As statistics could easilly be kept per zone on the number
of appropriately and inappropriately keyed pages which had been allocated
within that zone, scanning (and hence finding suitable targets) would
become considerably easier. Equally, maintenance of these statistics can
determine when the defragmenter should be run as a background process.

Some further changes will be necessary; for instance direct_reclaim should
not occur when the page to be reclaimed would be inappropriately keyed for
the zone; in practice this means using direct reclaim only to reclaim pages
for purposes where the allocated page might itself reach the InactiveDirty
list AND where the page reclaimed is correctly keyed.

Furthermore, the number unkeyed (i.e. empty) zones will need to have a
particular low water market target, below which memory pressure must
somehow be caused, in order to force buffer flushing or paging.

This effectively relegates the buddy system to allocating pages for
particular purposes within small chunks of memory - there is a parallel
purpose here with a sort of extended slab system. The zone system would
then become a low overhead manager of larger areas - a sort of 'super slab'.

Thoughts?

Notes
=====

[1] Higher order meaning greater than order 0

[2] By atomic I mean without __GFP_WAIT set, which
    are in the main GFP_ATOMIC allocations.

[3] The lack of any detail at all on non-atomic requests
    suggests that this is either a non-problem, or they
    are little used in the kernel - possibly wrongly so.

[4] For instance, the network code assumes that packets
    (pre-fragmentation, or post-reassembly), are contiguous
    in memory.

[5] For instance, packet drops, which whilst recoverable,
    impede performance.

--
Alex Bligh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 19:52                                   ` Rik van Riel
@ 2001-09-07  0:32                                     ` Kurt Garloff
  0 siblings, 0 replies; 79+ messages in thread
From: Kurt Garloff @ 2001-09-07  0:32 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, Jan Harkes, Marcelo Tosatti, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2082 bytes --]

On Thu, Sep 06, 2001 at 04:52:05PM -0300, Rik van Riel wrote:
> On Thu, 6 Sep 2001, Daniel Phillips wrote:
> > Again, I have to ask, which reads are you interfering with?  Ones that
> > haven't happened yet?  Remember, the disk is idle.  So *at worst* you are
> > going to get one extra seek before getting hit with the tidal wave of reads
> > you seem to be worried about.  This simply isn't significant.
> >
> > I've tested this, I know early writeout under light load is a win.
> 
> Other people have tested this too, and light writeout of
> small blocks destroys the performance of a heavy read
> load.

Then just don't take two hard limits, but make an easy mathematical function
of time and blocks to write (monotonic and with positive slope in both) and
start to write all blocks once we execced a certain limit.
So, if you produce very few dirty inactive pages, it'll only happen every
thirty seconds, e.g., at moderate loads, it may happen every 4 seconds and
at higher loads it may even happen a couple of times per second.
Think of a function like t + t*b + b, with appropriate scaling, so we reach
the threshold either after a long time alone, because of many dirty inactive
pages alone or because a combination of both. Tuning should be such that
under normal workloads, the combination of time times pages should be the
most significant term.

(The chance that you run into memory pressure because of too many dirty
pages this way is lower than before, but if it happens, you can adjust your
function or the threshold too flush more pages.)

If you are very concerned about read performance suffering from this, you
may even monitor reads and adjust the threshold according to read load.
(Or just make your function include this variable with a negative slope.)
I believe it won't be necessary though.

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE GmbH, Nuernberg, DE                                SCSI, Security

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG]
  2001-09-06 21:01                   ` [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG] Alex Bligh - linux-kernel
@ 2001-09-07  6:35                     ` Daniel Phillips
  2001-09-07  8:58                       ` Alex Bligh - linux-kernel
  0 siblings, 1 reply; 79+ messages in thread
From: Daniel Phillips @ 2001-09-07  6:35 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel, riel, linux-kernel; +Cc: Alex Bligh - linux-kernel

On September 6, 2001 11:01 pm, Alex Bligh - linux-kernel wrote:
> I thought I'd try coding this, then I thought better of it and so am asking
> people's opinions first. The following describes a mechanism to change the
> zone/buddy allocation system to minimize fragmentation before it happens,
> and then defragment post-facto.

Nice exposition and analysis, but see my wet-blanket comments below...

> [...]
>
> Causes of fragmentation
> =======================
> 
> Linux adopts a largely requestor-anonymous form of page allocation. Memory
> is divided into 3 zones, and page requesters can specify a list of suitable
> zones from which pages may be allocated, but beyond that, pages are
> allocated in a manner which does not distinguish between users of given
> pages.

It's a conscious goal to try to unify all sources of memory.  The three
zones that are there now are only there because they absolutely have to be.

> Thus pages allocated for packets in flight are likely to be intermingled
> with buffer pages, cache pages, code pages and data pages. Each of these
> different types of allocation has a different persistence over time. Some
> (for instance pages on the InactiveDirty list in an idle system) will
> persist indefinitely.
> 
> The buddy allocator will attempt (by looking at lowest order lists first)
> to allocate pages from fragmented areas first. Assuming pages are freed at
> random, this would act as a defragmentation process. However, if a system
> is taken to high utilization and back again to idle, the dispersion of
> persistent pages (for instance InactiveDirty pages) becomes great, and the
> buddy allocator performs poorly at coalescing blocks.

It becomes effectively useless.  The probability of all 8 pages of a given
8 page unit being free when only 1% of memory is free is (1/100)**8 =
1/(10**16).

> The situation is worsened by the understandable desire for simplicity in
> the VM system, which measures solely the number of pages free in different
> zones, as opposed their respective locations. It is possible (and has been
> observed) to have a system in a state with hardly any high order buddies on
> free area lists (thus where it would be impossible to make many atomic high
> order allocations), but copious easilly freeable RAM. This is in essence
> because no attempt is made to balance for different order free-lists, and
> shortage of entries on high-order free lists does not in itself cause
> memory pressure.
> 
> It is probably undesirable for the normal VM system to react to
> fragmentation in the same way it does to normal memory pressure. This would
> result in an unselective paging out / discarding of data, whereas an
> approach which selected pages to free which would be most likely to cause
> coalescence would be more useful. Further, it would be possible, by moving
> the data in physical pages, to move many types of page, without loss of
> in-memory data at all.

Moving pages sounds scary.  We already know how to evict pages, but moving
pages is a whole new mechanism.  We probably would not care about the "good"
data lost through eviction as opposed to moving fraction of pages we'd have
to evict to do the required defragmentation is tiny.

> Approaches to solution
> ======================

I'm going to confess that I don't understand your solution in detail yet,
however, I can see this complaint coming: the changes are too intrusive on
the existing kernel, and if that's what we had to do it would probably be
easier to just eliminate all high order allocations from the kernel.  I
already have heard some sentiment that the 0 order allocation failure
problems do not have to be solved, that they are really the fault of those
coders that used the feature in the first place.  I don't know about that,
I'd like to hear from the maintainers.  But I'm pretty sure that whatever
solution we come up with, it has to be very simple in implementation, and
have roughly zero impact on the rest of the kernel.

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG]
  2001-09-07  6:35                     ` Daniel Phillips
@ 2001-09-07  8:58                       ` Alex Bligh - linux-kernel
  2001-09-07  9:15                         ` Alex Bligh - linux-kernel
  2001-09-07 21:56                         ` Daniel Phillips
  0 siblings, 2 replies; 79+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-07  8:58 UTC (permalink / raw)
  To: Daniel Phillips, Alex Bligh - linux-kernel, riel, linux-kernel
  Cc: Alex Bligh - linux-kernel

Daniel,

Some comments in line - if you are modelling this, vital you
understand the first!

>> The buddy allocator will attempt (by looking at lowest order lists first)
>> to allocate pages from fragmented areas first. Assuming pages are freed
>> at random, this would act as a defragmentation process. However, if a
>> system is taken to high utilization and back again to idle, the
>> dispersion of persistent pages (for instance InactiveDirty pages)
>> becomes great, and the buddy allocator performs poorly at coalescing
>> blocks.
>
> It becomes effectively useless.  The probability of all 8 pages of a given
> 8 page unit being free when only 1% of memory is free is (1/100)**8 =
> 1/(10**16).

I thought that, then I tested & measured, and it simply isn't true.
Your mathematical model is wrong.

The reason is because pages are freed at random, but they are not
allocated at random. The buddy allocator allocates pages whose
buddy is allocated (lower order) preferentially to splitting a high
order block. Sorry to sound like a broken record, but apply the
/proc/memareas patch and you can see this happening. After extensive
activity, you see practically none of the free pages in order 0
blocks. You might see only a small number (20 or 30 on a 64k
machine) of (say) order 3 blocks, but if you run your stats
you would have an expected value of well less than one, and the
chance of having 20 or 30 would be vanishingly small. Local
aggregation is actually quite effective, provided that the
density of persistent pages is not too great. However, it
gets considerably less effective as the order increases.

> Moving pages sounds scary.  We already know how to evict pages, but moving
> pages is a whole new mechanism.  We probably would not care about the
> "good" data lost through eviction as opposed to moving fraction of pages
> we'd have to evict to do the required defragmentation is tiny.

The sort of moving I was talking about was a diskless page-out / page-in,
i.e. which didn't require a swap file, or I/O, and was thus much quicker.
Whilst the page would be physically moved, it's virtual address would
stay the same. Though this sounds like a completely new system, I think
there's a high probability of this just being a special case of
the page out routine.

> I'm going to confess that I don't understand your solution in detail yet,
> however, I can see this complaint coming: the changes are too intrusive on
> the existing kernel,

A valid criticism. But difficult to see how defragmentation that actually
takes account of the contents of memory (rather than 'blind' freeing)
could be less intrusive - though I'm open to ideas.

> and if that's what we had to do it would probably be
> easier to just eliminate all high order allocations from the kernel.  I
> already have heard some sentiment that the 0 order allocation failure
> problems do not have to be solved, that they are really the fault of those
> coders that used the feature in the first place.

I'd be especially interested to know how we'd solve this for the
network stuff, which currently relies on physically contiguous packets
in memory. This is a *HUGE* change I think (larger than any we'd
make to the VM system).

> But I'm pretty sure that whatever
> solution we come up with, it has to be very simple in implementation, and
> have roughly zero impact on the rest of the kernel.

This would of course be ideal.

--
Alex Bligh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG]
  2001-09-07  8:58                       ` Alex Bligh - linux-kernel
@ 2001-09-07  9:15                         ` Alex Bligh - linux-kernel
  2001-09-07  9:28                           ` Alex Bligh - linux-kernel
  2001-09-07 21:38                           ` Daniel Phillips
  2001-09-07 21:56                         ` Daniel Phillips
  1 sibling, 2 replies; 79+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-07  9:15 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel, Daniel Phillips, riel, linux-kernel
  Cc: Alex Bligh - linux-kernel


>> It becomes effectively useless.  The probability of all 8 pages of a given
>> 8 page unit being free when only 1% of memory is free is (1/100)**8 =
>> 1/(10**16).

> Sorry to sound like a broken record, but apply the
> /proc/memareas patch and you can see this happening. After extensive
> activity, you see practically none of the free pages in order 0
> blocks. You might see only a small number (20 or 30 on a 64k
> machine) of (say) order 3 blocks, but if you run your stats
> you would have an expected value of well less than one, and the
> chance of having 20 or 30 would be vanishingly small.

Ooops, what I wrote was factually correct, but misleading.
What I meant was it looks like this:

   Zone     4kB     8kB    16kB    32kB    64kB   128kB   256kB   512kB  1024kB  2048kB Tot Pages/kb
    DMA     495     348     196      72      10       1       1       0       0       0 =     2807)
  @frag      0%     18%     42%     70%     91%     97%     98%    100%    100%    100% =    11228kB
 Normal       0    1579    1670     667     140      12       3       1       0       0 =    18118)
  @frag      0%      0%     17%     54%     84%     96%     98%     99%    100%    100% =    72472kB

If your model was correct, you would see free pages
per order run like
  N = a (K ^ (2^-o)); (for a>0, K>1, o=order)

This doesn't happen. Instead you get GOOD coalescence
at oder 0 (in the Normal zone they've ALL been coalesced),
and not bad at order 1 (see how many order 2's we have).

8 page unit is order 3 (32k). This system has 20% of
memory free at the point where I took the snap shot.
Probability would be, (1/5)^8 = 2^8 / 10^8 =
roughly p = 2.5 x 10^-6. In a system with 32000 pages
(128Kb), if you were right, I'd expect to see about
0.08 free pages at order 3. But here I see 750.

The chance of seeing more than 500 events of probability
p = 2.5 ^ (10^-6) across 32000 samples, is vanishingly
small. Yet it looks this way all the time.

Hence I conclude your model is wrong :-)

--
Alex Bligh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG]
  2001-09-07  9:15                         ` Alex Bligh - linux-kernel
@ 2001-09-07  9:28                           ` Alex Bligh - linux-kernel
  2001-09-07 21:38                           ` Daniel Phillips
  1 sibling, 0 replies; 79+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-07  9:28 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel, Daniel Phillips, riel, linux-kernel
  Cc: Alex Bligh - linux-kernel

Blush

>   N = a (K ^ (2^-o)); (for a>0, K>1, o=order)

    N = a (K ^ -(2^o)); (for a>0, K>1, o=order)


--
Alex Bligh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-09-06 13:10               ` Stephan von Krawczynski
                                   ` (2 preceding siblings ...)
  2001-09-06 17:51                 ` Daniel Phillips
@ 2001-09-07 12:30                 ` Stephan von Krawczynski
  3 siblings, 0 replies; 79+ messages in thread
From: Stephan von Krawczynski @ 2001-09-07 12:30 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: riel, jaharkes, marcelo, linux-kernel

On Thu, 6 Sep 2001 19:51:26 +0200 Daniel Phillips <phillips@bonn-fries.net>
wrote:

> On September 6, 2001 03:10 pm, Stephan von Krawczynski wrote:
> > [...]
> > to lots on the nfs-data. Even if the nfs-data would only have one single
hit,
> > the old CD image should have been removed, because it is inactive and
_older_.
> 
> OK, this is not related to what we were discussing (IO latency).  It's not
too
> hard to fix, we just need to do a little aging whenever there are
allocations,
> whether or not there is memory_pressure.  I don't think it's a real problem
> though, we have at least two problems we really do need to fix (oom and
> high order failures).

Hm, I am not quite sure about that. Can you _show_ me how to fix this?

Regards,
Stephan


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG]
  2001-09-07  9:15                         ` Alex Bligh - linux-kernel
  2001-09-07  9:28                           ` Alex Bligh - linux-kernel
@ 2001-09-07 21:38                           ` Daniel Phillips
  1 sibling, 0 replies; 79+ messages in thread
From: Daniel Phillips @ 2001-09-07 21:38 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel, riel, linux-kernel; +Cc: Alex Bligh - linux-kernel

On September 7, 2001 11:15 am, Alex Bligh - linux-kernel wrote:
> >> It becomes effectively useless.  The probability of all 8 pages of a given
> The chance of seeing more than 500 events of probability
> p = 2.5 ^ (10^-6) across 32000 samples, is vanishingly
> small. Yet it looks this way all the time.
> 
> Hence I conclude your model is wrong :-)

True.  OK, need to make a better model, time to crack my Knuth.

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG]
  2001-09-07  8:58                       ` Alex Bligh - linux-kernel
  2001-09-07  9:15                         ` Alex Bligh - linux-kernel
@ 2001-09-07 21:56                         ` Daniel Phillips
  1 sibling, 0 replies; 79+ messages in thread
From: Daniel Phillips @ 2001-09-07 21:56 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel, riel, linux-kernel; +Cc: Alex Bligh - linux-kernel

On September 7, 2001 10:58 am, Alex Bligh - linux-kernel wrote:
> Some comments in line - if you are modelling this, vital you
> understand the first!
> 
> >> The buddy allocator will attempt (by looking at lowest order lists first)
> >> to allocate pages from fragmented areas first. Assuming pages are freed
> >> at random, this would act as a defragmentation process. However, if a
> >> system is taken to high utilization and back again to idle, the
> >> dispersion of persistent pages (for instance InactiveDirty pages)
> >> becomes great, and the buddy allocator performs poorly at coalescing
> >> blocks.
> >
> > It becomes effectively useless.  The probability of all 8 pages of a given
> > 8 page unit being free when only 1% of memory is free is (1/100)**8 =
> > 1/(10**16).
> 
> I thought that, then I tested & measured, and it simply isn't true.
> Your mathematical model is wrong.

Yes, a simple thought experiment show this.  Suppose we start with an intial 
state of every second 0 order page allocated.  Now, the next 0 order 
allocation must coalesce to a 1 order unit but the next allocate will come 
from a half-allocated allocated unit.  If we continue randomly in this way, 
allocating one page and freeing one, we will eventually arrive at a state 
where half the pages are in 1 order units and the other half are fully 
allocated.

So, the fragmentation is far from uniformly random.  This is going to require 
deeper analysis.  IMO, it's worth putting in the effort to get a handle on 
this.

--
Daniel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* page_launder() on 2.4.9/10 issue
@ 2001-09-27 23:14 Samium Gromoff
  0 siblings, 0 replies; 79+ messages in thread
From: Samium Gromoff @ 2001-09-27 23:14 UTC (permalink / raw)
  To: lkml; +Cc: Linus

  Linus wrote:
> Think about it - do you really want the system to actively try to reach
> the point where it has no "regular" pages left, and has to start writing
> stuff out (and wait for them synchronously) in order to free up memory? I
   I`m 100% agreed with you here: i had been hit by this issue 
 alot of times... This is absolutely reproducible with streaming io case.
   I think the lower is the number of processes simultaneously accessing data, the
 harder this beats us... (cant explain, but this is how i feel that)
> strongly feel that the old code was really really wrong - it may have been

sorry if im a noise here...

cheers,
 Sam


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-08-29 13:49       ` Linus Torvalds
@ 2001-08-29 14:38         ` Rik van Riel
  0 siblings, 0 replies; 79+ messages in thread
From: Rik van Riel @ 2001-08-29 14:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, linux-kernel

On Wed, 29 Aug 2001, Linus Torvalds wrote:

> Rik, look again: kswapd _does_ wait on IO these days.

Indeed, I missed the magic in sync_page_buffers().

regards,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-08-29 13:48     ` Rik van Riel
@ 2001-08-29 13:49       ` Linus Torvalds
  2001-08-29 14:38         ` Rik van Riel
  0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2001-08-29 13:49 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andi Kleen, linux-kernel


On Wed, 29 Aug 2001, Rik van Riel wrote:
> On 28 Aug 2001, Andi Kleen wrote:
>
> > Regarding kswapd in 2.4.9:
> >
> > At least something seems to be broken in it. I did run some 900MB processes
> > on a 512MB machine with 2.4.9 and kswapd took between 70 and 90% of the CPU
> > time.
>
> Well yes, if you never wait on IO synchronously kswapd turns
> into one big busy-loop. But we knew that, it was even written
> down in the comments in vmscan.c ;)

Rik, look again: kswapd _does_ wait on IO these days.

Not ever waiting for IO is just a sure way to overload the IO subsystem
and cause horribleinteractive behaviour.

		Linus


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-08-28 19:14   ` Andi Kleen
@ 2001-08-29 13:48     ` Rik van Riel
  2001-08-29 13:49       ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Rik van Riel @ 2001-08-29 13:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linus Torvalds, linux-kernel

On 28 Aug 2001, Andi Kleen wrote:

> Regarding kswapd in 2.4.9:
>
> At least something seems to be broken in it. I did run some 900MB processes
> on a 512MB machine with 2.4.9 and kswapd took between 70 and 90% of the CPU
> time.

Well yes, if you never wait on IO synchronously kswapd turns
into one big busy-loop. But we knew that, it was even written
down in the comments in vmscan.c ;)

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-08-28 20:01   ` David S. Miller
  2001-08-28 20:49     ` Linus Torvalds
@ 2001-08-28 20:56     ` David S. Miller
  1 sibling, 0 replies; 79+ messages in thread
From: David S. Miller @ 2001-08-28 20:56 UTC (permalink / raw)
  To: torvalds; +Cc: ak, linux-kernel

   From: Linus Torvalds <torvalds@transmeta.com>
   Date: Tue, 28 Aug 2001 13:49:40 -0700 (PDT)
   
   There might be an argment for making kswapd less eager, and more of a
   background thing.
   
   Regardless of where it actually spends the CPU time.

Right, but this is not an argument against fixing __get_swap_page's
algorithms to be more reasonable :-)

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
  2001-08-28 20:01   ` David S. Miller
@ 2001-08-28 20:49     ` Linus Torvalds
  2001-08-28 20:56     ` David S. Miller
  1 sibling, 0 replies; 79+ messages in thread
From: Linus Torvalds @ 2001-08-28 20:49 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, linux-kernel


On Tue, 28 Aug 2001, David S. Miller wrote:
>
>    At least something seems to be broken in it. I did run some 900MB processes
>    on a 512MB machine with 2.4.9 and kswapd took between 70 and 90% of the CPU
>    time.
>
> That's all swapmap lookup stupidity, you'll see __get_swap_page()
> near the top of your profiles.  The algorithm is just sucky.

Well, in all fairness the kswapd changes _do_ make kswapd more eager to
keep running too (ie kswapd tends to keep running until there is no
shortage any more - which it traditionally hasn't really done).

There might be an argment for making kswapd less eager, and more of a
background thing.

Regardless of where it actually spends the CPU time.

		Linus


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
       [not found] ` <Pine.LNX.4.33.0108281110540.8754-100000@penguin.transmeta.com.suse.lists.linux.kernel>
  2001-08-28 19:14   ` Andi Kleen
@ 2001-08-28 20:01   ` David S. Miller
  2001-08-28 20:49     ` Linus Torvalds
  2001-08-28 20:56     ` David S. Miller
  1 sibling, 2 replies; 79+ messages in thread
From: David S. Miller @ 2001-08-28 20:01 UTC (permalink / raw)
  To: ak; +Cc: torvalds, linux-kernel

   From: Andi Kleen <ak@suse.de>
   Date: 28 Aug 2001 21:14:15 +0200
   
   At least something seems to be broken in it. I did run some 900MB processes
   on a 512MB machine with 2.4.9 and kswapd took between 70 and 90% of the CPU
   time.

That's all swapmap lookup stupidity, you'll see __get_swap_page()
near the top of your profiles.  The algorithm is just sucky.

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: page_launder() on 2.4.9/10 issue
       [not found] ` <Pine.LNX.4.33.0108281110540.8754-100000@penguin.transmeta.com.suse.lists.linux.kernel>
@ 2001-08-28 19:14   ` Andi Kleen
  2001-08-29 13:48     ` Rik van Riel
  2001-08-28 20:01   ` David S. Miller
  1 sibling, 1 reply; 79+ messages in thread
From: Andi Kleen @ 2001-08-28 19:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

Regarding kswapd in 2.4.9:

At least something seems to be broken in it. I did run some 900MB processes
on a 512MB machine with 2.4.9 and kswapd took between 70 and 90% of the CPU
time.

-Andi


^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2001-09-07 21:49 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-28  3:36 page_launder() on 2.4.9/10 issue Marcelo Tosatti
2001-08-28 18:07 ` Daniel Phillips
2001-08-28 18:17   ` Linus Torvalds
2001-08-30  1:36     ` Daniel Phillips
2001-09-03 14:57     ` Marcelo Tosatti
2001-09-04 15:26       ` Jan Harkes
2001-09-04 15:24         ` Marcelo Tosatti
2001-09-04 17:14           ` Jan Harkes
2001-09-04 15:53             ` Marcelo Tosatti
2001-09-04 19:33             ` Daniel Phillips
2001-09-06 11:52             ` Rik van Riel
2001-09-06 12:31               ` Daniel Phillips
2001-09-06 12:32                 ` Rik van Riel
2001-09-06 12:53                   ` Daniel Phillips
2001-09-06 13:03                     ` Rik van Riel
2001-09-06 13:18                       ` Kurt Garloff
2001-09-06 13:23                         ` Rik van Riel
2001-09-06 13:28                         ` Alan Cox
2001-09-06 13:29                           ` Rik van Riel
2001-09-06 16:45                         ` Daniel Phillips
2001-09-06 16:57                           ` Rik van Riel
2001-09-06 17:22                             ` Daniel Phillips
2001-09-06 19:25                               ` Rik van Riel
2001-09-06 19:45                                 ` Daniel Phillips
2001-09-06 19:52                                   ` Rik van Riel
2001-09-07  0:32                                     ` Kurt Garloff
2001-09-06 19:53                                   ` Mike Fedyk
2001-09-06 17:35                         ` Mike Fedyk
2001-09-06 13:10               ` Stephan von Krawczynski
2001-09-06 13:23                 ` Alex Bligh - linux-kernel
2001-09-06 13:54                   ` M. Edward Borasky
2001-09-06 14:39                     ` Alan Cox
2001-09-06 16:20                       ` Victor Yodaiken
2001-09-06 17:33                     ` Daniel Phillips
2001-09-06 13:42                 ` Stephan von Krawczynski
2001-09-06 14:01                   ` Alex Bligh - linux-kernel
2001-09-06 14:39                   ` Stephan von Krawczynski
2001-09-06 15:02                     ` Alex Bligh - linux-kernel
2001-09-06 15:07                       ` Rik van Riel
     [not found]                         ` <Pine.LNX.4.33L.0109061206020.31200-100000@imladris.rielhome.con ectiva>
2001-09-06 15:16                           ` Alex Bligh - linux-kernel
2001-09-06 15:10                     ` Stephan von Krawczynski
2001-09-06 15:18                       ` Alex Bligh - linux-kernel
2001-09-06 17:34                         ` Daniel Phillips
2001-09-06 17:32                           ` Alex Bligh - linux-kernel
2001-09-06 17:51                 ` Daniel Phillips
2001-09-06 21:01                   ` [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG] Alex Bligh - linux-kernel
2001-09-07  6:35                     ` Daniel Phillips
2001-09-07  8:58                       ` Alex Bligh - linux-kernel
2001-09-07  9:15                         ` Alex Bligh - linux-kernel
2001-09-07  9:28                           ` Alex Bligh - linux-kernel
2001-09-07 21:38                           ` Daniel Phillips
2001-09-07 21:56                         ` Daniel Phillips
2001-09-07 12:30                 ` page_launder() on 2.4.9/10 issue Stephan von Krawczynski
2001-09-04 16:27         ` Rik van Riel
2001-09-04 17:13           ` Jan Harkes
2001-09-04 15:56             ` Marcelo Tosatti
2001-09-04 17:54               ` Jan Harkes
2001-09-04 16:37                 ` Marcelo Tosatti
2001-09-04 18:49                 ` Alan Cox
2001-09-04 19:39                   ` Jan Harkes
2001-09-04 20:25                     ` Alan Cox
2001-09-06 11:23                       ` Rik van Riel
2001-09-04 19:54                 ` Andrea Arcangeli
2001-09-04 18:36                   ` Marcelo Tosatti
2001-09-04 20:10                   ` Daniel Phillips
2001-09-04 22:04                     ` Andrea Arcangeli
2001-09-05  2:41                       ` Daniel Phillips
2001-09-06 11:18                   ` Rik van Riel
2001-09-04 17:35             ` Daniel Phillips
2001-09-04 20:43           ` Jan Harkes
2001-09-06 11:21             ` Rik van Riel
     [not found] <20010828180108Z16193-32383+2058@humbolt.nl.linux.org.suse.lists.linux.kernel>
     [not found] ` <Pine.LNX.4.33.0108281110540.8754-100000@penguin.transmeta.com.suse.lists.linux.kernel>
2001-08-28 19:14   ` Andi Kleen
2001-08-29 13:48     ` Rik van Riel
2001-08-29 13:49       ` Linus Torvalds
2001-08-29 14:38         ` Rik van Riel
2001-08-28 20:01   ` David S. Miller
2001-08-28 20:49     ` Linus Torvalds
2001-08-28 20:56     ` David S. Miller
2001-09-27 23:14 Samium Gromoff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).