linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.5.35-mm1
@ 2002-09-16  7:15 Andrew Morton
  2002-09-17 16:07 ` 2.5.35-mm1 Pavel Machek
  2002-09-19  7:51 ` 2.5.35-mm1 Daniel Phillips
  0 siblings, 2 replies; 6+ messages in thread
From: Andrew Morton @ 2002-09-16  7:15 UTC (permalink / raw)
  To: lkml, linux-mm, lse-tech


url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/

Significant rework of the new sleep/wakeup code - make it look totally
different from the current APIs to avoid confusion, and to make it
simpler to use.

Also increase the number of places where this API is used in networking;
Alexey says that some of these may be negative improvements, but
performance testing will nevertheless be interesting.  The relevant
patches are:

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/broken-out/prepare_to_wait.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/broken-out/tcp-wakeups.patch

A 4x performance regression in heavy dbench testing has been fixed. The
VM was accidentally being fair to the dbench instances in page reclaim.
It's better to be unfair so just a few instances can get ahead and submit
more contiguous IO.  It's a silly thing, but it's what I meant to do anyway.

Since 2.5.34-mm4:

-readv-writev.patch
-aio-sync-iocb.patch
-llzpr.patch
-buffermem.patch
-lpp.patch
-lpp-update.patch
-reversemaps-leak.patch
-sharedmem.patch
-ext3-sb.patch
-pagevec_lru_add.patch
-oom-fix.patch
-tlb-cleanup.patch
-dump-stack.patch
-wli-cleanup.patch

 Merged

+release_pages-speedup.patch

 Avoid a couple of lock-takings.

-wake-speedup.patch
+prepare_to_wait.patch

 Renamed, reworked

+swapoff-deadlock.patch
+dirty-and-uptodate.patch
+shmem_rename.patch
+dirent-size.patch
+tmpfs-trivia.patch

 Various fixes and cleanups from Hugh Dickins


linus.patch
  cset-1.552-to-1.564.txt.gz

scsi_hack.patch
  Fix block-highmem for scsi

ext3-htree.patch
  Indexed directories for ext3

spin-lock-check.patch
  spinlock/rwlock checking infrastructure

rd-cleanup.patch
  Cleanup and fix the ramdisk driver (doesn't work right yet)

madvise-move.patch
  move mdavise implementation into mm/madvise.c

split-vma.patch
  VMA splitting patch

mmap-fixes.patch
  mmap.c cleanup and lock ranking fixes

buffer-ops-move.patch
  Move submit_bh() and ll_rw_block() into fs/buffer.c

slab-stats.patch
  Display total slab memory in /proc/meminfo

writeback-control.patch
  Cleanup and extension of the writeback paths

free_area_init-cleanup.patch
  free_area_init() code cleanup

alloc_pages-cleanup.patch
  alloc_pages cleanup and optimisation

statm_pgd_range-sucks.patch
  Remove the pagetable walk from /proc/stat

remove-sync_thresh.patch
  Remove /proc/sys/vm/dirty_sync_thresh

taka-writev.patch
  Speed up writev

pf_nowarn.patch
  Fix up the handling of PF_NOWARN

jeremy.patch
  Spel Jermy's naim wright

release_pages-speedup.patch
  Reduced locking in release_pages()

queue-congestion.patch
  Infrastructure for communicating request queue congestion to the VM

nonblocking-ext2-preread.patch
  avoid ext2 inode prereads if the queue is congested

nonblocking-pdflush.patch
  non-blocking writeback infrastructure, use it for pdflush

nonblocking-vm.patch
  Non-blocking page reclaim

prepare_to_wait.patch
  New sleep/wakeup API

vm-wakeups.patch
  Use the faster wakeups in the VM and block layers

sync-helper.patch
  Speed up sys_sync() against multiple spindles

slabasap.patch
  Early and smarter shrinking of slabs

write-deadlock.patch
  Fix the generic_file_write-from-same-mmapped-page deadlock

buddyinfo.patch
  Add /proc/buddyinfo - stats on the free pages pool

free_area.patch
  Remove struct free_area_struct and free_area_t, use `struct free_area'

per-node-kswapd.patch
  Per-node kswapd instance

topology-api.patch
  NUMA topology API

radix_tree_gang_lookup.patch
  radix tree gang lookup

truncate_inode_pages.patch
  truncate/invalidate_inode_pages rewrite

proc_vmstat.patch
  Move the vm accounting out of /proc/stat

kswapd-reclaim-stats.patch
  Add kswapd_steal to /proc/vmstat

iowait.patch
  I/O wait statistics

tcp-wakeups.patch
  Use fast wakeups in TCP/IPV4

swapoff-deadlock.patch
  Fix a tmpfs swapoff deadlock

dirty-and-uptodate.patch
  page state cleanup

shmem_rename.patch
  shmem_rename() directory link count fix

dirent-size.patch
  tmpfs: show a non-zero size for directories

tmpfs-trivia.patch
  tmpfs: small fixlets

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.5.35-mm1
  2002-09-16  7:15 2.5.35-mm1 Andrew Morton
@ 2002-09-17 16:07 ` Pavel Machek
  2002-09-18 21:31   ` 2.5.35-mm1 Andrew Morton
  2002-09-19  7:51 ` 2.5.35-mm1 Daniel Phillips
  1 sibling, 1 reply; 6+ messages in thread
From: Pavel Machek @ 2002-09-17 16:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm, lse-tech

Hi!

> url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/
> 
> Significant rework of the new sleep/wakeup code - make it look totally
> different from the current APIs to avoid confusion, and to make it
> simpler to use.

Did you add any hooks to allow me to free memory for swsusp?
								Pavel
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.5.35-mm1
  2002-09-17 16:07 ` 2.5.35-mm1 Pavel Machek
@ 2002-09-18 21:31   ` Andrew Morton
  2002-09-18 21:54     ` 2.5.35-mm1 Pavel Machek
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2002-09-18 21:31 UTC (permalink / raw)
  To: Pavel Machek; +Cc: lkml, linux-mm, lse-tech

Pavel Machek wrote:
> 
> Hi!
> 
> > url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/
> >
> > Significant rework of the new sleep/wakeup code - make it look totally
> > different from the current APIs to avoid confusion, and to make it
> > simpler to use.
> 
> Did you add any hooks to allow me to free memory for swsusp?

I just did then.  You'll need to call

	freed = shrink_all_memory(99);

to free up 99 pages.  It returns the number which it actually
freed.  If that's not 99 then it's time to give up.  There is
no oom-killer in this code path.

I haven't tested it yet.  And it's quite a long way back in the
queue I'm afraid - it has a dependency chain, and I prefer to
send stuff to Linus which has been tested for a couple of weeks, and
hasn't changed for one week.

Can you use the allocate-lots-then-free-it trick in the meanwhile?



 include/linux/swap.h |    1 +
 mm/vmscan.c          |   46 ++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 41 insertions, 6 deletions

--- 2.5.36/mm/vmscan.c~swsusp-feature	Wed Sep 18 13:55:20 2002
+++ 2.5.36-akpm/mm/vmscan.c	Wed Sep 18 14:29:13 2002
@@ -694,12 +694,19 @@ try_to_free_pages(struct zone *classzone
 }
 
 /*
- * kswapd will work across all this node's zones until they are all at
- * pages_high.
+ * For kswapd, balance_pgdat() will work across all this node's zones until
+ * they are all at pages_high.
+ *
+ * If `nr_pages' is non-zero then it is the number of pages which are to be
+ * reclaimed, regardless of the zone occupancies.  This is a software suspend
+ * special.
+ *
+ * Returns the number of pages which were actually freed.
  */
-static void kswapd_balance_pgdat(pg_data_t *pgdat)
+static int balance_pgdat(pg_data_t *pgdat, int nr_pages)
 {
-	int priority = DEF_PRIORITY;
+	int to_free = nr_pages;
+	int priority;
 	int i;
 
 	for (priority = DEF_PRIORITY; priority; priority--) {
@@ -712,13 +719,15 @@ static void kswapd_balance_pgdat(pg_data
 			int to_reclaim;
 
 			to_reclaim = zone->pages_high - zone->free_pages;
+			if (nr_pages && to_free > 0)
+				to_reclaim = min(to_free, SWAP_CLUSTER_MAX*8);
 			if (to_reclaim <= 0)
 				continue;
 			success = 0;
 			max_scan = zone->nr_inactive >> priority;
 			if (max_scan < to_reclaim * 2)
 				max_scan = to_reclaim * 2;
-			shrink_zone(zone, max_scan, GFP_KSWAPD,
+			to_free -= shrink_zone(zone, max_scan, GFP_KSWAPD,
 					to_reclaim, &nr_mapped);
 			shrink_slab(max_scan + nr_mapped, GFP_KSWAPD);
 		}
@@ -726,6 +735,7 @@ static void kswapd_balance_pgdat(pg_data
 			break;	/* All zones are at pages_high */
 		blk_congestion_wait(WRITE, HZ/4);
 	}
+	return nr_pages - to_free;
 }
 
 /*
@@ -772,10 +782,34 @@ int kswapd(void *p)
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 		schedule();
 		finish_wait(&pgdat->kswapd_wait, &wait);
-		kswapd_balance_pgdat(pgdat);
+		balance_pgdat(pgdat, 0);
 		blk_run_queues();
 	}
 }
+
+#ifdef CONFIG_SOFTWARE_SUSPEND
+/*
+ * Try to free `nr_pages' of memory, system-wide.  Returns the number of freed
+ * pages.
+ */
+int shrink_all_memory(int nr_pages)
+{
+	pg_data_t *pgdat;
+	int nr_to_free = nr_pages;
+	int ret = 0;
+
+	for_each_pgdat(pgdat) {
+		int freed;
+
+		freed = balance_pgdat(pgdat, nr_to_free);
+		ret += freed;
+		nr_to_free -= freed;
+		if (nr_to_free <= 0)
+			break;
+	}
+	return ret;
+}
+#endif
 
 static int __init kswapd_init(void)
 {
--- 2.5.36/include/linux/swap.h~swsusp-feature	Wed Sep 18 14:03:01 2002
+++ 2.5.36-akpm/include/linux/swap.h	Wed Sep 18 14:16:29 2002
@@ -163,6 +163,7 @@ extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
 extern int try_to_free_pages(struct zone *, unsigned int, unsigned int);
+int shrink_all_memory(int nr_pages);
 
 /* linux/mm/page_io.c */
 int swap_readpage(struct file *file, struct page *page);

.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.5.35-mm1
  2002-09-18 21:31   ` 2.5.35-mm1 Andrew Morton
@ 2002-09-18 21:54     ` Pavel Machek
  0 siblings, 0 replies; 6+ messages in thread
From: Pavel Machek @ 2002-09-18 21:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm, lse-tech

Hi!

> > > url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/
> > >
> > > Significant rework of the new sleep/wakeup code - make it look totally
> > > different from the current APIs to avoid confusion, and to make it
> > > simpler to use.
> > 
> > Did you add any hooks to allow me to free memory for swsusp?
> 
> I just did then.  You'll need to call
> 
> 	freed = shrink_all_memory(99);

Thanx a lot.

> to free up 99 pages.  It returns the number which it actually
> freed.  If that's not 99 then it's time to give up.  There is
> no oom-killer in this code path.

So... I'll do something like shrink_all_memory(1000000) and it will
free as much as possible, right?

> I haven't tested it yet.  And it's quite a long way back in the
> queue I'm afraid - it has a dependency chain, and I prefer to

So if I apply this to my tree it will not work (that's what
"dependency chain means", right?). Okay, thanx anyway.

> send stuff to Linus which has been tested for a couple of weeks, and
> hasn't changed for one week.
> 
> Can you use the allocate-lots-then-free-it trick in the meanwhile?

In the meanwhile, swsusp only working when there's lot of ram is
probably okay. As IDE patch is not in, swsusp is dangerous, anyway.

									Pavel 

> --- 2.5.36/mm/vmscan.c~swsusp-feature	Wed Sep 18 13:55:20 2002
> +++ 2.5.36-akpm/mm/vmscan.c	Wed Sep 18 14:29:13 2002
> @@ -694,12 +694,19 @@ try_to_free_pages(struct zone *classzone
>  }
>  
>  /*
> - * kswapd will work across all this node's zones until they are all at
> - * pages_high.
> + * For kswapd, balance_pgdat() will work across all this node's zones until
> + * they are all at pages_high.
> + *
> + * If `nr_pages' is non-zero then it is the number of pages which are to be
> + * reclaimed, regardless of the zone occupancies.  This is a software suspend
> + * special.
> + *
> + * Returns the number of pages which were actually freed.
>   */
> -static void kswapd_balance_pgdat(pg_data_t *pgdat)
> +static int balance_pgdat(pg_data_t *pgdat, int nr_pages)
>  {
> -	int priority = DEF_PRIORITY;
> +	int to_free = nr_pages;
> +	int priority;
>  	int i;
>  
>  	for (priority = DEF_PRIORITY; priority; priority--) {
> @@ -712,13 +719,15 @@ static void kswapd_balance_pgdat(pg_data
>  			int to_reclaim;
>  
>  			to_reclaim = zone->pages_high - zone->free_pages;
> +			if (nr_pages && to_free > 0)
> +				to_reclaim = min(to_free, SWAP_CLUSTER_MAX*8);
>  			if (to_reclaim <= 0)
>  				continue;
>  			success = 0;
>  			max_scan = zone->nr_inactive >> priority;
>  			if (max_scan < to_reclaim * 2)
>  				max_scan = to_reclaim * 2;
> -			shrink_zone(zone, max_scan, GFP_KSWAPD,
> +			to_free -= shrink_zone(zone, max_scan, GFP_KSWAPD,
>  					to_reclaim, &nr_mapped);
>  			shrink_slab(max_scan + nr_mapped, GFP_KSWAPD);
>  		}
> @@ -726,6 +735,7 @@ static void kswapd_balance_pgdat(pg_data
>  			break;	/* All zones are at pages_high */
>  		blk_congestion_wait(WRITE, HZ/4);
>  	}
> +	return nr_pages - to_free;
>  }
>  
>  /*
> @@ -772,10 +782,34 @@ int kswapd(void *p)
>  		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>  		schedule();
>  		finish_wait(&pgdat->kswapd_wait, &wait);
> -		kswapd_balance_pgdat(pgdat);
> +		balance_pgdat(pgdat, 0);
>  		blk_run_queues();
>  	}
>  }
> +
> +#ifdef CONFIG_SOFTWARE_SUSPEND
> +/*
> + * Try to free `nr_pages' of memory, system-wide.  Returns the number of freed
> + * pages.
> + */
> +int shrink_all_memory(int nr_pages)
> +{
> +	pg_data_t *pgdat;
> +	int nr_to_free = nr_pages;
> +	int ret = 0;
> +
> +	for_each_pgdat(pgdat) {
> +		int freed;
> +
> +		freed = balance_pgdat(pgdat, nr_to_free);
> +		ret += freed;
> +		nr_to_free -= freed;
> +		if (nr_to_free <= 0)
> +			break;
> +	}
> +	return ret;
> +}
> +#endif
>  
>  static int __init kswapd_init(void)
>  {
> --- 2.5.36/include/linux/swap.h~swsusp-feature	Wed Sep 18 14:03:01 2002
> +++ 2.5.36-akpm/include/linux/swap.h	Wed Sep 18 14:16:29 2002
> @@ -163,6 +163,7 @@ extern void swap_setup(void);
>  
>  /* linux/mm/vmscan.c */
>  extern int try_to_free_pages(struct zone *, unsigned int, unsigned int);
> +int shrink_all_memory(int nr_pages);
>  
>  /* linux/mm/page_io.c */
>  int swap_readpage(struct file *file, struct page *page);
> 
> .

-- 
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.5.35-mm1
  2002-09-16  7:15 2.5.35-mm1 Andrew Morton
  2002-09-17 16:07 ` 2.5.35-mm1 Pavel Machek
@ 2002-09-19  7:51 ` Daniel Phillips
  2002-09-19  8:19   ` 2.5.35-mm1 Andrew Morton
  1 sibling, 1 reply; 6+ messages in thread
From: Daniel Phillips @ 2002-09-19  7:51 UTC (permalink / raw)
  To: Andrew Morton, lkml, linux-mm, lse-tech

On Monday 16 September 2002 09:15, Andrew Morton wrote:
> A 4x performance regression in heavy dbench testing has been fixed. The
> VM was accidentally being fair to the dbench instances in page reclaim.
> It's better to be unfair so just a few instances can get ahead and submit
> more contiguous IO.  It's a silly thing, but it's what I meant to do anyway.

Curious... did the performance hit show anywhere other than dbench?

-- 
Daniel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.5.35-mm1
  2002-09-19  7:51 ` 2.5.35-mm1 Daniel Phillips
@ 2002-09-19  8:19   ` Andrew Morton
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2002-09-19  8:19 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: lkml, linux-mm, lse-tech

Daniel Phillips wrote:
> 
> On Monday 16 September 2002 09:15, Andrew Morton wrote:
> > A 4x performance regression in heavy dbench testing has been fixed. The
> > VM was accidentally being fair to the dbench instances in page reclaim.
> > It's better to be unfair so just a few instances can get ahead and submit
> > more contiguous IO.  It's a silly thing, but it's what I meant to do anyway.
> 
> Curious... did the performance hit show anywhere other than dbench?

Other benchmarky tests would have suffered, but I did not check.

I have logic in there which is designed to throttle heavy writers
within the page allocator, as well as within balance_dirty_pages.
basically:

	generic_file_write()
	{
		current->backing_dev_info = mapping->backing_dev_info;
		alloc_page()
		current->backing_dev_info = 0;
	}

	shrink_list()
	{
		if (PageDirty(page)) {
			if (page->mapping->backing_dev_info == current->backing_dev_info)
				blocking_write(page->mapping);
			else
				nonblocking_write(page->mapping);
		}
	}


What this says is "if this task is prepared to block against this
page's queue, then write the dirty data, even if that would block".

This means that all the dbench instances will write each other's
dirty data as it comes off the tail of the LRU.  Which provides
some additional throttling, and means that we don't just refile
the page.

But the logic was not correctly implemented.  The dbench instances
were performing non-blocking writes.  This meant that all 64 instances
were cheerfully running all the time, submitting IO all over the disk.
The /proc/meminfo:Writeback figure never even hit a megabyte.  That
number tells us how much memory is currently in the request queue.
Clearly, it was very fragmented.

By forcing the dbench instance to block on the queue, particular instances
were able to submit decent amounts of IO.  The `Writeback' figure went
back to around 4 megabytes, because the individual requests were
larger - more merging.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2002-09-19  8:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-16  7:15 2.5.35-mm1 Andrew Morton
2002-09-17 16:07 ` 2.5.35-mm1 Pavel Machek
2002-09-18 21:31   ` 2.5.35-mm1 Andrew Morton
2002-09-18 21:54     ` 2.5.35-mm1 Pavel Machek
2002-09-19  7:51 ` 2.5.35-mm1 Daniel Phillips
2002-09-19  8:19   ` 2.5.35-mm1 Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).