* 2.5.35-mm1 @ 2002-09-16 7:15 Andrew Morton 2002-09-17 16:07 ` 2.5.35-mm1 Pavel Machek 2002-09-19 7:51 ` 2.5.35-mm1 Daniel Phillips 0 siblings, 2 replies; 6+ messages in thread From: Andrew Morton @ 2002-09-16 7:15 UTC (permalink / raw) To: lkml, linux-mm, lse-tech url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/ Significant rework of the new sleep/wakeup code - make it look totally different from the current APIs to avoid confusion, and to make it simpler to use. Also increase the number of places where this API is used in networking; Alexey says that some of these may be negative improvements, but performance testing will nevertheless be interesting. The relevant patches are: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/broken-out/prepare_to_wait.patch http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/broken-out/tcp-wakeups.patch A 4x performance regression in heavy dbench testing has been fixed. The VM was accidentally being fair to the dbench instances in page reclaim. It's better to be unfair so just a few instances can get ahead and submit more contiguous IO. It's a silly thing, but it's what I meant to do anyway. Since 2.5.34-mm4: -readv-writev.patch -aio-sync-iocb.patch -llzpr.patch -buffermem.patch -lpp.patch -lpp-update.patch -reversemaps-leak.patch -sharedmem.patch -ext3-sb.patch -pagevec_lru_add.patch -oom-fix.patch -tlb-cleanup.patch -dump-stack.patch -wli-cleanup.patch Merged +release_pages-speedup.patch Avoid a couple of lock-takings. -wake-speedup.patch +prepare_to_wait.patch Renamed, reworked +swapoff-deadlock.patch +dirty-and-uptodate.patch +shmem_rename.patch +dirent-size.patch +tmpfs-trivia.patch Various fixes and cleanups from Hugh Dickins linus.patch cset-1.552-to-1.564.txt.gz scsi_hack.patch Fix block-highmem for scsi ext3-htree.patch Indexed directories for ext3 spin-lock-check.patch spinlock/rwlock checking infrastructure rd-cleanup.patch Cleanup and fix the ramdisk driver (doesn't work right yet) madvise-move.patch move mdavise implementation into mm/madvise.c split-vma.patch VMA splitting patch mmap-fixes.patch mmap.c cleanup and lock ranking fixes buffer-ops-move.patch Move submit_bh() and ll_rw_block() into fs/buffer.c slab-stats.patch Display total slab memory in /proc/meminfo writeback-control.patch Cleanup and extension of the writeback paths free_area_init-cleanup.patch free_area_init() code cleanup alloc_pages-cleanup.patch alloc_pages cleanup and optimisation statm_pgd_range-sucks.patch Remove the pagetable walk from /proc/stat remove-sync_thresh.patch Remove /proc/sys/vm/dirty_sync_thresh taka-writev.patch Speed up writev pf_nowarn.patch Fix up the handling of PF_NOWARN jeremy.patch Spel Jermy's naim wright release_pages-speedup.patch Reduced locking in release_pages() queue-congestion.patch Infrastructure for communicating request queue congestion to the VM nonblocking-ext2-preread.patch avoid ext2 inode prereads if the queue is congested nonblocking-pdflush.patch non-blocking writeback infrastructure, use it for pdflush nonblocking-vm.patch Non-blocking page reclaim prepare_to_wait.patch New sleep/wakeup API vm-wakeups.patch Use the faster wakeups in the VM and block layers sync-helper.patch Speed up sys_sync() against multiple spindles slabasap.patch Early and smarter shrinking of slabs write-deadlock.patch Fix the generic_file_write-from-same-mmapped-page deadlock buddyinfo.patch Add /proc/buddyinfo - stats on the free pages pool free_area.patch Remove struct free_area_struct and free_area_t, use `struct free_area' per-node-kswapd.patch Per-node kswapd instance topology-api.patch NUMA topology API radix_tree_gang_lookup.patch radix tree gang lookup truncate_inode_pages.patch truncate/invalidate_inode_pages rewrite proc_vmstat.patch Move the vm accounting out of /proc/stat kswapd-reclaim-stats.patch Add kswapd_steal to /proc/vmstat iowait.patch I/O wait statistics tcp-wakeups.patch Use fast wakeups in TCP/IPV4 swapoff-deadlock.patch Fix a tmpfs swapoff deadlock dirty-and-uptodate.patch page state cleanup shmem_rename.patch shmem_rename() directory link count fix dirent-size.patch tmpfs: show a non-zero size for directories tmpfs-trivia.patch tmpfs: small fixlets ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 2.5.35-mm1 2002-09-16 7:15 2.5.35-mm1 Andrew Morton @ 2002-09-17 16:07 ` Pavel Machek 2002-09-18 21:31 ` 2.5.35-mm1 Andrew Morton 2002-09-19 7:51 ` 2.5.35-mm1 Daniel Phillips 1 sibling, 1 reply; 6+ messages in thread From: Pavel Machek @ 2002-09-17 16:07 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, linux-mm, lse-tech Hi! > url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/ > > Significant rework of the new sleep/wakeup code - make it look totally > different from the current APIs to avoid confusion, and to make it > simpler to use. Did you add any hooks to allow me to free memory for swsusp? Pavel -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 2.5.35-mm1 2002-09-17 16:07 ` 2.5.35-mm1 Pavel Machek @ 2002-09-18 21:31 ` Andrew Morton 2002-09-18 21:54 ` 2.5.35-mm1 Pavel Machek 0 siblings, 1 reply; 6+ messages in thread From: Andrew Morton @ 2002-09-18 21:31 UTC (permalink / raw) To: Pavel Machek; +Cc: lkml, linux-mm, lse-tech Pavel Machek wrote: > > Hi! > > > url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/ > > > > Significant rework of the new sleep/wakeup code - make it look totally > > different from the current APIs to avoid confusion, and to make it > > simpler to use. > > Did you add any hooks to allow me to free memory for swsusp? I just did then. You'll need to call freed = shrink_all_memory(99); to free up 99 pages. It returns the number which it actually freed. If that's not 99 then it's time to give up. There is no oom-killer in this code path. I haven't tested it yet. And it's quite a long way back in the queue I'm afraid - it has a dependency chain, and I prefer to send stuff to Linus which has been tested for a couple of weeks, and hasn't changed for one week. Can you use the allocate-lots-then-free-it trick in the meanwhile? include/linux/swap.h | 1 + mm/vmscan.c | 46 ++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 41 insertions, 6 deletions --- 2.5.36/mm/vmscan.c~swsusp-feature Wed Sep 18 13:55:20 2002 +++ 2.5.36-akpm/mm/vmscan.c Wed Sep 18 14:29:13 2002 @@ -694,12 +694,19 @@ try_to_free_pages(struct zone *classzone } /* - * kswapd will work across all this node's zones until they are all at - * pages_high. + * For kswapd, balance_pgdat() will work across all this node's zones until + * they are all at pages_high. + * + * If `nr_pages' is non-zero then it is the number of pages which are to be + * reclaimed, regardless of the zone occupancies. This is a software suspend + * special. + * + * Returns the number of pages which were actually freed. */ -static void kswapd_balance_pgdat(pg_data_t *pgdat) +static int balance_pgdat(pg_data_t *pgdat, int nr_pages) { - int priority = DEF_PRIORITY; + int to_free = nr_pages; + int priority; int i; for (priority = DEF_PRIORITY; priority; priority--) { @@ -712,13 +719,15 @@ static void kswapd_balance_pgdat(pg_data int to_reclaim; to_reclaim = zone->pages_high - zone->free_pages; + if (nr_pages && to_free > 0) + to_reclaim = min(to_free, SWAP_CLUSTER_MAX*8); if (to_reclaim <= 0) continue; success = 0; max_scan = zone->nr_inactive >> priority; if (max_scan < to_reclaim * 2) max_scan = to_reclaim * 2; - shrink_zone(zone, max_scan, GFP_KSWAPD, + to_free -= shrink_zone(zone, max_scan, GFP_KSWAPD, to_reclaim, &nr_mapped); shrink_slab(max_scan + nr_mapped, GFP_KSWAPD); } @@ -726,6 +735,7 @@ static void kswapd_balance_pgdat(pg_data break; /* All zones are at pages_high */ blk_congestion_wait(WRITE, HZ/4); } + return nr_pages - to_free; } /* @@ -772,10 +782,34 @@ int kswapd(void *p) prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); schedule(); finish_wait(&pgdat->kswapd_wait, &wait); - kswapd_balance_pgdat(pgdat); + balance_pgdat(pgdat, 0); blk_run_queues(); } } + +#ifdef CONFIG_SOFTWARE_SUSPEND +/* + * Try to free `nr_pages' of memory, system-wide. Returns the number of freed + * pages. + */ +int shrink_all_memory(int nr_pages) +{ + pg_data_t *pgdat; + int nr_to_free = nr_pages; + int ret = 0; + + for_each_pgdat(pgdat) { + int freed; + + freed = balance_pgdat(pgdat, nr_to_free); + ret += freed; + nr_to_free -= freed; + if (nr_to_free <= 0) + break; + } + return ret; +} +#endif static int __init kswapd_init(void) { --- 2.5.36/include/linux/swap.h~swsusp-feature Wed Sep 18 14:03:01 2002 +++ 2.5.36-akpm/include/linux/swap.h Wed Sep 18 14:16:29 2002 @@ -163,6 +163,7 @@ extern void swap_setup(void); /* linux/mm/vmscan.c */ extern int try_to_free_pages(struct zone *, unsigned int, unsigned int); +int shrink_all_memory(int nr_pages); /* linux/mm/page_io.c */ int swap_readpage(struct file *file, struct page *page); . ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 2.5.35-mm1 2002-09-18 21:31 ` 2.5.35-mm1 Andrew Morton @ 2002-09-18 21:54 ` Pavel Machek 0 siblings, 0 replies; 6+ messages in thread From: Pavel Machek @ 2002-09-18 21:54 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, linux-mm, lse-tech Hi! > > > url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/ > > > > > > Significant rework of the new sleep/wakeup code - make it look totally > > > different from the current APIs to avoid confusion, and to make it > > > simpler to use. > > > > Did you add any hooks to allow me to free memory for swsusp? > > I just did then. You'll need to call > > freed = shrink_all_memory(99); Thanx a lot. > to free up 99 pages. It returns the number which it actually > freed. If that's not 99 then it's time to give up. There is > no oom-killer in this code path. So... I'll do something like shrink_all_memory(1000000) and it will free as much as possible, right? > I haven't tested it yet. And it's quite a long way back in the > queue I'm afraid - it has a dependency chain, and I prefer to So if I apply this to my tree it will not work (that's what "dependency chain means", right?). Okay, thanx anyway. > send stuff to Linus which has been tested for a couple of weeks, and > hasn't changed for one week. > > Can you use the allocate-lots-then-free-it trick in the meanwhile? In the meanwhile, swsusp only working when there's lot of ram is probably okay. As IDE patch is not in, swsusp is dangerous, anyway. Pavel > --- 2.5.36/mm/vmscan.c~swsusp-feature Wed Sep 18 13:55:20 2002 > +++ 2.5.36-akpm/mm/vmscan.c Wed Sep 18 14:29:13 2002 > @@ -694,12 +694,19 @@ try_to_free_pages(struct zone *classzone > } > > /* > - * kswapd will work across all this node's zones until they are all at > - * pages_high. > + * For kswapd, balance_pgdat() will work across all this node's zones until > + * they are all at pages_high. > + * > + * If `nr_pages' is non-zero then it is the number of pages which are to be > + * reclaimed, regardless of the zone occupancies. This is a software suspend > + * special. > + * > + * Returns the number of pages which were actually freed. > */ > -static void kswapd_balance_pgdat(pg_data_t *pgdat) > +static int balance_pgdat(pg_data_t *pgdat, int nr_pages) > { > - int priority = DEF_PRIORITY; > + int to_free = nr_pages; > + int priority; > int i; > > for (priority = DEF_PRIORITY; priority; priority--) { > @@ -712,13 +719,15 @@ static void kswapd_balance_pgdat(pg_data > int to_reclaim; > > to_reclaim = zone->pages_high - zone->free_pages; > + if (nr_pages && to_free > 0) > + to_reclaim = min(to_free, SWAP_CLUSTER_MAX*8); > if (to_reclaim <= 0) > continue; > success = 0; > max_scan = zone->nr_inactive >> priority; > if (max_scan < to_reclaim * 2) > max_scan = to_reclaim * 2; > - shrink_zone(zone, max_scan, GFP_KSWAPD, > + to_free -= shrink_zone(zone, max_scan, GFP_KSWAPD, > to_reclaim, &nr_mapped); > shrink_slab(max_scan + nr_mapped, GFP_KSWAPD); > } > @@ -726,6 +735,7 @@ static void kswapd_balance_pgdat(pg_data > break; /* All zones are at pages_high */ > blk_congestion_wait(WRITE, HZ/4); > } > + return nr_pages - to_free; > } > > /* > @@ -772,10 +782,34 @@ int kswapd(void *p) > prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); > schedule(); > finish_wait(&pgdat->kswapd_wait, &wait); > - kswapd_balance_pgdat(pgdat); > + balance_pgdat(pgdat, 0); > blk_run_queues(); > } > } > + > +#ifdef CONFIG_SOFTWARE_SUSPEND > +/* > + * Try to free `nr_pages' of memory, system-wide. Returns the number of freed > + * pages. > + */ > +int shrink_all_memory(int nr_pages) > +{ > + pg_data_t *pgdat; > + int nr_to_free = nr_pages; > + int ret = 0; > + > + for_each_pgdat(pgdat) { > + int freed; > + > + freed = balance_pgdat(pgdat, nr_to_free); > + ret += freed; > + nr_to_free -= freed; > + if (nr_to_free <= 0) > + break; > + } > + return ret; > +} > +#endif > > static int __init kswapd_init(void) > { > --- 2.5.36/include/linux/swap.h~swsusp-feature Wed Sep 18 14:03:01 2002 > +++ 2.5.36-akpm/include/linux/swap.h Wed Sep 18 14:16:29 2002 > @@ -163,6 +163,7 @@ extern void swap_setup(void); > > /* linux/mm/vmscan.c */ > extern int try_to_free_pages(struct zone *, unsigned int, unsigned int); > +int shrink_all_memory(int nr_pages); > > /* linux/mm/page_io.c */ > int swap_readpage(struct file *file, struct page *page); > > . -- Casualities in World Trade Center: ~3k dead inside the building, cryptography in U.S.A. and free speech in Czech Republic. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 2.5.35-mm1 2002-09-16 7:15 2.5.35-mm1 Andrew Morton 2002-09-17 16:07 ` 2.5.35-mm1 Pavel Machek @ 2002-09-19 7:51 ` Daniel Phillips 2002-09-19 8:19 ` 2.5.35-mm1 Andrew Morton 1 sibling, 1 reply; 6+ messages in thread From: Daniel Phillips @ 2002-09-19 7:51 UTC (permalink / raw) To: Andrew Morton, lkml, linux-mm, lse-tech On Monday 16 September 2002 09:15, Andrew Morton wrote: > A 4x performance regression in heavy dbench testing has been fixed. The > VM was accidentally being fair to the dbench instances in page reclaim. > It's better to be unfair so just a few instances can get ahead and submit > more contiguous IO. It's a silly thing, but it's what I meant to do anyway. Curious... did the performance hit show anywhere other than dbench? -- Daniel ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 2.5.35-mm1 2002-09-19 7:51 ` 2.5.35-mm1 Daniel Phillips @ 2002-09-19 8:19 ` Andrew Morton 0 siblings, 0 replies; 6+ messages in thread From: Andrew Morton @ 2002-09-19 8:19 UTC (permalink / raw) To: Daniel Phillips; +Cc: lkml, linux-mm, lse-tech Daniel Phillips wrote: > > On Monday 16 September 2002 09:15, Andrew Morton wrote: > > A 4x performance regression in heavy dbench testing has been fixed. The > > VM was accidentally being fair to the dbench instances in page reclaim. > > It's better to be unfair so just a few instances can get ahead and submit > > more contiguous IO. It's a silly thing, but it's what I meant to do anyway. > > Curious... did the performance hit show anywhere other than dbench? Other benchmarky tests would have suffered, but I did not check. I have logic in there which is designed to throttle heavy writers within the page allocator, as well as within balance_dirty_pages. basically: generic_file_write() { current->backing_dev_info = mapping->backing_dev_info; alloc_page() current->backing_dev_info = 0; } shrink_list() { if (PageDirty(page)) { if (page->mapping->backing_dev_info == current->backing_dev_info) blocking_write(page->mapping); else nonblocking_write(page->mapping); } } What this says is "if this task is prepared to block against this page's queue, then write the dirty data, even if that would block". This means that all the dbench instances will write each other's dirty data as it comes off the tail of the LRU. Which provides some additional throttling, and means that we don't just refile the page. But the logic was not correctly implemented. The dbench instances were performing non-blocking writes. This meant that all 64 instances were cheerfully running all the time, submitting IO all over the disk. The /proc/meminfo:Writeback figure never even hit a megabyte. That number tells us how much memory is currently in the request queue. Clearly, it was very fragmented. By forcing the dbench instance to block on the queue, particular instances were able to submit decent amounts of IO. The `Writeback' figure went back to around 4 megabytes, because the individual requests were larger - more merging. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2002-09-19 8:14 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-09-16 7:15 2.5.35-mm1 Andrew Morton 2002-09-17 16:07 ` 2.5.35-mm1 Pavel Machek 2002-09-18 21:31 ` 2.5.35-mm1 Andrew Morton 2002-09-18 21:54 ` 2.5.35-mm1 Pavel Machek 2002-09-19 7:51 ` 2.5.35-mm1 Daniel Phillips 2002-09-19 8:19 ` 2.5.35-mm1 Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).