* page_launder() on 2.4.9/10 issue @ 2001-08-28 3:36 Marcelo Tosatti 2001-08-28 18:07 ` Daniel Phillips 0 siblings, 1 reply; 79+ messages in thread From: Marcelo Tosatti @ 2001-08-28 3:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: lkml Linus, I just noticed that the new page_launder() logic has a big bad problem. The window to find and free previously written out pages by page_launder() is the amount of writeable pages on the inactive dirty list. We'll keep writing out dirty pages (as long as they are available) even if have a ton of cleaned pages: its just that we don't see them because we scan a small piece of the inactive dirty list each time. That obviously did not happen with the full scan behaviour. With asynchronous i_dirty->i_clean movement (moving a cleaned page to the clean list at the IO completion handler. Please don't consider that for 2.4 :)) this would not happen, too. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-08-28 3:36 page_launder() on 2.4.9/10 issue Marcelo Tosatti @ 2001-08-28 18:07 ` Daniel Phillips 2001-08-28 18:17 ` Linus Torvalds 0 siblings, 1 reply; 79+ messages in thread From: Daniel Phillips @ 2001-08-28 18:07 UTC (permalink / raw) To: Marcelo Tosatti, Linus Torvalds; +Cc: lkml On August 28, 2001 05:36 am, Marcelo Tosatti wrote: > Linus, > > I just noticed that the new page_launder() logic has a big bad problem. > > The window to find and free previously written out pages by page_launder() > is the amount of writeable pages on the inactive dirty list. > > We'll keep writing out dirty pages (as long as they are available) even if > have a ton of cleaned pages: its just that we don't see them because we > scan a small piece of the inactive dirty list each time. > > That obviously did not happen with the full scan behaviour. > > With asynchronous i_dirty->i_clean movement (moving a cleaned page to the > clean list at the IO completion handler. Please don't consider that for > 2.4 :)) this would not happen, too. Or we could have parallel lists for dirty and clean. -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-08-28 18:07 ` Daniel Phillips @ 2001-08-28 18:17 ` Linus Torvalds 2001-08-30 1:36 ` Daniel Phillips 2001-09-03 14:57 ` Marcelo Tosatti 0 siblings, 2 replies; 79+ messages in thread From: Linus Torvalds @ 2001-08-28 18:17 UTC (permalink / raw) To: Daniel Phillips; +Cc: Marcelo Tosatti, lkml On Tue, 28 Aug 2001, Daniel Phillips wrote: > On August 28, 2001 05:36 am, Marcelo Tosatti wrote: > > Linus, > > > > I just noticed that the new page_launder() logic has a big bad problem. > > > > The window to find and free previously written out pages by page_launder() > > is the amount of writeable pages on the inactive dirty list. No. There is no "window". The page_launder() logic is very clear - it will write out any dirty pages that it finds that are "old". > > We'll keep writing out dirty pages (as long as they are available) even if > > have a ton of cleaned pages: its just that we don't see them because we > > scan a small piece of the inactive dirty list each time. So? We need to write them out at some point anyway. Isn't it much better to be graceful about it, and allow the writeout to happen in the background. The way things _used_ to work, we'd delay the write-out until we REALLY had to, which is great for dbench, but is really horrible for any normal load. Think about it - do you really want the system to actively try to reach the point where it has no "regular" pages left, and has to start writing stuff out (and wait for them synchronously) in order to free up memory? I strongly feel that the old code was really really wrong - it may have been wonderful for throughput, but it had non-repeatable behaviour, and easily caused the inactive_dirty list to fill up with dirty pages because it unfairly penalized clean pages. You do need to realize that dbench is a really bad benchmark, and should not be used as a way to tweak the algorithms. > > That obviously did not happen with the full scan behaviour. The new code has no difference between "full scan" and "partial scan". It will do the same thing regardless of whether you scan the whole list, as it doesn't have any state. This did NOT happen with the old "launder_loop" state thing, but I think you agreed that that code was unreliable and flaky, and caused basically random non-LRU behaviour that depended on subtle effects in (a) who called it and (b) what the layout of the inactive_dirty list was. > > With asynchronous i_dirty->i_clean movement (moving a cleaned page to the > > clean list at the IO completion handler. Please don't consider that for > > 2.4 :)) this would not happen, too. > > Or we could have parallel lists for dirty and clean. Well, more importantly, do you actually have good reason to believe that it is wrong to try to write things out asynchronously? Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-08-28 18:17 ` Linus Torvalds @ 2001-08-30 1:36 ` Daniel Phillips 2001-09-03 14:57 ` Marcelo Tosatti 1 sibling, 0 replies; 79+ messages in thread From: Daniel Phillips @ 2001-08-30 1:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: Marcelo Tosatti, lkml On August 28, 2001 08:17 pm, Linus Torvalds wrote: > On Tue, 28 Aug 2001, Daniel Phillips wrote: > > On August 28, 2001 05:36 am, Marcelo Tosatti wrote: > > > Linus, > > > > > > I just noticed that the new page_launder() logic has a big bad problem. > > > > > > The window to find and free previously written out pages by > > > page_launder() is the amount of writeable pages on the inactive dirty > > > list. > > No. > > There is no "window". The page_launder() logic is very clear - it will > write out any dirty pages that it finds that are "old". > > > > We'll keep writing out dirty pages (as long as they are available) even > > > if have a ton of cleaned pages: its just that we don't see them because > > > we scan a small piece of the inactive dirty list each time. > > So? We need to write them out at some point anyway. Isn't it much better > to be graceful about it, and allow the writeout to happen in the > background. The way things _used_ to work, we'd delay the write-out until > we REALLY had to, which is great for dbench, but is really horrible for > any normal load. I thought about it a lot and I had a really hard time coming up with examples where starting writeout early is not the right thing to do. Even write merging takes care of itself because if the system is heaviliy loaded the queue will naturally back up and create all the write merging opportunities we need. Temporary file deletion is hurt by early writeout, yes, but that is really something we should be handling at the filesystem level, not the vfs. (According to this theory, XFS with its delayed allocation should be a star performer on dbench.) The only case I can see where early writeout is not necessarily the best policy is when we have lots of input going on at the same time. The classic example is program startup. If there are lots of inactive/clean pages we want to hold off writeout until the swap-in activity due to program start winds down or eats all the inactive/clean pages. > Think about it - do you really want the system to actively try to reach > the point where it has no "regular" pages left, and has to start writing > stuff out (and wait for them synchronously) in order to free up memory? I > strongly feel that the old code was really really wrong - it may have been > wonderful for throughput, but it had non-repeatable behaviour, and easily > caused the inactive_dirty list to fill up with dirty pages because it > unfairly penalized clean pages. It was just plain wrong. We got sucked into the trap of optimizing for dbench. > [...] > > > With asynchronous i_dirty->i_clean movement (moving a cleaned page to > > > the clean list at the IO completion handler. Please don't consider that > > > for 2.4 :)) this would not happen, too. > > > > Or we could have parallel lists for dirty and clean. > > Well, more importantly, do you actually have good reason to believe that > it is wrong to try to write things out asynchronously? Asynchronous is good, but we don't want to blindly submit every dirty page as soon as it arrives on the inactive_dirty list. This will throw away information about the short-term activity of pages, without which we have no means of distinguishing between LFU and LRU pages. It doesn't matter under light disk load because... the load is light (duh) but under heavy load it does matter. -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-08-28 18:17 ` Linus Torvalds 2001-08-30 1:36 ` Daniel Phillips @ 2001-09-03 14:57 ` Marcelo Tosatti 2001-09-04 15:26 ` Jan Harkes 1 sibling, 1 reply; 79+ messages in thread From: Marcelo Tosatti @ 2001-09-03 14:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Daniel Phillips, lkml On Tue, 28 Aug 2001, Linus Torvalds wrote: > > On Tue, 28 Aug 2001, Daniel Phillips wrote: > > On August 28, 2001 05:36 am, Marcelo Tosatti wrote: > > > Linus, > > > > > > I just noticed that the new page_launder() logic has a big bad problem. > > > > > > The window to find and free previously written out pages by page_launder() > > > is the amount of writeable pages on the inactive dirty list. > > No. > > There is no "window". The page_launder() logic is very clear - it will > write out any dirty pages that it finds that are "old". Yes, this is clear. Look above. > > > > We'll keep writing out dirty pages (as long as they are available) even if > > > have a ton of cleaned pages: its just that we don't see them because we > > > scan a small piece of the inactive dirty list each time. > > So? We need to write them out at some point anyway. Isn't it much better > to be graceful about it, and allow the writeout to happen in the > background. The way things _used_ to work, we'd delay the write-out until > we REALLY had to, which is great for dbench, but is really horrible for > any normal load. > > Think about it - do you really want the system to actively try to reach > the point where it has no "regular" pages left, and has to start writing > stuff out (and wait for them synchronously) in order to free up memory? No, of course not. You're missing my point. > I strongly feel that the old code was really really wrong - it may > have been wonderful for throughput, but it had non-repeatable > behaviour, and easily caused the inactive_dirty list to fill up with > dirty pages because it unfairly penalized clean pages. Agreed. I'm not talking about this specific issue, however. > You do need to realize that dbench is a really bad benchmark, and should > not be used as a way to tweak the algorithms. > > > > That obviously did not happen with the full scan behaviour. > > The new code has no difference between "full scan" and "partial scan". It > will do the same thing regardless of whether you scan the whole list, as > it doesn't have any state. > > This did NOT happen with the old "launder_loop" state thing, but I think > you agreed that that code was unreliable and flaky, and caused basically > random non-LRU behaviour that depended on subtle effects in (a) who called > it and (b) what the layout of the inactive_dirty list was. Right. Please read the explanation above and you will understand that I'm talking about something else. > > > With asynchronous i_dirty->i_clean movement (moving a cleaned page to the > > > clean list at the IO completion handler. Please don't consider that for > > > 2.4 :)) this would not happen, too. > > > > Or we could have parallel lists for dirty and clean. > > Well, more importantly, do you actually have good reason to believe that > it is wrong to try to write things out asynchronously? No. Its not wrong to write things out, Linus. Thats not my point, however. What I'm trying to tell you is that cleaned (written) memory should be freed as soon as it gets cleaned. Look: 1M shortage page_launder() writeouts 10M of data Those 10M gets written out (cleaned) page_launder() writeouts 10M of data Those 10M gets written out (cleaned) ... We are going to find the written out data (which should be freed ASAP, since it already had enough time to be touched) _too_ late (only when we loop the whole inactive dirty list). Do you see my point ? I already have some code which adds a laundry list -- pages being written out (by page_launder()) go to the laundry list, and each page_launder() call will first check for unlocked pages on the laundry list, for then doing the usual page_launder() stuff. As far as I've seen, this has improved things _a lot_ exactly due to the problem I explained. I'll post the code as soon as I have some time to clean it. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-03 14:57 ` Marcelo Tosatti @ 2001-09-04 15:26 ` Jan Harkes 2001-09-04 15:24 ` Marcelo Tosatti 2001-09-04 16:27 ` Rik van Riel 0 siblings, 2 replies; 79+ messages in thread From: Jan Harkes @ 2001-09-04 15:26 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linus Torvalds, Daniel Phillips, lkml, riel On Mon, Sep 03, 2001 at 11:57:09AM -0300, Marcelo Tosatti wrote: > I already have some code which adds a laundry list -- pages being written > out (by page_launder()) go to the laundry list, and each page_launder() > call will first check for unlocked pages on the laundry list, for then > doing the usual page_launder() stuff. NO, please don't add another list to fix the symptoms of bad page aging. One of the graduate students here at CMU has been looking at the 2.4 VM, trying to predict the size of the app that can possibly be loaded without causing the system to start trashing. To do this he was looking at the current working set and was using the ages of pages in the page-cache as an indicator. i.e. he is exporting the number of pages of a given age on the active list through a /proc device. The results were unpredictable (almost every page was age 0, except for a few that were MAX_PAGE_AGE) and walking through the source showed why. Aging is broken. Horribly. As a result, the inactive list is filled with pages that are not necessarily inactive. refill_inactive scan does aging based on the PG_Referenced bit, this is only set for bufferpages. So on every call to refill_inactive pretty much all active pages are being aged down agressively. The hardware referenced bit is checked in swap_out and ages up. swap_out walks part of the vm of all processes, and ages up all referenced pages. However these pages will immediately get aged down as well by the following refill_inactive. The recent moving around of refill_inactive in the 2.4.10-pre4 patch has actually made down aging twice as agressive. Down aging is /2, up aging is += 3, so only pages that are referenced more frequently than once a second on a not-loaded system could slowly crawl up. Anything else is at age 0. I've attached a patch against 2.4.10-pre4 that tries to do 2 things, split the up/down aging out of refill_inactive etc. And it crawls _all_ process VM's to copy all hardware referenced bits to the software bit. On a system withoug shortage, pages are only aged up, this is not realy a problem, because as soon as there is some shortage the aggressive down aging pulls pages at MAX_PAGE_AGE down to age 0 within 5 calls. This is just an experimental patch, it probably doesn't work right on all various kinds of CPU's. But at least it gets the aging somewhat better. Oh and it seems to me that the discussion about read-ahead pages is pretty much moot after this patch, they shouldn't push active stuff out of memory. Jan diff -ur linux-2.4.10-pre4/mm/vmscan.c linux/mm/vmscan.c --- linux-2.4.10-pre4/mm/vmscan.c Tue Sep 4 10:55:29 2001 +++ linux/mm/vmscan.c Tue Sep 4 11:04:48 2001 @@ -45,6 +45,165 @@ page->age /= 2; } +/* mm->page_table_lock is held. mmap_sem is not held */ +static void vm_crawl_pmd(struct mm_struct * mm, struct vm_area_struct * vma, pmd_t *dir, unsigned long address, unsigned long end) +{ + pte_t * pte; + unsigned long pmd_end; + + if (pmd_none(*dir)) + return; + if (pmd_bad(*dir)) { + pmd_ERROR(*dir); + pmd_clear(dir); + return; + } + + pte = pte_offset(dir, address); + + pmd_end = (address + PMD_SIZE) & PMD_MASK; + if (end > pmd_end) + end = pmd_end; + + do { + if (pte_present(*pte)) { + struct page *page = pte_page(*pte); + + if (VALID_PAGE(page) && !PageReserved(page) && + ptep_test_and_clear_young(pte)) + { + SetPageReferenced(page); + } + } + address += PAGE_SIZE; + pte++; + } while (address && (address < end)); +} + +/* mm->page_table_lock is held. mmap_sem is not held */ +static inline void vm_crawl_pgd(struct mm_struct * mm, struct vm_area_struct * vma, pgd_t *dir, unsigned long address, unsigned long end) +{ + pmd_t * pmd; + unsigned long pgd_end; + + if (pgd_none(*dir)) + return; + if (pgd_bad(*dir)) { + pgd_ERROR(*dir); + pgd_clear(dir); + return; + } + + pmd = pmd_offset(dir, address); + + pgd_end = (address + PGDIR_SIZE) & PGDIR_MASK; + if (pgd_end && (end > pgd_end)) + end = pgd_end; + + do { + vm_crawl_pmd(mm, vma, pmd, address, end); + address = (address + PMD_SIZE) & PMD_MASK; + pmd++; + } while (address && (address < end)); +} + +/* mm->page_table_lock is held. mmap_sem is not held */ +static void vm_crawl_vma(struct mm_struct * mm, struct vm_area_struct * vma) +{ + pgd_t *pgdir; + unsigned long end, address; + + /* Skip areas which are locked down */ + if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) + return; + + address = vma->vm_start; + pgdir = pgd_offset(mm, address); + + end = vma->vm_end; + if (address >= end) + BUG(); + do { + vm_crawl_pgd(mm, vma, pgdir, address, end); + address = (address + PGDIR_SIZE) & PGDIR_MASK; + pgdir++; + } while (address && (address < end)); +} + +static void vm_crawl_mm(struct mm_struct * mm) +{ + struct vm_area_struct* vma; + + /* + * Go through process' page directory. + */ + + /* + * Find the proper vm-area after freezing the vma chain + * and ptes. + */ + spin_lock(&mm->page_table_lock); + + for (vma = find_vma(mm, 0); vma; vma = vma->vm_next) + vm_crawl_vma(mm, vma); + + spin_unlock(&mm->page_table_lock); +} + +/* set the software PG_Referenced bit on pages that have been accessed since + * the last scan. */ +static void vm_angel(void) +{ + struct list_head *p; + struct mm_struct *mm; + + /* Walk all mm's */ + spin_lock(&mmlist_lock); + + p = init_mm.mmlist.next; + while (p != &init_mm.mmlist) + { + mm = list_entry(p, struct mm_struct, mmlist); + + /* Make sure the mm doesn't disappear when we drop the lock.. */ + atomic_inc(&mm->mm_users); + spin_unlock(&mmlist_lock); + + vm_crawl_mm(mm); + + /* Grab the lock again */ + spin_lock(&mmlist_lock); + + p = p->next; + mmput(mm); + } + + spin_unlock(&mmlist_lock); +} + +/* Age all pages that on the active list that have their referenced bit set. + * Down aging is only done when do_try_to_free pages fails the first time + * through. kswapd is running too often to get any fair aging behavior + * otherwise and apps that are running when there is no memory pressure should + * in my opinion get a little advantage against the new 'memory hogs' that + * push us into a shortage. */ +void vm_devil(int general_shortage) +{ + struct list_head * p; + struct page * page; + + /* Take the lock while messing with the list... */ + spin_lock(&pagemap_lru_lock); + list_for_each(p, &active_list) { + page = list_entry(p, struct page, lru); + if (PageTestandClearReferenced(page)) + age_page_up(page); + else if (general_shortage) + age_page_down(page); + } + spin_unlock(&pagemap_lru_lock); +} + /* * The swap-out function returns 1 if it successfully * scanned all the pages it was asked to (`count'). @@ -87,6 +246,23 @@ pte_t pte; swp_entry_t entry; + /* Don't look at this page if it's been accessed recently. */ + if (page->mapping && page->age) + return; + +#if 0 /* The problem is that this test makes the system extremely unwilling to + * swap anything out, maybe we're not looking at a large enough part of + * the process VM so basically everything is typically referenced by the + * time we consider swapping out? */ + + /* Pages that have no swap allocated will not be on the active list and + * will not be aged. However their Referenced bit should be set. */ + if (PageTestandClearReferenced(page)) { + page->age = 0; + return; + } +#endif + /* * If we are doing a zone-specific scan, do not * touch pages from zones which don't have a @@ -95,12 +271,6 @@ if (zone_inactive_plenty(page->zone)) return; - /* Don't look at this pte if it's been accessed recently. */ - if (ptep_test_and_clear_young(page_table)) { - age_page_up(page); - return; - } - if (TryLockPage(page)) return; @@ -153,9 +323,12 @@ set_page_dirty(page); goto drop_pte; } + /* - * Check PageDirty as well as pte_dirty: page may - * have been brought back from swap by swapoff. + * Ok, it's really dirty. That means that + * we should either create a new swap cache + * entry for it, or we should write it back + * to its own backing store. */ if (!pte_dirty(pte) && !PageDirty(page)) goto drop_pte; @@ -669,7 +842,6 @@ struct list_head * page_lru; struct page * page; int maxscan = nr_active_pages >> priority; - int page_active = 0; int nr_deactivated = 0; /* Take the lock while messing with the list... */ @@ -690,41 +862,34 @@ * have plenty inactive pages. */ - if (zone_inactive_plenty(page->zone)) { - page_active = 1; + if (zone_inactive_plenty(page->zone)) goto skip_page; - } - /* Do aging on the pages. */ - if (PageTestandClearReferenced(page)) { - age_page_up(page); - page_active = 1; - } else { - age_page_down(page); - /* - * Since we don't hold a reference on the page - * ourselves, we have to do our test a bit more - * strict then deactivate_page(). This is needed - * since otherwise the system could hang shuffling - * unfreeable pages from the active list to the - * inactive_dirty list and back again... - * - * SUBTLE: we can have buffer pages with count 1. - */ - if (page->age == 0 && page_count(page) <= - (page->buffers ? 2 : 1)) { - deactivate_page_nolock(page); - page_active = 0; - } else { - page_active = 1; - } + /* not much use to inactivate ramdisk pages when page_launder + * simply bounces them back to the active list */ + if (page_ramdisk(page)) + goto skip_page; + + /* + * Since we don't hold a reference on the page + * ourselves, we have to do our test a bit more + * strict then deactivate_page(). This is needed + * since otherwise the system could hang shuffling + * unfreeable pages from the active list to the + * inactive_dirty list and back again... + * + * SUBTLE: we can have buffer pages with count 1. + */ + if (page->age == 0 && page_count(page) <= (page->buffers ? 2 : 1)) { + deactivate_page_nolock(page); } + /* * If the page is still on the active list, move it * to the other end of the list. Otherwise we exit if * we have done enough work. */ - if (page_active || PageActive(page)) { + if (PageActive(page)) { skip_page: list_del(page_lru); list_add(page_lru, &active_list); @@ -820,14 +985,21 @@ #define GENERAL_SHORTAGE 4 static int do_try_to_free_pages(unsigned int gfp_mask, int user) { + /* Always walk at least the active queue when called */ int shortage = 0; int maxtry; + /* make sure to update referenced bits */ + vm_angel(); + /* Always walk at least the active queue when called */ refill_inactive_scan(DEF_PRIORITY); maxtry = 1 << DEF_PRIORITY; do { + /* perform aging of the active list */ + vm_devil(shortage & GENERAL_SHORTAGE); + /* * If needed, we move pages from the active list * to the inactive list. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 15:26 ` Jan Harkes @ 2001-09-04 15:24 ` Marcelo Tosatti 2001-09-04 17:14 ` Jan Harkes 2001-09-04 16:27 ` Rik van Riel 1 sibling, 1 reply; 79+ messages in thread From: Marcelo Tosatti @ 2001-09-04 15:24 UTC (permalink / raw) To: Jan Harkes; +Cc: Linus Torvalds, Daniel Phillips, lkml, riel On Tue, 4 Sep 2001, Jan Harkes wrote: > On Mon, Sep 03, 2001 at 11:57:09AM -0300, Marcelo Tosatti wrote: > > I already have some code which adds a laundry list -- pages being written > > out (by page_launder()) go to the laundry list, and each page_launder() > > call will first check for unlocked pages on the laundry list, for then > > doing the usual page_launder() stuff. > > NO, please don't add another list to fix the symptoms of bad page aging. Please, read my message again. The laundry list is not an attempt to fix aging. Its just one way to find previously cleaned data faster. You should have created a new thread with subject "Aging is broken". :) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 15:24 ` Marcelo Tosatti @ 2001-09-04 17:14 ` Jan Harkes 2001-09-04 15:53 ` Marcelo Tosatti ` (2 more replies) 0 siblings, 3 replies; 79+ messages in thread From: Jan Harkes @ 2001-09-04 17:14 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: linux-kernel On Tue, Sep 04, 2001 at 12:24:36PM -0300, Marcelo Tosatti wrote: > On Tue, 4 Sep 2001, Jan Harkes wrote: > > On Mon, Sep 03, 2001 at 11:57:09AM -0300, Marcelo Tosatti wrote: > > > I already have some code which adds a laundry list -- pages being written > > > out (by page_launder()) go to the laundry list, and each page_launder() > > > call will first check for unlocked pages on the laundry list, for then > > > doing the usual page_launder() stuff. > > > > NO, please don't add another list to fix the symptoms of bad page aging. > > Please, read my message again. Sorry, it was a reaction to all the VM nonsense that has been flying around lately. The a lot of complaints and discussions wouldn't even have started if we actually moved _inactive_ pages to the inactive list instead of random pages. To get back on the thread I jumped into, I totally agree with Linus that writeout should be as soon as possible. Probably even as soon as an inactive dirty page hits the inactive dirty list, which would effectively turn the inactive dirty list into your laundry list. Jan ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 17:14 ` Jan Harkes @ 2001-09-04 15:53 ` Marcelo Tosatti 2001-09-04 19:33 ` Daniel Phillips 2001-09-06 11:52 ` Rik van Riel 2 siblings, 0 replies; 79+ messages in thread From: Marcelo Tosatti @ 2001-09-04 15:53 UTC (permalink / raw) To: Jan Harkes; +Cc: linux-kernel On Tue, 4 Sep 2001, Jan Harkes wrote: > On Tue, Sep 04, 2001 at 12:24:36PM -0300, Marcelo Tosatti wrote: > > On Tue, 4 Sep 2001, Jan Harkes wrote: > > > On Mon, Sep 03, 2001 at 11:57:09AM -0300, Marcelo Tosatti wrote: > > > > I already have some code which adds a laundry list -- pages being written > > > > out (by page_launder()) go to the laundry list, and each page_launder() > > > > call will first check for unlocked pages on the laundry list, for then > > > > doing the usual page_launder() stuff. > > > > > > NO, please don't add another list to fix the symptoms of bad page aging. > > > > Please, read my message again. > > Sorry, it was a reaction to all the VM nonsense that has been flying > around lately. The a lot of complaints and discussions wouldn't even > have started if we actually moved _inactive_ pages to the inactive list > instead of random pages. > To get back on the thread I jumped into, I totally agree with Linus that > writeout should be as soon as possible. Probably even as soon as an > inactive dirty page hits the inactive dirty list, which would > effectively turn the inactive dirty list into your laundry list. Wrong. The laundry list is something where on flight pages stay so users can free memory from there as soon as the IO is finished. Do you see what I mean ? ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 17:14 ` Jan Harkes 2001-09-04 15:53 ` Marcelo Tosatti @ 2001-09-04 19:33 ` Daniel Phillips 2001-09-06 11:52 ` Rik van Riel 2 siblings, 0 replies; 79+ messages in thread From: Daniel Phillips @ 2001-09-04 19:33 UTC (permalink / raw) To: Jan Harkes, Marcelo Tosatti; +Cc: linux-kernel On September 4, 2001 07:14 pm, Jan Harkes wrote: > To get back on the thread I jumped into, I totally agree with Linus that > writeout should be as soon as possible. Probably even as soon as an > inactive dirty page hits the inactive dirty list, which would > effectively turn the inactive dirty list into your laundry list. No, we don't want that, we need the inactive list as a test of short-term inactivity. It doesn't make sense to begin the writeout until the page has made it to the other end of the inactive ist. Otherwise you just revert to "one-hand-clock". -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 17:14 ` Jan Harkes 2001-09-04 15:53 ` Marcelo Tosatti 2001-09-04 19:33 ` Daniel Phillips @ 2001-09-06 11:52 ` Rik van Riel 2001-09-06 12:31 ` Daniel Phillips 2001-09-06 13:10 ` Stephan von Krawczynski 2 siblings, 2 replies; 79+ messages in thread From: Rik van Riel @ 2001-09-06 11:52 UTC (permalink / raw) To: Jan Harkes; +Cc: Marcelo Tosatti, linux-kernel On Tue, 4 Sep 2001, Jan Harkes wrote: > To get back on the thread I jumped into, I totally agree with Linus > that writeout should be as soon as possible. Nice way to destroy read performance. As DaveM noted so nicely in his reverse mapping patch (at the end of the 2.3 series), dirty pages get moved to the laundry list and the washing machine will deal with them when we have a full load. Lets face it, spinning the washing machine is expensive and running less than a full load makes things inefficient ;) cheers, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 11:52 ` Rik van Riel @ 2001-09-06 12:31 ` Daniel Phillips 2001-09-06 12:32 ` Rik van Riel 2001-09-06 13:10 ` Stephan von Krawczynski 1 sibling, 1 reply; 79+ messages in thread From: Daniel Phillips @ 2001-09-06 12:31 UTC (permalink / raw) To: Rik van Riel, Jan Harkes; +Cc: Marcelo Tosatti, linux-kernel On September 6, 2001 01:52 pm, Rik van Riel wrote: > On Tue, 4 Sep 2001, Jan Harkes wrote: > > > To get back on the thread I jumped into, I totally agree with Linus > > that writeout should be as soon as possible. > > Nice way to destroy read performance. Blindly delaying all the writes in the name of better read performance isn't the right idea either. Perhaps we should have a good think about some sensible mechanism for balancing reads against writes. > As DaveM noted so > nicely in his reverse mapping patch (at the end of the > 2.3 series), dirty pages get moved to the laundry list > and the washing machine will deal with them when we have > a full load. > > Lets face it, spinning the washing machine is expensive > and running less than a full load makes things inefficient ;) That makes a good sound bite but doesn't stand up to scrutiny. It's not a washing machine ;-) -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 12:31 ` Daniel Phillips @ 2001-09-06 12:32 ` Rik van Riel 2001-09-06 12:53 ` Daniel Phillips 0 siblings, 1 reply; 79+ messages in thread From: Rik van Riel @ 2001-09-06 12:32 UTC (permalink / raw) To: Daniel Phillips; +Cc: Jan Harkes, Marcelo Tosatti, linux-kernel On Thu, 6 Sep 2001, Daniel Phillips wrote: > On September 6, 2001 01:52 pm, Rik van Riel wrote: > > On Tue, 4 Sep 2001, Jan Harkes wrote: > > > > > To get back on the thread I jumped into, I totally agree with Linus > > > that writeout should be as soon as possible. > > > > Nice way to destroy read performance. > > Blindly delaying all the writes in the name of better read performance > isn't the right idea either. Perhaps we should have a good think > about some sensible mechanism for balancing reads against writes. Absolutely, delaying writes for too long is just as bad, we need something in-between. > > Lets face it, spinning the washing machine is expensive > > and running less than a full load makes things inefficient ;) > > That makes a good sound bite but doesn't stand up to scrutiny. > It's not a washing machine ;-) Two words: "IO clustering". regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 12:32 ` Rik van Riel @ 2001-09-06 12:53 ` Daniel Phillips 2001-09-06 13:03 ` Rik van Riel 0 siblings, 1 reply; 79+ messages in thread From: Daniel Phillips @ 2001-09-06 12:53 UTC (permalink / raw) To: Rik van Riel; +Cc: Jan Harkes, Marcelo Tosatti, linux-kernel On September 6, 2001 02:32 pm, Rik van Riel wrote: > > > Lets face it, spinning the washing machine is expensive > > > and running less than a full load makes things inefficient ;) > > > > That makes a good sound bite but doesn't stand up to scrutiny. > > It's not a washing machine ;-) > > Two words: "IO clustering". Yes, *after* the IO queue is fully loaded that makes sense. Leaving it partly or fully idle while waiting for it to load up makes no sense at all. IO clustering will happen naturally after the queue loads up. -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 12:53 ` Daniel Phillips @ 2001-09-06 13:03 ` Rik van Riel 2001-09-06 13:18 ` Kurt Garloff 0 siblings, 1 reply; 79+ messages in thread From: Rik van Riel @ 2001-09-06 13:03 UTC (permalink / raw) To: Daniel Phillips; +Cc: Jan Harkes, Marcelo Tosatti, linux-kernel On Thu, 6 Sep 2001, Daniel Phillips wrote: > On September 6, 2001 02:32 pm, Rik van Riel wrote: > > Two words: "IO clustering". > > Yes, *after* the IO queue is fully loaded that makes sense. Leaving it > partly or fully idle while waiting for it to load up makes no sense at all. > > IO clustering will happen naturally after the queue loads up. Exactly, so we need to give the queue some time to load up, right ? Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:03 ` Rik van Riel @ 2001-09-06 13:18 ` Kurt Garloff 2001-09-06 13:23 ` Rik van Riel ` (3 more replies) 0 siblings, 4 replies; 79+ messages in thread From: Kurt Garloff @ 2001-09-06 13:18 UTC (permalink / raw) To: Rik van Riel; +Cc: Daniel Phillips, Jan Harkes, Marcelo Tosatti, linux-kernel [-- Attachment #1: Type: text/plain, Size: 931 bytes --] On Thu, Sep 06, 2001 at 10:03:03AM -0300, Rik van Riel wrote: > On Thu, 6 Sep 2001, Daniel Phillips wrote: > > On September 6, 2001 02:32 pm, Rik van Riel wrote: > > > Two words: "IO clustering". > > > > Yes, *after* the IO queue is fully loaded that makes sense. Leaving it > > partly or fully idle while waiting for it to load up makes no sense at all. > > > > IO clustering will happen naturally after the queue loads up. > > Exactly, so we need to give the queue some time to load > up, right ? Just use two limits: * Time: After some time (say two seconds), we can always afford to write it out * Queue filling: After it has a certain size, it's worth doing a writing. Regards, -- Kurt Garloff <garloff@suse.de> Eindhoven, NL GPG key: See mail header, key servers Linux kernel development SuSE GmbH, Nuernberg, DE SCSI, Security [-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --] ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:18 ` Kurt Garloff @ 2001-09-06 13:23 ` Rik van Riel 2001-09-06 13:28 ` Alan Cox ` (2 subsequent siblings) 3 siblings, 0 replies; 79+ messages in thread From: Rik van Riel @ 2001-09-06 13:23 UTC (permalink / raw) To: Kurt Garloff; +Cc: Daniel Phillips, Jan Harkes, Marcelo Tosatti, linux-kernel On Thu, 6 Sep 2001, Kurt Garloff wrote: > > Exactly, so we need to give the queue some time to load > > up, right ? > > Just use two limits: > * Time: After some time (say two seconds), we can always afford to write it > out > * Queue filling: After it has a certain size, it's worth doing a writing. Sounds good to me. regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:18 ` Kurt Garloff 2001-09-06 13:23 ` Rik van Riel @ 2001-09-06 13:28 ` Alan Cox 2001-09-06 13:29 ` Rik van Riel 2001-09-06 16:45 ` Daniel Phillips 2001-09-06 17:35 ` Mike Fedyk 3 siblings, 1 reply; 79+ messages in thread From: Alan Cox @ 2001-09-06 13:28 UTC (permalink / raw) To: Kurt Garloff Cc: Rik van Riel, Daniel Phillips, Jan Harkes, Marcelo Tosatti, linux-kernel > Just use two limits: > * Time: After some time (say two seconds), we can always afford to write it > out=20 > * Queue filling: After it has a certain size, it's worth doing a writing. Both debatable and both I can find counter cases for - think about a shared memory database with multiple game clients using it (eg the older AberMUD codebase). Writing that to disk is counterproductive ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:28 ` Alan Cox @ 2001-09-06 13:29 ` Rik van Riel 0 siblings, 0 replies; 79+ messages in thread From: Rik van Riel @ 2001-09-06 13:29 UTC (permalink / raw) To: Alan Cox Cc: Kurt Garloff, Daniel Phillips, Jan Harkes, Marcelo Tosatti, linux-kernel On Thu, 6 Sep 2001, Alan Cox wrote: > > Just use two limits: > > * Time: After some time (say two seconds), we can always afford to write it > > out=20 > > * Queue filling: After it has a certain size, it's worth doing a writing. > > Both debatable and both I can find counter cases for - think about a > shared memory database with multiple game clients using it (eg the > older AberMUD codebase). Writing that to disk is counterproductive This is only for pages on the inactive_dirty list, though; ie pages we want to evict from memory with the minimal amount of work possible ;) regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:18 ` Kurt Garloff 2001-09-06 13:23 ` Rik van Riel 2001-09-06 13:28 ` Alan Cox @ 2001-09-06 16:45 ` Daniel Phillips 2001-09-06 16:57 ` Rik van Riel 2001-09-06 17:35 ` Mike Fedyk 3 siblings, 1 reply; 79+ messages in thread From: Daniel Phillips @ 2001-09-06 16:45 UTC (permalink / raw) To: Kurt Garloff, Rik van Riel; +Cc: Jan Harkes, Marcelo Tosatti, linux-kernel On September 6, 2001 03:18 pm, Kurt Garloff wrote: > On Thu, Sep 06, 2001 at 10:03:03AM -0300, Rik van Riel wrote: > > On Thu, 6 Sep 2001, Daniel Phillips wrote: > > > On September 6, 2001 02:32 pm, Rik van Riel wrote: > > > > Two words: "IO clustering". > > > > > > Yes, *after* the IO queue is fully loaded that makes sense. Leaving it > > > partly or fully idle while waiting for it to load up makes no sense at all. > > > > > > IO clustering will happen naturally after the queue loads up. > > > > Exactly, so we need to give the queue some time to load > > up, right ? > > Just use two limits: > * Time: After some time (say two seconds), we can always afford to write it > out > * Queue filling: After it has a certain size, it's worth doing a writing. Err, not quite the whole story. It is *never* right to leave the disk sitting idle while there are dirty, writable IO buffers. -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 16:45 ` Daniel Phillips @ 2001-09-06 16:57 ` Rik van Riel 2001-09-06 17:22 ` Daniel Phillips 0 siblings, 1 reply; 79+ messages in thread From: Rik van Riel @ 2001-09-06 16:57 UTC (permalink / raw) To: Daniel Phillips; +Cc: Kurt Garloff, Jan Harkes, Marcelo Tosatti, linux-kernel On Thu, 6 Sep 2001, Daniel Phillips wrote: > Err, not quite the whole story. It is *never* right to leave the disk > sitting idle while there are dirty, writable IO buffers. Define "idle" ? Is idle the time it takes between two readahead requests to be issued, delaying the second request because you just moved the disk arm away ? Is idle when we haven't had a request for, say, 3 disk seek time periods ? Is idle when we won't be getting any request soon for the area where the disk arm is hanging out ? (and how do we know the future?) regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 16:57 ` Rik van Riel @ 2001-09-06 17:22 ` Daniel Phillips 2001-09-06 19:25 ` Rik van Riel 0 siblings, 1 reply; 79+ messages in thread From: Daniel Phillips @ 2001-09-06 17:22 UTC (permalink / raw) To: Rik van Riel; +Cc: Kurt Garloff, Jan Harkes, Marcelo Tosatti, linux-kernel On September 6, 2001 06:57 pm, Rik van Riel wrote: > On Thu, 6 Sep 2001, Daniel Phillips wrote: > > > Err, not quite the whole story. It is *never* right to leave the disk > > sitting idle while there are dirty, writable IO buffers. > > Define "idle" ? Idle = not doing anything. IO queue is empty. > Is idle the time it takes between two readahead requests > to be issued, delaying the second request because you > just moved the disk arm away ? Which two readahead requests? It's idle. > Is idle when we haven't had a request for, say, 3 disk > seek time periods ? See above definition of idle. > Is idle when we won't be getting any request soon for the > area where the disk arm is hanging out ? (and how do we > know the future?) -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 17:22 ` Daniel Phillips @ 2001-09-06 19:25 ` Rik van Riel 2001-09-06 19:45 ` Daniel Phillips 0 siblings, 1 reply; 79+ messages in thread From: Rik van Riel @ 2001-09-06 19:25 UTC (permalink / raw) To: Daniel Phillips; +Cc: Kurt Garloff, Jan Harkes, Marcelo Tosatti, linux-kernel On Thu, 6 Sep 2001, Daniel Phillips wrote: > On September 6, 2001 06:57 pm, Rik van Riel wrote: > > On Thu, 6 Sep 2001, Daniel Phillips wrote: > > > > > Err, not quite the whole story. It is *never* right to leave the disk > > > sitting idle while there are dirty, writable IO buffers. > > > > Define "idle" ? > > Idle = not doing anything. IO queue is empty. > > > Is idle the time it takes between two readahead requests > > to be issued, delaying the second request because you > > just moved the disk arm away ? > > Which two readahead requests? It's idle. OK, in this case I disagree with you ;) Disk seek time takes ages, as much as 10 milliseconds. I really don't think it's good to move the disk arm away from the data we are reading just to write out this one disk block. Going 20 milliseconds out of our way to write out a single block really can't be worth it in any scenario I can imagine. OTOH, flushing out 64 or 128 kB at once (or some fraction of the inactive list, say 5%?) almost certainly is worth it in many cases. regards, Rik -- IA64: a worthy successor to the i860. http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 19:25 ` Rik van Riel @ 2001-09-06 19:45 ` Daniel Phillips 2001-09-06 19:52 ` Rik van Riel 2001-09-06 19:53 ` Mike Fedyk 0 siblings, 2 replies; 79+ messages in thread From: Daniel Phillips @ 2001-09-06 19:45 UTC (permalink / raw) To: Rik van Riel; +Cc: Kurt Garloff, Jan Harkes, Marcelo Tosatti, linux-kernel On September 6, 2001 09:25 pm, Rik van Riel wrote: > On Thu, 6 Sep 2001, Daniel Phillips wrote: > > On September 6, 2001 06:57 pm, Rik van Riel wrote: > > > On Thu, 6 Sep 2001, Daniel Phillips wrote: > > > > > > > Err, not quite the whole story. It is *never* right to leave the disk > > > > sitting idle while there are dirty, writable IO buffers. > > > > > > Define "idle" ? > > > > Idle = not doing anything. IO queue is empty. > > > > > Is idle the time it takes between two readahead requests > > > to be issued, delaying the second request because you > > > just moved the disk arm away ? > > > > Which two readahead requests? It's idle. > > OK, in this case I disagree with you ;) > > Disk seek time takes ages, as much as 10 milliseconds. > > I really don't think it's good to move the disk arm away > from the data we are reading just to write out this one > disk block. > > Going 20 milliseconds out of our way to write out a single > block really can't be worth it in any scenario I can imagine. > > OTOH, flushing out 64 or 128 kB at once (or some fraction of > the inactive list, say 5%?) almost certainly is worth it in > many cases. Again, I have to ask, which reads are you interfering with? Ones that haven't happened yet? Remember, the disk is idle. So *at worst* you are going to get one extra seek before getting hit with the tidal wave of reads you seem to be worried about. This simply isn't significant. I've tested this, I know early writeout under light load is a win. What we should be worrying about is how to balance reads against writes under heavy load. -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 19:45 ` Daniel Phillips @ 2001-09-06 19:52 ` Rik van Riel 2001-09-07 0:32 ` Kurt Garloff 2001-09-06 19:53 ` Mike Fedyk 1 sibling, 1 reply; 79+ messages in thread From: Rik van Riel @ 2001-09-06 19:52 UTC (permalink / raw) To: Daniel Phillips; +Cc: Kurt Garloff, Jan Harkes, Marcelo Tosatti, linux-kernel On Thu, 6 Sep 2001, Daniel Phillips wrote: > Again, I have to ask, which reads are you interfering with? Ones that > haven't happened yet? Remember, the disk is idle. So *at worst* you are > going to get one extra seek before getting hit with the tidal wave of reads > you seem to be worried about. This simply isn't significant. > > I've tested this, I know early writeout under light load is a win. Other people have tested this too, and light writeout of small blocks destroys the performance of a heavy read load. > What we should be worrying about is how to balance reads against > writes under heavy load. Exactly. We need to make sure we're efficient when the system is under heavy read load and light write load. This kind of load is very common in servers, especially web, ftp or news servers. regards, Rik -- IA64: a worthy successor to the i860. http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 19:52 ` Rik van Riel @ 2001-09-07 0:32 ` Kurt Garloff 0 siblings, 0 replies; 79+ messages in thread From: Kurt Garloff @ 2001-09-07 0:32 UTC (permalink / raw) To: Rik van Riel; +Cc: Daniel Phillips, Jan Harkes, Marcelo Tosatti, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2082 bytes --] On Thu, Sep 06, 2001 at 04:52:05PM -0300, Rik van Riel wrote: > On Thu, 6 Sep 2001, Daniel Phillips wrote: > > Again, I have to ask, which reads are you interfering with? Ones that > > haven't happened yet? Remember, the disk is idle. So *at worst* you are > > going to get one extra seek before getting hit with the tidal wave of reads > > you seem to be worried about. This simply isn't significant. > > > > I've tested this, I know early writeout under light load is a win. > > Other people have tested this too, and light writeout of > small blocks destroys the performance of a heavy read > load. Then just don't take two hard limits, but make an easy mathematical function of time and blocks to write (monotonic and with positive slope in both) and start to write all blocks once we execced a certain limit. So, if you produce very few dirty inactive pages, it'll only happen every thirty seconds, e.g., at moderate loads, it may happen every 4 seconds and at higher loads it may even happen a couple of times per second. Think of a function like t + t*b + b, with appropriate scaling, so we reach the threshold either after a long time alone, because of many dirty inactive pages alone or because a combination of both. Tuning should be such that under normal workloads, the combination of time times pages should be the most significant term. (The chance that you run into memory pressure because of too many dirty pages this way is lower than before, but if it happens, you can adjust your function or the threshold too flush more pages.) If you are very concerned about read performance suffering from this, you may even monitor reads and adjust the threshold according to read load. (Or just make your function include this variable with a negative slope.) I believe it won't be necessary though. Regards, -- Kurt Garloff <garloff@suse.de> Eindhoven, NL GPG key: See mail header, key servers Linux kernel development SuSE GmbH, Nuernberg, DE SCSI, Security [-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --] ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 19:45 ` Daniel Phillips 2001-09-06 19:52 ` Rik van Riel @ 2001-09-06 19:53 ` Mike Fedyk 1 sibling, 0 replies; 79+ messages in thread From: Mike Fedyk @ 2001-09-06 19:53 UTC (permalink / raw) To: linux-kernel On Thu, Sep 06, 2001 at 09:45:35PM +0200, Daniel Phillips wrote: > What we should be worrying about is how to balance reads against writes under > heavy load. > Yes, I agree. You can have a process that is at a 19 niceness level that doesn't do much processing, but a lot of disk access bring your system down to a crawl. Improvement in this area would be nice. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:18 ` Kurt Garloff ` (2 preceding siblings ...) 2001-09-06 16:45 ` Daniel Phillips @ 2001-09-06 17:35 ` Mike Fedyk 3 siblings, 0 replies; 79+ messages in thread From: Mike Fedyk @ 2001-09-06 17:35 UTC (permalink / raw) To: linux-kernel Cc: Kurt Garloff, Rik van Riel, Daniel Phillips, Jan Harkes, Marcelo Tosatti On Thu, Sep 06, 2001 at 03:18:27PM +0200, Kurt Garloff wrote: > On Thu, Sep 06, 2001 at 10:03:03AM -0300, Rik van Riel wrote: > > On Thu, 6 Sep 2001, Daniel Phillips wrote: > > > On September 6, 2001 02:32 pm, Rik van Riel wrote: > > > > Two words: "IO clustering". > > > > > > Yes, *after* the IO queue is fully loaded that makes sense. Leaving it > > > partly or fully idle while waiting for it to load up makes no sense at all. > > > > > > IO clustering will happen naturally after the queue loads up. > > > > Exactly, so we need to give the queue some time to load > > up, right ? > > Just use two limits: > * Time: After some time (say two seconds), we can always afford to write it > out > * Queue filling: After it has a certain size, it's worth doing a writing. > Correct me if I'm wrong, but aren't these two settings tunable in bdflush? If not, then how exactly does bdflush interact with this? ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 11:52 ` Rik van Riel 2001-09-06 12:31 ` Daniel Phillips @ 2001-09-06 13:10 ` Stephan von Krawczynski 2001-09-06 13:23 ` Alex Bligh - linux-kernel ` (3 more replies) 1 sibling, 4 replies; 79+ messages in thread From: Stephan von Krawczynski @ 2001-09-06 13:10 UTC (permalink / raw) To: Daniel Phillips; +Cc: riel, jaharkes, marcelo, linux-kernel On Thu, 6 Sep 2001 14:31:32 +0200 Daniel Phillips <phillips@bonn-fries.net> wrote: > On September 6, 2001 01:52 pm, Rik van Riel wrote: > > On Tue, 4 Sep 2001, Jan Harkes wrote: > > > > > To get back on the thread I jumped into, I totally agree with Linus > > > that writeout should be as soon as possible. > > > > Nice way to destroy read performance. > > Blindly delaying all the writes in the name of better read performance isn't > the right idea either. Perhaps we should have a good think about some > sensible mechanism for balancing reads against writes. I guess I have the real-world proof for that: Yesterday I mastered a CD (around 700 MB) and burned it, I left the equipment to get some food and sleep (sometimes needed :-). During this time the machine acts as nfs-server and gets about 3 GB of data written to it. Coming back today I recognise that deleting the CD image made yesterday frees up about 500 MB of physical mem (free mem was very low before). It was obviously held 24 hours for no reason, and _not_ (as one would expect) exchanged against the nfs-data. This means the caches were full with _old_ data and explains why nfs performance has remarkably dropped since 2.2. There is too few mem around to get good performance (no matter if read or write). Obviously aging did not work at all, there was not a single hit on these (CD image) pages during 24 hours, compared to lots on the nfs-data. Even if the nfs-data would only have one single hit, the old CD image should have been removed, because it is inactive and _older_. > > As DaveM noted so > > nicely in his reverse mapping patch (at the end of the > > 2.3 series), dirty pages get moved to the laundry list > > and the washing machine will deal with them when we have > > a full load. > > > > Lets face it, spinning the washing machine is expensive > > and running less than a full load makes things inefficient ;) I guess this is what people writing w*ndows screen blankers thought, too ;-) Sorry for this comment, couldn't resist :-) Stephan ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:10 ` Stephan von Krawczynski @ 2001-09-06 13:23 ` Alex Bligh - linux-kernel 2001-09-06 13:54 ` M. Edward Borasky 2001-09-06 13:42 ` Stephan von Krawczynski ` (2 subsequent siblings) 3 siblings, 1 reply; 79+ messages in thread From: Alex Bligh - linux-kernel @ 2001-09-06 13:23 UTC (permalink / raw) To: Stephan von Krawczynski, Daniel Phillips Cc: riel, jaharkes, marcelo, linux-kernel, Alex Bligh - linux-kernel --On Thursday, September 06, 2001 3:10 PM +0200 Stephan von Krawczynski <skraw@ithnet.com> wrote: > Obviously aging did not work at all, > there was not a single hit on these (CD image) pages during 24 hours, > compared to lots on the nfs-data. If there's no memory pressure, data stays in InactiveDirty, caches, etc., forever. What makes you think more memory would have helped the NFS performance? It's possible these all were served out of caches too. -- Alex Bligh ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: page_launder() on 2.4.9/10 issue 2001-09-06 13:23 ` Alex Bligh - linux-kernel @ 2001-09-06 13:54 ` M. Edward Borasky 2001-09-06 14:39 ` Alan Cox 2001-09-06 17:33 ` Daniel Phillips 0 siblings, 2 replies; 79+ messages in thread From: M. Edward Borasky @ 2001-09-06 13:54 UTC (permalink / raw) To: linux-kernel I'm relatively new to the Linux kernel world and even newer to the list, so forgive me if I'm asking a silly question or making a silly comment. It seems to me, from what I've seen of this discussion so far, that the only way one "tunes" Linux kernels at the moment is by changing code and rebuilding the kernel. That is, there are few "tunables" that one can set, based on one's circumstances, to optimize kernel performance for a specific application or environment. Every other operating system that I've done performance tuning on, starting with Xerox CP-V in 1974, had such tunables and tools to set them. And quite often, some of the tuning parameters can be set "on the fly", simply by knowing the correct memory location to set and poking a new value into it. No one "memory management scheme", for example, can be all things to all tasks, and it seems to me that giving users tools to measure and control the behavior of memory management, *preferably without having to recompile and reboot*, should be a major priority if Linux is to succeed in a wide variety of applications. OK, I'll get off my soapbox now, and ask a related question. Is there a mathematical model of the Linux kernel somewhere that I could get my hands on? -- M. Edward (Ed) Borasky, Chief Scientist, Borasky Research http://www.borasky-research.net http://www.aracnet.com/~znmeb mailto:znmeb@borasky-research.com mailto:znmeb@aracnet.com Stand-Up Comedy: Because Man Does Not Live By Dread Alone ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:54 ` M. Edward Borasky @ 2001-09-06 14:39 ` Alan Cox 2001-09-06 16:20 ` Victor Yodaiken 2001-09-06 17:33 ` Daniel Phillips 1 sibling, 1 reply; 79+ messages in thread From: Alan Cox @ 2001-09-06 14:39 UTC (permalink / raw) To: M. Edward Borasky; +Cc: linux-kernel > forgive me if I'm asking a silly question or making a silly comment. It > seems to me, from what I've seen of this discussion so far, that the only > way one "tunes" Linux kernels at the moment is by changing code and > rebuilding the kernel. That is, there are few "tunables" that one can set, > based on one's circumstances, to optimize kernel performance for a specific > application or environment. There are a lot of tunables in /proc/sys. An excellent tool for playing with them is "powertweak". > No one "memory management scheme", for example, can be all things to all > tasks, and it seems to me that giving users tools to measure and control the > behavior of memory management, *preferably without having to recompile and > reboot*, should be a major priority if Linux is to succeed in a wide variety > of applications. The VM is tunable in the -ac tree. I still believe the VM can and should be self tuning but we are not there yet. > OK, I'll get off my soapbox now, and ask a related question. Is there a > mathematical model of the Linux kernel somewhere that I could get my hands > on? Not that I am aware of. Alan ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 14:39 ` Alan Cox @ 2001-09-06 16:20 ` Victor Yodaiken 0 siblings, 0 replies; 79+ messages in thread From: Victor Yodaiken @ 2001-09-06 16:20 UTC (permalink / raw) To: Alan Cox; +Cc: M. Edward Borasky, linux-kernel On Thu, Sep 06, 2001 at 03:39:17PM +0100, Alan Cox wrote: > > OK, I'll get off my soapbox now, and ask a related question. Is there a > > mathematical model of the Linux kernel somewhere that I could get my hands > > on? > > Not that I am aware of. A mathematical model of the Linux kernel would be a major scientific advance. > > Alan > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:54 ` M. Edward Borasky 2001-09-06 14:39 ` Alan Cox @ 2001-09-06 17:33 ` Daniel Phillips 1 sibling, 0 replies; 79+ messages in thread From: Daniel Phillips @ 2001-09-06 17:33 UTC (permalink / raw) To: M. Edward Borasky, linux-kernel On September 6, 2001 03:54 pm, M. Edward Borasky wrote: > I'm relatively new to the Linux kernel world and even newer to the list, so > forgive me if I'm asking a silly question or making a silly comment. It > seems to me, from what I've seen of this discussion so far, that the only > way one "tunes" Linux kernels at the moment is by changing code and > rebuilding the kernel. That is, there are few "tunables" that one can set, > based on one's circumstances, to optimize kernel performance for a specific > application or environment. > > Every other operating system that I've done performance tuning on, starting > with Xerox CP-V in 1974, had such tunables and tools to set them. And quite > often, some of the tuning parameters can be set "on the fly", simply by > knowing the correct memory location to set and poking a new value into it. We typically use proc for this, sometimes combined with an ioctl. Some of these settings are standard in the kernel (bdflush, others) but more often you will have to apply a patch. > No one "memory management scheme", for example, can be all things to all > tasks, and it seems to me that giving users tools to measure and control the > behavior of memory management, *preferably without having to recompile and > reboot*, should be a major priority if Linux is to succeed in a wide variety > of applications. Linus doesn't seem to like like having tuning knobs appear where a better algorithm should be used instead. Leaving the knobs out makes people work harder to come up with solutions that don't need them. -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:10 ` Stephan von Krawczynski 2001-09-06 13:23 ` Alex Bligh - linux-kernel @ 2001-09-06 13:42 ` Stephan von Krawczynski 2001-09-06 14:01 ` Alex Bligh - linux-kernel 2001-09-06 14:39 ` Stephan von Krawczynski 2001-09-06 17:51 ` Daniel Phillips 2001-09-07 12:30 ` page_launder() on 2.4.9/10 issue Stephan von Krawczynski 3 siblings, 2 replies; 79+ messages in thread From: Stephan von Krawczynski @ 2001-09-06 13:42 UTC (permalink / raw) To: Alex Bligh - linux-kernel; +Cc: phillips, riel, jaharkes, marcelo, linux-kernel On Thu, 06 Sep 2001 14:23:58 +0100 Alex Bligh - linux-kernel <linux-kernel@alex.org.uk> wrote: > > > --On Thursday, September 06, 2001 3:10 PM +0200 Stephan von Krawczynski > <skraw@ithnet.com> wrote: > > > Obviously aging did not work at all, > > there was not a single hit on these (CD image) pages during 24 hours, > > compared to lots on the nfs-data. > > If there's no memory pressure, data stays in InactiveDirty, caches, > etc., forever. What makes you think more memory would have helped > the NFS performance? It's possible these all were served out of caches > too. Negative. Switching off export-option "no_subtree_check" (which basically leads to more small allocs during nfs action) shows immediately mem failures and truncated files on the server and stale nfs handles on the client. So the system _is_ under pressure. This exactly made me start (my branch of) the discussion. Besides I would really like to know what useable _data_ is in these pages, as I cannot see which application should hold it (the CD stuff was quit "long ago"). FS should have sync'ed several times, too. Stephan ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:42 ` Stephan von Krawczynski @ 2001-09-06 14:01 ` Alex Bligh - linux-kernel 2001-09-06 14:39 ` Stephan von Krawczynski 1 sibling, 0 replies; 79+ messages in thread From: Alex Bligh - linux-kernel @ 2001-09-06 14:01 UTC (permalink / raw) To: Stephan von Krawczynski, Alex Bligh - linux-kernel Cc: phillips, riel, jaharkes, marcelo, linux-kernel, Alex Bligh - linux-kernel >> If there's no memory pressure, data stays in InactiveDirty, caches, >> etc., forever. What makes you think more memory would have helped >> the NFS performance? It's possible these all were served out of caches >> too. > > Negative. Switching off export-option "no_subtree_check" (which basically > leads to more small allocs during nfs action) shows immediately mem > failures and truncated files on the server and stale nfs handles on the > client. So the system _is_ under pressure. This exactly made me start (my > branch of) the discussion. > Besides I would really like to know what useable _data_ is in these > pages, as I cannot see which application should hold it (the CD stuff was > quit "long ago"). FS should have sync'ed several times, too. Yes, but this is because VM system's targets & pressure calcs do not take into account fragmentation of the underlying physical memory. IE, in theory you could have half your memory free, but not be able to allocate a single 8k block. Nothing would cause cache, or InactiveDirty stuff to be written. You yourself proved this, by switching rsize,wsize to 1k and said it all worked fine! (unless I misread your email). The other potential problem is that if the memory requirement is all extremely bursty and without __GFP_WAIT (i.e. allocated GFP_ATOMIC) then it is conceivable you need a whole pile of memory allocated before the system has time to retrieve it from things which require locks, I/O, etc. However, I suspect this isn't the problem. Put my instrumentation patch on, and if I'm right you'll see something like the following, but worse. Look at 32kB allocations (order 3, which is what I think you said was failing), and look at the %fragmentation. This is the % of free memory which cannot be allocated as (in this case) contiguous 32kB chunks (as it's all in smaller blocks). As this approaches 100, the VM system is going to think 'no memory pressure' and not free up pages, but you are going to be unable to allocate. The second of these examples was after a single bonnie run, a sync, and 5 minutes of idle activity. Note that in this example, and few order 4 allocations which required DMA would fail, though the VM system would see plenty of memory. And they will continue failing. I think what you want isn't more memory, its less fragmented memory. Or an underlying system which can cope with fragmentation. -- Alex Bligh Before $ cat /proc/memareas Zone 4kB 8kB 16kB 32kB 64kB 128kB 256kB 512kB 1024kB 2048kB Tot Pages/kb DMA 2 2 4 3 3 3 1 1 0 6 = 3454 @frag 0% 0% 0% 1% 1% 3% 6% 7% 11% 11% 13816kB Normal 0 0 6 29 18 8 4 0 1 23 = 13088 @frag 0% 0% 0% 0% 2% 4% 6% 8% 8% 10% 52352kB HighMem = 0kB - zero size zone After $ cat /proc/memareas Zone 4kB 8kB 16kB 32kB 64kB 128kB 256kB 512kB 1024kB 2048kB Tot Pages/kb DMA 522 382 210 53 8 2 1 0 0 0 = 2806 @frag 0% 19% 46% 76% 91% 95% 98% 100% 100% 100% 11224kB Normal 0 1155 1656 756 163 29 0 1 0 0 = 18646 @frag 0% 0% 12% 48% 80% 94% 99% 99% 100% 100% 74584kB ^^^ Order 3 HighMem = 0kB - zero size zone ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:42 ` Stephan von Krawczynski 2001-09-06 14:01 ` Alex Bligh - linux-kernel @ 2001-09-06 14:39 ` Stephan von Krawczynski 2001-09-06 15:02 ` Alex Bligh - linux-kernel 2001-09-06 15:10 ` Stephan von Krawczynski 1 sibling, 2 replies; 79+ messages in thread From: Stephan von Krawczynski @ 2001-09-06 14:39 UTC (permalink / raw) To: Alex Bligh - linux-kernel; +Cc: phillips, riel, jaharkes, marcelo, linux-kernel On Thu, 06 Sep 2001 15:01:49 +0100 Alex Bligh - linux-kernel <linux-kernel@alex.org.uk> wrote: > Yes, but this is because VM system's targets & pressure calcs do not > take into account fragmentation of the underlying physical memory. > IE, in theory you could have half your memory free, but > not be able to allocate a single 8k block. Nothing would cause > cache, or InactiveDirty stuff to be written. Which is obviously not the right way to go. I guess we agree in that. > You yourself proved this, by switching rsize,wsize to 1k and said > it all worked fine! (unless I misread your email). Sorry, misunderstanding: I did not touch rsize/wsize. What I do is to lower fs action by not letting knfsd walk through the subtrees of a mounted fs. This leads to less allocs/frees by the fs layer which tend to fail and let knfs fail afterwards. > [...] > I think what you want isn't more memory, its less > fragmented memory. This is one important part for sure. > Or an underlying system which can > cope with fragmentation. Well, I'd rather prefer the cure than the dope :-) Regards, Stephan ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 14:39 ` Stephan von Krawczynski @ 2001-09-06 15:02 ` Alex Bligh - linux-kernel 2001-09-06 15:07 ` Rik van Riel 2001-09-06 15:10 ` Stephan von Krawczynski 1 sibling, 1 reply; 79+ messages in thread From: Alex Bligh - linux-kernel @ 2001-09-06 15:02 UTC (permalink / raw) To: Stephan von Krawczynski, Alex Bligh - linux-kernel Cc: phillips, riel, jaharkes, marcelo, linux-kernel, Alex Bligh - linux-kernel Stephan, >> Yes, but this is because VM system's targets & pressure calcs do not >> take into account fragmentation of the underlying physical memory. >> IE, in theory you could have half your memory free, but >> not be able to allocate a single 8k block. Nothing would cause >> cache, or InactiveDirty stuff to be written. > > Which is obviously not the right way to go. I guess we agree in that. Well, I agree that this is not desirable. I am not sure whether the right course is (a) to avoid getting here, (b) to do traditional page_launder() stuff, i.e. write stuff out, and hope that fixes it (c) to actively go defragment (Daniel P's prefered approach) (d) some combination of the above. >> You yourself proved this, by switching rsize,wsize to 1k and said >> it all worked fine! (unless I misread your email). > > Sorry, misunderstanding: I did not touch rsize/wsize. What I do is to lower fs > action by not letting knfsd walk through the subtrees of a mounted fs. This > leads to less allocs/frees by the fs layer which tend to fail and let knfs fail > afterwards. OK, I'm getting confused. I'm looking at stuff you sent like: Aug 29 13:43:34 admin kernel: pid=1207; __alloc_pages(gfp=0x20, order=3, ...) Aug 29 13:43:34 admin kernel: Call Trace: [_alloc_pages+22/24] [__get_free_pages+10/24] [<fdcec826>] [<fdcec8f5>] [<fdceb7d7>] Aug 29 13:43:34 admin kernel: [<fdcec0f5>] [<fdcea589>] [ip_local_deliver_finish+0/368] [nf_hook_slow+272/404] [ip_rcv_finish+0/480] [ip_local_deliver+436/444] Aug 29 13:43:34 admin kernel: [ip_local_deliver_finish+0/368] [ip_rcv_finish+0/480] [ip_rcv_finish+413/480] [ip_rcv_finish+0/480] [nf_hook_slow+272/404] [ip_rcv+870/944] Aug 29 13:43:34 admin kernel: [ip_rcv_finish+0/480] [net_rx_action+362/628] [do_softirq+111/204] [do_IRQ+219/236] [ret_from_intr+0/7] [sys_ioctl+443/532] Aug 29 13:43:34 admin kernel: [system_call+51/56] Aug 29 13:43:34 admin kernel: __alloc_pages: 3-order allocation failed (gfp=0x20/0). If you use rsize=1024,wsize=1024, (note you may have to force this at the client end), you should not see, at least from NFS, allocations at greater than order 0. So if the problem is /just/ fragmentation (rather than too little memory), it will magically go away (i.e. be hidden). If it's not just fragmentation, you will still see errors. This is not intended as a solution, but as a diagnostic tool. [I mistakenly thought/dreamed you had already done this]. Note there may still be other things trying to do >0 order allocs, for instance bounce buffers, but I believe you have applied useful patches for them already. -- Alex Bligh ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 15:02 ` Alex Bligh - linux-kernel @ 2001-09-06 15:07 ` Rik van Riel [not found] ` <Pine.LNX.4.33L.0109061206020.31200-100000@imladris.rielhome.con ectiva> 0 siblings, 1 reply; 79+ messages in thread From: Rik van Riel @ 2001-09-06 15:07 UTC (permalink / raw) To: Alex Bligh - linux-kernel Cc: Stephan von Krawczynski, phillips, jaharkes, marcelo, linux-kernel On Thu, 6 Sep 2001, Alex Bligh - linux-kernel wrote: > >> IE, in theory you could have half your memory free, but > >> not be able to allocate a single 8k block. Nothing would cause > >> cache, or InactiveDirty stuff to be written. > > > > Which is obviously not the right way to go. I guess we agree in that. > > Well, I agree that this is not desirable. I am not sure whether > the right course is > (a) to avoid getting here, > (b) to do traditional page_launder() stuff, i.e. write stuff out, > and hope that fixes it > (c) to actively go defragment (Daniel P's prefered approach) > (d) some combination of the above. On many systems, higher-order allocations are a really really small fraction of the allocations, so ideally we'd have them take the burden of memory fragmentation and won't punish the normal allocations. That pretty much rules out very strong forms of (a), things like (b) and (c) are very possible to do and maybe even easy. They also won't cause any overhead for normal allocations since we'd only call them when needed. regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 79+ messages in thread
[parent not found: <Pine.LNX.4.33L.0109061206020.31200-100000@imladris.rielhome.con ectiva>]
* Re: page_launder() on 2.4.9/10 issue [not found] ` <Pine.LNX.4.33L.0109061206020.31200-100000@imladris.rielhome.con ectiva> @ 2001-09-06 15:16 ` Alex Bligh - linux-kernel 0 siblings, 0 replies; 79+ messages in thread From: Alex Bligh - linux-kernel @ 2001-09-06 15:16 UTC (permalink / raw) To: Rik van Riel, Alex Bligh - linux-kernel Cc: Stephan von Krawczynski, phillips, jaharkes, marcelo, linux-kernel, Alex Bligh - linux-kernel --On Thursday, September 06, 2001 12:07 PM -0300 Rik van Riel <riel@conectiva.com.br> wrote: > On many systems, higher-order allocations are a really really > small fraction of the allocations, so ideally we'd have them > take the burden of memory fragmentation and won't punish the > normal allocations. The only nit being, every instance Stephan's reported so far, and in most other reports I've seen, the allocation has been GFP_ATOMIC (i.e. with mask without __GFP_WAIT). For non-atomic >0 order allocations we already have some good logic that does (b) via page_launder(), and where necessary reclaim_page(),__free_page(). So waiting until we are in the high order allocation allocation is too late, as we don't have room to move. I think we need to defragment / avoid fragmentation BEFORE the GFP_ATOMIC high order allocation comes along. I have some ideas I'd like to test tonight. -- Alex Bligh ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 14:39 ` Stephan von Krawczynski 2001-09-06 15:02 ` Alex Bligh - linux-kernel @ 2001-09-06 15:10 ` Stephan von Krawczynski 2001-09-06 15:18 ` Alex Bligh - linux-kernel 1 sibling, 1 reply; 79+ messages in thread From: Stephan von Krawczynski @ 2001-09-06 15:10 UTC (permalink / raw) To: Alex Bligh - linux-kernel; +Cc: phillips, riel, jaharkes, marcelo, linux-kernel On Thu, 06 Sep 2001 16:02:04 +0100 Alex Bligh - linux-kernel <linux-kernel@alex.org.uk> wrote: > Stephan, > >> You yourself proved this, by switching rsize,wsize to 1k and said > >> it all worked fine! (unless I misread your email). > > > > Sorry, misunderstanding: I did not touch rsize/wsize. What I do is to lower fs > > action by not letting knfsd walk through the subtrees of a mounted fs. This > > leads to less allocs/frees by the fs layer which tend to fail and let knfs fail > > afterwards. > > OK, I'm getting confused. To end that: What I meant was, I did not touch the values most everybody uses on NFS, which is: rsize=8192,wsize=8192 Using smaller values (or default = 1024) gives such a ridicolously bad performance that I would even prefer samba. Regards, Stephan ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 15:10 ` Stephan von Krawczynski @ 2001-09-06 15:18 ` Alex Bligh - linux-kernel 2001-09-06 17:34 ` Daniel Phillips 0 siblings, 1 reply; 79+ messages in thread From: Alex Bligh - linux-kernel @ 2001-09-06 15:18 UTC (permalink / raw) To: Stephan von Krawczynski, Alex Bligh - linux-kernel Cc: phillips, riel, jaharkes, marcelo, linux-kernel, Alex Bligh - linux-kernel --On Thursday, September 06, 2001 5:10 PM +0200 Stephan von Krawczynski <skraw@ithnet.com> wrote: > (or default = 1024) gives such a ridicolously bad > performance I know. I am trying to ensure we have the problem definitively identified, either from /proc/memareas, or by showing it goes away if you change rsize/wsize. I am NOT proposing it as a fix. -- Alex Bligh ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 15:18 ` Alex Bligh - linux-kernel @ 2001-09-06 17:34 ` Daniel Phillips 2001-09-06 17:32 ` Alex Bligh - linux-kernel 0 siblings, 1 reply; 79+ messages in thread From: Daniel Phillips @ 2001-09-06 17:34 UTC (permalink / raw) To: Alex Bligh - linux-kernel, Stephan von Krawczynski Cc: riel, jaharkes, marcelo, linux-kernel, Alex Bligh - linux-kernel On September 6, 2001 05:18 pm, Alex Bligh - linux-kernel wrote: > --On Thursday, September 06, 2001 5:10 PM +0200 Stephan von Krawczynski > <skraw@ithnet.com> wrote: > > > (or default = 1024) gives such a ridicolously bad > > performance > > I know. I am trying to ensure we have the problem definitively > identified, either from /proc/memareas, or by showing it > goes away if you change rsize/wsize. I am NOT proposing > it as a fix. Are rsize/wsize expressed in bytes? In which case you'd want them to be 4096 for this test. -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 17:34 ` Daniel Phillips @ 2001-09-06 17:32 ` Alex Bligh - linux-kernel 0 siblings, 0 replies; 79+ messages in thread From: Alex Bligh - linux-kernel @ 2001-09-06 17:32 UTC (permalink / raw) To: Daniel Phillips, Alex Bligh - linux-kernel, Stephan von Krawczynski Cc: riel, jaharkes, marcelo, linux-kernel, Alex Bligh - linux-kernel --On Thursday, September 06, 2001 7:34 PM +0200 Daniel Phillips <phillips@bonn-fries.net> wrote: > On September 6, 2001 05:18 pm, Alex Bligh - linux-kernel wrote: >> --On Thursday, September 06, 2001 5:10 PM +0200 Stephan von Krawczynski >> <skraw@ithnet.com> wrote: >> >> > (or default = 1024) gives such a ridicolously bad >> > performance >> >> I know. I am trying to ensure we have the problem definitively >> identified, either from /proc/memareas, or by showing it >> goes away if you change rsize/wsize. I am NOT proposing >> it as a fix. > > Are rsize/wsize expressed in bytes? In which case you'd want them to be > 4096 for this test. Bytes per request. There is some header wastage, so 4096 is too high as the packets will be slightly larger than a page. I suggested 1024 rather than 2048 as 1024 is the original standard & thus everything supports it. -- Alex Bligh ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:10 ` Stephan von Krawczynski 2001-09-06 13:23 ` Alex Bligh - linux-kernel 2001-09-06 13:42 ` Stephan von Krawczynski @ 2001-09-06 17:51 ` Daniel Phillips 2001-09-06 21:01 ` [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG] Alex Bligh - linux-kernel 2001-09-07 12:30 ` page_launder() on 2.4.9/10 issue Stephan von Krawczynski 3 siblings, 1 reply; 79+ messages in thread From: Daniel Phillips @ 2001-09-06 17:51 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: riel, jaharkes, marcelo, linux-kernel On September 6, 2001 03:10 pm, Stephan von Krawczynski wrote: > > Blindly delaying all the writes in the name of better read performance isn't > > the right idea either. Perhaps we should have a good think about some > > sensible mechanism for balancing reads against writes. > > I guess I have the real-world proof for that: > Yesterday I mastered a CD (around 700 MB) and burned it, I left the equipment > to get some food and sleep (sometimes needed :-). During this time the machine > acts as nfs-server and gets about 3 GB of data written to it. Coming back today > I recognise that deleting the CD image made yesterday frees up about 500 MB of > physical mem (free mem was very low before). It was obviously held 24 hours for > no reason, and _not_ (as one would expect) exchanged against the nfs-data. This > means the caches were full with _old_ data and explains why nfs performance has > remarkably dropped since 2.2. There is too few mem around to get good > performance (no matter if read or write). Obviously aging did not work at all, > there was not a single hit on these (CD image) pages during 24 hours, compared > to lots on the nfs-data. Even if the nfs-data would only have one single hit, > the old CD image should have been removed, because it is inactive and _older_. OK, this is not related to what we were discussing (IO latency). It's not too hard to fix, we just need to do a little aging whenever there are allocations, whether or not there is memory_pressure. I don't think it's a real problem though, we have at least two problems we really do need to fix (oom and high order failures). -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG] 2001-09-06 17:51 ` Daniel Phillips @ 2001-09-06 21:01 ` Alex Bligh - linux-kernel 2001-09-07 6:35 ` Daniel Phillips 0 siblings, 1 reply; 79+ messages in thread From: Alex Bligh - linux-kernel @ 2001-09-06 21:01 UTC (permalink / raw) To: Daniel Phillips, riel, linux-kernel; +Cc: Alex Bligh - linux-kernel I thought I'd try coding this, then I thought better of it and so am asking people's opinions first. The following describes a mechanism to change the zone/buddy allocation system to minimize fragmentation before it happens, and then defragment post-facto. Background, & Statement of problem ================================== High order [1] memory allocations tend to fail when memory is fragmented. Memory becomes fragmented through normal system usage, without memory pressure. When memory is fragmented, it stays fragmented. While non-atomic [2] high order can wait until progress is made freeing pages, the algorithm 'free pages without reference to their location until sufficient adjacent pages have by chance been freed for a coalescence' is inefficient compared to a defragmentation routine, or an attempt to free specific adjacent pages which may coalesce. The problem is worse for atomic [2] request, which can neither defragment memory (due to I/O and locking restrictions), nor can they make progress via (for instance) page_launder(). Therefore, in a fragmented memory environment, it has been observed that high order requests, particularly atomic ones [3], fail frequently. Common sources of atomic high order requests include allocations from the network layer where packets exceed 4k in size (for instance NFS packets with rsize,wsize>2048, fragmentation and reassembly), and the SCSI layer. Whilst it is undeniable that some drivers would benefit from using technologies like scatter lists to avoid the necessity of contiguous physical memory allocation, large swathes of current code assumes the opposite, and some is hard to change. [4] As many of these allocations occur in bottom half, or interrupt routines, it is more difficult to handle a failure gracefully than in other code. This tends to lead to performance problems [5], or worse (hard errors), which should be minimized. Causes of fragmentation ======================= Linux adopts a largely requestor-anonymous form of page allocation. Memory is divided into 3 zones, and page requesters can specify a list of suitable zones from which pages may be allocated, but beyond that, pages are allocated in a manner which does not distinguish between users of given pages. Thus pages allocated for packets in flight are likely to be intermingled with buffer pages, cache pages, code pages and data pages. Each of these different types of allocation has a different persistence over time. Some (for instance pages on the InactiveDirty list in an idle system) will persist indefinitely. The buddy allocator will attempt (by looking at lowest order lists first) to allocate pages from fragmented areas first. Assuming pages are freed at random, this would act as a defragmentation process. However, if a system is taken to high utilization and back again to idle, the dispersion of persistent pages (for instance InactiveDirty pages) becomes great, and the buddy allocator performs poorly at coalescing blocks. The situation is worsened by the understandable desire for simplicity in the VM system, which measures solely the number of pages free in different zones, as opposed their respective locations. It is possible (and has been observed) to have a system in a state with hardly any high order buddies on free area lists (thus where it would be impossible to make many atomic high order allocations), but copious easilly freeable RAM. This is in essence because no attempt is made to balance for different order free-lists, and shortage of entries on high-order free lists does not in itself cause memory pressure. It is probably undesirable for the normal VM system to react to fragmentation in the same way it does to normal memory pressure. This would result in an unselective paging out / discarding of data, whereas an approach which selected pages to free which would be most likely to cause coalescence would be more useful. Further, it would be possible, by moving the data in physical pages, to move many types of page, without loss of in-memory data at all. Approaches to solution ====================== It has been suggested that post-facto defragementation is a useful technique. This is undoubtedly true, but the defragmentation needs to run before it is 'needed' - i.e. we need to ensure that memory is never sufficiently fragmented that a reasonable size burst of high order atomic allocations can fail. This could be achieved by running some background defragmentation task against some measurable fragmentation target. Here fragmentation pressure would be an orthogonal measure to memory pressure. Non atomic high order allocations which are failing should allow the defragmenter to run, rather than call pagelaunder(). Defragmentation routines appear to be simple at first. Simply run through the free lists of particular zones, examining whether the constituent pages of buddies of free areas can be freed or moved. However, using this approach alone has some drawbacks. Firstly, it is not immediately obvious that by moving pages you are making the situation any better, because it is not clear that the (new) destination page will be allocated somewhere less awkward. Secondly, whilst many types of page can be allocated and moved with minimal effort (for instance pages on the Active or Inactive lists), it is less obvious how to move buffer and cache pages transparently (given only a pointer to the page struct to start with, it is hard to determine where they are used and referred to, for a start) and it is far from obvious how to move arbitrary pages allocated by the kernel for disparate purposes (including pages allocated by the slab allocator). However, this is not the only possibility to minimize fragmentation. Part of the problem is the fact that pages are allocated by location without reference to the caller. If (for instance) buffer pages tended to be allocated next to eachother, cache pages tended to be allocated next to eachother, pages allocated by the network stack tended to be allocated next to eachother, then a number of benefits would accrue: Firstly, defragmentation would be more successful. Defragmentation would tend to focus on pages allocated away from their natural brethren, and their newly allocated pages, into which their data would be moved, would tend to be next to these. This would help ensure that the new page was indeed a better location than the old page. Also, as pages of similar ease or difficulty to move would be clumped, the effect of a large number of difficult to move pages would be reduced by their mutual proximity. Secondly, defragmentation would be less necessary. Pages allocated by different functions have different natural persistence. For instance, pages allocated within the networking stack typically have short persistence, due to the transitory nature of the packets they represent. Therefore, in areas of memory preferred by low persistence users, the natural defragmentation effect of the buddy allocator would be greater. Therefore it is suggested that different allocators have affinities for different areas of memory. One mechanism of achieving this effect would be an extension to the zone system. Currently, there are three zones (DMA, Normal and High memory). Imagine instead, there were many more zones, and the above three labels became 'zone types'. There would thus be many DMA zones, many normal zones, and many high memory zones. These zones would be at least the highest order allocation in size - currently 2Mb on i386, but this could be reduced slightly with minimal disruption. In this manner, the efficiency of the buddy allocator is not reduced, as the buddy allocator has no visibility of coalescence etc. above this level anyway. Balancing would occur accross the aggregate of zone types (i.e. across all DMA zones in aggregate, accross all High memory zones in aggregate, etc.) as opposed to by individual zones. Each zone type would have an associated hash table, the entries being zones of that type. A routine requesting an allocation would pass information to __alloc_pages which identified it - it may well be that the GFP flags, the order, and perhaps some ID for the subsystem is sufficient. This would act as the key to the hash table concerned. When allocating a page, all zones in the hash table with the appropriate key (i.e. a matching allocator) are first tried, in order. If no page is found, then an empty zone (special key) is found, which is then labelled, and used as, a zone of the type required. If no empty zone is available of that zone type, then, other zone types (using the list of appropriate zone types are tried). If no page is found, then starting with the first zone type again, the first page in ANY zone within that zone hash table is utilized, and so on through other suitable zone types. In this manner, pages are likely to be clustered in zones by allocator. The role of the defragmenter becomes firstly to target pages which have an inappropriate key for the zone concerned, and secondly to target pages in sparsely allocated zones, so the zone becomes unkeyed, and free for rekeying later. As statistics could easilly be kept per zone on the number of appropriately and inappropriately keyed pages which had been allocated within that zone, scanning (and hence finding suitable targets) would become considerably easier. Equally, maintenance of these statistics can determine when the defragmenter should be run as a background process. Some further changes will be necessary; for instance direct_reclaim should not occur when the page to be reclaimed would be inappropriately keyed for the zone; in practice this means using direct reclaim only to reclaim pages for purposes where the allocated page might itself reach the InactiveDirty list AND where the page reclaimed is correctly keyed. Furthermore, the number unkeyed (i.e. empty) zones will need to have a particular low water market target, below which memory pressure must somehow be caused, in order to force buffer flushing or paging. This effectively relegates the buddy system to allocating pages for particular purposes within small chunks of memory - there is a parallel purpose here with a sort of extended slab system. The zone system would then become a low overhead manager of larger areas - a sort of 'super slab'. Thoughts? Notes ===== [1] Higher order meaning greater than order 0 [2] By atomic I mean without __GFP_WAIT set, which are in the main GFP_ATOMIC allocations. [3] The lack of any detail at all on non-atomic requests suggests that this is either a non-problem, or they are little used in the kernel - possibly wrongly so. [4] For instance, the network code assumes that packets (pre-fragmentation, or post-reassembly), are contiguous in memory. [5] For instance, packet drops, which whilst recoverable, impede performance. -- Alex Bligh ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG] 2001-09-06 21:01 ` [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG] Alex Bligh - linux-kernel @ 2001-09-07 6:35 ` Daniel Phillips 2001-09-07 8:58 ` Alex Bligh - linux-kernel 0 siblings, 1 reply; 79+ messages in thread From: Daniel Phillips @ 2001-09-07 6:35 UTC (permalink / raw) To: Alex Bligh - linux-kernel, riel, linux-kernel; +Cc: Alex Bligh - linux-kernel On September 6, 2001 11:01 pm, Alex Bligh - linux-kernel wrote: > I thought I'd try coding this, then I thought better of it and so am asking > people's opinions first. The following describes a mechanism to change the > zone/buddy allocation system to minimize fragmentation before it happens, > and then defragment post-facto. Nice exposition and analysis, but see my wet-blanket comments below... > [...] > > Causes of fragmentation > ======================= > > Linux adopts a largely requestor-anonymous form of page allocation. Memory > is divided into 3 zones, and page requesters can specify a list of suitable > zones from which pages may be allocated, but beyond that, pages are > allocated in a manner which does not distinguish between users of given > pages. It's a conscious goal to try to unify all sources of memory. The three zones that are there now are only there because they absolutely have to be. > Thus pages allocated for packets in flight are likely to be intermingled > with buffer pages, cache pages, code pages and data pages. Each of these > different types of allocation has a different persistence over time. Some > (for instance pages on the InactiveDirty list in an idle system) will > persist indefinitely. > > The buddy allocator will attempt (by looking at lowest order lists first) > to allocate pages from fragmented areas first. Assuming pages are freed at > random, this would act as a defragmentation process. However, if a system > is taken to high utilization and back again to idle, the dispersion of > persistent pages (for instance InactiveDirty pages) becomes great, and the > buddy allocator performs poorly at coalescing blocks. It becomes effectively useless. The probability of all 8 pages of a given 8 page unit being free when only 1% of memory is free is (1/100)**8 = 1/(10**16). > The situation is worsened by the understandable desire for simplicity in > the VM system, which measures solely the number of pages free in different > zones, as opposed their respective locations. It is possible (and has been > observed) to have a system in a state with hardly any high order buddies on > free area lists (thus where it would be impossible to make many atomic high > order allocations), but copious easilly freeable RAM. This is in essence > because no attempt is made to balance for different order free-lists, and > shortage of entries on high-order free lists does not in itself cause > memory pressure. > > It is probably undesirable for the normal VM system to react to > fragmentation in the same way it does to normal memory pressure. This would > result in an unselective paging out / discarding of data, whereas an > approach which selected pages to free which would be most likely to cause > coalescence would be more useful. Further, it would be possible, by moving > the data in physical pages, to move many types of page, without loss of > in-memory data at all. Moving pages sounds scary. We already know how to evict pages, but moving pages is a whole new mechanism. We probably would not care about the "good" data lost through eviction as opposed to moving fraction of pages we'd have to evict to do the required defragmentation is tiny. > Approaches to solution > ====================== I'm going to confess that I don't understand your solution in detail yet, however, I can see this complaint coming: the changes are too intrusive on the existing kernel, and if that's what we had to do it would probably be easier to just eliminate all high order allocations from the kernel. I already have heard some sentiment that the 0 order allocation failure problems do not have to be solved, that they are really the fault of those coders that used the feature in the first place. I don't know about that, I'd like to hear from the maintainers. But I'm pretty sure that whatever solution we come up with, it has to be very simple in implementation, and have roughly zero impact on the rest of the kernel. -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG] 2001-09-07 6:35 ` Daniel Phillips @ 2001-09-07 8:58 ` Alex Bligh - linux-kernel 2001-09-07 9:15 ` Alex Bligh - linux-kernel 2001-09-07 21:56 ` Daniel Phillips 0 siblings, 2 replies; 79+ messages in thread From: Alex Bligh - linux-kernel @ 2001-09-07 8:58 UTC (permalink / raw) To: Daniel Phillips, Alex Bligh - linux-kernel, riel, linux-kernel Cc: Alex Bligh - linux-kernel Daniel, Some comments in line - if you are modelling this, vital you understand the first! >> The buddy allocator will attempt (by looking at lowest order lists first) >> to allocate pages from fragmented areas first. Assuming pages are freed >> at random, this would act as a defragmentation process. However, if a >> system is taken to high utilization and back again to idle, the >> dispersion of persistent pages (for instance InactiveDirty pages) >> becomes great, and the buddy allocator performs poorly at coalescing >> blocks. > > It becomes effectively useless. The probability of all 8 pages of a given > 8 page unit being free when only 1% of memory is free is (1/100)**8 = > 1/(10**16). I thought that, then I tested & measured, and it simply isn't true. Your mathematical model is wrong. The reason is because pages are freed at random, but they are not allocated at random. The buddy allocator allocates pages whose buddy is allocated (lower order) preferentially to splitting a high order block. Sorry to sound like a broken record, but apply the /proc/memareas patch and you can see this happening. After extensive activity, you see practically none of the free pages in order 0 blocks. You might see only a small number (20 or 30 on a 64k machine) of (say) order 3 blocks, but if you run your stats you would have an expected value of well less than one, and the chance of having 20 or 30 would be vanishingly small. Local aggregation is actually quite effective, provided that the density of persistent pages is not too great. However, it gets considerably less effective as the order increases. > Moving pages sounds scary. We already know how to evict pages, but moving > pages is a whole new mechanism. We probably would not care about the > "good" data lost through eviction as opposed to moving fraction of pages > we'd have to evict to do the required defragmentation is tiny. The sort of moving I was talking about was a diskless page-out / page-in, i.e. which didn't require a swap file, or I/O, and was thus much quicker. Whilst the page would be physically moved, it's virtual address would stay the same. Though this sounds like a completely new system, I think there's a high probability of this just being a special case of the page out routine. > I'm going to confess that I don't understand your solution in detail yet, > however, I can see this complaint coming: the changes are too intrusive on > the existing kernel, A valid criticism. But difficult to see how defragmentation that actually takes account of the contents of memory (rather than 'blind' freeing) could be less intrusive - though I'm open to ideas. > and if that's what we had to do it would probably be > easier to just eliminate all high order allocations from the kernel. I > already have heard some sentiment that the 0 order allocation failure > problems do not have to be solved, that they are really the fault of those > coders that used the feature in the first place. I'd be especially interested to know how we'd solve this for the network stuff, which currently relies on physically contiguous packets in memory. This is a *HUGE* change I think (larger than any we'd make to the VM system). > But I'm pretty sure that whatever > solution we come up with, it has to be very simple in implementation, and > have roughly zero impact on the rest of the kernel. This would of course be ideal. -- Alex Bligh ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG] 2001-09-07 8:58 ` Alex Bligh - linux-kernel @ 2001-09-07 9:15 ` Alex Bligh - linux-kernel 2001-09-07 9:28 ` Alex Bligh - linux-kernel 2001-09-07 21:38 ` Daniel Phillips 2001-09-07 21:56 ` Daniel Phillips 1 sibling, 2 replies; 79+ messages in thread From: Alex Bligh - linux-kernel @ 2001-09-07 9:15 UTC (permalink / raw) To: Alex Bligh - linux-kernel, Daniel Phillips, riel, linux-kernel Cc: Alex Bligh - linux-kernel >> It becomes effectively useless. The probability of all 8 pages of a given >> 8 page unit being free when only 1% of memory is free is (1/100)**8 = >> 1/(10**16). > Sorry to sound like a broken record, but apply the > /proc/memareas patch and you can see this happening. After extensive > activity, you see practically none of the free pages in order 0 > blocks. You might see only a small number (20 or 30 on a 64k > machine) of (say) order 3 blocks, but if you run your stats > you would have an expected value of well less than one, and the > chance of having 20 or 30 would be vanishingly small. Ooops, what I wrote was factually correct, but misleading. What I meant was it looks like this: Zone 4kB 8kB 16kB 32kB 64kB 128kB 256kB 512kB 1024kB 2048kB Tot Pages/kb DMA 495 348 196 72 10 1 1 0 0 0 = 2807) @frag 0% 18% 42% 70% 91% 97% 98% 100% 100% 100% = 11228kB Normal 0 1579 1670 667 140 12 3 1 0 0 = 18118) @frag 0% 0% 17% 54% 84% 96% 98% 99% 100% 100% = 72472kB If your model was correct, you would see free pages per order run like N = a (K ^ (2^-o)); (for a>0, K>1, o=order) This doesn't happen. Instead you get GOOD coalescence at oder 0 (in the Normal zone they've ALL been coalesced), and not bad at order 1 (see how many order 2's we have). 8 page unit is order 3 (32k). This system has 20% of memory free at the point where I took the snap shot. Probability would be, (1/5)^8 = 2^8 / 10^8 = roughly p = 2.5 x 10^-6. In a system with 32000 pages (128Kb), if you were right, I'd expect to see about 0.08 free pages at order 3. But here I see 750. The chance of seeing more than 500 events of probability p = 2.5 ^ (10^-6) across 32000 samples, is vanishingly small. Yet it looks this way all the time. Hence I conclude your model is wrong :-) -- Alex Bligh ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG] 2001-09-07 9:15 ` Alex Bligh - linux-kernel @ 2001-09-07 9:28 ` Alex Bligh - linux-kernel 2001-09-07 21:38 ` Daniel Phillips 1 sibling, 0 replies; 79+ messages in thread From: Alex Bligh - linux-kernel @ 2001-09-07 9:28 UTC (permalink / raw) To: Alex Bligh - linux-kernel, Daniel Phillips, riel, linux-kernel Cc: Alex Bligh - linux-kernel Blush > N = a (K ^ (2^-o)); (for a>0, K>1, o=order) N = a (K ^ -(2^o)); (for a>0, K>1, o=order) -- Alex Bligh ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG] 2001-09-07 9:15 ` Alex Bligh - linux-kernel 2001-09-07 9:28 ` Alex Bligh - linux-kernel @ 2001-09-07 21:38 ` Daniel Phillips 1 sibling, 0 replies; 79+ messages in thread From: Daniel Phillips @ 2001-09-07 21:38 UTC (permalink / raw) To: Alex Bligh - linux-kernel, riel, linux-kernel; +Cc: Alex Bligh - linux-kernel On September 7, 2001 11:15 am, Alex Bligh - linux-kernel wrote: > >> It becomes effectively useless. The probability of all 8 pages of a given > The chance of seeing more than 500 events of probability > p = 2.5 ^ (10^-6) across 32000 samples, is vanishingly > small. Yet it looks this way all the time. > > Hence I conclude your model is wrong :-) True. OK, need to make a better model, time to crack my Knuth. -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG] 2001-09-07 8:58 ` Alex Bligh - linux-kernel 2001-09-07 9:15 ` Alex Bligh - linux-kernel @ 2001-09-07 21:56 ` Daniel Phillips 1 sibling, 0 replies; 79+ messages in thread From: Daniel Phillips @ 2001-09-07 21:56 UTC (permalink / raw) To: Alex Bligh - linux-kernel, riel, linux-kernel; +Cc: Alex Bligh - linux-kernel On September 7, 2001 10:58 am, Alex Bligh - linux-kernel wrote: > Some comments in line - if you are modelling this, vital you > understand the first! > > >> The buddy allocator will attempt (by looking at lowest order lists first) > >> to allocate pages from fragmented areas first. Assuming pages are freed > >> at random, this would act as a defragmentation process. However, if a > >> system is taken to high utilization and back again to idle, the > >> dispersion of persistent pages (for instance InactiveDirty pages) > >> becomes great, and the buddy allocator performs poorly at coalescing > >> blocks. > > > > It becomes effectively useless. The probability of all 8 pages of a given > > 8 page unit being free when only 1% of memory is free is (1/100)**8 = > > 1/(10**16). > > I thought that, then I tested & measured, and it simply isn't true. > Your mathematical model is wrong. Yes, a simple thought experiment show this. Suppose we start with an intial state of every second 0 order page allocated. Now, the next 0 order allocation must coalesce to a 1 order unit but the next allocate will come from a half-allocated allocated unit. If we continue randomly in this way, allocating one page and freeing one, we will eventually arrive at a state where half the pages are in 1 order units and the other half are fully allocated. So, the fragmentation is far from uniformly random. This is going to require deeper analysis. IMO, it's worth putting in the effort to get a handle on this. -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-06 13:10 ` Stephan von Krawczynski ` (2 preceding siblings ...) 2001-09-06 17:51 ` Daniel Phillips @ 2001-09-07 12:30 ` Stephan von Krawczynski 3 siblings, 0 replies; 79+ messages in thread From: Stephan von Krawczynski @ 2001-09-07 12:30 UTC (permalink / raw) To: Daniel Phillips; +Cc: riel, jaharkes, marcelo, linux-kernel On Thu, 6 Sep 2001 19:51:26 +0200 Daniel Phillips <phillips@bonn-fries.net> wrote: > On September 6, 2001 03:10 pm, Stephan von Krawczynski wrote: > > [...] > > to lots on the nfs-data. Even if the nfs-data would only have one single hit, > > the old CD image should have been removed, because it is inactive and _older_. > > OK, this is not related to what we were discussing (IO latency). It's not too > hard to fix, we just need to do a little aging whenever there are allocations, > whether or not there is memory_pressure. I don't think it's a real problem > though, we have at least two problems we really do need to fix (oom and > high order failures). Hm, I am not quite sure about that. Can you _show_ me how to fix this? Regards, Stephan ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 15:26 ` Jan Harkes 2001-09-04 15:24 ` Marcelo Tosatti @ 2001-09-04 16:27 ` Rik van Riel 2001-09-04 17:13 ` Jan Harkes 2001-09-04 20:43 ` Jan Harkes 1 sibling, 2 replies; 79+ messages in thread From: Rik van Riel @ 2001-09-04 16:27 UTC (permalink / raw) To: Jan Harkes; +Cc: Marcelo Tosatti, Linus Torvalds, Daniel Phillips, lkml On Tue, 4 Sep 2001, Jan Harkes wrote: > NO, please don't add another list to fix the symptoms of bad page aging. > > One of the graduate students here at CMU has been looking at the 2.4 VM, > trying to predict the size of the app that can possibly be loaded > without causing the system to start trashing. [snip results] > Aging is broken. Horribly. As a result, the inactive list is filled with > pages that are not necessarily inactive. I've been working on a CPU and memory efficient reverse mapping patch for Linux, one which will allow us to do a bunch of optimisations for later on (infrastructure) and has as its short-term benefit the potential for better page aging. It seems the balancing FreeBSD does (up aging +3, down aging -1, inactive list in LRU order as extra stage) is working nicely on my laptop now, but I don't think I'll be releasing that as part of the patch ... http://www.surriel.com/patches/2.4/2.4.8-ac12-pmap3 regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 16:27 ` Rik van Riel @ 2001-09-04 17:13 ` Jan Harkes 2001-09-04 15:56 ` Marcelo Tosatti 2001-09-04 17:35 ` Daniel Phillips 2001-09-04 20:43 ` Jan Harkes 1 sibling, 2 replies; 79+ messages in thread From: Jan Harkes @ 2001-09-04 17:13 UTC (permalink / raw) To: Rik van Riel; +Cc: Marcelo Tosatti, Linus Torvalds, Daniel Phillips, lkml On Tue, Sep 04, 2001 at 01:27:50PM -0300, Rik van Riel wrote: > I've been working on a CPU and memory efficient reverse > mapping patch for Linux, one which will allow us to do > a bunch of optimisations for later on (infrastructure) > and has as its short-term benefit the potential for > better page aging. Yes, I can see that using reverse mappings would be a way of correcting the aging if you call page_age_up from try_to_swap_out, in which case there probably needs to be a page_age_down on virtual mappings as well to correctly balance things. > It seems the balancing FreeBSD does (up aging +3, down > aging -1, inactive list in LRU order as extra stage) is One other observation, we should add anonymously allocated memory to the active-list as soon as they are allocated in do_nopage. At the moment a large part of memory is not aged at all until we start swapping things out. Jan ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 17:13 ` Jan Harkes @ 2001-09-04 15:56 ` Marcelo Tosatti 2001-09-04 17:54 ` Jan Harkes 2001-09-04 17:35 ` Daniel Phillips 1 sibling, 1 reply; 79+ messages in thread From: Marcelo Tosatti @ 2001-09-04 15:56 UTC (permalink / raw) To: Jan Harkes; +Cc: Rik van Riel, Linus Torvalds, Daniel Phillips, lkml On Tue, 4 Sep 2001, Jan Harkes wrote: > On Tue, Sep 04, 2001 at 01:27:50PM -0300, Rik van Riel wrote: > > I've been working on a CPU and memory efficient reverse > > mapping patch for Linux, one which will allow us to do > > a bunch of optimisations for later on (infrastructure) > > and has as its short-term benefit the potential for > > better page aging. > > Yes, I can see that using reverse mappings would be a way of correcting > the aging if you call page_age_up from try_to_swap_out, in which case > there probably needs to be a page_age_down on virtual mappings as well > to correctly balance things. > > > It seems the balancing FreeBSD does (up aging +3, down > > aging -1, inactive list in LRU order as extra stage) is > > One other observation, we should add anonymously allocated memory to the > active-list as soon as they are allocated in do_nopage. At the moment a > large part of memory is not aged at all until we start swapping things > out. With reverse mappings we can completly remove the "swap_out()" loop logic and age pte's at refill_inactive_scan(). All that with anon memory added to the active-list as soon as allocated, of course. Jan, I suggest you to take a look at the reverse mapping code. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 15:56 ` Marcelo Tosatti @ 2001-09-04 17:54 ` Jan Harkes 2001-09-04 16:37 ` Marcelo Tosatti ` (2 more replies) 0 siblings, 3 replies; 79+ messages in thread From: Jan Harkes @ 2001-09-04 17:54 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Rik van Riel, linux-kernel On Tue, Sep 04, 2001 at 12:56:32PM -0300, Marcelo Tosatti wrote: > On Tue, 4 Sep 2001, Jan Harkes wrote: > > One other observation, we should add anonymously allocated memory to the > > active-list as soon as they are allocated in do_nopage. At the moment a > > large part of memory is not aged at all until we start swapping things > > out. > > With reverse mappings we can completly remove the "swap_out()" loop logic > and age pte's at refill_inactive_scan(). > > All that with anon memory added to the active-list as soon as allocated, > of course. > > Jan, I suggest you to take a look at the reverse mapping code. I'm getting pretty sick and tired of these endless discussion. People have been reporting problems and they are pretty much alway met with the answer, "it works here, if you can do better send a patch". Now for the past _9_ stable kernel releases, page aging hasn't worked at all!! Nobody seems to even have bothered to check. I send in a patch and you basically answer with "Ohh, but we know about that one. Just apply patch wizzbangfoo#105 which basically does everything differently". Yeah I'll have a look at that code, and I'll check what the page ages look like when I actually run it (if it doesn't crash the system first). Jan ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 17:54 ` Jan Harkes @ 2001-09-04 16:37 ` Marcelo Tosatti 2001-09-04 18:49 ` Alan Cox 2001-09-04 19:54 ` Andrea Arcangeli 2 siblings, 0 replies; 79+ messages in thread From: Marcelo Tosatti @ 2001-09-04 16:37 UTC (permalink / raw) To: Jan Harkes; +Cc: Rik van Riel, linux-kernel On Tue, 4 Sep 2001, Jan Harkes wrote: > On Tue, Sep 04, 2001 at 12:56:32PM -0300, Marcelo Tosatti wrote: > > On Tue, 4 Sep 2001, Jan Harkes wrote: > > > One other observation, we should add anonymously allocated memory to the > > > active-list as soon as they are allocated in do_nopage. At the moment a > > > large part of memory is not aged at all until we start swapping things > > > out. > > > > With reverse mappings we can completly remove the "swap_out()" loop logic > > and age pte's at refill_inactive_scan(). > > > > All that with anon memory added to the active-list as soon as allocated, > > of course. > > > > Jan, I suggest you to take a look at the reverse mapping code. > > I'm getting pretty sick and tired of these endless discussion. People > have been reporting problems and they are pretty much alway met with the > answer, "it works here, if you can do better send a patch". > > Now for the past _9_ stable kernel releases, page aging hasn't worked > at all!! Nobody seems to even have bothered to check. I send in a patch > and you basically answer with "Ohh, but we know about that one. Just > apply patch wizzbangfoo#105 which basically does everything differently". Jan, Calm down. I haven't told you that the reverse mapping code is the fix to all aging problems, did I? I will take a careful look at your code later. However, I (and everybody else) has not enough time to fix the whole VM in one day. > Yeah I'll have a look at that code, and I'll check what the page ages > look like when I actually run it (if it doesn't crash the system first). I haven't said reverse mapping will fix the aging problem. I just did a comment on top of your comment. Please read my mails more carefully and slowly before sending me to hell. :) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 17:54 ` Jan Harkes 2001-09-04 16:37 ` Marcelo Tosatti @ 2001-09-04 18:49 ` Alan Cox 2001-09-04 19:39 ` Jan Harkes 2001-09-04 19:54 ` Andrea Arcangeli 2 siblings, 1 reply; 79+ messages in thread From: Alan Cox @ 2001-09-04 18:49 UTC (permalink / raw) To: Jan Harkes; +Cc: Marcelo Tosatti, Rik van Riel, linux-kernel > Now for the past _9_ stable kernel releases, page aging hasn't worked > at all!! Nobody seems to even have bothered to check. I send in a patch > and you basically answer with "Ohh, but we know about that one. Just > apply patch wizzbangfoo#105 which basically does everything differently". Maybe you should take issue with the people applying random patches, missing important ones and mixing and matching incompatible ideas in the main tree ? The VM tuning in the -ac tree is a lot more reliable for most loads (its certainly not perfect) and that is because the changes have been done and tested one at a time as they are merged. Real engineering process is the only way to get this sort of thing working well. Alan ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 18:49 ` Alan Cox @ 2001-09-04 19:39 ` Jan Harkes 2001-09-04 20:25 ` Alan Cox 0 siblings, 1 reply; 79+ messages in thread From: Jan Harkes @ 2001-09-04 19:39 UTC (permalink / raw) To: Alan Cox; +Cc: Marcelo Tosatti, Rik van Riel, linux-kernel On Tue, Sep 04, 2001 at 07:49:47PM +0100, Alan Cox wrote: > The VM tuning in the -ac tree is a lot more reliable for most loads (its > certainly not perfect) and that is because the changes have been done and > tested one at a time as they are merged. Real engineering process is the > only way to get this sort of thing working well. I grabbed the 2.4.9-ac7 patch and looked at some of the files. Pages allocated with do_anonymous_page are not added to the active list. as a result there is no aging information for a page until it is unmapped. So we might be unmapping and allocating swap for shared pages that another process is using heavily. In which case this page should always have a high age in in the active list and won't actually get swapped out. So we get both unnecessary minor faults, and the swap space will never be reclaimed because we never swap it back in. Also up aging of mapped process pages is still done in try_to_swap_out, and all of these pages are still aged down indiscriminately in refill_inactive_scan. I don't see how it could age that much differently, so I'm assuming all pages in the active list are basically at age 0 no matter what aging strategy is picked. Especially because only down aging is performed periodically by kswapd, while the only code that ages process pages up is only called once the system hits free or inactive shortage. There is some places where tests have been added that should never make a difference anyways. In reclaim_page and page_launder a page on the inactive list is checked for page->age. Because the page is not mapped in any VM it is not possibly for age to be non-zero. If the page was referenced it would have triggered a minor fault and reactivated the page. I guess it is just more carefully papering over the existing problems. Jan ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 19:39 ` Jan Harkes @ 2001-09-04 20:25 ` Alan Cox 2001-09-06 11:23 ` Rik van Riel 0 siblings, 1 reply; 79+ messages in thread From: Alan Cox @ 2001-09-04 20:25 UTC (permalink / raw) To: Jan Harkes; +Cc: Alan Cox, Marcelo Tosatti, Rik van Riel, linux-kernel > Pages allocated with do_anonymous_page are not added to the active list. > as a result there is no aging information for a page until it is > unmapped. So we might be unmapping and allocating swap for shared pages Right ok. > I guess it is just more carefully papering over the existing problems. If you are correct then I suspect the better behaviour is primarily coming from the balancing algorithms and the choices made rather than the quality of data suggested. When Rik gets back off a plane this sounds like something that should be tested - one item at a time. Alan ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 20:25 ` Alan Cox @ 2001-09-06 11:23 ` Rik van Riel 0 siblings, 0 replies; 79+ messages in thread From: Rik van Riel @ 2001-09-06 11:23 UTC (permalink / raw) To: Alan Cox; +Cc: Jan Harkes, Marcelo Tosatti, linux-kernel On Tue, 4 Sep 2001, Alan Cox wrote: > > Pages allocated with do_anonymous_page are not added to the active list. > > as a result there is no aging information for a page until it is > > unmapped. So we might be unmapping and allocating swap for shared pages > > Right ok. One problem though, we cannot 'see' the referenced bits in the page tables and nothing else is accessing this page, so there's no information we can learn from having this page on the active list. regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 17:54 ` Jan Harkes 2001-09-04 16:37 ` Marcelo Tosatti 2001-09-04 18:49 ` Alan Cox @ 2001-09-04 19:54 ` Andrea Arcangeli 2001-09-04 18:36 ` Marcelo Tosatti ` (2 more replies) 2 siblings, 3 replies; 79+ messages in thread From: Andrea Arcangeli @ 2001-09-04 19:54 UTC (permalink / raw) To: Jan Harkes; +Cc: Marcelo Tosatti, Rik van Riel, linux-kernel On Tue, Sep 04, 2001 at 01:54:27PM -0400, Jan Harkes wrote: > Now for the past _9_ stable kernel releases, page aging hasn't worked > at all!! Nobody seems to even have bothered to check. I send in a patch All I can say is that I hope you will get your problem fixed with one of the next -aa, I incidentally started working on it yesterday. So far it's a one thousand diff very far from compiling, so it will grow further, but it shouldn't take too long to finish the rewrite. Once finished the benchmarks and the reproducible 2.4 deadlocks will tell me if I'm right. Andrea ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 19:54 ` Andrea Arcangeli @ 2001-09-04 18:36 ` Marcelo Tosatti 2001-09-04 20:10 ` Daniel Phillips 2001-09-06 11:18 ` Rik van Riel 2 siblings, 0 replies; 79+ messages in thread From: Marcelo Tosatti @ 2001-09-04 18:36 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Jan Harkes, Rik van Riel, linux-kernel On Tue, 4 Sep 2001, Andrea Arcangeli wrote: > On Tue, Sep 04, 2001 at 01:54:27PM -0400, Jan Harkes wrote: > > Now for the past _9_ stable kernel releases, page aging hasn't worked > > at all!! Nobody seems to even have bothered to check. I send in a patch > > All I can say is that I hope you will get your problem fixed with one of > the next -aa, I incidentally started working on it yesterday. So far > it's a one thousand diff very far from compiling, so it will grow > further, but it shouldn't take too long to finish the rewrite. Once > finished the benchmarks and the reproducible 2.4 deadlocks will tell me > if I'm right. Andrea, Could you please describe how you're trying to fix the "anon pages not being added to the active list at do_no_page()" problem Jan described ? Thanks! ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 19:54 ` Andrea Arcangeli 2001-09-04 18:36 ` Marcelo Tosatti @ 2001-09-04 20:10 ` Daniel Phillips 2001-09-04 22:04 ` Andrea Arcangeli 2001-09-06 11:18 ` Rik van Riel 2 siblings, 1 reply; 79+ messages in thread From: Daniel Phillips @ 2001-09-04 20:10 UTC (permalink / raw) To: Andrea Arcangeli, Jan Harkes; +Cc: Marcelo Tosatti, Rik van Riel, linux-kernel On September 4, 2001 09:54 pm, Andrea Arcangeli wrote: > On Tue, Sep 04, 2001 at 01:54:27PM -0400, Jan Harkes wrote: > > Now for the past _9_ stable kernel releases, page aging hasn't worked > > at all!! Nobody seems to even have bothered to check. I send in a patch > > All I can say is that I hope you will get your problem fixed with one of > the next -aa, I incidentally started working on it yesterday. So far > it's a one thousand diff very far from compiling, so it will grow > further, but it shouldn't take too long to finish the rewrite. Once > finished the benchmarks and the reproducible 2.4 deadlocks will tell me > if I'm right. Which reproducible deadlocks did you have in mind, and how do I reproduce them? -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 20:10 ` Daniel Phillips @ 2001-09-04 22:04 ` Andrea Arcangeli 2001-09-05 2:41 ` Daniel Phillips 0 siblings, 1 reply; 79+ messages in thread From: Andrea Arcangeli @ 2001-09-04 22:04 UTC (permalink / raw) To: Daniel Phillips; +Cc: Jan Harkes, Marcelo Tosatti, Rik van Riel, linux-kernel On Tue, Sep 04, 2001 at 10:10:42PM +0200, Daniel Phillips wrote: > Which reproducible deadlocks did you have in mind, and how do I reproduce > them? I meant the various known oom deadlocks. I've one showstopper report with the blkdev in pagecache patch with in use also a small ramdisk pagecache backed, the pagecache backed works like ramfs etc.. marks the page dirty again in writepage, somebody must have broken page_launder or something else in the memory managment because exactly the same code was working fine in 2.4.7. Now it probably loops or breaks totally when somebody marks the page dirty again, but the vm problems are much much wider, starting from the kswapd loop on gfp dma or gfp normal, the overkill swapping when there's tons of ram in freeable cache and you are taking advantage of the cache, lack of defragmentation, lack of knowledge of the classzone to balance in the memory balancing (this in turn is why kswapd goes mad), very imprecise estimation of the freeable ram, overkill code in the allocator (the limit stuff is senseless), tons magic numbers that doesn't make any sensible difference, tons of cpu wasted, performance that decreases at every run of the benchmarks, etc... If you believe I'm dreaming just forget about this email, this is my last email about this until I've finished. Andrea ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 22:04 ` Andrea Arcangeli @ 2001-09-05 2:41 ` Daniel Phillips 0 siblings, 0 replies; 79+ messages in thread From: Daniel Phillips @ 2001-09-05 2:41 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Jan Harkes, Marcelo Tosatti, Rik van Riel, linux-kernel On September 5, 2001 12:04 am, Andrea Arcangeli wrote: > On Tue, Sep 04, 2001 at 10:10:42PM +0200, Daniel Phillips wrote: > > Which reproducible deadlocks did you have in mind, and how do I reproduce > > them? > > I meant the various known oom deadlocks. I've one showstopper report > with the blkdev in pagecache patch with in use also a small ramdisk > pagecache backed, the pagecache backed works like ramfs etc.. marks the > page dirty again in writepage, somebody must have broken page_launder or > something else in the memory managment because exactly the same code was > working fine in 2.4.7. Now it probably loops or breaks totally when > somebody marks the page dirty again, but the vm problems are much much > wider, starting from the kswapd loop on gfp dma or gfp normal, the > overkill swapping when there's tons of ram in freeable cache and you are > taking advantage of the cache, lack of defragmentation, lack of > knowledge of the classzone to balance in the memory balancing (this in > turn is why kswapd goes mad), very imprecise estimation of the freeable > ram, overkill code in the allocator (the limit stuff is senseless), tons > magic numbers that doesn't make any sensible difference, tons of cpu > wasted, performance that decreases at every run of the benchmarks, > etc... > > If you believe I'm dreaming just forget about this email, this is my > last email about this until I've finished. Sure. You mentioned one deadlock - oom - and a bunch of suckages. The oom problem is related to imprecise knowledge of freeable memory, you could group those two together. Active defragmentation isn't going to be that hard, I think. We'll see... Don't forget all the stuff that works pretty well now. Most of the problem reports we're getting now are concerned with the fact that we're loading up logs with allocation failure messages. We probably wouldn't get those reports if we just turned of the messages now. Bounce buffer allocation was the stopper there and Marcelo's patch has put that one away. I think I found a practical solution to the 0 order atomic failures, subject to more confirmation. Balancing and aging, while not perfect, are at least servicable. Hugh Dickins rooted out a bunch of genuine bugs in swap. Rik seems to have defanged the swap space allocation problem. Other bugs were rooted out and killed by Ben and Linus. All in all, things are much improved. The biggest issue we need to tackle before calling it a servicable vm system is the freeable memory accounting. -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 19:54 ` Andrea Arcangeli 2001-09-04 18:36 ` Marcelo Tosatti 2001-09-04 20:10 ` Daniel Phillips @ 2001-09-06 11:18 ` Rik van Riel 2 siblings, 0 replies; 79+ messages in thread From: Rik van Riel @ 2001-09-06 11:18 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Jan Harkes, Marcelo Tosatti, linux-kernel On Tue, 4 Sep 2001, Andrea Arcangeli wrote: > On Tue, Sep 04, 2001 at 01:54:27PM -0400, Jan Harkes wrote: > > Now for the past _9_ stable kernel releases, page aging hasn't worked > > at all!! Nobody seems to even have bothered to check. I send in a patch > > All I can say is that I hope you will get your problem fixed with one > of the next -aa, I incidentally started working on it yesterday. You too? ;) > So far it's a one thousand diff very far from compiling, so it will > grow further, but it shouldn't take too long to finish the rewrite. > Once finished the benchmarks and the reproducible 2.4 deadlocks will > tell me if I'm right. Of course, we could try to work together on this one, since we both seem to be starved for time ... cheers, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 17:13 ` Jan Harkes 2001-09-04 15:56 ` Marcelo Tosatti @ 2001-09-04 17:35 ` Daniel Phillips 1 sibling, 0 replies; 79+ messages in thread From: Daniel Phillips @ 2001-09-04 17:35 UTC (permalink / raw) To: Jan Harkes, Rik van Riel; +Cc: Marcelo Tosatti, Linus Torvalds, lkml On September 4, 2001 07:13 pm, Jan Harkes wrote: > On Tue, Sep 04, 2001 at 01:27:50PM -0300, Rik van Riel wrote: > > I've been working on a CPU and memory efficient reverse > > mapping patch for Linux, one which will allow us to do > > a bunch of optimisations for later on (infrastructure) > > and has as its short-term benefit the potential for > > better page aging. > > Yes, I can see that using reverse mappings would be a way of correcting > the aging if you call page_age_up from try_to_swap_out, in which case > there probably needs to be a page_age_down on virtual mappings as well > to correctly balance things. There is. 1) Unreferenced process space page gets unmapped, goes on to LRU lists 2) page aged down to zero until it gets deactivated 3) page deactivated and evicted soon after. If the page is referenced during (2) or (3) it will be mapped back in, no IO because it's still in the swap cache (minor fault). But this is lopsided and hard to balance. Also, unmapping/remapping is an expensive way to check for short-term page activity. > > It seems the balancing FreeBSD does (up aging +3, down > > aging -1, inactive list in LRU order as extra stage) is > > One other observation, we should add anonymously allocated memory to the > active-list as soon as they are allocated in do_nopage. At the moment a > large part of memory is not aged at all until we start swapping things > out. This is useless without rmap because the page will just be aged down, not up. With rmap, yes, that's what needs to be done. -- Daniel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 16:27 ` Rik van Riel 2001-09-04 17:13 ` Jan Harkes @ 2001-09-04 20:43 ` Jan Harkes 2001-09-06 11:21 ` Rik van Riel 1 sibling, 1 reply; 79+ messages in thread From: Jan Harkes @ 2001-09-04 20:43 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel On Tue, Sep 04, 2001 at 01:27:50PM -0300, Rik van Riel wrote: > I've been working on a CPU and memory efficient reverse > mapping patch for Linux, one which will allow us to do > a bunch of optimisations for later on (infrastructure) > and has as its short-term benefit the potential for > better page aging. > > It seems the balancing FreeBSD does (up aging +3, down > aging -1, inactive list in LRU order as extra stage) is > working nicely on my laptop now, but I don't think I'll > be releasing that as part of the patch ... > > http://www.surriel.com/patches/2.4/2.4.8-ac12-pmap3 I like the fact that it completely removes the vm crawling swap_out path. It also does aging more sanely because it now can take everything into account. It also works around the problems of anonymous pages that aren't aged until they are added to the swap cache. It should also minimize unnecessary minor page faults because the unmapping is done for all pte's once the page->age hits zero, and frequently used pages should not grabbing and lock down swapspace that they won't be able to give up (until the process exits). The pte_chain allocation stuff looks a bit scary, where did you want to reclaim them from when memory runs out, unmap existing pte's? One thing that might be nice, and showed a lot of promise here is to either age down by subtracting instead of dividing to make it less aggressive. It is already hard enough for pages to get referenced enough to move up the scale. Or use a similar approach as I have in my patch, age up periodically, but only age down when there is memory shortage, This gives a slight advantage to processes that were running when there was not much VM pressure. When something starts hogging memory, it is penalized a bit for disturbing the peace, but the agressive down aging will quickly rebalance, typically within about 3 calls to do_try_to_free_pages. I might port your patch over to Linus's 2.4.10-pre tree to play with it. It could very well be a significant improvement because it does address many of the issues that I ran into. Jan ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-09-04 20:43 ` Jan Harkes @ 2001-09-06 11:21 ` Rik van Riel 0 siblings, 0 replies; 79+ messages in thread From: Rik van Riel @ 2001-09-06 11:21 UTC (permalink / raw) To: Jan Harkes; +Cc: linux-kernel On Tue, 4 Sep 2001, Jan Harkes wrote: > The pte_chain allocation stuff looks a bit scary, where did you want > to reclaim them from when memory runs out, unmap existing pte's? Exactly. This is the strategy also used by BSD and it seems to work really well. > One thing that might be nice, and showed a lot of promise here is to > either age down by subtracting instead of dividing to make it less > aggressive. It is already hard enough for pages to get referenced > enough to move up the scale. Oh definately, I've tried it with linear page aging and it works a lot better. I'm just not including that in my patch right now because I don't want to mix policy and mechanism right now and I want to really get the mechanism right before moving on to other stuff. > Or use a similar approach as I have in my patch, age up periodically, > but only age down when there is memory shortage, Where can I get your patch ? regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 79+ messages in thread
[parent not found: <20010828180108Z16193-32383+2058@humbolt.nl.linux.org.suse.lists.linux.kernel>]
[parent not found: <Pine.LNX.4.33.0108281110540.8754-100000@penguin.transmeta.com.suse.lists.linux.kernel>]
* Re: page_launder() on 2.4.9/10 issue [not found] ` <Pine.LNX.4.33.0108281110540.8754-100000@penguin.transmeta.com.suse.lists.linux.kernel> @ 2001-08-28 19:14 ` Andi Kleen 2001-08-29 13:48 ` Rik van Riel 2001-08-28 20:01 ` David S. Miller 1 sibling, 1 reply; 79+ messages in thread From: Andi Kleen @ 2001-08-28 19:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel Linus Torvalds <torvalds@transmeta.com> writes: Regarding kswapd in 2.4.9: At least something seems to be broken in it. I did run some 900MB processes on a 512MB machine with 2.4.9 and kswapd took between 70 and 90% of the CPU time. -Andi ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-08-28 19:14 ` Andi Kleen @ 2001-08-29 13:48 ` Rik van Riel 2001-08-29 13:49 ` Linus Torvalds 0 siblings, 1 reply; 79+ messages in thread From: Rik van Riel @ 2001-08-29 13:48 UTC (permalink / raw) To: Andi Kleen; +Cc: Linus Torvalds, linux-kernel On 28 Aug 2001, Andi Kleen wrote: > Regarding kswapd in 2.4.9: > > At least something seems to be broken in it. I did run some 900MB processes > on a 512MB machine with 2.4.9 and kswapd took between 70 and 90% of the CPU > time. Well yes, if you never wait on IO synchronously kswapd turns into one big busy-loop. But we knew that, it was even written down in the comments in vmscan.c ;) regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-08-29 13:48 ` Rik van Riel @ 2001-08-29 13:49 ` Linus Torvalds 2001-08-29 14:38 ` Rik van Riel 0 siblings, 1 reply; 79+ messages in thread From: Linus Torvalds @ 2001-08-29 13:49 UTC (permalink / raw) To: Rik van Riel; +Cc: Andi Kleen, linux-kernel On Wed, 29 Aug 2001, Rik van Riel wrote: > On 28 Aug 2001, Andi Kleen wrote: > > > Regarding kswapd in 2.4.9: > > > > At least something seems to be broken in it. I did run some 900MB processes > > on a 512MB machine with 2.4.9 and kswapd took between 70 and 90% of the CPU > > time. > > Well yes, if you never wait on IO synchronously kswapd turns > into one big busy-loop. But we knew that, it was even written > down in the comments in vmscan.c ;) Rik, look again: kswapd _does_ wait on IO these days. Not ever waiting for IO is just a sure way to overload the IO subsystem and cause horribleinteractive behaviour. Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-08-29 13:49 ` Linus Torvalds @ 2001-08-29 14:38 ` Rik van Riel 0 siblings, 0 replies; 79+ messages in thread From: Rik van Riel @ 2001-08-29 14:38 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andi Kleen, linux-kernel On Wed, 29 Aug 2001, Linus Torvalds wrote: > Rik, look again: kswapd _does_ wait on IO these days. Indeed, I missed the magic in sync_page_buffers(). regards, Rik -- IA64: a worthy successor to the i860. http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue [not found] ` <Pine.LNX.4.33.0108281110540.8754-100000@penguin.transmeta.com.suse.lists.linux.kernel> 2001-08-28 19:14 ` Andi Kleen @ 2001-08-28 20:01 ` David S. Miller 2001-08-28 20:49 ` Linus Torvalds 2001-08-28 20:56 ` David S. Miller 1 sibling, 2 replies; 79+ messages in thread From: David S. Miller @ 2001-08-28 20:01 UTC (permalink / raw) To: ak; +Cc: torvalds, linux-kernel From: Andi Kleen <ak@suse.de> Date: 28 Aug 2001 21:14:15 +0200 At least something seems to be broken in it. I did run some 900MB processes on a 512MB machine with 2.4.9 and kswapd took between 70 and 90% of the CPU time. That's all swapmap lookup stupidity, you'll see __get_swap_page() near the top of your profiles. The algorithm is just sucky. Later, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-08-28 20:01 ` David S. Miller @ 2001-08-28 20:49 ` Linus Torvalds 2001-08-28 20:56 ` David S. Miller 1 sibling, 0 replies; 79+ messages in thread From: Linus Torvalds @ 2001-08-28 20:49 UTC (permalink / raw) To: David S. Miller; +Cc: ak, linux-kernel On Tue, 28 Aug 2001, David S. Miller wrote: > > At least something seems to be broken in it. I did run some 900MB processes > on a 512MB machine with 2.4.9 and kswapd took between 70 and 90% of the CPU > time. > > That's all swapmap lookup stupidity, you'll see __get_swap_page() > near the top of your profiles. The algorithm is just sucky. Well, in all fairness the kswapd changes _do_ make kswapd more eager to keep running too (ie kswapd tends to keep running until there is no shortage any more - which it traditionally hasn't really done). There might be an argment for making kswapd less eager, and more of a background thing. Regardless of where it actually spends the CPU time. Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: page_launder() on 2.4.9/10 issue 2001-08-28 20:01 ` David S. Miller 2001-08-28 20:49 ` Linus Torvalds @ 2001-08-28 20:56 ` David S. Miller 1 sibling, 0 replies; 79+ messages in thread From: David S. Miller @ 2001-08-28 20:56 UTC (permalink / raw) To: torvalds; +Cc: ak, linux-kernel From: Linus Torvalds <torvalds@transmeta.com> Date: Tue, 28 Aug 2001 13:49:40 -0700 (PDT) There might be an argment for making kswapd less eager, and more of a background thing. Regardless of where it actually spends the CPU time. Right, but this is not an argument against fixing __get_swap_page's algorithms to be more reasonable :-) Later, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* page_launder() on 2.4.9/10 issue @ 2001-09-27 23:14 Samium Gromoff 0 siblings, 0 replies; 79+ messages in thread From: Samium Gromoff @ 2001-09-27 23:14 UTC (permalink / raw) To: lkml; +Cc: Linus Linus wrote: > Think about it - do you really want the system to actively try to reach > the point where it has no "regular" pages left, and has to start writing > stuff out (and wait for them synchronously) in order to free up memory? I I`m 100% agreed with you here: i had been hit by this issue alot of times... This is absolutely reproducible with streaming io case. I think the lower is the number of processes simultaneously accessing data, the harder this beats us... (cant explain, but this is how i feel that) > strongly feel that the old code was really really wrong - it may have been sorry if im a noise here... cheers, Sam ^ permalink raw reply [flat|nested] 79+ messages in thread
end of thread, other threads:[~2001-09-07 21:49 UTC | newest] Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2001-08-28 3:36 page_launder() on 2.4.9/10 issue Marcelo Tosatti 2001-08-28 18:07 ` Daniel Phillips 2001-08-28 18:17 ` Linus Torvalds 2001-08-30 1:36 ` Daniel Phillips 2001-09-03 14:57 ` Marcelo Tosatti 2001-09-04 15:26 ` Jan Harkes 2001-09-04 15:24 ` Marcelo Tosatti 2001-09-04 17:14 ` Jan Harkes 2001-09-04 15:53 ` Marcelo Tosatti 2001-09-04 19:33 ` Daniel Phillips 2001-09-06 11:52 ` Rik van Riel 2001-09-06 12:31 ` Daniel Phillips 2001-09-06 12:32 ` Rik van Riel 2001-09-06 12:53 ` Daniel Phillips 2001-09-06 13:03 ` Rik van Riel 2001-09-06 13:18 ` Kurt Garloff 2001-09-06 13:23 ` Rik van Riel 2001-09-06 13:28 ` Alan Cox 2001-09-06 13:29 ` Rik van Riel 2001-09-06 16:45 ` Daniel Phillips 2001-09-06 16:57 ` Rik van Riel 2001-09-06 17:22 ` Daniel Phillips 2001-09-06 19:25 ` Rik van Riel 2001-09-06 19:45 ` Daniel Phillips 2001-09-06 19:52 ` Rik van Riel 2001-09-07 0:32 ` Kurt Garloff 2001-09-06 19:53 ` Mike Fedyk 2001-09-06 17:35 ` Mike Fedyk 2001-09-06 13:10 ` Stephan von Krawczynski 2001-09-06 13:23 ` Alex Bligh - linux-kernel 2001-09-06 13:54 ` M. Edward Borasky 2001-09-06 14:39 ` Alan Cox 2001-09-06 16:20 ` Victor Yodaiken 2001-09-06 17:33 ` Daniel Phillips 2001-09-06 13:42 ` Stephan von Krawczynski 2001-09-06 14:01 ` Alex Bligh - linux-kernel 2001-09-06 14:39 ` Stephan von Krawczynski 2001-09-06 15:02 ` Alex Bligh - linux-kernel 2001-09-06 15:07 ` Rik van Riel [not found] ` <Pine.LNX.4.33L.0109061206020.31200-100000@imladris.rielhome.con ectiva> 2001-09-06 15:16 ` Alex Bligh - linux-kernel 2001-09-06 15:10 ` Stephan von Krawczynski 2001-09-06 15:18 ` Alex Bligh - linux-kernel 2001-09-06 17:34 ` Daniel Phillips 2001-09-06 17:32 ` Alex Bligh - linux-kernel 2001-09-06 17:51 ` Daniel Phillips 2001-09-06 21:01 ` [RFC] Defragmentation proposal: preventative maintenance and cleanup [LONG] Alex Bligh - linux-kernel 2001-09-07 6:35 ` Daniel Phillips 2001-09-07 8:58 ` Alex Bligh - linux-kernel 2001-09-07 9:15 ` Alex Bligh - linux-kernel 2001-09-07 9:28 ` Alex Bligh - linux-kernel 2001-09-07 21:38 ` Daniel Phillips 2001-09-07 21:56 ` Daniel Phillips 2001-09-07 12:30 ` page_launder() on 2.4.9/10 issue Stephan von Krawczynski 2001-09-04 16:27 ` Rik van Riel 2001-09-04 17:13 ` Jan Harkes 2001-09-04 15:56 ` Marcelo Tosatti 2001-09-04 17:54 ` Jan Harkes 2001-09-04 16:37 ` Marcelo Tosatti 2001-09-04 18:49 ` Alan Cox 2001-09-04 19:39 ` Jan Harkes 2001-09-04 20:25 ` Alan Cox 2001-09-06 11:23 ` Rik van Riel 2001-09-04 19:54 ` Andrea Arcangeli 2001-09-04 18:36 ` Marcelo Tosatti 2001-09-04 20:10 ` Daniel Phillips 2001-09-04 22:04 ` Andrea Arcangeli 2001-09-05 2:41 ` Daniel Phillips 2001-09-06 11:18 ` Rik van Riel 2001-09-04 17:35 ` Daniel Phillips 2001-09-04 20:43 ` Jan Harkes 2001-09-06 11:21 ` Rik van Riel [not found] <20010828180108Z16193-32383+2058@humbolt.nl.linux.org.suse.lists.linux.kernel> [not found] ` <Pine.LNX.4.33.0108281110540.8754-100000@penguin.transmeta.com.suse.lists.linux.kernel> 2001-08-28 19:14 ` Andi Kleen 2001-08-29 13:48 ` Rik van Riel 2001-08-29 13:49 ` Linus Torvalds 2001-08-29 14:38 ` Rik van Riel 2001-08-28 20:01 ` David S. Miller 2001-08-28 20:49 ` Linus Torvalds 2001-08-28 20:56 ` David S. Miller 2001-09-27 23:14 Samium Gromoff
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).