* Bug with shared memory. @ 2002-05-14 15:13 Martin Schwidefsky 2002-05-14 19:33 ` Andrew Morton 0 siblings, 1 reply; 29+ messages in thread From: Martin Schwidefsky @ 2002-05-14 15:13 UTC (permalink / raw) To: linux-kernel Hi, we managed to hang the kernel with a db/2 stress test on s/390. The test was done on 2.4.7 but the problem is present on all recent 2.4.x and 2.5.x kernels (all architectures). In short a schedule is done while holding the shm_lock of a shared memory segment. The system call that caused this has been sys_ipc with IPC_RMID and from there the call chain is as follows: sys_shmctl, shm_destroy, fput, dput, iput, truncate_inode_pages, truncate_list_pages, schedule. The scheduler picked a process that called sys_shmat. It tries to get the lock and hangs. One way to fix this is to remove the schedule call from truncate_list_pages: --- linux-2.5/mm/filemap.c~ Tue May 14 17:04:14 2002 +++ linux-2.5/mm/filemap.c Tue May 14 17:04:33 2002 @@ -237,11 +237,6 @@ page_cache_release(page); - if (need_resched()) { - __set_current_state(TASK_RUNNING); - schedule(); - } - write_lock(&mapping->page_lock); goto restart; } Another way is to free the lock before calling fput in shm_destroy but the comment says that this functions has to be called with shp and shm_ids.sem locked. Comments? blue skies, Martin Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: schwidefsky@de.ibm.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-14 15:13 Bug with shared memory Martin Schwidefsky @ 2002-05-14 19:33 ` Andrew Morton 2002-05-15 22:42 ` Mike Kravetz 2002-05-20 4:30 ` Andrea Arcangeli 0 siblings, 2 replies; 29+ messages in thread From: Andrew Morton @ 2002-05-14 19:33 UTC (permalink / raw) To: Martin Schwidefsky; +Cc: linux-kernel Martin Schwidefsky wrote: > > Hi, > we managed to hang the kernel with a db/2 stress test on s/390. The test > was done on 2.4.7 but the problem is present on all recent 2.4.x and 2.5.x > kernels (all architectures). In short a schedule is done while holding > the shm_lock of a shared memory segment. The system call that caused > this has been sys_ipc with IPC_RMID and from there the call chain is > as follows: sys_shmctl, shm_destroy, fput, dput, iput, truncate_inode_pages, > truncate_list_pages, schedule. The scheduler picked a process that called > sys_shmat. It tries to get the lock and hangs. There's no way the kernel can successfully hold a spinlock across that call chain. > One way to fix this is to remove the schedule call from truncate_list_pages: > > --- linux-2.5/mm/filemap.c~ Tue May 14 17:04:14 2002 > +++ linux-2.5/mm/filemap.c Tue May 14 17:04:33 2002 > @@ -237,11 +237,6 @@ > > page_cache_release(page); > > - if (need_resched()) { > - __set_current_state(TASK_RUNNING); > - schedule(); > - } > - > write_lock(&mapping->page_lock); > goto restart; > } > > Another way is to free the lock before calling fput in shm_destroy but the > comment says that this functions has to be called with shp and shm_ids.sem > locked. Comments? Maybe ipc_ids.ary should become a semaphore? - ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-14 19:33 ` Andrew Morton @ 2002-05-15 22:42 ` Mike Kravetz 2002-05-15 23:07 ` Andrew Morton 2002-05-17 17:53 ` Bill Davidsen 2002-05-20 4:30 ` Andrea Arcangeli 1 sibling, 2 replies; 29+ messages in thread From: Mike Kravetz @ 2002-05-15 22:42 UTC (permalink / raw) To: Andrew Morton; +Cc: Martin Schwidefsky, linux-kernel On Tue, May 14, 2002 at 12:33:23PM -0700, Andrew Morton wrote: > Martin Schwidefsky wrote: > > The system call that caused > > this has been sys_ipc with IPC_RMID and from there the call chain is > > as follows: sys_shmctl, shm_destroy, fput, dput, iput, truncate_inode_pages, > > truncate_list_pages, schedule. The scheduler picked a process that called > > sys_shmat. It tries to get the lock and hangs. > > There's no way the kernel can successfully hold a spinlock > across that call chain. > Is adding a check for this type of situation (under CONFIG_DEBUG_SPINLOCK of course) worth the effort? One would simply add a 'locks_held' count for each task and check for zero at certain places such as return to user mode, and during context switching. It appears that this was done for 'sparc64', but no other architectures. I would consider doing this for i386, if anyone would actually use it. One would think these types of things are easily found, but this example suggests otherwise. Has anyone run the kernel through an extensive (stress) test suite with any of the kernel debug options enabled? -- Mike ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-15 22:42 ` Mike Kravetz @ 2002-05-15 23:07 ` Andrew Morton 2002-05-17 17:53 ` Bill Davidsen 1 sibling, 0 replies; 29+ messages in thread From: Andrew Morton @ 2002-05-15 23:07 UTC (permalink / raw) To: Mike Kravetz; +Cc: Martin Schwidefsky, linux-kernel Mike Kravetz wrote: > > On Tue, May 14, 2002 at 12:33:23PM -0700, Andrew Morton wrote: > > Martin Schwidefsky wrote: > > > The system call that caused > > > this has been sys_ipc with IPC_RMID and from there the call chain is > > > as follows: sys_shmctl, shm_destroy, fput, dput, iput, truncate_inode_pages, > > > truncate_list_pages, schedule. The scheduler picked a process that called > > > sys_shmat. It tries to get the lock and hangs. > > > > There's no way the kernel can successfully hold a spinlock > > across that call chain. > > > > Is adding a check for this type of situation (under CONFIG_DEBUG_SPINLOCK > of course) worth the effort? One would simply add a 'locks_held' count > for each task and check for zero at certain places such as return to > user mode, and during context switching. I think it would be worth the effort. One approach would be to create a `can_sleep()' macro. Add that to functions which may schedule. It's useful for documentation purposes as well as runtime checks. The Stanford checker caught a lot of these, but it seems that the (high) amount of source-level obfuscation in the ipc code defeated it. > One would think these types of things are easily found, but this example > suggests otherwise. Has anyone run the kernel through an extensive > (stress) test suite with any of the kernel debug options enabled? There are at present no tools in the tree to trap this problem. - ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-15 22:42 ` Mike Kravetz 2002-05-15 23:07 ` Andrew Morton @ 2002-05-17 17:53 ` Bill Davidsen 2002-05-17 20:07 ` Mike Kravetz 1 sibling, 1 reply; 29+ messages in thread From: Bill Davidsen @ 2002-05-17 17:53 UTC (permalink / raw) To: Mike Kravetz; +Cc: Andrew Morton, Martin Schwidefsky, linux-kernel On Wed, 15 May 2002, Mike Kravetz wrote: > It appears that this was done for 'sparc64', but no other architectures. > I would consider doing this for i386, if anyone would actually use it. > > One would think these types of things are easily found, but this example > suggests otherwise. Has anyone run the kernel through an extensive > (stress) test suite with any of the kernel debug options enabled? Does this imply that the option: CONFIG_DEBUG_SPINLOCK=y doesn't work on x86? Or works poorly? I'm not sure what changes you're proposing, but if they will make this more robust I'll certainly use them! SMP lockups are the bane of my existance, although 19-pre8-ac4+preempt+iowait has yet to take that route. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-17 17:53 ` Bill Davidsen @ 2002-05-17 20:07 ` Mike Kravetz 2002-05-17 20:29 ` Anton Blanchard 0 siblings, 1 reply; 29+ messages in thread From: Mike Kravetz @ 2002-05-17 20:07 UTC (permalink / raw) To: Bill Davidsen; +Cc: linux-kernel On Fri, May 17, 2002 at 01:53:10PM -0400, Bill Davidsen wrote: > On Wed, 15 May 2002, Mike Kravetz wrote: > > > It appears that this was done for 'sparc64', but no other architectures. > > I would consider doing this for i386, if anyone would actually use it. > > > > One would think these types of things are easily found, but this example > > suggests otherwise. Has anyone run the kernel through an extensive > > (stress) test suite with any of the kernel debug options enabled? > > Does this imply that the option: > CONFIG_DEBUG_SPINLOCK=y > doesn't work on x86? Or works poorly? No I did not intend to imply this. AFAIK 'CONFIG_DEBUG_SPINLOCK=y' works fine on x86 (although I'm not a user myself). My intention was only to add additional features to x86, that appear to only exist for sparc64. -- Mike ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-17 20:07 ` Mike Kravetz @ 2002-05-17 20:29 ` Anton Blanchard 0 siblings, 0 replies; 29+ messages in thread From: Anton Blanchard @ 2002-05-17 20:29 UTC (permalink / raw) To: Mike Kravetz; +Cc: Bill Davidsen, linux-kernel > No I did not intend to imply this. AFAIK 'CONFIG_DEBUG_SPINLOCK=y' > works fine on x86 (although I'm not a user myself). My intention > was only to add additional features to x86, that appear to only exist > for sparc64. It would be nice to see some of these things end up in a generic place. Now that we can wrap our spinlocks easily (preempt changed all the spinlocks to _raw_*) we could create a generic debug option that checked: uninitialised lock, timeout on deadlock, double lock by same cpu, unlock by another cpu, unlock when not locked, scheduling with lock held, read lock when same cpu has write lock. And Im sure there are some more useful tests. Anton ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-14 19:33 ` Andrew Morton 2002-05-15 22:42 ` Mike Kravetz @ 2002-05-20 4:30 ` Andrea Arcangeli 2002-05-20 5:21 ` Andrew Morton ` (2 more replies) 1 sibling, 3 replies; 29+ messages in thread From: Andrea Arcangeli @ 2002-05-20 4:30 UTC (permalink / raw) To: Andrew Morton; +Cc: Martin Schwidefsky, linux-kernel, Alan Cox, Rik van Riel On Tue, May 14, 2002 at 12:33:23PM -0700, Andrew Morton wrote: > Martin Schwidefsky wrote: > > > > Hi, > > we managed to hang the kernel with a db/2 stress test on s/390. The test > > was done on 2.4.7 but the problem is present on all recent 2.4.x and 2.5.x > > kernels (all architectures). In short a schedule is done while holding > > the shm_lock of a shared memory segment. The system call that caused > > this has been sys_ipc with IPC_RMID and from there the call chain is > > as follows: sys_shmctl, shm_destroy, fput, dput, iput, truncate_inode_pages, > > truncate_list_pages, schedule. The scheduler picked a process that called > > sys_shmat. It tries to get the lock and hangs. > > There's no way the kernel can successfully hold a spinlock > across that call chain. Yep, that's a blocking operations (even truncate_inode_pages can block for reasons not related to the expired timeslice). > > > One way to fix this is to remove the schedule call from truncate_list_pages: > > > > --- linux-2.5/mm/filemap.c~ Tue May 14 17:04:14 2002 > > +++ linux-2.5/mm/filemap.c Tue May 14 17:04:33 2002 > > @@ -237,11 +237,6 @@ > > > > page_cache_release(page); > > > > - if (need_resched()) { > > - __set_current_state(TASK_RUNNING); > > - schedule(); > > - } > > - > > write_lock(&mapping->page_lock); > > goto restart; > > } > > > > Another way is to free the lock before calling fput in shm_destroy but the > > comment says that this functions has to be called with shp and shm_ids.sem > > locked. Comments? > > Maybe ipc_ids.ary should become a semaphore? the shmid spinlock is not needed across the fput so we don't need to switch to a semaphore, btw, there seems some bit of superflous locking in such file (usually harmless, but not in this case). NOTE: apparently only shm_rmid seems to need the shmid lock in such function (please double check). This should fix it: --- 2.4.19pre8aa3/ipc/shm.c.~1~ Wed May 15 22:37:20 2002 +++ 2.4.19pre8aa3/ipc/shm.c Mon May 20 06:18:11 2002 @@ -117,12 +117,15 @@ static void shm_open (struct vm_area_str * * @shp: struct to free * - * It has to be called with shp and shm_ids.sem locked + * It has to be called with the id locked and the shm_ids.sem + * held. Before returning it unlocks the id internally. fput() + * is a blocking operation. */ -static void shm_destroy (struct shmid_kernel *shp) +static void shm_destroy (struct shmid_kernel *shp, int id) { - shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT; shm_rmid (shp->id); + shm_unlock(id); + shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT; shmem_lock(shp->shm_file, 0); fput (shp->shm_file); kfree (shp); @@ -149,9 +152,9 @@ static void shm_close (struct vm_area_st shp->shm_nattch--; if(shp->shm_nattch == 0 && shp->shm_flags & SHM_DEST) - shm_destroy (shp); - - shm_unlock(id); + shm_destroy (shp, id); + else + shm_unlock(id); up (&shm_ids.sem); } @@ -512,11 +515,11 @@ asmlinkage long sys_shmctl (int shmid, i shp->shm_flags |= SHM_DEST; /* Do not find it any more */ shp->shm_perm.key = IPC_PRIVATE; + shm_unlock(shmid); } else - shm_destroy (shp); + shm_destroy (shp, shmid); /* Unlock */ - shm_unlock(shmid); up(&shm_ids.sem); return err; } @@ -653,8 +656,9 @@ invalid: shp->shm_nattch--; if(shp->shm_nattch == 0 && shp->shm_flags & SHM_DEST) - shm_destroy (shp); - shm_unlock(shmid); + shm_destroy (shp, shmid); + else + shm_unlock(shmid); up (&shm_ids.sem); *raddr = (unsigned long) user_addr; As next thing I'll go ahead on the inode/highmem imbalance repored by Alexey in the weekend. Then the only pending thing before next -aa is the integration of the 2.5 scheduler like in 2.4-ac. I will also soon automate the porting of all the not-yet merged stuff into 2.5, so we don't risk to forget anything. I mostly care about mainline but I would also suggest Alan to merge in -ac the very same VM of -aa (so we also increase the testing before it gets merged in mainline as it is happening bit by bit overtime). Alan, I'm looking at it a bit and what you're shipping is an hybrid between the old 2.4 vm and the current one, plus using the rmap design for unmapping pagetables. The issue is not the rmap design, the issue is that the hybrid gratuitously reintroduces various bugs like kswapd infinite loop and it misses all the recent fixes (problems like kswapd infinite loop are reproducible after months the machine are up, other things like unfreed bh with shortage in the normal zone fixed in recent -aa are reproducible only with big irons, numa/discontigmem stuff, definitive fix for google etc..etc..). The rmap design (not the rmap patch, the rmap patch in -ac does a whole lots more than just switching shrink_cache to use a reverse lookup chain instead of triggering swap_out, it basically returns to the old 2.4 vm lru logic) as well cannot make nearly any positive difference during paging, the system CPU usage during a DBMS workload swapping 1.2G and with 1G of shm in RAM is not significant, swap_out linear complexity is not showing up heavily on the profiling yet (on smaller machines with less total virtual memory it's a no brainer), and it instead hurts scalability of most fast paths like page faults and fork. The aging changes also doesn't look significant either, infact aging the accessed bits in round robin is more fair then aging it using vm-lru order. All numbers I seen so far agrees but I'll be very interested to know if you reproduced any different pratical results in your lab doing -ac vs -aa vm comparisons. About rmap design I would very much appreciate if Rik could make a version of his patch that implements rmap on top of current -aa (it wouldn't be a rewrite, just a porting of the strict rmap feature), so we can compare apples to apples and benchmark the effect of the rmap patch, not the rmap + the hybrid, most of the slowdown during paging is most probably due the hybrid, not because of the rmap design, the rmap design if something should make things a bit faster during swapout infact, by being a bit slower in the more important fast paths. It is definitely possible to put a strict rmap on top of -aa without the huge "hybrid" thing attached to the rmap code, so without impacting at all the rest of the vm. It's just a matter of adding the try_to_unmap in shrink_cache and deleting the swap_out call (it's almost as easy as shipping a version of Windows without a web browser installed by default). Andrea ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-20 4:30 ` Andrea Arcangeli @ 2002-05-20 5:21 ` Andrew Morton 2002-05-20 11:34 ` Andrey Savochkin ` (2 more replies) 2002-05-20 16:13 ` Martin J. Bligh 2002-05-24 7:33 ` inode highmem imbalance fix [Re: Bug with shared memory.] Andrea Arcangeli 2 siblings, 3 replies; 29+ messages in thread From: Andrew Morton @ 2002-05-20 5:21 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin Schwidefsky, linux-kernel, Alan Cox, Rik van Riel Andrea Arcangeli wrote: > > ... > As next thing I'll go ahead on the inode/highmem imbalance repored by > Alexey in the weekend. (Some background: the problem is that ZONE_NORMAL gets clogged with inodes which are unfreeable because they have attached pagecache pages. There is no VM pressure against ZONE_HIGHMEM and there is nothing which causes the pagecache pages to be freed. The machine dies due to ZONE_NORMAL exhaustion. Workload is (presumably) the reading or creation of a very large number of small files on a machine which has much highmem. I expect Andrea is looking at a solution which releases pages against unused inodes when there are many unused inodes, but prune_icache() is failing to free sufficient inodes). I'll be interested to see your solution ;) We need to be careful to not over-shrink pagecache. Also we need to be doubly-careful to ensure that LRU order is maintained on unused_list. I expect this is a compound bug: sure, ZONE_NORMAL is clogged with inodes. But I bet it's also clogged with buffer_heads. If the buffer_head problem was fixed then the inode problem would occur much later. - more highmem, more/smaller files. > Then the only pending thing before next -aa is > the integration of the 2.5 scheduler like in 2.4-ac. I will also soon > automate the porting of all the not-yet merged stuff into 2.5, so we > don't risk to forget anything. That's good. > I mostly care about mainline but I would > also suggest Alan to merge in -ac the very same VM of -aa (so we also > increase the testing before it gets merged in mainline as it is > happening bit by bit overtime). Alan, I'm looking at it a bit and what > you're shipping is an hybrid between the old 2.4 vm and the current one, > plus using the rmap design for unmapping pagetables. The issue is not > the rmap design, the issue is that the hybrid gratuitously reintroduces > various bugs like kswapd infinite loop and it misses all the recent > fixes (problems like kswapd infinite loop are reproducible after months > the machine are up, other things like unfreed bh with shortage in the > normal zone fixed in recent -aa are reproducible only with big irons, > numa/discontigmem stuff, definitive fix for google etc..etc..). I've seen a vague report from the IBM team which indicates that -aa VM does not solve the ZONE_NORMAL-full-of-buffer_heads lockup. The workload was specweb on a 16 gig machine, I believe. I did a quickie patch early this month which simply strips buffers away from pages at each and every opportunity - that fixed it of course. At the end of the run, ZONE_NORMAL had 250 MB free, as opposed to zero with 2.4.18-stock. I think this is a better approach than memclass_related_bhs(). Because it proactively makes memory available in ZONE_NORMAL for blockdev pages. If we minimise the amount of memory which is consumed by buffer_heads then we maximise the amount of memory which is available to inodes and indirects. Sure, the buffers cache the disk mapping. But the inodes and indirects cache the same information which *much* better packing density. Sure, there's some extra CPU cost in reestablishing the mapping, but in the no-buffer_heads testing I'm doing on 2.5 it is unmeasurable. Currently, we're effectively allowing buffer_heads to evict useful pagecache data. We have to perform additional I/O to get that back. In other words: once we've finished I/O, just get rid of the damn things. Bit radical for 2.4 maybe. > ... > > About rmap design I would very much appreciate if Rik could make a > version of his patch that implements rmap on top of current -aa That would be good to see. I believe Rik is offline for a week or three, so he may not see your words. - ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-20 5:21 ` Andrew Morton @ 2002-05-20 11:34 ` Andrey Savochkin 2002-05-20 14:15 ` Andrea Arcangeli 2002-05-20 16:22 ` Martin J. Bligh 2 siblings, 0 replies; 29+ messages in thread From: Andrey Savochkin @ 2002-05-20 11:34 UTC (permalink / raw) To: Andrew Morton, Andrea Arcangeli Cc: Martin Schwidefsky, linux-kernel, Alan Cox, Rik van Riel Hi, On Sun, May 19, 2002 at 10:21:18PM -0700, Andrew Morton wrote: > Andrea Arcangeli wrote: > > > > ... > > As next thing I'll go ahead on the inode/highmem imbalance repored by > > Alexey in the weekend. > > (Some background: the problem is that ZONE_NORMAL gets clogged with > inodes which are unfreeable because they have attached pagecache > pages. There is no VM pressure against ZONE_HIGHMEM and there is > nothing which causes the pagecache pages to be freed. The machine > dies due to ZONE_NORMAL exhaustion. Workload is (presumably) the > reading or creation of a very large number of small files on a machine > which has much highmem. > > I expect Andrea is looking at a solution which releases pages against > unused inodes when there are many unused inodes, but prune_icache() > is failing to free sufficient inodes). > > I'll be interested to see your solution ;) We need to be careful to > not over-shrink pagecache. Also we need to be doubly-careful to > ensure that LRU order is maintained on unused_list. > > > I expect this is a compound bug: sure, ZONE_NORMAL is clogged with > inodes. But I bet it's also clogged with buffer_heads. If the > buffer_head problem was fixed then the inode problem would occur > much later. - more highmem, more/smaller files. In fact, running the test on different hardware, I saw both cases: just huge inode cache and inodes+buffer_heads. If ZONE_NORMAL is clogged with inodes+buffer_heads, the system works very slowly (`tar x' on a test archive finishes in 20+ minutes instead of 20 seconds), but it's still alive. If the test is aborted, the system returns to the normal state. If ZONE_NORMAL is clogged with inodes only, the system looks completely frozen, and only Magic-SysRq shows the attempts to shrink the cache. Andrey ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-20 5:21 ` Andrew Morton 2002-05-20 11:34 ` Andrey Savochkin @ 2002-05-20 14:15 ` Andrea Arcangeli 2002-05-20 19:24 ` Rik van Riel 2002-05-20 16:22 ` Martin J. Bligh 2 siblings, 1 reply; 29+ messages in thread From: Andrea Arcangeli @ 2002-05-20 14:15 UTC (permalink / raw) To: Andrew Morton; +Cc: Martin Schwidefsky, linux-kernel, Alan Cox, Rik van Riel On Sun, May 19, 2002 at 10:21:18PM -0700, Andrew Morton wrote: > I've seen a vague report from the IBM team which indicates that -aa VM does > not solve the ZONE_NORMAL-full-of-buffer_heads lockup. The workload was > specweb on a 16 gig machine, I believe. I didn't see any bugreport, could you pass it over. I was confident such bug is gone since a few weeks but of course if there are still problems I'd like to know. Note also that if the excessive number of bh is due rawio or O_DIRECT that's not a VM issue. > I did a quickie patch early this month which simply strips buffers > away from pages at each and every opportunity - that fixed it of > course. At the end of the run, ZONE_NORMAL had 250 MB free, as > opposed to zero with 2.4.18-stock. > > I think this is a better approach than memclass_related_bhs(). > Because it proactively makes memory available in ZONE_NORMAL > for blockdev pages. If we minimise the amount of memory which > is consumed by buffer_heads then we maximise the amount of > memory which is available to inodes and indirects. Sure, With memclass_related_bhs() we automatically maximixed the amount of ram available as inodes/indirects and everything else ZONE_NORMAL, after that you can consider all the unlocked bh as free memory, just like if it was other cache. The whole point of memclass_related_bhs is that it renders the bh an atomically shrinkable cache, so we can left it there until we run low in lowmem, then we will shrink them just fine. Actually I see it would be better to start freeing the bh than the zone_normal memory, the cache in the zone_normal is more important, so you may argue about page replacement decisions, but the VM should just work correctly as if the bh are free. > the buffers cache the disk mapping. But the inodes and indirects > cache the same information which *much* better packing density. > Sure, there's some extra CPU cost in reestablishing the mapping, > but in the no-buffer_heads testing I'm doing on 2.5 it is unmeasurable. > > Currently, we're effectively allowing buffer_heads to evict useful > pagecache data. We have to perform additional I/O to get that back. > > In other words: once we've finished I/O, just get rid of the > damn things. Bit radical for 2.4 maybe. Well, I'm doing a superset of that at the moment, I drop it and I also know when it's the time to do it. So to implement the "drop lazily any bh ASAP" it would be a matter of a few liner, can do that, I'm only a bit worried that the thing become measurable in some overwrite intensive workload. We would introduce additional not scalable get_blocks in the overwrite paths. But overall I agree with you it's probably most certainly worse to shrink the normal pagecache, but my solution guarantees no runtime difference for example in a zone_normal machine. Also in 2.5 you shrunk the bh quite a bit so it's a lighter entity to refill. Anyways the point is that you shouldn't run out of normal zone anymore and non highmem won't see any change in the behaviour with the highmem/bh balance fix. If this is not the case that is the real problem, everything else is a performance optimization. One bad thing actually is that if there's somebody re-reading endlessy the page and constantly keeping it in the active list, such page may keep the bh allocated for no good reasons, but again, if there are too many bh pinned that way eventually we'll get more aggressive against the active list so it would(/should) be again mostly a performnace issue. > > ... > > > > About rmap design I would very much appreciate if Rik could make a > > version of his patch that implements rmap on top of current -aa > > That would be good to see. I believe Rik is offline for a week Indeed. A strict rmap patch would also make it easier an integration in 2.5, I think rmap like -preempt O(1) scheduler, your bio based async flushed rewrite following the inodes then the dirty pages in the inode, and all the other non-showstopper issues (unlike pte-highmem, vm fixes etc..) are one of the ideal items to research during 2.5, I don't think it makes much sense to keep them on top of 2.4. > or three, so he may not see your words. Ok. We CC'ed Rik so I assume it won't get lost in the mail flood. Andrea ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-20 14:15 ` Andrea Arcangeli @ 2002-05-20 19:24 ` Rik van Riel 2002-05-20 23:46 ` Andrea Arcangeli 0 siblings, 1 reply; 29+ messages in thread From: Rik van Riel @ 2002-05-20 19:24 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Martin Schwidefsky, linux-kernel, Alan Cox On Mon, 20 May 2002, Andrea Arcangeli wrote: > With memclass_related_bhs() we automatically maximixed the amount of ram > available as inodes/indirects and everything else ZONE_NORMAL, after OK, a question and a remark: 1) does memclass_related_bhs() work if the bufferheads are on another node ? (NUMA-Q) 2) memclass_related_bhs() will definately not work if the data structure is pinned indirectly, say struct address_space which is pinned by mere the existance of the page cache page and cannot easily be freed > > or three, so he may not see your words. > > Ok. We CC'ed Rik so I assume it won't get lost in the mail flood. I'm on holidays, don't expect patches soon :) Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-20 19:24 ` Rik van Riel @ 2002-05-20 23:46 ` Andrea Arcangeli 2002-05-21 0:14 ` Martin J. Bligh 0 siblings, 1 reply; 29+ messages in thread From: Andrea Arcangeli @ 2002-05-20 23:46 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, Martin Schwidefsky, linux-kernel, Alan Cox On Mon, May 20, 2002 at 04:24:37PM -0300, Rik van Riel wrote: > On Mon, 20 May 2002, Andrea Arcangeli wrote: > > > With memclass_related_bhs() we automatically maximixed the amount of ram > > available as inodes/indirects and everything else ZONE_NORMAL, after > > OK, a question and a remark: > > 1) does memclass_related_bhs() work if the bufferheads are > on another node ? (NUMA-Q) yes of course. With all the numa fixes all the classzone logic is indipendent from the pgdat layer, not only during the memclass_related_bhs check. The faster way to understand how the logic works is to read the zone_idx/memclass define in mmzone.h. > 2) memclass_related_bhs() will definately not work if the > data structure is pinned indirectly, say struct address_space > which is pinned by mere the existance of the page cache page > and cannot easily be freed memclass_related_bhs is just for the bhs, it has nothing to do with the address space. It's in the page->buffer side of the page, not the page->mapping, so I don't understand very well. the inode problem is a separate thing, nothing to do with memclass_related_bhs(), the fix will be local to prune_icache. For the memclass_related_bhs() fix in -aa, that's in the testing TODO list of Martin (on the multi giga machines), he also nicely proposed to compare it to the other throw-away-all-bh-regardless patch from Andrew (that I actually didn't seen floating around yet but it's clear how it works, it's a subset of memclass_related_bhs). However the right way to test the memclass_related_bhs vs throw-away-all-bh, is to run a rewrite test that fits in cache, so write,fsync,write,fsync,write,fsync. specweb or any other read-only test will obviously perform exactly the same both ways (actually theoretically a bit cpu-faster in throw-away-all-bh because it doesn't check the bh list). > > > or three, so he may not see your words. > > > > Ok. We CC'ed Rik so I assume it won't get lost in the mail flood. > > I'm on holidays, don't expect patches soon :) you shouldn't read emails either then, you know it is strictly forbidden to read emails during vacations :). I'm kidding of course, thanks for the fast reply! Andrea ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-20 23:46 ` Andrea Arcangeli @ 2002-05-21 0:14 ` Martin J. Bligh 2002-05-21 1:40 ` Andrea Arcangeli 0 siblings, 1 reply; 29+ messages in thread From: Martin J. Bligh @ 2002-05-21 0:14 UTC (permalink / raw) To: Andrea Arcangeli, Rik van Riel Cc: Andrew Morton, Martin Schwidefsky, linux-kernel, Alan Cox > For the memclass_related_bhs() fix in -aa, that's in the testing TODO > list of Martin (on the multi giga machines), he also nicely proposed to > compare it to the other throw-away-all-bh-regardless patch from Andrew > (that I actually didn't seen floating around yet but it's clear how it > works, it's a subset of memclass_related_bhs). However the right way to > test the memclass_related_bhs vs throw-away-all-bh, is to run a rewrite > test that fits in cache, so write,fsync,write,fsync,write,fsync. specweb > or any other read-only test will obviously perform exactly the same both > ways (actually theoretically a bit cpu-faster in throw-away-all-bh > because it doesn't check the bh list). The only thing that worries me in theory about your approach for this Andrea is fragmentation - if we try to shrink only when we're low on memory, isn't there a danger that one buffer_head per page of slab cache will be in use, and thus no pages are freeable (obviously this is extreme, but I can certainly see a situation with lots of partially used pages)? With Andrew's approach, keeping things freed as we go, we should reuse the partially allocated slab pages, which would seem (to me) to result in less fragmentation? Thanks, M. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-21 0:14 ` Martin J. Bligh @ 2002-05-21 1:40 ` Andrea Arcangeli 0 siblings, 0 replies; 29+ messages in thread From: Andrea Arcangeli @ 2002-05-21 1:40 UTC (permalink / raw) To: Martin J. Bligh Cc: Rik van Riel, Andrew Morton, Martin Schwidefsky, linux-kernel, Alan Cox On Mon, May 20, 2002 at 05:14:24PM -0700, Martin J. Bligh wrote: > > For the memclass_related_bhs() fix in -aa, that's in the testing TODO > > list of Martin (on the multi giga machines), he also nicely proposed to > > compare it to the other throw-away-all-bh-regardless patch from Andrew > > (that I actually didn't seen floating around yet but it's clear how it > > works, it's a subset of memclass_related_bhs). However the right way to > > test the memclass_related_bhs vs throw-away-all-bh, is to run a rewrite > > test that fits in cache, so write,fsync,write,fsync,write,fsync. specweb > > or any other read-only test will obviously perform exactly the same both > > ways (actually theoretically a bit cpu-faster in throw-away-all-bh > > because it doesn't check the bh list). > > The only thing that worries me in theory about your approach for this > Andrea is fragmentation - if we try to shrink only when we're low on > memory, isn't there a danger that one buffer_head per page of slab > cache will be in use, and thus no pages are freeable (obviously this > is extreme, but I can certainly see a situation with lots of partially used > pages)? well, then you should be worried first for the whole /proc/slabinfo, not just the bh heahders :) if it's a problem for the bh, it's a problem for everything else. The reason fragmentation is never a problem is that as far as the persistent slab objects can be reclaimed dynamically by the vm we will always be able to free all the slab pages, the only downside we run into vs being aware of the slab fragmentation, is that we risk to reclaim more objects than necessary at the layer above the slab cache (so at the bh layer), but dropping more bh than necessary will never be a problem (Andrew wants to drop them all indeed). > With Andrew's approach, keeping things freed as we go, we should > reuse the partially allocated slab pages, which would seem (to me) > to result in less fragmentation? less fragmentation because of zero bh allocated from slab cache :). Andrea ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-20 5:21 ` Andrew Morton 2002-05-20 11:34 ` Andrey Savochkin 2002-05-20 14:15 ` Andrea Arcangeli @ 2002-05-20 16:22 ` Martin J. Bligh 2002-05-20 19:38 ` Rik van Riel 2 siblings, 1 reply; 29+ messages in thread From: Martin J. Bligh @ 2002-05-20 16:22 UTC (permalink / raw) To: Andrew Morton, Andrea Arcangeli Cc: Martin Schwidefsky, linux-kernel, Alan Cox, Rik van Riel > I expect this is a compound bug: sure, ZONE_NORMAL is clogged with > inodes. But I bet it's also clogged with buffer_heads. If the > buffer_head problem was fixed then the inode problem would occur > much later. - more highmem, more/smaller files. This should be easy to see from slabinfo. Note that the problem is worse on a P4 - alignment takes us from 96 to 128 bytes. > I've seen a vague report from the IBM team which indicates that > -aa VM does not solve the ZONE_NORMAL-full-of-buffer_heads lockup. > The workload was specweb on a 16 gig machine, I believe. I haven't been very clear on this, mea culpa. We have two different machines, each with 4 procs & 16Gb (at least). Both are exhausting memory through buffer_heads. 1) Running Oracle apps, doing raw IO. We are running an -aa kernel on this machine, and it doesn't help. I believe that's just because it's raw IO, though. We have some patches we did for TPC-H to fix the raw IO problems with buffer_heads, and we'll be trying those out this week. 2) Running specweb, doing non-raw IO. This is the machine we tried Andrew's patch on and it worked a treat. We haven't tried the -aa fix for this yet, I'll see if I can get that done this week. So, we haven't really given Andrea's patch a fair test yet. If you guys can agree which the better approach is by just discussing it, great. If not, we'll benchmark the hell out of it, and decide things that way ;-) M ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-20 16:22 ` Martin J. Bligh @ 2002-05-20 19:38 ` Rik van Riel 2002-05-20 20:06 ` William Lee Irwin III 0 siblings, 1 reply; 29+ messages in thread From: Rik van Riel @ 2002-05-20 19:38 UTC (permalink / raw) To: Martin J. Bligh Cc: Andrew Morton, Andrea Arcangeli, Martin Schwidefsky, linux-kernel, Alan Cox On Mon, 20 May 2002, Martin J. Bligh wrote: > 1) Running Oracle apps, doing raw IO. We are running an -aa kernel > on this machine, and it doesn't help. > So, we haven't really given Andrea's patch a fair test yet. If you > guys can agree which the better approach is by just discussing it, One thing that seems to be missing in Linux are proper VM statistics. There is no way handwaving and discussions are going to be better than a measurement of what's going on inside the VM. Treating the different VM patch sets as black boxes with benchmarks will show us which VM works best for which benchmark, but it won't show us why or how to combine the good features of the different VMs... I think good statistics to start would be the traditional page fault rate (pf), page free rate (fr), page scan rate (sr), reclaims (re), pageout (po), pagein (pi) and some variation of iowait stats. regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-20 19:38 ` Rik van Riel @ 2002-05-20 20:06 ` William Lee Irwin III 0 siblings, 0 replies; 29+ messages in thread From: William Lee Irwin III @ 2002-05-20 20:06 UTC (permalink / raw) To: Rik van Riel Cc: Martin J. Bligh, Andrew Morton, Andrea Arcangeli, Martin Schwidefsky, linux-kernel, Alan Cox On Mon, May 20, 2002 at 04:38:15PM -0300, Rik van Riel wrote: > I think good statistics to start would be the traditional > page fault rate (pf), page free rate (fr), page scan rate > (sr), reclaims (re), pageout (po), pagein (pi) and some > variation of iowait stats. This has been on my TODO list for months. Sorry I haven't gotten around to it sooner. I'll follow up with that in a bit. Cheers, Bill ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-20 4:30 ` Andrea Arcangeli 2002-05-20 5:21 ` Andrew Morton @ 2002-05-20 16:13 ` Martin J. Bligh 2002-05-20 16:37 ` Andrea Arcangeli 2002-05-24 7:33 ` inode highmem imbalance fix [Re: Bug with shared memory.] Andrea Arcangeli 2 siblings, 1 reply; 29+ messages in thread From: Martin J. Bligh @ 2002-05-20 16:13 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Martin Schwidefsky, linux-kernel, Alan Cox, Rik van Riel > About rmap design I would very much appreciate if Rik could make a > version of his patch that implements rmap on top of current -aa (it > wouldn't be a rewrite, just a porting of the strict rmap feature), > so we can compare apples to apples and benchmark the effect of the > rmap patch, not the rmap + the hybrid, most of the slowdown during > paging is most probably due the hybrid, not because of the rmap design, > the rmap design if something should make things a bit faster during > swapout infact, by being a bit slower in the more important fast paths. > It is definitely possible to put a strict rmap on top of -aa without > the huge "hybrid" thing attached to the rmap code, so without impacting > at all the rest of the vm. It's just a matter of adding the try_to_unmap > in shrink_cache and deleting the swap_out call (it's almost as easy as > shipping a version of Windows without a web browser installed by default). Is it really the rmap patch, or is this Alan's VM as a whole? Could you take a look at http://www.surriel.com/patches/ and see if the rmap 13 patch there is still objectionable to you? I've been benchmarking rmap 13 against mainline (2.4.19-pre7) and with the latest lock breakup changes performance now seems to be about equivalent to mainline (for kernel compile on NUMA-Q). Those changes reduced system time from 650s to 160s. The only reason I haven't published results "officially" yet is that I was sorting out some timer problems with the machine. M. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-20 16:13 ` Martin J. Bligh @ 2002-05-20 16:37 ` Andrea Arcangeli 2002-05-20 17:23 ` Martin J. Bligh 0 siblings, 1 reply; 29+ messages in thread From: Andrea Arcangeli @ 2002-05-20 16:37 UTC (permalink / raw) To: Martin J. Bligh Cc: Andrew Morton, Martin Schwidefsky, linux-kernel, Alan Cox, Rik van Riel On Mon, May 20, 2002 at 09:13:53AM -0700, Martin J. Bligh wrote: > Is it really the rmap patch, or is this Alan's VM as a whole? > Could you take a look at http://www.surriel.com/patches/ and > see if the rmap 13 patch there is still objectionable to you? I think it's almost the same code. > I've been benchmarking rmap 13 against mainline (2.4.19-pre7) > and with the latest lock breakup changes performance now seems > to be about equivalent to mainline (for kernel compile on NUMA-Q). > Those changes reduced system time from 650s to 160s. The only How much are you swapping in your workload? (as said the fast paths are hurted a little so it's expected that it's almost as fast as mainline with a kernel compile, similar to the fact we also add anon pages to the lru list). I think you're only exercising the fast paths in your workload, not the memory balancing that is the whole point of the change. > reason I haven't published results "officially" yet is that I > was sorting out some timer problems with the machine. > > M. Andrea ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-20 16:37 ` Andrea Arcangeli @ 2002-05-20 17:23 ` Martin J. Bligh 2002-05-20 17:32 ` William Lee Irwin III 0 siblings, 1 reply; 29+ messages in thread From: Martin J. Bligh @ 2002-05-20 17:23 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Martin Schwidefsky, linux-kernel, Alan Cox, Rik van Riel >> I've been benchmarking rmap 13 against mainline (2.4.19-pre7) >> and with the latest lock breakup changes performance now seems >> to be about equivalent to mainline (for kernel compile on NUMA-Q). >> Those changes reduced system time from 650s to 160s. The only > > How much are you swapping in your workload? (as said the fast paths are > hurted a little so it's expected that it's almost as fast as mainline > with a kernel compile, similar to the fact we also add anon pages to the > lru list). I think you're only exercising the fast paths in your > workload, not the memory balancing that is the whole point of the change. No swapping. We fixed the horrendous locking problem we were seeing, but this was only one test - obviously others are needed. But I think we're in agreement that it's time to give it a beating and see what happens ;-) M. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Bug with shared memory. 2002-05-20 17:23 ` Martin J. Bligh @ 2002-05-20 17:32 ` William Lee Irwin III 0 siblings, 0 replies; 29+ messages in thread From: William Lee Irwin III @ 2002-05-20 17:32 UTC (permalink / raw) To: Martin J. Bligh Cc: Andrea Arcangeli, Andrew Morton, Martin Schwidefsky, linux-kernel, Alan Cox, Rik van Riel At some point in the past, Andrea Arcangeli wrote: >> How much are you swapping in your workload? (as said the fast paths are >> hurted a little so it's expected that it's almost as fast as mainline >> with a kernel compile, similar to the fact we also add anon pages to the >> lru list). I think you're only exercising the fast paths in your >> workload, not the memory balancing that is the whole point of the change. On Mon, May 20, 2002 at 10:23:05AM -0700, Martin J. Bligh wrote: > No swapping. We fixed the horrendous locking problem we were seeing, > but this was only one test - obviously others are needed. But I think we're > in agreement that it's time to give it a beating and see what happens ;-) There's no mystery or secrecy to the locking work, really just overzealous (which is good wrt. locking changes) QA and a conservative release schedule. Cheers, Bill ^ permalink raw reply [flat|nested] 29+ messages in thread
* inode highmem imbalance fix [Re: Bug with shared memory.] 2002-05-20 4:30 ` Andrea Arcangeli 2002-05-20 5:21 ` Andrew Morton 2002-05-20 16:13 ` Martin J. Bligh @ 2002-05-24 7:33 ` Andrea Arcangeli 2002-05-24 7:51 ` William Lee Irwin III ` (3 more replies) 2 siblings, 4 replies; 29+ messages in thread From: Andrea Arcangeli @ 2002-05-24 7:33 UTC (permalink / raw) To: Andrew Morton Cc: Martin Schwidefsky, linux-kernel, Alan Cox, Rik van Riel, kuznet, Andrey Savochkin On Mon, May 20, 2002 at 06:30:40AM +0200, Andrea Arcangeli wrote: > As next thing I'll go ahead on the inode/highmem imbalance repored by > Alexey in the weekend. Then the only pending thing before next -aa is Here it is, you should apply it together with vm-35 that you need too for the bh/highmem balance (or on top of 2.4.19pre8aa3). I tested it slightly on uml and it didn't broke so far, so be careful because it's not very well tested yet. On the lines of what Alexey suggested originally, if goal isn't reached, in a second pass we shrink the cache too, but only if the cache is the only reason for the "pinning" beahiour of the inode. If for example there are dirty blocks of metadata or of data belonging to the inode we wakeup_bdflush instead and we never shrink the cache in such case. If the inode itself is dirty as well we let the two passes fail so we will schedule the work for keventd. This logic should ensure we never fall into shrinking the cache for no good reason and that we free the cache only for the inodes that we actually go ahead and free. (basically only dirty pages set with SetPageDirty aren't trapped by the logic before calling the invalidate, like ramfs, but that's expected of course, those pages cannot be damaged by the non destructive invalidate anyways) Comments? --- 2.4.19pre8aa4/fs/inode.c.~1~ Fri May 24 03:17:10 2002 +++ 2.4.19pre8aa4/fs/inode.c Fri May 24 05:03:54 2002 @@ -672,35 +672,87 @@ void prune_icache(int goal) { LIST_HEAD(list); struct list_head *entry, *freeable = &list; - int count; + int count, pass; struct inode * inode; - spin_lock(&inode_lock); + count = pass = 0; - count = 0; - entry = inode_unused.prev; - while (entry != &inode_unused) - { - struct list_head *tmp = entry; + spin_lock(&inode_lock); + while (goal && pass++ < 2) { + entry = inode_unused.prev; + while (entry != &inode_unused) + { + struct list_head *tmp = entry; - entry = entry->prev; - inode = INODE(tmp); - if (inode->i_state & (I_FREEING|I_CLEAR|I_LOCK)) - continue; - if (!CAN_UNUSE(inode)) - continue; - if (atomic_read(&inode->i_count)) - continue; - list_del(tmp); - list_del(&inode->i_hash); - INIT_LIST_HEAD(&inode->i_hash); - list_add(tmp, freeable); - inode->i_state |= I_FREEING; - count++; - if (!--goal) - break; + entry = entry->prev; + inode = INODE(tmp); + if (inode->i_state & (I_FREEING|I_CLEAR|I_LOCK)) + continue; + if (atomic_read(&inode->i_count)) + continue; + if (pass == 2 && !inode->i_state && !CAN_UNUSE(inode)) { + if (inode_has_buffers(inode)) + /* + * If the inode has dirty buffers + * pending, start flushing out bdflush.ndirty + * worth of data even if there's no dirty-memory + * pressure. Do nothing else in this + * case, until all dirty buffers are gone + * we can do nothing about the inode other than + * to keep flushing dirty stuff. We could also + * flush only the dirty buffers in the inode + * but there's no API to do it asynchronously + * and this simpler approch to deal with the + * dirty payload shouldn't make much difference + * in practice. Also keep in mind if somebody + * keeps overwriting data in a flood we'd + * never manage to drop the inode anyways, + * and we really shouldn't do that because + * it's an heavily used one. + */ + wakeup_bdflush(); + else if (inode->i_data.nrpages) + /* + * If we're here it means the only reason + * we cannot drop the inode is that its + * due its pagecache so go ahead and trim it + * hard. If it doesn't go away it means + * they're dirty or dirty/pinned pages ala + * ramfs. + * + * invalidate_inode_pages() is a non + * blocking operation but we introduce + * a dependency order between the + * inode_lock and the pagemap_lru_lock, + * the inode_lock must always be taken + * first from now on. + */ + invalidate_inode_pages(inode); + } + if (!CAN_UNUSE(inode)) + continue; + list_del(tmp); + list_del(&inode->i_hash); + INIT_LIST_HEAD(&inode->i_hash); + list_add(tmp, freeable); + inode->i_state |= I_FREEING; + count++; + if (!--goal) + break; + } } inodes_stat.nr_unused -= count; + + /* + * the unused list is hardly an LRU so it makes + * more sense to rotate it so we don't bang + * always on the same inodes in case they're + * unfreeable for whatever reason. + */ + if (entry != &inode_unused) { + list_del(&inode_unused); + list_add(&inode_unused, entry); + } spin_unlock(&inode_lock); dispose_list(freeable); Andrea ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode highmem imbalance fix [Re: Bug with shared memory.] 2002-05-24 7:33 ` inode highmem imbalance fix [Re: Bug with shared memory.] Andrea Arcangeli @ 2002-05-24 7:51 ` William Lee Irwin III 2002-05-24 8:04 ` Andrew Morton ` (2 subsequent siblings) 3 siblings, 0 replies; 29+ messages in thread From: William Lee Irwin III @ 2002-05-24 7:51 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Martin Schwidefsky, linux-kernel, Alan Cox, Rik van Riel, kuznet, Andrey Savochkin On Fri, May 24, 2002 at 09:33:41AM +0200, Andrea Arcangeli wrote: > Here it is, you should apply it together with vm-35 that you need too > for the bh/highmem balance (or on top of 2.4.19pre8aa3). I tested it > slightly on uml and it didn't broke so far, so be careful because it's not > very well tested yet. On the lines of what Alexey suggested originally, > if goal isn't reached, in a second pass we shrink the cache too, but > only if the cache is the only reason for the "pinning" beahiour of the > inode. If for example there are dirty blocks of metadata or of data > belonging to the inode we wakeup_bdflush instead and we never shrink the > cache in such case. If the inode itself is dirty as well we let the two > passes fail so we will schedule the work for keventd. This logic should > ensure we never fall into shrinking the cache for no good reason and > that we free the cache only for the inodes that we actually go ahead and > free. (basically only dirty pages set with SetPageDirty aren't trapped > by the logic before calling the invalidate, like ramfs, but that's > expected of course, those pages cannot be damaged by the non destructive > invalidate anyways) > Comments? I haven't had the chance to give this a test run yet, but it looks very promising. I have a slight concern about the hold time of the inode_lock because prune_icache() already generates some amount of contention, but what you've presented appears to be necessary to prevent lethal cache bloat, and so that concern is secondary at most. I'll give it a test run tomorrow if no one else on-site gets to it first, though with the proviso that I've not been involved in workloads triggering this specific KVA exhaustion issue, so what testing I can do is limited. Thanks, Bill ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode highmem imbalance fix [Re: Bug with shared memory.] 2002-05-24 7:33 ` inode highmem imbalance fix [Re: Bug with shared memory.] Andrea Arcangeli 2002-05-24 7:51 ` William Lee Irwin III @ 2002-05-24 8:04 ` Andrew Morton 2002-05-24 15:20 ` Andrea Arcangeli 2002-05-24 11:47 ` Ed Tomlinson 2002-05-30 11:25 ` Denis Lunev 3 siblings, 1 reply; 29+ messages in thread From: Andrew Morton @ 2002-05-24 8:04 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin Schwidefsky, linux-kernel, Alan Cox, Rik van Riel, kuznet, Andrey Savochkin Andrea Arcangeli wrote: > > On Mon, May 20, 2002 at 06:30:40AM +0200, Andrea Arcangeli wrote: > > As next thing I'll go ahead on the inode/highmem imbalance repored by > > Alexey in the weekend. Then the only pending thing before next -aa is > > Here it is, you should apply it together with vm-35 that you need too > for the bh/highmem balance (or on top of 2.4.19pre8aa3). Looks OK to me. But I wonder if it should be inside some config option - I don't think machines with saner memory architectures would want this? > ... > + * in practice. Also keep in mind if somebody > + * keeps overwriting data in a flood we'd > + * never manage to drop the inode anyways, > + * and we really shouldn't do that because > + * it's an heavily used one. > + */ Can anyone actually write to an inode which is on the unused list? > + wakeup_bdflush(); > + else if (inode->i_data.nrpages) > + /* > + * If we're here it means the only reason > + * we cannot drop the inode is that its > + * due its pagecache so go ahead and trim it > + * hard. If it doesn't go away it means > + * they're dirty or dirty/pinned pages ala > + * ramfs. > + * > + * invalidate_inode_pages() is a non > + * blocking operation but we introduce > + * a dependency order between the > + * inode_lock and the pagemap_lru_lock, > + * the inode_lock must always be taken > + * first from now on. > + */ > + invalidate_inode_pages(inode); It seems that a call to try_to_free_buffers() has snuck into invalidate_inode_pages(). That means that clean ext3 pages which are on the checkpoint list won't be released. Could you please change that to try_to_release_page()? - ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode highmem imbalance fix [Re: Bug with shared memory.] 2002-05-24 8:04 ` Andrew Morton @ 2002-05-24 15:20 ` Andrea Arcangeli 0 siblings, 0 replies; 29+ messages in thread From: Andrea Arcangeli @ 2002-05-24 15:20 UTC (permalink / raw) To: Andrew Morton Cc: Martin Schwidefsky, linux-kernel, Alan Cox, Rik van Riel, kuznet, Andrey Savochkin On Fri, May 24, 2002 at 01:04:35AM -0700, Andrew Morton wrote: > Andrea Arcangeli wrote: > > > > On Mon, May 20, 2002 at 06:30:40AM +0200, Andrea Arcangeli wrote: > > > As next thing I'll go ahead on the inode/highmem imbalance repored by > > > Alexey in the weekend. Then the only pending thing before next -aa is > > > > Here it is, you should apply it together with vm-35 that you need too > > for the bh/highmem balance (or on top of 2.4.19pre8aa3). > > Looks OK to me. But I wonder if it should be inside some config > option - I don't think machines with saner memory architectures > would want this? This assumes that this hurts lowmem machines, hopefully it's not the case. The vm bangs on the icache only when there's some remote sign of pagecache shortage and if the whole icache is totally pinned by clean pagecache, and there's some sign of pagecache shortage it is probably ok to do an invalidate to release it. We were doing it in 2.2 too of course (in a different simpler manner with truncate_inode_pages because there was no problem in 2.2 in being destructive against the pgaecache, the dirty cache was always on the buffercache side in 2.2). > > > ... > > + * in practice. Also keep in mind if somebody > > + * keeps overwriting data in a flood we'd > > + * never manage to drop the inode anyways, > > + * and we really shouldn't do that because > > + * it's an heavily used one. > > + */ > > Can anyone actually write to an inode which is on the unused > list? no you can't (if it's in the unused list, not even a single dentry is teking it pinned, that means nobody can have it open, it's not even cached in unused dentries), but the inode can have async dirty buffers or dirty pages (if also the inode is marked dirty, not only the pages, unless it's ramfs), so it may be cosntantly not freeable if bdflush cannot keep up with the writer. Doesn't look a pratical problem, in such case the inode is very hot and it doesn't worth to shrink it. > > > + wakeup_bdflush(); > > + else if (inode->i_data.nrpages) > > + /* > > + * If we're here it means the only reason > > + * we cannot drop the inode is that its > > + * due its pagecache so go ahead and trim it > > + * hard. If it doesn't go away it means > > + * they're dirty or dirty/pinned pages ala > > + * ramfs. > > + * > > + * invalidate_inode_pages() is a non > > + * blocking operation but we introduce > > + * a dependency order between the > > + * inode_lock and the pagemap_lru_lock, > > + * the inode_lock must always be taken > > + * first from now on. > > + */ > > + invalidate_inode_pages(inode); > > It seems that a call to try_to_free_buffers() has snuck into > invalidate_inode_pages(). That means that clean ext3 pages > which are on the checkpoint list won't be released. Could you > please change that to try_to_release_page()? Indeed. thanks! Andrea ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode highmem imbalance fix [Re: Bug with shared memory.] 2002-05-24 7:33 ` inode highmem imbalance fix [Re: Bug with shared memory.] Andrea Arcangeli 2002-05-24 7:51 ` William Lee Irwin III 2002-05-24 8:04 ` Andrew Morton @ 2002-05-24 11:47 ` Ed Tomlinson 2002-05-30 11:25 ` Denis Lunev 3 siblings, 0 replies; 29+ messages in thread From: Ed Tomlinson @ 2002-05-24 11:47 UTC (permalink / raw) To: linux-kernel I Wonder if the changes I posted putting slab pages in the lru would help? For now the patch is against rmap (pre8-ac5). I would be very interested in hearing about results testing it. See subject: Re: [RFC][PATCH] using page aging to shrink caches (pre8-ac5) Comments? Ed Tomlinson Andrea Arcangeli wrote: > On Mon, May 20, 2002 at 06:30:40AM +0200, Andrea Arcangeli wrote: >> As next thing I'll go ahead on the inode/highmem imbalance repored by >> Alexey in the weekend. Then the only pending thing before next -aa is > > Here it is, you should apply it together with vm-35 that you need too > for the bh/highmem balance (or on top of 2.4.19pre8aa3). I tested it > slightly on uml and it didn't broke so far, so be careful because it's not > very well tested yet. On the lines of what Alexey suggested originally, > if goal isn't reached, in a second pass we shrink the cache too, but > only if the cache is the only reason for the "pinning" beahiour of the > inode. If for example there are dirty blocks of metadata or of data > belonging to the inode we wakeup_bdflush instead and we never shrink the > cache in such case. If the inode itself is dirty as well we let the two > passes fail so we will schedule the work for keventd. This logic should > ensure we never fall into shrinking the cache for no good reason and > that we free the cache only for the inodes that we actually go ahead and > free. (basically only dirty pages set with SetPageDirty aren't trapped > by the logic before calling the invalidate, like ramfs, but that's > expected of course, those pages cannot be damaged by the non destructive > invalidate anyways) > > Comments? > > --- 2.4.19pre8aa4/fs/inode.c.~1~ Fri May 24 03:17:10 2002 > +++ 2.4.19pre8aa4/fs/inode.c Fri May 24 05:03:54 2002 > @@ -672,35 +672,87 @@ void prune_icache(int goal) > { > LIST_HEAD(list); > struct list_head *entry, *freeable = &list; > - int count; > + int count, pass; > struct inode * inode; > > - spin_lock(&inode_lock); > + count = pass = 0; > > - count = 0; > - entry = inode_unused.prev; > - while (entry != &inode_unused) > - { > - struct list_head *tmp = entry; > + spin_lock(&inode_lock); > + while (goal && pass++ < 2) { > + entry = inode_unused.prev; > + while (entry != &inode_unused) > + { > + struct list_head *tmp = entry; > > - entry = entry->prev; > - inode = INODE(tmp); > - if (inode->i_state & (I_FREEING|I_CLEAR|I_LOCK)) > - continue; > - if (!CAN_UNUSE(inode)) > - continue; > - if (atomic_read(&inode->i_count)) > - continue; > - list_del(tmp); > - list_del(&inode->i_hash); > - INIT_LIST_HEAD(&inode->i_hash); > - list_add(tmp, freeable); > - inode->i_state |= I_FREEING; > - count++; > - if (!--goal) > - break; > + entry = entry->prev; > + inode = INODE(tmp); > + if (inode->i_state & (I_FREEING|I_CLEAR|I_LOCK)) > + continue; > + if (atomic_read(&inode->i_count)) > + continue; > + if (pass == 2 && !inode->i_state && !CAN_UNUSE(inode)) { > + if (inode_has_buffers(inode)) > + /* > + * If the inode has dirty buffers > + * pending, start flushing out bdflush.ndirty > + * worth of data even if there's no dirty-memory > + * pressure. Do nothing else in this > + * case, until all dirty buffers are gone > + * we can do nothing about the inode other than > + * to keep flushing dirty stuff. We could also > + * flush only the dirty buffers in the inode > + * but there's no API to do it asynchronously > + * and this simpler approch to deal with the > + * dirty payload shouldn't make much difference > + * in practice. Also keep in mind if somebody > + * keeps overwriting data in a flood we'd > + * never manage to drop the inode anyways, > + * and we really shouldn't do that because > + * it's an heavily used one. > + */ > + wakeup_bdflush(); > + else if (inode->i_data.nrpages) > + /* > + * If we're here it means the only reason > + * we cannot drop the inode is that its > + * due its pagecache so go ahead and trim it > + * hard. If it doesn't go away it means > + * they're dirty or dirty/pinned pages ala > + * ramfs. > + * > + * invalidate_inode_pages() is a non > + * blocking operation but we introduce > + * a dependency order between the > + * inode_lock and the pagemap_lru_lock, > + * the inode_lock must always be taken > + * first from now on. > + */ > + invalidate_inode_pages(inode); > + } > + if (!CAN_UNUSE(inode)) > + continue; > + list_del(tmp); > + list_del(&inode->i_hash); > + INIT_LIST_HEAD(&inode->i_hash); > + list_add(tmp, freeable); > + inode->i_state |= I_FREEING; > + count++; > + if (!--goal) > + break; > + } > } > inodes_stat.nr_unused -= count; > + > + /* > + * the unused list is hardly an LRU so it makes > + * more sense to rotate it so we don't bang > + * always on the same inodes in case they're > + * unfreeable for whatever reason. > + */ > + if (entry != &inode_unused) { > + list_del(&inode_unused); > + list_add(&inode_unused, entry); > + } > spin_unlock(&inode_lock); > > dispose_list(freeable); > > Andrea > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 29+ messages in thread
* inode highmem imbalance fix [Re: Bug with shared memory.] 2002-05-24 7:33 ` inode highmem imbalance fix [Re: Bug with shared memory.] Andrea Arcangeli ` (2 preceding siblings ...) 2002-05-24 11:47 ` Ed Tomlinson @ 2002-05-30 11:25 ` Denis Lunev 2002-05-30 17:59 ` Andrea Arcangeli 3 siblings, 1 reply; 29+ messages in thread From: Denis Lunev @ 2002-05-30 11:25 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel [-- Attachment #1: message body text --] [-- Type: text/plain, Size: 146 bytes --] Hello! The patch itself cures my problems, but after a small fix concerning uninitialized variable resulting in OOPS. Regards, Denis V. Lunev [-- Attachment #2: diff-andrea-inodes2 --] [-- Type: text, Size: 246 bytes --] --- linux/fs/inode.c.old Wed May 29 20:16:17 2002 +++ linux/fs/inode.c Wed May 29 20:17:08 2002 @@ -669,6 +669,7 @@ struct inode * inode; count = pass = 0; + entry = &inode_unused; spin_lock(&inode_lock); while (goal && pass++ < 2) { [-- Attachment #3: message body text --] [-- Type: text/plain, Size: 1400 bytes --] Andrea Arcangeli writes: > On Mon, May 20, 2002 at 06:30:40AM +0200, Andrea Arcangeli wrote: > > As next thing I'll go ahead on the inode/highmem imbalance repored by > > Alexey in the weekend. Then the only pending thing before next -aa is > > Here it is, you should apply it together with vm-35 that you need too > for the bh/highmem balance (or on top of 2.4.19pre8aa3). I tested it > slightly on uml and it didn't broke so far, so be careful because it's not > very well tested yet. On the lines of what Alexey suggested originally, > if goal isn't reached, in a second pass we shrink the cache too, but > only if the cache is the only reason for the "pinning" beahiour of the > inode. If for example there are dirty blocks of metadata or of data > belonging to the inode we wakeup_bdflush instead and we never shrink the > cache in such case. If the inode itself is dirty as well we let the two > passes fail so we will schedule the work for keventd. This logic should > ensure we never fall into shrinking the cache for no good reason and > that we free the cache only for the inodes that we actually go ahead and > free. (basically only dirty pages set with SetPageDirty aren't trapped > by the logic before calling the invalidate, like ramfs, but that's > expected of course, those pages cannot be damaged by the non destructive > invalidate anyways) > > Comments? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode highmem imbalance fix [Re: Bug with shared memory.] 2002-05-30 11:25 ` Denis Lunev @ 2002-05-30 17:59 ` Andrea Arcangeli 0 siblings, 0 replies; 29+ messages in thread From: Andrea Arcangeli @ 2002-05-30 17:59 UTC (permalink / raw) To: Denis Lunev; +Cc: linux-kernel, Andrey Nekrasov, rwhron, Yann Dupont On Thu, May 30, 2002 at 03:25:04PM +0400, Denis Lunev wrote: Content-Description: message body text > Hello! > > The patch itself cures my problems, but after a small fix concerning > uninitialized variable resulting in OOPS. > > Regards, > Denis V. Lunev > Content-Description: diff-andrea-inodes2 > --- linux/fs/inode.c.old Wed May 29 20:16:17 2002 > +++ linux/fs/inode.c Wed May 29 20:17:08 2002 > @@ -669,6 +669,7 @@ > struct inode * inode; > > count = pass = 0; > + entry = &inode_unused; > > spin_lock(&inode_lock); > while (goal && pass++ < 2) { Great spotting! this fix is certainly correct, without it the unused list will be corrupted if prune_icache gets a goal == 0 as parameter. OTOH if fixes that cases only (that riggers only when the number of unused inodes is <= vm_vfs_scan_ratio, not an extremely common case, I wonder if that's enough to cure all the oopses I received today), probably it's enough, the number of unused inodes is != than the number of inodes allocated. At first glance I don't see other issues (my error is been to assume goal was going to be always something significant). BTW, it's great that at the first showstopper bug since a long time I got such an high quality feedback after a few hours, thank you very much! :) Could you test if the below one liner from Denis (I attached it below too without quotes) fixes all your problems with 2.4.19pre9aa1 or with the single inode highmem imbalance fix? thanks, --- linux/fs/inode.c.old Wed May 29 20:16:17 2002 +++ linux/fs/inode.c Wed May 29 20:17:08 2002 @@ -669,6 +669,7 @@ struct inode * inode; count = pass = 0; + entry = &inode_unused; spin_lock(&inode_lock); while (goal && pass++ < 2) { Also it seems the O1 scheduler is doing well so far. In next -aa I will also include the patch from Mike Kravetz that I finished auditing and it's really strightforward, it serializes the execution of the reder of the pipe with the writer of the pipe if the writer expires the length of the pipe buffer, that will maximize pipe bandwith similar to the pre-o1 levels, and still the tasks runs in parallel in two cpus if no blocking from the writer is necessary, I think it's the best heuristic. Adding the sync beahviour also with the reader seems inferior, I can imagine a writer running full time in a cpu and sometime posting a few bytes to the pipe, while the reader always blocking. This way the reader will keep running in its own cpu, and it won't interfere with the "cpu intensive" writer. Andrea ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2002-05-30 18:01 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-05-14 15:13 Bug with shared memory Martin Schwidefsky 2002-05-14 19:33 ` Andrew Morton 2002-05-15 22:42 ` Mike Kravetz 2002-05-15 23:07 ` Andrew Morton 2002-05-17 17:53 ` Bill Davidsen 2002-05-17 20:07 ` Mike Kravetz 2002-05-17 20:29 ` Anton Blanchard 2002-05-20 4:30 ` Andrea Arcangeli 2002-05-20 5:21 ` Andrew Morton 2002-05-20 11:34 ` Andrey Savochkin 2002-05-20 14:15 ` Andrea Arcangeli 2002-05-20 19:24 ` Rik van Riel 2002-05-20 23:46 ` Andrea Arcangeli 2002-05-21 0:14 ` Martin J. Bligh 2002-05-21 1:40 ` Andrea Arcangeli 2002-05-20 16:22 ` Martin J. Bligh 2002-05-20 19:38 ` Rik van Riel 2002-05-20 20:06 ` William Lee Irwin III 2002-05-20 16:13 ` Martin J. Bligh 2002-05-20 16:37 ` Andrea Arcangeli 2002-05-20 17:23 ` Martin J. Bligh 2002-05-20 17:32 ` William Lee Irwin III 2002-05-24 7:33 ` inode highmem imbalance fix [Re: Bug with shared memory.] Andrea Arcangeli 2002-05-24 7:51 ` William Lee Irwin III 2002-05-24 8:04 ` Andrew Morton 2002-05-24 15:20 ` Andrea Arcangeli 2002-05-24 11:47 ` Ed Tomlinson 2002-05-30 11:25 ` Denis Lunev 2002-05-30 17:59 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).