* invalidate_inode_pages in 2.5.32/3 @ 2002-09-05 14:25 Chuck Lever 2002-09-05 18:27 ` Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Chuck Lever @ 2002-09-05 14:25 UTC (permalink / raw) To: Linux Kernel Mailing List hi all- it appears that changes in or around invalidate_inode_pages that went into 2.5.32/3 have broken certain cache behaviors that the NFS client depends on. when the NFS client calls invalidate_inode_pages to purge directory data from the page cache, sometimes it works as before, and sometimes it doesn't, leaving stale pages in the page cache. i don't know much about the MM, but i can reliably reproduce the problem. what more information can i provide? please copy me, as i'm not a member of the linux-kernel mailing list. - Chuck Lever -- corporate: <cel at netapp dot com> personal: <chucklever at bigfoot dot com> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 14:25 invalidate_inode_pages in 2.5.32/3 Chuck Lever @ 2002-09-05 18:27 ` Andrew Morton 2002-09-05 18:53 ` Chuck Lever 2002-09-07 8:01 ` Daniel Phillips 0 siblings, 2 replies; 77+ messages in thread From: Andrew Morton @ 2002-09-05 18:27 UTC (permalink / raw) To: Chuck Lever; +Cc: Linux Kernel Mailing List Chuck Lever wrote: > > hi all- > > it appears that changes in or around invalidate_inode_pages that went into > 2.5.32/3 have broken certain cache behaviors that the NFS client depends > on. when the NFS client calls invalidate_inode_pages to purge directory > data from the page cache, sometimes it works as before, and sometimes it > doesn't, leaving stale pages in the page cache. > > i don't know much about the MM, but i can reliably reproduce the problem. > what more information can i provide? please copy me, as i'm not a member > of the linux-kernel mailing list. The locking became finer-grained. Up to 2.5.31 or thereabouts, invalidate_inode_pages was grabbing the global pagemap_lru_lock and walking all the inodes pages inside that lock. This basically shuts down the entire VM for the whole operation. In 2.5.33, that lock is per-zone, and invalidate takes it on a much more fine-grained basis. That all assumes SMP/preempt. If you're seeing these problems on uniproc/non-preempt then something fishy may be happening. But be aware that invalidate_inode_pages has always been best-effort. If someone is reading, or writing one of those pages then it certainly will not be removed. If you need assurances that the pagecache has been taken down then we'll need something stronger in there. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 18:27 ` Andrew Morton @ 2002-09-05 18:53 ` Chuck Lever 2002-09-05 19:17 ` Andrew Morton 2002-09-06 9:35 ` Helge Hafting 2002-09-07 8:01 ` Daniel Phillips 1 sibling, 2 replies; 77+ messages in thread From: Chuck Lever @ 2002-09-05 18:53 UTC (permalink / raw) To: Andrew Morton; +Cc: Linux Kernel Mailing List On Thu, 5 Sep 2002, Andrew Morton wrote: > That all assumes SMP/preempt. If you're seeing these problems on > uniproc/non-preempt then something fishy may be happening. sorry, forgot to mention: the system is UP, non-preemptible, high mem. invalidate_inode_pages isn't freeing these pages because the page count is two. perhaps the page count semantics of one of the page cache helper functions has changed slightly. i'm still diagnosing. fortunately the problem is deterministically reproducible. basic test6, the readdir test, of 2002 connectathon test suite, fails -- either a duplicate file entry or a missing file entry appears after some standard file creation and removal processing in that directory. the incorrect entries occur because the NFS client zaps the directory's page cache to force the next reader to re-read the directory from the server. but invalidate_inode_pages decides to leave the pages in the cache, so the next reader gets stale cached data instead. > But be aware that invalidate_inode_pages has always been best-effort. > If someone is reading, or writing one of those pages then it > certainly will not be removed. If you need assurances that the > pagecache has been taken down then we'll need something stronger > in there. right, i've always wondered why the NFS client doesn't use truncate_inode_pages, or something like it, instead. that can wait for another day, though. :-) - Chuck Lever -- corporate: <cel at netapp dot com> personal: <chucklever at bigfoot dot com> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 18:53 ` Chuck Lever @ 2002-09-05 19:17 ` Andrew Morton 2002-09-05 20:00 ` Trond Myklebust 2002-09-06 9:35 ` Helge Hafting 1 sibling, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-09-05 19:17 UTC (permalink / raw) To: Chuck Lever; +Cc: Linux Kernel Mailing List Chuck Lever wrote: > > On Thu, 5 Sep 2002, Andrew Morton wrote: > > > That all assumes SMP/preempt. If you're seeing these problems on > > uniproc/non-preempt then something fishy may be happening. > > sorry, forgot to mention: the system is UP, non-preemptible, high mem. > > invalidate_inode_pages isn't freeing these pages because the page count is > two. perhaps the page count semantics of one of the page cache helper > functions has changed slightly. i'm still diagnosing. OK, thanks. I can't immediately think of anything which would have altered the refcounting in there except for page reclaim. What you could do is to check whether the `page_count(page) != 1' pages are on the LRU. If they have !PageLRU(page) then the extra ref was taken by shrink_cache(). But that would be pretty rare, especially on UP. You may have more success using the stronger invalidate_inode_pages2(). ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 19:17 ` Andrew Morton @ 2002-09-05 20:00 ` Trond Myklebust 2002-09-05 20:15 ` Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Trond Myklebust @ 2002-09-05 20:00 UTC (permalink / raw) To: Andrew Morton; +Cc: Chuck Lever, Linux Kernel Mailing List >>>>> " " == Andrew Morton <akpm@zip.com.au> writes: > You may have more success using the stronger > invalidate_inode_pages2(). Shouldn't make any difference. Chuck is seeing this on readdir() pages which, of course, don't suffer from problems of dirtiness etc on NFS. I've noticed that the code that used to clear page->flags when we added a page to the page_cache has disappeared. Is it possible that pages are being re-added with screwy values for page->flags? Cheers, Trond ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 20:00 ` Trond Myklebust @ 2002-09-05 20:15 ` Andrew Morton 2002-09-05 20:27 ` Trond Myklebust 2002-09-05 21:41 ` Chuck Lever 0 siblings, 2 replies; 77+ messages in thread From: Andrew Morton @ 2002-09-05 20:15 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chuck Lever, Linux Kernel Mailing List Trond Myklebust wrote: > > >>>>> " " == Andrew Morton <akpm@zip.com.au> writes: > > > You may have more success using the stronger > > invalidate_inode_pages2(). > > Shouldn't make any difference. Chuck is seeing this on readdir() pages > which, of course, don't suffer from problems of dirtiness etc on NFS. Well the VM will take a ref on the page during reclaim even if it is clean. With what sort of frequency does this happen? If it's easily reproducible then dunno. If it's once-an-hour then it may be page reclaim, conceivably. The PageLRU debug test in there will tell us. > I've noticed that the code that used to clear page->flags when we > added a page to the page_cache has disappeared. Is it possible that > pages are being re-added with screwy values for page->flags? It's possible - those flags were getting set all over the place, and for add_to_swap(), the nonatomic rmw was an outright bug. The intent now is to set the initial page state in prep_new_page() and to then modify it atomically from there on in ways which make sense, rather than "because that's what the code used to do". Something may have got screwed up in there. Suggest you print out page->flags from in there and we can take a look. It's a bit worrisome if NFS is dependent upon successful pagecache takedown in invalidate_inode_pages. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 20:15 ` Andrew Morton @ 2002-09-05 20:27 ` Trond Myklebust 2002-09-05 20:37 ` Andrew Morton 2002-09-05 21:41 ` Chuck Lever 1 sibling, 1 reply; 77+ messages in thread From: Trond Myklebust @ 2002-09-05 20:27 UTC (permalink / raw) To: Andrew Morton; +Cc: Chuck Lever, Linux Kernel Mailing List >>>>> " " == Andrew Morton <akpm@zip.com.au> writes: > It's a bit worrisome if NFS is dependent upon successful > pagecache takedown in invalidate_inode_pages. Why? We don't use all that buffer crap, so in principle invalidate_inode_page() is only supposed to fail for us if - page is locked (i.e. new read in progress or something like that) - page is refcounted (by something like mmap()). neither of which should be the case here. Cheers, Trond ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 20:27 ` Trond Myklebust @ 2002-09-05 20:37 ` Andrew Morton 2002-09-05 20:51 ` Trond Myklebust 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-09-05 20:37 UTC (permalink / raw) To: trond.myklebust; +Cc: Chuck Lever, Linux Kernel Mailing List Trond Myklebust wrote: > > >>>>> " " == Andrew Morton <akpm@zip.com.au> writes: > > > It's a bit worrisome if NFS is dependent upon successful > > pagecache takedown in invalidate_inode_pages. > > Why? We don't use all that buffer crap, so in principle > invalidate_inode_page() is only supposed to fail for us if > > - page is locked (i.e. new read in progress or something like that) > - page is refcounted (by something like mmap()). - someone took a temporary ref on the page. Possibly it is the deferred LRU addition code. Try running lru_add_drain() before invalidate_inode_pages(). > neither of which should be the case here. Probably, it worked OK with the global locking because nobody was taking a temp ref against those pages. Please tell me exactly what semantics NFS needs in there. Does truncate_inode_pages() do the wrong thing? ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 20:37 ` Andrew Morton @ 2002-09-05 20:51 ` Trond Myklebust 2002-09-05 21:12 ` Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Trond Myklebust @ 2002-09-05 20:51 UTC (permalink / raw) To: Andrew Morton; +Cc: Chuck Lever, Linux Kernel Mailing List >>>>> " " == Andrew Morton <akpm@zip.com.au> writes: > Probably, it worked OK with the global locking because nobody > was taking a temp ref against those pages. But we still have global locking in that readdir by means of the BKL. There should be no nasty races... > Please tell me exactly what semantics NFS needs in there. Does > truncate_inode_pages() do the wrong thing? Definitely. In most cases we *cannot* wait on I/O completion because nfs_zap_caches() can be called directly from the 'rpciod' process by means of a callback. Since 'rpciod' is responsible for completing all asynchronous I/O, then truncate_inode_pages() would deadlock. As I said: I don't believe the problem here has anything to do with invalidate_inode_pages vs. truncate_inode_pages: - Pages should only be locked if they are actually being read from the server. - They should only be refcounted and/or cleared while the BKL is held... There is no reason why code which worked fine under 2.2.x and 2.4.x shouldn't work under 2.5.x. Cheers, Trond ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 20:51 ` Trond Myklebust @ 2002-09-05 21:12 ` Andrew Morton 2002-09-05 21:31 ` Trond Myklebust 2002-09-05 22:01 ` Chuck Lever 0 siblings, 2 replies; 77+ messages in thread From: Andrew Morton @ 2002-09-05 21:12 UTC (permalink / raw) To: trond.myklebust; +Cc: Chuck Lever, Linux Kernel Mailing List Trond Myklebust wrote: > > >>>>> " " == Andrew Morton <akpm@zip.com.au> writes: > > > Probably, it worked OK with the global locking because nobody > > was taking a temp ref against those pages. > > But we still have global locking in that readdir by means of the > BKL. There should be no nasty races... > > > Please tell me exactly what semantics NFS needs in there. Does > > truncate_inode_pages() do the wrong thing? > > Definitely. In most cases we *cannot* wait on I/O completion because > nfs_zap_caches() can be called directly from the 'rpciod' process by > means of a callback. Since 'rpciod' is responsible for completing all > asynchronous I/O, then truncate_inode_pages() would deadlock. But if there are such pages, invalidate_inode_pages() would not have removed them anyway? > As I said: I don't believe the problem here has anything to do with > invalidate_inode_pages vs. truncate_inode_pages: > - Pages should only be locked if they are actually being read from > the server. > - They should only be refcounted and/or cleared while the BKL is > held... > There is no reason why code which worked fine under 2.2.x and 2.4.x > shouldn't work under 2.5.x. Trond, there are very good reasons why it broke. Those pages are visible to the whole world via global data structures - both the page LRUs and via the superblocks->inodes walk. Those things exist for legitimate purposes, and it is legitimate for async threads of control to take a reference on those pages while playing with them. It just "happened to work" in earlier kernels. I suspect we can just remove the page_count() test from invalidate and that will fix everything up. That will give stronger invalidate and anything which doesn't like that is probably buggy wrt truncate anyway. Could you test that? ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 21:12 ` Andrew Morton @ 2002-09-05 21:31 ` Trond Myklebust 2002-09-05 22:19 ` Andrew Morton 2002-09-05 22:01 ` Chuck Lever 1 sibling, 1 reply; 77+ messages in thread From: Trond Myklebust @ 2002-09-05 21:31 UTC (permalink / raw) To: Andrew Morton; +Cc: Chuck Lever, Linux Kernel Mailing List >>>>> " " == Andrew Morton <akpm@zip.com.au> writes: >> 'rpciod' process by means of a callback. Since 'rpciod' is >> responsible for completing all asynchronous I/O, then >> truncate_inode_pages() would deadlock. > But if there are such pages, invalidate_inode_pages() would not > have removed them anyway? It should not have to. If the page is locked because it is being paged in, then we're guaranteed that the data is fresh from the server. If it is locked because we are writing, then ditto. > Trond, there are very good reasons why it broke. Those pages > are visible to the whole world via global data structures - > both the page LRUs and via the superblocks->inodes walk. Those > things exist for legitimate purposes, and it is legitimate for > async threads of control to take a reference on those pages > while playing with them. I don't buy that explanation w.r.t. readdir: those pages should always be clean. The global data structures might walk them in order to throw them out, but that should be a bonus here. In any case it seems a bit far fetched that they should be pinning *all* the readdir pages... > I suspect we can just remove the page_count() test from > invalidate and that will fix everything up. That will give > stronger invalidate and anything which doesn't like that is > probably buggy wrt truncate anyway. I never liked the page_count() test, but Linus overruled me because of the concern for consistency w.r.t. shared mmap(). Is the latter case resolved now? I notice for instance that you've added mapping->releasepage(). What should we be doing for that? Cheers, Trond ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 21:31 ` Trond Myklebust @ 2002-09-05 22:19 ` Andrew Morton 2002-09-06 0:48 ` Trond Myklebust 2002-09-07 8:24 ` Daniel Phillips 0 siblings, 2 replies; 77+ messages in thread From: Andrew Morton @ 2002-09-05 22:19 UTC (permalink / raw) To: trond.myklebust; +Cc: Chuck Lever, Linux Kernel Mailing List Trond Myklebust wrote: > > ... > > Trond, there are very good reasons why it broke. Those pages > > are visible to the whole world via global data structures - > > both the page LRUs and via the superblocks->inodes walk. Those > > things exist for legitimate purposes, and it is legitimate for > > async threads of control to take a reference on those pages > > while playing with them. > > I don't buy that explanation w.r.t. readdir: those pages should always > be clean. The global data structures might walk them in order to throw > them out, but that should be a bonus here. It's not related to cleanliness - the VM evicts clean pages and to do that it needs to manipulate them. To manipulate them the VM needs to pin them down: via spinlocks, via PageLocked or via page_count(). And a 2.4 kernel right now will set PG_locked while looking at a clean NFS directory page at the tail of the LRU. A concurrent invalidate_inode_page will not invalidate that page, so there is exposure in 2.4 SMP. Except everything is serialised under the LRU lock in 2.4... > In any case it seems a bit far fetched that they should be pinning > *all* the readdir pages... > > > I suspect we can just remove the page_count() test from > > invalidate and that will fix everything up. That will give > > stronger invalidate and anything which doesn't like that is > > probably buggy wrt truncate anyway. > > I never liked the page_count() test, but Linus overruled me because of > the concern for consistency w.r.t. shared mmap(). Is the latter case > resolved now? Oh, I see. I'm not sure what semantics we really want for this. If we were to "invalidate" a mapped page then it would become anonymous, which makes some sense. But if we want to keep the current "don't detach mapped pages from pagecache" semantics then we should test page->pte.direct rather than page_count(page). Making that 100% reliable would be tricky because of the need for locking wrt concurrent page faults. This only really matters (I think) for invalidate_inode_pages2(), where we take down pagecache after O_DIRECT IO. In that case I suspect we _should_ anonymise mapped cache, because we've gone and changed the data on-disk. > I notice for instance that you've added mapping->releasepage(). > What should we be doing for that? page->private is "anonymous data which only the address_space knows about". If the mapping has some data at page->private it must set PG_private inside lock_page and increment page->count. If the VM wants to reclaim a page, and it has PG_private set then the vm will run mapping->releasepage() against the page. The mapping's releasepage must try to clear away whatever is held at ->private. If that was successful then releasepage() must clear PG_private, decrement page->count and return non-zero. If the info at ->private is not freeable, releasepage returns zero. ->releasepage() may not sleep in 2.5. So. NFS can put anything it likes at page->private. If you're not doing that then you don't need a releasepage. If you are doing that then you must have a releasepage(). ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 22:19 ` Andrew Morton @ 2002-09-06 0:48 ` Trond Myklebust 2002-09-06 1:08 ` Andrew Morton 2002-09-07 8:24 ` Daniel Phillips 1 sibling, 1 reply; 77+ messages in thread From: Trond Myklebust @ 2002-09-06 0:48 UTC (permalink / raw) To: Andrew Morton; +Cc: Chuck Lever, Linux Kernel Mailing List >>>>> " " == Andrew Morton <akpm@zip.com.au> writes: > I'm not sure what semantics we really want for this. If we > were to "invalidate" a mapped page then it would become > anonymous, which makes some sense. > But if we want to keep the current "don't detach mapped pages > from pagecache" semantics then we should test page->pte.direct > rather than page_count(page). Making that 100% reliable would > be tricky because of the need for locking wrt concurrent page > faults. I believe that Linus is very much in favour of this latter approach. We had the 'anonymize page' semantics under 2.2.x, but they were changed in the 2.4.0-pre series. The problem is that NFS can clear your page cache at any moment. Things like out-of-order RPC calls etc. can trigger it. If your knee-jerk response is to anonymize all those pages that are referenced, you might end up with page aliasing (well - in practice not, since we do protect against that but Linus wasn't convinced). > The mapping's releasepage must try to clear away whatever is > held at ->private. If that was successful then releasepage() > must clear PG_private, decrement > page-> count and return non-zero. If the info at ->private is not > freeable, releasepage returns zero. ->releasepage() may not > sleep in > 2.5. Interesting. Our 'private data' in this case would be whether or not there is some pending I/O operation on the page. We don't keep it in page->private, but I assume that's less of a problem ;-) It's unrelated to the topic we're discussing, but having looked at it I was thinking that we might want to use it in order to replace the NFS 'flushd' daemon. Currently the latter is needed mainly in order to ensure that readaheads actually do complete in a timely fashion (i.e. before we run out of memory). Since try_to_release_page() is called in shrink_cache(), I was thinking that we might pass that buck on to releasepage() (btw: there's currently a bug w.r.t. that'. If I understand you correctly, the releasepage() thing is unrelated to page->buffers, but the call in shrink_cache() is masked by an 'if (page->buffers)) Extending that idea, we might also be able to get rid of nfs_try_to_free_pages(), if we also make releasepage() call the necessary routines to flush dirty pages too... Cheers, Trond ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-06 0:48 ` Trond Myklebust @ 2002-09-06 1:08 ` Andrew Morton 2002-09-06 6:49 ` Trond Myklebust 2002-09-07 8:37 ` Daniel Phillips 0 siblings, 2 replies; 77+ messages in thread From: Andrew Morton @ 2002-09-06 1:08 UTC (permalink / raw) To: trond.myklebust; +Cc: Chuck Lever, Linux Kernel Mailing List Trond Myklebust wrote: > > >>>>> " " == Andrew Morton <akpm@zip.com.au> writes: > > > I'm not sure what semantics we really want for this. If we > > were to "invalidate" a mapped page then it would become > > anonymous, which makes some sense. > > > But if we want to keep the current "don't detach mapped pages > > from pagecache" semantics then we should test page->pte.direct > > rather than page_count(page). Making that 100% reliable would > > be tricky because of the need for locking wrt concurrent page > > faults. > > I believe that Linus is very much in favour of this latter > approach. We had the 'anonymize page' semantics under 2.2.x, but they > were changed in the 2.4.0-pre series. hm. > The problem is that NFS can clear your page cache at any > moment. Things like out-of-order RPC calls etc. can trigger it. If > your knee-jerk response is to anonymize all those pages that are > referenced, you might end up with page aliasing (well - in practice > not, since we do protect against that but Linus wasn't convinced). Oh. I thought this was a "purely theoretical" discussion because it only pertains to directory data. You seem to be saying that NFS is doing this for S_ISREG pagecache also? Again: what do you _want_ to do? Having potentially incorrect pagecache mapped into process memory space is probably not the answer to that ;) Should we be forcibly unmapping the pages from pagetables? That would result in them being faulted in again, and re-read. > > The mapping's releasepage must try to clear away whatever is > > held at ->private. If that was successful then releasepage() > > must clear PG_private, decrement > > page-> count and return non-zero. If the info at ->private is not > > freeable, releasepage returns zero. ->releasepage() may not > > sleep in > > 2.5. > > Interesting. Our 'private data' in this case would be whether or not > there is some pending I/O operation on the page. We don't keep it in > page->private, but I assume that's less of a problem ;-) > It's unrelated to the topic we're discussing, but having looked at it > I was thinking that we might want to use it in order to replace the > NFS 'flushd' daemon. Currently the latter is needed mainly in order > to ensure that readaheads actually do complete in a timely fashion > (i.e. before we run out of memory). Since try_to_release_page() is > called in shrink_cache(), I was thinking that we might pass that buck > on to releasepage() That might work. It's a bit of a hassle that ->releasepage() must be nonblocking, but generally it just wants to grab locks, decrement refcounts and free things. > (btw: there's currently a bug w.r.t. that'. If I understand you > correctly, the releasepage() thing is unrelated to page->buffers, but > the call in shrink_cache() is masked by an 'if (page->buffers)) That would be in a 2.4 kernel? In 2.4, page->buffers can only contain buffers. If it contains anything else the kernel will die. > Extending that idea, we might also be able to get rid of > nfs_try_to_free_pages(), if we also make releasepage() call the > necessary routines to flush dirty pages too... ->releasepage() should never succeed against a dirty page. In fact the VM shouldn't even bother calling it - there's a minor efficiency bug there. If your mapping has old, dirty pages then the VM will call your ->vm_writeback to write some of them back. Or it will repeatedly call ->writepage if you don't define ->vm_writeback. That's the place to clean the pages. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-06 1:08 ` Andrew Morton @ 2002-09-06 6:49 ` Trond Myklebust 2002-09-07 8:37 ` Daniel Phillips 1 sibling, 0 replies; 77+ messages in thread From: Trond Myklebust @ 2002-09-06 6:49 UTC (permalink / raw) To: Andrew Morton; +Cc: Chuck Lever, Linux Kernel Mailing List >>>>> " " == Andrew Morton <akpm@zip.com.au> writes: > Oh. I thought this was a "purely theoretical" discussion > because it only pertains to directory data. You seem to be > saying that NFS is doing this for S_ISREG pagecache also? Yes. Chuck's particular example pertains to directory data, but if we're to expand the discussion to what *should* we be doing, then we need to consider the S_ISREG case too. > Again: what do you _want_ to do? Having potentially incorrect > pagecache mapped into process memory space is probably not the > answer to that ;) > Should we be forcibly unmapping the pages from pagetables? > That would result in them being faulted in again, and re-read. If we could do that for unlocked pages, and at the same time flush out any pending writes, then that would be ideal. As long as the page_count() thing is there, we *do* unfortunately have a potential for cache corruption. >> (btw: there's currently a bug w.r.t. that'. If I understand you >> correctly, the releasepage() thing is unrelated to >> page->buffers, but the call in shrink_cache() is masked by an >> 'if (page->buffers)) > That would be in a 2.4 kernel? In 2.4, page->buffers can only > contain buffers. If it contains anything else the kernel will > die. 2.5.33 > If your mapping has old, dirty pages then the VM will call your -> vm_writeback to write some of them back. Or it will repeatedly > call ->writepage if you don't define ->vm_writeback. That's > the place to clean the pages. OK. I'll think about that. I'm off to France on a fortnight's vacation now. Should leave me with a bit of free time, so once I finish the VFS credential code, I'll try to find some time to catch up on your recent changes... Cheers, Trond ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-06 1:08 ` Andrew Morton 2002-09-06 6:49 ` Trond Myklebust @ 2002-09-07 8:37 ` Daniel Phillips 2002-09-07 16:09 ` Andrew Morton 1 sibling, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-09-07 8:37 UTC (permalink / raw) To: Andrew Morton, trond.myklebust; +Cc: Chuck Lever, Linux Kernel Mailing List On Friday 06 September 2002 03:08, Andrew Morton wrote: > Should we be forcibly unmapping the pages from pagetables? That would > result in them being faulted in again, and re-read. There's no conceptual difficulty if it's a truncate. For NFS, maybe the thing to do is hold ->sem long enough to do whatever dark magic NFS needs? > That might work. It's a bit of a hassle that ->releasepage() must > be nonblocking, but generally it just wants to grab locks, decrement > refcounts and free things. Err. I really really wonder why the vfs is not supposed to know about these goings on, to the extent of being able to take care of them itself. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-07 8:37 ` Daniel Phillips @ 2002-09-07 16:09 ` Andrew Morton 2002-09-07 17:02 ` Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-09-07 16:09 UTC (permalink / raw) To: Daniel, Phillips; +Cc: trond.myklebust, Chuck Lever, Linux Kernel Mailing List Daniel Phillips wrote: > > > That might work. It's a bit of a hassle that ->releasepage() must > > be nonblocking, but generally it just wants to grab locks, decrement > > refcounts and free things. > > Err. I really really wonder why the vfs is not supposed to know > about these goings on, to the extent of being able to take care of > them itself. ->releasepage() was originally added for ext3, which has additional ordering constraints, and has additional references to the page's blocks. reiserfs has a releasepage in 2.5. XFS has a ->releasepage. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-07 16:09 ` Andrew Morton @ 2002-09-07 17:02 ` Andrew Morton 0 siblings, 0 replies; 77+ messages in thread From: Andrew Morton @ 2002-09-07 17:02 UTC (permalink / raw) To: Daniel Phillips, trond.myklebust, Chuck Lever, Linux Kernel Mailing List Andrew Morton wrote: > > XFS has a ->releasepage. And it looks like it wants to be able to sleep. I lied about ->releasepage being nonblocking. try_to_free_buffers() is nonblocking, and for a while, ->releasepage had to be nonblocking. But we can relax that now. I'll put the gfp_flags back into page reclaim's call to try_to_release_page(). That might help XFS to play nice with the VM. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 22:19 ` Andrew Morton 2002-09-06 0:48 ` Trond Myklebust @ 2002-09-07 8:24 ` Daniel Phillips 2002-09-07 16:06 ` Andrew Morton 2002-09-07 18:47 ` Rik van Riel 1 sibling, 2 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-07 8:24 UTC (permalink / raw) To: Andrew Morton, trond.myklebust; +Cc: Chuck Lever, Linux Kernel Mailing List On Friday 06 September 2002 00:19, Andrew Morton wrote: > I'm not sure what semantics we really want for this. If we were to > "invalidate" a mapped page then it would become anonymous, which > makes some sense. There's no need to leave the page mapped, you can easily walk the rmap list and remove the references. > If the VM wants to reclaim a page, and it has PG_private set then > the vm will run mapping->releasepage() against the page. The mapping's > releasepage must try to clear away whatever is held at ->private. If > that was successful then releasepage() must clear PG_private, decrement > page->count and return non-zero. If the info at ->private is not > freeable, releasepage returns zero. ->releasepage() may not sleep in > 2.5. > > So. NFS can put anything it likes at page->private. If you're not > doing that then you don't need a releasepage. If you are doing that > then you must have a releasepage(). Right now, there are no filesystems actually doing anything filesystem specific here, are there? I really wonder if making this field, formerly known as buffers, opaque to the vfs is the right idea. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-07 8:24 ` Daniel Phillips @ 2002-09-07 16:06 ` Andrew Morton 2002-09-09 21:08 ` Daniel Phillips 2002-09-07 18:47 ` Rik van Riel 1 sibling, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-09-07 16:06 UTC (permalink / raw) To: Daniel Phillips; +Cc: trond.myklebust, Chuck Lever, Linux Kernel Mailing List Daniel Phillips wrote: > > On Friday 06 September 2002 00:19, Andrew Morton wrote: > > I'm not sure what semantics we really want for this. If we were to > > "invalidate" a mapped page then it would become anonymous, which > > makes some sense. > > There's no need to leave the page mapped, you can easily walk the rmap list > and remove the references. Yes, unmapping and forcing a subsequent major fault would make sense. It would require additional locking in the nopage pth to make this 100% accurate. I doubt if it's worth doing that, so the unmap-and-refault-for-invalidate feature would probably be best-effort. But more accurate than what we have now. > > If the VM wants to reclaim a page, and it has PG_private set then > > the vm will run mapping->releasepage() against the page. The mapping's > > releasepage must try to clear away whatever is held at ->private. If > > that was successful then releasepage() must clear PG_private, decrement > > page->count and return non-zero. If the info at ->private is not > > freeable, releasepage returns zero. ->releasepage() may not sleep in > > 2.5. > > > > So. NFS can put anything it likes at page->private. If you're not > > doing that then you don't need a releasepage. If you are doing that > > then you must have a releasepage(). > > Right now, there are no filesystems actually doing anything filesystem > specific here, are there? I really wonder if making this field, formerly > known as buffers, opaque to the vfs is the right idea. That's right - it is only used for buffers at present. I was using page->private in the delayed-allocate code for directly holding the disk mapping information. There was some talk of using it for <mumble> in XFS. Also it may be used in the NFS server for storing credential information. Also it could be used for MAP_SHARED pages for credential information - to fix the problem wherein kswapd (ie: root) is the one who instantiates the page's blocks, thus allowing non-root programs to consume the root-only reserved ext2/ext3 blocks. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-07 16:06 ` Andrew Morton @ 2002-09-09 21:08 ` Daniel Phillips 2002-09-09 21:36 ` Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-09-09 21:08 UTC (permalink / raw) To: Andrew Morton; +Cc: trond.myklebust, Chuck Lever, Linux Kernel Mailing List On Saturday 07 September 2002 18:06, Andrew Morton wrote: > Daniel Phillips wrote: > > > If the VM wants to reclaim a page, and it has PG_private set then > > > the vm will run mapping->releasepage() against the page. The mapping's > > > releasepage must try to clear away whatever is held at ->private. If > > > that was successful then releasepage() must clear PG_private, decrement > > > page->count and return non-zero. If the info at ->private is not > > > freeable, releasepage returns zero. ->releasepage() may not sleep in > > > 2.5. > > > > > > So. NFS can put anything it likes at page->private. If you're not > > > doing that then you don't need a releasepage. If you are doing that > > > then you must have a releasepage(). > > > > Right now, there are no filesystems actually doing anything filesystem > > specific here, are there? I really wonder if making this field, formerly > > known as buffers, opaque to the vfs is the right idea. > > That's right - it is only used for buffers at present. I was using > page->private in the delayed-allocate code for directly holding the > disk mapping information. It's worth taking a deep breath at this point and considering whether that part of delalloc can be written generically, supposing that page->buffers were restored to its former non-opaque status. > There was some talk of using it for <mumble> in XFS. Christoph would be familiar with that issue, if there is one. > Also it may be used in the NFS server for storing credential > information. The NFS server is still a deep, black hole in the kernel from my point of view and I'd like that situation to end as soon as possible, so it might as well start ending now. Can you provide me a pointer to go start digging at that specific question? (And strongly agreed about the invalidate_inode_pages(2) issue: at some point it would behoove VM and NFS developers to reach a mutual understanding of what that interface is supposed to do, because it is growing new warts and tentacles at an alarming rate, and still seems to be, at best, a heuristic. I currently have the impression that the intent is to make files sort-of coherent between clients, for some slippery definition of sort-of.) > Also it could be used for MAP_SHARED pages for credential > information - to fix the problem wherein kswapd (ie: root) is the > one who instantiates the page's blocks, thus allowing non-root programs > to consume the root-only reserved ext2/ext3 blocks. OK, well I have four action items: - Listen to Christoph wax poetic about how XFS handles buffers and whether he needs them to be opaque to the vfs. - Think about delalloc with the goal of providing the mechanism at the vfs level, subject to somebody proving it actually has some benefit (maybe XFS already proves this? Come to think of it, if XFS already supports delayed allocation and doesn't rely on page->private...) - Find out what NFS really thinks invalidate_inode_pages(2) is supposed to do. - Start to sniff at the credential issue, with a view to ascertaining whether that subsystem ultimately relies on the buffer field being opaque. Intuitively I think 'no', the credential mechanism is supposed to be a generic vfs mechanism, and so for NFS or anyone else to bury credential state in the field formerly known as buffers is fundamentally wrong. In general, I think we'd be better off if page->buffers was not opaque, and that it should remain non-opaque until we are definitely ready to get rid of them. Doing otherwise will just allow various applications to start growing tendrils into the field, making it that much harder to get rid of when the time comes. So the question is, does anyone *really* need (void *) page->private instead of page->buffers? -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-09 21:08 ` Daniel Phillips @ 2002-09-09 21:36 ` Andrew Morton 2002-09-09 22:12 ` Daniel Phillips 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-09-09 21:36 UTC (permalink / raw) To: Daniel Phillips; +Cc: trond.myklebust, Chuck Lever, Linux Kernel Mailing List Daniel Phillips wrote: > > On Saturday 07 September 2002 18:06, Andrew Morton wrote: > > Daniel Phillips wrote: > > > > If the VM wants to reclaim a page, and it has PG_private set then > > > > the vm will run mapping->releasepage() against the page. The mapping's > > > > releasepage must try to clear away whatever is held at ->private. If > > > > that was successful then releasepage() must clear PG_private, decrement > > > > page->count and return non-zero. If the info at ->private is not > > > > freeable, releasepage returns zero. ->releasepage() may not sleep in > > > > 2.5. > > > > > > > > So. NFS can put anything it likes at page->private. If you're not > > > > doing that then you don't need a releasepage. If you are doing that > > > > then you must have a releasepage(). > > > > > > Right now, there are no filesystems actually doing anything filesystem > > > specific here, are there? I really wonder if making this field, formerly > > > known as buffers, opaque to the vfs is the right idea. > > > > That's right - it is only used for buffers at present. I was using > > page->private in the delayed-allocate code for directly holding the > > disk mapping information. > > It's worth taking a deep breath at this point and considering whether that > part of delalloc can be written generically, supposing that page->buffers > were restored to its former non-opaque status. The main motivation for that code was to fix the computational cost of the buffer layer by doing a complete end-around. That problem was later solved by fixing the buffer layer. Yes, there are still reasons for delayed allocation, the space reservation API, etc. But they are not compelling. They are certainly not compelling when writeback continues to use the (randomly-ordered) mapping->dirty_pages walk. With radix-tree enhancements to permit a pgoff_t-order walk of the dirty pages then yeah, order-of-magnitude gains in the tiobench random write pass. > > Also it may be used in the NFS server for storing credential > > information. > > The NFS server is still a deep, black hole in the kernel from my point of > view and I'd like that situation to end as soon as possible, so it might > as well start ending now. Can you provide me a pointer to go start > digging at that specific question? NFS client. Me too. > (And strongly agreed about the invalidate_inode_pages(2) issue: at some > point it would behoove VM and NFS developers to reach a mutual > understanding of what that interface is supposed to do, because it is > growing new warts and tentacles at an alarming rate, and still seems to > be, at best, a heuristic. I currently have the impression that the > intent is to make files sort-of coherent between clients, for some > slippery definition of sort-of.) I've been discussing that with Chuck. I'd prefer that the NFS client use a flavour of vmtruncate(), with its strong guarantees. But we won't know how horrid that is from a locking perspective until Trond returns. > ... > In general, I think we'd be better off if page->buffers was not opaque, Disagree. There is zero computational cost to the current setup, and it's a little cleaner, and it permits things such as the removal of ->private from the pageframes, and hashing for it. And there is plenty of precedent for putting fs-private hooks into core VFS data structures. > and that it should remain non-opaque until we are definitely ready to > get rid of them. There is nothing wrong with buffers, except the name. They no longer buffer anything. They _used_ to be the buffering entity, and an IO container, and the abstraction of a disk block. They are now just the abstraction of a disk block. s/buffer_head/block/g should make things clearer. And there is no way in which the kernel can get along without some structure which represents a disk block. It does one thing, and it does it well. The page is the buffering entity. The buffer_head is a disk block. The BIO is the IO container. Sometimes, for efficiency, we bypass the "block" part and go direct page-into-BIO. That's a conceptually-wrong performance hack. Yes, one could try to graft the "block" abstraction up into struct page, or down into struct BIO. But one would be mistaken, I expect. > Doing otherwise will just allow various applications > to start growing tendrils into the field, making it that much harder > to get rid of when the time comes. > > So the question is, does anyone *really* need (void *) page->private > instead of page->buffers? Don't know. But leaving it as-is tells the world that this is per-fs metadata which the VM/VFS supports. This has no cost. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-09 21:36 ` Andrew Morton @ 2002-09-09 22:12 ` Daniel Phillips 0 siblings, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-09 22:12 UTC (permalink / raw) To: Andrew Morton; +Cc: trond.myklebust, Chuck Lever, Linux Kernel Mailing List On Monday 09 September 2002 23:36, Andrew Morton wrote: > Daniel Phillips wrote: > Yes, there are still reasons for delayed allocation, the space reservation > API, etc. But they are not compelling. They are certainly not compelling > when writeback continues to use the (randomly-ordered) mapping->dirty_pages > walk. > > With radix-tree enhancements to permit a pgoff_t-order walk of the > dirty pages then yeah, order-of-magnitude gains in the tiobench > random write pass. Right. Coincidently, our emails both drawing the same conclusion just passed each other on the way through vger. > > ... > > In general, I think we'd be better off if page->buffers was not opaque, > > Disagree. There is zero computational cost to the current setup, It's not a computational cost, at least not directly, it's an organizational cost. > and it's a little cleaner, and it permits things such as the removal > of ->private from the pageframes, and hashing for it. That's not really an argument, you can do that with ->buffers too, and there is zero cost for doing it with both, separately. > And there is plenty of precedent for putting fs-private hooks into > core VFS data structures. But you have to have a clear argument why. I don't see one yet, and I do see why the vfs wants to know about them. None of your reasons for wanting ->private have anything to do with buffers, they are all about other fancy features that somebody might want to add. > > and that it should remain non-opaque until we are definitely ready to > > get rid of them. > > There is nothing wrong with buffers, except the name. They no longer > buffer anything. Understood. However, the spelling patch to correct that would touch the kernel in a few hundred places and break an indeterminate-but-large number of pending patches. > They _used_ to be the buffering entity, and an IO container, and > the abstraction of a disk block. > > They are now just the abstraction of a disk block. s/buffer_head/block/g > should make things clearer. Oooh, yes, that would be nice, but see above spelling patch objection. > And there is no way in which the kernel can get along without > some structure which represents a disk block. It does one thing, > and it does it well. > > The page is the buffering entity. > > The buffer_head is a disk block. > > The BIO is the IO container. > > Sometimes, for efficiency, we bypass the "block" part and go direct > page-into-BIO. That's a conceptually-wrong performance hack. Right, and what I *want* to do is introduce sub-page struct pages, along with generalizing struct page via mapping->page_size_shift, so that we can use struct page as a handle for a block, solving a number of nasty structural problems. The patch to do this will be less intrusive than you'd think at first blush, especially after Christoph has been in there for a while longer, dunging out filemap.c. > Yes, one could try to graft the "block" abstraction up into struct > page, or down into struct BIO. Err, right ;-) > But one would be mistaken, I expect. I expect not, but it falls to me to prove that, doesn't it? I was definitely not contemplating putting this forward for halloween, but instead to undertake the grunt work when the new stable series opens (and while *you* are busy chasing all the new bugs you made, *cackle* *cackle*). > > Doing otherwise will just allow various applications > > to start growing tendrils into the field, making it that much harder > > to get rid of when the time comes. > > > > So the question is, does anyone *really* need (void *) page->private > > instead of page->buffers? > > Don't know. But leaving it as-is tells the world that this is > per-fs metadata which the VM/VFS supports. This has no cost. It has a creeping-rot cost, because filesystems will invitably come up with large numbers of mostly-bogus reasons for using it. I think the solution to this is to introduce a hashed ->private_frob_me structure and restore ->buffers to visibility by the vfs, which needs all the help it can get to survive the current misfit between page and block size. On second thought, your name change sounds really attractive, how about: - struct buffer_head *buffers; + struct block *blocks; Unfortunately, I can't think of any nice macro way to ease the pain of the patch. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-07 8:24 ` Daniel Phillips 2002-09-07 16:06 ` Andrew Morton @ 2002-09-07 18:47 ` Rik van Riel 2002-09-07 23:09 ` Andrew Morton 1 sibling, 1 reply; 77+ messages in thread From: Rik van Riel @ 2002-09-07 18:47 UTC (permalink / raw) To: Daniel Phillips Cc: Andrew Morton, trond.myklebust, Chuck Lever, Linux Kernel Mailing List On Sat, 7 Sep 2002, Daniel Phillips wrote: > On Friday 06 September 2002 00:19, Andrew Morton wrote: > > I'm not sure what semantics we really want for this. If we were to > > "invalidate" a mapped page then it would become anonymous, which > > makes some sense. > > There's no need to leave the page mapped, you can easily walk the rmap list > and remove the references. A pagefaulting task can have claimed a reference to the page and only be waiting on the lock we're currently holding. regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ Spamtraps of the month: september@surriel.com trac@trac.org ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-07 18:47 ` Rik van Riel @ 2002-09-07 23:09 ` Andrew Morton 2002-09-09 21:44 ` Daniel Phillips 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-09-07 23:09 UTC (permalink / raw) To: Rik van Riel Cc: Daniel Phillips, trond.myklebust, Chuck Lever, Linux Kernel Mailing List Rik van Riel wrote: > > On Sat, 7 Sep 2002, Daniel Phillips wrote: > > On Friday 06 September 2002 00:19, Andrew Morton wrote: > > > I'm not sure what semantics we really want for this. If we were to > > > "invalidate" a mapped page then it would become anonymous, which > > > makes some sense. > > > > There's no need to leave the page mapped, you can easily walk the rmap list > > and remove the references. > > A pagefaulting task can have claimed a reference to the page > and only be waiting on the lock we're currently holding. Yup. In which case it comes away with an anonymous page. I'm very unkeen about using the inaccurate invalidate_inode_pages for anything which matters, really. And the consistency of pagecache data matters. NFS should be using something stronger. And that's basically vmtruncate() without the i_size manipulation. Hold i_sem, vmtruncate_list() for assured pagetable takedown, proper page locking to take the pages out of pagecache, etc. Sure, we could replace the page_count() heuristic with a page->pte.direct heuristic. Which would work just as well. Or better. Or worse. Who knows? Guys, can we sort out the NFS locking so that it is possible to take the correct locks to get the 100% behaviour? ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-07 23:09 ` Andrew Morton @ 2002-09-09 21:44 ` Daniel Phillips 2002-09-09 22:03 ` Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-09-09 21:44 UTC (permalink / raw) To: Andrew Morton, Rik van Riel Cc: trond.myklebust, Chuck Lever, Linux Kernel Mailing List > I'm very unkeen about using the inaccurate invalidate_inode_pages > for anything which matters, really. And the consistency of pagecache > data matters. > > NFS should be using something stronger. And that's basically > vmtruncate() without the i_size manipulation. Yes, that looks good. Semantics are basically "and don't come back until every damm page is gone" which is enforced by the requirement that we hold the mapping->page_lock though one entire scan of the truncated region. (Yes, I remember sweating this one out a year or two ago so it doesn't eat 100% CPU on regular occasions.) So, specifically, we want: void invalidate_inode_pages(struct inode *inode) { truncate_inode_pages(mapping, 0); } Is it any harder than that? By the way, now that we're all happy with the radix tree, we might as well just go traverse that instead of all the mapping->*_pages. (Not that I'm seriously suggesting rocking the boat that way just now, but it might yield some interesting de-crufting possibilities.) > Hold i_sem, > vmtruncate_list() for assured pagetable takedown, proper page > locking to take the pages out of pagecache, etc. > > Sure, we could replace the page_count() heuristic with a > page->pte.direct heuristic. Which would work just as well. Or > better. Or worse. Who knows? > > Guys, can we sort out the NFS locking so that it is possible to > take the correct locks to get the 100% behaviour? Trond, will the above work? Now, what is this invalidate_inode_pages2 seepage about? Called from one place. Sheesh. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-09 21:44 ` Daniel Phillips @ 2002-09-09 22:03 ` Andrew Morton 2002-09-09 22:19 ` Daniel Phillips 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-09-09 22:03 UTC (permalink / raw) To: Daniel Phillips Cc: Rik van Riel, trond.myklebust, Chuck Lever, Linux Kernel Mailing List Daniel Phillips wrote: > > > I'm very unkeen about using the inaccurate invalidate_inode_pages > > for anything which matters, really. And the consistency of pagecache > > data matters. > > > > NFS should be using something stronger. And that's basically > > vmtruncate() without the i_size manipulation. > > Yes, that looks good. Semantics are basically "and don't come back > until every damm page is gone" which is enforced by the requirement > that we hold the mapping->page_lock though one entire scan of the > truncated region. (Yes, I remember sweating this one out a year > or two ago so it doesn't eat 100% CPU on regular occasions.) > > So, specifically, we want: > > void invalidate_inode_pages(struct inode *inode) > { > truncate_inode_pages(mapping, 0); > } > > Is it any harder than that? Pretty much - need to leave i_size where it was. But there are apparently reasons why NFS cannot sleepingly lock pages in this particular context. > By the way, now that we're all happy with the radix tree, we might > as well just go traverse that instead of all the mapping->*_pages. > (Not that I'm seriously suggesting rocking the boat that way just > now, but it might yield some interesting de-crufting possibilities.) Oh absolutely. unsigned long radix_tree_gang_lookup(void **pointers, unsiged long starting_from_here, unsigned long this_many); could be used nicely in readahead, drop_behind, truncate, invalidate and invalidate2. But to use it in writeback (desirable) we would need additional metadata in radix_tree_node. One bit per page, which means "this page is dirty" or "this subtree has dirty pages". I keep saying this in the hope that someone will take pity and write it. > ... > Now, what is this invalidate_inode_pages2 seepage about? Called from > one place. Sheesh. heh. We still do have some O_DIRECT/pagecache coherency problems. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-09 22:03 ` Andrew Morton @ 2002-09-09 22:19 ` Daniel Phillips 2002-09-09 22:32 ` Andrew Morton 2002-09-09 23:51 ` Chuck Lever 0 siblings, 2 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-09 22:19 UTC (permalink / raw) To: Andrew Morton Cc: Rik van Riel, trond.myklebust, Chuck Lever, Linux Kernel Mailing List On Tuesday 10 September 2002 00:03, Andrew Morton wrote: > Daniel Phillips wrote: > > > > > I'm very unkeen about using the inaccurate invalidate_inode_pages > > > for anything which matters, really. And the consistency of pagecache > > > data matters. > > > > > > NFS should be using something stronger. And that's basically > > > vmtruncate() without the i_size manipulation. > > > > Yes, that looks good. Semantics are basically "and don't come back > > until every damm page is gone" which is enforced by the requirement > > that we hold the mapping->page_lock though one entire scan of the > > truncated region. (Yes, I remember sweating this one out a year > > or two ago so it doesn't eat 100% CPU on regular occasions.) > > > > So, specifically, we want: > > > > void invalidate_inode_pages(struct inode *inode) > > { > > truncate_inode_pages(mapping, 0); > > } > > > > Is it any harder than that? > > Pretty much - need to leave i_size where it was. This doesn't touch i_size. > But there are > apparently reasons why NFS cannot sleepingly lock pages in this particular > context. If only we knew what those were. It's hard to keep the word 'bogosity' from popping into my head. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-09 22:19 ` Daniel Phillips @ 2002-09-09 22:32 ` Andrew Morton 2002-09-10 16:57 ` Daniel Phillips 2002-09-09 23:51 ` Chuck Lever 1 sibling, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-09-09 22:32 UTC (permalink / raw) To: Daniel Phillips Cc: Rik van Riel, trond.myklebust, Chuck Lever, Linux Kernel Mailing List Daniel Phillips wrote: > > > > void invalidate_inode_pages(struct inode *inode) > > > { > > > truncate_inode_pages(mapping, 0); > > > } > > > > > > Is it any harder than that? > > > > Pretty much - need to leave i_size where it was. > > This doesn't touch i_size. Sorry - I was thinking vmtruncate(). truncate_inode_pages() would result in all the mmapped pages becoming out-of-date anonymous memory. NFS needs to take down the pagetables so that processes which are mmapping the file which changed on the server will take a major fault and read a fresh copy. I believe. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-09 22:32 ` Andrew Morton @ 2002-09-10 16:57 ` Daniel Phillips 0 siblings, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-10 16:57 UTC (permalink / raw) To: Andrew Morton Cc: Rik van Riel, trond.myklebust, Chuck Lever, Linux Kernel Mailing List On Tuesday 10 September 2002 00:32, Andrew Morton wrote: > Daniel Phillips wrote: > > > > > > void invalidate_inode_pages(struct inode *inode) > > > > { > > > > truncate_inode_pages(mapping, 0); > > > > } > > > > > > > > Is it any harder than that? > > > > > > Pretty much - need to leave i_size where it was. > > > > This doesn't touch i_size. > > Sorry - I was thinking vmtruncate(). truncate_inode_pages() would > result in all the mmapped pages becoming out-of-date anonymous > memory. NFS needs to take down the pagetables so that processes > which are mmapping the file which changed on the server will take > a major fault and read a fresh copy. I believe. Oh, um. Yes, we need the additional pte zapping behaviour of vmtruncate_list. It doesn't look particularly hard to produce a variant of vmtruncation that does (doesn't do) what you suggest. Let's see how the discussion goes with the NFS crowd. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-09 22:19 ` Daniel Phillips 2002-09-09 22:32 ` Andrew Morton @ 2002-09-09 23:51 ` Chuck Lever 2002-09-10 1:07 ` Daniel Phillips 2002-09-12 13:04 ` Trond Myklebust 1 sibling, 2 replies; 77+ messages in thread From: Chuck Lever @ 2002-09-09 23:51 UTC (permalink / raw) To: Daniel Phillips Cc: Andrew Morton, Rik van Riel, trond.myklebust, Linux Kernel Mailing List On Tue, 10 Sep 2002, Daniel Phillips wrote: > On Tuesday 10 September 2002 00:03, Andrew Morton wrote: > > Daniel Phillips wrote: > > > > > > > I'm very unkeen about using the inaccurate invalidate_inode_pages > > > > for anything which matters, really. And the consistency of pagecache > > > > data matters. > > > > > > > > NFS should be using something stronger. And that's basically > > > > vmtruncate() without the i_size manipulation. > > > > > > Yes, that looks good. Semantics are basically "and don't come back > > > until every damm page is gone" which is enforced by the requirement > > > that we hold the mapping->page_lock though one entire scan of the > > > truncated region. (Yes, I remember sweating this one out a year > > > or two ago so it doesn't eat 100% CPU on regular occasions.) > > > > > > So, specifically, we want: > > > > > > void invalidate_inode_pages(struct inode *inode) > > > { > > > truncate_inode_pages(mapping, 0); > > > } > > > > > > Is it any harder than that? > > > > Pretty much - need to leave i_size where it was. > > This doesn't touch i_size. > > > But there are > > apparently reasons why NFS cannot sleepingly lock pages in this particular > > context. > > If only we knew what those were. It's hard to keep the word 'bogosity' > from popping into my head. rpciod must never call a function that sleeps. if this happens, the whole NFS client stops working until the function wakes up again. this is not really bogus -- it is similar to restrictions placed on socket callbacks. async RPC tasks (ie, the rpciod process) invokes invalidate_inode_pages during normal, everyday processing, so it must not sleep. that's why it today ignores locked pages. thus: 1. whatever function purges a file's cached data must not sleep when invoked from an async RPC task. likewise, such a function must not sleep if the caller holds the file's i_sem. 2. access to the file must be serialized somehow with in-flight I/O (locked pages). we don't want to start new reads before all dirty pages have been flushed back to the server. dirty pages that have not yet been scheduled must be dealt with correctly. 3. mmap'd pages must behave reasonably when a file's cache is purged. clean pages should be faulted back in. what to do with dirty mmap'd pages? - Chuck Lever -- corporate: <cel at netapp dot com> personal: <chucklever at bigfoot dot com> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-09 23:51 ` Chuck Lever @ 2002-09-10 1:07 ` Daniel Phillips 2002-09-10 15:09 ` Chuck Lever 2002-09-12 13:04 ` Trond Myklebust 1 sibling, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-09-10 1:07 UTC (permalink / raw) To: Chuck Lever Cc: Andrew Morton, Rik van Riel, trond.myklebust, Linux Kernel Mailing List On Tuesday 10 September 2002 01:51, Chuck Lever wrote: > On Tue, 10 Sep 2002, Daniel Phillips wrote: > > On Tuesday 10 September 2002 00:03, Andrew Morton wrote: > > > But there are > > > apparently reasons why NFS cannot sleepingly lock pages in this > > > particular context. > > > > If only we knew what those were. It's hard to keep the word 'bogosity' > > from popping into my head. > > rpciod must never call a function that sleeps. if this happens, the whole > NFS client stops working until the function wakes up again. this is not > really bogus -- it is similar to restrictions placed on socket callbacks. Ah, a warm body with answers :-) It *sounds* bogus: why should we be satisfied with a function that doesn't do its job reliably (invalidate_inode_pages) in order to avoid coming up with a way of keeping the client daemon from blocking? How about having invalidate_inode_pages come back with "sorry boss, I couldn't complete the job so I started as much IO as I could and I'm back now, try again later"? > async RPC tasks (ie, the rpciod process) invokes invalidate_inode_pages > during normal, everyday processing, so it must not sleep. that's why it > today ignores locked pages. OK, that's half the job, the other half is to know that something's been ignored, and get back to it later. Either there is a valid reason for getting rid of these pages or there isn't, and if there is a valid reason, then getting rid of only some of them must surely leave the door wide open to strange misbehaviour. > thus: > > 1. whatever function purges a file's cached data must not sleep when > invoked from an async RPC task. [otherwise other tasks using the client will stall needlessly] > ...likewise, such a function must not > sleep if the caller holds the file's i_sem. > > 2. access to the file must be serialized somehow with in-flight I/O > (locked pages). we don't want to start new reads before all dirty > pages have been flushed back to the server. dirty pages that have > not yet been scheduled must be dealt with correctly. State machine! > 3. mmap'd pages must behave reasonably when a file's cache is purged. > clean pages should be faulted back in. what to do with dirty mmap'd > pages? I don't know, sorry. What? You've probably been through this before, but could you please explain the ground rules behind the cache purging strategy? -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-10 1:07 ` Daniel Phillips @ 2002-09-10 15:09 ` Chuck Lever 2002-09-10 16:13 ` Daniel Phillips 2002-09-12 22:05 ` Urban Widmark 0 siblings, 2 replies; 77+ messages in thread From: Chuck Lever @ 2002-09-10 15:09 UTC (permalink / raw) To: Daniel Phillips Cc: Andrew Morton, Rik van Riel, trond.myklebust, Linux Kernel Mailing List On Tue, 10 Sep 2002, Daniel Phillips wrote: > On Tuesday 10 September 2002 01:51, Chuck Lever wrote: > > rpciod must never call a function that sleeps. if this happens, the whole > > NFS client stops working until the function wakes up again. this is not > > really bogus -- it is similar to restrictions placed on socket callbacks. > > Ah, a warm body with answers :-) > > It *sounds* bogus: why should we be satisfied with a function that doesn't > do its job reliably (invalidate_inode_pages) in order to avoid coming up > with a way of keeping the client daemon from blocking? How about having > invalidate_inode_pages come back with "sorry boss, I couldn't complete the > job so I started as much IO as I could and I'm back now, try again later"? i'm not suggesting that invalidate_inode_pages behaves properly, i'm simply trying to document why it works the way it does. i agree that it leaves a window open for broken behavior today. what is *not* bogus is the requirement for functions invoked by async RPC tasks not to sleep; that's reasonable, i feel. > > thus: > > > > 1. whatever function purges a file's cached data must not sleep when > > invoked from an async RPC task. > > [otherwise other tasks using the client will stall needlessly] correct. > > 3. mmap'd pages must behave reasonably when a file's cache is purged. > > clean pages should be faulted back in. what to do with dirty mmap'd > > pages? > > I don't know, sorry. What? 'twas a rhetorical question. i'm trying to understand this myself. the case of what to do with dirty mmap'd pages is somewhat sticky. > You've probably been through this before, but could you please explain > the ground rules behind the cache purging strategy? i can answer the question "when does the NFS client purge a file's cached data?" there are four major categories: a. directory changes require any cached readdir results be purged. this forces the readdir results to be re-read from the server the next time the client needs them. this is what broke with the recent changes in 2.5.32/3 that triggered this thread. b. when the client discovers a file on the server was changed by some other client, all pages in the page cache for that file are purged (except the ones that can't be because they are locked, etc). this is the case that is hit most often and in async RPC tasks, and is on many critical performance paths. c. when a file is locked or unlocked via lockf/fcntl, all pending writes are pushed back to the server and any cached data in the page cache is purged before the lock/unlock call returns. applications sometimes depend on this behavior to checkpoint the client's cache. d. some error occurred while reading a directory, or the object on the server has changed type (like, a file becomes a directory but the file handle is still the same -- a protocol error, but the client checks for this just in case). so let's talk about b. before and after many operations, the NFS client attempts to revalidate an inode. this means it does a GETATTR operation, or uses the attr results returned in many NFS requests, to compare the file's size and mtime on the server with the same values it has cached locally. this revalidation can occur during XDR processing while the RPC layer marshals and unmarshals the results of an NFS request. i don't want to speculate too much without Trond around to keep me honest. however, i think what we want here is behavior that is closer to category c., with as few negative performance implications as possible. i think one way to accomplish this is to create two separate revalidation functions -- one that can be used by synchronous code in the NFS client that uses the 100% bug killer, and one that can be used from async RPC tasks that simply marks that a purge is necessary, and next time through the sync one, the purge actually occurs. the only outstanding issue then is how to handle pages that are dirtied via mmap'd files, since they are touched without going through the NFS client. also, i'd really like to hear from maintainers of other network file systems about how they manage cache coherency. - Chuck Lever -- corporate: <cel at netapp dot com> personal: <chucklever at bigfoot dot com> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-10 15:09 ` Chuck Lever @ 2002-09-10 16:13 ` Daniel Phillips 2002-09-10 19:04 ` Chuck Lever 2002-09-12 22:05 ` Urban Widmark 1 sibling, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-09-10 16:13 UTC (permalink / raw) To: Chuck Lever Cc: Andrew Morton, Rik van Riel, trond.myklebust, Linux Kernel Mailing List On Tuesday 10 September 2002 17:09, Chuck Lever wrote: > On Tue, 10 Sep 2002, Daniel Phillips wrote: > > On Tuesday 10 September 2002 01:51, Chuck Lever wrote: > > > rpciod must never call a function that sleeps. if this happens, the whole > > > NFS client stops working until the function wakes up again. this is not > > > really bogus -- it is similar to restrictions placed on socket callbacks. > > > > Ah, a warm body with answers :-) > > > > It *sounds* bogus: why should we be satisfied with a function that doesn't > > do its job reliably (invalidate_inode_pages) in order to avoid coming up > > with a way of keeping the client daemon from blocking? How about having > > invalidate_inode_pages come back with "sorry boss, I couldn't complete the > > job so I started as much IO as I could and I'm back now, try again later"? > > i'm not suggesting that invalidate_inode_pages behaves properly, i'm > simply trying to document why it works the way it does. And nicely too, thanks. > > > 3. mmap'd pages must behave reasonably when a file's cache is purged. > > > clean pages should be faulted back in. what to do with dirty mmap'd > > > pages? > > > > I don't know, sorry. What? > > 'twas a rhetorical question. A rhetorical answer as well ;-) > i'm trying to understand this myself. the > case of what to do with dirty mmap'd pages is somewhat sticky. What I meant was, could you please explain the problem with dirty mmaped pages. I see you explained it below: you mean that writes to mmaps bypass the client, but the client really needs to know about them (and is largely ignorant of them at present). > > You've probably been through this before, but could you please explain > > the ground rules behind the cache purging strategy? > > i can answer the question "when does the NFS client purge a file's cached > data?" > > there are four major categories: > > a. directory changes require any cached readdir results be purged. That is, the client changes the directory itself? I suppose an NFS server is incapable of reporting directory changes caused by other clients, because of being stateless. > ...this > forces the readdir results to be re-read from the server the next time > the client needs them. this is what broke with the recent changes in > 2.5.32/3 that triggered this thread. > > b. when the client discovers a file on the server was changed by some > other client, all pages in the page cache for that file are purged > (except the ones that can't be because they are locked, etc). this > is the case that is hit most often and in async RPC tasks, and is > on many critical performance paths. > > c. when a file is locked or unlocked via lockf/fcntl, all pending writes > are pushed back to the server and any cached data in the page cache is > purged before the lock/unlock call returns. Do you mean, the client locks/unlocks the file, or some other client? I'm trying to fit this into my model of how the server must work. It must be that the locked/unlocked state is recorded at the server, in the underlying file, and that the server reports the locked/unlocked state of the file to every client via attr results. So now, why purge at *both* lock and unlock time? > ...applications sometimes > depend on this behavior to checkpoint the client's cache. > > d. some error occurred while reading a directory, or the object on the > server has changed type (like, a file becomes a directory but the > file handle is still the same -- a protocol error, but the client > checks for this just in case). > > so let's talk about b. before and after many operations, the NFS client > attempts to revalidate an inode. this means it does a GETATTR operation, > or uses the attr results returned in many NFS requests, to compare the > file's size and mtime on the server with the same values it has cached > locally. this revalidation can occur during XDR processing while the RPC > layer marshals and unmarshals the results of an NFS request. OK, so if this revalidation fails the client does the purge, as you described in b. > i don't want to speculate too much without Trond around to keep me honest. > however, i think what we want here is behavior that is closer to category > c., with as few negative performance implications as possible. Actually, this is really, really useful and gives me lots pointers I can follow for more details. > i think one way to accomplish this is to create two separate revalidation > functions -- one that can be used by synchronous code in the NFS client > that uses the 100% bug killer, and one that can be used from async RPC > tasks that simply marks that a purge is necessary, and next time through > the sync one, the purge actually occurs. That would certainly be easy from the VM side, then we could simply use a derivative of vmtruncate that leaves the file size alone, as Andrew suggested. If this method isn't satisfactory, then with considerably more effort we (you guys) could build a state machine in the client that relies on an (as yet unwritten but pretty straightforward) atomic purger with the ability to report the fact that it was unable to complete the purge atomically. Your suggestion is oh-so-much-easier. I hope it works out. > the only outstanding issue then is how to handle pages that are dirtied > via mmap'd files, since they are touched without going through the NFS > client. Hmm. And what do you want? Would a function that walks through the radix tree and returns the OR of page_dirty for every page in it be useful? That would be easy, but efficiency is another question. I suppose that even if you had such a function, the need to poll mmaped files constantly would be a stopper. Would it be satisfactory to know within a second or two that the mmap was dirtied? If so, the dirty scan could possibly be rolled into the regular refill_inactive/shrink_cache scan, though at some cost to structural cleanliness. Could the client mprotect the mmap, and receive a signal the first time somebody writes it? Jeff Dike does things like that with UML and they seem to work pretty well. Of these approaches, this is the one that sounds must satisfactory from the performance and correctness point of view, and it is a proven technique, however scary it may sound. You want to know about the dirty pages only so you can send them to the server, correct? Not because the client needs to purge anything. > also, i'd really like to hear from maintainers of other network > file systems about how they manage cache coherency. Yes, unfortunately, if we break Samba, Tridge knows where I live ;-) -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-10 16:13 ` Daniel Phillips @ 2002-09-10 19:04 ` Chuck Lever 2002-09-10 20:52 ` Daniel Phillips 2002-09-12 19:06 ` Daniel Phillips 0 siblings, 2 replies; 77+ messages in thread From: Chuck Lever @ 2002-09-10 19:04 UTC (permalink / raw) To: Daniel Phillips Cc: Andrew Morton, Rik van Riel, trond.myklebust, Linux Kernel Mailing List On Tue, 10 Sep 2002, Daniel Phillips wrote: > On Tuesday 10 September 2002 17:09, Chuck Lever wrote: > > i can answer the question "when does the NFS client purge a file's cached > > data?" > > > > there are four major categories: > > > > a. directory changes require any cached readdir results be purged. > > That is, the client changes the directory itself? I suppose an NFS > server is incapable of reporting directory changes caused by other > clients, because of being stateless. when a client itself changes a directory, it must purge it's own readdir cache. this is because the server is really in control of the contents, which are treated as more or less opaque by the client. the next time any application on the client wants to read the directory, the client will go back to the server to get the updated contents. the NFS server doesn't report changes, the client must infer them from changes in a file's size and mtime. such changes are detected this way for directories as well as files. Trond may have added some code recently that also "flushes" the dentry cache for a directory when such a change is detected for a directory. > > c. when a file is locked or unlocked via lockf/fcntl, all pending writes > > are pushed back to the server and any cached data in the page cache is > > purged before the lock/unlock call returns. > > Do you mean, the client locks/unlocks the file, or some other client? when a client locks or unlocks a file, it flushes pending writes and purges its read cache. > I'm trying to fit this into my model of how the server must work. It > must be that the locked/unlocked state is recorded at the server, in > the underlying file, and that the server reports the locked/unlocked > state of the file to every client via attr results. no, lock management is handled by an entirely separate protocol. the server maintains lock state separately from the actual files, as i understand it. > So now, why purge at *both* lock and unlock time? this is a little long. clients use what's called "close-to-open" cache consistency to try to maintain a single-system image of the file. this means when a file is opened, the client checks with the server to get the latest version of the file data and metadata (or simply to verify that the client can continue using whatever it has cached). when a file is closed, the client always flushes any pending writes to the file back to the server. in this way, when client A closes a file and subsequently client B opens the same file, client B will see all changes made by client A before it closed the file. note that while A still has the file open, B may not see all changes made by A, unless an application on A explicitly flushes the file. this is a compromise between tight cache coherency and good performance. when locking or unlocking a file, the idea is to make sure that other clients can see all changes to the file that were made while it was locked. locking and unlocking provide tighter cache coherency than simple everyday close-to-open because that's why applications go to the trouble of locking a file -- they expect to share the contents of the file with other applications on other clients. when a file is locked, the client wants to be certain it has the latest version of the file for an application to play with. the cache is purged to cause the client to read any recent changes from the server. when a file is unlocked, the client wants to share its changes with other clients so it flushes all pending writes before allowing the unlocking application to proceed. > > i don't want to speculate too much without Trond around to keep me honest. > > however, i think what we want here is behavior that is closer to category > > c., with as few negative performance implications as possible. > > Actually, this is really, really useful and gives me lots pointers I > can follow for more details. i'm very happy to be able to include more brains in the dialog! > > i think one way to accomplish this is to create two separate revalidation > > functions -- one that can be used by synchronous code in the NFS client > > that uses the 100% bug killer, and one that can be used from async RPC > > tasks that simply marks that a purge is necessary, and next time through > > the sync one, the purge actually occurs. > > That would certainly be easy from the VM side, then we could simply > use a derivative of vmtruncate that leaves the file size alone, as > Andrew suggested. the idea is that only the NFS client would have to worry about getting this right, and would invoke the proper VM hammer when it is safe to do so. that way, NFS-specific weirdness can be kept in fs/nfs/*.c, and not ooze into the VM layer. > > the only outstanding issue then is how to handle pages that are dirtied > > via mmap'd files, since they are touched without going through the NFS > > client. [ ... various ideas snipped ... ] > You want to know about the dirty pages only so you can send them > to the server, correct? Not because the client needs to purge > anything. i'm thinking only at the level of what behavior we want, not how to implement it, simply because i'm not familiar enough with how it works today. i've researched how this behaves (more or less) on a reference implementation of an NFS client by asking someone who worked on the Solaris client. the thinking is that if applications running on two separate clients have a file mmap'd, the application designers already know well enough that dirtying the same regions of the file on separate clients will have nondeterministic results. thus the only good reason that two or more clients would mmap the same file and dirty some pages would be to modify different regions of the file, or there is some application-level serialization scheme in place to keep writes to the file in a deterministic order. thus, when a client "revalidates" an mmap'd file and discovers that the file was changed on the server by another client, the reference implementation says "go ahead and flush all the writes you know about, then purge the read cache." so the only problem is finding the dirty pages so that the client can schedule the writes. i think this needs to happen after a server change is detected but before the client schedules any new I/O against the file. today, dirty mmap'd pages are passed to the NFS client via the writepage address space operation. what more needs to be done here? is there a mechanism today to tell the VM layer to "call writepage for all dirty mmap'd pages?" - Chuck Lever -- corporate: <cel at netapp dot com> personal: <chucklever at bigfoot dot com> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-10 19:04 ` Chuck Lever @ 2002-09-10 20:52 ` Daniel Phillips 2002-09-11 0:07 ` Andrew Morton 2002-09-12 19:06 ` Daniel Phillips 1 sibling, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-09-10 20:52 UTC (permalink / raw) To: Chuck Lever Cc: Andrew Morton, Rik van Riel, trond.myklebust, Linux Kernel Mailing List On Tuesday 10 September 2002 21:04, Chuck Lever wrote: > today, dirty mmap'd pages are passed to the NFS client via the writepage > address space operation. what more needs to be done here? is there a > mechanism today to tell the VM layer to "call writepage for all dirty > mmap'd pages?" Well, Andrew has cooked up something that seems to be headed in that direction, mpage_writepages and various associated superstructure, but it does not apparently walk all the ptes to check the dirty bits before it flushes the dirty pages, which is what you want, I think. It would not be hard to teach it that trick, it's another thing made easy by rmap. Then, after that's done, what kind of semantics have we got? Perhaps it's worth being able to guarantee that when a program says 'get all the dirty memory here out onto disk' it actually happens, even though there is no built-in way to be sure that some unsychronized task won't come along and dirty the mmap again immediately. You could look at the need for synchronization there as an application issue. On the other hand, if the NFS client is taking the liberty of flushing the dirty memory on behalf of the mmap users, what guarantee is provided? IMHO, none, so is this really worth it for the NFS client to do this? It does make sense that fsync should really get all the dirty pages onto disk (err, or onto the server) and not come back until its done, and that 'dirty pages' should include dirty ptes, not just pages that happen to have been scanned and had their dirty bits moved from the pte to the struct page. Andrew, did I miss something, or does the current code really ignore the pte dirty bits? -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-10 20:52 ` Daniel Phillips @ 2002-09-11 0:07 ` Andrew Morton 2002-09-11 0:27 ` Daniel Phillips 2002-09-11 16:18 ` Rik van Riel 0 siblings, 2 replies; 77+ messages in thread From: Andrew Morton @ 2002-09-11 0:07 UTC (permalink / raw) To: Daniel Phillips Cc: Chuck Lever, Rik van Riel, trond.myklebust, Linux Kernel Mailing List Daniel Phillips wrote: > > ... > Andrew, did I miss something, or does the current code really ignore > the pte dirty bits? Sure. pte_dirty -> PageDirty propagation happens in page reclaim, and in msync. We _could_ walk the pte chain in writeback. But that would involve visiting every page in the mapping, basically. That could hurt. But if a page is already dirty, and we're going to write it anyway, it makes tons of sense to run around and clean all the ptes which point at it. It especially makes sense for fielmap_sync() to do that. (quickly edits the todo file). I'm not sure that MAP_SHARED is a good way of performing file writing, really. And not many things seem to use it for that. It's more there as a way for unrelated processes to find a chunk of common memory via the usual namespace. Not sure about that, but I am sure that it's a damn pest. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-11 0:07 ` Andrew Morton @ 2002-09-11 0:27 ` Daniel Phillips 2002-09-11 0:38 ` Andrew Morton 2002-09-11 16:18 ` Rik van Riel 1 sibling, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-09-11 0:27 UTC (permalink / raw) To: Andrew Morton Cc: Chuck Lever, Rik van Riel, trond.myklebust, Linux Kernel Mailing List On Wednesday 11 September 2002 02:07, Andrew Morton wrote: > Daniel Phillips wrote: > > > > ... > > Andrew, did I miss something, or does the current code really ignore > > the pte dirty bits? > > Sure. pte_dirty -> PageDirty propagation happens in page reclaim, > and in msync. > > We _could_ walk the pte chain in writeback. But that would involve > visiting every page in the mapping, basically. That could hurt. I wasn't suggesting that, I was confirming you're not going looking at the ptes somewhere I didn't notice. > But if a page is already dirty, and we're going to write it anyway, > it makes tons of sense to run around and clean all the ptes which > point at it. Really. Since that is what the PG_dirty bit is supposed to be telling us. > It especially makes sense for fielmap_sync() to do that. (quickly > edits the todo file). > > I'm not sure that MAP_SHARED is a good way of performing file writing, > really. It sure isn't just now, and it's likely to remain that way for quite some time. > And not many things seem to use it for that. It's more there > as a way for unrelated processes to find a chunk of common memory via > the usual namespace. Not sure about that, but I am sure that it's a > damn pest. One day, in a perfect world, you will dirty a mmap, fsync it, and the data will appear in the blink of an eye, somewhere else. Till then it would be nice just to get mmaps of NFS files doing something reasonable. We do get around to walking the ptes at file close I believe. Is that not driven by zap_page_range, which moves any orphaned pte dirty bits down into the struct page? -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-11 0:27 ` Daniel Phillips @ 2002-09-11 0:38 ` Andrew Morton 2002-09-11 0:53 ` Daniel Phillips 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-09-11 0:38 UTC (permalink / raw) To: Daniel Phillips Cc: Chuck Lever, Rik van Riel, trond.myklebust, Linux Kernel Mailing List Daniel Phillips wrote: > > > ... > We do get > around to walking the ptes at file close I believe. Is that not driven by > zap_page_range, which moves any orphaned pte dirty bits down into the struct > page? Nope, close will just leave all the pages pte-dirty or PageDirty in memory. truncate will nuke all the ptes and then the pagecache. But the normal way in which pte-dirty pages find their way to the backing file is: - page reclaim runs try_to_unmap or - user runs msync(). (Which will only clean that mm's ptes!) These will run set_page_dirty(), making the page visible to one of the many things which run writeback. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-11 0:38 ` Andrew Morton @ 2002-09-11 0:53 ` Daniel Phillips 2002-09-11 1:49 ` Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-09-11 0:53 UTC (permalink / raw) To: Andrew Morton Cc: Chuck Lever, Rik van Riel, trond.myklebust, Linux Kernel Mailing List On Wednesday 11 September 2002 02:38, Andrew Morton wrote: > Daniel Phillips wrote: > > > > > ... > > We do get > > around to walking the ptes at file close I believe. Is that not driven by > > zap_page_range, which moves any orphaned pte dirty bits down into the struct > > page? > > Nope, close will just leave all the pages pte-dirty or PageDirty in > memory. truncate will nuke all the ptes and then the pagecache. > > But the normal way in which pte-dirty pages find their way to the > backing file is: > > - page reclaim runs try_to_unmap or > > - user runs msync(). (Which will only clean that mm's ptes!) > > These will run set_page_dirty(), making the page visible to > one of the many things which run writeback. So we just quietly drop any dirty memory mapped to a file if the user doesn't run msync? Is that correct behaviour? It sure sounds wrong. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-11 0:53 ` Daniel Phillips @ 2002-09-11 1:49 ` Andrew Morton 2002-09-11 2:14 ` Daniel Phillips 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-09-11 1:49 UTC (permalink / raw) To: Daniel Phillips Cc: Chuck Lever, Rik van Riel, trond.myklebust, Linux Kernel Mailing List Daniel Phillips wrote: > > On Wednesday 11 September 2002 02:38, Andrew Morton wrote: > > Daniel Phillips wrote: > > > > > > > ... > > > We do get > > > around to walking the ptes at file close I believe. Is that not driven by > > > zap_page_range, which moves any orphaned pte dirty bits down into the struct > > > page? > > > > Nope, close will just leave all the pages pte-dirty or PageDirty in > > memory. truncate will nuke all the ptes and then the pagecache. > > > > But the normal way in which pte-dirty pages find their way to the > > backing file is: > > > > - page reclaim runs try_to_unmap or > > > > - user runs msync(). (Which will only clean that mm's ptes!) > > > > These will run set_page_dirty(), making the page visible to > > one of the many things which run writeback. > > So we just quietly drop any dirty memory mapped to a file if the user doesn't > run msync? Is that correct behaviour? It sure sounds wrong. > If the page is truncated then yes. (indirectly: the pte dirty state gets flushed into PG_dirty which is then invalidated). Otherwise, no. The pte dirtiness is propagated into PG_dirty when the pte is detached from the page. See try_to_unmap_one() - it is quite straightforward. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-11 1:49 ` Andrew Morton @ 2002-09-11 2:14 ` Daniel Phillips 0 siblings, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-11 2:14 UTC (permalink / raw) To: Andrew Morton Cc: Chuck Lever, Rik van Riel, trond.myklebust, Linux Kernel Mailing List On Wednesday 11 September 2002 03:49, Andrew Morton wrote: > Daniel Phillips wrote: > > So we just quietly drop any dirty memory mapped to a file if the user doesn't > > run msync? Is that correct behaviour? It sure sounds wrong. > > If the page is truncated then yes. (indirectly: the pte dirty > state gets flushed into PG_dirty which is then invalidated). > > Otherwise, no. The pte dirtiness is propagated into PG_dirty when > the pte is detached from the page. See try_to_unmap_one() - it is > quite straightforward. Yep, that's what I meant the first time round. Had me worried for a moment there... The specific path I was refering to is: sys_munmap-> unmap_region-> unmap_page_range-> zap_pmd_range-> zap_pte_range-> if (pte_dirty(pte)) set_page_dirty(page); Which is the definitive way of getting the data onto disk. It's weird that msync has to be different from fsync, or that fsync does not imply msync, but that's the posix folks for ya, I'm sure they threw a dart blindfolded at a list of reasons and came up with one. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-11 0:07 ` Andrew Morton 2002-09-11 0:27 ` Daniel Phillips @ 2002-09-11 16:18 ` Rik van Riel 2002-09-11 17:14 ` Daniel Phillips 1 sibling, 1 reply; 77+ messages in thread From: Rik van Riel @ 2002-09-11 16:18 UTC (permalink / raw) To: Andrew Morton Cc: Daniel Phillips, Chuck Lever, trond.myklebust, Linux Kernel Mailing List On Tue, 10 Sep 2002, Andrew Morton wrote: > We _could_ walk the pte chain in writeback. But that would involve > visiting every page in the mapping, basically. That could hurt. > > But if a page is already dirty, and we're going to write it anyway, > it makes tons of sense to run around and clean all the ptes which > point at it. Walking ptes probably doesn't hurt as much as doing extra disk IO, so I guess you're right ;) Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ Spamtraps of the month: september@surriel.com trac@trac.org ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-11 16:18 ` Rik van Riel @ 2002-09-11 17:14 ` Daniel Phillips 0 siblings, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-11 17:14 UTC (permalink / raw) To: Rik van Riel, Andrew Morton Cc: Chuck Lever, trond.myklebust, Linux Kernel Mailing List On Wednesday 11 September 2002 18:18, Rik van Riel wrote: > On Tue, 10 Sep 2002, Andrew Morton wrote: > > > We _could_ walk the pte chain in writeback. But that would involve > > visiting every page in the mapping, basically. That could hurt. > > > > But if a page is already dirty, and we're going to write it anyway, > > it makes tons of sense to run around and clean all the ptes which > > point at it. > > Walking ptes probably doesn't hurt as much as doing extra > disk IO, so I guess you're right ;) Well, but the thing I'm worried about is that we have failed to implement sys_msync at this point, please reassure me on that, if possible. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-10 19:04 ` Chuck Lever 2002-09-10 20:52 ` Daniel Phillips @ 2002-09-12 19:06 ` Daniel Phillips 1 sibling, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-12 19:06 UTC (permalink / raw) To: Chuck Lever Cc: Andrew Morton, Rik van Riel, trond.myklebust, Linux Kernel Mailing List On Tuesday 10 September 2002 21:04, Chuck Lever wrote: > when locking or unlocking a file, the idea is to make sure that other > clients can see all changes to the file that were made while it was > locked. locking and unlocking provide tighter cache coherency than simple > everyday close-to-open because that's why applications go to the trouble > of locking a file -- they expect to share the contents of the file with > other applications on other clients. > > when a file is locked, the client wants to be certain it has the latest > version of the file for an application to play with. the cache is purged > to cause the client to read any recent changes from the server. when a > file is unlocked, the client wants to share its changes with other clients > so it flushes all pending writes before allowing the unlocking application > to proceed. This is clear and easy to understand, thanks. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-10 15:09 ` Chuck Lever 2002-09-10 16:13 ` Daniel Phillips @ 2002-09-12 22:05 ` Urban Widmark 2002-09-12 22:21 ` Andrew Morton 1 sibling, 1 reply; 77+ messages in thread From: Urban Widmark @ 2002-09-12 22:05 UTC (permalink / raw) To: Chuck Lever Cc: Daniel Phillips, Andrew Morton, Rik van Riel, trond.myklebust, Linux Kernel Mailing List On Tue, 10 Sep 2002, Chuck Lever wrote: > client. also, i'd really like to hear from maintainers of other network > file systems about how they manage cache coherency. I have a feeling that smbfs was built around an assumption that if it has a file open, no one else can change it. I don't think that is valid even with older smb protocols. >From reading (selected parts of) this thread I believe the requirements from smbfs is much like the nfs ones. An incomplete summary of what is currently done: 1. If attributes of a file changes (size, mtime) the cached data is purged. 2. Directory changes does not modify the mtime of the dir on windows servers so a timeout is used. 3. Cached file data is currently not dropped when a file is closed. But any dirty pages are written. I guess the idea is that (1) would still catch any changes even if the file isn't open. smbfs in 2.5 will support oplocks, where the server can tell the client that someone else wants to open the same file, called an "oplock break". The client should then purge any cached file data and respond when it is done. This needs to be a nonblocking purge for similar reasons as mentioned for rpciod. The suggested best-effort-try-again-later invalidate sounds ok for that. There exists smb documentation stating that caching is not allowed without a held oplock. But how to deal with mmap wasn't really in that authors mind. If I understood Andrew right with the inode flag that won't purge anything until the next userspace access. I don't think that is what smbfs wants since the response to the oplock break should happen fairly soon ... /Urban ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 22:05 ` Urban Widmark @ 2002-09-12 22:21 ` Andrew Morton 2002-09-12 22:30 ` Rik van Riel 2002-09-14 9:58 ` Urban Widmark 0 siblings, 2 replies; 77+ messages in thread From: Andrew Morton @ 2002-09-12 22:21 UTC (permalink / raw) To: Urban Widmark Cc: Chuck Lever, Daniel Phillips, Rik van Riel, trond.myklebust, Linux Kernel Mailing List Urban Widmark wrote: > > ... > If I understood Andrew right with the inode flag that won't purge anything > until the next userspace access. I don't think that is what smbfs wants > since the response to the oplock break should happen fairly soon ... Well the lazy invalidation would be OK - defer that to the next userspace access, or just let the pages die a natural VM death, maybe poke `ksmbinvalidated'. But if you want to perform writeback within the local IO daemon then hm. That's a problem which NFS seems to not have? I guess you'd have to leave i_sem held and poke `ksmbwritebackd'. Or teach pdflush about queued work items (a tq_struct would do it). But it's the same story: the requirements of a) non blocking local IO daemon and b) assured pagecache takedown are conflicting. You need at least one more thread, and locking against userspace activity. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 22:21 ` Andrew Morton @ 2002-09-12 22:30 ` Rik van Riel 2002-09-12 22:43 ` Daniel Phillips ` (3 more replies) 2002-09-14 9:58 ` Urban Widmark 1 sibling, 4 replies; 77+ messages in thread From: Rik van Riel @ 2002-09-12 22:30 UTC (permalink / raw) To: Andrew Morton Cc: Urban Widmark, Chuck Lever, Daniel Phillips, trond.myklebust, Linux Kernel Mailing List On Thu, 12 Sep 2002, Andrew Morton wrote: > Well the lazy invalidation would be OK - defer that to the next > userspace access, I think I have an idea on how to do that, here's some pseudocode: invalidate_page(struct page * page) { SetPageInvalidated(page); rmap_lock(page); for_each_pte(pte, page) { make pte PROT_NONE; flush TLBs for this virtual address; } rmap_unlock(page); } And in the page fault path: if (pte_protection(pte) == PROT_NONE && PageInvalidated(pte_page_pte)) { clear_pte(ptep); page_cache_release(page); mm->rss--; } What do you think, is this simple enough that it would work ? ;) regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ Spamtraps of the month: september@surriel.com trac@trac.org ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 22:30 ` Rik van Riel @ 2002-09-12 22:43 ` Daniel Phillips 2002-09-12 22:51 ` Andrew Morton ` (2 subsequent siblings) 3 siblings, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-12 22:43 UTC (permalink / raw) To: Rik van Riel, Andrew Morton Cc: Urban Widmark, Chuck Lever, trond.myklebust, Linux Kernel Mailing List On Friday 13 September 2002 00:30, Rik van Riel wrote: > On Thu, 12 Sep 2002, Andrew Morton wrote: > > > Well the lazy invalidation would be OK - defer that to the next > > userspace access, > > I think I have an idea on how to do that, here's some pseudocode: > > invalidate_page(struct page * page) { > SetPageInvalidated(page); > rmap_lock(page); > for_each_pte(pte, page) { > make pte PROT_NONE; > flush TLBs for this virtual address; > } > rmap_unlock(page); > } > > And in the page fault path: > > if (pte_protection(pte) == PROT_NONE && PageInvalidated(pte_page_pte)) { > clear_pte(ptep); > page_cache_release(page); > mm->rss--; > } > > What do you think, is this simple enough that it would work ? ;) This is very promising. Actually, we'd only need to do that for pages that are currently skipped in invalidate_inode_pages. Have to think about possible interactions now... -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 22:30 ` Rik van Riel 2002-09-12 22:43 ` Daniel Phillips @ 2002-09-12 22:51 ` Andrew Morton 2002-09-12 23:05 ` Randy.Dunlap ` (2 more replies) 2002-09-13 4:19 ` Daniel Phillips 2002-09-13 4:52 ` Daniel Phillips 3 siblings, 3 replies; 77+ messages in thread From: Andrew Morton @ 2002-09-12 22:51 UTC (permalink / raw) To: Rik van Riel Cc: Urban Widmark, Chuck Lever, Daniel Phillips, trond.myklebust, Linux Kernel Mailing List Rik van Riel wrote: > > On Thu, 12 Sep 2002, Andrew Morton wrote: > > > Well the lazy invalidation would be OK - defer that to the next > > userspace access, > > I think I have an idea on how to do that, here's some pseudocode: > > invalidate_page(struct page * page) { > SetPageInvalidated(page); > rmap_lock(page); > for_each_pte(pte, page) { > make pte PROT_NONE; > flush TLBs for this virtual address; > } > rmap_unlock(page); > } > > And in the page fault path: > > if (pte_protection(pte) == PROT_NONE && PageInvalidated(pte_page_pte)) { > clear_pte(ptep); > page_cache_release(page); > mm->rss--; > } > That's the bottom-up approach. The top-down (vmtruncate) approach would also work, if the locking is suitable. Look, idunnoigiveup. Like scsi and USB, NFS is a black hole where akpms fear to tread. I think I'll sulk until someone explains why this work has to be performed in the context of a process which cannot do it. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 22:51 ` Andrew Morton @ 2002-09-12 23:05 ` Randy.Dunlap 2002-09-12 23:23 ` Rik van Riel 2002-09-23 16:38 ` Trond Myklebust 2 siblings, 0 replies; 77+ messages in thread From: Randy.Dunlap @ 2002-09-12 23:05 UTC (permalink / raw) To: Andrew Morton; +Cc: Linux Kernel Mailing List On Thu, 12 Sep 2002, Andrew Morton wrote: | Look, idunnoigiveup. Like scsi and USB, NFS is a black hole | where akpms fear to tread. hehe. s/NFS/VM/ s/akpms/me et al/ Fortunately we have a large developer community with different (complementary) points of view and interests. -- ~Randy "Linux is not a research project. Never was, never will be." -- Linus, 2002-09-02 ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 22:51 ` Andrew Morton 2002-09-12 23:05 ` Randy.Dunlap @ 2002-09-12 23:23 ` Rik van Riel 2002-09-12 23:53 ` Daniel Phillips 2002-09-23 16:38 ` Trond Myklebust 2 siblings, 1 reply; 77+ messages in thread From: Rik van Riel @ 2002-09-12 23:23 UTC (permalink / raw) To: Andrew Morton Cc: Urban Widmark, Chuck Lever, Daniel Phillips, trond.myklebust, Linux Kernel Mailing List On Thu, 12 Sep 2002, Andrew Morton wrote: > Rik van Riel wrote: > > invalidate_page(struct page * page) { > That's the bottom-up approach. The top-down (vmtruncate) approach > would also work, if the locking is suitable. The top-down approach will almost certainly be most efficient when invalidating a large chunk of a file (truncate, large file locks) while the bottom-up approach is probably more efficient when the system invalidates very few pages (small file lock, cluster file system mmap() support). regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ Spamtraps of the month: september@surriel.com trac@trac.org ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 23:23 ` Rik van Riel @ 2002-09-12 23:53 ` Daniel Phillips 0 siblings, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-12 23:53 UTC (permalink / raw) To: Rik van Riel, Andrew Morton Cc: Urban Widmark, Chuck Lever, trond.myklebust, Linux Kernel Mailing List On Friday 13 September 2002 01:23, Rik van Riel wrote: > On Thu, 12 Sep 2002, Andrew Morton wrote: > > Rik van Riel wrote: > > > > invalidate_page(struct page * page) { > > > That's the bottom-up approach. The top-down (vmtruncate) approach > > would also work, if the locking is suitable. > > The top-down approach will almost certainly be most efficient when > invalidating a large chunk of a file (truncate, large file locks) > while the bottom-up approach is probably more efficient when the > system invalidates very few pages (small file lock, cluster file > system mmap() support). The bottom-up approach is the one we want to use when we'd otherwise skip a page in invalidate_inode_pages. This is the rare case. On the face of it, this works out very well. Just have to think about interactions now - I don't see any, but I haven't really gone hunting yet. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 22:51 ` Andrew Morton 2002-09-12 23:05 ` Randy.Dunlap 2002-09-12 23:23 ` Rik van Riel @ 2002-09-23 16:38 ` Trond Myklebust 2002-09-23 17:16 ` Daniel Phillips ` (2 more replies) 2 siblings, 3 replies; 77+ messages in thread From: Trond Myklebust @ 2002-09-23 16:38 UTC (permalink / raw) To: Andrew Morton Cc: Rik van Riel, Urban Widmark, Chuck Lever, Daniel Phillips, trond.myklebust, Linux Kernel Mailing List >>>>> " " == Andrew Morton <akpm@digeo.com> writes: > Look, idunnoigiveup. Like scsi and USB, NFS is a black hole > where akpms fear to tread. I think I'll sulk until someone > explains why this work has to be performed in the context of a > process which cannot do it. I'd be happy to move that work out of the RPC callbacks if you could point out which other processes actually can do it. The main problem is that the VFS/MM has no way of relabelling pages as being invalid or no longer up to date: I once proposed simply clearing PG_uptodate on those pages which cannot be cleared by invalidate_inode_pages(), but this was not to Linus' taste. Cheers, Trond ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-23 16:38 ` Trond Myklebust @ 2002-09-23 17:16 ` Daniel Phillips 2002-09-23 18:57 ` Andrew Morton 2002-09-23 19:13 ` Daniel Phillips 2 siblings, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-23 17:16 UTC (permalink / raw) To: trond.myklebust, Andrew Morton Cc: Rik van Riel, Urban Widmark, Chuck Lever, trond.myklebust, Linux Kernel Mailing List On Monday 23 September 2002 18:38, Trond Myklebust wrote: > >>>>> " " == Andrew Morton <akpm@digeo.com> writes: > > > Look, idunnoigiveup. Like scsi and USB, NFS is a black hole > > where akpms fear to tread. I think I'll sulk until someone > > explains why this work has to be performed in the context of a > > process which cannot do it. > > I'd be happy to move that work out of the RPC callbacks if you could > point out which other processes actually can do it. > > The main problem is that the VFS/MM has no way of relabelling pages as > being invalid or no longer up to date: I once proposed simply clearing > PG_uptodate on those pages which cannot be cleared by > invalidate_inode_pages(), but this was not to Linus' taste. Could you please provide a subject line for that original discussion? -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-23 16:38 ` Trond Myklebust 2002-09-23 17:16 ` Daniel Phillips @ 2002-09-23 18:57 ` Andrew Morton 2002-09-23 20:41 ` Trond Myklebust 2002-09-23 19:13 ` Daniel Phillips 2 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-09-23 18:57 UTC (permalink / raw) To: trond.myklebust Cc: Rik van Riel, Urban Widmark, Chuck Lever, Daniel Phillips, Linux Kernel Mailing List Trond Myklebust wrote: > > >>>>> " " == Andrew Morton <akpm@digeo.com> writes: > > > Look, idunnoigiveup. Like scsi and USB, NFS is a black hole > > where akpms fear to tread. I think I'll sulk until someone > > explains why this work has to be performed in the context of a > > process which cannot do it. > > I'd be happy to move that work out of the RPC callbacks if you could > point out which other processes actually can do it. Well it has to be a new thread, or user processes. Would it be possible to mark the inode as "needs invalidation", and make user processes check that flag once they have i_sem? > The main problem is that the VFS/MM has no way of relabelling pages as > being invalid or no longer up to date: I once proposed simply clearing > PG_uptodate on those pages which cannot be cleared by > invalidate_inode_pages(), but this was not to Linus' taste. Yes, clearing PageUptodate without holding the page lock is pretty scary. Do we really need to invalidate individual pages, or is it real-life acceptable to invalidate the whole mapping? Doing an NFS-special invalidate function is not a problem, btw - my current invalidate_inode_pages() is just 25 lines. It's merely a matter of working out what it should do ;) ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-23 18:57 ` Andrew Morton @ 2002-09-23 20:41 ` Trond Myklebust 2002-09-23 20:49 ` Daniel Phillips 0 siblings, 1 reply; 77+ messages in thread From: Trond Myklebust @ 2002-09-23 20:41 UTC (permalink / raw) To: Andrew Morton Cc: trond.myklebust, Rik van Riel, Urban Widmark, Chuck Lever, Daniel Phillips, Linux Kernel Mailing List >>>>> " " == Andrew Morton <akpm@digeo.com> writes: > Well it has to be a new thread, or user processes. > Would it be possible to mark the inode as "needs invalidation", > and make user processes check that flag once they have i_sem? Not good enough unless you add those checks at the VFS/MM level. Think for instance about mmap() where the filesystem is usually not involved once a pagein has occurred. > Do we really need to invalidate individual pages, or is it > real-life acceptable to invalidate the whole mapping? Invalidating the mapping is certainly a good alternative if it can be done cleanly. Note that in doing so, we do not want to invalidate any reads or writes that may have been already scheduled. The existing mapping still would need to hang around long enough to permit them to complete. > Doing an NFS-special invalidate function is not a problem, btw > - my current invalidate_inode_pages() is just 25 lines. It's > merely a matter of working out what it should do ;) invalidate_inode_pages() *was* NFS-specific until you guys hijacked it for other purposes in 2.5.x ;-) Cheers, Trond ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-23 20:41 ` Trond Myklebust @ 2002-09-23 20:49 ` Daniel Phillips 2002-09-23 22:43 ` Trond Myklebust 0 siblings, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-09-23 20:49 UTC (permalink / raw) To: trond.myklebust, Andrew Morton Cc: trond.myklebust, Rik van Riel, Urban Widmark, Chuck Lever, Linux Kernel Mailing List On Monday 23 September 2002 22:41, Trond Myklebust wrote: > >>>>> " " == Andrew Morton <akpm@digeo.com> writes: > > > Would it be possible to mark the inode as "needs invalidation", > > and make user processes check that flag once they have i_sem? > > Not good enough unless you add those checks at the VFS/MM level. Please see Rik's suggestion and my followups where we are talking about handling this in the MM. > Think > for instance about mmap() where the filesystem is usually not involved > once a pagein has occurred. > > > Do we really need to invalidate individual pages, or is it > > real-life acceptable to invalidate the whole mapping? > > Invalidating the mapping is certainly a good alternative if it can be > done cleanly. But invalidate_inode_pages is (usually) just trying to do exactly that. It doesn't work. Anyway, what if you have a 2 gig file with 1 meg mmaped/locked/whatever by a database? > Note that in doing so, we do not want to invalidate any reads or > writes that may have been already scheduled. The existing mapping > still would need to hang around long enough to permit them to > complete. With the mechanism I described above, that would just work. The fault path would do lock_page, thus waiting for the IO to complete. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-23 20:49 ` Daniel Phillips @ 2002-09-23 22:43 ` Trond Myklebust 2002-09-24 5:09 ` Daniel Phillips 0 siblings, 1 reply; 77+ messages in thread From: Trond Myklebust @ 2002-09-23 22:43 UTC (permalink / raw) To: Daniel Phillips Cc: trond.myklebust, Andrew Morton, Rik van Riel, Urban Widmark, Chuck Lever, Linux Kernel Mailing List >>>>> " " == Daniel Phillips <phillips@arcor.de> writes: >> Note that in doing so, we do not want to invalidate any reads >> or writes that may have been already scheduled. The existing >> mapping still would need to hang around long enough to permit >> them to complete. > With the mechanism I described above, that would just work. > The fault path would do lock_page, thus waiting for the IO to > complete. NFS writes do not hold the page lock until completion. How would you expect to be able to coalesce writes to the same page if they did? Cheers, Trond ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-23 22:43 ` Trond Myklebust @ 2002-09-24 5:09 ` Daniel Phillips 2002-09-24 16:40 ` Trond Myklebust 0 siblings, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-09-24 5:09 UTC (permalink / raw) To: Trond Myklebust Cc: trond.myklebust, Andrew Morton, Rik van Riel, Urban Widmark, Chuck Lever, Linux Kernel Mailing List On Tuesday 24 September 2002 00:43, Trond Myklebust wrote: > >>>>> " " == Daniel Phillips <phillips@arcor.de> writes: > > >> Note that in doing so, we do not want to invalidate any reads > >> or writes that may have been already scheduled. The existing > >> mapping still would need to hang around long enough to permit > >> them to complete. > > > With the mechanism I described above, that would just work. > > The fault path would do lock_page, thus waiting for the IO to > > complete. > > NFS writes do not hold the page lock until completion. How would you > expect to be able to coalesce writes to the same page if they did? Coalesce before initiating writeout? I don't see why NFS should be special in this regard, or why it should not leave a page locked until IO has completed, like other filesystems. Could you please explain? -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-24 5:09 ` Daniel Phillips @ 2002-09-24 16:40 ` Trond Myklebust 0 siblings, 0 replies; 77+ messages in thread From: Trond Myklebust @ 2002-09-24 16:40 UTC (permalink / raw) To: Daniel Phillips Cc: Andrew Morton, Rik van Riel, Urban Widmark, Chuck Lever, Linux Kernel Mailing List On Tuesday 24 September 2002 07:09, Daniel Phillips wrote: > Coalesce before initiating writeout? I don't see why NFS should be special > in this regard, or why it should not leave a page locked until IO has > completed, like other filesystems. Could you please explain? It does that for reads. For writes, however, I see no point in keeping the page lock beyond what is required in order to safely copy data from userland. To do so would give disastrous performances if somebody was trying to write < page-sized records. I belive this is true for all filesystems. For some reason or another, the buffer stuff needs to re-take the page lock when it actually performs the physical commit to disk. NFS doesn't need to do this in order to perform an RPC call, so we don't... Cheers, Trond ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-23 16:38 ` Trond Myklebust 2002-09-23 17:16 ` Daniel Phillips 2002-09-23 18:57 ` Andrew Morton @ 2002-09-23 19:13 ` Daniel Phillips 2 siblings, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-23 19:13 UTC (permalink / raw) To: trond.myklebust, Andrew Morton, Linus Torvalds Cc: Rik van Riel, Urban Widmark, Chuck Lever, trond.myklebust, Linux Kernel Mailing List On Monday 23 September 2002 18:38, Trond Myklebust wrote: > >>>>> " " == Andrew Morton <akpm@digeo.com> writes: > > > Look, idunnoigiveup. Like scsi and USB, NFS is a black hole > > where akpms fear to tread. I think I'll sulk until someone > > explains why this work has to be performed in the context of a > > process which cannot do it. > > I'd be happy to move that work out of the RPC callbacks if you could > point out which other processes actually can do it. > > The main problem is that the VFS/MM has no way of relabelling pages as > being invalid or no longer up to date: I once proposed simply clearing > PG_uptodate on those pages which cannot be cleared by > invalidate_inode_pages(), but this was not to Linus' taste. I'll take a run at analyzing this. First, it's clear why can't just set the page !uptodate: if we fail to lock the page we can't change the state of the uptodate bit because we would violate the locking rules, iow, we would race with the vfs (see block_read/write_full_page). Note that even if succeed in the TryLock and set !uptodate, we still have to walk the rmap list and unmap the page or it won't get refaulted and the uptodate bit will be ignored. For any page we can't lock without blocking, the cases are: 1) Dirty: we don't need to invalidate it because it's going to get written back to the server anyway 2) Locked, clean: the page could be locked for any number of reasons. Probably, it's locked for reading though. We *obviously* need to kill this page at some point or we have a nasty heisenbug. E.g., somebody, somewhere, will get a file handed back to them from some other client that rewrote the whole thing, complete and correct except for a stale page or two. For pages that we can lock, we have: 3) Elevated count, clean: we could arguably ignore the use count and just yank the page out of the inode list, as Andrew's patch does. Getting it out of the mapping is harder, perhaps much harder. 4) Clean, has buffers, can't get rid of the buffers: we can't know why. HTree puts pages in this state for directory access, Ext3 probably does it for a variety of reasons. Same situation as above. Given the obviously broken case (2) above and the two probably broken case (3) and (4), I don't see any way to ignore this problem and still implement the NFS semantics Chuck described earlier. I see Rik's suggestion of marking the problem pages invalid, and walking the ptes to protect them as the cleanest fix. Unlike invalidate_inode_pages, the fault path can block perfectly happily while the problem conditions sort themselves out. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 22:30 ` Rik van Riel 2002-09-12 22:43 ` Daniel Phillips 2002-09-12 22:51 ` Andrew Morton @ 2002-09-13 4:19 ` Daniel Phillips 2002-09-13 4:52 ` Daniel Phillips 3 siblings, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-13 4:19 UTC (permalink / raw) To: Rik van Riel, Andrew Morton Cc: Urban Widmark, Chuck Lever, trond.myklebust, Linux Kernel Mailing List On Friday 13 September 2002 00:30, Rik van Riel wrote: > On Thu, 12 Sep 2002, Andrew Morton wrote: > > > Well the lazy invalidation would be OK - defer that to the next > > userspace access, > > I think I have an idea on how to do that, here's some pseudocode: > > invalidate_page(struct page * page) { > SetPageInvalidated(page); > rmap_lock(page); > for_each_pte(pte, page) { > make pte PROT_NONE; > flush TLBs for this virtual address; > } > rmap_unlock(page); > } > > And in the page fault path: > > if (pte_protection(pte) == PROT_NONE && PageInvalidated(pte_page_pte)) { > clear_pte(ptep); > page_cache_release(page); > mm->rss--; > } > > What do you think, is this simple enough that it would work ? ;) Too simple to work, unfortunately. We have to at least 1) lock the page and 2) remove it from the page cache. Can we remove a page from the page cache while it still has pte references? I suppose we can, it turns into an anonymous page. But isn't that the reason we didn't do it in invalidate_inode_pages in the first place? However, since we now, in addition, mark the page invalidated, it doesn't matter any more what kind of page it is, so I suppose that's ok. And we need a locked_page_cache_release->free_locked_page. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 22:30 ` Rik van Riel ` (2 preceding siblings ...) 2002-09-13 4:19 ` Daniel Phillips @ 2002-09-13 4:52 ` Daniel Phillips 3 siblings, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-13 4:52 UTC (permalink / raw) To: Rik van Riel, Andrew Morton Cc: Urban Widmark, Chuck Lever, trond.myklebust, Linux Kernel Mailing List On Friday 13 September 2002 06:19, I wrote: > And we need a locked_page_cache_release->free_locked_page. Hmm, maybe not. We should be able to unlock the page just after removing it from the page cache. By the way, the wait_on_page in lock_page is where we finally get around for waiting on those locked pages we couldn't get rid of in invalidate_inode_pages; what we have done here is shifted the wait from rpciod to normal users, which is exactly what we want. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 22:21 ` Andrew Morton 2002-09-12 22:30 ` Rik van Riel @ 2002-09-14 9:58 ` Urban Widmark 1 sibling, 0 replies; 77+ messages in thread From: Urban Widmark @ 2002-09-14 9:58 UTC (permalink / raw) To: Andrew Morton; +Cc: Linux Kernel Mailing List On Thu, 12 Sep 2002, Andrew Morton wrote: > But it's the same story: the requirements of > > a) non blocking local IO daemon and > > b) assured pagecache takedown > > are conflicting. You need at least one more thread, and locking > against userspace activity. I see no problem with adding another thread to handle the breaks. Only the cost of an extra thread and the fact that smbiod was originally created to handle the break (with a thought to eventually make it do the IO as it does now) makes me want to put it in smbiod. /Urban ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-09 23:51 ` Chuck Lever 2002-09-10 1:07 ` Daniel Phillips @ 2002-09-12 13:04 ` Trond Myklebust 2002-09-12 18:21 ` Andrew Morton 1 sibling, 1 reply; 77+ messages in thread From: Trond Myklebust @ 2002-09-12 13:04 UTC (permalink / raw) To: Chuck Lever Cc: Daniel Phillips, Andrew Morton, Rik van Riel, Linux Kernel Mailing List >>>>> " " == Chuck Lever <cel@citi.umich.edu> writes: > rpciod must never call a function that sleeps. if this > happens, the whole NFS client stops working until the function > wakes up again. this is not really bogus -- it is similar to > restrictions placed on socket callbacks. I'm in France at the moment, and am therefore not really able to follow up on this thread for the moment. I'll try to clarify the above though: 2 reasons why rpciod cannot block: 1) Doing so will slow down I/O for *all* NFS users. 2) There's a minefield of possible deadlock situations: waiting on a locked page is the main no-no since rpciod itself is the process that needs to complete the read I/O and unlock the page. Cheers, Trond ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 13:04 ` Trond Myklebust @ 2002-09-12 18:21 ` Andrew Morton 2002-09-12 21:15 ` Daniel Phillips 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-09-12 18:21 UTC (permalink / raw) To: trond.myklebust Cc: Chuck Lever, Daniel Phillips, Rik van Riel, Linux Kernel Mailing List Trond Myklebust wrote: > > >>>>> " " == Chuck Lever <cel@citi.umich.edu> writes: > > > rpciod must never call a function that sleeps. if this > > happens, the whole NFS client stops working until the function > > wakes up again. this is not really bogus -- it is similar to > > restrictions placed on socket callbacks. > > I'm in France at the moment, and am therefore not really able to > follow up on this thread for the moment. I'll try to clarify the above > though: > > 2 reasons why rpciod cannot block: > > 1) Doing so will slow down I/O for *all* NFS users. > 2) There's a minefield of possible deadlock situations: waiting on a > locked page is the main no-no since rpciod itself is the process > that needs to complete the read I/O and unlock the page. > Yes. Both of these would indicate that rpciod is the wrong process to be performing the invalidation. Is it not possible to co-opt a user process to perform the invalidation? Just inode->is_kaput = 1; in rpciod? ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 18:21 ` Andrew Morton @ 2002-09-12 21:15 ` Daniel Phillips 2002-09-12 21:38 ` Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-09-12 21:15 UTC (permalink / raw) To: Andrew Morton, trond.myklebust Cc: Chuck Lever, Rik van Riel, Linux Kernel Mailing List On Thursday 12 September 2002 20:21, Andrew Morton wrote: > Trond Myklebust wrote: > > > > >>>>> " " == Chuck Lever <cel@citi.umich.edu> writes: > > > > > rpciod must never call a function that sleeps. if this > > > happens, the whole NFS client stops working until the function > > > wakes up again. this is not really bogus -- it is similar to > > > restrictions placed on socket callbacks. > > > > I'm in France at the moment, and am therefore not really able to > > follow up on this thread for the moment. I'll try to clarify the above > > though: > > > > 2 reasons why rpciod cannot block: > > > > 1) Doing so will slow down I/O for *all* NFS users. > > 2) There's a minefield of possible deadlock situations: waiting on a > > locked page is the main no-no since rpciod itself is the process > > that needs to complete the read I/O and unlock the page. > > > > Yes. Both of these would indicate that rpciod is the wrong process > to be performing the invalidation. > > Is it not possible to co-opt a user process to perform the > invalidation? Just > > inode->is_kaput = 1; > > in rpciod? There must be a way. The key thing the VM needs to provide, and doesn't now, is a function callable by the rpciod that will report to the caller whether it was able to complete the invalidation without blocking. (I think I'm just rephrasing someone's earlier suggestion here.) I'm now thinking in general terms about how to concoct a mechanism that lets rpciod retry the invalidation later, for all those that turn out to be blocking. For example, rpciod could just keep a list of all pending invalidates and retry each inode on the list every time it has nothing to do. This is crude and n-squarish, but it would work. Maybe it's efficient enough for the time being. At least it's correct, which would be a step forward. Did you have some specific mechanism in mind? -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 21:15 ` Daniel Phillips @ 2002-09-12 21:38 ` Andrew Morton 2002-09-12 21:52 ` Daniel Phillips 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-09-12 21:38 UTC (permalink / raw) To: Daniel Phillips Cc: trond.myklebust, Chuck Lever, Rik van Riel, Linux Kernel Mailing List Daniel Phillips wrote: > > ... > > Is it not possible to co-opt a user process to perform the > > invalidation? Just > > > > inode->is_kaput = 1; > > > > in rpciod? > > There must be a way. The key thing the VM needs to provide, and doesn't > now, is a function callable by the rpciod that will report to the caller > whether it was able to complete the invalidation without blocking. (I > think I'm just rephrasing someone's earlier suggestion here.) > > I'm now thinking in general terms about how to concoct a mechanism > that lets rpciod retry the invalidation later, for all those that turn > out to be blocking. For example, rpciod could just keep a list of > all pending invalidates and retry each inode on the list every time > it has nothing to do. This is crude and n-squarish, but it would > work. Maybe it's efficient enough for the time being. At least it's > correct, which would be a step forward. rpciod is the wrong process to be performing this operation. I'd suggest the userspace process which wants to read the directory be used for this. > Did you have some specific mechanism in mind? > Testing mapping->nrpages will tell you if the invalidation was successful. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-12 21:38 ` Andrew Morton @ 2002-09-12 21:52 ` Daniel Phillips 0 siblings, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-12 21:52 UTC (permalink / raw) To: Andrew Morton Cc: trond.myklebust, Chuck Lever, Rik van Riel, Linux Kernel Mailing List On Thursday 12 September 2002 23:38, Andrew Morton wrote: > Daniel Phillips wrote: > > > > ... > > > Is it not possible to co-opt a user process to perform the > > > invalidation? Just > > > > > > inode->is_kaput = 1; > > > > > > in rpciod? > > > > There must be a way. The key thing the VM needs to provide, and doesn't > > now, is a function callable by the rpciod that will report to the caller > > whether it was able to complete the invalidation without blocking. (I > > think I'm just rephrasing someone's earlier suggestion here.) > > > > I'm now thinking in general terms about how to concoct a mechanism > > that lets rpciod retry the invalidation later, for all those that turn > > out to be blocking. For example, rpciod could just keep a list of > > all pending invalidates and retry each inode on the list every time > > it has nothing to do. This is crude and n-squarish, but it would > > work. Maybe it's efficient enough for the time being. At least it's > > correct, which would be a step forward. > > rpciod is the wrong process to be performing this operation. I'd suggest > the userspace process which wants to read the directory be used for this. It's not just directories as I understand it, it's also any file that's locked or unlocked. If we make userspace do it, we need an interface. Is there > > Did you have some specific mechanism in mind? > > Testing mapping->nrpages will tell you if the invalidation was successful. That's nice, so may be no need for a new flavor of invalidate_inode_pages to write, but what I was really talking about is what should be done after discovering it failed. Another thing: we only care about pages that were in the mapping at the time of the original invalidate attempt. Coming back and invalidating a bunch of pages that were already properly re-read from the server, just to get those few we missed on the first attempt would be, in a word, horrible. This one is harder than it first seems. It's worth putting in the effort to do the job correctly though. The current arrangement is, err, unreliable. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 21:12 ` Andrew Morton 2002-09-05 21:31 ` Trond Myklebust @ 2002-09-05 22:01 ` Chuck Lever 2002-09-05 22:23 ` Andrew Morton 1 sibling, 1 reply; 77+ messages in thread From: Chuck Lever @ 2002-09-05 22:01 UTC (permalink / raw) To: Andrew Morton; +Cc: trond.myklebust, Linux Kernel Mailing List Andrew Morton wrote: > Trond, there are very good reasons why it broke. Those pages are > visible to the whole world via global data structures - both the > page LRUs and via the superblocks->inodes walk. Those things exist > for legitimate purposes, and it is legitimate for async threads > of control to take a reference on those pages while playing with them. > > It just "happened to work" in earlier kernels. > > I suspect we can just remove the page_count() test from invalidate > and that will fix everything up. That will give stronger invalidate > and anything which doesn't like that is probably buggy wrt truncate anyway. > > Could you test that? removing that test from invalidate_inode_pages allows test6 to run to completion. however, i don't see why the reference counts should be higher in 2.5.32 than they were in 2.5.31. is there a good way to test that these pages do not become orphaned? -- corporate: <cel at netapp dot com> personal: <chucklever at bigfoot dot com> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 22:01 ` Chuck Lever @ 2002-09-05 22:23 ` Andrew Morton 0 siblings, 0 replies; 77+ messages in thread From: Andrew Morton @ 2002-09-05 22:23 UTC (permalink / raw) To: Chuck Lever; +Cc: trond.myklebust, Linux Kernel Mailing List Chuck Lever wrote: > > Andrew Morton wrote: > > > Trond, there are very good reasons why it broke. Those pages are > > visible to the whole world via global data structures - both the > > page LRUs and via the superblocks->inodes walk. Those things exist > > for legitimate purposes, and it is legitimate for async threads > > of control to take a reference on those pages while playing with them. > > > > It just "happened to work" in earlier kernels. > > > > I suspect we can just remove the page_count() test from invalidate > > and that will fix everything up. That will give stronger invalidate > > and anything which doesn't like that is probably buggy wrt truncate anyway. > > > > Could you test that? > > removing that test from invalidate_inode_pages allows test6 to run to > > completion. OK, thanks. I bet adding an lru_cache_drain() fixes it too. > however, i don't see why the reference counts should be higher in > > 2.5.32 than they were in 2.5.31. Those pages are sitting in the cpu-local "to be added to the LRU soon" queue, with their refcount elevated. That queue really only makes sense for SMP, but it's enabled on UP so we pick up any weird effects. Voila. > is there a good way to test that > these pages do not become orphaned? Not that I know of - just run the test for ages and see if the box runs out of memory. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 20:15 ` Andrew Morton 2002-09-05 20:27 ` Trond Myklebust @ 2002-09-05 21:41 ` Chuck Lever 1 sibling, 0 replies; 77+ messages in thread From: Chuck Lever @ 2002-09-05 21:41 UTC (permalink / raw) To: Andrew Morton; +Cc: Trond Myklebust, Linux Kernel Mailing List Andrew Morton wrote: > Trond Myklebust wrote: > >>>>>>>" " == Andrew Morton <akpm@zip.com.au> writes: >>>>>>> >> > You may have more success using the stronger >> > invalidate_inode_pages2(). >> >>Shouldn't make any difference. Chuck is seeing this on readdir() pages >>which, of course, don't suffer from problems of dirtiness etc on NFS. >> > > Well the VM will take a ref on the page during reclaim even if it > is clean. > > With what sort of frequency does this happen? If it's easily reproducible > then dunno. If it's once-an-hour then it may be page reclaim, conceivably. > The PageLRU debug test in there will tell us. this happens every time i run test6 on 2.5.32 or 2.5.33. the pages are not on the LRU, and are not locked, when the NFS client calls invalidate_inode_pages. how do these pages get into the page cache? answer: nfs_readdir_filler is invoked by read_cache_page to fill a page. when nfs_readdir_filler is invoked in 2.5.31, the page count on the page to be filled is always 2. when nfs_readdir_filler is invoked in 2.5.32+, the page count is very often 3, but occasionally it is 2. - Chuck Lever -- corporate: <cel at netapp dot com> personal: <chucklever at bigfoot dot com> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 18:53 ` Chuck Lever 2002-09-05 19:17 ` Andrew Morton @ 2002-09-06 9:35 ` Helge Hafting 2002-09-06 16:16 ` Chuck Lever 1 sibling, 1 reply; 77+ messages in thread From: Helge Hafting @ 2002-09-06 9:35 UTC (permalink / raw) To: Chuck Lever, trond.myklebust; +Cc: linux-kernel Chuck Lever wrote: > > On Thu, 5 Sep 2002, Andrew Morton wrote: > > > That all assumes SMP/preempt. If you're seeing these problems on > > uniproc/non-preempt then something fishy may be happening. > > sorry, forgot to mention: the system is UP, non-preemptible, high mem. > > invalidate_inode_pages isn't freeing these pages because the page count is > two. perhaps the page count semantics of one of the page cache helper > functions has changed slightly. i'm still diagnosing. > > fortunately the problem is deterministically reproducible. basic test6, > the readdir test, of 2002 connectathon test suite, fails -- either a > duplicate file entry or a missing file entry appears after some standard > file creation and removal processing in that directory. the incorrect > entries occur because the NFS client zaps the directory's page cache to > force the next reader to re-read the directory from the server. but > invalidate_inode_pages decides to leave the pages in the cache, so the > next reader gets stale cached data instead. Perhaps this explains my nfs problem. (2.5.32/33 UP, preempt, no highmem) Soemtimes, when editing a file on nfs, the file disappears from the server. The client believes it is there until an umount+mount sequence. It doesn't happen for all files, but it is 100% reproducible for those affected. Editing changes the directory when the editor makes a backup copy, the old directory is kept around wrongly, and so the save into the existing file silently fails because wrong directory information from cache is used? Helge Hafting ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-06 9:35 ` Helge Hafting @ 2002-09-06 16:16 ` Chuck Lever 0 siblings, 0 replies; 77+ messages in thread From: Chuck Lever @ 2002-09-06 16:16 UTC (permalink / raw) To: Helge Hafting; +Cc: linux-kernel On Fri, 6 Sep 2002, Helge Hafting wrote: > Chuck Lever wrote: > Perhaps this explains my nfs problem. (2.5.32/33 UP, preempt, no > highmem) > Soemtimes, when editing a file on nfs, the file disappears > from the server. The client believes it is there > until an umount+mount sequence. It doesn't happen > for all files, but it is 100% reproducible for those affected. > > Editing changes the directory when the editor makes a backup > copy, the old directory is kept around wrongly, and so the > save into the existing file silently fails because > wrong directory information from cache is used? that sounds very similar, it is probably the same bug. - Chuck Lever -- corporate: <cel at netapp dot com> personal: <chucklever at bigfoot dot com> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-05 18:27 ` Andrew Morton 2002-09-05 18:53 ` Chuck Lever @ 2002-09-07 8:01 ` Daniel Phillips 2002-09-07 10:01 ` Daniel Phillips 1 sibling, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-09-07 8:01 UTC (permalink / raw) To: Andrew Morton, Chuck Lever; +Cc: Linux Kernel Mailing List On Thursday 05 September 2002 20:27, Andrew Morton wrote: > But be aware that invalidate_inode_pages has always been best-effort. > If someone is reading, or writing one of those pages then it > certainly will not be removed. If you need assurances that the > pagecache has been taken down then we'll need something stronger > in there. But what is stopping us now from removing a page from the page cache even while IO is in progress? (Practical issue: the page lock, but that's a self-fullfilling prophesy.) -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: invalidate_inode_pages in 2.5.32/3 2002-09-07 8:01 ` Daniel Phillips @ 2002-09-07 10:01 ` Daniel Phillips 0 siblings, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-09-07 10:01 UTC (permalink / raw) To: Andrew Morton, Chuck Lever; +Cc: Linux Kernel Mailing List On Saturday 07 September 2002 10:01, Daniel Phillips wrote: > On Thursday 05 September 2002 20:27, Andrew Morton wrote: > > But be aware that invalidate_inode_pages has always been best-effort. > > If someone is reading, or writing one of those pages then it > > certainly will not be removed. If you need assurances that the > > pagecache has been taken down then we'll need something stronger > > in there. > > But what is stopping us now from removing a page from the page cache > even while IO is in progress? (Practical issue: the page lock, but > that's a self-fullfilling prophesy.) Never mind, I can see that the main function of the page lock here is to allow the filesystem to know there's no IO in progress on a block and so the block can be recovered and used for something else. Leaving the page in the page cache during IO is a simple means of keeping track of this. All the same, I have deep misgivings about the logic of the vfs truncate path in general, and given all the trouble it's caused historically, there's good reason to. I don't know, it may be perfect the way it is, but history would suggest otherwise. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
end of thread, other threads:[~2002-09-24 16:36 UTC | newest] Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-09-05 14:25 invalidate_inode_pages in 2.5.32/3 Chuck Lever 2002-09-05 18:27 ` Andrew Morton 2002-09-05 18:53 ` Chuck Lever 2002-09-05 19:17 ` Andrew Morton 2002-09-05 20:00 ` Trond Myklebust 2002-09-05 20:15 ` Andrew Morton 2002-09-05 20:27 ` Trond Myklebust 2002-09-05 20:37 ` Andrew Morton 2002-09-05 20:51 ` Trond Myklebust 2002-09-05 21:12 ` Andrew Morton 2002-09-05 21:31 ` Trond Myklebust 2002-09-05 22:19 ` Andrew Morton 2002-09-06 0:48 ` Trond Myklebust 2002-09-06 1:08 ` Andrew Morton 2002-09-06 6:49 ` Trond Myklebust 2002-09-07 8:37 ` Daniel Phillips 2002-09-07 16:09 ` Andrew Morton 2002-09-07 17:02 ` Andrew Morton 2002-09-07 8:24 ` Daniel Phillips 2002-09-07 16:06 ` Andrew Morton 2002-09-09 21:08 ` Daniel Phillips 2002-09-09 21:36 ` Andrew Morton 2002-09-09 22:12 ` Daniel Phillips 2002-09-07 18:47 ` Rik van Riel 2002-09-07 23:09 ` Andrew Morton 2002-09-09 21:44 ` Daniel Phillips 2002-09-09 22:03 ` Andrew Morton 2002-09-09 22:19 ` Daniel Phillips 2002-09-09 22:32 ` Andrew Morton 2002-09-10 16:57 ` Daniel Phillips 2002-09-09 23:51 ` Chuck Lever 2002-09-10 1:07 ` Daniel Phillips 2002-09-10 15:09 ` Chuck Lever 2002-09-10 16:13 ` Daniel Phillips 2002-09-10 19:04 ` Chuck Lever 2002-09-10 20:52 ` Daniel Phillips 2002-09-11 0:07 ` Andrew Morton 2002-09-11 0:27 ` Daniel Phillips 2002-09-11 0:38 ` Andrew Morton 2002-09-11 0:53 ` Daniel Phillips 2002-09-11 1:49 ` Andrew Morton 2002-09-11 2:14 ` Daniel Phillips 2002-09-11 16:18 ` Rik van Riel 2002-09-11 17:14 ` Daniel Phillips 2002-09-12 19:06 ` Daniel Phillips 2002-09-12 22:05 ` Urban Widmark 2002-09-12 22:21 ` Andrew Morton 2002-09-12 22:30 ` Rik van Riel 2002-09-12 22:43 ` Daniel Phillips 2002-09-12 22:51 ` Andrew Morton 2002-09-12 23:05 ` Randy.Dunlap 2002-09-12 23:23 ` Rik van Riel 2002-09-12 23:53 ` Daniel Phillips 2002-09-23 16:38 ` Trond Myklebust 2002-09-23 17:16 ` Daniel Phillips 2002-09-23 18:57 ` Andrew Morton 2002-09-23 20:41 ` Trond Myklebust 2002-09-23 20:49 ` Daniel Phillips 2002-09-23 22:43 ` Trond Myklebust 2002-09-24 5:09 ` Daniel Phillips 2002-09-24 16:40 ` Trond Myklebust 2002-09-23 19:13 ` Daniel Phillips 2002-09-13 4:19 ` Daniel Phillips 2002-09-13 4:52 ` Daniel Phillips 2002-09-14 9:58 ` Urban Widmark 2002-09-12 13:04 ` Trond Myklebust 2002-09-12 18:21 ` Andrew Morton 2002-09-12 21:15 ` Daniel Phillips 2002-09-12 21:38 ` Andrew Morton 2002-09-12 21:52 ` Daniel Phillips 2002-09-05 22:01 ` Chuck Lever 2002-09-05 22:23 ` Andrew Morton 2002-09-05 21:41 ` Chuck Lever 2002-09-06 9:35 ` Helge Hafting 2002-09-06 16:16 ` Chuck Lever 2002-09-07 8:01 ` Daniel Phillips 2002-09-07 10:01 ` Daniel Phillips
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).