* [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-21 21:15 ` Dave Kleikamp 0 siblings, 0 replies; 44+ messages in thread From: Dave Kleikamp @ 2004-10-21 21:15 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-kernel, linux-mm zap_pte_range should not mark non-uptodate pages dirty Doing O_DIRECT writes to an mmapped file caused pages in the page cache to be marked dirty but not uptodate. This led to a bug in mpage_writepage. Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Andrea Arcangeli <andrea@novell.com> diff -urp linux-2.6.9/mm/memory.c linux/mm/memory.c --- linux-2.6.9/mm/memory.c 2004-10-21 10:49:26.598031488 -0500 +++ linux/mm/memory.c 2004-10-21 16:01:44.902376232 -0500 @@ -414,7 +414,15 @@ static void zap_pte_range(struct mmu_gat && linear_page_index(details->nonlinear_vma, address+offset) != page->index) set_pte(ptep, pgoff_to_pte(page->index)); - if (pte_dirty(pte)) + /* + * PG_uptodate can be cleared by + * invalidate_inode_pages2, so we must not try to write + * not uptodate pages. Otherwise we risk invalidating + * underlying O_DIRECT writes, and secondly because + * pdflush would BUG(). Coherency of mmaps against + * O_DIRECT still cannot be guaranteed though. + */ + if (pte_dirty(pte) && PageUptodate(page)) set_page_dirty(page); if (pte_young(pte) && !PageAnon(page)) mark_page_accessed(page); ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-21 21:15 ` Dave Kleikamp 0 siblings, 0 replies; 44+ messages in thread From: Dave Kleikamp @ 2004-10-21 21:15 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-kernel, linux-mm zap_pte_range should not mark non-uptodate pages dirty Doing O_DIRECT writes to an mmapped file caused pages in the page cache to be marked dirty but not uptodate. This led to a bug in mpage_writepage. Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Andrea Arcangeli <andrea@novell.com> diff -urp linux-2.6.9/mm/memory.c linux/mm/memory.c --- linux-2.6.9/mm/memory.c 2004-10-21 10:49:26.598031488 -0500 +++ linux/mm/memory.c 2004-10-21 16:01:44.902376232 -0500 @@ -414,7 +414,15 @@ static void zap_pte_range(struct mmu_gat && linear_page_index(details->nonlinear_vma, address+offset) != page->index) set_pte(ptep, pgoff_to_pte(page->index)); - if (pte_dirty(pte)) + /* + * PG_uptodate can be cleared by + * invalidate_inode_pages2, so we must not try to write + * not uptodate pages. Otherwise we risk invalidating + * underlying O_DIRECT writes, and secondly because + * pdflush would BUG(). Coherency of mmaps against + * O_DIRECT still cannot be guaranteed though. + */ + if (pte_dirty(pte) && PageUptodate(page)) set_page_dirty(page); if (pte_young(pte) && !PageAnon(page)) mark_page_accessed(page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-21 21:15 ` Dave Kleikamp @ 2004-10-21 21:45 ` Andrew Morton -1 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-10-21 21:45 UTC (permalink / raw) To: Dave Kleikamp; +Cc: andrea, linux-kernel, linux-mm Dave Kleikamp <shaggy@austin.ibm.com> wrote: > > Doing O_DIRECT writes to an mmapped file caused pages in the page cache to > be marked dirty but not uptodate. This led to a bug in mpage_writepage. > Methinks this merely reduces the probability of the BUG. > diff -urp linux-2.6.9/mm/memory.c linux/mm/memory.c > --- linux-2.6.9/mm/memory.c 2004-10-21 10:49:26.598031488 -0500 > +++ linux/mm/memory.c 2004-10-21 16:01:44.902376232 -0500 > @@ -414,7 +414,15 @@ static void zap_pte_range(struct mmu_gat > && linear_page_index(details->nonlinear_vma, > address+offset) != page->index) > set_pte(ptep, pgoff_to_pte(page->index)); > - if (pte_dirty(pte)) > + /* > + * PG_uptodate can be cleared by > + * invalidate_inode_pages2, so we must not try to write > + * not uptodate pages. Otherwise we risk invalidating > + * underlying O_DIRECT writes, and secondly because > + * pdflush would BUG(). Coherency of mmaps against > + * O_DIRECT still cannot be guaranteed though. > + */ > + if (pte_dirty(pte) && PageUptodate(page)) invalidate_inode_pages2 runs ClearPageUptodate() here > set_page_dirty(page); > if (pte_young(pte) && !PageAnon(page)) > mark_page_accessed(page); Maybe we should revisit invalidate_inode_pages2(). It used to be an invariant that "pages which are mapped into process address space are always uptodate". We broke that (good) invariant and we're now seeing some fallout. There may be more. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-21 21:45 ` Andrew Morton 0 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-10-21 21:45 UTC (permalink / raw) To: Dave Kleikamp; +Cc: andrea, linux-kernel, linux-mm Dave Kleikamp <shaggy@austin.ibm.com> wrote: > > Doing O_DIRECT writes to an mmapped file caused pages in the page cache to > be marked dirty but not uptodate. This led to a bug in mpage_writepage. > Methinks this merely reduces the probability of the BUG. > diff -urp linux-2.6.9/mm/memory.c linux/mm/memory.c > --- linux-2.6.9/mm/memory.c 2004-10-21 10:49:26.598031488 -0500 > +++ linux/mm/memory.c 2004-10-21 16:01:44.902376232 -0500 > @@ -414,7 +414,15 @@ static void zap_pte_range(struct mmu_gat > && linear_page_index(details->nonlinear_vma, > address+offset) != page->index) > set_pte(ptep, pgoff_to_pte(page->index)); > - if (pte_dirty(pte)) > + /* > + * PG_uptodate can be cleared by > + * invalidate_inode_pages2, so we must not try to write > + * not uptodate pages. Otherwise we risk invalidating > + * underlying O_DIRECT writes, and secondly because > + * pdflush would BUG(). Coherency of mmaps against > + * O_DIRECT still cannot be guaranteed though. > + */ > + if (pte_dirty(pte) && PageUptodate(page)) invalidate_inode_pages2 runs ClearPageUptodate() here > set_page_dirty(page); > if (pte_young(pte) && !PageAnon(page)) > mark_page_accessed(page); Maybe we should revisit invalidate_inode_pages2(). It used to be an invariant that "pages which are mapped into process address space are always uptodate". We broke that (good) invariant and we're now seeing some fallout. There may be more. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-21 21:45 ` Andrew Morton @ 2004-10-21 22:36 ` Andrea Arcangeli -1 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-21 22:36 UTC (permalink / raw) To: Andrew Morton; +Cc: Dave Kleikamp, linux-kernel, linux-mm On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote: > Maybe we should revisit invalidate_inode_pages2(). It used to be an > invariant that "pages which are mapped into process address space are > always uptodate". We broke that (good) invariant and we're now seeing > some fallout. There may be more. such invariant doesn't exists since 2.4.10. There's no way to get mmaps reload data from disk without breaking such an invariant. It's not even for the write side, it's buffered read against O_DIRECT write that requires breaking such invariant. Either you fix it the above way, or you remove the BUG() in pdflush and you simply clear the dirty bit without doing anything, both are fine, peraphs we should do both, but the above is good to have anyways since it's more efficient to not even show the not uptodate pages to pdflush. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-21 22:36 ` Andrea Arcangeli 0 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-21 22:36 UTC (permalink / raw) To: Andrew Morton; +Cc: Dave Kleikamp, linux-kernel, linux-mm On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote: > Maybe we should revisit invalidate_inode_pages2(). It used to be an > invariant that "pages which are mapped into process address space are > always uptodate". We broke that (good) invariant and we're now seeing > some fallout. There may be more. such invariant doesn't exists since 2.4.10. There's no way to get mmaps reload data from disk without breaking such an invariant. It's not even for the write side, it's buffered read against O_DIRECT write that requires breaking such invariant. Either you fix it the above way, or you remove the BUG() in pdflush and you simply clear the dirty bit without doing anything, both are fine, peraphs we should do both, but the above is good to have anyways since it's more efficient to not even show the not uptodate pages to pdflush. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-21 22:36 ` Andrea Arcangeli @ 2004-10-21 23:02 ` Andrew Morton -1 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-10-21 23:02 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: shaggy, linux-kernel, linux-mm Andrea Arcangeli <andrea@novell.com> wrote: > > On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote: > > Maybe we should revisit invalidate_inode_pages2(). It used to be an > > invariant that "pages which are mapped into process address space are > > always uptodate". We broke that (good) invariant and we're now seeing > > some fallout. There may be more. > > such invariant doesn't exists since 2.4.10. There's no way to get mmaps > reload data from disk without breaking such an invariant. There are at least two ways: a) Set a new page flag in invalidate, test+clear that at fault time b) shoot down all pte's mapping the locked page at invalidate time, mark the page not uptodate. The latter is complex but has the advantage of fixing the current half-assed situation wherein existing mmaps are seeing invalidated data. > It's not even > for the write side, it's buffered read against O_DIRECT write that > requires breaking such invariant. > > Either you fix it the above way, or you remove the BUG() in pdflush and > you simply clear the dirty bit without doing anything, both are fine, > peraphs we should do both, but the above is good to have anyways since > it's more efficient to not even show the not uptodate pages to pdflush. We could just remove the BUG in mpage_writepage() (which I assume is the one which was being hit) but we might still have a not uptodate page with uptodate buffers and I suspect that the kernel will either go BUG there instead or will bring the page uptodate again without performing any I/O. But I haven't checked that. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-21 23:02 ` Andrew Morton 0 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-10-21 23:02 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: shaggy, linux-kernel, linux-mm Andrea Arcangeli <andrea@novell.com> wrote: > > On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote: > > Maybe we should revisit invalidate_inode_pages2(). It used to be an > > invariant that "pages which are mapped into process address space are > > always uptodate". We broke that (good) invariant and we're now seeing > > some fallout. There may be more. > > such invariant doesn't exists since 2.4.10. There's no way to get mmaps > reload data from disk without breaking such an invariant. There are at least two ways: a) Set a new page flag in invalidate, test+clear that at fault time b) shoot down all pte's mapping the locked page at invalidate time, mark the page not uptodate. The latter is complex but has the advantage of fixing the current half-assed situation wherein existing mmaps are seeing invalidated data. > It's not even > for the write side, it's buffered read against O_DIRECT write that > requires breaking such invariant. > > Either you fix it the above way, or you remove the BUG() in pdflush and > you simply clear the dirty bit without doing anything, both are fine, > peraphs we should do both, but the above is good to have anyways since > it's more efficient to not even show the not uptodate pages to pdflush. We could just remove the BUG in mpage_writepage() (which I assume is the one which was being hit) but we might still have a not uptodate page with uptodate buffers and I suspect that the kernel will either go BUG there instead or will bring the page uptodate again without performing any I/O. But I haven't checked that. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-21 23:02 ` Andrew Morton @ 2004-10-21 23:20 ` Andrea Arcangeli -1 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-21 23:20 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Thu, Oct 21, 2004 at 04:02:33PM -0700, Andrew Morton wrote: > Andrea Arcangeli <andrea@novell.com> wrote: > > > > On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote: > > > Maybe we should revisit invalidate_inode_pages2(). It used to be an > > > invariant that "pages which are mapped into process address space are > > > always uptodate". We broke that (good) invariant and we're now seeing > > > some fallout. There may be more. > > > > such invariant doesn't exists since 2.4.10. There's no way to get mmaps > > reload data from disk without breaking such an invariant. > > There are at least two ways: > > a) Set a new page flag in invalidate, test+clear that at fault time What's the point of adding a new page flag when the invariant !PageUptodate && page_mapcount(page) already provides the information? I turned a condition that previously was impossible, and it made such condition useful as another useful invariant, instead of a BUG_ON invariant. The BUG itself guarantees us nobody was using it for other purposes, infact invalidate_inode_pages2 is what triggered this in the first place. > b) shoot down all pte's mapping the locked page at invalidate time, mark the > page not uptodate. invalidate should run fast, I didn't enforce coherency or it'd hurt too much the O_DIRECT write if something is mapped, we only allow buffered read against O_DIRECT write to work coherently, the mmap coherency has never been provided to avoid having to search for vmas in the prio_tree for every single write to an inode. > The latter is complex but has the advantage of fixing the current > half-assed situation wherein existing mmaps are seeing invalidated data. that's a feature not a bug since 2.4.10. Nobody ever asked for such coherency, all we provide is read against write or read against read (write against write only with both writes O_DIRECT or both writes buffered). mmaps are ignored by O_DIRECT, mmaps don't crash the kernel (well modulo the PageReserved check added in 2.6) but that's all. > We could just remove the BUG in mpage_writepage() (which I assume is the > one which was being hit) but we might still have a not uptodate page with > uptodate buffers and I suspect that the kernel will either go BUG there > instead or will bring the page uptodate again without performing any I/O. > But I haven't checked that. how can this be related to mmapped pages? Isn't this only an issue with invalidate_inode_pages2? I agree we miss an invalidate of the bh there. the mpage_readpage not using bh coupled with the page-size alignment enforced by the 2.4 O_DIRECT API (not like 2.6 that uses hardblocksize alignment) probably helps a lot in hiding this I guess. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-21 23:20 ` Andrea Arcangeli 0 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-21 23:20 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Thu, Oct 21, 2004 at 04:02:33PM -0700, Andrew Morton wrote: > Andrea Arcangeli <andrea@novell.com> wrote: > > > > On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote: > > > Maybe we should revisit invalidate_inode_pages2(). It used to be an > > > invariant that "pages which are mapped into process address space are > > > always uptodate". We broke that (good) invariant and we're now seeing > > > some fallout. There may be more. > > > > such invariant doesn't exists since 2.4.10. There's no way to get mmaps > > reload data from disk without breaking such an invariant. > > There are at least two ways: > > a) Set a new page flag in invalidate, test+clear that at fault time What's the point of adding a new page flag when the invariant !PageUptodate && page_mapcount(page) already provides the information? I turned a condition that previously was impossible, and it made such condition useful as another useful invariant, instead of a BUG_ON invariant. The BUG itself guarantees us nobody was using it for other purposes, infact invalidate_inode_pages2 is what triggered this in the first place. > b) shoot down all pte's mapping the locked page at invalidate time, mark the > page not uptodate. invalidate should run fast, I didn't enforce coherency or it'd hurt too much the O_DIRECT write if something is mapped, we only allow buffered read against O_DIRECT write to work coherently, the mmap coherency has never been provided to avoid having to search for vmas in the prio_tree for every single write to an inode. > The latter is complex but has the advantage of fixing the current > half-assed situation wherein existing mmaps are seeing invalidated data. that's a feature not a bug since 2.4.10. Nobody ever asked for such coherency, all we provide is read against write or read against read (write against write only with both writes O_DIRECT or both writes buffered). mmaps are ignored by O_DIRECT, mmaps don't crash the kernel (well modulo the PageReserved check added in 2.6) but that's all. > We could just remove the BUG in mpage_writepage() (which I assume is the > one which was being hit) but we might still have a not uptodate page with > uptodate buffers and I suspect that the kernel will either go BUG there > instead or will bring the page uptodate again without performing any I/O. > But I haven't checked that. how can this be related to mmapped pages? Isn't this only an issue with invalidate_inode_pages2? I agree we miss an invalidate of the bh there. the mpage_readpage not using bh coupled with the page-size alignment enforced by the 2.4 O_DIRECT API (not like 2.6 that uses hardblocksize alignment) probably helps a lot in hiding this I guess. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-21 23:20 ` Andrea Arcangeli @ 2004-10-21 23:42 ` Andrew Morton -1 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-10-21 23:42 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: shaggy, linux-kernel, linux-mm Andrea Arcangeli <andrea@novell.com> wrote: > > On Thu, Oct 21, 2004 at 04:02:33PM -0700, Andrew Morton wrote: > > Andrea Arcangeli <andrea@novell.com> wrote: > > > > > > On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote: > > > > Maybe we should revisit invalidate_inode_pages2(). It used to be an > > > > invariant that "pages which are mapped into process address space are > > > > always uptodate". We broke that (good) invariant and we're now seeing > > > > some fallout. There may be more. > > > > > > such invariant doesn't exists since 2.4.10. There's no way to get mmaps > > > reload data from disk without breaking such an invariant. > > > > There are at least two ways: > > > > a) Set a new page flag in invalidate, test+clear that at fault time > > What's the point of adding a new page flag when the invariant > !PageUptodate && page_mapcount(page) already provides the information? Step back and think about this. What earthly sense is there in permitting userspace access to non uptodate pages? None. It's completely wrong and the invariant was a good one. We broke it by introducing some kluge to force new I/O when someone does a new fault against the page. (A new PG_needs_rereading flag isn't sufficient btw - we'd also need BH_Needs_Rereading and associated code. ug.) > > b) shoot down all pte's mapping the locked page at invalidate time, mark the > > page not uptodate. > > invalidate should run fast, I didn't enforce coherency or it'd hurt too > much the O_DIRECT write if something is mapped, we only allow buffered > read against O_DIRECT write to work coherently, the mmap coherency has > never been provided to avoid having to search for vmas in the prio_tree > for every single write to an inode. I don't get it. invalidate has the pageframe. All it need to do is to lock the page, examine mapcount and if it's non-zero, do the shootdown. The only way in which we would be performing the shootdown a significant number of times would be if someone was repeatedly faulting the thing back in anyway, and in that case the physical I/O cost would dominate. Where's the performance overhead?? Plus it makes the currently incorrect code correct for existing mmaps. Plus it avoids the idiotic situation of having non uptodate pages accessible to user processes. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-21 23:42 ` Andrew Morton 0 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-10-21 23:42 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: shaggy, linux-kernel, linux-mm Andrea Arcangeli <andrea@novell.com> wrote: > > On Thu, Oct 21, 2004 at 04:02:33PM -0700, Andrew Morton wrote: > > Andrea Arcangeli <andrea@novell.com> wrote: > > > > > > On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote: > > > > Maybe we should revisit invalidate_inode_pages2(). It used to be an > > > > invariant that "pages which are mapped into process address space are > > > > always uptodate". We broke that (good) invariant and we're now seeing > > > > some fallout. There may be more. > > > > > > such invariant doesn't exists since 2.4.10. There's no way to get mmaps > > > reload data from disk without breaking such an invariant. > > > > There are at least two ways: > > > > a) Set a new page flag in invalidate, test+clear that at fault time > > What's the point of adding a new page flag when the invariant > !PageUptodate && page_mapcount(page) already provides the information? Step back and think about this. What earthly sense is there in permitting userspace access to non uptodate pages? None. It's completely wrong and the invariant was a good one. We broke it by introducing some kluge to force new I/O when someone does a new fault against the page. (A new PG_needs_rereading flag isn't sufficient btw - we'd also need BH_Needs_Rereading and associated code. ug.) > > b) shoot down all pte's mapping the locked page at invalidate time, mark the > > page not uptodate. > > invalidate should run fast, I didn't enforce coherency or it'd hurt too > much the O_DIRECT write if something is mapped, we only allow buffered > read against O_DIRECT write to work coherently, the mmap coherency has > never been provided to avoid having to search for vmas in the prio_tree > for every single write to an inode. I don't get it. invalidate has the pageframe. All it need to do is to lock the page, examine mapcount and if it's non-zero, do the shootdown. The only way in which we would be performing the shootdown a significant number of times would be if someone was repeatedly faulting the thing back in anyway, and in that case the physical I/O cost would dominate. Where's the performance overhead?? Plus it makes the currently incorrect code correct for existing mmaps. Plus it avoids the idiotic situation of having non uptodate pages accessible to user processes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-21 23:42 ` Andrew Morton @ 2004-10-22 0:15 ` Andrew Morton -1 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-10-22 0:15 UTC (permalink / raw) To: andrea, shaggy, linux-kernel, linux-mm Andrew Morton <akpm@osdl.org> wrote: > > I don't get it. invalidate has the pageframe. All it need to do is to > lock the page, examine mapcount and if it's non-zero, do the shootdown. unmap_mapping_range() will do that - can call it one page at a time, or batch up runs of pages. It's not fast, but presumably not frequent either. The bigger problem is shooting down the buffer_heads. It's certainly the case that mpage_readpage() will call block_read_full_page() which will then bring the page uptodate without performing any I/O. And invalidating the buffer_heads in invalidate_inode_pages2() is tricky - we need to enter the filesystem and I'm not sure that either ->invalidatepage() or ->releasepage() are quite suitable. For a start, they're best-effort and may fail. If we just go and mark the buffers not uptodate we'll probably give ext3 a heart attack, so careful work would be needed there. Let's go back to why we needed all of this. Was it just for the NFS something-changed-on-the-server code? If so, would it be sufficient to add a new invalidate_inode_pages3() just for NFS, which clears the uptodate bit? Or something along those lines? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-22 0:15 ` Andrew Morton 0 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-10-22 0:15 UTC (permalink / raw) To: andrea, shaggy, linux-kernel, linux-mm Andrew Morton <akpm@osdl.org> wrote: > > I don't get it. invalidate has the pageframe. All it need to do is to > lock the page, examine mapcount and if it's non-zero, do the shootdown. unmap_mapping_range() will do that - can call it one page at a time, or batch up runs of pages. It's not fast, but presumably not frequent either. The bigger problem is shooting down the buffer_heads. It's certainly the case that mpage_readpage() will call block_read_full_page() which will then bring the page uptodate without performing any I/O. And invalidating the buffer_heads in invalidate_inode_pages2() is tricky - we need to enter the filesystem and I'm not sure that either ->invalidatepage() or ->releasepage() are quite suitable. For a start, they're best-effort and may fail. If we just go and mark the buffers not uptodate we'll probably give ext3 a heart attack, so careful work would be needed there. Let's go back to why we needed all of this. Was it just for the NFS something-changed-on-the-server code? If so, would it be sufficient to add a new invalidate_inode_pages3() just for NFS, which clears the uptodate bit? Or something along those lines? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-22 0:15 ` Andrew Morton @ 2004-10-22 0:41 ` Andrea Arcangeli -1 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-22 0:41 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Thu, Oct 21, 2004 at 05:15:58PM -0700, Andrew Morton wrote: > Andrew Morton <akpm@osdl.org> wrote: > > > > I don't get it. invalidate has the pageframe. All it need to do is to > > lock the page, examine mapcount and if it's non-zero, do the shootdown. > > unmap_mapping_range() will do that - can call it one page at a time, or > batch up runs of pages. It's not fast, but presumably not frequent either. That would shootdown the ptes to add completely coherency to the mmaps, right. Still we could shootdown the ptes after clearing the uptodate bitflag, allowing the mapped page to be not uptodate for a short while, since it makes sense and it's harmless. The pte shootdown from my point of view is just an additional coherency feature, but it cannot provide full coherency anyways, since the invalidate arrives after the I/O hit the disk, so the page will be out of sync with the disk if it's dirty, and no coherency can be provided anyways, because no locking happens to get max scalability. > The bigger problem is shooting down the buffer_heads. It's certainly the > case that mpage_readpage() will call block_read_full_page() which will then > bring the page uptodate without performing any I/O. yes, this is actually the only bug I can see in this whole affair (besdies the BUG that goes away with the patch already posted, and that patch still makes perfect sense to me since we could use it even for a more relaxed pte shootdown as described above, plus it doesn't worth to mark not-uptodate pages as dirty, that is really what makes no sense and needs fixing). > And invalidating the buffer_heads in invalidate_inode_pages2() is tricky - > we need to enter the filesystem and I'm not sure that either > ->invalidatepage() or ->releasepage() are quite suitable. For a start, > they're best-effort and may fail. If we just go and mark the buffers not > uptodate we'll probably give ext3 a heart attack, so careful work would be > needed there. > > Let's go back to why we needed all of this. Was it just for the NFS > something-changed-on-the-server code? If so, would it be sufficient to add > a new invalidate_inode_pages3() just for NFS, which clears the uptodate > bit? Or something along those lines? nfs is a case here too. But this is mostly needed for O_DIRECT write happening on a file that is mmapped and read in buffered mode at the same time. The API totally ignores the mapping, but we must guarantee buffered read to see the written data on disk, and in turn the uptodate bitflag must be clared because the page is not uptodate anymore. This just describes what has happened on disk and it tells to _future_ page faults (or buffered read syscalls) they've to re-read from disk. The only issue seems to be the bhs. Peraphs te bhs requires a new bitflag if the fs risks an hearth attack, but the VM can do the natural thing of clearing the uptodate bitflag reflecting the fact the cache is out-of-date, since the VM can deal with that just fine. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-22 0:41 ` Andrea Arcangeli 0 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-22 0:41 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Thu, Oct 21, 2004 at 05:15:58PM -0700, Andrew Morton wrote: > Andrew Morton <akpm@osdl.org> wrote: > > > > I don't get it. invalidate has the pageframe. All it need to do is to > > lock the page, examine mapcount and if it's non-zero, do the shootdown. > > unmap_mapping_range() will do that - can call it one page at a time, or > batch up runs of pages. It's not fast, but presumably not frequent either. That would shootdown the ptes to add completely coherency to the mmaps, right. Still we could shootdown the ptes after clearing the uptodate bitflag, allowing the mapped page to be not uptodate for a short while, since it makes sense and it's harmless. The pte shootdown from my point of view is just an additional coherency feature, but it cannot provide full coherency anyways, since the invalidate arrives after the I/O hit the disk, so the page will be out of sync with the disk if it's dirty, and no coherency can be provided anyways, because no locking happens to get max scalability. > The bigger problem is shooting down the buffer_heads. It's certainly the > case that mpage_readpage() will call block_read_full_page() which will then > bring the page uptodate without performing any I/O. yes, this is actually the only bug I can see in this whole affair (besdies the BUG that goes away with the patch already posted, and that patch still makes perfect sense to me since we could use it even for a more relaxed pte shootdown as described above, plus it doesn't worth to mark not-uptodate pages as dirty, that is really what makes no sense and needs fixing). > And invalidating the buffer_heads in invalidate_inode_pages2() is tricky - > we need to enter the filesystem and I'm not sure that either > ->invalidatepage() or ->releasepage() are quite suitable. For a start, > they're best-effort and may fail. If we just go and mark the buffers not > uptodate we'll probably give ext3 a heart attack, so careful work would be > needed there. > > Let's go back to why we needed all of this. Was it just for the NFS > something-changed-on-the-server code? If so, would it be sufficient to add > a new invalidate_inode_pages3() just for NFS, which clears the uptodate > bit? Or something along those lines? nfs is a case here too. But this is mostly needed for O_DIRECT write happening on a file that is mmapped and read in buffered mode at the same time. The API totally ignores the mapping, but we must guarantee buffered read to see the written data on disk, and in turn the uptodate bitflag must be clared because the page is not uptodate anymore. This just describes what has happened on disk and it tells to _future_ page faults (or buffered read syscalls) they've to re-read from disk. The only issue seems to be the bhs. Peraphs te bhs requires a new bitflag if the fs risks an hearth attack, but the VM can do the natural thing of clearing the uptodate bitflag reflecting the fact the cache is out-of-date, since the VM can deal with that just fine. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-22 0:41 ` Andrea Arcangeli @ 2004-10-22 2:51 ` Rik van Riel -1 siblings, 0 replies; 44+ messages in thread From: Rik van Riel @ 2004-10-22 2:51 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, shaggy, linux-kernel, linux-mm On Fri, 22 Oct 2004, Andrea Arcangeli wrote: > The pte shootdown from my point of view is just an additional coherency > feature, but it cannot provide full coherency anyways, since the > invalidate arrives after the I/O hit the disk, so the page will be out > of sync with the disk if it's dirty, and no coherency can be provided > anyways, because no locking happens to get max scalability. That depends on the filesystem. I hope the clustered filesystems will be able to provide full coherency by doing the invalidates in the right order. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-22 2:51 ` Rik van Riel 0 siblings, 0 replies; 44+ messages in thread From: Rik van Riel @ 2004-10-22 2:51 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, shaggy, linux-kernel, linux-mm On Fri, 22 Oct 2004, Andrea Arcangeli wrote: > The pte shootdown from my point of view is just an additional coherency > feature, but it cannot provide full coherency anyways, since the > invalidate arrives after the I/O hit the disk, so the page will be out > of sync with the disk if it's dirty, and no coherency can be provided > anyways, because no locking happens to get max scalability. That depends on the filesystem. I hope the clustered filesystems will be able to provide full coherency by doing the invalidates in the right order. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-22 2:51 ` Rik van Riel @ 2004-10-22 16:19 ` Andrea Arcangeli -1 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-22 16:19 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, shaggy, linux-kernel, linux-mm On Thu, Oct 21, 2004 at 10:51:34PM -0400, Rik van Riel wrote: > That depends on the filesystem. I hope the clustered filesystems I agree if you do a "is_underlying_fs_GFS?" check then you can make more assumptions. But if you don't do that, the linux API always left undefined the mmapped contents after O_DIRECT writes on the mmapped data. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-22 16:19 ` Andrea Arcangeli 0 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-22 16:19 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, shaggy, linux-kernel, linux-mm On Thu, Oct 21, 2004 at 10:51:34PM -0400, Rik van Riel wrote: > That depends on the filesystem. I hope the clustered filesystems I agree if you do a "is_underlying_fs_GFS?" check then you can make more assumptions. But if you don't do that, the linux API always left undefined the mmapped contents after O_DIRECT writes on the mmapped data. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-21 23:42 ` Andrew Morton @ 2004-10-22 0:30 ` Andrea Arcangeli -1 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-22 0:30 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Thu, Oct 21, 2004 at 04:42:45PM -0700, Andrew Morton wrote: > Andrea Arcangeli <andrea@novell.com> wrote: > > > > On Thu, Oct 21, 2004 at 04:02:33PM -0700, Andrew Morton wrote: > > > Andrea Arcangeli <andrea@novell.com> wrote: > > > > > > > > On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote: > > > > > Maybe we should revisit invalidate_inode_pages2(). It used to be an > > > > > invariant that "pages which are mapped into process address space are > > > > > always uptodate". We broke that (good) invariant and we're now seeing > > > > > some fallout. There may be more. > > > > > > > > such invariant doesn't exists since 2.4.10. There's no way to get mmaps > > > > reload data from disk without breaking such an invariant. > > > > > > There are at least two ways: > > > > > > a) Set a new page flag in invalidate, test+clear that at fault time > > > > What's the point of adding a new page flag when the invariant > > !PageUptodate && page_mapcount(page) already provides the information? > > Step back and think about this. What earthly sense is there in permitting > userspace access to non uptodate pages? this is exactly why new page faults re-read from disk. Istantiating not uptodate pages is definitely a mistake. But after the pte is istantiated the uptodate information becomes pointless. It's up to the page fault to make sure the page is uptodate before mapping it into userspace. > None. It's completely wrong and the invariant was a good one. We > broke it > by introducing some kluge to force new I/O when someone does a new fault > against the page. the invariant that is important, is that the page fault must never map not uptodate pages into userspace. That invariant is still obeyed, it's just after the page is mapped that we start making good use of the uptodate bit again. > (A new PG_needs_rereading flag isn't sufficient btw - we'd also need > BH_Needs_Rereading and associated code. ug.) clearing uptodate bits for bh too would fix it. Anyways I doubt in practice the lack of bh clearing could ever trigger thanks to mpage_readpages and the page aligned API. Peraphs the 2.6 more relaxed API could expose the problem, and that's why we need to address it in 2.6. > I don't get it. invalidate has the pageframe. All it need to do is to > lock the page, examine mapcount and if it's non-zero, do the shootdown. > The only way in which we would be performing the shootdown a significant > number of times would be if someone was repeatedly faulting the thing back > in anyway, and in that case the physical I/O cost would dominate. Where's > the performance overhead?? > > Plus it makes the currently incorrect code correct for existing mmaps. > > Plus it avoids the idiotic situation of having non uptodate pages > accessible to user processes. peraphs we can implement it for 2.6, but for 2.4 it was not doable, and clearing PG_uptodate in 2.4 is what made O_DIRECT possible at all, so I wouldn't call it idiotic situation. It was a big new feature and it still is, since your new bitflag is not needed: what the VM asks for is to clear such uptodate bit because the page in the pagecache is not uptodate anymore. What happens is that the disk has changed under you so it's idotic to set another bitflag and to leave the uptodate bit set, when the page is obviously not uptodate anymore. If you want to shootdown ptes before clearing the bitflag, that's fine with me, but you will still have to clear the uptodate bitflag on the page after that, and the mmap coherency has never been provided by O_DIRECT in linux, this would be a new feature, sure not a bugfix. the mmap case is a controlled race by design that can cause no harm. btw, nfs is doing it too in some 2.4 tree, we've got complains that mmaps weren't refreshed. If you change invalidate_inode_pages2 to shootdown ptes then they'll get much better refreshing for nfs too, but again, not doable in the 2.4 backport: in 2.4 all I could do is to clear uptodate on mapped pages, since after the page is mapped the uptodate bit becomes useless and we can re-use it. Adding a new bitflag wouldn't change a thing except to make it more complicated than it already is. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-22 0:30 ` Andrea Arcangeli 0 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-22 0:30 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Thu, Oct 21, 2004 at 04:42:45PM -0700, Andrew Morton wrote: > Andrea Arcangeli <andrea@novell.com> wrote: > > > > On Thu, Oct 21, 2004 at 04:02:33PM -0700, Andrew Morton wrote: > > > Andrea Arcangeli <andrea@novell.com> wrote: > > > > > > > > On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote: > > > > > Maybe we should revisit invalidate_inode_pages2(). It used to be an > > > > > invariant that "pages which are mapped into process address space are > > > > > always uptodate". We broke that (good) invariant and we're now seeing > > > > > some fallout. There may be more. > > > > > > > > such invariant doesn't exists since 2.4.10. There's no way to get mmaps > > > > reload data from disk without breaking such an invariant. > > > > > > There are at least two ways: > > > > > > a) Set a new page flag in invalidate, test+clear that at fault time > > > > What's the point of adding a new page flag when the invariant > > !PageUptodate && page_mapcount(page) already provides the information? > > Step back and think about this. What earthly sense is there in permitting > userspace access to non uptodate pages? this is exactly why new page faults re-read from disk. Istantiating not uptodate pages is definitely a mistake. But after the pte is istantiated the uptodate information becomes pointless. It's up to the page fault to make sure the page is uptodate before mapping it into userspace. > None. It's completely wrong and the invariant was a good one. We > broke it > by introducing some kluge to force new I/O when someone does a new fault > against the page. the invariant that is important, is that the page fault must never map not uptodate pages into userspace. That invariant is still obeyed, it's just after the page is mapped that we start making good use of the uptodate bit again. > (A new PG_needs_rereading flag isn't sufficient btw - we'd also need > BH_Needs_Rereading and associated code. ug.) clearing uptodate bits for bh too would fix it. Anyways I doubt in practice the lack of bh clearing could ever trigger thanks to mpage_readpages and the page aligned API. Peraphs the 2.6 more relaxed API could expose the problem, and that's why we need to address it in 2.6. > I don't get it. invalidate has the pageframe. All it need to do is to > lock the page, examine mapcount and if it's non-zero, do the shootdown. > The only way in which we would be performing the shootdown a significant > number of times would be if someone was repeatedly faulting the thing back > in anyway, and in that case the physical I/O cost would dominate. Where's > the performance overhead?? > > Plus it makes the currently incorrect code correct for existing mmaps. > > Plus it avoids the idiotic situation of having non uptodate pages > accessible to user processes. peraphs we can implement it for 2.6, but for 2.4 it was not doable, and clearing PG_uptodate in 2.4 is what made O_DIRECT possible at all, so I wouldn't call it idiotic situation. It was a big new feature and it still is, since your new bitflag is not needed: what the VM asks for is to clear such uptodate bit because the page in the pagecache is not uptodate anymore. What happens is that the disk has changed under you so it's idotic to set another bitflag and to leave the uptodate bit set, when the page is obviously not uptodate anymore. If you want to shootdown ptes before clearing the bitflag, that's fine with me, but you will still have to clear the uptodate bitflag on the page after that, and the mmap coherency has never been provided by O_DIRECT in linux, this would be a new feature, sure not a bugfix. the mmap case is a controlled race by design that can cause no harm. btw, nfs is doing it too in some 2.4 tree, we've got complains that mmaps weren't refreshed. If you change invalidate_inode_pages2 to shootdown ptes then they'll get much better refreshing for nfs too, but again, not doable in the 2.4 backport: in 2.4 all I could do is to clear uptodate on mapped pages, since after the page is mapped the uptodate bit becomes useless and we can re-use it. Adding a new bitflag wouldn't change a thing except to make it more complicated than it already is. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-22 0:30 ` Andrea Arcangeli @ 2004-10-22 1:22 ` Andrea Arcangeli -1 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-22 1:22 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Fri, Oct 22, 2004 at 02:30:04AM +0200, Andrea Arcangeli wrote: > If you want to shootdown ptes before clearing the bitflag, that's fine small correction s/before/after/. doing the pte shootdown before clearing the uptodate bitflag, would still not guarantee to read uptodate data after the invalidate (a minor page fault could still happen between the shootdown and the clear_bit; while after clearing the uptodate bit a major fault hitting the disk and refreshing the pagecache contents will be guaranteed - modulo bhs, well at least nfs is sure ok in that respect ;). ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-22 1:22 ` Andrea Arcangeli 0 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-22 1:22 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Fri, Oct 22, 2004 at 02:30:04AM +0200, Andrea Arcangeli wrote: > If you want to shootdown ptes before clearing the bitflag, that's fine small correction s/before/after/. doing the pte shootdown before clearing the uptodate bitflag, would still not guarantee to read uptodate data after the invalidate (a minor page fault could still happen between the shootdown and the clear_bit; while after clearing the uptodate bit a major fault hitting the disk and refreshing the pagecache contents will be guaranteed - modulo bhs, well at least nfs is sure ok in that respect ;). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-22 1:22 ` Andrea Arcangeli @ 2004-10-22 2:03 ` Andrew Morton -1 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-10-22 2:03 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: shaggy, linux-kernel, linux-mm Andrea Arcangeli <andrea@novell.com> wrote: > > On Fri, Oct 22, 2004 at 02:30:04AM +0200, Andrea Arcangeli wrote: > > If you want to shootdown ptes before clearing the bitflag, that's fine > > small correction s/before/after/. doing the pte shootdown before > clearing the uptodate bitflag, would still not guarantee to read > uptodate data after the invalidate (a minor page fault could still > happen between the shootdown and the clear_bit; while after clearing the > uptodate bit a major fault hitting the disk and refreshing the pagecache > contents will be guaranteed - modulo bhs, well at least nfs is sure ok > in that respect ;). Yeah. I think the only 100% reliable way to do this is: lock_page() unmap_mapping_range(page) ClearPageUptodate(page); invalidate(page); /* Try to free the thing */ unlock_page(page); (Can do it for a whole range of pages if we always agree to lock pages in file-index-ascending order). but hrm, we don't even have the locking there to prevent do_no_page() from reinstantiating the pte immediately after the unmap_mapping_range(). So what to do? - The patch you originally sent has a race window which can be made nonfatal by removing the BUGcheck in mpage_writepage(). - NFS should probably be taught to use unmap_mapping_range() regardless of what we do, so the existing-mmaps-hold-stale-data problem gets fixed up. - invalidate_inode_pages2() isn't successfully invalidating the page if it has buffers. This is the biggest problem, because the pages can trivially have buffers - just write()ing to them will attach buffers, if they're ext2- or ext3-backed. It means that a direct-IO write to a section of a file which is mmapped causes pagecache and disk to get out of sync. Question is: do we care, or do we say "don't do that"? Nobody seems to have noticed thus far and it's a pretty dopey thing to be doing. If we _do_ want to fix it, it seems the simplest approach would be to nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we raced - try again" loop in there to handle the "do_no_page() reinstantiated a pte" race. Fun. Something like this? Slow as a dog, but possibly correct ;) void invalidate_inode_pages2(struct address_space *mapping) { struct pagevec pvec; pgoff_t next = 0; int i; pagevec_init(&pvec, 0); while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { for (i = 0; i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; lock_page(page); if (page->mapping == mapping) { /* truncate race? */ wait_on_page_writeback(page); next = page->index + 1; while (page_mapped(page)) { unmap_mapping_range(mapping, page->index << PAGE_CACHE_SHIFT, page->index << PAGE_CACHE_SHIFT+1, 0); clear_page_dirty(page); } invalidate_complete_page(mapping, page); } unlock_page(page); } pagevec_release(&pvec); cond_resched(); } } The unmapping from pagetables means that invalidate_complete_page() will generally successfully remove the page from pagecache. That mostly fixes the direct-write-of-mmapped-data coherency problem: the pages simply aren't in pagecache any more so we'll surely redo physical I/O. But it's not 100% reliable because ->invalidatepage isn't 100% reliable and it really sucks that we're offering behaviour which only works most of the time :( ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-22 2:03 ` Andrew Morton 0 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-10-22 2:03 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: shaggy, linux-kernel, linux-mm Andrea Arcangeli <andrea@novell.com> wrote: > > On Fri, Oct 22, 2004 at 02:30:04AM +0200, Andrea Arcangeli wrote: > > If you want to shootdown ptes before clearing the bitflag, that's fine > > small correction s/before/after/. doing the pte shootdown before > clearing the uptodate bitflag, would still not guarantee to read > uptodate data after the invalidate (a minor page fault could still > happen between the shootdown and the clear_bit; while after clearing the > uptodate bit a major fault hitting the disk and refreshing the pagecache > contents will be guaranteed - modulo bhs, well at least nfs is sure ok > in that respect ;). Yeah. I think the only 100% reliable way to do this is: lock_page() unmap_mapping_range(page) ClearPageUptodate(page); invalidate(page); /* Try to free the thing */ unlock_page(page); (Can do it for a whole range of pages if we always agree to lock pages in file-index-ascending order). but hrm, we don't even have the locking there to prevent do_no_page() from reinstantiating the pte immediately after the unmap_mapping_range(). So what to do? - The patch you originally sent has a race window which can be made nonfatal by removing the BUGcheck in mpage_writepage(). - NFS should probably be taught to use unmap_mapping_range() regardless of what we do, so the existing-mmaps-hold-stale-data problem gets fixed up. - invalidate_inode_pages2() isn't successfully invalidating the page if it has buffers. This is the biggest problem, because the pages can trivially have buffers - just write()ing to them will attach buffers, if they're ext2- or ext3-backed. It means that a direct-IO write to a section of a file which is mmapped causes pagecache and disk to get out of sync. Question is: do we care, or do we say "don't do that"? Nobody seems to have noticed thus far and it's a pretty dopey thing to be doing. If we _do_ want to fix it, it seems the simplest approach would be to nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we raced - try again" loop in there to handle the "do_no_page() reinstantiated a pte" race. Fun. Something like this? Slow as a dog, but possibly correct ;) void invalidate_inode_pages2(struct address_space *mapping) { struct pagevec pvec; pgoff_t next = 0; int i; pagevec_init(&pvec, 0); while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { for (i = 0; i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; lock_page(page); if (page->mapping == mapping) { /* truncate race? */ wait_on_page_writeback(page); next = page->index + 1; while (page_mapped(page)) { unmap_mapping_range(mapping, page->index << PAGE_CACHE_SHIFT, page->index << PAGE_CACHE_SHIFT+1, 0); clear_page_dirty(page); } invalidate_complete_page(mapping, page); } unlock_page(page); } pagevec_release(&pvec); cond_resched(); } } The unmapping from pagetables means that invalidate_complete_page() will generally successfully remove the page from pagecache. That mostly fixes the direct-write-of-mmapped-data coherency problem: the pages simply aren't in pagecache any more so we'll surely redo physical I/O. But it's not 100% reliable because ->invalidatepage isn't 100% reliable and it really sucks that we're offering behaviour which only works most of the time :( -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-22 2:03 ` Andrew Morton @ 2004-10-22 16:17 ` Andrea Arcangeli -1 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-22 16:17 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Thu, Oct 21, 2004 at 07:03:20PM -0700, Andrew Morton wrote: > Yeah. I think the only 100% reliable way to do this is: > > lock_page() > unmap_mapping_range(page) > ClearPageUptodate(page); > invalidate(page); /* Try to free the thing */ > unlock_page(page); > > (Can do it for a whole range of pages if we always agree to lock pages in > file-index-ascending order). > > but hrm, we don't even have the locking there to prevent do_no_page() from > reinstantiating the pte immediately after the unmap_mapping_range(). > > So what to do? > > - The patch you originally sent has a race window which can be made > nonfatal by removing the BUGcheck in mpage_writepage(). > > - NFS should probably be taught to use unmap_mapping_range() regardless > of what we do, so the existing-mmaps-hold-stale-data problem gets fixed > up. > > - invalidate_inode_pages2() isn't successfully invalidating the page if > it has buffers. I did it right in 2.4, somebody obviously broken the thing in 2.6. Now w.r.t. to ext2/ext3 I never heard a problem, and we nuke buffers regardless if the page is mapped or not (so this isn't a corner case). define block_invalidate_page(page) discard_bh_page(page, 0, 0) } else { if (page->buffers) { /* Restart after this page */ list_del(head); list_add_tail(head, curr); page_cache_get(page); spin_unlock(&pagecache_lock); block_invalidate_page(page); } else unlocked = 0; ClearPageDirty(page); The guarantee is total in 2.4 w.r.t. buffers, as worse it can generate an hearth attack to ext3 but it won't risk to lose coherency. Only 2.6 broke the O_DIRECT vs buffered I/O coherency protocol. The semantics of not uptodate pages mapped in the address space are as well understood and working fine in 2.4. > This is the biggest problem, because the pages can trivially have > buffers - just write()ing to them will attach buffers, if they're ext2- > or ext3-backed. > > It means that a direct-IO write to a section of a file which is mmapped > causes pagecache and disk to get out of sync. Question is: do we care, > or do we say "don't do that"? Nobody seems to have noticed thus far and > it's a pretty dopey thing to be doing. this is a supported feature. People is writing with O_DIRECT and reading with buffered read and we must support that (think a database writing and a tar.gz reading). The coherency is a must after the O_DIRECT has completed. If people runs the operations at the same time the result is undefined. But by serializing the I/O at the userspace level (like stop_db(); tar.gz) we must provide guarantees to them. > If we _do_ want to fix it, it seems the simplest approach would be to > nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we > raced - try again" loop in there to handle the "do_no_page() > reinstantiated a pte" race. the do_no_page don't need to be coherent. The pte is the last thing we can care about. The API only guarantees read/write syscalls to be coherent, mmap not. the only bug is the bh and it has to be fixed as easily as I did in 2.4. > Something like this? Slow as a dog, but possibly correct ;) > > > void invalidate_inode_pages2(struct address_space *mapping) > { > struct pagevec pvec; > pgoff_t next = 0; > int i; > > pagevec_init(&pvec, 0); > while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > for (i = 0; i < pagevec_count(&pvec); i++) { > struct page *page = pvec.pages[i]; > > lock_page(page); > if (page->mapping == mapping) { /* truncate race? */ > wait_on_page_writeback(page); > next = page->index + 1; > while (page_mapped(page)) { > unmap_mapping_range(mapping, > page->index << PAGE_CACHE_SHIFT, > page->index << PAGE_CACHE_SHIFT+1, > 0); > clear_page_dirty(page); > } > invalidate_complete_page(mapping, page); > } > unlock_page(page); > } > pagevec_release(&pvec); > cond_resched(); > } > } You know the page_mapped path isn't really very important as far as we execute both clear_page_dirty and ClearPageUptodate. We must provide the guarantee to the syscall. Leaving not updoate pages in the ptes is fine, the VM can deal with that. the semantics of that condition is that the page changed on disk due O_DIRECT from under us, so the page isn't uptodate anymore. If something we can argue about ext3 not being capable of dealing with a PG_update going away from under it, but the VM is capable of that, there is only one spot that must be aware about it (zap fix we posted to start the thread). BTW, we have these fixes in our tree to make it work: --- sles/mm/truncate.c.~1~ 2004-05-18 19:24:40.000000000 +0200 +++ sles/mm/truncate.c 2004-05-19 02:09:28.311781864 +0200 @@ -260,9 +260,10 @@ void invalidate_inode_pages2(struct addr if (page->mapping == mapping) { /* truncate race? */ wait_on_page_writeback(page); next = page->index + 1; - if (page_mapped(page)) + if (page_mapped(page)) { + ClearPageUptodate(page); clear_page_dirty(page); - else + } else invalidate_complete_page(mapping, page); } unlock_page(page); and then this fix for BLKFLSBUF: diff -urNp linux-2.6.8/mm/truncate.c linux-2.6.8.SUSE/mm/truncate.c --- linux-2.6.8/mm/truncate.c 2004-09-21 15:39:37.850379202 +0200 +++ linux-2.6.8.SUSE/mm/truncate.c 2004-09-21 15:39:54.733668309 +0200 @@ -86,6 +86,12 @@ invalidate_complete_page(struct address_ write_unlock_irq(&mapping->tree_lock); return 0; } + + BUG_ON(PagePrivate(page)); + if (page_count(page) != 2) { + write_unlock_irq(&mapping->tree_lock); + return 0; + } __remove_from_page_cache(page); write_unlock_irq(&mapping->tree_lock); ClearPageUptodate(page); @@ -306,7 +312,11 @@ void invalidate_inode_pages2(struct addr clear_page_dirty(page); ClearPageUptodate(page); } else { - invalidate_complete_page(mapping, page); + if (!invalidate_complete_page(mapping, + page)) { + clear_page_dirty(page); + ClearPageUptodate(page); + } } } unlock_page(page); apparently those weren't submitted to mainline yet, that could have created some confusion on why we submitted the third patch that avoids the BUG in pdflush and it avoids to try writing to disk a page that is not uptodate anymore. Do you see why it's wrong to pass down to pdflush a page that isn't utpodate? That page must be re-read, it sure must not be written out. The semantics of mapped pages not uptodate are very well defined and they're pose no risk to the VM (modulo the bug in pdflush and this clear improvement in zap). > The unmapping from pagetables means that invalidate_complete_page() will > generally successfully remove the page from pagecache. That mostly fixes > the direct-write-of-mmapped-data coherency problem: the pages simply aren't > in pagecache any more so we'll surely redo physical I/O. the direct-write-of-mmapped-data coherency is the last thing we can care about. Even before thinking about direct-write-of-mmapped-data we must fix direct-write-of-cached-data (assuming it's not mmapped cache). That is a showstopper bug, the direct-write-of-mmapped-data is "undefined" by the linux API. Garbage can end up in the pagecache as far as the kernel is concerned (all we guarantee is that no random page contents are delivered to userspace). > But it's not 100% reliable because ->invalidatepage isn't 100% reliable and > it really sucks that we're offering behaviour which only works most of the > time :( That don't need to ever work, I don't think we even need to add a pte shootdown, the behviour has never been provided by design. All we need to do is to fix the bh layer, and fixup the spots in zap to stop bugging out on dirty not uptodate pages, plus the above patches in this email, then it'll work as well as 2.4 and it will avoid to break apps. Note that this is a very serious matter, what happens is that oracle writes with O_DIRECT and when you do a tar.gz of the database the last block may be corrupt if it's not 512byte aligned if it was a partial page and it had partial bh. mysql uses O_DIRECT too. I'm not particularly concerned about ext3, that can be fixed too, or we can add a new bitflag as you suggested to avoid fixing the filesystems to deal with that. But for the VM I'm confortable clearing the uptodate page is much simpler and it poses no risk to the VM. For the fs you may be very well right that the BH_reread_from_disk might be needed instead. So let's forget the mmapped case and let's fix the bh screwup in 2.6. Then we can think at the mmapped case. And to fix the mmapped case IMHO all we have to do is: clear_page_dirty ClearPageUptodate unmap_page_range Not the other way around. The uptodate bit must be clared before zapping the pte to force a reload. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-22 16:17 ` Andrea Arcangeli 0 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-22 16:17 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Thu, Oct 21, 2004 at 07:03:20PM -0700, Andrew Morton wrote: > Yeah. I think the only 100% reliable way to do this is: > > lock_page() > unmap_mapping_range(page) > ClearPageUptodate(page); > invalidate(page); /* Try to free the thing */ > unlock_page(page); > > (Can do it for a whole range of pages if we always agree to lock pages in > file-index-ascending order). > > but hrm, we don't even have the locking there to prevent do_no_page() from > reinstantiating the pte immediately after the unmap_mapping_range(). > > So what to do? > > - The patch you originally sent has a race window which can be made > nonfatal by removing the BUGcheck in mpage_writepage(). > > - NFS should probably be taught to use unmap_mapping_range() regardless > of what we do, so the existing-mmaps-hold-stale-data problem gets fixed > up. > > - invalidate_inode_pages2() isn't successfully invalidating the page if > it has buffers. I did it right in 2.4, somebody obviously broken the thing in 2.6. Now w.r.t. to ext2/ext3 I never heard a problem, and we nuke buffers regardless if the page is mapped or not (so this isn't a corner case). define block_invalidate_page(page) discard_bh_page(page, 0, 0) } else { if (page->buffers) { /* Restart after this page */ list_del(head); list_add_tail(head, curr); page_cache_get(page); spin_unlock(&pagecache_lock); block_invalidate_page(page); } else unlocked = 0; ClearPageDirty(page); The guarantee is total in 2.4 w.r.t. buffers, as worse it can generate an hearth attack to ext3 but it won't risk to lose coherency. Only 2.6 broke the O_DIRECT vs buffered I/O coherency protocol. The semantics of not uptodate pages mapped in the address space are as well understood and working fine in 2.4. > This is the biggest problem, because the pages can trivially have > buffers - just write()ing to them will attach buffers, if they're ext2- > or ext3-backed. > > It means that a direct-IO write to a section of a file which is mmapped > causes pagecache and disk to get out of sync. Question is: do we care, > or do we say "don't do that"? Nobody seems to have noticed thus far and > it's a pretty dopey thing to be doing. this is a supported feature. People is writing with O_DIRECT and reading with buffered read and we must support that (think a database writing and a tar.gz reading). The coherency is a must after the O_DIRECT has completed. If people runs the operations at the same time the result is undefined. But by serializing the I/O at the userspace level (like stop_db(); tar.gz) we must provide guarantees to them. > If we _do_ want to fix it, it seems the simplest approach would be to > nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we > raced - try again" loop in there to handle the "do_no_page() > reinstantiated a pte" race. the do_no_page don't need to be coherent. The pte is the last thing we can care about. The API only guarantees read/write syscalls to be coherent, mmap not. the only bug is the bh and it has to be fixed as easily as I did in 2.4. > Something like this? Slow as a dog, but possibly correct ;) > > > void invalidate_inode_pages2(struct address_space *mapping) > { > struct pagevec pvec; > pgoff_t next = 0; > int i; > > pagevec_init(&pvec, 0); > while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > for (i = 0; i < pagevec_count(&pvec); i++) { > struct page *page = pvec.pages[i]; > > lock_page(page); > if (page->mapping == mapping) { /* truncate race? */ > wait_on_page_writeback(page); > next = page->index + 1; > while (page_mapped(page)) { > unmap_mapping_range(mapping, > page->index << PAGE_CACHE_SHIFT, > page->index << PAGE_CACHE_SHIFT+1, > 0); > clear_page_dirty(page); > } > invalidate_complete_page(mapping, page); > } > unlock_page(page); > } > pagevec_release(&pvec); > cond_resched(); > } > } You know the page_mapped path isn't really very important as far as we execute both clear_page_dirty and ClearPageUptodate. We must provide the guarantee to the syscall. Leaving not updoate pages in the ptes is fine, the VM can deal with that. the semantics of that condition is that the page changed on disk due O_DIRECT from under us, so the page isn't uptodate anymore. If something we can argue about ext3 not being capable of dealing with a PG_update going away from under it, but the VM is capable of that, there is only one spot that must be aware about it (zap fix we posted to start the thread). BTW, we have these fixes in our tree to make it work: --- sles/mm/truncate.c.~1~ 2004-05-18 19:24:40.000000000 +0200 +++ sles/mm/truncate.c 2004-05-19 02:09:28.311781864 +0200 @@ -260,9 +260,10 @@ void invalidate_inode_pages2(struct addr if (page->mapping == mapping) { /* truncate race? */ wait_on_page_writeback(page); next = page->index + 1; - if (page_mapped(page)) + if (page_mapped(page)) { + ClearPageUptodate(page); clear_page_dirty(page); - else + } else invalidate_complete_page(mapping, page); } unlock_page(page); and then this fix for BLKFLSBUF: diff -urNp linux-2.6.8/mm/truncate.c linux-2.6.8.SUSE/mm/truncate.c --- linux-2.6.8/mm/truncate.c 2004-09-21 15:39:37.850379202 +0200 +++ linux-2.6.8.SUSE/mm/truncate.c 2004-09-21 15:39:54.733668309 +0200 @@ -86,6 +86,12 @@ invalidate_complete_page(struct address_ write_unlock_irq(&mapping->tree_lock); return 0; } + + BUG_ON(PagePrivate(page)); + if (page_count(page) != 2) { + write_unlock_irq(&mapping->tree_lock); + return 0; + } __remove_from_page_cache(page); write_unlock_irq(&mapping->tree_lock); ClearPageUptodate(page); @@ -306,7 +312,11 @@ void invalidate_inode_pages2(struct addr clear_page_dirty(page); ClearPageUptodate(page); } else { - invalidate_complete_page(mapping, page); + if (!invalidate_complete_page(mapping, + page)) { + clear_page_dirty(page); + ClearPageUptodate(page); + } } } unlock_page(page); apparently those weren't submitted to mainline yet, that could have created some confusion on why we submitted the third patch that avoids the BUG in pdflush and it avoids to try writing to disk a page that is not uptodate anymore. Do you see why it's wrong to pass down to pdflush a page that isn't utpodate? That page must be re-read, it sure must not be written out. The semantics of mapped pages not uptodate are very well defined and they're pose no risk to the VM (modulo the bug in pdflush and this clear improvement in zap). > The unmapping from pagetables means that invalidate_complete_page() will > generally successfully remove the page from pagecache. That mostly fixes > the direct-write-of-mmapped-data coherency problem: the pages simply aren't > in pagecache any more so we'll surely redo physical I/O. the direct-write-of-mmapped-data coherency is the last thing we can care about. Even before thinking about direct-write-of-mmapped-data we must fix direct-write-of-cached-data (assuming it's not mmapped cache). That is a showstopper bug, the direct-write-of-mmapped-data is "undefined" by the linux API. Garbage can end up in the pagecache as far as the kernel is concerned (all we guarantee is that no random page contents are delivered to userspace). > But it's not 100% reliable because ->invalidatepage isn't 100% reliable and > it really sucks that we're offering behaviour which only works most of the > time :( That don't need to ever work, I don't think we even need to add a pte shootdown, the behviour has never been provided by design. All we need to do is to fix the bh layer, and fixup the spots in zap to stop bugging out on dirty not uptodate pages, plus the above patches in this email, then it'll work as well as 2.4 and it will avoid to break apps. Note that this is a very serious matter, what happens is that oracle writes with O_DIRECT and when you do a tar.gz of the database the last block may be corrupt if it's not 512byte aligned if it was a partial page and it had partial bh. mysql uses O_DIRECT too. I'm not particularly concerned about ext3, that can be fixed too, or we can add a new bitflag as you suggested to avoid fixing the filesystems to deal with that. But for the VM I'm confortable clearing the uptodate page is much simpler and it poses no risk to the VM. For the fs you may be very well right that the BH_reread_from_disk might be needed instead. So let's forget the mmapped case and let's fix the bh screwup in 2.6. Then we can think at the mmapped case. And to fix the mmapped case IMHO all we have to do is: clear_page_dirty ClearPageUptodate unmap_page_range Not the other way around. The uptodate bit must be clared before zapping the pte to force a reload. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-22 16:17 ` Andrea Arcangeli @ 2004-10-22 17:04 ` Andrea Arcangeli -1 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-22 17:04 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Fri, Oct 22, 2004 at 06:17:44PM +0200, Andrea Arcangeli wrote: > So let's forget the mmapped case and let's fix the bh screwup in 2.6. I propose a solution to fix the coherency problem that afflicts 2.6 if invalidate_complete_page fails. If the page is still PagePrivate despite we called try_to_release_page (which means clearing the uptodate bitflag wouldn't help), we should simply return an error to the O_DIRECT writes. That way the error would not be overlooked by the database writing. This is a minium guarantee we have to provide: if we fail invalidate cause the O_DIRECT write to fail. Of course we should return write failures as well in the mmapped case, but let's ignore the mmapped case for now. The real showstopper bug is try_to_release_page failing and preventing the coherency protocol to work even without mmaps. This could never happen in 2.4, in 2.4 as worse ext3 would get an hearth attack, which is much better than silent corruption in the backup. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-22 17:04 ` Andrea Arcangeli 0 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-22 17:04 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Fri, Oct 22, 2004 at 06:17:44PM +0200, Andrea Arcangeli wrote: > So let's forget the mmapped case and let's fix the bh screwup in 2.6. I propose a solution to fix the coherency problem that afflicts 2.6 if invalidate_complete_page fails. If the page is still PagePrivate despite we called try_to_release_page (which means clearing the uptodate bitflag wouldn't help), we should simply return an error to the O_DIRECT writes. That way the error would not be overlooked by the database writing. This is a minium guarantee we have to provide: if we fail invalidate cause the O_DIRECT write to fail. Of course we should return write failures as well in the mmapped case, but let's ignore the mmapped case for now. The real showstopper bug is try_to_release_page failing and preventing the coherency protocol to work even without mmaps. This could never happen in 2.4, in 2.4 as worse ext3 would get an hearth attack, which is much better than silent corruption in the backup. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-22 16:17 ` Andrea Arcangeli @ 2004-10-22 23:24 ` Andrew Morton -1 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-10-22 23:24 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: shaggy, linux-kernel, linux-mm Andrea Arcangeli <andrea@novell.com> wrote: > > On Thu, Oct 21, 2004 at 07:03:20PM -0700, Andrew Morton wrote: > > Yeah. I think the only 100% reliable way to do this is: > > > > lock_page() > > unmap_mapping_range(page) > > ClearPageUptodate(page); > > invalidate(page); /* Try to free the thing */ > > unlock_page(page); > > > > (Can do it for a whole range of pages if we always agree to lock pages in > > file-index-ascending order). > > > > but hrm, we don't even have the locking there to prevent do_no_page() from > > reinstantiating the pte immediately after the unmap_mapping_range(). > > > > So what to do? > > > > - The patch you originally sent has a race window which can be made > > nonfatal by removing the BUGcheck in mpage_writepage(). > > > > - NFS should probably be taught to use unmap_mapping_range() regardless > > of what we do, so the existing-mmaps-hold-stale-data problem gets fixed > > up. > > > > - invalidate_inode_pages2() isn't successfully invalidating the page if > > it has buffers. > > I did it right in 2.4, somebody obviously broken the thing in 2.6. Now > w.r.t. to ext2/ext3 I never heard a problem, and we nuke buffers > regardless if the page is mapped or not (so this isn't a corner case). > > define block_invalidate_page(page) discard_bh_page(page, 0, 0) > > } else { > if (page->buffers) { > /* Restart after this page */ > list_del(head); > list_add_tail(head, curr); > > page_cache_get(page); > spin_unlock(&pagecache_lock); > block_invalidate_page(page); > } else > unlocked = 0; > > ClearPageDirty(page); This is incorrect for ext3 - we have to go through the a_ops to invalidate the buffer_heads. That's bad news for my (unmerged) 2.4 direct-io-for-ext3 patch. > The guarantee is total in 2.4 w.r.t. buffers, as worse it can generate > an hearth attack to ext3 but it won't risk to lose coherency. Only 2.6 > broke the O_DIRECT vs buffered I/O coherency protocol. Only for mmapped pages. I believe that normal pagecache-vs-directIO is OK in 2.6. > The semantics of not uptodate pages mapped in the address space are as > well understood and working fine in 2.4. I'd prefer to continue to disallow the mapping of nonuptodate pages into process address space. It's indistinguishable from a security hole and I think we can indeed preserve this invariant while fixing the mmap coherency problem. > > This is the biggest problem, because the pages can trivially have > > buffers - just write()ing to them will attach buffers, if they're ext2- > > or ext3-backed. > > > > It means that a direct-IO write to a section of a file which is mmapped > > causes pagecache and disk to get out of sync. Question is: do we care, > > or do we say "don't do that"? Nobody seems to have noticed thus far and > > it's a pretty dopey thing to be doing. > > this is a supported feature. People is writing with O_DIRECT and reading > with buffered read and we must support that (think a database writing > and a tar.gz reading). The coherency is a must after the O_DIRECT has > completed. If people runs the operations at the same time the result is > undefined. But by serializing the I/O at the userspace level (like > stop_db(); tar.gz) we must provide guarantees to them. If the page is not mapped into pagetables then we'll invalidate it OK even if it's backed by ext3. The only hole I can see is if someone comes in and dirties the page via write() _after_ generic_file_direct_IO() has done its filemap_write_and_wait(). And I agree that in that case we should propagate the invalidate_complete_page() failure back to the diret-io write() caller. I wonder how - via a short write, or via -EFOO? Probably we should sync all the ptes prior to a direct-io read as well. > > If we _do_ want to fix it, it seems the simplest approach would be to > > nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we > > raced - try again" loop in there to handle the "do_no_page() > > reinstantiated a pte" race. > > the do_no_page don't need to be coherent. The pte is the last thing we > can care about. The API only guarantees read/write syscalls to be > coherent, mmap not. I want to fix the mmap case too. We currently have a half-fix for NFS which doesn't work at all for (say) ext3. > the only bug is the bh and it has to be fixed as easily as I did in 2.4. No, we need to call into the filesystem to kill the buffer_heads. > > Something like this? Slow as a dog, but possibly correct ;) > > > > > > void invalidate_inode_pages2(struct address_space *mapping) > > { > > struct pagevec pvec; > > pgoff_t next = 0; > > int i; > > > > pagevec_init(&pvec, 0); > > while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > > for (i = 0; i < pagevec_count(&pvec); i++) { > > struct page *page = pvec.pages[i]; > > > > lock_page(page); > > if (page->mapping == mapping) { /* truncate race? */ > > wait_on_page_writeback(page); > > next = page->index + 1; > > while (page_mapped(page)) { > > unmap_mapping_range(mapping, > > page->index << PAGE_CACHE_SHIFT, > > page->index << PAGE_CACHE_SHIFT+1, > > 0); > > clear_page_dirty(page); > > } > > invalidate_complete_page(mapping, page); > > } > > unlock_page(page); > > } > > pagevec_release(&pvec); > > cond_resched(); > > } > > } > > You know the page_mapped path isn't really very important as far as we > execute both clear_page_dirty and ClearPageUptodate. It's important for NFS - it will allow mmaps to see server-side updates. > We must provide the > guarantee to the syscall. Leaving not updoate pages in the ptes is > fine, the VM can deal with that. the semantics of that condition is that > the page changed on disk due O_DIRECT from under us, so the page isn't > uptodate anymore. If something we can argue about ext3 not being capable > of dealing with a PG_update going away from under it, but the VM is > capable of that, there is only one spot that must be aware about it > (zap fix we posted to start the thread). > > BTW, we have these fixes in our tree to make it work: > > --- sles/mm/truncate.c.~1~ 2004-05-18 19:24:40.000000000 +0200 > +++ sles/mm/truncate.c 2004-05-19 02:09:28.311781864 +0200 > @@ -260,9 +260,10 @@ void invalidate_inode_pages2(struct addr > if (page->mapping == mapping) { /* truncate race? */ > wait_on_page_writeback(page); > next = page->index + 1; > - if (page_mapped(page)) > + if (page_mapped(page)) { > + ClearPageUptodate(page); > clear_page_dirty(page); > - else > + } else > invalidate_complete_page(mapping, page); > } > unlock_page(page); That's in 2.6.9. > and then this fix for BLKFLSBUF: > > diff -urNp linux-2.6.8/mm/truncate.c linux-2.6.8.SUSE/mm/truncate.c > --- linux-2.6.8/mm/truncate.c 2004-09-21 15:39:37.850379202 +0200 > +++ linux-2.6.8.SUSE/mm/truncate.c 2004-09-21 15:39:54.733668309 +0200 > @@ -86,6 +86,12 @@ invalidate_complete_page(struct address_ > write_unlock_irq(&mapping->tree_lock); > return 0; > } > + > + BUG_ON(PagePrivate(page)); > + if (page_count(page) != 2) { > + write_unlock_irq(&mapping->tree_lock); > + return 0; > + } > __remove_from_page_cache(page); > write_unlock_irq(&mapping->tree_lock); > ClearPageUptodate(page); > @@ -306,7 +312,11 @@ void invalidate_inode_pages2(struct addr > clear_page_dirty(page); > ClearPageUptodate(page); > } else { > - invalidate_complete_page(mapping, page); > + if (!invalidate_complete_page(mapping, > + page)) { > + clear_page_dirty(page); > + ClearPageUptodate(page); > + } > } > } > unlock_page(page); > That's in 2.6.9 as well. I need to think about this a bit more - the first hunk is for read() versus BLKFLSBUF (if someone else is playing with the page, leave it alone) whereas the second hunk is kinda buggy, because it will reintroduce the EIO-in-read() error which the first hunk tried to fix. (You're using an rwlock for tree_lock. It's still unclear to me whether we should go that way - it helps big SMP and hurts small SMP. Did you perform an evaluation of this?) > > apparently those weren't submitted to mainline yet, that could have > created some confusion on why we submitted the third patch that avoids > the BUG in pdflush and it avoids to try writing to disk a page that is > not uptodate anymore. Do you see why it's wrong to pass down to pdflush > a page that isn't utpodate? Sure. That's why there's a BUG() there ;) > That page must be re-read, it sure must not > be written out. The semantics of mapped pages not uptodate are very well > defined and they're pose no risk to the VM (modulo the bug in pdflush > and this clear improvement in zap). > > > The unmapping from pagetables means that invalidate_complete_page() will > > generally successfully remove the page from pagecache. That mostly fixes > > the direct-write-of-mmapped-data coherency problem: the pages simply aren't > > in pagecache any more so we'll surely redo physical I/O. > > the direct-write-of-mmapped-data coherency is the last thing we can care > about. But we can fix it. > Even before thinking about direct-write-of-mmapped-data we must > fix direct-write-of-cached-data (assuming it's not mmapped cache). That > is a showstopper bug, I don't think we have a bug here. After the sync(), ext3_releasepage() will be able to release the buffers. But we should check for failures. The problem we're seeing is with mmapped pages. In that case, clearing PG_uptodate is insufficient and directly invalidating the buffer_heads will probably kill ext3 and we're not allowed to assume that page->private contains bh's anyway... Probably it is sufficient to run try_to_release_page() in the page_mapped leg of invalidate_inode_pages2(), and to check the return value. > the direct-write-of-mmapped-data is "undefined" by > the linux API. Garbage can end up in the pagecache as far as the kernel > is concerned (all we guarantee is that no random page contents are > delivered to userspace). > > > But it's not 100% reliable because ->invalidatepage isn't 100% reliable and > > it really sucks that we're offering behaviour which only works most of the > > time :( > > That don't need to ever work, I don't think we even need to add a pte > shootdown, the behviour has never been provided by design. All we need > to do is to fix the bh layer, and fixup the spots in zap to stop bugging > out on dirty not uptodate pages, plus the above patches in this email, > then it'll work as well as 2.4 and it will avoid to break apps. Note > that this is a very serious matter, what happens is that oracle writes > with O_DIRECT and when you do a tar.gz of the database the last block > may be corrupt if it's not 512byte aligned if it was a partial page and > it had partial bh. mysql uses O_DIRECT too. The current buffered-vs-direct handling looks OK, as long as the pages aren't mmapped, and as long as nobody dirties a page after generic_file_direct_IO() has done filemap_write_and_wait(). For the former I'd like to unmap the pages. For the latter we should check the invalidate_complete_page() return value. > I'm not particularly > concerned about ext3, that can be fixed too, or we can add a new bitflag > as you suggested to avoid fixing the filesystems to deal with that. But > for the VM I'm confortable clearing the uptodate page is much simpler > and it poses no risk to the VM. For the fs you may be very well right > that the BH_reread_from_disk might be needed instead. > > So let's forget the mmapped case and let's fix the bh screwup in 2.6. > Then we can think at the mmapped case. The "bh screwup" _is_ the mmapped case. Maybe I said something different yesterday - it's been a while since I looked in there. > And to fix the mmapped case IMHO all we have to do is: > > clear_page_dirty > ClearPageUptodate > unmap_page_range > > Not the other way around. The uptodate bit must be clared before zapping > the pte to force a reload. Something like this... - When invalidating pages, take care to shoot down any ptes which map them as well. This ensures that the next mmap access to the page will generate a major fault, so NFS's server-side modifications are picked up. This also allows us to call invalidate_complete_page() on all pages, so filesytems such as ext3 get a chance to invalidate the buffer_heads. - Don't mark in-pagetable pages as non-uptodate any more. That broke a previous guarantee that mapped-into-user-process pages are always uptodate. - Check the return value of invalidate_complete_page(). It can fail if someone redirties a page after generic_file_direct_IO() write it back. But we still have a problem. If invalidate_inode_pages2() calls unmap_mapping_range(), that can cause zap_pte_range() to dirty the pagecache pages. That will redirty the page's buffers and will cause invalidate_complete_page() to fail. So, in generic_file_direct_IO() we do a complete pte shootdown on the file up-front, prior to writing back dirty pagecache. This is only done for O_DIRECT writes. It _could_ be done for O_DIRECT reads too, providing full mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke other people's mapped pagecache. NFS also uses invalidate_inode_pages2() for handling server-side modification notifications. But in the NFS case the clear_page_dirty() in invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry about the "dirty buffers against a clean page" problem. (I think) Signed-off-by: Andrew Morton <akpm@osdl.org> --- 25-akpm/include/linux/fs.h | 2 - 25-akpm/mm/filemap.c | 19 ++++++++++--- 25-akpm/mm/truncate.c | 63 +++++++++++++++++++++++++++------------------ 3 files changed, 55 insertions(+), 29 deletions(-) diff -puN mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix mm/truncate.c --- 25/mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:40:26 2004 +++ 25-akpm/mm/truncate.c Fri Oct 22 16:05:26 2004 @@ -65,6 +65,8 @@ truncate_complete_page(struct address_sp * be marked dirty at any time too. So we re-check the dirtiness inside * ->tree_lock. That provides exclusion against the __set_page_dirty * functions. + * + * Returns non-zero if the page was successfully invalidated. */ static int invalidate_complete_page(struct address_space *mapping, struct page *page) @@ -281,50 +283,63 @@ unsigned long invalidate_inode_pages(str EXPORT_SYMBOL(invalidate_inode_pages); /** - * invalidate_inode_pages2 - remove all unmapped pages from an address_space + * invalidate_inode_pages2 - remove all pages from an address_space * @mapping - the address_space * - * invalidate_inode_pages2() is like truncate_inode_pages(), except for the case - * where the page is seen to be mapped into process pagetables. In that case, - * the page is marked clean but is left attached to its address_space. - * - * The page is also marked not uptodate so that a subsequent pagefault will - * perform I/O to bringthe page's contents back into sync with its backing - * store. + * Any pages which are found to be mapped into pagetables are unmapped prior to + * invalidation. * - * FIXME: invalidate_inode_pages2() is probably trivially livelockable. + * Returns -EIO if any pages could not be invalidated. */ -void invalidate_inode_pages2(struct address_space *mapping) +int invalidate_inode_pages2(struct address_space *mapping) { struct pagevec pvec; pgoff_t next = 0; int i; + int ret = 0; + int did_full_unmap = 0; pagevec_init(&pvec, 0); - while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { - for (i = 0; i < pagevec_count(&pvec); i++) { + while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { + for (i = 0; !ret && i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; lock_page(page); - if (page->mapping == mapping) { /* truncate race? */ - wait_on_page_writeback(page); - next = page->index + 1; - if (page_mapped(page)) { - clear_page_dirty(page); - ClearPageUptodate(page); + if (page->mapping != mapping) { /* truncate race? */ + unlock_page(page); + continue; + } + wait_on_page_writeback(page); + next = page->index + 1; + while (page_mapped(page)) { + if (!did_full_unmap) { + /* + * Zap the rest of the file in one hit. + * FIXME: invalidate_inode_pages2() + * should take start/end offsets. + */ + unmap_mapping_range(mapping, + page->index << PAGE_CACHE_SHIFT, + -1, 0); + did_full_unmap = 1; } else { - if (!invalidate_complete_page(mapping, - page)) { - clear_page_dirty(page); - ClearPageUptodate(page); - } + /* + * Just zap this page + */ + unmap_mapping_range(mapping, + page->index << PAGE_CACHE_SHIFT, + (page->index << PAGE_CACHE_SHIFT)+1, + 0); } } + clear_page_dirty(page); + if (!invalidate_complete_page(mapping, page)) + ret = -EIO; unlock_page(page); } pagevec_release(&pvec); cond_resched(); } + return ret; } - EXPORT_SYMBOL_GPL(invalidate_inode_pages2); diff -puN include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix include/linux/fs.h --- 25/include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:54:50 2004 +++ 25-akpm/include/linux/fs.h Fri Oct 22 15:54:58 2004 @@ -1404,7 +1404,7 @@ static inline void invalidate_remote_ino S_ISLNK(inode->i_mode)) invalidate_inode_pages(inode->i_mapping); } -extern void invalidate_inode_pages2(struct address_space *mapping); +extern int invalidate_inode_pages2(struct address_space *mapping); extern void write_inode_now(struct inode *, int); extern int filemap_fdatawrite(struct address_space *); extern int filemap_flush(struct address_space *); diff -puN mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix mm/filemap.c --- 25/mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:55:06 2004 +++ 25-akpm/mm/filemap.c Fri Oct 22 16:17:19 2004 @@ -2214,7 +2214,8 @@ ssize_t generic_file_writev(struct file EXPORT_SYMBOL(generic_file_writev); /* - * Called under i_sem for writes to S_ISREG files + * Called under i_sem for writes to S_ISREG files. Returns -EIO if something + * went wrong during pagecache shootdown. */ ssize_t generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, @@ -2224,14 +2225,24 @@ generic_file_direct_IO(int rw, struct ki struct address_space *mapping = file->f_mapping; ssize_t retval; + /* + * If it's a write, unmap all mmappings of the file up-front. This + * will cause any pte dirty bits to be propagated into the pageframes + * for the subsequent filemap_write_and_wait(). + */ + if (rw == WRITE && mapping_mapped(mapping)) + unmap_mapping_range(mapping, 0, -1, 0); + retval = filemap_write_and_wait(mapping); if (retval == 0) { retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs); - if (rw == WRITE && mapping->nrpages) - invalidate_inode_pages2(mapping); + if (rw == WRITE && mapping->nrpages) { + int err = invalidate_inode_pages2(mapping); + if (err) + retval = err; + } } return retval; } - EXPORT_SYMBOL_GPL(generic_file_direct_IO); _ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-22 23:24 ` Andrew Morton 0 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-10-22 23:24 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: shaggy, linux-kernel, linux-mm Andrea Arcangeli <andrea@novell.com> wrote: > > On Thu, Oct 21, 2004 at 07:03:20PM -0700, Andrew Morton wrote: > > Yeah. I think the only 100% reliable way to do this is: > > > > lock_page() > > unmap_mapping_range(page) > > ClearPageUptodate(page); > > invalidate(page); /* Try to free the thing */ > > unlock_page(page); > > > > (Can do it for a whole range of pages if we always agree to lock pages in > > file-index-ascending order). > > > > but hrm, we don't even have the locking there to prevent do_no_page() from > > reinstantiating the pte immediately after the unmap_mapping_range(). > > > > So what to do? > > > > - The patch you originally sent has a race window which can be made > > nonfatal by removing the BUGcheck in mpage_writepage(). > > > > - NFS should probably be taught to use unmap_mapping_range() regardless > > of what we do, so the existing-mmaps-hold-stale-data problem gets fixed > > up. > > > > - invalidate_inode_pages2() isn't successfully invalidating the page if > > it has buffers. > > I did it right in 2.4, somebody obviously broken the thing in 2.6. Now > w.r.t. to ext2/ext3 I never heard a problem, and we nuke buffers > regardless if the page is mapped or not (so this isn't a corner case). > > define block_invalidate_page(page) discard_bh_page(page, 0, 0) > > } else { > if (page->buffers) { > /* Restart after this page */ > list_del(head); > list_add_tail(head, curr); > > page_cache_get(page); > spin_unlock(&pagecache_lock); > block_invalidate_page(page); > } else > unlocked = 0; > > ClearPageDirty(page); This is incorrect for ext3 - we have to go through the a_ops to invalidate the buffer_heads. That's bad news for my (unmerged) 2.4 direct-io-for-ext3 patch. > The guarantee is total in 2.4 w.r.t. buffers, as worse it can generate > an hearth attack to ext3 but it won't risk to lose coherency. Only 2.6 > broke the O_DIRECT vs buffered I/O coherency protocol. Only for mmapped pages. I believe that normal pagecache-vs-directIO is OK in 2.6. > The semantics of not uptodate pages mapped in the address space are as > well understood and working fine in 2.4. I'd prefer to continue to disallow the mapping of nonuptodate pages into process address space. It's indistinguishable from a security hole and I think we can indeed preserve this invariant while fixing the mmap coherency problem. > > This is the biggest problem, because the pages can trivially have > > buffers - just write()ing to them will attach buffers, if they're ext2- > > or ext3-backed. > > > > It means that a direct-IO write to a section of a file which is mmapped > > causes pagecache and disk to get out of sync. Question is: do we care, > > or do we say "don't do that"? Nobody seems to have noticed thus far and > > it's a pretty dopey thing to be doing. > > this is a supported feature. People is writing with O_DIRECT and reading > with buffered read and we must support that (think a database writing > and a tar.gz reading). The coherency is a must after the O_DIRECT has > completed. If people runs the operations at the same time the result is > undefined. But by serializing the I/O at the userspace level (like > stop_db(); tar.gz) we must provide guarantees to them. If the page is not mapped into pagetables then we'll invalidate it OK even if it's backed by ext3. The only hole I can see is if someone comes in and dirties the page via write() _after_ generic_file_direct_IO() has done its filemap_write_and_wait(). And I agree that in that case we should propagate the invalidate_complete_page() failure back to the diret-io write() caller. I wonder how - via a short write, or via -EFOO? Probably we should sync all the ptes prior to a direct-io read as well. > > If we _do_ want to fix it, it seems the simplest approach would be to > > nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we > > raced - try again" loop in there to handle the "do_no_page() > > reinstantiated a pte" race. > > the do_no_page don't need to be coherent. The pte is the last thing we > can care about. The API only guarantees read/write syscalls to be > coherent, mmap not. I want to fix the mmap case too. We currently have a half-fix for NFS which doesn't work at all for (say) ext3. > the only bug is the bh and it has to be fixed as easily as I did in 2.4. No, we need to call into the filesystem to kill the buffer_heads. > > Something like this? Slow as a dog, but possibly correct ;) > > > > > > void invalidate_inode_pages2(struct address_space *mapping) > > { > > struct pagevec pvec; > > pgoff_t next = 0; > > int i; > > > > pagevec_init(&pvec, 0); > > while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > > for (i = 0; i < pagevec_count(&pvec); i++) { > > struct page *page = pvec.pages[i]; > > > > lock_page(page); > > if (page->mapping == mapping) { /* truncate race? */ > > wait_on_page_writeback(page); > > next = page->index + 1; > > while (page_mapped(page)) { > > unmap_mapping_range(mapping, > > page->index << PAGE_CACHE_SHIFT, > > page->index << PAGE_CACHE_SHIFT+1, > > 0); > > clear_page_dirty(page); > > } > > invalidate_complete_page(mapping, page); > > } > > unlock_page(page); > > } > > pagevec_release(&pvec); > > cond_resched(); > > } > > } > > You know the page_mapped path isn't really very important as far as we > execute both clear_page_dirty and ClearPageUptodate. It's important for NFS - it will allow mmaps to see server-side updates. > We must provide the > guarantee to the syscall. Leaving not updoate pages in the ptes is > fine, the VM can deal with that. the semantics of that condition is that > the page changed on disk due O_DIRECT from under us, so the page isn't > uptodate anymore. If something we can argue about ext3 not being capable > of dealing with a PG_update going away from under it, but the VM is > capable of that, there is only one spot that must be aware about it > (zap fix we posted to start the thread). > > BTW, we have these fixes in our tree to make it work: > > --- sles/mm/truncate.c.~1~ 2004-05-18 19:24:40.000000000 +0200 > +++ sles/mm/truncate.c 2004-05-19 02:09:28.311781864 +0200 > @@ -260,9 +260,10 @@ void invalidate_inode_pages2(struct addr > if (page->mapping == mapping) { /* truncate race? */ > wait_on_page_writeback(page); > next = page->index + 1; > - if (page_mapped(page)) > + if (page_mapped(page)) { > + ClearPageUptodate(page); > clear_page_dirty(page); > - else > + } else > invalidate_complete_page(mapping, page); > } > unlock_page(page); That's in 2.6.9. > and then this fix for BLKFLSBUF: > > diff -urNp linux-2.6.8/mm/truncate.c linux-2.6.8.SUSE/mm/truncate.c > --- linux-2.6.8/mm/truncate.c 2004-09-21 15:39:37.850379202 +0200 > +++ linux-2.6.8.SUSE/mm/truncate.c 2004-09-21 15:39:54.733668309 +0200 > @@ -86,6 +86,12 @@ invalidate_complete_page(struct address_ > write_unlock_irq(&mapping->tree_lock); > return 0; > } > + > + BUG_ON(PagePrivate(page)); > + if (page_count(page) != 2) { > + write_unlock_irq(&mapping->tree_lock); > + return 0; > + } > __remove_from_page_cache(page); > write_unlock_irq(&mapping->tree_lock); > ClearPageUptodate(page); > @@ -306,7 +312,11 @@ void invalidate_inode_pages2(struct addr > clear_page_dirty(page); > ClearPageUptodate(page); > } else { > - invalidate_complete_page(mapping, page); > + if (!invalidate_complete_page(mapping, > + page)) { > + clear_page_dirty(page); > + ClearPageUptodate(page); > + } > } > } > unlock_page(page); > That's in 2.6.9 as well. I need to think about this a bit more - the first hunk is for read() versus BLKFLSBUF (if someone else is playing with the page, leave it alone) whereas the second hunk is kinda buggy, because it will reintroduce the EIO-in-read() error which the first hunk tried to fix. (You're using an rwlock for tree_lock. It's still unclear to me whether we should go that way - it helps big SMP and hurts small SMP. Did you perform an evaluation of this?) > > apparently those weren't submitted to mainline yet, that could have > created some confusion on why we submitted the third patch that avoids > the BUG in pdflush and it avoids to try writing to disk a page that is > not uptodate anymore. Do you see why it's wrong to pass down to pdflush > a page that isn't utpodate? Sure. That's why there's a BUG() there ;) > That page must be re-read, it sure must not > be written out. The semantics of mapped pages not uptodate are very well > defined and they're pose no risk to the VM (modulo the bug in pdflush > and this clear improvement in zap). > > > The unmapping from pagetables means that invalidate_complete_page() will > > generally successfully remove the page from pagecache. That mostly fixes > > the direct-write-of-mmapped-data coherency problem: the pages simply aren't > > in pagecache any more so we'll surely redo physical I/O. > > the direct-write-of-mmapped-data coherency is the last thing we can care > about. But we can fix it. > Even before thinking about direct-write-of-mmapped-data we must > fix direct-write-of-cached-data (assuming it's not mmapped cache). That > is a showstopper bug, I don't think we have a bug here. After the sync(), ext3_releasepage() will be able to release the buffers. But we should check for failures. The problem we're seeing is with mmapped pages. In that case, clearing PG_uptodate is insufficient and directly invalidating the buffer_heads will probably kill ext3 and we're not allowed to assume that page->private contains bh's anyway... Probably it is sufficient to run try_to_release_page() in the page_mapped leg of invalidate_inode_pages2(), and to check the return value. > the direct-write-of-mmapped-data is "undefined" by > the linux API. Garbage can end up in the pagecache as far as the kernel > is concerned (all we guarantee is that no random page contents are > delivered to userspace). > > > But it's not 100% reliable because ->invalidatepage isn't 100% reliable and > > it really sucks that we're offering behaviour which only works most of the > > time :( > > That don't need to ever work, I don't think we even need to add a pte > shootdown, the behviour has never been provided by design. All we need > to do is to fix the bh layer, and fixup the spots in zap to stop bugging > out on dirty not uptodate pages, plus the above patches in this email, > then it'll work as well as 2.4 and it will avoid to break apps. Note > that this is a very serious matter, what happens is that oracle writes > with O_DIRECT and when you do a tar.gz of the database the last block > may be corrupt if it's not 512byte aligned if it was a partial page and > it had partial bh. mysql uses O_DIRECT too. The current buffered-vs-direct handling looks OK, as long as the pages aren't mmapped, and as long as nobody dirties a page after generic_file_direct_IO() has done filemap_write_and_wait(). For the former I'd like to unmap the pages. For the latter we should check the invalidate_complete_page() return value. > I'm not particularly > concerned about ext3, that can be fixed too, or we can add a new bitflag > as you suggested to avoid fixing the filesystems to deal with that. But > for the VM I'm confortable clearing the uptodate page is much simpler > and it poses no risk to the VM. For the fs you may be very well right > that the BH_reread_from_disk might be needed instead. > > So let's forget the mmapped case and let's fix the bh screwup in 2.6. > Then we can think at the mmapped case. The "bh screwup" _is_ the mmapped case. Maybe I said something different yesterday - it's been a while since I looked in there. > And to fix the mmapped case IMHO all we have to do is: > > clear_page_dirty > ClearPageUptodate > unmap_page_range > > Not the other way around. The uptodate bit must be clared before zapping > the pte to force a reload. Something like this... - When invalidating pages, take care to shoot down any ptes which map them as well. This ensures that the next mmap access to the page will generate a major fault, so NFS's server-side modifications are picked up. This also allows us to call invalidate_complete_page() on all pages, so filesytems such as ext3 get a chance to invalidate the buffer_heads. - Don't mark in-pagetable pages as non-uptodate any more. That broke a previous guarantee that mapped-into-user-process pages are always uptodate. - Check the return value of invalidate_complete_page(). It can fail if someone redirties a page after generic_file_direct_IO() write it back. But we still have a problem. If invalidate_inode_pages2() calls unmap_mapping_range(), that can cause zap_pte_range() to dirty the pagecache pages. That will redirty the page's buffers and will cause invalidate_complete_page() to fail. So, in generic_file_direct_IO() we do a complete pte shootdown on the file up-front, prior to writing back dirty pagecache. This is only done for O_DIRECT writes. It _could_ be done for O_DIRECT reads too, providing full mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke other people's mapped pagecache. NFS also uses invalidate_inode_pages2() for handling server-side modification notifications. But in the NFS case the clear_page_dirty() in invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry about the "dirty buffers against a clean page" problem. (I think) Signed-off-by: Andrew Morton <akpm@osdl.org> --- 25-akpm/include/linux/fs.h | 2 - 25-akpm/mm/filemap.c | 19 ++++++++++--- 25-akpm/mm/truncate.c | 63 +++++++++++++++++++++++++++------------------ 3 files changed, 55 insertions(+), 29 deletions(-) diff -puN mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix mm/truncate.c --- 25/mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:40:26 2004 +++ 25-akpm/mm/truncate.c Fri Oct 22 16:05:26 2004 @@ -65,6 +65,8 @@ truncate_complete_page(struct address_sp * be marked dirty at any time too. So we re-check the dirtiness inside * ->tree_lock. That provides exclusion against the __set_page_dirty * functions. + * + * Returns non-zero if the page was successfully invalidated. */ static int invalidate_complete_page(struct address_space *mapping, struct page *page) @@ -281,50 +283,63 @@ unsigned long invalidate_inode_pages(str EXPORT_SYMBOL(invalidate_inode_pages); /** - * invalidate_inode_pages2 - remove all unmapped pages from an address_space + * invalidate_inode_pages2 - remove all pages from an address_space * @mapping - the address_space * - * invalidate_inode_pages2() is like truncate_inode_pages(), except for the case - * where the page is seen to be mapped into process pagetables. In that case, - * the page is marked clean but is left attached to its address_space. - * - * The page is also marked not uptodate so that a subsequent pagefault will - * perform I/O to bringthe page's contents back into sync with its backing - * store. + * Any pages which are found to be mapped into pagetables are unmapped prior to + * invalidation. * - * FIXME: invalidate_inode_pages2() is probably trivially livelockable. + * Returns -EIO if any pages could not be invalidated. */ -void invalidate_inode_pages2(struct address_space *mapping) +int invalidate_inode_pages2(struct address_space *mapping) { struct pagevec pvec; pgoff_t next = 0; int i; + int ret = 0; + int did_full_unmap = 0; pagevec_init(&pvec, 0); - while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { - for (i = 0; i < pagevec_count(&pvec); i++) { + while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { + for (i = 0; !ret && i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; lock_page(page); - if (page->mapping == mapping) { /* truncate race? */ - wait_on_page_writeback(page); - next = page->index + 1; - if (page_mapped(page)) { - clear_page_dirty(page); - ClearPageUptodate(page); + if (page->mapping != mapping) { /* truncate race? */ + unlock_page(page); + continue; + } + wait_on_page_writeback(page); + next = page->index + 1; + while (page_mapped(page)) { + if (!did_full_unmap) { + /* + * Zap the rest of the file in one hit. + * FIXME: invalidate_inode_pages2() + * should take start/end offsets. + */ + unmap_mapping_range(mapping, + page->index << PAGE_CACHE_SHIFT, + -1, 0); + did_full_unmap = 1; } else { - if (!invalidate_complete_page(mapping, - page)) { - clear_page_dirty(page); - ClearPageUptodate(page); - } + /* + * Just zap this page + */ + unmap_mapping_range(mapping, + page->index << PAGE_CACHE_SHIFT, + (page->index << PAGE_CACHE_SHIFT)+1, + 0); } } + clear_page_dirty(page); + if (!invalidate_complete_page(mapping, page)) + ret = -EIO; unlock_page(page); } pagevec_release(&pvec); cond_resched(); } + return ret; } - EXPORT_SYMBOL_GPL(invalidate_inode_pages2); diff -puN include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix include/linux/fs.h --- 25/include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:54:50 2004 +++ 25-akpm/include/linux/fs.h Fri Oct 22 15:54:58 2004 @@ -1404,7 +1404,7 @@ static inline void invalidate_remote_ino S_ISLNK(inode->i_mode)) invalidate_inode_pages(inode->i_mapping); } -extern void invalidate_inode_pages2(struct address_space *mapping); +extern int invalidate_inode_pages2(struct address_space *mapping); extern void write_inode_now(struct inode *, int); extern int filemap_fdatawrite(struct address_space *); extern int filemap_flush(struct address_space *); diff -puN mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix mm/filemap.c --- 25/mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:55:06 2004 +++ 25-akpm/mm/filemap.c Fri Oct 22 16:17:19 2004 @@ -2214,7 +2214,8 @@ ssize_t generic_file_writev(struct file EXPORT_SYMBOL(generic_file_writev); /* - * Called under i_sem for writes to S_ISREG files + * Called under i_sem for writes to S_ISREG files. Returns -EIO if something + * went wrong during pagecache shootdown. */ ssize_t generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, @@ -2224,14 +2225,24 @@ generic_file_direct_IO(int rw, struct ki struct address_space *mapping = file->f_mapping; ssize_t retval; + /* + * If it's a write, unmap all mmappings of the file up-front. This + * will cause any pte dirty bits to be propagated into the pageframes + * for the subsequent filemap_write_and_wait(). + */ + if (rw == WRITE && mapping_mapped(mapping)) + unmap_mapping_range(mapping, 0, -1, 0); + retval = filemap_write_and_wait(mapping); if (retval == 0) { retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs); - if (rw == WRITE && mapping->nrpages) - invalidate_inode_pages2(mapping); + if (rw == WRITE && mapping->nrpages) { + int err = invalidate_inode_pages2(mapping); + if (err) + retval = err; + } } return retval; } - EXPORT_SYMBOL_GPL(generic_file_direct_IO); _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-22 23:24 ` Andrew Morton @ 2004-10-25 13:58 ` Dave Kleikamp -1 siblings, 0 replies; 44+ messages in thread From: Dave Kleikamp @ 2004-10-25 13:58 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-kernel, linux-mm Andrew, This patch does fix the oops I was seeing. Shaggy On Fri, 2004-10-22 at 18:24, Andrew Morton wrote: > Something like this... > > > > > - When invalidating pages, take care to shoot down any ptes which map them > as well. > > This ensures that the next mmap access to the page will generate a major > fault, so NFS's server-side modifications are picked up. > > This also allows us to call invalidate_complete_page() on all pages, so > filesytems such as ext3 get a chance to invalidate the buffer_heads. > > - Don't mark in-pagetable pages as non-uptodate any more. That broke a > previous guarantee that mapped-into-user-process pages are always uptodate. > > - Check the return value of invalidate_complete_page(). It can fail if > someone redirties a page after generic_file_direct_IO() write it back. > > But we still have a problem. If invalidate_inode_pages2() calls > unmap_mapping_range(), that can cause zap_pte_range() to dirty the pagecache > pages. That will redirty the page's buffers and will cause > invalidate_complete_page() to fail. > > So, in generic_file_direct_IO() we do a complete pte shootdown on the file > up-front, prior to writing back dirty pagecache. This is only done for > O_DIRECT writes. It _could_ be done for O_DIRECT reads too, providing full > mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but > permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke > other people's mapped pagecache. > > NFS also uses invalidate_inode_pages2() for handling server-side modification > notifications. But in the NFS case the clear_page_dirty() in > invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry > about the "dirty buffers against a clean page" problem. (I think) > > Signed-off-by: Andrew Morton <akpm@osdl.org> > --- > > 25-akpm/include/linux/fs.h | 2 - > 25-akpm/mm/filemap.c | 19 ++++++++++--- > 25-akpm/mm/truncate.c | 63 +++++++++++++++++++++++++++------------------ > 3 files changed, 55 insertions(+), 29 deletions(-) > > diff -puN mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix mm/truncate.c > --- 25/mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:40:26 2004 > +++ 25-akpm/mm/truncate.c Fri Oct 22 16:05:26 2004 > @@ -65,6 +65,8 @@ truncate_complete_page(struct address_sp > * be marked dirty at any time too. So we re-check the dirtiness inside > * ->tree_lock. That provides exclusion against the __set_page_dirty > * functions. > + * > + * Returns non-zero if the page was successfully invalidated. > */ > static int > invalidate_complete_page(struct address_space *mapping, struct page *page) > @@ -281,50 +283,63 @@ unsigned long invalidate_inode_pages(str > EXPORT_SYMBOL(invalidate_inode_pages); > > /** > - * invalidate_inode_pages2 - remove all unmapped pages from an address_space > + * invalidate_inode_pages2 - remove all pages from an address_space > * @mapping - the address_space > * > - * invalidate_inode_pages2() is like truncate_inode_pages(), except for the case > - * where the page is seen to be mapped into process pagetables. In that case, > - * the page is marked clean but is left attached to its address_space. > - * > - * The page is also marked not uptodate so that a subsequent pagefault will > - * perform I/O to bringthe page's contents back into sync with its backing > - * store. > + * Any pages which are found to be mapped into pagetables are unmapped prior to > + * invalidation. > * > - * FIXME: invalidate_inode_pages2() is probably trivially livelockable. > + * Returns -EIO if any pages could not be invalidated. > */ > -void invalidate_inode_pages2(struct address_space *mapping) > +int invalidate_inode_pages2(struct address_space *mapping) > { > struct pagevec pvec; > pgoff_t next = 0; > int i; > + int ret = 0; > + int did_full_unmap = 0; > > pagevec_init(&pvec, 0); > - while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > - for (i = 0; i < pagevec_count(&pvec); i++) { > + while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > + for (i = 0; !ret && i < pagevec_count(&pvec); i++) { > struct page *page = pvec.pages[i]; > > lock_page(page); > - if (page->mapping == mapping) { /* truncate race? */ > - wait_on_page_writeback(page); > - next = page->index + 1; > - if (page_mapped(page)) { > - clear_page_dirty(page); > - ClearPageUptodate(page); > + if (page->mapping != mapping) { /* truncate race? */ > + unlock_page(page); > + continue; > + } > + wait_on_page_writeback(page); > + next = page->index + 1; > + while (page_mapped(page)) { > + if (!did_full_unmap) { > + /* > + * Zap the rest of the file in one hit. > + * FIXME: invalidate_inode_pages2() > + * should take start/end offsets. > + */ > + unmap_mapping_range(mapping, > + page->index << PAGE_CACHE_SHIFT, > + -1, 0); > + did_full_unmap = 1; > } else { > - if (!invalidate_complete_page(mapping, > - page)) { > - clear_page_dirty(page); > - ClearPageUptodate(page); > - } > + /* > + * Just zap this page > + */ > + unmap_mapping_range(mapping, > + page->index << PAGE_CACHE_SHIFT, > + (page->index << PAGE_CACHE_SHIFT)+1, > + 0); > } > } > + clear_page_dirty(page); > + if (!invalidate_complete_page(mapping, page)) > + ret = -EIO; > unlock_page(page); > } > pagevec_release(&pvec); > cond_resched(); > } > + return ret; > } > - > EXPORT_SYMBOL_GPL(invalidate_inode_pages2); > diff -puN include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix include/linux/fs.h > --- 25/include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:54:50 2004 > +++ 25-akpm/include/linux/fs.h Fri Oct 22 15:54:58 2004 > @@ -1404,7 +1404,7 @@ static inline void invalidate_remote_ino > S_ISLNK(inode->i_mode)) > invalidate_inode_pages(inode->i_mapping); > } > -extern void invalidate_inode_pages2(struct address_space *mapping); > +extern int invalidate_inode_pages2(struct address_space *mapping); > extern void write_inode_now(struct inode *, int); > extern int filemap_fdatawrite(struct address_space *); > extern int filemap_flush(struct address_space *); > diff -puN mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix mm/filemap.c > --- 25/mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:55:06 2004 > +++ 25-akpm/mm/filemap.c Fri Oct 22 16:17:19 2004 > @@ -2214,7 +2214,8 @@ ssize_t generic_file_writev(struct file > EXPORT_SYMBOL(generic_file_writev); > > /* > - * Called under i_sem for writes to S_ISREG files > + * Called under i_sem for writes to S_ISREG files. Returns -EIO if something > + * went wrong during pagecache shootdown. > */ > ssize_t > generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, > @@ -2224,14 +2225,24 @@ generic_file_direct_IO(int rw, struct ki > struct address_space *mapping = file->f_mapping; > ssize_t retval; > > + /* > + * If it's a write, unmap all mmappings of the file up-front. This > + * will cause any pte dirty bits to be propagated into the pageframes > + * for the subsequent filemap_write_and_wait(). > + */ > + if (rw == WRITE && mapping_mapped(mapping)) > + unmap_mapping_range(mapping, 0, -1, 0); > + > retval = filemap_write_and_wait(mapping); > if (retval == 0) { > retval = mapping->a_ops->direct_IO(rw, iocb, iov, > offset, nr_segs); > - if (rw == WRITE && mapping->nrpages) > - invalidate_inode_pages2(mapping); > + if (rw == WRITE && mapping->nrpages) { > + int err = invalidate_inode_pages2(mapping); > + if (err) > + retval = err; > + } > } > return retval; > } > - > EXPORT_SYMBOL_GPL(generic_file_direct_IO); > _ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-25 13:58 ` Dave Kleikamp 0 siblings, 0 replies; 44+ messages in thread From: Dave Kleikamp @ 2004-10-25 13:58 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-kernel, linux-mm Andrew, This patch does fix the oops I was seeing. Shaggy On Fri, 2004-10-22 at 18:24, Andrew Morton wrote: > Something like this... > > > > > - When invalidating pages, take care to shoot down any ptes which map them > as well. > > This ensures that the next mmap access to the page will generate a major > fault, so NFS's server-side modifications are picked up. > > This also allows us to call invalidate_complete_page() on all pages, so > filesytems such as ext3 get a chance to invalidate the buffer_heads. > > - Don't mark in-pagetable pages as non-uptodate any more. That broke a > previous guarantee that mapped-into-user-process pages are always uptodate. > > - Check the return value of invalidate_complete_page(). It can fail if > someone redirties a page after generic_file_direct_IO() write it back. > > But we still have a problem. If invalidate_inode_pages2() calls > unmap_mapping_range(), that can cause zap_pte_range() to dirty the pagecache > pages. That will redirty the page's buffers and will cause > invalidate_complete_page() to fail. > > So, in generic_file_direct_IO() we do a complete pte shootdown on the file > up-front, prior to writing back dirty pagecache. This is only done for > O_DIRECT writes. It _could_ be done for O_DIRECT reads too, providing full > mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but > permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke > other people's mapped pagecache. > > NFS also uses invalidate_inode_pages2() for handling server-side modification > notifications. But in the NFS case the clear_page_dirty() in > invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry > about the "dirty buffers against a clean page" problem. (I think) > > Signed-off-by: Andrew Morton <akpm@osdl.org> > --- > > 25-akpm/include/linux/fs.h | 2 - > 25-akpm/mm/filemap.c | 19 ++++++++++--- > 25-akpm/mm/truncate.c | 63 +++++++++++++++++++++++++++------------------ > 3 files changed, 55 insertions(+), 29 deletions(-) > > diff -puN mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix mm/truncate.c > --- 25/mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:40:26 2004 > +++ 25-akpm/mm/truncate.c Fri Oct 22 16:05:26 2004 > @@ -65,6 +65,8 @@ truncate_complete_page(struct address_sp > * be marked dirty at any time too. So we re-check the dirtiness inside > * ->tree_lock. That provides exclusion against the __set_page_dirty > * functions. > + * > + * Returns non-zero if the page was successfully invalidated. > */ > static int > invalidate_complete_page(struct address_space *mapping, struct page *page) > @@ -281,50 +283,63 @@ unsigned long invalidate_inode_pages(str > EXPORT_SYMBOL(invalidate_inode_pages); > > /** > - * invalidate_inode_pages2 - remove all unmapped pages from an address_space > + * invalidate_inode_pages2 - remove all pages from an address_space > * @mapping - the address_space > * > - * invalidate_inode_pages2() is like truncate_inode_pages(), except for the case > - * where the page is seen to be mapped into process pagetables. In that case, > - * the page is marked clean but is left attached to its address_space. > - * > - * The page is also marked not uptodate so that a subsequent pagefault will > - * perform I/O to bringthe page's contents back into sync with its backing > - * store. > + * Any pages which are found to be mapped into pagetables are unmapped prior to > + * invalidation. > * > - * FIXME: invalidate_inode_pages2() is probably trivially livelockable. > + * Returns -EIO if any pages could not be invalidated. > */ > -void invalidate_inode_pages2(struct address_space *mapping) > +int invalidate_inode_pages2(struct address_space *mapping) > { > struct pagevec pvec; > pgoff_t next = 0; > int i; > + int ret = 0; > + int did_full_unmap = 0; > > pagevec_init(&pvec, 0); > - while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > - for (i = 0; i < pagevec_count(&pvec); i++) { > + while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > + for (i = 0; !ret && i < pagevec_count(&pvec); i++) { > struct page *page = pvec.pages[i]; > > lock_page(page); > - if (page->mapping == mapping) { /* truncate race? */ > - wait_on_page_writeback(page); > - next = page->index + 1; > - if (page_mapped(page)) { > - clear_page_dirty(page); > - ClearPageUptodate(page); > + if (page->mapping != mapping) { /* truncate race? */ > + unlock_page(page); > + continue; > + } > + wait_on_page_writeback(page); > + next = page->index + 1; > + while (page_mapped(page)) { > + if (!did_full_unmap) { > + /* > + * Zap the rest of the file in one hit. > + * FIXME: invalidate_inode_pages2() > + * should take start/end offsets. > + */ > + unmap_mapping_range(mapping, > + page->index << PAGE_CACHE_SHIFT, > + -1, 0); > + did_full_unmap = 1; > } else { > - if (!invalidate_complete_page(mapping, > - page)) { > - clear_page_dirty(page); > - ClearPageUptodate(page); > - } > + /* > + * Just zap this page > + */ > + unmap_mapping_range(mapping, > + page->index << PAGE_CACHE_SHIFT, > + (page->index << PAGE_CACHE_SHIFT)+1, > + 0); > } > } > + clear_page_dirty(page); > + if (!invalidate_complete_page(mapping, page)) > + ret = -EIO; > unlock_page(page); > } > pagevec_release(&pvec); > cond_resched(); > } > + return ret; > } > - > EXPORT_SYMBOL_GPL(invalidate_inode_pages2); > diff -puN include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix include/linux/fs.h > --- 25/include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:54:50 2004 > +++ 25-akpm/include/linux/fs.h Fri Oct 22 15:54:58 2004 > @@ -1404,7 +1404,7 @@ static inline void invalidate_remote_ino > S_ISLNK(inode->i_mode)) > invalidate_inode_pages(inode->i_mapping); > } > -extern void invalidate_inode_pages2(struct address_space *mapping); > +extern int invalidate_inode_pages2(struct address_space *mapping); > extern void write_inode_now(struct inode *, int); > extern int filemap_fdatawrite(struct address_space *); > extern int filemap_flush(struct address_space *); > diff -puN mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix mm/filemap.c > --- 25/mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:55:06 2004 > +++ 25-akpm/mm/filemap.c Fri Oct 22 16:17:19 2004 > @@ -2214,7 +2214,8 @@ ssize_t generic_file_writev(struct file > EXPORT_SYMBOL(generic_file_writev); > > /* > - * Called under i_sem for writes to S_ISREG files > + * Called under i_sem for writes to S_ISREG files. Returns -EIO if something > + * went wrong during pagecache shootdown. > */ > ssize_t > generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, > @@ -2224,14 +2225,24 @@ generic_file_direct_IO(int rw, struct ki > struct address_space *mapping = file->f_mapping; > ssize_t retval; > > + /* > + * If it's a write, unmap all mmappings of the file up-front. This > + * will cause any pte dirty bits to be propagated into the pageframes > + * for the subsequent filemap_write_and_wait(). > + */ > + if (rw == WRITE && mapping_mapped(mapping)) > + unmap_mapping_range(mapping, 0, -1, 0); > + > retval = filemap_write_and_wait(mapping); > if (retval == 0) { > retval = mapping->a_ops->direct_IO(rw, iocb, iov, > offset, nr_segs); > - if (rw == WRITE && mapping->nrpages) > - invalidate_inode_pages2(mapping); > + if (rw == WRITE && mapping->nrpages) { > + int err = invalidate_inode_pages2(mapping); > + if (err) > + retval = err; > + } > } > return retval; > } > - > EXPORT_SYMBOL_GPL(generic_file_direct_IO); > _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-22 23:24 ` Andrew Morton @ 2004-10-26 0:35 ` Andrea Arcangeli -1 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-26 0:35 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Fri, Oct 22, 2004 at 04:24:33PM -0700, Andrew Morton wrote: > This is incorrect for ext3 - we have to go through the a_ops to invalidate > the buffer_heads. That's bad news for my (unmerged) 2.4 direct-io-for-ext3 > patch. BTW, truncate_complete_page also calls into do_flushpage without asking the fs. truncate_inode_pages does it too (that's probably where it comes from, however truncate happens when the fs already giveup on the inode that's probably why it's safe there). > Only for mmapped pages. I believe that normal pagecache-vs-directIO is OK > in 2.6. yes, as far as try_to_release_page cannot fail on non-mapped pages, otherwise it would lose coherency. When exactly can try_to_release_page return failure from the fs? > And I agree that in that case we should propagate the > invalidate_complete_page() failure back to the diret-io write() caller. I wonder > how - via a short write, or via -EFOO? can we do a short write at all? The failure would come out of try_to_release_buffers, we don't know which blocks failed to invalidate there and in turn which pages would be visible right. I suspect all we can do is -EFOO (which sounds safer anyways since this is a fundamental userspace race condition detection path, a -ERACE would possibly be appropriate if it existed ;). > Probably we should sync all the ptes prior to a direct-io read as well. yes, we should msync all the mappings if you want to truly provide mmap coherency to O_DIRECT. > I want to fix the mmap case too. We currently have a half-fix for NFS > which doesn't work at all for (say) ext3. for nfs it worked good enough because all they care about is that if you munmap and mmap again the new data is shown. all they care about is that if they restart the program it sees the new data on the server. Perfect mmap coherency is impossible with nfs due the stateless protocol but for ext3 we could provide it (in theory). But keep in mind 99% of the O_DIRECT userbase seems not to need it, if you use O_DIRECT by design it means you have no mapping on that data and you do the caching in userspace (otherwise mmap would be the api to use). the two things are quite mutually exclusive. This is why it always looked an unnecessary slowdown to me, but it looks we might implement it with little overhead for 2.6. > It's important for NFS - it will allow mmaps to see server-side updates. the folks I talked to only cares about munmap/mmap coherency, a persistent mmap isn't required to see the new data. > That's in 2.6.9. ok fine. > That's in 2.6.9 as well. I need to think about this a bit more - the first fine too ;) > (You're using an rwlock for tree_lock. It's still unclear to me whether we > should go that way - it helps big SMP and hurts small SMP. Did you perform > an evaluation of this?) I didn't change anything in that area, I've no clue who changed that. since it's a spinlock it will most probably be only an overhead to have rwlock. BTW, I've still to backport the semaphore to spinlock conversion to 2.6.5 (I'm in the process of backporting the lowmem_reserve patch, 2.6.5 didn't even have the protection new better behaviour of 2.6.8+ so I'm in the usual rejects procedures.. ;). Then there's Hugh's rss to backport to 2.6.5 and finally this one patch you posted here for mmap coherency. > Sure. That's why there's a BUG() there ;) eheh ;) > I don't think we have a bug here. After the sync(), ext3_releasepage() > will be able to release the buffers. But we should check for failures. this partly explains the above question (i.e. when exactly releasepage can fail?). > The problem we're seeing is with mmapped pages. In that case, clearing > PG_uptodate is insufficient and directly invalidating the buffer_heads will > probably kill ext3 and we're not allowed to assume that page->private > contains bh's anyway... > > Probably it is sufficient to run try_to_release_page() in the page_mapped > leg of invalidate_inode_pages2(), and to check the return value. I guess so. I mean ext3 cannot know the page is mapped. Returning failure sounds more than good enough. > mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but > permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke > other people's mapped pagecache. for O_DIRECT reads you don't need to shootdown the ptes but only to msync them. That would avoid a performance impact enforceable on other users and it would leave the cost in the task issuing the msync and walking the ptes. Since you provided either coherency or -EIO to the O_DIRECT operation, it probably worth to add the msync as well in the ptes before the O_DIRECT read hits the disk, no? BTW, this is still not coherent at the block level, so if I write a 512byte block with O_DIRECT on one side and at the same time I write a 512byte block with mmap (or buffered write) the end result is still undefined. But if somebody orders the writes, then it should work even with mmap now (while previously it couldn't work even with userspace-ordered I/O since the dirty bit in the pte would be simply ignored by O_DIRECT). Correct? > NFS also uses invalidate_inode_pages2() for handling server-side modification > notifications. But in the NFS case the clear_page_dirty() in > invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry > about the "dirty buffers against a clean page" problem. (I think) indeed, nfs was the only safe one w.r.t. mappings. patch looks good but perhaps we can add an msync (not an unmap) for O_DIRECT reads too to reduce the magic? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-10-26 0:35 ` Andrea Arcangeli 0 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-10-26 0:35 UTC (permalink / raw) To: Andrew Morton; +Cc: shaggy, linux-kernel, linux-mm On Fri, Oct 22, 2004 at 04:24:33PM -0700, Andrew Morton wrote: > This is incorrect for ext3 - we have to go through the a_ops to invalidate > the buffer_heads. That's bad news for my (unmerged) 2.4 direct-io-for-ext3 > patch. BTW, truncate_complete_page also calls into do_flushpage without asking the fs. truncate_inode_pages does it too (that's probably where it comes from, however truncate happens when the fs already giveup on the inode that's probably why it's safe there). > Only for mmapped pages. I believe that normal pagecache-vs-directIO is OK > in 2.6. yes, as far as try_to_release_page cannot fail on non-mapped pages, otherwise it would lose coherency. When exactly can try_to_release_page return failure from the fs? > And I agree that in that case we should propagate the > invalidate_complete_page() failure back to the diret-io write() caller. I wonder > how - via a short write, or via -EFOO? can we do a short write at all? The failure would come out of try_to_release_buffers, we don't know which blocks failed to invalidate there and in turn which pages would be visible right. I suspect all we can do is -EFOO (which sounds safer anyways since this is a fundamental userspace race condition detection path, a -ERACE would possibly be appropriate if it existed ;). > Probably we should sync all the ptes prior to a direct-io read as well. yes, we should msync all the mappings if you want to truly provide mmap coherency to O_DIRECT. > I want to fix the mmap case too. We currently have a half-fix for NFS > which doesn't work at all for (say) ext3. for nfs it worked good enough because all they care about is that if you munmap and mmap again the new data is shown. all they care about is that if they restart the program it sees the new data on the server. Perfect mmap coherency is impossible with nfs due the stateless protocol but for ext3 we could provide it (in theory). But keep in mind 99% of the O_DIRECT userbase seems not to need it, if you use O_DIRECT by design it means you have no mapping on that data and you do the caching in userspace (otherwise mmap would be the api to use). the two things are quite mutually exclusive. This is why it always looked an unnecessary slowdown to me, but it looks we might implement it with little overhead for 2.6. > It's important for NFS - it will allow mmaps to see server-side updates. the folks I talked to only cares about munmap/mmap coherency, a persistent mmap isn't required to see the new data. > That's in 2.6.9. ok fine. > That's in 2.6.9 as well. I need to think about this a bit more - the first fine too ;) > (You're using an rwlock for tree_lock. It's still unclear to me whether we > should go that way - it helps big SMP and hurts small SMP. Did you perform > an evaluation of this?) I didn't change anything in that area, I've no clue who changed that. since it's a spinlock it will most probably be only an overhead to have rwlock. BTW, I've still to backport the semaphore to spinlock conversion to 2.6.5 (I'm in the process of backporting the lowmem_reserve patch, 2.6.5 didn't even have the protection new better behaviour of 2.6.8+ so I'm in the usual rejects procedures.. ;). Then there's Hugh's rss to backport to 2.6.5 and finally this one patch you posted here for mmap coherency. > Sure. That's why there's a BUG() there ;) eheh ;) > I don't think we have a bug here. After the sync(), ext3_releasepage() > will be able to release the buffers. But we should check for failures. this partly explains the above question (i.e. when exactly releasepage can fail?). > The problem we're seeing is with mmapped pages. In that case, clearing > PG_uptodate is insufficient and directly invalidating the buffer_heads will > probably kill ext3 and we're not allowed to assume that page->private > contains bh's anyway... > > Probably it is sufficient to run try_to_release_page() in the page_mapped > leg of invalidate_inode_pages2(), and to check the return value. I guess so. I mean ext3 cannot know the page is mapped. Returning failure sounds more than good enough. > mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but > permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke > other people's mapped pagecache. for O_DIRECT reads you don't need to shootdown the ptes but only to msync them. That would avoid a performance impact enforceable on other users and it would leave the cost in the task issuing the msync and walking the ptes. Since you provided either coherency or -EIO to the O_DIRECT operation, it probably worth to add the msync as well in the ptes before the O_DIRECT read hits the disk, no? BTW, this is still not coherent at the block level, so if I write a 512byte block with O_DIRECT on one side and at the same time I write a 512byte block with mmap (or buffered write) the end result is still undefined. But if somebody orders the writes, then it should work even with mmap now (while previously it couldn't work even with userspace-ordered I/O since the dirty bit in the pte would be simply ignored by O_DIRECT). Correct? > NFS also uses invalidate_inode_pages2() for handling server-side modification > notifications. But in the NFS case the clear_page_dirty() in > invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry > about the "dirty buffers against a clean page" problem. (I think) indeed, nfs was the only safe one w.r.t. mappings. patch looks good but perhaps we can add an msync (not an unmap) for O_DIRECT reads too to reduce the magic? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-10-22 23:24 ` Andrew Morton @ 2004-11-09 14:15 ` Dave Kleikamp -1 siblings, 0 replies; 44+ messages in thread From: Dave Kleikamp @ 2004-11-09 14:15 UTC (permalink / raw) To: Andrew Morton, Andrea Arcangeli; +Cc: linux-kernel, linux-mm Andrew & Andrea, What is the status of this patch? It would be nice to have it in the -mm4 kernel. Thanks, Shaggy On Fri, 2004-10-22 at 18:24, Andrew Morton wrote: > Andrea Arcangeli <andrea@novell.com> wrote: > > > > On Thu, Oct 21, 2004 at 07:03:20PM -0700, Andrew Morton wrote: > > > Yeah. I think the only 100% reliable way to do this is: > > > > > > lock_page() > > > unmap_mapping_range(page) > > > ClearPageUptodate(page); > > > invalidate(page); /* Try to free the thing */ > > > unlock_page(page); > > > > > > (Can do it for a whole range of pages if we always agree to lock pages in > > > file-index-ascending order). > > > > > > but hrm, we don't even have the locking there to prevent do_no_page() from > > > reinstantiating the pte immediately after the unmap_mapping_range(). > > > > > > So what to do? > > > > > > - The patch you originally sent has a race window which can be made > > > nonfatal by removing the BUGcheck in mpage_writepage(). > > > > > > - NFS should probably be taught to use unmap_mapping_range() regardless > > > of what we do, so the existing-mmaps-hold-stale-data problem gets fixed > > > up. > > > > > > - invalidate_inode_pages2() isn't successfully invalidating the page if > > > it has buffers. > > > > I did it right in 2.4, somebody obviously broken the thing in 2.6. Now > > w.r.t. to ext2/ext3 I never heard a problem, and we nuke buffers > > regardless if the page is mapped or not (so this isn't a corner case). > > > > define block_invalidate_page(page) discard_bh_page(page, 0, 0) > > > > } else { > > if (page->buffers) { > > /* Restart after this page */ > > list_del(head); > > list_add_tail(head, curr); > > > > page_cache_get(page); > > spin_unlock(&pagecache_lock); > > block_invalidate_page(page); > > } else > > unlocked = 0; > > > > ClearPageDirty(page); > > This is incorrect for ext3 - we have to go through the a_ops to invalidate > the buffer_heads. That's bad news for my (unmerged) 2.4 direct-io-for-ext3 > patch. > > > The guarantee is total in 2.4 w.r.t. buffers, as worse it can generate > > an hearth attack to ext3 but it won't risk to lose coherency. Only 2.6 > > broke the O_DIRECT vs buffered I/O coherency protocol. > > Only for mmapped pages. I believe that normal pagecache-vs-directIO is OK > in 2.6. > > > The semantics of not uptodate pages mapped in the address space are as > > well understood and working fine in 2.4. > > I'd prefer to continue to disallow the mapping of nonuptodate pages into > process address space. It's indistinguishable from a security hole and I > think we can indeed preserve this invariant while fixing the mmap coherency > problem. > > > > This is the biggest problem, because the pages can trivially have > > > buffers - just write()ing to them will attach buffers, if they're ext2- > > > or ext3-backed. > > > > > > It means that a direct-IO write to a section of a file which is mmapped > > > causes pagecache and disk to get out of sync. Question is: do we care, > > > or do we say "don't do that"? Nobody seems to have noticed thus far and > > > it's a pretty dopey thing to be doing. > > > > this is a supported feature. People is writing with O_DIRECT and reading > > with buffered read and we must support that (think a database writing > > and a tar.gz reading). The coherency is a must after the O_DIRECT has > > completed. If people runs the operations at the same time the result is > > undefined. But by serializing the I/O at the userspace level (like > > stop_db(); tar.gz) we must provide guarantees to them. > > If the page is not mapped into pagetables then we'll invalidate it OK even > if it's backed by ext3. The only hole I can see is if someone comes in and > dirties the page via write() _after_ generic_file_direct_IO() has done its > filemap_write_and_wait(). > > And I agree that in that case we should propagate the > invalidate_complete_page() failure back to the diret-io write() caller. I wonder > how - via a short write, or via -EFOO? > > Probably we should sync all the ptes prior to a direct-io read as well. > > > > If we _do_ want to fix it, it seems the simplest approach would be to > > > nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we > > > raced - try again" loop in there to handle the "do_no_page() > > > reinstantiated a pte" race. > > > > the do_no_page don't need to be coherent. The pte is the last thing we > > can care about. The API only guarantees read/write syscalls to be > > coherent, mmap not. > > I want to fix the mmap case too. We currently have a half-fix for NFS > which doesn't work at all for (say) ext3. > > > the only bug is the bh and it has to be fixed as easily as I did in 2.4. > > No, we need to call into the filesystem to kill the buffer_heads. > > > > Something like this? Slow as a dog, but possibly correct ;) > > > > > > > > > void invalidate_inode_pages2(struct address_space *mapping) > > > { > > > struct pagevec pvec; > > > pgoff_t next = 0; > > > int i; > > > > > > pagevec_init(&pvec, 0); > > > while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > > > for (i = 0; i < pagevec_count(&pvec); i++) { > > > struct page *page = pvec.pages[i]; > > > > > > lock_page(page); > > > if (page->mapping == mapping) { /* truncate race? */ > > > wait_on_page_writeback(page); > > > next = page->index + 1; > > > while (page_mapped(page)) { > > > unmap_mapping_range(mapping, > > > page->index << PAGE_CACHE_SHIFT, > > > page->index << PAGE_CACHE_SHIFT+1, > > > 0); > > > clear_page_dirty(page); > > > } > > > invalidate_complete_page(mapping, page); > > > } > > > unlock_page(page); > > > } > > > pagevec_release(&pvec); > > > cond_resched(); > > > } > > > } > > > > You know the page_mapped path isn't really very important as far as we > > execute both clear_page_dirty and ClearPageUptodate. > > It's important for NFS - it will allow mmaps to see server-side updates. > > > We must provide the > > guarantee to the syscall. Leaving not updoate pages in the ptes is > > fine, the VM can deal with that. the semantics of that condition is that > > the page changed on disk due O_DIRECT from under us, so the page isn't > > uptodate anymore. If something we can argue about ext3 not being capable > > of dealing with a PG_update going away from under it, but the VM is > > capable of that, there is only one spot that must be aware about it > > (zap fix we posted to start the thread). > > > > BTW, we have these fixes in our tree to make it work: > > > > --- sles/mm/truncate.c.~1~ 2004-05-18 19:24:40.000000000 +0200 > > +++ sles/mm/truncate.c 2004-05-19 02:09:28.311781864 +0200 > > @@ -260,9 +260,10 @@ void invalidate_inode_pages2(struct addr > > if (page->mapping == mapping) { /* truncate race? */ > > wait_on_page_writeback(page); > > next = page->index + 1; > > - if (page_mapped(page)) > > + if (page_mapped(page)) { > > + ClearPageUptodate(page); > > clear_page_dirty(page); > > - else > > + } else > > invalidate_complete_page(mapping, page); > > } > > unlock_page(page); > > That's in 2.6.9. > > > and then this fix for BLKFLSBUF: > > > > diff -urNp linux-2.6.8/mm/truncate.c linux-2.6.8.SUSE/mm/truncate.c > > --- linux-2.6.8/mm/truncate.c 2004-09-21 15:39:37.850379202 +0200 > > +++ linux-2.6.8.SUSE/mm/truncate.c 2004-09-21 15:39:54.733668309 +0200 > > @@ -86,6 +86,12 @@ invalidate_complete_page(struct address_ > > write_unlock_irq(&mapping->tree_lock); > > return 0; > > } > > + > > + BUG_ON(PagePrivate(page)); > > + if (page_count(page) != 2) { > > + write_unlock_irq(&mapping->tree_lock); > > + return 0; > > + } > > __remove_from_page_cache(page); > > write_unlock_irq(&mapping->tree_lock); > > ClearPageUptodate(page); > > @@ -306,7 +312,11 @@ void invalidate_inode_pages2(struct addr > > clear_page_dirty(page); > > ClearPageUptodate(page); > > } else { > > - invalidate_complete_page(mapping, page); > > + if (!invalidate_complete_page(mapping, > > + page)) { > > + clear_page_dirty(page); > > + ClearPageUptodate(page); > > + } > > } > > } > > unlock_page(page); > > > > That's in 2.6.9 as well. I need to think about this a bit more - the first > hunk is for read() versus BLKFLSBUF (if someone else is playing with the > page, leave it alone) whereas the second hunk is kinda buggy, because it > will reintroduce the EIO-in-read() error which the first hunk tried to fix. > > (You're using an rwlock for tree_lock. It's still unclear to me whether we > should go that way - it helps big SMP and hurts small SMP. Did you perform > an evaluation of this?) > > > > > apparently those weren't submitted to mainline yet, that could have > > created some confusion on why we submitted the third patch that avoids > > the BUG in pdflush and it avoids to try writing to disk a page that is > > not uptodate anymore. Do you see why it's wrong to pass down to pdflush > > a page that isn't utpodate? > > Sure. That's why there's a BUG() there ;) > > > That page must be re-read, it sure must not > > be written out. The semantics of mapped pages not uptodate are very well > > defined and they're pose no risk to the VM (modulo the bug in pdflush > > and this clear improvement in zap). > > > > > The unmapping from pagetables means that invalidate_complete_page() will > > > generally successfully remove the page from pagecache. That mostly fixes > > > the direct-write-of-mmapped-data coherency problem: the pages simply aren't > > > in pagecache any more so we'll surely redo physical I/O. > > > > the direct-write-of-mmapped-data coherency is the last thing we can care > > about. > > But we can fix it. > > > Even before thinking about direct-write-of-mmapped-data we must > > fix direct-write-of-cached-data (assuming it's not mmapped cache). That > > is a showstopper bug, > > I don't think we have a bug here. After the sync(), ext3_releasepage() > will be able to release the buffers. But we should check for failures. > > The problem we're seeing is with mmapped pages. In that case, clearing > PG_uptodate is insufficient and directly invalidating the buffer_heads will > probably kill ext3 and we're not allowed to assume that page->private > contains bh's anyway... > > Probably it is sufficient to run try_to_release_page() in the page_mapped > leg of invalidate_inode_pages2(), and to check the return value. > > > the direct-write-of-mmapped-data is "undefined" by > > the linux API. Garbage can end up in the pagecache as far as the kernel > > is concerned (all we guarantee is that no random page contents are > > delivered to userspace). > > > > > But it's not 100% reliable because ->invalidatepage isn't 100% reliable and > > > it really sucks that we're offering behaviour which only works most of the > > > time :( > > > > That don't need to ever work, I don't think we even need to add a pte > > shootdown, the behviour has never been provided by design. All we need > > to do is to fix the bh layer, and fixup the spots in zap to stop bugging > > out on dirty not uptodate pages, plus the above patches in this email, > > then it'll work as well as 2.4 and it will avoid to break apps. Note > > that this is a very serious matter, what happens is that oracle writes > > with O_DIRECT and when you do a tar.gz of the database the last block > > may be corrupt if it's not 512byte aligned if it was a partial page and > > it had partial bh. mysql uses O_DIRECT too. > > The current buffered-vs-direct handling looks OK, as long as the pages > aren't mmapped, and as long as nobody dirties a page after > generic_file_direct_IO() has done filemap_write_and_wait(). > > For the former I'd like to unmap the pages. For the latter we should check > the invalidate_complete_page() return value. > > > I'm not particularly > > concerned about ext3, that can be fixed too, or we can add a new bitflag > > as you suggested to avoid fixing the filesystems to deal with that. But > > for the VM I'm confortable clearing the uptodate page is much simpler > > and it poses no risk to the VM. For the fs you may be very well right > > that the BH_reread_from_disk might be needed instead. > > > > So let's forget the mmapped case and let's fix the bh screwup in 2.6. > > Then we can think at the mmapped case. > > The "bh screwup" _is_ the mmapped case. Maybe I said something different > yesterday - it's been a while since I looked in there. > > > And to fix the mmapped case IMHO all we have to do is: > > > > clear_page_dirty > > ClearPageUptodate > > unmap_page_range > > > > Not the other way around. The uptodate bit must be clared before zapping > > the pte to force a reload. > > Something like this... > > > > > - When invalidating pages, take care to shoot down any ptes which map them > as well. > > This ensures that the next mmap access to the page will generate a major > fault, so NFS's server-side modifications are picked up. > > This also allows us to call invalidate_complete_page() on all pages, so > filesytems such as ext3 get a chance to invalidate the buffer_heads. > > - Don't mark in-pagetable pages as non-uptodate any more. That broke a > previous guarantee that mapped-into-user-process pages are always uptodate. > > - Check the return value of invalidate_complete_page(). It can fail if > someone redirties a page after generic_file_direct_IO() write it back. > > But we still have a problem. If invalidate_inode_pages2() calls > unmap_mapping_range(), that can cause zap_pte_range() to dirty the pagecache > pages. That will redirty the page's buffers and will cause > invalidate_complete_page() to fail. > > So, in generic_file_direct_IO() we do a complete pte shootdown on the file > up-front, prior to writing back dirty pagecache. This is only done for > O_DIRECT writes. It _could_ be done for O_DIRECT reads too, providing full > mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but > permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke > other people's mapped pagecache. > > NFS also uses invalidate_inode_pages2() for handling server-side modification > notifications. But in the NFS case the clear_page_dirty() in > invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry > about the "dirty buffers against a clean page" problem. (I think) > > Signed-off-by: Andrew Morton <akpm@osdl.org> > --- > > 25-akpm/include/linux/fs.h | 2 - > 25-akpm/mm/filemap.c | 19 ++++++++++--- > 25-akpm/mm/truncate.c | 63 +++++++++++++++++++++++++++------------------ > 3 files changed, 55 insertions(+), 29 deletions(-) > > diff -puN mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix mm/truncate.c > --- 25/mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:40:26 2004 > +++ 25-akpm/mm/truncate.c Fri Oct 22 16:05:26 2004 > @@ -65,6 +65,8 @@ truncate_complete_page(struct address_sp > * be marked dirty at any time too. So we re-check the dirtiness inside > * ->tree_lock. That provides exclusion against the __set_page_dirty > * functions. > + * > + * Returns non-zero if the page was successfully invalidated. > */ > static int > invalidate_complete_page(struct address_space *mapping, struct page *page) > @@ -281,50 +283,63 @@ unsigned long invalidate_inode_pages(str > EXPORT_SYMBOL(invalidate_inode_pages); > > /** > - * invalidate_inode_pages2 - remove all unmapped pages from an address_space > + * invalidate_inode_pages2 - remove all pages from an address_space > * @mapping - the address_space > * > - * invalidate_inode_pages2() is like truncate_inode_pages(), except for the case > - * where the page is seen to be mapped into process pagetables. In that case, > - * the page is marked clean but is left attached to its address_space. > - * > - * The page is also marked not uptodate so that a subsequent pagefault will > - * perform I/O to bringthe page's contents back into sync with its backing > - * store. > + * Any pages which are found to be mapped into pagetables are unmapped prior to > + * invalidation. > * > - * FIXME: invalidate_inode_pages2() is probably trivially livelockable. > + * Returns -EIO if any pages could not be invalidated. > */ > -void invalidate_inode_pages2(struct address_space *mapping) > +int invalidate_inode_pages2(struct address_space *mapping) > { > struct pagevec pvec; > pgoff_t next = 0; > int i; > + int ret = 0; > + int did_full_unmap = 0; > > pagevec_init(&pvec, 0); > - while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > - for (i = 0; i < pagevec_count(&pvec); i++) { > + while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > + for (i = 0; !ret && i < pagevec_count(&pvec); i++) { > struct page *page = pvec.pages[i]; > > lock_page(page); > - if (page->mapping == mapping) { /* truncate race? */ > - wait_on_page_writeback(page); > - next = page->index + 1; > - if (page_mapped(page)) { > - clear_page_dirty(page); > - ClearPageUptodate(page); > + if (page->mapping != mapping) { /* truncate race? */ > + unlock_page(page); > + continue; > + } > + wait_on_page_writeback(page); > + next = page->index + 1; > + while (page_mapped(page)) { > + if (!did_full_unmap) { > + /* > + * Zap the rest of the file in one hit. > + * FIXME: invalidate_inode_pages2() > + * should take start/end offsets. > + */ > + unmap_mapping_range(mapping, > + page->index << PAGE_CACHE_SHIFT, > + -1, 0); > + did_full_unmap = 1; > } else { > - if (!invalidate_complete_page(mapping, > - page)) { > - clear_page_dirty(page); > - ClearPageUptodate(page); > - } > + /* > + * Just zap this page > + */ > + unmap_mapping_range(mapping, > + page->index << PAGE_CACHE_SHIFT, > + (page->index << PAGE_CACHE_SHIFT)+1, > + 0); > } > } > + clear_page_dirty(page); > + if (!invalidate_complete_page(mapping, page)) > + ret = -EIO; > unlock_page(page); > } > pagevec_release(&pvec); > cond_resched(); > } > + return ret; > } > - > EXPORT_SYMBOL_GPL(invalidate_inode_pages2); > diff -puN include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix include/linux/fs.h > --- 25/include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:54:50 2004 > +++ 25-akpm/include/linux/fs.h Fri Oct 22 15:54:58 2004 > @@ -1404,7 +1404,7 @@ static inline void invalidate_remote_ino > S_ISLNK(inode->i_mode)) > invalidate_inode_pages(inode->i_mapping); > } > -extern void invalidate_inode_pages2(struct address_space *mapping); > +extern int invalidate_inode_pages2(struct address_space *mapping); > extern void write_inode_now(struct inode *, int); > extern int filemap_fdatawrite(struct address_space *); > extern int filemap_flush(struct address_space *); > diff -puN mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix mm/filemap.c > --- 25/mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:55:06 2004 > +++ 25-akpm/mm/filemap.c Fri Oct 22 16:17:19 2004 > @@ -2214,7 +2214,8 @@ ssize_t generic_file_writev(struct file > EXPORT_SYMBOL(generic_file_writev); > > /* > - * Called under i_sem for writes to S_ISREG files > + * Called under i_sem for writes to S_ISREG files. Returns -EIO if something > + * went wrong during pagecache shootdown. > */ > ssize_t > generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, > @@ -2224,14 +2225,24 @@ generic_file_direct_IO(int rw, struct ki > struct address_space *mapping = file->f_mapping; > ssize_t retval; > > + /* > + * If it's a write, unmap all mmappings of the file up-front. This > + * will cause any pte dirty bits to be propagated into the pageframes > + * for the subsequent filemap_write_and_wait(). > + */ > + if (rw == WRITE && mapping_mapped(mapping)) > + unmap_mapping_range(mapping, 0, -1, 0); > + > retval = filemap_write_and_wait(mapping); > if (retval == 0) { > retval = mapping->a_ops->direct_IO(rw, iocb, iov, > offset, nr_segs); > - if (rw == WRITE && mapping->nrpages) > - invalidate_inode_pages2(mapping); > + if (rw == WRITE && mapping->nrpages) { > + int err = invalidate_inode_pages2(mapping); > + if (err) > + retval = err; > + } > } > return retval; > } > - > EXPORT_SYMBOL_GPL(generic_file_direct_IO); > _ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-11-09 14:15 ` Dave Kleikamp 0 siblings, 0 replies; 44+ messages in thread From: Dave Kleikamp @ 2004-11-09 14:15 UTC (permalink / raw) To: Andrew Morton, Andrea Arcangeli; +Cc: linux-kernel, linux-mm Andrew & Andrea, What is the status of this patch? It would be nice to have it in the -mm4 kernel. Thanks, Shaggy On Fri, 2004-10-22 at 18:24, Andrew Morton wrote: > Andrea Arcangeli <andrea@novell.com> wrote: > > > > On Thu, Oct 21, 2004 at 07:03:20PM -0700, Andrew Morton wrote: > > > Yeah. I think the only 100% reliable way to do this is: > > > > > > lock_page() > > > unmap_mapping_range(page) > > > ClearPageUptodate(page); > > > invalidate(page); /* Try to free the thing */ > > > unlock_page(page); > > > > > > (Can do it for a whole range of pages if we always agree to lock pages in > > > file-index-ascending order). > > > > > > but hrm, we don't even have the locking there to prevent do_no_page() from > > > reinstantiating the pte immediately after the unmap_mapping_range(). > > > > > > So what to do? > > > > > > - The patch you originally sent has a race window which can be made > > > nonfatal by removing the BUGcheck in mpage_writepage(). > > > > > > - NFS should probably be taught to use unmap_mapping_range() regardless > > > of what we do, so the existing-mmaps-hold-stale-data problem gets fixed > > > up. > > > > > > - invalidate_inode_pages2() isn't successfully invalidating the page if > > > it has buffers. > > > > I did it right in 2.4, somebody obviously broken the thing in 2.6. Now > > w.r.t. to ext2/ext3 I never heard a problem, and we nuke buffers > > regardless if the page is mapped or not (so this isn't a corner case). > > > > define block_invalidate_page(page) discard_bh_page(page, 0, 0) > > > > } else { > > if (page->buffers) { > > /* Restart after this page */ > > list_del(head); > > list_add_tail(head, curr); > > > > page_cache_get(page); > > spin_unlock(&pagecache_lock); > > block_invalidate_page(page); > > } else > > unlocked = 0; > > > > ClearPageDirty(page); > > This is incorrect for ext3 - we have to go through the a_ops to invalidate > the buffer_heads. That's bad news for my (unmerged) 2.4 direct-io-for-ext3 > patch. > > > The guarantee is total in 2.4 w.r.t. buffers, as worse it can generate > > an hearth attack to ext3 but it won't risk to lose coherency. Only 2.6 > > broke the O_DIRECT vs buffered I/O coherency protocol. > > Only for mmapped pages. I believe that normal pagecache-vs-directIO is OK > in 2.6. > > > The semantics of not uptodate pages mapped in the address space are as > > well understood and working fine in 2.4. > > I'd prefer to continue to disallow the mapping of nonuptodate pages into > process address space. It's indistinguishable from a security hole and I > think we can indeed preserve this invariant while fixing the mmap coherency > problem. > > > > This is the biggest problem, because the pages can trivially have > > > buffers - just write()ing to them will attach buffers, if they're ext2- > > > or ext3-backed. > > > > > > It means that a direct-IO write to a section of a file which is mmapped > > > causes pagecache and disk to get out of sync. Question is: do we care, > > > or do we say "don't do that"? Nobody seems to have noticed thus far and > > > it's a pretty dopey thing to be doing. > > > > this is a supported feature. People is writing with O_DIRECT and reading > > with buffered read and we must support that (think a database writing > > and a tar.gz reading). The coherency is a must after the O_DIRECT has > > completed. If people runs the operations at the same time the result is > > undefined. But by serializing the I/O at the userspace level (like > > stop_db(); tar.gz) we must provide guarantees to them. > > If the page is not mapped into pagetables then we'll invalidate it OK even > if it's backed by ext3. The only hole I can see is if someone comes in and > dirties the page via write() _after_ generic_file_direct_IO() has done its > filemap_write_and_wait(). > > And I agree that in that case we should propagate the > invalidate_complete_page() failure back to the diret-io write() caller. I wonder > how - via a short write, or via -EFOO? > > Probably we should sync all the ptes prior to a direct-io read as well. > > > > If we _do_ want to fix it, it seems the simplest approach would be to > > > nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we > > > raced - try again" loop in there to handle the "do_no_page() > > > reinstantiated a pte" race. > > > > the do_no_page don't need to be coherent. The pte is the last thing we > > can care about. The API only guarantees read/write syscalls to be > > coherent, mmap not. > > I want to fix the mmap case too. We currently have a half-fix for NFS > which doesn't work at all for (say) ext3. > > > the only bug is the bh and it has to be fixed as easily as I did in 2.4. > > No, we need to call into the filesystem to kill the buffer_heads. > > > > Something like this? Slow as a dog, but possibly correct ;) > > > > > > > > > void invalidate_inode_pages2(struct address_space *mapping) > > > { > > > struct pagevec pvec; > > > pgoff_t next = 0; > > > int i; > > > > > > pagevec_init(&pvec, 0); > > > while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > > > for (i = 0; i < pagevec_count(&pvec); i++) { > > > struct page *page = pvec.pages[i]; > > > > > > lock_page(page); > > > if (page->mapping == mapping) { /* truncate race? */ > > > wait_on_page_writeback(page); > > > next = page->index + 1; > > > while (page_mapped(page)) { > > > unmap_mapping_range(mapping, > > > page->index << PAGE_CACHE_SHIFT, > > > page->index << PAGE_CACHE_SHIFT+1, > > > 0); > > > clear_page_dirty(page); > > > } > > > invalidate_complete_page(mapping, page); > > > } > > > unlock_page(page); > > > } > > > pagevec_release(&pvec); > > > cond_resched(); > > > } > > > } > > > > You know the page_mapped path isn't really very important as far as we > > execute both clear_page_dirty and ClearPageUptodate. > > It's important for NFS - it will allow mmaps to see server-side updates. > > > We must provide the > > guarantee to the syscall. Leaving not updoate pages in the ptes is > > fine, the VM can deal with that. the semantics of that condition is that > > the page changed on disk due O_DIRECT from under us, so the page isn't > > uptodate anymore. If something we can argue about ext3 not being capable > > of dealing with a PG_update going away from under it, but the VM is > > capable of that, there is only one spot that must be aware about it > > (zap fix we posted to start the thread). > > > > BTW, we have these fixes in our tree to make it work: > > > > --- sles/mm/truncate.c.~1~ 2004-05-18 19:24:40.000000000 +0200 > > +++ sles/mm/truncate.c 2004-05-19 02:09:28.311781864 +0200 > > @@ -260,9 +260,10 @@ void invalidate_inode_pages2(struct addr > > if (page->mapping == mapping) { /* truncate race? */ > > wait_on_page_writeback(page); > > next = page->index + 1; > > - if (page_mapped(page)) > > + if (page_mapped(page)) { > > + ClearPageUptodate(page); > > clear_page_dirty(page); > > - else > > + } else > > invalidate_complete_page(mapping, page); > > } > > unlock_page(page); > > That's in 2.6.9. > > > and then this fix for BLKFLSBUF: > > > > diff -urNp linux-2.6.8/mm/truncate.c linux-2.6.8.SUSE/mm/truncate.c > > --- linux-2.6.8/mm/truncate.c 2004-09-21 15:39:37.850379202 +0200 > > +++ linux-2.6.8.SUSE/mm/truncate.c 2004-09-21 15:39:54.733668309 +0200 > > @@ -86,6 +86,12 @@ invalidate_complete_page(struct address_ > > write_unlock_irq(&mapping->tree_lock); > > return 0; > > } > > + > > + BUG_ON(PagePrivate(page)); > > + if (page_count(page) != 2) { > > + write_unlock_irq(&mapping->tree_lock); > > + return 0; > > + } > > __remove_from_page_cache(page); > > write_unlock_irq(&mapping->tree_lock); > > ClearPageUptodate(page); > > @@ -306,7 +312,11 @@ void invalidate_inode_pages2(struct addr > > clear_page_dirty(page); > > ClearPageUptodate(page); > > } else { > > - invalidate_complete_page(mapping, page); > > + if (!invalidate_complete_page(mapping, > > + page)) { > > + clear_page_dirty(page); > > + ClearPageUptodate(page); > > + } > > } > > } > > unlock_page(page); > > > > That's in 2.6.9 as well. I need to think about this a bit more - the first > hunk is for read() versus BLKFLSBUF (if someone else is playing with the > page, leave it alone) whereas the second hunk is kinda buggy, because it > will reintroduce the EIO-in-read() error which the first hunk tried to fix. > > (You're using an rwlock for tree_lock. It's still unclear to me whether we > should go that way - it helps big SMP and hurts small SMP. Did you perform > an evaluation of this?) > > > > > apparently those weren't submitted to mainline yet, that could have > > created some confusion on why we submitted the third patch that avoids > > the BUG in pdflush and it avoids to try writing to disk a page that is > > not uptodate anymore. Do you see why it's wrong to pass down to pdflush > > a page that isn't utpodate? > > Sure. That's why there's a BUG() there ;) > > > That page must be re-read, it sure must not > > be written out. The semantics of mapped pages not uptodate are very well > > defined and they're pose no risk to the VM (modulo the bug in pdflush > > and this clear improvement in zap). > > > > > The unmapping from pagetables means that invalidate_complete_page() will > > > generally successfully remove the page from pagecache. That mostly fixes > > > the direct-write-of-mmapped-data coherency problem: the pages simply aren't > > > in pagecache any more so we'll surely redo physical I/O. > > > > the direct-write-of-mmapped-data coherency is the last thing we can care > > about. > > But we can fix it. > > > Even before thinking about direct-write-of-mmapped-data we must > > fix direct-write-of-cached-data (assuming it's not mmapped cache). That > > is a showstopper bug, > > I don't think we have a bug here. After the sync(), ext3_releasepage() > will be able to release the buffers. But we should check for failures. > > The problem we're seeing is with mmapped pages. In that case, clearing > PG_uptodate is insufficient and directly invalidating the buffer_heads will > probably kill ext3 and we're not allowed to assume that page->private > contains bh's anyway... > > Probably it is sufficient to run try_to_release_page() in the page_mapped > leg of invalidate_inode_pages2(), and to check the return value. > > > the direct-write-of-mmapped-data is "undefined" by > > the linux API. Garbage can end up in the pagecache as far as the kernel > > is concerned (all we guarantee is that no random page contents are > > delivered to userspace). > > > > > But it's not 100% reliable because ->invalidatepage isn't 100% reliable and > > > it really sucks that we're offering behaviour which only works most of the > > > time :( > > > > That don't need to ever work, I don't think we even need to add a pte > > shootdown, the behviour has never been provided by design. All we need > > to do is to fix the bh layer, and fixup the spots in zap to stop bugging > > out on dirty not uptodate pages, plus the above patches in this email, > > then it'll work as well as 2.4 and it will avoid to break apps. Note > > that this is a very serious matter, what happens is that oracle writes > > with O_DIRECT and when you do a tar.gz of the database the last block > > may be corrupt if it's not 512byte aligned if it was a partial page and > > it had partial bh. mysql uses O_DIRECT too. > > The current buffered-vs-direct handling looks OK, as long as the pages > aren't mmapped, and as long as nobody dirties a page after > generic_file_direct_IO() has done filemap_write_and_wait(). > > For the former I'd like to unmap the pages. For the latter we should check > the invalidate_complete_page() return value. > > > I'm not particularly > > concerned about ext3, that can be fixed too, or we can add a new bitflag > > as you suggested to avoid fixing the filesystems to deal with that. But > > for the VM I'm confortable clearing the uptodate page is much simpler > > and it poses no risk to the VM. For the fs you may be very well right > > that the BH_reread_from_disk might be needed instead. > > > > So let's forget the mmapped case and let's fix the bh screwup in 2.6. > > Then we can think at the mmapped case. > > The "bh screwup" _is_ the mmapped case. Maybe I said something different > yesterday - it's been a while since I looked in there. > > > And to fix the mmapped case IMHO all we have to do is: > > > > clear_page_dirty > > ClearPageUptodate > > unmap_page_range > > > > Not the other way around. The uptodate bit must be clared before zapping > > the pte to force a reload. > > Something like this... > > > > > - When invalidating pages, take care to shoot down any ptes which map them > as well. > > This ensures that the next mmap access to the page will generate a major > fault, so NFS's server-side modifications are picked up. > > This also allows us to call invalidate_complete_page() on all pages, so > filesytems such as ext3 get a chance to invalidate the buffer_heads. > > - Don't mark in-pagetable pages as non-uptodate any more. That broke a > previous guarantee that mapped-into-user-process pages are always uptodate. > > - Check the return value of invalidate_complete_page(). It can fail if > someone redirties a page after generic_file_direct_IO() write it back. > > But we still have a problem. If invalidate_inode_pages2() calls > unmap_mapping_range(), that can cause zap_pte_range() to dirty the pagecache > pages. That will redirty the page's buffers and will cause > invalidate_complete_page() to fail. > > So, in generic_file_direct_IO() we do a complete pte shootdown on the file > up-front, prior to writing back dirty pagecache. This is only done for > O_DIRECT writes. It _could_ be done for O_DIRECT reads too, providing full > mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but > permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke > other people's mapped pagecache. > > NFS also uses invalidate_inode_pages2() for handling server-side modification > notifications. But in the NFS case the clear_page_dirty() in > invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry > about the "dirty buffers against a clean page" problem. (I think) > > Signed-off-by: Andrew Morton <akpm@osdl.org> > --- > > 25-akpm/include/linux/fs.h | 2 - > 25-akpm/mm/filemap.c | 19 ++++++++++--- > 25-akpm/mm/truncate.c | 63 +++++++++++++++++++++++++++------------------ > 3 files changed, 55 insertions(+), 29 deletions(-) > > diff -puN mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix mm/truncate.c > --- 25/mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:40:26 2004 > +++ 25-akpm/mm/truncate.c Fri Oct 22 16:05:26 2004 > @@ -65,6 +65,8 @@ truncate_complete_page(struct address_sp > * be marked dirty at any time too. So we re-check the dirtiness inside > * ->tree_lock. That provides exclusion against the __set_page_dirty > * functions. > + * > + * Returns non-zero if the page was successfully invalidated. > */ > static int > invalidate_complete_page(struct address_space *mapping, struct page *page) > @@ -281,50 +283,63 @@ unsigned long invalidate_inode_pages(str > EXPORT_SYMBOL(invalidate_inode_pages); > > /** > - * invalidate_inode_pages2 - remove all unmapped pages from an address_space > + * invalidate_inode_pages2 - remove all pages from an address_space > * @mapping - the address_space > * > - * invalidate_inode_pages2() is like truncate_inode_pages(), except for the case > - * where the page is seen to be mapped into process pagetables. In that case, > - * the page is marked clean but is left attached to its address_space. > - * > - * The page is also marked not uptodate so that a subsequent pagefault will > - * perform I/O to bringthe page's contents back into sync with its backing > - * store. > + * Any pages which are found to be mapped into pagetables are unmapped prior to > + * invalidation. > * > - * FIXME: invalidate_inode_pages2() is probably trivially livelockable. > + * Returns -EIO if any pages could not be invalidated. > */ > -void invalidate_inode_pages2(struct address_space *mapping) > +int invalidate_inode_pages2(struct address_space *mapping) > { > struct pagevec pvec; > pgoff_t next = 0; > int i; > + int ret = 0; > + int did_full_unmap = 0; > > pagevec_init(&pvec, 0); > - while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > - for (i = 0; i < pagevec_count(&pvec); i++) { > + while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > + for (i = 0; !ret && i < pagevec_count(&pvec); i++) { > struct page *page = pvec.pages[i]; > > lock_page(page); > - if (page->mapping == mapping) { /* truncate race? */ > - wait_on_page_writeback(page); > - next = page->index + 1; > - if (page_mapped(page)) { > - clear_page_dirty(page); > - ClearPageUptodate(page); > + if (page->mapping != mapping) { /* truncate race? */ > + unlock_page(page); > + continue; > + } > + wait_on_page_writeback(page); > + next = page->index + 1; > + while (page_mapped(page)) { > + if (!did_full_unmap) { > + /* > + * Zap the rest of the file in one hit. > + * FIXME: invalidate_inode_pages2() > + * should take start/end offsets. > + */ > + unmap_mapping_range(mapping, > + page->index << PAGE_CACHE_SHIFT, > + -1, 0); > + did_full_unmap = 1; > } else { > - if (!invalidate_complete_page(mapping, > - page)) { > - clear_page_dirty(page); > - ClearPageUptodate(page); > - } > + /* > + * Just zap this page > + */ > + unmap_mapping_range(mapping, > + page->index << PAGE_CACHE_SHIFT, > + (page->index << PAGE_CACHE_SHIFT)+1, > + 0); > } > } > + clear_page_dirty(page); > + if (!invalidate_complete_page(mapping, page)) > + ret = -EIO; > unlock_page(page); > } > pagevec_release(&pvec); > cond_resched(); > } > + return ret; > } > - > EXPORT_SYMBOL_GPL(invalidate_inode_pages2); > diff -puN include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix include/linux/fs.h > --- 25/include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:54:50 2004 > +++ 25-akpm/include/linux/fs.h Fri Oct 22 15:54:58 2004 > @@ -1404,7 +1404,7 @@ static inline void invalidate_remote_ino > S_ISLNK(inode->i_mode)) > invalidate_inode_pages(inode->i_mapping); > } > -extern void invalidate_inode_pages2(struct address_space *mapping); > +extern int invalidate_inode_pages2(struct address_space *mapping); > extern void write_inode_now(struct inode *, int); > extern int filemap_fdatawrite(struct address_space *); > extern int filemap_flush(struct address_space *); > diff -puN mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix mm/filemap.c > --- 25/mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:55:06 2004 > +++ 25-akpm/mm/filemap.c Fri Oct 22 16:17:19 2004 > @@ -2214,7 +2214,8 @@ ssize_t generic_file_writev(struct file > EXPORT_SYMBOL(generic_file_writev); > > /* > - * Called under i_sem for writes to S_ISREG files > + * Called under i_sem for writes to S_ISREG files. Returns -EIO if something > + * went wrong during pagecache shootdown. > */ > ssize_t > generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, > @@ -2224,14 +2225,24 @@ generic_file_direct_IO(int rw, struct ki > struct address_space *mapping = file->f_mapping; > ssize_t retval; > > + /* > + * If it's a write, unmap all mmappings of the file up-front. This > + * will cause any pte dirty bits to be propagated into the pageframes > + * for the subsequent filemap_write_and_wait(). > + */ > + if (rw == WRITE && mapping_mapped(mapping)) > + unmap_mapping_range(mapping, 0, -1, 0); > + > retval = filemap_write_and_wait(mapping); > if (retval == 0) { > retval = mapping->a_ops->direct_IO(rw, iocb, iov, > offset, nr_segs); > - if (rw == WRITE && mapping->nrpages) > - invalidate_inode_pages2(mapping); > + if (rw == WRITE && mapping->nrpages) { > + int err = invalidate_inode_pages2(mapping); > + if (err) > + retval = err; > + } > } > return retval; > } > - > EXPORT_SYMBOL_GPL(generic_file_direct_IO); > _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-11-09 14:15 ` Dave Kleikamp @ 2004-11-09 14:46 ` Andrea Arcangeli -1 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-11-09 14:46 UTC (permalink / raw) To: Dave Kleikamp; +Cc: Andrew Morton, linux-kernel, linux-mm On Tue, Nov 09, 2004 at 08:15:30AM -0600, Dave Kleikamp wrote: > Andrew & Andrea, > What is the status of this patch? It would be nice to have it in the > -mm4 kernel. I think we should add an msync in front of O_DIRECT reads too (msync won't hurt other users, and it'll provide full coherency), everything else is ok (the msync can be added as an incremental patch). We applied it to SUSE (without the msync) to fix your crash in pdflush. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-11-09 14:46 ` Andrea Arcangeli 0 siblings, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-11-09 14:46 UTC (permalink / raw) To: Dave Kleikamp; +Cc: Andrew Morton, linux-kernel, linux-mm On Tue, Nov 09, 2004 at 08:15:30AM -0600, Dave Kleikamp wrote: > Andrew & Andrea, > What is the status of this patch? It would be nice to have it in the > -mm4 kernel. I think we should add an msync in front of O_DIRECT reads too (msync won't hurt other users, and it'll provide full coherency), everything else is ok (the msync can be added as an incremental patch). We applied it to SUSE (without the msync) to fix your crash in pdflush. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-11-09 14:46 ` Andrea Arcangeli @ 2004-11-09 19:51 ` Andrew Morton -1 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-11-09 19:51 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: shaggy, linux-kernel, linux-mm Andrea Arcangeli <andrea@novell.com> wrote: > > On Tue, Nov 09, 2004 at 08:15:30AM -0600, Dave Kleikamp wrote: > > Andrew & Andrea, > > What is the status of this patch? It would be nice to have it in the > > -mm4 kernel. > > I think we should add an msync in front of O_DIRECT reads too (msync > won't hurt other users, and it'll provide full coherency), everything > else is ok (the msync can be added as an incremental patch). I don't think we have a simple way of syncing all ptes which map the pages without actually shooting those pte's down, via zap_page_range(). A filemap_sync() will only sync the caller's mm's ptes. I guess it would be pretty simple to add a sync_but_dont_unmap field to struct zap_details, and propagate that down. So we can reuse all the unmap_vmas() code for an all-mms pte sync. It could all get very expensive if someone has a bit of the file mapped though. Testing is needed there. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-11-09 19:51 ` Andrew Morton 0 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-11-09 19:51 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: shaggy, linux-kernel, linux-mm Andrea Arcangeli <andrea@novell.com> wrote: > > On Tue, Nov 09, 2004 at 08:15:30AM -0600, Dave Kleikamp wrote: > > Andrew & Andrea, > > What is the status of this patch? It would be nice to have it in the > > -mm4 kernel. > > I think we should add an msync in front of O_DIRECT reads too (msync > won't hurt other users, and it'll provide full coherency), everything > else is ok (the msync can be added as an incremental patch). I don't think we have a simple way of syncing all ptes which map the pages without actually shooting those pte's down, via zap_page_range(). A filemap_sync() will only sync the caller's mm's ptes. I guess it would be pretty simple to add a sync_but_dont_unmap field to struct zap_details, and propagate that down. So we can reuse all the unmap_vmas() code for an all-mms pte sync. It could all get very expensive if someone has a bit of the file mapped though. Testing is needed there. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty 2004-11-09 14:15 ` Dave Kleikamp @ 2004-11-09 19:46 ` Andrew Morton -1 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-11-09 19:46 UTC (permalink / raw) To: Dave Kleikamp; +Cc: andrea, linux-kernel, linux-mm Dave Kleikamp <shaggy@austin.ibm.com> wrote: > > Andrew & Andrea, > What is the status of this patch? It would be nice to have it in the > -mm4 kernel. > I found that the patch was causing I/O errors to be reported with one of the LTP tests: testcases/bin/growfiles -W gf15 -b -e 1 -u -r 1-49600 -I r -u -i 0 -L 120 Lgfile1 So I dropped it out pending a bit more work. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty @ 2004-11-09 19:46 ` Andrew Morton 0 siblings, 0 replies; 44+ messages in thread From: Andrew Morton @ 2004-11-09 19:46 UTC (permalink / raw) To: Dave Kleikamp; +Cc: andrea, linux-kernel, linux-mm Dave Kleikamp <shaggy@austin.ibm.com> wrote: > > Andrew & Andrea, > What is the status of this patch? It would be nice to have it in the > -mm4 kernel. > I found that the patch was causing I/O errors to be reported with one of the LTP tests: testcases/bin/growfiles -W gf15 -b -e 1 -u -r 1-49600 -I r -u -i 0 -L 120 Lgfile1 So I dropped it out pending a bit more work. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2004-11-09 19:51 UTC | newest] Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2004-10-21 21:15 [PATCH] zap_pte_range should not mark non-uptodate pages dirty Dave Kleikamp 2004-10-21 21:15 ` Dave Kleikamp 2004-10-21 21:45 ` Andrew Morton 2004-10-21 21:45 ` Andrew Morton 2004-10-21 22:36 ` Andrea Arcangeli 2004-10-21 22:36 ` Andrea Arcangeli 2004-10-21 23:02 ` Andrew Morton 2004-10-21 23:02 ` Andrew Morton 2004-10-21 23:20 ` Andrea Arcangeli 2004-10-21 23:20 ` Andrea Arcangeli 2004-10-21 23:42 ` Andrew Morton 2004-10-21 23:42 ` Andrew Morton 2004-10-22 0:15 ` Andrew Morton 2004-10-22 0:15 ` Andrew Morton 2004-10-22 0:41 ` Andrea Arcangeli 2004-10-22 0:41 ` Andrea Arcangeli 2004-10-22 2:51 ` Rik van Riel 2004-10-22 2:51 ` Rik van Riel 2004-10-22 16:19 ` Andrea Arcangeli 2004-10-22 16:19 ` Andrea Arcangeli 2004-10-22 0:30 ` Andrea Arcangeli 2004-10-22 0:30 ` Andrea Arcangeli 2004-10-22 1:22 ` Andrea Arcangeli 2004-10-22 1:22 ` Andrea Arcangeli 2004-10-22 2:03 ` Andrew Morton 2004-10-22 2:03 ` Andrew Morton 2004-10-22 16:17 ` Andrea Arcangeli 2004-10-22 16:17 ` Andrea Arcangeli 2004-10-22 17:04 ` Andrea Arcangeli 2004-10-22 17:04 ` Andrea Arcangeli 2004-10-22 23:24 ` Andrew Morton 2004-10-22 23:24 ` Andrew Morton 2004-10-25 13:58 ` Dave Kleikamp 2004-10-25 13:58 ` Dave Kleikamp 2004-10-26 0:35 ` Andrea Arcangeli 2004-10-26 0:35 ` Andrea Arcangeli 2004-11-09 14:15 ` Dave Kleikamp 2004-11-09 14:15 ` Dave Kleikamp 2004-11-09 14:46 ` Andrea Arcangeli 2004-11-09 14:46 ` Andrea Arcangeli 2004-11-09 19:51 ` Andrew Morton 2004-11-09 19:51 ` Andrew Morton 2004-11-09 19:46 ` Andrew Morton 2004-11-09 19:46 ` Andrew Morton
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.