All of lore.kernel.org
 help / color / mirror / Atom feed
* Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-23 18:17 ` Trond Myklebust
  0 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2009-04-23 18:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA, Linux Filesystem Development

Hi Nick,

I'm still working on the bug in
http://bugzilla.kernel.org/show_bug.cgi?id=12913 . One other source of
grief appears to be munmap(), which is calling set_page_dirty() on a
number of pages without locking them or first calling page_mkwrite(). 

Currently, this means that we either ignore that dirty bit (since
nfs_page_async_flush() won't find a corresponding write request) or it
too can end up triggering the PG_CLEAN BUG() in fs/nfs/write.c:252 if
the timing is right.

So what is the reason why zap_pte_range() calls set_page_dirty()
directly?

Cheers
  Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-23 18:17 ` Trond Myklebust
  0 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2009-04-23 18:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-nfs, Linux Filesystem Development

Hi Nick,

I'm still working on the bug in
http://bugzilla.kernel.org/show_bug.cgi?id=12913 . One other source of
grief appears to be munmap(), which is calling set_page_dirty() on a
number of pages without locking them or first calling page_mkwrite(). 

Currently, this means that we either ignore that dirty bit (since
nfs_page_async_flush() won't find a corresponding write request) or it
too can end up triggering the PG_CLEAN BUG() in fs/nfs/write.c:252 if
the timing is right.

So what is the reason why zap_pte_range() calls set_page_dirty()
directly?

Cheers
  Trond


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
       [not found] ` <1240510668.11148.40.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-04-23 19:52     ` Miklos Szeredi
  0 siblings, 0 replies; 45+ messages in thread
From: Miklos Szeredi @ 2009-04-23 19:52 UTC (permalink / raw)
  To: trond.myklebust-41N18TsMXrtuMpJDpNschA
  Cc: npiggin-l3A5Bk7waGM, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Thu, 23 Apr 2009, Trond Myklebust wrote:
> I'm still working on the bug in
> http://bugzilla.kernel.org/show_bug.cgi?id=12913 . One other source of
> grief appears to be munmap(), which is calling set_page_dirty() on a
> number of pages without locking them or first calling page_mkwrite(). 
> 
> Currently, this means that we either ignore that dirty bit (since
> nfs_page_async_flush() won't find a corresponding write request) or it
> too can end up triggering the PG_CLEAN BUG() in fs/nfs/write.c:252 if
> the timing is right.
> 
> So what is the reason why zap_pte_range() calls set_page_dirty()
> directly?

In the old times this was one of the main ways of transferring the pte
dirtyness to the PG_dirty page flag.

Now this is mostly done at page fault time, and the pte's are always
being re-protected whenever the PG_dirty flag is cleared (see
page_mkclean()).

But in some cases (shmfs being the example I know) pages are not write
protected and so zap_pte_range(), and other functions, still need to
transfer the pte dirtyness to the page flag.

Not sure how this matters to NFS though.  If the above is correct,
then the set_page_dirty() call in zap_pte_range() should always result
in a no-op, since the PG_dirty flag would already have been set by the
page fault...

Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-23 19:52     ` Miklos Szeredi
  0 siblings, 0 replies; 45+ messages in thread
From: Miklos Szeredi @ 2009-04-23 19:52 UTC (permalink / raw)
  To: trond.myklebust; +Cc: npiggin, linux-nfs, linux-fsdevel

On Thu, 23 Apr 2009, Trond Myklebust wrote:
> I'm still working on the bug in
> http://bugzilla.kernel.org/show_bug.cgi?id=12913 . One other source of
> grief appears to be munmap(), which is calling set_page_dirty() on a
> number of pages without locking them or first calling page_mkwrite(). 
> 
> Currently, this means that we either ignore that dirty bit (since
> nfs_page_async_flush() won't find a corresponding write request) or it
> too can end up triggering the PG_CLEAN BUG() in fs/nfs/write.c:252 if
> the timing is right.
> 
> So what is the reason why zap_pte_range() calls set_page_dirty()
> directly?

In the old times this was one of the main ways of transferring the pte
dirtyness to the PG_dirty page flag.

Now this is mostly done at page fault time, and the pte's are always
being re-protected whenever the PG_dirty flag is cleared (see
page_mkclean()).

But in some cases (shmfs being the example I know) pages are not write
protected and so zap_pte_range(), and other functions, still need to
transfer the pte dirtyness to the page flag.

Not sure how this matters to NFS though.  If the above is correct,
then the set_page_dirty() call in zap_pte_range() should always result
in a no-op, since the PG_dirty flag would already have been set by the
page fault...

Miklos

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
       [not found]     ` <E1Lx4yU-0007A8-Gl-8f8m9JG5TPIdUIPVzhDTVZP2KDSNp7ea@public.gmane.org>
@ 2009-04-23 20:42         ` Trond Myklebust
  0 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2009-04-23 20:42 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: npiggin-l3A5Bk7waGM, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Thu, 2009-04-23 at 21:52 +0200, Miklos Szeredi wrote:
> On Thu, 23 Apr 2009, Trond Myklebust wrote:
> > I'm still working on the bug in
> > http://bugzilla.kernel.org/show_bug.cgi?id=12913 . One other source of
> > grief appears to be munmap(), which is calling set_page_dirty() on a
> > number of pages without locking them or first calling page_mkwrite(). 
> > 
> > Currently, this means that we either ignore that dirty bit (since
> > nfs_page_async_flush() won't find a corresponding write request) or it
> > too can end up triggering the PG_CLEAN BUG() in fs/nfs/write.c:252 if
> > the timing is right.
> > 
> > So what is the reason why zap_pte_range() calls set_page_dirty()
> > directly?
> 
> In the old times this was one of the main ways of transferring the pte
> dirtyness to the PG_dirty page flag.
> 
> Now this is mostly done at page fault time, and the pte's are always
> being re-protected whenever the PG_dirty flag is cleared (see
> page_mkclean()).
> 
> But in some cases (shmfs being the example I know) pages are not write
> protected and so zap_pte_range(), and other functions, still need to
> transfer the pte dirtyness to the page flag.

My main worry is that this is all happening at munmap() time. There
shouldn't be any more page faults after that completes (am I right?), so
what other mechanism would transfer the pte dirtyness?

> Not sure how this matters to NFS though.  If the above is correct,
> then the set_page_dirty() call in zap_pte_range() should always result
> in a no-op, since the PG_dirty flag would already have been set by the
> page fault...

If I can ignore the dirty flag on these occasions, then that would be
great. That would enable me to get rid of that BUG_ON(PG_CLEAN) in
write.c, and close the bug...

Cheers
  Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-23 20:42         ` Trond Myklebust
  0 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2009-04-23 20:42 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: npiggin, linux-nfs, linux-fsdevel

On Thu, 2009-04-23 at 21:52 +0200, Miklos Szeredi wrote:
> On Thu, 23 Apr 2009, Trond Myklebust wrote:
> > I'm still working on the bug in
> > http://bugzilla.kernel.org/show_bug.cgi?id=12913 . One other source of
> > grief appears to be munmap(), which is calling set_page_dirty() on a
> > number of pages without locking them or first calling page_mkwrite(). 
> > 
> > Currently, this means that we either ignore that dirty bit (since
> > nfs_page_async_flush() won't find a corresponding write request) or it
> > too can end up triggering the PG_CLEAN BUG() in fs/nfs/write.c:252 if
> > the timing is right.
> > 
> > So what is the reason why zap_pte_range() calls set_page_dirty()
> > directly?
> 
> In the old times this was one of the main ways of transferring the pte
> dirtyness to the PG_dirty page flag.
> 
> Now this is mostly done at page fault time, and the pte's are always
> being re-protected whenever the PG_dirty flag is cleared (see
> page_mkclean()).
> 
> But in some cases (shmfs being the example I know) pages are not write
> protected and so zap_pte_range(), and other functions, still need to
> transfer the pte dirtyness to the page flag.

My main worry is that this is all happening at munmap() time. There
shouldn't be any more page faults after that completes (am I right?), so
what other mechanism would transfer the pte dirtyness?

> Not sure how this matters to NFS though.  If the above is correct,
> then the set_page_dirty() call in zap_pte_range() should always result
> in a no-op, since the PG_dirty flag would already have been set by the
> page fault...

If I can ignore the dirty flag on these occasions, then that would be
great. That would enable me to get rid of that BUG_ON(PG_CLEAN) in
write.c, and close the bug...

Cheers
  Trond


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-04-23 20:42         ` Trond Myklebust
@ 2009-04-24  7:15           ` Miklos Szeredi
  -1 siblings, 0 replies; 45+ messages in thread
From: Miklos Szeredi @ 2009-04-24  7:15 UTC (permalink / raw)
  To: trond.myklebust; +Cc: miklos, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Thu, 23 Apr 2009, Trond Myklebust wrote:
> On Thu, 2009-04-23 at 21:52 +0200, Miklos Szeredi wrote:
> > Now this is mostly done at page fault time, and the pte's are always
> > being re-protected whenever the PG_dirty flag is cleared (see
> > page_mkclean()).
> > 
> > But in some cases (shmfs being the example I know) pages are not write
> > protected and so zap_pte_range(), and other functions, still need to
> > transfer the pte dirtyness to the page flag.
> 
> My main worry is that this is all happening at munmap() time. There
> shouldn't be any more page faults after that completes (am I right?), so
> what other mechanism would transfer the pte dirtyness?

After munmap() a page fault will result in SIGSEGV.  A write access
during munmap(), when the vma has been removed but the page table is
still intact is more interesting.  But in that case the write fault
should also result in a SEGV, because it won't be able to find the
matching VMA.

Now lets see what happens if writeback is started against the page
during this limbo period.  page_mkclean() is called, which doesn't
find the vma, so it doesn't re-protect the pte.  But the PG_dirty will
be cleared regadless.  So AFAICS it can happen that the pte remains
dirty but the page is clean.

And in that case that set_page_dirty() in zap_pte_range() is
important, since the page could have been dirtied through the mapping
after the writeback finished.

> > Not sure how this matters to NFS though.  If the above is correct,
> > then the set_page_dirty() call in zap_pte_range() should always result
> > in a no-op, since the PG_dirty flag would already have been set by the
> > page fault...
> 
> If I can ignore the dirty flag on these occasions, then that would be
> great. That would enable me to get rid of that BUG_ON(PG_CLEAN) in
> write.c, and close the bug...

I don't think you can ignore the dirty flag...  

Hmm, I guess this is a bit nasty: the VM promises filesystems that
->page_mkwrite() will be called when the page is dirtied through a
mapping, _almost_ all of the time.  Except when munmap happens to race
with clear_page_dirty_for_io().

I don't have any ideas how this could be fixed, CC-ing linux-mm...

Miklos

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-24  7:15           ` Miklos Szeredi
  0 siblings, 0 replies; 45+ messages in thread
From: Miklos Szeredi @ 2009-04-24  7:15 UTC (permalink / raw)
  To: trond.myklebust; +Cc: miklos, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Thu, 23 Apr 2009, Trond Myklebust wrote:
> On Thu, 2009-04-23 at 21:52 +0200, Miklos Szeredi wrote:
> > Now this is mostly done at page fault time, and the pte's are always
> > being re-protected whenever the PG_dirty flag is cleared (see
> > page_mkclean()).
> > 
> > But in some cases (shmfs being the example I know) pages are not write
> > protected and so zap_pte_range(), and other functions, still need to
> > transfer the pte dirtyness to the page flag.
> 
> My main worry is that this is all happening at munmap() time. There
> shouldn't be any more page faults after that completes (am I right?), so
> what other mechanism would transfer the pte dirtyness?

After munmap() a page fault will result in SIGSEGV.  A write access
during munmap(), when the vma has been removed but the page table is
still intact is more interesting.  But in that case the write fault
should also result in a SEGV, because it won't be able to find the
matching VMA.

Now lets see what happens if writeback is started against the page
during this limbo period.  page_mkclean() is called, which doesn't
find the vma, so it doesn't re-protect the pte.  But the PG_dirty will
be cleared regadless.  So AFAICS it can happen that the pte remains
dirty but the page is clean.

And in that case that set_page_dirty() in zap_pte_range() is
important, since the page could have been dirtied through the mapping
after the writeback finished.

> > Not sure how this matters to NFS though.  If the above is correct,
> > then the set_page_dirty() call in zap_pte_range() should always result
> > in a no-op, since the PG_dirty flag would already have been set by the
> > page fault...
> 
> If I can ignore the dirty flag on these occasions, then that would be
> great. That would enable me to get rid of that BUG_ON(PG_CLEAN) in
> write.c, and close the bug...

I don't think you can ignore the dirty flag...  

Hmm, I guess this is a bit nasty: the VM promises filesystems that
->page_mkwrite() will be called when the page is dirtied through a
mapping, _almost_ all of the time.  Except when munmap happens to race
with clear_page_dirty_for_io().

I don't have any ideas how this could be fixed, CC-ing linux-mm...

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
       [not found]           ` <E1LxFd4-0008Ih-Rd-8f8m9JG5TPIdUIPVzhDTVZP2KDSNp7ea@public.gmane.org>
  2009-04-24  7:33               ` Miklos Szeredi
@ 2009-04-24  7:33               ` Miklos Szeredi
  1 sibling, 0 replies; 45+ messages in thread
From: Miklos Szeredi @ 2009-04-24  7:33 UTC (permalink / raw)
  To: trond.myklebust-41N18TsMXrtuMpJDpNschA
  Cc: miklos-sUDqSbJrdHQHWmgEVkV9KA, npiggin-l3A5Bk7waGM,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri, 24 Apr 2009, Miklos Szeredi wrote:
> Hmm, I guess this is a bit nasty: the VM promises filesystems that
> ->page_mkwrite() will be called when the page is dirtied through a
> mapping, _almost_ all of the time.  Except when munmap happens to race
> with clear_page_dirty_for_io().
> 
> I don't have any ideas how this could be fixed, CC-ing linux-mm...

On second thought, we could possibly just ignore the dirty bit in that
case.  Trying to write to a mapping _during_ munmap() will have pretty
undefined results, I don't think any sane application out there should
rely on the results of this.

But how knows, the world is a weird place...

Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-24  7:33               ` Miklos Szeredi
  0 siblings, 0 replies; 45+ messages in thread
From: Miklos Szeredi @ 2009-04-24  7:33 UTC (permalink / raw)
  To: trond.myklebust; +Cc: miklos, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Fri, 24 Apr 2009, Miklos Szeredi wrote:
> Hmm, I guess this is a bit nasty: the VM promises filesystems that
> ->page_mkwrite() will be called when the page is dirtied through a
> mapping, _almost_ all of the time.  Except when munmap happens to race
> with clear_page_dirty_for_io().
> 
> I don't have any ideas how this could be fixed, CC-ing linux-mm...

On second thought, we could possibly just ignore the dirty bit in that
case.  Trying to write to a mapping _during_ munmap() will have pretty
undefined results, I don't think any sane application out there should
rely on the results of this.

But how knows, the world is a weird place...

Miklos

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-24  7:33               ` Miklos Szeredi
  0 siblings, 0 replies; 45+ messages in thread
From: Miklos Szeredi @ 2009-04-24  7:33 UTC (permalink / raw)
  To: trond.myklebust; +Cc: miklos, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Fri, 24 Apr 2009, Miklos Szeredi wrote:
> Hmm, I guess this is a bit nasty: the VM promises filesystems that
> ->page_mkwrite() will be called when the page is dirtied through a
> mapping, _almost_ all of the time.  Except when munmap happens to race
> with clear_page_dirty_for_io().
> 
> I don't have any ideas how this could be fixed, CC-ing linux-mm...

On second thought, we could possibly just ignore the dirty bit in that
case.  Trying to write to a mapping _during_ munmap() will have pretty
undefined results, I don't think any sane application out there should
rely on the results of this.

But how knows, the world is a weird place...

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
       [not found]           ` <E1LxFd4-0008Ih-Rd-8f8m9JG5TPIdUIPVzhDTVZP2KDSNp7ea@public.gmane.org>
  2009-04-24  7:33               ` Miklos Szeredi
@ 2009-04-24 10:41               ` Robin Holt
  1 sibling, 0 replies; 45+ messages in thread
From: Robin Holt @ 2009-04-24 10:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: trond.myklebust-41N18TsMXrtuMpJDpNschA, npiggin-l3A5Bk7waGM,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri, Apr 24, 2009 at 09:15:22AM +0200, Miklos Szeredi wrote:
> On Thu, 23 Apr 2009, Trond Myklebust wrote:
> > On Thu, 2009-04-23 at 21:52 +0200, Miklos Szeredi wrote:
> > > Now this is mostly done at page fault time, and the pte's are always
> > > being re-protected whenever the PG_dirty flag is cleared (see
> > > page_mkclean()).
> > > 
> > > But in some cases (shmfs being the example I know) pages are not write
> > > protected and so zap_pte_range(), and other functions, still need to
> > > transfer the pte dirtyness to the page flag.
> > 
> > My main worry is that this is all happening at munmap() time. There
> > shouldn't be any more page faults after that completes (am I right?), so
> > what other mechanism would transfer the pte dirtyness?
> 
> After munmap() a page fault will result in SIGSEGV.  A write access
> during munmap(), when the vma has been removed but the page table is
> still intact is more interesting.  But in that case the write fault
> should also result in a SEGV, because it won't be able to find the
> matching VMA.
> 
> Now lets see what happens if writeback is started against the page
> during this limbo period.  page_mkclean() is called, which doesn't
> find the vma, so it doesn't re-protect the pte.  But the PG_dirty will

I am not sure how you came to this conclusion.  The address_space has
the vma's chained together and protected by the i_mmap_lock.  That is
acquired prior to the cleaning operation.  Additionally, the cleaning
operation walks the process's page tables and will remove/write-protect
the page before releasing the i_mmap_lock.

Maybe I misunderstand.  I hope I have not added confusion.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-24 10:41               ` Robin Holt
  0 siblings, 0 replies; 45+ messages in thread
From: Robin Holt @ 2009-04-24 10:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: trond.myklebust, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Fri, Apr 24, 2009 at 09:15:22AM +0200, Miklos Szeredi wrote:
> On Thu, 23 Apr 2009, Trond Myklebust wrote:
> > On Thu, 2009-04-23 at 21:52 +0200, Miklos Szeredi wrote:
> > > Now this is mostly done at page fault time, and the pte's are always
> > > being re-protected whenever the PG_dirty flag is cleared (see
> > > page_mkclean()).
> > > 
> > > But in some cases (shmfs being the example I know) pages are not write
> > > protected and so zap_pte_range(), and other functions, still need to
> > > transfer the pte dirtyness to the page flag.
> > 
> > My main worry is that this is all happening at munmap() time. There
> > shouldn't be any more page faults after that completes (am I right?), so
> > what other mechanism would transfer the pte dirtyness?
> 
> After munmap() a page fault will result in SIGSEGV.  A write access
> during munmap(), when the vma has been removed but the page table is
> still intact is more interesting.  But in that case the write fault
> should also result in a SEGV, because it won't be able to find the
> matching VMA.
> 
> Now lets see what happens if writeback is started against the page
> during this limbo period.  page_mkclean() is called, which doesn't
> find the vma, so it doesn't re-protect the pte.  But the PG_dirty will

I am not sure how you came to this conclusion.  The address_space has
the vma's chained together and protected by the i_mmap_lock.  That is
acquired prior to the cleaning operation.  Additionally, the cleaning
operation walks the process's page tables and will remove/write-protect
the page before releasing the i_mmap_lock.

Maybe I misunderstand.  I hope I have not added confusion.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-24 10:41               ` Robin Holt
  0 siblings, 0 replies; 45+ messages in thread
From: Robin Holt @ 2009-04-24 10:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: trond.myklebust, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Fri, Apr 24, 2009 at 09:15:22AM +0200, Miklos Szeredi wrote:
> On Thu, 23 Apr 2009, Trond Myklebust wrote:
> > On Thu, 2009-04-23 at 21:52 +0200, Miklos Szeredi wrote:
> > > Now this is mostly done at page fault time, and the pte's are always
> > > being re-protected whenever the PG_dirty flag is cleared (see
> > > page_mkclean()).
> > > 
> > > But in some cases (shmfs being the example I know) pages are not write
> > > protected and so zap_pte_range(), and other functions, still need to
> > > transfer the pte dirtyness to the page flag.
> > 
> > My main worry is that this is all happening at munmap() time. There
> > shouldn't be any more page faults after that completes (am I right?), so
> > what other mechanism would transfer the pte dirtyness?
> 
> After munmap() a page fault will result in SIGSEGV.  A write access
> during munmap(), when the vma has been removed but the page table is
> still intact is more interesting.  But in that case the write fault
> should also result in a SEGV, because it won't be able to find the
> matching VMA.
> 
> Now lets see what happens if writeback is started against the page
> during this limbo period.  page_mkclean() is called, which doesn't
> find the vma, so it doesn't re-protect the pte.  But the PG_dirty will

I am not sure how you came to this conclusion.  The address_space has
the vma's chained together and protected by the i_mmap_lock.  That is
acquired prior to the cleaning operation.  Additionally, the cleaning
operation walks the process's page tables and will remove/write-protect
the page before releasing the i_mmap_lock.

Maybe I misunderstand.  I hope I have not added confusion.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-04-24  7:33               ` Miklos Szeredi
@ 2009-04-24 12:59                 ` Chris Mason
  -1 siblings, 0 replies; 45+ messages in thread
From: Chris Mason @ 2009-04-24 12:59 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: trond.myklebust, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Fri, 2009-04-24 at 09:33 +0200, Miklos Szeredi wrote:
> On Fri, 24 Apr 2009, Miklos Szeredi wrote:
> > Hmm, I guess this is a bit nasty: the VM promises filesystems that
> > ->page_mkwrite() will be called when the page is dirtied through a
> > mapping, _almost_ all of the time.  Except when munmap happens to race
> > with clear_page_dirty_for_io().
> > 
> > I don't have any ideas how this could be fixed, CC-ing linux-mm...
> 
> On second thought, we could possibly just ignore the dirty bit in that
> case.  Trying to write to a mapping _during_ munmap() will have pretty
> undefined results, I don't think any sane application out there should
> rely on the results of this.
> 
> But how knows, the world is a weird place...

It does happen in practice, btrfs has fallback code that triggers the
page_mkwrite when it finds a dirty page that wasn't dirtied with help
from the FS.

I'd love to get rid of the fallback ;)

-chris



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-24 12:59                 ` Chris Mason
  0 siblings, 0 replies; 45+ messages in thread
From: Chris Mason @ 2009-04-24 12:59 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: trond.myklebust, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Fri, 2009-04-24 at 09:33 +0200, Miklos Szeredi wrote:
> On Fri, 24 Apr 2009, Miklos Szeredi wrote:
> > Hmm, I guess this is a bit nasty: the VM promises filesystems that
> > ->page_mkwrite() will be called when the page is dirtied through a
> > mapping, _almost_ all of the time.  Except when munmap happens to race
> > with clear_page_dirty_for_io().
> > 
> > I don't have any ideas how this could be fixed, CC-ing linux-mm...
> 
> On second thought, we could possibly just ignore the dirty bit in that
> case.  Trying to write to a mapping _during_ munmap() will have pretty
> undefined results, I don't think any sane application out there should
> rely on the results of this.
> 
> But how knows, the world is a weird place...

It does happen in practice, btrfs has fallback code that triggers the
page_mkwrite when it finds a dirty page that wasn't dirtied with help
from the FS.

I'd love to get rid of the fallback ;)

-chris


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-04-24 12:59                 ` Chris Mason
@ 2009-04-24 13:31                   ` Trond Myklebust
  -1 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2009-04-24 13:31 UTC (permalink / raw)
  To: Chris Mason; +Cc: Miklos Szeredi, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Fri, 2009-04-24 at 08:59 -0400, Chris Mason wrote:
> On Fri, 2009-04-24 at 09:33 +0200, Miklos Szeredi wrote:
> > On Fri, 24 Apr 2009, Miklos Szeredi wrote:
> > > Hmm, I guess this is a bit nasty: the VM promises filesystems that
> > > ->page_mkwrite() will be called when the page is dirtied through a
> > > mapping, _almost_ all of the time.  Except when munmap happens to race
> > > with clear_page_dirty_for_io().
> > > 
> > > I don't have any ideas how this could be fixed, CC-ing linux-mm...
> > 
> > On second thought, we could possibly just ignore the dirty bit in that
> > case.  Trying to write to a mapping _during_ munmap() will have pretty
> > undefined results, I don't think any sane application out there should
> > rely on the results of this.
> > 
> > But how knows, the world is a weird place...
> 
> It does happen in practice, btrfs has fallback code that triggers the
> page_mkwrite when it finds a dirty page that wasn't dirtied with help
> from the FS.
> 
> I'd love to get rid of the fallback ;)

So is there any reason why we shouldn't put calls to page_mkwrite in
zap_pte_range?

The only alternative I can think of would be to unmap the page when the
filesystem starts to write it out in order to force another page fault
if the user application writes more data into that page.

Cheers
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-24 13:31                   ` Trond Myklebust
  0 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2009-04-24 13:31 UTC (permalink / raw)
  To: Chris Mason; +Cc: Miklos Szeredi, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Fri, 2009-04-24 at 08:59 -0400, Chris Mason wrote:
> On Fri, 2009-04-24 at 09:33 +0200, Miklos Szeredi wrote:
> > On Fri, 24 Apr 2009, Miklos Szeredi wrote:
> > > Hmm, I guess this is a bit nasty: the VM promises filesystems that
> > > ->page_mkwrite() will be called when the page is dirtied through a
> > > mapping, _almost_ all of the time.  Except when munmap happens to race
> > > with clear_page_dirty_for_io().
> > > 
> > > I don't have any ideas how this could be fixed, CC-ing linux-mm...
> > 
> > On second thought, we could possibly just ignore the dirty bit in that
> > case.  Trying to write to a mapping _during_ munmap() will have pretty
> > undefined results, I don't think any sane application out there should
> > rely on the results of this.
> > 
> > But how knows, the world is a weird place...
> 
> It does happen in practice, btrfs has fallback code that triggers the
> page_mkwrite when it finds a dirty page that wasn't dirtied with help
> from the FS.
> 
> I'd love to get rid of the fallback ;)

So is there any reason why we shouldn't put calls to page_mkwrite in
zap_pte_range?

The only alternative I can think of would be to unmap the page when the
filesystem starts to write it out in order to force another page fault
if the user application writes more data into that page.

Cheers
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-04-24 13:31                   ` Trond Myklebust
@ 2009-04-24 14:06                     ` Trond Myklebust
  -1 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2009-04-24 14:06 UTC (permalink / raw)
  To: Chris Mason; +Cc: Miklos Szeredi, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Fri, 2009-04-24 at 09:31 -0400, Trond Myklebust wrote:
> The only alternative I can think of would be to unmap the page when the
> filesystem starts to write it out in order to force another page fault
> if the user application writes more data into that page.

Actually, this might be fairly trivial to implement in NFS. We'd tag the
nfs_page request as having been created by page_mkwrite(), then unmap
any such tagged page in the ->writepage() callback (assuming that
calling unmap_mapping_range() from ->writepage() is allowed?).

AFAICS that should get rid of those residual dirty ptes in sys_munmap().

Cheers
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-24 14:06                     ` Trond Myklebust
  0 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2009-04-24 14:06 UTC (permalink / raw)
  To: Chris Mason; +Cc: Miklos Szeredi, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Fri, 2009-04-24 at 09:31 -0400, Trond Myklebust wrote:
> The only alternative I can think of would be to unmap the page when the
> filesystem starts to write it out in order to force another page fault
> if the user application writes more data into that page.

Actually, this might be fairly trivial to implement in NFS. We'd tag the
nfs_page request as having been created by page_mkwrite(), then unmap
any such tagged page in the ->writepage() callback (assuming that
calling unmap_mapping_range() from ->writepage() is allowed?).

AFAICS that should get rid of those residual dirty ptes in sys_munmap().

Cheers
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-04-24 10:41               ` Robin Holt
  (?)
  (?)
@ 2009-04-24 14:52               ` Miklos Szeredi
       [not found]                 ` <E1LxMlO-0000sU-1J-8f8m9JG5TPIdUIPVzhDTVZP2KDSNp7ea@public.gmane.org>
  -1 siblings, 1 reply; 45+ messages in thread
From: Miklos Szeredi @ 2009-04-24 14:52 UTC (permalink / raw)
  To: holt; +Cc: miklos, trond.myklebust, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Fri, 24 Apr 2009, Robin Holt wrote:
> I am not sure how you came to this conclusion.  The address_space has
> the vma's chained together and protected by the i_mmap_lock.  That is
> acquired prior to the cleaning operation.  Additionally, the cleaning
> operation walks the process's page tables and will remove/write-protect
> the page before releasing the i_mmap_lock.
> 
> Maybe I misunderstand.  I hope I have not added confusion.

Looking more closely, I think you're right.

I thought that detach_vmas_to_be_unmapped() also removed them from
mapping->i_mmap, but that is not the case, it only removes them from
the process's mm_struct.  The vma is only removed from ->i_mmap in
unmap_region() _after_ zapping the pte's.

This means that while the pte zapping is going on, any page faults
will fail but page_mkclean() (and all of rmap) will continue to work.

But then I don't see how we get a dirty pte without also first getting
a page fault.  Weird...

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-04-24  7:33               ` Miklos Szeredi
                                 ` (2 preceding siblings ...)
  (?)
@ 2009-04-24 16:18               ` Jamie Lokier
  -1 siblings, 0 replies; 45+ messages in thread
From: Jamie Lokier @ 2009-04-24 16:18 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: trond.myklebust, npiggin, linux-nfs, linux-fsdevel, linux-mm

Miklos Szeredi wrote:
> On Fri, 24 Apr 2009, Miklos Szeredi wrote:
> > Hmm, I guess this is a bit nasty: the VM promises filesystems that
> > ->page_mkwrite() will be called when the page is dirtied through a
> > mapping, _almost_ all of the time.  Except when munmap happens to race
> > with clear_page_dirty_for_io().
> > 
> > I don't have any ideas how this could be fixed, CC-ing linux-mm...
> 
> On second thought, we could possibly just ignore the dirty bit in that
> case.  Trying to write to a mapping _during_ munmap() will have pretty
> undefined results, I don't think any sane application out there should
> rely on the results of this.
> 
> But how knows, the world is a weird place...

I think it's a sane but unusual thing to do.

App has a thread writing to random places in a mapped file, and
another calling munmap() or mprotect() to trap writes to some parts of
the file in order to track what parts the first thread is dirtying.
Second thread's SIGSEGV handler reinstates those mappings.  First
thread doesn't know about any of this, it just writes and the only
side effect is timing.  Or should be.

Think garbage collection, change tracking, tracing, and debugging.

-- Jamie

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
       [not found]                 ` <E1LxMlO-0000sU-1J-8f8m9JG5TPIdUIPVzhDTVZP2KDSNp7ea@public.gmane.org>
  2009-04-24 17:00                     ` Trond Myklebust
@ 2009-04-24 17:00                     ` Trond Myklebust
  0 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2009-04-24 17:00 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: holt-sJ/iWh9BUns, npiggin-l3A5Bk7waGM,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri, 2009-04-24 at 16:52 +0200, Miklos Szeredi wrote:
> On Fri, 24 Apr 2009, Robin Holt wrote:
> > I am not sure how you came to this conclusion.  The address_space has
> > the vma's chained together and protected by the i_mmap_lock.  That is
> > acquired prior to the cleaning operation.  Additionally, the cleaning
> > operation walks the process's page tables and will remove/write-protect
> > the page before releasing the i_mmap_lock.
> > 
> > Maybe I misunderstand.  I hope I have not added confusion.
> 
> Looking more closely, I think you're right.
> 
> I thought that detach_vmas_to_be_unmapped() also removed them from
> mapping->i_mmap, but that is not the case, it only removes them from
> the process's mm_struct.  The vma is only removed from ->i_mmap in
> unmap_region() _after_ zapping the pte's.
> 
> This means that while the pte zapping is going on, any page faults
> will fail but page_mkclean() (and all of rmap) will continue to work.
> 
> But then I don't see how we get a dirty pte without also first getting
> a page fault.  Weird...

You don't, but unless you unmap the page when you write it out, you will
not get any further page faults. The VM will just redirty the page
without calling page_mkwrite().

As I said, I think I can fix the NFS problem by simply unmapping the
page inside ->writepage() whenever we know the write request was
originally set up by a page fault.

Cheers
  Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-24 17:00                     ` Trond Myklebust
  0 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2009-04-24 17:00 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: holt, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Fri, 2009-04-24 at 16:52 +0200, Miklos Szeredi wrote:
> On Fri, 24 Apr 2009, Robin Holt wrote:
> > I am not sure how you came to this conclusion.  The address_space has
> > the vma's chained together and protected by the i_mmap_lock.  That is
> > acquired prior to the cleaning operation.  Additionally, the cleaning
> > operation walks the process's page tables and will remove/write-protect
> > the page before releasing the i_mmap_lock.
> > 
> > Maybe I misunderstand.  I hope I have not added confusion.
> 
> Looking more closely, I think you're right.
> 
> I thought that detach_vmas_to_be_unmapped() also removed them from
> mapping->i_mmap, but that is not the case, it only removes them from
> the process's mm_struct.  The vma is only removed from ->i_mmap in
> unmap_region() _after_ zapping the pte's.
> 
> This means that while the pte zapping is going on, any page faults
> will fail but page_mkclean() (and all of rmap) will continue to work.
> 
> But then I don't see how we get a dirty pte without also first getting
> a page fault.  Weird...

You don't, but unless you unmap the page when you write it out, you will
not get any further page faults. The VM will just redirty the page
without calling page_mkwrite().

As I said, I think I can fix the NFS problem by simply unmapping the
page inside ->writepage() whenever we know the write request was
originally set up by a page fault.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-24 17:00                     ` Trond Myklebust
  0 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2009-04-24 17:00 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: holt, npiggin, linux-nfs, linux-fsdevel, linux-mm

On Fri, 2009-04-24 at 16:52 +0200, Miklos Szeredi wrote:
> On Fri, 24 Apr 2009, Robin Holt wrote:
> > I am not sure how you came to this conclusion.  The address_space has
> > the vma's chained together and protected by the i_mmap_lock.  That is
> > acquired prior to the cleaning operation.  Additionally, the cleaning
> > operation walks the process's page tables and will remove/write-protect
> > the page before releasing the i_mmap_lock.
> > 
> > Maybe I misunderstand.  I hope I have not added confusion.
> 
> Looking more closely, I think you're right.
> 
> I thought that detach_vmas_to_be_unmapped() also removed them from
> mapping->i_mmap, but that is not the case, it only removes them from
> the process's mm_struct.  The vma is only removed from ->i_mmap in
> unmap_region() _after_ zapping the pte's.
> 
> This means that while the pte zapping is going on, any page faults
> will fail but page_mkclean() (and all of rmap) will continue to work.
> 
> But then I don't see how we get a dirty pte without also first getting
> a page fault.  Weird...

You don't, but unless you unmap the page when you write it out, you will
not get any further page faults. The VM will just redirty the page
without calling page_mkwrite().

As I said, I think I can fix the NFS problem by simply unmapping the
page inside ->writepage() whenever we know the write request was
originally set up by a page fault.

Cheers
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
       [not found]                     ` <1240592448.4946.35.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2009-04-25  5:10                         ` Nick Piggin
@ 2009-04-25  5:10                         ` Nick Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-04-25  5:10 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Miklos Szeredi, holt-sJ/iWh9BUns,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri, Apr 24, 2009 at 01:00:48PM -0400, Trond Myklebust wrote:
> On Fri, 2009-04-24 at 16:52 +0200, Miklos Szeredi wrote:
> > On Fri, 24 Apr 2009, Robin Holt wrote:
> > > I am not sure how you came to this conclusion.  The address_space has
> > > the vma's chained together and protected by the i_mmap_lock.  That is
> > > acquired prior to the cleaning operation.  Additionally, the cleaning
> > > operation walks the process's page tables and will remove/write-protect
> > > the page before releasing the i_mmap_lock.
> > > 
> > > Maybe I misunderstand.  I hope I have not added confusion.
> > 
> > Looking more closely, I think you're right.
> > 
> > I thought that detach_vmas_to_be_unmapped() also removed them from
> > mapping->i_mmap, but that is not the case, it only removes them from
> > the process's mm_struct.  The vma is only removed from ->i_mmap in
> > unmap_region() _after_ zapping the pte's.
> > 
> > This means that while the pte zapping is going on, any page faults
> > will fail but page_mkclean() (and all of rmap) will continue to work.
> > 
> > But then I don't see how we get a dirty pte without also first getting
> > a page fault.  Weird...
> 
> You don't, but unless you unmap the page when you write it out, you will
> not get any further page faults. The VM will just redirty the page
> without calling page_mkwrite().

Why? It should call page_mkwrite...

 
> As I said, I think I can fix the NFS problem by simply unmapping the
> page inside ->writepage() whenever we know the write request was
> originally set up by a page fault.

The biggest outstanding problem we have remaining is get_user_pages.
Callers are only required to hold a ref on the page and then they
can call set_page_dirty at any point after that.

I have a half-done patch somewhere to add a put_user_pages, and then
we could probably go from there to pinning the fs metadata (whether
by using the page lock or something else, I don't quite know).
 

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-25  5:10                         ` Nick Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-04-25  5:10 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Miklos Szeredi, holt, linux-nfs, linux-fsdevel, linux-mm

On Fri, Apr 24, 2009 at 01:00:48PM -0400, Trond Myklebust wrote:
> On Fri, 2009-04-24 at 16:52 +0200, Miklos Szeredi wrote:
> > On Fri, 24 Apr 2009, Robin Holt wrote:
> > > I am not sure how you came to this conclusion.  The address_space has
> > > the vma's chained together and protected by the i_mmap_lock.  That is
> > > acquired prior to the cleaning operation.  Additionally, the cleaning
> > > operation walks the process's page tables and will remove/write-protect
> > > the page before releasing the i_mmap_lock.
> > > 
> > > Maybe I misunderstand.  I hope I have not added confusion.
> > 
> > Looking more closely, I think you're right.
> > 
> > I thought that detach_vmas_to_be_unmapped() also removed them from
> > mapping->i_mmap, but that is not the case, it only removes them from
> > the process's mm_struct.  The vma is only removed from ->i_mmap in
> > unmap_region() _after_ zapping the pte's.
> > 
> > This means that while the pte zapping is going on, any page faults
> > will fail but page_mkclean() (and all of rmap) will continue to work.
> > 
> > But then I don't see how we get a dirty pte without also first getting
> > a page fault.  Weird...
> 
> You don't, but unless you unmap the page when you write it out, you will
> not get any further page faults. The VM will just redirty the page
> without calling page_mkwrite().

Why? It should call page_mkwrite...

 
> As I said, I think I can fix the NFS problem by simply unmapping the
> page inside ->writepage() whenever we know the write request was
> originally set up by a page fault.

The biggest outstanding problem we have remaining is get_user_pages.
Callers are only required to hold a ref on the page and then they
can call set_page_dirty at any point after that.

I have a half-done patch somewhere to add a put_user_pages, and then
we could probably go from there to pinning the fs metadata (whether
by using the page lock or something else, I don't quite know).
 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-04-25  5:10                         ` Nick Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-04-25  5:10 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Miklos Szeredi, holt, linux-nfs, linux-fsdevel, linux-mm

On Fri, Apr 24, 2009 at 01:00:48PM -0400, Trond Myklebust wrote:
> On Fri, 2009-04-24 at 16:52 +0200, Miklos Szeredi wrote:
> > On Fri, 24 Apr 2009, Robin Holt wrote:
> > > I am not sure how you came to this conclusion.  The address_space has
> > > the vma's chained together and protected by the i_mmap_lock.  That is
> > > acquired prior to the cleaning operation.  Additionally, the cleaning
> > > operation walks the process's page tables and will remove/write-protect
> > > the page before releasing the i_mmap_lock.
> > > 
> > > Maybe I misunderstand.  I hope I have not added confusion.
> > 
> > Looking more closely, I think you're right.
> > 
> > I thought that detach_vmas_to_be_unmapped() also removed them from
> > mapping->i_mmap, but that is not the case, it only removes them from
> > the process's mm_struct.  The vma is only removed from ->i_mmap in
> > unmap_region() _after_ zapping the pte's.
> > 
> > This means that while the pte zapping is going on, any page faults
> > will fail but page_mkclean() (and all of rmap) will continue to work.
> > 
> > But then I don't see how we get a dirty pte without also first getting
> > a page fault.  Weird...
> 
> You don't, but unless you unmap the page when you write it out, you will
> not get any further page faults. The VM will just redirty the page
> without calling page_mkwrite().

Why? It should call page_mkwrite...

 
> As I said, I think I can fix the NFS problem by simply unmapping the
> page inside ->writepage() whenever we know the write request was
> originally set up by a page fault.

The biggest outstanding problem we have remaining is get_user_pages.
Callers are only required to hold a ref on the page and then they
can call set_page_dirty at any point after that.

I have a half-done patch somewhere to add a put_user_pages, and then
we could probably go from there to pinning the fs metadata (whether
by using the page lock or something else, I don't quite know).
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-04-25  5:10                         ` Nick Piggin
  (?)
  (?)
@ 2009-09-08 15:30                         ` Chris Mason
  2009-09-08 15:41                             ` Nick Piggin
                                             ` (5 more replies)
  -1 siblings, 6 replies; 45+ messages in thread
From: Chris Mason @ 2009-09-08 15:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Trond Myklebust, Miklos Szeredi, holt, linux-nfs, linux-fsdevel,
	linux-mm

On Sat, Apr 25, 2009 at 07:10:28AM +0200, Nick Piggin wrote:
> On Fri, Apr 24, 2009 at 01:00:48PM -0400, Trond Myklebust wrote:
> > On Fri, 2009-04-24 at 16:52 +0200, Miklos Szeredi wrote:
> > > On Fri, 24 Apr 2009, Robin Holt wrote:
> > > > I am not sure how you came to this conclusion.  The address_space has
> > > > the vma's chained together and protected by the i_mmap_lock.  That is
> > > > acquired prior to the cleaning operation.  Additionally, the cleaning
> > > > operation walks the process's page tables and will remove/write-protect
> > > > the page before releasing the i_mmap_lock.
> > > > 
> > > > Maybe I misunderstand.  I hope I have not added confusion.
> > > 
> > > Looking more closely, I think you're right.
> > > 
> > > I thought that detach_vmas_to_be_unmapped() also removed them from
> > > mapping->i_mmap, but that is not the case, it only removes them from
> > > the process's mm_struct.  The vma is only removed from ->i_mmap in
> > > unmap_region() _after_ zapping the pte's.
> > > 
> > > This means that while the pte zapping is going on, any page faults
> > > will fail but page_mkclean() (and all of rmap) will continue to work.
> > > 
> > > But then I don't see how we get a dirty pte without also first getting
> > > a page fault.  Weird...
> > 
> > You don't, but unless you unmap the page when you write it out, you will
> > not get any further page faults. The VM will just redirty the page
> > without calling page_mkwrite().
> 
> Why? It should call page_mkwrite...
> 
>  
> > As I said, I think I can fix the NFS problem by simply unmapping the
> > page inside ->writepage() whenever we know the write request was
> > originally set up by a page fault.
> 
> The biggest outstanding problem we have remaining is get_user_pages.
> Callers are only required to hold a ref on the page and then they
> can call set_page_dirty at any point after that.
> 
> I have a half-done patch somewhere to add a put_user_pages, and then
> we could probably go from there to pinning the fs metadata (whether
> by using the page lock or something else, I don't quite know).

Hi everyone,

Sorry for digging up an old thread, but is there any reason we can't
just use page_mkwrite here?  I'd love to get rid of the btrfs code to
detect places that use set_page_dirty without a page_mkwrite.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-09-08 15:30                         ` Chris Mason
  2009-09-08 15:41                             ` Nick Piggin
  2009-09-08 15:41                           ` Nick Piggin
@ 2009-09-08 15:41                           ` Nick Piggin
  2009-09-09  2:21                           ` Christoph Hellwig
                                             ` (2 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-09-08 15:41 UTC (permalink / raw)
  To: Chris Mason, Trond Myklebust, Miklos Szeredi, holt, linux-nfs,
	linux-fsdevel

On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> > > As I said, I think I can fix the NFS problem by simply unmapping the
> > > page inside ->writepage() whenever we know the write request was
> > > originally set up by a page fault.
> > 
> > The biggest outstanding problem we have remaining is get_user_pages.
> > Callers are only required to hold a ref on the page and then they
> > can call set_page_dirty at any point after that.
> > 
> > I have a half-done patch somewhere to add a put_user_pages, and then
> > we could probably go from there to pinning the fs metadata (whether
> > by using the page lock or something else, I don't quite know).
> 
> Hi everyone,
> 
> Sorry for digging up an old thread, but is there any reason we can't
> just use page_mkwrite here?  I'd love to get rid of the btrfs code to
> detect places that use set_page_dirty without a page_mkwrite.

It is because page_mkwrite must be called before the page is dirtied
(it may fail, it theoretically may do something crazy with the previous
clean page data). And in several places I think it gets called from a
nasty context.

It hasn't fallen completely off my radar. fsblock has the same issue
(although I've just been ignoring gup writes into fsblock fs for the
time being).

I have a basic idea of what to do... It would be nice to change calling
convention of get_user_pages and take the page lock. Database people might
scream, in which case we could only take the page lock for filesystems that
define ->page_mkwrite (so shared mem segments avoid the overhead). Lock
ordering might get a bit interesting, but if we can have callers ensure they
always submit and release partially fulfilled requirests, then we can always
trylock them.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-09-08 15:30                         ` Chris Mason
@ 2009-09-08 15:41                             ` Nick Piggin
  2009-09-08 15:41                           ` Nick Piggin
                                               ` (4 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-09-08 15:41 UTC (permalink / raw)
  To: Chris Mason, Trond Myklebust, Miklos Szeredi, holt, linux-nfs,
	linux-fsdevel-u79uwXL29TY

On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> > > As I said, I think I can fix the NFS problem by simply unmapping the
> > > page inside ->writepage() whenever we know the write request was
> > > originally set up by a page fault.
> > 
> > The biggest outstanding problem we have remaining is get_user_pages.
> > Callers are only required to hold a ref on the page and then they
> > can call set_page_dirty at any point after that.
> > 
> > I have a half-done patch somewhere to add a put_user_pages, and then
> > we could probably go from there to pinning the fs metadata (whether
> > by using the page lock or something else, I don't quite know).
> 
> Hi everyone,
> 
> Sorry for digging up an old thread, but is there any reason we can't
> just use page_mkwrite here?  I'd love to get rid of the btrfs code to
> detect places that use set_page_dirty without a page_mkwrite.

It is because page_mkwrite must be called before the page is dirtied
(it may fail, it theoretically may do something crazy with the previous
clean page data). And in several places I think it gets called from a
nasty context.

It hasn't fallen completely off my radar. fsblock has the same issue
(although I've just been ignoring gup writes into fsblock fs for the
time being).

I have a basic idea of what to do... It would be nice to change calling
convention of get_user_pages and take the page lock. Database people might
scream, in which case we could only take the page lock for filesystems that
define ->page_mkwrite (so shared mem segments avoid the overhead). Lock
ordering might get a bit interesting, but if we can have callers ensure they
always submit and release partially fulfilled requirests, then we can always
trylock them.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-09-08 15:41                             ` Nick Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-09-08 15:41 UTC (permalink / raw)
  To: Chris Mason, Trond Myklebust, Miklos Szeredi, holt, linux-nfs,
	linux-fsdevel, linux-mm

On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> > > As I said, I think I can fix the NFS problem by simply unmapping the
> > > page inside ->writepage() whenever we know the write request was
> > > originally set up by a page fault.
> > 
> > The biggest outstanding problem we have remaining is get_user_pages.
> > Callers are only required to hold a ref on the page and then they
> > can call set_page_dirty at any point after that.
> > 
> > I have a half-done patch somewhere to add a put_user_pages, and then
> > we could probably go from there to pinning the fs metadata (whether
> > by using the page lock or something else, I don't quite know).
> 
> Hi everyone,
> 
> Sorry for digging up an old thread, but is there any reason we can't
> just use page_mkwrite here?  I'd love to get rid of the btrfs code to
> detect places that use set_page_dirty without a page_mkwrite.

It is because page_mkwrite must be called before the page is dirtied
(it may fail, it theoretically may do something crazy with the previous
clean page data). And in several places I think it gets called from a
nasty context.

It hasn't fallen completely off my radar. fsblock has the same issue
(although I've just been ignoring gup writes into fsblock fs for the
time being).

I have a basic idea of what to do... It would be nice to change calling
convention of get_user_pages and take the page lock. Database people might
scream, in which case we could only take the page lock for filesystems that
define ->page_mkwrite (so shared mem segments avoid the overhead). Lock
ordering might get a bit interesting, but if we can have callers ensure they
always submit and release partially fulfilled requirests, then we can always
trylock them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-09-08 15:30                         ` Chris Mason
  2009-09-08 15:41                             ` Nick Piggin
@ 2009-09-08 15:41                           ` Nick Piggin
  2009-09-08 15:41                           ` Nick Piggin
                                             ` (3 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-09-08 15:41 UTC (permalink / raw)
  To: Chris Mason, Trond Myklebust, Miklos Szeredi, holt, linux-nfs,
	linux-fsdevel

On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> > > As I said, I think I can fix the NFS problem by simply unmapping the
> > > page inside ->writepage() whenever we know the write request was
> > > originally set up by a page fault.
> > 
> > The biggest outstanding problem we have remaining is get_user_pages.
> > Callers are only required to hold a ref on the page and then they
> > can call set_page_dirty at any point after that.
> > 
> > I have a half-done patch somewhere to add a put_user_pages, and then
> > we could probably go from there to pinning the fs metadata (whether
> > by using the page lock or something else, I don't quite know).
> 
> Hi everyone,
> 
> Sorry for digging up an old thread, but is there any reason we can't
> just use page_mkwrite here?  I'd love to get rid of the btrfs code to
> detect places that use set_page_dirty without a page_mkwrite.

It is because page_mkwrite must be called before the page is dirtied
(it may fail, it theoretically may do something crazy with the previous
clean page data). And in several places I think it gets called from a
nasty context.

It hasn't fallen completely off my radar. fsblock has the same issue
(although I've just been ignoring gup writes into fsblock fs for the
time being).

I have a basic idea of what to do... It would be nice to change calling
convention of get_user_pages and take the page lock. Database people might
scream, in which case we could only take the page lock for filesystems that
define ->page_mkwrite (so shared mem segments avoid the overhead). Lock
ordering might get a bit interesting, but if we can have callers ensure they
always submit and release partially fulfilled requirests, then we can always
trylock them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-09-08 15:41                             ` Nick Piggin
  (?)
@ 2009-09-08 16:31                             ` Chris Mason
  2009-09-08 17:00                               ` Nick Piggin
                                                 ` (2 more replies)
  -1 siblings, 3 replies; 45+ messages in thread
From: Chris Mason @ 2009-09-08 16:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Trond Myklebust, Miklos Szeredi, holt, linux-nfs, linux-fsdevel,
	linux-mm

On Tue, Sep 08, 2009 at 05:41:32PM +0200, Nick Piggin wrote:
> On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> > > > As I said, I think I can fix the NFS problem by simply unmapping the
> > > > page inside ->writepage() whenever we know the write request was
> > > > originally set up by a page fault.
> > > 
> > > The biggest outstanding problem we have remaining is get_user_pages.
> > > Callers are only required to hold a ref on the page and then they
> > > can call set_page_dirty at any point after that.
> > > 
> > > I have a half-done patch somewhere to add a put_user_pages, and then
> > > we could probably go from there to pinning the fs metadata (whether
> > > by using the page lock or something else, I don't quite know).
> > 
> > Hi everyone,
> > 
> > Sorry for digging up an old thread, but is there any reason we can't
> > just use page_mkwrite here?  I'd love to get rid of the btrfs code to
> > detect places that use set_page_dirty without a page_mkwrite.
> 
> It is because page_mkwrite must be called before the page is dirtied
> (it may fail, it theoretically may do something crazy with the previous
> clean page data). And in several places I think it gets called from a
> nasty context.
> 
> It hasn't fallen completely off my radar. fsblock has the same issue
> (although I've just been ignoring gup writes into fsblock fs for the
> time being).

Ok, I'll change my detection code a bit then.

> 
> I have a basic idea of what to do... It would be nice to change calling
> convention of get_user_pages and take the page lock. Database people might
> scream, in which case we could only take the page lock for filesystems that
> define ->page_mkwrite (so shared mem segments avoid the overhead). Lock
> ordering might get a bit interesting, but if we can have callers ensure they
> always submit and release partially fulfilled requirests, then we can always
> trylock them.

I think everyone will have page_mkwrite eventually, at least everyone
who the databases will care about ;)

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-09-08 16:31                             ` Chris Mason
@ 2009-09-08 17:00                               ` Nick Piggin
  2009-09-08 17:00                               ` Nick Piggin
  2009-09-08 17:00                                 ` Nick Piggin
  2 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-09-08 17:00 UTC (permalink / raw)
  To: Chris Mason, Trond Myklebust, Miklos Szeredi, holt, linux-nfs,
	linux-fsdevel

On Tue, Sep 08, 2009 at 12:31:49PM -0400, Chris Mason wrote:
> On Tue, Sep 08, 2009 at 05:41:32PM +0200, Nick Piggin wrote:
> > It hasn't fallen completely off my radar. fsblock has the same issue
> > (although I've just been ignoring gup writes into fsblock fs for the
> > time being).
> 
> Ok, I'll change my detection code a bit then.

OK.


> > I have a basic idea of what to do... It would be nice to change calling
> > convention of get_user_pages and take the page lock. Database people might
> > scream, in which case we could only take the page lock for filesystems that
> > define ->page_mkwrite (so shared mem segments avoid the overhead). Lock
> > ordering might get a bit interesting, but if we can have callers ensure they
> > always submit and release partially fulfilled requirests, then we can always
> > trylock them.
> 
> I think everyone will have page_mkwrite eventually, at least everyone
> who the databases will care about ;)

Ah, the problem is not where the DIO write goes, it's where the read
goes :) (ie. the read writes into get_user_pages pages).

So for databases this should typically be shared memory segments I'd
say (tmpfs), or maybe anonymous memory.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-09-08 16:31                             ` Chris Mason
@ 2009-09-08 17:00                                 ` Nick Piggin
  2009-09-08 17:00                               ` Nick Piggin
  2009-09-08 17:00                                 ` Nick Piggin
  2 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-09-08 17:00 UTC (permalink / raw)
  To: Chris Mason, Trond Myklebust, Miklos Szeredi, holt, linux-nfs,
	linux-fsdevel-u79uwXL29TY

On Tue, Sep 08, 2009 at 12:31:49PM -0400, Chris Mason wrote:
> On Tue, Sep 08, 2009 at 05:41:32PM +0200, Nick Piggin wrote:
> > It hasn't fallen completely off my radar. fsblock has the same issue
> > (although I've just been ignoring gup writes into fsblock fs for the
> > time being).
> 
> Ok, I'll change my detection code a bit then.

OK.


> > I have a basic idea of what to do... It would be nice to change calling
> > convention of get_user_pages and take the page lock. Database people might
> > scream, in which case we could only take the page lock for filesystems that
> > define ->page_mkwrite (so shared mem segments avoid the overhead). Lock
> > ordering might get a bit interesting, but if we can have callers ensure they
> > always submit and release partially fulfilled requirests, then we can always
> > trylock them.
> 
> I think everyone will have page_mkwrite eventually, at least everyone
> who the databases will care about ;)

Ah, the problem is not where the DIO write goes, it's where the read
goes :) (ie. the read writes into get_user_pages pages).

So for databases this should typically be shared memory segments I'd
say (tmpfs), or maybe anonymous memory.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-09-08 17:00                                 ` Nick Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-09-08 17:00 UTC (permalink / raw)
  To: Chris Mason, Trond Myklebust, Miklos Szeredi, holt, linux-nfs,
	linux-fsdevel, linux-mm

On Tue, Sep 08, 2009 at 12:31:49PM -0400, Chris Mason wrote:
> On Tue, Sep 08, 2009 at 05:41:32PM +0200, Nick Piggin wrote:
> > It hasn't fallen completely off my radar. fsblock has the same issue
> > (although I've just been ignoring gup writes into fsblock fs for the
> > time being).
> 
> Ok, I'll change my detection code a bit then.

OK.


> > I have a basic idea of what to do... It would be nice to change calling
> > convention of get_user_pages and take the page lock. Database people might
> > scream, in which case we could only take the page lock for filesystems that
> > define ->page_mkwrite (so shared mem segments avoid the overhead). Lock
> > ordering might get a bit interesting, but if we can have callers ensure they
> > always submit and release partially fulfilled requirests, then we can always
> > trylock them.
> 
> I think everyone will have page_mkwrite eventually, at least everyone
> who the databases will care about ;)

Ah, the problem is not where the DIO write goes, it's where the read
goes :) (ie. the read writes into get_user_pages pages).

So for databases this should typically be shared memory segments I'd
say (tmpfs), or maybe anonymous memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-09-08 16:31                             ` Chris Mason
  2009-09-08 17:00                               ` Nick Piggin
@ 2009-09-08 17:00                               ` Nick Piggin
  2009-09-08 17:00                                 ` Nick Piggin
  2 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-09-08 17:00 UTC (permalink / raw)
  To: Chris Mason, Trond Myklebust, Miklos Szeredi, holt, linux-nfs,
	linux-fsdevel

On Tue, Sep 08, 2009 at 12:31:49PM -0400, Chris Mason wrote:
> On Tue, Sep 08, 2009 at 05:41:32PM +0200, Nick Piggin wrote:
> > It hasn't fallen completely off my radar. fsblock has the same issue
> > (although I've just been ignoring gup writes into fsblock fs for the
> > time being).
> 
> Ok, I'll change my detection code a bit then.

OK.


> > I have a basic idea of what to do... It would be nice to change calling
> > convention of get_user_pages and take the page lock. Database people might
> > scream, in which case we could only take the page lock for filesystems that
> > define ->page_mkwrite (so shared mem segments avoid the overhead). Lock
> > ordering might get a bit interesting, but if we can have callers ensure they
> > always submit and release partially fulfilled requirests, then we can always
> > trylock them.
> 
> I think everyone will have page_mkwrite eventually, at least everyone
> who the databases will care about ;)

Ah, the problem is not where the DIO write goes, it's where the read
goes :) (ie. the read writes into get_user_pages pages).

So for databases this should typically be shared memory segments I'd
say (tmpfs), or maybe anonymous memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-09-08 15:30                         ` Chris Mason
                                             ` (2 preceding siblings ...)
  2009-09-08 15:41                           ` Nick Piggin
@ 2009-09-09  2:21                           ` Christoph Hellwig
  2009-09-09  2:21                           ` Christoph Hellwig
  2009-09-09  2:21                             ` Christoph Hellwig
  5 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2009-09-09  2:21 UTC (permalink / raw)
  To: Chris Mason, Nick Piggin, Trond Myklebust, Miklos Szeredi, holt,
	linux-nfs

On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> Sorry for digging up an old thread, but is there any reason we can't
> just use page_mkwrite here?  I'd love to get rid of the btrfs code to
> detect places that use set_page_dirty without a page_mkwrite.

It's not just btrfs, it's also a complete pain in the a** for XFS and
probably every filesystems using ->page_mkwrite for dirty page tracking.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-09-08 15:30                         ` Chris Mason
@ 2009-09-09  2:21                             ` Christoph Hellwig
  2009-09-08 15:41                           ` Nick Piggin
                                               ` (4 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2009-09-09  2:21 UTC (permalink / raw)
  To: Chris Mason, Nick Piggin, Trond Myklebust, Miklos Szeredi, holt,
	linux-nfs-fy+rA21nqHI

On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> Sorry for digging up an old thread, but is there any reason we can't
> just use page_mkwrite here?  I'd love to get rid of the btrfs code to
> detect places that use set_page_dirty without a page_mkwrite.

It's not just btrfs, it's also a complete pain in the a** for XFS and
probably every filesystems using ->page_mkwrite for dirty page tracking.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-09-09  2:21                             ` Christoph Hellwig
  0 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2009-09-09  2:21 UTC (permalink / raw)
  To: Chris Mason, Nick Piggin, Trond Myklebust, Miklos Szeredi, holt,
	linux-nfs, linux-fsdevel, linux-mm

On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> Sorry for digging up an old thread, but is there any reason we can't
> just use page_mkwrite here?  I'd love to get rid of the btrfs code to
> detect places that use set_page_dirty without a page_mkwrite.

It's not just btrfs, it's also a complete pain in the a** for XFS and
probably every filesystems using ->page_mkwrite for dirty page tracking.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-09-08 15:30                         ` Chris Mason
                                             ` (3 preceding siblings ...)
  2009-09-09  2:21                           ` Christoph Hellwig
@ 2009-09-09  2:21                           ` Christoph Hellwig
  2009-09-09  2:21                             ` Christoph Hellwig
  5 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2009-09-09  2:21 UTC (permalink / raw)
  To: Chris Mason, Nick Piggin, Trond Myklebust, Miklos Szeredi, holt,
	linux-nfs

On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> Sorry for digging up an old thread, but is there any reason we can't
> just use page_mkwrite here?  I'd love to get rid of the btrfs code to
> detect places that use set_page_dirty without a page_mkwrite.

It's not just btrfs, it's also a complete pain in the a** for XFS and
probably every filesystems using ->page_mkwrite for dirty page tracking.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
  2009-09-09  2:21                             ` Christoph Hellwig
  (?)
@ 2009-09-09  5:39                                 ` Nick Piggin
  -1 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-09-09  5:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Trond Myklebust, Miklos Szeredi, holt-sJ/iWh9BUns,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue, Sep 08, 2009 at 10:21:02PM -0400, Christoph Hellwig wrote:
> On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> > Sorry for digging up an old thread, but is there any reason we can't
> > just use page_mkwrite here?  I'd love to get rid of the btrfs code to
> > detect places that use set_page_dirty without a page_mkwrite.
> 
> It's not just btrfs, it's also a complete pain in the a** for XFS and
> probably every filesystems using ->page_mkwrite for dirty page tracking.

Well I guess I should really get out my put_user_pages patches and
propose doing page locking or something. One problem is just going
through and converting all callers... another problem is that
nobody seemed to care much last time but hopefully there is more
interest now.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-09-09  5:39                                 ` Nick Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-09-09  5:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Trond Myklebust, Miklos Szeredi, holt, linux-nfs,
	linux-fsdevel, linux-mm

On Tue, Sep 08, 2009 at 10:21:02PM -0400, Christoph Hellwig wrote:
> On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> > Sorry for digging up an old thread, but is there any reason we can't
> > just use page_mkwrite here?  I'd love to get rid of the btrfs code to
> > detect places that use set_page_dirty without a page_mkwrite.
> 
> It's not just btrfs, it's also a complete pain in the a** for XFS and
> probably every filesystems using ->page_mkwrite for dirty page tracking.

Well I guess I should really get out my put_user_pages patches and
propose doing page locking or something. One problem is just going
through and converting all callers... another problem is that
nobody seemed to care much last time but hopefully there is more
interest now.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Why doesn't zap_pte_range() call page_mkwrite()
@ 2009-09-09  5:39                                 ` Nick Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2009-09-09  5:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Trond Myklebust, Miklos Szeredi, holt, linux-nfs,
	linux-fsdevel, linux-mm

On Tue, Sep 08, 2009 at 10:21:02PM -0400, Christoph Hellwig wrote:
> On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> > Sorry for digging up an old thread, but is there any reason we can't
> > just use page_mkwrite here?  I'd love to get rid of the btrfs code to
> > detect places that use set_page_dirty without a page_mkwrite.
> 
> It's not just btrfs, it's also a complete pain in the a** for XFS and
> probably every filesystems using ->page_mkwrite for dirty page tracking.

Well I guess I should really get out my put_user_pages patches and
propose doing page locking or something. One problem is just going
through and converting all callers... another problem is that
nobody seemed to care much last time but hopefully there is more
interest now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2009-09-09  5:39 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-23 18:17 Why doesn't zap_pte_range() call page_mkwrite() Trond Myklebust
2009-04-23 18:17 ` Trond Myklebust
     [not found] ` <1240510668.11148.40.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-04-23 19:52   ` Miklos Szeredi
2009-04-23 19:52     ` Miklos Szeredi
     [not found]     ` <E1Lx4yU-0007A8-Gl-8f8m9JG5TPIdUIPVzhDTVZP2KDSNp7ea@public.gmane.org>
2009-04-23 20:42       ` Trond Myklebust
2009-04-23 20:42         ` Trond Myklebust
2009-04-24  7:15         ` Miklos Szeredi
2009-04-24  7:15           ` Miklos Szeredi
     [not found]           ` <E1LxFd4-0008Ih-Rd-8f8m9JG5TPIdUIPVzhDTVZP2KDSNp7ea@public.gmane.org>
2009-04-24  7:33             ` Miklos Szeredi
2009-04-24  7:33               ` Miklos Szeredi
2009-04-24  7:33               ` Miklos Szeredi
2009-04-24 12:59               ` Chris Mason
2009-04-24 12:59                 ` Chris Mason
2009-04-24 13:31                 ` Trond Myklebust
2009-04-24 13:31                   ` Trond Myklebust
2009-04-24 14:06                   ` Trond Myklebust
2009-04-24 14:06                     ` Trond Myklebust
2009-04-24 16:18               ` Jamie Lokier
2009-04-24 10:41             ` Robin Holt
2009-04-24 10:41               ` Robin Holt
2009-04-24 10:41               ` Robin Holt
2009-04-24 14:52               ` Miklos Szeredi
     [not found]                 ` <E1LxMlO-0000sU-1J-8f8m9JG5TPIdUIPVzhDTVZP2KDSNp7ea@public.gmane.org>
2009-04-24 17:00                   ` Trond Myklebust
2009-04-24 17:00                     ` Trond Myklebust
2009-04-24 17:00                     ` Trond Myklebust
     [not found]                     ` <1240592448.4946.35.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-04-25  5:10                       ` Nick Piggin
2009-04-25  5:10                         ` Nick Piggin
2009-04-25  5:10                         ` Nick Piggin
2009-09-08 15:30                         ` Chris Mason
2009-09-08 15:41                           ` Nick Piggin
2009-09-08 15:41                             ` Nick Piggin
2009-09-08 16:31                             ` Chris Mason
2009-09-08 17:00                               ` Nick Piggin
2009-09-08 17:00                               ` Nick Piggin
2009-09-08 17:00                               ` Nick Piggin
2009-09-08 17:00                                 ` Nick Piggin
2009-09-08 15:41                           ` Nick Piggin
2009-09-08 15:41                           ` Nick Piggin
2009-09-09  2:21                           ` Christoph Hellwig
2009-09-09  2:21                           ` Christoph Hellwig
2009-09-09  2:21                           ` Christoph Hellwig
2009-09-09  2:21                             ` Christoph Hellwig
     [not found]                             ` <20090909022102.GA28318-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2009-09-09  5:39                               ` Nick Piggin
2009-09-09  5:39                                 ` Nick Piggin
2009-09-09  5:39                                 ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.