All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: reviving mlock isolation dead code
@ 2010-10-30 10:16 Michel Lespinasse
  2010-10-30 12:48 ` Michel Lespinasse
  2010-11-01  7:05 ` KOSAKI Motohiro
  0 siblings, 2 replies; 10+ messages in thread
From: Michel Lespinasse @ 2010-10-30 10:16 UTC (permalink / raw)
  To: linux-mm

Hi,

The following code at the bottom of try_to_unmap_one appears to be dead:

 out_mlock:
        pte_unmap_unlock(pte, ptl);

        /*
         * We need mmap_sem locking, Otherwise VM_LOCKED check makes
         * unstable result and race. Plus, We can't wait here because
         * we now hold anon_vma->lock or mapping->i_mmap_lock.
         * if trylock failed, the page remain in evictable lru and later
         * vmscan could retry to move the page to unevictable lru if the
         * page is actually mlocked.
         */
        if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
                if (vma->vm_flags & VM_LOCKED) {
                        mlock_vma_page(page);
                        ret = SWAP_MLOCK;
                }
                up_read(&vma->vm_mm->mmap_sem);
        }
        return ret;

The mmap_sem read acquire always fais here, because the mmap_sem is
held exclusively by __mlock_vma_pages_range(). By the time
__mlock_vma_pages_range() terminates (so that its caller can release
mmap_sem), all mlocked pages have been isolated already so that LRU
eviction algorithms should not encounter them (and if they do, the
pages should at least be already marked as mlocked).


I would like to resurect this, as I am seeing problems during a large
mlock (many GB). The mlock takes a long time to complete
(__mlock_vma_pages_range() is loading pages from disk), there is
memory pressure as some pages have to be evicted to make room for the
large mlock, and the LRU algorithm performs badly with the high amount
of pages still on LRU list - PageMlocked has not been set yet - while
their VMA is already VM_LOCKED.

One approach I am considering would be to modify
__mlock_vma_pages_range() and it call sites so the mmap sem is only
read-owned while __mlock_vma_pages_range() runs. The mlock handling
code in try_to_unmap_one() would then be able to acquire the
mmap_sem() and help, as it is designed to do.

Please comment if you have any concerns about this.

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: reviving mlock isolation dead code
  2010-10-30 10:16 RFC: reviving mlock isolation dead code Michel Lespinasse
@ 2010-10-30 12:48 ` Michel Lespinasse
  2010-11-01  7:05 ` KOSAKI Motohiro
  1 sibling, 0 replies; 10+ messages in thread
From: Michel Lespinasse @ 2010-10-30 12:48 UTC (permalink / raw)
  To: linux-mm

On Sat, Oct 30, 2010 at 3:16 AM, Michel Lespinasse <walken@google.com> wrote:
> The following code at the bottom of try_to_unmap_one appears to be dead:
>
>  out_mlock:
>        pte_unmap_unlock(pte, ptl);
>
>        /*
>         * We need mmap_sem locking, Otherwise VM_LOCKED check makes
>         * unstable result and race. Plus, We can't wait here because
>         * we now hold anon_vma->lock or mapping->i_mmap_lock.
>         * if trylock failed, the page remain in evictable lru and later
>         * vmscan could retry to move the page to unevictable lru if the
>         * page is actually mlocked.
>         */
>        if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
>                if (vma->vm_flags & VM_LOCKED) {
>                        mlock_vma_page(page);
>                        ret = SWAP_MLOCK;
>                }
>                up_read(&vma->vm_mm->mmap_sem);
>        }
>        return ret;

All right, not entirely dead - Documentation/vm/unevictable-lru.txt
actually explains this very well.

But still, the problem remains that lazy mlocking doesn't work while
mmap_sem is exclusively held by a long-running mlock().

> One approach I am considering would be to modify
> __mlock_vma_pages_range() and it call sites so the mmap sem is only
> read-owned while __mlock_vma_pages_range() runs. The mlock handling
> code in try_to_unmap_one() would then be able to acquire the
> mmap_sem() and help, as it is designed to do.

I'm still looking for any comments people might have about this :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: reviving mlock isolation dead code
  2010-10-30 10:16 RFC: reviving mlock isolation dead code Michel Lespinasse
  2010-10-30 12:48 ` Michel Lespinasse
@ 2010-11-01  7:05 ` KOSAKI Motohiro
  2010-11-09  4:34   ` KOSAKI Motohiro
  1 sibling, 1 reply; 10+ messages in thread
From: KOSAKI Motohiro @ 2010-11-01  7:05 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: kosaki.motohiro, linux-mm

Hello,

> I would like to resurect this, as I am seeing problems during a large
> mlock (many GB). The mlock takes a long time to complete
> (__mlock_vma_pages_range() is loading pages from disk), there is
> memory pressure as some pages have to be evicted to make room for the
> large mlock, and the LRU algorithm performs badly with the high amount
> of pages still on LRU list - PageMlocked has not been set yet - while
> their VMA is already VM_LOCKED.
> 
> One approach I am considering would be to modify
> __mlock_vma_pages_range() and it call sites so the mmap sem is only
> read-owned while __mlock_vma_pages_range() runs. The mlock handling
> code in try_to_unmap_one() would then be able to acquire the
> mmap_sem() and help, as it is designed to do.

I would like to talk historical story a bit. Originally, Lee designed it as you proposed. 
but Linus refused it. He thought ro-rwsem is bandaid fix. That is one of reason that
some developers seeks proper mmap_sem dividing way.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: reviving mlock isolation dead code
  2010-11-01  7:05 ` KOSAKI Motohiro
@ 2010-11-09  4:34   ` KOSAKI Motohiro
  2010-11-10 12:21     ` Michel Lespinasse
  0 siblings, 1 reply; 10+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  4:34 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: kosaki.motohiro, linux-mm

Hi Michel,

> Hello,
> 
> > I would like to resurect this, as I am seeing problems during a large
> > mlock (many GB). The mlock takes a long time to complete
> > (__mlock_vma_pages_range() is loading pages from disk), there is
> > memory pressure as some pages have to be evicted to make room for the
> > large mlock, and the LRU algorithm performs badly with the high amount
> > of pages still on LRU list - PageMlocked has not been set yet - while
> > their VMA is already VM_LOCKED.
> > 
> > One approach I am considering would be to modify
> > __mlock_vma_pages_range() and it call sites so the mmap sem is only
> > read-owned while __mlock_vma_pages_range() runs. The mlock handling
> > code in try_to_unmap_one() would then be able to acquire the
> > mmap_sem() and help, as it is designed to do.
> 
> I would like to talk historical story a bit. Originally, Lee designed it as you proposed. 
> but Linus refused it. He thought ro-rwsem is bandaid fix. That is one of reason that
> some developers seeks proper mmap_sem dividing way.

While in airplane to come back from KS and LPC, I was thinking this issue. now I think
we can solve this issue. can you please hear my idea?

Now, mlock has following call flow

sys_mlock
	down_write(mmap_sem)
	do_mlock()
		for-each-vma
			mlock_fixup()
				__mlock_vma_pages_range()
					__get_user_pages()
	up_write(mmap_sem)


And, someone tried following change and Linus refuse it because releasing mmap_sem
while mlock() syscall can makes nasty race issue. He storongly requested we don't release
mmap_sem while processing mlock().


sys_mlock
	down_write(mmap_sem)
	do_mlock()
		for-each-vma
			downgrade_write(mmap_sem)
			mlock_fixup()
				__mlock_vma_pages_range()
					__get_user_pages()
			up_read(mmap_sem)
			// race here
			down_write(mmap_sem)
	up_write(mmap_sem)


Then, I'd propose two phase mlock. that said,

sys_mlock
	down_write(mmap_sem)
	do_mlock()
		for-each-vma
			turn on VM_LOCKED and merge/split vma
	downgrade_write(mmap_sem)
		for-each-vma
			mlock_fixup()
				__mlock_vma_pages_range()
	up_read(mmap_sem)


Usually, kernel developers strongly dislike two phase thing beucase it's slow. but at least
_I_ think it's ok in this case. because mlock is really really slow syscall, it often take a few
*miniture*. then, A few microsecond slower is not big matter.

What do you think?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: reviving mlock isolation dead code
  2010-11-09  4:34   ` KOSAKI Motohiro
@ 2010-11-10 12:21     ` Michel Lespinasse
  2010-11-14  5:07       ` KOSAKI Motohiro
  0 siblings, 1 reply; 10+ messages in thread
From: Michel Lespinasse @ 2010-11-10 12:21 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm

On Mon, Nov 8, 2010 at 8:34 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> While in airplane to come back from KS and LPC, I was thinking this issue. now I think
> we can solve this issue. can you please hear my idea?

I have been having similar thoughts over the past week. I'll try to
send a related patch set soon.

> Now, mlock has following call flow
>
> sys_mlock
>        down_write(mmap_sem)
>        do_mlock()
>                for-each-vma
>                        mlock_fixup()
>                                __mlock_vma_pages_range()
>                                        __get_user_pages()
>        up_write(mmap_sem)
>
> Then, I'd propose two phase mlock. that said,
>
> sys_mlock
>        down_write(mmap_sem)
>        do_mlock()
>                for-each-vma
>                        turn on VM_LOCKED and merge/split vma
>        downgrade_write(mmap_sem)
>                for-each-vma
>                        mlock_fixup()
>                                __mlock_vma_pages_range()
>        up_read(mmap_sem)
>
> Usually, kernel developers strongly dislike two phase thing beucase it's slow. but at least
> _I_ think it's ok in this case. because mlock is really really slow syscall, it often take a few
> *miniture*. then, A few microsecond slower is not big matter.
>
> What do you think?

downgrade_write() would help, but only partially. If another thread
tries to acquire the mmap_sem for write, it will get queued for a long
time until mlock() completes - this may in itself be acceptable, but
the issue here is that additional readers like try_to_unmap_one()
won't be able to acquire the mmap_sem anymore. This is because the
rwsem code prevents new readers from entering once there is a queued
writer, in order to avoid starvation.

My proposal would be as follows:

sys_mlock
       down_write(mmap_sem)
       do_mlock()
               for-each-vma
                       turn on VM_LOCKED and merge/split vma
       up_write(mmap_sem)
       for (addr = start of mlock range; addr < end of mlock range;
addr = next_addr)
               down_read(mmap_sem)
               find vma for addr
               next_addr = end of the vma
               if vma still has VM_LOCKED flag:
                       next_addr = min(next_addr, addr + few pages)
                       mlock a small batch of pages from that vma
(from addr to next_addr)
               up_read(mmap_sem)

Since a large mlock() can take a long time and we don't want to hold
mmap_sem for that long, we have to allow other threads to grab
mmap_sem and deal with the concurrency issues.

The races aren't actually too bad:

* If some other thread creates new VM_LOCKED vmas within the mlock
range while sys_mlock() is working: both threads will be trying to
mlock_fixup the same page range at once. This is no big deal as
__mlock_vma_pages_range already only needs mmap_sem held for read: the
get_user_pages() part can safely proceed in parallel and the
mlock_vma_page() part is protected by the page lock and won't do
anything if the PageMlocked flag is already set.

* If some other thread creates new non-VM_LOCKED vmas, or munlocks the
same address ranges that mlock() is currently working on: the mlock()
code needs to be careful here to not mlock the pages when the vmas
don't have the VM_LOCKED flag anymore. From the user process point of
view, things will look like if the mlock had completed first, followed
by the munlock.

The other mlock related issue I have is that it marks pages as dirty
(if they are in a writable VMA), and causes writeback to work on them,
even though the pages have not actually been modified. This looks like
it would be solvable with a new get_user_pages flag for mlock use
(breaking cow etc, but not writing to the pages just yet).

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: reviving mlock isolation dead code
  2010-11-10 12:21     ` Michel Lespinasse
@ 2010-11-14  5:07       ` KOSAKI Motohiro
  2010-11-16  1:44         ` Hugh Dickins
  0 siblings, 1 reply; 10+ messages in thread
From: KOSAKI Motohiro @ 2010-11-14  5:07 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: kosaki.motohiro, linux-mm

Hi

> My proposal would be as follows:
> 
> sys_mlock
>        down_write(mmap_sem)
>        do_mlock()
>                for-each-vma
>                        turn on VM_LOCKED and merge/split vma
>        up_write(mmap_sem)
>        for (addr = start of mlock range; addr < end of mlock range;
> addr = next_addr)
>                down_read(mmap_sem)
>                find vma for addr
>                next_addr = end of the vma
>                if vma still has VM_LOCKED flag:
>                        next_addr = min(next_addr, addr + few pages)
>                        mlock a small batch of pages from that vma
> (from addr to next_addr)
>                up_read(mmap_sem)
> 
> Since a large mlock() can take a long time and we don't want to hold
> mmap_sem for that long, we have to allow other threads to grab
> mmap_sem and deal with the concurrency issues.

Sound good.
Can you please consider to post actual patch?


> The races aren't actually too bad:
> 
> * If some other thread creates new VM_LOCKED vmas within the mlock
> range while sys_mlock() is working: both threads will be trying to
> mlock_fixup the same page range at once. This is no big deal as
> __mlock_vma_pages_range already only needs mmap_sem held for read: the
> get_user_pages() part can safely proceed in parallel and the
> mlock_vma_page() part is protected by the page lock and won't do
> anything if the PageMlocked flag is already set.
> 
> * If some other thread creates new non-VM_LOCKED vmas, or munlocks the
> same address ranges that mlock() is currently working on: the mlock()
> code needs to be careful here to not mlock the pages when the vmas
> don't have the VM_LOCKED flag anymore. From the user process point of
> view, things will look like if the mlock had completed first, followed
> by the munlock.

Yes, here is really key point. If user can't notice the race, it doesn't exist practically.


> The other mlock related issue I have is that it marks pages as dirty
> (if they are in a writable VMA), and causes writeback to work on them,
> even though the pages have not actually been modified. This looks like
> it would be solvable with a new get_user_pages flag for mlock use
> (breaking cow etc, but not writing to the pages just yet).

To be honest, I haven't understand why current code does so. I dislike it too. but
I'm not sure such change is safe or not. I hope another developer comment you ;-)




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: reviving mlock isolation dead code
  2010-11-14  5:07       ` KOSAKI Motohiro
@ 2010-11-16  1:44         ` Hugh Dickins
  2010-11-16  6:50           ` Michel Lespinasse
  0 siblings, 1 reply; 10+ messages in thread
From: Hugh Dickins @ 2010-11-16  1:44 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Michel Lespinasse, linux-mm

On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> Michel Lespinasse <walken@google.com> wrote:
> > ...
> > The other mlock related issue I have is that it marks pages as dirty
> > (if they are in a writable VMA), and causes writeback to work on them,
> > even though the pages have not actually been modified. This looks like
> > it would be solvable with a new get_user_pages flag for mlock use
> > (breaking cow etc, but not writing to the pages just yet).
> 
> To be honest, I haven't understand why current code does so. I dislike it too. but
> I'm not sure such change is safe or not. I hope another developer comment you ;-)

It's been that way for years, and the primary purpose is to do the COWs
in advance, so we won't need to allocate new pages later to the locked
area: the pages that may be needed are already locked down.

That justifies it for the private mapping case, but what of shared maps?
There the justification is that the underlying file might be sparse, and
we want to allocate blocks upfront for the locked area.

Do we?  I dislike it also, as you both do.  It seems crazy to mark a
vast number of pages as dirty when they're not.

It makes sense to mark pte_dirty when we have a real write fault to a
page, to save the mmu from making that pagetable transaction immediately
after; but it does not make sense when the write (if any) may come
minutes later - we'll just do a pointless write and clear dirty meanwhile.

A new __get_user_pages flag (for use by make_pages_present) might make a
good saving there, but I've not thought it through.  Tell page_mkwrite
that we're doing a write (to do allocation in those FSes that care),
but avoid marking the pte as dirty?  I'm not sure, and you might need
to be careful with the dirty balancing too.

If it does work out, I think you'd need to be passing the flag down to
follow_page too: I have a patch or patches to merge the FOLL_flags with
the FAULT_FLAGs - Linus wanted that a year ago, and I recently met a
need for it with shmem - I'd better accelerate sending those in.

Here's a link to the last(?) time mlock dirtying was discussed,
http://lkml.org/lkml/2007/7/26/457
worth reading; we could Cc the guys from that thread, though I haven't.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: reviving mlock isolation dead code
  2010-11-16  1:44         ` Hugh Dickins
@ 2010-11-16  6:50           ` Michel Lespinasse
  2010-11-16 23:28             ` Hugh Dickins
  0 siblings, 1 reply; 10+ messages in thread
From: Michel Lespinasse @ 2010-11-16  6:50 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: KOSAKI Motohiro, linux-mm

On Mon, Nov 15, 2010 at 5:44 PM, Hugh Dickins <hughd@google.com> wrote:
> On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
>> Michel Lespinasse <walken@google.com> wrote:
>> > ...
>> > The other mlock related issue I have is that it marks pages as dirty
>> > (if they are in a writable VMA), and causes writeback to work on them,
>> > even though the pages have not actually been modified. This looks like
>> > it would be solvable with a new get_user_pages flag for mlock use
>> > (breaking cow etc, but not writing to the pages just yet).
>>
>> To be honest, I haven't understand why current code does so. I dislike it too. but
>> I'm not sure such change is safe or not. I hope another developer comment you ;-)
>
> It's been that way for years, and the primary purpose is to do the COWs
> in advance, so we won't need to allocate new pages later to the locked
> area: the pages that may be needed are already locked down.

Thanks Hugh for posting your comments. I was aware of Suleiman's
proposal to always do a READ mode get_user_pages years ago, and I
could see that we'd need a new flag instead so we can break COW
without dirtying pages, but I hadn't thought about other issues.

> That justifies it for the private mapping case, but what of shared maps?
> There the justification is that the underlying file might be sparse, and
> we want to allocate blocks upfront for the locked area.
>
> Do we?  I dislike it also, as you both do.  It seems crazy to mark a
> vast number of pages as dirty when they're not.
>
> It makes sense to mark pte_dirty when we have a real write fault to a
> page, to save the mmu from making that pagetable transaction immediately
> after; but it does not make sense when the write (if any) may come
> minutes later - we'll just do a pointless write and clear dirty meanwhile

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: reviving mlock isolation dead code
  2010-11-16  6:50           ` Michel Lespinasse
@ 2010-11-16 23:28             ` Hugh Dickins
  2010-11-18 11:16               ` Michel Lespinasse
  0 siblings, 1 reply; 10+ messages in thread
From: Hugh Dickins @ 2010-11-16 23:28 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: KOSAKI Motohiro, Peter Zijlstra, Nick Piggin, Arjan van de Ven, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4671 bytes --]

On Mon, 15 Nov 2010, Michel Lespinasse wrote:
> On Mon, Nov 15, 2010 at 5:44 PM, Hugh Dickins <hughd@google.com> wrote:
> > On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> >> Michel Lespinasse <walken@google.com> wrote:
> >> > ...
> >> > The other mlock related issue I have is that it marks pages as dirty
> >> > (if they are in a writable VMA), and causes writeback to work on them,
> >> > even though the pages have not actually been modified. This looks like
> >> > it would be solvable with a new get_user_pages flag for mlock use
> >> > (breaking cow etc, but not writing to the pages just yet).
> >>
> >> To be honest, I haven't understand why current code does so. I dislike it too. but
> >> I'm not sure such change is safe or not. I hope another developer comment you ;-)
> >
> > It's been that way for years, and the primary purpose is to do the COWs
> > in advance, so we won't need to allocate new pages later to the locked
> > area: the pages that may be needed are already locked down.
> 
> Thanks Hugh for posting your comments. I was aware of Suleiman's
> proposal to always do a READ mode get_user_pages years ago, and I
> could see that we'd need a new flag instead so we can break COW
> without dirtying pages, but I hadn't thought about other issues.
> 
> > That justifies it for the private mapping case, but what of shared maps?
> > There the justification is that the underlying file might be sparse, and
> > we want to allocate blocks upfront for the locked area.
> >
> > Do we?  I dislike it also, as you both do.  It seems crazy to mark a
> > vast number of pages as dirty when they're not.
> >
> > It makes sense to mark pte_dirty when we have a real write fault to a
> > page, to save the mmu from making that pagetable transaction immediately
> > after; but it does not make sense when the write (if any) may come
> > minutes later - we'll just do a pointless write and clear dirty meanwhile.
> 
> If we just mlocked the page but did not made it writable (or mark it
> dirty) yet, would we be allowed to skip the page_mkwrite method call ?

Yes, indeed you should skip it in that case.

> 
> I believe this would be legal:

Yes, I agree that it would be legal.

> 
> - If/when an actual write comes later on, we'll run through
> do_wp_page() again, and reuse the old page, making it writable and
> dirty from then on. Since this is a shared mapping, we won't have to
> allocate a new page at a that time, so this preserves the mlock
> semantic of having all necessary pages preallocated.
> 
> - If we skip page_mkwrite(), we can't guarantee that the filesystem
> will have a free block to allocate, but is this actually part of the
> mlock() semantics ? I think not, given that only a few filesystems
> implement page_mkwrite() in the first place. ext4 does, but ext2/3
> does not, for example. So while skipping page_mkwrite() would prevent
> data blocks from being pre-allocated, I don't really see it as
> breaking mlock() ?

Yes, allocating the blocks is not actually part of mlock() semantics.

And a few years ago, there was no ->page_mkwrite(), and the ->nopage()
interface didn't tell the filesystem whether it was read or write fault
(and mlocking a writable vma certainly didn't do synchronous writes back
to disk before the mlock returned success or failure).

It's all a matter of QoS: is it acceptable to make the change, that
a write fault to an mlocked area of a sparse file might now generate
SIGBUS, on a few filesystems which have recently been guaranteeing not?

Personally, I believe that's more acceptable than doing a huge rush of
(almost always) pointless writes at the time of mlock().  But I can
see that others may disagree.

> 
> > If it does work out, I think you'd need to be passing the flag down to
> > follow_page too: I have a patch or patches to merge the FOLL_flags with
> > the FAULT_FLAGs - Linus wanted that a year ago, and I recently met a
> > need for it with shmem - I'd better accelerate sending those in.
> 
> The follow_page change is simpler, it might even be sufficient to not
> pass in the FOLL_TOUCH flag I think.

Yes, in fact, is anything required beyond Peter's original simple patch?

There are some tweaks that could be added.  A FAULT_FLAG to let filesystem
know that we're mlocking a writable area, so it could be careful about it?
only useful if some filesystem uses it!  A check on vma_wants_writenotify()
or something like it, so mlock does set pte_write if it's okay e.g. tmpfs?
Second order things, probably don't matter.

Added Ccs of those most likely to agree or disagree with us.

Hugh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: reviving mlock isolation dead code
  2010-11-16 23:28             ` Hugh Dickins
@ 2010-11-18 11:16               ` Michel Lespinasse
  0 siblings, 0 replies; 10+ messages in thread
From: Michel Lespinasse @ 2010-11-18 11:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KOSAKI Motohiro, Peter Zijlstra, Nick Piggin, Arjan van de Ven,
	linux-mm, Andrea Arcangeli

On Tue, Nov 16, 2010 at 3:28 PM, Hugh Dickins <hughd@google.com> wrote:
> Yes, in fact, is anything required beyond Peter's original simple patch?

I initially thought there would be a problem with breaking COW on anon
pages (think of fork() + mlock() ), but then I realized these won't
show up in VM_SHARED vmas, so Peter's patch seems fine.

> Added Ccs of those most likely to agree or disagree with us.

I forgot to add Arjan and Andrea in my proposal ('Avoid dirtying pages
during mlock'). Let's move the discussion there.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-11-18 11:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-30 10:16 RFC: reviving mlock isolation dead code Michel Lespinasse
2010-10-30 12:48 ` Michel Lespinasse
2010-11-01  7:05 ` KOSAKI Motohiro
2010-11-09  4:34   ` KOSAKI Motohiro
2010-11-10 12:21     ` Michel Lespinasse
2010-11-14  5:07       ` KOSAKI Motohiro
2010-11-16  1:44         ` Hugh Dickins
2010-11-16  6:50           ` Michel Lespinasse
2010-11-16 23:28             ` Hugh Dickins
2010-11-18 11:16               ` Michel Lespinasse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.