linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hugh Dickins <hugh@veritas.com>
To: Andrea Arcangeli <andrea@suse.de>
Cc: Andrew Morton <akpm@osdl.org>,
	nickpiggin@yahoo.com.au, linux-kernel@vger.kernel.org
Subject: Re: smp race fix between invalidate_inode_pages* and do_no_page
Date: Wed, 11 Jan 2006 20:49:44 +0000 (GMT)	[thread overview]
Message-ID: <Pine.LNX.4.61.0601111949070.6448@goblin.wat.veritas.com> (raw)
In-Reply-To: <20060111091327.GZ15897@opteron.random>

On Wed, 11 Jan 2006, Andrea Arcangeli wrote:
> On Wed, Jan 11, 2006 at 01:06:38AM -0800, Andrew Morton wrote:
> > 
> > Confused.  do_no_page() doesn't have a page to lock until it has called
> > ->nopage.
> 
> yes, I mean doing lock_page after ->nopage returned it here:
> 
> 	lock_page(page);
> 	if (mapping && !page->mapping)
> 		goto bail_out;
> 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> [..]
> 			page_add_file_rmap()
> 			unlock_page()

I've rather chosen this mail at random from the thread, and can't
come up with better than a few random remarks on it all, sorry.

Though using lock_page may solve more than one problem, without seeing
a full implementation I'm less sure than the rest of you that it will
really do the job.

And less confident than you and Nick that adding a sleeping lock here
in nopage won't present a scalability issue.  Though it gives me a
special thrill to see Nick arguing in favour of extra locking ;)

Keep in mind that in the truncate case, it's the original shared file
page that needs to be locked and tested, when deciding whether it's
okay to insert a COWed anonymous copy (even COWed areas must SIGBUS).

Don't forget to update mm/fremap.c's install_page, which we forgot
originally when doing the truncate_count business.

But when fixing that (not me who noticed the problem), I found we
didn't really need to rely on truncate_count there, just mapping and
i_size: because in install_page we could be sure it was (or had been)
a page cache page with mapping set.  The need for truncate_count
arises from the uncertainty when do_no_page returns from ->nopage,
whether it was a pagecache kind of nopage (which would set mappping),
or a weird driver kind of nopage (which would not).

If coming here again, I think I'd have a VM_ flag for that: almost
corresponds to VM_RESERVED (as I think you were keying off in your SuSE
tree) but better to make it specific to pagecache, or truncatability -
to say mapping ought to be non-NULL, and if it's found NULL after
grabbing pte_lock, then it must have been truncated/invalidated off.

Didn't examine in detail, but I didn't care for your seqschedlock
implementation.  Partly the overdesign of a new locking scheme for
just one usage (seqschedlock.h missing from your recent send, by
the way, but I assume same as last month's) - fine when more uses
come along later, but seemed premature, and better to concentrate
on the immediate need for now.

And partly that you seemed to be leaving the truncate case handled
by one technique (truncate_count) and the invalidate case handled by
another (seqschedlock): much more satisfying to handle them both in
the same-ish way.

But yes, they do differ significantly: treatment of COW pages,
applicability of i_size, wait rather than error out when invalidate.
Perhaps I underestimate the effect of those differences.

I don't think you'll be able to eliminate truncate_count so easily,
since I put it to further use, with a matching vm_truncate_count,
to handle the latency while unmapping vmas from the prio_tree.

I've never got to grips with invalidate_inode_pages, or its more
exciting sequel "Invalidate Inode Pages II".  Last time I looked
(2.6.10-ish) both looked racy, neither very serious.  Alarmed now
to see that while I was attacking latency below unmap_mapping_range,
Andrew was adding unmap_mapping_range calls into iipages2.

I'd been working on the assumption that i_sem is held across it,
but that seems not to be the case, in NFS uses at least (I've
given up further researches to write this mail).  But miraculously
(or by Andrew's careful checking) it looks to me like it is okay:
if there are concurrent unmappings, i_mmap_lock protects truncate_
count and vm_truncate_count modifications, and unmap_mapping_range_tree
won't do worse than fail to skip vmas it has already dealt with.

But I do not know what guarantees invalidate_inode_pages2 is supposed
to give.  As soon as you emerge from iipages2, its work could be undone:
I suppose it provides a serialization point, ensures that you go down to
filesystem once, rather than being satisfied by the pagecache.

Hugh

  reply	other threads:[~2006-01-11 20:49 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-12-13 19:37 smp race fix between invalidate_inode_pages* and do_no_page Andrea Arcangeli
2005-12-13 21:02 ` Andrew Morton
2005-12-13 21:14   ` Andrea Arcangeli
2005-12-16 13:51     ` Andrea Arcangeli
2006-01-10  6:24       ` Andrea Arcangeli
2006-01-10  6:48         ` Andrea Arcangeli
2006-01-11  4:08         ` Nick Piggin
2006-01-11  8:23           ` Andrea Arcangeli
2006-01-11  8:51             ` Andrew Morton
2006-01-11  9:02               ` Andrea Arcangeli
2006-01-11  9:06                 ` Andrew Morton
2006-01-11  9:13                   ` Andrea Arcangeli
2006-01-11 20:49                     ` Hugh Dickins [this message]
2006-01-11 21:05                       ` Andrew Morton
2006-01-13  7:35                       ` Nick Piggin
2006-01-13  7:47                         ` Andrew Morton
2006-01-13 10:37                           ` Nick Piggin
2006-03-31 12:36                             ` Andrea Arcangeli
2006-04-02  5:17                               ` Nick Piggin
2006-04-02  5:21                               ` Andrew Morton
2006-04-07 19:18                                 ` Hugh Dickins
2006-01-11  9:39                 ` Nick Piggin
2006-01-11  9:34             ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.61.0601111949070.6448@goblin.wat.veritas.com \
    --to=hugh@veritas.com \
    --cc=akpm@osdl.org \
    --cc=andrea@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nickpiggin@yahoo.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).