All of lore.kernel.org
 help / color / mirror / Atom feed
From: Minchan Kim <minchan.kim@gmail.com>
To: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Hugh Dickins <hughd@google.com>,
	linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Johannes Weiner <jweiner@redhat.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Shaohua Li <shaohua.li@intel.com>
Subject: Re: [RFC PATCH 0/3] page count lock for simpler put_page
Date: Wed, 10 Aug 2011 07:22:08 +0900	[thread overview]
Message-ID: <CAEwNFnA3o_9SJZEeSg5azO45-E3HhEcSnww0GarN5m6n-qL_Vw@mail.gmail.com> (raw)
In-Reply-To: <CANN689Edai1k4nmyTHZ_2EwWuTXdfmah-JiyibEBvSudcWhv+g@mail.gmail.com>

On Tue, Aug 9, 2011 at 8:04 PM, Michel Lespinasse <walken@google.com> wrote:
> On Sun, Aug 7, 2011 at 7:25 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
>> On Thu, Aug 04, 2011 at 11:39:19PM -0700, Michel Lespinasse wrote:
>>> On Thu, Aug 4, 2011 at 2:07 PM, Michel Lespinasse <walken@google.com> wrote:
>>> > Patch 3 demonstrates my motivation for this patch series: in my pre-THP
>>> > implementation of idle page tracking, I was able to use get_page_unless_zero
>>> > in a way that __split_huge_page_refcount made unsafe. Building on top of
>>> > patch 2, I can make the required operation safe again. If patch 2 was to
>>> > be rejected, I would like to get suggestions about alternative approaches
>>> > to implement the get_first_page_unless_zero() operation described here.
>>>
>>> I should add that I am quite worried about the places that use
>>> get_page_unless_zero (or the page_cache_*_speculative wrappers) today.
>>> My worrisome scenario would be as follows:
>>>
>>> - thread T finds a pointer to a page P (possibly from a radix tree in
>>> find_get_page() )
>>> - page P gets freed by another thread
>>> - page P gets re-allocated as the tail of a THP page by another thread
>>> - another thread gets a reference on page P
>>> - thread T proceeds doing page_cache_get_speculative(P), intending to
>>> then check that P is really the page it wanted
>>> - another thread splits up P's compound page;
>>> __split_huge_page_refcount subtracts T's refcount on P from head(P)'s
>>> refcount
>>> - thread T figures out that it didn't get the page it expected, calls
>>> page_cache_release(P). But it's too late - the refcount for what used
>>> to be head(P) has already been corrupted (incorrectly decremented).
>>>
>>> Does anything prevent the above ?
>>
>> I think it's possbile and you find a BUG.
>> Andrea?
>
> At this point I believe this is indeed a bug, though a very unlikely
> one to hit. The interval between thread T finding a pointer to page P
> and thread T getting a reference on P is typically just a few
> instructions, while the time scale necessary for other threads to free
> page P, reallocate it as a compound page, and decide to split it into
> single pages is just much larger. So, there is no line of code I can
> point to that would prevent the race, but it also looks like it would
> be very hard to trigger it.

Fair enough.

>
>>> I can see that the page_cache_get_speculative comment in
>>> include/linux/pagemap.h maps out one way to prevent the issue. If
>>> thread T continually held an rcu read lock from the time it finds the
>>> pointer to P until the time it calls get_page_unless_zero on that
>>> page, AND there was a synchronize_rcu() call somewhere between the
>>> time a THP page gets allocated and the time __split_huge_page_refcount
>>> might first get called on that page, then things would be safe.
>>> However, that does not seem to be true today: I could not find a
>>> synchronize_rcu() call before __split_huge_page_refcount(), AND there
>>> are also places (such as deactivate_page() for example) that call
>>> get_page_unless_zero without being within an rcu read locked section
>>> (or holding the zone lru lock to provide exclusion against
>>> __split_huge_page_refcount).
>
> Going forward, I can see several possible solutions:
> - Use my proposed page count lock in order to avoid the race. One
> would have to convert all get_page_unless_zero() sites to use it. I
> expect the cost would be low but still measurable.

It's not necessary to apply it on *all* get_page_unless_zero sites.
Because deactivate_page does it on file pages while THP handles only anon pages.
So the race should not a problem.

> - Protect all get_page_unless_zero call sites with rcu read lock or
> lru lock (page_cache_get_speculative already has it, but there are
> others to consider), and add a synchronize_rcu() before splitting huge
> pages.

I think it can't be a solution.
If we don't have any lock for protect write-side, page_count could be
unstable again while we peek page->count in
__split_huge_page_refcount after calling synchronize_rcu.
Do I miss something?

> - It'd be sweet if one could somehow record the time a THP page was
> created, and wait for at least one RCU grace period *starting from the
> recorded THP creation time* before splitting huge pages. In practice,
> we would be very unlikely to have to wait since the grace period would
> be already expired. However, I don't think RCU currently provides such
> a mechanism - Paul, is this something that would seem easy to
> implement or not ?
> - Do nothing and hope one doesn't hit the race. This is not my
> favourite "solution", but OTOH the race seems so hard to hit that it

Others might hate it. :)

> may be hard to justify expensive solutions to work around it.

We should fix it with nice idea. :)
At the moment, your proposed approach would be nice to me.
But I expect Andrea didn't like it because he always want to minimize
change of other code by THP.

>
>> When I make deactivate_page, I didn't consider that honestly.
>> IMHO, It shouldn't be a problem as deactive_page hold a reference
>> of page by pagevec_lookup so the page shouldn't be gone under us.
>
> Agree - it seems like you are guaranteed to already hold a reference
> (but then a straight get_page should be sufficient, right ?)

I think the point is that deactivate_page handles file pages while THP
handles only anon pages.
Thanks.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-08-09 22:22 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-04 21:07 [RFC PATCH 0/3] page count lock for simpler put_page Michel Lespinasse
2011-08-04 21:07 ` [RFC PATCH 1/3] mm: Replace naked page->_count accesses with accessor functions Michel Lespinasse
2011-08-04 21:07 ` [RFC PATCH 2/3] mm: page count lock Michel Lespinasse
2011-08-07 14:00   ` Minchan Kim
2011-08-04 21:07 ` [RFC PATCH 3/3] mm: get_first_page_unless_zero() Michel Lespinasse
2011-08-07 14:13   ` Minchan Kim
2011-08-05  6:39 ` [RFC PATCH 0/3] page count lock for simpler put_page Michel Lespinasse
2011-08-07 14:25   ` Minchan Kim
2011-08-09 11:04     ` Michel Lespinasse
2011-08-09 22:22       ` Minchan Kim [this message]
2011-08-12 22:35         ` Michel Lespinasse
2011-08-13  4:07           ` Minchan Kim
2011-08-12 15:36       ` Andrea Arcangeli
2011-08-12 16:08         ` SPAM: " Paul E. McKenney
2011-08-12 16:43           ` Andrea Arcangeli
2011-08-12 17:27             ` Paul E. McKenney
2011-08-12 23:45               ` Michel Lespinasse
2011-08-13  1:57                 ` Paul E. McKenney
2011-08-13 23:56                   ` Andrea Arcangeli
2011-08-13  4:18             ` Minchan Kim
2011-08-12 16:57           ` Johannes Weiner
2011-08-12 17:08             ` Andrea Arcangeli
2011-08-12 17:52               ` Johannes Weiner
2011-08-12 18:13                 ` Paul E. McKenney
2011-08-12 19:05                   ` Johannes Weiner
2011-08-12 22:14                     ` Paul E. McKenney
2011-08-12 22:22                 ` Andrea Arcangeli
2011-08-12 18:03               ` Paul E. McKenney
2011-08-12 17:41             ` Paul E. McKenney
2011-08-12 17:56               ` Johannes Weiner
2011-08-12 23:02           ` Michel Lespinasse
2011-08-12 22:50         ` Michel Lespinasse
2011-08-13  4:11         ` Minchan Kim
2011-08-12 16:58   ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAEwNFnA3o_9SJZEeSg5azO45-E3HhEcSnww0GarN5m6n-qL_Vw@mail.gmail.com \
    --to=minchan.kim@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=jweiner@redhat.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=shaohua.li@intel.com \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.