Re: mmotm hangs on compaction lock_page

From: Mel Gorman <mel@csn.ul.ie>
To: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org
Subject: Re: mmotm hangs on compaction lock_page
Date: Fri, 7 Jan 2011 17:57:05 +0000	[thread overview]
Message-ID: <20110107175705.GL29257@csn.ul.ie> (raw)
In-Reply-To: <20110107145259.GK29257@csn.ul.ie>

On Fri, Jan 07, 2011 at 02:52:59PM +0000, Mel Gorman wrote:
> On Thu, Jan 06, 2011 at 05:20:25PM -0800, Hugh Dickins wrote:
> > Hi Mel,
> > 
> > Here are the traces of two concurrent "cp -a kerneltree elsewhere"s
> > which have hung on mmotm: in limited RAM, on a PowerPC
> 
> How limited in RAM and how many CPUs?
> 
> > - I don't have
> > an explanation for why I can reproduce it in minutes on that box but
> > never yet on the x86s.
> > 
> 
> Strongest bet is simply that compaction is not triggering for you on the
> x86 boxes. Monitor "grep compact /proc/vmstat" on the two machines and
> see if the counters are growing on powerpc and not on x86. I'm trying to
> reproduce the problem locally but no luck yet.
> 
> > Perhaps we can get it to happen with just one cp: the second cp here
> > seemed to be deadlocking itself, unmap_and_move()'s force lock_page
> > waiting on a page which its page cache readahead already holds locked.
> > 
> > cp              D 000000000fea3110     0 18874  18873 0x00008010
> > Call Trace:
> >  .__switch_to+0xcc/0x110
> >  .schedule+0x670/0x7b0
> >  .io_schedule+0x50/0x8c
> >  .sync_page+0x84/0xa0
> >  .sync_page_killable+0x10/0x48
> >  .__wait_on_bit_lock+0x9c/0x140
> >  .__lock_page_killable+0x74/0x98
> >  .do_generic_file_read+0x2b0/0x504
> >  .generic_file_aio_read+0x214/0x29c
> >  .do_sync_read+0xac/0x10c
> >  .vfs_read+0xd0/0x1a0
> >  .SyS_read+0x58/0xa0
> >  syscall_exit+0x0/0x40
> >  syscall_exit+0x0/0x40
> > 
> > cp              D 000000000fea3110     0 18876  18875 0x00008010
> > Call Trace:
> >  0xc000000001343b68 (unreliable)
> >  .__switch_to+0xcc/0x110
> >  .schedule+0x670/0x7b0
> >  .io_schedule+0x50/0x8c
> >  .sync_page+0x84/0xa0
> >  .__wait_on_bit_lock+0x9c/0x140
> >  .__lock_page+0x74/0x98
> >  .unmap_and_move+0xfc/0x380
> >  .migrate_pages+0xbc/0x18c
> >  .compact_zone+0xbc/0x400
> >  .compact_zone_order+0xc8/0xf4
> >  .try_to_compact_pages+0x104/0x1b8
> >  .__alloc_pages_direct_compact+0xa8/0x228
> >  .__alloc_pages_nodemask+0x42c/0x730
> >  .allocate_slab+0x84/0x168
> >  .new_slab+0x58/0x198
> >  .__slab_alloc+0x1ec/0x430
> >  .kmem_cache_alloc+0x7c/0xe0
> >  .radix_tree_preload+0x94/0x140
> >  .add_to_page_cache_locked+0x70/0x1f0
> >  .add_to_page_cache_lru+0x50/0xac
> >  .mpage_readpages+0xcc/0x198
> >  .ext3_readpages+0x28/0x40
> >  .__do_page_cache_readahead+0x1ac/0x2ac
> >  .ra_submit+0x28/0x38
> 
> Something is odd right here. I would have expected entries in the
> calli stack containing
> 
>  .ondemand_readahead
>  .page_cache_sync_readahead
> 
> I am going to have to assume these functions were really called
> otherwise the ra_submit is a mystery :(
> 
> >  .do_generic_file_read+0xe8/0x504
> >  .generic_file_aio_read+0x214/0x29c
> >  .do_sync_read+0xac/0x10c
> >  .vfs_read+0xd0/0x1a0
> >  .SyS_read+0x58/0xa0
> >  syscall_exit+0x0/0x40
> > 
> > I haven't made a patch for it, just hacked unmap_and_move() to say
> > "if (!0)" instead of "if (!force)" to get on with my testing.  I expect
> > you'll want to pass another arg down to migrate_pages() to prevent
> > setting force, or give it some flags, or do something with PF_MEMALLOC.
> > 
> 
> I tend to agree but I'm failing to see how it might be happening right now.
> The callchain looks something like
> 
> do_generic_file_read
>    # page is not found
>    page_cache_sync_readahead
>       ondemand_readahead
>          ra_submit
>             __do_page_cache_readahead
>                # Allocates a bunch of pages
>                # Sets PageReadahead. Otherwise the pages  initialised
>                # and they are not on the LRU yet
>                read_pages
>                   # Calls mapping->readpages which calls mpage_readpages
>                   mpage_readpages
>                      # For the list of pages (index initialised), add
>                      # each of them to the LRU. Adding to the LRU
>                      # locks the page and should return the page
>                      # locked.
>                      add_to_page_cache_lru
>                      # sets PageSwapBacked
>                         add_to_page_cache
>                            # locks page
>                            add_to_page_cache_locked
>                               # preloads radix tree
>                                  radix_tree_preload
>                                  # DEADLOCK HERE. This is what does not
>                                  # make sense. Compaction could not be
>                                  # be finding the page on the LRU as
>                                  # lru_cache_add_file() has not been
>                                  # called yet for this page
> 
> So I don't think we are locking on the same page.
> 
> Here is a possibility. mpage_readpages() is reading ahead so there are obviously
> pages that are not Uptodate. It queues these for asynchronous read with
> block_read_full_page(), returns and adds the page to the LRU (so compaction
> is now able to find it). IO starts at some time in the future with the page
> still locked and gets unlocked at the end of IO by end_buffer_async_read().
> 
> Between when IO is queued and it completes, a new page is being added to
> the LRU, the radix tree is loaded and compaction kicks off trying to
> lock the same page that is not up to date yet. Something is preventing
> the IO completing and the page being unlocked but I'm missing what that
> might be.
> 
> Does this sound plausible? I'll keep looking but I wanted to see if
> someone spotted quickly a major flaw in the reasoning or have a quick
> guess as to why the page might not be getting unlocked at the end of IO
> properly.
> 

I've still not figured out why the pages are not getting unlocked again
but can you try the following bodge patch to confirm it's something to
do with !Uptodate pages please?

==== CUT HERE ====

diff --git a/mm/migrate.c b/mm/migrate.c
index e12f717..66a2aac 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -612,6 +612,23 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 }
 
 /*
+ * HACK
+ * Do not allow direct compaction to migrate !Uptodate pages. It is possible
+ * the caller has locked the page for IO to complete and for some reason the
+ * IO is not completing and unlocking the page leading to deadlock
+ */
+static inline bool safe_to_lock_page(struct page *page)
+{
+	if (!(current->flags & PF_MEMALLOC))
+		return true;
+
+	if (!PageUptodate(page))
+		return false;
+
+	return true;
+}
+
+/*
  * Obtain the lock on page, remove all ptes and migrate the page
  * to the newly allocated page in newpage.
  */
@@ -642,7 +659,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	rc = -EAGAIN;
 
 	if (!trylock_page(page)) {
-		if (!force)
+		if (!force || !safe_to_lock_page(page))
 			goto move_newpage;
 		lock_page(page);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>