All of lore.kernel.org
 help / color / mirror / Atom feed
From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
To: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] mm/compaction: avoid VM_BUG_ON(PageSlab()) in page_mapcount()
Date: Sun, 24 May 2020 14:21:13 +0300	[thread overview]
Message-ID: <63fe94c7-78d1-ae03-00da-ba0e6d207a70@yandex-team.ru> (raw)
In-Reply-To: <alpine.LSU.2.11.2005231650070.1171@eggly.anvils>

On 24/05/2020 04.01, Hugh Dickins wrote:
> On Wed, 13 May 2020, Konstantin Khlebnikov wrote:
> 
>> Function isolate_migratepages_block() runs some checks out of lru_lock
>> when choose pages for migration. After checking PageLRU() it checks extra
>> page references by comparing page_count() and page_mapcount(). Between
>> these two checks page could be removed from lru, freed and taken by slab.
>>
>> As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount().
>> Race window is tiny. For certain workload this happens around once a year.
> 
> Around once a year, that was my guess too. I have no record of us ever
> hitting this, but yes it could happen when you have CONFIG_DEBUG_VM=y
> (which I too like to run with, but would not recommend for users).

Yep, but for large cluster and pinpointed workload this happens surprisingly
frequently =) I've believed into this race only after seeing statistics for
count of compactions and how it correlates with incidents.

Probably the key component is a slab allocation from network irq/bh context
which interrupts compaction exactly at this spot.

> 
>>
>>
>>   page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180 index:0x0 compound_mapcount: 0
>>   flags: 0x500000000008100(slab|head)
>>   raw: 0500000000008100 dead000000000100 dead000000000200 ffff88ff7712c180
>>   raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
>>   page dumped because: VM_BUG_ON_PAGE(PageSlab(page))
>>   ------------[ cut here ]------------
>>   kernel BUG at ./include/linux/mm.h:628!
>>   invalid opcode: 0000 [#1] SMP NOPTI
>>   CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G        W         4.19.109-27 #1
>>   Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019
>>   RIP: 0010:isolate_migratepages_block+0x986/0x9b0
>>
>>
>> To fix just opencode page_mapcount() in racy check for 0-order case and
>> recheck carefully under lru_lock when page cannot escape from lru.
>>
>> Also add checking extra references for file pages and swap cache.
>>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> Fixes: 119d6d59dcc0 ("mm, compaction: avoid isolating pinned pages")
> 
> Not really, that commit was correct at the time it went in.
> 
>> Fixes: 1d148e218a0d ("mm: add VM_BUG_ON_PAGE() to page_mapcount()")
> 
> Exactly, that commit was well-intentioned, but did not allow for this
> (admittedly very exceptional) usage.  How many developers actually
> make the mistake of applying page_mapcount() to their slab pages?
> None, I expect.  That VM_BUG_ON_PAGE() is there for documentation,
> and could just be replaced by a comment - and Linus would be happy
> with that.

Ok, I'll redo the fix in this way.

> 
>> ---
>>   mm/compaction.c |   17 +++++++++++++----
>>   1 file changed, 13 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 46f0fcc93081..91bb87fd9420 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -935,12 +935,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>   		}
>>   
>>   		/*
>> -		 * Migration will fail if an anonymous page is pinned in memory,
>> +		 * Migration will fail if an page is pinned in memory,
>>   		 * so avoid taking lru_lock and isolating it unnecessarily in an
>> -		 * admittedly racy check.
>> +		 * admittedly racy check simplest case for 0-order pages.
>> +		 *
>> +		 * Open code page_mapcount() to avoid VM_BUG_ON(PageSlab(page)).
> 
> But open coding page_mapcount() is not all that you did.  You have
> (understandably) chosen to avoid calling page_mapping(page), but...
> 
>> +		 * Page could have extra reference from mapping or swap cache.
>>   		 */
>> -		if (!page_mapping(page) &&
>> -		    page_count(page) > page_mapcount(page))
>> +		if (!PageCompound(page) &&
>> +		    page_count(page) > atomic_read(&page->_mapcount) + 1 +
>> +				(!PageAnon(page) || PageSwapCache(page)))
>>   			goto isolate_fail;
> 
> Isn't that test going to send all the file cache pages with buffer heads
> in page->private, off to isolate_fail when they're actually great
> candidates for migration?

Yes. What a shame. Adding page_has_private() could fix that?

Kind of

page_count(page) > page_mapcount(page) +
(PageAnon(page) ? PageSwapCache(page) : (1 + page_has_private(page)))

or probably something like this:

page_count(page) > page_mapcount(page) +
(PageAnon(page) ? PageSwapCache(page) : GUP_PIN_COUNTING_BIAS)

I.e. skip only file pages pinned by dma or something slower.
I see some movements in this direction in recent changes.

of course that's independent matter.

> 
> Given that the actual bug spotted was with the VM_BUG_ON_PAGE(PageSlab),
> and nobody has reported any crash from the use of page_mapping() there
> (and we only need the test to be right most of the time: all of this
> knowingly racy, as you explain in other mail): I'd go for just replacing
> the VM_BUG_ON_PAGE in page_mapcount() by a comment about this case.
> 
> But if you think developers are really in danger of coding page_mapcount()
> on their slab pages, then you could add a _page_mapcount() to linux/mm.h,
> which omits the VM_BUG_ON_PAGE, for use here only.
> 
> Then we wouldn't have to think so hard about the counting above!
> 
>>   
>>   		/*
>> @@ -975,6 +979,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>   				low_pfn += compound_nr(page) - 1;
>>   				goto isolate_fail;
>>   			}
>> +
>> +			/* Recheck page extra references under lock */
>> +			if (page_count(page) > page_mapcount(page) +
>> +				    (!PageAnon(page) || PageSwapCache(page)))
>> +				goto isolate_fail;
> 
> Well, that lru_lock (and the intervening PageLRU check after getting it)
> may restrict PageAnon and PageSwapCache transitions to some extent, but
> it certainly has no effect on page_count and page_mapcount: so I think
> such an additional check here is rather superfluous, and we should just
> rely on the final checks in migrate_page_move_mapping(), as before.
> 
>>   		}
>>   
>>   		lruvec = mem_cgroup_page_lruvec(page, pgdat);

  reply	other threads:[~2020-05-24 11:21 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-13 14:05 [PATCH] mm/compaction: avoid VM_BUG_ON(PageSlab()) in page_mapcount() Konstantin Khlebnikov
2020-05-13 18:32 ` Andrew Morton
2020-05-13 19:28   ` Konstantin Khlebnikov
2020-05-13 21:35     ` Vlastimil Babka
2020-05-23  1:34 ` Andrew Morton
2020-05-23 13:55   ` Konstantin Khlebnikov
2020-05-23 13:55     ` Konstantin Khlebnikov
2020-05-24  1:01 ` Hugh Dickins
2020-05-24  1:01   ` Hugh Dickins
2020-05-24 11:21   ` Konstantin Khlebnikov [this message]
2020-05-24 19:12     ` Hugh Dickins
2020-05-24 19:12       ` Hugh Dickins
2020-06-02  4:05   ` Hugh Dickins
2020-06-02  4:05     ` Hugh Dickins
2020-06-02  4:13     ` Andrew Morton
2020-06-02 11:28     ` Alex Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=63fe94c7-78d1-ae03-00da-ba0e6d207a70@yandex-team.ru \
    --to=khlebnikov@yandex-team.ru \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.