From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751842AbcGRGC0 (ORCPT ); Mon, 18 Jul 2016 02:02:26 -0400 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:16600 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751802AbcGRGCV (ORCPT ); Mon, 18 Jul 2016 02:02:21 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2ASDwA+cIxXEHvHLHlcgz+BUoZvnRIBAQEBAQEGjFWKEYYUAgIBAQKBK00BAQEBAQEHAQEBAQEBAj5AhFwBAQUnExwjEAgDFQMJJQ8FJQMHGhOIL78YAQEBAQEFAQEBAQEBIR6FRIUVhAgUAYNPgi8FhgpMkk6FKoZXglSBdYgIhUSGX4k/gl8cgV4qMoV2AQ4XgR4BAQE Date: Mon, 18 Jul 2016 16:02:15 +1000 From: Dave Chinner To: Calvin Owens Cc: linux-block@vger.kernel.org, kernel-team@fb.com, linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org, xfs@oss.sgi.com Subject: Re: [BUG] Slab corruption during XFS writeback under memory pressure Message-ID: <20160718060215.GB16044@dastard> References: <28f77d74-5ab4-d913-2921-df90da53f393@fb.com> <20160717000003.GW1922@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160717000003.GW1922@dastard> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Jul 17, 2016 at 10:00:03AM +1000, Dave Chinner wrote: > On Fri, Jul 15, 2016 at 05:18:02PM -0700, Calvin Owens wrote: > > Hello all, > > > > I've found a nasty source of slab corruption. Based on seeing similar symptoms > > on boxes at Facebook, I suspect it's been around since at least 3.10. > > > > It only reproduces under memory pressure so far as I can tell: the issue seems > > to be that XFS reclaims pages from buffers that are still in use by > > scsi/block. I'm not sure which side the bug lies on, but I've only observed it > > with XFS. [....] > But this indicates that the page is under writeback at this point, > so that tends to indicate that the above freeing was incorrect. > > Hmmm - it's clear we've got direct reclaim involved here, and the > suspicion of a dirty page that has had it's bufferheads cleared. > Are there any other warnings in the log from XFS prior to kasan > throwing the error? Can you try the patch below? -Dave. -- Dave Chinner david@fromorbit.com xfs: bufferhead chains are invalid after end_page_writeback From: Dave Chinner In xfs_finish_page_writeback(), we have a loop that looks like this: do { if (off < bvec->bv_offset) goto next_bh; if (off > end) break; bh->b_end_io(bh, !error); next_bh: off += bh->b_size; } while ((bh = bh->b_this_page) != head); The b_end_io function is end_buffer_async_write(), which will call end_page_writeback() once all the buffers have marked as no longer under IO. This issue here is that the only thing currently protecting both the bufferhead chain and the page from being reclaimed is the PageWriteback state held on the page. While we attempt to limit the loop to just the buffers covered by the IO, we still read from the buffer size and follow the next pointer in the bufferhead chain. There is no guarantee that either of these are valid after the PageWriteback flag has been cleared. Hence, loops like this are completely unsafe, and result in use-after-free issues. One such problem was caught by Calvin Owens with KASAN: ..... INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1 free_buffer_head+0x41/0x90 __slab_free+0x1ed/0x340 kmem_cache_free+0x270/0x300 free_buffer_head+0x41/0x90 try_to_free_buffers+0x171/0x240 xfs_vm_releasepage+0xcb/0x3b0 try_to_release_page+0x106/0x190 shrink_page_list+0x118e/0x1a10 shrink_inactive_list+0x42c/0xdf0 shrink_zone_memcg+0xa09/0xfa0 shrink_zone+0x2c3/0xbc0 ..... Call Trace: [] dump_stack+0x68/0x94 [] print_trailer+0x115/0x1a0 [] object_err+0x34/0x40 [] kasan_report_error+0x217/0x530 [] __asan_report_load8_noabort+0x43/0x50 [] xfs_destroy_ioend+0x3bf/0x4c0 [] xfs_end_bio+0x154/0x220 [] bio_endio+0x158/0x1b0 [] blk_update_request+0x18b/0xb80 [] scsi_end_request+0x97/0x5a0 [] scsi_io_completion+0x438/0x1690 [] scsi_finish_command+0x375/0x4e0 [] scsi_softirq_done+0x280/0x340 Where the access is occuring during IO completion after the buffer had been freed from direct memory reclaim. Prevent use-after-free accidents in this end_io processing loop by pre-calculating the loop conditionals before calling bh->b_end_io(). The loop is already limited to just the bufferheads covered by the IO in progress, so the offset checks are sufficient to prevent accessing buffers in the chain after end_page_writeback() has been called by the the bh->b_end_io() callout. Yet another example of why Bufferheads Must Die. Signed-off-by: Dave Chinner --- fs/xfs/xfs_aops.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 80714eb..0cfb944 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -87,6 +87,12 @@ xfs_find_bdev_for_inode( * We're now finished for good with this page. Update the page state via the * associated buffer_heads, paying attention to the start and end offsets that * we need to process on the page. + * + * Landmine Warning: bh->b_end_io() will call end_page_writeback() on the last + * buffer in the IO. Once it does this, it is unsafe to access the bufferhead or + * the page at all, as we may be racing with memory reclaim and it can free both + * the bufferhead chain and the page as it will see the page as clean and + * unused. */ static void xfs_finish_page_writeback( @@ -95,8 +101,9 @@ xfs_finish_page_writeback( int error) { unsigned int end = bvec->bv_offset + bvec->bv_len - 1; - struct buffer_head *head, *bh; + struct buffer_head *head, *bh, *next; unsigned int off = 0; + unsigned int bsize; ASSERT(bvec->bv_offset < PAGE_SIZE); ASSERT((bvec->bv_offset & ((1 << inode->i_blkbits) - 1)) == 0); @@ -105,15 +112,17 @@ xfs_finish_page_writeback( bh = head = page_buffers(bvec->bv_page); + bsize = bh->b_size; do { + next = bh->b_this_page; if (off < bvec->bv_offset) goto next_bh; if (off > end) break; bh->b_end_io(bh, !error); next_bh: - off += bh->b_size; - } while ((bh = bh->b_this_page) != head); + off += bsize; + } while ((bh = next) != head); } /*