linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>, Jan Kara <jack@suse.cz>,
	Ritesh Harjani <riteshh@linux.ibm.com>,
	Dave Chinner <dchinner@redhat.com>
Subject: Re: [PATCH] iomap: Submit the BIO at the end of each extent
Date: Mon, 4 May 2020 17:37:10 -0700	[thread overview]
Message-ID: <20200505003710.GO5703@magnolia> (raw)
In-Reply-To: <20200320214654.GC6812@magnolia>

On Fri, Mar 20, 2020 at 02:46:54PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 20, 2020 at 07:40:14AM -0700, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > By definition, an extent covers a range of consecutive blocks, so
> > it would be quite rare to be able to just add pages to the BIO from
> > a previous range.  The only case we can think of is a mapped extent
> > followed by a hole extent, followed by another mapped extent which has
> > been allocated immediately after the first extent.  We believe this to
> > be an unlikely layout for a filesystem to choose and, since the queue
> > is plugged, those two BIOs would be merged by the block layer.
> > 
> > The reason we care is that ext2/ext4 choose to lay out blocks 0-11
> > consecutively, followed by the indirect block, and we want to merge those
> > two BIOs.  If we don't submit the data BIO before asking the filesystem
> > for the next extent, then the indirect BIO will be submitted first,
> > and waited for, leading to inefficient I/O patterns.  Buffer heads solve
> > this with the BH_boundary flag, but iomap doesn't need that as long as
> > we submit the bio here.
> 
> Hmm, I just received the following stack trace while running generic/418
> on a v5 filesystem with 1k blocks:

I hit this crash again, though I've figured out how to trigger it
reliably.  Boot a x64 VM with mem=512M to make memory constrained, and
then kick off fstests with 1k block size:

FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 alder-mtr00 5.7.0-rc4-djw #rc4 SMP PREEMPT Mon May 4 09:24:50 PDT 2020
MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdf
MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdf /opt

and then you get the following:

xfs filesystem being mounted at /mnt supports timestamps until 2038 (0x7fffffff)
run fstests generic/418 at 2020-05-04 17:27:51
rm (3338) used greatest stack depth: 11728 bytes left
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 4900 Comm: dio-invalidate- Not tainted 5.7.0-rc4-djw #rc4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
RIP: 0010:iomap_set_range_uptodate+0x5d/0x170
Code: 07 00 60 00 00 75 13 f0 80 0f 04 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 47 18 44 8d 74 16 ff 41 89 f1 4c 8b 6f 28 <4c> 8b 38 41 0f b6 af ca 00 00 00 89 e9 41 d3 e9 40 80 fd 1f 0f 87
RSP: 0018:ffffc90004b0b8e8 EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffffea0000292440 RCX: 0000000000000000
RDX: 0000000000000400 RSI: 0000000000000c00 RDI: ffffea0000292440
RBP: 0000000000001000 R08: ffffc90004b0b958 R09: 0000000000000c00
R10: 0000000000000002 R11: ffff888017806720 R12: ffffc90004b0ba20
R13: ffff888017806720 R14: 0000000000000fff R15: ffff88801872c610
FS:  00007f2091593740(0000) GS:ffff88801e800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000017abb005 CR4: 00000000001606a0
Call Trace:
 ? __raw_spin_lock_init+0x39/0x60
 iomap_readpage_actor+0x113/0x3f0
 iomap_readpages_actor+0x1dc/0x240
 iomap_apply+0x12d/0x4e9
 ? iomap_readpage_actor+0x3f0/0x3f0
 ? mark_held_locks+0x45/0x70
 iomap_readpages+0xc2/0x290
 ? iomap_readpage_actor+0x3f0/0x3f0
 ? xa_clear_mark+0x30/0x30
 read_pages+0x75/0x1b0
 __do_page_cache_readahead+0x1bb/0x1d0
 ondemand_readahead+0x21a/0x540
 ? pagecache_get_page+0x26/0x320
 generic_file_read_iter+0x91a/0xd10
 ? xfs_file_buffered_aio_read+0x88/0x170 [xfs]
 xfs_file_buffered_aio_read+0x65/0x170 [xfs]
 xfs_file_read_iter+0xe9/0x2a0 [xfs]
 new_sync_read+0x12d/0x1d0
 vfs_read+0xc7/0x180
 ksys_pread64+0x64/0xa0
 do_syscall_64+0x50/0x1a0
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x7f209179cbca

Digging into this with gcc, the RIP value is:

0xffffffff813047cd is in iomap_set_range_uptodate (/storage/home/djwong/cdev/work/linux-djw/fs/iomap/buffered-io.c:147).
142
143     static void
144     iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
145     {
146             struct iomap_page *iop = to_iomap_page(page);
147             struct inode *inode = page->mapping->host;
148             unsigned first = off >> inode->i_blkbits;
149             unsigned last = (off + len - 1) >> inode->i_blkbits;
150             bool uptodate = true;
151             unsigned long flags;

So now this makes me wonder, is it possible to be performing readahead
into a page that doesn't have page->mapping set yet?  I reran this a few
times, got crashes in different places, but the common factor is that
page->mapping is NULL, and we're doing readhead.

I also tried this with the patch *not* applied and had the same
problems, so it's not actually this patch.  But there's something going
wrong in the iomap code...

--D

> FSTYP         -- xfs (debug)
> PLATFORM      -- Linux/x86_64 alder-mtr01 5.6.0-rc4-djw #rc4 SMP PREEMPT Fri Mar 13 14:48:13 PDT 2020
> MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdd
> MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdd /opt
> 
> (Note that it seems to do this even with MKFS_OPTIONS='-m crc=0' and
> empty MOUNT_OPTIONS.)
> 
> [   32.931667] XFS (sdd): Mounting V5 Filesystem
> [   32.940193] XFS (sdd): Ending clean mount
> [   32.941350] XFS (sdd): Quotacheck needed: Please wait.
> [   32.970724] XFS (sdd): Quotacheck: Done.
> [   32.972550] xfs filesystem being mounted at /opt supports timestamps until 2038 (0x7fffffff)
> [   32.985191] XFS (sdd): Unmounting Filesystem
> [   33.162426] XFS (sde): EXPERIMENTAL online scrub feature in use. Use at your own risk!
> [   33.273185] XFS (sde): Unmounting Filesystem
> [   33.517957] XFS (sde): Mounting V5 Filesystem
> [   33.526176] XFS (sde): Ending clean mount
> [   33.527439] XFS (sde): Quotacheck needed: Please wait.
> [   33.566117] XFS (sde): Quotacheck: Done.
> [   33.569427] xfs filesystem being mounted at /mnt supports timestamps until 2038 (0x7fffffff)
> [   33.656942] run fstests generic/418 at 2020-03-20 14:42:29
> [   36.332268] BUG: kernel NULL pointer dereference, address: 0000000000000060
> [   36.334254] #PF: supervisor read access in kernel mode
> [   36.334849] #PF: error_code(0x0000) - not-present page
> [   36.335461] PGD 0 P4D 0 
> [   36.335779] Oops: 0000 [#1] PREEMPT SMP
> [   36.336246] CPU: 2 PID: 5144 Comm: dio-invalidate- Not tainted 5.6.0-rc4-djw #rc4
> [   36.337078] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
> [   36.338069] RIP: 0010:iomap_readpage_actor+0x2ea/0x3c0
> [   36.338671] Code: 43 10 8b 54 24 24 48 c7 40 38 80 bc 2f 81 48 8b 7b 10 e9 00 ff ff ff 31 c0 48 85 ed 0f 85 c9 fe ff ff 49 8b 46 18 48 8b 2c 24 <8b> 48 60 48 81 c5 ff 0f 00 00 48 c1 ed 0c 81 e1 c0 0c 00 00 e9 12
> [   36.340705] RSP: 0018:ffffc90004ebb968 EFLAGS: 00010246
> [   36.341309] RAX: 0000000000000000 RBX: ffffc90004ebbb30 RCX: 000000000000000a
> [   36.342105] RDX: 0000000000000400 RSI: 0000000000000003 RDI: 0000000000000000
> [   36.342909] RBP: 0000000000000400 R08: ffffc90004ebb988 R09: ffffc90004ebb98c
> [   36.343710] R10: 0000000000001000 R11: 0000000000000400 R12: ffffc90004ebba50
> [   36.344505] R13: 0000000000000086 R14: ffffea0001cd2400 R15: 0000000000000c00
> [   36.345246] FS:  00007f892894c740(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000
> [   36.346087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   36.346696] CR2: 0000000000000060 CR3: 0000000078753005 CR4: 00000000001606a0
> [   36.347445] Call Trace:
> [   36.347734]  iomap_readpages_actor+0x1e3/0x250
> [   36.348699]  iomap_apply+0x12c/0x4e3
> [   36.349097]  ? iomap_readpage_actor+0x3c0/0x3c0
> [   36.349593]  ? prep_new_page+0x3f/0x100
> [   36.350022]  ? iomap_readpage_actor+0x3c0/0x3c0
> [   36.350519]  iomap_readpages+0xc7/0x2b0
> [   36.350938]  ? iomap_readpage_actor+0x3c0/0x3c0
> [   36.351438]  read_pages+0x6e/0x1a0
> [   36.351824]  __do_page_cache_readahead+0x1c3/0x1e0
> [   36.352343]  ondemand_readahead+0x210/0x4b0
> [   36.352797]  generic_file_read_iter+0x871/0xcd0
> [   36.353365]  ? xfs_file_buffered_aio_read+0x54/0x170 [xfs]
> [   36.353982]  xfs_file_buffered_aio_read+0x5f/0x170 [xfs]
> [   36.354591]  xfs_file_read_iter+0xea/0x2a0 [xfs]
> [   36.355139]  ? xfs_file_write_iter+0xf2/0x1d0 [xfs]
> [   36.355668]  new_sync_read+0x12d/0x1d0
> [   36.356085]  vfs_read+0xa6/0x180
> [   36.356454]  ksys_pread64+0x64/0xa0
> [   36.356841]  do_syscall_64+0x50/0x1a0
> [   36.357252]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   36.357792] RIP: 0033:0x7f8928524f64
> [   36.358189] Code: 15 61 80 20 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 8b 05 aa c4 20 00 49 89 ca 85 c0 75 13 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5c f3 c3 66 90 41 55 41 54 49 89 cd 55 53 49
> [   36.360053] RSP: 002b:00007fffc7893b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> [   36.360841] RAX: ffffffffffffffda RBX: 0000000000000400 RCX: 00007f8928524f64
> [   36.361594] RDX: 0000000000000400 RSI: 00005593e3c23000 RDI: 0000000000000003
> [   36.362344] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000001
> [   36.363075] R10: 0000000000000000 R11: 0000000000000246 R12: 00005593e3c23000
> [   36.363817] R13: 0000000000000000 R14: 00005593e3c25000 R15: 0000000000000400
> [   36.364569] Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables nfsv4 af_packet
> [   36.366966] Dumping ftrace buffer:
> [   36.367351]    (ftrace buffer empty)
> [   36.367742] CR2: 0000000000000060
> [   36.369050] ---[ end trace d599586d1259866c ]---
> [   36.369884] RIP: 0010:iomap_readpage_actor+0x2ea/0x3c0
> [   36.370694] Code: 43 10 8b 54 24 24 48 c7 40 38 80 bc 2f 81 48 8b 7b 10 e9 00 ff ff ff 31 c0 48 85 ed 0f 85 c9 fe ff ff 49 8b 46 18 48 8b 2c 24 <8b> 48 60 48 81 c5 ff 0f 00 00 48 c1 ed 0c 81 e1 c0 0c 00 00 e9 12
> [   36.373022] RSP: 0018:ffffc90004ebb968 EFLAGS: 00010246
> [   36.373615] RAX: 0000000000000000 RBX: ffffc90004ebbb30 RCX: 000000000000000a
> [   36.374362] RDX: 0000000000000400 RSI: 0000000000000003 RDI: 0000000000000000
> [   36.375100] RBP: 0000000000000400 R08: ffffc90004ebb988 R09: ffffc90004ebb98c
> [   36.375833] R10: 0000000000001000 R11: 0000000000000400 R12: ffffc90004ebba50
> [   36.376577] R13: 0000000000000086 R14: ffffea0001cd2400 R15: 0000000000000c00
> [   36.377322] FS:  00007f892894c740(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000
> [   36.378190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   36.378795] CR2: 0000000000000060 CR3: 0000000078753005 CR4: 00000000001606a0
> 
> I'll email back if I find anything else but since this is the second
> Friday in a row of getting STOP SHIP bugs dropped in my lap at 2pm, I
> doubt I'm going to manage much.
> 
> --D
> 
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> >  fs/iomap/buffered-io.c | 12 ++++++++++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index f080f542911b..417115bfaf6b 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -420,6 +420,16 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> >  				ctx, iomap, srcmap);
> >  	}
> >  
> > +	/*
> > +	 * Submitting the bio here leads to better I/O patterns for
> > +	 * filesystems which need to do metadata reads to find the
> > +	 * next extent.
> > +	 */
> > +	if (ctx->bio) {
> > +		submit_bio(ctx->bio);
> > +		ctx->bio = NULL;
> > +	}
> > +
> >  	return done;
> >  }
> >  
> > @@ -449,8 +459,6 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
> >  	}
> >  	ret = 0;
> >  done:
> > -	if (ctx.bio)
> > -		submit_bio(ctx.bio);
> >  	if (ctx.cur_page) {
> >  		if (!ctx.cur_page_in_bio)
> >  			unlock_page(ctx.cur_page);
> > -- 
> > 2.25.1
> > 

  parent reply	other threads:[~2020-05-05  0:39 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-20 14:40 [PATCH] iomap: Submit the BIO at the end of each extent Matthew Wilcox
2020-03-20 14:44 ` Christoph Hellwig
2020-03-20 15:50 ` Darrick J. Wong
2020-03-20 21:46 ` Darrick J. Wong
2020-03-21 14:20   ` Matthew Wilcox
2020-03-23 12:59     ` Christoph Hellwig
2020-05-05  0:37   ` Darrick J. Wong [this message]
2020-05-05  2:24     ` Matthew Wilcox
2020-05-05  2:59       ` Darrick J. Wong
2020-05-05 19:08         ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200505003710.GO5703@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=dchinner@redhat.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=riteshh@linux.ibm.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).