All of lore.kernel.org
 help / color / mirror / Atom feed
From: Leon Romanovsky <leon@kernel.org>
To: Jan Kara <jack@suse.cz>
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-ext4@vger.kernel.org,
	Dan Williams <dan.j.williams@intel.com>,
	RDMA mailing list <linux-rdma@vger.kernel.org>,
	Majd Dibbiny <majd@mellanox.com>
Subject: Re: Filesystem crashes due to pages without buffers
Date: Thu, 4 Jan 2018 08:10:53 +0200	[thread overview]
Message-ID: <20180104061053.GA10145@mtr-leonro.local> (raw)
In-Reply-To: <20180103100430.GE4911@quack2.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 9537 bytes --]

On Wed, Jan 03, 2018 at 11:04:30AM +0100, Jan Kara wrote:
> Hello,
>
> Over the years I have seen so far unexplained crashed in filesystem's
> (ext4, xfs) writeback path due to dirty pages without buffers attached to
> them (see [1] and [2] for relatively recent reports). This was confusing as
> reclaim takes care not to strip buffers from a dirty page and both
> filesystems do add buffers to a page when it is first written to - in
> ->page_mkwrite() and ->write_begin callbacks.
>
> Recently I have come across a code path that is probably leading to this
> inconsistent state and I'd like to discuss how to best fix the problem
> because it's not obvious to me. Consider the following race:
>
> CPU1					CPU2
>
> addr = mmap(file1, MAP_SHARED, ...);
> fd2 = open(file2, O_DIRECT | O_RDONLY);
> read(fd2, addr, len)
>   do_direct_IO()
>     page = dio_get_page()
>       dio_refill_pages()
>         iov_iter_get_pages()
> 	  get_user_pages_fast()
>             - page fault
>               ->page_mkwrite()
>                 block_page_mkwrite()
>                   lock_page(page);
>                   - attaches buffers to page
>                   - makes sure blocks are allocated
>                   set_page_dirty(page)
>               - install writeable PTE
>               unlock_page(page);
>     submit_page_section(page)
>       - submits bio with 'page' as a buffer
> 					kswapd reclaims pages:
> 					...
> 					shrink_page_list()
> 					  trylock_page(page) - this is the
> 					    page CPU1 has just faulted in
> 					  try_to_unmap(page)
> 					  pageout(page);
> 					    clear_page_dirty_for_io(page);
> 					    ->writepage()
> 					  - let's assume page got written
> 					    out fast enough, alternatively
> 					    we could get to the same path as
> 					    soon as the page IO completes
> 					  if (page_has_private(page)) {
> 					    try_to_release_page(page)
> 					      - reclaims buffers from the
> 					        page
> 					   __remove_mapping(page)
> 					     - fails as DIO code still
> 					       holds page reference
> ...
>
> eventually read completes
>   dio_bio_complete(bio)
>     set_page_dirty_lock(page)
>       Bummer, we've just marked the page as dirty without having buffers.
>       Eventually writeback will find it and filesystem will complain...
>
> Am I missing something?
>
> The problem here is that filesystems fundamentally assume that a page can
> be written to only between ->write_begin - ->write_end (in this interval
> the page is locked), or between ->page_mkwrite - ->writepage and above is
> an example where this does not hold because when a page reference is
> acquired through get_user_pages(), page can get written to by the holder of
> the reference and dirtied even after it has been unmapped from page tables
> and ->writepage has been called. This is not only a cosmetic issue leading
> to assertion failure but it can also lead to data loss, data corruption, or
> other unpleasant surprises as filesystems assume page contents cannot be
> modified until either ->write_begin() or ->page_mkwrite gets called and
> those calls are serialized by proper locking with problematic operations
> such as hole punching etc.
>
> I'm not sure how to fix this problem. We could 'simulate' a writeable page
> fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
> virtual address of the fault, don't hold mmap_sem, etc., possibly
> expensive, but it would make filesystems happy. Data stored by GUP user
> (e.g. read by DIO in the above case) could still get lost if someone e.g.
> punched hole under the buffer or otherwise messed with the underlying
> storage of the page while DIO was running but arguably users could expect
> such outcome.
>
> Another possible solution would be to make sure page is writeably mapped
> until GUP user drops its reference. That would be arguably cleaner but
> probably that would mean we have to track number of writeable GUP page
> references separately (no space space in struct page is a problem here) and
> block page_mkclean() until they are dropped. Also for long term GUP users
> like Infiniband or V4L we'd have to come up with some solution as we should
> not block page_mkclean() for so long.
>
> As a side note DAX needs some solution for GUP users as well. The problems
> are similar there in nature, just much easier to hit. So at least a
> solution for long-term GUP users can (and I strongly believe should) be
> shared between standard and DAX paths.
>
> Anybody has other ideas how to fix the problem or opinions on which
> solution would be better to use or some complications I have missed?
>

+RDMA

Hi Jan,

I don't have actual proposals how to fix, but wanted to mention that
we have a customer who experiences those failures in his setup.
In his case, it is reproducible in 100% of cases in approximately 2
minutes of run.

His application creates two memory regions with ib_umem_get(), one is
backed by ext4 and another is anonymous. Approximately after two minutes
of data traffic, he stops the system and calls to release those memory
regions with ib_umem_release()->__ib_umem_release()->set_page_dirty_lock().

A couple of seconds later, he hits the following BUG_ON.

[ 1411.545311] ------------[ cut here ]------------
[ 1411.545340] kernel BUG at fs/ext4/inode.c:2297!
[ 1411.545360] invalid opcode: 0000 [#1] SMP
[ 1411.545381] Modules linked in: xt_nat veth ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack
br_netfilter bridge stp llc overlay(T) rdma_ucm(OE) ib_ucm(OE)
rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE)
mlx5_ib(OE) mlx5_core(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE)
mlx4_core(OE) devlink mlx_compat(OE) intel_powerclamp coretemp
intel_rapl iosf_mbi kvm_intel vfat fat kvm irqbypass crc32_pclmul
ghash_clmulni_intel aesni_intel lrw gf128mul ext4 glue_helper
ablk_helper cryptd mbcache jbd2 iTCO_wdt iTCO_vendor_support mxm_wmi
pcspkr sb_edac edac_core i2c_i801 sg mei_me mei shpchp lpc_ich
ipmi_devintf ipmi_si ipmi_msghandler acpi_pad wmi acpi_power_meter
knem(OE) nfsd auth_rpcgss
[ 1411.545744] nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod
crc_t10dif crct10dif_generic ast drm_kms_helper crct10dif_pclmul
crct10dif_common syscopyarea sysfillrect crc32c_intel mpt3sas sysimgblt
fb_sys_fops ttm ahci raid_class libahci scsi_transport_sas igb drm
libata dca i2c_algo_bit ptp nvme i2c_core pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod [last unloaded: devlink]
[ 1411.545926] CPU: 6 PID: 13195 Comm: node_runner_w8 Tainted: G W OE
------------ T 3.10.0-514.21.1.el7.debug_bz1368895.x86_64 #1
[ 1411.545975] Hardware name: Quanta Computer Inc D51BP-1U (dual 1G LoM)/S2BP-MB (dual 1G LoM), BIOS S2BP3B04 03/03/2016
[ 1411.546017] task: ffff881e4b323ec0 ti: ffff881e49bbc000 task.ti: ffff881e49bbc000
[ 1411.546047] RIP: 0010:[<ffffffffa07083e5>] [<ffffffffa07083e5>] mpage_prepare_extent_to_map+0x2d5/0x2e0 [ext4]
[ 1411.546103] RSP: 0018:ffff881e49bbfc10 EFLAGS: 00010246
[ 1411.546125] RAX: 001fffff0000003d RBX: ffff881e49bbfc68 RCX: 0000000000000170
[ 1411.546154] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88207ff8dde8
[ 1411.546183] RBP: ffff881e49bbfce8 R08: 0000000000000000 R09: 0000000000000001
[ 1411.546212] R10: 57fe04c2df4f6680 R11: 0000000000000008 R12: 7ffffffffffffe9e
[ 1411.546240] R13: 000000000003ffff R14: ffffea0001449a00 R15: ffff881e49bbfd90
[ 1411.546270] FS: 00007f5dd5de7d40(0000) GS:ffff881fffb80000(0000) knlGS:0000000000000000
[ 1411.546302] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1411.546325] CR2: 00007f5b0c5969c0 CR3: 0000001e5e6ae000 CR4: 00000000003407e0
[ 1411.546354] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1411.546383] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1411.546412] Stack:
[ 1411.546421] ffff881e49bbfc50 0000000000000002 ffff881f07b00628 ffff881e49bbfcb8
[ 1411.546456] 000000000000017b 000000000000000e 0000000000000000 ffffea00794f6600
[ 1411.546490] ffffea00794f6640 ffffea00794f6680 ffffea0001449a00 ffffea00014499c0
[ 1411.546524] Call Trace:
[ 1411.546545] [<ffffffffa070c9ab>] ext4_writepages+0x45b/0xd60 [ext4]
[ 1411.546576] [<ffffffff8118d93e>] do_writepages+0x1e/0x40
[ 1411.546601] [<ffffffff811824f5>] __filemap_fdatawrite_range+0x65/0x80
[ 1411.546629] [<ffffffff81182641>] filemap_write_and_wait_range+0x41/0x90
[ 1411.546664] [<ffffffffa0703bba>] ext4_sync_file+0xba/0x320 [ext4]
[ 1411.546692] [<ffffffff8123028d>] vfs_fsync_range+0x1d/0x30
[ 1411.546717] [<ffffffff811ba89e>] SyS_msync+0x1fe/0x250
[ 1411.546741] [<ffffffff816974c9>] system_call_fastpath+0x16/0x1b
[ 1411.546765] Code: ff ff ff e8 2e 7e a8 e0 8b 85 40 ff ff ff eb c2 48
8d bd 50 ff ff ff e8 1a 7e a8 e0 eb 8c 4c 89 f7 e8 a0 81 a7 e0 e9 d5 fe
ff ff <0f> 0b 0f 0b e8 62 d6 97 e0 66 90 0f 1f 44 00 00 55 48 89 e5 41
[ 1411.548075] RIP [<ffffffffa07083e5>] mpage_prepare_extent_to_map+0x2d5/0x2e0 [ext4]
[ 1411.549292] RSP <ffff881e49bbfc10>
---here vmcore-dmesg got cut off----

Thanks

> 								Honza
>
> [1] https://www.spinics.net/lists/linux-xfs/msg10090.html
> [2] https://www.spinics.net/lists/linux-ext4/msg54377.html
>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

      parent reply	other threads:[~2018-01-04  6:10 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-03 10:04 Filesystem crashes due to pages without buffers Jan Kara
2018-01-03 10:04 ` Jan Kara
2018-01-04  4:56 ` Dan Williams
2018-01-04  4:56   ` Dan Williams
2018-01-04 11:33   ` Jan Kara
2018-01-04 11:33     ` Jan Kara
2018-04-13 12:39     ` Gavin Guo
2018-04-13 12:58       ` Jan Kara
2018-04-16  9:20         ` Gavin Guo
2018-01-04  5:59 ` Dave Chinner
2018-01-04  5:59   ` Dave Chinner
2018-01-04  8:52   ` Jan Kara
2018-01-04  8:52     ` Jan Kara
2018-01-04 10:08     ` Dave Chinner
2018-01-04 10:08       ` Dave Chinner
2018-01-04  6:10 ` Leon Romanovsky [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180104061053.GA10145@mtr-leonro.local \
    --to=leon@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=majd@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.