All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Filipe Manana <fdmanana@suse.com>,
	Boris Burkov <boris@bur.io>, David Sterba <dsterba@suse.com>
Subject: [PATCH 4.19 53/58] btrfs: fix fatal extent_buffer readahead vs releasepage race
Date: Tue, 14 Jul 2020 20:44:26 +0200	[thread overview]
Message-ID: <20200714184058.793401263@linuxfoundation.org> (raw)
In-Reply-To: <20200714184056.149119318@linuxfoundation.org>

From: Boris Burkov <boris@bur.io>

commit 6bf9cd2eed9aee6d742bb9296c994a91f5316949 upstream.

Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.

This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.

Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.

The following represents an example execution demonstrating the race:

            CPU0                                                         CPU1                                           CPU2
reada_for_search                                            reada_for_search
  readahead_tree_block                                        readahead_tree_block
    find_create_tree_block                                      find_create_tree_block
      alloc_extent_buffer                                         alloc_extent_buffer
                                                                  find_extent_buffer // not found
                                                                  allocates eb
                                                                  lock pages
                                                                  associate pages to eb
                                                                  insert eb into radix tree
                                                                  set TREE_REF, refs == 2
                                                                  unlock pages
                                                              read_extent_buffer_pages // WAIT_NONE
                                                                not uptodate (brand new eb)
                                                                                                            lock_page
                                                                if !trylock_page
                                                                  goto unlock_exit // not an error
                                                              free_extent_buffer
                                                                release_extent_buffer
                                                                  atomic_dec_and_test refs to 1
        find_extent_buffer // found
                                                                                                            try_release_extent_buffer
                                                                                                              take refs_lock
                                                                                                              reads refs == 1; no io
          atomic_inc_not_zero refs to 2
          mark_buffer_accessed
            check_buffer_tree_ref
              // not STALE, won't take refs_lock
              refs == 2; TREE_REF set // no action
    read_extent_buffer_pages // WAIT_NONE
                                                                                                              clear TREE_REF
                                                                                                              release_extent_buffer
                                                                                                                atomic_dec_and_test refs to 1
                                                                                                                unlock_page
      still not uptodate (CPU1 read failed on trylock_page)
      locks pages
      set io_pages > 0
      submit io
      return
    free_extent_buffer
      release_extent_buffer
        dec refs to 0
        delete from radix tree
        btrfs_release_extent_buffer_pages
          BUG_ON(io_pages > 0)!!!

We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.

To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.

Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS:  00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103]  release_extent_buffer+0x39/0x90
[1417839.746913]  read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645]  btrfs_search_slot+0x260/0x9b0
[1417839.768054]  btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427]  btrfs_get_extent+0x15f/0x830
[1417839.787665]  ? submit_extent_page+0xc4/0x1c0
[1417839.797474]  ? __do_readpage+0x299/0x7a0
[1417839.806515]  __do_readpage+0x33b/0x7a0
[1417839.815171]  ? btrfs_releasepage+0x70/0x70
[1417839.824597]  extent_readpages+0x28f/0x400
[1417839.833836]  read_pages+0x6a/0x1c0
[1417839.841729]  ? startup_64+0x2/0x30
[1417839.849624]  __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590]  filemap_fault+0x6c7/0x990
[1417839.869252]  ? xas_load+0x8/0x80
[1417839.876756]  ? xas_find+0x150/0x190
[1417839.884839]  ? filemap_map_pages+0x295/0x3b0
[1417839.894652]  __do_fault+0x32/0x110
[1417839.902540]  __handle_mm_fault+0xacd/0x1000
[1417839.912156]  handle_mm_fault+0xaa/0x1c0
[1417839.921004]  __do_page_fault+0x242/0x4b0
[1417839.930044]  ? page_fault+0x8/0x30
[1417839.937933]  page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 fs/btrfs/extent_io.c |   40 ++++++++++++++++++++++++----------------
 1 file changed, 24 insertions(+), 16 deletions(-)

--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4801,25 +4801,28 @@ struct extent_buffer *alloc_dummy_extent
 static void check_buffer_tree_ref(struct extent_buffer *eb)
 {
 	int refs;
-	/* the ref bit is tricky.  We have to make sure it is set
-	 * if we have the buffer dirty.   Otherwise the
-	 * code to free a buffer can end up dropping a dirty
-	 * page
+	/*
+	 * The TREE_REF bit is first set when the extent_buffer is added
+	 * to the radix tree. It is also reset, if unset, when a new reference
+	 * is created by find_extent_buffer.
 	 *
-	 * Once the ref bit is set, it won't go away while the
-	 * buffer is dirty or in writeback, and it also won't
-	 * go away while we have the reference count on the
-	 * eb bumped.
+	 * It is only cleared in two cases: freeing the last non-tree
+	 * reference to the extent_buffer when its STALE bit is set or
+	 * calling releasepage when the tree reference is the only reference.
 	 *
-	 * We can't just set the ref bit without bumping the
-	 * ref on the eb because free_extent_buffer might
-	 * see the ref bit and try to clear it.  If this happens
-	 * free_extent_buffer might end up dropping our original
-	 * ref by mistake and freeing the page before we are able
-	 * to add one more ref.
+	 * In both cases, care is taken to ensure that the extent_buffer's
+	 * pages are not under io. However, releasepage can be concurrently
+	 * called with creating new references, which is prone to race
+	 * conditions between the calls to check_buffer_tree_ref in those
+	 * codepaths and clearing TREE_REF in try_release_extent_buffer.
 	 *
-	 * So bump the ref count first, then set the bit.  If someone
-	 * beat us to it, drop the ref we added.
+	 * The actual lifetime of the extent_buffer in the radix tree is
+	 * adequately protected by the refcount, but the TREE_REF bit and
+	 * its corresponding reference are not. To protect against this
+	 * class of races, we call check_buffer_tree_ref from the codepaths
+	 * which trigger io after they set eb->io_pages. Note that once io is
+	 * initiated, TREE_REF can no longer be cleared, so that is the
+	 * moment at which any such race is best fixed.
 	 */
 	refs = atomic_read(&eb->refs);
 	if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
@@ -5273,6 +5276,11 @@ int read_extent_buffer_pages(struct exte
 	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
+	/*
+	 * It is possible for releasepage to clear the TREE_REF bit before we
+	 * set io_pages. See check_buffer_tree_ref for a more detailed comment.
+	 */
+	check_buffer_tree_ref(eb);
 	for (i = 0; i < num_pages; i++) {
 		page = eb->pages[i];
 



  parent reply	other threads:[~2020-07-14 18:47 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-14 18:43 [PATCH 4.19 00/58] 4.19.133-rc1 review Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 01/58] KVM: s390: reduce number of IO pins to 1 Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 02/58] spi: spi-fsl-dspi: Adding shutdown hook Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 03/58] spi: spi-fsl-dspi: Fix lockup if device is removed during SPI transfer Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 04/58] spi: spi-fsl-dspi: use IRQF_SHARED mode to request IRQ Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 05/58] spi: spi-fsl-dspi: Fix external abort on interrupt in resume or exit paths Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 06/58] regmap: fix alignment issue Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 07/58] ARM: dts: omap4-droid4: Fix spi configuration and increase rate Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 08/58] drm/tegra: hub: Do not enable orphaned window group Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 09/58] gpu: host1x: Detach driver on unregister Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 10/58] spi: spidev: fix a race between spidev_release and spidev_remove Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 11/58] spi: spidev: fix a potential use-after-free in spidev_release() Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 12/58] ixgbe: protect ring accesses with READ- and WRITE_ONCE Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 13/58] i40e: " Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 14/58] drm: panel-orientation-quirks: Add quirk for Asus T101HA panel Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 15/58] drm: panel-orientation-quirks: Use generic orientation-data for Acer S1003 Greg Kroah-Hartman
2020-07-15 14:45   ` Pavel Machek
2020-07-14 18:43 ` [PATCH 4.19 16/58] s390/kasan: fix early pgm check handler execution Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 17/58] drm/sun4i: mixer: Call of_dma_configure if theres an IOMMU Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 18/58] cifs: update ctime and mtime during truncate Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 19/58] ARM: imx6: add missing put_device() call in imx6q_suspend_init() Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 20/58] scsi: mptscsih: Fix read sense data size Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 21/58] usb: dwc3: pci: Fix reference count leak in dwc3_pci_resume_work Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 22/58] block: release bip in a right way in error path Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 23/58] nvme-rdma: assign completion vector correctly Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 24/58] x86/entry: Increase entry_stack size to a full page Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 25/58] net: qrtr: Fix an out of bounds read qrtr_endpoint_post() Greg Kroah-Hartman
2020-07-14 18:43 ` [PATCH 4.19 26/58] drm/mediatek: Check plane visibility in atomic_update Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 27/58] net: cxgb4: fix return error value in t4_prep_fw Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 28/58] smsc95xx: check return value of smsc95xx_reset Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 29/58] smsc95xx: avoid memory leak in smsc95xx_bind Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 30/58] net: hns3: fix use-after-free when doing self test Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 31/58] ALSA: compress: fix partial_drain completion state Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 32/58] arm64: kgdb: Fix single-step exception handling oops Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 33/58] nbd: Fix memory leak in nbd_add_socket Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 34/58] cxgb4: fix all-mask IP address comparison Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 35/58] bnxt_en: fix NULL dereference in case SR-IOV configuration fails Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 36/58] net: macb: mark device wake capable when "magic-packet" property present Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 37/58] mlxsw: spectrum_router: Remove inappropriate usage of WARN_ON() Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 38/58] ALSA: opl3: fix infoleak in opl3 Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 39/58] ALSA: hda - let hs_mic be picked ahead of hp_mic Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 40/58] ALSA: usb-audio: add quirk for MacroSilicon MS2109 Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 41/58] KVM: arm64: Fix definition of PAGE_HYP_DEVICE Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 42/58] KVM: arm64: Stop clobbering x0 for HVC_SOFT_RESTART Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 43/58] KVM: x86: bit 8 of non-leaf PDPEs is not reserved Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 44/58] KVM: x86: Inject #GP if guest attempts to toggle CR4.LA57 in 64-bit mode Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 45/58] KVM: x86: Mark CR4.TSD as being possibly owned by the guest Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 46/58] kallsyms: Refactor kallsyms_show_value() to take cred Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 47/58] kernel: module: Use struct_size() helper Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 48/58] module: Refactor section attr into bin attribute Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 49/58] module: Do not expose section addresses to non-CAP_SYSLOG Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 50/58] kprobes: Do not expose probe " Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 51/58] bpf: Check correct cred for CAP_SYSLOG in bpf_dump_raw_ok() Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 52/58] Revert "ath9k: Fix general protection fault in ath9k_hif_usb_rx_cb" Greg Kroah-Hartman
2020-07-14 18:44 ` Greg Kroah-Hartman [this message]
2020-07-14 18:44 ` [PATCH 4.19 54/58] drm/radeon: fix double free Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 55/58] dm: use noio when sending kobject event Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 56/58] ARC: entry: fix potential EFA clobber when TIF_SYSCALL_TRACE Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 57/58] ARC: elf: use right ELF_ARCH Greg Kroah-Hartman
2020-07-14 18:44 ` [PATCH 4.19 58/58] s390/mm: fix huge pte soft dirty copying Greg Kroah-Hartman
2020-07-15  9:41 ` [PATCH 4.19 00/58] 4.19.133-rc1 review Naresh Kamboju
2020-07-15 12:39   ` Greg Kroah-Hartman
2020-07-15 10:49 ` Jon Hunter
2020-07-15 10:49   ` Jon Hunter
2020-07-15 15:21 ` Shuah Khan
2020-07-15 16:42 ` Guenter Roeck
2020-07-16  7:45 ` Pavel Machek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200714184058.793401263@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=boris@bur.io \
    --cc=dsterba@suse.com \
    --cc=fdmanana@suse.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.