From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stable-owner@vger.kernel.org>
Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:42006 "EHLO
        mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL)
        by vger.kernel.org with ESMTP id S1725817AbeKMHlt (ORCPT
        <rfc822;stable@vger.kernel.org>); Tue, 13 Nov 2018 02:41:49 -0500
From: Chris Mason <clm@fb.com>
To: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
CC: "dsterba@suse.com" <dsterba@suse.com>,
        "stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: Re: FAILED: patch "[PATCH] Btrfs: don't clean dirty pages during
 buffered writes" failed to apply to 4.14-stable tree
Date: Mon, 12 Nov 2018 21:46:23 +0000
Message-ID: <42748700-6ABF-4308-98F9-86FC4DC642CF@fb.com>
References: <1541967508228252@kroah.com>
In-Reply-To: <1541967508228252@kroah.com>
Content-Language: en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Sender: stable-owner@vger.kernel.org
List-ID: <stable.vger.kernel.org>

On 11 Nov 2018, at 12:18, gregkh@linuxfoundation.org wrote:

> The patch below does not apply to the 4.14-stable tree.
> If someone wants it applied there, or to any other stable or longterm
> tree, then please email the backport, including the original git=20
> commit
> id to <stable@vger.kernel.org>.

This is a tiny race window that's been broken since day one in btrfs. =20
I'm fine with leaving the fix out of older kernels, but Dave if you want=20
this backported just let me know.

-chris

>
> thanks,
>
> greg k-h
>
> ------------------ original commit in Linus's tree ------------------
>
> From 7703bdd8d23e6ef057af3253958a793ec6066b28 Mon Sep 17 00:00:00 2001
> From: Chris Mason <clm@fb.com>
> Date: Wed, 20 Jun 2018 07:56:11 -0700
> Subject: [PATCH] Btrfs: don't clean dirty pages during buffered writes
>
> During buffered writes, we follow this basic series of steps:
>
> again:
> 	lock all the pages
> 	wait for writeback on all the pages
> 	Take the extent range lock
> 	wait for ordered extents on the whole range
> 	clean all the pages
>
> 	if (copy_from_user_in_atomic() hits a fault) {
> 		drop our locks
> 		goto again;
> 	}
>
> 	dirty all the pages
> 	release all the locks
>
> The extra waiting, cleaning and locking are there to make sure we=20
> don't
> modify pages in flight to the drive, after they've been crc'd.
>
> If some of the pages in the range were already dirty when the write
> began, and we need to goto again, we create a window where a dirty=20
> page
> has been cleaned and unlocked.  It may be reclaimed before we're able=20
> to
> lock it again, which means we'll read the old contents off the drive=20
> and
> lose any modifications that had been pending writeback.
>
> We don't actually need to clean the pages.  All of the other locking=20
> in
> place makes sure we don't start IO on the pages, so we can just leave
> them dirty for the duration of the write.
>
> Fixes: 73d59314e6ed (the original btrfs merge)
> CC: stable@vger.kernel.org # v4.4+
> Signed-off-by: Chris Mason <clm@fb.com>
> Reviewed-by: David Sterba <dsterba@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index d254cf94545f..15b925142793 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -531,6 +531,14 @@ int btrfs_dirty_pages(struct inode *inode, struct=20
> page **pages,
>
>  	end_of_last_block =3D start_pos + num_bytes - 1;
>
> +	/*
> +	 * The pages may have already been dirty, clear out old accounting=20
> so
> +	 * we can set things up properly
> +	 */
> +	clear_extent_bit(&BTRFS_I(inode)->io_tree, start_pos,=20
> end_of_last_block,
> +			 EXTENT_DIRTY | EXTENT_DELALLOC |
> +			 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 0, 0, cached);
> +
>  	if (!btrfs_is_free_space_inode(BTRFS_I(inode))) {
>  		if (start_pos >=3D isize &&
>  		    !(BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC)) {
> @@ -1500,18 +1508,27 @@ lock_and_cleanup_extent_if_need(struct=20
> btrfs_inode *inode, struct page **pages,
>  		}
>  		if (ordered)
>  			btrfs_put_ordered_extent(ordered);
> -		clear_extent_bit(&inode->io_tree, start_pos, last_pos,
> -				 EXTENT_DIRTY | EXTENT_DELALLOC |
> -				 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
> -				 0, 0, cached_state);
> +
>  		*lockstart =3D start_pos;
>  		*lockend =3D last_pos;
>  		ret =3D 1;
>  	}
>
> +	/*
> +	 * It's possible the pages are dirty right now, but we don't want
> +	 * to clean them yet because copy_from_user may catch a page fault
> +	 * and we might have to fall back to one page at a time.  If that
> +	 * happens, we'll unlock these pages and we'd have a window where
> +	 * reclaim could sneak in and drop the once-dirty page on the floor
> +	 * without writing it.
> +	 *
> +	 * We have the pages locked and the extent range locked, so there's
> +	 * no way someone can start IO on any dirty pages in this range.
> +	 *
> +	 * We'll call btrfs_dirty_pages() later on, and that will flip=20
> around
> +	 * delalloc bits and dirty the pages as required.
> +	 */
>  	for (i =3D 0; i < num_pages; i++) {
> -		if (clear_page_dirty_for_io(pages[i]))
> -			account_page_redirty(pages[i]);
>  		set_page_extent_mapped(pages[i]);
>  		WARN_ON(!PageLocked(pages[i]));
>  	}