All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Maxim Levitsky <mlevitsk@redhat.com>,
	qemu-devel@nongnu.org, Kevin Wolf <kwolf@redhat.com>,
	Peter Lieven <pl@kamp.de>, Paolo Bonzini <pbonzini@redhat.com>,
	Max Reitz <mreitz@redhat.com>,
	qemu-block@nongnu.org, linux-block@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Jens Axboe <axboe@kernel.dk>
Subject: Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
Date: Thu, 12 Nov 2020 14:08:38 -0800	[thread overview]
Message-ID: <20201112220838.GD9699@magnolia> (raw)
In-Reply-To: <20201112120056.GC27697@quack2.suse.cz>

On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote:
> On Thu 12-11-20 12:19:51, Jan Kara wrote:
> > [added some relevant people and lists to CC]
> > 
> > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > clone of "starship_production"
> > > 
> > > The git-publish destroyed the cover letter:
> > > 
> > > For the reference this is for bz #1872633
> > > 
> > > The issue is that current kernel code that implements 'fallocate'
> > > on kernel block devices roughly works like that:
> > > 
> > > 1. Flush the page cache on the range that is about to be discarded.
> > > 2. Issue the discard and wait for it to finish.
> > >    (as far as I can see the discard doesn't go through the
> > >    page cache).
> > > 
> > > 3. Check if the page cache is dirty for this range,
> > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > >    return -EBUSY.
> > > 
> > > This means that if qemu (or qemu-img) issues a write, and then
> > > discard to the area that shares a page, -EBUSY can be returned by
> > > the kernel.
> > 
> > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > kernel so we need to fix this.
> > 
> > > On the other hand, for example, the ext4 implementation of discard
> > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > 
> > Well, filesystem hole punching is somewhat different beast than block device
> > discard (at least implementation wise).
> > 
> > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > 
> > > It also can be racy unless special means are done to block IO from happening
> > > from qemu during this fsync.
> > > 
> > > This patch series contains two patches:
> > > 
> > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > technically enough to fail back to plain write in this case, but seems wrong.
> > > 
> > > And the second patch adds an optimization to qemu-img to avoid such a
> > > fragmented write/discard in the first place.
> > > 
> > > Both patches make the reproducer work for this particular bugzilla,
> > > but I don't think they are enough.
> > > 
> > > What do you think?
> > 
> > So if the EBUSY error happens because something happened to the page cache
> > outside of discarded range (like you describe above), that is a kernel bug
> > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > discarded range while discard was running and userspace app has to deal
> > with that depending on what it aims to do...
> 
> So I was looking what it would take to fix this inside the kernel. The
> problem is that invalidate_inode_pages2_range() is working on page
> granularity and it is non-trivial to extend it to work on byte granularity
> since we don't support something like "try to reclaim part of a page". But
> I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
> instead of truncate_inode_pages_range() again? I mean the EBUSY detection
> cannot be reliable anyway and userspace has no way of knowing whether a
> write happened before discard or after it so just discarding data is fine
> from this point of view. Darrick?

Hmmm, I think I overlooked the fact that we can do buffered writes into
a block device's pagecache without taking any of the usual locks that
have to be held for filesystem files.  This is essentially a race
between a not-page-aligned fallocate and a buffered write to a different
sector that is mapped by a page that caches part of the fallocate range.

So yes, Jan is right that we need to use truncate_bdev_range instead of
invalidate_inode_pages2_range because the former will zero the sub-page
ranges on either end of the fallocate request instead of returning
-EBUSY because someone dirtied a part of a page that wasn't involved in
the fallocate operation.

I /probably/ just copy-pasta'd that invalidation call from directio
without thinking hard enough about it, sorry about that. :(

--D

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Christoph Hellwig <hch@infradead.org>,
	qemu-block@nongnu.org, Peter Lieven <pl@kamp.de>,
	qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>,
	linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Maxim Levitsky <mlevitsk@redhat.com>
Subject: Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
Date: Thu, 12 Nov 2020 14:08:38 -0800	[thread overview]
Message-ID: <20201112220838.GD9699@magnolia> (raw)
In-Reply-To: <20201112120056.GC27697@quack2.suse.cz>

On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote:
> On Thu 12-11-20 12:19:51, Jan Kara wrote:
> > [added some relevant people and lists to CC]
> > 
> > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > clone of "starship_production"
> > > 
> > > The git-publish destroyed the cover letter:
> > > 
> > > For the reference this is for bz #1872633
> > > 
> > > The issue is that current kernel code that implements 'fallocate'
> > > on kernel block devices roughly works like that:
> > > 
> > > 1. Flush the page cache on the range that is about to be discarded.
> > > 2. Issue the discard and wait for it to finish.
> > >    (as far as I can see the discard doesn't go through the
> > >    page cache).
> > > 
> > > 3. Check if the page cache is dirty for this range,
> > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > >    return -EBUSY.
> > > 
> > > This means that if qemu (or qemu-img) issues a write, and then
> > > discard to the area that shares a page, -EBUSY can be returned by
> > > the kernel.
> > 
> > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > kernel so we need to fix this.
> > 
> > > On the other hand, for example, the ext4 implementation of discard
> > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > 
> > Well, filesystem hole punching is somewhat different beast than block device
> > discard (at least implementation wise).
> > 
> > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > 
> > > It also can be racy unless special means are done to block IO from happening
> > > from qemu during this fsync.
> > > 
> > > This patch series contains two patches:
> > > 
> > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > technically enough to fail back to plain write in this case, but seems wrong.
> > > 
> > > And the second patch adds an optimization to qemu-img to avoid such a
> > > fragmented write/discard in the first place.
> > > 
> > > Both patches make the reproducer work for this particular bugzilla,
> > > but I don't think they are enough.
> > > 
> > > What do you think?
> > 
> > So if the EBUSY error happens because something happened to the page cache
> > outside of discarded range (like you describe above), that is a kernel bug
> > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > discarded range while discard was running and userspace app has to deal
> > with that depending on what it aims to do...
> 
> So I was looking what it would take to fix this inside the kernel. The
> problem is that invalidate_inode_pages2_range() is working on page
> granularity and it is non-trivial to extend it to work on byte granularity
> since we don't support something like "try to reclaim part of a page". But
> I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
> instead of truncate_inode_pages_range() again? I mean the EBUSY detection
> cannot be reliable anyway and userspace has no way of knowing whether a
> write happened before discard or after it so just discarding data is fine
> from this point of view. Darrick?

Hmmm, I think I overlooked the fact that we can do buffered writes into
a block device's pagecache without taking any of the usual locks that
have to be held for filesystem files.  This is essentially a race
between a not-page-aligned fallocate and a buffered write to a different
sector that is mapped by a page that caches part of the fallocate range.

So yes, Jan is right that we need to use truncate_bdev_range instead of
invalidate_inode_pages2_range because the former will zero the sub-page
ranges on either end of the fallocate request instead of returning
-EBUSY because someone dirtied a part of a page that wasn't involved in
the fallocate operation.

I /probably/ just copy-pasta'd that invalidation call from directio
without thinking hard enough about it, sorry about that. :(

--D

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


  reply	other threads:[~2020-11-12 22:09 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-11 15:39 [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky
2020-11-16 14:48   ` Kevin Wolf
2021-01-07 12:44     ` Maxim Levitsky
2020-11-11 15:39 ` [PATCH 2/2] qemu-img: align next status sector on destination alignment Maxim Levitsky
2020-11-12 12:40   ` Peter Lieven
2020-11-12 13:45     ` Eric Blake
2020-11-12 15:04       ` Maxim Levitsky
2021-03-18  9:57         ` Maxim Levitsky
2020-11-11 15:44 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
2020-11-12 11:19   ` Jan Kara
2020-11-12 11:19     ` Jan Kara
2020-11-12 12:00     ` Jan Kara
2020-11-12 12:00       ` Jan Kara
2020-11-12 22:08       ` Darrick J. Wong [this message]
2020-11-12 22:08         ` Darrick J. Wong
2020-12-07 17:23         ` Maxim Levitsky
2020-12-07 17:23           ` Maxim Levitsky
2021-01-07 12:40           ` [PATCH] block: fallocate: avoid false positive on collision detection Maxim Levitsky
2021-01-07 12:42             ` Maxim Levitsky
2021-01-07 15:37             ` Jan Kara
2020-11-12 15:38     ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
2020-11-12 15:38       ` Maxim Levitsky
2020-11-13 10:07       ` Jan Kara
2020-11-13 10:07         ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201112220838.GD9699@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=axboe@kernel.dk \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=kwolf@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=mlevitsk@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=pl@kamp.de \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.