* [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration @ 2018-04-19 7:52 Stefan Hajnoczi 2018-04-19 7:52 ` [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux Stefan Hajnoczi ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Stefan Hajnoczi @ 2018-04-19 7:52 UTC (permalink / raw) To: qemu-devel Cc: Max Reitz, Kevin Wolf, Sergio Lopez, qemu-block, Dr. David Alan Gilbert, Stefan Hajnoczi file-posix.c only supports shared storage live migration with -drive cache.direct=off due to cache consistency issues. There are two main shared storage configurations: files on NFS and host block devices on SAN LUNs. The problem is that QEMU starts on the destination host before the source host has written everything out to the disk. The page cache on the destination host may contain stale data read when QEMU opened the image file (before migration handover). Using O_DIRECT avoids this problem but prevents users from taking advantage of the host page cache. Although cache=none is the recommended setting for virtualization use cases, there are scenarios where cache=writeback makes sense. If the guest has much less RAM than the host or many guests share the same backing file, then the host page cache can significantly improve disk I/O performance. This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c on Linux so that shared storage live migration works. I have sent it as an RFC because cache consistency is not binary, there are corner cases which I've described in the actual patch, and this may require more discussion. Regarding NFS, QEMU relies on O_DIRECT rather than the close-to-open consistency model (see nfs(5)), which is the basic guarantee provided by NFS. After this patch cache consistency is no longer provided by O_DIRECT. This patch series relies on fdatasync(2) (source) + posix_fadvise(POSIX_FADV_DONTNEED) (destination) instead. I believe it is safe for both NFS and SAN LUNs. Maybe we should use fsync(2) instead of fdatasync(2) so that NFS has up-to-date inode metadata? Stefan Hajnoczi (2): block/file-posix: implement bdrv_co_invalidate_cache() on Linux block/file-posix: verify page cache is not used block/file-posix.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) -- 2.14.3 ^ permalink raw reply [flat|nested] 20+ messages in thread
* [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux 2018-04-19 7:52 [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration Stefan Hajnoczi @ 2018-04-19 7:52 ` Stefan Hajnoczi 2018-04-19 8:13 ` Fam Zheng 2018-04-19 9:18 ` Dr. David Alan Gilbert 2018-04-19 7:52 ` [Qemu-devel] [RFC 2/2] block/file-posix: verify page cache is not used Stefan Hajnoczi 2018-04-19 16:09 ` [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration Eric Blake 2 siblings, 2 replies; 20+ messages in thread From: Stefan Hajnoczi @ 2018-04-19 7:52 UTC (permalink / raw) To: qemu-devel Cc: Max Reitz, Kevin Wolf, Sergio Lopez, qemu-block, Dr. David Alan Gilbert, Stefan Hajnoczi On Linux posix_fadvise(POSIX_FADV_DONTNEED) invalidates pages*. Use this to drop page cache on the destination host during shared storage migration. This way the destination host will read the latest copy of the data and will not use stale data from the page cache. The flow is as follows: 1. Source host writes out all dirty pages and inactivates drives. 2. QEMU_VM_EOF is sent on migration stream. 3. Destination host invalidates caches before accessing drives. This patch enables live migration even with -drive cache.direct=off. * Terms and conditions may apply, please see patch for details. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> --- block/file-posix.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/block/file-posix.c b/block/file-posix.c index 3794c0007a..df4f52919f 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2236,6 +2236,42 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, return ret | BDRV_BLOCK_OFFSET_VALID; } +static void coroutine_fn raw_co_invalidate_cache(BlockDriverState *bs, + Error **errp) +{ + BDRVRawState *s = bs->opaque; + int ret; + + ret = fd_open(bs); + if (ret < 0) { + error_setg_errno(errp, -ret, "The file descriptor is not open"); + return; + } + + if (s->open_flags & O_DIRECT) { + return; /* No host kernel page cache */ + } + +#if defined(__linux__) + /* This sets the scene for the next syscall... */ + ret = bdrv_co_flush(bs); + if (ret < 0) { + error_setg_errno(errp, -ret, "flush failed"); + return; + } + + /* Linux does not invalidate pages that are dirty, locked, or mmapped by a + * process. These limitations are okay because we just fsynced the file, + * we don't use mmap, and the file should not be in use by other processes. + */ + ret = posix_fadvise(s->fd, 0, 0, POSIX_FADV_DONTNEED); + if (ret != 0) { /* the return value is a positive errno */ + error_setg_errno(errp, ret, "fadvise failed"); + return; + } +#endif /* __linux__ */ +} + static coroutine_fn BlockAIOCB *raw_aio_pdiscard(BlockDriverState *bs, int64_t offset, int bytes, BlockCompletionFunc *cb, void *opaque) @@ -2328,6 +2364,7 @@ BlockDriver bdrv_file = { .bdrv_co_create_opts = raw_co_create_opts, .bdrv_has_zero_init = bdrv_has_zero_init_1, .bdrv_co_block_status = raw_co_block_status, + .bdrv_co_invalidate_cache = raw_co_invalidate_cache, .bdrv_co_pwrite_zeroes = raw_co_pwrite_zeroes, .bdrv_co_preadv = raw_co_preadv, @@ -2805,6 +2842,7 @@ static BlockDriver bdrv_host_device = { .bdrv_reopen_abort = raw_reopen_abort, .bdrv_co_create_opts = hdev_co_create_opts, .create_opts = &raw_create_opts, + .bdrv_co_invalidate_cache = raw_co_invalidate_cache, .bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes, .bdrv_co_preadv = raw_co_preadv, @@ -2927,6 +2965,7 @@ static BlockDriver bdrv_host_cdrom = { .bdrv_reopen_abort = raw_reopen_abort, .bdrv_co_create_opts = hdev_co_create_opts, .create_opts = &raw_create_opts, + .bdrv_co_invalidate_cache = raw_co_invalidate_cache, .bdrv_co_preadv = raw_co_preadv, -- 2.14.3 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux 2018-04-19 7:52 ` [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux Stefan Hajnoczi @ 2018-04-19 8:13 ` Fam Zheng 2018-04-20 3:15 ` Stefan Hajnoczi 2018-04-19 9:18 ` Dr. David Alan Gilbert 1 sibling, 1 reply; 20+ messages in thread From: Fam Zheng @ 2018-04-19 8:13 UTC (permalink / raw) To: Stefan Hajnoczi Cc: qemu-devel, Kevin Wolf, Sergio Lopez, qemu-block, Dr. David Alan Gilbert, Max Reitz On Thu, 04/19 15:52, Stefan Hajnoczi wrote: > On Linux posix_fadvise(POSIX_FADV_DONTNEED) invalidates pages*. Use > this to drop page cache on the destination host during shared storage > migration. This way the destination host will read the latest copy of > the data and will not use stale data from the page cache. > > The flow is as follows: > > 1. Source host writes out all dirty pages and inactivates drives. > 2. QEMU_VM_EOF is sent on migration stream. > 3. Destination host invalidates caches before accessing drives. > > This patch enables live migration even with -drive cache.direct=off. > > * Terms and conditions may apply, please see patch for details. > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > block/file-posix.c | 39 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 39 insertions(+) > > diff --git a/block/file-posix.c b/block/file-posix.c > index 3794c0007a..df4f52919f 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -2236,6 +2236,42 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > return ret | BDRV_BLOCK_OFFSET_VALID; > } > > +static void coroutine_fn raw_co_invalidate_cache(BlockDriverState *bs, > + Error **errp) > +{ > + BDRVRawState *s = bs->opaque; > + int ret; > + > + ret = fd_open(bs); > + if (ret < 0) { > + error_setg_errno(errp, -ret, "The file descriptor is not open"); > + return; > + } > + > + if (s->open_flags & O_DIRECT) { > + return; /* No host kernel page cache */ > + } > + > +#if defined(__linux__) > + /* This sets the scene for the next syscall... */ > + ret = bdrv_co_flush(bs); > + if (ret < 0) { > + error_setg_errno(errp, -ret, "flush failed"); > + return; > + } > + > + /* Linux does not invalidate pages that are dirty, locked, or mmapped by a > + * process. These limitations are okay because we just fsynced the file, > + * we don't use mmap, and the file should not be in use by other processes. > + */ > + ret = posix_fadvise(s->fd, 0, 0, POSIX_FADV_DONTNEED); > + if (ret != 0) { /* the return value is a positive errno */ > + error_setg_errno(errp, ret, "fadvise failed"); > + return; > + } > +#endif /* __linux__ */ What about the #else branch? It doesn't automatically work, I guess? Fam > +} > + > static coroutine_fn BlockAIOCB *raw_aio_pdiscard(BlockDriverState *bs, > int64_t offset, int bytes, > BlockCompletionFunc *cb, void *opaque) > @@ -2328,6 +2364,7 @@ BlockDriver bdrv_file = { > .bdrv_co_create_opts = raw_co_create_opts, > .bdrv_has_zero_init = bdrv_has_zero_init_1, > .bdrv_co_block_status = raw_co_block_status, > + .bdrv_co_invalidate_cache = raw_co_invalidate_cache, > .bdrv_co_pwrite_zeroes = raw_co_pwrite_zeroes, > > .bdrv_co_preadv = raw_co_preadv, > @@ -2805,6 +2842,7 @@ static BlockDriver bdrv_host_device = { > .bdrv_reopen_abort = raw_reopen_abort, > .bdrv_co_create_opts = hdev_co_create_opts, > .create_opts = &raw_create_opts, > + .bdrv_co_invalidate_cache = raw_co_invalidate_cache, > .bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes, > > .bdrv_co_preadv = raw_co_preadv, > @@ -2927,6 +2965,7 @@ static BlockDriver bdrv_host_cdrom = { > .bdrv_reopen_abort = raw_reopen_abort, > .bdrv_co_create_opts = hdev_co_create_opts, > .create_opts = &raw_create_opts, > + .bdrv_co_invalidate_cache = raw_co_invalidate_cache, > > > .bdrv_co_preadv = raw_co_preadv, > -- > 2.14.3 > > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux 2018-04-19 8:13 ` Fam Zheng @ 2018-04-20 3:15 ` Stefan Hajnoczi 2018-04-20 3:36 ` Fam Zheng 2018-04-20 6:13 ` Kevin Wolf 0 siblings, 2 replies; 20+ messages in thread From: Stefan Hajnoczi @ 2018-04-20 3:15 UTC (permalink / raw) To: Fam Zheng Cc: qemu-devel, Kevin Wolf, Sergio Lopez, qemu-block, Dr. David Alan Gilbert, Max Reitz [-- Attachment #1: Type: text/plain, Size: 3065 bytes --] On Thu, Apr 19, 2018 at 04:13:44PM +0800, Fam Zheng wrote: > On Thu, 04/19 15:52, Stefan Hajnoczi wrote: > > On Linux posix_fadvise(POSIX_FADV_DONTNEED) invalidates pages*. Use > > this to drop page cache on the destination host during shared storage > > migration. This way the destination host will read the latest copy of > > the data and will not use stale data from the page cache. > > > > The flow is as follows: > > > > 1. Source host writes out all dirty pages and inactivates drives. > > 2. QEMU_VM_EOF is sent on migration stream. > > 3. Destination host invalidates caches before accessing drives. > > > > This patch enables live migration even with -drive cache.direct=off. > > > > * Terms and conditions may apply, please see patch for details. > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > --- > > block/file-posix.c | 39 +++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 39 insertions(+) > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > index 3794c0007a..df4f52919f 100644 > > --- a/block/file-posix.c > > +++ b/block/file-posix.c > > @@ -2236,6 +2236,42 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > > return ret | BDRV_BLOCK_OFFSET_VALID; > > } > > > > +static void coroutine_fn raw_co_invalidate_cache(BlockDriverState *bs, > > + Error **errp) > > +{ > > + BDRVRawState *s = bs->opaque; > > + int ret; > > + > > + ret = fd_open(bs); > > + if (ret < 0) { > > + error_setg_errno(errp, -ret, "The file descriptor is not open"); > > + return; > > + } > > + > > + if (s->open_flags & O_DIRECT) { > > + return; /* No host kernel page cache */ > > + } > > + > > +#if defined(__linux__) > > + /* This sets the scene for the next syscall... */ > > + ret = bdrv_co_flush(bs); > > + if (ret < 0) { > > + error_setg_errno(errp, -ret, "flush failed"); > > + return; > > + } > > + > > + /* Linux does not invalidate pages that are dirty, locked, or mmapped by a > > + * process. These limitations are okay because we just fsynced the file, > > + * we don't use mmap, and the file should not be in use by other processes. > > + */ > > + ret = posix_fadvise(s->fd, 0, 0, POSIX_FADV_DONTNEED); > > + if (ret != 0) { /* the return value is a positive errno */ > > + error_setg_errno(errp, ret, "fadvise failed"); > > + return; > > + } > > +#endif /* __linux__ */ > > What about the #else branch? It doesn't automatically work, I guess? Right, no error is reported. This is existing QEMU behavior. If we want to change behavior then it must be done consistently (i.e. by auditing the other block drivers) and we need to be prepared for bug reports (just like file locking, it may expose interesting use cases that we cannot easily dismiss as wrong). I didn't want to go there. If there is consensus then I will change the behavior. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux 2018-04-20 3:15 ` Stefan Hajnoczi @ 2018-04-20 3:36 ` Fam Zheng 2018-04-20 6:13 ` Kevin Wolf 1 sibling, 0 replies; 20+ messages in thread From: Fam Zheng @ 2018-04-20 3:36 UTC (permalink / raw) To: Stefan Hajnoczi Cc: qemu-devel, Kevin Wolf, Sergio Lopez, qemu-block, Dr. David Alan Gilbert, Max Reitz On Fri, 04/20 11:15, Stefan Hajnoczi wrote: > On Thu, Apr 19, 2018 at 04:13:44PM +0800, Fam Zheng wrote: > > On Thu, 04/19 15:52, Stefan Hajnoczi wrote: > > > On Linux posix_fadvise(POSIX_FADV_DONTNEED) invalidates pages*. Use > > > this to drop page cache on the destination host during shared storage > > > migration. This way the destination host will read the latest copy of > > > the data and will not use stale data from the page cache. > > > > > > The flow is as follows: > > > > > > 1. Source host writes out all dirty pages and inactivates drives. > > > 2. QEMU_VM_EOF is sent on migration stream. > > > 3. Destination host invalidates caches before accessing drives. > > > > > > This patch enables live migration even with -drive cache.direct=off. > > > > > > * Terms and conditions may apply, please see patch for details. > > > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > > --- > > > block/file-posix.c | 39 +++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 39 insertions(+) > > > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > > index 3794c0007a..df4f52919f 100644 > > > --- a/block/file-posix.c > > > +++ b/block/file-posix.c > > > @@ -2236,6 +2236,42 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > > > return ret | BDRV_BLOCK_OFFSET_VALID; > > > } > > > > > > +static void coroutine_fn raw_co_invalidate_cache(BlockDriverState *bs, > > > + Error **errp) > > > +{ > > > + BDRVRawState *s = bs->opaque; > > > + int ret; > > > + > > > + ret = fd_open(bs); > > > + if (ret < 0) { > > > + error_setg_errno(errp, -ret, "The file descriptor is not open"); > > > + return; > > > + } > > > + > > > + if (s->open_flags & O_DIRECT) { > > > + return; /* No host kernel page cache */ > > > + } > > > + > > > +#if defined(__linux__) > > > + /* This sets the scene for the next syscall... */ > > > + ret = bdrv_co_flush(bs); > > > + if (ret < 0) { > > > + error_setg_errno(errp, -ret, "flush failed"); > > > + return; > > > + } > > > + > > > + /* Linux does not invalidate pages that are dirty, locked, or mmapped by a > > > + * process. These limitations are okay because we just fsynced the file, > > > + * we don't use mmap, and the file should not be in use by other processes. > > > + */ > > > + ret = posix_fadvise(s->fd, 0, 0, POSIX_FADV_DONTNEED); > > > + if (ret != 0) { /* the return value is a positive errno */ > > > + error_setg_errno(errp, ret, "fadvise failed"); > > > + return; > > > + } > > > +#endif /* __linux__ */ > > > > What about the #else branch? It doesn't automatically work, I guess? > > Right, no error is reported. This is existing QEMU behavior. > > If we want to change behavior then it must be done consistently (i.e. by > auditing the other block drivers) and we need to be prepared for bug > reports (just like file locking, it may expose interesting use cases > that we cannot easily dismiss as wrong). I didn't want to go there. > > If there is consensus then I will change the behavior. No need to change behavior (reporting error), at least not in this patch. But a #else /* TODO: ... */ #endif to remember adding similar code to invalidate system cache on other *nix systems cannot hurt. E.g BSDes have posix_fadvise() too, though I have no idea if POSIX_FADV_DONTNEED works the same. (I'm sure there are some tricks on Windows to but do we care? :) Fam ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux 2018-04-20 3:15 ` Stefan Hajnoczi 2018-04-20 3:36 ` Fam Zheng @ 2018-04-20 6:13 ` Kevin Wolf 1 sibling, 0 replies; 20+ messages in thread From: Kevin Wolf @ 2018-04-20 6:13 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Fam Zheng, qemu-devel, Sergio Lopez, qemu-block, Dr. David Alan Gilbert, Max Reitz [-- Attachment #1: Type: text/plain, Size: 3894 bytes --] Am 20.04.2018 um 05:15 hat Stefan Hajnoczi geschrieben: > On Thu, Apr 19, 2018 at 04:13:44PM +0800, Fam Zheng wrote: > > On Thu, 04/19 15:52, Stefan Hajnoczi wrote: > > > On Linux posix_fadvise(POSIX_FADV_DONTNEED) invalidates pages*. Use > > > this to drop page cache on the destination host during shared storage > > > migration. This way the destination host will read the latest copy of > > > the data and will not use stale data from the page cache. > > > > > > The flow is as follows: > > > > > > 1. Source host writes out all dirty pages and inactivates drives. > > > 2. QEMU_VM_EOF is sent on migration stream. > > > 3. Destination host invalidates caches before accessing drives. > > > > > > This patch enables live migration even with -drive cache.direct=off. > > > > > > * Terms and conditions may apply, please see patch for details. > > > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > > --- > > > block/file-posix.c | 39 +++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 39 insertions(+) > > > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > > index 3794c0007a..df4f52919f 100644 > > > --- a/block/file-posix.c > > > +++ b/block/file-posix.c > > > @@ -2236,6 +2236,42 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > > > return ret | BDRV_BLOCK_OFFSET_VALID; > > > } > > > > > > +static void coroutine_fn raw_co_invalidate_cache(BlockDriverState *bs, > > > + Error **errp) > > > +{ > > > + BDRVRawState *s = bs->opaque; > > > + int ret; > > > + > > > + ret = fd_open(bs); > > > + if (ret < 0) { > > > + error_setg_errno(errp, -ret, "The file descriptor is not open"); > > > + return; > > > + } > > > + > > > + if (s->open_flags & O_DIRECT) { > > > + return; /* No host kernel page cache */ > > > + } > > > + > > > +#if defined(__linux__) > > > + /* This sets the scene for the next syscall... */ > > > + ret = bdrv_co_flush(bs); > > > + if (ret < 0) { > > > + error_setg_errno(errp, -ret, "flush failed"); > > > + return; > > > + } > > > + > > > + /* Linux does not invalidate pages that are dirty, locked, or mmapped by a > > > + * process. These limitations are okay because we just fsynced the file, > > > + * we don't use mmap, and the file should not be in use by other processes. > > > + */ > > > + ret = posix_fadvise(s->fd, 0, 0, POSIX_FADV_DONTNEED); > > > + if (ret != 0) { /* the return value is a positive errno */ > > > + error_setg_errno(errp, ret, "fadvise failed"); > > > + return; > > > + } > > > +#endif /* __linux__ */ > > > > What about the #else branch? It doesn't automatically work, I guess? > > Right, no error is reported. This is existing QEMU behavior. > > If we want to change behavior then it must be done consistently (i.e. by > auditing the other block drivers) and we need to be prepared for bug > reports (just like file locking, it may expose interesting use cases > that we cannot easily dismiss as wrong). I didn't want to go there. > > If there is consensus then I will change the behavior. I think either way that would be for a separate patch. I'm also not sure how useful that change would actually be because it might give you a false sense of safety: Even with this patch, you still need to be exactly aware of the conditions that make live migration with shared storage work correctly. If we error out on some unsafe cases, but not on others, this might be confusing. On the other hand, the problematic image format drivers have been setting migration blockers for a long time, so you could also argue that file-posix is inconsistent with them because it completely ignores unsafe scenarios. Kevin [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 801 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux 2018-04-19 7:52 ` [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux Stefan Hajnoczi 2018-04-19 8:13 ` Fam Zheng @ 2018-04-19 9:18 ` Dr. David Alan Gilbert 2018-04-20 3:21 ` Stefan Hajnoczi 1 sibling, 1 reply; 20+ messages in thread From: Dr. David Alan Gilbert @ 2018-04-19 9:18 UTC (permalink / raw) To: Stefan Hajnoczi Cc: qemu-devel, Max Reitz, Kevin Wolf, Sergio Lopez, qemu-block * Stefan Hajnoczi (stefanha@redhat.com) wrote: > On Linux posix_fadvise(POSIX_FADV_DONTNEED) invalidates pages*. Use > this to drop page cache on the destination host during shared storage > migration. This way the destination host will read the latest copy of > the data and will not use stale data from the page cache. > > The flow is as follows: > > 1. Source host writes out all dirty pages and inactivates drives. > 2. QEMU_VM_EOF is sent on migration stream. > 3. Destination host invalidates caches before accessing drives. > > This patch enables live migration even with -drive cache.direct=off. > > * Terms and conditions may apply, please see patch for details. > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > block/file-posix.c | 39 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 39 insertions(+) > > diff --git a/block/file-posix.c b/block/file-posix.c > index 3794c0007a..df4f52919f 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -2236,6 +2236,42 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > return ret | BDRV_BLOCK_OFFSET_VALID; > } > > +static void coroutine_fn raw_co_invalidate_cache(BlockDriverState *bs, > + Error **errp) > +{ > + BDRVRawState *s = bs->opaque; > + int ret; > + > + ret = fd_open(bs); > + if (ret < 0) { > + error_setg_errno(errp, -ret, "The file descriptor is not open"); > + return; > + } > + > + if (s->open_flags & O_DIRECT) { > + return; /* No host kernel page cache */ > + } > + > +#if defined(__linux__) > + /* This sets the scene for the next syscall... */ > + ret = bdrv_co_flush(bs); > + if (ret < 0) { > + error_setg_errno(errp, -ret, "flush failed"); > + return; > + } > + > + /* Linux does not invalidate pages that are dirty, locked, or mmapped by a > + * process. These limitations are okay because we just fsynced the file, > + * we don't use mmap, and the file should not be in use by other processes. > + */ > + ret = posix_fadvise(s->fd, 0, 0, POSIX_FADV_DONTNEED); What happens if I try a migrate between two qemu's on the same host? (Which I, and avocado, both use for testing; I think think users occasionally do for QEMU updates). Dave > + if (ret != 0) { /* the return value is a positive errno */ > + error_setg_errno(errp, ret, "fadvise failed"); > + return; > + } > +#endif /* __linux__ */ > +} > + > static coroutine_fn BlockAIOCB *raw_aio_pdiscard(BlockDriverState *bs, > int64_t offset, int bytes, > BlockCompletionFunc *cb, void *opaque) > @@ -2328,6 +2364,7 @@ BlockDriver bdrv_file = { > .bdrv_co_create_opts = raw_co_create_opts, > .bdrv_has_zero_init = bdrv_has_zero_init_1, > .bdrv_co_block_status = raw_co_block_status, > + .bdrv_co_invalidate_cache = raw_co_invalidate_cache, > .bdrv_co_pwrite_zeroes = raw_co_pwrite_zeroes, > > .bdrv_co_preadv = raw_co_preadv, > @@ -2805,6 +2842,7 @@ static BlockDriver bdrv_host_device = { > .bdrv_reopen_abort = raw_reopen_abort, > .bdrv_co_create_opts = hdev_co_create_opts, > .create_opts = &raw_create_opts, > + .bdrv_co_invalidate_cache = raw_co_invalidate_cache, > .bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes, > > .bdrv_co_preadv = raw_co_preadv, > @@ -2927,6 +2965,7 @@ static BlockDriver bdrv_host_cdrom = { > .bdrv_reopen_abort = raw_reopen_abort, > .bdrv_co_create_opts = hdev_co_create_opts, > .create_opts = &raw_create_opts, > + .bdrv_co_invalidate_cache = raw_co_invalidate_cache, > > > .bdrv_co_preadv = raw_co_preadv, > -- > 2.14.3 > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux 2018-04-19 9:18 ` Dr. David Alan Gilbert @ 2018-04-20 3:21 ` Stefan Hajnoczi 2018-04-20 6:27 ` Kevin Wolf 0 siblings, 1 reply; 20+ messages in thread From: Stefan Hajnoczi @ 2018-04-20 3:21 UTC (permalink / raw) To: Dr. David Alan Gilbert Cc: qemu-devel, Max Reitz, Kevin Wolf, Sergio Lopez, qemu-block [-- Attachment #1: Type: text/plain, Size: 3231 bytes --] On Thu, Apr 19, 2018 at 10:18:33AM +0100, Dr. David Alan Gilbert wrote: > * Stefan Hajnoczi (stefanha@redhat.com) wrote: > > On Linux posix_fadvise(POSIX_FADV_DONTNEED) invalidates pages*. Use > > this to drop page cache on the destination host during shared storage > > migration. This way the destination host will read the latest copy of > > the data and will not use stale data from the page cache. > > > > The flow is as follows: > > > > 1. Source host writes out all dirty pages and inactivates drives. > > 2. QEMU_VM_EOF is sent on migration stream. > > 3. Destination host invalidates caches before accessing drives. > > > > This patch enables live migration even with -drive cache.direct=off. > > > > * Terms and conditions may apply, please see patch for details. > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > --- > > block/file-posix.c | 39 +++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 39 insertions(+) > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > index 3794c0007a..df4f52919f 100644 > > --- a/block/file-posix.c > > +++ b/block/file-posix.c > > @@ -2236,6 +2236,42 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > > return ret | BDRV_BLOCK_OFFSET_VALID; > > } > > > > +static void coroutine_fn raw_co_invalidate_cache(BlockDriverState *bs, > > + Error **errp) > > +{ > > + BDRVRawState *s = bs->opaque; > > + int ret; > > + > > + ret = fd_open(bs); > > + if (ret < 0) { > > + error_setg_errno(errp, -ret, "The file descriptor is not open"); > > + return; > > + } > > + > > + if (s->open_flags & O_DIRECT) { > > + return; /* No host kernel page cache */ > > + } > > + > > +#if defined(__linux__) > > + /* This sets the scene for the next syscall... */ > > + ret = bdrv_co_flush(bs); > > + if (ret < 0) { > > + error_setg_errno(errp, -ret, "flush failed"); > > + return; > > + } > > + > > + /* Linux does not invalidate pages that are dirty, locked, or mmapped by a > > + * process. These limitations are okay because we just fsynced the file, > > + * we don't use mmap, and the file should not be in use by other processes. > > + */ > > + ret = posix_fadvise(s->fd, 0, 0, POSIX_FADV_DONTNEED); > > What happens if I try a migrate between two qemu's on the same host? > (Which I, and avocado, both use for testing; I think think users > occasionally do for QEMU updates). The steps quoted from the commit description: 1. Source host writes out all dirty pages and inactivates drives. 2. QEMU_VM_EOF is sent on migration stream. 3. Destination host invalidates caches before accessing drives. When we reach Step 3 the source QEMU is not doing I/O (no pages are locked). The destination QEMU does bdrv_co_flush() so even if pages are still dirty (that shouldn't happen since the source already drained and flushed) they will be written out and pages will be clean. Therefore fadvise really invalidates all resident pages. FWIW when writing this patch I tested with both QEMUs on the same host. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux 2018-04-20 3:21 ` Stefan Hajnoczi @ 2018-04-20 6:27 ` Kevin Wolf 0 siblings, 0 replies; 20+ messages in thread From: Kevin Wolf @ 2018-04-20 6:27 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Dr. David Alan Gilbert, qemu-devel, Max Reitz, Sergio Lopez, qemu-block [-- Attachment #1: Type: text/plain, Size: 3613 bytes --] Am 20.04.2018 um 05:21 hat Stefan Hajnoczi geschrieben: > On Thu, Apr 19, 2018 at 10:18:33AM +0100, Dr. David Alan Gilbert wrote: > > * Stefan Hajnoczi (stefanha@redhat.com) wrote: > > > On Linux posix_fadvise(POSIX_FADV_DONTNEED) invalidates pages*. Use > > > this to drop page cache on the destination host during shared storage > > > migration. This way the destination host will read the latest copy of > > > the data and will not use stale data from the page cache. > > > > > > The flow is as follows: > > > > > > 1. Source host writes out all dirty pages and inactivates drives. > > > 2. QEMU_VM_EOF is sent on migration stream. > > > 3. Destination host invalidates caches before accessing drives. > > > > > > This patch enables live migration even with -drive cache.direct=off. > > > > > > * Terms and conditions may apply, please see patch for details. > > > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > > --- > > > block/file-posix.c | 39 +++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 39 insertions(+) > > > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > > index 3794c0007a..df4f52919f 100644 > > > --- a/block/file-posix.c > > > +++ b/block/file-posix.c > > > @@ -2236,6 +2236,42 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > > > return ret | BDRV_BLOCK_OFFSET_VALID; > > > } > > > > > > +static void coroutine_fn raw_co_invalidate_cache(BlockDriverState *bs, > > > + Error **errp) > > > +{ > > > + BDRVRawState *s = bs->opaque; > > > + int ret; > > > + > > > + ret = fd_open(bs); > > > + if (ret < 0) { > > > + error_setg_errno(errp, -ret, "The file descriptor is not open"); > > > + return; > > > + } > > > + > > > + if (s->open_flags & O_DIRECT) { > > > + return; /* No host kernel page cache */ > > > + } > > > + > > > +#if defined(__linux__) > > > + /* This sets the scene for the next syscall... */ > > > + ret = bdrv_co_flush(bs); > > > + if (ret < 0) { > > > + error_setg_errno(errp, -ret, "flush failed"); > > > + return; > > > + } > > > + > > > + /* Linux does not invalidate pages that are dirty, locked, or mmapped by a > > > + * process. These limitations are okay because we just fsynced the file, > > > + * we don't use mmap, and the file should not be in use by other processes. > > > + */ > > > + ret = posix_fadvise(s->fd, 0, 0, POSIX_FADV_DONTNEED); > > > > What happens if I try a migrate between two qemu's on the same host? > > (Which I, and avocado, both use for testing; I think think users > > occasionally do for QEMU updates). > > The steps quoted from the commit description: > > 1. Source host writes out all dirty pages and inactivates drives. > 2. QEMU_VM_EOF is sent on migration stream. > 3. Destination host invalidates caches before accessing drives. > > When we reach Step 3 the source QEMU is not doing I/O (no pages are > locked). The destination QEMU does bdrv_co_flush() so even if pages are > still dirty (that shouldn't happen since the source already drained and > flushed) they will be written out and pages will be clean. Therefore > fadvise really invalidates all resident pages. > > FWIW when writing this patch I tested with both QEMUs on the same host. Which is actually unnecessary overhead on localhost because the local kernel page cache can't be incoherent with itself. But I don't think it's a real problem either. Kevin [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 801 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* [Qemu-devel] [RFC 2/2] block/file-posix: verify page cache is not used 2018-04-19 7:52 [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration Stefan Hajnoczi 2018-04-19 7:52 ` [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux Stefan Hajnoczi @ 2018-04-19 7:52 ` Stefan Hajnoczi 2018-04-19 9:05 ` Dr. David Alan Gilbert 2018-04-19 16:09 ` [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration Eric Blake 2 siblings, 1 reply; 20+ messages in thread From: Stefan Hajnoczi @ 2018-04-19 7:52 UTC (permalink / raw) To: qemu-devel Cc: Max Reitz, Kevin Wolf, Sergio Lopez, qemu-block, Dr. David Alan Gilbert, Stefan Hajnoczi This commit is for debugging only. Do not merge it. mincore(2) checks whether pages are resident. Use it to verify that page cache has been dropped. You can trigger a verification failure by mmapping the image file from another process and loading a byte from a page so that it becomes resident. bdrv_co_invalidate_cache() will fail while the process is alive. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> --- block/file-posix.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 71 insertions(+) diff --git a/block/file-posix.c b/block/file-posix.c index df4f52919f..d3105269c6 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2236,6 +2236,75 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, return ret | BDRV_BLOCK_OFFSET_VALID; } +static bool is_mincore(void *addr, size_t length) +{ + size_t vec_len = DIV_ROUND_UP(length, sysconf(_SC_PAGESIZE)); + unsigned char *vec; + size_t i; + int ret; + bool incore = false; + + vec = g_malloc(vec_len); + ret = mincore(addr, length, vec); + if (ret < 0) { + incore = true; + goto out; + } + + for (i = 0; i < vec_len; i++) { + if (vec[i] & 0x1) { + incore = true; + break; + } + } + +out: + g_free(vec); + return incore; +} + +static void check_not_in_page_cache(BlockDriverState *bs, Error **errp) +{ + const size_t WINDOW_SIZE = 128 * 1024 * 1024; + BDRVRawState *s = bs->opaque; + void *window = NULL; + size_t length = 0; + off_t end; + off_t offset; + + end = raw_getlength(bs); + + for (offset = 0; offset < end; offset += WINDOW_SIZE) { + void *new_window; + size_t new_length = MIN(end - offset, WINDOW_SIZE); + + if (new_length != length) { + munmap(window, length); + window = NULL; + length = 0; + } + + new_window = mmap(window, new_length, PROT_NONE, MAP_PRIVATE, + s->fd, offset); + if (new_window == MAP_FAILED) { + error_setg_errno(errp, errno, "mmap failed"); + break; + } + + window = new_window; + length = new_length; + + if (is_mincore(window, length)) { + error_setg(errp, "page cache still in use!"); + break; + } + } + + if (window) { + munmap(window, length); + } +} + static void coroutine_fn raw_co_invalidate_cache(BlockDriverState *bs, Error **errp) { @@ -2270,6 +2339,8 @@ static void coroutine_fn raw_co_invalidate_cache(BlockDriverState *bs, return; } #endif /* __linux__ */ + + check_not_in_page_cache(bs, errp); } static coroutine_fn BlockAIOCB *raw_aio_pdiscard(BlockDriverState *bs, -- 2.14.3 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 2/2] block/file-posix: verify page cache is not used 2018-04-19 7:52 ` [Qemu-devel] [RFC 2/2] block/file-posix: verify page cache is not used Stefan Hajnoczi @ 2018-04-19 9:05 ` Dr. David Alan Gilbert 2018-04-20 3:02 ` Stefan Hajnoczi 0 siblings, 1 reply; 20+ messages in thread From: Dr. David Alan Gilbert @ 2018-04-19 9:05 UTC (permalink / raw) To: Stefan Hajnoczi Cc: qemu-devel, Max Reitz, Kevin Wolf, Sergio Lopez, qemu-block * Stefan Hajnoczi (stefanha@redhat.com) wrote: > This commit is for debugging only. Do not merge it. > > mincore(2) checks whether pages are resident. Use it to verify that > page cache has been dropped. > > You can trigger a verification failure by mmapping the image file from > another process and loading a byte from a page so that it becomes > resident. bdrv_co_invalidate_cache() will fail while the process is > alive. It doesn't seem a bad diagnostic to keep in (with a switch to activate) for when we're faced with some weird corruption on some weird storage system. Dave > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > block/file-posix.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 71 insertions(+) > > diff --git a/block/file-posix.c b/block/file-posix.c > index df4f52919f..d3105269c6 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -2236,6 +2236,75 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > return ret | BDRV_BLOCK_OFFSET_VALID; > } > > +static bool is_mincore(void *addr, size_t length) > +{ > + size_t vec_len = DIV_ROUND_UP(length, sysconf(_SC_PAGESIZE)); > + unsigned char *vec; > + size_t i; > + int ret; > + bool incore = false; > + > + vec = g_malloc(vec_len); > + ret = mincore(addr, length, vec); > + if (ret < 0) { > + incore = true; > + goto out; > + } > + > + for (i = 0; i < vec_len; i++) { > + if (vec[i] & 0x1) { > + incore = true; > + break; > + } > + } > + > +out: > + g_free(vec); > + return incore; > +} > + > +static void check_not_in_page_cache(BlockDriverState *bs, Error **errp) > +{ > + const size_t WINDOW_SIZE = 128 * 1024 * 1024; > + BDRVRawState *s = bs->opaque; > + void *window = NULL; > + size_t length = 0; > + off_t end; > + off_t offset; > + > + end = raw_getlength(bs); > + > + for (offset = 0; offset < end; offset += WINDOW_SIZE) { > + void *new_window; > + size_t new_length = MIN(end - offset, WINDOW_SIZE); > + > + if (new_length != length) { > + munmap(window, length); > + window = NULL; > + length = 0; > + } > + > + new_window = mmap(window, new_length, PROT_NONE, MAP_PRIVATE, > + s->fd, offset); > + if (new_window == MAP_FAILED) { > + error_setg_errno(errp, errno, "mmap failed"); > + break; > + } > + > + window = new_window; > + length = new_length; > + > + if (is_mincore(window, length)) { > + error_setg(errp, "page cache still in use!"); > + break; > + } > + } > + > + if (window) { > + munmap(window, length); > + } > +} > + > static void coroutine_fn raw_co_invalidate_cache(BlockDriverState *bs, > Error **errp) > { > @@ -2270,6 +2339,8 @@ static void coroutine_fn raw_co_invalidate_cache(BlockDriverState *bs, > return; > } > #endif /* __linux__ */ > + > + check_not_in_page_cache(bs, errp); > } > > static coroutine_fn BlockAIOCB *raw_aio_pdiscard(BlockDriverState *bs, > -- > 2.14.3 > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 2/2] block/file-posix: verify page cache is not used 2018-04-19 9:05 ` Dr. David Alan Gilbert @ 2018-04-20 3:02 ` Stefan Hajnoczi 2018-04-20 6:25 ` Kevin Wolf 0 siblings, 1 reply; 20+ messages in thread From: Stefan Hajnoczi @ 2018-04-20 3:02 UTC (permalink / raw) To: Dr. David Alan Gilbert Cc: qemu-devel, Max Reitz, Kevin Wolf, Sergio Lopez, qemu-block [-- Attachment #1: Type: text/plain, Size: 817 bytes --] On Thu, Apr 19, 2018 at 10:05:47AM +0100, Dr. David Alan Gilbert wrote: > * Stefan Hajnoczi (stefanha@redhat.com) wrote: > > This commit is for debugging only. Do not merge it. > > > > mincore(2) checks whether pages are resident. Use it to verify that > > page cache has been dropped. > > > > You can trigger a verification failure by mmapping the image file from > > another process and loading a byte from a page so that it becomes > > resident. bdrv_co_invalidate_cache() will fail while the process is > > alive. > > It doesn't seem a bad diagnostic to keep in (with a switch to activate) > for when we're faced with some weird corruption on some weird storage > system. Okay. It's very slow to mmap an entire image file and query mincore(2) so it needs to be off by default. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 2/2] block/file-posix: verify page cache is not used 2018-04-20 3:02 ` Stefan Hajnoczi @ 2018-04-20 6:25 ` Kevin Wolf 2018-04-24 14:04 ` Stefan Hajnoczi 0 siblings, 1 reply; 20+ messages in thread From: Kevin Wolf @ 2018-04-20 6:25 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Dr. David Alan Gilbert, qemu-devel, Max Reitz, Sergio Lopez, qemu-block [-- Attachment #1: Type: text/plain, Size: 1474 bytes --] Am 20.04.2018 um 05:02 hat Stefan Hajnoczi geschrieben: > On Thu, Apr 19, 2018 at 10:05:47AM +0100, Dr. David Alan Gilbert wrote: > > * Stefan Hajnoczi (stefanha@redhat.com) wrote: > > > This commit is for debugging only. Do not merge it. > > > > > > mincore(2) checks whether pages are resident. Use it to verify that > > > page cache has been dropped. > > > > > > You can trigger a verification failure by mmapping the image file from > > > another process and loading a byte from a page so that it becomes > > > resident. bdrv_co_invalidate_cache() will fail while the process is > > > alive. > > > > It doesn't seem a bad diagnostic to keep in (with a switch to activate) > > for when we're faced with some weird corruption on some weird storage > > system. > > Okay. It's very slow to mmap an entire image file and query mincore(2) > so it needs to be off by default. Also, having it enabled breaks localhost migration at least on tmpfs (which was what I tried out first). I wonder if the kernel would add some way to query whether the "advice" was actually acted upon if we asked. Either with a new function that returns an error if not everything is dropped (basically .bdrv_invalidate_cache on the kernel level), or a function that just queries if any page is allocated (or maybe the address of the first allocated page in a given range) without having to use mincore() and iterating over all the pages in userspace. Kevin [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 801 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 2/2] block/file-posix: verify page cache is not used 2018-04-20 6:25 ` Kevin Wolf @ 2018-04-24 14:04 ` Stefan Hajnoczi 2018-04-24 14:29 ` Kevin Wolf 0 siblings, 1 reply; 20+ messages in thread From: Stefan Hajnoczi @ 2018-04-24 14:04 UTC (permalink / raw) To: Kevin Wolf Cc: Dr. David Alan Gilbert, qemu-devel, Max Reitz, Sergio Lopez, qemu-block [-- Attachment #1: Type: text/plain, Size: 2555 bytes --] On Fri, Apr 20, 2018 at 08:25:13AM +0200, Kevin Wolf wrote: > Am 20.04.2018 um 05:02 hat Stefan Hajnoczi geschrieben: > > On Thu, Apr 19, 2018 at 10:05:47AM +0100, Dr. David Alan Gilbert wrote: > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote: > > > > This commit is for debugging only. Do not merge it. > > > > > > > > mincore(2) checks whether pages are resident. Use it to verify that > > > > page cache has been dropped. > > > > > > > > You can trigger a verification failure by mmapping the image file from > > > > another process and loading a byte from a page so that it becomes > > > > resident. bdrv_co_invalidate_cache() will fail while the process is > > > > alive. > > > > > > It doesn't seem a bad diagnostic to keep in (with a switch to activate) > > > for when we're faced with some weird corruption on some weird storage > > > system. > > > > Okay. It's very slow to mmap an entire image file and query mincore(2) > > so it needs to be off by default. > > Also, having it enabled breaks localhost migration at least on tmpfs > (which was what I tried out first). > > I wonder if the kernel would add some way to query whether the "advice" > was actually acted upon if we asked. Either with a new function that > returns an error if not everything is dropped (basically > .bdrv_invalidate_cache on the kernel level), or a function that just > queries if any page is allocated (or maybe the address of the first > allocated page in a given range) without having to use mincore() and > iterating over all the pages in userspace. I'm trying to figure out how to expose the optional mincore check on the command-line/QMP: 1. Add a check_consistency bool argument to bdrv_invalidate_cache*(). Add command-line/QMP option to -incoming and migrate_incoming. This is messy and won't be easy to access for libvirt users. 2. Add a BlockdevOptionsFile *check-cache-consistency bool field. This is specified at .bdrv_open() time. It can be changed at runtime with .bdrv_reopen*(). 3. Add a 'blockdev-check-cache-consistency' QMP command that calls a new .bdrv_check_cache_consistency() callback that is implemented by file-posix.c. The problem is users might issue this command after I/O has resumed and pages have become resident again. It only makes sense if the guest is still paused. Probably a bad interface... Have I missed a good way to expose this optional check functionality? Which approach do you prefer? I'm leaning towards #2. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 2/2] block/file-posix: verify page cache is not used 2018-04-24 14:04 ` Stefan Hajnoczi @ 2018-04-24 14:29 ` Kevin Wolf 2018-04-27 10:06 ` Stefan Hajnoczi 0 siblings, 1 reply; 20+ messages in thread From: Kevin Wolf @ 2018-04-24 14:29 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Dr. David Alan Gilbert, qemu-devel, Max Reitz, Sergio Lopez, qemu-block [-- Attachment #1: Type: text/plain, Size: 2925 bytes --] Am 24.04.2018 um 16:04 hat Stefan Hajnoczi geschrieben: > On Fri, Apr 20, 2018 at 08:25:13AM +0200, Kevin Wolf wrote: > > Am 20.04.2018 um 05:02 hat Stefan Hajnoczi geschrieben: > > > On Thu, Apr 19, 2018 at 10:05:47AM +0100, Dr. David Alan Gilbert wrote: > > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote: > > > > > This commit is for debugging only. Do not merge it. > > > > > > > > > > mincore(2) checks whether pages are resident. Use it to verify that > > > > > page cache has been dropped. > > > > > > > > > > You can trigger a verification failure by mmapping the image file from > > > > > another process and loading a byte from a page so that it becomes > > > > > resident. bdrv_co_invalidate_cache() will fail while the process is > > > > > alive. > > > > > > > > It doesn't seem a bad diagnostic to keep in (with a switch to activate) > > > > for when we're faced with some weird corruption on some weird storage > > > > system. > > > > > > Okay. It's very slow to mmap an entire image file and query mincore(2) > > > so it needs to be off by default. > > > > Also, having it enabled breaks localhost migration at least on tmpfs > > (which was what I tried out first). > > > > I wonder if the kernel would add some way to query whether the "advice" > > was actually acted upon if we asked. Either with a new function that > > returns an error if not everything is dropped (basically > > .bdrv_invalidate_cache on the kernel level), or a function that just > > queries if any page is allocated (or maybe the address of the first > > allocated page in a given range) without having to use mincore() and > > iterating over all the pages in userspace. > > I'm trying to figure out how to expose the optional mincore check on the > command-line/QMP: > > 1. Add a check_consistency bool argument to bdrv_invalidate_cache*(). > Add command-line/QMP option to -incoming and migrate_incoming. This > is messy and won't be easy to access for libvirt users. > > 2. Add a BlockdevOptionsFile *check-cache-consistency bool field. This > is specified at .bdrv_open() time. It can be changed at runtime with > .bdrv_reopen*(). > > 3. Add a 'blockdev-check-cache-consistency' QMP command that calls a new > .bdrv_check_cache_consistency() callback that is implemented by > file-posix.c. The problem is users might issue this command after > I/O has resumed and pages have become resident again. It only makes > sense if the guest is still paused. Probably a bad interface... > > Have I missed a good way to expose this optional check functionality? > > Which approach do you prefer? I'm leaning towards #2. Yes, I think that makes the most sense. In its current form, this can probably only be a debugging feature, though, so maybe x-check-cache-consistency? I also don't think libvirt should mess with this. Kevin [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 801 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 2/2] block/file-posix: verify page cache is not used 2018-04-24 14:29 ` Kevin Wolf @ 2018-04-27 10:06 ` Stefan Hajnoczi 0 siblings, 0 replies; 20+ messages in thread From: Stefan Hajnoczi @ 2018-04-27 10:06 UTC (permalink / raw) To: Kevin Wolf Cc: Dr. David Alan Gilbert, qemu-devel, Max Reitz, Sergio Lopez, qemu-block [-- Attachment #1: Type: text/plain, Size: 3137 bytes --] On Tue, Apr 24, 2018 at 04:29:15PM +0200, Kevin Wolf wrote: > Am 24.04.2018 um 16:04 hat Stefan Hajnoczi geschrieben: > > On Fri, Apr 20, 2018 at 08:25:13AM +0200, Kevin Wolf wrote: > > > Am 20.04.2018 um 05:02 hat Stefan Hajnoczi geschrieben: > > > > On Thu, Apr 19, 2018 at 10:05:47AM +0100, Dr. David Alan Gilbert wrote: > > > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote: > > > > > > This commit is for debugging only. Do not merge it. > > > > > > > > > > > > mincore(2) checks whether pages are resident. Use it to verify that > > > > > > page cache has been dropped. > > > > > > > > > > > > You can trigger a verification failure by mmapping the image file from > > > > > > another process and loading a byte from a page so that it becomes > > > > > > resident. bdrv_co_invalidate_cache() will fail while the process is > > > > > > alive. > > > > > > > > > > It doesn't seem a bad diagnostic to keep in (with a switch to activate) > > > > > for when we're faced with some weird corruption on some weird storage > > > > > system. > > > > > > > > Okay. It's very slow to mmap an entire image file and query mincore(2) > > > > so it needs to be off by default. > > > > > > Also, having it enabled breaks localhost migration at least on tmpfs > > > (which was what I tried out first). > > > > > > I wonder if the kernel would add some way to query whether the "advice" > > > was actually acted upon if we asked. Either with a new function that > > > returns an error if not everything is dropped (basically > > > .bdrv_invalidate_cache on the kernel level), or a function that just > > > queries if any page is allocated (or maybe the address of the first > > > allocated page in a given range) without having to use mincore() and > > > iterating over all the pages in userspace. > > > > I'm trying to figure out how to expose the optional mincore check on the > > command-line/QMP: > > > > 1. Add a check_consistency bool argument to bdrv_invalidate_cache*(). > > Add command-line/QMP option to -incoming and migrate_incoming. This > > is messy and won't be easy to access for libvirt users. > > > > 2. Add a BlockdevOptionsFile *check-cache-consistency bool field. This > > is specified at .bdrv_open() time. It can be changed at runtime with > > .bdrv_reopen*(). > > > > 3. Add a 'blockdev-check-cache-consistency' QMP command that calls a new > > .bdrv_check_cache_consistency() callback that is implemented by > > file-posix.c. The problem is users might issue this command after > > I/O has resumed and pages have become resident again. It only makes > > sense if the guest is still paused. Probably a bad interface... > > > > Have I missed a good way to expose this optional check functionality? > > > > Which approach do you prefer? I'm leaning towards #2. > > Yes, I think that makes the most sense. > > In its current form, this can probably only be a debugging feature, > though, so maybe x-check-cache-consistency? I also don't think libvirt > should mess with this. Cool, I will implement this. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration 2018-04-19 7:52 [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration Stefan Hajnoczi 2018-04-19 7:52 ` [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux Stefan Hajnoczi 2018-04-19 7:52 ` [Qemu-devel] [RFC 2/2] block/file-posix: verify page cache is not used Stefan Hajnoczi @ 2018-04-19 16:09 ` Eric Blake 2018-04-20 3:05 ` Stefan Hajnoczi 2 siblings, 1 reply; 20+ messages in thread From: Eric Blake @ 2018-04-19 16:09 UTC (permalink / raw) To: Stefan Hajnoczi, qemu-devel Cc: Kevin Wolf, Sergio Lopez, qemu-block, Dr. David Alan Gilbert, Max Reitz, Vladimir Sementsov-Ogievskiy [-- Attachment #1: Type: text/plain, Size: 2038 bytes --] On 04/19/2018 02:52 AM, Stefan Hajnoczi wrote: > file-posix.c only supports shared storage live migration with -drive > cache.direct=off due to cache consistency issues. There are two main shared > storage configurations: files on NFS and host block devices on SAN LUNs. > > The problem is that QEMU starts on the destination host before the source host > has written everything out to the disk. The page cache on the destination host > may contain stale data read when QEMU opened the image file (before migration > handover). Using O_DIRECT avoids this problem but prevents users from taking > advantage of the host page cache. > > Although cache=none is the recommended setting for virtualization use cases, > there are scenarios where cache=writeback makes sense. If the guest has much > less RAM than the host or many guests share the same backing file, then the > host page cache can significantly improve disk I/O performance. > > This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c > on Linux so that shared storage live migration works. I have sent it as an RFC > because cache consistency is not binary, there are corner cases which I've > described in the actual patch, and this may require more discussion. Interesting, in that the NBD list is also discussing the possible standardization of a NBD_CMD_CACHE command (based on existing practice in the xNBD implementation), and covering whether that MIGHT be worth doing as a thin wrapper that corresponds to posix_fadvise() semantics. Thus, if NBD_CMD_CACHE learns flags, we could support .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition to the POSIX file driver. Obviously, your usage invalidates the cache of the entire file; but does it also make sense to expose a start/length subset invalidation, for better exposure to posix_fadvise() semantics? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 619 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration 2018-04-19 16:09 ` [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration Eric Blake @ 2018-04-20 3:05 ` Stefan Hajnoczi 2018-04-20 13:53 ` Eric Blake 0 siblings, 1 reply; 20+ messages in thread From: Stefan Hajnoczi @ 2018-04-20 3:05 UTC (permalink / raw) To: Eric Blake Cc: qemu-devel, Kevin Wolf, Sergio Lopez, qemu-block, Dr. David Alan Gilbert, Max Reitz, Vladimir Sementsov-Ogievskiy [-- Attachment #1: Type: text/plain, Size: 2344 bytes --] On Thu, Apr 19, 2018 at 11:09:53AM -0500, Eric Blake wrote: > On 04/19/2018 02:52 AM, Stefan Hajnoczi wrote: > > file-posix.c only supports shared storage live migration with -drive > > cache.direct=off due to cache consistency issues. There are two main shared > > storage configurations: files on NFS and host block devices on SAN LUNs. > > > > The problem is that QEMU starts on the destination host before the source host > > has written everything out to the disk. The page cache on the destination host > > may contain stale data read when QEMU opened the image file (before migration > > handover). Using O_DIRECT avoids this problem but prevents users from taking > > advantage of the host page cache. > > > > Although cache=none is the recommended setting for virtualization use cases, > > there are scenarios where cache=writeback makes sense. If the guest has much > > less RAM than the host or many guests share the same backing file, then the > > host page cache can significantly improve disk I/O performance. > > > > This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c > > on Linux so that shared storage live migration works. I have sent it as an RFC > > because cache consistency is not binary, there are corner cases which I've > > described in the actual patch, and this may require more discussion. > > Interesting, in that the NBD list is also discussing the possible > standardization of a NBD_CMD_CACHE command (based on existing practice > in the xNBD implementation), and covering whether that MIGHT be worth > doing as a thin wrapper that corresponds to posix_fadvise() semantics. > Thus, if NBD_CMD_CACHE learns flags, we could support > .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition > to the POSIX file driver. Obviously, your usage invalidates the cache > of the entire file; but does it also make sense to expose a start/length > subset invalidation, for better exposure to posix_fadvise() semantics? bdrv_co_invalidate_cache() is currently only used by migration before using a file that may have been touched by the other host. I don't think start/length will be needed for that use case. Can you describe how will NBD use cache invalidation? Maybe this will help me understand other use cases. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration 2018-04-20 3:05 ` Stefan Hajnoczi @ 2018-04-20 13:53 ` Eric Blake 2018-04-24 13:43 ` Stefan Hajnoczi 0 siblings, 1 reply; 20+ messages in thread From: Eric Blake @ 2018-04-20 13:53 UTC (permalink / raw) To: Stefan Hajnoczi Cc: qemu-devel, Kevin Wolf, Sergio Lopez, qemu-block, Dr. David Alan Gilbert, Max Reitz, Vladimir Sementsov-Ogievskiy [-- Attachment #1: Type: text/plain, Size: 2247 bytes --] On 04/19/2018 10:05 PM, Stefan Hajnoczi wrote: >>> This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c >>> on Linux so that shared storage live migration works. I have sent it as an RFC >>> because cache consistency is not binary, there are corner cases which I've >>> described in the actual patch, and this may require more discussion. >> >> Interesting, in that the NBD list is also discussing the possible >> standardization of a NBD_CMD_CACHE command (based on existing practice >> in the xNBD implementation), and covering whether that MIGHT be worth >> doing as a thin wrapper that corresponds to posix_fadvise() semantics. >> Thus, if NBD_CMD_CACHE learns flags, we could support >> .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition >> to the POSIX file driver. Obviously, your usage invalidates the cache >> of the entire file; but does it also make sense to expose a start/length >> subset invalidation, for better exposure to posix_fadvise() semantics? > > bdrv_co_invalidate_cache() is currently only used by migration before > using a file that may have been touched by the other host. I don't > think start/length will be needed for that use case. > > Can you describe how will NBD use cache invalidation? Maybe this will > help me understand other use cases. That's where things are still under discussion - no one has yet provided a case that would benefit from a POSIX_FADV_DONTNEED over just a range of the file [1]; on the other hand, it might make sense that if you know an implementation has a limited cache, then having control over the various posix_fadvise() flags over various ranges of the files may lead to more optimum behavior. And posix_fadvise() does have the ability to work over the entire file (offset 0 length 0) or a subrange (any offset and nonzero length). So I'm also fine if .bdrv_co_invalidate_cache() doesn't expose offset/length parameters, particularly if NBD can't come up with an actual use case that would benefit. [1] https://lists.debian.org/nbd/2018/04/msg00020.html -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 619 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration 2018-04-20 13:53 ` Eric Blake @ 2018-04-24 13:43 ` Stefan Hajnoczi 0 siblings, 0 replies; 20+ messages in thread From: Stefan Hajnoczi @ 2018-04-24 13:43 UTC (permalink / raw) To: Eric Blake Cc: qemu-devel, Kevin Wolf, Sergio Lopez, qemu-block, Dr. David Alan Gilbert, Max Reitz, Vladimir Sementsov-Ogievskiy [-- Attachment #1: Type: text/plain, Size: 2437 bytes --] On Fri, Apr 20, 2018 at 08:53:33AM -0500, Eric Blake wrote: > On 04/19/2018 10:05 PM, Stefan Hajnoczi wrote: > > >>> This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c > >>> on Linux so that shared storage live migration works. I have sent it as an RFC > >>> because cache consistency is not binary, there are corner cases which I've > >>> described in the actual patch, and this may require more discussion. > >> > >> Interesting, in that the NBD list is also discussing the possible > >> standardization of a NBD_CMD_CACHE command (based on existing practice > >> in the xNBD implementation), and covering whether that MIGHT be worth > >> doing as a thin wrapper that corresponds to posix_fadvise() semantics. > >> Thus, if NBD_CMD_CACHE learns flags, we could support > >> .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition > >> to the POSIX file driver. Obviously, your usage invalidates the cache > >> of the entire file; but does it also make sense to expose a start/length > >> subset invalidation, for better exposure to posix_fadvise() semantics? > > > > bdrv_co_invalidate_cache() is currently only used by migration before > > using a file that may have been touched by the other host. I don't > > think start/length will be needed for that use case. > > > > Can you describe how will NBD use cache invalidation? Maybe this will > > help me understand other use cases. > > That's where things are still under discussion - no one has yet provided > a case that would benefit from a POSIX_FADV_DONTNEED over just a range > of the file [1]; on the other hand, it might make sense that if you know > an implementation has a limited cache, then having control over the > various posix_fadvise() flags over various ranges of the files may lead > to more optimum behavior. And posix_fadvise() does have the ability to > work over the entire file (offset 0 length 0) or a subrange (any offset > and nonzero length). So I'm also fine if .bdrv_co_invalidate_cache() > doesn't expose offset/length parameters, particularly if NBD can't come > up with an actual use case that would benefit. > > [1] https://lists.debian.org/nbd/2018/04/msg00020.html Okay. .bdrv_co_invalidate_cache() is an internal interface. We can change it later to support NBD functionality, if necessary. Let's keep it without start/length for now. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2018-04-27 12:22 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-19 7:52 [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration Stefan Hajnoczi 2018-04-19 7:52 ` [Qemu-devel] [RFC 1/2] block/file-posix: implement bdrv_co_invalidate_cache() on Linux Stefan Hajnoczi 2018-04-19 8:13 ` Fam Zheng 2018-04-20 3:15 ` Stefan Hajnoczi 2018-04-20 3:36 ` Fam Zheng 2018-04-20 6:13 ` Kevin Wolf 2018-04-19 9:18 ` Dr. David Alan Gilbert 2018-04-20 3:21 ` Stefan Hajnoczi 2018-04-20 6:27 ` Kevin Wolf 2018-04-19 7:52 ` [Qemu-devel] [RFC 2/2] block/file-posix: verify page cache is not used Stefan Hajnoczi 2018-04-19 9:05 ` Dr. David Alan Gilbert 2018-04-20 3:02 ` Stefan Hajnoczi 2018-04-20 6:25 ` Kevin Wolf 2018-04-24 14:04 ` Stefan Hajnoczi 2018-04-24 14:29 ` Kevin Wolf 2018-04-27 10:06 ` Stefan Hajnoczi 2018-04-19 16:09 ` [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration Eric Blake 2018-04-20 3:05 ` Stefan Hajnoczi 2018-04-20 13:53 ` Eric Blake 2018-04-24 13:43 ` Stefan Hajnoczi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.