qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* qemu-img cache modes with Linux cgroup v1
@ 2023-07-31 15:40 Stefan Hajnoczi
  2023-07-31 16:06 ` Richard W.M. Jones
  2023-07-31 17:19 ` Daniel P. Berrangé
  0 siblings, 2 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2023-07-31 15:40 UTC (permalink / raw)
  To: Alex Kalenyuk, Adam Litke; +Cc: qemu-devel, kwolf, Richard W.M. Jones

[-- Attachment #1: Type: text/plain, Size: 2493 bytes --]

Hi,
qemu-img -t writeback -T writeback is not designed to run with the Linux
cgroup v1 memory controller because dirtying too much page cache leads
to process termination instead of usual non-cgroup and cgroup v2
throttling behavior:
https://bugzilla.redhat.com/show_bug.cgi?id=2196072

I wanted to share my thoughts on this issue.

cache=none bypasses the host page cache and will not hit the cgroup
memory limit. It's an easy solution to avoid exceeding the cgroup v1
memory limit.

However, not all Linux file systems support O_DIRECT and qemu-img's I/O
pattern may perform worse under cache=none than cache=writeback.

1. Which file systems support O_DIRECT in Linux 6.5?

I searched the Linux source code for file systems that implement
.direct_IO or set FMODE_CAN_ODIRECT. This is not exhaustive and may not
be 100% accurate.

The big name file systems (ext4, XFS, btrfs, nfs, smb, ceph) support
O_DIRECT. The most obvious omission is tmpfs.

If your users are running file systems that support O_DIRECT, then
qemu-img -t none -T none is an easy solution to the cgroup v1 memory
limit issue.

Supported:
9p
affs
btrfs
ceph
erofs
exfat
ext2
ext4
f2fs
fat
fuse
gfs2
hfs
hfsplus
jfs
minix
nfs
nilfs2
ntfs3
ocfs2
orangefs
overlayfs
reiserfs
smb
udf
xfs
zonefs

Unsupported:
adfs
befs
bfs
cramfs
ecryptfs
efs
freevxfs
hpfs
hugetlbfs
isofs
jffs2
ntfs
omfs
qnx4
qnx6
ramfs
romfs
squashfs
sysv
tmpfs
ubifs
ufs
vboxsf

2. Is qemu-img performance with O_DIRECT acceptable?

The I/O pattern matters more with O_DIRECT because every I/O request is
sent to the storage device. This means buffer sizes matter more (more
small I/Os have higher overhead than fewer large I/Os). Concurrency can
also help saturate the storage device.

If you switch to O_DIRECT and encounter performance problems then
qemu-img can be optimized to send I/O patterns with less overhead. This
requires performance analysis.

3. Using buffered I/O because O_DIRECT is not universally supported?

If you can't use O_DIRECT, then qemu-img could be extended to manage its
dirty page cache set carefully. This consists of picking a budget and
writing back to disk when the budget is exhausted. Richard Jones has
shared links covering posix_fadvise(2) and sync_file_range(2):
https://lkml.iu.edu/hypermail/linux/kernel/1005.2/01845.html
https://lkml.iu.edu/hypermail/linux/kernel/1005.2/01953.html

We can discuss qemu-img code changes and performance analysis more if
you decide to take that direction.

Hope this helps!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: qemu-img cache modes with Linux cgroup v1
  2023-07-31 15:40 qemu-img cache modes with Linux cgroup v1 Stefan Hajnoczi
@ 2023-07-31 16:06 ` Richard W.M. Jones
  2023-07-31 17:19 ` Daniel P. Berrangé
  1 sibling, 0 replies; 6+ messages in thread
From: Richard W.M. Jones @ 2023-07-31 16:06 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Alex Kalenyuk, Adam Litke, qemu-devel, kwolf, nsoffer

On Mon, Jul 31, 2023 at 11:40:36AM -0400, Stefan Hajnoczi wrote:
> 3. Using buffered I/O because O_DIRECT is not universally supported?
> 
> If you can't use O_DIRECT, then qemu-img could be extended to manage its
> dirty page cache set carefully. This consists of picking a budget and
> writing back to disk when the budget is exhausted. Richard Jones has
> shared links covering posix_fadvise(2) and sync_file_range(2):
> https://lkml.iu.edu/hypermail/linux/kernel/1005.2/01845.html
> https://lkml.iu.edu/hypermail/linux/kernel/1005.2/01953.html
> 
> We can discuss qemu-img code changes and performance analysis more if
> you decide to take that direction.

There's a bit more detail in these two commits:

  https://gitlab.com/nbdkit/libnbd/-/commit/64d50d994dd7062d5cce21f26f0e8eba0e88c87e
  https://gitlab.com/nbdkit/nbdkit/-/commit/a956e2e75d6c88eeefecd967505667c9f176e3af

In my experience this method is much better than using O_DIRECT,
it has much fewer sharp edges.

By the way, this is a super-useful tool for measuring how much of the
page cache is being used to cache a file:

  https://github.com/Feh/nocache

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: qemu-img cache modes with Linux cgroup v1
  2023-07-31 15:40 qemu-img cache modes with Linux cgroup v1 Stefan Hajnoczi
  2023-07-31 16:06 ` Richard W.M. Jones
@ 2023-07-31 17:19 ` Daniel P. Berrangé
  2023-07-31 19:15   ` Stefan Hajnoczi
  1 sibling, 1 reply; 6+ messages in thread
From: Daniel P. Berrangé @ 2023-07-31 17:19 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Alex Kalenyuk, Adam Litke, qemu-devel, kwolf, Richard W.M. Jones

On Mon, Jul 31, 2023 at 11:40:36AM -0400, Stefan Hajnoczi wrote:
> Hi,
> qemu-img -t writeback -T writeback is not designed to run with the Linux
> cgroup v1 memory controller because dirtying too much page cache leads
> to process termination instead of usual non-cgroup and cgroup v2
> throttling behavior:
> https://bugzilla.redhat.com/show_bug.cgi?id=2196072

Ewww, a horrible behavioural change v1 is imposing on apps :-(

QEMU happens to hit it because we do lots of I/O, but plenty of
other apps do major I/o and can fall into the same trap :-( I
can imagine that simply running a big "tar zxvf" would have much
the same effect in terms of masses of I/O in a short time.

> I wanted to share my thoughts on this issue.
> 
> cache=none bypasses the host page cache and will not hit the cgroup
> memory limit. It's an easy solution to avoid exceeding the cgroup v1
> memory limit.

I go further and say that is a good recommendation even without
this bug in cgroups v1.

writeback caching helps if you have lots of free memory, but on
virtualization hosts memory is usually the biggest VM density
constraint, so apps shouldn't generally expect there to be lots
of free host memory to burn as I/O cache.

If you're using qemu-img in preparation for running qemu-system-XXX
and the latter will use cache=none anyway, then it is even less
desirable for qemu-img to fill the host cache with pages that won't
be accessed again when the VM starts in qemu-system-XXXX.

> However, not all Linux file systems support O_DIRECT and qemu-img's I/O
> pattern may perform worse under cache=none than cache=writeback.
> 
> 1. Which file systems support O_DIRECT in Linux 6.5?
> 
> I searched the Linux source code for file systems that implement
> .direct_IO or set FMODE_CAN_ODIRECT. This is not exhaustive and may not
> be 100% accurate.
> 
> The big name file systems (ext4, XFS, btrfs, nfs, smb, ceph) support
> O_DIRECT. The most obvious omission is tmpfs.

Rather than trying to fogure out a list of FS types, in openstack,
a bit of code was added to simply attempt to open a test file with
O_DIRECT on the target filesystem. If that works then run qemu-img
/ qemu-system-XXX with cache=none, otherwise use cache=writeback.
IOW, a "best effort" to avoid host cache where supported.

Could there be justification for QEMU to support a "best effort"
host cache bypass mode natively, to avoid every app needing to
re-implement this logic to check for support of O_DIRECT ?

eg a QEMU 'cache=trynone' option instead of 'cache=none' ?


> 2. Is qemu-img performance with O_DIRECT acceptable?
> 
> The I/O pattern matters more with O_DIRECT because every I/O request is
> sent to the storage device. This means buffer sizes matter more (more
> small I/Os have higher overhead than fewer large I/Os). Concurrency can
> also help saturate the storage device.

"qemu-img convert" supports the '--parallel' flag to use many
coroutines for I/O

> If you switch to O_DIRECT and encounter performance problems then
> qemu-img can be optimized to send I/O patterns with less overhead. This
> requires performance analysis.

Since we're in pretty direct control of the I/O pattern qemu-img imposes,
it feels very sensible to optimize it to such that cache=none achieves
ideal performance.  


> 3. Using buffered I/O because O_DIRECT is not universally supported?
> 
> If you can't use O_DIRECT, then qemu-img could be extended to manage its
> dirty page cache set carefully. This consists of picking a budget and
> writing back to disk when the budget is exhausted.

IOW, re-implementing what the kernel should already be doing for us :-(

This feels like the least desirable thing for QEMU to take on, especially
since cgroups v1 is an evolutionary dead-end, with v2 increasingly taking
over the world.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: qemu-img cache modes with Linux cgroup v1
  2023-07-31 17:19 ` Daniel P. Berrangé
@ 2023-07-31 19:15   ` Stefan Hajnoczi
  2024-05-06 17:10     ` Alex Kalenyuk
  0 siblings, 1 reply; 6+ messages in thread
From: Stefan Hajnoczi @ 2023-07-31 19:15 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Alex Kalenyuk, Adam Litke, qemu-devel, kwolf, Richard W.M. Jones

[-- Attachment #1: Type: text/plain, Size: 45 bytes --]

Hi Daniel,
I agree with your points.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: qemu-img cache modes with Linux cgroup v1
  2023-07-31 19:15   ` Stefan Hajnoczi
@ 2024-05-06 17:10     ` Alex Kalenyuk
  2024-05-06 18:24       ` Stefan Hajnoczi
  0 siblings, 1 reply; 6+ messages in thread
From: Alex Kalenyuk @ 2024-05-06 17:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Daniel P. Berrangé,
	Adam Litke, qemu-devel, kwolf, Richard W.M. Jones

[-- Attachment #1: Type: text/plain, Size: 480 bytes --]

Hey, just FYI about tmpfs, during some development on Fedora 39 I noticed
O_DIRECT is now supported on tmpfs (as opposed to our CI which runs Centos
9 Stream).
`qemu-img convert -t none -O raw tests/images/cirros-qcow2.img
/tmp/cirros.raw`
where /tmp is indeed a tmpfs.

I might be missing something so feel free to call that out

On Tue, Aug 1, 2023 at 6:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:

> Hi Daniel,
> I agree with your points.
>
> Stefan
>

[-- Attachment #2: Type: text/html, Size: 766 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: qemu-img cache modes with Linux cgroup v1
  2024-05-06 17:10     ` Alex Kalenyuk
@ 2024-05-06 18:24       ` Stefan Hajnoczi
  0 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2024-05-06 18:24 UTC (permalink / raw)
  To: Alex Kalenyuk
  Cc: Daniel P. Berrangé,
	Adam Litke, qemu-devel, kwolf, Richard W.M. Jones

[-- Attachment #1: Type: text/plain, Size: 667 bytes --]

On Mon, May 06, 2024 at 08:10:25PM +0300, Alex Kalenyuk wrote:
> Hey, just FYI about tmpfs, during some development on Fedora 39 I noticed
> O_DIRECT is now supported on tmpfs (as opposed to our CI which runs Centos
> 9 Stream).
> `qemu-img convert -t none -O raw tests/images/cirros-qcow2.img
> /tmp/cirros.raw`
> where /tmp is indeed a tmpfs.
> 
> I might be missing something so feel free to call that out

Yes, it was added by:

commit e88e0d366f9cfbb810b0c8509dc5d130d5a53e02
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Aug 10 23:27:07 2023 -0700

    tmpfs: trivial support for direct IO

It's fairly new but great to have.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-05-06 20:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-31 15:40 qemu-img cache modes with Linux cgroup v1 Stefan Hajnoczi
2023-07-31 16:06 ` Richard W.M. Jones
2023-07-31 17:19 ` Daniel P. Berrangé
2023-07-31 19:15   ` Stefan Hajnoczi
2024-05-06 17:10     ` Alex Kalenyuk
2024-05-06 18:24       ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).