Re: dm overlaybd: targets mapping OverlayBD image

From: Gao Xiang <hsiangkao@linux.alibaba.com>
To: Du Rui <durui@linux.alibaba.com>
Cc: agk@redhat.com, alexl@redhat.com, dm-devel@redhat.com,
	gscrivan@redhat.com, linux-kernel@vger.kernel.org,
	snitzer@kernel.org, Gao Xiang <xiang@kernel.org>
Subject: Re: dm overlaybd: targets mapping OverlayBD image
Date: Sat, 27 May 2023 12:12:24 +0800	[thread overview]
Message-ID: <11c1e59f-d05e-5479-fa6b-36d9a793c16e@linux.alibaba.com> (raw)
In-Reply-To: <20230527031319.92200-1-durui@linux.alibaba.com>

On 2023/5/27 11:13, Du Rui wrote:
>> Block drivers has nothing to do on filesystem page cache stuffs, also
>> currently your approach has nothing to do with pmem stuffs (If you must
>> mention "DAX" to proposal your "page cache sharing", please _here_
>> write down your detailed design first and explain how it could work to
>> ours if you really want to do.)
> 
> We have already done experiments (by virtio pmem), to make virtual PMEM
> device in QEMU, make guest vm sharing only one memory mapping on host,
> with filesystem that supports DAX. In guest vm, fs keeps no page cache,
> maybe "sharing pagecache" is not such accurate description, but sharing
> memory pages on host can do prevent making duplicated pagecache pages in
> VMs.

First, does virtio-pmem have some relationship with this kernel
claim "dm / lvm" proposal" of yours?  Does your virtio-pmem work
on bare matel or cloud server or runC (I mean without some host
adaption)?

Secondly, does your virtio-pmem have any relationship with this
kernel approach? If not, why not directly using your userspace
work on your specific use case? How does this kernel DM approach
have any help to your "sharing pagecache"?

Do you know how kernel FSDAX work and what type of memory of
pmem is?  Could you give me your detailed kernel design to do
in-kernel DM+pmem dax mapping?

> 
> Please make sure that you have already understood that dm-overlaybd are
> for GENERIC purpose. It is NOT a special design for container, and have
> nothing related to filesystem implementations.
Previous dm-qcow2 is more generic (on-disk format friendly for
read-write as well as qcow2 format with two-level l1/l2 indexes
takes less runtime memory persistent footprint than your on-disk
format which needs to load and parse your hardly-seekable on-disk
lsmt+zfile layer indexes to some new in-memory represention which
can be used for random accesses before any real I/Os and these
in-memory indexes _cannot_ be _partially reclaimed_ from memory)
and qcow2 has more wider ecosystem compared to your approach, but
could you see the community tendency of this?

ublk-qcow2: ublk-qcow2 is available:
https://lore.kernel.org/r/Yza1u1KfKa7ycQm0@T590

Second, I have to mention here your previous attempt including
read (maybe later write) your DADI file stuffs in your
in-kernel block driver, which I think that was really
dangerous:

see vfsfile.c of your previous codebase
https://github.com/data-accelerator/dadi-kernel-mod/commit/ff12687f2c567ddf51a28df88b25dd2d0e3737a2

static struct file *file_open(const char *path, int flags, int rights)
{
..
	fp = filp_open(path, O_RDONLY, 0);
..
}

static ssize_t file_read(struct file *file, void *buf, size_t count, loff_t pos)
{
..
	vfs_fadvise(file, pos, count, POSIX_FADV_SEQUENTIAL);
..
		ret = kernel_read(file, buf, count, &lpos);
..
}

In your current proposed patch, you still call it as "struct
vfile" but use raw block device stuffs instead.

But this raw block device use cases are limited (almost useless)
for containers since as Alex said, almost all container users
switch to filesystem-based approach (I don't want to repeat why).
And your kernel approach is almost useless for virtual machine
use cases (see how qcow2 works for VMs).

In the end, Assume that if you *end up with* later upstreaming
reading backing filesystem files directly under the block layer
(for example, as your second step), that is really a not-go.

Anyway, all the above is on behalf myself.

Thanks,
Gao Xiang