linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Gao Xiang <hsiangkao@linux.alibaba.com>
To: Alexander Larsson <alexl@redhat.com>,
	Mike Snitzer <snitzer@kernel.org>,
	Du Rui <durui@linux.alibaba.com>
Cc: dm-devel@redhat.com, linux-kernel@vger.kernel.org,
	Alasdair Kergon <agk@redhat.com>,
	Giuseppe Scrivano <gscrivan@redhat.com>
Subject: Re: dm overlaybd: targets mapping OverlayBD image
Date: Wed, 24 May 2023 15:13:49 +0800	[thread overview]
Message-ID: <fd4d0429-4da3-8217-6c13-14fd8a198920@linux.alibaba.com> (raw)
In-Reply-To: <CAL7ro1FPEqXyOuX_WPMYdsT6rW-bD5EU=v=oWKsd6XscykLF6Q@mail.gmail.com>



On 2023/5/24 23:43, Alexander Larsson wrote:
> On Tue, May 23, 2023 at 7:29 PM Mike Snitzer <snitzer@kernel.org> wrote:
>>
>> On Fri, May 19 2023 at  6:27P -0400,
>> Du Rui <durui@linux.alibaba.com> wrote:
>>
>>> OverlayBD is a novel layering block-level image format, which is design
>>> for container, secure container and applicable to virtual machine,
>>> published in USENIX ATC '20
>>> https://www.usenix.org/system/files/atc20-li-huiba.pdf
>>>
>>> OverlayBD already has a ContainerD non-core sub-project implementation
>>> in userspace, as an accelerated container image service
>>> https://github.com/containerd/accelerated-container-image
>>>
>>> It could be much more efficient when do decompressing and mapping works
>>> in the kernel with the framework of device-mapper, in many circumstances,
>>> such as secure container runtime, mobile-devices, etc.
>>>
>>> This patch contains a module, dm-overlaybd, provides two kinds of targets
>>> dm-zfile and dm-lsmt, to expose a group of block-devices contains
>>> OverlayBD image as a overlaid read-only block-device.
>>>
>>> Signed-off-by: Du Rui <durui@linux.alibaba.com>
>>
>> <snip, original patch here: [1] >
> 
> A long long time ago I wrote a docker container image based on
> dm-snapshot that is vaguely similar to this one. It is still
> available, but nobody really uses it. It has several weaknesses. First
> of all the container image is an actual filesystem, so you need to
> pre-allocate a fixed max size for images at construction time.
> Secondly, all the lvm volume changes and mounts during runtime caused
> weird behaviour (especially at scale) that was painful to manage (just
> search the docker issue tracker for devmapper backend). In the end
> everyone moved to a filesystem based implementation (overlayfs based).

Yeah, and I think reproducibility issue is another problem, which means
it's quite hard to select a random fs without some change to get the
best result.  I do find these guys work on e2fsprogs again and again.

I've already told them internally again and again, but.. They only focus
on some minor points such as how to do I/O and CPU prefetch to get
(somewhat) better performance and beat EROFS.  I don't know, I have no
enough time to even look into that whether this new kernel stuffs is
fine: because of a very simplist idea:

  stacked storage overhead generally takes double runtime/memory
footprints:
    filesystem + block drivers

> 
>> I appreciate that this work is being done with an eye toward
>> containerd "community" and standardization but based on my limited
>> research it appears that this format of OCI image storage/use is only
>> used by Alibaba? (but I could be wrong...)
>>
>> But you'd do well to explain why the userspace solution isn't
>> acceptable. Are there security issues that moving the implementation
>> to kernel addresses?
>>
>> I also have doubts that this solution is _actually_ more performant
>> than a proper filesystem based solution that allows page cache sharing
>> of container image data across multiple containers.
> 
> This solution doesn't even allow page cache sharing between shared
> layers (like current containers do), much less between independent
> layers.
> 
>> There is an active discussion about, and active development effort
>> for, using overlayfs + erofs for container images.  I'm reluctant to
>> merge this DM based container image approach without wider consensus
>> from other container stakeholders.
>>
>> But short of reaching wider consensus on the need for these DM
>> targets: there is nothing preventing you from carrying these changes
>> in your alibaba kernel.
> 
> Erofs already has some block-level support for container images (with
> nydus), and composefs works with current in-kernel EROFS+overlayfs.
> And this new approach doesn't help for the IMHO current weak spot we
> have, which is unprivileged container images.
> 
> Also, while OCI artifacts can be used to store any kind of image
> formats (or any other kind of file) I think for an actual standardized
> new image format it would be better to work with the OCI org to come
> up with a OCI v2 standard image format.

Agreed, I hope you guys could actually sit down and evaluate a proper
solution on the next OCI v2, currently I know there are:

  - Composefs
  - (e)stargz   https://github.com/containerd/stargz-snapshotter
  - Nydus       https://github.com/containerd/nydus-snapshotter
  - OverlayBD   https://github.com/containerd/accelerated-container-image
  - SOCI        https://github.com/awslabs/soci-snapshotter
  - Tarfs
  - (maybe even more..)

Honestly, I do think OSTree/Composefs is the best approach for now for
deduplication and page cache sharing (due to kernel limitation of page
cache sharing and overlayfs copyup limitation).  I'm too tired of
container image stuffs honestly.  Too much unnecessary manpower waste.

Thanks,
Gao Xiang

> 
> But, I don't really speak for the block layer developers, so take my
> opinions with a pinch of salt.
> 

  reply	other threads:[~2023-05-24  7:15 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-19 10:27 [RFC] dm overlaybd: targets mapping OverlayBD image Du Rui
2023-05-23 17:28 ` Mike Snitzer
2023-05-24  0:56   ` [dm-devel] " Gao Xiang
2023-05-24  6:43   ` Alexander Larsson
2023-05-24  7:13     ` Gao Xiang [this message]
2023-05-24  8:11       ` Giuseppe Scrivano
2023-05-24  8:26         ` Gao Xiang
2023-05-24 10:48           ` Giuseppe Scrivano
2023-05-24 11:06             ` Gao Xiang
2023-05-26 10:28         ` Du Rui
2023-05-26 10:26     ` Du Rui
2023-05-26 16:43       ` Gao Xiang
2023-05-27  3:13         ` Du Rui
2023-05-27  4:12           ` Gao Xiang
2023-05-24  6:59   ` Du Rui
2023-05-26 10:25   ` Du Rui
2023-05-24  7:24 ` [RFC PATCH v2] " Du Rui
2023-05-24  7:40 ` [RFC PATCH v3] " Du Rui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fd4d0429-4da3-8217-6c13-14fd8a198920@linux.alibaba.com \
    --to=hsiangkao@linux.alibaba.com \
    --cc=agk@redhat.com \
    --cc=alexl@redhat.com \
    --cc=dm-devel@redhat.com \
    --cc=durui@linux.alibaba.com \
    --cc=gscrivan@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=snitzer@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).