Hi amir and all, On 2/6/23 3:06 AM, Amir Goldstein wrote: >>>>> Apart from that, I still fail to get some thoughts (apart from >>>>> unprivileged >>>>> mounts) how EROFS + overlayfs combination fails on automative real >>>>> workloads >>>>> aside from "ls -lR" (readdir + stat). >>>>> >>>>> And eventually we still need overlayfs for most use cases to do >>>>> writable >>>>> stuffs, anyway, it needs some words to describe why such < 1s >>>>> difference is >>>>> very very important to the real workload as you already mentioned >>>>> before. >>>>> >>>>> And with overlayfs lazy lookup, I think it can be close to ~100ms or >>>>> better. >>>>> >>>> >>>> If we had an overlay.fs-verity xattr, then I think there are no >>>> individual features lacking for it to work for the automotive usecase >>>> I'm working on. Nor for the OCI container usecase. However, the >>>> possibility of doing something doesn't mean it is the better technical >>>> solution. >>>> >>>> The container usecase is very important in real world Linux use today, >>>> and as such it makes sense to have a technically excellent solution for >>>> it, not just a workable solution. Obviously we all have different >>>> viewpoints of what that is, but these are the reasons why I think a >>>> composefs solution is better: >>>> >>>> * It is faster than all other approaches for the one thing it actually >>>> needs to do (lookup and readdir performance). Other kinds of >>>> performance (file i/o speed, etc) is up to the backing filesystem >>>> anyway. >>>> >>>> Even if there are possible approaches to make overlayfs perform better >>>> here (the "lazy lookup" idea) it will not reach the performance of >>>> composefs, while further complicating the overlayfs codebase. (btw, did >>>> someone ask Miklos what he thinks of that idea?) >>>> >>> >>> Well, Miklos was CCed (now in TO:) >>> I did ask him specifically about relaxing -ouserxarr,metacopy,redirect: >>> https://lore.kernel.org/linux-unionfs/20230126082228.rweg75ztaexykejv@wittgenstein/T/#mc375df4c74c0d41aa1a2251c97509c6522487f96 >>> but no response on that yet. >>> >>> TBH, in the end, Miklos really is the one who is going to have the most >>> weight on the outcome. >>> >>> If Miklos is interested in adding this functionality to overlayfs, you are going >>> to have a VERY hard sell, trying to merge composefs as an independent >>> expert filesystem. The community simply does not approve of this sort of >>> fragmentation unless there is a very good reason to do that. >>> >>>> For the automotive usecase we have strict cold-boot time requirements >>>> that make cold-cache performance very important to us. Of course, there >>>> is no simple time requirements for the specific case of listing files >>>> in an image, but any improvement in cold-cache performance for both the >>>> ostree rootfs and the containers started during boot will be worth its >>>> weight in gold trying to reach these hard KPIs. >>>> >>>> * It uses less memory, as we don't need the extra inodes that comes >>>> with the overlayfs mount. (See profiling data in giuseppes mail[1]). >>> >>> Understood, but we will need profiling data with the optimized ovl >>> (or with the single blob hack) to compare the relevant alternatives. >> >> My little request again, could you help benchmark on your real workload >> rather than "ls -lR" stuff? If your hard KPI is really what as you >> said, why not just benchmark the real workload now and write a detailed >> analysis to everyone to explain it's a _must_ that we should upstream >> a new stacked fs for this? >> > > I agree that benchmarking the actual KPI (boot time) will have > a much stronger impact and help to build a much stronger case > for composefs if you can prove that the boot time difference really matters. > > In order to test boot time on fair grounds, I prepared for you a POC > branch with overlayfs lazy lookup: > https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata > > It is very lightly tested, but should be sufficient for the benchmark. > Note that: > 1. You need to opt-in with redirect_dir=lazyfollow,metacopy=on > 2. The lazyfollow POC only works with read-only overlay that > has two lower dirs (1 metadata layer and one data blobs layer) > 3. The data layer must be a local blockdev fs (i.e. not a network fs) > 4. Only absolute path redirects are lazy (e.g. "/objects/cc/3da...") > > These limitations could be easily lifted with a bit more work. > If any of those limitations stand in your way for running the benchmark > let me know and I'll see what I can do. > Thanks for the lazyfollow POC, I updated the perf test with overlayfs lazyfollow enabled. | uncached(ms)| cached(ms) ----------------------------------------------------+-----+---- composefs | 404 | 181 composefs (readahead of manifest disabled) | 523 | 178 erofs (loop BUFFER) | 300 | 188 erofs (loop DIRECT) | 486 | 188 erofs (loop DIRECT + ra manifest) | 292 | 190 erofs (loop BUFFER) +overlayfs(lazyfollowup) | 502 | 286 erofs (loop DIRECT) +overlayfs(lazyfollowup) | 686 | 285 erofs (loop DIRECT+ra manifest)+overlayfs(lazyfollowup) | 484 | 300 I find that composefs behaves better than purely erofs (loop DIRECT), e.g. 404ms vs 486ms in uncached situation, somewhat because composefs reads part of metadata by buffered kernel_read() and thus the builtin readahead is performed on the manifest file. With the readahead for the manifest disabled, the performance gets much worse. Erofs can also use similar optimization of readahead the manifest file when accessing the metadata if really needed. An example POC implementation is inlined in the bottom of this mail. Considering the workload of "ls -lR" will read basically the full content of the manifest file, plusing the manifest file size is just ~10MB, the POC implementation just performs async readahead upon the manifest with a fixed step of 128KB. With this opt-in, erofs performs somewhat better in uncached situation. I have to admit that this much depends on the range and the step size of the readahead, but at least it indicates that erofs can get comparable performance with similar optimization. Besides, as mentioned in [1], in composefs the on-disk inode under one directory is arranged closer than erofs, which means the submitted IO when doing "ls -l" in erofs is more random than that in composefs, somewhat affecting the performance. It can be possibly fixed by improving mkfs.erofs if the gap (~80ms) really matters. The inode id arrangement under the root directory of tested rootfs is shown as an attachment, and the tested erofs image of erofs+overlayfs is constructed from the script (mkhack.sh) attached in [2] offered by Alexander. To summarize: For composefs and erofs, they are quite similar and the performance is also comparable (with the same optimization). But when comparing composefs and erofs+overlayfs(lazyfollowup), at least in the workload of "ls -lR", the combination of erofs and overlayfs costs ~100ms more in both cached and uncached situation. If such ~100ms diff really matters, erofs could resolve "redirect" xattr itself, in which case overlayfs is not involved and then the performance shall be comparable with composefs, but i'm not sure if it's worthwhile considering the results are already close. Besides the rootfs is read-only in this case, and if we need writable layer anyway, overlayfs still needs to be introduced. [1] https://lore.kernel.org/lkml/1d65be2f-6d3a-13c6-4982-66bbb0f9b530@linux.alibaba.com/ [2] https://lore.kernel.org/linux-fsdevel/5fb32a1297821040edd8c19ce796fc0540101653.camel@redhat.com/ diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c index d3b8736fa124..e74e24e00b49 100644 --- a/fs/erofs/inode.c +++ b/fs/erofs/inode.c @@ -21,6 +21,10 @@ static void *erofs_read_inode(struct erofs_buf *buf, struct erofs_inode_compact *dic; struct erofs_inode_extended *die, *copied = NULL; unsigned int ifmt; + struct folio *folio; + struct inode *binode = sb->s_bdev->bd_inode; + struct address_space *mapping = binode->i_mapping; + pgoff_t index = inode_loc >> PAGE_SHIFT; int err; blkaddr = erofs_blknr(inode_loc); @@ -29,6 +33,16 @@ static void *erofs_read_inode(struct erofs_buf *buf, erofs_dbg("%s, reading inode nid %llu at %u of blkaddr %u", __func__, vi->nid, *ofs, blkaddr); + folio = filemap_get_folio(mapping, index); + if (!folio || !folio_test_uptodate(folio)) { + loff_t isize = i_size_read(binode); + pgoff_t end_index = (isize - 1) >> PAGE_SHIFT; + unsigned long nr_to_read = min_t(unsigned long, end_index - index, 32); + DEFINE_READAHEAD(ractl, NULL, NULL, mapping, index); + + page_cache_ra_unbounded(&ractl, nr_to_read, 0); + } + kaddr = erofs_read_metabuf(buf, sb, blkaddr, EROFS_KMAP); if (IS_ERR(kaddr)) { erofs_err(sb, "failed to get inode (nid: %llu) page, err %ld", -- Thanks, Jingbo