From: Amir Goldstein <amir73il@gmail.com>
To: Hugh Dickins <hughd@google.com>
Cc: Chris Down <chris@chrisdown.name>,
"zhengbin (A)" <zhengbin13@huawei.com>,
"J. R. Okajima" <hooanon05g@gmail.com>,
Linux MM <linux-mm@kvack.org>,
Andrew Morton <akpm@linux-foundation.org>,
Al Viro <viro@zeniv.linux.org.uk>,
Matthew Wilcox <willy@infradead.org>,
Jeff Layton <jlayton@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
linux-kernel <linux-kernel@vger.kernel.org>,
kernel-team@fb.com
Subject: Re: [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support
Date: Tue, 7 Jan 2020 10:35:22 +0200 [thread overview]
Message-ID: <CAOQ4uxj_rLeQY4VXzRbM78T8O=b36-Jrh4C-jdQx_6Aiy=D1BA@mail.gmail.com> (raw)
In-Reply-To: <alpine.LSU.2.11.2001062304500.1496@eggly.anvils>
> Chris, Amir, thank you both for all the work you have been putting
> into this over the holiday. I'm only now catching up, sorry.
>
> It appears that what you are ending up with above is very close to
> the patch Google has been using for several years. I'm glad Chris
> has explored some interesting options, disappointed that you had no
> more success than I had in trying to solve it efficiently with 32-bit
> inos, agree with the way in which Amir cut it back. That we've come to
> the same conclusions independently is good confirmation of this approach.
>
Indeed :)
> Only yesterday did I get to see that Amir had asked to see my patch on
> the 27th: rediffed against 5.5-rc5, appended now below. I'm showing it,
> though it's known deficient in three ways (not to mention lack of config
> or mount options, but I now see Dave Chinner has an interesting take on
> those) - better make it visible to you now, than me delay you further.
>
> It does indicate a couple of improvements to Chris's current patch:
> reducing use of stat_lock, as Amir suggested (in both nr_inodes limited
> and unlimited cases); and need to fix shmem_encode_fh(), which neither
> of us did yet. Where we should go from here, fix Chris's or fix mine,
> I've not decided.
>
I vote in favor or best of both patches to result in a simpler outcome:
1. use inop approach from Hugh's patch
2. use get_next_ino() instead of per-sb ino_batch for kern_mount
Hugh, do you have any evidence or suspect that shmem_file_setup()
could be contributing to depleting the global get_next_ino pool?
Do you have an objection to the combination above?
> From: Hugh Dickins <hughd@google.com>
> Date: Fri, 7 Aug 2015 20:08:53 -0700
> Subject: [PATCH] tmpfs: provide 64-bit inode numbers
>
> Some programs (including ld.so and clang) try to deduplicate opened
> files by comparing st_dev and st_ino, which breaks down when two files
> on the same tmpfs have the same inode number: we are now hitting this
> problem too often. J. R. Okajima has reported the same problem with
> backup tools.
>
> tmpfs is currently sharing the same 32-bit get_next_ino() pool as
> sockets, pipes, ramfs, hugetlbfs etc. It delivers 32-bit inos even
> on a 64-bit kernel for one reason: because if a 32-bit executable
> compiled without _FILE_OFFSET_BITS=64 tries to stat a file with an
> ino which won't fit in 32 bits, glibc fails that with EOVERFLOW.
> glibc is being correct, but unhelpful: so 2.6.22 commit 866b04fccbf1
> ("inode numbering: make static counters in new_inode and iunique be
> 32 bits") forced get_next_ino() to unsigned int.
>
> But whatever the upstream need to avoid surprising a mis-built
> 32-bit executable, Google production can be sure of the 64-bit
> environment, and any remaining 32-bit executables built with
> _FILE_OFFSET_BITS=64 (our files are sometimes bigger than 2G).
>
> So, leave get_next_ino() as it is, but convert tmpfs to supply
> unsigned long inos, and let each superblock keep its own account:
> it was weird to be sharing a pool with such disparate uses before.
>
> shmem_reserve_inode() already provides a good place to do this;
> and normally it has to take stat_lock to update free_inodes, so
> no overhead to increment its next_ino under the same lock. But
> if it's the internal shm_mnt, or mounted with "nr_inodes=0" to
> increase scalability by avoiding that stat_lock, then use the
> same percpu batching technique as get_next_ino().
>
> Take on board 4.2 commit 2adc376c5519 ("vfs: avoid creation of
> inode number 0 in get_next_ino"): for safety, skip any ino whose
> low 32 bits is 0.
>
> Upstream? That's tougher: maybe try to upstream this as is, and
> see what falls out; maybe add ino32 or ino64 mount options before
> trying; or maybe upstream has to stick with the 32-bit ino, and a
> scheme more complicated than this be implemented for tmpfs.
>
> Not-Yet-Signed-off-by: Hugh Dickins <hughd@google.com>
>
> 1) Probably needs minor corrections for the 32-bit kernel: I wasn't
> worrying about that at the time, and expect some "unsigned long"s
> I didn't need to change for the 64-bit kernel actually need to be
> "u64"s or "ino_t"s now.
> 2) The "return 1" from shmem_encode_fh() would nowadays be written as
> "return FILEID_INO32_GEN" (and overlayfs takes particular interest
> in that fileid); yet it already put the high 32 bits of the ino into
> the fh: probably needs a different fileid once high 32 bits are set.
Nice spotting, but this really isn't a problem for overlayfs.
I agree that it would be nice to follow the same practice as xfs with
XFS_FILEID_TYPE_64FLAG, but as one can see from the private
flag, this is by no means a standard of any sort.
As the name_to_handle_at() man page says:
"Other than the use of the handle_bytes field, the caller should treat the
file_handle structure as an opaque data type: the handle_type and f_handle
fields are needed only by a subsequent call to open_by_handle_at()."
It is true that one of my initial versions was checking value returned from
encode_fh, but Miklos has pointed out to me that this value is arbitrary
and we shouldn't rely on it.
In current overlayfs, the value FILEID_INO32_GEN is only used internally
to indicate the case where filesystem has no encode_fh op (see comment
above ovl_can_decode_fh()). IOW, only the special case of encoding
by the default export_encode_fh() is considered a proof for 32bit inodes.
So tmpfs has never been considered as a 32bit inodes filesystem by
overlayfs.
Thanks,
Amir.
next prev parent reply other threads:[~2020-01-07 8:35 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-05 12:05 [PATCH v5 0/2] fs: inode: shmem: Reduce risk of inum overflow Chris Down
2020-01-05 12:06 ` [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support Chris Down
2020-01-06 2:03 ` zhengbin (A)
2020-01-06 6:41 ` Amir Goldstein
2020-01-07 8:01 ` Hugh Dickins
2020-01-07 8:35 ` Amir Goldstein [this message]
2020-01-08 10:58 ` Hugh Dickins
2020-01-08 12:51 ` Amir Goldstein
2020-01-06 13:17 ` Chris Down
2020-01-05 12:06 ` [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb Chris Down
2020-01-07 0:10 ` Dave Chinner
2020-01-07 0:16 ` Chris Down
2020-01-07 0:39 ` Dave Chinner
2020-01-07 6:54 ` Amir Goldstein
2020-01-07 8:36 ` Hugh Dickins
2020-01-07 10:12 ` Amir Goldstein
2020-01-07 21:07 ` Dave Chinner
2020-01-07 21:37 ` Chris Mason
2020-01-08 11:24 ` Hugh Dickins
2020-01-09 0:43 ` Jeff Layton
2020-01-10 16:45 ` Chris Down
2020-01-13 7:36 ` Hugh Dickins
2020-01-20 15:11 ` Chris Down
2020-02-25 23:14 ` Hugh Dickins
2020-01-07 20:59 ` Dave Chinner
2020-01-08 14:37 ` Mikael Magnusson
2020-01-13 6:58 ` Hugh Dickins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAOQ4uxj_rLeQY4VXzRbM78T8O=b36-Jrh4C-jdQx_6Aiy=D1BA@mail.gmail.com' \
--to=amir73il@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=chris@chrisdown.name \
--cc=hannes@cmpxchg.org \
--cc=hooanon05g@gmail.com \
--cc=hughd@google.com \
--cc=jlayton@kernel.org \
--cc=kernel-team@fb.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=tj@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=zhengbin13@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).