From: Yang Shi <shy828301@gmail.com> To: Hugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org>, Shakeel Butt <shakeelb@google.com>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Miaohe Lin <linmiaohe@huawei.com>, Mike Kravetz <mike.kravetz@oracle.com>, Michal Hocko <mhocko@suse.com>, Rik van Riel <riel@surriel.com>, Christoph Hellwig <hch@infradead.org>, Matthew Wilcox <willy@infradead.org>, "Eric W. Biederman" <ebiederm@xmission.com>, Alexey Gladkov <legion@kernel.org>, Chris Wilson <chris@chris-wilson.co.uk>, Matthew Auld <matthew.auld@intel.com>, Linux FS-devel Mailing List <linux-fsdevel@vger.kernel.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, linux-api@vger.kernel.org, Linux MM <linux-mm@kvack.org> Subject: Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index) Date: Wed, 4 Aug 2021 12:01:17 -0700 [thread overview] Message-ID: <CAHbLzkrvOCCbN3EcDeKwfqWrtU6kH0+7fuSv7aahyjpKtsHn3g@mail.gmail.com> (raw) In-Reply-To: <8baad8b2-8f7a-2589-ce21-4135a59c5dc6@google.com> On Wed, Aug 4, 2021 at 1:28 AM Hugh Dickins <hughd@google.com> wrote: > > On Mon, 2 Aug 2021, Yang Shi wrote: > > On Sat, Jul 31, 2021 at 10:22 PM Hugh Dickins <hughd@google.com> wrote: > > > On Fri, 30 Jul 2021, Yang Shi wrote: > > > > On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote: > > > > > > > > > > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so > > > > > that a consistent set of checks can be applied, even when the inode is > > > > > accessed through read/write syscalls (with NULL vma) instead of mmaps > > > > > (the index argument is seldom of interest, but required by mount option > > > > > "huge=within_size"). Clean up and rearrange the checks a little. > > > > > > > > > > This then replaces the checks which shmem_fault() and shmem_getpage_gfp() > > > > > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's > > > > > still true that khugepaged's collapse_file() at that point wants a small > > > > > page, the race that might allocate it a huge page is too unlikely to be > > > > > worth optimizing against (we are there *because* there was at least one > > > > > small page in the way), and handled by a later PageTransCompound check. > > > > > > > > Yes, it seems too unlikely. But if it happens the PageTransCompound > > > > check may be not good enough since the page allocated by > > > > shmem_getpage() may be charged to wrong memcg (root memcg). And it > > > > won't be replaced by a newly allocated huge page so the wrong charge > > > > can't be undone. > > > > > > Good point on the memcg charge: I hadn't thought of that. Of course > > > it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I > > > admit that a huge mischarge is hugely worse than a small mischarge. > > > > The small page could be collapsed to a huge page sooner or later, so > > the mischarge may be transient. But huge page can't be replaced. > > You're right, if all goes well, the mischarged small page could be > collapsed to a correctly charged huge page sooner or later (but all > may not go well), whereas the mischarged huge page is stuck there. > > > > > > > > > We could fix it by making shmem_getpage_gfp() non-static, and pointing > > > to the vma (hence its mm, hence its memcg) here, couldn't we? Easily > > > done, but I don't really want to make shmem_getpage_gfp() public just > > > for this, for two reasons. > > > > > > One is that the huge race it just so unlikely; and a mischarge to root > > > is not the end of the world, so long as it's not reproducible. It can > > > only happen on the very first page of the huge extent, and the prior > > > > OK, if so the mischarge is not as bad as what I thought in the first place. > > > > > "Stop if extent has been truncated" check makes sure there was one > > > entry in the extent at that point: so the race with hole-punch can only > > > occur after we xas_unlock_irq(&xas) immediately before shmem_getpage() > > > looks up the page in the tree (and I say hole-punch not truncate, > > > because shmem_getpage()'s i_size check will reject when truncated). > > > I don't doubt that it could happen, but stand by not optimizing against. > > > > I agree the race is so unlikely and it may be not worth optimizing > > against it right now, but a note or a comment may be worth. > > Thanks, but despite us agreeing that the race is too unlikely to be worth > optimizing against, it does still nag at me ever since you questioned it: > silly, but I can't quite be convinced by my own dismissals. > > I do still want to get rid of SGP_HUGE and SGP_NOHUGE, clearing up those > huge allocation decisions remains the intention; but now think to add > SGP_NOALLOC for collapse_file() in place of SGP_NOHUGE or SGP_CACHE - > to rule out that possibility of mischarge after racing hole-punch, > no matter whether it's huge or small. If any such race occurs, > collapse_file() should just give up. > > This being the "Stupid me" SGP_READ idea, except that of course would > not work: because half the point of that block in collapse_file() is > to initialize the !Uptodate pages, whereas SGP_READ avoids doing so. > > There is, of course, the danger that in fixing this unlikely mischarge, > I've got the code wrong and am introducing a bug: here's what a 17/16 > would look like, though it will be better inserted early. I got sick > of all the "if (page "s, and was glad of the opportunity to fix that > outdated "bring it back from swap" comment - swap got done above. > > What do you think? Should I add this in or leave it out? Thanks for keeping investigating this. The patch looks good to me. I think we could go this way. Just a nit below. > > Thanks, > Hugh > > --- a/include/linux/shmem_fs.h > +++ b/include/linux/shmem_fs.h > @@ -108,6 +108,7 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping, > /* Flag allocation requirements to shmem_getpage */ > enum sgp_type { > SGP_READ, /* don't exceed i_size, don't allocate page */ > + SGP_NOALLOC, /* like SGP_READ, but do use fallocated page */ The comment looks misleading, it seems SGP_NOALLOC does clear the Uptodate flag but SGP_READ doesn't. Or it is fine not to distinguish this difference? > SGP_CACHE, /* don't exceed i_size, may allocate page */ > SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */ > SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */ > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1721,7 +1721,7 @@ static void collapse_file(struct mm_struct *mm, > xas_unlock_irq(&xas); > /* swap in or instantiate fallocated page */ > if (shmem_getpage(mapping->host, index, &page, > - SGP_CACHE)) { > + SGP_NOALLOC)) { > result = SCAN_FAIL; > goto xa_unlocked; > } > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1903,26 +1903,27 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, > return error; > } > > - if (page) > + if (page) { > hindex = page->index; > - if (page && sgp == SGP_WRITE) > - mark_page_accessed(page); > - > - /* fallocated page? */ > - if (page && !PageUptodate(page)) { > + if (sgp == SGP_WRITE) > + mark_page_accessed(page); > + if (PageUptodate(page)) > + goto out; > + /* fallocated page */ > if (sgp != SGP_READ) > goto clear; > unlock_page(page); > put_page(page); > - page = NULL; > - hindex = index; > } > - if (page || sgp == SGP_READ) > - goto out; > + > + *pagep = NULL; > + if (sgp == SGP_READ) > + return 0; > + if (sgp == SGP_NOALLOC) > + return -ENOENT; > > /* > - * Fast cache lookup did not find it: > - * bring it back from swap or allocate. > + * Fast cache lookup and swap lookup did not find it: allocate. > */ > > if (vma && userfaultfd_missing(vma)) {
next prev parent reply other threads:[~2021-08-04 19:01 UTC|newest] Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-07-30 7:22 [PATCH 00/16] tmpfs: HUGEPAGE and MEM_LOCK fcntls and memfds Hugh Dickins 2021-07-30 7:25 ` [PATCH 01/16] huge tmpfs: fix fallocate(vanilla) advance over huge pages Hugh Dickins 2021-07-30 21:36 ` Yang Shi 2021-08-01 3:38 ` Hugh Dickins 2021-08-02 20:36 ` Yang Shi 2021-07-30 7:28 ` [PATCH 02/16] huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE Hugh Dickins 2021-07-30 23:48 ` Yang Shi 2021-07-30 7:30 ` [PATCH 03/16] huge tmpfs: remove shrinklist addition from shmem_setattr() Hugh Dickins 2021-07-30 21:50 ` Yang Shi 2021-07-30 7:36 ` [PATCH 04/16] huge tmpfs: revert shmem's use of transhuge_vma_enabled() Hugh Dickins 2021-07-30 21:56 ` Yang Shi 2021-08-01 4:01 ` Hugh Dickins 2021-08-02 20:39 ` Yang Shi 2021-07-30 7:39 ` [PATCH 05/16] huge tmpfs: move shmem_huge_enabled() upwards Hugh Dickins 2021-07-30 21:57 ` Yang Shi 2021-07-30 7:42 ` [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index) Hugh Dickins 2021-07-30 23:34 ` Yang Shi 2021-08-01 5:22 ` Hugh Dickins 2021-08-01 5:37 ` Hugh Dickins 2021-08-02 21:14 ` Yang Shi 2021-08-04 8:28 ` Hugh Dickins 2021-08-04 19:01 ` Yang Shi [this message] 2021-08-06 5:21 ` Hugh Dickins 2021-08-06 17:41 ` Yang Shi 2021-08-05 23:04 ` Yang Shi 2021-08-06 5:43 ` Hugh Dickins 2021-08-06 17:57 ` Yang Shi 2021-08-12 18:19 ` Yang Shi 2021-07-30 7:45 ` [PATCH 07/16] memfd: memfd_create(name, MFD_HUGEPAGE) for shmem huge pages Hugh Dickins 2021-08-04 14:03 ` Kirill A. Shutemov 2021-08-06 3:33 ` Hugh Dickins 2021-07-30 7:48 ` [PATCH 08/16] huge tmpfs: fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE) Hugh Dickins 2021-08-04 14:08 ` Kirill A. Shutemov 2021-08-06 4:34 ` Hugh Dickins 2021-07-30 7:51 ` [PATCH 09/16] huge tmpfs: decide stat.st_blksize by shmem_is_huge() Hugh Dickins 2021-07-30 23:40 ` Yang Shi 2021-07-30 7:55 ` [PATCH 10/16] tmpfs: fcntl(fd, F_MEM_LOCK) to memlock a tmpfs file Hugh Dickins 2021-08-03 1:38 ` Matthew Wilcox 2021-08-04 9:15 ` Hugh Dickins 2021-07-30 7:57 ` [PATCH 11/16] tmpfs: fcntl(fd, F_MEM_LOCKED) to test if memlocked Hugh Dickins 2021-07-30 8:00 ` [PATCH 12/16] tmpfs: refuse memlock when fallocated beyond i_size Hugh Dickins 2021-07-30 8:03 ` [PATCH 13/16] mm: bool user_shm_lock(loff_t size, struct ucounts *) Hugh Dickins 2021-07-30 8:06 ` [PATCH 14/16] mm: user_shm_lock(,,getuc) and user_shm_unlock(,,putuc) Hugh Dickins 2021-07-30 8:09 ` [PATCH 15/16] tmpfs: permit changing size of memlocked file Hugh Dickins 2021-07-30 8:13 ` [PATCH 16/16] memfd: memfd_create(name, MFD_MEM_LOCK) for memlocked shmem Hugh Dickins
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CAHbLzkrvOCCbN3EcDeKwfqWrtU6kH0+7fuSv7aahyjpKtsHn3g@mail.gmail.com \ --to=shy828301@gmail.com \ --cc=akpm@linux-foundation.org \ --cc=chris@chris-wilson.co.uk \ --cc=ebiederm@xmission.com \ --cc=hch@infradead.org \ --cc=hughd@google.com \ --cc=kirill.shutemov@linux.intel.com \ --cc=legion@kernel.org \ --cc=linmiaohe@huawei.com \ --cc=linux-api@vger.kernel.org \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=matthew.auld@intel.com \ --cc=mhocko@suse.com \ --cc=mike.kravetz@oracle.com \ --cc=riel@surriel.com \ --cc=shakeelb@google.com \ --cc=willy@infradead.org \ --subject='Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).