All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	Andreas Gruenbacher <agruenba@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-s390 <linux-s390@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	John Hubbard <jhubbard@nvidia.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	Heiko Carstens <hca@linux.ibm.com>
Subject: Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
Date: Tue, 8 Mar 2022 13:24:19 +0100	[thread overview]
Message-ID: <1bdb0184-696c-0f1a-3054-d88391c32e64@redhat.com> (raw)
In-Reply-To: <81f2f76d-24ef-c23b-449e-0b8fdec506e1@redhat.com>

On 08.03.22 13:11, David Hildenbrand wrote:
> On 08.03.22 09:37, David Hildenbrand wrote:
>> On 08.03.22 09:21, David Hildenbrand wrote:
>>> On 08.03.22 00:18, Linus Torvalds wrote:
>>>> On Mon, Mar 7, 2022 at 2:52 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>>>>>
>>>>> After generic_file_read_iter() returns a short or empty read, we fault
>>>>> in some pages with fault_in_iov_iter_writeable(). This succeeds, but
>>>>> the next call to generic_file_read_iter() returns -EFAULT and we're
>>>>> not making any progress.
>>>>
>>>> Since this is s390-specific, I get the very strong feeling that the
>>>>
>>>>   fault_in_iov_iter_writeable ->
>>>>     fault_in_safe_writeable ->
>>>>       __get_user_pages_locked ->
>>>>         __get_user_pages
>>>>
>>>> path somehow successfully finds the page, despite it not being
>>>> properly accessible in the page tables.
>>>
>>> As raised offline already, I suspect
>>>
>>> shrink_active_list()
>>> ->page_referenced()
>>>  ->page_referenced_one()
>>>   ->ptep_clear_flush_young_notify()
>>>    ->ptep_clear_flush_young()
>>>
>>> which results on s390x in:
>>>
>>> static inline pte_t pte_mkold(pte_t pte)
>>> {
>>> 	pte_val(pte) &= ~_PAGE_YOUNG;
>>> 	pte_val(pte) |= _PAGE_INVALID;
>>> 	return pte;
>>> }
>>>
>>> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> 					    unsigned long addr, pte_t *ptep)
>>> {
>>> 	pte_t pte = *ptep;
>>>
>>> 	pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte));
>>> 	return pte_young(pte);
>>> }
>>>
>>>
>>> _PAGE_INVALID is the actual HW bit, _PAGE_PRESENT is a
>>> pure SW bit. AFAIU, pte_present() still holds:
>>>
>>> static inline int pte_present(pte_t pte)
>>> {
>>> 	/* Bit pattern: (pte & 0x001) == 0x001 */
>>> 	return (pte_val(pte) & _PAGE_PRESENT) != 0;
>>> }
>>>
>>>
>>> pte_mkyoung() will revert that action:
>>>
>>> static inline pte_t pte_mkyoung(pte_t pte)
>>> {
>>> 	pte_val(pte) |= _PAGE_YOUNG;
>>> 	if (pte_val(pte) & _PAGE_READ)
>>> 		pte_val(pte) &= ~_PAGE_INVALID;
>>> 	return pte;
>>> }
>>>
>>>
>>> and pte_modify() will adjust it properly again:
>>>
>>> /*
>>>  * The following pte modification functions only work if
>>>  * pte_present() is true. Undefined behaviour if not..
>>>  */
>>> static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>>> {
>>> 	pte_val(pte) &= _PAGE_CHG_MASK;
>>> 	pte_val(pte) |= pgprot_val(newprot);
>>> 	/*
>>> 	 * newprot for PAGE_NONE, PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX
>>> 	 * has the invalid bit set, clear it again for readable, young pages
>>> 	 */
>>> 	if ((pte_val(pte) & _PAGE_YOUNG) && (pte_val(pte) & _PAGE_READ))
>>> 		pte_val(pte) &= ~_PAGE_INVALID;
>>> 	/*
>>> 	 * newprot for PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX has the page
>>> 	 * protection bit set, clear it again for writable, dirty pages
>>> 	 */
>>> 	if ((pte_val(pte) & _PAGE_DIRTY) && (pte_val(pte) & _PAGE_WRITE))
>>> 		pte_val(pte) &= ~_PAGE_PROTECT;
>>> 	return pte;
>>> }
>>>
>>>
>>>
>>> Which leaves me wondering if there is a way in GUP whereby
>>> we would lookup that page and not clear _PAGE_INVALID,
>>> resulting in GUP succeeding but faults via the MMU still
>>> faulting on _PAGE_INVALID.
>>
>>
>> follow_page_pte() has this piece of code:
>>
>> 	if (flags & FOLL_TOUCH) {
>> 		if ((flags & FOLL_WRITE) &&
>> 		    !pte_dirty(pte) && !PageDirty(page))
>> 			set_page_dirty(page);
>> 		/*
>> 		 * pte_mkyoung() would be more correct here, but atomic care
>> 		 * is needed to avoid losing the dirty bit: it is easier to use
>> 		 * mark_page_accessed().
>> 		 */
>> 		mark_page_accessed(page);
>> 	}
>>
>> Which at least to me suggests that, although the page is marked accessed and GUP
>> succeeds, that the PTE might still have _PAGE_INVALID set after we succeeded GUP.
>>
>>
>> On s390x, there is no HW dirty bit, so we might just be able to do a proper
>> pte_mkyoung() here instead of the mark_page_accessed().
>>
> 
> Something hacky like this should be able to show if what I suspect is the case.
> It compiles, but I didn't actually test it.
That would be the alternative that also takes the mkdirty into account:


From 1e51e8a93894f87c0a4d0e908391e0628ae56afe Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 8 Mar 2022 12:51:26 +0100
Subject: [PATCH] mm/gup: fix buffered I/O on s390x with pagefaults disabled

On s390x, we actually need a pte_mkyoung() / pte_mkdirty() instead of
going via the page and leaving the PTE unmodified. E.g., if we only
mark the page accessed via mark_page_accessed() when doing a FOLL_TOUCH,
we'll miss to clear the HW invalid bit in the pte and subsequent accesses
via the MMU would still require a pagefault.

Otherwise, buffered I/O will loop forever because it will keep stumling
over the set HW invalid bit, requiring a page fault.

Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a9d4d724aef7..de3311feb377 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -587,15 +587,33 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		}
 	}
 	if (flags & FOLL_TOUCH) {
-		if ((flags & FOLL_WRITE) &&
-		    !pte_dirty(pte) && !PageDirty(page))
-			set_page_dirty(page);
 		/*
-		 * pte_mkyoung() would be more correct here, but atomic care
-		 * is needed to avoid losing the dirty bit: it is easier to use
-		 * mark_page_accessed().
+		 * We have to be careful with updating the PTE on architectures
+		 * that have a HW dirty bit: while updating the PTE we might
+		 * lose that bit again and we'd need an atomic update: it is
+		 * easier to leave the PTE untouched for these architectures.
+		 *
+		 * s390x doesn't have a hw referenced / dirty bit and e.g., sets
+		 * the hw invalid bit in pte_mkold(), to catch further
+		 * references. We have to update the PTE here to e.g., clear the
+		 * invalid bit; otherwise, callers that rely on not requiring
+		 * an MMU fault once GUP(FOLL_TOUCH) succeeded will loop forever
+		 * because the page won't actually be accessible via the MMU.
 		 */
-		mark_page_accessed(page);
+		if (IS_ENABLED(CONFIG_S390)) {
+			pte = pte_mkyoung(pte);
+			if (flags & FOLL_WRITE)
+				pte = pte_mkdirty(pte);
+			if (!pte_same(pte, *ptep)) {
+				set_pte_at(vma->vm_mm, address, ptep, pte);
+				update_mmu_cache(vma, address, ptep);
+			}
+		} else {
+			if ((flags & FOLL_WRITE) &&
+			    !pte_dirty(pte) && !PageDirty(page))
+				set_page_dirty(page);
+			mark_page_accessed(page);
+		}
 	}
 	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
 		/* Do not mlock pte-mapped THP */
-- 
2.35.1


-- 
Thanks,

David / dhildenb


  reply	other threads:[~2022-03-08 12:24 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-07 22:52 Buffered I/O broken on s390x with page faults disabled (gfs2) Andreas Gruenbacher
2022-03-07 23:18 ` Linus Torvalds
2022-03-08  8:21   ` David Hildenbrand
2022-03-08  8:37     ` David Hildenbrand
2022-03-08 12:11       ` David Hildenbrand
2022-03-08 12:24         ` David Hildenbrand [this message]
2022-03-08 13:20           ` Gerald Schaefer
2022-03-08 13:32             ` David Hildenbrand
2022-03-08 14:14               ` Gerald Schaefer
2022-03-08 17:23           ` David Hildenbrand
2022-03-08 17:26     ` Linus Torvalds
2022-03-08 17:40       ` Linus Torvalds
2022-03-08 19:27         ` Linus Torvalds
2022-03-08 20:03           ` Linus Torvalds
2022-03-08 23:24             ` Andreas Gruenbacher
2022-03-09  0:22               ` Linus Torvalds
2022-03-09 18:42                 ` Andreas Gruenbacher
2022-03-09 19:08                   ` Linus Torvalds
2022-03-09 20:57                     ` Andreas Gruenbacher
2022-03-09 21:08                     ` Andreas Gruenbacher
2022-03-10 12:13                       ` Filipe Manana
2022-03-09 19:21                   ` Linus Torvalds
2022-03-09 19:35                     ` Andreas Gruenbacher
2022-03-09 20:18                       ` Linus Torvalds
2022-03-09 20:36                         ` Andreas Gruenbacher
2022-03-09 20:48                           ` Linus Torvalds
2022-03-09 20:54                             ` Linus Torvalds
2022-03-10 17:13           ` David Hildenbrand
2022-03-10 18:00             ` Andreas Gruenbacher
2022-03-10 18:35             ` Linus Torvalds
2022-03-10 18:38               ` David Hildenbrand
2022-03-10 18:47               ` Andreas Gruenbacher
2022-03-10 19:22                 ` Linus Torvalds
2022-03-10 19:56                   ` Linus Torvalds
2022-03-10 20:23                     ` Andreas Gruenbacher
2022-03-08 17:47       ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1bdb0184-696c-0f1a-3054-d88391c32e64@redhat.com \
    --to=david@redhat.com \
    --cc=agruenba@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=borntraeger@de.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.