All of lore.kernel.org
 help / color / mirror / Atom feed
* Buffered I/O broken on s390x with page faults disabled (gfs2)
@ 2022-03-07 22:52 Andreas Gruenbacher
  2022-03-07 23:18 ` Linus Torvalds
  0 siblings, 1 reply; 36+ messages in thread
From: Andreas Gruenbacher @ 2022-03-07 22:52 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro
  Cc: linux-s390, Linux-MM, linux-fsdevel, linux-btrfs, David Hildenbrand

Hello,

since the page fault disabling changes from v5.15, gfs2 is in trouble
and I'm out of ideas; maybe one of you can help.

The test case in question writes large chunks of data into mmapped
memory (between 400 MiB and 797 MiB) and then reads that data back in
in a single read system call. With those large reads, we end up in a
tight loop in gfs2_file_read_iter() on s390x:

After generic_file_read_iter() returns a short or empty read, we fault
in some pages with fault_in_iov_iter_writeable(). This succeeds, but
the next call to generic_file_read_iter() returns -EFAULT and we're
not making any progress.

The problem starts when the system runs low on memory and the
read_fault and read_fault_race numbers suddenly shoot up (1 second
sampling rate):

# ./runhang /mnt/scratch/foo
swpd free buff cache read_fault(*) read_fault_race(*)
read_lock_stolen(*) min_flt maj_flt
0 2610408 6352 470064 0 0 0 654 76064
0 2412172 6352 667352 0 0 0 0 49276
0 1887404 6224 1080936 0 0 0 956 103586
0 1692324 6224 1275004 0 0 0 0 48400
0 1350184 6224 1460968 0 0 0 962 46400
0 1134788 6224 1675756 0 0 0 0 53600
0 967088 6224 1842556 0 0 0 0 41600
0 803168 6224 2006044 0 0 0 0 40000
0 714448 6096 2157080 0 0 0 862 38704
0 502388 6096 2367356 0 0 0 0 52476
0 287008 6096 2582648 0 0 0 0 53680
0 532888 6096 2748832 0 0 0 278 41471
0 32344 5424 2948440 0 0 0 440 63721
0 44812 4848 2946208 0 0 0 1 45344
0 36080 4332 2954440 0 0 0 0 37114
0 44316 1060 2779684 0 0 0 42375 35251
0 64276 1060 2759572 0 0 0 0 32952
0 57564 1060 2762912 0 0 0 0 44036
0 33424 1060 2785804 0 0 0 171 44036
0 69752 828 2747364 0 0 0 176 38201
0 69652 828 2747456 2454899 2454898 0 0 0
0 69932 828 2747456 1977938 1977938 0 0 0
0 69972 828 2747456 2010242 2010242 0 0 0
0 70004 828 2747456 1977223 1977223 0 0 0
0 69784 828 2747456 2004963 2004963 0 0 0
0 69560 828 2747456 2020472 2020472 0 0 0
0 70144 828 2747456 2019310 2019310 0 0 0
0 69440 828 2748256 1965665 1965665 0 0 0
0 69200 828 2748388 2008175 2008175 0 0 0
0 69000 828 2748388 2010383 2010383 0 0 0
0 69052 828 2748388 2011591 2011591 0 0 0
0 68588 828 2748388 2017224 2017224 0 0 0
0 68888 828 2748388 2015931 2015931 0 0 0
0 69156 828 2748388 1995175 1995175 0 0 0
0 69196 828 2748388 2009583 2009583 0 0 0
0 68484 828 2748388 2014036 2014036 0 0 0
0 69032 828 2748388 2004889 2004889 0 0 0

Here, read_fault is the number of fault_in_iov_iter_writeable calls in
gfs2_file_read_iter,
read_fault_race is the number of unsuccessful generic_file_read_iter
calls after successful fault_in_iov_iter_writeable calls, and
read_lock_stolen is the number of times the glock got stolen.

The hang goes away with a chunk size between 4 MiB and 8 MiB. This
currently reproduces on s390x, but not on x86_64 or ppc64le. In all
cases, there are about 4G of memory, no swap, and 4 CPUs. It's not
clear if btrfs is affected by this as well.

There's a bug in the window size logic in gfs2's
should_fault_in_pages(), so the window currently always ends up being
one page. We have a fix for that queued for the next merge window
(https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?h=for-next&id=cf8da18f6c4d),
but the hang reproduces independent of that fix.

Any ideas?

Thanks a lot,
Andreas


hang.c
======
#define _XOPEN_SOURCE 600  /* posix_memalign() */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <err.h>
#include <sys/mman.h>

#define MiB (1ULL << 20)

off_t filesize = 21061 * MiB;
size_t mintrans = 400 * MiB;
size_t maxtrans = 797 * MiB;

const char *filename;
unsigned int seed;

static void usage(char *progname)
{
    fprintf(stderr,
        "Usage: %s [-S seed] [-z filesize] [-t mintrans] [-T maxtrans]
{filename}\n",
        progname);
}

static void do_mmwrite(int fd, void *addr, off_t offset, size_t xferlen)
{
    void *buf;
    size_t count = 0;
    ssize_t ret;

    printf("%lu %zu\n", offset, xferlen);

    errno = posix_memalign(&buf, 512, xferlen);
    if (errno)
        err(1, "posix_memalign");

    memset(buf, (char)offset, xferlen);
    memcpy(addr + offset, buf, xferlen);

    if (lseek(fd, offset, SEEK_SET) == -1)
        err(1, "%s", filename);

    do {
        ret = read(fd, buf + count, xferlen - count);
        if (ret <= 0) {
            if (ret == -EINTR) {
                warn("read interrupted by signal");
                continue;
            }
            break;
        }
        if (ret != xferlen - count)
            warn("short read");
        count += ret;
    } while (count != xferlen);

    if (ret < 0)
        err(1, "%s", filename);
    if (ret == 0 && offset + xferlen != filesize)
        warn("read returned 0 before EOF");

    free(buf);
}

int main(int argc, char *argv[])
{
    int c;
    int fd;
    void *addr;

    seed = getpid();

    while ((c = getopt(argc, argv, "S:z:t:T:h")) != -1) {
        switch(c) {
        case 'S':
            seed = atoi(optarg);
            break;
        case 'z':
            filesize = atoi(optarg) * MiB;
            break;
        case 't':
            mintrans = atoi(optarg) * MiB;
            break;
        case 'T':
            maxtrans = atoi(optarg) * MiB;
            break;
        case '?':
            usage(argv[0]);
            exit(2);
        case 'h':
            usage(argv[0]);
            exit(0);
        }
    }

    if (optind + 1 != argc) {
        usage(argv[0]);
        exit(2);
    }
    filename = argv[optind];

    srandom(seed);

    fd = open(filename, O_CREAT | O_RDWR | O_TRUNC, 0666);
    if (fd == -1)
        err(1, "%s", filename);

    if (lseek(fd, filesize - 1, SEEK_SET) == -1)
        err(1, "%s", filename);
    if (write(fd, "X", 1) != 1)
        err(1, "%s", filename);

    addr = mmap(NULL, filesize, PROT_WRITE, MAP_SHARED, fd, 0);
    if (addr == MAP_FAILED)
        err(1, "mmap");

    for(;;) {
        off_t offset = 0;

        for (;;) {
            size_t xferlen = mintrans;

            if (mintrans != maxtrans)
                xferlen = mintrans +
                      (random() % (maxtrans - mintrans));

            if (offset + xferlen > filesize)
                xferlen = filesize - offset;

            do_mmwrite(fd, addr, offset, xferlen);

            offset += xferlen;
            if (offset >= filesize)
                offset = 0;
        }
    }

    munmap(addr, filesize);
    close(fd);
    unlink(filename);

    return 0;
}


runhang
=======
#! /bin/bash

${0%/*}/hang "$@" > /dev/null &
STATUS=$?
HANGPID=$!
[ $STATUS  = 0 ] || exit 1

trap "kill $HANGPID" EXIT

sample() {
    local -a stats

    set -- $(grep ^read_iter /sys/kernel/debug/gfs2/*/fault_stats)
    shift
    stats=("$@")
    set -- $(cat /proc/$HANGPID/stat)
    stats=("${stats[@]}" "${10}" "${12}")
    echo "${stats[@]}"
}

meminfo() {
    awk '
    /^'"$1"'/ { print $2 }
    ' /proc/meminfo
}

echo swpd free buff cache read_fault read_fault_race read_lock_stolen
min_flt maj_flt
old=( $(sample) )
while :; do
    sleep 1
    new=( $(sample) )
    set --
    for ((n=0; n<${#old[@]}; n++)); do
    set -- "$@" $((${new[n]} - ${old[n]}))
    done
    set -- $(meminfo SwapCached) $(meminfo MemFree) $(meminfo Buffers)
$(meminfo Cached) "$@"
    echo "$@"
    old=( "${new[@]}" )
done


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-07 22:52 Buffered I/O broken on s390x with page faults disabled (gfs2) Andreas Gruenbacher
@ 2022-03-07 23:18 ` Linus Torvalds
  2022-03-08  8:21   ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2022-03-07 23:18 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Alexander Viro, linux-s390, Linux-MM, linux-fsdevel, linux-btrfs,
	David Hildenbrand

On Mon, Mar 7, 2022 at 2:52 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> After generic_file_read_iter() returns a short or empty read, we fault
> in some pages with fault_in_iov_iter_writeable(). This succeeds, but
> the next call to generic_file_read_iter() returns -EFAULT and we're
> not making any progress.

Since this is s390-specific, I get the very strong feeling that the

  fault_in_iov_iter_writeable ->
    fault_in_safe_writeable ->
      __get_user_pages_locked ->
        __get_user_pages

path somehow successfully finds the page, despite it not being
properly accessible in the page tables.

And it's presumably something specific in the s390x page table
functionality that makes that happen.

The places I'd look at in particular is to make sure that
follow_page_mask() actually has the same rules as a real page table
lookup on s390x.

IOW, if follow_page_mask() finds the page and thinks it's writable,
then it will return a 'page' successfully, and the __get_user_pages()
code will be happy and say "it's there".

But if then accessing the page by trying to write to it using the
virtual address fails despite that, then you'll get the behavior you
describe.

I'd take a look at that can_follow_write_pte() case in particular,
since this is a FOLL_WRITE thing, but it could be any of the pte
checking details.

Have you tried tracing through that path?

                      Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-07 23:18 ` Linus Torvalds
@ 2022-03-08  8:21   ` David Hildenbrand
  2022-03-08  8:37     ` David Hildenbrand
  2022-03-08 17:26     ` Linus Torvalds
  0 siblings, 2 replies; 36+ messages in thread
From: David Hildenbrand @ 2022-03-08  8:21 UTC (permalink / raw)
  To: Linus Torvalds, Andreas Gruenbacher
  Cc: Alexander Viro, linux-s390, Linux-MM, linux-fsdevel, linux-btrfs

On 08.03.22 00:18, Linus Torvalds wrote:
> On Mon, Mar 7, 2022 at 2:52 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>>
>> After generic_file_read_iter() returns a short or empty read, we fault
>> in some pages with fault_in_iov_iter_writeable(). This succeeds, but
>> the next call to generic_file_read_iter() returns -EFAULT and we're
>> not making any progress.
> 
> Since this is s390-specific, I get the very strong feeling that the
> 
>   fault_in_iov_iter_writeable ->
>     fault_in_safe_writeable ->
>       __get_user_pages_locked ->
>         __get_user_pages
> 
> path somehow successfully finds the page, despite it not being
> properly accessible in the page tables.

As raised offline already, I suspect

shrink_active_list()
->page_referenced()
 ->page_referenced_one()
  ->ptep_clear_flush_young_notify()
   ->ptep_clear_flush_young()

which results on s390x in:

static inline pte_t pte_mkold(pte_t pte)
{
	pte_val(pte) &= ~_PAGE_YOUNG;
	pte_val(pte) |= _PAGE_INVALID;
	return pte;
}

static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
					    unsigned long addr, pte_t *ptep)
{
	pte_t pte = *ptep;

	pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte));
	return pte_young(pte);
}


_PAGE_INVALID is the actual HW bit, _PAGE_PRESENT is a
pure SW bit. AFAIU, pte_present() still holds:

static inline int pte_present(pte_t pte)
{
	/* Bit pattern: (pte & 0x001) == 0x001 */
	return (pte_val(pte) & _PAGE_PRESENT) != 0;
}


pte_mkyoung() will revert that action:

static inline pte_t pte_mkyoung(pte_t pte)
{
	pte_val(pte) |= _PAGE_YOUNG;
	if (pte_val(pte) & _PAGE_READ)
		pte_val(pte) &= ~_PAGE_INVALID;
	return pte;
}


and pte_modify() will adjust it properly again:

/*
 * The following pte modification functions only work if
 * pte_present() is true. Undefined behaviour if not..
 */
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
	pte_val(pte) &= _PAGE_CHG_MASK;
	pte_val(pte) |= pgprot_val(newprot);
	/*
	 * newprot for PAGE_NONE, PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX
	 * has the invalid bit set, clear it again for readable, young pages
	 */
	if ((pte_val(pte) & _PAGE_YOUNG) && (pte_val(pte) & _PAGE_READ))
		pte_val(pte) &= ~_PAGE_INVALID;
	/*
	 * newprot for PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX has the page
	 * protection bit set, clear it again for writable, dirty pages
	 */
	if ((pte_val(pte) & _PAGE_DIRTY) && (pte_val(pte) & _PAGE_WRITE))
		pte_val(pte) &= ~_PAGE_PROTECT;
	return pte;
}



Which leaves me wondering if there is a way in GUP whereby
we would lookup that page and not clear _PAGE_INVALID,
resulting in GUP succeeding but faults via the MMU still
faulting on _PAGE_INVALID.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08  8:21   ` David Hildenbrand
@ 2022-03-08  8:37     ` David Hildenbrand
  2022-03-08 12:11       ` David Hildenbrand
  2022-03-08 17:26     ` Linus Torvalds
  1 sibling, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2022-03-08  8:37 UTC (permalink / raw)
  To: Linus Torvalds, Andreas Gruenbacher
  Cc: Alexander Viro, linux-s390, Linux-MM, linux-fsdevel, linux-btrfs

On 08.03.22 09:21, David Hildenbrand wrote:
> On 08.03.22 00:18, Linus Torvalds wrote:
>> On Mon, Mar 7, 2022 at 2:52 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>>>
>>> After generic_file_read_iter() returns a short or empty read, we fault
>>> in some pages with fault_in_iov_iter_writeable(). This succeeds, but
>>> the next call to generic_file_read_iter() returns -EFAULT and we're
>>> not making any progress.
>>
>> Since this is s390-specific, I get the very strong feeling that the
>>
>>   fault_in_iov_iter_writeable ->
>>     fault_in_safe_writeable ->
>>       __get_user_pages_locked ->
>>         __get_user_pages
>>
>> path somehow successfully finds the page, despite it not being
>> properly accessible in the page tables.
> 
> As raised offline already, I suspect
> 
> shrink_active_list()
> ->page_referenced()
>  ->page_referenced_one()
>   ->ptep_clear_flush_young_notify()
>    ->ptep_clear_flush_young()
> 
> which results on s390x in:
> 
> static inline pte_t pte_mkold(pte_t pte)
> {
> 	pte_val(pte) &= ~_PAGE_YOUNG;
> 	pte_val(pte) |= _PAGE_INVALID;
> 	return pte;
> }
> 
> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> 					    unsigned long addr, pte_t *ptep)
> {
> 	pte_t pte = *ptep;
> 
> 	pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte));
> 	return pte_young(pte);
> }
> 
> 
> _PAGE_INVALID is the actual HW bit, _PAGE_PRESENT is a
> pure SW bit. AFAIU, pte_present() still holds:
> 
> static inline int pte_present(pte_t pte)
> {
> 	/* Bit pattern: (pte & 0x001) == 0x001 */
> 	return (pte_val(pte) & _PAGE_PRESENT) != 0;
> }
> 
> 
> pte_mkyoung() will revert that action:
> 
> static inline pte_t pte_mkyoung(pte_t pte)
> {
> 	pte_val(pte) |= _PAGE_YOUNG;
> 	if (pte_val(pte) & _PAGE_READ)
> 		pte_val(pte) &= ~_PAGE_INVALID;
> 	return pte;
> }
> 
> 
> and pte_modify() will adjust it properly again:
> 
> /*
>  * The following pte modification functions only work if
>  * pte_present() is true. Undefined behaviour if not..
>  */
> static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> {
> 	pte_val(pte) &= _PAGE_CHG_MASK;
> 	pte_val(pte) |= pgprot_val(newprot);
> 	/*
> 	 * newprot for PAGE_NONE, PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX
> 	 * has the invalid bit set, clear it again for readable, young pages
> 	 */
> 	if ((pte_val(pte) & _PAGE_YOUNG) && (pte_val(pte) & _PAGE_READ))
> 		pte_val(pte) &= ~_PAGE_INVALID;
> 	/*
> 	 * newprot for PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX has the page
> 	 * protection bit set, clear it again for writable, dirty pages
> 	 */
> 	if ((pte_val(pte) & _PAGE_DIRTY) && (pte_val(pte) & _PAGE_WRITE))
> 		pte_val(pte) &= ~_PAGE_PROTECT;
> 	return pte;
> }
> 
> 
> 
> Which leaves me wondering if there is a way in GUP whereby
> we would lookup that page and not clear _PAGE_INVALID,
> resulting in GUP succeeding but faults via the MMU still
> faulting on _PAGE_INVALID.


follow_page_pte() has this piece of code:

	if (flags & FOLL_TOUCH) {
		if ((flags & FOLL_WRITE) &&
		    !pte_dirty(pte) && !PageDirty(page))
			set_page_dirty(page);
		/*
		 * pte_mkyoung() would be more correct here, but atomic care
		 * is needed to avoid losing the dirty bit: it is easier to use
		 * mark_page_accessed().
		 */
		mark_page_accessed(page);
	}

Which at least to me suggests that, although the page is marked accessed and GUP
succeeds, that the PTE might still have _PAGE_INVALID set after we succeeded GUP.


On s390x, there is no HW dirty bit, so we might just be able to do a proper
pte_mkyoung() here instead of the mark_page_accessed().

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08  8:37     ` David Hildenbrand
@ 2022-03-08 12:11       ` David Hildenbrand
  2022-03-08 12:24         ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2022-03-08 12:11 UTC (permalink / raw)
  To: Linus Torvalds, Andreas Gruenbacher
  Cc: Alexander Viro, linux-s390, Linux-MM, linux-fsdevel, linux-btrfs,
	Heiko Carstens, Christian Borntraeger, Jason Gunthorpe,
	John Hubbard, akpm

On 08.03.22 09:37, David Hildenbrand wrote:
> On 08.03.22 09:21, David Hildenbrand wrote:
>> On 08.03.22 00:18, Linus Torvalds wrote:
>>> On Mon, Mar 7, 2022 at 2:52 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>>>>
>>>> After generic_file_read_iter() returns a short or empty read, we fault
>>>> in some pages with fault_in_iov_iter_writeable(). This succeeds, but
>>>> the next call to generic_file_read_iter() returns -EFAULT and we're
>>>> not making any progress.
>>>
>>> Since this is s390-specific, I get the very strong feeling that the
>>>
>>>   fault_in_iov_iter_writeable ->
>>>     fault_in_safe_writeable ->
>>>       __get_user_pages_locked ->
>>>         __get_user_pages
>>>
>>> path somehow successfully finds the page, despite it not being
>>> properly accessible in the page tables.
>>
>> As raised offline already, I suspect
>>
>> shrink_active_list()
>> ->page_referenced()
>>  ->page_referenced_one()
>>   ->ptep_clear_flush_young_notify()
>>    ->ptep_clear_flush_young()
>>
>> which results on s390x in:
>>
>> static inline pte_t pte_mkold(pte_t pte)
>> {
>> 	pte_val(pte) &= ~_PAGE_YOUNG;
>> 	pte_val(pte) |= _PAGE_INVALID;
>> 	return pte;
>> }
>>
>> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>> 					    unsigned long addr, pte_t *ptep)
>> {
>> 	pte_t pte = *ptep;
>>
>> 	pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte));
>> 	return pte_young(pte);
>> }
>>
>>
>> _PAGE_INVALID is the actual HW bit, _PAGE_PRESENT is a
>> pure SW bit. AFAIU, pte_present() still holds:
>>
>> static inline int pte_present(pte_t pte)
>> {
>> 	/* Bit pattern: (pte & 0x001) == 0x001 */
>> 	return (pte_val(pte) & _PAGE_PRESENT) != 0;
>> }
>>
>>
>> pte_mkyoung() will revert that action:
>>
>> static inline pte_t pte_mkyoung(pte_t pte)
>> {
>> 	pte_val(pte) |= _PAGE_YOUNG;
>> 	if (pte_val(pte) & _PAGE_READ)
>> 		pte_val(pte) &= ~_PAGE_INVALID;
>> 	return pte;
>> }
>>
>>
>> and pte_modify() will adjust it properly again:
>>
>> /*
>>  * The following pte modification functions only work if
>>  * pte_present() is true. Undefined behaviour if not..
>>  */
>> static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>> {
>> 	pte_val(pte) &= _PAGE_CHG_MASK;
>> 	pte_val(pte) |= pgprot_val(newprot);
>> 	/*
>> 	 * newprot for PAGE_NONE, PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX
>> 	 * has the invalid bit set, clear it again for readable, young pages
>> 	 */
>> 	if ((pte_val(pte) & _PAGE_YOUNG) && (pte_val(pte) & _PAGE_READ))
>> 		pte_val(pte) &= ~_PAGE_INVALID;
>> 	/*
>> 	 * newprot for PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX has the page
>> 	 * protection bit set, clear it again for writable, dirty pages
>> 	 */
>> 	if ((pte_val(pte) & _PAGE_DIRTY) && (pte_val(pte) & _PAGE_WRITE))
>> 		pte_val(pte) &= ~_PAGE_PROTECT;
>> 	return pte;
>> }
>>
>>
>>
>> Which leaves me wondering if there is a way in GUP whereby
>> we would lookup that page and not clear _PAGE_INVALID,
>> resulting in GUP succeeding but faults via the MMU still
>> faulting on _PAGE_INVALID.
> 
> 
> follow_page_pte() has this piece of code:
> 
> 	if (flags & FOLL_TOUCH) {
> 		if ((flags & FOLL_WRITE) &&
> 		    !pte_dirty(pte) && !PageDirty(page))
> 			set_page_dirty(page);
> 		/*
> 		 * pte_mkyoung() would be more correct here, but atomic care
> 		 * is needed to avoid losing the dirty bit: it is easier to use
> 		 * mark_page_accessed().
> 		 */
> 		mark_page_accessed(page);
> 	}
> 
> Which at least to me suggests that, although the page is marked accessed and GUP
> succeeds, that the PTE might still have _PAGE_INVALID set after we succeeded GUP.
> 
> 
> On s390x, there is no HW dirty bit, so we might just be able to do a proper
> pte_mkyoung() here instead of the mark_page_accessed().
> 

Something hacky like this should be able to show if what I suspect is the case.
It compiles, but I didn't actually test it.


From fee26d7bd90e219688c29bbe174e7a23d5e2dfd3 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 8 Mar 2022 12:51:26 +0100
Subject: [PATCH] mm/gup: fix buffered I/O on s390x with pagefaults disabled

On s390x, we actually need a pte_mkyoung() instead of a
mark_page_accessed() when doing a FOLL_TOUCH to clear the HW invalid bit
in the pte and allow subsequent accesses via the MMU to succeed without
triggering a pagefault.

Otherwise, buffered I/O will loop forever because it will keep stumlbing
over the set HW invalid bit, requiring a page fault, which is disabled.

Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a9d4d724aef7..d6c65474ed72 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -592,10 +592,27 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 			set_page_dirty(page);
 		/*
 		 * pte_mkyoung() would be more correct here, but atomic care
-		 * is needed to avoid losing the dirty bit: it is easier to use
-		 * mark_page_accessed().
+		 * is needed for architectures that have a hw dirty bit, to
+		 * avoid losing the dirty bit: it is easier to use
+		 * mark_page_accessed() for these architectures.
+		 *
+		 * s390x doesn't have a hw reference/dirty bit and sets the
+		 * hw invalid bit in pte_mkold(), to catch further references.
+		 * We have to update the pte via pte_mkyoung() here to clear the
+		 * invalid bit and mark the page young; otherwise, callers that
+		 * rely on not requiring a MMU fault once GUP(FOLL_TOUCH)
+		 * succeeded will loop forever because the page won't be
+		 * actually accessible via the MMU.
 		 */
-		mark_page_accessed(page);
+		if (IS_ENABLED(CONFIG_S390)) {
+			pte = pte_mkyoung(pte);
+			if (!pte_same(pte, *ptep)) {
+				set_pte_at(vma->vm_mm, address, ptep, pte);
+				update_mmu_cache(vma, address, ptep);
+			}
+		} else {
+			mark_page_accessed(page);
+		}
 	}
 	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
 		/* Do not mlock pte-mapped THP */
-- 
2.35.1



We should probably generalize this, using an ARCH config that says that we
don't have HW dirty bits and can do a pte_mkyoung() here without losing
any concurrent updates to the pte via the hw.

Further, I wonder if we might have to do a pte_mkdirty() in case of FOLL_WRITE
for these architectures as well, instead of going via the set_page_dirty().
Could be that that might be required as well here, haven't looked into the
details.

The follow_trans_huge_pmd()->touch_pmd() case should be fine I guess,
and it does both, the pmd_mkyoung and the pmd_mkdirty.

-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08 12:11       ` David Hildenbrand
@ 2022-03-08 12:24         ` David Hildenbrand
  2022-03-08 13:20           ` Gerald Schaefer
  2022-03-08 17:23           ` David Hildenbrand
  0 siblings, 2 replies; 36+ messages in thread
From: David Hildenbrand @ 2022-03-08 12:24 UTC (permalink / raw)
  To: Linus Torvalds, Andreas Gruenbacher
  Cc: Alexander Viro, linux-s390, Linux-MM, linux-fsdevel, linux-btrfs,
	Christian Borntraeger, Jason Gunthorpe, John Hubbard, akpm,
	Heiko Carstens

On 08.03.22 13:11, David Hildenbrand wrote:
> On 08.03.22 09:37, David Hildenbrand wrote:
>> On 08.03.22 09:21, David Hildenbrand wrote:
>>> On 08.03.22 00:18, Linus Torvalds wrote:
>>>> On Mon, Mar 7, 2022 at 2:52 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>>>>>
>>>>> After generic_file_read_iter() returns a short or empty read, we fault
>>>>> in some pages with fault_in_iov_iter_writeable(). This succeeds, but
>>>>> the next call to generic_file_read_iter() returns -EFAULT and we're
>>>>> not making any progress.
>>>>
>>>> Since this is s390-specific, I get the very strong feeling that the
>>>>
>>>>   fault_in_iov_iter_writeable ->
>>>>     fault_in_safe_writeable ->
>>>>       __get_user_pages_locked ->
>>>>         __get_user_pages
>>>>
>>>> path somehow successfully finds the page, despite it not being
>>>> properly accessible in the page tables.
>>>
>>> As raised offline already, I suspect
>>>
>>> shrink_active_list()
>>> ->page_referenced()
>>>  ->page_referenced_one()
>>>   ->ptep_clear_flush_young_notify()
>>>    ->ptep_clear_flush_young()
>>>
>>> which results on s390x in:
>>>
>>> static inline pte_t pte_mkold(pte_t pte)
>>> {
>>> 	pte_val(pte) &= ~_PAGE_YOUNG;
>>> 	pte_val(pte) |= _PAGE_INVALID;
>>> 	return pte;
>>> }
>>>
>>> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> 					    unsigned long addr, pte_t *ptep)
>>> {
>>> 	pte_t pte = *ptep;
>>>
>>> 	pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte));
>>> 	return pte_young(pte);
>>> }
>>>
>>>
>>> _PAGE_INVALID is the actual HW bit, _PAGE_PRESENT is a
>>> pure SW bit. AFAIU, pte_present() still holds:
>>>
>>> static inline int pte_present(pte_t pte)
>>> {
>>> 	/* Bit pattern: (pte & 0x001) == 0x001 */
>>> 	return (pte_val(pte) & _PAGE_PRESENT) != 0;
>>> }
>>>
>>>
>>> pte_mkyoung() will revert that action:
>>>
>>> static inline pte_t pte_mkyoung(pte_t pte)
>>> {
>>> 	pte_val(pte) |= _PAGE_YOUNG;
>>> 	if (pte_val(pte) & _PAGE_READ)
>>> 		pte_val(pte) &= ~_PAGE_INVALID;
>>> 	return pte;
>>> }
>>>
>>>
>>> and pte_modify() will adjust it properly again:
>>>
>>> /*
>>>  * The following pte modification functions only work if
>>>  * pte_present() is true. Undefined behaviour if not..
>>>  */
>>> static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>>> {
>>> 	pte_val(pte) &= _PAGE_CHG_MASK;
>>> 	pte_val(pte) |= pgprot_val(newprot);
>>> 	/*
>>> 	 * newprot for PAGE_NONE, PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX
>>> 	 * has the invalid bit set, clear it again for readable, young pages
>>> 	 */
>>> 	if ((pte_val(pte) & _PAGE_YOUNG) && (pte_val(pte) & _PAGE_READ))
>>> 		pte_val(pte) &= ~_PAGE_INVALID;
>>> 	/*
>>> 	 * newprot for PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX has the page
>>> 	 * protection bit set, clear it again for writable, dirty pages
>>> 	 */
>>> 	if ((pte_val(pte) & _PAGE_DIRTY) && (pte_val(pte) & _PAGE_WRITE))
>>> 		pte_val(pte) &= ~_PAGE_PROTECT;
>>> 	return pte;
>>> }
>>>
>>>
>>>
>>> Which leaves me wondering if there is a way in GUP whereby
>>> we would lookup that page and not clear _PAGE_INVALID,
>>> resulting in GUP succeeding but faults via the MMU still
>>> faulting on _PAGE_INVALID.
>>
>>
>> follow_page_pte() has this piece of code:
>>
>> 	if (flags & FOLL_TOUCH) {
>> 		if ((flags & FOLL_WRITE) &&
>> 		    !pte_dirty(pte) && !PageDirty(page))
>> 			set_page_dirty(page);
>> 		/*
>> 		 * pte_mkyoung() would be more correct here, but atomic care
>> 		 * is needed to avoid losing the dirty bit: it is easier to use
>> 		 * mark_page_accessed().
>> 		 */
>> 		mark_page_accessed(page);
>> 	}
>>
>> Which at least to me suggests that, although the page is marked accessed and GUP
>> succeeds, that the PTE might still have _PAGE_INVALID set after we succeeded GUP.
>>
>>
>> On s390x, there is no HW dirty bit, so we might just be able to do a proper
>> pte_mkyoung() here instead of the mark_page_accessed().
>>
> 
> Something hacky like this should be able to show if what I suspect is the case.
> It compiles, but I didn't actually test it.
That would be the alternative that also takes the mkdirty into account:


From 1e51e8a93894f87c0a4d0e908391e0628ae56afe Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 8 Mar 2022 12:51:26 +0100
Subject: [PATCH] mm/gup: fix buffered I/O on s390x with pagefaults disabled

On s390x, we actually need a pte_mkyoung() / pte_mkdirty() instead of
going via the page and leaving the PTE unmodified. E.g., if we only
mark the page accessed via mark_page_accessed() when doing a FOLL_TOUCH,
we'll miss to clear the HW invalid bit in the pte and subsequent accesses
via the MMU would still require a pagefault.

Otherwise, buffered I/O will loop forever because it will keep stumling
over the set HW invalid bit, requiring a page fault.

Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a9d4d724aef7..de3311feb377 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -587,15 +587,33 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		}
 	}
 	if (flags & FOLL_TOUCH) {
-		if ((flags & FOLL_WRITE) &&
-		    !pte_dirty(pte) && !PageDirty(page))
-			set_page_dirty(page);
 		/*
-		 * pte_mkyoung() would be more correct here, but atomic care
-		 * is needed to avoid losing the dirty bit: it is easier to use
-		 * mark_page_accessed().
+		 * We have to be careful with updating the PTE on architectures
+		 * that have a HW dirty bit: while updating the PTE we might
+		 * lose that bit again and we'd need an atomic update: it is
+		 * easier to leave the PTE untouched for these architectures.
+		 *
+		 * s390x doesn't have a hw referenced / dirty bit and e.g., sets
+		 * the hw invalid bit in pte_mkold(), to catch further
+		 * references. We have to update the PTE here to e.g., clear the
+		 * invalid bit; otherwise, callers that rely on not requiring
+		 * an MMU fault once GUP(FOLL_TOUCH) succeeded will loop forever
+		 * because the page won't actually be accessible via the MMU.
 		 */
-		mark_page_accessed(page);
+		if (IS_ENABLED(CONFIG_S390)) {
+			pte = pte_mkyoung(pte);
+			if (flags & FOLL_WRITE)
+				pte = pte_mkdirty(pte);
+			if (!pte_same(pte, *ptep)) {
+				set_pte_at(vma->vm_mm, address, ptep, pte);
+				update_mmu_cache(vma, address, ptep);
+			}
+		} else {
+			if ((flags & FOLL_WRITE) &&
+			    !pte_dirty(pte) && !PageDirty(page))
+				set_page_dirty(page);
+			mark_page_accessed(page);
+		}
 	}
 	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
 		/* Do not mlock pte-mapped THP */
-- 
2.35.1


-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08 12:24         ` David Hildenbrand
@ 2022-03-08 13:20           ` Gerald Schaefer
  2022-03-08 13:32             ` David Hildenbrand
  2022-03-08 17:23           ` David Hildenbrand
  1 sibling, 1 reply; 36+ messages in thread
From: Gerald Schaefer @ 2022-03-08 13:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, Andreas Gruenbacher, Alexander Viro, linux-s390,
	Linux-MM, linux-fsdevel, linux-btrfs, Christian Borntraeger,
	Jason Gunthorpe, John Hubbard, akpm, Heiko Carstens,
	Alexander Gordeev, Gerald Schaefer

On Tue, 8 Mar 2022 13:24:19 +0100
David Hildenbrand <david@redhat.com> wrote:

[...]
> 
> From 1e51e8a93894f87c0a4d0e908391e0628ae56afe Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Tue, 8 Mar 2022 12:51:26 +0100
> Subject: [PATCH] mm/gup: fix buffered I/O on s390x with pagefaults disabled
> 
> On s390x, we actually need a pte_mkyoung() / pte_mkdirty() instead of
> going via the page and leaving the PTE unmodified. E.g., if we only
> mark the page accessed via mark_page_accessed() when doing a FOLL_TOUCH,
> we'll miss to clear the HW invalid bit in the pte and subsequent accesses
> via the MMU would still require a pagefault.
> 
> Otherwise, buffered I/O will loop forever because it will keep stumling
> over the set HW invalid bit, requiring a page fault.
> 
> Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/gup.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index a9d4d724aef7..de3311feb377 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -587,15 +587,33 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  		}
>  	}
>  	if (flags & FOLL_TOUCH) {
> -		if ((flags & FOLL_WRITE) &&
> -		    !pte_dirty(pte) && !PageDirty(page))
> -			set_page_dirty(page);
>  		/*
> -		 * pte_mkyoung() would be more correct here, but atomic care
> -		 * is needed to avoid losing the dirty bit: it is easier to use
> -		 * mark_page_accessed().
> +		 * We have to be careful with updating the PTE on architectures
> +		 * that have a HW dirty bit: while updating the PTE we might
> +		 * lose that bit again and we'd need an atomic update: it is
> +		 * easier to leave the PTE untouched for these architectures.
> +		 *
> +		 * s390x doesn't have a hw referenced / dirty bit and e.g., sets
> +		 * the hw invalid bit in pte_mkold(), to catch further
> +		 * references. We have to update the PTE here to e.g., clear the
> +		 * invalid bit; otherwise, callers that rely on not requiring
> +		 * an MMU fault once GUP(FOLL_TOUCH) succeeded will loop forever
> +		 * because the page won't actually be accessible via the MMU.
>  		 */
> -		mark_page_accessed(page);
> +		if (IS_ENABLED(CONFIG_S390)) {
> +			pte = pte_mkyoung(pte);
> +			if (flags & FOLL_WRITE)
> +				pte = pte_mkdirty(pte);
> +			if (!pte_same(pte, *ptep)) {
> +				set_pte_at(vma->vm_mm, address, ptep, pte);
> +				update_mmu_cache(vma, address, ptep);
> +			}
> +		} else {
> +			if ((flags & FOLL_WRITE) &&
> +			    !pte_dirty(pte) && !PageDirty(page))
> +				set_page_dirty(page);
> +			mark_page_accessed(page);
> +		}
>  	}
>  	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
>  		/* Do not mlock pte-mapped THP */

Thanks David, your analysis looks valid, at least it seems that you found
a scenario where we would have HW invalid bit set due to pte_mkold() in
ptep_clear_flush_young(), and still GUP would find and return that page, IIUC.

I think pte handling should be similar to pmd handling in follow_trans_huge_pmd()
-> touch_pmd(), or cow_user_page() (see comment on software "accessed" bits),
which is more or less what your patch does.

Some possible concerns:
- set_page_dirty() would not be done any more for s390, is that intended and ok?
- using set_pte_at() here seems a bit dangerous, as I'm not sure if this will
  always only operate on invalid PTEs. Using it on active valid PTEs could
  result in TLB issues because of missing flush. Also not sure about kvm impact.
  Using ptep_set_access_flags() seems safer, again similar to touch_pmd() and
  also cow_user_page().

Looking at cow_user_page(), I also wonder if the arch_faults_on_old_pte()
logic could be used here. I must admit that I did not really understand the
"losing the dirty bit" part of the comment, but it seems that we might need
to not only check for arch_faults_on_old_pte(), but also for something like
"arch_faults_for_dirty_pte".

Last but not least, IIUC, this issue should affect all archs that return
true on arch_faults_on_old_pte(). After all, the basic problem seems to be
that a pagefault is required for PTEs marked as old, in combination with
GUP still returning a valid page. So maybe this should not be restricted
to IS_ENABLED(CONFIG_S390).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08 13:20           ` Gerald Schaefer
@ 2022-03-08 13:32             ` David Hildenbrand
  2022-03-08 14:14               ` Gerald Schaefer
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2022-03-08 13:32 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Linus Torvalds, Andreas Gruenbacher, Alexander Viro, linux-s390,
	Linux-MM, linux-fsdevel, linux-btrfs, Christian Borntraeger,
	Jason Gunthorpe, John Hubbard, akpm, Heiko Carstens,
	Alexander Gordeev

On 08.03.22 14:20, Gerald Schaefer wrote:
> On Tue, 8 Mar 2022 13:24:19 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
> [...]
>>
>> From 1e51e8a93894f87c0a4d0e908391e0628ae56afe Mon Sep 17 00:00:00 2001
>> From: David Hildenbrand <david@redhat.com>
>> Date: Tue, 8 Mar 2022 12:51:26 +0100
>> Subject: [PATCH] mm/gup: fix buffered I/O on s390x with pagefaults disabled
>>
>> On s390x, we actually need a pte_mkyoung() / pte_mkdirty() instead of
>> going via the page and leaving the PTE unmodified. E.g., if we only
>> mark the page accessed via mark_page_accessed() when doing a FOLL_TOUCH,
>> we'll miss to clear the HW invalid bit in the pte and subsequent accesses
>> via the MMU would still require a pagefault.
>>
>> Otherwise, buffered I/O will loop forever because it will keep stumling
>> over the set HW invalid bit, requiring a page fault.
>>
>> Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  mm/gup.c | 32 +++++++++++++++++++++++++-------
>>  1 file changed, 25 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index a9d4d724aef7..de3311feb377 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -587,15 +587,33 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>>  		}
>>  	}
>>  	if (flags & FOLL_TOUCH) {
>> -		if ((flags & FOLL_WRITE) &&
>> -		    !pte_dirty(pte) && !PageDirty(page))
>> -			set_page_dirty(page);
>>  		/*
>> -		 * pte_mkyoung() would be more correct here, but atomic care
>> -		 * is needed to avoid losing the dirty bit: it is easier to use
>> -		 * mark_page_accessed().
>> +		 * We have to be careful with updating the PTE on architectures
>> +		 * that have a HW dirty bit: while updating the PTE we might
>> +		 * lose that bit again and we'd need an atomic update: it is
>> +		 * easier to leave the PTE untouched for these architectures.
>> +		 *
>> +		 * s390x doesn't have a hw referenced / dirty bit and e.g., sets
>> +		 * the hw invalid bit in pte_mkold(), to catch further
>> +		 * references. We have to update the PTE here to e.g., clear the
>> +		 * invalid bit; otherwise, callers that rely on not requiring
>> +		 * an MMU fault once GUP(FOLL_TOUCH) succeeded will loop forever
>> +		 * because the page won't actually be accessible via the MMU.
>>  		 */
>> -		mark_page_accessed(page);
>> +		if (IS_ENABLED(CONFIG_S390)) {
>> +			pte = pte_mkyoung(pte);
>> +			if (flags & FOLL_WRITE)
>> +				pte = pte_mkdirty(pte);
>> +			if (!pte_same(pte, *ptep)) {
>> +				set_pte_at(vma->vm_mm, address, ptep, pte);
>> +				update_mmu_cache(vma, address, ptep);
>> +			}
>> +		} else {
>> +			if ((flags & FOLL_WRITE) &&
>> +			    !pte_dirty(pte) && !PageDirty(page))
>> +				set_page_dirty(page);
>> +			mark_page_accessed(page);
>> +		}
>>  	}
>>  	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
>>  		/* Do not mlock pte-mapped THP */
> 
> Thanks David, your analysis looks valid, at least it seems that you found
> a scenario where we would have HW invalid bit set due to pte_mkold() in
> ptep_clear_flush_young(), and still GUP would find and return that page, IIUC.
> 
> I think pte handling should be similar to pmd handling in follow_trans_huge_pmd()
> -> touch_pmd(), or cow_user_page() (see comment on software "accessed" bits),
> which is more or less what your patch does.
> 
> Some possible concerns:
> - set_page_dirty() would not be done any more for s390, is that intended and ok?

I strongly assume so, because the page is mapped via a PTE, which is
writable and dirty. This is similar to THP logic.

> - using set_pte_at() here seems a bit dangerous, as I'm not sure if this will
>   always only operate on invalid PTEs. Using it on active valid PTEs could
>   result in TLB issues because of missing flush. Also not sure about kvm impact.
>   Using ptep_set_access_flags() seems safer, again similar to touch_pmd() and
>   also cow_user_page().

Yeah, I sticked to what follow_pfn_pte() does for simplicity for now.
But I agree that following what touch_pmd() does looks saner --
ptep_set_access_flags().

> 
> Looking at cow_user_page(), I also wonder if the arch_faults_on_old_pte()
> logic could be used here. I must admit that I did not really understand the
> "losing the dirty bit" part of the comment, but it seems that we might need
> to not only check for arch_faults_on_old_pte(), but also for something like
> "arch_faults_for_dirty_pte".
> 
> Last but not least, IIUC, this issue should affect all archs that return
> true on arch_faults_on_old_pte(). After all, the basic problem seems to be
> that a pagefault is required for PTEs marked as old, in combination with
> GUP still returning a valid page. So maybe this should not be restricted
> to IS_ENABLED(CONFIG_S390).

Yeah, as raised, the IS_ENABLED(CONFIG_S390) part is just a quick hack
to see if this would fix the issue.

arch_faults_on_dirty_pte / arch_faults_on_old_pte might be a
replacement. We just would have to be careful for architectures that
e.g., have arch_faults_on_old_pte=true and
arch_faults_on_dirty_pte=false (i.e., hw dirty bit but no hw accessed
bit). Would have to think about how to handle that properly ...

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08 13:32             ` David Hildenbrand
@ 2022-03-08 14:14               ` Gerald Schaefer
  0 siblings, 0 replies; 36+ messages in thread
From: Gerald Schaefer @ 2022-03-08 14:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, Andreas Gruenbacher, Alexander Viro, linux-s390,
	Linux-MM, linux-fsdevel, linux-btrfs, Christian Borntraeger,
	Jason Gunthorpe, John Hubbard, akpm, Heiko Carstens,
	Alexander Gordeev

On Tue, 8 Mar 2022 14:32:25 +0100
David Hildenbrand <david@redhat.com> wrote:

[...]
> 
> > - using set_pte_at() here seems a bit dangerous, as I'm not sure if this will
> >   always only operate on invalid PTEs. Using it on active valid PTEs could
> >   result in TLB issues because of missing flush. Also not sure about kvm impact.
> >   Using ptep_set_access_flags() seems safer, again similar to touch_pmd() and
> >   also cow_user_page().
> 
> Yeah, I sticked to what follow_pfn_pte() does for simplicity for now.

Uh oh, that set_pte_at() in follow_pfn_pte() also looks dangerous, at least
I do not spontaneously see that it would only be used for invalid / pte_none()
PTEs. But that is a totally different story, and maybe (hopefully) not
affecting s390 until we have proper DAX support...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08 12:24         ` David Hildenbrand
  2022-03-08 13:20           ` Gerald Schaefer
@ 2022-03-08 17:23           ` David Hildenbrand
  1 sibling, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2022-03-08 17:23 UTC (permalink / raw)
  To: Linus Torvalds, Andreas Gruenbacher
  Cc: Alexander Viro, linux-s390, Linux-MM, linux-fsdevel, linux-btrfs,
	Christian Borntraeger, Jason Gunthorpe, John Hubbard, akpm,
	Heiko Carstens

On 08.03.22 13:24, David Hildenbrand wrote:
> On 08.03.22 13:11, David Hildenbrand wrote:
>> On 08.03.22 09:37, David Hildenbrand wrote:
>>> On 08.03.22 09:21, David Hildenbrand wrote:
>>>> On 08.03.22 00:18, Linus Torvalds wrote:
>>>>> On Mon, Mar 7, 2022 at 2:52 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>>>>>>
>>>>>> After generic_file_read_iter() returns a short or empty read, we fault
>>>>>> in some pages with fault_in_iov_iter_writeable(). This succeeds, but
>>>>>> the next call to generic_file_read_iter() returns -EFAULT and we're
>>>>>> not making any progress.
>>>>>
>>>>> Since this is s390-specific, I get the very strong feeling that the
>>>>>
>>>>>   fault_in_iov_iter_writeable ->
>>>>>     fault_in_safe_writeable ->
>>>>>       __get_user_pages_locked ->
>>>>>         __get_user_pages
>>>>>
>>>>> path somehow successfully finds the page, despite it not being
>>>>> properly accessible in the page tables.
>>>>
>>>> As raised offline already, I suspect
>>>>
>>>> shrink_active_list()
>>>> ->page_referenced()
>>>>  ->page_referenced_one()
>>>>   ->ptep_clear_flush_young_notify()
>>>>    ->ptep_clear_flush_young()
>>>>
>>>> which results on s390x in:
>>>>
>>>> static inline pte_t pte_mkold(pte_t pte)
>>>> {
>>>> 	pte_val(pte) &= ~_PAGE_YOUNG;
>>>> 	pte_val(pte) |= _PAGE_INVALID;
>>>> 	return pte;
>>>> }
>>>>
>>>> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> 					    unsigned long addr, pte_t *ptep)
>>>> {
>>>> 	pte_t pte = *ptep;
>>>>
>>>> 	pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte));
>>>> 	return pte_young(pte);
>>>> }
>>>>
>>>>
>>>> _PAGE_INVALID is the actual HW bit, _PAGE_PRESENT is a
>>>> pure SW bit. AFAIU, pte_present() still holds:
>>>>
>>>> static inline int pte_present(pte_t pte)
>>>> {
>>>> 	/* Bit pattern: (pte & 0x001) == 0x001 */
>>>> 	return (pte_val(pte) & _PAGE_PRESENT) != 0;
>>>> }
>>>>
>>>>
>>>> pte_mkyoung() will revert that action:
>>>>
>>>> static inline pte_t pte_mkyoung(pte_t pte)
>>>> {
>>>> 	pte_val(pte) |= _PAGE_YOUNG;
>>>> 	if (pte_val(pte) & _PAGE_READ)
>>>> 		pte_val(pte) &= ~_PAGE_INVALID;
>>>> 	return pte;
>>>> }
>>>>
>>>>
>>>> and pte_modify() will adjust it properly again:
>>>>
>>>> /*
>>>>  * The following pte modification functions only work if
>>>>  * pte_present() is true. Undefined behaviour if not..
>>>>  */
>>>> static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>>>> {
>>>> 	pte_val(pte) &= _PAGE_CHG_MASK;
>>>> 	pte_val(pte) |= pgprot_val(newprot);
>>>> 	/*
>>>> 	 * newprot for PAGE_NONE, PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX
>>>> 	 * has the invalid bit set, clear it again for readable, young pages
>>>> 	 */
>>>> 	if ((pte_val(pte) & _PAGE_YOUNG) && (pte_val(pte) & _PAGE_READ))
>>>> 		pte_val(pte) &= ~_PAGE_INVALID;
>>>> 	/*
>>>> 	 * newprot for PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX has the page
>>>> 	 * protection bit set, clear it again for writable, dirty pages
>>>> 	 */
>>>> 	if ((pte_val(pte) & _PAGE_DIRTY) && (pte_val(pte) & _PAGE_WRITE))
>>>> 		pte_val(pte) &= ~_PAGE_PROTECT;
>>>> 	return pte;
>>>> }
>>>>
>>>>
>>>>
>>>> Which leaves me wondering if there is a way in GUP whereby
>>>> we would lookup that page and not clear _PAGE_INVALID,
>>>> resulting in GUP succeeding but faults via the MMU still
>>>> faulting on _PAGE_INVALID.
>>>
>>>
>>> follow_page_pte() has this piece of code:
>>>
>>> 	if (flags & FOLL_TOUCH) {
>>> 		if ((flags & FOLL_WRITE) &&
>>> 		    !pte_dirty(pte) && !PageDirty(page))
>>> 			set_page_dirty(page);
>>> 		/*
>>> 		 * pte_mkyoung() would be more correct here, but atomic care
>>> 		 * is needed to avoid losing the dirty bit: it is easier to use
>>> 		 * mark_page_accessed().
>>> 		 */
>>> 		mark_page_accessed(page);
>>> 	}
>>>
>>> Which at least to me suggests that, although the page is marked accessed and GUP
>>> succeeds, that the PTE might still have _PAGE_INVALID set after we succeeded GUP.
>>>
>>>
>>> On s390x, there is no HW dirty bit, so we might just be able to do a proper
>>> pte_mkyoung() here instead of the mark_page_accessed().
>>>
>>
>> Something hacky like this should be able to show if what I suspect is the case.
>> It compiles, but I didn't actually test it.
> That would be the alternative that also takes the mkdirty into account:
> 
> 
> From 1e51e8a93894f87c0a4d0e908391e0628ae56afe Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Tue, 8 Mar 2022 12:51:26 +0100
> Subject: [PATCH] mm/gup: fix buffered I/O on s390x with pagefaults disabled
> 
> On s390x, we actually need a pte_mkyoung() / pte_mkdirty() instead of
> going via the page and leaving the PTE unmodified. E.g., if we only
> mark the page accessed via mark_page_accessed() when doing a FOLL_TOUCH,
> we'll miss to clear the HW invalid bit in the pte and subsequent accesses
> via the MMU would still require a pagefault.
> 
> Otherwise, buffered I/O will loop forever because it will keep stumling
> over the set HW invalid bit, requiring a page fault.
> 
> Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/gup.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index a9d4d724aef7..de3311feb377 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -587,15 +587,33 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  		}
>  	}
>  	if (flags & FOLL_TOUCH) {
> -		if ((flags & FOLL_WRITE) &&
> -		    !pte_dirty(pte) && !PageDirty(page))
> -			set_page_dirty(page);
>  		/*
> -		 * pte_mkyoung() would be more correct here, but atomic care
> -		 * is needed to avoid losing the dirty bit: it is easier to use
> -		 * mark_page_accessed().
> +		 * We have to be careful with updating the PTE on architectures
> +		 * that have a HW dirty bit: while updating the PTE we might
> +		 * lose that bit again and we'd need an atomic update: it is
> +		 * easier to leave the PTE untouched for these architectures.
> +		 *
> +		 * s390x doesn't have a hw referenced / dirty bit and e.g., sets
> +		 * the hw invalid bit in pte_mkold(), to catch further
> +		 * references. We have to update the PTE here to e.g., clear the
> +		 * invalid bit; otherwise, callers that rely on not requiring
> +		 * an MMU fault once GUP(FOLL_TOUCH) succeeded will loop forever
> +		 * because the page won't actually be accessible via the MMU.
>  		 */
> -		mark_page_accessed(page);
> +		if (IS_ENABLED(CONFIG_S390)) {
> +			pte = pte_mkyoung(pte);
> +			if (flags & FOLL_WRITE)
> +				pte = pte_mkdirty(pte);
> +			if (!pte_same(pte, *ptep)) {
> +				set_pte_at(vma->vm_mm, address, ptep, pte);
> +				update_mmu_cache(vma, address, ptep);

I just reproduced without this fix and then tried to reproduce with this
fix (however, replacing pte_same() + set_pte_at() by a
ptep_set_access_flags()).

Seems to do the trick for me. I'll figure out the best way to handle
IS_ENABLED(CONFIG_S390) before posting an official fix.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08  8:21   ` David Hildenbrand
  2022-03-08  8:37     ` David Hildenbrand
@ 2022-03-08 17:26     ` Linus Torvalds
  2022-03-08 17:40       ` Linus Torvalds
  2022-03-08 17:47       ` David Hildenbrand
  1 sibling, 2 replies; 36+ messages in thread
From: Linus Torvalds @ 2022-03-08 17:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andreas Gruenbacher, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

On Tue, Mar 8, 2022 at 12:21 AM David Hildenbrand <david@redhat.com> wrote:
>
> As raised offline already, I suspect
>
> shrink_active_list()
> ->page_referenced()
>  ->page_referenced_one()
>   ->ptep_clear_flush_young_notify()
>    ->ptep_clear_flush_young()
>
> which results on s390x in:
>
> static inline pte_t pte_mkold(pte_t pte)
> {
>         pte_val(pte) &= ~_PAGE_YOUNG;
>         pte_val(pte) |= _PAGE_INVALID;
>         return pte;
> }

Yeah, that looks likely.

It looks to me like GUP just doesn't care about _PAGE_INVALID on s390,
and happily looks up that page despite it not being "present" as far
as hardware is concerned.

Your actual patch looks pretty nasty, though. We avoid marking it
accessed on purpose (to avoid atomicity issues wrt hw-dirty bits etc),
but still, that patch makes me go "there has to be a better way".

                              Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08 17:26     ` Linus Torvalds
@ 2022-03-08 17:40       ` Linus Torvalds
  2022-03-08 19:27         ` Linus Torvalds
  2022-03-08 17:47       ` David Hildenbrand
  1 sibling, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2022-03-08 17:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andreas Gruenbacher, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

On Tue, Mar 8, 2022 at 9:26 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Your actual patch looks pretty nasty, though. We avoid marking it
> accessed on purpose (to avoid atomicity issues wrt hw-dirty bits etc),
> but still, that patch makes me go "there has to be a better way".

That better way might be to make q() not use GUP at all.

It doesn't want the page - it just wants to make sure the fault is
handled. For the fault_in_readable() case we already just do the user
access.

The only reason we don't do that for fault_in_safe_writeable() is that
we don't have a good non-destructive user write.

Hmm. The futex code actually uses "fixup_user_fault()" for this case.
Maybe fault_in_safe_writeable() should do that?

                            Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08 17:26     ` Linus Torvalds
  2022-03-08 17:40       ` Linus Torvalds
@ 2022-03-08 17:47       ` David Hildenbrand
  1 sibling, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2022-03-08 17:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

On 08.03.22 18:26, Linus Torvalds wrote:
> On Tue, Mar 8, 2022 at 12:21 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> As raised offline already, I suspect
>>
>> shrink_active_list()
>> ->page_referenced()
>>  ->page_referenced_one()
>>   ->ptep_clear_flush_young_notify()
>>    ->ptep_clear_flush_young()
>>
>> which results on s390x in:
>>
>> static inline pte_t pte_mkold(pte_t pte)
>> {
>>         pte_val(pte) &= ~_PAGE_YOUNG;
>>         pte_val(pte) |= _PAGE_INVALID;
>>         return pte;
>> }
> 
> Yeah, that looks likely.
> 
> It looks to me like GUP just doesn't care about _PAGE_INVALID on s390,
> and happily looks up that page despite it not being "present" as far
> as hardware is concerned.
> 
> Your actual patch looks pretty nasty, though. We avoid marking it
> accessed on purpose (to avoid atomicity issues wrt hw-dirty bits etc),
> but still, that patch makes me go "there has to be a better way".

It certainly only works if we don't have hw dirty bits that might get
set concurrently -- for example, on s390x there is no such requirement.

As raised by Gerald, arch_faults_for_dirty_pte (and existing
arch_faults_on_old_pte) might be one option to get rid of the s390x
special-casing, and detect any arch that might update the dirty bit
concurrently.

Interestingly, mm/huge_memory.c:touch_pmd() doesn't seem to care about
concurrent dirty-bit updates by the hardware. Hmm.

But, of course, I'm open for alternatives, maybe we could adjust
fault_in_safe_writeable() to not use GUP as raised by you in the other
reply.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08 17:40       ` Linus Torvalds
@ 2022-03-08 19:27         ` Linus Torvalds
  2022-03-08 20:03           ` Linus Torvalds
  2022-03-10 17:13           ` David Hildenbrand
  0 siblings, 2 replies; 36+ messages in thread
From: Linus Torvalds @ 2022-03-08 19:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andreas Gruenbacher, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1079 bytes --]

On Tue, Mar 8, 2022 at 9:40 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. The futex code actually uses "fixup_user_fault()" for this case.
> Maybe fault_in_safe_writeable() should do that?

.. paging all the bits back in, I'm reminded that one of the worries
was "what about large areas".

But I really think that the solution should be that we limit the size
of fault_in_safe_writeable() to just a couple of pages.

Even if you were to fault in gigabytes, page-out can undo it anyway,
so there is no semantic reason why that function should ever do more
than a few pages to make sure. There's already even a comment about
how there's no guarantee that the pages will stay.

Side note: the current GUP-based fault_in_safe_writeable() is buggy in
another way anyway: it doesn't work right for stack extending
accesses.

So I think the fix for this all might be something like the attached
(TOTALLY UNTESTED)!

Comments? Andreas, mind (carefully - maybe it is totally broken and
does unspeakable acts to your pets) testing this?

                         Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1972 bytes --]

 mm/gup.c | 40 ++++++++++++----------------------------
 1 file changed, 12 insertions(+), 28 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a9d4d724aef7..9e085e7b9c28 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1745,44 +1745,28 @@ EXPORT_SYMBOL(fault_in_writeable);
 size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
 {
 	unsigned long start = (unsigned long)untagged_addr(uaddr);
-	unsigned long end, nstart, nend;
+	unsigned long end, nstart;
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	int locked = 0;
+	const unsigned int fault_flags = FAULT_FLAG_WRITE | FAULT_FLAG_KILLABLE;
+	const size_t max_size = 4 * PAGE_SIZE;
 
 	nstart = start & PAGE_MASK;
-	end = PAGE_ALIGN(start + size);
+	end = PAGE_ALIGN(start + min(size, max_size));
 	if (end < nstart)
 		end = 0;
-	for (; nstart != end; nstart = nend) {
-		unsigned long nr_pages;
-		long ret;
 
-		if (!locked) {
-			locked = 1;
-			mmap_read_lock(mm);
-			vma = find_vma(mm, nstart);
-		} else if (nstart >= vma->vm_end)
-			vma = vma->vm_next;
-		if (!vma || vma->vm_start >= end)
-			break;
-		nend = end ? min(end, vma->vm_end) : vma->vm_end;
-		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
-			continue;
-		if (nstart < vma->vm_start)
-			nstart = vma->vm_start;
-		nr_pages = (nend - nstart) / PAGE_SIZE;
-		ret = __get_user_pages_locked(mm, nstart, nr_pages,
-					      NULL, NULL, &locked,
-					      FOLL_TOUCH | FOLL_WRITE);
-		if (ret <= 0)
+	mmap_read_lock(mm);
+	for (; nstart != end; nstart += PAGE_SIZE) {
+		if (fixup_user_fault(mm, nstart, fault_flags, NULL))
 			break;
-		nend = nstart + ret * PAGE_SIZE;
 	}
-	if (locked)
-		mmap_read_unlock(mm);
+	mmap_read_unlock(mm);
+
+	/* If we got all of our (truncated) fault-in, we claim we got it all */
 	if (nstart == end)
 		return 0;
+
+	/* .. otherwise we'll use the original untruncated size */
 	return size - min_t(size_t, nstart - start, size);
 }
 EXPORT_SYMBOL(fault_in_safe_writeable);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08 19:27         ` Linus Torvalds
@ 2022-03-08 20:03           ` Linus Torvalds
  2022-03-08 23:24             ` Andreas Gruenbacher
  2022-03-10 17:13           ` David Hildenbrand
  1 sibling, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2022-03-08 20:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andreas Gruenbacher, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 430 bytes --]

On Tue, Mar 8, 2022 at 11:27 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I think the fix for this all might be something like the attached
> (TOTALLY UNTESTED)!

Still entirely untested, but I wrote a commit message for it in the
hopes that this actually works and Andreas can verify that it fixes
the issue.

Same exact patch, it's just now in my local experimental tree as a commit.

                  Linus

[-- Attachment #2: 0001-mm-gup-make-fault_in_safe_writeable-use-fixup_user_f.patch --]
[-- Type: text/x-patch, Size: 3143 bytes --]

From d8c2e0a81274d67edfff3769c4c37e364ba8d6f8 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 8 Mar 2022 11:55:48 -0800
Subject: [PATCH] mm: gup: make fault_in_safe_writeable() use
 fixup_user_fault()

Instedad of using GUP, make fault_in_safe_writeable() actually force a
'handle_mm_fault()' using the same fixup_user_fault() machinery that
futexes already use.

Using the GUP machinery meant that fault_in_safe_writeable() did not do
everything that a real fault would do, ranging from not auto-expanding
the stack segment, to not updating accessed or dirty flags in the page
tables (GUP sets those flags on the pages themselves).

The latter causes problems on architectures (like s390) that do accessed
bit handling in software, which meant that fault_in_safe_writeable()
didn't actually do all the fault handling it needed to.

Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lore.kernel.org/all/CAHc6FU5nP+nziNGG0JAF1FUx-GV7kKFvM7aZuU_XD2_1v4vnvg@mail.gmail.com/
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: inus Torvalds <torvalds@linux-foundation.org>
---
 mm/gup.c | 40 ++++++++++++----------------------------
 1 file changed, 12 insertions(+), 28 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a9d4d724aef7..9e085e7b9c28 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1745,44 +1745,28 @@ EXPORT_SYMBOL(fault_in_writeable);
 size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
 {
 	unsigned long start = (unsigned long)untagged_addr(uaddr);
-	unsigned long end, nstart, nend;
+	unsigned long end, nstart;
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	int locked = 0;
+	const unsigned int fault_flags = FAULT_FLAG_WRITE | FAULT_FLAG_KILLABLE;
+	const size_t max_size = 4 * PAGE_SIZE;
 
 	nstart = start & PAGE_MASK;
-	end = PAGE_ALIGN(start + size);
+	end = PAGE_ALIGN(start + min(size, max_size));
 	if (end < nstart)
 		end = 0;
-	for (; nstart != end; nstart = nend) {
-		unsigned long nr_pages;
-		long ret;
 
-		if (!locked) {
-			locked = 1;
-			mmap_read_lock(mm);
-			vma = find_vma(mm, nstart);
-		} else if (nstart >= vma->vm_end)
-			vma = vma->vm_next;
-		if (!vma || vma->vm_start >= end)
-			break;
-		nend = end ? min(end, vma->vm_end) : vma->vm_end;
-		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
-			continue;
-		if (nstart < vma->vm_start)
-			nstart = vma->vm_start;
-		nr_pages = (nend - nstart) / PAGE_SIZE;
-		ret = __get_user_pages_locked(mm, nstart, nr_pages,
-					      NULL, NULL, &locked,
-					      FOLL_TOUCH | FOLL_WRITE);
-		if (ret <= 0)
+	mmap_read_lock(mm);
+	for (; nstart != end; nstart += PAGE_SIZE) {
+		if (fixup_user_fault(mm, nstart, fault_flags, NULL))
 			break;
-		nend = nstart + ret * PAGE_SIZE;
 	}
-	if (locked)
-		mmap_read_unlock(mm);
+	mmap_read_unlock(mm);
+
+	/* If we got all of our (truncated) fault-in, we claim we got it all */
 	if (nstart == end)
 		return 0;
+
+	/* .. otherwise we'll use the original untruncated size */
 	return size - min_t(size_t, nstart - start, size);
 }
 EXPORT_SYMBOL(fault_in_safe_writeable);
-- 
2.35.1.356.ge6630f57cf.dirty


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08 20:03           ` Linus Torvalds
@ 2022-03-08 23:24             ` Andreas Gruenbacher
  2022-03-09  0:22               ` Linus Torvalds
  0 siblings, 1 reply; 36+ messages in thread
From: Andreas Gruenbacher @ 2022-03-08 23:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

On Tue, Mar 8, 2022 at 9:04 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Mar 8, 2022 at 11:27 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > So I think the fix for this all might be something like the attached
> > (TOTALLY UNTESTED)!
>
> Still entirely untested, but I wrote a commit message for it in the
> hopes that this actually works and Andreas can verify that it fixes
> the issue.
>
> Same exact patch, it's just now in my local experimental tree as a commit.

Seems to be working on s390x for this test case at least; the kind of
trace I'm getting is:

swpd free buff cache read_fault read_fault_race read_lock_stolen min_flt maj_flt
0 2249156 6352 783924 0 0 0 1142 157566
0 2076308 5896 961680 0 0 0 1 52702
0 1938824 5896 1186184 0 0 0 506 56708
0 1719640 5896 1404268 0 0 0 0 54604
0 1454772 5896 1590780 0 0 0 741 45044
0 1213924 5896 1828048 0 0 0 0 60716
0 995300 5896 2046172 0 0 0 0 54408
0 843140 5896 2264756 0 0 0 622 54412
0 630328 5896 2478980 0 0 0 0 53578
0 446096 5896 2675508 0 0 0 569 48286
0 315156 5896 2806336 0 0 0 0 33396
0 169068 5896 2954052 0 0 0 0 36151
0 33980 5064 3095736 0 0 0 681 49776
0 136596 1068 3012612 0 0 0 0 48800
0 51128 1068 2887180 0 0 0 48605 49288
0 42276 1068 2895960 0 0 0 1 49776
0 45496 944 2890752 0 0 0 35 49288
0 38592 944 2896364 3 0 0 124 44817
0 37100 944 2976368 46 0 0 25412 27301
0 37204 928 2981564 0 0 0 0 62316
0 52384 928 2965820 0 0 0 0 30020
0 32296 928 2830228 0 0 0 61824 49281
0 36208 928 2831920 0 0 0 0 49599
0 37424 928 2830176 0 0 0 0 48800
0 35676 928 2830256 0 0 0 104 46282
0 140932 928 2987628 0 0 0 628 47871
0 32128 928 3095628 0 0 0 0 49086
0 40900 920 2752240 0 0 0 79824 131220
0 67380 920 2723188 0 0 0 39 61666
0 42560 884 2747124 0 0 0 4 27266
0 43296 880 3036340 16 0 0 6557 79270
0 160616 832 3043056 0 0 0 379 44515
0 36892 688 2864892 0 0 0 49524 59299
0 38004 688 2866088 0 0 0 0 57308
0 39732 568 2860468 0 0 0 49 51105
0 37364 568 2861224 0 0 0 0 0
0 35432 568 2863220 0 0 0 0 0
0 38996 568 2859308 0 0 0 0 0
0 40808 568 2857044 0 0 0 0 0
0 35360 568 2865076 1 0 0 4 0
0 31592 564 2861216 123 0 0 50577 29984
0 32816 564 2857012 0 0 0 0 130047
0 39588 564 2785968 0 0 0 66022 50499
0 37752 564 2788028 0 0 0 0 53397
0 35576 460 2789696 0 0 0 0 52803
0 35944 460 2790020 0 0 0 13 37240
0 38280 460 2787336 0 0 0 0 0
0 38588 460 2786444 0 0 0 0 0
0 36940 460 2785644 0 0 0 0 0
0 39560 460 2784648 0 0 0 0 0
0 115032 452 2777568 4 0 0 16 0
0 44772 452 2808116 0 0 0 71280 139232
0 35828 452 2812368 0 0 0 0 57886
0 33420 452 2815904 0 0 0 0 56762
0 222932 452 2901480 0 0 0 1761 51125
0 34988 452 3085360 0 0 0 0 55638
0 32716 336 3049448 0 0 0 6398 54202
0 38572 336 3050040 0 0 0 0 54172
0 250724 336 3052404 0 0 0 422 37779
0 39204 264 2754388 0 0 0 78287 55056
0 39564 264 2752932 0 0 0 0 52712
0 38108 232 2752660 0 0 0 38 53350
0 32448 196 2753920 0 0 0 2 29024
0 36536 196 2754172 0 0 0 0 0
0 36116 196 2749096 0 0 0 0 0
0 35968 196 2754416 0 0 0 0 0
0 33744 196 2756908 24 0 0 96 0
0 33428 196 2870904 16 0 0 48271 123984
0 124932 196 2916896 0 0 0 14925 47427
0 44764 196 2993408 0 0 0 0 57057
0 46396 196 2990544 0 0 0 0 49864
0 39224 196 2996332 0 0 0 0 3124
0 34524 196 2844232 0 0 0 49321 6308
0 34088 92 2794136 0 0 0 14846 62955
0 38680 92 2792752 0 0 0 0 48056
0 36756 92 2793916 0 0 0 36 48994
0 37336 92 2791996 0 0 0 4 18127
0 38844 92 2760304 26 0 0 61323 48345
0 42584 92 2755980 0 0 0 0 53772
0 39436 92 2758252 0 0 0 804 51220
0 33408 88 2764016 0 0 0 339 32307
0 38228 88 2759052 0 0 0 0 0
0 36816 88 2760776 0 0 0 0 0
0 38624 88 2759628 0 0 0 0 0
0 38912 88 2752788 234 0 0 936 0
0 35640 88 3070384 394 0 0 2193 108219
0 38816 88 2716016 0 0 0 83715 55072
0 37084 88 2716076 0 0 0 0 50703
0 49256 88 2703460 0 0 0 0 51268
0 38256 72 2714892 0 0 0 0 38361
0 43468 72 2709640 0 0 0 0 0
0 38448 72 2715140 0 0 0 0 0
0 40536 72 2713412 0 0 0 0 0
0 36060 72 2711732 4 0 0 16 0
0 35804 72 2697600 65 0 0 88429 76050
0 42100 72 2687776 0 0 0 0 78018
0 44532 72 2682776 0 0 0 37 47051
0 57652 72 2911680 0 0 0 27636 60021
0 38196 72 2930916 0 0 0 0 39256
0 304060 72 2926132 0 0 0 6 41287
0 42524 72 3067232 0 0 0 446 58607
0 36244 72 3071536 0 0 0 0 45878
0 41052 72 2815772 0 0 0 59290 53910
0 38264 72 2816620 0 0 0 0 20666
0 39772 72 2815696 0 0 0 0 23975
0 37016 72 2817520 0 0 0 0 46818
0 39172 72 2889840 0 0 0 41578 47448
0 36440 72 2890160 0 0 0 1 48756
0 35028 72 2888492 0 0 0 0 49428
0 54364 72 2975816 0 0 0 14981 48084
0 36580 72 2992292 0 0 0 0 37284
0 35852 72 2992000 0 0 0 0 48168
0 36280 72 2833252 9 0 0 53762 48677
0 38916 72 2827892 0 0 0 0 48467
0 36996 72 2829432 0 0 0 0 48347
0 42524 68 2822216 0 0 0 0 32250
0 37060 68 2827612 0 0 0 0 0
0 38080 68 2827556 0 0 0 0 0
0 41660 68 2822936 0 0 0 0 0
0 32384 68 2824372 0 0 0 0 0
0 37468 68 2754364 58 0 0 70218 117793
0 36568 68 2756192 0 0 0 56 53470
0 125052 68 2906412 0 0 0 8070 49300
0 38468 68 2990660 0 0 0 0 49776
0 41180 68 3012140 0 0 0 684 49776
0 38044 68 3012908 0 0 0 1 50752
0 38924 68 2774392 0 0 0 56420 44877
0 36824 68 2777720 0 0 0 0 800
0 39172 68 2775236 0 0 0 0 1568
0 39748 68 2774740 0 0 0 0 960
0 35768 68 2778264 0 0 0 0 1920
0 40568 68 2773900 0 0 0 0 1696
0 41616 68 2773096 0 0 0 0 3712
0 42688 68 2772320 0 0 0 0 5408
0 39736 68 2774648 0 0 0 0 5152

This shows bursts of successful fault-ins in gfs2_file_read_iter
(read_fault). The pauses in between might be caused by the storage;
I'm not sure.

I'd still let the caller of fault_in_safe_writeable() decide how much
memory to fault in; the tight cap in fault_in_safe_writeable() doesn't
seem useful.

Also, you want to put in an extra L here:
> Signed-off-by: linus Torvalds <torvalds@linux-foundation.org>

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08 23:24             ` Andreas Gruenbacher
@ 2022-03-09  0:22               ` Linus Torvalds
  2022-03-09 18:42                 ` Andreas Gruenbacher
  0 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2022-03-09  0:22 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: David Hildenbrand, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

On Tue, Mar 8, 2022 at 3:25 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Seems to be working on s390x for this test case at least; the kind of
> trace I'm getting is:

Good.

> This shows bursts of successful fault-ins in gfs2_file_read_iter
> (read_fault). The pauses in between might be caused by the storage;
> I'm not sure.

Don't know about the pauses, but the burst size might be impacted by that

+       const size_t max_size = 4 * PAGE_SIZE;

thing that limits how many calls to fixup_user_fault() we do per
fault_in_safe_writeable().

So it might be worth checking if that value seems to make any difference.

> I'd still let the caller of fault_in_safe_writeable() decide how much
> memory to fault in; the tight cap in fault_in_safe_writeable() doesn't
> seem useful.

Well, there are real latency concerns there - fixup_user_fault() is
not necessarily all that low-cost.

And it's actually going to be worse when we have the sub-page coloring
issues happening, and will need to probe at a 128-byte granularity
(not on s390, but on arm64).

At that point, we almost certainly will need to have a "probe user
space non-destructibly for writability" instruction (possibly
extending on our current arch_futex_atomic_op_inuser()
infrastructure).

So honestly, if you do IO on areas that will get page faults on them,
to some degree it's a "you are doing something stupid, you get what
you deserve". This code should _work_, it should not have to worry
about users having bad patterns (where "read data into huge cold
mappings under enough memory pressure that it causes access bit faults
in the middle of the read" very much counts as such a bad pattern).

> Also, you want to put in an extra L here:
> > Signed-off-by: linus Torvalds <torvalds@linux-foundation.org>

Heh. Fixed locally.

                 Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-09  0:22               ` Linus Torvalds
@ 2022-03-09 18:42                 ` Andreas Gruenbacher
  2022-03-09 19:08                   ` Linus Torvalds
  2022-03-09 19:21                   ` Linus Torvalds
  0 siblings, 2 replies; 36+ messages in thread
From: Andreas Gruenbacher @ 2022-03-09 18:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, Catalin Marinas, David Hildenbrand,
	Alexander Viro, linux-s390, Linux-MM, linux-fsdevel, linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 6892 bytes --]

On Wed, Mar 9, 2022 at 1:22 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, Mar 8, 2022 at 3:25 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Seems to be working on s390x for this test case at least; the kind of
> > trace I'm getting is:
>
> Good.
>
> > This shows bursts of successful fault-ins in gfs2_file_read_iter
> > (read_fault). The pauses in between might be caused by the storage;
> > I'm not sure.
>
> Don't know about the pauses, but the burst size might be impacted by that
>
> +       const size_t max_size = 4 * PAGE_SIZE;
>
> thing that limits how many calls to fixup_user_fault() we do per
> fault_in_safe_writeable().
>
> So it might be worth checking if that value seems to make any difference.
>
> > I'd still let the caller of fault_in_safe_writeable() decide how much
> > memory to fault in; the tight cap in fault_in_safe_writeable() doesn't
> > seem useful.
>
> Well, there are real latency concerns there - fixup_user_fault() is
> not necessarily all that low-cost.

I just don't know if making the GUP based approach work instead of
switching to fixup_user_fault(), or introducing something else, would
make enough of a performance difference to be worth it.

> And it's actually going to be worse when we have the sub-page coloring
> issues happening, and will need to probe at a 128-byte granularity
> (not on s390, but on arm64).
>
> At that point, we almost certainly will need to have a "probe user
> space non-destructibly for writability" instruction (possibly
> extending on our current arch_futex_atomic_op_inuser()
> infrastructure).

Let me add Catalin Marinas to the CC.

From what I took from the previous discussion, probing at a sub-page
granularity won't be necessary for bytewise copying: when the address
we're trying to access is poisoned, fault_in_*() will fail; when we get
a short result, that will take us to the poisoned address in the next
iteration.

The problematic case was copying objects that cross fault domains, when
we're getting an all-or-nothing result back from the copying and the
address we're trying to fault_in_*() isn't the address of the actual
fault.  The fix for those cases is to pass back the address of the
actual fault in one way or another.

> So honestly, if you do IO on areas that will get page faults on them,
> to some degree it's a "you are doing something stupid, you get what
> you deserve". This code should _work_, it should not have to worry
> about users having bad patterns (where "read data into huge cold
> mappings under enough memory pressure that it causes access bit faults
> in the middle of the read" very much counts as such a bad pattern).

With a large enough buffer, a simple malloc() will return unmapped
pages, and reading into such a buffer will result in fault-in.  So page
faults during read() are actually pretty normal, and it's not the user's
fault.

In my test case, the buffer was pre-initialized with memset() to avoid
those kinds of page faults, which meant that the page faults in
gfs2_file_read_iter() only started to happen when we were out of memory.
But that's not the common case.

> > Also, you want to put in an extra L here:
> > > Signed-off-by: linus Torvalds <torvalds@linux-foundation.org>
>
> Heh. Fixed locally.

Attached is a revised patch; only lightly tested so far.  Changes:

* Fix the function description.

* No need for untagged_addr() as that is handled in fixup_user_fault().

* Get rid of max_size: it really makes no sense to second-guess what the
  caller needs.  In cases where fault_in_safe_writeable() is used for
  buffers larger than a handful of pages, it is entirely the caller's
  responsibility to scale back the fault-in size in anticipation of or
  in reaction to page-out.

* Use the same control flow as in fault_in_readable(); there is no
  need for anything more complicated anymore.

* The same patch description still applies.

Thanks,
Andreas

diff --git a/mm/gup.c b/mm/gup.c
index f1bf3a1f6d109..5e777049bdf41 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1841,9 +1841,9 @@ EXPORT_SYMBOL(fault_in_writeable);
  * @uaddr: start of address range
  * @size: length of address range
  *
- * Faults in an address range using get_user_pages, i.e., without triggering
- * hardware page faults.  This is primarily useful when we already know that
- * some or all of the pages in the address range aren't in memory.
+ * Faults in an address range for writing.  This is primarily useful when we
+ * already know that some or all of the pages in the address range aren't in
+ * memory.
  *
  * Other than fault_in_writeable(), this function is non-destructive.
  *
@@ -1856,46 +1856,33 @@ EXPORT_SYMBOL(fault_in_writeable);
  */
 size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
 {
-	unsigned long start = (unsigned long)untagged_addr(uaddr);
-	unsigned long end, nstart, nend;
+	const unsigned int fault_flags = FAULT_FLAG_WRITE | FAULT_FLAG_KILLABLE;
+	unsigned long start = (unsigned long)uaddr, end;
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	int locked = 0;
 
-	nstart = start & PAGE_MASK;
+	if (unlikely(size == 0))
+		return 0;
 	end = PAGE_ALIGN(start + size);
-	if (end < nstart)
+	if (end < start)
 		end = 0;
-	for (; nstart != end; nstart = nend) {
-		unsigned long nr_pages;
-		long ret;
 
-		if (!locked) {
-			locked = 1;
-			mmap_read_lock(mm);
-			vma = find_vma(mm, nstart);
-		} else if (nstart >= vma->vm_end)
-			vma = vma->vm_next;
-		if (!vma || vma->vm_start >= end)
-			break;
-		nend = end ? min(end, vma->vm_end) : vma->vm_end;
-		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
-			continue;
-		if (nstart < vma->vm_start)
-			nstart = vma->vm_start;
-		nr_pages = (nend - nstart) / PAGE_SIZE;
-		ret = __get_user_pages_locked(mm, nstart, nr_pages,
-					      NULL, NULL, &locked,
-					      FOLL_TOUCH | FOLL_WRITE);
-		if (ret <= 0)
-			break;
-		nend = nstart + ret * PAGE_SIZE;
+	mmap_read_lock(mm);
+	if (!PAGE_ALIGNED(start)) {
+		if (fixup_user_fault(mm, start, fault_flags, NULL))
+			return size;
+		start = PAGE_ALIGN(start);
 	}
-	if (locked)
-		mmap_read_unlock(mm);
-	if (nstart == end)
-		return 0;
-	return size - min_t(size_t, nstart - start, size);
+	while (start != end) {
+		if (fixup_user_fault(mm, start, fault_flags, NULL))
+			goto out;
+		start += PAGE_SIZE;
+	}
+	mmap_read_unlock(mm);
+
+out:
+	if (size > (unsigned long)uaddr - start)
+		return size - ((unsigned long)uaddr - start);
+	return 0;
 }
 EXPORT_SYMBOL(fault_in_safe_writeable);
 


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-09 18:42                 ` Andreas Gruenbacher
@ 2022-03-09 19:08                   ` Linus Torvalds
  2022-03-09 20:57                     ` Andreas Gruenbacher
  2022-03-09 21:08                     ` Andreas Gruenbacher
  2022-03-09 19:21                   ` Linus Torvalds
  1 sibling, 2 replies; 36+ messages in thread
From: Linus Torvalds @ 2022-03-09 19:08 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Catalin Marinas, David Hildenbrand, Alexander Viro, linux-s390,
	Linux-MM, linux-fsdevel, linux-btrfs

On Wed, Mar 9, 2022 at 10:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> From what I took from the previous discussion, probing at a sub-page
> granularity won't be necessary for bytewise copying: when the address
> we're trying to access is poisoned, fault_in_*() will fail; when we get
> a short result, that will take us to the poisoned address in the next
> iteration.

Sadly, that isn't actually the case.

It's not the case for GUP (that page aligns things), and it's not the
case for fault_in_writeable() itself (that also page aligns things).

But more importantly, it's not actually the case for the *users*
either. Not all of the users are byte-stream oriented, and I think it
was btrfs that had a case of "copy a struct at the beginning of the
stream". And if that copy failed, it wouldn't advance by as many bytes
as it got - it would require that struct to be all fetched, and start
from the beginning.

So we do need to probe at least a minimum set of bytes. Probably a
fairly small minimum, but still...


> With a large enough buffer, a simple malloc() will return unmapped
> pages, and reading into such a buffer will result in fault-in.  So page
> faults during read() are actually pretty normal, and it's not the user's
> fault.

Agreed. But that wasn't the case here:

> In my test case, the buffer was pre-initialized with memset() to avoid
> those kinds of page faults, which meant that the page faults in
> gfs2_file_read_iter() only started to happen when we were out of memory.
> But that's not the common case.

Exactly. I do not think this is a case that we should - or need to -
optimize for.

And doing too much pre-faulting is actually counter-productive.

> * Get rid of max_size: it really makes no sense to second-guess what the
>   caller needs.

It's not about "what caller needs". It's literally about latency
issues. If you can force a busy loop in kernel space by having one
unmapped page and then do a 2GB read(), that's a *PROBLEM*.

Now, we can try this thing, because I think we end up having other
size limitations in the IO subsystem that means that the filesystem
won't actually do that, but the moment I hear somebody talk about
latencies, that max_size goes back.

                Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-09 18:42                 ` Andreas Gruenbacher
  2022-03-09 19:08                   ` Linus Torvalds
@ 2022-03-09 19:21                   ` Linus Torvalds
  2022-03-09 19:35                     ` Andreas Gruenbacher
  1 sibling, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2022-03-09 19:21 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Catalin Marinas, David Hildenbrand, Alexander Viro, linux-s390,
	Linux-MM, linux-fsdevel, linux-btrfs

On Wed, Mar 9, 2022 at 10:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> +       while (start != end) {
> +               if (fixup_user_fault(mm, start, fault_flags, NULL))
> +                       goto out;
> +               start += PAGE_SIZE;
> +       }
> +       mmap_read_unlock(mm);
> +
> +out:
> +       if (size > (unsigned long)uaddr - start)
> +               return size - ((unsigned long)uaddr - start);
> +       return 0;
>  }

What?

That "goto out" is completely broken. It explicitly avoids the
"mmap_read_unlock()" for some reason I can't for the life of me
understand.

You must have done that on purpose, since a simple "break" would have
been the sane and simple thing to do, but it looks *entirely* wrong to
me.

I think the whole loop should just be

        mmap_read_lock(mm);
        do {
                if (fixup_user_fault(mm, start, fault_flags, NULL))
                        break;
                start = (start + PAGE_SIZE) & PAGE_MASK;
        } while (start != end);
        mmap_read_unlock(mm);

which also doesn't need that first unlooped iteration (not that I
think that passing in the non-masked starting address for the first
case actually helps, but that's a different thing).

                 Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-09 19:21                   ` Linus Torvalds
@ 2022-03-09 19:35                     ` Andreas Gruenbacher
  2022-03-09 20:18                       ` Linus Torvalds
  0 siblings, 1 reply; 36+ messages in thread
From: Andreas Gruenbacher @ 2022-03-09 19:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Catalin Marinas, David Hildenbrand, Alexander Viro, linux-s390,
	Linux-MM, linux-fsdevel, linux-btrfs

On Wed, Mar 9, 2022 at 8:22 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Mar 9, 2022 at 10:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > +       while (start != end) {
> > +               if (fixup_user_fault(mm, start, fault_flags, NULL))
> > +                       goto out;
> > +               start += PAGE_SIZE;
> > +       }
> > +       mmap_read_unlock(mm);
> > +
> > +out:
> > +       if (size > (unsigned long)uaddr - start)
> > +               return size - ((unsigned long)uaddr - start);
> > +       return 0;
> >  }
>
> What?
>
> That "goto out" is completely broken. It explicitly avoids the
> "mmap_read_unlock()" for some reason I can't for the life of me
> understand.
>
> You must have done that on purpose, since a simple "break" would have
> been the sane and simple thing to do, but it looks *entirely* wrong to
> me.

Ouch, that was stupid. Same for the "return size".

> I think the whole loop should just be
>
>         mmap_read_lock(mm);
>         do {
>                 if (fixup_user_fault(mm, start, fault_flags, NULL))
>                         break;
>                 start = (start + PAGE_SIZE) & PAGE_MASK;
>
>         } while (start != end);
>         mmap_read_unlock(mm);
>
> which also doesn't need that first unlooped iteration (not that I
> think that passing in the non-masked starting address for the first
> case actually helps, but that's a different thing).

That's better, thanks.

Andreas


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-09 19:35                     ` Andreas Gruenbacher
@ 2022-03-09 20:18                       ` Linus Torvalds
  2022-03-09 20:36                         ` Andreas Gruenbacher
  0 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2022-03-09 20:18 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Catalin Marinas, David Hildenbrand, Alexander Viro, linux-s390,
	Linux-MM, linux-fsdevel, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1223 bytes --]

On Wed, Mar 9, 2022 at 11:35 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> That's better, thanks.

Ok, can you give this one more test?

It has that simplified loop, but it also replaced the
FAULT_FLAG_KILLABLE with just passing in 'unlocked'.

I thought I didn't need to do that, but the "retry" loop inside
fixup_user_fault() will actually set that 'unlocked' thing even if the
caller doesn't care whether the mmap_sem was unlocked during the call,
so we have to pass in that pointer just to get that to work.

And we don't care if mmap_sem was dropped, because this loop doesn't
cache any vma information or anything like that, but we don't want to
get a NULL pointer oops just because fixup_user_fault() tries to
inform us about something we don't care about ;)

That incidentally gets us FAULT_FLAG_ALLOW_RETRY too, which is
probably a good thing anyway - it means that the mmap_sem will be
dropped if we wait for IO. Not likely a huge deal, but it's the
RightThing(tm) to do.

So this has some other changes there too, but on the whole the
function is now really quite simple. But it would be good to have one
final round of testing considering how many small details changed..

                      Linus

[-- Attachment #2: 0001-mm-gup-make-fault_in_safe_writeable-use-fixup_user_f.patch --]
[-- Type: text/x-patch, Size: 3954 bytes --]

From 2a437a05c4c06e2089fa1a789e96100ea2135f6a Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 8 Mar 2022 11:55:48 -0800
Subject: [PATCH] mm: gup: make fault_in_safe_writeable() use
 fixup_user_fault()

Instedad of using GUP, make fault_in_safe_writeable() actually force a
'handle_mm_fault()' using the same fixup_user_fault() machinery that
futexes already use.

Using the GUP machinery meant that fault_in_safe_writeable() did not do
everything that a real fault would do, ranging from not auto-expanding
the stack segment, to not updating accessed or dirty flags in the page
tables (GUP sets those flags on the pages themselves).

The latter causes problems on architectures (like s390) that do accessed
bit handling in software, which meant that fault_in_safe_writeable()
didn't actually do all the fault handling it needed to.

Reported-and-tested-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lore.kernel.org/all/CAHc6FU5nP+nziNGG0JAF1FUx-GV7kKFvM7aZuU_XD2_1v4vnvg@mail.gmail.com/
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/gup.c | 57 +++++++++++++++++++-------------------------------------
 1 file changed, 19 insertions(+), 38 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a9d4d724aef7..7bc1ba9ce440 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1729,11 +1729,11 @@ EXPORT_SYMBOL(fault_in_writeable);
  * @uaddr: start of address range
  * @size: length of address range
  *
- * Faults in an address range using get_user_pages, i.e., without triggering
- * hardware page faults.  This is primarily useful when we already know that
- * some or all of the pages in the address range aren't in memory.
+ * Faults in an address range for writing.  This is primarily useful when we
+ * already know that some or all of the pages in the address range aren't in
+ * memory.
  *
- * Other than fault_in_writeable(), this function is non-destructive.
+ * Unlike fault_in_writeable(), this function is non-destructive.
  *
  * Note that we don't pin or otherwise hold the pages referenced that we fault
  * in.  There's no guarantee that they'll stay in memory for any duration of
@@ -1744,46 +1744,27 @@ EXPORT_SYMBOL(fault_in_writeable);
  */
 size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
 {
-	unsigned long start = (unsigned long)untagged_addr(uaddr);
-	unsigned long end, nstart, nend;
+	unsigned long start = (unsigned long)uaddr, end;
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	int locked = 0;
+	bool unlocked = false;
 
-	nstart = start & PAGE_MASK;
+	if (unlikely(size == 0))
+		return 0;
 	end = PAGE_ALIGN(start + size);
-	if (end < nstart)
+	if (end < start)
 		end = 0;
-	for (; nstart != end; nstart = nend) {
-		unsigned long nr_pages;
-		long ret;
 
-		if (!locked) {
-			locked = 1;
-			mmap_read_lock(mm);
-			vma = find_vma(mm, nstart);
-		} else if (nstart >= vma->vm_end)
-			vma = vma->vm_next;
-		if (!vma || vma->vm_start >= end)
-			break;
-		nend = end ? min(end, vma->vm_end) : vma->vm_end;
-		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
-			continue;
-		if (nstart < vma->vm_start)
-			nstart = vma->vm_start;
-		nr_pages = (nend - nstart) / PAGE_SIZE;
-		ret = __get_user_pages_locked(mm, nstart, nr_pages,
-					      NULL, NULL, &locked,
-					      FOLL_TOUCH | FOLL_WRITE);
-		if (ret <= 0)
+	mmap_read_lock(mm);
+	do {
+		if (fixup_user_fault(mm, start, FAULT_FLAG_WRITE, &unlocked))
 			break;
-		nend = nstart + ret * PAGE_SIZE;
-	}
-	if (locked)
-		mmap_read_unlock(mm);
-	if (nstart == end)
-		return 0;
-	return size - min_t(size_t, nstart - start, size);
+		start = (start + PAGE_SIZE) & PAGE_MASK;
+	} while (start != end);
+	mmap_read_unlock(mm);
+
+	if (size > (unsigned long)uaddr - start)
+		return size - ((unsigned long)uaddr - start);
+	return 0;
 }
 EXPORT_SYMBOL(fault_in_safe_writeable);
 
-- 
2.35.1.356.ge6630f57cf.dirty


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-09 20:18                       ` Linus Torvalds
@ 2022-03-09 20:36                         ` Andreas Gruenbacher
  2022-03-09 20:48                           ` Linus Torvalds
  0 siblings, 1 reply; 36+ messages in thread
From: Andreas Gruenbacher @ 2022-03-09 20:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Catalin Marinas, David Hildenbrand, Alexander Viro, linux-s390,
	Linux-MM, linux-fsdevel, linux-btrfs

On Wed, Mar 9, 2022 at 9:19 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Mar 9, 2022 at 11:35 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > That's better, thanks.
>
> Ok, can you give this one more test?
>
> It has that simplified loop, but it also replaced the
> FAULT_FLAG_KILLABLE with just passing in 'unlocked'.
>
> I thought I didn't need to do that, but the "retry" loop inside
> fixup_user_fault() will actually set that 'unlocked' thing even if the
> caller doesn't care whether the mmap_sem was unlocked during the call,
> so we have to pass in that pointer just to get that to work.
>
> And we don't care if mmap_sem was dropped, because this loop doesn't
> cache any vma information or anything like that, but we don't want to
> get a NULL pointer oops just because fixup_user_fault() tries to
> inform us about something we don't care about ;)

It's a moot point now, but I don't think handle_mm_fault would have
returned VM_FAULT_RETRY without FAULT_FLAG_ALLOW_RETRY, so there
wouldn't have been any NULL pointer accesses.

> That incidentally gets us FAULT_FLAG_ALLOW_RETRY too, which is
> probably a good thing anyway - it means that the mmap_sem will be
> dropped if we wait for IO. Not likely a huge deal, but it's the
> RightThing(tm) to do.

Looking good.

> So this has some other changes there too, but on the whole the
> function is now really quite simple. But it would be good to have one
> final round of testing considering how many small details changed..

Sure, we'll put it through all our tests.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-09 20:36                         ` Andreas Gruenbacher
@ 2022-03-09 20:48                           ` Linus Torvalds
  2022-03-09 20:54                             ` Linus Torvalds
  0 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2022-03-09 20:48 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Catalin Marinas, David Hildenbrand, Alexander Viro, linux-s390,
	Linux-MM, linux-fsdevel, linux-btrfs

On Wed, Mar 9, 2022 at 12:37 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> It's a moot point now, but I don't think handle_mm_fault would have
> returned VM_FAULT_RETRY without FAULT_FLAG_ALLOW_RETRY, so there
> wouldn't have been any NULL pointer accesses.

No, it really does - FAULT_FLAG_KILLABLE will trigger the code in
page_lock_or_retry() (->__folio_lock_or_retry) even without
FAULT_FLAG_ALLOW_RETRY.

So lock_page_or_retry() will drop the mmap_sem and return false, and
then you have

        locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);

        if (!locked) {
                ret |= VM_FAULT_RETRY;
                goto out_release;
        }

for the swapin case.

And mm/filemap.c has essentially the same logic in
lock_folio_maybe_drop_mmap(), although the syntax is quite different.

Basically FAULT_FLAG_KILLABLE implies a kind of "half-way ALLOW_RETRY"
- allow aborting, but only for the fatal signal case.

                  Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-09 20:48                           ` Linus Torvalds
@ 2022-03-09 20:54                             ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2022-03-09 20:54 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Catalin Marinas, David Hildenbrand, Alexander Viro, linux-s390,
	Linux-MM, linux-fsdevel, linux-btrfs

On Wed, Mar 9, 2022 at 12:48 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Basically FAULT_FLAG_KILLABLE implies a kind of "half-way ALLOW_RETRY"
> - allow aborting, but only for the fatal signal case.

Side note: I'm not saying this is a *good* thing - it is very
confusing - but it's due to historical reasons where architectures
ended up getting these features incrementally.

So FAULT_FLAG_KILLABLE was one such half-way step, where you could
abort things without actually retrying them, because a fatal signal
could be handled without then repeating things.

These days, I think all architectures have converted their actual
fault handling to the full "retry/abort/error/success" spectrum, but
then you still have other uses of handle_mm_fault() that may not be
willing to have mmap_sem be dropped in the middle of operations
because they keep a 'vma' list around or something like that.

And then FAULT_FLAG_KILLABLE can still be a good way to say "ok, I'm
willing to _abort_ things entirely, but you can't drop the mmap sem
and ask me to recover if it's not fatal".

And this is all entirely due to historical behavior, since
_originally_ handle_mm_fault() wouldn't ever drop mmap_sem at all.
These days it would probably be simpler to *not* have all these
complicated special cases, and just say that FAULT_FLAG_ALLOW_RETRY is
always true.

But then somebody would have to walk through every user to make sure it's ok...

           Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-09 19:08                   ` Linus Torvalds
@ 2022-03-09 20:57                     ` Andreas Gruenbacher
  2022-03-09 21:08                     ` Andreas Gruenbacher
  1 sibling, 0 replies; 36+ messages in thread
From: Andreas Gruenbacher @ 2022-03-09 20:57 UTC (permalink / raw)
  To: Linus Torvalds, Filipe Manana
  Cc: Catalin Marinas, David Hildenbrand, Alexander Viro, linux-s390,
	Linux-MM, linux-fsdevel, linux-btrfs

On Wed, Mar 9, 2022 at 8:08 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Mar 9, 2022 at 10:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > From what I took from the previous discussion, probing at a sub-page
> > granularity won't be necessary for bytewise copying: when the address
> > we're trying to access is poisoned, fault_in_*() will fail; when we get
> > a short result, that will take us to the poisoned address in the next
> > iteration.
>
> Sadly, that isn't actually the case.
>
> It's not the case for GUP (that page aligns things), and it's not the
> case for fault_in_writeable() itself (that also page aligns things).

As long as the fault_in_*() functions probe the exact start address,
they will detect when that address is poisoned. They can advance
page-wise after that and it doesn't matter if they skip poisoned
addresses. When the memory range is then accessed, that may fail at a
poisoned address. The next call to fault_in_*() will be with that
poisoned address as the start address.

So it's the unaligned probing at the start in the fault_in_*()
functions that's essential, and fault_in_readable(),
fault_in_writeable(), and fault_in_safe_writeable() now all do that
probing.

> But more importantly, it's not actually the case for the *users*
> either. Not all of the users are byte-stream oriented, and I think it
> was btrfs that had a case of "copy a struct at the beginning of the
> stream". And if that copy failed, it wouldn't advance by as many bytes
> as it got - it would require that struct to be all fetched, and start
> from the beginning.
>
> So we do need to probe at least a minimum set of bytes. Probably a
> fairly small minimum, but still...

There are only a few users like that, and they can be changed to pass
back the actual address that fails so that that address can be probed
instead of probing every 128 bytes of the entire buffer on arm64.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-09 19:08                   ` Linus Torvalds
  2022-03-09 20:57                     ` Andreas Gruenbacher
@ 2022-03-09 21:08                     ` Andreas Gruenbacher
  2022-03-10 12:13                       ` Filipe Manana
  1 sibling, 1 reply; 36+ messages in thread
From: Andreas Gruenbacher @ 2022-03-09 21:08 UTC (permalink / raw)
  To: Linus Torvalds, Filipe Manana
  Cc: Catalin Marinas, David Hildenbrand, Alexander Viro, linux-s390,
	Linux-MM, linux-fsdevel, linux-btrfs

On Wed, Mar 9, 2022 at 8:08 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Mar 9, 2022 at 10:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > With a large enough buffer, a simple malloc() will return unmapped
> > pages, and reading into such a buffer will result in fault-in.  So page
> > faults during read() are actually pretty normal, and it's not the user's
> > fault.
>
> Agreed. But that wasn't the case here:
>
> > In my test case, the buffer was pre-initialized with memset() to avoid
> > those kinds of page faults, which meant that the page faults in
> > gfs2_file_read_iter() only started to happen when we were out of memory.
> > But that's not the common case.
>
> Exactly. I do not think this is a case that we should - or need to -
> optimize for.
>
> And doing too much pre-faulting is actually counter-productive.
>
> > * Get rid of max_size: it really makes no sense to second-guess what the
> >   caller needs.
>
> It's not about "what caller needs". It's literally about latency
> issues. If you can force a busy loop in kernel space by having one
> unmapped page and then do a 2GB read(), that's a *PROBLEM*.
>
> Now, we can try this thing, because I think we end up having other
> size limitations in the IO subsystem that means that the filesystem
> won't actually do that, but the moment I hear somebody talk about
> latencies, that max_size goes back.

Thanks, this puts fault_in_safe_writeable() in line with
fault_in_readable() and fault_in_writeable().

There currently are two users of
fault_in_safe_writeable()/fault_in_iov_iter_writeable(): gfs2 and
btrfs.
In gfs2, we cap the size at BIO_MAX_VECS pages (256). I don't see an
explicit cap in btrfs; adding Filipe.

Andreas


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-09 21:08                     ` Andreas Gruenbacher
@ 2022-03-10 12:13                       ` Filipe Manana
  0 siblings, 0 replies; 36+ messages in thread
From: Filipe Manana @ 2022-03-10 12:13 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, Filipe Manana, Catalin Marinas,
	David Hildenbrand, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

On Wed, Mar 09, 2022 at 10:08:32PM +0100, Andreas Gruenbacher wrote:
> On Wed, Mar 9, 2022 at 8:08 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Wed, Mar 9, 2022 at 10:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > With a large enough buffer, a simple malloc() will return unmapped
> > > pages, and reading into such a buffer will result in fault-in.  So page
> > > faults during read() are actually pretty normal, and it's not the user's
> > > fault.
> >
> > Agreed. But that wasn't the case here:
> >
> > > In my test case, the buffer was pre-initialized with memset() to avoid
> > > those kinds of page faults, which meant that the page faults in
> > > gfs2_file_read_iter() only started to happen when we were out of memory.
> > > But that's not the common case.
> >
> > Exactly. I do not think this is a case that we should - or need to -
> > optimize for.
> >
> > And doing too much pre-faulting is actually counter-productive.
> >
> > > * Get rid of max_size: it really makes no sense to second-guess what the
> > >   caller needs.
> >
> > It's not about "what caller needs". It's literally about latency
> > issues. If you can force a busy loop in kernel space by having one
> > unmapped page and then do a 2GB read(), that's a *PROBLEM*.
> >
> > Now, we can try this thing, because I think we end up having other
> > size limitations in the IO subsystem that means that the filesystem
> > won't actually do that, but the moment I hear somebody talk about
> > latencies, that max_size goes back.
> 
> Thanks, this puts fault_in_safe_writeable() in line with
> fault_in_readable() and fault_in_writeable().
> 
> There currently are two users of
> fault_in_safe_writeable()/fault_in_iov_iter_writeable(): gfs2 and
> btrfs.
> In gfs2, we cap the size at BIO_MAX_VECS pages (256). I don't see an
> explicit cap in btrfs; adding Filipe.

On btrfs, for buffered writes, we have some cap (done at btrfs_buffered_write()),
for buffered reads we don't have any control on that as we use filemap_read().

For direct IO we don't have any cap, we try to fault in everything that's left.
However we keep track if we are doing any progress, and if we aren't making any
progress we just fall back to the buffered IO path. So that prevents infinite
or long loops.

There's really no good reason to not cap how much we try to fault in in the
direct IO paths. We should do it, as it probably has a negative performance
impact for very large direct IO reads/writes.

Thanks.

> 
> Andreas
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-08 19:27         ` Linus Torvalds
  2022-03-08 20:03           ` Linus Torvalds
@ 2022-03-10 17:13           ` David Hildenbrand
  2022-03-10 18:00             ` Andreas Gruenbacher
  2022-03-10 18:35             ` Linus Torvalds
  1 sibling, 2 replies; 36+ messages in thread
From: David Hildenbrand @ 2022-03-10 17:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

On 08.03.22 20:27, Linus Torvalds wrote:
> On Tue, Mar 8, 2022 at 9:40 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Hmm. The futex code actually uses "fixup_user_fault()" for this case.
>> Maybe fault_in_safe_writeable() should do that?
> 
> .. paging all the bits back in, I'm reminded that one of the worries
> was "what about large areas".
> 
> But I really think that the solution should be that we limit the size
> of fault_in_safe_writeable() to just a couple of pages.
> 
> Even if you were to fault in gigabytes, page-out can undo it anyway,
> so there is no semantic reason why that function should ever do more
> than a few pages to make sure. There's already even a comment about
> how there's no guarantee that the pages will stay.
> 
> Side note: the current GUP-based fault_in_safe_writeable() is buggy in
> another way anyway: it doesn't work right for stack extending
> accesses.
> 
> So I think the fix for this all might be something like the attached
> (TOTALLY UNTESTED)!
> 
> Comments? Andreas, mind (carefully - maybe it is totally broken and
> does unspeakable acts to your pets) testing this?

I'm late to the party, I agree with the "stack extending accesses" issue
and that using fixup_user_fault() looks "cleaner" than FOLL_TOUCH.


I'm just going to point out that fixup_user_fault() on a per-page basis
is sub-optimal, especially when dealing with something that's PMD- or
PUD-mapped (THP, hugetlb, ...). In contrast, GUP is optimized for that.

So that might be something interesting to look into optimizing in the
future, if relevant in pactice. Not sure how we could return that
information the best way to the caller ("the currently faulted
in/present page ends at address X").

For the time being, the idea LGTM.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-10 17:13           ` David Hildenbrand
@ 2022-03-10 18:00             ` Andreas Gruenbacher
  2022-03-10 18:35             ` Linus Torvalds
  1 sibling, 0 replies; 36+ messages in thread
From: Andreas Gruenbacher @ 2022-03-10 18:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

On Thu, Mar 10, 2022 at 6:13 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.03.22 20:27, Linus Torvalds wrote:
> > On Tue, Mar 8, 2022 at 9:40 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> Hmm. The futex code actually uses "fixup_user_fault()" for this case.
> >> Maybe fault_in_safe_writeable() should do that?
> >
> > .. paging all the bits back in, I'm reminded that one of the worries
> > was "what about large areas".
> >
> > But I really think that the solution should be that we limit the size
> > of fault_in_safe_writeable() to just a couple of pages.
> >
> > Even if you were to fault in gigabytes, page-out can undo it anyway,
> > so there is no semantic reason why that function should ever do more
> > than a few pages to make sure. There's already even a comment about
> > how there's no guarantee that the pages will stay.
> >
> > Side note: the current GUP-based fault_in_safe_writeable() is buggy in
> > another way anyway: it doesn't work right for stack extending
> > accesses.
> >
> > So I think the fix for this all might be something like the attached
> > (TOTALLY UNTESTED)!
> >
> > Comments? Andreas, mind (carefully - maybe it is totally broken and
> > does unspeakable acts to your pets) testing this?
>
> I'm late to the party, I agree with the "stack extending accesses" issue
> and that using fixup_user_fault() looks "cleaner" than FOLL_TOUCH.
>
>
> I'm just going to point out that fixup_user_fault() on a per-page basis
> is sub-optimal, especially when dealing with something that's PMD- or
> PUD-mapped (THP, hugetlb, ...). In contrast, GUP is optimized for that.
>
> So that might be something interesting to look into optimizing in the
> future, if relevant in practice. Not sure how we could return that
> information the best way to the caller ("the currently faulted
> in/present page ends at address X").

Yes, this applies to fault_in_iov_iter_readable() as well, as it is
based on fault_in_readable(). It's probably not a super urgent
optimization as the buffers faulted in are immediately accessed.

Thanks,
Andreas

> For the time being, the idea LGTM.
>
> --
> Thanks,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-10 17:13           ` David Hildenbrand
  2022-03-10 18:00             ` Andreas Gruenbacher
@ 2022-03-10 18:35             ` Linus Torvalds
  2022-03-10 18:38               ` David Hildenbrand
  2022-03-10 18:47               ` Andreas Gruenbacher
  1 sibling, 2 replies; 36+ messages in thread
From: Linus Torvalds @ 2022-03-10 18:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andreas Gruenbacher, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

On Thu, Mar 10, 2022 at 9:13 AM David Hildenbrand <david@redhat.com> wrote:
>
> For the time being, the idea LGTM.

I'll take that as an acked-by, and I think I'll just commit it to my
real tree rather than delay this fix for the next merge window (only
to have it then be marked as stable and applied that wat).

I do agree that we should look at future changes in this area, ranging
from limiting the number of pages to the (I think already pending)
work for arm64 to use an instruction to probe every 128 bytes instead
of on a page basis.

It might even be reasonable to have some hybrid approach that walks
the page tables and faults things in - not quite GUP, not quite
'handle_mm_fault()'.

That said, this is hopefully always going to be the rare case. Yes,
people do IO on "cold" virtual memory, but people doing any
performance work hopefully know that locality (temporal and spatial)
matters not just for the regular CPU data caches, but for pretty much
everything.

                    Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-10 18:35             ` Linus Torvalds
@ 2022-03-10 18:38               ` David Hildenbrand
  2022-03-10 18:47               ` Andreas Gruenbacher
  1 sibling, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2022-03-10 18:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

On 10.03.22 19:35, Linus Torvalds wrote:
> On Thu, Mar 10, 2022 at 9:13 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> For the time being, the idea LGTM.
> 
> I'll take that as an acked-by, and I think I'll just commit it to my
> real tree rather than delay this fix for the next merge window (only
> to have it then be marked as stable and applied that wat).

Yes, ACK and makes sense.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-10 18:35             ` Linus Torvalds
  2022-03-10 18:38               ` David Hildenbrand
@ 2022-03-10 18:47               ` Andreas Gruenbacher
  2022-03-10 19:22                 ` Linus Torvalds
  1 sibling, 1 reply; 36+ messages in thread
From: Andreas Gruenbacher @ 2022-03-10 18:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

On Thu, Mar 10, 2022 at 7:36 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Mar 10, 2022 at 9:13 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > For the time being, the idea LGTM.
>
> I'll take that as an acked-by, and I think I'll just commit it to my
> real tree rather than delay this fix for the next merge window (only
> to have it then be marked as stable and applied that wat).

Works for me. This should probably still be tagged as:

Fixes: cdd591fc86e3 ("iov_iter: Introduce fault_in_iov_iter_writeable")
Cc: stable@vger.kernel.org # v5.16+

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-10 18:47               ` Andreas Gruenbacher
@ 2022-03-10 19:22                 ` Linus Torvalds
  2022-03-10 19:56                   ` Linus Torvalds
  0 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2022-03-10 19:22 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: David Hildenbrand, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

On Thu, Mar 10, 2022 at 10:48 AM Andreas Gruenbacher
<agruenba@redhat.com> wrote:
>
> Works for me. This should probably still be tagged as:
>
> Fixes: cdd591fc86e3 ("iov_iter: Introduce fault_in_iov_iter_writeable")
> Cc: stable@vger.kernel.org # v5.16+

Yeah, I had already added the "Fixes" line there (and it makes the
stable one unnecessary).

            Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-10 19:22                 ` Linus Torvalds
@ 2022-03-10 19:56                   ` Linus Torvalds
  2022-03-10 20:23                     ` Andreas Gruenbacher
  0 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2022-03-10 19:56 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: David Hildenbrand, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

It's out as commit fe673d3f5bf1 ("mm: gup: make
fault_in_safe_writeable() use fixup_user_fault()") now.

                 Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
  2022-03-10 19:56                   ` Linus Torvalds
@ 2022-03-10 20:23                     ` Andreas Gruenbacher
  0 siblings, 0 replies; 36+ messages in thread
From: Andreas Gruenbacher @ 2022-03-10 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Alexander Viro, linux-s390, Linux-MM,
	linux-fsdevel, linux-btrfs

On Thu, Mar 10, 2022 at 8:56 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> It's out as commit fe673d3f5bf1 ("mm: gup: make
> fault_in_safe_writeable() use fixup_user_fault()") now.

Thanks!

Andreas


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2022-03-10 20:23 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-07 22:52 Buffered I/O broken on s390x with page faults disabled (gfs2) Andreas Gruenbacher
2022-03-07 23:18 ` Linus Torvalds
2022-03-08  8:21   ` David Hildenbrand
2022-03-08  8:37     ` David Hildenbrand
2022-03-08 12:11       ` David Hildenbrand
2022-03-08 12:24         ` David Hildenbrand
2022-03-08 13:20           ` Gerald Schaefer
2022-03-08 13:32             ` David Hildenbrand
2022-03-08 14:14               ` Gerald Schaefer
2022-03-08 17:23           ` David Hildenbrand
2022-03-08 17:26     ` Linus Torvalds
2022-03-08 17:40       ` Linus Torvalds
2022-03-08 19:27         ` Linus Torvalds
2022-03-08 20:03           ` Linus Torvalds
2022-03-08 23:24             ` Andreas Gruenbacher
2022-03-09  0:22               ` Linus Torvalds
2022-03-09 18:42                 ` Andreas Gruenbacher
2022-03-09 19:08                   ` Linus Torvalds
2022-03-09 20:57                     ` Andreas Gruenbacher
2022-03-09 21:08                     ` Andreas Gruenbacher
2022-03-10 12:13                       ` Filipe Manana
2022-03-09 19:21                   ` Linus Torvalds
2022-03-09 19:35                     ` Andreas Gruenbacher
2022-03-09 20:18                       ` Linus Torvalds
2022-03-09 20:36                         ` Andreas Gruenbacher
2022-03-09 20:48                           ` Linus Torvalds
2022-03-09 20:54                             ` Linus Torvalds
2022-03-10 17:13           ` David Hildenbrand
2022-03-10 18:00             ` Andreas Gruenbacher
2022-03-10 18:35             ` Linus Torvalds
2022-03-10 18:38               ` David Hildenbrand
2022-03-10 18:47               ` Andreas Gruenbacher
2022-03-10 19:22                 ` Linus Torvalds
2022-03-10 19:56                   ` Linus Torvalds
2022-03-10 20:23                     ` Andreas Gruenbacher
2022-03-08 17:47       ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.