linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Question] New mmap64 syscall?
@ 2016-12-06 18:54 Yury Norov
  2016-12-06 21:20 ` Arnd Bergmann
  2016-12-07 13:23 ` [Question] New mmap64 syscall? Florian Weimer
  0 siblings, 2 replies; 26+ messages in thread
From: Yury Norov @ 2016-12-06 18:54 UTC (permalink / raw)
  To: libc-alpha, linux-arch, linux-kernel
  Cc: Catalin Marinas, szabolcs.nagy, heiko.carstens, cmetcalf,
	philipp.tomsich, joseph, zhouchengming1, Prasun.Kapoor, agraf,
	geert, kilobyte, manuel.montezelo, arnd, pinskia, linyongting,
	klimov.linux, broonie, bamvor.zhangjian, linux-arm-kernel,
	maxim.kuvyrkov, Nathan_Lynch, schwidefsky, davem,
	christoph.muellner

Hi all,

(Sorry if there is similar discussion, and I missed it. I didn't
find something in LKML in last half a year.)

In aarch64/ilp32 discussion Catalin wondered why we don't pass offset
in mmap() as 64-bit value (in 2 registers if needed). Looking at kernel
code I found that there's no generic interface for it. But almost all
architectures provide their own implementations, like this:

SYSCALL_DEFINE6(mips_mmap, unsigned long, addr, unsigned long, len,
                unsigned long, prot, unsigned long, flags, unsigned long,
                fd, off_t, offset)
{
        unsigned long result;

        result = -EINVAL;
        if (offset & ~PAGE_MASK)
                goto out;

        result = sys_mmap_pgoff(addr, len, prot, flags, fd, offset >> PAGE_SHIFT);

out:
        return result;
}

On glibc side things are even worse. There's no mmap() implementation
that allows to pass 64-bit offset in 32-bit architecture. mmap64() which 
is supposed to do this is simply broken:
void *
__mmap64 (void *addr, size_t len, int prot, int flags, int fd, off64_t
                offset)
{
        [...]
        void *result;
        result = (void *) INLINE_SYSCALL (mmap2, 6, addr,
                                         len, prot, flags, fd,
                                         (off_t) (offset >> page_shift));
        return result;
}

It explicitly declares offset as 64-bit value, but casts it to 32-bit
before passing to the kernel, which is wrong for me. Even if arch has
64-bit off_t, like aarch64/ilp32, the cast will take place because
offset is passed in a single register, which is 32-bit.

I see 3 solutions for my problem:
1. Reuse aarch64/lp64 mmap code for ilp32 in glibc, but wrap offset with
SYSCALL_LL64() macro - which converts offset to the pair for 32-bit
ports. This is simple but local solution. And most probably it's enough.

2. Add new flag to mmap, like MAP_OFFSET_IN_PAIR. This will also work.
The problem here is that there are too much arches that implement
their custom sys_mmap2(). And, of course, this type of flags is
looking ugly.

3. Introduce new mmap64() syscall like this:
sys_mmap64(void *addr, size_t len, int prot, int flags, int fd, struct off_pair *off);
(The pointer here because otherwise we have 7 args, if simply pass off_hi and
off_lo in registers.)

With new 64-bit interface we can deprecate mmap2(), and generalize all
implementations in kernel.

I think we can discuss it because 64-bit is the default size for off_t 
in all new 32-bit architectures. So generic solution may take place.

The last question here is how important to support offsets bigger than
2^44 on 32-bit machines in practice? It may be a case for ARM64 servers,
which are looking like main aarch64/ilp32 users. If no, we can leave
things as is, and just do nothing.

Yury

On Mon, Dec 05, 2016 at 05:12:43PM +0000, Catalin Marinas wrote:
> On Fri, Oct 21, 2016 at 11:33:10PM +0300, Yury Norov wrote:
> > off_t is  passed in register pair just like in aarch32.
> > In this patch corresponding aarch32 handlers are shared to
> > ilp32 code.
> [...]
> > +/*
> > + * Note: off_4k (w5) is always in units of 4K. If we can't do the
> > + * requested offset because it is not page-aligned, we return -EINVAL.
> > + */
> > +ENTRY(compat_sys_mmap2_wrapper)
> > +#if PAGE_SHIFT > 12
> > +	tst	w5, #~PAGE_MASK >> 12
> > +	b.ne	1f
> > +	lsr	w5, w5, #PAGE_SHIFT - 12
> > +#endif
> > +	b	sys_mmap_pgoff
> > +1:	mov	x0, #-EINVAL
> > +	ret
> > +ENDPROC(compat_sys_mmap2_wrapper)
> 
> For compat sys_mmap2, the pgoff argument is in multiples of 4K. This was
> traditionally used for architectures where off_t is 32-bit to allow
> mapping files to 2^44.
> 
> Since off_t is 64-bit with AArch64/ILP32, should we just pass the off_t
> as a 64-bit value in two different registers (w5 and w6)?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
  2016-12-06 18:54 [Question] New mmap64 syscall? Yury Norov
@ 2016-12-06 21:20 ` Arnd Bergmann
  2016-12-07 10:34   ` Yury Norov
  2016-12-07 13:23 ` [Question] New mmap64 syscall? Florian Weimer
  1 sibling, 1 reply; 26+ messages in thread
From: Arnd Bergmann @ 2016-12-06 21:20 UTC (permalink / raw)
  To: Yury Norov
  Cc: libc-alpha, linux-arch, linux-kernel, Catalin Marinas,
	szabolcs.nagy, heiko.carstens, cmetcalf, philipp.tomsich, joseph,
	zhouchengming1, Prasun.Kapoor, agraf, geert, kilobyte,
	manuel.montezelo, pinskia, linyongting, klimov.linux, broonie,
	bamvor.zhangjian, linux-arm-kernel, maxim.kuvyrkov, Nathan_Lynch,
	schwidefsky, davem, christoph.muellner

On Wednesday, December 7, 2016 12:24:40 AM CET Yury Norov wrote:

> I see 3 solutions for my problem:
> 1. Reuse aarch64/lp64 mmap code for ilp32 in glibc, but wrap offset with
> SYSCALL_LL64() macro - which converts offset to the pair for 32-bit
> ports. This is simple but local solution. And most probably it's enough.

I wouldn't want arm64 to be different from all other architectures
here for the 32-bit API. The mmap() API used to be done entirely
in architecture specific code, while nowadays at least new architectures
use something resembling sys_mmap_pgoff(). I think that was originally
introduced to be the default API for 32-bit architectures, but it
failed to address architectures with variable page sizes.

> 2. Add new flag to mmap, like MAP_OFFSET_IN_PAIR. This will also work.
> The problem here is that there are too much arches that implement
> their custom sys_mmap2(). And, of course, this type of flags is
> looking ugly.

Right, better not touch make complicate it further. The other problem
is that mmap2() already has six argument and on most architectures
that is the limit for the number of syscall arguments, so you
cannot add another argument for the upper half.

> 3. Introduce new mmap64() syscall like this:
> sys_mmap64(void *addr, size_t len, int prot, int flags, int fd, struct off_pair *off);
> (The pointer here because otherwise we have 7 args, if simply pass off_hi and
> off_lo in registers.)

This wouldn't have to be a pair, just a pointer to a 64-bit number.

> With new 64-bit interface we can deprecate mmap2(), and generalize all
> implementations in kernel.
> 
> I think we can discuss it because 64-bit is the default size for off_t 
> in all new 32-bit architectures. So generic solution may take place.
> 
> The last question here is how important to support offsets bigger than
> 2^44 on 32-bit machines in practice? It may be a case for ARM64 servers,
> which are looking like main aarch64/ilp32 users. If no, we can leave
> things as is, and just do nothing.

If there is a use case for larger than 16TB offsets, we should add
the call on all architectures, probably using your approach 3. I don't
think that we should treat it as anything special for arm64 though.

	Arnd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
  2016-12-06 21:20 ` Arnd Bergmann
@ 2016-12-07 10:34   ` Yury Norov
       [not found]     ` <0F280FED-870A-42B5-ABC4-1976ACA32462@theobroma-systems.com>
  0 siblings, 1 reply; 26+ messages in thread
From: Yury Norov @ 2016-12-07 10:34 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: libc-alpha, linux-arch, linux-kernel, Catalin Marinas,
	szabolcs.nagy, heiko.carstens, cmetcalf, philipp.tomsich, joseph,
	zhouchengming1, Prasun.Kapoor, agraf, geert, kilobyte,
	manuel.montezelo, pinskia, linyongting, klimov.linux, broonie,
	bamvor.zhangjian, linux-arm-kernel, maxim.kuvyrkov, Nathan_Lynch,
	schwidefsky, davem, christoph.muellner

On Tue, Dec 06, 2016 at 10:20:20PM +0100, Arnd Bergmann wrote:
> On Wednesday, December 7, 2016 12:24:40 AM CET Yury Norov wrote:
> > 3. Introduce new mmap64() syscall like this:
> > sys_mmap64(void *addr, size_t len, int prot, int flags, int fd, struct off_pair *off);
> > (The pointer here because otherwise we have 7 args, if simply pass off_hi and
> > off_lo in registers.)
> 
> This wouldn't have to be a pair, just a pointer to a 64-bit number.
> 
> > With new 64-bit interface we can deprecate mmap2(), and generalize all
> > implementations in kernel.
> > 
> > I think we can discuss it because 64-bit is the default size for off_t 
> > in all new 32-bit architectures. So generic solution may take place.
> > 
> > The last question here is how important to support offsets bigger than
> > 2^44 on 32-bit machines in practice? It may be a case for ARM64 servers,
> > which are looking like main aarch64/ilp32 users. If no, we can leave
> > things as is, and just do nothing.
> 
> If there is a use case for larger than 16TB offsets, we should add
> the call on all architectures, probably using your approach 3. I don't
> think that we should treat it as anything special for arm64 though.

>From this point of view, 16+TB offset is a matter of 16+TB storage,
and it's more than real. The other consideration to add it is that
we have 64-bit support for offsets in syscalls like sys_llseek().
So mmap64() will simply extend this support.

I can prepare this patch. Some implementation details I'd like to
clarify:
Syscall declaration:
SYSCALL_DEFINE6(mmap64, unsigned long, addr, unsigned long, len,
                unsigned long, prot, unsigned long, flags,
                unsigned long, fd, unsigned long long *, offset);

sys_mmap64() deprecates sys_mmap2(), and __ARCH_WANT_MMAP2 is
introduced to keep it enabled for all existing architectures.
All modern arches (aarch64/ilp32 is the first candidate) will have
mmap64() only. The example is set/getrlimit() or renameat() drop
patches (b0da6d44).
                                
On GLIBC side, __OFF_T_MATCHES_OFF64_t will wire mmap() from
linux/generic/wordsize32/mmap.c to mmap64() from linux/mmap64.c. 

mmap64() will first try __NR_mmap64, and if not defined, or ENOSYS
is returned, __NR_mmap2 will be called. This is to let userspace that
supports both mmap2() and mmap64() have full 64-bit offset support, not
44-bit one.

For __NR_mmap2 case, I'd also add the check against offsets more than
2^44, and set errno to EOVERFLOW in that case.

Any thoughts?

Yury.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
  2016-12-06 18:54 [Question] New mmap64 syscall? Yury Norov
  2016-12-06 21:20 ` Arnd Bergmann
@ 2016-12-07 13:23 ` Florian Weimer
  2016-12-07 15:48   ` Yury Norov
  1 sibling, 1 reply; 26+ messages in thread
From: Florian Weimer @ 2016-12-07 13:23 UTC (permalink / raw)
  To: Yury Norov, libc-alpha, linux-arch, linux-kernel
  Cc: Catalin Marinas, szabolcs.nagy, heiko.carstens, cmetcalf,
	philipp.tomsich, joseph, zhouchengming1, Prasun.Kapoor, agraf,
	geert, kilobyte, manuel.montezelo, arnd, pinskia, linyongting,
	klimov.linux, broonie, bamvor.zhangjian, linux-arm-kernel,
	maxim.kuvyrkov, Nathan_Lynch, schwidefsky, davem,
	christoph.muellner

On 12/06/2016 07:54 PM, Yury Norov wrote:
> 3. Introduce new mmap64() syscall like this:
> sys_mmap64(void *addr, size_t len, int prot, int flags, int fd, struct off_pair *off);
> (The pointer here because otherwise we have 7 args, if simply pass off_hi and
> off_lo in registers.)

I would prefer a batched mmap/munmap/mremap/mprotect/madvise interface, 
so that VM changes can be coalesced and the output reduced.  This 
interface could then be used to implement mmap on 32-bit architectures 
as well because the offset restrictions would not apply there.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
  2016-12-07 13:23 ` [Question] New mmap64 syscall? Florian Weimer
@ 2016-12-07 15:48   ` Yury Norov
  2016-12-08 15:47     ` Florian Weimer
  0 siblings, 1 reply; 26+ messages in thread
From: Yury Norov @ 2016-12-07 15:48 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, linux-arch, linux-kernel, Catalin Marinas,
	szabolcs.nagy, heiko.carstens, cmetcalf, philipp.tomsich, joseph,
	zhouchengming1, Prasun.Kapoor, agraf, geert, kilobyte,
	manuel.montezelo, arnd, pinskia, linyongting, klimov.linux,
	broonie, bamvor.zhangjian, linux-arm-kernel, maxim.kuvyrkov,
	Nathan_Lynch, schwidefsky, davem, christoph.muellner

On Wed, Dec 07, 2016 at 02:23:55PM +0100, Florian Weimer wrote:
> On 12/06/2016 07:54 PM, Yury Norov wrote:
> >3. Introduce new mmap64() syscall like this:
> >sys_mmap64(void *addr, size_t len, int prot, int flags, int fd, struct off_pair *off);
> >(The pointer here because otherwise we have 7 args, if simply pass off_hi and
> >off_lo in registers.)
> 
> I would prefer a batched mmap/munmap/mremap/mprotect/madvise interface, so
> that VM changes can be coalesced and the output reduced.  This interface
> could then be used to implement mmap on 32-bit architectures as well because
> the offset restrictions would not apply there.

Hi Florian,

I frankly don't understand what you mean, All syscalls you mentioned
doesn't take off_t or other 64-bit arguments. 'VM changes' - virtual
memory? If so, I don't see any changes in VM with this approach, just
correct handling of big offsets.

> This interface
> could then be used to implement mmap on 32-bit architectures as well 

This is for 32-bit architectures only. 64 bit arches use
sysdeps/unix/sysv/linux/wordsize-64/mmap.c for both mmap and mmap64,
and they don't need that tricks with off_t. Or you meaning to switch
64-bit mmap to this interface?

Please explain what you mean in details.

Yury

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
       [not found]       ` <20161207123944.GA11799@yury-N73SV>
@ 2016-12-07 16:32         ` Catalin Marinas
  2016-12-07 16:43           ` Dr. Philipp Tomsich
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2016-12-07 16:32 UTC (permalink / raw)
  To: Yury Norov
  Cc: Dr.Philipp Tomsich, Arnd Bergmann, libc-alpha, linux-arch, LKML,
	szabolcs.nagy, heiko.carstens, cmetcalf, Joseph S. Myers,
	zhouchengming1, Kapoor, Prasun, Alexander Graf, geert, kilobyte,
	manuel.montezelo, Andrew Pinski, linyongting, Alexey Klimov,
	broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan_Lynch, schwidefsky,
	davem, christoph.muellner

On Wed, Dec 07, 2016 at 06:09:44PM +0530, Yury Norov wrote:
> On Wed, Dec 07, 2016 at 12:07:24PM +0100, Dr.Philipp Tomsich wrote:
> > [Resend, as my mail-client had insisted on using the wrong MIME type…]
> > 
> > > On 07 Dec 2016, at 11:34, Yury Norov <ynorov@caviumnetworks.com> wrote:
> > > 
> > >> If there is a use case for larger than 16TB offsets, we should add
> > >> the call on all architectures, probably using your approach 3. I don't
> > >> think that we should treat it as anything special for arm64 though.
> > > 
> > > From this point of view, 16+TB offset is a matter of 16+TB storage,
> > > and it's more than real. The other consideration to add it is that
> > > we have 64-bit support for offsets in syscalls like sys_llseek().
> > > So mmap64() will simply extend this support.
> > 
> > I believe the question is rather if the 16TB offset is a real use-case for ILP32.
> 
> This is not for ilp32, but for all 32-bit architectures - both native
> and compat. And because the scope is so generic, I think it's the
> strong reason for us to support true 64-bit offset in mmap().

When I mentioned it, I didn't realise that we already use 6 registers
for mmap(). While we can go up to 8 on AArch64/ILP32, I think Arnd has a
point that we don't want this to diverge from other new 32-bit
architectures. I don't really have a strong opinion either way here,
just a remark that AArch64/ILP32 already diverged from _current_ 32-bit
architectures by introducing 64-bit off_t in a 32-bit world. Introducing
an mmap64() at the same time wouldn't look too bad either.

> > This seems to bring the discussion full-circle, as this would indicate that 64bit is the 
> > preferred bit-width for all sizes, offsets, etc. throughout all filesystem-related calls 
> > (i.e. stat, seek, etc.).
> 
> AARCH64/ILP32 (and all new arches) exposes ino_t, off_t, blkcnt_t,
> fsblkcnt_t, fsfilcnt_t and rlim_t as 64-bit types. (Size_t should
> be 32-bit of course, because it's the same lengths as pointer.)
> 
> It allows to make syscalls that pass it support 64-bit values, refer
> Documentation/arm64/ilp32.txt for details. Stat and seek are both
> supporting 64-bit types. From this point of view, mmap() is the (only?)
> exception in current ILP32 ABI.

I thought ILP32 will use llseek() which has its own explicit way of
passing a 64-bit offset and the result written back by the kernel. We
wouldn't be able to use lseek() because of the return type.

> > But if that is the case, then we should have gone with 64bit arguments in a single
> > register for our ILP32 definition on AArch64.
>  
> There are 2 unrelated matters - the size of types, and the size of
> register. Most of 32-bit architectures has hardware limitation on
> register size (consider aarch32). And it doesn't mean that they are
> forced to stuck with 32-bit off_t etc. This is still opened question
> how to pass 64-bit parameters in aarch64/ilp32 because there we have
> the choice (the reason why it's RFC). If you have new ideas - welcome
> to that discussion. This topic also covers architectures that has to
> pass 64-bit parameters in a pair.

We've discussed this a few times already and the only sane option from
the _kernel_ perspective seemed to be either (a) close to native ABI for
ILP32 (and breaking POSIX) or (b) just a standard 32-bit ABI. The latter
implies splitting 64-bit values in register pairs, especially to avoid a
lot of annotations/wrapping in the generic kernel unistd.h file. IIRC,
we decided to go with option (b), so I don't think it's worth re-opening
that discussion.

> > In other words: Why not keep ILP32 simple an ask users that need a 16TB+ offset
> > to use LP64? It seems much more consistent with the other choices takes so far.
> 
> If user can switch to lp64, he doesn't need ilp32 at all, right? :)
> Also, I don't understand how true 64-bit offset in mmap64() would
> complicate this port.

It's more like the user wanting a quick transition from code that was
only ever compiled for AArch32 (or other 32-bit architecture) with a
goal of full LP64 transition on the long run. I have yet to see
convincing benchmarks showing ILP32 as an advantage over LP64 (of
course, I hear the argument of reading a pointer a loop is twice as fast
with a half-size pointer but I don't consider such benchmarks relevant).

-- 
Catalin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
  2016-12-07 16:32         ` Catalin Marinas
@ 2016-12-07 16:43           ` Dr. Philipp Tomsich
  2016-12-07 21:30             ` Arnd Bergmann
  0 siblings, 1 reply; 26+ messages in thread
From: Dr. Philipp Tomsich @ 2016-12-07 16:43 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Yury Norov, Arnd Bergmann, libc-alpha, linux-arch, LKML,
	szabolcs.nagy, heiko.carstens, cmetcalf, Joseph S. Myers,
	zhouchengming1, Kapoor, Prasun, Alexander Graf, geert, kilobyte,
	manuel.montezelo, Andrew Pinski, linyongting, Alexey Klimov,
	broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

Catalin,

> On 07 Dec 2016, at 17:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
>>> In other words: Why not keep ILP32 simple an ask users that need a 16TB+ offset
>>> to use LP64? It seems much more consistent with the other choices takes so far.
>> 
>> If user can switch to lp64, he doesn't need ilp32 at all, right? :)
>> Also, I don't understand how true 64-bit offset in mmap64() would
>> complicate this port.
> 
> It's more like the user wanting a quick transition from code that was
> only ever compiled for AArch32 (or other 32-bit architecture) with a
> goal of full LP64 transition on the long run. I have yet to see
> convincing benchmarks showing ILP32 as an advantage over LP64 (of
> course, I hear the argument of reading a pointer a loop is twice as fast
> with a half-size pointer but I don't consider such benchmarks relevant).

Most of the performance advantage in benchmarks comes from a reduction
in the size of data-structures and/or tighter packing of arrays.  In other words,
we can make slightly better use of the caches and push the memory subsystem
a little further when running multiple instances of benchmarks.

Most of these advantages should eventually go away, when struct-reorg makes
it way into the compiler. That said, it’s a marginal (but real) improvement for a
subset of SPEC.

In the real world, the importance of ILP32 as an aid to transition legacy code
that is not 64bit clean… and this should drive the ILP32 discussion. That we
get a boost in our SPEC scores is just a nice extra that we get from it ;-)


Regards,
Philipp.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
  2016-12-07 16:43           ` Dr. Philipp Tomsich
@ 2016-12-07 21:30             ` Arnd Bergmann
  2016-12-10  9:10               ` Pavel Machek
  0 siblings, 1 reply; 26+ messages in thread
From: Arnd Bergmann @ 2016-12-07 21:30 UTC (permalink / raw)
  To: Dr. Philipp Tomsich
  Cc: Catalin Marinas, Yury Norov, libc-alpha, linux-arch, LKML,
	szabolcs.nagy, heiko.carstens, cmetcalf, Joseph S. Myers,
	zhouchengming1, Kapoor, Prasun, Alexander Graf, geert, kilobyte,
	manuel.montezelo, Andrew Pinski, linyongting, Alexey Klimov,
	broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

On Wednesday, December 7, 2016 5:43:27 PM CET Dr. Philipp Tomsich wrote:
> Catalin,
> 
> > On 07 Dec 2016, at 17:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > 
> >>> In other words: Why not keep ILP32 simple an ask users that need a 16TB+ offset
> >>> to use LP64? It seems much more consistent with the other choices takes so far.
> >> 
> >> If user can switch to lp64, he doesn't need ilp32 at all, right? 
> >> Also, I don't understand how true 64-bit offset in mmap64() would
> >> complicate this port.
> > 
> > It's more like the user wanting a quick transition from code that was
> > only ever compiled for AArch32 (or other 32-bit architecture) with a
> > goal of full LP64 transition on the long run. I have yet to see
> > convincing benchmarks showing ILP32 as an advantage over LP64 (of
> > course, I hear the argument of reading a pointer a loop is twice as fast
> > with a half-size pointer but I don't consider such benchmarks relevant).
> 
> Most of the performance advantage in benchmarks comes from a reduction
> in the size of data-structures and/or tighter packing of arrays.  In other words,
> we can make slightly better use of the caches and push the memory subsystem
> a little further when running multiple instances of benchmarks.
> 
> Most of these advantages should eventually go away, when struct-reorg makes
> it way into the compiler. That said, it’s a marginal (but real) improvement for a
> subset of SPEC.
> 
> In the real world, the importance of ILP32 as an aid to transition legacy code
> that is not 64bit clean… and this should drive the ILP32 discussion. That we
> get a boost in our SPEC scores is just a nice extra that we get from it 

To bring this back from the philosophical questions of ABI design
to the specific point of what file offset width you want for mmap()
on 32-bit architectures.

For all I can tell, using mmap() to access a file that is many thousand
times larger than your virtual address space is completely crazy.
Adding a new mmap64() syscall on all 32-bit architectures would be
trivial if there was a use case for it, without one we but without at
least one specific application asking for it (with good reasons), we
shouldn't even be talking about that.

Note that until commit f8b7256096a2 ("Unify sys_mmap"), we actually
had a sys_mmap64 implementation on a couple of architectures, but
removed it.

	Arnd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
  2016-12-07 15:48   ` Yury Norov
@ 2016-12-08 15:47     ` Florian Weimer
  2017-01-03 20:54       ` Pavel Machek
  0 siblings, 1 reply; 26+ messages in thread
From: Florian Weimer @ 2016-12-08 15:47 UTC (permalink / raw)
  To: Yury Norov
  Cc: libc-alpha, linux-arch, linux-kernel, Catalin Marinas,
	szabolcs.nagy, heiko.carstens, cmetcalf, philipp.tomsich, joseph,
	zhouchengming1, Prasun.Kapoor, agraf, geert, kilobyte,
	manuel.montezelo, arnd, pinskia, linyongting, klimov.linux,
	broonie, bamvor.zhangjian, linux-arm-kernel, maxim.kuvyrkov,
	Nathan_Lynch, schwidefsky, davem, christoph.muellner

On 12/07/2016 04:48 PM, Yury Norov wrote:
> On Wed, Dec 07, 2016 at 02:23:55PM +0100, Florian Weimer wrote:
>> On 12/06/2016 07:54 PM, Yury Norov wrote:
>>> 3. Introduce new mmap64() syscall like this:
>>> sys_mmap64(void *addr, size_t len, int prot, int flags, int fd, struct off_pair *off);
>>> (The pointer here because otherwise we have 7 args, if simply pass off_hi and
>>> off_lo in registers.)
>>
>> I would prefer a batched mmap/munmap/mremap/mprotect/madvise interface, so
>> that VM changes can be coalesced and the output reduced.  This interface
>> could then be used to implement mmap on 32-bit architectures as well because
>> the offset restrictions would not apply there.
>
> Hi Florian,
>
> I frankly don't understand what you mean, All syscalls you mentioned
> doesn't take off_t or other 64-bit arguments. 'VM changes' - virtual
> memory? If so, I don't see any changes in VM with this approach, just
> correct handling of big offsets.

What I was trying to suggest is a completely different interface which 
is not subject to register size constraints and which has been requested 
before (a mechanism for batching mm updates).

Thanks,
Florian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
  2016-12-07 21:30             ` Arnd Bergmann
@ 2016-12-10  9:10               ` Pavel Machek
  2016-12-10  9:21                 ` Pavel Machek
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Machek @ 2016-12-10  9:10 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Dr. Philipp Tomsich, Catalin Marinas, Yury Norov, libc-alpha,
	linux-arch, LKML, szabolcs.nagy, heiko.carstens, cmetcalf,
	Joseph S. Myers, zhouchengming1, Kapoor, Prasun, Alexander Graf,
	geert, kilobyte, manuel.montezelo, Andrew Pinski, linyongting,
	Alexey Klimov, broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

Hi!

> > Most of these advantages should eventually go away, when struct-reorg makes
> > it way into the compiler. That said, it’s a marginal (but real) improvement for a
> > subset of SPEC.
> > 
> > In the real world, the importance of ILP32 as an aid to transition legacy code
> > that is not 64bit clean… and this should drive the ILP32 discussion. That we
> > get a boost in our SPEC scores is just a nice extra that we get from it 
> 
> To bring this back from the philosophical questions of ABI design
> to the specific point of what file offset width you want for mmap()
> on 32-bit architectures.
> 
> For all I can tell, using mmap() to access a file that is many thousand
> times larger than your virtual address space is completely crazy.

Dunno. Wanting to mmap part of a partition does not seem too crazy... I'm pretty
sure there's some tool out there that uses mmap(), just because mmap() was nicer
to use then read(). And when the partition is big, the offset may be big.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
  2016-12-10  9:10               ` Pavel Machek
@ 2016-12-10  9:21                 ` Pavel Machek
  2016-12-11 12:56                   ` Yury Norov
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Machek @ 2016-12-10  9:21 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Dr. Philipp Tomsich, Catalin Marinas, Yury Norov, libc-alpha,
	linux-arch, LKML, szabolcs.nagy, heiko.carstens, cmetcalf,
	Joseph S. Myers, zhouchengming1, Kapoor, Prasun, Alexander Graf,
	geert, kilobyte, manuel.montezelo, Andrew Pinski, linyongting,
	Alexey Klimov, broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

On Sat 2016-12-10 10:10:01, Pavel Machek wrote:
> Hi!
> 
> > > Most of these advantages should eventually go away, when struct-reorg makes
> > > it way into the compiler. That said, it’s a marginal (but real) improvement for a
> > > subset of SPEC.
> > > 
> > > In the real world, the importance of ILP32 as an aid to transition legacy code
> > > that is not 64bit clean… and this should drive the ILP32 discussion. That we
> > > get a boost in our SPEC scores is just a nice extra that we get from it 
> > 
> > To bring this back from the philosophical questions of ABI design
> > to the specific point of what file offset width you want for mmap()
> > on 32-bit architectures.
> > 
> > For all I can tell, using mmap() to access a file that is many thousand
> > times larger than your virtual address space is completely crazy.
> 
> Dunno. Wanting to mmap part of a partition does not seem too crazy... I'm pretty
> sure there's some tool out there that uses mmap(), just because mmap() was nicer
> to use then read(). And when the partition is big, the offset may be big.

Actually, if I wrote something like jpegrecover, I'd use mmap() for that (because
otherwise I'd be keeping copy of disk in anonymous memory, increasing memory pressure).

jpegrecover definitely makes sense on partitions...

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
  2016-12-10  9:21                 ` Pavel Machek
@ 2016-12-11 12:56                   ` Yury Norov
  2016-12-11 12:56                     ` [PATCH 1/3] mm: move argument checkers of mmap_pgoff() to separated routine Yury Norov
                                       ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Yury Norov @ 2016-12-11 12:56 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Yury Norov, Arnd Bergmann, Dr. Philipp Tomsich, Catalin Marinas,
	libc-alpha, linux-arch, LKML, szabolcs.nagy, heiko.carstens,
	cmetcalf, Joseph S. Myers, zhouchengming1, Kapoor, Prasun,
	Alexander Graf, geert, kilobyte, manuel.montezelo, Andrew Pinski,
	linyongting, Alexey Klimov, broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

This is the draft of sys_mmap64() support in the kernel. For 64-bit
kernels everything is simple. For 32-bit kernels we have a problem.
pgoff_t is declared as unsigned long, and should be turned to 
unsigned long long. It affects the number of structures and interfaces.
Last patch does the change. It should be wide-tested on 32-bit kernels
whith I didn't do. Arm64 kernel is working with this patchset, and I don't
expect difficulties there.

Yury Norov (3):
  mm: move argument checkers of mmap_pgoff() to separated routine
  sys_mmap64()
  mm: turn page offset types to 64-bit

 fs/btrfs/extent_io.c              |  2 +-
 fs/ext2/dir.c                     |  4 +--
 include/linux/mm.h                |  9 +++---
 include/linux/radix-tree.h        |  8 ++---
 include/linux/syscalls.h          |  3 ++
 include/linux/types.h             |  2 +-
 include/uapi/asm-generic/unistd.h |  4 ++-
 lib/radix-tree.c                  |  8 ++---
 mm/debug.c                        |  2 +-
 mm/internal.h                     |  2 +-
 mm/memory.c                       |  4 +--
 mm/mmap.c                         | 66 ++++++++++++++++++++++++++++++++-------
 mm/readahead.c                    |  4 +--
 mm/util.c                         |  3 +-
 14 files changed, 85 insertions(+), 36 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] mm: move argument checkers of mmap_pgoff() to separated routine
  2016-12-11 12:56                   ` Yury Norov
@ 2016-12-11 12:56                     ` Yury Norov
  2016-12-11 12:56                     ` [PATCH 2/3] sys_mmap64() Yury Norov
  2016-12-11 12:56                     ` [PATCH 3/3] mm: make pagoff_t type 64-bit Yury Norov
  2 siblings, 0 replies; 26+ messages in thread
From: Yury Norov @ 2016-12-11 12:56 UTC (permalink / raw)
  Cc: Yury Norov, Arnd Bergmann, Dr. Philipp Tomsich, Catalin Marinas,
	libc-alpha, linux-arch, LKML, szabolcs.nagy, heiko.carstens,
	cmetcalf, Joseph S. Myers, zhouchengming1, Kapoor, Prasun,
	Alexander Graf, geert, kilobyte, manuel.montezelo, Andrew Pinski,
	linyongting, Alexey Klimov, broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
---
 mm/mmap.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 1af87c1..fc1c943 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1455,12 +1455,12 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	return addr;
 }
 
-SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
-		unsigned long, prot, unsigned long, flags,
-		unsigned long, fd, unsigned long, pgoff)
+static int mmap_pgoff_prepare(struct file **f, unsigned long *l,
+		unsigned long *fl, unsigned long fd)
 {
-	struct file *file = NULL;
-	unsigned long retval;
+	struct file *file = *f;
+	unsigned long flags = *fl;
+	unsigned long len = *l;
 
 	if (!(flags & MAP_ANONYMOUS)) {
 		audit_mmap_fd(fd, flags);
@@ -1469,9 +1469,10 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 			return -EBADF;
 		if (is_file_hugepages(file))
 			len = ALIGN(len, huge_page_size(hstate_file(file)));
-		retval = -EINVAL;
-		if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
-			goto out_fput;
+		if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file))) {
+			fput(file);
+			return -EINVAL;
+		}
 	} else if (flags & MAP_HUGETLB) {
 		struct user_struct *user = NULL;
 		struct hstate *hs;
@@ -1497,8 +1498,23 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
+	*f = file;
+	*l = len;
+	*fl = flags;
+	return 0;
+}
+
+SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
+		unsigned long, prot, unsigned long, flags,
+		unsigned long, fd, unsigned long, pgoff)
+{
+	struct file *file = NULL;
+	unsigned long retval;
+	int err = mmap_pgoff_prepare(&file, &len, &flags, fd);
+	if (err)
+		return err;
+
 	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-out_fput:
 	if (file)
 		fput(file);
 	return retval;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/3] sys_mmap64()
  2016-12-11 12:56                   ` Yury Norov
  2016-12-11 12:56                     ` [PATCH 1/3] mm: move argument checkers of mmap_pgoff() to separated routine Yury Norov
@ 2016-12-11 12:56                     ` Yury Norov
  2016-12-11 14:48                       ` kbuild test robot
  2016-12-11 14:56                       ` kbuild test robot
  2016-12-11 12:56                     ` [PATCH 3/3] mm: make pagoff_t type 64-bit Yury Norov
  2 siblings, 2 replies; 26+ messages in thread
From: Yury Norov @ 2016-12-11 12:56 UTC (permalink / raw)
  Cc: Yury Norov, Arnd Bergmann, Dr. Philipp Tomsich, Catalin Marinas,
	libc-alpha, linux-arch, LKML, szabolcs.nagy, heiko.carstens,
	cmetcalf, Joseph S. Myers, zhouchengming1, Kapoor, Prasun,
	Alexander Graf, geert, kilobyte, manuel.montezelo, Andrew Pinski,
	linyongting, Alexey Klimov, broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
---
 include/linux/syscalls.h          |  3 +++
 include/uapi/asm-generic/unistd.h |  4 +++-
 mm/mmap.c                         | 25 +++++++++++++++++++++++++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 91a740f..869ca76 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -857,6 +857,9 @@ asmlinkage long sys_perf_event_open(
 asmlinkage long sys_mmap_pgoff(unsigned long addr, unsigned long len,
 			unsigned long prot, unsigned long flags,
 			unsigned long fd, unsigned long pgoff);
+asmlinkage long sys_mmap64(unsigned long addr, unsigned long len,
+			unsigned long prot, unsigned long flags,
+			unsigned long fd, unsigned long long *offset);
 asmlinkage long sys_old_mmap(struct mmap_arg_struct __user *arg);
 asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name,
 				      struct file_handle __user *handle,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index d65e232..f9ca919 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -734,9 +734,11 @@ __SYSCALL(__NR_pkey_mprotect, sys_pkey_mprotect)
 __SYSCALL(__NR_pkey_alloc,    sys_pkey_alloc)
 #define __NR_pkey_free 290
 __SYSCALL(__NR_pkey_free,     sys_pkey_free)
+#define __NR_mmap64 291
+__SYSCALL(__NR_mmap64, sys_mmap64)
 
 #undef __NR_syscalls
-#define __NR_syscalls 291
+#define __NR_syscalls 292
 
 /*
  * All syscalls below here should go away really,
diff --git a/mm/mmap.c b/mm/mmap.c
index fc1c943..6c6b95a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1504,6 +1504,31 @@ static int mmap_pgoff_prepare(struct file **f, unsigned long *l,
 	return 0;
 }
 
+SYSCALL_DEFINE6(mmap64, unsigned long, addr, unsigned long, len,
+		unsigned long, prot, unsigned long, flags,
+		unsigned long, fd, unsigned long long __user *, offset)
+{
+	int err;
+	unsigned long long koffset;
+	unsigned long retval;
+	struct file *file = NULL;
+
+	if (copy_from_user(&koffset, offset, sizeof(koffset)))
+		return -EFAULT;
+	if (offset_in_page(koffset))
+		return -EINVAL;
+
+	err = mmap_pgoff_prepare(&file, &len, &flags, fd);
+	if (err)
+		return err;
+
+	retval = vm_mmap_pgoff(file, addr, len, prot,
+			flags, koffset >> PAGE_SHIFT);
+	if (file)
+		fput(file);
+	return retval;
+}
+
 SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 		unsigned long, prot, unsigned long, flags,
 		unsigned long, fd, unsigned long, pgoff)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/3] mm: make pagoff_t type 64-bit
  2016-12-11 12:56                   ` Yury Norov
  2016-12-11 12:56                     ` [PATCH 1/3] mm: move argument checkers of mmap_pgoff() to separated routine Yury Norov
  2016-12-11 12:56                     ` [PATCH 2/3] sys_mmap64() Yury Norov
@ 2016-12-11 12:56                     ` Yury Norov
  2016-12-11 13:31                       ` kbuild test robot
                                         ` (2 more replies)
  2 siblings, 3 replies; 26+ messages in thread
From: Yury Norov @ 2016-12-11 12:56 UTC (permalink / raw)
  Cc: Yury Norov, Arnd Bergmann, Dr. Philipp Tomsich, Catalin Marinas,
	libc-alpha, linux-arch, LKML, szabolcs.nagy, heiko.carstens,
	cmetcalf, Joseph S. Myers, zhouchengming1, Kapoor, Prasun,
	Alexander Graf, geert, kilobyte, manuel.montezelo, Andrew Pinski,
	linyongting, Alexey Klimov, broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

Also fix related interfaces

Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
---
 fs/btrfs/extent_io.c       | 2 +-
 fs/ext2/dir.c              | 4 ++--
 include/linux/mm.h         | 9 +++++----
 include/linux/radix-tree.h | 8 ++++----
 include/linux/types.h      | 2 +-
 lib/radix-tree.c           | 8 ++++----
 mm/debug.c                 | 2 +-
 mm/internal.h              | 2 +-
 mm/memory.c                | 4 ++--
 mm/mmap.c                  | 7 ++++---
 mm/readahead.c             | 4 ++--
 mm/util.c                  | 3 ++-
 12 files changed, 29 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8ed05d9..f4c9318 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3436,7 +3436,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 		set_range_writeback(tree, cur, cur + iosize - 1);
 		if (!PageWriteback(page)) {
 			btrfs_err(BTRFS_I(inode)->root->fs_info,
-				   "page %lu not writeback, cur %llu end %llu",
+				   "page %llu not writeback, cur %llu end %llu",
 			       page->index, cur, end);
 		}
 
diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index d9650c9..c01b76e 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -174,7 +174,7 @@ static bool ext2_check_page(struct page *page, int quiet)
 	error = "inode out of bounds";
 bad_entry:
 	if (!quiet)
-		ext2_error(sb, __func__, "bad entry in directory #%lu: : %s - "
+		ext2_error(sb, __func__, "bad entry in directory #%llu: : %s - "
 			"offset=%lu, inode=%lu, rec_len=%d, name_len=%d",
 			dir->i_ino, error, (page->index<<PAGE_SHIFT)+offs,
 			(unsigned long) le32_to_cpu(p->inode),
@@ -184,7 +184,7 @@ static bool ext2_check_page(struct page *page, int quiet)
 	if (!quiet) {
 		p = (ext2_dirent *)(kaddr + offs);
 		ext2_error(sb, "ext2_check_page",
-			"entry in directory #%lu spans the page boundary"
+			"entry in directory #%llu spans the page boundary"
 			"offset=%lu, inode=%lu",
 			dir->i_ino, (page->index<<PAGE_SHIFT)+offs,
 			(unsigned long) le32_to_cpu(p->inode));
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a92c8d7..33d9150 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2022,19 +2022,20 @@ extern int install_special_mapping(struct mm_struct *mm,
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags, struct page **pages);
 
-extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
+extern unsigned long get_unmapped_area(struct file *, unsigned long,
+		unsigned long, pgoff_t, unsigned long);
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
-	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff);
+	unsigned long len, vm_flags_t vm_flags, pgoff_t pgoff);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
-	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate);
+	vm_flags_t vm_flags, pgoff_t pgoff, unsigned long *populate);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 static inline unsigned long
 do_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
-	unsigned long pgoff, unsigned long *populate)
+	pgoff_t pgoff, unsigned long *populate)
 {
 	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate);
 }
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index af3581b..1781bb7 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -287,7 +287,7 @@ unsigned int radix_tree_gang_lookup(struct radix_tree_root *root,
 			void **results, unsigned long first_index,
 			unsigned int max_items);
 unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
-			void ***results, unsigned long *indices,
+			void ***results, unsigned long long *indices,
 			unsigned long first_index, unsigned int max_items);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
@@ -308,7 +308,7 @@ radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
 		unsigned long first_index, unsigned int max_items,
 		unsigned int tag);
 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
-		unsigned long *first_indexp, unsigned long last_index,
+		unsigned long long *first_indexp, unsigned long last_index,
 		unsigned long nr_to_tag,
 		unsigned int fromtag, unsigned int totag);
 int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
@@ -335,8 +335,8 @@ static inline void radix_tree_preload_end(void)
  * radix tree tag.
  */
 struct radix_tree_iter {
-	unsigned long	index;
-	unsigned long	next_index;
+	unsigned long long	index;
+	unsigned long long	next_index;
 	unsigned long	tags;
 #ifdef CONFIG_RADIX_TREE_MULTIORDER
 	unsigned int	shift;
diff --git a/include/linux/types.h b/include/linux/types.h
index baf7183..1e711c1 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -137,7 +137,7 @@ typedef unsigned long blkcnt_t;
 /*
  * The type of an index into the pagecache.
  */
-#define pgoff_t unsigned long
+#define pgoff_t unsigned long long
 
 /*
  * A dma_addr_t can hold any valid DMA address, i.e., any address returned
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 0d1d23e..afb49381 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -97,7 +97,7 @@ static inline unsigned long get_slot_offset(struct radix_tree_node *parent,
 }
 
 static unsigned int radix_tree_descend(struct radix_tree_node *parent,
-			struct radix_tree_node **nodep, unsigned long index)
+			struct radix_tree_node **nodep, unsigned long long index)
 {
 	unsigned int offset = (index >> parent->shift) & RADIX_TREE_MAP_MASK;
 	void **entry = rcu_dereference_raw(parent->slots[offset]);
@@ -1040,14 +1040,14 @@ EXPORT_SYMBOL(radix_tree_next_chunk);
  * be prepared to handle that.
  */
 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
-		unsigned long *first_indexp, unsigned long last_index,
+		unsigned long long *first_indexp, unsigned long last_index,
 		unsigned long nr_to_tag,
 		unsigned int iftag, unsigned int settag)
 {
 	struct radix_tree_node *parent, *node, *child;
 	unsigned long maxindex;
 	unsigned long tagged = 0;
-	unsigned long index = *first_indexp;
+	unsigned long long index = *first_indexp;
 
 	radix_tree_load_root(root, &child, &maxindex);
 	last_index = min(last_index, maxindex);
@@ -1195,7 +1195,7 @@ EXPORT_SYMBOL(radix_tree_gang_lookup);
  */
 unsigned int
 radix_tree_gang_lookup_slot(struct radix_tree_root *root,
-			void ***results, unsigned long *indices,
+			void ***results, unsigned long long *indices,
 			unsigned long first_index, unsigned int max_items)
 {
 	struct radix_tree_iter iter;
diff --git a/mm/debug.c b/mm/debug.c
index 9feb699..a568fc8 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -49,7 +49,7 @@ void __dump_page(struct page *page, const char *reason)
 	 */
 	int mapcount = PageSlab(page) ? 0 : page_mapcount(page);
 
-	pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx",
+	pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#llx",
 		  page, page_ref_count(page), mapcount,
 		  page->mapping, page_to_pgoff(page));
 	if (PageCompound(page))
diff --git a/mm/internal.h b/mm/internal.h
index 537ac99..8027eac 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -447,7 +447,7 @@ extern u32 hwpoison_filter_enable;
 
 extern unsigned long  __must_check vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
-        unsigned long, unsigned long);
+        unsigned long, pgoff_t);
 
 extern void set_pageblock_order(void);
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
diff --git a/mm/memory.c b/mm/memory.c
index e18c57b..c05d534 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -688,7 +688,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
 	if (page)
 		dump_page(page, "bad pte");
-	pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
+	pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%llx\n",
 		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
 	/*
 	 * Choose text because data symbols depend on CONFIG_KALLSYMS_ALL=y
@@ -3133,7 +3133,7 @@ static int do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
 	end_pgoff = start_pgoff -
 		((fe->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
 		PTRS_PER_PTE - 1;
-	end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
+	end_pgoff = min3(end_pgoff, (pgoff_t) vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
 			start_pgoff + nr_pages - 1);
 
 	if (pmd_none(*fe->pmd)) {
diff --git a/mm/mmap.c b/mm/mmap.c
index 6c6b95a..cf50232 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -9,6 +9,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include <linux/kernel.h>
+#include <linux/types.h>
 #include <linux/slab.h>
 #include <linux/backing-dev.h>
 #include <linux/mm.h>
@@ -1304,7 +1305,7 @@ static inline int mlock_future_check(struct mm_struct *mm,
 unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
 			unsigned long flags, vm_flags_t vm_flags,
-			unsigned long pgoff, unsigned long *populate)
+			pgoff_t pgoff, unsigned long *populate)
 {
 	struct mm_struct *mm = current->mm;
 	int pkey = 0;
@@ -1624,7 +1625,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 }
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
-		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff)
+		unsigned long len, vm_flags_t vm_flags, pgoff_t pgoff)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -2088,7 +2089,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 
 unsigned long
 get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
-		unsigned long pgoff, unsigned long flags)
+		pgoff_t pgoff, unsigned long flags)
 {
 	unsigned long (*get_area)(struct file *, unsigned long,
 				  unsigned long, unsigned long, unsigned long);
diff --git a/mm/readahead.c b/mm/readahead.c
index c8a955b..902bad8 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -332,8 +332,8 @@ static pgoff_t count_history_pages(struct address_space *mapping,
 static int try_context_readahead(struct address_space *mapping,
 				 struct file_ra_state *ra,
 				 pgoff_t offset,
-				 unsigned long req_size,
-				 unsigned long max)
+				 unsigned long long req_size,
+				 unsigned long long max)
 {
 	pgoff_t size;
 
diff --git a/mm/util.c b/mm/util.c
index 1a41553..51fae99 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -2,6 +2,7 @@
 #include <linux/slab.h>
 #include <linux/string.h>
 #include <linux/compiler.h>
+#include <linux/types.h>
 #include <linux/export.h>
 #include <linux/err.h>
 #include <linux/sched.h>
@@ -292,7 +293,7 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
 unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot,
-	unsigned long flag, unsigned long pgoff)
+	unsigned long flag, pgoff_t pgoff)
 {
 	unsigned long ret;
 	struct mm_struct *mm = current->mm;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] mm: make pagoff_t type 64-bit
  2016-12-11 12:56                     ` [PATCH 3/3] mm: make pagoff_t type 64-bit Yury Norov
@ 2016-12-11 13:31                       ` kbuild test robot
  2016-12-11 13:41                       ` kbuild test robot
  2016-12-11 14:59                       ` Arnd Bergmann
  2 siblings, 0 replies; 26+ messages in thread
From: kbuild test robot @ 2016-12-11 13:31 UTC (permalink / raw)
  To: Yury Norov
  Cc: kbuild-all, Yury Norov, Arnd Bergmann, Dr. Philipp Tomsich,
	Catalin Marinas, libc-alpha, linux-arch, LKML, szabolcs.nagy,
	heiko.carstens, cmetcalf, Joseph S. Myers, zhouchengming1,
	Kapoor, Prasun, Alexander Graf, geert, kilobyte,
	manuel.montezelo, Andrew Pinski, linyongting, Alexey Klimov,
	broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

[-- Attachment #1: Type: text/plain, Size: 2830 bytes --]

Hi Yury,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.9-rc8]
[cannot apply to mmotm/master next-20161209]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Yury-Norov/mm-move-argument-checkers-of-mmap_pgoff-to-separated-routine/20161211-211314
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   arch/x86/mm/pgtable.c: In function 'pgd_set_mm':
>> arch/x86/mm/pgtable.c:108:29: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
     virt_to_page(pgd)->index = (pgoff_t)mm;
                                ^
   arch/x86/mm/pgtable.c: In function 'pgd_page_get_mm':
>> arch/x86/mm/pgtable.c:113:9: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
     return (struct mm_struct *)page->index;
            ^
--
   mm/percpu.c: In function 'pcpu_get_page_chunk':
>> mm/percpu.c:240:9: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
     return (struct pcpu_chunk *)page->index;
            ^

vim +108 arch/x86/mm/pgtable.c

68db065c Jeremy Fitzhardinge 2008-03-17  102  	(SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)
4f76cd38 Jeremy Fitzhardinge 2008-03-17  103  
617d34d9 Jeremy Fitzhardinge 2010-09-21  104  
617d34d9 Jeremy Fitzhardinge 2010-09-21  105  static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
617d34d9 Jeremy Fitzhardinge 2010-09-21  106  {
617d34d9 Jeremy Fitzhardinge 2010-09-21  107  	BUILD_BUG_ON(sizeof(virt_to_page(pgd)->index) < sizeof(mm));
617d34d9 Jeremy Fitzhardinge 2010-09-21 @108  	virt_to_page(pgd)->index = (pgoff_t)mm;
617d34d9 Jeremy Fitzhardinge 2010-09-21  109  }
617d34d9 Jeremy Fitzhardinge 2010-09-21  110  
617d34d9 Jeremy Fitzhardinge 2010-09-21  111  struct mm_struct *pgd_page_get_mm(struct page *page)
617d34d9 Jeremy Fitzhardinge 2010-09-21  112  {
617d34d9 Jeremy Fitzhardinge 2010-09-21 @113  	return (struct mm_struct *)page->index;
617d34d9 Jeremy Fitzhardinge 2010-09-21  114  }
617d34d9 Jeremy Fitzhardinge 2010-09-21  115  
617d34d9 Jeremy Fitzhardinge 2010-09-21  116  static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)

:::::: The code at line 108 was first introduced by commit
:::::: 617d34d9e5d8326ec8f188c616aa06ac59d083fe x86, mm: Hold mm->page_table_lock while doing vmalloc_sync

:::::: TO: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
:::::: CC: H. Peter Anvin <hpa@linux.intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6363 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] mm: make pagoff_t type 64-bit
  2016-12-11 12:56                     ` [PATCH 3/3] mm: make pagoff_t type 64-bit Yury Norov
  2016-12-11 13:31                       ` kbuild test robot
@ 2016-12-11 13:41                       ` kbuild test robot
  2016-12-11 14:59                       ` Arnd Bergmann
  2 siblings, 0 replies; 26+ messages in thread
From: kbuild test robot @ 2016-12-11 13:41 UTC (permalink / raw)
  To: Yury Norov
  Cc: kbuild-all, Yury Norov, Arnd Bergmann, Dr. Philipp Tomsich,
	Catalin Marinas, libc-alpha, linux-arch, LKML, szabolcs.nagy,
	heiko.carstens, cmetcalf, Joseph S. Myers, zhouchengming1,
	Kapoor, Prasun, Alexander Graf, geert, kilobyte,
	manuel.montezelo, Andrew Pinski, linyongting, Alexey Klimov,
	broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

[-- Attachment #1: Type: text/plain, Size: 2681 bytes --]

Hi Yury,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.9-rc8]
[cannot apply to mmotm/master next-20161209]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Yury-Norov/mm-move-argument-checkers-of-mmap_pgoff-to-separated-routine/20161211-211314
config: i386-randconfig-x003-201650 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   fs/ext2/dir.c: In function 'ext2_check_page':
>> fs/ext2/dir.c:177:56: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 4 has type 'long unsigned int' [-Wformat=]
      ext2_error(sb, __func__, "bad entry in directory #%llu: : %s - "
                                                           ^
>> fs/ext2/dir.c:177:28: warning: format '%lu' expects argument of type 'long unsigned int', but argument 6 has type 'long long unsigned int' [-Wformat=]
      ext2_error(sb, __func__, "bad entry in directory #%llu: : %s - "
                               ^
   fs/ext2/dir.c:187:28: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 4 has type 'long unsigned int' [-Wformat=]
       "entry in directory #%llu spans the page boundary"
                               ^
   fs/ext2/dir.c:187:4: warning: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'long long unsigned int' [-Wformat=]
       "entry in directory #%llu spans the page boundary"
       ^

vim +177 fs/ext2/dir.c

   161	Eshort:
   162		error = "rec_len is smaller than minimal";
   163		goto bad_entry;
   164	Ealign:
   165		error = "unaligned directory entry";
   166		goto bad_entry;
   167	Enamelen:
   168		error = "rec_len is too small for name_len";
   169		goto bad_entry;
   170	Espan:
   171		error = "directory entry across blocks";
   172		goto bad_entry;
   173	Einumber:
   174		error = "inode out of bounds";
   175	bad_entry:
   176		if (!quiet)
 > 177			ext2_error(sb, __func__, "bad entry in directory #%llu: : %s - "
   178				"offset=%lu, inode=%lu, rec_len=%d, name_len=%d",
   179				dir->i_ino, error, (page->index<<PAGE_SHIFT)+offs,
   180				(unsigned long) le32_to_cpu(p->inode),
   181				rec_len, p->name_len);
   182		goto fail;
   183	Eend:
   184		if (!quiet) {
   185			p = (ext2_dirent *)(kaddr + offs);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26862 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] sys_mmap64()
  2016-12-11 12:56                     ` [PATCH 2/3] sys_mmap64() Yury Norov
@ 2016-12-11 14:48                       ` kbuild test robot
  2016-12-11 14:56                       ` kbuild test robot
  1 sibling, 0 replies; 26+ messages in thread
From: kbuild test robot @ 2016-12-11 14:48 UTC (permalink / raw)
  To: Yury Norov
  Cc: kbuild-all, Yury Norov, Arnd Bergmann, Dr. Philipp Tomsich,
	Catalin Marinas, libc-alpha, linux-arch, LKML, szabolcs.nagy,
	heiko.carstens, cmetcalf, Joseph S. Myers, zhouchengming1,
	Kapoor, Prasun, Alexander Graf, geert, kilobyte,
	manuel.montezelo, Andrew Pinski, linyongting, Alexey Klimov,
	broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

[-- Attachment #1: Type: text/plain, Size: 997 bytes --]

Hi Yury,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.9-rc8 next-20161209]
[cannot apply to mmotm/master]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Yury-Norov/mm-move-argument-checkers-of-mmap_pgoff-to-separated-routine/20161211-211314
config: h8300-h8300h-sim_defconfig (attached as .config)
compiler: h8300-linux-gcc (GCC) 6.2.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=h8300 

All errors (new ones prefixed by >>):

>> arch/h8300/kernel/built-in.o:(.data+0x48c): undefined reference to `sys_mmap64'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 4484 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] sys_mmap64()
  2016-12-11 12:56                     ` [PATCH 2/3] sys_mmap64() Yury Norov
  2016-12-11 14:48                       ` kbuild test robot
@ 2016-12-11 14:56                       ` kbuild test robot
  1 sibling, 0 replies; 26+ messages in thread
From: kbuild test robot @ 2016-12-11 14:56 UTC (permalink / raw)
  To: Yury Norov
  Cc: kbuild-all, Yury Norov, Arnd Bergmann, Dr. Philipp Tomsich,
	Catalin Marinas, libc-alpha, linux-arch, LKML, szabolcs.nagy,
	heiko.carstens, cmetcalf, Joseph S. Myers, zhouchengming1,
	Kapoor, Prasun, Alexander Graf, geert, kilobyte,
	manuel.montezelo, Andrew Pinski, linyongting, Alexey Klimov,
	broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

[-- Attachment #1: Type: text/plain, Size: 988 bytes --]

Hi Yury,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.9-rc8 next-20161209]
[cannot apply to mmotm/master]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Yury-Norov/mm-move-argument-checkers-of-mmap_pgoff-to-separated-routine/20161211-211314
config: c6x-evmc6457_defconfig (attached as .config)
compiler: c6x-elf-gcc (GCC) 6.2.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=c6x 

All errors (new ones prefixed by >>):

>> arch/c6x/kernel/built-in.o:(.fardata+0xb8c): undefined reference to `sys_mmap64'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 5162 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] mm: make pagoff_t type 64-bit
  2016-12-11 12:56                     ` [PATCH 3/3] mm: make pagoff_t type 64-bit Yury Norov
  2016-12-11 13:31                       ` kbuild test robot
  2016-12-11 13:41                       ` kbuild test robot
@ 2016-12-11 14:59                       ` Arnd Bergmann
  2016-12-16 10:55                         ` Yury Norov
  2 siblings, 1 reply; 26+ messages in thread
From: Arnd Bergmann @ 2016-12-11 14:59 UTC (permalink / raw)
  To: libc-alpha
  Cc: Yury Norov, Dr. Philipp Tomsich, Catalin Marinas, linux-arch,
	LKML, szabolcs.nagy, heiko.carstens, cmetcalf, Joseph S. Myers,
	zhouchengming1, Kapoor, Prasun, Alexander Graf, geert, kilobyte,
	manuel.montezelo, Andrew Pinski, linyongting, Alexey Klimov,
	broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

On Sunday, December 11, 2016 6:26:42 PM CET Yury Norov wrote:
> Also fix related interfaces
> 
> Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
> ---
>  fs/btrfs/extent_io.c       | 2 +-
>  fs/ext2/dir.c              | 4 ++--
>  include/linux/mm.h         | 9 +++++----
>  include/linux/radix-tree.h | 8 ++++----
>  include/linux/types.h      | 2 +-
>  lib/radix-tree.c           | 8 ++++----
>  mm/debug.c                 | 2 +-
>  mm/internal.h              | 2 +-
>  mm/memory.c                | 4 ++--
>  mm/mmap.c                  | 7 ++++---
>  mm/readahead.c             | 4 ++--
>  mm/util.c                  | 3 ++-
>  12 files changed, 29 insertions(+), 26 deletions(-)
> 

Thanks Yury for the demonstration. I think this would put the nail
in the coffin of the idea of mmap64 even for Pavel, who didn't
seem convinced already.

Changing all those interfaces and structure, struct page in particular,
is clearly too costly for any advantage we might have otherwise
gained.

	Arnd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] mm: make pagoff_t type 64-bit
  2016-12-11 14:59                       ` Arnd Bergmann
@ 2016-12-16 10:55                         ` Yury Norov
  2016-12-16 11:02                           ` Arnd Bergmann
  2016-12-18  9:23                           ` Christoph Hellwig
  0 siblings, 2 replies; 26+ messages in thread
From: Yury Norov @ 2016-12-16 10:55 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: libc-alpha, Dr. Philipp Tomsich, Catalin Marinas, linux-arch,
	LKML, szabolcs.nagy, heiko.carstens, cmetcalf, Joseph S. Myers,
	zhouchengming1, Kapoor, Prasun, Alexander Graf, geert, kilobyte,
	manuel.montezelo, Andrew Pinski, linyongting, Alexey Klimov,
	broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

On Sun, Dec 11, 2016 at 03:59:01PM +0100, Arnd Bergmann wrote:
> On Sunday, December 11, 2016 6:26:42 PM CET Yury Norov wrote:
> > Also fix related interfaces
> > 
> > Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
> 
> Thanks Yury for the demonstration. I think this would put the nail
> in the coffin of the idea of mmap64 even for Pavel, who didn't
> seem convinced already.
> 
> Changing all those interfaces and structure, struct page in particular,
> is clearly too costly for any advantage we might have otherwise
> gained.
> 
> 	Arnd

To be complete, we have 3 options:
 1 leave things as is. 32-bit architectures will have no option to
   mmap big offsets, and no one cares - as usual.
 2 add mmap64() for compat arches only. This way we don't need patch
   3, and arches like aarch32 or aarch64/ilp32 will enjoy true 64-bit
   offsets.
 3 introduce CONFIG_64_BIT_PGOFF_T, and let Pavel enable it if he has
   to work with big files on 32-bit arches.

The most realistic approach for me is 1 because I never heard about
64-bit pgoff_t requests, except Pavel's one. Thinking about
aarch64/ilp32, we probably need second approach. This is only 2 simple
patches that are already there, and one patch in glibc. It will let
32-bit software work in 64-bit environment more smoothly. Cavium
people should be completely satisfied with 2.

Third is more looking like research exercise than something we need
in practice.

The only thing that makes me sad is that we proudly declare 64-bit
off_t in new 32-bit ABIs but in fact we lie, at least in this
specific case. We should add corresponding checks on glibc side at
least. It's also simple.

Yury.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] mm: make pagoff_t type 64-bit
  2016-12-16 10:55                         ` Yury Norov
@ 2016-12-16 11:02                           ` Arnd Bergmann
  2016-12-18  9:23                           ` Christoph Hellwig
  1 sibling, 0 replies; 26+ messages in thread
From: Arnd Bergmann @ 2016-12-16 11:02 UTC (permalink / raw)
  To: Yury Norov
  Cc: libc-alpha, Dr. Philipp Tomsich, Catalin Marinas, linux-arch,
	LKML, szabolcs.nagy, heiko.carstens, cmetcalf, Joseph S. Myers,
	zhouchengming1, Kapoor, Prasun, Alexander Graf, geert, kilobyte,
	manuel.montezelo, Andrew Pinski, linyongting, Alexey Klimov,
	broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

On Friday, December 16, 2016 4:25:14 PM CET Yury Norov wrote:
> On Sun, Dec 11, 2016 at 03:59:01PM +0100, Arnd Bergmann wrote:
> > On Sunday, December 11, 2016 6:26:42 PM CET Yury Norov wrote:
> > > Also fix related interfaces
> > > 
> > > Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
> > 
> > Thanks Yury for the demonstration. I think this would put the nail
> > in the coffin of the idea of mmap64 even for Pavel, who didn't
> > seem convinced already.
> > 
> > Changing all those interfaces and structure, struct page in particular,
> > is clearly too costly for any advantage we might have otherwise
> > gained.
> > 
> >       Arnd
> 
> To be complete, we have 3 options:
>  1 leave things as is. 32-bit architectures will have no option to
>    mmap big offsets, and no one cares - as usual.
>  2 add mmap64() for compat arches only. This way we don't need patch
>    3, and arches like aarch32 or aarch64/ilp32 will enjoy true 64-bit
>    offsets.
>  3 introduce CONFIG_64_BIT_PGOFF_T, and let Pavel enable it if he has
>    to work with big files on 32-bit arches.
> 
> The most realistic approach for me is 1 because I never heard about
> 64-bit pgoff_t requests, except Pavel's one. Thinking about
> aarch64/ilp32, we probably need second approach. This is only 2 simple
> patches that are already there, and one patch in glibc. It will let
> 32-bit software work in 64-bit environment more smoothly. Cavium
> people should be completely satisfied with 2.

Agreed: If there is a serious request from Cavium or Huawei (which
are also very interested in this feature) and a specific use case,
we can still do 2 easily.

> Third is more looking like research exercise than something we need
> in practice.

Right.

> The only thing that makes me sad is that we proudly declare 64-bit
> off_t in new 32-bit ABIs but in fact we lie, at least in this
> specific case. We should add corresponding checks on glibc side at
> least. It's also simple.

Well, the only thing we are really saying there is that we support
more than 32-bit, and that the ABI uses 64-bit. Actually doing 64-bit
offsets within (very sparse) files probably also fails on 64-bit
architectures, at least on some file systems.

	Arnd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] mm: make pagoff_t type 64-bit
  2016-12-16 10:55                         ` Yury Norov
  2016-12-16 11:02                           ` Arnd Bergmann
@ 2016-12-18  9:23                           ` Christoph Hellwig
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2016-12-18  9:23 UTC (permalink / raw)
  To: Yury Norov
  Cc: Arnd Bergmann, libc-alpha, Dr. Philipp Tomsich, Catalin Marinas,
	linux-arch, LKML, szabolcs.nagy, heiko.carstens, cmetcalf,
	Joseph S. Myers, zhouchengming1, Kapoor, Prasun, Alexander Graf,
	geert, kilobyte, manuel.montezelo, Andrew Pinski, linyongting,
	Alexey Klimov, broonie, Zhangjian (Bamvor),
	linux-arm-kernel, Maxim Kuvyrkov, Nathan Lynch,
	Martin Schwidefsky, davem, christoph.muellner

On Fri, Dec 16, 2016 at 04:25:14PM +0530, Yury Norov wrote:
>  1 leave things as is. 32-bit architectures will have no option to
>    mmap big offsets, and no one cares - as usual.

And that's the only sensible option.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
  2016-12-08 15:47     ` Florian Weimer
@ 2017-01-03 20:54       ` Pavel Machek
  2017-01-12 16:13         ` Florian Weimer
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Machek @ 2017-01-03 20:54 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Yury Norov, libc-alpha, linux-arch, linux-kernel,
	Catalin Marinas, szabolcs.nagy, heiko.carstens, cmetcalf,
	philipp.tomsich, joseph, zhouchengming1, Prasun.Kapoor, agraf,
	geert, kilobyte, manuel.montezelo, arnd, pinskia, linyongting,
	klimov.linux, broonie, bamvor.zhangjian, linux-arm-kernel,
	maxim.kuvyrkov, Nathan_Lynch, schwidefsky, davem,
	christoph.muellner

[-- Attachment #1: Type: text/plain, Size: 1060 bytes --]



> >Hi Florian,
> >
> >I frankly don't understand what you mean, All syscalls you mentioned
> >doesn't take off_t or other 64-bit arguments. 'VM changes' - virtual
> >memory? If so, I don't see any changes in VM with this approach, just
> >correct handling of big offsets.
> 
> What I was trying to suggest is a completely different interface which is
> not subject to register size constraints and which has been requested before
> (a mechanism for batching mm updates).

While I agree that batching might be good idea, I believe mmap64()
makes sense, too. Yes, I guess libc could do the translation, but
indirection will cost some performance, and will be problematic for
stuff such as strace.

...actually, with strace and batched interface, it will be impossible
to see what is going on because of races. So I'm not sure if I like
the batched interface at all...

Best regards,
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
  2017-01-03 20:54       ` Pavel Machek
@ 2017-01-12 16:13         ` Florian Weimer
  2017-01-12 21:51           ` Pavel Machek
  0 siblings, 1 reply; 26+ messages in thread
From: Florian Weimer @ 2017-01-12 16:13 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Yury Norov, libc-alpha, linux-arch, linux-kernel,
	Catalin Marinas, szabolcs.nagy, heiko.carstens, cmetcalf,
	philipp.tomsich, joseph, zhouchengming1, Prasun.Kapoor, agraf,
	geert, kilobyte, manuel.montezelo, arnd, pinskia, linyongting,
	klimov.linux, broonie, bamvor.zhangjian, linux-arm-kernel,
	maxim.kuvyrkov, Nathan_Lynch, schwidefsky, davem,
	christoph.muellner

On 01/03/2017 09:54 PM, Pavel Machek wrote:
> ...actually, with strace and batched interface, it will be impossible
> to see what is going on because of races. So I'm not sure if I like
> the batched interface at all...

I'm not sure if I understand this problem.

ioctl, fcntl, most socket system calls, even open all have this problem 
as well, right?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Question] New mmap64 syscall?
  2017-01-12 16:13         ` Florian Weimer
@ 2017-01-12 21:51           ` Pavel Machek
  0 siblings, 0 replies; 26+ messages in thread
From: Pavel Machek @ 2017-01-12 21:51 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Yury Norov, libc-alpha, linux-arch, linux-kernel,
	Catalin Marinas, szabolcs.nagy, heiko.carstens, cmetcalf,
	philipp.tomsich, joseph, zhouchengming1, Prasun.Kapoor, agraf,
	geert, kilobyte, manuel.montezelo, arnd, pinskia, linyongting,
	klimov.linux, broonie, bamvor.zhangjian, linux-arm-kernel,
	maxim.kuvyrkov, Nathan_Lynch, schwidefsky, davem,
	christoph.muellner

[-- Attachment #1: Type: text/plain, Size: 750 bytes --]

On Thu 2017-01-12 17:13:25, Florian Weimer wrote:
> On 01/03/2017 09:54 PM, Pavel Machek wrote:
> >...actually, with strace and batched interface, it will be impossible
> >to see what is going on because of races. So I'm not sure if I like
> >the batched interface at all...
> 
> I'm not sure if I understand this problem.
> 
> ioctl, fcntl, most socket system calls, even open all have this problem as
> well, right?

Yes, ioctl() and similar are problematic. Still it is possible to
implement secure sandbox using ptrace. Dealing with indirect mmap()
would difficult AFAICT.

									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-01-12 21:51 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-06 18:54 [Question] New mmap64 syscall? Yury Norov
2016-12-06 21:20 ` Arnd Bergmann
2016-12-07 10:34   ` Yury Norov
     [not found]     ` <0F280FED-870A-42B5-ABC4-1976ACA32462@theobroma-systems.com>
     [not found]       ` <20161207123944.GA11799@yury-N73SV>
2016-12-07 16:32         ` Catalin Marinas
2016-12-07 16:43           ` Dr. Philipp Tomsich
2016-12-07 21:30             ` Arnd Bergmann
2016-12-10  9:10               ` Pavel Machek
2016-12-10  9:21                 ` Pavel Machek
2016-12-11 12:56                   ` Yury Norov
2016-12-11 12:56                     ` [PATCH 1/3] mm: move argument checkers of mmap_pgoff() to separated routine Yury Norov
2016-12-11 12:56                     ` [PATCH 2/3] sys_mmap64() Yury Norov
2016-12-11 14:48                       ` kbuild test robot
2016-12-11 14:56                       ` kbuild test robot
2016-12-11 12:56                     ` [PATCH 3/3] mm: make pagoff_t type 64-bit Yury Norov
2016-12-11 13:31                       ` kbuild test robot
2016-12-11 13:41                       ` kbuild test robot
2016-12-11 14:59                       ` Arnd Bergmann
2016-12-16 10:55                         ` Yury Norov
2016-12-16 11:02                           ` Arnd Bergmann
2016-12-18  9:23                           ` Christoph Hellwig
2016-12-07 13:23 ` [Question] New mmap64 syscall? Florian Weimer
2016-12-07 15:48   ` Yury Norov
2016-12-08 15:47     ` Florian Weimer
2017-01-03 20:54       ` Pavel Machek
2017-01-12 16:13         ` Florian Weimer
2017-01-12 21:51           ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).