All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] stat: don't fail if the major number is >= 256
@ 2022-04-11 14:43 Mikulas Patocka
  2022-04-11 15:03 ` Matthew Wilcox
  2022-04-11 16:30 ` Linus Torvalds
  0 siblings, 2 replies; 15+ messages in thread
From: Mikulas Patocka @ 2022-04-11 14:43 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro; +Cc: linux-fsdevel, linux-kernel

If you run a program compiled with OpenWatcom for Linux on a filesystem on 
NVMe, all "stat" syscalls fail with -EOVERFLOW. The reason is that the 
NVMe driver allocates a device with the major number 259 and it doesn't 
pass the "old_valid_dev" test.

This patch removes the tests - it's better to wrap around than to return
an error. (note that cp_old_stat also doesn't report an error and wraps
the number around)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 fs/stat.c |    6 ------
 1 file changed, 6 deletions(-)

Index: linux-5.17.2/fs/stat.c
===================================================================
--- linux-5.17.2.orig/fs/stat.c	2022-04-10 21:39:27.000000000 +0200
+++ linux-5.17.2/fs/stat.c	2022-04-10 21:42:43.000000000 +0200
@@ -334,7 +334,6 @@ SYSCALL_DEFINE2(fstat, unsigned int, fd,
 #  define choose_32_64(a,b) b
 #endif
 
-#define valid_dev(x)  choose_32_64(old_valid_dev(x),true)
 #define encode_dev(x) choose_32_64(old_encode_dev,new_encode_dev)(x)
 
 #ifndef INIT_STRUCT_STAT_PADDING
@@ -345,8 +344,6 @@ static int cp_new_stat(struct kstat *sta
 {
 	struct stat tmp;
 
-	if (!valid_dev(stat->dev) || !valid_dev(stat->rdev))
-		return -EOVERFLOW;
 #if BITS_PER_LONG == 32
 	if (stat->size > MAX_NON_LFS)
 		return -EOVERFLOW;
@@ -644,9 +641,6 @@ static int cp_compat_stat(struct kstat *
 {
 	struct compat_stat tmp;
 
-	if (!old_valid_dev(stat->dev) || !old_valid_dev(stat->rdev))
-		return -EOVERFLOW;
-
 	memset(&tmp, 0, sizeof(tmp));
 	tmp.st_dev = old_encode_dev(stat->dev);
 	tmp.st_ino = stat->ino;


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] stat: don't fail if the major number is >= 256
  2022-04-11 14:43 [PATCH] stat: don't fail if the major number is >= 256 Mikulas Patocka
@ 2022-04-11 15:03 ` Matthew Wilcox
  2022-04-11 15:16   ` Christoph Hellwig
  2022-04-11 16:30 ` Linus Torvalds
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2022-04-11 15:03 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Linus Torvalds, Alexander Viro, linux-fsdevel, linux-kernel

On Mon, Apr 11, 2022 at 10:43:33AM -0400, Mikulas Patocka wrote:
> If you run a program compiled with OpenWatcom for Linux on a filesystem on 
> NVMe, all "stat" syscalls fail with -EOVERFLOW. The reason is that the 
> NVMe driver allocates a device with the major number 259 and it doesn't 
> pass the "old_valid_dev" test.
> 
> This patch removes the tests - it's better to wrap around than to return
> an error. (note that cp_old_stat also doesn't report an error and wraps
> the number around)

Is it better?  You've done a good job arguing why it is for this particular
situation, but if there's a program which compares files by
st_dev+st_ino, it might think two files are identical when they're
actually different and, eg, skip backing up a file because it thinks
it already did it.  That would be a silent failure, which is worse
than this noisy failure.

The real problem is clearly that Linus denied my request for a real
major number for NVMe back in 2012 or whenever it was :-P

> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> ---
>  fs/stat.c |    6 ------
>  1 file changed, 6 deletions(-)
> 
> Index: linux-5.17.2/fs/stat.c
> ===================================================================
> --- linux-5.17.2.orig/fs/stat.c	2022-04-10 21:39:27.000000000 +0200
> +++ linux-5.17.2/fs/stat.c	2022-04-10 21:42:43.000000000 +0200
> @@ -334,7 +334,6 @@ SYSCALL_DEFINE2(fstat, unsigned int, fd,
>  #  define choose_32_64(a,b) b
>  #endif
>  
> -#define valid_dev(x)  choose_32_64(old_valid_dev(x),true)
>  #define encode_dev(x) choose_32_64(old_encode_dev,new_encode_dev)(x)
>  
>  #ifndef INIT_STRUCT_STAT_PADDING
> @@ -345,8 +344,6 @@ static int cp_new_stat(struct kstat *sta
>  {
>  	struct stat tmp;
>  
> -	if (!valid_dev(stat->dev) || !valid_dev(stat->rdev))
> -		return -EOVERFLOW;
>  #if BITS_PER_LONG == 32
>  	if (stat->size > MAX_NON_LFS)
>  		return -EOVERFLOW;
> @@ -644,9 +641,6 @@ static int cp_compat_stat(struct kstat *
>  {
>  	struct compat_stat tmp;
>  
> -	if (!old_valid_dev(stat->dev) || !old_valid_dev(stat->rdev))
> -		return -EOVERFLOW;
> -
>  	memset(&tmp, 0, sizeof(tmp));
>  	tmp.st_dev = old_encode_dev(stat->dev);
>  	tmp.st_ino = stat->ino;
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] stat: don't fail if the major number is >= 256
  2022-04-11 15:03 ` Matthew Wilcox
@ 2022-04-11 15:16   ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2022-04-11 15:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mikulas Patocka, Linus Torvalds, Alexander Viro, linux-fsdevel,
	linux-kernel

On Mon, Apr 11, 2022 at 04:03:37PM +0100, Matthew Wilcox wrote:
> Is it better?  You've done a good job arguing why it is for this particular
> situation, but if there's a program which compares files by
> st_dev+st_ino, it might think two files are identical when they're
> actually different and, eg, skip backing up a file because it thinks
> it already did it.  That would be a silent failure, which is worse
> than this noisy failure.

Agreed.

> The real problem is clearly that Linus denied my request for a real
> major number for NVMe back in 2012 or whenever it was :-P

I don't think that is the real probem.  The whole dynamic dev_t scheme
has worked out really well and I'm glade drivers don't need a major
allocation anymore.  But I'm not sure why the dynamic major had to be
past 255 given these legacy issues, especially as there are plenty of
free ones with lower numbers.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] stat: don't fail if the major number is >= 256
  2022-04-11 14:43 [PATCH] stat: don't fail if the major number is >= 256 Mikulas Patocka
  2022-04-11 15:03 ` Matthew Wilcox
@ 2022-04-11 16:30 ` Linus Torvalds
  2022-04-11 17:13   ` Mikulas Patocka
  1 sibling, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2022-04-11 16:30 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List

On Mon, Apr 11, 2022 at 4:43 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> If you run a program compiled with OpenWatcom for Linux on a filesystem on
> NVMe, all "stat" syscalls fail with -EOVERFLOW. The reason is that the
> NVMe driver allocates a device with the major number 259 and it doesn't
> pass the "old_valid_dev" test.

OpenWatcom? Really?

> This patch removes the tests - it's better to wrap around than to return
> an error. (note that cp_old_stat also doesn't report an error and wraps
> the number around)

Hmm. We've used majors over 256 for a long time, but some of them are
admittedly very rare (SCSI OSD?)

Unfortunate. And in this case 259 aliases to 3, which is the old
HD/IDE0 major number. That's not great - there would be other numbers
that didn't have that problem (ie 4-6 are all currently only character
device majors, I think).

Anyway, I think that check is just bogus. The cp_new_stat() thing uses
'struct stat' and it has

        unsigned long   st_dev;         /* Device.  */
        unsigned long   st_rdev;        /* Device number, if device.  */

so there's no reason to limit things to the old 8-bit behavior.

Yes, it does that

  #define valid_dev(x)  choose_32_64(old_valid_dev(x),true)
  #define encode_dev(x) choose_32_64(old_encode_dev,new_encode_dev)(x)

  static __always_inline u16 old_encode_dev(dev_t dev)
  {
        return (MAJOR(dev) << 8) | MINOR(dev);
  }

which currently drops bits, but we should just *fix* that. We can put
the high bits in the upper bits, not limit it to 16 bits when we have
more space than that.

Even the *really* old 'struct old_stat' doesn't really have a 16-bit
st_dev/rdev.

           Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] stat: don't fail if the major number is >= 256
  2022-04-11 16:30 ` Linus Torvalds
@ 2022-04-11 17:13   ` Mikulas Patocka
  2022-04-12  5:37     ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Mikulas Patocka @ 2022-04-11 17:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox



On Mon, 11 Apr 2022, Linus Torvalds wrote:

> On Mon, Apr 11, 2022 at 4:43 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> > If you run a program compiled with OpenWatcom for Linux on a filesystem on
> > NVMe, all "stat" syscalls fail with -EOVERFLOW. The reason is that the
> > NVMe driver allocates a device with the major number 259 and it doesn't
> > pass the "old_valid_dev" test.
> 
> OpenWatcom? Really?

Yes. I use OpenWatcom to verify that my programs are clean ANSI C without 
any gccisms.

Other than that, it is not much useful - it has it's own libc, it's own 
module format, and programs compiled with OpenWatcom cannot be linked with 
existing *.a or *.so libraries.

> > This patch removes the tests - it's better to wrap around than to return
> > an error. (note that cp_old_stat also doesn't report an error and wraps
> > the number around)
> 
> Hmm. We've used majors over 256 for a long time, but some of them are
> admittedly very rare (SCSI OSD?)
> 
> Unfortunate. And in this case 259 aliases to 3, which is the old
> HD/IDE0 major number. That's not great - there would be other numbers
> that didn't have that problem (ie 4-6 are all currently only character
> device majors, I think).

Should we perhaps hash the number, take 16 bits of the hash and hope 
than the collision won't happen?

> Anyway, I think that check is just bogus. The cp_new_stat() thing uses
> 'struct stat' and it has
> 
>         unsigned long   st_dev;         /* Device.  */
>         unsigned long   st_rdev;        /* Device number, if device.  */
> 
> so there's no reason to limit things to the old 8-bit behavior.
> 
> Yes, it does that
> 
>   #define valid_dev(x)  choose_32_64(old_valid_dev(x),true)
>   #define encode_dev(x) choose_32_64(old_encode_dev,new_encode_dev)(x)
> 
>   static __always_inline u16 old_encode_dev(dev_t dev)
>   {
>         return (MAJOR(dev) << 8) | MINOR(dev);
>   }
> 
> which currently drops bits, but we should just *fix* that. We can put
> the high bits in the upper bits, not limit it to 16 bits when we have
> more space than that.

Yes - we can return values larger than 16-bit here. But there's a risk 
that the userspace code will extract the values using macros like this and 
lose the upper bits:

#define major(device)           ((int)(((device) >> 8) & 0xFF))
#define minor(device)           ((int)((device) & 0xff))

> Even the *really* old 'struct old_stat' doesn't really have a 16-bit
> st_dev/rdev.
> 
>            Linus

For me, the failure happens in cp_compat_stat (I have a 64-bit kernel). In 
struct compat_stat in arch/x86/include/asm/compat.h, st_dev and st_rdev 
are compat_dev_t which is 16-bit. But they are followed by 16-bit 
paddings, so they could be extended.

If you have a native 32-bit kernel, it uses 'struct stat' defined at the 
beginning of arch/x86/include/uapi/asm/stat.h that has 32-bit st_dev and 
st_rdev. If you use a 64-bit kernel with 32-bit compat, it uses 'struct 
compat_stat' defined in arch/x86/include/asm/compat.h that has 16-bit 
st_dev and st_rdev. That's an inconsistency that should be resolved.

What did glibc do? Did it use 16-bit dev_t with following padding or 
32-bit dev_t? (the current glibc just uses stat64 and 64-bit dev_t always)

Mikulas


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] stat: don't fail if the major number is >= 256
  2022-04-11 17:13   ` Mikulas Patocka
@ 2022-04-12  5:37     ` Linus Torvalds
  2022-04-12  5:41       ` Christoph Hellwig
                         ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Linus Torvalds @ 2022-04-12  5:37 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox

On Mon, Apr 11, 2022 at 7:13 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> Should we perhaps hash the number, take 16 bits of the hash and hope
> than the collision won't happen?

That would "work", but I think it would be incredibly annoying to
users with basically random results.

I think the solution is to just put the bits in the high bits. Yes,
they might be masked off if people use 'MAJOR()' to pick them out, but
the common "compare st_dev and st_ino" model at least works. That's
the one that wants unique numbers.

> For me, the failure happens in cp_compat_stat (I have a 64-bit kernel). In
> struct compat_stat in arch/x86/include/asm/compat.h, st_dev and st_rdev
> are compat_dev_t which is 16-bit. But they are followed by 16-bit
> paddings, so they could be extended.

Ok, that actually looks like a bug.

The compat structure should match the native structure.  Those "u16
__padX" fields seem to be just a symptom of the bug.

The only user of that compat_stat structure is the kernel, so that
should just be fixed.

Of course, who knows what the libraries have done, so user space could
still have screwed up.

> If you have a native 32-bit kernel, it uses 'struct stat' defined at the
> beginning of arch/x86/include/uapi/asm/stat.h that has 32-bit st_dev and
> st_rdev.

Correct. It's literally the compat structure that has no basis in reality.

Or it might be some truly ancient thing, but I really don't think so.

Regardless, if we fill in the high 16 bits of that field, in the
_worst_ case things will act as if your patch had been applied, but in
any sane case you'll have that working "unique st_dev" thing.

I'd love to actually use the better "modern" 64-bit encoding (12+20
bits, or whatever it is we do, too lazy to look it up), but hey, that
historical thing is what it is.

Realistically, nobody uses it. Apparently your OpenWatcom use is also
really just fairly theoretical, not some "this app is used by people
and doesn't work with a filesystem on NVMe".

                 Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] stat: don't fail if the major number is >= 256
  2022-04-12  5:37     ` Linus Torvalds
@ 2022-04-12  5:41       ` Christoph Hellwig
  2022-04-12  5:47         ` Linus Torvalds
  2022-04-12  6:49       ` Linus Torvalds
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2022-04-12  5:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikulas Patocka, Alexander Viro, linux-fsdevel,
	Linux Kernel Mailing List, Matthew Wilcox

On Mon, Apr 11, 2022 at 07:37:44PM -1000, Linus Torvalds wrote:
> Realistically, nobody uses it. Apparently your OpenWatcom use is also
> really just fairly theoretical, not some "this app is used by people
> and doesn't work with a filesystem on NVMe".

And that is easily fixed by using a lower major for the block dynamic
dev_t.  In theory userspace should be able to copy with any major for
it, but I fear something might break so we could make it configurable.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] stat: don't fail if the major number is >= 256
  2022-04-12  5:41       ` Christoph Hellwig
@ 2022-04-12  5:47         ` Linus Torvalds
  2022-04-12  5:52           ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2022-04-12  5:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mikulas Patocka, Alexander Viro, linux-fsdevel,
	Linux Kernel Mailing List, Matthew Wilcox

On Mon, Apr 11, 2022 at 7:41 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> And that is easily fixed by using a lower major for the block dynamic
> dev_t.  In theory userspace should be able to copy with any major for
> it, but I fear something might break so we could make it configurable.

We actually have a ton of major numbers free, although it's not
obvious because of the whole "they are used by character devices, not
block devices" issue.

Ie 4-6, 12, 14, 19 are all free, afaik.

But yeah, somebody might have a static /dev for some odd embedded case, I guess.

That said, it really does look like we just return -EOVERFLOW much too
eagerly, for stupid and bad reasons.

Considering that BLOCK_EXT_MAJOR has been 259 since 2008, and this is
the first time anybody has hit this, I don't think there's much reason
to change that major number when the whole error case seems to have
been largely a mistake to begin with.

                  Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] stat: don't fail if the major number is >= 256
  2022-04-12  5:47         ` Linus Torvalds
@ 2022-04-12  5:52           ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2022-04-12  5:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Hellwig, Mikulas Patocka, Alexander Viro,
	linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox

On Mon, Apr 11, 2022 at 07:47:44PM -1000, Linus Torvalds wrote:
> Considering that BLOCK_EXT_MAJOR has been 259 since 2008, and this is
> the first time anybody has hit this, I don't think there's much reason
> to change that major number when the whole error case seems to have
> been largely a mistake to begin with.

Yeah.  If Mikulas still has a problem he could just patch
BLOCK_EXT_MAJOR locally.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] stat: don't fail if the major number is >= 256
  2022-04-12  5:37     ` Linus Torvalds
  2022-04-12  5:41       ` Christoph Hellwig
@ 2022-04-12  6:49       ` Linus Torvalds
  2022-04-12  8:19       ` Andreas Schwab
  2022-04-12  9:41       ` [PATCH] stat: fix inconsistency between struct stat and struct compat_stat Mikulas Patocka
  3 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2022-04-12  6:49 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox

On Mon, Apr 11, 2022 at 7:37 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Correct. It's literally the compat structure that has no basis in reality.
>
> Or it might be some truly ancient thing, but I really don't think so.

I was intrigued, so I went back and checked.

        unsigned short st_dev;
        unsigned short __pad1;

is in fact historical. But it was changed to

        unsigned long  st_dev;

(for i386, so this is a 32-bit 'unsigned long') on April 2, 2003.

From the BK tree conversion:

    https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e95b2065677fe32512a597a79db94b77b90c968d

so I think we should just make sure that the 64-bit compat system call
is compatible with that 2003+ state, not with some truly ancient
state.

                      Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] stat: don't fail if the major number is >= 256
  2022-04-12  5:37     ` Linus Torvalds
  2022-04-12  5:41       ` Christoph Hellwig
  2022-04-12  6:49       ` Linus Torvalds
@ 2022-04-12  8:19       ` Andreas Schwab
  2022-04-12  9:41       ` [PATCH] stat: fix inconsistency between struct stat and struct compat_stat Mikulas Patocka
  3 siblings, 0 replies; 15+ messages in thread
From: Andreas Schwab @ 2022-04-12  8:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikulas Patocka, Alexander Viro, linux-fsdevel,
	Linux Kernel Mailing List, Matthew Wilcox

On Apr 11 2022, Linus Torvalds wrote:

>> For me, the failure happens in cp_compat_stat (I have a 64-bit kernel). In
>> struct compat_stat in arch/x86/include/asm/compat.h, st_dev and st_rdev
>> are compat_dev_t which is 16-bit. But they are followed by 16-bit
>> paddings, so they could be extended.
>
> Ok, that actually looks like a bug.
>
> The compat structure should match the native structure.  Those "u16
> __padX" fields seem to be just a symptom of the bug.

Looks like the move to 32-bit st_[r]dev was never applied to struct
compat_stat, see commit e95b206567 ("[PATCH] struct stat - support
larger dev_t") from tglx/history.git.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] stat: fix inconsistency between struct stat and struct compat_stat
  2022-04-12  5:37     ` Linus Torvalds
                         ` (2 preceding siblings ...)
  2022-04-12  8:19       ` Andreas Schwab
@ 2022-04-12  9:41       ` Mikulas Patocka
  2022-04-12 16:30         ` Linus Torvalds
  3 siblings, 1 reply; 15+ messages in thread
From: Mikulas Patocka @ 2022-04-12  9:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox



On Mon, 11 Apr 2022, Linus Torvalds wrote:

> On Mon, Apr 11, 2022 at 7:13 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> > Should we perhaps hash the number, take 16 bits of the hash and hope
> > than the collision won't happen?
> 
> That would "work", but I think it would be incredibly annoying to
> users with basically random results.
> 
> I think the solution is to just put the bits in the high bits. Yes,
> they might be masked off if people use 'MAJOR()' to pick them out, but
> the common "compare st_dev and st_ino" model at least works. That's
> the one that wants unique numbers.
> 
> > For me, the failure happens in cp_compat_stat (I have a 64-bit kernel). In
> > struct compat_stat in arch/x86/include/asm/compat.h, st_dev and st_rdev
> > are compat_dev_t which is 16-bit. But they are followed by 16-bit
> > paddings, so they could be extended.
> 
> Ok, that actually looks like a bug.
> 
> The compat structure should match the native structure.  Those "u16
> __padX" fields seem to be just a symptom of the bug.
> 
> The only user of that compat_stat structure is the kernel, so that
> should just be fixed.
> 
> Of course, who knows what the libraries have done, so user space could
> still have screwed up.

Here I'm sending a patch that makes struct compat_stat match struct stat.



stat: fix inconsistency between struct stat and struct compat_stat

struct stat (defined in arch/x86/include/uapi/asm/stat.h) has 32-bit
st_dev and st_rdev; struct compat_stat (defined in
arch/x86/include/asm/compat.h) has 16-bit st_dev and st_rdev followed by a
16-bit padding. This patch fixes struct compat_stat to match struct stat.

Note that we can't change compat_dev_t because it is used by
compat_loop_info.

Also, if the st_dev and st_rdev values are 32-bit, we don't have to use
old_valid_dev to test if the value fits into them. This fixes -EOVERFLOW
on filesystems that are on NVMe because NVMe uses the major number 259.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 arch/x86/include/asm/compat.h |    6 ++----
 fs/stat.c                     |   19 ++++++++++---------
 2 files changed, 12 insertions(+), 13 deletions(-)

Index: linux-5.17.2/arch/x86/include/asm/compat.h
===================================================================
--- linux-5.17.2.orig/arch/x86/include/asm/compat.h	2022-01-21 10:29:12.000000000 +0100
+++ linux-5.17.2/arch/x86/include/asm/compat.h	2022-04-12 11:27:14.000000000 +0200
@@ -28,15 +28,13 @@ typedef u16		compat_ipc_pid_t;
 typedef __kernel_fsid_t	compat_fsid_t;
 
 struct compat_stat {
-	compat_dev_t	st_dev;
-	u16		__pad1;
+	u32		st_dev;
 	compat_ino_t	st_ino;
 	compat_mode_t	st_mode;
 	compat_nlink_t	st_nlink;
 	__compat_uid_t	st_uid;
 	__compat_gid_t	st_gid;
-	compat_dev_t	st_rdev;
-	u16		__pad2;
+	u32		st_rdev;
 	u32		st_size;
 	u32		st_blksize;
 	u32		st_blocks;
Index: linux-5.17.2/fs/stat.c
===================================================================
--- linux-5.17.2.orig/fs/stat.c	2022-04-12 10:39:46.000000000 +0200
+++ linux-5.17.2/fs/stat.c	2022-04-12 10:58:28.000000000 +0200
@@ -334,9 +334,6 @@ SYSCALL_DEFINE2(fstat, unsigned int, fd,
 #  define choose_32_64(a,b) b
 #endif
 
-#define valid_dev(x)  choose_32_64(old_valid_dev(x),true)
-#define encode_dev(x) choose_32_64(old_encode_dev,new_encode_dev)(x)
-
 #ifndef INIT_STRUCT_STAT_PADDING
 #  define INIT_STRUCT_STAT_PADDING(st) memset(&st, 0, sizeof(st))
 #endif
@@ -345,7 +342,9 @@ static int cp_new_stat(struct kstat *sta
 {
 	struct stat tmp;
 
-	if (!valid_dev(stat->dev) || !valid_dev(stat->rdev))
+	if (sizeof(tmp.st_dev) < 4 && !old_valid_dev(stat->dev))
+		return -EOVERFLOW;
+	if (sizeof(tmp.st_rdev) < 4 && !old_valid_dev(stat->rdev))
 		return -EOVERFLOW;
 #if BITS_PER_LONG == 32
 	if (stat->size > MAX_NON_LFS)
@@ -353,7 +352,7 @@ static int cp_new_stat(struct kstat *sta
 #endif
 
 	INIT_STRUCT_STAT_PADDING(tmp);
-	tmp.st_dev = encode_dev(stat->dev);
+	tmp.st_dev = new_encode_dev(stat->dev);
 	tmp.st_ino = stat->ino;
 	if (sizeof(tmp.st_ino) < sizeof(stat->ino) && tmp.st_ino != stat->ino)
 		return -EOVERFLOW;
@@ -363,7 +362,7 @@ static int cp_new_stat(struct kstat *sta
 		return -EOVERFLOW;
 	SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
 	SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
-	tmp.st_rdev = encode_dev(stat->rdev);
+	tmp.st_rdev = new_encode_dev(stat->rdev);
 	tmp.st_size = stat->size;
 	tmp.st_atime = stat->atime.tv_sec;
 	tmp.st_mtime = stat->mtime.tv_sec;
@@ -644,11 +643,13 @@ static int cp_compat_stat(struct kstat *
 {
 	struct compat_stat tmp;
 
-	if (!old_valid_dev(stat->dev) || !old_valid_dev(stat->rdev))
+	if (sizeof(tmp.st_dev) < 4 && !old_valid_dev(stat->dev))
+		return -EOVERFLOW;
+	if (sizeof(tmp.st_rdev) < 4 && !old_valid_dev(stat->rdev))
 		return -EOVERFLOW;
 
 	memset(&tmp, 0, sizeof(tmp));
-	tmp.st_dev = old_encode_dev(stat->dev);
+	tmp.st_dev = new_encode_dev(stat->dev);
 	tmp.st_ino = stat->ino;
 	if (sizeof(tmp.st_ino) < sizeof(stat->ino) && tmp.st_ino != stat->ino)
 		return -EOVERFLOW;
@@ -658,7 +659,7 @@ static int cp_compat_stat(struct kstat *
 		return -EOVERFLOW;
 	SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
 	SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
-	tmp.st_rdev = old_encode_dev(stat->rdev);
+	tmp.st_rdev = new_encode_dev(stat->rdev);
 	if ((u64) stat->size > MAX_NON_LFS)
 		return -EOVERFLOW;
 	tmp.st_size = stat->size;


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] stat: fix inconsistency between struct stat and struct compat_stat
  2022-04-12  9:41       ` [PATCH] stat: fix inconsistency between struct stat and struct compat_stat Mikulas Patocka
@ 2022-04-12 16:30         ` Linus Torvalds
  2022-04-12 17:42           ` Mikulas Patocka
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2022-04-12 16:30 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox

On Mon, Apr 11, 2022 at 11:41 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> Also, if the st_dev and st_rdev values are 32-bit, we don't have to use
> old_valid_dev to test if the value fits into them. This fixes -EOVERFLOW
> on filesystems that are on NVMe because NVMe uses the major number 259.

The problem with this part of the patch is that this:

> @@ -353,7 +352,7 @@ static int cp_new_stat(struct kstat *sta
>  #endif
>
>         INIT_STRUCT_STAT_PADDING(tmp);
> -       tmp.st_dev = encode_dev(stat->dev);
> +       tmp.st_dev = new_encode_dev(stat->dev);

completely changes the format of that st_dev field.

For completely insane historical reasons, we have had the rule that

 - 32-bit architectures encode the device into a 16 bit value

 - 64-bit architectures encode the device number into a 32 bit value

and that has been true *despite* the fact that the actual "st_dev"
field has been 32-bit and 64-bit respectively since 2003!

And it doesn't help that to confuse things even more, the _naming_ of
those "encode_dev" functions is "old and new", so that logically you'd
think that "cp_new_stat()" would use "new_encode_dev()". Nope.

So on 32-bit architectures, cp_new_stat() uses "old_encode_dev()",
which historically put the minor number in bits 0..7, and the major
number in bits 8..15.

End result: on a 32-bit system (or in the compat syscall mode),
changing to new_encode_dev() would confuse anybody (like just "ls -l
/dev") that uses that old stat call and tries to print out major/minor
numbers.

Now,. the good news is that

 (a) nobody should use that old stat call, since the new world order
is called "stat64" and has been for a loooong time - also since at
least 2003)

 (b) we could just hide the bits in upper bits instead.

So what I suggest we do is to make old_encode_dev() put the minor bits
in bits 0..7 _and_ 16..23, and the major bits in 8..15 _and_ 24..32.

And then the -EOVERFLOW should be something like

        unsigned int st_dev = encode_dev(stat->dev);
        tmp.st_dev = st_dev;
        if (st_dev != tmp.st_dev)
                return -EOVERFLOW;

for the lcase that tmp.st_dev is actually 16-bit (ie the compat case
for some architecture where the padding wasn't there?)

NOTE: That will still screw up 'ls -l' output, but only for the
devices that previously would have returned -EOVERFLOW.

And it will make anybopdy who does that "stat1->st_dev ==
stat2->st_dev && ino == ino2" thing for testing "same inode" work just
fine.

              Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] stat: fix inconsistency between struct stat and struct compat_stat
  2022-04-12 16:30         ` Linus Torvalds
@ 2022-04-12 17:42           ` Mikulas Patocka
  2022-04-12 23:25             ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Mikulas Patocka @ 2022-04-12 17:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox



On Tue, 12 Apr 2022, Linus Torvalds wrote:

> On Mon, Apr 11, 2022 at 11:41 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> > Also, if the st_dev and st_rdev values are 32-bit, we don't have to use
> > old_valid_dev to test if the value fits into them. This fixes -EOVERFLOW
> > on filesystems that are on NVMe because NVMe uses the major number 259.
> 
> The problem with this part of the patch is that this:
> 
> > @@ -353,7 +352,7 @@ static int cp_new_stat(struct kstat *sta
> >  #endif
> >
> >         INIT_STRUCT_STAT_PADDING(tmp);
> > -       tmp.st_dev = encode_dev(stat->dev);
> > +       tmp.st_dev = new_encode_dev(stat->dev);
> 
> completely changes the format of that st_dev field.

we have these definitions:

static __always_inline u16 old_encode_dev(dev_t dev)
{
        return (MAJOR(dev) << 8) | MINOR(dev);
}

static __always_inline u32 new_encode_dev(dev_t dev)
{
        unsigned major = MAJOR(dev);
        unsigned minor = MINOR(dev);
        return (minor & 0xff) | (major << 8) | ((minor & ~0xff) << 12);
}

As long as both major and minor numbers are less than 256, these functions 
return equivalent results. So, I think it's safe to replace old_encode_dev 
with new_encode_dev.

old_encode_dev shouldn't be called with minor >= 256, because it blends 
the upper minor bits into the major field - the kernel doesn't do this and 
checks the value with old_valid_dev before calling old_encode_dev. But 
when old_valid_dev returns true, it doesn't matter if you use 
old_encode_dev or new_encode_dev - both give equivalent results.

When I tested it, both gcc and openwatcom return st_dev 0x10301, which is 
the expected value (the NVMe device has major 259 and minor 1).

>  (b) we could just hide the bits in upper bits instead.
> 
> So what I suggest we do is to make old_encode_dev() put the minor bits
> in bits 0..7 _and_ 16..23, and the major bits in 8..15 _and_ 24..32.

new_encode_dev puts the minor value into bits 0..7, 20..31 and the major 
value into bits 8..19

So, we can use this instead of inventing a new format.

Mikulas


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] stat: fix inconsistency between struct stat and struct compat_stat
  2022-04-12 17:42           ` Mikulas Patocka
@ 2022-04-12 23:25             ` Linus Torvalds
  0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2022-04-12 23:25 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox

On Tue, Apr 12, 2022 at 7:42 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> As long as both major and minor numbers are less than 256, these functions
> return equivalent results. So, I think it's safe to replace old_encode_dev
> with new_encode_dev.

You are of course 100% right, and I should have looked more closely at
the code rather than going by my (broken) assumptions based on old
memory of what we did when we did that "new" stat expansion.

I take back all my objections that were completely bogus.

             Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-04-12 23:41 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-11 14:43 [PATCH] stat: don't fail if the major number is >= 256 Mikulas Patocka
2022-04-11 15:03 ` Matthew Wilcox
2022-04-11 15:16   ` Christoph Hellwig
2022-04-11 16:30 ` Linus Torvalds
2022-04-11 17:13   ` Mikulas Patocka
2022-04-12  5:37     ` Linus Torvalds
2022-04-12  5:41       ` Christoph Hellwig
2022-04-12  5:47         ` Linus Torvalds
2022-04-12  5:52           ` Christoph Hellwig
2022-04-12  6:49       ` Linus Torvalds
2022-04-12  8:19       ` Andreas Schwab
2022-04-12  9:41       ` [PATCH] stat: fix inconsistency between struct stat and struct compat_stat Mikulas Patocka
2022-04-12 16:30         ` Linus Torvalds
2022-04-12 17:42           ` Mikulas Patocka
2022-04-12 23:25             ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.