All of lore.kernel.org
 help / color / mirror / Atom feed
* page->index limitation on 32bit system?
@ 2021-02-18  8:54 Qu Wenruo
  2021-02-18 12:15 ` Matthew Wilcox
  0 siblings, 1 reply; 21+ messages in thread
From: Qu Wenruo @ 2021-02-18  8:54 UTC (permalink / raw)
  To: Linux FS Devel; +Cc: linux-btrfs

Hi,

Recently we got a strange bug report that, one 32bit systems like armv6
or non-64bit x86, certain large btrfs can't be mounted.

It turns out that, since page->index is just unsigned long, and on 32bit
systemts, that can just be 32bit.

And when filesystems is utilizing any page offset over 4T, page->index
get truncated, causing various problems.

This is especially a big problem for btrfs, as btrfs uses its internal
address space, which is from 0 to U64_MAX, but still sometimes relies on
page->index, just like most filesystems.

If a metadata is at or beyond 4T boundary (which is not rare, even with
small btrfs, as btrfs can related its chunks to much higher bytenr than
device boundary), then page->index will be truncated and may even
conflicts with existing pages.

I'm wonder if this is a known problem, and if so is there any plan to fix?
If not a known one, does it mean we have to make page->index u64 to fix
it? (this is definitely not going to be easy)

Thanks,
Qu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-18  8:54 page->index limitation on 32bit system? Qu Wenruo
@ 2021-02-18 12:15 ` Matthew Wilcox
  2021-02-18 12:42   ` Qu Wenruo
  2021-02-18 21:27   ` Erik Jensen
  0 siblings, 2 replies; 21+ messages in thread
From: Matthew Wilcox @ 2021-02-18 12:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Linux FS Devel, linux-btrfs

On Thu, Feb 18, 2021 at 04:54:46PM +0800, Qu Wenruo wrote:
> Recently we got a strange bug report that, one 32bit systems like armv6
> or non-64bit x86, certain large btrfs can't be mounted.
> 
> It turns out that, since page->index is just unsigned long, and on 32bit
> systemts, that can just be 32bit.
> 
> And when filesystems is utilizing any page offset over 4T, page->index
> get truncated, causing various problems.

4TB?  I think you mean 16TB (4kB * 4GB)

Yes, this is a known limitation.  Some vendors have gone to the trouble
of introducing a new page_index_t.  I'm not convinced this is a problem
worth solving.  There are very few 32-bit systems with this much storage
on a single partition (everything should work fine if you take a 20TB
drive and partition it into two 10TB partitions).

As usual, the best solution is for people to stop buying 32-bit systems.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-18 12:15 ` Matthew Wilcox
@ 2021-02-18 12:42   ` Qu Wenruo
  2021-02-18 13:39     ` Matthew Wilcox
  2021-02-18 21:27   ` Erik Jensen
  1 sibling, 1 reply; 21+ messages in thread
From: Qu Wenruo @ 2021-02-18 12:42 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Linux FS Devel, linux-btrfs



On 2021/2/18 下午8:15, Matthew Wilcox wrote:
> On Thu, Feb 18, 2021 at 04:54:46PM +0800, Qu Wenruo wrote:
>> Recently we got a strange bug report that, one 32bit systems like armv6
>> or non-64bit x86, certain large btrfs can't be mounted.
>>
>> It turns out that, since page->index is just unsigned long, and on 32bit
>> systemts, that can just be 32bit.
>>
>> And when filesystems is utilizing any page offset over 4T, page->index
>> get truncated, causing various problems.
>
> 4TB?  I think you mean 16TB (4kB * 4GB)

Oh, offset by 2...

>
> Yes, this is a known limitation.  Some vendors have gone to the trouble
> of introducing a new page_index_t.  I'm not convinced this is a problem
> worth solving.  There are very few 32-bit systems with this much storage
> on a single partition (everything should work fine if you take a 20TB
> drive and partition it into two 10TB partitions).
What would happen if a user just tries to write 4K at file offset 16T
fir a sparse file?

Would it be blocked by other checks before reaching the underlying fs?

>
> As usual, the best solution is for people to stop buying 32-bit systems.
>

They don't need a large single partition to even trigger it.

This is especially true for btrfs, which has its internal address space
(and it can be any aligned U64 value).
Even 1T btrfs can have its metadata at its internal bytenr way larger
than 1T. (although those ranges still needs to be mapped inside the device).

And considering the reporter is already using 32bit with 10T+ storage, I
doubt if it's really not worthy.

BTW, what would be the extra cost by converting page::index to u64?
I know tons of printk() would cause warning, but most 64bit systems
should not be affected anyway.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-18 12:42   ` Qu Wenruo
@ 2021-02-18 13:39     ` Matthew Wilcox
  2021-02-19  0:37       ` Qu Wenruo
  2021-02-20 23:02       ` Erik Jensen
  0 siblings, 2 replies; 21+ messages in thread
From: Matthew Wilcox @ 2021-02-18 13:39 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Linux FS Devel, linux-btrfs

On Thu, Feb 18, 2021 at 08:42:14PM +0800, Qu Wenruo wrote:
> On 2021/2/18 下午8:15, Matthew Wilcox wrote:
> > Yes, this is a known limitation.  Some vendors have gone to the trouble
> > of introducing a new page_index_t.  I'm not convinced this is a problem
> > worth solving.  There are very few 32-bit systems with this much storage
> > on a single partition (everything should work fine if you take a 20TB
> > drive and partition it into two 10TB partitions).
> What would happen if a user just tries to write 4K at file offset 16T
> fir a sparse file?
> 
> Would it be blocked by other checks before reaching the underlying fs?

/* Page cache limit. The filesystems should put that into their s_maxbytes 
   limits, otherwise bad things can happen in VM. */ 
#if BITS_PER_LONG==32
#define MAX_LFS_FILESIZE        ((loff_t)ULONG_MAX << PAGE_SHIFT)
#elif BITS_PER_LONG==64
#define MAX_LFS_FILESIZE        ((loff_t)LLONG_MAX)
#endif

> This is especially true for btrfs, which has its internal address space
> (and it can be any aligned U64 value).
> Even 1T btrfs can have its metadata at its internal bytenr way larger
> than 1T. (although those ranges still needs to be mapped inside the device).

Sounds like btrfs has a problem to fix.

> And considering the reporter is already using 32bit with 10T+ storage, I
> doubt if it's really not worthy.
> 
> BTW, what would be the extra cost by converting page::index to u64?
> I know tons of printk() would cause warning, but most 64bit systems
> should not be affected anyway.

No effect for 64-bit systems, other than the churn.

For 32-bit systems, it'd have some pretty horrible overhead.  You don't
just have to touch the page cache, you have to convert the XArray.
It's doable (I mean, it's been done), but it's very costly for all the
32-bit systems which don't use a humongous filesystem.  And we could
minimise that overhead with a typedef, but then the source code gets
harder to work with.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-18 12:15 ` Matthew Wilcox
  2021-02-18 12:42   ` Qu Wenruo
@ 2021-02-18 21:27   ` Erik Jensen
  2021-02-19 14:22     ` Matthew Wilcox
  1 sibling, 1 reply; 21+ messages in thread
From: Erik Jensen @ 2021-02-18 21:27 UTC (permalink / raw)
  To: Matthew Wilcox, Qu Wenruo; +Cc: Linux FS Devel, linux-btrfs

On 2/18/21 4:15 AM, Matthew Wilcox wrote:

> On Thu, Feb 18, 2021 at 04:54:46PM +0800, Qu Wenruo wrote:
>> Recently we got a strange bug report that, one 32bit systems like armv6
>> or non-64bit x86, certain large btrfs can't be mounted.
>>
>> It turns out that, since page->index is just unsigned long, and on 32bit
>> systemts, that can just be 32bit.
>>
>> And when filesystems is utilizing any page offset over 4T, page->index
>> get truncated, causing various problems.
> 4TB?  I think you mean 16TB (4kB * 4GB)
>
> Yes, this is a known limitation.  Some vendors have gone to the trouble
> of introducing a new page_index_t.  I'm not convinced this is a problem
> worth solving.  There are very few 32-bit systems with this much storage
> on a single partition (everything should work fine if you take a 20TB
> drive and partition it into two 10TB partitions).
For what it's worth, I'm the reporter of the original bug. My use case 
is a custom NAS system. It runs on a 32-bit ARM processor, and has 5 8TB 
drives, which I'd like to use as a single, unified storage array. I 
chose btrfs for this project due to the filesystem-integrated snapshots 
and checksums. Currently, I'm working around this issue by exporting the 
raw drives using nbd and mounting them on a 64-bit system to access the 
filesystem, but this is very inconvenient, only allows one machine to 
access the filesystem at a time, and prevents running any tools that 
need access to the filesystem (such as backup and file sync utilities) 
on the NAS itself.

It sounds like this limitation would also prevent me from trying to use 
a different filesystem on top of software RAID, since in that case the 
logical filesystem would still be over 16TB.

> As usual, the best solution is for people to stop buying 32-bit systems.
I purchased this device in 2018, so it's not exactly ancient. At the 
time, it was the only SBC I could find that was low power, used ECC RAM, 
had a crypto accelerator, and had multiple sata ports with 
port-multiplier support.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-18 13:39     ` Matthew Wilcox
@ 2021-02-19  0:37       ` Qu Wenruo
  2021-02-19 16:12         ` Theodore Ts'o
  2021-02-20 23:02       ` Erik Jensen
  1 sibling, 1 reply; 21+ messages in thread
From: Qu Wenruo @ 2021-02-19  0:37 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Linux FS Devel, linux-btrfs



On 2021/2/18 下午9:39, Matthew Wilcox wrote:
> On Thu, Feb 18, 2021 at 08:42:14PM +0800, Qu Wenruo wrote:
>> On 2021/2/18 下午8:15, Matthew Wilcox wrote:
>>> Yes, this is a known limitation.  Some vendors have gone to the trouble
>>> of introducing a new page_index_t.  I'm not convinced this is a problem
>>> worth solving.  There are very few 32-bit systems with this much storage
>>> on a single partition (everything should work fine if you take a 20TB
>>> drive and partition it into two 10TB partitions).
>> What would happen if a user just tries to write 4K at file offset 16T
>> fir a sparse file?
>>
>> Would it be blocked by other checks before reaching the underlying fs?
>
> /* Page cache limit. The filesystems should put that into their s_maxbytes
>     limits, otherwise bad things can happen in VM. */
> #if BITS_PER_LONG==32
> #define MAX_LFS_FILESIZE        ((loff_t)ULONG_MAX << PAGE_SHIFT)
> #elif BITS_PER_LONG==64
> #define MAX_LFS_FILESIZE        ((loff_t)LLONG_MAX)
> #endif
>
>> This is especially true for btrfs, which has its internal address space
>> (and it can be any aligned U64 value).
>> Even 1T btrfs can have its metadata at its internal bytenr way larger
>> than 1T. (although those ranges still needs to be mapped inside the device).
>
> Sounds like btrfs has a problem to fix.

You're kinda right. Btrfs metadata uses an inode to organize the whole
metadata as a file, but that doesn't take the limit into consideration.

Although to fix it there will be tons of new problems.

We will have cases like the initial fs meets the limit, but when user
wants to do something like balance, then it may go beyond the limit and
cause problems.

And when such problem happens, users won't be happy anyway.
>
>> And considering the reporter is already using 32bit with 10T+ storage, I
>> doubt if it's really not worthy.
>>
>> BTW, what would be the extra cost by converting page::index to u64?
>> I know tons of printk() would cause warning, but most 64bit systems
>> should not be affected anyway.
>
> No effect for 64-bit systems, other than the churn.
>
> For 32-bit systems, it'd have some pretty horrible overhead.  You don't
> just have to touch the page cache, you have to convert the XArray.
> It's doable (I mean, it's been done), but it's very costly for all the
> 32-bit systems which don't use a humongous filesystem.  And we could
> minimise that overhead with a typedef, but then the source code gets
> harder to work with.
>
So it means the 32bit archs are already 2nd tier targets for at least
upstream linux kernel?

Or would it be possible to make it an option to make the index u64?
So guys who really wants large file support can enable it while most
other 32bit guys can just keep the existing behavior?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-18 21:27   ` Erik Jensen
@ 2021-02-19 14:22     ` Matthew Wilcox
  2021-02-19 17:51       ` Matthew Wilcox
  2021-02-22  1:48       ` Dave Chinner
  0 siblings, 2 replies; 21+ messages in thread
From: Matthew Wilcox @ 2021-02-19 14:22 UTC (permalink / raw)
  To: Erik Jensen; +Cc: Qu Wenruo, Linux FS Devel, linux-btrfs

On Thu, Feb 18, 2021 at 01:27:09PM -0800, Erik Jensen wrote:
> On 2/18/21 4:15 AM, Matthew Wilcox wrote:
> 
> > On Thu, Feb 18, 2021 at 04:54:46PM +0800, Qu Wenruo wrote:
> > > Recently we got a strange bug report that, one 32bit systems like armv6
> > > or non-64bit x86, certain large btrfs can't be mounted.
> > > 
> > > It turns out that, since page->index is just unsigned long, and on 32bit
> > > systemts, that can just be 32bit.
> > > 
> > > And when filesystems is utilizing any page offset over 4T, page->index
> > > get truncated, causing various problems.
> > 4TB?  I think you mean 16TB (4kB * 4GB)
> > 
> > Yes, this is a known limitation.  Some vendors have gone to the trouble
> > of introducing a new page_index_t.  I'm not convinced this is a problem
> > worth solving.  There are very few 32-bit systems with this much storage
> > on a single partition (everything should work fine if you take a 20TB
> > drive and partition it into two 10TB partitions).
> For what it's worth, I'm the reporter of the original bug. My use case is a
> custom NAS system. It runs on a 32-bit ARM processor, and has 5 8TB drives,
> which I'd like to use as a single, unified storage array. I chose btrfs for
> this project due to the filesystem-integrated snapshots and checksums.
> Currently, I'm working around this issue by exporting the raw drives using
> nbd and mounting them on a 64-bit system to access the filesystem, but this
> is very inconvenient, only allows one machine to access the filesystem at a
> time, and prevents running any tools that need access to the filesystem
> (such as backup and file sync utilities) on the NAS itself.
> 
> It sounds like this limitation would also prevent me from trying to use a
> different filesystem on top of software RAID, since in that case the logical
> filesystem would still be over 16TB.
> 
> > As usual, the best solution is for people to stop buying 32-bit systems.
> I purchased this device in 2018, so it's not exactly ancient. At the time,
> it was the only SBC I could find that was low power, used ECC RAM, had a
> crypto accelerator, and had multiple sata ports with port-multiplier
> support.

I'm sorry you bought unsupported hardware.

This limitation has been known since at least 2009:
https://lore.kernel.org/lkml/19041.4714.686158.130252@notabene.brown/

In the last decade, nobody's tried to fix it in mainline that I know of.
As I said, some vendors have tried to fix it in their NAS products,
but I don't know where to find that patch any more.

https://bootlin.com/blog/large-page-support-for-nas-systems-on-32-bit-arm/
might help you, but btrfs might still contain assumptions that will trip
you up.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-19  0:37       ` Qu Wenruo
@ 2021-02-19 16:12         ` Theodore Ts'o
  2021-02-19 23:10           ` Qu Wenruo
  2021-02-20  2:20           ` Erik Jensen
  0 siblings, 2 replies; 21+ messages in thread
From: Theodore Ts'o @ 2021-02-19 16:12 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Matthew Wilcox, Linux FS Devel, linux-btrfs

On Fri, Feb 19, 2021 at 08:37:30AM +0800, Qu Wenruo wrote:
> So it means the 32bit archs are already 2nd tier targets for at least
> upstream linux kernel?

At least as far as btrfs is concerned, anyway....

> Or would it be possible to make it an option to make the index u64?
> So guys who really wants large file support can enable it while most
> other 32bit guys can just keep the existing behavior?

I think if this is going to be done at all, it would need to be a
compile-time CONFIG option to make the index be 64-bits.  That's
because there are a huge number of low-end Android devices (retail
price ~$30 USD in India, for example --- this set of customers is
sometimes called "the next billion users" by some folks) that are
using 32-bit ARM systems.  And they will be using ext4 or f2fs, and it
would be massively unfortunate/unfair/etc. to impose that performance
penalty on them.

It sounds like what Willy is saying is that supporting a 64-bit page
index on 32-bit platforms is going to be have a lot of downsides, and
not just the performance / memory overhead issue.  It's also a code
mainteinance concern, and that tax would land on the mm developers.
And if it's not well-maintained, without regular testing, it's likely
to be heavily subject to bitrot.  (Although I suppose if we don't mind
doubling the number of configs that kernelci has to test, this could
be mitigated.)

In contrast, changing btrfs to not depend on a single address space
for all of its metadata might be a lot of work, but it's something
which lands on the btrfs developers, as opposed to a another (perhaps
more central) kernel subsystem.  Managing at this tradeoff is
something that is going to be between the mm developers and the btrfs
developers, but as someone who doesn't do any work on either of these
subsystems, it seems like a pretty obvious choice.

The final observation I'll make is that if we know which NAS box
vendor can (properly) support volumes > 16 TB, we can probably find
the 64-bit page index patch.  It'll probably be against a fairly old
kernel, so it might not all _that_ helpful, but it might give folks a
bit of a head start.

I can tell you that the NAS box vendor that it _isn't_ is Synology.
Synology boxes uses btrfs, and on 32-bit processors, they have a 16TB
volume size limit, and this is enforced by the Synology NAS
software[1].  However, Synology NAS boxes can support multiple
volumes; until today, I never understood why, since it seemed to be
unnecessary complexity, but I suspect the real answer was this was how
Synology handled storage array sizes > 16TB on their older systems.
(All of their new NAS boxes use 64-bit processors.)

[1] https://www.reddit.com/r/synology/comments/a62xrx/max_volume_size_of_16tb/

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-19 14:22     ` Matthew Wilcox
@ 2021-02-19 17:51       ` Matthew Wilcox
  2021-02-19 23:13         ` Qu Wenruo
  2021-02-22  1:48       ` Dave Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2021-02-19 17:51 UTC (permalink / raw)
  To: Erik Jensen; +Cc: Qu Wenruo, Linux FS Devel, linux-btrfs

On Fri, Feb 19, 2021 at 02:22:01PM +0000, Matthew Wilcox wrote:
> In the last decade, nobody's tried to fix it in mainline that I know of.
> As I said, some vendors have tried to fix it in their NAS products,
> but I don't know where to find that patch any more.

Arnd found it for me.

https://sourceforge.net/projects/dsgpl/files/Synology%20NAS%20GPL%20Source/25426branch/alpine-source/linux-3.10.x-bsp.txz/download

They've done a perfect job of making the source available while making it
utterly dreadful to extract anything useful from.

 16084 files changed, 1322769 insertions(+), 285257 deletions(-)

It's full of gratuitous whitespace changes to files that definitely
aren't used (arch/alpha?  really?) and they've stripped out a lot of
comments that they didn't need to touch.

Forward porting a patch from 10 years ago wouldn't be easy, even if
they hadn't tried very hard to obfuscate their patch.  I don't think
this will be a fruitful line of inquiry.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-19 16:12         ` Theodore Ts'o
@ 2021-02-19 23:10           ` Qu Wenruo
  2021-02-20  0:23             ` Matthew Wilcox
  2021-02-22  0:19             ` Dave Chinner
  2021-02-20  2:20           ` Erik Jensen
  1 sibling, 2 replies; 21+ messages in thread
From: Qu Wenruo @ 2021-02-19 23:10 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Matthew Wilcox, Linux FS Devel, linux-btrfs



On 2021/2/20 上午12:12, Theodore Ts'o wrote:
> On Fri, Feb 19, 2021 at 08:37:30AM +0800, Qu Wenruo wrote:
>> So it means the 32bit archs are already 2nd tier targets for at least
>> upstream linux kernel?
>
> At least as far as btrfs is concerned, anyway....

I'm afraid that would be the case.

But I'm still interested in how other fses handle such problem.

Doesn't they rely on page::index to handle their metadata?
Or all other fses just don't support allocating/deleting their AG/BG
dynamically so they can reject the fs at mount time?

Or they limit their metadata page::index to just inside each AG/BG?

Anyway, I'm afraid we have to reject the fs at both mount time and
runtime for now.

>
>> Or would it be possible to make it an option to make the index u64?
>> So guys who really wants large file support can enable it while most
>> other 32bit guys can just keep the existing behavior?
>
> I think if this is going to be done at all, it would need to be a
> compile-time CONFIG option to make the index be 64-bits.  That's
> because there are a huge number of low-end Android devices (retail
> price ~$30 USD in India, for example --- this set of customers is
> sometimes called "the next billion users" by some folks) that are
> using 32-bit ARM systems.  And they will be using ext4 or f2fs, and it
> would be massively unfortunate/unfair/etc. to impose that performance
> penalty on them.
>
> It sounds like what Willy is saying is that supporting a 64-bit page
> index on 32-bit platforms is going to be have a lot of downsides, and
> not just the performance / memory overhead issue.  It's also a code
> mainteinance concern, and that tax would land on the mm developers.
> And if it's not well-maintained, without regular testing, it's likely
> to be heavily subject to bitrot.  (Although I suppose if we don't mind
> doubling the number of configs that kernelci has to test, this could
> be mitigated.)
>
> In contrast, changing btrfs to not depend on a single address space
> for all of its metadata might be a lot of work, but it's something
> which lands on the btrfs developers, as opposed to a another (perhaps
> more central) kernel subsystem.  Managing at this tradeoff is
> something that is going to be between the mm developers and the btrfs
> developers, but as someone who doesn't do any work on either of these
> subsystems, it seems like a pretty obvious choice.

Yeah, I totally understand that.

And it doesn't look that worthy (or even possible) to make several
metadata inodes (address space to be more specific) just to support
32bit systemts.

As the lack of test coverage problem is still the same.

I don't see any active btrfs developer using 32bit system to test, even
for ARM systems.

Even rejecting the fs is in fact much more complex and may not get
enough tests after the initial submission.
>
> The final observation I'll make is that if we know which NAS box
> vendor can (properly) support volumes > 16 TB, we can probably find
> the 64-bit page index patch.  It'll probably be against a fairly old
> kernel, so it might not all _that_ helpful, but it might give folks a
> bit of a head start.
>
> I can tell you that the NAS box vendor that it _isn't_ is Synology.
> Synology boxes uses btrfs, and on 32-bit processors, they have a 16TB
> volume size limit, and this is enforced by the Synology NAS
> software[1].  However, Synology NAS boxes can support multiple
> volumes; until today, I never understood why, since it seemed to be
> unnecessary complexity, but I suspect the real answer was this was how
> Synology handled storage array sizes > 16TB on their older systems.
> (All of their new NAS boxes use 64-bit processors.)

BTW, even for Synology, 32bit systems can easily go beyond 16T in its
local address space while the underlying fs is only 1T or even smaller.

They only need to run routine balance and finally they will go beyond
that 16T limit.

Thanks,
Qu

>
> [1] https://www.reddit.com/r/synology/comments/a62xrx/max_volume_size_of_16tb/
>
> Cheers,
>
> 					- Ted
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-19 17:51       ` Matthew Wilcox
@ 2021-02-19 23:13         ` Qu Wenruo
  0 siblings, 0 replies; 21+ messages in thread
From: Qu Wenruo @ 2021-02-19 23:13 UTC (permalink / raw)
  To: Matthew Wilcox, Erik Jensen; +Cc: Linux FS Devel, linux-btrfs



On 2021/2/20 上午1:51, Matthew Wilcox wrote:
> On Fri, Feb 19, 2021 at 02:22:01PM +0000, Matthew Wilcox wrote:
>> In the last decade, nobody's tried to fix it in mainline that I know of.
>> As I said, some vendors have tried to fix it in their NAS products,
>> but I don't know where to find that patch any more.
>
> Arnd found it for me.
>
> https://sourceforge.net/projects/dsgpl/files/Synology%20NAS%20GPL%20Source/25426branch/alpine-source/linux-3.10.x-bsp.txz/download
>
> They've done a perfect job of making the source available while making it
> utterly dreadful to extract anything useful from.
>
>   16084 files changed, 1322769 insertions(+), 285257 deletions(-)

Wow, I thought RedHat was the only open-source vendor that tries to send
out a super big patch to make life of every other guys miserable.
And I'm definitely wrong now.

>
> It's full of gratuitous whitespace changes to files that definitely
> aren't used (arch/alpha?  really?) and they've stripped out a lot of
> comments that they didn't need to touch.
>
> Forward porting a patch from 10 years ago wouldn't be easy, even if
> they hadn't tried very hard to obfuscate their patch.  I don't think
> this will be a fruitful line of inquiry.
>
Yeah, I believe it's not worthy now.

I'll make btrfs to try its best to reject the fs instead.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-19 23:10           ` Qu Wenruo
@ 2021-02-20  0:23             ` Matthew Wilcox
  2021-02-22  0:19             ` Dave Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: Matthew Wilcox @ 2021-02-20  0:23 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Theodore Ts'o, Linux FS Devel, linux-btrfs

On Sat, Feb 20, 2021 at 07:10:14AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/2/20 上午12:12, Theodore Ts'o wrote:
> > On Fri, Feb 19, 2021 at 08:37:30AM +0800, Qu Wenruo wrote:
> > > So it means the 32bit archs are already 2nd tier targets for at least
> > > upstream linux kernel?
> > 
> > At least as far as btrfs is concerned, anyway....
> 
> I'm afraid that would be the case.

btrfs already treats 32-bit arches as second class citizens.
I found a1fbc6750e212c5675a4e48d7f51d44607eb8756 by code inspection,
so clearly it hasn't been tested in five years.  I wouldn't recommend
that anybody use btrfs with a 32-bit kernel.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-19 16:12         ` Theodore Ts'o
  2021-02-19 23:10           ` Qu Wenruo
@ 2021-02-20  2:20           ` Erik Jensen
  2021-02-20  3:40             ` Matthew Wilcox
  1 sibling, 1 reply; 21+ messages in thread
From: Erik Jensen @ 2021-02-20  2:20 UTC (permalink / raw)
  To: Theodore Ts'o, Qu Wenruo; +Cc: Matthew Wilcox, Linux FS Devel, linux-btrfs

On 2/19/21 8:12 AM, Theodore Ts'o wrote:
> On Fri, Feb 19, 2021 at 08:37:30AM +0800, Qu Wenruo wrote:
>> So it means the 32bit archs are already 2nd tier targets for at least
>> upstream linux kernel?
> 
> At least as far as btrfs is concerned, anyway....
> 
>> Or would it be possible to make it an option to make the index u64?
>> So guys who really wants large file support can enable it while most
>> other 32bit guys can just keep the existing behavior?
> 
> I think if this is going to be done at all, it would need to be a
> compile-time CONFIG option to make the index be 64-bits.  That's
> because there are a huge number of low-end Android devices (retail
> price ~$30 USD in India, for example --- this set of customers is
> sometimes called "the next billion users" by some folks) that are
> using 32-bit ARM systems.  And they will be using ext4 or f2fs, and it
> would be massively unfortunate/unfair/etc. to impose that performance
> penalty on them.

A CONFIG option would certainly work for my use case. I was also 
wondering (and I ask this as and end user with admittedly no knowledge 
whatsoever about how the page cache works) whether it might be possible 
to treat the top bit as a kind of "extended address" bit, with some kind 
of additional side table that handles indexes more than 31 bits. That 
way, filesystems that are 8TB or less wouldn't lose any performance, 
while still supporting those larger than 16TB.

I assume the 4KiB entry size in the page cache is fundamental, and can't 
be, e.g., increased to 16KiB to allow addressing up to 64TiB of storage?

> It sounds like what Willy is saying is that supporting a 64-bit page
> index on 32-bit platforms is going to be have a lot of downsides, and
> not just the performance / memory overhead issue.  It's also a code
> mainteinance concern, and that tax would land on the mm developers.
> And if it's not well-maintained, without regular testing, it's likely
> to be heavily subject to bitrot.  (Although I suppose if we don't mind
> doubling the number of configs that kernelci has to test, this could
> be mitigated.)
> 
> In contrast, changing btrfs to not depend on a single address space
> for all of its metadata might be a lot of work, but it's something
> which lands on the btrfs developers, as opposed to a another (perhaps
> more central) kernel subsystem.  Managing at this tradeoff is
> something that is going to be between the mm developers and the btrfs
> developers, but as someone who doesn't do any work on either of these
> subsystems, it seems like a pretty obvious choice.
> 
> The final observation I'll make is that if we know which NAS box
> vendor can (properly) support volumes > 16 TB, we can probably find
> the 64-bit page index patch.  It'll probably be against a fairly old
> kernel, so it might not all _that_ helpful, but it might give folks a
> bit of a head start.
> 
> I can tell you that the NAS box vendor that it _isn't_ is Synology.
> Synology boxes uses btrfs, and on 32-bit processors, they have a 16TB
> volume size limit, and this is enforced by the Synology NAS
> software[1].  However, Synology NAS boxes can support multiple
> volumes; until today, I never understood why, since it seemed to be
> unnecessary complexity, but I suspect the real answer was this was how
> Synology handled storage array sizes > 16TB on their older systems.
> (All of their new NAS boxes use 64-bit processors.)
> 
> [1] https://www.reddit.com/r/synology/comments/a62xrx/max_volume_size_of_16tb/
> 
> Cheers,
> 
> 					- Ted
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-20  2:20           ` Erik Jensen
@ 2021-02-20  3:40             ` Matthew Wilcox
  0 siblings, 0 replies; 21+ messages in thread
From: Matthew Wilcox @ 2021-02-20  3:40 UTC (permalink / raw)
  To: Erik Jensen; +Cc: Theodore Ts'o, Qu Wenruo, Linux FS Devel, linux-btrfs

On Fri, Feb 19, 2021 at 06:20:43PM -0800, Erik Jensen wrote:
> I assume the 4KiB entry size in the page cache is fundamental, and can't be,
> e.g., increased to 16KiB to allow addressing up to 64TiB of storage?

The bootlin link i sent in the other email does exactly that.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-18 13:39     ` Matthew Wilcox
  2021-02-19  0:37       ` Qu Wenruo
@ 2021-02-20 23:02       ` Erik Jensen
  2021-02-20 23:22         ` Matthew Wilcox
  1 sibling, 1 reply; 21+ messages in thread
From: Erik Jensen @ 2021-02-20 23:02 UTC (permalink / raw)
  To: Matthew Wilcox, Qu Wenruo; +Cc: Linux FS Devel, linux-btrfs

On 2/18/21 5:39 AM, Matthew Wilcox wrote:
> On Thu, Feb 18, 2021 at 08:42:14PM +0800, Qu Wenruo wrote:
>> [...]
>> BTW, what would be the extra cost by converting page::index to u64?
>> I know tons of printk() would cause warning, but most 64bit systems
>> should not be affected anyway.
> 
> No effect for 64-bit systems, other than the churn.
> 
> For 32-bit systems, it'd have some pretty horrible overhead.  You don't
> just have to touch the page cache, you have to convert the XArray.
> It's doable (I mean, it's been done), but it's very costly for all the
> 32-bit systems which don't use a humongous filesystem.  And we could
> minimise that overhead with a typedef, but then the source code gets
> harder to work with.

Out of curiosity, would it be at all feasible to use 64-bits for the 
page offset *without* changing XArray, perhaps by indexing by the lower 
32-bits, and evicting the page that's there if the top bits don't match 
(vaguely like how the CPU cache works)? Or, if there are cases where a 
page can't be evicted (I don't know if this can ever happen), use chaining?

I would expect index contention to be extremely uncommon, and it could 
only happen for inodes larger than 16 TiB, which can't be used at all 
today. I don't know how many data structures store page offsets today, 
but it seems like this should significantly reduce the performance 
impact versus upping XArray to 64-bit indexes.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-20 23:02       ` Erik Jensen
@ 2021-02-20 23:22         ` Matthew Wilcox
  2021-02-21  0:01           ` Erik Jensen
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2021-02-20 23:22 UTC (permalink / raw)
  To: Erik Jensen; +Cc: Qu Wenruo, Linux FS Devel, linux-btrfs

On Sat, Feb 20, 2021 at 03:02:26PM -0800, Erik Jensen wrote:
> On 2/18/21 5:39 AM, Matthew Wilcox wrote:
> > On Thu, Feb 18, 2021 at 08:42:14PM +0800, Qu Wenruo wrote:
> > > [...]
> > > BTW, what would be the extra cost by converting page::index to u64?
> > > I know tons of printk() would cause warning, but most 64bit systems
> > > should not be affected anyway.
> > 
> > No effect for 64-bit systems, other than the churn.
> > 
> > For 32-bit systems, it'd have some pretty horrible overhead.  You don't
> > just have to touch the page cache, you have to convert the XArray.
> > It's doable (I mean, it's been done), but it's very costly for all the
> > 32-bit systems which don't use a humongous filesystem.  And we could
> > minimise that overhead with a typedef, but then the source code gets
> > harder to work with.
> 
> Out of curiosity, would it be at all feasible to use 64-bits for the page
> offset *without* changing XArray, perhaps by indexing by the lower 32-bits,
> and evicting the page that's there if the top bits don't match (vaguely like
> how the CPU cache works)? Or, if there are cases where a page can't be
> evicted (I don't know if this can ever happen), use chaining?
> 
> I would expect index contention to be extremely uncommon, and it could only
> happen for inodes larger than 16 TiB, which can't be used at all today. I
> don't know how many data structures store page offsets today, but it seems
> like this should significantly reduce the performance impact versus upping
> XArray to 64-bit indexes.

Again, you're asking for significant development work for a dying
platform.

Did you try the bootlin patch?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-20 23:22         ` Matthew Wilcox
@ 2021-02-21  0:01           ` Erik Jensen
  2021-02-21 17:15             ` Matthew Wilcox
  0 siblings, 1 reply; 21+ messages in thread
From: Erik Jensen @ 2021-02-21  0:01 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Qu Wenruo, Linux FS Devel, linux-btrfs

On Sat, Feb 20, 2021 at 3:23 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Sat, Feb 20, 2021 at 03:02:26PM -0800, Erik Jensen wrote:
> > Out of curiosity, would it be at all feasible to use 64-bits for the page
> > offset *without* changing XArray, perhaps by indexing by the lower 32-bits,
> > and evicting the page that's there if the top bits don't match (vaguely like
> > how the CPU cache works)? Or, if there are cases where a page can't be
> > evicted (I don't know if this can ever happen), use chaining?
> >
> > I would expect index contention to be extremely uncommon, and it could only
> > happen for inodes larger than 16 TiB, which can't be used at all today. I
> > don't know how many data structures store page offsets today, but it seems
> > like this should significantly reduce the performance impact versus upping
> > XArray to 64-bit indexes.
>
> Again, you're asking for significant development work for a dying
> platform.

Depending on how complex it would be, I'm not unwilling to give it a
go myself, but I admittedly have no kernel development experience or
knowledge of how locking works around the page cache. E.g., I have no
idea if evicting the old page at an index before bringing in a new one
is even possible without causing deadlocks right and left.

> Did you try the bootlin patch?

While looking into it, I discovered that btrfs can't currently handle
mounting a filesystem that was created on a system with a different
page size. However, it sounds like there is currently work being done
to support subpage sector sizes, with read-only support coming in 3.12
and write support coming later, so hopefully the bootlin patch will be
helpful to bump my page size up to 64 KiB once btrfs support for it is
fully stable. Thanks!

It does feel like I'd just be kicking the can down the road a bit, but
hopefully it will turn out to be long enough for there to be either a
better fix or an AArch64 system that meets my needs by then (e.g., if
Kobol were to release a version of the Helios64 with ECC RAM).

I do appreciate your help and explanations.

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-21  0:01           ` Erik Jensen
@ 2021-02-21 17:15             ` Matthew Wilcox
  0 siblings, 0 replies; 21+ messages in thread
From: Matthew Wilcox @ 2021-02-21 17:15 UTC (permalink / raw)
  To: Erik Jensen; +Cc: Qu Wenruo, Linux FS Devel, linux-btrfs

On Sat, Feb 20, 2021 at 04:01:17PM -0800, Erik Jensen wrote:
> On Sat, Feb 20, 2021 at 3:23 PM Matthew Wilcox <willy@infradead.org> wrote:
> > On Sat, Feb 20, 2021 at 03:02:26PM -0800, Erik Jensen wrote:
> > > Out of curiosity, would it be at all feasible to use 64-bits for the page
> > > offset *without* changing XArray, perhaps by indexing by the lower 32-bits,
> > > and evicting the page that's there if the top bits don't match (vaguely like
> > > how the CPU cache works)? Or, if there are cases where a page can't be
> > > evicted (I don't know if this can ever happen), use chaining?
> > >
> > > I would expect index contention to be extremely uncommon, and it could only
> > > happen for inodes larger than 16 TiB, which can't be used at all today. I
> > > don't know how many data structures store page offsets today, but it seems
> > > like this should significantly reduce the performance impact versus upping
> > > XArray to 64-bit indexes.
> >
> > Again, you're asking for significant development work for a dying
> > platform.
> 
> Depending on how complex it would be, I'm not unwilling to give it a
> go myself, but I admittedly have no kernel development experience or
> knowledge of how locking works around the page cache. E.g., I have no
> idea if evicting the old page at an index before bringing in a new one
> is even possible without causing deadlocks right and left.

I wouldn't recommend the page cache as the ideal place to start learning
how to hack on the kernel.  Not only is it complex, it affects almost
everything.

What might work is using "auxiliary" inodes for btrfs's special purpose.
Allocate an array of inodes and use inodes[index / (ULONG_MAX + 1)]
and look up the page at index % (ULONG_MAX + 1).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-19 23:10           ` Qu Wenruo
  2021-02-20  0:23             ` Matthew Wilcox
@ 2021-02-22  0:19             ` Dave Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2021-02-22  0:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Theodore Ts'o, Matthew Wilcox, Linux FS Devel, linux-btrfs

On Sat, Feb 20, 2021 at 07:10:14AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/2/20 上午12:12, Theodore Ts'o wrote:
> > On Fri, Feb 19, 2021 at 08:37:30AM +0800, Qu Wenruo wrote:
> > > So it means the 32bit archs are already 2nd tier targets for at least
> > > upstream linux kernel?
> > 
> > At least as far as btrfs is concerned, anyway....
> 
> I'm afraid that would be the case.
> 
> But I'm still interested in how other fses handle such problem.

Refuse to mount >16TB on 32 bit, 4kB page systems.  And set the max
file offset for such systems to 16TB so sparse files can't be larger
than what the kernel supports. See xfs_sb_validate_fsb_count() call
and the file offset checks against MAX_LFS_FILESIZE in
xfs_fs_fill_super()...

FWIW, XFS has been doing this for roughly 20 years now - >16TB on 32
bit machines was an issue for XFS way back at the turn of the
century...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-19 14:22     ` Matthew Wilcox
  2021-02-19 17:51       ` Matthew Wilcox
@ 2021-02-22  1:48       ` Dave Chinner
  2021-03-01  1:49         ` GWB
  1 sibling, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2021-02-22  1:48 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Erik Jensen, Qu Wenruo, Linux FS Devel, linux-btrfs

On Fri, Feb 19, 2021 at 02:22:01PM +0000, Matthew Wilcox wrote:
> On Thu, Feb 18, 2021 at 01:27:09PM -0800, Erik Jensen wrote:
> > On 2/18/21 4:15 AM, Matthew Wilcox wrote:
> > 
> > > On Thu, Feb 18, 2021 at 04:54:46PM +0800, Qu Wenruo wrote:
> > > > Recently we got a strange bug report that, one 32bit systems like armv6
> > > > or non-64bit x86, certain large btrfs can't be mounted.
> > > > 
> > > > It turns out that, since page->index is just unsigned long, and on 32bit
> > > > systemts, that can just be 32bit.
> > > > 
> > > > And when filesystems is utilizing any page offset over 4T, page->index
> > > > get truncated, causing various problems.
> > > 4TB?  I think you mean 16TB (4kB * 4GB)
> > > 
> > > Yes, this is a known limitation.  Some vendors have gone to the trouble
> > > of introducing a new page_index_t.  I'm not convinced this is a problem
> > > worth solving.  There are very few 32-bit systems with this much storage
> > > on a single partition (everything should work fine if you take a 20TB
> > > drive and partition it into two 10TB partitions).
> > For what it's worth, I'm the reporter of the original bug. My use case is a
> > custom NAS system. It runs on a 32-bit ARM processor, and has 5 8TB drives,
> > which I'd like to use as a single, unified storage array. I chose btrfs for
> > this project due to the filesystem-integrated snapshots and checksums.
> > Currently, I'm working around this issue by exporting the raw drives using
> > nbd and mounting them on a 64-bit system to access the filesystem, but this
> > is very inconvenient, only allows one machine to access the filesystem at a
> > time, and prevents running any tools that need access to the filesystem
> > (such as backup and file sync utilities) on the NAS itself.
> > 
> > It sounds like this limitation would also prevent me from trying to use a
> > different filesystem on top of software RAID, since in that case the logical
> > filesystem would still be over 16TB.
> > 
> > > As usual, the best solution is for people to stop buying 32-bit systems.
> > I purchased this device in 2018, so it's not exactly ancient. At the time,
> > it was the only SBC I could find that was low power, used ECC RAM, had a
> > crypto accelerator, and had multiple sata ports with port-multiplier
> > support.
> 
> I'm sorry you bought unsupported hardware.
> 
> This limitation has been known since at least 2009:
> https://lore.kernel.org/lkml/19041.4714.686158.130252@notabene.brown/

2004:

commit 839099eb5ea07aef093ae2c5674f5a16a268f8b6
Author: Eric Sandeen <sandeen@sgi.com>
Date:   Wed Jul 14 20:02:01 2004 +0000

    Add filesystem size limit even when XFS_BIG_BLKNOS is
    in effect; limited by page cache index size (16T on ia32)

This all popped up on XFS around 2003 when the the disk address
space was expanded from 32 bits to 64 bits on 32 bit systems
(CONFIG_LBD) and so XFS could define XFS_BIG_FILESYSTEMS on 32 bit
systems for the first time.

FWIW, from an early 1994 commit into xfs_types.h:

+/*
+ * Some types are conditional based on the selected configuration.
+ * Set XFS_BIG_FILES=1 or 0 and XFS_BIG_FILESYSTEMS=1 or 0 depending
+ * on the desired configuration.
+ * XFS_BIG_FILES needs pgno_t to be 64 bits.
+ * XFS_BIG_FILESYSTEMS needs daddr_t to be 64 bits.
+ *
+ * Expect these to be set from klocaldefs, or from the machine-type
+ * defs files for the normal case.
+ */

So limiting file and filesystem sizes on 32 bit systems is
something XFS has done right from the start...

> In the last decade, nobody's tried to fix it in mainline that I know of.
> As I said, some vendors have tried to fix it in their NAS products,
> but I don't know where to find that patch any more.

It's not suportable from a disaster recovery perspective. I recently
saw a 14TB filesystem with billions of hardlinks in it require 240GB
of RAM to run xfs_repair. We just can't support large filesystems
on 32 bit systems, and it has nothing to do with simple stuff like
page cache index sizes...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: page->index limitation on 32bit system?
  2021-02-22  1:48       ` Dave Chinner
@ 2021-03-01  1:49         ` GWB
  0 siblings, 0 replies; 21+ messages in thread
From: GWB @ 2021-03-01  1:49 UTC (permalink / raw)
  To: linux-btrfs, Linux FS Devel, Erik Jensen
  Cc: Matthew Wilcox, Dave Chinner, Qu Wenruo

Getting btrfs patched for 32 bit arm would be of interest, but I'm not
suggesting the devs can do much more with that.  In practical usage,
we ran into similar difficulties a while back on embedded and
dedicated devices which would boot btrfs, but eventually it was easier
to put storage on nilfs2.  Nilfs2 is nice, but not nearly as developed
as btrfs (it snapshots, but did not allow sending and incremental
backups, and it is comparatively slow).  No idea, however, if it would
even compile on arm 32.

I'm delighted, Erik, that you were able to get btrfs to function to
the extent that it did, and that you're willing to put time and effort
in btrfs on arm 32 bit.  But before you do, consider nilfs2 for
storage.  You can even get zfs to work on some 32 bit systems but RAM
is an issue.  Also take a look at the Raspberry Pi 4's.  They have an
8gig 64 bit embedded version called Compute Module 4, which seems to
handle btrfs, and are not too pricey.  ZFS can work, but its too
memory intensive for 8 gigs.

Gordon

On Sun, Feb 21, 2021 at 7:52 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Feb 19, 2021 at 02:22:01PM +0000, Matthew Wilcox wrote:
> > On Thu, Feb 18, 2021 at 01:27:09PM -0800, Erik Jensen wrote:
> > > On 2/18/21 4:15 AM, Matthew Wilcox wrote:
> > >
> > > > On Thu, Feb 18, 2021 at 04:54:46PM +0800, Qu Wenruo wrote:
> > > > > Recently we got a strange bug report that, one 32bit systems like armv6
> > > > > or non-64bit x86, certain large btrfs can't be mounted.
> > > > >
> > > > > It turns out that, since page->index is just unsigned long, and on 32bit
> > > > > systemts, that can just be 32bit.
> > > > >
> > > > > And when filesystems is utilizing any page offset over 4T, page->index
> > > > > get truncated, causing various problems.
> > > > 4TB?  I think you mean 16TB (4kB * 4GB)
> > > >
> > > > Yes, this is a known limitation.  Some vendors have gone to the trouble
> > > > of introducing a new page_index_t.  I'm not convinced this is a problem
> > > > worth solving.  There are very few 32-bit systems with this much storage
> > > > on a single partition (everything should work fine if you take a 20TB
> > > > drive and partition it into two 10TB partitions).
> > > For what it's worth, I'm the reporter of the original bug. My use case is a
> > > custom NAS system. It runs on a 32-bit ARM processor, and has 5 8TB drives,
> > > which I'd like to use as a single, unified storage array. I chose btrfs for
> > > this project due to the filesystem-integrated snapshots and checksums.
> > > Currently, I'm working around this issue by exporting the raw drives using
> > > nbd and mounting them on a 64-bit system to access the filesystem, but this
> > > is very inconvenient, only allows one machine to access the filesystem at a
> > > time, and prevents running any tools that need access to the filesystem
> > > (such as backup and file sync utilities) on the NAS itself.
> > >
> > > It sounds like this limitation would also prevent me from trying to use a
> > > different filesystem on top of software RAID, since in that case the logical
> > > filesystem would still be over 16TB.
> > >
> > > > As usual, the best solution is for people to stop buying 32-bit systems.
> > > I purchased this device in 2018, so it's not exactly ancient. At the time,
> > > it was the only SBC I could find that was low power, used ECC RAM, had a
> > > crypto accelerator, and had multiple sata ports with port-multiplier
> > > support.
> >
> > I'm sorry you bought unsupported hardware.
> >
> > This limitation has been known since at least 2009:
> > https://lore.kernel.org/lkml/19041.4714.686158.130252@notabene.brown/
>
> 2004:
>
> commit 839099eb5ea07aef093ae2c5674f5a16a268f8b6
> Author: Eric Sandeen <sandeen@sgi.com>
> Date:   Wed Jul 14 20:02:01 2004 +0000
>
>     Add filesystem size limit even when XFS_BIG_BLKNOS is
>     in effect; limited by page cache index size (16T on ia32)
>
> This all popped up on XFS around 2003 when the the disk address
> space was expanded from 32 bits to 64 bits on 32 bit systems
> (CONFIG_LBD) and so XFS could define XFS_BIG_FILESYSTEMS on 32 bit
> systems for the first time.
>
> FWIW, from an early 1994 commit into xfs_types.h:
>
> +/*
> + * Some types are conditional based on the selected configuration.
> + * Set XFS_BIG_FILES=1 or 0 and XFS_BIG_FILESYSTEMS=1 or 0 depending
> + * on the desired configuration.
> + * XFS_BIG_FILES needs pgno_t to be 64 bits.
> + * XFS_BIG_FILESYSTEMS needs daddr_t to be 64 bits.
> + *
> + * Expect these to be set from klocaldefs, or from the machine-type
> + * defs files for the normal case.
> + */
>
> So limiting file and filesystem sizes on 32 bit systems is
> something XFS has done right from the start...
>
> > In the last decade, nobody's tried to fix it in mainline that I know of.
> > As I said, some vendors have tried to fix it in their NAS products,
> > but I don't know where to find that patch any more.
>
> It's not suportable from a disaster recovery perspective. I recently
> saw a 14TB filesystem with billions of hardlinks in it require 240GB
> of RAM to run xfs_repair. We just can't support large filesystems
> on 32 bit systems, and it has nothing to do with simple stuff like
> page cache index sizes...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-03-01  1:50 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-18  8:54 page->index limitation on 32bit system? Qu Wenruo
2021-02-18 12:15 ` Matthew Wilcox
2021-02-18 12:42   ` Qu Wenruo
2021-02-18 13:39     ` Matthew Wilcox
2021-02-19  0:37       ` Qu Wenruo
2021-02-19 16:12         ` Theodore Ts'o
2021-02-19 23:10           ` Qu Wenruo
2021-02-20  0:23             ` Matthew Wilcox
2021-02-22  0:19             ` Dave Chinner
2021-02-20  2:20           ` Erik Jensen
2021-02-20  3:40             ` Matthew Wilcox
2021-02-20 23:02       ` Erik Jensen
2021-02-20 23:22         ` Matthew Wilcox
2021-02-21  0:01           ` Erik Jensen
2021-02-21 17:15             ` Matthew Wilcox
2021-02-18 21:27   ` Erik Jensen
2021-02-19 14:22     ` Matthew Wilcox
2021-02-19 17:51       ` Matthew Wilcox
2021-02-19 23:13         ` Qu Wenruo
2021-02-22  1:48       ` Dave Chinner
2021-03-01  1:49         ` GWB

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.