linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] remove 2TB block device limit
       [not found] <1060250300@toto.iv>
@ 2002-05-13 10:28 ` Peter Chubb
  2002-05-13 12:13   ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Chubb @ 2002-05-13 10:28 UTC (permalink / raw)
  To: Peter Chubb; +Cc: linux-kernel, torvalds, axboe, akpm, martin, neilb


There's now a patch available against 2.5.15, and the BK repository
has been updated to v2.5.15 as well:

    http://www.gelato.unsw.edu.au/patches/2.5.15-largefile-patch
    bk://gelato.unsw.edu.au:2023/

Peter C
peterc@gelato.unsw.edu.au

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-13 10:28 ` [PATCH] remove 2TB block device limit Peter Chubb
@ 2002-05-13 12:13   ` Christoph Hellwig
  2002-05-14  0:30     ` Peter Chubb
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2002-05-13 12:13 UTC (permalink / raw)
  To: Peter Chubb; +Cc: linux-kernel, torvalds, axboe, akpm, martin, neilb

On Mon, May 13, 2002 at 08:28:30PM +1000, Peter Chubb wrote:
> 
> There's now a patch available against 2.5.15, and the BK repository
> has been updated to v2.5.15 as well:
> 
>     http://www.gelato.unsw.edu.au/patches/2.5.15-largefile-patch
>     bk://gelato.unsw.edu.au:2023/

This looks really good, I'd like to see something like that merged soon!
Some comments:

 - please move the sector_t typedef from <linux/types.h> to <asm/types.h>,
   so 64 bit arches don't have to have the CONFIG_ option at all, some
   32bit plattforms that are unlikely to ever support large disks
   (m68k comes to mind) can make it 32bit unconditionally and some like
   i386 can use a config option.
 - sector_div should move to a common header (blkdev.h?)

And something related to general sector_t usage:

 - what about sector_t vs daddr_t?  Linux still has daddr_t, but it is
   still always 32bit, I think a big s/sector_t/daddr_t/ would fit the
   traditional unix way of doing disk addressing
 - why is the get_block block argument a sector_t?  It presents a logical
   filesystem block which usually is larger than the sector, not to
   mention that for the usual blocksize == PAGE_SIZE case a ulong is
   enough as that is the same size the pagecache limit triggers.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-13 12:13   ` Christoph Hellwig
@ 2002-05-14  0:30     ` Peter Chubb
  2002-05-14  1:36       ` Anton Altaparmakov
  2002-05-14  2:09       ` Andrew Morton
  0 siblings, 2 replies; 41+ messages in thread
From: Peter Chubb @ 2002-05-14  0:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Chubb, linux-kernel, torvalds, axboe, akpm, martin, neilb

>>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:

Christoph> On Mon, May 13, 2002 at 08:28:30PM +1000, Peter Chubb
Christoph> wrote:
>> There's now a patch available against 2.5.15, and the BK repository
>> has been updated to v2.5.15 as well:
>> 
>> http://www.gelato.unsw.edu.au/patches/2.5.15-largefile-patch
>> bk://gelato.unsw.edu.au:2023/

Christoph> This looks really good, I'd like to see something like that
Christoph> merged soon!  Some comments:

Christoph>  - please move the sector_t typedef from <linux/types.h> to
Christoph> <asm/types.h>, so 64 bit arches don't have to have the
Christoph> CONFIG_ option at all, some 32bit plattforms that are
Christoph> unlikely to ever support large disks (m68k comes to mind)
Christoph> can make it 32bit unconditionally and some like i386 can
Christoph> use a config option.  - sector_div should move to a common
Christoph> header (blkdev.h?)

That's not a bad idea, I'll do it.

Christoph> And something related to general sector_t usage:

Christoph>  - what about sector_t vs daddr_t?  Linux still has
Christoph> daddr_t, but it is still always 32bit, I think a big
Christoph> s/sector_t/daddr_t/ would fit the traditional unix way of
Christoph> doing disk addressing

Yes I considered that, but daddr_t is exported to userland by the tape
ioctls, and is defined to be 32-bit, so it'd require out-of-kernel changes.

Besides, Jens had introduced sector_t for the purpose of counting
blocks and sectors, so I thought I may as well use it.  One could
argue that it's misnamed (personally, I liked Ben's `blkoff_t' for
offset in blocks), but it's been thare for four months or so, and was
already being used in likely places throughout the block layer.  I
just extended its use to all the places I thought were necessary
(there may be some paths that I've missed; but I hope not).

Christoph>  - why is the get_block block argument
Christoph> a sector_t?  It presents a logical filesystem block which
Christoph> usually is larger than the sector, not to mention that for
Christoph> the usual blocksize == PAGE_SIZE case a ulong is enough as
Christoph> that is the same size the pagecache limit triggers.

For filesystems that *can* handle logical filesystem blocks beyond the
2^32 limit (i.e., that use >32bit offsets in their on-disc format),
the get_block() argument has to be > 32bits long.  At  the moment
that's only JFS and XFS, but reiserfs version 4 looks as if it might
go that way.  We'll need this especially when the pagecache limit is
gone.

Besides, blocksize is not usually pagesize.  ext[23] uses 1k or 4k
blocks depending on the size and expected use of the filesystem; alpha
pagesize is usually 8k, for example.  The arm uses 4k, 16k or 32k
pagesizes depending on the model.

So on 32-bit systems, ulong is not enough.  (in fact if you look at
jfs, the first thing jfs_get_block does is convert the block number
arg to a 64-bit number).

Peter C

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-14  0:30     ` Peter Chubb
@ 2002-05-14  1:36       ` Anton Altaparmakov
  2002-05-16 20:32         ` Daniel Phillips
  2002-05-14  2:09       ` Andrew Morton
  1 sibling, 1 reply; 41+ messages in thread
From: Anton Altaparmakov @ 2002-05-14  1:36 UTC (permalink / raw)
  To: Peter Chubb
  Cc: Christoph Hellwig, linux-kernel, torvalds, axboe, akpm, martin, neilb

On Tue, 14 May 2002, Peter Chubb wrote:
> >>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:
> 
> Christoph> On Mon, May 13, 2002 at 08:28:30PM +1000, Peter Chubb
> Christoph> wrote:
> >> There's now a patch available against 2.5.15, and the BK repository
> >> has been updated to v2.5.15 as well:
> >> 
> >> http://www.gelato.unsw.edu.au/patches/2.5.15-largefile-patch
> >> bk://gelato.unsw.edu.au:2023/
> 
> Christoph> This looks really good, I'd like to see something like that
> Christoph> merged soon!  Some comments:
[snip]
> Christoph>  - why is the get_block block argument
> Christoph> a sector_t?  It presents a logical filesystem block which
> Christoph> usually is larger than the sector, not to mention that for
> Christoph> the usual blocksize == PAGE_SIZE case a ulong is enough as
> Christoph> that is the same size the pagecache limit triggers.
> 
> For filesystems that *can* handle logical filesystem blocks beyond the
> 2^32 limit (i.e., that use >32bit offsets in their on-disc format),
> the get_block() argument has to be > 32bits long.  At  the moment
> that's only JFS and XFS, but reiserfs version 4 looks as if it might
> go that way.  We'll need this especially when the pagecache limit is
> gone.

NTFS uses signed 64 bits for all offsets on disk, too. And yes at the
moment the pagecache limit is also a problem which we just ignore in the
hope that the kernel will have gone to 64 bits by the time devices grow
that large as to start using > 32 bits of blocks/pages...

> Besides, blocksize is not usually pagesize.  ext[23] uses 1k or 4k
> blocks depending on the size and expected use of the filesystem; alpha
> pagesize is usually 8k, for example.  The arm uses 4k, 16k or 32k
> pagesizes depending on the model.
> 
> So on 32-bit systems, ulong is not enough.  (in fact if you look at
> jfs, the first thing jfs_get_block does is convert the block number
> arg to a 64-bit number).

NTFS does that, too, but not quite immediately... (-;

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-14  0:30     ` Peter Chubb
  2002-05-14  1:36       ` Anton Altaparmakov
@ 2002-05-14  2:09       ` Andrew Morton
  2002-05-14  2:58         ` Peter Chubb
  2002-05-14  7:21         ` Christoph Hellwig
  1 sibling, 2 replies; 41+ messages in thread
From: Andrew Morton @ 2002-05-14  2:09 UTC (permalink / raw)
  To: Peter Chubb
  Cc: Christoph Hellwig, linux-kernel, torvalds, axboe, martin, neilb

Peter Chubb wrote:
> 
> ...
> Christoph>  - why is the get_block block argument
> Christoph> a sector_t?  It presents a logical filesystem block which
> Christoph> usually is larger than the sector, not to mention that for
> Christoph> the usual blocksize == PAGE_SIZE case a ulong is enough as
> Christoph> that is the same size the pagecache limit triggers.
> 
> For filesystems that *can* handle logical filesystem blocks beyond the
> 2^32 limit (i.e., that use >32bit offsets in their on-disc format),
> the get_block() argument has to be > 32bits long.  At  the moment
> that's only JFS and XFS, but reiserfs version 4 looks as if it might
> go that way.  We'll need this especially when the pagecache limit is
> gone.

I think Christoph's point is that a pagecache index is not a sector
number.  We agree that we need to plan for taking it to 64 bits, but
it should be something different. Like pageindex_t, or whatever.

This:

--- linux-2.5.15/include/linux/mm.h	Tue Apr 30 17:56:30 2002
+++ 25/include/linux/mm.h	Mon May 13 19:08:21 2002
@@ -148,7 +148,7 @@ struct vm_operations_struct {
 typedef struct page {
 	struct list_head list;		/* ->mapping has some page lists. */
 	struct address_space *mapping;	/* The inode (or ...) we belong to. */
-	unsigned long index;		/* Our offset within mapping. */
+	sector_t index;			/* Our offset within mapping. */
 	atomic_t count;			/* Usage count, see below. */
 	unsigned long flags;		/* atomic flags, some possibly
 					   updated asynchronously */

looks rather silly, no?

-

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-14  2:09       ` Andrew Morton
@ 2002-05-14  2:58         ` Peter Chubb
  2002-05-14  7:22           ` Christoph Hellwig
  2002-05-14  7:21         ` Christoph Hellwig
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Chubb @ 2002-05-14  2:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Chubb, Christoph Hellwig, linux-kernel

>>>>> "Andrew" == Andrew Morton <akpm@zip.com.au> writes:

Andrew> Peter Chubb wrote:
>> ...
Christoph> - why is the get_block block argument a sector_t?  It
Christoph> presents a logical filesystem block which usually is larger
Christoph> than the sector, not to mention that for the usual
Christoph> blocksize == PAGE_SIZE case a ulong is enough as that is
Christoph> the same size the pagecache limit triggers.
>> For filesystems that *can* handle logical filesystem blocks beyond
>> the 2^32 limit (i.e., that use >32bit offsets in their on-disc
>> format), the get_block() argument has to be > 32bits long.  At the
>> moment that's only JFS and XFS, but reiserfs version 4 looks as if
>> it might go that way.  We'll need this especially when the
>> pagecache limit is gone.

Andrew> I think Christoph's point is that a pagecache index is not a
Andrew> sector number.  We agree that we need to plan for taking it to
Andrew> 64 bits, but it should be something different. Like
Andrew> pageindex_t, or whatever.


I'll let Christoph speak for himself, but my point is that
get_block() is an interface exported from the filesystem.  It should
be possible to specify any logical block number that the filesystem
supports.   That the current VM system on 32-bit machines will never
request a block beyond 2^32 is a (one-day-soon-to-be-removed) current
limitation. 

Peter C

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-14  2:09       ` Andrew Morton
  2002-05-14  2:58         ` Peter Chubb
@ 2002-05-14  7:21         ` Christoph Hellwig
  1 sibling, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2002-05-14  7:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Chubb, linux-kernel, torvalds, axboe, martin, neilb

On Mon, May 13, 2002 at 07:09:59PM -0700, Andrew Morton wrote:
> 
> I think Christoph's point is that a pagecache index is not a sector
> number.  We agree that we need to plan for taking it to 64 bits, but
> it should be something different. Like pageindex_t, or whatever.

I don't think we want to increase it.  First it grow struct page by 32bits
and second 64bit arithmetic on 32bit plattforms is still very expensive.
I'd rather see PAGE_CACHE_SIZE growing to address the issues.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-14  2:58         ` Peter Chubb
@ 2002-05-14  7:22           ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2002-05-14  7:22 UTC (permalink / raw)
  To: Peter Chubb; +Cc: Andrew Morton, Christoph Hellwig, linux-kernel

On Tue, May 14, 2002 at 12:58:33PM +1000, Peter Chubb wrote:
> I'll let Christoph speak for himself, but my point is that
> get_block() is an interface exported from the filesystem.

It itsn't.  get_block is a callback for the generic block-based filesystem
library routines.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-14  1:36       ` Anton Altaparmakov
@ 2002-05-16 20:32         ` Daniel Phillips
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Phillips @ 2002-05-16 20:32 UTC (permalink / raw)
  To: Anton Altaparmakov, Peter Chubb
  Cc: Christoph Hellwig, linux-kernel, torvalds, axboe, akpm, martin, neilb

On Tuesday 14 May 2002 03:36, Anton Altaparmakov wrote:
> ...And yes at the
> moment the pagecache limit is also a problem which we just ignore in the
> hope that the kernel will have gone to 64 bits by the time devices grow
> that large as to start using > 32 bits of blocks/pages...

PAGE_CACHE_SIZE can also grow, so 32 bit architectures are further away
from the page cache limit on than it seems.

-- 
Daniel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-17 19:52       ` Daniel Phillips
@ 2002-05-17 20:25         ` Andrew Morton
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Morton @ 2002-05-17 20:25 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Jesse Pollard, Peter Chubb, Anton Altaparmakov,
	Christoph Hellwig, linux-kernel, axboe, martin, neilb

Daniel Phillips wrote:
> 
> On Friday 17 May 2002 15:32, Jesse Pollard wrote:
> > Note - most these really large filesystems allow the inode tables and
> > bitmaps to be stored on disks with a relatively small blocksize (raid 5),
> > and the data on different drives (striped) with a large block size (I believe
> > ours is 64K to 128K sized data blocks, inode/bitmaps are 16K-32K.) This is
> > done for two reasons:
> 
> Since we're on this subject, and you have experience with these large block
> sizes, where exactly do you see the large savings?
> 
>   - setup cost of the disk transfer?
>   - rotational latency of small transfers?
>   - setup cost of the network transfer?
>   - interrupt processing overhead?
>   - other?

If you surf on over to
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.15/ you'll see
some code which performs 64k I/Os.  Reads direct into pagecache.
It reduces the cost of reading from disk by 25% in my testing.
(That code is ready to go - just waiting for Linus to rematerialise).

The remaining profile is interesting.  The workload is simply
`cat large_file > /dev/null':

c012b448 33       0.200877    kmem_cache_free         
c0131af8 33       0.200877    flush_all_zero_pkmaps   
c01e51bc 33       0.200877    blk_recount_segments    
c01f9aec 34       0.206964    hpt374_udma_stop        
c016eb80 36       0.219138    ext2_get_block          
c0133320 37       0.225225    page_cache_readahead    
c013740c 37       0.225225    __getblk                
c0131ba0 41       0.249574    kmap_high               
c01fa1c4 41       0.249574    ata_start_dma           
c016e7dc 46       0.28001     ext2_block_to_path      
c01e5320 48       0.292184    blk_rq_map_sg           
c01c65d0 50       0.304358    radix_tree_reserve      
c014bfb0 53       0.32262     do_mpage_bio_readpage   
c01f4d88 54       0.328707    ata_irq_request         
c0136b34 64       0.389579    __get_hash_table        
c0126a00 72       0.438276    do_generic_file_read    
c016e910 82       0.499148    ext2_get_branch         
c0126610 88       0.535671    unlock_page             
c0106df4 91       0.553932    system_call             
c012b04c 94       0.572194    kmem_cache_alloc        
c01f2494 126      0.766983    ata_taskfile            
c01c66e8 163      0.992208    radix_tree_lookup       
c012d250 165      1.00438     rmqueue                 
c0105274 2781     16.9284     default_idle            
c0126e48 11009    67.0136     file_read_actor         

That's a single 500MHz PIII Xeon, reading at 35 megabytes/sec.

There's 17% "overhead" here.  Going to a larger filesystem
blocksize would provide almost zero benefit in the I/O layers.

Savings from larger blocks and larger pages would come into
the radix tree operations, get_block, a few other places.
At a guess, 8k blocks would cut the overhead to 10-12%.

And larger block size significantly penalises bandwidth for
the many-small-file case.  The larger the blocks, the worse
it gets.  You end up having to implement complexities such
as tail-merging to get around the inefficiency which the
workaround for your other inefficiency caused.

And larger pages with small blocks isn't an answer - CPU load
and seek costs from 2-blocks-per-page is measurable.  At
4 blocks-per-page it's getting serious.

Small pages and pagesize=blocksize are good.  I see no point in
going to larger pages or blocks until the current scheme is 
working efficiently and has been *proven* to still be unfixably
inadequate.

The current code sucks.  Simply amortising that suckiness across
larger blocks is not the right thing to do.

-

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-17 13:32     ` Jesse Pollard
  2002-05-17 18:02       ` Daniel Phillips
  2002-05-17 18:36       ` Andreas Dilger
@ 2002-05-17 19:52       ` Daniel Phillips
  2002-05-17 20:25         ` Andrew Morton
  2 siblings, 1 reply; 41+ messages in thread
From: Daniel Phillips @ 2002-05-17 19:52 UTC (permalink / raw)
  To: Jesse Pollard, Peter Chubb
  Cc: Anton Altaparmakov, Peter Chubb, Christoph Hellwig, linux-kernel,
	axboe, akpm, martin, neilb

On Friday 17 May 2002 15:32, Jesse Pollard wrote:
> Note - most these really large filesystems allow the inode tables and
> bitmaps to be stored on disks with a relatively small blocksize (raid 5),
> and the data on different drives (striped) with a large block size (I believe
> ours is 64K to 128K sized data blocks, inode/bitmaps are 16K-32K.) This is
> done for two reasons:

Since we're on this subject, and you have experience with these large block
sizes, where exactly do you see the large savings?

  - setup cost of the disk transfer?
  - rotational latency of small transfers?
  - setup cost of the network transfer?
  - interrupt processing overhead?
  - other?

-- 
Daniel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-17 13:32     ` Jesse Pollard
  2002-05-17 18:02       ` Daniel Phillips
@ 2002-05-17 18:36       ` Andreas Dilger
  2002-05-17 19:52       ` Daniel Phillips
  2 siblings, 0 replies; 41+ messages in thread
From: Andreas Dilger @ 2002-05-17 18:36 UTC (permalink / raw)
  To: Jesse Pollard
  Cc: phillips, Peter Chubb, Anton Altaparmakov, Christoph Hellwig,
	linux-kernel, axboe, akpm, martin, neilb

On May 17, 2002  08:32 -0500, Jesse Pollard wrote:
> Note - most these really large filesystems allow the inode tables and
> bitmaps to be stored on disks with a relatively small blocksize (raid 5),
> and the data on different drives (striped) with a large block size (I believe
> ours is 64K to 128K sized data blocks, inode/bitmaps are 16K-32K.)
> 
> The division allows for high integrity of the meta data (which is also
> backed up daily (incremental) - but without the corresponding datablocks),
> along with maximum capacity for data. 1/5 of 200TB is about 40TB if raid5
> were used with everything.

Interestingly, this can also be done for Lustre, the OSS cluster
filesystem I am currently working on (http://www.clusterfs.com).
All of the filesystem metadata is actually stored on a separate server,
so its disk can be configured totally separately from the file data.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-17 18:02       ` Daniel Phillips
@ 2002-05-17 18:26         ` Jesse Pollard
  0 siblings, 0 replies; 41+ messages in thread
From: Jesse Pollard @ 2002-05-17 18:26 UTC (permalink / raw)
  To: phillips, Jesse Pollard, Peter Chubb
  Cc: Anton Altaparmakov, Peter Chubb, Christoph Hellwig, linux-kernel,
	axboe, akpm, martin, neilb

---------  Received message begins Here  ---------

> 
> On Friday 17 May 2002 15:32, Jesse Pollard wrote:
> > And for the curious, the filesystems are SAMFS and SAMQFS on Sun E10000s.
> > We migrated the data from Cray NC1 filesystems with DMF - Cray data
> > migration facility (this took over 4 months. Would have taken only a month
> > or two, but we also had to accept new data at the same time).
> 
> Thanks for the fascinating data, however you left out one crucial piece of
> information: how many data bits in your processor?

Sorry - I did omit that..

Sun E10000s are Sparc based, and 64 bit. I realize this doesn't refer to
the 32 bit boundaries, but they were operating in 32 bit mode for the initial
installation (Solaris 2.6) and since some of the filesystem support was 3rd
party and didn't work when the kernel was in 64 bit mode.

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@navo.hpc.mil

Any opinions expressed are solely my own.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-17 13:32     ` Jesse Pollard
@ 2002-05-17 18:02       ` Daniel Phillips
  2002-05-17 18:26         ` Jesse Pollard
  2002-05-17 18:36       ` Andreas Dilger
  2002-05-17 19:52       ` Daniel Phillips
  2 siblings, 1 reply; 41+ messages in thread
From: Daniel Phillips @ 2002-05-17 18:02 UTC (permalink / raw)
  To: Jesse Pollard, Peter Chubb
  Cc: Anton Altaparmakov, Peter Chubb, Christoph Hellwig, linux-kernel,
	axboe, akpm, martin, neilb

On Friday 17 May 2002 15:32, Jesse Pollard wrote:
> And for the curious, the filesystems are SAMFS and SAMQFS on Sun E10000s.
> We migrated the data from Cray NC1 filesystems with DMF - Cray data
> migration facility (this took over 4 months. Would have taken only a month
> or two, but we also had to accept new data at the same time).

Thanks for the fascinating data, however you left out one crucial piece of
information: how many data bits in your processor?

-- 
Daniel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-17  0:18   ` Daniel Phillips
  2002-05-17 13:32     ` Jesse Pollard
@ 2002-05-17 15:26     ` Jason L Tibbitts III
  1 sibling, 0 replies; 41+ messages in thread
From: Jason L Tibbitts III @ 2002-05-17 15:26 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

>>>>> "DP" == Daniel Phillips <phillips@bonn-fries.net> writes:

DP> Incidently, the 200 TB high-end servers are a lot closer than you
DP> think.

Perhaps the "low-end server" is a more interesting case, though.
After all, you can put over 4TB in one machine now for somewhere near
$12K.  (32 160GB IDE disks on four 3ware 7850 cards running RAID5, all
striped into one huge "RAID50" array.)  This is only going to get
cheaper, and when someone out-does Maxtor for the "big IDE disk" crown,
the capacity will jump again.  Of course it doesn't have the same
performance or extreme reliability of the more expensive solutions,
but hey, it only costs $12K.

Still, 16TB done this way looks to be a few years away yet.

 - J<

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-17  0:18   ` Daniel Phillips
@ 2002-05-17 13:32     ` Jesse Pollard
  2002-05-17 18:02       ` Daniel Phillips
                         ` (2 more replies)
  2002-05-17 15:26     ` Jason L Tibbitts III
  1 sibling, 3 replies; 41+ messages in thread
From: Jesse Pollard @ 2002-05-17 13:32 UTC (permalink / raw)
  To: phillips, Peter Chubb
  Cc: Anton Altaparmakov, Peter Chubb, Christoph Hellwig, linux-kernel,
	axboe, akpm, martin, neilb

Daniel Phillips <phillips@bonn-fries.net>:
> On Friday 17 May 2002 02:04, Peter Chubb wrote:
> > >>>>> "Daniel" == Daniel Phillips <phillips@bonn-fries.net> writes:
> > 
> > Daniel> On Tuesday 14 May 2002 03:36, Anton Altaparmakov wrote:
> > >> ...And yes at the moment the pagecache limit is also a problem
> > >> which we just ignore in the hope that the kernel will have gone to
> > >> 64 bits by the time devices grow that large as to start using > 32
> > >> bits of blocks/pages...
> > 
> > Daniel> PAGE_CACHE_SIZE can also grow, so 32 bit architectures are
> > Daniel> further away from the page cache limit on than it seems.
> > 
> > Check out the table on page 2 of
> > http://www.scsita.org/statech/01s005r1.pdf 
> > 
> > The SCSI trade association is predicting 200TB in a high-end server
> > within 10 years --- and  2TB in a high-end desktop by 2004.  I'd take
> > some of their predictions with a grain of salt, however.
> 
> The server definitely won't be running a 32 bit processor, and the high
> end desktop probably won't.  In any event, the current 44 bit limitation
> (32 bit arch) on the page cache takes us up to 16 TB, and going to a 16 KB 
> PAGE_CACHE_SIZE takes us to 64 TB, so I don't think we have to start
> slicing and dicing that part of the kernel just now.  Anybody who expects
> to run into this limitation should of course raise their hand.
> 
> Incidently, the 200 TB high-end servers are a lot closer than you think.

A lot closer - though ours is really 10TB online, with 300+ TB nearline.

Note - most these really large filesystems allow the inode tables and
bitmaps to be stored on disks with a relatively small blocksize (raid 5),
and the data on different drives (striped) with a large block size (I believe
ours is 64K to 128K sized data blocks, inode/bitmaps are 16K-32K.) This is
done for two reasons:

1. the data blocks are temporary until the the file migrates to tape and
   is striped. If we loose a disk out of the stripe, the filesystem is
   sort of dead. Replace the disk and all is well since the metadata IS
   protected. Files that were physically damanged are recalled from tape.
   New files not yet migrated are destroyed, but since the tape archive
   is created as near the file creation date (close time), it doesn't really
   happen that often. Not many files will be damaged.
2. The inode/bitmap metadata is raid5 for maximum protection. Backup of
   just the metadata can also be done much faster (about 3 hours in our
   case).

The division allows for high integrity of the meta data (which is also
backed up daily (incremental) - but without the corresponding datablocks),
along with maximum capacity for data. 1/5 of 200TB is about 40TB if raid5
were used with everything.

And for the curious, the filesystems are SAMFS and SAMQFS on Sun E10000s.
We migrated the data from Cray NC1 filesystems with DMF - Cray data
migration facility (this took over 4 months. Would have taken only a month
or two, but we also had to accept new data at the same time).

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@navo.hpc.mil

Any opinions expressed are solely my own.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-16 22:54             ` Andreas Dilger
@ 2002-05-17  1:17               ` Daniel Phillips
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Phillips @ 2002-05-17  1:17 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: davidm, Peter Chubb, Jeremy Andrews, linux-kernel, ext2-devel

On Friday 17 May 2002 00:54, Andreas Dilger wrote:
> A minor question is whether to cap it at 65536 blocks/group or 65528?
> (The number of blocks per group must be a multiple of 8).
> 
> The current layout is such that you will _always_ have at least 3
> blocks in use for each group.  However, if we implement Ted's
> "metagroup" layout (which puts all of a group's bitmaps/itable blocks
> in the first group of its block of group descriptors) then there could
> be cases where a group has no blocks in use, and the free count will
> overflow.
> 
> Having the upper limit at 65536 is aesthetically pleasing, and it aligns
> nicely with LVM (which allocates chunks in power-of-two sizes), but may
> preclude changing such a filesystem to the metagroup layout without a
> larger effort on the resizer's part.  I'll go with 65528 I guess.

I like 65536 as well, but it's easy to relax your slightly lower limit
later if the metagroup design changes, and would not require a compatibility
flag, while tightening it would be a major pain.

> Note that going to a metagroup layout would also grow the distance
> between the itable and possible blocks quadratically (the number of
> group descriptors that fit into a block also grows with blocksize),
> but at least it is not cubic growth.  That said, the metagroup layout
> is probably only useful for cases where you _know_ you want huge files
> (in the multi-GB range) and locality of blocks to the single inode block
> is irrelevant.

-- 
Daniel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-17  0:04 ` Peter Chubb
@ 2002-05-17  0:18   ` Daniel Phillips
  2002-05-17 13:32     ` Jesse Pollard
  2002-05-17 15:26     ` Jason L Tibbitts III
  0 siblings, 2 replies; 41+ messages in thread
From: Daniel Phillips @ 2002-05-17  0:18 UTC (permalink / raw)
  To: Peter Chubb
  Cc: Anton Altaparmakov, Peter Chubb, Christoph Hellwig, linux-kernel,
	axboe, akpm, martin, neilb

On Friday 17 May 2002 02:04, Peter Chubb wrote:
> >>>>> "Daniel" == Daniel Phillips <phillips@bonn-fries.net> writes:
> 
> Daniel> On Tuesday 14 May 2002 03:36, Anton Altaparmakov wrote:
> >> ...And yes at the moment the pagecache limit is also a problem
> >> which we just ignore in the hope that the kernel will have gone to
> >> 64 bits by the time devices grow that large as to start using > 32
> >> bits of blocks/pages...
> 
> Daniel> PAGE_CACHE_SIZE can also grow, so 32 bit architectures are
> Daniel> further away from the page cache limit on than it seems.
> 
> Check out the table on page 2 of
> http://www.scsita.org/statech/01s005r1.pdf 
> 
> The SCSI trade association is predicting 200TB in a high-end server
> within 10 years --- and  2TB in a high-end desktop by 2004.  I'd take
> some of their predictions with a grain of salt, however.

The server definitely won't be running a 32 bit processor, and the high
end desktop probably won't.  In any event, the current 44 bit limitation
(32 bit arch) on the page cache takes us up to 16 TB, and going to a 16 KB 
PAGE_CACHE_SIZE takes us to 64 TB, so I don't think we have to start
slicing and dicing that part of the kernel just now.  Anybody who expects
to run into this limitation should of course raise their hand.

Incidently, the 200 TB high-end servers are a lot closer than you think.

-- 
Daniel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
       [not found] <581856778@toto.iv>
@ 2002-05-17  0:04 ` Peter Chubb
  2002-05-17  0:18   ` Daniel Phillips
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Chubb @ 2002-05-17  0:04 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Anton Altaparmakov, Peter Chubb, Christoph Hellwig, linux-kernel,
	axboe, akpm, martin, neilb

>>>>> "Daniel" == Daniel Phillips <phillips@bonn-fries.net> writes:

Daniel> On Tuesday 14 May 2002 03:36, Anton Altaparmakov wrote:
>> ...And yes at the moment the pagecache limit is also a problem
>> which we just ignore in the hope that the kernel will have gone to
>> 64 bits by the time devices grow that large as to start using > 32
>> bits of blocks/pages...

Daniel> PAGE_CACHE_SIZE can also grow, so 32 bit architectures are
Daniel> further away from the page cache limit on than it seems.

Check out the table on page 2 of
http://www.scsita.org/statech/01s005r1.pdf 

The SCSI trade association is predicting 200TB in a high-end server
within 10 years --- and  2TB in a high-end desktop by 2004.  I'd take
some of their predictions with a grain of salt, however.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-16 20:22           ` Daniel Phillips
@ 2002-05-16 22:54             ` Andreas Dilger
  2002-05-17  1:17               ` Daniel Phillips
  0 siblings, 1 reply; 41+ messages in thread
From: Andreas Dilger @ 2002-05-16 22:54 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: davidm, Peter Chubb, Jeremy Andrews, linux-kernel, ext2-devel

On May 16, 2002  22:22 +0200, Daniel Phillips wrote:
> On Thursday 16 May 2002 00:17, Andreas Dilger wrote:
> > Even 8kB blocks would theoretically overflow these fields, but you
> > can't yet have a group _totally_ empty (there are always two bitmaps
> > and at least one inode table block), so it would always have less
> > than 65535 blocks free.  Now I realize that this isn't true of the
> > inode table in theory, but you normally also have less than the maximum
> > number of inodes per group - need to check for that.
> > 
> > This could be worked around temporarily by limiting the size of each
> > group to at most 65535 free blocks/inodes.
> 
> Imposing an absolute upper limit of 2**16 blocks per group makes the most 
> sense for now, and may always make the most sense.  Even with a cap on the
> blocks per group group size still scales directly with block size.  We
> don't want it to scale quadratically.  If it did, then a data block could
> end up 32 GB away from the inode, still in the same group.  This
> effectively destroys the utility of block groups as a means of reducing
> seek latency.

Yes, Ted said the same.

A minor question is whether to cap it at 65536 blocks/group or 65528?
(The number of blocks per group must be a multiple of 8).

The current layout is such that you will _always_ have at least 3
blocks in use for each group.  However, if we implement Ted's
"metagroup" layout (which puts all of a group's bitmaps/itable blocks
in the first group of its block of group descriptors) then there could
be cases where a group has no blocks in use, and the free count will
overflow.

Having the upper limit at 65536 is aesthetically pleasing, and it aligns
nicely with LVM (which allocates chunks in power-of-two sizes), but may
preclude changing such a filesystem to the metagroup layout without a
larger effort on the resizer's part.  I'll go with 65528 I guess.

Note that going to a metagroup layout would also grow the distance
between the itable and possible blocks quadratically (the number of
group descriptors that fit into a block also grows with blocksize),
but at least it is not cubic growth.  That said, the metagroup layout
is probably only useful for cases where you _know_ you want huge files
(in the multi-GB range) and locality of blocks to the single inode block
is irrelevant.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-15 22:17         ` Andreas Dilger
@ 2002-05-16 20:22           ` Daniel Phillips
  2002-05-16 22:54             ` Andreas Dilger
  0 siblings, 1 reply; 41+ messages in thread
From: Daniel Phillips @ 2002-05-16 20:22 UTC (permalink / raw)
  To: Andreas Dilger, davidm
  Cc: Peter Chubb, Jeremy Andrews, linux-kernel, ext2-devel

On Thursday 16 May 2002 00:17, Andreas Dilger wrote:
> On May 10, 2002  17:07 -0700, David Mosberger wrote:
> >On Fri, 10 May 2002 17:46:23 -0600, Andreas Dilger <adilger@clusterfs.com> 
said:
> >   Andreas> For 64-bit systems like Alpha, it is relatively easy to use
> >   Andreas> 8kB blocks for ext3.  It has been discouraged because such
> >   Andreas> a filesystem is non-portable to other (smaller page-sized)
> >   Andreas> filesystems.  Maybe this rationale should be re-examined -
> >   Andreas> I could probably whip up a configure option for e2fsprogs
> >   Andreas> to allow 8kB blocks in a few hours.
> > 
> > If you do this, please consider allowing a block size up to 64KB.
> > The ia64 kernel offers a choice of 4, 8, 16, and 64KB page size.
> 
> Well, taking a look at the ext2 code, there is a slight problem when
> trying to use block sizes > 8kB.  This is in the group descriptors,
> where they only store a 16 bit could of free blocks and inodes for
> the group.  Since the maximum number of blocks/inodes is 8*blocksize
> (the number of bits that can fit into a single block) you overflow
> these fields if you have more than 64k (8*8k) blocks in a group.
> 
> Even 8kB blocks would theoretically overflow these fields, but you
> can't yet have a group _totally_ empty (there are always two bitmaps
> and at least one inode table block), so it would always have less
> than 65535 blocks free.  Now I realize that this isn't true of the
> inode table in theory, but you normally also have less than the maximum
> number of inodes per group - need to check for that.
> 
> This could be worked around temporarily by limiting the size of each
> group to at most 65535 free blocks/inodes.  The permanent solution is
> to probably add an extra byte for each of these two fields to allow up
> to 16M blocks/inodes per group, which gives us a max block size of 2MB.
>
> This could be a compat ext2 feature, since at worst if we didn't take
> the high byte into account on a block free it could overflow this field
> and we wouldn't be able to allocate from this group until more blocks
> are freed.  We couldn't underflow because the allocator would stop when
> the free block/inode count hit zero for that group, even if there were
> really more free blocks available.
> 
> So, for now I think I'll stick to a maximum of 8kB blocks, and maybe
> we can slip in support for the high byte of the free blocks/inodes
> count when Ted adds in support for metagroups.

Hi Andreas,

Imposing an absolute upper limit of 2**16 blocks per group makes the most 
sense for now, and may always make the most sense.  Even with a cap on the
blocks per group group size still scales directly with block size.  We
don't want it to scale quadratically.  If it did, then a data block could
end up 32 GB away from the inode, still in the same group.  This
effectively destroys the utility of block groups as a means of reducing
seek latency.
 
-- 
Daniel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10  9:04     ` Andrew Morton
@ 2002-05-16 19:08       ` Daniel Phillips
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Phillips @ 2002-05-16 19:08 UTC (permalink / raw)
  To: Andrew Morton, Anton Altaparmakov
  Cc: Peter Chubb, linux-kernel, martin, neilb

On Friday 10 May 2002 11:04, Andrew Morton wrote:
> Anton Altaparmakov wrote:
> > 
> > ...
> > >This code:
> > >
> > >         printk("%lu%s", some_sector, some_string);
> > >
> > >will work fine with 32-bit sector_t.  But with 64-bit sector_t it
> > >will generate a warning at compile-time and an oops at runtime.
> > >
> > >The same problem applies to dma_addr_t.  Jeff, davem and I kicked
> > >that around a while back and ended up deciding that although there
> > >are a number of high-tech solutions, the dumb one was best:
> > 
> > Why not the even dumber one? Forget FMT_SECTOR_T and always use %Lu and
> > typecast (unsigned long long)sector_t_variable in the printk.
> > 
> 
> Agree.   The nice thing about the typecast is that you
> can format the output with %06Lx, %9Ld, %Lo or whatever.
> The FMT_SECTOR_T thing forces you to use the chosen formatting.

This approach solves the phys_t printk problem as well.

-- 
Daniel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-11  0:07       ` David Mosberger
@ 2002-05-15 22:17         ` Andreas Dilger
  2002-05-16 20:22           ` Daniel Phillips
  0 siblings, 1 reply; 41+ messages in thread
From: Andreas Dilger @ 2002-05-15 22:17 UTC (permalink / raw)
  To: davidm; +Cc: Peter Chubb, Jeremy Andrews, linux-kernel, ext2-devel

On May 10, 2002  17:07 -0700, David Mosberger wrote:
>On Fri, 10 May 2002 17:46:23 -0600, Andreas Dilger <adilger@clusterfs.com> said:
>   Andreas> For 64-bit systems like Alpha, it is relatively easy to use
>   Andreas> 8kB blocks for ext3.  It has been discouraged because such
>   Andreas> a filesystem is non-portable to other (smaller page-sized)
>   Andreas> filesystems.  Maybe this rationale should be re-examined -
>   Andreas> I could probably whip up a configure option for e2fsprogs
>   Andreas> to allow 8kB blocks in a few hours.
> 
> If you do this, please consider allowing a block size up to 64KB.
> The ia64 kernel offers a choice of 4, 8, 16, and 64KB page size.

Well, taking a look at the ext2 code, there is a slight problem when
trying to use block sizes > 8kB.  This is in the group descriptors,
where they only store a 16 bit could of free blocks and inodes for
the group.  Since the maximum number of blocks/inodes is 8*blocksize
(the number of bits that can fit into a single block) you overflow
these fields if you have more than 64k (8*8k) blocks in a group.

Even 8kB blocks would theoretically overflow these fields, but you
can't yet have a group _totally_ empty (there are always two bitmaps
and at least one inode table block), so it would always have less
than 65535 blocks free.  Now I realize that this isn't true of the
inode table in theory, but you normally also have less than the maximum
number of inodes per group - need to check for that.

This could be worked around temporarily by limiting the size of each
group to at most 65535 free blocks/inodes.  The permanent solution is
to probably add an extra byte for each of these two fields to allow up
to 16M blocks/inodes per group, which gives us a max block size of 2MB.

This could be a compat ext2 feature, since at worst if we didn't take
the high byte into account on a block free it could overflow this field
and we wouldn't be able to allocate from this group until more blocks
are freed.  We couldn't underflow because the allocator would stop when
the free block/inode count hit zero for that group, even if there were
really more free blocks available.

So, for now I think I'll stick to a maximum of 8kB blocks, and maybe
we can slip in support for the high byte of the free blocks/inodes
count when Ted adds in support for metagroups.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-15  9:41 Hirotaka Sasaki
@ 2002-05-15 21:49 ` Steve Lord
  0 siblings, 0 replies; 41+ messages in thread
From: Steve Lord @ 2002-05-15 21:49 UTC (permalink / raw)
  To: Hirotaka Sasaki; +Cc: Linux Kernel, taka, minoura, alexr

On Wed, 2002-05-15 at 04:41, Hirotaka Sasaki wrote:
> Hi,
> My name is Hirotaka Sasaki and I work for VA Linux Japan.  
> We've had a need for large disk support as well, and so I've developed
> support for 64-bit block numbers and page cache indices. 
> 
> I'm not subscribed to this list so please CC on any responses.
> 
> All development I've done so far has been tested on 2.4.17 w/XFS
>         - linux-2.4.17
>         - xfs-1.0.2
>         - x86 (p3) architecture
> 
> The main revisions my patch includes:
>         - 64-bit page cache indices (doesn't support 64-bit mmap)

You will need to extend this into the xfs code base, particularly pagebuf.

>         - 64-bit block #'s, sector #'s in the block I/O layer
>         - 64-bit block device file (support for block #'s)
>         - raw and direct I/O support for 64-bit block and sector #'s
>         - 64-bit start_sect/nr_sect support in struct hd_struct
>         - 64-bit blk_size table
>         - 64-bit SCSI device sizes (sd_sizes/sr_sizes)
>         - 64-bit loop device
> 
>   This patch at:
>   ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/va-block64-2.4.17.patch
> 
> Other revisions (not necessarily including the kernel):
> 
> In order to create a file system larger than 2TB on XFS I,
>         - changed ioctl(BLKGETSIZE) to ioctl(BLKGETSIZE64) in mkfs.xfs
>         - in the kernel fixed an error in the handling of ioctl(BLKGETSIZE64)
> 
>   This patches at:
>   ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/va-blkgetsize64-2.4.17.patch
>   ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/xfsprogs-1.3.17-blkgetsize64.patch
> 
> In order to display a file system size larger than 16TB using df I,
>         - added a new system call to the kernel, statfs64
>         - added statfs64 to struct super
>         - modified XFS and NFSv3 to support statfs64
>         - created a new library, statvfs64, to use statfs64 which is
>                   then called by df command
> 
>   This patches at:
>   ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/va-statfs64-2.4.17.patch
>   ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/va-statfs64_xfs-2.4.17.patch
>   ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/va-statfs64_nfsd-2.4.17.patch
>   ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/va-statfs64_nfs-2.4.17.patch        
>   ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/fileutils-4.1-df_statvfs64.patch
> 
> I ran several tests on XFS by creating a file system and mounting
> it on the loop device.  I noticed that the size of the file system
> is limited to 16TB by XFS_MAX_FILE_OFFSET.  I need to test file systems
> > 16TB so I changed XFS_MAX_FILE_OFFSET to (long long)((1ULL<<63)-1ULL).
> However, XFS internally uses unsigned long's for the page cache indices
> which means everything works great until you mount the file system, but
> after that it all falls apart.

XFS_MAX_FILE_OFFSET on linux was reduced to reflect the maximum
cachable offset in a file. This only affects access to files themselves,
not to metadata. For the metadata end of things you need to look at
_pagebuf_lookup_pages in fs/xfs/pagebuf/page_buf.c. I have not looked
at your patches yet, but abstracting the page index into a special
typedef would be the way to go if you have not already done so.

I am about to totally revamp the XFS I/O path, so you might want to
wait a bit and then pick up the xfs cvs tree from oss.sgi.com.

Steve

-- 

Steve Lord                                      voice: +1-651-683-3511
Principal Engineer, Filesystem Software         email: lord@sgi.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10 23:46     ` Andreas Dilger
  2002-05-11  0:07       ` David Mosberger
  2002-05-11  4:40       ` Peter Chubb
@ 2002-05-15 13:49       ` Pavel Machek
  2 siblings, 0 replies; 41+ messages in thread
From: Pavel Machek @ 2002-05-15 13:49 UTC (permalink / raw)
  To: Peter Chubb, Jeremy Andrews, linux-kernel

Hi!


-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
@ 2002-05-15  9:41 Hirotaka Sasaki
  2002-05-15 21:49 ` Steve Lord
  0 siblings, 1 reply; 41+ messages in thread
From: Hirotaka Sasaki @ 2002-05-15  9:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: taka, minoura, alexr

Hi,
My name is Hirotaka Sasaki and I work for VA Linux Japan.  
We've had a need for large disk support as well, and so I've developed
support for 64-bit block numbers and page cache indices. 

I'm not subscribed to this list so please CC on any responses.

All development I've done so far has been tested on 2.4.17 w/XFS
        - linux-2.4.17
        - xfs-1.0.2
        - x86 (p3) architecture

The main revisions my patch includes:
        - 64-bit page cache indices (doesn't support 64-bit mmap)
        - 64-bit block #'s, sector #'s in the block I/O layer
        - 64-bit block device file (support for block #'s)
        - raw and direct I/O support for 64-bit block and sector #'s
        - 64-bit start_sect/nr_sect support in struct hd_struct
        - 64-bit blk_size table
        - 64-bit SCSI device sizes (sd_sizes/sr_sizes)
        - 64-bit loop device

  This patch at:
  ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/va-block64-2.4.17.patch

Other revisions (not necessarily including the kernel):

In order to create a file system larger than 2TB on XFS I,
        - changed ioctl(BLKGETSIZE) to ioctl(BLKGETSIZE64) in mkfs.xfs
        - in the kernel fixed an error in the handling of ioctl(BLKGETSIZE64)

  This patches at:
  ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/va-blkgetsize64-2.4.17.patch
  ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/xfsprogs-1.3.17-blkgetsize64.patch

In order to display a file system size larger than 16TB using df I,
        - added a new system call to the kernel, statfs64
        - added statfs64 to struct super
        - modified XFS and NFSv3 to support statfs64
        - created a new library, statvfs64, to use statfs64 which is
                  then called by df command

  This patches at:
  ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/va-statfs64-2.4.17.patch
  ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/va-statfs64_xfs-2.4.17.patch
  ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/va-statfs64_nfsd-2.4.17.patch
  ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/va-statfs64_nfs-2.4.17.patch        
  ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/fileutils-4.1-df_statvfs64.patch

I ran several tests on XFS by creating a file system and mounting
it on the loop device.  I noticed that the size of the file system
is limited to 16TB by XFS_MAX_FILE_OFFSET.  I need to test file systems
> 16TB so I changed XFS_MAX_FILE_OFFSET to (long long)((1ULL<<63)-1ULL).
However, XFS internally uses unsigned long's for the page cache indices
which means everything works great until you mount the file system, but
after that it all falls apart.

  This patch at:
  ftp://ftp.valinux.co.jp/pub/people/sasaki/blk64/va-xfs_max_file_offset-2.4.17.patch

Under XFS I've tested,
        - 16TB XFS file system (successfully mounted and accessed)
        - 32TB XFS file system (successfully mounted but access failed
                as outlined above)

Further improvements I plan on making:
        - 64-bit support for LVM (including LVM tools)
        - SCSI device support for 64-bit in the common layer
        - 16-byte SCSI command support

Any help or advice you can offer is greatly appreciated!

# Thanks to Alexander Reeder for translating 
--
Hirotaka Sasaki <sasaki@valinux.co.jp>
Engineering Dept.
VA Linux Systems Japan K.K. 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10 19:12   ` Peter Chubb
  2002-05-10 23:46     ` Andreas Dilger
@ 2002-05-11 18:13     ` Padraig Brady
  1 sibling, 0 replies; 41+ messages in thread
From: Padraig Brady @ 2002-05-11 18:13 UTC (permalink / raw)
  To: Peter Chubb; +Cc: Jeremy Andrews, linux-kernel

I found the following related graph from Mr. Cahalan very informative:
http://www.cs.uml.edu/~acahalan/linux/ext2.gif
I just might get around to updating/expanding it.

Padraig.

Peter Chubb wrote:
>>>>>>"Jeremy" == Jeremy Andrews <jeremy@kerneltrap.org> writes:
>>>>>
> 
> Jeremy> Peter, Out of curiousity, what then does the new filesystem
> Jeremy> limit become, on a 64-bit system?  Will all filesystems
> Jeremy> support your changes?
> 
> This depends on the file system.
> See
> 	 http://www.gelato.unsw.edu.au/~peterc/lfs.html
> (which I'm intending to update next week, after some testing to
> check the new limits with my new code -- I found the 1TB limit in
> the generic code (someone using a signed int instead of unsigned long))
> 
> There are three different limits that apply:
> 
>  --- The physical layout on disc (e.g., ext2 uses 32-bit for block
>      numbers within a file system; thus the max size is
>      (2^32-1)*block_size;  although it's theoretically possible to use
>      larger blocksizes, the current toolchain has a maximum of 4k,
>      thus the largest size of an ext[23] filesystem is ((2^32)-1)*4k
>      bytes --- around 16TB)
> 
>      It's extremely unlikely that you'd want to use a non-journalled
>      file system on such a large partition, so your best bets are
>      reiserfs, jfs or XFS.  jfs and xfs work well on enormous
>      partitions on other platforms; the current version of reiserfs is
>      somewhat limited, but version 4 will allow larger file systems.
> 
> 
>  --- Limitations imposed by the partitioning scheme.
>      As far as I know, only the EFI GUID partitioning scheme uses
>      64-bit block offsets, so under any other scheme you're limited to
>      2^32 or 2^31 blocks per disc; some use the underlying hardware
>      sector size, some use a block size that's  multiple of this.
> 
>  --- The page cache limit (which on a 32-bit system is 16TB; on a 64
>      bit system is 18 EB
> 
> 
> Jeremy>   Mind if I quote what you say on my webpage?
> 
> Go ahead
> 
> --
> Peter Chubb
> peterc@gelato.unsw.edu.au	http://www.gelato.unsw.edu.au
> -



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10 23:46     ` Andreas Dilger
  2002-05-11  0:07       ` David Mosberger
@ 2002-05-11  4:40       ` Peter Chubb
  2002-05-15 13:49       ` Pavel Machek
  2 siblings, 0 replies; 41+ messages in thread
From: Peter Chubb @ 2002-05-11  4:40 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Peter Chubb, Jeremy Andrews, linux-kernel

>>>>> "Andreas" == Andreas Dilger <adilger@clusterfs.com> writes:

Andreas> On May 11, 2002 05:12 +1000, Peter Chubb wrote:
>> See http://www.gelato.unsw.edu.au/~peterc/lfs.html (which I'm
>> intending to update next week, after some testing to check the new
>> limits with my new code -- I found the 1TB limit in the generic
>> code (someone using a signed int instead of unsigned long))

Andreas> Any chance you could rename this from "LFS" to something else
Andreas> (e.g. LBD for Large Block Device).  LFS == Large File Summit
Andreas> which describes the use/access of > 2GB _files_ on 32-bit
Andreas> systems under Unix.

Will do.

Andreas> Does x86-64 and/or ia64 actually _use_ > 4kB page sizes?  If
Andreas> so, it may be more worthwhile to allow larger block sizes
Andreas> with e2fsprogs.  It may be that the kernel supports >4kB
Andreas> blocks already on systems with larger PAGE_SIZE, I don't know
Andreas> (no way for me to test this).

ia64 uses 16k standard; you can choose up to 64k and get performance
gains.
The main limitation on performance on a modern architecture is the
limited TLB coverage of the large real address space --- large pages
give fewer TLB entries for the same coverage, which leads to better
performance.

>> It's extremely unlikely that you'd want to use a non-journalled
>> file system on such a large partition, so your best bets are
>> reiserfs, jfs or XFS.

Andreas> I find it somewhat ironic that you suggest reiserfs over
Andreas> ext3, when in fact they both currently have the same 16TB
Andreas> filesystem limit.  On your web page, you say the ext[23]
Andreas> limit is 1TB, which it definitely is not (unless there are
Andreas> bugs in the code).  There is currently a 16TB filesystem
Andreas> limit for 4kB blocks, but there are plans towards fixing that
Andreas> also.

I found the limitation --- it was in the block layer, not ext2.
There were bugs in the other FS that let me create files bigger than
the 2TB pagecache limit, even though when I tried to write them bad
things happened.

Updating that web page is on my TODO list for Monday...

>> --- Limitations imposed by the partitioning scheme.  As far as I
>> know, only the EFI GUID partitioning scheme uses 64-bit block
>> offsets, so under any other scheme you're limited to 2^32 or 2^31
>> blocks per disc; some use the underlying hardware sector size, some
>> use a block size that's multiple of this.

Andreas> LVM does not need to have partitions, and presumably EVMS
Andreas> using Linux or AIX LVM devices doesn't either.

Sure.  Discs and hardware raids don't need partitions either, providing the
size of the media is smaller than the filesystem-layout-imposed size limit. 

The reasons, as far as I'm concerned, for partitioning a disc are:
    -- convenience, to allow controlled sharing of discs
    -- to allow striping of swap space across multiple spindles (have
       a swap partition on each drive)
    -- to keep filesystem size below backup-medium size.

If you want to boot from a drive, it'd better have a volume header
that the firmware (e.g., the BIOS) understands, too.

Peter Chubb
peterc@gelato.unsw.edu.au	http://www.gelato.unsw.edu.au

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10 23:46     ` Andreas Dilger
@ 2002-05-11  0:07       ` David Mosberger
  2002-05-15 22:17         ` Andreas Dilger
  2002-05-11  4:40       ` Peter Chubb
  2002-05-15 13:49       ` Pavel Machek
  2 siblings, 1 reply; 41+ messages in thread
From: David Mosberger @ 2002-05-11  0:07 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Peter Chubb, Jeremy Andrews, linux-kernel

>>>>> On Fri, 10 May 2002 17:46:23 -0600, Andreas Dilger <adilger@clusterfs.com> said:

  Andreas> For 64-bit systems like Alpha, it is relatively easy to use
  Andreas> 8kB blocks for ext3.  It has been discouraged because such
  Andreas> a filesystem is non-portable to other (smaller page-sized)
  Andreas> filesystems.  Maybe this rationale should be re-examined -
  Andreas> I could probably whip up a configure option for e2fsprogs
  Andreas> to allow 8kB blocks in a few hours.

If you do this, please consider allowing a block size up to 64KB.
The ia64 kernel offers a choice of 4, 8, 16, and 64KB page size.

  Andreas> Does x86-64 and/or ia64 actually _use_ > 4kB page sizes?

ia64 linux normally uses > 4KB.  The recommended page size at the
moment is 16KB.  I didn't think 64KB would become realistic for quite
some time, but performance is surprisingly good, even on today's
systems.

	--david

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10 19:12   ` Peter Chubb
@ 2002-05-10 23:46     ` Andreas Dilger
  2002-05-11  0:07       ` David Mosberger
                         ` (2 more replies)
  2002-05-11 18:13     ` Padraig Brady
  1 sibling, 3 replies; 41+ messages in thread
From: Andreas Dilger @ 2002-05-10 23:46 UTC (permalink / raw)
  To: Peter Chubb; +Cc: Jeremy Andrews, linux-kernel

On May 11, 2002  05:12 +1000, Peter Chubb wrote:
> See http://www.gelato.unsw.edu.au/~peterc/lfs.html
> (which I'm intending to update next week, after some testing to
> check the new limits with my new code -- I found the 1TB limit in
> the generic code (someone using a signed int instead of unsigned long))

Any chance you could rename this from "LFS" to something else (e.g. LBD
for Large Block Device).  LFS == Large File Summit which describes the
use/access of > 2GB _files_ on 32-bit systems under Unix.

> There are three different limits that apply:
> 
>  --- The physical layout on disc (e.g., ext2 uses 32-bit for block
>      numbers within a file system; thus the max size is
>      (2^32-1)*block_size;  although it's theoretically possible to use
>      larger blocksizes, the current toolchain has a maximum of 4k,
>      thus the largest size of an ext[23] filesystem is ((2^32)-1)*4k
>      bytes --- around 16TB)

For 64-bit systems like Alpha, it is relatively easy to use 8kB blocks for
ext3.  It has been discouraged because such a filesystem is non-portable
to other (smaller page-sized) filesystems.  Maybe this rationale should
be re-examined - I could probably whip up a configure option for
e2fsprogs to allow 8kB blocks in a few hours.

Does x86-64 and/or ia64 actually _use_ > 4kB page sizes?  If so, it
may be more worthwhile to allow larger block sizes with e2fsprogs.
It may be that the kernel supports >4kB blocks already on systems with
larger PAGE_SIZE, I don't know (no way for me to test this).

>      It's extremely unlikely that you'd want to use a non-journalled
>      file system on such a large partition, so your best bets are
>      reiserfs, jfs or XFS.

I find it somewhat ironic that you suggest reiserfs over ext3, when in
fact they both currently have the same 16TB filesystem limit.  On your
web page, you say the ext[23] limit is 1TB, which it definitely is not
(unless there are bugs in the code).  There is currently a 16TB filesystem
limit for 4kB blocks, but there are plans towards fixing that also.

>  --- Limitations imposed by the partitioning scheme.
>      As far as I know, only the EFI GUID partitioning scheme uses
>      64-bit block offsets, so under any other scheme you're limited to
>      2^32 or 2^31 blocks per disc; some use the underlying hardware
>      sector size, some use a block size that's  multiple of this.

LVM does not need to have partitions, and presumably EVMS using Linux
or AIX LVM devices doesn't either.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
       [not found] ` <20020510084713.43ce396e.jeremy@kerneltrap.org>
@ 2002-05-10 19:12   ` Peter Chubb
  2002-05-10 23:46     ` Andreas Dilger
  2002-05-11 18:13     ` Padraig Brady
  0 siblings, 2 replies; 41+ messages in thread
From: Peter Chubb @ 2002-05-10 19:12 UTC (permalink / raw)
  To: Jeremy Andrews; +Cc: linux-kernel

>>>>> "Jeremy" == Jeremy Andrews <jeremy@kerneltrap.org> writes:

Jeremy> Peter, Out of curiousity, what then does the new filesystem
Jeremy> limit become, on a 64-bit system?  Will all filesystems
Jeremy> support your changes?

This depends on the file system.
See
	 http://www.gelato.unsw.edu.au/~peterc/lfs.html
(which I'm intending to update next week, after some testing to
check the new limits with my new code -- I found the 1TB limit in
the generic code (someone using a signed int instead of unsigned long))

There are three different limits that apply:

 --- The physical layout on disc (e.g., ext2 uses 32-bit for block
     numbers within a file system; thus the max size is
     (2^32-1)*block_size;  although it's theoretically possible to use
     larger blocksizes, the current toolchain has a maximum of 4k,
     thus the largest size of an ext[23] filesystem is ((2^32)-1)*4k
     bytes --- around 16TB)

     It's extremely unlikely that you'd want to use a non-journalled
     file system on such a large partition, so your best bets are
     reiserfs, jfs or XFS.  jfs and xfs work well on enormous
     partitions on other platforms; the current version of reiserfs is
     somewhat limited, but version 4 will allow larger file systems.


 --- Limitations imposed by the partitioning scheme.
     As far as I know, only the EFI GUID partitioning scheme uses
     64-bit block offsets, so under any other scheme you're limited to
     2^32 or 2^31 blocks per disc; some use the underlying hardware
     sector size, some use a block size that's  multiple of this.

 --- The page cache limit (which on a 32-bit system is 16TB; on a 64
     bit system is 18 EB


Jeremy>   Mind if I quote what you say on my webpage?

Go ahead

--
Peter Chubb
peterc@gelato.unsw.edu.au	http://www.gelato.unsw.edu.au

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10  9:53       ` Peter Chubb
  2002-05-10 10:01         ` Jens Axboe
@ 2002-05-10 11:43         ` Anton Altaparmakov
  1 sibling, 0 replies; 41+ messages in thread
From: Anton Altaparmakov @ 2002-05-10 11:43 UTC (permalink / raw)
  To: Peter Chubb
  Cc: Jens Axboe, Andrew Morton, Peter Chubb, linux-kernel, martin, neilb

At 10:53 10/05/02, Peter Chubb wrote:
> >>>>> "Jens" == Jens Axboe <axboe@suse.de> writes:
>
>Jens> On Fri, May 10 2002, Anton Altaparmakov wrote:
> >> Why not the even dumber one? Forget FMT_SECTOR_T and always use %Lu
> >> and typecast (unsigned long long)sector_t_variable in the printk.
>
>Jens> I like that better too, it's what I did in the block layer too.
>
>That's exactly what I did in the patch....
>
>Except most places I used u64 not unsigned long long (it's the same
>thing on all architectures, and much shorter to type).

I have been told that this is wrong (it was on this list but I can't 
remember who said it - it was one of the prominent kernel hackers... (-;).

u64 is not necesssarily unsigned long long type and this causes compilation 
problems on some architectures (apparently).

Anton


>Peter C
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10  9:53       ` Peter Chubb
@ 2002-05-10 10:01         ` Jens Axboe
  2002-05-10 11:43         ` Anton Altaparmakov
  1 sibling, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2002-05-10 10:01 UTC (permalink / raw)
  To: Peter Chubb
  Cc: Anton Altaparmakov, Andrew Morton, linux-kernel, martin, neilb

On Fri, May 10 2002, Peter Chubb wrote:
> >>>>> "Jens" == Jens Axboe <axboe@suse.de> writes:
> 
> Jens> On Fri, May 10 2002, Anton Altaparmakov wrote:
> >> Why not the even dumber one? Forget FMT_SECTOR_T and always use %Lu
> >> and typecast (unsigned long long)sector_t_variable in the printk.
> 
> Jens> I like that better too, it's what I did in the block layer too.
> 
> That's exactly what I did in the patch....

Excellent

> Except most places I used u64 not unsigned long long (it's the same
> thing on all architectures, and much shorter to type).

Patch looks fine to me. I was hoping someone would do the grunt
conversion work when I introduced sector_t, thanks! :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10  9:05     ` Jens Axboe
@ 2002-05-10  9:53       ` Peter Chubb
  2002-05-10 10:01         ` Jens Axboe
  2002-05-10 11:43         ` Anton Altaparmakov
  0 siblings, 2 replies; 41+ messages in thread
From: Peter Chubb @ 2002-05-10  9:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Anton Altaparmakov, Andrew Morton, Peter Chubb, linux-kernel,
	martin, neilb

>>>>> "Jens" == Jens Axboe <axboe@suse.de> writes:

Jens> On Fri, May 10 2002, Anton Altaparmakov wrote:
>> Why not the even dumber one? Forget FMT_SECTOR_T and always use %Lu
>> and typecast (unsigned long long)sector_t_variable in the printk.

Jens> I like that better too, it's what I did in the block layer too.

That's exactly what I did in the patch....

Except most places I used u64 not unsigned long long (it's the same
thing on all architectures, and much shorter to type).

Peter C

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10  8:43   ` Anton Altaparmakov
  2002-05-10  9:04     ` Andrew Morton
@ 2002-05-10  9:05     ` Jens Axboe
  2002-05-10  9:53       ` Peter Chubb
  1 sibling, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2002-05-10  9:05 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Andrew Morton, Peter Chubb, linux-kernel, martin, neilb

On Fri, May 10 2002, Anton Altaparmakov wrote:
> Why not the even dumber one? Forget FMT_SECTOR_T and always use %Lu and 
> typecast (unsigned long long)sector_t_variable in the printk.

I like that better too, it's what I did in the block layer too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10  8:43   ` Anton Altaparmakov
@ 2002-05-10  9:04     ` Andrew Morton
  2002-05-16 19:08       ` Daniel Phillips
  2002-05-10  9:05     ` Jens Axboe
  1 sibling, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2002-05-10  9:04 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Peter Chubb, linux-kernel, martin, neilb

Anton Altaparmakov wrote:
> 
> ...
> >This code:
> >
> >         printk("%lu%s", some_sector, some_string);
> >
> >will work fine with 32-bit sector_t.  But with 64-bit sector_t it
> >will generate a warning at compile-time and an oops at runtime.
> >
> >The same problem applies to dma_addr_t.  Jeff, davem and I kicked
> >that around a while back and ended up deciding that although there
> >are a number of high-tech solutions, the dumb one was best:
> 
> Why not the even dumber one? Forget FMT_SECTOR_T and always use %Lu and
> typecast (unsigned long long)sector_t_variable in the printk.
> 

Agree.   The nice thing about the typecast is that you
can format the output with %06Lx, %9Ld, %Lo or whatever.
The FMT_SECTOR_T thing forces you to use the chosen formatting.

-

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10  4:05 ` Andrew Morton
@ 2002-05-10  8:43   ` Anton Altaparmakov
  2002-05-10  9:04     ` Andrew Morton
  2002-05-10  9:05     ` Jens Axboe
  0 siblings, 2 replies; 41+ messages in thread
From: Anton Altaparmakov @ 2002-05-10  8:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Chubb, linux-kernel, martin, neilb

At 05:05 10/05/02, Andrew Morton wrote:
>Peter Chubb wrote:
> >
> > Hi,
> >         At present, linux is limited to 2TB filesystems even on 64-bit
> > systems, because there are various places where the block offset on
> > disc are assigned to unsigned or int 32-bit variables.
> >
> > There's a type, sector_t, that's meant to hold offsets in sectors and
> > blocks.  It's not used consistently (yet).
> >
> > The patch at
> >     http://www.gelato.unsw.edu.au/patches/2.5.14-largefile-patch
> >
> > ...
> >
> > As this touches lots of places -- the generic block layer (Andrew?)
> > the IDE code (Martin?) and RAID (Neil?) and minor changes to the scsi
> > I've CCd a few people directly.
>
>That would be more Jens and aviro than I.
>
>My vote would be: just merge the sucker while it still (almost)
>applies. 2TB is a showstopper for some people in 2.4 today.  Obviously
>2.6 will need 64-bit block numbers.
>
>The next obstacle will be page cache indices into the blockdev mapping.
>That's either an 8TB or 16TB limit, depending on signedness correctness.
>
>One minor point - it is currently not possible to print sector_t's.
>This code:
>
>         printk("%lu%s", some_sector, some_string);
>
>will work fine with 32-bit sector_t.  But with 64-bit sector_t it
>will generate a warning at compile-time and an oops at runtime.
>
>The same problem applies to dma_addr_t.  Jeff, davem and I kicked
>that around a while back and ended up deciding that although there
>are a number of high-tech solutions, the dumb one was best:

Why not the even dumber one? Forget FMT_SECTOR_T and always use %Lu and 
typecast (unsigned long long)sector_t_variable in the printk.

May be ugly, but isn't it correct that you actually need the above typecast 
on some architectures where %Lu == unsigned long long != u64?

Anton

>--- 2.5.14/include/linux/types.h~sector_t-printing      Thu May  9 
>17:08:13 2002
>+++ 2.5.14-akpm/include/linux/types.h   Thu May  9 17:08:13 2002
>@@ -120,8 +120,10 @@ typedef            __s64           int64_t;
>
>  #ifdef BLK_64BIT_SECTOR
>  typedef u64 sector_t;
>+#define FMT_SECTOR_T   "%Lu"
>  #else
>  typedef unsigned long sector_t;
>+#define FMT_SECTOR_T   "%lu"
>  #endif
>
>  #endif /* __KERNEL_STRICT_NAMES */
>--- 2.5.14/fs/buffer.c~sector_t-printing        Thu May  9 17:08:13 2002
>+++ 2.5.14-akpm/fs/buffer.c     Thu May  9 17:09:35 2002
>@@ -179,7 +179,8 @@ __clear_page_buffers(struct page *page)
>
>  static void buffer_io_error(struct buffer_head *bh)
>  {
>-       printk(KERN_ERR "Buffer I/O error on device %s, logical block %ld\n",
>+       printk(KERN_ERR "Buffer I/O error on device %s,"
>+                       " logical block " FMT_SECTOR_T "\n",
>                         bdevname(bh->b_bdev), bh->b_blocknr);
>  }
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10  3:36 Peter Chubb
  2002-05-10  4:05 ` Andrew Morton
@ 2002-05-10  4:51 ` Martin Dalecki
       [not found] ` <20020510084713.43ce396e.jeremy@kerneltrap.org>
  2 siblings, 0 replies; 41+ messages in thread
From: Martin Dalecki @ 2002-05-10  4:51 UTC (permalink / raw)
  To: Peter Chubb; +Cc: linux-kernel

Uz.ytkownik Peter Chubb napisa?:
> Hi,
> 	At present, linux is limited to 2TB filesystems even on 64-bit
> systems, because there are various places where the block offset on
> disc are assigned to unsigned or int 32-bit variables.
> 
> There's a type, sector_t, that's meant to hold offsets in sectors and
> blocks.  It's not used consistently (yet).
> 

The IDE part of it appears to be sane. I will take it.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
  2002-05-10  3:36 Peter Chubb
@ 2002-05-10  4:05 ` Andrew Morton
  2002-05-10  8:43   ` Anton Altaparmakov
  2002-05-10  4:51 ` Martin Dalecki
       [not found] ` <20020510084713.43ce396e.jeremy@kerneltrap.org>
  2 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2002-05-10  4:05 UTC (permalink / raw)
  To: Peter Chubb; +Cc: linux-kernel, martin, neilb

Peter Chubb wrote:
> 
> Hi,
>         At present, linux is limited to 2TB filesystems even on 64-bit
> systems, because there are various places where the block offset on
> disc are assigned to unsigned or int 32-bit variables.
> 
> There's a type, sector_t, that's meant to hold offsets in sectors and
> blocks.  It's not used consistently (yet).
> 
> The patch at
>     http://www.gelato.unsw.edu.au/patches/2.5.14-largefile-patch
> 
> ...
> 
> As this touches lots of places -- the generic block layer (Andrew?)
> the IDE code (Martin?) and RAID (Neil?) and minor changes to the scsi
> I've CCd a few people directly.

That would be more Jens and aviro than I.

My vote would be: just merge the sucker while it still (almost) 
applies. 2TB is a showstopper for some people in 2.4 today.  Obviously
2.6 will need 64-bit block numbers.

The next obstacle will be page cache indices into the blockdev mapping.
That's either an 8TB or 16TB limit, depending on signedness correctness.

One minor point - it is currently not possible to print sector_t's.
This code:

	printk("%lu%s", some_sector, some_string);

will work fine with 32-bit sector_t.  But with 64-bit sector_t it
will generate a warning at compile-time and an oops at runtime.

The same problem applies to dma_addr_t.  Jeff, davem and I kicked
that around a while back and ended up deciding that although there
are a number of high-tech solutions, the dumb one was best:


--- 2.5.14/include/linux/types.h~sector_t-printing	Thu May  9 17:08:13 2002
+++ 2.5.14-akpm/include/linux/types.h	Thu May  9 17:08:13 2002
@@ -120,8 +120,10 @@ typedef		__s64		int64_t;
 
 #ifdef BLK_64BIT_SECTOR
 typedef u64 sector_t;
+#define FMT_SECTOR_T	"%Lu"
 #else
 typedef unsigned long sector_t;
+#define FMT_SECTOR_T	"%lu"
 #endif
 
 #endif /* __KERNEL_STRICT_NAMES */
--- 2.5.14/fs/buffer.c~sector_t-printing	Thu May  9 17:08:13 2002
+++ 2.5.14-akpm/fs/buffer.c	Thu May  9 17:09:35 2002
@@ -179,7 +179,8 @@ __clear_page_buffers(struct page *page)
 
 static void buffer_io_error(struct buffer_head *bh)
 {
-	printk(KERN_ERR "Buffer I/O error on device %s, logical block %ld\n",
+	printk(KERN_ERR "Buffer I/O error on device %s,"
+			" logical block " FMT_SECTOR_T "\n",
 			bdevname(bh->b_bdev), bh->b_blocknr);
 }

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] remove 2TB block device limit
@ 2002-05-10  3:53 Neil Brown
  0 siblings, 0 replies; 41+ messages in thread
From: Neil Brown @ 2002-05-10  3:53 UTC (permalink / raw)
  To: Peter Chubb; +Cc: linux-kernel, akpm, martin

On Friday May 10, peter@chubb.wattle.id.au wrote:
> 
> Hi,
> 	At present, linux is limited to 2TB filesystems even on 64-bit
> systems, because there are various places where the block offset on
> disc are assigned to unsigned or int 32-bit variables.
> 
> There's a type, sector_t, that's meant to hold offsets in sectors and
> blocks.  It's not used consistently (yet).
> 
> The patch at
>     http://www.gelato.unsw.edu.au/patches/2.5.14-largefile-patch

> 
> As this touches lots of places -- the generic block layer (Andrew?)
> the IDE code (Martin?) and RAID (Neil?) and minor changes to the scsi
> I've CCd a few people directly.
> 

Thanks.
MD part looks sane to me. However I would rather the
 
+#ifdef CONFIG_LFS
+#include <asm/div64.h>
+#else
+#undef do_div
+#define do_div(n, b)({ int _res; _res = (n) % (b); (n) /= (b); _res;})
+#endif
+

part went in linux/raid/md_k.h and defined "sector_div" (or similar)
as either do_div or ({ int _res; _res = (n) % (b); (n) /= (b); _res;})
as appropriate.

NeilBrown

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH] remove 2TB block device limit
@ 2002-05-10  3:36 Peter Chubb
  2002-05-10  4:05 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Peter Chubb @ 2002-05-10  3:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, martin, neilb


Hi,
	At present, linux is limited to 2TB filesystems even on 64-bit
systems, because there are various places where the block offset on
disc are assigned to unsigned or int 32-bit variables.

There's a type, sector_t, that's meant to hold offsets in sectors and
blocks.  It's not used consistently (yet).

The patch at
    http://www.gelato.unsw.edu.au/patches/2.5.14-largefile-patch

(also available from bk://gelato.unsw.edu.au:2023/ for those using
bitkeeper)
has the following changes to address the problem:

	bmap() changes from int bmap(struct address_space *, long)
	to		    sector_t bmap(struct address_space *,
				     sector_t)

	The partitioning code takes sector_t everywhere that makes
	sense (to allow efi, for example, to create partitions on enormous
	discs).

	The block_sizes[] array is sector_t not int.

	get_nr_sectors() and get_start_sect() etc., now return a
	sector_t

	__bread() takes a sector_t as its second argument, and struct
	buffer_head contains a sector_t blocknumber field.

	struct scsi_disk and struct gendisk have a sector_t field for
	capacity.

	The scsi disc code now uses 16-byte commands if they're
	needed.

	ioctl(..GETBLKSZ..) now fails with EFBIG if the size won't fit
	in a long. (at least for devices using the generic version).

Plus a smattering of casts to avoid compilation warnings (mostly so
that printk() works whether sector_t is 64 or 32 bits) and a new
CONFIG_LFS option to turn on 64-bit sector_t on 32-bit platforms.

On an old pentium I now have a 15Tb file mounted as JFS on the loop
device -- and it seems to work for almost everything.  There are a few
user-mode programs that'll have to be fixed (notably parted, mkfs.???
etc) to cope with the new GETBLKSIZE failure (they should use
alternate mechanisms, e.g., GETBLKSIZE64, or just seek to the end of
the partition and look at the offset).

As this touches lots of places -- the generic block layer (Andrew?)
the IDE code (Martin?) and RAID (Neil?) and minor changes to the scsi
I've CCd a few people directly.

--
Peter Chubb
Gelato@UNSW http://www.gelato.unsw.edu.au/

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2002-05-17 20:27 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1060250300@toto.iv>
2002-05-13 10:28 ` [PATCH] remove 2TB block device limit Peter Chubb
2002-05-13 12:13   ` Christoph Hellwig
2002-05-14  0:30     ` Peter Chubb
2002-05-14  1:36       ` Anton Altaparmakov
2002-05-16 20:32         ` Daniel Phillips
2002-05-14  2:09       ` Andrew Morton
2002-05-14  2:58         ` Peter Chubb
2002-05-14  7:22           ` Christoph Hellwig
2002-05-14  7:21         ` Christoph Hellwig
     [not found] <581856778@toto.iv>
2002-05-17  0:04 ` Peter Chubb
2002-05-17  0:18   ` Daniel Phillips
2002-05-17 13:32     ` Jesse Pollard
2002-05-17 18:02       ` Daniel Phillips
2002-05-17 18:26         ` Jesse Pollard
2002-05-17 18:36       ` Andreas Dilger
2002-05-17 19:52       ` Daniel Phillips
2002-05-17 20:25         ` Andrew Morton
2002-05-17 15:26     ` Jason L Tibbitts III
2002-05-15  9:41 Hirotaka Sasaki
2002-05-15 21:49 ` Steve Lord
  -- strict thread matches above, loose matches on Subject: below --
2002-05-10  3:53 Neil Brown
2002-05-10  3:36 Peter Chubb
2002-05-10  4:05 ` Andrew Morton
2002-05-10  8:43   ` Anton Altaparmakov
2002-05-10  9:04     ` Andrew Morton
2002-05-16 19:08       ` Daniel Phillips
2002-05-10  9:05     ` Jens Axboe
2002-05-10  9:53       ` Peter Chubb
2002-05-10 10:01         ` Jens Axboe
2002-05-10 11:43         ` Anton Altaparmakov
2002-05-10  4:51 ` Martin Dalecki
     [not found] ` <20020510084713.43ce396e.jeremy@kerneltrap.org>
2002-05-10 19:12   ` Peter Chubb
2002-05-10 23:46     ` Andreas Dilger
2002-05-11  0:07       ` David Mosberger
2002-05-15 22:17         ` Andreas Dilger
2002-05-16 20:22           ` Daniel Phillips
2002-05-16 22:54             ` Andreas Dilger
2002-05-17  1:17               ` Daniel Phillips
2002-05-11  4:40       ` Peter Chubb
2002-05-15 13:49       ` Pavel Machek
2002-05-11 18:13     ` Padraig Brady

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).