From: Jesse Pollard <pollard@tomcat.admin.navo.hpc.mil>
To: jaharkes@cs.cmu.edu, Jesse Pollard <pollard@tomcat.admin.navo.hpc.mil>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: 64-bit block sizes on 32-bit systems
Date: Tue, 27 Mar 2001 16:23:32 -0600 (CST) [thread overview]
Message-ID: <200103272223.QAA37189@tomcat.admin.navo.hpc.mil> (raw)
Jan Harkes <jaharkes@cs.cmu.edu>:
>
> On Tue, Mar 27, 2001 at 01:57:42PM -0600, Jesse Pollard wrote:
> > > Using similar numbers as presented. If we are working our way through
> > > every single block in a Pentabyte filesystem, and the blocksize is 512
> > > bytes. Then the 1us in extra CPU cycles because of 64-bit operations
> > > would add, according to by back of the envelope calculation, 2199023
> > > seconds of CPU time a bit more than 25 days.
> >
> > Ummm... I don't think it adds that much. You seem to be leaving out the
> > overlap disk/IO and computation for read-ahead. This should eliminate the
> > majority of the delay effect.
>
> 1024 TB should be around 2*10^12 512-byte blocks, divide by 10^6 (1us)
> of "assumed" overhead per block operation is 2*10^6 seconds, no I
> believe I'm pretty close there. I am considering everything being
> "available in the cache", i.e. no waiting for disk access.
That would be true for small files (< 5GB). I have to deal with files that
may be 20-100 GB. Except for the largest systems (200GB of main memory)
the data will NOT be in the cache except for ~50% of the time. (assuming
only one user....)
> > > Seriously, there is a lot more that needs to be done than introducing a
> > > 64-bit blocknumber. Effectively 512 byte blocks are far too small for
> > > that kind of data, and going to pagesize blocks (and increasing pagesize
> > > to 64KB or 2MB at the same time) is a solution that is far more likely
> > > to give good results since it reduces both the total the number of
> > > 'blocks' on the device as well as reducing the total amount of calls
> > > throughout kernel space instead of increasing the cost per call.
> >
> > Talk about adding overhead... How long do you think it takes to read a
> > 2MB block (not to mention the time to update that page..) The additional
> > contention on the fiberchannel I/O alone might kill it if the filesystem
> > is busy.
>
> The time to update the pagetables is identical to the time to update a
> 4KB page when the OS is using a 2MB pagesize. Ofcourse it will take more
> time to load the data into the page, however it should be a consecutive
> stretch of data on disk, which should give a more efficient transfer
> than small blocks scattered around the disk.
You assume the file is accessed sequentially. The wether models don't do
that. They do have some locality, but only in a 3D sense. When you include
time it becomes closer to a random disk block reference when everything has to
be linearized.
>
> > Granted, 512 bytes could be considered too small for some things, but
> > once you pass 32K you start adding a lot of rotational delay problems.
> > I've used file systems with 256K blocks - they are slow when compaired
> > to the throughput using 32K. I wasn't the one running the benchmarks,
> > but with a MaxStrat 400GB raid with 256K sized data transfer was much
> > slower (around 3 times slower) than 32K. (The target application was
> > a GIS server using Oracle).
>
> But your subsystem (the disk) was probably still using 512 byte blocks,
> possibly scattered. And the OS was still using 4KB pages, it takes more
> time to reclaim and gather 64 pages per IO operation than one, that's
> why I'm saying that the pagesize needs to scale along with the blocksize.
It wasn't - the "disks" were composed of groups of 5 drives in a raid striped
for speed and spread across 5 SCSI III controlers. Each raid attached had
16MB internal cache. I think the controlers were using an entire sector
read (32K).
> The application might have been assuming a small block size as well, and
> the OS was told to do several read/modify/write cycles, perhaps even 512
> times as much as necessary.
There was some of that, but not much. Oracle (as I recall) allows for the
specification of transfer size.
This also brings up the problem of small files. Allocating 2MB per file
would, waist quite a bit of disk space (assuming 5 - 10 million files
with only 15% having 25GB or more).
> I'm not saying that the current system will perform well when working
> with large blocks, but compared to increasing the size of block_t, a
> larger blocksize has more potential to give improvements in the long
> term without adding an unrecoverable performance hit.
Not when the filesystem is required for general use. It only makes it
simpler to actually have a large filesystem. It doesn't help when it
must be used.
Now you are saying that the throughput WILL go down, but only if you use
large block sizes.
I can go along with making block sizes up to 8K. Even 32K for special
circumstances (even 64K for dedicated use). But not larger. NFS overhead on
file I/O becomes way too excessive (...worst example now is having to read
a 2MB block to update 512 bytes, then write it back... :-)
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@navo.hpc.mil
Any opinions expressed are solely my own.
next reply other threads:[~2001-03-27 22:24 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2001-03-27 22:23 Jesse Pollard [this message]
2001-03-27 23:56 ` 64-bit block sizes on 32-bit systems Steve Lord
2001-03-28 8:09 ` Brad Boyer
2001-03-28 14:53 ` Dave Kleikamp
-- strict thread matches above, loose matches on Subject: below --
2001-03-27 19:57 Jesse Pollard
2001-03-27 20:20 ` Jan Harkes
2001-03-27 21:55 ` LA Walsh
2001-03-27 19:30 Jesse Pollard
[not found] <Pine.LNX.4.30.0103270022500.21075-100000@age.cs.columbia.edu>
[not found] ` <3AC0CA9C.3D804361@sgi.com>
2001-03-27 19:00 ` Jan Harkes
2001-03-27 17:22 LA Walsh
2001-03-26 21:27 Jesse Pollard
2001-03-26 22:07 ` Jonathan Morton
2001-03-27 4:14 ` Jesse Pollard
2001-03-26 19:26 Jesse Pollard
2001-03-26 18:01 Manfred Spraul
2001-03-26 18:07 ` Matthew Wilcox
2001-03-26 19:40 ` LA Walsh
2001-03-26 21:53 ` Manfred Spraul
2001-03-26 22:07 ` LA Walsh
2001-03-26 17:35 LA Walsh
2001-03-26 16:39 LA Walsh
2001-03-26 17:18 ` Matthew Wilcox
2001-03-26 17:47 ` Andreas Dilger
2001-03-26 18:09 ` Matthew Wilcox
2001-03-26 18:37 ` Eric W. Biederman
2001-03-26 19:36 ` Martin Dalecki
2001-03-26 23:03 ` AJ Lewis
2001-03-26 19:05 ` Scott Laird
2001-03-26 19:09 ` Andreas Dilger
2001-03-26 20:31 ` Dan Hollis
2001-03-26 19:20 ` Rik van Riel
2001-03-26 20:14 ` Jes Sorensen
2001-03-26 17:58 ` Eric W. Biederman
2001-03-28 8:06 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200103272223.QAA37189@tomcat.admin.navo.hpc.mil \
--to=pollard@tomcat.admin.navo.hpc.mil \
--cc=jaharkes@cs.cmu.edu \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).