linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Implementing NVMHCI...
       [not found] <20090412091228.GA29937@elte.hu>
@ 2009-04-12 15:14 ` Szabolcs Szakacsits
  2009-04-12 15:20   ` Alan Cox
  2009-04-12 15:41   ` Linus Torvalds
  0 siblings, 2 replies; 44+ messages in thread
From: Szabolcs Szakacsits @ 2009-04-12 15:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Grant Grundler, Linux IDE mailing list, LKML,
	Jens Axboe, Arjan van de Ven


Linus Torvalds wrote:

> And people tend to really dislike hardware that forces a particular 
> filesystem on them. Guess how big the user base is going to be if you 
> cannot format the device as NTFS, for example? Hint: if a piece of 
> hardware only works well with special filesystems, that piece of hardware 
> won't be a big seller.
> 
> Modern technology needs big volume to become cheap and relevant.
> 
> And maybe I'm wrong, and NTFS works fine as-is with sectors >4kB. But let 
> me doubt that.

I did not hear about NTFS using >4kB sectors yet but technically 
it should work.

The atomic building units (sector size, block size, etc) of NTFS are 
entirely parametric. The maximum values could be bigger than the 
currently "configured" maximum limits. 

At present the limits are set in the BIOS Parameter Block in the NTFS
Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for 
"Sectors Per Block". So >4kB sector size should work since 1993.

64kB+ sector size could be possible by bootstrapping NTFS drivers 
in a different way. 

	Szaka

--
NTFS-3G: http://ntfs-3g.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-12 15:14 ` Implementing NVMHCI Szabolcs Szakacsits
@ 2009-04-12 15:20   ` Alan Cox
  2009-04-12 16:15     ` Avi Kivity
  2009-04-12 15:41   ` Linus Torvalds
  1 sibling, 1 reply; 44+ messages in thread
From: Alan Cox @ 2009-04-12 15:20 UTC (permalink / raw)
  To: Szabolcs Szakacsits
  Cc: Linus Torvalds, Grant Grundler, Linux IDE mailing list, LKML,
	Jens Axboe, Arjan van de Ven

> The atomic building units (sector size, block size, etc) of NTFS are 
> entirely parametric. The maximum values could be bigger than the 
> currently "configured" maximum limits. 

That isn't what bites you - you can run 8K-32K ext2 file systems but if
your physical page size is smaller than the fs page size you have a
problem.

The question is whether the NT VM can cope rather than the fs.

Alan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-12 15:14 ` Implementing NVMHCI Szabolcs Szakacsits
  2009-04-12 15:20   ` Alan Cox
@ 2009-04-12 15:41   ` Linus Torvalds
  2009-04-12 17:02     ` Robert Hancock
                       ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Linus Torvalds @ 2009-04-12 15:41 UTC (permalink / raw)
  To: Szabolcs Szakacsits
  Cc: Alan Cox, Grant Grundler, Linux IDE mailing list, LKML,
	Jens Axboe, Arjan van de Ven



On Sun, 12 Apr 2009, Szabolcs Szakacsits wrote:
> 
> I did not hear about NTFS using >4kB sectors yet but technically 
> it should work.
> 
> The atomic building units (sector size, block size, etc) of NTFS are 
> entirely parametric. The maximum values could be bigger than the 
> currently "configured" maximum limits. 

It's probably trivial to make ext3 support 16kB blocksizes (if it doesn't 
already).

That's not the problem. The "filesystem layout" part is just a parameter.

The problem is then trying to actually access such a filesystem, in 
particular trying to write to it, or trying to mmap() small chunks of it. 
The FS layout is the trivial part.

> At present the limits are set in the BIOS Parameter Block in the NTFS
> Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for 
> "Sectors Per Block". So >4kB sector size should work since 1993.
> 
> 64kB+ sector size could be possible by bootstrapping NTFS drivers 
> in a different way. 

Try it. And I don't mean "try to create that kind of filesystem". Try to 
_use_ it. Does Window actually support using it it, or is it just a matter 
of "the filesystem layout is _specified_ for up to 64kB block sizes"?

And I really don't know. Maybe Windows does support it. I'm just very 
suspicious. I think there's a damn good reason why NTFS supports larger 
block sizes in theory, BUT EVERYBODY USES A 4kB BLOCKSIZE DESPITE THAT!

Because it really is a hard problem. It's really pretty nasty to have your 
cache blocking be smaller than the actual filesystem blocksize (the other 
way is much easier, although it's certainly not pleasant either - Linux 
supports it because we _have_ to, but sector-size of hardware had 
traditionally been 4kB, I'd certainly also argue against adding complexity 
just to make it smaller, the same way I argue against making it much 
larger).

And don't get me wrong - we could (fairly) trivially make the 
PAGE_CACHE_SIZE be bigger - even eventually go so far as to make it a 
per-mapping thing, so that you could have some filesystems with that 
bigger sector size and some with smaller ones. I think Andrea had patches 
that did a fair chunk of it, and that _almost_ worked.

But it ABSOLUTELY SUCKS. If we did a 16kB page-cache-size, it would 
absolutely blow chunks. It would be disgustingly horrible. Putting the 
kernel source tree on such a filesystem would waste about 75% of all 
memory (the median size of a source file is just about 4kB), so your page 
cache would be effectively cut in a quarter for a lot of real loads.

And to fix up _that_, you'd need to now do things like sub-page 
allocations, and now your page-cache size isn't even fixed per filesystem, 
it would be per-file, and the filesystem (and the drievrs!) would hav to 
handle the cases of getting those 4kB partial pages (and do r-m-w IO after 
all if your hardware sector size is >4kB).

IOW, there are simple things we can do - but they would SUCK. And there 
are really complicated things we could do - and they would _still_ SUCK, 
plus now I pretty much guarantee that your system would also be a lot less 
stable. 

It really isn't worth it. It's much better for everybody to just be aware 
of the incredible level of pure suckage of a general-purpose disk that has 
hardware sectors >4kB. Just educate people that it's not good. Avoid the 
whole insane suckage early, rather than be disappointed in hardware that 
is total and utter CRAP and just causes untold problems.

Now, for specialty uses, things are different. CD-ROM's have had 2kB 
sector sizes for a long time, and the reason it was never as big of a 
problem isn't that they are still smaller than 4kB - it's that they are 
read-only, and use special filesystems. And people _know_ they are 
special. Yes, even when you write to them, it's a very special op. You'd 
never try to put NTFS on a CD-ROM, and everybody knows it's not a disk 
replacement.

In _those_ kinds of situations, a 64kB block isn't much of a problem. We 
can do read-only media (where "read-only" doesn't have to be absolute: the 
important part is that writing is special), and never have problems. 
That's easy. Almost all the problems with block-size go away if you think 
reading is 99.9% of the load. 

But if you want to see it as a _disk_ (ie replacing SSD's or rotational 
media), 4kB blocksize is the maximum sane one for Linux/x86 (or, indeed, 
any "Linux/not-just-database-server" - it really isn't so much about x86, 
as it is about large cache granularity causing huge memory fragmentation 
issues).

			Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-12 15:20   ` Alan Cox
@ 2009-04-12 16:15     ` Avi Kivity
  2009-04-12 17:11       ` Linus Torvalds
  0 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2009-04-12 16:15 UTC (permalink / raw)
  To: Alan Cox
  Cc: Szabolcs Szakacsits, Linus Torvalds, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Alan Cox wrote:
>> The atomic building units (sector size, block size, etc) of NTFS are 
>> entirely parametric. The maximum values could be bigger than the 
>> currently "configured" maximum limits. 
>>     
>
> That isn't what bites you - you can run 8K-32K ext2 file systems but if
> your physical page size is smaller than the fs page size you have a
> problem.
>
> The question is whether the NT VM can cope rather than the fs.
>   

A quick test shows that it can.  I didn't try mmap(), but copying files 
around worked.

Did you expect it not to work?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-12 15:41   ` Linus Torvalds
@ 2009-04-12 17:02     ` Robert Hancock
  2009-04-12 17:20       ` Linus Torvalds
  2009-04-12 17:23     ` James Bottomley
       [not found]     ` <6934efce0904141052j3d4f87cey9fc4b802303aa73b@mail.gmail.com>
  2 siblings, 1 reply; 44+ messages in thread
From: Robert Hancock @ 2009-04-12 17:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Szabolcs Szakacsits, Alan Cox, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Linus Torvalds wrote:
> 
> On Sun, 12 Apr 2009, Szabolcs Szakacsits wrote:
>> I did not hear about NTFS using >4kB sectors yet but technically 
>> it should work.
>>
>> The atomic building units (sector size, block size, etc) of NTFS are 
>> entirely parametric. The maximum values could be bigger than the 
>> currently "configured" maximum limits. 
> 
> It's probably trivial to make ext3 support 16kB blocksizes (if it doesn't 
> already).
> 
> That's not the problem. The "filesystem layout" part is just a parameter.
> 
> The problem is then trying to actually access such a filesystem, in 
> particular trying to write to it, or trying to mmap() small chunks of it. 
> The FS layout is the trivial part.
> 
>> At present the limits are set in the BIOS Parameter Block in the NTFS
>> Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for 
>> "Sectors Per Block". So >4kB sector size should work since 1993.
>>
>> 64kB+ sector size could be possible by bootstrapping NTFS drivers 
>> in a different way. 
> 
> Try it. And I don't mean "try to create that kind of filesystem". Try to 
> _use_ it. Does Window actually support using it it, or is it just a matter 
> of "the filesystem layout is _specified_ for up to 64kB block sizes"?
> 
> And I really don't know. Maybe Windows does support it. I'm just very 
> suspicious. I think there's a damn good reason why NTFS supports larger 
> block sizes in theory, BUT EVERYBODY USES A 4kB BLOCKSIZE DESPITE THAT!

I can't find any mention that any formattable block size can't be used, 
other than the fact that "The maximum default cluster size under Windows 
NT 3.51 and later is 4K due to the fact that NTFS file compression is 
not possible on drives with a larger allocation size. So format will 
never use larger than 4k clusters unless the user specifically overrides 
the defaults".

It could be there are other downsides to >4K cluster sizes as well, but 
that's the reason they state.

What about FAT? It supports cluster sizes up to 32K at least (possibly 
up to 256K as well, although somewhat nonstandard), and that works.. We 
support that in Linux, don't we?

> 
> Because it really is a hard problem. It's really pretty nasty to have your 
> cache blocking be smaller than the actual filesystem blocksize (the other 
> way is much easier, although it's certainly not pleasant either - Linux 
> supports it because we _have_ to, but sector-size of hardware had 
> traditionally been 4kB, I'd certainly also argue against adding complexity 
> just to make it smaller, the same way I argue against making it much 
> larger).
> 
> And don't get me wrong - we could (fairly) trivially make the 
> PAGE_CACHE_SIZE be bigger - even eventually go so far as to make it a 
> per-mapping thing, so that you could have some filesystems with that 
> bigger sector size and some with smaller ones. I think Andrea had patches 
> that did a fair chunk of it, and that _almost_ worked.
> 
> But it ABSOLUTELY SUCKS. If we did a 16kB page-cache-size, it would 
> absolutely blow chunks. It would be disgustingly horrible. Putting the 
> kernel source tree on such a filesystem would waste about 75% of all 
> memory (the median size of a source file is just about 4kB), so your page 
> cache would be effectively cut in a quarter for a lot of real loads.
> 
> And to fix up _that_, you'd need to now do things like sub-page 
> allocations, and now your page-cache size isn't even fixed per filesystem, 
> it would be per-file, and the filesystem (and the drievrs!) would hav to 
> handle the cases of getting those 4kB partial pages (and do r-m-w IO after 
> all if your hardware sector size is >4kB).
> 
> IOW, there are simple things we can do - but they would SUCK. And there 
> are really complicated things we could do - and they would _still_ SUCK, 
> plus now I pretty much guarantee that your system would also be a lot less 
> stable. 
> 
> It really isn't worth it. It's much better for everybody to just be aware 
> of the incredible level of pure suckage of a general-purpose disk that has 
> hardware sectors >4kB. Just educate people that it's not good. Avoid the 
> whole insane suckage early, rather than be disappointed in hardware that 
> is total and utter CRAP and just causes untold problems.
> 
> Now, for specialty uses, things are different. CD-ROM's have had 2kB 
> sector sizes for a long time, and the reason it was never as big of a 
> problem isn't that they are still smaller than 4kB - it's that they are 
> read-only, and use special filesystems. And people _know_ they are 
> special. Yes, even when you write to them, it's a very special op. You'd 
> never try to put NTFS on a CD-ROM, and everybody knows it's not a disk 
> replacement.
> 
> In _those_ kinds of situations, a 64kB block isn't much of a problem. We 
> can do read-only media (where "read-only" doesn't have to be absolute: the 
> important part is that writing is special), and never have problems. 
> That's easy. Almost all the problems with block-size go away if you think 
> reading is 99.9% of the load. 
> 
> But if you want to see it as a _disk_ (ie replacing SSD's or rotational 
> media), 4kB blocksize is the maximum sane one for Linux/x86 (or, indeed, 
> any "Linux/not-just-database-server" - it really isn't so much about x86, 
> as it is about large cache granularity causing huge memory fragmentation 
> issues).
> 
> 			Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-12 16:15     ` Avi Kivity
@ 2009-04-12 17:11       ` Linus Torvalds
  2009-04-13  6:32         ` Avi Kivity
  0 siblings, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2009-04-12 17:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alan Cox, Szabolcs Szakacsits, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven



On Sun, 12 Apr 2009, Avi Kivity wrote:
> 
> A quick test shows that it can.  I didn't try mmap(), but copying files around
> worked.

You being who you are, I'm assuming you're doing this in a virtual 
environment, so you might be able to see the IO patterns..

Can you tell if it does the IO in chunks of 16kB or smaller? That can be 
hard to see with trivial tests (since any filesystem will try to chunk up 
writes regardless of how small the cache entry is, and on file creation it 
will have to write the full 16kB anyway just to initialize the newly 
allocated blocks on disk), but there's a couple of things that should be 
reasonably good litmus tests of what WNT does internally:

 - create a big file, then rewrite just a few bytes in it, and look at the 
   IO pattern of the result. Does it actually do the rewrite IO as one 
   16kB IO, or does it do sub-blocking?

   If the latter, then the 16kB thing is just a filesystem layout issue, 
   not an internal block-size issue, and WNT would likely have exactly the 
   same issues as Linux.

 - can you tell how many small files it will cache in RAM without doing 
   IO? If it always uses 16kB blocks for caching, it will be able to cache 
   a _lot_ fewer files in the same amount of RAM than with a smaller block 
   size.

Of course, the _really_ conclusive thing (in a virtualized environment) is 
to just make the virtual disk only able to do 16kB IO accesses (and with 
16kB alignment). IOW, actually emulate a disk with a 16kB hard sector 
size, and reporting a 16kB sector size to the READ CAPACITY command. If it 
works then, then clearly WNT has no issues with bigger sectors.

			Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-12 17:02     ` Robert Hancock
@ 2009-04-12 17:20       ` Linus Torvalds
  2009-04-12 18:35         ` Robert Hancock
  2009-04-13 11:18         ` Avi Kivity
  0 siblings, 2 replies; 44+ messages in thread
From: Linus Torvalds @ 2009-04-12 17:20 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Szabolcs Szakacsits, Alan Cox, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven



On Sun, 12 Apr 2009, Robert Hancock wrote:
>
> What about FAT? It supports cluster sizes up to 32K at least (possibly up to
> 256K as well, although somewhat nonstandard), and that works.. We support that
> in Linux, don't we?

Sure. 

The thing is, "cluster size" in an FS is totally different from sector 
size.

People are missing the point here. You can trivially implement bigger 
cluster sizes by just writing multiple sectors. In fact, even just a 4kB 
cluster size is actually writing 8 512-byte hardware sectors on all normal 
disks.

So you can support big clusters without having big sectors. A 32kB cluster 
size in FAT is absolutely trivial to do: it's really purely an allocation 
size. So a fat filesystem allocates disk-space in 32kB chunks, but then 
when you actually do IO to it, you can still write things 4kB at a time 
(or smaller), because once the allocation has been made, you still treat 
the disk as a series of smaller blocks.

IOW, when you allocate a new 32kB cluster, you will have to allocate 8 
pages to do IO on it (since you'll have to initialize the diskspace), but 
you can still literally treat those pages as _individual_ pages, and you 
can write them out in any order, and you can free them (and then look them 
up) one at a time.

Notice? The cluster size really only ends up being a disk-space allocation 
issue, not an issue for actually caching the end result or for the actual 
size of the IO.

The hardware sector size is very different. If you have a 32kB hardware 
sector size, that implies that _all_ IO has to be done with that 
granularity. Now you can no longer treat the eight pages as individual 
pages - you _have_ to write them out and read them in as one entity. If 
you dirty one page, you effectively dirty them all. You can not drop and 
re-allocate pages one at a time any more.

				Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-12 15:41   ` Linus Torvalds
  2009-04-12 17:02     ` Robert Hancock
@ 2009-04-12 17:23     ` James Bottomley
       [not found]     ` <6934efce0904141052j3d4f87cey9fc4b802303aa73b@mail.gmail.com>
  2 siblings, 0 replies; 44+ messages in thread
From: James Bottomley @ 2009-04-12 17:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Szabolcs Szakacsits, Alan Cox, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

On Sun, 2009-04-12 at 08:41 -0700, Linus Torvalds wrote:
> 
> On Sun, 12 Apr 2009, Szabolcs Szakacsits wrote:
> > 
> > I did not hear about NTFS using >4kB sectors yet but technically 
> > it should work.
> > 
> > The atomic building units (sector size, block size, etc) of NTFS are 
> > entirely parametric. The maximum values could be bigger than the 
> > currently "configured" maximum limits. 
> 
> It's probably trivial to make ext3 support 16kB blocksizes (if it doesn't 
> already).
> 
> That's not the problem. The "filesystem layout" part is just a parameter.
> 
> The problem is then trying to actually access such a filesystem, in 
> particular trying to write to it, or trying to mmap() small chunks of it. 
> The FS layout is the trivial part.
> 
> > At present the limits are set in the BIOS Parameter Block in the NTFS
> > Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for 
> > "Sectors Per Block". So >4kB sector size should work since 1993.
> > 
> > 64kB+ sector size could be possible by bootstrapping NTFS drivers 
> > in a different way. 
> 
> Try it. And I don't mean "try to create that kind of filesystem". Try to 
> _use_ it. Does Window actually support using it it, or is it just a matter 
> of "the filesystem layout is _specified_ for up to 64kB block sizes"?
> 
> And I really don't know. Maybe Windows does support it. I'm just very 
> suspicious. I think there's a damn good reason why NTFS supports larger 
> block sizes in theory, BUT EVERYBODY USES A 4kB BLOCKSIZE DESPITE THAT!
> 
> Because it really is a hard problem. It's really pretty nasty to have your 
> cache blocking be smaller than the actual filesystem blocksize (the other 
> way is much easier, although it's certainly not pleasant either - Linux 
> supports it because we _have_ to, but sector-size of hardware had 
> traditionally been 4kB, I'd certainly also argue against adding complexity 
> just to make it smaller, the same way I argue against making it much 
> larger).
> 
> And don't get me wrong - we could (fairly) trivially make the 
> PAGE_CACHE_SIZE be bigger - even eventually go so far as to make it a 
> per-mapping thing, so that you could have some filesystems with that 
> bigger sector size and some with smaller ones. I think Andrea had patches 
> that did a fair chunk of it, and that _almost_ worked.
> 
> But it ABSOLUTELY SUCKS. If we did a 16kB page-cache-size, it would 
> absolutely blow chunks. It would be disgustingly horrible. Putting the 
> kernel source tree on such a filesystem would waste about 75% of all 
> memory (the median size of a source file is just about 4kB), so your page 
> cache would be effectively cut in a quarter for a lot of real loads.
> 
> And to fix up _that_, you'd need to now do things like sub-page 
> allocations, and now your page-cache size isn't even fixed per filesystem, 
> it would be per-file, and the filesystem (and the drievrs!) would hav to 
> handle the cases of getting those 4kB partial pages (and do r-m-w IO after 
> all if your hardware sector size is >4kB).

We might not have to go that far for a device with these special
characteristics.  It should be possible to build a block size remapping
Read Modify Write type device to present a 4k block size to the OS while
operating in n*4k blocks for the device.  We could implement the read
operations as readahead in the page cache, so if we're lucky we mostly
end up operating on full n*4k blocks anyway.  For the cases where we've
lost pieces of the n*4k native block and we have to do a write, we'd
just suck it up and do a read modify write on a separate memory area, a
bit like the new 4k sector devices do emulating 512 byte blocks.  The
suck factor of this double I/O plus memory copy overhead should be
mitigated partially by the fact that the underlying device is very fast.

James



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-12 17:20       ` Linus Torvalds
@ 2009-04-12 18:35         ` Robert Hancock
  2009-04-13 11:18         ` Avi Kivity
  1 sibling, 0 replies; 44+ messages in thread
From: Robert Hancock @ 2009-04-12 18:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Szabolcs Szakacsits, Alan Cox, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Linus Torvalds wrote:
> IOW, when you allocate a new 32kB cluster, you will have to allocate 8 
> pages to do IO on it (since you'll have to initialize the diskspace), but 
> you can still literally treat those pages as _individual_ pages, and you 
> can write them out in any order, and you can free them (and then look them 
> up) one at a time.
> 
> Notice? The cluster size really only ends up being a disk-space allocation 
> issue, not an issue for actually caching the end result or for the actual 
> size of the IO.

Right.. I didn't realize we were actually that smart (not writing out 
the entire cluster when dirtying one page) but I guess it makes sense.

> 
> The hardware sector size is very different. If you have a 32kB hardware 
> sector size, that implies that _all_ IO has to be done with that 
> granularity. Now you can no longer treat the eight pages as individual 
> pages - you _have_ to write them out and read them in as one entity. If 
> you dirty one page, you effectively dirty them all. You can not drop and 
> re-allocate pages one at a time any more.
> 
> 				Linus

I suspect that in this case trying to gang together multiple pages 
inside the VM to actually handle it this way all the way through would 
be insanity. My guess is the only way you could sanely do it is the 
read-modify-write approach when writing out the data (in the block layer 
maybe?) where the read can be optimized away if the pages for the entire 
hardware sector are already in cache or the write is large enough to 
replace the entire sector. I assume we already do this in the md code 
somewhere for cases like software RAID 5 with a stripe size of >4KB..

That obviously would have some performance drawbacks compared to a 
smaller sector size, but if the device is bound and determined to use 
bigger sectors internally one way or the other and the alternative is 
the drive does R-M-W internally to emulate smaller sectors - which for 
some devices seems to be the case - maybe it makes more sense to do it 
in the kernel if we have more information to allow us to do it more 
efficiently. (Though, at least on the normal ATA disk side of things, 4K 
is the biggest number I've heard tossed about for a future expanded 
sector size, but flash devices like this may be another story..)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-12 17:11       ` Linus Torvalds
@ 2009-04-13  6:32         ` Avi Kivity
  2009-04-13 15:10           ` Linus Torvalds
  0 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2009-04-13  6:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Szabolcs Szakacsits, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Linus Torvalds wrote:
> On Sun, 12 Apr 2009, Avi Kivity wrote:
>   
>> A quick test shows that it can.  I didn't try mmap(), but copying files around
>> worked.
>>     
>
> You being who you are, I'm assuming you're doing this in a virtual 
> environment, so you might be able to see the IO patterns..
>
>   

Yes.  I just used the Windows performance counters rather than mess with 
qemu for the test below.

> Can you tell if it does the IO in chunks of 16kB or smaller? That can be 
> hard to see with trivial tests (since any filesystem will try to chunk up 
> writes regardless of how small the cache entry is, and on file creation it 
> will have to write the full 16kB anyway just to initialize the newly 
> allocated blocks on disk), but there's a couple of things that should be 
> reasonably good litmus tests of what WNT does internally:
>
>  - create a big file, 

Just creating a 5GB file in a 64KB filesystem was interesting - Windows 
was throwing out 256KB I/Os even though I was generating 1MB writes  
(and cached too).  Looks like a paranoid IDE driver (qemu exposes a PIIX4).

> then rewrite just a few bytes in it, and look at the 
>    IO pattern of the result. Does it actually do the rewrite IO as one 
>    16kB IO, or does it do sub-blocking?
>   

It generates 4KB writes (I was generating aligned 512 byte overwrites).  
What's more interesting, it was also issuing 32KB reads to fill the 
cache, not 64KB.  Since the number of reads and writes per second is 
almost equal, it's not splitting a 64KB read into two.

>    If the latter, then the 16kB thing is just a filesystem layout issue, 
>    not an internal block-size issue, and WNT would likely have exactly the 
>    same issues as Linux.
>   

A 1 byte write on an ordinary file generates a RMW, same as a 4KB write 
on a 16KB block.  So long as the filesystem is just a layer behind the 
pagecache (which I think is the case on Windows), I don't see what 
issues it can have.

>  - can you tell how many small files it will cache in RAM without doing 
>    IO? If it always uses 16kB blocks for caching, it will be able to cache 
>    a _lot_ fewer files in the same amount of RAM than with a smaller block 
>    size.
>   

I'll do this later, but given the 32KB reads for the test above, I'm 
guessing it will cache pages, not blocks.

> Of course, the _really_ conclusive thing (in a virtualized environment) is 
> to just make the virtual disk only able to do 16kB IO accesses (and with 
> 16kB alignment). IOW, actually emulate a disk with a 16kB hard sector 
> size, and reporting a 16kB sector size to the READ CAPACITY command. If it 
> works then, then clearly WNT has no issues with bigger sectors.
>   

I don't think IDE supports this?  And Windows 2008 doesn't like the LSI 
emulated device we expose.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-12 17:20       ` Linus Torvalds
  2009-04-12 18:35         ` Robert Hancock
@ 2009-04-13 11:18         ` Avi Kivity
  1 sibling, 0 replies; 44+ messages in thread
From: Avi Kivity @ 2009-04-13 11:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robert Hancock, Szabolcs Szakacsits, Alan Cox, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Linus Torvalds wrote:
> The hardware sector size is very different. If you have a 32kB hardware 
> sector size, that implies that _all_ IO has to be done with that 
> granularity. Now you can no longer treat the eight pages as individual 
> pages - you _have_ to write them out and read them in as one entity. If 
> you dirty one page, you effectively dirty them all. You can not drop and 
> re-allocate pages one at a time any more.
>   

You can still drop clean pages.  Sure, that costs you performance as 
you'll have to do re-read them in order to write a dirty page, but in 
the common case, the clean pages around would still be available and 
you'd avoid it.

Applications that randomly write to large files can be tuned to use the 
disk sector size.  As for the rest, they're either read-only (executable 
mappings) or sequential.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-13  6:32         ` Avi Kivity
@ 2009-04-13 15:10           ` Linus Torvalds
  2009-04-13 15:38             ` James Bottomley
                               ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Linus Torvalds @ 2009-04-13 15:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alan Cox, Szabolcs Szakacsits, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven



On Mon, 13 Apr 2009, Avi Kivity wrote:
> > 
> >  - create a big file,
> 
> Just creating a 5GB file in a 64KB filesystem was interesting - Windows 
> was throwing out 256KB I/Os even though I was generating 1MB writes (and 
> cached too).  Looks like a paranoid IDE driver (qemu exposes a PIIX4).

Heh, ok. So the "big file" really only needed to be big enough to not be 
cached, and 5GB was probably overkill. In fact, if there's some way to 
blow the cache, you could have made it much smaller. But 5G certainly 
works ;)

And yeah, I'm not surprised it limits the size of the IO. Linux will 
generally do the same. I forget what our default maximum bio size is, but 
I suspect it is in that same kind of range.

There are often problems with bigger IO's (latency being one, actual 
controller bugs being another), and even if the hardware has no bugs and 
its limits are higher, you usually don't want to have excessively large 
DMA mapping tables _and_ the advantage of bigger IO is usually not that 
big once you pass the "reasonably sized" limit (which is 64kB+). Plus they 
happen seldom enough in practice anyway that it's often not worth 
optimizing for.

> > then rewrite just a few bytes in it, and look at the IO pattern of the 
> > result. Does it actually do the rewrite IO as one 16kB IO, or does it 
> > do sub-blocking?
> 
> It generates 4KB writes (I was generating aligned 512 byte overwrites). 
> What's more interesting, it was also issuing 32KB reads to fill the 
> cache, not 64KB.  Since the number of reads and writes per second is 
> almost equal, it's not splitting a 64KB read into two.

Ok, that sounds pretty much _exactly_ like the Linux IO patterns would 
likely be.

The 32kB read has likely nothing to do with any filesystem layout issues 
(especially as you used a 64kB cluster size), but is simply because 

 (a) Windows caches things with a 4kB granularity, so the 512-byte write 
     turned into a read-modify-write
 (b) the read was really for just 4kB, but once you start reading you want 
     to do read-ahead anyway since it hardly gets any more expensive to 
     read a few pages than to read just one.

So once it had to do the read anyway, windows just read 8 pages instead of 
one - very reasonable. 

> >    If the latter, then the 16kB thing is just a filesystem layout 
> > issue, not an internal block-size issue, and WNT would likely have 
> > exactly the same issues as Linux.
> 
> A 1 byte write on an ordinary file generates a RMW, same as a 4KB write on a
> 16KB block.  So long as the filesystem is just a layer behind the pagecache
> (which I think is the case on Windows), I don't see what issues it can have.

Right. It's all very straightforward from a filesystem layout issue. The 
problem is all about managing memory.

You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for 
your example!). It's a total disaster. Imagine what would happen to user 
application performance if kmalloc() always returned 16kB-aligned chunks 
of memory, all sized as integer multiples of 16kB? It would absolutely 
_suck_. Sure, it would be fine for your large allocations, but any time 
you handle strings, you'd allocate 16kB of memory for any small 5-byte 
string. You'd have horrible cache behavior, and you'd run out of memory 
much too quickly.

The same is true in the kernel. The single biggest memory user under 
almost all normal loads is the disk cache. That _is_ the normal allocator 
for any OS kernel. Everything else is almost details (ok, so Linux in 
particular does cache metadata very aggressively, so the dcache and inode 
cache are seldom "just details", but the page cache is still generally the 
most important part).

So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane 
system does that. It's only useful if you absolutely _only_ work with 
large files - ie you're a database server. For just about any other 
workload, that kind of granularity is totally unnacceptable.

So doing a read-modify-write on a 1-byte (or 512-byte) write, when the 
block size is 4kB is easy - we just have to do it anyway. 

Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is 
also _doable_, and from the IO pattern standpoint it is no different. But 
from a memory allocation pattern standpoint it's a disaster - because now 
you're always working with chunks that are just 'too big' to be good 
building blocks of a reasonable allocator.

If you always allocate 64kB for file caches, and you work with lots of 
small files (like a source tree), you will literally waste all your 
memory.

And if you have some "dynamic" scheme, you'll have tons and tons of really 
nasty cases when you have to grow a 4kB allocation to a 64kB one when the 
file grows. Imagine doing "realloc()", but doing it in a _threaded_ 
environment, where any number of threads may be using the old allocation 
at the same time. And that's a kernel - it has to be _the_ most 
threaded program on the whole machine, because otherwise the kernel 
would be the scaling bottleneck.

And THAT is why 64kB blocks is such a disaster.

> >  - can you tell how many small files it will cache in RAM without doing
> > IO? If it always uses 16kB blocks for caching, it will be able to cache    a
> > _lot_ fewer files in the same amount of RAM than with a smaller block
> > size.
> 
> I'll do this later, but given the 32KB reads for the test above, I'm guessing
> it will cache pages, not blocks.

Yeah, you don't need to.

I can already guarantee that Windows does caching on a page granularity.

I can also pretty much guarantee that that is also why Windows stops 
compressing files once the blocksize is bigger than 4kB: because at that 
point, the block compressions would need to handle _multiple_ cache 
entities, and that's really painful for all the same reasons that bigger 
sectors would be really painful - you'd always need to make sure that you 
always have all of those cache entries in memory together, and you could 
never treat your cache entries as individual entities.

> > Of course, the _really_ conclusive thing (in a virtualized environment) is
> > to just make the virtual disk only able to do 16kB IO accesses (and with
> > 16kB alignment). IOW, actually emulate a disk with a 16kB hard sector size,
> > and reporting a 16kB sector size to the READ CAPACITY command. If it works
> > then, then clearly WNT has no issues with bigger sectors.
> 
> I don't think IDE supports this?  And Windows 2008 doesn't like the LSI
> emulated device we expose.

Yeah, you'd have to have the OS use the SCSI commands for disk discovery, 
so at least a SATA interface. With IDE disks, the sector size always has 
to be 512 bytes, I think.

		Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-13 15:10           ` Linus Torvalds
@ 2009-04-13 15:38             ` James Bottomley
  2009-04-14  7:22             ` Andi Kleen
  2009-04-14  9:59             ` Avi Kivity
  2 siblings, 0 replies; 44+ messages in thread
From: James Bottomley @ 2009-04-13 15:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, Alan Cox, Szabolcs Szakacsits, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

On Mon, 2009-04-13 at 08:10 -0700, Linus Torvalds wrote:
> On Mon, 13 Apr 2009, Avi Kivity wrote:
> > > Of course, the _really_ conclusive thing (in a virtualized environment) is
> > > to just make the virtual disk only able to do 16kB IO accesses (and with
> > > 16kB alignment). IOW, actually emulate a disk with a 16kB hard sector size,
> > > and reporting a 16kB sector size to the READ CAPACITY command. If it works
> > > then, then clearly WNT has no issues with bigger sectors.
> > 
> > I don't think IDE supports this?  And Windows 2008 doesn't like the LSI
> > emulated device we expose.
> 
> Yeah, you'd have to have the OS use the SCSI commands for disk discovery, 
> so at least a SATA interface. With IDE disks, the sector size always has 
> to be 512 bytes, I think.

Actually, the latest ATA rev supports different sector sizes in
preparation for native 4k sector size SATA disks (words 117-118 of
IDENTIFY).  Matthew Wilcox already has the patches for libata ready.

James



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-13 15:10           ` Linus Torvalds
  2009-04-13 15:38             ` James Bottomley
@ 2009-04-14  7:22             ` Andi Kleen
  2009-04-14 10:07               ` Avi Kivity
  2009-04-14  9:59             ` Avi Kivity
  2 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2009-04-14  7:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, Alan Cox, Szabolcs Szakacsits, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for 
> your example!).

AFAIK at least for user visible anonymous memory Windows uses 64k 
chunks. At least that is what Cygwin's mmap exposes. I don't know
if it does the same for disk cache.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-13 15:10           ` Linus Torvalds
  2009-04-13 15:38             ` James Bottomley
  2009-04-14  7:22             ` Andi Kleen
@ 2009-04-14  9:59             ` Avi Kivity
  2009-04-14 10:23               ` Jeff Garzik
  2 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2009-04-14  9:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Szabolcs Szakacsits, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Linus Torvalds wrote:
> On Mon, 13 Apr 2009, Avi Kivity wrote:
>   
>>>  - create a big file,
>>>       
>> Just creating a 5GB file in a 64KB filesystem was interesting - Windows 
>> was throwing out 256KB I/Os even though I was generating 1MB writes (and 
>> cached too).  Looks like a paranoid IDE driver (qemu exposes a PIIX4).
>>     
>
> Heh, ok. So the "big file" really only needed to be big enough to not be 
> cached, and 5GB was probably overkill. In fact, if there's some way to 
> blow the cache, you could have made it much smaller. But 5G certainly 
> works ;)
>   

I wanted to make sure my random writes later don't get coalesced.  A 1GB 
file, half of which is cached (I used a 1GB guest), offers lots of 
chances for coalescing if Windows delays the writes sufficiently.  At 
5GB, Windows can only cache 10% of the file, so it will be continuously 
flushing.


>
>  (a) Windows caches things with a 4kB granularity, so the 512-byte write 
>      turned into a read-modify-write
>   
>   
[...]

> You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for 
> your example!). It's a total disaster. Imagine what would happen to user 
> application performance if kmalloc() always returned 16kB-aligned chunks 
> of memory, all sized as integer multiples of 16kB? It would absolutely 
> _suck_. Sure, it would be fine for your large allocations, but any time 
> you handle strings, you'd allocate 16kB of memory for any small 5-byte 
> string. You'd have horrible cache behavior, and you'd run out of memory 
> much too quickly.
>
> The same is true in the kernel. The single biggest memory user under 
> almost all normal loads is the disk cache. That _is_ the normal allocator 
> for any OS kernel. Everything else is almost details (ok, so Linux in 
> particular does cache metadata very aggressively, so the dcache and inode 
> cache are seldom "just details", but the page cache is still generally the 
> most important part).
>
> So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane 
> system does that. It's only useful if you absolutely _only_ work with 
> large files - ie you're a database server. For just about any other 
> workload, that kind of granularity is totally unnacceptable.
>
> So doing a read-modify-write on a 1-byte (or 512-byte) write, when the 
> block size is 4kB is easy - we just have to do it anyway. 
>
> Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is 
> also _doable_, and from the IO pattern standpoint it is no different. But 
> from a memory allocation pattern standpoint it's a disaster - because now 
> you're always working with chunks that are just 'too big' to be good 
> building blocks of a reasonable allocator.
>
> If you always allocate 64kB for file caches, and you work with lots of 
> small files (like a source tree), you will literally waste all your 
> memory.
>
>   

Well, no one is talking about 64KB granularity for in-core files.  Like 
you noticed, Windows uses the mmu page size.  We could keep doing that, 
and still have 16KB+ sector sizes.  It just means a RMW if you don't 
happen to have the adjoining clean pages in cache.

Sure, on a rotating disk that's a disaster, but we're talking SSD here, 
so while you're doubling your access time, you're doubling a fairly 
small quantity.  The controller would do the same if it exposed smaller 
sectors, so there's no huge loss.

We still lose on disk storage efficiency, but I'm guessing that a modern 
tree with some object files with debug information and a .git directory 
it won't be such a great hit.  For more mainstream uses, it would be 
negligible.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-14  7:22             ` Andi Kleen
@ 2009-04-14 10:07               ` Avi Kivity
  0 siblings, 0 replies; 44+ messages in thread
From: Avi Kivity @ 2009-04-14 10:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Alan Cox, Szabolcs Szakacsits, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Andi Kleen wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>   
>> You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for 
>> your example!).
>>     
>
> AFAIK at least for user visible anonymous memory Windows uses 64k 
> chunks. At least that is what Cygwin's mmap exposes. I don't know
> if it does the same for disk cache.
>   

I think that's just the region address and size granularity (as in 
vmas).  For paging they still use the mmu page size.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-14  9:59             ` Avi Kivity
@ 2009-04-14 10:23               ` Jeff Garzik
  2009-04-14 10:37                 ` Avi Kivity
  2009-04-25  8:26                 ` Pavel Machek
  0 siblings, 2 replies; 44+ messages in thread
From: Jeff Garzik @ 2009-04-14 10:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Alan Cox, Szabolcs Szakacsits, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Avi Kivity wrote:
> Well, no one is talking about 64KB granularity for in-core files.  Like 
> you noticed, Windows uses the mmu page size.  We could keep doing that, 
> and still have 16KB+ sector sizes.  It just means a RMW if you don't 
> happen to have the adjoining clean pages in cache.
> 
> Sure, on a rotating disk that's a disaster, but we're talking SSD here, 
> so while you're doubling your access time, you're doubling a fairly 
> small quantity.  The controller would do the same if it exposed smaller 
> sectors, so there's no huge loss.
> 
> We still lose on disk storage efficiency, but I'm guessing that a modern 
> tree with some object files with debug information and a .git directory 
> it won't be such a great hit.  For more mainstream uses, it would be 
> negligible.


Speaking of RMW...    in one sense, we have to deal with RMW anyway. 
Upcoming ATA hard drives will be configured with a normal 512b sector 
API interface, but underlying physical sector size is 1k or 4k.

The disk performs the RMW for us, but we must be aware of physical 
sector size in order to determine proper alignment of on-disk data, to 
minimize RMW cycles.

At the moment, it seems like most of the effort to get these ATA devices 
to perform efficiently is in getting partition / RAID stripe offsets set 
up properly.

So perhaps for NVMHCI we could
(a) hardcode NVM sector size maximum at 4k
(b) do RMW in the driver for sector size >4k, and
(c) export information indicating the true sector size, in a manner 
similar to how the ATA driver passes that info to userland partitioning 
tools.

	Jeff




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-14 10:23               ` Jeff Garzik
@ 2009-04-14 10:37                 ` Avi Kivity
  2009-04-14 11:45                   ` Jeff Garzik
  2009-04-25  8:26                 ` Pavel Machek
  1 sibling, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2009-04-14 10:37 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Alan Cox, Szabolcs Szakacsits, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Jeff Garzik wrote:
> Speaking of RMW...    in one sense, we have to deal with RMW anyway. 
> Upcoming ATA hard drives will be configured with a normal 512b sector 
> API interface, but underlying physical sector size is 1k or 4k.
>
> The disk performs the RMW for us, but we must be aware of physical 
> sector size in order to determine proper alignment of on-disk data, to 
> minimize RMW cycles.
>

Virtualization has the same issue.  OS installers will typically setup 
the first partition at sector 63, and that means every page-sized block 
access will be misaligned.  Particularly bad when the guest's disk is 
backed on a regular file.

Windows 2008 aligns partitions on a 1MB boundary, IIRC.

> At the moment, it seems like most of the effort to get these ATA 
> devices to perform efficiently is in getting partition / RAID stripe 
> offsets set up properly.
>
> So perhaps for NVMHCI we could
> (a) hardcode NVM sector size maximum at 4k
> (b) do RMW in the driver for sector size >4k, and

Why not do it in the block layer?  That way it isn't limited to one driver.

> (c) export information indicating the true sector size, in a manner 
> similar to how the ATA driver passes that info to userland 
> partitioning tools.

Eventually we'll want to allow filesystems to make use of the native 
sector size.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-14 10:37                 ` Avi Kivity
@ 2009-04-14 11:45                   ` Jeff Garzik
  2009-04-14 11:58                     ` Szabolcs Szakacsits
  2009-04-14 12:08                     ` Avi Kivity
  0 siblings, 2 replies; 44+ messages in thread
From: Jeff Garzik @ 2009-04-14 11:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Alan Cox, Szabolcs Szakacsits, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Avi Kivity wrote:
> Jeff Garzik wrote:
>> Speaking of RMW...    in one sense, we have to deal with RMW anyway. 
>> Upcoming ATA hard drives will be configured with a normal 512b sector 
>> API interface, but underlying physical sector size is 1k or 4k.
>>
>> The disk performs the RMW for us, but we must be aware of physical 
>> sector size in order to determine proper alignment of on-disk data, to 
>> minimize RMW cycles.
>>
> 
> Virtualization has the same issue.  OS installers will typically setup 
> the first partition at sector 63, and that means every page-sized block 
> access will be misaligned.  Particularly bad when the guest's disk is 
> backed on a regular file.
> 
> Windows 2008 aligns partitions on a 1MB boundary, IIRC.

Makes a lot of sense...


>> At the moment, it seems like most of the effort to get these ATA 
>> devices to perform efficiently is in getting partition / RAID stripe 
>> offsets set up properly.
>>
>> So perhaps for NVMHCI we could
>> (a) hardcode NVM sector size maximum at 4k
>> (b) do RMW in the driver for sector size >4k, and
> 
> Why not do it in the block layer?  That way it isn't limited to one driver.

Sure.  "in the driver" is a highly relative phrase :)  If there is code 
to be shared among multiple callsites, let's share it.


>> (c) export information indicating the true sector size, in a manner 
>> similar to how the ATA driver passes that info to userland 
>> partitioning tools.
> 
> Eventually we'll want to allow filesystems to make use of the native 
> sector size.

At the kernel level, you mean?

Filesystems already must deal with issues such as avoiding RAID stripe 
boundaries (man mke2fs, search for 'RAID').

So I hope that same code should be applicable to cases where the 
"logical sector size" (as exported by storage interface) differs from 
"physical sector size" (the underlying hardware sector size, not 
directly accessible by OS).

But if you are talking about filesystems directly supporting sector 
sizes >4kb, well, I'll let Linus and others settle that debate :)  I 
will just write the driver once the dust settles...

	Jeff



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-14 11:45                   ` Jeff Garzik
@ 2009-04-14 11:58                     ` Szabolcs Szakacsits
  2009-04-17 22:45                       ` H. Peter Anvin
  2009-04-14 12:08                     ` Avi Kivity
  1 sibling, 1 reply; 44+ messages in thread
From: Szabolcs Szakacsits @ 2009-04-14 11:58 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Avi Kivity, Linus Torvalds, Alan Cox, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven


On Tue, 14 Apr 2009, Jeff Garzik wrote:
> Avi Kivity wrote:
> > Jeff Garzik wrote:
> > > Speaking of RMW...    in one sense, we have to deal with RMW anyway.
> > > Upcoming ATA hard drives will be configured with a normal 512b sector API
> > > interface, but underlying physical sector size is 1k or 4k.
> > > 
> > > The disk performs the RMW for us, but we must be aware of physical sector
> > > size in order to determine proper alignment of on-disk data, to minimize
> > > RMW cycles.
> > > 
> > 
> > Virtualization has the same issue.  OS installers will typically setup the
> > first partition at sector 63, and that means every page-sized block access
> > will be misaligned.  Particularly bad when the guest's disk is backed on a
> > regular file.
> > 
> > Windows 2008 aligns partitions on a 1MB boundary, IIRC.
> 
> Makes a lot of sense...

Since Vista at least the first partition is 2048 sector aligned.

	Szaka

--
NTFS-3G: http://ntfs-3g.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-14 11:45                   ` Jeff Garzik
  2009-04-14 11:58                     ` Szabolcs Szakacsits
@ 2009-04-14 12:08                     ` Avi Kivity
  2009-04-14 12:21                       ` Jeff Garzik
  1 sibling, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2009-04-14 12:08 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Alan Cox, Szabolcs Szakacsits, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Jeff Garzik wrote:
>>> (c) export information indicating the true sector size, in a manner 
>>> similar to how the ATA driver passes that info to userland 
>>> partitioning tools.
>>
>> Eventually we'll want to allow filesystems to make use of the native 
>> sector size.
>
> At the kernel level, you mean?
>

Yes.  You'll want to align extents and I/O requests on that boundary.

>
> But if you are talking about filesystems directly supporting sector 
> sizes >4kb, well, I'll let Linus and others settle that debate :)  I 
> will just write the driver once the dust settles...

IMO drivers should expose whatever sector size the device have, 
filesystems should expose their block size, and the block layer should 
correct any impedance mismatches by doing RMW.

Unfortunately, sector size > fs block size means a lot of pointless 
locking for the RMW, so if large sector sizes take off, we'll have to 
adjust filesystems to use larger block sizes.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-14 12:08                     ` Avi Kivity
@ 2009-04-14 12:21                       ` Jeff Garzik
  0 siblings, 0 replies; 44+ messages in thread
From: Jeff Garzik @ 2009-04-14 12:21 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Alan Cox, Szabolcs Szakacsits, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Avi Kivity wrote:
> Jeff Garzik wrote:
>>>> (c) export information indicating the true sector size, in a manner 
>>>> similar to how the ATA driver passes that info to userland 
>>>> partitioning tools.
>>>
>>> Eventually we'll want to allow filesystems to make use of the native 
>>> sector size.
>>
>> At the kernel level, you mean?
>>
> 
> Yes.  You'll want to align extents and I/O requests on that boundary.

Sure.  And RAID today presents these issues to the filesystem...

man mke2fs(8), and look at extended options 'stride' and 'stripe-width'. 
  It includes mention of RMW issues.


>> But if you are talking about filesystems directly supporting sector 
>> sizes >4kb, well, I'll let Linus and others settle that debate :)  I 
>> will just write the driver once the dust settles...
> 
> IMO drivers should expose whatever sector size the device have, 
> filesystems should expose their block size, and the block layer should 
> correct any impedance mismatches by doing RMW.
> 
> Unfortunately, sector size > fs block size means a lot of pointless 
> locking for the RMW, so if large sector sizes take off, we'll have to 
> adjust filesystems to use larger block sizes.

Don't forget the case where the device does RMW for you, and does not 
permit direct access to physical sector size (all operations are in 
terms of logical sector size).

	Jeff




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
       [not found]     ` <6934efce0904141052j3d4f87cey9fc4b802303aa73b@mail.gmail.com>
@ 2009-04-15  6:37       ` Artem Bityutskiy
  2009-04-30 22:51         ` Jörn Engel
  0 siblings, 1 reply; 44+ messages in thread
From: Artem Bityutskiy @ 2009-04-15  6:37 UTC (permalink / raw)
  To: Jared Hulbert
  Cc: Linus Torvalds, Szabolcs Szakacsits, Alan Cox, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven,
	David Woodhouse, Jörn Engel

On Tue, 2009-04-14 at 10:52 -0700, Jared Hulbert wrote:
>         It really isn't worth it. It's much better for everybody to
>         just be aware
>         of the incredible level of pure suckage of a general-purpose
>         disk that has
>         hardware sectors >4kB. Just educate people that it's not good.
>         Avoid the
>         whole insane suckage early, rather than be disappointed in
>         hardware that
>         is total and utter CRAP and just causes untold problems.
> 
> I don't disagree that >4KB DISKS are a bad idea. But I don't think
> that's what's going on here.  As I read it, NVMHCI would plug into the
> MTD subsystem, not the block subsystem.
> 
> 
> NVMHCI, as far as I understand the spec, is not trying to be a
> general-purpose disk, it's for exposing more or less the raw NAND.  As
> far as I can tell it's a DMA engine spec for large arrays of NAND.
> BTW, anybody actually seen a NVMHCI device or plan on making one?

I briefly glanced at the doc, and it does not look like this is an
interface to expose raw NAND. E.g., I could not find "erase" operation.
I could not find information about bad eraseblocks.

It looks like it is not about raw NANDs. May be about "managed" NANDs.

Also, the following sentences from the "Outside of Scope" sub-section
suggest I'm right:
"NVMHCI is also specified above any non-volatile memory management, like
wear leveling. Erases and other management tasks for NVM technologies
like NAND are abstracted.".

So it says NVMHCI is _above_ wear levelling, which means FTL would be
_inside_ the NVMHCI device, which is not about raw NAND.

But I may be wrong, I spent less than 10 minutes looking at the doc,
sorry.

-- 
Best regards,
Artem Bityutskiy (Битюцкий Артём)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-14 11:58                     ` Szabolcs Szakacsits
@ 2009-04-17 22:45                       ` H. Peter Anvin
  0 siblings, 0 replies; 44+ messages in thread
From: H. Peter Anvin @ 2009-04-17 22:45 UTC (permalink / raw)
  To: Szabolcs Szakacsits
  Cc: Jeff Garzik, Avi Kivity, Linus Torvalds, Alan Cox,
	Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe,
	Arjan van de Ven

Szabolcs Szakacsits wrote:
>>>
>>> Windows 2008 aligns partitions on a 1MB boundary, IIRC.
>> Makes a lot of sense...
> 
> Since Vista at least the first partition is 2048 sector aligned.
> 
> 	Szaka
> 

2048 * 512 = 1 MB, yes.

I *think* it's actually 1 MB and not 2048 sectors, but yes, they've 
finally dumped the idiotic DOS misalignment.  Unfortunately the GNU 
parted people have said that the parted code is too fragile to fix 
parted in this way.  Sigh.

	-hpa

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-14 10:23               ` Jeff Garzik
  2009-04-14 10:37                 ` Avi Kivity
@ 2009-04-25  8:26                 ` Pavel Machek
  1 sibling, 0 replies; 44+ messages in thread
From: Pavel Machek @ 2009-04-25  8:26 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Avi Kivity, Linus Torvalds, Alan Cox, Szabolcs Szakacsits,
	Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe,
	Arjan van de Ven

Hi!

>> Well, no one is talking about 64KB granularity for in-core files.  Like 
>> you noticed, Windows uses the mmu page size.  We could keep doing that, 
>> and still have 16KB+ sector sizes.  It just means a RMW if you don't  
>> happen to have the adjoining clean pages in cache.
>>
>> Sure, on a rotating disk that's a disaster, but we're talking SSD here, 
>> so while you're doubling your access time, you're doubling a fairly  
>> small quantity.  The controller would do the same if it exposed smaller 
>> sectors, so there's no huge loss.
>>
>> We still lose on disk storage efficiency, but I'm guessing that a 
>> modern tree with some object files with debug information and a .git 
>> directory it won't be such a great hit.  For more mainstream uses, it 
>> would be negligible.
>
>
> Speaking of RMW...    in one sense, we have to deal with RMW anyway.  
> Upcoming ATA hard drives will be configured with a normal 512b sector  
> API interface, but underlying physical sector size is 1k or 4k.
>
> The disk performs the RMW for us, but we must be aware of physical  
> sector size in order to determine proper alignment of on-disk data, to  
> minimize RMW cycles.

Also... RMW hsa some nasty reliability implications. If we use 1KB
block size ext3 (or something like that), unrelated data may now be
damaged during powerfail. Filesystems can not handle that :-(.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-15  6:37       ` Artem Bityutskiy
@ 2009-04-30 22:51         ` Jörn Engel
  2009-04-30 23:36           ` Jeff Garzik
  0 siblings, 1 reply; 44+ messages in thread
From: Jörn Engel @ 2009-04-30 22:51 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Jared Hulbert, Linus Torvalds, Szabolcs Szakacsits, Alan Cox,
	Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe,
	Arjan van de Ven, David Woodhouse

On Wed, 15 April 2009 09:37:50 +0300, Artem Bityutskiy wrote:
> 
> I briefly glanced at the doc, and it does not look like this is an
> interface to expose raw NAND. E.g., I could not find "erase" operation.
> I could not find information about bad eraseblocks.
> 
> It looks like it is not about raw NANDs. May be about "managed" NANDs.

I'm not sure whether your distinction is exactly valid anymore.  "raw
NAND" used to mean two things.  1) A single chip of silicon without
additional hardware.  2) NAND without FTL.

Traditionally the FTL was implemented either in software or in a
constroller chip.  So you could not get "cooked" flash as in FTL without
"cooked" flash as in extra hardware.  Today you can, which makes "raw
NAND" a less useful term.

And I'm not sure what to think about flash chips with the (likely
crappy) FTL inside either.  Not having to deal with bad blocks anymore
is a bliss.  Not having to deal with wear leveling anymore is a lie.
Not knowing whether errors occurred and whether uncorrected data was
left on the device or replaced with corrected data is a pain.

But like it or not, the market seems to be moving in that direction.
Which means we will have "block devices" that have all the interfaces of
disks and behave much like flash - modulo the crap FTL.

Jörn

-- 
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-30 22:51         ` Jörn Engel
@ 2009-04-30 23:36           ` Jeff Garzik
  0 siblings, 0 replies; 44+ messages in thread
From: Jeff Garzik @ 2009-04-30 23:36 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Artem Bityutskiy, Jared Hulbert, Linus Torvalds,
	Szabolcs Szakacsits, Alan Cox, Grant Grundler,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven,
	David Woodhouse

Jörn Engel wrote:
> But like it or not, the market seems to be moving in that direction.
> Which means we will have "block devices" that have all the interfaces of
> disks and behave much like flash - modulo the crap FTL.


One driving goal behind NVMHCI was to avoid disk-originated interfaces, 
because they are not as well suited to flash storage.

The NVMHCI command set (distinguished from NVMHCI, the silicon) is 
specifically targetted towards flash.

	Jeff



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-12 14:23         ` Mark Lord
@ 2009-04-12 17:29           ` Jeff Garzik
  0 siblings, 0 replies; 44+ messages in thread
From: Jeff Garzik @ 2009-04-12 17:29 UTC (permalink / raw)
  To: Mark Lord
  Cc: Alan Cox, Grant Grundler, Linus Torvalds, Linux IDE mailing list,
	LKML, Jens Axboe, Arjan van de Ven

Mark Lord wrote:
> Alan Cox wrote:
> ..
>> Alternatively you go for read-modify-write (nasty performance hit
>> especially for RAID or a log structured fs).
> ..
> 
> Initially, at least, I'd guess that this NVM-HCI thing is all about
> built-in flash memory on motherboards, to hold the "instant-boot"
> software that hardware companies (eg. ASUS) are rapidly growing fond of.
> 
> At present, that means a mostly read-only Linux installation,
> though MS for sure are hoping for Moore's Law to kick in and
> provide sufficient space for a copy of Vista there or something.

Yeah... instant boot, and "trusted boot" (booting a signed image), 
storage of useful details like boot drive layouts, etc.

I'm sure we can come up with other funs uses, too...

	Jeff





^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-11 23:25       ` Alan Cox
  2009-04-11 23:51         ` Jeff Garzik
  2009-04-12  1:15         ` david
@ 2009-04-12 14:23         ` Mark Lord
  2009-04-12 17:29           ` Jeff Garzik
  2 siblings, 1 reply; 44+ messages in thread
From: Mark Lord @ 2009-04-12 14:23 UTC (permalink / raw)
  To: Alan Cox
  Cc: Grant Grundler, Linus Torvalds, Jeff Garzik,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Alan Cox wrote:
..
> Alternatively you go for read-modify-write (nasty performance hit
> especially for RAID or a log structured fs).
..

Initially, at least, I'd guess that this NVM-HCI thing is all about
built-in flash memory on motherboards, to hold the "instant-boot"
software that hardware companies (eg. ASUS) are rapidly growing fond of.

At present, that means a mostly read-only Linux installation,
though MS for sure are hoping for Moore's Law to kick in and
provide sufficient space for a copy of Vista there or something.

The point being, it's probable *initial* intended use is for a
run-time read-only filesystem, so having to do dirty R-M-W sequences
for writes might not be a significant issue.

At present.  And even if it were, it might not be much worse than
having the hardware itself do it internally, which is what would 
have to happen if it always only ever showed 4KB to us.

Longer term, as flash densities increase, we're going to end up
with motherboards that have huge SSDs built-in, through an interface
like this one, or over a virtual SATA link or something.

I wonder how long until "desktop/notebook" computers no longer
have replaceable "hard disks" at all?

Cheers

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-12  1:15         ` david
@ 2009-04-12  3:13           ` Linus Torvalds
  0 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2009-04-12  3:13 UTC (permalink / raw)
  To: david
  Cc: Alan Cox, Grant Grundler, Jeff Garzik, Linux IDE mailing list,
	LKML, Jens Axboe, Arjan van de Ven



On Sat, 11 Apr 2009, david@lang.hm wrote:
> 
> gaining this sort of ability would not be a bad thing.

.. and if my house was built of gold, that wouldn't be a bad thing either.

What's your point?

Are you going to create the magical patches that make that happen? Are you 
going to maintain the added complexity that comes from suddenly having 
multiple dirty bits per "page"? Are you going to create the mythical 
filesystems that magically start doing tail packing in order to not waste 
tons of disk-space with small files, even if they have a 32kB block-size?

In other words, your whole argument is built in "wouldn't it be nice".

And I'm just the grumpy old guy who tells you that there's this small 
thing called REALITY that comes and bites you in the *ss. And I'm sorry, 
but the very nature of "reality" is that it doesn't care one whit whether 
you believe me or not.

The fact is, >4kB sectors just aren't realistic right now, and I don't 
think you have any _clue_ about the pain of trying to make them so. You're 
just throwing pennies down a wishing well.

			Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-12  0:49           ` Linus Torvalds
@ 2009-04-12  1:59             ` Jeff Garzik
  0 siblings, 0 replies; 44+ messages in thread
From: Jeff Garzik @ 2009-04-12  1:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Grant Grundler, Linux IDE mailing list, LKML,
	Jens Axboe, Arjan van de Ven

Linus Torvalds wrote:
> And maybe I'm wrong, and NTFS works fine as-is with sectors >4kB. But let 
> me doubt that.

FWIW...  No clue about sector size, but NTFS cluster size (i.e. block 
size) goes up to 64k.  Compression is disabled after 4k.

	Jeff




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-11 23:25       ` Alan Cox
  2009-04-11 23:51         ` Jeff Garzik
@ 2009-04-12  1:15         ` david
  2009-04-12  3:13           ` Linus Torvalds
  2009-04-12 14:23         ` Mark Lord
  2 siblings, 1 reply; 44+ messages in thread
From: david @ 2009-04-12  1:15 UTC (permalink / raw)
  To: Alan Cox
  Cc: Grant Grundler, Linus Torvalds, Jeff Garzik,
	Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

On Sun, 12 Apr 2009, Alan Cox wrote:

>> We've abstract the DMA mapping/SG list handling enough that the
>> block size should make no more difference than it does for the
>> MTU size of a network.
>
> You need to start managing groups of pages in the vm and keeping them
> together and writing them out together and paging them together even if
> one of them is dirty and the other isn't. You have to deal with cases
> where a process forks and the two pages are dirtied one in each but still
> have to be written together.

gaining this sort of ability would not be a bad thing. with current 
hardware (SSDs and raid arrays) you can very easily be in a situation 
where it's much cheaper to deal with a group of related pages as one group 
rather than processing them individually. this is just an extention of the 
same issue.

David Lang

> Alternatively you go for read-modify-write (nasty performance hit
> especially for RAID or a log structured fs).
>
> Yes you can do it but it sure won't be pretty with a conventional fs.
> Some of the log structured file systems have no problems with this and
> some kinds of journalling can help but for a typical block file system
> it'll suck.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-11 23:51         ` Jeff Garzik
@ 2009-04-12  0:49           ` Linus Torvalds
  2009-04-12  1:59             ` Jeff Garzik
  0 siblings, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2009-04-12  0:49 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Alan Cox, Grant Grundler, Linux IDE mailing list, LKML,
	Jens Axboe, Arjan van de Ven



On Sat, 11 Apr 2009, Jeff Garzik wrote:
> 
> Or just ignore the extra length, thereby excising the 'read-modify' step...
> Total storage is halved or worse, but you don't take as much of a performance
> hit.

Well, the people who want > 4kB sectors usually want _much_ bigger (ie 
32kB sectors), and if you end up doing the "just use the first part" 
thing, you're wasting 7/8ths of the space. 

Yes, it's doable, and yes, it obviously makes for a simple driver thing, 
but no, I don't think people will consider it acceptable to lose that much 
of their effective size of the disk.

I suspect people would scream even with a 8kB sector.

Treating all writes as read-modify-write cycles on a driver level (and 
then opportunistically avoiding the read part when you are lucky and see 
bigger contiguous writes) is likely more acceptable. But it _will_ suck 
dick from a performance angle, because no regular filesystem will care 
enough, so even with nicely behaved big writes, the two end-points will 
have a fairly high chance of requiring a rmw cycle.

Even the journaling ones that might have nice logging write behavior tend 
to have a non-logging part that then will behave badly. Rather few 
filesystems are _purely_ log-based, and the ones that are tend to have 
various limitations. Most commonly read performance just sucks.

We just merged nilfs2, and I _think_ that one is a pure logging filesystem 
with just linear writes (within a segment). But I think random read 
performance (think: loading executables off the disk) is bad.

And people tend to really dislike hardware that forces a particular 
filesystem on them. Guess how big the user base is going to be if you 
cannot format the device as NTFS, for example? Hint: if a piece of 
hardware only works well with special filesystems, that piece of hardware 
won't be a big seller. 

Modern technology needs big volume to become cheap and relevant.

And maybe I'm wrong, and NTFS works fine as-is with sectors >4kB. But let 
me doubt that.

			Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-11 23:25       ` Alan Cox
@ 2009-04-11 23:51         ` Jeff Garzik
  2009-04-12  0:49           ` Linus Torvalds
  2009-04-12  1:15         ` david
  2009-04-12 14:23         ` Mark Lord
  2 siblings, 1 reply; 44+ messages in thread
From: Jeff Garzik @ 2009-04-11 23:51 UTC (permalink / raw)
  To: Alan Cox
  Cc: Grant Grundler, Linus Torvalds, Linux IDE mailing list, LKML,
	Jens Axboe, Arjan van de Ven

Alan Cox wrote:
>> We've abstract the DMA mapping/SG list handling enough that the
>> block size should make no more difference than it does for the
>> MTU size of a network.
> 
> You need to start managing groups of pages in the vm and keeping them
> together and writing them out together and paging them together even if
> one of them is dirty and the other isn't. You have to deal with cases
> where a process forks and the two pages are dirtied one in each but still
> have to be written together.
> 
> Alternatively you go for read-modify-write (nasty performance hit
> especially for RAID or a log structured fs).

Or just ignore the extra length, thereby excising the 'read-modify' 
step...  Total storage is halved or worse, but you don't take as much of 
a performance hit.

	Jeff




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-11 21:49     ` Grant Grundler
  2009-04-11 22:33       ` Linus Torvalds
@ 2009-04-11 23:25       ` Alan Cox
  2009-04-11 23:51         ` Jeff Garzik
                           ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Alan Cox @ 2009-04-11 23:25 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Linus Torvalds, Jeff Garzik, Linux IDE mailing list, LKML,
	Jens Axboe, Arjan van de Ven

> We've abstract the DMA mapping/SG list handling enough that the
> block size should make no more difference than it does for the
> MTU size of a network.

You need to start managing groups of pages in the vm and keeping them
together and writing them out together and paging them together even if
one of them is dirty and the other isn't. You have to deal with cases
where a process forks and the two pages are dirtied one in each but still
have to be written together.

Alternatively you go for read-modify-write (nasty performance hit
especially for RAID or a log structured fs).

Yes you can do it but it sure won't be pretty with a conventional fs.
Some of the log structured file systems have no problems with this and
some kinds of journalling can help but for a typical block file system
it'll suck.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-11 21:49     ` Grant Grundler
@ 2009-04-11 22:33       ` Linus Torvalds
  2009-04-11 23:25       ` Alan Cox
  1 sibling, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2009-04-11 22:33 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Alan Cox, Jeff Garzik, Linux IDE mailing list, LKML, Jens Axboe,
	Arjan van de Ven



On Sat, 11 Apr 2009, Grant Grundler wrote:
> 
> Why does it matter what the sector size is?
> I'm failing to see what the fuss is about.
> 
> We've abstract the DMA mapping/SG list handling enough that the
> block size should make no more difference than it does for the
> MTU size of a network.

The VM is not ready or willing to do more than 4kB pages for any normal 
cacheing scheme.

> And the linux VM does handle bigger than 4k pages (several architectures
> have implemented it) - even if x86 only supports 4k as base page size.

4k is not just the "supported" base page size, it's the only sane one. 
Bigger pages waste memory like mad on any normal load due to 
fragmentation. Only basically single-purpose servers are worth doing 
bigger pages for.

> Block size just defines the granularity of the device's address space in
> the same way the VM base page size defines the Virtual address space.

.. and the point is, if you have granularity that is bigger than 4kB, you 
lose binary compatibility on x86, for example. The 4kB thing is encoded in 
mmap() semantics.

In other words, if you have sector size >4kB, your hardware is CRAP. It's 
unusable sh*t. No ifs, buts or maybe's about it.

Sure, we can work around it. We can work around it by doing things like 
read-modify-write cycles with bounce buffers (and where DMA remapping can 
be used to avoid the copy). Or we can work around it by saying that if you 
mmap files on such a a filesystem, your mmap's will have to have 8kB 
alignment semantics, and the hardware is only useful for servers.

Or we can just tell people what a total piece of shit the hardware is.

So if you're involved with any such hardware or know people who are, you 
might give people strong hints that sector sizes >4kB will not be taken 
seriously by a huge number of people. Maybe it's not too late to head the 
crap off at the pass.

Btw, this is not a new issue. Sandisk and some other totally clueless SSD 
manufacturers tried to convince people that 64kB access sizes were the 
RightThing(tm) to do. The reason? Their SSD's were crap, and couldn't do 
anything better, so they tried to blame software.

Then Intel came out with their controller, and now the same people who 
tried to sell their sh*t-for-brain SSD's are finally admittign that 
it was crap hardware.

Do you really want to go through that one more time?

			Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-11 19:52   ` Linus Torvalds
  2009-04-11 20:21     ` Jeff Garzik
@ 2009-04-11 21:49     ` Grant Grundler
  2009-04-11 22:33       ` Linus Torvalds
  2009-04-11 23:25       ` Alan Cox
  1 sibling, 2 replies; 44+ messages in thread
From: Grant Grundler @ 2009-04-11 21:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Jeff Garzik, Linux IDE mailing list, LKML, Jens Axboe,
	Arjan van de Ven

On Sat, Apr 11, 2009 at 12:52 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Sat, 11 Apr 2009, Alan Cox wrote:
>>
>> >       The spec describes the sector size as
>> >       "512, 1k, 2k, 4k, 8k, etc."   It will be interesting to reach
>> >       "etc" territory.
>>
>> Over 4K will be fun.
>
> And by "fun", you mean "irrelevant".
>
> If anybody does that, they'll simply not work. And it's not worth it even
> trying to handle it.

Why does it matter what the sector size is?
I'm failing to see what the fuss is about.

We've abstract the DMA mapping/SG list handling enough that the
block size should make no more difference than it does for the
MTU size of a network.

And the linux VM does handle bigger than 4k pages (several architectures
have implemented it) - even if x86 only supports 4k as base page size.

Block size just defines the granularity of the device's address space in
the same way the VM base page size defines the Virtual address space.

> That said, I'm pretty certain Windows has the same 4k issue, so we can
> hope nobody will ever do that kind of idiotically broken hardware. Of
> course, hardware people often do incredibly stupid things, so no
> guarantees.

That's just flame-bait. Not touching that.

thanks,
grant

>
>                Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-11 21:08     ` John Stoffel
@ 2009-04-11 21:31       ` John Stoffel
  0 siblings, 0 replies; 44+ messages in thread
From: John Stoffel @ 2009-04-11 21:31 UTC (permalink / raw)
  To: John Stoffel
  Cc: Jeff Garzik, Alan Cox, Linux IDE mailing list, LKML, Jens Axboe,
	Arjan van de Ven, Linus Torvalds

>>>>> "John" == John Stoffel <john@stoffel.org> writes:

>>>>> "Jeff" == Jeff Garzik <jeff@garzik.org> writes:
Jeff> Alan Cox wrote:

>>>> With a brand new command set, might as well avoid SCSI completely
>>>> IMO, and create a brand new block device.
>>> 
>>> Providing we allow for the (inevitable ;)) joys of NVHCI over SAS etc 8)

Jeff> Perhaps...  from what I can tell, this is a direct, asynchronous
Jeff> NVM interface.  It appears to lack any concept of bus or bus
Jeff> enumeration.  No worries about link up/down, storage device
Jeff> hotplug, etc.  (you still have PCI hotplug case, of course)

John> Didn't we just spend years merging the old IDE PATA block devices into
John> the libata/scsi block device setup to get a more unified userspace and
John> to share common code?  

John> I'm a total ignoramous here, but it would seem that it would be nice
John> to keep the /dev/sd# stuff around for this, esp since it is supported
John> through/with/around AHCI and libata stuff.

John> Honestly, I don't care as long as userspace isn't too affected and I
John> can just format it using ext3.  :]  Which I realize would be silly
John> since it's probably nothing like regular disk access, but more like
John> the NVRAM used on Netapps for caching writes to disk so they can be
John> acknowledged quicker to the clients.  Or like the old PrestoServe
John> NVRAM modules on DECsystems and Alphas.  

And actually spending some thought on this, I'm thinking that this
will be like the MTD block device and such... seperate specialized
block devices, but still usable.  So maybe I'll just shutup now.  :]

John

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-11 19:54   ` Jeff Garzik
@ 2009-04-11 21:08     ` John Stoffel
  2009-04-11 21:31       ` John Stoffel
  0 siblings, 1 reply; 44+ messages in thread
From: John Stoffel @ 2009-04-11 21:08 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Alan Cox, Linux IDE mailing list, LKML, Jens Axboe,
	Arjan van de Ven, Linus Torvalds

>>>>> "Jeff" == Jeff Garzik <jeff@garzik.org> writes:

Jeff> Alan Cox wrote:

>>> With a brand new command set, might as well avoid SCSI completely
>>> IMO, and create a brand new block device.
>> 
>> Providing we allow for the (inevitable ;)) joys of NVHCI over SAS etc 8)

Jeff> Perhaps...  from what I can tell, this is a direct, asynchronous
Jeff> NVM interface.  It appears to lack any concept of bus or bus
Jeff> enumeration.  No worries about link up/down, storage device
Jeff> hotplug, etc.  (you still have PCI hotplug case, of course)

Didn't we just spend years merging the old IDE PATA block devices into
the libata/scsi block device setup to get a more unified userspace and
to share common code?  

I'm a total ignoramous here, but it would seem that it would be nice
to keep the /dev/sd# stuff around for this, esp since it is supported
through/with/around AHCI and libata stuff.

Honestly, I don't care as long as userspace isn't too affected and I
can just format it using ext3.  :]  Which I realize would be silly
since it's probably nothing like regular disk access, but more like
the NVRAM used on Netapps for caching writes to disk so they can be
acknowledged quicker to the clients.  Or like the old PrestoServe
NVRAM modules on DECsystems and Alphas.  

John

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-11 19:52   ` Linus Torvalds
@ 2009-04-11 20:21     ` Jeff Garzik
  2009-04-11 21:49     ` Grant Grundler
  1 sibling, 0 replies; 44+ messages in thread
From: Jeff Garzik @ 2009-04-11 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven

Linus Torvalds wrote:
> 
> On Sat, 11 Apr 2009, Alan Cox wrote:
>>> 	  The spec describes the sector size as
>>> 	  "512, 1k, 2k, 4k, 8k, etc."   It will be interesting to reach
>>> 	  "etc" territory.
>> Over 4K will be fun.
> 
> And by "fun", you mean "irrelevant".
> 
> If anybody does that, they'll simply not work. And it's not worth it even 
> trying to handle it.

FSVO trying to handle...

At the driver level, it would be easy to clamp sector size to 4k, and 
point the scatterlist to a zero-filled region for the >4k portion of 
each sector.  Inefficient, sure, but it is low-cost to the driver and 
gives the user something other than a brick.

	if (too_large_sector_size)
		nvmhci_fill_sg_clamped_interleave()
	else
		nvmhci_fill_sg()

Regards,

	Jeff




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-11 19:32 ` Alan Cox
  2009-04-11 19:52   ` Linus Torvalds
@ 2009-04-11 19:54   ` Jeff Garzik
  2009-04-11 21:08     ` John Stoffel
  1 sibling, 1 reply; 44+ messages in thread
From: Jeff Garzik @ 2009-04-11 19:54 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven,
	Linus Torvalds

Alan Cox wrote:
>> 	  The spec describes the sector size as
>> 	  "512, 1k, 2k, 4k, 8k, etc."   It will be interesting to reach
>> 	  "etc" territory.
> 
> Over 4K will be fun.
> 
>> - ahci.c becomes a tiny stub with a pci_device_id match table,
>>    calling functions in libahci.c.
> 
> It needs to a be a little bit bigger because of the folks wanting to do
> non PCI AHCI, so you need a little bit of PCI wrapping etc

True...


>>    With a brand new command set, might as well avoid SCSI completely IMO,
>>    and create a brand new block device.
> 
> Providing we allow for the (inevitable ;)) joys of NVHCI over SAS etc 8)

Perhaps...  from what I can tell, this is a direct, asynchronous NVM 
interface.  It appears to lack any concept of bus or bus enumeration. 
No worries about link up/down, storage device hotplug, etc.  (you still 
have PCI hotplug case, of course)

	Jeff




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-11 19:32 ` Alan Cox
@ 2009-04-11 19:52   ` Linus Torvalds
  2009-04-11 20:21     ` Jeff Garzik
  2009-04-11 21:49     ` Grant Grundler
  2009-04-11 19:54   ` Jeff Garzik
  1 sibling, 2 replies; 44+ messages in thread
From: Linus Torvalds @ 2009-04-11 19:52 UTC (permalink / raw)
  To: Alan Cox
  Cc: Jeff Garzik, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven



On Sat, 11 Apr 2009, Alan Cox wrote:
>
> > 	  The spec describes the sector size as
> > 	  "512, 1k, 2k, 4k, 8k, etc."   It will be interesting to reach
> > 	  "etc" territory.
> 
> Over 4K will be fun.

And by "fun", you mean "irrelevant".

If anybody does that, they'll simply not work. And it's not worth it even 
trying to handle it.

That said, I'm pretty certain Windows has the same 4k issue, so we can 
hope nobody will ever do that kind of idiotically broken hardware. Of 
course, hardware people often do incredibly stupid things, so no 
guarantees.

		Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Implementing NVMHCI...
  2009-04-11 17:33 Jeff Garzik
@ 2009-04-11 19:32 ` Alan Cox
  2009-04-11 19:52   ` Linus Torvalds
  2009-04-11 19:54   ` Jeff Garzik
  0 siblings, 2 replies; 44+ messages in thread
From: Alan Cox @ 2009-04-11 19:32 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven,
	Linus Torvalds

> 	  The spec describes the sector size as
> 	  "512, 1k, 2k, 4k, 8k, etc."   It will be interesting to reach
> 	  "etc" territory.

Over 4K will be fun.

> - ahci.c becomes a tiny stub with a pci_device_id match table,
>    calling functions in libahci.c.

It needs to a be a little bit bigger because of the folks wanting to do
non PCI AHCI, so you need a little bit of PCI wrapping etc

>    With a brand new command set, might as well avoid SCSI completely IMO,
>    and create a brand new block device.

Providing we allow for the (inevitable ;)) joys of NVHCI over SAS etc 8)


Alan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Implementing NVMHCI...
@ 2009-04-11 17:33 Jeff Garzik
  2009-04-11 19:32 ` Alan Cox
  0 siblings, 1 reply; 44+ messages in thread
From: Jeff Garzik @ 2009-04-11 17:33 UTC (permalink / raw)
  To: Linux IDE mailing list; +Cc: LKML, Jens Axboe, Arjan van de Ven, Linus Torvalds


Has anybody looked into working on NVMHCI support?  It is a new 
controller + new command set for direct interaction with non-volatile 
memory devices:

	http://download.intel.com/standards/nvmhci/spec.pdf

Although NVMHCI is nice from a hardware design perspective, it is a bit 
problematic for Linux because

	* NVMHCI might be implemented as part of an AHCI controller's
	  register set, much like how Marvell's AHCI clones implement
	  a PATA port: with wholly different per-port registers
	  and DMA data structures, buried inside the standard AHCI
	  per-port interrupt dispatch mechanism.

	  Or, NVMHCI might be implemented as its own PCI device,
	  wholly independent from the AHCI PCI device.

	  The per-port registers and DMA data structure remain the same,
	  whether or not it is embedded within AHCI or not.

	* NVMHCI introduces a brand new command set, completely
	  incompatible with ATA or SCSI.  Presumably it is tuned
	  specifically for non-volatile memory.

	* The sector size can vary wildly from device to device.  There
	  is no 512-byte legacy to deal with, for a brand new
	  command set.  We should handle this OK, but......  who knows
	  until you try.

	  The spec describes the sector size as
	  "512, 1k, 2k, 4k, 8k, etc."   It will be interesting to reach
	  "etc" territory.

Here is my initial idea:

- Move 95% of ahci.c into libahci.c.

   This will make implementation of AHCI-and-more devices like
   NVMHCI (AHCI 1.3) and Marvell much easier, while avoiding
   the cost of NVMHCI or Marvell support, for those users without
   such hardware.

- ahci.c becomes a tiny stub with a pci_device_id match table,
   calling functions in libahci.c.

- I can move my libata-dev.git#mv-ahci-pata work, recently refreshed,
   into mv-ahci.c.

- nvmhci.c implements the NVMHCI controller standard.  Maybe referenced
   from ahci.c, or used standalone.

- nvmhci-blk.c implements a block device for NVMHCI-attached devices,
   using the new NVMHCI command set.

   With a brand new command set, might as well avoid SCSI completely IMO,
   and create a brand new block device.

Open questions are...

1) When will we see hardware?  This is a feature newly introduced in
    AHCI 1.3.  AHCI 1.3 spec is public, but I have not seen any machines
    yet. http://download.intel.com/technology/serialata/pdf/rev1_3.pdf

    My ICH10 box uses AHCI 1.2.  dmesg | grep '^ahci'

> ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 6 ports 3 Gbps 0x3f impl SATA mode
> ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ems 


2) Has anyone else started working on this?  All relevant specs are 
public on intel.com.


3) Are there major objections to doing this as a native block device (as 
opposed to faking SCSI, for example...) ?


Thanks,

	Jeff (engaging in some light Saturday reading...)





^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2009-04-30 23:38 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20090412091228.GA29937@elte.hu>
2009-04-12 15:14 ` Implementing NVMHCI Szabolcs Szakacsits
2009-04-12 15:20   ` Alan Cox
2009-04-12 16:15     ` Avi Kivity
2009-04-12 17:11       ` Linus Torvalds
2009-04-13  6:32         ` Avi Kivity
2009-04-13 15:10           ` Linus Torvalds
2009-04-13 15:38             ` James Bottomley
2009-04-14  7:22             ` Andi Kleen
2009-04-14 10:07               ` Avi Kivity
2009-04-14  9:59             ` Avi Kivity
2009-04-14 10:23               ` Jeff Garzik
2009-04-14 10:37                 ` Avi Kivity
2009-04-14 11:45                   ` Jeff Garzik
2009-04-14 11:58                     ` Szabolcs Szakacsits
2009-04-17 22:45                       ` H. Peter Anvin
2009-04-14 12:08                     ` Avi Kivity
2009-04-14 12:21                       ` Jeff Garzik
2009-04-25  8:26                 ` Pavel Machek
2009-04-12 15:41   ` Linus Torvalds
2009-04-12 17:02     ` Robert Hancock
2009-04-12 17:20       ` Linus Torvalds
2009-04-12 18:35         ` Robert Hancock
2009-04-13 11:18         ` Avi Kivity
2009-04-12 17:23     ` James Bottomley
     [not found]     ` <6934efce0904141052j3d4f87cey9fc4b802303aa73b@mail.gmail.com>
2009-04-15  6:37       ` Artem Bityutskiy
2009-04-30 22:51         ` Jörn Engel
2009-04-30 23:36           ` Jeff Garzik
2009-04-11 17:33 Jeff Garzik
2009-04-11 19:32 ` Alan Cox
2009-04-11 19:52   ` Linus Torvalds
2009-04-11 20:21     ` Jeff Garzik
2009-04-11 21:49     ` Grant Grundler
2009-04-11 22:33       ` Linus Torvalds
2009-04-11 23:25       ` Alan Cox
2009-04-11 23:51         ` Jeff Garzik
2009-04-12  0:49           ` Linus Torvalds
2009-04-12  1:59             ` Jeff Garzik
2009-04-12  1:15         ` david
2009-04-12  3:13           ` Linus Torvalds
2009-04-12 14:23         ` Mark Lord
2009-04-12 17:29           ` Jeff Garzik
2009-04-11 19:54   ` Jeff Garzik
2009-04-11 21:08     ` John Stoffel
2009-04-11 21:31       ` John Stoffel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).