All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: ATA 4 KiB sector issues.
@ 2010-03-16 22:21 H. Peter Anvin
  2010-03-17 15:08 ` Ric Wheeler
  0 siblings, 1 reply; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-16 22:21 UTC (permalink / raw)
  To: Ric Wheeler, Tejun Heo
  Cc: James Bottomley, Denys Vlasenko, Arnd Bergmann, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth

[-- Attachment #1: Type: text/plain, Size: 1666 bytes --]

The only reason I see to care about CHS at all is that there are systems in the field which can only boot from USB in CHS mode, and which often look at the MBR partition table to guess the geometry.  Of course, some then *report* the detected geometry but don't *use* the detected geometry...

"Ric Wheeler" <rwheeler@redhat.com> wrote:

>On 03/16/2010 11:37 AM, Tejun Heo wrote:
>> Hello,
>>
>> On 03/17/2010 12:23 AM, James Bottomley wrote:
>>    
>>>> So, using custom geometry doesn't help compatibility at all.
>>>>        
>>> Our partitioning tool still obey the integral cylinder rule ... we can
>>> argue about whether they should, but what we need is a strategy for
>>> fixing what is rather than what should be.
>>>      
>> The updated ones don't anymore.  They just align to 1MiB + whatever
>> the drive requests for offset (the offset-by-one thing).  They will
>> basically behave the same as windows vista/7 ones, so it's already
>> fixed.  What we can argue is whether adding CHS tricks on top to make
>> those larger alignments somewhat meaningful w/ CHS interpretation too,
>> which I'm objecting on the ground that it doesn't help compatibility
>> at all.
>>
>> Thanks.
>>
>>    
>
>Dropping any mention of CHS seems to be the only sensible thing. Why 
>waste any time to continue some myth about drives that no modern 
>hardware supports (and then have the joy of explaining that to users)?
>
>Talking about it only confuses people and in the worst case, could cause 
>them to misalign their partitions by clinging to these pretend borders :-)
>
>ric
>

--
Sent from my mobile phone, pardon any lack of formatting.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 22:21 ATA 4 KiB sector issues H. Peter Anvin
@ 2010-03-17 15:08 ` Ric Wheeler
  2010-03-17 17:13   ` H. Peter Anvin
  0 siblings, 1 reply; 155+ messages in thread
From: Ric Wheeler @ 2010-03-17 15:08 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ric Wheeler, Tejun Heo, James Bottomley, Denys Vlasenko,
	Arnd Bergmann, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth

On 03/16/2010 06:21 PM, H. Peter Anvin wrote:
> The only reason I see to care about CHS at all is that there are systems in the field which can only boot from USB in CHS mode, and which often look at the MBR partition table to guess the geometry.  Of course, some then *report* the detected geometry but don't *use* the detected geometry...
>
>    

These systems, given the changes in modern microsoft releases, must be 
doomed even without any effort on our part.

I still think that we should work to make this ancient stuff disappear & 
help force the legacy edge cases to modernize. It has been some huge 
amount of time since storage vendors pretty much abandoned CHS (15 
years? 20?) :-)

ric


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-17 15:08 ` Ric Wheeler
@ 2010-03-17 17:13   ` H. Peter Anvin
  0 siblings, 0 replies; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-17 17:13 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Tejun Heo, James Bottomley, Denys Vlasenko, Arnd Bergmann,
	linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth

On 03/17/2010 08:08 AM, Ric Wheeler wrote:
> On 03/16/2010 06:21 PM, H. Peter Anvin wrote:
>> The only reason I see to care about CHS at all is that there are
>> systems in the field which can only boot from USB in CHS mode, and
>> which often look at the MBR partition table to guess the geometry. Of
>> course, some then *report* the detected geometry but don't *use* the
>> detected geometry...
>
> These systems, given the changes in modern microsoft releases, must be
> doomed even without any effort on our part.
>
> I still think that we should work to make this ancient stuff disappear &
> help force the legacy edge cases to modernize. It has been some huge
> amount of time since storage vendors pretty much abandoned CHS (15
> years? 20?) :-)
>

I wish.  This is mostly systems from the first half of the 2000's 
timeframe.  There was a *huge* regression when BIOS vendors started 
doing USB boot; almost all of them introduced major bugs at that time; 
in many cases they still haven't been fixed.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 15:20                     ` Tejun Heo
  2010-03-16 15:22                         ` Martin K. Petersen
  2010-03-16 15:23                       ` James Bottomley
@ 2010-03-17 17:04                       ` Bill Davidsen
  2 siblings, 0 replies; 155+ messages in thread
From: Bill Davidsen @ 2010-03-17 17:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-ide

Tejun Heo wrote:
> On 03/17/2010 12:02 AM, James Bottomley wrote:
>> On Tue, 2010-03-16 at 23:50 +0900, Tejun Heo wrote:
>>> e.g.  If the first partition begins at CHS 0/32/33 and ends at
>>> 12/233/19 and the corresponding LBA addresses are 2048 and 206848, you
>>> can solve the equation and determine that the parameters gotta be 63
>>> secs/trk and 255 heads/cyl to make those two pairs of addresses match
>>> each other and in fact some BIOSs try to do this depending on
>>> configuration (and sometimes falls into infinite loop or causes other
>>> boot related problems if the parameters are too uncommon).
>> for an msdos label, this is illegal, that was Arnd's point.  The
>> partitions have to begin and end on cylinder boundaries*.  Knowing that,
>> you can deduce the geometry from the last sector entry.
>>
>> * at least if you want to preserve windows compatibility, which is what
>> most of our partitioning tools seem to do.
> 
> Well, the thing is that
> 
> * Anything remotely modern (>= XP) doesn't give a hoot about cylinder
>   alignment.
> 
> * Anything older (<= 2000) is very likely to get confused with custom
>   geometry starting from the BIOS itself.  For those cases, the only
>   thing we can do is aligning partitions to cylinders abiding BIOS
>   supplied geometry parameters which will usually be 255/63.
> 
> So, using custom geometry doesn't help compatibility at all.
> 
I think you hit on the real culprit and ignored it, it seems that even modern 
BIOS implementations, at least some of them, do not want to cross a cylinder 
boundary doing boot. Or maybe that's dumb MBR code, which at least has the 
excuse of being size limited.

I did try using 48 sector geometry on a virtual drive, and it seems as though 
both Linux and XP will install. Then I tried on a USB stick and the BIOS in 
several old Asus laptops will boot that.

I cautiously suggest that since nothing past boot used chs, and using 48 spt 
seems to work and gives correct alignment, perhaps there is value in custom 
geometry.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-17  3:44                           ` Tejun Heo
@ 2010-03-17  8:01                             ` jdow
  0 siblings, 0 replies; 155+ messages in thread
From: jdow @ 2010-03-17  8:01 UTC (permalink / raw)
  To: Tejun Heo, Kevin Easton; +Cc: James Bottomley, linux-kernel

From: "Tejun Heo" <tj@kernel.org>
Sent: Tuesday, 2010/March/16 20:44


> Hello,
> 
> On 03/17/2010 11:51 AM, Kevin Easton wrote:
>> Can't we fix the problem by defaulting to aligning partitions to
>> start on an LBA that is a multiple of 64260 ?
>> 
>> Such partitions will always be 4KiB-aligned, *and* start-of-cylinder
>> aligned (assuming 255/63, as seems to be the norm).
>> 
>> Sure, that reduces your partition granularity to almost-32-MiB, but
>> that's pretty small potatoes these days (and it's only a *default*, so
>> you could always override that if you really cared, and didn't need 
>> the compatibility).
> 
> The only thing we can gain by that is possible compatibility w/ very
> old operating systems (<=w2k, BTW, it would be great if someone can
> actually test it).  Plus, breaking the first cylinder assumption might
> not be always safe to begin with.  I personally don't think it's
> something worth departing from the behavior most vendors would assume
> from now on (1MiB alignment).  It should be enough and safer to
> provide a mechanism to choose legacy alignment if someone is trying to
> put something which is older than a decade there.
> 
> Thanks.
> 
> -- 
> tejun

WRT very old filesystems - it won't affect Amiga partition tables or
the Amiga FFS. It already understands large block sizes natively. And
that's MY definition of "old" with "very" in front of it.

{^_^}   Joanne Dow

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 13:24           ` James Bottomley
  2010-03-16 13:56             ` Tejun Heo
@ 2010-03-17  6:48             ` H. Peter Anvin
  1 sibling, 0 replies; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-17  6:48 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tejun Heo, Denys Vlasenko, Arnd Bergmann, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

On 03/16/2010 06:24 AM, James Bottomley wrote:
>
> Because the msdos label can only partition in units of cylinders.  If
> you're using an msdos label, picking the right H/S gets you alignment.
>

This is doubly false.

An MS-DOS partition table can partition at any boundary.  Some OSes 
(like some versions of MS-DOS) needed track alignment because their boot 
loaders did not support crossing track boundaries.

Second, the primary field in the (modern) MS-DOS partition table is an 
LBA field.  The CHS fields are largely historic and useless because of 
the 1024-cylinder limitation, and by only being 24 bits total.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-17  2:51                         ` Kevin Easton
@ 2010-03-17  3:44                           ` Tejun Heo
  2010-03-17  8:01                             ` jdow
  0 siblings, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-03-17  3:44 UTC (permalink / raw)
  To: Kevin Easton; +Cc: James Bottomley, linux-kernel

Hello,

On 03/17/2010 11:51 AM, Kevin Easton wrote:
> Can't we fix the problem by defaulting to aligning partitions to
> start on an LBA that is a multiple of 64260 ?
> 
> Such partitions will always be 4KiB-aligned, *and* start-of-cylinder
> aligned (assuming 255/63, as seems to be the norm).
> 
> Sure, that reduces your partition granularity to almost-32-MiB, but
> that's pretty small potatoes these days (and it's only a *default*, so
> you could always override that if you really cared, and didn't need 
> the compatibility).

The only thing we can gain by that is possible compatibility w/ very
old operating systems (<=w2k, BTW, it would be great if someone can
actually test it).  Plus, breaking the first cylinder assumption might
not be always safe to begin with.  I personally don't think it's
something worth departing from the behavior most vendors would assume
from now on (1MiB alignment).  It should be enough and safer to
provide a mechanism to choose legacy alignment if someone is trying to
put something which is older than a decade there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 15:23                       ` James Bottomley
  2010-03-16 15:37                         ` Tejun Heo
@ 2010-03-17  2:51                         ` Kevin Easton
  2010-03-17  3:44                           ` Tejun Heo
  1 sibling, 1 reply; 155+ messages in thread
From: Kevin Easton @ 2010-03-17  2:51 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, Tejun Heo

James Bottomley wrote:
> On Wed, 2010-03-17 at 00:20 +0900, Tejun Heo wrote:

...

> > Well, the thing is that
> > 
> > * Anything remotely modern (>= XP) doesn't give a hoot about cylinder
> >   alignment.
> > 
> > * Anything older (<= 2000) is very likely to get confused with custom
> >   geometry starting from the BIOS itself.  For those cases, the only
> >   thing we can do is aligning partitions to cylinders abiding BIOS
> >   supplied geometry parameters which will usually be 255/63.
> > 
> > So, using custom geometry doesn't help compatibility at all.
> 
> Our partitioning tool still obey the integral cylinder rule ... we can
> argue about whether they should, but what we need is a strategy for
> fixing what is rather than what should be.

James / Tejun,

Can't we fix the problem by defaulting to aligning partitions to
start on an LBA that is a multiple of 64260 ?

Such partitions will always be 4KiB-aligned, *and* start-of-cylinder
aligned (assuming 255/63, as seems to be the norm).

Sure, that reduces your partition granularity to almost-32-MiB, but
that's pretty small potatoes these days (and it's only a *default*, so
you could always override that if you really cared, and didn't need 
the compatibility).

    - Kevin


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 15:22                         ` Martin K. Petersen
  (?)
@ 2010-03-17  2:07                         ` Tejun Heo
  -1 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-17  2:07 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James Bottomley, Denys Vlasenko, Arnd Bergmann, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth, jdelvare

Hello,

On 03/17/2010 12:22 AM, Martin K. Petersen wrote:
>>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:
> 
> Tejun> * Anything remotely modern (>= XP) doesn't give a hoot about
> Tejun>   cylinder alignment.
> 
> Tejun> * Anything older (<= 2000) is very likely to get confused with
> Tejun>   custom geometry starting from the BIOS itself.  For those
> Tejun>   cases, the only thing we can do is aligning partitions to
> Tejun>   cylinders abiding BIOS supplied geometry parameters which will
> Tejun>   usually be 255/63.
> 
> Tejun> So, using custom geometry doesn't help compatibility at all.
> 
> Great reads on this topic.  Might be worth linking to:
>
> 	http://www.win.tue.nl/~aeb/partitions/partition_types.html
> 	http://www.win.tue.nl/~aeb/linux/largedisk.html

Thanks for the links.  I'll read and link them.  BTW, if you can spot
something wrong regarding this in the doc, please let me know.  I'm
still learning how all these legacy stuff is supposed to work so there
likely are some points that I got wrong.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 20:42                           ` Ric Wheeler
@ 2010-03-17  2:04                             ` Tejun Heo
  0 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-17  2:04 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: James Bottomley, Denys Vlasenko, Arnd Bergmann, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth

Hello, Ric.

On 03/17/2010 05:42 AM, Ric Wheeler wrote:
> Dropping any mention of CHS seems to be the only sensible thing. Why
> waste any time to continue some myth about drives that no modern
> hardware supports (and then have the joy of explaining that to users)?
> 
> Talking about it only confuses people and in the worst case, could cause
> them to misalign their partitions by clinging to these pretend borders :-)

I don't think not mentioning it would clear up the myth.  It would
probably be a good idea to beef up the document to clear
misconceptions around disk geometry.  I'll give a shot at it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 15:37                         ` Tejun Heo
@ 2010-03-16 20:42                           ` Ric Wheeler
  2010-03-17  2:04                             ` Tejun Heo
  0 siblings, 1 reply; 155+ messages in thread
From: Ric Wheeler @ 2010-03-16 20:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, Denys Vlasenko, Arnd Bergmann, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth

On 03/16/2010 11:37 AM, Tejun Heo wrote:
> Hello,
>
> On 03/17/2010 12:23 AM, James Bottomley wrote:
>    
>>> So, using custom geometry doesn't help compatibility at all.
>>>        
>> Our partitioning tool still obey the integral cylinder rule ... we can
>> argue about whether they should, but what we need is a strategy for
>> fixing what is rather than what should be.
>>      
> The updated ones don't anymore.  They just align to 1MiB + whatever
> the drive requests for offset (the offset-by-one thing).  They will
> basically behave the same as windows vista/7 ones, so it's already
> fixed.  What we can argue is whether adding CHS tricks on top to make
> those larger alignments somewhat meaningful w/ CHS interpretation too,
> which I'm objecting on the ground that it doesn't help compatibility
> at all.
>
> Thanks.
>
>    

Dropping any mention of CHS seems to be the only sensible thing. Why 
waste any time to continue some myth about drives that no modern 
hardware supports (and then have the joy of explaining that to users)?

Talking about it only confuses people and in the worst case, could cause 
them to misalign their partitions by clinging to these pretend borders :-)

ric


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 15:25                   ` Denys Vlasenko
@ 2010-03-16 15:47                     ` Tejun Heo
  0 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-16 15:47 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: James Bottomley, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

Hello,

On 03/17/2010 12:25 AM, Denys Vlasenko wrote:
>> C/H/S of 1023/254/63 is a special marker indicating the value there is
>> out-of-range.
> 
> You misunderstood my ^^^ markers. I was trying to highlight
> the whole columns of "end head" and "end sector", not the
> last partition's 1023/254/63 values.
>
> In the partition table like shown above it is obvious
> that geometry is 255/63.

Oh, if you have at least one partition contained under the CHS limit,
you can definitely determine the parameters.  You need to know two
params and there are two equations.  You don't even have to consider
the alignment.

>> We don't have to align to cylinders either.
> 
> If neither the start nor the end is aligned to cylinder's end
> and disk has just one partition and it's bigger than 8G,
> there is not way to determine geometry.
> 
> If everybody adopts the convention of ending the partitions
> at the cylinder end, geometry can be trivially determined by
> looking at partition end values. Sans "no of cylinders" value,
> which can be easily determined by other means.

But this is irrelevant because we don't and can't control everybody.
Actually, nobody can.  Codes dealing with partition tables have
already been out there for a very long time and there's no way to
retroactively make them agree on anything.  The only reason why we
care about CHS values at all is backward compatibility.  Going
forward, we don't need them at all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 15:23                       ` James Bottomley
@ 2010-03-16 15:37                         ` Tejun Heo
  2010-03-16 20:42                           ` Ric Wheeler
  2010-03-17  2:51                         ` Kevin Easton
  1 sibling, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-03-16 15:37 UTC (permalink / raw)
  To: James Bottomley
  Cc: Denys Vlasenko, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth

Hello,

On 03/17/2010 12:23 AM, James Bottomley wrote:
>> So, using custom geometry doesn't help compatibility at all.
> 
> Our partitioning tool still obey the integral cylinder rule ... we can
> argue about whether they should, but what we need is a strategy for
> fixing what is rather than what should be.

The updated ones don't anymore.  They just align to 1MiB + whatever
the drive requests for offset (the offset-by-one thing).  They will
basically behave the same as windows vista/7 ones, so it's already
fixed.  What we can argue is whether adding CHS tricks on top to make
those larger alignments somewhat meaningful w/ CHS interpretation too,
which I'm objecting on the ground that it doesn't help compatibility
at all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 15:12                 ` Tejun Heo
@ 2010-03-16 15:25                   ` Denys Vlasenko
  2010-03-16 15:47                     ` Tejun Heo
  0 siblings, 1 reply; 155+ messages in thread
From: Denys Vlasenko @ 2010-03-16 15:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

On Tue, Mar 16, 2010 at 4:12 PM, Tejun Heo <tj@kernel.org> wrote:
>> The "end of partition" is expected to be at the last head and sector.
>> Of course this heuristic fails if there are more than one primary
>> partition and they have differing last head and sector.
>>
>> But on most "sanely" partitioned disks they are the same:
>>
>> Disk /dev/sda: 255 heads, 63 sectors, 36481 cylinders
>>
>> Nr AF  Hd Sec  Cyl  Hd Sec  Cyl      Start       Size ID
>>  1 00   1   1    0 254  63  850         63   13671252 0b
>>  2 80   0   1  851 254  63 1023   13671315  572395950 05
>>  3 00   0   0    0   0   0    0          0          0 00
>>  4 00   0   0    0   0   0    0          0          0 00
>>  5 00   1   1  851 254  63  972         63    1959867 83
>>  6 00   1   1  973 254  63 1023         63   31246362 83
>>  7 00 254  63 1023 254  63 1023         63  195318207 83
>>  8 00 254  63 1023 254  63 1023         63  343871262 83
>>                    ^^^  ^^
>
> C/H/S of 1023/254/63 is a special marker indicating the value there is
> out-of-range.

You misunderstood my ^^^ markers. I was trying to highlight
the whole columns of "end head" and "end sector", not the
last partition's 1023/254/63 values.

In the partition table like shown above it is obvious
that geometry is 255/63.

>> Which suggests another idea how to align a partition: since there is
>> no requirement on the partition *start*, we don't have to start at
>> head1,sector1 or head0,sector1
>
> We don't have to align to cylinders either.

If neither the start nor the end is aligned to cylinder's end
and disk has just one partition and it's bigger than 8G,
there is not way to determine geometry.

If everybody adopts the convention of ending the partitions
at the cylinder end, geometry can be trivially determined by
looking at partition end values. Sans "no of cylinders" value,
which can be easily determined by other means.

>> In the example above, 1st partition might be modified to start at
>> head1,sector2, IOW, LBA 64, thus making it 32k aligned.
>>
>> As long as partition *ends* adhere to the convention of being
>> exactly at last_head,last_sector, nothing should break.
>
> That has almost nothing to do with compatibility.  Just let the
> cylinder alignment go.

Then (some) bootloaders will stop working.

-- 
vda

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 15:20                     ` Tejun Heo
  2010-03-16 15:22                         ` Martin K. Petersen
@ 2010-03-16 15:23                       ` James Bottomley
  2010-03-16 15:37                         ` Tejun Heo
  2010-03-17  2:51                         ` Kevin Easton
  2010-03-17 17:04                       ` Bill Davidsen
  2 siblings, 2 replies; 155+ messages in thread
From: James Bottomley @ 2010-03-16 15:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Denys Vlasenko, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

On Wed, 2010-03-17 at 00:20 +0900, Tejun Heo wrote:
> On 03/17/2010 12:02 AM, James Bottomley wrote:
> > On Tue, 2010-03-16 at 23:50 +0900, Tejun Heo wrote:
> >> e.g.  If the first partition begins at CHS 0/32/33 and ends at
> >> 12/233/19 and the corresponding LBA addresses are 2048 and 206848, you
> >> can solve the equation and determine that the parameters gotta be 63
> >> secs/trk and 255 heads/cyl to make those two pairs of addresses match
> >> each other and in fact some BIOSs try to do this depending on
> >> configuration (and sometimes falls into infinite loop or causes other
> >> boot related problems if the parameters are too uncommon).
> > 
> > for an msdos label, this is illegal, that was Arnd's point.  The
> > partitions have to begin and end on cylinder boundaries*.  Knowing that,
> > you can deduce the geometry from the last sector entry.
> >
> > * at least if you want to preserve windows compatibility, which is what
> > most of our partitioning tools seem to do.
> 
> Well, the thing is that
> 
> * Anything remotely modern (>= XP) doesn't give a hoot about cylinder
>   alignment.
> 
> * Anything older (<= 2000) is very likely to get confused with custom
>   geometry starting from the BIOS itself.  For those cases, the only
>   thing we can do is aligning partitions to cylinders abiding BIOS
>   supplied geometry parameters which will usually be 255/63.
> 
> So, using custom geometry doesn't help compatibility at all.

Our partitioning tool still obey the integral cylinder rule ... we can
argue about whether they should, but what we need is a strategy for
fixing what is rather than what should be.

James


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 15:20                     ` Tejun Heo
@ 2010-03-16 15:22                         ` Martin K. Petersen
  2010-03-16 15:23                       ` James Bottomley
  2010-03-17 17:04                       ` Bill Davidsen
  2 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-16 15:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, Denys Vlasenko, Arnd Bergmann, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth, jdelvare

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> * Anything remotely modern (>= XP) doesn't give a hoot about
Tejun>   cylinder alignment.

Tejun> * Anything older (<= 2000) is very likely to get confused with
Tejun>   custom geometry starting from the BIOS itself.  For those
Tejun>   cases, the only thing we can do is aligning partitions to
Tejun>   cylinders abiding BIOS supplied geometry parameters which will
Tejun>   usually be 255/63.

Tejun> So, using custom geometry doesn't help compatibility at all.

Great reads on this topic.  Might be worth linking to:

	http://www.win.tue.nl/~aeb/partitions/partition_types.html

	http://www.win.tue.nl/~aeb/linux/largedisk.html

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-16 15:22                         ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-16 15:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, Denys Vlasenko, Arnd Bergmann, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth, jdelvare

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> * Anything remotely modern (>= XP) doesn't give a hoot about
Tejun>   cylinder alignment.

Tejun> * Anything older (<= 2000) is very likely to get confused with
Tejun>   custom geometry starting from the BIOS itself.  For those
Tejun>   cases, the only thing we can do is aligning partitions to
Tejun>   cylinders abiding BIOS supplied geometry parameters which will
Tejun>   usually be 255/63.

Tejun> So, using custom geometry doesn't help compatibility at all.

Great reads on this topic.  Might be worth linking to:

	http://www.win.tue.nl/~aeb/partitions/partition_types.html

	http://www.win.tue.nl/~aeb/linux/largedisk.html

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 15:02                   ` James Bottomley
@ 2010-03-16 15:20                     ` Tejun Heo
  2010-03-16 15:22                         ` Martin K. Petersen
                                         ` (2 more replies)
  0 siblings, 3 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-16 15:20 UTC (permalink / raw)
  To: James Bottomley
  Cc: Denys Vlasenko, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

On 03/17/2010 12:02 AM, James Bottomley wrote:
> On Tue, 2010-03-16 at 23:50 +0900, Tejun Heo wrote:
>> e.g.  If the first partition begins at CHS 0/32/33 and ends at
>> 12/233/19 and the corresponding LBA addresses are 2048 and 206848, you
>> can solve the equation and determine that the parameters gotta be 63
>> secs/trk and 255 heads/cyl to make those two pairs of addresses match
>> each other and in fact some BIOSs try to do this depending on
>> configuration (and sometimes falls into infinite loop or causes other
>> boot related problems if the parameters are too uncommon).
> 
> for an msdos label, this is illegal, that was Arnd's point.  The
> partitions have to begin and end on cylinder boundaries*.  Knowing that,
> you can deduce the geometry from the last sector entry.
>
> * at least if you want to preserve windows compatibility, which is what
> most of our partitioning tools seem to do.

Well, the thing is that

* Anything remotely modern (>= XP) doesn't give a hoot about cylinder
  alignment.

* Anything older (<= 2000) is very likely to get confused with custom
  geometry starting from the BIOS itself.  For those cases, the only
  thing we can do is aligning partitions to cylinders abiding BIOS
  supplied geometry parameters which will usually be 255/63.

So, using custom geometry doesn't help compatibility at all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 14:38               ` Denys Vlasenko
@ 2010-03-16 15:12                 ` Tejun Heo
  2010-03-16 15:25                   ` Denys Vlasenko
  0 siblings, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-03-16 15:12 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: James Bottomley, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

Hello,

On 03/16/2010 11:38 PM, Denys Vlasenko wrote:
> The "end of partition" is expected to be at the last head and sector.
> Of course this heuristic fails if there are more than one primary
> partition and they have differing last head and sector.
> 
> But on most "sanely" partitioned disks they are the same:
> 
> Disk /dev/sda: 255 heads, 63 sectors, 36481 cylinders
> 
> Nr AF  Hd Sec  Cyl  Hd Sec  Cyl      Start       Size ID
>  1 00   1   1    0 254  63  850         63   13671252 0b
>  2 80   0   1  851 254  63 1023   13671315  572395950 05
>  3 00   0   0    0   0   0    0          0          0 00
>  4 00   0   0    0   0   0    0          0          0 00
>  5 00   1   1  851 254  63  972         63    1959867 83
>  6 00   1   1  973 254  63 1023         63   31246362 83
>  7 00 254  63 1023 254  63 1023         63  195318207 83
>  8 00 254  63 1023 254  63 1023         63  343871262 83
>                    ^^^  ^^

C/H/S of 1023/254/63 is a special marker indicating the value there is
out-of-range.  It doesn't actually carry any information regarding the
geometry parameters other than that the matching LBA can't be
expressed within its range.  The end marker doesn't change according
to geometry parameters.  It's fixed at 0xfeffff.

> Which suggests another idea how to align a partition: since there is
> no requirement on the partition *start*, we don't have to start at
> head1,sector1 or head0,sector1

We don't have to align to cylinders either.

> In the example above, 1st partition might be modified to start at
> head1,sector2, IOW, LBA 64, thus making it 32k aligned.
> 
> As long as partition *ends* adhere to the convention of being
> exactly at last_head,last_sector, nothing should break.

That has almost nothing to do with compatibility.  Just let the
cylinder alignment go.  Anything remotely modern doesn't care about it
at all and anything older will puke way easier with custom geometry
massaging.  For those, we'll just have to stick with cylinder aligning
according to the BIOS supplied parameters.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 14:50                 ` Tejun Heo
@ 2010-03-16 15:02                   ` James Bottomley
  2010-03-16 15:20                     ` Tejun Heo
  0 siblings, 1 reply; 155+ messages in thread
From: James Bottomley @ 2010-03-16 15:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Denys Vlasenko, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

On Tue, 2010-03-16 at 23:50 +0900, Tejun Heo wrote:
> e.g.  If the first partition begins at CHS 0/32/33 and ends at
> 12/233/19 and the corresponding LBA addresses are 2048 and 206848, you
> can solve the equation and determine that the parameters gotta be 63
> secs/trk and 255 heads/cyl to make those two pairs of addresses match
> each other and in fact some BIOSs try to do this depending on
> configuration (and sometimes falls into infinite loop or causes other
> boot related problems if the parameters are too uncommon).

for an msdos label, this is illegal, that was Arnd's point.  The
partitions have to begin and end on cylinder boundaries*.  Knowing that,
you can deduce the geometry from the last sector entry.

James

* at least if you want to preserve windows compatibility, which is what
most of our partitioning tools seem to do.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 14:21               ` James Bottomley
  2010-03-16 14:25                 ` Arnd Bergmann
@ 2010-03-16 14:50                 ` Tejun Heo
  2010-03-16 15:02                   ` James Bottomley
  1 sibling, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-03-16 14:50 UTC (permalink / raw)
  To: James Bottomley
  Cc: Denys Vlasenko, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

Hello,

On 03/16/2010 11:21 PM, James Bottomley wrote:
>>> For msdos labels, it's embedded in the label ... for all other labels,
>>> it's made up on the spot.
>>
>> Where in the label?
> 
> No idea ... I only know you can use fdisk expert mode to change the
> C/H/S layout and the change is preserved across reboots.

The CHS addresses are stored alongside with the LBA addresses.  The
problem is that the geometry parameters (sectors/track and heads/cyl)
are not stored anywhere and CHS addresses don't make any sense without
the two parameters.  The only way to figure out the geometry
parameters is to solve two equations involving CHS addresses and LBA
addresses.

e.g.  If the first partition begins at CHS 0/32/33 and ends at
12/233/19 and the corresponding LBA addresses are 2048 and 206848, you
can solve the equation and determine that the parameters gotta be 63
secs/trk and 255 heads/cyl to make those two pairs of addresses match
each other and in fact some BIOSs try to do this depending on
configuration (and sometimes falls into infinite loop or causes other
boot related problems if the parameters are too uncommon).

This method can't work reliably even at theoretical level because it
requires at least two pairs of CHS/LBA addresses to match (two unknown
parameters to solve for) and there is only single pair available if
the first partition goes over the CHS limit which at maximum is 8GiB.

So, CHS *values* are preserved if it falls below the CHS limit of the
geometry used during partitioning but the geometry information is lost
making the CHS values completely meaningless, so the only sane thing
to do is to stick to whatever geometry parameters provided by the BIOS
which usually is 255/63 these days.  Otherwise, the results are...

* If the first partition ends before the CHS limit and BIOS is
  configured to calculate back the parameters, BIOS may be able to
  report the geometry correctly.

* If the first partition goes over the CHS limit,

  * BIOS can use 255/63 or whatever default parameters and CHS and LBA
    addresses won't match each other which won't be a problem for
    modern OSes as they don't look at the CHS addresses at all but
    older operating systems which consider both CHS and LBA addresses
    may get confused.

  * BIOS can set up arbitrary parameters such that the CHS and LBA for
    the start of the first partition match and maybe also try to
    cylinder align further LBA addresses but there is no guarantee
    these parameters match the original parameters used during
    partitioning and this seems to cause more compatibility problems
    than it solves.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 13:56             ` Tejun Heo
  2010-03-16 14:21               ` James Bottomley
@ 2010-03-16 14:38               ` Denys Vlasenko
  2010-03-16 15:12                 ` Tejun Heo
  1 sibling, 1 reply; 155+ messages in thread
From: Denys Vlasenko @ 2010-03-16 14:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

On Tuesday 16 March 2010 14:56, Tejun Heo wrote:
> Hello, James.
> 
> On 03/16/2010 10:24 PM, James Bottomley wrote:
> > For msdos labels, it's embedded in the label ... for all other labels,
> > it's made up on the spot.
> 
> Where in the label?

The "end of partition" is expected to be at the last head and sector.
Of course this heuristic fails if there are more than one primary
partition and they have differing last head and sector.

But on most "sanely" partitioned disks they are the same:

Disk /dev/sda: 255 heads, 63 sectors, 36481 cylinders

Nr AF  Hd Sec  Cyl  Hd Sec  Cyl      Start       Size ID
 1 00   1   1    0 254  63  850         63   13671252 0b
 2 80   0   1  851 254  63 1023   13671315  572395950 05
 3 00   0   0    0   0   0    0          0          0 00
 4 00   0   0    0   0   0    0          0          0 00
 5 00   1   1  851 254  63  972         63    1959867 83
 6 00   1   1  973 254  63 1023         63   31246362 83
 7 00 254  63 1023 254  63 1023         63  195318207 83
 8 00 254  63 1023 254  63 1023         63  343871262 83
                   ^^^  ^^

Which suggests another idea how to align a partition:
since there is no requirement on the partition *start*,
we don't have to start at head1,sector1 or head0,sector1

In the example above, 1st partition might be modified to start
at head1,sector2, IOW, LBA 64, thus making it 32k aligned.

As long as partition *ends* adhere to the convention
of being exactly at last_head,last_sector, nothing should break.
-- 
vda

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 14:21               ` James Bottomley
@ 2010-03-16 14:25                 ` Arnd Bergmann
  2010-03-16 14:50                 ` Tejun Heo
  1 sibling, 0 replies; 155+ messages in thread
From: Arnd Bergmann @ 2010-03-16 14:25 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tejun Heo, Denys Vlasenko, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

On Tuesday 16 March 2010, James Bottomley wrote:
> On Tue, 2010-03-16 at 22:56 +0900, Tejun Heo wrote:
> > Hello, James.
> > 
> > On 03/16/2010 10:24 PM, James Bottomley wrote:
> > > For msdos labels, it's embedded in the label ... for all other labels,
> > > it's made up on the spot.
> > 
> > Where in the label?
> 
> No idea ... I only know you can use fdisk expert mode to change the
> C/H/S layout and the change is preserved across reboots.

IIRC, the layout is guessed from the partition end locations, in the
assumption that each partition is aligned to full cylinders. That
gives you the heads/sectors number, while the cylinder number can be
calculated from the total number of sectors using these numbers.

	Arnd

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 13:56             ` Tejun Heo
@ 2010-03-16 14:21               ` James Bottomley
  2010-03-16 14:25                 ` Arnd Bergmann
  2010-03-16 14:50                 ` Tejun Heo
  2010-03-16 14:38               ` Denys Vlasenko
  1 sibling, 2 replies; 155+ messages in thread
From: James Bottomley @ 2010-03-16 14:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Denys Vlasenko, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

On Tue, 2010-03-16 at 22:56 +0900, Tejun Heo wrote:
> Hello, James.
> 
> On 03/16/2010 10:24 PM, James Bottomley wrote:
> > For msdos labels, it's embedded in the label ... for all other labels,
> > it's made up on the spot.
> 
> Where in the label?

No idea ... I only know you can use fdisk expert mode to change the
C/H/S layout and the change is preserved across reboots.

James



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16 13:24           ` James Bottomley
@ 2010-03-16 13:56             ` Tejun Heo
  2010-03-16 14:21               ` James Bottomley
  2010-03-16 14:38               ` Denys Vlasenko
  2010-03-17  6:48             ` H. Peter Anvin
  1 sibling, 2 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-16 13:56 UTC (permalink / raw)
  To: James Bottomley
  Cc: Denys Vlasenko, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

Hello, James.

On 03/16/2010 10:24 PM, James Bottomley wrote:
> For msdos labels, it's embedded in the label ... for all other labels,
> it's made up on the spot.

Where in the label?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16  6:22         ` Tejun Heo
  2010-03-16  6:37           ` Felix Miata
@ 2010-03-16 13:24           ` James Bottomley
  2010-03-16 13:56             ` Tejun Heo
  2010-03-17  6:48             ` H. Peter Anvin
  1 sibling, 2 replies; 155+ messages in thread
From: James Bottomley @ 2010-03-16 13:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Denys Vlasenko, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

On Tue, 2010-03-16 at 15:22 +0900, Tejun Heo wrote:
> Hello, James.
> 
> On 03/16/2010 03:14 PM, James Bottomley wrote:
> > So, it is true to say that picking a certain H/S geometry (which is
> > entirely withing the gift of the partitioner) will align msdos label
> > partitions, but will be don't care for all other labels: all other
> > partition labels (like gpt) use block as offset and don't have any truck
> > with the fictitious C/H/S stuff.
> 
> For any modern Linux and Windows, CHS simply doesn't matter.  They
> don't look at it at all.

If they have a msdos label, they do.

> > The big problem is that 99% of the x86 systems out there still use the
> > ancient msdos label for their boot disks, so aligning H/S going forwards
> > will give us a nice "just works" for x86 boxes.
> 
> What I don't get is that how picking up a custom geometry can make
> things work when there is *no* reliable way to determine which
> geometry was used during partitioning once the partitioning is
> complete.

For msdos labels, it's embedded in the label ... for all other labels,
it's made up on the spot.

>   Most BIOSs these days will simply report the geometry as
> being 255/63 regardless of the geometry used during partitioning.  So,
> how can using a custom geometry give that nice "just works" for x86
> boxes when nobody knows what geometry is in use?

Because the msdos label can only partition in units of cylinders.  If
you're using an msdos label, picking the right H/S gets you alignment.

James



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16  6:37           ` Felix Miata
@ 2010-03-16  6:42             ` Tejun Heo
  0 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-16  6:42 UTC (permalink / raw)
  To: Felix Miata; +Cc: linux-ide

Hello,

On 03/16/2010 03:37 PM, Felix Miata wrote:
> On 2010/03/16 15:22 (GMT+0900) Tejun Heo composed:
> 
>> Most BIOSs these days will simply report the geometry as
>> being 255/63 regardless of the geometry used during partitioning
> 
> In my experience this "most" only applies to desktop motherboard BIOS. Most
> laptop BIOS seem to prefer 240, as do many external USB HD case controllers,
> and older but not that antique Compaq desktops.

Hmmm... interesting, does that hold for modern laptops too?  It
doesn't really change the situation much tho.  It's just another
arbitrary geometry that BIOS reports that we can't really do much
about.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16  6:22         ` Tejun Heo
@ 2010-03-16  6:37           ` Felix Miata
  2010-03-16  6:42             ` Tejun Heo
  2010-03-16 13:24           ` James Bottomley
  1 sibling, 1 reply; 155+ messages in thread
From: Felix Miata @ 2010-03-16  6:37 UTC (permalink / raw)
  To: linux-ide

On 2010/03/16 15:22 (GMT+0900) Tejun Heo composed:

> Most BIOSs these days will simply report the geometry as
> being 255/63 regardless of the geometry used during partitioning

In my experience this "most" only applies to desktop motherboard BIOS. Most
laptop BIOS seem to prefer 240, as do many external USB HD case controllers,
and older but not that antique Compaq desktops.
-- 
"The wise are known for their understanding, and pleasant
words are persuasive." Proverbs 16:21 (New Living Translation)

 Team OS/2 ** Reg. Linux User #211409

Felix Miata  ***  http://fm.no-ip.com/

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16  6:14       ` James Bottomley
  2010-03-16  6:22         ` Tejun Heo
@ 2010-03-16  6:27         ` Thomas Chou
  1 sibling, 0 replies; 155+ messages in thread
From: Thomas Chou @ 2010-03-16  6:27 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tejun Heo, Denys Vlasenko, Arnd Bergmann, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth, jdelvare

On 03/16/2010 02:14 PM, James Bottomley wrote:
>
> The big problem is that 99% of the x86 systems out there still use the
> ancient msdos label for their boot disks, so aligning H/S going forwards
> will give us a nice "just works" for x86 boxes.
>
>    
The key issue is not "just work", but "performance". When unaligned, the 
write performance can be lower than 50% of the expected rate.

- Thomas

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16  6:14       ` James Bottomley
@ 2010-03-16  6:22         ` Tejun Heo
  2010-03-16  6:37           ` Felix Miata
  2010-03-16 13:24           ` James Bottomley
  2010-03-16  6:27         ` Thomas Chou
  1 sibling, 2 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-16  6:22 UTC (permalink / raw)
  To: James Bottomley
  Cc: Denys Vlasenko, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

Hello, James.

On 03/16/2010 03:14 PM, James Bottomley wrote:
> So, it is true to say that picking a certain H/S geometry (which is
> entirely withing the gift of the partitioner) will align msdos label
> partitions, but will be don't care for all other labels: all other
> partition labels (like gpt) use block as offset and don't have any truck
> with the fictitious C/H/S stuff.

For any modern Linux and Windows, CHS simply doesn't matter.  They
don't look at it at all.

> The big problem is that 99% of the x86 systems out there still use the
> ancient msdos label for their boot disks, so aligning H/S going forwards
> will give us a nice "just works" for x86 boxes.

What I don't get is that how picking up a custom geometry can make
things work when there is *no* reliable way to determine which
geometry was used during partitioning once the partitioning is
complete.  Most BIOSs these days will simply report the geometry as
being 255/63 regardless of the geometry used during partitioning.  So,
how can using a custom geometry give that nice "just works" for x86
boxes when nobody knows what geometry is in use?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16  2:30     ` Tejun Heo
  2010-03-16  2:32       ` Tejun Heo
@ 2010-03-16  6:14       ` James Bottomley
  2010-03-16  6:22         ` Tejun Heo
  2010-03-16  6:27         ` Thomas Chou
  1 sibling, 2 replies; 155+ messages in thread
From: James Bottomley @ 2010-03-16  6:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Denys Vlasenko, Arnd Bergmann, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

On Tue, 2010-03-16 at 11:30 +0900, Tejun Heo wrote:
> Hello,
> 
> On 03/10/2010 06:14 PM, Denys Vlasenko wrote:
> > 63s/255h is more or less "standard" now.
> > 
> > Alignment issues can be solved by picking a good multiple of
> > _heads_ or _cylinders_:
> 
> I've got a couple of comments stating that picking a good geometry
> parameters can resolve the whole issue but I simply fail to see how it
> could.  We can pick any parameter we wish, but there is no reliable
> way to communicate the custom geometry parameters to others.
> 
> Geometry is determined by two parameters sec/trk and heads/cyl.  You
> can punch in those numbers if the BIOS has a menu for it (many don't
> these days).  Or hope that BIOS can somehow figure it out from the
> partition table which some BIOSs actually try to do.  The problem is
> that to determine the two parameters you need to equations matching
> CHSs and LBAs and that's available iff the first partition ends before
> CHS addressing limit according to the custom geometry, which usually
> is not the case.
> 
> So, custom geometry is only useful to trick partitioners which align
> using cylinders into using better alignments but doesn't help anything
> for compatibility as no one can determine the used geometry reliably
> after the partitioning is complete.  With compatibility benefit gone,
> there simply is no reason to stick to the cylinder abstraction at all.
> 
> Am I missing something?

Sort of.  As you say, C/H/S doesn't exist for any modern disk.  However,
the msdos label, for reasons lost in the mists of time, uses cylinders
as the units of partition boundaries, so we have to invent a bogus C/H/S
geometry for that partition label.  Because of the problems with picking
C/H/S, most boot loaders take care to ensure that BIOS never cares about
it either (by using the block offset I/O routines), so for most linux
bootloaders, the BIOS problems with C/H/S is a red herring.

So, it is true to say that picking a certain H/S geometry (which is
entirely withing the gift of the partitioner) will align msdos label
partitions, but will be don't care for all other labels: all other
partition labels (like gpt) use block as offset and don't have any truck
with the fictitious C/H/S stuff.

The big problem is that 99% of the x86 systems out there still use the
ancient msdos label for their boot disks, so aligning H/S going forwards
will give us a nice "just works" for x86 boxes.

James



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-16  2:30     ` Tejun Heo
@ 2010-03-16  2:32       ` Tejun Heo
  2010-03-16  6:14       ` James Bottomley
  1 sibling, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-16  2:32 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Arnd Bergmann, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

Aieee... critical typo.

On 03/16/2010 11:30 AM, Tejun Heo wrote:
> partition table which some BIOSs actually try to do.  The problem is
> that to determine the two parameters you need to equations matching
                                                ^^
                                                two
> CHSs and LBAs

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10  9:14   ` Denys Vlasenko
  2010-03-10 11:02     ` Felix Miata
  2010-03-15  1:21     ` H. Peter Anvin
@ 2010-03-16  2:30     ` Tejun Heo
  2010-03-16  2:32       ` Tejun Heo
  2010-03-16  6:14       ` James Bottomley
  2 siblings, 2 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-16  2:30 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Arnd Bergmann, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

Hello,

On 03/10/2010 06:14 PM, Denys Vlasenko wrote:
> 63s/255h is more or less "standard" now.
> 
> Alignment issues can be solved by picking a good multiple of
> _heads_ or _cylinders_:

I've got a couple of comments stating that picking a good geometry
parameters can resolve the whole issue but I simply fail to see how it
could.  We can pick any parameter we wish, but there is no reliable
way to communicate the custom geometry parameters to others.

Geometry is determined by two parameters sec/trk and heads/cyl.  You
can punch in those numbers if the BIOS has a menu for it (many don't
these days).  Or hope that BIOS can somehow figure it out from the
partition table which some BIOSs actually try to do.  The problem is
that to determine the two parameters you need to equations matching
CHSs and LBAs and that's available iff the first partition ends before
CHS addressing limit according to the custom geometry, which usually
is not the case.

So, custom geometry is only useful to trick partitioners which align
using cylinders into using better alignments but doesn't help anything
for compatibility as no one can determine the used geometry reliably
after the partitioning is complete.  With compatibility benefit gone,
there simply is no reason to stick to the cylinder abstraction at all.

Am I missing something?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-15  9:56           ` Denys Vlasenko
@ 2010-03-15 14:47             ` H. Peter Anvin
  0 siblings, 0 replies; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-15 14:47 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: david, Arnd Bergmann, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

On 03/15/2010 02:56 AM, Denys Vlasenko wrote:
> I think Linux already is doing this. The problem is, in many cases
> OS can't possibly do this, short of using a specially designed
> filesystem.
> 
> If you untar a Linux kernel source tarball on a seriously
> fragmented ext2 filesystem, there will be a lot of discontiguous
> and/or misaligned writes smaller than 256K.
> Only smart firmware can help in this case.

Yes, but guess what... there is a lot of stupid firmware out there, and
there are lots of RAID arrays, and so on.

"Seriously fragmented" means you have already lost in the first place.

This doesn't change the fact that this is a real issue and that that is
the major reason why aligning to 63*4K is a bad idea.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-15  4:00         ` H. Peter Anvin
@ 2010-03-15 12:30           ` Arnd Bergmann
  0 siblings, 0 replies; 155+ messages in thread
From: Arnd Bergmann @ 2010-03-15 12:30 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Denys Vlasenko, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

On Monday 15 March 2010, H. Peter Anvin wrote:
> > 256K alignment is hard to swallow for a lot of reasons anyway.
> > Unless the filesystem packs small files into blocks a-la reiserfs,
> > 256K block filesystems will be very inefficient for a typical
> > storage scenarios.
>
> Noone has talked about using 256K filesystem blocks.

Well, logfs has just been merged and works with block sizes in that
range, but obviously only if the partition is correctly aligned.

	Arnd

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-15  5:20         ` david
@ 2010-03-15  9:56           ` Denys Vlasenko
  2010-03-15 14:47             ` H. Peter Anvin
  0 siblings, 1 reply; 155+ messages in thread
From: Denys Vlasenko @ 2010-03-15  9:56 UTC (permalink / raw)
  To: david
  Cc: H. Peter Anvin, Arnd Bergmann, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

On Monday 15 March 2010 06:20, david@lang.hm wrote:
> >>> For any other partition, pick start cylinder which is a multiple of 8:
> >>>
> >>> cyl 8*x head 0 sector 1: LBA sector 8*x*255*63 - good (4k aligned)
> >>>
> >>> This will actually work well for *any* geometry, not only for 63s/255h.
> >>
> >> Yes, but it does squat for a flash disk that wants, say, 256K alignment.
> >
> > 4K makes sense. 256K not so much.
> >
> > 256K alignment is hard to swallow for a lot of reasons anyway.
> > Unless the filesystem packs small files into blocks a-la reiserfs,
> > 256K block filesystems will be very inefficient for a typical
> > storage scenarios.
> 
> the thing is, if the OS can learn that it's more efficiant to write in 
> 256K aligned chunks, then it can batch up things so that the drive doesn't 
> have to do a read-modify-write cycle and can instead just replace the 
> entire chunk.

I think Linux already is doing this. The problem is, in many cases
OS can't possibly do this, short of using a specially designed
filesystem.

If you untar a Linux kernel source tarball on a seriously
fragmented ext2 filesystem, there will be a lot of discontiguous
and/or misaligned writes smaller than 256K.
Only smart firmware can help in this case.
-- 
vda

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-15  2:26       ` Denys Vlasenko
  2010-03-15  2:56         ` Greg Freemyer
  2010-03-15  4:00         ` H. Peter Anvin
@ 2010-03-15  5:20         ` david
  2010-03-15  9:56           ` Denys Vlasenko
  2 siblings, 1 reply; 155+ messages in thread
From: david @ 2010-03-15  5:20 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: H. Peter Anvin, Arnd Bergmann, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

On Mon, 15 Mar 2010, Denys Vlasenko wrote:

> On Monday 15 March 2010 02:21, H. Peter Anvin wrote:
>> On 03/10/2010 01:14 AM, Denys Vlasenko wrote:
>>>
>>> 63s/255h is more or less "standard" now.
>>>
>>> Alignment issues can be solved by picking a good multiple of
>>> _heads_ or _cylinders_:
>>>
>>> For first partition, pick the start at 8th head:
>>>
>>> cyl 0 head 1 sector 1: LBA sector 63) - bad
>>> cyl 0 head 8 sector 1: LBA sector 8*63) - good (4k aligned)
>>>
>>> For any other partition, pick start cylinder which is a multiple of 8:
>>>
>>> cyl 8*x head 0 sector 1: LBA sector 8*x*255*63 - good (4k aligned)
>>>
>>> This will actually work well for *any* geometry, not only for 63s/255h.
>>
>> Yes, but it does squat for a flash disk that wants, say, 256K alignment.
>
> 4K makes sense. 256K not so much.
>
> 256K alignment is hard to swallow for a lot of reasons anyway.
> Unless the filesystem packs small files into blocks a-la reiserfs,
> 256K block filesystems will be very inefficient for a typical
> storage scenarios.

the thing is, if the OS can learn that it's more efficiant to write in 
256K aligned chunks, then it can batch up things so that the drive doesn't 
have to do a read-modify-write cycle and can instead just replace the 
entire chunk.

raid arrays can benifit from this as well as SSDs.

the OS can do this when writing things to swap, flushing dirty buffers, 
mmaped files, etc (in fact, if the OS knows the full contents of the 
chunk, it may be more efficiant for the OS to write the entire thing then 
to write part of it and have the drive/array do the read-modify-write 
cycle)

David Lang

> It looks like flash storage manufacturers just have to bite
> the bullet and develop smarter algorithms that combine wear
> leveling, block remapping and such and make their internal
> preference for huge continuous aligned writes nearly invisible
> from the outside - just like hard disks which do not expose
> their zoned recording, variable sector counts etc.
>
> Such algorithms aren't trivial, but they are possible.
> Whoever will incorporate them in their products,
> delivers a significantly better user experience.
>
> I just played with ubuntu installation on an usb stick.
> Yes, it works. Soft of. Write performance is abysmal.
> I would pay x2 or x3 for the same sized stick if it
> would perform better.
>
>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-15  2:26       ` Denys Vlasenko
  2010-03-15  2:56         ` Greg Freemyer
@ 2010-03-15  4:00         ` H. Peter Anvin
  2010-03-15 12:30           ` Arnd Bergmann
  2010-03-15  5:20         ` david
  2 siblings, 1 reply; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-15  4:00 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Arnd Bergmann, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

On 03/14/2010 07:26 PM, Denys Vlasenko wrote:
>>
>> Yes, but it does squat for a flash disk that wants, say, 256K alignment.
> 
> 4K makes sense. 256K not so much.
> 
> 256K alignment is hard to swallow for a lot of reasons anyway.
> Unless the filesystem packs small files into blocks a-la reiserfs,
> 256K block filesystems will be very inefficient for a typical
> storage scenarios.
> 

Noone has talked about using 256K filesystem blocks.  The fact of the
matter, though, is that both flash and RAID have much larger alignment
requirements than a mere 4K for optimal performance.

You might not like it, but that's the way it is.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-15  2:26       ` Denys Vlasenko
@ 2010-03-15  2:56         ` Greg Freemyer
  2010-03-15  4:00         ` H. Peter Anvin
  2010-03-15  5:20         ` david
  2 siblings, 0 replies; 155+ messages in thread
From: Greg Freemyer @ 2010-03-15  2:56 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: H. Peter Anvin, Arnd Bergmann, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

> I just played with ubuntu installation on an usb stick.
> Yes, it works. Soft of. Write performance is abysmal.
> I would pay x2 or x3 for the same sized stick if it
> would perform better.

In general USB sticks don't offer the same performance as SSDs.

You can find sticks with both USB and eSata.  I'd hope those offer
better performance.

You should read some performance reviews.  I'm sure you can find a few
sticks that are much better than what you get from a vanilla usb
stick.

Greg

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-15  1:21     ` H. Peter Anvin
@ 2010-03-15  2:26       ` Denys Vlasenko
  2010-03-15  2:56         ` Greg Freemyer
                           ` (2 more replies)
  0 siblings, 3 replies; 155+ messages in thread
From: Denys Vlasenko @ 2010-03-15  2:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Arnd Bergmann, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

On Monday 15 March 2010 02:21, H. Peter Anvin wrote:
> On 03/10/2010 01:14 AM, Denys Vlasenko wrote:
> >
> > 63s/255h is more or less "standard" now.
> >
> > Alignment issues can be solved by picking a good multiple of
> > _heads_ or _cylinders_:
> >
> > For first partition, pick the start at 8th head:
> >
> > cyl 0 head 1 sector 1: LBA sector 63) - bad
> > cyl 0 head 8 sector 1: LBA sector 8*63) - good (4k aligned)
> >
> > For any other partition, pick start cylinder which is a multiple of 8:
> >
> > cyl 8*x head 0 sector 1: LBA sector 8*x*255*63 - good (4k aligned)
> >
> > This will actually work well for *any* geometry, not only for 63s/255h.
> 
> Yes, but it does squat for a flash disk that wants, say, 256K alignment.

4K makes sense. 256K not so much.

256K alignment is hard to swallow for a lot of reasons anyway.
Unless the filesystem packs small files into blocks a-la reiserfs,
256K block filesystems will be very inefficient for a typical
storage scenarios.

It looks like flash storage manufacturers just have to bite
the bullet and develop smarter algorithms that combine wear
leveling, block remapping and such and make their internal
preference for huge continuous aligned writes nearly invisible
from the outside - just like hard disks which do not expose
their zoned recording, variable sector counts etc.

Such algorithms aren't trivial, but they are possible.
Whoever will incorporate them in their products,
delivers a significantly better user experience.

I just played with ubuntu installation on an usb stick.
Yes, it works. Soft of. Write performance is abysmal.
I would pay x2 or x3 for the same sized stick if it
would perform better.

-- 
vda

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10  9:14   ` Denys Vlasenko
  2010-03-10 11:02     ` Felix Miata
@ 2010-03-15  1:21     ` H. Peter Anvin
  2010-03-15  2:26       ` Denys Vlasenko
  2010-03-16  2:30     ` Tejun Heo
  2 siblings, 1 reply; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-15  1:21 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Arnd Bergmann, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

On 03/10/2010 01:14 AM, Denys Vlasenko wrote:
>
> 63s/255h is more or less "standard" now.
>
> Alignment issues can be solved by picking a good multiple of
> _heads_ or _cylinders_:
>
> For first partition, pick the start at 8th head:
>
> cyl 0 head 1 sector 1: LBA sector 63) - bad
> cyl 0 head 8 sector 1: LBA sector 8*63) - good (4k aligned)
>
> For any other partition, pick start cylinder which is a multiple of 8:
>
> cyl 8*x head 0 sector 1: LBA sector 8*x*255*63 - good (4k aligned)
>
> This will actually work well for *any* geometry, not only for 63s/255h.

Yes, but it does squat for a flash disk that wants, say, 256K alignment.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-14 21:09       ` Michal Soltys
@ 2010-03-14 22:56         ` s ponnusa
  0 siblings, 0 replies; 155+ messages in thread
From: s ponnusa @ 2010-03-14 22:56 UTC (permalink / raw)
  To: Michal Soltys
  Cc: Tejun Heo, Mikael Abrahamsson, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

Has been following this thread and I might possibly be testing with
Windows XP soon. Will update the results.
-
SP

On Sun, Mar 14, 2010 at 5:09 PM, Michal Soltys <soltys@ziu.info> wrote:
> Tejun Heo wrote:
>>
>> I was thinking about testing XP booting this weekend but really want
>> to avoid it, so thanks a lot for the info.  I'll update the doc
>> accordingly but can you please enlighten me on how it works and what's
>> broken in detail?  So, XP should be fine with any alignment?
>>
>> Thanks.
>>
>
> Sorry for late reply.
>
> s/sp2/sp3 - although it shouldn't make a difference from sp2 onwards.
>
> Anyway - the tests I did were because of weird laptop, where I shrinked
> whole win7 stuff and having no primary partitions left to use, I tested my
> usual windows xp installation I deploy with ntfsclone. Originally that XP
> were installed from installation disk merged with sp3 (or how it's usually
> called in windows world - slipstreamed). Of course, windows xp itself will
> not present any options to install itself into logical partition in the
> usual way - but during later deployment it's not a problem to put it where
> one's want.
>
> It's possible that this wouldn't work, if windows were installed first from
> pre-sp2 media, and then service pack was installed (in such case, ntldr in
> C:\ is not updated afaik). It's also possible, that "brute-force" copied
> pre-sp2 or win2k to a partition made with either - a) xp sp2+'s disk manager
> or b) mkfs.ntfs and with updated most recent ntldr -  would boot as well
> (the partition requirement is due to potential differences between the code
> in bootsector, or more precisely - $Boot - first 8KiB of ntfs partition).
>
> Obvious requirements besides the above (ntldr, perhaps $Boot as well) are:
>
> - mentioned "hidden sectors" (must be manually adjusted, recent syslinux's
> chain.c32 has option to do it automatically)
> - adjusted boot.ini (to point to new partition, eventually other windowish
> stuff as necessary)
>
> As you can see, there're many "if"s and combinations here that I didn't
> test.
>
> On a related note - ironically, while I had 0 problems making it work
> through syslinux (both regular chaining and through direct ntldr loading) -
> I couldn't make win7's bootmgr (bcd, bcdedit ....) do it properly. Oh well.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10  0:11     ` Tejun Heo
@ 2010-03-14 21:09       ` Michal Soltys
  2010-03-14 22:56         ` s ponnusa
  0 siblings, 1 reply; 155+ messages in thread
From: Michal Soltys @ 2010-03-14 21:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mikael Abrahamsson, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

Tejun Heo wrote:
> 
> I was thinking about testing XP booting this weekend but really want
> to avoid it, so thanks a lot for the info.  I'll update the doc
> accordingly but can you please enlighten me on how it works and what's
> broken in detail?  So, XP should be fine with any alignment?
> 
> Thanks.
> 

Sorry for late reply.

s/sp2/sp3 - although it shouldn't make a difference from sp2 onwards.

Anyway - the tests I did were because of weird laptop, where I shrinked 
whole win7 stuff and having no primary partitions left to use, I tested 
my usual windows xp installation I deploy with ntfsclone. Originally 
that XP were installed from installation disk merged with sp3 (or how 
it's usually called in windows world - slipstreamed). Of course, 
windows xp itself will not present any options to install itself into 
logical partition in the usual way - but during later deployment it's not 
a problem to put it where one's want.

It's possible that this wouldn't work, if windows were installed first 
from pre-sp2 media, and then service pack was installed (in such case, 
ntldr in C:\ is not updated afaik). It's also possible, that "brute-force" 
copied pre-sp2 or win2k to a partition made with either - a) xp sp2+'s disk 
manager or b) mkfs.ntfs and with updated most recent ntldr -  would boot as 
well (the partition requirement is due to potential differences between the code 
in bootsector, or more precisely - $Boot - first 8KiB of ntfs partition).

Obvious requirements besides the above (ntldr, perhaps $Boot as well) are:

- mentioned "hidden sectors" (must be manually adjusted, recent syslinux's 
chain.c32 has option to do it automatically)
- adjusted boot.ini (to point to new partition, eventually other windowish 
stuff as necessary)

As you can see, there're many "if"s and combinations here that I didn't test.

On a related note - ironically, while I had 0 problems making it work 
through syslinux (both regular chaining and through direct ntldr loading) - 
I couldn't make win7's bootmgr (bcd, bcdedit ....) do it properly. Oh well.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-12  3:10 H. Peter Anvin
  0 siblings, 0 replies; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-12  3:10 UTC (permalink / raw)
  To: Tejun Heo, Greg Freemyer
  Cc: tytso, Nikanth Karthikesan, James Bottomley, Damian Lukowski,
	linux-ide, Jeff Garzik, Matthew Wilcox, Martin K. Petersen, lkml,
	Daniel Taylor, Mark Lord, hirofumi, Andrew Morton, Alan Cox,
	irtiger, aschnell, jdelvare

[-- Attachment #1: Type: text/plain, Size: 703 bytes --]

I think if you use the DOS compat option to create the legacy partitions only, you should be fine.

"Tejun Heo" <tj@kernel.org> wrote:

>Hello,
>
>On 03/12/2010 01:34 AM, Greg Freemyer wrote:
>> I do think the linux partitioners should provide a way to force a
>> cylinder alignment.  Tejun, I would like to see your doc describe how
>> to force a win2k compatible partition layout.
>
>I suppose I can play with fdisk and list it as an example but if
>anyone knows better/proper way to force certain partitions to legacy
>alignment while leaving others properly aligned, I'll be happy to
>include it.
>
>Thanks.
>
>-- 
>tejun

--
Sent from my mobile phone, pardon any lack of formatting.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 16:34                             ` Greg Freemyer
  (?)
@ 2010-03-12  1:09                             ` Tejun Heo
  -1 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-12  1:09 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: tytso, Nikanth Karthikesan, James Bottomley, Damian Lukowski,
	linux-ide, Jeff Garzik, Matthew Wilcox, Martin K. Petersen, lkml,
	Daniel Taylor, Mark Lord, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, aschnell, jdelvare

Hello,

On 03/12/2010 01:34 AM, Greg Freemyer wrote:
> I do think the linux partitioners should provide a way to force a
> cylinder alignment.  Tejun, I would like to see your doc describe how
> to force a win2k compatible partition layout.

I suppose I can play with fdisk and list it as an example but if
anyone knows better/proper way to force certain partitions to legacy
alignment while leaving others properly aligned, I'll be happy to
include it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 16:01                       ` Mike Snitzer
@ 2010-03-11 18:26                         ` Christoph Hellwig
  0 siblings, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-03-11 18:26 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Nikanth Karthikesan, Theodore Tso, Damian Lukowski, linux-ide,
	Jeff Garzik, Matthew Wilcox, Martin K. Petersen, James Bottomley,
	Tejun Heo, lkml, Daniel Taylor, Mark Lord, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, aschnell, jdelvare

On Thu, Mar 11, 2010 at 11:01:41AM -0500, Mike Snitzer wrote:
> mkp in particular, Jens, James, myself, and others implemented and
> refined the SCSI and block changes.  kzak, jim meyering, hans de
> goede, hch, eric sandeen, bob peterson, myself and others updated all
> other I/O stack layers ranging from DM to LVM, libblkid, fdisk, parted
> to anaconda to mkfs.ext[234], mkfs.xfs, mkfs.gfs2 to virt-io and qemu.
>  FYI, all of these advances will be in Fedora 13 (quite a few are
> already in Fedora 12).

I also have some older patches for btrfs that I need to get back out
to the list.  There was some talk of major changes to the organization
of the tools so I held it back for a while longer.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 15:25                         ` tytso
@ 2010-03-11 16:34                             ` Greg Freemyer
  2010-03-11 16:34                             ` Greg Freemyer
  1 sibling, 0 replies; 155+ messages in thread
From: Greg Freemyer @ 2010-03-11 16:34 UTC (permalink / raw)
  To: tytso, Nikanth Karthikesan, James Bottomley, Damian Lukowski,
	linux-ide@vger.kernel.org

On Thu, Mar 11, 2010 at 10:25 AM,  <tytso@mit.edu> wrote:
> On Thu, Mar 11, 2010 at 08:35:26PM +0530, Nikanth Karthikesan wrote:
>> The real problem, here is just that partitioning-tools should create
>> partitions that can work with both XP as well as Windows7. May be distro
>> installers, should ask the user which compatibility he needs.
>
> 4k aligned sectors will *work* with Windows XP, will it not?  It's
> just simply a matter of Windows XP, being really ancient, doesn't
> create properly alligned partitions by default.
>
> And how often are we going to see Windows XP systems with these new 4k
> physical sector drives anyway, where the first OS to touch the
> partition is Windows XP?  And in the case where this does happy, the
> resulting partition will be result in terribly performance for Windows
> XP as well as Linux.
>
> What's the specific scenario which you are trying to solve, and how
> likely is it to occur in real life?
>
>                                        - Ted

Ted,

Apparently the real issue is Win2K, not XP.

It seems to require the boot partition and possibly all partitions
start on a cylinder boundary.  And may have 255/63 hard-coded in to
define what a cylinder is.  I agree with the apparent consensus that a
2010 era linux partitioner does not need to be Win2K compatible.  If
someone wants to install Win2K they will need to either use an older
generation partitioner to create the partitions or use specific
command-line args to force a non-optimal alignment.

I do think the linux partitioners should provide a way to force a
cylinder alignment.  Tejun, I would like to see your doc describe how
to force a win2k compatible partition layout.

fyi: The same issue apparently also exists for users still running OS/2.

Greg

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-11 16:34                             ` Greg Freemyer
  0 siblings, 0 replies; 155+ messages in thread
From: Greg Freemyer @ 2010-03-11 16:34 UTC (permalink / raw)
  To: tytso, Nikanth Karthikesan, James Bottomley, Damian Lukowski,
	linux-ide, Jeff Garzik, Matthew Wilcox, Martin K. Petersen,
	Tejun Heo, lkml, Daniel Taylor, Mark Lord, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, aschnell, jdelvare

On Thu, Mar 11, 2010 at 10:25 AM,  <tytso@mit.edu> wrote:
> On Thu, Mar 11, 2010 at 08:35:26PM +0530, Nikanth Karthikesan wrote:
>> The real problem, here is just that partitioning-tools should create
>> partitions that can work with both XP as well as Windows7. May be distro
>> installers, should ask the user which compatibility he needs.
>
> 4k aligned sectors will *work* with Windows XP, will it not?  It's
> just simply a matter of Windows XP, being really ancient, doesn't
> create properly alligned partitions by default.
>
> And how often are we going to see Windows XP systems with these new 4k
> physical sector drives anyway, where the first OS to touch the
> partition is Windows XP?  And in the case where this does happy, the
> resulting partition will be result in terribly performance for Windows
> XP as well as Linux.
>
> What's the specific scenario which you are trying to solve, and how
> likely is it to occur in real life?
>
>                                        - Ted

Ted,

Apparently the real issue is Win2K, not XP.

It seems to require the boot partition and possibly all partitions
start on a cylinder boundary.  And may have 255/63 hard-coded in to
define what a cylinder is.  I agree with the apparent consensus that a
2010 era linux partitioner does not need to be Win2K compatible.  If
someone wants to install Win2K they will need to either use an older
generation partitioner to create the partitions or use specific
command-line args to force a non-optimal alignment.

I do think the linux partitioners should provide a way to force a
cylinder alignment.  Tejun, I would like to see your doc describe how
to force a win2k compatible partition layout.

fyi: The same issue apparently also exists for users still running OS/2.

Greg

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 13:57                 ` Nikanth Karthikesan
  2010-03-11 14:28                   ` Theodore Tso
@ 2010-03-11 16:33                   ` H. Peter Anvin
  1 sibling, 0 replies; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-11 16:33 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Theodore Tso, Damian Lukowski, linux-ide, Jeff Garzik,
	Matthew Wilcox, Martin K. Petersen, James Bottomley, Tejun Heo,
	lkml, Daniel Taylor, Mark Lord, hirofumi, Andrew Morton,
	Alan Cox, irtiger, aschnell, jdelvare

On 03/11/2010 05:57 AM, Nikanth Karthikesan wrote:
>
> I guess, what he meant was, to keep filesystem blocks aligned, even if the
> partition is not. Say if the partition is mis-aligned by 512-bytes, let the
> filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a
> case of over-engineering, possibly requiring disk format change.
>

That's basically what you end up having to do for FAT filesystems to be 
aligned.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 15:25                         ` tytso
@ 2010-03-11 16:26                             ` Gene Heskett
  2010-03-11 16:34                             ` Greg Freemyer
  1 sibling, 0 replies; 155+ messages in thread
From: Gene Heskett @ 2010-03-11 16:26 UTC (permalink / raw)
  To: tytso, Nikanth Karthikesan, James Bottomley, Damian Lukowski,
	linux-ide@vger.kernel.org

On Thursday 11 March 2010, tytso@mit.edu wrote:
>On Thu, Mar 11, 2010 at 08:35:26PM +0530, Nikanth Karthikesan wrote:
>> The real problem, here is just that partitioning-tools should create
>> partitions that can work with both XP as well as Windows7. May be distro
>> installers, should ask the user which compatibility he needs.
>
>4k aligned sectors will *work* with Windows XP, will it not?  It's
>just simply a matter of Windows XP, being really ancient, doesn't
>create properly alligned partitions by default.
>
>And how often are we going to see Windows XP systems with these new 4k
>physical sector drives anyway, where the first OS to touch the
>partition is Windows XP?  And in the case where this does happy, the
>resulting partition will be result in terribly performance for Windows
>XP as well as Linux.
>
>What's the specific scenario which you are trying to solve, and how
>likely is it to occur in real life?

And potentially one more question from a list lurker, Ted.  Where are the 
tools that allow us to check and/or adjust that?  I ask since I have 3 of 
these terrabyte drives in this box now and have no clue how to either check 
to see if we're off, or how to fix it if it is.  And I have called my self 
following this discussion without noting if the tools have been specifically 
named.

Thanks

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

Authors are easy to get on with -- if you're fond of children.
		-- Michael Joseph, "Observer"

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-11 16:26                             ` Gene Heskett
  0 siblings, 0 replies; 155+ messages in thread
From: Gene Heskett @ 2010-03-11 16:26 UTC (permalink / raw)
  To: tytso, Nikanth Karthikesan, James Bottomley, Damian Lukowski,
	linux-ide, Jeff Garzik, Matthew Wilcox, Martin K. Petersen,
	Tejun Heo, lkml, Daniel Taylor, Mark Lord, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, aschnell, jdelvare

On Thursday 11 March 2010, tytso@mit.edu wrote:
>On Thu, Mar 11, 2010 at 08:35:26PM +0530, Nikanth Karthikesan wrote:
>> The real problem, here is just that partitioning-tools should create
>> partitions that can work with both XP as well as Windows7. May be distro
>> installers, should ask the user which compatibility he needs.
>
>4k aligned sectors will *work* with Windows XP, will it not?  It's
>just simply a matter of Windows XP, being really ancient, doesn't
>create properly alligned partitions by default.
>
>And how often are we going to see Windows XP systems with these new 4k
>physical sector drives anyway, where the first OS to touch the
>partition is Windows XP?  And in the case where this does happy, the
>resulting partition will be result in terribly performance for Windows
>XP as well as Linux.
>
>What's the specific scenario which you are trying to solve, and how
>likely is it to occur in real life?

And potentially one more question from a list lurker, Ted.  Where are the 
tools that allow us to check and/or adjust that?  I ask since I have 3 of 
these terrabyte drives in this box now and have no clue how to either check 
to see if we're off, or how to fix it if it is.  And I have called my self 
following this discussion without noting if the tools have been specifically 
named.

Thanks

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

Authors are easy to get on with -- if you're fond of children.
		-- Michael Joseph, "Observer"

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 15:00                     ` Nikanth Karthikesan
  2010-03-11 15:10                       ` Tejun Heo
@ 2010-03-11 16:01                       ` Mike Snitzer
  2010-03-11 18:26                         ` Christoph Hellwig
  1 sibling, 1 reply; 155+ messages in thread
From: Mike Snitzer @ 2010-03-11 16:01 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Theodore Tso, Damian Lukowski, linux-ide, Jeff Garzik,
	Matthew Wilcox, Martin K. Petersen, James Bottomley, Tejun Heo,
	lkml, Daniel Taylor, Mark Lord, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, aschnell, jdelvare

On Thu, Mar 11, 2010 at 10:00 AM, Nikanth Karthikesan <knikanth@suse.de> wrote:
> On Thursday 11 March 2010 19:58:11 Theodore Tso wrote:
>> On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
>> > I guess, what he meant was, to keep filesystem blocks aligned, even if
>> > the partition is not. Say if the partition is mis-aligned by 512-bytes,
>> > let the filesystem waste 4k-512bytes and keep it's blocks aligned. But it
>> > might be a case of over-engineering, possibly requiring disk format
>> > change.
>>
>> Ah, yes, I agree with you; that's probably what he meant.
>>
>> Sure, that's theoretically possible, but it would mean changing every
>>  single filesystem, and it would require a file system format change --- or
>>  at least a file system format extension.
>>
>> It would seem to be way easier to simply fix the partitioning tools to do
>>  the right thing, though.
>>
>
> Yes. May be, just a simple but transparent device-mapper like mapping on top
> of the mis-aligned partition, to do the alignment. Then the file-system code
> need not change much.
>
> But Linux already has device-mapper and Linux will not be affected with mis-
> aligned partitions, when we use LVM.

Well, device-mapper and LVM needed to be updated to make them "just
work" but yes that work has been done.

> But the actual problem here is that partitioning tools might create partitions
> that wont allow other operating-systems to boot. So it might be enough, if the
> partitioning tools just create partitions with (mis-)alignment requirement for
> Windows.

I'm not following...

Anyway, 4K drives that are 512b logical and 4K physical may or may not
also have "DOS partition compensation" that use LBA -1 as the first
naturally (4K) aligned start.  This means that the partition tools
need to shift the start of the first primary partition to be offset by
3584 bytes (7 512b sectors) for use with Linux.  But for windows,
AFAIK windows XP and windows 7 create all partitions aligned on 1MB
boundaries.  Linux's parted and fdisk create 1MB aligned partitions
now too.

So the only outlier is older versions of windows (< XP) and Linux (old
fdisk and parted, etc also use DOS partitioning) that don't use
naturally aligned (e.g. 1MB) partition boundaries.  In those versions
of Windows and LInux there are ways to change the default start of
sector 63.   That said, there is an opportunity to improve
documentation for how to workaround DOS partitioning on these
operating systems.

One other piece worth mentioning on this "IO Toplogy" support in the
entire Linux I/O Stack is the virt layers.  hch has already extended
the virt-io protocol and qemu is in the finishing stages of being
updated to properly consume the "IO Topology" information.  So we
really don't have any gaps in the Linux I/O stack.

mkp in particular, Jens, James, myself, and others implemented and
refined the SCSI and block changes.  kzak, jim meyering, hans de
goede, hch, eric sandeen, bob peterson, myself and others updated all
other I/O stack layers ranging from DM to LVM, libblkid, fdisk, parted
to anaconda to mkfs.ext[234], mkfs.xfs, mkfs.gfs2 to virt-io and qemu.
 FYI, all of these advances will be in Fedora 13 (quite a few are
already in Fedora 12).

There are obviously other Linux systems and userland tools (likely
Xen, other mkfs.* and more) that should be updated.  Hopefully
maintainers and/or contributors of these projects will follow-up to
address those that need updating.

Again please see:
http://oss.oracle.com/~mkp/docs/linux-advanced-storage.pdf
http://people.redhat.com/msnitzer/docs/io-limits.txt
Some omissions include: Linux MD, which has been updated as mkp
pointed out, and I neglected to talk about virt-io and qemu (but like
I said they have been updated too).

Hopefully we're all closer to being on the same page now.

Mike

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 15:05                       ` Nikanth Karthikesan
@ 2010-03-11 15:25                         ` tytso
  2010-03-11 16:26                             ` Gene Heskett
  2010-03-11 16:34                             ` Greg Freemyer
  0 siblings, 2 replies; 155+ messages in thread
From: tytso @ 2010-03-11 15:25 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: James Bottomley, Damian Lukowski, linux-ide, Jeff Garzik,
	Matthew Wilcox, Martin K. Petersen, Tejun Heo, lkml,
	Daniel Taylor, Mark Lord, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, aschnell, jdelvare

On Thu, Mar 11, 2010 at 08:35:26PM +0530, Nikanth Karthikesan wrote:
> The real problem, here is just that partitioning-tools should create 
> partitions that can work with both XP as well as Windows7. May be distro 
> installers, should ask the user which compatibility he needs.

4k aligned sectors will *work* with Windows XP, will it not?  It's
just simply a matter of Windows XP, being really ancient, doesn't
create properly alligned partitions by default.   

And how often are we going to see Windows XP systems with these new 4k
physical sector drives anyway, where the first OS to touch the
partition is Windows XP?  And in the case where this does happy, the
resulting partition will be result in terribly performance for Windows
XP as well as Linux.

What's the specific scenario which you are trying to solve, and how
likely is it to occur in real life?

					- Ted

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 15:00                     ` Nikanth Karthikesan
@ 2010-03-11 15:10                       ` Tejun Heo
  2010-03-11 16:01                       ` Mike Snitzer
  1 sibling, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-11 15:10 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Theodore Tso, Damian Lukowski, linux-ide, Jeff Garzik,
	Matthew Wilcox, Martin K. Petersen, James Bottomley, lkml,
	Daniel Taylor, Mark Lord, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, aschnell, jdelvare

Hello,

On 03/12/2010 12:00 AM, Nikanth Karthikesan wrote:
> But the actual problem here is that partitioning tools might create
> partitions that wont allow other operating-systems to boot. So it
> might be enough, if the partitioning tools just create partitions
> with (mis-)alignment requirement for Windows.

Turns out XP is generally OK.  The reported problem was only on
specific configurations (some BIOS stuff).  Windows 2000 reportedly
would be hurt but I really think we don't have to care about that too
much.  So, it seems like we wouldn't have to worry too much about it
and just go ahead with new alignment schemes.  I'll update the doc
this weekend with new information from this now rather large thread.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 14:39                     ` James Bottomley
@ 2010-03-11 15:05                       ` Nikanth Karthikesan
  2010-03-11 15:25                         ` tytso
  0 siblings, 1 reply; 155+ messages in thread
From: Nikanth Karthikesan @ 2010-03-11 15:05 UTC (permalink / raw)
  To: James Bottomley
  Cc: Theodore Tso, Damian Lukowski, linux-ide, Jeff Garzik,
	Matthew Wilcox, Martin K. Petersen, Tejun Heo, lkml,
	Daniel Taylor, Mark Lord, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, aschnell, jdelvare

On Thursday 11 March 2010 20:09:34 James Bottomley wrote:
> On Thu, 2010-03-11 at 09:28 -0500, Theodore Tso wrote:
> > On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
> > > I guess, what he meant was, to keep filesystem blocks aligned, even if
> > > the partition is not. Say if the partition is mis-aligned by 512-bytes,
> > > let the filesystem waste 4k-512bytes and keep it's blocks aligned. But
> > > it might be a case of over-engineering, possibly requiring disk format
> > > change.
> >
> > Ah, yes, I agree with you; that's probably what he meant.
> >
> > Sure, that's theoretically possible, but it would mean changing every
> > single filesystem, and it would require a file system format change
> > --- or at least a file system format extension.
> >
> > It would seem to be way easier to simply fix the partitioning tools to
> > do the right thing, though.
> 
> Actually, it's a layering violation.  The filesystem shouldn't need to
> probe the device layout ... particularly when there are complexities
> like is it logical 512 or physical, and if logical 512 on 4k does it
> have an offset exponent or not.
> 
> We can transmit certain abstractions of information up the stack (like
> stripe width for RAID arrays which should be the fs optimal write size),
> but for this type of alignment, which can be completely solved at the
> partition layer, the information should really stay there and the
> filesystem should "just work".
> 

Right. It would be layering violation and we have LVM to solve it already.

The real problem, here is just that partitioning-tools should create 
partitions that can work with both XP as well as Windows7. May be distro 
installers, should ask the user which compatibility he needs.

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 14:28                   ` Theodore Tso
  2010-03-11 14:39                     ` James Bottomley
  2010-03-11 14:48                     ` Mike Snitzer
@ 2010-03-11 15:00                     ` Nikanth Karthikesan
  2010-03-11 15:10                       ` Tejun Heo
  2010-03-11 16:01                       ` Mike Snitzer
  2 siblings, 2 replies; 155+ messages in thread
From: Nikanth Karthikesan @ 2010-03-11 15:00 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Damian Lukowski, linux-ide, Jeff Garzik, Matthew Wilcox,
	Martin K. Petersen, James Bottomley, Tejun Heo, lkml,
	Daniel Taylor, Mark Lord, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, aschnell, jdelvare

On Thursday 11 March 2010 19:58:11 Theodore Tso wrote:
> On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
> > I guess, what he meant was, to keep filesystem blocks aligned, even if
> > the partition is not. Say if the partition is mis-aligned by 512-bytes,
> > let the filesystem waste 4k-512bytes and keep it's blocks aligned. But it
> > might be a case of over-engineering, possibly requiring disk format
> > change.
> 
> Ah, yes, I agree with you; that's probably what he meant.
> 
> Sure, that's theoretically possible, but it would mean changing every
>  single filesystem, and it would require a file system format change --- or
>  at least a file system format extension.
> 
> It would seem to be way easier to simply fix the partitioning tools to do
>  the right thing, though.
> 

Yes. May be, just a simple but transparent device-mapper like mapping on top 
of the mis-aligned partition, to do the alignment. Then the file-system code 
need not change much.

But Linux already has device-mapper and Linux will not be affected with mis-
aligned partitions, when we use LVM.

But the actual problem here is that partitioning tools might create partitions 
that wont allow other operating-systems to boot. So it might be enough, if the 
partitioning tools just create partitions with (mis-)alignment requirement for 
Windows.

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 14:28                   ` Theodore Tso
  2010-03-11 14:39                     ` James Bottomley
@ 2010-03-11 14:48                     ` Mike Snitzer
  2010-03-11 15:00                     ` Nikanth Karthikesan
  2 siblings, 0 replies; 155+ messages in thread
From: Mike Snitzer @ 2010-03-11 14:48 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Nikanth Karthikesan, Damian Lukowski, linux-ide, Jeff Garzik,
	Matthew Wilcox, Martin K. Petersen, James Bottomley, Tejun Heo,
	lkml, Daniel Taylor, Mark Lord, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, aschnell, jdelvare

On Thu, Mar 11, 2010 at 9:28 AM, Theodore Tso <tytso@mit.edu> wrote:
>
> On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
>>
>> I guess, what he meant was, to keep filesystem blocks aligned, even if the
>> partition is not. Say if the partition is mis-aligned by 512-bytes, let the
>> filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a
>> case of over-engineering, possibly requiring disk format change.
>
> Ah, yes, I agree with you; that's probably what he meant.
>
> Sure, that's theoretically possible, but it would mean changing every single filesystem, and it would require a file system format change --- or at least a file system format extension.
>
> It would seem to be way easier to simply fix the partitioning tools to do the right thing, though.

Yes, the current supported approach is to rely on partitions (parted,
fdisk) or LVM to account for 'alignment_offset'.

This avoids having a filesystem add its own padding (format change).
But e2fsprogs at least warns if a device, that it is to format, has an
alignment_offset != 0.

Mike

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 14:28                   ` Theodore Tso
@ 2010-03-11 14:39                     ` James Bottomley
  2010-03-11 15:05                       ` Nikanth Karthikesan
  2010-03-11 14:48                     ` Mike Snitzer
  2010-03-11 15:00                     ` Nikanth Karthikesan
  2 siblings, 1 reply; 155+ messages in thread
From: James Bottomley @ 2010-03-11 14:39 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Nikanth Karthikesan, Damian Lukowski, linux-ide, Jeff Garzik,
	Matthew Wilcox, Martin K. Petersen, Tejun Heo, lkml,
	Daniel Taylor, Mark Lord, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, aschnell, jdelvare

On Thu, 2010-03-11 at 09:28 -0500, Theodore Tso wrote:
> On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
> > 
> > I guess, what he meant was, to keep filesystem blocks aligned, even if the 
> > partition is not. Say if the partition is mis-aligned by 512-bytes, let the 
> > filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a 
> > case of over-engineering, possibly requiring disk format change.
> 
> Ah, yes, I agree with you; that's probably what he meant.
> 
> Sure, that's theoretically possible, but it would mean changing every
> single filesystem, and it would require a file system format change
> --- or at least a file system format extension.
> 
> It would seem to be way easier to simply fix the partitioning tools to
> do the right thing, though.

Actually, it's a layering violation.  The filesystem shouldn't need to
probe the device layout ... particularly when there are complexities
like is it logical 512 or physical, and if logical 512 on 4k does it
have an offset exponent or not.

We can transmit certain abstractions of information up the stack (like
stripe width for RAID arrays which should be the fs optimal write size),
but for this type of alignment, which can be completely solved at the
partition layer, the information should really stay there and the
filesystem should "just work".

James



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 13:57                 ` Nikanth Karthikesan
@ 2010-03-11 14:28                   ` Theodore Tso
  2010-03-11 14:39                     ` James Bottomley
                                       ` (2 more replies)
  2010-03-11 16:33                   ` H. Peter Anvin
  1 sibling, 3 replies; 155+ messages in thread
From: Theodore Tso @ 2010-03-11 14:28 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Damian Lukowski, linux-ide, Jeff Garzik, Matthew Wilcox,
	Martin K. Petersen, James Bottomley, Tejun Heo, lkml,
	Daniel Taylor, Mark Lord, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, aschnell, jdelvare


On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
> 
> I guess, what he meant was, to keep filesystem blocks aligned, even if the 
> partition is not. Say if the partition is mis-aligned by 512-bytes, let the 
> filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a 
> case of over-engineering, possibly requiring disk format change.

Ah, yes, I agree with you; that's probably what he meant.

Sure, that's theoretically possible, but it would mean changing every single filesystem, and it would require a file system format change --- or at least a file system format extension.

It would seem to be way easier to simply fix the partitioning tools to do the right thing, though.

-- Ted


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-11 13:04               ` Theodore Tso
@ 2010-03-11 13:57                 ` Nikanth Karthikesan
  2010-03-11 14:28                   ` Theodore Tso
  2010-03-11 16:33                   ` H. Peter Anvin
  0 siblings, 2 replies; 155+ messages in thread
From: Nikanth Karthikesan @ 2010-03-11 13:57 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Damian Lukowski, linux-ide, Jeff Garzik, Matthew Wilcox,
	Martin K. Petersen, James Bottomley, Tejun Heo, lkml,
	Daniel Taylor, Mark Lord, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, aschnell, jdelvare

On Thursday 11 March 2010 18:34:56 Theodore Tso wrote:
> On Mar 10, 2010, at 11:19 AM, Damian Lukowski wrote:
> > I have practically no knowledge of Linux' block device drivers,
> > but is this really a partitioning issue? I think the problem is
> > with the filesystems when clustering multiple blocks without
> > knowledge of the sector alignment and sector size of the underlying
> > block device. Maybe it is a better solution to adapt the filesystem
> > buffer routine which reads/writes data from/to the block device?
> 
> No, it's really a partitioning issue.   If the paging subsystem wants a 4k
>  block to fill a particular page, we need to read that 4k block into
>  memory.  If we need to swap out that 4k block, we need to write that 4k
>  block to swap space, or to the memory segment's backing store.   If the
>  partition is misaligned by 512 bytes, this is simply not possible.   The
>  file system has to do what is requested of it by its users, and the
>  reality is that we need to do 4k aligned reads and writes with respect to
>  the beginning of the partition, far more often than not.
> 
> Hence, if we want the best performance on 4k sector drives, the partition
>  needs to be aligned with respect to what is most desirable for the device
>  in question.
> 

I guess, what he meant was, to keep filesystem blocks aligned, even if the 
partition is not. Say if the partition is mis-aligned by 512-bytes, let the 
filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a 
case of over-engineering, possibly requiring disk format change.

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10 16:19             ` Damian Lukowski
@ 2010-03-11 13:04               ` Theodore Tso
  2010-03-11 13:57                 ` Nikanth Karthikesan
  0 siblings, 1 reply; 155+ messages in thread
From: Theodore Tso @ 2010-03-11 13:04 UTC (permalink / raw)
  To: Damian Lukowski
  Cc: linux-ide, Jeff Garzik, Matthew Wilcox, Martin K. Petersen,
	James Bottomley, Tejun Heo, lkml, Daniel Taylor, Mark Lord,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	aschnell, knikanth, jdelvare

On Mar 10, 2010, at 11:19 AM, Damian Lukowski wrote:
> 
> I have practically no knowledge of Linux' block device drivers,
> but is this really a partitioning issue? I think the problem is
> with the filesystems when clustering multiple blocks without
> knowledge of the sector alignment and sector size of the underlying
> block device. Maybe it is a better solution to adapt the filesystem
> buffer routine which reads/writes data from/to the block device?

No, it's really a partitioning issue.   If the paging subsystem wants a 4k block to fill a particular page, we need to read that 4k block into memory.  If we need to swap out that 4k block, we need to write that 4k block to swap space, or to the memory segment's backing store.   If the partition is misaligned by 512 bytes, this is simply not possible.   The file system has to do what is requested of it by its users, and the reality is that we need to do 4k aligned reads and writes with respect to the beginning of the partition, far more often than not.

Hence, if we want the best performance on 4k sector drives, the partition needs to be aligned with respect to what is most desirable for the device in question.

Best regards,

-- Ted


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10  5:06             ` Martin K. Petersen
@ 2010-03-10 20:50               ` Henrique de Moraes Holschuh
  0 siblings, 0 replies; 155+ messages in thread
From: Henrique de Moraes Holschuh @ 2010-03-10 20:50 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Karel Zak, Michael Tokarev, Mike Snitzer, linux-ide, linux-kernel

On Wed, 10 Mar 2010, Martin K. Petersen wrote:
> >>>>> "Karel" == Karel Zak <kzak@redhat.com> writes:
> 
> [Cleaned up the CC: list from hell]
> 
> Karel>  # cat /sys/block/md8/queue/{minimum,optimal}_io_size
> Karel>  65536 65536
> 
> This one had me puzzled.  We set min_io and opt_io correctly in raid5.c
> depending on number of non-parity disks.  And yet it turns into
> something nonsensical after.
> 
> Turns out we overrun unsigned int calculating the lowest common multiple
> in the stacking function.  That's why we ended up with the wrong value.
> 
> I never noticed this because my userland topology regression test tool
> uses unsigned long.
> 
> I'll get a patch off to Jens right away.

And please get the whole fixed deal in -stable eventually, for 2.6.32.y's
benefit :-)

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10 13:47           ` Jeff Garzik
@ 2010-03-10 16:19             ` Damian Lukowski
  2010-03-11 13:04               ` Theodore Tso
  0 siblings, 1 reply; 155+ messages in thread
From: Damian Lukowski @ 2010-03-10 16:19 UTC (permalink / raw)
  To: linux-ide
  Cc: Jeff Garzik, Matthew Wilcox, Martin K. Petersen, James Bottomley,
	Tejun Heo, lkml, Daniel Taylor, Mark Lord, tytso, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, aschnell, knikanth,
	jdelvare

> S-2. The proper solution.
> 
>   Correct alignments for all partitions can't be achieved by the
>   firmware alone.  The system utilities should be informed about the
>   alignment requirements and align partitions accordingly.
> 
>   The above firmware workaround complicates the situation because the
>   two different configurations require different offsets to achieve
>   the correct alignments.  ATA/ATAPI-8 specifies a way for a drive to
>   export the physical and logical sector sizes and the LBA offset
>   which is aligned to the physical sectors.
> 
>   In Linux, these parameters are exported via the following sysfs
>   nodes.
> 
>     physical sector size	: /sys/block/sdX/queue/physical_block_size
>     logical sector size		: /sys/block/sdX/queue/logical_block_size
>     alignment offset		: /sys/block/sdX/alignment_offset
> 
>   Let the physical sector size be PSS, logical sector size LSS and
>   alignment offset AOFF.  The system software should place partitions
>   such that the starting LBAs of all partitions are aligned on
> 
>     (n * PSS + AOFF) / LSS
> 
>   For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512
>   and AOFF 3584 and with n of 7 the above becomes,
> 
>     (7 * 4096 + 3584) / 512 == 63
> 
>   making sector 63 an aligned LBA where the first partition can be
>   put, but without the offset-by-one mapping, AOFF is zero and LBA 63
>   is not aligned.
> 
>   With the above new alignment requirement in place, it becomes
>   difficult to honor the legacy one - first partition on sector 63 and
>   all other partitions on cylinder boundary (255 * 63 sectors) - as
>   the two alignment requirements contradict each other.  This might be
>   worked around by adjusting how LBA and CHS addresses are mapped but
>   the disk geometry parameters are hard coded everywhere and there is
>   no reliable way to communicate custom geometry parameters.

Hello,
I have practically no knowledge of Linux' block device drivers,
but is this really a partitioning issue? I think the problem is
with the filesystems when clustering multiple blocks without
knowledge of the sector alignment and sector size of the underlying
block device. Maybe it is a better solution to adapt the filesystem
buffer routine which reads/writes data from/to the block device?

Best regards
 Damian

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10  7:53         ` Matthew Wilcox
@ 2010-03-10 13:47           ` Jeff Garzik
  2010-03-10 16:19             ` Damian Lukowski
  0 siblings, 1 reply; 155+ messages in thread
From: Jeff Garzik @ 2010-03-10 13:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Martin K. Petersen, James Bottomley, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, aschnell, knikanth, jdelvare

On 03/10/2010 02:53 AM, Matthew Wilcox wrote:
> On Mon, Mar 08, 2010 at 10:41:57AM -0500, Martin K. Petersen wrote:
>> What I meant to say was that I know ATA supports 4 KB LBS and that
>> nobody appears to care about it.
>
> I sent patches to add support ... they were ignored.

Not true, read the rest of the thread.

	Jeff




^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10 10:46         ` Johannes Stezenbach
@ 2010-03-10 11:22           ` H. Peter Anvin
  0 siblings, 0 replies; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-10 11:22 UTC (permalink / raw)
  To: Johannes Stezenbach
  Cc: Greg Freemyer, James Bottomley, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare, mkp

On 03/10/2010 02:46 AM, Johannes Stezenbach wrote:
> On Tue, Mar 09, 2010 at 04:32:04PM -0800, H. Peter Anvin wrote:
>>
>> It can.  The BIOS doesn't care about the partition table at all --
>> all it does is load the MBR.
>
> A little story for your entertainment pleasure:
>
> I have a Gigabyte GA-MA78GM-S2H board, and during install
> turned off the power after partitioning but before formatting
> any partition because I got distracted by something else.
>
> Result: System could not boot anymore, BIOS hung before
> I could get to the "select boot device" screen. This also
> happened when I removed the hdd from the boot device
> list in BIOS. The last BIOS message was "Verifying DMI Pool Data"
> and you can find numerous similar reports by searching for
> 'gigabyte bios hang "Verifying DMI Pool Data"'.
>
> In my case it worked to switch the SATA mode from AHCI to
> something else, then wipe the partition table and switch
> back to AHCI.  But I read on the net that some people had
> to format the drive in another PC, or hotplug it after the BIOS
> got past "Verifying DMI Pool Data".
>

Well, yes, there are buggy BIOSes of a gazillion varieties.  A fair 
number of them read the partition table to try to guess what C/H/S 
geometry the user intended.  However, the GPT spec specifically uses a 
"Protective MBR" to guard against this and other issues like it; it 
makes the entire disk look to MBR-reading software like a single fully 
partitioned disk with one large partition on it.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10  9:14   ` Denys Vlasenko
@ 2010-03-10 11:02     ` Felix Miata
  2010-03-15  1:21     ` H. Peter Anvin
  2010-03-16  2:30     ` Tejun Heo
  2 siblings, 0 replies; 155+ messages in thread
From: Felix Miata @ 2010-03-10 11:02 UTC (permalink / raw)
  To: linux-ide

On 2010/03/10 10:14 (GMT+0100) Denys Vlasenko composed:

>> Do we know of anything that requires 63s/255h?

> 63s/255h is more or less "standard" now.

When did this change? AFAICT, this "standard" only applies to desktops, while
63s/240h applies to laptops, older Compaqs, and some brain dead external/USB
drive case controllers.

> Alignment issues can be solved by picking a good multiple of
> _heads_ or _cylinders_:

> For first partition, pick the start at 8th head:

> cyl 0 head 1 sector 1: LBA sector 63) - bad
> cyl 0 head 8 sector 1: LBA sector 8*63) - good (4k aligned)

> For any other partition, pick start cylinder which is a multiple of 8:

> cyl 8*x head 0 sector 1: LBA sector 8*x*255*63 - good (4k aligned)

> This will actually work well for *any* geometry, not only for 63s/255h.
-- 
"The wise are known for their understanding, and pleasant
words are persuasive." Proverbs 16:21 (New Living Translation)

 Team OS/2 ** Reg. Linux User #211409

Felix Miata  ***  http://fm.no-ip.com/

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10  0:32       ` H. Peter Anvin
@ 2010-03-10 10:46         ` Johannes Stezenbach
  2010-03-10 11:22           ` H. Peter Anvin
  0 siblings, 1 reply; 155+ messages in thread
From: Johannes Stezenbach @ 2010-03-10 10:46 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Greg Freemyer, James Bottomley, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare, mkp

On Tue, Mar 09, 2010 at 04:32:04PM -0800, H. Peter Anvin wrote:
> 
> It can.  The BIOS doesn't care about the partition table at all --
> all it does is load the MBR.

A little story for your entertainment pleasure:

I have a Gigabyte GA-MA78GM-S2H board, and during install
turned off the power after partitioning but before formatting
any partition because I got distracted by something else.

Result: System could not boot anymore, BIOS hung before
I could get to the "select boot device" screen. This also
happened when I removed the hdd from the boot device
list in BIOS. The last BIOS message was "Verifying DMI Pool Data"
and you can find numerous similar reports by searching for
'gigabyte bios hang "Verifying DMI Pool Data"'.

In my case it worked to switch the SATA mode from AHCI to
something else, then wipe the partition table and switch
back to AHCI.  But I read on the net that some people had
to format the drive in another PC, or hotplug it after the BIOS
got past "Verifying DMI Pool Data".


Johannes

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09 23:46 ` Arnd Bergmann
  2010-03-10  0:20   ` Tejun Heo
@ 2010-03-10  9:14   ` Denys Vlasenko
  2010-03-10 11:02     ` Felix Miata
                       ` (2 more replies)
  1 sibling, 3 replies; 155+ messages in thread
From: Denys Vlasenko @ 2010-03-10  9:14 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Tejun Heo, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

On Wed, Mar 10, 2010 at 12:46 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 08 March 2010 04:48:35 Tejun Heo wrote:
>> Unfortunately, while Windows can assume that newer releases won't
>> share the hard drive with older releases including Windows XP, Linux
>> distros can't do that.  There will be many installations where a
>> modern Linux distros share a hard drive with older releases of
>> Windows.  At this point, I can't see a silver bullet solution.
>>
>> Partitioners maybe should only align partitions which will be used by
>> Linux and default to the traditional layout for others while allowing
>> explicit override.  I think Windows XP wouldn't have problem with
>> differently aligned partitions as long as it doesn't actually use them
>> but haven't tested it.
>
> Any idea if XP can cope with partition tables that use a 32-sector, 128-head
> geometry rather than the default 63-sector, 255-head one? That seems to
> be what some flash memory cards are using and it would make any cylinder
> aligned partition also 4096-byte aligned, at the cost of moving the
> 1024-cylinder boundary from 7.88 GiB to 2 GiB.
>
> Do we know of anything that requires 63s/255h?

63s/255h is more or less "standard" now.

Alignment issues can be solved by picking a good multiple of
_heads_ or _cylinders_:

For first partition, pick the start at 8th head:

cyl 0 head 1 sector 1: LBA sector 63) - bad
cyl 0 head 8 sector 1: LBA sector 8*63) - good (4k aligned)

For any other partition, pick start cylinder which is a multiple of 8:

cyl 8*x head 0 sector 1: LBA sector 8*x*255*63 - good (4k aligned)

This will actually work well for *any* geometry, not only for 63s/255h.
-- 
vda

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 15:41         ` Martin K. Petersen
  (?)
  (?)
@ 2010-03-10  7:53         ` Matthew Wilcox
  2010-03-10 13:47           ` Jeff Garzik
  -1 siblings, 1 reply; 155+ messages in thread
From: Matthew Wilcox @ 2010-03-10  7:53 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James Bottomley, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, aschnell, knikanth, jdelvare

On Mon, Mar 08, 2010 at 10:41:57AM -0500, Martin K. Petersen wrote:
> What I meant to say was that I know ATA supports 4 KB LBS and that
> nobody appears to care about it.

I sent patches to add support ... they were ignored.

Part of the problem is that ATA is heinously broken wrt non-512 byte
sector sizes.  You have to know which commands work in multiples of
the block size, and which commands work in multiples of 512-bytes.
There's no easy way to figure it out; you need a table.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10  0:14           ` Daniel Taylor
                             ` (2 preceding siblings ...)
  (?)
@ 2010-03-10  7:09           ` Gabor Gombas
  -1 siblings, 0 replies; 155+ messages in thread
From: Gabor Gombas @ 2010-03-10  7:09 UTC (permalink / raw)
  To: Daniel Taylor
  Cc: Tejun Heo, Greg Freemyer, H. Peter Anvin, James Bottomley,
	linux-ide, lkml, Jeff Garzik, Mark Lord, tytso, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare, mkp

On Tue, Mar 09, 2010 at 04:14:30PM -0800, Daniel Taylor wrote:

> I will run some experiments to see if any of the systems on my desk can boot
> Linux from a GPT.

My desktop with a BIOS from 2005 has no problems with GPT. AFAIK a
recent Debian installer automatically chooses GPT if the disk is 2 TB or
larger.

Gabor

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10  0:00   ` Tejun Heo
@ 2010-03-10  6:08     ` Mark Lord
  0 siblings, 0 replies; 155+ messages in thread
From: Mark Lord @ 2010-03-10  6:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-ide, lkml, Daniel Taylor, Jeff Garzik, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare

On 03/09/10 19:00, Tejun Heo wrote:
> On 03/09/2010 10:55 PM, Mark Lord wrote:
>> On 03/07/10 22:48, Tejun Heo wrote:
>> ..
>>> Please note that hdparm is misreporting the alignment offset.  It
>>> should be reporting 512 instead of 256 for offset-by-one drives.
>> ..
>>
>> That issue was fixed quite a while ago.
>> Upgrade your elderly copy of hdparm.
>
> Heh heh, *you* were keeping it from me!  Anyways, is there hdparm
> devel tree published somewhere?  I wandared the SF page for quite a
> bit (which for some reason is very difficult to find things in) but I
> couldn't find one.  If it's not, it might be a good idea to put it on
> SF or git.kernel.org?
..

No tree.  There's just my working copy (private),
and the published versions at SF.

But yes, SF has gotten incredibly more cryptic to use of late,
and I might have to move it somewhere more accessible soon.

Cheers!

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10  0:14           ` Daniel Taylor
  (?)
  (?)
@ 2010-03-10  5:17           ` H. Peter Anvin
  -1 siblings, 0 replies; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-10  5:17 UTC (permalink / raw)
  To: Daniel Taylor
  Cc: Tejun Heo, Greg Freemyer, James Bottomley, linux-ide, lkml,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare, mkp

On 03/09/2010 04:14 PM, Daniel Taylor wrote:
>
> The MBR in a GPT installation doesn't map the first GPT partition, it maps
> the entire drive
> drive after the first sector, as well as marking it type 0xEE.  The start
> LBA of the file system
> is not correctly located in the MBR.
>
> I will run some experiments to see if any of the systems on my desk can boot
> Linux from a GPT.

There is something called a "hybrid MBR", which is basically a GPT disk 
with a single partition (the current bootable partition) mapped as an 
MBR partition, instead of marking the whole disk 0xEE.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09 12:18           ` Karel Zak
@ 2010-03-10  5:06             ` Martin K. Petersen
  2010-03-10 20:50               ` Henrique de Moraes Holschuh
  0 siblings, 1 reply; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-10  5:06 UTC (permalink / raw)
  To: Karel Zak; +Cc: Michael Tokarev, Mike Snitzer, linux-ide, linux-kernel

>>>>> "Karel" == Karel Zak <kzak@redhat.com> writes:

[Cleaned up the CC: list from hell]

Karel>  # cat /sys/block/md8/queue/{minimum,optimal}_io_size
Karel>  65536 65536

This one had me puzzled.  We set min_io and opt_io correctly in raid5.c
depending on number of non-parity disks.  And yet it turns into
something nonsensical after.

Turns out we overrun unsigned int calculating the lowest common multiple
in the stacking function.  That's why we ended up with the wrong value.

I never noticed this because my userland topology regression test tool
uses unsigned long.

I'll get a patch off to Jens right away.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  6:53     ` Michael Tokarev
@ 2010-03-10  4:57         ` Martin K. Petersen
  2010-03-10  4:57         ` Martin K. Petersen
  1 sibling, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-10  4:57 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Mike Snitzer, Martin K. Petersen, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth, jdelvare, Karel Zak, Jim Meyering,
	Neil Brown

>>>>> "Michael" == Michael Tokarev <mjt@tls.msk.ru> writes:

[MD I/O topology support]

Michael> But apparently it does not implement anything of this sort.
Michael> Adding Neilb to the Cc list.......

git show 8f6c2e4b

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-10  4:57         ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-10  4:57 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Mike Snitzer, Martin K. Petersen, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth, jdelvare, Karel Zak, Jim Meyering,
	Neil Brown

>>>>> "Michael" == Michael Tokarev <mjt@tls.msk.ru> writes:

[MD I/O topology support]

Michael> But apparently it does not implement anything of this sort.
Michael> Adding Neilb to the Cc list.......

git show 8f6c2e4b

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10  0:26           ` Tejun Heo
@ 2010-03-10  0:36             ` H. Peter Anvin
  0 siblings, 0 replies; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-10  0:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Daniel Taylor, Greg Freemyer, James Bottomley, linux-ide, lkml,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare, mkp

On 03/09/2010 04:26 PM, Tejun Heo wrote:
>
>> I will run some experiments to see if any of the systems on my desk can boot
>> Linux from a GPT.
>
> I'm not sure about grub although I strongly suspect recent version of
> it should work but AFAICS lilo should definitely work as it doesn't
> care how the disk is logically organized at all.
>

In the case of Syslinux, you have to install gptmbr.bin, but otherwise 
it works unmodified (Syslinux itself doesn't care about the partition 
table at all.)

Note: the official standard for GPT booting on BIOS is still evolving, 
so I might change gptmbr to match the new standard.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 18:50         ` H. Peter Anvin
  2010-03-08 18:58           ` James Bottomley
  2010-03-08 20:19             ` Martin K. Petersen
@ 2010-03-10  0:34           ` Tejun Heo
  2 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-10  0:34 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin K. Petersen, James Bottomley, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

Hello,

On 03/09/2010 03:50 AM, H. Peter Anvin wrote:
> Well, apparently Western Digital are looking at it for USB drives due to
> XP compatibility requirements -- those presumably are ATA internally and
> use a USB-ATA bridge.

This should work right now as long as the bridge chip doesn't screw
up, which we can't do much about anyway.  USB is used as SCSI
transport and SCSI layer has been working fine with devices with
differing sector sizes for quite some time.

> On the flipside, though, there really is very little net benefit to 4K
> as opposed to 512 byte logical sectors: the additional protocol overhead
> is relatively minimal, and as long as writes are aligned full blocks,
> there shouldn't be any additional overhead on either the OS or the drive
> side.  On the plus side, you get full compatibility with the existing
> software stack.  The equation really seems rather simple.

Yeap, for addressing, whether 9 bit is shifted or 12 doesn't really
matter.  That's only 8 times difference which may be breached in
probably under three years.  If the current 48 bit addressing limit is
reached, we would be far better off introducing 64 or 128 bit
addressing.  That was the reason why I thought that I would never see
an ATA disk w/ 4KiB logical sector and got pretty surprised that it
was being considered for XP compatibility where 3 year offset could be
pretty meaningful.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09 22:46     ` Greg Freemyer
  2010-03-10  0:05       ` Tejun Heo
@ 2010-03-10  0:32       ` H. Peter Anvin
  2010-03-10 10:46         ` Johannes Stezenbach
  1 sibling, 1 reply; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-10  0:32 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: James Bottomley, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare, mkp

On 03/09/2010 02:46 PM, Greg Freemyer wrote:
> <snip>
>>
>> As far as partitioning... I believe we should be using GPT partition tables
>> where possible.  Even on non-EFI systems, it's simply a much better
>> partition table format.
>>
>>         -hpa
>
> GPT can not be used for boot disks in non-EFI systems, right?
>

It can.  The BIOS doesn't care about the partition table at all -- all 
it does is load the MBR.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-10  0:14           ` Daniel Taylor
  (?)
@ 2010-03-10  0:26           ` Tejun Heo
  2010-03-10  0:36             ` H. Peter Anvin
  -1 siblings, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-03-10  0:26 UTC (permalink / raw)
  To: Daniel Taylor
  Cc: Greg Freemyer, H. Peter Anvin, James Bottomley, linux-ide, lkml,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare, mkp

Hello,

On 03/10/2010 09:14 AM, Daniel Taylor wrote:
> The MBR in a GPT installation doesn't map the first GPT partition,
> it maps the entire drive drive after the first sector, as well as
> marking it type 0xEE.

Yeah, yeah, that was exactly what I was saying by "describing the rest
of the whole disk as a single chunk containing GPT managed area" with
a typo making "whole" "while".

> The start LBA of the file system is not correctly located in the
> MBR.

Sure it's not but MBR belongs to the boot loader not the BIOS.  BIOS
just needs to load MBR and handles control to it.  If the MBR or more
likely later stages of the bootloader loaded by MBR knows how to boot
from GPT, it should work.

> I will run some experiments to see if any of the systems on my desk can boot
> Linux from a GPT.

I'm not sure about grub although I strongly suspect recent version of
it should work but AFAICS lilo should definitely work as it doesn't
care how the disk is logically organized at all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09 23:46 ` Arnd Bergmann
@ 2010-03-10  0:20   ` Tejun Heo
  2010-03-10  9:14   ` Denys Vlasenko
  1 sibling, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-10  0:20 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare

On 03/10/2010 08:46 AM, Arnd Bergmann wrote:
> On Monday 08 March 2010 04:48:35 Tejun Heo wrote:
>> Unfortunately, while Windows can assume that newer releases won't
>> share the hard drive with older releases including Windows XP, Linux
>> distros can't do that.  There will be many installations where a
>> modern Linux distros share a hard drive with older releases of
>> Windows.  At this point, I can't see a silver bullet solution.
>>
>> Partitioners maybe should only align partitions which will be used by
>> Linux and default to the traditional layout for others while allowing
>> explicit override.  I think Windows XP wouldn't have problem with
>> differently aligned partitions as long as it doesn't actually use them
>> but haven't tested it.
> 
> Any idea if XP can cope with partition tables that use a 32-sector, 128-head
> geometry rather than the default 63-sector, 255-head one? That seems to
> be what some flash memory cards are using and it would make any cylinder
> aligned partition also 4096-byte aligned, at the cost of moving the
> 1024-cylinder boundary from 7.88 GiB to 2 GiB.
> 
> Do we know of anything that requires 63s/255h?

Michal Soltys pointed out that XP doesn't really depend on the legacy
layout although 2000 does (can't boot), so I guess it shouldn't be
much of a problem.

Regarding the gemetry, IIUC changing it isn't meaningful for
compatibility.  Geometry information is obtained using a BIOS call
(the int Xh thing) and the hard disk itself doesn't carry that
information , so unless you go into the BIOS set up and enter those
values manually (and I don't think you can do that on many BIOSs these
days), there's no way for anyone else to know custom geometry other
than solving equations using the CHS and LBA information in the
partition table.

So, feeding custom geometry to a partitioner which uses CHS to
determine the layout is useful to make it create partitions aligned in
certain way but as the information regarding the geometry is not
recorded anywhere, others will just keep using whatever they were
using (255*63) and figure that CHS and LBA in the partition tables
just don't match.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* RE: ATA 4 KiB sector issues.
  2010-03-10  0:05       ` Tejun Heo
@ 2010-03-10  0:14           ` Daniel Taylor
  0 siblings, 0 replies; 155+ messages in thread
From: Daniel Taylor @ 2010-03-10  0:14 UTC (permalink / raw)
  To: Tejun Heo, Greg Freemyer
  Cc: H. Peter Anvin, James Bottomley, linux-ide, lkml, Jeff Garzik,
	Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, mkp

 
>> GPT can not be used for boot disks in non-EFI systems, right?

> IIUC, I think any BIOS should be able to do so as it only cares about the
code part of MBR
> not the partitions and even with GPT the MBR remains the same with the
partition part
> describing the rest of the while disk as a single chunk containing GPT
managed area.  The
> only problem is the older operating systems (like XP) which don't
understand GPT wouldn't be
> able to access those partitions.

> Thanks.

The MBR in a GPT installation doesn't map the first GPT partition, it maps
the entire drive
drive after the first sector, as well as marking it type 0xEE.  The start
LBA of the file system
is not correctly located in the MBR.

I will run some experiments to see if any of the systems on my desk can boot
Linux from a GPT.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* RE: ATA 4 KiB sector issues.
@ 2010-03-10  0:14           ` Daniel Taylor
  0 siblings, 0 replies; 155+ messages in thread
From: Daniel Taylor @ 2010-03-10  0:14 UTC (permalink / raw)
  To: Tejun Heo, Greg Freemyer
  Cc: H. Peter Anvin, James Bottomley, linux-ide, lkml, Jeff Garzik,
	Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, mkp

 
>> GPT can not be used for boot disks in non-EFI systems, right?

> IIUC, I think any BIOS should be able to do so as it only cares about the
code part of MBR
> not the partitions and even with GPT the MBR remains the same with the
partition part
> describing the rest of the while disk as a single chunk containing GPT
managed area.  The
> only problem is the older operating systems (like XP) which don't
understand GPT wouldn't be
> able to access those partitions.

> Thanks.

The MBR in a GPT installation doesn't map the first GPT partition, it maps
the entire drive
drive after the first sector, as well as marking it type 0xEE.  The start
LBA of the file system
is not correctly located in the MBR.

I will run some experiments to see if any of the systems on my desk can boot
Linux from a GPT.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09 10:06   ` Michal Soltys
@ 2010-03-10  0:11     ` Tejun Heo
  2010-03-14 21:09       ` Michal Soltys
  0 siblings, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-03-10  0:11 UTC (permalink / raw)
  To: Michal Soltys
  Cc: Mikael Abrahamsson, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

Hello,

On 03/09/2010 07:06 PM, Michal Soltys wrote:
> Mikael Abrahamsson wrote:
>> On Mon, 8 Mar 2010, Tejun Heo wrote:
>>
>>>  http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>>
>> Excellent summary.
>>
>>> C-2. Windows XP depends on the traditional partition layout.
>>
>> Is this really true? WD ships their EARS drives with an alignment tool
>> that as far as I can understand, moves the partition so
>> it's aligned to 4KiB:

Hmmm... I based that claim on the MS KB page and as you pointed out
the problem there could probably be issues with specific BIOS
implementation interacting badly.  I'll update the doc.

> XP SP2 (or later) can boot from any place, including logical partitions
> (tested that recently). Most important thing is "hidden sectors" (recent
> chain.c32 can set that automatically through ntldr and/or sethidden
> options). No idea about pre-SP2 ; Win 2000 will not boot from
> "misaligned" (with reference to cylinder boundary) partition.

I was thinking about testing XP booting this weekend but really want
to avoid it, so thanks a lot for the info.  I'll update the doc
accordingly but can you please enlighten me on how it works and what's
broken in detail?  So, XP should be fine with any alignment?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09 22:46     ` Greg Freemyer
@ 2010-03-10  0:05       ` Tejun Heo
  2010-03-10  0:14           ` Daniel Taylor
  2010-03-10  0:32       ` H. Peter Anvin
  1 sibling, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-03-10  0:05 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: H. Peter Anvin, James Bottomley, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare, mkp

Hello,

On 03/10/2010 07:46 AM, Greg Freemyer wrote:
>> As far as partitioning... I believe we should be using GPT partition tables
>> where possible.  Even on non-EFI systems, it's simply a much better
>> partition table format.
> 
> GPT can not be used for boot disks in non-EFI systems, right?

IIUC, I think any BIOS should be able to do so as it only cares about
the code part of MBR not the partitions and even with GPT the MBR
remains the same with the partition part describing the rest of the
while disk as a single chunk containing GPT managed area.  The only
problem is the older operating systems (like XP) which don't
understand GPT wouldn't be able to access those partitions.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09 13:55 ` Mark Lord
@ 2010-03-10  0:00   ` Tejun Heo
  2010-03-10  6:08     ` Mark Lord
  0 siblings, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-03-10  0:00 UTC (permalink / raw)
  To: Mark Lord
  Cc: linux-ide, lkml, Daniel Taylor, Jeff Garzik, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare

On 03/09/2010 10:55 PM, Mark Lord wrote:
> On 03/07/10 22:48, Tejun Heo wrote:
> ..
>> Please note that hdparm is misreporting the alignment offset.  It
>> should be reporting 512 instead of 256 for offset-by-one drives.
> ..
> 
> That issue was fixed quite a while ago.
> Upgrade your elderly copy of hdparm.

Heh heh, *you* were keeping it from me!  Anyways, is there hdparm
devel tree published somewhere?  I wandared the SF page for quite a
bit (which for some reason is very difficult to find things in) but I
couldn't find one.  If it's not, it might be a good idea to put it on
SF or git.kernel.org?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  7:27       ` Jim Meyering
  (?)
@ 2010-03-09 23:56       ` Tejun Heo
  -1 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-09 23:56 UTC (permalink / raw)
  To: Jim Meyering
  Cc: Karel Zak, Martin K. Petersen, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

Hello,

On 03/09/2010 04:27 PM, Jim Meyering wrote:
> Related information, prompted by my recent encounter with a
> tool that refused to let me use a GPT partition table.
> 
> Partition table formats: prefer GUID/GPT:
> 
>   Having spent more than my share of time looking at partition table
>   formats recently, I am now strongly biased against DOS partition
>   tables, and for GUID/GPT ones.  In addition to allowing for >2GiB
>   partition offsets and lengths, GPT tables provide for better
>   protection in case of corruption (checksums, backup table at end
>   of disk) and don't have the anachronistic distinction of primary
>   and extended/logical partitions (all partitions are "primary").
>   You can even give each partition a name.  The only reason to use a
>   DOS partition table on a new installation is if you're stuck with
>   a requirement of using an OS like XP on bare metal.
> 
> Please consider encouraging the use of GPT partition tables...
> or at least do not *dis*courage their use.

I'll surely include it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08  3:48 ` Tejun Heo
                   ` (5 preceding siblings ...)
  (?)
@ 2010-03-09 23:46 ` Arnd Bergmann
  2010-03-10  0:20   ` Tejun Heo
  2010-03-10  9:14   ` Denys Vlasenko
  -1 siblings, 2 replies; 155+ messages in thread
From: Arnd Bergmann @ 2010-03-09 23:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare

On Monday 08 March 2010 04:48:35 Tejun Heo wrote:
> Unfortunately, while Windows can assume that newer releases won't
> share the hard drive with older releases including Windows XP, Linux
> distros can't do that.  There will be many installations where a
> modern Linux distros share a hard drive with older releases of
> Windows.  At this point, I can't see a silver bullet solution.
> 
> Partitioners maybe should only align partitions which will be used by
> Linux and default to the traditional layout for others while allowing
> explicit override.  I think Windows XP wouldn't have problem with
> differently aligned partitions as long as it doesn't actually use them
> but haven't tested it.

Any idea if XP can cope with partition tables that use a 32-sector, 128-head
geometry rather than the default 63-sector, 255-head one? That seems to
be what some flash memory cards are using and it would make any cylinder
aligned partition also 4096-byte aligned, at the cost of moving the
1024-cylinder boundary from 7.88 GiB to 2 GiB.

Do we know of anything that requires 63s/255h?

	Arnd

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08  7:53   ` H. Peter Anvin
  2010-03-08 15:34       ` Martin K. Petersen
@ 2010-03-09 22:46     ` Greg Freemyer
  2010-03-10  0:05       ` Tejun Heo
  2010-03-10  0:32       ` H. Peter Anvin
  1 sibling, 2 replies; 155+ messages in thread
From: Greg Freemyer @ 2010-03-09 22:46 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: James Bottomley, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare, mkp

<snip>
>
> As far as partitioning... I believe we should be using GPT partition tables
> where possible.  Even on non-EFI systems, it's simply a much better
> partition table format.
>
>        -hpa

GPT can not be used for boot disks in non-EFI systems, right?

Greg

^ permalink raw reply	[flat|nested] 155+ messages in thread

* RE: ATA 4 KiB sector issues.
  2010-03-08 15:34       ` Martin K. Petersen
@ 2010-03-09 22:36         ` Daniel Taylor
  -1 siblings, 0 replies; 155+ messages in thread
From: Daniel Taylor @ 2010-03-09 22:36 UTC (permalink / raw)
  To: Martin K. Petersen, H. Peter Anvin
  Cc: James Bottomley, Tejun Heo, linux-ide, lkml, Jeff Garzik,
	Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare

 
hpa> I would very much like a reference for a platform which has 
hpa> firmware which can successfully boot from 4K-logical media.  It 
hpa> would be very useful for bootloader testing.


I am told that the Mac UEFI platform will boot from 4K logical/physical
drives.

Now I have to scrounge one of the old drives to test it.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* RE: ATA 4 KiB sector issues.
@ 2010-03-09 22:36         ` Daniel Taylor
  0 siblings, 0 replies; 155+ messages in thread
From: Daniel Taylor @ 2010-03-09 22:36 UTC (permalink / raw)
  To: Martin K. Petersen, H. Peter Anvin
  Cc: James Bottomley, Tejun Heo, linux-ide, lkml, Jeff Garzik,
	Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare

 
hpa> I would very much like a reference for a platform which has 
hpa> firmware which can successfully boot from 4K-logical media.  It 
hpa> would be very useful for bootloader testing.


I am told that the Mac UEFI platform will boot from 4K logical/physical
drives.

Now I have to scrounge one of the old drives to test it.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  3:18       ` Martin K. Petersen
  (?)
@ 2010-03-09 14:32       ` Mark Lord
  -1 siblings, 0 replies; 155+ messages in thread
From: Mark Lord @ 2010-03-09 14:32 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Tejun Heo, linux-ide, lkml, Daniel Taylor, Jeff Garzik, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, Karel Zak,
	Jim Meyering

On 03/08/10 22:18, Martin K. Petersen wrote:
>>>>>> "Tejun" == Tejun Heo<tj@kernel.org>  writes:
>
> Tejun>  Yeah, I know Mark fixed it but couldn't find where the tree was.
> Tejun>  SF only had old releases, so...
>
> Tejun>  (other stuff replied further down the thread)
>
> Looks like Mark hasn't made an hdparm release since I posted the patch.
..

Holy crap.  I thought I'd put that out months ago!

Anyway, it's there now:  https://sourceforge.net/projects/hdparm/

Thanks!

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08  3:48 ` Tejun Heo
                   ` (4 preceding siblings ...)
  (?)
@ 2010-03-09 13:55 ` Mark Lord
  2010-03-10  0:00   ` Tejun Heo
  -1 siblings, 1 reply; 155+ messages in thread
From: Mark Lord @ 2010-03-09 13:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-ide, lkml, Daniel Taylor, Jeff Garzik, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare

On 03/07/10 22:48, Tejun Heo wrote:
..
> Please note that hdparm is misreporting the alignment offset.  It
> should be reporting 512 instead of 256 for offset-by-one drives.
..

That issue was fixed quite a while ago.
Upgrade your elderly copy of hdparm.

:)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09 11:38             ` Michael Tokarev
@ 2010-03-09 12:20               ` Dave Chinner
  0 siblings, 0 replies; 155+ messages in thread
From: Dave Chinner @ 2010-03-09 12:20 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Karel Zak, Mike Snitzer, Martin K. Petersen, Tejun Heo,
	linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, Jim Meyering,
	Neil Brown

On Tue, Mar 09, 2010 at 02:38:57PM +0300, Michael Tokarev wrote:
> Dave Chinner wrote:
> > On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
> >> Karel Zak wrote:
> >>> I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
> >>> It works as expected.
> >> Actually, for raid0, the alignment is questionable.  Should it be a
> >> multiple of chunk size or whole stripe size?  I'm not sure, both ways
> >> has bad and good sides..  But if it is the latter, the same issues
> >> pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.
> > 
> > Yes, alignment is still needed, especially for filesystems that can
> > do stripe unit aligned allocation like XFS. If you don't align the
> > filesystem properly, all the data IO will be mis-aligned to the
> > underlying disks and stripe unit sized IO will hit multiple disks
> > rather than just one....
> 
> I understand alignment is needed, the question is if the alignment
> should be to chunk size or full-stripe size.  In neither case it
> will be bad for underlying disks.

Depends on the RAID implementation. High end RAID arrays often have
cache bypass features that are triggered by stripe width aligned and
sized IOs. cwWhen receiving well formed IO they can more than double
write performance because they are not limited by internal cache
mirroring bandwidth (e.g. the controller magically switches to
write-through for those well formed IOs instead of writeback).

So from that perspective, alignment needs to be to stripe width,
not stripe unit. Similarly for RAID5/6 alignment needs to be to
stripe width, so that a well formed IO issued by the filesystem
only hits one RAID5/6 stripe.

FWIW, XFS takes great care to ensure that it doesn't place all it's
allocation group headers on the same stripe unit.  Failing to
distribute the AG headers across all the ѕtripe units evenly loads
the disks/luns in the stripe unevenly. As soon as you have uneven
load on a stripe the performance tanks as stripe is only as fast as
it's slowest member.

Also, while XFS prefers to align to stripe unit, there are mount
options to change the default allocation alignment to be stripe
width based. Hence if you have large files and applications that are
doing well formed IO, stripe width alignment of the filesystem to
the underlying block device is critical to acheiving deterministic
throughput close to the maximum the hardware can support.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09 10:16         ` Michael Tokarev
  2010-03-09 11:15           ` Dave Chinner
  2010-03-09 11:50           ` Karel Zak
@ 2010-03-09 12:18           ` Karel Zak
  2010-03-10  5:06             ` Martin K. Petersen
  2 siblings, 1 reply; 155+ messages in thread
From: Karel Zak @ 2010-03-09 12:18 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Mike Snitzer, Martin K. Petersen, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth, jdelvare, Jim Meyering, Neil Brown

On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
> Karel Zak wrote:
> >  # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}
> 
> That's 3-disk stripe size with default 64Kb chunk size, which makes
> 3x64=320KiB - the number to which everything should be aligned.
> 
> >  # fdisk -lcu /dev/md8
> > 
> >  Disk /dev/md8: 1572 MB, 1572667392 bytes
> >  2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
> >  Units = sectors of 1 * 512 = 512 bytes
> >  Sector size (logical/physical): 512 bytes / 4096 bytes
> >  I/O size (minimum/optimal): 65536 bytes / 65536 bytes
> 
> And here we go: fdisk does not see the right number: nothing
> is dividable by 3.

 Well, the same setup with 2.6.34-0.9.rc0.git13.fc14.x86_64:

 # fdisk -luc /dev/sdb

 Disk /dev/sdb: 2621 MB, 2621440000 bytes
 255 heads, 63 sectors/track, 318 cylinders, total 5120000 sectors
 Units = sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 4096 bytes
 I/O size (minimum/optimal): 4096 bytes / 32768 bytes
 Disk identifier: 0x77fbab55

 Device Boot         Start         End      Blocks   Id  System
 /dev/sdb1            2048     1026047      512000   83  Linux
 /dev/sdb2         1026048     2050047      512000   83  Linux
 /dev/sdb3         2050048     3074047      512000   83  Linux
 /dev/sdb4         3074048     4098047      512000   83  Linux


 # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}


 # fdisk -luc /dev/md8

 Disk /dev/md8: 1572 MB, 1572667392 bytes
 2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
 Units = sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 4096 bytes
 I/O size (minimum/optimal): 65536 bytes / 65536 bytes


 # cat /sys/block/md8/queue/{minimum,optimal}_io_size 
 65536
 65536

> >  # cat /sys/block/md8/md8p{1,2}/alignment_offset
> >  0
> >  0
> 
> And that's where the issue is.  md does not {sup,re}port all
> this stuff yet.

 Hmm...

    Karel

-- 
 Karel Zak  <kzak@redhat.com>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09 10:16         ` Michael Tokarev
  2010-03-09 11:15           ` Dave Chinner
@ 2010-03-09 11:50           ` Karel Zak
  2010-03-09 12:18           ` Karel Zak
  2 siblings, 0 replies; 155+ messages in thread
From: Karel Zak @ 2010-03-09 11:50 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Mike Snitzer, Martin K. Petersen, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth, jdelvare, Jim Meyering, Neil Brown

On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
> Karel Zak wrote:
> >  # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}
> 
> That's 3-disk stripe size with default 64Kb chunk size, which makes
> 3x64=320KiB - the number to which everything should be aligned.
> 
> >  # fdisk -lcu /dev/md8
> > 
> >  Disk /dev/md8: 1572 MB, 1572667392 bytes
> >  2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
> >  Units = sectors of 1 * 512 = 512 bytes
> >  Sector size (logical/physical): 512 bytes / 4096 bytes
> >  I/O size (minimum/optimal): 65536 bytes / 65536 bytes
> 
> And here we go: fdisk does not see the right number: nothing
> is dividable by 3.
> 
> []
> >  # cat /sys/block/md8/md8p{1,2}/alignment_offset
> >  0
> >  0
> 
> And that's where the issue is.  md does not {sup,re}port all
> this stuff yet.
> 
> This is what I'm talking about.

Note that I have 2.6.31.12-174.2.22.fc12.x86_64 kernel on my laptop.
It would be better for serious tests to use 2.6.33.

    Karel
 
-- 
 Karel Zak  <kzak@redhat.com>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09 11:15           ` Dave Chinner
@ 2010-03-09 11:38             ` Michael Tokarev
  2010-03-09 12:20               ` Dave Chinner
  0 siblings, 1 reply; 155+ messages in thread
From: Michael Tokarev @ 2010-03-09 11:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Karel Zak, Mike Snitzer, Martin K. Petersen, Tejun Heo,
	linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, Jim Meyering,
	Neil Brown

Dave Chinner wrote:
> On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
>> Karel Zak wrote:
>>> I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
>>> It works as expected.
>> Actually, for raid0, the alignment is questionable.  Should it be a
>> multiple of chunk size or whole stripe size?  I'm not sure, both ways
>> has bad and good sides..  But if it is the latter, the same issues
>> pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.
> 
> Yes, alignment is still needed, especially for filesystems that can
> do stripe unit aligned allocation like XFS. If you don't align the
> filesystem properly, all the data IO will be mis-aligned to the
> underlying disks and stripe unit sized IO will hit multiple disks
> rather than just one....

I understand alignment is needed, the question is if the alignment
should be to chunk size or full-stripe size.  In neither case it
will be bad for underlying disks.

/mjt

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09 10:16         ` Michael Tokarev
@ 2010-03-09 11:15           ` Dave Chinner
  2010-03-09 11:38             ` Michael Tokarev
  2010-03-09 11:50           ` Karel Zak
  2010-03-09 12:18           ` Karel Zak
  2 siblings, 1 reply; 155+ messages in thread
From: Dave Chinner @ 2010-03-09 11:15 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Karel Zak, Mike Snitzer, Martin K. Petersen, Tejun Heo,
	linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, Jim Meyering,
	Neil Brown

On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
> Karel Zak wrote:
> > I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
> > It works as expected.
> 
> Actually, for raid0, the alignment is questionable.  Should it be a
> multiple of chunk size or whole stripe size?  I'm not sure, both ways
> has bad and good sides..  But if it is the latter, the same issues
> pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.

Yes, alignment is still needed, especially for filesystems that can
do stripe unit aligned allocation like XFS. If you don't align the
filesystem properly, all the data IO will be mis-aligned to the
underlying disks and stripe unit sized IO will hit multiple disks
rather than just one....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09 10:01       ` Karel Zak
@ 2010-03-09 10:16         ` Michael Tokarev
  2010-03-09 11:15           ` Dave Chinner
                             ` (2 more replies)
  0 siblings, 3 replies; 155+ messages in thread
From: Michael Tokarev @ 2010-03-09 10:16 UTC (permalink / raw)
  To: Karel Zak
  Cc: Mike Snitzer, Martin K. Petersen, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth, jdelvare, Jim Meyering, Neil Brown

Karel Zak wrote:
> On Tue, Mar 09, 2010 at 09:53:37AM +0300, Michael Tokarev wrote:
[]
>> Think of a raid5 array - with all the mentioned good stuff in place
>> fdisk should figure out to align partitions on the array stripe
>> boundary, and should do that automatically.  And this should be
> 
> Yes. For userspace there is not a difference between RAID and non-RAID
> device -- the topology support in kernel provides unified API to all
> devices. It means we needn't any extra support for RAIDs in
> fdisk/parted. The userspace tools follow topology data from kernel.
> 
> The good thing with 1MiB default alignment is that it is usable for
> usual stripe sizes (for sizes greater than 1MiB we use optimal I/O
> size).

No, it's not that simple.  For raid5 (and I especially mentioned raid5
above), raid4 and raid6, 1MiB is only good when the number of devices
is 2^N+1 (for raid[45]) or 2^N+2 (for raid6).  For raid5 that means
3, 5, 9, 17, .. disks.  In all other cases the alignment (which should
match stripe size) will not be power of two.  For example, for a 4-disk
raid5 array with 1MiB chunk size the partitions should be aligned at
3MiB boundaries.  For 6-disk raid5 with 256KiB chunk size it is
5x256=1280 Kib.  And so on.

Yes it has little to do with the $subject (4KiB sectors), but it is
closely related still.

>> most easy to debug/test, since the whole thing is controllable
>> by kernel.
> 
> I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
> It works as expected.

Actually, for raid0, the alignment is questionable.  Should it be a
multiple of chunk size or whole stripe size?  I'm not sure, both ways
has bad and good sides..  But if it is the latter, the same issues
pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.

[]
>  Disk /dev/sdb: 2621 MB, 2621440000 bytes
>  255 heads, 63 sectors/track, 318 cylinders, total 5120000 sectors
>  Units = sectors of 1 * 512 = 512 bytes
>  Sector size (logical/physical): 512 bytes / 4096 bytes
>  I/O size (minimum/optimal): 4096 bytes / 32768 bytes

Good.

>  # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}

That's 3-disk stripe size with default 64Kb chunk size, which makes
3x64=320KiB - the number to which everything should be aligned.

>  # fdisk -lcu /dev/md8
> 
>  Disk /dev/md8: 1572 MB, 1572667392 bytes
>  2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
>  Units = sectors of 1 * 512 = 512 bytes
>  Sector size (logical/physical): 512 bytes / 4096 bytes
>  I/O size (minimum/optimal): 65536 bytes / 65536 bytes

And here we go: fdisk does not see the right number: nothing
is dividable by 3.

[]
>  # cat /sys/block/md8/md8p{1,2}/alignment_offset
>  0
>  0

And that's where the issue is.  md does not {sup,re}port all
this stuff yet.

This is what I'm talking about.

Thanks!

/mjt

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  6:34 ` Mikael Abrahamsson
@ 2010-03-09 10:06   ` Michal Soltys
  2010-03-10  0:11     ` Tejun Heo
  0 siblings, 1 reply; 155+ messages in thread
From: Michal Soltys @ 2010-03-09 10:06 UTC (permalink / raw)
  To: Mikael Abrahamsson
  Cc: Tejun Heo, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

Mikael Abrahamsson wrote:
> On Mon, 8 Mar 2010, Tejun Heo wrote:
> 
>>  http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
> 
> Excellent summary.
> 
>> C-2. Windows XP depends on the traditional partition layout.
> 
> Is this really true? WD ships their EARS drives with an alignment tool 
> that as far as I can understand, moves the partition so
> it's aligned to 4KiB:
> 

XP SP2 (or later) can boot from any place, including logical partitions 
(tested that recently). Most important thing is "hidden sectors" (recent 
chain.c32 can set that automatically through ntldr and/or sethidden 
options). No idea about pre-SP2 ; Win 2000 will not boot from "misaligned" 
(with reference to cylinder boundary) partition.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  6:53     ` Michael Tokarev
@ 2010-03-09 10:01       ` Karel Zak
  2010-03-09 10:16         ` Michael Tokarev
  2010-03-10  4:57         ` Martin K. Petersen
  1 sibling, 1 reply; 155+ messages in thread
From: Karel Zak @ 2010-03-09 10:01 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Mike Snitzer, Martin K. Petersen, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, H. Peter Anvin,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth, jdelvare, Jim Meyering, Neil Brown

On Tue, Mar 09, 2010 at 09:53:37AM +0300, Michael Tokarev wrote:
> Mike Snitzer wrote:
> []
> > I've been keeping track of all the pieces in play, have coordinated
> > with kzak and jim, and have a summary that offers some amount of macro
> > detail (at the end I touch on parted and fdisk):
> > 
> > http://people.redhat.com/msnitzer/docs/io-limits.txt
> 
> What I don't see in this thread and in this document is - any mention
> of linux md layer.  I think it is the first candidate to test the whole
> thing, the easiest and most important one.  I mean the alignment and
> "recommended I/O size" and all this similar stuff.
> 
> Think of a raid5 array - with all the mentioned good stuff in place
> fdisk should figure out to align partitions on the array stripe
> boundary, and should do that automatically.  And this should be

Yes. For userspace there is not a difference between RAID and non-RAID
device -- the topology support in kernel provides unified API to all
devices. It means we needn't any extra support for RAIDs in
fdisk/parted. The userspace tools follow topology data from kernel.

The good thing with 1MiB default alignment is that it is usable for
usual stripe sizes (for sizes greater than 1MiB we use optimal I/O
size).

> most easy to debug/test, since the whole thing is controllable
> by kernel.

I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
It works as expected. (Note that kernel 2.6.31 has a problem with
alignment_offset calculation on stacked devices, so use the latest
kernel where the bug is already fixed.)

But I didn't tried to use unpartitioned (whole) 4K disks for RAIDs,
because scsi_debug does not allow to create more devices (and I don't
have a real HW).

Some tests are available in util-linux-ng sources:
http://git.kernel.org/?p=utils/util-linux-ng/util-linux-ng.git;a=tree;f=tests/ts/fdisk

    Karel


 # modprobe scsi_debug dev_size_mb=2500 sector_size=512 physblk_exp=3

    [..create partitions...]

 # fdisk -lcu /dev/sdb 

 Disk /dev/sdb: 2621 MB, 2621440000 bytes
 255 heads, 63 sectors/track, 318 cylinders, total 5120000 sectors
 Units = sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 4096 bytes
 I/O size (minimum/optimal): 4096 bytes / 32768 bytes
 Disk identifier: 0xb585b0be

 Device Boot         Start         End      Blocks   Id  System
 /dev/sdb1            2048     1026047      512000   83  Linux
 /dev/sdb2         1026048     2050047      512000   83  Linux
 /dev/sdb3         2050048     3074047      512000   83  Linux
 /dev/sdb4         3074048     4098047      512000   83  Linux


 # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}

     [...create partitions on the raid...]

 # fdisk -lcu /dev/md8

 Disk /dev/md8: 1572 MB, 1572667392 bytes
 2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
 Units = sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 4096 bytes
 I/O size (minimum/optimal): 65536 bytes / 65536 bytes
 Disk identifier: 0x1bb6fd8d

 Device Boot          Start         End      Blocks   Id  System
 /dev/md8p1            2048     1435647      716800   83  Linux
 /dev/md8p2         1435648     2869247      716800   83  Linux


 Check offsets (alignment):

 # cat /sys/block/sdb/sdb{1,2,3,4}/alignment_offset
 0
 0
 0
 0

 # cat /sys/block/md8/md8p{1,2}/alignment_offset
 0
 0

-- 
 Karel Zak  <kzak@redhat.com>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 19:58   ` Karel Zak
@ 2010-03-09  7:27       ` Jim Meyering
  2010-03-09  7:27       ` Jim Meyering
  1 sibling, 0 replies; 155+ messages in thread
From: Jim Meyering @ 2010-03-09  7:27 UTC (permalink / raw)
  To: Karel Zak
  Cc: Martin K. Petersen, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

Karel Zak wrote:
> On Mon, Mar 08, 2010 at 10:18:27AM -0500, Martin K. Petersen wrote:
...
>> It'd be great if you guys could share what you have been doing to the
>> tooling.
>
>  small summary:
>
>  - libblkid provides unified API to topology information, it supports:
>     - ioctls (kernel >= 2.6.32)
>     - sysfs (kernel >= 2.6.31)
>     - stripe chunk size and stripe width for DM, MD. LVM and evms on
>       old kernels
>  - libparted and fdisk are linked against libblkid
>
>  - fdisk supports 4KiB logical sector size (util-linux-ng >= 2.15
>  - fdisk supports 4KiB physical sector size (util-linux-ng >= 2.17)
>  - fdisk uses 1MiB alignment (or more if optimal I/O size is bigger)
>    and alignment_offset for all partitions in non-DOS mode
>    (util-linux-ng >= 2.17.1)
>
>  - parted supports 4KiB physical sector size
>  - parted uses 1MiB alignment for disks with unknown topology, disks
>    with topology information are aligned to optimal (or minimum) I/O
>    size (parted >= 2.1)
>
>  - EFI GPT code in the kernel has been updated to works properly with
>    4KiB sectors (kernel >= 2.6.33)
>
>  - mkfs.{ext,xfs,gfs2,ocfs2} have been update to work properly with
>    topology information, mkfs.{ext,xfs} are linked against libblkid
>    for compatibility with old kernel (for stripe chunk size / width)
>
>  - Fedora-13/RHEL6 installer uses libparted with 4KiB support
>
>  - alignment_offset & 4KiB support is planned for LUKS (cryptsetup)

Thanks for the summary, Karel.
In case anyone wants more high-level detail on the parted front,
here's its NEWS file:

    http://git.debian.org/?p=parted/parted.git;a=blob;f=NEWS

Currently, I'm not planning much for Parted, other than clean-up.
For example, I want to remove all of the FS-related code (it's
horribly bit-rotted) from the package, with the exception of
HFS/HFS+ and FAT resizing capabilities, since AFAIK, Parted
has the only free implementations.  If any of you know of other
implementations or work in progress, please let me know.


Related information, prompted by my recent encounter with a
tool that refused to let me use a GPT partition table.

Partition table formats: prefer GUID/GPT:

  Having spent more than my share of time looking at partition table
  formats recently, I am now strongly biased against DOS partition
  tables, and for GUID/GPT ones.  In addition to allowing for >2GiB
  partition offsets and lengths, GPT tables provide for better
  protection in case of corruption (checksums, backup table at end
  of disk) and don't have the anachronistic distinction of primary
  and extended/logical partitions (all partitions are "primary").
  You can even give each partition a name.  The only reason to use a
  DOS partition table on a new installation is if you're stuck with
  a requirement of using an OS like XP on bare metal.

Please consider encouraging the use of GPT partition tables...
or at least do not *dis*courage their use.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-09  7:27       ` Jim Meyering
  0 siblings, 0 replies; 155+ messages in thread
From: Jim Meyering @ 2010-03-09  7:27 UTC (permalink / raw)
  To: Karel Zak
  Cc: Martin K. Petersen, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

Karel Zak wrote:
> On Mon, Mar 08, 2010 at 10:18:27AM -0500, Martin K. Petersen wrote:
...
>> It'd be great if you guys could share what you have been doing to the
>> tooling.
>
>  small summary:
>
>  - libblkid provides unified API to topology information, it supports:
>     - ioctls (kernel >= 2.6.32)
>     - sysfs (kernel >= 2.6.31)
>     - stripe chunk size and stripe width for DM, MD. LVM and evms on
>       old kernels
>  - libparted and fdisk are linked against libblkid
>
>  - fdisk supports 4KiB logical sector size (util-linux-ng >= 2.15
>  - fdisk supports 4KiB physical sector size (util-linux-ng >= 2.17)
>  - fdisk uses 1MiB alignment (or more if optimal I/O size is bigger)
>    and alignment_offset for all partitions in non-DOS mode
>    (util-linux-ng >= 2.17.1)
>
>  - parted supports 4KiB physical sector size
>  - parted uses 1MiB alignment for disks with unknown topology, disks
>    with topology information are aligned to optimal (or minimum) I/O
>    size (parted >= 2.1)
>
>  - EFI GPT code in the kernel has been updated to works properly with
>    4KiB sectors (kernel >= 2.6.33)
>
>  - mkfs.{ext,xfs,gfs2,ocfs2} have been update to work properly with
>    topology information, mkfs.{ext,xfs} are linked against libblkid
>    for compatibility with old kernel (for stripe chunk size / width)
>
>  - Fedora-13/RHEL6 installer uses libparted with 4KiB support
>
>  - alignment_offset & 4KiB support is planned for LUKS (cryptsetup)

Thanks for the summary, Karel.
In case anyone wants more high-level detail on the parted front,
here's its NEWS file:

    http://git.debian.org/?p=parted/parted.git;a=blob;f=NEWS

Currently, I'm not planning much for Parted, other than clean-up.
For example, I want to remove all of the FS-related code (it's
horribly bit-rotted) from the package, with the exception of
HFS/HFS+ and FAT resizing capabilities, since AFAIK, Parted
has the only free implementations.  If any of you know of other
implementations or work in progress, please let me know.


Related information, prompted by my recent encounter with a
tool that refused to let me use a GPT partition table.

Partition table formats: prefer GUID/GPT:

  Having spent more than my share of time looking at partition table
  formats recently, I am now strongly biased against DOS partition
  tables, and for GUID/GPT ones.  In addition to allowing for >2GiB
  partition offsets and lengths, GPT tables provide for better
  protection in case of corruption (checksums, backup table at end
  of disk) and don't have the anachronistic distinction of primary
  and extended/logical partitions (all partitions are "primary").
  You can even give each partition a name.  The only reason to use a
  DOS partition table on a new installation is if you're stuck with
  a requirement of using an OS like XP on bare metal.

Please consider encouraging the use of GPT partition tables...
or at least do not *dis*courage their use.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 19:34   ` Mike Snitzer
  2010-03-09  2:53     ` Tejun Heo
@ 2010-03-09  6:53     ` Michael Tokarev
  2010-03-09 10:01       ` Karel Zak
  2010-03-10  4:57         ` Martin K. Petersen
  1 sibling, 2 replies; 155+ messages in thread
From: Michael Tokarev @ 2010-03-09  6:53 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Martin K. Petersen, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare, Karel Zak, Jim Meyering, Neil Brown

Mike Snitzer wrote:
[]
> I've been keeping track of all the pieces in play, have coordinated
> with kzak and jim, and have a summary that offers some amount of macro
> detail (at the end I touch on parted and fdisk):
> 
> http://people.redhat.com/msnitzer/docs/io-limits.txt

What I don't see in this thread and in this document is - any mention
of linux md layer.  I think it is the first candidate to test the whole
thing, the easiest and most important one.  I mean the alignment and
"recommended I/O size" and all this similar stuff.

Think of a raid5 array - with all the mentioned good stuff in place
fdisk should figure out to align partitions on the array stripe
boundary, and should do that automatically.  And this should be
most easy to debug/test, since the whole thing is controllable
by kernel.

But apparently it does not implement anything of this sort.
Adding Neilb to the Cc list.......

Thanks!

/mjt

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08  3:48 ` Tejun Heo
                   ` (3 preceding siblings ...)
  (?)
@ 2010-03-09  6:34 ` Mikael Abrahamsson
  2010-03-09 10:06   ` Michal Soltys
  -1 siblings, 1 reply; 155+ messages in thread
From: Mikael Abrahamsson @ 2010-03-09  6:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare

On Mon, 8 Mar 2010, Tejun Heo wrote:

>  http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues

Excellent summary.

> C-2. Windows XP depends on the traditional partition layout.

Is this really true? WD ships their EARS drives with an alignment tool 
that as far as I can understand, moves the partition so
it's aligned to 4KiB:

http://www.wdc.com/en/products/advancedformat/

So an XP fresh install (including letting XP partition the drive) will be 
misaligned, but if you clone xp onto a properly aligned partition (or run 
the tool and let it move the partition), it'll be ok. So saying that XP 
"depends" on traditional partition layout might be a bit of a streth?

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  3:38         ` Daniel Taylor
@ 2010-03-09  4:54           ` Martin K. Petersen
  -1 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-09  4:54 UTC (permalink / raw)
  To: Daniel Taylor
  Cc: Tejun Heo, Karel Zak, Martin K. Petersen, linux-ide, lkml,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare, Jim Meyering

>>>>> "DLT" == Daniel Taylor <Daniel.Taylor@wdc.com> writes:

DLT> Simple reality is that XP is "forever".  Drives >2TiB, which may be
DLT> USB-attached, used with XP will be MBR-partitioned and use
DLT> 4096-byte sectors.  We need to be able to read/write those disks on
DLT> Linux systems.

Shouldn't be a problem as long as the DOS partition table vs. 4 KiB
sectors thing is fixed.


DLT> One last comment: I just tried to partition and format a >2TiB
DLT> drive on fully updated Ubuntu 9.10 with GParted.  I selected not to
DLT> cylinder align, use GPT and ext3, and to put 1 MiB preceeding and
DLT> following.  libparted failed with "unable to satisfy all
DLT> constraints of the partition".  Using "parted", I created the
DLT> partition, and then GParted was able to apply the ext3 file system.

I don't think ubuntu has adopted any of the relevant updates yet.

I believe the Fedora 13 Alpha is due to be released this week.  That
would be the best test platform because several of the people who have
been actively engaged in the 4 KiB sector enablement process are Fedora
developers.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-09  4:54           ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-09  4:54 UTC (permalink / raw)
  To: Daniel Taylor
  Cc: Tejun Heo, Karel Zak, Martin K. Petersen, linux-ide, lkml,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare, Jim Meyering

>>>>> "DLT" == Daniel Taylor <Daniel.Taylor@wdc.com> writes:

DLT> Simple reality is that XP is "forever".  Drives >2TiB, which may be
DLT> USB-attached, used with XP will be MBR-partitioned and use
DLT> 4096-byte sectors.  We need to be able to read/write those disks on
DLT> Linux systems.

Shouldn't be a problem as long as the DOS partition table vs. 4 KiB
sectors thing is fixed.


DLT> One last comment: I just tried to partition and format a >2TiB
DLT> drive on fully updated Ubuntu 9.10 with GParted.  I selected not to
DLT> cylinder align, use GPT and ext3, and to put 1 MiB preceeding and
DLT> following.  libparted failed with "unable to satisfy all
DLT> constraints of the partition".  Using "parted", I created the
DLT> partition, and then GParted was able to apply the ext3 file system.

I don't think ubuntu has adopted any of the relevant updates yet.

I believe the Fedora 13 Alpha is due to be released this week.  That
would be the best test platform because several of the people who have
been actively engaged in the 4 KiB sector enablement process are Fedora
developers.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  2:34     ` Tejun Heo
                         ` (3 preceding siblings ...)
  2010-03-09  3:38         ` Daniel Taylor
@ 2010-03-09  3:41       ` Felix Miata
  4 siblings, 0 replies; 155+ messages in thread
From: Felix Miata @ 2010-03-09  3:41 UTC (permalink / raw)
  To: linux-ide

On 2010/03/09 11:34 (GMT+0900) Tejun Heo composed:

>>> With regards to XP compatibility I don't think we should go too much out
>>> of our way to accommodate it.  XP has been disowned by its master and I
>>> think virtualization will take care of the rest.

> Yeah, good point.  I'm just a bit worried that it might generate a lot
> of frustrated bug reports.  Well, maybe we should just advise users to
> install windows first and then install Linux.

That's too crude. Installing Windows first has never been necessary on any of
the many multiboot systems I've configured. If competent partitioning is
performed prior to installing anything, as I do, no one else should need to
either. Besides, one cannot normally re-install Windows, an installation
situation that is hardly uncommon, after installing other operating systems,
so any plan that includes that instruction is broken from inception.

FWIW, OS/2 still has existed since before Linux, and likely will continue to
exist after XP has been all but forgotten. Yet it's not too likely ever to be
GPT-capable.
-- 
"The wise are known for their understanding, and pleasant
words are persuasive." Proverbs 16:21 (New Living Translation)

 Team OS/2 ** Reg. Linux User #211409

Felix Miata  ***  http://fm.no-ip.com/

^ permalink raw reply	[flat|nested] 155+ messages in thread

* RE: ATA 4 KiB sector issues.
  2010-03-09  2:34     ` Tejun Heo
@ 2010-03-09  3:38         ` Daniel Taylor
  2010-03-09  2:42       ` Tejun Heo
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 155+ messages in thread
From: Daniel Taylor @ 2010-03-09  3:38 UTC (permalink / raw)
  To: Tejun Heo, Karel Zak
  Cc: Martin K. Petersen, linux-ide, lkml, Jeff Garzik, Mark Lord,
	tytso, H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare,
	Jim Meyering

 

-----Original Message-----
From: Tejun Heo [mailto:tj@kernel.org] 
Sent: Monday, March 08, 2010 6:34 PM
To: Karel Zak
Cc: Martin K. Petersen; linux-ide@vger.kernel.org; lkml; Daniel Taylor; Jeff
Garzik; Mark Lord; tytso@mit.edu; H. Peter Anvin;
hirofumi@mail.parknet.co.jp; Andrew Morton; Alan Cox; irtiger@gmail.com;
Matthew Wilcox; aschnell@suse.de; knikanth@suse.de; jdelvare@suse.de; Jim
Meyering
Subject: Re: ATA 4 KiB sector issues.

Hello,

On 03/09/2010 04:58 AM, Karel Zak wrote:
>> Tejun> Reportedly, commonly used partitioners aren't ready to handle 
>> Tejun> drives larger than 2 TiB in any configuration and alignment 
>> Tejun> isn't
> 
> The limit is specific for DOS partition table (with 512-byte log.
> sectors), but for example GPT uses 64-bit LBA. I believe that our 
> partitioning tools don't introduce any other restriction.

Hmmm... the 'reportedly' was from Daniel Taylor or maybe I just
misinterpreted the conversation.  Daniel, can you please fill in?

DLT> The problem that I see is that the installers and upper level
applications do not make good choices for partition layout.
DLT> "parted", itself, seems to work OK in the latest version.  One of the
things I've heard since I started this process is that
DLT> there are some libraries associated with the process of
partitioning/formatting.  Perhaps the upper layers and those
DLT> libraries aren't synced up?

>> Tejun> done properly for drives with 4 KiB physical sectors.  4 KiB 
>> Tejun> logical sector support is broken in both the kernel
>>
>> Huh, what?  My homedir is on a 4KiB LBS/PBS drive and has been for ~2 
>> years.

By default, they aren't aligned properly, are they?

>> Tejun> (need more details and probably a whole section on partitioner
>> Tejun> behaviors)
>>
>> I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the 
>> alignment work for fdisk and parted respectively.  Karel, Jim: The 
>> full writeup is here:
>>
>> 	http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>>
>> It'd be great if you guys could share what you have been doing to the 
>> tooling.
> 
>  small summary:
> 
>  - libblkid provides unified API to topology information, it supports:
>     - ioctls (kernel >= 2.6.32)
>     - sysfs (kernel >= 2.6.31)
>     - stripe chunk size and stripe width for DM, MD. LVM and evms on
>       old kernels
>  - libparted and fdisk are linked against libblkid
> 
>  - fdisk supports 4KiB logical sector size (util-linux-ng >= 2.15
>  - fdisk supports 4KiB physical sector size (util-linux-ng >= 2.17)
>  - fdisk uses 1MiB alignment (or more if optimal I/O size is bigger)
>    and alignment_offset for all partitions in non-DOS mode
>    (util-linux-ng >= 2.17.1)

That's great.  Daniel, maybe you were testing older versions?  Or maybe
those failures were manifested from libata mishandling 4KiB r/w requets.

DLT> As I said, above, it could be libraries.  I was not aware that so much
of the implementation was embedded there.

>  - parted supports 4KiB physical sector size
>  - parted uses 1MiB alignment for disks with unknown topology, disks
>    with topology information are aligned to optimal (or minimum) I/O
>    size (parted >= 2.1)

This will result in incorrect alignment for drives which lie about the
physical sector size to work around BIOS/drivers issues (C-1).  It would
probably be best to align to at least 1MiB.

DLT> Please.

>  - EFI GPT code in the kernel has been updated to works properly with 
>    4KiB sectors (kernel >= 2.6.33)

libata is broken for logical 4KiB ATA devices tho.  I'll fix it up.

>  - mkfs.{ext,xfs,gfs2,ocfs2} have been update to work properly with
>    topology information, mkfs.{ext,xfs} are linked against libblkid
>    for compatibility with old kernel (for stripe chunk size / width)
> 
>  - Fedora-13/RHEL6 installer uses libparted with 4KiB support
> 
>  - alignment_offset & 4KiB support is planned for LUKS (cryptsetup)
> 
>> Tejun> Unfortunately, the transition to 4 KiB sector size, physical 
>> Tejun> only or logical too, is looking fairly ugly.  Hopefully, a 
>> Tejun> reasonable solution can be reached in not too distant future 
>> Tejun> but even with all the software side updated, it looks like 
>> Tejun> it's gonna cause significant amount of confusion and frustration.
>>
>> With regards to XP compatibility I don't think we should go too much 
>> out of our way to accommodate it.  XP has been disowned by its master 
>> and I think virtualization will take care of the rest.

Yeah, good point.  I'm just a bit worried that it might generate a lot of
frustrated bug reports.  Well, maybe we should just advise users to install
windows first and then install Linux.

DLT> Simple reality is that XP is "forever".  Drives >2TiB, which may be
USB-attached, used with XP will be MBR-partitioned
DLT> and use 4096-byte sectors.  We need to be able to read/write those
disks on Linux systems.

>> FWIW, recent fdisk has a command line flag that will enable/disable 
>> DOS compatible layout.
> 
>  yes, util-linux-ng 2.17.1, fdisk -c
>  
>  Note that non-DOS mode will be default in the next major  
> util-linux-ng release.

I'll try to merge these information into the ata-4k doc.

Thank you very much.

DLT> One last comment: I just tried to partition and format a >2TiB drive on
fully updated Ubuntu 9.10 with GParted.
DLT> I selected not to cylinder align, use GPT and ext3, and to put 1 MiB
preceeding and following.  libparted failed
DLT> with "unable to satisfy all constraints of the partition".  Using
"parted", I created the partition, and then
DLT> GParted was able to apply the ext3 file system.
--
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* RE: ATA 4 KiB sector issues.
@ 2010-03-09  3:38         ` Daniel Taylor
  0 siblings, 0 replies; 155+ messages in thread
From: Daniel Taylor @ 2010-03-09  3:38 UTC (permalink / raw)
  To: Tejun Heo, Karel Zak
  Cc: Martin K. Petersen, linux-ide, lkml, Jeff Garzik, Mark Lord,
	tytso, H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare,
	Jim Meyering

 

-----Original Message-----
From: Tejun Heo [mailto:tj@kernel.org] 
Sent: Monday, March 08, 2010 6:34 PM
To: Karel Zak
Cc: Martin K. Petersen; linux-ide@vger.kernel.org; lkml; Daniel Taylor; Jeff
Garzik; Mark Lord; tytso@mit.edu; H. Peter Anvin;
hirofumi@mail.parknet.co.jp; Andrew Morton; Alan Cox; irtiger@gmail.com;
Matthew Wilcox; aschnell@suse.de; knikanth@suse.de; jdelvare@suse.de; Jim
Meyering
Subject: Re: ATA 4 KiB sector issues.

Hello,

On 03/09/2010 04:58 AM, Karel Zak wrote:
>> Tejun> Reportedly, commonly used partitioners aren't ready to handle 
>> Tejun> drives larger than 2 TiB in any configuration and alignment 
>> Tejun> isn't
> 
> The limit is specific for DOS partition table (with 512-byte log.
> sectors), but for example GPT uses 64-bit LBA. I believe that our 
> partitioning tools don't introduce any other restriction.

Hmmm... the 'reportedly' was from Daniel Taylor or maybe I just
misinterpreted the conversation.  Daniel, can you please fill in?

DLT> The problem that I see is that the installers and upper level
applications do not make good choices for partition layout.
DLT> "parted", itself, seems to work OK in the latest version.  One of the
things I've heard since I started this process is that
DLT> there are some libraries associated with the process of
partitioning/formatting.  Perhaps the upper layers and those
DLT> libraries aren't synced up?

>> Tejun> done properly for drives with 4 KiB physical sectors.  4 KiB 
>> Tejun> logical sector support is broken in both the kernel
>>
>> Huh, what?  My homedir is on a 4KiB LBS/PBS drive and has been for ~2 
>> years.

By default, they aren't aligned properly, are they?

>> Tejun> (need more details and probably a whole section on partitioner
>> Tejun> behaviors)
>>
>> I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the 
>> alignment work for fdisk and parted respectively.  Karel, Jim: The 
>> full writeup is here:
>>
>> 	http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>>
>> It'd be great if you guys could share what you have been doing to the 
>> tooling.
> 
>  small summary:
> 
>  - libblkid provides unified API to topology information, it supports:
>     - ioctls (kernel >= 2.6.32)
>     - sysfs (kernel >= 2.6.31)
>     - stripe chunk size and stripe width for DM, MD. LVM and evms on
>       old kernels
>  - libparted and fdisk are linked against libblkid
> 
>  - fdisk supports 4KiB logical sector size (util-linux-ng >= 2.15
>  - fdisk supports 4KiB physical sector size (util-linux-ng >= 2.17)
>  - fdisk uses 1MiB alignment (or more if optimal I/O size is bigger)
>    and alignment_offset for all partitions in non-DOS mode
>    (util-linux-ng >= 2.17.1)

That's great.  Daniel, maybe you were testing older versions?  Or maybe
those failures were manifested from libata mishandling 4KiB r/w requets.

DLT> As I said, above, it could be libraries.  I was not aware that so much
of the implementation was embedded there.

>  - parted supports 4KiB physical sector size
>  - parted uses 1MiB alignment for disks with unknown topology, disks
>    with topology information are aligned to optimal (or minimum) I/O
>    size (parted >= 2.1)

This will result in incorrect alignment for drives which lie about the
physical sector size to work around BIOS/drivers issues (C-1).  It would
probably be best to align to at least 1MiB.

DLT> Please.

>  - EFI GPT code in the kernel has been updated to works properly with 
>    4KiB sectors (kernel >= 2.6.33)

libata is broken for logical 4KiB ATA devices tho.  I'll fix it up.

>  - mkfs.{ext,xfs,gfs2,ocfs2} have been update to work properly with
>    topology information, mkfs.{ext,xfs} are linked against libblkid
>    for compatibility with old kernel (for stripe chunk size / width)
> 
>  - Fedora-13/RHEL6 installer uses libparted with 4KiB support
> 
>  - alignment_offset & 4KiB support is planned for LUKS (cryptsetup)
> 
>> Tejun> Unfortunately, the transition to 4 KiB sector size, physical 
>> Tejun> only or logical too, is looking fairly ugly.  Hopefully, a 
>> Tejun> reasonable solution can be reached in not too distant future 
>> Tejun> but even with all the software side updated, it looks like 
>> Tejun> it's gonna cause significant amount of confusion and frustration.
>>
>> With regards to XP compatibility I don't think we should go too much 
>> out of our way to accommodate it.  XP has been disowned by its master 
>> and I think virtualization will take care of the rest.

Yeah, good point.  I'm just a bit worried that it might generate a lot of
frustrated bug reports.  Well, maybe we should just advise users to install
windows first and then install Linux.

DLT> Simple reality is that XP is "forever".  Drives >2TiB, which may be
USB-attached, used with XP will be MBR-partitioned
DLT> and use 4096-byte sectors.  We need to be able to read/write those
disks on Linux systems.

>> FWIW, recent fdisk has a command line flag that will enable/disable 
>> DOS compatible layout.
> 
>  yes, util-linux-ng 2.17.1, fdisk -c
>  
>  Note that non-DOS mode will be default in the next major  
> util-linux-ng release.

I'll try to merge these information into the ata-4k doc.

Thank you very much.

DLT> One last comment: I just tried to partition and format a >2TiB drive on
fully updated Ubuntu 9.10 with GParted.
DLT> I selected not to cylinder align, use GPT and ext3, and to put 1 MiB
preceeding and following.  libparted failed
DLT> with "unable to satisfy all constraints of the partition".  Using
"parted", I created the partition, and then
DLT> GParted was able to apply the ext3 file system.
--
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  2:53     ` Tejun Heo
@ 2010-03-09  3:20         ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-09  3:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, Martin K. Petersen, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare, Karel Zak, Jim Meyering

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

>> http://people.redhat.com/msnitzer/docs/io-limits.txt

Tejun> Ah... this is great.  I'll link the doc and shamelessly steal
Tejun> parts of it if that's okay with you.

There's also this one:

    http://oss.oracle.com/~mkp/docs/linux-advanced-storage.pdf

It is more aimed at storage vendors than end users, though.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-09  3:20         ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-09  3:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, Martin K. Petersen, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare, Karel Zak, Jim Meyering

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

>> http://people.redhat.com/msnitzer/docs/io-limits.txt

Tejun> Ah... this is great.  I'll link the doc and shamelessly steal
Tejun> parts of it if that's okay with you.

There's also this one:

    http://oss.oracle.com/~mkp/docs/linux-advanced-storage.pdf

It is more aimed at storage vendors than end users, though.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  2:44   ` Tejun Heo
@ 2010-03-09  3:18       ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-09  3:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Martin K. Petersen, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare,
	Karel Zak, Jim Meyering

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> Yeah, I know Mark fixed it but couldn't find where the tree was.
Tejun> SF only had old releases, so...

Tejun> (other stuff replied further down the thread)

Looks like Mark hasn't made an hdparm release since I posted the patch.
It's here:

http://marc.info/?l=linux-ide&m=126427438620651&w=2

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-09  3:18       ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-09  3:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Martin K. Petersen, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare,
	Karel Zak, Jim Meyering

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> Yeah, I know Mark fixed it but couldn't find where the tree was.
Tejun> SF only had old releases, so...

Tejun> (other stuff replied further down the thread)

Looks like Mark hasn't made an hdparm release since I posted the patch.
It's here:

http://marc.info/?l=linux-ide&m=126427438620651&w=2

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  2:42       ` Tejun Heo
@ 2010-03-09  3:11           ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-09  3:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Karel Zak, Martin K. Petersen, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare, Jim Meyering

>>>>> "Tejun" == Tejun Heo <htejun@gmail.com> writes:

>> This will result in incorrect alignment for drives which lie about
>> the physical sector size to work around BIOS/drivers issues (C-1).
>> It would probably be best to align to at least 1MiB.

Tejun> I misread it.  C-1 would be disks w/o alignment information which
Tejun> will be aligned to optimal_io_size which again would be 0 and
Tejun> thus 1MiB alignment.  So, this should work, right?

Correct.  ATA only provides physical block size whereas SCSI has the
extra knobs in the block limits VPD.  And consequently ATA block devices
have min_io = physical block size and optimal_io = 0.

So we'll align to 1 MB by default.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-09  3:11           ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-09  3:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Karel Zak, Martin K. Petersen, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare, Jim Meyering

>>>>> "Tejun" == Tejun Heo <htejun@gmail.com> writes:

>> This will result in incorrect alignment for drives which lie about
>> the physical sector size to work around BIOS/drivers issues (C-1).
>> It would probably be best to align to at least 1MiB.

Tejun> I misread it.  C-1 would be disks w/o alignment information which
Tejun> will be aligned to optimal_io_size which again would be 0 and
Tejun> thus 1MiB alignment.  So, this should work, right?

Correct.  ATA only provides physical block size whereas SCSI has the
extra knobs in the block limits VPD.  And consequently ATA block devices
have min_io = physical block size and optimal_io = 0.

So we'll align to 1 MB by default.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  2:34     ` Tejun Heo
@ 2010-03-09  3:09         ` Martin K. Petersen
  2010-03-09  2:42       ` Tejun Heo
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-09  3:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Karel Zak, Martin K. Petersen, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare, Jim Meyering

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

>>> Huh, what?  My homedir is on a 4KiB LBS/PBS drive and has been for
>>> ~2 years.

Tejun> By default, they aren't aligned properly, are they?

Single partition.  I did the alignment manually.


Tejun> libata is broken for logical 4KiB ATA devices tho.  I'll fix it
Tejun> up.

Matthew implemented support for this a while back...


Tejun> I'm just a bit worried that it might generate a lot of frustrated
Tejun> bug reports.  Well, maybe we should just advise users to install
Tejun> windows first and then install Linux.

Unfortunately there is no simple solution given that we can't go back in
time and fix legacy DOS/XP behavior.

The 1-alignment jumper (that some drives have) fixes things for the
first partition but will mess up our alignment for subsequent ones
unless the firmware actually reports the shift.  So no matter what we do
the user will have to have a bare minimum of knowledge about 512-byte
LBS/4 KB PBS drives.  That sucks.  But even Windows users are presented
with extra documentation and alignment utilities during the transition.

Having a 1 MB alignment by default and hoping that devices that lie will
be 0-aligned is the best we can do, I think.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-09  3:09         ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-09  3:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Karel Zak, Martin K. Petersen, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare, Jim Meyering

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

>>> Huh, what?  My homedir is on a 4KiB LBS/PBS drive and has been for
>>> ~2 years.

Tejun> By default, they aren't aligned properly, are they?

Single partition.  I did the alignment manually.


Tejun> libata is broken for logical 4KiB ATA devices tho.  I'll fix it
Tejun> up.

Matthew implemented support for this a while back...


Tejun> I'm just a bit worried that it might generate a lot of frustrated
Tejun> bug reports.  Well, maybe we should just advise users to install
Tejun> windows first and then install Linux.

Unfortunately there is no simple solution given that we can't go back in
time and fix legacy DOS/XP behavior.

The 1-alignment jumper (that some drives have) fixes things for the
first partition but will mess up our alignment for subsequent ones
unless the firmware actually reports the shift.  So no matter what we do
the user will have to have a bare minimum of knowledge about 512-byte
LBS/4 KB PBS drives.  That sucks.  But even Windows users are presented
with extra documentation and alignment utilities during the transition.

Having a 1 MB alignment by default and hoping that devices that lie will
be 0-aligned is the best we can do, I think.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 19:34   ` Mike Snitzer
@ 2010-03-09  2:53     ` Tejun Heo
  2010-03-09  3:20         ` Martin K. Petersen
  2010-03-09  6:53     ` Michael Tokarev
  1 sibling, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-03-09  2:53 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Martin K. Petersen, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare,
	Karel Zak, Jim Meyering

Hello,

On 03/09/2010 04:34 AM, Mike Snitzer wrote:
> I've been keeping track of all the pieces in play, have coordinated
> with kzak and jim, and have a summary that offers some amount of macro
> detail (at the end I touch on parted and fdisk):
> 
> http://people.redhat.com/msnitzer/docs/io-limits.txt

Ah... this is great.  I'll link the doc and shamelessly steal parts of
it if that's okay with you.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  2:42       ` Jeff Garzik
@ 2010-03-09  2:49         ` Tejun Heo
  0 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-09  2:49 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Karel Zak, Martin K. Petersen, linux-ide, lkml, Daniel Taylor,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare,
	Jim Meyering

Hello,

On 03/09/2010 11:42 AM, Jeff Garzik wrote:
> On 03/08/2010 09:34 PM, Tejun Heo wrote:
>> libata is broken for logical 4KiB ATA devices tho.  I'll fix it up.
> 
> Does libata-dev.git#sectsize miss any details?

I haven't looked at it yet.  I'll review it soon but the thing is
without actual hardware it would be a bit difficult to tell.  It's not
only the drivers.  I have this mighty unhappy feeling that some
controllers (especially some of the SATA ones with internal state
machine to emulate SFF) would be sniffing the commands and making the
wrong assumption if 4KiB logical sector size is used, so we'll need to
test various controllers.  Some PATA-SATA bridge chips will definitely
be having problems too.  Then there are the USB and other bridges too
but well those aren't libata's problem at least.  :-)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 15:18   ` Martin K. Petersen
                     ` (4 preceding siblings ...)
  (?)
@ 2010-03-09  2:44   ` Tejun Heo
  2010-03-09  3:18       ` Martin K. Petersen
  -1 siblings, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-03-09  2:44 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, Karel Zak,
	Jim Meyering

Hello,

On 03/09/2010 12:18 AM, Martin K. Petersen wrote:
>>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:
> Tejun> Please note that hdparm is misreporting the alignment offset.  It
> Tejun> should be reporting 512 instead of 256 for offset-by-one drives.
> 
> Already fixed.  Your hdparm must be old.

Yeah, I know Mark fixed it but couldn't find where the tree was.  SF
only had old releases, so...

(other stuff replied further down the thread)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  2:34     ` Tejun Heo
  2010-03-09  2:42       ` Jeff Garzik
@ 2010-03-09  2:42       ` Tejun Heo
  2010-03-09  3:11           ` Martin K. Petersen
  2010-03-09  3:09         ` Martin K. Petersen
                         ` (2 subsequent siblings)
  4 siblings, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-03-09  2:42 UTC (permalink / raw)
  To: Karel Zak
  Cc: Martin K. Petersen, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare,
	Jim Meyering

Hello, again.

On 03/09/2010 11:34 AM, Tejun Heo wrote:
>>  - parted uses 1MiB alignment for disks with unknown topology, disks
>>    with topology information are aligned to optimal (or minimum) I/O
>>    size (parted >= 2.1)
> 
> This will result in incorrect alignment for drives which lie about the
> physical sector size to work around BIOS/drivers issues (C-1).  It
> would probably be best to align to at least 1MiB.

I misread it.  C-1 would be disks w/o alignment information which will
be aligned to optimal_io_size which again would be 0 and thus 1MiB
alignment.  So, this should work, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-09  2:34     ` Tejun Heo
@ 2010-03-09  2:42       ` Jeff Garzik
  2010-03-09  2:49         ` Tejun Heo
  2010-03-09  2:42       ` Tejun Heo
                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 155+ messages in thread
From: Jeff Garzik @ 2010-03-09  2:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Karel Zak, Martin K. Petersen, linux-ide, lkml, Daniel Taylor,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare,
	Jim Meyering

On 03/08/2010 09:34 PM, Tejun Heo wrote:
> libata is broken for logical 4KiB ATA devices tho.  I'll fix it up.

Does libata-dev.git#sectsize miss any details?

	Jeff

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 19:58   ` Karel Zak
@ 2010-03-09  2:34     ` Tejun Heo
  2010-03-09  2:42       ` Jeff Garzik
                         ` (4 more replies)
  2010-03-09  7:27       ` Jim Meyering
  1 sibling, 5 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-09  2:34 UTC (permalink / raw)
  To: Karel Zak
  Cc: Martin K. Petersen, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare,
	Jim Meyering

Hello,

On 03/09/2010 04:58 AM, Karel Zak wrote:
>> Tejun> Reportedly, commonly used partitioners aren't ready to handle
>> Tejun> drives larger than 2 TiB in any configuration and alignment isn't
> 
> The limit is specific for DOS partition table (with 512-byte log.
> sectors), but for example GPT uses 64-bit LBA. I believe that our
> partitioning tools don't introduce any other restriction.

Hmmm... the 'reportedly' was from Daniel Taylor or maybe I just
misinterpreted the conversation.  Daniel, can you please fill in?

>> Tejun> done properly for drives with 4 KiB physical sectors.  4 KiB
>> Tejun> logical sector support is broken in both the kernel 
>>
>> Huh, what?  My homedir is on a 4KiB LBS/PBS drive and has been for ~2
>> years.

By default, they aren't aligned properly, are they?

>> Tejun> (need more details and probably a whole section on partitioner
>> Tejun> behaviors)
>>
>> I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the
>> alignment work for fdisk and parted respectively.  Karel, Jim: The full
>> writeup is here:
>>
>> 	http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>>
>> It'd be great if you guys could share what you have been doing to the
>> tooling.
> 
>  small summary:
> 
>  - libblkid provides unified API to topology information, it supports:
>     - ioctls (kernel >= 2.6.32)
>     - sysfs (kernel >= 2.6.31)
>     - stripe chunk size and stripe width for DM, MD. LVM and evms on
>       old kernels
>  - libparted and fdisk are linked against libblkid
> 
>  - fdisk supports 4KiB logical sector size (util-linux-ng >= 2.15
>  - fdisk supports 4KiB physical sector size (util-linux-ng >= 2.17)
>  - fdisk uses 1MiB alignment (or more if optimal I/O size is bigger)
>    and alignment_offset for all partitions in non-DOS mode
>    (util-linux-ng >= 2.17.1)

That's great.  Daniel, maybe you were testing older versions?  Or
maybe those failures were manifested from libata mishandling 4KiB r/w
requets.

>  - parted supports 4KiB physical sector size
>  - parted uses 1MiB alignment for disks with unknown topology, disks
>    with topology information are aligned to optimal (or minimum) I/O
>    size (parted >= 2.1)

This will result in incorrect alignment for drives which lie about the
physical sector size to work around BIOS/drivers issues (C-1).  It
would probably be best to align to at least 1MiB.

>  - EFI GPT code in the kernel has been updated to works properly with 
>    4KiB sectors (kernel >= 2.6.33)

libata is broken for logical 4KiB ATA devices tho.  I'll fix it up.

>  - mkfs.{ext,xfs,gfs2,ocfs2} have been update to work properly with
>    topology information, mkfs.{ext,xfs} are linked against libblkid
>    for compatibility with old kernel (for stripe chunk size / width)
> 
>  - Fedora-13/RHEL6 installer uses libparted with 4KiB support
> 
>  - alignment_offset & 4KiB support is planned for LUKS (cryptsetup)
> 
>> Tejun> Unfortunately, the transition to 4 KiB sector size, physical only
>> Tejun> or logical too, is looking fairly ugly.  Hopefully, a reasonable
>> Tejun> solution can be reached in not too distant future but even with
>> Tejun> all the software side updated, it looks like it's gonna cause
>> Tejun> significant amount of confusion and frustration.
>>
>> With regards to XP compatibility I don't think we should go too much out
>> of our way to accommodate it.  XP has been disowned by its master and I
>> think virtualization will take care of the rest.

Yeah, good point.  I'm just a bit worried that it might generate a lot
of frustrated bug reports.  Well, maybe we should just advise users to
install windows first and then install Linux.

>> FWIW, recent fdisk has a command line flag that will enable/disable DOS
>> compatible layout.
> 
>  yes, util-linux-ng 2.17.1, fdisk -c
>  
>  Note that non-DOS mode will be default in the next major
>  util-linux-ng release.

I'll try to merge these information into the ata-4k doc.

Thank you very much.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 20:12   ` H. Peter Anvin
@ 2010-03-09  2:22     ` Tejun Heo
  0 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-09  2:22 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin K. Petersen, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, Karel Zak,
	Jim Meyering

Hello,

On 03/09/2010 05:12 AM, H. Peter Anvin wrote:
> Please correct the following bit in C-3:
> 
> "A different partition format - GPT[6] - should be used beyond 2^32
> sectors, which could harm compatibility with older BIOSs or other
> operating systems which don't recognize the new format."
> 
> BIOS does not care about the partition table format.  There might be
> issues with > 2^32 sectors for BIOSes (e.g. truncating sector counts),
> but that would be unrelated.

Updated to,

  This might also be beneficial for operating systems which don't
  suffer from this limitation.  A different partition format - GPT[6]
  - should be used beyond 2^32 sectors, which could harm compatibility
  with other operating systems which don't recognize the new format.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 20:19             ` Martin K. Petersen
  (?)
@ 2010-03-08 21:16             ` H. Peter Anvin
  -1 siblings, 0 replies; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-08 21:16 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James Bottomley, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

On 03/08/2010 12:19 PM, Martin K. Petersen wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> 
> hpa> On the flipside, though, there really is very little net benefit to
> hpa> 4K as opposed to 512 byte logical sectors: the additional protocol
> hpa> overhead is relatively minimal, and as long as writes are aligned
> hpa> full blocks, there shouldn't be any additional overhead on either
> hpa> the OS or the drive side.  On the plus side, you get full
> hpa> compatibility with the existing software stack.  The equation
> hpa> really seems rather simple.
> 
> 4KB sectors are not a win for anybody except the drive vendors.
> 

Obviously.  However, larger physical storage unit sizes -- 4K for
spinning media, but frequently much larger for flash, for example -- is
already in wide use, and having a huge mishmash of logical block sizes
isn't going to work very well.

> There is a push in the industry right now to keep the 512-byte logical
> blocks forever.  The first step would be to report misaligned accesses
> or accesses that are not a multiple of the physical block size.  Second
> step would be to eventually reject any write that's not a properly
> aligned multiple of the physical block size.

I personally suspect that that is the way it is going to go, rather than
trying to change the software ecosystem to a different logical block
size.  It has been tried in the past and failed, with the sole exception
of CD-ROMs, pretty much.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 20:02             ` Cláudio Martins
@ 2010-03-08 21:07                 ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 21:07 UTC (permalink / raw)
  To: Cláudio Martins
  Cc: James Bottomley, H. Peter Anvin, Martin K. Petersen, Tejun Heo,
	linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth, jdelvare

>>>>> "Cláudio" == Cláudio Martins <ctpm@ist.utl.pt> writes:

Cláudio> So the question is: what are hard drive makers guaranteeing (if
Cláudio> anything at all)?

No guarantees.  Nothing that you can get in writing, anyway.


Cláudio> Was a 512B sector write really atomic?

Sometimes.


Cláudio> Is a 4k one?  

Sometimes, maybe.

The problem with 4KB physical blocks is that if you do a partial or
misaligned write you'll end up having to do read-modify-write.  And that
introduces are scenario where a subsequent write error will affect
logical blocks that were not part of the I/O request.

However, you also have that with regular drives because they often write
more than the actual block undergoing I/O.  For instance to reduce
hotspot bleed to adjacent sectors.

There have been several unsuccessful attempts at nudging the drive
vendors into giving us real guarantees (supercapacitors, NVRAM or
flash-backed write cache).  No luck so far.  So people that care use
arrays with non-volatile caches.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-08 21:07                 ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 21:07 UTC (permalink / raw)
  To: Cláudio Martins
  Cc: James Bottomley, H. Peter Anvin, Martin K. Petersen, Tejun Heo,
	linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	hirofumi, Andrew Morton, Alan Cox, irtiger, Matthew Wilcox,
	aschnell, knikanth, jdelvare

>>>>> "Cláudio" == Cláudio Martins <ctpm@ist.utl.pt> writes:

Cláudio> So the question is: what are hard drive makers guaranteeing (if
Cláudio> anything at all)?

No guarantees.  Nothing that you can get in writing, anyway.


Cláudio> Was a 512B sector write really atomic?

Sometimes.


Cláudio> Is a 4k one?  

Sometimes, maybe.

The problem with 4KB physical blocks is that if you do a partial or
misaligned write you'll end up having to do read-modify-write.  And that
introduces are scenario where a subsequent write error will affect
logical blocks that were not part of the I/O request.

However, you also have that with regular drives because they often write
more than the actual block undergoing I/O.  For instance to reduce
hotspot bleed to adjacent sectors.

There have been several unsuccessful attempts at nudging the drive
vendors into giving us real guarantees (supercapacitors, NVRAM or
flash-backed write cache).  No luck so far.  So people that care use
arrays with non-volatile caches.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 18:50         ` H. Peter Anvin
@ 2010-03-08 20:19             ` Martin K. Petersen
  2010-03-08 20:19             ` Martin K. Petersen
  2010-03-10  0:34           ` Tejun Heo
  2 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 20:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin K. Petersen, James Bottomley, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> On the flipside, though, there really is very little net benefit to
hpa> 4K as opposed to 512 byte logical sectors: the additional protocol
hpa> overhead is relatively minimal, and as long as writes are aligned
hpa> full blocks, there shouldn't be any additional overhead on either
hpa> the OS or the drive side.  On the plus side, you get full
hpa> compatibility with the existing software stack.  The equation
hpa> really seems rather simple.

4KB sectors are not a win for anybody except the drive vendors.

There is a push in the industry right now to keep the 512-byte logical
blocks forever.  The first step would be to report misaligned accesses
or accesses that are not a multiple of the physical block size.  Second
step would be to eventually reject any write that's not a properly
aligned multiple of the physical block size.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-08 20:19             ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 20:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin K. Petersen, James Bottomley, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> On the flipside, though, there really is very little net benefit to
hpa> 4K as opposed to 512 byte logical sectors: the additional protocol
hpa> overhead is relatively minimal, and as long as writes are aligned
hpa> full blocks, there shouldn't be any additional overhead on either
hpa> the OS or the drive side.  On the plus side, you get full
hpa> compatibility with the existing software stack.  The equation
hpa> really seems rather simple.

4KB sectors are not a win for anybody except the drive vendors.

There is a push in the industry right now to keep the 512-byte logical
blocks forever.  The first step would be to report misaligned accesses
or accesses that are not a multiple of the physical block size.  Second
step would be to eventually reject any write that's not a properly
aligned multiple of the physical block size.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 15:18   ` Martin K. Petersen
                     ` (3 preceding siblings ...)
  (?)
@ 2010-03-08 20:12   ` H. Peter Anvin
  2010-03-09  2:22     ` Tejun Heo
  -1 siblings, 1 reply; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-08 20:12 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Tejun Heo, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, Karel Zak,
	Jim Meyering

On 03/08/2010 07:18 AM, Martin K. Petersen wrote:
> 
> I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the
> alignment work for fdisk and parted respectively.  Karel, Jim: The full
> writeup is here:
> 
> 	http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
> 
> It'd be great if you guys could share what you have been doing to the
> tooling.
> 

Please correct the following bit in C-3:

"A different partition format - GPT[6] - should be used beyond 2^32
sectors, which could harm compatibility with older BIOSs or other
operating systems which don't recognize the new format."

BIOS does not care about the partition table format.  There might be
issues with > 2^32 sectors for BIOSes (e.g. truncating sector counts),
but that would be unrelated.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 18:58           ` James Bottomley
  2010-03-08 19:11             ` H. Peter Anvin
@ 2010-03-08 20:02             ` Cláudio Martins
  2010-03-08 21:07                 ` Martin K. Petersen
  1 sibling, 1 reply; 155+ messages in thread
From: Cláudio Martins @ 2010-03-08 20:02 UTC (permalink / raw)
  To: James Bottomley
  Cc: H. Peter Anvin, Martin K. Petersen, Tejun Heo, linux-ide, lkml,
	Daniel Taylor, Jeff Garzik, Mark Lord, tytso, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare


On Tue, 09 Mar 2010 00:28:25 +0530 James Bottomley <James.Bottomley@suse.de> wrote:
> 
> There's another problem that afflicts 4k drives emulating 512b: they
> have to do a read modify write for any isolated 512b write ... that
> leads to potential corruption of adjacent 512b blocks if power is lost
> at the moment the write is being done.  Since most Linux filesystems are
> 4k sectors, misalignment really hammers this, plus most journal writes
> seem to be done in 512 byte increments.  I suppose for USB this could be
> regarded as flakey as usual, though.
> 

 Most users assume that a single 512B sector write is atomic as far as
power failure is concerned. Hasn't this requirement been carried over
to the new 4k physical sector?

 It seems reasonable that if a 512B sector write is atomic in the older
drives, a 4k sector write would also be atomic on the newer drives,
since the time required to write it is negligible when compared to
capacitor voltage decay and inertia of the disk platters.

 Anyway, I suppose most of the energy/time required for a sector write
operation, is being expended on head assembly positioning and the wait
for the correct sector passing under the write head. That is, the write
operation itself takes so little time that it should make no difference
whether you write 512B or 4k.

 So the question is: what are hard drive makers guaranteeing (if
anything at all)? Was a 512B sector write really atomic? Is a 4k one?
Or was it completely manufacturer-dependent to start?

 Regards

Cláudio


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 18:29   ` H. Peter Anvin
@ 2010-03-08 20:01       ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 20:01 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin K. Petersen, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare, Karel Zak,
	Jim Meyering

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

>> Huh, what?  My homedir is on a 4KiB LBS/PBS drive and has been for ~2
>> years.

hpa> For > 2 TiB drives with 4 KiB logical sectors and MS-DOS partition
hpa> tables, it is.

Ah, that.  Already fixed, I believe.


>> With regards to XP compatibility I don't think we should go too much
>> out of our way to accommodate it.  XP has been disowned by its master
>> and I think virtualization will take care of the rest.

hpa> I think that's is wildly optimistic, 

I don't expect XP to go away any time soon.  But do I think that the
number of fresh XP installs in combination with Linux will be fairly
limited.  And general lack of hardware enablement will eventually kill
off XP on raw metal.

I think it's ok that we have stop-gap solutions in place for
interoperability.  But I wouldn't want to waste all our resources on
designing for the past.  I'm much more interested in making sure that
single-boot Linux is doing the right thing.


>> FWIW, recent fdisk has a command line flag that will enable/disable
>> DOS compatible layout.

hpa> Yes, unfortunately it is still on by default.

I agree that this is a don't-be-broken option and I would prefer it the
other way around (I know that's the plan for the next release.  I just
hope the distributions get things right).

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-08 20:01       ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 20:01 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin K. Petersen, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare, Karel Zak,
	Jim Meyering

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

>> Huh, what?  My homedir is on a 4KiB LBS/PBS drive and has been for ~2
>> years.

hpa> For > 2 TiB drives with 4 KiB logical sectors and MS-DOS partition
hpa> tables, it is.

Ah, that.  Already fixed, I believe.


>> With regards to XP compatibility I don't think we should go too much
>> out of our way to accommodate it.  XP has been disowned by its master
>> and I think virtualization will take care of the rest.

hpa> I think that's is wildly optimistic, 

I don't expect XP to go away any time soon.  But do I think that the
number of fresh XP installs in combination with Linux will be fairly
limited.  And general lack of hardware enablement will eventually kill
off XP on raw metal.

I think it's ok that we have stop-gap solutions in place for
interoperability.  But I wouldn't want to waste all our resources on
designing for the past.  I'm much more interested in making sure that
single-boot Linux is doing the right thing.


>> FWIW, recent fdisk has a command line flag that will enable/disable
>> DOS compatible layout.

hpa> Yes, unfortunately it is still on by default.

I agree that this is a don't-be-broken option and I would prefer it the
other way around (I know that's the plan for the next release.  I just
hope the distributions get things right).

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 15:18   ` Martin K. Petersen
                     ` (2 preceding siblings ...)
  (?)
@ 2010-03-08 19:58   ` Karel Zak
  2010-03-09  2:34     ` Tejun Heo
  2010-03-09  7:27       ` Jim Meyering
  -1 siblings, 2 replies; 155+ messages in thread
From: Karel Zak @ 2010-03-08 19:58 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Tejun Heo, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare,
	Jim Meyering

On Mon, Mar 08, 2010 at 10:18:27AM -0500, Martin K. Petersen wrote:
> >>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:
> Tejun> Partitioners maybe should only align partitions which will be
> Tejun> used by Linux and default to the traditional layout for others
> Tejun> while allowing explicit override.
> 
> I don't think we take the partition type into account.  Karel?

Yes, you're right. 

(IMHO our goal should be to minimize number of places where anything
depends on partition type.)

> Tejun> Reportedly, commonly used partitioners aren't ready to handle
> Tejun> drives larger than 2 TiB in any configuration and alignment isn't

The limit is specific for DOS partition table (with 512-byte log.
sectors), but for example GPT uses 64-bit LBA. I believe that our
partitioning tools don't introduce any other restriction.

> Tejun> done properly for drives with 4 KiB physical sectors.  4 KiB
> Tejun> logical sector support is broken in both the kernel 
> 
> Huh, what?  My homedir is on a 4KiB LBS/PBS drive and has been for ~2
> years.
> 
> 
> Tejun> (need more details and probably a whole section on partitioner
> Tejun> behaviors)
> 
> I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the
> alignment work for fdisk and parted respectively.  Karel, Jim: The full
> writeup is here:
> 
> 	http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
> 
> It'd be great if you guys could share what you have been doing to the
> tooling.

 small summary:

 - libblkid provides unified API to topology information, it supports:
    - ioctls (kernel >= 2.6.32)
    - sysfs (kernel >= 2.6.31)
    - stripe chunk size and stripe width for DM, MD. LVM and evms on
      old kernels
 - libparted and fdisk are linked against libblkid

 - fdisk supports 4KiB logical sector size (util-linux-ng >= 2.15
 - fdisk supports 4KiB physical sector size (util-linux-ng >= 2.17)
 - fdisk uses 1MiB alignment (or more if optimal I/O size is bigger)
   and alignment_offset for all partitions in non-DOS mode
   (util-linux-ng >= 2.17.1)

 - parted supports 4KiB physical sector size
 - parted uses 1MiB alignment for disks with unknown topology, disks
   with topology information are aligned to optimal (or minimum) I/O
   size (parted >= 2.1)
 
 - EFI GPT code in the kernel has been updated to works properly with 
   4KiB sectors (kernel >= 2.6.33)

 - mkfs.{ext,xfs,gfs2,ocfs2} have been update to work properly with
   topology information, mkfs.{ext,xfs} are linked against libblkid
   for compatibility with old kernel (for stripe chunk size / width)

 - Fedora-13/RHEL6 installer uses libparted with 4KiB support

 - alignment_offset & 4KiB support is planned for LUKS (cryptsetup)

> Tejun> Unfortunately, the transition to 4 KiB sector size, physical only
> Tejun> or logical too, is looking fairly ugly.  Hopefully, a reasonable
> Tejun> solution can be reached in not too distant future but even with
> Tejun> all the software side updated, it looks like it's gonna cause
> Tejun> significant amount of confusion and frustration.
> 
> With regards to XP compatibility I don't think we should go too much out
> of our way to accommodate it.  XP has been disowned by its master and I
> think virtualization will take care of the rest.
> 
> FWIW, recent fdisk has a command line flag that will enable/disable DOS
> compatible layout.

 yes, util-linux-ng 2.17.1, fdisk -c
 
 Note that non-DOS mode will be default in the next major
 util-linux-ng release.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 15:18   ` Martin K. Petersen
  (?)
  (?)
@ 2010-03-08 19:34   ` Mike Snitzer
  2010-03-09  2:53     ` Tejun Heo
  2010-03-09  6:53     ` Michael Tokarev
  -1 siblings, 2 replies; 155+ messages in thread
From: Mike Snitzer @ 2010-03-08 19:34 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Tejun Heo, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare,
	Karel Zak, Jim Meyering

On Mon, Mar 8, 2010 at 10:18 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:
>
> Tejun> The [Windows Vista/7] partitioner seems to be using 1M as the
> Tejun> basic alignment unit and offsetting from there if explicitly
> Tejun> requested by the drive
>
> Yep.
>
>
> Tejun> Please note that hdparm is misreporting the alignment offset.  It
> Tejun> should be reporting 512 instead of 256 for offset-by-one drives.
>
> Already fixed.  Your hdparm must be old.
>
>
>
> Tejun> Partitioners maybe should only align partitions which will be
> Tejun> used by Linux and default to the traditional layout for others
> Tejun> while allowing explicit override.
>
> I don't think we take the partition type into account.  Karel?
>
>
> Tejun> Reportedly, commonly used partitioners aren't ready to handle
> Tejun> drives larger than 2 TiB in any configuration and alignment isn't
> Tejun> done properly for drives with 4 KiB physical sectors.  4 KiB
> Tejun> logical sector support is broken in both the kernel
>
> Huh, what?  My homedir is on a 4KiB LBS/PBS drive and has been for ~2
> years.
>
>
> Tejun> (need more details and probably a whole section on partitioner
> Tejun> behaviors)
>
> I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the
> alignment work for fdisk and parted respectively.  Karel, Jim: The full
> writeup is here:
>
>        http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>
> It'd be great if you guys could share what you have been doing to the
> tooling.

I've been keeping track of all the pieces in play, have coordinated
with kzak and jim, and have a summary that offers some amount of macro
detail (at the end I touch on parted and fdisk):

http://people.redhat.com/msnitzer/docs/io-limits.txt

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 18:58           ` James Bottomley
@ 2010-03-08 19:11             ` H. Peter Anvin
  2010-03-08 20:02             ` Cláudio Martins
  1 sibling, 0 replies; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-08 19:11 UTC (permalink / raw)
  To: James Bottomley
  Cc: Martin K. Petersen, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

On 03/08/2010 10:58 AM, James Bottomley wrote:
>>
>> On the flipside, though, there really is very little net benefit to 4K
>> as opposed to 512 byte logical sectors: the additional protocol overhead
>> is relatively minimal, and as long as writes are aligned full blocks,
>> there shouldn't be any additional overhead on either the OS or the drive
>> side.  On the plus side, you get full compatibility with the existing
>> software stack.  The equation really seems rather simple.
> 
> There's another problem that afflicts 4k drives emulating 512b: they
> have to do a read modify write for any isolated 512b write ... that
> leads to potential corruption of adjacent 512b blocks if power is lost
> at the moment the write is being done.  Since most Linux filesystems are
> 4k sectors, misalignment really hammers this, plus most journal writes
> seem to be done in 512 byte increments.  I suppose for USB this could be
> regarded as flakey as usual, though.
> 

Misalignment sucks in general.  This is nothing new - the RAID and flash
people have had these problems for a long time now.  It's clear we need
to align our filesystems, period.

As to the read-modify-write issue: to some degree there is very little
you can do about it other than a big enough capacitor.  If you can't
write a sector atomically and have it stick, you're screwed no matter what.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 18:50         ` H. Peter Anvin
@ 2010-03-08 18:58           ` James Bottomley
  2010-03-08 19:11             ` H. Peter Anvin
  2010-03-08 20:02             ` Cláudio Martins
  2010-03-08 20:19             ` Martin K. Petersen
  2010-03-10  0:34           ` Tejun Heo
  2 siblings, 2 replies; 155+ messages in thread
From: James Bottomley @ 2010-03-08 18:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin K. Petersen, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

On Mon, 2010-03-08 at 10:50 -0800, H. Peter Anvin wrote:
> On 03/08/2010 07:41 AM, Martin K. Petersen wrote:
> >>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:
> > 
> >>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:
> > Martin> There are 4 KB LBS SSDs out there but in general the industry is
> > Martin> sticking to ATA for local boot.
> > 
> > Martin> Thus implying that ATA doesn't support 4 KB LBS, just that
> > Martin> people stick to the tried-and-true 512.
> > 
> > *sigh* I haven't had my breakfast tea yet...
> > 
> > What I meant to say was that I know ATA supports 4 KB LBS and that
> > nobody appears to care about it.
> > 
> 
> Well, apparently Western Digital are looking at it for USB drives due to
> XP compatibility requirements -- those presumably are ATA internally and
> use a USB-ATA bridge.
> 
> On the flipside, though, there really is very little net benefit to 4K
> as opposed to 512 byte logical sectors: the additional protocol overhead
> is relatively minimal, and as long as writes are aligned full blocks,
> there shouldn't be any additional overhead on either the OS or the drive
> side.  On the plus side, you get full compatibility with the existing
> software stack.  The equation really seems rather simple.

There's another problem that afflicts 4k drives emulating 512b: they
have to do a read modify write for any isolated 512b write ... that
leads to potential corruption of adjacent 512b blocks if power is lost
at the moment the write is being done.  Since most Linux filesystems are
4k sectors, misalignment really hammers this, plus most journal writes
seem to be done in 512 byte increments.  I suppose for USB this could be
regarded as flakey as usual, though.

James



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 15:41         ` Martin K. Petersen
  (?)
@ 2010-03-08 18:50         ` H. Peter Anvin
  2010-03-08 18:58           ` James Bottomley
                             ` (2 more replies)
  -1 siblings, 3 replies; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-08 18:50 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James Bottomley, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

On 03/08/2010 07:41 AM, Martin K. Petersen wrote:
>>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:
> 
>>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:
> Martin> There are 4 KB LBS SSDs out there but in general the industry is
> Martin> sticking to ATA for local boot.
> 
> Martin> Thus implying that ATA doesn't support 4 KB LBS, just that
> Martin> people stick to the tried-and-true 512.
> 
> *sigh* I haven't had my breakfast tea yet...
> 
> What I meant to say was that I know ATA supports 4 KB LBS and that
> nobody appears to care about it.
> 

Well, apparently Western Digital are looking at it for USB drives due to
XP compatibility requirements -- those presumably are ATA internally and
use a USB-ATA bridge.

On the flipside, though, there really is very little net benefit to 4K
as opposed to 512 byte logical sectors: the additional protocol overhead
is relatively minimal, and as long as writes are aligned full blocks,
there shouldn't be any additional overhead on either the OS or the drive
side.  On the plus side, you get full compatibility with the existing
software stack.  The equation really seems rather simple.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 15:18   ` Martin K. Petersen
  (?)
@ 2010-03-08 18:29   ` H. Peter Anvin
  2010-03-08 20:01       ` Martin K. Petersen
  -1 siblings, 1 reply; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-08 18:29 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Tejun Heo, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, Karel Zak,
	Jim Meyering

On 03/08/2010 07:18 AM, Martin K. Petersen wrote:
> 
> Tejun> Partitioners maybe should only align partitions which will be
> Tejun> used by Linux and default to the traditional layout for others
> Tejun> while allowing explicit override.
> 
> I don't think we take the partition type into account.  Karel?
> 

We should not take the partition type into account.  The other aspect is
that FAT partitions need to be formatted differently to maintain the
alignment once set; I have recently contributed patches (which were
accepted) into mkdosfs to do the right thing there.

Looking at the Windows XP article, it looks like it is limited to
certain BIOSes; unfortunately it doesn't say what the particular BIOS
issue is.  If we can find a system which actually exhibits the bug it
might be possible to reverse-engineer a solution.

> Tejun> Reportedly, commonly used partitioners aren't ready to handle
> Tejun> drives larger than 2 TiB in any configuration and alignment isn't
> Tejun> done properly for drives with 4 KiB physical sectors.  4 KiB
> Tejun> logical sector support is broken in both the kernel 
> 
> Huh, what?  My homedir is on a 4KiB LBS/PBS drive and has been for ~2
> years.

For > 2 TiB drives with 4 KiB logical sectors and MS-DOS partition
tables, it is.

> Tejun> Unfortunately, the transition to 4 KiB sector size, physical only
> Tejun> or logical too, is looking fairly ugly.  Hopefully, a reasonable
> Tejun> solution can be reached in not too distant future but even with
> Tejun> all the software side updated, it looks like it's gonna cause
> Tejun> significant amount of confusion and frustration.
> 
> With regards to XP compatibility I don't think we should go too much out
> of our way to accommodate it.  XP has been disowned by its master and I
> think virtualization will take care of the rest.

I think that's is wildly optimistic, but I do observe there is a fix
from Microsoft in the article you reference.

> FWIW, recent fdisk has a command line flag that will enable/disable DOS
> compatible layout.

Yes, unfortunately it is still on by default.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 15:38       ` Martin K. Petersen
@ 2010-03-08 15:41         ` Martin K. Petersen
  -1 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 15:41 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James Bottomley, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:

>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:
Martin> There are 4 KB LBS SSDs out there but in general the industry is
Martin> sticking to ATA for local boot.

Martin> Thus implying that ATA doesn't support 4 KB LBS, just that
Martin> people stick to the tried-and-true 512.

*sigh* I haven't had my breakfast tea yet...

What I meant to say was that I know ATA supports 4 KB LBS and that
nobody appears to care about it.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-08 15:41         ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 15:41 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James Bottomley, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:

>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:
Martin> There are 4 KB LBS SSDs out there but in general the industry is
Martin> sticking to ATA for local boot.

Martin> Thus implying that ATA doesn't support 4 KB LBS, just that
Martin> people stick to the tried-and-true 512.

*sigh* I haven't had my breakfast tea yet...

What I meant to say was that I know ATA supports 4 KB LBS and that
nobody appears to care about it.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08 15:33     ` Martin K. Petersen
@ 2010-03-08 15:38       ` Martin K. Petersen
  -1 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 15:38 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James Bottomley, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:

Martin> There are 4 KB LBS SSDs out there but in general the industry is
Martin> sticking to ATA for local boot.

Thus implying that ATA doesn't support 4 KB LBS, just that people stick
to the tried-and-true 512.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-08 15:38       ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 15:38 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James Bottomley, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, H. Peter Anvin, hirofumi,
	Andrew Morton, Alan Cox, irtiger, Matthew Wilcox, aschnell,
	knikanth, jdelvare

>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:

Martin> There are 4 KB LBS SSDs out there but in general the industry is
Martin> sticking to ATA for local boot.

Thus implying that ATA doesn't support 4 KB LBS, just that people stick
to the tried-and-true 512.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08  7:53   ` H. Peter Anvin
@ 2010-03-08 15:34       ` Martin K. Petersen
  2010-03-09 22:46     ` Greg Freemyer
  1 sibling, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 15:34 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: James Bottomley, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> I would very much like a reference for a platform which has
hpa> firmware which can successfully boot from 4K-logical media.  It
hpa> would be very useful for bootloader testing.

I have yet to find one.


hpa> Aligning partitions is something we should have done long ago.  It
hpa> affects RAID and many flash drives just as much or more than
hpa> 4K-sectored disks.

Yup.


hpa> As far as partitioning... I believe we should be using GPT
hpa> partition tables where possible.  Even on non-EFI systems, it's
hpa> simply a much better partition table format.

Agreed.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-08 15:34       ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 15:34 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: James Bottomley, Tejun Heo, linux-ide, lkml, Daniel Taylor,
	Jeff Garzik, Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox,
	irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> I would very much like a reference for a platform which has
hpa> firmware which can successfully boot from 4K-logical media.  It
hpa> would be very useful for bootloader testing.

I have yet to find one.


hpa> Aligning partitions is something we should have done long ago.  It
hpa> affects RAID and many flash drives just as much or more than
hpa> 4K-sectored disks.

Yup.


hpa> As far as partitioning... I believe we should be using GPT
hpa> partition tables where possible.  Even on non-EFI systems, it's
hpa> simply a much better partition table format.

Agreed.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08  7:00 ` James Bottomley
@ 2010-03-08 15:33     ` Martin K. Petersen
  2010-03-08  7:56   ` H. Peter Anvin
  2010-03-08 15:33     ` Martin K. Petersen
  2 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 15:33 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tejun Heo, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

>>>>> "James" == James Bottomley <James.Bottomley@suse.de> writes:

James> However, for 4k sectors, the main issues which have shown up in
James> testing by others (mostly Martin) are

James>      1. In native 4k mode, we work perfectly fine.  *however*,
James>         most BIOSs can't boot native 4k drives.

Correct.  I have engaged with pretty much all the big OEMs in the
industry and so far the interest has been near zero.


James>      4. The aligment problem is made more complex by drives that
James>         make use of the offset exponent feature (what you refer
James>         to as offset by one) ... fortunately very few of these
James>         have been seen in the wild and we're hopeful they can be
James>         shot before they breed.

This topic is constantly up for debate in IDEMA.  However, it looks like
we might win because of the impending demise of XP.


James> so the bottom line seems to be that if you want the device as a
James> non boot disk, use native 4k sectors and a non-msdos partition
James> label.  If you want to boot from the drive and your bios won't
James> book 4k natively, partition everything using the 512 emulation
James> and try to align the partitions correctly.  If your bios/uefi
James> will boot 4k natively, just use it and whatever partition label
James> the bios/uefi supports.

James> Martin can fill in the pieces I've left out.

Here's my latest take given what I hear on the grapevine:

1. 512-byte logical block size drives will be around forever for legacy
   deployments because nobody is willing to do the required BIOS int13
   work.  It's not just a BIOS thing, this requires heavy changes to HBA
   boot ROMs as well.

2. Some vendors are working on EFI firmware and will support booting off
   of 4KB LBS drives there.  This is mostly aimed at the server space.

3. 4 KB logical block size drives will mainly be targeted for use inside
   arrays.  Off the shelf enterprise drive models will most likely
   continue to ship with a 512-byte LBS.

4. Part of the hesitation to work on booting off of 4 KB lbs drives is
   motivated by a general trend in the industry to move boot
   functionality to SSD.  There are 4 KB LBS SSDs out there but in
   general the industry is sticking to ATA for local boot.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-08 15:33     ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 15:33 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tejun Heo, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, H. Peter Anvin, hirofumi, Andrew Morton,
	Alan Cox, irtiger, Matthew Wilcox, aschnell, knikanth, jdelvare

>>>>> "James" == James Bottomley <James.Bottomley@suse.de> writes:

James> However, for 4k sectors, the main issues which have shown up in
James> testing by others (mostly Martin) are

James>      1. In native 4k mode, we work perfectly fine.  *however*,
James>         most BIOSs can't boot native 4k drives.

Correct.  I have engaged with pretty much all the big OEMs in the
industry and so far the interest has been near zero.


James>      4. The aligment problem is made more complex by drives that
James>         make use of the offset exponent feature (what you refer
James>         to as offset by one) ... fortunately very few of these
James>         have been seen in the wild and we're hopeful they can be
James>         shot before they breed.

This topic is constantly up for debate in IDEMA.  However, it looks like
we might win because of the impending demise of XP.


James> so the bottom line seems to be that if you want the device as a
James> non boot disk, use native 4k sectors and a non-msdos partition
James> label.  If you want to boot from the drive and your bios won't
James> book 4k natively, partition everything using the 512 emulation
James> and try to align the partitions correctly.  If your bios/uefi
James> will boot 4k natively, just use it and whatever partition label
James> the bios/uefi supports.

James> Martin can fill in the pieces I've left out.

Here's my latest take given what I hear on the grapevine:

1. 512-byte logical block size drives will be around forever for legacy
   deployments because nobody is willing to do the required BIOS int13
   work.  It's not just a BIOS thing, this requires heavy changes to HBA
   boot ROMs as well.

2. Some vendors are working on EFI firmware and will support booting off
   of 4KB LBS drives there.  This is mostly aimed at the server space.

3. 4 KB logical block size drives will mainly be targeted for use inside
   arrays.  Off the shelf enterprise drive models will most likely
   continue to ship with a 512-byte LBS.

4. Part of the hesitation to work on booting off of 4 KB lbs drives is
   motivated by a general trend in the industry to move boot
   functionality to SSD.  There are 4 KB LBS SSDs out there but in
   general the industry is sticking to ATA for local boot.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08  3:48 ` Tejun Heo
@ 2010-03-08 15:18   ` Martin K. Petersen
  -1 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 15:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, Karel Zak,
	Jim Meyering

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> The [Windows Vista/7] partitioner seems to be using 1M as the
Tejun> basic alignment unit and offsetting from there if explicitly
Tejun> requested by the drive

Yep.


Tejun> Please note that hdparm is misreporting the alignment offset.  It
Tejun> should be reporting 512 instead of 256 for offset-by-one drives.

Already fixed.  Your hdparm must be old.



Tejun> Partitioners maybe should only align partitions which will be
Tejun> used by Linux and default to the traditional layout for others
Tejun> while allowing explicit override.

I don't think we take the partition type into account.  Karel?


Tejun> Reportedly, commonly used partitioners aren't ready to handle
Tejun> drives larger than 2 TiB in any configuration and alignment isn't
Tejun> done properly for drives with 4 KiB physical sectors.  4 KiB
Tejun> logical sector support is broken in both the kernel 

Huh, what?  My homedir is on a 4KiB LBS/PBS drive and has been for ~2
years.


Tejun> (need more details and probably a whole section on partitioner
Tejun> behaviors)

I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the
alignment work for fdisk and parted respectively.  Karel, Jim: The full
writeup is here:

	http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues

It'd be great if you guys could share what you have been doing to the
tooling.


Tejun> Unfortunately, the transition to 4 KiB sector size, physical only
Tejun> or logical too, is looking fairly ugly.  Hopefully, a reasonable
Tejun> solution can be reached in not too distant future but even with
Tejun> all the software side updated, it looks like it's gonna cause
Tejun> significant amount of confusion and frustration.

With regards to XP compatibility I don't think we should go too much out
of our way to accommodate it.  XP has been disowned by its master and I
think virtualization will take care of the rest.

FWIW, recent fdisk has a command line flag that will enable/disable DOS
compatible layout.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
@ 2010-03-08 15:18   ` Martin K. Petersen
  0 siblings, 0 replies; 155+ messages in thread
From: Martin K. Petersen @ 2010-03-08 15:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, Karel Zak,
	Jim Meyering

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> The [Windows Vista/7] partitioner seems to be using 1M as the
Tejun> basic alignment unit and offsetting from there if explicitly
Tejun> requested by the drive

Yep.


Tejun> Please note that hdparm is misreporting the alignment offset.  It
Tejun> should be reporting 512 instead of 256 for offset-by-one drives.

Already fixed.  Your hdparm must be old.



Tejun> Partitioners maybe should only align partitions which will be
Tejun> used by Linux and default to the traditional layout for others
Tejun> while allowing explicit override.

I don't think we take the partition type into account.  Karel?


Tejun> Reportedly, commonly used partitioners aren't ready to handle
Tejun> drives larger than 2 TiB in any configuration and alignment isn't
Tejun> done properly for drives with 4 KiB physical sectors.  4 KiB
Tejun> logical sector support is broken in both the kernel 

Huh, what?  My homedir is on a 4KiB LBS/PBS drive and has been for ~2
years.


Tejun> (need more details and probably a whole section on partitioner
Tejun> behaviors)

I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the
alignment work for fdisk and parted respectively.  Karel, Jim: The full
writeup is here:

	http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues

It'd be great if you guys could share what you have been doing to the
tooling.


Tejun> Unfortunately, the transition to 4 KiB sector size, physical only
Tejun> or logical too, is looking fairly ugly.  Hopefully, a reasonable
Tejun> solution can be reached in not too distant future but even with
Tejun> all the software side updated, it looks like it's gonna cause
Tejun> significant amount of confusion and frustration.

With regards to XP compatibility I don't think we should go too much out
of our way to accommodate it.  XP has been disowned by its master and I
think virtualization will take care of the rest.

FWIW, recent fdisk has a command line flag that will enable/disable DOS
compatible layout.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08  7:00 ` James Bottomley
  2010-03-08  7:53   ` H. Peter Anvin
@ 2010-03-08  7:56   ` H. Peter Anvin
  2010-03-08 15:33     ` Martin K. Petersen
  2 siblings, 0 replies; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-08  7:56 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tejun Heo, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, mkp

On 03/07/2010 11:00 PM, James Bottomley wrote:
>
> The 2TB size for msdos partitions is a problem independent of the 4k
> sector issue.  Traditional 512 byte sector drives are now available in
> those sizes.  It looks like we're going to have to move to a new
> partitioning label to solve this.
>
> There's actually another barrier at 8 or 16TB, which is where a 4k
> logical sector filesystem tops out using 32 bit block offsets (it's 8TB
> if the fs hasn't been proof checked against sign extension problems).
>

The limit for the MS-DOS partition tables is 2^32 sectors.  The patch 
that Daniel posted was for a Linux kernel internal limit that set the 
limit to 2 TB.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08  7:00 ` James Bottomley
@ 2010-03-08  7:53   ` H. Peter Anvin
  2010-03-08 15:34       ` Martin K. Petersen
  2010-03-09 22:46     ` Greg Freemyer
  2010-03-08  7:56   ` H. Peter Anvin
  2010-03-08 15:33     ` Martin K. Petersen
  2 siblings, 2 replies; 155+ messages in thread
From: H. Peter Anvin @ 2010-03-08  7:53 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tejun Heo, linux-ide, lkml, Daniel Taylor, Jeff Garzik,
	Mark Lord, tytso, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, mkp

On 03/07/2010 11:00 PM, James Bottomley wrote:
> Just a quick note:
>
> The 2TB size for msdos partitions is a problem independent of the 4k
> sector issue.  Traditional 512 byte sector drives are now available in
> those sizes.  It looks like we're going to have to move to a new
> partitioning label to solve this.
>
> There's actually another barrier at 8 or 16TB, which is where a 4k
> logical sector filesystem tops out using 32 bit block offsets (it's 8TB
> if the fs hasn't been proof checked against sign extension problems).
>
> However, for 4k sectors, the main issues which have shown up in testing
> by others (mostly Martin) are
>
>       1. In native 4k mode, we work perfectly fine.  *however*, most
>          BIOSs can't boot native 4k drives.
>       2. Even if the BIOS can boot native 4k, our own boot loaders seem
>          to be hard coded for 512 byte sectors in several places.
>       3. If we run in the 512 byte sector emulation mode, we end up with
>          the partition alignment problems you allude to.
>       4. The aligment problem is made more complex by drives that make
>          use of the offset exponent feature (what you refer to as offset
>          by one) ... fortunately very few of these have been seen in the
>          wild and we're hopeful they can be shot before they breed.
>       5. I'm really, really sorry to have to mention it, but it looks
>          like uefi is going to be the only way we can boot non-msdos
>          partitioned devices with native 4k sectors.
>
> so the bottom line seems to be that if you want the device as a non boot
> disk, use native 4k sectors and a non-msdos partition label.  If you
> want to boot from the drive and your bios won't book 4k natively,
> partition everything using the 512 emulation and try to align the
> partitions correctly.  If your bios/uefi will boot 4k natively, just use
> it and whatever partition label the bios/uefi supports.
>
> Martin can fill in the pieces I've left out.
>

I would very much like a reference for a platform which has firmware 
which can successfully boot from 4K-logical media.  It would be very 
useful for bootloader testing.

Aligning partitions is something we should have done long ago.  It 
affects RAID and many flash drives just as much or more than 4K-sectored 
disks.

Legacy BIOS doesn't care at all how the disk is partitioned, so as long 
as the BIOS can read the disk at all the rest is up to the bootloader. 
Of course, since there hasn't been the opportunity to test, bootloaders 
generally don't handle it correctly (early versions of Syslinux 
supported any sector size, but that bitrotted, and for the lack of 
testing I eventually ended up hard-coding the number.  Now I'd like to 
get it working properly.)

As far as partitioning... I believe we should be using GPT partition 
tables where possible.  Even on non-EFI systems, it's simply a much 
better partition table format.

	-hpa

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08  3:48 ` Tejun Heo
  (?)
  (?)
@ 2010-03-08  7:00 ` James Bottomley
  2010-03-08  7:53   ` H. Peter Anvin
                     ` (2 more replies)
  -1 siblings, 3 replies; 155+ messages in thread
From: James Bottomley @ 2010-03-08  7:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare, mkp

Just a quick note:

The 2TB size for msdos partitions is a problem independent of the 4k
sector issue.  Traditional 512 byte sector drives are now available in
those sizes.  It looks like we're going to have to move to a new
partitioning label to solve this.

There's actually another barrier at 8 or 16TB, which is where a 4k
logical sector filesystem tops out using 32 bit block offsets (it's 8TB
if the fs hasn't been proof checked against sign extension problems).

However, for 4k sectors, the main issues which have shown up in testing
by others (mostly Martin) are

     1. In native 4k mode, we work perfectly fine.  *however*, most
        BIOSs can't boot native 4k drives.
     2. Even if the BIOS can boot native 4k, our own boot loaders seem
        to be hard coded for 512 byte sectors in several places.
     3. If we run in the 512 byte sector emulation mode, we end up with
        the partition alignment problems you allude to.
     4. The aligment problem is made more complex by drives that make
        use of the offset exponent feature (what you refer to as offset
        by one) ... fortunately very few of these have been seen in the
        wild and we're hopeful they can be shot before they breed.
     5. I'm really, really sorry to have to mention it, but it looks
        like uefi is going to be the only way we can boot non-msdos
        partitioned devices with native 4k sectors.

so the bottom line seems to be that if you want the device as a non boot
disk, use native 4k sectors and a non-msdos partition label.  If you
want to boot from the drive and your bios won't book 4k natively,
partition everything using the 512 emulation and try to align the
partitions correctly.  If your bios/uefi will boot 4k natively, just use
it and whatever partition label the bios/uefi supports.

Martin can fill in the pieces I've left out.

James



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: ATA 4 KiB sector issues.
  2010-03-08  3:48 ` Tejun Heo
  (?)
@ 2010-03-08  5:38 ` Greg Freemyer
  -1 siblings, 0 replies; 155+ messages in thread
From: Greg Freemyer @ 2010-03-08  5:38 UTC (permalink / raw)
  To: Tejun Heo, Martin K. Petersen
  Cc: linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare

cc'ing Martin Petersen since I believe he is one of the most
knowledgeable kernel hackers on this topic and has been working the
issue for the last year.

On Sun, Mar 7, 2010 at 10:48 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, guys.
>
> It looks like transition to ATA 4k drives will be quite painful and we
> aren't really ready although these drives are already selling widely.
> I've written up a summary document on the issue to clarify stuff as
> it's getting more and more confusing and develop some consensus.  It's
> also on the linux ata wiki.
>
>  http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>
> I've cc'd people whom I can think of off the top of my head but I
> surely have missed some people who would have been interested.  Please
> feel free to add cc's or forward the message to other MLs.
> Especially, I don't know much about partitioners so the details there
> are pretty shallow and could be plain wrong.  It would be great if
> someone who knows more about this stuff can chime in.
>
> Thanks.
>
> === Document follows ===
>
> ATA 4 KiB sector issues
>
> Background
> ==========
>
> Up until recently, all ATA hard drives have been organized in 512 byte
> sectors.  For example, my 500 GB or 477 GiB hard drive is organized of
> 976773168 512 byte sectors numbered from 0 to 976773167.  This is how
> a drive communicates with the driver.  When the operating system wants
> to read 32 KiB of data at 1 MiB position, the driver asks the drive to
> read 64 sectors from LBA (Logical block address, sector number) 2048.
>
> Because each sector should be addressable, readable and writable
> individually, the physical medium also is organized in the same sized
> sectors.  In addition to the area to store the actual data, each
> sector requires extra space for book keeping - inter-sector space to
> enable locating and addressing each sector and ECC data to detect and
> correct inevitable raw data errors.
>
> As the densities and capacities of hard drives keep growing, stronger
> ECC becomes necessary to guarantee acceptable level of data integrity
> increasing the space overhead.  In addition, in most applications,
> hard drives are now accessed in units of at least 8 sectors or 4096
> bytes and maintaining 512 byte granularity has become somewhat
> meaningless.
>
> This reached a point where enlarging the sector size to 4096 bytes
> would yield measurably more usable space given the same raw data
> storage size and hard drive manufacturers are transitioning to 4 KiB
> sectors.
>
> Anandtech has a good article which illustrates the background and
> issues with pretty diagrams[1].
>
>
> Physical vs. Logical
> ====================
>
> Because the 512 byte sector size has been around for a very long time
> and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the
> sector size assumption is scattered across all the layers -
> controllers or bridge chips snooping commands, BIOSs, boot codes,
> drivers, partitioners and system utilities, which makes it very
> difficult to change the sector size from 512 byte without breaking
> backward compatibility massively.
>
> As a workaround, the concept of logical sector size was introduced.
> The physical medium is organized in 4 KiB sectors but the firmware on
> the drive will present it as if the drive is composed of 512 byte
> sectors thus making the drive behave as before, so if the driver asks
> the hard drive to read 64 sectors from LBA 2048, the firmware will
> translate it and read 8 4 KiB sectors from hardware sector 256.  As a
> result, the hard drive now has two sector sizes - the physical one
> which the physical media is actually organized in, and the logical one
> which the firmware presents to the outside world.
>
> A straight forward example mapping between physical sector and LBA
> would be
>
>  LBA = 8 * phys_sect
>
>
> Alignment problem on 4 KiB physical / 512 logical drives
> =======================================================
>
> This workaround keeps older hardware and software working while
> allowing the drive to use larger sector size internally.  However, the
> discrepancy between physical and logical sector sizes creates an
> alignment issue.  For example, if the driver wants to read 7 sectors
> from LBA 2047, the firmware has to read hardware sector 255 and 256
> and trim leading 7*512 bytes and tailing 512 bytes.
>
> For reads, this isn't an issue as drives read in larger chunks anyway
> but for writes, the drive has to do read-modify-write to achieve the
> requested action.  It has to first read hardware sector 255 and 256,
> update requested parts and then write back those sectors which can
> cause significant performance degradation[2].
>
> The problem is aggravated by the way DOS partitions[3] have been laid
> out traditionally.  For reasons dating back more than two decades,
> they are laid out considering something called disk geometry which
> nowadays are arbitrary values with a number of restrictions for
> backward compatibility accumulated over the years.  The end result is
> that until recently (most Linux variants and upto Windows XP) the
> first partition ends up on sector 63 and later ones on cylinder
> boundaries where each cylinder usually is composed of 255 * 63
> sectors.
>
> Most modern filesystems generate 4 KiB aligned accesses from the
> partition it is in.  If a drive maps 4 KiB physical sectors to 512
> byte logical sectors from LBA0, the filesystem in the first partition
> will always be misaligned and filesystems in later partitions are
> likely to be misaligned too.
>
>
> Solving the alignment problem on 4 KiB physical / 512 logical drives
> ====================================================================
>
> There are multiple ways which attempt to solve the problem.
>
> S-1. Yet another workaround from the firmware - offset-by-one.
>
>  Yet another workaround which can be done by the firmware is to
>  offset physical to logical mapping by one logical sector such that
>  LBA 63 ends up on physical sector boundary, which aligns the first
>  partition to physical sectors without requiring any software update.
>  The example mapping between phys_sector and LBA becomes
>
>    LBA = 8 * phys_sect - 1
>
>  The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts
>  from after that point.  phys_sect 1 maps to LBA 7 and phys_sect 8 to
>  63, making LBA 63 aligned on hardware sector.
>
>  Although this aligns only the first partition, for many use cases,
>  especially the ones involving older software, this workaround was
>  deemed useful and some recent drives with 4 KiB physical sectors are
>  equipped with a dip switch to turn on or off offset-by-one mapping.
>
> S-2. The proper solution.
>
>  Correct alignments for all partitions can't be achieved by the
>  firmware alone.  The system utilities should be informed about the
>  alignment requirements and align partitions accordingly.
>
>  The above firmware workaround complicates the situation because the
>  two different configurations require different offsets to achieve
>  the correct alignments.  ATA/ATAPI-8 specifies a way for a drive to
>  export the physical and logical sector sizes and the LBA offset
>  which is aligned to the physical sectors.
>
>  In Linux, these parameters are exported via the following sysfs
>  nodes.
>
>    physical sector size        : /sys/block/sdX/queue/physical_block_size
>    logical sector size         : /sys/block/sdX/queue/logical_block_size
>    alignment offset            : /sys/block/sdX/alignment_offset
>
>  Let the physical sector size be PSS, logical sector size LSS and
>  alignment offset AOFF.  The system software should place partitions
>  such that the starting LBAs of all partitions are aligned on
>
>    (n * PSS + AOFF) / LSS
>
>  For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512
>  and AOFF 3584 and with n of 7 the above becomes,
>
>    (7 * 4096 + 3584) / 512 == 63
>
>  making sector 63 an aligned LBA where the first partition can be
>  put, but without the offset-by-one mapping, AOFF is zero and LBA 63
>  is not aligned.
>
>  With the above new alignment requirement in place, it becomes
>  difficult to honor the legacy one - first partition on sector 63 and
>  all other partitions on cylinder boundary (255 * 63 sectors) - as
>  the two alignment requirements contradict each other.  This might be
>  worked around by adjusting how LBA and CHS addresses are mapped but
>  the disk geometry parameters are hard coded everywhere and there is
>  no reliable way to communicate custom geometry parameters.
>
>
> Complications
> =============
>
> Unfortunately, there are complications.
>
> C-1. The standard is not and won't be followed as-is.
>
>  Some of the existing BIOSs and/or drivers can't cope with drives
>  which report 4 KiB physical sector size.  To work around this, some
>  drive models lie that its physical sector size is 512 bytes when the
>  actual configuration is 4 KiB without offsetting.
>
>  This nullifies the provisions for alignment in the ATA standard but
>  results in the correct alignment for Windows Vista and 7.  OS
>  behaviors will be described further later.
>
>  For these drives, which are likely to continue to be shipped for the
>  foreseeable future, traditional LBA 63 and cylinder based aligning
>  results in misalignment.
>
> C-2. Windows XP depends on the traditional partition layout.
>
>  Windows XP makes use of the CHS start/end addresses in the partition
>  table and gets confused if partitions are not laid out
>  traditionally.  This means that XP can't be installed into a
>  partition prepared by later versions of Windows[4].  This isn't a
>  big problem for Windows because in most cases the later version is
>  replacing the older one, not the other way around.
>
>  Unfortunately, the situation is more complex for Linux because Linux
>  is often co-installed with various versions of Windows and XP is
>  still quite popular.  This means that when a Linux partitioner is
>  used to prepare a partition which may be used by Windows, the
>  partitioner might have to consider which version of Windows is going
>  to be used and whether to align the partitions for the correct
>  alignment or compatibility with older versions of Windows.
>
> C-3. The 2 TiB barrier and the possibility for 4 KiB logical sector size.
>
>  The DOS partition format uses 32 bit for the starting LBA and the
>  number of sectors and, reportedly, 32 bit Windows XP shares the
>  limitation.  With 32 bit addressing and 512 byte logical sector
>  size, the maximum addressable sector + 1 is at
>
>    2^32 * 2^9 == 2^41 == 2 TiB
>
>  The DOS partition format allows a partition to reach beyond 2 TiB as
>  long as the starting LBA is under 2 TiB; however, both Windows XP
>  and and the Linux kernel (at least upto v2.6.33) refuse such
>  partition configurations.
>
>  With the right combination of host controller, BIOS and driver, this
>  barrier can be overcome by enlarging the logical sector size to 4
>  KiB, which will push the barrier out to 16 TiB.  On the right
>  configuration, Windows XP is reportedly able to address beyond the 2
>  TiB barrier with a DOS partition and 4 KiB logical sector size.
>  Linux kernel upto v2.6.33 doesn't work under such configurations but
>  a patch to make it work is pending[5].
>
>  This might also be beneficial for operating systems which don't
>  suffer from this limitation.  A different partition format - GPT[6]
>  - should be used beyond 2^32 sectors, which could harm compatibility
>  with older BIOSs or other operating systems which don't recognize
>  the new format.
>
>  As mentioned previously, 512 byte sector assumption has been there
>  for a very long time and changing it is likely to cause various
>  compatibility problems at many different layers from hardware up to
>  the system utilities.
>
>
> Windows
> =======
>
> As hard drive vendors aim for performance and compatibility in modern
> Windows environments, it is worthwhile to investigate how Windows
> partitions with different alignment requirements.  Up until Windows
> XP, it followed the traditional layout - the first partition on LBA 63
> and the others on cylinder boundaries where a cylinder is defined as
> 255 tracks with 63 sectors each.
>
> Windows Vista and 7 align partitions differently.  As the two behave
> similarly, only 7's behavior is shown here.  These partition tables
> are created by Windows 7 RC installer on blank disks.
>
> W-1. 512 byte physical and logical sector drive.
>
>  ST FIRST  T  LAST   LBA      NBLKS
>  80 202100 07 df130c 00080000 00200300
>  00 df140c 07 feffff 00280300 00689e12
>  00 000000 00 000000 00000000 00000000
>  00 000000 00 000000 00000000 00000000
>
>  Part0:        FIRST   C    0  H   32  S   33  : 2048          (63 sec/trk)
>                LAST    C   12  H  223  S   19  : 206847        (255 heads/cyl)
>                LBA     2048 + 204800 = 206848
>
>  Part1:        FIRST   C   12  H  223  S   20  : 206848
>                LAST    C 1023  H  254  S   63  : E
>                LBA     206848 + 312371200 = 312578048
>
>  Both aligned at (2048 * n).  Part 1 not aligned to cylinder.
>
> W-2. 4 KiB physical and 512 byte logical sector drive without offset-by-one.
>
>  ST FIRST  T  LAST   LBA      NBLKS
>  80 202100 07 df130c 00080000 00200300
>  00 df140c 07 feffff 00280300 00b83f25
>  00 000000 00 000000 00000000 00000000
>  00 000000 00 000000 00000000 00000000
>
>  Part0:        FIRST   C    0  H   32  S   33  : 2048          (63 sec/trk)
>                LAST    C   12  H  223  S   19  : 206847        (255 heads/cyl)
>                LBA     2048 + 204800 = 206848
>
>  Part1:        FIRST   C   12  H  223  S   20  : 206848
>                LAST    C 1023  H  254  S   63  : E
>                LBA     206848 + 624932864 = 625139712
>
>  Both aligned at (2048 * n).  Part 1 not aligned to cylinder.
>
> W-3. 4 KiB physical and 512 byte logical sector drive with offset-by-one.
>
>  ST FIRST  T  LAST   LBA      NBLKS
>  80 202800 07 df130c 07080000 f91f0300
>  00 df1b0c 07 feffff 07280300 f9376d74
>  00 000000 00 000000 00000000 00000000
>  00 000000 00 000000 00000000 00000000
>
>  Part0:        FIRST   C    0  H   32  S   40  : 2055          (63 sec/trk)
>                LAST    C   12  H  223  S   19  : 206847        (255 heads/cyl)
>                LBA     2055 + 204793 = 206848
>
>  Part1:        FIRST   C   12  H  223  S   27  : 206855
>                LAST    C 1023  H  254  S   63  : E
>                LBA     206855 + 1953314809 = 1953521664
>
>  Both aligned at (2048 * n + 7).  Part 1 not aligned to cylinder.
>
> The partitioner seems to be using 1M as the basic alignment unit and
> offsetting from there if explicitly requested by the drive and there
> is no difference between handling of 512 byte and 4 KiB drives, which
> explains why C-1 works for hard drive vendors.
>
> In all cases, the partitioner ignores both the first partition on LBA
> 63 and the others on cylinder boundary requirements while still using
> the same 255*63 cylinder size.  Also, note that in W-3, both part 0
> and 1 end up with odd number of sectors.  It seems that they simply
> decided to completely break away from the traditional layout, which is
> understandable given that there really isn't one good solution which
> can cover all the cases and that the default larger alignment benefits
> earlier SSDs.
>
> Windows Vista basically shows the same behavior.  Vista was tested by
> creating two partitions using the management tool.  Test data is
> available at [7].
>
>  *-alignment_offset    : alignment_offset reported by Linux kernel
>  *-fdisk               : fdisk -l output
>  *-fdisk-u             : fdisk -lu output
>  *-hdparm              : hdparm -I output
>  *-mbr                 : dump of mbr
>  *-part                : decoded partition table from mbr
>
> Please note that hdparm is misreporting the alignment offset.  It
> should be reporting 512 instead of 256 for offset-by-one drives.
>
>
> So, what now for Linux?
> =======================
>
> The situation is not easy.  Considering all the factors, the only
> workable solution looks like doing what Windows is doing.  Hard drive
> and SSD vendors are focusing on compatibility and performance on
> recent Windows releases and are happy to do things which break the
> standard defined mechanism as shown by C-1, so parting away from what
> Windows does would be unnecessarily painful.
>
> Unfortunately, while Windows can assume that newer releases won't
> share the hard drive with older releases including Windows XP, Linux
> distros can't do that.  There will be many installations where a
> modern Linux distros share a hard drive with older releases of
> Windows.  At this point, I can't see a silver bullet solution.
>
> Partitioners maybe should only align partitions which will be used by
> Linux and default to the traditional layout for others while allowing
> explicit override.  I think Windows XP wouldn't have problem with
> differently aligned partitions as long as it doesn't actually use them
> but haven't tested it.
>
> Reportedly, commonly used partitioners aren't ready to handle drives
> larger than 2 TiB in any configuration and alignment isn't done
> properly for drives with 4 KiB physical sectors.  4 KiB logical sector
> support is broken in both the kernel and partitioners.  (need more
> details and probably a whole section on partitioner behaviors)
>
> Unfortunately, the transition to 4 KiB sector size, physical only or
> logical too, is looking fairly ugly.  Hopefully, a reasonable solution
> can be reached in not too distant future but even with all the
> software side updated, it looks like it's gonna cause significant
> amount of confusion and frustration.
>
>
> [1] http://www.anandtech.com/storage/showdoc.aspx?i=3691
> [2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives
> [3] http://en.wikipedia.org/wiki/Master_boot_record
> [4] http://support.microsoft.com/kb/931760
> [5] http://thread.gmane.org/gmane.linux.kernel/953981
> [6] http://en.wikipedia.org/wiki/GUID_Partition_Table
> [7] http://userweb.kernel.org/~tj/partalign/
>
> * Mar 04 2009
>        Initial draft, Tejun Heo <tj@kernel.org>
> * Mar 08 2009
>        Updated according to comments from Daniel Taylor
>        <Daniel.Taylor@wdc.com>.  Other minor updates.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 155+ messages in thread

* ATA 4 KiB sector issues.
@ 2010-03-08  3:48 ` Tejun Heo
  0 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-08  3:48 UTC (permalink / raw)
  To: linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord

Hello, guys.

It looks like transition to ATA 4k drives will be quite painful and we
aren't really ready although these drives are already selling widely.
I've written up a summary document on the issue to clarify stuff as
it's getting more and more confusing and develop some consensus.  It's
also on the linux ata wiki.

  http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues

I've cc'd people whom I can think of off the top of my head but I
surely have missed some people who would have been interested.  Please
feel free to add cc's or forward the message to other MLs.
Especially, I don't know much about partitioners so the details there
are pretty shallow and could be plain wrong.  It would be great if
someone who knows more about this stuff can chime in.

Thanks.

=== Document follows ===

ATA 4 KiB sector issues

Background
==========

Up until recently, all ATA hard drives have been organized in 512 byte
sectors.  For example, my 500 GB or 477 GiB hard drive is organized of
976773168 512 byte sectors numbered from 0 to 976773167.  This is how
a drive communicates with the driver.  When the operating system wants
to read 32 KiB of data at 1 MiB position, the driver asks the drive to
read 64 sectors from LBA (Logical block address, sector number) 2048.

Because each sector should be addressable, readable and writable
individually, the physical medium also is organized in the same sized
sectors.  In addition to the area to store the actual data, each
sector requires extra space for book keeping - inter-sector space to
enable locating and addressing each sector and ECC data to detect and
correct inevitable raw data errors.

As the densities and capacities of hard drives keep growing, stronger
ECC becomes necessary to guarantee acceptable level of data integrity
increasing the space overhead.  In addition, in most applications,
hard drives are now accessed in units of at least 8 sectors or 4096
bytes and maintaining 512 byte granularity has become somewhat
meaningless.

This reached a point where enlarging the sector size to 4096 bytes
would yield measurably more usable space given the same raw data
storage size and hard drive manufacturers are transitioning to 4 KiB
sectors.

Anandtech has a good article which illustrates the background and
issues with pretty diagrams[1].


Physical vs. Logical
====================

Because the 512 byte sector size has been around for a very long time
and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the
sector size assumption is scattered across all the layers -
controllers or bridge chips snooping commands, BIOSs, boot codes,
drivers, partitioners and system utilities, which makes it very
difficult to change the sector size from 512 byte without breaking
backward compatibility massively.

As a workaround, the concept of logical sector size was introduced.
The physical medium is organized in 4 KiB sectors but the firmware on
the drive will present it as if the drive is composed of 512 byte
sectors thus making the drive behave as before, so if the driver asks
the hard drive to read 64 sectors from LBA 2048, the firmware will
translate it and read 8 4 KiB sectors from hardware sector 256.  As a
result, the hard drive now has two sector sizes - the physical one
which the physical media is actually organized in, and the logical one
which the firmware presents to the outside world.

A straight forward example mapping between physical sector and LBA
would be

  LBA = 8 * phys_sect


Alignment problem on 4 KiB physical / 512 logical drives
=======================================================

This workaround keeps older hardware and software working while
allowing the drive to use larger sector size internally.  However, the
discrepancy between physical and logical sector sizes creates an
alignment issue.  For example, if the driver wants to read 7 sectors
from LBA 2047, the firmware has to read hardware sector 255 and 256
and trim leading 7*512 bytes and tailing 512 bytes.

For reads, this isn't an issue as drives read in larger chunks anyway
but for writes, the drive has to do read-modify-write to achieve the
requested action.  It has to first read hardware sector 255 and 256,
update requested parts and then write back those sectors which can
cause significant performance degradation[2].

The problem is aggravated by the way DOS partitions[3] have been laid
out traditionally.  For reasons dating back more than two decades,
they are laid out considering something called disk geometry which
nowadays are arbitrary values with a number of restrictions for
backward compatibility accumulated over the years.  The end result is
that until recently (most Linux variants and upto Windows XP) the
first partition ends up on sector 63 and later ones on cylinder
boundaries where each cylinder usually is composed of 255 * 63
sectors.

Most modern filesystems generate 4 KiB aligned accesses from the
partition it is in.  If a drive maps 4 KiB physical sectors to 512
byte logical sectors from LBA0, the filesystem in the first partition
will always be misaligned and filesystems in later partitions are
likely to be misaligned too.


Solving the alignment problem on 4 KiB physical / 512 logical drives
====================================================================

There are multiple ways which attempt to solve the problem.

S-1. Yet another workaround from the firmware - offset-by-one.

  Yet another workaround which can be done by the firmware is to
  offset physical to logical mapping by one logical sector such that
  LBA 63 ends up on physical sector boundary, which aligns the first
  partition to physical sectors without requiring any software update.
  The example mapping between phys_sector and LBA becomes

    LBA = 8 * phys_sect - 1

  The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts
  from after that point.  phys_sect 1 maps to LBA 7 and phys_sect 8 to
  63, making LBA 63 aligned on hardware sector.

  Although this aligns only the first partition, for many use cases,
  especially the ones involving older software, this workaround was
  deemed useful and some recent drives with 4 KiB physical sectors are
  equipped with a dip switch to turn on or off offset-by-one mapping.

S-2. The proper solution.

  Correct alignments for all partitions can't be achieved by the
  firmware alone.  The system utilities should be informed about the
  alignment requirements and align partitions accordingly.

  The above firmware workaround complicates the situation because the
  two different configurations require different offsets to achieve
  the correct alignments.  ATA/ATAPI-8 specifies a way for a drive to
  export the physical and logical sector sizes and the LBA offset
  which is aligned to the physical sectors.

  In Linux, these parameters are exported via the following sysfs
  nodes.

    physical sector size	: /sys/block/sdX/queue/physical_block_size
    logical sector size		: /sys/block/sdX/queue/logical_block_size
    alignment offset		: /sys/block/sdX/alignment_offset

  Let the physical sector size be PSS, logical sector size LSS and
  alignment offset AOFF.  The system software should place partitions
  such that the starting LBAs of all partitions are aligned on

    (n * PSS + AOFF) / LSS

  For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512
  and AOFF 3584 and with n of 7 the above becomes,

    (7 * 4096 + 3584) / 512 == 63

  making sector 63 an aligned LBA where the first partition can be
  put, but without the offset-by-one mapping, AOFF is zero and LBA 63
  is not aligned.

  With the above new alignment requirement in place, it becomes
  difficult to honor the legacy one - first partition on sector 63 and
  all other partitions on cylinder boundary (255 * 63 sectors) - as
  the two alignment requirements contradict each other.  This might be
  worked around by adjusting how LBA and CHS addresses are mapped but
  the disk geometry parameters are hard coded everywhere and there is
  no reliable way to communicate custom geometry parameters.


Complications
=============

Unfortunately, there are complications.

C-1. The standard is not and won't be followed as-is.

  Some of the existing BIOSs and/or drivers can't cope with drives
  which report 4 KiB physical sector size.  To work around this, some
  drive models lie that its physical sector size is 512 bytes when the
  actual configuration is 4 KiB without offsetting.

  This nullifies the provisions for alignment in the ATA standard but
  results in the correct alignment for Windows Vista and 7.  OS
  behaviors will be described further later.

  For these drives, which are likely to continue to be shipped for the
  foreseeable future, traditional LBA 63 and cylinder based aligning
  results in misalignment.

C-2. Windows XP depends on the traditional partition layout.

  Windows XP makes use of the CHS start/end addresses in the partition
  table and gets confused if partitions are not laid out
  traditionally.  This means that XP can't be installed into a
  partition prepared by later versions of Windows[4].  This isn't a
  big problem for Windows because in most cases the later version is
  replacing the older one, not the other way around.

  Unfortunately, the situation is more complex for Linux because Linux
  is often co-installed with various versions of Windows and XP is
  still quite popular.  This means that when a Linux partitioner is
  used to prepare a partition which may be used by Windows, the
  partitioner might have to consider which version of Windows is going
  to be used and whether to align the partitions for the correct
  alignment or compatibility with older versions of Windows.

C-3. The 2 TiB barrier and the possibility for 4 KiB logical sector size.

  The DOS partition format uses 32 bit for the starting LBA and the
  number of sectors and, reportedly, 32 bit Windows XP shares the
  limitation.  With 32 bit addressing and 512 byte logical sector
  size, the maximum addressable sector + 1 is at

    2^32 * 2^9 == 2^41 == 2 TiB

  The DOS partition format allows a partition to reach beyond 2 TiB as
  long as the starting LBA is under 2 TiB; however, both Windows XP
  and and the Linux kernel (at least upto v2.6.33) refuse such
  partition configurations.

  With the right combination of host controller, BIOS and driver, this
  barrier can be overcome by enlarging the logical sector size to 4
  KiB, which will push the barrier out to 16 TiB.  On the right
  configuration, Windows XP is reportedly able to address beyond the 2
  TiB barrier with a DOS partition and 4 KiB logical sector size.
  Linux kernel upto v2.6.33 doesn't work under such configurations but
  a patch to make it work is pending[5].

  This might also be beneficial for operating systems which don't
  suffer from this limitation.  A different partition format - GPT[6]
  - should be used beyond 2^32 sectors, which could harm compatibility
  with older BIOSs or other operating systems which don't recognize
  the new format.

  As mentioned previously, 512 byte sector assumption has been there
  for a very long time and changing it is likely to cause various
  compatibility problems at many different layers from hardware up to
  the system utilities.


Windows
=======

As hard drive vendors aim for performance and compatibility in modern
Windows environments, it is worthwhile to investigate how Windows
partitions with different alignment requirements.  Up until Windows
XP, it followed the traditional layout - the first partition on LBA 63
and the others on cylinder boundaries where a cylinder is defined as
255 tracks with 63 sectors each.

Windows Vista and 7 align partitions differently.  As the two behave
similarly, only 7's behavior is shown here.  These partition tables
are created by Windows 7 RC installer on blank disks.

W-1. 512 byte physical and logical sector drive.

  ST FIRST  T  LAST   LBA      NBLKS
  80 202100 07 df130c 00080000 00200300
  00 df140c 07 feffff 00280300 00689e12
  00 000000 00 000000 00000000 00000000
  00 000000 00 000000 00000000 00000000

  Part0:	FIRST	C    0	H   32	S   33	: 2048		(63 sec/trk)
		LAST	C   12	H  223	S   19	: 206847	(255 heads/cyl)
		LBA	2048 + 204800 = 206848

  Part1:	FIRST	C   12	H  223	S   20	: 206848
		LAST	C 1023	H  254	S   63	: E
		LBA	206848 + 312371200 = 312578048

  Both aligned at (2048 * n).  Part 1 not aligned to cylinder.

W-2. 4 KiB physical and 512 byte logical sector drive without offset-by-one.

  ST FIRST  T  LAST   LBA      NBLKS
  80 202100 07 df130c 00080000 00200300
  00 df140c 07 feffff 00280300 00b83f25
  00 000000 00 000000 00000000 00000000
  00 000000 00 000000 00000000 00000000

  Part0:	FIRST	C    0	H   32	S   33	: 2048		(63 sec/trk)
		LAST	C   12	H  223	S   19	: 206847	(255 heads/cyl)
		LBA	2048 + 204800 = 206848

  Part1:	FIRST	C   12	H  223	S   20	: 206848
		LAST	C 1023	H  254	S   63	: E
		LBA	206848 + 624932864 = 625139712

  Both aligned at (2048 * n).  Part 1 not aligned to cylinder.

W-3. 4 KiB physical and 512 byte logical sector drive with offset-by-one.

  ST FIRST  T  LAST   LBA      NBLKS
  80 202800 07 df130c 07080000 f91f0300
  00 df1b0c 07 feffff 07280300 f9376d74
  00 000000 00 000000 00000000 00000000
  00 000000 00 000000 00000000 00000000

  Part0:	FIRST	C    0	H   32	S   40	: 2055		(63 sec/trk)
		LAST	C   12	H  223	S   19	: 206847	(255 heads/cyl)
		LBA	2055 + 204793 = 206848

  Part1:	FIRST	C   12	H  223	S   27	: 206855
		LAST	C 1023	H  254	S   63	: E
		LBA	206855 + 1953314809 = 1953521664

  Both aligned at (2048 * n + 7).  Part 1 not aligned to cylinder.

The partitioner seems to be using 1M as the basic alignment unit and
offsetting from there if explicitly requested by the drive and there
is no difference between handling of 512 byte and 4 KiB drives, which
explains why C-1 works for hard drive vendors.

In all cases, the partitioner ignores both the first partition on LBA
63 and the others on cylinder boundary requirements while still using
the same 255*63 cylinder size.  Also, note that in W-3, both part 0
and 1 end up with odd number of sectors.  It seems that they simply
decided to completely break away from the traditional layout, which is
understandable given that there really isn't one good solution which
can cover all the cases and that the default larger alignment benefits
earlier SSDs.

Windows Vista basically shows the same behavior.  Vista was tested by
creating two partitions using the management tool.  Test data is
available at [7].

  *-alignment_offset	: alignment_offset reported by Linux kernel
  *-fdisk		: fdisk -l output
  *-fdisk-u		: fdisk -lu output
  *-hdparm		: hdparm -I output
  *-mbr			: dump of mbr
  *-part		: decoded partition table from mbr

Please note that hdparm is misreporting the alignment offset.  It
should be reporting 512 instead of 256 for offset-by-one drives.


So, what now for Linux?
=======================

The situation is not easy.  Considering all the factors, the only
workable solution looks like doing what Windows is doing.  Hard drive
and SSD vendors are focusing on compatibility and performance on
recent Windows releases and are happy to do things which break the
standard defined mechanism as shown by C-1, so parting away from what
Windows does would be unnecessarily painful.

Unfortunately, while Windows can assume that newer releases won't
share the hard drive with older releases including Windows XP, Linux
distros can't do that.  There will be many installations where a
modern Linux distros share a hard drive with older releases of
Windows.  At this point, I can't see a silver bullet solution.

Partitioners maybe should only align partitions which will be used by
Linux and default to the traditional layout for others while allowing
explicit override.  I think Windows XP wouldn't have problem with
differently aligned partitions as long as it doesn't actually use them
but haven't tested it.

Reportedly, commonly used partitioners aren't ready to handle drives
larger than 2 TiB in any configuration and alignment isn't done
properly for drives with 4 KiB physical sectors.  4 KiB logical sector
support is broken in both the kernel and partitioners.  (need more
details and probably a whole section on partitioner behaviors)

Unfortunately, the transition to 4 KiB sector size, physical only or
logical too, is looking fairly ugly.  Hopefully, a reasonable solution
can be reached in not too distant future but even with all the
software side updated, it looks like it's gonna cause significant
amount of confusion and frustration.


[1] http://www.anandtech.com/storage/showdoc.aspx?i=3691
[2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives
[3] http://en.wikipedia.org/wiki/Master_boot_record
[4] http://support.microsoft.com/kb/931760
[5] http://thread.gmane.org/gmane.linux.kernel/953981
[6] http://en.wikipedia.org/wiki/GUID_Partition_Table
[7] http://userweb.kernel.org/~tj/partalign/

* Mar 04 2009
	Initial draft, Tejun Heo <tj@kernel.org>
* Mar 08 2009
	Updated according to comments from Daniel Taylor
	<Daniel.Taylor@wdc.com>.  Other minor updates.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* ATA 4 KiB sector issues.
@ 2010-03-08  3:48 ` Tejun Heo
  0 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-03-08  3:48 UTC (permalink / raw)
  To: linux-ide, lkml, Daniel Taylor, Jeff Garzik, Mark Lord, tytso,
	H. Peter Anvin, hirofumi, Andrew Morton, Alan Cox, irtiger,
	Matthew Wilcox, aschnell, knikanth, jdelvare

Hello, guys.

It looks like transition to ATA 4k drives will be quite painful and we
aren't really ready although these drives are already selling widely.
I've written up a summary document on the issue to clarify stuff as
it's getting more and more confusing and develop some consensus.  It's
also on the linux ata wiki.

  http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues

I've cc'd people whom I can think of off the top of my head but I
surely have missed some people who would have been interested.  Please
feel free to add cc's or forward the message to other MLs.
Especially, I don't know much about partitioners so the details there
are pretty shallow and could be plain wrong.  It would be great if
someone who knows more about this stuff can chime in.

Thanks.

=== Document follows ===

ATA 4 KiB sector issues

Background
==========

Up until recently, all ATA hard drives have been organized in 512 byte
sectors.  For example, my 500 GB or 477 GiB hard drive is organized of
976773168 512 byte sectors numbered from 0 to 976773167.  This is how
a drive communicates with the driver.  When the operating system wants
to read 32 KiB of data at 1 MiB position, the driver asks the drive to
read 64 sectors from LBA (Logical block address, sector number) 2048.

Because each sector should be addressable, readable and writable
individually, the physical medium also is organized in the same sized
sectors.  In addition to the area to store the actual data, each
sector requires extra space for book keeping - inter-sector space to
enable locating and addressing each sector and ECC data to detect and
correct inevitable raw data errors.

As the densities and capacities of hard drives keep growing, stronger
ECC becomes necessary to guarantee acceptable level of data integrity
increasing the space overhead.  In addition, in most applications,
hard drives are now accessed in units of at least 8 sectors or 4096
bytes and maintaining 512 byte granularity has become somewhat
meaningless.

This reached a point where enlarging the sector size to 4096 bytes
would yield measurably more usable space given the same raw data
storage size and hard drive manufacturers are transitioning to 4 KiB
sectors.

Anandtech has a good article which illustrates the background and
issues with pretty diagrams[1].


Physical vs. Logical
====================

Because the 512 byte sector size has been around for a very long time
and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the
sector size assumption is scattered across all the layers -
controllers or bridge chips snooping commands, BIOSs, boot codes,
drivers, partitioners and system utilities, which makes it very
difficult to change the sector size from 512 byte without breaking
backward compatibility massively.

As a workaround, the concept of logical sector size was introduced.
The physical medium is organized in 4 KiB sectors but the firmware on
the drive will present it as if the drive is composed of 512 byte
sectors thus making the drive behave as before, so if the driver asks
the hard drive to read 64 sectors from LBA 2048, the firmware will
translate it and read 8 4 KiB sectors from hardware sector 256.  As a
result, the hard drive now has two sector sizes - the physical one
which the physical media is actually organized in, and the logical one
which the firmware presents to the outside world.

A straight forward example mapping between physical sector and LBA
would be

  LBA = 8 * phys_sect


Alignment problem on 4 KiB physical / 512 logical drives
=======================================================

This workaround keeps older hardware and software working while
allowing the drive to use larger sector size internally.  However, the
discrepancy between physical and logical sector sizes creates an
alignment issue.  For example, if the driver wants to read 7 sectors
from LBA 2047, the firmware has to read hardware sector 255 and 256
and trim leading 7*512 bytes and tailing 512 bytes.

For reads, this isn't an issue as drives read in larger chunks anyway
but for writes, the drive has to do read-modify-write to achieve the
requested action.  It has to first read hardware sector 255 and 256,
update requested parts and then write back those sectors which can
cause significant performance degradation[2].

The problem is aggravated by the way DOS partitions[3] have been laid
out traditionally.  For reasons dating back more than two decades,
they are laid out considering something called disk geometry which
nowadays are arbitrary values with a number of restrictions for
backward compatibility accumulated over the years.  The end result is
that until recently (most Linux variants and upto Windows XP) the
first partition ends up on sector 63 and later ones on cylinder
boundaries where each cylinder usually is composed of 255 * 63
sectors.

Most modern filesystems generate 4 KiB aligned accesses from the
partition it is in.  If a drive maps 4 KiB physical sectors to 512
byte logical sectors from LBA0, the filesystem in the first partition
will always be misaligned and filesystems in later partitions are
likely to be misaligned too.


Solving the alignment problem on 4 KiB physical / 512 logical drives
====================================================================

There are multiple ways which attempt to solve the problem.

S-1. Yet another workaround from the firmware - offset-by-one.

  Yet another workaround which can be done by the firmware is to
  offset physical to logical mapping by one logical sector such that
  LBA 63 ends up on physical sector boundary, which aligns the first
  partition to physical sectors without requiring any software update.
  The example mapping between phys_sector and LBA becomes

    LBA = 8 * phys_sect - 1

  The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts
  from after that point.  phys_sect 1 maps to LBA 7 and phys_sect 8 to
  63, making LBA 63 aligned on hardware sector.

  Although this aligns only the first partition, for many use cases,
  especially the ones involving older software, this workaround was
  deemed useful and some recent drives with 4 KiB physical sectors are
  equipped with a dip switch to turn on or off offset-by-one mapping.

S-2. The proper solution.

  Correct alignments for all partitions can't be achieved by the
  firmware alone.  The system utilities should be informed about the
  alignment requirements and align partitions accordingly.

  The above firmware workaround complicates the situation because the
  two different configurations require different offsets to achieve
  the correct alignments.  ATA/ATAPI-8 specifies a way for a drive to
  export the physical and logical sector sizes and the LBA offset
  which is aligned to the physical sectors.

  In Linux, these parameters are exported via the following sysfs
  nodes.

    physical sector size	: /sys/block/sdX/queue/physical_block_size
    logical sector size		: /sys/block/sdX/queue/logical_block_size
    alignment offset		: /sys/block/sdX/alignment_offset

  Let the physical sector size be PSS, logical sector size LSS and
  alignment offset AOFF.  The system software should place partitions
  such that the starting LBAs of all partitions are aligned on

    (n * PSS + AOFF) / LSS

  For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512
  and AOFF 3584 and with n of 7 the above becomes,

    (7 * 4096 + 3584) / 512 == 63

  making sector 63 an aligned LBA where the first partition can be
  put, but without the offset-by-one mapping, AOFF is zero and LBA 63
  is not aligned.

  With the above new alignment requirement in place, it becomes
  difficult to honor the legacy one - first partition on sector 63 and
  all other partitions on cylinder boundary (255 * 63 sectors) - as
  the two alignment requirements contradict each other.  This might be
  worked around by adjusting how LBA and CHS addresses are mapped but
  the disk geometry parameters are hard coded everywhere and there is
  no reliable way to communicate custom geometry parameters.


Complications
=============

Unfortunately, there are complications.

C-1. The standard is not and won't be followed as-is.

  Some of the existing BIOSs and/or drivers can't cope with drives
  which report 4 KiB physical sector size.  To work around this, some
  drive models lie that its physical sector size is 512 bytes when the
  actual configuration is 4 KiB without offsetting.

  This nullifies the provisions for alignment in the ATA standard but
  results in the correct alignment for Windows Vista and 7.  OS
  behaviors will be described further later.

  For these drives, which are likely to continue to be shipped for the
  foreseeable future, traditional LBA 63 and cylinder based aligning
  results in misalignment.

C-2. Windows XP depends on the traditional partition layout.

  Windows XP makes use of the CHS start/end addresses in the partition
  table and gets confused if partitions are not laid out
  traditionally.  This means that XP can't be installed into a
  partition prepared by later versions of Windows[4].  This isn't a
  big problem for Windows because in most cases the later version is
  replacing the older one, not the other way around.

  Unfortunately, the situation is more complex for Linux because Linux
  is often co-installed with various versions of Windows and XP is
  still quite popular.  This means that when a Linux partitioner is
  used to prepare a partition which may be used by Windows, the
  partitioner might have to consider which version of Windows is going
  to be used and whether to align the partitions for the correct
  alignment or compatibility with older versions of Windows.

C-3. The 2 TiB barrier and the possibility for 4 KiB logical sector size.

  The DOS partition format uses 32 bit for the starting LBA and the
  number of sectors and, reportedly, 32 bit Windows XP shares the
  limitation.  With 32 bit addressing and 512 byte logical sector
  size, the maximum addressable sector + 1 is at

    2^32 * 2^9 == 2^41 == 2 TiB

  The DOS partition format allows a partition to reach beyond 2 TiB as
  long as the starting LBA is under 2 TiB; however, both Windows XP
  and and the Linux kernel (at least upto v2.6.33) refuse such
  partition configurations.

  With the right combination of host controller, BIOS and driver, this
  barrier can be overcome by enlarging the logical sector size to 4
  KiB, which will push the barrier out to 16 TiB.  On the right
  configuration, Windows XP is reportedly able to address beyond the 2
  TiB barrier with a DOS partition and 4 KiB logical sector size.
  Linux kernel upto v2.6.33 doesn't work under such configurations but
  a patch to make it work is pending[5].

  This might also be beneficial for operating systems which don't
  suffer from this limitation.  A different partition format - GPT[6]
  - should be used beyond 2^32 sectors, which could harm compatibility
  with older BIOSs or other operating systems which don't recognize
  the new format.

  As mentioned previously, 512 byte sector assumption has been there
  for a very long time and changing it is likely to cause various
  compatibility problems at many different layers from hardware up to
  the system utilities.


Windows
=======

As hard drive vendors aim for performance and compatibility in modern
Windows environments, it is worthwhile to investigate how Windows
partitions with different alignment requirements.  Up until Windows
XP, it followed the traditional layout - the first partition on LBA 63
and the others on cylinder boundaries where a cylinder is defined as
255 tracks with 63 sectors each.

Windows Vista and 7 align partitions differently.  As the two behave
similarly, only 7's behavior is shown here.  These partition tables
are created by Windows 7 RC installer on blank disks.

W-1. 512 byte physical and logical sector drive.

  ST FIRST  T  LAST   LBA      NBLKS
  80 202100 07 df130c 00080000 00200300
  00 df140c 07 feffff 00280300 00689e12
  00 000000 00 000000 00000000 00000000
  00 000000 00 000000 00000000 00000000

  Part0:	FIRST	C    0	H   32	S   33	: 2048		(63 sec/trk)
		LAST	C   12	H  223	S   19	: 206847	(255 heads/cyl)
		LBA	2048 + 204800 = 206848

  Part1:	FIRST	C   12	H  223	S   20	: 206848
		LAST	C 1023	H  254	S   63	: E
		LBA	206848 + 312371200 = 312578048

  Both aligned at (2048 * n).  Part 1 not aligned to cylinder.

W-2. 4 KiB physical and 512 byte logical sector drive without offset-by-one.

  ST FIRST  T  LAST   LBA      NBLKS
  80 202100 07 df130c 00080000 00200300
  00 df140c 07 feffff 00280300 00b83f25
  00 000000 00 000000 00000000 00000000
  00 000000 00 000000 00000000 00000000

  Part0:	FIRST	C    0	H   32	S   33	: 2048		(63 sec/trk)
		LAST	C   12	H  223	S   19	: 206847	(255 heads/cyl)
		LBA	2048 + 204800 = 206848

  Part1:	FIRST	C   12	H  223	S   20	: 206848
		LAST	C 1023	H  254	S   63	: E
		LBA	206848 + 624932864 = 625139712

  Both aligned at (2048 * n).  Part 1 not aligned to cylinder.

W-3. 4 KiB physical and 512 byte logical sector drive with offset-by-one.

  ST FIRST  T  LAST   LBA      NBLKS
  80 202800 07 df130c 07080000 f91f0300
  00 df1b0c 07 feffff 07280300 f9376d74
  00 000000 00 000000 00000000 00000000
  00 000000 00 000000 00000000 00000000

  Part0:	FIRST	C    0	H   32	S   40	: 2055		(63 sec/trk)
		LAST	C   12	H  223	S   19	: 206847	(255 heads/cyl)
		LBA	2055 + 204793 = 206848

  Part1:	FIRST	C   12	H  223	S   27	: 206855
		LAST	C 1023	H  254	S   63	: E
		LBA	206855 + 1953314809 = 1953521664

  Both aligned at (2048 * n + 7).  Part 1 not aligned to cylinder.

The partitioner seems to be using 1M as the basic alignment unit and
offsetting from there if explicitly requested by the drive and there
is no difference between handling of 512 byte and 4 KiB drives, which
explains why C-1 works for hard drive vendors.

In all cases, the partitioner ignores both the first partition on LBA
63 and the others on cylinder boundary requirements while still using
the same 255*63 cylinder size.  Also, note that in W-3, both part 0
and 1 end up with odd number of sectors.  It seems that they simply
decided to completely break away from the traditional layout, which is
understandable given that there really isn't one good solution which
can cover all the cases and that the default larger alignment benefits
earlier SSDs.

Windows Vista basically shows the same behavior.  Vista was tested by
creating two partitions using the management tool.  Test data is
available at [7].

  *-alignment_offset	: alignment_offset reported by Linux kernel
  *-fdisk		: fdisk -l output
  *-fdisk-u		: fdisk -lu output
  *-hdparm		: hdparm -I output
  *-mbr			: dump of mbr
  *-part		: decoded partition table from mbr

Please note that hdparm is misreporting the alignment offset.  It
should be reporting 512 instead of 256 for offset-by-one drives.


So, what now for Linux?
=======================

The situation is not easy.  Considering all the factors, the only
workable solution looks like doing what Windows is doing.  Hard drive
and SSD vendors are focusing on compatibility and performance on
recent Windows releases and are happy to do things which break the
standard defined mechanism as shown by C-1, so parting away from what
Windows does would be unnecessarily painful.

Unfortunately, while Windows can assume that newer releases won't
share the hard drive with older releases including Windows XP, Linux
distros can't do that.  There will be many installations where a
modern Linux distros share a hard drive with older releases of
Windows.  At this point, I can't see a silver bullet solution.

Partitioners maybe should only align partitions which will be used by
Linux and default to the traditional layout for others while allowing
explicit override.  I think Windows XP wouldn't have problem with
differently aligned partitions as long as it doesn't actually use them
but haven't tested it.

Reportedly, commonly used partitioners aren't ready to handle drives
larger than 2 TiB in any configuration and alignment isn't done
properly for drives with 4 KiB physical sectors.  4 KiB logical sector
support is broken in both the kernel and partitioners.  (need more
details and probably a whole section on partitioner behaviors)

Unfortunately, the transition to 4 KiB sector size, physical only or
logical too, is looking fairly ugly.  Hopefully, a reasonable solution
can be reached in not too distant future but even with all the
software side updated, it looks like it's gonna cause significant
amount of confusion and frustration.


[1] http://www.anandtech.com/storage/showdoc.aspx?i=3691
[2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives
[3] http://en.wikipedia.org/wiki/Master_boot_record
[4] http://support.microsoft.com/kb/931760
[5] http://thread.gmane.org/gmane.linux.kernel/953981
[6] http://en.wikipedia.org/wiki/GUID_Partition_Table
[7] http://userweb.kernel.org/~tj/partalign/

* Mar 04 2009
	Initial draft, Tejun Heo <tj@kernel.org>
* Mar 08 2009
	Updated according to comments from Daniel Taylor
	<Daniel.Taylor@wdc.com>.  Other minor updates.

^ permalink raw reply	[flat|nested] 155+ messages in thread

end of thread, other threads:[~2010-03-17 17:15 UTC | newest]

Thread overview: 155+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-16 22:21 ATA 4 KiB sector issues H. Peter Anvin
2010-03-17 15:08 ` Ric Wheeler
2010-03-17 17:13   ` H. Peter Anvin
  -- strict thread matches above, loose matches on Subject: below --
2010-03-12  3:10 H. Peter Anvin
2010-03-08  3:48 Tejun Heo
2010-03-08  3:48 ` Tejun Heo
2010-03-08  5:38 ` Greg Freemyer
2010-03-08  7:00 ` James Bottomley
2010-03-08  7:53   ` H. Peter Anvin
2010-03-08 15:34     ` Martin K. Petersen
2010-03-08 15:34       ` Martin K. Petersen
2010-03-09 22:36       ` Daniel Taylor
2010-03-09 22:36         ` Daniel Taylor
2010-03-09 22:46     ` Greg Freemyer
2010-03-10  0:05       ` Tejun Heo
2010-03-10  0:14         ` Daniel Taylor
2010-03-10  0:14           ` Daniel Taylor
2010-03-10  0:26           ` Tejun Heo
2010-03-10  0:36             ` H. Peter Anvin
2010-03-10  5:17           ` H. Peter Anvin
2010-03-10  7:09           ` Gabor Gombas
2010-03-10  0:32       ` H. Peter Anvin
2010-03-10 10:46         ` Johannes Stezenbach
2010-03-10 11:22           ` H. Peter Anvin
2010-03-08  7:56   ` H. Peter Anvin
2010-03-08 15:33   ` Martin K. Petersen
2010-03-08 15:33     ` Martin K. Petersen
2010-03-08 15:38     ` Martin K. Petersen
2010-03-08 15:38       ` Martin K. Petersen
2010-03-08 15:41       ` Martin K. Petersen
2010-03-08 15:41         ` Martin K. Petersen
2010-03-08 18:50         ` H. Peter Anvin
2010-03-08 18:58           ` James Bottomley
2010-03-08 19:11             ` H. Peter Anvin
2010-03-08 20:02             ` Cláudio Martins
2010-03-08 21:07               ` Martin K. Petersen
2010-03-08 21:07                 ` Martin K. Petersen
2010-03-08 20:19           ` Martin K. Petersen
2010-03-08 20:19             ` Martin K. Petersen
2010-03-08 21:16             ` H. Peter Anvin
2010-03-10  0:34           ` Tejun Heo
2010-03-10  7:53         ` Matthew Wilcox
2010-03-10 13:47           ` Jeff Garzik
2010-03-10 16:19             ` Damian Lukowski
2010-03-11 13:04               ` Theodore Tso
2010-03-11 13:57                 ` Nikanth Karthikesan
2010-03-11 14:28                   ` Theodore Tso
2010-03-11 14:39                     ` James Bottomley
2010-03-11 15:05                       ` Nikanth Karthikesan
2010-03-11 15:25                         ` tytso
2010-03-11 16:26                           ` Gene Heskett
2010-03-11 16:26                             ` Gene Heskett
2010-03-11 16:34                           ` Greg Freemyer
2010-03-11 16:34                             ` Greg Freemyer
2010-03-12  1:09                             ` Tejun Heo
2010-03-11 14:48                     ` Mike Snitzer
2010-03-11 15:00                     ` Nikanth Karthikesan
2010-03-11 15:10                       ` Tejun Heo
2010-03-11 16:01                       ` Mike Snitzer
2010-03-11 18:26                         ` Christoph Hellwig
2010-03-11 16:33                   ` H. Peter Anvin
2010-03-08 15:18 ` Martin K. Petersen
2010-03-08 15:18   ` Martin K. Petersen
2010-03-08 18:29   ` H. Peter Anvin
2010-03-08 20:01     ` Martin K. Petersen
2010-03-08 20:01       ` Martin K. Petersen
2010-03-08 19:34   ` Mike Snitzer
2010-03-09  2:53     ` Tejun Heo
2010-03-09  3:20       ` Martin K. Petersen
2010-03-09  3:20         ` Martin K. Petersen
2010-03-09  6:53     ` Michael Tokarev
2010-03-09 10:01       ` Karel Zak
2010-03-09 10:16         ` Michael Tokarev
2010-03-09 11:15           ` Dave Chinner
2010-03-09 11:38             ` Michael Tokarev
2010-03-09 12:20               ` Dave Chinner
2010-03-09 11:50           ` Karel Zak
2010-03-09 12:18           ` Karel Zak
2010-03-10  5:06             ` Martin K. Petersen
2010-03-10 20:50               ` Henrique de Moraes Holschuh
2010-03-10  4:57       ` Martin K. Petersen
2010-03-10  4:57         ` Martin K. Petersen
2010-03-08 19:58   ` Karel Zak
2010-03-09  2:34     ` Tejun Heo
2010-03-09  2:42       ` Jeff Garzik
2010-03-09  2:49         ` Tejun Heo
2010-03-09  2:42       ` Tejun Heo
2010-03-09  3:11         ` Martin K. Petersen
2010-03-09  3:11           ` Martin K. Petersen
2010-03-09  3:09       ` Martin K. Petersen
2010-03-09  3:09         ` Martin K. Petersen
2010-03-09  3:38       ` Daniel Taylor
2010-03-09  3:38         ` Daniel Taylor
2010-03-09  4:54         ` Martin K. Petersen
2010-03-09  4:54           ` Martin K. Petersen
2010-03-09  3:41       ` Felix Miata
2010-03-09  7:27     ` Jim Meyering
2010-03-09  7:27       ` Jim Meyering
2010-03-09 23:56       ` Tejun Heo
2010-03-08 20:12   ` H. Peter Anvin
2010-03-09  2:22     ` Tejun Heo
2010-03-09  2:44   ` Tejun Heo
2010-03-09  3:18     ` Martin K. Petersen
2010-03-09  3:18       ` Martin K. Petersen
2010-03-09 14:32       ` Mark Lord
2010-03-09  6:34 ` Mikael Abrahamsson
2010-03-09 10:06   ` Michal Soltys
2010-03-10  0:11     ` Tejun Heo
2010-03-14 21:09       ` Michal Soltys
2010-03-14 22:56         ` s ponnusa
2010-03-09 13:55 ` Mark Lord
2010-03-10  0:00   ` Tejun Heo
2010-03-10  6:08     ` Mark Lord
2010-03-09 23:46 ` Arnd Bergmann
2010-03-10  0:20   ` Tejun Heo
2010-03-10  9:14   ` Denys Vlasenko
2010-03-10 11:02     ` Felix Miata
2010-03-15  1:21     ` H. Peter Anvin
2010-03-15  2:26       ` Denys Vlasenko
2010-03-15  2:56         ` Greg Freemyer
2010-03-15  4:00         ` H. Peter Anvin
2010-03-15 12:30           ` Arnd Bergmann
2010-03-15  5:20         ` david
2010-03-15  9:56           ` Denys Vlasenko
2010-03-15 14:47             ` H. Peter Anvin
2010-03-16  2:30     ` Tejun Heo
2010-03-16  2:32       ` Tejun Heo
2010-03-16  6:14       ` James Bottomley
2010-03-16  6:22         ` Tejun Heo
2010-03-16  6:37           ` Felix Miata
2010-03-16  6:42             ` Tejun Heo
2010-03-16 13:24           ` James Bottomley
2010-03-16 13:56             ` Tejun Heo
2010-03-16 14:21               ` James Bottomley
2010-03-16 14:25                 ` Arnd Bergmann
2010-03-16 14:50                 ` Tejun Heo
2010-03-16 15:02                   ` James Bottomley
2010-03-16 15:20                     ` Tejun Heo
2010-03-16 15:22                       ` Martin K. Petersen
2010-03-16 15:22                         ` Martin K. Petersen
2010-03-17  2:07                         ` Tejun Heo
2010-03-16 15:23                       ` James Bottomley
2010-03-16 15:37                         ` Tejun Heo
2010-03-16 20:42                           ` Ric Wheeler
2010-03-17  2:04                             ` Tejun Heo
2010-03-17  2:51                         ` Kevin Easton
2010-03-17  3:44                           ` Tejun Heo
2010-03-17  8:01                             ` jdow
2010-03-17 17:04                       ` Bill Davidsen
2010-03-16 14:38               ` Denys Vlasenko
2010-03-16 15:12                 ` Tejun Heo
2010-03-16 15:25                   ` Denys Vlasenko
2010-03-16 15:47                     ` Tejun Heo
2010-03-17  6:48             ` H. Peter Anvin
2010-03-16  6:27         ` Thomas Chou

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.