Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

From: Linda Knippers <linda.knippers@hp.com>
To: Dave Chinner <david@fromorbit.com>, Jeff Moyer <jmoyer@redhat.com>
Cc: "matthew r. wilcox" <matthew.r.wilcox@intel.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"
Date: Wed, 05 Aug 2015 21:42:54 -0400	[thread overview]
Message-ID: <55C2BB9E.3040709@hp.com> (raw)
In-Reply-To: <20150805220113.GC3902@dastard>

On 08/05/2015 06:01 PM, Dave Chinner wrote:
> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
>> Hi, Matthew,
>>
>> Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
>>
>> # mkfs -t xfs -f /dev/pmem0
>> meta-data=/dev/pmem0             isize=256    agcount=4, agsize=524288 blks
>>          =                       sectsz=512   attr=2, projid32bit=1
>>          =                       crc=0        finobt=0
>> data     =                       bsize=4096   blocks=2097152, imaxpct=25
>>          =                       sunit=0      swidth=0 blks
>> naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
>> log      =internal log           bsize=4096   blocks=2560, version=2
>>          =                       sectsz=512   sunit=0 blks, lazy-count=1
>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>> mkfs.xfs: read failed: Numerical result out of range
>>
>> I sat down with Linda to look into it, and the problem is that mkfs.xfs
>> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
>> from the last sector of the device.  This results in dax_io trying to do
>> a page-sized I/O at 512 bytes from the end of the device.
> 
> Right - we have to be able to do IO to that last sector, so this is
> a sanity check to tell if the block dev is large enough. The XFS
> kernel code does the same end-of-device sector read when the
> filesystem is mounted, too.
> 
>> bdev_direct_access, receiving this bogus pos/size combo, returns
>> -ERANGE:
>>
>> 	if ((sector + DIV_ROUND_UP(size, 512)) >
>> 					part_nr_sects_read(bdev->bd_part))
>> 		return -ERANGE;
>>
>> Given that file systems supporting dax refuse to mount with a blocksize
>> != page size, I'm guessing this is sort of expected behavior.  However,
>> we really shouldn't be breaking direct I/O on pmem devices.
> 
> If the device is advertising 512 byte sector size support, then this
> needs to work, especially as DAX is completely transparent on the
> block device. Remember that DAX through a filesystem works on
> filesystem data block size boundaries, so a 512 byte sector/4k block
> size filesystem will be able to use DAX for mmapped files just fine.
> 
>> So, what do you want to do?  We could make the pmem device's logical
>> block size fixed at the sytem page size.  Or, we could modify the dax
>> code to work with blocksize < pagesize.  Or, we could continue using the
>> direct I/O codepath for direct block device access.  What do you think?
> 
> I don't know how the pmem device sets up it's limits. Can you post
> the output of:
> 
> 	/sys/block/pmem0/queue/logical_block_size
512

> 	/sys/block/pmem0/queue/physical_block_size
512

> 	/sys/block/pmem0/queue/hw_sector_size
512

> 	/sys/block/pmem0/queue/minimum_io_size
512

> 	/sys/block/pmem0/queue/optimal_io_size
0

Let me know if you need anything else.

-- ljk

> As these all affect how mkfs.xfs configures the filesystem being
> made and so influences the size and alignment of the IO is does....
> 
> Cheers,
> 
> Dave.
>