linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* i_size misuses
@ 2010-09-08 13:32 Mikulas Patocka
  2010-09-11 22:05 ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Mikulas Patocka @ 2010-09-08 13:32 UTC (permalink / raw)
  To: linux-kernel, dm-devel, linux-fsdevel, jaxboe
  Cc: Milan Broz, Alasdair G Kergon

Hi

when reviewing some i_size problem, I searched the kernel for i_size usage 
(the variable should really be written with i_size_write and read with 
i_size_read).

Properly locked direct use of "i_size" inside memory management or 
filesystems may not be a problem, but there are many problems in general 
code outside mm.

The misuses are:
SOUND/SOUND_FIRMWARE.C:l = filp->f_path.dentry->d_inode->i_size;
KERNEL/RELAY.C:buf->dentry->d_inode->i_size = buf->early_bytes;
KERNEL/RELAY.C:buf->dentry->d_inode->i_size += buf->chan->subbuf_size 
-buf->padding[old_subbuf];
DRIVERS/USB/CORE/INODE.C:dev->usbfs_dentry->d_inode->i_size = i_size;
DRIVERS/MTD/DEVICES/BLOCK2MTD.C:dev->mtd.size = 
dev->blkdev->bd_inode->i_size & PAGE_MASK;
DRIVERS/MD/MD.C: many reads of i_size 
DRIVERS/BLOCK/NBD.C: many writes to i_size without i_size_write
DRIVERS/BLOCK/DRBD/DRBD_INT.H: return bdev ? bdev->bd_inode->i_size >> 9 : 0;
DRIVERS/BLOCK/DRBD/DRBD_WRAPPERS.H: mdev->this_bdev->bd_inode->i_size = 
(loff_t)size << 9;
BLOCK/BLK-CORE.C:printk(KERN_INFO "%s: rw=%ld, want=%Lu, limit=%Lu\n",
                bdevname(bio->bi_bdev, b),
                bio->bi_rw,
                (unsigned long long)bio->bi_sector + bio_sectors(bio),
                (long long)(bio->bi_bdev->bd_inode->i_size >> 9));
maxsector = bio->bi_bdev->bd_inode->i_size >> 9;
BLOCK/COMPAT_IOCTL.C: size = bdev->bd_inode->i_size;
return compat_put_u64(arg, bdev->bd_inode->i_size);
BLOCK/IOCTL.C: if (start + len > (bdev->bd_inode->i_size >> 9))
size = bdev->bd_inode->i_size;
return put_u64(arg, bdev->bd_inode->i_size);

The problem with this code is that if you read i_size without i_size_read 
and the size wraps around 32 bits, for example from 0xffffffff to 
0x100000000 , there is a possibility on 32-bit machines to read an invalid 
value (either 0 or 0x1ffffffff). Similarly, if you write i_size without 
i_size_write, the readers can see intermediate invalid values.


The original problem that caused this investigation is the question, how a 
block device driver can change the size of its device. Normal method (used 
in a few drivers, including dm), consists of
mutex_lock(&inode->i_mutex);
i_size_write(inode, new_size);
mutex_unlock(&inode->i_mutex);

This is deadlock-prone, because i_mutex is also held on fsync path. 
Therefore, this deadlock happens: fsync takes i_mutex and issues I/Os, 
block device driver wants to change its size, so it waits on i_mutex, 
the I/Os wait until the device driver did its internal maintenance and 
changed the inode size. The driver doesn't change the size until fsync 
finished.

Jens, as a block maintainer, please think about it and propose some 
specification how to clean this up. Also a clean verifiable rule regarding 
i_size should be specified and the code should be fixed to conform to the 
rule: maybe we could rename i_size to __i_size and ban its using.

Mikulas

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: i_size misuses
  2010-09-08 13:32 i_size misuses Mikulas Patocka
@ 2010-09-11 22:05 ` Neil Brown
  2010-10-19 18:55   ` Mikulas Patocka
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Brown @ 2010-09-11 22:05 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: linux-kernel, dm-devel, linux-fsdevel, jaxboe, Milan Broz,
	Alasdair G Kergon

On Wed, 8 Sep 2010 09:32:13 -0400 (EDT)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> Hi
> 
> when reviewing some i_size problem, I searched the kernel for i_size usage 
> (the variable should really be written with i_size_write and read with 
> i_size_read).
> 
> Properly locked direct use of "i_size" inside memory management or 
> filesystems may not be a problem, but there are many problems in general 
> code outside mm.
> 
> The misuses are:
> SOUND/SOUND_FIRMWARE.C:l = filp->f_path.dentry->d_inode->i_size;
> KERNEL/RELAY.C:buf->dentry->d_inode->i_size = buf->early_bytes;
> KERNEL/RELAY.C:buf->dentry->d_inode->i_size += buf->chan->subbuf_size 
> -buf->padding[old_subbuf];
> DRIVERS/USB/CORE/INODE.C:dev->usbfs_dentry->d_inode->i_size = i_size;
> DRIVERS/MTD/DEVICES/BLOCK2MTD.C:dev->mtd.size = 
> dev->blkdev->bd_inode->i_size & PAGE_MASK;
> DRIVERS/MD/MD.C: many reads of i_size 
> DRIVERS/BLOCK/NBD.C: many writes to i_size without i_size_write
> DRIVERS/BLOCK/DRBD/DRBD_INT.H: return bdev ? bdev->bd_inode->i_size >> 9 : 0;
> DRIVERS/BLOCK/DRBD/DRBD_WRAPPERS.H: mdev->this_bdev->bd_inode->i_size = 
> (loff_t)size << 9;
> BLOCK/BLK-CORE.C:printk(KERN_INFO "%s: rw=%ld, want=%Lu, limit=%Lu\n",
>                 bdevname(bio->bi_bdev, b),
>                 bio->bi_rw,
>                 (unsigned long long)bio->bi_sector + bio_sectors(bio),
>                 (long long)(bio->bi_bdev->bd_inode->i_size >> 9));
> maxsector = bio->bi_bdev->bd_inode->i_size >> 9;
> BLOCK/COMPAT_IOCTL.C: size = bdev->bd_inode->i_size;
> return compat_put_u64(arg, bdev->bd_inode->i_size);
> BLOCK/IOCTL.C: if (start + len > (bdev->bd_inode->i_size >> 9))
> size = bdev->bd_inode->i_size;
> return put_u64(arg, bdev->bd_inode->i_size);
> 
> The problem with this code is that if you read i_size without i_size_read 
> and the size wraps around 32 bits, for example from 0xffffffff to 
> 0x100000000 , there is a possibility on 32-bit machines to read an invalid 
> value (either 0 or 0x1ffffffff). Similarly, if you write i_size without 
> i_size_write, the readers can see intermediate invalid values.
> 
> 
> The original problem that caused this investigation is the question, how a 
> block device driver can change the size of its device. Normal method (used 
> in a few drivers, including dm), consists of
> mutex_lock(&inode->i_mutex);
> i_size_write(inode, new_size);
> mutex_unlock(&inode->i_mutex);

Don't you just do

  set_capacity(gendisk, sectors);
  revalidate(gendisk);

??

NeilBrown

> 
> This is deadlock-prone, because i_mutex is also held on fsync path. 
> Therefore, this deadlock happens: fsync takes i_mutex and issues I/Os, 
> block device driver wants to change its size, so it waits on i_mutex, 
> the I/Os wait until the device driver did its internal maintenance and 
> changed the inode size. The driver doesn't change the size until fsync 
> finished.
> 
> Jens, as a block maintainer, please think about it and propose some 
> specification how to clean this up. Also a clean verifiable rule regarding 
> i_size should be specified and the code should be fixed to conform to the 
> rule: maybe we could rename i_size to __i_size and ban its using.
> 
> Mikulas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: i_size misuses
  2010-09-11 22:05 ` Neil Brown
@ 2010-10-19 18:55   ` Mikulas Patocka
  2010-10-19 22:33     ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Mikulas Patocka @ 2010-10-19 18:55 UTC (permalink / raw)
  To: Neil Brown
  Cc: linux-kernel, dm-devel, linux-fsdevel, jaxboe, Milan Broz,
	Alasdair G Kergon



On Sun, 12 Sep 2010, Neil Brown wrote:

> On Wed, 8 Sep 2010 09:32:13 -0400 (EDT)
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > Hi
> > 
> > when reviewing some i_size problem, I searched the kernel for i_size usage 
> > (the variable should really be written with i_size_write and read with 
> > i_size_read).
> > 
> > Properly locked direct use of "i_size" inside memory management or 
> > filesystems may not be a problem, but there are many problems in general 
> > code outside mm.
> > 
> > The misuses are:
> > SOUND/SOUND_FIRMWARE.C:l = filp->f_path.dentry->d_inode->i_size;
> > KERNEL/RELAY.C:buf->dentry->d_inode->i_size = buf->early_bytes;
> > KERNEL/RELAY.C:buf->dentry->d_inode->i_size += buf->chan->subbuf_size 
> > -buf->padding[old_subbuf];
> > DRIVERS/USB/CORE/INODE.C:dev->usbfs_dentry->d_inode->i_size = i_size;
> > DRIVERS/MTD/DEVICES/BLOCK2MTD.C:dev->mtd.size = 
> > dev->blkdev->bd_inode->i_size & PAGE_MASK;
> > DRIVERS/MD/MD.C: many reads of i_size 
> > DRIVERS/BLOCK/NBD.C: many writes to i_size without i_size_write
> > DRIVERS/BLOCK/DRBD/DRBD_INT.H: return bdev ? bdev->bd_inode->i_size >> 9 : 0;
> > DRIVERS/BLOCK/DRBD/DRBD_WRAPPERS.H: mdev->this_bdev->bd_inode->i_size = 
> > (loff_t)size << 9;
> > BLOCK/BLK-CORE.C:printk(KERN_INFO "%s: rw=%ld, want=%Lu, limit=%Lu\n",
> >                 bdevname(bio->bi_bdev, b),
> >                 bio->bi_rw,
> >                 (unsigned long long)bio->bi_sector + bio_sectors(bio),
> >                 (long long)(bio->bi_bdev->bd_inode->i_size >> 9));
> > maxsector = bio->bi_bdev->bd_inode->i_size >> 9;
> > BLOCK/COMPAT_IOCTL.C: size = bdev->bd_inode->i_size;
> > return compat_put_u64(arg, bdev->bd_inode->i_size);
> > BLOCK/IOCTL.C: if (start + len > (bdev->bd_inode->i_size >> 9))
> > size = bdev->bd_inode->i_size;
> > return put_u64(arg, bdev->bd_inode->i_size);
> > 
> > The problem with this code is that if you read i_size without i_size_read 
> > and the size wraps around 32 bits, for example from 0xffffffff to 
> > 0x100000000 , there is a possibility on 32-bit machines to read an invalid 
> > value (either 0 or 0x1ffffffff). Similarly, if you write i_size without 
> > i_size_write, the readers can see intermediate invalid values.
> > 
> > 
> > The original problem that caused this investigation is the question, how a 
> > block device driver can change the size of its device. Normal method (used 
> > in a few drivers, including dm), consists of
> > mutex_lock(&inode->i_mutex);
> > i_size_write(inode, new_size);
> > mutex_unlock(&inode->i_mutex);
> 
> Don't you just do
> 
>   set_capacity(gendisk, sectors);
>   revalidate(gendisk);
> 
> ??
> 
> NeilBrown

revalidate_disk() has still the problem that it changes i_size without 
holding i_mutex and other kernel parts (for example generic_file_llseek) 
assume that if we hold the lock, i_size_can't be changed.

The rules for accessing i_size must be specified and followed.

Mikulas

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: i_size misuses
  2010-10-19 18:55   ` Mikulas Patocka
@ 2010-10-19 22:33     ` Neil Brown
  2010-10-20  0:09       ` Mikulas Patocka
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Brown @ 2010-10-19 22:33 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: linux-kernel, dm-devel, linux-fsdevel, jaxboe, Milan Broz,
	Alasdair G Kergon

On Tue, 19 Oct 2010 14:55:31 -0400 (EDT)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Sun, 12 Sep 2010, Neil Brown wrote:
> 
> > On Wed, 8 Sep 2010 09:32:13 -0400 (EDT)
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > 
> > > Hi
> > > 
> > > when reviewing some i_size problem, I searched the kernel for i_size usage 
> > > (the variable should really be written with i_size_write and read with 
> > > i_size_read).
> > > 
> > > Properly locked direct use of "i_size" inside memory management or 
> > > filesystems may not be a problem, but there are many problems in general 
> > > code outside mm.
> > > 
> > > The misuses are:
> > > SOUND/SOUND_FIRMWARE.C:l = filp->f_path.dentry->d_inode->i_size;
> > > KERNEL/RELAY.C:buf->dentry->d_inode->i_size = buf->early_bytes;
> > > KERNEL/RELAY.C:buf->dentry->d_inode->i_size += buf->chan->subbuf_size 
> > > -buf->padding[old_subbuf];
> > > DRIVERS/USB/CORE/INODE.C:dev->usbfs_dentry->d_inode->i_size = i_size;
> > > DRIVERS/MTD/DEVICES/BLOCK2MTD.C:dev->mtd.size = 
> > > dev->blkdev->bd_inode->i_size & PAGE_MASK;
> > > DRIVERS/MD/MD.C: many reads of i_size 
> > > DRIVERS/BLOCK/NBD.C: many writes to i_size without i_size_write
> > > DRIVERS/BLOCK/DRBD/DRBD_INT.H: return bdev ? bdev->bd_inode->i_size >> 9 : 0;
> > > DRIVERS/BLOCK/DRBD/DRBD_WRAPPERS.H: mdev->this_bdev->bd_inode->i_size = 
> > > (loff_t)size << 9;
> > > BLOCK/BLK-CORE.C:printk(KERN_INFO "%s: rw=%ld, want=%Lu, limit=%Lu\n",
> > >                 bdevname(bio->bi_bdev, b),
> > >                 bio->bi_rw,
> > >                 (unsigned long long)bio->bi_sector + bio_sectors(bio),
> > >                 (long long)(bio->bi_bdev->bd_inode->i_size >> 9));
> > > maxsector = bio->bi_bdev->bd_inode->i_size >> 9;
> > > BLOCK/COMPAT_IOCTL.C: size = bdev->bd_inode->i_size;
> > > return compat_put_u64(arg, bdev->bd_inode->i_size);
> > > BLOCK/IOCTL.C: if (start + len > (bdev->bd_inode->i_size >> 9))
> > > size = bdev->bd_inode->i_size;
> > > return put_u64(arg, bdev->bd_inode->i_size);
> > > 
> > > The problem with this code is that if you read i_size without i_size_read 
> > > and the size wraps around 32 bits, for example from 0xffffffff to 
> > > 0x100000000 , there is a possibility on 32-bit machines to read an invalid 
> > > value (either 0 or 0x1ffffffff). Similarly, if you write i_size without 
> > > i_size_write, the readers can see intermediate invalid values.
> > > 
> > > 
> > > The original problem that caused this investigation is the question, how a 
> > > block device driver can change the size of its device. Normal method (used 
> > > in a few drivers, including dm), consists of
> > > mutex_lock(&inode->i_mutex);
> > > i_size_write(inode, new_size);
> > > mutex_unlock(&inode->i_mutex);
> > 
> > Don't you just do
> > 
> >   set_capacity(gendisk, sectors);
> >   revalidate(gendisk);
> > 
> > ??
> > 
> > NeilBrown
> 
> revalidate_disk() has still the problem that it changes i_size without 
> holding i_mutex and other kernel parts (for example generic_file_llseek) 
> assume that if we hold the lock, i_size_can't be changed.

generic_file_llseek is not used for block devices.  They use block_llseek
which uses i_size_read, so I think it is safe.

> 
> The rules for accessing i_size must be specified and followed.

I agree.  However the rules can be different for different file systems and
file types.
A filesystem that used the generic_* function would need to only change
i_size under i_mutex as you say.
For block devices it appears that the rule is that it can only be changed
under bd_mutex.
For 'relay' (which you mentioned above), it seem the relevant mutex is the
global relay_channels_mutex, though I didn't read the code thoroughly to be
sure.

It still would probably be useful to review all the i_size related code to
ensure that it is safe, but you should not assume that everything follows the
same rules.  So first you need to work out the rule for a given subsystem,
then audit it against that rule (and maybe document that rule if it isn't
already documented!)

NeilBrown


> 
> Mikulas


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: i_size misuses
  2010-10-19 22:33     ` Neil Brown
@ 2010-10-20  0:09       ` Mikulas Patocka
  0 siblings, 0 replies; 5+ messages in thread
From: Mikulas Patocka @ 2010-10-20  0:09 UTC (permalink / raw)
  To: Neil Brown
  Cc: linux-kernel, dm-devel, linux-fsdevel, jaxboe, Milan Broz,
	Alasdair G Kergon



On Wed, 20 Oct 2010, Neil Brown wrote:

> On Tue, 19 Oct 2010 14:55:31 -0400 (EDT)
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Sun, 12 Sep 2010, Neil Brown wrote:
> > 
> > > On Wed, 8 Sep 2010 09:32:13 -0400 (EDT)
> > > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > 
> > > > Hi
> > > > 
> > > > when reviewing some i_size problem, I searched the kernel for i_size usage 
> > > > (the variable should really be written with i_size_write and read with 
> > > > i_size_read).
> > > > 
> > > > Properly locked direct use of "i_size" inside memory management or 
> > > > filesystems may not be a problem, but there are many problems in general 
> > > > code outside mm.
> > > > 
> > > > The misuses are:
> > > > SOUND/SOUND_FIRMWARE.C:l = filp->f_path.dentry->d_inode->i_size;
> > > > KERNEL/RELAY.C:buf->dentry->d_inode->i_size = buf->early_bytes;
> > > > KERNEL/RELAY.C:buf->dentry->d_inode->i_size += buf->chan->subbuf_size 
> > > > -buf->padding[old_subbuf];
> > > > DRIVERS/USB/CORE/INODE.C:dev->usbfs_dentry->d_inode->i_size = i_size;
> > > > DRIVERS/MTD/DEVICES/BLOCK2MTD.C:dev->mtd.size = 
> > > > dev->blkdev->bd_inode->i_size & PAGE_MASK;
> > > > DRIVERS/MD/MD.C: many reads of i_size 
> > > > DRIVERS/BLOCK/NBD.C: many writes to i_size without i_size_write
> > > > DRIVERS/BLOCK/DRBD/DRBD_INT.H: return bdev ? bdev->bd_inode->i_size >> 9 : 0;
> > > > DRIVERS/BLOCK/DRBD/DRBD_WRAPPERS.H: mdev->this_bdev->bd_inode->i_size = 
> > > > (loff_t)size << 9;
> > > > BLOCK/BLK-CORE.C:printk(KERN_INFO "%s: rw=%ld, want=%Lu, limit=%Lu\n",
> > > >                 bdevname(bio->bi_bdev, b),
> > > >                 bio->bi_rw,
> > > >                 (unsigned long long)bio->bi_sector + bio_sectors(bio),
> > > >                 (long long)(bio->bi_bdev->bd_inode->i_size >> 9));
> > > > maxsector = bio->bi_bdev->bd_inode->i_size >> 9;
> > > > BLOCK/COMPAT_IOCTL.C: size = bdev->bd_inode->i_size;
> > > > return compat_put_u64(arg, bdev->bd_inode->i_size);
> > > > BLOCK/IOCTL.C: if (start + len > (bdev->bd_inode->i_size >> 9))
> > > > size = bdev->bd_inode->i_size;
> > > > return put_u64(arg, bdev->bd_inode->i_size);
> > > > 
> > > > The problem with this code is that if you read i_size without i_size_read 
> > > > and the size wraps around 32 bits, for example from 0xffffffff to 
> > > > 0x100000000 , there is a possibility on 32-bit machines to read an invalid 
> > > > value (either 0 or 0x1ffffffff). Similarly, if you write i_size without 
> > > > i_size_write, the readers can see intermediate invalid values.
> > > > 
> > > > 
> > > > The original problem that caused this investigation is the question, how a 
> > > > block device driver can change the size of its device. Normal method (used 
> > > > in a few drivers, including dm), consists of
> > > > mutex_lock(&inode->i_mutex);
> > > > i_size_write(inode, new_size);
> > > > mutex_unlock(&inode->i_mutex);
> > > 
> > > Don't you just do
> > > 
> > >   set_capacity(gendisk, sectors);
> > >   revalidate(gendisk);
> > > 
> > > ??
> > > 
> > > NeilBrown
> > 
> > revalidate_disk() has still the problem that it changes i_size without 
> > holding i_mutex and other kernel parts (for example generic_file_llseek) 
> > assume that if we hold the lock, i_size_can't be changed.
> 
> generic_file_llseek is not used for block devices.  They use block_llseek
> which uses i_size_read, so I think it is safe.
> 
> > 
> > The rules for accessing i_size must be specified and followed.
> 
> I agree.  However the rules can be different for different file systems and
> file types.
> A filesystem that used the generic_* function would need to only change
> i_size under i_mutex as you say.
> For block devices it appears that the rule is that it can only be changed
> under bd_mutex.
> For 'relay' (which you mentioned above), it seem the relevant mutex is the
> global relay_channels_mutex, though I didn't read the code thoroughly to be
> sure.

You are right.

> It still would probably be useful to review all the i_size related code to
> ensure that it is safe, but you should not assume that everything follows the
> same rules.  So first you need to work out the rule for a given subsystem,
> then audit it against that rule (and maybe document that rule if it isn't
> already documented!)
> 
> NeilBrown

If you specify complex rules (i_size can be read directly under i_mutex 
for files, but can't be read for block devices under i_mutex), it will 
only complicate review and it may introduce bugs in the future --- 
remember that Linux doesn't have a formal specification and the 
specification is inferred from the code --- thus, you use 
i_size_read(inode) somewhere and inode->i_size elsewhere, people will 
infer a wrong specification --- as shown above on the few examples of 
i_size abusers.

I'd change i_size to __i_size (this will warn people not to touch it) and 
convert all accesses to i_size_read and i_size_write.

Mikulas

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-10-20  0:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-08 13:32 i_size misuses Mikulas Patocka
2010-09-11 22:05 ` Neil Brown
2010-10-19 18:55   ` Mikulas Patocka
2010-10-19 22:33     ` Neil Brown
2010-10-20  0:09       ` Mikulas Patocka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).