linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Possible ext2/3/4 filesysystem iov_length integer overflow and strange behavior on large writes
@ 2011-06-17 16:17 halfdog
  2011-07-16 21:16 ` Ted Ts'o
  0 siblings, 1 reply; 3+ messages in thread
From: halfdog @ 2011-06-17 16:17 UTC (permalink / raw)
  To: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

If I understand it correctly, there might be multiple iov_length
interger overflows on 32bit arch in ext2, ext3, ext4, e.g.

fs/ext4/file.c:

static ssize_t
ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
                unsigned long nr_segs, loff_t pos)
{
...
        /*
         * If we have encountered a bitmap-format file, the size limit
         * is smaller than s_maxbytes, which is for extent-mapped files.
         */
        if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
                struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
                size_t length = iov_length(iov, nr_segs);  << length
might be any value with more than 4GB data

                if ((pos > sbi->s_bitmap_maxbytes ||
                    (pos == sbi->s_bitmap_maxbytes && length > 0)))
                        return -EFBIG;

                if (pos + length > sbi->s_bitmap_maxbytes) {
                        nr_segs = iov_shorten((struct iovec *)iov, nr_segs,
                                              sbi->s_bitmap_maxbytes - pos);
                }
...


Can someone confirm or refute that? I wrote a small test program, but
failed to inflict damage on the kernel or filesystem, so I might have
missed something. From source grep, also other filesystems might have
the same problem.


Apart from that, large iov writes seem to be uninteruptible. Sending a
kill signal to the process in writev terminates it after finishing the
syscall.

./LargeWritevTest --File x --IovecNum 257 --BufferSize 16777216
- --LastSize 10
pkill -KILL LargeWritevTest

[24306.588390] INFO: task LargeWritevTest:1390 blocked for more than 120
seconds.
[24306.589984] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[24306.590512] WritevTest      D 00000086     0  1390   1380 0x00000004
[24306.590571]  c8a91db0 00000082 c1040b73 00000086 00000000 c86a1940
c86a1bcc c183a8c0
[24309.657798]  8dcb7199 000014fc c86a1bc8 c183a8c0 c183a8c0 cac068c0
c86a1940 c87e0ca0
[24309.657871]  cac03640 c8605ae8 000581ca 00000380 00000000 00000001
c8a91d90 c103351c
[24309.657908] Call Trace:
[24309.658226]  [<c1040b73>] ? entity_tick+0x73/0x130
[24309.658284]  [<c103351c>] ? kmap_atomic_prot+0x4c/0x100
[24309.658331]  [<c10e7dc0>] ? prep_new_page+0x110/0x1a0
[24309.658439]  [<c15087e6>] __mutex_lock_slowpath+0xd6/0x140
[24309.658526]  [<c1508355>] mutex_lock+0x25/0x40
[24309.658547]  [<c10e3c1b>] generic_file_aio_write+0x4b/0xd0
[24309.658587]  [<c11a9a84>] ext4_file_write+0x54/0x2a0
[24309.658608]  [<c10e8809>] ? __alloc_pages_nodemask+0xf9/0x710
[24309.658627]  [<c10e8809>] ? __alloc_pages_nodemask+0xf9/0x710
[24309.658805]  [<c11a9a30>] ? ext4_file_write+0x0/0x2a0
[24309.660607]  [<c1127676>] do_sync_readv_writev+0xa6/0xe0


Since writev would allow 1024 segments a 1GB, one might be able to
consume 1TB (all) disk space on a machine and the process cannot be
stopped. On 32 bit architecture, the write stops after 2GB, but I'm not
sure why. Would terrabyte writes be possible on 64-bit systems?

On 32-bit, forking and calling write on different files has to be used
instead. Since processes cannot be terminated, reboot does not unmount
cleanly, so that might increase likelihood of disk corruption.

For testing I used
http://www.halfdog.net/Security/2011/ExtFilesystemIovecHandling/LargeWritevTest.c
on an ext4 filesystem, but failed to understand the various outcomes.
Especially un-comprehensible was the oscillation between disk-full and
disk-free when writing with O_DIRECT to a disk with not enough free
space. The behavior change also unexpected, when aligning the memory
buffers to page-size or ext blocksize, or doing unaligned IO.


7G free:
./LargeWritevTest --File x --IovecNum 256 --BufferSize 16777216
./LargeWritevTest --File x --IovecNum 257 --BufferSize 16777216
- --LastSize 10tou
./LargeWritevTest --File y --IovecNum 512 --BufferSize 16777216
- --LastSize 16777215
Write result 2147479552 (is 2^31-4096)

./LargeWritevTest --File x --IovecNum 257 --BufferSize 16777216
- --LastSize 10 --Align 65536
Write result 16740352 (fast)

3.9G free:
./LargeWritevTest --File x --IovecNum 257 --BufferSize 16777216
- --LastSize 10 --Align 65536 --Direct
./LargeWritevTest --File x --IovecNum 256 --BufferSize 16777216 --Align
65536 --Direct
Write result -14 (immediate)

./LargeWritevTest --File x --IovecNum 257 --BufferSize 16777216
- --LastSize 10 --Direct
./LargeWritevTest --File x --IovecNum 256 --BufferSize 16777216 --Direct
Write result -22 (immediate)

Less than 2GB:
./LargeWritevTest --File z --IovecNum 257 --BufferSize 16777216
- --LastSize 10 --Align 4096 --Direct
Oscillates between disk empty/full?


- -- 
http://www.halfdog.net/
PGP: 156A AE98 B91F 0114 FE88  2BD8 C459 9386 feed a bee
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFN+34jxFmThv7tq+4RAh5gAJ45kycXTOk4zD9R+J9jkEXQbeoJvACeI3oT
KmEeBGVbF4ZDh3zaUN88mfg=
=WFDh
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Possible ext2/3/4 filesysystem iov_length integer overflow and strange behavior on large writes
  2011-06-17 16:17 Possible ext2/3/4 filesysystem iov_length integer overflow and strange behavior on large writes halfdog
@ 2011-07-16 21:16 ` Ted Ts'o
  2011-07-17  4:39   ` Linus Torvalds
  0 siblings, 1 reply; 3+ messages in thread
From: Ted Ts'o @ 2011-07-16 21:16 UTC (permalink / raw)
  To: halfdog; +Cc: linux-kernel

On Fri, Jun 17, 2011 at 04:17:32PM +0000, halfdog wrote:
> 
> If I understand it correctly, there might be multiple iov_length
> interger overflows on 32bit arch in ext2, ext3, ext4, e.g.

> Can someone confirm or refute that? I wrote a small test program, but
> failed to inflict damage on the kernel or filesystem, so I might have
> missed something. From source grep, also other filesystems might have
> the same problem.

The iovec is checked in the VFS layer.  See the function
rw_copy_check_uvector() in fs/read_write.c.

> Apart from that, large iov writes seem to be uninteruptible. Sending a
> kill signal to the process in writev terminates it after finishing the
> syscall.

That's partially historical.  There are a programs out there which
assume that reads and writes to files on disk can't get interrupted in
media res.  (Worse yet are the progams which make this assumption on
network connections, but that's another story.)  Programs should check
the return value, on a partial read or write, retry the read/write.
Many don't.  Writes are fast enough most of the time that it's not
worth it to make them be interruptible.

Your questions about what happens if someone is trying to perform a
Denial of Service attack and send a writev of 1 TB is a interesting
one.  I'm currently not in a place where I can do experiments about
this, but I did want to acknowledge your concern.  It may be that the
right thing to do is to allow a SIGKILL to interrupt a disk write.

Apologies for not responding earlier; this managed to slip through my
inbox and I only saw it now.

      	       	     	      		   - Ted

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Possible ext2/3/4 filesysystem iov_length integer overflow and strange behavior on large writes
  2011-07-16 21:16 ` Ted Ts'o
@ 2011-07-17  4:39   ` Linus Torvalds
  0 siblings, 0 replies; 3+ messages in thread
From: Linus Torvalds @ 2011-07-17  4:39 UTC (permalink / raw)
  To: Ted Ts'o, halfdog, linux-kernel

On Sat, Jul 16, 2011 at 2:16 PM, Ted Ts'o <tytso@mit.edu> wrote:
>
> Your questions about what happens if someone is trying to perform a
> Denial of Service attack and send a writev of 1 TB is a interesting
> one.

We *should* be pretty good about this at the VFS layer at least, since
we tend to use things like "lock_page_killable()" and friends. IOW,
the thing may not be "interruptible" in the sense that we don't allow
signals (for all the historical reasons: user programs *should* be
able to handle interrupted file accesses since they do happen with NFS
if you mount things with "-o intr" etc, but nobody has really dared
say that it can happen in general). But we generally allow a signal
that kills a process to interrupt a read or write.

There may be cases where we don't do the "_killable()" version,
though. And filesystems that do their own routines (eg writes on
filesystems with logs etc) rather than use the VFS helper ones would
be more likely to have that issue.

Of course, we already say "screw posix" if somebody tries to write
really big chunks, and the VFS layer will say "I will truncate writes
bigger than 2GB, and better handle partial writes if you do big
chunks". That's partly due to security issues with drivers/filesystems
with overflow issues, and partly just because anything else would be
crazy. We could easily also say that "big writes are always
interruptible", for some random definition of "big" (ie
multi-megabyte). Exactly because of the DoS issues.

                         Linus

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-07-17  4:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-17 16:17 Possible ext2/3/4 filesysystem iov_length integer overflow and strange behavior on large writes halfdog
2011-07-16 21:16 ` Ted Ts'o
2011-07-17  4:39   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).