linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Help tracking down problem --- endless loop in __find_get_block_slow
@ 2005-02-21 10:46 Thomas S. Iversen
  2005-02-22  9:18 ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas S. Iversen @ 2005-02-21 10:46 UTC (permalink / raw)
  To: device-mapper development, linux-kernel

Hi There

I am trying to develop a device mapper plugin which does transparent 
block encryption and sector shuffling in the style freebsd does it (GBDE)

Reads are support and working, but have trouble getting writes to work 
properly.

If I do a simple:

echo "test" > /mnt/test (where /mnt is /dev/mapper/gbde)
sync

it works just fine. If I do

dd if=/dev/zero of=/mnt/testfile count=N, N=1..6 it works fine

But if I do

dd if=/dev/zero of=/mnt/testfile count=N, N>6

I get into an endless loop in __find_get_block_slow. My write path does 
something like this:

recieve original BIO (eg. size=4096). Split BIO into sectorsize chunks.
map chunks to physical sectors.
encrypt sectors
generate keys
write sectors
update keysectors
... when all sectors are written call bio_endio on the original request.

To awoid allocating alot of pages during writes, I use the mem from the 
original request to do encryption "inplace". Could that be the cause of 
my problems? I would clearly like to minimize the need for page 
allocation in my dm-module, so I hope it isn't.

Whats going on here? Every comments appriciated!

Regards Thomas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Help tracking down problem --- endless loop in __find_get_block_slow
  2005-02-21 10:46 Help tracking down problem --- endless loop in __find_get_block_slow Thomas S. Iversen
@ 2005-02-22  9:18 ` Andrew Morton
  2005-02-22 22:46   ` Jeff Mahoney
  2005-02-23 12:00   ` Help tracking down problem --- endless loop in __find_get_block_slow Thomas S. Iversen
  0 siblings, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2005-02-22  9:18 UTC (permalink / raw)
  To: Thomas S. Iversen; +Cc: dm-devel, linux-kernel

"Thomas S. Iversen" <zensonic@zensonic.dk> wrote:
>
> But if I do
> 
>  dd if=/dev/zero of=/mnt/testfile count=N, N>6
> 
>  I get into an endless loop in __find_get_block_slow. 

The only way in which __find_get_block_slow() can loop is if something
wrecked the buffer_head ring at page->private: something caused an internal
loop via bh->b_this_page.

Are you sure that's where things are hanging?  That it's not stuck on a
spinlock?

A sysrq-P trace might help.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Help tracking down problem --- endless loop in __find_get_block_slow
  2005-02-22  9:18 ` Andrew Morton
@ 2005-02-22 22:46   ` Jeff Mahoney
  2005-02-22 23:28     ` Andrew Morton
  2005-02-23 12:00   ` Help tracking down problem --- endless loop in __find_get_block_slow Thomas S. Iversen
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff Mahoney @ 2005-02-22 22:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Thomas S. Iversen, dm-devel, linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Morton wrote:
> "Thomas S. Iversen" <zensonic@zensonic.dk> wrote:
> 
>>But if I do
>>
>> dd if=/dev/zero of=/mnt/testfile count=N, N>6
>>
>> I get into an endless loop in __find_get_block_slow. 
> 
> 
> The only way in which __find_get_block_slow() can loop is if something
> wrecked the buffer_head ring at page->private: something caused an internal
> loop via bh->b_this_page.
> 
> Are you sure that's where things are hanging?  That it's not stuck on a
> spinlock?
> 
> A sysrq-P trace might help.

I've observed similar effects without DM involved at all. I've been able
to reproduce using subfs (it brings out umount races nicely) on kernels
from 2.6.5 to 2.6.11-rc4, it's platform and device independent.

In my experience, the loop is actually outside of
__find_get_block_slow(), in __getblk_slow(). I've been using xmon to
interrupt the kernel, and the results vary but are all rooted in the
for(;;) loop in __getblk_slow. It appears as though grow_buffers is
finding/creating the page, but then __find_get_block can't locate the
buffer it needs.

- -Jeff

- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFCG7ZeLPWxlyuTD7IRAixHAJsHORHEMfFtTIozqwUOkk9WGFxCggCgiSfn
V3kCyFn/X87Mw4laVsJLUp4=
=YNSw
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Help tracking down problem --- endless loop in __find_get_block_slow
  2005-02-22 22:46   ` Jeff Mahoney
@ 2005-02-22 23:28     ` Andrew Morton
  2005-02-26  0:03       ` Jeff Mahoney
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2005-02-22 23:28 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: zensonic, dm-devel, linux-kernel

Jeff Mahoney <jeffm@suse.com> wrote:
>
> In my experience, the loop is actually outside of
> __find_get_block_slow(), in __getblk_slow(). I've been using xmon to
> interrupt the kernel, and the results vary but are all rooted in the
> for(;;) loop in __getblk_slow. It appears as though grow_buffers is
> finding/creating the page, but then __find_get_block can't locate the
> buffer it needs.

Yes, that'll happen.  Because there are still buffers attached to the page
which have the wrong blocksize.  Say, if someone is trying to read a 2k
buffer_head which is backed by a page which already has 1k buffer_heads
attached to it.

Does your kernel not have that big printk in __find_get_block_slow()?  If
it does, maybe some of the buffers are unmapped.  Try:

--- 25/fs/buffer.c~a	Tue Feb 22 15:27:35 2005
+++ 25-akpm/fs/buffer.c	Tue Feb 22 15:27:41 2005
@@ -456,7 +456,7 @@ __find_get_block_slow(struct block_devic
 	 * file io on the block device and getblk.  It gets dealt with
 	 * elsewhere, don't buffer_error if we had some unmapped buffers
 	 */
-	if (all_mapped) {
+	{
 		printk("__find_get_block_slow() failed. "
 			"block=%llu, b_blocknr=%llu\n",
 			(unsigned long long)block, (unsigned long long)bh->b_blocknr);
_



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Help tracking down problem --- endless loop in __find_get_block_slow
  2005-02-22  9:18 ` Andrew Morton
  2005-02-22 22:46   ` Jeff Mahoney
@ 2005-02-23 12:00   ` Thomas S. Iversen
  2005-02-23 12:10     ` Andrew Morton
  1 sibling, 1 reply; 14+ messages in thread
From: Thomas S. Iversen @ 2005-02-23 12:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dm-devel, linux-kernel

> >  dd if=/dev/zero of=/mnt/testfile count=N, N>6

It turns out N>6 is variable. But large enough and it will hang. Sugests
some kind of race I am afraid.

> >  I get into an endless loop in __find_get_block_slow. 
> 
> The only way in which __find_get_block_slow() can loop is if something
> wrecked the buffer_head ring at page->private: something caused an internal
> loop via bh->b_this_page.
> 
> Are you sure that's where things are hanging?  That it's not stuck on a
> spinlock?

I get the same message again and again all over (this trace with
count=100 and a BUG() and panic inserted into buffer.c)

Feb 23 13:38:34 localhost kernel: __find_get_block_slow() failed.
block=101, b_blocknr=100
Feb 23 13:38:34 localhost kernel: b_state=0x00000010, b_size=2048
Feb 23 13:38:34 localhost kernel: device blocksize: 2048

Ad inf. The output of the BUG() is show below. My questions remain:

1. Whos fault is this?
2. How do I fix/avoid this problem?

Regards Thomas


__find_get_block_slow() failed. block=101, b_blocknr=100
b_state=0x00000010, b_size=2048
device blocksize: 2048
------------[ cut here ]------------
DEBUG_PAGEALLOC
Modules linked in: ufs sha512 aes_i586 dm_bde md5 ipv6 af_packet evdev ehci_hcd usbcore agpgart dm_mod rtc unix
CPU:    0
EIP: 0060:[__find_get_block_slow+202/304]    Not tainted
EFLAGS: 00010286   (2.6.8) 
EIP is at __find_get_block_slow+0xca/0x130
eax: 00000017   ebx: dcd21208   ecx: c02c6950   edx: 000041e4
esi: c138d600   edi: 00000065   ebp: da63aed0   esp: da7e9c54
ds: 007b   es: 007b   ss: 0068
Process dd (pid: 1742, threadinfo=da7e8000 task=dc76da70)
Stack: c029a003 00000800 00000800 00000064 00000000 00000000 00000000 da63ae6c 
       00000065 00000800 c014ab5f da63ae6c 00000065 00000800 00000800 00000065 
       da63ae6c da764df8 c014abdb da63ae6c 00000065 00000800 00000000 00000000 
Call Trace:
 [__find_get_block+95/176] __find_get_block+0x5f/0xb0
 [__getblk+43/96] __getblk+0x2b/0x60
 [pg0+542048932/1070055424] ufs_new_fragments+0x1c4/0x740 [ufs]
 [pg0+542070875/1070055424] ufs_inode_getfrag+0x23b/0x3f0 [ufs]
 [pg0+542072873/1070055424] ufs_getfrag_block+0x349/0x3b0 [ufs]
 [create_empty_buffers+37/128] create_empty_buffers+0x25/0x80
 [__block_prepare_write+467/960] __block_prepare_write+0x1d3/0x3c0
 [kernel_map_pages+40/144] kernel_map_pages+0x28/0x90
 [inode_update_time+167/224] inode_update_time+0xa7/0xe0
 [block_prepare_write+52/80] block_prepare_write+0x34/0x50
 [pg0+542072032/1070055424] ufs_getfrag_block+0x0/0x3b0 [ufs]
 [generic_file_aio_write_nolock+821/2624] generic_file_aio_write_nolock+0x335/0xa40
 [pg0+542072032/1070055424] ufs_getfrag_block+0x0/0x3b0 [ufs]
 [do_no_page+99/688] do_no_page+0x63/0x2b0 
 [generic_file_write_nolock+116/144] generic_file_write_nolock+0x74/0x90
 [clear_user+51/80] clear_user+0x33/0x50
 [read_zero+452/528] read_zero+0x1c4/0x210
 [generic_file_write+85/112] generic_file_write+0x55/0x70
 [generic_file_write+0/112] generic_file_write+0x0/0x70
 [vfs_write+184/304] vfs_write+0xb8/0x130
 [copy_to_user+62/80] copy_to_user+0x3e/0x50
 [sys_write+81/128] sys_write+0x51/0x80
 [syscall_call+7/11] syscall_call+0x7/0xb
Code: 0f 0b 15 02 c2 9f 29 c0 8b 06 f6 c4 08 75 17 8b 46 04 40 74 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Help tracking down problem --- endless loop in __find_get_block_slow
  2005-02-23 12:00   ` Help tracking down problem --- endless loop in __find_get_block_slow Thomas S. Iversen
@ 2005-02-23 12:10     ` Andrew Morton
  2005-02-23 13:02       ` Thomas S. Iversen
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2005-02-23 12:10 UTC (permalink / raw)
  To: Thomas S. Iversen; +Cc: dm-devel, linux-kernel

zensonic@zensonic.dk (Thomas S. Iversen) wrote:
>
> > >  dd if=/dev/zero of=/mnt/testfile count=N, N>6
> 
> It turns out N>6 is variable. But large enough and it will hang. Sugests
> some kind of race I am afraid.
> 
> > >  I get into an endless loop in __find_get_block_slow. 
> > 
> > The only way in which __find_get_block_slow() can loop is if something
> > wrecked the buffer_head ring at page->private: something caused an internal
> > loop via bh->b_this_page.
> > 
> > Are you sure that's where things are hanging?  That it's not stuck on a
> > spinlock?
> 
> I get the same message again and again all over (this trace with
> count=100 and a BUG() and panic inserted into buffer.c)
> 
> Feb 23 13:38:34 localhost kernel: __find_get_block_slow() failed.
> block=101, b_blocknr=100
> Feb 23 13:38:34 localhost kernel: b_state=0x00000010, b_size=2048
> Feb 23 13:38:34 localhost kernel: device blocksize: 2048

OK, so we're looking for the buffer_head for block 101 and the first
buffer_head which is attached to the page represents block 100.  So the
next buffer_head _should_ represent block 101.  Please print it out:

--- 25/fs/buffer.c~a	2005-02-23 04:06:12.000000000 -0800
+++ 25-akpm/fs/buffer.c	2005-02-23 04:06:51.000000000 -0800
@@ -459,8 +459,10 @@ __find_get_block_slow(struct block_devic
 	 */
 	if (all_mapped) {
 		printk("__find_get_block_slow() failed. "
-			"block=%llu, b_blocknr=%llu\n",
-			(unsigned long long)block, (unsigned long long)bh->b_blocknr);
+			"block=%llu, b_blocknr=%llu, next=%llu\n",
+			(unsigned long long)block,
+			(unsigned long long)bh->b_blocknr,
+			(unsigned long long)bh->b_this_page->b_blocknr);
 		printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size);
 		printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits);
 	}
_


> Ad inf. The output of the BUG() is show below. My questions remain:
> 
> 1. Whos fault is this?

Could be UFS.  But what does "transparent block encryption and sector
shuffling" mean?  How is the sector shuffling implemented?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Help tracking down problem --- endless loop in __find_get_block_slow
  2005-02-23 12:10     ` Andrew Morton
@ 2005-02-23 13:02       ` Thomas S. Iversen
  2005-02-23 20:09         ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas S. Iversen @ 2005-02-23 13:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dm-devel, linux-kernel

> OK, so we're looking for the buffer_head for block 101 and the first
> buffer_head which is attached to the page represents block 100.  So the
> next buffer_head _should_ represent block 101.  Please print it out:

Not quite the same, but simelar:

Feb 23 14:50:24 localhost kernel: __find_get_block_slow() failed. block=102,
b_blocknr=128, next=129
Feb 23 14:50:24 localhost kernel: b_state=0x00000013, b_size=2048
Feb 23 14:50:24 localhost kernel: device blocksize: 2048
Feb 23 14:50:24 localhost kernel: ------------[ cut here ]------------

> Could be UFS.  But what does "transparent block encryption and sector
> shuffling" mean?  How is the sector shuffling implemented?

GDBE is a block level encrypter. It encrypts the actual sectors
transparently via the GEOM API (corresponds to the devicemapper api in linux).

GBDE assigns a key for each block, thereby introducing keysectors.
Furthermore the sectors are remapped so that one can not guess where e.g.
metadata is located on the physical disk. It is a rather simple remap:

Taken from GDBE:

/*
 * Convert an unencrypted side offset to offsets on the encrypted side.
 *
 * Security objective:  Make it harder to identify what sectors contain what
 * on a "cold" disk image.
 *
 * We do this by adding the "keyoffset" from the lock to the physical sector
 * number modulus the available number of sectors.  Since all physical sectors
 * presumably look the same cold, this will do.
 *
 * As part of the mapping we have to skip the lock sectors which we know
 * the physical address off.  We also truncate the work packet, respecting
 * zone boundaries and lock sectors, so that we end up with a sequence of
 * sectors which are physically contiguous.
 *
 * Shuffling things further is an option, but the incremental frustration is
 * not currently deemed worth the run-time performance hit resulting from the
 * increased number of disk arm movements it would incur.
 *
 * This function offers nothing but a trivial diversion for an attacker able
 * to do "the cleaning lady attack" in its current static mapping form.
 */"
  
Further reading:

	http://phk.freebsd.dk/pubs/bsdcon-03.gbde.paper.pdf

Regards Thomas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Help tracking down problem --- endless loop in __find_get_block_slow
  2005-02-23 13:02       ` Thomas S. Iversen
@ 2005-02-23 20:09         ` Andrew Morton
  2005-02-23 22:24           ` Thomas S. Iversen
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2005-02-23 20:09 UTC (permalink / raw)
  To: Thomas S. Iversen; +Cc: dm-devel, linux-kernel

zensonic@zensonic.dk (Thomas S. Iversen) wrote:
>
> > OK, so we're looking for the buffer_head for block 101 and the first
> > buffer_head which is attached to the page represents block 100.  So the
> > next buffer_head _should_ represent block 101.  Please print it out:
> 
> Not quite the same, but simelar:
> 
> Feb 23 14:50:24 localhost kernel: __find_get_block_slow() failed. block=102,
> b_blocknr=128, next=129
> Feb 23 14:50:24 localhost kernel: b_state=0x00000013, b_size=2048
> Feb 23 14:50:24 localhost kernel: device blocksize: 2048
> Feb 23 14:50:24 localhost kernel: ------------[ cut here ]------------

Something has caused the page at offset 51 (block 102) to have buffer_heads
for blocks 128 and 129 attached to it.

> > Could be UFS.  But what does "transparent block encryption and sector
> > shuffling" mean?  How is the sector shuffling implemented?
> 
> GDBE is a block level encrypter. It encrypts the actual sectors
> transparently via the GEOM API (corresponds to the devicemapper api in linux).
> 
> GBDE assigns a key for each block, thereby introducing keysectors.
> Furthermore the sectors are remapped so that one can not guess where e.g.
> metadata is located on the physical disk. It is a rather simple remap:

I'd be suspecting that the sector remapping is the cause of the problem. 
How is it implemented?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Help tracking down problem --- endless loop in __find_get_block_slow
  2005-02-23 20:09         ` Andrew Morton
@ 2005-02-23 22:24           ` Thomas S. Iversen
  2005-02-23 23:17             ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas S. Iversen @ 2005-02-23 22:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dm-devel, linux-kernel

> Something has caused the page at offset 51 (block 102) to have buffer_heads
> for blocks 128 and 129 attached to it.

Ok.

> I'd be suspecting that the sector remapping is the cause of the problem. 
> How is it implemented?

Quite simple actually. You're most welcome to see the code, but I have 
just done a test like the one below. Never mind the performance figures, 
  correctness comes as a very first priority. It does not block, cause 
endless loops or anything funny if I take the filsystem out of the 
question and access the raw devicemapped block device.

This should assert that it is not a fault in my code, right?

zensonic@www:~$ ssh 192.168.1.9 -l root
Password:
Last login: Wed Feb 23 23:30:41 2005
maibritt:~# unload
maibritt:~# modprobe dm-bde
maibritt:~# alias dmc
alias dmc='/usr/src/device-mapper.1.00.19/dmsetup/dmsetup create dmbde'
maibritt:~# dmc
maibritt:~# dd if=/dev/urandom of=/dev/mapper/dmbde
dd: skriver til '/dev/mapper/dmbde': Ikke mere plads på enheden
46721+0 blokke ind
46720+0 blokke ud
23920640 bytes transferred in 13,004132 seconds (1839465 bytes/sec)
maibritt:~# hexdump /dev/mapper/dmbde | head
0000000 f348 b77e 96b8 62cc ea69 1ddf e232 1d74
0000010 b1fe 521c 4f4f 30cd 62a0 d697 0ce9 55de
0000020 2bae ee86 53b1 c04d 88ab cc63 5826 11d7
0000030 98c5 2ee3 d81e 1658 29f2 9196 8591 fe20
0000040 7c4c 2bf4 a088 906e 23ae d327 9def 09ea
0000050 b435 75c6 3b78 d45e 16aa 038c f592 2e03
0000060 52fe 5b05 03c1 95ff 927a 84ed 6aab f0f8
0000070 a870 0797 5a82 e574 964d c633 8365 42b8
0000080 8e6e 6e60 8544 792f 597f 5065 7766 fbbb
0000090 5eb7 e6f9 9632 477b a17a 91d3 121e 8692
maibritt:~# dd if=/dev/zero of=/dev/mapper/dmbde
dd: skriver til '/dev/mapper/dmbde': Ikke mere plads på enheden
46721+0 blokke ind
46720+0 blokke ud
23920640 bytes transferred in 6,335617 seconds (3775582 bytes/sec)
maibritt:~# hexdump /dev/mapper/dmbde
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
16d0000
maibritt:~#
maibritt:~#


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Help tracking down problem --- endless loop in __find_get_block_slow
  2005-02-23 22:24           ` Thomas S. Iversen
@ 2005-02-23 23:17             ` Andrew Morton
  2005-02-24  8:54               ` Thomas S. Iversen
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2005-02-23 23:17 UTC (permalink / raw)
  To: Thomas S. Iversen; +Cc: dm-devel, linux-kernel

"Thomas S. Iversen" <zensonic@zensonic.dk> wrote:
>
> > I'd be suspecting that the sector remapping is the cause of the problem. 
> > How is it implemented?
> 
> Quite simple actually. You're most welcome to see the code, but I have 
> just done a test like the one below. Never mind the performance figures, 
>   correctness comes as a very first priority. It does not block, cause 
> endless loops or anything funny if I take the filsystem out of the 
> question and access the raw devicemapped block device.
> 
> This should assert that it is not a fault in my code, right?

I don't know.   Can you describe how your driver implements the remapping?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Help tracking down problem --- endless loop in __find_get_block_slow
  2005-02-23 23:17             ` Andrew Morton
@ 2005-02-24  8:54               ` Thomas S. Iversen
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas S. Iversen @ 2005-02-24  8:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dm-devel, linux-kernel

> I don't know.   Can you describe how your driver implements the remapping?

I have tested with ext2. I get double faults when trying to sync.
What are those? (By the look of the error message, I can see they are 
not something I want to have :-)

I'm at loss how all of this can happend. All I do is to allocate BIOs 
and kernel memory, process the BIOs and deal with deallocation afterwards.

Thomas



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Help tracking down problem --- endless loop in __find_get_block_slow
  2005-02-22 23:28     ` Andrew Morton
@ 2005-02-26  0:03       ` Jeff Mahoney
  2005-02-28 21:06         ` Jeff Mahoney
  2005-02-28 21:09         ` Help tracking down problem --- endless loop in __find_get_block_slow (now with the patch) Jeff Mahoney
  0 siblings, 2 replies; 14+ messages in thread
From: Jeff Mahoney @ 2005-02-26  0:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: zensonic, dm-devel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2844 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Morton wrote:
> Jeff Mahoney <jeffm@suse.com> wrote:
> 
>>In my experience, the loop is actually outside of
>>__find_get_block_slow(), in __getblk_slow(). I've been using xmon to
>>interrupt the kernel, and the results vary but are all rooted in the
>>for(;;) loop in __getblk_slow. It appears as though grow_buffers is
>>finding/creating the page, but then __find_get_block can't locate the
>>buffer it needs.
> 
> 
> Yes, that'll happen.  Because there are still buffers attached to the page
> which have the wrong blocksize.  Say, if someone is trying to read a 2k
> buffer_head which is backed by a page which already has 1k buffer_heads
> attached to it.
> 
> Does your kernel not have that big printk in __find_get_block_slow()?  If
> it does, maybe some of the buffers are unmapped.  Try:

I think it's likely I'm experiencing a different bug than the original
poster. I've tried making the printk unconditional, and I get no output.
However, I've continued to track it down, and I believe I've found a
umount race. I can also reproduce it without subfs, with the attached
script.

I added some debug output to aid in my search:
__find_get_block_slow: find_get_page
[block=17508,blksize=2048,index=8754,sizebits=1,size=512] returned null
returning page [index=2188,block=17504,size=512,sizebits=3]
Couldn't find buffer @ block 17508

What I'm observing is that __find_get_block_slow is calculating the
index using the blocksize for the device, and the grow_buffers call is
using the blocksize handed down from the filesystem via sb_bread(). They
*should* be the same, but here's where my suspected race comes in. Since
the buffers are being searched for in the wrong place, they're never
found, causing the infinite loop.

The open_bdev_excl() call in get_sb_bdev() should be keeping callers out
until the block device is actually closed, but it uses the fs_type
struct as the holder which, given that the filesystem to be mounted is
the same one as the one being umounted, will be the same. This allows
the mount attempt to continue. If the superblock for the umounting
filesystem is already in the process of getting shut down, sget() will
create a new superblock and the mount attempt will use that one. The
umount will continue, destroying the old superblock and setting the
blocksize back to its original value, dropping all buffers in the process.

If kill_block_super resets the blocksize while an sb_bread is in
progress, the sizes won't match up and we'll get stuck in the loop.

I'll be working on a fix, but figured I'd send out a quick update.

- -Jeff

- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFCH7zXLPWxlyuTD7IRAr/WAJ9B6MLsKl6cv48Qlcklx1saYERv7ACdHWGW
UBXAsQBiEAge3T1R4akLKd0=
=w1zP
-----END PGP SIGNATURE-----

[-- Attachment #2: test.sh --]
[-- Type: application/x-sh, Size: 290 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Help tracking down problem --- endless loop in __find_get_block_slow
  2005-02-26  0:03       ` Jeff Mahoney
@ 2005-02-28 21:06         ` Jeff Mahoney
  2005-02-28 21:09         ` Help tracking down problem --- endless loop in __find_get_block_slow (now with the patch) Jeff Mahoney
  1 sibling, 0 replies; 14+ messages in thread
From: Jeff Mahoney @ 2005-02-28 21:06 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: Andrew Morton, zensonic, dm-devel, linux-kernel, Al Viro

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeff Mahoney wrote:
> Andrew Morton wrote:
> 
>>>Jeff Mahoney <jeffm@suse.com> wrote:
>>>
>>>
>>>>In my experience, the loop is actually outside of
>>>>__find_get_block_slow(), in __getblk_slow(). I've been using xmon to
>>>>interrupt the kernel, and the results vary but are all rooted in the
>>>>for(;;) loop in __getblk_slow. It appears as though grow_buffers is
>>>>finding/creating the page, but then __find_get_block can't locate the
>>>>buffer it needs.
>>>
>>>
>>>Yes, that'll happen.  Because there are still buffers attached to the page
>>>which have the wrong blocksize.  Say, if someone is trying to read a 2k
>>>buffer_head which is backed by a page which already has 1k buffer_heads
>>>attached to it.
>>>
>>>Does your kernel not have that big printk in __find_get_block_slow()?  If
>>>it does, maybe some of the buffers are unmapped.  Try:
> 
> 
> I think it's likely I'm experiencing a different bug than the original
> poster. I've tried making the printk unconditional, and I get no output.
> However, I've continued to track it down, and I believe I've found a
> umount race. I can also reproduce it without subfs, with the attached
> script.
> 
> I added some debug output to aid in my search:
> __find_get_block_slow: find_get_page
> [block=17508,blksize=2048,index=8754,sizebits=1,size=512] returned null
> returning page [index=2188,block=17504,size=512,sizebits=3]
> Couldn't find buffer @ block 17508
> 
> What I'm observing is that __find_get_block_slow is calculating the
> index using the blocksize for the device, and the grow_buffers call is
> using the blocksize handed down from the filesystem via sb_bread(). They
> *should* be the same, but here's where my suspected race comes in. Since
> the buffers are being searched for in the wrong place, they're never
> found, causing the infinite loop.
> 
> The open_bdev_excl() call in get_sb_bdev() should be keeping callers out
> until the block device is actually closed, but it uses the fs_type
> struct as the holder which, given that the filesystem to be mounted is
> the same one as the one being umounted, will be the same. This allows
> the mount attempt to continue. If the superblock for the umounting
> filesystem is already in the process of getting shut down, sget() will
> create a new superblock and the mount attempt will use that one. The
> umount will continue, destroying the old superblock and setting the
> blocksize back to its original value, dropping all buffers in the process.
> 
> If kill_block_super resets the blocksize while an sb_bread is in
> progress, the sizes won't match up and we'll get stuck in the loop.
> 
> I'll be working on a fix, but figured I'd send out a quick update.

To prove the race, I made bd_claim return -EBUSY even if the passed
holder == current holder. This eliminated the race, but made a
potentially unacceptable change to how block devices are handled. [1]

Attached is a patch that isn't pretty but does still allow the
multiple-open semantics of open_bdev_excl() while eliminating the race.
Basically, it creates a call path that allows the device's block size to
be changed only if the caller holds the sole reference to the block
device. This has passed my admittedly small test cases without returning
invalid results.

Only the close case is protected by bdev_lock. This may seem racy, but I
don't believe it to be. The only way a caller could have the opportunity
to reset the block size while the block device is opened is if the
holder value is the same, which means that it must be the same
filesystem type. The same filesystem type opening the same device will
result in finding the same on-disk superblock that the umounting
filesystem was using, so it would be the same block size. set_blocksize
will exit early if the block size is the same, thus avoiding the code
that could cause issues.

I'd appreciate any feedback.

- -Jeff

[1] Currently, mtd is the only non-filesystem caller of open_bdev_excl.
While it's possible for mtd to call open_bdev_excl twice on the same
device, it would require passing the same device more than once as a
module parameter.
- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFCI4fHLPWxlyuTD7IRAu5LAKCPaSFb+BCpHRw44YU5j2IE52X5AgCfVHby
4mz96aq70X9VOIccEWvnCko=
=59ut
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Help tracking down problem --- endless loop in __find_get_block_slow (now with the patch)
  2005-02-26  0:03       ` Jeff Mahoney
  2005-02-28 21:06         ` Jeff Mahoney
@ 2005-02-28 21:09         ` Jeff Mahoney
  1 sibling, 0 replies; 14+ messages in thread
From: Jeff Mahoney @ 2005-02-28 21:09 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: Andrew Morton, zensonic, dm-devel, linux-kernel, Al Viro

[-- Attachment #1: Type: text/plain, Size: 4407 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeff Mahoney wrote:
> Andrew Morton wrote:
>
>>>Jeff Mahoney <jeffm@suse.com> wrote:
>>>
>>>
>>>>In my experience, the loop is actually outside of
>>>>__find_get_block_slow(), in __getblk_slow(). I've been using xmon to
>>>>interrupt the kernel, and the results vary but are all rooted in the
>>>>for(;;) loop in __getblk_slow. It appears as though grow_buffers is
>>>>finding/creating the page, but then __find_get_block can't locate the
>>>>buffer it needs.
>>>
>>>
>>>Yes, that'll happen.  Because there are still buffers attached to the page
>>>which have the wrong blocksize.  Say, if someone is trying to read a 2k
>>>buffer_head which is backed by a page which already has 1k buffer_heads
>>>attached to it.
>>>
>>>Does your kernel not have that big printk in __find_get_block_slow()?  If
>>>it does, maybe some of the buffers are unmapped.  Try:
>
>
> I think it's likely I'm experiencing a different bug than the original
> poster. I've tried making the printk unconditional, and I get no output.
> However, I've continued to track it down, and I believe I've found a
> umount race. I can also reproduce it without subfs, with the attached
> script.
>
> I added some debug output to aid in my search:
> __find_get_block_slow: find_get_page
> [block=17508,blksize=2048,index=8754,sizebits=1,size=512] returned null
> returning page [index=2188,block=17504,size=512,sizebits=3]
> Couldn't find buffer @ block 17508
>
> What I'm observing is that __find_get_block_slow is calculating the
> index using the blocksize for the device, and the grow_buffers call is
> using the blocksize handed down from the filesystem via sb_bread(). They
> *should* be the same, but here's where my suspected race comes in. Since
> the buffers are being searched for in the wrong place, they're never
> found, causing the infinite loop.
>
> The open_bdev_excl() call in get_sb_bdev() should be keeping callers out
> until the block device is actually closed, but it uses the fs_type
> struct as the holder which, given that the filesystem to be mounted is
> the same one as the one being umounted, will be the same. This allows
> the mount attempt to continue. If the superblock for the umounting
> filesystem is already in the process of getting shut down, sget() will
> create a new superblock and the mount attempt will use that one. The
> umount will continue, destroying the old superblock and setting the
> blocksize back to its original value, dropping all buffers in the process.
>
> If kill_block_super resets the blocksize while an sb_bread is in
> progress, the sizes won't match up and we'll get stuck in the loop.
>
> I'll be working on a fix, but figured I'd send out a quick update.

Resent: Forgot to attach the patch. :-\

To prove the race, I made bd_claim return -EBUSY even if the passed
holder == current holder. This eliminated the race, but made a
potentially unacceptable change to how block devices are handled. [1]

Attached is a patch that isn't pretty but does still allow the
multiple-open semantics of open_bdev_excl() while eliminating the race.
Basically, it creates a call path that allows the device's block size to
be changed only if the caller holds the sole reference to the block
device. This has passed my admittedly small test cases without returning
invalid results.

Only the close case is protected by bdev_lock. This may seem racy, but I
don't believe it to be. The only way a caller could have the opportunity
to reset the block size while the block device is opened is if the
holder value is the same, which means that it must be the same
filesystem type. The same filesystem type opening the same device will
result in finding the same on-disk superblock that the umounting
filesystem was using, so it would be the same block size. set_blocksize
will exit early if the block size is the same, thus avoiding the code
that could cause issues.

I'd appreciate any feedback.

- -Jeff

[1] Currently, mtd is the only non-filesystem caller of open_bdev_excl.
While it's possible for mtd to call open_bdev_excl twice on the same
device, it would require passing the same device more than once as a
module parameter.
- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFCI4iNLPWxlyuTD7IRAhdtAJ9s/GELyA4GE4SQaDiztuAtjzffgQCcCWyO
MZ3swZSz7YJI+ElgH9YS5kE=
=Sd5k
-----END PGP SIGNATURE-----

[-- Attachment #2: set_blocksize_race.diff --]
[-- Type: text/x-patch, Size: 4591 bytes --]

diff -ruNpX dontdiff linux-2.6.11-rc4.orig/fs/block_dev.c linux-2.6.11-rc4/fs/block_dev.c
--- linux-2.6.11-rc4.orig/fs/block_dev.c	2005-02-28 14:06:59.000000000 -0500
+++ linux-2.6.11-rc4/fs/block_dev.c	2005-02-28 14:49:52.000000000 -0500
@@ -62,7 +62,7 @@ static void kill_bdev(struct block_devic
 	truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
 }	
 
-int set_blocksize(struct block_device *bdev, int size)
+int __set_blocksize(struct block_device *bdev, int size, int sync)
 {
 	/* Size must be a power of two, and between 512 and PAGE_SIZE */
 	if (size > PAGE_SIZE || size < 512 || (size & (size-1)))
@@ -74,7 +74,8 @@ int set_blocksize(struct block_device *b
 
 	/* Don't change the size if it is same as current */
 	if (bdev->bd_block_size != size) {
-		sync_blockdev(bdev);
+		if (sync)
+			sync_blockdev(bdev);
 		bdev->bd_block_size = size;
 		bdev->bd_inode->i_blkbits = blksize_bits(size);
 		kill_bdev(bdev);
@@ -82,7 +83,7 @@ int set_blocksize(struct block_device *b
 	return 0;
 }
 
-EXPORT_SYMBOL(set_blocksize);
+EXPORT_SYMBOL(__set_blocksize);
 
 int sb_set_blocksize(struct super_block *sb, int size)
 {
@@ -480,17 +481,19 @@ int bd_claim(struct block_device *bdev, 
 
 EXPORT_SYMBOL(bd_claim);
 
-void bd_release(struct block_device *bdev)
+void __bd_release(struct block_device *bdev, int size)
 {
 	spin_lock(&bdev_lock);
 	if (!--bdev->bd_contains->bd_holders)
 		bdev->bd_contains->bd_holder = NULL;
-	if (!--bdev->bd_holders)
+	if (!--bdev->bd_holders) {
 		bdev->bd_holder = NULL;
+		set_blocksize_nosync (bdev, size);
+	}
 	spin_unlock(&bdev_lock);
 }
 
-EXPORT_SYMBOL(bd_release);
+EXPORT_SYMBOL(__bd_release);
 
 /*
  * Tries to open block device by device number.  Use it ONLY if you
@@ -914,10 +917,10 @@ EXPORT_SYMBOL(open_bdev_excl);
  *
  * This is the counterpart to open_bdev_excl().
  */
-void close_bdev_excl(struct block_device *bdev)
+void __close_bdev_excl(struct block_device *bdev, int size)
 {
-	bd_release(bdev);
+	__bd_release(bdev, size);
 	blkdev_put(bdev);
 }
 
-EXPORT_SYMBOL(close_bdev_excl);
+EXPORT_SYMBOL(__close_bdev_excl);
diff -ruNpX dontdiff linux-2.6.11-rc4.orig/fs/super.c linux-2.6.11-rc4/fs/super.c
--- linux-2.6.11-rc4.orig/fs/super.c	2005-02-28 14:07:01.000000000 -0500
+++ linux-2.6.11-rc4/fs/super.c	2005-02-28 14:42:49.000000000 -0500
@@ -732,8 +732,7 @@ void kill_block_super(struct super_block
 
 	bdev_uevent(bdev, KOBJ_UMOUNT);
 	generic_shutdown_super(sb);
-	set_blocksize(bdev, sb->s_old_blocksize);
-	close_bdev_excl(bdev);
+	__close_bdev_excl(bdev, sb->s_old_blocksize);
 }
 
 EXPORT_SYMBOL(kill_block_super);
diff -ruNpX dontdiff linux-2.6.11-rc4.orig/include/linux/fs.h linux-2.6.11-rc4/include/linux/fs.h
--- linux-2.6.11-rc4.orig/include/linux/fs.h	2005-02-28 14:07:42.000000000 -0500
+++ linux-2.6.11-rc4/include/linux/fs.h	2005-02-28 14:50:53.000000000 -0500
@@ -1294,7 +1294,10 @@ extern long compat_blkdev_ioctl(struct f
 extern int blkdev_get(struct block_device *, mode_t, unsigned);
 extern int blkdev_put(struct block_device *);
 extern int bd_claim(struct block_device *, void *);
-extern void bd_release(struct block_device *);
+extern void __bd_release(struct block_device *, int);
+static inline void bd_release(struct block_device *bdev) {
+	__bd_release (bdev, bdev->bd_block_size);
+}
 
 /* fs/char_dev.c */
 extern int alloc_chrdev_region(dev_t *, unsigned, unsigned, const char *);
@@ -1311,7 +1314,10 @@ extern const char *__bdevname(dev_t, cha
 extern const char *bdevname(struct block_device *bdev, char *buffer);
 extern struct block_device *lookup_bdev(const char *);
 extern struct block_device *open_bdev_excl(const char *, int, void *);
-extern void close_bdev_excl(struct block_device *);
+extern void __close_bdev_excl(struct block_device *, int);
+static inline void close_bdev_excl(struct block_device *bdev) {
+	__close_bdev_excl(bdev, bdev->bd_block_size);
+}
 
 extern void init_special_inode(struct inode *, umode_t, dev_t);
 
@@ -1447,7 +1453,14 @@ extern void file_kill(struct file *f);
 struct bio;
 extern void submit_bio(int, struct bio *);
 extern int bdev_read_only(struct block_device *);
-extern int set_blocksize(struct block_device *, int);
+extern int __set_blocksize(struct block_device *, int, int);
+static inline int set_blocksize(struct block_device *bdev, int size) {
+	return __set_blocksize (bdev, size, 1);
+}
+static inline int set_blocksize_nosync(struct block_device *bdev, int size) {
+	return __set_blocksize (bdev, size, 0);
+}
+
 extern int sb_set_blocksize(struct super_block *, int);
 extern int sb_min_blocksize(struct super_block *, int);
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2005-02-28 21:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-02-21 10:46 Help tracking down problem --- endless loop in __find_get_block_slow Thomas S. Iversen
2005-02-22  9:18 ` Andrew Morton
2005-02-22 22:46   ` Jeff Mahoney
2005-02-22 23:28     ` Andrew Morton
2005-02-26  0:03       ` Jeff Mahoney
2005-02-28 21:06         ` Jeff Mahoney
2005-02-28 21:09         ` Help tracking down problem --- endless loop in __find_get_block_slow (now with the patch) Jeff Mahoney
2005-02-23 12:00   ` Help tracking down problem --- endless loop in __find_get_block_slow Thomas S. Iversen
2005-02-23 12:10     ` Andrew Morton
2005-02-23 13:02       ` Thomas S. Iversen
2005-02-23 20:09         ` Andrew Morton
2005-02-23 22:24           ` Thomas S. Iversen
2005-02-23 23:17             ` Andrew Morton
2005-02-24  8:54               ` Thomas S. Iversen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).