linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
       [not found] <fb63a4d0-d124-21c8-7395-90b34b57c85a@linux.ee>
@ 2019-02-10 20:27 ` Meelis Roos
  2019-02-15 16:59   ` Meelis Roos
  0 siblings, 1 reply; 18+ messages in thread
From: Meelis Roos @ 2019-02-10 20:27 UTC (permalink / raw)
  To: linux-alpha, LKML, linux-block

02.01.19 17:52 I wrote:

> I have noticed ext4 filesystem corruption on two of my test alphas with 4.20.0-09062-gd8372ba8ce28.

Retried it, still happens with 5.0.0-rc5-00358-gdf3865f8f568 - rsync of emerge --sync just fail with nothing in dmesg.
  
> On AlphaServer DS10:
> [10749.664418] EXT4-fs error (device sda2): __ext4_iget:5052: inode #1853093: block 1: comm rsync: invalid block
> 
> On AlphaServer DS10L:
> [ 5325.064656] EXT4-fs error (device sda2): htree_dirblock_to_tree:1007: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096
> [ 5325.069539] EXT4-fs error (device sda2): htree_dirblock_to_tree:1007: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096
> [ 5325.077351] EXT4-fs error (device sda2): ext4_empty_dir:2718: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096
> 
> Two other alphas, PC-164 and Eiger, worked fine with the same kernel version (different kernel configs according to hardware).
> 
> The details:
> 4.20 worked fine, with gentoo emerge package update after bootup.
> Next, 4.20.0-06428-g00c569b567c7 worked fine, with gentoo emerge after bootup.
> Next, 4.20.0-09062-gd8372ba8ce28 booted up fine but rsync and rm during start of gentoo emerge errored out like above.
> 
> So the corruption _might_ have happened during bootup of previous kernel but it looks more likely that only the latest kernel with blk-mq introduced the problems. mq-deadline is in use on all the alphas.
> 
> DS10 has Symbios 53C896 SCSI (sym2 driver), DS10L has QLogic ISP1040, so they are different. Working Eiger and PC164 have sym2 based scsi controllers too.

-- 
Meelis Roos <mroos@linux.ee>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-10 20:27 ` ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28 Meelis Roos
@ 2019-02-15 16:59   ` Meelis Roos
  2019-02-16 17:45     ` Theodore Y. Ts'o
  0 siblings, 1 reply; 18+ messages in thread
From: Meelis Roos @ 2019-02-15 16:59 UTC (permalink / raw)
  To: linux-alpha, LKML, linux-block, Jan Kara

>> I have noticed ext4 filesystem corruption on two of my test alphas with 4.20.0-09062-gd8372ba8ce28.
> 
> Retried it, still happens with 5.0.0-rc5-00358-gdf3865f8f568 - rsync of emerge --sync just fail with nothing in dmesg.

Finished second round of bisecting, first round did not get me far enough so
I may still have false "goods" in my bisection history.

The command I used for bisecting was Gentoos
emerge --sync.
that sometimes failed from error -6 or -11 from rsync.
Usually the file system corruption did not happen and nothing was in dmesg, just file IO error from rsync.

The result of the bisection is
[88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration stalls for blkdev pages

Is that result relevant for the problem or should I continue bisecting between 4.20.0 and the so far first bad commit?

>> On AlphaServer DS10:
>> [10749.664418] EXT4-fs error (device sda2): __ext4_iget:5052: inode #1853093: block 1: comm rsync: invalid block
>>
>> On AlphaServer DS10L:
>> [ 5325.064656] EXT4-fs error (device sda2): htree_dirblock_to_tree:1007: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096
>> [ 5325.069539] EXT4-fs error (device sda2): htree_dirblock_to_tree:1007: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096
>> [ 5325.077351] EXT4-fs error (device sda2): ext4_empty_dir:2718: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096
>>
>> Two other alphas, PC-164 and Eiger, worked fine with the same kernel version (different kernel configs according to hardware).
>>
>> The details:
>> 4.20 worked fine, with gentoo emerge package update after bootup.
>> Next, 4.20.0-06428-g00c569b567c7 worked fine, with gentoo emerge after bootup.
>> Next, 4.20.0-09062-gd8372ba8ce28 booted up fine but rsync and rm during start of gentoo emerge errored out like above.
>>
>> So the corruption _might_ have happened during bootup of previous kernel but it looks more likely that only the latest kernel with blk-mq introduced the problems. mq-deadline is in use on all the alphas.
>>
>> DS10 has Symbios 53C896 SCSI (sym2 driver), DS10L has QLogic ISP1040, so they are different. Working Eiger and PC164 have sym2 based scsi controllers too.
> 

-- 
Meelis Roos <mroos@linux.ee>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-15 16:59   ` Meelis Roos
@ 2019-02-16 17:45     ` Theodore Y. Ts'o
  2019-02-16 22:29       ` Meelis Roos
  0 siblings, 1 reply; 18+ messages in thread
From: Theodore Y. Ts'o @ 2019-02-16 17:45 UTC (permalink / raw)
  To: Meelis Roos; +Cc: linux-alpha, LKML, linux-block, Jan Kara

On Fri, Feb 15, 2019 at 06:59:48PM +0200, Meelis Roos wrote:
> The result of the bisection is
> [88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration stalls for blkdev pages
> 
> Is that result relevant for the problem or should I continue bisecting between 4.20.0 and the so far first bad commit?

Can you try reverting the commit and see if it makes the problem go away?

    	    	      	  	     - Ted

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-16 17:45     ` Theodore Y. Ts'o
@ 2019-02-16 22:29       ` Meelis Roos
  2019-02-18 12:02         ` Jan Kara
  0 siblings, 1 reply; 18+ messages in thread
From: Meelis Roos @ 2019-02-16 22:29 UTC (permalink / raw)
  To: Theodore Y. Ts'o, linux-alpha, LKML, linux-block, Jan Kara

>> The result of the bisection is
>> [88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration stalls for blkdev pages
>>
>> Is that result relevant for the problem or should I continue bisecting between 4.20.0 and the so far first bad commit?
> 
> Can you try reverting the commit and see if it makes the problem go away?

Tried reverting it on top of 5.0.0-rc6-00153-g5ded5871030e and it seems to make the kernel work - emerge --sync succeeded.

Unfinished further bisection has also not yielded any other bad revisions so far.

-- 
Meelis Roos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-16 22:29       ` Meelis Roos
@ 2019-02-18 12:02         ` Jan Kara
  2019-02-18 12:37           ` Meelis Roos
  2019-02-19 12:17           ` Meelis Roos
  0 siblings, 2 replies; 18+ messages in thread
From: Jan Kara @ 2019-02-18 12:02 UTC (permalink / raw)
  To: Meelis Roos
  Cc: Theodore Y. Ts'o, linux-alpha, LKML, linux-block, Jan Kara

On Sun 17-02-19 00:29:40, Meelis Roos wrote:
> > > The result of the bisection is
> > > [88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration stalls for blkdev pages
> > > 
> > > Is that result relevant for the problem or should I continue bisecting between 4.20.0 and the so far first bad commit?
> > 
> > Can you try reverting the commit and see if it makes the problem go away?
> 
> Tried reverting it on top of 5.0.0-rc6-00153-g5ded5871030e and it seems
> to make the kernel work - emerge --sync succeeded.
> 
> Unfinished further bisection has also not yielded any other bad revisions
> so far.

Hum, weird. I have hard time understanding how that change could be causing
fs corruption on Aplha but OTOH it is not completely unthinkable. With this
commit we may migrate some block device pages we were not able to migrate
previously and that could be causing some unexpected issue. I'll look into
this.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-18 12:02         ` Jan Kara
@ 2019-02-18 12:37           ` Meelis Roos
  2019-02-19 12:17           ` Meelis Roos
  1 sibling, 0 replies; 18+ messages in thread
From: Meelis Roos @ 2019-02-18 12:37 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Y. Ts'o, linux-alpha, LKML, linux-block

> Hum, weird. I have hard time understanding how that change could be causing
> fs corruption on Aplha but OTOH it is not completely unthinkable. With this
> commit we may migrate some block device pages we were not able to migrate
> previously and that could be causing some unexpected issue. I'll look into
> this.

To make things more interesting, it does not happen on any alpha but only one subarch
so far: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1889207.html
is my original bug report.

-- 
Meelis Roos <mroos@linux.ee>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-18 12:02         ` Jan Kara
  2019-02-18 12:37           ` Meelis Roos
@ 2019-02-19 12:17           ` Meelis Roos
  2019-02-19 13:20             ` Jan Kara
  1 sibling, 1 reply; 18+ messages in thread
From: Meelis Roos @ 2019-02-19 12:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Y. Ts'o, linux-alpha, LKML, linux-block

>>>> The result of the bisection is
>>>> [88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration stalls for blkdev pages
>>>>
>>>> Is that result relevant for the problem or should I continue bisecting between 4.20.0 and the so far first bad commit?
>>>
>>> Can you try reverting the commit and see if it makes the problem go away?
>>
>> Tried reverting it on top of 5.0.0-rc6-00153-g5ded5871030e and it seems
>> to make the kernel work - emerge --sync succeeded.
There is more to it.

After running 5.0.0-rc6-00153-g5ded5871030e-dirty (with the revert of that patch)
successfully for Gentoo update, I upgraded the kernel to
5.0.0-rc7-00011-gb5372fe5dc84-dirty (todays git + revert of this patch) and it broke on rsync again:

RepoStorageException: command exited with status -6: rsync -a --link-dest /usr/portage --exclude=/distfiles --exclude=/local --exclude=/lost+found --exclude=/packages --exclude /.tmp-unverified-download-quarantine /usr/portage/ /usr/portage/.tmp-unverified-download-quarantine/

Nothing in dmesg.

This means the real root reason is somewhere deeper and reverting this commit just made
it less likely to happen.

-- 
Meelis Roos <mroos@linux.ee>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-19 12:17           ` Meelis Roos
@ 2019-02-19 13:20             ` Jan Kara
  2019-02-19 13:49               ` Meelis Roos
  2019-02-19 14:44               ` Matthew Wilcox
  0 siblings, 2 replies; 18+ messages in thread
From: Jan Kara @ 2019-02-19 13:20 UTC (permalink / raw)
  To: Meelis Roos
  Cc: Jan Kara, Theodore Y. Ts'o, linux-alpha, LKML, linux-block, linux-mm

On Tue 19-02-19 14:17:09, Meelis Roos wrote:
> > > > > The result of the bisection is
> > > > > [88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration stalls for blkdev pages
> > > > > 
> > > > > Is that result relevant for the problem or should I continue bisecting between 4.20.0 and the so far first bad commit?
> > > > 
> > > > Can you try reverting the commit and see if it makes the problem go away?
> > > 
> > > Tried reverting it on top of 5.0.0-rc6-00153-g5ded5871030e and it seems
> > > to make the kernel work - emerge --sync succeeded.
> There is more to it.
> 
> After running 5.0.0-rc6-00153-g5ded5871030e-dirty (with the revert of
> that patch) successfully for Gentoo update, I upgraded the kernel to
> 5.0.0-rc7-00011-gb5372fe5dc84-dirty (todays git + revert of this patch)
> and it broke on rsync again:
> 
> RepoStorageException: command exited with status -6: rsync -a --link-dest /usr/portage --exclude=/distfiles --exclude=/local --exclude=/lost+found --exclude=/packages --exclude /.tmp-unverified-download-quarantine /usr/portage/ /usr/portage/.tmp-unverified-download-quarantine/
> 
> Nothing in dmesg.
> 
> This means the real root reason is somewhere deeper and reverting this
> commit just made it less likely to happen.

Thanks for information. Yeah, that makes somewhat more sense. Can you ever
see the failure if you disable CONFIG_TRANSPARENT_HUGEPAGE? Because your
findings still seem to indicate that there' some problem with page
migration and Alpha (added MM list to CC).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-19 13:20             ` Jan Kara
@ 2019-02-19 13:49               ` Meelis Roos
  2019-02-19 14:44               ` Matthew Wilcox
  1 sibling, 0 replies; 18+ messages in thread
From: Meelis Roos @ 2019-02-19 13:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Y. Ts'o, linux-alpha, LKML, linux-block, linux-mm

> Thanks for information. Yeah, that makes somewhat more sense. Can you ever
> see the failure if you disable CONFIG_TRANSPARENT_HUGEPAGE?
HAVE_ARCH_TRANSPARENT_HUGEPAGE [=n]

Seems there is no THP on alpha.

> Because your
> findings still seem to indicate that there' some problem with page
> migration and Alpha (added MM list to CC).

But my kernel config had memory compaction (that turned on page migration) and
bounce buffers. I do not remember why I found them necessary but I will try
without them.

-- 
Meelis Roos <mroos@linux.ee>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-19 13:20             ` Jan Kara
  2019-02-19 13:49               ` Meelis Roos
@ 2019-02-19 14:44               ` Matthew Wilcox
  2019-02-20  6:31                 ` Meelis Roos
  1 sibling, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2019-02-19 14:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Meelis Roos, Theodore Y. Ts'o, linux-alpha, LKML,
	linux-block, linux-mm

On Tue, Feb 19, 2019 at 02:20:26PM +0100, Jan Kara wrote:
> Thanks for information. Yeah, that makes somewhat more sense. Can you ever
> see the failure if you disable CONFIG_TRANSPARENT_HUGEPAGE? Because your
> findings still seem to indicate that there' some problem with page
> migration and Alpha (added MM list to CC).

Could
https://lore.kernel.org/linux-mm/20190219123212.29838-1-larper@axis.com/T/#u
be relevant?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-19 14:44               ` Matthew Wilcox
@ 2019-02-20  6:31                 ` Meelis Roos
  2019-02-20  9:48                   ` Jan Kara
  0 siblings, 1 reply; 18+ messages in thread
From: Meelis Roos @ 2019-02-20  6:31 UTC (permalink / raw)
  To: Matthew Wilcox, Jan Kara
  Cc: Theodore Y. Ts'o, linux-alpha, LKML, linux-block, linux-mm

> Could
> https://lore.kernel.org/linux-mm/20190219123212.29838-1-larper@axis.com/T/#u
> be relevant?

Tried it, still broken.

I wrote:

> But my kernel config had memory compaction (that turned on page migration) and
> bounce buffers. I do not remember why I found them necessary but I will try
> without them. 

First, I found out that both the problematic alphas had memory compaction and
page migration and bounce buffers turned on, and working alphas had them off.

Next, turing off these options makes the problematic alphas work.

-- 
Meelis Roos <mroos@linux.ee>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-20  6:31                 ` Meelis Roos
@ 2019-02-20  9:48                   ` Jan Kara
  2019-02-20 23:23                     ` Meelis Roos
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2019-02-20  9:48 UTC (permalink / raw)
  To: Meelis Roos
  Cc: Matthew Wilcox, Jan Kara, Theodore Y. Ts'o, linux-alpha,
	LKML, linux-block, linux-mm

On Wed 20-02-19 08:31:05, Meelis Roos wrote:
> > Could
> > https://lore.kernel.org/linux-mm/20190219123212.29838-1-larper@axis.com/T/#u
> > be relevant?
> 
> Tried it, still broken.

OK, I didn't put too much hope into this patch as you see filesystem
metadata corruption so icache/dcache coherency issues seemed unlikely.
Still good that you've tried so that we are sure.

> I wrote:
> 
> > But my kernel config had memory compaction (that turned on page migration) and
> > bounce buffers. I do not remember why I found them necessary but I will try
> > without them.
> 
> First, I found out that both the problematic alphas had memory compaction and
> page migration and bounce buffers turned on, and working alphas had them off.
> 
> Next, turing off these options makes the problematic alphas work.

OK, thanks for testing! Can you narrow down whether the problem is due to
CONFIG_BOUNCE or CONFIG_MIGRATION + CONFIG_COMPACTION? These are two
completely different things so knowing where to look will help. Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-20  9:48                   ` Jan Kara
@ 2019-02-20 23:23                     ` Meelis Roos
  2019-02-21 13:29                       ` Jan Kara
  0 siblings, 1 reply; 18+ messages in thread
From: Meelis Roos @ 2019-02-20 23:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, Theodore Y. Ts'o, linux-alpha, LKML,
	linux-block, linux-mm

>> First, I found out that both the problematic alphas had memory compaction and
>> page migration and bounce buffers turned on, and working alphas had them off.
>>
>> Next, turing off these options makes the problematic alphas work.
> 
> OK, thanks for testing! Can you narrow down whether the problem is due to
> CONFIG_BOUNCE or CONFIG_MIGRATION + CONFIG_COMPACTION? These are two
> completely different things so knowing where to look will help. Thanks!

Tested both.

Just CONFIG_MIGRATION + CONFIG_COMPACTION breaks the alpha.
Just CONFIG_BOUNCE has no effect in 5 tries.

-- 
Meelis Roos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-20 23:23                     ` Meelis Roos
@ 2019-02-21 13:29                       ` Jan Kara
  2022-08-25 15:05                         ` matoro
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2019-02-21 13:29 UTC (permalink / raw)
  To: Meelis Roos
  Cc: Jan Kara, Matthew Wilcox, Theodore Y. Ts'o, linux-alpha,
	LKML, linux-block, linux-mm

On Thu 21-02-19 01:23:50, Meelis Roos wrote:
> > > First, I found out that both the problematic alphas had memory compaction and
> > > page migration and bounce buffers turned on, and working alphas had them off.
> > > 
> > > Next, turing off these options makes the problematic alphas work.
> > 
> > OK, thanks for testing! Can you narrow down whether the problem is due to
> > CONFIG_BOUNCE or CONFIG_MIGRATION + CONFIG_COMPACTION? These are two
> > completely different things so knowing where to look will help. Thanks!
> 
> Tested both.
> 
> Just CONFIG_MIGRATION + CONFIG_COMPACTION breaks the alpha.
> Just CONFIG_BOUNCE has no effect in 5 tries.

OK, so page migration is problematic. Thanks for confirmation!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2019-02-21 13:29                       ` Jan Kara
@ 2022-08-25 15:05                         ` matoro
  2022-08-26 10:55                           ` Jan Kara
  0 siblings, 1 reply; 18+ messages in thread
From: matoro @ 2022-08-25 15:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Meelis Roos, Matthew Wilcox, Theodore Y. Ts'o, linux-alpha,
	LKML, linux-block, linux-mm

Hello all, I know this is quite an old thread.  I recently acquired some 
alpha hardware and have run into this exact same problem on the latest 
stable kernel (5.18 and 5.19).  CONFIG_COMPACTION seems to be totally 
broken and causes userspace to be extremely unstable - random segfaults, 
corruption of glibc data structures, gcc ICEs etc etc - seems most 
noticable during tasks with heavy I/O load.

My hardware is a DS15 (Titan), so only slightly newer than the Tsunamis 
mentioned earlier.  The problem is greatly exacerbated when using a 
machine-optimized kernel (CONFIG_ALPHA_TITAN) over one with 
CONFIG_ALPHA_GENERIC.  But it still doesn't go away on a generic kernel, 
just pops up less often, usually very I/O heavy tasks like checking out 
a tag in the kernel repo.

However all of this seems to be dependent on CONFIG_COMPACTION.  With 
this toggled off all problems disappear, regardless of other options.  I 
tried reverting the commit 88dbcbb3a4847f5e6dfeae952d3105497700c128 
mentioned earlier in the thread (the structure has moved to a different 
file but was otherwise the same), but it unfortunately did not make a 
difference.

Since this doesn't seem to have a known cause or an easy fix, would it 
be reasonable to just add a Kconfig dep to disable it automatically on 
alpha?

Thank you!

-------- Original Message --------
Subject: Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
Date: 2019-02-21 08:29
 From: Jan Kara <jack@suse.cz>
To: Meelis Roos <mroos@linux.ee>

On Thu 21-02-19 01:23:50, Meelis Roos wrote:
> > > First, I found out that both the problematic alphas had memory compaction and
> > > page migration and bounce buffers turned on, and working alphas had them off.
> > >
> > > Next, turing off these options makes the problematic alphas work.
> >
> > OK, thanks for testing! Can you narrow down whether the problem is due to
> > CONFIG_BOUNCE or CONFIG_MIGRATION + CONFIG_COMPACTION? These are two
> > completely different things so knowing where to look will help. Thanks!
> 
> Tested both.
> 
> Just CONFIG_MIGRATION + CONFIG_COMPACTION breaks the alpha.
> Just CONFIG_BOUNCE has no effect in 5 tries.

OK, so page migration is problematic. Thanks for confirmation!

								Honza

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2022-08-25 15:05                         ` matoro
@ 2022-08-26 10:55                           ` Jan Kara
  2022-08-26 11:04                             ` Vlastimil Babka
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2022-08-26 10:55 UTC (permalink / raw)
  To: matoro
  Cc: Jan Kara, Meelis Roos, Matthew Wilcox, Theodore Y. Ts'o,
	linux-alpha, LKML, linux-block, linux-mm, vbabka

On Thu 25-08-22 11:05:48, matoro wrote:
> Hello all, I know this is quite an old thread.  I recently acquired some
> alpha hardware and have run into this exact same problem on the latest
> stable kernel (5.18 and 5.19).  CONFIG_COMPACTION seems to be totally broken
> and causes userspace to be extremely unstable - random segfaults, corruption
> of glibc data structures, gcc ICEs etc etc - seems most noticable during
> tasks with heavy I/O load.
> 
> My hardware is a DS15 (Titan), so only slightly newer than the Tsunamis
> mentioned earlier.  The problem is greatly exacerbated when using a
> machine-optimized kernel (CONFIG_ALPHA_TITAN) over one with
> CONFIG_ALPHA_GENERIC.  But it still doesn't go away on a generic kernel,
> just pops up less often, usually very I/O heavy tasks like checking out a
> tag in the kernel repo.
> 
> However all of this seems to be dependent on CONFIG_COMPACTION.  With this
> toggled off all problems disappear, regardless of other options.  I tried
> reverting the commit 88dbcbb3a4847f5e6dfeae952d3105497700c128 mentioned
> earlier in the thread (the structure has moved to a different file but was
> otherwise the same), but it unfortunately did not make a difference.
> 
> Since this doesn't seem to have a known cause or an easy fix, would it be
> reasonable to just add a Kconfig dep to disable it automatically on alpha?

Thanks for report. I guess this just confirms that migration of pagecache
pages is somehow broken on Alpha. Maybe we are missing to flush some cache
specific for Alpha? Or maybe the page migration code is not safe wrt the
peculiar memory ordering Alpha has... I think this will need someone with
Alpha HW and willingness to dive into MM internals to debug this. Added
Vlasta to CC mostly for awareness and in case it rings some bells :).

								Honza

> -------- Original Message --------
> Subject: Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
> Date: 2019-02-21 08:29
> From: Jan Kara <jack@suse.cz>
> To: Meelis Roos <mroos@linux.ee>
> 
> On Thu 21-02-19 01:23:50, Meelis Roos wrote:
> > > > First, I found out that both the problematic alphas had memory compaction and
> > > > page migration and bounce buffers turned on, and working alphas had them off.
> > > >
> > > > Next, turing off these options makes the problematic alphas work.
> > >
> > > OK, thanks for testing! Can you narrow down whether the problem is due to
> > > CONFIG_BOUNCE or CONFIG_MIGRATION + CONFIG_COMPACTION? These are two
> > > completely different things so knowing where to look will help. Thanks!
> > 
> > Tested both.
> > 
> > Just CONFIG_MIGRATION + CONFIG_COMPACTION breaks the alpha.
> > Just CONFIG_BOUNCE has no effect in 5 tries.
> 
> OK, so page migration is problematic. Thanks for confirmation!
> 
> 								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2022-08-26 10:55                           ` Jan Kara
@ 2022-08-26 11:04                             ` Vlastimil Babka
  2022-08-26 16:16                               ` matoro
  0 siblings, 1 reply; 18+ messages in thread
From: Vlastimil Babka @ 2022-08-26 11:04 UTC (permalink / raw)
  To: Jan Kara, matoro
  Cc: Meelis Roos, Matthew Wilcox, Theodore Y. Ts'o, linux-alpha,
	LKML, linux-block, linux-mm, vbabka

On 8/26/22 12:55, Jan Kara wrote:
> On Thu 25-08-22 11:05:48, matoro wrote:
>> Hello all, I know this is quite an old thread.  I recently acquired some
>> alpha hardware and have run into this exact same problem on the latest
>> stable kernel (5.18 and 5.19).  CONFIG_COMPACTION seems to be totally broken
>> and causes userspace to be extremely unstable - random segfaults, corruption
>> of glibc data structures, gcc ICEs etc etc - seems most noticable during
>> tasks with heavy I/O load.
>> 
>> My hardware is a DS15 (Titan), so only slightly newer than the Tsunamis
>> mentioned earlier.  The problem is greatly exacerbated when using a
>> machine-optimized kernel (CONFIG_ALPHA_TITAN) over one with
>> CONFIG_ALPHA_GENERIC.  But it still doesn't go away on a generic kernel,
>> just pops up less often, usually very I/O heavy tasks like checking out a
>> tag in the kernel repo.
>> 
>> However all of this seems to be dependent on CONFIG_COMPACTION.  With this
>> toggled off all problems disappear, regardless of other options.  I tried
>> reverting the commit 88dbcbb3a4847f5e6dfeae952d3105497700c128 mentioned
>> earlier in the thread (the structure has moved to a different file but was
>> otherwise the same), but it unfortunately did not make a difference.
>> 
>> Since this doesn't seem to have a known cause or an easy fix, would it be
>> reasonable to just add a Kconfig dep to disable it automatically on alpha?
> 
> Thanks for report. I guess this just confirms that migration of pagecache
> pages is somehow broken on Alpha. Maybe we are missing to flush some cache
> specific for Alpha? Or maybe the page migration code is not safe wrt the
> peculiar memory ordering Alpha has... I think this will need someone with
> Alpha HW and willingness to dive into MM internals to debug this. Added
> Vlasta to CC mostly for awareness and in case it rings some bells :).

Hi, doesn't ring any bells unfortunately. Does corruption also happen when
mmapping a file and applying mbind() with MPOL_MF_MOVE or migrate_pages()?
That should allow more controlled migration experimens than through
compaction. But that would also need a NUMA machine or a fakenuma support,
dunno if alpha has that?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
  2022-08-26 11:04                             ` Vlastimil Babka
@ 2022-08-26 16:16                               ` matoro
  0 siblings, 0 replies; 18+ messages in thread
From: matoro @ 2022-08-26 16:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Jan Kara, Meelis Roos, Matthew Wilcox, Theodore Y. Ts'o,
	linux-alpha, LKML, linux-block, linux-mm, vbabka

At least according to the docs I see, fakenuma is x86-specific.  There 
are multi-socket machines, but the one I have is single-socket 
single-core.

I can provide access to this machine to play around with it though!  
Either simple shell access or serial access if some kernel poking is 
needed.

Would that be helpful or is a NUMA system going to be required for 
debugging?

-------- Original Message --------
Subject: Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
Date: 2022-08-26 07:04
 From: Vlastimil Babka <vbabka@suse.cz>
To: Jan Kara <jack@suse.cz>, matoro 
<matoro_mailinglist_kernel@matoro.tk>

On 8/26/22 12:55, Jan Kara wrote:
> On Thu 25-08-22 11:05:48, matoro wrote:
>> Hello all, I know this is quite an old thread.  I recently acquired 
>> some
>> alpha hardware and have run into this exact same problem on the latest
>> stable kernel (5.18 and 5.19).  CONFIG_COMPACTION seems to be totally 
>> broken
>> and causes userspace to be extremely unstable - random segfaults, 
>> corruption
>> of glibc data structures, gcc ICEs etc etc - seems most noticable 
>> during
>> tasks with heavy I/O load.
>> 
>> My hardware is a DS15 (Titan), so only slightly newer than the 
>> Tsunamis
>> mentioned earlier.  The problem is greatly exacerbated when using a
>> machine-optimized kernel (CONFIG_ALPHA_TITAN) over one with
>> CONFIG_ALPHA_GENERIC.  But it still doesn't go away on a generic 
>> kernel,
>> just pops up less often, usually very I/O heavy tasks like checking 
>> out a
>> tag in the kernel repo.
>> 
>> However all of this seems to be dependent on CONFIG_COMPACTION.  With 
>> this
>> toggled off all problems disappear, regardless of other options.  I 
>> tried
>> reverting the commit 88dbcbb3a4847f5e6dfeae952d3105497700c128 
>> mentioned
>> earlier in the thread (the structure has moved to a different file but 
>> was
>> otherwise the same), but it unfortunately did not make a difference.
>> 
>> Since this doesn't seem to have a known cause or an easy fix, would it 
>> be
>> reasonable to just add a Kconfig dep to disable it automatically on 
>> alpha?
> 
> Thanks for report. I guess this just confirms that migration of 
> pagecache
> pages is somehow broken on Alpha. Maybe we are missing to flush some 
> cache
> specific for Alpha? Or maybe the page migration code is not safe wrt 
> the
> peculiar memory ordering Alpha has... I think this will need someone 
> with
> Alpha HW and willingness to dive into MM internals to debug this. Added
> Vlasta to CC mostly for awareness and in case it rings some bells :).

Hi, doesn't ring any bells unfortunately. Does corruption also happen 
when
mmapping a file and applying mbind() with MPOL_MF_MOVE or 
migrate_pages()?
That should allow more controlled migration experimens than through
compaction. But that would also need a NUMA machine or a fakenuma 
support,
dunno if alpha has that?

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-08-26 16:18 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <fb63a4d0-d124-21c8-7395-90b34b57c85a@linux.ee>
2019-02-10 20:27 ` ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28 Meelis Roos
2019-02-15 16:59   ` Meelis Roos
2019-02-16 17:45     ` Theodore Y. Ts'o
2019-02-16 22:29       ` Meelis Roos
2019-02-18 12:02         ` Jan Kara
2019-02-18 12:37           ` Meelis Roos
2019-02-19 12:17           ` Meelis Roos
2019-02-19 13:20             ` Jan Kara
2019-02-19 13:49               ` Meelis Roos
2019-02-19 14:44               ` Matthew Wilcox
2019-02-20  6:31                 ` Meelis Roos
2019-02-20  9:48                   ` Jan Kara
2019-02-20 23:23                     ` Meelis Roos
2019-02-21 13:29                       ` Jan Kara
2022-08-25 15:05                         ` matoro
2022-08-26 10:55                           ` Jan Kara
2022-08-26 11:04                             ` Vlastimil Babka
2022-08-26 16:16                               ` matoro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).