All of lore.kernel.org
 help / color / mirror / Atom feed
* xfs: garbage file data inclusion bug under memory pressure
@ 2019-07-25 10:06 Tetsuo Handa
  2019-07-25 10:53 ` Brian Foster
  2019-07-25 11:32 ` Dave Chinner
  0 siblings, 2 replies; 23+ messages in thread
From: Tetsuo Handa @ 2019-07-25 10:06 UTC (permalink / raw)
  To: linux-xfs

Hello.

I noticed that a file includes data from deleted files when

  XFS (sda1): writeback error on sector XXXXX

messages are printed (due to close to OOM).

So far I confirmed that this bug exists at least from 4.18 till 5.3-rc1.
I haven't tried 4.17 and earlier kernels. I haven't tried other filesystems.



Steps to test:

(1) Run the disk space filler (source code is shown below).

  # ./fillspace > file &
  # unlink file
  # fg

(2) Wait until the disk space filler completes.

(3) Start the reproducer (source code is shown below).

  # ./oom-torture

(4) Stop the reproducer using Ctrl-C after "writeback error on sector"
    message was printed.

  [ 1410.792467] XFS (sda1): writeback error on sector 159883016
  [ 1410.822127] XFS (sda1): writeback error on sector 187138128
  [ 1410.951357] XFS (sda1): writeback error on sector 162195392
  [ 1410.952527] XFS (sda1): writeback error on sector 95210384
  [ 1410.953870] XFS (sda1): writeback error on sector 95539264

(5) Examine files written by the reproducer for file data
    written by the disk space filler.

  # grep -F XXXXX /tmp/file.*
  Binary file /tmp/file.10111 matches
  Binary file /tmp/file.10122 matches
  Binary file /tmp/file.10143 matches
  Binary file /tmp/file.10162 matches
  Binary file /tmp/file.10179 matches

oom-torture.c
----------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#include <poll.h>

static char use_delay = 0;

static void sigcld_handler(int unused)
{
	use_delay = 1;
}

int main(int argc, char *argv[])
{
	static char buffer[4096] = { };
	char *buf = NULL;
	unsigned long size;
	int i;
	signal(SIGCLD, sigcld_handler);
	for (i = 0; i < 1024; i++) {
		if (fork() == 0) {
			int fd = open("/proc/self/oom_score_adj", O_WRONLY);
			write(fd, "1000", 4);
			close(fd);
			sleep(1);
			if (!i)
				pause();
			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
			fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
			while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) {
				poll(NULL, 0, 10);
				fsync(fd);
			}
			_exit(0);
		}
	}
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sleep(2);
	/* Will cause OOM due to overcommit */
	for (i = 0; i < size; i += 4096) {
		buf[i] = 0;
		if (use_delay) /* Give children a chance to write(). */
			poll(NULL, 0, 10);
	}
	pause();
	return 0;
}
----------------------------------------

fillspace.c
----------------------------------------
#include <unistd.h>

int main(int argc, char *argv[])
{
	static char buffer[1048576];
	int i;
	for (i = 0; i < sizeof(buffer); i++)
		buffer[i] = 'X';
	for (i = 0; i < sizeof(buffer); i+= 40)
		buffer[i] = '\n';
	while (write(1, buffer, sizeof(buffer)) > 0);
	return 0;
}
----------------------------------------



Environment:

This is a plain system with no special configuration.

  # cat /proc/partitions
  major minor  #blocks  name

     8        0  104857600 sda
     8        1  104856576 sda1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xfs: garbage file data inclusion bug under memory pressure
  2019-07-25 10:06 xfs: garbage file data inclusion bug under memory pressure Tetsuo Handa
@ 2019-07-25 10:53 ` Brian Foster
  2019-07-25 12:30   ` Tetsuo Handa
  2019-07-25 11:32 ` Dave Chinner
  1 sibling, 1 reply; 23+ messages in thread
From: Brian Foster @ 2019-07-25 10:53 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-xfs

On Thu, Jul 25, 2019 at 07:06:24PM +0900, Tetsuo Handa wrote:
> Hello.
> 
> I noticed that a file includes data from deleted files when
> 
>   XFS (sda1): writeback error on sector XXXXX
> 
> messages are printed (due to close to OOM).
> 
> So far I confirmed that this bug exists at least from 4.18 till 5.3-rc1.
> I haven't tried 4.17 and earlier kernels. I haven't tried other filesystems.
> 
> 
> 
> Steps to test:
> 
> (1) Run the disk space filler (source code is shown below).
> 
>   # ./fillspace > file &
>   # unlink file
>   # fg
> 
> (2) Wait until the disk space filler completes.
> 
> (3) Start the reproducer (source code is shown below).
> 
>   # ./oom-torture
> 
> (4) Stop the reproducer using Ctrl-C after "writeback error on sector"
>     message was printed.
> 
>   [ 1410.792467] XFS (sda1): writeback error on sector 159883016
>   [ 1410.822127] XFS (sda1): writeback error on sector 187138128
>   [ 1410.951357] XFS (sda1): writeback error on sector 162195392
>   [ 1410.952527] XFS (sda1): writeback error on sector 95210384
>   [ 1410.953870] XFS (sda1): writeback error on sector 95539264
> 
> (5) Examine files written by the reproducer for file data
>     written by the disk space filler.
> 
>   # grep -F XXXXX /tmp/file.*
>   Binary file /tmp/file.10111 matches
>   Binary file /tmp/file.10122 matches
>   Binary file /tmp/file.10143 matches
>   Binary file /tmp/file.10162 matches
>   Binary file /tmp/file.10179 matches
> 

This is a known problem. XFS delayed allocation has a window between
delalloc to real block conversion and writeback completion where stale
data exposure is possible if the writeback doesn't complete (i.e., due
to crash, I/O error, etc.). See fstests generic/536 for another
reference.  We've batted around potential solutions like using unwritten
extents for delalloc allocations, but IIRC we haven't been able to come
up with something with suitable performance to this point.

I'm curious why your OOM test results in writeback errors in the first
place. Is that generally expected? Does dmesg show any other XFS related
events, such as filesystem shutdown for example? I gave it a quick try
on a 4GB swapless VM and it doesn't trigger OOM. What's your memory
configuration and what does the /tmp filesystem look like ('xfs_info
/tmp')?

Brian

> oom-torture.c
> ----------------------------------------
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <signal.h>
> #include <poll.h>
> 
> static char use_delay = 0;
> 
> static void sigcld_handler(int unused)
> {
> 	use_delay = 1;
> }
> 
> int main(int argc, char *argv[])
> {
> 	static char buffer[4096] = { };
> 	char *buf = NULL;
> 	unsigned long size;
> 	int i;
> 	signal(SIGCLD, sigcld_handler);
> 	for (i = 0; i < 1024; i++) {
> 		if (fork() == 0) {
> 			int fd = open("/proc/self/oom_score_adj", O_WRONLY);
> 			write(fd, "1000", 4);
> 			close(fd);
> 			sleep(1);
> 			if (!i)
> 				pause();
> 			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
> 			fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
> 			while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) {
> 				poll(NULL, 0, 10);
> 				fsync(fd);
> 			}
> 			_exit(0);
> 		}
> 	}
> 	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
> 		char *cp = realloc(buf, size);
> 		if (!cp) {
> 			size >>= 1;
> 			break;
> 		}
> 		buf = cp;
> 	}
> 	sleep(2);
> 	/* Will cause OOM due to overcommit */
> 	for (i = 0; i < size; i += 4096) {
> 		buf[i] = 0;
> 		if (use_delay) /* Give children a chance to write(). */
> 			poll(NULL, 0, 10);
> 	}
> 	pause();
> 	return 0;
> }
> ----------------------------------------
> 
> fillspace.c
> ----------------------------------------
> #include <unistd.h>
> 
> int main(int argc, char *argv[])
> {
> 	static char buffer[1048576];
> 	int i;
> 	for (i = 0; i < sizeof(buffer); i++)
> 		buffer[i] = 'X';
> 	for (i = 0; i < sizeof(buffer); i+= 40)
> 		buffer[i] = '\n';
> 	while (write(1, buffer, sizeof(buffer)) > 0);
> 	return 0;
> }
> ----------------------------------------
> 
> 
> 
> Environment:
> 
> This is a plain system with no special configuration.
> 
>   # cat /proc/partitions
>   major minor  #blocks  name
> 
>      8        0  104857600 sda
>      8        1  104856576 sda1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xfs: garbage file data inclusion bug under memory pressure
  2019-07-25 10:06 xfs: garbage file data inclusion bug under memory pressure Tetsuo Handa
  2019-07-25 10:53 ` Brian Foster
@ 2019-07-25 11:32 ` Dave Chinner
  2019-07-25 12:44   ` Tetsuo Handa
  1 sibling, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2019-07-25 11:32 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-xfs

On Thu, Jul 25, 2019 at 07:06:24PM +0900, Tetsuo Handa wrote:
> Hello.
> 
> I noticed that a file includes data from deleted files when
> 
>   XFS (sda1): writeback error on sector XXXXX
> 
> messages are printed (due to close to OOM).
> 
> So far I confirmed that this bug exists at least from 4.18 till 5.3-rc1.
> I haven't tried 4.17 and earlier kernels. I haven't tried other filesystems.
> 
> 
> 
> Steps to test:
> 
> (1) Run the disk space filler (source code is shown below).
> 
>   # ./fillspace > file &
>   # unlink file
>   # fg
> 
> (2) Wait until the disk space filler completes.
> 
> (3) Start the reproducer (source code is shown below).
> 
>   # ./oom-torture
> 
> (4) Stop the reproducer using Ctrl-C after "writeback error on sector"
>     message was printed.
> 
>   [ 1410.792467] XFS (sda1): writeback error on sector 159883016
>   [ 1410.822127] XFS (sda1): writeback error on sector 187138128
>   [ 1410.951357] XFS (sda1): writeback error on sector 162195392
>   [ 1410.952527] XFS (sda1): writeback error on sector 95210384
>   [ 1410.953870] XFS (sda1): writeback error on sector 95539264
> 
> (5) Examine files written by the reproducer for file data
>     written by the disk space filler.
> 
>   # grep -F XXXXX /tmp/file.*
>   Binary file /tmp/file.10111 matches
>   Binary file /tmp/file.10122 matches
>   Binary file /tmp/file.10143 matches
>   Binary file /tmp/file.10162 matches
>   Binary file /tmp/file.10179 matches

You've had writeback errors. This is somewhat expected behaviour for
most filesystems when there are write errors - space has been
allocated, but whatever was to be written into that allocated space
failed for some reason so it remains in an uninitialised state....

For XFS and sequential writes, the on-disk file size is not extended
on an IO error, hence it should not expose stale data.  However,
your test code is not checking for errors - that's a bug in your
test code - and that's why writeback errors are resulting in stale
data exposure.  i.e. by ignoring the fsync() error,
the test continues writing at the next offset and the fsync() for
that new data write exposes the region of stale data in the
file where the previous data write failed by extending the on-disk
EOF past it....

So in this case stale data exposure is a side effect of not
handling writeback errors appropriately in the application.

But I have to ask: what is causing the IO to fail? OOM conditions
should not cause writeback errors - XFS will retry memory
allocations until they succeed, and the block layer is supposed to
be resilient against memory shortages, too. Hence I'd be interested
to know what is actually failing here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xfs: garbage file data inclusion bug under memory pressure
  2019-07-25 10:53 ` Brian Foster
@ 2019-07-25 12:30   ` Tetsuo Handa
  2019-07-25 16:00     ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Tetsuo Handa @ 2019-07-25 12:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On 2019/07/25 19:53, Brian Foster wrote:
> This is a known problem. XFS delayed allocation has a window between
> delalloc to real block conversion and writeback completion where stale
> data exposure is possible if the writeback doesn't complete (i.e., due
> to crash, I/O error, etc.). See fstests generic/536 for another
> reference.  We've batted around potential solutions like using unwritten
> extents for delalloc allocations, but IIRC we haven't been able to come
> up with something with suitable performance to this point.
> 
> I'm curious why your OOM test results in writeback errors in the first
> place. Is that generally expected? Does dmesg show any other XFS related
> events, such as filesystem shutdown for example? I gave it a quick try
> on a 4GB swapless VM and it doesn't trigger OOM. What's your memory
> configuration and what does the /tmp filesystem look like ('xfs_info
> /tmp')?

Writeback errors should not happen by just close-to-OOM situation.
And there is no other XFS related events.

----------
[  828.600750][ T5241] oom-torture invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  828.608041][ T5241] CPU: 1 PID: 5241 Comm: oom-torture Not tainted 5.3.0-rc1+ #626
[  828.612160][ T5241] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  828.617700][ T5241] Call Trace:
[  828.619616][ T5241]  dump_stack+0x67/0x95
[  828.621791][ T5241]  dump_header+0x4d/0x3e0
[  828.624026][ T5241]  oom_kill_process+0x193/0x220
[  828.626412][ T5241]  out_of_memory+0x105/0x360
[  828.628797][ T5241]  ? out_of_memory+0x1ab/0x360
[  828.631133][ T5241]  __alloc_pages_slowpath+0x937/0xbce
[  828.633676][ T5241]  __alloc_pages_nodemask+0x372/0x3b0
[  828.636181][ T5241]  do_anonymous_page+0xe0/0x5e0
[  828.638542][ T5241]  __handle_mm_fault+0x8d6/0xbe0
[  828.640926][ T5241]  handle_mm_fault+0x179/0x380
[  828.643228][ T5241]  ? handle_mm_fault+0x46/0x380
[  828.645546][ T5241]  __do_page_fault+0x255/0x4d0
[  828.647753][ T5241]  do_page_fault+0x27/0x1e0
[  828.649859][ T5241]  page_fault+0x34/0x40
[  828.651820][ T5241] RIP: 0033:0x40082f
[  828.653687][ T5241] Code: 31 e4 e8 54 ff ff ff 31 c0 48 85 db 75 15 eb 3c 0f 1f 00 49 8d 84 24 00 10 00 00 48 39 c3 76 2c 49 89 c4 80 3d 91 18 20 00 00 <c6> 44 05 00 00 74 e2 31 f6 31 ff ba 0a 00 00 00 e8 ec fe ff ff 49
[  828.661900][ T5241] RSP: 002b:00007ffc18ab7fe0 EFLAGS: 00010202
[  828.664423][ T5241] RAX: 00000000c43f9000 RBX: 0000000100000000 RCX: 00007ff8e4083370
[  828.667648][ T5241] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000000
[  828.670833][ T5241] RBP: 00007ff6e4184010 R08: 00007ffc18ab7f10 R09: 00007ffc18ab7d50
[  828.674029][ T5241] R10: 00007ffc18ab7a60 R11: 0000000000000246 R12: 00000000c43f9000
[  828.677236][ T5241] R13: 00007ff6e4184010 R14: 0000000000000000 R15: 0000000000000000
[  828.685651][ T5241] Mem-Info:
[  828.690368][ T5241] active_anon:808378 inactive_anon:3716 isolated_anon:0
[  828.690368][ T5241]  active_file:434 inactive_file:583 isolated_file:0
[  828.690368][ T5241]  unevictable:0 dirty:85 writeback:108 unstable:0
[  828.690368][ T5241]  slab_reclaimable:8616 slab_unreclaimable:28454
[  828.690368][ T5241]  mapped:2428 shmem:4221 pagetables:5031 bounce:0
[  828.690368][ T5241]  free:25443 free_pcp:646 free_cma:0
[  828.713486][ T5241] Node 0 active_anon:3233512kB inactive_anon:14864kB active_file:1784kB inactive_file:2744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:9712kB dirty:324kB writeback:532kB shmem:16884kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2826240kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[  828.730876][ T5241] DMA free:14272kB min:296kB low:368kB high:440kB active_anon:1632kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  828.743754][ T5241] lowmem_reserve[]: 0 2679 3495 3495
[  828.752288][ T5241] DMA32 free:54488kB min:51572kB low:64464kB high:77356kB active_anon:2682892kB inactive_anon:0kB active_file:0kB inactive_file:544kB unevictable:0kB writepending:0kB present:3129216kB managed:2743640kB mlocked:0kB kernel_stack:2064kB pagetables:28kB bounce:0kB free_pcp:96kB local_pcp:0kB free_cma:0kB
[  828.765008][ T5241] lowmem_reserve[]: 0 0 816 816
[  828.770282][ T5241] Normal free:33040kB min:15712kB low:19640kB high:23568kB active_anon:548988kB inactive_anon:14864kB active_file:1608kB inactive_file:2660kB unevictable:0kB writepending:76kB present:1048576kB managed:835884kB mlocked:0kB kernel_stack:17396kB pagetables:20096kB bounce:0kB free_pcp:2348kB local_pcp:152kB free_cma:0kB
[  828.783430][ T5241] lowmem_reserve[]: 0 0 0 0
[  828.791481][ T5241] DMA: 2*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (U) 2*64kB (U) 2*128kB (UM) 2*256kB (UM) 0*512kB 1*1024kB (U) 0*2048kB 3*4096kB (M) = 14272kB
[  828.802502][ T5241] DMA32: 4*4kB (UM) 7*8kB (UM) 34*16kB (UM) 90*32kB (UME) 54*64kB (UME) 30*128kB (UME) 10*256kB (UE) 26*512kB (UM) 27*1024kB (ME) 0*2048kB 0*4096kB = 54312kB
[  828.810562][ T5241] Normal: 856*4kB (UM) 750*8kB (UME) 589*16kB (UME) 348*32kB (UM) 3*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 30176kB
[  828.818691][ T5241] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  828.825729][ T5241] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  828.837586][ T5241] 5941 total pagecache pages
[  828.842234][ T5241] 0 pages in swap cache
[  828.845175][ T5241] Swap cache stats: add 0, delete 0, find 0/0
[  828.848688][ T5241] Free swap  = 0kB
[  828.851230][ T5241] Total swap = 0kB
[  828.853768][ T5241] 1048445 pages RAM
[  828.856735][ T5241] 0 pages HighMem/MovableOnly
[  828.859683][ T5241] 149588 pages reserved
[  828.862394][ T5241] 0 pages cma reserved
[  828.865161][ T5241] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),global_oom,task_memcg=/,task=oom-torture,pid=5701,uid=0
[  828.871540][ T5241] Out of memory: Killed process 5701 (oom-torture) total-vm:4220kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  828.877783][   T35] oom_reaper: reaped process 5701 (oom-torture), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  829.475335][ T5301] XFS (sda1): writeback error on sector 90663248
[  829.475618][ T5622] XFS (sda1): writeback error on sector 92712224
[  829.476211][ T5557] XFS (sda1): writeback error on sector 370398432
[  829.476801][ T5309] XFS (sda1): writeback error on sector 494252360
[  829.477005][ T5468] XFS (sda1): writeback error on sector 493980928
[  829.727478][ T5241] oom-torture invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  829.734776][ T5241] CPU: 2 PID: 5241 Comm: oom-torture Not tainted 5.3.0-rc1+ #626
----------

Kernel config is http://I-love.SAKURA.ne.jp/tmp/config-5.3-rc1 .

Below result is from a different VM which shows the same problem.

# xfs_info /tmp
meta-data=/dev/sda1              isize=256    agcount=4, agsize=16383936 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=65535744, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal               bsize=4096   blocks=31999, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
	

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xfs: garbage file data inclusion bug under memory pressure
  2019-07-25 11:32 ` Dave Chinner
@ 2019-07-25 12:44   ` Tetsuo Handa
  2019-07-25 17:28     ` Darrick J. Wong
  2019-07-25 22:07     ` Dave Chinner
  0 siblings, 2 replies; 23+ messages in thread
From: Tetsuo Handa @ 2019-07-25 12:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 2019/07/25 20:32, Dave Chinner wrote:
> You've had writeback errors. This is somewhat expected behaviour for
> most filesystems when there are write errors - space has been
> allocated, but whatever was to be written into that allocated space
> failed for some reason so it remains in an uninitialised state....

This is bad for security perspective. The data I found are e.g. random
source file, /var/log/secure , SQL database server's access log
containing secret values...

> 
> For XFS and sequential writes, the on-disk file size is not extended
> on an IO error, hence it should not expose stale data.  However,
> your test code is not checking for errors - that's a bug in your
> test code - and that's why writeback errors are resulting in stale
> data exposure.  i.e. by ignoring the fsync() error,
> the test continues writing at the next offset and the fsync() for
> that new data write exposes the region of stale data in the
> file where the previous data write failed by extending the on-disk
> EOF past it....
> 
> So in this case stale data exposure is a side effect of not
> handling writeback errors appropriately in the application.

But blaming users regarding not handling writeback errors is pointless
when thinking from security perspective. A bad guy might be trying to
steal data from inaccessible files.

> 
> But I have to ask: what is causing the IO to fail? OOM conditions
> should not cause writeback errors - XFS will retry memory
> allocations until they succeed, and the block layer is supposed to
> be resilient against memory shortages, too. Hence I'd be interested
> to know what is actually failing here...

Yeah. It is strange that this problem occurs when close-to-OOM.
But no failure messages at all (except OOM killer messages and writeback
error messages).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xfs: garbage file data inclusion bug under memory pressure
  2019-07-25 12:30   ` Tetsuo Handa
@ 2019-07-25 16:00     ` Brian Foster
  0 siblings, 0 replies; 23+ messages in thread
From: Brian Foster @ 2019-07-25 16:00 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-xfs

On Thu, Jul 25, 2019 at 09:30:01PM +0900, Tetsuo Handa wrote:
> On 2019/07/25 19:53, Brian Foster wrote:
> > This is a known problem. XFS delayed allocation has a window between
> > delalloc to real block conversion and writeback completion where stale
> > data exposure is possible if the writeback doesn't complete (i.e., due
> > to crash, I/O error, etc.). See fstests generic/536 for another
> > reference.  We've batted around potential solutions like using unwritten
> > extents for delalloc allocations, but IIRC we haven't been able to come
> > up with something with suitable performance to this point.
> > 
> > I'm curious why your OOM test results in writeback errors in the first
> > place. Is that generally expected? Does dmesg show any other XFS related
> > events, such as filesystem shutdown for example? I gave it a quick try
> > on a 4GB swapless VM and it doesn't trigger OOM. What's your memory
> > configuration and what does the /tmp filesystem look like ('xfs_info
> > /tmp')?
> 
> Writeback errors should not happen by just close-to-OOM situation.
> And there is no other XFS related events.
> 

Indeed, that is strange.

...
> 
> Kernel config is http://I-love.SAKURA.ne.jp/tmp/config-5.3-rc1 .
> 
> Below result is from a different VM which shows the same problem.
> 
> # xfs_info /tmp
> meta-data=/dev/sda1              isize=256    agcount=4, agsize=16383936 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=0        finobt=0 spinodes=0
> data     =                       bsize=4096   blocks=65535744, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
> log      =internal               bsize=4096   blocks=31999, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 	

I ran your oom-torture.c (without the fs fill step) tool again after
dropping VM RAM to 3GB and still had to invoke some usemem (from
fstests) instances to consume memory before OOM triggered. I eventually
reproduced oom-torture OOM kills but did not reproduce writeback errors.
I've only run it once, but this is against a virtio vdisk backing
lvm+XFS in the guest. What is your target device here? Is it failing
independently by chance?

Brian

> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xfs: garbage file data inclusion bug under memory pressure
  2019-07-25 12:44   ` Tetsuo Handa
@ 2019-07-25 17:28     ` Darrick J. Wong
  2019-07-25 22:07     ` Dave Chinner
  1 sibling, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2019-07-25 17:28 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Dave Chinner, linux-xfs

On Thu, Jul 25, 2019 at 09:44:35PM +0900, Tetsuo Handa wrote:
> On 2019/07/25 20:32, Dave Chinner wrote:
> > You've had writeback errors. This is somewhat expected behaviour for
> > most filesystems when there are write errors - space has been
> > allocated, but whatever was to be written into that allocated space
> > failed for some reason so it remains in an uninitialised state....
> 
> This is bad for security perspective. The data I found are e.g. random
> source file, /var/log/secure , SQL database server's access log
> containing secret values...

Woot.  That's bad.

By any chance do the duo of patches:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?id=bd012b434a56d9fac3cbc33062b8e2cd6e1ad0a0
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?id=adcf7c0c87191fd3616813c8ce9790f89a9a8eba

fix this problem?  I wrote them a while ago but I never got around to
quantifying how much of a performance impact they'd have.

> > For XFS and sequential writes, the on-disk file size is not extended
> > on an IO error, hence it should not expose stale data.  However,
> > your test code is not checking for errors - that's a bug in your
> > test code - and that's why writeback errors are resulting in stale
> > data exposure.  i.e. by ignoring the fsync() error,
> > the test continues writing at the next offset and the fsync() for
> > that new data write exposes the region of stale data in the
> > file where the previous data write failed by extending the on-disk
> > EOF past it....
> > 
> > So in this case stale data exposure is a side effect of not
> > handling writeback errors appropriately in the application.
> 
> But blaming users regarding not handling writeback errors is pointless
> when thinking from security perspective. A bad guy might be trying to
> steal data from inaccessible files.

My thoughts exactly.  I'm not sure what data is supposed to be read()
from a file after a write error <cough> but I'm pretty sure that
"someone else's discarded junk" is /not/ in that set.

> 
> > 
> > But I have to ask: what is causing the IO to fail? OOM conditions
> > should not cause writeback errors - XFS will retry memory
> > allocations until they succeed, and the block layer is supposed to
> > be resilient against memory shortages, too. Hence I'd be interested
> > to know what is actually failing here...
> 
> Yeah. It is strange that this problem occurs when close-to-OOM.
> But no failure messages at all (except OOM killer messages and writeback
> error messages).

That /is/ strange.  I wonder if your scsi driver is trying to allocate
memory, failing, and the block layer squishes that into EIO?

--D

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xfs: garbage file data inclusion bug under memory pressure
  2019-07-25 12:44   ` Tetsuo Handa
  2019-07-25 17:28     ` Darrick J. Wong
@ 2019-07-25 22:07     ` Dave Chinner
  2019-07-29  3:50       ` Tetsuo Handa
  1 sibling, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2019-07-25 22:07 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-xfs

On Thu, Jul 25, 2019 at 09:44:35PM +0900, Tetsuo Handa wrote:
> On 2019/07/25 20:32, Dave Chinner wrote:
> > You've had writeback errors. This is somewhat expected behaviour for
> > most filesystems when there are write errors - space has been
> > allocated, but whatever was to be written into that allocated space
> > failed for some reason so it remains in an uninitialised state....
> 
> This is bad for security perspective. The data I found are e.g. random
> source file, /var/log/secure , SQL database server's access log
> containing secret values...

The results of a read after a write error are undefined. In this
case, the test is doing enough IO to evict the cached pages that
failed the write so the followup read is pulling them from disk, not
from the page cache. i.e. if the test read back immediately after
the failure, it would get whatever the write put into the page cache
but writeback failed to write to disk....

Perhaps we need the write retry mechanisms we use for metadata in
the data IO path, too, so that a single writeback failure doesn't
cause this sort of thing to occur.

> > For XFS and sequential writes, the on-disk file size is not extended
> > on an IO error, hence it should not expose stale data.  However,
> > your test code is not checking for errors - that's a bug in your
> > test code - and that's why writeback errors are resulting in stale
> > data exposure.  i.e. by ignoring the fsync() error,
> > the test continues writing at the next offset and the fsync() for
> > that new data write exposes the region of stale data in the
> > file where the previous data write failed by extending the on-disk
> > EOF past it....
> > 
> > So in this case stale data exposure is a side effect of not
> > handling writeback errors appropriately in the application.
> 
> But blaming users regarding not handling writeback errors is pointless
> when thinking from security perspective.

I'm not blaming anyone. I'm just explaining how the problem was
exposed and pointing out that the responsibility for writing data
correctly falls on *both* the filesystem and userspace applications.
i.e. when the kernel fails it is userspace's responsibility to clean
up the mess to ensure the /application's data/ is correctly on
stable storage and not corrupt, missing, stale, etc.

So forget security - fsync is a data integrity operation. Not
checking that that it failed is extremely poor behaviour from a data
integrity point of view. Fix the data integrity problems in the
application and the security issues that data integrity failures
/may/ expose are very effectively mitigated.

> A bad guy might be trying to
> steal data from inaccessible files.

Which won't happen if user-triggered OOM does not cause writeback
failures. i.e. the bug we need to find and fix is whatever is
causing writeback to error out under OOM conditions.
Writeback is a key component of memory reclaim, and it it fails
under memory pressure we have much bigger problems...

> > But I have to ask: what is causing the IO to fail? OOM conditions
> > should not cause writeback errors - XFS will retry memory
> > allocations until they succeed, and the block layer is supposed to
> > be resilient against memory shortages, too. Hence I'd be interested
> > to know what is actually failing here...
> 
> Yeah. It is strange that this problem occurs when close-to-OOM.
> But no failure messages at all (except OOM killer messages and writeback
> error messages).

Perhaps using things like trace_kmalloc and friends to isolate the
location of memory allocation failures would help....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xfs: garbage file data inclusion bug under memory pressure
  2019-07-25 22:07     ` Dave Chinner
@ 2019-07-29  3:50       ` Tetsuo Handa
  2019-07-29 11:23         ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Tetsuo Handa @ 2019-07-29  3:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Dave Chinner wrote:
> > > But I have to ask: what is causing the IO to fail? OOM conditions
> > > should not cause writeback errors - XFS will retry memory
> > > allocations until they succeed, and the block layer is supposed to
> > > be resilient against memory shortages, too. Hence I'd be interested
> > > to know what is actually failing here...
> > 
> > Yeah. It is strange that this problem occurs when close-to-OOM.
> > But no failure messages at all (except OOM killer messages and writeback
> > error messages).
> 
> Perhaps using things like trace_kmalloc and friends to isolate the
> location of memory allocation failures would help....
> 

I checked using below diff, and confirmed that XFS writeback failure is triggered by ENOMEM.

When fsync() is called, xfs_submit_ioend() is called. xfs_submit_ioend() invokes
xfs_setfilesize_trans_alloc(), but xfs_trans_alloc() fails with -ENOMEM because
xfs_log_reserve() from xfs_trans_reserve() fails with -ENOMEM because
xlog_ticket_alloc() is using KM_SLEEP | KM_MAYFAIL which is mapped to
GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP which will fail under close-to-OOM.

As a result, bio_endio() is immediately called due to -ENOMEM, and
xfs_destroy_ioend() from xfs_end_bio() from bio_endio() is printing
writeback error message due to -ENOMEM error.
(By the way, why not to print error code when printing writeback error message?)

----------------------------------------
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f16d5f196c6b..7df0f5333d91 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -112,7 +112,8 @@ xfs_destroy_ioend(
 
 	if (unlikely(error && !quiet)) {
 		xfs_err_ratelimited(XFS_I(inode)->i_mount,
-			"writeback error on sector %llu", start);
+				    "writeback error %d on sector %llu", error, start);
+		panic("STOP!");
 	}
 }
 
@@ -648,6 +649,9 @@ xfs_submit_ioend(
 {
 	unsigned int		nofs_flag;
 
+	if (status)
+		pr_err("xfs_submit_ioend(1) status=%d\n", status);
+
 	/*
 	 * We can allocate memory here while doing writeback on behalf of
 	 * memory reclaim.  To avoid memory allocation deadlocks set the
@@ -659,6 +663,8 @@ xfs_submit_ioend(
 	if (!status && ioend->io_fork == XFS_COW_FORK) {
 		status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode),
 				ioend->io_offset, ioend->io_size);
+		if (status)
+			pr_err("xfs_submit_ioend(2) status=%d\n", status);
 	}
 
 	/* Reserve log space if we might write beyond the on-disk inode size. */
@@ -666,8 +672,11 @@ xfs_submit_ioend(
 	    (ioend->io_fork == XFS_COW_FORK ||
 	     ioend->io_state != XFS_EXT_UNWRITTEN) &&
 	    xfs_ioend_is_append(ioend) &&
-	    !ioend->io_append_trans)
+	    !ioend->io_append_trans) {
 		status = xfs_setfilesize_trans_alloc(ioend);
+		if (status)
+			pr_err("xfs_submit_ioend(3) status=%d\n", status);
+	}
 
 	memalloc_nofs_restore(nofs_flag);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..d8d1ed1c51d4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3748,11 +3748,11 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 {
 	struct va_format vaf;
 	va_list args;
-	static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL,
-				      DEFAULT_RATELIMIT_BURST);
+	//static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL,
+	//			      DEFAULT_RATELIMIT_BURST);
 
-	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
-		return;
+	//if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
+	//	return;
 
 	va_start(args, fmt);
 	vaf.fmt = fmt;
----------------------------------------

----------------------------------------
[  160.300800][T1662] oom-torture: page allocation failure: order:0, mode:0x46c40(GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP), nodemask=(null)
[  160.301216][T1662] CPU: 7 PID: 1662 Comm: oom-torture Kdump: loaded Not tainted 5.3.0-rc2+ #925
[  160.301220][T1662] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  160.301222][T1662] Call Trace:
[  160.301229][T1662]  dump_stack+0x67/0x95
[  160.301235][T1662]  warn_alloc+0xa9/0x140
[  160.301251][T1662]  __alloc_pages_slowpath+0x9a8/0xbce
[  160.301275][T1662]  __alloc_pages_nodemask+0x372/0x3b0
[  160.301288][T1662]  alloc_slab_page+0x3a/0x8d0
[  160.301294][T1662]  ? lockdep_hardirqs_on+0xe8/0x1d0
[  160.301297][T1662]  ? new_slab+0x251/0x420
[  160.301306][T1662]  new_slab+0x330/0x420
[  160.301314][T1662]  ___slab_alloc.constprop.94+0x879/0xb00
[  160.301363][T1662]  ? kmem_zone_alloc+0x9f/0x110 [xfs]
[  160.301398][T1662]  ? kmem_zone_alloc+0x9f/0x110 [xfs]
[  160.301430][T1662]  ? kmem_zone_alloc+0x9f/0x110 [xfs]
[  160.301434][T1662]  ? init_object+0x37/0x80
[  160.301445][T1662]  ? lockdep_hardirqs_off+0x77/0xe0
[  160.301473][T1662]  ? kmem_zone_alloc+0x9f/0x110 [xfs]
[  160.301478][T1662]  __slab_alloc.isra.89.constprop.93+0x43/0x6f
[  160.301486][T1662]  kmem_cache_alloc+0x331/0x390
[  160.301512][T1662]  ? kmem_zone_alloc+0x9f/0x110 [xfs]
[  160.301543][T1662]  kmem_zone_alloc+0x9f/0x110 [xfs]
[  160.301574][T1662]  xlog_ticket_alloc+0x33/0xd0 [xfs]
[  160.301602][T1662]  xfs_log_reserve+0xb4/0x410 [xfs]
[  160.301632][T1662]  xfs_trans_reserve+0x1d1/0x2b0 [xfs]
[  160.301684][T1662]  xfs_trans_alloc+0xc9/0x250 [xfs]
[  160.301714][T1662]  xfs_setfilesize_trans_alloc.isra.27+0x44/0xc0 [xfs]
[  160.301769][T1662]  xfs_submit_ioend.isra.28+0xa5/0x180 [xfs]
[  160.301799][T1662]  xfs_vm_writepages+0x76/0xa0 [xfs]
[  160.301813][T1662]  do_writepages+0x17/0x80
[  160.301819][T1662]  __filemap_fdatawrite_range+0xc1/0xf0
[  160.301836][T1662]  file_write_and_wait_range+0x53/0xa0
[  160.301865][T1662]  xfs_file_fsync+0x87/0x290 [xfs]
[  160.301878][T1662]  vfs_fsync_range+0x37/0x80
[  160.301883][T1662]  ? do_syscall_64+0x12/0x1c0
[  160.301887][T1662]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  160.301891][T1662]  do_fsync+0x38/0x60
[  160.301899][T1662]  __x64_sys_fsync+0xf/0x20
[  160.301903][T1662]  do_syscall_64+0x4a/0x1c0
[  160.301909][T1662]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  160.301912][T1662] RIP: 0033:0x7f73ecaac280
[  160.301916][T1662] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 1d 6d 2d 00 00 75 10 b8 4a 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be 6a 01 00 48 89 04 24
[  160.301919][T1662] RSP: 002b:00007fff2aaeb018 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  160.301923][T1662] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f73ecaac280
[  160.301925][T1662] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000003
[  160.301928][T1662] RBP: 0000000000000003 R08: 0000000000000000 R09: 00007f73eca0420d
[  160.301931][T1662] R10: 00007fff2aaeaa60 R11: 0000000000000246 R12: 0000000000400938
[  160.301934][T1662] R13: 00007fff2aaeb120 R14: 0000000000000000 R15: 0000000000000000
[  160.302003][T1662] xfs_submit_ioend(3) status=-12
[  160.302081][T1662] XFS (sda1): writeback error -12 on sector 91827024
[  160.302085][T1662] Kernel panic - not syncing: STOP!
[  160.304453][T1662] CPU: 7 PID: 1662 Comm: oom-torture Kdump: loaded Not tainted 5.3.0-rc2+ #925
[  160.304455][T1662] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  160.304457][T1662] Call Trace:
[  160.304461][T1662]  dump_stack+0x67/0x95
[  160.304467][T1662]  panic+0xfc/0x2e0
[  160.304501][T1662]  xfs_destroy_ioend+0x228/0x260 [xfs]
[  160.304534][T1662]  xfs_end_bio+0xd0/0xe0 [xfs]
[  160.320673][T1662]  bio_endio+0x174/0x340
[  160.320677][T1662]  ? bio_endio+0xc3/0x340
[  160.320712][T1662]  xfs_submit_ioend.isra.28+0x101/0x180 [xfs]
[  160.320741][T1662]  xfs_vm_writepages+0x76/0xa0 [xfs]
[  160.329386][T1662]  do_writepages+0x17/0x80
[  160.329392][T1662]  __filemap_fdatawrite_range+0xc1/0xf0
[  160.329403][T1662]  file_write_and_wait_range+0x53/0xa0
[  160.329440][T1662]  xfs_file_fsync+0x87/0x290 [xfs]
[  160.329451][T1662]  vfs_fsync_range+0x37/0x80
[  160.329454][T1662]  ? do_syscall_64+0x12/0x1c0
[  160.329457][T1662]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  160.329462][T1662]  do_fsync+0x38/0x60
[  160.329467][T1662]  __x64_sys_fsync+0xf/0x20
[  160.329470][T1662]  do_syscall_64+0x4a/0x1c0
[  160.329475][T1662]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  160.329478][T1662] RIP: 0033:0x7f73ecaac280
[  160.329481][T1662] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 1d 6d 2d 00 00 75 10 b8 4a 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be 6a 01 00 48 89 04 24
[  160.329482][T1662] RSP: 002b:00007fff2aaeb018 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  160.329485][T1662] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f73ecaac280
[  160.329486][T1662] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000003
[  160.329488][T1662] RBP: 0000000000000003 R08: 0000000000000000 R09: 00007f73eca0420d
[  160.329489][T1662] R10: 00007fff2aaeaa60 R11: 0000000000000246 R12: 0000000000400938
[  160.329490][T1662] R13: 00007fff2aaeb120 R14: 0000000000000000 R15: 0000000000000000
----------------------------------------

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: xfs: garbage file data inclusion bug under memory pressure
  2019-07-29  3:50       ` Tetsuo Handa
@ 2019-07-29 11:23         ` Brian Foster
  2019-07-29 21:56           ` Dave Chinner
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2019-07-29 11:23 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Dave Chinner, linux-xfs

On Mon, Jul 29, 2019 at 12:50:11PM +0900, Tetsuo Handa wrote:
> Dave Chinner wrote:
> > > > But I have to ask: what is causing the IO to fail? OOM conditions
> > > > should not cause writeback errors - XFS will retry memory
> > > > allocations until they succeed, and the block layer is supposed to
> > > > be resilient against memory shortages, too. Hence I'd be interested
> > > > to know what is actually failing here...
> > > 
> > > Yeah. It is strange that this problem occurs when close-to-OOM.
> > > But no failure messages at all (except OOM killer messages and writeback
> > > error messages).
> > 
> > Perhaps using things like trace_kmalloc and friends to isolate the
> > location of memory allocation failures would help....
> > 
> 
> I checked using below diff, and confirmed that XFS writeback failure is triggered by ENOMEM.
> 
> When fsync() is called, xfs_submit_ioend() is called. xfs_submit_ioend() invokes
> xfs_setfilesize_trans_alloc(), but xfs_trans_alloc() fails with -ENOMEM because
> xfs_log_reserve() from xfs_trans_reserve() fails with -ENOMEM because
> xlog_ticket_alloc() is using KM_SLEEP | KM_MAYFAIL which is mapped to
> GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP which will fail under close-to-OOM.
> 
> As a result, bio_endio() is immediately called due to -ENOMEM, and
> xfs_destroy_ioend() from xfs_end_bio() from bio_endio() is printing
> writeback error message due to -ENOMEM error.
> (By the way, why not to print error code when printing writeback error message?)
> 
> ----------------------------------------

Ah, that makes sense. Thanks for tracking that down Tetsuo. For context,
it looks like that flag goes back to commit eb01c9cd87 ("[XFS] Remove
the xlog_ticket allocator") that replaces some old internal ticket
allocation mechanism (that I'm not familiar with) with a standard kmem
cache.

ISTM we can just remove that KM_MAYFAIL from ticket allocation. We're
already in NOFS context in this particular caller (writeback), though
that's probably not the case for most other transaction allocations. If
we had a reason to get more elaborate, I suppose we could conditionalize
use of the KM_MAYFAIL flag and/or lift bits of ticket allocation to
earlier in xfs_trans_alloc(), but it's not clear to me that's necessary.
Dave?

Brian

> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index f16d5f196c6b..7df0f5333d91 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -112,7 +112,8 @@ xfs_destroy_ioend(
> 
>  	if (unlikely(error && !quiet)) {
>  		xfs_err_ratelimited(XFS_I(inode)->i_mount,
> -			"writeback error on sector %llu", start);
> +				    "writeback error %d on sector %llu", error, start);
> +		panic("STOP!");
>  	}
>  }
>  
> @@ -648,6 +649,9 @@ xfs_submit_ioend(
>  {
>  	unsigned int		nofs_flag;
>  
> +	if (status)
> +		pr_err("xfs_submit_ioend(1) status=%d\n", status);
> +
>  	/*
>  	 * We can allocate memory here while doing writeback on behalf of
>  	 * memory reclaim.  To avoid memory allocation deadlocks set the
> @@ -659,6 +663,8 @@ xfs_submit_ioend(
>  	if (!status && ioend->io_fork == XFS_COW_FORK) {
>  		status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode),
>  				ioend->io_offset, ioend->io_size);
> +		if (status)
> +			pr_err("xfs_submit_ioend(2) status=%d\n", status);
>  	}
>  
>  	/* Reserve log space if we might write beyond the on-disk inode size. */
> @@ -666,8 +672,11 @@ xfs_submit_ioend(
>  	    (ioend->io_fork == XFS_COW_FORK ||
>  	     ioend->io_state != XFS_EXT_UNWRITTEN) &&
>  	    xfs_ioend_is_append(ioend) &&
> -	    !ioend->io_append_trans)
> +	    !ioend->io_append_trans) {
>  		status = xfs_setfilesize_trans_alloc(ioend);
> +		if (status)
> +			pr_err("xfs_submit_ioend(3) status=%d\n", status);
> +	}
>  
>  	memalloc_nofs_restore(nofs_flag);
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 272c6de1bf4e..d8d1ed1c51d4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3748,11 +3748,11 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  {
>  	struct va_format vaf;
>  	va_list args;
> -	static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL,
> -				      DEFAULT_RATELIMIT_BURST);
> +	//static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL,
> +	//			      DEFAULT_RATELIMIT_BURST);
>  
> -	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
> -		return;
> +	//if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
> +	//	return;
>  
>  	va_start(args, fmt);
>  	vaf.fmt = fmt;
> ----------------------------------------
> 
> ----------------------------------------
> [  160.300800][T1662] oom-torture: page allocation failure: order:0, mode:0x46c40(GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP), nodemask=(null)
> [  160.301216][T1662] CPU: 7 PID: 1662 Comm: oom-torture Kdump: loaded Not tainted 5.3.0-rc2+ #925
> [  160.301220][T1662] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
> [  160.301222][T1662] Call Trace:
> [  160.301229][T1662]  dump_stack+0x67/0x95
> [  160.301235][T1662]  warn_alloc+0xa9/0x140
> [  160.301251][T1662]  __alloc_pages_slowpath+0x9a8/0xbce
> [  160.301275][T1662]  __alloc_pages_nodemask+0x372/0x3b0
> [  160.301288][T1662]  alloc_slab_page+0x3a/0x8d0
> [  160.301294][T1662]  ? lockdep_hardirqs_on+0xe8/0x1d0
> [  160.301297][T1662]  ? new_slab+0x251/0x420
> [  160.301306][T1662]  new_slab+0x330/0x420
> [  160.301314][T1662]  ___slab_alloc.constprop.94+0x879/0xb00
> [  160.301363][T1662]  ? kmem_zone_alloc+0x9f/0x110 [xfs]
> [  160.301398][T1662]  ? kmem_zone_alloc+0x9f/0x110 [xfs]
> [  160.301430][T1662]  ? kmem_zone_alloc+0x9f/0x110 [xfs]
> [  160.301434][T1662]  ? init_object+0x37/0x80
> [  160.301445][T1662]  ? lockdep_hardirqs_off+0x77/0xe0
> [  160.301473][T1662]  ? kmem_zone_alloc+0x9f/0x110 [xfs]
> [  160.301478][T1662]  __slab_alloc.isra.89.constprop.93+0x43/0x6f
> [  160.301486][T1662]  kmem_cache_alloc+0x331/0x390
> [  160.301512][T1662]  ? kmem_zone_alloc+0x9f/0x110 [xfs]
> [  160.301543][T1662]  kmem_zone_alloc+0x9f/0x110 [xfs]
> [  160.301574][T1662]  xlog_ticket_alloc+0x33/0xd0 [xfs]
> [  160.301602][T1662]  xfs_log_reserve+0xb4/0x410 [xfs]
> [  160.301632][T1662]  xfs_trans_reserve+0x1d1/0x2b0 [xfs]
> [  160.301684][T1662]  xfs_trans_alloc+0xc9/0x250 [xfs]
> [  160.301714][T1662]  xfs_setfilesize_trans_alloc.isra.27+0x44/0xc0 [xfs]
> [  160.301769][T1662]  xfs_submit_ioend.isra.28+0xa5/0x180 [xfs]
> [  160.301799][T1662]  xfs_vm_writepages+0x76/0xa0 [xfs]
> [  160.301813][T1662]  do_writepages+0x17/0x80
> [  160.301819][T1662]  __filemap_fdatawrite_range+0xc1/0xf0
> [  160.301836][T1662]  file_write_and_wait_range+0x53/0xa0
> [  160.301865][T1662]  xfs_file_fsync+0x87/0x290 [xfs]
> [  160.301878][T1662]  vfs_fsync_range+0x37/0x80
> [  160.301883][T1662]  ? do_syscall_64+0x12/0x1c0
> [  160.301887][T1662]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  160.301891][T1662]  do_fsync+0x38/0x60
> [  160.301899][T1662]  __x64_sys_fsync+0xf/0x20
> [  160.301903][T1662]  do_syscall_64+0x4a/0x1c0
> [  160.301909][T1662]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  160.301912][T1662] RIP: 0033:0x7f73ecaac280
> [  160.301916][T1662] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 1d 6d 2d 00 00 75 10 b8 4a 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be 6a 01 00 48 89 04 24
> [  160.301919][T1662] RSP: 002b:00007fff2aaeb018 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
> [  160.301923][T1662] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f73ecaac280
> [  160.301925][T1662] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000003
> [  160.301928][T1662] RBP: 0000000000000003 R08: 0000000000000000 R09: 00007f73eca0420d
> [  160.301931][T1662] R10: 00007fff2aaeaa60 R11: 0000000000000246 R12: 0000000000400938
> [  160.301934][T1662] R13: 00007fff2aaeb120 R14: 0000000000000000 R15: 0000000000000000
> [  160.302003][T1662] xfs_submit_ioend(3) status=-12
> [  160.302081][T1662] XFS (sda1): writeback error -12 on sector 91827024
> [  160.302085][T1662] Kernel panic - not syncing: STOP!
> [  160.304453][T1662] CPU: 7 PID: 1662 Comm: oom-torture Kdump: loaded Not tainted 5.3.0-rc2+ #925
> [  160.304455][T1662] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
> [  160.304457][T1662] Call Trace:
> [  160.304461][T1662]  dump_stack+0x67/0x95
> [  160.304467][T1662]  panic+0xfc/0x2e0
> [  160.304501][T1662]  xfs_destroy_ioend+0x228/0x260 [xfs]
> [  160.304534][T1662]  xfs_end_bio+0xd0/0xe0 [xfs]
> [  160.320673][T1662]  bio_endio+0x174/0x340
> [  160.320677][T1662]  ? bio_endio+0xc3/0x340
> [  160.320712][T1662]  xfs_submit_ioend.isra.28+0x101/0x180 [xfs]
> [  160.320741][T1662]  xfs_vm_writepages+0x76/0xa0 [xfs]
> [  160.329386][T1662]  do_writepages+0x17/0x80
> [  160.329392][T1662]  __filemap_fdatawrite_range+0xc1/0xf0
> [  160.329403][T1662]  file_write_and_wait_range+0x53/0xa0
> [  160.329440][T1662]  xfs_file_fsync+0x87/0x290 [xfs]
> [  160.329451][T1662]  vfs_fsync_range+0x37/0x80
> [  160.329454][T1662]  ? do_syscall_64+0x12/0x1c0
> [  160.329457][T1662]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  160.329462][T1662]  do_fsync+0x38/0x60
> [  160.329467][T1662]  __x64_sys_fsync+0xf/0x20
> [  160.329470][T1662]  do_syscall_64+0x4a/0x1c0
> [  160.329475][T1662]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  160.329478][T1662] RIP: 0033:0x7f73ecaac280
> [  160.329481][T1662] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 1d 6d 2d 00 00 75 10 b8 4a 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be 6a 01 00 48 89 04 24
> [  160.329482][T1662] RSP: 002b:00007fff2aaeb018 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
> [  160.329485][T1662] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f73ecaac280
> [  160.329486][T1662] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000003
> [  160.329488][T1662] RBP: 0000000000000003 R08: 0000000000000000 R09: 00007f73eca0420d
> [  160.329489][T1662] R10: 00007fff2aaeaa60 R11: 0000000000000246 R12: 0000000000400938
> [  160.329490][T1662] R13: 00007fff2aaeb120 R14: 0000000000000000 R15: 0000000000000000
> ----------------------------------------

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xfs: garbage file data inclusion bug under memory pressure
  2019-07-29 11:23         ` Brian Foster
@ 2019-07-29 21:56           ` Dave Chinner
  2019-07-30 11:30             ` Brian Foster
  2019-08-01 10:06             ` [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve() Tetsuo Handa
  0 siblings, 2 replies; 23+ messages in thread
From: Dave Chinner @ 2019-07-29 21:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: Tetsuo Handa, linux-xfs

On Mon, Jul 29, 2019 at 07:23:35AM -0400, Brian Foster wrote:
> On Mon, Jul 29, 2019 at 12:50:11PM +0900, Tetsuo Handa wrote:
> > Dave Chinner wrote:
> > > > > But I have to ask: what is causing the IO to fail? OOM conditions
> > > > > should not cause writeback errors - XFS will retry memory
> > > > > allocations until they succeed, and the block layer is supposed to
> > > > > be resilient against memory shortages, too. Hence I'd be interested
> > > > > to know what is actually failing here...
> > > > 
> > > > Yeah. It is strange that this problem occurs when close-to-OOM.
> > > > But no failure messages at all (except OOM killer messages and writeback
> > > > error messages).
> > > 
> > > Perhaps using things like trace_kmalloc and friends to isolate the
> > > location of memory allocation failures would help....
> > > 
> > 
> > I checked using below diff, and confirmed that XFS writeback failure is triggered by ENOMEM.
> > 
> > When fsync() is called, xfs_submit_ioend() is called. xfs_submit_ioend() invokes
> > xfs_setfilesize_trans_alloc(), but xfs_trans_alloc() fails with -ENOMEM because
> > xfs_log_reserve() from xfs_trans_reserve() fails with -ENOMEM because
> > xlog_ticket_alloc() is using KM_SLEEP | KM_MAYFAIL which is mapped to
> > GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP which will fail under close-to-OOM.
> > 
> > As a result, bio_endio() is immediately called due to -ENOMEM, and
> > xfs_destroy_ioend() from xfs_end_bio() from bio_endio() is printing
> > writeback error message due to -ENOMEM error.
> > (By the way, why not to print error code when printing writeback error message?)
> > 
> > ----------------------------------------
> 
> Ah, that makes sense. Thanks for tracking that down Tetsuo. For context,
> it looks like that flag goes back to commit eb01c9cd87 ("[XFS] Remove
> the xlog_ticket allocator") that replaces some old internal ticket
> allocation mechanism (that I'm not familiar with) with a standard kmem
> cache.
> 
> ISTM we can just remove that KM_MAYFAIL from ticket allocation. We're
> already in NOFS context in this particular caller (writeback), though
> that's probably not the case for most other transaction allocations. If
> we had a reason to get more elaborate, I suppose we could conditionalize
> use of the KM_MAYFAIL flag and/or lift bits of ticket allocation to
> earlier in xfs_trans_alloc(), but it's not clear to me that's necessary.
> Dave?

That's a long time ago, and it predates the pre-allocation of
transactions for file size updates in IO submission. The log ticket
rework is irrelevant - it was just an open-coded slab allocator - it
was the fact it handled allocation failure that mattered. That was
done at the time because we were slowly reducing the number of
blocking allocations at the time - trying to reduce the reliance on
looping until allocation succeeds - so MAYFAIL was used for quite a
lot of new allocations at the time.

This is perfectly fine for transactions in syscall context - if we
don't have memory available for the log ticket, we may as well give
up now before we really start creating memory demand and getting
into a state where we are half way through a transaction and
completely out of memory and can't go forwards or backwards.

The trans alloc/trans reserve/log reserve code was somewhat
different back then, as was the writeback code. I suspect it dates
back to when we had trylock semantics in writeback and so memory
allocation errors like this would have simply redirtied the page and
it was tried again later. Hence, historically, I don't think this
was an issue, either.

Hence the code has morphed so much since then I don't think we can
"blame" this commit for introducing this problem. I looks more like
we have removed all the protection it had as we've simplified the
writeback and transaction allocation/reservation code over time, and
now it's exposed directly in writeback.

----

As for how to fix it, I'd just remove KM_MAYFAIL. We've just done a
transaction allocation with just KM_SLEEP, so we may as well do the
same for the log ticket....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xfs: garbage file data inclusion bug under memory pressure
  2019-07-29 21:56           ` Dave Chinner
@ 2019-07-30 11:30             ` Brian Foster
  2019-08-01 10:06             ` [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve() Tetsuo Handa
  1 sibling, 0 replies; 23+ messages in thread
From: Brian Foster @ 2019-07-30 11:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Tetsuo Handa, linux-xfs

On Tue, Jul 30, 2019 at 07:56:57AM +1000, Dave Chinner wrote:
> On Mon, Jul 29, 2019 at 07:23:35AM -0400, Brian Foster wrote:
> > On Mon, Jul 29, 2019 at 12:50:11PM +0900, Tetsuo Handa wrote:
> > > Dave Chinner wrote:
> > > > > > But I have to ask: what is causing the IO to fail? OOM conditions
> > > > > > should not cause writeback errors - XFS will retry memory
> > > > > > allocations until they succeed, and the block layer is supposed to
> > > > > > be resilient against memory shortages, too. Hence I'd be interested
> > > > > > to know what is actually failing here...
> > > > > 
> > > > > Yeah. It is strange that this problem occurs when close-to-OOM.
> > > > > But no failure messages at all (except OOM killer messages and writeback
> > > > > error messages).
> > > > 
> > > > Perhaps using things like trace_kmalloc and friends to isolate the
> > > > location of memory allocation failures would help....
> > > > 
> > > 
> > > I checked using below diff, and confirmed that XFS writeback failure is triggered by ENOMEM.
> > > 
> > > When fsync() is called, xfs_submit_ioend() is called. xfs_submit_ioend() invokes
> > > xfs_setfilesize_trans_alloc(), but xfs_trans_alloc() fails with -ENOMEM because
> > > xfs_log_reserve() from xfs_trans_reserve() fails with -ENOMEM because
> > > xlog_ticket_alloc() is using KM_SLEEP | KM_MAYFAIL which is mapped to
> > > GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP which will fail under close-to-OOM.
> > > 
> > > As a result, bio_endio() is immediately called due to -ENOMEM, and
> > > xfs_destroy_ioend() from xfs_end_bio() from bio_endio() is printing
> > > writeback error message due to -ENOMEM error.
> > > (By the way, why not to print error code when printing writeback error message?)
> > > 
> > > ----------------------------------------
> > 
> > Ah, that makes sense. Thanks for tracking that down Tetsuo. For context,
> > it looks like that flag goes back to commit eb01c9cd87 ("[XFS] Remove
> > the xlog_ticket allocator") that replaces some old internal ticket
> > allocation mechanism (that I'm not familiar with) with a standard kmem
> > cache.
> > 
> > ISTM we can just remove that KM_MAYFAIL from ticket allocation. We're
> > already in NOFS context in this particular caller (writeback), though
> > that's probably not the case for most other transaction allocations. If
> > we had a reason to get more elaborate, I suppose we could conditionalize
> > use of the KM_MAYFAIL flag and/or lift bits of ticket allocation to
> > earlier in xfs_trans_alloc(), but it's not clear to me that's necessary.
> > Dave?
> 
> That's a long time ago, and it predates the pre-allocation of
> transactions for file size updates in IO submission. The log ticket
> rework is irrelevant - it was just an open-coded slab allocator - it
> was the fact it handled allocation failure that mattered. That was
> done at the time because we were slowly reducing the number of
> blocking allocations at the time - trying to reduce the reliance on
> looping until allocation succeeds - so MAYFAIL was used for quite a
> lot of new allocations at the time.
> 
> This is perfectly fine for transactions in syscall context - if we
> don't have memory available for the log ticket, we may as well give
> up now before we really start creating memory demand and getting
> into a state where we are half way through a transaction and
> completely out of memory and can't go forwards or backwards.
> 
> The trans alloc/trans reserve/log reserve code was somewhat
> different back then, as was the writeback code. I suspect it dates
> back to when we had trylock semantics in writeback and so memory
> allocation errors like this would have simply redirtied the page and
> it was tried again later. Hence, historically, I don't think this
> was an issue, either.
> 
> Hence the code has morphed so much since then I don't think we can
> "blame" this commit for introducing this problem. I looks more like
> we have removed all the protection it had as we've simplified the
> writeback and transaction allocation/reservation code over time, and
> now it's exposed directly in writeback.
> 

To be clear, I'm not blaming this commit for a bug or anything. I posted
it for reference because the context has changed so much over such a
long period of time that it wasn't clear to me if there was some other
reason for having this flag that might not be obvious from the current
code. Thanks for the background/context, that pretty much clears that
question up...

> ----
> 
> As for how to fix it, I'd just remove KM_MAYFAIL. We've just done a
> transaction allocation with just KM_SLEEP, so we may as well do the
> same for the log ticket....
> 

... and it sounds like we're on the same page. ;)

Tetsuo,

That's a pretty straightforward change. Care to try your workload
against a change to remove the flag, confirm it behaves as we anticipate
and if so perhaps send a patch?

Brian

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
  2019-07-29 21:56           ` Dave Chinner
  2019-07-30 11:30             ` Brian Foster
@ 2019-08-01 10:06             ` Tetsuo Handa
  2019-08-01 10:56               ` Brian Foster
                                 ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Tetsuo Handa @ 2019-08-01 10:06 UTC (permalink / raw)
  To: linux-xfs, Dave Chinner, Brian Foster; +Cc: Tetsuo Handa

When the system is close-to-OOM, fsync() may fail due to -ENOMEM because
xfs_log_reserve() is using KM_MAYFAIL. It is a bad thing to fail writeback
operation due to user-triggerable OOM condition. Since we are not using
KM_MAYFAIL at xfs_trans_alloc() before calling xfs_log_reserve(), let's
use the same flags at xfs_log_reserve().

  oom-torture: page allocation failure: order:0, mode:0x46c40(GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP), nodemask=(null)
  CPU: 7 PID: 1662 Comm: oom-torture Kdump: loaded Not tainted 5.3.0-rc2+ #925
  Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
  Call Trace:
   dump_stack+0x67/0x95
   warn_alloc+0xa9/0x140
   __alloc_pages_slowpath+0x9a8/0xbce
   __alloc_pages_nodemask+0x372/0x3b0
   alloc_slab_page+0x3a/0x8d0
   new_slab+0x330/0x420
   ___slab_alloc.constprop.94+0x879/0xb00
   __slab_alloc.isra.89.constprop.93+0x43/0x6f
   kmem_cache_alloc+0x331/0x390
   kmem_zone_alloc+0x9f/0x110 [xfs]
   kmem_zone_alloc+0x9f/0x110 [xfs]
   xlog_ticket_alloc+0x33/0xd0 [xfs]
   xfs_log_reserve+0xb4/0x410 [xfs]
   xfs_trans_reserve+0x1d1/0x2b0 [xfs]
   xfs_trans_alloc+0xc9/0x250 [xfs]
   xfs_setfilesize_trans_alloc.isra.27+0x44/0xc0 [xfs]
   xfs_submit_ioend.isra.28+0xa5/0x180 [xfs]
   xfs_vm_writepages+0x76/0xa0 [xfs]
   do_writepages+0x17/0x80
   __filemap_fdatawrite_range+0xc1/0xf0
   file_write_and_wait_range+0x53/0xa0
   xfs_file_fsync+0x87/0x290 [xfs]
   vfs_fsync_range+0x37/0x80
   do_fsync+0x38/0x60
   __x64_sys_fsync+0xf/0x20
   do_syscall_64+0x4a/0x1c0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 fs/xfs/xfs_log.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 00e9f5c388d3..7fc3c1ad36bc 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -429,10 +429,7 @@ xfs_log_reserve(
 
 	ASSERT(*ticp == NULL);
 	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent,
-				KM_SLEEP | KM_MAYFAIL);
-	if (!tic)
-		return -ENOMEM;
-
+				KM_SLEEP);
 	*ticp = tic;
 
 	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
-- 
2.16.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
  2019-08-01 10:06             ` [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve() Tetsuo Handa
@ 2019-08-01 10:56               ` Brian Foster
  2019-08-01 11:00                 ` Tetsuo Handa
  2019-08-01 18:50               ` Luis Chamberlain
  2019-08-01 20:46               ` Darrick J. Wong
  2 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2019-08-01 10:56 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-xfs, Dave Chinner

On Thu, Aug 01, 2019 at 07:06:35PM +0900, Tetsuo Handa wrote:
> When the system is close-to-OOM, fsync() may fail due to -ENOMEM because
> xfs_log_reserve() is using KM_MAYFAIL. It is a bad thing to fail writeback
> operation due to user-triggerable OOM condition. Since we are not using
> KM_MAYFAIL at xfs_trans_alloc() before calling xfs_log_reserve(), let's
> use the same flags at xfs_log_reserve().
> 
>   oom-torture: page allocation failure: order:0, mode:0x46c40(GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP), nodemask=(null)
>   CPU: 7 PID: 1662 Comm: oom-torture Kdump: loaded Not tainted 5.3.0-rc2+ #925
>   Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
>   Call Trace:
>    dump_stack+0x67/0x95
>    warn_alloc+0xa9/0x140
>    __alloc_pages_slowpath+0x9a8/0xbce
>    __alloc_pages_nodemask+0x372/0x3b0
>    alloc_slab_page+0x3a/0x8d0
>    new_slab+0x330/0x420
>    ___slab_alloc.constprop.94+0x879/0xb00
>    __slab_alloc.isra.89.constprop.93+0x43/0x6f
>    kmem_cache_alloc+0x331/0x390
>    kmem_zone_alloc+0x9f/0x110 [xfs]
>    kmem_zone_alloc+0x9f/0x110 [xfs]
>    xlog_ticket_alloc+0x33/0xd0 [xfs]
>    xfs_log_reserve+0xb4/0x410 [xfs]
>    xfs_trans_reserve+0x1d1/0x2b0 [xfs]
>    xfs_trans_alloc+0xc9/0x250 [xfs]
>    xfs_setfilesize_trans_alloc.isra.27+0x44/0xc0 [xfs]
>    xfs_submit_ioend.isra.28+0xa5/0x180 [xfs]
>    xfs_vm_writepages+0x76/0xa0 [xfs]
>    do_writepages+0x17/0x80
>    __filemap_fdatawrite_range+0xc1/0xf0
>    file_write_and_wait_range+0x53/0xa0
>    xfs_file_fsync+0x87/0x290 [xfs]
>    vfs_fsync_range+0x37/0x80
>    do_fsync+0x38/0x60
>    __x64_sys_fsync+0xf/0x20
>    do_syscall_64+0x4a/0x1c0
>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---

I assume this survived your test scenario?

If so, the change looks fine to me. Thanks for the patch.

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_log.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 00e9f5c388d3..7fc3c1ad36bc 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -429,10 +429,7 @@ xfs_log_reserve(
>  
>  	ASSERT(*ticp == NULL);
>  	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent,
> -				KM_SLEEP | KM_MAYFAIL);
> -	if (!tic)
> -		return -ENOMEM;
> -
> +				KM_SLEEP);
>  	*ticp = tic;
>  
>  	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
> -- 
> 2.16.5
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
  2019-08-01 10:56               ` Brian Foster
@ 2019-08-01 11:00                 ` Tetsuo Handa
  0 siblings, 0 replies; 23+ messages in thread
From: Tetsuo Handa @ 2019-08-01 11:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Dave Chinner

On 2019/08/01 19:56, Brian Foster wrote:
> I assume this survived your test scenario?

Yes.

> 
> If so, the change looks fine to me. Thanks for the patch.
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
  2019-08-01 10:06             ` [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve() Tetsuo Handa
  2019-08-01 10:56               ` Brian Foster
@ 2019-08-01 18:50               ` Luis Chamberlain
  2019-08-01 20:46                 ` Darrick J. Wong
  2019-08-01 21:13                 ` Tetsuo Handa
  2019-08-01 20:46               ` Darrick J. Wong
  2 siblings, 2 replies; 23+ messages in thread
From: Luis Chamberlain @ 2019-08-01 18:50 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-xfs, Dave Chinner, Brian Foster

On Thu, Aug 01, 2019 at 07:06:35PM +0900, Tetsuo Handa wrote:
> When the system is close-to-OOM, fsync() may fail due to -ENOMEM because
> xfs_log_reserve() is using KM_MAYFAIL. It is a bad thing to fail writeback
> operation due to user-triggerable OOM condition. Since we are not using
> KM_MAYFAIL at xfs_trans_alloc() before calling xfs_log_reserve(), let's
> use the same flags at xfs_log_reserve().
> 
>   oom-torture: page allocation failure: order:0, mode:0x46c40(GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP), nodemask=(null)
>   CPU: 7 PID: 1662 Comm: oom-torture Kdump: loaded Not tainted 5.3.0-rc2+ #925
>   Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
>   Call Trace:
>    dump_stack+0x67/0x95
>    warn_alloc+0xa9/0x140
>    __alloc_pages_slowpath+0x9a8/0xbce
>    __alloc_pages_nodemask+0x372/0x3b0
>    alloc_slab_page+0x3a/0x8d0
>    new_slab+0x330/0x420
>    ___slab_alloc.constprop.94+0x879/0xb00
>    __slab_alloc.isra.89.constprop.93+0x43/0x6f
>    kmem_cache_alloc+0x331/0x390
>    kmem_zone_alloc+0x9f/0x110 [xfs]
>    kmem_zone_alloc+0x9f/0x110 [xfs]
>    xlog_ticket_alloc+0x33/0xd0 [xfs]
>    xfs_log_reserve+0xb4/0x410 [xfs]
>    xfs_trans_reserve+0x1d1/0x2b0 [xfs]
>    xfs_trans_alloc+0xc9/0x250 [xfs]
>    xfs_setfilesize_trans_alloc.isra.27+0x44/0xc0 [xfs]
>    xfs_submit_ioend.isra.28+0xa5/0x180 [xfs]
>    xfs_vm_writepages+0x76/0xa0 [xfs]
>    do_writepages+0x17/0x80
>    __filemap_fdatawrite_range+0xc1/0xf0
>    file_write_and_wait_range+0x53/0xa0
>    xfs_file_fsync+0x87/0x290 [xfs]
>    vfs_fsync_range+0x37/0x80
>    do_fsync+0x38/0x60
>    __x64_sys_fsync+0xf/0x20
>    do_syscall_64+0x4a/0x1c0
>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

That's quite an opaque commit log for what started off as a severe email
thread of potential leak of information. As such, can you expand on this
commit log considerably to explain the situation a bit better? Your
initial thread here provided much clearer evidence of the issue. As-is
this commit log tells the reader *nothing* about the potential harm in
not applying this patch.

You had mentioned you identified this issue present on at least
4.18 till 5.3-rc1. So, I'm at least inclined to consider this for
stable for at least v4.19.

However, what about older kernels? Now that you have identified
a fix, were the flag changed in prior commits, is it a regression
that perhaps added KM_MAYFAIL at some point?

  Luis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
  2019-08-01 18:50               ` Luis Chamberlain
@ 2019-08-01 20:46                 ` Darrick J. Wong
  2019-08-02 22:21                   ` Luis Chamberlain
  2019-08-01 21:13                 ` Tetsuo Handa
  1 sibling, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2019-08-01 20:46 UTC (permalink / raw)
  To: Luis Chamberlain; +Cc: Tetsuo Handa, linux-xfs, Dave Chinner, Brian Foster

On Thu, Aug 01, 2019 at 06:50:57PM +0000, Luis Chamberlain wrote:
> On Thu, Aug 01, 2019 at 07:06:35PM +0900, Tetsuo Handa wrote:
> > When the system is close-to-OOM, fsync() may fail due to -ENOMEM because
> > xfs_log_reserve() is using KM_MAYFAIL. It is a bad thing to fail writeback
> > operation due to user-triggerable OOM condition. Since we are not using
> > KM_MAYFAIL at xfs_trans_alloc() before calling xfs_log_reserve(), let's
> > use the same flags at xfs_log_reserve().
> > 
> >   oom-torture: page allocation failure: order:0, mode:0x46c40(GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP), nodemask=(null)
> >   CPU: 7 PID: 1662 Comm: oom-torture Kdump: loaded Not tainted 5.3.0-rc2+ #925
> >   Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
> >   Call Trace:
> >    dump_stack+0x67/0x95
> >    warn_alloc+0xa9/0x140
> >    __alloc_pages_slowpath+0x9a8/0xbce
> >    __alloc_pages_nodemask+0x372/0x3b0
> >    alloc_slab_page+0x3a/0x8d0
> >    new_slab+0x330/0x420
> >    ___slab_alloc.constprop.94+0x879/0xb00
> >    __slab_alloc.isra.89.constprop.93+0x43/0x6f
> >    kmem_cache_alloc+0x331/0x390
> >    kmem_zone_alloc+0x9f/0x110 [xfs]
> >    kmem_zone_alloc+0x9f/0x110 [xfs]
> >    xlog_ticket_alloc+0x33/0xd0 [xfs]
> >    xfs_log_reserve+0xb4/0x410 [xfs]
> >    xfs_trans_reserve+0x1d1/0x2b0 [xfs]
> >    xfs_trans_alloc+0xc9/0x250 [xfs]
> >    xfs_setfilesize_trans_alloc.isra.27+0x44/0xc0 [xfs]
> >    xfs_submit_ioend.isra.28+0xa5/0x180 [xfs]
> >    xfs_vm_writepages+0x76/0xa0 [xfs]
> >    do_writepages+0x17/0x80
> >    __filemap_fdatawrite_range+0xc1/0xf0
> >    file_write_and_wait_range+0x53/0xa0
> >    xfs_file_fsync+0x87/0x290 [xfs]
> >    vfs_fsync_range+0x37/0x80
> >    do_fsync+0x38/0x60
> >    __x64_sys_fsync+0xf/0x20
> >    do_syscall_64+0x4a/0x1c0
> >    entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > 
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> That's quite an opaque commit log for what started off as a severe email
> thread of potential leak of information. As such, can you expand on this
> commit log considerably to explain the situation a bit better?

I'm pretty sure this didn't solve the underlying stale data exposure
problem, which might be why you think this is "opaque".  It fixes a bug
that causes data writeback failure (which was the exposure vector this
time) but I think the ultimate fix for the exposure problem are the two
patches I linked to quite a ways back in this discussion....

--D

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?id=bd012b434a56d9fac3cbc33062b8e2cd6e1ad0a0
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?id=adcf7c0c87191fd3616813c8ce9790f89a9a8eba


> Your
> initial thread here provided much clearer evidence of the issue. As-is
> this commit log tells the reader *nothing* about the potential harm in
> not applying this patch.
> 
> You had mentioned you identified this issue present on at least
> 4.18 till 5.3-rc1. So, I'm at least inclined to consider this for
> stable for at least v4.19.
> 
> However, what about older kernels? Now that you have identified
> a fix, were the flag changed in prior commits, is it a regression
> that perhaps added KM_MAYFAIL at some point?
> 
>   Luis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
  2019-08-01 10:06             ` [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve() Tetsuo Handa
  2019-08-01 10:56               ` Brian Foster
  2019-08-01 18:50               ` Luis Chamberlain
@ 2019-08-01 20:46               ` Darrick J. Wong
  2 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2019-08-01 20:46 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-xfs, Dave Chinner, Brian Foster

On Thu, Aug 01, 2019 at 07:06:35PM +0900, Tetsuo Handa wrote:
> When the system is close-to-OOM, fsync() may fail due to -ENOMEM because
> xfs_log_reserve() is using KM_MAYFAIL. It is a bad thing to fail writeback
> operation due to user-triggerable OOM condition. Since we are not using
> KM_MAYFAIL at xfs_trans_alloc() before calling xfs_log_reserve(), let's
> use the same flags at xfs_log_reserve().
> 
>   oom-torture: page allocation failure: order:0, mode:0x46c40(GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP), nodemask=(null)
>   CPU: 7 PID: 1662 Comm: oom-torture Kdump: loaded Not tainted 5.3.0-rc2+ #925
>   Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
>   Call Trace:
>    dump_stack+0x67/0x95
>    warn_alloc+0xa9/0x140
>    __alloc_pages_slowpath+0x9a8/0xbce
>    __alloc_pages_nodemask+0x372/0x3b0
>    alloc_slab_page+0x3a/0x8d0
>    new_slab+0x330/0x420
>    ___slab_alloc.constprop.94+0x879/0xb00
>    __slab_alloc.isra.89.constprop.93+0x43/0x6f
>    kmem_cache_alloc+0x331/0x390
>    kmem_zone_alloc+0x9f/0x110 [xfs]
>    kmem_zone_alloc+0x9f/0x110 [xfs]
>    xlog_ticket_alloc+0x33/0xd0 [xfs]
>    xfs_log_reserve+0xb4/0x410 [xfs]
>    xfs_trans_reserve+0x1d1/0x2b0 [xfs]
>    xfs_trans_alloc+0xc9/0x250 [xfs]
>    xfs_setfilesize_trans_alloc.isra.27+0x44/0xc0 [xfs]
>    xfs_submit_ioend.isra.28+0xa5/0x180 [xfs]
>    xfs_vm_writepages+0x76/0xa0 [xfs]
>    do_writepages+0x17/0x80
>    __filemap_fdatawrite_range+0xc1/0xf0
>    file_write_and_wait_range+0x53/0xa0
>    xfs_file_fsync+0x87/0x290 [xfs]
>    vfs_fsync_range+0x37/0x80
>    do_fsync+0x38/0x60
>    __x64_sys_fsync+0xf/0x20
>    do_syscall_64+0x4a/0x1c0
>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Looks ok...

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_log.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 00e9f5c388d3..7fc3c1ad36bc 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -429,10 +429,7 @@ xfs_log_reserve(
>  
>  	ASSERT(*ticp == NULL);
>  	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent,
> -				KM_SLEEP | KM_MAYFAIL);
> -	if (!tic)
> -		return -ENOMEM;
> -
> +				KM_SLEEP);
>  	*ticp = tic;
>  
>  	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
> -- 
> 2.16.5
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
  2019-08-01 18:50               ` Luis Chamberlain
  2019-08-01 20:46                 ` Darrick J. Wong
@ 2019-08-01 21:13                 ` Tetsuo Handa
  2019-08-01 21:55                   ` Dave Chinner
  1 sibling, 1 reply; 23+ messages in thread
From: Tetsuo Handa @ 2019-08-01 21:13 UTC (permalink / raw)
  To: Luis Chamberlain; +Cc: linux-xfs, Dave Chinner, Brian Foster

On 2019/08/02 3:50, Luis Chamberlain wrote:
> That's quite an opaque commit log for what started off as a severe email
> thread of potential leak of information. As such, can you expand on this
> commit log considerably to explain the situation a bit better? Your
> initial thread here provided much clearer evidence of the issue. As-is
> this commit log tells the reader *nothing* about the potential harm in
> not applying this patch.
> 
> You had mentioned you identified this issue present on at least
> 4.18 till 5.3-rc1. So, I'm at least inclined to consider this for
> stable for at least v4.19.
> 
> However, what about older kernels? Now that you have identified
> a fix, were the flag changed in prior commits, is it a regression
> that perhaps added KM_MAYFAIL at some point?

I only checked 4.18+ so that RHEL8 will backport this patch. According to
Brian Foster, commit eb01c9cd87 ("[XFS] Remove the xlog_ticket allocator")
( https://git.kernel.org/linus/eb01c9cd87 ) which dates back to April 2008
added KM_MAYFAIL flag for this allocation

-	buf = (xfs_caddr_t) kmem_zalloc(PAGE_SIZE, KM_SLEEP);
+	tic = kmem_zone_zalloc(xfs_log_ticket_zone, KM_SLEEP|KM_MAYFAIL);

though Dave Chinner thinks that the log ticket rework is irrelevant.
Do we need to find which commit made this problem visible?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
  2019-08-01 21:13                 ` Tetsuo Handa
@ 2019-08-01 21:55                   ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2019-08-01 21:55 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Luis Chamberlain, linux-xfs, Brian Foster

On Fri, Aug 02, 2019 at 06:13:12AM +0900, Tetsuo Handa wrote:
> On 2019/08/02 3:50, Luis Chamberlain wrote:
> > That's quite an opaque commit log for what started off as a severe email
> > thread of potential leak of information. As such, can you expand on this
> > commit log considerably to explain the situation a bit better? Your
> > initial thread here provided much clearer evidence of the issue. As-is
> > this commit log tells the reader *nothing* about the potential harm in
> > not applying this patch.
> > 
> > You had mentioned you identified this issue present on at least
> > 4.18 till 5.3-rc1. So, I'm at least inclined to consider this for
> > stable for at least v4.19.
> > 
> > However, what about older kernels? Now that you have identified
> > a fix, were the flag changed in prior commits, is it a regression
> > that perhaps added KM_MAYFAIL at some point?
> 
> I only checked 4.18+ so that RHEL8 will backport this patch. According to
> Brian Foster, commit eb01c9cd87 ("[XFS] Remove the xlog_ticket allocator")
> ( https://git.kernel.org/linus/eb01c9cd87 ) which dates back to April 2008
> added KM_MAYFAIL flag for this allocation
> 
> -	buf = (xfs_caddr_t) kmem_zalloc(PAGE_SIZE, KM_SLEEP);
> +	tic = kmem_zone_zalloc(xfs_log_ticket_zone, KM_SLEEP|KM_MAYFAIL);
> 
> though Dave Chinner thinks that the log ticket rework is irrelevant.
> Do we need to find which commit made this problem visible?

No.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
  2019-08-01 20:46                 ` Darrick J. Wong
@ 2019-08-02 22:21                   ` Luis Chamberlain
  2019-08-12 10:57                     ` Tetsuo Handa
  0 siblings, 1 reply; 23+ messages in thread
From: Luis Chamberlain @ 2019-08-02 22:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Tetsuo Handa, linux-xfs, Dave Chinner, Brian Foster

On Thu, Aug 01, 2019 at 01:46:14PM -0700, Darrick J. Wong wrote:
> On Thu, Aug 01, 2019 at 06:50:57PM +0000, Luis Chamberlain wrote:
> > On Thu, Aug 01, 2019 at 07:06:35PM +0900, Tetsuo Handa wrote:
> > > When the system is close-to-OOM, fsync() may fail due to -ENOMEM because
> > > xfs_log_reserve() is using KM_MAYFAIL. It is a bad thing to fail writeback
> > > operation due to user-triggerable OOM condition. Since we are not using
> > > KM_MAYFAIL at xfs_trans_alloc() before calling xfs_log_reserve(), let's
> > > use the same flags at xfs_log_reserve().
> > > 
> > >   oom-torture: page allocation failure: order:0, mode:0x46c40(GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP), nodemask=(null)
> > >   CPU: 7 PID: 1662 Comm: oom-torture Kdump: loaded Not tainted 5.3.0-rc2+ #925
> > >   Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
> > >   Call Trace:
> > >    dump_stack+0x67/0x95
> > >    warn_alloc+0xa9/0x140
> > >    __alloc_pages_slowpath+0x9a8/0xbce
> > >    __alloc_pages_nodemask+0x372/0x3b0
> > >    alloc_slab_page+0x3a/0x8d0
> > >    new_slab+0x330/0x420
> > >    ___slab_alloc.constprop.94+0x879/0xb00
> > >    __slab_alloc.isra.89.constprop.93+0x43/0x6f
> > >    kmem_cache_alloc+0x331/0x390
> > >    kmem_zone_alloc+0x9f/0x110 [xfs]
> > >    kmem_zone_alloc+0x9f/0x110 [xfs]
> > >    xlog_ticket_alloc+0x33/0xd0 [xfs]
> > >    xfs_log_reserve+0xb4/0x410 [xfs]
> > >    xfs_trans_reserve+0x1d1/0x2b0 [xfs]
> > >    xfs_trans_alloc+0xc9/0x250 [xfs]
> > >    xfs_setfilesize_trans_alloc.isra.27+0x44/0xc0 [xfs]
> > >    xfs_submit_ioend.isra.28+0xa5/0x180 [xfs]
> > >    xfs_vm_writepages+0x76/0xa0 [xfs]
> > >    do_writepages+0x17/0x80
> > >    __filemap_fdatawrite_range+0xc1/0xf0
> > >    file_write_and_wait_range+0x53/0xa0
> > >    xfs_file_fsync+0x87/0x290 [xfs]
> > >    vfs_fsync_range+0x37/0x80
> > >    do_fsync+0x38/0x60
> > >    __x64_sys_fsync+0xf/0x20
> > >    do_syscall_64+0x4a/0x1c0
> > >    entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > > 
> > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > 
> > That's quite an opaque commit log for what started off as a severe email
> > thread of potential leak of information. As such, can you expand on this
> > commit log considerably to explain the situation a bit better?
> 
> I'm pretty sure this didn't solve the underlying stale data exposure
> problem, which might be why you think this is "opaque".  It fixes a bug
> that causes data writeback failure (which was the exposure vector this
> time) but I think the ultimate fix for the exposure problem are the two
> patches I linked to quite a ways back in this discussion....
> 
> --D
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?id=bd012b434a56d9fac3cbc33062b8e2cd6e1ad0a0
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?id=adcf7c0c87191fd3616813c8ce9790f89a9a8eba

Got it, thanks! Even with this, I still think the current commit could
say a bit a more about the effects of not having this patch applied.
What are the effects of say having the above two patches applied but not
the one being submitted now?

  Luis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
  2019-08-02 22:21                   ` Luis Chamberlain
@ 2019-08-12 10:57                     ` Tetsuo Handa
  2019-08-12 19:55                       ` Darrick J. Wong
  0 siblings, 1 reply; 23+ messages in thread
From: Tetsuo Handa @ 2019-08-12 10:57 UTC (permalink / raw)
  To: Luis Chamberlain, Darrick J. Wong; +Cc: linux-xfs, Dave Chinner, Brian Foster

On 2019/08/03 7:21, Luis Chamberlain wrote:
>> I'm pretty sure this didn't solve the underlying stale data exposure
>> problem, which might be why you think this is "opaque".  It fixes a bug
>> that causes data writeback failure (which was the exposure vector this
>> time) but I think the ultimate fix for the exposure problem are the two
>> patches I linked to quite a ways back in this discussion....
>>
>> --D
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?id=bd012b434a56d9fac3cbc33062b8e2cd6e1ad0a0
>> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?id=adcf7c0c87191fd3616813c8ce9790f89a9a8eba
> 
> Got it, thanks! Even with this, I still think the current commit could
> say a bit a more about the effects of not having this patch applied.
> What are the effects of say having the above two patches applied but not
> the one being submitted now?

Is this patch going to be applied as-is? Or, someone have a plan to rewrite the changelog?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
  2019-08-12 10:57                     ` Tetsuo Handa
@ 2019-08-12 19:55                       ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2019-08-12 19:55 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Luis Chamberlain, linux-xfs, Dave Chinner, Brian Foster

On Mon, Aug 12, 2019 at 07:57:27PM +0900, Tetsuo Handa wrote:
> On 2019/08/03 7:21, Luis Chamberlain wrote:
> >> I'm pretty sure this didn't solve the underlying stale data exposure
> >> problem, which might be why you think this is "opaque".  It fixes a bug
> >> that causes data writeback failure (which was the exposure vector this
> >> time) but I think the ultimate fix for the exposure problem are the two
> >> patches I linked to quite a ways back in this discussion....
> >>
> >> --D
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?id=bd012b434a56d9fac3cbc33062b8e2cd6e1ad0a0
> >> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?id=adcf7c0c87191fd3616813c8ce9790f89a9a8eba
> > 
> > Got it, thanks! Even with this, I still think the current commit could
> > say a bit a more about the effects of not having this patch applied.
> > What are the effects of say having the above two patches applied but not
> > the one being submitted now?
> 
> Is this patch going to be applied as-is? Or, someone have a plan to
> rewrite the changelog?

The first one, since the patch eliminates a vector to the writeback race
problem but does not iself solve the race.

--D

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-08-12 19:57 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-25 10:06 xfs: garbage file data inclusion bug under memory pressure Tetsuo Handa
2019-07-25 10:53 ` Brian Foster
2019-07-25 12:30   ` Tetsuo Handa
2019-07-25 16:00     ` Brian Foster
2019-07-25 11:32 ` Dave Chinner
2019-07-25 12:44   ` Tetsuo Handa
2019-07-25 17:28     ` Darrick J. Wong
2019-07-25 22:07     ` Dave Chinner
2019-07-29  3:50       ` Tetsuo Handa
2019-07-29 11:23         ` Brian Foster
2019-07-29 21:56           ` Dave Chinner
2019-07-30 11:30             ` Brian Foster
2019-08-01 10:06             ` [PATCH] fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve() Tetsuo Handa
2019-08-01 10:56               ` Brian Foster
2019-08-01 11:00                 ` Tetsuo Handa
2019-08-01 18:50               ` Luis Chamberlain
2019-08-01 20:46                 ` Darrick J. Wong
2019-08-02 22:21                   ` Luis Chamberlain
2019-08-12 10:57                     ` Tetsuo Handa
2019-08-12 19:55                       ` Darrick J. Wong
2019-08-01 21:13                 ` Tetsuo Handa
2019-08-01 21:55                   ` Dave Chinner
2019-08-01 20:46               ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.