All of lore.kernel.org
 help / color / mirror / Atom feed
* ext4 out of order when use cfq scheduler
@ 2015-12-22  6:24 HUANG Weller (CM/EPF1-CN)
  2015-12-22 15:00 ` Jan Kara
  2016-01-28  8:02 ` Xiong Zhou
  0 siblings, 2 replies; 30+ messages in thread
From: HUANG Weller (CM/EPF1-CN) @ 2015-12-22  6:24 UTC (permalink / raw)
  To: linux-ext4; +Cc: Li, Michael

Hi EXT4 developers,

I am doing the power failure on a storage device. We are arm embedded system.


Test environment:
Kernel: 3.10.63
Test type: power failure
File create type: O_SYNC flag enabled
Mount option: data=ordered, journal_checksum,barrier=1
IO scheduler: CFQ

Debug method: kernel print every block number of write request. print whether it is journal block.  print whether it is (REQ_FUA | REQ_META | write)
After system startup, backup journal blocks before mount the test partition.

Below is analysis which I think there is a out of order:
hp0000017aMhWY3i0vMv is the file which checksum is error. Only half of the file contents is correct.


[root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3
debugfs 1.42.9 (28-Dec-2013)
debugfs:  imap test/hp0000017aMhWY3i0vMv
Inode 390 is part of block group 0
        located at block 141, offset 0x0280

00000280  80 81 00 00 10 00 04 00  c8 09 00 00 66 0a 00 00  |............f...|
00000290  66 0a 00 00 00 00 00 00  00 00 01 00 04 02 00 00  |f...............|
000002a0  00 00 08 00 01 00 00 00  0a f3 02 00 04 00 00 00  |................|
000002b0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00  |.............,..|  ==> the file contents is at 0x00012c00
000002c0  80 00 00 00 01 00 00 00  8e 26 01 00 00 00 00 00  |.........&......|


Search the block number from the journal blocks:

[root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "00 2c 01 00"
00039ab0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00  |.............,..|

Search file name which the file checksum is error in journal blocks:

[root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "3i0vMv" -B1
00030c60  86 01 00 00 1c 00 14 01  68 70 30 30 30 30 30 31  |........hp000001|
00030c70  37 61 4d 68 57 59 33 69  30 76 4d 76 88 01 00 00  |7aMhWY3i0vMv....|


List all journal block record to check which journal block records it:

[root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
00000000  c0 3b 39 98 00 00 00 04  00 00 00 00 00 00 08 00  |.;9.............|
00000800  c0 3b 39 98 00 00 00 05  00 00 00 6f 00 00 00 24  |.;9........o...$|
00001000  c0 3b 39 98 00 00 00 01  00 00 00 6f 00 00 00 75  |.;9........o...u|
0000c800  c0 3b 39 98 00 00 00 02  00 00 00 6f 00 00 00 00  |.;9........o....|
0000d000  c0 3b 39 98 00 00 00 01  00 00 00 70 00 00 00 65  |.;9........p...e|
00016000  c0 3b 39 98 00 00 00 02  00 00 00 70 00 00 00 00  |.;9........p....|
00016800  c0 3b 39 98 00 00 00 01  00 00 00 71 00 00 00 7c  |.;9........q...||
00021000  c0 3b 39 98 00 00 00 02  00 00 00 71 00 00 00 00  |.;9........q....|
00021800  c0 3b 39 98 00 00 00 01  00 00 00 72 00 00 00 82  |.;9........r....|
0002d000  c0 3b 39 98 00 00 00 02  00 00 00 72 00 00 00 00  |.;9........r....|
0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88  |.;9........s....|  ==>00039ab0 is in last block, the file name and the start block number are all recorded in the journals.
0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00  |.;9........s....|


Back to see the kernel log which it print all the block numbers:

...
...
[   46.222671] 244109  75277 
[   46.222693] 
[   46.272438] 244352  75520 
[   46.272460] 
[   46.348417] 238443  69611 
[   46.348438] 
[   46.349811] 244480  75648 
[   46.352287] 
[   46.404904] 244609  75777 
[   46.404926] 
[   46.454698] 244738  75906 
[   46.454719] 
[   46.505439] 244992  76160 
[   46.505459] 
[   46.557783] 245120  76288 
[   46.557804] 
[   46.610075] 245249  76417 
[   46.610096] 
[   46.660196] 245378  76546 
[   46.660219] 
[   46.709906] 201691  32859   ==> journal start is 32768, so the offset is 91, block size=2048,  so, the offset address in the j.bin is 0x2d800
[   46.709928] J [   46.711233] 
[   46.740635] drop to 9v
[   46.749540] 201716  32884 
[   46.749560] J S 
[   46.751039] 
[   46.753151] 245632  76800  ==> 76800 in hex is 0x012c00, it is the same start block of the file which checksum is error.
[   46.755284] 
nanddisk idle -> 1.
[   46.800227] 6v irq-2


The j.bin offset 0x2d800

0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88  |.;9........s....|  ==>00039ab0 is in last block
0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00  |.;9........s....|


Normally, ext4 will first write the file contents, and then write the journal and journal commit. Then after some delay, it will write the meta data. 
So The journal blocks contains the meta data of the file which the file contents already been written before.
But from above analysis,  the journal sequence 0x73 already contain the file name(hp0000017aMhWY3i0vMv) and the start block number(76800). So from the kernel log,  the block number(76800) should be available before the journal blocks but NOT after it.
It seems that there is out of order happen.

Could you please help to check this issue ? or give a explanation about it ?
Many thanks.

Best regards

Weller HUANG



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2015-12-22  6:24 ext4 out of order when use cfq scheduler HUANG Weller (CM/EPF1-CN)
@ 2015-12-22 15:00 ` Jan Kara
       [not found]   ` <c67f356b63d94d35ad010a6e987b68f0@SGPMBX1004.APAC.bosch.com>
  2016-01-28  8:02 ` Xiong Zhou
  1 sibling, 1 reply; 30+ messages in thread
From: Jan Kara @ 2015-12-22 15:00 UTC (permalink / raw)
  To: HUANG Weller (CM/EPF1-CN); +Cc: linux-ext4, Li, Michael

  Hello,

On Tue 22-12-15 06:24:29, HUANG Weller (CM/EPF1-CN) wrote:
> I am doing the power failure on a storage device. We are arm embedded
> system.
> 
> Test environment:
> Kernel: 3.10.63
> Test type: power failure
> File create type: O_SYNC flag enabled
> Mount option: data=ordered, journal_checksum,barrier=1
> IO scheduler: CFQ
> 
> Debug method: kernel print every block number of write request. print
> whether it is journal block.  print whether it is (REQ_FUA | REQ_META |
> write) After system startup, backup journal blocks before mount the test
> partition.

Do you know blktrace / blkparse tools? They would give you this information
without the need of patching the kernel...

> Below is analysis which I think there is a out of order:
> hp0000017aMhWY3i0vMv is the file which checksum is error. Only half of
> the file contents is correct.

Can you post the test program you are using? What is in the incorrect
block? Previous content of the file or some random garbage? If you are just
overwriting the file and see old contents, that could happen and is legal (we
just guarantee newly allocated blocks are first written out).

Is the problem reproducible?

								Honza

> [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3
> debugfs 1.42.9 (28-Dec-2013)
> debugfs:  imap test/hp0000017aMhWY3i0vMv
> Inode 390 is part of block group 0
>         located at block 141, offset 0x0280
> 
> 00000280  80 81 00 00 10 00 04 00  c8 09 00 00 66 0a 00 00  |............f...|
> 00000290  66 0a 00 00 00 00 00 00  00 00 01 00 04 02 00 00  |f...............|
> 000002a0  00 00 08 00 01 00 00 00  0a f3 02 00 04 00 00 00  |................|
> 000002b0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00  |.............,..|  ==> the file contents is at 0x00012c00
> 000002c0  80 00 00 00 01 00 00 00  8e 26 01 00 00 00 00 00  |.........&......|
> 
> 
> Search the block number from the journal blocks:
> 
> [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "00 2c 01 00"
> 00039ab0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00  |.............,..|
> 
> Search file name which the file checksum is error in journal blocks:
> 
> [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "3i0vMv" -B1
> 00030c60  86 01 00 00 1c 00 14 01  68 70 30 30 30 30 30 31  |........hp000001|
> 00030c70  37 61 4d 68 57 59 33 69  30 76 4d 76 88 01 00 00  |7aMhWY3i0vMv....|
> 
> 
> List all journal block record to check which journal block records it:
> 
> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> 00000000  c0 3b 39 98 00 00 00 04  00 00 00 00 00 00 08 00  |.;9.............|
> 00000800  c0 3b 39 98 00 00 00 05  00 00 00 6f 00 00 00 24  |.;9........o...$|
> 00001000  c0 3b 39 98 00 00 00 01  00 00 00 6f 00 00 00 75  |.;9........o...u|
> 0000c800  c0 3b 39 98 00 00 00 02  00 00 00 6f 00 00 00 00  |.;9........o....|
> 0000d000  c0 3b 39 98 00 00 00 01  00 00 00 70 00 00 00 65  |.;9........p...e|
> 00016000  c0 3b 39 98 00 00 00 02  00 00 00 70 00 00 00 00  |.;9........p....|
> 00016800  c0 3b 39 98 00 00 00 01  00 00 00 71 00 00 00 7c  |.;9........q...||
> 00021000  c0 3b 39 98 00 00 00 02  00 00 00 71 00 00 00 00  |.;9........q....|
> 00021800  c0 3b 39 98 00 00 00 01  00 00 00 72 00 00 00 82  |.;9........r....|
> 0002d000  c0 3b 39 98 00 00 00 02  00 00 00 72 00 00 00 00  |.;9........r....|
> 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88  |.;9........s....|  ==>00039ab0 is in last block, the file name and the start block number are all recorded in the journals.
> 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00  |.;9........s....|
> 
> 
> Back to see the kernel log which it print all the block numbers:
> 
> ...
> ...
> [   46.222671] 244109  75277 
> [   46.222693] 
> [   46.272438] 244352  75520 
> [   46.272460] 
> [   46.348417] 238443  69611 
> [   46.348438] 
> [   46.349811] 244480  75648 
> [   46.352287] 
> [   46.404904] 244609  75777 
> [   46.404926] 
> [   46.454698] 244738  75906 
> [   46.454719] 
> [   46.505439] 244992  76160 
> [   46.505459] 
> [   46.557783] 245120  76288 
> [   46.557804] 
> [   46.610075] 245249  76417 
> [   46.610096] 
> [   46.660196] 245378  76546 
> [   46.660219] 
> [   46.709906] 201691  32859   ==> journal start is 32768, so the offset is 91, block size=2048,  so, the offset address in the j.bin is 0x2d800
> [   46.709928] J [   46.711233] 
> [   46.740635] drop to 9v
> [   46.749540] 201716  32884 
> [   46.749560] J S 
> [   46.751039] 
> [   46.753151] 245632  76800  ==> 76800 in hex is 0x012c00, it is the same start block of the file which checksum is error.
> [   46.755284] 
> nanddisk idle -> 1.
> [   46.800227] 6v irq-2
> 
> 
> The j.bin offset 0x2d800
> 
> 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88  |.;9........s....|  ==>00039ab0 is in last block
> 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00  |.;9........s....|
> 
> 
> Normally, ext4 will first write the file contents, and then write the
> journal and journal commit. Then after some delay, it will write the meta
> data.  So The journal blocks contains the meta data of the file which the
> file contents already been written before.  But from above analysis,  the
> journal sequence 0x73 already contain the file name(hp0000017aMhWY3i0vMv)
> and the start block number(76800). So from the kernel log,  the block
> number(76800) should be available before the journal blocks but NOT after
> it.  It seems that there is out of order happen.
> 
> Could you please help to check this issue ? or give a explanation about
> it ?  Many thanks.
> 
> Best regards
> 
> Weller HUANG
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
       [not found]   ` <c67f356b63d94d35ad010a6e987b68f0@SGPMBX1004.APAC.bosch.com>
@ 2016-01-05 15:30     ` Jan Kara
  2016-01-06  2:39       ` HUANG Weller (CM/ESW12-CN)
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2016-01-05 15:30 UTC (permalink / raw)
  To: HUANG Weller (CM/ESW12-CN); +Cc: Jan Kara, linux-ext4

On Mon 04-01-16 05:25:15, HUANG Weller (CM/ESW12-CN) wrote:
> > -----Original Message-----
> > From: Jan Kara [mailto:jack@suse.cz]
> > Sent: Tuesday, December 22, 2015 11:01 PM
> > To: HUANG Weller (CM/EPF1-CN) <Weller.Huang@cn.bosch.com>
> > Cc: linux-ext4@vger.kernel.org; Li, Michael <huayil@qti.qualcomm.com>
> > Subject: Re: ext4 out of order when use cfq scheduler
> > 
> >   Hello,
> > 
> > On Tue 22-12-15 06:24:29, HUANG Weller (CM/EPF1-CN) wrote:
> > > I am doing the power failure on a storage device. We are arm 
> > > embedded system.
> > >
> > > Test environment:
> > > Kernel: 3.10.63
> > > Test type: power failure
> > > File create type: O_SYNC flag enabled Mount option: data=ordered, 
> > > journal_checksum,barrier=1 IO scheduler:
> > > CFQ
> > >
> > > Debug method: kernel print every block number of write request. 
> > > print whether it is journal block.  print whether it is (REQ_FUA | 
> > > REQ_META
> > > |
> > > write) After system startup, backup journal blocks before mount the 
> > > test partition.
> > 
> > Do you know blktrace / blkparse tools? They would give you this 
> > information without the need of patching the kernel...
> 
> Yes, I know that tool. Normally, I use that tool, but in this case, I patch kernel because of following reasons:
> 1. this is power failure test, blktrace might not catch some log near to the power failure. 
> 2. I add some calculation when print these block numbers in the kernel patch. So I can directly use the value print from kernel.
> blktrace give me lots of logs and I have to do block number conversion one by one.

OK, fair enough.

> > > Below is analysis which I think there is a out of order:
> > > hp0000017aMhWY3i0vMv is the file which checksum is error. Only half 
> > > of the file contents is correct.
> > 
> > Can you post the test program you are using? What is in the incorrect block?
> > Previous content of the file or some random garbage? If you are just 
> > overwriting the file and see old contents, that could happen and is 
> > legal (we just guarantee newly allocated blocks are first written out).
> > 
> > Is the problem reproducible?
> 
> The test tool(attached in this mail)  not rewrite a file but create many
> files after delete the old test files in every  new test round.  The test
> tool create file in this way: filename format: hp[HEX NUMBER][RANDOM
> STRING].  For example: hp0000017aMhWY3i0vMv, hex number: 0000017a  random
> string: MhWY3i0vMv
> 
> The file contents has some relationship to the file name.  every 512
> bytes has its own checksum and sequence number.  Also contain the file
> name and the random string. So we can easily trace the file contents(see
> the attached picture).

I see.

> >What is in the incorrect block?
> 
> Since I clean the test partition with /dev/zero, so what I saw is that,
> half of the file is ok. But the rest are 0. !!!! But another information
> is that:!!!!  hp0000017aMhWY3i0vMv is not the last file.  from the number
> in the file name , we can know there still 3 files created after it and
> these files are all checksum OK.  Except the last file is zero length.
> 
> This issue is very easy to reproducible. Even we test with emmc, we also
> see this issue.  If you need I to do anymore test, just let me know.

So you are running in 'ws' mode of your tool, am I right? Just looking into
the sources you've sent me I've noticed that although you set O_SYNC in
openflg when mode == MODE_WS, you do not use openflg at all. So file won't
be synced at all. That would well explain why you see that not all file
contents is written. So did you just send me a different version of the
source or is your test program really buggy?

								Honza

> > > [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3 debugfs 1.42.9
> > > (28-Dec-2013)
> > > debugfs:  imap test/hp0000017aMhWY3i0vMv Inode 390 is part of block 
> > > group 0
> > >         located at block 141, offset 0x0280
> > >
> > > 00000280  80 81 00 00 10 00 04 00  c8 09 00 00 66 0a 00 00
> > > |............f...|
> > > 00000290  66 0a 00 00 00 00 00 00  00 00 01 00 04 02 00 00
> > > |f...............|
> > > 000002a0  00 00 08 00 01 00 00 00  0a f3 02 00 04 00 00 00
> > > |................|
> > > 000002b0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00
> > > |.............,..|  ==> the file contents is at 0x00012c00
> > > 000002c0  80 00 00 00 01 00 00 00  8e 26 01 00 00 00 00 00
> > > |.........&......|
> > >
> > >
> > > Search the block number from the journal blocks:
> > >
> > > [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "00 2c 01 00"
> > > 00039ab0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00
> > > |.............,..|
> > >
> > > Search file name which the file checksum is error in journal blocks:
> > >
> > > [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "3i0vMv" -B1
> > > 00030c60  86 01 00 00 1c 00 14 01  68 70 30 30 30 30 30 31
> > > |........hp000001|
> > > 00030c70  37 61 4d 68 57 59 33 69  30 76 4d 76 88 01 00 00
> > > |7aMhWY3i0vMv....|
> > >
> > >
> > > List all journal block record to check which journal block records it:
> > >
> > > [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> > > 00000000  c0 3b 39 98 00 00 00 04  00 00 00 00 00 00 08 00
> > > |.;9.............|
> > > 00000800  c0 3b 39 98 00 00 00 05  00 00 00 6f 00 00 00 24
> > > |.;9........o...$|
> > > 00001000  c0 3b 39 98 00 00 00 01  00 00 00 6f 00 00 00 75
> > > |.;9........o...u|
> > > 0000c800  c0 3b 39 98 00 00 00 02  00 00 00 6f 00 00 00 00
> > > |.;9........o....|
> > > 0000d000  c0 3b 39 98 00 00 00 01  00 00 00 70 00 00 00 65
> > > |.;9........p...e|
> > > 00016000  c0 3b 39 98 00 00 00 02  00 00 00 70 00 00 00 00
> > > |.;9........p....|
> > > 00016800  c0 3b 39 98 00 00 00 01  00 00 00 71 00 00 00 7c
> > > |.;9........q...||
> > > 00021000  c0 3b 39 98 00 00 00 02  00 00 00 71 00 00 00 00
> > > |.;9........q....|
> > > 00021800  c0 3b 39 98 00 00 00 01  00 00 00 72 00 00 00 82
> > > |.;9........r....|
> > > 0002d000  c0 3b 39 98 00 00 00 02  00 00 00 72 00 00 00 00
> > > |.;9........r....|
> > > 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88  
> > > |.;9........s....|
> > ==>00039ab0 is in last block, the file name and the start block number 
> > are all recorded in the journals.
> > > 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00
> > > |.;9........s....|
> > >
> > >
> > > Back to see the kernel log which it print all the block numbers:
> > >
> > > ...
> > > ...
> > > [   46.222671] 244109  75277
> > > [   46.222693]
> > > [   46.272438] 244352  75520
> > > [   46.272460]
> > > [   46.348417] 238443  69611
> > > [   46.348438]
> > > [   46.349811] 244480  75648
> > > [   46.352287]
> > > [   46.404904] 244609  75777
> > > [   46.404926]
> > > [   46.454698] 244738  75906
> > > [   46.454719]
> > > [   46.505439] 244992  76160
> > > [   46.505459]
> > > [   46.557783] 245120  76288
> > > [   46.557804]
> > > [   46.610075] 245249  76417
> > > [   46.610096]
> > > [   46.660196] 245378  76546
> > > [   46.660219]
> > > [   46.709906] 201691  32859   ==> journal start is 32768, so the offset is 91,
> > block size=2048,  so, the offset address in the j.bin is 0x2d800
> > > [   46.709928] J [   46.711233]
> > > [   46.740635] drop to 9v
> > > [   46.749540] 201716  32884
> > > [   46.749560] J S
> > > [   46.751039]
> > > [   46.753151] 245632  76800  ==> 76800 in hex is 0x012c00, it is the same start
> > block of the file which checksum is error.
> > > [   46.755284]
> > > nanddisk idle -> 1.
> > > [   46.800227] 6v irq-2
> > >
> > >
> > > The j.bin offset 0x2d800
> > >
> > > 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88
> > > |.;9........s....|  ==>00039ab0 is in last block
> > > 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00
> > > |.;9........s....|
> > >
> > >
> > > Normally, ext4 will first write the file contents, and then write 
> > > the journal and journal commit. Then after some delay, it will write 
> > > the meta data.  So The journal blocks contains the meta data of the 
> > > file which the file contents already been written before.  But from 
> > > above analysis,  the journal sequence 0x73 already contain the file
> > > name(hp0000017aMhWY3i0vMv) and the start block number(76800). So 
> > > from the kernel log,  the block
> > > number(76800) should be available before the journal blocks but NOT 
> > > after it.  It seems that there is out of order happen.
> > >
> > > Could you please help to check this issue ? or give a explanation 
> > > about it ?  Many thanks.
> > >
> > > Best regards
> > >
> > > Weller HUANG
> > >
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4"
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR



-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: ext4 out of order when use cfq scheduler
  2016-01-05 15:30     ` Jan Kara
@ 2016-01-06  2:39       ` HUANG Weller (CM/ESW12-CN)
  2016-01-06 19:17         ` Andreas Dilger
       [not found]         ` <20160106100621.GA24046@quack.suse.cz>
  0 siblings, 2 replies; 30+ messages in thread
From: HUANG Weller (CM/ESW12-CN) @ 2016-01-06  2:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

> So you are running in 'ws' mode of your tool, am I right? Just looking into the
> sources you've sent me I've noticed that although you set O_SYNC in openflg
> when mode == MODE_WS, you do not use openflg at all. So file won't be synced
> at all. That would well explain why you see that not all file contents is written. So
> did you just send me a different version of the source or is your test program
> really buggy?
> 

Yes, it is a bug of the test code. So the test tool create files without O_SYNC flag actually.
But , even in this case, is the out of order acceptable ? or is it normal ?

								
> 
> > > > [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3 debugfs 1.42.9
> > > > (28-Dec-2013)
> > > > debugfs:  imap test/hp0000017aMhWY3i0vMv Inode 390 is part of
> > > > block group 0
> > > >         located at block 141, offset 0x0280
> > > >
> > > > 00000280  80 81 00 00 10 00 04 00  c8 09 00 00 66 0a 00 00
> > > > |............f...|
> > > > 00000290  66 0a 00 00 00 00 00 00  00 00 01 00 04 02 00 00
> > > > |f...............|
> > > > 000002a0  00 00 08 00 01 00 00 00  0a f3 02 00 04 00 00 00
> > > > |................|
> > > > 000002b0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00
> > > > |.............,..|  ==> the file contents is at 0x00012c00
> > > > 000002c0  80 00 00 00 01 00 00 00  8e 26 01 00 00 00 00 00
> > > > |.........&......|
> > > >
> > > >
> > > > Search the block number from the journal blocks:
> > > >
> > > > [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "00 2c 01 00"
> > > > 00039ab0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00
> > > > |.............,..|
> > > >
> > > > Search file name which the file checksum is error in journal blocks:
> > > >
> > > > [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "3i0vMv" -B1
> > > > 00030c60  86 01 00 00 1c 00 14 01  68 70 30 30 30 30 30 31
> > > > |........hp000001|
> > > > 00030c70  37 61 4d 68 57 59 33 69  30 76 4d 76 88 01 00 00
> > > > |7aMhWY3i0vMv....|
> > > >
> > > >
> > > > List all journal block record to check which journal block records it:
> > > >
> > > > [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> > > > 00000000  c0 3b 39 98 00 00 00 04  00 00 00 00 00 00 08 00
> > > > |.;9.............|
> > > > 00000800  c0 3b 39 98 00 00 00 05  00 00 00 6f 00 00 00 24
> > > > |.;9........o...$|
> > > > 00001000  c0 3b 39 98 00 00 00 01  00 00 00 6f 00 00 00 75
> > > > |.;9........o...u|
> > > > 0000c800  c0 3b 39 98 00 00 00 02  00 00 00 6f 00 00 00 00
> > > > |.;9........o....|
> > > > 0000d000  c0 3b 39 98 00 00 00 01  00 00 00 70 00 00 00 65
> > > > |.;9........p...e|
> > > > 00016000  c0 3b 39 98 00 00 00 02  00 00 00 70 00 00 00 00
> > > > |.;9........p....|
> > > > 00016800  c0 3b 39 98 00 00 00 01  00 00 00 71 00 00 00 7c
> > > > |.;9........q...||
> > > > 00021000  c0 3b 39 98 00 00 00 02  00 00 00 71 00 00 00 00
> > > > |.;9........q....|
> > > > 00021800  c0 3b 39 98 00 00 00 01  00 00 00 72 00 00 00 82
> > > > |.;9........r....|
> > > > 0002d000  c0 3b 39 98 00 00 00 02  00 00 00 72 00 00 00 00
> > > > |.;9........r....|
> > > > 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88
> > > > |.;9........s....|
> > > ==>00039ab0 is in last block, the file name and the start block
> > > number are all recorded in the journals.
> > > > 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00
> > > > |.;9........s....|
> > > >
> > > >
> > > > Back to see the kernel log which it print all the block numbers:
> > > >
> > > > ...
> > > > ...
> > > > [   46.222671] 244109  75277
> > > > [   46.222693]
> > > > [   46.272438] 244352  75520
> > > > [   46.272460]
> > > > [   46.348417] 238443  69611
> > > > [   46.348438]
> > > > [   46.349811] 244480  75648
> > > > [   46.352287]
> > > > [   46.404904] 244609  75777
> > > > [   46.404926]
> > > > [   46.454698] 244738  75906
> > > > [   46.454719]
> > > > [   46.505439] 244992  76160
> > > > [   46.505459]
> > > > [   46.557783] 245120  76288
> > > > [   46.557804]
> > > > [   46.610075] 245249  76417
> > > > [   46.610096]
> > > > [   46.660196] 245378  76546
> > > > [   46.660219]
> > > > [   46.709906] 201691  32859   ==> journal start is 32768, so the offset is 91,
> > > block size=2048,  so, the offset address in the j.bin is 0x2d800
> > > > [   46.709928] J [   46.711233]
> > > > [   46.740635] drop to 9v
> > > > [   46.749540] 201716  32884
> > > > [   46.749560] J S
> > > > [   46.751039]
> > > > [   46.753151] 245632  76800  ==> 76800 in hex is 0x012c00, it is the same
> start
> > > block of the file which checksum is error.
> > > > [   46.755284]
> > > > nanddisk idle -> 1.
> > > > [   46.800227] 6v irq-2
> > > >
> > > >
> > > > The j.bin offset 0x2d800
> > > >
> > > > 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88
> > > > |.;9........s....|  ==>00039ab0 is in last block
> > > > 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00
> > > > |.;9........s....|
> > > >
> > > >
> > > > Normally, ext4 will first write the file contents, and then write
> > > > the journal and journal commit. Then after some delay, it will
> > > > write the meta data.  So The journal blocks contains the meta data
> > > > of the file which the file contents already been written before.
> > > > But from above analysis,  the journal sequence 0x73 already
> > > > contain the file
> > > > name(hp0000017aMhWY3i0vMv) and the start block number(76800). So
> > > > from the kernel log,  the block
> > > > number(76800) should be available before the journal blocks but
> > > > NOT after it.  It seems that there is out of order happen.
> > > >
> > > > Could you please help to check this issue ? or give a explanation
> > > > about it ?  Many thanks.
> > > >
> > > > Best regards
> > > >
> > > > Weller HUANG
> > > >
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR
> 
> 
> 
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2016-01-06  2:39       ` HUANG Weller (CM/ESW12-CN)
@ 2016-01-06 19:17         ` Andreas Dilger
  2016-01-07  6:51           ` HUANG Weller (CM/ESW12-CN)
       [not found]         ` <20160106100621.GA24046@quack.suse.cz>
  1 sibling, 1 reply; 30+ messages in thread
From: Andreas Dilger @ 2016-01-06 19:17 UTC (permalink / raw)
  To: HUANG Weller (CM/ESW12-CN); +Cc: Jan Kara, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 6540 bytes --]


> On Jan 5, 2016, at 7:39 PM, HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com> wrote:
> 
>> So you are running in 'ws' mode of your tool, am I right? Just looking into the
>> sources you've sent me I've noticed that although you set O_SYNC in openflg
>> when mode == MODE_WS, you do not use openflg at all. So file won't be synced
>> at all. That would well explain why you see that not all file contents is written. So
>> did you just send me a different version of the source or is your test program
>> really buggy?
>> 
> 
> Yes, it is a bug of the test code. So the test tool create files without O_SYNC flag actually.
> But , even in this case, is the out of order acceptable ? or is it normal ?

Without O_SYNC there is no ordering guarantee between non-overlapping writes.  They may be reordered by the filesystem or the elevator or the storage device.

Cheers, Andreas

> 
>> 
>>>>> [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3 debugfs 1.42.9
>>>>> (28-Dec-2013)
>>>>> debugfs:  imap test/hp0000017aMhWY3i0vMv Inode 390 is part of
>>>>> block group 0
>>>>>        located at block 141, offset 0x0280
>>>>> 
>>>>> 00000280  80 81 00 00 10 00 04 00  c8 09 00 00 66 0a 00 00
>>>>> |............f...|
>>>>> 00000290  66 0a 00 00 00 00 00 00  00 00 01 00 04 02 00 00
>>>>> |f...............|
>>>>> 000002a0  00 00 08 00 01 00 00 00  0a f3 02 00 04 00 00 00
>>>>> |................|
>>>>> 000002b0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00
>>>>> |.............,..|  ==> the file contents is at 0x00012c00
>>>>> 000002c0  80 00 00 00 01 00 00 00  8e 26 01 00 00 00 00 00
>>>>> |.........&......|
>>>>> 
>>>>> 
>>>>> Search the block number from the journal blocks:
>>>>> 
>>>>> [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "00 2c 01 00"
>>>>> 00039ab0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00
>>>>> |.............,..|
>>>>> 
>>>>> Search file name which the file checksum is error in journal blocks:
>>>>> 
>>>>> [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "3i0vMv" -B1
>>>>> 00030c60  86 01 00 00 1c 00 14 01  68 70 30 30 30 30 30 31
>>>>> |........hp000001|
>>>>> 00030c70  37 61 4d 68 57 59 33 69  30 76 4d 76 88 01 00 00
>>>>> |7aMhWY3i0vMv....|
>>>>> 
>>>>> 
>>>>> List all journal block record to check which journal block records it:
>>>>> 
>>>>> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
>>>>> 00000000  c0 3b 39 98 00 00 00 04  00 00 00 00 00 00 08 00
>>>>> |.;9.............|
>>>>> 00000800  c0 3b 39 98 00 00 00 05  00 00 00 6f 00 00 00 24
>>>>> |.;9........o...$|
>>>>> 00001000  c0 3b 39 98 00 00 00 01  00 00 00 6f 00 00 00 75
>>>>> |.;9........o...u|
>>>>> 0000c800  c0 3b 39 98 00 00 00 02  00 00 00 6f 00 00 00 00
>>>>> |.;9........o....|
>>>>> 0000d000  c0 3b 39 98 00 00 00 01  00 00 00 70 00 00 00 65
>>>>> |.;9........p...e|
>>>>> 00016000  c0 3b 39 98 00 00 00 02  00 00 00 70 00 00 00 00
>>>>> |.;9........p....|
>>>>> 00016800  c0 3b 39 98 00 00 00 01  00 00 00 71 00 00 00 7c
>>>>> |.;9........q...||
>>>>> 00021000  c0 3b 39 98 00 00 00 02  00 00 00 71 00 00 00 00
>>>>> |.;9........q....|
>>>>> 00021800  c0 3b 39 98 00 00 00 01  00 00 00 72 00 00 00 82
>>>>> |.;9........r....|
>>>>> 0002d000  c0 3b 39 98 00 00 00 02  00 00 00 72 00 00 00 00
>>>>> |.;9........r....|
>>>>> 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88
>>>>> |.;9........s....|
>>>> ==>00039ab0 is in last block, the file name and the start block
>>>> number are all recorded in the journals.
>>>>> 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00
>>>>> |.;9........s....|
>>>>> 
>>>>> 
>>>>> Back to see the kernel log which it print all the block numbers:
>>>>> 
>>>>> ...
>>>>> ...
>>>>> [   46.222671] 244109  75277
>>>>> [   46.222693]
>>>>> [   46.272438] 244352  75520
>>>>> [   46.272460]
>>>>> [   46.348417] 238443  69611
>>>>> [   46.348438]
>>>>> [   46.349811] 244480  75648
>>>>> [   46.352287]
>>>>> [   46.404904] 244609  75777
>>>>> [   46.404926]
>>>>> [   46.454698] 244738  75906
>>>>> [   46.454719]
>>>>> [   46.505439] 244992  76160
>>>>> [   46.505459]
>>>>> [   46.557783] 245120  76288
>>>>> [   46.557804]
>>>>> [   46.610075] 245249  76417
>>>>> [   46.610096]
>>>>> [   46.660196] 245378  76546
>>>>> [   46.660219]
>>>>> [   46.709906] 201691  32859   ==> journal start is 32768, so the offset is 91,
>>>> block size=2048,  so, the offset address in the j.bin is 0x2d800
>>>>> [   46.709928] J [   46.711233]
>>>>> [   46.740635] drop to 9v
>>>>> [   46.749540] 201716  32884
>>>>> [   46.749560] J S
>>>>> [   46.751039]
>>>>> [   46.753151] 245632  76800  ==> 76800 in hex is 0x012c00, it is the same
>> start
>>>> block of the file which checksum is error.
>>>>> [   46.755284]
>>>>> nanddisk idle -> 1.
>>>>> [   46.800227] 6v irq-2
>>>>> 
>>>>> 
>>>>> The j.bin offset 0x2d800
>>>>> 
>>>>> 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88
>>>>> |.;9........s....|  ==>00039ab0 is in last block
>>>>> 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00
>>>>> |.;9........s....|
>>>>> 
>>>>> 
>>>>> Normally, ext4 will first write the file contents, and then write
>>>>> the journal and journal commit. Then after some delay, it will
>>>>> write the meta data.  So The journal blocks contains the meta data
>>>>> of the file which the file contents already been written before.
>>>>> But from above analysis,  the journal sequence 0x73 already
>>>>> contain the file
>>>>> name(hp0000017aMhWY3i0vMv) and the start block number(76800). So
>>>>> from the kernel log,  the block
>>>>> number(76800) should be available before the journal blocks but
>>>>> NOT after it.  It seems that there is out of order happen.
>>>>> 
>>>>> Could you please help to check this issue ? or give a explanation
>>>>> about it ?  Many thanks.
>>>>> 
>>>>> Best regards
>>>>> 
>>>>> Weller HUANG
>>>>> 
>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> Jan Kara <jack@suse.com>
>>>> SUSE Labs, CR
>> 
>> 
>> 
>> --
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: ext4 out of order when use cfq scheduler
  2016-01-06 19:17         ` Andreas Dilger
@ 2016-01-07  6:51           ` HUANG Weller (CM/ESW12-CN)
  0 siblings, 0 replies; 30+ messages in thread
From: HUANG Weller (CM/ESW12-CN) @ 2016-01-07  6:51 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Jan Kara, linux-ext4



> -----Original Message-----
> From: Andreas Dilger [mailto:adilger@dilger.ca]
> Sent: Thursday, January 07, 2016 3:18 AM
> To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> Cc: Jan Kara <jack@suse.cz>; linux-ext4@vger.kernel.org
> Subject: Re: ext4 out of order when use cfq scheduler
> 
> 
> > On Jan 5, 2016, at 7:39 PM, HUANG Weller (CM/ESW12-CN)
> <Weller.Huang@cn.bosch.com> wrote:
> >
> >> So you are running in 'ws' mode of your tool, am I right? Just
> >> looking into the sources you've sent me I've noticed that although
> >> you set O_SYNC in openflg when mode == MODE_WS, you do not use
> >> openflg at all. So file won't be synced at all. That would well
> >> explain why you see that not all file contents is written. So did you
> >> just send me a different version of the source or is your test program really
> buggy?
> >>
> >
> > Yes, it is a bug of the test code. So the test tool create files without O_SYNC
> flag actually.
> > But , even in this case, is the out of order acceptable ? or is it normal ?
> 
> Without O_SYNC there is no ordering guarantee between non-overlapping writes.
> They may be reordered by the filesystem or the elevator or the storage device.
> 

The out of sequence is happen before sending the block request to the storage device FW. I printed the log from the device block driver.
I remember, 2 years ago, I test on freescale platform with kernel version 3.8 and with same test tools. I never see such issue.
The journal blocks already contain the inode information of a file. But the file contents were wrote after the journal commit. The problem is that the length of the file described in the inode is not zero before the real data was written.
I think it is a problem. Do you agree with me ?


> 
> >
> >>
> >>>>> [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3 debugfs 1.42.9
> >>>>> (28-Dec-2013)
> >>>>> debugfs:  imap test/hp0000017aMhWY3i0vMv Inode 390 is part of
> >>>>> block group 0
> >>>>>        located at block 141, offset 0x0280
> >>>>>
> >>>>> 00000280  80 81 00 00 10 00 04 00  c8 09 00 00 66 0a 00 00
> >>>>> |............f...|
> >>>>> 00000290  66 0a 00 00 00 00 00 00  00 00 01 00 04 02 00 00
> >>>>> |f...............|
> >>>>> 000002a0  00 00 08 00 01 00 00 00  0a f3 02 00 04 00 00 00
> >>>>> |................|
> >>>>> 000002b0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00
> >>>>> |.............,..|  ==> the file contents is at 0x00012c00
> >>>>> 000002c0  80 00 00 00 01 00 00 00  8e 26 01 00 00 00 00 00
> >>>>> |.........&......|
> >>>>>
> >>>>>
> >>>>> Search the block number from the journal blocks:
> >>>>>
> >>>>> [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "00 2c 01 00"
> >>>>> 00039ab0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00
> >>>>> |.............,..|
> >>>>>
> >>>>> Search file name which the file checksum is error in journal blocks:
> >>>>>
> >>>>> [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "3i0vMv" -B1
> >>>>> 00030c60  86 01 00 00 1c 00 14 01  68 70 30 30 30 30 30 31
> >>>>> |........hp000001|
> >>>>> 00030c70  37 61 4d 68 57 59 33 69  30 76 4d 76 88 01 00 00
> >>>>> |7aMhWY3i0vMv....|
> >>>>>
> >>>>>
> >>>>> List all journal block record to check which journal block records it:
> >>>>>
> >>>>> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> >>>>> 00000000  c0 3b 39 98 00 00 00 04  00 00 00 00 00 00 08 00
> >>>>> |.;9.............|
> >>>>> 00000800  c0 3b 39 98 00 00 00 05  00 00 00 6f 00 00 00 24
> >>>>> |.;9........o...$|
> >>>>> 00001000  c0 3b 39 98 00 00 00 01  00 00 00 6f 00 00 00 75
> >>>>> |.;9........o...u|
> >>>>> 0000c800  c0 3b 39 98 00 00 00 02  00 00 00 6f 00 00 00 00
> >>>>> |.;9........o....|
> >>>>> 0000d000  c0 3b 39 98 00 00 00 01  00 00 00 70 00 00 00 65
> >>>>> |.;9........p...e|
> >>>>> 00016000  c0 3b 39 98 00 00 00 02  00 00 00 70 00 00 00 00
> >>>>> |.;9........p....|
> >>>>> 00016800  c0 3b 39 98 00 00 00 01  00 00 00 71 00 00 00 7c
> >>>>> |.;9........q...||
> >>>>> 00021000  c0 3b 39 98 00 00 00 02  00 00 00 71 00 00 00 00
> >>>>> |.;9........q....|
> >>>>> 00021800  c0 3b 39 98 00 00 00 01  00 00 00 72 00 00 00 82
> >>>>> |.;9........r....|
> >>>>> 0002d000  c0 3b 39 98 00 00 00 02  00 00 00 72 00 00 00 00
> >>>>> |.;9........r....|
> >>>>> 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88
> >>>>> |.;9........s....|
> >>>> ==>00039ab0 is in last block, the file name and the start block
> >>>> number are all recorded in the journals.
> >>>>> 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00
> >>>>> |.;9........s....|
> >>>>>
> >>>>>
> >>>>> Back to see the kernel log which it print all the block numbers:
> >>>>>
> >>>>> ...
> >>>>> ...
> >>>>> [   46.222671] 244109  75277
> >>>>> [   46.222693]
> >>>>> [   46.272438] 244352  75520
> >>>>> [   46.272460]
> >>>>> [   46.348417] 238443  69611
> >>>>> [   46.348438]
> >>>>> [   46.349811] 244480  75648
> >>>>> [   46.352287]
> >>>>> [   46.404904] 244609  75777
> >>>>> [   46.404926]
> >>>>> [   46.454698] 244738  75906
> >>>>> [   46.454719]
> >>>>> [   46.505439] 244992  76160
> >>>>> [   46.505459]
> >>>>> [   46.557783] 245120  76288
> >>>>> [   46.557804]
> >>>>> [   46.610075] 245249  76417
> >>>>> [   46.610096]
> >>>>> [   46.660196] 245378  76546
> >>>>> [   46.660219]
> >>>>> [   46.709906] 201691  32859   ==> journal start is 32768, so the offset is 91,
> >>>> block size=2048,  so, the offset address in the j.bin is 0x2d800
> >>>>> [   46.709928] J [   46.711233]
> >>>>> [   46.740635] drop to 9v
> >>>>> [   46.749540] 201716  32884
> >>>>> [   46.749560] J S
> >>>>> [   46.751039]
> >>>>> [   46.753151] 245632  76800  ==> 76800 in hex is 0x012c00, it is the same
> >> start
> >>>> block of the file which checksum is error.
> >>>>> [   46.755284]
> >>>>> nanddisk idle -> 1.
> >>>>> [   46.800227] 6v irq-2
> >>>>>
> >>>>>
> >>>>> The j.bin offset 0x2d800
> >>>>>
> >>>>> 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88
> >>>>> |.;9........s....|  ==>00039ab0 is in last block
> >>>>> 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00
> >>>>> |.;9........s....|
> >>>>>
> >>>>>
> >>>>> Normally, ext4 will first write the file contents, and then write
> >>>>> the journal and journal commit. Then after some delay, it will
> >>>>> write the meta data.  So The journal blocks contains the meta data
> >>>>> of the file which the file contents already been written before.
> >>>>> But from above analysis,  the journal sequence 0x73 already
> >>>>> contain the file
> >>>>> name(hp0000017aMhWY3i0vMv) and the start block number(76800). So
> >>>>> from the kernel log,  the block
> >>>>> number(76800) should be available before the journal blocks but
> >>>>> NOT after it.  It seems that there is out of order happen.
> >>>>>
> >>>>> Could you please help to check this issue ? or give a explanation
> >>>>> about it ?  Many thanks.
> >>>>>
> >>>>> Best regards
> >>>>>
> >>>>> Weller HUANG
> >>>>>
> >>>>>
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4"
> >>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>> --
> >>>> Jan Kara <jack@suse.com>
> >>>> SUSE Labs, CR
> >>
> >>
> >>
> >> --
> >> Jan Kara <jack@suse.com>
> >> SUSE Labs, CR
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> Cheers, Andreas
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
       [not found]           ` <3ab48fa47e434455b101251730e69bd2@SGPMBX1004.APAC.bosch.com>
@ 2016-01-07 10:24             ` Jan Kara
  2016-01-07 11:02               ` HUANG Weller (CM/ESW12-CN)
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2016-01-07 10:24 UTC (permalink / raw)
  To: HUANG Weller (CM/ESW12-CN); +Cc: Jan Kara, linux-ext4

On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > -----Original Message-----
> > From: Jan Kara [mailto:jack@suse.cz]
> > Sent: Wednesday, January 06, 2016 6:06 PM
> > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> > Subject: Re: ext4 out of order when use cfq scheduler
> > 
> > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > So you are running in 'ws' mode of your tool, am I right? Just
> > > > looking into the sources you've sent me I've noticed that although
> > > > you set O_SYNC in openflg when mode == MODE_WS, you do not use
> > > > openflg at all. So file won't be synced at all. That would well
> > > > explain why you see that not all file contents is written. So did
> > > > you just send me a different version of the source or is your test program
> > really buggy?
> > > >
> > >
> > > Yes, it is a bug of the test code. So the test tool create files
> > > without O_SYNC flag actually.  But , even in this case, is the out of
> > > order acceptable ? or is it normal ?
> > 
> > Without fsync(2) or O_SYNC, it is perfectly possible that some files are written and
> > others are not since nobody guarantees order of writeback of inodes. OTOH you
> > shouldn't ever see uninitialized data in the inode (but so far it isn't clear to me
> > whether you really see unitialized data or whether we really wrote zeros to those
> > blocks - ext4 can sometimes decide to do so).  Your traces and disk contents
> > show that the problematic inode has extent of length 128 blocks starting at block
> > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.  What is the
> > block size of the filesystem?  Because inode size is only 0x40010.
> > 
> > Some suggestions to try:
> > 1) Print also length of a write request in addition to the starting
> > block so that we can see how much actually got written
> 
> Please see below failure analysis.
> 
> > 2) Initialize the device to 0xff so that we can distinguish
> > uninitialized blocks from zeroed-out blocks.
> 
> Yes, i Initialize the device to 0xff this time.
> 
> > 3) Report exactly for which 512-byte blocks checksum matches and for
> > which it is wrong.
> The wrong contents are old file contents which are created in previous
> test round.  It is caused by the "wrong" sequence inode data(in journal)
> and  the file contents. So the file contents are not updated.

So this confuses me somewhat. You previously said that you always remove
files after each test round and then new ones are created. Is it still the
case? So the old file contents you speak about above is just some random
contents that happened to be in disk blocks we freshly allocated to the
file, am I right?

> > 4) Try newer kernel if possible.
> > 
> 
> Please see my analysis with print length of a write request

Thanks for the detailed analysis!

> Ext4 information:
> ----------------------------
> [root@SiRFatlas6 ~]# dumpe2fs /dev/nandblk0p3
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super huge_file uninit_bg dir_nlink extra_isize
> Filesystem flags:         unsigned_directory_hash
> Default mount options:    user_xattr acl
> Filesystem state:         clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              42816
> Block count:              85632
> Reserved block count:     4281
> Free blocks:              41789
> Free inodes:              42519
> First block:              0
> Block size:               2048
> Fragment size:            2048
> Reserved GDT blocks:      83
> Blocks per group:         16384
> Fragments per group:      16384
> Inodes per group:         7136
> Inode blocks per group:   446
> Flex block group size:    16
> Filesystem created:       Thu Jan  1 00:42:42 1970
> Last mount time:          Thu Jan  1 00:30:58 1970
> Last write time:          Thu Jan  1 00:30:58 1970
> Mount count:              21
> Maximum mount count:      -1
> Last checked:             Thu Jan  1 00:42:42 1970
> Check interval:           0 (<none>)
> Lifetime writes:          10 MB
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group root)
> First inode:              11
> Inode size:               128
> Journal inode:            8
> Default directory hash:   half_md4
> Directory Hash Seed:      e9c1cdf4-bc4c-40cb-930f-5f3923e68a33
> Journal backup:           inode blocks
> Journal features:         journal_incompat_revoke
> Journal size:             8M
> Journal length:           4096
> Journal sequence:         0x0000005b
> Journal start:            1
> 
> Last four files are contents error:
> Failure files:
> -rw-------    1 root     root        262160 Jan  1 00:44 /mnt/test/hp0000011aHHlIpPbjK6
> -rw-------    1 root     root        262160 Jan  1 00:44 /mnt/test/hp0000011bjwWXnrY2jn
> -rw-------    1 root     root        262160 Jan  1 00:44 /mnt/test/hp0000011c7HYSBFSTAk
> -rw-------    1 root     root        135168 Jan  1 00:44 /mnt/test/hp0000011dliQ04nIiLB
> 
> Analysis /mnt/test/hp0000011aHHlIpPbjK6
> 
> debugfs:  imap test/hp0000011aHHlIpPbjK6
> Inode 294 is part of block group 0
>         located at block 135, offset 0x0280
> 
> 00000280  80 81 00 00 10 00 04 00  aa 07 00 00 72 0a 00 00  
> 00000290  72 0a 00 00 00 00 00 00  00 00 01 00 04 02 00 00  
> 000002a0  00 00 08 00 01 00 00 00  0a f3 01 00 04 00 00 00  
> 000002b0  00 00 00 00 00 00 00 00  81 00 00 00 00 dd 00 00 ==> 0xdd00
> 
> Find the block number of the file contents: j.bin is the journal blocks which I backup before journal replay.
> [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "00 dd 00 00"
> 0002eab0  00 00 00 00 00 00 00 00  81 00 00 00 00 dd 00 00 
> 
> Check the block number in which journal commit,  the commit ID is 0x58.
>  [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> 00000000  c0 3b 39 98 00 00 00 04  00 00 00 00 00 00 08 00  |.;9.............|
> 00000800  c0 3b 39 98 00 00 00 05  00 00 00 55 00 00 00 30  |.;9........U...0|
> 00001000  c0 3b 39 98 00 00 00 01  00 00 00 55 00 00 00 75  |.;9........U...u|
> 0000f800  c0 3b 39 98 00 00 00 02  00 00 00 55 00 00 00 00  |.;9........U....|
> 00010000  c0 3b 39 98 00 00 00 01  00 00 00 56 00 00 00 65  |.;9........V...e|
> 00019000  c0 3b 39 98 00 00 00 02  00 00 00 56 00 00 00 00  |.;9........V....|
> 00019800  c0 3b 39 98 00 00 00 01  00 00 00 57 00 00 00 7c  |.;9........W...||
> 00024000  c0 3b 39 98 00 00 00 02  00 00 00 57 00 00 00 00  |.;9........W....|
> 00024800  c0 3b 39 98 00 00 00 01  00 00 00 58 00 00 00 82  |.;9........X....|
> 0002f000  c0 3b 39 98 00 00 00 02  00 00 00 58 00 00 00 00  |.;9........X....|
> 00039000  c0 3b 39 98 00 00 00 02  00 00 00 1a 00 00 00 00  |.;9.............|
> 00039800  c0 3b 39 98 00 00 00 01  00 00 00 1b 00 00 00 8c  |.;9.............|
> 00046800  c0 3b 39 98 00 00 00 02  00 00 00 1b 00 00 00 00  |.;9.............|
> 
> 
> 
> 
> Kernel log:
> 
> [   43.801072] 223746  54914(0xd682) 0x40800 bytes
> [   43.802580] 
> [   43.853328] 224000  55168(0xd780) 0x40000 bytes
> [   43.854876] 
> [   43.962590] 224128  55296(0xd800) 0x40800 bytes
> [   43.964097] 
> [   44.016428] 221062  52230(0xcc06) 0x1000 bytes
> [   44.017844] 
> [   44.019594] 224257  55425(0xd881) 0x40800 bytes
> [   44.023832] 
> [   44.100933] 224386  55554(0xd902) 0x40800 bytes
> [   44.102439] 
> [   44.165200] 224640  55808(0xda00) 0x40000 bytes
> [   44.166727] 
> [   44.216643] 224768  55936(0xda80) 0x40800 bytes
> [   44.218144] 
> [   44.270599] 224897  56065(0xdb01) 0x40800 bytes
> [   44.272108] 
> [   44.348213] 201673  32841(0x8049) 0xa800 bytes  ===> (32841-32768)*block_size = 0x24800 , it is the start block of journal ID 0x58
> [   44.349659] J [   44.351275] 
> [   44.360542] 201694  32862(0x805e) 0x800 bytes
> [   44.361876] J S                                                                   
> [   44.363695] 
> [   44.365843] 225026  56194(0xdb82) 0x40800 bytes
> [   44.369716] 
> [   44.417490] drop to 9v
> [   44.422435] 225280  56448(0xdc80) 0x40000 bytes   ==> file hp00000119py87lXtVHG contents. This file is checksum correct. But data contents are written after journal commit.
> [   44.423938] 
> [   44.474334] 225408  56576(0xdd00) 0x40800 bytes   ===> file hp0000011aHHlIpPbjK6's data written after journal commit!!!
> [   44.475904] 
> nanddisk idle -> 1.
> [   44.479310] 6v irq-2

OK, so I was looking into the code and indeed, reality is correct and my
mental model was wrong! ;) I thought that inode gets added to the list of
inodes for which we need to wait for data IO completion during transaction
commit during block allocation. And I was wrong. It used to happen in
mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove calls to
ext4_jbd2_file_inode() from delalloc write path) where it got removed. And
that was wrong because although we submit data writes before dropping
handle for allocating transaction and updating i_size, nobody guarantees
that data IO is not delayed in the block layer until transaction commit.
Which seems to happen in your case. I'll send a fix. Thanks for your report
and persistence!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: ext4 out of order when use cfq scheduler
  2016-01-07 10:24             ` Jan Kara
@ 2016-01-07 11:02               ` HUANG Weller (CM/ESW12-CN)
  2016-01-07 11:47                 ` Jan Kara
  0 siblings, 1 reply; 30+ messages in thread
From: HUANG Weller (CM/ESW12-CN) @ 2016-01-07 11:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Li, Michael



> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz]
> Sent: Thursday, January 07, 2016 6:24 PM
> To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> Cc: Jan Kara <jack@suse.cz>; linux-ext4@vger.kernel.org
> Subject: Re: ext4 out of order when use cfq scheduler
> 
> On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > > -----Original Message-----
> > > From: Jan Kara [mailto:jack@suse.cz]
> > > Sent: Wednesday, January 06, 2016 6:06 PM
> > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> > > Subject: Re: ext4 out of order when use cfq scheduler
> > >
> > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > So you are running in 'ws' mode of your tool, am I right? Just
> > > > > looking into the sources you've sent me I've noticed that
> > > > > although you set O_SYNC in openflg when mode == MODE_WS, you do
> > > > > not use openflg at all. So file won't be synced at all. That
> > > > > would well explain why you see that not all file contents is
> > > > > written. So did you just send me a different version of the
> > > > > source or is your test program
> > > really buggy?
> > > > >
> > > >
> > > > Yes, it is a bug of the test code. So the test tool create files
> > > > without O_SYNC flag actually.  But , even in this case, is the out
> > > > of order acceptable ? or is it normal ?
> > >
> > > Without fsync(2) or O_SYNC, it is perfectly possible that some files
> > > are written and others are not since nobody guarantees order of
> > > writeback of inodes. OTOH you shouldn't ever see uninitialized data
> > > in the inode (but so far it isn't clear to me whether you really see
> > > unitialized data or whether we really wrote zeros to those blocks -
> > > ext4 can sometimes decide to do so).  Your traces and disk contents
> > > show that the problematic inode has extent of length 128 blocks
> > > starting at block
> > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.
> > > What is the block size of the filesystem?  Because inode size is only 0x40010.
> > >
> > > Some suggestions to try:
> > > 1) Print also length of a write request in addition to the starting
> > > block so that we can see how much actually got written
> >
> > Please see below failure analysis.
> >
> > > 2) Initialize the device to 0xff so that we can distinguish
> > > uninitialized blocks from zeroed-out blocks.
> >
> > Yes, i Initialize the device to 0xff this time.
> >
> > > 3) Report exactly for which 512-byte blocks checksum matches and for
> > > which it is wrong.
> > The wrong contents are old file contents which are created in previous
> > test round.  It is caused by the "wrong" sequence inode data(in
> > journal) and  the file contents. So the file contents are not updated.
> 
> So this confuses me somewhat. You previously said that you always remove files
> after each test round and then new ones are created. Is it still the case? So the old
> file contents you speak about above is just some random contents that happened
> to be in disk blocks we freshly allocated to the file, am I right?

Yes. You are right.
 The "old file contents" means that the disk blocks which the contents is generated from last test round, and they are allocated to a new file in new test round.


> 
> OK, so I was looking into the code and indeed, reality is correct and my mental
> model was wrong! ;) I thought that inode gets added to the list of inodes for which
> we need to wait for data IO completion during transaction commit during block
> allocation. And I was wrong. It used to happen in
> mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove calls to
> ext4_jbd2_file_inode() from delalloc write path) where it got removed. And that was
> wrong because although we submit data writes before dropping handle for
> allocating transaction and updating i_size, nobody guarantees that data IO is not
> delayed in the block layer until transaction commit.
> Which seems to happen in your case. I'll send a fix. Thanks for your report and
> persistence!
> 

Thanks a lot for your feedback :-)
Because I am not familiar with the detail of the ext4 internal code.  I will try to understand your explanation which you describe above.  And have a look on related funcations.
Could you send the fix in this mail ?
And whether the kernel 3.14 also have such issue, right ?

> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2016-01-07 11:02               ` HUANG Weller (CM/ESW12-CN)
@ 2016-01-07 11:47                 ` Jan Kara
  2016-01-07 12:19                   ` Jan Kara
                                     ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Jan Kara @ 2016-01-07 11:47 UTC (permalink / raw)
  To: HUANG Weller (CM/ESW12-CN); +Cc: Jan Kara, linux-ext4, Li, Michael

[-- Attachment #1: Type: text/plain, Size: 4821 bytes --]

On Thu 07-01-16 11:02:29, HUANG Weller (CM/ESW12-CN) wrote:
> > -----Original Message-----
> > From: Jan Kara [mailto:jack@suse.cz]
> > Sent: Thursday, January 07, 2016 6:24 PM
> > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> > Cc: Jan Kara <jack@suse.cz>; linux-ext4@vger.kernel.org
> > Subject: Re: ext4 out of order when use cfq scheduler
> > 
> > On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > > > -----Original Message-----
> > > > From: Jan Kara [mailto:jack@suse.cz]
> > > > Sent: Wednesday, January 06, 2016 6:06 PM
> > > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > >
> > > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > So you are running in 'ws' mode of your tool, am I right? Just
> > > > > > looking into the sources you've sent me I've noticed that
> > > > > > although you set O_SYNC in openflg when mode == MODE_WS, you do
> > > > > > not use openflg at all. So file won't be synced at all. That
> > > > > > would well explain why you see that not all file contents is
> > > > > > written. So did you just send me a different version of the
> > > > > > source or is your test program
> > > > really buggy?
> > > > > >
> > > > >
> > > > > Yes, it is a bug of the test code. So the test tool create files
> > > > > without O_SYNC flag actually.  But , even in this case, is the out
> > > > > of order acceptable ? or is it normal ?
> > > >
> > > > Without fsync(2) or O_SYNC, it is perfectly possible that some files
> > > > are written and others are not since nobody guarantees order of
> > > > writeback of inodes. OTOH you shouldn't ever see uninitialized data
> > > > in the inode (but so far it isn't clear to me whether you really see
> > > > unitialized data or whether we really wrote zeros to those blocks -
> > > > ext4 can sometimes decide to do so).  Your traces and disk contents
> > > > show that the problematic inode has extent of length 128 blocks
> > > > starting at block
> > > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.
> > > > What is the block size of the filesystem?  Because inode size is only 0x40010.
> > > >
> > > > Some suggestions to try:
> > > > 1) Print also length of a write request in addition to the starting
> > > > block so that we can see how much actually got written
> > >
> > > Please see below failure analysis.
> > >
> > > > 2) Initialize the device to 0xff so that we can distinguish
> > > > uninitialized blocks from zeroed-out blocks.
> > >
> > > Yes, i Initialize the device to 0xff this time.
> > >
> > > > 3) Report exactly for which 512-byte blocks checksum matches and for
> > > > which it is wrong.
> > > The wrong contents are old file contents which are created in previous
> > > test round.  It is caused by the "wrong" sequence inode data(in
> > > journal) and  the file contents. So the file contents are not updated.
> > 
> > So this confuses me somewhat. You previously said that you always remove files
> > after each test round and then new ones are created. Is it still the case? So the old
> > file contents you speak about above is just some random contents that happened
> > to be in disk blocks we freshly allocated to the file, am I right?
> 
> Yes. You are right.
>  The "old file contents" means that the disk blocks which the contents is generated from last test round, and they are allocated to a new file in new test round.
> 
> 
> > 
> > OK, so I was looking into the code and indeed, reality is correct and my mental
> > model was wrong! ;) I thought that inode gets added to the list of inodes for which
> > we need to wait for data IO completion during transaction commit during block
> > allocation. And I was wrong. It used to happen in
> > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove calls to
> > ext4_jbd2_file_inode() from delalloc write path) where it got removed. And that was
> > wrong because although we submit data writes before dropping handle for
> > allocating transaction and updating i_size, nobody guarantees that data IO is not
> > delayed in the block layer until transaction commit.
> > Which seems to happen in your case. I'll send a fix. Thanks for your report and
> > persistence!
> > 
> 
> Thanks a lot for your feedback :-)
> Because I am not familiar with the detail of the ext4 internal code.  I will try to understand your explanation which you describe above.  And have a look on related funcations.
> Could you send the fix in this mail ?
> And whether the kernel 3.14 also have such issue, right ?

The problem is in all kernels starting with 3.8. Attached is a patch which
should fix the issue. Can you test whether it fixes the problem for you?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-ext4-Fix-data-exposure-after-a-crash.patch --]
[-- Type: text/x-patch, Size: 3101 bytes --]

>From 4dd4ac4bec65620a71df5e3f893e6728863f05c3 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 7 Jan 2016 12:21:25 +0100
Subject: [PATCH] ext4: Fix data exposure after a crash

Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a69d removed the logic with a
flawed argument that it is not needed.

The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.

The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().

CC: stable@vger.kernel.org
Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d
Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ff2f3cd38522..b216a3eb41a8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -682,6 +682,20 @@ out_sem:
 		ret = check_block_validity(inode, map);
 		if (ret != 0)
 			return ret;
+
+		/*
+		 * Inodes with freshly allocated blocks where contents will be
+		 * visible after transaction commit must be on transaction's
+		 * ordered data list.
+		 */
+		if (map->m_flags & EXT4_MAP_NEW &&
+		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
+		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
+		    ext4_should_order_data(inode)) {
+			ret = ext4_jbd2_file_inode(handle, inode);
+			if (ret)
+				return ret;
+		}
 	}
 	return retval;
 }
@@ -1135,15 +1149,6 @@ static int ext4_write_end(struct file *file,
 	int i_size_changed = 0;
 
 	trace_ext4_write_end(inode, pos, len, copied);
-	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
-		ret = ext4_jbd2_file_inode(handle, inode);
-		if (ret) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto errout;
-		}
-	}

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2016-01-07 11:47                 ` Jan Kara
@ 2016-01-07 12:19                   ` Jan Kara
  2016-01-08  2:18                     ` HUANG Weller (CM/ESW12-CN)
  2016-01-08  0:46                   ` HUANG Weller (CM/ESW12-CN)
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2016-01-07 12:19 UTC (permalink / raw)
  To: HUANG Weller (CM/ESW12-CN); +Cc: Jan Kara, linux-ext4, Li, Michael

On Thu 07-01-16 12:47:36, Jan Kara wrote:
> On Thu 07-01-16 11:02:29, HUANG Weller (CM/ESW12-CN) wrote:
> > > -----Original Message-----
> > > From: Jan Kara [mailto:jack@suse.cz]
> > > Sent: Thursday, January 07, 2016 6:24 PM
> > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> > > Cc: Jan Kara <jack@suse.cz>; linux-ext4@vger.kernel.org
> > > Subject: Re: ext4 out of order when use cfq scheduler
> > > 
> > > On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > -----Original Message-----
> > > > > From: Jan Kara [mailto:jack@suse.cz]
> > > > > Sent: Wednesday, January 06, 2016 6:06 PM
> > > > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> > > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > > >
> > > > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > > So you are running in 'ws' mode of your tool, am I right? Just
> > > > > > > looking into the sources you've sent me I've noticed that
> > > > > > > although you set O_SYNC in openflg when mode == MODE_WS, you do
> > > > > > > not use openflg at all. So file won't be synced at all. That
> > > > > > > would well explain why you see that not all file contents is
> > > > > > > written. So did you just send me a different version of the
> > > > > > > source or is your test program
> > > > > really buggy?
> > > > > > >
> > > > > >
> > > > > > Yes, it is a bug of the test code. So the test tool create files
> > > > > > without O_SYNC flag actually.  But , even in this case, is the out
> > > > > > of order acceptable ? or is it normal ?
> > > > >
> > > > > Without fsync(2) or O_SYNC, it is perfectly possible that some files
> > > > > are written and others are not since nobody guarantees order of
> > > > > writeback of inodes. OTOH you shouldn't ever see uninitialized data
> > > > > in the inode (but so far it isn't clear to me whether you really see
> > > > > unitialized data or whether we really wrote zeros to those blocks -
> > > > > ext4 can sometimes decide to do so).  Your traces and disk contents
> > > > > show that the problematic inode has extent of length 128 blocks
> > > > > starting at block
> > > > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.
> > > > > What is the block size of the filesystem?  Because inode size is only 0x40010.
> > > > >
> > > > > Some suggestions to try:
> > > > > 1) Print also length of a write request in addition to the starting
> > > > > block so that we can see how much actually got written
> > > >
> > > > Please see below failure analysis.
> > > >
> > > > > 2) Initialize the device to 0xff so that we can distinguish
> > > > > uninitialized blocks from zeroed-out blocks.
> > > >
> > > > Yes, i Initialize the device to 0xff this time.
> > > >
> > > > > 3) Report exactly for which 512-byte blocks checksum matches and for
> > > > > which it is wrong.
> > > > The wrong contents are old file contents which are created in previous
> > > > test round.  It is caused by the "wrong" sequence inode data(in
> > > > journal) and  the file contents. So the file contents are not updated.
> > > 
> > > So this confuses me somewhat. You previously said that you always remove files
> > > after each test round and then new ones are created. Is it still the case? So the old
> > > file contents you speak about above is just some random contents that happened
> > > to be in disk blocks we freshly allocated to the file, am I right?
> > 
> > Yes. You are right.
> >  The "old file contents" means that the disk blocks which the contents is generated from last test round, and they are allocated to a new file in new test round.
> > 
> > 
> > > 
> > > OK, so I was looking into the code and indeed, reality is correct and my mental
> > > model was wrong! ;) I thought that inode gets added to the list of inodes for which
> > > we need to wait for data IO completion during transaction commit during block
> > > allocation. And I was wrong. It used to happen in
> > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove calls to
> > > ext4_jbd2_file_inode() from delalloc write path) where it got removed. And that was
> > > wrong because although we submit data writes before dropping handle for
> > > allocating transaction and updating i_size, nobody guarantees that data IO is not
> > > delayed in the block layer until transaction commit.
> > > Which seems to happen in your case. I'll send a fix. Thanks for your report and
> > > persistence!
> > > 
> > 
> > Thanks a lot for your feedback :-)
> > Because I am not familiar with the detail of the ext4 internal code.  I will try to understand your explanation which you describe above.  And have a look on related funcations.
> > Could you send the fix in this mail ?
> > And whether the kernel 3.14 also have such issue, right ?
> 
> The problem is in all kernels starting with 3.8. Attached is a patch which
> should fix the issue. Can you test whether it fixes the problem for you?

Oh, I have realized the patch is on top of current ext4 development tree
and it won't compile for current vanilla kernel because of
EXT4_GET_BLOCKS_ZERO check. Just remove that line when  you get compilation
failure.

> +		if (map->m_flags & EXT4_MAP_NEW &&
> +		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
> +		    !(flags & EXT4_GET_BLOCKS_ZERO) &&

Just remove the above line and things should work for older kernels as
well.

> +		    ext4_should_order_data(inode)) {
> +			ret = ext4_jbd2_file_inode(handle, inode);
> +			if (ret)
> +				return ret;
> +		}
>  	}
>  	return retval;
>  }

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: ext4 out of order when use cfq scheduler
  2016-01-07 11:47                 ` Jan Kara
  2016-01-07 12:19                   ` Jan Kara
@ 2016-01-08  0:46                   ` HUANG Weller (CM/ESW12-CN)
  2016-01-11  9:05                   ` HUANG Weller (CM/ESW12-CN)
  2016-03-13  4:27                   ` Theodore Ts'o
  3 siblings, 0 replies; 30+ messages in thread
From: HUANG Weller (CM/ESW12-CN) @ 2016-01-08  0:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Li, Michael

> >
> > >
> > > OK, so I was looking into the code and indeed, reality is correct
> > > and my mental model was wrong! ;) I thought that inode gets added to
> > > the list of inodes for which we need to wait for data IO completion
> > > during transaction commit during block allocation. And I was wrong.
> > > It used to happen in
> > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > calls to
> > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > removed. And that was wrong because although we submit data writes
> > > before dropping handle for allocating transaction and updating
> > > i_size, nobody guarantees that data IO is not delayed in the block layer until
> transaction commit.
> > > Which seems to happen in your case. I'll send a fix. Thanks for your
> > > report and persistence!
> > >
> >
> > Thanks a lot for your feedback :-)
> > Because I am not familiar with the detail of the ext4 internal code.  I will try to
> understand your explanation which you describe above.  And have a look on
> related funcations.
> > Could you send the fix in this mail ?
> > And whether the kernel 3.14 also have such issue, right ?
> 
> The problem is in all kernels starting with 3.8. Attached is a patch which should fix
> the issue. Can you test whether it fixes the problem for you?
> 
> 								Honza
> --

Yes, of course I will redo the test with the patch. And also give you feedback.
Thanks.
Huang weller

> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: ext4 out of order when use cfq scheduler
  2016-01-07 12:19                   ` Jan Kara
@ 2016-01-08  2:18                     ` HUANG Weller (CM/ESW12-CN)
  0 siblings, 0 replies; 30+ messages in thread
From: HUANG Weller (CM/ESW12-CN) @ 2016-01-08  2:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Li, Michael



> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz]
> Sent: Thursday, January 07, 2016 8:19 PM
> To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> Cc: Jan Kara <jack@suse.cz>; linux-ext4@vger.kernel.org; Li, Michael
> <huayil@qti.qualcomm.com>
> Subject: Re: ext4 out of order when use cfq scheduler
> 
> On Thu 07-01-16 12:47:36, Jan Kara wrote:
> > On Thu 07-01-16 11:02:29, HUANG Weller (CM/ESW12-CN) wrote:
> > > > -----Original Message-----
> > > > From: Jan Kara [mailto:jack@suse.cz]
> > > > Sent: Thursday, January 07, 2016 6:24 PM
> > > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> > > > Cc: Jan Kara <jack@suse.cz>; linux-ext4@vger.kernel.org
> > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > >
> > > > On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > -----Original Message-----
> > > > > > From: Jan Kara [mailto:jack@suse.cz]
> > > > > > Sent: Wednesday, January 06, 2016 6:06 PM
> > > > > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> > > > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > > > >
> > > > > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > > > So you are running in 'ws' mode of your tool, am I right?
> > > > > > > > Just looking into the sources you've sent me I've noticed
> > > > > > > > that although you set O_SYNC in openflg when mode ==
> > > > > > > > MODE_WS, you do not use openflg at all. So file won't be
> > > > > > > > synced at all. That would well explain why you see that
> > > > > > > > not all file contents is written. So did you just send me
> > > > > > > > a different version of the source or is your test program
> > > > > > really buggy?
> > > > > > > >
> > > > > > >
> > > > > > > Yes, it is a bug of the test code. So the test tool create
> > > > > > > files without O_SYNC flag actually.  But , even in this
> > > > > > > case, is the out of order acceptable ? or is it normal ?
> > > > > >
> > > > > > Without fsync(2) or O_SYNC, it is perfectly possible that some
> > > > > > files are written and others are not since nobody guarantees
> > > > > > order of writeback of inodes. OTOH you shouldn't ever see
> > > > > > uninitialized data in the inode (but so far it isn't clear to
> > > > > > me whether you really see unitialized data or whether we
> > > > > > really wrote zeros to those blocks -
> > > > > > ext4 can sometimes decide to do so).  Your traces and disk
> > > > > > contents show that the problematic inode has extent of length
> > > > > > 128 blocks starting at block
> > > > > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.
> > > > > > What is the block size of the filesystem?  Because inode size is only
> 0x40010.
> > > > > >
> > > > > > Some suggestions to try:
> > > > > > 1) Print also length of a write request in addition to the
> > > > > > starting block so that we can see how much actually got
> > > > > > written
> > > > >
> > > > > Please see below failure analysis.
> > > > >
> > > > > > 2) Initialize the device to 0xff so that we can distinguish
> > > > > > uninitialized blocks from zeroed-out blocks.
> > > > >
> > > > > Yes, i Initialize the device to 0xff this time.
> > > > >
> > > > > > 3) Report exactly for which 512-byte blocks checksum matches
> > > > > > and for which it is wrong.
> > > > > The wrong contents are old file contents which are created in
> > > > > previous test round.  It is caused by the "wrong" sequence inode
> > > > > data(in
> > > > > journal) and  the file contents. So the file contents are not updated.
> > > >
> > > > So this confuses me somewhat. You previously said that you always
> > > > remove files after each test round and then new ones are created.
> > > > Is it still the case? So the old file contents you speak about
> > > > above is just some random contents that happened to be in disk blocks we
> freshly allocated to the file, am I right?
> > >
> > > Yes. You are right.
> > >  The "old file contents" means that the disk blocks which the contents is
> generated from last test round, and they are allocated to a new file in new test
> round.
> > >
> > >
> > > >
> > > > OK, so I was looking into the code and indeed, reality is correct
> > > > and my mental model was wrong! ;) I thought that inode gets added
> > > > to the list of inodes for which we need to wait for data IO
> > > > completion during transaction commit during block allocation. And
> > > > I was wrong. It used to happen in
> > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > > calls to
> > > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > > removed. And that was wrong because although we submit data writes
> > > > before dropping handle for allocating transaction and updating
> > > > i_size, nobody guarantees that data IO is not delayed in the block layer until
> transaction commit.
> > > > Which seems to happen in your case. I'll send a fix. Thanks for
> > > > your report and persistence!
> > > >
> > >
> > > Thanks a lot for your feedback :-)
> > > Because I am not familiar with the detail of the ext4 internal code.  I will try to
> understand your explanation which you describe above.  And have a look on
> related funcations.
> > > Could you send the fix in this mail ?
> > > And whether the kernel 3.14 also have such issue, right ?
> >
> > The problem is in all kernels starting with 3.8. Attached is a patch
> > which should fix the issue. Can you test whether it fixes the problem for you?
> 
> Oh, I have realized the patch is on top of current ext4 development tree and it
> won't compile for current vanilla kernel because of EXT4_GET_BLOCKS_ZERO
> check. Just remove that line when  you get compilation failure.
> 
> > +		if (map->m_flags & EXT4_MAP_NEW &&
> > +		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
> > +		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
> 
> Just remove the above line and things should work for older kernels as well.
> 
> > +		    ext4_should_order_data(inode)) {
> > +			ret = ext4_jbd2_file_inode(handle, inode);
> > +			if (ret)
> > +				return ret;
> > +		}
> >  	}
> >  	return retval;
> >  }
> 

Just confirmed with you because the patch tool didn't found:

"out_sem:
 		ret = check_block_validity(inode, map);" in my kernel.

after checking the code, I add the modification to the end of function : ext4_map_blocks
below is the diff. please help to double confirm.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 10b71e4..d29a1d2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -753,6 +753,10 @@ has_zeroout:
                int ret = check_block_validity(inode, map);
                if (ret != 0)
                        return ret;
+               if(ext4_should_order_data(inode)) {
+                       ret = ext4_jbd2_file_inode(handle, inode);
+               if (ret)
+                       return ret;
        }
        return retval;
 }
@@ -1113,15 +1117,6 @@ static int ext4_write_end(struct file *file,
        int i_size_changed = 0;

        trace_ext4_write_end(inode, pos, len, copied);
-       if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
-               ret = ext4_jbd2_file_inode(handle, inode);
-               if (ret) {
-                       unlock_page(page);
-                       page_cache_release(page);
-                       goto errout;
-               }
-       }

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* RE: ext4 out of order when use cfq scheduler
  2016-01-07 11:47                 ` Jan Kara
  2016-01-07 12:19                   ` Jan Kara
  2016-01-08  0:46                   ` HUANG Weller (CM/ESW12-CN)
@ 2016-01-11  9:05                   ` HUANG Weller (CM/ESW12-CN)
  2016-01-11 10:21                     ` Jan Kara
  2016-03-13  4:27                   ` Theodore Ts'o
  3 siblings, 1 reply; 30+ messages in thread
From: HUANG Weller (CM/ESW12-CN) @ 2016-01-11  9:05 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Li, Michael

[-- Attachment #1: Type: text/plain, Size: 2201 bytes --]


> -----Original Message-----
> From: HUANG Weller (CM/ESW12-CN)
> Sent: Friday, January 08, 2016 8:47 AM
> To: 'Jan Kara' <jack@suse.cz>
> Cc: linux-ext4@vger.kernel.org; Li, Michael <huayil@qti.qualcomm.com>
> Subject: RE: ext4 out of order when use cfq scheduler
> 
> > >
> > > >
> > > > OK, so I was looking into the code and indeed, reality is correct
> > > > and my mental model was wrong! ;) I thought that inode gets added
> > > > to the list of inodes for which we need to wait for data IO
> > > > completion during transaction commit during block allocation. And I was
> wrong.
> > > > It used to happen in
> > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > > calls to
> > > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > > removed. And that was wrong because although we submit data writes
> > > > before dropping handle for allocating transaction and updating
> > > > i_size, nobody guarantees that data IO is not delayed in the block
> > > > layer until
> > transaction commit.
> > > > Which seems to happen in your case. I'll send a fix. Thanks for
> > > > your report and persistence!
> > > >
> > >
> > > Thanks a lot for your feedback :-)
> > > Because I am not familiar with the detail of the ext4 internal code.
> > > I will try to
> > understand your explanation which you describe above.  And have a look
> > on related funcations.
> > > Could you send the fix in this mail ?
> > > And whether the kernel 3.14 also have such issue, right ?
> >
> > The problem is in all kernels starting with 3.8. Attached is a patch
> > which should fix the issue. Can you test whether it fixes the problem for you?
> >
> > 								Honza
> > --
> 
> Yes, of course I will redo the test with the patch. And also give you feedback.

Test result:
Test on 2 targets with the kernel applied your patch. Both of them are OK after 5000 power failure tests. Our target test cycle is 10,000.
By the way, since your original patch can't be applied on 3.x kernel, The attached one  is based on yours and can applied on old kernel(mine is 3.10) directly.

> 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR

[-- Attachment #2: 0001-ext4-Fix-data-exposure-after-a-crash.patch --]
[-- Type: application/octet-stream, Size: 3185 bytes --]

From 98cf821e270748486132079ac8d1c9557c9adfb9 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 7 Jan 2016 12:21:25 +0100
Subject: [PATCH] ext4: Fix data exposure after a crash

Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a69d removed the logic with a
flawed argument that it is not needed.

The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.

The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().

CC: stable@vger.kernel.org
Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d
Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
fix conflict with 3.10 kernel.
Signed-off-by: weller huang <weller.huang@cn.bosch.com>
---
 fs/ext4/inode.c |   20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e48bd5a..d5d6d6c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -751,6 +751,17 @@ has_zeroout:
 		int ret = check_block_validity(inode, map);
 		if (ret != 0)
 			return ret;
+
+		/*
+		 * Inodes with freshly allocated blocks where contents will be
+		 * visible after transaction commit must be on transaction's
+		 * ordered data list.
+		 */
+		if(ext4_should_order_data(inode)) {
+			ret = ext4_jbd2_file_inode(handle, inode);
+			if (ret)
+				return ret;
+		}
 	}
 	return retval;
 }
@@ -1110,15 +1121,6 @@ static int ext4_write_end(struct file *file,
 	int i_size_changed = 0;
 
 	trace_ext4_write_end(inode, pos, len, copied);
-	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
-		ret = ext4_jbd2_file_inode(handle, inode);
-		if (ret) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto errout;
-		}
-	}
-
 	if (ext4_has_inline_data(inode)) {
 		ret = ext4_write_inline_data_end(inode, pos, len,
 						 copied, page);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2016-01-11  9:05                   ` HUANG Weller (CM/ESW12-CN)
@ 2016-01-11 10:21                     ` Jan Kara
  0 siblings, 0 replies; 30+ messages in thread
From: Jan Kara @ 2016-01-11 10:21 UTC (permalink / raw)
  To: HUANG Weller (CM/ESW12-CN); +Cc: Jan Kara, linux-ext4, Li, Michael

On Mon 11-01-16 09:05:20, HUANG Weller (CM/ESW12-CN) wrote:
> 
> > -----Original Message-----
> > From: HUANG Weller (CM/ESW12-CN)
> > Sent: Friday, January 08, 2016 8:47 AM
> > To: 'Jan Kara' <jack@suse.cz>
> > Cc: linux-ext4@vger.kernel.org; Li, Michael <huayil@qti.qualcomm.com>
> > Subject: RE: ext4 out of order when use cfq scheduler
> > 
> > > >
> > > > >
> > > > > OK, so I was looking into the code and indeed, reality is correct
> > > > > and my mental model was wrong! ;) I thought that inode gets added
> > > > > to the list of inodes for which we need to wait for data IO
> > > > > completion during transaction commit during block allocation. And I was
> > wrong.
> > > > > It used to happen in
> > > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > > > calls to
> > > > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > > > removed. And that was wrong because although we submit data writes
> > > > > before dropping handle for allocating transaction and updating
> > > > > i_size, nobody guarantees that data IO is not delayed in the block
> > > > > layer until
> > > transaction commit.
> > > > > Which seems to happen in your case. I'll send a fix. Thanks for
> > > > > your report and persistence!
> > > > >
> > > >
> > > > Thanks a lot for your feedback :-)
> > > > Because I am not familiar with the detail of the ext4 internal code.
> > > > I will try to
> > > understand your explanation which you describe above.  And have a look
> > > on related funcations.
> > > > Could you send the fix in this mail ?
> > > > And whether the kernel 3.14 also have such issue, right ?
> > >
> > > The problem is in all kernels starting with 3.8. Attached is a patch
> > > which should fix the issue. Can you test whether it fixes the problem for you?
> > >
> > > 								Honza
> > > --
> > 
> > Yes, of course I will redo the test with the patch. And also give you feedback.
> 
> Test result:
> Test on 2 targets with the kernel applied your patch. Both of them are OK
> after 5000 power failure tests. Our target test cycle is 10,000.  By the
> way, since your original patch can't be applied on 3.x kernel, The
> attached one  is based on yours and can applied on old kernel(mine is
> 3.10) directly.

Thanks for testing and the port! Once the patch is merged upstream, I'll
backport it and push it to all active stable kernels.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2015-12-22  6:24 ext4 out of order when use cfq scheduler HUANG Weller (CM/EPF1-CN)
  2015-12-22 15:00 ` Jan Kara
@ 2016-01-28  8:02 ` Xiong Zhou
  2016-02-03  6:08   ` HUANG Weller (CM/ESW12-CN)
  1 sibling, 1 reply; 30+ messages in thread
From: Xiong Zhou @ 2016-01-28  8:02 UTC (permalink / raw)
  To: HUANG Weller (CM/EPF1-CN); +Cc: linux-ext4, Li, Michael

On Tue, Dec 22, 2015 at 2:24 PM, HUANG Weller (CM/EPF1-CN)
<Weller.Huang@cn.bosch.com> wrote:
> Hi EXT4 developers,
>
> I am doing the power failure on a storage device. We are arm embedded system.

Is there any chance that make this test a xfstest case ?

Thanks,
Xiong

>
>
> Test environment:
> Kernel: 3.10.63
> Test type: power failure
> File create type: O_SYNC flag enabled
> Mount option: data=ordered, journal_checksum,barrier=1
> IO scheduler: CFQ
>
> Debug method: kernel print every block number of write request. print whether it is journal block.  print whether it is (REQ_FUA | REQ_META | write)
> After system startup, backup journal blocks before mount the test partition.
>
> Below is analysis which I think there is a out of order:
> hp0000017aMhWY3i0vMv is the file which checksum is error. Only half of the file contents is correct.
>
>
> [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3
> debugfs 1.42.9 (28-Dec-2013)
> debugfs:  imap test/hp0000017aMhWY3i0vMv
> Inode 390 is part of block group 0
>         located at block 141, offset 0x0280
>
> 00000280  80 81 00 00 10 00 04 00  c8 09 00 00 66 0a 00 00  |............f...|
> 00000290  66 0a 00 00 00 00 00 00  00 00 01 00 04 02 00 00  |f...............|
> 000002a0  00 00 08 00 01 00 00 00  0a f3 02 00 04 00 00 00  |................|
> 000002b0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00  |.............,..|  ==> the file contents is at 0x00012c00
> 000002c0  80 00 00 00 01 00 00 00  8e 26 01 00 00 00 00 00  |.........&......|
>
>
> Search the block number from the journal blocks:
>
> [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "00 2c 01 00"
> 00039ab0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00  |.............,..|
>
> Search file name which the file checksum is error in journal blocks:
>
> [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "3i0vMv" -B1
> 00030c60  86 01 00 00 1c 00 14 01  68 70 30 30 30 30 30 31  |........hp000001|
> 00030c70  37 61 4d 68 57 59 33 69  30 76 4d 76 88 01 00 00  |7aMhWY3i0vMv....|
>
>
> List all journal block record to check which journal block records it:
>
> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> 00000000  c0 3b 39 98 00 00 00 04  00 00 00 00 00 00 08 00  |.;9.............|
> 00000800  c0 3b 39 98 00 00 00 05  00 00 00 6f 00 00 00 24  |.;9........o...$|
> 00001000  c0 3b 39 98 00 00 00 01  00 00 00 6f 00 00 00 75  |.;9........o...u|
> 0000c800  c0 3b 39 98 00 00 00 02  00 00 00 6f 00 00 00 00  |.;9........o....|
> 0000d000  c0 3b 39 98 00 00 00 01  00 00 00 70 00 00 00 65  |.;9........p...e|
> 00016000  c0 3b 39 98 00 00 00 02  00 00 00 70 00 00 00 00  |.;9........p....|
> 00016800  c0 3b 39 98 00 00 00 01  00 00 00 71 00 00 00 7c  |.;9........q...||
> 00021000  c0 3b 39 98 00 00 00 02  00 00 00 71 00 00 00 00  |.;9........q....|
> 00021800  c0 3b 39 98 00 00 00 01  00 00 00 72 00 00 00 82  |.;9........r....|
> 0002d000  c0 3b 39 98 00 00 00 02  00 00 00 72 00 00 00 00  |.;9........r....|
> 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88  |.;9........s....|  ==>00039ab0 is in last block, the file name and the start block number are all recorded in the journals.
> 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00  |.;9........s....|
>
>
> Back to see the kernel log which it print all the block numbers:
>
> ...
> ...
> [   46.222671] 244109  75277
> [   46.222693]
> [   46.272438] 244352  75520
> [   46.272460]
> [   46.348417] 238443  69611
> [   46.348438]
> [   46.349811] 244480  75648
> [   46.352287]
> [   46.404904] 244609  75777
> [   46.404926]
> [   46.454698] 244738  75906
> [   46.454719]
> [   46.505439] 244992  76160
> [   46.505459]
> [   46.557783] 245120  76288
> [   46.557804]
> [   46.610075] 245249  76417
> [   46.610096]
> [   46.660196] 245378  76546
> [   46.660219]
> [   46.709906] 201691  32859   ==> journal start is 32768, so the offset is 91, block size=2048,  so, the offset address in the j.bin is 0x2d800
> [   46.709928] J [   46.711233]
> [   46.740635] drop to 9v
> [   46.749540] 201716  32884
> [   46.749560] J S
> [   46.751039]
> [   46.753151] 245632  76800  ==> 76800 in hex is 0x012c00, it is the same start block of the file which checksum is error.
> [   46.755284]
> nanddisk idle -> 1.
> [   46.800227] 6v irq-2
>
>
> The j.bin offset 0x2d800
>
> 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88  |.;9........s....|  ==>00039ab0 is in last block
> 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00  |.;9........s....|
>
>
> Normally, ext4 will first write the file contents, and then write the journal and journal commit. Then after some delay, it will write the meta data.
> So The journal blocks contains the meta data of the file which the file contents already been written before.
> But from above analysis,  the journal sequence 0x73 already contain the file name(hp0000017aMhWY3i0vMv) and the start block number(76800). So from the kernel log,  the block number(76800) should be available before the journal blocks but NOT after it.
> It seems that there is out of order happen.
>
> Could you please help to check this issue ? or give a explanation about it ?
> Many thanks.
>
> Best regards
>
> Weller HUANG
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: ext4 out of order when use cfq scheduler
  2016-01-28  8:02 ` Xiong Zhou
@ 2016-02-03  6:08   ` HUANG Weller (CM/ESW12-CN)
  0 siblings, 0 replies; 30+ messages in thread
From: HUANG Weller (CM/ESW12-CN) @ 2016-02-03  6:08 UTC (permalink / raw)
  To: Xiong Zhou; +Cc: linux-ext4, Li, Michael



> -----Original Message-----
> From: Xiong Zhou [mailto:jencce.kernel@gmail.com]
> Sent: Thursday, January 28, 2016 4:03 PM
> To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> Cc: linux-ext4@vger.kernel.org; Li, Michael <huayil@qti.qualcomm.com>
> Subject: Re: ext4 out of order when use cfq scheduler
> 
> On Tue, Dec 22, 2015 at 2:24 PM, HUANG Weller (CM/EPF1-CN)
> <Weller.Huang@cn.bosch.com> wrote:
> > Hi EXT4 developers,
> >
> > I am doing the power failure on a storage device. We are arm embedded system.
> 
> Is there any chance that make this test a xfstest case ?
> 

Do you mean use the xfstest tool to do the write test ? is there any difference or advantage to use xfstest ?

> >
> >
> > Test environment:
> > Kernel: 3.10.63
> > Test type: power failure
> > File create type: O_SYNC flag enabled
> > Mount option: data=ordered, journal_checksum,barrier=1 IO scheduler:
> > CFQ
> >
> > Debug method: kernel print every block number of write request. print
> > whether it is journal block.  print whether it is (REQ_FUA | REQ_META | write)
> After system startup, backup journal blocks before mount the test partition.
> >
> > Below is analysis which I think there is a out of order:
> > hp0000017aMhWY3i0vMv is the file which checksum is error. Only half of the file
> contents is correct.
> >
> >
> > [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3 debugfs 1.42.9
> > (28-Dec-2013)
> > debugfs:  imap test/hp0000017aMhWY3i0vMv Inode 390 is part of block
> > group 0
> >         located at block 141, offset 0x0280
> >
> > 00000280  80 81 00 00 10 00 04 00  c8 09 00 00 66 0a 00 00
> > |............f...|
> > 00000290  66 0a 00 00 00 00 00 00  00 00 01 00 04 02 00 00
> > |f...............|
> > 000002a0  00 00 08 00 01 00 00 00  0a f3 02 00 04 00 00 00
> > |................|
> > 000002b0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00
> > |.............,..|  ==> the file contents is at 0x00012c00
> > 000002c0  80 00 00 00 01 00 00 00  8e 26 01 00 00 00 00 00
> > |.........&......|
> >
> >
> > Search the block number from the journal blocks:
> >
> > [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "00 2c 01 00"
> > 00039ab0  00 00 00 00 00 00 00 00  80 00 00 00 00 2c 01 00
> > |.............,..|
> >
> > Search file name which the file checksum is error in journal blocks:
> >
> > [root@SiRFatlas6 ~]# hexdump j.bin  -C |  grep "3i0vMv" -B1
> > 00030c60  86 01 00 00 1c 00 14 01  68 70 30 30 30 30 30 31
> > |........hp000001|
> > 00030c70  37 61 4d 68 57 59 33 69  30 76 4d 76 88 01 00 00
> > |7aMhWY3i0vMv....|
> >
> >
> > List all journal block record to check which journal block records it:
> >
> > [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> > 00000000  c0 3b 39 98 00 00 00 04  00 00 00 00 00 00 08 00
> > |.;9.............|
> > 00000800  c0 3b 39 98 00 00 00 05  00 00 00 6f 00 00 00 24
> > |.;9........o...$|
> > 00001000  c0 3b 39 98 00 00 00 01  00 00 00 6f 00 00 00 75
> > |.;9........o...u|
> > 0000c800  c0 3b 39 98 00 00 00 02  00 00 00 6f 00 00 00 00
> > |.;9........o....|
> > 0000d000  c0 3b 39 98 00 00 00 01  00 00 00 70 00 00 00 65
> > |.;9........p...e|
> > 00016000  c0 3b 39 98 00 00 00 02  00 00 00 70 00 00 00 00
> > |.;9........p....|
> > 00016800  c0 3b 39 98 00 00 00 01  00 00 00 71 00 00 00 7c
> > |.;9........q...||
> > 00021000  c0 3b 39 98 00 00 00 02  00 00 00 71 00 00 00 00
> > |.;9........q....|
> > 00021800  c0 3b 39 98 00 00 00 01  00 00 00 72 00 00 00 82
> > |.;9........r....|
> > 0002d000  c0 3b 39 98 00 00 00 02  00 00 00 72 00 00 00 00
> > |.;9........r....|
> > 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88  |.;9........s....|
> ==>00039ab0 is in last block, the file name and the start block number are all
> recorded in the journals.
> > 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00
> > |.;9........s....|
> >
> >
> > Back to see the kernel log which it print all the block numbers:
> >
> > ...
> > ...
> > [   46.222671] 244109  75277
> > [   46.222693]
> > [   46.272438] 244352  75520
> > [   46.272460]
> > [   46.348417] 238443  69611
> > [   46.348438]
> > [   46.349811] 244480  75648
> > [   46.352287]
> > [   46.404904] 244609  75777
> > [   46.404926]
> > [   46.454698] 244738  75906
> > [   46.454719]
> > [   46.505439] 244992  76160
> > [   46.505459]
> > [   46.557783] 245120  76288
> > [   46.557804]
> > [   46.610075] 245249  76417
> > [   46.610096]
> > [   46.660196] 245378  76546
> > [   46.660219]
> > [   46.709906] 201691  32859   ==> journal start is 32768, so the offset is 91,
> block size=2048,  so, the offset address in the j.bin is 0x2d800
> > [   46.709928] J [   46.711233]
> > [   46.740635] drop to 9v
> > [   46.749540] 201716  32884
> > [   46.749560] J S
> > [   46.751039]
> > [   46.753151] 245632  76800  ==> 76800 in hex is 0x012c00, it is the same start
> block of the file which checksum is error.
> > [   46.755284]
> > nanddisk idle -> 1.
> > [   46.800227] 6v irq-2
> >
> >
> > The j.bin offset 0x2d800
> >
> > 0002d800  c0 3b 39 98 00 00 00 01  00 00 00 73 00 00 00 88
> > |.;9........s....|  ==>00039ab0 is in last block
> > 0003a000  c0 3b 39 98 00 00 00 02  00 00 00 73 00 00 00 00
> > |.;9........s....|
> >
> >
> > Normally, ext4 will first write the file contents, and then write the journal and
> journal commit. Then after some delay, it will write the meta data.
> > So The journal blocks contains the meta data of the file which the file contents
> already been written before.
> > But from above analysis,  the journal sequence 0x73 already contain the file
> name(hp0000017aMhWY3i0vMv) and the start block number(76800). So from the
> kernel log,  the block number(76800) should be available before the journal blocks
> but NOT after it.
> > It seems that there is out of order happen.
> >
> > Could you please help to check this issue ? or give a explanation about it ?
> > Many thanks.
> >
> > Best regards
> >
> > Weller HUANG
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2016-01-07 11:47                 ` Jan Kara
                                     ` (2 preceding siblings ...)
  2016-01-11  9:05                   ` HUANG Weller (CM/ESW12-CN)
@ 2016-03-13  4:27                   ` Theodore Ts'o
  2016-03-14  2:43                     ` HUANG Weller (CM/ESW12-CN)
  2016-03-14  7:39                     ` Jan Kara
  3 siblings, 2 replies; 30+ messages in thread
From: Theodore Ts'o @ 2016-03-13  4:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: HUANG Weller (CM/ESW12-CN), linux-ext4, Li, Michael

On Thu, Jan 07, 2016 at 12:47:36PM +0100, Jan Kara wrote:
> 
> The problem is in all kernels starting with 3.8. Attached is a patch which
> should fix the issue. Can you test whether it fixes the problem for you?

Sorry, I missed this patch because it was attached to an discussion
thread.

> The problem is that although for delayed allocated blocks we write their
> contents immediately after allocating them, there is no guarantee that
> the IO scheduler or device doesn't reorder things

I don't think that's the problem.  In the commit thread when we call
blkdev_issue_flush() that acts as a barrier so the I/O scheduler won't
reorder writes after that point, which is before we write the commit
block.  Instead, I believe the problem is in ext4_writepages:

		ext4_journal_stop(handle);
		/* Submit prepared bio */
		ext4_io_submit(&mpd.io_submit);

Once we release the handle, the commit can start --- *before* we have
a chance to submit the I/O.   Oops.

I believe if we swap these two calls, it should fix the problem Huang
was seeing.

Jan, do you agree?

					- Ted
					

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: ext4 out of order when use cfq scheduler
  2016-03-13  4:27                   ` Theodore Ts'o
@ 2016-03-14  2:43                     ` HUANG Weller (CM/ESW12-CN)
  2016-03-14  7:39                     ` Jan Kara
  1 sibling, 0 replies; 30+ messages in thread
From: HUANG Weller (CM/ESW12-CN) @ 2016-03-14  2:43 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Kara; +Cc: linux-ext4, Li, Michael



> -----Original Message-----
> From: Theodore Ts'o [mailto:tytso@mit.edu]
> Sent: Sunday, March 13, 2016 12:27 PM
> To: Jan Kara <jack@suse.cz>
> Cc: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>; linux-
> ext4@vger.kernel.org; Li, Michael <huayil@qti.qualcomm.com>
> Subject: Re: ext4 out of order when use cfq scheduler
> 
> On Thu, Jan 07, 2016 at 12:47:36PM +0100, Jan Kara wrote:
> >
> > The problem is in all kernels starting with 3.8. Attached is a patch
> > which should fix the issue. Can you test whether it fixes the problem for you?
> 
> Sorry, I missed this patch because it was attached to an discussion thread.
> 
> > The problem is that although for delayed allocated blocks we write
> > their contents immediately after allocating them, there is no
> > guarantee that the IO scheduler or device doesn't reorder things
> 
> I don't think that's the problem.  In the commit thread when we call
> blkdev_issue_flush() that acts as a barrier so the I/O scheduler won't reorder writes
> after that point, which is before we write the commit block.  Instead, I believe the
> problem is in ext4_writepages:
> 
> 		ext4_journal_stop(handle);
> 		/* Submit prepared bio */
> 		ext4_io_submit(&mpd.io_submit);
> 
> Once we release the handle, the commit can start --- *before* we have
> a chance to submit the I/O.   Oops.
> 
> I believe if we swap these two calls, it should fix the problem Huang was seeing.
> 
> Jan, do you agree?
> 
> 					- Ted

Hi Ted and Jan,

You can give me a patch and I can redo the verification on my kernel and HWs.

I also look into the code, since In my test case, I use data=ordered option and without sync.  So the write operation will goto ext4_da_writepages(), right ?
My kernel version is 3.10.63, as I see io_submit and journal_stop sequence  already in that order. 

while (!ret && wbc->nr_to_write > 0) {
ext4_journal_start
write_cache_pages_da
mpage_da_map_and_submit  ==>
ext4_journal_stop
}

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2016-03-13  4:27                   ` Theodore Ts'o
  2016-03-14  2:43                     ` HUANG Weller (CM/ESW12-CN)
@ 2016-03-14  7:39                     ` Jan Kara
  2016-03-14 14:36                       ` Theodore Ts'o
  1 sibling, 1 reply; 30+ messages in thread
From: Jan Kara @ 2016-03-14  7:39 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, HUANG Weller (CM/ESW12-CN), linux-ext4, Li, Michael

On Sat 12-03-16 23:27:23, Ted Tso wrote:
> On Thu, Jan 07, 2016 at 12:47:36PM +0100, Jan Kara wrote:
> > The problem is in all kernels starting with 3.8. Attached is a patch which
> > should fix the issue. Can you test whether it fixes the problem for you?
> 
> Sorry, I missed this patch because it was attached to an discussion
> thread.

I have actually sent this patch in a standalone thread on January 11
(http://lists.openwall.net/linux-ext4/2016/01/11/3) together with one more
cleanup.

> > The problem is that although for delayed allocated blocks we write their
> > contents immediately after allocating them, there is no guarantee that
> > the IO scheduler or device doesn't reorder things
> 
> I don't think that's the problem.  In the commit thread when we call
> blkdev_issue_flush() that acts as a barrier so the I/O scheduler won't
> reorder writes after that point, which is before we write the commit
> block.  Instead, I believe the problem is in ext4_writepages:
> 
> 		ext4_journal_stop(handle);
> 		/* Submit prepared bio */
> 		ext4_io_submit(&mpd.io_submit);
> 
> Once we release the handle, the commit can start --- *before* we have
> a chance to submit the I/O.   Oops.
> 
> I believe if we swap these two calls, it should fix the problem Huang
> was seeing.

No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
anything to IOs which have not reported completion before
blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
in its internal RB tree, following flush request completely bypasses this
tree and goes directly to the disk where it flushes caches. And only later
CFQ decides to schedule async writeback from the flusher thread which is
queued in the RB tree...

Note that the behavior has changed to be like this with the flushing
machinery rewrite. Before that, IO scheduler had to drain all the
outstanding IO requests (IO cache flush behaved like IO barrier). So your
patch would be enough with the old flushing machinery but is not enough
since 3.0 or so...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2016-03-14  7:39                     ` Jan Kara
@ 2016-03-14 14:36                       ` Theodore Ts'o
  2016-03-15 10:46                         ` Jan Kara
  0 siblings, 1 reply; 30+ messages in thread
From: Theodore Ts'o @ 2016-03-14 14:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: HUANG Weller (CM/ESW12-CN), linux-ext4, Li, Michael

On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
> anything to IOs which have not reported completion before
> blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
> in its internal RB tree, following flush request completely bypasses this
> tree and goes directly to the disk where it flushes caches. And only later
> CFQ decides to schedule async writeback from the flusher thread which is
> queued in the RB tree...

Oh, right.  I am forgetting about the flushing mahchinery rewrite.
Thanks for pointing that out.

But what we *could* do is to swap those two calls and then in the case
where delalloc is enabled, could maintain a list of inodes where we
only need to call filemap_fdatawait(), and not initiate writeback for
any dirty pages which had been caused by non-allocating writes.

						- Ted

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2016-03-14 14:36                       ` Theodore Ts'o
@ 2016-03-15 10:46                         ` Jan Kara
  2016-03-15 14:46                           ` Jan Kara
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2016-03-15 10:46 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, HUANG Weller (CM/ESW12-CN), linux-ext4, Li, Michael

On Mon 14-03-16 10:36:35, Ted Tso wrote:
> On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> > No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
> > anything to IOs which have not reported completion before
> > blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
> > in its internal RB tree, following flush request completely bypasses this
> > tree and goes directly to the disk where it flushes caches. And only later
> > CFQ decides to schedule async writeback from the flusher thread which is
> > queued in the RB tree...
> 
> Oh, right.  I am forgetting about the flushing mahchinery rewrite.
> Thanks for pointing that out.
> 
> But what we *could* do is to swap those two calls and then in the case
> where delalloc is enabled, could maintain a list of inodes where we
> only need to call filemap_fdatawait(), and not initiate writeback for
> any dirty pages which had been caused by non-allocating writes.

We actually don't need to swap those two calls - page is already marked as
under writeback in

  mpage_map_and_submit_buffers() -> mpage_submit_page -> ext4_bio_write_page

which gets called while we still hold the transaction handle. I agree
calling filemap_fdatawait() from JBD2 during commit should be enough to fix
issues with delalloc writeback. I'm just somewhat afraid that it will be
more fragile: If we add inode to transaction's list in ext4_map_blocks(),
we are pretty sure there's no way to allocate block to an inode without
introducing data exposure issues (which are then very hard to spot). If we
depend on callers of ext4_map_blocks() to properly add inode to appropriate
transaction list, we have much more places to check. I'll think whether we
could make this more robust.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2016-03-15 10:46                         ` Jan Kara
@ 2016-03-15 14:46                           ` Jan Kara
  2016-03-15 20:09                             ` Jan Kara
                                               ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Jan Kara @ 2016-03-15 14:46 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, HUANG Weller (CM/ESW12-CN), linux-ext4, Li, Michael

[-- Attachment #1: Type: text/plain, Size: 2450 bytes --]

On Tue 15-03-16 11:46:34, Jan Kara wrote:
> On Mon 14-03-16 10:36:35, Ted Tso wrote:
> > On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> > > No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
> > > anything to IOs which have not reported completion before
> > > blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
> > > in its internal RB tree, following flush request completely bypasses this
> > > tree and goes directly to the disk where it flushes caches. And only later
> > > CFQ decides to schedule async writeback from the flusher thread which is
> > > queued in the RB tree...
> > 
> > Oh, right.  I am forgetting about the flushing mahchinery rewrite.
> > Thanks for pointing that out.
> > 
> > But what we *could* do is to swap those two calls and then in the case
> > where delalloc is enabled, could maintain a list of inodes where we
> > only need to call filemap_fdatawait(), and not initiate writeback for
> > any dirty pages which had been caused by non-allocating writes.
> 
> We actually don't need to swap those two calls - page is already marked as
> under writeback in
> 
>   mpage_map_and_submit_buffers() -> mpage_submit_page -> ext4_bio_write_page
> 
> which gets called while we still hold the transaction handle. I agree
> calling filemap_fdatawait() from JBD2 during commit should be enough to fix
> issues with delalloc writeback. I'm just somewhat afraid that it will be
> more fragile: If we add inode to transaction's list in ext4_map_blocks(),
> we are pretty sure there's no way to allocate block to an inode without
> introducing data exposure issues (which are then very hard to spot). If we
> depend on callers of ext4_map_blocks() to properly add inode to appropriate
> transaction list, we have much more places to check. I'll think whether we
> could make this more robust.

OK, I have something - Huang, can you check whether the attached patches
also fix your data exposure issues please? The first patch is the original
fix, patch two is a cleanup, patches 3 and 4 implement the speedup
suggested by Ted. Patches are only lightly tested so far.  I'll run more
comprehensive tests later and in particular I want to check whether the
additional complexity actually brings us some advantage at least for
workloads which redirty pages in addition to writing some new ones using
delayed allocation.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-ext4-Fix-data-exposure-after-a-crash.patch --]
[-- Type: text/x-patch, Size: 3300 bytes --]

>From 0e1d6452e870f1f9e0bef727da18c839ed983c35 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 7 Jan 2016 12:21:25 +0100
Subject: [PATCH 1/4] ext4: Fix data exposure after a crash

Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a69d removed the logic with a
flawed argument that it is not needed.

The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.

The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().

CC: stable@vger.kernel.org
Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d
Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ff2f3cd38522..b216a3eb41a8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -682,6 +682,20 @@ out_sem:
 		ret = check_block_validity(inode, map);
 		if (ret != 0)
 			return ret;
+
+		/*
+		 * Inodes with freshly allocated blocks where contents will be
+		 * visible after transaction commit must be on transaction's
+		 * ordered data list.
+		 */
+		if (map->m_flags & EXT4_MAP_NEW &&
+		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
+		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
+		    ext4_should_order_data(inode)) {
+			ret = ext4_jbd2_file_inode(handle, inode);
+			if (ret)
+				return ret;
+		}
 	}
 	return retval;
 }
@@ -1135,15 +1149,6 @@ static int ext4_write_end(struct file *file,
 	int i_size_changed = 0;
 
 	trace_ext4_write_end(inode, pos, len, copied);
-	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
-		ret = ext4_jbd2_file_inode(handle, inode);
-		if (ret) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto errout;
-		}
-	}
-
 	if (ext4_has_inline_data(inode)) {
 		ret = ext4_write_inline_data_end(inode, pos, len,
 						 copied, page);
-- 
2.6.2


[-- Attachment #3: 0002-ext4-Remove-EXT4_STATE_ORDERED_MODE.patch --]
[-- Type: text/x-patch, Size: 1813 bytes --]

>From 40e646b46c7cbb5dcfbc35055f5e4fc7068935d8 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 7 Jan 2016 12:53:53 +0100
Subject: [PATCH 2/4] ext4: Remove EXT4_STATE_ORDERED_MODE

This flag is just duplicating what ext4_should_order_data() tells you
and is used in a single place. Furthermore it doesn't reflect changes to
inode data journalling flag so it may be possibly misleading. Just
remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h  | 1 -
 fs/ext4/inode.c | 5 +----
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1e20fa94fcf6..bc4910bce4ff 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1489,7 +1489,6 @@ enum {
 	EXT4_STATE_DIOREAD_LOCK,	/* Disable support for dio read
 					   nolocking */
 	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
-	EXT4_STATE_ORDERED_MODE,	/* data=ordered mode */
 	EXT4_STATE_EXT_PRECACHED,	/* extents have been precached */
 };
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b216a3eb41a8..173da9d467d1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3444,10 +3444,7 @@ void ext4_set_aops(struct inode *inode)
 {
 	switch (ext4_inode_journal_mode(inode)) {
 	case EXT4_INODE_ORDERED_DATA_MODE:
-		ext4_set_inode_state(inode, EXT4_STATE_ORDERED_MODE);
-		break;
 	case EXT4_INODE_WRITEBACK_DATA_MODE:
-		ext4_clear_inode_state(inode, EXT4_STATE_ORDERED_MODE);
 		break;
 	case EXT4_INODE_JOURNAL_DATA_MODE:
 		inode->i_mapping->a_ops = &ext4_journalled_aops;
@@ -3540,7 +3537,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 	} else {
 		err = 0;
 		mark_buffer_dirty(bh);
-		if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE))
+		if (ext4_should_order_data(inode))
 			err = ext4_jbd2_file_inode(handle, inode);
 	}
 
-- 
2.6.2


[-- Attachment #4: 0003-jbd2-Add-support-for-avoiding-data-writes-during-tra.patch --]
[-- Type: text/x-patch, Size: 7859 bytes --]

>From 408f3da34dc87a02e8c4127c71591650955c031d Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 15 Mar 2016 15:07:22 +0100
Subject: [PATCH 3/4] jbd2: Add support for avoiding data writes during
 transaction commits

Currently when filesystem needs to make sure data is on permanent
storage before committing a transaction it adds inode to transaction's
inode list. During transaction commit, jbd2 writes back all dirty
buffers that have allocated underlying blocks and waits for the IO to
finish. However when doing writeback for delayed allocated data, we
allocate blocks and immediately submit the data. Thus asking jbd2 to
write dirty pages just unnecessarily adds more work to jbd2 possibly
writing back other redirtied blocks.

Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
outstanding data writes before committing a transaction and thus avoid
unnecessary writes.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4_jbd2.h   |  3 ++-
 fs/jbd2/commit.c      |  4 ++++
 fs/jbd2/journal.c     |  3 ++-
 fs/jbd2/transaction.c | 25 ++++++++++++++++++++-----
 fs/ocfs2/journal.h    |  2 +-
 include/linux/jbd2.h  | 13 +++++++++++--
 6 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 5f5846211095..f1c940b38b30 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -362,7 +362,8 @@ static inline int ext4_journal_force_commit(journal_t *journal)
 static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_file_inode(handle, EXT4_I(inode)->jinode);
+		return jbd2_journal_inode_add_write(handle,
+						    EXT4_I(inode)->jinode);
 	return 0;
 }
 
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 36345fefa3ff..7d1dbe93d290 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -221,6 +221,8 @@ static int journal_submit_data_buffers(journal_t *journal,
 
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
+		if (!(jinode->i_flags & JI_WRITE_DATA))
+			continue;
 		mapping = jinode->i_vfs_inode->i_mapping;
 		set_bit(__JI_COMMIT_RUNNING, &jinode->i_flags);
 		spin_unlock(&journal->j_list_lock);
@@ -258,6 +260,8 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
 	/* For locking, see the comment in journal_submit_data_buffers() */
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
+		if (!(jinode->i_flags & JI_WAIT_DATA))
+			continue;
 		set_bit(__JI_COMMIT_RUNNING, &jinode->i_flags);
 		spin_unlock(&journal->j_list_lock);
 		err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 81e622681c82..8734b77ae47a 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -94,7 +94,8 @@ EXPORT_SYMBOL(jbd2_journal_blocks_per_page);
 EXPORT_SYMBOL(jbd2_journal_invalidatepage);
 EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
 EXPORT_SYMBOL(jbd2_journal_force_commit);
-EXPORT_SYMBOL(jbd2_journal_file_inode);
+EXPORT_SYMBOL(jbd2_journal_inode_add_write);
+EXPORT_SYMBOL(jbd2_journal_inode_add_wait);
 EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index ca181e81c765..0fcbec79fc7e 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2476,7 +2476,8 @@ void jbd2_journal_refile_buffer(journal_t *journal, struct journal_head *jh)
 /*
  * File inode in the inode list of the handle's transaction
  */
-int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode)
+static int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode,
+				   unsigned long flags)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal;
@@ -2501,14 +2502,16 @@ int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode)
 	 * and if jinode->i_next_transaction == transaction, commit code
 	 * will only file the inode where we want it.
 	 */
-	if (jinode->i_transaction == transaction ||
-	    jinode->i_next_transaction == transaction)
+	if ((jinode->i_transaction == transaction ||
+	    jinode->i_next_transaction == transaction) &&
+	    jinode->i_flags & flags == flags)
 		return 0;
 
 	spin_lock(&journal->j_list_lock);
 
-	if (jinode->i_transaction == transaction ||
-	    jinode->i_next_transaction == transaction)
+	if ((jinode->i_transaction == transaction ||
+	    jinode->i_next_transaction == transaction) &&
+	    jinode->i_flags & flags == flags)
 		goto done;
 
 	/*
@@ -2518,6 +2521,7 @@ int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode)
 	 */
 	if (!transaction->t_need_data_flush)
 		transaction->t_need_data_flush = 1;
+	jinode->i_flags |= flags;
 	/* On some different transaction's list - should be
 	 * the committing one */
 	if (jinode->i_transaction) {
@@ -2537,6 +2541,17 @@ done:
 	return 0;
 }
 
+int jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *jinode)
+{
+	return jbd2_journal_file_inode(handle, jinode,
+				       JI_WRITE_DATA | JI_WAIT_DATA);
+}
+
+int jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *jinode)
+{
+	return jbd2_journal_file_inode(handle, jinode, JI_WAIT_DATA);
+}
+
 /*
  * File truncate and transaction commit interact with each other in a
  * non-trivial way.  If a transaction writing data block A is
diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h
index f4cd3c3e9fb7..497a4171ef61 100644
--- a/fs/ocfs2/journal.h
+++ b/fs/ocfs2/journal.h
@@ -619,7 +619,7 @@ static inline int ocfs2_calc_tree_trunc_credits(struct super_block *sb,
 
 static inline int ocfs2_jbd2_file_inode(handle_t *handle, struct inode *inode)
 {
-	return jbd2_journal_file_inode(handle, &OCFS2_I(inode)->ip_jinode);
+	return jbd2_journal_inode_add_write(handle, &OCFS2_I(inode)->ip_jinode);
 }
 
 static inline int ocfs2_begin_ordered_truncate(struct inode *inode,
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 65407f6c9120..05a8d6762b2d 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -408,11 +408,19 @@ static inline void jbd_unlock_bh_journal_head(struct buffer_head *bh)
 
 /* Flags in jbd_inode->i_flags */
 #define __JI_COMMIT_RUNNING 0
-/* Commit of the inode data in progress. We use this flag to protect us from
+#define __JI_WRITE_DATA 1
+#define __JI_WAIT_DATA 2
+
+/*
+ * Commit of the inode data in progress. We use this flag to protect us from
  * concurrent deletion of inode. We cannot use reference to inode for this
  * since we cannot afford doing last iput() on behalf of kjournald
  */
 #define JI_COMMIT_RUNNING (1 << __JI_COMMIT_RUNNING)
+/* Write allocated dirty buffers in this inode before commit */
+#define JI_WRITE_DATA (1 << __JI_WRITE_DATA)
+/* Wait for outstanding data writes for this inode before commit */
+#define JI_WAIT_DATA (1 << __JI_WAIT_DATA)
 
 /**
  * struct jbd_inode is the structure linking inodes in ordered mode
@@ -1274,7 +1282,8 @@ extern int	   jbd2_journal_clear_err  (journal_t *);
 extern int	   jbd2_journal_bmap(journal_t *, unsigned long, unsigned long long *);
 extern int	   jbd2_journal_force_commit(journal_t *);
 extern int	   jbd2_journal_force_commit_nested(journal_t *);
-extern int	   jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
+extern int	   jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *inode);
+extern int	   jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *inode);
 extern int	   jbd2_journal_begin_ordered_truncate(journal_t *journal,
 				struct jbd2_inode *inode, loff_t new_size);
 extern void	   jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
-- 
2.6.2


[-- Attachment #5: 0004-ext4-Do-not-ask-jbd2-to-write-data-for-delalloc-buff.patch --]
[-- Type: text/x-patch, Size: 5006 bytes --]

>From 7dd6dc4704480d9d07113c7218649cd056ea8508 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 15 Mar 2016 15:25:02 +0100
Subject: [PATCH 4/4] ext4: Do not ask jbd2 to write data for delalloc buffers

Currently we ask jbd2 to write all dirty allocated buffers before
committing a transaction when doing writeback of delay allocated blocks.
However this is unnecessary since we move all pages to writeback state
before dropping a transaction handle and then submit all the necessary
IO. We still need the transaction commit to wait for all the outstanding
writeback before flushing disk caches during transaction commit to avoid
data exposure issues though. Use the new jbd2 capability and ask it to
only wait for outstanding writeback during transaction commit when
writing back data in ext4_writepages().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h        |  3 +++
 fs/ext4/ext4_jbd2.h   | 12 +++++++++++-
 fs/ext4/inode.c       | 10 +++++++---
 fs/ext4/move_extent.c |  2 +-
 fs/jbd2/transaction.c |  4 ++--
 5 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bc4910bce4ff..a095e833a6c0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -561,6 +561,9 @@ enum {
 #define EXT4_GET_BLOCKS_ZERO			0x0200
 #define EXT4_GET_BLOCKS_CREATE_ZERO		(EXT4_GET_BLOCKS_CREATE |\
 					EXT4_GET_BLOCKS_ZERO)
+	/* Caller will submit data before dropping transaction handle. This
+	 * allows jbd2 to avoid submitting data before commit. */
+#define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 
 /*
  * The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index f1c940b38b30..09c1ef38cbe6 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -359,7 +359,8 @@ static inline int ext4_journal_force_commit(journal_t *journal)
 	return 0;
 }
 
-static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
+static inline int ext4_jbd2_inode_add_write(handle_t *handle,
+					    struct inode *inode)
 {
 	if (ext4_handle_valid(handle))
 		return jbd2_journal_inode_add_write(handle,
@@ -367,6 +368,15 @@ static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
 	return 0;
 }
 
+static inline int ext4_jbd2_inode_add_wait(handle_t *handle,
+					   struct inode *inode)
+{
+	if (ext4_handle_valid(handle))
+		return jbd2_journal_inode_add_wait(handle,
+						   EXT4_I(inode)->jinode);
+	return 0;
+}
+
 static inline void ext4_update_inode_fsync_trans(handle_t *handle,
 						 struct inode *inode,
 						 int datasync)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 173da9d467d1..01ecc67985bb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -692,7 +692,10 @@ out_sem:
 		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
 		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
 		    ext4_should_order_data(inode)) {
-			ret = ext4_jbd2_file_inode(handle, inode);
+			if (flags & EXT4_GET_BLOCKS_IO_SUBMIT)
+				ret = ext4_jbd2_inode_add_wait(handle, inode);
+			else
+				ret = ext4_jbd2_inode_add_write(handle, inode);
 			if (ret)
 				return ret;
 		}
@@ -2164,7 +2167,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	 * the data was copied into the page cache.
 	 */
 	get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
-			   EXT4_GET_BLOCKS_METADATA_NOFAIL;
+			   EXT4_GET_BLOCKS_METADATA_NOFAIL |
+			   EXT4_GET_BLOCKS_IO_SUBMIT;
 	dioread_nolock = ext4_should_dioread_nolock(inode);
 	if (dioread_nolock)
 		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
@@ -3538,7 +3542,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 		err = 0;
 		mark_buffer_dirty(bh);
 		if (ext4_should_order_data(inode))
-			err = ext4_jbd2_file_inode(handle, inode);
+			err = ext4_jbd2_inode_add_write(handle, inode);
 	}
 
 unlock:
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index fb6f11709ae6..0ce9795c7404 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -390,7 +390,7 @@ data_copy:
 
 	/* Even in case of data=writeback it is reasonable to pin
 	 * inode to transaction, to prevent unexpected data loss */
-	*err = ext4_jbd2_file_inode(handle, orig_inode);
+	*err = ext4_jbd2_inode_add_write(handle, orig_inode);
 
 unlock_pages:
 	unlock_page(pagep[0]);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 0fcbec79fc7e..903ad38443c8 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2504,14 +2504,14 @@ static int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode,
 	 */
 	if ((jinode->i_transaction == transaction ||
 	    jinode->i_next_transaction == transaction) &&
-	    jinode->i_flags & flags == flags)
+	    (jinode->i_flags & flags) == flags)
 		return 0;
 
 	spin_lock(&journal->j_list_lock);
 
 	if ((jinode->i_transaction == transaction ||
 	    jinode->i_next_transaction == transaction) &&
-	    jinode->i_flags & flags == flags)
+	    (jinode->i_flags & flags) == flags)
 		goto done;
 
 	/*
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2016-03-15 14:46                           ` Jan Kara
@ 2016-03-15 20:09                             ` Jan Kara
  2016-03-16  2:30                               ` HUANG Weller (CM/ESW12-CN)
  2016-06-22 11:55                               ` FW: " HUANG Weller (CM/ESW12-CN)
  2016-03-16  0:41                             ` HUANG Weller (CM/ESW12-CN)
  2016-03-24 10:16                             ` HUANG Weller (CM/ESW12-CN)
  2 siblings, 2 replies; 30+ messages in thread
From: Jan Kara @ 2016-03-15 20:09 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, HUANG Weller (CM/ESW12-CN), linux-ext4, Li, Michael

[-- Attachment #1: Type: text/plain, Size: 2610 bytes --]

On Tue 15-03-16 15:46:33, Jan Kara wrote:
> On Tue 15-03-16 11:46:34, Jan Kara wrote:
> > On Mon 14-03-16 10:36:35, Ted Tso wrote:
> > > On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> > > > No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
> > > > anything to IOs which have not reported completion before
> > > > blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
> > > > in its internal RB tree, following flush request completely bypasses this
> > > > tree and goes directly to the disk where it flushes caches. And only later
> > > > CFQ decides to schedule async writeback from the flusher thread which is
> > > > queued in the RB tree...
> > > 
> > > Oh, right.  I am forgetting about the flushing mahchinery rewrite.
> > > Thanks for pointing that out.
> > > 
> > > But what we *could* do is to swap those two calls and then in the case
> > > where delalloc is enabled, could maintain a list of inodes where we
> > > only need to call filemap_fdatawait(), and not initiate writeback for
> > > any dirty pages which had been caused by non-allocating writes.
> > 
> > We actually don't need to swap those two calls - page is already marked as
> > under writeback in
> > 
> >   mpage_map_and_submit_buffers() -> mpage_submit_page -> ext4_bio_write_page
> > 
> > which gets called while we still hold the transaction handle. I agree
> > calling filemap_fdatawait() from JBD2 during commit should be enough to fix
> > issues with delalloc writeback. I'm just somewhat afraid that it will be
> > more fragile: If we add inode to transaction's list in ext4_map_blocks(),
> > we are pretty sure there's no way to allocate block to an inode without
> > introducing data exposure issues (which are then very hard to spot). If we
> > depend on callers of ext4_map_blocks() to properly add inode to appropriate
> > transaction list, we have much more places to check. I'll think whether we
> > could make this more robust.
> 
> OK, I have something - Huang, can you check whether the attached patches
> also fix your data exposure issues please? The first patch is the original
> fix, patch two is a cleanup, patches 3 and 4 implement the speedup
> suggested by Ted. Patches are only lightly tested so far.  I'll run more
> comprehensive tests later and in particular I want to check whether the
> additional complexity actually brings us some advantage at least for
> workloads which redirty pages in addition to writing some new ones using
> delayed allocation.

OK, there was a bug in patch 3. Attached is a new version of patches 3 and
4.
							Honza

[-- Attachment #2: 0003-jbd2-Add-support-for-avoiding-data-writes-during-tra.patch --]
[-- Type: text/x-patch, Size: 7472 bytes --]

>From 5e3bf6cbe750979e003b132aceda655ebd5b4f2a Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 15 Mar 2016 15:07:22 +0100
Subject: [PATCH 3/4] jbd2: Add support for avoiding data writes during
 transaction commits

Currently when filesystem needs to make sure data is on permanent
storage before committing a transaction it adds inode to transaction's
inode list. During transaction commit, jbd2 writes back all dirty
buffers that have allocated underlying blocks and waits for the IO to
finish. However when doing writeback for delayed allocated data, we
allocate blocks and immediately submit the data. Thus asking jbd2 to
write dirty pages just unnecessarily adds more work to jbd2 possibly
writing back other redirtied blocks.

Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
outstanding data writes before committing a transaction and thus avoid
unnecessary writes.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4_jbd2.h   |  3 ++-
 fs/jbd2/commit.c      |  4 ++++
 fs/jbd2/journal.c     |  3 ++-
 fs/jbd2/transaction.c | 22 ++++++++++++++++++----
 fs/ocfs2/journal.h    |  2 +-
 include/linux/jbd2.h  | 13 +++++++++++--
 6 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 5f5846211095..f1c940b38b30 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -362,7 +362,8 @@ static inline int ext4_journal_force_commit(journal_t *journal)
 static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_file_inode(handle, EXT4_I(inode)->jinode);
+		return jbd2_journal_inode_add_write(handle,
+						    EXT4_I(inode)->jinode);
 	return 0;
 }
 
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 36345fefa3ff..7d1dbe93d290 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -221,6 +221,8 @@ static int journal_submit_data_buffers(journal_t *journal,
 
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
+		if (!(jinode->i_flags & JI_WRITE_DATA))
+			continue;
 		mapping = jinode->i_vfs_inode->i_mapping;
 		set_bit(__JI_COMMIT_RUNNING, &jinode->i_flags);
 		spin_unlock(&journal->j_list_lock);
@@ -258,6 +260,8 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
 	/* For locking, see the comment in journal_submit_data_buffers() */
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
+		if (!(jinode->i_flags & JI_WAIT_DATA))
+			continue;
 		set_bit(__JI_COMMIT_RUNNING, &jinode->i_flags);
 		spin_unlock(&journal->j_list_lock);
 		err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 81e622681c82..8734b77ae47a 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -94,7 +94,8 @@ EXPORT_SYMBOL(jbd2_journal_blocks_per_page);
 EXPORT_SYMBOL(jbd2_journal_invalidatepage);
 EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
 EXPORT_SYMBOL(jbd2_journal_force_commit);
-EXPORT_SYMBOL(jbd2_journal_file_inode);
+EXPORT_SYMBOL(jbd2_journal_inode_add_write);
+EXPORT_SYMBOL(jbd2_journal_inode_add_wait);
 EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index ca181e81c765..d1b5c0f465b5 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2476,7 +2476,8 @@ void jbd2_journal_refile_buffer(journal_t *journal, struct journal_head *jh)
 /*
  * File inode in the inode list of the handle's transaction
  */
-int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode)
+static int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode,
+				   unsigned long flags)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal;
@@ -2501,12 +2502,14 @@ int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode)
 	 * and if jinode->i_next_transaction == transaction, commit code
 	 * will only file the inode where we want it.
 	 */
-	if (jinode->i_transaction == transaction ||
-	    jinode->i_next_transaction == transaction)
+	if ((jinode->i_transaction == transaction ||
+	    jinode->i_next_transaction == transaction) &&
+	    (jinode->i_flags & flags) == flags)
 		return 0;
 
 	spin_lock(&journal->j_list_lock);
-
+	jinode->i_flags |= flags;
+	/* Is inode already attached where we need it? */
 	if (jinode->i_transaction == transaction ||
 	    jinode->i_next_transaction == transaction)
 		goto done;
@@ -2537,6 +2540,17 @@ done:
 	return 0;
 }
 
+int jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *jinode)
+{
+	return jbd2_journal_file_inode(handle, jinode,
+				       JI_WRITE_DATA | JI_WAIT_DATA);
+}
+
+int jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *jinode)
+{
+	return jbd2_journal_file_inode(handle, jinode, JI_WAIT_DATA);
+}
+
 /*
  * File truncate and transaction commit interact with each other in a
  * non-trivial way.  If a transaction writing data block A is
diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h
index f4cd3c3e9fb7..497a4171ef61 100644
--- a/fs/ocfs2/journal.h
+++ b/fs/ocfs2/journal.h
@@ -619,7 +619,7 @@ static inline int ocfs2_calc_tree_trunc_credits(struct super_block *sb,
 
 static inline int ocfs2_jbd2_file_inode(handle_t *handle, struct inode *inode)
 {
-	return jbd2_journal_file_inode(handle, &OCFS2_I(inode)->ip_jinode);
+	return jbd2_journal_inode_add_write(handle, &OCFS2_I(inode)->ip_jinode);
 }
 
 static inline int ocfs2_begin_ordered_truncate(struct inode *inode,
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 65407f6c9120..05a8d6762b2d 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -408,11 +408,19 @@ static inline void jbd_unlock_bh_journal_head(struct buffer_head *bh)
 
 /* Flags in jbd_inode->i_flags */
 #define __JI_COMMIT_RUNNING 0
-/* Commit of the inode data in progress. We use this flag to protect us from
+#define __JI_WRITE_DATA 1
+#define __JI_WAIT_DATA 2
+
+/*
+ * Commit of the inode data in progress. We use this flag to protect us from
  * concurrent deletion of inode. We cannot use reference to inode for this
  * since we cannot afford doing last iput() on behalf of kjournald
  */
 #define JI_COMMIT_RUNNING (1 << __JI_COMMIT_RUNNING)
+/* Write allocated dirty buffers in this inode before commit */
+#define JI_WRITE_DATA (1 << __JI_WRITE_DATA)
+/* Wait for outstanding data writes for this inode before commit */
+#define JI_WAIT_DATA (1 << __JI_WAIT_DATA)
 
 /**
  * struct jbd_inode is the structure linking inodes in ordered mode
@@ -1274,7 +1282,8 @@ extern int	   jbd2_journal_clear_err  (journal_t *);
 extern int	   jbd2_journal_bmap(journal_t *, unsigned long, unsigned long long *);
 extern int	   jbd2_journal_force_commit(journal_t *);
 extern int	   jbd2_journal_force_commit_nested(journal_t *);
-extern int	   jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
+extern int	   jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *inode);
+extern int	   jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *inode);
 extern int	   jbd2_journal_begin_ordered_truncate(journal_t *journal,
 				struct jbd2_inode *inode, loff_t new_size);
 extern void	   jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
-- 
2.6.2


[-- Attachment #3: 0004-ext4-Do-not-ask-jbd2-to-write-data-for-delalloc-buff.patch --]
[-- Type: text/x-patch, Size: 4272 bytes --]

>From e79ed02968040d8684ecfba7a9dca0a8af1c5830 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 15 Mar 2016 15:25:02 +0100
Subject: [PATCH 4/4] ext4: Do not ask jbd2 to write data for delalloc buffers

Currently we ask jbd2 to write all dirty allocated buffers before
committing a transaction when doing writeback of delay allocated blocks.
However this is unnecessary since we move all pages to writeback state
before dropping a transaction handle and then submit all the necessary
IO. We still need the transaction commit to wait for all the outstanding
writeback before flushing disk caches during transaction commit to avoid
data exposure issues though. Use the new jbd2 capability and ask it to
only wait for outstanding writeback during transaction commit when
writing back data in ext4_writepages().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h        |  3 +++
 fs/ext4/ext4_jbd2.h   | 12 +++++++++++-
 fs/ext4/inode.c       | 10 +++++++---
 fs/ext4/move_extent.c |  2 +-
 4 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bc4910bce4ff..a095e833a6c0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -561,6 +561,9 @@ enum {
 #define EXT4_GET_BLOCKS_ZERO			0x0200
 #define EXT4_GET_BLOCKS_CREATE_ZERO		(EXT4_GET_BLOCKS_CREATE |\
 					EXT4_GET_BLOCKS_ZERO)
+	/* Caller will submit data before dropping transaction handle. This
+	 * allows jbd2 to avoid submitting data before commit. */
+#define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 
 /*
  * The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index f1c940b38b30..09c1ef38cbe6 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -359,7 +359,8 @@ static inline int ext4_journal_force_commit(journal_t *journal)
 	return 0;
 }
 
-static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
+static inline int ext4_jbd2_inode_add_write(handle_t *handle,
+					    struct inode *inode)
 {
 	if (ext4_handle_valid(handle))
 		return jbd2_journal_inode_add_write(handle,
@@ -367,6 +368,15 @@ static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
 	return 0;
 }
 
+static inline int ext4_jbd2_inode_add_wait(handle_t *handle,
+					   struct inode *inode)
+{
+	if (ext4_handle_valid(handle))
+		return jbd2_journal_inode_add_wait(handle,
+						   EXT4_I(inode)->jinode);
+	return 0;
+}
+
 static inline void ext4_update_inode_fsync_trans(handle_t *handle,
 						 struct inode *inode,
 						 int datasync)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 173da9d467d1..01ecc67985bb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -692,7 +692,10 @@ out_sem:
 		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
 		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
 		    ext4_should_order_data(inode)) {
-			ret = ext4_jbd2_file_inode(handle, inode);
+			if (flags & EXT4_GET_BLOCKS_IO_SUBMIT)
+				ret = ext4_jbd2_inode_add_wait(handle, inode);
+			else
+				ret = ext4_jbd2_inode_add_write(handle, inode);
 			if (ret)
 				return ret;
 		}
@@ -2164,7 +2167,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	 * the data was copied into the page cache.
 	 */
 	get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
-			   EXT4_GET_BLOCKS_METADATA_NOFAIL;
+			   EXT4_GET_BLOCKS_METADATA_NOFAIL |
+			   EXT4_GET_BLOCKS_IO_SUBMIT;
 	dioread_nolock = ext4_should_dioread_nolock(inode);
 	if (dioread_nolock)
 		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
@@ -3538,7 +3542,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 		err = 0;
 		mark_buffer_dirty(bh);
 		if (ext4_should_order_data(inode))
-			err = ext4_jbd2_file_inode(handle, inode);
+			err = ext4_jbd2_inode_add_write(handle, inode);
 	}
 
 unlock:
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index fb6f11709ae6..0ce9795c7404 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -390,7 +390,7 @@ data_copy:
 
 	/* Even in case of data=writeback it is reasonable to pin
 	 * inode to transaction, to prevent unexpected data loss */
-	*err = ext4_jbd2_file_inode(handle, orig_inode);
+	*err = ext4_jbd2_inode_add_write(handle, orig_inode);
 
 unlock_pages:
 	unlock_page(pagep[0]);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* RE: ext4 out of order when use cfq scheduler
  2016-03-15 14:46                           ` Jan Kara
  2016-03-15 20:09                             ` Jan Kara
@ 2016-03-16  0:41                             ` HUANG Weller (CM/ESW12-CN)
  2016-03-24 10:16                             ` HUANG Weller (CM/ESW12-CN)
  2 siblings, 0 replies; 30+ messages in thread
From: HUANG Weller (CM/ESW12-CN) @ 2016-03-16  0:41 UTC (permalink / raw)
  To: Jan Kara, Theodore Ts'o; +Cc: linux-ext4, Li, Michael

> 
> OK, I have something - Huang, can you check whether the attached patches also
> fix your data exposure issues please? The first patch is the original fix, patch two
> is a cleanup, patches 3 and 4 implement the speedup suggested by Ted. Patches
> are only lightly tested so far.  I'll run more comprehensive tests later and in
> particular I want to check whether the additional complexity actually brings us
> some advantage at least for workloads which redirty pages in addition to writing
> some new ones using delayed allocation.
> 
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


Sure, I will try to merge the patches and do the test on my kernel.

Huang weller

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: ext4 out of order when use cfq scheduler
  2016-03-15 20:09                             ` Jan Kara
@ 2016-03-16  2:30                               ` HUANG Weller (CM/ESW12-CN)
  2016-03-18  9:20                                 ` Jan Kara
  2016-06-22 11:55                               ` FW: " HUANG Weller (CM/ESW12-CN)
  1 sibling, 1 reply; 30+ messages in thread
From: HUANG Weller (CM/ESW12-CN) @ 2016-03-16  2:30 UTC (permalink / raw)
  To: Jan Kara, Theodore Ts'o; +Cc: linux-ext4, Li, Michael


> >
> > OK, I have something - Huang, can you check whether the attached
> > patches also fix your data exposure issues please? The first patch is
> > the original fix, patch two is a cleanup, patches 3 and 4 implement
> > the speedup suggested by Ted. Patches are only lightly tested so far.
> > I'll run more comprehensive tests later and in particular I want to
> > check whether the additional complexity actually brings us some
> > advantage at least for workloads which redirty pages in addition to
> > writing some new ones using delayed allocation.
> 
> OK, there was a bug in patch 3. Attached is a new version of patches 3 and 4.
> 


Hi Kara,
Patches are applied on my kernel. Basically it is working. 
Power loss test is running now. Need some days to give you test result.
At same time, Some conflicts need your confirm:

Target Kernel 3.10.63.

Patch 0004 conflicts:
----------------------------------
- Don't have function:   *mpage_map_one_extent*
So , please review this segment code at  patch 0004-ext4-Do-not-ask-jbd2-to-write-data-for-delalloc-buff.patch

@@ -1697,7 +1700,8 @@ static void mpage_da_map_and_submit(struct mpage_da_data *mpd)
 	 * So use reserved blocks to allocate metadata if possible.
 	 */
 	get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
-			   EXT4_GET_BLOCKS_METADATA_NOFAIL;
+			   EXT4_GET_BLOCKS_METADATA_NOFAIL |
+			   EXT4_GET_BLOCKS_IO_SUBMIT;
 	if (ext4_should_dioread_nolock(mpd->inode))
 		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
 	if (mpd->b_state & (1 << BH_Delay))


- Don't have function *__ext4_block_zero_page_range*, so ignore the modification under this function.


Patch 0001 conflicts:
----------------------------------
- Don't have macro EXT4_GET_BLOCKS_ZERO is not available on kernel 3.10.63, so the patch is like this please help to review:

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e48bd5a..03017a9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -751,6 +751,19 @@ has_zeroout:
 		int ret = check_block_validity(inode, map);
 		if (ret != 0)
 			return ret;
+
+		/*
+		 * Inodes with freshly allocated blocks where contents will be
+		 * visible after transaction commit must be on transaction's
+		 * ordered data list.
+		 */
+		if (map->m_flags & EXT4_MAP_NEW &&
+		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
+		    ext4_should_order_data(inode)) {
+			ret = ext4_jbd2_file_inode(handle, inode);
+			if (ret)
+				return ret;
+		}
 	}
 	return retval;
 }


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2016-03-16  2:30                               ` HUANG Weller (CM/ESW12-CN)
@ 2016-03-18  9:20                                 ` Jan Kara
  0 siblings, 0 replies; 30+ messages in thread
From: Jan Kara @ 2016-03-18  9:20 UTC (permalink / raw)
  To: HUANG Weller (CM/ESW12-CN)
  Cc: Jan Kara, Theodore Ts'o, linux-ext4, Li, Michael

On Wed 16-03-16 02:30:13, HUANG Weller (CM/ESW12-CN) wrote:
> 
> > >
> > > OK, I have something - Huang, can you check whether the attached
> > > patches also fix your data exposure issues please? The first patch is
> > > the original fix, patch two is a cleanup, patches 3 and 4 implement
> > > the speedup suggested by Ted. Patches are only lightly tested so far.
> > > I'll run more comprehensive tests later and in particular I want to
> > > check whether the additional complexity actually brings us some
> > > advantage at least for workloads which redirty pages in addition to
> > > writing some new ones using delayed allocation.
> > 
> > OK, there was a bug in patch 3. Attached is a new version of patches 3 and 4.
> > 
> 
> Patches are applied on my kernel. Basically it is working. 
> Power loss test is running now. Need some days to give you test result.

Thank you!

> At same time, Some conflicts need your confirm:

The resolution of the conflicts looks good.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: ext4 out of order when use cfq scheduler
  2016-03-15 14:46                           ` Jan Kara
  2016-03-15 20:09                             ` Jan Kara
  2016-03-16  0:41                             ` HUANG Weller (CM/ESW12-CN)
@ 2016-03-24 10:16                             ` HUANG Weller (CM/ESW12-CN)
  2016-03-24 12:17                               ` Jan Kara
  2 siblings, 1 reply; 30+ messages in thread
From: HUANG Weller (CM/ESW12-CN) @ 2016-03-24 10:16 UTC (permalink / raw)
  To: Jan Kara, Theodore Ts'o; +Cc: linux-ext4, Li, Michael

> 
> OK, I have something - Huang, can you check whether the attached patches also
> fix your data exposure issues please? The first patch is the original fix, patch two
> is a cleanup, patches 3 and 4 implement the speedup suggested by Ted. Patches
> are only lightly tested so far.  I'll run more comprehensive tests later and in
> particular I want to check whether the additional complexity actually brings us
> some advantage at least for workloads which redirty pages in addition to writing
> some new ones using delayed allocation.
> 

Test done.
Both targets(kernel 3.10.63) PASS the power loss test with 10,000 cycles. Test with io-scheduler CFQ.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: ext4 out of order when use cfq scheduler
  2016-03-24 10:16                             ` HUANG Weller (CM/ESW12-CN)
@ 2016-03-24 12:17                               ` Jan Kara
  0 siblings, 0 replies; 30+ messages in thread
From: Jan Kara @ 2016-03-24 12:17 UTC (permalink / raw)
  To: HUANG Weller (CM/ESW12-CN)
  Cc: Jan Kara, Theodore Ts'o, linux-ext4, Li, Michael

On Thu 24-03-16 10:16:05, HUANG Weller (CM/ESW12-CN) wrote:
> > 
> > OK, I have something - Huang, can you check whether the attached patches also
> > fix your data exposure issues please? The first patch is the original fix, patch two
> > is a cleanup, patches 3 and 4 implement the speedup suggested by Ted. Patches
> > are only lightly tested so far.  I'll run more comprehensive tests later and in
> > particular I want to check whether the additional complexity actually brings us
> > some advantage at least for workloads which redirty pages in addition to writing
> > some new ones using delayed allocation.
> > 
> 
> Test done.
> Both targets(kernel 3.10.63) PASS the power loss test with 10,000 cycles. Test with io-scheduler CFQ.

Thanks for testing! I'll submit those patches once I verify there is some
performance gain in only waiting...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* FW: ext4 out of order when use cfq scheduler
  2016-03-15 20:09                             ` Jan Kara
  2016-03-16  2:30                               ` HUANG Weller (CM/ESW12-CN)
@ 2016-06-22 11:55                               ` HUANG Weller (CM/ESW12-CN)
  2016-06-22 13:09                                 ` Jan Kara
  1 sibling, 1 reply; 30+ messages in thread
From: HUANG Weller (CM/ESW12-CN) @ 2016-06-22 11:55 UTC (permalink / raw)
  To: jack; +Cc: linux-ext4, Theodore Ts'o

[-- Attachment #1: Type: text/plain, Size: 3298 bytes --]

Hi Kara,

I saw the patch " fix data exposure after a crash " in the kernel 4.6.2 release.
I remember you provided 2 more patches for performance optimization. Could you tell me whether they are necessary ?

Thanks. 
Best regards

 Weller HUANG



-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz] 
Sent: Wednesday, March 16, 2016 4:10 AM
To: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>; HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>; linux-ext4@vger.kernel.org; Li, Michael <huayil@qti.qualcomm.com>
Subject: Re: ext4 out of order when use cfq scheduler

On Tue 15-03-16 15:46:33, Jan Kara wrote:
> On Tue 15-03-16 11:46:34, Jan Kara wrote:
> > On Mon 14-03-16 10:36:35, Ted Tso wrote:
> > > On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> > > > No, that won't be enough. blkdev_issue_flush() is not guaranteed 
> > > > to do anything to IOs which have not reported completion before
> > > > blkdev_issue_flush() was called. Specifically, CFQ will queue 
> > > > submitted bio in its internal RB tree, following flush request 
> > > > completely bypasses this tree and goes directly to the disk 
> > > > where it flushes caches. And only later CFQ decides to schedule 
> > > > async writeback from the flusher thread which is queued in the RB tree...
> > > 
> > > Oh, right.  I am forgetting about the flushing mahchinery rewrite.
> > > Thanks for pointing that out.
> > > 
> > > But what we *could* do is to swap those two calls and then in the 
> > > case where delalloc is enabled, could maintain a list of inodes 
> > > where we only need to call filemap_fdatawait(), and not initiate 
> > > writeback for any dirty pages which had been caused by non-allocating writes.
> > 
> > We actually don't need to swap those two calls - page is already 
> > marked as under writeback in
> > 
> >   mpage_map_and_submit_buffers() -> mpage_submit_page -> 
> > ext4_bio_write_page
> > 
> > which gets called while we still hold the transaction handle. I 
> > agree calling filemap_fdatawait() from JBD2 during commit should be 
> > enough to fix issues with delalloc writeback. I'm just somewhat 
> > afraid that it will be more fragile: If we add inode to 
> > transaction's list in ext4_map_blocks(), we are pretty sure there's 
> > no way to allocate block to an inode without introducing data 
> > exposure issues (which are then very hard to spot). If we depend on 
> > callers of ext4_map_blocks() to properly add inode to appropriate 
> > transaction list, we have much more places to check. I'll think whether we could make this more robust.
> 
> OK, I have something - Huang, can you check whether the attached 
> patches also fix your data exposure issues please? The first patch is 
> the original fix, patch two is a cleanup, patches 3 and 4 implement 
> the speedup suggested by Ted. Patches are only lightly tested so far.  
> I'll run more comprehensive tests later and in particular I want to 
> check whether the additional complexity actually brings us some 
> advantage at least for workloads which redirty pages in addition to 
> writing some new ones using delayed allocation.

OK, there was a bug in patch 3. Attached is a new version of patches 3 and 4.
							Honza

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0003-jbd2-Add-support-for-avoiding-data-writes-during-tra.patch --]
[-- Type: text/x-patch; name="0003-jbd2-Add-support-for-avoiding-data-writes-during-tra.patch", Size: 7651 bytes --]

From 5e3bf6cbe750979e003b132aceda655ebd5b4f2a Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 15 Mar 2016 15:07:22 +0100
Subject: [PATCH 3/4] jbd2: Add support for avoiding data writes during
 transaction commits

Currently when filesystem needs to make sure data is on permanent
storage before committing a transaction it adds inode to transaction's
inode list. During transaction commit, jbd2 writes back all dirty
buffers that have allocated underlying blocks and waits for the IO to
finish. However when doing writeback for delayed allocated data, we
allocate blocks and immediately submit the data. Thus asking jbd2 to
write dirty pages just unnecessarily adds more work to jbd2 possibly
writing back other redirtied blocks.

Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
outstanding data writes before committing a transaction and thus avoid
unnecessary writes.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4_jbd2.h   |  3 ++-
 fs/jbd2/commit.c      |  4 ++++
 fs/jbd2/journal.c     |  3 ++-
 fs/jbd2/transaction.c | 22 ++++++++++++++++++----
 fs/ocfs2/journal.h    |  2 +-
 include/linux/jbd2.h  | 13 +++++++++++--
 6 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 5f5846211095..f1c940b38b30 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -362,7 +362,8 @@ static inline int ext4_journal_force_commit(journal_t *journal)
 static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_file_inode(handle, EXT4_I(inode)->jinode);
+		return jbd2_journal_inode_add_write(handle,
+						    EXT4_I(inode)->jinode);
 	return 0;
 }
 
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 36345fefa3ff..7d1dbe93d290 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -221,6 +221,8 @@ static int journal_submit_data_buffers(journal_t *journal,
 
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
+		if (!(jinode->i_flags & JI_WRITE_DATA))
+			continue;
 		mapping = jinode->i_vfs_inode->i_mapping;
 		set_bit(__JI_COMMIT_RUNNING, &jinode->i_flags);
 		spin_unlock(&journal->j_list_lock);
@@ -258,6 +260,8 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
 	/* For locking, see the comment in journal_submit_data_buffers() */
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
+		if (!(jinode->i_flags & JI_WAIT_DATA))
+			continue;
 		set_bit(__JI_COMMIT_RUNNING, &jinode->i_flags);
 		spin_unlock(&journal->j_list_lock);
 		err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 81e622681c82..8734b77ae47a 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -94,7 +94,8 @@ EXPORT_SYMBOL(jbd2_journal_blocks_per_page);
 EXPORT_SYMBOL(jbd2_journal_invalidatepage);
 EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
 EXPORT_SYMBOL(jbd2_journal_force_commit);
-EXPORT_SYMBOL(jbd2_journal_file_inode);
+EXPORT_SYMBOL(jbd2_journal_inode_add_write);
+EXPORT_SYMBOL(jbd2_journal_inode_add_wait);
 EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index ca181e81c765..d1b5c0f465b5 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2476,7 +2476,8 @@ void jbd2_journal_refile_buffer(journal_t *journal, struct journal_head *jh)
 /*
  * File inode in the inode list of the handle's transaction
  */
-int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode)
+static int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode,
+				   unsigned long flags)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal;
@@ -2501,12 +2502,14 @@ int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode)
 	 * and if jinode->i_next_transaction == transaction, commit code
 	 * will only file the inode where we want it.
 	 */
-	if (jinode->i_transaction == transaction ||
-	    jinode->i_next_transaction == transaction)
+	if ((jinode->i_transaction == transaction ||
+	    jinode->i_next_transaction == transaction) &&
+	    (jinode->i_flags & flags) == flags)
 		return 0;
 
 	spin_lock(&journal->j_list_lock);
-
+	jinode->i_flags |= flags;
+	/* Is inode already attached where we need it? */
 	if (jinode->i_transaction == transaction ||
 	    jinode->i_next_transaction == transaction)
 		goto done;
@@ -2537,6 +2540,17 @@ done:
 	return 0;
 }
 
+int jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *jinode)
+{
+	return jbd2_journal_file_inode(handle, jinode,
+				       JI_WRITE_DATA | JI_WAIT_DATA);
+}
+
+int jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *jinode)
+{
+	return jbd2_journal_file_inode(handle, jinode, JI_WAIT_DATA);
+}
+
 /*
  * File truncate and transaction commit interact with each other in a
  * non-trivial way.  If a transaction writing data block A is
diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h
index f4cd3c3e9fb7..497a4171ef61 100644
--- a/fs/ocfs2/journal.h
+++ b/fs/ocfs2/journal.h
@@ -619,7 +619,7 @@ static inline int ocfs2_calc_tree_trunc_credits(struct super_block *sb,
 
 static inline int ocfs2_jbd2_file_inode(handle_t *handle, struct inode *inode)
 {
-	return jbd2_journal_file_inode(handle, &OCFS2_I(inode)->ip_jinode);
+	return jbd2_journal_inode_add_write(handle, &OCFS2_I(inode)->ip_jinode);
 }
 
 static inline int ocfs2_begin_ordered_truncate(struct inode *inode,
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 65407f6c9120..05a8d6762b2d 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -408,11 +408,19 @@ static inline void jbd_unlock_bh_journal_head(struct buffer_head *bh)
 
 /* Flags in jbd_inode->i_flags */
 #define __JI_COMMIT_RUNNING 0
-/* Commit of the inode data in progress. We use this flag to protect us from
+#define __JI_WRITE_DATA 1
+#define __JI_WAIT_DATA 2
+
+/*
+ * Commit of the inode data in progress. We use this flag to protect us from
  * concurrent deletion of inode. We cannot use reference to inode for this
  * since we cannot afford doing last iput() on behalf of kjournald
  */
 #define JI_COMMIT_RUNNING (1 << __JI_COMMIT_RUNNING)
+/* Write allocated dirty buffers in this inode before commit */
+#define JI_WRITE_DATA (1 << __JI_WRITE_DATA)
+/* Wait for outstanding data writes for this inode before commit */
+#define JI_WAIT_DATA (1 << __JI_WAIT_DATA)
 
 /**
  * struct jbd_inode is the structure linking inodes in ordered mode
@@ -1274,7 +1282,8 @@ extern int	   jbd2_journal_clear_err  (journal_t *);
 extern int	   jbd2_journal_bmap(journal_t *, unsigned long, unsigned long long *);
 extern int	   jbd2_journal_force_commit(journal_t *);
 extern int	   jbd2_journal_force_commit_nested(journal_t *);
-extern int	   jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
+extern int	   jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *inode);
+extern int	   jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *inode);
 extern int	   jbd2_journal_begin_ordered_truncate(journal_t *journal,
 				struct jbd2_inode *inode, loff_t new_size);
 extern void	   jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
-- 
2.6.2


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0004-ext4-Do-not-ask-jbd2-to-write-data-for-delalloc-buff.patch --]
[-- Type: text/x-patch; name="0004-ext4-Do-not-ask-jbd2-to-write-data-for-delalloc-buff.patch", Size: 4389 bytes --]

From e79ed02968040d8684ecfba7a9dca0a8af1c5830 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 15 Mar 2016 15:25:02 +0100
Subject: [PATCH 4/4] ext4: Do not ask jbd2 to write data for delalloc buffers

Currently we ask jbd2 to write all dirty allocated buffers before
committing a transaction when doing writeback of delay allocated blocks.
However this is unnecessary since we move all pages to writeback state
before dropping a transaction handle and then submit all the necessary
IO. We still need the transaction commit to wait for all the outstanding
writeback before flushing disk caches during transaction commit to avoid
data exposure issues though. Use the new jbd2 capability and ask it to
only wait for outstanding writeback during transaction commit when
writing back data in ext4_writepages().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h        |  3 +++
 fs/ext4/ext4_jbd2.h   | 12 +++++++++++-
 fs/ext4/inode.c       | 10 +++++++---
 fs/ext4/move_extent.c |  2 +-
 4 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bc4910bce4ff..a095e833a6c0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -561,6 +561,9 @@ enum {
 #define EXT4_GET_BLOCKS_ZERO			0x0200
 #define EXT4_GET_BLOCKS_CREATE_ZERO		(EXT4_GET_BLOCKS_CREATE |\
 					EXT4_GET_BLOCKS_ZERO)
+	/* Caller will submit data before dropping transaction handle. This
+	 * allows jbd2 to avoid submitting data before commit. */
+#define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 
 /*
  * The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index f1c940b38b30..09c1ef38cbe6 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -359,7 +359,8 @@ static inline int ext4_journal_force_commit(journal_t *journal)
 	return 0;
 }
 
-static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
+static inline int ext4_jbd2_inode_add_write(handle_t *handle,
+					    struct inode *inode)
 {
 	if (ext4_handle_valid(handle))
 		return jbd2_journal_inode_add_write(handle,
@@ -367,6 +368,15 @@ static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
 	return 0;
 }
 
+static inline int ext4_jbd2_inode_add_wait(handle_t *handle,
+					   struct inode *inode)
+{
+	if (ext4_handle_valid(handle))
+		return jbd2_journal_inode_add_wait(handle,
+						   EXT4_I(inode)->jinode);
+	return 0;
+}
+
 static inline void ext4_update_inode_fsync_trans(handle_t *handle,
 						 struct inode *inode,
 						 int datasync)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 173da9d467d1..01ecc67985bb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -692,7 +692,10 @@ out_sem:
 		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
 		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
 		    ext4_should_order_data(inode)) {
-			ret = ext4_jbd2_file_inode(handle, inode);
+			if (flags & EXT4_GET_BLOCKS_IO_SUBMIT)
+				ret = ext4_jbd2_inode_add_wait(handle, inode);
+			else
+				ret = ext4_jbd2_inode_add_write(handle, inode);
 			if (ret)
 				return ret;
 		}
@@ -2164,7 +2167,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	 * the data was copied into the page cache.
 	 */
 	get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
-			   EXT4_GET_BLOCKS_METADATA_NOFAIL;
+			   EXT4_GET_BLOCKS_METADATA_NOFAIL |
+			   EXT4_GET_BLOCKS_IO_SUBMIT;
 	dioread_nolock = ext4_should_dioread_nolock(inode);
 	if (dioread_nolock)
 		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
@@ -3538,7 +3542,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 		err = 0;
 		mark_buffer_dirty(bh);
 		if (ext4_should_order_data(inode))
-			err = ext4_jbd2_file_inode(handle, inode);
+			err = ext4_jbd2_inode_add_write(handle, inode);
 	}
 
 unlock:
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index fb6f11709ae6..0ce9795c7404 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -390,7 +390,7 @@ data_copy:
 
 	/* Even in case of data=writeback it is reasonable to pin
 	 * inode to transaction, to prevent unexpected data loss */
-	*err = ext4_jbd2_file_inode(handle, orig_inode);
+	*err = ext4_jbd2_inode_add_write(handle, orig_inode);
 
 unlock_pages:
 	unlock_page(pagep[0]);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: FW: ext4 out of order when use cfq scheduler
  2016-06-22 11:55                               ` FW: " HUANG Weller (CM/ESW12-CN)
@ 2016-06-22 13:09                                 ` Jan Kara
  0 siblings, 0 replies; 30+ messages in thread
From: Jan Kara @ 2016-06-22 13:09 UTC (permalink / raw)
  To: HUANG Weller (CM/ESW12-CN); +Cc: jack, linux-ext4, Theodore Ts'o

Hi,

On Wed 22-06-16 11:55:12, HUANG Weller (CM/ESW12-CN) wrote:
> I saw the patch " fix data exposure after a crash " in the kernel 4.6.2
> release.  I remember you provided 2 more patches for performance
> optimization. Could you tell me whether they are necessary ?

They are not necessary. They may just improve performance in some cases...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-06-22 13:09 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-22  6:24 ext4 out of order when use cfq scheduler HUANG Weller (CM/EPF1-CN)
2015-12-22 15:00 ` Jan Kara
     [not found]   ` <c67f356b63d94d35ad010a6e987b68f0@SGPMBX1004.APAC.bosch.com>
2016-01-05 15:30     ` Jan Kara
2016-01-06  2:39       ` HUANG Weller (CM/ESW12-CN)
2016-01-06 19:17         ` Andreas Dilger
2016-01-07  6:51           ` HUANG Weller (CM/ESW12-CN)
     [not found]         ` <20160106100621.GA24046@quack.suse.cz>
     [not found]           ` <3ab48fa47e434455b101251730e69bd2@SGPMBX1004.APAC.bosch.com>
2016-01-07 10:24             ` Jan Kara
2016-01-07 11:02               ` HUANG Weller (CM/ESW12-CN)
2016-01-07 11:47                 ` Jan Kara
2016-01-07 12:19                   ` Jan Kara
2016-01-08  2:18                     ` HUANG Weller (CM/ESW12-CN)
2016-01-08  0:46                   ` HUANG Weller (CM/ESW12-CN)
2016-01-11  9:05                   ` HUANG Weller (CM/ESW12-CN)
2016-01-11 10:21                     ` Jan Kara
2016-03-13  4:27                   ` Theodore Ts'o
2016-03-14  2:43                     ` HUANG Weller (CM/ESW12-CN)
2016-03-14  7:39                     ` Jan Kara
2016-03-14 14:36                       ` Theodore Ts'o
2016-03-15 10:46                         ` Jan Kara
2016-03-15 14:46                           ` Jan Kara
2016-03-15 20:09                             ` Jan Kara
2016-03-16  2:30                               ` HUANG Weller (CM/ESW12-CN)
2016-03-18  9:20                                 ` Jan Kara
2016-06-22 11:55                               ` FW: " HUANG Weller (CM/ESW12-CN)
2016-06-22 13:09                                 ` Jan Kara
2016-03-16  0:41                             ` HUANG Weller (CM/ESW12-CN)
2016-03-24 10:16                             ` HUANG Weller (CM/ESW12-CN)
2016-03-24 12:17                               ` Jan Kara
2016-01-28  8:02 ` Xiong Zhou
2016-02-03  6:08   ` HUANG Weller (CM/ESW12-CN)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.