All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: bluestore bug#15724: bluefs _replay file with link count 0:
@ 2016-07-15 20:21 Kevan Rehm
  2016-07-16  7:23 ` Varada Kari
  0 siblings, 1 reply; 12+ messages in thread
From: Kevan Rehm @ 2016-07-15 20:21 UTC (permalink / raw)
  To: ceph-devel (ceph-devel@vger.kernel.org)

Greetings,

I have made a bit of progress.  Still looking for ideas/advise while I
keep digging.

First, the ceph version I forgot to mention last time.

# ceph -v
ceph version 11.0.0-196-g85bb43e (85bb43e111692989d2296a389ce45377d2297d6f)



As a refresher, here is the beginning of the problem transaction which
will result in an inode with 0 links.   There are actually three inodes in
this transaction which will all end up having 0 links, inodes 109, 102,
and 113.

2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay 0x5be000:
txn(seq 77752 len 35269 crc 1796157826)
2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay 0x5be000:
op_dir_unlink  db/000154.log
2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay 0x5be000:
op_file_remove 109
2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay 0x5be000:
op_dir_unlink  db/000153.log
2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay 0x5be000:
op_file_remove 102
2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
op_dir_unlink  db/000151.log
2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
op_file_remove 113
2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:
op_dir_unlink  db/000150.log
2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay 0x5be000:
op_file_remove 111
2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:
op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
17:05:53.738683 bdev 1 extents
[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+20
97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:9604956
16+58720256])
2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino 113 =
0x7fb66aafc880 (new)
2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay 0x5be000:
op_file_update  file(ino 102 size 68557043 mtime 2016-07-12
17:05:54.615190 bdev 1 extents
[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375168+209
7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:22754099
2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,1:251
658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2097152,
1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649856+104
8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:31876710
4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,1:342
884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2097152,
1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778816+104
8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:40579891
2+3145728,1:507510784+5242880])
2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino 102 =
0x7fb66aafbb90 (new)
2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay 0x5be000:
op_file_update  file(ino 109 size 64421913 mtime 2016-07-12
22:41:33.028030 bdev 1 extents
[1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino 109 =
0x7fb66aafbe60 (new)
...


The prior transaction log entry for each of the three inodes was another
op_file_update() call.  The only difference between the two updates for
ino 133 was file size went from 71359362 -> 71354938, 4424 less bytes, all
other fields are identical.

2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:
op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
17:05:53.738683 bdev 1 extents
[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+20
97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:9604956
16+58720256])
2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino 113 =
0x7fb66aafbb90


Same story for inode 109, the file size changed from 71354939 -> 64421913,
6933026 less bytes, all other fields are identical.

2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay 0x5ba000:
op_file_update  file(ino 109 size 71354939 mtime 2016-07-12
22:41:33.028030 bdev 1 extents
[1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino 109 =
0x7fb66aafc910





Same story for inode 102, the file size changed from 71359405 -> 68557043,
2802362 less bytes, all other fields are identical.

2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay 0x5b9000:
op_file_update  file(ino 102 size 71359405 mtime 2016-07-12
17:05:54.615190 bdev 1 extents
[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375168+209
7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:22754099
2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,1:251
658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2097152,
1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649856+104
8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:31876710
4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,1:342
884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2097152,
1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778816+104
8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:40579891
2+3145728,1:507510784+5242880])
2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino 102 =
0x7fb66aafbe60




There is only one caller of op_file_update() in which the size of the file
is reduced but all other fields stay the same, that is
BlueFS::_truncate(), which in turn is only called by BlueFS::truncate(),
which is only called by BlueRocksWritableFile::Close().

The op_dir_unlink() and op_file_remove() calls for each of the three
inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm having
difficulty finding where the call is coming from.


So the problem appears to be that the file is first deleted, then it is
closed, and a byproduct of the close operation  on a stale fnode is that
another op_file_update() entry gets logged, recreating the deleted inode
with a stale fnode, resulting in the 0 links failure.


Kevan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bluestore bug#15724: bluefs _replay file with link count 0:
  2016-07-15 20:21 bluestore bug#15724: bluefs _replay file with link count 0: Kevan Rehm
@ 2016-07-16  7:23 ` Varada Kari
  2016-07-16 11:23   ` Kevan Rehm
  0 siblings, 1 reply; 12+ messages in thread
From: Varada Kari @ 2016-07-16  7:23 UTC (permalink / raw)
  To: Kevan Rehm, ceph-devel (ceph-devel@vger.kernel.org)

On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:
> Greetings,
>
> I have made a bit of progress.  Still looking for ideas/advise while I
> keep digging.
>
> First, the ceph version I forgot to mention last time.
>
> # ceph -v
> ceph version 11.0.0-196-g85bb43e (85bb43e111692989d2296a389ce45377d2297d6f)
>
>
>
> As a refresher, here is the beginning of the problem transaction which
> will result in an inode with 0 links.   There are actually three inodes in
> this transaction which will all end up having 0 links, inodes 109, 102,
> and 113.
>
> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay 0x5be000:
> txn(seq 77752 len 35269 crc 1796157826)
> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay 0x5be000:
> op_dir_unlink  db/000154.log
> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay 0x5be000:
> op_file_remove 109
> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay 0x5be000:
> op_dir_unlink  db/000153.log
> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay 0x5be000:
> op_file_remove 102
> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
> op_dir_unlink  db/000151.log
> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
> op_file_remove 113
> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:
> op_dir_unlink  db/000150.log
> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay 0x5be000:
> op_file_remove 111
> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:
> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
> 17:05:53.738683 bdev 1 extents
> [1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+20
> 97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:9604956
> 16+58720256])
> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino 113 =
> 0x7fb66aafc880 (new)
When _get_file() is called with a inode number, if the file is not found
in file_map
a new file is created and returned, hence that (new) in the log file.
These set of events
can happen in case of rename. You can see the same BlueFS::rename(). In
that case, we use the
same inode number.

> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay 0x5be000:
> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12
> 17:05:54.615190 bdev 1 extents
> [1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375168+209
> 7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:22754099
> 2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,1:251
> 658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2097152,
> 1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649856+104
> 8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:31876710
> 4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,1:342
> 884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2097152,
> 1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778816+104
> 8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:40579891
> 2+3145728,1:507510784+5242880])
> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino 102 =
> 0x7fb66aafbb90 (new)
> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay 0x5be000:
> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12
> 22:41:33.028030 bdev 1 extents
> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino 109 =
> 0x7fb66aafbe60 (new)
> ...
>
>
> The prior transaction log entry for each of the three inodes was another
> op_file_update() call.  The only difference between the two updates for
> ino 133 was file size went from 71359362 -> 71354938, 4424 less bytes, all
> other fields are identical.
Are the extents same for both the events?
> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:
> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
> 17:05:53.738683 bdev 1 extents
> [1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+20
> 97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:9604956
> 16+58720256])
> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino 113 =
> 0x7fb66aafbb90
>
>
> Same story for inode 109, the file size changed from 71354939 -> 64421913,
> 6933026 less bytes, all other fields are identical.
>
> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay 0x5ba000:
> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12
> 22:41:33.028030 bdev 1 extents
> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino 109 =
> 0x7fb66aafc910
>
>
>
>
>
> Same story for inode 102, the file size changed from 71359405 -> 68557043,
> 2802362 less bytes, all other fields are identical.
>
> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay 0x5b9000:
> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12
> 17:05:54.615190 bdev 1 extents
> [1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375168+209
> 7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:22754099
> 2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,1:251
> 658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2097152,
> 1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649856+104
> 8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:31876710
> 4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,1:342
> 884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2097152,
> 1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778816+104
> 8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:40579891
> 2+3145728,1:507510784+5242880])
> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino 102 =
> 0x7fb66aafbe60
>
>
>
>
> There is only one caller of op_file_update() in which the size of the file
> is reduced but all other fields stay the same, that is
> BlueFS::_truncate(), which in turn is only called by BlueFS::truncate(),
> which is only called by BlueRocksWritableFile::Close().
>
> The op_dir_unlink() and op_file_remove() calls for each of the three
> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm having
> difficulty finding where the call is coming from.
Any syscall making dir rename or file remove or directory remove will
result in BlueFS::rename or unlink
or corresponding calls and results in op_dir_* or op_file_* operations.
rocksdb wal_manager or compaction
can trigger these calls during that process. Will take a closer look at
the logs, if i can get something from them.
>
> So the problem appears to be that the file is first deleted, then it is
> closed, and a byproduct of the close operation  on a stale fnode is that
> another op_file_update() entry gets logged, recreating the deleted inode
> with a stale fnode, resulting in the 0 links failure.
Once the file is deleted, all the underlying extents will be cleared.
Until and unless there is race in handling on
the files from application level, Bluefs doesn't have any control over
them. If the a file(handle) is been reused after
deletion, that seems to be a bug. Will take a look at the logs to
confirm your theory.
But here, this seems to be more of rename use case. one more thing to
verify is the extents owned by the inodes mentioned.

Varada
>
> Kevan
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bluestore bug#15724: bluefs _replay file with link count 0:
  2016-07-16  7:23 ` Varada Kari
@ 2016-07-16 11:23   ` Kevan Rehm
  2016-07-16 12:16     ` Varada Kari
  0 siblings, 1 reply; 12+ messages in thread
From: Kevan Rehm @ 2016-07-16 11:23 UTC (permalink / raw)
  To: Varada Kari, ceph-devel (ceph-devel@vger.kernel.org)

Varada, see below.

On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:
>> Greetings,
>>
>> I have made a bit of progress.  Still looking for ideas/advise while I
>> keep digging.
>>
>> First, the ceph version I forgot to mention last time.
>>
>> # ceph -v
>> ceph version 11.0.0-196-g85bb43e
>>(85bb43e111692989d2296a389ce45377d2297d6f)
>>
>>
>>
>> As a refresher, here is the beginning of the problem transaction which
>> will result in an inode with 0 links.   There are actually three inodes
>>in
>> this transaction which will all end up having 0 links, inodes 109, 102,
>> and 113.
>>
>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay 0x5be000:
>> txn(seq 77752 len 35269 crc 1796157826)
>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay 0x5be000:
>> op_dir_unlink  db/000154.log
>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay 0x5be000:
>> op_file_remove 109
>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay 0x5be000:
>> op_dir_unlink  db/000153.log
>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay 0x5be000:
>> op_file_remove 102
>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>> op_dir_unlink  db/000151.log
>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>> op_file_remove 113
>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:
>> op_dir_unlink  db/000150.log
>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay 0x5be000:
>> op_file_remove 111
>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:
>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
>> 17:05:53.738683 bdev 1 extents
>> 
>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+
>>20
>> 
>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:96049
>>56
>> 16+58720256])
>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino 113 =
>> 0x7fb66aafc880 (new)
>When _get_file() is called with a inode number, if the file is not found
>in file_map
>a new file is created and returned, hence that (new) in the log file.
>These set of events
>can happen in case of rename. You can see the same BlueFS::rename(). In
>that case, we use the
>same inode number.
 
I'm not sure I understand, I do not believe this is a rename case.  For
this to be a rename, there would have to be at least one op_dir_link()
call for each op_dir_unlink() call in the sequence above.   The only code
sequence I can find where an op_dir_unlink() and an op_file_remove()
appear together without a corresponding op_dir_link() is when a file is
being deleted.
 
The (new) above makes my point exactly.  The op_file_remove() call above
freed the block allocation for the inode and released the inode from the
filemap, showing that the file->refs had to be zero at that time (routine
_drop_link).   Yet the following op_file_update() call is performing a
truncate on that same inode, which is not in the filemap.   So the caller
has to be holding a file structure whose refs field is greater than zero.
Which implies that the caller's file structure has to be stale.  The fact
that the op_file_update file contents for inode 113 is exactly the same as
the previous op_file_update contents prior to the op_file_remove(), except
for a reduction in file size, proves I believe that the file struct used
by the op_file_update() caller is stale.
 
 
>
>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay 0x5be000:
>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12
>> 17:05:54.615190 bdev 1 extents
>> 
>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375168+2
>>09
>> 
>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:227540
>>99
>> 
>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,1:2
>>51
>> 
>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+209715
>>2,
>> 
>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649856+1
>>04
>> 
>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:318767
>>10
>> 
>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,1:3
>>42
>> 
>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+209715
>>2,
>> 
>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778816+1
>>04
>> 
>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:405798
>>91
>> 2+3145728,1:507510784+5242880])
>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino 102 =
>> 0x7fb66aafbb90 (new)
>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay 0x5be000:
>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12
>> 22:41:33.028030 bdev 1 extents
>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino 109 =
>> 0x7fb66aafbe60 (new)
>> ...
>>
>>
>> The prior transaction log entry for each of the three inodes was another
>> op_file_update() call.  The only difference between the two updates for
>> ino 133 was file size went from 71359362 -> 71354938, 4424 less bytes,
>>all
>> other fields are identical.
>Are the extents same for both the events?
 
Yes, please compare this line with the one higher above in this email, you
can see that only the file size is different.   That is true for all three
of these inodes.
 
>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:
>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
>> 17:05:53.738683 bdev 1 extents
>> 
>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+
>>20
>> 
>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:96049
>>56
>> 16+58720256])
>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino 113 =
>> 0x7fb66aafbb90
>>
>>
>> Same story for inode 109, the file size changed from 71354939 ->
>>64421913,
>> 6933026 less bytes, all other fields are identical.
>>
>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay 0x5ba000:
>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12
>> 22:41:33.028030 bdev 1 extents
>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino 109 =
>> 0x7fb66aafc910
>>
>>
>>
>>
>>
>> Same story for inode 102, the file size changed from 71359405 ->
>>68557043,
>> 2802362 less bytes, all other fields are identical.
>>
>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay 0x5b9000:
>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12
>> 17:05:54.615190 bdev 1 extents
>> 
>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375168+2
>>09
>> 
>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:227540
>>99
>> 
>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,1:2
>>51
>> 
>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+209715
>>2,
>> 
>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649856+1
>>04
>> 
>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:318767
>>10
>> 
>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,1:3
>>42
>> 
>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+209715
>>2,
>> 
>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778816+1
>>04
>> 
>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:405798
>>91
>> 2+3145728,1:507510784+5242880])
>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino 102 =
>> 0x7fb66aafbe60
>>
>>
>>
>>
>> There is only one caller of op_file_update() in which the size of the
>>file
>> is reduced but all other fields stay the same, that is
>> BlueFS::_truncate(), which in turn is only called by BlueFS::truncate(),
>> which is only called by BlueRocksWritableFile::Close().
>>
>> The op_dir_unlink() and op_file_remove() calls for each of the three
>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm having
>> difficulty finding where the call is coming from.
>Any syscall making dir rename or file remove or directory remove will
>result in BlueFS::rename or unlink
>or corresponding calls and results in op_dir_* or op_file_* operations.
>rocksdb wal_manager or compaction
>can trigger these calls during that process. Will take a closer look at
>the logs, if i can get something from them.
>>
>> So the problem appears to be that the file is first deleted, then it is
>> closed, and a byproduct of the close operation  on a stale fnode is that
>> another op_file_update() entry gets logged, recreating the deleted inode
>> with a stale fnode, resulting in the 0 links failure.
>Once the file is deleted, all the underlying extents will be cleared.
>Until and unless there is race in handling on
>the files from application level, Bluefs doesn't have any control over
>them. If the a file(handle) is been reused after
>deletion, that seems to be a bug. Will take a look at the logs to
>confirm your theory.
>But here, this seems to be more of rename use case. one more thing to
>verify is the extents owned by the inodes mentioned.
 
No, I don't believe this is a rename, see above.

I'll note that every occurrence of this problem has been preceded by an
OSD segfault.  And this problem is racy, if I increase debug logging for
bluefs or bluestore, then the problem goes away.  I have 50+ bluestore
segfaults and asserts in my cluster logs; I suspect many could be due to
this problem, because the same range of blocks is being freed more than
once.   If some other file should perform an allocation between the
block-freeing in op_file_remove() and the second freeing in
op_file_update(), then we could easily have multiple files with the same
block allocation at the same time.

I think this problem can be caught fairly easily, an assert could be added
to _truncate().   The _truncate() routine is only called for files which
are open, which means that the inode has to be in the filemap.   Code
could be added at the front of _truncate() to look up the inode in the
filemap, and if it is not there, then assert.   That would provide a stack
trace showing the caller using the file struct with the stale reference
count.   (It may be a few days before I have a chance to try this myself.)

Regards, Kevan
 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bluestore bug#15724: bluefs _replay file with link count 0:
  2016-07-16 11:23   ` Kevan Rehm
@ 2016-07-16 12:16     ` Varada Kari
  2016-07-16 14:35       ` Kevan Rehm
  0 siblings, 1 reply; 12+ messages in thread
From: Varada Kari @ 2016-07-16 12:16 UTC (permalink / raw)
  To: Kevan Rehm, ceph-devel (ceph-devel@vger.kernel.org)

On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote:
> Varada, see below.
>
> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>
>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:
>>> Greetings,
>>>
>>> I have made a bit of progress.  Still looking for ideas/advise while I
>>> keep digging.
>>>
>>> First, the ceph version I forgot to mention last time.
>>>
>>> # ceph -v
>>> ceph version 11.0.0-196-g85bb43e
>>> (85bb43e111692989d2296a389ce45377d2297d6f)
>>>
>>>
>>>
>>> As a refresher, here is the beginning of the problem transaction which
>>> will result in an inode with 0 links.   There are actually three inodes
>>> in
>>> this transaction which will all end up having 0 links, inodes 109, 102,
>>> and 113.
>>>
>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay 0x5be000:
>>> txn(seq 77752 len 35269 crc 1796157826)
>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay 0x5be000:
>>> op_dir_unlink  db/000154.log
>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay 0x5be000:
>>> op_file_remove 109
>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay 0x5be000:
>>> op_dir_unlink  db/000153.log
>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay 0x5be000:
>>> op_file_remove 102
>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>> op_dir_unlink  db/000151.log
>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>> op_file_remove 113
>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:
>>> op_dir_unlink  db/000150.log
>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay 0x5be000:
>>> op_file_remove 111
>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:
>>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
>>> 17:05:53.738683 bdev 1 extents
>>>
>>> [1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+
>>> 20
>>>
>>> 97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:96049
>>> 56
>>> 16+58720256])
>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino 113 =
>>> 0x7fb66aafc880 (new)
>> When _get_file() is called with a inode number, if the file is not found
>> in file_map
>> a new file is created and returned, hence that (new) in the log file.
>> These set of events
>> can happen in case of rename. You can see the same BlueFS::rename(). In
>> that case, we use the
>> same inode number.
>
> I'm not sure I understand, I do not believe this is a rename case.  For
> this to be a rename, there would have to be at least one op_dir_link()
> call for each op_dir_unlink() call in the sequence above.   The only code
> sequence I can find where an op_dir_unlink() and an op_file_remove()
> appear together without a corresponding op_dir_link() is when a file is
> being deleted.
>
> The (new) above makes my point exactly.  The op_file_remove() call above
> freed the block allocation for the inode and released the inode from the
> filemap, showing that the file->refs had to be zero at that time (routine
> _drop_link).   Yet the following op_file_update() call is performing a
> truncate on that same inode, which is not in the filemap.   So the caller
> has to be holding a file structure whose refs field is greater than zero.
> Which implies that the caller's file structure has to be stale.  The fact
> that the op_file_update file contents for inode 113 is exactly the same as
> the previous op_file_update contents prior to the op_file_remove(), except
> for a reduction in file size, proves I believe that the file struct used
> by the op_file_update() caller is stale.
>
>
>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay 0x5be000:
>>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12
>>> 17:05:54.615190 bdev 1 extents
>>>
>>> [1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375168+2
>>> 09
>>>
>>> 7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:227540
>>> 99
>>>
>>> 2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,1:2
>>> 51
>>>
>>> 658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+209715
>>> 2,
>>>
>>> 1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649856+1
>>> 04
>>>
>>> 8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:318767
>>> 10
>>>
>>> 4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,1:3
>>> 42
>>>
>>> 884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+209715
>>> 2,
>>>
>>> 1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778816+1
>>> 04
>>>
>>> 8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:405798
>>> 91
>>> 2+3145728,1:507510784+5242880])
>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino 102 =
>>> 0x7fb66aafbb90 (new)
>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay 0x5be000:
>>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12
>>> 22:41:33.028030 bdev 1 extents
>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino 109 =
>>> 0x7fb66aafbe60 (new)
>>> ...
>>>
>>>
>>> The prior transaction log entry for each of the three inodes was another
>>> op_file_update() call.  The only difference between the two updates for
>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less bytes,
>>> all
>>> other fields are identical.
>> Are the extents same for both the events?
>
> Yes, please compare this line with the one higher above in this email, you
> can see that only the file size is different.   That is true for all three
> of these inodes.
>
>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:
>>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
>>> 17:05:53.738683 bdev 1 extents
>>>
>>> [1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+
>>> 20
>>>
>>> 97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:96049
>>> 56
>>> 16+58720256])
>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino 113 =
>>> 0x7fb66aafbb90
>>>
>>>
>>> Same story for inode 109, the file size changed from 71354939 ->
>>> 64421913,
>>> 6933026 less bytes, all other fields are identical.
>>>
>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay 0x5ba000:
>>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12
>>> 22:41:33.028030 bdev 1 extents
>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino 109 =
>>> 0x7fb66aafc910
>>>
>>>
>>>
>>>
>>>
>>> Same story for inode 102, the file size changed from 71359405 ->
>>> 68557043,
>>> 2802362 less bytes, all other fields are identical.
>>>
>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay 0x5b9000:
>>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12
>>> 17:05:54.615190 bdev 1 extents
>>>
>>> [1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375168+2
>>> 09
>>>
>>> 7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:227540
>>> 99
>>>
>>> 2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,1:2
>>> 51
>>>
>>> 658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+209715
>>> 2,
>>>
>>> 1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649856+1
>>> 04
>>>
>>> 8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:318767
>>> 10
>>>
>>> 4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,1:3
>>> 42
>>>
>>> 884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+209715
>>> 2,
>>>
>>> 1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778816+1
>>> 04
>>>
>>> 8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:405798
>>> 91
>>> 2+3145728,1:507510784+5242880])
>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino 102 =
>>> 0x7fb66aafbe60
>>>
>>>
>>>
>>>
>>> There is only one caller of op_file_update() in which the size of the
>>> file
>>> is reduced but all other fields stay the same, that is
>>> BlueFS::_truncate(), which in turn is only called by BlueFS::truncate(),
>>> which is only called by BlueRocksWritableFile::Close().
>>>
>>> The op_dir_unlink() and op_file_remove() calls for each of the three
>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm having
>>> difficulty finding where the call is coming from.
>> Any syscall making dir rename or file remove or directory remove will
>> result in BlueFS::rename or unlink
>> or corresponding calls and results in op_dir_* or op_file_* operations.
>> rocksdb wal_manager or compaction
>> can trigger these calls during that process. Will take a closer look at
>> the logs, if i can get something from them.
>>> So the problem appears to be that the file is first deleted, then it is
>>> closed, and a byproduct of the close operation  on a stale fnode is that
>>> another op_file_update() entry gets logged, recreating the deleted inode
>>> with a stale fnode, resulting in the 0 links failure.
>> Once the file is deleted, all the underlying extents will be cleared.
>> Until and unless there is race in handling on
>> the files from application level, Bluefs doesn't have any control over
>> them. If the a file(handle) is been reused after
>> deletion, that seems to be a bug. Will take a look at the logs to
>> confirm your theory.
>> But here, this seems to be more of rename use case. one more thing to
>> verify is the extents owned by the inodes mentioned.
>
> No, I don't believe this is a rename, see above.
>
> I'll note that every occurrence of this problem has been preceded by an
> OSD segfault.  And this problem is racy, if I increase debug logging for
> bluefs or bluestore, then the problem goes away.  I have 50+ bluestore
> segfaults and asserts in my cluster logs; I suspect many could be due to
> this problem, because the same range of blocks is being freed more than
> once.   If some other file should perform an allocation between the
> block-freeing in op_file_remove() and the second freeing in
> op_file_update(), then we could easily have multiple files with the same
> block allocation at the same time.
>
> I think this problem can be caught fairly easily, an assert could be added
> to _truncate().   The _truncate() routine is only called for files which
> are open, which means that the inode has to be in the filemap.   Code
> could be added at the front of _truncate() to look up the inode in the
> filemap, and if it is not there, then assert.   That would provide a stack
> trace showing the caller using the file struct with the stale reference
> count.   (It may be a few days before I have a chance to try this myself.)
Sure. Could you please let me know the steps to hit issue?
i am doing only few min runs to check the performance and some unit test
cases.
I will add the necessary asserts if i can reproduce the issue and fix it.

Varada
> Regards, Kevan
>
>
>

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bluestore bug#15724: bluefs _replay file with link count 0:
  2016-07-16 12:16     ` Varada Kari
@ 2016-07-16 14:35       ` Kevan Rehm
  2016-07-19 14:09         ` Kevan Rehm
  0 siblings, 1 reply; 12+ messages in thread
From: Kevan Rehm @ 2016-07-16 14:35 UTC (permalink / raw)
  To: Varada Kari, ceph-devel (ceph-devel@vger.kernel.org)



On 7/16/16, 7:16 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote:
>> Varada, see below.
>>
>> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>>
>>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:
>>>> Greetings,
>>>>
>>>> I have made a bit of progress.  Still looking for ideas/advise while I
>>>> keep digging.
>>>>
>>>> First, the ceph version I forgot to mention last time.
>>>>
>>>> # ceph -v
>>>> ceph version 11.0.0-196-g85bb43e
>>>> (85bb43e111692989d2296a389ce45377d2297d6f)
>>>>
>>>>
>>>>
>>>> As a refresher, here is the beginning of the problem transaction which
>>>> will result in an inode with 0 links.   There are actually three
>>>>inodes
>>>> in
>>>> this transaction which will all end up having 0 links, inodes 109,
>>>>102,
>>>> and 113.
>>>>
>>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay 0x5be000:
>>>> txn(seq 77752 len 35269 crc 1796157826)
>>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>> op_dir_unlink  db/000154.log
>>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>> op_file_remove 109
>>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>> op_dir_unlink  db/000153.log
>>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>> op_file_remove 102
>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>> op_dir_unlink  db/000151.log
>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>> op_file_remove 113
>>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>> op_dir_unlink  db/000150.log
>>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>> op_file_remove 111
>>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
>>>> 17:05:53.738683 bdev 1 extents
>>>>
>>>> 
>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:93113548
>>>>8+
>>>> 20
>>>>
>>>> 
>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:960
>>>>49
>>>> 56
>>>> 16+58720256])
>>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino 113 =
>>>> 0x7fb66aafc880 (new)
>>> When _get_file() is called with a inode number, if the file is not
>>>found
>>> in file_map
>>> a new file is created and returned, hence that (new) in the log file.
>>> These set of events
>>> can happen in case of rename. You can see the same BlueFS::rename(). In
>>> that case, we use the
>>> same inode number.
>>
>> I'm not sure I understand, I do not believe this is a rename case.  For
>> this to be a rename, there would have to be at least one op_dir_link()
>> call for each op_dir_unlink() call in the sequence above.   The only
>>code
>> sequence I can find where an op_dir_unlink() and an op_file_remove()
>> appear together without a corresponding op_dir_link() is when a file is
>> being deleted.
>>
>> The (new) above makes my point exactly.  The op_file_remove() call above
>> freed the block allocation for the inode and released the inode from the
>> filemap, showing that the file->refs had to be zero at that time
>>(routine
>> _drop_link).   Yet the following op_file_update() call is performing a
>> truncate on that same inode, which is not in the filemap.   So the
>>caller
>> has to be holding a file structure whose refs field is greater than
>>zero.
>> Which implies that the caller's file structure has to be stale.  The
>>fact
>> that the op_file_update file contents for inode 113 is exactly the same
>>as
>> the previous op_file_update contents prior to the op_file_remove(),
>>except
>> for a reduction in file size, proves I believe that the file struct used
>> by the op_file_update() caller is stale.
>>
>>
>>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12
>>>> 17:05:54.615190 bdev 1 extents
>>>>
>>>> 
>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375168
>>>>+2
>>>> 09
>>>>
>>>> 
>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:2275
>>>>40
>>>> 99
>>>>
>>>> 
>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,1
>>>>:2
>>>> 51
>>>>
>>>> 
>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2097
>>>>15
>>>> 2,
>>>>
>>>> 
>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649856
>>>>+1
>>>> 04
>>>>
>>>> 
>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:3187
>>>>67
>>>> 10
>>>>
>>>> 
>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,1
>>>>:3
>>>> 42
>>>>
>>>> 
>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2097
>>>>15
>>>> 2,
>>>>
>>>> 
>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778816
>>>>+1
>>>> 04
>>>>
>>>> 
>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:4057
>>>>98
>>>> 91
>>>> 2+3145728,1:507510784+5242880])
>>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino 102 =
>>>> 0x7fb66aafbb90 (new)
>>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12
>>>> 22:41:33.028030 bdev 1 extents
>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino 109 =
>>>> 0x7fb66aafbe60 (new)
>>>> ...
>>>>
>>>>
>>>> The prior transaction log entry for each of the three inodes was
>>>>another
>>>> op_file_update() call.  The only difference between the two updates
>>>>for
>>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less bytes,
>>>> all
>>>> other fields are identical.
>>> Are the extents same for both the events?
>>
>> Yes, please compare this line with the one higher above in this email,
>>you
>> can see that only the file size is different.   That is true for all
>>three
>> of these inodes.
>>
>>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:
>>>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
>>>> 17:05:53.738683 bdev 1 extents
>>>>
>>>> 
>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:93113548
>>>>8+
>>>> 20
>>>>
>>>> 
>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:960
>>>>49
>>>> 56
>>>> 16+58720256])
>>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino 113 =
>>>> 0x7fb66aafbb90
>>>>
>>>>
>>>> Same story for inode 109, the file size changed from 71354939 ->
>>>> 64421913,
>>>> 6933026 less bytes, all other fields are identical.
>>>>
>>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay 0x5ba000:
>>>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12
>>>> 22:41:33.028030 bdev 1 extents
>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino 109 =
>>>> 0x7fb66aafc910
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Same story for inode 102, the file size changed from 71359405 ->
>>>> 68557043,
>>>> 2802362 less bytes, all other fields are identical.
>>>>
>>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay 0x5b9000:
>>>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12
>>>> 17:05:54.615190 bdev 1 extents
>>>>
>>>> 
>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375168
>>>>+2
>>>> 09
>>>>
>>>> 
>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:2275
>>>>40
>>>> 99
>>>>
>>>> 
>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,1
>>>>:2
>>>> 51
>>>>
>>>> 
>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2097
>>>>15
>>>> 2,
>>>>
>>>> 
>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649856
>>>>+1
>>>> 04
>>>>
>>>> 
>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:3187
>>>>67
>>>> 10
>>>>
>>>> 
>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,1
>>>>:3
>>>> 42
>>>>
>>>> 
>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2097
>>>>15
>>>> 2,
>>>>
>>>> 
>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778816
>>>>+1
>>>> 04
>>>>
>>>> 
>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:4057
>>>>98
>>>> 91
>>>> 2+3145728,1:507510784+5242880])
>>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino 102 =
>>>> 0x7fb66aafbe60
>>>>
>>>>
>>>>
>>>>
>>>> There is only one caller of op_file_update() in which the size of the
>>>> file
>>>> is reduced but all other fields stay the same, that is
>>>> BlueFS::_truncate(), which in turn is only called by
>>>>BlueFS::truncate(),
>>>> which is only called by BlueRocksWritableFile::Close().
>>>>
>>>> The op_dir_unlink() and op_file_remove() calls for each of the three
>>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm
>>>>having
>>>> difficulty finding where the call is coming from.
>>> Any syscall making dir rename or file remove or directory remove will
>>> result in BlueFS::rename or unlink
>>> or corresponding calls and results in op_dir_* or op_file_* operations.
>>> rocksdb wal_manager or compaction
>>> can trigger these calls during that process. Will take a closer look at
>>> the logs, if i can get something from them.
>>>> So the problem appears to be that the file is first deleted, then it
>>>>is
>>>> closed, and a byproduct of the close operation  on a stale fnode is
>>>>that
>>>> another op_file_update() entry gets logged, recreating the deleted
>>>>inode
>>>> with a stale fnode, resulting in the 0 links failure.
>>> Once the file is deleted, all the underlying extents will be cleared.
>>> Until and unless there is race in handling on
>>> the files from application level, Bluefs doesn't have any control over
>>> them. If the a file(handle) is been reused after
>>> deletion, that seems to be a bug. Will take a look at the logs to
>>> confirm your theory.
>>> But here, this seems to be more of rename use case. one more thing to
>>> verify is the extents owned by the inodes mentioned.
>>
>> No, I don't believe this is a rename, see above.
>>
>> I'll note that every occurrence of this problem has been preceded by an
>> OSD segfault.  And this problem is racy, if I increase debug logging for
>> bluefs or bluestore, then the problem goes away.  I have 50+ bluestore
>> segfaults and asserts in my cluster logs; I suspect many could be due to
>> this problem, because the same range of blocks is being freed more than
>> once.   If some other file should perform an allocation between the
>> block-freeing in op_file_remove() and the second freeing in
>> op_file_update(), then we could easily have multiple files with the same
>> block allocation at the same time.
>>
>> I think this problem can be caught fairly easily, an assert could be
>>added
>> to _truncate().   The _truncate() routine is only called for files which
>> are open, which means that the inode has to be in the filemap.   Code
>> could be added at the front of _truncate() to look up the inode in the
>> filemap, and if it is not there, then assert.   That would provide a
>>stack
>> trace showing the caller using the file struct with the stale reference
>> count.   (It may be a few days before I have a chance to try this
>>myself.)
>Sure. Could you please let me know the steps to hit issue?
>i am doing only few min runs to check the performance and some unit test
>cases.
>I will add the necessary asserts if i can reproduce the issue and fix it.
 
Thanks for the help!   I will have to get back to you on Monday, as there
is another person on our team who runs the benchmarks that trigger this
problem, I don't know how to run his benchmarks myself.

Kevan
 
>
>Varada
>> Regards, Kevan
>>
>>
>>
>
>PLEASE NOTE: The information contained in this electronic mail message is
>intended only for the use of the designated recipient(s) named above. If
>the reader of this message is not the intended recipient, you are hereby
>notified that you have received this message in error and that any
>review, dissemination, distribution, or copying of this message is
>strictly prohibited. If you have received this communication in error,
>please notify the sender by telephone or e-mail (as shown above)
>immediately and destroy any and all copies of this message in your
>possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bluestore bug#15724: bluefs _replay file with link count 0:
  2016-07-16 14:35       ` Kevan Rehm
@ 2016-07-19 14:09         ` Kevan Rehm
  2016-07-19 14:50           ` Varada Kari
  0 siblings, 1 reply; 12+ messages in thread
From: Kevan Rehm @ 2016-07-19 14:09 UTC (permalink / raw)
  To: Varada Kari, ceph-devel (ceph-devel@vger.kernel.org)

Varada,

I added the following assert in _truncate() but haven't yet had it fire in
our test cluster.   If you or others also want to give it a try, that
would be great.  It seems like a useful assert.

diff --git a/src/os/bluestore/BlueFS.cc b/src/os/bluestore/BlueFS.cc
index dbfd59b..759d1af 100644
--- a/src/os/bluestore/BlueFS.cc
+++ b/src/os/bluestore/BlueFS.cc
@@ -1299,6 +1299,13 @@ int BlueFS::_truncate(FileWriter *h, uint64_t
offset)
   dout(10) << __func__ << " 0x" << std::hex << offset << std::dec
            << " file " << h->file->fnode << dendl;
 
+  // If ino is not in filemap, we are truncating a file that is not
allocated.
+
+  auto p = file_map.find(h->file->fnode.ino);
+  if (p == file_map.end()) {
+    assert(0 == "_truncate(): truncating a file that is not allocated");
+  }
+
   // truncate off unflushed data?
   if (h->pos < offset &&
       h->pos + h->buffer.length() > offset) {


The test case which seemed to make the problem occur more frequently for
us consists of mounting a cephfs filesystem, then starting up 32 instances
of ior on each of 4 client machines.  The ior command looks like this:

ior -a POSIX -o  /mnt/cephfs/test -b 8m -i 10


In the meantime I have been looking further at the code, and I think I see
the problem-causing sequence of events.

BlueFS implements unlink() differently than the POSIX model.  In POSIX,
the inode and disk extents for a file are not released until both the link
count reaches zero and the inode reference count (file open count) reaches
zero.  BlueFS however releases the inode and disk extents when the link
count (file->refs) reaches zero even if there are threads that still have
the file open.  

Routine _drop_link() does the inode and disk extent releases, and it sets
a File variable called 'deleted' to true, but note that it does not
otherwise clear the released inode number or disk extent fields in the
File.  It is therefore the responsibility of every thread with a FileRef
for that file to first obtain a lock (to prevent races with _drop_link())
and then look at the value of 'deleted', and only continue using the File
if 'deleted' is false.

Function BlueRocksWritableFile::Close() does not do this.  The routine
does not grab a lock; it calls GetPreallocationStatus(), and if
last_allocated_block is > 0, it calls fs->truncate() (which does get a
lock) which calls _truncate().  Routine _truncate() lowers the file size,
and calls op_file_update(), which posts an update transaction entry to the
log for a file that is already unlinked.   This is how we end up with an
inode in the transaction log that has disk extents but its link count is
zero.

Adding a check of the 'deleted' variable after the lock is taken in
_truncate() would work, but it bothers me a bit to see
BlueRocksWritableFile::Close() using the File struct as if it is still
valid.  Folks with more BlueFS experience can decide if that is a problem
or not.   They might also be able to describe situations where a file
could be unlinked while it is still open, that might lead to better test
cases which fire the assert more quickly.

I'm also a bit concerned about the integrity of the BlueFS filesystem,
should the log have to be replayed.  This problem was only seen because
the inode number wasn't reused in the log after it was released.  What if
a subsequent log transaction reallocates the inode?  The "file with link
count 0" assert would not catch it.  I guess that subsequent reallocation
should overlay the bad FileRef->fnode with a new fnode, so maybe it's
harmless?

It would be nice to have my assert() trigger to provide the "smoking gun".
 However, if I can't duplicate the problem in the next day or two, I will
probably just patch _truncate() to return immediately if 'deleted' is
true, and move on. 

What do you think, should I post this history to bug 15724, create a
separate bug, anything else?

Thanks, Kevan

On 7/16/16, 9:35 AM, "ceph-devel-owner@vger.kernel.org on behalf of Kevan
Rehm" <ceph-devel-owner@vger.kernel.org on behalf of krehm@cray.com> wrote:

>
>
>On 7/16/16, 7:16 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>
>>On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote:
>>> Varada, see below.
>>>
>>> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>>>
>>>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:
>>>>> Greetings,
>>>>>
>>>>> I have made a bit of progress.  Still looking for ideas/advise while
>>>>>I
>>>>> keep digging.
>>>>>
>>>>> First, the ceph version I forgot to mention last time.
>>>>>
>>>>> # ceph -v
>>>>> ceph version 11.0.0-196-g85bb43e
>>>>> (85bb43e111692989d2296a389ce45377d2297d6f)
>>>>>
>>>>>
>>>>>
>>>>> As a refresher, here is the beginning of the problem transaction
>>>>>which
>>>>> will result in an inode with 0 links.   There are actually three
>>>>>inodes
>>>>> in
>>>>> this transaction which will all end up having 0 links, inodes 109,
>>>>>102,
>>>>> and 113.
>>>>>
>>>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay 0x5be000:
>>>>> txn(seq 77752 len 35269 crc 1796157826)
>>>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_dir_unlink  db/000154.log
>>>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_remove 109
>>>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_dir_unlink  db/000153.log
>>>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_remove 102
>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_dir_unlink  db/000151.log
>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_remove 113
>>>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_dir_unlink  db/000150.log
>>>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_remove 111
>>>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
>>>>> 17:05:53.738683 bdev 1 extents
>>>>>
>>>>> 
>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:9311354
>>>>>8
>>>>>8+
>>>>> 20
>>>>>
>>>>> 
>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:96
>>>>>0
>>>>>49
>>>>> 56
>>>>> 16+58720256])
>>>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino 113 =
>>>>> 0x7fb66aafc880 (new)
>>>> When _get_file() is called with a inode number, if the file is not
>>>>found
>>>> in file_map
>>>> a new file is created and returned, hence that (new) in the log file.
>>>> These set of events
>>>> can happen in case of rename. You can see the same BlueFS::rename().
>>>>In
>>>> that case, we use the
>>>> same inode number.
>>>
>>> I'm not sure I understand, I do not believe this is a rename case.  For
>>> this to be a rename, there would have to be at least one op_dir_link()
>>> call for each op_dir_unlink() call in the sequence above.   The only
>>>code
>>> sequence I can find where an op_dir_unlink() and an op_file_remove()
>>> appear together without a corresponding op_dir_link() is when a file is
>>> being deleted.
>>>
>>> The (new) above makes my point exactly.  The op_file_remove() call
>>>above
>>> freed the block allocation for the inode and released the inode from
>>>the
>>> filemap, showing that the file->refs had to be zero at that time
>>>(routine
>>> _drop_link).   Yet the following op_file_update() call is performing a
>>> truncate on that same inode, which is not in the filemap.   So the
>>>caller
>>> has to be holding a file structure whose refs field is greater than
>>>zero.
>>> Which implies that the caller's file structure has to be stale.  The
>>>fact
>>> that the op_file_update file contents for inode 113 is exactly the same
>>>as
>>> the previous op_file_update contents prior to the op_file_remove(),
>>>except
>>> for a reduction in file size, proves I believe that the file struct
>>>used
>>> by the op_file_update() caller is stale.
>>>
>>>
>>>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12
>>>>> 17:05:54.615190 bdev 1 extents
>>>>>
>>>>> 
>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:20237516
>>>>>8
>>>>>+2
>>>>> 09
>>>>>
>>>>> 
>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:227
>>>>>5
>>>>>40
>>>>> 99
>>>>>
>>>>> 
>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,
>>>>>1
>>>>>:2
>>>>> 51
>>>>>
>>>>> 
>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+209
>>>>>7
>>>>>15
>>>>> 2,
>>>>>
>>>>> 
>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:29464985
>>>>>6
>>>>>+1
>>>>> 04
>>>>>
>>>>> 
>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:318
>>>>>7
>>>>>67
>>>>> 10
>>>>>
>>>>> 
>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,
>>>>>1
>>>>>:3
>>>>> 42
>>>>>
>>>>> 
>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+209
>>>>>7
>>>>>15
>>>>> 2,
>>>>>
>>>>> 
>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:38377881
>>>>>6
>>>>>+1
>>>>> 04
>>>>>
>>>>> 
>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:405
>>>>>7
>>>>>98
>>>>> 91
>>>>> 2+3145728,1:507510784+5242880])
>>>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino 102 =
>>>>> 0x7fb66aafbb90 (new)
>>>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12
>>>>> 22:41:33.028030 bdev 1 extents
>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino 109 =
>>>>> 0x7fb66aafbe60 (new)
>>>>> ...
>>>>>
>>>>>
>>>>> The prior transaction log entry for each of the three inodes was
>>>>>another
>>>>> op_file_update() call.  The only difference between the two updates
>>>>>for
>>>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less
>>>>>bytes,
>>>>> all
>>>>> other fields are identical.
>>>> Are the extents same for both the events?
>>>
>>> Yes, please compare this line with the one higher above in this email,
>>>you
>>> can see that only the file size is different.   That is true for all
>>>three
>>> of these inodes.
>>>
>>>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:
>>>>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
>>>>> 17:05:53.738683 bdev 1 extents
>>>>>
>>>>> 
>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:9311354
>>>>>8
>>>>>8+
>>>>> 20
>>>>>
>>>>> 
>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:96
>>>>>0
>>>>>49
>>>>> 56
>>>>> 16+58720256])
>>>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino 113 =
>>>>> 0x7fb66aafbb90
>>>>>
>>>>>
>>>>> Same story for inode 109, the file size changed from 71354939 ->
>>>>> 64421913,
>>>>> 6933026 less bytes, all other fields are identical.
>>>>>
>>>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay 0x5ba000:
>>>>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12
>>>>> 22:41:33.028030 bdev 1 extents
>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino 109 =
>>>>> 0x7fb66aafc910
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Same story for inode 102, the file size changed from 71359405 ->
>>>>> 68557043,
>>>>> 2802362 less bytes, all other fields are identical.
>>>>>
>>>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay 0x5b9000:
>>>>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12
>>>>> 17:05:54.615190 bdev 1 extents
>>>>>
>>>>> 
>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:20237516
>>>>>8
>>>>>+2
>>>>> 09
>>>>>
>>>>> 
>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:227
>>>>>5
>>>>>40
>>>>> 99
>>>>>
>>>>> 
>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,
>>>>>1
>>>>>:2
>>>>> 51
>>>>>
>>>>> 
>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+209
>>>>>7
>>>>>15
>>>>> 2,
>>>>>
>>>>> 
>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:29464985
>>>>>6
>>>>>+1
>>>>> 04
>>>>>
>>>>> 
>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:318
>>>>>7
>>>>>67
>>>>> 10
>>>>>
>>>>> 
>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,
>>>>>1
>>>>>:3
>>>>> 42
>>>>>
>>>>> 
>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+209
>>>>>7
>>>>>15
>>>>> 2,
>>>>>
>>>>> 
>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:38377881
>>>>>6
>>>>>+1
>>>>> 04
>>>>>
>>>>> 
>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:405
>>>>>7
>>>>>98
>>>>> 91
>>>>> 2+3145728,1:507510784+5242880])
>>>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino 102 =
>>>>> 0x7fb66aafbe60
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> There is only one caller of op_file_update() in which the size of the
>>>>> file
>>>>> is reduced but all other fields stay the same, that is
>>>>> BlueFS::_truncate(), which in turn is only called by
>>>>>BlueFS::truncate(),
>>>>> which is only called by BlueRocksWritableFile::Close().
>>>>>
>>>>> The op_dir_unlink() and op_file_remove() calls for each of the three
>>>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm
>>>>>having
>>>>> difficulty finding where the call is coming from.
>>>> Any syscall making dir rename or file remove or directory remove will
>>>> result in BlueFS::rename or unlink
>>>> or corresponding calls and results in op_dir_* or op_file_*
>>>>operations.
>>>> rocksdb wal_manager or compaction
>>>> can trigger these calls during that process. Will take a closer look
>>>>at
>>>> the logs, if i can get something from them.
>>>>> So the problem appears to be that the file is first deleted, then it
>>>>>is
>>>>> closed, and a byproduct of the close operation  on a stale fnode is
>>>>>that
>>>>> another op_file_update() entry gets logged, recreating the deleted
>>>>>inode
>>>>> with a stale fnode, resulting in the 0 links failure.
>>>> Once the file is deleted, all the underlying extents will be cleared.
>>>> Until and unless there is race in handling on
>>>> the files from application level, Bluefs doesn't have any control over
>>>> them. If the a file(handle) is been reused after
>>>> deletion, that seems to be a bug. Will take a look at the logs to
>>>> confirm your theory.
>>>> But here, this seems to be more of rename use case. one more thing to
>>>> verify is the extents owned by the inodes mentioned.
>>>
>>> No, I don't believe this is a rename, see above.
>>>
>>> I'll note that every occurrence of this problem has been preceded by an
>>> OSD segfault.  And this problem is racy, if I increase debug logging
>>>for
>>> bluefs or bluestore, then the problem goes away.  I have 50+ bluestore
>>> segfaults and asserts in my cluster logs; I suspect many could be due
>>>to
>>> this problem, because the same range of blocks is being freed more than
>>> once.   If some other file should perform an allocation between the
>>> block-freeing in op_file_remove() and the second freeing in
>>> op_file_update(), then we could easily have multiple files with the
>>>same
>>> block allocation at the same time.
>>>
>>> I think this problem can be caught fairly easily, an assert could be
>>>added
>>> to _truncate().   The _truncate() routine is only called for files
>>>which
>>> are open, which means that the inode has to be in the filemap.   Code
>>> could be added at the front of _truncate() to look up the inode in the
>>> filemap, and if it is not there, then assert.   That would provide a
>>>stack
>>> trace showing the caller using the file struct with the stale reference
>>> count.   (It may be a few days before I have a chance to try this
>>>myself.)
>>Sure. Could you please let me know the steps to hit issue?
>>i am doing only few min runs to check the performance and some unit test
>>cases.
>>I will add the necessary asserts if i can reproduce the issue and fix it.
> 
>Thanks for the help!   I will have to get back to you on Monday, as there
>is another person on our team who runs the benchmarks that trigger this
>problem, I don't know how to run his benchmarks myself.
>
>Kevan
> 
>>
>>Varada
>>> Regards, Kevan
>>>
>>>
>>>
>>
>>PLEASE NOTE: The information contained in this electronic mail message is
>>intended only for the use of the designated recipient(s) named above. If
>>the reader of this message is not the intended recipient, you are hereby
>>notified that you have received this message in error and that any
>>review, dissemination, distribution, or copying of this message is
>>strictly prohibited. If you have received this communication in error,
>>please notify the sender by telephone or e-mail (as shown above)
>>immediately and destroy any and all copies of this message in your
>>possession (whether hard copies or electronically stored copies).
>
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: bluestore bug#15724: bluefs _replay file with link count 0:
  2016-07-19 14:09         ` Kevan Rehm
@ 2016-07-19 14:50           ` Varada Kari
  2016-07-19 15:08             ` Kevan Rehm
  0 siblings, 1 reply; 12+ messages in thread
From: Varada Kari @ 2016-07-19 14:50 UTC (permalink / raw)
  To: Kevan Rehm, ceph-devel (ceph-devel@vger.kernel.org)

On Tuesday 19 July 2016 07:40 PM, Kevan Rehm wrote:
> Varada,
>
> I added the following assert in _truncate() but haven't yet had it fire in
> our test cluster.   If you or others also want to give it a try, that
> would be great.  It seems like a useful assert.
>
> diff --git a/src/os/bluestore/BlueFS.cc b/src/os/bluestore/BlueFS.cc
> index dbfd59b..759d1af 100644
> --- a/src/os/bluestore/BlueFS.cc
> +++ b/src/os/bluestore/BlueFS.cc
> @@ -1299,6 +1299,13 @@ int BlueFS::_truncate(FileWriter *h, uint64_t
> offset)
>    dout(10) << __func__ << " 0x" << std::hex << offset << std::dec
>             << " file " << h->file->fnode << dendl;
>
> +  // If ino is not in filemap, we are truncating a file that is not
> allocated.
> +
> +  auto p = file_map.find(h->file->fnode.ino);
> +  if (p == file_map.end()) {
> +    assert(0 == "_truncate(): truncating a file that is not allocated");
> +  }
> +
>    // truncate off unflushed data?
>    if (h->pos < offset &&
>        h->pos + h->buffer.length() > offset) {
If we are using FileWriter, we will(should) have a reference on the fnode.
I don't think that will be deleted from the file_map. Will wait for your
run.
>
> The test case which seemed to make the problem occur more frequently for
> us consists of mounting a cephfs filesystem, then starting up 32 instances
> of ior on each of 4 client machines.  The ior command looks like this:
>
> ior -a POSIX -o  /mnt/cephfs/test -b 8m -i 10
>
>
> In the meantime I have been looking further at the code, and I think I see
> the problem-causing sequence of events.
>
> BlueFS implements unlink() differently than the POSIX model.  In POSIX,
> the inode and disk extents for a file are not released until both the link
> count reaches zero and the inode reference count (file open count) reaches
> zero.  BlueFS however releases the inode and disk extents when the link
> count (file->refs) reaches zero even if there are threads that still have
> the file open.
open_for_read and open_for_write will take a reference on the file and
we will increment
num_readers and num_writers on the file reference. we have asserts to
catch if we have any
readers/writers in case file deletion.
> Routine _drop_link() does the inode and disk extent releases, and it sets
> a File variable called 'deleted' to true, but note that it does not
> otherwise clear the released inode number or disk extent fields in the
> File.  It is therefore the responsibility of every thread with a FileRef
> for that file to first obtain a lock (to prevent races with _drop_link())
> and then look at the value of 'deleted', and only continue using the File
> if 'deleted' is false.
>
> Function BlueRocksWritableFile::Close() does not do this.  The routine
> does not grab a lock; it calls GetPreallocationStatus(), and if
> last_allocated_block is > 0, it calls fs->truncate() (which does get a
> lock) which calls _truncate().  Routine _truncate() lowers the file size,
> and calls op_file_update(), which posts an update transaction entry to the
> log for a file that is already unlinked.   This is how we end up with an
> inode in the transaction log that has disk extents but its link count is
> zero.
Yes. i also found the same code path, which truncates the file. Trying
to write some
unit test cases to run all these ops in a concurrent way to find the
problem.
> Adding a check of the 'deleted' variable after the lock is taken in
> _truncate() would work, but it bothers me a bit to see
> BlueRocksWritableFile::Close() using the File struct as if it is still
> valid.  Folks with more BlueFS experience can decide if that is a problem
> or not.   They might also be able to describe situations where a file
> could be unlinked while it is still open, that might lead to better test
> cases which fire the assert more quickly.
Added some asserts to find the issue using a test with multiple threads.
Still
trying reproduce the issue. I believe we should have a reference to the
file. Will verify that
anyways.
>
> I'm also a bit concerned about the integrity of the BlueFS filesystem,
> should the log have to be replayed.  This problem was only seen because
> the inode number wasn't reused in the log after it was released.  What if
> a subsequent log transaction reallocates the inode?  The "file with link
> count 0" assert would not catch it.  I guess that subsequent reallocation
> should overlay the bad FileRef->fnode with a new fnode, so maybe it's
> harmless?
Hopefully. Will try to add some debug around reference count of the file
and
verify the theory.

> It would be nice to have my assert() trigger to provide the "smoking gun".
>  However, if I can't duplicate the problem in the next day or two, I will
> probably just patch _truncate() to return immediately if 'deleted' is
> true, and move on.
>
> What do you think, should I post this history to bug 15724, create a
> separate bug, anything else?
Please post your findings to 15724, if we find a different problem we
can file a
different ticket.

Varada
> Thanks, Kevan
>
> On 7/16/16, 9:35 AM, "ceph-devel-owner@vger.kernel.org on behalf of Kevan
> Rehm" <ceph-devel-owner@vger.kernel.org on behalf of krehm@cray.com> wrote:
>
>>
>> On 7/16/16, 7:16 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>>
>>> On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote:
>>>> Varada, see below.
>>>>
>>>> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>>>>
>>>>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:
>>>>>> Greetings,
>>>>>>
>>>>>> I have made a bit of progress.  Still looking for ideas/advise while
>>>>>> I
>>>>>> keep digging.
>>>>>>
>>>>>> First, the ceph version I forgot to mention last time.
>>>>>>
>>>>>> # ceph -v
>>>>>> ceph version 11.0.0-196-g85bb43e
>>>>>> (85bb43e111692989d2296a389ce45377d2297d6f)
>>>>>>
>>>>>>
>>>>>>
>>>>>> As a refresher, here is the beginning of the problem transaction
>>>>>> which
>>>>>> will result in an inode with 0 links.   There are actually three
>>>>>> inodes
>>>>>> in
>>>>>> this transaction which will all end up having 0 links, inodes 109,
>>>>>> 102,
>>>>>> and 113.
>>>>>>
>>>>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay 0x5be000:
>>>>>> txn(seq 77752 len 35269 crc 1796157826)
>>>>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>> op_dir_unlink  db/000154.log
>>>>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>> op_file_remove 109
>>>>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>> op_dir_unlink  db/000153.log
>>>>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>> op_file_remove 102
>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>> op_dir_unlink  db/000151.log
>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>> op_file_remove 113
>>>>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>> op_dir_unlink  db/000150.log
>>>>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>> op_file_remove 111
>>>>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
>>>>>> 17:05:53.738683 bdev 1 extents
>>>>>>
>>>>>>
>>>>>> [1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:9311354
>>>>>> 8
>>>>>> 8+
>>>>>> 20
>>>>>>
>>>>>>
>>>>>> 97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:96
>>>>>> 0
>>>>>> 49
>>>>>> 56
>>>>>> 16+58720256])
>>>>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino 113 =
>>>>>> 0x7fb66aafc880 (new)
>>>>> When _get_file() is called with a inode number, if the file is not
>>>>> found
>>>>> in file_map
>>>>> a new file is created and returned, hence that (new) in the log file.
>>>>> These set of events
>>>>> can happen in case of rename. You can see the same BlueFS::rename().
>>>>> In
>>>>> that case, we use the
>>>>> same inode number.
>>>> I'm not sure I understand, I do not believe this is a rename case.  For
>>>> this to be a rename, there would have to be at least one op_dir_link()
>>>> call for each op_dir_unlink() call in the sequence above.   The only
>>>> code
>>>> sequence I can find where an op_dir_unlink() and an op_file_remove()
>>>> appear together without a corresponding op_dir_link() is when a file is
>>>> being deleted.
>>>>
>>>> The (new) above makes my point exactly.  The op_file_remove() call
>>>> above
>>>> freed the block allocation for the inode and released the inode from
>>>> the
>>>> filemap, showing that the file->refs had to be zero at that time
>>>> (routine
>>>> _drop_link).   Yet the following op_file_update() call is performing a
>>>> truncate on that same inode, which is not in the filemap.   So the
>>>> caller
>>>> has to be holding a file structure whose refs field is greater than
>>>> zero.
>>>> Which implies that the caller's file structure has to be stale.  The
>>>> fact
>>>> that the op_file_update file contents for inode 113 is exactly the same
>>>> as
>>>> the previous op_file_update contents prior to the op_file_remove(),
>>>> except
>>>> for a reduction in file size, proves I believe that the file struct
>>>> used
>>>> by the op_file_update() caller is stale.
>>>>
>>>>
>>>>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12
>>>>>> 17:05:54.615190 bdev 1 extents
>>>>>>
>>>>>>
>>>>>> [1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:20237516
>>>>>> 8
>>>>>> +2
>>>>>> 09
>>>>>>
>>>>>>
>>>>>> 7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:227
>>>>>> 5
>>>>>> 40
>>>>>> 99
>>>>>>
>>>>>>
>>>>>> 2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,
>>>>>> 1
>>>>>> :2
>>>>>> 51
>>>>>>
>>>>>>
>>>>>> 658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+209
>>>>>> 7
>>>>>> 15
>>>>>> 2,
>>>>>>
>>>>>>
>>>>>> 1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:29464985
>>>>>> 6
>>>>>> +1
>>>>>> 04
>>>>>>
>>>>>>
>>>>>> 8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:318
>>>>>> 7
>>>>>> 67
>>>>>> 10
>>>>>>
>>>>>>
>>>>>> 4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,
>>>>>> 1
>>>>>> :3
>>>>>> 42
>>>>>>
>>>>>>
>>>>>> 884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+209
>>>>>> 7
>>>>>> 15
>>>>>> 2,
>>>>>>
>>>>>>
>>>>>> 1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:38377881
>>>>>> 6
>>>>>> +1
>>>>>> 04
>>>>>>
>>>>>>
>>>>>> 8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:405
>>>>>> 7
>>>>>> 98
>>>>>> 91
>>>>>> 2+3145728,1:507510784+5242880])
>>>>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino 102 =
>>>>>> 0x7fb66aafbb90 (new)
>>>>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12
>>>>>> 22:41:33.028030 bdev 1 extents
>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino 109 =
>>>>>> 0x7fb66aafbe60 (new)
>>>>>> ...
>>>>>>
>>>>>>
>>>>>> The prior transaction log entry for each of the three inodes was
>>>>>> another
>>>>>> op_file_update() call.  The only difference between the two updates
>>>>>> for
>>>>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less
>>>>>> bytes,
>>>>>> all
>>>>>> other fields are identical.
>>>>> Are the extents same for both the events?
>>>> Yes, please compare this line with the one higher above in this email,
>>>> you
>>>> can see that only the file size is different.   That is true for all
>>>> three
>>>> of these inodes.
>>>>
>>>>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:
>>>>>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
>>>>>> 17:05:53.738683 bdev 1 extents
>>>>>>
>>>>>>
>>>>>> [1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:9311354
>>>>>> 8
>>>>>> 8+
>>>>>> 20
>>>>>>
>>>>>>
>>>>>> 97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:96
>>>>>> 0
>>>>>> 49
>>>>>> 56
>>>>>> 16+58720256])
>>>>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino 113 =
>>>>>> 0x7fb66aafbb90
>>>>>>
>>>>>>
>>>>>> Same story for inode 109, the file size changed from 71354939 ->
>>>>>> 64421913,
>>>>>> 6933026 less bytes, all other fields are identical.
>>>>>>
>>>>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay 0x5ba000:
>>>>>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12
>>>>>> 22:41:33.028030 bdev 1 extents
>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino 109 =
>>>>>> 0x7fb66aafc910
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Same story for inode 102, the file size changed from 71359405 ->
>>>>>> 68557043,
>>>>>> 2802362 less bytes, all other fields are identical.
>>>>>>
>>>>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay 0x5b9000:
>>>>>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12
>>>>>> 17:05:54.615190 bdev 1 extents
>>>>>>
>>>>>>
>>>>>> [1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:20237516
>>>>>> 8
>>>>>> +2
>>>>>> 09
>>>>>>
>>>>>>
>>>>>> 7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:227
>>>>>> 5
>>>>>> 40
>>>>>> 99
>>>>>>
>>>>>>
>>>>>> 2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,
>>>>>> 1
>>>>>> :2
>>>>>> 51
>>>>>>
>>>>>>
>>>>>> 658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+209
>>>>>> 7
>>>>>> 15
>>>>>> 2,
>>>>>>
>>>>>>
>>>>>> 1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:29464985
>>>>>> 6
>>>>>> +1
>>>>>> 04
>>>>>>
>>>>>>
>>>>>> 8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:318
>>>>>> 7
>>>>>> 67
>>>>>> 10
>>>>>>
>>>>>>
>>>>>> 4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,
>>>>>> 1
>>>>>> :3
>>>>>> 42
>>>>>>
>>>>>>
>>>>>> 884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+209
>>>>>> 7
>>>>>> 15
>>>>>> 2,
>>>>>>
>>>>>>
>>>>>> 1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:38377881
>>>>>> 6
>>>>>> +1
>>>>>> 04
>>>>>>
>>>>>>
>>>>>> 8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:405
>>>>>> 7
>>>>>> 98
>>>>>> 91
>>>>>> 2+3145728,1:507510784+5242880])
>>>>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino 102 =
>>>>>> 0x7fb66aafbe60
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> There is only one caller of op_file_update() in which the size of the
>>>>>> file
>>>>>> is reduced but all other fields stay the same, that is
>>>>>> BlueFS::_truncate(), which in turn is only called by
>>>>>> BlueFS::truncate(),
>>>>>> which is only called by BlueRocksWritableFile::Close().
>>>>>>
>>>>>> The op_dir_unlink() and op_file_remove() calls for each of the three
>>>>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm
>>>>>> having
>>>>>> difficulty finding where the call is coming from.
>>>>> Any syscall making dir rename or file remove or directory remove will
>>>>> result in BlueFS::rename or unlink
>>>>> or corresponding calls and results in op_dir_* or op_file_*
>>>>> operations.
>>>>> rocksdb wal_manager or compaction
>>>>> can trigger these calls during that process. Will take a closer look
>>>>> at
>>>>> the logs, if i can get something from them.
>>>>>> So the problem appears to be that the file is first deleted, then it
>>>>>> is
>>>>>> closed, and a byproduct of the close operation  on a stale fnode is
>>>>>> that
>>>>>> another op_file_update() entry gets logged, recreating the deleted
>>>>>> inode
>>>>>> with a stale fnode, resulting in the 0 links failure.
>>>>> Once the file is deleted, all the underlying extents will be cleared.
>>>>> Until and unless there is race in handling on
>>>>> the files from application level, Bluefs doesn't have any control over
>>>>> them. If the a file(handle) is been reused after
>>>>> deletion, that seems to be a bug. Will take a look at the logs to
>>>>> confirm your theory.
>>>>> But here, this seems to be more of rename use case. one more thing to
>>>>> verify is the extents owned by the inodes mentioned.
>>>> No, I don't believe this is a rename, see above.
>>>>
>>>> I'll note that every occurrence of this problem has been preceded by an
>>>> OSD segfault.  And this problem is racy, if I increase debug logging
>>>> for
>>>> bluefs or bluestore, then the problem goes away.  I have 50+ bluestore
>>>> segfaults and asserts in my cluster logs; I suspect many could be due
>>>> to
>>>> this problem, because the same range of blocks is being freed more than
>>>> once.   If some other file should perform an allocation between the
>>>> block-freeing in op_file_remove() and the second freeing in
>>>> op_file_update(), then we could easily have multiple files with the
>>>> same
>>>> block allocation at the same time.
>>>>
>>>> I think this problem can be caught fairly easily, an assert could be
>>>> added
>>>> to _truncate().   The _truncate() routine is only called for files
>>>> which
>>>> are open, which means that the inode has to be in the filemap.   Code
>>>> could be added at the front of _truncate() to look up the inode in the
>>>> filemap, and if it is not there, then assert.   That would provide a
>>>> stack
>>>> trace showing the caller using the file struct with the stale reference
>>>> count.   (It may be a few days before I have a chance to try this
>>>> myself.)
>>> Sure. Could you please let me know the steps to hit issue?
>>> i am doing only few min runs to check the performance and some unit test
>>> cases.
>>> I will add the necessary asserts if i can reproduce the issue and fix it.
>> Thanks for the help!   I will have to get back to you on Monday, as there
>> is another person on our team who runs the benchmarks that trigger this
>> problem, I don't know how to run his benchmarks myself.
>>
>> Kevan
>>
>>> Varada
>>>> Regards, Kevan
>>>>
>>>>
>>>>
>>> PLEASE NOTE: The information contained in this electronic mail message is
>>> intended only for the use of the designated recipient(s) named above. If
>>> the reader of this message is not the intended recipient, you are hereby
>>> notified that you have received this message in error and that any
>>> review, dissemination, distribution, or copying of this message is
>>> strictly prohibited. If you have received this communication in error,
>>> please notify the sender by telephone or e-mail (as shown above)
>>> immediately and destroy any and all copies of this message in your
>>> possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bluestore bug#15724: bluefs _replay file with link count 0:
  2016-07-19 14:50           ` Varada Kari
@ 2016-07-19 15:08             ` Kevan Rehm
  2016-07-19 16:40               ` Varada Kari
  0 siblings, 1 reply; 12+ messages in thread
From: Kevan Rehm @ 2016-07-19 15:08 UTC (permalink / raw)
  To: Varada Kari, ceph-devel (ceph-devel@vger.kernel.org)

Varada,

> If we are using FileWriter, we will(should) have a reference on the
>fnode.
> I don't think that will be deleted from the file_map. Will wait for your
> run.
>


> open_for_read and open_for_write will take a reference on the file and
>we will increment
> num_readers and num_writers on the file reference. we have asserts to
>catch if we have any
> readers/writers in case file deletion.


I had hoped that, but can't find them in the code, unless they were added
recently.  See routine _drop_link(), when the link count hits zero, it
proceeds immediately with the inode/extent deletion.  It does have an
assert there checking file->num_reading, but (in my copy) no checks for
num_readers and num_writers.  There are also no checks of the 'nref' in
RefCountedObject from which File inherits.    If you do find any of these
elsewhere that provide protection, please let me know, I'd like to
understand this code better.

Thanks for your help digging into this, and trying to reproduce it.  I'll
update the bug shortly.

Kevan


On 7/19/16, 9:50 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>On Tuesday 19 July 2016 07:40 PM, Kevan Rehm wrote:
>> Varada,
>>
>> I added the following assert in _truncate() but haven't yet had it fire
>>in
>> our test cluster.   If you or others also want to give it a try, that
>> would be great.  It seems like a useful assert.
>>
>> diff --git a/src/os/bluestore/BlueFS.cc b/src/os/bluestore/BlueFS.cc
>> index dbfd59b..759d1af 100644
>> --- a/src/os/bluestore/BlueFS.cc
>> +++ b/src/os/bluestore/BlueFS.cc
>> @@ -1299,6 +1299,13 @@ int BlueFS::_truncate(FileWriter *h, uint64_t
>> offset)
>>    dout(10) << __func__ << " 0x" << std::hex << offset << std::dec
>>             << " file " << h->file->fnode << dendl;
>>
>> +  // If ino is not in filemap, we are truncating a file that is not
>> allocated.
>> +
>> +  auto p = file_map.find(h->file->fnode.ino);
>> +  if (p == file_map.end()) {
>> +    assert(0 == "_truncate(): truncating a file that is not
>>allocated");
>> +  }
>> +
>>    // truncate off unflushed data?
>>    if (h->pos < offset &&
>>        h->pos + h->buffer.length() > offset) {
>If we are using FileWriter, we will(should) have a reference on the fnode.
>I don't think that will be deleted from the file_map. Will wait for your
>run.
>>
>> The test case which seemed to make the problem occur more frequently for
>> us consists of mounting a cephfs filesystem, then starting up 32
>>instances
>> of ior on each of 4 client machines.  The ior command looks like this:
>>
>> ior -a POSIX -o  /mnt/cephfs/test -b 8m -i 10
>>
>>
>> In the meantime I have been looking further at the code, and I think I
>>see
>> the problem-causing sequence of events.
>>
>> BlueFS implements unlink() differently than the POSIX model.  In POSIX,
>> the inode and disk extents for a file are not released until both the
>>link
>> count reaches zero and the inode reference count (file open count)
>>reaches
>> zero.  BlueFS however releases the inode and disk extents when the link
>> count (file->refs) reaches zero even if there are threads that still
>>have
>> the file open.
>open_for_read and open_for_write will take a reference on the file and
>we will increment
>num_readers and num_writers on the file reference. we have asserts to
>catch if we have any
>readers/writers in case file deletion.
>> Routine _drop_link() does the inode and disk extent releases, and it
>>sets
>> a File variable called 'deleted' to true, but note that it does not
>> otherwise clear the released inode number or disk extent fields in the
>> File.  It is therefore the responsibility of every thread with a FileRef
>> for that file to first obtain a lock (to prevent races with
>>_drop_link())
>> and then look at the value of 'deleted', and only continue using the
>>File
>> if 'deleted' is false.
>>
>> Function BlueRocksWritableFile::Close() does not do this.  The routine
>> does not grab a lock; it calls GetPreallocationStatus(), and if
>> last_allocated_block is > 0, it calls fs->truncate() (which does get a
>> lock) which calls _truncate().  Routine _truncate() lowers the file
>>size,
>> and calls op_file_update(), which posts an update transaction entry to
>>the
>> log for a file that is already unlinked.   This is how we end up with an
>> inode in the transaction log that has disk extents but its link count is
>> zero.
>Yes. i also found the same code path, which truncates the file. Trying
>to write some
>unit test cases to run all these ops in a concurrent way to find the
>problem.
>> Adding a check of the 'deleted' variable after the lock is taken in
>> _truncate() would work, but it bothers me a bit to see
>> BlueRocksWritableFile::Close() using the File struct as if it is still
>> valid.  Folks with more BlueFS experience can decide if that is a
>>problem
>> or not.   They might also be able to describe situations where a file
>> could be unlinked while it is still open, that might lead to better test
>> cases which fire the assert more quickly.
>Added some asserts to find the issue using a test with multiple threads.
>Still
>trying reproduce the issue. I believe we should have a reference to the
>file. Will verify that
>anyways.
>>
>> I'm also a bit concerned about the integrity of the BlueFS filesystem,
>> should the log have to be replayed.  This problem was only seen because
>> the inode number wasn't reused in the log after it was released.  What
>>if
>> a subsequent log transaction reallocates the inode?  The "file with link
>> count 0" assert would not catch it.  I guess that subsequent
>>reallocation
>> should overlay the bad FileRef->fnode with a new fnode, so maybe it's
>> harmless?
>Hopefully. Will try to add some debug around reference count of the file
>and
>verify the theory.
>
>> It would be nice to have my assert() trigger to provide the "smoking
>>gun".
>>  However, if I can't duplicate the problem in the next day or two, I
>>will
>> probably just patch _truncate() to return immediately if 'deleted' is
>> true, and move on.
>>
>> What do you think, should I post this history to bug 15724, create a
>> separate bug, anything else?
>Please post your findings to 15724, if we find a different problem we
>can file a
>different ticket.
>
>Varada
>> Thanks, Kevan
>>
>> On 7/16/16, 9:35 AM, "ceph-devel-owner@vger.kernel.org on behalf of
>>Kevan
>> Rehm" <ceph-devel-owner@vger.kernel.org on behalf of krehm@cray.com>
>>wrote:
>>
>>>
>>> On 7/16/16, 7:16 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>>>
>>>> On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote:
>>>>> Varada, see below.
>>>>>
>>>>> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>>>>>
>>>>>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:
>>>>>>> Greetings,
>>>>>>>
>>>>>>> I have made a bit of progress.  Still looking for ideas/advise
>>>>>>>while
>>>>>>> I
>>>>>>> keep digging.
>>>>>>>
>>>>>>> First, the ceph version I forgot to mention last time.
>>>>>>>
>>>>>>> # ceph -v
>>>>>>> ceph version 11.0.0-196-g85bb43e
>>>>>>> (85bb43e111692989d2296a389ce45377d2297d6f)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> As a refresher, here is the beginning of the problem transaction
>>>>>>> which
>>>>>>> will result in an inode with 0 links.   There are actually three
>>>>>>> inodes
>>>>>>> in
>>>>>>> this transaction which will all end up having 0 links, inodes 109,
>>>>>>> 102,
>>>>>>> and 113.
>>>>>>>
>>>>>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay 0x5be000:
>>>>>>> txn(seq 77752 len 35269 crc 1796157826)
>>>>>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>> op_dir_unlink  db/000154.log
>>>>>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>> op_file_remove 109
>>>>>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>> op_dir_unlink  db/000153.log
>>>>>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>> op_file_remove 102
>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>> op_dir_unlink  db/000151.log
>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>> op_file_remove 113
>>>>>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>> op_dir_unlink  db/000150.log
>>>>>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>> op_file_remove 111
>>>>>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
>>>>>>> 17:05:53.738683 bdev 1 extents
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:93113
>>>>>>>54
>>>>>>> 8
>>>>>>> 8+
>>>>>>> 20
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:
>>>>>>>96
>>>>>>> 0
>>>>>>> 49
>>>>>>> 56
>>>>>>> 16+58720256])
>>>>>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>113 =
>>>>>>> 0x7fb66aafc880 (new)
>>>>>> When _get_file() is called with a inode number, if the file is not
>>>>>> found
>>>>>> in file_map
>>>>>> a new file is created and returned, hence that (new) in the log
>>>>>>file.
>>>>>> These set of events
>>>>>> can happen in case of rename. You can see the same BlueFS::rename().
>>>>>> In
>>>>>> that case, we use the
>>>>>> same inode number.
>>>>> I'm not sure I understand, I do not believe this is a rename case.
>>>>>For
>>>>> this to be a rename, there would have to be at least one
>>>>>op_dir_link()
>>>>> call for each op_dir_unlink() call in the sequence above.   The only
>>>>> code
>>>>> sequence I can find where an op_dir_unlink() and an op_file_remove()
>>>>> appear together without a corresponding op_dir_link() is when a file
>>>>>is
>>>>> being deleted.
>>>>>
>>>>> The (new) above makes my point exactly.  The op_file_remove() call
>>>>> above
>>>>> freed the block allocation for the inode and released the inode from
>>>>> the
>>>>> filemap, showing that the file->refs had to be zero at that time
>>>>> (routine
>>>>> _drop_link).   Yet the following op_file_update() call is performing
>>>>>a
>>>>> truncate on that same inode, which is not in the filemap.   So the
>>>>> caller
>>>>> has to be holding a file structure whose refs field is greater than
>>>>> zero.
>>>>> Which implies that the caller's file structure has to be stale.  The
>>>>> fact
>>>>> that the op_file_update file contents for inode 113 is exactly the
>>>>>same
>>>>> as
>>>>> the previous op_file_update contents prior to the op_file_remove(),
>>>>> except
>>>>> for a reduction in file size, proves I believe that the file struct
>>>>> used
>>>>> by the op_file_update() caller is stale.
>>>>>
>>>>>
>>>>>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12
>>>>>>> 17:05:54.615190 bdev 1 extents
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375
>>>>>>>16
>>>>>>> 8
>>>>>>> +2
>>>>>>> 09
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:2
>>>>>>>27
>>>>>>> 5
>>>>>>> 40
>>>>>>> 99
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+209715
>>>>>>>2,
>>>>>>> 1
>>>>>>> :2
>>>>>>> 51
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2
>>>>>>>09
>>>>>>> 7
>>>>>>> 15
>>>>>>> 2,
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649
>>>>>>>85
>>>>>>> 6
>>>>>>> +1
>>>>>>> 04
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:3
>>>>>>>18
>>>>>>> 7
>>>>>>> 67
>>>>>>> 10
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+209715
>>>>>>>2,
>>>>>>> 1
>>>>>>> :3
>>>>>>> 42
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2
>>>>>>>09
>>>>>>> 7
>>>>>>> 15
>>>>>>> 2,
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778
>>>>>>>81
>>>>>>> 6
>>>>>>> +1
>>>>>>> 04
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:4
>>>>>>>05
>>>>>>> 7
>>>>>>> 98
>>>>>>> 91
>>>>>>> 2+3145728,1:507510784+5242880])
>>>>>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>102 =
>>>>>>> 0x7fb66aafbb90 (new)
>>>>>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12
>>>>>>> 22:41:33.028030 bdev 1 extents
>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>109 =
>>>>>>> 0x7fb66aafbe60 (new)
>>>>>>> ...
>>>>>>>
>>>>>>>
>>>>>>> The prior transaction log entry for each of the three inodes was
>>>>>>> another
>>>>>>> op_file_update() call.  The only difference between the two updates
>>>>>>> for
>>>>>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less
>>>>>>> bytes,
>>>>>>> all
>>>>>>> other fields are identical.
>>>>>> Are the extents same for both the events?
>>>>> Yes, please compare this line with the one higher above in this
>>>>>email,
>>>>> you
>>>>> can see that only the file size is different.   That is true for all
>>>>> three
>>>>> of these inodes.
>>>>>
>>>>>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:
>>>>>>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
>>>>>>> 17:05:53.738683 bdev 1 extents
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:93113
>>>>>>>54
>>>>>>> 8
>>>>>>> 8+
>>>>>>> 20
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:
>>>>>>>96
>>>>>>> 0
>>>>>>> 49
>>>>>>> 56
>>>>>>> 16+58720256])
>>>>>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>113 =
>>>>>>> 0x7fb66aafbb90
>>>>>>>
>>>>>>>
>>>>>>> Same story for inode 109, the file size changed from 71354939 ->
>>>>>>> 64421913,
>>>>>>> 6933026 less bytes, all other fields are identical.
>>>>>>>
>>>>>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay 0x5ba000:
>>>>>>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12
>>>>>>> 22:41:33.028030 bdev 1 extents
>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>109 =
>>>>>>> 0x7fb66aafc910
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Same story for inode 102, the file size changed from 71359405 ->
>>>>>>> 68557043,
>>>>>>> 2802362 less bytes, all other fields are identical.
>>>>>>>
>>>>>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay 0x5b9000:
>>>>>>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12
>>>>>>> 17:05:54.615190 bdev 1 extents
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375
>>>>>>>16
>>>>>>> 8
>>>>>>> +2
>>>>>>> 09
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:2
>>>>>>>27
>>>>>>> 5
>>>>>>> 40
>>>>>>> 99
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+209715
>>>>>>>2,
>>>>>>> 1
>>>>>>> :2
>>>>>>> 51
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2
>>>>>>>09
>>>>>>> 7
>>>>>>> 15
>>>>>>> 2,
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649
>>>>>>>85
>>>>>>> 6
>>>>>>> +1
>>>>>>> 04
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:3
>>>>>>>18
>>>>>>> 7
>>>>>>> 67
>>>>>>> 10
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+209715
>>>>>>>2,
>>>>>>> 1
>>>>>>> :3
>>>>>>> 42
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2
>>>>>>>09
>>>>>>> 7
>>>>>>> 15
>>>>>>> 2,
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778
>>>>>>>81
>>>>>>> 6
>>>>>>> +1
>>>>>>> 04
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:4
>>>>>>>05
>>>>>>> 7
>>>>>>> 98
>>>>>>> 91
>>>>>>> 2+3145728,1:507510784+5242880])
>>>>>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>102 =
>>>>>>> 0x7fb66aafbe60
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> There is only one caller of op_file_update() in which the size of
>>>>>>>the
>>>>>>> file
>>>>>>> is reduced but all other fields stay the same, that is
>>>>>>> BlueFS::_truncate(), which in turn is only called by
>>>>>>> BlueFS::truncate(),
>>>>>>> which is only called by BlueRocksWritableFile::Close().
>>>>>>>
>>>>>>> The op_dir_unlink() and op_file_remove() calls for each of the
>>>>>>>three
>>>>>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm
>>>>>>> having
>>>>>>> difficulty finding where the call is coming from.
>>>>>> Any syscall making dir rename or file remove or directory remove
>>>>>>will
>>>>>> result in BlueFS::rename or unlink
>>>>>> or corresponding calls and results in op_dir_* or op_file_*
>>>>>> operations.
>>>>>> rocksdb wal_manager or compaction
>>>>>> can trigger these calls during that process. Will take a closer look
>>>>>> at
>>>>>> the logs, if i can get something from them.
>>>>>>> So the problem appears to be that the file is first deleted, then
>>>>>>>it
>>>>>>> is
>>>>>>> closed, and a byproduct of the close operation  on a stale fnode is
>>>>>>> that
>>>>>>> another op_file_update() entry gets logged, recreating the deleted
>>>>>>> inode
>>>>>>> with a stale fnode, resulting in the 0 links failure.
>>>>>> Once the file is deleted, all the underlying extents will be
>>>>>>cleared.
>>>>>> Until and unless there is race in handling on
>>>>>> the files from application level, Bluefs doesn't have any control
>>>>>>over
>>>>>> them. If the a file(handle) is been reused after
>>>>>> deletion, that seems to be a bug. Will take a look at the logs to
>>>>>> confirm your theory.
>>>>>> But here, this seems to be more of rename use case. one more thing
>>>>>>to
>>>>>> verify is the extents owned by the inodes mentioned.
>>>>> No, I don't believe this is a rename, see above.
>>>>>
>>>>> I'll note that every occurrence of this problem has been preceded by
>>>>>an
>>>>> OSD segfault.  And this problem is racy, if I increase debug logging
>>>>> for
>>>>> bluefs or bluestore, then the problem goes away.  I have 50+
>>>>>bluestore
>>>>> segfaults and asserts in my cluster logs; I suspect many could be due
>>>>> to
>>>>> this problem, because the same range of blocks is being freed more
>>>>>than
>>>>> once.   If some other file should perform an allocation between the
>>>>> block-freeing in op_file_remove() and the second freeing in
>>>>> op_file_update(), then we could easily have multiple files with the
>>>>> same
>>>>> block allocation at the same time.
>>>>>
>>>>> I think this problem can be caught fairly easily, an assert could be
>>>>> added
>>>>> to _truncate().   The _truncate() routine is only called for files
>>>>> which
>>>>> are open, which means that the inode has to be in the filemap.   Code
>>>>> could be added at the front of _truncate() to look up the inode in
>>>>>the
>>>>> filemap, and if it is not there, then assert.   That would provide a
>>>>> stack
>>>>> trace showing the caller using the file struct with the stale
>>>>>reference
>>>>> count.   (It may be a few days before I have a chance to try this
>>>>> myself.)
>>>> Sure. Could you please let me know the steps to hit issue?
>>>> i am doing only few min runs to check the performance and some unit
>>>>test
>>>> cases.
>>>> I will add the necessary asserts if i can reproduce the issue and fix
>>>>it.
>>> Thanks for the help!   I will have to get back to you on Monday, as
>>>there
>>> is another person on our team who runs the benchmarks that trigger this
>>> problem, I don't know how to run his benchmarks myself.
>>>
>>> Kevan
>>>
>>>> Varada
>>>>> Regards, Kevan
>>>>>
>>>>>
>>>>>
>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>message is
>>>> intended only for the use of the designated recipient(s) named above.
>>>>If
>>>> the reader of this message is not the intended recipient, you are
>>>>hereby
>>>> notified that you have received this message in error and that any
>>>> review, dissemination, distribution, or copying of this message is
>>>> strictly prohibited. If you have received this communication in error,
>>>> please notify the sender by telephone or e-mail (as shown above)
>>>> immediately and destroy any and all copies of this message in your
>>>> possession (whether hard copies or electronically stored copies).
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>PLEASE NOTE: The information contained in this electronic mail message is
>intended only for the use of the designated recipient(s) named above. If
>the reader of this message is not the intended recipient, you are hereby
>notified that you have received this message in error and that any
>review, dissemination, distribution, or copying of this message is
>strictly prohibited. If you have received this communication in error,
>please notify the sender by telephone or e-mail (as shown above)
>immediately and destroy any and all copies of this message in your
>possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bluestore bug#15724: bluefs _replay file with link count 0:
  2016-07-19 15:08             ` Kevan Rehm
@ 2016-07-19 16:40               ` Varada Kari
  2016-07-19 17:04                 ` Kevan Rehm
  0 siblings, 1 reply; 12+ messages in thread
From: Varada Kari @ 2016-07-19 16:40 UTC (permalink / raw)
  To: Kevan Rehm, ceph-devel (ceph-devel@vger.kernel.org)

You can see the asserts in destructor of File(~File). if we have any
valid refcount, i think boost
asserts to delete the object. These are intrusive pointers.

Will update if i have a successful recreate of the same scenario.

Varada

On Tuesday 19 July 2016 08:38 PM, Kevan Rehm wrote:
> Varada,
>
>> If we are using FileWriter, we will(should) have a reference on the
>> fnode.
>> I don't think that will be deleted from the file_map. Will wait for your
>> run.
>>
>
>> open_for_read and open_for_write will take a reference on the file and
>> we will increment
>> num_readers and num_writers on the file reference. we have asserts to
>> catch if we have any
>> readers/writers in case file deletion.
>
> I had hoped that, but can't find them in the code, unless they were added
> recently.  See routine _drop_link(), when the link count hits zero, it
> proceeds immediately with the inode/extent deletion.  It does have an
> assert there checking file->num_reading, but (in my copy) no checks for
> num_readers and num_writers.  There are also no checks of the 'nref' in
> RefCountedObject from which File inherits.    If you do find any of these
> elsewhere that provide protection, please let me know, I'd like to
> understand this code better.
>
> Thanks for your help digging into this, and trying to reproduce it.  I'll
> update the bug shortly.
>
> Kevan
>
>
> On 7/19/16, 9:50 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>
>> On Tuesday 19 July 2016 07:40 PM, Kevan Rehm wrote:
>>> Varada,
>>>
>>> I added the following assert in _truncate() but haven't yet had it fire
>>> in
>>> our test cluster.   If you or others also want to give it a try, that
>>> would be great.  It seems like a useful assert.
>>>
>>> diff --git a/src/os/bluestore/BlueFS.cc b/src/os/bluestore/BlueFS.cc
>>> index dbfd59b..759d1af 100644
>>> --- a/src/os/bluestore/BlueFS.cc
>>> +++ b/src/os/bluestore/BlueFS.cc
>>> @@ -1299,6 +1299,13 @@ int BlueFS::_truncate(FileWriter *h, uint64_t
>>> offset)
>>>    dout(10) << __func__ << " 0x" << std::hex << offset << std::dec
>>>             << " file " << h->file->fnode << dendl;
>>>
>>> +  // If ino is not in filemap, we are truncating a file that is not
>>> allocated.
>>> +
>>> +  auto p = file_map.find(h->file->fnode.ino);
>>> +  if (p == file_map.end()) {
>>> +    assert(0 == "_truncate(): truncating a file that is not
>>> allocated");
>>> +  }
>>> +
>>>    // truncate off unflushed data?
>>>    if (h->pos < offset &&
>>>        h->pos + h->buffer.length() > offset) {
>> If we are using FileWriter, we will(should) have a reference on the fnode.
>> I don't think that will be deleted from the file_map. Will wait for your
>> run.
>>> The test case which seemed to make the problem occur more frequently for
>>> us consists of mounting a cephfs filesystem, then starting up 32
>>> instances
>>> of ior on each of 4 client machines.  The ior command looks like this:
>>>
>>> ior -a POSIX -o  /mnt/cephfs/test -b 8m -i 10
>>>
>>>
>>> In the meantime I have been looking further at the code, and I think I
>>> see
>>> the problem-causing sequence of events.
>>>
>>> BlueFS implements unlink() differently than the POSIX model.  In POSIX,
>>> the inode and disk extents for a file are not released until both the
>>> link
>>> count reaches zero and the inode reference count (file open count)
>>> reaches
>>> zero.  BlueFS however releases the inode and disk extents when the link
>>> count (file->refs) reaches zero even if there are threads that still
>>> have
>>> the file open.
>> open_for_read and open_for_write will take a reference on the file and
>> we will increment
>> num_readers and num_writers on the file reference. we have asserts to
>> catch if we have any
>> readers/writers in case file deletion.
>>> Routine _drop_link() does the inode and disk extent releases, and it
>>> sets
>>> a File variable called 'deleted' to true, but note that it does not
>>> otherwise clear the released inode number or disk extent fields in the
>>> File.  It is therefore the responsibility of every thread with a FileRef
>>> for that file to first obtain a lock (to prevent races with
>>> _drop_link())
>>> and then look at the value of 'deleted', and only continue using the
>>> File
>>> if 'deleted' is false.
>>>
>>> Function BlueRocksWritableFile::Close() does not do this.  The routine
>>> does not grab a lock; it calls GetPreallocationStatus(), and if
>>> last_allocated_block is > 0, it calls fs->truncate() (which does get a
>>> lock) which calls _truncate().  Routine _truncate() lowers the file
>>> size,
>>> and calls op_file_update(), which posts an update transaction entry to
>>> the
>>> log for a file that is already unlinked.   This is how we end up with an
>>> inode in the transaction log that has disk extents but its link count is
>>> zero.
>> Yes. i also found the same code path, which truncates the file. Trying
>> to write some
>> unit test cases to run all these ops in a concurrent way to find the
>> problem.
>>> Adding a check of the 'deleted' variable after the lock is taken in
>>> _truncate() would work, but it bothers me a bit to see
>>> BlueRocksWritableFile::Close() using the File struct as if it is still
>>> valid.  Folks with more BlueFS experience can decide if that is a
>>> problem
>>> or not.   They might also be able to describe situations where a file
>>> could be unlinked while it is still open, that might lead to better test
>>> cases which fire the assert more quickly.
>> Added some asserts to find the issue using a test with multiple threads.
>> Still
>> trying reproduce the issue. I believe we should have a reference to the
>> file. Will verify that
>> anyways.
>>> I'm also a bit concerned about the integrity of the BlueFS filesystem,
>>> should the log have to be replayed.  This problem was only seen because
>>> the inode number wasn't reused in the log after it was released.  What
>>> if
>>> a subsequent log transaction reallocates the inode?  The "file with link
>>> count 0" assert would not catch it.  I guess that subsequent
>>> reallocation
>>> should overlay the bad FileRef->fnode with a new fnode, so maybe it's
>>> harmless?
>> Hopefully. Will try to add some debug around reference count of the file
>> and
>> verify the theory.
>>
>>> It would be nice to have my assert() trigger to provide the "smoking
>>> gun".
>>>  However, if I can't duplicate the problem in the next day or two, I
>>> will
>>> probably just patch _truncate() to return immediately if 'deleted' is
>>> true, and move on.
>>>
>>> What do you think, should I post this history to bug 15724, create a
>>> separate bug, anything else?
>> Please post your findings to 15724, if we find a different problem we
>> can file a
>> different ticket.
>>
>> Varada
>>> Thanks, Kevan
>>>
>>> On 7/16/16, 9:35 AM, "ceph-devel-owner@vger.kernel.org on behalf of
>>> Kevan
>>> Rehm" <ceph-devel-owner@vger.kernel.org on behalf of krehm@cray.com>
>>> wrote:
>>>
>>>> On 7/16/16, 7:16 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>>>>
>>>>> On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote:
>>>>>> Varada, see below.
>>>>>>
>>>>>> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>>>>>>
>>>>>>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:
>>>>>>>> Greetings,
>>>>>>>>
>>>>>>>> I have made a bit of progress.  Still looking for ideas/advise
>>>>>>>> while
>>>>>>>> I
>>>>>>>> keep digging.
>>>>>>>>
>>>>>>>> First, the ceph version I forgot to mention last time.
>>>>>>>>
>>>>>>>> # ceph -v
>>>>>>>> ceph version 11.0.0-196-g85bb43e
>>>>>>>> (85bb43e111692989d2296a389ce45377d2297d6f)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> As a refresher, here is the beginning of the problem transaction
>>>>>>>> which
>>>>>>>> will result in an inode with 0 links.   There are actually three
>>>>>>>> inodes
>>>>>>>> in
>>>>>>>> this transaction which will all end up having 0 links, inodes 109,
>>>>>>>> 102,
>>>>>>>> and 113.
>>>>>>>>
>>>>>>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay 0x5be000:
>>>>>>>> txn(seq 77752 len 35269 crc 1796157826)
>>>>>>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>>> op_dir_unlink  db/000154.log
>>>>>>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>>> op_file_remove 109
>>>>>>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>>> op_dir_unlink  db/000153.log
>>>>>>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>>> op_file_remove 102
>>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>>> op_dir_unlink  db/000151.log
>>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>>> op_file_remove 113
>>>>>>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>>> op_dir_unlink  db/000150.log
>>>>>>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>>> op_file_remove 111
>>>>>>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
>>>>>>>> 17:05:53.738683 bdev 1 extents
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:93113
>>>>>>>> 54
>>>>>>>> 8
>>>>>>>> 8+
>>>>>>>> 20
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:
>>>>>>>> 96
>>>>>>>> 0
>>>>>>>> 49
>>>>>>>> 56
>>>>>>>> 16+58720256])
>>>>>>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>> 113 =
>>>>>>>> 0x7fb66aafc880 (new)
>>>>>>> When _get_file() is called with a inode number, if the file is not
>>>>>>> found
>>>>>>> in file_map
>>>>>>> a new file is created and returned, hence that (new) in the log
>>>>>>> file.
>>>>>>> These set of events
>>>>>>> can happen in case of rename. You can see the same BlueFS::rename().
>>>>>>> In
>>>>>>> that case, we use the
>>>>>>> same inode number.
>>>>>> I'm not sure I understand, I do not believe this is a rename case.
>>>>>> For
>>>>>> this to be a rename, there would have to be at least one
>>>>>> op_dir_link()
>>>>>> call for each op_dir_unlink() call in the sequence above.   The only
>>>>>> code
>>>>>> sequence I can find where an op_dir_unlink() and an op_file_remove()
>>>>>> appear together without a corresponding op_dir_link() is when a file
>>>>>> is
>>>>>> being deleted.
>>>>>>
>>>>>> The (new) above makes my point exactly.  The op_file_remove() call
>>>>>> above
>>>>>> freed the block allocation for the inode and released the inode from
>>>>>> the
>>>>>> filemap, showing that the file->refs had to be zero at that time
>>>>>> (routine
>>>>>> _drop_link).   Yet the following op_file_update() call is performing
>>>>>> a
>>>>>> truncate on that same inode, which is not in the filemap.   So the
>>>>>> caller
>>>>>> has to be holding a file structure whose refs field is greater than
>>>>>> zero.
>>>>>> Which implies that the caller's file structure has to be stale.  The
>>>>>> fact
>>>>>> that the op_file_update file contents for inode 113 is exactly the
>>>>>> same
>>>>>> as
>>>>>> the previous op_file_update contents prior to the op_file_remove(),
>>>>>> except
>>>>>> for a reduction in file size, proves I believe that the file struct
>>>>>> used
>>>>>> by the op_file_update() caller is stale.
>>>>>>
>>>>>>
>>>>>>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12
>>>>>>>> 17:05:54.615190 bdev 1 extents
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375
>>>>>>>> 16
>>>>>>>> 8
>>>>>>>> +2
>>>>>>>> 09
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:2
>>>>>>>> 27
>>>>>>>> 5
>>>>>>>> 40
>>>>>>>> 99
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+209715
>>>>>>>> 2,
>>>>>>>> 1
>>>>>>>> :2
>>>>>>>> 51
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2
>>>>>>>> 09
>>>>>>>> 7
>>>>>>>> 15
>>>>>>>> 2,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649
>>>>>>>> 85
>>>>>>>> 6
>>>>>>>> +1
>>>>>>>> 04
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:3
>>>>>>>> 18
>>>>>>>> 7
>>>>>>>> 67
>>>>>>>> 10
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+209715
>>>>>>>> 2,
>>>>>>>> 1
>>>>>>>> :3
>>>>>>>> 42
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2
>>>>>>>> 09
>>>>>>>> 7
>>>>>>>> 15
>>>>>>>> 2,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778
>>>>>>>> 81
>>>>>>>> 6
>>>>>>>> +1
>>>>>>>> 04
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:4
>>>>>>>> 05
>>>>>>>> 7
>>>>>>>> 98
>>>>>>>> 91
>>>>>>>> 2+3145728,1:507510784+5242880])
>>>>>>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>> 102 =
>>>>>>>> 0x7fb66aafbb90 (new)
>>>>>>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>>>>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12
>>>>>>>> 22:41:33.028030 bdev 1 extents
>>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>>>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>> 109 =
>>>>>>>> 0x7fb66aafbe60 (new)
>>>>>>>> ...
>>>>>>>>
>>>>>>>>
>>>>>>>> The prior transaction log entry for each of the three inodes was
>>>>>>>> another
>>>>>>>> op_file_update() call.  The only difference between the two updates
>>>>>>>> for
>>>>>>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less
>>>>>>>> bytes,
>>>>>>>> all
>>>>>>>> other fields are identical.
>>>>>>> Are the extents same for both the events?
>>>>>> Yes, please compare this line with the one higher above in this
>>>>>> email,
>>>>>> you
>>>>>> can see that only the file size is different.   That is true for all
>>>>>> three
>>>>>> of these inodes.
>>>>>>
>>>>>>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:
>>>>>>>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
>>>>>>>> 17:05:53.738683 bdev 1 extents
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:93113
>>>>>>>> 54
>>>>>>>> 8
>>>>>>>> 8+
>>>>>>>> 20
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:
>>>>>>>> 96
>>>>>>>> 0
>>>>>>>> 49
>>>>>>>> 56
>>>>>>>> 16+58720256])
>>>>>>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>> 113 =
>>>>>>>> 0x7fb66aafbb90
>>>>>>>>
>>>>>>>>
>>>>>>>> Same story for inode 109, the file size changed from 71354939 ->
>>>>>>>> 64421913,
>>>>>>>> 6933026 less bytes, all other fields are identical.
>>>>>>>>
>>>>>>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay 0x5ba000:
>>>>>>>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12
>>>>>>>> 22:41:33.028030 bdev 1 extents
>>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>>>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>> 109 =
>>>>>>>> 0x7fb66aafc910
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Same story for inode 102, the file size changed from 71359405 ->
>>>>>>>> 68557043,
>>>>>>>> 2802362 less bytes, all other fields are identical.
>>>>>>>>
>>>>>>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay 0x5b9000:
>>>>>>>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12
>>>>>>>> 17:05:54.615190 bdev 1 extents
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375
>>>>>>>> 16
>>>>>>>> 8
>>>>>>>> +2
>>>>>>>> 09
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:2
>>>>>>>> 27
>>>>>>>> 5
>>>>>>>> 40
>>>>>>>> 99
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+209715
>>>>>>>> 2,
>>>>>>>> 1
>>>>>>>> :2
>>>>>>>> 51
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2
>>>>>>>> 09
>>>>>>>> 7
>>>>>>>> 15
>>>>>>>> 2,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649
>>>>>>>> 85
>>>>>>>> 6
>>>>>>>> +1
>>>>>>>> 04
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:3
>>>>>>>> 18
>>>>>>>> 7
>>>>>>>> 67
>>>>>>>> 10
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+209715
>>>>>>>> 2,
>>>>>>>> 1
>>>>>>>> :3
>>>>>>>> 42
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2
>>>>>>>> 09
>>>>>>>> 7
>>>>>>>> 15
>>>>>>>> 2,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778
>>>>>>>> 81
>>>>>>>> 6
>>>>>>>> +1
>>>>>>>> 04
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:4
>>>>>>>> 05
>>>>>>>> 7
>>>>>>>> 98
>>>>>>>> 91
>>>>>>>> 2+3145728,1:507510784+5242880])
>>>>>>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>> 102 =
>>>>>>>> 0x7fb66aafbe60
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> There is only one caller of op_file_update() in which the size of
>>>>>>>> the
>>>>>>>> file
>>>>>>>> is reduced but all other fields stay the same, that is
>>>>>>>> BlueFS::_truncate(), which in turn is only called by
>>>>>>>> BlueFS::truncate(),
>>>>>>>> which is only called by BlueRocksWritableFile::Close().
>>>>>>>>
>>>>>>>> The op_dir_unlink() and op_file_remove() calls for each of the
>>>>>>>> three
>>>>>>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm
>>>>>>>> having
>>>>>>>> difficulty finding where the call is coming from.
>>>>>>> Any syscall making dir rename or file remove or directory remove
>>>>>>> will
>>>>>>> result in BlueFS::rename or unlink
>>>>>>> or corresponding calls and results in op_dir_* or op_file_*
>>>>>>> operations.
>>>>>>> rocksdb wal_manager or compaction
>>>>>>> can trigger these calls during that process. Will take a closer look
>>>>>>> at
>>>>>>> the logs, if i can get something from them.
>>>>>>>> So the problem appears to be that the file is first deleted, then
>>>>>>>> it
>>>>>>>> is
>>>>>>>> closed, and a byproduct of the close operation  on a stale fnode is
>>>>>>>> that
>>>>>>>> another op_file_update() entry gets logged, recreating the deleted
>>>>>>>> inode
>>>>>>>> with a stale fnode, resulting in the 0 links failure.
>>>>>>> Once the file is deleted, all the underlying extents will be
>>>>>>> cleared.
>>>>>>> Until and unless there is race in handling on
>>>>>>> the files from application level, Bluefs doesn't have any control
>>>>>>> over
>>>>>>> them. If the a file(handle) is been reused after
>>>>>>> deletion, that seems to be a bug. Will take a look at the logs to
>>>>>>> confirm your theory.
>>>>>>> But here, this seems to be more of rename use case. one more thing
>>>>>>> to
>>>>>>> verify is the extents owned by the inodes mentioned.
>>>>>> No, I don't believe this is a rename, see above.
>>>>>>
>>>>>> I'll note that every occurrence of this problem has been preceded by
>>>>>> an
>>>>>> OSD segfault.  And this problem is racy, if I increase debug logging
>>>>>> for
>>>>>> bluefs or bluestore, then the problem goes away.  I have 50+
>>>>>> bluestore
>>>>>> segfaults and asserts in my cluster logs; I suspect many could be due
>>>>>> to
>>>>>> this problem, because the same range of blocks is being freed more
>>>>>> than
>>>>>> once.   If some other file should perform an allocation between the
>>>>>> block-freeing in op_file_remove() and the second freeing in
>>>>>> op_file_update(), then we could easily have multiple files with the
>>>>>> same
>>>>>> block allocation at the same time.
>>>>>>
>>>>>> I think this problem can be caught fairly easily, an assert could be
>>>>>> added
>>>>>> to _truncate().   The _truncate() routine is only called for files
>>>>>> which
>>>>>> are open, which means that the inode has to be in the filemap.   Code
>>>>>> could be added at the front of _truncate() to look up the inode in
>>>>>> the
>>>>>> filemap, and if it is not there, then assert.   That would provide a
>>>>>> stack
>>>>>> trace showing the caller using the file struct with the stale
>>>>>> reference
>>>>>> count.   (It may be a few days before I have a chance to try this
>>>>>> myself.)
>>>>> Sure. Could you please let me know the steps to hit issue?
>>>>> i am doing only few min runs to check the performance and some unit
>>>>> test
>>>>> cases.
>>>>> I will add the necessary asserts if i can reproduce the issue and fix
>>>>> it.
>>>> Thanks for the help!   I will have to get back to you on Monday, as
>>>> there
>>>> is another person on our team who runs the benchmarks that trigger this
>>>> problem, I don't know how to run his benchmarks myself.
>>>>
>>>> Kevan
>>>>
>>>>> Varada
>>>>>> Regards, Kevan
>>>>>>
>>>>>>
>>>>>>
>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>> message is
>>>>> intended only for the use of the designated recipient(s) named above.
>>>>> If
>>>>> the reader of this message is not the intended recipient, you are
>>>>> hereby
>>>>> notified that you have received this message in error and that any
>>>>> review, dissemination, distribution, or copying of this message is
>>>>> strictly prohibited. If you have received this communication in error,
>>>>> please notify the sender by telephone or e-mail (as shown above)
>>>>> immediately and destroy any and all copies of this message in your
>>>>> possession (whether hard copies or electronically stored copies).
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is
>> intended only for the use of the designated recipient(s) named above. If
>> the reader of this message is not the intended recipient, you are hereby
>> notified that you have received this message in error and that any
>> review, dissemination, distribution, or copying of this message is
>> strictly prohibited. If you have received this communication in error,
>> please notify the sender by telephone or e-mail (as shown above)
>> immediately and destroy any and all copies of this message in your
>> possession (whether hard copies or electronically stored copies).
>

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bluestore bug#15724: bluefs _replay file with link count 0:
  2016-07-19 16:40               ` Varada Kari
@ 2016-07-19 17:04                 ` Kevan Rehm
  0 siblings, 0 replies; 12+ messages in thread
From: Kevan Rehm @ 2016-07-19 17:04 UTC (permalink / raw)
  To: Varada Kari, ceph-devel (ceph-devel@vger.kernel.org)



On 7/19/16, 11:40 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>You can see the asserts in destructor of File(~File). if we have any
>valid refcount, i think boost
>asserts to delete the object. These are intrusive pointers.
  
Ah, you are correct, I misunderstood your point.  You were saying that
asserts would fire if readers/writers exist at File object destruction
time.   I thought you were saying that asserts would fire if
readers/writers exist when the file's link count goes to zero, those are
the asserts I could not find.
 
>
>Will update if i have a successful recreate of the same scenario.
 
Very good, thanks!

Kevan
 
>
>Varada
>
>On Tuesday 19 July 2016 08:38 PM, Kevan Rehm wrote:
>> Varada,
>>
>>> If we are using FileWriter, we will(should) have a reference on the
>>> fnode.
>>> I don't think that will be deleted from the file_map. Will wait for
>>>your
>>> run.
>>>
>>
>>> open_for_read and open_for_write will take a reference on the file and
>>> we will increment
>>> num_readers and num_writers on the file reference. we have asserts to
>>> catch if we have any
>>> readers/writers in case file deletion.
>>
>> I had hoped that, but can't find them in the code, unless they were
>>added
>> recently.  See routine _drop_link(), when the link count hits zero, it
>> proceeds immediately with the inode/extent deletion.  It does have an
>> assert there checking file->num_reading, but (in my copy) no checks for
>> num_readers and num_writers.  There are also no checks of the 'nref' in
>> RefCountedObject from which File inherits.    If you do find any of
>>these
>> elsewhere that provide protection, please let me know, I'd like to
>> understand this code better.
>>
>> Thanks for your help digging into this, and trying to reproduce it.
>>I'll
>> update the bug shortly.
>>
>> Kevan
>>
>>
>> On 7/19/16, 9:50 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>>
>>> On Tuesday 19 July 2016 07:40 PM, Kevan Rehm wrote:
>>>> Varada,
>>>>
>>>> I added the following assert in _truncate() but haven't yet had it
>>>>fire
>>>> in
>>>> our test cluster.   If you or others also want to give it a try, that
>>>> would be great.  It seems like a useful assert.
>>>>
>>>> diff --git a/src/os/bluestore/BlueFS.cc b/src/os/bluestore/BlueFS.cc
>>>> index dbfd59b..759d1af 100644
>>>> --- a/src/os/bluestore/BlueFS.cc
>>>> +++ b/src/os/bluestore/BlueFS.cc
>>>> @@ -1299,6 +1299,13 @@ int BlueFS::_truncate(FileWriter *h, uint64_t
>>>> offset)
>>>>    dout(10) << __func__ << " 0x" << std::hex << offset << std::dec
>>>>             << " file " << h->file->fnode << dendl;
>>>>
>>>> +  // If ino is not in filemap, we are truncating a file that is not
>>>> allocated.
>>>> +
>>>> +  auto p = file_map.find(h->file->fnode.ino);
>>>> +  if (p == file_map.end()) {
>>>> +    assert(0 == "_truncate(): truncating a file that is not
>>>> allocated");
>>>> +  }
>>>> +
>>>>    // truncate off unflushed data?
>>>>    if (h->pos < offset &&
>>>>        h->pos + h->buffer.length() > offset) {
>>> If we are using FileWriter, we will(should) have a reference on the
>>>fnode.
>>> I don't think that will be deleted from the file_map. Will wait for
>>>your
>>> run.
>>>> The test case which seemed to make the problem occur more frequently
>>>>for
>>>> us consists of mounting a cephfs filesystem, then starting up 32
>>>> instances
>>>> of ior on each of 4 client machines.  The ior command looks like this:
>>>>
>>>> ior -a POSIX -o  /mnt/cephfs/test -b 8m -i 10
>>>>
>>>>
>>>> In the meantime I have been looking further at the code, and I think I
>>>> see
>>>> the problem-causing sequence of events.
>>>>
>>>> BlueFS implements unlink() differently than the POSIX model.  In
>>>>POSIX,
>>>> the inode and disk extents for a file are not released until both the
>>>> link
>>>> count reaches zero and the inode reference count (file open count)
>>>> reaches
>>>> zero.  BlueFS however releases the inode and disk extents when the
>>>>link
>>>> count (file->refs) reaches zero even if there are threads that still
>>>> have
>>>> the file open.
>>> open_for_read and open_for_write will take a reference on the file and
>>> we will increment
>>> num_readers and num_writers on the file reference. we have asserts to
>>> catch if we have any
>>> readers/writers in case file deletion.
>>>> Routine _drop_link() does the inode and disk extent releases, and it
>>>> sets
>>>> a File variable called 'deleted' to true, but note that it does not
>>>> otherwise clear the released inode number or disk extent fields in the
>>>> File.  It is therefore the responsibility of every thread with a
>>>>FileRef
>>>> for that file to first obtain a lock (to prevent races with
>>>> _drop_link())
>>>> and then look at the value of 'deleted', and only continue using the
>>>> File
>>>> if 'deleted' is false.
>>>>
>>>> Function BlueRocksWritableFile::Close() does not do this.  The routine
>>>> does not grab a lock; it calls GetPreallocationStatus(), and if
>>>> last_allocated_block is > 0, it calls fs->truncate() (which does get a
>>>> lock) which calls _truncate().  Routine _truncate() lowers the file
>>>> size,
>>>> and calls op_file_update(), which posts an update transaction entry to
>>>> the
>>>> log for a file that is already unlinked.   This is how we end up with
>>>>an
>>>> inode in the transaction log that has disk extents but its link count
>>>>is
>>>> zero.
>>> Yes. i also found the same code path, which truncates the file. Trying
>>> to write some
>>> unit test cases to run all these ops in a concurrent way to find the
>>> problem.
>>>> Adding a check of the 'deleted' variable after the lock is taken in
>>>> _truncate() would work, but it bothers me a bit to see
>>>> BlueRocksWritableFile::Close() using the File struct as if it is still
>>>> valid.  Folks with more BlueFS experience can decide if that is a
>>>> problem
>>>> or not.   They might also be able to describe situations where a file
>>>> could be unlinked while it is still open, that might lead to better
>>>>test
>>>> cases which fire the assert more quickly.
>>> Added some asserts to find the issue using a test with multiple
>>>threads.
>>> Still
>>> trying reproduce the issue. I believe we should have a reference to the
>>> file. Will verify that
>>> anyways.
>>>> I'm also a bit concerned about the integrity of the BlueFS filesystem,
>>>> should the log have to be replayed.  This problem was only seen
>>>>because
>>>> the inode number wasn't reused in the log after it was released.  What
>>>> if
>>>> a subsequent log transaction reallocates the inode?  The "file with
>>>>link
>>>> count 0" assert would not catch it.  I guess that subsequent
>>>> reallocation
>>>> should overlay the bad FileRef->fnode with a new fnode, so maybe it's
>>>> harmless?
>>> Hopefully. Will try to add some debug around reference count of the
>>>file
>>> and
>>> verify the theory.
>>>
>>>> It would be nice to have my assert() trigger to provide the "smoking
>>>> gun".
>>>>  However, if I can't duplicate the problem in the next day or two, I
>>>> will
>>>> probably just patch _truncate() to return immediately if 'deleted' is
>>>> true, and move on.
>>>>
>>>> What do you think, should I post this history to bug 15724, create a
>>>> separate bug, anything else?
>>> Please post your findings to 15724, if we find a different problem we
>>> can file a
>>> different ticket.
>>>
>>> Varada
>>>> Thanks, Kevan
>>>>
>>>> On 7/16/16, 9:35 AM, "ceph-devel-owner@vger.kernel.org on behalf of
>>>> Kevan
>>>> Rehm" <ceph-devel-owner@vger.kernel.org on behalf of krehm@cray.com>
>>>> wrote:
>>>>
>>>>> On 7/16/16, 7:16 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>>>>>
>>>>>> On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote:
>>>>>>> Varada, see below.
>>>>>>>
>>>>>>> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>>>>>>>
>>>>>>>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:
>>>>>>>>> Greetings,
>>>>>>>>>
>>>>>>>>> I have made a bit of progress.  Still looking for ideas/advise
>>>>>>>>> while
>>>>>>>>> I
>>>>>>>>> keep digging.
>>>>>>>>>
>>>>>>>>> First, the ceph version I forgot to mention last time.
>>>>>>>>>
>>>>>>>>> # ceph -v
>>>>>>>>> ceph version 11.0.0-196-g85bb43e
>>>>>>>>> (85bb43e111692989d2296a389ce45377d2297d6f)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As a refresher, here is the beginning of the problem transaction
>>>>>>>>> which
>>>>>>>>> will result in an inode with 0 links.   There are actually three
>>>>>>>>> inodes
>>>>>>>>> in
>>>>>>>>> this transaction which will all end up having 0 links, inodes
>>>>>>>>>109,
>>>>>>>>> 102,
>>>>>>>>> and 113.
>>>>>>>>>
>>>>>>>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> txn(seq 77752 len 35269 crc 1796157826)
>>>>>>>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_dir_unlink  db/000154.log
>>>>>>>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_remove 109
>>>>>>>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_dir_unlink  db/000153.log
>>>>>>>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_remove 102
>>>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_dir_unlink  db/000151.log
>>>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_remove 113
>>>>>>>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_dir_unlink  db/000150.log
>>>>>>>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_remove 111
>>>>>>>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
>>>>>>>>> 17:05:53.738683 bdev 1 extents
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931
>>>>>>>>>13
>>>>>>>>> 54
>>>>>>>>> 8
>>>>>>>>> 8+
>>>>>>>>> 20
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,
>>>>>>>>>1:
>>>>>>>>> 96
>>>>>>>>> 0
>>>>>>>>> 49
>>>>>>>>> 56
>>>>>>>>> 16+58720256])
>>>>>>>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>>> 113 =
>>>>>>>>> 0x7fb66aafc880 (new)
>>>>>>>> When _get_file() is called with a inode number, if the file is not
>>>>>>>> found
>>>>>>>> in file_map
>>>>>>>> a new file is created and returned, hence that (new) in the log
>>>>>>>> file.
>>>>>>>> These set of events
>>>>>>>> can happen in case of rename. You can see the same
>>>>>>>>BlueFS::rename().
>>>>>>>> In
>>>>>>>> that case, we use the
>>>>>>>> same inode number.
>>>>>>> I'm not sure I understand, I do not believe this is a rename case.
>>>>>>> For
>>>>>>> this to be a rename, there would have to be at least one
>>>>>>> op_dir_link()
>>>>>>> call for each op_dir_unlink() call in the sequence above.   The
>>>>>>>only
>>>>>>> code
>>>>>>> sequence I can find where an op_dir_unlink() and an
>>>>>>>op_file_remove()
>>>>>>> appear together without a corresponding op_dir_link() is when a
>>>>>>>file
>>>>>>> is
>>>>>>> being deleted.
>>>>>>>
>>>>>>> The (new) above makes my point exactly.  The op_file_remove() call
>>>>>>> above
>>>>>>> freed the block allocation for the inode and released the inode
>>>>>>>from
>>>>>>> the
>>>>>>> filemap, showing that the file->refs had to be zero at that time
>>>>>>> (routine
>>>>>>> _drop_link).   Yet the following op_file_update() call is
>>>>>>>performing
>>>>>>> a
>>>>>>> truncate on that same inode, which is not in the filemap.   So the
>>>>>>> caller
>>>>>>> has to be holding a file structure whose refs field is greater than
>>>>>>> zero.
>>>>>>> Which implies that the caller's file structure has to be stale.
>>>>>>>The
>>>>>>> fact
>>>>>>> that the op_file_update file contents for inode 113 is exactly the
>>>>>>> same
>>>>>>> as
>>>>>>> the previous op_file_update contents prior to the op_file_remove(),
>>>>>>> except
>>>>>>> for a reduction in file size, proves I believe that the file struct
>>>>>>> used
>>>>>>> by the op_file_update() caller is stale.
>>>>>>>
>>>>>>>
>>>>>>>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12
>>>>>>>>> 17:05:54.615190 bdev 1 extents
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:2023
>>>>>>>>>75
>>>>>>>>> 16
>>>>>>>>> 8
>>>>>>>>> +2
>>>>>>>>> 09
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1
>>>>>>>>>:2
>>>>>>>>> 27
>>>>>>>>> 5
>>>>>>>>> 40
>>>>>>>>> 99
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097
>>>>>>>>>15
>>>>>>>>> 2,
>>>>>>>>> 1
>>>>>>>>> :2
>>>>>>>>> 51
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032
>>>>>>>>>+2
>>>>>>>>> 09
>>>>>>>>> 7
>>>>>>>>> 15
>>>>>>>>> 2,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:2946
>>>>>>>>>49
>>>>>>>>> 85
>>>>>>>>> 6
>>>>>>>>> +1
>>>>>>>>> 04
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1
>>>>>>>>>:3
>>>>>>>>> 18
>>>>>>>>> 7
>>>>>>>>> 67
>>>>>>>>> 10
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097
>>>>>>>>>15
>>>>>>>>> 2,
>>>>>>>>> 1
>>>>>>>>> :3
>>>>>>>>> 42
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144
>>>>>>>>>+2
>>>>>>>>> 09
>>>>>>>>> 7
>>>>>>>>> 15
>>>>>>>>> 2,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:3837
>>>>>>>>>78
>>>>>>>>> 81
>>>>>>>>> 6
>>>>>>>>> +1
>>>>>>>>> 04
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1
>>>>>>>>>:4
>>>>>>>>> 05
>>>>>>>>> 7
>>>>>>>>> 98
>>>>>>>>> 91
>>>>>>>>> 2+3145728,1:507510784+5242880])
>>>>>>>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>>> 102 =
>>>>>>>>> 0x7fb66aafbb90 (new)
>>>>>>>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12
>>>>>>>>> 22:41:33.028030 bdev 1 extents
>>>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>>>>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>>> 109 =
>>>>>>>>> 0x7fb66aafbe60 (new)
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The prior transaction log entry for each of the three inodes was
>>>>>>>>> another
>>>>>>>>> op_file_update() call.  The only difference between the two
>>>>>>>>>updates
>>>>>>>>> for
>>>>>>>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less
>>>>>>>>> bytes,
>>>>>>>>> all
>>>>>>>>> other fields are identical.
>>>>>>>> Are the extents same for both the events?
>>>>>>> Yes, please compare this line with the one higher above in this
>>>>>>> email,
>>>>>>> you
>>>>>>> can see that only the file size is different.   That is true for
>>>>>>>all
>>>>>>> three
>>>>>>> of these inodes.
>>>>>>>
>>>>>>>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5b4000:
>>>>>>>>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
>>>>>>>>> 17:05:53.738683 bdev 1 extents
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931
>>>>>>>>>13
>>>>>>>>> 54
>>>>>>>>> 8
>>>>>>>>> 8+
>>>>>>>>> 20
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,
>>>>>>>>>1:
>>>>>>>>> 96
>>>>>>>>> 0
>>>>>>>>> 49
>>>>>>>>> 56
>>>>>>>>> 16+58720256])
>>>>>>>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>>> 113 =
>>>>>>>>> 0x7fb66aafbb90
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Same story for inode 109, the file size changed from 71354939 ->
>>>>>>>>> 64421913,
>>>>>>>>> 6933026 less bytes, all other fields are identical.
>>>>>>>>>
>>>>>>>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5ba000:
>>>>>>>>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12
>>>>>>>>> 22:41:33.028030 bdev 1 extents
>>>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>>>>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>>> 109 =
>>>>>>>>> 0x7fb66aafc910
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Same story for inode 102, the file size changed from 71359405 ->
>>>>>>>>> 68557043,
>>>>>>>>> 2802362 less bytes, all other fields are identical.
>>>>>>>>>
>>>>>>>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5b9000:
>>>>>>>>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12
>>>>>>>>> 17:05:54.615190 bdev 1 extents
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:2023
>>>>>>>>>75
>>>>>>>>> 16
>>>>>>>>> 8
>>>>>>>>> +2
>>>>>>>>> 09
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1
>>>>>>>>>:2
>>>>>>>>> 27
>>>>>>>>> 5
>>>>>>>>> 40
>>>>>>>>> 99
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097
>>>>>>>>>15
>>>>>>>>> 2,
>>>>>>>>> 1
>>>>>>>>> :2
>>>>>>>>> 51
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032
>>>>>>>>>+2
>>>>>>>>> 09
>>>>>>>>> 7
>>>>>>>>> 15
>>>>>>>>> 2,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:2946
>>>>>>>>>49
>>>>>>>>> 85
>>>>>>>>> 6
>>>>>>>>> +1
>>>>>>>>> 04
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1
>>>>>>>>>:3
>>>>>>>>> 18
>>>>>>>>> 7
>>>>>>>>> 67
>>>>>>>>> 10
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097
>>>>>>>>>15
>>>>>>>>> 2,
>>>>>>>>> 1
>>>>>>>>> :3
>>>>>>>>> 42
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144
>>>>>>>>>+2
>>>>>>>>> 09
>>>>>>>>> 7
>>>>>>>>> 15
>>>>>>>>> 2,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:3837
>>>>>>>>>78
>>>>>>>>> 81
>>>>>>>>> 6
>>>>>>>>> +1
>>>>>>>>> 04
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1
>>>>>>>>>:4
>>>>>>>>> 05
>>>>>>>>> 7
>>>>>>>>> 98
>>>>>>>>> 91
>>>>>>>>> 2+3145728,1:507510784+5242880])
>>>>>>>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>>> 102 =
>>>>>>>>> 0x7fb66aafbe60
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is only one caller of op_file_update() in which the size of
>>>>>>>>> the
>>>>>>>>> file
>>>>>>>>> is reduced but all other fields stay the same, that is
>>>>>>>>> BlueFS::_truncate(), which in turn is only called by
>>>>>>>>> BlueFS::truncate(),
>>>>>>>>> which is only called by BlueRocksWritableFile::Close().
>>>>>>>>>
>>>>>>>>> The op_dir_unlink() and op_file_remove() calls for each of the
>>>>>>>>> three
>>>>>>>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm
>>>>>>>>> having
>>>>>>>>> difficulty finding where the call is coming from.
>>>>>>>> Any syscall making dir rename or file remove or directory remove
>>>>>>>> will
>>>>>>>> result in BlueFS::rename or unlink
>>>>>>>> or corresponding calls and results in op_dir_* or op_file_*
>>>>>>>> operations.
>>>>>>>> rocksdb wal_manager or compaction
>>>>>>>> can trigger these calls during that process. Will take a closer
>>>>>>>>look
>>>>>>>> at
>>>>>>>> the logs, if i can get something from them.
>>>>>>>>> So the problem appears to be that the file is first deleted, then
>>>>>>>>> it
>>>>>>>>> is
>>>>>>>>> closed, and a byproduct of the close operation  on a stale fnode
>>>>>>>>>is
>>>>>>>>> that
>>>>>>>>> another op_file_update() entry gets logged, recreating the
>>>>>>>>>deleted
>>>>>>>>> inode
>>>>>>>>> with a stale fnode, resulting in the 0 links failure.
>>>>>>>> Once the file is deleted, all the underlying extents will be
>>>>>>>> cleared.
>>>>>>>> Until and unless there is race in handling on
>>>>>>>> the files from application level, Bluefs doesn't have any control
>>>>>>>> over
>>>>>>>> them. If the a file(handle) is been reused after
>>>>>>>> deletion, that seems to be a bug. Will take a look at the logs to
>>>>>>>> confirm your theory.
>>>>>>>> But here, this seems to be more of rename use case. one more thing
>>>>>>>> to
>>>>>>>> verify is the extents owned by the inodes mentioned.
>>>>>>> No, I don't believe this is a rename, see above.
>>>>>>>
>>>>>>> I'll note that every occurrence of this problem has been preceded
>>>>>>>by
>>>>>>> an
>>>>>>> OSD segfault.  And this problem is racy, if I increase debug
>>>>>>>logging
>>>>>>> for
>>>>>>> bluefs or bluestore, then the problem goes away.  I have 50+
>>>>>>> bluestore
>>>>>>> segfaults and asserts in my cluster logs; I suspect many could be
>>>>>>>due
>>>>>>> to
>>>>>>> this problem, because the same range of blocks is being freed more
>>>>>>> than
>>>>>>> once.   If some other file should perform an allocation between the
>>>>>>> block-freeing in op_file_remove() and the second freeing in
>>>>>>> op_file_update(), then we could easily have multiple files with the
>>>>>>> same
>>>>>>> block allocation at the same time.
>>>>>>>
>>>>>>> I think this problem can be caught fairly easily, an assert could
>>>>>>>be
>>>>>>> added
>>>>>>> to _truncate().   The _truncate() routine is only called for files
>>>>>>> which
>>>>>>> are open, which means that the inode has to be in the filemap.
>>>>>>>Code
>>>>>>> could be added at the front of _truncate() to look up the inode in
>>>>>>> the
>>>>>>> filemap, and if it is not there, then assert.   That would provide
>>>>>>>a
>>>>>>> stack
>>>>>>> trace showing the caller using the file struct with the stale
>>>>>>> reference
>>>>>>> count.   (It may be a few days before I have a chance to try this
>>>>>>> myself.)
>>>>>> Sure. Could you please let me know the steps to hit issue?
>>>>>> i am doing only few min runs to check the performance and some unit
>>>>>> test
>>>>>> cases.
>>>>>> I will add the necessary asserts if i can reproduce the issue and
>>>>>>fix
>>>>>> it.
>>>>> Thanks for the help!   I will have to get back to you on Monday, as
>>>>> there
>>>>> is another person on our team who runs the benchmarks that trigger
>>>>>this
>>>>> problem, I don't know how to run his benchmarks myself.
>>>>>
>>>>> Kevan
>>>>>
>>>>>> Varada
>>>>>>> Regards, Kevan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>> message is
>>>>>> intended only for the use of the designated recipient(s) named
>>>>>>above.
>>>>>> If
>>>>>> the reader of this message is not the intended recipient, you are
>>>>>> hereby
>>>>>> notified that you have received this message in error and that any
>>>>>> review, dissemination, distribution, or copying of this message is
>>>>>> strictly prohibited. If you have received this communication in
>>>>>>error,
>>>>>> please notify the sender by telephone or e-mail (as shown above)
>>>>>> immediately and destroy any and all copies of this message in your
>>>>>> possession (whether hard copies or electronically stored copies).
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> PLEASE NOTE: The information contained in this electronic mail message
>>>is
>>> intended only for the use of the designated recipient(s) named above.
>>>If
>>> the reader of this message is not the intended recipient, you are
>>>hereby
>>> notified that you have received this message in error and that any
>>> review, dissemination, distribution, or copying of this message is
>>> strictly prohibited. If you have received this communication in error,
>>> please notify the sender by telephone or e-mail (as shown above)
>>> immediately and destroy any and all copies of this message in your
>>> possession (whether hard copies or electronically stored copies).
>>
>
>PLEASE NOTE: The information contained in this electronic mail message is
>intended only for the use of the designated recipient(s) named above. If
>the reader of this message is not the intended recipient, you are hereby
>notified that you have received this message in error and that any
>review, dissemination, distribution, or copying of this message is
>strictly prohibited. If you have received this communication in error,
>please notify the sender by telephone or e-mail (as shown above)
>immediately and destroy any and all copies of this message in your
>possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bluestore bug#15724: bluefs _replay file with link count 0:
  2016-07-14 18:49 Kevan Rehm
@ 2016-07-14 19:49 ` Kevan Rehm
  0 siblings, 0 replies; 12+ messages in thread
From: Kevan Rehm @ 2016-07-14 19:49 UTC (permalink / raw)
  To: Kevan Rehm, ceph-devel (ceph-devel@vger.kernel.org)

A bit more info missing in the previous email:

# ceph -v
ceph version 11.0.0-196-g85bb43e (85bb43e111692989d2296a389ce45377d2297d6f)


The dump of recent events prior to the transaction log problem is 10,000
lines, so I'll just include the tail for now, if you want more, let me
know.

Kevan

   -22> 2016-07-13 00:35:43.893066 7f6d53fcc700 20 --
10.12.0.127:6804/4080 >> 10.12.0.129:0/1660564381 pipe(0x7f6d6f8c8800
sd=59 :6804 s=4 pgs=4 cs=1 l=1 c=0x7f6d783a6800).join
   -21> 2016-07-13 00:35:43.893702 7f6d53fcc700 10 --
10.12.0.127:6804/4080 reaper reaped pipe 0x7f6d6f8c8800
10.12.0.129:0/1660564381
   -20> 2016-07-13 00:35:43.893742 7f6d53fcc700 10 --
10.12.0.127:6804/4080 reaper deleted pipe 0x7f6d6f8c8800
   -19> 2016-07-13 00:35:43.893744 7f6d53fcc700 10 --
10.12.0.127:6804/4080 reaper done
   -18> 2016-07-13 00:35:44.201223 7f6d427a9700 10 monclient: tick
   -17> 2016-07-13 00:35:44.201250 7f6d427a9700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after 2016-07-13
00:35:14.201237)
   -16> 2016-07-13 00:35:44.201257 7f6d427a9700 20 --
10.12.0.127:6804/4080 send_keepalive con 0x7f6d68891c00, have pipe.
   -15> 2016-07-13 00:35:44.201298 7f6d5fcae700 10 --
10.12.0.127:6804/4080 >> 10.12.0.133:6789/0 pipe(0x7f6d768c2000 sd=17
:49366 s=2 pgs=4915 cs=1 l=1 c=0x7f6d68891c00).writer: state = open
policy.server=0
   -14> 2016-07-13 00:35:44.201318 7f6d5fcae700 10 --
10.12.0.127:6804/4080 >> 10.12.0.133:6789/0 pipe(0x7f6d768c2000 sd=17
:49366 s=2 pgs=4915 cs=1 l=1 c=0x7f6d68891c00).write_keepalive2 14
2016-07-13 00:35:44.201316
   -13> 2016-07-13 00:35:44.201337 7f6d5fcae700 10 --
10.12.0.127:6804/4080 >> 10.12.0.133:6789/0 pipe(0x7f6d768c2000 sd=17
:49366 s=2 pgs=4915 cs=1 l=1 c=0x7f6d68891c00).writer: state = open
policy.server=0
   -12> 2016-07-13 00:35:44.201341 7f6d5fcae700 20 --
10.12.0.127:6804/4080 >> 10.12.0.133:6789/0 pipe(0x7f6d768c2000 sd=17
:49366 s=2 pgs=4915 cs=1 l=1 c=0x7f6d68891c00).writer sleeping
   -11> 2016-07-13 00:35:44.201463 7f6d5ec90700  2 --
10.12.0.127:6804/4080 >> 10.12.0.133:6789/0 pipe(0x7f6d768c2000 sd=17
:49366 s=2 pgs=4915 cs=1 l=1 c=0x7f6d68891c00).reader got KEEPALIVE_ACK
   -10> 2016-07-13 00:35:44.201474 7f6d5ec90700 20 --
10.12.0.127:6804/4080 >> 10.12.0.133:6789/0 pipe(0x7f6d768c2000 sd=17
:49366 s=2 pgs=4915 cs=1 l=1 c=0x7f6d68891c00).reader reading tag...
    -9> 2016-07-13 00:35:44.210392 7f6d33e61700  2 --
10.13.0.127:6804/4080 >> 10.13.0.127:6800/4072 pipe(0x7f6d782df400 sd=115
:6804 s=2 pgs=25 cs=3 l=0 c=0x7f6d68893700).reader couldn't read tag,
(107) Transport endpoint is not connected
    -8> 2016-07-13 00:35:44.210435 7f6d33e61700  2 --
10.13.0.127:6804/4080 >> 10.13.0.127:6800/4072 pipe(0x7f6d782df400 sd=115
:6804 s=2 pgs=25 cs=3 l=0 c=0x7f6d68893700).fault (107) Transport endpoint
is not connected
    -7> 2016-07-13 00:35:44.210486 7f6d33e61700  0 --
10.13.0.127:6804/4080 >> 10.13.0.127:6800/4072 pipe(0x7f6d782df400 sd=115
:6804 s=2 pgs=25 cs=3 l=0 c=0x7f6d68893700).fault with nothing to send,
going to standby
    -6> 2016-07-13 00:35:44.210494 7f6d33e61700 20 --
10.13.0.127:6804/4080 >> 10.13.0.127:6800/4072 pipe(0x7f6d782df400 sd=115
:6804 s=3 pgs=25 cs=3 l=0 c=0x7f6d68893700).reader sleeping during
reconnect|standby
    -5> 2016-07-13 00:35:44.210509 7f6d3577a700 10 --
10.13.0.127:6804/4080 >> 10.13.0.127:6800/4072 pipe(0x7f6d782df400 sd=115
:6804 s=3 pgs=25 cs=3 l=0 c=0x7f6d68893700).writer: state = standby
policy.server=0
    -4> 2016-07-13 00:35:44.210522 7f6d3577a700 20 --
10.13.0.127:6804/4080 >> 10.13.0.127:6800/4072 pipe(0x7f6d782df400 sd=115
:6804 s=3 pgs=25 cs=3 l=0 c=0x7f6d68893700).writer sleeping
    -3> 2016-07-13 00:35:44.438101 7f6d37793700 10 osd.20 342 Store synced
    -2> 2016-07-13 00:35:44.438121 7f6d37793700 20 osd.20 342  kicking pg
2.278
    -1> 2016-07-13 00:35:44.438124 7f6d37793700 30 osd.20 pg_epoch: 342
pg[2.278( v 339'406 (0'0,339'406] local-les=338 n=0 ec=130 les/c/f
338/338/0 337/337/337) [20] r=0 lpr
=337 luod=0'0 crt=339'404 lcod 339'405 mlcod 0'0 active+clean] lock
     0> 2016-07-13 00:35:44.438703 7f6d37793700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f6d37793700 thread_name:signal_handler


 ceph version 11.0.0-196-g85bb43e
(85bb43e111692989d2296a389ce45377d2297d6f)
 1: (()+0x980a3a) [0x7f6d5f614a3a]
 2: (()+0xf100) [0x7f6d5d5dd100]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.



On 7/14/16, 1:49 PM, "ceph-devel-owner@vger.kernel.org on behalf of Kevan
Rehm" <ceph-devel-owner@vger.kernel.org on behalf of krehm@cray.com> wrote:

>Greetings,
>
>Bug #15724 reports several occurrences of an OSD that will not start due
>to the message "bluefs _replay file with link count 0:".  I have also had
>several occurrences of this in the last couple of days, using a ceph built
>from master a couple of weeks ago.
>
>I jacked up the bluefs debug tracing and have been working through the
>reading of the transaction log by the OSD during startup, and I can see
>what appears to be a bad transaction that is causing this, see below.
>(Probably also the cause of the segmentation fault that occurred just
>before this incident.)
>
>Before I fry any more brain cells, I'd like to know if someone is already
>fixing this, or if someone has any advice to offer, like where the problem
>most likely lies.   If not, I'll keep digging.
>
>
>Here is what we see after the transaction log has been read, the OSD gets
>excited and bails:
>
>
>2016-07-13 16:03:03.933669 7fb6617c1800 20 bluefs _read got 4096
>2016-07-13 16:03:03.933671 7fb6617c1800 10 bluefs _replay 0x5cf000: stop:
>uuid 4ef0859e-06e8-51c2-9612-add7269b0100 != super.uuid
>7eb30434-317a-4844-b8d1-6f8cb37585d1
>2016-07-13 16:03:03.933678 7fb6617c1800 10 bluefs _replay log file size
>was 0x5cf000
>2016-07-13 16:03:03.933731 7fb6617c1800 -1 bluefs _replay file with link
>count 0: file(ino 113 size 71354938 mtime 2016-07-12 17:05:53.738683 bdev
>1 extents 
>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+2
>0
>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:960495
>6
>16+58720256])
>2016-07-13 16:03:03.933739 7fb6617c1800 -1 bluefs mount failed to replay
>log: (5) Input/output error
>2016-07-13 16:03:03.933753 7fb6617c1800 20 bluefs _stop_alloc
>2016-07-13 16:03:03.933754 7fb6617c1800 10 bitmapalloc:shutdown instance
>140421450426752
>2016-07-13 16:03:03.933924 7fb6617c1800 -1
>bluestore(/var/lib/ceph/osd/ceph-20) _open_db failed bluefs mount: (5)
>Input/output error
>
>
>Starting from the front of the log, here are the first references to ino
>113 in the transaction log:
>
>2016-07-13 16:03:03.900994 7fb6617c1800 20 bluefs _replay 0x562000:
>op_file_update  file(ino 113 size 0 mtime 2016-07-12 17:05:48.312770 bdev
>1 extents [])
>2016-07-13 16:03:03.900996 7fb6617c1800 30 bluefs _get_file ino 113 =
>0x7fb66aafbb90 (new)
>2016-07-13 16:03:03.900997 7fb6617c1800 20 bluefs _replay 0x562000:
>op_dir_link  db/000145.log to 113
>2016-07-13 16:03:03.900998 7fb6617c1800 30 bluefs _get_file ino 113 =
>0x7fb66aafbb90
>Š
>2016-07-13 16:03:03.901047 7fb6617c1800 20 bluefs _replay 0x56e000:
>op_file_update  file(ino 113 size 23 mtime 2016-07-12 17:05:48.314814 bdev
>1 extents [1:904921088+1048576])
>2016-07-13 16:03:03.901048 7fb6617c1800 30 bluefs _get_file ino 113 =
>0x7fb66aafbb90
>
>
>Lots of op_file_updates are applied to ino 113 in the log, leading up to
>this final one.
>
>
>2016-07-13 16:03:03.924173 7fb6617c1800 20 bluefs _replay 0x5ab000:
>op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
>17:05:49.420150 bdev 1 extents
>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+2
>0
>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:960495
>6
>16+58720256])
>2016-07-13 16:03:03.924176 7fb6617c1800 30 bluefs _get_file ino 113 =
>0x7fb66aafbb90
>
>
>Next ino 113's filename is renamed from 000145.log to 000151.log, followed
>by an update, so it must be coming from BlueRocksEnv::ReuseWritableFile().
>
>
>2016-07-13 16:03:03.925181 7fb6617c1800 10 bluefs _replay 0x5b4000:
>txn(seq 77743 len 982 crc 98786513)
>2016-07-13 16:03:03.925182 7fb6617c1800 20 bluefs _replay 0x5b4000:
>op_dir_link  db/000151.log to 113
>2016-07-13 16:03:03.925183 7fb6617c1800 30 bluefs _get_file ino 113 =
>0x7fb66aafbb90
>2016-07-13 16:03:03.925184 7fb6617c1800 20 bluefs _replay 0x5b4000:
>op_dir_unlink  db/000145.log
>2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:
>op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
>17:05:53.738683 bdev 1 extents
>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+2
>0
>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:960495
>6
>16+58720256])
>2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino 113 =
>0x7fb66aafbb90
>
>
>In txn 77752, ino 113 and its filename get removed via a
>BlueRocksEnv::DeleteFile() call, followed by an update with the original
>allocation.  Bad news.  Don't yet know where the update is coming from.
>So now we have an ino 113 with no links, and a series of extents which it
>doesn't really own.
>
>
>2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>op_file_remove 113
>2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:
>op_dir_unlink  db/000150.log
>Š
>2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:
>op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
>17:05:53.738683 bdev 1 extents
>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+2
>0
>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:960495
>6
>16+58720256])
>2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino 113 =
>0x7fb66aafc880 (new)
>
>
>The end result is an ino without a link, and an unhappy OSD.
>
>
>2016-07-13 16:03:03.933669 7fb6617c1800 20 bluefs _read got 4096
>2016-07-13 16:03:03.933671 7fb6617c1800 10 bluefs _replay 0x5cf000: stop:
>uuid 4ef0859e-06e8-51c2-9612-add7269b0100 != super.uuid
>7eb30434-317a-4844-b8d1-6f8cb37585d1
>2016-07-13 16:03:03.933678 7fb6617c1800 10 bluefs _replay log file size
>was 0x5cf000
>
>2016-07-13 16:03:03.933731 7fb6617c1800 -1 bluefs _replay file with link
>count 0: file(ino 113 size 71354938 mtime 2016-07-12 17:05:53.738683 bdev
>1 extents 
>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+2
>0
>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:960495
>6
>16+58720256])
>
>
>
>I do have the OSD logs showing the original segfault plus the error
>mentioned here, but I had to fix the block device so that I could move on.
>  I can upload the logs if someone is interested, but they're large.
>
>Thanks, Kevan
>
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bluestore bug#15724: bluefs _replay file with link count 0:
@ 2016-07-14 18:49 Kevan Rehm
  2016-07-14 19:49 ` Kevan Rehm
  0 siblings, 1 reply; 12+ messages in thread
From: Kevan Rehm @ 2016-07-14 18:49 UTC (permalink / raw)
  To: ceph-devel (ceph-devel@vger.kernel.org)

Greetings,

Bug #15724 reports several occurrences of an OSD that will not start due
to the message "bluefs _replay file with link count 0:".  I have also had
several occurrences of this in the last couple of days, using a ceph built
from master a couple of weeks ago.

I jacked up the bluefs debug tracing and have been working through the
reading of the transaction log by the OSD during startup, and I can see
what appears to be a bad transaction that is causing this, see below.
(Probably also the cause of the segmentation fault that occurred just
before this incident.)

Before I fry any more brain cells, I'd like to know if someone is already
fixing this, or if someone has any advice to offer, like where the problem
most likely lies.   If not, I'll keep digging.


Here is what we see after the transaction log has been read, the OSD gets
excited and bails:


2016-07-13 16:03:03.933669 7fb6617c1800 20 bluefs _read got 4096
2016-07-13 16:03:03.933671 7fb6617c1800 10 bluefs _replay 0x5cf000: stop:
uuid 4ef0859e-06e8-51c2-9612-add7269b0100 != super.uuid
7eb30434-317a-4844-b8d1-6f8cb37585d1
2016-07-13 16:03:03.933678 7fb6617c1800 10 bluefs _replay log file size
was 0x5cf000
2016-07-13 16:03:03.933731 7fb6617c1800 -1 bluefs _replay file with link
count 0: file(ino 113 size 71354938 mtime 2016-07-12 17:05:53.738683 bdev
1 extents 
[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+20
97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:9604956
16+58720256])
2016-07-13 16:03:03.933739 7fb6617c1800 -1 bluefs mount failed to replay
log: (5) Input/output error
2016-07-13 16:03:03.933753 7fb6617c1800 20 bluefs _stop_alloc
2016-07-13 16:03:03.933754 7fb6617c1800 10 bitmapalloc:shutdown instance
140421450426752
2016-07-13 16:03:03.933924 7fb6617c1800 -1
bluestore(/var/lib/ceph/osd/ceph-20) _open_db failed bluefs mount: (5)
Input/output error


Starting from the front of the log, here are the first references to ino
113 in the transaction log:

2016-07-13 16:03:03.900994 7fb6617c1800 20 bluefs _replay 0x562000:
op_file_update  file(ino 113 size 0 mtime 2016-07-12 17:05:48.312770 bdev
1 extents [])
2016-07-13 16:03:03.900996 7fb6617c1800 30 bluefs _get_file ino 113 =
0x7fb66aafbb90 (new)
2016-07-13 16:03:03.900997 7fb6617c1800 20 bluefs _replay 0x562000:
op_dir_link  db/000145.log to 113
2016-07-13 16:03:03.900998 7fb6617c1800 30 bluefs _get_file ino 113 =
0x7fb66aafbb90
Š
2016-07-13 16:03:03.901047 7fb6617c1800 20 bluefs _replay 0x56e000:
op_file_update  file(ino 113 size 23 mtime 2016-07-12 17:05:48.314814 bdev
1 extents [1:904921088+1048576])
2016-07-13 16:03:03.901048 7fb6617c1800 30 bluefs _get_file ino 113 =
0x7fb66aafbb90


Lots of op_file_updates are applied to ino 113 in the log, leading up to
this final one.


2016-07-13 16:03:03.924173 7fb6617c1800 20 bluefs _replay 0x5ab000:
op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
17:05:49.420150 bdev 1 extents
[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+20
97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:9604956
16+58720256])
2016-07-13 16:03:03.924176 7fb6617c1800 30 bluefs _get_file ino 113 =
0x7fb66aafbb90


Next ino 113's filename is renamed from 000145.log to 000151.log, followed
by an update, so it must be coming from BlueRocksEnv::ReuseWritableFile().


2016-07-13 16:03:03.925181 7fb6617c1800 10 bluefs _replay 0x5b4000:
txn(seq 77743 len 982 crc 98786513)
2016-07-13 16:03:03.925182 7fb6617c1800 20 bluefs _replay 0x5b4000:
op_dir_link  db/000151.log to 113
2016-07-13 16:03:03.925183 7fb6617c1800 30 bluefs _get_file ino 113 =
0x7fb66aafbb90
2016-07-13 16:03:03.925184 7fb6617c1800 20 bluefs _replay 0x5b4000:
op_dir_unlink  db/000145.log
2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:
op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
17:05:53.738683 bdev 1 extents
[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+20
97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:9604956
16+58720256])
2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino 113 =
0x7fb66aafbb90


In txn 77752, ino 113 and its filename get removed via a
BlueRocksEnv::DeleteFile() call, followed by an update with the original
allocation.  Bad news.  Don't yet know where the update is coming from.
So now we have an ino 113 with no links, and a series of extents which it
doesn't really own.


2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
op_file_remove 113
2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:
op_dir_unlink  db/000150.log
Š
2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:
op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
17:05:53.738683 bdev 1 extents
[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+20
97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:9604956
16+58720256])
2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino 113 =
0x7fb66aafc880 (new)


The end result is an ino without a link, and an unhappy OSD.


2016-07-13 16:03:03.933669 7fb6617c1800 20 bluefs _read got 4096
2016-07-13 16:03:03.933671 7fb6617c1800 10 bluefs _replay 0x5cf000: stop:
uuid 4ef0859e-06e8-51c2-9612-add7269b0100 != super.uuid
7eb30434-317a-4844-b8d1-6f8cb37585d1
2016-07-13 16:03:03.933678 7fb6617c1800 10 bluefs _replay log file size
was 0x5cf000

2016-07-13 16:03:03.933731 7fb6617c1800 -1 bluefs _replay file with link
count 0: file(ino 113 size 71354938 mtime 2016-07-12 17:05:53.738683 bdev
1 extents 
[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931135488+20
97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:9604956
16+58720256])



I do have the OSD logs showing the original segfault plus the error
mentioned here, but I had to fix the block device so that I could move on.
  I can upload the logs if someone is interested, but they're large.

Thanks, Kevan

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-07-19 17:04 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-15 20:21 bluestore bug#15724: bluefs _replay file with link count 0: Kevan Rehm
2016-07-16  7:23 ` Varada Kari
2016-07-16 11:23   ` Kevan Rehm
2016-07-16 12:16     ` Varada Kari
2016-07-16 14:35       ` Kevan Rehm
2016-07-19 14:09         ` Kevan Rehm
2016-07-19 14:50           ` Varada Kari
2016-07-19 15:08             ` Kevan Rehm
2016-07-19 16:40               ` Varada Kari
2016-07-19 17:04                 ` Kevan Rehm
  -- strict thread matches above, loose matches on Subject: below --
2016-07-14 18:49 Kevan Rehm
2016-07-14 19:49 ` Kevan Rehm

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.