All of lore.kernel.org
 help / color / mirror / Atom feed
* corruption of active mmapped files in btrfs snapshots
@ 2013-03-18 21:14 ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-18 21:14 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ceph-devel

For quite a while, I've experienced oddities with snapshotted Firefox
_CACHE_00?_ files, whose checksums (and contents) would change after the
btrfs snapshot was taken, and would even change depending on how the
file was brought to memory (e.g., rsyncing it to backup storage vs
checking its md5sum before or after the rsync).  This only affected
these cache files, so I didn't give it too much attention.

A similar problem seems to affect the leveldb databases maintained by
ceph within the periodic snapshots it takes of its object storage
volumes.  I'm told others using ceph on filesystems other than btrfs are
not observing this problem, which makes me thing it's not memory
corruption within ceph itself.  I've looked into this for a bit, and I'm
now inclined to believe it has to do with some bad interaction of mmap
and snapshots; I'm not sure the fact that the filesystem has compression
enabled has any effect, but that's certainly a possibility.

leveldb does not modify file contents once they're initialized, it only
appends to files, ftruncate()ing them to about a MB early on, mmap()ping
that in and memcpy()ing blocks of various sizes to the end of the output
buffer, occasionally msync()ing the maps, or running fdatasync if it
didn't msync a map before munmap()ping it.  If it runs out of space in a
map, it munmap()s the previously mapped range, truncates the file to a
larger size, then maps in the new tail of the file, starting at the page
it should append to next.

What I'm observing is that some btrfs snapshots taken by ceph osds,
containing the leveldb database, are corrupted, causing crashes during
the use of the database.

I've scripted regular checks of osd snapshots, saving the
last-known-good database along with the first one that displays the
corruption.  Studying about two dozen failures over the weekend, that
took place on all of 13 btrfs-based osds on 3 servers running btrfs as
in 3.8.3(-gnu), I noticed that all of the corrupted databases had a
similar pattern: a stream of NULs of varying sizes at the end of a page,
starting at a block boundary (leveldb doesn't do page-sized blocking, so
blocks can start anywhere in a page), and ending close to the beginning
of the next page, although not exactly at the page boundary; 20 bytes
past the page boundary seemed to be the most common size, but the
occasional presence of NULs in the database contents makes it harder to
tell for sure.

The stream of NULs ended in the middle of a database block (meaning it
was not the beginning of a subsequent database block written later; the
beginning of the database block was partially replaced with NULs).
Furthermore, the checksum fails to match on this one partially-NULed
block.  Since the checksum is computed just before the block and the
checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty
that the block was copied entirely to the right place at some point, and
if part of it became zeros, it's either because the modification was
partially lost, or because the mmapped buffer was partially overwritten.
The fact that all instances of corruption I looked at were correct right
to the end of one block boundary, and then all zeros instead of the
beginning of the subsequent block to the end of that page, makes a
failure to write that modified page seem more likely in my mind (more so
given the Firefox _CACHE_ file oddities in snapshots); intense memory
pressure at the time of the corruption also seems to favor this
possibility.

Now, it could be that btrfs requires those who modify SHARED mmap()ed
files so as to make sure that data makes it to a subsequent snapshot,
along the lines of msync MS_ASYNC, and leveldb does not take this sort
of precaution.  However, I noticed that the unexpected stream of zeros
after a prior block and before the rest of the subsequent block
*remains* in subsequent snapshots, which to me indicates the page update
is effectively lost.  This explains why even the running osd, that
operates on the “current” subvolumes from which snapshots for recovery
are taken, occasionally crashes because of database corruption, and will
later fail to restart from an earlier snapshot due to that same
corruption.


Does this problem sound familiar to anyone else?

Should mmaped-file writers in general do more than umount or msync to
ensure changes make it to subsequent snapshots that are supposed to be
consistent?

Any tips on where to start looking so as to fix the problem, or even to
confirm that the problem is indeed in btrfs?


TIA,

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* corruption of active mmapped files in btrfs snapshots
@ 2013-03-18 21:14 ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-18 21:14 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ceph-devel

For quite a while, I've experienced oddities with snapshotted Firefox
_CACHE_00?_ files, whose checksums (and contents) would change after the
btrfs snapshot was taken, and would even change depending on how the
file was brought to memory (e.g., rsyncing it to backup storage vs
checking its md5sum before or after the rsync).  This only affected
these cache files, so I didn't give it too much attention.

A similar problem seems to affect the leveldb databases maintained by
ceph within the periodic snapshots it takes of its object storage
volumes.  I'm told others using ceph on filesystems other than btrfs are
not observing this problem, which makes me thing it's not memory
corruption within ceph itself.  I've looked into this for a bit, and I'm
now inclined to believe it has to do with some bad interaction of mmap
and snapshots; I'm not sure the fact that the filesystem has compression
enabled has any effect, but that's certainly a possibility.

leveldb does not modify file contents once they're initialized, it only
appends to files, ftruncate()ing them to about a MB early on, mmap()ping
that in and memcpy()ing blocks of various sizes to the end of the output
buffer, occasionally msync()ing the maps, or running fdatasync if it
didn't msync a map before munmap()ping it.  If it runs out of space in a
map, it munmap()s the previously mapped range, truncates the file to a
larger size, then maps in the new tail of the file, starting at the page
it should append to next.

What I'm observing is that some btrfs snapshots taken by ceph osds,
containing the leveldb database, are corrupted, causing crashes during
the use of the database.

I've scripted regular checks of osd snapshots, saving the
last-known-good database along with the first one that displays the
corruption.  Studying about two dozen failures over the weekend, that
took place on all of 13 btrfs-based osds on 3 servers running btrfs as
in 3.8.3(-gnu), I noticed that all of the corrupted databases had a
similar pattern: a stream of NULs of varying sizes at the end of a page,
starting at a block boundary (leveldb doesn't do page-sized blocking, so
blocks can start anywhere in a page), and ending close to the beginning
of the next page, although not exactly at the page boundary; 20 bytes
past the page boundary seemed to be the most common size, but the
occasional presence of NULs in the database contents makes it harder to
tell for sure.

The stream of NULs ended in the middle of a database block (meaning it
was not the beginning of a subsequent database block written later; the
beginning of the database block was partially replaced with NULs).
Furthermore, the checksum fails to match on this one partially-NULed
block.  Since the checksum is computed just before the block and the
checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty
that the block was copied entirely to the right place at some point, and
if part of it became zeros, it's either because the modification was
partially lost, or because the mmapped buffer was partially overwritten.
The fact that all instances of corruption I looked at were correct right
to the end of one block boundary, and then all zeros instead of the
beginning of the subsequent block to the end of that page, makes a
failure to write that modified page seem more likely in my mind (more so
given the Firefox _CACHE_ file oddities in snapshots); intense memory
pressure at the time of the corruption also seems to favor this
possibility.

Now, it could be that btrfs requires those who modify SHARED mmap()ed
files so as to make sure that data makes it to a subsequent snapshot,
along the lines of msync MS_ASYNC, and leveldb does not take this sort
of precaution.  However, I noticed that the unexpected stream of zeros
after a prior block and before the rest of the subsequent block
*remains* in subsequent snapshots, which to me indicates the page update
is effectively lost.  This explains why even the running osd, that
operates on the “current” subvolumes from which snapshots for recovery
are taken, occasionally crashes because of database corruption, and will
later fail to restart from an earlier snapshot due to that same
corruption.


Does this problem sound familiar to anyone else?

Should mmaped-file writers in general do more than umount or msync to
ensure changes make it to subsequent snapshots that are supposed to be
consistent?

Any tips on where to start looking so as to fix the problem, or even to
confirm that the problem is indeed in btrfs?


TIA,

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-18 21:14 ` Alexandre Oliva
  (?)
@ 2013-03-18 22:43 ` Alexandre Oliva
  -1 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-18 22:43 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ceph-devel

While I wrote the previous email, a smoking gun formed in one of my
servers: a snapshot that had passed a database consistency check turned
out to be corrupted when I tried to rollback to it!  Since the snapshot
was not modified in any way between the initial scripted check and the
later manual check, the problem must be in btrfs.

On Mar 18, 2013, Alexandre Oliva <oliva@gnu.org> wrote:

> I've scripted regular checks of osd snapshots, saving the
> last-known-good database along with the first one that displays the
> corruption.  Studying about two dozen failures over the weekend, that
> took place on all of 13 btrfs-based osds on 3 servers running btrfs as
> in 3.8.3(-gnu), I noticed that all of the corrupted databases had a
> similar pattern: a stream of NULs of varying sizes at the end of a page,
> starting at a block boundary (leveldb doesn't do page-sized blocking, so
> blocks can start anywhere in a page), and ending close to the beginning
> of the next page, although not exactly at the page boundary; 20 bytes
> past the page boundary seemed to be the most common size, but the
> occasional presence of NULs in the database contents makes it harder to
> tell for sure.

Additional corrupted snapshots collected today have confirmed this
pattern, except that today I got several corrupted files with non-NULs
right at the beginning of the page following the one that marked the
beginning of the corrupted database block.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-18 21:14 ` Alexandre Oliva
@ 2013-03-18 22:52   ` Chris Mason
  -1 siblings, 0 replies; 42+ messages in thread
From: Chris Mason @ 2013-03-18 22:52 UTC (permalink / raw)
  To: Alexandre Oliva, linux-btrfs; +Cc: ceph-devel

A few questions.  Does leveldb use O_DIRECT and mmap together? (the
source of a write being pages that are mmap'd from somewhere else)

That's the most likely place for this kind of problem.  Also, you
mention crc errors.  Are those reported by btrfs or are they application
level crcs.

Thanks for all the time you spent tracking it down this far.

-chris

Quoting Alexandre Oliva (2013-03-18 17:14:41)
> For quite a while, I've experienced oddities with snapshotted Firefox
> _CACHE_00?_ files, whose checksums (and contents) would change after the
> btrfs snapshot was taken, and would even change depending on how the
> file was brought to memory (e.g., rsyncing it to backup storage vs
> checking its md5sum before or after the rsync).  This only affected
> these cache files, so I didn't give it too much attention.
> 
> A similar problem seems to affect the leveldb databases maintained by
> ceph within the periodic snapshots it takes of its object storage
> volumes.  I'm told others using ceph on filesystems other than btrfs are
> not observing this problem, which makes me thing it's not memory
> corruption within ceph itself.  I've looked into this for a bit, and I'm
> now inclined to believe it has to do with some bad interaction of mmap
> and snapshots; I'm not sure the fact that the filesystem has compression
> enabled has any effect, but that's certainly a possibility.
> 
> leveldb does not modify file contents once they're initialized, it only
> appends to files, ftruncate()ing them to about a MB early on, mmap()ping
> that in and memcpy()ing blocks of various sizes to the end of the output
> buffer, occasionally msync()ing the maps, or running fdatasync if it
> didn't msync a map before munmap()ping it.  If it runs out of space in a
> map, it munmap()s the previously mapped range, truncates the file to a
> larger size, then maps in the new tail of the file, starting at the page
> it should append to next.
> 
> What I'm observing is that some btrfs snapshots taken by ceph osds,
> containing the leveldb database, are corrupted, causing crashes during
> the use of the database.
> 
> I've scripted regular checks of osd snapshots, saving the
> last-known-good database along with the first one that displays the
> corruption.  Studying about two dozen failures over the weekend, that
> took place on all of 13 btrfs-based osds on 3 servers running btrfs as
> in 3.8.3(-gnu), I noticed that all of the corrupted databases had a
> similar pattern: a stream of NULs of varying sizes at the end of a page,
> starting at a block boundary (leveldb doesn't do page-sized blocking, so
> blocks can start anywhere in a page), and ending close to the beginning
> of the next page, although not exactly at the page boundary; 20 bytes
> past the page boundary seemed to be the most common size, but the
> occasional presence of NULs in the database contents makes it harder to
> tell for sure.
> 
> The stream of NULs ended in the middle of a database block (meaning it
> was not the beginning of a subsequent database block written later; the
> beginning of the database block was partially replaced with NULs).
> Furthermore, the checksum fails to match on this one partially-NULed
> block.  Since the checksum is computed just before the block and the
> checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty
> that the block was copied entirely to the right place at some point, and
> if part of it became zeros, it's either because the modification was
> partially lost, or because the mmapped buffer was partially overwritten.
> The fact that all instances of corruption I looked at were correct right
> to the end of one block boundary, and then all zeros instead of the
> beginning of the subsequent block to the end of that page, makes a
> failure to write that modified page seem more likely in my mind (more so
> given the Firefox _CACHE_ file oddities in snapshots); intense memory
> pressure at the time of the corruption also seems to favor this
> possibility.
> 
> Now, it could be that btrfs requires those who modify SHARED mmap()ed
> files so as to make sure that data makes it to a subsequent snapshot,
> along the lines of msync MS_ASYNC, and leveldb does not take this sort
> of precaution.  However, I noticed that the unexpected stream of zeros
> after a prior block and before the rest of the subsequent block
> *remains* in subsequent snapshots, which to me indicates the page update
> is effectively lost.  This explains why even the running osd, that
> operates on the “current” subvolumes from which snapshots for recovery
> are taken, occasionally crashes because of database corruption, and will
> later fail to restart from an earlier snapshot due to that same
> corruption.
> 
> 
> Does this problem sound familiar to anyone else?
> 
> Should mmaped-file writers in general do more than umount or msync to
> ensure changes make it to subsequent snapshots that are supposed to be
> consistent?
> 
> Any tips on where to start looking so as to fix the problem, or even to
> confirm that the problem is indeed in btrfs?
> 
> 
> TIA,
> 
> -- 
> Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
> You must be the change you wish to see in the world. -- Gandhi
> Be Free! -- http://FSFLA.org/   FSF Latin America board member
> Free Software Evangelist      Red Hat Brazil Compiler Engineer
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
@ 2013-03-18 22:52   ` Chris Mason
  0 siblings, 0 replies; 42+ messages in thread
From: Chris Mason @ 2013-03-18 22:52 UTC (permalink / raw)
  To: Alexandre Oliva, linux-btrfs; +Cc: ceph-devel

A few questions.  Does leveldb use O_DIRECT and mmap together? (the
source of a write being pages that are mmap'd from somewhere else)

That's the most likely place for this kind of problem.  Also, you
mention crc errors.  Are those reported by btrfs or are they application
level crcs.

Thanks for all the time you spent tracking it down this far.

-chris

Quoting Alexandre Oliva (2013-03-18 17:14:41)
> For quite a while, I've experienced oddities with snapshotted Firefox
> _CACHE_00?_ files, whose checksums (and contents) would change after the
> btrfs snapshot was taken, and would even change depending on how the
> file was brought to memory (e.g., rsyncing it to backup storage vs
> checking its md5sum before or after the rsync).  This only affected
> these cache files, so I didn't give it too much attention.
> 
> A similar problem seems to affect the leveldb databases maintained by
> ceph within the periodic snapshots it takes of its object storage
> volumes.  I'm told others using ceph on filesystems other than btrfs are
> not observing this problem, which makes me thing it's not memory
> corruption within ceph itself.  I've looked into this for a bit, and I'm
> now inclined to believe it has to do with some bad interaction of mmap
> and snapshots; I'm not sure the fact that the filesystem has compression
> enabled has any effect, but that's certainly a possibility.
> 
> leveldb does not modify file contents once they're initialized, it only
> appends to files, ftruncate()ing them to about a MB early on, mmap()ping
> that in and memcpy()ing blocks of various sizes to the end of the output
> buffer, occasionally msync()ing the maps, or running fdatasync if it
> didn't msync a map before munmap()ping it.  If it runs out of space in a
> map, it munmap()s the previously mapped range, truncates the file to a
> larger size, then maps in the new tail of the file, starting at the page
> it should append to next.
> 
> What I'm observing is that some btrfs snapshots taken by ceph osds,
> containing the leveldb database, are corrupted, causing crashes during
> the use of the database.
> 
> I've scripted regular checks of osd snapshots, saving the
> last-known-good database along with the first one that displays the
> corruption.  Studying about two dozen failures over the weekend, that
> took place on all of 13 btrfs-based osds on 3 servers running btrfs as
> in 3.8.3(-gnu), I noticed that all of the corrupted databases had a
> similar pattern: a stream of NULs of varying sizes at the end of a page,
> starting at a block boundary (leveldb doesn't do page-sized blocking, so
> blocks can start anywhere in a page), and ending close to the beginning
> of the next page, although not exactly at the page boundary; 20 bytes
> past the page boundary seemed to be the most common size, but the
> occasional presence of NULs in the database contents makes it harder to
> tell for sure.
> 
> The stream of NULs ended in the middle of a database block (meaning it
> was not the beginning of a subsequent database block written later; the
> beginning of the database block was partially replaced with NULs).
> Furthermore, the checksum fails to match on this one partially-NULed
> block.  Since the checksum is computed just before the block and the
> checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty
> that the block was copied entirely to the right place at some point, and
> if part of it became zeros, it's either because the modification was
> partially lost, or because the mmapped buffer was partially overwritten.
> The fact that all instances of corruption I looked at were correct right
> to the end of one block boundary, and then all zeros instead of the
> beginning of the subsequent block to the end of that page, makes a
> failure to write that modified page seem more likely in my mind (more so
> given the Firefox _CACHE_ file oddities in snapshots); intense memory
> pressure at the time of the corruption also seems to favor this
> possibility.
> 
> Now, it could be that btrfs requires those who modify SHARED mmap()ed
> files so as to make sure that data makes it to a subsequent snapshot,
> along the lines of msync MS_ASYNC, and leveldb does not take this sort
> of precaution.  However, I noticed that the unexpected stream of zeros
> after a prior block and before the rest of the subsequent block
> *remains* in subsequent snapshots, which to me indicates the page update
> is effectively lost.  This explains why even the running osd, that
> operates on the “current” subvolumes from which snapshots for recovery
> are taken, occasionally crashes because of database corruption, and will
> later fail to restart from an earlier snapshot due to that same
> corruption.
> 
> 
> Does this problem sound familiar to anyone else?
> 
> Should mmaped-file writers in general do more than umount or msync to
> ensure changes make it to subsequent snapshots that are supposed to be
> consistent?
> 
> Any tips on where to start looking so as to fix the problem, or even to
> confirm that the problem is indeed in btrfs?
> 
> 
> TIA,
> 
> -- 
> Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
> You must be the change you wish to see in the world. -- Gandhi
> Be Free! -- http://FSFLA.org/   FSF Latin America board member
> Free Software Evangelist      Red Hat Brazil Compiler Engineer
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-18 22:52   ` Chris Mason
@ 2013-03-19  5:20     ` Alexandre Oliva
  -1 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-19  5:20 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, ceph-devel

On Mar 18, 2013, Chris Mason <chris.mason@fusionio.com> wrote:

> A few questions.  Does leveldb use O_DIRECT and mmap together?

No, it doesn't use O_DIRECT at all.  Its I/O interface is very
simplified: it just opens each new file (database chunks limited to 2MB)
with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync,
munmap and fdatasync.  It doesn't seem to modify data once it's written;
it only appends.  Reading data back from it uses a completely different
class interface, using separate descriptors and using pread only.

> (the source of a write being pages that are mmap'd from somewhere
> else)

AFAICT the source of the memcpy()s that append to the file are
malloc()ed memory.

> That's the most likely place for this kind of problem.  Also, you
> mention crc errors.  Are those reported by btrfs or are they application
> level crcs.

These are CRCs leveldb computes and writes out after each db block.  No
btrfs CRC errors are reported in this process.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
@ 2013-03-19  5:20     ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-19  5:20 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, ceph-devel

On Mar 18, 2013, Chris Mason <chris.mason@fusionio.com> wrote:

> A few questions.  Does leveldb use O_DIRECT and mmap together?

No, it doesn't use O_DIRECT at all.  Its I/O interface is very
simplified: it just opens each new file (database chunks limited to 2MB)
with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync,
munmap and fdatasync.  It doesn't seem to modify data once it's written;
it only appends.  Reading data back from it uses a completely different
class interface, using separate descriptors and using pread only.

> (the source of a write being pages that are mmap'd from somewhere
> else)

AFAICT the source of the memcpy()s that append to the file are
malloc()ed memory.

> That's the most likely place for this kind of problem.  Also, you
> mention crc errors.  Are those reported by btrfs or are they application
> level crcs.

These are CRCs leveldb computes and writes out after each db block.  No
btrfs CRC errors are reported in this process.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-19  5:20     ` Alexandre Oliva
  (?)
@ 2013-03-19 12:09     ` Chris Mason
  2013-03-19 17:29       ` Sage Weil
  2013-03-19 19:26         ` Alexandre Oliva
  -1 siblings, 2 replies; 42+ messages in thread
From: Chris Mason @ 2013-03-19 12:09 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: linux-btrfs, ceph-devel

Quoting Alexandre Oliva (2013-03-19 01:20:10)
> On Mar 18, 2013, Chris Mason <chris.mason@fusionio.com> wrote:
> 
> > A few questions.  Does leveldb use O_DIRECT and mmap together?
> 
> No, it doesn't use O_DIRECT at all.  Its I/O interface is very
> simplified: it just opens each new file (database chunks limited to 2MB)
> with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync,
> munmap and fdatasync.  It doesn't seem to modify data once it's written;
> it only appends.  Reading data back from it uses a completely different
> class interface, using separate descriptors and using pread only.
> 
> > (the source of a write being pages that are mmap'd from somewhere
> > else)
> 
> AFAICT the source of the memcpy()s that append to the file are
> malloc()ed memory.
> 
> > That's the most likely place for this kind of problem.  Also, you
> > mention crc errors.  Are those reported by btrfs or are they application
> > level crcs.
> 
> These are CRCs leveldb computes and writes out after each db block.  No
> btrfs CRC errors are reported in this process.

Ok, so we have three moving pieces here.

1) leveldb truncating the files
2) leveldb using mmap to write
3) btrfs snapshots

My guess is the truncate is creating a orphan item that is being
processed inside the snapshot.

Is it possible to create a smaller leveldb unit test that we might use
to exercise all of this?

-chris


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-19 12:09     ` Chris Mason
@ 2013-03-19 17:29       ` Sage Weil
  2013-03-19 19:26           ` Alexandre Oliva
  2013-03-19 19:26         ` Alexandre Oliva
  1 sibling, 1 reply; 42+ messages in thread
From: Sage Weil @ 2013-03-19 17:29 UTC (permalink / raw)
  To: Chris Mason; +Cc: Alexandre Oliva, linux-btrfs, ceph-devel

On Tue, 19 Mar 2013, Chris Mason wrote:
> Quoting Alexandre Oliva (2013-03-19 01:20:10)
> > On Mar 18, 2013, Chris Mason <chris.mason@fusionio.com> wrote:
> > 
> > > A few questions.  Does leveldb use O_DIRECT and mmap together?
> > 
> > No, it doesn't use O_DIRECT at all.  Its I/O interface is very
> > simplified: it just opens each new file (database chunks limited to 2MB)
> > with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync,
> > munmap and fdatasync.  It doesn't seem to modify data once it's written;
> > it only appends.  Reading data back from it uses a completely different
> > class interface, using separate descriptors and using pread only.
> > 
> > > (the source of a write being pages that are mmap'd from somewhere
> > > else)
> > 
> > AFAICT the source of the memcpy()s that append to the file are
> > malloc()ed memory.
> > 
> > > That's the most likely place for this kind of problem.  Also, you
> > > mention crc errors.  Are those reported by btrfs or are they application
> > > level crcs.
> > 
> > These are CRCs leveldb computes and writes out after each db block.  No
> > btrfs CRC errors are reported in this process.
> 
> Ok, so we have three moving pieces here.
> 
> 1) leveldb truncating the files
> 2) leveldb using mmap to write
> 3) btrfs snapshots
> 
> My guess is the truncate is creating a orphan item that is being
> processed inside the snapshot.
> 
> Is it possible to create a smaller leveldb unit test that we might use
> to exercise all of this?

There is a set of unit tests in the leveldb source tree that ought to do 
the trick:

	git clone https://code.google.com/p/leveldb/

sage

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-19 12:09     ` Chris Mason
@ 2013-03-19 19:26         ` Alexandre Oliva
  2013-03-19 19:26         ` Alexandre Oliva
  1 sibling, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-19 19:26 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, ceph-devel

On Mar 19, 2013, Chris Mason <clmason@fusionio.com> wrote:

> My guess is the truncate is creating a orphan item

Would it, even though the truncate is used to grow rather than to shrink
the file?

> that is being processed inside the snapshot.

This doesn't explain why the master database occasionally gets similarly
corrupted, does it?

> Is it possible to create a smaller leveldb unit test that we might use
> to exercise all of this?

I suppose we can even do away with leveldb altogether, using only a
PosixMmapFile object, as created by PosixEnv::NewWritableFile (all of
this is defined in leveldb's util/env_posix.cc), to exercise the
creation and growth of multiple files, one at a time, taking btrfs
snapshots at random in between the writes.  This ought to suffice.

One thing I'm yet to check is whether ceph uses the sync leveldb
WriteOption, to determine whether or not to call the file object's Sync
member function in the test; this would bring fdatasync and msync calls
into the picture, that would otherwise be left entirely out of the test.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
@ 2013-03-19 19:26         ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-19 19:26 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, ceph-devel

On Mar 19, 2013, Chris Mason <clmason@fusionio.com> wrote:

> My guess is the truncate is creating a orphan item

Would it, even though the truncate is used to grow rather than to shrink
the file?

> that is being processed inside the snapshot.

This doesn't explain why the master database occasionally gets similarly
corrupted, does it?

> Is it possible to create a smaller leveldb unit test that we might use
> to exercise all of this?

I suppose we can even do away with leveldb altogether, using only a
PosixMmapFile object, as created by PosixEnv::NewWritableFile (all of
this is defined in leveldb's util/env_posix.cc), to exercise the
creation and growth of multiple files, one at a time, taking btrfs
snapshots at random in between the writes.  This ought to suffice.

One thing I'm yet to check is whether ceph uses the sync leveldb
WriteOption, to determine whether or not to call the file object's Sync
member function in the test; this would bring fdatasync and msync calls
into the picture, that would otherwise be left entirely out of the test.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-19 17:29       ` Sage Weil
@ 2013-03-19 19:26           ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-19 19:26 UTC (permalink / raw)
  To: Sage Weil; +Cc: Chris Mason, linux-btrfs, ceph-devel

On Mar 19, 2013, Sage Weil <sage@inktank.com> wrote:

> There is a set of unit tests in the leveldb source tree that ought to do 
> the trick:

> 	git clone https://code.google.com/p/leveldb/

But these don't create btrfs snapshots.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
@ 2013-03-19 19:26           ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-19 19:26 UTC (permalink / raw)
  To: Sage Weil; +Cc: Chris Mason, linux-btrfs, ceph-devel

On Mar 19, 2013, Sage Weil <sage@inktank.com> wrote:

> There is a set of unit tests in the leveldb source tree that ought to do 
> the trick:

> 	git clone https://code.google.com/p/leveldb/

But these don't create btrfs snapshots.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-19 19:26         ` Alexandre Oliva
@ 2013-03-20  1:58           ` Alexandre Oliva
  -1 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-20  1:58 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, ceph-devel

On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:

>> that is being processed inside the snapshot.

> This doesn't explain why the master database occasionally gets similarly
> corrupted, does it?

Actually, scratch this bit for now.  I don't really have proof that the
master database actually gets corrupted while it's in use, rather than
having inherited corruption on a server restart, that rolls back to the
most recent snapshot and replays the osd journal on it.  It could be
that the used snapshot is corrupted in a way that doesn't manifest
itself immediately, or that that it gets corrupted afterwards with your
delayed-orphan theory.

I wrote a test that exercises leveldb's PosixMmapFile with highly
compressible appends of varying sizes, as well as syncs and btrfs
snapshots at random, but I haven't been able to trigger the problem with
it (yet?).

I'm now instrumenting the failing code to try to collect more data.  It
looks like, even though ceph does use leveldb's sync option in some
situations, the syncs don't seem to get all to the data files, only to
the leveldb logs.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
@ 2013-03-20  1:58           ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-20  1:58 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, ceph-devel

On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:

>> that is being processed inside the snapshot.

> This doesn't explain why the master database occasionally gets similarly
> corrupted, does it?

Actually, scratch this bit for now.  I don't really have proof that the
master database actually gets corrupted while it's in use, rather than
having inherited corruption on a server restart, that rolls back to the
most recent snapshot and replays the osd journal on it.  It could be
that the used snapshot is corrupted in a way that doesn't manifest
itself immediately, or that that it gets corrupted afterwards with your
delayed-orphan theory.

I wrote a test that exercises leveldb's PosixMmapFile with highly
compressible appends of varying sizes, as well as syncs and btrfs
snapshots at random, but I haven't been able to trigger the problem with
it (yet?).

I'm now instrumenting the failing code to try to collect more data.  It
looks like, even though ceph does use leveldb's sync option in some
situations, the syncs don't seem to get all to the data files, only to
the leveldb logs.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-20  1:58           ` Alexandre Oliva
@ 2013-03-21  7:14             ` Alexandre Oliva
  -1 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-21  7:14 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, ceph-devel

On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:

> On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:
>>> that is being processed inside the snapshot.

>> This doesn't explain why the master database occasionally gets similarly
>> corrupted, does it?

> Actually, scratch this bit for now.  I don't really have proof that the
> master database actually gets corrupted while it's in use

Scratch the “scratch this”.  The master database actually gets
corrupted, and it's with recently-created files, created after earlier
known-good snapshots.  So, it can't really be orphan processing, can it?

Some more info from the errors and instrumentation:

- no data syncing on the affected files is taking place.  it's just
  memcpy()ing data in <4KiB-sized chunks onto mmap()ed areas,
  munmap()ing it, growing the file with ftruncate and mapping a
  subsequent chunk for further output

- the NULs at the end of pages do NOT occur at munmap/mmap boundaries as
  I suspected at first, but they do coincide with the end of extents
  that are smaller than the maximum compressed extent size.  So,
  something's making btrfs flush pages to disk before the pages are
  completely written (which is fine in principle), but apparently
  failing to pick up subsequent changes to the pages (eek!)

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
@ 2013-03-21  7:14             ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-21  7:14 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, ceph-devel

On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:

> On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:
>>> that is being processed inside the snapshot.

>> This doesn't explain why the master database occasionally gets similarly
>> corrupted, does it?

> Actually, scratch this bit for now.  I don't really have proof that the
> master database actually gets corrupted while it's in use

Scratch the “scratch this”.  The master database actually gets
corrupted, and it's with recently-created files, created after earlier
known-good snapshots.  So, it can't really be orphan processing, can it?

Some more info from the errors and instrumentation:

- no data syncing on the affected files is taking place.  it's just
  memcpy()ing data in <4KiB-sized chunks onto mmap()ed areas,
  munmap()ing it, growing the file with ftruncate and mapping a
  subsequent chunk for further output

- the NULs at the end of pages do NOT occur at munmap/mmap boundaries as
  I suspected at first, but they do coincide with the end of extents
  that are smaller than the maximum compressed extent size.  So,
  something's making btrfs flush pages to disk before the pages are
  completely written (which is fine in principle), but apparently
  failing to pick up subsequent changes to the pages (eek!)

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-21  7:14             ` Alexandre Oliva
@ 2013-03-21 18:06               ` Chris Mason
  -1 siblings, 0 replies; 42+ messages in thread
From: Chris Mason @ 2013-03-21 18:06 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: linux-btrfs, ceph-devel

Quoting Alexandre Oliva (2013-03-21 03:14:02)
> On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:
> 
> > On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:
> >>> that is being processed inside the snapshot.
> 
> >> This doesn't explain why the master database occasionally gets similarly
> >> corrupted, does it?
> 
> > Actually, scratch this bit for now.  I don't really have proof that the
> > master database actually gets corrupted while it's in use
> 
> Scratch the “scratch this”.  The master database actually gets
> corrupted, and it's with recently-created files, created after earlier
> known-good snapshots.  So, it can't really be orphan processing, can it?

Right, it can't be orphan processing.

> 
> Some more info from the errors and instrumentation:
> 
> - no data syncing on the affected files is taking place.  it's just
>   memcpy()ing data in <4KiB-sized chunks onto mmap()ed areas,
>   munmap()ing it, growing the file with ftruncate and mapping a
>   subsequent chunk for further output
> 
> - the NULs at the end of pages do NOT occur at munmap/mmap boundaries as
>   I suspected at first, but they do coincide with the end of extents
>   that are smaller than the maximum compressed extent size.  So,
>   something's making btrfs flush pages to disk before the pages are
>   completely written (which is fine in principle), but apparently
>   failing to pick up subsequent changes to the pages (eek!)

With mmap the kernel can pick any given time to start writing out dirty
pages.  The idea is that if the application makes more changes the page
becomes dirty again and the kernel writes it again.

So the question is, can you trigger this without snapshots being done
at all?  I'll try to make an mmap tester here that hammers on the
related code.  We usually test this with fsx, which catches all kinds of
horrors.

-chris

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
@ 2013-03-21 18:06               ` Chris Mason
  0 siblings, 0 replies; 42+ messages in thread
From: Chris Mason @ 2013-03-21 18:06 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: linux-btrfs, ceph-devel

Quoting Alexandre Oliva (2013-03-21 03:14:02)
> On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:
> 
> > On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:
> >>> that is being processed inside the snapshot.
> 
> >> This doesn't explain why the master database occasionally gets similarly
> >> corrupted, does it?
> 
> > Actually, scratch this bit for now.  I don't really have proof that the
> > master database actually gets corrupted while it's in use
> 
> Scratch the “scratch this”.  The master database actually gets
> corrupted, and it's with recently-created files, created after earlier
> known-good snapshots.  So, it can't really be orphan processing, can it?

Right, it can't be orphan processing.

> 
> Some more info from the errors and instrumentation:
> 
> - no data syncing on the affected files is taking place.  it's just
>   memcpy()ing data in <4KiB-sized chunks onto mmap()ed areas,
>   munmap()ing it, growing the file with ftruncate and mapping a
>   subsequent chunk for further output
> 
> - the NULs at the end of pages do NOT occur at munmap/mmap boundaries as
>   I suspected at first, but they do coincide with the end of extents
>   that are smaller than the maximum compressed extent size.  So,
>   something's making btrfs flush pages to disk before the pages are
>   completely written (which is fine in principle), but apparently
>   failing to pick up subsequent changes to the pages (eek!)

With mmap the kernel can pick any given time to start writing out dirty
pages.  The idea is that if the application makes more changes the page
becomes dirty again and the kernel writes it again.

So the question is, can you trigger this without snapshots being done
at all?  I'll try to make an mmap tester here that hammers on the
related code.  We usually test this with fsx, which catches all kinds of
horrors.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-21 18:06               ` Chris Mason
@ 2013-03-21 23:06                 ` Chris Mason
  -1 siblings, 0 replies; 42+ messages in thread
From: Chris Mason @ 2013-03-21 23:06 UTC (permalink / raw)
  To: Chris Mason, Alexandre Oliva; +Cc: linux-btrfs, ceph-devel

Quoting Chris Mason (2013-03-21 14:06:14)
> Quoting Alexandre Oliva (2013-03-21 03:14:02)
> > On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:
> > 
> > > On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:
> > >>> that is being processed inside the snapshot.
> > 
> > >> This doesn't explain why the master database occasionally gets similarly
> > >> corrupted, does it?
> > 
> > > Actually, scratch this bit for now.  I don't really have proof that the
> > > master database actually gets corrupted while it's in use
> > 
> > Scratch the “scratch this”.  The master database actually gets
> > corrupted, and it's with recently-created files, created after earlier
> > known-good snapshots.  So, it can't really be orphan processing, can it?
> 
> Right, it can't be orphan processing.
> 
> > 
> > Some more info from the errors and instrumentation:
> > 
> > - no data syncing on the affected files is taking place.  it's just
> >   memcpy()ing data in <4KiB-sized chunks onto mmap()ed areas,
> >   munmap()ing it, growing the file with ftruncate and mapping a
> >   subsequent chunk for further output
> > 
> > - the NULs at the end of pages do NOT occur at munmap/mmap boundaries as
> >   I suspected at first, but they do coincide with the end of extents
> >   that are smaller than the maximum compressed extent size.  So,
> >   something's making btrfs flush pages to disk before the pages are
> >   completely written (which is fine in principle), but apparently
> >   failing to pick up subsequent changes to the pages (eek!)
> 
> With mmap the kernel can pick any given time to start writing out dirty
> pages.  The idea is that if the application makes more changes the page
> becomes dirty again and the kernel writes it again.
> 
> So the question is, can you trigger this without snapshots being done
> at all?  I'll try to make an mmap tester here that hammers on the
> related code.  We usually test this with fsx, which catches all kinds of
> horrors.

So my test program creates an 8GB file in chunks of 1MB each.  Using
truncate to extend the file and then mmap to write into the new hole.
It is writing in 1MB chunks, ever so slightly not aligned.  After
creating the whole file, it reads it back to look for errors.

I'm running this with heavy memory pressure, but no snapshots.  No
corruptions yet, but I'll let it run a while long.

-chris


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
@ 2013-03-21 23:06                 ` Chris Mason
  0 siblings, 0 replies; 42+ messages in thread
From: Chris Mason @ 2013-03-21 23:06 UTC (permalink / raw)
  To: Chris Mason, Alexandre Oliva; +Cc: linux-btrfs, ceph-devel

Quoting Chris Mason (2013-03-21 14:06:14)
> Quoting Alexandre Oliva (2013-03-21 03:14:02)
> > On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:
> > 
> > > On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:
> > >>> that is being processed inside the snapshot.
> > 
> > >> This doesn't explain why the master database occasionally gets similarly
> > >> corrupted, does it?
> > 
> > > Actually, scratch this bit for now.  I don't really have proof that the
> > > master database actually gets corrupted while it's in use
> > 
> > Scratch the “scratch this”.  The master database actually gets
> > corrupted, and it's with recently-created files, created after earlier
> > known-good snapshots.  So, it can't really be orphan processing, can it?
> 
> Right, it can't be orphan processing.
> 
> > 
> > Some more info from the errors and instrumentation:
> > 
> > - no data syncing on the affected files is taking place.  it's just
> >   memcpy()ing data in <4KiB-sized chunks onto mmap()ed areas,
> >   munmap()ing it, growing the file with ftruncate and mapping a
> >   subsequent chunk for further output
> > 
> > - the NULs at the end of pages do NOT occur at munmap/mmap boundaries as
> >   I suspected at first, but they do coincide with the end of extents
> >   that are smaller than the maximum compressed extent size.  So,
> >   something's making btrfs flush pages to disk before the pages are
> >   completely written (which is fine in principle), but apparently
> >   failing to pick up subsequent changes to the pages (eek!)
> 
> With mmap the kernel can pick any given time to start writing out dirty
> pages.  The idea is that if the application makes more changes the page
> becomes dirty again and the kernel writes it again.
> 
> So the question is, can you trigger this without snapshots being done
> at all?  I'll try to make an mmap tester here that hammers on the
> related code.  We usually test this with fsx, which catches all kinds of
> horrors.

So my test program creates an 8GB file in chunks of 1MB each.  Using
truncate to extend the file and then mmap to write into the new hole.
It is writing in 1MB chunks, ever so slightly not aligned.  After
creating the whole file, it reads it back to look for errors.

I'm running this with heavy memory pressure, but no snapshots.  No
corruptions yet, but I'll let it run a while long.

-chris

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-21 23:06                 ` Chris Mason
@ 2013-03-22  5:27                   ` Alexandre Oliva
  -1 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-22  5:27 UTC (permalink / raw)
  To: Chris Mason; +Cc: Chris Mason, linux-btrfs, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 2664 bytes --]

On Mar 21, 2013, Chris Mason <chris.mason@fusionio.com> wrote:

> Quoting Chris Mason (2013-03-21 14:06:14)
>> With mmap the kernel can pick any given time to start writing out dirty
>> pages.  The idea is that if the application makes more changes the page
>> becomes dirty again and the kernel writes it again.

That's the theory.  But what if there's some race between the time the
page is frozen for compressing and the time it's marked as clean, or
it's marked as clean after it's further modified, or a subsequent write
to the same page ends up overridden by the background compression of the
old contents of the page?  These are all possibilities that come to mind
without knowing much about btrfs inner workings.

>> So the question is, can you trigger this without snapshots being done
>> at all?

I haven't tried, but I now have a program that hit the error condition
while taking snapshots in background with small time perturbations to
increase the likelihood of hitting a race condition at the exact time.
It uses leveldb's infrastructure for the mmapping, but it shouldn't be
too hard to adapt it so that it doesn't.

> So my test program creates an 8GB file in chunks of 1MB each.

That's probably too large a chunk to write at a time.  The bug is
exercised with writes slightly smaller than a single page (although
straddling across two consecutive pages).

This half-baked test program (hereby provided under the terms of the GNU
GPLv3+) creates a btrfs subvolume and two files in it: one in which I/O
will be performed with write()s, another that will get the same data
appended with leveldb's mmap-based output interface.  Random block
sizes, as well as milli and microsecond timing perturbations, are read
from /dev/urandom, and the rest of the output buffer is filled with
(char)1.

The test that actually failed (on the first try!, after some other
variations that didn't fail) didn't have any of the #ifdef options
enabled (i.e., no -D* flags during compilation), but it triggered the
exact failure observed with ceph: zeros at the end of a page where there
should have been nonzero data, followed by nonzero data on the following
page!  That was within snapshots, not in the main subvol, but hopefully
it's the same problem, just a bit harder to trigger.

I can't tell whether memory pressure is required to hit the problem.
The system on which I hit the error was mostly otherwise idle while
running the test, but starting so many shell commands in background
surely creates intense activity on the system, possibly increasing the
odds that some race condition will hit.

Two subsequent runs of the program failed to trigger the problem.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: snaptest.cc --]
[-- Type: text/x-c++src, Size: 2684 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <fcntl.h>
#include <string>
#include <sstream>
#include "leveldb/env.h"

#ifndef MAXROUNDS
#define MAXROUNDS 400
#endif

int main () {
  leveldb::Env *env = leveldb::Env::Default();
  leveldb::Status s;
  std::string str;
  int wd;
  int rd;
  leveldb::WritableFile *out;
  char lenbuf[1];
  char buf[4096];
  unsigned long long totalsize = 0;
  int blocks;
  pid_t __attribute__((__unused__)) pid = getpid();

  memset(buf, 1, sizeof(buf));

  rd = open("/dev/urandom", O_RDONLY);
  if (rd == -1) {
    perror ("open random");
    abort ();
  }

  str = "btrfs su cr snaptest.";
  if (system (str.c_str())) {
    perror ("subvol create");
    abort ();
  }

  str = "snaptest./";
  unlink((str + "ca").c_str ());
  wd = open((str + "ca").c_str (), O_CREAT | O_TRUNC | O_WRONLY, 0644);
  if (wd == -1) {
    perror ("open wd");
    abort ();
  }
  
  unlink((str + "db").c_str ());
  s = env->NewWritableFile(str + "db", &out);
  if (!s.ok()) {
    perror ("open db");
    abort ();
  }

  for (blocks = 0; blocks < MAXROUNDS; blocks++) {
    if (read (rd, buf, 3) != 3) {
      printf ("\nread error: %s\n", strerror (errno));
      break;
    }

    printf("\r%i blocks, %llu total size\n",
	   blocks, totalsize);

#if !NOBGCMP
    std::ostringstream os;
    if (buf[1] || buf[2])
      os << "usleep " << 1000L * (long)(unsigned char)buf[1]
	+ (buf[1] ? (long)(signed char)buf[2] : (unsigned char)buf[2])
	 << " && ";
#if !NOSNAPS
    os << "btrfs su snap snaptest. snaptest." << blocks
       << " && sleep 5 && if cmp -n `stat -c %s snaptest." << blocks
       << "/ca` snaptest." << blocks
       << "/??; then btrfs su del snaptest." << blocks
       << "; else kill " << pid << "; fi &";
#else
    os << "sleep 5 && if cmp -n " << totalsize << " snaptest." << blocks
       << "/??; then :; else kill " << pid << "; fi &"
#endif
    if (system (os.str().c_str())) {
      printf ("\nsnap error: %s\n", strerror (errno));
      break;
    }
#endif

    int size = 4096 - (unsigned char)buf[0];

    s = out->Append(leveldb::Slice(buf, size));
    if (!s.ok ()) {
      printf("\nappend error: %s\n", s.ToString().c_str());
      break;
    }

    if (write (wd, buf, size) != size) {
      printf("\nwrite error: %s\n", strerror (errno));
      break;
    }

    if (buf[1])
      for (timespec tv = { 0, (unsigned char)buf[1] * 1000000L };
	   nanosleep(&tv, &tv);)
	;

    totalsize += size;
  }

  printf("\r%i blocks, %llu total size\n",
	 blocks, totalsize);

#if NOBGCMP
  if (system("cmp snaptest./??")) {
    printf ("\ncmp error: %s\n", strerror (errno));
    break;
  }
#endif
}

[-- Attachment #3: Type: text/plain, Size: 257 bytes --]


-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
@ 2013-03-22  5:27                   ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-22  5:27 UTC (permalink / raw)
  To: Chris Mason; +Cc: Chris Mason, linux-btrfs, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 2664 bytes --]

On Mar 21, 2013, Chris Mason <chris.mason@fusionio.com> wrote:

> Quoting Chris Mason (2013-03-21 14:06:14)
>> With mmap the kernel can pick any given time to start writing out dirty
>> pages.  The idea is that if the application makes more changes the page
>> becomes dirty again and the kernel writes it again.

That's the theory.  But what if there's some race between the time the
page is frozen for compressing and the time it's marked as clean, or
it's marked as clean after it's further modified, or a subsequent write
to the same page ends up overridden by the background compression of the
old contents of the page?  These are all possibilities that come to mind
without knowing much about btrfs inner workings.

>> So the question is, can you trigger this without snapshots being done
>> at all?

I haven't tried, but I now have a program that hit the error condition
while taking snapshots in background with small time perturbations to
increase the likelihood of hitting a race condition at the exact time.
It uses leveldb's infrastructure for the mmapping, but it shouldn't be
too hard to adapt it so that it doesn't.

> So my test program creates an 8GB file in chunks of 1MB each.

That's probably too large a chunk to write at a time.  The bug is
exercised with writes slightly smaller than a single page (although
straddling across two consecutive pages).

This half-baked test program (hereby provided under the terms of the GNU
GPLv3+) creates a btrfs subvolume and two files in it: one in which I/O
will be performed with write()s, another that will get the same data
appended with leveldb's mmap-based output interface.  Random block
sizes, as well as milli and microsecond timing perturbations, are read
from /dev/urandom, and the rest of the output buffer is filled with
(char)1.

The test that actually failed (on the first try!, after some other
variations that didn't fail) didn't have any of the #ifdef options
enabled (i.e., no -D* flags during compilation), but it triggered the
exact failure observed with ceph: zeros at the end of a page where there
should have been nonzero data, followed by nonzero data on the following
page!  That was within snapshots, not in the main subvol, but hopefully
it's the same problem, just a bit harder to trigger.

I can't tell whether memory pressure is required to hit the problem.
The system on which I hit the error was mostly otherwise idle while
running the test, but starting so many shell commands in background
surely creates intense activity on the system, possibly increasing the
odds that some race condition will hit.

Two subsequent runs of the program failed to trigger the problem.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: snaptest.cc --]
[-- Type: text/x-c++src, Size: 2684 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <fcntl.h>
#include <string>
#include <sstream>
#include "leveldb/env.h"

#ifndef MAXROUNDS
#define MAXROUNDS 400
#endif

int main () {
  leveldb::Env *env = leveldb::Env::Default();
  leveldb::Status s;
  std::string str;
  int wd;
  int rd;
  leveldb::WritableFile *out;
  char lenbuf[1];
  char buf[4096];
  unsigned long long totalsize = 0;
  int blocks;
  pid_t __attribute__((__unused__)) pid = getpid();

  memset(buf, 1, sizeof(buf));

  rd = open("/dev/urandom", O_RDONLY);
  if (rd == -1) {
    perror ("open random");
    abort ();
  }

  str = "btrfs su cr snaptest.";
  if (system (str.c_str())) {
    perror ("subvol create");
    abort ();
  }

  str = "snaptest./";
  unlink((str + "ca").c_str ());
  wd = open((str + "ca").c_str (), O_CREAT | O_TRUNC | O_WRONLY, 0644);
  if (wd == -1) {
    perror ("open wd");
    abort ();
  }
  
  unlink((str + "db").c_str ());
  s = env->NewWritableFile(str + "db", &out);
  if (!s.ok()) {
    perror ("open db");
    abort ();
  }

  for (blocks = 0; blocks < MAXROUNDS; blocks++) {
    if (read (rd, buf, 3) != 3) {
      printf ("\nread error: %s\n", strerror (errno));
      break;
    }

    printf("\r%i blocks, %llu total size\n",
	   blocks, totalsize);

#if !NOBGCMP
    std::ostringstream os;
    if (buf[1] || buf[2])
      os << "usleep " << 1000L * (long)(unsigned char)buf[1]
	+ (buf[1] ? (long)(signed char)buf[2] : (unsigned char)buf[2])
	 << " && ";
#if !NOSNAPS
    os << "btrfs su snap snaptest. snaptest." << blocks
       << " && sleep 5 && if cmp -n `stat -c %s snaptest." << blocks
       << "/ca` snaptest." << blocks
       << "/??; then btrfs su del snaptest." << blocks
       << "; else kill " << pid << "; fi &";
#else
    os << "sleep 5 && if cmp -n " << totalsize << " snaptest." << blocks
       << "/??; then :; else kill " << pid << "; fi &"
#endif
    if (system (os.str().c_str())) {
      printf ("\nsnap error: %s\n", strerror (errno));
      break;
    }
#endif

    int size = 4096 - (unsigned char)buf[0];

    s = out->Append(leveldb::Slice(buf, size));
    if (!s.ok ()) {
      printf("\nappend error: %s\n", s.ToString().c_str());
      break;
    }

    if (write (wd, buf, size) != size) {
      printf("\nwrite error: %s\n", strerror (errno));
      break;
    }

    if (buf[1])
      for (timespec tv = { 0, (unsigned char)buf[1] * 1000000L };
	   nanosleep(&tv, &tv);)
	;

    totalsize += size;
  }

  printf("\r%i blocks, %llu total size\n",
	 blocks, totalsize);

#if NOBGCMP
  if (system("cmp snaptest./??")) {
    printf ("\ncmp error: %s\n", strerror (errno));
    break;
  }
#endif
}

[-- Attachment #3: Type: text/plain, Size: 257 bytes --]


-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-22  5:27                   ` Alexandre Oliva
  (?)
@ 2013-03-22 12:07                   ` Chris Mason
  2013-03-22 14:17                       ` Alexandre Oliva
  -1 siblings, 1 reply; 42+ messages in thread
From: Chris Mason @ 2013-03-22 12:07 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: linux-btrfs, ceph-devel

Quoting Alexandre Oliva (2013-03-22 01:27:42)
> On Mar 21, 2013, Chris Mason <chris.mason@fusionio.com> wrote:
> 
> > Quoting Chris Mason (2013-03-21 14:06:14)
> >> With mmap the kernel can pick any given time to start writing out dirty
> >> pages.  The idea is that if the application makes more changes the page
> >> becomes dirty again and the kernel writes it again.
> 
> That's the theory.  But what if there's some race between the time the
> page is frozen for compressing and the time it's marked as clean, or
> it's marked as clean after it's further modified, or a subsequent write
> to the same page ends up overridden by the background compression of the
> old contents of the page?  These are all possibilities that come to mind
> without knowing much about btrfs inner workings.

Definitely, there is a lot of room for racing.  Are you using
compression in btrfs or just in leveldb?

> 
> >> So the question is, can you trigger this without snapshots being done
> >> at all?
> 
> I haven't tried, but I now have a program that hit the error condition
> while taking snapshots in background with small time perturbations to
> increase the likelihood of hitting a race condition at the exact time.
> It uses leveldb's infrastructure for the mmapping, but it shouldn't be
> too hard to adapt it so that it doesn't.
> 
> > So my test program creates an 8GB file in chunks of 1MB each.
> 
> That's probably too large a chunk to write at a time.  The bug is
> exercised with writes slightly smaller than a single page (although
> straddling across two consecutive pages).
> 
> This half-baked test program (hereby provided under the terms of the GNU
> GPLv3+) creates a btrfs subvolume and two files in it: one in which I/O
> will be performed with write()s, another that will get the same data
> appended with leveldb's mmap-based output interface.  Random block
> sizes, as well as milli and microsecond timing perturbations, are read
> from /dev/urandom, and the rest of the output buffer is filled with
> (char)1.
> 
> The test that actually failed (on the first try!, after some other
> variations that didn't fail) didn't have any of the #ifdef options
> enabled (i.e., no -D* flags during compilation), but it triggered the
> exact failure observed with ceph: zeros at the end of a page where there
> should have been nonzero data, followed by nonzero data on the following
> page!  That was within snapshots, not in the main subvol, but hopefully
> it's the same problem, just a bit harder to trigger.

I'd like to take snapshots out of the picture for a minute.  We need
some way to synchronize the leveldb with snapshotting because the
snapshot is basically the same thing as a crash from a db point of view.

Corrupting the main database file is a much different (and bigger)
problem.

-chris


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-22 12:07                   ` Chris Mason
@ 2013-03-22 14:17                       ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-22 14:17 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, ceph-devel

On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote:

> Are you using compression in btrfs or just in leveldb?

btrfs lzo compression.

> I'd like to take snapshots out of the picture for a minute.

That's understandable, I guess, but I don't know that anyone has ever
got the problem without snapshots.  I mean, even when the master copy of
the database got corrupted, snapshots of the subvol containing it were
being taken every now and again, because that's the way ceph works.
Even back when I noticed corruption of firefox _CACHE_* files, snapshots
taken for archival were involved.  So, unless the program happens to
trigger the problem with the -DNOSNAPS option about as easily as it did
without it, I guess we may not have a choice but to keep snapshots in
the picture.

> We need some way to synchronize the leveldb with snapshotting

I purposefully refrained from doing that, because AFAICT ceph doesn't do
that.  Once I failed to trigger the problem with Sync calls, and
determined ceph only syncs the leveldb logs before taking its snapshots,
I went without syncing and finally succeeded in triggering the bug in
snapshots, by simulating very similar snapshotting and mmaping
conditions to those generated by ceph.  I haven't managed to trigger the
corruption of the master subvol yet with the test program, but I already
knew its corruption didn't occur as often as that of the snapshots, and
since it smells like two slightly different symptoms of the same bug, I
decided to leave the test program at that.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
@ 2013-03-22 14:17                       ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-22 14:17 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, ceph-devel

On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote:

> Are you using compression in btrfs or just in leveldb?

btrfs lzo compression.

> I'd like to take snapshots out of the picture for a minute.

That's understandable, I guess, but I don't know that anyone has ever
got the problem without snapshots.  I mean, even when the master copy of
the database got corrupted, snapshots of the subvol containing it were
being taken every now and again, because that's the way ceph works.
Even back when I noticed corruption of firefox _CACHE_* files, snapshots
taken for archival were involved.  So, unless the program happens to
trigger the problem with the -DNOSNAPS option about as easily as it did
without it, I guess we may not have a choice but to keep snapshots in
the picture.

> We need some way to synchronize the leveldb with snapshotting

I purposefully refrained from doing that, because AFAICT ceph doesn't do
that.  Once I failed to trigger the problem with Sync calls, and
determined ceph only syncs the leveldb logs before taking its snapshots,
I went without syncing and finally succeeded in triggering the bug in
snapshots, by simulating very similar snapshotting and mmaping
conditions to those generated by ceph.  I haven't managed to trigger the
corruption of the master subvol yet with the test program, but I already
knew its corruption didn't occur as often as that of the snapshots, and
since it smells like two slightly different symptoms of the same bug, I
decided to leave the test program at that.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-22 14:17                       ` Alexandre Oliva
  (?)
@ 2013-03-22 14:26                       ` Chris Mason
  2013-03-22 17:06                         ` Samuel Just
                                           ` (2 more replies)
  -1 siblings, 3 replies; 42+ messages in thread
From: Chris Mason @ 2013-03-22 14:26 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: linux-btrfs, ceph-devel

Quoting Alexandre Oliva (2013-03-22 10:17:30)
> On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote:
> 
> > Are you using compression in btrfs or just in leveldb?
> 
> btrfs lzo compression.

Perfect, I'll focus on that part of things.

> 
> > I'd like to take snapshots out of the picture for a minute.
> 
> That's understandable, I guess, but I don't know that anyone has ever
> got the problem without snapshots.  I mean, even when the master copy of
> the database got corrupted, snapshots of the subvol containing it were
> being taken every now and again, because that's the way ceph works.

Hopefully Sage can comment, but the basic idea is that if you snapshot a
database file the db must participate.  If it doesn't, it really is the
same effect as crashing the box.

Something is definitely broken if we're corrupting the source files
(either with or without snapshots), but avoiding incomplete writes in
the snapshot files requires synchronization with the db.

-chris

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-22 14:26                       ` Chris Mason
@ 2013-03-22 17:06                         ` Samuel Just
  2013-03-22 17:12                           ` Chris Mason
  2013-03-22 17:08                         ` David Sterba
  2013-03-22 17:18                         ` Sage Weil
  2 siblings, 1 reply; 42+ messages in thread
From: Samuel Just @ 2013-03-22 17:06 UTC (permalink / raw)
  To: Chris Mason; +Cc: Alexandre Oliva, linux-btrfs, ceph-devel

Incomplete writes for leveldb should just result in lost updates, not
corruption.  Also, we do stop writes before the snapshot is initiated
so there should be no in-progress writes to leveldb other than leveldb
compaction (though that might be something to investigate).
-Sam

On Fri, Mar 22, 2013 at 7:26 AM, Chris Mason <clmason@fusionio.com> wrote:
> Quoting Alexandre Oliva (2013-03-22 10:17:30)
>> On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote:
>>
>> > Are you using compression in btrfs or just in leveldb?
>>
>> btrfs lzo compression.
>
> Perfect, I'll focus on that part of things.
>
>>
>> > I'd like to take snapshots out of the picture for a minute.
>>
>> That's understandable, I guess, but I don't know that anyone has ever
>> got the problem without snapshots.  I mean, even when the master copy of
>> the database got corrupted, snapshots of the subvol containing it were
>> being taken every now and again, because that's the way ceph works.
>
> Hopefully Sage can comment, but the basic idea is that if you snapshot a
> database file the db must participate.  If it doesn't, it really is the
> same effect as crashing the box.
>
> Something is definitely broken if we're corrupting the source files
> (either with or without snapshots), but avoiding incomplete writes in
> the snapshot files requires synchronization with the db.
>
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-22 14:26                       ` Chris Mason
  2013-03-22 17:06                         ` Samuel Just
@ 2013-03-22 17:08                         ` David Sterba
  2013-03-23  9:48                             ` Alexandre Oliva
  2013-03-22 17:18                         ` Sage Weil
  2 siblings, 1 reply; 42+ messages in thread
From: David Sterba @ 2013-03-22 17:08 UTC (permalink / raw)
  To: Chris Mason; +Cc: Alexandre Oliva, linux-btrfs, ceph-devel

On Fri, Mar 22, 2013 at 10:26:59AM -0400, Chris Mason wrote:
> Quoting Alexandre Oliva (2013-03-22 10:17:30)
> > On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote:
> > 
> > > Are you using compression in btrfs or just in leveldb?
> > 
> > btrfs lzo compression.
> 
> Perfect, I'll focus on that part of things.

> > > I'd like to take snapshots out of the picture for a minute.

I've reproduced this without compression, with autodefrag on. The test
was using snapshots (ie. the unmmodified versino) and ended with

1087 blocks, 4316779 total size
snaptest.268/ca snaptest.268/db differ: char 4245170, line 16

after a few minutes.

Before that, I was running the NOSNAPS mode for many-minutes (up to 50k
rounds) without a reported problem.

There was the same 'make clean && make -j 32' kernel compilation running
in parallel, the box has 8 cpus, 4GB ram. Watching 'free' showed the
memory going up to a few gigs and down to ~130MB.


david

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-22 17:06                         ` Samuel Just
@ 2013-03-22 17:12                           ` Chris Mason
  2013-03-23  9:47                               ` Alexandre Oliva
  0 siblings, 1 reply; 42+ messages in thread
From: Chris Mason @ 2013-03-22 17:12 UTC (permalink / raw)
  To: Samuel Just; +Cc: Alexandre Oliva, linux-btrfs, ceph-devel

In this case, I think Alexandre is scanning for zeros in the file.   The
incomplete writes will definitely show that.

-chris

Quoting Samuel Just (2013-03-22 13:06:41)
> Incomplete writes for leveldb should just result in lost updates, not
> corruption.  Also, we do stop writes before the snapshot is initiated
> so there should be no in-progress writes to leveldb other than leveldb
> compaction (though that might be something to investigate).
> -Sam
> 
> On Fri, Mar 22, 2013 at 7:26 AM, Chris Mason <clmason@fusionio.com> wrote:
> > Quoting Alexandre Oliva (2013-03-22 10:17:30)
> >> On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote:
> >>
> >> > Are you using compression in btrfs or just in leveldb?
> >>
> >> btrfs lzo compression.
> >
> > Perfect, I'll focus on that part of things.
> >
> >>
> >> > I'd like to take snapshots out of the picture for a minute.
> >>
> >> That's understandable, I guess, but I don't know that anyone has ever
> >> got the problem without snapshots.  I mean, even when the master copy of
> >> the database got corrupted, snapshots of the subvol containing it were
> >> being taken every now and again, because that's the way ceph works.
> >
> > Hopefully Sage can comment, but the basic idea is that if you snapshot a
> > database file the db must participate.  If it doesn't, it really is the
> > same effect as crashing the box.
> >
> > Something is definitely broken if we're corrupting the source files
> > (either with or without snapshots), but avoiding incomplete writes in
> > the snapshot files requires synchronization with the db.
> >
> > -chris
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-22 14:26                       ` Chris Mason
  2013-03-22 17:06                         ` Samuel Just
  2013-03-22 17:08                         ` David Sterba
@ 2013-03-22 17:18                         ` Sage Weil
  2 siblings, 0 replies; 42+ messages in thread
From: Sage Weil @ 2013-03-22 17:18 UTC (permalink / raw)
  To: Chris Mason; +Cc: Alexandre Oliva, linux-btrfs, ceph-devel

On Fri, 22 Mar 2013, Chris Mason wrote:
> Quoting Alexandre Oliva (2013-03-22 10:17:30)
> > On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote:
> > 
> > > Are you using compression in btrfs or just in leveldb?
> > 
> > btrfs lzo compression.
> 
> Perfect, I'll focus on that part of things.
> 
> > 
> > > I'd like to take snapshots out of the picture for a minute.
> > 
> > That's understandable, I guess, but I don't know that anyone has ever
> > got the problem without snapshots.  I mean, even when the master copy of
> > the database got corrupted, snapshots of the subvol containing it were
> > being taken every now and again, because that's the way ceph works.
> 
> Hopefully Sage can comment, but the basic idea is that if you snapshot a
> database file the db must participate.  If it doesn't, it really is the
> same effect as crashing the box.
> 
> Something is definitely broken if we're corrupting the source files
> (either with or without snapshots), but avoiding incomplete writes in
> the snapshot files requires synchronization with the db.

In this case, we quiesce write activity, call leveldb's sync(), take the 
snapshot, and then continue.

(FWIW, this isn't the first time we've heard about leveldb corruption, but 
in each case we've looked into the user had the btrfs compression 
enabled.... so I suspect that's the right avenue of investigation!)

sage

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-18 21:14 ` Alexandre Oliva
                   ` (2 preceding siblings ...)
  (?)
@ 2013-03-22 18:07 ` Chris Mason
  2013-03-22 20:31   ` Chris Mason
  -1 siblings, 1 reply; 42+ messages in thread
From: Chris Mason @ 2013-03-22 18:07 UTC (permalink / raw)
  To: Alexandre Oliva, linux-btrfs; +Cc: ceph-devel

[ mmap corruptions with leveldb and btrfs compression ]

I ran this a number of times with compression off and wasn't able to
trigger problems.  With compress=lzo, I see errors on every run.

Compile: gcc -Wall -o mmap-trunc mmap-trunc.c
Run: ./mmap-trunc file_name

The basic idea is to create a 256MB file in steps.  Each step ftruncates
the file larger, and then mmaps a region for writing.  It dirties some
unaligned bytes (a little more than 8K), and then munmaps.

Then a verify stage goes back through the file to make sure the data we
wrote is really there.  I'm using a simple rotating pattern of chars
that compress very well.

I run it in batches of 100 with some memory pressure on the side:

for x in `seq 1 100` ; do (mmap-trunc f$x &) ; done

#define _FILE_OFFSET_BITS 64
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define FILE_SIZE ((loff_t)256 * 1024 * 1024)
/* make a painfully unaligned chunk size */
#define CHUNK_SIZE (8192 + 932)

#define mmap_align(x) (((x) + 4095) & ~4095)

char *file_name = NULL;

void mmap_one_chunk(int fd, loff_t *cur_size, unsigned char *file_buf)
{
	int ret;
	loff_t new_size = *cur_size + CHUNK_SIZE;
	loff_t pos = *cur_size;
	unsigned long map_size = mmap_align(CHUNK_SIZE) + 4096;
	char val = file_buf[0];
	char *p;
	int extra;

	/* step one, truncate out a hole */
	ret = ftruncate(fd, new_size);
	if (ret) {
		perror("truncate");
		exit(1);
	}

	if (val == 0 || val == 'z')
		val = 'a';
	else
		val++;

	memset(file_buf, val, CHUNK_SIZE);

	extra = pos & 4095;
	p = mmap(0, map_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
		 pos - extra);
	if (p == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}
	memcpy(p + extra, file_buf, CHUNK_SIZE);

	ret = munmap(p, map_size);
	if (ret) {
		perror("munmap");
		exit(1);
	}
	*cur_size = new_size;
}

void check_chunks(int fd)
{
	char *p;
	loff_t checked = 0;
	char val = 'a';
	int i;
	int errors = 0;
	int ret;
	int extra;
	unsigned long map_size = mmap_align(CHUNK_SIZE) + 4096;

	fprintf(stderr, "checking chunks\n");
	while (checked < FILE_SIZE) {
		extra = checked & 4095;
		p = mmap(0, map_size, PROT_READ,
			 MAP_SHARED, fd, checked - extra);
		if (p == MAP_FAILED) {
			perror("mmap");
			exit(1);
		}
		for (i = 0; i < CHUNK_SIZE; i++) {
			if (p[i + extra] != val) {
				fprintf(stderr, "%s: bad val %x wanted %x offset 0x%llx\n",
					file_name, p[i + extra], val,
					(unsigned long long)checked + i);
				errors++;
			}
		}
		if (val == 'z')
			val = 'a';
		else
			val++;
		ret = munmap(p, map_size);
		if (ret) {
			perror("munmap");
			exit(1);
		}
		checked += CHUNK_SIZE;
	}
	printf("%s found %d errors\n", file_name, errors);
	if (errors)
		exit(1);
}

int main(int ac, char **av)
{
	unsigned char *file_buf;
	loff_t pos = 0;
	int ret;
	int fd;

	if (ac < 2) {
		fprintf(stderr, "usage: mmap-trunc filename\n");
		exit(1);
	}

	ret = posix_memalign((void **)&file_buf, 4096, CHUNK_SIZE);
	if (ret) {
		perror("cannot allocate memory\n");
		exit(1);
	}

	file_buf[0] = 0;

	file_name = av[1];

	fprintf(stderr, "running test on %s\n", file_name);

	unlink(file_name);
	fd = open(file_name, O_RDWR | O_CREAT, 0600);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	fprintf(stderr, "writing chunks\n");
	while (pos < FILE_SIZE) {
		mmap_one_chunk(fd, &pos, file_buf);
	}
	check_chunks(fd);
	return 0;
}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-22 18:07 ` Chris Mason
@ 2013-03-22 20:31   ` Chris Mason
  2013-03-26  0:08     ` Chris Mason
  0 siblings, 1 reply; 42+ messages in thread
From: Chris Mason @ 2013-03-22 20:31 UTC (permalink / raw)
  To: Chris Mason, Alexandre Oliva, linux-btrfs; +Cc: ceph-devel

Quoting Chris Mason (2013-03-22 14:07:05)
> [ mmap corruptions with leveldb and btrfs compression ]
> 
> I ran this a number of times with compression off and wasn't able to
> trigger problems.  With compress=lzo, I see errors on every run.
> 
> Compile: gcc -Wall -o mmap-trunc mmap-trunc.c
> Run: ./mmap-trunc file_name
> 
> The basic idea is to create a 256MB file in steps.  Each step ftruncates
> the file larger, and then mmaps a region for writing.  It dirties some
> unaligned bytes (a little more than 8K), and then munmaps.
> 
> Then a verify stage goes back through the file to make sure the data we
> wrote is really there.  I'm using a simple rotating pattern of chars
> that compress very well.

Going through the code here, when I change the test to truncate once in
the very beginning, I still get errors.  So, it isn't an interaction
between mmap and truncate.  It must be a problem between lzo and mmap.

-chris

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-22 17:12                           ` Chris Mason
@ 2013-03-23  9:47                               ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-23  9:47 UTC (permalink / raw)
  To: Chris Mason; +Cc: Samuel Just, linux-btrfs, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 652 bytes --]

On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote:

> Quoting Samuel Just (2013-03-22 13:06:41)
>> Incomplete writes for leveldb should just result in lost updates, not
>> corruption.

> In this case, I think Alexandre is scanning for zeros in the file.

Yup, the symptom is zeros at the end of a page, with nonzeros on the
subsequent page, which indicates that the writes to the previous page
were dropped.

What I actually do is to iterate over the entire database, which will
error out when the block header is found to be corrupted.  I use this
program I wrote (also hereby provided under GNU GPLv3+) to check the
database for corruption.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: dbck.cc --]
[-- Type: text/x-c++src, Size: 3237 bytes --]

#include <assert.h>
#include <iostream>
#include "leveldb/db.h"

int main(int argc, char *argv[]) {
  bool paranoid = false;
  bool dump = false;
  bool repair = false;
  bool quiet = false;
  int i = 0;
  int errors = 0;

  if (argc == 1) {
  usage:
    std::cout << "usage: [flags] dbname [flags] ..." << std::endl
	      << "-d --dump     dump database contents" << std::endl
	      << "-r --repair   repair database" << std::endl
	      << "-p --paranoid enable paranoid mode" << std::endl
	      << "-l --lax      disable paranoid mode (default)" << std::endl
	      << "-q --quiet    enable quiet mode" << std::endl
	      << "-v --verbose  disable quiet mode (default)" << std::endl
	      << "-h --help     show this message and exit" << std::endl
	      << "dbname        check, dump and repair" << std::endl
	      << std::endl
	      << "exit status is the number of errors" << std::endl;
    return errors;
  }

  for (i++; i < argc; i++) {
    if (argv[i][0] == '-') {
      if (strcmp (argv[i], "--dump") == 0
	  || strcmp (argv[i], "-d") == 0)
	dump = true;
      else if (strcmp (argv[i], "--repair") == 0
	       || strcmp (argv[i], "-r") == 0)
	repair = true;
      else if (strcmp (argv[i], "--paranoid") == 0
	       || strcmp (argv[i], "-p") == 0)
	paranoid = true;
      else if (strcmp (argv[i], "--lax") == 0
	       || strcmp (argv[i], "-l") == 0)
	paranoid = false;
      else if (strcmp (argv[i], "--quiet") == 0
	       || strcmp (argv[i], "-q") == 0)
	quiet = true;
      else if (strcmp (argv[i], "--verbose") == 0
	       || strcmp (argv[i], "-v") == 0)
	quiet = false;
      else if (strcmp (argv[i], "--help") == 0
	       || strcmp (argv[i], "-h") == 0)
	goto usage;
      else {
	std::cerr << "unrecognized option: " << argv[i] << std::endl;
	goto usage;
      }
    } else {
      if (!quiet)
	std::cout << argv[i] << std::endl;

      leveldb::DB* db;
      leveldb::Options options;
      options.paranoid_checks = paranoid;
      leveldb::Status status = leveldb::DB::Open(options, argv[i], &db);
      bool bad = false;

      if (!status.ok()) {
	std::cerr << status.ToString() << std::endl;
	bad = true;
      } else {
	leveldb::ReadOptions rdopt;
	rdopt.verify_checksums = paranoid;
	rdopt.fill_cache = false;
	leveldb::Iterator* it = db->NewIterator(rdopt);
	int count = 0;
	try {
	  for (it->SeekToFirst(); it->Valid(); it->Next()) {
	    count++;
	    if (dump)
	      std::cout << it->key().ToString() << ": "
			<< it->value().ToString() << std::endl;
	    else if (!quiet && count % 1000 == 0)
	      std::cout << count << " entries\r" << std::flush;
	  }
	  if (!it->status().ok()) {
	    std::cerr << it->status().ToString() << std::endl;
	    bad = true;
	  }
	} catch (...) {
	  std::cerr << "caught an exception" << std::endl;
	}
	delete it;
	if (!quiet)
	  std::cout << count << " entries" << std::endl;
      }

      delete db;

      if (bad) {
	errors++;
	if (repair) {
	  if (!quiet)
	    std::cout << "repairing..." << std::endl;
	  status = RepairDB(argv[i], options);
	  if (!status.ok()) {
	    std::cerr << status.ToString() << std::endl;
	    errors++;
	  }
	} else if (!quiet)
	  std::cout << "use --repair to repair" << std::endl;
      }
    }
  }

  return errors;
}

[-- Attachment #3: Type: text/plain, Size: 258 bytes --]



-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
@ 2013-03-23  9:47                               ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-23  9:47 UTC (permalink / raw)
  To: Chris Mason; +Cc: Samuel Just, linux-btrfs, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 652 bytes --]

On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote:

> Quoting Samuel Just (2013-03-22 13:06:41)
>> Incomplete writes for leveldb should just result in lost updates, not
>> corruption.

> In this case, I think Alexandre is scanning for zeros in the file.

Yup, the symptom is zeros at the end of a page, with nonzeros on the
subsequent page, which indicates that the writes to the previous page
were dropped.

What I actually do is to iterate over the entire database, which will
error out when the block header is found to be corrupted.  I use this
program I wrote (also hereby provided under GNU GPLv3+) to check the
database for corruption.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: dbck.cc --]
[-- Type: text/x-c++src, Size: 3237 bytes --]

#include <assert.h>
#include <iostream>
#include "leveldb/db.h"

int main(int argc, char *argv[]) {
  bool paranoid = false;
  bool dump = false;
  bool repair = false;
  bool quiet = false;
  int i = 0;
  int errors = 0;

  if (argc == 1) {
  usage:
    std::cout << "usage: [flags] dbname [flags] ..." << std::endl
	      << "-d --dump     dump database contents" << std::endl
	      << "-r --repair   repair database" << std::endl
	      << "-p --paranoid enable paranoid mode" << std::endl
	      << "-l --lax      disable paranoid mode (default)" << std::endl
	      << "-q --quiet    enable quiet mode" << std::endl
	      << "-v --verbose  disable quiet mode (default)" << std::endl
	      << "-h --help     show this message and exit" << std::endl
	      << "dbname        check, dump and repair" << std::endl
	      << std::endl
	      << "exit status is the number of errors" << std::endl;
    return errors;
  }

  for (i++; i < argc; i++) {
    if (argv[i][0] == '-') {
      if (strcmp (argv[i], "--dump") == 0
	  || strcmp (argv[i], "-d") == 0)
	dump = true;
      else if (strcmp (argv[i], "--repair") == 0
	       || strcmp (argv[i], "-r") == 0)
	repair = true;
      else if (strcmp (argv[i], "--paranoid") == 0
	       || strcmp (argv[i], "-p") == 0)
	paranoid = true;
      else if (strcmp (argv[i], "--lax") == 0
	       || strcmp (argv[i], "-l") == 0)
	paranoid = false;
      else if (strcmp (argv[i], "--quiet") == 0
	       || strcmp (argv[i], "-q") == 0)
	quiet = true;
      else if (strcmp (argv[i], "--verbose") == 0
	       || strcmp (argv[i], "-v") == 0)
	quiet = false;
      else if (strcmp (argv[i], "--help") == 0
	       || strcmp (argv[i], "-h") == 0)
	goto usage;
      else {
	std::cerr << "unrecognized option: " << argv[i] << std::endl;
	goto usage;
      }
    } else {
      if (!quiet)
	std::cout << argv[i] << std::endl;

      leveldb::DB* db;
      leveldb::Options options;
      options.paranoid_checks = paranoid;
      leveldb::Status status = leveldb::DB::Open(options, argv[i], &db);
      bool bad = false;

      if (!status.ok()) {
	std::cerr << status.ToString() << std::endl;
	bad = true;
      } else {
	leveldb::ReadOptions rdopt;
	rdopt.verify_checksums = paranoid;
	rdopt.fill_cache = false;
	leveldb::Iterator* it = db->NewIterator(rdopt);
	int count = 0;
	try {
	  for (it->SeekToFirst(); it->Valid(); it->Next()) {
	    count++;
	    if (dump)
	      std::cout << it->key().ToString() << ": "
			<< it->value().ToString() << std::endl;
	    else if (!quiet && count % 1000 == 0)
	      std::cout << count << " entries\r" << std::flush;
	  }
	  if (!it->status().ok()) {
	    std::cerr << it->status().ToString() << std::endl;
	    bad = true;
	  }
	} catch (...) {
	  std::cerr << "caught an exception" << std::endl;
	}
	delete it;
	if (!quiet)
	  std::cout << count << " entries" << std::endl;
      }

      delete db;

      if (bad) {
	errors++;
	if (repair) {
	  if (!quiet)
	    std::cout << "repairing..." << std::endl;
	  status = RepairDB(argv[i], options);
	  if (!status.ok()) {
	    std::cerr << status.ToString() << std::endl;
	    errors++;
	  }
	} else if (!quiet)
	  std::cout << "use --repair to repair" << std::endl;
      }
    }
  }

  return errors;
}

[-- Attachment #3: Type: text/plain, Size: 258 bytes --]



-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-22 17:08                         ` David Sterba
@ 2013-03-23  9:48                             ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-23  9:48 UTC (permalink / raw)
  To: dsterba; +Cc: Chris Mason, linux-btrfs, ceph-devel

On Mar 22, 2013, David Sterba <dsterba@suse.cz> wrote:

> I've reproduced this without compression, with autodefrag on.

I don't have autodefrag on, unless it's enabled by default on 3.8.3 or
on the for-linus tree.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
@ 2013-03-23  9:48                             ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-23  9:48 UTC (permalink / raw)
  To: dsterba; +Cc: Chris Mason, linux-btrfs, ceph-devel

On Mar 22, 2013, David Sterba <dsterba@suse.cz> wrote:

> I've reproduced this without compression, with autodefrag on.

I don't have autodefrag on, unless it's enabled by default on 3.8.3 or
on the for-linus tree.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-23  9:48                             ` Alexandre Oliva
  (?)
@ 2013-03-25 15:33                             ` David Sterba
  -1 siblings, 0 replies; 42+ messages in thread
From: David Sterba @ 2013-03-25 15:33 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Chris Mason, linux-btrfs, ceph-devel

On Sat, Mar 23, 2013 at 06:48:38AM -0300, Alexandre Oliva wrote:
> On Mar 22, 2013, David Sterba <dsterba@suse.cz> wrote:
> 
> > I've reproduced this without compression, with autodefrag on.
> 
> I don't have autodefrag on, unless it's enabled by default on 3.8.3 or
> on the for-linus tree.

It's not on by default, but it seemed that I reproduced the same issue
without compression (and autodefrag was the difference).

david

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-22 20:31   ` Chris Mason
@ 2013-03-26  0:08     ` Chris Mason
  2013-03-29  9:56         ` Alexandre Oliva
  0 siblings, 1 reply; 42+ messages in thread
From: Chris Mason @ 2013-03-26  0:08 UTC (permalink / raw)
  To: Chris Mason, Alexandre Oliva, linux-btrfs; +Cc: ceph-devel

Quoting Chris Mason (2013-03-22 16:31:42)
> Going through the code here, when I change the test to truncate once in
> the very beginning, I still get errors.  So, it isn't an interaction
> between mmap and truncate.  It must be a problem between lzo and mmap.

With compression off, we use clear_page_dirty_for_io to create a wall
between applications using mmap and our crc code.  Once we call
clear_page_dirty_for_io, it means we're in the process of writing the
page and anyone using mmap must wait (by calling page_mkwrite) before
they are allowed to change the page.

We use it with compression on as well, but it only ends up protecting
the crcs.  It gets called after the compression is done, which allows
applications to race in and modify the pages while we are compressing
them.

This patch changes our compression code to call clear_page_dirty_for_io
before we compress, and then redirty the pages if the compression fails.

Alexandre, many thanks for tracking this down into a well defined use
case. 

-chris

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f173c5a..cdee391 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1257,6 +1257,39 @@ int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end)
 				GFP_NOFS);
 }
 
+int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
+{
+	unsigned long index = start >> PAGE_CACHE_SHIFT;
+	unsigned long end_index = end >> PAGE_CACHE_SHIFT;
+	struct page *page;
+
+	while (index <= end_index) {
+		page = find_get_page(inode->i_mapping, index);
+		BUG_ON(!page); /* Pages should be in the extent_io_tree */
+		clear_page_dirty_for_io(page);
+		page_cache_release(page);
+		index++;
+	}
+	return 0;
+}
+
+int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end)
+{
+	unsigned long index = start >> PAGE_CACHE_SHIFT;
+	unsigned long end_index = end >> PAGE_CACHE_SHIFT;
+	struct page *page;
+
+	while (index <= end_index) {
+		page = find_get_page(inode->i_mapping, index);
+		BUG_ON(!page); /* Pages should be in the extent_io_tree */
+		account_page_redirty(page);
+		__set_page_dirty_nobuffers(page);
+		page_cache_release(page);
+		index++;
+	}
+	return 0;
+}
+
 /*
  * helper function to set both pages and extents in the tree writeback
  */
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 6068a19..258c921 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -325,6 +325,8 @@ int map_private_extent_buffer(struct extent_buffer *eb, unsigned long offset,
 		      unsigned long *map_len);
 int extent_range_uptodate(struct extent_io_tree *tree,
 			  u64 start, u64 end);
+int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end);
+int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end);
 int extent_clear_unlock_delalloc(struct inode *inode,
 				struct extent_io_tree *tree,
 				u64 start, u64 end, struct page *locked_page,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ca1b767..88d4a18 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -353,6 +353,7 @@ static noinline int compress_file_range(struct inode *inode,
 	int i;
 	int will_compress;
 	int compress_type = root->fs_info->compress_type;
+	int redirty = 0;
 
 	/* if this is a small write inside eof, kick off a defrag */
 	if ((end - start + 1) < 16 * 1024 &&
@@ -415,6 +416,8 @@ again:
 		if (BTRFS_I(inode)->force_compress)
 			compress_type = BTRFS_I(inode)->force_compress;
 
+		extent_range_clear_dirty_for_io(inode, start, end);
+		redirty = 1;
 		ret = btrfs_compress_pages(compress_type,
 					   inode->i_mapping, start,
 					   total_compressed, pages,
@@ -554,6 +557,8 @@ cleanup_and_bail_uncompressed:
 			__set_page_dirty_nobuffers(locked_page);
 			/* unlocked later on in the async handlers */
 		}
+		if (redirty)
+			extent_range_redirty_for_io(inode, start, end);
 		add_async_extent(async_cow, start, end - start + 1,
 				 0, NULL, 0, BTRFS_COMPRESS_NONE);
 		*num_added += 1;

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-26  0:08     ` Chris Mason
@ 2013-03-29  9:56         ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-29  9:56 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, ceph-devel

On Mar 25, 2013, Chris Mason <chris.mason@fusionio.com> wrote:

> This patch changes our compression code to call clear_page_dirty_for_io
> before we compress, and then redirty the pages if the compression fails.

> Alexandre, many thanks for tracking this down into a well defined use
> case. 

Thanks for the patch, it's run flawlessly since I started gradually
rolling it out onto my ceph OSDs on Monday!  Ship it! :-)

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
@ 2013-03-29  9:56         ` Alexandre Oliva
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre Oliva @ 2013-03-29  9:56 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, ceph-devel

On Mar 25, 2013, Chris Mason <chris.mason@fusionio.com> wrote:

> This patch changes our compression code to call clear_page_dirty_for_io
> before we compress, and then redirty the pages if the compression fails.

> Alexandre, many thanks for tracking this down into a well defined use
> case. 

Thanks for the patch, it's run flawlessly since I started gradually
rolling it out onto my ceph OSDs on Monday!  Ship it! :-)

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: corruption of active mmapped files in btrfs snapshots
  2013-03-29  9:56         ` Alexandre Oliva
  (?)
@ 2013-03-29 11:35         ` Chris Mason
  -1 siblings, 0 replies; 42+ messages in thread
From: Chris Mason @ 2013-03-29 11:35 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: linux-btrfs, ceph-devel

Quoting Alexandre Oliva (2013-03-29 05:56:06)
> On Mar 25, 2013, Chris Mason <chris.mason@fusionio.com> wrote:
> 
> > This patch changes our compression code to call clear_page_dirty_for_io
> > before we compress, and then redirty the pages if the compression fails.
> 
> > Alexandre, many thanks for tracking this down into a well defined use
> > case. 
> 
> Thanks for the patch, it's run flawlessly since I started gradually
> rolling it out onto my ceph OSDs on Monday!  Ship it! :-)

Awesome, it's in my for-linus and will go out to Linus today.

-chris


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2013-03-29 11:35 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-18 21:14 corruption of active mmapped files in btrfs snapshots Alexandre Oliva
2013-03-18 21:14 ` Alexandre Oliva
2013-03-18 22:43 ` Alexandre Oliva
2013-03-18 22:52 ` Chris Mason
2013-03-18 22:52   ` Chris Mason
2013-03-19  5:20   ` Alexandre Oliva
2013-03-19  5:20     ` Alexandre Oliva
2013-03-19 12:09     ` Chris Mason
2013-03-19 17:29       ` Sage Weil
2013-03-19 19:26         ` Alexandre Oliva
2013-03-19 19:26           ` Alexandre Oliva
2013-03-19 19:26       ` Alexandre Oliva
2013-03-19 19:26         ` Alexandre Oliva
2013-03-20  1:58         ` Alexandre Oliva
2013-03-20  1:58           ` Alexandre Oliva
2013-03-21  7:14           ` Alexandre Oliva
2013-03-21  7:14             ` Alexandre Oliva
2013-03-21 18:06             ` Chris Mason
2013-03-21 18:06               ` Chris Mason
2013-03-21 23:06               ` Chris Mason
2013-03-21 23:06                 ` Chris Mason
2013-03-22  5:27                 ` Alexandre Oliva
2013-03-22  5:27                   ` Alexandre Oliva
2013-03-22 12:07                   ` Chris Mason
2013-03-22 14:17                     ` Alexandre Oliva
2013-03-22 14:17                       ` Alexandre Oliva
2013-03-22 14:26                       ` Chris Mason
2013-03-22 17:06                         ` Samuel Just
2013-03-22 17:12                           ` Chris Mason
2013-03-23  9:47                             ` Alexandre Oliva
2013-03-23  9:47                               ` Alexandre Oliva
2013-03-22 17:08                         ` David Sterba
2013-03-23  9:48                           ` Alexandre Oliva
2013-03-23  9:48                             ` Alexandre Oliva
2013-03-25 15:33                             ` David Sterba
2013-03-22 17:18                         ` Sage Weil
2013-03-22 18:07 ` Chris Mason
2013-03-22 20:31   ` Chris Mason
2013-03-26  0:08     ` Chris Mason
2013-03-29  9:56       ` Alexandre Oliva
2013-03-29  9:56         ` Alexandre Oliva
2013-03-29 11:35         ` Chris Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.