ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Hole punch races in Ceph
@ 2021-04-22 11:15 Jan Kara
  2021-04-22 11:43 ` Jeff Layton
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Kara @ 2021-04-22 11:15 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Ilya Dryomov, ceph-devel

Hello,

I'm looking into how Ceph protects against races between page fault and
hole punching (I'm unifying protection for this kind of races among
filesystems) and AFAICT it does not. What I have in mind in particular is a
race like:

CPU1					CPU2

ceph_fallocate()
  ...
  ceph_zero_pagecache_range()
					ceph_filemap_fault()
					  faults in page in the range being
					  punched
  ceph_zero_objects()

And now we have a page in punched range with invalid data. If
ceph_page_mkwrite() manages to squeeze in at the right moment, we might
even associate invalid metadata with the page I'd assume (but I'm not sure
whether this would be harmful). Am I missing something?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hole punch races in Ceph
  2021-04-22 11:15 Hole punch races in Ceph Jan Kara
@ 2021-04-22 11:43 ` Jeff Layton
  2021-04-22 12:02   ` Jan Kara
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff Layton @ 2021-04-22 11:43 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ilya Dryomov, ceph-devel

On Thu, 2021-04-22 at 13:15 +0200, Jan Kara wrote:
> Hello,
> 
> I'm looking into how Ceph protects against races between page fault and
> hole punching (I'm unifying protection for this kind of races among
> filesystems) and AFAICT it does not. What I have in mind in particular is a
> race like:
> 
> CPU1					CPU2
> 
> ceph_fallocate()
>   ...
>   ceph_zero_pagecache_range()
> 					ceph_filemap_fault()
> 					  faults in page in the range being
> 					  punched
>   ceph_zero_objects()
> 
> And now we have a page in punched range with invalid data. If
> ceph_page_mkwrite() manages to squeeze in at the right moment, we might
> even associate invalid metadata with the page I'd assume (but I'm not sure
> whether this would be harmful). Am I missing something?
> 
> 								Honza

No, I don't think you're missing anything. If ceph_page_mkwrite happens
to get called at an inopportune time then we'd probably end up writing
that page back into the punched range too. What would be the best way to
fix this, do you think?

One idea:

We could lock the pages we're planning to punch out first, then
zero/punch out the objects on the OSDs, and then do the hole punch in
the pagecache? Would that be sufficient to close the race?

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hole punch races in Ceph
  2021-04-22 11:43 ` Jeff Layton
@ 2021-04-22 12:02   ` Jan Kara
  2021-04-22 12:05     ` Jeff Layton
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Kara @ 2021-04-22 12:02 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Jan Kara, Ilya Dryomov, ceph-devel

On Thu 22-04-21 07:43:16, Jeff Layton wrote:
> On Thu, 2021-04-22 at 13:15 +0200, Jan Kara wrote:
> > Hello,
> > 
> > I'm looking into how Ceph protects against races between page fault and
> > hole punching (I'm unifying protection for this kind of races among
> > filesystems) and AFAICT it does not. What I have in mind in particular is a
> > race like:
> > 
> > CPU1					CPU2
> > 
> > ceph_fallocate()
> >   ...
> >   ceph_zero_pagecache_range()
> > 					ceph_filemap_fault()
> > 					  faults in page in the range being
> > 					  punched
> >   ceph_zero_objects()
> > 
> > And now we have a page in punched range with invalid data. If
> > ceph_page_mkwrite() manages to squeeze in at the right moment, we might
> > even associate invalid metadata with the page I'd assume (but I'm not sure
> > whether this would be harmful). Am I missing something?
> > 
> > 								Honza
> 
> No, I don't think you're missing anything. If ceph_page_mkwrite happens
> to get called at an inopportune time then we'd probably end up writing
> that page back into the punched range too. What would be the best way to
> fix this, do you think?
> 
> One idea:
> 
> We could lock the pages we're planning to punch out first, then
> zero/punch out the objects on the OSDs, and then do the hole punch in
> the pagecache? Would that be sufficient to close the race?

Yes, that would be sufficient but very awkward e.g. if you want to punch
out 4GB of data which even needn't be in the page cache. But all
filesystems have this problem - e.g. ext4, xfs, etc. have already their
private locks to avoid races like this, I'm now working on lifting the
fs-private solutions into a generic one so I'll fix CEPH along the way as
well. I was just making sure I'm not missing some other protection
mechanism in CEPH.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hole punch races in Ceph
  2021-04-22 12:02   ` Jan Kara
@ 2021-04-22 12:05     ` Jeff Layton
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff Layton @ 2021-04-22 12:05 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ilya Dryomov, ceph-devel

On Thu, 2021-04-22 at 14:02 +0200, Jan Kara wrote:
> On Thu 22-04-21 07:43:16, Jeff Layton wrote:
> > On Thu, 2021-04-22 at 13:15 +0200, Jan Kara wrote:
> > > Hello,
> > > 
> > > I'm looking into how Ceph protects against races between page fault and
> > > hole punching (I'm unifying protection for this kind of races among
> > > filesystems) and AFAICT it does not. What I have in mind in particular is a
> > > race like:
> > > 
> > > CPU1					CPU2
> > > 
> > > ceph_fallocate()
> > >   ...
> > >   ceph_zero_pagecache_range()
> > > 					ceph_filemap_fault()
> > > 					  faults in page in the range being
> > > 					  punched
> > >   ceph_zero_objects()
> > > 
> > > And now we have a page in punched range with invalid data. If
> > > ceph_page_mkwrite() manages to squeeze in at the right moment, we might
> > > even associate invalid metadata with the page I'd assume (but I'm not sure
> > > whether this would be harmful). Am I missing something?
> > > 
> > > 								Honza
> > 
> > No, I don't think you're missing anything. If ceph_page_mkwrite happens
> > to get called at an inopportune time then we'd probably end up writing
> > that page back into the punched range too. What would be the best way to
> > fix this, do you think?
> > 
> > One idea:
> > 
> > We could lock the pages we're planning to punch out first, then
> > zero/punch out the objects on the OSDs, and then do the hole punch in
> > the pagecache? Would that be sufficient to close the race?
> 
> Yes, that would be sufficient but very awkward e.g. if you want to punch
> out 4GB of data which even needn't be in the page cache. But all
> filesystems have this problem - e.g. ext4, xfs, etc. have already their
> private locks to avoid races like this, I'm now working on lifting the
> fs-private solutions into a generic one so I'll fix CEPH along the way as
> well. I was just making sure I'm not missing some other protection
> mechanism in CEPH.
> 

Even better! I'll keep an eye out for your patches.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-04-22 12:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-22 11:15 Hole punch races in Ceph Jan Kara
2021-04-22 11:43 ` Jeff Layton
2021-04-22 12:02   ` Jan Kara
2021-04-22 12:05     ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).