[10/13] PCI: revoke mappings like devmem
diff mbox series

Message ID 20201007164426.1812530-11-daniel.vetter@ffwll.ch
State New, archived
Headers show
Series
  • follow_pfn and other iomap races
Related show

Commit Message

Daniel Vetter Oct. 7, 2020, 4:44 p.m. UTC
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
the region") /dev/kmem zaps ptes when the kernel requests exclusive
acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
the default for all driver uses.

Except there's two more ways to access pci bars: sysfs and proc mmap
support. Let's plug that hole.

For revoke_devmem() to work we need to link our vma into the same
address_space, with consistent vma->vm_pgoff. ->pgoff is already
adjusted, because that's how (io_)remap_pfn_range works, but for the
mapping we need to adjust vma->vm_file->f_mapping. Usually that's done
at ->open time, but that's a bit tricky here with all the entry points
and arch code. So instead create a fake file and adjust vma->vm_file.

Note this only works for ARCH_GENERIC_PCI_MMAP_RESOURCE. But that
seems to be a subset of architectures support STRICT_DEVMEM, so we
should be good.

The only difference in access checks left is that sysfs pci mmap does
not check for CAP_RAWIO. But I think that makes some sense compared to
/dev/mem and proc, where one file gives you access to everything and
no ownership applies.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
---
 drivers/char/mem.c     | 16 +++++++++++++++-
 drivers/pci/mmap.c     |  3 +++
 include/linux/ioport.h |  2 ++
 3 files changed, 20 insertions(+), 1 deletion(-)

Comments

Bjorn Helgaas Oct. 7, 2020, 6:41 p.m. UTC | #1
Capitalize subject, like other patches in this series and previous
drivers/pci history.

On Wed, Oct 07, 2020 at 06:44:23PM +0200, Daniel Vetter wrote:
> Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> the region") /dev/kmem zaps ptes when the kernel requests exclusive
> acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> the default for all driver uses.
> 
> Except there's two more ways to access pci bars: sysfs and proc mmap
> support. Let's plug that hole.

s/pci/PCI/ in commit logs and comments.

> For revoke_devmem() to work we need to link our vma into the same
> address_space, with consistent vma->vm_pgoff. ->pgoff is already
> adjusted, because that's how (io_)remap_pfn_range works, but for the
> mapping we need to adjust vma->vm_file->f_mapping. Usually that's done
> at ->open time, but that's a bit tricky here with all the entry points
> and arch code. So instead create a fake file and adjust vma->vm_file.
> 
> Note this only works for ARCH_GENERIC_PCI_MMAP_RESOURCE. But that
> seems to be a subset of architectures support STRICT_DEVMEM, so we
> should be good.
> 
> The only difference in access checks left is that sysfs pci mmap does
> not check for CAP_RAWIO. But I think that makes some sense compared to
> /dev/mem and proc, where one file gives you access to everything and
> no ownership applies.

> --- a/drivers/char/mem.c
> +++ b/drivers/char/mem.c
> @@ -810,6 +810,7 @@ static loff_t memory_lseek(struct file *file, loff_t offset, int orig)
>  }
>  
>  static struct inode *devmem_inode;
> +static struct vfsmount *devmem_vfs_mount;
>  
>  #ifdef CONFIG_IO_STRICT_DEVMEM
>  void revoke_devmem(struct resource *res)
> @@ -843,6 +844,20 @@ void revoke_devmem(struct resource *res)
>  
>  	unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
>  }
> +
> +struct file *devmem_getfile(void)
> +{
> +	struct file *file;
> +
> +	file = alloc_file_pseudo(devmem_inode, devmem_vfs_mount, "devmem",
> +				 O_RDWR, &kmem_fops);
> +	if (IS_ERR(file))
> +		return NULL;
> +
> +	file->f_mapping = devmem_indoe->i_mapping;

"devmem_indoe"?  Obviously not compiled, I guess?

> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -304,8 +304,10 @@ struct resource *request_free_mem_region(struct resource *base,
>  
>  #ifdef CONFIG_IO_STRICT_DEVMEM
>  void revoke_devmem(struct resource *res);
> +struct file *devm_getfile(void);
>  #else
>  static inline void revoke_devmem(struct resource *res) { };
> +static inline struct file *devmem_getfile(void) { return NULL; };

I guess these names are supposed to match?

>  #endif
>  
>  #endif /* __ASSEMBLY__ */
> -- 
> 2.28.0
>
Daniel Vetter Oct. 7, 2020, 7:24 p.m. UTC | #2
On Wed, Oct 7, 2020 at 8:41 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> Capitalize subject, like other patches in this series and previous
> drivers/pci history.
>
> On Wed, Oct 07, 2020 at 06:44:23PM +0200, Daniel Vetter wrote:
> > Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> > the region") /dev/kmem zaps ptes when the kernel requests exclusive
> > acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> > the default for all driver uses.
> >
> > Except there's two more ways to access pci bars: sysfs and proc mmap
> > support. Let's plug that hole.
>
> s/pci/PCI/ in commit logs and comments.
>
> > For revoke_devmem() to work we need to link our vma into the same
> > address_space, with consistent vma->vm_pgoff. ->pgoff is already
> > adjusted, because that's how (io_)remap_pfn_range works, but for the
> > mapping we need to adjust vma->vm_file->f_mapping. Usually that's done
> > at ->open time, but that's a bit tricky here with all the entry points
> > and arch code. So instead create a fake file and adjust vma->vm_file.
> >
> > Note this only works for ARCH_GENERIC_PCI_MMAP_RESOURCE. But that
> > seems to be a subset of architectures support STRICT_DEVMEM, so we
> > should be good.
> >
> > The only difference in access checks left is that sysfs pci mmap does
> > not check for CAP_RAWIO. But I think that makes some sense compared to
> > /dev/mem and proc, where one file gives you access to everything and
> > no ownership applies.
>
> > --- a/drivers/char/mem.c
> > +++ b/drivers/char/mem.c
> > @@ -810,6 +810,7 @@ static loff_t memory_lseek(struct file *file, loff_t offset, int orig)
> >  }
> >
> >  static struct inode *devmem_inode;
> > +static struct vfsmount *devmem_vfs_mount;
> >
> >  #ifdef CONFIG_IO_STRICT_DEVMEM
> >  void revoke_devmem(struct resource *res)
> > @@ -843,6 +844,20 @@ void revoke_devmem(struct resource *res)
> >
> >       unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
> >  }
> > +
> > +struct file *devmem_getfile(void)
> > +{
> > +     struct file *file;
> > +
> > +     file = alloc_file_pseudo(devmem_inode, devmem_vfs_mount, "devmem",
> > +                              O_RDWR, &kmem_fops);
> > +     if (IS_ERR(file))
> > +             return NULL;
> > +
> > +     file->f_mapping = devmem_indoe->i_mapping;
>
> "devmem_indoe"?  Obviously not compiled, I guess?

Yeah apologies, I forgot to compile this with CONFIG_IO_STRICT_DEVMEM
set. The entire series is more rfc about the overall problem really, I
need to also figure out how to even this this somehow. I guess there's
nothing really ready made here?
-Daniel

> > --- a/include/linux/ioport.h
> > +++ b/include/linux/ioport.h
> > @@ -304,8 +304,10 @@ struct resource *request_free_mem_region(struct resource *base,
> >
> >  #ifdef CONFIG_IO_STRICT_DEVMEM
> >  void revoke_devmem(struct resource *res);
> > +struct file *devm_getfile(void);
> >  #else
> >  static inline void revoke_devmem(struct resource *res) { };
> > +static inline struct file *devmem_getfile(void) { return NULL; };
>
> I guess these names are supposed to match?
>
> >  #endif
> >
> >  #endif /* __ASSEMBLY__ */
> > --
> > 2.28.0
> >
Dan Williams Oct. 7, 2020, 7:33 p.m. UTC | #3
On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> the region") /dev/kmem zaps ptes when the kernel requests exclusive
> acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> the default for all driver uses.
>
> Except there's two more ways to access pci bars: sysfs and proc mmap
> support. Let's plug that hole.

Ooh, yes, lets.

>
> For revoke_devmem() to work we need to link our vma into the same
> address_space, with consistent vma->vm_pgoff. ->pgoff is already
> adjusted, because that's how (io_)remap_pfn_range works, but for the
> mapping we need to adjust vma->vm_file->f_mapping. Usually that's done
> at ->open time, but that's a bit tricky here with all the entry points
> and arch code. So instead create a fake file and adjust vma->vm_file.

I don't think you want to share the devmem inode for this, this should
be based off the sysfs inode which I believe there is already only one
instance per resource. In contrast /dev/mem can have multiple inodes
because anyone can just mknod a new character device file, the same
problem does not exist for sysfs.
Daniel Vetter Oct. 7, 2020, 7:47 p.m. UTC | #4
On Wed, Oct 7, 2020 at 9:33 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> > the region") /dev/kmem zaps ptes when the kernel requests exclusive
> > acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> > the default for all driver uses.
> >
> > Except there's two more ways to access pci bars: sysfs and proc mmap
> > support. Let's plug that hole.
>
> Ooh, yes, lets.
>
> > For revoke_devmem() to work we need to link our vma into the same
> > address_space, with consistent vma->vm_pgoff. ->pgoff is already
> > adjusted, because that's how (io_)remap_pfn_range works, but for the
> > mapping we need to adjust vma->vm_file->f_mapping. Usually that's done
> > at ->open time, but that's a bit tricky here with all the entry points
> > and arch code. So instead create a fake file and adjust vma->vm_file.
>
> I don't think you want to share the devmem inode for this, this should
> be based off the sysfs inode which I believe there is already only one
> instance per resource. In contrast /dev/mem can have multiple inodes
> because anyone can just mknod a new character device file, the same
> problem does not exist for sysfs.

But then I need to find the right one, plus I also need to find the
right one for the procfs side. That gets messy, and I already have no
idea how to really test this. Shared address_space is the same trick
we're using in drm (where we have multiple things all pointing to the
same underlying resources, through different files), and it gets the
job done. So that's why I figured the shared address_space is the
cleaner solution since then unmap_mapping_range takes care of
iterating over all vma for us. I guess I could reimplement that logic
with our own locking and everything in revoke_devmem, but feels a bit
silly. But it would also solve the problem of having mutliple
different mknod of /dev/kmem with different address_space behind them.
Also because of how remap_pfn_range works, all these vma do use the
same pgoff already anyway.
-Daniel
Dan Williams Oct. 7, 2020, 10:23 p.m. UTC | #5
On Wed, Oct 7, 2020 at 12:49 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Wed, Oct 7, 2020 at 9:33 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> > > the region") /dev/kmem zaps ptes when the kernel requests exclusive
> > > acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> > > the default for all driver uses.
> > >
> > > Except there's two more ways to access pci bars: sysfs and proc mmap
> > > support. Let's plug that hole.
> >
> > Ooh, yes, lets.
> >
> > > For revoke_devmem() to work we need to link our vma into the same
> > > address_space, with consistent vma->vm_pgoff. ->pgoff is already
> > > adjusted, because that's how (io_)remap_pfn_range works, but for the
> > > mapping we need to adjust vma->vm_file->f_mapping. Usually that's done
> > > at ->open time, but that's a bit tricky here with all the entry points
> > > and arch code. So instead create a fake file and adjust vma->vm_file.
> >
> > I don't think you want to share the devmem inode for this, this should
> > be based off the sysfs inode which I believe there is already only one
> > instance per resource. In contrast /dev/mem can have multiple inodes
> > because anyone can just mknod a new character device file, the same
> > problem does not exist for sysfs.
>
> But then I need to find the right one, plus I also need to find the
> right one for the procfs side. That gets messy, and I already have no
> idea how to really test this. Shared address_space is the same trick
> we're using in drm (where we have multiple things all pointing to the
> same underlying resources, through different files), and it gets the
> job done. So that's why I figured the shared address_space is the
> cleaner solution since then unmap_mapping_range takes care of
> iterating over all vma for us. I guess I could reimplement that logic
> with our own locking and everything in revoke_devmem, but feels a bit
> silly. But it would also solve the problem of having mutliple
> different mknod of /dev/kmem with different address_space behind them.
> Also because of how remap_pfn_range works, all these vma do use the
> same pgoff already anyway.

True, remap_pfn_range() makes sure that ->pgoff is an absolute
physical address offset for all use cases. So you might be able to
just point proc_bus_pci_open() at the shared devmem address space. For
sysfs it's messier. I think you would need to somehow get the inode
from kernfs_fop_open() to adjust its address space, but only if the
bin_file will ultimately be used for PCI memory.
Dan Williams Oct. 7, 2020, 10:29 p.m. UTC | #6
On Wed, Oct 7, 2020 at 3:23 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Oct 7, 2020 at 12:49 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > On Wed, Oct 7, 2020 at 9:33 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > >
> > > > Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> > > > the region") /dev/kmem zaps ptes when the kernel requests exclusive
> > > > acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> > > > the default for all driver uses.
> > > >
> > > > Except there's two more ways to access pci bars: sysfs and proc mmap
> > > > support. Let's plug that hole.
> > >
> > > Ooh, yes, lets.
> > >
> > > > For revoke_devmem() to work we need to link our vma into the same
> > > > address_space, with consistent vma->vm_pgoff. ->pgoff is already
> > > > adjusted, because that's how (io_)remap_pfn_range works, but for the
> > > > mapping we need to adjust vma->vm_file->f_mapping. Usually that's done
> > > > at ->open time, but that's a bit tricky here with all the entry points
> > > > and arch code. So instead create a fake file and adjust vma->vm_file.
> > >
> > > I don't think you want to share the devmem inode for this, this should
> > > be based off the sysfs inode which I believe there is already only one
> > > instance per resource. In contrast /dev/mem can have multiple inodes
> > > because anyone can just mknod a new character device file, the same
> > > problem does not exist for sysfs.
> >
> > But then I need to find the right one, plus I also need to find the
> > right one for the procfs side. That gets messy, and I already have no
> > idea how to really test this. Shared address_space is the same trick
> > we're using in drm (where we have multiple things all pointing to the
> > same underlying resources, through different files), and it gets the
> > job done. So that's why I figured the shared address_space is the
> > cleaner solution since then unmap_mapping_range takes care of
> > iterating over all vma for us. I guess I could reimplement that logic
> > with our own locking and everything in revoke_devmem, but feels a bit
> > silly. But it would also solve the problem of having mutliple
> > different mknod of /dev/kmem with different address_space behind them.
> > Also because of how remap_pfn_range works, all these vma do use the
> > same pgoff already anyway.
>
> True, remap_pfn_range() makes sure that ->pgoff is an absolute
> physical address offset for all use cases. So you might be able to
> just point proc_bus_pci_open() at the shared devmem address space. For
> sysfs it's messier. I think you would need to somehow get the inode
> from kernfs_fop_open() to adjust its address space, but only if the
> bin_file will ultimately be used for PCI memory.

To me this seems like a new sysfs_create_bin_file() flavor that
registers the file with the common devmem address_space.
Jason Gunthorpe Oct. 7, 2020, 11:24 p.m. UTC | #7
On Wed, Oct 07, 2020 at 12:33:06PM -0700, Dan Williams wrote:
> On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> > the region") /dev/kmem zaps ptes when the kernel requests exclusive
> > acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> > the default for all driver uses.
> >
> > Except there's two more ways to access pci bars: sysfs and proc mmap
> > support. Let's plug that hole.
> 
> Ooh, yes, lets.
> 
> >
> > For revoke_devmem() to work we need to link our vma into the same
> > address_space, with consistent vma->vm_pgoff. ->pgoff is already
> > adjusted, because that's how (io_)remap_pfn_range works, but for the
> > mapping we need to adjust vma->vm_file->f_mapping. Usually that's done
> > at ->open time, but that's a bit tricky here with all the entry points
> > and arch code. So instead create a fake file and adjust vma->vm_file.
> 
> I don't think you want to share the devmem inode for this, this should
> be based off the sysfs inode which I believe there is already only one
> instance per resource. In contrast /dev/mem can have multiple inodes
> because anyone can just mknod a new character device file, the same
> problem does not exist for sysfs.

The inode does not come from the filesystem char/mem.c creates a
singular anon inode in devmem_init_inode()

Seems OK to use this more widely, but it feels a bit weird to live in
char/memory.c.

This is what got me thinking maybe this needs to be a bit bigger
generic infrastructure - eg enter this scheme from fops mmap and
everything else is in mm/user_iomem.c

Jason
Daniel Vetter Oct. 8, 2020, 7:31 a.m. UTC | #8
On Thu, Oct 8, 2020 at 1:24 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Oct 07, 2020 at 12:33:06PM -0700, Dan Williams wrote:
> > On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> > > the region") /dev/kmem zaps ptes when the kernel requests exclusive
> > > acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> > > the default for all driver uses.
> > >
> > > Except there's two more ways to access pci bars: sysfs and proc mmap
> > > support. Let's plug that hole.
> >
> > Ooh, yes, lets.
> >
> > >
> > > For revoke_devmem() to work we need to link our vma into the same
> > > address_space, with consistent vma->vm_pgoff. ->pgoff is already
> > > adjusted, because that's how (io_)remap_pfn_range works, but for the
> > > mapping we need to adjust vma->vm_file->f_mapping. Usually that's done
> > > at ->open time, but that's a bit tricky here with all the entry points
> > > and arch code. So instead create a fake file and adjust vma->vm_file.
> >
> > I don't think you want to share the devmem inode for this, this should
> > be based off the sysfs inode which I believe there is already only one
> > instance per resource. In contrast /dev/mem can have multiple inodes
> > because anyone can just mknod a new character device file, the same
> > problem does not exist for sysfs.
>
> The inode does not come from the filesystem char/mem.c creates a
> singular anon inode in devmem_init_inode()
>
> Seems OK to use this more widely, but it feels a bit weird to live in
> char/memory.c.
>
> This is what got me thinking maybe this needs to be a bit bigger
> generic infrastructure - eg enter this scheme from fops mmap and
> everything else is in mm/user_iomem.c

Yeah moving it to iomem and renaming it to have an iomem_prefix
instead of devmem sounds like a good idea.
-Daniel
Dan Williams Oct. 8, 2020, 7:49 a.m. UTC | #9
On Wed, Oct 7, 2020 at 4:25 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Oct 07, 2020 at 12:33:06PM -0700, Dan Williams wrote:
> > On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> > > the region") /dev/kmem zaps ptes when the kernel requests exclusive
> > > acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> > > the default for all driver uses.
> > >
> > > Except there's two more ways to access pci bars: sysfs and proc mmap
> > > support. Let's plug that hole.
> >
> > Ooh, yes, lets.
> >
> > >
> > > For revoke_devmem() to work we need to link our vma into the same
> > > address_space, with consistent vma->vm_pgoff. ->pgoff is already
> > > adjusted, because that's how (io_)remap_pfn_range works, but for the
> > > mapping we need to adjust vma->vm_file->f_mapping. Usually that's done
> > > at ->open time, but that's a bit tricky here with all the entry points
> > > and arch code. So instead create a fake file and adjust vma->vm_file.
> >
> > I don't think you want to share the devmem inode for this, this should
> > be based off the sysfs inode which I believe there is already only one
> > instance per resource. In contrast /dev/mem can have multiple inodes
> > because anyone can just mknod a new character device file, the same
> > problem does not exist for sysfs.
>
> The inode does not come from the filesystem char/mem.c creates a
> singular anon inode in devmem_init_inode()

That's not quite right, An inode does come from the filesystem I just
arranged for that inode's i_mapping to be set to a common instance.

> Seems OK to use this more widely, but it feels a bit weird to live in
> char/memory.c.

Sure, now that more users have arrived it should move somewhere common.

> This is what got me thinking maybe this needs to be a bit bigger
> generic infrastructure - eg enter this scheme from fops mmap and
> everything else is in mm/user_iomem.c

It still requires every file that can map physical memory to have its
->open fop do

       inode->i_mapping = devmem_inode->i_mapping;
       filp->f_mapping = inode->i_mapping;

I don't see how you can centralize that part.
Daniel Vetter Oct. 8, 2020, 8:09 a.m. UTC | #10
On Thu, Oct 8, 2020 at 12:29 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Oct 7, 2020 at 3:23 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Wed, Oct 7, 2020 at 12:49 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > On Wed, Oct 7, 2020 at 9:33 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > > >
> > > > On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > >
> > > > > Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> > > > > the region") /dev/kmem zaps ptes when the kernel requests exclusive
> > > > > acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> > > > > the default for all driver uses.
> > > > >
> > > > > Except there's two more ways to access pci bars: sysfs and proc mmap
> > > > > support. Let's plug that hole.
> > > >
> > > > Ooh, yes, lets.
> > > >
> > > > > For revoke_devmem() to work we need to link our vma into the same
> > > > > address_space, with consistent vma->vm_pgoff. ->pgoff is already
> > > > > adjusted, because that's how (io_)remap_pfn_range works, but for the
> > > > > mapping we need to adjust vma->vm_file->f_mapping. Usually that's done
> > > > > at ->open time, but that's a bit tricky here with all the entry points
> > > > > and arch code. So instead create a fake file and adjust vma->vm_file.
> > > >
> > > > I don't think you want to share the devmem inode for this, this should
> > > > be based off the sysfs inode which I believe there is already only one
> > > > instance per resource. In contrast /dev/mem can have multiple inodes
> > > > because anyone can just mknod a new character device file, the same
> > > > problem does not exist for sysfs.
> > >
> > > But then I need to find the right one, plus I also need to find the
> > > right one for the procfs side. That gets messy, and I already have no
> > > idea how to really test this. Shared address_space is the same trick
> > > we're using in drm (where we have multiple things all pointing to the
> > > same underlying resources, through different files), and it gets the
> > > job done. So that's why I figured the shared address_space is the
> > > cleaner solution since then unmap_mapping_range takes care of
> > > iterating over all vma for us. I guess I could reimplement that logic
> > > with our own locking and everything in revoke_devmem, but feels a bit
> > > silly. But it would also solve the problem of having mutliple
> > > different mknod of /dev/kmem with different address_space behind them.
> > > Also because of how remap_pfn_range works, all these vma do use the
> > > same pgoff already anyway.
> >
> > True, remap_pfn_range() makes sure that ->pgoff is an absolute
> > physical address offset for all use cases. So you might be able to
> > just point proc_bus_pci_open() at the shared devmem address space. For
> > sysfs it's messier. I think you would need to somehow get the inode
> > from kernfs_fop_open() to adjust its address space, but only if the
> > bin_file will ultimately be used for PCI memory.

Just read the code  a bit more, and for proc it's impossible. There's
only a single file, and before you mmap it you have to call a few
ioctl to select the right pci resource on that device you want to
mmap. Which includes legacy ioport stuff, and at least for now those
don't get revoked (maybe they should, but I'm looking at iomem here
now). Setting the mapping too early in ->open means that on
architectures which can do ioport as mmaps (not many, but powerpc is
among them) we'd shoot down these mmaps too.

Looking at the code there's the generic implementation, which consults
pci_iobar_pfn. And the only other implementation for sparc looks
similar, they separate iomem vs ioport through different pfn. So I
think this should indeed work.

> To me this seems like a new sysfs_create_bin_file() flavor that
> registers the file with the common devmem address_space.

Hm I think we could just add a i_mapping member to bin_attributes and
let the normal open code set that up for us. That should work.
mmapable binary sysfs file is already a similar special case.
-Daniel




--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Daniel Vetter Oct. 8, 2020, 8:13 a.m. UTC | #11
On Thu, Oct 8, 2020 at 9:50 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Oct 7, 2020 at 4:25 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Oct 07, 2020 at 12:33:06PM -0700, Dan Williams wrote:
> > > On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > >
> > > > Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> > > > the region") /dev/kmem zaps ptes when the kernel requests exclusive
> > > > acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> > > > the default for all driver uses.
> > > >
> > > > Except there's two more ways to access pci bars: sysfs and proc mmap
> > > > support. Let's plug that hole.
> > >
> > > Ooh, yes, lets.
> > >
> > > >
> > > > For revoke_devmem() to work we need to link our vma into the same
> > > > address_space, with consistent vma->vm_pgoff. ->pgoff is already
> > > > adjusted, because that's how (io_)remap_pfn_range works, but for the
> > > > mapping we need to adjust vma->vm_file->f_mapping. Usually that's done
> > > > at ->open time, but that's a bit tricky here with all the entry points
> > > > and arch code. So instead create a fake file and adjust vma->vm_file.
> > >
> > > I don't think you want to share the devmem inode for this, this should
> > > be based off the sysfs inode which I believe there is already only one
> > > instance per resource. In contrast /dev/mem can have multiple inodes
> > > because anyone can just mknod a new character device file, the same
> > > problem does not exist for sysfs.
> >
> > The inode does not come from the filesystem char/mem.c creates a
> > singular anon inode in devmem_init_inode()
>
> That's not quite right, An inode does come from the filesystem I just
> arranged for that inode's i_mapping to be set to a common instance.
>
> > Seems OK to use this more widely, but it feels a bit weird to live in
> > char/memory.c.
>
> Sure, now that more users have arrived it should move somewhere common.
>
> > This is what got me thinking maybe this needs to be a bit bigger
> > generic infrastructure - eg enter this scheme from fops mmap and
> > everything else is in mm/user_iomem.c
>
> It still requires every file that can map physical memory to have its
> ->open fop do
>
>        inode->i_mapping = devmem_inode->i_mapping;
>        filp->f_mapping = inode->i_mapping;
>
> I don't see how you can centralize that part.

btw, why are you setting inode->i_mapping? The inode is already
published, changing that looks risky. And I don't think it's needed,
vma_link() only looks at filp->f_mapping, and in our drm_open() we
only set that one.
-Daniel
Dan Williams Oct. 8, 2020, 8:35 a.m. UTC | #12
On Thu, Oct 8, 2020 at 1:13 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Thu, Oct 8, 2020 at 9:50 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Wed, Oct 7, 2020 at 4:25 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Wed, Oct 07, 2020 at 12:33:06PM -0700, Dan Williams wrote:
> > > > On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > >
> > > > > Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> > > > > the region") /dev/kmem zaps ptes when the kernel requests exclusive
> > > > > acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> > > > > the default for all driver uses.
> > > > >
> > > > > Except there's two more ways to access pci bars: sysfs and proc mmap
> > > > > support. Let's plug that hole.
> > > >
> > > > Ooh, yes, lets.
> > > >
> > > > >
> > > > > For revoke_devmem() to work we need to link our vma into the same
> > > > > address_space, with consistent vma->vm_pgoff. ->pgoff is already
> > > > > adjusted, because that's how (io_)remap_pfn_range works, but for the
> > > > > mapping we need to adjust vma->vm_file->f_mapping. Usually that's done
> > > > > at ->open time, but that's a bit tricky here with all the entry points
> > > > > and arch code. So instead create a fake file and adjust vma->vm_file.
> > > >
> > > > I don't think you want to share the devmem inode for this, this should
> > > > be based off the sysfs inode which I believe there is already only one
> > > > instance per resource. In contrast /dev/mem can have multiple inodes
> > > > because anyone can just mknod a new character device file, the same
> > > > problem does not exist for sysfs.
> > >
> > > The inode does not come from the filesystem char/mem.c creates a
> > > singular anon inode in devmem_init_inode()
> >
> > That's not quite right, An inode does come from the filesystem I just
> > arranged for that inode's i_mapping to be set to a common instance.
> >
> > > Seems OK to use this more widely, but it feels a bit weird to live in
> > > char/memory.c.
> >
> > Sure, now that more users have arrived it should move somewhere common.
> >
> > > This is what got me thinking maybe this needs to be a bit bigger
> > > generic infrastructure - eg enter this scheme from fops mmap and
> > > everything else is in mm/user_iomem.c
> >
> > It still requires every file that can map physical memory to have its
> > ->open fop do
> >
> >        inode->i_mapping = devmem_inode->i_mapping;
> >        filp->f_mapping = inode->i_mapping;
> >
> > I don't see how you can centralize that part.
>
> btw, why are you setting inode->i_mapping? The inode is already
> published, changing that looks risky. And I don't think it's needed,
> vma_link() only looks at filp->f_mapping, and in our drm_open() we
> only set that one.

I think you're right it is unnecessary for devmem, but I don't think
it's dangerous to do it from the very first open before anything is
using the address space. It's copy-paste from what all the other
"shared address space" implementers do. For example, block-devices in
bd_acquire(). However, the rationale for block_devices to do it is so
that page cache pages can be associated with the address space in the
absence of an f_mapping. Without filesystem page writeback to
coordinate I don't see any devmem code paths that would operate on the
inode->i_mapping.
Jason Gunthorpe Oct. 8, 2020, 12:41 p.m. UTC | #13
On Thu, Oct 08, 2020 at 12:49:54AM -0700, Dan Williams wrote:

> > This is what got me thinking maybe this needs to be a bit bigger
> > generic infrastructure - eg enter this scheme from fops mmap and
> > everything else is in mm/user_iomem.c
> 
> It still requires every file that can map physical memory to have its
> ->open fop do

Common infrastructure would have to create a dummy struct file at mmap
time with the global inode and attach that to the VMA.

Jason

Patch
diff mbox series

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index abd4ffdc8cde..5e58a326d4ee 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -810,6 +810,7 @@  static loff_t memory_lseek(struct file *file, loff_t offset, int orig)
 }
 
 static struct inode *devmem_inode;
+static struct vfsmount *devmem_vfs_mount;
 
 #ifdef CONFIG_IO_STRICT_DEVMEM
 void revoke_devmem(struct resource *res)
@@ -843,6 +844,20 @@  void revoke_devmem(struct resource *res)
 
 	unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
 }
+
+struct file *devmem_getfile(void)
+{
+	struct file *file;
+
+	file = alloc_file_pseudo(devmem_inode, devmem_vfs_mount, "devmem",
+				 O_RDWR, &kmem_fops);
+	if (IS_ERR(file))
+		return NULL;
+
+	file->f_mapping = devmem_indoe->i_mapping;
+
+	return file;
+}
 #endif
 
 static int open_port(struct inode *inode, struct file *filp)
@@ -1010,7 +1025,6 @@  static struct file_system_type devmem_fs_type = {
 
 static int devmem_init_inode(void)
 {
-	static struct vfsmount *devmem_vfs_mount;
 	static int devmem_fs_cnt;
 	struct inode *inode;
 	int rc;
diff --git a/drivers/pci/mmap.c b/drivers/pci/mmap.c
index b8c9011987f4..63786cc9c746 100644
--- a/drivers/pci/mmap.c
+++ b/drivers/pci/mmap.c
@@ -7,6 +7,7 @@ 
  * Author: David Woodhouse <dwmw2@infradead.org>
  */
 
+#include <linux/file.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/pci.h>
@@ -64,6 +65,8 @@  int pci_mmap_resource_range(struct pci_dev *pdev, int bar,
 		vma->vm_pgoff += (pci_resource_start(pdev, bar) >> PAGE_SHIFT);
 
 	vma->vm_ops = &pci_phys_vm_ops;
+	fput(vma->vm_file);
+	vma->vm_file = devmem_getfile();
 
 	return io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
 				  vma->vm_end - vma->vm_start,
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 6c2b06fe8beb..83238cba19fe 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -304,8 +304,10 @@  struct resource *request_free_mem_region(struct resource *base,
 
 #ifdef CONFIG_IO_STRICT_DEVMEM
 void revoke_devmem(struct resource *res);
+struct file *devm_getfile(void);
 #else
 static inline void revoke_devmem(struct resource *res) { };
+static inline struct file *devmem_getfile(void) { return NULL; };
 #endif
 
 #endif /* __ASSEMBLY__ */